From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xSLH=NZ=os.amperecomputing.com=fxue@sourceware.org>
Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2072f.outbound.protection.outlook.com [IPv6:2a01:111:f400:7e88::72f])
	by sourceware.org (Postfix) with ESMTPS id 66B4B385DDF1
	for <gcc-patches@gcc.gnu.org>; Sun, 23 Jun 2024 15:10:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 66B4B385DDF1
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 66B4B385DDF1
Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f400:7e88::72f
ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1719155445; cv=pass;
	b=xajYxa2jYG0rD1t4OJE3d0+trQjppBTzl30E2QnoJe/UE2Le9YycGN7PXn1j1c5HJBxr9PvehNWyYXK+e0CyeldiAcNoakiEinJdinLtzy51P4Cedv2/oUuVNn37hJLIuQbk5OZiSB5RALvnrYeEPqAEwKolrrJRhHhVdYKs7mM=
ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key;
	t=1719155445; c=relaxed/simple;
	bh=TS2GP50W9kS8RepvxPVqnWyX2p7m2CMqdXC4+I7rPmE=;
	h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=oKQ9C5c8LgHk7JzELQHimNK0ulAV9IbUgFol2ZkxTGoHVcEEvF9HUEZaX3oiFw5DHdhnSnIlfEazXg9aVaEgPGmZEhpY2QD6X+CLXeZghN/Y5kqSkbxUM+6o5/Gz9z/CjOiBfAF1lYYaujWAK8t+PknOxnKgzCW05iiVZHpLwcE=
ARC-Authentication-Results: i=2; server2.sourceware.org
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=NPdGfe4iRVX59Ti1VfGpRBdCFn5mMiMEtm1748jUGeTiVuBOs2IAn4GUvaOp4QWgOauwNY316XFXKQsJ++emT8xCuc590MqMEi5DkhUIi52NfS6YX6P+IvkcQJIump/AJLGOOshq/x3Qs3eqYeA8W4pUpMTeyGCRRyv/0UQVh3NlLr0uY9hDNgrOjSScjto1Mf1SY8dEx3o+t/Zgy+HpvI+Z2ClAsl3k8LXLbJDAUGbJtFfbJLIhBUKJJF8opLChNPpbn0L4QXULbSSbT0mzg2d2KGfp5clM1HKsRhrdun5elqAopUAUKbkd0NS4Isn/M8eppUcR+2KJgPpRYlH6jg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=hE15JQtXDOgAKHG25Rl7+4eNGIZPd6bhYlSx6sZMoXc=;
 b=JIZ6SVtAK++wo9/geBmP9F4wWmp6UNnoCJZqsZ0F42q3hsfqeOxOPnPNtwKKvjT2jdTXpKU9fJsJ94JlRP6b2VOYWDO3tuoo/iy2XtCAJFyVyn+c+uyZITWzINDzmafqZvKhkV1520Ah3hsCGp0Sd6FnOCeLNLacAmA+tnyQ6kpA2NY0yR1DIlbVDlTI9EXusJ7VVJrxxrtMygeMXDroh7sqq7xl/2yeFJiKq+SfAABYujvgK2d4bwHccYYcqWxPnqJTSktHAOo2KBGJ8xN3fMo7uJDCRiIgxsacfpzzp+kq4s/JiLeiFbcgS+XDiQzVP775MegijT06iISINVb9MQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none
 header.from=os.amperecomputing.com; dkim=pass
 header.d=os.amperecomputing.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=os.amperecomputing.com; s=selector2;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=hE15JQtXDOgAKHG25Rl7+4eNGIZPd6bhYlSx6sZMoXc=;
 b=oN3aUo/7O+JogjBte20Pm6nM9ZGglf8zLUqzKEUqgGMG5NTy1jWkhhYl00SEEX3dXzv5wvKp4Fe9DjB+QxAOOGC7ZcuW3T/0td3f60NcCX/k+GfMs6gZXW7ECgtI6vLo7w6Ptq8OHCT8XM1gWqfogdQmOMqcaT7Xo1FcyNwSHpA=
Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by
 PH7PR01MB7608.prod.exchangelabs.com (2603:10b6:510:1d3::5) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.7698.25; Sun, 23 Jun 2024 15:10:32 +0000
Received: from LV2PR01MB7839.prod.exchangelabs.com
 ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com
 ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7698.017; Sun, 23 Jun 2024
 15:10:32 +0000
From: Feng Xue OS <fxue@os.amperecomputing.com>
To: Richard Biener <richard.guenther@gmail.com>
CC: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for
 loop reduction [PR114440]
Thread-Topic: [PATCH 7/8] vect: Support multiple lane-reducing operations for
 loop reduction [PR114440]
Thread-Index: AQHaww0x/TTxzniavkuqtVlcAVxhibHVait1
Date: Sun, 23 Jun 2024 15:10:32 +0000
Message-ID:
 <LV2PR01MB7839454698AE00602C3EC782F7CB2@LV2PR01MB7839.prod.exchangelabs.com>
References:
 <LV2PR01MB7839A1A6112176E201D3DEBCF7CC2@LV2PR01MB7839.prod.exchangelabs.com>
 <CAFiYyc1oMjAW+OnBe5BNN4GYfyGRzwn6AzVcc8VxQ_eN+QNZ3A@mail.gmail.com>
In-Reply-To:
 <CAFiYyc1oMjAW+OnBe5BNN4GYfyGRzwn6AzVcc8VxQ_eN+QNZ3A@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels:
 MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-06-23T15:10:31.596Z;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard;
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=os.amperecomputing.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH7PR01MB7608:EE_
x-ms-office365-filtering-correlation-id: af048b68-1abe-4466-262a-08dc9396a02a
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;ARA:13230037|366013|376011|1800799021|38070700015;
x-microsoft-antispam-message-info:
 =?iso-8859-1?Q?KFATOjFQf1jUHVTbiocSVK0pZSwNmYzieK646KBCgG8Chyd1s8wVdIaKAa?=
 =?iso-8859-1?Q?2eNos3X328Iv5H+O/RfAdG9JRgEo4SxpBhHd5UMeo/mE3wLLaajJ+Iy+yS?=
 =?iso-8859-1?Q?at5H6SqjwtTHhUvn1Wuyf3mLmDnCatWTsGvQ3xzlOK6BuraqusGDhRTaQl?=
 =?iso-8859-1?Q?ipsrmF01HrJqxyTUjodoFN/PdrA5gAlmCJiy6X8Fjjhyic+BGXqzJcmyvh?=
 =?iso-8859-1?Q?TMRbKCKhhR0MC++EVQea0bVbnhaOUcf1IWGxRrr6KeqCw0qNSESkIRi4XH?=
 =?iso-8859-1?Q?opQb/SSgaAEqWi7SSINhQye2MHPnnGBei3Zhb2FpSRffk9nj0jN7IKPIOX?=
 =?iso-8859-1?Q?+fMEMN/FytpR9zNfY0VAltU5rnU5BkYSsp3jvgKFZjyx4ulfadAOej70VT?=
 =?iso-8859-1?Q?1BLzL6U5BqwIsnHrHBCa+qUp3ET0v1AQI7CWhW3ViI7mvOBEfyG3x5d4R9?=
 =?iso-8859-1?Q?6yO1LoiRWB1qJKrsw95Y7XpY+8AvUaP0CREda2omZI332mJXilAA7FrGHq?=
 =?iso-8859-1?Q?nzrco38Xph+GfZlRpjHlqP1pks5t/12RnK65qxT6LEJuT9V+BrJs4pj9tT?=
 =?iso-8859-1?Q?fbFVbbjuGI0LlucskcpkPSX0zpnGowKxQIzjEsWkRgwDQ6hM+LoH/KHVMI?=
 =?iso-8859-1?Q?qG7weyfPFWKYvxzi2536T87kdyj+dssAI/spOk0nKVd+uBChZAUdcxi7io?=
 =?iso-8859-1?Q?Y4KO1WW15EbHNC+QprkhmwhmggMRNOSLtPoNzzlBm4/Otxnl3IwoPGoeh8?=
 =?iso-8859-1?Q?FHZ3OdQFgRybxFmBGIEBoObasMRAaoEahYp9Qi3cnnQ6UGSvWQ8TUaovAY?=
 =?iso-8859-1?Q?rGAkwGhGcCOHq05yo+PUJQhVeqw/sBWhzsFS8QJR4LWw7XLA0pGQdJDq8J?=
 =?iso-8859-1?Q?XQ0unl4PllUNnY6YeJssESurZcb/NwP7fN0+8w1XJ0vpycsHsKk5S9HjRJ?=
 =?iso-8859-1?Q?um+6i5/7Dl50AYU1/8EuUiBsKqy3vo0eml9BF3Ze9KdbC9DQB6s3uHr8h9?=
 =?iso-8859-1?Q?CMBkEdWTnH00C6+ZC8LKi6x3LuIt4h35RgQR4u5h0PoXXhmeTu0M1Epvy+?=
 =?iso-8859-1?Q?F1d8z/L2TXXFzYrqiL2lalQVMbdPHuEsAZAK0MATO4HjBhRiLtKVoK5w3n?=
 =?iso-8859-1?Q?DqJBtagy4aMRsKmdDuZ7sODpk6QGmwRwKxgiyXsjlrXBXoC6zO/eYrTFw5?=
 =?iso-8859-1?Q?j4Lcg3IuvacIf+uplxI6Yl9dZCw06O5TAMwQ+VKD3KanAIBIzrKt4+LWt/?=
 =?iso-8859-1?Q?f2ySH7j7ta925yA4uB3ScXOZgeNuZfld7kRafp3FJQYQ4b8StUP8WPIRnI?=
 =?iso-8859-1?Q?nEm2r26Iliz3X524M7rwG1yk5CpeqQ5+1OZadTpvGk6S2IK2J8yuKolarw?=
 =?iso-8859-1?Q?slW90ZBvJU?=
x-forefront-antispam-report:
 CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:LV2PR01MB7839.prod.exchangelabs.com;PTR:;CAT:NONE;SFS:(13230037)(366013)(376011)(1800799021)(38070700015);DIR:OUT;SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0:
 =?iso-8859-1?Q?8NnDeDoK2V1H4As6nfsVQGrBcB/F5+yOlfJBJQ0Bxr+Ss2nwHPzQbNXuaJ?=
 =?iso-8859-1?Q?RCGFcUbo+MHOP0g+cwhrK0rniTEOIgr1OvCSzfqMKofmSjkYrHL1I61H92?=
 =?iso-8859-1?Q?o/ldJ1hVEpldjdgAvxqp3iAFfGrj/SpwIRSFT1nWpy6P8+9GFvTKZdnnqN?=
 =?iso-8859-1?Q?yhurfEcltmiZnV6nwMGcjWdQPUeTMOxgQ6y1OziiheQIa2XeUK5LozeF42?=
 =?iso-8859-1?Q?O2jfuMcwuN7I9qUL0mCr9ktkF4clG6uYZSUs1NT6OfCdvJZ27ce2ZDXAG9?=
 =?iso-8859-1?Q?GEt75jRQ0ANpkv9ICWo17itC0wMISBO/3TaPEtN/DD/CsVrMtD1A0PocpW?=
 =?iso-8859-1?Q?oCj4zPxCWT1IGZjvQZtwJeNT///oLnrN/biMOtfz0whJrvHRJ1dHs+9pjf?=
 =?iso-8859-1?Q?Mb8ub/8Y1U3fEpRfCU+rgvMjw30oSPTNtcUGYy7dTkmuSvKCCwWQjjWAYe?=
 =?iso-8859-1?Q?cMboLW4fMhMQIBYB0mrUG/W4Sq9nv49PfsgiktWcH2GnTL4X0S19diyIQl?=
 =?iso-8859-1?Q?UhYy7oVvX0qdf/6LRV695TGDchykQaW97smanBc47DgbVGlJ6vd2QyR6bR?=
 =?iso-8859-1?Q?jaNOXgCeVKji8+Hb5BmF8UBiZs/hH71SX5Xlc5C/rU8IQHBIVoXcGHO8T8?=
 =?iso-8859-1?Q?F3iw0bgIwQPKkIIGDldEyVArGwm+pI7E2OaVcUeBfmfMUyaSv1SGo3OSZn?=
 =?iso-8859-1?Q?tiTPh8O9FVdkQlYhBb0XtMg9N5JpFuWjp3XUlJ6Q/wQ7U09O3DhUyOkfHZ?=
 =?iso-8859-1?Q?3BdyBjRgaAfs5Q4DmJe/EkcKupC8o9nU0qJ7kmw4LIA+iOsmnwA9mcTQVP?=
 =?iso-8859-1?Q?mOSln/BYDAsmWXFvlHk25RY7q/+sz+trtUaUdwO9X4dXWTe6ygrTOBrvqi?=
 =?iso-8859-1?Q?0YzsXUdKTLuHzupVaUAe4VFYygZuLiIG2jbDK95F76Vthas8BkPun6T5Dj?=
 =?iso-8859-1?Q?tupzre5qOn+oGau3DyzipKZRHv9TNHvTd5drthkVfIC9Ff2p/92P0374IO?=
 =?iso-8859-1?Q?61FMWqV4KJbwdfxpCw6DcvMCS3TQ9uWpSpJ2U7PmE4jcf94jsCyZtWPEV5?=
 =?iso-8859-1?Q?pA5hY+DIsoL1GeTfQcTtt+yaK4/HUOz1Ns93A+gPeWgopavPPDHULn0HHu?=
 =?iso-8859-1?Q?dLD1M90wvnPH7oPG/tGp7TWVnlkvHdqgUH75q1SFnIxlqsxcY5iHlioS5y?=
 =?iso-8859-1?Q?yuUC7HjXFsiddYsTgcwQiSHK8YmDRpT/tISPq6ThcXiEEu7Z0ylChIvEFh?=
 =?iso-8859-1?Q?Q47HlYXusFPNqKrmC7hPeWZBPhQJtLvv5B1GeWA4aAOekKy60WHlHFZYvI?=
 =?iso-8859-1?Q?nYtZvtmw61jHUWs5pGSk6rSv9/F/suooICjGvAqplmnfMgTwMgSY/QqCpY?=
 =?iso-8859-1?Q?Wh6LdF/kLhY44nWDI6mCIop5c8kGs6XEykVvqfpB6YVUp3QHhZfUI8Eq2J?=
 =?iso-8859-1?Q?jFyW0ML6x+ZrZt2aOFe0Tt4HXW6vDdqW6PP7lgaoplR/Y3/Hq7u4BQoL9K?=
 =?iso-8859-1?Q?OcqXoyKYn5CqGwXTwc7FZ2m1EKKRTSfNO9fYFeg5iqSXF5tFx4ZrgbB+x/?=
 =?iso-8859-1?Q?y+ZuBrtpuvAAd7yMFX8CCdIABF5r+EeMJYYsrdi2eQuP4pupzAjr14sUMd?=
 =?iso-8859-1?Q?brDhlA1GTjVRnoDiKtXsGJiIGbgXWbF827?=
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: os.amperecomputing.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com
X-MS-Exchange-CrossTenant-Network-Message-Id: af048b68-1abe-4466-262a-08dc9396a02a
X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Jun 2024 15:10:32.0742
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: lLWGDGzJW3U5aBLT5epULmsrKDybOM6OLP1Kfe6b0efuPACFjNek/oihUDeBYolNSusKYwjPNtwhDWoi2EkOcgSe5ixPSUw3sDxyqpc6FBU=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR01MB7608
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,GIT_PATCH_0,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

>> -      if (slp_node)=0A=
>> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)=0A=
> =0A=
> Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off=
=0A=
> instead, which is bad.=0A=
> =0A=
>>         nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);=0A=
>>        else=0A=
>>         nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in);=0A=
>> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop=
_vec_info loop_vinfo,=0A=
>>      }=0A=
>>  }=0A=
>>=0A=
>> +/* Check if STMT_INFO is a lane-reducing operation that can be vectoriz=
ed in=0A=
>> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_=
VEC.=0A=
>> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad=
=0A=
>> +   (sum-of-absolute-differences).=0A=
>> +=0A=
>> +   For a lane-reducing operation, the loop reduction path that it lies =
in,=0A=
>> +   may contain normal operation, or other lane-reducing operation of di=
fferent=0A=
>> +   input type size, an example as:=0A=
>> +=0A=
>> +     int sum =3D 0;=0A=
>> +     for (i)=0A=
>> +       {=0A=
>> +         ...=0A=
>> +         sum +=3D d0[i] * d1[i];       // dot-prod <vector(16) char>=0A=
>> +         sum +=3D w[i];                // widen-sum <vector(16) char>=
=0A=
>> +         sum +=3D abs(s0[i] - s1[i]);  // sad <vector(8) short>=0A=
>> +         sum +=3D n[i];                // normal <vector(4) int>=0A=
>> +         ...=0A=
>> +       }=0A=
>> +=0A=
>> +   Vectorization factor is essentially determined by operation whose in=
put=0A=
>> +   vectype has the most lanes ("vector(16) char" in the example), while=
 we=0A=
>> +   need to choose input vectype with the least lanes ("vector(4) int" i=
n the=0A=
>> +   example) for the reduction PHI statement.  */=0A=
>> +=0A=
>> +bool=0A=
>> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stm=
t_info,=0A=
>> +                           slp_tree slp_node, stmt_vector_for_cost *cos=
t_vec)=0A=
>> +{=0A=
>> +  gimple *stmt =3D stmt_info->stmt;=0A=
>> +=0A=
>> +  if (!lane_reducing_stmt_p (stmt))=0A=
>> +    return false;=0A=
>> +=0A=
>> +  tree type =3D TREE_TYPE (gimple_assign_lhs (stmt));=0A=
>> +=0A=
>> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))=0A=
>> +    return false;=0A=
>> +=0A=
>> +  /* Do not try to vectorize bit-precision reductions.  */=0A=
>> +  if (!type_has_mode_precision_p (type))=0A=
>> +    return false;=0A=
>> +=0A=
>> +  if (!slp_node)=0A=
>> +    return false;=0A=
>> +=0A=
>> +  for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++)=0A=
>> +    {=0A=
>> +      stmt_vec_info def_stmt_info;=0A=
>> +      slp_tree slp_op;=0A=
>> +      tree op;=0A=
>> +      tree vectype;=0A=
>> +      enum vect_def_type dt;=0A=
>> +=0A=
>> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,=
=0A=
>> +                              &slp_op, &dt, &vectype, &def_stmt_info))=
=0A=
>> +       {=0A=
>> +         if (dump_enabled_p ())=0A=
>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
>> +                            "use not simple.\n");=0A=
>> +         return false;=0A=
>> +       }=0A=
>> +=0A=
>> +      if (!vectype)=0A=
>> +       {=0A=
>> +         vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE=
 (op),=0A=
>> +                                                slp_op);=0A=
>> +         if (!vectype)=0A=
>> +           return false;=0A=
>> +       }=0A=
>> +=0A=
>> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))=0A=
>> +       {=0A=
>> +         if (dump_enabled_p ())=0A=
>> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
>> +                            "incompatible vector types for invariants\n=
");=0A=
>> +         return false;=0A=
>> +       }=0A=
>> +=0A=
>> +      if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info))=0A=
>> +       continue;=0A=
>> +=0A=
>> +      /* There should be at most one cycle def in the stmt.  */=0A=
>> +      if (VECTORIZABLE_CYCLE_DEF (dt))=0A=
>> +       return false;=0A=
>> +    }=0A=
>> +=0A=
>> +  stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (st=
mt_info));=0A=
>> +=0A=
>> +  /* TODO: Support lane-reducing operation that does not directly parti=
cipate=0A=
>> +     in loop reduction. */=0A=
>> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)=0A=
>> +    return false;=0A=
>> +=0A=
>> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not=
=0A=
>> +     recoginized.  */=0A=
>> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_de=
f);=0A=
>> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUC=
TION);=0A=
>> +=0A=
>> +  tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);=0A=
>> +  int ncopies_for_cost;=0A=
>> +=0A=
>> +  if (SLP_TREE_LANES (slp_node) > 1)=0A=
>> +    {=0A=
>> +      /* Now lane-reducing operations in a non-single-lane slp node sho=
uld only=0A=
>> +        come from the same loop reduction path.  */=0A=
>> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));=0A=
>> +      ncopies_for_cost =3D 1;=0A=
>> +    }=0A=
>> +  else=0A=
>> +    {=0A=
>> +      ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_in)=
;=0A=
> =0A=
> OK, so the fact that the ops are lane-reducing means they effectively=0A=
> change the VF for the result.  That's only possible as we tightly control=
=0A=
> code generation and "adjust" to the expected VF (by inserting the copies=
=0A=
> you mentioned above), but only up to the highest number of outputs=0A=
> created in the reduction chain.  In that sense instead of talking and rec=
ording=0A=
> "input vector types" wouldn't it make more sense to record the effective=
=0A=
> vectorization factor for the reduction instance?  That VF would be at mos=
t=0A=
> the loops VF but could be as low as 1.  Once we have a non-lane-reducing=
=0A=
> operation in the reduction chain it would be always equal to the loops VF=
.=0A=
> =0A=
> ncopies would then be always determined by that reduction instance VF and=
=0A=
> the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction=0A=
> instance VF would also trivially indicate the force-single-def-use-cycle=
=0A=
> case, possibly simplifying code?=0A=
=0A=
I tried to add such an effective VF, while the vectype_in is still needed i=
n some=0A=
scenarios, such as when checking whether a dot-prod stmt is emulated or not=
.=0A=
The former could be deduced from the later, so recording both things seems=
=0A=
to be redundant. Another consideration is that for normal op, ncopies=0A=
is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,=0A=
it is from VF. So, a better means to make them unified? =0A=
 =0A=
>> +      gcc_assert (ncopies_for_cost >=3D 1);=0A=
>> +    }=0A=
>> +=0A=
>> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A=
>> +    {=0A=
>> +      /* We need extra two invariants: one that contains the minimum si=
gned=0A=
>> +        value and one that contains half of its negative.  */=0A=
>> +      int prologue_stmts =3D 2;=0A=
>> +      unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts,=0A=
>> +                                       scalar_to_vec, stmt_info, 0,=0A=
>> +                                       vect_prologue);=0A=
>> +      if (dump_enabled_p ())=0A=
>> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "=0A=
>> +                    "extra prologue_cost =3D %d .\n", cost);=0A=
>> +=0A=
>> +      /* Three dot-products and a subtraction.  */=0A=
>> +      ncopies_for_cost *=3D 4;=0A=
>> +    }=0A=
>> +=0A=
>> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info,=
 0,=0A=
>> +                   vect_body);=0A=
>> +=0A=
>> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))=0A=
>> +    {=0A=
>> +      enum tree_code code =3D gimple_assign_rhs_code (stmt);=0A=
>> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_inf=
o,=0A=
>> +                                                 slp_node, code, type,=
=0A=
>> +                                                 vectype_in);=0A=
>> +    }=0A=
>> +=0A=
> =0A=
> Add a comment:=0A=
> =0A=
>     /* Transform via vect_transform_reduction.  */=0A=
> =0A=
>> +  STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type;=0A=
>> +  return true;=0A=
>> +}=0A=
>> +=0A=
>>  /* Function vectorizable_reduction.=0A=
>>=0A=
>>     Check if STMT_INFO performs a reduction operation that can be vector=
ized.=0A=
>> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>>    if (!type_has_mode_precision_p (op.type))=0A=
>>      return false;=0A=
>>=0A=
>> -  /* For lane-reducing ops we're reducing the number of reduction PHIs=
=0A=
>> -     which means the only use of that may be in the lane-reducing opera=
tion.  */=0A=
>> -  if (lane_reducing=0A=
>> -      && reduc_chain_length !=3D 1=0A=
>> -      && !only_slp_reduc_chain)=0A=
>> -    {=0A=
>> -      if (dump_enabled_p ())=0A=
>> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
>> -                        "lane-reducing reduction with extra stmts.\n");=
=0A=
>> -      return false;=0A=
>> -    }=0A=
>> -=0A=
>>    /* Lane-reducing ops also never can be used in a SLP reduction group=
=0A=
>>       since we'll mix lanes belonging to different reductions.  But it's=
=0A=
>>       OK to use them in a reduction chain or when the reduction group=0A=
>> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,=0A=
>>        && loop_vinfo->suggested_unroll_factor =3D=3D 1)=0A=
>>      single_defuse_cycle =3D true;=0A=
>>=0A=
>> -  if (single_defuse_cycle || lane_reducing)=0A=
>> +  if (single_defuse_cycle && !lane_reducing)=0A=
> =0A=
> If there's also a non-lane-reducing plus in the chain don't we have to=0A=
> check for that reduction op?  So shouldn't it be=0A=
> single_defuse_cycle && ... fact that we don't record=0A=
> (non-lane-reducing op there) ...=0A=
=0A=
Quite not understand this point.  For a non-lane-reducing op in the chain,=
=0A=
it should be handled in its own vectorizable_xxx function? The below check=
=0A=
is only for the first statement (vect_reduction_def) in the reduction.=0A=
=0A=
> =0A=
>>      {=0A=
>>        gcc_assert (op.code !=3D COND_EXPR);=0A=
>>=0A=
>> -      /* 4. Supportable by target?  */=0A=
>> -      bool ok =3D true;=0A=
>> -=0A=
>> -      /* 4.1. check support for the operation in the loop=0A=
>> +      /* 4. check support for the operation in the loop=0A=
>>=0A=
>>          This isn't necessary for the lane reduction codes, since they=
=0A=
>>          can only be produced by pattern matching, and it's up to the=0A=
>> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,=0A=
>>          mixed-sign dot-products can be implemented using signed=0A=
>>          dot-products.  */=0A=
>>        machine_mode vec_mode =3D TYPE_MODE (vectype_in);=0A=
>> -      if (!lane_reducing=0A=
>> -         && !directly_supported_p (op.code, vectype_in, optab_vector))=
=0A=
>> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))=0A=
>>          {=0A=
>>            if (dump_enabled_p ())=0A=
>>              dump_printf (MSG_NOTE, "op not supported by target.\n");=0A=
>>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)=0A=
>>               || !vect_can_vectorize_without_simd_p (op.code))=0A=
>> -           ok =3D false;=0A=
>> +           single_defuse_cycle =3D false;=0A=
>>           else=0A=
>>             if (dump_enabled_p ())=0A=
>>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");=
=0A=
>> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>>             dump_printf (MSG_NOTE, "using word mode not possible.\n");=
=0A=
>>           return false;=0A=
>>         }=0A=
>> -=0A=
>> -      /* lane-reducing operations have to go through vect_transform_red=
uction.=0A=
>> -         For the other cases try without the single cycle optimization.=
  */=0A=
>> -      if (!ok)=0A=
>> -       {=0A=
>> -         if (lane_reducing)=0A=
>> -           return false;=0A=
>> -         else=0A=
>> -           single_defuse_cycle =3D false;=0A=
>> -       }=0A=
>>      }=0A=
>>    if (dump_enabled_p () && single_defuse_cycle)=0A=
>>      dump_printf_loc (MSG_NOTE, vect_location,=0A=
>> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,=0A=
>>                      "multiple vectors to one in the loop body\n");=0A=
>>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle;=
=0A=
>>=0A=
>> -  /* If the reduction stmt is one of the patterns that have lane=0A=
>> -     reduction embedded we cannot handle the case of ! single_defuse_cy=
cle.  */=0A=
>> -  if ((ncopies > 1 && ! single_defuse_cycle)=0A=
>> -      && lane_reducing)=0A=
>> -    {=0A=
>> -      if (dump_enabled_p ())=0A=
>> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
>> -                        "multi def-use cycle not possible for lane-redu=
cing "=0A=
>> -                        "reduction operation\n");=0A=
>> -      return false;=0A=
>> -    }=0A=
>> +  /* For lane-reducing operation, the below processing related to singl=
e=0A=
>> +     defuse-cycle will be done in its own vectorizable function.  One m=
ore=0A=
>> +     thing to note is that the operation must not be involved in fold-l=
eft=0A=
>> +     reduction.  */=0A=
>> +  single_defuse_cycle &=3D !lane_reducing;=0A=
>>=0A=
>>    if (slp_node=0A=
>> -      && !(!single_defuse_cycle=0A=
>> -          && !lane_reducing=0A=
>> -          && reduction_type !=3D FOLD_LEFT_REDUCTION))=0A=
>> +      && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDUCT=
ION))=0A=
>>      for (i =3D 0; i < (int) op.num_ops; i++)=0A=
>>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))=
=0A=
>>         {=0A=
>> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo=
,=0A=
>>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,=0A=
>>                              reduction_type, ncopies, cost_vec);=0A=
>>    /* Cost the reduction op inside the loop if transformed via=0A=
>> -     vect_transform_reduction.  Otherwise this is costed by the=0A=
>> -     separate vectorizable_* routines.  */=0A=
>> -  if (single_defuse_cycle || lane_reducing)=0A=
>> -    {=0A=
>> -      int factor =3D 1;=0A=
>> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A=
>> -       /* Three dot-products and a subtraction.  */=0A=
>> -       factor =3D 4;=0A=
>> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,=0A=
>> -                       stmt_info, 0, vect_body);=0A=
>> -    }=0A=
>> +     vect_transform_reduction for non-lane-reducing operation.  Otherwi=
se=0A=
>> +     this is costed by the separate vectorizable_* routines.  */=0A=
>> +  if (single_defuse_cycle)=0A=
>> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vec=
t_body);=0A=
>>=0A=
>>    if (dump_enabled_p ()=0A=
>>        && reduction_type =3D=3D FOLD_LEFT_REDUCTION)=0A=
>>      dump_printf_loc (MSG_NOTE, vect_location,=0A=
>>                      "using an in-order (fold-left) reduction.\n");=0A=
>>    STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type;=0A=
>> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left=
=0A=
>> -     reductions go through their own vectorizable_* routines.  */=0A=
>> -  if (!single_defuse_cycle=0A=
>> -      && !lane_reducing=0A=
>> -      && reduction_type !=3D FOLD_LEFT_REDUCTION)=0A=
>> +=0A=
>> +  /* All but single defuse-cycle optimized and fold-left reductions go=
=0A=
>> +     through their own vectorizable_* routines.  */=0A=
>> +  if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION)=
=0A=
>>      {=0A=
>>        stmt_vec_info tem=0A=
>>         =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));=0A=
>> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinf=
o,=0A=
>>    bool lane_reducing =3D lane_reducing_op_p (code);=0A=
>>    gcc_assert (single_defuse_cycle || lane_reducing);=0A=
>>=0A=
>> +  if (lane_reducing)=0A=
>> +    {=0A=
>> +      /* The last operand of lane-reducing op is for reduction.  */=0A=
>> +      gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1);=0A=
>> +=0A=
>> +      /* Now all lane-reducing ops are covered by some slp node.  */=0A=
>> +      gcc_assert (slp_node);=0A=
>> +    }=0A=
>> +=0A=
>>    /* Create the destination vector  */=0A=
>>    tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt);=0A=
>>    tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype_o=
ut);=0A=
>> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinf=
o,=0A=
>>                          reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE,=
=0A=
>>                          &vec_oprnds[2]);=0A=
>>      }=0A=
>> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1=0A=
>> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length (=
))=0A=
>> +    {=0A=
>> +      /* For lane-reducing op covered by single-lane slp node, the inpu=
t=0A=
>> +        vectype of the reduction PHI determines copies of vectorized de=
f-use=0A=
>> +        cycles, which might be more than effective copies of vectorized=
 lane-=0A=
>> +        reducing reduction statements.  This could be complemented by=
=0A=
>> +        generating extra trivial pass-through copies.  For example:=0A=
>> +=0A=
>> +          int sum =3D 0;=0A=
>> +          for (i)=0A=
>> +            {=0A=
>> +              sum +=3D d0[i] * d1[i];      // dot-prod <vector(16) char=
>=0A=
>> +              sum +=3D abs(s0[i] - s1[i]); // sad <vector(8) short>=0A=
>> +              sum +=3D n[i];               // normal <vector(4) int>=0A=
>> +            }=0A=
>> +=0A=
>> +        The vector size is 128-bit,vectorization factor is 16.  Reducti=
on=0A=
>> +        statements would be transformed as:=0A=
>> +=0A=
>> +          vector<4> int sum_v0 =3D { 0, 0, 0, 0 };=0A=
>> +          vector<4> int sum_v1 =3D { 0, 0, 0, 0 };=0A=
>> +          vector<4> int sum_v2 =3D { 0, 0, 0, 0 };=0A=
>> +          vector<4> int sum_v3 =3D { 0, 0, 0, 0 };=0A=
>> +=0A=
>> +          for (i / 16)=0A=
>> +            {=0A=
>> +              sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], =
sum_v0);=0A=
>> +              sum_v1 =3D sum_v1;  // copy=0A=
>> +              sum_v2 =3D sum_v2;  // copy=0A=
>> +              sum_v3 =3D sum_v3;  // copy=0A=
>> +=0A=
>> +              sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v=
0);=0A=
>> +              sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v=
1);=0A=
>> +              sum_v2 =3D sum_v2;  // copy=0A=
>> +              sum_v3 =3D sum_v3;  // copy=0A=
>> +=0A=
>> +              sum_v0 +=3D n_v0[i: 0  ~ 3 ];=0A=
>> +              sum_v1 +=3D n_v1[i: 4  ~ 7 ];=0A=
>> +              sum_v2 +=3D n_v2[i: 8  ~ 11];=0A=
>> +              sum_v3 +=3D n_v3[i: 12 ~ 15];=0A=
>> +            }=0A=
>> +       */=0A=
>> +      unsigned using_ncopies =3D vec_oprnds[0].length ();=0A=
>> +      unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ();=0A=
>> +=0A=
> =0A=
> assert reduc_ncopies >=3D using_ncopies?  Maybe assert=0A=
> reduc_index =3D=3D op.num_ops - 1 given you use one above=0A=
> and the other below?  Or simply iterate till op.num_ops=0A=
> and sip i =3D=3D reduc_index.=0A=
> =0A=
>> +      for (unsigned i =3D 0; i < op.num_ops - 1; i++)=0A=
>> +       {=0A=
>> +         gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies);=0A=
>> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);=0A=
>> +       }=0A=
>> +    }=0A=
>>=0A=
>>    bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (stm=
t_info);=0A=
>>    unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length ();=
=0A=
>> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinf=
o,=0A=
>>      {=0A=
>>        gimple *new_stmt;=0A=
>>        tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }=
;=0A=
>> -      if (masked_loop_p && !mask_by_cond_expr)=0A=
>> +=0A=
>> +      if (!vop[0] || !vop[1])=0A=
>> +       {=0A=
>> +         tree reduc_vop =3D vec_oprnds[reduc_index][i];=0A=
>> +=0A=
>> +         /* Insert trivial copy if no need to generate vectorized=0A=
>> +            statement.  */=0A=
>> +         gcc_assert (reduc_vop);=0A=
>> +=0A=
>> +         new_stmt =3D gimple_build_assign (vec_dest, reduc_vop);=0A=
>> +         new_temp =3D make_ssa_name (vec_dest, new_stmt);=0A=
>> +         gimple_set_lhs (new_stmt, new_temp);=0A=
>> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, =
gsi);=0A=
> =0A=
> I think you could simply do=0A=
> =0A=
>                slp_node->push_vec_def (reduc_vop);=0A=
>                continue;=0A=
> =0A=
> without any code generation.=0A=
> =0A=
=0A=
OK, that would be easy. Here comes another question, this patch assumes=0A=
lane-reducing op would always be contained in a slp node, since single-lane=
=0A=
slp node feature has been enabled. But I got some regression if I enforced=
=0A=
such constraint on lane-reducing op check. Those cases are founded to=0A=
be unvectorizable with single-lane slp, so this should not be what we want?=
=0A=
and need to be fixed?=0A=
=0A=
>> +       }=0A=
>> +      else if (masked_loop_p && !mask_by_cond_expr)=0A=
>>         {=0A=
>>           /* No conditional ifns have been defined for lane-reducing op=
=0A=
>>              yet.  */=0A=
>> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinf=
o,=0A=
>>=0A=
>>           if (masked_loop_p && mask_by_cond_expr)=0A=
>>             {=0A=
>> +             tree stmt_vectype_in =3D vectype_in;=0A=
>> +             unsigned nvectors =3D vec_num * ncopies;=0A=
>> +=0A=
>> +             if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1)=
=0A=
>> +               {=0A=
>> +                 /* Input vectype of the reduction PHI may be defferent=
 from=0A=
> =0A=
> different=0A=
> =0A=
>> +                    that of lane-reducing operation.  */=0A=
>> +                 stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_=
info);=0A=
>> +                 nvectors =3D vect_get_num_copies (loop_vinfo, stmt_vec=
type_in);=0A=
> =0A=
> I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.=0A=
=0A=
To partially vectorizing a dot_prod<16 * char> with 128-bit vector width, =
=0A=
we should pass (nvector=3D4, vectype=3D<4 *int>) instead of (nvector=3D1, v=
ectype=3D<16 *char>)=0A=
to vect_get_loop_mask?=0A=
=0A=
Thanks,=0A=
Feng=0A=
=0A=
=0A=
________________________________________=0A=
From: Richard Biener <richard.guenther@gmail.com>=0A=
Sent: Thursday, June 20, 2024 8:26 PM=0A=
To: Feng Xue OS=0A=
Cc: gcc-patches@gcc.gnu.org=0A=
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations fo=
r loop reduction [PR114440]=0A=
=0A=
On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS <fxue@os.amperecomputing.com> w=
rote:=0A=
>=0A=
> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, cu=
rrent=0A=
> vectorizer could only handle the pattern if the reduction chain does not=
=0A=
> contain other operation, no matter the other is normal or lane-reducing.=
=0A=
>=0A=
> Actually, to allow multiple arbitrary lane-reducing operations, we need t=
o=0A=
> support vectorization of loop reduction chain with mixed input vectypes. =
Since=0A=
> lanes of vectype may vary with operation, the effective ncopies of vector=
ized=0A=
> statements for operation also may not be same to each other, this causes=
=0A=
> mismatch on vectorized def-use cycles. A simple way is to align all opera=
tions=0A=
> with the one that has the most ncopies, the gap could be complemented by=
=0A=
> generating extra trivial pass-through copies. For example:=0A=
>=0A=
>    int sum =3D 0;=0A=
>    for (i)=0A=
>      {=0A=
>        sum +=3D d0[i] * d1[i];      // dot-prod <vector(16) char>=0A=
>        sum +=3D w[i];               // widen-sum <vector(16) char>=0A=
>        sum +=3D abs(s0[i] - s1[i]); // sad <vector(8) short>=0A=
>        sum +=3D n[i];               // normal <vector(4) int>=0A=
>      }=0A=
>=0A=
> The vector size is 128-bit vectorization factor is 16. Reduction statemen=
ts=0A=
> would be transformed as:=0A=
>=0A=
>    vector<4> int sum_v0 =3D { 0, 0, 0, 0 };=0A=
>    vector<4> int sum_v1 =3D { 0, 0, 0, 0 };=0A=
>    vector<4> int sum_v2 =3D { 0, 0, 0, 0 };=0A=
>    vector<4> int sum_v3 =3D { 0, 0, 0, 0 };=0A=
>=0A=
>    for (i / 16)=0A=
>      {=0A=
>        sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);=
=0A=
>        sum_v1 =3D sum_v1;  // copy=0A=
>        sum_v2 =3D sum_v2;  // copy=0A=
>        sum_v3 =3D sum_v3;  // copy=0A=
>=0A=
>        sum_v0 =3D WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);=0A=
>        sum_v1 =3D sum_v1;  // copy=0A=
>        sum_v2 =3D sum_v2;  // copy=0A=
>        sum_v3 =3D sum_v3;  // copy=0A=
>=0A=
>        sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);=0A=
>        sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);=0A=
>        sum_v2 =3D sum_v2;  // copy=0A=
>        sum_v3 =3D sum_v3;  // copy=0A=
>=0A=
>        sum_v0 +=3D n_v0[i: 0  ~ 3 ];=0A=
>        sum_v1 +=3D n_v1[i: 4  ~ 7 ];=0A=
>        sum_v2 +=3D n_v2[i: 8  ~ 11];=0A=
>        sum_v3 +=3D n_v3[i: 12 ~ 15];=0A=
>      }=0A=
>=0A=
> Thanks,=0A=
> Feng=0A=
>=0A=
> ---=0A=
> gcc/=0A=
>         PR tree-optimization/114440=0A=
>         * tree-vectorizer.h (vectorizable_lane_reducing): New function=0A=
>         declaration.=0A=
>         * tree-vect-stmts.cc (vect_analyze_stmt): Call new function=0A=
>         vectorizable_lane_reducing to analyze lane-reducing operation.=0A=
>         * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost comp=
utation=0A=
>         code related to emulated_mixed_dot_prod.=0A=
>         (vect_reduction_update_partial_vector_usage): Compute ncopies as =
the=0A=
>         original means for single-lane slp node.=0A=
>         (vectorizable_lane_reducing): New function.=0A=
>         (vectorizable_reduction): Allow multiple lane-reducing operations=
 in=0A=
>         loop reduction. Move some original lane-reducing related code to=
=0A=
>         vectorizable_lane_reducing.=0A=
>         (vect_transform_reduction): Extend transformation to support redu=
ction=0A=
>         statements with mixed input vectypes.=0A=
>=0A=
> gcc/testsuite/=0A=
>         PR tree-optimization/114440=0A=
>         * gcc.dg/vect/vect-reduc-chain-1.c=0A=
>         * gcc.dg/vect/vect-reduc-chain-2.c=0A=
>         * gcc.dg/vect/vect-reduc-chain-3.c=0A=
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c=0A=
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c=0A=
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c=0A=
>         * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c=0A=
>         * gcc.dg/vect/vect-reduc-dot-slp-1.c=0A=
> ---=0A=
>  .../gcc.dg/vect/vect-reduc-chain-1.c          |  62 ++++=0A=
>  .../gcc.dg/vect/vect-reduc-chain-2.c          |  77 +++++=0A=
>  .../gcc.dg/vect/vect-reduc-chain-3.c          |  66 ++++=0A=
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++++=0A=
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 ++++=0A=
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +++++=0A=
>  .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 ++++=0A=
>  .../gcc.dg/vect/vect-reduc-dot-slp-1.c        |  35 ++=0A=
>  gcc/tree-vect-loop.cc                         | 324 ++++++++++++++----=
=0A=
>  gcc/tree-vect-stmts.cc                        |   2 +=0A=
>  gcc/tree-vectorizer.h                         |   2 +=0A=
>  11 files changed, 802 insertions(+), 70 deletions(-)=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.=
c=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.=
c=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.=
c=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.=
c=0A=
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c=0A=
>=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsui=
te/gcc.dg/vect/vect-reduc-chain-1.c=0A=
> new file mode 100644=0A=
> index 00000000000..04bfc419dbd=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c=0A=
> @@ -0,0 +1,62 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa=
rch64*-*-* || arm*-*-* } } } */=0A=
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#define N 50=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +f (SIGNEDNESS_1 int res,=0A=
> +   SIGNEDNESS_2 char *restrict a,=0A=
> +   SIGNEDNESS_2 char *restrict b,=0A=
> +   SIGNEDNESS_2 char *restrict c,=0A=
> +   SIGNEDNESS_2 char *restrict d,=0A=
> +   SIGNEDNESS_1 int *restrict e)=0A=
> +{=0A=
> +  for (int i =3D 0; i < N; ++i)=0A=
> +    {=0A=
> +      res +=3D a[i] * b[i];=0A=
> +      res +=3D c[i] * d[i];=0A=
> +      res +=3D e[i];=0A=
> +    }=0A=
> +  return res;=0A=
> +}=0A=
> +=0A=
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A=
> +#define OFFSET 20=0A=
> +=0A=
> +int=0A=
> +main (void)=0A=
> +{=0A=
> +  check_vect ();=0A=
> +=0A=
> +  SIGNEDNESS_2 char a[N], b[N];=0A=
> +  SIGNEDNESS_2 char c[N], d[N];=0A=
> +  SIGNEDNESS_1 int e[N];=0A=
> +  int expected =3D 0x12345;=0A=
> +  for (int i =3D 0; i < N; ++i)=0A=
> +    {=0A=
> +      a[i] =3D BASE + i * 5;=0A=
> +      b[i] =3D BASE + OFFSET + i * 4;=0A=
> +      c[i] =3D BASE + i * 2;=0A=
> +      d[i] =3D BASE + OFFSET + i * 3;=0A=
> +      e[i] =3D i;=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +      expected +=3D a[i] * b[i];=0A=
> +      expected +=3D c[i] * d[i];=0A=
> +      expected +=3D e[i];=0A=
> +    }=0A=
> +  if (f (0x12345, a, b, c, d, e) !=3D expected)=0A=
> +    __builtin_abort ();=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "=
vect" } } */=0A=
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO=
T_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsui=
te/gcc.dg/vect/vect-reduc-chain-2.c=0A=
> new file mode 100644=0A=
> index 00000000000..6c803b80120=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c=0A=
> @@ -0,0 +1,77 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa=
rch64*-*-* || arm*-*-* } } } */=0A=
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#define N 50=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 unsigned=0A=
> +#define SIGNEDNESS_3 signed=0A=
> +#define SIGNEDNESS_4 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +fn (SIGNEDNESS_1 int res,=0A=
> +   SIGNEDNESS_2 char *restrict a,=0A=
> +   SIGNEDNESS_2 char *restrict b,=0A=
> +   SIGNEDNESS_3 char *restrict c,=0A=
> +   SIGNEDNESS_3 char *restrict d,=0A=
> +   SIGNEDNESS_4 short *restrict e,=0A=
> +   SIGNEDNESS_4 short *restrict f,=0A=
> +   SIGNEDNESS_1 int *restrict g)=0A=
> +{=0A=
> +  for (int i =3D 0; i < N; ++i)=0A=
> +    {=0A=
> +      res +=3D a[i] * b[i];=0A=
> +      res +=3D i + 1;=0A=
> +      res +=3D c[i] * d[i];=0A=
> +      res +=3D e[i] * f[i];=0A=
> +      res +=3D g[i];=0A=
> +    }=0A=
> +  return res;=0A=
> +}=0A=
> +=0A=
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A=
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)=0A=
> +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)=0A=
> +#define OFFSET 20=0A=
> +=0A=
> +int=0A=
> +main (void)=0A=
> +{=0A=
> +  check_vect ();=0A=
> +=0A=
> +  SIGNEDNESS_2 char a[N], b[N];=0A=
> +  SIGNEDNESS_3 char c[N], d[N];=0A=
> +  SIGNEDNESS_4 short e[N], f[N];=0A=
> +  SIGNEDNESS_1 int g[N];=0A=
> +  int expected =3D 0x12345;=0A=
> +  for (int i =3D 0; i < N; ++i)=0A=
> +    {=0A=
> +      a[i] =3D BASE2 + i * 5;=0A=
> +      b[i] =3D BASE2 + OFFSET + i * 4;=0A=
> +      c[i] =3D BASE3 + i * 2;=0A=
> +      d[i] =3D BASE3 + OFFSET + i * 3;=0A=
> +      e[i] =3D BASE4 + i * 6;=0A=
> +      f[i] =3D BASE4 + OFFSET + i * 5;=0A=
> +      g[i] =3D i;=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +      expected +=3D a[i] * b[i];=0A=
> +      expected +=3D i + 1;=0A=
> +      expected +=3D c[i] * d[i];=0A=
> +      expected +=3D e[i] * f[i];=0A=
> +      expected +=3D g[i];=0A=
> +    }=0A=
> +  if (fn (0x12345, a, b, c, d, e, f, g) !=3D expected)=0A=
> +    __builtin_abort ();=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "=
vect" } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD=
_EXPR" "vect" { target { vect_sdot_qi } } } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD=
_EXPR" "vect" { target { vect_udot_qi } } } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD=
_EXPR" "vect" { target { vect_sdot_hi } } } } */=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsui=
te/gcc.dg/vect/vect-reduc-chain-3.c=0A=
> new file mode 100644=0A=
> index 00000000000..a41e4b176c4=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c=0A=
> @@ -0,0 +1,66 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#define N 50=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 unsigned=0A=
> +#define SIGNEDNESS_3 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +f (SIGNEDNESS_1 int res,=0A=
> +   SIGNEDNESS_2 char *restrict a,=0A=
> +   SIGNEDNESS_2 char *restrict b,=0A=
> +   SIGNEDNESS_3 short *restrict c,=0A=
> +   SIGNEDNESS_3 short *restrict d,=0A=
> +   SIGNEDNESS_1 int *restrict e)=0A=
> +{=0A=
> +  for (int i =3D 0; i < N; ++i)=0A=
> +    {=0A=
> +      short diff =3D a[i] - b[i];=0A=
> +      SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff;=0A=
> +      res +=3D abs;=0A=
> +      res +=3D c[i] * d[i];=0A=
> +      res +=3D e[i];=0A=
> +    }=0A=
> +  return res;=0A=
> +}=0A=
> +=0A=
> +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A=
> +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)=0A=
> +#define OFFSET 20=0A=
> +=0A=
> +int=0A=
> +main (void)=0A=
> +{=0A=
> +  check_vect ();=0A=
> +=0A=
> +  SIGNEDNESS_2 char a[N], b[N];=0A=
> +  SIGNEDNESS_3 short c[N], d[N];=0A=
> +  SIGNEDNESS_1 int e[N];=0A=
> +  int expected =3D 0x12345;=0A=
> +  for (int i =3D 0; i < N; ++i)=0A=
> +    {=0A=
> +      a[i] =3D BASE2 + i * 5;=0A=
> +      b[i] =3D BASE2 - i * 4;=0A=
> +      c[i] =3D BASE3 + i * 2;=0A=
> +      d[i] =3D BASE3 + OFFSET + i * 3;=0A=
> +      e[i] =3D i;=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +      short diff =3D a[i] - b[i];=0A=
> +      SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff;=0A=
> +      expected +=3D abs;=0A=
> +      expected +=3D c[i] * d[i];=0A=
> +      expected +=3D e[i];=0A=
> +    }=0A=
> +  if (f (0x12345, a, b, c, d, e) !=3D expected)=0A=
> +    __builtin_abort ();=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D SAD_EXPR=
" "vect" { target vect_udot_qi } } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD=
_EXPR" "vect" { target vect_sdot_hi } } } */=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc=
/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c=0A=
> new file mode 100644=0A=
> index 00000000000..c2831fbcc8e=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c=0A=
> @@ -0,0 +1,95 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa=
rch64*-*-* || arm*-*-* } } } */=0A=
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +f (SIGNEDNESS_1 int res,=0A=
> +   SIGNEDNESS_2 char *a,=0A=
> +   SIGNEDNESS_2 char *b,=0A=
> +   int step, int n)=0A=
> +{=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      res +=3D a[0] * b[0];=0A=
> +      res +=3D a[1] * b[1];=0A=
> +      res +=3D a[2] * b[2];=0A=
> +      res +=3D a[3] * b[3];=0A=
> +      res +=3D a[4] * b[4];=0A=
> +      res +=3D a[5] * b[5];=0A=
> +      res +=3D a[6] * b[6];=0A=
> +      res +=3D a[7] * b[7];=0A=
> +      res +=3D a[8] * b[8];=0A=
> +      res +=3D a[9] * b[9];=0A=
> +      res +=3D a[10] * b[10];=0A=
> +      res +=3D a[11] * b[11];=0A=
> +      res +=3D a[12] * b[12];=0A=
> +      res +=3D a[13] * b[13];=0A=
> +      res +=3D a[14] * b[14];=0A=
> +      res +=3D a[15] * b[15];=0A=
> +=0A=
> +      a +=3D step;=0A=
> +      b +=3D step;=0A=
> +    }=0A=
> +=0A=
> +  return res;=0A=
> +}=0A=
> +=0A=
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A=
> +#define OFFSET 20=0A=
> +=0A=
> +int=0A=
> +main (void)=0A=
> +{=0A=
> +  check_vect ();=0A=
> +=0A=
> +  SIGNEDNESS_2 char a[100], b[100];=0A=
> +  int expected =3D 0x12345;=0A=
> +  int step =3D 16;=0A=
> +  int n =3D 2;=0A=
> +  int t =3D 0;=0A=
> +=0A=
> +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A=
> +    {=0A=
> +      a[i] =3D BASE + i * 5;=0A=
> +      b[i] =3D BASE + OFFSET + i * 4;=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +    }=0A=
> +=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +      expected +=3D a[t + 0] * b[t + 0];=0A=
> +      expected +=3D a[t + 1] * b[t + 1];=0A=
> +      expected +=3D a[t + 2] * b[t + 2];=0A=
> +      expected +=3D a[t + 3] * b[t + 3];=0A=
> +      expected +=3D a[t + 4] * b[t + 4];=0A=
> +      expected +=3D a[t + 5] * b[t + 5];=0A=
> +      expected +=3D a[t + 6] * b[t + 6];=0A=
> +      expected +=3D a[t + 7] * b[t + 7];=0A=
> +      expected +=3D a[t + 8] * b[t + 8];=0A=
> +      expected +=3D a[t + 9] * b[t + 9];=0A=
> +      expected +=3D a[t + 10] * b[t + 10];=0A=
> +      expected +=3D a[t + 11] * b[t + 11];=0A=
> +      expected +=3D a[t + 12] * b[t + 12];=0A=
> +      expected +=3D a[t + 13] * b[t + 13];=0A=
> +      expected +=3D a[t + 14] * b[t + 14];=0A=
> +      expected +=3D a[t + 15] * b[t + 15];=0A=
> +      t +=3D step;=0A=
> +    }=0A=
> +=0A=
> +  if (f (0x12345, a, b, step, n) !=3D expected)=0A=
> +    __builtin_abort ();=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "=
vect" } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } =
*/=0A=
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO=
T_PROD_EXPR" 16 "vect" } } */=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc=
/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c=0A=
> new file mode 100644=0A=
> index 00000000000..4114264a364=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c=0A=
> @@ -0,0 +1,67 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa=
rch64*-*-* || arm*-*-* } } } */=0A=
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +f (SIGNEDNESS_1 int res,=0A=
> +   SIGNEDNESS_2 char *a,=0A=
> +   SIGNEDNESS_2 char *b,=0A=
> +   int n)=0A=
> +{=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      res +=3D a[5 * i + 0] * b[5 * i + 0];=0A=
> +      res +=3D a[5 * i + 1] * b[5 * i + 1];=0A=
> +      res +=3D a[5 * i + 2] * b[5 * i + 2];=0A=
> +      res +=3D a[5 * i + 3] * b[5 * i + 3];=0A=
> +      res +=3D a[5 * i + 4] * b[5 * i + 4];=0A=
> +    }=0A=
> +=0A=
> +  return res;=0A=
> +}=0A=
> +=0A=
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A=
> +#define OFFSET 20=0A=
> +=0A=
> +int=0A=
> +main (void)=0A=
> +{=0A=
> +  check_vect ();=0A=
> +=0A=
> +  SIGNEDNESS_2 char a[100], b[100];=0A=
> +  int expected =3D 0x12345;=0A=
> +  int n =3D 18;=0A=
> +=0A=
> +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A=
> +    {=0A=
> +      a[i] =3D BASE + i * 5;=0A=
> +      b[i] =3D BASE + OFFSET + i * 4;=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +    }=0A=
> +=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +      expected +=3D a[5 * i + 0] * b[5 * i + 0];=0A=
> +      expected +=3D a[5 * i + 1] * b[5 * i + 1];=0A=
> +      expected +=3D a[5 * i + 2] * b[5 * i + 2];=0A=
> +      expected +=3D a[5 * i + 3] * b[5 * i + 3];=0A=
> +      expected +=3D a[5 * i + 4] * b[5 * i + 4];=0A=
> +    }=0A=
> +=0A=
> +  if (f (0x12345, a, b, n) !=3D expected)=0A=
> +    __builtin_abort ();=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "=
vect" } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } =
*/=0A=
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO=
T_PROD_EXPR" 5 "vect" } } */=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc=
/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c=0A=
> new file mode 100644=0A=
> index 00000000000..2cdecc36d16=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c=0A=
> @@ -0,0 +1,79 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa=
rch64*-*-* || arm*-*-* } } } */=0A=
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +f (SIGNEDNESS_1 int res,=0A=
> +   SIGNEDNESS_2 short *a,=0A=
> +   SIGNEDNESS_2 short *b,=0A=
> +   int step, int n)=0A=
> +{=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      res +=3D a[0] * b[0];=0A=
> +      res +=3D a[1] * b[1];=0A=
> +      res +=3D a[2] * b[2];=0A=
> +      res +=3D a[3] * b[3];=0A=
> +      res +=3D a[4] * b[4];=0A=
> +      res +=3D a[5] * b[5];=0A=
> +      res +=3D a[6] * b[6];=0A=
> +      res +=3D a[7] * b[7];=0A=
> +=0A=
> +      a +=3D step;=0A=
> +      b +=3D step;=0A=
> +    }=0A=
> +=0A=
> +  return res;=0A=
> +}=0A=
> +=0A=
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)=0A=
> +#define OFFSET 20=0A=
> +=0A=
> +int=0A=
> +main (void)=0A=
> +{=0A=
> +  check_vect ();=0A=
> +=0A=
> +  SIGNEDNESS_2 short a[100], b[100];=0A=
> +  int expected =3D 0x12345;=0A=
> +  int step =3D 8;=0A=
> +  int n =3D 2;=0A=
> +  int t =3D 0;=0A=
> +=0A=
> +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A=
> +    {=0A=
> +      a[i] =3D BASE + i * 5;=0A=
> +      b[i] =3D BASE + OFFSET + i * 4;=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +    }=0A=
> +=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +      expected +=3D a[t + 0] * b[t + 0];=0A=
> +      expected +=3D a[t + 1] * b[t + 1];=0A=
> +      expected +=3D a[t + 2] * b[t + 2];=0A=
> +      expected +=3D a[t + 3] * b[t + 3];=0A=
> +      expected +=3D a[t + 4] * b[t + 4];=0A=
> +      expected +=3D a[t + 5] * b[t + 5];=0A=
> +      expected +=3D a[t + 6] * b[t + 6];=0A=
> +      expected +=3D a[t + 7] * b[t + 7];=0A=
> +      t +=3D step;=0A=
> +    }=0A=
> +=0A=
> +  if (f (0x12345, a, b, step, n) !=3D expected)=0A=
> +    __builtin_abort ();=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "=
vect" } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } =
*/=0A=
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO=
T_PROD_EXPR" 8 "vect"  { target vect_sdot_hi } } } */=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc=
/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c=0A=
> new file mode 100644=0A=
> index 00000000000..32c0f30c77b=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c=0A=
> @@ -0,0 +1,63 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa=
rch64*-*-* || arm*-*-* } } } */=0A=
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +f (SIGNEDNESS_1 int res,=0A=
> +   SIGNEDNESS_2 short *a,=0A=
> +   SIGNEDNESS_2 short *b,=0A=
> +   int n)=0A=
> +{=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      res +=3D a[3 * i + 0] * b[3 * i + 0];=0A=
> +      res +=3D a[3 * i + 1] * b[3 * i + 1];=0A=
> +      res +=3D a[3 * i + 2] * b[3 * i + 2];=0A=
> +    }=0A=
> +=0A=
> +  return res;=0A=
> +}=0A=
> +=0A=
> +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)=0A=
> +#define OFFSET 20=0A=
> +=0A=
> +int=0A=
> +main (void)=0A=
> +{=0A=
> +  check_vect ();=0A=
> +=0A=
> +  SIGNEDNESS_2 short a[100], b[100];=0A=
> +  int expected =3D 0x12345;=0A=
> +  int n =3D 18;=0A=
> +=0A=
> +  for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A=
> +    {=0A=
> +      a[i] =3D BASE + i * 5;=0A=
> +      b[i] =3D BASE + OFFSET + i * 4;=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +    }=0A=
> +=0A=
> +  for (int i =3D 0; i < n; i++)=0A=
> +    {=0A=
> +      asm volatile ("" ::: "memory");=0A=
> +      expected +=3D a[3 * i + 0] * b[3 * i + 0];=0A=
> +      expected +=3D a[3 * i + 1] * b[3 * i + 1];=0A=
> +      expected +=3D a[3 * i + 2] * b[3 * i + 2];=0A=
> +    }=0A=
> +=0A=
> +  if (f (0x12345, a, b, n) !=3D expected)=0A=
> +    __builtin_abort ();=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "=
vect" } } */=0A=
> +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } =
*/=0A=
> +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO=
T_PROD_EXPR" 3 "vect"  { target vect_sdot_hi } } } */=0A=
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/tests=
uite/gcc.dg/vect/vect-reduc-dot-slp-1.c=0A=
> new file mode 100644=0A=
> index 00000000000..e17d6291f75=0A=
> --- /dev/null=0A=
> +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c=0A=
> @@ -0,0 +1,35 @@=0A=
> +/* Disabling epilogues until we find a better way to deal with scans.  *=
/=0A=
> +/* { dg-do compile } */=0A=
> +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A=
> +/* { dg-require-effective-target vect_int } */=0A=
> +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa=
rch64*-*-* || arm*-*-* } } } */=0A=
> +/* { dg-add-options arm_v8_2a_dotprod_neon }  */=0A=
> +=0A=
> +#include "tree-vect.h"=0A=
> +=0A=
> +#ifndef SIGNEDNESS_1=0A=
> +#define SIGNEDNESS_1 signed=0A=
> +#define SIGNEDNESS_2 signed=0A=
> +#endif=0A=
> +=0A=
> +SIGNEDNESS_1 int __attribute__ ((noipa))=0A=
> +f (SIGNEDNESS_1 int res0,=0A=
> +   SIGNEDNESS_1 int res1,=0A=
> +   SIGNEDNESS_1 int res2,=0A=
> +   SIGNEDNESS_1 int res3,=0A=
> +   SIGNEDNESS_2 short *a,=0A=
> +   SIGNEDNESS_2 short *b)=0A=
> +{=0A=
> +  for (int i =3D 0; i < 64; i +=3D 4)=0A=
> +    {=0A=
> +      res0 +=3D a[i + 0] * b[i + 0];=0A=
> +      res1 +=3D a[i + 1] * b[i + 1];=0A=
> +      res2 +=3D a[i + 2] * b[i + 2];=0A=
> +      res3 +=3D a[i + 3] * b[i + 3];=0A=
> +    }=0A=
> +=0A=
> +  return res0 ^ res1 ^ res2 ^ res3;=0A=
> +}=0A=
> +=0A=
> +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "=
vect" } } */=0A=
> +/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" =
} } */=0A=
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc=0A=
> index e0561feddce..6d91665a341 100644=0A=
> --- a/gcc/tree-vect-loop.cc=0A=
> +++ b/gcc/tree-vect-loop.cc=0A=
> @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo=
,=0A=
>    if (!gimple_extract_op (orig_stmt_info->stmt, &op))=0A=
>      gcc_unreachable ();=0A=
>=0A=
> -  bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (stmt=
_info);=0A=
> -=0A=
>    if (reduction_type =3D=3D EXTRACT_LAST_REDUCTION)=0A=
>      /* No extra instructions are needed in the prologue.  The loop body=
=0A=
>         operations are costed in vectorizable_condition.  */=0A=
> @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinf=
o,=0A=
>            initial result of the data reduction, initial value of the ind=
ex=0A=
>            reduction.  */=0A=
>         prologue_stmts =3D 4;=0A=
> -      else if (emulated_mixed_dot_prod)=0A=
> -       /* We need the initial reduction value and two invariants:=0A=
> -          one that contains the minimum signed value and one that=0A=
> -          contains half of its negative.  */=0A=
> -       prologue_stmts =3D 3;=0A=
>        else=0A=
> +       /* We need the initial reduction value.  */=0A=
>         prologue_stmts =3D 1;=0A=
>        prologue_cost +=3D record_stmt_cost (cost_vec, prologue_stmts,=0A=
>                                          scalar_to_vec, stmt_info, 0,=0A=
> @@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_ve=
c_info loop_vinfo,=0A=
>        vec_loop_lens *lens =3D &LOOP_VINFO_LENS (loop_vinfo);=0A=
>        unsigned nvectors;=0A=
>=0A=
> -      if (slp_node)=0A=
> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)=0A=
=0A=
Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off=
=0A=
instead, which is bad.=0A=
=0A=
>         nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);=0A=
>        else=0A=
>         nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in);=0A=
> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_=
vec_info loop_vinfo,=0A=
>      }=0A=
>  }=0A=
>=0A=
> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorize=
d in=0A=
> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_V=
EC.=0A=
> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad=
=0A=
> +   (sum-of-absolute-differences).=0A=
> +=0A=
> +   For a lane-reducing operation, the loop reduction path that it lies i=
n,=0A=
> +   may contain normal operation, or other lane-reducing operation of dif=
ferent=0A=
> +   input type size, an example as:=0A=
> +=0A=
> +     int sum =3D 0;=0A=
> +     for (i)=0A=
> +       {=0A=
> +         ...=0A=
> +         sum +=3D d0[i] * d1[i];       // dot-prod <vector(16) char>=0A=
> +         sum +=3D w[i];                // widen-sum <vector(16) char>=0A=
> +         sum +=3D abs(s0[i] - s1[i]);  // sad <vector(8) short>=0A=
> +         sum +=3D n[i];                // normal <vector(4) int>=0A=
> +         ...=0A=
> +       }=0A=
> +=0A=
> +   Vectorization factor is essentially determined by operation whose inp=
ut=0A=
> +   vectype has the most lanes ("vector(16) char" in the example), while =
we=0A=
> +   need to choose input vectype with the least lanes ("vector(4) int" in=
 the=0A=
> +   example) for the reduction PHI statement.  */=0A=
> +=0A=
> +bool=0A=
> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt=
_info,=0A=
> +                           slp_tree slp_node, stmt_vector_for_cost *cost=
_vec)=0A=
> +{=0A=
> +  gimple *stmt =3D stmt_info->stmt;=0A=
> +=0A=
> +  if (!lane_reducing_stmt_p (stmt))=0A=
> +    return false;=0A=
> +=0A=
> +  tree type =3D TREE_TYPE (gimple_assign_lhs (stmt));=0A=
> +=0A=
> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))=0A=
> +    return false;=0A=
> +=0A=
> +  /* Do not try to vectorize bit-precision reductions.  */=0A=
> +  if (!type_has_mode_precision_p (type))=0A=
> +    return false;=0A=
> +=0A=
> +  if (!slp_node)=0A=
> +    return false;=0A=
> +=0A=
> +  for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++)=0A=
> +    {=0A=
> +      stmt_vec_info def_stmt_info;=0A=
> +      slp_tree slp_op;=0A=
> +      tree op;=0A=
> +      tree vectype;=0A=
> +      enum vect_def_type dt;=0A=
> +=0A=
> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,=
=0A=
> +                              &slp_op, &dt, &vectype, &def_stmt_info))=
=0A=
> +       {=0A=
> +         if (dump_enabled_p ())=0A=
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
> +                            "use not simple.\n");=0A=
> +         return false;=0A=
> +       }=0A=
> +=0A=
> +      if (!vectype)=0A=
> +       {=0A=
> +         vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE =
(op),=0A=
> +                                                slp_op);=0A=
> +         if (!vectype)=0A=
> +           return false;=0A=
> +       }=0A=
> +=0A=
> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))=0A=
> +       {=0A=
> +         if (dump_enabled_p ())=0A=
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
> +                            "incompatible vector types for invariants\n"=
);=0A=
> +         return false;=0A=
> +       }=0A=
> +=0A=
> +      if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info))=0A=
> +       continue;=0A=
> +=0A=
> +      /* There should be at most one cycle def in the stmt.  */=0A=
> +      if (VECTORIZABLE_CYCLE_DEF (dt))=0A=
> +       return false;=0A=
> +    }=0A=
> +=0A=
> +  stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (stm=
t_info));=0A=
> +=0A=
> +  /* TODO: Support lane-reducing operation that does not directly partic=
ipate=0A=
> +     in loop reduction. */=0A=
> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)=0A=
> +    return false;=0A=
> +=0A=
> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not=0A=
> +     recoginized.  */=0A=
> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_def=
);=0A=
> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUCT=
ION);=0A=
> +=0A=
> +  tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);=0A=
> +  int ncopies_for_cost;=0A=
> +=0A=
> +  if (SLP_TREE_LANES (slp_node) > 1)=0A=
> +    {=0A=
> +      /* Now lane-reducing operations in a non-single-lane slp node shou=
ld only=0A=
> +        come from the same loop reduction path.  */=0A=
> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));=0A=
> +      ncopies_for_cost =3D 1;=0A=
> +    }=0A=
> +  else=0A=
> +    {=0A=
> +      ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_in);=
=0A=
=0A=
OK, so the fact that the ops are lane-reducing means they effectively=0A=
change the VF for the result.  That's only possible as we tightly control=
=0A=
code generation and "adjust" to the expected VF (by inserting the copies=0A=
you mentioned above), but only up to the highest number of outputs=0A=
created in the reduction chain.  In that sense instead of talking and recor=
ding=0A=
"input vector types" wouldn't it make more sense to record the effective=0A=
vectorization factor for the reduction instance?  That VF would be at most=
=0A=
the loops VF but could be as low as 1.  Once we have a non-lane-reducing=0A=
operation in the reduction chain it would be always equal to the loops VF.=
=0A=
=0A=
ncopies would then be always determined by that reduction instance VF and=
=0A=
the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction=0A=
instance VF would also trivially indicate the force-single-def-use-cycle=0A=
case, possibly simplifying code?=0A=
=0A=
> +      gcc_assert (ncopies_for_cost >=3D 1);=0A=
> +    }=0A=
> +=0A=
> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A=
> +    {=0A=
> +      /* We need extra two invariants: one that contains the minimum sig=
ned=0A=
> +        value and one that contains half of its negative.  */=0A=
> +      int prologue_stmts =3D 2;=0A=
> +      unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts,=0A=
> +                                       scalar_to_vec, stmt_info, 0,=0A=
> +                                       vect_prologue);=0A=
> +      if (dump_enabled_p ())=0A=
> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "=0A=
> +                    "extra prologue_cost =3D %d .\n", cost);=0A=
> +=0A=
> +      /* Three dot-products and a subtraction.  */=0A=
> +      ncopies_for_cost *=3D 4;=0A=
> +    }=0A=
> +=0A=
> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, =
0,=0A=
> +                   vect_body);=0A=
> +=0A=
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))=0A=
> +    {=0A=
> +      enum tree_code code =3D gimple_assign_rhs_code (stmt);=0A=
> +      vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info=
,=0A=
> +                                                 slp_node, code, type,=
=0A=
> +                                                 vectype_in);=0A=
> +    }=0A=
> +=0A=
=0A=
Add a comment:=0A=
=0A=
    /* Transform via vect_transform_reduction.  */=0A=
=0A=
> +  STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type;=0A=
> +  return true;=0A=
> +}=0A=
> +=0A=
>  /* Function vectorizable_reduction.=0A=
>=0A=
>     Check if STMT_INFO performs a reduction operation that can be vectori=
zed.=0A=
> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>    if (!type_has_mode_precision_p (op.type))=0A=
>      return false;=0A=
>=0A=
> -  /* For lane-reducing ops we're reducing the number of reduction PHIs=
=0A=
> -     which means the only use of that may be in the lane-reducing operat=
ion.  */=0A=
> -  if (lane_reducing=0A=
> -      && reduc_chain_length !=3D 1=0A=
> -      && !only_slp_reduc_chain)=0A=
> -    {=0A=
> -      if (dump_enabled_p ())=0A=
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
> -                        "lane-reducing reduction with extra stmts.\n");=
=0A=
> -      return false;=0A=
> -    }=0A=
> -=0A=
>    /* Lane-reducing ops also never can be used in a SLP reduction group=
=0A=
>       since we'll mix lanes belonging to different reductions.  But it's=
=0A=
>       OK to use them in a reduction chain or when the reduction group=0A=
> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>        && loop_vinfo->suggested_unroll_factor =3D=3D 1)=0A=
>      single_defuse_cycle =3D true;=0A=
>=0A=
> -  if (single_defuse_cycle || lane_reducing)=0A=
> +  if (single_defuse_cycle && !lane_reducing)=0A=
=0A=
If there's also a non-lane-reducing plus in the chain don't we have to=0A=
check for that reduction op?  So shouldn't it be=0A=
single_defuse_cycle && ... fact that we don't record=0A=
(non-lane-reducing op there) ...=0A=
=0A=
>      {=0A=
>        gcc_assert (op.code !=3D COND_EXPR);=0A=
>=0A=
> -      /* 4. Supportable by target?  */=0A=
> -      bool ok =3D true;=0A=
> -=0A=
> -      /* 4.1. check support for the operation in the loop=0A=
> +      /* 4. check support for the operation in the loop=0A=
>=0A=
>          This isn't necessary for the lane reduction codes, since they=0A=
>          can only be produced by pattern matching, and it's up to the=0A=
> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>          mixed-sign dot-products can be implemented using signed=0A=
>          dot-products.  */=0A=
>        machine_mode vec_mode =3D TYPE_MODE (vectype_in);=0A=
> -      if (!lane_reducing=0A=
> -         && !directly_supported_p (op.code, vectype_in, optab_vector))=
=0A=
> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))=0A=
>          {=0A=
>            if (dump_enabled_p ())=0A=
>              dump_printf (MSG_NOTE, "op not supported by target.\n");=0A=
>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)=0A=
>               || !vect_can_vectorize_without_simd_p (op.code))=0A=
> -           ok =3D false;=0A=
> +           single_defuse_cycle =3D false;=0A=
>           else=0A=
>             if (dump_enabled_p ())=0A=
>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");=0A=
> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>             dump_printf (MSG_NOTE, "using word mode not possible.\n");=0A=
>           return false;=0A=
>         }=0A=
> -=0A=
> -      /* lane-reducing operations have to go through vect_transform_redu=
ction.=0A=
> -         For the other cases try without the single cycle optimization. =
 */=0A=
> -      if (!ok)=0A=
> -       {=0A=
> -         if (lane_reducing)=0A=
> -           return false;=0A=
> -         else=0A=
> -           single_defuse_cycle =3D false;=0A=
> -       }=0A=
>      }=0A=
>    if (dump_enabled_p () && single_defuse_cycle)=0A=
>      dump_printf_loc (MSG_NOTE, vect_location,=0A=
> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>                      "multiple vectors to one in the loop body\n");=0A=
>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle;=0A=
>=0A=
> -  /* If the reduction stmt is one of the patterns that have lane=0A=
> -     reduction embedded we cannot handle the case of ! single_defuse_cyc=
le.  */=0A=
> -  if ((ncopies > 1 && ! single_defuse_cycle)=0A=
> -      && lane_reducing)=0A=
> -    {=0A=
> -      if (dump_enabled_p ())=0A=
> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A=
> -                        "multi def-use cycle not possible for lane-reduc=
ing "=0A=
> -                        "reduction operation\n");=0A=
> -      return false;=0A=
> -    }=0A=
> +  /* For lane-reducing operation, the below processing related to single=
=0A=
> +     defuse-cycle will be done in its own vectorizable function.  One mo=
re=0A=
> +     thing to note is that the operation must not be involved in fold-le=
ft=0A=
> +     reduction.  */=0A=
> +  single_defuse_cycle &=3D !lane_reducing;=0A=
>=0A=
>    if (slp_node=0A=
> -      && !(!single_defuse_cycle=0A=
> -          && !lane_reducing=0A=
> -          && reduction_type !=3D FOLD_LEFT_REDUCTION))=0A=
> +      && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDUCTI=
ON))=0A=
>      for (i =3D 0; i < (int) op.num_ops; i++)=0A=
>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))=
=0A=
>         {=0A=
> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,=
=0A=
>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,=0A=
>                              reduction_type, ncopies, cost_vec);=0A=
>    /* Cost the reduction op inside the loop if transformed via=0A=
> -     vect_transform_reduction.  Otherwise this is costed by the=0A=
> -     separate vectorizable_* routines.  */=0A=
> -  if (single_defuse_cycle || lane_reducing)=0A=
> -    {=0A=
> -      int factor =3D 1;=0A=
> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A=
> -       /* Three dot-products and a subtraction.  */=0A=
> -       factor =3D 4;=0A=
> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,=0A=
> -                       stmt_info, 0, vect_body);=0A=
> -    }=0A=
> +     vect_transform_reduction for non-lane-reducing operation.  Otherwis=
e=0A=
> +     this is costed by the separate vectorizable_* routines.  */=0A=
> +  if (single_defuse_cycle)=0A=
> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect=
_body);=0A=
>=0A=
>    if (dump_enabled_p ()=0A=
>        && reduction_type =3D=3D FOLD_LEFT_REDUCTION)=0A=
>      dump_printf_loc (MSG_NOTE, vect_location,=0A=
>                      "using an in-order (fold-left) reduction.\n");=0A=
>    STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type;=0A=
> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left=
=0A=
> -     reductions go through their own vectorizable_* routines.  */=0A=
> -  if (!single_defuse_cycle=0A=
> -      && !lane_reducing=0A=
> -      && reduction_type !=3D FOLD_LEFT_REDUCTION)=0A=
> +=0A=
> +  /* All but single defuse-cycle optimized and fold-left reductions go=
=0A=
> +     through their own vectorizable_* routines.  */=0A=
> +  if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION)=
=0A=
>      {=0A=
>        stmt_vec_info tem=0A=
>         =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));=0A=
> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo=
,=0A=
>    bool lane_reducing =3D lane_reducing_op_p (code);=0A=
>    gcc_assert (single_defuse_cycle || lane_reducing);=0A=
>=0A=
> +  if (lane_reducing)=0A=
> +    {=0A=
> +      /* The last operand of lane-reducing op is for reduction.  */=0A=
> +      gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1);=0A=
> +=0A=
> +      /* Now all lane-reducing ops are covered by some slp node.  */=0A=
> +      gcc_assert (slp_node);=0A=
> +    }=0A=
> +=0A=
>    /* Create the destination vector  */=0A=
>    tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt);=0A=
>    tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype_ou=
t);=0A=
> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo=
,=0A=
>                          reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE,=0A=
>                          &vec_oprnds[2]);=0A=
>      }=0A=
> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1=0A=
> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ()=
)=0A=
> +    {=0A=
> +      /* For lane-reducing op covered by single-lane slp node, the input=
=0A=
> +        vectype of the reduction PHI determines copies of vectorized def=
-use=0A=
> +        cycles, which might be more than effective copies of vectorized =
lane-=0A=
> +        reducing reduction statements.  This could be complemented by=0A=
> +        generating extra trivial pass-through copies.  For example:=0A=
> +=0A=
> +          int sum =3D 0;=0A=
> +          for (i)=0A=
> +            {=0A=
> +              sum +=3D d0[i] * d1[i];      // dot-prod <vector(16) char>=
=0A=
> +              sum +=3D abs(s0[i] - s1[i]); // sad <vector(8) short>=0A=
> +              sum +=3D n[i];               // normal <vector(4) int>=0A=
> +            }=0A=
> +=0A=
> +        The vector size is 128-bit,vectorization factor is 16.  Reductio=
n=0A=
> +        statements would be transformed as:=0A=
> +=0A=
> +          vector<4> int sum_v0 =3D { 0, 0, 0, 0 };=0A=
> +          vector<4> int sum_v1 =3D { 0, 0, 0, 0 };=0A=
> +          vector<4> int sum_v2 =3D { 0, 0, 0, 0 };=0A=
> +          vector<4> int sum_v3 =3D { 0, 0, 0, 0 };=0A=
> +=0A=
> +          for (i / 16)=0A=
> +            {=0A=
> +              sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], s=
um_v0);=0A=
> +              sum_v1 =3D sum_v1;  // copy=0A=
> +              sum_v2 =3D sum_v2;  // copy=0A=
> +              sum_v3 =3D sum_v3;  // copy=0A=
> +=0A=
> +              sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0=
);=0A=
> +              sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1=
);=0A=
> +              sum_v2 =3D sum_v2;  // copy=0A=
> +              sum_v3 =3D sum_v3;  // copy=0A=
> +=0A=
> +              sum_v0 +=3D n_v0[i: 0  ~ 3 ];=0A=
> +              sum_v1 +=3D n_v1[i: 4  ~ 7 ];=0A=
> +              sum_v2 +=3D n_v2[i: 8  ~ 11];=0A=
> +              sum_v3 +=3D n_v3[i: 12 ~ 15];=0A=
> +            }=0A=
> +       */=0A=
> +      unsigned using_ncopies =3D vec_oprnds[0].length ();=0A=
> +      unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ();=0A=
> +=0A=
=0A=
assert reduc_ncopies >=3D using_ncopies?  Maybe assert=0A=
reduc_index =3D=3D op.num_ops - 1 given you use one above=0A=
and the other below?  Or simply iterate till op.num_ops=0A=
and sip i =3D=3D reduc_index.=0A=
=0A=
> +      for (unsigned i =3D 0; i < op.num_ops - 1; i++)=0A=
> +       {=0A=
> +         gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies);=0A=
> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);=0A=
> +       }=0A=
> +    }=0A=
>=0A=
>    bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (stmt=
_info);=0A=
>    unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length ();=
=0A=
> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo=
,=0A=
>      {=0A=
>        gimple *new_stmt;=0A=
>        tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };=
=0A=
> -      if (masked_loop_p && !mask_by_cond_expr)=0A=
> +=0A=
> +      if (!vop[0] || !vop[1])=0A=
> +       {=0A=
> +         tree reduc_vop =3D vec_oprnds[reduc_index][i];=0A=
> +=0A=
> +         /* Insert trivial copy if no need to generate vectorized=0A=
> +            statement.  */=0A=
> +         gcc_assert (reduc_vop);=0A=
> +=0A=
> +         new_stmt =3D gimple_build_assign (vec_dest, reduc_vop);=0A=
> +         new_temp =3D make_ssa_name (vec_dest, new_stmt);=0A=
> +         gimple_set_lhs (new_stmt, new_temp);=0A=
> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, g=
si);=0A=
=0A=
I think you could simply do=0A=
=0A=
               slp_node->push_vec_def (reduc_vop);=0A=
               continue;=0A=
=0A=
without any code generation.=0A=
=0A=
> +       }=0A=
> +      else if (masked_loop_p && !mask_by_cond_expr)=0A=
>         {=0A=
>           /* No conditional ifns have been defined for lane-reducing op=
=0A=
>              yet.  */=0A=
> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo=
,=0A=
>=0A=
>           if (masked_loop_p && mask_by_cond_expr)=0A=
>             {=0A=
> +             tree stmt_vectype_in =3D vectype_in;=0A=
> +             unsigned nvectors =3D vec_num * ncopies;=0A=
> +=0A=
> +             if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1)=0A=
> +               {=0A=
> +                 /* Input vectype of the reduction PHI may be defferent =
from=0A=
=0A=
different=0A=
=0A=
> +                    that of lane-reducing operation.  */=0A=
> +                 stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_i=
nfo);=0A=
> +                 nvectors =3D vect_get_num_copies (loop_vinfo, stmt_vect=
ype_in);=0A=
=0A=
I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.=0A=
=0A=
Otherwise the patch looks good to me.=0A=
=0A=
Richard.=0A=
=0A=
> +               }=0A=
> +=0A=
>               tree mask =3D vect_get_loop_mask (loop_vinfo, gsi, masks,=
=0A=
> -                                             vec_num * ncopies, vectype_=
in, i);=0A=
> +                                             nvectors, stmt_vectype_in, =
i);=0A=
>               build_vect_cond_expr (code, vop, mask, gsi);=0A=
>             }=0A=
>=0A=
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc=0A=
> index ca6052662a3..1b73ef01ade 100644=0A=
> --- a/gcc/tree-vect-stmts.cc=0A=
> +++ b/gcc/tree-vect-stmts.cc=0A=
> @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,=0A=
>                                       NULL, NULL, node, cost_vec)=0A=
>           || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_=
vec)=0A=
>           || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost=
_vec)=0A=
> +         || vectorizable_lane_reducing (as_a <loop_vec_info> (vinfo),=0A=
> +                                        stmt_info, node, cost_vec)=0A=
>           || vectorizable_reduction (as_a <loop_vec_info> (vinfo), stmt_i=
nfo,=0A=
>                                      node, node_instance, cost_vec)=0A=
>           || vectorizable_induction (as_a <loop_vec_info> (vinfo), stmt_i=
nfo,=0A=
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h=0A=
> index 60224f4e284..94736736dcc 100644=0A=
> --- a/gcc/tree-vectorizer.h=0A=
> +++ b/gcc/tree-vectorizer.h=0A=
> @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class =
loop *, vec_info_shared *,=0A=
>  extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,=0A=
>                                          slp_tree, slp_instance, int,=0A=
>                                          bool, stmt_vector_for_cost *);=
=0A=
> +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,=0A=
> +                                       slp_tree, stmt_vector_for_cost *)=
;=0A=
>  extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,=0A=
>                                     slp_tree, slp_instance,=0A=
>                                     stmt_vector_for_cost *);=0A=
> --=0A=
> 2.17.1=0A=