From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2072f.outbound.protection.outlook.com [IPv6:2a01:111:f400:7e88::72f]) by sourceware.org (Postfix) with ESMTPS id 66B4B385DDF1 for ; Sun, 23 Jun 2024 15:10:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 66B4B385DDF1 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 66B4B385DDF1 Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f400:7e88::72f ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1719155445; cv=pass; b=xajYxa2jYG0rD1t4OJE3d0+trQjppBTzl30E2QnoJe/UE2Le9YycGN7PXn1j1c5HJBxr9PvehNWyYXK+e0CyeldiAcNoakiEinJdinLtzy51P4Cedv2/oUuVNn37hJLIuQbk5OZiSB5RALvnrYeEPqAEwKolrrJRhHhVdYKs7mM= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1719155445; c=relaxed/simple; bh=TS2GP50W9kS8RepvxPVqnWyX2p7m2CMqdXC4+I7rPmE=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=oKQ9C5c8LgHk7JzELQHimNK0ulAV9IbUgFol2ZkxTGoHVcEEvF9HUEZaX3oiFw5DHdhnSnIlfEazXg9aVaEgPGmZEhpY2QD6X+CLXeZghN/Y5kqSkbxUM+6o5/Gz9z/CjOiBfAF1lYYaujWAK8t+PknOxnKgzCW05iiVZHpLwcE= ARC-Authentication-Results: i=2; server2.sourceware.org ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=NPdGfe4iRVX59Ti1VfGpRBdCFn5mMiMEtm1748jUGeTiVuBOs2IAn4GUvaOp4QWgOauwNY316XFXKQsJ++emT8xCuc590MqMEi5DkhUIi52NfS6YX6P+IvkcQJIump/AJLGOOshq/x3Qs3eqYeA8W4pUpMTeyGCRRyv/0UQVh3NlLr0uY9hDNgrOjSScjto1Mf1SY8dEx3o+t/Zgy+HpvI+Z2ClAsl3k8LXLbJDAUGbJtFfbJLIhBUKJJF8opLChNPpbn0L4QXULbSSbT0mzg2d2KGfp5clM1HKsRhrdun5elqAopUAUKbkd0NS4Isn/M8eppUcR+2KJgPpRYlH6jg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=hE15JQtXDOgAKHG25Rl7+4eNGIZPd6bhYlSx6sZMoXc=; b=JIZ6SVtAK++wo9/geBmP9F4wWmp6UNnoCJZqsZ0F42q3hsfqeOxOPnPNtwKKvjT2jdTXpKU9fJsJ94JlRP6b2VOYWDO3tuoo/iy2XtCAJFyVyn+c+uyZITWzINDzmafqZvKhkV1520Ah3hsCGp0Sd6FnOCeLNLacAmA+tnyQ6kpA2NY0yR1DIlbVDlTI9EXusJ7VVJrxxrtMygeMXDroh7sqq7xl/2yeFJiKq+SfAABYujvgK2d4bwHccYYcqWxPnqJTSktHAOo2KBGJ8xN3fMo7uJDCRiIgxsacfpzzp+kq4s/JiLeiFbcgS+XDiQzVP775MegijT06iISINVb9MQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hE15JQtXDOgAKHG25Rl7+4eNGIZPd6bhYlSx6sZMoXc=; b=oN3aUo/7O+JogjBte20Pm6nM9ZGglf8zLUqzKEUqgGMG5NTy1jWkhhYl00SEEX3dXzv5wvKp4Fe9DjB+QxAOOGC7ZcuW3T/0td3f60NcCX/k+GfMs6gZXW7ECgtI6vLo7w6Ptq8OHCT8XM1gWqfogdQmOMqcaT7Xo1FcyNwSHpA= Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by PH7PR01MB7608.prod.exchangelabs.com (2603:10b6:510:1d3::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7698.25; Sun, 23 Jun 2024 15:10:32 +0000 Received: from LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7698.017; Sun, 23 Jun 2024 15:10:32 +0000 From: Feng Xue OS To: Richard Biener CC: "gcc-patches@gcc.gnu.org" Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] Thread-Topic: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] Thread-Index: AQHaww0x/TTxzniavkuqtVlcAVxhibHVait1 Date: Sun, 23 Jun 2024 15:10:32 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-06-23T15:10:31.596Z;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH7PR01MB7608:EE_ x-ms-office365-filtering-correlation-id: af048b68-1abe-4466-262a-08dc9396a02a x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0;ARA:13230037|366013|376011|1800799021|38070700015; x-microsoft-antispam-message-info: =?iso-8859-1?Q?KFATOjFQf1jUHVTbiocSVK0pZSwNmYzieK646KBCgG8Chyd1s8wVdIaKAa?= =?iso-8859-1?Q?2eNos3X328Iv5H+O/RfAdG9JRgEo4SxpBhHd5UMeo/mE3wLLaajJ+Iy+yS?= =?iso-8859-1?Q?at5H6SqjwtTHhUvn1Wuyf3mLmDnCatWTsGvQ3xzlOK6BuraqusGDhRTaQl?= =?iso-8859-1?Q?ipsrmF01HrJqxyTUjodoFN/PdrA5gAlmCJiy6X8Fjjhyic+BGXqzJcmyvh?= =?iso-8859-1?Q?TMRbKCKhhR0MC++EVQea0bVbnhaOUcf1IWGxRrr6KeqCw0qNSESkIRi4XH?= =?iso-8859-1?Q?opQb/SSgaAEqWi7SSINhQye2MHPnnGBei3Zhb2FpSRffk9nj0jN7IKPIOX?= =?iso-8859-1?Q?+fMEMN/FytpR9zNfY0VAltU5rnU5BkYSsp3jvgKFZjyx4ulfadAOej70VT?= =?iso-8859-1?Q?1BLzL6U5BqwIsnHrHBCa+qUp3ET0v1AQI7CWhW3ViI7mvOBEfyG3x5d4R9?= =?iso-8859-1?Q?6yO1LoiRWB1qJKrsw95Y7XpY+8AvUaP0CREda2omZI332mJXilAA7FrGHq?= =?iso-8859-1?Q?nzrco38Xph+GfZlRpjHlqP1pks5t/12RnK65qxT6LEJuT9V+BrJs4pj9tT?= =?iso-8859-1?Q?fbFVbbjuGI0LlucskcpkPSX0zpnGowKxQIzjEsWkRgwDQ6hM+LoH/KHVMI?= =?iso-8859-1?Q?qG7weyfPFWKYvxzi2536T87kdyj+dssAI/spOk0nKVd+uBChZAUdcxi7io?= =?iso-8859-1?Q?Y4KO1WW15EbHNC+QprkhmwhmggMRNOSLtPoNzzlBm4/Otxnl3IwoPGoeh8?= =?iso-8859-1?Q?FHZ3OdQFgRybxFmBGIEBoObasMRAaoEahYp9Qi3cnnQ6UGSvWQ8TUaovAY?= =?iso-8859-1?Q?rGAkwGhGcCOHq05yo+PUJQhVeqw/sBWhzsFS8QJR4LWw7XLA0pGQdJDq8J?= =?iso-8859-1?Q?XQ0unl4PllUNnY6YeJssESurZcb/NwP7fN0+8w1XJ0vpycsHsKk5S9HjRJ?= =?iso-8859-1?Q?um+6i5/7Dl50AYU1/8EuUiBsKqy3vo0eml9BF3Ze9KdbC9DQB6s3uHr8h9?= =?iso-8859-1?Q?CMBkEdWTnH00C6+ZC8LKi6x3LuIt4h35RgQR4u5h0PoXXhmeTu0M1Epvy+?= =?iso-8859-1?Q?F1d8z/L2TXXFzYrqiL2lalQVMbdPHuEsAZAK0MATO4HjBhRiLtKVoK5w3n?= =?iso-8859-1?Q?DqJBtagy4aMRsKmdDuZ7sODpk6QGmwRwKxgiyXsjlrXBXoC6zO/eYrTFw5?= =?iso-8859-1?Q?j4Lcg3IuvacIf+uplxI6Yl9dZCw06O5TAMwQ+VKD3KanAIBIzrKt4+LWt/?= =?iso-8859-1?Q?f2ySH7j7ta925yA4uB3ScXOZgeNuZfld7kRafp3FJQYQ4b8StUP8WPIRnI?= =?iso-8859-1?Q?nEm2r26Iliz3X524M7rwG1yk5CpeqQ5+1OZadTpvGk6S2IK2J8yuKolarw?= =?iso-8859-1?Q?slW90ZBvJU?= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:LV2PR01MB7839.prod.exchangelabs.com;PTR:;CAT:NONE;SFS:(13230037)(366013)(376011)(1800799021)(38070700015);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?8NnDeDoK2V1H4As6nfsVQGrBcB/F5+yOlfJBJQ0Bxr+Ss2nwHPzQbNXuaJ?= =?iso-8859-1?Q?RCGFcUbo+MHOP0g+cwhrK0rniTEOIgr1OvCSzfqMKofmSjkYrHL1I61H92?= =?iso-8859-1?Q?o/ldJ1hVEpldjdgAvxqp3iAFfGrj/SpwIRSFT1nWpy6P8+9GFvTKZdnnqN?= =?iso-8859-1?Q?yhurfEcltmiZnV6nwMGcjWdQPUeTMOxgQ6y1OziiheQIa2XeUK5LozeF42?= =?iso-8859-1?Q?O2jfuMcwuN7I9qUL0mCr9ktkF4clG6uYZSUs1NT6OfCdvJZ27ce2ZDXAG9?= =?iso-8859-1?Q?GEt75jRQ0ANpkv9ICWo17itC0wMISBO/3TaPEtN/DD/CsVrMtD1A0PocpW?= =?iso-8859-1?Q?oCj4zPxCWT1IGZjvQZtwJeNT///oLnrN/biMOtfz0whJrvHRJ1dHs+9pjf?= =?iso-8859-1?Q?Mb8ub/8Y1U3fEpRfCU+rgvMjw30oSPTNtcUGYy7dTkmuSvKCCwWQjjWAYe?= =?iso-8859-1?Q?cMboLW4fMhMQIBYB0mrUG/W4Sq9nv49PfsgiktWcH2GnTL4X0S19diyIQl?= =?iso-8859-1?Q?UhYy7oVvX0qdf/6LRV695TGDchykQaW97smanBc47DgbVGlJ6vd2QyR6bR?= =?iso-8859-1?Q?jaNOXgCeVKji8+Hb5BmF8UBiZs/hH71SX5Xlc5C/rU8IQHBIVoXcGHO8T8?= =?iso-8859-1?Q?F3iw0bgIwQPKkIIGDldEyVArGwm+pI7E2OaVcUeBfmfMUyaSv1SGo3OSZn?= =?iso-8859-1?Q?tiTPh8O9FVdkQlYhBb0XtMg9N5JpFuWjp3XUlJ6Q/wQ7U09O3DhUyOkfHZ?= =?iso-8859-1?Q?3BdyBjRgaAfs5Q4DmJe/EkcKupC8o9nU0qJ7kmw4LIA+iOsmnwA9mcTQVP?= =?iso-8859-1?Q?mOSln/BYDAsmWXFvlHk25RY7q/+sz+trtUaUdwO9X4dXWTe6ygrTOBrvqi?= =?iso-8859-1?Q?0YzsXUdKTLuHzupVaUAe4VFYygZuLiIG2jbDK95F76Vthas8BkPun6T5Dj?= =?iso-8859-1?Q?tupzre5qOn+oGau3DyzipKZRHv9TNHvTd5drthkVfIC9Ff2p/92P0374IO?= =?iso-8859-1?Q?61FMWqV4KJbwdfxpCw6DcvMCS3TQ9uWpSpJ2U7PmE4jcf94jsCyZtWPEV5?= =?iso-8859-1?Q?pA5hY+DIsoL1GeTfQcTtt+yaK4/HUOz1Ns93A+gPeWgopavPPDHULn0HHu?= =?iso-8859-1?Q?dLD1M90wvnPH7oPG/tGp7TWVnlkvHdqgUH75q1SFnIxlqsxcY5iHlioS5y?= =?iso-8859-1?Q?yuUC7HjXFsiddYsTgcwQiSHK8YmDRpT/tISPq6ThcXiEEu7Z0ylChIvEFh?= =?iso-8859-1?Q?Q47HlYXusFPNqKrmC7hPeWZBPhQJtLvv5B1GeWA4aAOekKy60WHlHFZYvI?= =?iso-8859-1?Q?nYtZvtmw61jHUWs5pGSk6rSv9/F/suooICjGvAqplmnfMgTwMgSY/QqCpY?= =?iso-8859-1?Q?Wh6LdF/kLhY44nWDI6mCIop5c8kGs6XEykVvqfpB6YVUp3QHhZfUI8Eq2J?= =?iso-8859-1?Q?jFyW0ML6x+ZrZt2aOFe0Tt4HXW6vDdqW6PP7lgaoplR/Y3/Hq7u4BQoL9K?= =?iso-8859-1?Q?OcqXoyKYn5CqGwXTwc7FZ2m1EKKRTSfNO9fYFeg5iqSXF5tFx4ZrgbB+x/?= =?iso-8859-1?Q?y+ZuBrtpuvAAd7yMFX8CCdIABF5r+EeMJYYsrdi2eQuP4pupzAjr14sUMd?= =?iso-8859-1?Q?brDhlA1GTjVRnoDiKtXsGJiIGbgXWbF827?= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: af048b68-1abe-4466-262a-08dc9396a02a X-MS-Exchange-CrossTenant-originalarrivaltime: 23 Jun 2024 15:10:32.0742 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: lLWGDGzJW3U5aBLT5epULmsrKDybOM6OLP1Kfe6b0efuPACFjNek/oihUDeBYolNSusKYwjPNtwhDWoi2EkOcgSe5ixPSUw3sDxyqpc6FBU= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH7PR01MB7608 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,GIT_PATCH_0,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: >> - if (slp_node)=0A= >> + if (slp_node && SLP_TREE_LANES (slp_node) > 1)=0A= > =0A= > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off= =0A= > instead, which is bad.=0A= > =0A= >> nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);=0A= >> else=0A= >> nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in);=0A= >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop= _vec_info loop_vinfo,=0A= >> }=0A= >> }=0A= >>=0A= >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectoriz= ed in=0A= >> + the context of LOOP_VINFO, and vector cost will be recorded in COST_= VEC.=0A= >> + Now there are three such kinds of operations: dot-prod/widen-sum/sad= =0A= >> + (sum-of-absolute-differences).=0A= >> +=0A= >> + For a lane-reducing operation, the loop reduction path that it lies = in,=0A= >> + may contain normal operation, or other lane-reducing operation of di= fferent=0A= >> + input type size, an example as:=0A= >> +=0A= >> + int sum =3D 0;=0A= >> + for (i)=0A= >> + {=0A= >> + ...=0A= >> + sum +=3D d0[i] * d1[i]; // dot-prod =0A= >> + sum +=3D w[i]; // widen-sum = =0A= >> + sum +=3D abs(s0[i] - s1[i]); // sad =0A= >> + sum +=3D n[i]; // normal =0A= >> + ...=0A= >> + }=0A= >> +=0A= >> + Vectorization factor is essentially determined by operation whose in= put=0A= >> + vectype has the most lanes ("vector(16) char" in the example), while= we=0A= >> + need to choose input vectype with the least lanes ("vector(4) int" i= n the=0A= >> + example) for the reduction PHI statement. */=0A= >> +=0A= >> +bool=0A= >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stm= t_info,=0A= >> + slp_tree slp_node, stmt_vector_for_cost *cos= t_vec)=0A= >> +{=0A= >> + gimple *stmt =3D stmt_info->stmt;=0A= >> +=0A= >> + if (!lane_reducing_stmt_p (stmt))=0A= >> + return false;=0A= >> +=0A= >> + tree type =3D TREE_TYPE (gimple_assign_lhs (stmt));=0A= >> +=0A= >> + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))=0A= >> + return false;=0A= >> +=0A= >> + /* Do not try to vectorize bit-precision reductions. */=0A= >> + if (!type_has_mode_precision_p (type))=0A= >> + return false;=0A= >> +=0A= >> + if (!slp_node)=0A= >> + return false;=0A= >> +=0A= >> + for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++)=0A= >> + {=0A= >> + stmt_vec_info def_stmt_info;=0A= >> + slp_tree slp_op;=0A= >> + tree op;=0A= >> + tree vectype;=0A= >> + enum vect_def_type dt;=0A= >> +=0A= >> + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,= =0A= >> + &slp_op, &dt, &vectype, &def_stmt_info))= =0A= >> + {=0A= >> + if (dump_enabled_p ())=0A= >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= >> + "use not simple.\n");=0A= >> + return false;=0A= >> + }=0A= >> +=0A= >> + if (!vectype)=0A= >> + {=0A= >> + vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE= (op),=0A= >> + slp_op);=0A= >> + if (!vectype)=0A= >> + return false;=0A= >> + }=0A= >> +=0A= >> + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))=0A= >> + {=0A= >> + if (dump_enabled_p ())=0A= >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= >> + "incompatible vector types for invariants\n= ");=0A= >> + return false;=0A= >> + }=0A= >> +=0A= >> + if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info))=0A= >> + continue;=0A= >> +=0A= >> + /* There should be at most one cycle def in the stmt. */=0A= >> + if (VECTORIZABLE_CYCLE_DEF (dt))=0A= >> + return false;=0A= >> + }=0A= >> +=0A= >> + stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (st= mt_info));=0A= >> +=0A= >> + /* TODO: Support lane-reducing operation that does not directly parti= cipate=0A= >> + in loop reduction. */=0A= >> + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)=0A= >> + return false;=0A= >> +=0A= >> + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not= =0A= >> + recoginized. */=0A= >> + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_de= f);=0A= >> + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUC= TION);=0A= >> +=0A= >> + tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);=0A= >> + int ncopies_for_cost;=0A= >> +=0A= >> + if (SLP_TREE_LANES (slp_node) > 1)=0A= >> + {=0A= >> + /* Now lane-reducing operations in a non-single-lane slp node sho= uld only=0A= >> + come from the same loop reduction path. */=0A= >> + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));=0A= >> + ncopies_for_cost =3D 1;=0A= >> + }=0A= >> + else=0A= >> + {=0A= >> + ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_in)= ;=0A= > =0A= > OK, so the fact that the ops are lane-reducing means they effectively=0A= > change the VF for the result. That's only possible as we tightly control= =0A= > code generation and "adjust" to the expected VF (by inserting the copies= =0A= > you mentioned above), but only up to the highest number of outputs=0A= > created in the reduction chain. In that sense instead of talking and rec= ording=0A= > "input vector types" wouldn't it make more sense to record the effective= =0A= > vectorization factor for the reduction instance? That VF would be at mos= t=0A= > the loops VF but could be as low as 1. Once we have a non-lane-reducing= =0A= > operation in the reduction chain it would be always equal to the loops VF= .=0A= > =0A= > ncopies would then be always determined by that reduction instance VF and= =0A= > the accumulator vector type (STMT_VINFO_VECTYPE). This reduction=0A= > instance VF would also trivially indicate the force-single-def-use-cycle= =0A= > case, possibly simplifying code?=0A= =0A= I tried to add such an effective VF, while the vectype_in is still needed i= n some=0A= scenarios, such as when checking whether a dot-prod stmt is emulated or not= .=0A= The former could be deduced from the later, so recording both things seems= =0A= to be redundant. Another consideration is that for normal op, ncopies=0A= is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,=0A= it is from VF. So, a better means to make them unified? =0A= =0A= >> + gcc_assert (ncopies_for_cost >=3D 1);=0A= >> + }=0A= >> +=0A= >> + if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A= >> + {=0A= >> + /* We need extra two invariants: one that contains the minimum si= gned=0A= >> + value and one that contains half of its negative. */=0A= >> + int prologue_stmts =3D 2;=0A= >> + unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts,=0A= >> + scalar_to_vec, stmt_info, 0,=0A= >> + vect_prologue);=0A= >> + if (dump_enabled_p ())=0A= >> + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "=0A= >> + "extra prologue_cost =3D %d .\n", cost);=0A= >> +=0A= >> + /* Three dot-products and a subtraction. */=0A= >> + ncopies_for_cost *=3D 4;=0A= >> + }=0A= >> +=0A= >> + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info,= 0,=0A= >> + vect_body);=0A= >> +=0A= >> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))=0A= >> + {=0A= >> + enum tree_code code =3D gimple_assign_rhs_code (stmt);=0A= >> + vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_inf= o,=0A= >> + slp_node, code, type,= =0A= >> + vectype_in);=0A= >> + }=0A= >> +=0A= > =0A= > Add a comment:=0A= > =0A= > /* Transform via vect_transform_reduction. */=0A= > =0A= >> + STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type;=0A= >> + return true;=0A= >> +}=0A= >> +=0A= >> /* Function vectorizable_reduction.=0A= >>=0A= >> Check if STMT_INFO performs a reduction operation that can be vector= ized.=0A= >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= >> if (!type_has_mode_precision_p (op.type))=0A= >> return false;=0A= >>=0A= >> - /* For lane-reducing ops we're reducing the number of reduction PHIs= =0A= >> - which means the only use of that may be in the lane-reducing opera= tion. */=0A= >> - if (lane_reducing=0A= >> - && reduc_chain_length !=3D 1=0A= >> - && !only_slp_reduc_chain)=0A= >> - {=0A= >> - if (dump_enabled_p ())=0A= >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= >> - "lane-reducing reduction with extra stmts.\n");= =0A= >> - return false;=0A= >> - }=0A= >> -=0A= >> /* Lane-reducing ops also never can be used in a SLP reduction group= =0A= >> since we'll mix lanes belonging to different reductions. But it's= =0A= >> OK to use them in a reduction chain or when the reduction group=0A= >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo= ,=0A= >> && loop_vinfo->suggested_unroll_factor =3D=3D 1)=0A= >> single_defuse_cycle =3D true;=0A= >>=0A= >> - if (single_defuse_cycle || lane_reducing)=0A= >> + if (single_defuse_cycle && !lane_reducing)=0A= > =0A= > If there's also a non-lane-reducing plus in the chain don't we have to=0A= > check for that reduction op? So shouldn't it be=0A= > single_defuse_cycle && ... fact that we don't record=0A= > (non-lane-reducing op there) ...=0A= =0A= Quite not understand this point. For a non-lane-reducing op in the chain,= =0A= it should be handled in its own vectorizable_xxx function? The below check= =0A= is only for the first statement (vect_reduction_def) in the reduction.=0A= =0A= > =0A= >> {=0A= >> gcc_assert (op.code !=3D COND_EXPR);=0A= >>=0A= >> - /* 4. Supportable by target? */=0A= >> - bool ok =3D true;=0A= >> -=0A= >> - /* 4.1. check support for the operation in the loop=0A= >> + /* 4. check support for the operation in the loop=0A= >>=0A= >> This isn't necessary for the lane reduction codes, since they= =0A= >> can only be produced by pattern matching, and it's up to the=0A= >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo= ,=0A= >> mixed-sign dot-products can be implemented using signed=0A= >> dot-products. */=0A= >> machine_mode vec_mode =3D TYPE_MODE (vectype_in);=0A= >> - if (!lane_reducing=0A= >> - && !directly_supported_p (op.code, vectype_in, optab_vector))= =0A= >> + if (!directly_supported_p (op.code, vectype_in, optab_vector))=0A= >> {=0A= >> if (dump_enabled_p ())=0A= >> dump_printf (MSG_NOTE, "op not supported by target.\n");=0A= >> if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)=0A= >> || !vect_can_vectorize_without_simd_p (op.code))=0A= >> - ok =3D false;=0A= >> + single_defuse_cycle =3D false;=0A= >> else=0A= >> if (dump_enabled_p ())=0A= >> dump_printf (MSG_NOTE, "proceeding using word mode.\n");= =0A= >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= >> dump_printf (MSG_NOTE, "using word mode not possible.\n");= =0A= >> return false;=0A= >> }=0A= >> -=0A= >> - /* lane-reducing operations have to go through vect_transform_red= uction.=0A= >> - For the other cases try without the single cycle optimization.= */=0A= >> - if (!ok)=0A= >> - {=0A= >> - if (lane_reducing)=0A= >> - return false;=0A= >> - else=0A= >> - single_defuse_cycle =3D false;=0A= >> - }=0A= >> }=0A= >> if (dump_enabled_p () && single_defuse_cycle)=0A= >> dump_printf_loc (MSG_NOTE, vect_location,=0A= >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo= ,=0A= >> "multiple vectors to one in the loop body\n");=0A= >> STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle;= =0A= >>=0A= >> - /* If the reduction stmt is one of the patterns that have lane=0A= >> - reduction embedded we cannot handle the case of ! single_defuse_cy= cle. */=0A= >> - if ((ncopies > 1 && ! single_defuse_cycle)=0A= >> - && lane_reducing)=0A= >> - {=0A= >> - if (dump_enabled_p ())=0A= >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= >> - "multi def-use cycle not possible for lane-redu= cing "=0A= >> - "reduction operation\n");=0A= >> - return false;=0A= >> - }=0A= >> + /* For lane-reducing operation, the below processing related to singl= e=0A= >> + defuse-cycle will be done in its own vectorizable function. One m= ore=0A= >> + thing to note is that the operation must not be involved in fold-l= eft=0A= >> + reduction. */=0A= >> + single_defuse_cycle &=3D !lane_reducing;=0A= >>=0A= >> if (slp_node=0A= >> - && !(!single_defuse_cycle=0A= >> - && !lane_reducing=0A= >> - && reduction_type !=3D FOLD_LEFT_REDUCTION))=0A= >> + && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDUCT= ION))=0A= >> for (i =3D 0; i < (int) op.num_ops; i++)=0A= >> if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))= =0A= >> {=0A= >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo= ,=0A= >> vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,=0A= >> reduction_type, ncopies, cost_vec);=0A= >> /* Cost the reduction op inside the loop if transformed via=0A= >> - vect_transform_reduction. Otherwise this is costed by the=0A= >> - separate vectorizable_* routines. */=0A= >> - if (single_defuse_cycle || lane_reducing)=0A= >> - {=0A= >> - int factor =3D 1;=0A= >> - if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A= >> - /* Three dot-products and a subtraction. */=0A= >> - factor =3D 4;=0A= >> - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,=0A= >> - stmt_info, 0, vect_body);=0A= >> - }=0A= >> + vect_transform_reduction for non-lane-reducing operation. Otherwi= se=0A= >> + this is costed by the separate vectorizable_* routines. */=0A= >> + if (single_defuse_cycle)=0A= >> + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vec= t_body);=0A= >>=0A= >> if (dump_enabled_p ()=0A= >> && reduction_type =3D=3D FOLD_LEFT_REDUCTION)=0A= >> dump_printf_loc (MSG_NOTE, vect_location,=0A= >> "using an in-order (fold-left) reduction.\n");=0A= >> STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type;=0A= >> - /* All but single defuse-cycle optimized, lane-reducing and fold-left= =0A= >> - reductions go through their own vectorizable_* routines. */=0A= >> - if (!single_defuse_cycle=0A= >> - && !lane_reducing=0A= >> - && reduction_type !=3D FOLD_LEFT_REDUCTION)=0A= >> +=0A= >> + /* All but single defuse-cycle optimized and fold-left reductions go= =0A= >> + through their own vectorizable_* routines. */=0A= >> + if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION)= =0A= >> {=0A= >> stmt_vec_info tem=0A= >> =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));=0A= >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinf= o,=0A= >> bool lane_reducing =3D lane_reducing_op_p (code);=0A= >> gcc_assert (single_defuse_cycle || lane_reducing);=0A= >>=0A= >> + if (lane_reducing)=0A= >> + {=0A= >> + /* The last operand of lane-reducing op is for reduction. */=0A= >> + gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1);=0A= >> +=0A= >> + /* Now all lane-reducing ops are covered by some slp node. */=0A= >> + gcc_assert (slp_node);=0A= >> + }=0A= >> +=0A= >> /* Create the destination vector */=0A= >> tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt);=0A= >> tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype_o= ut);=0A= >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinf= o,=0A= >> reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE,= =0A= >> &vec_oprnds[2]);=0A= >> }=0A= >> + else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1=0A= >> + && vec_oprnds[0].length () < vec_oprnds[reduc_index].length (= ))=0A= >> + {=0A= >> + /* For lane-reducing op covered by single-lane slp node, the inpu= t=0A= >> + vectype of the reduction PHI determines copies of vectorized de= f-use=0A= >> + cycles, which might be more than effective copies of vectorized= lane-=0A= >> + reducing reduction statements. This could be complemented by= =0A= >> + generating extra trivial pass-through copies. For example:=0A= >> +=0A= >> + int sum =3D 0;=0A= >> + for (i)=0A= >> + {=0A= >> + sum +=3D d0[i] * d1[i]; // dot-prod =0A= >> + sum +=3D abs(s0[i] - s1[i]); // sad =0A= >> + sum +=3D n[i]; // normal =0A= >> + }=0A= >> +=0A= >> + The vector size is 128-bit,vectorization factor is 16. Reducti= on=0A= >> + statements would be transformed as:=0A= >> +=0A= >> + vector<4> int sum_v0 =3D { 0, 0, 0, 0 };=0A= >> + vector<4> int sum_v1 =3D { 0, 0, 0, 0 };=0A= >> + vector<4> int sum_v2 =3D { 0, 0, 0, 0 };=0A= >> + vector<4> int sum_v3 =3D { 0, 0, 0, 0 };=0A= >> +=0A= >> + for (i / 16)=0A= >> + {=0A= >> + sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], = sum_v0);=0A= >> + sum_v1 =3D sum_v1; // copy=0A= >> + sum_v2 =3D sum_v2; // copy=0A= >> + sum_v3 =3D sum_v3; // copy=0A= >> +=0A= >> + sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v= 0);=0A= >> + sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v= 1);=0A= >> + sum_v2 =3D sum_v2; // copy=0A= >> + sum_v3 =3D sum_v3; // copy=0A= >> +=0A= >> + sum_v0 +=3D n_v0[i: 0 ~ 3 ];=0A= >> + sum_v1 +=3D n_v1[i: 4 ~ 7 ];=0A= >> + sum_v2 +=3D n_v2[i: 8 ~ 11];=0A= >> + sum_v3 +=3D n_v3[i: 12 ~ 15];=0A= >> + }=0A= >> + */=0A= >> + unsigned using_ncopies =3D vec_oprnds[0].length ();=0A= >> + unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ();=0A= >> +=0A= > =0A= > assert reduc_ncopies >=3D using_ncopies? Maybe assert=0A= > reduc_index =3D=3D op.num_ops - 1 given you use one above=0A= > and the other below? Or simply iterate till op.num_ops=0A= > and sip i =3D=3D reduc_index.=0A= > =0A= >> + for (unsigned i =3D 0; i < op.num_ops - 1; i++)=0A= >> + {=0A= >> + gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies);=0A= >> + vec_oprnds[i].safe_grow_cleared (reduc_ncopies);=0A= >> + }=0A= >> + }=0A= >>=0A= >> bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (stm= t_info);=0A= >> unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length ();= =0A= >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinf= o,=0A= >> {=0A= >> gimple *new_stmt;=0A= >> tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }= ;=0A= >> - if (masked_loop_p && !mask_by_cond_expr)=0A= >> +=0A= >> + if (!vop[0] || !vop[1])=0A= >> + {=0A= >> + tree reduc_vop =3D vec_oprnds[reduc_index][i];=0A= >> +=0A= >> + /* Insert trivial copy if no need to generate vectorized=0A= >> + statement. */=0A= >> + gcc_assert (reduc_vop);=0A= >> +=0A= >> + new_stmt =3D gimple_build_assign (vec_dest, reduc_vop);=0A= >> + new_temp =3D make_ssa_name (vec_dest, new_stmt);=0A= >> + gimple_set_lhs (new_stmt, new_temp);=0A= >> + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, = gsi);=0A= > =0A= > I think you could simply do=0A= > =0A= > slp_node->push_vec_def (reduc_vop);=0A= > continue;=0A= > =0A= > without any code generation.=0A= > =0A= =0A= OK, that would be easy. Here comes another question, this patch assumes=0A= lane-reducing op would always be contained in a slp node, since single-lane= =0A= slp node feature has been enabled. But I got some regression if I enforced= =0A= such constraint on lane-reducing op check. Those cases are founded to=0A= be unvectorizable with single-lane slp, so this should not be what we want?= =0A= and need to be fixed?=0A= =0A= >> + }=0A= >> + else if (masked_loop_p && !mask_by_cond_expr)=0A= >> {=0A= >> /* No conditional ifns have been defined for lane-reducing op= =0A= >> yet. */=0A= >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinf= o,=0A= >>=0A= >> if (masked_loop_p && mask_by_cond_expr)=0A= >> {=0A= >> + tree stmt_vectype_in =3D vectype_in;=0A= >> + unsigned nvectors =3D vec_num * ncopies;=0A= >> +=0A= >> + if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1)= =0A= >> + {=0A= >> + /* Input vectype of the reduction PHI may be defferent= from=0A= > =0A= > different=0A= > =0A= >> + that of lane-reducing operation. */=0A= >> + stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_= info);=0A= >> + nvectors =3D vect_get_num_copies (loop_vinfo, stmt_vec= type_in);=0A= > =0A= > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.=0A= =0A= To partially vectorizing a dot_prod<16 * char> with 128-bit vector width, = =0A= we should pass (nvector=3D4, vectype=3D<4 *int>) instead of (nvector=3D1, v= ectype=3D<16 *char>)=0A= to vect_get_loop_mask?=0A= =0A= Thanks,=0A= Feng=0A= =0A= =0A= ________________________________________=0A= From: Richard Biener =0A= Sent: Thursday, June 20, 2024 8:26 PM=0A= To: Feng Xue OS=0A= Cc: gcc-patches@gcc.gnu.org=0A= Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations fo= r loop reduction [PR114440]=0A= =0A= On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS w= rote:=0A= >=0A= > For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, cu= rrent=0A= > vectorizer could only handle the pattern if the reduction chain does not= =0A= > contain other operation, no matter the other is normal or lane-reducing.= =0A= >=0A= > Actually, to allow multiple arbitrary lane-reducing operations, we need t= o=0A= > support vectorization of loop reduction chain with mixed input vectypes. = Since=0A= > lanes of vectype may vary with operation, the effective ncopies of vector= ized=0A= > statements for operation also may not be same to each other, this causes= =0A= > mismatch on vectorized def-use cycles. A simple way is to align all opera= tions=0A= > with the one that has the most ncopies, the gap could be complemented by= =0A= > generating extra trivial pass-through copies. For example:=0A= >=0A= > int sum =3D 0;=0A= > for (i)=0A= > {=0A= > sum +=3D d0[i] * d1[i]; // dot-prod =0A= > sum +=3D w[i]; // widen-sum =0A= > sum +=3D abs(s0[i] - s1[i]); // sad =0A= > sum +=3D n[i]; // normal =0A= > }=0A= >=0A= > The vector size is 128-bit vectorization factor is 16. Reduction statemen= ts=0A= > would be transformed as:=0A= >=0A= > vector<4> int sum_v0 =3D { 0, 0, 0, 0 };=0A= > vector<4> int sum_v1 =3D { 0, 0, 0, 0 };=0A= > vector<4> int sum_v2 =3D { 0, 0, 0, 0 };=0A= > vector<4> int sum_v3 =3D { 0, 0, 0, 0 };=0A= >=0A= > for (i / 16)=0A= > {=0A= > sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);= =0A= > sum_v1 =3D sum_v1; // copy=0A= > sum_v2 =3D sum_v2; // copy=0A= > sum_v3 =3D sum_v3; // copy=0A= >=0A= > sum_v0 =3D WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);=0A= > sum_v1 =3D sum_v1; // copy=0A= > sum_v2 =3D sum_v2; // copy=0A= > sum_v3 =3D sum_v3; // copy=0A= >=0A= > sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);=0A= > sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);=0A= > sum_v2 =3D sum_v2; // copy=0A= > sum_v3 =3D sum_v3; // copy=0A= >=0A= > sum_v0 +=3D n_v0[i: 0 ~ 3 ];=0A= > sum_v1 +=3D n_v1[i: 4 ~ 7 ];=0A= > sum_v2 +=3D n_v2[i: 8 ~ 11];=0A= > sum_v3 +=3D n_v3[i: 12 ~ 15];=0A= > }=0A= >=0A= > Thanks,=0A= > Feng=0A= >=0A= > ---=0A= > gcc/=0A= > PR tree-optimization/114440=0A= > * tree-vectorizer.h (vectorizable_lane_reducing): New function=0A= > declaration.=0A= > * tree-vect-stmts.cc (vect_analyze_stmt): Call new function=0A= > vectorizable_lane_reducing to analyze lane-reducing operation.=0A= > * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost comp= utation=0A= > code related to emulated_mixed_dot_prod.=0A= > (vect_reduction_update_partial_vector_usage): Compute ncopies as = the=0A= > original means for single-lane slp node.=0A= > (vectorizable_lane_reducing): New function.=0A= > (vectorizable_reduction): Allow multiple lane-reducing operations= in=0A= > loop reduction. Move some original lane-reducing related code to= =0A= > vectorizable_lane_reducing.=0A= > (vect_transform_reduction): Extend transformation to support redu= ction=0A= > statements with mixed input vectypes.=0A= >=0A= > gcc/testsuite/=0A= > PR tree-optimization/114440=0A= > * gcc.dg/vect/vect-reduc-chain-1.c=0A= > * gcc.dg/vect/vect-reduc-chain-2.c=0A= > * gcc.dg/vect/vect-reduc-chain-3.c=0A= > * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c=0A= > * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c=0A= > * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c=0A= > * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c=0A= > * gcc.dg/vect/vect-reduc-dot-slp-1.c=0A= > ---=0A= > .../gcc.dg/vect/vect-reduc-chain-1.c | 62 ++++=0A= > .../gcc.dg/vect/vect-reduc-chain-2.c | 77 +++++=0A= > .../gcc.dg/vect/vect-reduc-chain-3.c | 66 ++++=0A= > .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 +++++=0A= > .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 ++++=0A= > .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 +++++=0A= > .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 ++++=0A= > .../gcc.dg/vect/vect-reduc-dot-slp-1.c | 35 ++=0A= > gcc/tree-vect-loop.cc | 324 ++++++++++++++----= =0A= > gcc/tree-vect-stmts.cc | 2 +=0A= > gcc/tree-vectorizer.h | 2 +=0A= > 11 files changed, 802 insertions(+), 70 deletions(-)=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.= c=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.= c=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.= c=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.= c=0A= > create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c=0A= >=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsui= te/gcc.dg/vect/vect-reduc-chain-1.c=0A= > new file mode 100644=0A= > index 00000000000..04bfc419dbd=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c=0A= > @@ -0,0 +1,62 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa= rch64*-*-* || arm*-*-* } } } */=0A= > +/* { dg-add-options arm_v8_2a_dotprod_neon } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#define N 50=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +f (SIGNEDNESS_1 int res,=0A= > + SIGNEDNESS_2 char *restrict a,=0A= > + SIGNEDNESS_2 char *restrict b,=0A= > + SIGNEDNESS_2 char *restrict c,=0A= > + SIGNEDNESS_2 char *restrict d,=0A= > + SIGNEDNESS_1 int *restrict e)=0A= > +{=0A= > + for (int i =3D 0; i < N; ++i)=0A= > + {=0A= > + res +=3D a[i] * b[i];=0A= > + res +=3D c[i] * d[i];=0A= > + res +=3D e[i];=0A= > + }=0A= > + return res;=0A= > +}=0A= > +=0A= > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A= > +#define OFFSET 20=0A= > +=0A= > +int=0A= > +main (void)=0A= > +{=0A= > + check_vect ();=0A= > +=0A= > + SIGNEDNESS_2 char a[N], b[N];=0A= > + SIGNEDNESS_2 char c[N], d[N];=0A= > + SIGNEDNESS_1 int e[N];=0A= > + int expected =3D 0x12345;=0A= > + for (int i =3D 0; i < N; ++i)=0A= > + {=0A= > + a[i] =3D BASE + i * 5;=0A= > + b[i] =3D BASE + OFFSET + i * 4;=0A= > + c[i] =3D BASE + i * 2;=0A= > + d[i] =3D BASE + OFFSET + i * 3;=0A= > + e[i] =3D i;=0A= > + asm volatile ("" ::: "memory");=0A= > + expected +=3D a[i] * b[i];=0A= > + expected +=3D c[i] * d[i];=0A= > + expected +=3D e[i];=0A= > + }=0A= > + if (f (0x12345, a, b, c, d, e) !=3D expected)=0A= > + __builtin_abort ();=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "= vect" } } */=0A= > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO= T_PROD_EXPR" 2 "vect" { target vect_sdot_qi } } } */=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c b/gcc/testsui= te/gcc.dg/vect/vect-reduc-chain-2.c=0A= > new file mode 100644=0A= > index 00000000000..6c803b80120=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c=0A= > @@ -0,0 +1,77 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa= rch64*-*-* || arm*-*-* } } } */=0A= > +/* { dg-add-options arm_v8_2a_dotprod_neon } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#define N 50=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 unsigned=0A= > +#define SIGNEDNESS_3 signed=0A= > +#define SIGNEDNESS_4 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +fn (SIGNEDNESS_1 int res,=0A= > + SIGNEDNESS_2 char *restrict a,=0A= > + SIGNEDNESS_2 char *restrict b,=0A= > + SIGNEDNESS_3 char *restrict c,=0A= > + SIGNEDNESS_3 char *restrict d,=0A= > + SIGNEDNESS_4 short *restrict e,=0A= > + SIGNEDNESS_4 short *restrict f,=0A= > + SIGNEDNESS_1 int *restrict g)=0A= > +{=0A= > + for (int i =3D 0; i < N; ++i)=0A= > + {=0A= > + res +=3D a[i] * b[i];=0A= > + res +=3D i + 1;=0A= > + res +=3D c[i] * d[i];=0A= > + res +=3D e[i] * f[i];=0A= > + res +=3D g[i];=0A= > + }=0A= > + return res;=0A= > +}=0A= > +=0A= > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A= > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -126 : 4)=0A= > +#define BASE4 ((SIGNEDNESS_4 int) -1 < 0 ? -1026 : 373)=0A= > +#define OFFSET 20=0A= > +=0A= > +int=0A= > +main (void)=0A= > +{=0A= > + check_vect ();=0A= > +=0A= > + SIGNEDNESS_2 char a[N], b[N];=0A= > + SIGNEDNESS_3 char c[N], d[N];=0A= > + SIGNEDNESS_4 short e[N], f[N];=0A= > + SIGNEDNESS_1 int g[N];=0A= > + int expected =3D 0x12345;=0A= > + for (int i =3D 0; i < N; ++i)=0A= > + {=0A= > + a[i] =3D BASE2 + i * 5;=0A= > + b[i] =3D BASE2 + OFFSET + i * 4;=0A= > + c[i] =3D BASE3 + i * 2;=0A= > + d[i] =3D BASE3 + OFFSET + i * 3;=0A= > + e[i] =3D BASE4 + i * 6;=0A= > + f[i] =3D BASE4 + OFFSET + i * 5;=0A= > + g[i] =3D i;=0A= > + asm volatile ("" ::: "memory");=0A= > + expected +=3D a[i] * b[i];=0A= > + expected +=3D i + 1;=0A= > + expected +=3D c[i] * d[i];=0A= > + expected +=3D e[i] * f[i];=0A= > + expected +=3D g[i];=0A= > + }=0A= > + if (fn (0x12345, a, b, c, d, e, f, g) !=3D expected)=0A= > + __builtin_abort ();=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "= vect" } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD= _EXPR" "vect" { target { vect_sdot_qi } } } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD= _EXPR" "vect" { target { vect_udot_qi } } } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD= _EXPR" "vect" { target { vect_sdot_hi } } } } */=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c b/gcc/testsui= te/gcc.dg/vect/vect-reduc-chain-3.c=0A= > new file mode 100644=0A= > index 00000000000..a41e4b176c4=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c=0A= > @@ -0,0 +1,66 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#define N 50=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 unsigned=0A= > +#define SIGNEDNESS_3 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +f (SIGNEDNESS_1 int res,=0A= > + SIGNEDNESS_2 char *restrict a,=0A= > + SIGNEDNESS_2 char *restrict b,=0A= > + SIGNEDNESS_3 short *restrict c,=0A= > + SIGNEDNESS_3 short *restrict d,=0A= > + SIGNEDNESS_1 int *restrict e)=0A= > +{=0A= > + for (int i =3D 0; i < N; ++i)=0A= > + {=0A= > + short diff =3D a[i] - b[i];=0A= > + SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff;=0A= > + res +=3D abs;=0A= > + res +=3D c[i] * d[i];=0A= > + res +=3D e[i];=0A= > + }=0A= > + return res;=0A= > +}=0A= > +=0A= > +#define BASE2 ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A= > +#define BASE3 ((SIGNEDNESS_3 int) -1 < 0 ? -1236 : 373)=0A= > +#define OFFSET 20=0A= > +=0A= > +int=0A= > +main (void)=0A= > +{=0A= > + check_vect ();=0A= > +=0A= > + SIGNEDNESS_2 char a[N], b[N];=0A= > + SIGNEDNESS_3 short c[N], d[N];=0A= > + SIGNEDNESS_1 int e[N];=0A= > + int expected =3D 0x12345;=0A= > + for (int i =3D 0; i < N; ++i)=0A= > + {=0A= > + a[i] =3D BASE2 + i * 5;=0A= > + b[i] =3D BASE2 - i * 4;=0A= > + c[i] =3D BASE3 + i * 2;=0A= > + d[i] =3D BASE3 + OFFSET + i * 3;=0A= > + e[i] =3D i;=0A= > + asm volatile ("" ::: "memory");=0A= > + short diff =3D a[i] - b[i];=0A= > + SIGNEDNESS_2 short abs =3D diff < 0 ? -diff : diff;=0A= > + expected +=3D abs;=0A= > + expected +=3D c[i] * d[i];=0A= > + expected +=3D e[i];=0A= > + }=0A= > + if (f (0x12345, a, b, c, d, e) !=3D expected)=0A= > + __builtin_abort ();=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D SAD_EXPR= " "vect" { target vect_udot_qi } } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing statement: \\S+ =3D DOT_PROD= _EXPR" "vect" { target vect_sdot_hi } } } */=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c b/gcc= /testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c=0A= > new file mode 100644=0A= > index 00000000000..c2831fbcc8e=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c=0A= > @@ -0,0 +1,95 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa= rch64*-*-* || arm*-*-* } } } */=0A= > +/* { dg-add-options arm_v8_2a_dotprod_neon } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +f (SIGNEDNESS_1 int res,=0A= > + SIGNEDNESS_2 char *a,=0A= > + SIGNEDNESS_2 char *b,=0A= > + int step, int n)=0A= > +{=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + res +=3D a[0] * b[0];=0A= > + res +=3D a[1] * b[1];=0A= > + res +=3D a[2] * b[2];=0A= > + res +=3D a[3] * b[3];=0A= > + res +=3D a[4] * b[4];=0A= > + res +=3D a[5] * b[5];=0A= > + res +=3D a[6] * b[6];=0A= > + res +=3D a[7] * b[7];=0A= > + res +=3D a[8] * b[8];=0A= > + res +=3D a[9] * b[9];=0A= > + res +=3D a[10] * b[10];=0A= > + res +=3D a[11] * b[11];=0A= > + res +=3D a[12] * b[12];=0A= > + res +=3D a[13] * b[13];=0A= > + res +=3D a[14] * b[14];=0A= > + res +=3D a[15] * b[15];=0A= > +=0A= > + a +=3D step;=0A= > + b +=3D step;=0A= > + }=0A= > +=0A= > + return res;=0A= > +}=0A= > +=0A= > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A= > +#define OFFSET 20=0A= > +=0A= > +int=0A= > +main (void)=0A= > +{=0A= > + check_vect ();=0A= > +=0A= > + SIGNEDNESS_2 char a[100], b[100];=0A= > + int expected =3D 0x12345;=0A= > + int step =3D 16;=0A= > + int n =3D 2;=0A= > + int t =3D 0;=0A= > +=0A= > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A= > + {=0A= > + a[i] =3D BASE + i * 5;=0A= > + b[i] =3D BASE + OFFSET + i * 4;=0A= > + asm volatile ("" ::: "memory");=0A= > + }=0A= > +=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + asm volatile ("" ::: "memory");=0A= > + expected +=3D a[t + 0] * b[t + 0];=0A= > + expected +=3D a[t + 1] * b[t + 1];=0A= > + expected +=3D a[t + 2] * b[t + 2];=0A= > + expected +=3D a[t + 3] * b[t + 3];=0A= > + expected +=3D a[t + 4] * b[t + 4];=0A= > + expected +=3D a[t + 5] * b[t + 5];=0A= > + expected +=3D a[t + 6] * b[t + 6];=0A= > + expected +=3D a[t + 7] * b[t + 7];=0A= > + expected +=3D a[t + 8] * b[t + 8];=0A= > + expected +=3D a[t + 9] * b[t + 9];=0A= > + expected +=3D a[t + 10] * b[t + 10];=0A= > + expected +=3D a[t + 11] * b[t + 11];=0A= > + expected +=3D a[t + 12] * b[t + 12];=0A= > + expected +=3D a[t + 13] * b[t + 13];=0A= > + expected +=3D a[t + 14] * b[t + 14];=0A= > + expected +=3D a[t + 15] * b[t + 15];=0A= > + t +=3D step;=0A= > + }=0A= > +=0A= > + if (f (0x12345, a, b, step, n) !=3D expected)=0A= > + __builtin_abort ();=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "= vect" } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } = */=0A= > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO= T_PROD_EXPR" 16 "vect" } } */=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c b/gcc= /testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c=0A= > new file mode 100644=0A= > index 00000000000..4114264a364=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c=0A= > @@ -0,0 +1,67 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa= rch64*-*-* || arm*-*-* } } } */=0A= > +/* { dg-add-options arm_v8_2a_dotprod_neon } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +f (SIGNEDNESS_1 int res,=0A= > + SIGNEDNESS_2 char *a,=0A= > + SIGNEDNESS_2 char *b,=0A= > + int n)=0A= > +{=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + res +=3D a[5 * i + 0] * b[5 * i + 0];=0A= > + res +=3D a[5 * i + 1] * b[5 * i + 1];=0A= > + res +=3D a[5 * i + 2] * b[5 * i + 2];=0A= > + res +=3D a[5 * i + 3] * b[5 * i + 3];=0A= > + res +=3D a[5 * i + 4] * b[5 * i + 4];=0A= > + }=0A= > +=0A= > + return res;=0A= > +}=0A= > +=0A= > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)=0A= > +#define OFFSET 20=0A= > +=0A= > +int=0A= > +main (void)=0A= > +{=0A= > + check_vect ();=0A= > +=0A= > + SIGNEDNESS_2 char a[100], b[100];=0A= > + int expected =3D 0x12345;=0A= > + int n =3D 18;=0A= > +=0A= > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A= > + {=0A= > + a[i] =3D BASE + i * 5;=0A= > + b[i] =3D BASE + OFFSET + i * 4;=0A= > + asm volatile ("" ::: "memory");=0A= > + }=0A= > +=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + asm volatile ("" ::: "memory");=0A= > + expected +=3D a[5 * i + 0] * b[5 * i + 0];=0A= > + expected +=3D a[5 * i + 1] * b[5 * i + 1];=0A= > + expected +=3D a[5 * i + 2] * b[5 * i + 2];=0A= > + expected +=3D a[5 * i + 3] * b[5 * i + 3];=0A= > + expected +=3D a[5 * i + 4] * b[5 * i + 4];=0A= > + }=0A= > +=0A= > + if (f (0x12345, a, b, n) !=3D expected)=0A= > + __builtin_abort ();=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "= vect" } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } = */=0A= > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO= T_PROD_EXPR" 5 "vect" } } */=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c b/gcc= /testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c=0A= > new file mode 100644=0A= > index 00000000000..2cdecc36d16=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c=0A= > @@ -0,0 +1,79 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa= rch64*-*-* || arm*-*-* } } } */=0A= > +/* { dg-add-options arm_v8_2a_dotprod_neon } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +f (SIGNEDNESS_1 int res,=0A= > + SIGNEDNESS_2 short *a,=0A= > + SIGNEDNESS_2 short *b,=0A= > + int step, int n)=0A= > +{=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + res +=3D a[0] * b[0];=0A= > + res +=3D a[1] * b[1];=0A= > + res +=3D a[2] * b[2];=0A= > + res +=3D a[3] * b[3];=0A= > + res +=3D a[4] * b[4];=0A= > + res +=3D a[5] * b[5];=0A= > + res +=3D a[6] * b[6];=0A= > + res +=3D a[7] * b[7];=0A= > +=0A= > + a +=3D step;=0A= > + b +=3D step;=0A= > + }=0A= > +=0A= > + return res;=0A= > +}=0A= > +=0A= > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)=0A= > +#define OFFSET 20=0A= > +=0A= > +int=0A= > +main (void)=0A= > +{=0A= > + check_vect ();=0A= > +=0A= > + SIGNEDNESS_2 short a[100], b[100];=0A= > + int expected =3D 0x12345;=0A= > + int step =3D 8;=0A= > + int n =3D 2;=0A= > + int t =3D 0;=0A= > +=0A= > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A= > + {=0A= > + a[i] =3D BASE + i * 5;=0A= > + b[i] =3D BASE + OFFSET + i * 4;=0A= > + asm volatile ("" ::: "memory");=0A= > + }=0A= > +=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + asm volatile ("" ::: "memory");=0A= > + expected +=3D a[t + 0] * b[t + 0];=0A= > + expected +=3D a[t + 1] * b[t + 1];=0A= > + expected +=3D a[t + 2] * b[t + 2];=0A= > + expected +=3D a[t + 3] * b[t + 3];=0A= > + expected +=3D a[t + 4] * b[t + 4];=0A= > + expected +=3D a[t + 5] * b[t + 5];=0A= > + expected +=3D a[t + 6] * b[t + 6];=0A= > + expected +=3D a[t + 7] * b[t + 7];=0A= > + t +=3D step;=0A= > + }=0A= > +=0A= > + if (f (0x12345, a, b, step, n) !=3D expected)=0A= > + __builtin_abort ();=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "= vect" } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } = */=0A= > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO= T_PROD_EXPR" 8 "vect" { target vect_sdot_hi } } } */=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c b/gcc= /testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c=0A= > new file mode 100644=0A= > index 00000000000..32c0f30c77b=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c=0A= > @@ -0,0 +1,63 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa= rch64*-*-* || arm*-*-* } } } */=0A= > +/* { dg-add-options arm_v8_2a_dotprod_neon } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +f (SIGNEDNESS_1 int res,=0A= > + SIGNEDNESS_2 short *a,=0A= > + SIGNEDNESS_2 short *b,=0A= > + int n)=0A= > +{=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + res +=3D a[3 * i + 0] * b[3 * i + 0];=0A= > + res +=3D a[3 * i + 1] * b[3 * i + 1];=0A= > + res +=3D a[3 * i + 2] * b[3 * i + 2];=0A= > + }=0A= > +=0A= > + return res;=0A= > +}=0A= > +=0A= > +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -1026 : 373)=0A= > +#define OFFSET 20=0A= > +=0A= > +int=0A= > +main (void)=0A= > +{=0A= > + check_vect ();=0A= > +=0A= > + SIGNEDNESS_2 short a[100], b[100];=0A= > + int expected =3D 0x12345;=0A= > + int n =3D 18;=0A= > +=0A= > + for (int i =3D 0; i < sizeof (a) / sizeof (a[0]); ++i)=0A= > + {=0A= > + a[i] =3D BASE + i * 5;=0A= > + b[i] =3D BASE + OFFSET + i * 4;=0A= > + asm volatile ("" ::: "memory");=0A= > + }=0A= > +=0A= > + for (int i =3D 0; i < n; i++)=0A= > + {=0A= > + asm volatile ("" ::: "memory");=0A= > + expected +=3D a[3 * i + 0] * b[3 * i + 0];=0A= > + expected +=3D a[3 * i + 1] * b[3 * i + 1];=0A= > + expected +=3D a[3 * i + 2] * b[3 * i + 2];=0A= > + }=0A= > +=0A= > + if (f (0x12345, a, b, n) !=3D expected)=0A= > + __builtin_abort ();=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "= vect" } } */=0A= > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } = */=0A= > +/* { dg-final { scan-tree-dump-times "vectorizing statement: \\S+ =3D DO= T_PROD_EXPR" 3 "vect" { target vect_sdot_hi } } } */=0A= > diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c b/gcc/tests= uite/gcc.dg/vect/vect-reduc-dot-slp-1.c=0A= > new file mode 100644=0A= > index 00000000000..e17d6291f75=0A= > --- /dev/null=0A= > +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c=0A= > @@ -0,0 +1,35 @@=0A= > +/* Disabling epilogues until we find a better way to deal with scans. *= /=0A= > +/* { dg-do compile } */=0A= > +/* { dg-additional-options "--param vect-epilogues-nomask=3D0" } */=0A= > +/* { dg-require-effective-target vect_int } */=0A= > +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aa= rch64*-*-* || arm*-*-* } } } */=0A= > +/* { dg-add-options arm_v8_2a_dotprod_neon } */=0A= > +=0A= > +#include "tree-vect.h"=0A= > +=0A= > +#ifndef SIGNEDNESS_1=0A= > +#define SIGNEDNESS_1 signed=0A= > +#define SIGNEDNESS_2 signed=0A= > +#endif=0A= > +=0A= > +SIGNEDNESS_1 int __attribute__ ((noipa))=0A= > +f (SIGNEDNESS_1 int res0,=0A= > + SIGNEDNESS_1 int res1,=0A= > + SIGNEDNESS_1 int res2,=0A= > + SIGNEDNESS_1 int res3,=0A= > + SIGNEDNESS_2 short *a,=0A= > + SIGNEDNESS_2 short *b)=0A= > +{=0A= > + for (int i =3D 0; i < 64; i +=3D 4)=0A= > + {=0A= > + res0 +=3D a[i + 0] * b[i + 0];=0A= > + res1 +=3D a[i + 1] * b[i + 1];=0A= > + res2 +=3D a[i + 2] * b[i + 2];=0A= > + res3 +=3D a[i + 3] * b[i + 3];=0A= > + }=0A= > +=0A= > + return res0 ^ res1 ^ res2 ^ res3;=0A= > +}=0A= > +=0A= > +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "= vect" } } */=0A= > +/* { dg-final { scan-tree-dump-not "vectorizing stmts using SLP" "vect" = } } */=0A= > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc=0A= > index e0561feddce..6d91665a341 100644=0A= > --- a/gcc/tree-vect-loop.cc=0A= > +++ b/gcc/tree-vect-loop.cc=0A= > @@ -5324,8 +5324,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo= ,=0A= > if (!gimple_extract_op (orig_stmt_info->stmt, &op))=0A= > gcc_unreachable ();=0A= >=0A= > - bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (stmt= _info);=0A= > -=0A= > if (reduction_type =3D=3D EXTRACT_LAST_REDUCTION)=0A= > /* No extra instructions are needed in the prologue. The loop body= =0A= > operations are costed in vectorizable_condition. */=0A= > @@ -5360,12 +5358,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinf= o,=0A= > initial result of the data reduction, initial value of the ind= ex=0A= > reduction. */=0A= > prologue_stmts =3D 4;=0A= > - else if (emulated_mixed_dot_prod)=0A= > - /* We need the initial reduction value and two invariants:=0A= > - one that contains the minimum signed value and one that=0A= > - contains half of its negative. */=0A= > - prologue_stmts =3D 3;=0A= > else=0A= > + /* We need the initial reduction value. */=0A= > prologue_stmts =3D 1;=0A= > prologue_cost +=3D record_stmt_cost (cost_vec, prologue_stmts,=0A= > scalar_to_vec, stmt_info, 0,=0A= > @@ -7466,7 +7460,7 @@ vect_reduction_update_partial_vector_usage (loop_ve= c_info loop_vinfo,=0A= > vec_loop_lens *lens =3D &LOOP_VINFO_LENS (loop_vinfo);=0A= > unsigned nvectors;=0A= >=0A= > - if (slp_node)=0A= > + if (slp_node && SLP_TREE_LANES (slp_node) > 1)=0A= =0A= Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off= =0A= instead, which is bad.=0A= =0A= > nvectors =3D SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);=0A= > else=0A= > nvectors =3D vect_get_num_copies (loop_vinfo, vectype_in);=0A= > @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage (loop_= vec_info loop_vinfo,=0A= > }=0A= > }=0A= >=0A= > +/* Check if STMT_INFO is a lane-reducing operation that can be vectorize= d in=0A= > + the context of LOOP_VINFO, and vector cost will be recorded in COST_V= EC.=0A= > + Now there are three such kinds of operations: dot-prod/widen-sum/sad= =0A= > + (sum-of-absolute-differences).=0A= > +=0A= > + For a lane-reducing operation, the loop reduction path that it lies i= n,=0A= > + may contain normal operation, or other lane-reducing operation of dif= ferent=0A= > + input type size, an example as:=0A= > +=0A= > + int sum =3D 0;=0A= > + for (i)=0A= > + {=0A= > + ...=0A= > + sum +=3D d0[i] * d1[i]; // dot-prod =0A= > + sum +=3D w[i]; // widen-sum =0A= > + sum +=3D abs(s0[i] - s1[i]); // sad =0A= > + sum +=3D n[i]; // normal =0A= > + ...=0A= > + }=0A= > +=0A= > + Vectorization factor is essentially determined by operation whose inp= ut=0A= > + vectype has the most lanes ("vector(16) char" in the example), while = we=0A= > + need to choose input vectype with the least lanes ("vector(4) int" in= the=0A= > + example) for the reduction PHI statement. */=0A= > +=0A= > +bool=0A= > +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt= _info,=0A= > + slp_tree slp_node, stmt_vector_for_cost *cost= _vec)=0A= > +{=0A= > + gimple *stmt =3D stmt_info->stmt;=0A= > +=0A= > + if (!lane_reducing_stmt_p (stmt))=0A= > + return false;=0A= > +=0A= > + tree type =3D TREE_TYPE (gimple_assign_lhs (stmt));=0A= > +=0A= > + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))=0A= > + return false;=0A= > +=0A= > + /* Do not try to vectorize bit-precision reductions. */=0A= > + if (!type_has_mode_precision_p (type))=0A= > + return false;=0A= > +=0A= > + if (!slp_node)=0A= > + return false;=0A= > +=0A= > + for (int i =3D 0; i < (int) gimple_num_ops (stmt) - 1; i++)=0A= > + {=0A= > + stmt_vec_info def_stmt_info;=0A= > + slp_tree slp_op;=0A= > + tree op;=0A= > + tree vectype;=0A= > + enum vect_def_type dt;=0A= > +=0A= > + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,= =0A= > + &slp_op, &dt, &vectype, &def_stmt_info))= =0A= > + {=0A= > + if (dump_enabled_p ())=0A= > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= > + "use not simple.\n");=0A= > + return false;=0A= > + }=0A= > +=0A= > + if (!vectype)=0A= > + {=0A= > + vectype =3D get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE = (op),=0A= > + slp_op);=0A= > + if (!vectype)=0A= > + return false;=0A= > + }=0A= > +=0A= > + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))=0A= > + {=0A= > + if (dump_enabled_p ())=0A= > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= > + "incompatible vector types for invariants\n"= );=0A= > + return false;=0A= > + }=0A= > +=0A= > + if (i =3D=3D STMT_VINFO_REDUC_IDX (stmt_info))=0A= > + continue;=0A= > +=0A= > + /* There should be at most one cycle def in the stmt. */=0A= > + if (VECTORIZABLE_CYCLE_DEF (dt))=0A= > + return false;=0A= > + }=0A= > +=0A= > + stmt_vec_info reduc_info =3D STMT_VINFO_REDUC_DEF (vect_orig_stmt (stm= t_info));=0A= > +=0A= > + /* TODO: Support lane-reducing operation that does not directly partic= ipate=0A= > + in loop reduction. */=0A= > + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)=0A= > + return false;=0A= > +=0A= > + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not=0A= > + recoginized. */=0A= > + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) =3D=3D vect_reduction_def= );=0A= > + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUCT= ION);=0A= > +=0A= > + tree vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);=0A= > + int ncopies_for_cost;=0A= > +=0A= > + if (SLP_TREE_LANES (slp_node) > 1)=0A= > + {=0A= > + /* Now lane-reducing operations in a non-single-lane slp node shou= ld only=0A= > + come from the same loop reduction path. */=0A= > + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));=0A= > + ncopies_for_cost =3D 1;=0A= > + }=0A= > + else=0A= > + {=0A= > + ncopies_for_cost =3D vect_get_num_copies (loop_vinfo, vectype_in);= =0A= =0A= OK, so the fact that the ops are lane-reducing means they effectively=0A= change the VF for the result. That's only possible as we tightly control= =0A= code generation and "adjust" to the expected VF (by inserting the copies=0A= you mentioned above), but only up to the highest number of outputs=0A= created in the reduction chain. In that sense instead of talking and recor= ding=0A= "input vector types" wouldn't it make more sense to record the effective=0A= vectorization factor for the reduction instance? That VF would be at most= =0A= the loops VF but could be as low as 1. Once we have a non-lane-reducing=0A= operation in the reduction chain it would be always equal to the loops VF.= =0A= =0A= ncopies would then be always determined by that reduction instance VF and= =0A= the accumulator vector type (STMT_VINFO_VECTYPE). This reduction=0A= instance VF would also trivially indicate the force-single-def-use-cycle=0A= case, possibly simplifying code?=0A= =0A= > + gcc_assert (ncopies_for_cost >=3D 1);=0A= > + }=0A= > +=0A= > + if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A= > + {=0A= > + /* We need extra two invariants: one that contains the minimum sig= ned=0A= > + value and one that contains half of its negative. */=0A= > + int prologue_stmts =3D 2;=0A= > + unsigned cost =3D record_stmt_cost (cost_vec, prologue_stmts,=0A= > + scalar_to_vec, stmt_info, 0,=0A= > + vect_prologue);=0A= > + if (dump_enabled_p ())=0A= > + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "=0A= > + "extra prologue_cost =3D %d .\n", cost);=0A= > +=0A= > + /* Three dot-products and a subtraction. */=0A= > + ncopies_for_cost *=3D 4;=0A= > + }=0A= > +=0A= > + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, stmt_info, = 0,=0A= > + vect_body);=0A= > +=0A= > + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))=0A= > + {=0A= > + enum tree_code code =3D gimple_assign_rhs_code (stmt);=0A= > + vect_reduction_update_partial_vector_usage (loop_vinfo, reduc_info= ,=0A= > + slp_node, code, type,= =0A= > + vectype_in);=0A= > + }=0A= > +=0A= =0A= Add a comment:=0A= =0A= /* Transform via vect_transform_reduction. */=0A= =0A= > + STMT_VINFO_TYPE (stmt_info) =3D reduc_vec_info_type;=0A= > + return true;=0A= > +}=0A= > +=0A= > /* Function vectorizable_reduction.=0A= >=0A= > Check if STMT_INFO performs a reduction operation that can be vectori= zed.=0A= > @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= > if (!type_has_mode_precision_p (op.type))=0A= > return false;=0A= >=0A= > - /* For lane-reducing ops we're reducing the number of reduction PHIs= =0A= > - which means the only use of that may be in the lane-reducing operat= ion. */=0A= > - if (lane_reducing=0A= > - && reduc_chain_length !=3D 1=0A= > - && !only_slp_reduc_chain)=0A= > - {=0A= > - if (dump_enabled_p ())=0A= > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= > - "lane-reducing reduction with extra stmts.\n");= =0A= > - return false;=0A= > - }=0A= > -=0A= > /* Lane-reducing ops also never can be used in a SLP reduction group= =0A= > since we'll mix lanes belonging to different reductions. But it's= =0A= > OK to use them in a reduction chain or when the reduction group=0A= > @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= > && loop_vinfo->suggested_unroll_factor =3D=3D 1)=0A= > single_defuse_cycle =3D true;=0A= >=0A= > - if (single_defuse_cycle || lane_reducing)=0A= > + if (single_defuse_cycle && !lane_reducing)=0A= =0A= If there's also a non-lane-reducing plus in the chain don't we have to=0A= check for that reduction op? So shouldn't it be=0A= single_defuse_cycle && ... fact that we don't record=0A= (non-lane-reducing op there) ...=0A= =0A= > {=0A= > gcc_assert (op.code !=3D COND_EXPR);=0A= >=0A= > - /* 4. Supportable by target? */=0A= > - bool ok =3D true;=0A= > -=0A= > - /* 4.1. check support for the operation in the loop=0A= > + /* 4. check support for the operation in the loop=0A= >=0A= > This isn't necessary for the lane reduction codes, since they=0A= > can only be produced by pattern matching, and it's up to the=0A= > @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= > mixed-sign dot-products can be implemented using signed=0A= > dot-products. */=0A= > machine_mode vec_mode =3D TYPE_MODE (vectype_in);=0A= > - if (!lane_reducing=0A= > - && !directly_supported_p (op.code, vectype_in, optab_vector))= =0A= > + if (!directly_supported_p (op.code, vectype_in, optab_vector))=0A= > {=0A= > if (dump_enabled_p ())=0A= > dump_printf (MSG_NOTE, "op not supported by target.\n");=0A= > if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)=0A= > || !vect_can_vectorize_without_simd_p (op.code))=0A= > - ok =3D false;=0A= > + single_defuse_cycle =3D false;=0A= > else=0A= > if (dump_enabled_p ())=0A= > dump_printf (MSG_NOTE, "proceeding using word mode.\n");=0A= > @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= > dump_printf (MSG_NOTE, "using word mode not possible.\n");=0A= > return false;=0A= > }=0A= > -=0A= > - /* lane-reducing operations have to go through vect_transform_redu= ction.=0A= > - For the other cases try without the single cycle optimization. = */=0A= > - if (!ok)=0A= > - {=0A= > - if (lane_reducing)=0A= > - return false;=0A= > - else=0A= > - single_defuse_cycle =3D false;=0A= > - }=0A= > }=0A= > if (dump_enabled_p () && single_defuse_cycle)=0A= > dump_printf_loc (MSG_NOTE, vect_location,=0A= > @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= > "multiple vectors to one in the loop body\n");=0A= > STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) =3D single_defuse_cycle;=0A= >=0A= > - /* If the reduction stmt is one of the patterns that have lane=0A= > - reduction embedded we cannot handle the case of ! single_defuse_cyc= le. */=0A= > - if ((ncopies > 1 && ! single_defuse_cycle)=0A= > - && lane_reducing)=0A= > - {=0A= > - if (dump_enabled_p ())=0A= > - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,=0A= > - "multi def-use cycle not possible for lane-reduc= ing "=0A= > - "reduction operation\n");=0A= > - return false;=0A= > - }=0A= > + /* For lane-reducing operation, the below processing related to single= =0A= > + defuse-cycle will be done in its own vectorizable function. One mo= re=0A= > + thing to note is that the operation must not be involved in fold-le= ft=0A= > + reduction. */=0A= > + single_defuse_cycle &=3D !lane_reducing;=0A= >=0A= > if (slp_node=0A= > - && !(!single_defuse_cycle=0A= > - && !lane_reducing=0A= > - && reduction_type !=3D FOLD_LEFT_REDUCTION))=0A= > + && (single_defuse_cycle || reduction_type =3D=3D FOLD_LEFT_REDUCTI= ON))=0A= > for (i =3D 0; i < (int) op.num_ops; i++)=0A= > if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))= =0A= > {=0A= > @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info loop_vinfo,= =0A= > vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,=0A= > reduction_type, ncopies, cost_vec);=0A= > /* Cost the reduction op inside the loop if transformed via=0A= > - vect_transform_reduction. Otherwise this is costed by the=0A= > - separate vectorizable_* routines. */=0A= > - if (single_defuse_cycle || lane_reducing)=0A= > - {=0A= > - int factor =3D 1;=0A= > - if (vect_is_emulated_mixed_dot_prod (stmt_info))=0A= > - /* Three dot-products and a subtraction. */=0A= > - factor =3D 4;=0A= > - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,=0A= > - stmt_info, 0, vect_body);=0A= > - }=0A= > + vect_transform_reduction for non-lane-reducing operation. Otherwis= e=0A= > + this is costed by the separate vectorizable_* routines. */=0A= > + if (single_defuse_cycle)=0A= > + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, vect= _body);=0A= >=0A= > if (dump_enabled_p ()=0A= > && reduction_type =3D=3D FOLD_LEFT_REDUCTION)=0A= > dump_printf_loc (MSG_NOTE, vect_location,=0A= > "using an in-order (fold-left) reduction.\n");=0A= > STMT_VINFO_TYPE (orig_stmt_of_analysis) =3D cycle_phi_info_type;=0A= > - /* All but single defuse-cycle optimized, lane-reducing and fold-left= =0A= > - reductions go through their own vectorizable_* routines. */=0A= > - if (!single_defuse_cycle=0A= > - && !lane_reducing=0A= > - && reduction_type !=3D FOLD_LEFT_REDUCTION)=0A= > +=0A= > + /* All but single defuse-cycle optimized and fold-left reductions go= =0A= > + through their own vectorizable_* routines. */=0A= > + if (!single_defuse_cycle && reduction_type !=3D FOLD_LEFT_REDUCTION)= =0A= > {=0A= > stmt_vec_info tem=0A= > =3D vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));=0A= > @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo= ,=0A= > bool lane_reducing =3D lane_reducing_op_p (code);=0A= > gcc_assert (single_defuse_cycle || lane_reducing);=0A= >=0A= > + if (lane_reducing)=0A= > + {=0A= > + /* The last operand of lane-reducing op is for reduction. */=0A= > + gcc_assert (reduc_index =3D=3D (int) op.num_ops - 1);=0A= > +=0A= > + /* Now all lane-reducing ops are covered by some slp node. */=0A= > + gcc_assert (slp_node);=0A= > + }=0A= > +=0A= > /* Create the destination vector */=0A= > tree scalar_dest =3D gimple_get_lhs (stmt_info->stmt);=0A= > tree vec_dest =3D vect_create_destination_var (scalar_dest, vectype_ou= t);=0A= > @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info loop_vinfo= ,=0A= > reduc_index =3D=3D 2 ? op.ops[2] : NULL_TREE,=0A= > &vec_oprnds[2]);=0A= > }=0A= > + else if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1=0A= > + && vec_oprnds[0].length () < vec_oprnds[reduc_index].length ()= )=0A= > + {=0A= > + /* For lane-reducing op covered by single-lane slp node, the input= =0A= > + vectype of the reduction PHI determines copies of vectorized def= -use=0A= > + cycles, which might be more than effective copies of vectorized = lane-=0A= > + reducing reduction statements. This could be complemented by=0A= > + generating extra trivial pass-through copies. For example:=0A= > +=0A= > + int sum =3D 0;=0A= > + for (i)=0A= > + {=0A= > + sum +=3D d0[i] * d1[i]; // dot-prod = =0A= > + sum +=3D abs(s0[i] - s1[i]); // sad =0A= > + sum +=3D n[i]; // normal =0A= > + }=0A= > +=0A= > + The vector size is 128-bit,vectorization factor is 16. Reductio= n=0A= > + statements would be transformed as:=0A= > +=0A= > + vector<4> int sum_v0 =3D { 0, 0, 0, 0 };=0A= > + vector<4> int sum_v1 =3D { 0, 0, 0, 0 };=0A= > + vector<4> int sum_v2 =3D { 0, 0, 0, 0 };=0A= > + vector<4> int sum_v3 =3D { 0, 0, 0, 0 };=0A= > +=0A= > + for (i / 16)=0A= > + {=0A= > + sum_v0 =3D DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], s= um_v0);=0A= > + sum_v1 =3D sum_v1; // copy=0A= > + sum_v2 =3D sum_v2; // copy=0A= > + sum_v3 =3D sum_v3; // copy=0A= > +=0A= > + sum_v0 =3D SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0= );=0A= > + sum_v1 =3D SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1= );=0A= > + sum_v2 =3D sum_v2; // copy=0A= > + sum_v3 =3D sum_v3; // copy=0A= > +=0A= > + sum_v0 +=3D n_v0[i: 0 ~ 3 ];=0A= > + sum_v1 +=3D n_v1[i: 4 ~ 7 ];=0A= > + sum_v2 +=3D n_v2[i: 8 ~ 11];=0A= > + sum_v3 +=3D n_v3[i: 12 ~ 15];=0A= > + }=0A= > + */=0A= > + unsigned using_ncopies =3D vec_oprnds[0].length ();=0A= > + unsigned reduc_ncopies =3D vec_oprnds[reduc_index].length ();=0A= > +=0A= =0A= assert reduc_ncopies >=3D using_ncopies? Maybe assert=0A= reduc_index =3D=3D op.num_ops - 1 given you use one above=0A= and the other below? Or simply iterate till op.num_ops=0A= and sip i =3D=3D reduc_index.=0A= =0A= > + for (unsigned i =3D 0; i < op.num_ops - 1; i++)=0A= > + {=0A= > + gcc_assert (vec_oprnds[i].length () =3D=3D using_ncopies);=0A= > + vec_oprnds[i].safe_grow_cleared (reduc_ncopies);=0A= > + }=0A= > + }=0A= >=0A= > bool emulated_mixed_dot_prod =3D vect_is_emulated_mixed_dot_prod (stmt= _info);=0A= > unsigned num =3D vec_oprnds[reduc_index =3D=3D 0 ? 1 : 0].length ();= =0A= > @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo= ,=0A= > {=0A= > gimple *new_stmt;=0A= > tree vop[3] =3D { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };= =0A= > - if (masked_loop_p && !mask_by_cond_expr)=0A= > +=0A= > + if (!vop[0] || !vop[1])=0A= > + {=0A= > + tree reduc_vop =3D vec_oprnds[reduc_index][i];=0A= > +=0A= > + /* Insert trivial copy if no need to generate vectorized=0A= > + statement. */=0A= > + gcc_assert (reduc_vop);=0A= > +=0A= > + new_stmt =3D gimple_build_assign (vec_dest, reduc_vop);=0A= > + new_temp =3D make_ssa_name (vec_dest, new_stmt);=0A= > + gimple_set_lhs (new_stmt, new_temp);=0A= > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, g= si);=0A= =0A= I think you could simply do=0A= =0A= slp_node->push_vec_def (reduc_vop);=0A= continue;=0A= =0A= without any code generation.=0A= =0A= > + }=0A= > + else if (masked_loop_p && !mask_by_cond_expr)=0A= > {=0A= > /* No conditional ifns have been defined for lane-reducing op= =0A= > yet. */=0A= > @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo= ,=0A= >=0A= > if (masked_loop_p && mask_by_cond_expr)=0A= > {=0A= > + tree stmt_vectype_in =3D vectype_in;=0A= > + unsigned nvectors =3D vec_num * ncopies;=0A= > +=0A= > + if (lane_reducing && SLP_TREE_LANES (slp_node) =3D=3D 1)=0A= > + {=0A= > + /* Input vectype of the reduction PHI may be defferent = from=0A= =0A= different=0A= =0A= > + that of lane-reducing operation. */=0A= > + stmt_vectype_in =3D STMT_VINFO_REDUC_VECTYPE_IN (stmt_i= nfo);=0A= > + nvectors =3D vect_get_num_copies (loop_vinfo, stmt_vect= ype_in);=0A= =0A= I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.=0A= =0A= Otherwise the patch looks good to me.=0A= =0A= Richard.=0A= =0A= > + }=0A= > +=0A= > tree mask =3D vect_get_loop_mask (loop_vinfo, gsi, masks,= =0A= > - vec_num * ncopies, vectype_= in, i);=0A= > + nvectors, stmt_vectype_in, = i);=0A= > build_vect_cond_expr (code, vop, mask, gsi);=0A= > }=0A= >=0A= > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc=0A= > index ca6052662a3..1b73ef01ade 100644=0A= > --- a/gcc/tree-vect-stmts.cc=0A= > +++ b/gcc/tree-vect-stmts.cc=0A= > @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,=0A= > NULL, NULL, node, cost_vec)=0A= > || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_= vec)=0A= > || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost= _vec)=0A= > + || vectorizable_lane_reducing (as_a (vinfo),=0A= > + stmt_info, node, cost_vec)=0A= > || vectorizable_reduction (as_a (vinfo), stmt_i= nfo,=0A= > node, node_instance, cost_vec)=0A= > || vectorizable_induction (as_a (vinfo), stmt_i= nfo,=0A= > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h=0A= > index 60224f4e284..94736736dcc 100644=0A= > --- a/gcc/tree-vectorizer.h=0A= > +++ b/gcc/tree-vectorizer.h=0A= > @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class = loop *, vec_info_shared *,=0A= > extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,=0A= > slp_tree, slp_instance, int,=0A= > bool, stmt_vector_for_cost *);= =0A= > +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,=0A= > + slp_tree, stmt_vector_for_cost *)= ;=0A= > extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,=0A= > slp_tree, slp_instance,=0A= > stmt_vector_for_cost *);=0A= > --=0A= > 2.17.1=0A=