From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 82A843858412 for ; Mon, 19 Jun 2023 07:24:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 82A843858412 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com Received: from pps.filterd (m0353722.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 35J7MS89028648; Mon, 19 Jun 2023 07:24:05 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=hBgAFulAafnXW+uFse0/3oh7ctZs1BH3t46fRfdFeSE=; b=XzmfuWZ+jcZzst7v/PF4KFu0k+J5Ml+Ui3ufrr93NlcJOr+DD+123A/DWcre/jSEUwph mFNKTJ6JbZQzf4zaT85wUvUiradD88COm+hsqBWjByfhhMR3K8CpyciTSnHIC6FQ7Orf RSC1E6JwEZUTX2exJQJb73e4gWoNK7KFgVsmIihYbrmysO4rAr2iWi0H98nA9mDq3zfx JIkvMi4g8IpCJ1AAemi04DChqdPuEMjEOVVdLvqRB6qlpzP9ez81jMHrU8OaWwtz2bFh 16TH+5ZkeyMi9sujASKCM5stduTPGbf+1IJ38FYcilOmmXawWhzLwNUwHHcAjwZ5Azh6 9Q== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3rajjur0rq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 19 Jun 2023 07:24:05 +0000 Received: from m0353722.ppops.net (m0353722.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 35J7O4HI000978; Mon, 19 Jun 2023 07:24:04 GMT Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com [149.81.74.107]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3rajjur0r7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 19 Jun 2023 07:24:04 +0000 Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1]) by ppma03fra.de.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 35J2okJ2011558; Mon, 19 Jun 2023 07:24:02 GMT Received: from smtprelay02.fra02v.mail.ibm.com ([9.218.2.226]) by ppma03fra.de.ibm.com (PPS) with ESMTPS id 3r94f58wgr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 19 Jun 2023 07:24:02 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay02.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 35J7O06r11993600 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 19 Jun 2023 07:24:00 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 263F620043; Mon, 19 Jun 2023 07:24:00 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C7B8020040; Mon, 19 Jun 2023 07:23:56 +0000 (GMT) Received: from [9.177.23.162] (unknown [9.177.23.162]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Mon, 19 Jun 2023 07:23:56 +0000 (GMT) Message-ID: Date: Mon, 19 Jun 2023 15:23:55 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: Re: [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE Content-Language: en-US To: Hongtao Liu Cc: gcc-patches@gcc.gnu.org, richard.guenther@gmail.com, richard.sandiford@arm.com, segher@kernel.crashing.org, bergner@linux.ibm.com, ubizjak@gmail.com, hongtao.liu@intel.com References: <216bf6e61d4fe2caa6b87ae1e5c8e15b6d31c409.1686573640.git.linkw@linux.ibm.com> From: "Kewen.Lin" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: qNZDoU84kwab7JwoBYtoOpYq17Gv7HIF X-Proofpoint-GUID: TWS0Fk_ERWk-rqcok7MrJ_-e1eR60ALa X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26 definitions=2023-06-19_04,2023-06-16_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 spamscore=0 mlxscore=0 phishscore=0 clxscore=1015 malwarescore=0 impostorscore=0 mlxlogscore=999 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2305260000 definitions=main-2306190065 X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,GIT_PATCH_0,KAM_SHORT,NICE_REPLY_A,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Hongtao, on 2023/6/14 16:17, Hongtao Liu wrote: > On Tue, Jun 13, 2023 at 10:07 AM Kewen Lin via Gcc-patches > wrote: >> >> This patch adjusts the cost handling on >> VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load. We >> don't call function vect_model_load_cost for it any more. >> >> As the affected test case gcc.target/i386/pr70021.c shows, >> the previous costing can under-cost the total generated >> vector loads as for VMAT_CONTIGUOUS_PERMUTE function >> vect_model_load_cost doesn't consider the group size which >> is considered as vec_num during the transformation. > The original PR is for the correctness issue, and I'm not sure how > much of a performance impact the patch would be, but the change looks > reasonable, so the test change looks ok to me. > I'll track performance impact on SPEC2017 to see if there's any > regression caused by the patch(Guess probably not). Thanks for the feedback and further tracking! Hope this (and this whole series) doesn't impact SPEC2017 performance on x86. :) BR, Kewen >> >> This patch makes the count of vector load in costing become >> consistent with what we generates during the transformation. >> To be more specific, for the given test case, for memory >> access b[i_20], it costed for 2 vector loads before, >> with this patch it costs 8 instead, it matches the final >> count of generated vector loads basing from b. This costing >> change makes cost model analysis feel it's not profitable >> to vectorize the first loop, so this patch adjusts the test >> case without vect cost model any more. >> >> But note that this test case also exposes something we can >> improve further is that although the number of vector >> permutation what we costed and generated are consistent, >> but DCE can further optimize some unused permutation out, >> it would be good if we can predict that and generate only >> those necessary permutations. >> >> gcc/ChangeLog: >> >> * tree-vect-stmts.cc (vect_model_load_cost): Assert this function only >> handle memory_access_type VMAT_CONTIGUOUS, remove some >> VMAT_CONTIGUOUS_PERMUTE related handlings. >> (vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE >> without calling vect_model_load_cost. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model. >> --- >> gcc/testsuite/gcc.target/i386/pr70021.c | 2 +- >> gcc/tree-vect-stmts.cc | 88 ++++++++++++++----------- >> 2 files changed, 51 insertions(+), 39 deletions(-) >> >> diff --git a/gcc/testsuite/gcc.target/i386/pr70021.c b/gcc/testsuite/gcc.target/i386/pr70021.c >> index 6562c0f2bd0..d509583601e 100644 >> --- a/gcc/testsuite/gcc.target/i386/pr70021.c >> +++ b/gcc/testsuite/gcc.target/i386/pr70021.c >> @@ -1,7 +1,7 @@ >> /* PR target/70021 */ >> /* { dg-do run } */ >> /* { dg-require-effective-target avx2 } */ >> -/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake" } */ >> +/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake -fno-vect-cost-model" } */ >> >> #include "avx2-check.h" >> >> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >> index 7f8d9db5363..e7a97dbe05d 100644 >> --- a/gcc/tree-vect-stmts.cc >> +++ b/gcc/tree-vect-stmts.cc >> @@ -1134,8 +1134,7 @@ vect_model_load_cost (vec_info *vinfo, >> slp_tree slp_node, >> stmt_vector_for_cost *cost_vec) >> { >> - gcc_assert (memory_access_type == VMAT_CONTIGUOUS >> - || memory_access_type == VMAT_CONTIGUOUS_PERMUTE); >> + gcc_assert (memory_access_type == VMAT_CONTIGUOUS); >> >> unsigned int inside_cost = 0, prologue_cost = 0; >> bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info); >> @@ -1174,26 +1173,6 @@ vect_model_load_cost (vec_info *vinfo, >> once per group anyhow. */ >> bool first_stmt_p = (first_stmt_info == stmt_info); >> >> - /* We assume that the cost of a single load-lanes instruction is >> - equivalent to the cost of DR_GROUP_SIZE separate loads. If a grouped >> - access is instead being provided by a load-and-permute operation, >> - include the cost of the permutes. */ >> - if (first_stmt_p >> - && memory_access_type == VMAT_CONTIGUOUS_PERMUTE) >> - { >> - /* Uses an even and odd extract operations or shuffle operations >> - for each needed permute. */ >> - int group_size = DR_GROUP_SIZE (first_stmt_info); >> - int nstmts = ncopies * ceil_log2 (group_size) * group_size; >> - inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm, >> - stmt_info, 0, vect_body); >> - >> - if (dump_enabled_p ()) >> - dump_printf_loc (MSG_NOTE, vect_location, >> - "vect_model_load_cost: strided group_size = %d .\n", >> - group_size); >> - } >> - >> vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme, >> misalignment, first_stmt_p, &inside_cost, &prologue_cost, >> cost_vec, cost_vec, true); >> @@ -10652,11 +10631,22 @@ vectorizable_load (vec_info *vinfo, >> alignment support schemes. */ >> if (costing_p) >> { >> - if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) >> + /* For VMAT_CONTIGUOUS_PERMUTE if it's grouped load, we >> + only need to take care of the first stmt, whose >> + stmt_info is first_stmt_info, vec_num iterating on it >> + will cover the cost for the remaining, it's consistent >> + with transforming. For the prologue cost for realign, >> + we only need to count it once for the whole group. */ >> + bool first_stmt_info_p = first_stmt_info == stmt_info; >> + bool add_realign_cost = first_stmt_info_p && i == 0; >> + if (memory_access_type == VMAT_CONTIGUOUS_REVERSE >> + || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE >> + && (!grouped_load || first_stmt_info_p))) >> vect_get_load_cost (vinfo, stmt_info, 1, >> alignment_support_scheme, misalignment, >> - false, &inside_cost, &prologue_cost, >> - cost_vec, cost_vec, true); >> + add_realign_cost, &inside_cost, >> + &prologue_cost, cost_vec, cost_vec, >> + true); >> } >> else >> { >> @@ -10774,8 +10764,7 @@ vectorizable_load (vec_info *vinfo, >> ??? This is a hack to prevent compile-time issues as seen >> in PR101120 and friends. */ >> if (costing_p >> - && memory_access_type != VMAT_CONTIGUOUS >> - && memory_access_type != VMAT_CONTIGUOUS_PERMUTE) >> + && memory_access_type != VMAT_CONTIGUOUS) >> { >> vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf, >> true, &n_perms, nullptr); >> @@ -10790,20 +10779,44 @@ vectorizable_load (vec_info *vinfo, >> gcc_assert (ok); >> } >> } >> - else if (!costing_p) >> + else >> { >> if (grouped_load) >> { >> if (memory_access_type != VMAT_LOAD_STORE_LANES) >> - vect_transform_grouped_load (vinfo, stmt_info, dr_chain, >> - group_size, gsi); >> - *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; >> - } >> - else >> - { >> - STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); >> + { >> + gcc_assert (memory_access_type == VMAT_CONTIGUOUS_PERMUTE); >> + /* We assume that the cost of a single load-lanes instruction >> + is equivalent to the cost of DR_GROUP_SIZE separate loads. >> + If a grouped access is instead being provided by a >> + load-and-permute operation, include the cost of the >> + permutes. */ >> + if (costing_p && first_stmt_info == stmt_info) >> + { >> + /* Uses an even and odd extract operations or shuffle >> + operations for each needed permute. */ >> + int group_size = DR_GROUP_SIZE (first_stmt_info); >> + int nstmts = ceil_log2 (group_size) * group_size; >> + inside_cost >> + += record_stmt_cost (cost_vec, nstmts, vec_perm, >> + stmt_info, 0, vect_body); >> + >> + if (dump_enabled_p ()) >> + dump_printf_loc ( >> + MSG_NOTE, vect_location, >> + "vect_model_load_cost: strided group_size = %d .\n", >> + group_size); >> + } >> + else if (!costing_p) >> + vect_transform_grouped_load (vinfo, stmt_info, dr_chain, >> + group_size, gsi); >> + } >> + if (!costing_p) >> + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; >> } >> - } >> + else if (!costing_p) >> + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); >> + } >> dr_chain.release (); >> } >> if (!slp && !costing_p) >> @@ -10814,8 +10827,7 @@ vectorizable_load (vec_info *vinfo, >> gcc_assert (memory_access_type != VMAT_INVARIANT >> && memory_access_type != VMAT_ELEMENTWISE >> && memory_access_type != VMAT_STRIDED_SLP); >> - if (memory_access_type != VMAT_CONTIGUOUS >> - && memory_access_type != VMAT_CONTIGUOUS_PERMUTE) >> + if (memory_access_type != VMAT_CONTIGUOUS) >> { >> if (dump_enabled_p ()) >> dump_printf_loc (MSG_NOTE, vect_location, >> -- >> 2.31.1 >> > >