From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=STff=CH=linux.ibm.com=linkw@sourceware.org>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5])
	by sourceware.org (Postfix) with ESMTPS id 82A843858412
	for <gcc-patches@gcc.gnu.org>; Mon, 19 Jun 2023 07:24:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 82A843858412
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
Received: from pps.filterd (m0353722.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 35J7MS89028648;
	Mon, 19 Jun 2023 07:24:05 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 mime-version : subject : to : cc : references : from : in-reply-to :
 content-type : content-transfer-encoding; s=pp1;
 bh=hBgAFulAafnXW+uFse0/3oh7ctZs1BH3t46fRfdFeSE=;
 b=XzmfuWZ+jcZzst7v/PF4KFu0k+J5Ml+Ui3ufrr93NlcJOr+DD+123A/DWcre/jSEUwph
 mFNKTJ6JbZQzf4zaT85wUvUiradD88COm+hsqBWjByfhhMR3K8CpyciTSnHIC6FQ7Orf
 RSC1E6JwEZUTX2exJQJb73e4gWoNK7KFgVsmIihYbrmysO4rAr2iWi0H98nA9mDq3zfx
 JIkvMi4g8IpCJ1AAemi04DChqdPuEMjEOVVdLvqRB6qlpzP9ez81jMHrU8OaWwtz2bFh
 16TH+5ZkeyMi9sujASKCM5stduTPGbf+1IJ38FYcilOmmXawWhzLwNUwHHcAjwZ5Azh6 9Q== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3rajjur0rq-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 19 Jun 2023 07:24:05 +0000
Received: from m0353722.ppops.net (m0353722.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 35J7O4HI000978;
	Mon, 19 Jun 2023 07:24:04 GMT
Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com [149.81.74.107])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3rajjur0r7-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 19 Jun 2023 07:24:04 +0000
Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1])
	by ppma03fra.de.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 35J2okJ2011558;
	Mon, 19 Jun 2023 07:24:02 GMT
Received: from smtprelay02.fra02v.mail.ibm.com ([9.218.2.226])
	by ppma03fra.de.ibm.com (PPS) with ESMTPS id 3r94f58wgr-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Mon, 19 Jun 2023 07:24:02 +0000
Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100])
	by smtprelay02.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 35J7O06r11993600
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Mon, 19 Jun 2023 07:24:00 GMT
Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 263F620043;
	Mon, 19 Jun 2023 07:24:00 +0000 (GMT)
Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id C7B8020040;
	Mon, 19 Jun 2023 07:23:56 +0000 (GMT)
Received: from [9.177.23.162] (unknown [9.177.23.162])
	by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP;
	Mon, 19 Jun 2023 07:23:56 +0000 (GMT)
Message-ID: <dbd0bff8-38b3-ee26-dbbb-02ca3e03f31b@linux.ibm.com>
Date: Mon, 19 Jun 2023 15:23:55 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.6.1
Subject: Re: [PATCH 8/9] vect: Adjust vectorizable_load costing on
 VMAT_CONTIGUOUS_PERMUTE
Content-Language: en-US
To: Hongtao Liu <crazylht@gmail.com>
Cc: gcc-patches@gcc.gnu.org, richard.guenther@gmail.com,
        richard.sandiford@arm.com, segher@kernel.crashing.org,
        bergner@linux.ibm.com, ubizjak@gmail.com, hongtao.liu@intel.com
References: <cover.1686573640.git.linkw@linux.ibm.com>
 <216bf6e61d4fe2caa6b87ae1e5c8e15b6d31c409.1686573640.git.linkw@linux.ibm.com>
 <CAMZc-bzyDicGYWfEL0c-sLuWPdaXOURpywL_aAPZGxhCwNaTaA@mail.gmail.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
In-Reply-To: <CAMZc-bzyDicGYWfEL0c-sLuWPdaXOURpywL_aAPZGxhCwNaTaA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: qNZDoU84kwab7JwoBYtoOpYq17Gv7HIF
X-Proofpoint-GUID: TWS0Fk_ERWk-rqcok7MrJ_-e1eR60ALa
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26
 definitions=2023-06-19_04,2023-06-16_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0
 lowpriorityscore=0 bulkscore=0 priorityscore=1501 spamscore=0 mlxscore=0
 phishscore=0 clxscore=1015 malwarescore=0 impostorscore=0 mlxlogscore=999
 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2305260000 definitions=main-2306190065
X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,GIT_PATCH_0,KAM_SHORT,NICE_REPLY_A,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi Hongtao,

on 2023/6/14 16:17, Hongtao Liu wrote:
> On Tue, Jun 13, 2023 at 10:07 AM Kewen Lin via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> This patch adjusts the cost handling on
>> VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load.  We
>> don't call function vect_model_load_cost for it any more.
>>
>> As the affected test case gcc.target/i386/pr70021.c shows,
>> the previous costing can under-cost the total generated
>> vector loads as for VMAT_CONTIGUOUS_PERMUTE function
>> vect_model_load_cost doesn't consider the group size which
>> is considered as vec_num during the transformation.
> The original PR is for the correctness issue, and I'm not sure how
> much of a performance impact the patch would be, but the change looks
> reasonable, so the test change looks ok to me.
> I'll track performance impact on SPEC2017 to see if there's any
> regression caused by the patch(Guess probably not).

Thanks for the feedback and further tracking!  Hope this (and
this whole series) doesn't impact SPEC2017 performance on x86. :)

BR,
Kewen

>>
>> This patch makes the count of vector load in costing become
>> consistent with what we generates during the transformation.
>> To be more specific, for the given test case, for memory
>> access b[i_20], it costed for 2 vector loads before,
>> with this patch it costs 8 instead, it matches the final
>> count of generated vector loads basing from b.  This costing
>> change makes cost model analysis feel it's not profitable
>> to vectorize the first loop, so this patch adjusts the test
>> case without vect cost model any more.
>>
>> But note that this test case also exposes something we can
>> improve further is that although the number of vector
>> permutation what we costed and generated are consistent,
>> but DCE can further optimize some unused permutation out,
>> it would be good if we can predict that and generate only
>> those necessary permutations.
>>
>> gcc/ChangeLog:
>>
>>         * tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
>>         handle memory_access_type VMAT_CONTIGUOUS, remove some
>>         VMAT_CONTIGUOUS_PERMUTE related handlings.
>>         (vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE
>>         without calling vect_model_load_cost.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.
>> ---
>>  gcc/testsuite/gcc.target/i386/pr70021.c |  2 +-
>>  gcc/tree-vect-stmts.cc                  | 88 ++++++++++++++-----------
>>  2 files changed, 51 insertions(+), 39 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.target/i386/pr70021.c b/gcc/testsuite/gcc.target/i386/pr70021.c
>> index 6562c0f2bd0..d509583601e 100644
>> --- a/gcc/testsuite/gcc.target/i386/pr70021.c
>> +++ b/gcc/testsuite/gcc.target/i386/pr70021.c
>> @@ -1,7 +1,7 @@
>>  /* PR target/70021 */
>>  /* { dg-do run } */
>>  /* { dg-require-effective-target avx2 } */
>> -/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake" } */
>> +/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details -mtune=skylake -fno-vect-cost-model" } */
>>
>>  #include "avx2-check.h"
>>
>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> index 7f8d9db5363..e7a97dbe05d 100644
>> --- a/gcc/tree-vect-stmts.cc
>> +++ b/gcc/tree-vect-stmts.cc
>> @@ -1134,8 +1134,7 @@ vect_model_load_cost (vec_info *vinfo,
>>                       slp_tree slp_node,
>>                       stmt_vector_for_cost *cost_vec)
>>  {
>> -  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
>> -             || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
>> +  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
>>
>>    unsigned int inside_cost = 0, prologue_cost = 0;
>>    bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
>> @@ -1174,26 +1173,6 @@ vect_model_load_cost (vec_info *vinfo,
>>       once per group anyhow.  */
>>    bool first_stmt_p = (first_stmt_info == stmt_info);
>>
>> -  /* We assume that the cost of a single load-lanes instruction is
>> -     equivalent to the cost of DR_GROUP_SIZE separate loads.  If a grouped
>> -     access is instead being provided by a load-and-permute operation,
>> -     include the cost of the permutes.  */
>> -  if (first_stmt_p
>> -      && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
>> -    {
>> -      /* Uses an even and odd extract operations or shuffle operations
>> -        for each needed permute.  */
>> -      int group_size = DR_GROUP_SIZE (first_stmt_info);
>> -      int nstmts = ncopies * ceil_log2 (group_size) * group_size;
>> -      inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm,
>> -                                      stmt_info, 0, vect_body);
>> -
>> -      if (dump_enabled_p ())
>> -        dump_printf_loc (MSG_NOTE, vect_location,
>> -                         "vect_model_load_cost: strided group_size = %d .\n",
>> -                         group_size);
>> -    }
>> -
>>    vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
>>                       misalignment, first_stmt_p, &inside_cost, &prologue_cost,
>>                       cost_vec, cost_vec, true);
>> @@ -10652,11 +10631,22 @@ vectorizable_load (vec_info *vinfo,
>>                  alignment support schemes.  */
>>               if (costing_p)
>>                 {
>> -                 if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
>> +                 /* For VMAT_CONTIGUOUS_PERMUTE if it's grouped load, we
>> +                    only need to take care of the first stmt, whose
>> +                    stmt_info is first_stmt_info, vec_num iterating on it
>> +                    will cover the cost for the remaining, it's consistent
>> +                    with transforming.  For the prologue cost for realign,
>> +                    we only need to count it once for the whole group.  */
>> +                 bool first_stmt_info_p = first_stmt_info == stmt_info;
>> +                 bool add_realign_cost = first_stmt_info_p && i == 0;
>> +                 if (memory_access_type == VMAT_CONTIGUOUS_REVERSE
>> +                     || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
>> +                         && (!grouped_load || first_stmt_info_p)))
>>                     vect_get_load_cost (vinfo, stmt_info, 1,
>>                                         alignment_support_scheme, misalignment,
>> -                                       false, &inside_cost, &prologue_cost,
>> -                                       cost_vec, cost_vec, true);
>> +                                       add_realign_cost, &inside_cost,
>> +                                       &prologue_cost, cost_vec, cost_vec,
>> +                                       true);
>>                 }
>>               else
>>                 {
>> @@ -10774,8 +10764,7 @@ vectorizable_load (vec_info *vinfo,
>>              ???  This is a hack to prevent compile-time issues as seen
>>              in PR101120 and friends.  */
>>           if (costing_p
>> -             && memory_access_type != VMAT_CONTIGUOUS
>> -             && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
>> +             && memory_access_type != VMAT_CONTIGUOUS)
>>             {
>>               vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
>>                                             true, &n_perms, nullptr);
>> @@ -10790,20 +10779,44 @@ vectorizable_load (vec_info *vinfo,
>>               gcc_assert (ok);
>>             }
>>         }
>> -      else if (!costing_p)
>> +      else
>>          {
>>            if (grouped_load)
>>             {
>>               if (memory_access_type != VMAT_LOAD_STORE_LANES)
>> -               vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
>> -                                            group_size, gsi);
>> -             *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>> -           }
>> -          else
>> -           {
>> -             STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +               {
>> +                 gcc_assert (memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
>> +                 /* We assume that the cost of a single load-lanes instruction
>> +                    is equivalent to the cost of DR_GROUP_SIZE separate loads.
>> +                    If a grouped access is instead being provided by a
>> +                    load-and-permute operation, include the cost of the
>> +                    permutes.  */
>> +                 if (costing_p && first_stmt_info == stmt_info)
>> +                   {
>> +                     /* Uses an even and odd extract operations or shuffle
>> +                        operations for each needed permute.  */
>> +                     int group_size = DR_GROUP_SIZE (first_stmt_info);
>> +                     int nstmts = ceil_log2 (group_size) * group_size;
>> +                     inside_cost
>> +                       += record_stmt_cost (cost_vec, nstmts, vec_perm,
>> +                                            stmt_info, 0, vect_body);
>> +
>> +                     if (dump_enabled_p ())
>> +                       dump_printf_loc (
>> +                         MSG_NOTE, vect_location,
>> +                         "vect_model_load_cost: strided group_size = %d .\n",
>> +                         group_size);
>> +                   }
>> +                 else if (!costing_p)
>> +                   vect_transform_grouped_load (vinfo, stmt_info, dr_chain,
>> +                                                group_size, gsi);
>> +               }
>> +             if (!costing_p)
>> +               *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>>             }
>> -        }
>> +         else if (!costing_p)
>> +           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +       }
>>        dr_chain.release ();
>>      }
>>    if (!slp && !costing_p)
>> @@ -10814,8 +10827,7 @@ vectorizable_load (vec_info *vinfo,
>>        gcc_assert (memory_access_type != VMAT_INVARIANT
>>                   && memory_access_type != VMAT_ELEMENTWISE
>>                   && memory_access_type != VMAT_STRIDED_SLP);
>> -      if (memory_access_type != VMAT_CONTIGUOUS
>> -         && memory_access_type != VMAT_CONTIGUOUS_PERMUTE)
>> +      if (memory_access_type != VMAT_CONTIGUOUS)
>>         {
>>           if (dump_enabled_p ())
>>             dump_printf_loc (MSG_NOTE, vect_location,
>> --
>> 2.31.1
>>
> 
>