From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=8pMj=EA=linux.ibm.com=linkw@sourceware.org>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1])
	by sourceware.org (Postfix) with ESMTPS id E5056385840A
	for <gcc-patches@gcc.gnu.org>; Tue, 15 Aug 2023 11:47:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E5056385840A
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linux.ibm.com
Received: from pps.filterd (m0360083.ppops.net [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 37FBi0Pk021954;
	Tue, 15 Aug 2023 11:47:44 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date :
 subject : to : cc : references : from : in-reply-to : content-type :
 content-transfer-encoding : mime-version; s=pp1;
 bh=+RLHT4xTtOIib9HtBxGVmSJ3vOWi44UxUnZM0Sz5T04=;
 b=rze1MNbJOTq/mCSRjw/JANqLOJdz12Lm/GEE1o04eMpKIToMcJBYiT1qdKMHm7zsd2Uf
 OaX78IjWfcJxERapFWyMXb82WVOQOmJilWGmF3cYqyWx8RY9HdtVHoS0+L6B+oIM+ncP
 DRKQfBpQXvGR9FyaXL5wBLxxHq2M7I29FM4L9oXN+o0jZHVp2qEyyLSRh99pj7n/Byfb
 q8g8zh4EwVsXvRV0qldYWDcTiU83UmrCKOW8CsenLGZeqhMfqMykix1Jko2oPy5E9ULV
 3pJRYV5S1wBFy5wA8rxuryiHJDKAVm35pc8RLWU6SKmb4QS18/N3v+4HVD/cniGiyNiW IQ== 
Received: from pps.reinject (localhost [127.0.0.1])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3sg8gd8ach-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 15 Aug 2023 11:47:43 +0000
Received: from m0360083.ppops.net (m0360083.ppops.net [127.0.0.1])
	by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 37FBSXMs011311;
	Tue, 15 Aug 2023 11:47:43 GMT
Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221])
	by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3sg8gd8ac9-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 15 Aug 2023 11:47:43 +0000
Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1])
	by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 37F8qf5x013500;
	Tue, 15 Aug 2023 11:47:42 GMT
Received: from smtprelay02.fra02v.mail.ibm.com ([9.218.2.226])
	by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3sepmjm1ut-1
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Tue, 15 Aug 2023 11:47:42 +0000
Received: from smtpav04.fra02v.mail.ibm.com (smtpav04.fra02v.mail.ibm.com [10.20.54.103])
	by smtprelay02.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 37FBleuu9765554
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
	Tue, 15 Aug 2023 11:47:41 GMT
Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id D6E4220043;
	Tue, 15 Aug 2023 11:47:40 +0000 (GMT)
Received: from smtpav04.fra02v.mail.ibm.com (unknown [127.0.0.1])
	by IMSVA (Postfix) with ESMTP id 91A1020040;
	Tue, 15 Aug 2023 11:47:39 +0000 (GMT)
Received: from [9.197.231.56] (unknown [9.197.231.56])
	by smtpav04.fra02v.mail.ibm.com (Postfix) with ESMTP;
	Tue, 15 Aug 2023 11:47:39 +0000 (GMT)
Message-ID: <4275f99b-cd1f-da4b-0bd3-3c5b50f0e93d@linux.ibm.com>
Date: Tue, 15 Aug 2023 19:47:37 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.6.1
Subject: Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop
 nest
Content-Language: en-US
To: Richard Biener <richard.guenther@gmail.com>
Cc: richard.sandiford@arm.com, GCC Patches <gcc-patches@gcc.gnu.org>
References: <b39e934e-869e-840d-eb7a-5b2de24146a8@linux.ibm.com>
 <7314a4eb-26d0-e33e-94c2-31daca9f490e@linux.ibm.com>
 <mptzg2trf1l.fsf@arm.com>
 <dce2077b-81b6-0b78-eb54-a9339309ae78@linux.ibm.com>
 <mpth6p1r9ny.fsf@arm.com>
 <c6b56548-f451-03bf-e680-b92f90b248dd@linux.ibm.com>
 <CAFiYyc0d931GyuqwUXfRP-8UzfYbbEL7up0k91sJ5u3hh-Xu3Q@mail.gmail.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
In-Reply-To: <CAFiYyc0d931GyuqwUXfRP-8UzfYbbEL7up0k91sJ5u3hh-Xu3Q@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: tpQQ-9lkAcpsR2UbN9l47TAtHgkJeYRx
X-Proofpoint-GUID: 0YXuW-fYwcuxOp_i2-AVXMcvYSzj1wlk
Content-Transfer-Encoding: 8bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.267,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26
 definitions=2023-08-15_10,2023-08-15_01,2023-05-22_02
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0
 mlxlogscore=999 priorityscore=1501 spamscore=0 adultscore=0 bulkscore=0
 malwarescore=0 impostorscore=0 suspectscore=0 clxscore=1015 mlxscore=0
 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2306200000 definitions=main-2308150103
X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,KAM_SHORT,NICE_REPLY_A,RCVD_IN_MSPIKE_H5,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

on 2023/8/15 15:53, Richard Biener wrote:
> On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> on 2023/8/14 22:16, Richard Sandiford wrote:
>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>> Hi Richard,
>>>>
>>>> on 2023/8/14 20:20, Richard Sandiford wrote:
>>>>> Thanks for the clean-ups.  But...
>>>>>
>>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>> Hi,
>>>>>>
>>>>>> Following Richi's suggestion [1], this patch is to move the
>>>>>> handlings on VMAT_GATHER_SCATTER in the final loop nest
>>>>>> of function vectorizable_load to its own loop.  Basically
>>>>>> it duplicates the final loop nest, clean up some useless
>>>>>> set up code for the case of VMAT_GATHER_SCATTER, remove some
>>>>>> unreachable code.  Also remove the corresponding handlings
>>>>>> in the final loop nest.
>>>>>>
>>>>>> Bootstrapped and regtested on x86_64-redhat-linux,
>>>>>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>>>>>
>>>>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>>>>>>
>>>>>> Is it ok for trunk?
>>>>>>
>>>>>> BR,
>>>>>> Kewen
>>>>>> -----
>>>>>>
>>>>>> gcc/ChangeLog:
>>>>>>
>>>>>>    * tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>>>>>>    VMAT_GATHER_SCATTER in the final loop nest to its own loop,
>>>>>>    and update the final nest accordingly.
>>>>>> ---
>>>>>>  gcc/tree-vect-stmts.cc | 361 +++++++++++++++++++++++++----------------
>>>>>>  1 file changed, 219 insertions(+), 142 deletions(-)
>>>>>
>>>>> ...that seems like quite a lot of +s.  Is there nothing we can do to
>>>>> avoid the cut-&-paste?
>>>>
>>>> Thanks for the comments!  I'm not sure if I get your question, if we
>>>> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
>>>> inevitable?  Your concern is mainly about git blame history?
>>>
>>> No, it was more that 219-142=77, so it seems like a lot of lines
>>> are being duplicated rather than simply being moved.  (Unlike for
>>> VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
>>> was a clear improvement.)
>>>
>>> So I was just wondering if there was any obvious factoring-out that
>>> could be done to reduce the duplication.
>>
>> ah, thanks for the clarification!
>>
>> I think the main duplication are on the loop body beginning and end,
>> let's take a look at them in details:
>>
>> +  if (memory_access_type == VMAT_GATHER_SCATTER)
>> +    {
>> +      gcc_assert (alignment_support_scheme == dr_aligned
>> +                 || alignment_support_scheme == dr_unaligned_supported);
>> +      gcc_assert (!grouped_load && !slp_perm);
>> +
>> +      unsigned int inside_cost = 0, prologue_cost = 0;
>>
>> // These above are newly added.
>>
>> +      for (j = 0; j < ncopies; j++)
>> +       {
>> +         /* 1. Create the vector or array pointer update chain.  */
>> +         if (j == 0 && !costing_p)
>> +           {
>> +             if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +               vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>> +                                            slp_node, &gs_info, &dataref_ptr,
>> +                                            &vec_offsets);
>> +             else
>> +               dataref_ptr
>> +                 = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
>> +                                             at_loop, offset, &dummy, gsi,
>> +                                             &ptr_incr, false, bump);
>> +           }
>> +         else if (!costing_p)
>> +           {
>> +             gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>> +             if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +               dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
>> +                                              gsi, stmt_info, bump);
>> +           }
>>
>> // These are for dataref_ptr, in the final looop nest we deal with more cases
>> on simd_lane_access_p and diff_first_stmt_info, but don't handle
>> STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared between,
>> IMHO factoring out it seems like a overkill.
>>
>> +
>> +         if (mask && !costing_p)
>> +           vec_mask = vec_masks[j];
>>
>> // It's merged out from j == 0 and j != 0
>>
>> +
>> +         gimple *new_stmt = NULL;
>> +         for (i = 0; i < vec_num; i++)
>> +           {
>> +             tree final_mask = NULL_TREE;
>> +             tree final_len = NULL_TREE;
>> +             tree bias = NULL_TREE;
>> +             if (!costing_p)
>> +               {
>> +                 if (loop_masks)
>> +                   final_mask
>> +                     = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
>> +                                           vec_num * ncopies, vectype,
>> +                                           vec_num * j + i);
>> +                 if (vec_mask)
>> +                   final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
>> +                                                  final_mask, vec_mask, gsi);
>> +
>> +                 if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +                   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
>> +                                                  gsi, stmt_info, bump);
>> +               }
>>
>> // This part is directly copied from the original, the original gets updated by
>> removing && !STMT_VINFO_GATHER_SCATTER_P.  Due to its size, I didn't consider
>> this before, do you prefer me to factor this part out?
>>
>> +             if (gs_info.ifn != IFN_LAST)
>> +               {
>> ...
>> +               }
>> +             else
>> +               {
>> +                 /* Emulated gather-scatter.  */
>> ...
>>
>> // This part is just moved from the original.
>>
>> +             vec_dest = vect_create_destination_var (scalar_dest, vectype);
>> +             /* DATA_REF is null if we've already built the statement.  */
>> +             if (data_ref)
>> +               {
>> +                 vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr));
>> +                 new_stmt = gimple_build_assign (vec_dest, data_ref);
>> +               }
>> +             new_temp = make_ssa_name (vec_dest, new_stmt);
>> +             gimple_set_lhs (new_stmt, new_temp);
>> +             vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
>> +
>> +             /* Store vector loads in the corresponding SLP_NODE.  */
>> +             if (slp)
>> +               slp_node->push_vec_def (new_stmt);
>> +
>> +         if (!slp && !costing_p)
>> +           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +       }
>> +
>> +      if (!slp && !costing_p)
>> +       *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>>
>> // This part is some subsequent handlings, it's duplicated from the original
>> but removing some more useless code.  I guess this part is not worthy
>> being factored out?
>>
>> +      if (costing_p)
>> +       {
>> +         if (dump_enabled_p ())
>> +           dump_printf_loc (MSG_NOTE, vect_location,
>> +                            "vect_model_load_cost: inside_cost = %u, "
>> +                            "prologue_cost = %u .\n",
>> +                            inside_cost, prologue_cost);
>> +       }
>> +      return true;
>> +    }
>>
>> // Duplicating the dumping, I guess it's unnecessary to be factored out.
>>
>> oh, I just noticed that this should be shorten as
>> "if (costing_p && dump_enabled_p ())" instead, just the same as what's
>> adopted for VMAT_LOAD_STORE_LANES dumping.
> 
> Just to mention, the original motivational idea was even though we
> duplicate some
> code we make it overall more readable and thus maintainable.  In the end we
> might have vectorizable_load () for analysis but have not only
> load_vec_info_type but one for each VMAT_* which means multiple separate
> vect_transform_load () functions.  Currently vectorizable_load is structured
> very inconsistently, having the transforms all hang off a single
> switch (vmat-kind) {} would be an improvement IMHO.

Thanks for the comments!  With these two patches, now the final loop nest are
only handling VMAT_CONTIGUOUS, VMAT_CONTIGUOUS_REVERSE and VMAT_CONTIGUOUS_PERMUTE.
IMHO, their handlings are highly bundled, re-structuring them can have more
duplicated code and potential incomplete bug fix risks as Richard pointed out.
But if I read the above comments right, our final goal seems to separate all of
them?  I wonder if you both prefer to further separate them?

> 
> But sure some of our internal APIs are verbose and maybe badly factored,
> any improvement there is welcome.  Inventing new random APIs just to
> save a few lines of code without actually making the code more readable
> is IMHO bad.
> 
> But, if we can for example enhance prepare_vec_mask to handle both loop
> and conditional mask and handle querying the mask that would be fine
> (of course you need to check all uses to see if that makes sense).

OK, will keep in mind, also add the example to my TODO list. :)

BR,
Kewen