From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id 3AB3C3858403
 for <gcc-patches@gcc.gnu.org>; Fri, 10 Sep 2021 03:23:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3AB3C3858403
Received: from pps.filterd (m0098421.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 18A349UD096900; Thu, 9 Sep 2021 23:23:09 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3aywabt5dy-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 09 Sep 2021 23:23:08 -0400
Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 18A3E0QC136127;
 Thu, 9 Sep 2021 23:23:08 -0400
Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.102])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3aywabt5dg-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Thu, 09 Sep 2021 23:23:08 -0400
Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1])
 by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 18A37S8i024926;
 Fri, 10 Sep 2021 03:23:06 GMT
Received: from b06avi18626390.portsmouth.uk.ibm.com
 (b06avi18626390.portsmouth.uk.ibm.com [9.149.26.192])
 by ppma06ams.nl.ibm.com with ESMTP id 3axcnp4mq9-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 10 Sep 2021 03:23:06 +0000
Received: from b06wcsmtp001.portsmouth.uk.ibm.com
 (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160])
 by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP
 id 18A3Ij6449283430
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 10 Sep 2021 03:18:45 GMT
Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 18D93A4065;
 Fri, 10 Sep 2021 03:23:04 +0000 (GMT)
Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id DDBC1A405C;
 Fri, 10 Sep 2021 03:23:01 +0000 (GMT)
Received: from KewenLins-MacBook-Pro.local (unknown [9.200.58.64])
 by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Fri, 10 Sep 2021 03:23:01 +0000 (GMT)
Subject: Re: [PATCH v4] rs6000: Add load density heuristic
To: wschmidt@linux.ibm.com, Segher Boessenkool <segher@kernel.crashing.org>
Cc: David Edelsohn <dje.gcc@gmail.com>,
 will schmidt <will_schmidt@vnet.ibm.com>,
 GCC Patches <gcc-patches@gcc.gnu.org>
References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com>
 <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com>
 <df5512db-81bd-02e2-a45d-e6b276e07f1b@linux.ibm.com>
 <77fe5ac1-200f-db69-a92a-5d349642f394@linux.ibm.com>
 <4f7c5da8-75d3-2d98-b728-e1a319392097@linux.ibm.com>
 <20210909161152.GR1583@gate.crashing.org>
 <894f01c3-6481-0757-751f-b4239a4f0232@linux.ibm.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Message-ID: <0cfeda5b-344a-05ce-de5d-7ae43b71be4c@linux.ibm.com>
Date: Fri, 10 Sep 2021 11:22:59 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.10.0
MIME-Version: 1.0
In-Reply-To: <894f01c3-6481-0757-751f-b4239a4f0232@linux.ibm.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: kxY-82a87S00LHKv_WOCDrtUitPeXYFQ
X-Proofpoint-ORIG-GUID: M98hqoir9NvpWOF6VCDFqOuS9WZqDUAI
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-09-10_01:2021-09-09,
 2021-09-10 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 impostorscore=0
 suspectscore=0 bulkscore=0 lowpriorityscore=0 phishscore=0 spamscore=0
 adultscore=0 clxscore=1015 mlxscore=0 priorityscore=1501 mlxlogscore=999
 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2109030001 definitions=main-2109100013
X-Spam-Status: No, score=-3.9 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_EF, NICE_REPLY_A, RCVD_IN_MSPIKE_H4,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Sep 2021 03:23:11 -0000

Hi Segher and Bill,

Thanks a lot for your reviews and helps!

on 2021/9/10 上午1:19, Bill Schmidt wrote:
> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
>> Hi!
>>
>> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>>> +      /* If we have strided or elementwise loads into a vector, it's
>>>>> +     possible to be bounded by latency and execution resources for
>>>>> +     many scalar loads.  Try to account for this by scaling the
>>>>> +     construction cost by the number of elements involved, when
>>>>> +     handling each matching statement we record the possible extra
>>>>> +     penalized cost into target cost, in the end of costing for
>>>>> +     the whole loop, we do the actual penalization once some load
>>>>> +     density heuristics are satisfied.  */
>>>> The above comment is quite hard to read.  Can you please break up the last
>>>> sentence into at least two sentences?
>>> How about the below:
>>>
>>> +      /* If we have strided or elementwise loads into a vector, it's
>> "strided" is not a word: it properly is "stridden", which does not read
>> very well either.  "Have loads by stride, or by element, ..."?  Is that
>> good English, and easier to understand?
> 
> No, this is OK.  "Strided loads" is a term of art used by the vectorizer; whether or not it was the Queen's English, it's what we have...  (And I think you might only find "bestridden" in some 18th or 19th century English poetry... :-)
>>
>>> +        possible to be bounded by latency and execution resources for
>>> +        many scalar loads.  Try to account for this by scaling the
>>> +        construction cost by the number of elements involved.  For
>>> +        each matching statement, we record the possible extra
>>> +        penalized cost into the relevant field in target cost.  When
>>> +        we want to finalize the whole loop costing, we will check if
>>> +        those related load density heuristics are satisfied, and add
>>> +        this accumulated penalized cost if yes.  */
>>>
>>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>>> that clarified.
>> Does that text look good to you now Bill?  It is still kinda complex,
>> maybe you see a way to make it simpler.
> 
> I think it's OK now.  The complexity at least matches the code now instead of exceeding it. :-P  j/k...
> 

Just noticed Bill helped to revise it, I will use that nice paragraph.
(Thanks, Bill!)

>>
>>>     * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>>     nstmts, nloads and extra_ctor_cost.
>>>     (rs6000_density_test): Add load density related heuristics and the
>>>     checks, do extra costing on vector construction statements if need.
>> "and the checks"?  Oh, "and checks"?  It is probably fine to just leave
>> out this whole phrase part :-)
>>
>> Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
>> bit boring text that way, but that is the purpose: it makes it simpler
>> to read (and read quickly, even merely scan).
>>

Thanks for the explanation, will fix.

>>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
>> [ Btw, you can get rid of the typedef now, just have a struct with the
>> non-underscore name, we have C++ now.  Such a mechanical change (as
>> separate patch!) is pre-approved. ]
>>
>>> +  /* Check if we need to penalize the body cost for latency and
>>> +     execution resources bound from strided or elementwise loads
>>> +     into a vector.  */
>> Bill, is that clear enough?  I'm sure something nicer would help here,
>> but it's hard for me to write anything :-)
> 
> Perhaps:  "Check whether we need to penalize the body cost to account for excess strided or elementwise loads."

Thanks, will update.

>>
>>> +  if (data->extra_ctor_cost > 0)
>>> +    {
>>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> Threshold for what?
>>
>> 45% is awfully exact.  Can you make this a param?
>>
>>> +      /* Threshold for total number of load stmts.  */
>>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> Same.
> 
> 
> We have similar magic constants in here already.  Parameterizing is possible, but I'm more interested in making sure the numbers are appropriate for each processor.  Given that Kewen reports they work well for both P9 and P10, I'm pretty happy with what we have here.  (Kewen, thanks for running the P10 experiments!)
> 
> Perhaps a follow-up patch to add params for the magic constants would be reasonable, but I'd personally consider it pretty low priority.
> 

As Bill mentioned, there are some other thresholds which can be parameterized together.
Segher, if you don't mind, I will parameterize all of them in a follow up patch.  :)

>>
>>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> No parens around the last thing please.  The other pair of parens is
>> unneeded as well, but perhaps it is easier to read like that.
>>

Will fix.

>>> +      if (dump_enabled_p ())
>>> +        dump_printf_loc (MSG_NOTE, vect_location,
>>> +                 "Found %u loads and load pct. %u%% exceed "
>>> +                 "the threshold, penalizing loop body "
>>> +                 "cost by extra cost %u for ctor.\n",
>>> +                 data->nloads, load_pct, data->extra_ctor_cost);
>> That line does not fit.  Make it more lines?
>>

Will fix.

>> It is a pity that using these interfaces at all takes up 45 chars
>> of noise already.
>>
>>> +/* Helper function for add_stmt_cost.  Check each statement cost
>>> +   entry, gather information and update the target_cost fields
>>> +   accordingly.  */
>>> +static void
>>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>> +                    enum vect_cost_for_stmt kind,
>>> +                    struct _stmt_vec_info *stmt_info,
>>> +                    enum vect_cost_model_location where,
>>> +                    int stmt_cost, unsigned int orig_count)
>> Please put those last two on separate lines as well?
>>

Will fix.

>>> +      /* As function rs6000_builtin_vectorization_cost shows, we have
>>> +         priced much on V16QI/V8HI vector construction as their units,
>>> +         if we penalize them with nunits * stmt_cost, it can result in
>>> +         an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>> +         is 20 and nunits is 16, the extra cost is 320 which looks
>>> +         much exaggerated.  So let's use one maximum bound for the
>>> +         extra penalized cost for vector construction here.  */
>>> +      const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>> +      if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>> +        extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> That is a pretty gross hack.  Can you think of any saner way to not have
>> those out of scale costs in the first place?
> 
> In Kewen's defense, the whole business of "finish_cost" for these vectorized loops is to tweak things that don't work quite right with the hooks currently provided to the vectorizer to add costs on a per-stmt basis without looking at the overall set of statements.  It gives the back end a chance to massage things and exercise veto power over otherwise bad decisions.  By nature, that's going to be very much a heuristic exercise.  Personally I think the heuristics used here are pretty reasonable, and importantly they are designed to only be employed in pretty rare circumstances.  It doesn't look easy to me to avoid the need for a cap here without making the rest of the heuristics harder to understand.  But sure, he can try! :)
> 

Thanks for the explanation, Bill!  I agree the current hacking isn't good, one alternative to avoid this bound check
seems not to use nunits * stmt_cost, which was referred from the existing practice for the similar case.  Instead,
we can use log2(nunits) and then V16QI => 4, V8HI => 3 ..., the cost will not be exaggerated much as before.
Not sure if it can work expectedly, I'll re-evaluate this idea on Power8/9/10 at Ofast unroll and O2 vect (at least)
and see if it works well.
>>
>> Okay for trunk with such tweaks.  Thanks!  (And please consult with Bill
>> for the wordsmithing :-) )
>>

Thanks again.

BR,
Kewen