From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 01CFF3858407
 for <gcc-patches@gcc.gnu.org>; Wed,  8 Sep 2021 08:28:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 01CFF3858407
Received: from pps.filterd (m0098393.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 18883hY9010546; Wed, 8 Sep 2021 04:28:47 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3axmepxy8s-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 08 Sep 2021 04:28:46 -0400
Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 1888RqnS112883;
 Wed, 8 Sep 2021 04:28:46 -0400
Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com
 [149.81.74.108])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3axmepxy83-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 08 Sep 2021 04:28:46 -0400
Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1])
 by ppma05fra.de.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 1888LQvE016517;
 Wed, 8 Sep 2021 08:28:43 GMT
Received: from b06avi18878370.portsmouth.uk.ibm.com
 (b06avi18878370.portsmouth.uk.ibm.com [9.149.26.194])
 by ppma05fra.de.ibm.com with ESMTP id 3axcnnduse-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 08 Sep 2021 08:28:43 +0000
Received: from d06av25.portsmouth.uk.ibm.com (d06av25.portsmouth.uk.ibm.com
 [9.149.105.61])
 by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP
 id 1888ONOv58130826
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 8 Sep 2021 08:24:23 GMT
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id A1DD111C050;
 Wed,  8 Sep 2021 08:28:40 +0000 (GMT)
Received: from d06av25.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 6B63B11C064;
 Wed,  8 Sep 2021 08:28:39 +0000 (GMT)
Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34])
 by d06av25.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Wed,  8 Sep 2021 08:28:39 +0000 (GMT)
Subject: Re: [PATCH v4] rs6000: Add load density heuristic
To: wschmidt@linux.ibm.com
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, David Edelsohn <dje.gcc@gmail.com>,
 Segher Boessenkool <segher@kernel.crashing.org>
References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com>
 <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com>
 <df5512db-81bd-02e2-a45d-e6b276e07f1b@linux.ibm.com>
 <77fe5ac1-200f-db69-a92a-5d349642f394@linux.ibm.com>
 <4f7c5da8-75d3-2d98-b728-e1a319392097@linux.ibm.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Message-ID: <a3b1a5ac-a6cc-9614-9252-bb766159af08@linux.ibm.com>
Date: Wed, 8 Sep 2021 16:28:37 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.10.0
In-Reply-To: <4f7c5da8-75d3-2d98-b728-e1a319392097@linux.ibm.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: wqz7tN9y8xAZ8jcEmugA4AEbAT-3eWLL
X-Proofpoint-ORIG-GUID: XXRBW6-eQ-1YX6oCNyGtN5mA3VHYe73p
Content-Transfer-Encoding: 8bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-09-08_03:2021-09-07,
 2021-09-08 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 impostorscore=0 bulkscore=0
 phishscore=0 adultscore=0 suspectscore=0 malwarescore=0 mlxlogscore=999
 priorityscore=1501 clxscore=1015 lowpriorityscore=0 mlxscore=0 spamscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2109030001
 definitions=main-2109080047
X-Spam-Status: No, score=-6.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, KAM_SHORT, NICE_REPLY_A, RCVD_IN_MSPIKE_H3,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 08 Sep 2021 08:28:49 -0000

on 2021/9/8 下午2:57, Kewen.Lin via Gcc-patches wrote:
> Hi Bill,
> 
> Thanks for the review comments!
> 
> on 2021/9/3 下午11:57, Bill Schmidt wrote:
>> Hi Kewen,
>>
>> Sorry that we lost track of this patch!  The heuristic approach looks good.  It is limited in scope and won't kick in often, and the case you're trying to account for is important.
>>
>> At the time you submitted this, I think reliable P10 testing wasn't possible.  Now that it is, could you please do a quick sniff test to make sure there aren't any adjustments that need to be made for P10?  I doubt it, but worth checking.
>>
> 
> Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 with Ofast unroll, this patch is neutral,
> one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r degradation existed or not and if it can fixed by
> this patch.  The result shows the degradation did exist and got fixed by this patch, besides got extra 3.93% speedup
> against O2 and another bmk 554.roms_r got 3.24% speed up.
> 

hmm, sorry that this improvement on 554.roms_r looks not reliable, I just ran it with 10 iterations for both w/ and w/o
the patch, both suffer the jitters and the best scores of them are close.  But note that bwaves_r scores are quite stable
so it's reliable.

BR,
Kewen

> In short, the Power10 evaluation result shows this patch is positive.
> 
>> Otherwise I have one comment below...
>>
>> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>>> Hi,
>>>
>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>>
>>> This v3 addressed William's review comments in
>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>>
>>> It's mainly to deal with the bwaves_r degradation due to vector
>>> construction fed by strided loads.
>>>
>>> As Richi's comments [1], this follows the similar idea to over
>>> price the vector construction fed by VMAT_ELEMENTWISE or
>>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>>> construction costing immediately, it firstly records how many
>>> loads and vectorized statements in the given loop, later in
>>> rs6000_density_test (called by finish_cost) it computes the
>>> load density ratio against all vectorized stmts, and check
>>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>>> if both thresholds are exceeded.
>>>
>>> Note that this new load density heuristic check is based on
>>> some fields in target cost which are updated as needed when
>>> scanning each add_stmt_cost entry, it's independent of the
>>> current function rs6000_density_test which requires to scan
>>> non_vect stmts.  Since it's checking the load stmts count
>>> vs. all vectorized stmts, it's kind of density, so I put
>>> it in function rs6000_density_test.  With the same reason to
>>> keep it independent, I didn't put it as an else arm of the
>>> current existing density threshold check hunk or before this
>>> hunk.
>>>
>>> In the investigation of -1.04% degradation from 526.blender_r
>>> on Power8, I noticed that the extra penalized cost 320 on one
>>> single vector construction with type V16QI is much exaggerated,
>>> which makes the final body cost unreliable, so this patch adds
>>> one maximum bound for the extra penalized cost for each vector
>>> construction statement.
>>>
>>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>>
>>> Full SPEC2017 performance evaluation on Power8/Power9 with
>>> option combinations (with v2, as v3 is NFC against v2):
>>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>>   * {-O3, -Ofast} {,-funroll-loops}
>>>
>>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>>> remarkable was observed.
>>>
> 
> ...
> 
>>> +  /* Gather some information when we are costing the vectorized instruction
>>> +     for the statements located in a loop body.  */
>>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>> +    {
>>> +      data->nstmts += orig_count;
>>> +
>>> +      if (kind == scalar_load || kind == vector_load
>>> +	  || kind == unaligned_load || kind == vector_gather_load)
>>> +	data->nloads += orig_count;
>>> +
>>> +      /* If we have strided or elementwise loads into a vector, it's
>>> +	 possible to be bounded by latency and execution resources for
>>> +	 many scalar loads.  Try to account for this by scaling the
>>> +	 construction cost by the number of elements involved, when
>>> +	 handling each matching statement we record the possible extra
>>> +	 penalized cost into target cost, in the end of costing for
>>> +	 the whole loop, we do the actual penalization once some load
>>> +	 density heuristics are satisfied.  */
>>
>> The above comment is quite hard to read.  Can you please break up the last
>> sentence into at least two sentences?
>>
> 
> How about the below:
> 
> +      /* If we have strided or elementwise loads into a vector, it's
> +        possible to be bounded by latency and execution resources for
> +        many scalar loads.  Try to account for this by scaling the
> +        construction cost by the number of elements involved.  For
> +        each matching statement, we record the possible extra
> +        penalized cost into the relevant field in target cost.  When
> +        we want to finalize the whole loop costing, we will check if
> +        those related load density heuristics are satisfied, and add
> +        this accumulated penalized cost if yes.  */
> 
> 
>> Otherwise this looks good to me, and I recommend maintainers approve with
>> that clarified.
>>
> 
> Thanks again!
> 
> The whole updated patch is attached, it also addressed Segher's comments on formatting.
> 
> Is it ok for trunk?
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
>