From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 111E13892023
 for <gcc-patches@gcc.gnu.org>; Wed, 28 Jul 2021 03:00:00 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 111E13892023
Received: from pps.filterd (m0098393.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 16S2hddS084094; Tue, 27 Jul 2021 22:59:59 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3a2xpy893m-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 27 Jul 2021 22:59:59 -0400
Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16S2iEi7085133;
 Tue, 27 Jul 2021 22:59:59 -0400
Received: from ppma06ams.nl.ibm.com (66.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.102])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3a2xpy8934-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 27 Jul 2021 22:59:59 -0400
Received: from pps.filterd (ppma06ams.nl.ibm.com [127.0.0.1])
 by ppma06ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 16S2wt7l031311;
 Wed, 28 Jul 2021 02:59:57 GMT
Received: from b06cxnps4076.portsmouth.uk.ibm.com
 (d06relay13.portsmouth.uk.ibm.com [9.149.109.198])
 by ppma06ams.nl.ibm.com with ESMTP id 3a235kgryq-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 28 Jul 2021 02:59:56 +0000
Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com
 [9.149.105.60])
 by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 16S2xsGp29229546
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 28 Jul 2021 02:59:54 GMT
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 3297B4203F;
 Wed, 28 Jul 2021 02:59:54 +0000 (GMT)
Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 4E98342041;
 Wed, 28 Jul 2021 02:59:52 +0000 (GMT)
Received: from kewenlins-mbp.cn.ibm.com (unknown [9.200.147.34])
 by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP;
 Wed, 28 Jul 2021 02:59:51 +0000 (GMT)
Subject: Re: [PATCH v2] rs6000: Add load density heuristic
To: will schmidt <will_schmidt@vnet.ibm.com>
Cc: Bill Schmidt <wschmidt@linux.ibm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 David Edelsohn <dje.gcc@gmail.com>, GCC Patches <gcc-patches@gcc.gnu.org>
References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com>
 <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com>
 <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Message-ID: <91063938-9ac4-93ed-c438-abf25d4eca05@linux.ibm.com>
Date: Wed, 28 Jul 2021 10:59:50 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.10.0
In-Reply-To: <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: 7Rdmyviy6Hp3yLCQtHm-C-nOrlEM9du1
X-Proofpoint-ORIG-GUID: wztXBDsdXwrD4sdhglfkf_T_Tr-W5Z0b
Content-Transfer-Encoding: 8bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-07-28_01:2021-07-27,
 2021-07-27 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 adultscore=0
 priorityscore=1501 phishscore=0 impostorscore=0 bulkscore=0 mlxscore=0
 lowpriorityscore=0 mlxlogscore=999 suspectscore=0 clxscore=1015
 spamscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2107140000 definitions=main-2107280012
X-Spam-Status: No, score=-11.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, NICE_REPLY_A,
 RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jul 2021 03:00:03 -0000

Hi William,

Thanks for the review comments!

on 2021/7/28 上午6:25, will schmidt wrote:
> On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
> 
> 
> Hi,
> 
> 
>> This is the updated version of patch to deal with the bwaves_r
>> degradation due to vector construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
> 
> ok
> 
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
> 
> ok
> 
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
> 
> ok
> 
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations:
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
> 
> So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
> degredation with no additional regressions.  that sounds good. 
> 

Sorry for the confusion, the original intention of this patch is to
fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r.
(From the last evaluation based on r12-1442, P8 is about -10% while
P9 is about -9%).  The mentioned -1.04% degradation from 526.blender_r
on P8 was expected to be a reason why we need the bound for the extra
penalized cost adjustment.

>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> 	nstmts, nloads and extra_ctor_cost.
>> 	(rs6000_density_test): Add load density related heuristics and the
>> 	checks, do extra costing on vector construction statements if need.
>> 	(rs6000_init_cost): Init new members.
>> 	(rs6000_update_target_cost_per_stmt): New function.
>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>> 	rs6000_update_target_cost_per_stmt and call it.
>>
> 
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 83d29cbfac1..806c3335cbc 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>>
>> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>>  {
>>    struct loop *loop_info;
>>    unsigned cost[3];
>> +  /* Total number of vectorized stmts (loop only).  */
>> +  unsigned nstmts;
>> +  /* Total number of loads (loop only).  */
>> +  unsigned nloads;
>> +  /* Possible extra penalized cost on vector construction (loop only).  */
>> +  unsigned extra_ctor_cost;
>>
>>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>>       instruction is needed by the vectorization.  */
>>    bool vect_nonmem;
>> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>>        if (dump_enabled_p ())
>>  	dump_printf_loc (MSG_NOTE, vect_location,
>>  			 "density %d%%, cost %d exceeds threshold, penalizing "
>> -			 "loop body cost by %d%%", density_pct,
>> +			 "loop body cost by %d%%\n", density_pct,
>>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>>      }
>> +
>> +  /* Check if we need to penalize the body cost for latency and
>> +     execution resources bound from strided or elementwise loads
>> +     into a vector.  */
>> +  if (data->extra_ctor_cost > 0)
>> +    {
>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> +      /* Threshold for total number of load stmts.  */
>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> +
>> +      gcc_assert (data->nloads <= data->nstmts);
>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> +
>> +      /* It's likely to be bounded by latency and execution resources
>> +	 from many scalar loads which are strided or elementwise loads
>> +	 into a vector if both conditions below are found:
>> +	   1. there are many loads, it's easy to result in a long wait
>> +	      for load units;
>> +	   2. load has a big proportion of all vectorized statements,
>> +	      it's not easy to schedule other statements to spread among
>> +	      the loads.
>> +	 One typical case is the innermost loop of the hotspot of SPEC2017
>> +	 503.bwaves_r without loop interchange.  */
>> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>> +	{
>> +	  data->cost[vect_body] += data->extra_ctor_cost;
>> +	  if (dump_enabled_p ())
>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>> +			     "Found %u loads and load pct. %u%% exceed "
>> +			     "the threshold, penalizing loop body "
>> +			     "cost by extra cost %u for ctor.\n",
>> +			     data->nloads, load_pct, data->extra_ctor_cost);
>> +	}
>> +    }
>>
>>  }
>>
>>  /* Implement targetm.vectorize.init_cost.  */
>> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>>    data->cost[vect_body]     = 0;
>>    data->cost[vect_epilogue] = 0;
>>    data->vect_nonmem = false;
>> +  data->nstmts = 0;
>> +  data->nloads = 0;
>> +  data->extra_ctor_cost = 0;
>>
>>    data->costing_for_scalar = costing_for_scalar;
>>    return data;
>>  }
>> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>>    return 0;
>>  }
>>
>> +/* As a visitor function for each statement cost entry handled in
>> +   function add_stmt_cost, gather some information and update its
>> +   relevant fields in target cost accordingly.  */
> 
> I got lost trying to parse that..  (could be just me :-) 
> 
> Possibly instead something like
> /* Helper function for add_stmt_cost ; gather information and update
> the target_cost fields accordingly.  */
> 
> 

OK, will update.  I was thinking for each entry handled in function
add_stmt_cost, this helper acts like a visitor, trying to visit each
entry and take some actions if some conditions are satisifed.

> 
>> +static void
>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> +				    enum vect_cost_for_stmt kind,
>> +				    struct _stmt_vec_info *stmt_info,
>> +				    enum vect_cost_model_location where,
>> +				    int stmt_cost, unsigned int orig_count)
>> +{
>> +
>> +  /* Check whether we're doing something other than just a copy loop.
>> +     Not all such loops may be profitably vectorized; see
>> +     rs6000_finish_cost.  */
>> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
>> +       || kind == vec_construct || kind == scalar_to_vec)
>> +      || (where == vect_body && kind == vector_stmt))
> 
> I thought I saw an alignment issue, then noticed the "(" that was
> hiding on me.. :-)    
> 
> Maybe clearer to read if you rearrange slightly and flatten it ?  I
> defer to others on that..
> 
> 	if ((kind == vec_to_scalar
> 	     || kind == vec_perm
> 	     || kind == vec_promote_demote
> 	     || kind ==
> vec_construct	
> 	     || kind == scalar_to_vec)
> 	    || (kind == vector_stmt && where == vect_body)
> 	
> 

This hunk is factored out from function rs6000_add_stmt_cost, maybe I
can keep the original formatting?  The formatting tool isn't so smart,
and sometimes rearrange things to become unexpected (although it meets
the basic rule, not so elegant), sigh.

>> +    data->vect_nonmem = true;
>> +
>> +  /* Gather some information when we are costing the vector version for
>> +     the statements locate in a loop body.  */
> s/version/instruction? operation?/  ? ? 
> s/locate/located/
> 

Will fix.

>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>> +    {
>> +      data->nstmts += orig_count;
>> +
>> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
>> +	  || kind == vector_gather_load)
> 
> Cosmetically, I'd move the second "||" to the next line to balance
> those two lines a bit more. 
> 

Will fix.

>> +	data->nloads += orig_count;
>> +
>> +      /* If we have strided or elementwise loads into a vector, it's
>> +	 possible to be bounded by latency and execution resources for
>> +	 many scalar loads.  Try to account for this by scaling the
>> +	 construction cost by the number of elements involved, when
>> +	 handling each matching statement we record the possible extra
>> +	 penalized cost into target cost, in the end of costing for
>> +	 the whole loop, we do the actual penalization once some load
>> +	 density heuristics are satisfied.  */
>> +      if (kind == vec_construct && stmt_info
>> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
>> +	{
>> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
>> +	  unsigned int extra_cost = nunits * stmt_cost;
>> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
>> +	     priced much on V16QI/V8HI vector construction as their units,
>> +	     if we penalize them with nunits * stmt_cost, it can result in
>> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
> s/a/an/ 

Will fix.

>> +	     20 and nunits is 16, the extra cost is 320 which looks much
>> +	     exaggerated.  So let's use one maximum bound for the extra
>> +	     penalized cost for vector construction here.  */
>> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +	  data->extra_ctor_cost += extra_cost;
>> +	}
>> +    }
>> +}
> ok
> 
>> +
>>
>>  /* Implement targetm.vectorize.add_stmt_cost.  */
>>
>>  static unsigned
>> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        /* Statements in an inner loop relative to the loop being
>>  	 vectorized are weighted more heavily.  The value here is
>>  	 arbitrary and could potentially be improved with analysis.  */
>> +      unsigned int orig_count = count;
> 
> I don't see any other assignments to orig_count.  Does 'count' get
> updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
> variable is necessary?
> 

Yeah, the 'count' could get updated but the default "git diff" doesn't
show it, the codes omitted look as below:

      if (where == vect_body && stmt_info
	  && stmt_in_inner_loop_p (vinfo, stmt_info))
	{
	  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
	  gcc_assert (loop_vinfo);
	  count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME.  */
	}

BR,
Kewen

>>
>>        if (where == vect_body && stmt_info
>>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>>  	{
>> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        retval = (unsigned) (count * stmt_cost);
>>        cost_data->cost[where] += retval;
>>
>> -      /* Check whether we're doing something other than just a copy loop.
>> -	 Not all such loops may be profitably vectorized; see
>> -	 rs6000_finish_cost.  */
>> -      if ((kind == vec_to_scalar || kind == vec_perm
>> -	   || kind == vec_promote_demote || kind == vec_construct
>> -	   || kind == scalar_to_vec)
>> -	  || (where == vect_body && kind == vector_stmt))
>> -	cost_data->vect_nonmem = true;
>> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>> +					  stmt_cost, orig_count);
>>
>>      }
>>
>>    return retval;
>>
> 
>