From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <will_schmidt@vnet.ibm.com>
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 4F652385040A
 for <gcc-patches@gcc.gnu.org>; Tue, 27 Jul 2021 22:25:12 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4F652385040A
Received: from pps.filterd (m0098396.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 16RM3kXW061328; Tue, 27 Jul 2021 18:25:11 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3a2tassbmt-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 27 Jul 2021 18:25:10 -0400
Received: from m0098396.ppops.net (m0098396.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 16RM56tl067268;
 Tue, 27 Jul 2021 18:25:10 -0400
Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com
 [169.47.144.26])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3a2tassbmb-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 27 Jul 2021 18:25:10 -0400
Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1])
 by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 16RMMEVq006975;
 Tue, 27 Jul 2021 22:25:09 GMT
Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com
 [9.57.198.24]) by ppma04wdc.us.ibm.com with ESMTP id 3a235n5btf-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 27 Jul 2021 22:25:09 +0000
Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com
 [9.57.199.108])
 by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 16RMP8TE37290332
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Tue, 27 Jul 2021 22:25:08 GMT
Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 82342B2077;
 Tue, 27 Jul 2021 22:25:08 +0000 (GMT)
Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 13B47B2073;
 Tue, 27 Jul 2021 22:25:07 +0000 (GMT)
Received: from lexx (unknown [9.171.17.235])
 by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP;
 Tue, 27 Jul 2021 22:25:06 +0000 (GMT)
Message-ID: <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com>
Subject: Re: [PATCH v2] rs6000: Add load density heuristic
From: will schmidt <will_schmidt@vnet.ibm.com>
To: "Kewen.Lin" <linkw@linux.ibm.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Bill Schmidt <wschmidt@linux.ibm.com>, Segher Boessenkool
 <segher@kernel.crashing.org>, David Edelsohn <dje.gcc@gmail.com>
Date: Tue, 27 Jul 2021 17:25:05 -0500
In-Reply-To: <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com>
References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com>
 <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.28.5 (3.28.5-10.el7) 
X-TM-AS-GCONF: 00
X-Proofpoint-ORIG-GUID: hMvLoMbq-Grc4SO54KXE9QhurVHhvNxq
X-Proofpoint-GUID: UMPLINQREewCacK8DfL6jz6WHE_sQVId
Content-Transfer-Encoding: 7bit
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-07-27_14:2021-07-27,
 2021-07-27 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 mlxscore=0 mlxlogscore=999
 malwarescore=0 adultscore=0 phishscore=0 spamscore=0 clxscore=1015
 lowpriorityscore=0 bulkscore=0 priorityscore=1501 impostorscore=0
 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2107140000 definitions=main-2107270128
X-Spam-Status: No, score=-12.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, RCVD_IN_MSPIKE_H3,
 RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jul 2021 22:25:14 -0000

On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 


Hi,


> This is the updated version of patch to deal with the bwaves_r
> degradation due to vector construction fed by strided loads.
> 
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.

ok

> 
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.

ok

> 
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.

ok

> 
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> 
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations:
>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>   * {-O3, -Ofast} {,-funroll-loops}
> 
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.

So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
degredation with no additional regressions.  that sounds good. 

> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
> 

> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 83d29cbfac1..806c3335cbc 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> 
> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>  {
>    struct loop *loop_info;
>    unsigned cost[3];
> +  /* Total number of vectorized stmts (loop only).  */
> +  unsigned nstmts;
> +  /* Total number of loads (loop only).  */
> +  unsigned nloads;
> +  /* Possible extra penalized cost on vector construction (loop only).  */
> +  unsigned extra_ctor_cost;
> 
>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>       instruction is needed by the vectorization.  */
>    bool vect_nonmem;
> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_NOTE, vect_location,
>  			 "density %d%%, cost %d exceeds threshold, penalizing "
> -			 "loop body cost by %d%%", density_pct,
> +			 "loop body cost by %d%%\n", density_pct,
>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>      }
> +
> +  /* Check if we need to penalize the body cost for latency and
> +     execution resources bound from strided or elementwise loads
> +     into a vector.  */
> +  if (data->extra_ctor_cost > 0)
> +    {
> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
> +      /* Threshold for total number of load stmts.  */
> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
> +
> +      gcc_assert (data->nloads <= data->nstmts);
> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
> +
> +      /* It's likely to be bounded by latency and execution resources
> +	 from many scalar loads which are strided or elementwise loads
> +	 into a vector if both conditions below are found:
> +	   1. there are many loads, it's easy to result in a long wait
> +	      for load units;
> +	   2. load has a big proportion of all vectorized statements,
> +	      it's not easy to schedule other statements to spread among
> +	      the loads.
> +	 One typical case is the innermost loop of the hotspot of SPEC2017
> +	 503.bwaves_r without loop interchange.  */
> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
> +	{
> +	  data->cost[vect_body] += data->extra_ctor_cost;
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "Found %u loads and load pct. %u%% exceed "
> +			     "the threshold, penalizing loop body "
> +			     "cost by extra cost %u for ctor.\n",
> +			     data->nloads, load_pct, data->extra_ctor_cost);
> +	}
> +    }
> 
>  }
> 
>  /* Implement targetm.vectorize.init_cost.  */
> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>    data->cost[vect_body]     = 0;
>    data->cost[vect_epilogue] = 0;
>    data->vect_nonmem = false;
> +  data->nstmts = 0;
> +  data->nloads = 0;
> +  data->extra_ctor_cost = 0;
> 
>    data->costing_for_scalar = costing_for_scalar;
>    return data;
>  }
> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>    return 0;
>  }
> 
> +/* As a visitor function for each statement cost entry handled in
> +   function add_stmt_cost, gather some information and update its
> +   relevant fields in target cost accordingly.  */

I got lost trying to parse that..  (could be just me :-) 

Possibly instead something like
/* Helper function for add_stmt_cost ; gather information and update
the target_cost fields accordingly.  */


> +static void
> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
> +				    enum vect_cost_for_stmt kind,
> +				    struct _stmt_vec_info *stmt_info,
> +				    enum vect_cost_model_location where,
> +				    int stmt_cost, unsigned int orig_count)
> +{
> +
> +  /* Check whether we're doing something other than just a copy loop.
> +     Not all such loops may be profitably vectorized; see
> +     rs6000_finish_cost.  */
> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
> +       || kind == vec_construct || kind == scalar_to_vec)
> +      || (where == vect_body && kind == vector_stmt))

I thought I saw an alignment issue, then noticed the "(" that was
hiding on me.. :-)    

Maybe clearer to read if you rearrange slightly and flatten it ?  I
defer to others on that..

	if ((kind == vec_to_scalar
	     || kind == vec_perm
	     || kind == vec_promote_demote
	     || kind ==
vec_construct	
	     || kind == scalar_to_vec)
	    || (kind == vector_stmt && where == vect_body)
	

> +    data->vect_nonmem = true;
> +
> +  /* Gather some information when we are costing the vector version for
> +     the statements locate in a loop body.  */
s/version/instruction? operation?/  ? ? 
s/locate/located/

> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
> +    {
> +      data->nstmts += orig_count;
> +
> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
> +	  || kind == vector_gather_load)

Cosmetically, I'd move the second "||" to the next line to balance
those two lines a bit more. 

> +	data->nloads += orig_count;
> +
> +      /* If we have strided or elementwise loads into a vector, it's
> +	 possible to be bounded by latency and execution resources for
> +	 many scalar loads.  Try to account for this by scaling the
> +	 construction cost by the number of elements involved, when
> +	 handling each matching statement we record the possible extra
> +	 penalized cost into target cost, in the end of costing for
> +	 the whole loop, we do the actual penalization once some load
> +	 density heuristics are satisfied.  */
> +      if (kind == vec_construct && stmt_info
> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
> +	{
> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
> +	  unsigned int extra_cost = nunits * stmt_cost;
> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
> +	     priced much on V16QI/V8HI vector construction as their units,
> +	     if we penalize them with nunits * stmt_cost, it can result in
> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
s/a/an/ 
> +	     20 and nunits is 16, the extra cost is 320 which looks much
> +	     exaggerated.  So let's use one maximum bound for the extra
> +	     penalized cost for vector construction here.  */
> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> +	  data->extra_ctor_cost += extra_cost;
> +	}
> +    }
> +}
ok

> +
> 
>  /* Implement targetm.vectorize.add_stmt_cost.  */
> 
>  static unsigned
> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        /* Statements in an inner loop relative to the loop being
>  	 vectorized are weighted more heavily.  The value here is
>  	 arbitrary and could potentially be improved with analysis.  */
> +      unsigned int orig_count = count;

I don't see any other assignments to orig_count.  Does 'count' get
updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
variable is necessary?

> 
>        if (where == vect_body && stmt_info
>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>  	{
> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        retval = (unsigned) (count * stmt_cost);
>        cost_data->cost[where] += retval;
> 
> -      /* Check whether we're doing something other than just a copy loop.
> -	 Not all such loops may be profitably vectorized; see
> -	 rs6000_finish_cost.  */
> -      if ((kind == vec_to_scalar || kind == vec_perm
> -	   || kind == vec_promote_demote || kind == vec_construct
> -	   || kind == scalar_to_vec)
> -	  || (where == vect_body && kind == vector_stmt))
> -	cost_data->vect_nonmem = true;
> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
> +					  stmt_cost, orig_count);
> 
>      }
> 
>    return retval;
>