Re: [PATCH v2] rs6000: Add load density heuristic

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Kewen.Lin" <linkw@linux.ibm.com>
To: will schmidt <will_schmidt@vnet.ibm.com>
Cc: Bill Schmidt <wschmidt@linux.ibm.com>,
	Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH v2] rs6000: Add load density heuristic
Date: Wed, 28 Jul 2021 10:59:50 +0800	[thread overview]
Message-ID: <91063938-9ac4-93ed-c438-abf25d4eca05@linux.ibm.com> (raw)
In-Reply-To: <101b1a2a18a6332b305e6259355d91772d0e02be.camel@vnet.ibm.com>

Hi William,

Thanks for the review comments!

on 2021/7/28 上午6:25, will schmidt wrote:
> On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
> 
> 
> Hi,
> 
> 
>> This is the updated version of patch to deal with the bwaves_r
>> degradation due to vector construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
> 
> ok
> 
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
> 
> ok
> 
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
> 
> ok
> 
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations:
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
> 
> So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
> degredation with no additional regressions.  that sounds good. 
> 

Sorry for the confusion, the original intention of this patch is to
fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r.
(From the last evaluation based on r12-1442, P8 is about -10% while
P9 is about -9%).  The mentioned -1.04% degradation from 526.blender_r
on P8 was expected to be a reason why we need the bound for the extra
penalized cost adjustment.

>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> 	nstmts, nloads and extra_ctor_cost.
>> 	(rs6000_density_test): Add load density related heuristics and the
>> 	checks, do extra costing on vector construction statements if need.
>> 	(rs6000_init_cost): Init new members.
>> 	(rs6000_update_target_cost_per_stmt): New function.
>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>> 	rs6000_update_target_cost_per_stmt and call it.
>>
> 
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 83d29cbfac1..806c3335cbc 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>>
>> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>>  {
>>    struct loop *loop_info;
>>    unsigned cost[3];
>> +  /* Total number of vectorized stmts (loop only).  */
>> +  unsigned nstmts;
>> +  /* Total number of loads (loop only).  */
>> +  unsigned nloads;
>> +  /* Possible extra penalized cost on vector construction (loop only).  */
>> +  unsigned extra_ctor_cost;
>>
>>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>>       instruction is needed by the vectorization.  */
>>    bool vect_nonmem;
>> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>>        if (dump_enabled_p ())
>>  	dump_printf_loc (MSG_NOTE, vect_location,
>>  			 "density %d%%, cost %d exceeds threshold, penalizing "
>> -			 "loop body cost by %d%%", density_pct,
>> +			 "loop body cost by %d%%\n", density_pct,
>>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>>      }
>> +
>> +  /* Check if we need to penalize the body cost for latency and
>> +     execution resources bound from strided or elementwise loads
>> +     into a vector.  */
>> +  if (data->extra_ctor_cost > 0)
>> +    {
>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> +      /* Threshold for total number of load stmts.  */
>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> +
>> +      gcc_assert (data->nloads <= data->nstmts);
>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> +
>> +      /* It's likely to be bounded by latency and execution resources
>> +	 from many scalar loads which are strided or elementwise loads
>> +	 into a vector if both conditions below are found:
>> +	   1. there are many loads, it's easy to result in a long wait
>> +	      for load units;
>> +	   2. load has a big proportion of all vectorized statements,
>> +	      it's not easy to schedule other statements to spread among
>> +	      the loads.
>> +	 One typical case is the innermost loop of the hotspot of SPEC2017
>> +	 503.bwaves_r without loop interchange.  */
>> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>> +	{
>> +	  data->cost[vect_body] += data->extra_ctor_cost;
>> +	  if (dump_enabled_p ())
>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>> +			     "Found %u loads and load pct. %u%% exceed "
>> +			     "the threshold, penalizing loop body "
>> +			     "cost by extra cost %u for ctor.\n",
>> +			     data->nloads, load_pct, data->extra_ctor_cost);
>> +	}
>> +    }
>>
>>  }
>>
>>  /* Implement targetm.vectorize.init_cost.  */
>> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>>    data->cost[vect_body]     = 0;
>>    data->cost[vect_epilogue] = 0;
>>    data->vect_nonmem = false;
>> +  data->nstmts = 0;
>> +  data->nloads = 0;
>> +  data->extra_ctor_cost = 0;
>>
>>    data->costing_for_scalar = costing_for_scalar;
>>    return data;
>>  }
>> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>>    return 0;
>>  }
>>
>> +/* As a visitor function for each statement cost entry handled in
>> +   function add_stmt_cost, gather some information and update its
>> +   relevant fields in target cost accordingly.  */
> 
> I got lost trying to parse that..  (could be just me :-) 
> 
> Possibly instead something like
> /* Helper function for add_stmt_cost ; gather information and update
> the target_cost fields accordingly.  */
> 
> 

OK, will update.  I was thinking for each entry handled in function
add_stmt_cost, this helper acts like a visitor, trying to visit each
entry and take some actions if some conditions are satisifed.

> 
>> +static void
>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> +				    enum vect_cost_for_stmt kind,
>> +				    struct _stmt_vec_info *stmt_info,
>> +				    enum vect_cost_model_location where,
>> +				    int stmt_cost, unsigned int orig_count)
>> +{
>> +
>> +  /* Check whether we're doing something other than just a copy loop.
>> +     Not all such loops may be profitably vectorized; see
>> +     rs6000_finish_cost.  */
>> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
>> +       || kind == vec_construct || kind == scalar_to_vec)
>> +      || (where == vect_body && kind == vector_stmt))
> 
> I thought I saw an alignment issue, then noticed the "(" that was
> hiding on me.. :-)    
> 
> Maybe clearer to read if you rearrange slightly and flatten it ?  I
> defer to others on that..
> 
> 	if ((kind == vec_to_scalar
> 	     || kind == vec_perm
> 	     || kind == vec_promote_demote
> 	     || kind ==
> vec_construct	
> 	     || kind == scalar_to_vec)
> 	    || (kind == vector_stmt && where == vect_body)
> 	
> 

This hunk is factored out from function rs6000_add_stmt_cost, maybe I
can keep the original formatting?  The formatting tool isn't so smart,
and sometimes rearrange things to become unexpected (although it meets
the basic rule, not so elegant), sigh.

>> +    data->vect_nonmem = true;
>> +
>> +  /* Gather some information when we are costing the vector version for
>> +     the statements locate in a loop body.  */
> s/version/instruction? operation?/  ? ? 
> s/locate/located/
> 

Will fix.

>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>> +    {
>> +      data->nstmts += orig_count;
>> +
>> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
>> +	  || kind == vector_gather_load)
> 
> Cosmetically, I'd move the second "||" to the next line to balance
> those two lines a bit more. 
> 

Will fix.

>> +	data->nloads += orig_count;
>> +
>> +      /* If we have strided or elementwise loads into a vector, it's
>> +	 possible to be bounded by latency and execution resources for
>> +	 many scalar loads.  Try to account for this by scaling the
>> +	 construction cost by the number of elements involved, when
>> +	 handling each matching statement we record the possible extra
>> +	 penalized cost into target cost, in the end of costing for
>> +	 the whole loop, we do the actual penalization once some load
>> +	 density heuristics are satisfied.  */
>> +      if (kind == vec_construct && stmt_info
>> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
>> +	{
>> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
>> +	  unsigned int extra_cost = nunits * stmt_cost;
>> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
>> +	     priced much on V16QI/V8HI vector construction as their units,
>> +	     if we penalize them with nunits * stmt_cost, it can result in
>> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
> s/a/an/ 

Will fix.

>> +	     20 and nunits is 16, the extra cost is 320 which looks much
>> +	     exaggerated.  So let's use one maximum bound for the extra
>> +	     penalized cost for vector construction here.  */
>> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +	  data->extra_ctor_cost += extra_cost;
>> +	}
>> +    }
>> +}
> ok
> 
>> +
>>
>>  /* Implement targetm.vectorize.add_stmt_cost.  */
>>
>>  static unsigned
>> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        /* Statements in an inner loop relative to the loop being
>>  	 vectorized are weighted more heavily.  The value here is
>>  	 arbitrary and could potentially be improved with analysis.  */
>> +      unsigned int orig_count = count;
> 
> I don't see any other assignments to orig_count.  Does 'count' get
> updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
> variable is necessary?
> 

Yeah, the 'count' could get updated but the default "git diff" doesn't
show it, the codes omitted look as below:

      if (where == vect_body && stmt_info
	  && stmt_in_inner_loop_p (vinfo, stmt_info))
	{
	  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
	  gcc_assert (loop_vinfo);
	  count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME.  */
	}

BR,
Kewen

>>
>>        if (where == vect_body && stmt_info
>>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>>  	{
>> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        retval = (unsigned) (count * stmt_cost);
>>        cost_data->cost[where] += retval;
>>
>> -      /* Check whether we're doing something other than just a copy loop.
>> -	 Not all such loops may be profitably vectorized; see
>> -	 rs6000_finish_cost.  */
>> -      if ((kind == vec_to_scalar || kind == vec_perm
>> -	   || kind == vec_promote_demote || kind == vec_construct
>> -	   || kind == scalar_to_vec)
>> -	  || (where == vect_body && kind == vector_stmt))
>> -	cost_data->vect_nonmem = true;
>> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>> +					  stmt_cost, orig_count);
>>
>>      }
>>
>>    return retval;
>>
> 
>

next prev parent reply	other threads:[~2021-07-28  3:00 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-07  2:29 [PATCH] rs6000: Adjust rs6000_density_test for strided_load Kewen.Lin
2021-05-26  2:59 ` [PATCH v2] rs6000: Add load density heuristic Kewen.Lin
2021-06-09  2:26   ` PING^1 " Kewen.Lin
2021-06-28  7:01     ` PING^2 " Kewen.Lin
2021-07-15  1:59       ` PING^3 " Kewen.Lin
2021-07-27 22:25   ` will schmidt
2021-07-28  2:59     ` Kewen.Lin [this message]
2021-09-06 23:43       ` Segher Boessenkool
2021-09-08  7:01         ` Kewen.Lin
2021-07-28  5:22   ` [PATCH v3] " Kewen.Lin
2021-09-03 15:57     ` Bill Schmidt
2021-09-08  6:57       ` [PATCH v4] " Kewen.Lin
2021-09-08  8:28         ` Kewen.Lin
2021-09-09 16:11         ` Segher Boessenkool
2021-09-09 17:19           ` Bill Schmidt
2021-09-09 17:39             ` Bill Schmidt
2021-09-09 18:24             ` Segher Boessenkool
2021-09-10  3:22             ` Kewen.Lin
2021-09-10  3:46               ` Kewen.Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=91063938-9ac4-93ed-c438-abf25d4eca05@linux.ibm.com \
    --to=linkw@linux.ibm.com \
    --cc=dje.gcc@gmail.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=segher@kernel.crashing.org \
    --cc=will_schmidt@vnet.ibm.com \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).