Re: [PATCH v2] vect/rs6000: Support vector with length cost modeling

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: "Kewen.Lin" <linkw@linux.ibm.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>,
	Bill Schmidt <wschmidt@linux.ibm.com>,
	Segher Boessenkool <segher@kernel.crashing.org>
Subject: Re: [PATCH v2] vect/rs6000: Support vector with length cost modeling
Date: Wed, 22 Jul 2020 10:11:46 +0100	[thread overview]
Message-ID: <mptv9ifud65.fsf@arm.com> (raw)
In-Reply-To: <a06e714e-04c3-8a2f-fa1d-02a72aecf7f4@linux.ibm.com> (Kewen Lin's message of "Wed, 22 Jul 2020 09:26:39 +0800")

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/21 下午3:57, Richard Biener wrote:
>> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> Hi,
>>>
>>> This patch is to add the cost modeling for vector with length,
>>> it mainly follows what we generate for vector with length in
>>> functions vect_set_loop_controls_directly and vect_gen_len
>>> at the worst case.
>>>
>>> For Power, the length is expected to be in bits 0-7 (high bits),
>>> we have to model the cost of shifting bits.  To allow other targets
>>> not suffer this, I used one target hook to describe this extra cost,
>>> I'm not sure if it's a correct way.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>>> param vect-partial-vector-usage=1.
>>>
>>> Any comments/suggestions are highly appreciated!
>> 
>> I don't like the introduction of an extra target hook for this.  All
>> vectorizer cost modeling should ideally go through
>> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
>> not per stmt then either init_cost or finish_cost is appropriate.
>> Currently init_cost only gets a struct loop while we should
>> probably give it a vec_info * parameter so targets can
>> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
>> 
>
> Thanks!  Nice, your suggested way looks better.  I've removed the hook
> and taken care of it in finish_cost.  The updated v2 is attached.
>
> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (adjust_vect_cost): New function.
> 	(rs6000_finish_cost): Call function adjust_vect_cost.
> 	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
> 	modeling for vector with length.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 5a4f07d5810..f2724e792c9 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>    return retval;
>  }
>  
> +/* For some target specific vectorization cost which can't be handled per stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +adjust_vect_cost (rs6000_cost_data *data)
> +{
> +  struct loop *loop = data->loop_info;
> +  gcc_assert (loop);
> +  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> +
> +  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      unsigned int shift_cnt = 0;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  /* Each length needs one shift to fill into bits 0-7.  */
> +	  shift_cnt += (num_vectors_m1 + 1);
> +
> +      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);
> +    }
> +}
> +
>  /* Implement targetm.vectorize.finish_cost.  */
>  
>  static void
> @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
>    rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
>  
>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      adjust_vect_cost (cost_data);
> +      rs6000_density_test (cost_data);
> +    }
>  
>    /* Don't vectorize minimum-vectorization-factor, simple copy loops
>       that require versioning for any reason.  The vectorization is at
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index e933441b922..99e1fd7bdd0 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>       TODO: Build an expression that represents peel_iters for prologue and
>       epilogue to be used in a run-time test.  */
>  
> -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
>      {
>        peel_iters_prologue = 0;
>        peel_iters_epilogue = 0;
> @@ -3663,45 +3663,145 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>  	  peel_iters_epilogue += 1;
>  	  stmt_info_for_cost *si;
>  	  int j;
> -	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> -			    j, si)
> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
> +			    si)
>  	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
>  				  si->kind, si->stmt_info, si->vectype,
>  				  si->misalign, vect_epilogue);
>  	}
>  
> -      /* Calculate how many masks we need to generate.  */
> -      unsigned int num_masks = 0;
> -      rgroup_controls *rgm;
> -      unsigned int num_vectors_m1;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> -	if (rgm->type)
> -	  num_masks += num_vectors_m1 + 1;
> -      gcc_assert (num_masks > 0);
> -
> -      /* In the worst case, we need to generate each mask in the prologue
> -	 and in the loop body.  One of the loop body mask instructions
> -	 replaces the comparison in the scalar loop, and since we don't
> -	 count the scalar comparison against the scalar body, we shouldn't
> -	 count that vector instruction against the vector body either.
> -
> -	 Sometimes we can use unpacks instead of generating prologue
> -	 masks and sometimes the prologue mask will fold to a constant,
> -	 so the actual prologue cost might be smaller.  However, it's
> -	 simpler and safer to use the worst-case cost; if this ends up
> -	 being the tie-breaker between vectorizing or not, then it's
> -	 probably better not to vectorize.  */
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, num_masks, vector_stmt,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, num_masks - 1, vector_stmt,
> -			    NULL, NULL_TREE, 0, vect_body);
> -    }
> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> -    {
> -      peel_iters_prologue = 0;
> -      peel_iters_epilogue = 0;
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +	{
> +	  /* Calculate how many masks we need to generate.  */
> +	  unsigned int num_masks = 0;
> +	  rgroup_controls *rgm;
> +	  unsigned int num_vectors_m1;
> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> +	    if (rgm->type)
> +	      num_masks += num_vectors_m1 + 1;
> +	  gcc_assert (num_masks > 0);
> +
> +	  /* In the worst case, we need to generate each mask in the prologue
> +	     and in the loop body.  One of the loop body mask instructions
> +	     replaces the comparison in the scalar loop, and since we don't
> +	     count the scalar comparison against the scalar body, we shouldn't
> +	     count that vector instruction against the vector body either.
> +
> +	     Sometimes we can use unpacks instead of generating prologue
> +	     masks and sometimes the prologue mask will fold to a constant,
> +	     so the actual prologue cost might be smaller.  However, it's
> +	     simpler and safer to use the worst-case cost; if this ends up
> +	     being the tie-breaker between vectorizing or not, then it's
> +	     probably better not to vectorize.  */
> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
> +				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
> +				vector_stmt, NULL, NULL_TREE, 0, vect_body);
> +	}
> +      else
> +	{
> +	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
> +
> +	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
> +	  if (npeel < 0)
> +	    {
> +	      peel_iters_prologue = assumed_vf / 2;
> +	      /* See below, if peeled iterations are unknown, count a taken
> +		 branch and a not taken branch per peeled loop.  */
> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				    cond_branch_taken, NULL, NULL_TREE, 0,
> +				    vect_prologue);
> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				    cond_branch_not_taken, NULL, NULL_TREE, 0,
> +				    vect_prologue);
> +	    }
> +	  else
> +	    {
> +	      peel_iters_prologue = npeel;
> +	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +		/* See vect_get_known_peeling_cost, if peeled iterations are
> +		   known but number of scalar loop iterations are unknown, count
> +		   a taken branch per peeled loop.  */
> +		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				      cond_branch_taken, NULL, NULL_TREE, 0,
> +				      vect_prologue);
> +	    }

I think it'd be good to avoid duplicating this.  How about the
following structure?

  if (vect_use_loop_mask_for_alignment_p (…))
    {
      peel_iters_prologue = 0;
      peel_iters_epilogue = 0;
    }
  else if (npeel < 0)
    {
      … // A
    }
  else
    {
      …vect_get_known_peeling_cost stuff…
    }

but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:

  LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0

for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to
whatever value we'd normally use.  Then wrap:

      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
			    NULL, NULL_TREE, 0, vect_epilogue);
      (void) add_stmt_cost (loop_vinfo,
			    target_cost_data, 1, cond_branch_not_taken,
			    NULL, NULL_TREE, 0, vect_epilogue);

in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue
stuff in A conditional on peel_iters_epilogue != 0.

This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P
code:

      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
	{
	  /* We need to peel exactly one iteration.  */
	  peel_iters_epilogue += 1;
	  stmt_info_for_cost *si;
	  int j;
	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
			    j, si)
	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
				  si->kind, si->stmt_info, si->vectype,
				  si->misalign, vect_epilogue);
	}

Then, after the above, have:

  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
    …add costs for mask overhead…
  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
    …add costs for lengths overhead…

So we'd have one block of code for estimating the prologue and epilogue
peeling cost, and a separate block of code for the loop control overhead.

Thanks,
Richard

next prev parent reply	other threads:[~2020-07-22  9:11 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-21  5:51 [PATCH] vect: " Kewen.Lin
2020-07-21  7:57 ` Richard Biener
2020-07-22  1:26   ` [PATCH v2] vect/rs6000: " Kewen.Lin
2020-07-22  6:38     ` Richard Biener
2020-07-22  7:08       ` Kewen.Lin
2020-07-22  9:11     ` Richard Sandiford [this message]
2020-07-22 15:48       ` [PATCH v3] " Kewen.Lin
2020-07-22 16:25         ` Kewen.Lin
2020-07-24 16:21           ` Richard Sandiford
2020-07-27  3:58             ` [PATCH v4] " Kewen.Lin
2020-07-27 13:40               ` Richard Sandiford
2020-07-28  8:36                 ` Kewen.Lin
2020-07-31 11:03                   ` Richard Sandiford
2020-07-31 11:20                     ` Richard Biener
2020-07-31 12:37                       ` Kewen.Lin
2020-07-31 13:01                         ` Richard Biener
2020-07-31 13:21                           ` Kewen.Lin
2020-07-31 14:51                 ` [PATCH v5] " Kewen.Lin
2020-08-05  7:27                   ` Richard Sandiford
2020-08-05 14:06                     ` Segher Boessenkool
2020-08-06  6:47                       ` Kewen.Lin
2020-07-22 17:49     ` [PATCH v2] " Segher Boessenkool
2020-07-27  3:44       ` Kewen.Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mptv9ifud65.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=linkw@linux.ibm.com \
    --cc=richard.guenther@gmail.com \
    --cc=segher@kernel.crashing.org \
    --cc=wschmidt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).