Hi Richard,

Thanks for the review!

on 2020/7/22 下午5:11, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>> -    {
>> -      peel_iters_prologue = 0;
>> -      peel_iters_epilogue = 0;
>> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>> +	{
>> +	  /* Calculate how many masks we need to generate.  */
>> +	  unsigned int num_masks = 0;
>> +	  rgroup_controls *rgm;
>> +	  unsigned int num_vectors_m1;
>> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
>> +	    if (rgm->type)
>> +	      num_masks += num_vectors_m1 + 1;
>> +	  gcc_assert (num_masks > 0);
>> +
>> +	  /* In the worst case, we need to generate each mask in the prologue
>> +	     and in the loop body.  One of the loop body mask instructions
>> +	     replaces the comparison in the scalar loop, and since we don't
>> +	     count the scalar comparison against the scalar body, we shouldn't
>> +	     count that vector instruction against the vector body either.
>> +
>> +	     Sometimes we can use unpacks instead of generating prologue
>> +	     masks and sometimes the prologue mask will fold to a constant,
>> +	     so the actual prologue cost might be smaller.  However, it's
>> +	     simpler and safer to use the worst-case cost; if this ends up
>> +	     being the tie-breaker between vectorizing or not, then it's
>> +	     probably better not to vectorize.  */
>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
>> +				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
>> +				vector_stmt, NULL, NULL_TREE, 0, vect_body);
>> +	}
>> +      else
>> +	{
>> +	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
>> +
>> +	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
>> +	  if (npeel < 0)
>> +	    {
>> +	      peel_iters_prologue = assumed_vf / 2;
>> +	      /* See below, if peeled iterations are unknown, count a taken
>> +		 branch and a not taken branch per peeled loop.  */
>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				    cond_branch_taken, NULL, NULL_TREE, 0,
>> +				    vect_prologue);
>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				    cond_branch_not_taken, NULL, NULL_TREE, 0,
>> +				    vect_prologue);
>> +	    }
>> +	  else
>> +	    {
>> +	      peel_iters_prologue = npeel;
>> +	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>> +		/* See vect_get_known_peeling_cost, if peeled iterations are
>> +		   known but number of scalar loop iterations are unknown, count
>> +		   a taken branch per peeled loop.  */
>> +		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				      cond_branch_taken, NULL, NULL_TREE, 0,
>> +				      vect_prologue);
>> +	    }
> 
> I think it'd be good to avoid duplicating this.  How about the
> following structure?
> 
>   if (vect_use_loop_mask_for_alignment_p (…))
>     {
>       peel_iters_prologue = 0;
>       peel_iters_epilogue = 0;
>     }
>   else if (npeel < 0)
>     {
>       … // A
>     }
>   else
>     {
>       …vect_get_known_peeling_cost stuff…
>     }
> 
> but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:
> 
>   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0
> 
> for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to
> whatever value we'd normally use.  Then wrap:
> 
>       (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
> 			    NULL, NULL_TREE, 0, vect_epilogue);
>       (void) add_stmt_cost (loop_vinfo,
> 			    target_cost_data, 1, cond_branch_not_taken,
> 			    NULL, NULL_TREE, 0, vect_epilogue);
> 
> in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue
> stuff in A conditional on peel_iters_epilogue != 0.
> 
> This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P
> code:
> 
>       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> 	{
> 	  /* We need to peel exactly one iteration.  */
> 	  peel_iters_epilogue += 1;
> 	  stmt_info_for_cost *si;
> 	  int j;
> 	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> 			    j, si)
> 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
> 				  si->kind, si->stmt_info, si->vectype,
> 				  si->misalign, vect_epilogue);
> 	}
> 
> Then, after the above, have:
> 
>   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>     …add costs for mask overhead…
>   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>     …add costs for lengths overhead…
> 
> So we'd have one block of code for estimating the prologue and epilogue
> peeling cost, and a separate block of code for the loop control overhead.
> 

It's a great idea, by following your subsequent suggestion to make the structure
like:

  - calculate peel_iters_prologue
  - calculate peel_iters_epilogue
  - add costs associated with peel_iters_prologue
  - add costs associated with peel_iters_epilogue
  - add costs related to branch taken/not_taken.

the updated v3 is attached.

Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1, I'll test it without partial vectors
setting, also on aarch64 later.

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
	to determine peel_iters_epilogue to function ...
	(vect_get_peel_iters_epilogue): ... this.  New function.
	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
	with length, refactor cost calculation on peel_iters_prologue and
	peel_iters_epilogue.