public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] vect: Support vector with length cost modeling
@ 2020-07-21  5:51 Kewen.Lin
  2020-07-21  7:57 ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Kewen.Lin @ 2020-07-21  5:51 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Segher Boessenkool, Richard Sandiford, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

Hi,

This patch is to add the cost modeling for vector with length,
it mainly follows what we generate for vector with length in
functions vect_set_loop_controls_directly and vect_gen_len
at the worst case.

For Power, the length is expected to be in bits 0-7 (high bits),
we have to model the cost of shifting bits.  To allow other targets
not suffer this, I used one target hook to describe this extra cost,
I'm not sure if it's a correct way.

Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1.

Any comments/suggestions are highly appreciated!

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New
	macro.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New target hook.
	* target.def (extra_length_cost): Likewise.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
	modeling for vector with length.

[-- Attachment #2: cost_v1.diff --]
[-- Type: text/plain, Size: 13038 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5a4f07d5810..1c5f02796f5 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1668,6 +1668,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_DOLOOP_COST_FOR_ADDRESS
 #define TARGET_DOLOOP_COST_FOR_ADDRESS 1000000000
 
+#undef TARGET_VECTORIZE_EXTRA_LENGTH_COST
+#define TARGET_VECTORIZE_EXTRA_LENGTH_COST 1
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 6e7d9dc54a9..ef37b5c1d6d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6106,6 +6106,14 @@ This hook should complete calculations of the cost of vectorizing a loop or basi
 This hook should release @var{data} and any related data structures allocated by TARGET_VECTORIZE_INIT_COST.  The default releases the accumulator.
 @end deftypefn
 
+@deftypevr {Target Hook} unsigned TARGET_VECTORIZE_EXTRA_LENGTH_COST
+For loop vectorization using length-based partial vectors, some targets
+need extra operations for length preparation, like one shift operation is
+required on Power to make length be encoded in bits 0-7.  This hook is to
+provide a way for this kind of extra cost.
+The default value is zero.
+@end deftypevr
+
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
 Target builtin that implements vector gather operation.  @var{mem_vectype}
 is the vector type of the load and @var{index_type} is scalar type of
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3be984bbd5c..ae9a513f529 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4191,6 +4191,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_DESTROY_COST_DATA
 
+@hook TARGET_VECTORIZE_EXTRA_LENGTH_COST
+
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
 @hook TARGET_VECTORIZE_BUILTIN_SCATTER
diff --git a/gcc/target.def b/gcc/target.def
index 07059a87caf..3134be7ea7b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2058,6 +2058,16 @@ DEFHOOK
  (void *data),
  default_destroy_cost_data)
 
+/* For target-specific cost on length-based vectorization.  */
+DEFHOOKPOD
+(extra_length_cost,
+ "For loop vectorization using length-based partial vectors, some targets\n\
+need extra operations for length preparation, like one shift operation is\n\
+required on Power to make length be encoded in bits 0-7.  This hook is to\n\
+provide a way for this kind of extra cost.\n\
+The default value is zero.",
+ unsigned, 0)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..294a445afac 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       peel_iters_prologue = 0;
       peel_iters_epilogue = 0;
@@ -3663,45 +3663,149 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	  peel_iters_epilogue += 1;
 	  stmt_info_for_cost *si;
 	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
 				  si->kind, si->stmt_info, si->vectype,
 				  si->misalign, vect_epilogue);
 	}
 
-      /* Calculate how many masks we need to generate.  */
-      unsigned int num_masks = 0;
-      rgroup_controls *rgm;
-      unsigned int num_vectors_m1;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
-	if (rgm->type)
-	  num_masks += num_vectors_m1 + 1;
-      gcc_assert (num_masks > 0);
-
-      /* In the worst case, we need to generate each mask in the prologue
-	 and in the loop body.  One of the loop body mask instructions
-	 replaces the comparison in the scalar loop, and since we don't
-	 count the scalar comparison against the scalar body, we shouldn't
-	 count that vector instruction against the vector body either.
-
-	 Sometimes we can use unpacks instead of generating prologue
-	 masks and sometimes the prologue mask will fold to a constant,
-	 so the actual prologue cost might be smaller.  However, it's
-	 simpler and safer to use the worst-case cost; if this ends up
-	 being the tie-breaker between vectorizing or not, then it's
-	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
-    }
-  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
-    {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+	  /* Calculate how many masks we need to generate.  */
+	  unsigned int num_masks = 0;
+	  rgroup_controls *rgm;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+	    if (rgm->type)
+	      num_masks += num_vectors_m1 + 1;
+	  gcc_assert (num_masks > 0);
+
+	  /* In the worst case, we need to generate each mask in the prologue
+	     and in the loop body.  One of the loop body mask instructions
+	     replaces the comparison in the scalar loop, and since we don't
+	     count the scalar comparison against the scalar body, we shouldn't
+	     count that vector instruction against the vector body either.
+
+	     Sometimes we can use unpacks instead of generating prologue
+	     masks and sometimes the prologue mask will fold to a constant,
+	     so the actual prologue cost might be smaller.  However, it's
+	     simpler and safer to use the worst-case cost; if this ends up
+	     being the tie-breaker between vectorizing or not, then it's
+	     probably better not to vectorize.  */
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+				vector_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
+      else
+	{
+	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
+
+	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
+	  if (npeel < 0)
+	    {
+	      peel_iters_prologue = assumed_vf / 2;
+	      /* See below, if peeled iterations are unknown, count a taken
+		 branch and a not taken branch per peeled loop.  */
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_not_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	    }
+	  else
+	    {
+	      peel_iters_prologue = npeel;
+	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+		/* See vect_get_known_peeling_cost, if peeled iterations are
+		   known but number of scalar loop iterations are unknown, count
+		   a taken branch per peeled loop.  */
+		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				      cond_branch_taken, NULL, NULL_TREE, 0,
+				      vect_prologue);
+	    }
+
+	  stmt_info_for_cost *si;
+	  int j;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
+	    (void) add_stmt_cost (loop_vinfo, target_cost_data,
+				  si->count * peel_iters_prologue, si->kind,
+				  si->stmt_info, si->vectype, si->misalign,
+				  vect_prologue);
+
+	  /* Refer to the functions vect_set_loop_condition_partial_vectors
+	     and vect_set_loop_controls_directly, we need to generate each
+	     length in the prologue and in the loop body if required.  Although
+	     there are some possible optimization, we consider the worst case
+	     here.  */
+
+	  /* For now we only operate length-based partial vectors on Power,
+	     which has constant VF all the time, we need some tweakings below
+	     if it doesn't hold in future.  */
+	  gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+	  /* For wrap around checking.  */
+	  tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  unsigned int compare_precision = TYPE_PRECISION (compare_type);
+	  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+	  bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+	  bool need_iterate_p
+	    = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	       && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+	  /* Init min/max, shift and minus cost relative to single scalar_stmt.
+	     FIXME: For now we only use length-based partial vectors on Power,
+	     target specific costs may be needed for other ports.  */
+	  unsigned int min_max_cost = 2;
+	  unsigned int shift_cost = 1, minus_cost = 1;
+
+	  /* Init cost relative to single scalar_stmt.  */
+	  unsigned int prol_cnt = 0;
+	  unsigned int body_cnt = 0;
+
+	  rgroup_controls *rgc;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	    if (rgc->type)
+	      {
+		unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+		/* Need one shift for niters_total computation.  */
+		if (!niters_known_p && nitems != 1)
+		  prol_cnt += shift_cost;
+
+		/* Need to handle wrap around.  */
+		if (iv_limit == -1
+		    || (wi::min_precision (iv_limit * nitems, UNSIGNED)
+			> compare_precision))
+		  prol_cnt += (min_max_cost + minus_cost);
+
+		/* Need to handle batch limit excepting for the 1st one.  */
+		prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
+
+		unsigned int num_vectors = num_vectors_m1 + 1;
+		/* Need to set up lengths in prologue, only one MIN required
+		   since start index is zero.  */
+		prol_cnt += min_max_cost * num_vectors;
+
+		/* Probably need some extra cost to transform this length, like
+		   shift on Power.  */
+		body_cnt += targetm.vectorize.extra_length_cost * num_vectors;
+
+		/* Need to update lengths in body for next iteration.  */
+		if (need_iterate_p)
+		  body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	      }
+
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
     }
   else if (npeel < 0)
     {
@@ -3913,8 +4017,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4045,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4100,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4121,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4133,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] vect: Support vector with length cost modeling
  2020-07-21  5:51 [PATCH] vect: Support vector with length cost modeling Kewen.Lin
@ 2020-07-21  7:57 ` Richard Biener
  2020-07-22  1:26   ` [PATCH v2] vect/rs6000: " Kewen.Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2020-07-21  7:57 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford

On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi,
>
> This patch is to add the cost modeling for vector with length,
> it mainly follows what we generate for vector with length in
> functions vect_set_loop_controls_directly and vect_gen_len
> at the worst case.
>
> For Power, the length is expected to be in bits 0-7 (high bits),
> we have to model the cost of shifting bits.  To allow other targets
> not suffer this, I used one target hook to describe this extra cost,
> I'm not sure if it's a correct way.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1.
>
> Any comments/suggestions are highly appreciated!

I don't like the introduction of an extra target hook for this.  All
vectorizer cost modeling should ideally go through
init_cost/add_stmt_cost/finish_cost.  If the extra costing is
not per stmt then either init_cost or finish_cost is appropriate.
Currently init_cost only gets a struct loop while we should
probably give it a vec_info * parameter so targets can
check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.

Richard.

> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * config/rs6000/rs6000.c (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New
>         macro.
>         * doc/tm.texi: Regenerate.
>         * doc/tm.texi.in (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New target hook.
>         * target.def (extra_length_cost): Likewise.
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
>         modeling for vector with length.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] vect/rs6000: Support vector with length cost modeling
  2020-07-21  7:57 ` Richard Biener
@ 2020-07-22  1:26   ` Kewen.Lin
  2020-07-22  6:38     ` Richard Biener
                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Kewen.Lin @ 2020-07-22  1:26 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 1712 bytes --]

Hi Richard,

on 2020/7/21 下午3:57, Richard Biener wrote:
> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi,
>>
>> This patch is to add the cost modeling for vector with length,
>> it mainly follows what we generate for vector with length in
>> functions vect_set_loop_controls_directly and vect_gen_len
>> at the worst case.
>>
>> For Power, the length is expected to be in bits 0-7 (high bits),
>> we have to model the cost of shifting bits.  To allow other targets
>> not suffer this, I used one target hook to describe this extra cost,
>> I'm not sure if it's a correct way.
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>> param vect-partial-vector-usage=1.
>>
>> Any comments/suggestions are highly appreciated!
> 
> I don't like the introduction of an extra target hook for this.  All
> vectorizer cost modeling should ideally go through
> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
> not per stmt then either init_cost or finish_cost is appropriate.
> Currently init_cost only gets a struct loop while we should
> probably give it a vec_info * parameter so targets can
> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
> 

Thanks!  Nice, your suggested way looks better.  I've removed the hook
and taken care of it in finish_cost.  The updated v2 is attached.

Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1.

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (adjust_vect_cost): New function.
	(rs6000_finish_cost): Call function adjust_vect_cost.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
	modeling for vector with length.

[-- Attachment #2: cost_v2.diff --]
[-- Type: text/plain, Size: 11987 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5a4f07d5810..f2724e792c9 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+adjust_vect_cost (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += (num_vectors_m1 + 1);
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      adjust_vect_cost (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..99e1fd7bdd0 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       peel_iters_prologue = 0;
       peel_iters_epilogue = 0;
@@ -3663,45 +3663,145 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	  peel_iters_epilogue += 1;
 	  stmt_info_for_cost *si;
 	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
 				  si->kind, si->stmt_info, si->vectype,
 				  si->misalign, vect_epilogue);
 	}
 
-      /* Calculate how many masks we need to generate.  */
-      unsigned int num_masks = 0;
-      rgroup_controls *rgm;
-      unsigned int num_vectors_m1;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
-	if (rgm->type)
-	  num_masks += num_vectors_m1 + 1;
-      gcc_assert (num_masks > 0);
-
-      /* In the worst case, we need to generate each mask in the prologue
-	 and in the loop body.  One of the loop body mask instructions
-	 replaces the comparison in the scalar loop, and since we don't
-	 count the scalar comparison against the scalar body, we shouldn't
-	 count that vector instruction against the vector body either.
-
-	 Sometimes we can use unpacks instead of generating prologue
-	 masks and sometimes the prologue mask will fold to a constant,
-	 so the actual prologue cost might be smaller.  However, it's
-	 simpler and safer to use the worst-case cost; if this ends up
-	 being the tie-breaker between vectorizing or not, then it's
-	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
-    }
-  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
-    {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+	  /* Calculate how many masks we need to generate.  */
+	  unsigned int num_masks = 0;
+	  rgroup_controls *rgm;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+	    if (rgm->type)
+	      num_masks += num_vectors_m1 + 1;
+	  gcc_assert (num_masks > 0);
+
+	  /* In the worst case, we need to generate each mask in the prologue
+	     and in the loop body.  One of the loop body mask instructions
+	     replaces the comparison in the scalar loop, and since we don't
+	     count the scalar comparison against the scalar body, we shouldn't
+	     count that vector instruction against the vector body either.
+
+	     Sometimes we can use unpacks instead of generating prologue
+	     masks and sometimes the prologue mask will fold to a constant,
+	     so the actual prologue cost might be smaller.  However, it's
+	     simpler and safer to use the worst-case cost; if this ends up
+	     being the tie-breaker between vectorizing or not, then it's
+	     probably better not to vectorize.  */
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+				vector_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
+      else
+	{
+	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
+
+	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
+	  if (npeel < 0)
+	    {
+	      peel_iters_prologue = assumed_vf / 2;
+	      /* See below, if peeled iterations are unknown, count a taken
+		 branch and a not taken branch per peeled loop.  */
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_not_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	    }
+	  else
+	    {
+	      peel_iters_prologue = npeel;
+	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+		/* See vect_get_known_peeling_cost, if peeled iterations are
+		   known but number of scalar loop iterations are unknown, count
+		   a taken branch per peeled loop.  */
+		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				      cond_branch_taken, NULL, NULL_TREE, 0,
+				      vect_prologue);
+	    }
+
+	  stmt_info_for_cost *si;
+	  int j;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
+	    (void) add_stmt_cost (loop_vinfo, target_cost_data,
+				  si->count * peel_iters_prologue, si->kind,
+				  si->stmt_info, si->vectype, si->misalign,
+				  vect_prologue);
+
+	  /* Refer to the functions vect_set_loop_condition_partial_vectors
+	     and vect_set_loop_controls_directly, we need to generate each
+	     length in the prologue and in the loop body if required.  Although
+	     there are some possible optimization, we consider the worst case
+	     here.  */
+
+	  /* For now we only operate length-based partial vectors on Power,
+	     which has constant VF all the time, we need some tweakings below
+	     if it doesn't hold in future.  */
+	  gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+	  /* For wrap around checking.  */
+	  tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  unsigned int compare_precision = TYPE_PRECISION (compare_type);
+	  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+	  bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+	  bool need_iterate_p
+	    = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	       && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+	  /* Init min/max, shift and minus cost relative to single scalar_stmt.
+	     For now we only use length-based partial vectors on Power, target
+	     specific cost tweaking may be needed for other ports in future.  */
+	  unsigned int min_max_cost = 2;
+	  unsigned int shift_cost = 1, minus_cost = 1;
+
+	  /* Init cost relative to single scalar_stmt.  */
+	  unsigned int prol_cnt = 0;
+	  unsigned int body_cnt = 0;
+
+	  rgroup_controls *rgc;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	    if (rgc->type)
+	      {
+		unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+		/* Need one shift for niters_total computation.  */
+		if (!niters_known_p && nitems != 1)
+		  prol_cnt += shift_cost;
+
+		/* Need to handle wrap around.  */
+		if (iv_limit == -1
+		    || (wi::min_precision (iv_limit * nitems, UNSIGNED)
+			> compare_precision))
+		  prol_cnt += (min_max_cost + minus_cost);
+
+		/* Need to handle batch limit excepting for the 1st one.  */
+		prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
+
+		unsigned int num_vectors = num_vectors_m1 + 1;
+		/* Need to set up lengths in prologue, only one MIN required
+		   since start index is zero.  */
+		prol_cnt += min_max_cost * num_vectors;
+
+		/* Need to update lengths in body for next iteration.  */
+		if (need_iterate_p)
+		  body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	      }
+
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
     }
   else if (npeel < 0)
     {
@@ -3913,8 +4013,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4041,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4096,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4117,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4129,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] vect/rs6000: Support vector with length cost modeling
  2020-07-22  1:26   ` [PATCH v2] vect/rs6000: " Kewen.Lin
@ 2020-07-22  6:38     ` Richard Biener
  2020-07-22  7:08       ` Kewen.Lin
  2020-07-22  9:11     ` Richard Sandiford
  2020-07-22 17:49     ` [PATCH v2] " Segher Boessenkool
  2 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2020-07-22  6:38 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford

On Wed, Jul 22, 2020 at 3:26 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richard,
>
> on 2020/7/21 下午3:57, Richard Biener wrote:
> > On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi,
> >>
> >> This patch is to add the cost modeling for vector with length,
> >> it mainly follows what we generate for vector with length in
> >> functions vect_set_loop_controls_directly and vect_gen_len
> >> at the worst case.
> >>
> >> For Power, the length is expected to be in bits 0-7 (high bits),
> >> we have to model the cost of shifting bits.  To allow other targets
> >> not suffer this, I used one target hook to describe this extra cost,
> >> I'm not sure if it's a correct way.
> >>
> >> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
> >> param vect-partial-vector-usage=1.
> >>
> >> Any comments/suggestions are highly appreciated!
> >
> > I don't like the introduction of an extra target hook for this.  All
> > vectorizer cost modeling should ideally go through
> > init_cost/add_stmt_cost/finish_cost.  If the extra costing is
> > not per stmt then either init_cost or finish_cost is appropriate.
> > Currently init_cost only gets a struct loop while we should
> > probably give it a vec_info * parameter so targets can
> > check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
> >
>
> Thanks!  Nice, your suggested way looks better.  I've removed the hook
> and taken care of it in finish_cost.  The updated v2 is attached.
>
> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1.

LGTM (with assuming the first larger hunk is mostly re-indenting
under LOOP_VINFO_USING_PARTIAL_VECTORS_P).

Thanks,
Richard.

> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>         * config/rs6000/rs6000.c (adjust_vect_cost): New function.
>         (rs6000_finish_cost): Call function adjust_vect_cost.
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
>         modeling for vector with length.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] vect/rs6000: Support vector with length cost modeling
  2020-07-22  6:38     ` Richard Biener
@ 2020-07-22  7:08       ` Kewen.Lin
  0 siblings, 0 replies; 23+ messages in thread
From: Kewen.Lin @ 2020-07-22  7:08 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Sandiford

Hi Richard,

on 2020/7/22 下午2:38, Richard Biener wrote:
> On Wed, Jul 22, 2020 at 3:26 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richard,
>>
>> on 2020/7/21 下午3:57, Richard Biener wrote:
>>> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> This patch is to add the cost modeling for vector with length,
>>>> it mainly follows what we generate for vector with length in
>>>> functions vect_set_loop_controls_directly and vect_gen_len
>>>> at the worst case.
>>>>
>>>> For Power, the length is expected to be in bits 0-7 (high bits),
>>>> we have to model the cost of shifting bits.  To allow other targets
>>>> not suffer this, I used one target hook to describe this extra cost,
>>>> I'm not sure if it's a correct way.
>>>>
>>>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>>>> param vect-partial-vector-usage=1.
>>>>
>>>> Any comments/suggestions are highly appreciated!
>>>
>>> I don't like the introduction of an extra target hook for this.  All
>>> vectorizer cost modeling should ideally go through
>>> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
>>> not per stmt then either init_cost or finish_cost is appropriate.
>>> Currently init_cost only gets a struct loop while we should
>>> probably give it a vec_info * parameter so targets can
>>> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
>>>
>>
>> Thanks!  Nice, your suggested way looks better.  I've removed the hook
>> and taken care of it in finish_cost.  The updated v2 is attached.
>>
>> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
>> param vect-partial-vector-usage=1.
> 
> LGTM (with assuming the first larger hunk is mostly re-indenting
> under LOOP_VINFO_USING_PARTIAL_VECTORS_P).

Thanks for the review!  Yes, for the original LOOP_VINFO_FULLY_MASKED_P
hunk, this patch moves the handling of gap peeling to be shared between
mask and length, and re-indent the remaining (masking specific) into inner
LOOP_VINFO_FULLY_MASKED_P.  The length specific is put into the else hunk.
It wouldn't change anything for masking, I'll run aarch64 regtesting to
ensure it.  :)

BR,
Kewen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] vect/rs6000: Support vector with length cost modeling
  2020-07-22  1:26   ` [PATCH v2] vect/rs6000: " Kewen.Lin
  2020-07-22  6:38     ` Richard Biener
@ 2020-07-22  9:11     ` Richard Sandiford
  2020-07-22 15:48       ` [PATCH v3] " Kewen.Lin
  2020-07-22 17:49     ` [PATCH v2] " Segher Boessenkool
  2 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2020-07-22  9:11 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Richard Biener, GCC Patches, Bill Schmidt, Segher Boessenkool

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/21 下午3:57, Richard Biener wrote:
>> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>
>>> Hi,
>>>
>>> This patch is to add the cost modeling for vector with length,
>>> it mainly follows what we generate for vector with length in
>>> functions vect_set_loop_controls_directly and vect_gen_len
>>> at the worst case.
>>>
>>> For Power, the length is expected to be in bits 0-7 (high bits),
>>> we have to model the cost of shifting bits.  To allow other targets
>>> not suffer this, I used one target hook to describe this extra cost,
>>> I'm not sure if it's a correct way.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>>> param vect-partial-vector-usage=1.
>>>
>>> Any comments/suggestions are highly appreciated!
>> 
>> I don't like the introduction of an extra target hook for this.  All
>> vectorizer cost modeling should ideally go through
>> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
>> not per stmt then either init_cost or finish_cost is appropriate.
>> Currently init_cost only gets a struct loop while we should
>> probably give it a vec_info * parameter so targets can
>> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
>> 
>
> Thanks!  Nice, your suggested way looks better.  I've removed the hook
> and taken care of it in finish_cost.  The updated v2 is attached.
>
> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (adjust_vect_cost): New function.
> 	(rs6000_finish_cost): Call function adjust_vect_cost.
> 	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
> 	modeling for vector with length.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 5a4f07d5810..f2724e792c9 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>    return retval;
>  }
>  
> +/* For some target specific vectorization cost which can't be handled per stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +adjust_vect_cost (rs6000_cost_data *data)
> +{
> +  struct loop *loop = data->loop_info;
> +  gcc_assert (loop);
> +  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> +
> +  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      unsigned int shift_cnt = 0;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  /* Each length needs one shift to fill into bits 0-7.  */
> +	  shift_cnt += (num_vectors_m1 + 1);
> +
> +      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);
> +    }
> +}
> +
>  /* Implement targetm.vectorize.finish_cost.  */
>  
>  static void
> @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
>    rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
>  
>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      adjust_vect_cost (cost_data);
> +      rs6000_density_test (cost_data);
> +    }
>  
>    /* Don't vectorize minimum-vectorization-factor, simple copy loops
>       that require versioning for any reason.  The vectorization is at
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index e933441b922..99e1fd7bdd0 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>       TODO: Build an expression that represents peel_iters for prologue and
>       epilogue to be used in a run-time test.  */
>  
> -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
>      {
>        peel_iters_prologue = 0;
>        peel_iters_epilogue = 0;
> @@ -3663,45 +3663,145 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>  	  peel_iters_epilogue += 1;
>  	  stmt_info_for_cost *si;
>  	  int j;
> -	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> -			    j, si)
> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
> +			    si)
>  	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
>  				  si->kind, si->stmt_info, si->vectype,
>  				  si->misalign, vect_epilogue);
>  	}
>  
> -      /* Calculate how many masks we need to generate.  */
> -      unsigned int num_masks = 0;
> -      rgroup_controls *rgm;
> -      unsigned int num_vectors_m1;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> -	if (rgm->type)
> -	  num_masks += num_vectors_m1 + 1;
> -      gcc_assert (num_masks > 0);
> -
> -      /* In the worst case, we need to generate each mask in the prologue
> -	 and in the loop body.  One of the loop body mask instructions
> -	 replaces the comparison in the scalar loop, and since we don't
> -	 count the scalar comparison against the scalar body, we shouldn't
> -	 count that vector instruction against the vector body either.
> -
> -	 Sometimes we can use unpacks instead of generating prologue
> -	 masks and sometimes the prologue mask will fold to a constant,
> -	 so the actual prologue cost might be smaller.  However, it's
> -	 simpler and safer to use the worst-case cost; if this ends up
> -	 being the tie-breaker between vectorizing or not, then it's
> -	 probably better not to vectorize.  */
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, num_masks, vector_stmt,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, num_masks - 1, vector_stmt,
> -			    NULL, NULL_TREE, 0, vect_body);
> -    }
> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> -    {
> -      peel_iters_prologue = 0;
> -      peel_iters_epilogue = 0;
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +	{
> +	  /* Calculate how many masks we need to generate.  */
> +	  unsigned int num_masks = 0;
> +	  rgroup_controls *rgm;
> +	  unsigned int num_vectors_m1;
> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> +	    if (rgm->type)
> +	      num_masks += num_vectors_m1 + 1;
> +	  gcc_assert (num_masks > 0);
> +
> +	  /* In the worst case, we need to generate each mask in the prologue
> +	     and in the loop body.  One of the loop body mask instructions
> +	     replaces the comparison in the scalar loop, and since we don't
> +	     count the scalar comparison against the scalar body, we shouldn't
> +	     count that vector instruction against the vector body either.
> +
> +	     Sometimes we can use unpacks instead of generating prologue
> +	     masks and sometimes the prologue mask will fold to a constant,
> +	     so the actual prologue cost might be smaller.  However, it's
> +	     simpler and safer to use the worst-case cost; if this ends up
> +	     being the tie-breaker between vectorizing or not, then it's
> +	     probably better not to vectorize.  */
> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
> +				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
> +				vector_stmt, NULL, NULL_TREE, 0, vect_body);
> +	}
> +      else
> +	{
> +	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
> +
> +	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
> +	  if (npeel < 0)
> +	    {
> +	      peel_iters_prologue = assumed_vf / 2;
> +	      /* See below, if peeled iterations are unknown, count a taken
> +		 branch and a not taken branch per peeled loop.  */
> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				    cond_branch_taken, NULL, NULL_TREE, 0,
> +				    vect_prologue);
> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				    cond_branch_not_taken, NULL, NULL_TREE, 0,
> +				    vect_prologue);
> +	    }
> +	  else
> +	    {
> +	      peel_iters_prologue = npeel;
> +	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +		/* See vect_get_known_peeling_cost, if peeled iterations are
> +		   known but number of scalar loop iterations are unknown, count
> +		   a taken branch per peeled loop.  */
> +		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +				      cond_branch_taken, NULL, NULL_TREE, 0,
> +				      vect_prologue);
> +	    }

I think it'd be good to avoid duplicating this.  How about the
following structure?

  if (vect_use_loop_mask_for_alignment_p (…))
    {
      peel_iters_prologue = 0;
      peel_iters_epilogue = 0;
    }
  else if (npeel < 0)
    {
      … // A
    }
  else
    {
      …vect_get_known_peeling_cost stuff…
    }

but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:

  LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0

for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to
whatever value we'd normally use.  Then wrap:

      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
			    NULL, NULL_TREE, 0, vect_epilogue);
      (void) add_stmt_cost (loop_vinfo,
			    target_cost_data, 1, cond_branch_not_taken,
			    NULL, NULL_TREE, 0, vect_epilogue);

in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue
stuff in A conditional on peel_iters_epilogue != 0.

This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P
code:

      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
	{
	  /* We need to peel exactly one iteration.  */
	  peel_iters_epilogue += 1;
	  stmt_info_for_cost *si;
	  int j;
	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
			    j, si)
	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
				  si->kind, si->stmt_info, si->vectype,
				  si->misalign, vect_epilogue);
	}

Then, after the above, have:

  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
    …add costs for mask overhead…
  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
    …add costs for lengths overhead…

So we'd have one block of code for estimating the prologue and epilogue
peeling cost, and a separate block of code for the loop control overhead.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3] vect/rs6000: Support vector with length cost modeling
  2020-07-22  9:11     ` Richard Sandiford
@ 2020-07-22 15:48       ` Kewen.Lin
  2020-07-22 16:25         ` Kewen.Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Kewen.Lin @ 2020-07-22 15:48 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Richard Biener, Bill Schmidt, Segher Boessenkool

[-- Attachment #1: Type: text/plain, Size: 5760 bytes --]

Hi Richard,

Thanks for the review!

on 2020/7/22 下午5:11, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>> -    {
>> -      peel_iters_prologue = 0;
>> -      peel_iters_epilogue = 0;
>> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>> +	{
>> +	  /* Calculate how many masks we need to generate.  */
>> +	  unsigned int num_masks = 0;
>> +	  rgroup_controls *rgm;
>> +	  unsigned int num_vectors_m1;
>> +	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
>> +	    if (rgm->type)
>> +	      num_masks += num_vectors_m1 + 1;
>> +	  gcc_assert (num_masks > 0);
>> +
>> +	  /* In the worst case, we need to generate each mask in the prologue
>> +	     and in the loop body.  One of the loop body mask instructions
>> +	     replaces the comparison in the scalar loop, and since we don't
>> +	     count the scalar comparison against the scalar body, we shouldn't
>> +	     count that vector instruction against the vector body either.
>> +
>> +	     Sometimes we can use unpacks instead of generating prologue
>> +	     masks and sometimes the prologue mask will fold to a constant,
>> +	     so the actual prologue cost might be smaller.  However, it's
>> +	     simpler and safer to use the worst-case cost; if this ends up
>> +	     being the tie-breaker between vectorizing or not, then it's
>> +	     probably better not to vectorize.  */
>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
>> +				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
>> +	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
>> +				vector_stmt, NULL, NULL_TREE, 0, vect_body);
>> +	}
>> +      else
>> +	{
>> +	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
>> +
>> +	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
>> +	  if (npeel < 0)
>> +	    {
>> +	      peel_iters_prologue = assumed_vf / 2;
>> +	      /* See below, if peeled iterations are unknown, count a taken
>> +		 branch and a not taken branch per peeled loop.  */
>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				    cond_branch_taken, NULL, NULL_TREE, 0,
>> +				    vect_prologue);
>> +	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				    cond_branch_not_taken, NULL, NULL_TREE, 0,
>> +				    vect_prologue);
>> +	    }
>> +	  else
>> +	    {
>> +	      peel_iters_prologue = npeel;
>> +	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
>> +		/* See vect_get_known_peeling_cost, if peeled iterations are
>> +		   known but number of scalar loop iterations are unknown, count
>> +		   a taken branch per peeled loop.  */
>> +		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
>> +				      cond_branch_taken, NULL, NULL_TREE, 0,
>> +				      vect_prologue);
>> +	    }
> 
> I think it'd be good to avoid duplicating this.  How about the
> following structure?
> 
>   if (vect_use_loop_mask_for_alignment_p (…))
>     {
>       peel_iters_prologue = 0;
>       peel_iters_epilogue = 0;
>     }
>   else if (npeel < 0)
>     {
>       … // A
>     }
>   else
>     {
>       …vect_get_known_peeling_cost stuff…
>     }
> 
> but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:
> 
>   LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0
> 
> for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to
> whatever value we'd normally use.  Then wrap:
> 
>       (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
> 			    NULL, NULL_TREE, 0, vect_epilogue);
>       (void) add_stmt_cost (loop_vinfo,
> 			    target_cost_data, 1, cond_branch_not_taken,
> 			    NULL, NULL_TREE, 0, vect_epilogue);
> 
> in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue
> stuff in A conditional on peel_iters_epilogue != 0.
> 
> This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P
> code:
> 
>       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> 	{
> 	  /* We need to peel exactly one iteration.  */
> 	  peel_iters_epilogue += 1;
> 	  stmt_info_for_cost *si;
> 	  int j;
> 	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> 			    j, si)
> 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
> 				  si->kind, si->stmt_info, si->vectype,
> 				  si->misalign, vect_epilogue);
> 	}
> 
> Then, after the above, have:
> 
>   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
>     …add costs for mask overhead…
>   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>     …add costs for lengths overhead…
> 
> So we'd have one block of code for estimating the prologue and epilogue
> peeling cost, and a separate block of code for the loop control overhead.
> 

It's a great idea, by following your subsequent suggestion to make the structure
like:

  - calculate peel_iters_prologue
  - calculate peel_iters_epilogue
  - add costs associated with peel_iters_prologue
  - add costs associated with peel_iters_epilogue
  - add costs related to branch taken/not_taken.

the updated v3 is attached.

Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1, I'll test it without partial vectors
setting, also on aarch64 later.

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
	to determine peel_iters_epilogue to function ...
	(vect_get_peel_iters_epilogue): ... this.  New function.
	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
	with length, refactor cost calculation on peel_iters_prologue and
	peel_iters_epilogue.

[-- Attachment #2: cost_v3.diff --]
[-- Type: text/plain, Size: 18524 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..d71f2bf1c16 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += (num_vectors_m1 + 1);
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..c85b3ea1393 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3474,42 +3474,56 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info,
   return NULL;
 }
 
-/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
-int
-vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
-                             int *peel_iters_epilogue,
-                             stmt_vector_for_cost *scalar_cost_vec,
-			     stmt_vector_for_cost *prologue_cost_vec,
-			     stmt_vector_for_cost *epilogue_cost_vec)
+/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO
+   and PEEL_ITERS_PROLOGUE.  */
+
+static int
+vect_get_peel_iters_epilogue (loop_vec_info loop_vinfo, int peel_iters_prologue)
 {
-  int retval = 0;
   int assumed_vf = vect_vf_for_cost (loop_vinfo);
-
   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      *peel_iters_epilogue = assumed_vf / 2;
       if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
+	dump_printf_loc (MSG_NOTE, vect_location,
 			 "cost model: epilogue peel iters set to vf/2 "
 			 "because loop iterations are unknown .\n");
-
-      /* If peeled iterations are known but number of scalar loop
-         iterations are unknown, count a taken branch per peeled loop.  */
-      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
-				 NULL, NULL_TREE, 0, vect_prologue);
-      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken,
-				  NULL, NULL_TREE, 0, vect_epilogue);
+      return assumed_vf / 2;
     }
   else
     {
       int niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
-      peel_iters_prologue = niters < peel_iters_prologue ?
-                            niters : peel_iters_prologue;
-      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;
+      int npeel_prol
+	= niters < peel_iters_prologue ? niters : peel_iters_prologue;
+      int npeel_epil = (niters - npeel_prol) % assumed_vf;
       /* If we need to peel for gaps, but no peeling is required, we have to
 	 peel VF iterations.  */
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !*peel_iters_epilogue)
-	*peel_iters_epilogue = assumed_vf;
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !npeel_epil)
+	npeel_epil = assumed_vf;
+      return npeel_epil;
+    }
+}
+
+/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
+int
+vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
+			     int *peel_iters_epilogue,
+			     stmt_vector_for_cost *scalar_cost_vec,
+			     stmt_vector_for_cost *prologue_cost_vec,
+			     stmt_vector_for_cost *epilogue_cost_vec)
+{
+  int retval = 0;
+
+  *peel_iters_epilogue
+    = vect_get_peel_iters_epilogue (loop_vinfo, peel_iters_prologue);
+
+  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    {
+      /* If peeled iterations are known but number of scalar loop
+	 iterations are unknown, count a taken branch per peeled loop.  */
+      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL,
+				 NULL_TREE, 0, vect_prologue);
+      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, NULL,
+				  NULL_TREE, 0, vect_epilogue);
     }
 
   stmt_info_for_cost *si;
@@ -3652,24 +3666,106 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  bool prol_need_br_taken_cost = false;
+  bool prol_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_prologue.  */
+  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    peel_iters_prologue = 0;
+  else if (npeel < 0)
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      peel_iters_prologue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "prologue peel iters set to vf/2.\n");
 
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+      /* If peeled iterations are unknown, count a taken branch and a not taken
+	 branch per peeled loop. Even if scalar loop iterations are known,
+	 vector iterations are not known since peeled prologue iterations are
+	 not known. Hence guards remain the same.  */
+      prol_need_br_taken_cost = true;
+      prol_need_br_not_taken_cost = true;
+    }
+  else
+    peel_iters_prologue = npeel;
+
+  bool epil_need_br_taken_cost = false;
+  bool epil_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_epilogue.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
+    /* We need to peel exactly one iteration for gaps.  */
+    peel_iters_epilogue = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  else if (npeel < 0)
+    {
+      /* If peeling for alignment is unknown, loop bound of main loop
+	 becomes unknown.  */
+      peel_iters_epilogue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "epilogue peel iters set to vf/2 because "
+			       "peeling for alignment is unknown.\n");
+
+      /* See the same reason above in peel_iters_prologue calculation.  */
+      epil_need_br_taken_cost = true;
+      epil_need_br_not_taken_cost = true;
+    }
+  else
+    {
+      peel_iters_epilogue = vect_get_peel_iters_epilogue (loop_vinfo, npeel);
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
 	{
-	  /* We need to peel exactly one iteration.  */
-	  peel_iters_epilogue += 1;
-	  stmt_info_for_cost *si;
-	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
-	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
-				  si->kind, si->stmt_info, si->vectype,
-				  si->misalign, vect_epilogue);
+	  /* If peeled iterations are known but number of scalar loop
+	     iterations are unknown, count a taken branch per peeled loop.  */
+	  prol_need_br_taken_cost = true;
+	  epil_need_br_taken_cost = true;
 	}
+    }
+
+  stmt_info_for_cost *si;
+  int j;
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_prologue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_prologue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_prologue);
+      }
+
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_epilogue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_epilogue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_epilogue);
+      }
+
+  /* Add possible cond_branch_taken/cond_branch_not_taken cost.  */
+  if (prol_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_prologue);
 
+  if (prol_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_prologue);
+
+  if (epil_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_epilogue);
+
+  if (epil_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_epilogue);
+
+  /* Take care of special costs for rgroup controls of partial vectors.  */
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
       rgroup_controls *rgm;
@@ -3691,93 +3787,79 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	 simpler and safer to use the worst-case cost; if this ends up
 	 being the tie-breaker between vectorizing or not, then it's
 	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
-    }
-  else if (npeel < 0)
-    {
-      peel_iters_prologue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "prologue peel iters set to vf/2.\n");
-
-      /* If peeling for alignment is unknown, loop bound of main loop becomes
-         unknown.  */
-      peel_iters_epilogue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "epilogue peel iters set to vf/2 because "
-		     "peeling for alignment is unknown.\n");
+      /* Refer to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimization, we consider the worst case
+	 here.  */
+
+      /* For now we only operate length-based partial vectors on Power,
+	 which has constant VF all the time, we need some tweakings below
+	 if it doesn't hold in future.  */
+      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+      /* For wrap around checking.  */
+      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+      unsigned int compare_precision = TYPE_PRECISION (compare_type);
+      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Init min/max, shift and minus cost relative to single
+	 scalar_stmt. For now we only use length-based partial vectors on
+	 Power, target specific cost tweaking may be needed for other
+	 ports in future.  */
+      unsigned int min_max_cost = 2;
+      unsigned int shift_cost = 1, minus_cost = 1;
+
+      /* Init cost relative to single scalar_stmt.  */
+      unsigned int prol_cnt = 0;
+      unsigned int body_cnt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
 
-      /* If peeled iterations are unknown, count a taken branch and a not taken
-         branch per peeled loop. Even if scalar loop iterations are known,
-         vector iterations are not known since peeled prologue iterations are
-         not known. Hence guards remain the same.  */
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      stmt_info_for_cost *si;
-      int j;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
-	{
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_prologue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_prologue);
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_epilogue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_epilogue);
-	}
-    }
-  else
-    {
-      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
-      stmt_info_for_cost *si;
-      int j;
-      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
+	    /* Need one shift for niters_total computation.  */
+	    if (!niters_known_p && nitems != 1)
+	      prol_cnt += shift_cost;
 
-      prologue_cost_vec.create (2);
-      epilogue_cost_vec.create (2);
-      peel_iters_prologue = npeel;
+	    /* Need to handle wrap around.  */
+	    if (iv_limit == -1
+		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
+		    > compare_precision))
+	      prol_cnt += (min_max_cost + minus_cost);
 
-      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,
-					  &peel_iters_epilogue,
-					  &LOOP_VINFO_SCALAR_ITERATION_COST
-					    (loop_vinfo),
-					  &prologue_cost_vec,
-					  &epilogue_cost_vec);
+	    /* Need to handle batch limit excepting for the 1st one.  */
+	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
 
-      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_prologue);
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+	    /* Need to set up lengths in prologue, only one MIN required
+	       since start index is zero.  */
+	    prol_cnt += min_max_cost * num_vectors;
 
-      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_epilogue);
+	    /* Need to update lengths in body for next iteration.  */
+	    if (need_iterate_p)
+	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	  }
 
-      prologue_cost_vec.release ();
-      epilogue_cost_vec.release ();
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
     }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
@@ -3913,8 +3995,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4023,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4078,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4099,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4111,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] vect/rs6000: Support vector with length cost modeling
  2020-07-22 15:48       ` [PATCH v3] " Kewen.Lin
@ 2020-07-22 16:25         ` Kewen.Lin
  2020-07-24 16:21           ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Kewen.Lin @ 2020-07-22 16:25 UTC (permalink / raw)
  To: richard.sandiford
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

Hi,

Sorry, please ignore the previously attached file, which isn't the latest one
although almost are the same.   The latest tested is attached here.  

Sorry for the inconvenience.

BR,
Kewen

on 2020/7/22 下午11:48, Kewen.Lin via Gcc-patches wrote:
> 
> It's a great idea, by following your subsequent suggestion to make the structure
> like:
> 
>   - calculate peel_iters_prologue
>   - calculate peel_iters_epilogue
>   - add costs associated with peel_iters_prologue
>   - add costs associated with peel_iters_epilogue
>   - add costs related to branch taken/not_taken.
> 
> the updated v3 is attached.
> 
> Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1, I'll test it without partial vectors
> setting, also on aarch64 later.
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
> 	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
> 	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
> 	to determine peel_iters_epilogue to function ...
> 	(vect_get_peel_iters_epilogue): ... this.  New function.
> 	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
> 	with length, refactor cost calculation on peel_iters_prologue and
> 	peel_iters_epilogue.
> 


[-- Attachment #2: cost_v3.diff --]
[-- Type: text/plain, Size: 18717 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..d71f2bf1c16 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += (num_vectors_m1 + 1);
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..8746c5ae582 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3474,42 +3474,56 @@ vect_is_simple_reduction (loop_vec_info loop_info, stmt_vec_info phi_info,
   return NULL;
 }
 
-/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
-int
-vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
-                             int *peel_iters_epilogue,
-                             stmt_vector_for_cost *scalar_cost_vec,
-			     stmt_vector_for_cost *prologue_cost_vec,
-			     stmt_vector_for_cost *epilogue_cost_vec)
+/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO
+   and PEEL_ITERS_PROLOGUE.  */
+
+static int
+vect_get_peel_iters_epilogue (loop_vec_info loop_vinfo, int peel_iters_prologue)
 {
-  int retval = 0;
   int assumed_vf = vect_vf_for_cost (loop_vinfo);
-
   if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
     {
-      *peel_iters_epilogue = assumed_vf / 2;
       if (dump_enabled_p ())
-        dump_printf_loc (MSG_NOTE, vect_location,
+	dump_printf_loc (MSG_NOTE, vect_location,
 			 "cost model: epilogue peel iters set to vf/2 "
 			 "because loop iterations are unknown .\n");
-
-      /* If peeled iterations are known but number of scalar loop
-         iterations are unknown, count a taken branch per peeled loop.  */
-      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
-				 NULL, NULL_TREE, 0, vect_prologue);
-      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken,
-				  NULL, NULL_TREE, 0, vect_epilogue);
+      return assumed_vf / 2;
     }
   else
     {
       int niters = LOOP_VINFO_INT_NITERS (loop_vinfo);
-      peel_iters_prologue = niters < peel_iters_prologue ?
-                            niters : peel_iters_prologue;
-      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;
+      int npeel_prol
+	= niters < peel_iters_prologue ? niters : peel_iters_prologue;
+      int npeel_epil = (niters - npeel_prol) % assumed_vf;
       /* If we need to peel for gaps, but no peeling is required, we have to
 	 peel VF iterations.  */
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !*peel_iters_epilogue)
-	*peel_iters_epilogue = assumed_vf;
+      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && !npeel_epil)
+	npeel_epil = assumed_vf;
+      return npeel_epil;
+    }
+}
+
+/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
+int
+vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
+			     int *peel_iters_epilogue,
+			     stmt_vector_for_cost *scalar_cost_vec,
+			     stmt_vector_for_cost *prologue_cost_vec,
+			     stmt_vector_for_cost *epilogue_cost_vec)
+{
+  int retval = 0;
+
+  *peel_iters_epilogue
+    = vect_get_peel_iters_epilogue (loop_vinfo, peel_iters_prologue);
+
+  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+    {
+      /* If peeled iterations are known but number of scalar loop
+	 iterations are unknown, count a taken branch per peeled loop.  */
+      retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken, NULL,
+				 NULL_TREE, 0, vect_prologue);
+      retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken, NULL,
+				  NULL_TREE, 0, vect_epilogue);
     }
 
   stmt_info_for_cost *si;
@@ -3652,24 +3666,109 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  bool prol_need_br_taken_cost = false;
+  bool prol_need_br_not_taken_cost = false;
+
+  /* Calculate peel_iters_prologue.  */
+  if (vect_use_loop_mask_for_alignment_p (loop_vinfo))
+    peel_iters_prologue = 0;
+  else if (npeel < 0)
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      peel_iters_prologue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "prologue peel iters set to vf/2.\n");
 
-      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
-	{
-	  /* We need to peel exactly one iteration.  */
-	  peel_iters_epilogue += 1;
-	  stmt_info_for_cost *si;
-	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
-	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
-				  si->kind, si->stmt_info, si->vectype,
-				  si->misalign, vect_epilogue);
-	}
+      /* If peeled iterations are unknown, count a taken branch and a not taken
+	 branch per peeled loop. Even if scalar loop iterations are known,
+	 vector iterations are not known since peeled prologue iterations are
+	 not known. Hence guards remain the same.  */
+      prol_need_br_taken_cost = true;
+      prol_need_br_not_taken_cost = true;
+    }
+  else
+    {
+      peel_iters_prologue = npeel;
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+	/* If peeled iterations are known but number of scalar loop
+	   iterations are unknown, count a taken branch per peeled loop.  */
+	prol_need_br_taken_cost = true;
+    }
+
+  bool epil_need_br_taken_cost = false;
+  bool epil_need_br_not_taken_cost = false;
 
+  /* Calculate peel_iters_epilogue.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
+    /* We need to peel exactly one iteration for gaps.  */
+    peel_iters_epilogue = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0;
+  else if (npeel < 0)
+    {
+      /* If peeling for alignment is unknown, loop bound of main loop
+	 becomes unknown.  */
+      peel_iters_epilogue = assumed_vf / 2;
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "cost model: "
+			       "epilogue peel iters set to vf/2 because "
+			       "peeling for alignment is unknown.\n");
+
+      /* See the same reason above in peel_iters_prologue calculation.  */
+      epil_need_br_taken_cost = true;
+      epil_need_br_not_taken_cost = true;
+    }
+  else
+    {
+      peel_iters_epilogue = vect_get_peel_iters_epilogue (loop_vinfo, npeel);
+      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+	/* If peeled iterations are known but number of scalar loop
+	   iterations are unknown, count a taken branch per peeled loop.  */
+	epil_need_br_taken_cost = true;
+    }
+
+  stmt_info_for_cost *si;
+  int j;
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_prologue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_prologue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_prologue);
+      }
+
+  /* Add costs associated with peel_iters_prologue.  */
+  if (peel_iters_epilogue)
+    FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
+      {
+	(void) add_stmt_cost (loop_vinfo, target_cost_data,
+			      si->count * peel_iters_epilogue, si->kind,
+			      si->stmt_info, si->vectype, si->misalign,
+			      vect_epilogue);
+      }
+
+  /* Add possible cond_branch_taken/cond_branch_not_taken cost.  */
+  if (prol_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_prologue);
+
+  if (prol_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_prologue);
+
+  if (epil_need_br_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
+			  NULL, NULL_TREE, 0, vect_epilogue);
+
+  if (epil_need_br_not_taken_cost)
+    (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+			  cond_branch_not_taken, NULL, NULL_TREE, 0,
+			  vect_epilogue);
+
+  /* Take care of special costs for rgroup controls of partial vectors.  */
+  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
       /* Calculate how many masks we need to generate.  */
       unsigned int num_masks = 0;
       rgroup_controls *rgm;
@@ -3691,93 +3790,79 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	 simpler and safer to use the worst-case cost; if this ends up
 	 being the tie-breaker between vectorizing or not, then it's
 	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
   else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
     {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
-    }
-  else if (npeel < 0)
-    {
-      peel_iters_prologue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "prologue peel iters set to vf/2.\n");
-
-      /* If peeling for alignment is unknown, loop bound of main loop becomes
-         unknown.  */
-      peel_iters_epilogue = assumed_vf / 2;
-      if (dump_enabled_p ())
-	dump_printf (MSG_NOTE, "cost model: "
-		     "epilogue peel iters set to vf/2 because "
-		     "peeling for alignment is unknown.\n");
+      /* Refer to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimization, we consider the worst case
+	 here.  */
+
+      /* For now we only operate length-based partial vectors on Power,
+	 which has constant VF all the time, we need some tweakings below
+	 if it doesn't hold in future.  */
+      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+      /* For wrap around checking.  */
+      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+      unsigned int compare_precision = TYPE_PRECISION (compare_type);
+      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Init min/max, shift and minus cost relative to single
+	 scalar_stmt. For now we only use length-based partial vectors on
+	 Power, target specific cost tweaking may be needed for other
+	 ports in future.  */
+      unsigned int min_max_cost = 2;
+      unsigned int shift_cost = 1, minus_cost = 1;
+
+      /* Init cost relative to single scalar_stmt.  */
+      unsigned int prol_cnt = 0;
+      unsigned int body_cnt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
 
-      /* If peeled iterations are unknown, count a taken branch and a not taken
-         branch per peeled loop. Even if scalar loop iterations are known,
-         vector iterations are not known since peeled prologue iterations are
-         not known. Hence guards remain the same.  */
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, 1, cond_branch_not_taken,
-			    NULL, NULL_TREE, 0, vect_epilogue);
-      stmt_info_for_cost *si;
-      int j;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
-	{
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_prologue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_prologue);
-	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
-				si->count * peel_iters_epilogue,
-				si->kind, si->stmt_info, si->vectype,
-				si->misalign,
-				vect_epilogue);
-	}
-    }
-  else
-    {
-      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
-      stmt_info_for_cost *si;
-      int j;
-      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
+	    /* Need one shift for niters_total computation.  */
+	    if (!niters_known_p && nitems != 1)
+	      prol_cnt += shift_cost;
 
-      prologue_cost_vec.create (2);
-      epilogue_cost_vec.create (2);
-      peel_iters_prologue = npeel;
+	    /* Need to handle wrap around.  */
+	    if (iv_limit == -1
+		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
+		    > compare_precision))
+	      prol_cnt += (min_max_cost + minus_cost);
 
-      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,
-					  &peel_iters_epilogue,
-					  &LOOP_VINFO_SCALAR_ITERATION_COST
-					    (loop_vinfo),
-					  &prologue_cost_vec,
-					  &epilogue_cost_vec);
+	    /* Need to handle batch limit excepting for the 1st one.  */
+	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
 
-      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_prologue);
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+	    /* Need to set up lengths in prologue, only one MIN required
+	       since start index is zero.  */
+	    prol_cnt += min_max_cost * num_vectors;
 
-      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
-	(void) add_stmt_cost (loop_vinfo,
-			      data, si->count, si->kind, si->stmt_info,
-			      si->vectype, si->misalign, vect_epilogue);
+	    /* Need to update lengths in body for next iteration.  */
+	    if (need_iterate_p)
+	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	  }
 
-      prologue_cost_vec.release ();
-      epilogue_cost_vec.release ();
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
     }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
@@ -3913,8 +3998,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4026,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4081,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4102,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4114,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] vect/rs6000: Support vector with length cost modeling
  2020-07-22  1:26   ` [PATCH v2] vect/rs6000: " Kewen.Lin
  2020-07-22  6:38     ` Richard Biener
  2020-07-22  9:11     ` Richard Sandiford
@ 2020-07-22 17:49     ` Segher Boessenkool
  2020-07-27  3:44       ` Kewen.Lin
  2 siblings, 1 reply; 23+ messages in thread
From: Segher Boessenkool @ 2020-07-22 17:49 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: Richard Biener, GCC Patches, Bill Schmidt, Richard Sandiford

Hi!

On Wed, Jul 22, 2020 at 09:26:39AM +0800, Kewen.Lin wrote:
> +/* For some target specific vectorization cost which can't be handled per stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +adjust_vect_cost (rs6000_cost_data *data)
> +{

Maybe call it rs6000_adjust_vect_cost?  For consistency, but also it
could (in the future) collide with a globalfunction of the same name (it
is a very non-specific name).

> +	  /* Each length needs one shift to fill into bits 0-7.  */
> +	  shift_cnt += (num_vectors_m1 + 1);

That doesn't need parentheses.

>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      adjust_vect_cost (cost_data);
> +      rs6000_density_test (cost_data);
> +    }

^^^ consistency :-)

The rs6000 parts are fine for trunk, thanks!


Segher

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] vect/rs6000: Support vector with length cost modeling
  2020-07-22 16:25         ` Kewen.Lin
@ 2020-07-24 16:21           ` Richard Sandiford
  2020-07-27  3:58             ` [PATCH v4] " Kewen.Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2020-07-24 16:21 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Biener

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi,
>
> Sorry, please ignore the previously attached file, which isn't the latest one
> although almost are the same.   The latest tested is attached here.  
>
> Sorry for the inconvenience.
>
> BR,
> Kewen
>
> on 2020/7/22 下午11:48, Kewen.Lin via Gcc-patches wrote:
>> 
>> It's a great idea, by following your subsequent suggestion to make the structure
>> like:
>> 
>>   - calculate peel_iters_prologue
>>   - calculate peel_iters_epilogue
>>   - add costs associated with peel_iters_prologue
>>   - add costs associated with peel_iters_epilogue
>>   - add costs related to branch taken/not_taken.
>> 
>> the updated v3 is attached.
>> 
>> Just bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>> param vect-partial-vector-usage=1, I'll test it without partial vectors
>> setting, also on aarch64 later.
>> 
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>> 
>> 	* config/rs6000/rs6000.c (adjust_vect_cost_per_loop): New function.
>> 	(rs6000_finish_cost): Call adjust_vect_cost_per_loop.
>> 	* tree-vect-loop.c (vect_get_known_peeling_cost): Factor out some code
>> 	to determine peel_iters_epilogue to function ...
>> 	(vect_get_peel_iters_epilogue): ... this.  New function.
>> 	(vect_estimate_min_profitable_iters):  Add cost modeling for vector
>> 	with length, refactor cost calculation on peel_iters_prologue and
>> 	peel_iters_epilogue.
>> 

Thanks, the rearrangement of the existing code looks good.  Could you
split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
out into separate patches?

> -/* Calculate cost of peeling the loop PEEL_ITERS_PROLOGUE times.  */
> -int
> -vect_get_known_peeling_cost (loop_vec_info loop_vinfo, int peel_iters_prologue,
> -                             int *peel_iters_epilogue,
> -                             stmt_vector_for_cost *scalar_cost_vec,
> -			     stmt_vector_for_cost *prologue_cost_vec,
> -			     stmt_vector_for_cost *epilogue_cost_vec)
> +/* Calculate how many iterations peeled for epilogue with information LOOP_VINFO
> +   and PEEL_ITERS_PROLOGUE.  */

Maybe:

/* Estimate the number of peeled epilogue iterations for LOOP_VINFO.
   PEEL_ITERS_PROLOGUE is the number of peeled prologue iterations,
   or -1 if not known.  */

> […]
> -      peel_iters_prologue = niters < peel_iters_prologue ?
> -                            niters : peel_iters_prologue;
> -      *peel_iters_epilogue = (niters - peel_iters_prologue) % assumed_vf;
> +      int npeel_prol
> +	= niters < peel_iters_prologue ? niters : peel_iters_prologue;
> +      int npeel_epil = (niters - npeel_prol) % assumed_vf;

Here and in the rest of the patch, I think we should stick with
the “ogue” rather than shorten it this much.

This is a comment about the original code, but it seems simpler to
write as either:

  if (peel_iters_prologue > niters)
    peel_iters_prologue = niters;

or:

  peel_iters_prologue = MIN (niters, peel_iters_prologue);

> […]
> +      /* Refer to the functions vect_set_loop_condition_partial_vectors

s/Refer/Referring/

> +	 and vect_set_loop_controls_directly, we need to generate each
> +	 length in the prologue and in the loop body if required. Although
> +	 there are some possible optimization, we consider the worst case

optimizations

> +	 here.  */
> +
> +      /* For now we only operate length-based partial vectors on Power,
> +	 which has constant VF all the time, we need some tweakings below
> +	 if it doesn't hold in future.  */
> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());

Where do you rely on this?  There didn't seem to be any obvious
to_constant uses.  Since this is “only” a cost calculation, we should
be using assumed_vf.

> +
> +      /* For wrap around checking.  */
> +      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +      unsigned int compare_precision = TYPE_PRECISION (compare_type);
> +      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> +
> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> +      bool need_iterate_p
> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
> +
> +      /* Init min/max, shift and minus cost relative to single
> +	 scalar_stmt. For now we only use length-based partial vectors on
> +	 Power, target specific cost tweaking may be needed for other
> +	 ports in future.  */
> +      unsigned int min_max_cost = 2;
> +      unsigned int shift_cost = 1, minus_cost = 1;
> +
> +      /* Init cost relative to single scalar_stmt.  */
> +      unsigned int prol_cnt = 0;
> +      unsigned int body_cnt = 0;
> +
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  {
> +	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
>  
> -      /* If peeled iterations are unknown, count a taken branch and a not taken
> -         branch per peeled loop. Even if scalar loop iterations are known,
> -         vector iterations are not known since peeled prologue iterations are
> -         not known. Hence guards remain the same.  */
> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, 1, cond_branch_not_taken,
> -			    NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
> -			    NULL, NULL_TREE, 0, vect_epilogue);
> -      (void) add_stmt_cost (loop_vinfo,
> -			    target_cost_data, 1, cond_branch_not_taken,
> -			    NULL, NULL_TREE, 0, vect_epilogue);
> -      stmt_info_for_cost *si;
> -      int j;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si)
> -	{
> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
> -				si->count * peel_iters_prologue,
> -				si->kind, si->stmt_info, si->vectype,
> -				si->misalign,
> -				vect_prologue);
> -	  (void) add_stmt_cost (loop_vinfo, target_cost_data,
> -				si->count * peel_iters_epilogue,
> -				si->kind, si->stmt_info, si->vectype,
> -				si->misalign,
> -				vect_epilogue);
> -	}
> -    }
> -  else
> -    {
> -      stmt_vector_for_cost prologue_cost_vec, epilogue_cost_vec;
> -      stmt_info_for_cost *si;
> -      int j;
> -      void *data = LOOP_VINFO_TARGET_COST_DATA (loop_vinfo);
> +	    /* Need one shift for niters_total computation.  */
> +	    if (!niters_known_p && nitems != 1)
> +	      prol_cnt += shift_cost;
>  
> -      prologue_cost_vec.create (2);
> -      epilogue_cost_vec.create (2);
> -      peel_iters_prologue = npeel;
> +	    /* Need to handle wrap around.  */
> +	    if (iv_limit == -1
> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
> +		    > compare_precision))
> +	      prol_cnt += (min_max_cost + minus_cost);
>  
> -      (void) vect_get_known_peeling_cost (loop_vinfo, peel_iters_prologue,
> -					  &peel_iters_epilogue,
> -					  &LOOP_VINFO_SCALAR_ITERATION_COST
> -					    (loop_vinfo),
> -					  &prologue_cost_vec,
> -					  &epilogue_cost_vec);
> +	    /* Need to handle batch limit excepting for the 1st one.  */
> +	    prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
>  
> -      FOR_EACH_VEC_ELT (prologue_cost_vec, j, si)
> -	(void) add_stmt_cost (loop_vinfo,
> -			      data, si->count, si->kind, si->stmt_info,
> -			      si->vectype, si->misalign, vect_prologue);
> +	    unsigned int num_vectors = num_vectors_m1 + 1;
> +	    /* Need to set up lengths in prologue, only one MIN required
> +	       since start index is zero.  */
> +	    prol_cnt += min_max_cost * num_vectors;
>  
> -      FOR_EACH_VEC_ELT (epilogue_cost_vec, j, si)
> -	(void) add_stmt_cost (loop_vinfo,
> -			      data, si->count, si->kind, si->stmt_info,
> -			      si->vectype, si->misalign, vect_epilogue);
> +	    /* Need to update lengths in body for next iteration.  */
> +	    if (need_iterate_p)
> +	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
> +	  }
>  
> -      prologue_cost_vec.release ();
> -      epilogue_cost_vec.release ();
> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_prologue);
> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);

IMO this seems to be reproducing too much of the functions that you
referred to.  And the danger with that is that they could easily
get out of sync later.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] vect/rs6000: Support vector with length cost modeling
  2020-07-22 17:49     ` [PATCH v2] " Segher Boessenkool
@ 2020-07-27  3:44       ` Kewen.Lin
  0 siblings, 0 replies; 23+ messages in thread
From: Kewen.Lin @ 2020-07-27  3:44 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Biener, GCC Patches, Bill Schmidt, Richard Sandiford

Hi Segher,

Thanks for the comments!

on 2020/7/23 上午1:49, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jul 22, 2020 at 09:26:39AM +0800, Kewen.Lin wrote:
>> +/* For some target specific vectorization cost which can't be handled per stmt,
>> +   we check the requisite conditions and adjust the vectorization cost
>> +   accordingly if satisfied.  One typical example is to model shift cost for
>> +   vector with length by counting number of required lengths under condition
>> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
>> +
>> +static void
>> +adjust_vect_cost (rs6000_cost_data *data)
>> +{
> 
> Maybe call it rs6000_adjust_vect_cost?  For consistency, but also it
> could (in the future) collide with a globalfunction of the same name (it
> is a very non-specific name).

Done in v4, used rs6000_adjust_vect_cost_per_loop.

> 
>> +	  /* Each length needs one shift to fill into bits 0-7.  */
>> +	  shift_cnt += (num_vectors_m1 + 1);
> 
> That doesn't need parentheses.

Done in v4.

> 
>>    if (cost_data->loop_info)
>> -    rs6000_density_test (cost_data);
>> +    {
>> +      adjust_vect_cost (cost_data);
>> +      rs6000_density_test (cost_data);
>> +    }
> 
> ^^^ consistency :-)
> 
> The rs6000 parts are fine for trunk, thanks!

Thanks!

BR,
Kewen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-24 16:21           ` Richard Sandiford
@ 2020-07-27  3:58             ` Kewen.Lin
  2020-07-27 13:40               ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Kewen.Lin @ 2020-07-27  3:58 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Bill Schmidt, Segher Boessenkool, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2749 bytes --]

Hi Richard,

Thanks for the review again!

on 2020/7/25 上午12:21, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> 
> Thanks, the rearrangement of the existing code looks good.  Could you
> split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
> out into separate patches?
> 

Splitted to https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550691.html.

errr... that subject should be with prefix "[PATCH] vect:".

[snip ...] 
(Some comments in the snipped content will be done in v4)

>> +	 here.  */
>> +
>> +      /* For now we only operate length-based partial vectors on Power,
>> +	 which has constant VF all the time, we need some tweakings below
>> +	 if it doesn't hold in future.  */
>> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
> 
> Where do you rely on this?  There didn't seem to be any obvious
> to_constant uses.  Since this is “only” a cost calculation, we should
> be using assumed_vf.

Sorry for the confusion.  This was intended for the poly things like
VF or nitems_per_ctrl which isn't constant during compilation time,
then get people's attention on the possible runtime cost on things like
scaling up for nitems_step etc.  But I just realized that the computations
like the multiply with another constant can operate on the coefficient,
it looks there is no runtime cost then?  If so, I think I thought too
much before.  ;-)

>> -      prologue_cost_vec.release ();
>> -      epilogue_cost_vec.release ();
>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
>> +			    NULL, NULL_TREE, 0, vect_prologue);
>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
>> +			    NULL, NULL_TREE, 0, vect_body);
> 
> IMO this seems to be reproducing too much of the functions that you
> referred to.  And the danger with that is that they could easily
> get out of sync later.

Good point!  The original intention was to model as possible as we can,
to avoid some bad decision due to some unmodeled pieces, like the case
the loop body is small and some computation become nonnegligible.
The unsync risks seems also applied for other codes.  How about adding
some "note" comments in those functions?

The updated v4 is attached by addressing your comments as well as Segher's
comments.

BR,
Kewen
-----

gcc/ChangeLog:

	* config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New
	function.
	(rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
	modeling for vector with length.
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Update
	function comment.
	* tree-vect-stmts.c (vect_gen_len): Likewise.

[-- Attachment #2: cost_v4.diff --]
[-- Type: text/plain, Size: 8541 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..86ef584e09b 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += num_vectors_m1 + 1;
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      rs6000_adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 490e7befea3..9d0e3fc525e 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    This means that we cannot guarantee that such an induction variable
    would ever hit a value that produces a set of all-false masks or zero
-   lengths for RGC.  */
+   lengths for RGC.
+
+   Note that please check cost modeling whether needed to be updated in
+   function vect_estimate_min_profitable_iters if any changes.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 06cde4b1da3..a00160a7f2d 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
 			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      /* Referring to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimizations, we consider the worst case
+	 here.  */
+
+      /* For wrap around checking.  */
+      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+      unsigned int compare_precision = TYPE_PRECISION (compare_type);
+      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Init min/max, shift and minus cost relative to single
+	 scalar_stmt. For now we only use length-based partial vectors on
+	 Power, target specific cost tweaking may be needed for other
+	 ports in future.  */
+      unsigned int min_max_cost = 2;
+      unsigned int shift_cost = 1, minus_cost = 1;
+
+      /* Init cost relative to single scalar_stmt.  */
+      unsigned int prologue_cnt = 0;
+      unsigned int body_cnt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+	    /* May need one shift for nitems_total computation.  */
+	    if (nitems != 1 && !niters_known_p)
+	      prologue_cnt += shift_cost;
+
+	    /* Need to handle wrap around.  */
+	    if (iv_limit == -1
+		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
+		    > compare_precision))
+	      prologue_cnt += (min_max_cost + minus_cost);
+
+	    /* Need to handle batch limit excepting for the 1st one.  */
+	    prologue_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
+
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+	    /* Need to set up lengths in prologue, only one MIN required
+	       since start index is zero.  */
+	    prologue_cnt += min_max_cost * num_vectors;
+
+	    /* Need to update lengths in body for next iteration.  */
+	    if (need_iterate_p)
+	      body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	  }
+
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prologue_cnt,
+			    scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
      following ways:
@@ -3932,8 +3996,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3960,7 +4024,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -4015,6 +4079,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4032,7 +4100,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4044,7 +4112,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 31af46ae19c..8550a252f44 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -12090,7 +12090,10 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
    min_of_start_and_end = min (START_INDEX, END_INDEX);
    left_len = END_INDEX - min_of_start_and_end;
    rhs = min (left_len, LEN_LIMIT);
-   LEN = rhs;  */
+   LEN = rhs;
+
+   Note that please check cost modeling whether needed to be updated in
+   function vect_estimate_min_profitable_iters if any changes.  */
 
 gimple_seq
 vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-27  3:58             ` [PATCH v4] " Kewen.Lin
@ 2020-07-27 13:40               ` Richard Sandiford
  2020-07-28  8:36                 ` Kewen.Lin
  2020-07-31 14:51                 ` [PATCH v5] " Kewen.Lin
  0 siblings, 2 replies; 23+ messages in thread
From: Richard Sandiford @ 2020-07-27 13:40 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Biener

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> Thanks for the review again!
>
> on 2020/7/25 上午12:21, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>> 
>> Thanks, the rearrangement of the existing code looks good.  Could you
>> split that and the new LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) stuff
>> out into separate patches?
>> 
>
> Splitted to https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550691.html.
>
> errr... that subject should be with prefix "[PATCH] vect:".
>
> [snip ...] 
> (Some comments in the snipped content will be done in v4)
>
>>> +	 here.  */
>>> +
>>> +      /* For now we only operate length-based partial vectors on Power,
>>> +	 which has constant VF all the time, we need some tweakings below
>>> +	 if it doesn't hold in future.  */
>>> +      gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
>> 
>> Where do you rely on this?  There didn't seem to be any obvious
>> to_constant uses.  Since this is “only” a cost calculation, we should
>> be using assumed_vf.
>
> Sorry for the confusion.  This was intended for the poly things like
> VF or nitems_per_ctrl which isn't constant during compilation time,
> then get people's attention on the possible runtime cost on things like
> scaling up for nitems_step etc.  But I just realized that the computations
> like the multiply with another constant can operate on the coefficient,
> it looks there is no runtime cost then?  If so, I think I thought too
> much before.  ;-)
>
>>> -      prologue_cost_vec.release ();
>>> -      epilogue_cost_vec.release ();
>>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, scalar_stmt,
>>> +			    NULL, NULL_TREE, 0, vect_prologue);
>>> +      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, scalar_stmt,
>>> +			    NULL, NULL_TREE, 0, vect_body);
>> 
>> IMO this seems to be reproducing too much of the functions that you
>> referred to.  And the danger with that is that they could easily
>> get out of sync later.
>
> Good point!  The original intention was to model as possible as we can,
> to avoid some bad decision due to some unmodeled pieces, like the case
> the loop body is small and some computation become nonnegligible.
> The unsync risks seems also applied for other codes.  How about adding
> some "note" comments in those functions?
>
> The updated v4 is attached by addressing your comments as well as Segher's
> comments.
>
> BR,
> Kewen
> -----
>
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New
> 	function.
> 	(rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop.
> 	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
> 	modeling for vector with length.
> 	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): Update
> 	function comment.
> 	* tree-vect-stmts.c (vect_gen_len): Likewise.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 009afc5f894..86ef584e09b 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>    return retval;
>  }
>  
> +/* For some target specific vectorization cost which can't be handled per stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
> +{
> +  struct loop *loop = data->loop_info;
> +  gcc_assert (loop);
> +  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> +
> +  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      unsigned int shift_cnt = 0;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  /* Each length needs one shift to fill into bits 0-7.  */
> +	  shift_cnt += num_vectors_m1 + 1;
> +
> +      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
> +			    NULL, NULL_TREE, 0, vect_body);
> +    }
> +}
> +
>  /* Implement targetm.vectorize.finish_cost.  */
>  
>  static void
> @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
>    rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
>  
>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      rs6000_adjust_vect_cost_per_loop (cost_data);
> +      rs6000_density_test (cost_data);
> +    }
>  
>    /* Don't vectorize minimum-vectorization-factor, simple copy loops
>       that require versioning for any reason.  The vectorization is at
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 490e7befea3..9d0e3fc525e 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>  
>     This means that we cannot guarantee that such an induction variable
>     would ever hit a value that produces a set of all-false masks or zero
> -   lengths for RGC.  */
> +   lengths for RGC.
> +
> +   Note that please check cost modeling whether needed to be updated in
> +   function vect_estimate_min_profitable_iters if any changes.  */
>  
>  static tree
>  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 06cde4b1da3..a00160a7f2d 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3798,6 +3798,70 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>        (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
>  			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
>      }
> +  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      /* Referring to the functions vect_set_loop_condition_partial_vectors
> +	 and vect_set_loop_controls_directly, we need to generate each
> +	 length in the prologue and in the loop body if required. Although
> +	 there are some possible optimizations, we consider the worst case
> +	 here.  */
> +
> +      /* For wrap around checking.  */
> +      tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +      unsigned int compare_precision = TYPE_PRECISION (compare_type);
> +      widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
> +
> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> +      bool need_iterate_p
> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
> +
> +      /* Init min/max, shift and minus cost relative to single
> +	 scalar_stmt. For now we only use length-based partial vectors on
> +	 Power, target specific cost tweaking may be needed for other
> +	 ports in future.  */
> +      unsigned int min_max_cost = 2;
> +      unsigned int shift_cost = 1, minus_cost = 1;

Please instead add a scalar_min_max to vect_cost_for_stmt, and use
scalar_stmt for shift and minus.  There shouldn't be any Power things
hard-coded into target-independent code.

> +      /* Init cost relative to single scalar_stmt.  */
> +      unsigned int prologue_cnt = 0;
> +      unsigned int body_cnt = 0;

Maybe _stmts instead of _cnt, to make it clearer what we're counting.
At first it looked like this was actually the raw cost.  I guess with
the above change we'd also want separate counters for min_max.

> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +	if (rgc->type)
> +	  {
> +	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
> +
> +	    /* May need one shift for nitems_total computation.  */
> +	    if (nitems != 1 && !niters_known_p)
> +	      prologue_cnt += shift_cost;
> +
> +	    /* Need to handle wrap around.  */
> +	    if (iv_limit == -1
> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
> +		    > compare_precision))
> +	      prologue_cnt += (min_max_cost + minus_cost);

I think we should instead split this condition out into a subroutine
that both this function and vect_set_loop_condition_partial_vectors
can use.  It would take the loop_vec_info and rgroup_controls as
arguments.

That means that we might recompute things a few more times, but this
shouldn't be performance-critical.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-27 13:40               ` Richard Sandiford
@ 2020-07-28  8:36                 ` Kewen.Lin
  2020-07-31 11:03                   ` Richard Sandiford
  2020-07-31 14:51                 ` [PATCH v5] " Kewen.Lin
  1 sibling, 1 reply; 23+ messages in thread
From: Kewen.Lin @ 2020-07-28  8:36 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Bill Schmidt, Segher Boessenkool, Richard Biener

Hi Richard,

Thanks for the comments!

on 2020/7/27 下午9:40, Richard Sandiford wrote:
> "Kewen.Lin" <linkw@linux.ibm.com> writes:

[snip]

>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>> +      bool need_iterate_p
>> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
>> +
>> +      /* Init min/max, shift and minus cost relative to single
>> +	 scalar_stmt. For now we only use length-based partial vectors on
>> +	 Power, target specific cost tweaking may be needed for other
>> +	 ports in future.  */
>> +      unsigned int min_max_cost = 2;
>> +      unsigned int shift_cost = 1, minus_cost = 1;
> 
> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
> scalar_stmt for shift and minus.  There shouldn't be any Power things
> hard-coded into target-independent code.
> 

Agree!  It's not good to leave them there.  I thought to wait and see
if other targets which support vector with length can reuse this, or
move it to target specific codes then if not sharable.  But anyway
it looks not good, let's fix it.

I had some concerns on vect_cost_for_stmt way, since it seems to allow
more computations similar to min/max to be added like this, in long
term it probably leads to the situtation like: scalar_min_max,
scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
seems not good to maintain.

I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
whether is assignment and the subcode of the expression, it provides the
chance to check the statement more fine-grain, not just as normal
scalar_stmt/vector_stmt.

For the case here, we don't have the stmt_info, but we know the type
of computation(expression), how about to extend the hook add_stmt_cost
with one extra tree_code type argument, by default it can be some
unmeaningful code, for some needs like here, we specify the tree_code
as the code of computation, like {MIN,MAX}_EXPR, then target specific
add_stmt_cost can check this tree_code and adjust the cost accordingly.

>> +      /* Init cost relative to single scalar_stmt.  */
>> +      unsigned int prologue_cnt = 0;
>> +      unsigned int body_cnt = 0;
> 
> Maybe _stmts instead of _cnt, to make it clearer what we're counting.
> At first it looked like this was actually the raw cost.  I guess with
> the above change we'd also want separate counters for min_max.
> 

Good point, will fix it in v5.

>> +      rgroup_controls *rgc;
>> +      unsigned int num_vectors_m1;
>> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
>> +	if (rgc->type)
>> +	  {
>> +	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
>> +
>> +	    /* May need one shift for nitems_total computation.  */
>> +	    if (nitems != 1 && !niters_known_p)
>> +	      prologue_cnt += shift_cost;
>> +
>> +	    /* Need to handle wrap around.  */
>> +	    if (iv_limit == -1
>> +		|| (wi::min_precision (iv_limit * nitems, UNSIGNED)
>> +		    > compare_precision))
>> +	      prologue_cnt += (min_max_cost + minus_cost);
> 
> I think we should instead split this condition out into a subroutine
> that both this function and vect_set_loop_condition_partial_vectors
> can use.  It would take the loop_vec_info and rgroup_controls as
> arguments.

OK, will split it in v5.

BR,
Kewen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-28  8:36                 ` Kewen.Lin
@ 2020-07-31 11:03                   ` Richard Sandiford
  2020-07-31 11:20                     ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2020-07-31 11:03 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Biener

"Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>>> +      bool need_iterate_p
>>> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
>>> +
>>> +      /* Init min/max, shift and minus cost relative to single
>>> +	 scalar_stmt. For now we only use length-based partial vectors on
>>> +	 Power, target specific cost tweaking may be needed for other
>>> +	 ports in future.  */
>>> +      unsigned int min_max_cost = 2;
>>> +      unsigned int shift_cost = 1, minus_cost = 1;
>> 
>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
>> scalar_stmt for shift and minus.  There shouldn't be any Power things
>> hard-coded into target-independent code.
>> 
>
> Agree!  It's not good to leave them there.  I thought to wait and see
> if other targets which support vector with length can reuse this, or
> move it to target specific codes then if not sharable.  But anyway
> it looks not good, let's fix it.
>
> I had some concerns on vect_cost_for_stmt way, since it seems to allow
> more computations similar to min/max to be added like this, in long
> term it probably leads to the situtation like: scalar_min_max,
> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
> seems not good to maintain.

I guess doing that doesn't seem so bad to me :-)  I think it's been
a recurring problem that the current classification isn't fine-grained
enough for some cases.

> I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
> whether is assignment and the subcode of the expression, it provides the
> chance to check the statement more fine-grain, not just as normal
> scalar_stmt/vector_stmt.
>
> For the case here, we don't have the stmt_info, but we know the type
> of computation(expression), how about to extend the hook add_stmt_cost
> with one extra tree_code type argument, by default it can be some
> unmeaningful code, for some needs like here, we specify the tree_code
> as the code of computation, like {MIN,MAX}_EXPR, then target specific
> add_stmt_cost can check this tree_code and adjust the cost accordingly.

If we do that, I guess we should “promote” code_helper out of
gimple-match.h and use that instead, so that we can handle
internal and built-in functions too.

Would like to hear Richard's opinion on the best way forward here.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-31 11:03                   ` Richard Sandiford
@ 2020-07-31 11:20                     ` Richard Biener
  2020-07-31 12:37                       ` Kewen.Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2020-07-31 11:20 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches, Bill Schmidt, Segher Boessenkool,
	Richard Biener, Richard Sandiford

On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> >>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> >>> +      bool need_iterate_p
> >>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> >>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
> >>> +
> >>> +      /* Init min/max, shift and minus cost relative to single
> >>> +    scalar_stmt. For now we only use length-based partial vectors on
> >>> +    Power, target specific cost tweaking may be needed for other
> >>> +    ports in future.  */
> >>> +      unsigned int min_max_cost = 2;
> >>> +      unsigned int shift_cost = 1, minus_cost = 1;
> >>
> >> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
> >> scalar_stmt for shift and minus.  There shouldn't be any Power things
> >> hard-coded into target-independent code.
> >>
> >
> > Agree!  It's not good to leave them there.  I thought to wait and see
> > if other targets which support vector with length can reuse this, or
> > move it to target specific codes then if not sharable.  But anyway
> > it looks not good, let's fix it.

In other generic places like this we simply use three generic scalar_stmt
costs.  At least I don't see how min_max should be different from it
when shift can be the same as minus.  Note this is also how we treat
vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
vector cost is one vector_stmt.  As you say below the add_stmt_cost
hook can adjust based on the actual GIMPLE stmt -- if one is
available (which indeed it isn't here).

I'm somewhat lacking context here as well - we actually GIMPLE
code-generate the min/max/shift/minus and only the eventual
AND is defered to the target, right?

> > I had some concerns on vect_cost_for_stmt way, since it seems to allow
> > more computations similar to min/max to be added like this, in long
> > term it probably leads to the situtation like: scalar_min_max,
> > scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
> > seems not good to maintain.
>
> I guess doing that doesn't seem so bad to me :-)  I think it's been
> a recurring problem that the current classification isn't fine-grained
> enough for some cases.

But we eventually want to get rid of this classification enum in favor
of the add_stmt_cost hook ...

> > I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
> > whether is assignment and the subcode of the expression, it provides the
> > chance to check the statement more fine-grain, not just as normal
> > scalar_stmt/vector_stmt.
> >
> > For the case here, we don't have the stmt_info, but we know the type
> > of computation(expression), how about to extend the hook add_stmt_cost
> > with one extra tree_code type argument, by default it can be some
> > unmeaningful code, for some needs like here, we specify the tree_code
> > as the code of computation, like {MIN,MAX}_EXPR, then target specific
> > add_stmt_cost can check this tree_code and adjust the cost accordingly.
>
> If we do that, I guess we should “promote” code_helper out of
> gimple-match.h and use that instead, so that we can handle
> internal and built-in functions too.
>
> Would like to hear Richard's opinion on the best way forward here.

I'd say defer this to a later patch and for now simply cost one scalar
stmt for MIN/MAX.  I agree that if we add a tree_code we want a
code_helper instead.  Note that I want to eventually have a
full SLP tree for the final code generation where all info should be
there (including SLP nodes for those min/max ops) and which the
target could traverse.  But I'm not sure if I can make enough progress
on that SLP-only thing for GCC 11 even...

Richard.

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-31 11:20                     ` Richard Biener
@ 2020-07-31 12:37                       ` Kewen.Lin
  2020-07-31 13:01                         ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Kewen.Lin @ 2020-07-31 12:37 UTC (permalink / raw)
  To: Richard Biener, Richard Sandiford
  Cc: GCC Patches, Bill Schmidt, Segher Boessenkool

Hi Richards,

on 2020/7/31 下午7:20, Richard Biener wrote:
> On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>>>>> +      bool need_iterate_p
>>>>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>>>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
>>>>> +
>>>>> +      /* Init min/max, shift and minus cost relative to single
>>>>> +    scalar_stmt. For now we only use length-based partial vectors on
>>>>> +    Power, target specific cost tweaking may be needed for other
>>>>> +    ports in future.  */
>>>>> +      unsigned int min_max_cost = 2;
>>>>> +      unsigned int shift_cost = 1, minus_cost = 1;
>>>>
>>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
>>>> scalar_stmt for shift and minus.  There shouldn't be any Power things
>>>> hard-coded into target-independent code.
>>>>
>>>
>>> Agree!  It's not good to leave them there.  I thought to wait and see
>>> if other targets which support vector with length can reuse this, or
>>> move it to target specific codes then if not sharable.  But anyway
>>> it looks not good, let's fix it.
> 
> In other generic places like this we simply use three generic scalar_stmt
> costs.  At least I don't see how min_max should be different from it
> when shift can be the same as minus.  Note this is also how we treat

Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting
for fine-grain tuning like i386 port, they will use the same cost.  On Power9,
to implement min/max it takes double cycles of the normal scalar operations
like add/shift, I was trying to model it more fine-grained since we probably
generate a few min/max here, if the loop body cost is small, I was worried
the decision isn't good enough.  But yeah, in other generic places, the small
loop could also suffer this similar off, they are the same essentially.

> vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
> vector cost is one vector_stmt.  As you say below the add_stmt_cost
> hook can adjust based on the actual GIMPLE stmt -- if one is
> available (which indeed it isn't here).
> 
> I'm somewhat lacking context here as well - we actually GIMPLE
> code-generate the min/max/shift/minus and only the eventual
> AND is defered to the target, right?
> 

Yes, min/max/shift/minus are all GIMPLE code, targets like Power
will have its target specific cost for shift which moves length
to high bits 0:7.

One typical case is as below:

  <bb 3> [local count: 105119324]:
  _26 = n_11(D) * 4;
  _37 = MAX_EXPR <_26, 16>;
  _38 = _37 + 18446744073709551600;
  _40 = MIN_EXPR <_26, 16>;

  <bb 4> [local count: 630715945]:
  # ivtmp_35 = PHI <0(3), ivtmp_36(4)>
  # loop_len_30 = PHI <_40(3), _44(4)>
  _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B];
  vect_24 = .LEN_LOAD (_19, 4B, loop_len_30);
  vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24);
  _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B];
  vect_17 = .LEN_LOAD (_1, 4B, loop_len_30);
  vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17);
  vect__7.11_8 = vect__5.10_9 + vect__3.7_23;
  vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8);
  _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B];
  .LEN_STORE (_2, 4B, loop_len_30, vect_28);
  _42 = MIN_EXPR <ivtmp_35, _38>;
  _43 = _38 - _42;
  _44 = MIN_EXPR <_43, 16>;
  ivtmp_36 = ivtmp_35 + 16;
  if (ivtmp_35 < _38)
    goto <bb 4>; [83.33%]
  else
    goto <bb 5>; [16.67%]


>>> I had some concerns on vect_cost_for_stmt way, since it seems to allow
>>> more computations similar to min/max to be added like this, in long
>>> term it probably leads to the situtation like: scalar_min_max,
>>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
>>> seems not good to maintain.
>>
>> I guess doing that doesn't seem so bad to me :-)  I think it's been
>> a recurring problem that the current classification isn't fine-grained
>> enough for some cases.
> 
> But we eventually want to get rid of this classification enum in favor
> of the add_stmt_cost hook ...
> 

Nice, sounds like each target has to handle it fine-grain.  :) 
IIUC, the current modeling doesn't consider the instruction dependency and
execution resource etc. like scheduling, even if all costs are fine-grained
enough, the decision could be sub-optimal.

>>> I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
>>> whether is assignment and the subcode of the expression, it provides the
>>> chance to check the statement more fine-grain, not just as normal
>>> scalar_stmt/vector_stmt.
>>>
>>> For the case here, we don't have the stmt_info, but we know the type
>>> of computation(expression), how about to extend the hook add_stmt_cost
>>> with one extra tree_code type argument, by default it can be some
>>> unmeaningful code, for some needs like here, we specify the tree_code
>>> as the code of computation, like {MIN,MAX}_EXPR, then target specific
>>> add_stmt_cost can check this tree_code and adjust the cost accordingly.
>>
>> If we do that, I guess we should “promote” code_helper out of
>> gimple-match.h and use that instead, so that we can handle
>> internal and built-in functions too.
>>
>> Would like to hear Richard's opinion on the best way forward here.
> 
> I'd say defer this to a later patch and for now simply cost one scalar
> stmt for MIN/MAX.  I agree that if we add a tree_code we want a
> code_helper instead.  Note that I want to eventually have a
> full SLP tree for the final code generation where all info should be
> there (including SLP nodes for those min/max ops) and which the
> target could traverse.  But I'm not sure if I can make enough progress
> on that SLP-only thing for GCC 11 even...
> 

OK, I'm fine to take MIN/MAX as scalar_stmt here.  Thank both of you!
This new SLP framework looks very promising and powerful.  :-)


BR,
Kewen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-31 12:37                       ` Kewen.Lin
@ 2020-07-31 13:01                         ` Richard Biener
  2020-07-31 13:21                           ` Kewen.Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2020-07-31 13:01 UTC (permalink / raw)
  To: Kewen.Lin
  Cc: Richard Sandiford, GCC Patches, Bill Schmidt, Segher Boessenkool

On Fri, Jul 31, 2020 at 2:37 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Richards,
>
> on 2020/7/31 下午7:20, Richard Biener wrote:
> > On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> "Kewen.Lin" <linkw@linux.ibm.com> writes:
> >>>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> >>>>> +      bool need_iterate_p
> >>>>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> >>>>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
> >>>>> +
> >>>>> +      /* Init min/max, shift and minus cost relative to single
> >>>>> +    scalar_stmt. For now we only use length-based partial vectors on
> >>>>> +    Power, target specific cost tweaking may be needed for other
> >>>>> +    ports in future.  */
> >>>>> +      unsigned int min_max_cost = 2;
> >>>>> +      unsigned int shift_cost = 1, minus_cost = 1;
> >>>>
> >>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
> >>>> scalar_stmt for shift and minus.  There shouldn't be any Power things
> >>>> hard-coded into target-independent code.
> >>>>
> >>>
> >>> Agree!  It's not good to leave them there.  I thought to wait and see
> >>> if other targets which support vector with length can reuse this, or
> >>> move it to target specific codes then if not sharable.  But anyway
> >>> it looks not good, let's fix it.
> >
> > In other generic places like this we simply use three generic scalar_stmt
> > costs.  At least I don't see how min_max should be different from it
> > when shift can be the same as minus.  Note this is also how we treat
>
> Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting
> for fine-grain tuning like i386 port, they will use the same cost.  On Power9,
> to implement min/max it takes double cycles of the normal scalar operations
> like add/shift, I was trying to model it more fine-grained since we probably
> generate a few min/max here, if the loop body cost is small, I was worried
> the decision isn't good enough.  But yeah, in other generic places, the small
> loop could also suffer this similar off, they are the same essentially.
>
> > vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
> > vector cost is one vector_stmt.  As you say below the add_stmt_cost
> > hook can adjust based on the actual GIMPLE stmt -- if one is
> > available (which indeed it isn't here).
> >
> > I'm somewhat lacking context here as well - we actually GIMPLE
> > code-generate the min/max/shift/minus and only the eventual
> > AND is defered to the target, right?
> >
>
> Yes, min/max/shift/minus are all GIMPLE code, targets like Power
> will have its target specific cost for shift which moves length
> to high bits 0:7.
>
> One typical case is as below:
>
>   <bb 3> [local count: 105119324]:
>   _26 = n_11(D) * 4;
>   _37 = MAX_EXPR <_26, 16>;
>   _38 = _37 + 18446744073709551600;
>   _40 = MIN_EXPR <_26, 16>;
>
>   <bb 4> [local count: 630715945]:
>   # ivtmp_35 = PHI <0(3), ivtmp_36(4)>
>   # loop_len_30 = PHI <_40(3), _44(4)>
>   _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B];
>   vect_24 = .LEN_LOAD (_19, 4B, loop_len_30);
>   vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24);
>   _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B];
>   vect_17 = .LEN_LOAD (_1, 4B, loop_len_30);
>   vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17);
>   vect__7.11_8 = vect__5.10_9 + vect__3.7_23;
>   vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8);
>   _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B];
>   .LEN_STORE (_2, 4B, loop_len_30, vect_28);
>   _42 = MIN_EXPR <ivtmp_35, _38>;
>   _43 = _38 - _42;
>   _44 = MIN_EXPR <_43, 16>;
>   ivtmp_36 = ivtmp_35 + 16;
>   if (ivtmp_35 < _38)
>     goto <bb 4>; [83.33%]
>   else
>     goto <bb 5>; [16.67%]
>
>
> >>> I had some concerns on vect_cost_for_stmt way, since it seems to allow
> >>> more computations similar to min/max to be added like this, in long
> >>> term it probably leads to the situtation like: scalar_min_max,
> >>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
> >>> seems not good to maintain.
> >>
> >> I guess doing that doesn't seem so bad to me :-)  I think it's been
> >> a recurring problem that the current classification isn't fine-grained
> >> enough for some cases.
> >
> > But we eventually want to get rid of this classification enum in favor
> > of the add_stmt_cost hook ...
> >
>
> Nice, sounds like each target has to handle it fine-grain.  :)
> IIUC, the current modeling doesn't consider the instruction dependency and
> execution resource etc. like scheduling, even if all costs are fine-grained
> enough, the decision could be sub-optimal.

That's what the finish_cost hook is for - the target can collect
all stmts during add_stmt and then apply adjustments at the end
(IIRC power does already for shift operation resource constraints).

> >>> I noticed that i386 port ix86_add_stmt_cost will check stmt_info->stmt,
> >>> whether is assignment and the subcode of the expression, it provides the
> >>> chance to check the statement more fine-grain, not just as normal
> >>> scalar_stmt/vector_stmt.
> >>>
> >>> For the case here, we don't have the stmt_info, but we know the type
> >>> of computation(expression), how about to extend the hook add_stmt_cost
> >>> with one extra tree_code type argument, by default it can be some
> >>> unmeaningful code, for some needs like here, we specify the tree_code
> >>> as the code of computation, like {MIN,MAX}_EXPR, then target specific
> >>> add_stmt_cost can check this tree_code and adjust the cost accordingly.
> >>
> >> If we do that, I guess we should “promote” code_helper out of
> >> gimple-match.h and use that instead, so that we can handle
> >> internal and built-in functions too.
> >>
> >> Would like to hear Richard's opinion on the best way forward here.
> >
> > I'd say defer this to a later patch and for now simply cost one scalar
> > stmt for MIN/MAX.  I agree that if we add a tree_code we want a
> > code_helper instead.  Note that I want to eventually have a
> > full SLP tree for the final code generation where all info should be
> > there (including SLP nodes for those min/max ops) and which the
> > target could traverse.  But I'm not sure if I can make enough progress
> > on that SLP-only thing for GCC 11 even...
> >
>
> OK, I'm fine to take MIN/MAX as scalar_stmt here.  Thank both of you!
> This new SLP framework looks very promising and powerful.  :-)

As do all things that are still on paper^Win my head only ;)

Richard.

>
> BR,
> Kewen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4] vect/rs6000: Support vector with length cost modeling
  2020-07-31 13:01                         ` Richard Biener
@ 2020-07-31 13:21                           ` Kewen.Lin
  0 siblings, 0 replies; 23+ messages in thread
From: Kewen.Lin @ 2020-07-31 13:21 UTC (permalink / raw)
  To: Richard Biener
  Cc: Richard Sandiford, GCC Patches, Bill Schmidt, Segher Boessenkool

on 2020/7/31 下午9:01, Richard Biener wrote:
> On Fri, Jul 31, 2020 at 2:37 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Richards,
>>
>> on 2020/7/31 下午7:20, Richard Biener wrote:
>>> On Fri, Jul 31, 2020 at 1:03 PM Richard Sandiford
>>> <richard.sandiford@arm.com> wrote:
>>>>
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>>> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
>>>>>>> +      bool need_iterate_p
>>>>>>> +   = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>>>>>>> +      && !vect_known_niters_smaller_than_vf (loop_vinfo));
>>>>>>> +
>>>>>>> +      /* Init min/max, shift and minus cost relative to single
>>>>>>> +    scalar_stmt. For now we only use length-based partial vectors on
>>>>>>> +    Power, target specific cost tweaking may be needed for other
>>>>>>> +    ports in future.  */
>>>>>>> +      unsigned int min_max_cost = 2;
>>>>>>> +      unsigned int shift_cost = 1, minus_cost = 1;
>>>>>>
>>>>>> Please instead add a scalar_min_max to vect_cost_for_stmt, and use
>>>>>> scalar_stmt for shift and minus.  There shouldn't be any Power things
>>>>>> hard-coded into target-independent code.
>>>>>>
>>>>>
>>>>> Agree!  It's not good to leave them there.  I thought to wait and see
>>>>> if other targets which support vector with length can reuse this, or
>>>>> move it to target specific codes then if not sharable.  But anyway
>>>>> it looks not good, let's fix it.
>>>
>>> In other generic places like this we simply use three generic scalar_stmt
>>> costs.  At least I don't see how min_max should be different from it
>>> when shift can be the same as minus.  Note this is also how we treat
>>
>> Yeah, normally they (min/max/minus/shift) are taken as scalar_stmt, excepting
>> for fine-grain tuning like i386 port, they will use the same cost.  On Power9,
>> to implement min/max it takes double cycles of the normal scalar operations
>> like add/shift, I was trying to model it more fine-grained since we probably
>> generate a few min/max here, if the loop body cost is small, I was worried
>> the decision isn't good enough.  But yeah, in other generic places, the small
>> loop could also suffer this similar off, they are the same essentially.
>>
>>> vectorization of MAX_EXPR - scalar cost is one scalar_stmt and
>>> vector cost is one vector_stmt.  As you say below the add_stmt_cost
>>> hook can adjust based on the actual GIMPLE stmt -- if one is
>>> available (which indeed it isn't here).
>>>
>>> I'm somewhat lacking context here as well - we actually GIMPLE
>>> code-generate the min/max/shift/minus and only the eventual
>>> AND is defered to the target, right?
>>>
>>
>> Yes, min/max/shift/minus are all GIMPLE code, targets like Power
>> will have its target specific cost for shift which moves length
>> to high bits 0:7.
>>
>> One typical case is as below:
>>
>>   <bb 3> [local count: 105119324]:
>>   _26 = n_11(D) * 4;
>>   _37 = MAX_EXPR <_26, 16>;
>>   _38 = _37 + 18446744073709551600;
>>   _40 = MIN_EXPR <_26, 16>;
>>
>>   <bb 4> [local count: 630715945]:
>>   # ivtmp_35 = PHI <0(3), ivtmp_36(4)>
>>   # loop_len_30 = PHI <_40(3), _44(4)>
>>   _19 = &MEM[base: a_12(D), index: ivtmp_35, offset: 0B];
>>   vect_24 = .LEN_LOAD (_19, 4B, loop_len_30);
>>   vect__3.7_23 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_24);
>>   _1 = &MEM[base: b_13(D), index: ivtmp_35, offset: 0B];
>>   vect_17 = .LEN_LOAD (_1, 4B, loop_len_30);
>>   vect__5.10_9 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(vect_17);
>>   vect__7.11_8 = vect__5.10_9 + vect__3.7_23;
>>   vect_28 = VIEW_CONVERT_EXPR<vector(16) unsigned char>(vect__7.11_8);
>>   _2 = &MEM[base: c_14(D), index: ivtmp_35, offset: 0B];
>>   .LEN_STORE (_2, 4B, loop_len_30, vect_28);
>>   _42 = MIN_EXPR <ivtmp_35, _38>;
>>   _43 = _38 - _42;
>>   _44 = MIN_EXPR <_43, 16>;
>>   ivtmp_36 = ivtmp_35 + 16;
>>   if (ivtmp_35 < _38)
>>     goto <bb 4>; [83.33%]
>>   else
>>     goto <bb 5>; [16.67%]
>>
>>
>>>>> I had some concerns on vect_cost_for_stmt way, since it seems to allow
>>>>> more computations similar to min/max to be added like this, in long
>>>>> term it probably leads to the situtation like: scalar_min_max,
>>>>> scalar_div_expr, scalar_mul_expr ... an enum (cost types) bloat, it
>>>>> seems not good to maintain.
>>>>
>>>> I guess doing that doesn't seem so bad to me :-)  I think it's been
>>>> a recurring problem that the current classification isn't fine-grained
>>>> enough for some cases.
>>>
>>> But we eventually want to get rid of this classification enum in favor
>>> of the add_stmt_cost hook ...
>>>
>>
>> Nice, sounds like each target has to handle it fine-grain.  :)
>> IIUC, the current modeling doesn't consider the instruction dependency and
>> execution resource etc. like scheduling, even if all costs are fine-grained
>> enough, the decision could be sub-optimal.
> 
> That's what the finish_cost hook is for - the target can collect
> all stmts during add_stmt and then apply adjustments at the end
> (IIRC power does already for shift operation resource constraints).
> 

oh, forgot that we have that hook, though I just imagined some generic codes
would do this.  Good to know we have a way to make it.  Nice!

BR,
Kewen

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5] vect/rs6000: Support vector with length cost modeling
  2020-07-27 13:40               ` Richard Sandiford
  2020-07-28  8:36                 ` Kewen.Lin
@ 2020-07-31 14:51                 ` Kewen.Lin
  2020-08-05  7:27                   ` Richard Sandiford
  1 sibling, 1 reply; 23+ messages in thread
From: Kewen.Lin @ 2020-07-31 14:51 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford
  Cc: Bill Schmidt, Segher Boessenkool, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 827 bytes --]

Hi Richard,

New version v5 is attached.

v5 main changes against v4:
  1) use _stmt instead of _cnt to avoid confusion
  2) factor out function vect_rgroup_iv_might_wrap_p
  3) use generic scalar_stmt for min/max stmt

Does this look better?  Thanks in advance!

BR,
Kewen
-----

gcc/ChangeLog:

	* config/rs6000/rs6000.c (rs6000_adjust_vect_cost_per_loop): New
	function.
	(rs6000_finish_cost): Call rs6000_adjust_vect_cost_per_loop.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
	modeling for vector with length.
	(vect_rgroup_iv_might_wrap_p): New function, factored out from...
	* tree-vect-loop-manip.c (vect_set_loop_controls_directly): ...this.
	Update function comment.
	* tree-vect-stmts.c (vect_gen_len): Update function comment.
	* tree-vectorizer.h (vect_rgroup_iv_might_wrap_p): New declare.

[-- Attachment #2: cost_v5.diff --]
[-- Type: text/plain, Size: 10429 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 009afc5f894..86ef584e09b 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* For some target specific vectorization cost which can't be handled per stmt,
+   we check the requisite conditions and adjust the vectorization cost
+   accordingly if satisfied.  One typical example is to model shift cost for
+   vector with length by counting number of required lengths under condition
+   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
+
+static void
+rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
+{
+  struct loop *loop = data->loop_info;
+  gcc_assert (loop);
+  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
+
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      unsigned int shift_cnt = 0;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  /* Each length needs one shift to fill into bits 0-7.  */
+	  shift_cnt += num_vectors_m1 + 1;
+
+      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt,
+			    NULL, NULL_TREE, 0, vect_body);
+    }
+}
+
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
@@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost,
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
   if (cost_data->loop_info)
-    rs6000_density_test (cost_data);
+    {
+      rs6000_adjust_vect_cost_per_loop (cost_data);
+      rs6000_density_test (cost_data);
+    }
 
   /* Don't vectorize minimum-vectorization-factor, simple copy loops
      that require versioning for any reason.  The vectorization is at
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 490e7befea3..2a0d3b1840d 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
 
    This means that we cannot guarantee that such an induction variable
    would ever hit a value that produces a set of all-false masks or zero
-   lengths for RGC.  */
+   lengths for RGC.
+
+   Note that please check cost modeling whether needed to be updated in
+   function vect_estimate_min_profitable_iters if any changes.  */
 
 static tree
 vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
@@ -711,8 +714,6 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
   else
     niters = gimple_convert (&preheader_seq, compare_type, niters);
 
-  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
-
   /* Iterate over all the rgroups and fill in their controls.  We could use
      the first control from any rgroup for the loop condition; here we
      arbitrarily pick the last.  */
@@ -739,11 +740,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
 
 	/* See whether zero-based IV would ever generate all-false masks
 	   or zero length before wrapping around.  */
-	unsigned nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
-	bool might_wrap_p
-	  = (iv_limit == -1
-	     || (wi::min_precision (iv_limit * nitems_per_iter, UNSIGNED)
-		 > compare_precision));
+	bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
 
 	/* Set up all controls for this group.  */
 	test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 43ec4fb04d5..dea9ca6778f 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3798,6 +3798,58 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
 			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
     }
+  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      /* Referring to the functions vect_set_loop_condition_partial_vectors
+	 and vect_set_loop_controls_directly, we need to generate each
+	 length in the prologue and in the loop body if required. Although
+	 there are some possible optimizations, we consider the worst case
+	 here.  */
+
+      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+      bool need_iterate_p
+	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+      /* Calculate how many statements to be added.  */
+      unsigned int prologue_stmt = 0;
+      unsigned int body_stmt = 0;
+
+      rgroup_controls *rgc;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	if (rgc->type)
+	  {
+	    /* May need one SHIFT for nitems_total computation.  */
+	    unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+	    if (nitems != 1 && !niters_known_p)
+	      prologue_stmt += 1;
+
+	    /* May need one MAX and one MINUS for wrap around.  */
+	    if (vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc))
+	      prologue_stmt += 2;
+
+	    /* Need one MAX and one MINUS for each batch limit excepting for
+	       the 1st one.  */
+	    prologue_stmt += num_vectors_m1 * 2;
+
+	    unsigned int num_vectors = num_vectors_m1 + 1;
+
+	    /* Need to set up lengths in prologue, only one MIN required
+	       for each since start index is zero.  */
+	    prologue_stmt += num_vectors;
+
+	    /* Each may need two MINs and one MINUS to update lengths in body
+	       for next iteration.  */
+	    if (need_iterate_p)
+	      body_stmt += 3 * num_vectors;
+	  }
+
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, prologue_stmt,
+			    scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+      (void) add_stmt_cost (loop_vinfo, target_cost_data, body_stmt,
+			    scalar_stmt, NULL, NULL_TREE, 0, vect_body);
+    }
 
   /* FORNOW: The scalar outside cost is incremented in one of the
      following ways:
@@ -3932,8 +3984,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3960,7 +4012,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -4015,6 +4067,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4032,7 +4088,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4044,7 +4100,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost
@@ -9351,3 +9407,25 @@ vect_iv_limit_for_partial_vectors (loop_vec_info loop_vinfo)
   return iv_limit;
 }
 
+/* For the given rgroup_controls RGC, check whether an induction variable
+   would ever hit a value that produces a set of all-false masks or zero
+   lengths before wrapping around.  Return true if it's possible to wrap
+   around before hitting the desirable value, otherwise return false.  */
+
+bool
+vect_rgroup_iv_might_wrap_p (loop_vec_info loop_vinfo, rgroup_controls *rgc)
+{
+  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+  if (iv_limit == -1)
+    return true;
+
+  tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+  unsigned int compare_precision = TYPE_PRECISION (compare_type);
+  unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+  if (wi::min_precision (iv_limit * nitems, UNSIGNED) > compare_precision)
+    return true;
+
+  return false;
+}
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 31af46ae19c..8550a252f44 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -12090,7 +12090,10 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
    min_of_start_and_end = min (START_INDEX, END_INDEX);
    left_len = END_INDEX - min_of_start_and_end;
    rhs = min (left_len, LEN_LIMIT);
-   LEN = rhs;  */
+   LEN = rhs;
+
+   Note that please check cost modeling whether needed to be updated in
+   function vect_estimate_min_profitable_iters if any changes.  */
 
 gimple_seq
 vect_gen_len (tree len, tree start_index, tree end_index, tree len_limit)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 5466c78c20b..7ab6d8234ac 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1960,6 +1960,7 @@ extern tree vect_create_addr_base_for_vector_ref (vec_info *,
 
 /* In tree-vect-loop.c.  */
 extern widest_int vect_iv_limit_for_partial_vectors (loop_vec_info loop_vinfo);
+bool vect_rgroup_iv_might_wrap_p (loop_vec_info, rgroup_controls *);
 /* Used in tree-vect-loop-manip.c */
 extern void determine_peel_for_niter (loop_vec_info);
 /* Used in gimple-loop-interchange.c and tree-parloops.c.  */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5] vect/rs6000: Support vector with length cost modeling
  2020-07-31 14:51                 ` [PATCH v5] " Kewen.Lin
@ 2020-08-05  7:27                   ` Richard Sandiford
  2020-08-05 14:06                     ` Segher Boessenkool
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2020-08-05  7:27 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: GCC Patches, Bill Schmidt, Segher Boessenkool, Richard Biener

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> index 490e7befea3..2a0d3b1840d 100644
> --- a/gcc/tree-vect-loop-manip.c
> +++ b/gcc/tree-vect-loop-manip.c
> @@ -412,7 +412,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>  
>     This means that we cannot guarantee that such an induction variable
>     would ever hit a value that produces a set of all-false masks or zero
> -   lengths for RGC.  */
> +   lengths for RGC.
> +
> +   Note that please check cost modeling whether needed to be updated in
> +   function vect_estimate_min_profitable_iters if any changes.  */

Maybe:

   Note: the cost of the code generated by this function is modeled
   by vect_estimate_min_profitable_iters, so changes here may need
   corresponding changes there.  */

> […]
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 43ec4fb04d5..dea9ca6778f 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3798,6 +3798,58 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>        (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
>  			    vector_stmt, NULL, NULL_TREE, 0, vect_body);
>      }
> +  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      /* Referring to the functions vect_set_loop_condition_partial_vectors
> +	 and vect_set_loop_controls_directly, we need to generate each
> +	 length in the prologue and in the loop body if required. Although
> +	 there are some possible optimizations, we consider the worst case
> +	 here.  */
> +
> +      bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
> +      bool need_iterate_p
> +	= (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +	   && !vect_known_niters_smaller_than_vf (loop_vinfo));
> +
> +      /* Calculate how many statements to be added.  */
> +      unsigned int prologue_stmt = 0;
> +      unsigned int body_stmt = 0;

Sorry to be nit-picky, but “_stmts” reads better to me.

> […]
> @@ -4015,6 +4067,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>        && min_profitable_iters < (assumed_vf + peel_iters_prologue))
>      /* We want the vectorized loop to execute at least once.  */
>      min_profitable_iters = assumed_vf + peel_iters_prologue;
> +  else if (min_profitable_iters < peel_iters_prologue)
> +    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
> +       vectorized loop to execute at least once.  */
> +    min_profitable_iters = peel_iters_prologue;

s/to execute/executes/

> […]
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 31af46ae19c..8550a252f44 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -12090,7 +12090,10 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>     min_of_start_and_end = min (START_INDEX, END_INDEX);
>     left_len = END_INDEX - min_of_start_and_end;
>     rhs = min (left_len, LEN_LIMIT);
> -   LEN = rhs;  */
> +   LEN = rhs;
> +
> +   Note that please check cost modeling whether needed to be updated in
> +   function vect_estimate_min_profitable_iters if any changes.  */

Same suggested edit as above.

OK for the vectoriser parts with those changes, thanks.

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5] vect/rs6000: Support vector with length cost modeling
  2020-08-05  7:27                   ` Richard Sandiford
@ 2020-08-05 14:06                     ` Segher Boessenkool
  2020-08-06  6:47                       ` Kewen.Lin
  0 siblings, 1 reply; 23+ messages in thread
From: Segher Boessenkool @ 2020-08-05 14:06 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches, Bill Schmidt, Richard Biener, richard.sandiford

On Wed, Aug 05, 2020 at 08:27:57AM +0100, Richard Sandiford wrote:
> OK for the vectoriser parts with those changes, thanks.

The rs6000 part is still fine as well.  Thanks!


Segher

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5] vect/rs6000: Support vector with length cost modeling
  2020-08-05 14:06                     ` Segher Boessenkool
@ 2020-08-06  6:47                       ` Kewen.Lin
  0 siblings, 0 replies; 23+ messages in thread
From: Kewen.Lin @ 2020-08-06  6:47 UTC (permalink / raw)
  To: Segher Boessenkool, richard.sandiford
  Cc: GCC Patches, Bill Schmidt, Richard Biener

on 2020/8/5 下午10:06, Segher Boessenkool wrote:
> On Wed, Aug 05, 2020 at 08:27:57AM +0100, Richard Sandiford wrote:
>> OK for the vectoriser parts with those changes, thanks.
> 
> The rs6000 part is still fine as well.  Thanks!
> 
> 

Committed via r11-2586.  Thanks all!

BR,
Kewen

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-08-06  6:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-21  5:51 [PATCH] vect: Support vector with length cost modeling Kewen.Lin
2020-07-21  7:57 ` Richard Biener
2020-07-22  1:26   ` [PATCH v2] vect/rs6000: " Kewen.Lin
2020-07-22  6:38     ` Richard Biener
2020-07-22  7:08       ` Kewen.Lin
2020-07-22  9:11     ` Richard Sandiford
2020-07-22 15:48       ` [PATCH v3] " Kewen.Lin
2020-07-22 16:25         ` Kewen.Lin
2020-07-24 16:21           ` Richard Sandiford
2020-07-27  3:58             ` [PATCH v4] " Kewen.Lin
2020-07-27 13:40               ` Richard Sandiford
2020-07-28  8:36                 ` Kewen.Lin
2020-07-31 11:03                   ` Richard Sandiford
2020-07-31 11:20                     ` Richard Biener
2020-07-31 12:37                       ` Kewen.Lin
2020-07-31 13:01                         ` Richard Biener
2020-07-31 13:21                           ` Kewen.Lin
2020-07-31 14:51                 ` [PATCH v5] " Kewen.Lin
2020-08-05  7:27                   ` Richard Sandiford
2020-08-05 14:06                     ` Segher Boessenkool
2020-08-06  6:47                       ` Kewen.Lin
2020-07-22 17:49     ` [PATCH v2] " Segher Boessenkool
2020-07-27  3:44       ` Kewen.Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).