[PATCH] rs6000: Adjust rs6000_density_test for strided

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] rs6000: Adjust rs6000_density_test for strided_load
@ 2021-05-07  2:29 Kewen.Lin
  2021-05-26  2:59 ` [PATCH v2] rs6000: Add load density heuristic Kewen.Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Kewen.Lin @ 2021-05-07  2:29 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, David Edelsohn

[-- Attachment #1: Type: text/plain, Size: 1153 bytes --]

Hi,

We noticed that SPEC2017 503.bwaves_r run time degrades by about 8%
on P8 and P9 if we enabled vectorization at O2 fast-math.

Comparing to Ofast, compiler doesn't do the loop interchange on the
innermost loop, it's not profitable to vectorize it then.  Since
with loop vectorization, the loop becomes very intensive (density
ratio is 83), there are many scalar loads and further to construct
vector, it's bad that the vector CTORs have to wait for the required
loads are ready.

Now we have the function rs6000_density_test to check this kind of
intensive case, but for this case, the threshold is too generic and
a bit high.  This patch is to tweak the density heuristics by
introducing some more thresholds for strided_load, avoid to affect
some potential bmks sensitive to DENSITY_PCT_THRESHOLD change which
is generic.

Bootstrapped/regtested on powerpc64le-linux-gnu P9.

Nothing remarkable was observed with SPEC2017 Power9 full run,
excepting for bwaves_r degradation has been fixed.

Is it ok for trunk?

BR,
Kewen
------
gcc/ChangeLog:

	* config/rs6000/rs6000.c (rs6000_density_test): Add new heuristics
	for strided_load density check.

[-- Attachment #2: 0002-rs6000-Adjust-rs6000_density_test-for-strided_load.patch --]
[-- Type: text/plain, Size: 4604 bytes --]

---
 gcc/config/rs6000/rs6000.c | 88 +++++++++++++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 11 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index ffdf10098a9..5ae40d6f4ce 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5245,12 +5245,16 @@ rs6000_density_test (rs6000_cost_data *data)
   const int DENSITY_PCT_THRESHOLD = 85;
   const int DENSITY_SIZE_THRESHOLD = 70;
   const int DENSITY_PENALTY = 10;
+  const int DENSITY_LOAD_PCT_THRESHOLD = 80;
+  const int DENSITY_LOAD_FOR_CTOR_PCT_THRESHOLD = 65;
+  const int DENSITY_LOAD_SIZE_THRESHOLD = 20;
   struct loop *loop = data->loop_info;
   basic_block *bbs = get_loop_body (loop);
   int nbbs = loop->num_nodes;
   loop_vec_info loop_vinfo = loop_vec_info_for_loop (data->loop_info);
   int vec_cost = data->cost[vect_body], not_vec_cost = 0;
   int i, density_pct;
+  unsigned int nload_total = 0, nctor_for_strided = 0, nload_for_ctor = 0;
 
   /* Only care about cost of vector version, so exclude scalar version here.  */
   if (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo) != (void *) data)
@@ -5272,21 +5276,83 @@ rs6000_density_test (rs6000_cost_data *data)
 	  if (!STMT_VINFO_RELEVANT_P (stmt_info)
 	      && !STMT_VINFO_IN_PATTERN_P (stmt_info))
 	    not_vec_cost++;
+	  else
+	    {
+	      stmt_vec_info vstmt_info = vect_stmt_to_vectorize (stmt_info);
+	      if (STMT_VINFO_DATA_REF (vstmt_info)
+		  && DR_IS_READ (STMT_VINFO_DATA_REF (vstmt_info)))
+		{
+		  if (STMT_VINFO_STRIDED_P (vstmt_info))
+		    {
+		      unsigned int ncopies = 1;
+		      unsigned int nunits = 1;
+		      /* TODO: For VMAT_STRIDED_SLP, the total CTOR can be
+			 fewer due to group access.  Simply handle it here
+			 for now.  */
+		      if (!STMT_SLP_TYPE (vstmt_info))
+			{
+			  tree vectype = STMT_VINFO_VECTYPE (vstmt_info);
+			  ncopies = vect_get_num_copies (loop_vinfo, vectype);
+			  nunits = vect_nunits_for_cost (vectype);
+			}
+		      unsigned int nloads = ncopies * nunits;
+		      nload_for_ctor += nloads;
+		      nload_total += nloads;
+		      nctor_for_strided += ncopies;
+		    }
+		  else
+		    nload_total++;
+		}
+	    }
 	}
     }
-
   free (bbs);
-  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
 
-  if (density_pct > DENSITY_PCT_THRESHOLD
-      && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD)
-    {
-      data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
-      if (dump_enabled_p ())
-	dump_printf_loc (MSG_NOTE, vect_location,
-			 "density %d%%, cost %d exceeds threshold, penalizing "
-			 "loop body cost by %d%%", density_pct,
-			 vec_cost + not_vec_cost, DENSITY_PENALTY);
+  if (vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD)
+    {
+      density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
+      if (density_pct > DENSITY_PCT_THRESHOLD)
+	{
+	  data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "density %d%%, cost %d exceeds threshold, "
+			     "penalizing loop body cost by %d%%.\n",
+			     density_pct, vec_cost + not_vec_cost,
+			     DENSITY_PENALTY);
+	}
+      /* For one loop which has a large proportion scalar loads of all
+	 loads fed into vector construction, if the density is high,
+	 the loads will have more stalls than usual, further affect
+	 the vector construction.  One typical case is the innermost
+	 loop of the hotspot of spec2017 503.bwaves_r without loop
+	 interchange.  Here we price more on the related vector
+	 construction and penalize the body cost.  */
+      else if (density_pct > DENSITY_LOAD_PCT_THRESHOLD
+	       && nload_total > DENSITY_LOAD_SIZE_THRESHOLD)
+	{
+	  int load_for_ctor_pct = (nload_for_ctor * 100) / nload_total;
+	  /* Large proportion of scalar loads fed to vector CTOR.  */
+	  if (load_for_ctor_pct > DENSITY_LOAD_FOR_CTOR_PCT_THRESHOLD)
+	    {
+	      vec_cost += nctor_for_strided;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Found high density loop with a large "
+				 "proportion %d%% of scalar loads fed to "
+				 "vector ctor, add cost %d.\n",
+				 load_for_ctor_pct, nctor_for_strided);
+
+	      data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "density %d%%, cost %d exceeds threshold, "
+				 "penalizing loop body cost by %d%% for "
+				 "load.\n",
+				 density_pct, vec_cost + not_vec_cost,
+				 DENSITY_PENALTY);
+	    }
+	}
     }
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] rs6000: Add load density heuristic
  2021-05-07  2:29 [PATCH] rs6000: Adjust rs6000_density_test for strided_load Kewen.Lin
@ 2021-05-26  2:59 ` Kewen.Lin
  2021-06-09  2:26   ` PING^1 " Kewen.Lin
                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Kewen.Lin @ 2021-05-26  2:59 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, David Edelsohn, Segher Boessenkool, Richard Biener

[-- Attachment #1: Type: text/plain, Size: 2485 bytes --]

Hi,

This is the updated version of patch to deal with the bwaves_r
degradation due to vector construction fed by strided loads.

As Richi's comments [1], this follows the similar idea to over
price the vector construction fed by VMAT_ELEMENTWISE or
VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
construction costing immediately, it firstly records how many
loads and vectorized statements in the given loop, later in
rs6000_density_test (called by finish_cost) it computes the
load density ratio against all vectorized stmts, and check
with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
if both thresholds are exceeded.

Note that this new load density heuristic check is based on
some fields in target cost which are updated as needed when
scanning each add_stmt_cost entry, it's independent of the
current function rs6000_density_test which requires to scan
non_vect stmts.  Since it's checking the load stmts count
vs. all vectorized stmts, it's kind of density, so I put
it in function rs6000_density_test.  With the same reason to
keep it independent, I didn't put it as an else arm of the
current existing density threshold check hunk or before this
hunk.

In the investigation of -1.04% degradation from 526.blender_r
on Power8, I noticed that the extra penalized cost 320 on one
single vector construction with type V16QI is much exaggerated,
which makes the final body cost unreliable, so this patch adds
one maximum bound for the extra penalized cost for each vector
construction statement.

Bootstrapped/regtested on powerpc64le-linux-gnu P9.

Full SPEC2017 performance evaluation on Power8/Power9 with
option combinations:
  * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
  * {-O3, -Ofast} {,-funroll-loops}

bwaves_r degradations on P8/P9 have been fixed, nothing else
remarkable was observed.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
	nstmts, nloads and extra_ctor_cost.
	(rs6000_density_test): Add load density related heuristics and the
	checks, do extra costing on vector construction statements if need.
	(rs6000_init_cost): Init new members.
	(rs6000_update_target_cost_per_stmt): New function.
	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
	rs6000_update_target_cost_per_stmt and call it.


[-- Attachment #2: ctor_extra_cost_v2.diff --]
[-- Type: text/plain, Size: 7038 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 83d29cbfac1..806c3335cbc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
 {
   struct loop *loop_info;
   unsigned cost[3];
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned nstmts;
+  /* Total number of loads (loop only).  */
+  unsigned nloads;
+  /* Possible extra penalized cost on vector construction (loop only).  */
+  unsigned extra_ctor_cost;
   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
      instruction is needed by the vectorization.  */
   bool vect_nonmem;
@@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "density %d%%, cost %d exceeds threshold, penalizing "
-			 "loop body cost by %d%%", density_pct,
+			 "loop body cost by %d%%\n", density_pct,
 			 vec_cost + not_vec_cost, DENSITY_PENALTY);
     }
+
+  /* Check if we need to penalize the body cost for latency and
+     execution resources bound from strided or elementwise loads
+     into a vector.  */
+  if (data->extra_ctor_cost > 0)
+    {
+      /* Threshold for load stmts percentage in all vectorized stmts.  */
+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
+      /* Threshold for total number of load stmts.  */
+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
+
+      gcc_assert (data->nloads <= data->nstmts);
+      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
+
+      /* It's likely to be bounded by latency and execution resources
+	 from many scalar loads which are strided or elementwise loads
+	 into a vector if both conditions below are found:
+	   1. there are many loads, it's easy to result in a long wait
+	      for load units;
+	   2. load has a big proportion of all vectorized statements,
+	      it's not easy to schedule other statements to spread among
+	      the loads.
+	 One typical case is the innermost loop of the hotspot of SPEC2017
+	 503.bwaves_r without loop interchange.  */
+      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
+	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
+	{
+	  data->cost[vect_body] += data->extra_ctor_cost;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Found %u loads and load pct. %u%% exceed "
+			     "the threshold, penalizing loop body "
+			     "cost by extra cost %u for ctor.\n",
+			     data->nloads, load_pct, data->extra_ctor_cost);
+	}
+    }
 }
 
 /* Implement targetm.vectorize.init_cost.  */
@@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
   data->cost[vect_body]     = 0;
   data->cost[vect_epilogue] = 0;
   data->vect_nonmem = false;
+  data->nstmts = 0;
+  data->nloads = 0;
+  data->extra_ctor_cost = 0;
   data->costing_for_scalar = costing_for_scalar;
   return data;
 }
@@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
   return 0;
 }
 
+/* As a visitor function for each statement cost entry handled in
+   function add_stmt_cost, gather some information and update its
+   relevant fields in target cost accordingly.  */
+static void
+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
+				    enum vect_cost_for_stmt kind,
+				    struct _stmt_vec_info *stmt_info,
+				    enum vect_cost_model_location where,
+				    int stmt_cost, unsigned int orig_count)
+{
+
+  /* Check whether we're doing something other than just a copy loop.
+     Not all such loops may be profitably vectorized; see
+     rs6000_finish_cost.  */
+  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
+       || kind == vec_construct || kind == scalar_to_vec)
+      || (where == vect_body && kind == vector_stmt))
+    data->vect_nonmem = true;
+
+  /* Gather some information when we are costing the vector version for
+     the statements locate in a loop body.  */
+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
+    {
+      data->nstmts += orig_count;
+
+      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
+	  || kind == vector_gather_load)
+	data->nloads += orig_count;
+
+      /* If we have strided or elementwise loads into a vector, it's
+	 possible to be bounded by latency and execution resources for
+	 many scalar loads.  Try to account for this by scaling the
+	 construction cost by the number of elements involved, when
+	 handling each matching statement we record the possible extra
+	 penalized cost into target cost, in the end of costing for
+	 the whole loop, we do the actual penalization once some load
+	 density heuristics are satisfied.  */
+      if (kind == vec_construct && stmt_info
+	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
+	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
+	{
+	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+	  unsigned int nunits = vect_nunits_for_cost (vectype);
+	  unsigned int extra_cost = nunits * stmt_cost;
+	  /* As function rs6000_builtin_vectorization_cost shows, we have
+	     priced much on V16QI/V8HI vector construction as their units,
+	     if we penalize them with nunits * stmt_cost, it can result in
+	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
+	     20 and nunits is 16, the extra cost is 320 which looks much
+	     exaggerated.  So let's use one maximum bound for the extra
+	     penalized cost for vector construction here.  */
+	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
+	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
+	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+	  data->extra_ctor_cost += extra_cost;
+	}
+    }
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
 static unsigned
@@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
 	 arbitrary and could potentially be improved with analysis.  */
+      unsigned int orig_count = count;
       if (where == vect_body && stmt_info
 	  && stmt_in_inner_loop_p (vinfo, stmt_info))
 	{
@@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       retval = (unsigned) (count * stmt_cost);
       cost_data->cost[where] += retval;
 
-      /* Check whether we're doing something other than just a copy loop.
-	 Not all such loops may be profitably vectorized; see
-	 rs6000_finish_cost.  */
-      if ((kind == vec_to_scalar || kind == vec_perm
-	   || kind == vec_promote_demote || kind == vec_construct
-	   || kind == scalar_to_vec)
-	  || (where == vect_body && kind == vector_stmt))
-	cost_data->vect_nonmem = true;
+      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
+					  stmt_cost, orig_count);
     }
 
   return retval;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* PING^1 [PATCH v2] rs6000: Add load density heuristic
  2021-05-26  2:59 ` [PATCH v2] rs6000: Add load density heuristic Kewen.Lin
@ 2021-06-09  2:26   ` Kewen.Lin
  2021-06-28  7:01     ` PING^2 " Kewen.Lin
  2021-07-27 22:25   ` will schmidt
  2021-07-28  5:22   ` [PATCH v3] " Kewen.Lin
  2 siblings, 1 reply; 19+ messages in thread
From: Kewen.Lin @ 2021-06-09  2:26 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, David Edelsohn

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html

BR,
Kewen

on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> This is the updated version of patch to deal with the bwaves_r
> degradation due to vector construction fed by strided loads.
> 
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.
> 
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.
> 
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.
> 
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> 
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations:
>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>   * {-O3, -Ofast} {,-funroll-loops}
> 
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.
> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* PING^2 [PATCH v2] rs6000: Add load density heuristic
  2021-06-09  2:26   ` PING^1 " Kewen.Lin
@ 2021-06-28  7:01     ` Kewen.Lin
  2021-07-15  1:59       ` PING^3 " Kewen.Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Kewen.Lin @ 2021-06-28  7:01 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, David Edelsohn, Segher Boessenkool

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html

BR,
Kewen

on 2021/6/9 上午10:26, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
> 
> BR,
> Kewen
> 
> on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> This is the updated version of patch to deal with the bwaves_r
>> degradation due to vector construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations:
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> 	nstmts, nloads and extra_ctor_cost.
>> 	(rs6000_density_test): Add load density related heuristics and the
>> 	checks, do extra costing on vector construction statements if need.
>> 	(rs6000_init_cost): Init new members.
>> 	(rs6000_update_target_cost_per_stmt): New function.
>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>> 	rs6000_update_target_cost_per_stmt and call it.
>>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* PING^3 [PATCH v2] rs6000: Add load density heuristic
  2021-06-28  7:01     ` PING^2 " Kewen.Lin
@ 2021-07-15  1:59       ` Kewen.Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Kewen.Lin @ 2021-07-15  1:59 UTC (permalink / raw)
  To: GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, David Edelsohn

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html

BR,
Kewen

on 2021/6/28 下午3:01, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
> 
> BR,
> Kewen
> 
> on 2021/6/9 上午10:26, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Gentle ping this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>
>> BR,
>> Kewen
>>
>> on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> This is the updated version of patch to deal with the bwaves_r
>>> degradation due to vector construction fed by strided loads.
>>>
>>> As Richi's comments [1], this follows the similar idea to over
>>> price the vector construction fed by VMAT_ELEMENTWISE or
>>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>>> construction costing immediately, it firstly records how many
>>> loads and vectorized statements in the given loop, later in
>>> rs6000_density_test (called by finish_cost) it computes the
>>> load density ratio against all vectorized stmts, and check
>>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>>> if both thresholds are exceeded.
>>>
>>> Note that this new load density heuristic check is based on
>>> some fields in target cost which are updated as needed when
>>> scanning each add_stmt_cost entry, it's independent of the
>>> current function rs6000_density_test which requires to scan
>>> non_vect stmts.  Since it's checking the load stmts count
>>> vs. all vectorized stmts, it's kind of density, so I put
>>> it in function rs6000_density_test.  With the same reason to
>>> keep it independent, I didn't put it as an else arm of the
>>> current existing density threshold check hunk or before this
>>> hunk.
>>>
>>> In the investigation of -1.04% degradation from 526.blender_r
>>> on Power8, I noticed that the extra penalized cost 320 on one
>>> single vector construction with type V16QI is much exaggerated,
>>> which makes the final body cost unreliable, so this patch adds
>>> one maximum bound for the extra penalized cost for each vector
>>> construction statement.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>>
>>> Full SPEC2017 performance evaluation on Power8/Power9 with
>>> option combinations:
>>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>>   * {-O3, -Ofast} {,-funroll-loops}
>>>
>>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>>> remarkable was observed.
>>>
>>> Is it ok for trunk?
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>>
>>> BR,
>>> Kewen
>>> -----
>>> gcc/ChangeLog:
>>>
>>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>> 	nstmts, nloads and extra_ctor_cost.
>>> 	(rs6000_density_test): Add load density related heuristics and the
>>> 	checks, do extra costing on vector construction statements if need.
>>> 	(rs6000_init_cost): Init new members.
>>> 	(rs6000_update_target_cost_per_stmt): New function.
>>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>>> 	rs6000_update_target_cost_per_stmt and call it.
>>>
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] rs6000: Add load density heuristic
  2021-05-26  2:59 ` [PATCH v2] rs6000: Add load density heuristic Kewen.Lin
  2021-06-09  2:26   ` PING^1 " Kewen.Lin
@ 2021-07-27 22:25   ` will schmidt
  2021-07-28  2:59     ` Kewen.Lin
  2021-07-28  5:22   ` [PATCH v3] " Kewen.Lin
  2 siblings, 1 reply; 19+ messages in thread
From: will schmidt @ 2021-07-27 22:25 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches; +Cc: Bill Schmidt, Segher Boessenkool, David Edelsohn

On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 


Hi,


> This is the updated version of patch to deal with the bwaves_r
> degradation due to vector construction fed by strided loads.
> 
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.

ok

> 
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.

ok

> 
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.

ok

> 
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> 
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations:
>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>   * {-O3, -Ofast} {,-funroll-loops}
> 
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.

So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
degredation with no additional regressions.  that sounds good. 

> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
> 

> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 83d29cbfac1..806c3335cbc 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> 
> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>  {
>    struct loop *loop_info;
>    unsigned cost[3];
> +  /* Total number of vectorized stmts (loop only).  */
> +  unsigned nstmts;
> +  /* Total number of loads (loop only).  */
> +  unsigned nloads;
> +  /* Possible extra penalized cost on vector construction (loop only).  */
> +  unsigned extra_ctor_cost;
> 
>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>       instruction is needed by the vectorization.  */
>    bool vect_nonmem;
> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_NOTE, vect_location,
>  			 "density %d%%, cost %d exceeds threshold, penalizing "
> -			 "loop body cost by %d%%", density_pct,
> +			 "loop body cost by %d%%\n", density_pct,
>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>      }
> +
> +  /* Check if we need to penalize the body cost for latency and
> +     execution resources bound from strided or elementwise loads
> +     into a vector.  */
> +  if (data->extra_ctor_cost > 0)
> +    {
> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
> +      /* Threshold for total number of load stmts.  */
> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
> +
> +      gcc_assert (data->nloads <= data->nstmts);
> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
> +
> +      /* It's likely to be bounded by latency and execution resources
> +	 from many scalar loads which are strided or elementwise loads
> +	 into a vector if both conditions below are found:
> +	   1. there are many loads, it's easy to result in a long wait
> +	      for load units;
> +	   2. load has a big proportion of all vectorized statements,
> +	      it's not easy to schedule other statements to spread among
> +	      the loads.
> +	 One typical case is the innermost loop of the hotspot of SPEC2017
> +	 503.bwaves_r without loop interchange.  */
> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
> +	{
> +	  data->cost[vect_body] += data->extra_ctor_cost;
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "Found %u loads and load pct. %u%% exceed "
> +			     "the threshold, penalizing loop body "
> +			     "cost by extra cost %u for ctor.\n",
> +			     data->nloads, load_pct, data->extra_ctor_cost);
> +	}
> +    }
> 
>  }
> 
>  /* Implement targetm.vectorize.init_cost.  */
> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>    data->cost[vect_body]     = 0;
>    data->cost[vect_epilogue] = 0;
>    data->vect_nonmem = false;
> +  data->nstmts = 0;
> +  data->nloads = 0;
> +  data->extra_ctor_cost = 0;
> 
>    data->costing_for_scalar = costing_for_scalar;
>    return data;
>  }
> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>    return 0;
>  }
> 
> +/* As a visitor function for each statement cost entry handled in
> +   function add_stmt_cost, gather some information and update its
> +   relevant fields in target cost accordingly.  */

I got lost trying to parse that..  (could be just me :-) 

Possibly instead something like
/* Helper function for add_stmt_cost ; gather information and update
the target_cost fields accordingly.  */



> +static void
> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
> +				    enum vect_cost_for_stmt kind,
> +				    struct _stmt_vec_info *stmt_info,
> +				    enum vect_cost_model_location where,
> +				    int stmt_cost, unsigned int orig_count)
> +{
> +
> +  /* Check whether we're doing something other than just a copy loop.
> +     Not all such loops may be profitably vectorized; see
> +     rs6000_finish_cost.  */
> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
> +       || kind == vec_construct || kind == scalar_to_vec)
> +      || (where == vect_body && kind == vector_stmt))

I thought I saw an alignment issue, then noticed the "(" that was
hiding on me.. :-)    

Maybe clearer to read if you rearrange slightly and flatten it ?  I
defer to others on that..

	if ((kind == vec_to_scalar
	     || kind == vec_perm
	     || kind == vec_promote_demote
	     || kind ==
vec_construct	
	     || kind == scalar_to_vec)
	    || (kind == vector_stmt && where == vect_body)
	

> +    data->vect_nonmem = true;
> +
> +  /* Gather some information when we are costing the vector version for
> +     the statements locate in a loop body.  */
s/version/instruction? operation?/  ? ? 
s/locate/located/

> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
> +    {
> +      data->nstmts += orig_count;
> +
> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
> +	  || kind == vector_gather_load)

Cosmetically, I'd move the second "||" to the next line to balance
those two lines a bit more. 

> +	data->nloads += orig_count;
> +
> +      /* If we have strided or elementwise loads into a vector, it's
> +	 possible to be bounded by latency and execution resources for
> +	 many scalar loads.  Try to account for this by scaling the
> +	 construction cost by the number of elements involved, when
> +	 handling each matching statement we record the possible extra
> +	 penalized cost into target cost, in the end of costing for
> +	 the whole loop, we do the actual penalization once some load
> +	 density heuristics are satisfied.  */
> +      if (kind == vec_construct && stmt_info
> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
> +	{
> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
> +	  unsigned int extra_cost = nunits * stmt_cost;
> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
> +	     priced much on V16QI/V8HI vector construction as their units,
> +	     if we penalize them with nunits * stmt_cost, it can result in
> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
s/a/an/ 
> +	     20 and nunits is 16, the extra cost is 320 which looks much
> +	     exaggerated.  So let's use one maximum bound for the extra
> +	     penalized cost for vector construction here.  */
> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> +	  data->extra_ctor_cost += extra_cost;
> +	}
> +    }
> +}
ok

> +
> 
>  /* Implement targetm.vectorize.add_stmt_cost.  */
> 
>  static unsigned
> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        /* Statements in an inner loop relative to the loop being
>  	 vectorized are weighted more heavily.  The value here is
>  	 arbitrary and could potentially be improved with analysis.  */
> +      unsigned int orig_count = count;

I don't see any other assignments to orig_count.  Does 'count' get
updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
variable is necessary?

> 
>        if (where == vect_body && stmt_info
>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>  	{
> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        retval = (unsigned) (count * stmt_cost);
>        cost_data->cost[where] += retval;
> 
> -      /* Check whether we're doing something other than just a copy loop.
> -	 Not all such loops may be profitably vectorized; see
> -	 rs6000_finish_cost.  */
> -      if ((kind == vec_to_scalar || kind == vec_perm
> -	   || kind == vec_promote_demote || kind == vec_construct
> -	   || kind == scalar_to_vec)
> -	  || (where == vect_body && kind == vector_stmt))
> -	cost_data->vect_nonmem = true;
> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
> +					  stmt_cost, orig_count);
> 
>      }
> 
>    return retval;
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] rs6000: Add load density heuristic
  2021-07-27 22:25   ` will schmidt
@ 2021-07-28  2:59     ` Kewen.Lin
  2021-09-06 23:43       ` Segher Boessenkool
  0 siblings, 1 reply; 19+ messages in thread
From: Kewen.Lin @ 2021-07-28  2:59 UTC (permalink / raw)
  To: will schmidt
  Cc: Bill Schmidt, Segher Boessenkool, David Edelsohn, GCC Patches

Hi William,

Thanks for the review comments!

on 2021/7/28 上午6:25, will schmidt wrote:
> On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
> 
> 
> Hi,
> 
> 
>> This is the updated version of patch to deal with the bwaves_r
>> degradation due to vector construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
> 
> ok
> 
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
> 
> ok
> 
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
> 
> ok
> 
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations:
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
> 
> So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
> degredation with no additional regressions.  that sounds good. 
> 

Sorry for the confusion, the original intention of this patch is to
fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r.
(From the last evaluation based on r12-1442, P8 is about -10% while
P9 is about -9%).  The mentioned -1.04% degradation from 526.blender_r
on P8 was expected to be a reason why we need the bound for the extra
penalized cost adjustment.

>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> 	nstmts, nloads and extra_ctor_cost.
>> 	(rs6000_density_test): Add load density related heuristics and the
>> 	checks, do extra costing on vector construction statements if need.
>> 	(rs6000_init_cost): Init new members.
>> 	(rs6000_update_target_cost_per_stmt): New function.
>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>> 	rs6000_update_target_cost_per_stmt and call it.
>>
> 
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 83d29cbfac1..806c3335cbc 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>>
>> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>>  {
>>    struct loop *loop_info;
>>    unsigned cost[3];
>> +  /* Total number of vectorized stmts (loop only).  */
>> +  unsigned nstmts;
>> +  /* Total number of loads (loop only).  */
>> +  unsigned nloads;
>> +  /* Possible extra penalized cost on vector construction (loop only).  */
>> +  unsigned extra_ctor_cost;
>>
>>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>>       instruction is needed by the vectorization.  */
>>    bool vect_nonmem;
>> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>>        if (dump_enabled_p ())
>>  	dump_printf_loc (MSG_NOTE, vect_location,
>>  			 "density %d%%, cost %d exceeds threshold, penalizing "
>> -			 "loop body cost by %d%%", density_pct,
>> +			 "loop body cost by %d%%\n", density_pct,
>>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>>      }
>> +
>> +  /* Check if we need to penalize the body cost for latency and
>> +     execution resources bound from strided or elementwise loads
>> +     into a vector.  */
>> +  if (data->extra_ctor_cost > 0)
>> +    {
>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> +      /* Threshold for total number of load stmts.  */
>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> +
>> +      gcc_assert (data->nloads <= data->nstmts);
>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> +
>> +      /* It's likely to be bounded by latency and execution resources
>> +	 from many scalar loads which are strided or elementwise loads
>> +	 into a vector if both conditions below are found:
>> +	   1. there are many loads, it's easy to result in a long wait
>> +	      for load units;
>> +	   2. load has a big proportion of all vectorized statements,
>> +	      it's not easy to schedule other statements to spread among
>> +	      the loads.
>> +	 One typical case is the innermost loop of the hotspot of SPEC2017
>> +	 503.bwaves_r without loop interchange.  */
>> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>> +	{
>> +	  data->cost[vect_body] += data->extra_ctor_cost;
>> +	  if (dump_enabled_p ())
>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>> +			     "Found %u loads and load pct. %u%% exceed "
>> +			     "the threshold, penalizing loop body "
>> +			     "cost by extra cost %u for ctor.\n",
>> +			     data->nloads, load_pct, data->extra_ctor_cost);
>> +	}
>> +    }
>>
>>  }
>>
>>  /* Implement targetm.vectorize.init_cost.  */
>> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>>    data->cost[vect_body]     = 0;
>>    data->cost[vect_epilogue] = 0;
>>    data->vect_nonmem = false;
>> +  data->nstmts = 0;
>> +  data->nloads = 0;
>> +  data->extra_ctor_cost = 0;
>>
>>    data->costing_for_scalar = costing_for_scalar;
>>    return data;
>>  }
>> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>>    return 0;
>>  }
>>
>> +/* As a visitor function for each statement cost entry handled in
>> +   function add_stmt_cost, gather some information and update its
>> +   relevant fields in target cost accordingly.  */
> 
> I got lost trying to parse that..  (could be just me :-) 
> 
> Possibly instead something like
> /* Helper function for add_stmt_cost ; gather information and update
> the target_cost fields accordingly.  */
> 
> 

OK, will update.  I was thinking for each entry handled in function
add_stmt_cost, this helper acts like a visitor, trying to visit each
entry and take some actions if some conditions are satisifed.

> 
>> +static void
>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> +				    enum vect_cost_for_stmt kind,
>> +				    struct _stmt_vec_info *stmt_info,
>> +				    enum vect_cost_model_location where,
>> +				    int stmt_cost, unsigned int orig_count)
>> +{
>> +
>> +  /* Check whether we're doing something other than just a copy loop.
>> +     Not all such loops may be profitably vectorized; see
>> +     rs6000_finish_cost.  */
>> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
>> +       || kind == vec_construct || kind == scalar_to_vec)
>> +      || (where == vect_body && kind == vector_stmt))
> 
> I thought I saw an alignment issue, then noticed the "(" that was
> hiding on me.. :-)    
> 
> Maybe clearer to read if you rearrange slightly and flatten it ?  I
> defer to others on that..
> 
> 	if ((kind == vec_to_scalar
> 	     || kind == vec_perm
> 	     || kind == vec_promote_demote
> 	     || kind ==
> vec_construct	
> 	     || kind == scalar_to_vec)
> 	    || (kind == vector_stmt && where == vect_body)
> 	
> 

This hunk is factored out from function rs6000_add_stmt_cost, maybe I
can keep the original formatting?  The formatting tool isn't so smart,
and sometimes rearrange things to become unexpected (although it meets
the basic rule, not so elegant), sigh.

>> +    data->vect_nonmem = true;
>> +
>> +  /* Gather some information when we are costing the vector version for
>> +     the statements locate in a loop body.  */
> s/version/instruction? operation?/  ? ? 
> s/locate/located/
> 

Will fix.

>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>> +    {
>> +      data->nstmts += orig_count;
>> +
>> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
>> +	  || kind == vector_gather_load)
> 
> Cosmetically, I'd move the second "||" to the next line to balance
> those two lines a bit more. 
> 

Will fix.

>> +	data->nloads += orig_count;
>> +
>> +      /* If we have strided or elementwise loads into a vector, it's
>> +	 possible to be bounded by latency and execution resources for
>> +	 many scalar loads.  Try to account for this by scaling the
>> +	 construction cost by the number of elements involved, when
>> +	 handling each matching statement we record the possible extra
>> +	 penalized cost into target cost, in the end of costing for
>> +	 the whole loop, we do the actual penalization once some load
>> +	 density heuristics are satisfied.  */
>> +      if (kind == vec_construct && stmt_info
>> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
>> +	{
>> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
>> +	  unsigned int extra_cost = nunits * stmt_cost;
>> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
>> +	     priced much on V16QI/V8HI vector construction as their units,
>> +	     if we penalize them with nunits * stmt_cost, it can result in
>> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
> s/a/an/ 

Will fix.

>> +	     20 and nunits is 16, the extra cost is 320 which looks much
>> +	     exaggerated.  So let's use one maximum bound for the extra
>> +	     penalized cost for vector construction here.  */
>> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +	  data->extra_ctor_cost += extra_cost;
>> +	}
>> +    }
>> +}
> ok
> 
>> +
>>
>>  /* Implement targetm.vectorize.add_stmt_cost.  */
>>
>>  static unsigned
>> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        /* Statements in an inner loop relative to the loop being
>>  	 vectorized are weighted more heavily.  The value here is
>>  	 arbitrary and could potentially be improved with analysis.  */
>> +      unsigned int orig_count = count;
> 
> I don't see any other assignments to orig_count.  Does 'count' get
> updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
> variable is necessary?
> 

Yeah, the 'count' could get updated but the default "git diff" doesn't
show it, the codes omitted look as below:

      if (where == vect_body && stmt_info
	  && stmt_in_inner_loop_p (vinfo, stmt_info))
	{
	  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
	  gcc_assert (loop_vinfo);
	  count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME.  */
	}

BR,
Kewen

>>
>>        if (where == vect_body && stmt_info
>>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>>  	{
>> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        retval = (unsigned) (count * stmt_cost);
>>        cost_data->cost[where] += retval;
>>
>> -      /* Check whether we're doing something other than just a copy loop.
>> -	 Not all such loops may be profitably vectorized; see
>> -	 rs6000_finish_cost.  */
>> -      if ((kind == vec_to_scalar || kind == vec_perm
>> -	   || kind == vec_promote_demote || kind == vec_construct
>> -	   || kind == scalar_to_vec)
>> -	  || (where == vect_body && kind == vector_stmt))
>> -	cost_data->vect_nonmem = true;
>> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>> +					  stmt_cost, orig_count);
>>
>>      }
>>
>>    return retval;
>>
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3] rs6000: Add load density heuristic
  2021-05-26  2:59 ` [PATCH v2] rs6000: Add load density heuristic Kewen.Lin
  2021-06-09  2:26   ` PING^1 " Kewen.Lin
  2021-07-27 22:25   ` will schmidt
@ 2021-07-28  5:22   ` Kewen.Lin
  2021-09-03 15:57     ` Bill Schmidt
  2 siblings, 1 reply; 19+ messages in thread
From: Kewen.Lin @ 2021-07-28  5:22 UTC (permalink / raw)
  To: GCC Patches
  Cc: Bill Schmidt, Segher Boessenkool, David Edelsohn, will schmidt

[-- Attachment #1: Type: text/plain, Size: 2684 bytes --]

Hi,

v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html

This v3 addressed William's review comments in
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html

It's mainly to deal with the bwaves_r degradation due to vector
construction fed by strided loads.

As Richi's comments [1], this follows the similar idea to over
price the vector construction fed by VMAT_ELEMENTWISE or
VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
construction costing immediately, it firstly records how many
loads and vectorized statements in the given loop, later in
rs6000_density_test (called by finish_cost) it computes the
load density ratio against all vectorized stmts, and check
with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
if both thresholds are exceeded.

Note that this new load density heuristic check is based on
some fields in target cost which are updated as needed when
scanning each add_stmt_cost entry, it's independent of the
current function rs6000_density_test which requires to scan
non_vect stmts.  Since it's checking the load stmts count
vs. all vectorized stmts, it's kind of density, so I put
it in function rs6000_density_test.  With the same reason to
keep it independent, I didn't put it as an else arm of the
current existing density threshold check hunk or before this
hunk.

In the investigation of -1.04% degradation from 526.blender_r
on Power8, I noticed that the extra penalized cost 320 on one
single vector construction with type V16QI is much exaggerated,
which makes the final body cost unreliable, so this patch adds
one maximum bound for the extra penalized cost for each vector
construction statement.

Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.

Full SPEC2017 performance evaluation on Power8/Power9 with
option combinations (with v2, as v3 is NFC against v2):
  * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
  * {-O3, -Ofast} {,-funroll-loops}

bwaves_r degradations on P8/P9 have been fixed, nothing else
remarkable was observed.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
	nstmts, nloads and extra_ctor_cost.
	(rs6000_density_test): Add load density related heuristics and the
	checks, do extra costing on vector construction statements if need.
	(rs6000_init_cost): Init new members.
	(rs6000_update_target_cost_per_stmt): New function.
	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
	rs6000_update_target_cost_per_stmt and call it.

[-- Attachment #2: load-density-v3.patch --]
[-- Type: text/plain, Size: 7161 bytes --]

---
 gcc/config/rs6000/rs6000.c | 119 ++++++++++++++++++++++++++++++++++---
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 279f00cc648..4e9087683dc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5254,6 +5254,12 @@ typedef struct _rs6000_cost_data
 {
   struct loop *loop_info;
   unsigned cost[3];
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned nstmts;
+  /* Total number of loads (loop only).  */
+  unsigned nloads;
+  /* Possible extra penalized cost on vector construction (loop only).  */
+  unsigned extra_ctor_cost;
   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
      instruction is needed by the vectorization.  */
   bool vect_nonmem;
@@ -5315,9 +5321,45 @@ rs6000_density_test (rs6000_cost_data *data)
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "density %d%%, cost %d exceeds threshold, penalizing "
-			 "loop body cost by %d%%", density_pct,
+			 "loop body cost by %d%%\n", density_pct,
 			 vec_cost + not_vec_cost, DENSITY_PENALTY);
     }
+
+  /* Check if we need to penalize the body cost for latency and
+     execution resources bound from strided or elementwise loads
+     into a vector.  */
+  if (data->extra_ctor_cost > 0)
+    {
+      /* Threshold for load stmts percentage in all vectorized stmts.  */
+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
+      /* Threshold for total number of load stmts.  */
+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
+
+      gcc_assert (data->nloads <= data->nstmts);
+      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
+
+      /* It's likely to be bounded by latency and execution resources
+	 from many scalar loads which are strided or elementwise loads
+	 into a vector if both conditions below are found:
+	   1. there are many loads, it's easy to result in a long wait
+	      for load units;
+	   2. load has a big proportion of all vectorized statements,
+	      it's not easy to schedule other statements to spread among
+	      the loads.
+	 One typical case is the innermost loop of the hotspot of SPEC2017
+	 503.bwaves_r without loop interchange.  */
+      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
+	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
+	{
+	  data->cost[vect_body] += data->extra_ctor_cost;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Found %u loads and load pct. %u%% exceed "
+			     "the threshold, penalizing loop body "
+			     "cost by extra cost %u for ctor.\n",
+			     data->nloads, load_pct, data->extra_ctor_cost);
+	}
+    }
 }
 
 /* Implement targetm.vectorize.init_cost.  */
@@ -5331,6 +5373,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
   data->cost[vect_body]     = 0;
   data->cost[vect_epilogue] = 0;
   data->vect_nonmem = false;
+  data->nstmts = 0;
+  data->nloads = 0;
+  data->extra_ctor_cost = 0;
   data->costing_for_scalar = costing_for_scalar;
   return data;
 }
@@ -5358,6 +5403,67 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
   return 0;
 }
 
+/* Helper function for add_stmt_cost.  Check each statement cost
+   entry, gather information and update the target_cost fields
+   accordingly.  */
+static void
+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
+				    enum vect_cost_for_stmt kind,
+				    struct _stmt_vec_info *stmt_info,
+				    enum vect_cost_model_location where,
+				    int stmt_cost, unsigned int orig_count)
+{
+
+  /* Check whether we're doing something other than just a copy loop.
+     Not all such loops may be profitably vectorized; see
+     rs6000_finish_cost.  */
+  if ((kind == vec_to_scalar || kind == vec_perm
+       || kind == vec_promote_demote || kind == vec_construct
+       || kind == scalar_to_vec)
+      || (where == vect_body && kind == vector_stmt))
+    data->vect_nonmem = true;
+
+  /* Gather some information when we are costing the vectorized instruction
+     for the statements located in a loop body.  */
+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
+    {
+      data->nstmts += orig_count;
+
+      if (kind == scalar_load || kind == vector_load
+	  || kind == unaligned_load || kind == vector_gather_load)
+	data->nloads += orig_count;
+
+      /* If we have strided or elementwise loads into a vector, it's
+	 possible to be bounded by latency and execution resources for
+	 many scalar loads.  Try to account for this by scaling the
+	 construction cost by the number of elements involved, when
+	 handling each matching statement we record the possible extra
+	 penalized cost into target cost, in the end of costing for
+	 the whole loop, we do the actual penalization once some load
+	 density heuristics are satisfied.  */
+      if (kind == vec_construct && stmt_info
+	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
+	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
+	{
+	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+	  unsigned int nunits = vect_nunits_for_cost (vectype);
+	  unsigned int extra_cost = nunits * stmt_cost;
+	  /* As function rs6000_builtin_vectorization_cost shows, we have
+	     priced much on V16QI/V8HI vector construction as their units,
+	     if we penalize them with nunits * stmt_cost, it can result in
+	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
+	     is 20 and nunits is 16, the extra cost is 320 which looks
+	     much exaggerated.  So let's use one maximum bound for the
+	     extra penalized cost for vector construction here.  */
+	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
+	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
+	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+	  data->extra_ctor_cost += extra_cost;
+	}
+    }
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
 static unsigned
@@ -5377,6 +5483,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
 	 arbitrary and could potentially be improved with analysis.  */
+      unsigned int orig_count = count;
       if (where == vect_body && stmt_info
 	  && stmt_in_inner_loop_p (vinfo, stmt_info))
 	{
@@ -5388,14 +5495,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       retval = (unsigned) (count * stmt_cost);
       cost_data->cost[where] += retval;
 
-      /* Check whether we're doing something other than just a copy loop.
-	 Not all such loops may be profitably vectorized; see
-	 rs6000_finish_cost.  */
-      if ((kind == vec_to_scalar || kind == vec_perm
-	   || kind == vec_promote_demote || kind == vec_construct
-	   || kind == scalar_to_vec)
-	  || (where == vect_body && kind == vector_stmt))
-	cost_data->vect_nonmem = true;
+      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
+					  stmt_cost, orig_count);
     }
 
   return retval;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3] rs6000: Add load density heuristic
  2021-07-28  5:22   ` [PATCH v3] " Kewen.Lin
@ 2021-09-03 15:57     ` Bill Schmidt
  2021-09-08  6:57       ` [PATCH v4] " Kewen.Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Bill Schmidt @ 2021-09-03 15:57 UTC (permalink / raw)
  To: Kewen.Lin, GCC Patches; +Cc: Segher Boessenkool, David Edelsohn, will schmidt

Hi Kewen,

Sorry that we lost track of this patch!  The heuristic approach looks 
good.  It is limited in scope and won't kick in often, and the case 
you're trying to account for is important.

At the time you submitted this, I think reliable P10 testing wasn't 
possible.  Now that it is, could you please do a quick sniff test to 
make sure there aren't any adjustments that need to be made for P10?  I 
doubt it, but worth checking.

Otherwise I have one comment below...

On 7/28/21 12:22 AM, Kewen.Lin wrote:
> Hi,
>
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>
> This v3 addressed William's review comments in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>
> It's mainly to deal with the bwaves_r degradation due to vector
> construction fed by strided loads.
>
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.
>
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.
>
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.
>
> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations (with v2, as v3 is NFC against v2):
>    * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>    * {-O3, -Ofast} {,-funroll-loops}
>
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.
>
> Is it ok for trunk?
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.

>---
> gcc/config/rs6000/rs6000.c | 119 ++++++++++++++++++++++++++++++++++---
> 1 file changed, 110 insertions(+), 9 deletions(-)
>
>diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>index 279f00cc648..4e9087683dc 100644
>--- a/gcc/config/rs6000/rs6000.c
>+++ b/gcc/config/rs6000/rs6000.c
>@@ -5254,6 +5254,12 @@ typedef struct _rs6000_cost_data
> {
>   struct loop *loop_info;
>   unsigned cost[3];
>+  /* Total number of vectorized stmts (loop only).  */
>+  unsigned nstmts;
>+  /* Total number of loads (loop only).  */
>+  unsigned nloads;
>+  /* Possible extra penalized cost on vector construction (loop only).  */
>+  unsigned extra_ctor_cost;
>   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>      instruction is needed by the vectorization.  */
>   bool vect_nonmem;
>@@ -5315,9 +5321,45 @@ rs6000_density_test (rs6000_cost_data *data)
>       if (dump_enabled_p ())
> 	dump_printf_loc (MSG_NOTE, vect_location,
> 			 "density %d%%, cost %d exceeds threshold, penalizing "
>-			 "loop body cost by %d%%", density_pct,
>+			 "loop body cost by %d%%\n", density_pct,
> 			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>     }
>+
>+  /* Check if we need to penalize the body cost for latency and
>+     execution resources bound from strided or elementwise loads
>+     into a vector.  */
>+  if (data->extra_ctor_cost > 0)
>+    {
>+      /* Threshold for load stmts percentage in all vectorized stmts.  */
>+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>+      /* Threshold for total number of load stmts.  */
>+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>+
>+      gcc_assert (data->nloads <= data->nstmts);
>+      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>+
>+      /* It's likely to be bounded by latency and execution resources
>+	 from many scalar loads which are strided or elementwise loads
>+	 into a vector if both conditions below are found:
>+	   1. there are many loads, it's easy to result in a long wait
>+	      for load units;
>+	   2. load has a big proportion of all vectorized statements,
>+	      it's not easy to schedule other statements to spread among
>+	      the loads.
>+	 One typical case is the innermost loop of the hotspot of SPEC2017
>+	 503.bwaves_r without loop interchange.  */
>+      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>+	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>+	{
>+	  data->cost[vect_body] += data->extra_ctor_cost;
>+	  if (dump_enabled_p ())
>+	    dump_printf_loc (MSG_NOTE, vect_location,
>+			     "Found %u loads and load pct. %u%% exceed "
>+			     "the threshold, penalizing loop body "
>+			     "cost by extra cost %u for ctor.\n",
>+			     data->nloads, load_pct, data->extra_ctor_cost);
>+	}
>+    }
> }
> 
> /* Implement targetm.vectorize.init_cost.  */
>@@ -5331,6 +5373,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>   data->cost[vect_body]     = 0;
>   data->cost[vect_epilogue] = 0;
>   data->vect_nonmem = false;
>+  data->nstmts = 0;
>+  data->nloads = 0;
>+  data->extra_ctor_cost = 0;
>   data->costing_for_scalar = costing_for_scalar;
>   return data;
> }
>@@ -5358,6 +5403,67 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>   return 0;
> }
> 
>+/* Helper function for add_stmt_cost.  Check each statement cost
>+   entry, gather information and update the target_cost fields
>+   accordingly.  */
>+static void
>+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>+				    enum vect_cost_for_stmt kind,
>+				    struct _stmt_vec_info *stmt_info,
>+				    enum vect_cost_model_location where,
>+				    int stmt_cost, unsigned int orig_count)
>+{
>+
>+  /* Check whether we're doing something other than just a copy loop.
>+     Not all such loops may be profitably vectorized; see
>+     rs6000_finish_cost.  */
>+  if ((kind == vec_to_scalar || kind == vec_perm
>+       || kind == vec_promote_demote || kind == vec_construct
>+       || kind == scalar_to_vec)
>+      || (where == vect_body && kind == vector_stmt))
>+    data->vect_nonmem = true;
>+
>+  /* Gather some information when we are costing the vectorized instruction
>+     for the statements located in a loop body.  */
>+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>+    {
>+      data->nstmts += orig_count;
>+
>+      if (kind == scalar_load || kind == vector_load
>+	  || kind == unaligned_load || kind == vector_gather_load)
>+	data->nloads += orig_count;
>+
>+      /* If we have strided or elementwise loads into a vector, it's
>+	 possible to be bounded by latency and execution resources for
>+	 many scalar loads.  Try to account for this by scaling the
>+	 construction cost by the number of elements involved, when
>+	 handling each matching statement we record the possible extra
>+	 penalized cost into target cost, in the end of costing for
>+	 the whole loop, we do the actual penalization once some load
>+	 density heuristics are satisfied.  */

The above comment is quite hard to read.  Can you please break up the last
sentence into at least two sentences?

Otherwise this looks good to me, and I recommend maintainers approve with
that clarified.

Thanks,
Bill

>+      if (kind == vec_construct && stmt_info
>+	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>+	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>+	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
>+	{
>+	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>+	  unsigned int nunits = vect_nunits_for_cost (vectype);
>+	  unsigned int extra_cost = nunits * stmt_cost;
>+	  /* As function rs6000_builtin_vectorization_cost shows, we have
>+	     priced much on V16QI/V8HI vector construction as their units,
>+	     if we penalize them with nunits * stmt_cost, it can result in
>+	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>+	     is 20 and nunits is 16, the extra cost is 320 which looks
>+	     much exaggerated.  So let's use one maximum bound for the
>+	     extra penalized cost for vector construction here.  */
>+	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>+	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>+	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>+	  data->extra_ctor_cost += extra_cost;
>+	}
>+    }
>+}
>+
> /* Implement targetm.vectorize.add_stmt_cost.  */
> 
> static unsigned
>@@ -5377,6 +5483,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>       /* Statements in an inner loop relative to the loop being
> 	 vectorized are weighted more heavily.  The value here is
> 	 arbitrary and could potentially be improved with analysis.  */
>+      unsigned int orig_count = count;
>       if (where == vect_body && stmt_info
> 	  && stmt_in_inner_loop_p (vinfo, stmt_info))
> 	{
>@@ -5388,14 +5495,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>       retval = (unsigned) (count * stmt_cost);
>       cost_data->cost[where] += retval;
> 
>-      /* Check whether we're doing something other than just a copy loop.
>-	 Not all such loops may be profitably vectorized; see
>-	 rs6000_finish_cost.  */
>-      if ((kind == vec_to_scalar || kind == vec_perm
>-	   || kind == vec_promote_demote || kind == vec_construct
>-	   || kind == scalar_to_vec)
>-	  || (where == vect_body && kind == vector_stmt))
>-	cost_data->vect_nonmem = true;
>+      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>+					  stmt_cost, orig_count);
>     }
> 
>   return retval;
>-- 
>2.17.1
>
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] rs6000: Add load density heuristic
  2021-07-28  2:59     ` Kewen.Lin
@ 2021-09-06 23:43       ` Segher Boessenkool
  2021-09-08  7:01         ` Kewen.Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Segher Boessenkool @ 2021-09-06 23:43 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: will schmidt, Bill Schmidt, David Edelsohn, GCC Patches

Hi!

On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote:
> >> +/* As a visitor function for each statement cost entry handled in
> >> +   function add_stmt_cost, gather some information and update its
> >> +   relevant fields in target cost accordingly.  */
> > 
> > I got lost trying to parse that..  (could be just me :-) 
> > 
> > Possibly instead something like
> > /* Helper function for add_stmt_cost ; gather information and update
> > the target_cost fields accordingly.  */
> 
> OK, will update.  I was thinking for each entry handled in function
> add_stmt_cost, this helper acts like a visitor, trying to visit each
> entry and take some actions if some conditions are satisifed.

It (thankfully!) has nothing to do with the "visitor pattern", so some
other name might be better :-)

> > Maybe clearer to read if you rearrange slightly and flatten it ?  I
> > defer to others on that..
> > 
> > 	if ((kind == vec_to_scalar
> > 	     || kind == vec_perm
> > 	     || kind == vec_promote_demote
> > 	     || kind == vec_construct
> > 	     || kind == scalar_to_vec)
> > 	    || (kind == vector_stmt && where == vect_body)
> 
> This hunk is factored out from function rs6000_add_stmt_cost, maybe I
> can keep the original formatting?  The formatting tool isn't so smart,
> and sometimes rearrange things to become unexpected (although it meets
> the basic rule, not so elegant), sigh.

It has too many parens, making grouping where there is none, that is the
core issue.

	if (kind == vec_to_scalar
	    || kind == vec_perm
	    || kind == vec_promote_demote
	    || kind == vec_construct
	    || kind == scalar_to_vec
	    || (kind == vector_stmt && where == vect_body))


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4] rs6000: Add load density heuristic
  2021-09-03 15:57     ` Bill Schmidt
@ 2021-09-08  6:57       ` Kewen.Lin
  2021-09-08  8:28         ` Kewen.Lin
  2021-09-09 16:11         ` Segher Boessenkool
  0 siblings, 2 replies; 19+ messages in thread
From: Kewen.Lin @ 2021-09-08  6:57 UTC (permalink / raw)
  To: wschmidt; +Cc: Segher Boessenkool, David Edelsohn, will schmidt, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 5742 bytes --]

Hi Bill,

Thanks for the review comments!

on 2021/9/3 下午11:57, Bill Schmidt wrote:
> Hi Kewen,
> 
> Sorry that we lost track of this patch!  The heuristic approach looks good.  It is limited in scope and won't kick in often, and the case you're trying to account for is important.
> 
> At the time you submitted this, I think reliable P10 testing wasn't possible.  Now that it is, could you please do a quick sniff test to make sure there aren't any adjustments that need to be made for P10?  I doubt it, but worth checking.
> 

Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 with Ofast unroll, this patch is neutral,
one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r degradation existed or not and if it can fixed by
this patch.  The result shows the degradation did exist and got fixed by this patch, besides got extra 3.93% speedup
against O2 and another bmk 554.roms_r got 3.24% speed up.

In short, the Power10 evaluation result shows this patch is positive.

> Otherwise I have one comment below...
> 
> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>> Hi,
>>
>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>
>> This v3 addressed William's review comments in
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>
>> It's mainly to deal with the bwaves_r degradation due to vector
>> construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
>>
>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations (with v2, as v3 is NFC against v2):
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
>>

...

>>+  /* Gather some information when we are costing the vectorized instruction
>>+     for the statements located in a loop body.  */
>>+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>+    {
>>+      data->nstmts += orig_count;
>>+
>>+      if (kind == scalar_load || kind == vector_load
>>+	  || kind == unaligned_load || kind == vector_gather_load)
>>+	data->nloads += orig_count;
>>+
>>+      /* If we have strided or elementwise loads into a vector, it's
>>+	 possible to be bounded by latency and execution resources for
>>+	 many scalar loads.  Try to account for this by scaling the
>>+	 construction cost by the number of elements involved, when
>>+	 handling each matching statement we record the possible extra
>>+	 penalized cost into target cost, in the end of costing for
>>+	 the whole loop, we do the actual penalization once some load
>>+	 density heuristics are satisfied.  */
> 
> The above comment is quite hard to read.  Can you please break up the last
> sentence into at least two sentences?
> 

How about the below:

+      /* If we have strided or elementwise loads into a vector, it's
+        possible to be bounded by latency and execution resources for
+        many scalar loads.  Try to account for this by scaling the
+        construction cost by the number of elements involved.  For
+        each matching statement, we record the possible extra
+        penalized cost into the relevant field in target cost.  When
+        we want to finalize the whole loop costing, we will check if
+        those related load density heuristics are satisfied, and add
+        this accumulated penalized cost if yes.  */


> Otherwise this looks good to me, and I recommend maintainers approve with
> that clarified.
> 

Thanks again!

The whole updated patch is attached, it also addressed Segher's comments on formatting.

Is it ok for trunk?

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
	nstmts, nloads and extra_ctor_cost.
	(rs6000_density_test): Add load density related heuristics and the
	checks, do extra costing on vector construction statements if need.
	(rs6000_init_cost): Init new members.
	(rs6000_update_target_cost_per_stmt): New function.
	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
	rs6000_update_target_cost_per_stmt and call it.


[-- Attachment #2: density_v4.diff --]
[-- Type: text/plain, Size: 7095 bytes --]

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index e073b26b430..a8872cd6d8d 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
 {
   struct loop *loop_info;
   unsigned cost[3];
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned nstmts;
+  /* Total number of loads (loop only).  */
+  unsigned nloads;
+  /* Possible extra penalized cost on vector construction (loop only).  */
+  unsigned extra_ctor_cost;
   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
      instruction is needed by the vectorization.  */
   bool vect_nonmem;
@@ -5323,9 +5329,45 @@ rs6000_density_test (rs6000_cost_data *data)
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "density %d%%, cost %d exceeds threshold, penalizing "
-			 "loop body cost by %d%%", density_pct,
+			 "loop body cost by %d%%\n", density_pct,
 			 vec_cost + not_vec_cost, DENSITY_PENALTY);
     }
+
+  /* Check if we need to penalize the body cost for latency and
+     execution resources bound from strided or elementwise loads
+     into a vector.  */
+  if (data->extra_ctor_cost > 0)
+    {
+      /* Threshold for load stmts percentage in all vectorized stmts.  */
+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
+      /* Threshold for total number of load stmts.  */
+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
+
+      gcc_assert (data->nloads <= data->nstmts);
+      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
+
+      /* It's likely to be bounded by latency and execution resources
+	 from many scalar loads which are strided or elementwise loads
+	 into a vector if both conditions below are found:
+	   1. there are many loads, it's easy to result in a long wait
+	      for load units;
+	   2. load has a big proportion of all vectorized statements,
+	      it's not easy to schedule other statements to spread among
+	      the loads.
+	 One typical case is the innermost loop of the hotspot of SPEC2017
+	 503.bwaves_r without loop interchange.  */
+      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
+	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
+	{
+	  data->cost[vect_body] += data->extra_ctor_cost;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Found %u loads and load pct. %u%% exceed "
+			     "the threshold, penalizing loop body "
+			     "cost by extra cost %u for ctor.\n",
+			     data->nloads, load_pct, data->extra_ctor_cost);
+	}
+    }
 }
 
 /* Implement targetm.vectorize.init_cost.  */
@@ -5339,6 +5381,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
   data->cost[vect_body]     = 0;
   data->cost[vect_epilogue] = 0;
   data->vect_nonmem = false;
+  data->nstmts = 0;
+  data->nloads = 0;
+  data->extra_ctor_cost = 0;
   data->costing_for_scalar = costing_for_scalar;
   return data;
 }
@@ -5366,6 +5411,70 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
   return 0;
 }
 
+/* Helper function for add_stmt_cost.  Check each statement cost
+   entry, gather information and update the target_cost fields
+   accordingly.  */
+static void
+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
+				    enum vect_cost_for_stmt kind,
+				    struct _stmt_vec_info *stmt_info,
+				    enum vect_cost_model_location where,
+				    int stmt_cost, unsigned int orig_count)
+{
+
+  /* Check whether we're doing something other than just a copy loop.
+     Not all such loops may be profitably vectorized; see
+     rs6000_finish_cost.  */
+  if (kind == vec_to_scalar
+      || kind == vec_perm
+      || kind == vec_promote_demote
+      || kind == vec_construct
+      || kind == scalar_to_vec
+      || (where == vect_body && kind == vector_stmt))
+    data->vect_nonmem = true;
+
+  /* Gather some information when we are costing the vectorized instruction
+     for the statements located in a loop body.  */
+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
+    {
+      data->nstmts += orig_count;
+
+      if (kind == scalar_load || kind == vector_load
+	  || kind == unaligned_load || kind == vector_gather_load)
+	data->nloads += orig_count;
+
+      /* If we have strided or elementwise loads into a vector, it's
+	 possible to be bounded by latency and execution resources for
+	 many scalar loads.  Try to account for this by scaling the
+	 construction cost by the number of elements involved.  For
+	 each matching statement, we record the possible extra
+	 penalized cost into the relevant field in target cost.  When
+	 we want to finalize the whole loop costing, we will check if
+	 those related load density heuristics are satisfied, and add
+	 this accumulated penalized cost if yes.  */
+      if (kind == vec_construct && stmt_info
+	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
+	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
+	{
+	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+	  unsigned int nunits = vect_nunits_for_cost (vectype);
+	  unsigned int extra_cost = nunits * stmt_cost;
+	  /* As function rs6000_builtin_vectorization_cost shows, we have
+	     priced much on V16QI/V8HI vector construction as their units,
+	     if we penalize them with nunits * stmt_cost, it can result in
+	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
+	     is 20 and nunits is 16, the extra cost is 320 which looks
+	     much exaggerated.  So let's use one maximum bound for the
+	     extra penalized cost for vector construction here.  */
+	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
+	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
+	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+	  data->extra_ctor_cost += extra_cost;
+	}
+    }
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
 static unsigned
@@ -5385,6 +5494,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
 	 arbitrary and could potentially be improved with analysis.  */
+      unsigned int orig_count = count;
       if (where == vect_body && stmt_info
 	  && stmt_in_inner_loop_p (vinfo, stmt_info))
 	{
@@ -5396,14 +5506,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       retval = (unsigned) (count * stmt_cost);
       cost_data->cost[where] += retval;
 
-      /* Check whether we're doing something other than just a copy loop.
-	 Not all such loops may be profitably vectorized; see
-	 rs6000_finish_cost.  */
-      if ((kind == vec_to_scalar || kind == vec_perm
-	   || kind == vec_promote_demote || kind == vec_construct
-	   || kind == scalar_to_vec)
-	  || (where == vect_body && kind == vector_stmt))
-	cost_data->vect_nonmem = true;
+      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
+					  stmt_cost, orig_count);
     }
 
   return retval;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] rs6000: Add load density heuristic
  2021-09-06 23:43       ` Segher Boessenkool
@ 2021-09-08  7:01         ` Kewen.Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Kewen.Lin @ 2021-09-08  7:01 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: will schmidt, Bill Schmidt, David Edelsohn, GCC Patches

Hi Segher,

Thanks for the comments!

on 2021/9/7 上午7:43, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote:
>>>> +/* As a visitor function for each statement cost entry handled in
>>>> +   function add_stmt_cost, gather some information and update its
>>>> +   relevant fields in target cost accordingly.  */
>>>
>>> I got lost trying to parse that..  (could be just me :-) 
>>>
>>> Possibly instead something like
>>> /* Helper function for add_stmt_cost ; gather information and update
>>> the target_cost fields accordingly.  */
>>
>> OK, will update.  I was thinking for each entry handled in function
>> add_stmt_cost, this helper acts like a visitor, trying to visit each
>> entry and take some actions if some conditions are satisifed.
> 
> It (thankfully!) has nothing to do with the "visitor pattern", so some
> other name might be better :-)
> 
>>> Maybe clearer to read if you rearrange slightly and flatten it ?  I
>>> defer to others on that..
>>>
>>> 	if ((kind == vec_to_scalar
>>> 	     || kind == vec_perm
>>> 	     || kind == vec_promote_demote
>>> 	     || kind == vec_construct
>>> 	     || kind == scalar_to_vec)
>>> 	    || (kind == vector_stmt && where == vect_body)
>>
>> This hunk is factored out from function rs6000_add_stmt_cost, maybe I
>> can keep the original formatting?  The formatting tool isn't so smart,
>> and sometimes rearrange things to become unexpected (although it meets
>> the basic rule, not so elegant), sigh.
> 
> It has too many parens, making grouping where there is none, that is the
> core issue.
> 
> 	if (kind == vec_to_scalar
> 	    || kind == vec_perm
> 	    || kind == vec_promote_demote
> 	    || kind == vec_construct
> 	    || kind == scalar_to_vec
> 	    || (kind == vector_stmt && where == vect_body))
> 
> 

Good catch, I've updated it in V4.

BR,
Kewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] rs6000: Add load density heuristic
  2021-09-08  6:57       ` [PATCH v4] " Kewen.Lin
@ 2021-09-08  8:28         ` Kewen.Lin
  2021-09-09 16:11         ` Segher Boessenkool
  1 sibling, 0 replies; 19+ messages in thread
From: Kewen.Lin @ 2021-09-08  8:28 UTC (permalink / raw)
  To: wschmidt; +Cc: GCC Patches, David Edelsohn, Segher Boessenkool

on 2021/9/8 下午2:57, Kewen.Lin via Gcc-patches wrote:
> Hi Bill,
> 
> Thanks for the review comments!
> 
> on 2021/9/3 下午11:57, Bill Schmidt wrote:
>> Hi Kewen,
>>
>> Sorry that we lost track of this patch!  The heuristic approach looks good.  It is limited in scope and won't kick in often, and the case you're trying to account for is important.
>>
>> At the time you submitted this, I think reliable P10 testing wasn't possible.  Now that it is, could you please do a quick sniff test to make sure there aren't any adjustments that need to be made for P10?  I doubt it, but worth checking.
>>
> 
> Good point, thanks for the reminder!  I did one SPEC2017 full run on Power10 with Ofast unroll, this patch is neutral,
> one SPEC2017 run at O2 vectorization (cheap cost) to verify bwaves_r degradation existed or not and if it can fixed by
> this patch.  The result shows the degradation did exist and got fixed by this patch, besides got extra 3.93% speedup
> against O2 and another bmk 554.roms_r got 3.24% speed up.
> 

hmm, sorry that this improvement on 554.roms_r looks not reliable, I just ran it with 10 iterations for both w/ and w/o
the patch, both suffer the jitters and the best scores of them are close.  But note that bwaves_r scores are quite stable
so it's reliable.

BR,
Kewen

> In short, the Power10 evaluation result shows this patch is positive.
> 
>> Otherwise I have one comment below...
>>
>> On 7/28/21 12:22 AM, Kewen.Lin wrote:
>>> Hi,
>>>
>>> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>>
>>> This v3 addressed William's review comments in
>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>>>
>>> It's mainly to deal with the bwaves_r degradation due to vector
>>> construction fed by strided loads.
>>>
>>> As Richi's comments [1], this follows the similar idea to over
>>> price the vector construction fed by VMAT_ELEMENTWISE or
>>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>>> construction costing immediately, it firstly records how many
>>> loads and vectorized statements in the given loop, later in
>>> rs6000_density_test (called by finish_cost) it computes the
>>> load density ratio against all vectorized stmts, and check
>>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>>> if both thresholds are exceeded.
>>>
>>> Note that this new load density heuristic check is based on
>>> some fields in target cost which are updated as needed when
>>> scanning each add_stmt_cost entry, it's independent of the
>>> current function rs6000_density_test which requires to scan
>>> non_vect stmts.  Since it's checking the load stmts count
>>> vs. all vectorized stmts, it's kind of density, so I put
>>> it in function rs6000_density_test.  With the same reason to
>>> keep it independent, I didn't put it as an else arm of the
>>> current existing density threshold check hunk or before this
>>> hunk.
>>>
>>> In the investigation of -1.04% degradation from 526.blender_r
>>> on Power8, I noticed that the extra penalized cost 320 on one
>>> single vector construction with type V16QI is much exaggerated,
>>> which makes the final body cost unreliable, so this patch adds
>>> one maximum bound for the extra penalized cost for each vector
>>> construction statement.
>>>
>>> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>>>
>>> Full SPEC2017 performance evaluation on Power8/Power9 with
>>> option combinations (with v2, as v3 is NFC against v2):
>>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>>   * {-O3, -Ofast} {,-funroll-loops}
>>>
>>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>>> remarkable was observed.
>>>
> 
> ...
> 
>>> +  /* Gather some information when we are costing the vectorized instruction
>>> +     for the statements located in a loop body.  */
>>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>>> +    {
>>> +      data->nstmts += orig_count;
>>> +
>>> +      if (kind == scalar_load || kind == vector_load
>>> +	  || kind == unaligned_load || kind == vector_gather_load)
>>> +	data->nloads += orig_count;
>>> +
>>> +      /* If we have strided or elementwise loads into a vector, it's
>>> +	 possible to be bounded by latency and execution resources for
>>> +	 many scalar loads.  Try to account for this by scaling the
>>> +	 construction cost by the number of elements involved, when
>>> +	 handling each matching statement we record the possible extra
>>> +	 penalized cost into target cost, in the end of costing for
>>> +	 the whole loop, we do the actual penalization once some load
>>> +	 density heuristics are satisfied.  */
>>
>> The above comment is quite hard to read.  Can you please break up the last
>> sentence into at least two sentences?
>>
> 
> How about the below:
> 
> +      /* If we have strided or elementwise loads into a vector, it's
> +        possible to be bounded by latency and execution resources for
> +        many scalar loads.  Try to account for this by scaling the
> +        construction cost by the number of elements involved.  For
> +        each matching statement, we record the possible extra
> +        penalized cost into the relevant field in target cost.  When
> +        we want to finalize the whole loop costing, we will check if
> +        those related load density heuristics are satisfied, and add
> +        this accumulated penalized cost if yes.  */
> 
> 
>> Otherwise this looks good to me, and I recommend maintainers approve with
>> that clarified.
>>
> 
> Thanks again!
> 
> The whole updated patch is attached, it also addressed Segher's comments on formatting.
> 
> Is it ok for trunk?
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] rs6000: Add load density heuristic
  2021-09-08  6:57       ` [PATCH v4] " Kewen.Lin
  2021-09-08  8:28         ` Kewen.Lin
@ 2021-09-09 16:11         ` Segher Boessenkool
  2021-09-09 17:19           ` Bill Schmidt
  1 sibling, 1 reply; 19+ messages in thread
From: Segher Boessenkool @ 2021-09-09 16:11 UTC (permalink / raw)
  To: Kewen.Lin; +Cc: wschmidt, David Edelsohn, will schmidt, GCC Patches

Hi!

On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
> >>+      /* If we have strided or elementwise loads into a vector, it's
> >>+	 possible to be bounded by latency and execution resources for
> >>+	 many scalar loads.  Try to account for this by scaling the
> >>+	 construction cost by the number of elements involved, when
> >>+	 handling each matching statement we record the possible extra
> >>+	 penalized cost into target cost, in the end of costing for
> >>+	 the whole loop, we do the actual penalization once some load
> >>+	 density heuristics are satisfied.  */
> > 
> > The above comment is quite hard to read.  Can you please break up the last
> > sentence into at least two sentences?
> 
> How about the below:
> 
> +      /* If we have strided or elementwise loads into a vector, it's

"strided" is not a word: it properly is "stridden", which does not read
very well either.  "Have loads by stride, or by element, ..."?  Is that
good English, and easier to understand?

> +        possible to be bounded by latency and execution resources for
> +        many scalar loads.  Try to account for this by scaling the
> +        construction cost by the number of elements involved.  For
> +        each matching statement, we record the possible extra
> +        penalized cost into the relevant field in target cost.  When
> +        we want to finalize the whole loop costing, we will check if
> +        those related load density heuristics are satisfied, and add
> +        this accumulated penalized cost if yes.  */
> 
> > Otherwise this looks good to me, and I recommend maintainers approve with
> > that clarified.

Does that text look good to you now Bill?  It is still kinda complex,
maybe you see a way to make it simpler.

> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.

"and the checks"?  Oh, "and checks"?  It is probably fine to just leave
out this whole phrase part :-)

Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
bit boring text that way, but that is the purpose: it makes it simpler
to read (and read quickly, even merely scan).

> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data

[ Btw, you can get rid of the typedef now, just have a struct with the
non-underscore name, we have C++ now.  Such a mechanical change (as
separate patch!) is pre-approved. ]

> +  /* Check if we need to penalize the body cost for latency and
> +     execution resources bound from strided or elementwise loads
> +     into a vector.  */

Bill, is that clear enough?  I'm sure something nicer would help here,
but it's hard for me to write anything :-)

> +  if (data->extra_ctor_cost > 0)
> +    {
> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;

Threshold for what?

45% is awfully exact.  Can you make this a param?

> +      /* Threshold for total number of load stmts.  */
> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;

Same.

> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);

No parens around the last thing please.  The other pair of parens is
unneeded as well, but perhaps it is easier to read like that.

> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "Found %u loads and load pct. %u%% exceed "
> +			     "the threshold, penalizing loop body "
> +			     "cost by extra cost %u for ctor.\n",
> +			     data->nloads, load_pct, data->extra_ctor_cost);

That line does not fit.  Make it more lines?

It is a pity that using these interfaces at all takes up 45 chars
of noise already.

> +/* Helper function for add_stmt_cost.  Check each statement cost
> +   entry, gather information and update the target_cost fields
> +   accordingly.  */
> +static void
> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
> +				    enum vect_cost_for_stmt kind,
> +				    struct _stmt_vec_info *stmt_info,
> +				    enum vect_cost_model_location where,
> +				    int stmt_cost, unsigned int orig_count)

Please put those last two on separate lines as well?

> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
> +	     priced much on V16QI/V8HI vector construction as their units,
> +	     if we penalize them with nunits * stmt_cost, it can result in
> +	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
> +	     is 20 and nunits is 16, the extra cost is 320 which looks
> +	     much exaggerated.  So let's use one maximum bound for the
> +	     extra penalized cost for vector construction here.  */
> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;

That is a pretty gross hack.  Can you think of any saner way to not have
those out of scale costs in the first place?

Okay for trunk with such tweaks.  Thanks!  (And please consult with Bill
for the wordsmithing :-) )


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] rs6000: Add load density heuristic
  2021-09-09 16:11         ` Segher Boessenkool
@ 2021-09-09 17:19           ` Bill Schmidt
  2021-09-09 17:39             ` Bill Schmidt
                               ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Bill Schmidt @ 2021-09-09 17:19 UTC (permalink / raw)
  To: Segher Boessenkool, Kewen.Lin; +Cc: David Edelsohn, will schmidt, GCC Patches

On 9/9/21 11:11 AM, Segher Boessenkool wrote:
> Hi!
>
> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>> +      /* If we have strided or elementwise loads into a vector, it's
>>>> +	 possible to be bounded by latency and execution resources for
>>>> +	 many scalar loads.  Try to account for this by scaling the
>>>> +	 construction cost by the number of elements involved, when
>>>> +	 handling each matching statement we record the possible extra
>>>> +	 penalized cost into target cost, in the end of costing for
>>>> +	 the whole loop, we do the actual penalization once some load
>>>> +	 density heuristics are satisfied.  */
>>> The above comment is quite hard to read.  Can you please break up the last
>>> sentence into at least two sentences?
>> How about the below:
>>
>> +      /* If we have strided or elementwise loads into a vector, it's
> "strided" is not a word: it properly is "stridden", which does not read
> very well either.  "Have loads by stride, or by element, ..."?  Is that
> good English, and easier to understand?

No, this is OK.  "Strided loads" is a term of art used by the 
vectorizer; whether or not it was the Queen's English, it's what we 
have...  (And I think you might only find "bestridden" in some 18th or 
19th century English poetry... :-)
>
>> +        possible to be bounded by latency and execution resources for
>> +        many scalar loads.  Try to account for this by scaling the
>> +        construction cost by the number of elements involved.  For
>> +        each matching statement, we record the possible extra
>> +        penalized cost into the relevant field in target cost.  When
>> +        we want to finalize the whole loop costing, we will check if
>> +        those related load density heuristics are satisfied, and add
>> +        this accumulated penalized cost if yes.  */
>>
>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>> that clarified.
> Does that text look good to you now Bill?  It is still kinda complex,
> maybe you see a way to make it simpler.

I think it's OK now.  The complexity at least matches the code now 
instead of exceeding it. :-P  j/k...

>
>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> 	nstmts, nloads and extra_ctor_cost.
>> 	(rs6000_density_test): Add load density related heuristics and the
>> 	checks, do extra costing on vector construction statements if need.
> "and the checks"?  Oh, "and checks"?  It is probably fine to just leave
> out this whole phrase part :-)
>
> Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
> bit boring text that way, but that is the purpose: it makes it simpler
> to read (and read quickly, even merely scan).
>
>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
> [ Btw, you can get rid of the typedef now, just have a struct with the
> non-underscore name, we have C++ now.  Such a mechanical change (as
> separate patch!) is pre-approved. ]
>
>> +  /* Check if we need to penalize the body cost for latency and
>> +     execution resources bound from strided or elementwise loads
>> +     into a vector.  */
> Bill, is that clear enough?  I'm sure something nicer would help here,
> but it's hard for me to write anything :-)

Perhaps:  "Check whether we need to penalize the body cost to account 
for excess strided or elementwise loads."
>
>> +  if (data->extra_ctor_cost > 0)
>> +    {
>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
> Threshold for what?
>
> 45% is awfully exact.  Can you make this a param?
>
>> +      /* Threshold for total number of load stmts.  */
>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
> Same.


We have similar magic constants in here already.  Parameterizing is 
possible, but I'm more interested in making sure the numbers are 
appropriate for each processor.  Given that Kewen reports they work well 
for both P9 and P10, I'm pretty happy with what we have here.  (Kewen, 
thanks for running the P10 experiments!)

Perhaps a follow-up patch to add params for the magic constants would be 
reasonable, but I'd personally consider it pretty low priority.

>
>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
> No parens around the last thing please.  The other pair of parens is
> unneeded as well, but perhaps it is easier to read like that.
>
>> +	  if (dump_enabled_p ())
>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>> +			     "Found %u loads and load pct. %u%% exceed "
>> +			     "the threshold, penalizing loop body "
>> +			     "cost by extra cost %u for ctor.\n",
>> +			     data->nloads, load_pct, data->extra_ctor_cost);
> That line does not fit.  Make it more lines?
>
> It is a pity that using these interfaces at all takes up 45 chars
> of noise already.
>
>> +/* Helper function for add_stmt_cost.  Check each statement cost
>> +   entry, gather information and update the target_cost fields
>> +   accordingly.  */
>> +static void
>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> +				    enum vect_cost_for_stmt kind,
>> +				    struct _stmt_vec_info *stmt_info,
>> +				    enum vect_cost_model_location where,
>> +				    int stmt_cost, unsigned int orig_count)
> Please put those last two on separate lines as well?
>
>> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
>> +	     priced much on V16QI/V8HI vector construction as their units,
>> +	     if we penalize them with nunits * stmt_cost, it can result in
>> +	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>> +	     is 20 and nunits is 16, the extra cost is 320 which looks
>> +	     much exaggerated.  So let's use one maximum bound for the
>> +	     extra penalized cost for vector construction here.  */
>> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> That is a pretty gross hack.  Can you think of any saner way to not have
> those out of scale costs in the first place?

In Kewen's defense, the whole business of "finish_cost" for these 
vectorized loops is to tweak things that don't work quite right with the 
hooks currently provided to the vectorizer to add costs on a per-stmt 
basis without looking at the overall set of statements.  It gives the 
back end a chance to massage things and exercise veto power over 
otherwise bad decisions.  By nature, that's going to be very much a 
heuristic exercise.  Personally I think the heuristics used here are 
pretty reasonable, and importantly they are designed to only be employed 
in pretty rare circumstances.  It doesn't look easy to me to avoid the 
need for a cap here without making the rest of the heuristics harder to 
understand.  But sure, he can try! :)

Kewen, thanks for the updates!

Bill

>
> Okay for trunk with such tweaks.  Thanks!  (And please consult with Bill
> for the wordsmithing :-) )
>
>
> Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] rs6000: Add load density heuristic
  2021-09-09 17:19           ` Bill Schmidt
@ 2021-09-09 17:39             ` Bill Schmidt
  2021-09-09 18:24             ` Segher Boessenkool
  2021-09-10  3:22             ` Kewen.Lin
  2 siblings, 0 replies; 19+ messages in thread
From: Bill Schmidt @ 2021-09-09 17:39 UTC (permalink / raw)
  To: Segher Boessenkool, Kewen.Lin; +Cc: David Edelsohn, will schmidt, GCC Patches

On 9/9/21 12:19 PM, Bill Schmidt wrote:
> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
>> Hi!
>>
>> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>>> +      /* If we have strided or elementwise loads into a vector, it's
>>>>> +	 possible to be bounded by latency and execution resources for
>>>>> +	 many scalar loads.  Try to account for this by scaling the
>>>>> +	 construction cost by the number of elements involved, when
>>>>> +	 handling each matching statement we record the possible extra
>>>>> +	 penalized cost into target cost, in the end of costing for
>>>>> +	 the whole loop, we do the actual penalization once some load
>>>>> +	 density heuristics are satisfied.  */
>>>> The above comment is quite hard to read.  Can you please break up the last
>>>> sentence into at least two sentences?
>>> How about the below:
>>>
>>> +      /* If we have strided or elementwise loads into a vector, it's
>> "strided" is not a word: it properly is "stridden", which does not read
>> very well either.  "Have loads by stride, or by element, ..."?  Is that
>> good English, and easier to understand?
> No, this is OK.  "Strided loads" is a term of art used by the
> vectorizer; whether or not it was the Queen's English, it's what we
> have...  (And I think you might only find "bestridden" in some 18th or
> 19th century English poetry... :-)
>>> +        possible to be bounded by latency and execution resources for
>>> +        many scalar loads.  Try to account for this by scaling the
>>> +        construction cost by the number of elements involved.  For
>>> +        each matching statement, we record the possible extra
>>> +        penalized cost into the relevant field in target cost.  When
>>> +        we want to finalize the whole loop costing, we will check if
>>> +        those related load density heuristics are satisfied, and add
>>> +        this accumulated penalized cost if yes.  */
>>>
>>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>>> that clarified.
>> Does that text look good to you now Bill?  It is still kinda complex,
>> maybe you see a way to make it simpler.
> I think it's OK now.  The complexity at least matches the code now
> instead of exceeding it. :-P  j/k...

Well, let me not be lazy, and see whether I can help:

"Power processors do not currently have instructions for strided and 
elementwise loads, and instead we must generate multiple scalar loads.  
This leads to undercounting of the cost.  We account for this by scaling 
the construction cost by the number of elements involved, and saving 
this as extra cost that we may or may not need to apply.  When 
finalizing the cost of the loop, the extra penalty is applied when the 
load density heuristics are satisfied."

Something like that?

Bill



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] rs6000: Add load density heuristic
  2021-09-09 17:19           ` Bill Schmidt
  2021-09-09 17:39             ` Bill Schmidt
@ 2021-09-09 18:24             ` Segher Boessenkool
  2021-09-10  3:22             ` Kewen.Lin
  2 siblings, 0 replies; 19+ messages in thread
From: Segher Boessenkool @ 2021-09-09 18:24 UTC (permalink / raw)
  To: Bill Schmidt; +Cc: Kewen.Lin, David Edelsohn, will schmidt, GCC Patches

Hi!

On Thu, Sep 09, 2021 at 12:19:28PM -0500, Bill Schmidt wrote:
> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
> >On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
> >>+      /* If we have strided or elementwise loads into a vector, it's
> >"strided" is not a word: it properly is "stridden", which does not read
> >very well either.  "Have loads by stride, or by element, ..."?  Is that
> >good English, and easier to understand?
> 
> No, this is OK.  "Strided loads" is a term of art used by the 
> vectorizer; whether or not it was the Queen's English, it's what we 
> have...

"Strided" is not in any dictionary I could find.  I have no idea what
the island queen has to do with anything :-)

> (And I think you might only find "bestridden" in some 18th or 
> 19th century English poetry... :-)

"Bestride" is not used in the US apparently, but it has nothing to do
with this anyway.

In any case, "strided" is used a lot in the vectoriser already (and
nowhere else).  So, fine with me of course.

> >Does that text look good to you now Bill?  It is still kinda complex,
> >maybe you see a way to make it simpler.
> 
> I think it's OK now.  The complexity at least matches the code now 
> instead of exceeding it. :-P  j/k...

Ha :-)

Ideally the comments will clarify complex code.  But not making it worse
is at least something, okay :-)

> >>+  if (data->extra_ctor_cost > 0)
> >>+    {
> >>+      /* Threshold for load stmts percentage in all vectorized stmts.  */
> >>+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
> >Threshold for what?
> >
> >45% is awfully exact.  Can you make this a param?
> >
> >>+      /* Threshold for total number of load stmts.  */
> >>+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
> >Same.
> 
> We have similar magic constants in here already.

And that is problematic.  That is the problem.

> Parameterizing is 
> possible, but I'm more interested in making sure the numbers are 
> appropriate for each processor.

Of course.  So a) there shouldn't be any magic constants, the constants
should be meaningful.  That way, tunings will need less maintenance for
new hardware versions, but way more importantly: if anything else in the
compiler changes.  And b), having params makes it easier to do such
tuning, or experiments for it.

It is very easy to add params btw (just in rs6000.opt).

> Perhaps a follow-up patch to add params for the magic constants would be 
> reasonable, but I'd personally consider it pretty low priority.

It also is super easy to do.  The only hard part is making a name for
it, and if you cannot think of a good name, that says that it is not a
good abstraction in the first place -- magic isn't good.

Please do it.

> >>+	  /* As function rs6000_builtin_vectorization_cost shows, we have
> >>+	     priced much on V16QI/V8HI vector construction as their units,
> >>+	     if we penalize them with nunits * stmt_cost, it can result in
> >>+	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
> >>+	     is 20 and nunits is 16, the extra cost is 320 which looks
> >>+	     much exaggerated.  So let's use one maximum bound for the
> >>+	     extra penalized cost for vector construction here.  */
> >>+	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> >>+	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> >>+	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> >That is a pretty gross hack.  Can you think of any saner way to not have
> >those out of scale costs in the first place?
> 
> In Kewen's defense, the whole business of "finish_cost" for these 
> vectorized loops is to tweak things that don't work quite right with the 
> hooks currently provided to the vectorizer to add costs on a per-stmt 
> basis without looking at the overall set of statements.  It gives the 
> back end a chance to massage things and exercise veto power over 
> otherwise bad decisions.  By nature, that's going to be very much a 
> heuristic exercise.  Personally I think the heuristics used here are 
> pretty reasonable, and importantly they are designed to only be employed 
> in pretty rare circumstances.  It doesn't look easy to me to avoid the 
> need for a cap here without making the rest of the heuristics harder to 
> understand.  But sure, he can try! :)

Yes.  But we need to avoid magic.  Magic does not scale; magic makes it
hard or impossible to do any future changes.

I did approve the patch already.  But weaknesses like this need
continuous attention, and that does not scale.


Segher

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] rs6000: Add load density heuristic
  2021-09-09 17:19           ` Bill Schmidt
  2021-09-09 17:39             ` Bill Schmidt
  2021-09-09 18:24             ` Segher Boessenkool
@ 2021-09-10  3:22             ` Kewen.Lin
  2021-09-10  3:46               ` Kewen.Lin
  2 siblings, 1 reply; 19+ messages in thread
From: Kewen.Lin @ 2021-09-10  3:22 UTC (permalink / raw)
  To: wschmidt, Segher Boessenkool; +Cc: David Edelsohn, will schmidt, GCC Patches

Hi Segher and Bill,

Thanks a lot for your reviews and helps!

on 2021/9/10 上午1:19, Bill Schmidt wrote:
> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
>> Hi!
>>
>> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>>> +      /* If we have strided or elementwise loads into a vector, it's
>>>>> +     possible to be bounded by latency and execution resources for
>>>>> +     many scalar loads.  Try to account for this by scaling the
>>>>> +     construction cost by the number of elements involved, when
>>>>> +     handling each matching statement we record the possible extra
>>>>> +     penalized cost into target cost, in the end of costing for
>>>>> +     the whole loop, we do the actual penalization once some load
>>>>> +     density heuristics are satisfied.  */
>>>> The above comment is quite hard to read.  Can you please break up the last
>>>> sentence into at least two sentences?
>>> How about the below:
>>>
>>> +      /* If we have strided or elementwise loads into a vector, it's
>> "strided" is not a word: it properly is "stridden", which does not read
>> very well either.  "Have loads by stride, or by element, ..."?  Is that
>> good English, and easier to understand?
> 
> No, this is OK.  "Strided loads" is a term of art used by the vectorizer; whether or not it was the Queen's English, it's what we have...  (And I think you might only find "bestridden" in some 18th or 19th century English poetry... :-)
>>
>>> +        possible to be bounded by latency and execution resources for
>>> +        many scalar loads.  Try to account for this by scaling the
>>> +        construction cost by the number of elements involved.  For
>>> +        each matching statement, we record the possible extra
>>> +        penalized cost into the relevant field in target cost.  When
>>> +        we want to finalize the whole loop costing, we will check if
>>> +        those related load density heuristics are satisfied, and add
>>> +        this accumulated penalized cost if yes.  */
>>>
>>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>>> that clarified.
>> Does that text look good to you now Bill?  It is still kinda complex,
>> maybe you see a way to make it simpler.
> 
> I think it's OK now.  The complexity at least matches the code now instead of exceeding it. :-P  j/k...
> 

Just noticed Bill helped to revise it, I will use that nice paragraph.
(Thanks, Bill!)

>>
>>>     * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>>     nstmts, nloads and extra_ctor_cost.
>>>     (rs6000_density_test): Add load density related heuristics and the
>>>     checks, do extra costing on vector construction statements if need.
>> "and the checks"?  Oh, "and checks"?  It is probably fine to just leave
>> out this whole phrase part :-)
>>
>> Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
>> bit boring text that way, but that is the purpose: it makes it simpler
>> to read (and read quickly, even merely scan).
>>

Thanks for the explanation, will fix.

>>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
>> [ Btw, you can get rid of the typedef now, just have a struct with the
>> non-underscore name, we have C++ now.  Such a mechanical change (as
>> separate patch!) is pre-approved. ]
>>
>>> +  /* Check if we need to penalize the body cost for latency and
>>> +     execution resources bound from strided or elementwise loads
>>> +     into a vector.  */
>> Bill, is that clear enough?  I'm sure something nicer would help here,
>> but it's hard for me to write anything :-)
> 
> Perhaps:  "Check whether we need to penalize the body cost to account for excess strided or elementwise loads."

Thanks, will update.

>>
>>> +  if (data->extra_ctor_cost > 0)
>>> +    {
>>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> Threshold for what?
>>
>> 45% is awfully exact.  Can you make this a param?
>>
>>> +      /* Threshold for total number of load stmts.  */
>>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> Same.
> 
> 
> We have similar magic constants in here already.  Parameterizing is possible, but I'm more interested in making sure the numbers are appropriate for each processor.  Given that Kewen reports they work well for both P9 and P10, I'm pretty happy with what we have here.  (Kewen, thanks for running the P10 experiments!)
> 
> Perhaps a follow-up patch to add params for the magic constants would be reasonable, but I'd personally consider it pretty low priority.
> 

As Bill mentioned, there are some other thresholds which can be parameterized together.
Segher, if you don't mind, I will parameterize all of them in a follow up patch.  :)

>>
>>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> No parens around the last thing please.  The other pair of parens is
>> unneeded as well, but perhaps it is easier to read like that.
>>

Will fix.

>>> +      if (dump_enabled_p ())
>>> +        dump_printf_loc (MSG_NOTE, vect_location,
>>> +                 "Found %u loads and load pct. %u%% exceed "
>>> +                 "the threshold, penalizing loop body "
>>> +                 "cost by extra cost %u for ctor.\n",
>>> +                 data->nloads, load_pct, data->extra_ctor_cost);
>> That line does not fit.  Make it more lines?
>>

Will fix.

>> It is a pity that using these interfaces at all takes up 45 chars
>> of noise already.
>>
>>> +/* Helper function for add_stmt_cost.  Check each statement cost
>>> +   entry, gather information and update the target_cost fields
>>> +   accordingly.  */
>>> +static void
>>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>> +                    enum vect_cost_for_stmt kind,
>>> +                    struct _stmt_vec_info *stmt_info,
>>> +                    enum vect_cost_model_location where,
>>> +                    int stmt_cost, unsigned int orig_count)
>> Please put those last two on separate lines as well?
>>

Will fix.

>>> +      /* As function rs6000_builtin_vectorization_cost shows, we have
>>> +         priced much on V16QI/V8HI vector construction as their units,
>>> +         if we penalize them with nunits * stmt_cost, it can result in
>>> +         an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>> +         is 20 and nunits is 16, the extra cost is 320 which looks
>>> +         much exaggerated.  So let's use one maximum bound for the
>>> +         extra penalized cost for vector construction here.  */
>>> +      const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>> +      if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>> +        extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> That is a pretty gross hack.  Can you think of any saner way to not have
>> those out of scale costs in the first place?
> 
> In Kewen's defense, the whole business of "finish_cost" for these vectorized loops is to tweak things that don't work quite right with the hooks currently provided to the vectorizer to add costs on a per-stmt basis without looking at the overall set of statements.  It gives the back end a chance to massage things and exercise veto power over otherwise bad decisions.  By nature, that's going to be very much a heuristic exercise.  Personally I think the heuristics used here are pretty reasonable, and importantly they are designed to only be employed in pretty rare circumstances.  It doesn't look easy to me to avoid the need for a cap here without making the rest of the heuristics harder to understand.  But sure, he can try! :)
> 

Thanks for the explanation, Bill!  I agree the current hacking isn't good, one alternative to avoid this bound check
seems not to use nunits * stmt_cost, which was referred from the existing practice for the similar case.  Instead,
we can use log2(nunits) and then V16QI => 4, V8HI => 3 ..., the cost will not be exaggerated much as before.
Not sure if it can work expectedly, I'll re-evaluate this idea on Power8/9/10 at Ofast unroll and O2 vect (at least)
and see if it works well.
>>
>> Okay for trunk with such tweaks.  Thanks!  (And please consult with Bill
>> for the wordsmithing :-) )
>>

Thanks again.

BR,
Kewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] rs6000: Add load density heuristic
  2021-09-10  3:22             ` Kewen.Lin
@ 2021-09-10  3:46               ` Kewen.Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Kewen.Lin @ 2021-09-10  3:46 UTC (permalink / raw)
  To: wschmidt, Segher Boessenkool; +Cc: GCC Patches, David Edelsohn

on 2021/9/10 上午11:22, Kewen.Lin via Gcc-patches wrote:
> Hi Segher and Bill,
> 
> Thanks a lot for your reviews and helps!
> 
> on 2021/9/10 上午1:19, Bill Schmidt wrote:
>> On 9/9/21 11:11 AM, Segher Boessenkool wrote:
>>> Hi!
>>>
>>> On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
>>>>>> +      /* If we have strided or elementwise loads into a vector, it's
>>>>>> +     possible to be bounded by latency and execution resources for
>>>>>> +     many scalar loads.  Try to account for this by scaling the
>>>>>> +     construction cost by the number of elements involved, when
>>>>>> +     handling each matching statement we record the possible extra
>>>>>> +     penalized cost into target cost, in the end of costing for
>>>>>> +     the whole loop, we do the actual penalization once some load
>>>>>> +     density heuristics are satisfied.  */
>>>>> The above comment is quite hard to read.  Can you please break up the last
>>>>> sentence into at least two sentences?
>>>> How about the below:
>>>>
>>>> +      /* If we have strided or elementwise loads into a vector, it's
>>> "strided" is not a word: it properly is "stridden", which does not read
>>> very well either.  "Have loads by stride, or by element, ..."?  Is that
>>> good English, and easier to understand?
>>
>> No, this is OK.  "Strided loads" is a term of art used by the vectorizer; whether or not it was the Queen's English, it's what we have...  (And I think you might only find "bestridden" in some 18th or 19th century English poetry... :-)
>>>
>>>> +        possible to be bounded by latency and execution resources for
>>>> +        many scalar loads.  Try to account for this by scaling the
>>>> +        construction cost by the number of elements involved.  For
>>>> +        each matching statement, we record the possible extra
>>>> +        penalized cost into the relevant field in target cost.  When
>>>> +        we want to finalize the whole loop costing, we will check if
>>>> +        those related load density heuristics are satisfied, and add
>>>> +        this accumulated penalized cost if yes.  */
>>>>
>>>>> Otherwise this looks good to me, and I recommend maintainers approve with
>>>>> that clarified.
>>> Does that text look good to you now Bill?  It is still kinda complex,
>>> maybe you see a way to make it simpler.
>>
>> I think it's OK now.  The complexity at least matches the code now instead of exceeding it. :-P  j/k...
>>
> 
> Just noticed Bill helped to revise it, I will use that nice paragraph.
> (Thanks, Bill!)
> 
>>>
>>>>     * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>>>     nstmts, nloads and extra_ctor_cost.
>>>>     (rs6000_density_test): Add load density related heuristics and the
>>>>     checks, do extra costing on vector construction statements if need.
>>> "and the checks"?  Oh, "and checks"?  It is probably fine to just leave
>>> out this whole phrase part :-)
>>>
>>> Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
>>> bit boring text that way, but that is the purpose: it makes it simpler
>>> to read (and read quickly, even merely scan).
>>>
> 
> Thanks for the explanation, will fix.
> 
>>>> @@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
>>> [ Btw, you can get rid of the typedef now, just have a struct with the
>>> non-underscore name, we have C++ now.  Such a mechanical change (as
>>> separate patch!) is pre-approved. ]
>>>
>>>> +  /* Check if we need to penalize the body cost for latency and
>>>> +     execution resources bound from strided or elementwise loads
>>>> +     into a vector.  */
>>> Bill, is that clear enough?  I'm sure something nicer would help here,
>>> but it's hard for me to write anything :-)
>>
>> Perhaps:  "Check whether we need to penalize the body cost to account for excess strided or elementwise loads."
> 
> Thanks, will update.
> 
>>>
>>>> +  if (data->extra_ctor_cost > 0)
>>>> +    {
>>>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>>>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>>> Threshold for what?
>>>
>>> 45% is awfully exact.  Can you make this a param?
>>>
>>>> +      /* Threshold for total number of load stmts.  */
>>>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>>> Same.
>>
>>
>> We have similar magic constants in here already.  Parameterizing is possible, but I'm more interested in making sure the numbers are appropriate for each processor.  Given that Kewen reports they work well for both P9 and P10, I'm pretty happy with what we have here.  (Kewen, thanks for running the P10 experiments!)
>>
>> Perhaps a follow-up patch to add params for the magic constants would be reasonable, but I'd personally consider it pretty low priority.
>>
> 
> As Bill mentioned, there are some other thresholds which can be parameterized together.
> Segher, if you don't mind, I will parameterize all of them in a follow up patch.  :)
> 
>>>
>>>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>>> No parens around the last thing please.  The other pair of parens is
>>> unneeded as well, but perhaps it is easier to read like that.
>>>
> 
> Will fix.
> 
>>>> +      if (dump_enabled_p ())
>>>> +        dump_printf_loc (MSG_NOTE, vect_location,
>>>> +                 "Found %u loads and load pct. %u%% exceed "
>>>> +                 "the threshold, penalizing loop body "
>>>> +                 "cost by extra cost %u for ctor.\n",
>>>> +                 data->nloads, load_pct, data->extra_ctor_cost);
>>> That line does not fit.  Make it more lines?
>>>
> 
> Will fix.
> 
>>> It is a pity that using these interfaces at all takes up 45 chars
>>> of noise already.
>>>
>>>> +/* Helper function for add_stmt_cost.  Check each statement cost
>>>> +   entry, gather information and update the target_cost fields
>>>> +   accordingly.  */
>>>> +static void
>>>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>>>> +                    enum vect_cost_for_stmt kind,
>>>> +                    struct _stmt_vec_info *stmt_info,
>>>> +                    enum vect_cost_model_location where,
>>>> +                    int stmt_cost, unsigned int orig_count)
>>> Please put those last two on separate lines as well?
>>>
> 
> Will fix.
> 
>>>> +      /* As function rs6000_builtin_vectorization_cost shows, we have
>>>> +         priced much on V16QI/V8HI vector construction as their units,
>>>> +         if we penalize them with nunits * stmt_cost, it can result in
>>>> +         an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>>>> +         is 20 and nunits is 16, the extra cost is 320 which looks
>>>> +         much exaggerated.  So let's use one maximum bound for the
>>>> +         extra penalized cost for vector construction here.  */
>>>> +      const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>>>> +      if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>>>> +        extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>>> That is a pretty gross hack.  Can you think of any saner way to not have
>>> those out of scale costs in the first place?
>>
>> In Kewen's defense, the whole business of "finish_cost" for these vectorized loops is to tweak things that don't work quite right with the hooks currently provided to the vectorizer to add costs on a per-stmt basis without looking at the overall set of statements.  It gives the back end a chance to massage things and exercise veto power over otherwise bad decisions.  By nature, that's going to be very much a heuristic exercise.  Personally I think the heuristics used here are pretty reasonable, and importantly they are designed to only be employed in pretty rare circumstances.  It doesn't look easy to me to avoid the need for a cap here without making the rest of the heuristics harder to understand.  But sure, he can try! :)
>>
> 
> Thanks for the explanation, Bill!  I agree the current hacking isn't good, one alternative to avoid this bound check
> seems not to use nunits * stmt_cost, which was referred from the existing practice for the similar case.  Instead,
> we can use log2(nunits) and then V16QI => 4, V8HI => 3 ..., the cost will not be exaggerated much as before.
> Not sure if it can work expectedly, I'll re-evaluate this idea on Power8/9/10 at Ofast unroll and O2 vect (at least)
> and see if it works well.

Just realized I considered log2(nunits) to replace units before, but gave up then since it doesn't reflect the reality
that there are actually nunits scalar loads.  So the trying will shift to tweak the used cost (currently it's stmt_cost)
according to log2(nunits), it seems to be a more reasonable direction to try.

BR,
Kewen

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-09-10  3:47 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-07  2:29 [PATCH] rs6000: Adjust rs6000_density_test for strided_load Kewen.Lin
2021-05-26  2:59 ` [PATCH v2] rs6000: Add load density heuristic Kewen.Lin
2021-06-09  2:26   ` PING^1 " Kewen.Lin
2021-06-28  7:01     ` PING^2 " Kewen.Lin
2021-07-15  1:59       ` PING^3 " Kewen.Lin
2021-07-27 22:25   ` will schmidt
2021-07-28  2:59     ` Kewen.Lin
2021-09-06 23:43       ` Segher Boessenkool
2021-09-08  7:01         ` Kewen.Lin
2021-07-28  5:22   ` [PATCH v3] " Kewen.Lin
2021-09-03 15:57     ` Bill Schmidt
2021-09-08  6:57       ` [PATCH v4] " Kewen.Lin
2021-09-08  8:28         ` Kewen.Lin
2021-09-09 16:11         ` Segher Boessenkool
2021-09-09 17:19           ` Bill Schmidt
2021-09-09 17:39             ` Bill Schmidt
2021-09-09 18:24             ` Segher Boessenkool
2021-09-10  3:22             ` Kewen.Lin
2021-09-10  3:46               ` Kewen.Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).