From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@sourceware.org>
Received: by sourceware.org (Postfix, from userid 2063)
 id 6B4EC3857C6C; Mon, 13 Sep 2021 06:30:12 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6B4EC3857C6C
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="utf-8"
From: Kewen Lin <linkw@gcc.gnu.org>
To: gcc-cvs@gcc.gnu.org
Subject: [gcc r12-3469] rs6000: Add load density heuristic
X-Act-Checkin: gcc
X-Git-Author: Kewen Lin <linkw@linux.ibm.com>
X-Git-Refname: refs/heads/master
X-Git-Oldrev: b70e2541fec8751d287b24b9e3681a41e420d4c4
X-Git-Newrev: fbeead55e03711fb39429ab564a0d8a0b43e9a74
Message-Id: <20210913063012.6B4EC3857C6C@sourceware.org>
Date: Mon, 13 Sep 2021 06:30:12 +0000 (GMT)
X-BeenThere: gcc-cvs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-cvs mailing list <gcc-cvs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-cvs>,
 <mailto:gcc-cvs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-cvs/>
List-Help: <mailto:gcc-cvs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-cvs>,
 <mailto:gcc-cvs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Sep 2021 06:30:12 -0000

https://gcc.gnu.org/g:fbeead55e03711fb39429ab564a0d8a0b43e9a74

commit r12-3469-gfbeead55e03711fb39429ab564a0d8a0b43e9a74
Author: Kewen Lin <linkw@linux.ibm.com>
Date:   Sun Sep 12 21:57:34 2021 -0500

    rs6000: Add load density heuristic
    
    We noticed that SPEC2017 503.bwaves_r run time degrades by
    about 8% on P8 and P9 if we enabled vectorization at O2
    fast-math (with cheap vect cost model).  Comparing to Ofast,
    compiler doesn't do the loop interchange on the innermost
    loop, it's not profitable to vectorize it then.
    
    As Richi's comments [1], this follows the similar idea to
    over price the vector construction fed by VMAT_ELEMENTWISE
    or VMAT_STRIDED_SLP.  Instead of adding the extra cost on
    vector construction costing immediately, it firstly records
    how many loads and vectorized statements in the given loop,
    later in rs6000_density_test (called by finish_cost) it
    computes the load density ratio against all vectorized
    statements, and check with the corresponding thresholds
    DENSITY_LOAD_NUM_THRESHOLD and DENSITY_LOAD_PCT_THRESHOLD,
    do the actual extra pricing if both thresholds are exceeded.
    
    Note that this new load density heuristic check is based on
    some fields in target cost which are updated as needed when
    scanning each add_stmt_cost entry, it's independent of the
    current function rs6000_density_test which requires to scan
    non_vect stmts.  Since it's checking the load stmts count
    vs. all vectorized stmts, it's kind of density, so I put
    it in function rs6000_density_test.  With the same reason to
    keep it independent, I didn't put it as an else arm of the
    current existing density threshold check hunk or before this
    hunk.
    
    In the investigation of -1.04% degradation from 526.blender_r
    on Power8, I noticed that the extra penalized cost 320 on one
    single vector construction for mode V16QI is much exaggerated,
    which makes the final body cost unreliable, so this patch adds
    one maximum bound for the extra penalized cost for each vector
    construction statement.
    
    Full SPEC2017 performance evaluation on Power8/Power9 with
    option combinations:
      * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap}
        {,-ffast-math}
      * {-O3, -Ofast} {,-funroll-loops}
    bwaves_r degradations on P8/P9 have been fixed, nothing else
    remarkable was observed.  Power10 -Ofast -funroll-loops run
    shows it's neutral, while -O2 -ftree-vectorize run shows the
    bwaves_r degradation is fixed expectedly.
    
    [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
    
    gcc/ChangeLog:
    
            * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
            nstmts, nloads and extra_ctor_cost.
            (rs6000_density_test): Add load density related heuristics.  Do
            extra costing on vector construction statements if need.
            (rs6000_init_cost): Init new members.
            (rs6000_update_target_cost_per_stmt): New function.
            (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
            rs6000_update_target_cost_per_stmt and call it.

Diff:
---
 gcc/config/rs6000/rs6000.c | 125 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 116 insertions(+), 9 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 39d428db8e6..25709375635 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5266,6 +5266,12 @@ struct rs6000_cost_data
 {
   struct loop *loop_info;
   unsigned cost[3];
+  /* Total number of vectorized stmts (loop only).  */
+  unsigned nstmts;
+  /* Total number of loads (loop only).  */
+  unsigned nloads;
+  /* Possible extra penalized cost on vector construction (loop only).  */
+  unsigned extra_ctor_cost;
   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
      instruction is needed by the vectorization.  */
   bool vect_nonmem;
@@ -5327,9 +5333,48 @@ rs6000_density_test (rs6000_cost_data *data)
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "density %d%%, cost %d exceeds threshold, penalizing "
-			 "loop body cost by %d%%", density_pct,
+			 "loop body cost by %d%%\n", density_pct,
 			 vec_cost + not_vec_cost, DENSITY_PENALTY);
     }
+
+  /* Check whether we need to penalize the body cost to account
+     for excess strided or elementwise loads.  */
+  if (data->extra_ctor_cost > 0)
+    {
+      /* Threshold for load stmts percentage in all vectorized stmts.  */
+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
+      /* Threshold for total number of load stmts.  */
+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
+
+      gcc_assert (data->nloads <= data->nstmts);
+      unsigned int load_pct = (data->nloads * 100) / data->nstmts;
+
+      /* It's likely to be bounded by latency and execution resources
+	 from many scalar loads which are strided or elementwise loads
+	 into a vector if both conditions below are found:
+	   1. there are many loads, it's easy to result in a long wait
+	      for load units;
+	   2. load has a big proportion of all vectorized statements,
+	      it's not easy to schedule other statements to spread among
+	      the loads.
+	 One typical case is the innermost loop of the hotspot of SPEC2017
+	 503.bwaves_r without loop interchange.  */
+      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
+	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
+	{
+	  data->cost[vect_body] += data->extra_ctor_cost;
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Found %u loads and "
+			     "load pct. %u%% exceed "
+			     "the threshold, "
+			     "penalizing loop body "
+			     "cost by extra cost %u "
+			     "for ctor.\n",
+			     data->nloads, load_pct,
+			     data->extra_ctor_cost);
+	}
+    }
 }
 
 /* Implement targetm.vectorize.init_cost.  */
@@ -5343,6 +5388,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
   data->cost[vect_body]     = 0;
   data->cost[vect_epilogue] = 0;
   data->vect_nonmem = false;
+  data->nstmts = 0;
+  data->nloads = 0;
+  data->extra_ctor_cost = 0;
   data->costing_for_scalar = costing_for_scalar;
   return data;
 }
@@ -5370,6 +5418,70 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
   return 0;
 }
 
+/* Helper function for add_stmt_cost.  Check each statement cost
+   entry, gather information and update the target_cost fields
+   accordingly.  */
+static void
+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
+				    enum vect_cost_for_stmt kind,
+				    struct _stmt_vec_info *stmt_info,
+				    enum vect_cost_model_location where,
+				    int stmt_cost,
+				    unsigned int orig_count)
+{
+
+  /* Check whether we're doing something other than just a copy loop.
+     Not all such loops may be profitably vectorized; see
+     rs6000_finish_cost.  */
+  if (kind == vec_to_scalar
+      || kind == vec_perm
+      || kind == vec_promote_demote
+      || kind == vec_construct
+      || kind == scalar_to_vec
+      || (where == vect_body && kind == vector_stmt))
+    data->vect_nonmem = true;
+
+  /* Gather some information when we are costing the vectorized instruction
+     for the statements located in a loop body.  */
+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
+    {
+      data->nstmts += orig_count;
+
+      if (kind == scalar_load || kind == vector_load
+	  || kind == unaligned_load || kind == vector_gather_load)
+	data->nloads += orig_count;
+
+      /* Power processors do not currently have instructions for strided
+	 and elementwise loads, and instead we must generate multiple
+	 scalar loads.  This leads to undercounting of the cost.  We
+	 account for this by scaling the construction cost by the number
+	 of elements involved, and saving this as extra cost that we may
+	 or may not need to apply.  When finalizing the cost of the loop,
+	 the extra penalty is applied when the load density heuristics
+	 are satisfied.  */
+      if (kind == vec_construct && stmt_info
+	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
+	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
+	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
+	{
+	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+	  unsigned int nunits = vect_nunits_for_cost (vectype);
+	  unsigned int extra_cost = nunits * stmt_cost;
+	  /* As function rs6000_builtin_vectorization_cost shows, we have
+	     priced much on V16QI/V8HI vector construction as their units,
+	     if we penalize them with nunits * stmt_cost, it can result in
+	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
+	     is 20 and nunits is 16, the extra cost is 320 which looks
+	     much exaggerated.  So let's use one maximum bound for the
+	     extra penalized cost for vector construction here.  */
+	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
+	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
+	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+	  data->extra_ctor_cost += extra_cost;
+	}
+    }
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
 static unsigned
@@ -5389,6 +5501,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
 	 arbitrary and could potentially be improved with analysis.  */
+      unsigned int orig_count = count;
       if (where == vect_body && stmt_info
 	  && stmt_in_inner_loop_p (vinfo, stmt_info))
 	{
@@ -5400,14 +5513,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       retval = (unsigned) (count * stmt_cost);
       cost_data->cost[where] += retval;
 
-      /* Check whether we're doing something other than just a copy loop.
-	 Not all such loops may be profitably vectorized; see
-	 rs6000_finish_cost.  */
-      if ((kind == vec_to_scalar || kind == vec_perm
-	   || kind == vec_promote_demote || kind == vec_construct
-	   || kind == scalar_to_vec)
-	  || (where == vect_body && kind == vector_stmt))
-	cost_data->vect_nonmem = true;
+      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
+					  stmt_cost, orig_count);
     }
 
   return retval;