From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@sourceware.org>
Received: by sourceware.org (Postfix, from userid 2063)
 id DAC543858023; Tue, 30 Nov 2021 04:51:31 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DAC543858023
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="utf-8"
From: Kewen Lin <linkw@gcc.gnu.org>
To: gcc-cvs@gcc.gnu.org
Subject: [gcc r12-5589] rs6000: Modify the way for extra penalized cost
X-Act-Checkin: gcc
X-Git-Author: Kewen Lin <linkw@linux.ibm.com>
X-Git-Refname: refs/heads/master
X-Git-Oldrev: bcb163eee8c290a1c023f89b401ba7406dcac605
X-Git-Newrev: aca68829d723a11f73b84d59401568015959f432
Message-Id: <20211130045131.DAC543858023@sourceware.org>
Date: Tue, 30 Nov 2021 04:51:31 +0000 (GMT)
X-BeenThere: gcc-cvs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-cvs mailing list <gcc-cvs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-cvs>,
 <mailto:gcc-cvs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-cvs/>
List-Help: <mailto:gcc-cvs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-cvs>,
 <mailto:gcc-cvs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 30 Nov 2021 04:51:32 -0000

https://gcc.gnu.org/g:aca68829d723a11f73b84d59401568015959f432

commit r12-5589-gaca68829d723a11f73b84d59401568015959f432
Author: Kewen Lin <linkw@linux.ibm.com>
Date:   Mon Nov 29 21:22:27 2021 -0600

    rs6000: Modify the way for extra penalized cost
    
    This patch follows the discussions here[1][2], where Segher
    pointed out the existing way to guard the extra penalized
    cost for strided/elementwise loads with a magic bound does
    not scale.
    
    The way with nunits * stmt_cost can get one much
    exaggerated penalized cost, such as: for V16QI on P8, it's
    16 * 20 = 320, that's why we need one bound.  To make it
    better and more readable, the penalized cost is simplified
    as:
    
        unsigned adjusted_cost = (nunits == 2) ? 2 : 1;
        unsigned extra_cost = nunits * adjusted_cost;
    
    For V2DI/V2DF, it uses 2 penalized cost for each scalar load
    while for the other modes, it uses 1.  It's mainly concluded
    from the performance evaluations.  One thing might be
    related is that: More units vector gets constructed, more
    instructions are used.  It has more chances to schedule them
    better (even run in parallelly when enough available units
    at that time), so it seems reasonable not to penalize more
    for them.
    
    The SPEC2017 evaluations on Power8/Power9/Power10 at option
    sets O2-vect and Ofast-unroll show this change is neutral.
    
    [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html
    [2] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580099.html
    
    gcc/ChangeLog:
    
            * config/rs6000/rs6000.c
            (rs6000_cost_data::update_target_cost_per_stmt): Adjust the way to
            compute extra penalized cost.  Remove useless parameter.
            (rs6000_cost_data::rs6000_add_stmt_cost): Adjust the call to function
            update_target_cost_per_stmt.

Diff:
---
 gcc/config/rs6000/rs6000.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index e4843eb0f1c..289c1b3df24 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5272,8 +5272,7 @@ public:
 
 protected:
   void update_target_cost_per_stmt (vect_cost_for_stmt, stmt_vec_info,
-				    vect_cost_model_location, int,
-				    unsigned int);
+				    vect_cost_model_location, unsigned int);
   void density_test (loop_vec_info);
   void adjust_vect_cost_per_loop (loop_vec_info);
 
@@ -5414,7 +5413,6 @@ void
 rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind,
 					       stmt_vec_info stmt_info,
 					       vect_cost_model_location where,
-					       int stmt_cost,
 					       unsigned int orig_count)
 {
 
@@ -5456,17 +5454,23 @@ rs6000_cost_data::update_target_cost_per_stmt (vect_cost_for_stmt kind,
 	{
 	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
 	  unsigned int nunits = vect_nunits_for_cost (vectype);
-	  unsigned int extra_cost = nunits * stmt_cost;
-	  /* As function rs6000_builtin_vectorization_cost shows, we have
-	     priced much on V16QI/V8HI vector construction as their units,
-	     if we penalize them with nunits * stmt_cost, it can result in
-	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
-	     is 20 and nunits is 16, the extra cost is 320 which looks
-	     much exaggerated.  So let's use one maximum bound for the
-	     extra penalized cost for vector construction here.  */
-	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
-	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
-	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
+	  /* We don't expect strided/elementwise loads for just 1 nunit.  */
+	  gcc_assert (nunits > 1);
+	  /* i386 port adopts nunits * stmt_cost as the penalized cost
+	     for this kind of penalization, we used to follow it but
+	     found it could result in an unreliable body cost especially
+	     for V16QI/V8HI modes.  To make it better, we choose this
+	     new heuristic: for each scalar load, we use 2 as penalized
+	     cost for the case with 2 nunits and use 1 for the other
+	     cases.  It's without much supporting theory, mainly
+	     concluded from the broad performance evaluations on Power8,
+	     Power9 and Power10.  One possibly related point is that:
+	     vector construction for more units would use more insns,
+	     it has more chances to schedule them better (even run in
+	     parallelly when enough available units at that time), so
+	     it seems reasonable not to penalize that much for them.  */
+	  unsigned int adjusted_cost = (nunits == 2) ? 2 : 1;
+	  unsigned int extra_cost = nunits * adjusted_cost;
 	  m_extra_ctor_cost += extra_cost;
 	}
     }
@@ -5491,8 +5495,7 @@ rs6000_cost_data::add_stmt_cost (int count, vect_cost_for_stmt kind,
       retval = adjust_cost_for_freq (stmt_info, where, count * stmt_cost);
       m_costs[where] += retval;
 
-      update_target_cost_per_stmt (kind, stmt_info, where,
-				   stmt_cost, orig_count);
+      update_target_cost_per_stmt (kind, stmt_info, where, orig_count);
     }
 
   return retval;