From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <wschmidt@linux.ibm.com>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id C3DAC384B13A
 for <gcc-patches@gcc.gnu.org>; Fri,  3 Sep 2021 15:57:46 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C3DAC384B13A
Received: from pps.filterd (m0127361.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id
 183FXJwH084182; Fri, 3 Sep 2021 11:57:46 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3aum6yvm01-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 03 Sep 2021 11:57:45 -0400
Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 183FXOdB084928;
 Fri, 3 Sep 2021 11:57:45 -0400
Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com
 [169.62.189.10])
 by mx0a-001b2d01.pphosted.com with ESMTP id 3aum6yvkyk-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 03 Sep 2021 11:57:45 -0400
Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1])
 by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 183Fv6LS008969;
 Fri, 3 Sep 2021 15:57:44 GMT
Received: from b03cxnp08027.gho.boulder.ibm.com
 (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19])
 by ppma02dal.us.ibm.com with ESMTP id 3au6pndxdk-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Fri, 03 Sep 2021 15:57:44 +0000
Received: from b03ledav003.gho.boulder.ibm.com
 (b03ledav003.gho.boulder.ibm.com [9.17.130.234])
 by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 183FvhNM17957142
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Fri, 3 Sep 2021 15:57:43 GMT
Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 84F0F6A05F;
 Fri,  3 Sep 2021 15:57:43 +0000 (GMT)
Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 395E26A063;
 Fri,  3 Sep 2021 15:57:43 +0000 (GMT)
Received: from Bills-MacBook-Pro.local (unknown [9.211.156.247])
 by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP;
 Fri,  3 Sep 2021 15:57:43 +0000 (GMT)
Reply-To: wschmidt@linux.ibm.com
Subject: Re: [PATCH v3] rs6000: Add load density heuristic
To: "Kewen.Lin" <linkw@linux.ibm.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
 David Edelsohn <dje.gcc@gmail.com>,
 will schmidt <will_schmidt@vnet.ibm.com>
References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com>
 <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com>
 <df5512db-81bd-02e2-a45d-e6b276e07f1b@linux.ibm.com>
From: Bill Schmidt <wschmidt@linux.ibm.com>
Message-ID: <77fe5ac1-200f-db69-a92a-5d349642f394@linux.ibm.com>
Date: Fri, 3 Sep 2021 10:57:42 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0)
 Gecko/20100101 Thunderbird/78.13.0
In-Reply-To: <df5512db-81bd-02e2-a45d-e6b276e07f1b@linux.ibm.com>
Content-Language: en-GB
X-TM-AS-GCONF: 00
X-Proofpoint-GUID: ytFGA_zu1_juc3k39Blh5edL1yyiAvBS
X-Proofpoint-ORIG-GUID: KJrMFxFOQ_NwU_NG-WMbDxtw2a5pYxt6
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.790
 definitions=2021-09-03_05:2021-09-03,
 2021-09-03 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 clxscore=1015 phishscore=0
 mlxlogscore=999 suspectscore=0 priorityscore=1501 impostorscore=0
 mlxscore=0 malwarescore=0 adultscore=0 spamscore=0 bulkscore=0
 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=8.12.0-2108310000 definitions=main-2109030095
X-Spam-Status: No, score=-13.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_EF, GIT_PATCH_0, HTML_MESSAGE, KAM_SHORT, NICE_REPLY_A,
 RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 03 Sep 2021 15:57:49 -0000

Hi Kewen,

Sorry that we lost track of this patch!  The heuristic approach looks 
good.  It is limited in scope and won't kick in often, and the case 
you're trying to account for is important.

At the time you submitted this, I think reliable P10 testing wasn't 
possible.  Now that it is, could you please do a quick sniff test to 
make sure there aren't any adjustments that need to be made for P10?  I 
doubt it, but worth checking.

Otherwise I have one comment below...

On 7/28/21 12:22 AM, Kewen.Lin wrote:
> Hi,
>
> v2: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>
> This v3 addressed William's review comments in
> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/576154.html
>
> It's mainly to deal with the bwaves_r degradation due to vector
> construction fed by strided loads.
>
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.
>
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.
>
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.
>
> Bootstrapped & regtested *again* on powerpc64le-linux-gnu P9.
>
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations (with v2, as v3 is NFC against v2):
>    * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>    * {-O3, -Ofast} {,-funroll-loops}
>
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.
>
> Is it ok for trunk?
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.

>---
> gcc/config/rs6000/rs6000.c | 119 ++++++++++++++++++++++++++++++++++---
> 1 file changed, 110 insertions(+), 9 deletions(-)
>
>diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>index 279f00cc648..4e9087683dc 100644
>--- a/gcc/config/rs6000/rs6000.c
>+++ b/gcc/config/rs6000/rs6000.c
>@@ -5254,6 +5254,12 @@ typedef struct _rs6000_cost_data
> {
>   struct loop *loop_info;
>   unsigned cost[3];
>+  /* Total number of vectorized stmts (loop only).  */
>+  unsigned nstmts;
>+  /* Total number of loads (loop only).  */
>+  unsigned nloads;
>+  /* Possible extra penalized cost on vector construction (loop only).  */
>+  unsigned extra_ctor_cost;
>   /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>      instruction is needed by the vectorization.  */
>   bool vect_nonmem;
>@@ -5315,9 +5321,45 @@ rs6000_density_test (rs6000_cost_data *data)
>       if (dump_enabled_p ())
> 	dump_printf_loc (MSG_NOTE, vect_location,
> 			 "density %d%%, cost %d exceeds threshold, penalizing "
>-			 "loop body cost by %d%%", density_pct,
>+			 "loop body cost by %d%%\n", density_pct,
> 			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>     }
>+
>+  /* Check if we need to penalize the body cost for latency and
>+     execution resources bound from strided or elementwise loads
>+     into a vector.  */
>+  if (data->extra_ctor_cost > 0)
>+    {
>+      /* Threshold for load stmts percentage in all vectorized stmts.  */
>+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>+      /* Threshold for total number of load stmts.  */
>+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>+
>+      gcc_assert (data->nloads <= data->nstmts);
>+      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>+
>+      /* It's likely to be bounded by latency and execution resources
>+	 from many scalar loads which are strided or elementwise loads
>+	 into a vector if both conditions below are found:
>+	   1. there are many loads, it's easy to result in a long wait
>+	      for load units;
>+	   2. load has a big proportion of all vectorized statements,
>+	      it's not easy to schedule other statements to spread among
>+	      the loads.
>+	 One typical case is the innermost loop of the hotspot of SPEC2017
>+	 503.bwaves_r without loop interchange.  */
>+      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>+	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>+	{
>+	  data->cost[vect_body] += data->extra_ctor_cost;
>+	  if (dump_enabled_p ())
>+	    dump_printf_loc (MSG_NOTE, vect_location,
>+			     "Found %u loads and load pct. %u%% exceed "
>+			     "the threshold, penalizing loop body "
>+			     "cost by extra cost %u for ctor.\n",
>+			     data->nloads, load_pct, data->extra_ctor_cost);
>+	}
>+    }
> }
> 
> /* Implement targetm.vectorize.init_cost.  */
>@@ -5331,6 +5373,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>   data->cost[vect_body]     = 0;
>   data->cost[vect_epilogue] = 0;
>   data->vect_nonmem = false;
>+  data->nstmts = 0;
>+  data->nloads = 0;
>+  data->extra_ctor_cost = 0;
>   data->costing_for_scalar = costing_for_scalar;
>   return data;
> }
>@@ -5358,6 +5403,67 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>   return 0;
> }
> 
>+/* Helper function for add_stmt_cost.  Check each statement cost
>+   entry, gather information and update the target_cost fields
>+   accordingly.  */
>+static void
>+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>+				    enum vect_cost_for_stmt kind,
>+				    struct _stmt_vec_info *stmt_info,
>+				    enum vect_cost_model_location where,
>+				    int stmt_cost, unsigned int orig_count)
>+{
>+
>+  /* Check whether we're doing something other than just a copy loop.
>+     Not all such loops may be profitably vectorized; see
>+     rs6000_finish_cost.  */
>+  if ((kind == vec_to_scalar || kind == vec_perm
>+       || kind == vec_promote_demote || kind == vec_construct
>+       || kind == scalar_to_vec)
>+      || (where == vect_body && kind == vector_stmt))
>+    data->vect_nonmem = true;
>+
>+  /* Gather some information when we are costing the vectorized instruction
>+     for the statements located in a loop body.  */
>+  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>+    {
>+      data->nstmts += orig_count;
>+
>+      if (kind == scalar_load || kind == vector_load
>+	  || kind == unaligned_load || kind == vector_gather_load)
>+	data->nloads += orig_count;
>+
>+      /* If we have strided or elementwise loads into a vector, it's
>+	 possible to be bounded by latency and execution resources for
>+	 many scalar loads.  Try to account for this by scaling the
>+	 construction cost by the number of elements involved, when
>+	 handling each matching statement we record the possible extra
>+	 penalized cost into target cost, in the end of costing for
>+	 the whole loop, we do the actual penalization once some load
>+	 density heuristics are satisfied.  */

The above comment is quite hard to read.  Can you please break up the last
sentence into at least two sentences?

Otherwise this looks good to me, and I recommend maintainers approve with
that clarified.

Thanks,
Bill

>+      if (kind == vec_construct && stmt_info
>+	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>+	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>+	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
>+	{
>+	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>+	  unsigned int nunits = vect_nunits_for_cost (vectype);
>+	  unsigned int extra_cost = nunits * stmt_cost;
>+	  /* As function rs6000_builtin_vectorization_cost shows, we have
>+	     priced much on V16QI/V8HI vector construction as their units,
>+	     if we penalize them with nunits * stmt_cost, it can result in
>+	     an unreliable body cost, eg: for V16QI on Power8, stmt_cost
>+	     is 20 and nunits is 16, the extra cost is 320 which looks
>+	     much exaggerated.  So let's use one maximum bound for the
>+	     extra penalized cost for vector construction here.  */
>+	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>+	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>+	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>+	  data->extra_ctor_cost += extra_cost;
>+	}
>+    }
>+}
>+
> /* Implement targetm.vectorize.add_stmt_cost.  */
> 
> static unsigned
>@@ -5377,6 +5483,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>       /* Statements in an inner loop relative to the loop being
> 	 vectorized are weighted more heavily.  The value here is
> 	 arbitrary and could potentially be improved with analysis.  */
>+      unsigned int orig_count = count;
>       if (where == vect_body && stmt_info
> 	  && stmt_in_inner_loop_p (vinfo, stmt_info))
> 	{
>@@ -5388,14 +5495,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>       retval = (unsigned) (count * stmt_cost);
>       cost_data->cost[where] += retval;
> 
>-      /* Check whether we're doing something other than just a copy loop.
>-	 Not all such loops may be profitably vectorized; see
>-	 rs6000_finish_cost.  */
>-      if ((kind == vec_to_scalar || kind == vec_perm
>-	   || kind == vec_promote_demote || kind == vec_construct
>-	   || kind == scalar_to_vec)
>-	  || (where == vect_body && kind == vector_stmt))
>-	cost_data->vect_nonmem = true;
>+      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>+					  stmt_cost, orig_count);
>     }
> 
>   return retval;
>-- 
>2.17.1
>
>