From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linkw@linux.ibm.com>
Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com
 [148.163.158.5])
 by sourceware.org (Postfix) with ESMTPS id A42393858D38
 for <gcc-patches@gcc.gnu.org>; Tue, 21 Jul 2020 05:52:10 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A42393858D38
Received: from pps.filterd (m0098421.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 06L5V7Gx152672; Tue, 21 Jul 2020 01:52:08 -0400
Received: from pps.reinject (localhost [127.0.0.1])
 by mx0a-001b2d01.pphosted.com with ESMTP id 32d5k0m354-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 21 Jul 2020 01:52:07 -0400
Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1])
 by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 06L5VfVg154049;
 Tue, 21 Jul 2020 01:52:07 -0400
Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com
 [169.51.49.98])
 by mx0a-001b2d01.pphosted.com with ESMTP id 32d5k0m34m-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 21 Jul 2020 01:52:07 -0400
Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1])
 by ppma03ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 06L5p51G021172;
 Tue, 21 Jul 2020 05:52:05 GMT
Received: from b06cxnps3075.portsmouth.uk.ibm.com
 (d06relay10.portsmouth.uk.ibm.com [9.149.109.195])
 by ppma03ams.nl.ibm.com with ESMTP id 32brq7kg3n-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Tue, 21 Jul 2020 05:52:05 +0000
Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com
 [9.149.105.232])
 by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 06L5q3UZ17760592
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Tue, 21 Jul 2020 05:52:03 GMT
Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 6D1695204E;
 Tue, 21 Jul 2020 05:52:03 +0000 (GMT)
Received: from KewenLins-MacBook-Pro.local (unknown [9.200.48.212])
 by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 0D21652054;
 Tue, 21 Jul 2020 05:52:00 +0000 (GMT)
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Bill Schmidt <wschmidt@linux.ibm.com>,
 Segher Boessenkool <segher@kernel.crashing.org>,
 Richard Sandiford <richard.sandiford@arm.com>,
 Richard Biener <richard.guenther@gmail.com>
From: "Kewen.Lin" <linkw@linux.ibm.com>
Subject: [PATCH] vect: Support vector with length cost modeling
Message-ID: <419f1fad-05be-115c-1a53-cb710ae7b2dc@linux.ibm.com>
Date: Tue, 21 Jul 2020 13:51:58 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0)
 Gecko/20100101 Thunderbird/68.9.0
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------BFEB2E4545496AA5B8BA4480"
Content-Language: en-US
X-TM-AS-GCONF: 00
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235, 18.0.687
 definitions=2020-07-21_01:2020-07-21,
 2020-07-21 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 bulkscore=0 phishscore=0
 mlxlogscore=999 suspectscore=9 priorityscore=1501 mlxscore=0 clxscore=1015
 lowpriorityscore=0 malwarescore=0 impostorscore=0 spamscore=0 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000
 definitions=main-2007210035
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <http://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <http://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 21 Jul 2020 05:52:12 -0000

This is a multi-part message in MIME format.
--------------BFEB2E4545496AA5B8BA4480
Content-Type: text/plain; charset=gbk
Content-Transfer-Encoding: 7bit

Hi,

This patch is to add the cost modeling for vector with length,
it mainly follows what we generate for vector with length in
functions vect_set_loop_controls_directly and vect_gen_len
at the worst case.

For Power, the length is expected to be in bits 0-7 (high bits),
we have to model the cost of shifting bits.  To allow other targets
not suffer this, I used one target hook to describe this extra cost,
I'm not sure if it's a correct way.

Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
param vect-partial-vector-usage=1.

Any comments/suggestions are highly appreciated!

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New
	macro.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in (TARGET_VECTORIZE_EXTRA_LENGTH_COST): New target hook.
	* target.def (extra_length_cost): Likewise.
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
	modeling for vector with length.

--------------BFEB2E4545496AA5B8BA4480
Content-Type: text/plain; charset=UTF-8; x-mac-type="0"; x-mac-creator="0";
 name="cost_v1.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="cost_v1.diff"

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5a4f07d5810..1c5f02796f5 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1668,6 +1668,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_DOLOOP_COST_FOR_ADDRESS
 #define TARGET_DOLOOP_COST_FOR_ADDRESS 1000000000
 
+#undef TARGET_VECTORIZE_EXTRA_LENGTH_COST
+#define TARGET_VECTORIZE_EXTRA_LENGTH_COST 1
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 6e7d9dc54a9..ef37b5c1d6d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6106,6 +6106,14 @@ This hook should complete calculations of the cost of vectorizing a loop or basi
 This hook should release @var{data} and any related data structures allocated by TARGET_VECTORIZE_INIT_COST.  The default releases the accumulator.
 @end deftypefn
 
+@deftypevr {Target Hook} unsigned TARGET_VECTORIZE_EXTRA_LENGTH_COST
+For loop vectorization using length-based partial vectors, some targets
+need extra operations for length preparation, like one shift operation is
+required on Power to make length be encoded in bits 0-7.  This hook is to
+provide a way for this kind of extra cost.
+The default value is zero.
+@end deftypevr
+
 @deftypefn {Target Hook} tree TARGET_VECTORIZE_BUILTIN_GATHER (const_tree @var{mem_vectype}, const_tree @var{index_type}, int @var{scale})
 Target builtin that implements vector gather operation.  @var{mem_vectype}
 is the vector type of the load and @var{index_type} is scalar type of
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3be984bbd5c..ae9a513f529 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4191,6 +4191,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_DESTROY_COST_DATA
 
+@hook TARGET_VECTORIZE_EXTRA_LENGTH_COST
+
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
 @hook TARGET_VECTORIZE_BUILTIN_SCATTER
diff --git a/gcc/target.def b/gcc/target.def
index 07059a87caf..3134be7ea7b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2058,6 +2058,16 @@ DEFHOOK
  (void *data),
  default_destroy_cost_data)
 
+/* For target-specific cost on length-based vectorization.  */
+DEFHOOKPOD
+(extra_length_cost,
+ "For loop vectorization using length-based partial vectors, some targets\n\
+need extra operations for length preparation, like one shift operation is\n\
+required on Power to make length be encoded in bits 0-7.  This hook is to\n\
+provide a way for this kind of extra cost.\n\
+The default value is zero.",
+ unsigned, 0)
+
 HOOK_VECTOR_END (vectorize)
 
 #undef HOOK_PREFIX
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index e933441b922..294a445afac 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
      TODO: Build an expression that represents peel_iters for prologue and
      epilogue to be used in a run-time test.  */
 
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       peel_iters_prologue = 0;
       peel_iters_epilogue = 0;
@@ -3663,45 +3663,149 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	  peel_iters_epilogue += 1;
 	  stmt_info_for_cost *si;
 	  int j;
-	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
-			    j, si)
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
 	    (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
 				  si->kind, si->stmt_info, si->vectype,
 				  si->misalign, vect_epilogue);
 	}
 
-      /* Calculate how many masks we need to generate.  */
-      unsigned int num_masks = 0;
-      rgroup_controls *rgm;
-      unsigned int num_vectors_m1;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
-	if (rgm->type)
-	  num_masks += num_vectors_m1 + 1;
-      gcc_assert (num_masks > 0);
-
-      /* In the worst case, we need to generate each mask in the prologue
-	 and in the loop body.  One of the loop body mask instructions
-	 replaces the comparison in the scalar loop, and since we don't
-	 count the scalar comparison against the scalar body, we shouldn't
-	 count that vector instruction against the vector body either.
-
-	 Sometimes we can use unpacks instead of generating prologue
-	 masks and sometimes the prologue mask will fold to a constant,
-	 so the actual prologue cost might be smaller.  However, it's
-	 simpler and safer to use the worst-case cost; if this ends up
-	 being the tie-breaker between vectorizing or not, then it's
-	 probably better not to vectorize.  */
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_prologue);
-      (void) add_stmt_cost (loop_vinfo,
-			    target_cost_data, num_masks - 1, vector_stmt,
-			    NULL, NULL_TREE, 0, vect_body);
-    }
-  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
-    {
-      peel_iters_prologue = 0;
-      peel_iters_epilogue = 0;
+      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+	{
+	  /* Calculate how many masks we need to generate.  */
+	  unsigned int num_masks = 0;
+	  rgroup_controls *rgm;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+	    if (rgm->type)
+	      num_masks += num_vectors_m1 + 1;
+	  gcc_assert (num_masks > 0);
+
+	  /* In the worst case, we need to generate each mask in the prologue
+	     and in the loop body.  One of the loop body mask instructions
+	     replaces the comparison in the scalar loop, and since we don't
+	     count the scalar comparison against the scalar body, we shouldn't
+	     count that vector instruction against the vector body either.
+
+	     Sometimes we can use unpacks instead of generating prologue
+	     masks and sometimes the prologue mask will fold to a constant,
+	     so the actual prologue cost might be smaller.  However, it's
+	     simpler and safer to use the worst-case cost; if this ends up
+	     being the tie-breaker between vectorizing or not, then it's
+	     probably better not to vectorize.  */
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
+				vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
+				vector_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
+      else
+	{
+	  gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
+
+	  /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
+	  if (npeel < 0)
+	    {
+	      peel_iters_prologue = assumed_vf / 2;
+	      /* See below, if peeled iterations are unknown, count a taken
+		 branch and a not taken branch per peeled loop.  */
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				    cond_branch_not_taken, NULL, NULL_TREE, 0,
+				    vect_prologue);
+	    }
+	  else
+	    {
+	      peel_iters_prologue = npeel;
+	      if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+		/* See vect_get_known_peeling_cost, if peeled iterations are
+		   known but number of scalar loop iterations are unknown, count
+		   a taken branch per peeled loop.  */
+		(void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
+				      cond_branch_taken, NULL, NULL_TREE, 0,
+				      vect_prologue);
+	    }
+
+	  stmt_info_for_cost *si;
+	  int j;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
+			    si)
+	    (void) add_stmt_cost (loop_vinfo, target_cost_data,
+				  si->count * peel_iters_prologue, si->kind,
+				  si->stmt_info, si->vectype, si->misalign,
+				  vect_prologue);
+
+	  /* Refer to the functions vect_set_loop_condition_partial_vectors
+	     and vect_set_loop_controls_directly, we need to generate each
+	     length in the prologue and in the loop body if required.  Although
+	     there are some possible optimization, we consider the worst case
+	     here.  */
+
+	  /* For now we only operate length-based partial vectors on Power,
+	     which has constant VF all the time, we need some tweakings below
+	     if it doesn't hold in future.  */
+	  gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ());
+
+	  /* For wrap around checking.  */
+	  tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
+	  unsigned int compare_precision = TYPE_PRECISION (compare_type);
+	  widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo);
+
+	  bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo);
+	  bool need_iterate_p
+	    = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	       && !vect_known_niters_smaller_than_vf (loop_vinfo));
+
+	  /* Init min/max, shift and minus cost relative to single scalar_stmt.
+	     FIXME: For now we only use length-based partial vectors on Power,
+	     target specific costs may be needed for other ports.  */
+	  unsigned int min_max_cost = 2;
+	  unsigned int shift_cost = 1, minus_cost = 1;
+
+	  /* Init cost relative to single scalar_stmt.  */
+	  unsigned int prol_cnt = 0;
+	  unsigned int body_cnt = 0;
+
+	  rgroup_controls *rgc;
+	  unsigned int num_vectors_m1;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
+	    if (rgc->type)
+	      {
+		unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor;
+
+		/* Need one shift for niters_total computation.  */
+		if (!niters_known_p && nitems != 1)
+		  prol_cnt += shift_cost;
+
+		/* Need to handle wrap around.  */
+		if (iv_limit == -1
+		    || (wi::min_precision (iv_limit * nitems, UNSIGNED)
+			> compare_precision))
+		  prol_cnt += (min_max_cost + minus_cost);
+
+		/* Need to handle batch limit excepting for the 1st one.  */
+		prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1;
+
+		unsigned int num_vectors = num_vectors_m1 + 1;
+		/* Need to set up lengths in prologue, only one MIN required
+		   since start index is zero.  */
+		prol_cnt += min_max_cost * num_vectors;
+
+		/* Probably need some extra cost to transform this length, like
+		   shift on Power.  */
+		body_cnt += targetm.vectorize.extra_length_cost * num_vectors;
+
+		/* Need to update lengths in body for next iteration.  */
+		if (need_iterate_p)
+		  body_cnt += (2 * min_max_cost + minus_cost) * num_vectors;
+	      }
+
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
+	  (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt,
+				scalar_stmt, NULL, NULL_TREE, 0, vect_body);
+	}
     }
   else if (npeel < 0)
     {
@@ -3913,8 +4017,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* ??? The "if" arm is written to handle all cases; see below for what
-     we would do for !LOOP_VINFO_FULLY_MASKED_P.  */
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+     we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P.  */
+  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* Rewriting the condition above in terms of the number of
 	 vector iterations (vniters) rather than the number of
@@ -3941,7 +4045,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 	dump_printf (MSG_NOTE, "  Minimum number of vector iterations: %d\n",
 		     min_vec_niters);
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  /* Now that we know the minimum number of vector iterations,
 	     find the minimum niters for which the scalar cost is larger:
@@ -3996,6 +4100,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
     /* We want the vectorized loop to execute at least once.  */
     min_profitable_iters = assumed_vf + peel_iters_prologue;
+  else if (min_profitable_iters < peel_iters_prologue)
+    /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the
+       vectorized loop to execute at least once.  */
+    min_profitable_iters = peel_iters_prologue;
 
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
@@ -4013,7 +4121,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
 
   if (vec_outside_cost <= 0)
     min_profitable_estimate = 0;
-  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
     {
       /* This is a repeat of the code above, but with + SOC rather
 	 than - SOC.  */
@@ -4025,7 +4133,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
       if (outside_overhead > 0)
 	min_vec_niters = outside_overhead / saving_per_viter + 1;
 
-      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
 	{
 	  int threshold = (vec_inside_cost * min_vec_niters
 			   + vec_outside_cost

--------------BFEB2E4545496AA5B8BA4480--