Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	Richard Sandiford <richard.sandiford@arm.com>
Subject: Re: [PATCH 1v2/3][vect] Add main vectorized loop unrolling
Date: Tue, 12 Oct 2021 11:35:08 +0100	[thread overview]
Message-ID: <623fbfd9-b97c-8c6e-0348-07d6c4496592@arm.com> (raw)
In-Reply-To: <4272814n-8538-p793-157q-5n6q16r48n51@fhfr.qr>

[-- Attachment #1: Type: text/plain, Size: 8479 bytes --]

Hi Richi,

I think this is what you meant, I now hide all the unrolling cost 
calculations in the existing target hooks for costs. I did need to 
adjust 'finish_cost' to take the loop_vinfo so the target's 
implementations are able to set the newly renamed 'suggested_unroll_factor'.

Also added the checks for the epilogue's VF.

Is this more like what you had in mind?


gcc/ChangeLog:

         * config/aarch64/aarch64.c (aarch64_finish_cost): Add class 
vec_info parameter.
         * config/i386/i386.c (ix86_finish_cost): Likewise.
         * config/rs6000/rs6000.c (rs6000_finish_cost): Likewise.
         * doc/tm.texi: Document changes to TARGET_VECTORIZE_FINISH_COST.
         * target.def: Add class vec_info parameter to finish_cost.
         * targhooks.c (default_finish_cost): Likewise.
         * targhooks.h (default_finish_cost): Likewise.
         * tree-vect-loop.c (vect_determine_vectorization_factor): Use 
suggested_unroll_factor
         to increase vectorization_factor if possible.
         (_loop_vec_info::_loop_vec_info): Add suggested_unroll_factor 
member.
         (vect_compute_single_scalar_iteration_cost): Adjust call to 
finish_cost.
         (vect_determine_partial_vectors_and_peeling): Ensure unrolled 
loop is not predicated.
         (vect_determine_unroll_factor): New.
         (vect_try_unrolling): New.
         (vect_reanalyze_as_main_loop): Also try to unroll when 
reanalyzing as main loop.
         (vect_analyze_loop): Add call to vect_try_unrolling and check 
to ensure epilogue
         is either a smaller VF than main loop or uses partial vectors 
and might be of equal
         VF.
         (vect_estimate_min_profitable_iters): Adjust call to finish_cost.
         (vectorizable_reduction): Make sure to not use 
single_defuse_cyle when unrolling.
         * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Adjust 
call to finish_cost.
         * tree-vectorizer.h (finish_cost): Change to pass new class 
vec_info parameter.

On 01/10/2021 09:19, Richard Biener wrote:
> On Thu, 30 Sep 2021, Andre Vieira (lists) wrote:
>
>> Hi,
>>
>>
>>>> That just forces trying the vector modes we've tried before. Though I might
>>>> need to revisit this now I think about it. I'm afraid it might be possible
>>>> for
>>>> this to generate an epilogue with a vf that is not lower than that of the
>>>> main
>>>> loop, but I'd need to think about this again.
>>>>
>>>> Either way I don't think this changes the vector modes used for the
>>>> epilogue.
>>>> But maybe I'm just missing your point here.
>>> Yes, I was refering to the above which suggests that when we vectorize
>>> the main loop with V4SF but unroll then we try vectorizing the
>>> epilogue with V4SF as well (but not unrolled).  I think that's
>>> premature (not sure if you try V8SF if the main loop was V4SF but
>>> unrolled 4 times).
>> My main motivation for this was because I had a SVE loop that vectorized with
>> both VNx8HI, then V8HI which beat VNx8HI on cost, then it decided to unroll
>> V8HI by two and skipped using VNx8HI as a predicated epilogue which would've
>> been the best choice.
> I see, yes - for fully predicated epilogues it makes sense to consider
> the same vector mode as for the main loop anyways (independent on
> whether we're unrolling or not).  One could argue that with an
> unrolled V4SImode main loop a predicated V8SImode epilogue would also
> be a good match (but then somehow costing favored the unrolled V4SI
> over the V8SI for the main loop...).
>
>> So that is why I decided to just 'reset' the vector_mode selection. In a
>> scenario where you only have the traditional vector modes it might make less
>> sense.
>>
>> Just realized I still didn't add any check to make sure the epilogue has a
>> lower VF than the previous loop, though I'm still not sure that could happen.
>> I'll go look at where to add that if you agree with this.
> As said above, it only needs a lower VF in case the epilogue is not
> fully masked - otherwise the same VF would be OK.
>
>>>> I can move it there, it would indeed remove the need for the change to
>>>> vect_update_vf_for_slp, the change to
>>>> vect_determine_partial_vectors_and_peeling would still be required I think.
>>>> It
>>>> is meant to disable using partial vectors in an unrolled loop.
>>> Why would we disable the use of partial vectors in an unrolled loop?
>> The motivation behind that is that the overhead caused by generating
>> predicates for each iteration will likely be too much for it to be profitable
>> to unroll. On top of that, when dealing with low iteration count loops, if
>> executing one predicated iteration would be enough we now still need to
>> execute all other unrolled predicated iterations, whereas if we keep them
>> unrolled we skip the unrolled loops.
> OK, I guess we're not factoring in costs when deciding on predication
> but go for it if it's gernally enabled and possible.
>
> With the proposed scheme we'd then cost the predicated not unrolled
> loop against a not predicated unrolled loop which might be a bit
> apples vs. oranges also because the target made the unroll decision
> based on the data it collected for the predicated loop.
>
>>> Sure but I'm suggesting you keep the not unrolled body as one way of
>>> costed vectorization but then if the target says "try unrolling"
>>> re-do the analysis with the same mode but a larger VF.  Just like
>>> we iterate over vector modes you'll now iterate over pairs of
>>> vector mode + VF (unroll factor).  It's not about re-using the costing
>>> it's about using costing that is actually relevant and also to avoid
>>> targets inventing two distinct separate costings - a target (powerpc)
>>> might already compute load/store density and other stuff for the main
>>> costing so it should have an idea whether doubling or triplicating is OK.
>>>
>>> Richard.
>> Sounds good! I changed the patch to determine the unrolling factor later,
>> after all analysis has been done and retry analysis if an unrolling factor
>> larger than 1 has been chosen for this loop and vector_mode.
>>
>> gcc/ChangeLog:
>>
>>          * doc/tm.texi: Document TARGET_VECTORIZE_UNROLL_FACTOR.
>>          * doc/tm.texi.in: Add entries for TARGET_VECTORIZE_UNROLL_FACTOR.
>>          * params.opt: Add vect-unroll and vect-unroll-reductions
>> parameters.
> What's the reason to add the --params?  It looks like this makes
> us unroll with a static number short-cutting the target.
>
> IMHO that's never going to be a great thing - but what we could do
> is look at loop->unroll and try to honor that (factoring in that
> the vectorization factor is already the times we unroll).
>
> So I'd leave those params out for now, the user would have a much
> more fine-grained way to control this with the unroll pragma.
>
> Adding a max-vect-unroll parameter would be another thing but that
> would apply after the targets or pragma decision.
>
>>          * target.def: Define hook TARGET_VECTORIZE_UNROLL_FACTOR.
> I still do not like the new target hook - as said I'd like to
> make you have the finis_cost hook allow the target to specify
> a suggested unroll factor instead because that's the point where
> it has all the info.
>
> Thanks,
> Richard.
>
>>          * targhooks.c (default_unroll_factor): New.
>>          * targhooks.h (default_unroll_factor): Likewise.
>>          * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
>>          par_unrolling_factor.
>>          (vect_determine_partial_vectors_and_peeling): Account for
>> unrolling.
>>          (vect_determine_unroll_factor): New.
>>          (vect_try_unrolling): New.
>>          (vect_reanalyze_as_main_loop): Call vect_try_unrolling when
>>          retrying a loop_vinfo as a main loop.
>>          (vect_analyze_loop): Call vect_try_unrolling when vectorizing
>> main loops.
>>          (vect_analyze_loop): Allow for epilogue vectorization when unrolling
>>          and rewalk vector_mode warray for the epilogues.
>>          (vectorizable_reduction): Disable single_defuse_cycle when
>> unrolling.
>>          * tree-vectorizer.h (vect_unroll_value): Declare par_unrolling_factor
>>          as a member of loop_vec_info.
>>

[-- Attachment #2: vect_unroll3.patch --]
[-- Type: text/plain, Size: 21115 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 36519ccc5a58abab483c38d0a6c5f039592bfc7f..e6ccb66ba41895c4583a959d03ac3f0f173adae6 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15972,8 +15972,9 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 
 /* Implement TARGET_VECTORIZE_FINISH_COST.  */
 static void
-aarch64_finish_cost (void *data, unsigned *prologue_cost,
-		     unsigned *body_cost, unsigned *epilogue_cost)
+aarch64_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+		     unsigned *prologue_cost, unsigned *body_cost,
+		     unsigned *epilogue_cost)
 {
   auto *costs = static_cast<aarch64_vector_costs *> (data);
   *prologue_cost = costs->region[vect_prologue];
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index afc2674d49da370ae0f5ef277df7e9954f303b8e..de7bb9fe62fcec53ee40a4798f24c6ccd4584736 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -23048,8 +23048,9 @@ ix86_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-ix86_finish_cost (void *data, unsigned *prologue_cost,
-		  unsigned *body_cost, unsigned *epilogue_cost)
+ix86_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+		  unsigned *prologue_cost, unsigned *body_cost,
+		  unsigned *epilogue_cost)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index ad81dfb316dff00cde810d6b1edd31fa49d5c1e8..6f674b6426284dbf9b9f8fdd85515cf9702adff6 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5551,8 +5551,9 @@ rs6000_adjust_vect_cost_per_loop (rs6000_cost_data *data)
 /* Implement targetm.vectorize.finish_cost.  */
 
 static void
-rs6000_finish_cost (void *data, unsigned *prologue_cost,
-		    unsigned *body_cost, unsigned *epilogue_cost)
+rs6000_finish_cost (class vec_info *vinfo ATTRIBUTE_UNUSED, void *data,
+		    unsigned *prologue_cost, unsigned *body_cost,
+		    unsigned *epilogue_cost)
 {
   rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index be8148583d8571b0d035b1938db9d056bfd213a8..05ddd4c58a3711dd949b28da3e61fb49d8175257 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6276,7 +6276,7 @@ return value should be viewed as a tentative cost that may later be
 revised.
 @end deftypefn
 
-@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (void *@var{data}, unsigned *@var{prologue_cost}, unsigned *@var{body_cost}, unsigned *@var{epilogue_cost})
+@deftypefn {Target Hook} void TARGET_VECTORIZE_FINISH_COST (class vec_info *@var{vinfo}, void *@var{data}, unsigned *@var{prologue_cost}, unsigned *@var{body_cost}, unsigned *@var{epilogue_cost})
 This hook should complete calculations of the cost of vectorizing a loop
 or basic block based on @var{data}, and return the prologue, body, and
 epilogue costs as unsigned integers.  The default returns the value of
diff --git a/gcc/target.def b/gcc/target.def
index bfa819609c21bd71c0cc585c01dba42534453f47..f0be0e10a9225dd75b013535d8e42c1d1bfe8f50 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2081,8 +2081,8 @@ or basic block based on @var{data}, and return the prologue, body, and\n\
 epilogue costs as unsigned integers.  The default returns the value of\n\
 the three accumulators.",
  void,
- (void *data, unsigned *prologue_cost, unsigned *body_cost,
-  unsigned *epilogue_cost),
+ (class vec_info *vinfo, void *data, unsigned *prologue_cost,
+  unsigned *body_cost, unsigned *epilogue_cost),
  default_finish_cost)
 
 /* Function to delete target-specific cost modeling data.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 92d51992e625c2497aa8496b1e2e3d916e5706fd..6fd1fade49cfe00295afd52aee7a34931bb48b92 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -123,7 +123,8 @@ extern unsigned default_add_stmt_cost (class vec_info *, void *, int,
 				       enum vect_cost_for_stmt,
 				       class _stmt_vec_info *, tree, int,
 				       enum vect_cost_model_location);
-extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
+extern void default_finish_cost (class vec_info *, void *, unsigned *,
+				 unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
diff --git a/gcc/targhooks.c b/gcc/targhooks.c
index c9b5208853dbc15706a65d1eb335e28e0564325e..0a3ecfa76406152ce79aaf19c5a2cc8b652936ff 100644
--- a/gcc/targhooks.c
+++ b/gcc/targhooks.c
@@ -1518,8 +1518,9 @@ default_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 /* By default, the cost model just returns the accumulated costs.  */
 
 void
-default_finish_cost (void *data, unsigned *prologue_cost,
-		     unsigned *body_cost, unsigned *epilogue_cost)
+default_finish_cost (class vec_info *vinfo, void *data,
+		     unsigned *prologue_cost, unsigned *body_cost,
+		     unsigned *epilogue_cost)
 {
   unsigned *cost = (unsigned *) data;
   *prologue_cost = cost[vect_prologue];
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 5a5b8da2e771a1dd204f22a6447eba96bb3b352c..50256cb6cb478246e3402162391096cbbc7fde94 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -365,6 +365,24 @@ vect_determine_vectorization_factor (loop_vec_info loop_vinfo)
   if (known_le (vectorization_factor, 1U))
     return opt_result::failure_at (vect_location,
 				   "not vectorized: unsupported data-type\n");
+  /* Apply unrolling factor, this was determined by
+     vect_determine_unroll_factor the first time we ran the analyzis for this
+     vector mode.  */
+  if (loop_vinfo->suggested_unroll_factor > 1)
+    {
+      unsigned unrolling_factor = loop_vinfo->suggested_unroll_factor;
+      while (unrolling_factor > 1)
+	{
+	  poly_uint64 candidate_factor = vectorization_factor * unrolling_factor;
+	  if (estimated_poly_value (candidate_factor, POLY_VALUE_MAX)
+	      <= (HOST_WIDE_INT) LOOP_VINFO_MAX_VECT_FACTOR (loop_vinfo))
+	    {
+	      vectorization_factor = candidate_factor;
+	      break;
+	    }
+	  unrolling_factor /= 2;
+	}
+    }
   LOOP_VINFO_VECT_FACTOR (loop_vinfo) = vectorization_factor;
   return opt_result::success ();
 }
@@ -828,6 +846,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     skip_main_loop_edge (nullptr),
     skip_this_loop_edge (nullptr),
     reusable_accumulators (),
+    suggested_unroll_factor (1),
     max_vectorization_factor (0),
     mask_skip_niters (NULL_TREE),
     rgroup_compare_type (NULL_TREE),
@@ -1301,7 +1320,7 @@ vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
 			  si->kind, si->stmt_info, si->vectype,
 			  si->misalign, si->where);
   unsigned prologue_cost = 0, body_cost = 0, epilogue_cost = 0;
-  finish_cost (target_cost_data, &prologue_cost, &body_cost,
+  finish_cost (NULL, target_cost_data, &prologue_cost, &body_cost,
 	       &epilogue_cost);
   destroy_cost_data (target_cost_data);
   LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST (loop_vinfo)
@@ -2128,10 +2147,16 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	 vectors to the epilogue, with the main loop continuing to operate
 	 on full vectors.
 
+	 If we are unrolling we also do not want to use partial vectors. This
+	 is to avoid the overhead of generating multiple masks and also to
+	 avoid having to execute entire iterations of FALSE masked instructions
+	 when dealing with one or less full iterations.
+
 	 ??? We could then end up failing to use partial vectors if we
 	 decide to peel iterations into a prologue, and if the main loop
 	 then ends up processing fewer than VF iterations.  */
-      if (param_vect_partial_vector_usage == 1
+      if ((param_vect_partial_vector_usage == 1
+	   || loop_vinfo->suggested_unroll_factor > 1)
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
@@ -2879,6 +2904,121 @@ vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
   return true;
 }
 
+/* Determine whether we should unroll this loop and ask target how much to
+   unroll by.  */
+
+static opt_loop_vec_info
+vect_determine_unroll_factor (loop_vec_info loop_vinfo)
+{
+  stmt_vec_info stmt_info;
+  unsigned i;
+  bool seen_reduction_p = false;
+  poly_uint64 vectorization_factor = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+
+  FOR_EACH_VEC_ELT (loop_vinfo->stmt_vec_infos, i, stmt_info)
+    {
+      if (STMT_VINFO_IN_PATTERN_P (stmt_info)
+	  || !STMT_VINFO_RELEVANT_P (stmt_info)
+	  || stmt_info->vectype == NULL_TREE)
+	continue;
+      /* Do not unroll loops with negative steps as it is unlikely that
+	 vectorization will succeed due to the way we deal with negative steps
+	 in loads and stores in 'get_load_store_type'.  */
+      if (stmt_info->dr_aux.dr
+	  && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	{
+	  dr_vec_info *dr_info = STMT_VINFO_DR_INFO (stmt_info);
+	  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
+	  if (TREE_CODE (step) == INTEGER_CST
+	      && tree_int_cst_compare (step, size_zero_node) < 0)
+	    {
+	      return opt_loop_vec_info::failure_at
+		(vect_location, "could not unroll due to negative step\n");
+	    }
+	}
+
+      if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def)
+	{
+	  auto red_info = info_for_reduction (loop_vinfo, stmt_info);
+	  if (STMT_VINFO_REDUC_TYPE (red_info) == TREE_CODE_REDUCTION)
+	    seen_reduction_p = true;
+	  else
+	    {
+	      return opt_loop_vec_info::failure_at
+		(vect_location, "could not unroll loop with reduction due to "
+				"non TREE_CODE_REDUCTION\n");
+	    }
+	}
+    }
+
+  if (known_le (vectorization_factor, 1U))
+    return opt_loop_vec_info::failure_at (vect_location,
+					  "will not unroll loop with a VF of 1"
+					  "or less\n");
+
+  opt_loop_vec_info unrolled_vinfo
+    = opt_loop_vec_info::success (vect_analyze_loop_form (loop_vinfo->loop,
+							  loop_vinfo->shared));
+  unrolled_vinfo->vector_mode = loop_vinfo->vector_mode;
+  /* Use the suggested_unrolling_factor that was set during the target's
+     TARGET_VECTORIZE_FINISH_COST hook.  */
+  unrolled_vinfo->suggested_unroll_factor = loop_vinfo->suggested_unroll_factor;
+  return unrolled_vinfo;
+}
+
+
+/* Try to unroll the current loop.  First determine the unrolling factor using
+   the analysis done for the current vector mode.  Then re-analyze the loop for
+   the given unrolling factor and the current vector mode.  */
+
+static opt_loop_vec_info
+vect_try_unrolling (opt_loop_vec_info loop_vinfo, unsigned *n_stmts)
+{
+  DUMP_VECT_SCOPE ("vect_try_unrolling");
+
+  opt_loop_vec_info unrolled_vinfo = vect_determine_unroll_factor (loop_vinfo);
+  /* Reset unrolling factor, in case we decide to not unroll.  */
+  loop_vinfo->suggested_unroll_factor = 1;
+  if (unrolled_vinfo)
+    {
+      if (unrolled_vinfo->suggested_unroll_factor > 1)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "***** unrolling factor %d chosen for vector mode %s,"
+			     "re-trying analyzis...\n",
+			     unrolled_vinfo->suggested_unroll_factor,
+			     GET_MODE_NAME (unrolled_vinfo->vector_mode));
+	  bool unrolling_fatal = false;
+	  if (vect_analyze_loop_2 (unrolled_vinfo, unrolling_fatal, n_stmts)
+	      && known_ne (loop_vinfo->vectorization_factor,
+			   unrolled_vinfo->vectorization_factor))
+	    {
+
+	      loop_vinfo = unrolled_vinfo;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "unrolling succeeded with factor = %d\n",
+				 loop_vinfo->suggested_unroll_factor);
+
+	    }
+	  else
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "unrolling failed with factor = %d\n",
+				 unrolled_vinfo->suggested_unroll_factor);
+	    }
+	}
+      else
+	if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "target determined unrolling is not profitable.\n");
+    }
+  loop_vinfo->loop->aux = NULL;
+  return loop_vinfo;
+}
+
 /* If LOOP_VINFO is already a main loop, return it unmodified.  Otherwise
    try to reanalyze it as a main loop.  Return the loop_vinfo on success
    and null on failure.  */
@@ -2904,6 +3044,8 @@ vect_reanalyze_as_main_loop (loop_vec_info loop_vinfo, unsigned int *n_stmts)
   bool fatal = false;
   bool res = vect_analyze_loop_2 (main_loop_vinfo, fatal, n_stmts);
   loop->aux = NULL;
+  main_loop_vinfo = vect_try_unrolling (main_loop_vinfo, n_stmts);
+
   if (!res)
     {
       if (dump_enabled_p ())
@@ -3038,6 +3180,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       if (res)
 	{
+	  /* Only try unrolling main loops.  */
+	  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo))
+	    loop_vinfo = vect_try_unrolling (loop_vinfo, &n_stmts);
+
 	  LOOP_VINFO_VECTORIZABLE_P (loop_vinfo) = 1;
 	  vectorized_loops++;
 
@@ -3056,13 +3202,26 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 	      /* Keep trying to roll back vectorization attempts while the
 		 loop_vec_infos they produced were worse than this one.  */
 	      vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
+	      poly_uint64 vinfo_vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+	      poly_uint64 first_vinfo_vf
+		= LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
 	      while (!vinfos.is_empty ()
+		     && (known_lt (vinfo_vf, first_vinfo_vf)
+			 || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+			     && maybe_eq (vinfo_vf, first_vinfo_vf)))
 		     && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
 		{
 		  gcc_assert (vect_epilogues);
 		  delete vinfos.pop ();
 		}
+	      /* Check if we may want to replace the current first_loop_vinfo
+		 with the new loop, but only if they have different vector
+		 modes.  If they have the same vector mode this means the main
+		 loop is an unrolled loop and we are trying to vectorize the
+		 epilogue using the same vector mode but with a lower
+		 vectorization factor.  */
 	      if (vinfos.is_empty ()
+		  && loop_vinfo->vector_mode != first_loop_vinfo->vector_mode
 		  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
 		{
 		  loop_vec_info main_loop_vinfo
@@ -3105,14 +3264,34 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 		   /* For now only allow one epilogue loop.  */
 		   && first_loop_vinfo->epilogue_vinfos.is_empty ())
 	    {
-	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
-	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
-	      gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
-			  || maybe_ne (lowest_th, 0U));
-	      /* Keep track of the known smallest versioning
-		 threshold.  */
-	      if (ordered_p (lowest_th, th))
-		lowest_th = ordered_min (lowest_th, th);
+	      /* Ensure the epilogue has a smaller VF than the main loop or
+		 uses predication and has the same VF.  */
+	      if (known_lt (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			    LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))
+		  || (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+		      && maybe_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+				   LOOP_VINFO_VECT_FACTOR (first_loop_vinfo))))
+		{
+		  first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
+		  poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
+		  gcc_assert (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
+			      || maybe_ne (lowest_th, 0U));
+		  /* Keep track of the known smallest versioning
+		     threshold.  */
+		  if (ordered_p (lowest_th, th))
+		    lowest_th = ordered_min (lowest_th, th);
+		}
+	      else
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_NOTE, vect_location,
+				     "***** Will not use %s mode as an"
+				     " epilogue, since it leads to an higher"
+				     " vectorization factor than main loop\n",
+				     GET_MODE_NAME (loop_vinfo->vector_mode));
+		  delete loop_vinfo;
+		  loop_vinfo = opt_loop_vec_info::success (NULL);
+		}
 	    }
 	  else
 	    {
@@ -3153,13 +3332,32 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 
       /* Handle the case that the original loop can use partial
 	 vectorization, but want to only adopt it for the epilogue.
-	 The retry should be in the same mode as original.  */
+	 The retry should be in the same mode as original.
+	 Also handle the case where we have unrolled the main loop and want to
+	 retry all vector modes again for the epilogues, since the VF is now
+	 at least twice as high as the current vector mode.  */
       if (vect_epilogues
 	  && loop_vinfo
-	  && LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	  && (LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	      || loop_vinfo->suggested_unroll_factor > 1))
 	{
-	  gcc_assert (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+	  gcc_assert ((LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+		       || loop_vinfo->suggested_unroll_factor > 1)
 		      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo));
+	  /* If we are unrolling, try all VECTOR_MODES for the epilogue.  */
+	  if (loop_vinfo->suggested_unroll_factor > 1)
+	    {
+	      next_vector_mode = vector_modes[0];
+	      mode_i = 1;
+
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "***** Re-trying analysis with vector mode"
+				 " %s for epilogues after unrolling.\n",
+				 GET_MODE_NAME (next_vector_mode));
+	      continue;
+	    }
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "***** Re-trying analysis with same vector mode"
@@ -4222,8 +4420,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
     }
 
   /* Complete the target-specific cost calculations.  */
-  finish_cost (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo), &vec_prologue_cost,
-	       &vec_inside_cost, &vec_epilogue_cost);
+  finish_cost (loop_vinfo, LOOP_VINFO_TARGET_COST_DATA (loop_vinfo),
+	       &vec_prologue_cost, &vec_inside_cost, &vec_epilogue_cost);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
 
@@ -7212,7 +7410,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
    participating.  */
   if (ncopies > 1
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
-      && reduc_chain_length == 1)
+      && reduc_chain_length == 1
+      && loop_vinfo->suggested_unroll_factor == 1)
     single_defuse_cycle = true;
 
   if (single_defuse_cycle || lane_reduc_code_p)
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 024a1c38a2342246d7891db1de5f1d6e6458d5dd..dce8b953d306b90185ffe75c637f1fdb998aa953 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5405,7 +5405,8 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
       while (si < li_scalar_costs.length ()
 	     && li_scalar_costs[si].first == sl);
       unsigned dummy;
-      finish_cost (scalar_target_cost_data, &dummy, &scalar_cost, &dummy);
+      finish_cost (bb_vinfo, scalar_target_cost_data, &dummy, &scalar_cost,
+		   &dummy);
       destroy_cost_data (scalar_target_cost_data);
 
       /* Complete the target-specific vector cost calculation.  */
@@ -5418,7 +5419,7 @@ vect_bb_vectorization_profitable_p (bb_vec_info bb_vinfo,
 	}
       while (vi < li_vector_costs.length ()
 	     && li_vector_costs[vi].first == vl);
-      finish_cost (vect_target_cost_data, &vec_prologue_cost,
+      finish_cost (bb_vinfo, vect_target_cost_data, &vec_prologue_cost,
 		   &vec_inside_cost, &vec_epilogue_cost);
       destroy_cost_data (vect_target_cost_data);
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index c4c5678e7f1abafc25c465319dbacf3ef50f0ae9..e91fb6691857cbcc0b1c087d6de35164a7c75e48 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -621,6 +621,13 @@ public:
      about the reductions that generated them.  */
   hash_map<tree, vect_reusable_accumulator> reusable_accumulators;
 
+  /* The number of times that the target suggested we unroll the vector loop
+     in order to promote more ILP.  This value will be used to re-analyze the
+     loop for vectorization and if successful the value will be folded into
+     vectorization_factor (and therefore exactly divides
+     vectorization_factor).  */
+  unsigned int suggested_unroll_factor;
+
   /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR
      if there is no particular limit.  */
   unsigned HOST_WIDE_INT max_vectorization_factor;
@@ -1570,10 +1577,10 @@ add_stmt_cost (vec_info *vinfo, void *data, stmt_info_for_cost *i)
 /* Alias targetm.vectorize.finish_cost.  */
 
 static inline void
-finish_cost (void *data, unsigned *prologue_cost,
+finish_cost (class vec_info *vinfo, void *data, unsigned *prologue_cost,
 	     unsigned *body_cost, unsigned *epilogue_cost)
 {
-  targetm.vectorize.finish_cost (data, prologue_cost, body_cost, epilogue_cost);
+  targetm.vectorize.finish_cost (vinfo, data, prologue_cost, body_cost, epilogue_cost);
 }
 
 /* Alias targetm.vectorize.destroy_cost_data.  */

next prev parent reply	other threads:[~2021-10-12 10:35 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-17 15:27 [PATCH 0/3][vect] Enable vector unrolling of main loop Andre Vieira (lists)
2021-09-17 15:31 ` [PATCH 1/3][vect] Add main vectorized loop unrolling Andre Vieira (lists)
2021-09-21 12:30   ` Richard Biener
2021-09-21 16:34     ` Andre Vieira (lists)
2021-09-22  6:14       ` Richard Biener
2021-09-30  8:52         ` [PATCH 1v2/3][vect] " Andre Vieira (lists)
2021-10-01  8:19           ` Richard Biener
2021-10-04 16:30             ` Richard Sandiford
2021-10-12 10:35             ` Andre Vieira (lists) [this message]
2021-10-15  8:48               ` Richard Biener
2021-10-20 13:29                 ` Andre Vieira (lists)
2021-10-21 12:14                   ` Richard Biener
2021-10-22 10:18                     ` Richard Sandiford
2021-11-11 16:02                       ` Andre Vieira (lists)
2021-11-12 13:12                         ` Richard Biener
2021-11-22 11:41                           ` Andre Vieira (lists)
2021-11-22 12:39                             ` Richard Biener
2021-11-24  9:46                               ` Andre Vieira (lists)
2021-11-24 11:00                                 ` Richard Biener
2021-11-25 10:40                                   ` Andre Vieira (lists)
2021-11-25 12:46                                     ` Richard Biener
2021-11-30 11:36                                       ` Andre Vieira (lists)
2021-11-30 13:56                                         ` Richard Biener
2021-12-07 11:27                                           ` [vect] Re-analyze all modes for epilogues Andre Vieira (lists)
2021-12-07 11:31                                             ` Andre Vieira (lists)
2021-12-07 11:48                                               ` Richard Biener
2021-12-07 13:31                                                 ` Richard Sandiford
2021-12-07 13:33                                                   ` Richard Biener
2021-12-07 11:45                                             ` Richard Biener
2021-12-07 15:17                                               ` Andre Vieira (lists)
2021-12-13 16:41                                                 ` Andre Vieira (lists)
2021-12-14 11:39                                                   ` Richard Sandiford
2021-12-17 16:33                                                     ` Andre Vieira (lists)
2022-01-07 12:39                                                       ` Richard Sandiford
2022-01-10 18:31                                           ` [PATCH 1v2/3][vect] Add main vectorized loop unrolling Andre Vieira (lists)
2022-01-11  7:14                                             ` Richard Biener
2021-10-22 10:12                   ` Richard Sandiford
2021-09-17 15:32 ` [PATCH 2/3][vect] Consider outside costs earlier for epilogue loops Andre Vieira (lists)
2021-10-14 13:44   ` Andre Vieira (lists)
2021-10-22 15:33   ` Richard Sandiford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=623fbfd9-b97c-8c6e-0348-07d6c4496592@arm.com \
    --to=andre.simoesdiasvieira@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).