[0/6] Optionally pick the cheapest loop_vec

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [0/6] Optionally pick the cheapest loop_vec_info
@ 2019-11-05 14:24 Richard Sandiford
  2019-11-05 14:25 ` [1/6] Fix vectorizable_conversion costs Richard Sandiford
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Richard Sandiford @ 2019-11-05 14:24 UTC (permalink / raw)
  To: gcc-patches

This series adds a mode in which we try to vectorise loops once for
each supported vector mode combination and then pick the one with the
lowest cost.  There are only really two patches for that: one to add the
feature and another to enable it by default for SVE.  However, for it to
work as hoped, I also needed to tweak some of the cost calculations.

The series applies on top of two earlier ones:

  https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00119.html
  https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01822.html

Each patch tested individually on aarch64-linux-gnu and the series
as a whole on x86_64-linux-gnu.

Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [1/6] Fix vectorizable_conversion costs
  2019-11-05 14:24 [0/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
@ 2019-11-05 14:25 ` Richard Sandiford
  2019-11-06 12:01   ` Richard Biener
  2019-11-05 14:27 ` [2/6] Don't assign a cost to vectorizable_assignment Richard Sandiford
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-05 14:25 UTC (permalink / raw)
  To: gcc-patches

This patch makes two tweaks to vectorizable_conversion.  The first
is to use "modifier" to distinguish between promotion, demotion,
and neither promotion nor demotion, rather than using a code for
some cases and "modifier" for others.  The second is to take ncopies
into account for the promotion and demotion costs; previously we gave
multiple copies the same cost as a single copy.

Later patches test this, but it seemed worth splitting out.


2019-11-05  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-vect-stmts.c (vect_model_promotion_demotion_cost): Take the
	number of ncopies as an additional argument.
	(vectorizable_conversion): Update call accordingly.  Use "modifier"
	to check whether a conversion is between vectors with the same
	numbers of units.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2019-11-05 11:08:12.521631453 +0000
+++ gcc/tree-vect-stmts.c	2019-11-05 14:17:43.330141911 +0000
@@ -917,26 +917,27 @@ vect_model_simple_cost (stmt_vec_info st
 }
 
 
-/* Model cost for type demotion and promotion operations.  PWR is normally
-   zero for single-step promotions and demotions.  It will be one if 
-   two-step promotion/demotion is required, and so on.  Each additional
+/* Model cost for type demotion and promotion operations.  PWR is
+   normally zero for single-step promotions and demotions.  It will be
+   one if two-step promotion/demotion is required, and so on.  NCOPIES
+   is the number of vector results (and thus number of instructions)
+   for the narrowest end of the operation chain.  Each additional
    step doubles the number of instructions required.  */
 
 static void
 vect_model_promotion_demotion_cost (stmt_vec_info stmt_info,
-				    enum vect_def_type *dt, int pwr,
+				    enum vect_def_type *dt,
+				    unsigned int ncopies, int pwr,
 				    stmt_vector_for_cost *cost_vec)
 {
-  int i, tmp;
+  int i;
   int inside_cost = 0, prologue_cost = 0;
 
   for (i = 0; i < pwr + 1; i++)
     {
-      tmp = (STMT_VINFO_TYPE (stmt_info) == type_promotion_vec_info_type) ?
-	(i + 1) : i;
-      inside_cost += record_stmt_cost (cost_vec, vect_pow2 (tmp),
-				       vec_promote_demote, stmt_info, 0,
-				       vect_body);
+      inside_cost += record_stmt_cost (cost_vec, ncopies, vec_promote_demote,
+				       stmt_info, 0, vect_body);
+      ncopies *= 2;
     }
 
   /* FORNOW: Assuming maximum 2 args per stmts.  */
@@ -4981,7 +4982,7 @@ vectorizable_conversion (stmt_vec_info s
   if (!vec_stmt)		/* transformation not required.  */
     {
       DUMP_VECT_SCOPE ("vectorizable_conversion");
-      if (code == FIX_TRUNC_EXPR || code == FLOAT_EXPR)
+      if (modifier == NONE)
         {
 	  STMT_VINFO_TYPE (stmt_info) = type_conversion_vec_info_type;
 	  vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node,
@@ -4990,14 +4991,17 @@ vectorizable_conversion (stmt_vec_info s
       else if (modifier == NARROW)
 	{
 	  STMT_VINFO_TYPE (stmt_info) = type_demotion_vec_info_type;
-	  vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
-					      cost_vec);
+	  /* The final packing step produces one vector result per copy.  */
+	  vect_model_promotion_demotion_cost (stmt_info, dt, ncopies,
+					      multi_step_cvt, cost_vec);
 	}
       else
 	{
 	  STMT_VINFO_TYPE (stmt_info) = type_promotion_vec_info_type;
-	  vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
-					      cost_vec);
+	  /* The initial unpacking step produces two vector results
+	     per copy.  */
+	  vect_model_promotion_demotion_cost (stmt_info, dt, ncopies * 2,
+					      multi_step_cvt, cost_vec);
 	}
       interm_types.release ();
       return true;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [2/6] Don't assign a cost to vectorizable_assignment
  2019-11-05 14:24 [0/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
  2019-11-05 14:25 ` [1/6] Fix vectorizable_conversion costs Richard Sandiford
@ 2019-11-05 14:27 ` Richard Sandiford
  2019-11-06 12:04   ` Richard Biener
  2019-11-05 14:28 ` [3/6] Avoid accounting for non-existent vector loop versioning Richard Sandiford
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-05 14:27 UTC (permalink / raw)
  To: gcc-patches

vectorizable_assignment handles true SSA-to-SSA copies (which hopefully
we don't see in practice) and no-op conversions that are required
to maintain correct gimple, such as changes between signed and
unsigned types.  These cases shouldn't generate any code and so
shouldn't count against either the scalar or vector costs.

Later patches test this, but it seemed worth splitting out.


2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-vect-stmts.c (vectorizable_assignment): Don't add a cost.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2019-11-05 14:17:43.330141911 +0000
+++ gcc/tree-vect-stmts.c	2019-11-05 14:18:39.169752725 +0000
@@ -5305,7 +5305,7 @@ vectorizable_conversion (stmt_vec_info s
 static bool
 vectorizable_assignment (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
 			 stmt_vec_info *vec_stmt, slp_tree slp_node,
-			 stmt_vector_for_cost *cost_vec)
+			 stmt_vector_for_cost *)
 {
   tree vec_dest;
   tree scalar_dest;
@@ -5313,7 +5313,6 @@ vectorizable_assignment (stmt_vec_info s
   loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
   tree new_temp;
   enum vect_def_type dt[1] = {vect_unknown_def_type};
-  int ndts = 1;
   int ncopies;
   int i, j;
   vec<tree> vec_oprnds = vNULL;
@@ -5409,7 +5408,8 @@ vectorizable_assignment (stmt_vec_info s
     {
       STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
       DUMP_VECT_SCOPE ("vectorizable_assignment");
-      vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node, cost_vec);
+      /* Don't add a cost here.  SSA copies and no-op conversions
+	 shouldn't generate any code in either scalar or vector form.  */
       return true;
     }
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [3/6] Avoid accounting for non-existent vector loop versioning
  2019-11-05 14:24 [0/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
  2019-11-05 14:25 ` [1/6] Fix vectorizable_conversion costs Richard Sandiford
  2019-11-05 14:27 ` [2/6] Don't assign a cost to vectorizable_assignment Richard Sandiford
@ 2019-11-05 14:28 ` Richard Sandiford
  2019-11-06 12:05   ` Richard Biener
  2019-11-05 14:29 ` [4/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-05 14:28 UTC (permalink / raw)
  To: gcc-patches

vect_analyze_loop_costing uses two profitability thresholds: a runtime
one and a static compile-time one.  The runtime one is simply the point
at which the vector loop is cheaper than the scalar loop, while the
static one also takes into account the cost of choosing between the
scalar and vector loops at runtime.  We compare this static cost against
the expected execution frequency to decide whether it's worth generating
any vector code at all.

However, we never reclaimed the cost of applying the runtime threshold
if it turned out that the vector code can always be used.  And we only
know whether that's true once we've calculated what the runtime
threshold would be.


2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-vectorizer.h (vect_apply_runtime_profitability_check_p):
	New function.
	* tree-vect-loop-manip.c (vect_loop_versioning): Use it.
	* tree-vect-loop.c (vect_analyze_loop_2): Likewise.
	(vect_transform_loop): Likewise.
	(vect_analyze_loop_costing): Don't take the cost of versioning
	into account for the static profitability threshold if it turns
	out that no versioning is needed.

Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2019-11-05 11:14:42.786884473 +0000
+++ gcc/tree-vectorizer.h	2019-11-05 14:19:33.829371745 +0000
@@ -1557,6 +1557,17 @@ vect_get_scalar_dr_size (dr_vec_info *dr
   return tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr_info->dr))));
 }
 
+/* Return true if LOOP_VINFO requires a runtime check for whether the
+   vector loop is profitable.  */
+
+inline bool
+vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo)
+{
+  unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
+  return (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && th >= vect_vf_for_cost (loop_vinfo));
+}
+
 /* Source location + hotness information. */
 extern dump_user_location_t vect_location;
 
Index: gcc/tree-vect-loop-manip.c
===================================================================
--- gcc/tree-vect-loop-manip.c	2019-11-05 10:38:31.838181047 +0000
+++ gcc/tree-vect-loop-manip.c	2019-11-05 14:19:33.825371773 +0000
@@ -3173,8 +3173,7 @@ vect_loop_versioning (loop_vec_info loop
     = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
   unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
 
-  if (th >= vect_vf_for_cost (loop_vinfo)
-      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+  if (vect_apply_runtime_profitability_check_p (loop_vinfo)
       && !ordered_p (th, versioning_threshold))
     cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
 			     build_int_cst (TREE_TYPE (scalar_loop_iters),
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2019-11-05 11:14:42.782884501 +0000
+++ gcc/tree-vect-loop.c	2019-11-05 14:19:33.829371745 +0000
@@ -1689,6 +1689,24 @@ vect_analyze_loop_costing (loop_vec_info
       return 0;
     }
 
+  /* The static profitablity threshold min_profitable_estimate includes
+     the cost of having to check at runtime whether the scalar loop
+     should be used instead.  If it turns out that we don't need or want
+     such a check, the threshold we should use for the static estimate
+     is simply the point at which the vector loop becomes more profitable
+     than the scalar loop.  */
+  if (min_profitable_estimate > min_profitable_iters
+      && !LOOP_REQUIRES_VERSIONING (loop_vinfo)
+      && !LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
+      && !LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+      && !vect_apply_runtime_profitability_check_p (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location, "no need for a runtime"
+			 " choice between the scalar and vector loops\n");
+      min_profitable_estimate = min_profitable_iters;
+    }
+
   HOST_WIDE_INT estimated_niter;
 
   /* If we are vectorizing an epilogue then we know the maximum number of
@@ -2225,8 +2243,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
 
       /*  Use the same condition as vect_transform_loop to decide when to use
 	  the cost to determine a versioning threshold.  */
-      if (th >= vect_vf_for_cost (loop_vinfo)
-	  && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      if (vect_apply_runtime_profitability_check_p (loop_vinfo)
 	  && ordered_p (th, niters_th))
 	niters_th = ordered_max (poly_uint64 (th), niters_th);
 
@@ -8268,14 +8285,13 @@ vect_transform_loop (loop_vec_info loop_
      run at least the (estimated) vectorization factor number of times
      checking is pointless, too.  */
   th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
-  if (th >= vect_vf_for_cost (loop_vinfo)
-      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+  if (vect_apply_runtime_profitability_check_p (loop_vinfo))
     {
-	if (dump_enabled_p ())
-	  dump_printf_loc (MSG_NOTE, vect_location,
-			   "Profitability threshold is %d loop iterations.\n",
-			   th);
-	check_profitability = true;
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Profitability threshold is %d loop iterations.\n",
+			 th);
+      check_profitability = true;
     }
 
   /* Make sure there exists a single-predecessor exit bb.  Do this before 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [4/6] Optionally pick the cheapest loop_vec_info
  2019-11-05 14:24 [0/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
                   ` (2 preceding siblings ...)
  2019-11-05 14:28 ` [3/6] Avoid accounting for non-existent vector loop versioning Richard Sandiford
@ 2019-11-05 14:29 ` Richard Sandiford
  2019-11-06 12:09   ` Richard Biener
  2019-11-05 14:31 ` [5/6] Account for the cost of generating loop masks Richard Sandiford
  2019-11-05 14:32 ` [6/6][AArch64] Enable vect-compare-loop-costs by default for SVE Richard Sandiford
  5 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-05 14:29 UTC (permalink / raw)
  To: gcc-patches

This patch adds a mode in which the vectoriser tries each available
base vector mode and picks the one with the lowest cost.  For now
the behaviour is behind a default-off --param, but a later patch
enables it by default for SVE.

The patch keeps the current behaviour of preferring a VF of
loop->simdlen over any larger or smaller VF, regardless of costs
or target preferences.


2019-11-05  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* params.def (vect-compare-loop-costs): New param.
	* doc/invoke.texi: Document it.
	* tree-vectorizer.h (_loop_vec_info::vec_outside_cost)
	(_loop_vec_info::vec_inside_cost): New member variables.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize them.
	(vect_better_loop_vinfo_p, vect_joust_loop_vinfos): New functions.
	(vect_analyze_loop): When the new parameter allows, try vectorizing
	the loop with each available vector mode and picking the one with
	the lowest cost.
	(vect_estimate_min_profitable_iters): Record the computed costs
	in the loop_vec_info.

Index: gcc/params.def
===================================================================
--- gcc/params.def	2019-10-31 17:15:25.470517368 +0000
+++ gcc/params.def	2019-11-05 14:19:58.781197820 +0000
@@ -661,6 +661,13 @@ DEFPARAM(PARAM_VECT_MAX_PEELING_FOR_ALIG
          "Maximum number of loop peels to enhance alignment of data references in a loop.",
          -1, -1, 64)
 
+DEFPARAM(PARAM_VECT_COMPARE_LOOP_COSTS,
+	 "vect-compare-loop-costs",
+	 "Whether to try vectorizing a loop using each supported"
+	 " combination of vector types and picking the version with the"
+	 " lowest cost.",
+	 0, 0, 1)
+
 DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS,
 	 "max-cselib-memory-locations",
 	 "The maximum memory locations recorded by cselib.",
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	2019-11-04 21:13:57.611756365 +0000
+++ gcc/doc/invoke.texi	2019-11-05 14:19:58.777197850 +0000
@@ -11563,6 +11563,12 @@ doing loop versioning for alias in the v
 The maximum number of loop peels to enhance access alignment
 for vectorizer. Value -1 means no limit.
 
+@item vect-compare-loop-costs
+Whether to try vectorizing a loop using each supported combination of
+vector types and picking the version with the lowest cost.  This parameter
+has no effect when @option{-fno-vect-cost-model} or
+@option{-fvect-cost-model=unlimited} are used.
+
 @item max-iterations-to-track
 The maximum number of iterations of a loop the brute-force algorithm
 for analysis of the number of iterations of the loop tries to evaluate.
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2019-11-05 14:19:33.829371745 +0000
+++ gcc/tree-vectorizer.h	2019-11-05 14:19:58.781197820 +0000
@@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve
   /* Cost of a single scalar iteration.  */
   int single_scalar_iteration_cost;
 
+  /* The cost of the vector prologue and epilogue, including peeled
+     iterations and set-up code.  */
+  int vec_outside_cost;
+
+  /* The cost of the vector loop body.  */
+  int vec_inside_cost;
+
   /* Is the loop vectorizable? */
   bool vectorizable;
 
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2019-11-05 14:19:33.829371745 +0000
+++ gcc/tree-vect-loop.c	2019-11-05 14:19:58.781197820 +0000
@@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo
     scan_map (NULL),
     slp_unrolling_factor (1),
     single_scalar_iteration_cost (0),
+    vec_outside_cost (0),
+    vec_inside_cost (0),
     vectorizable (false),
     can_fully_mask_p (true),
     fully_masked_p (false),
@@ -2373,6 +2375,80 @@ vect_analyze_loop_2 (loop_vec_info loop_
   goto start_over;
 }
 
+/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears
+   to be better than vectorizing it using OLD_LOOP_VINFO.  Assume that
+   OLD_LOOP_VINFO is better unless something specifically indicates
+   otherwise.
+
+   Note that this deliberately isn't a partial order.  */
+
+static bool
+vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
+			  loop_vec_info old_loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo);
+  gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop);
+
+  poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo);
+  poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo);
+
+  /* Always prefer a VF of loop->simdlen over any other VF.  */
+  if (loop->simdlen)
+    {
+      bool new_simdlen_p = known_eq (new_vf, loop->simdlen);
+      bool old_simdlen_p = known_eq (old_vf, loop->simdlen);
+      if (new_simdlen_p != old_simdlen_p)
+	return new_simdlen_p;
+    }
+
+  /* Limit the VFs to what is likely to be the maximum number of iterations,
+     to handle cases in which at least one loop_vinfo is fully-masked.  */
+  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
+  if (estimated_max_niter != -1)
+    {
+      if (known_le (estimated_max_niter, new_vf))
+	new_vf = estimated_max_niter;
+      if (known_le (estimated_max_niter, old_vf))
+	old_vf = estimated_max_niter;
+    }
+
+  /* Check whether the (fractional) cost per scalar iteration is lower
+     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
+  poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost
+			     * poly_widest_int (old_vf));
+  poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost
+			     * poly_widest_int (new_vf));
+  if (maybe_lt (rel_old, rel_new))
+    return false;
+  if (known_lt (rel_new, rel_old))
+    return true;
+
+  /* If there's nothing to choose between the loop bodies, see whether
+     there's a difference in the prologue and epilogue costs.  */
+  if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost)
+    return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost;
+
+  return false;
+}
+
+/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
+   true if we should.  */
+
+static bool
+vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
+			loop_vec_info old_loop_vinfo)
+{
+  if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo))
+    return false;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "***** Preferring vector mode %s to vector mode %s\n",
+		     GET_MODE_NAME (new_loop_vinfo->vector_mode),
+		     GET_MODE_NAME (old_loop_vinfo->vector_mode));
+  return true;
+}
+
 /* Function vect_analyze_loop.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -2408,6 +2484,8 @@ vect_analyze_loop (class loop *loop, vec
   machine_mode next_vector_mode = VOIDmode;
   poly_uint64 lowest_th = 0;
   unsigned vectorized_loops = 0;
+  bool pick_lowest_cost_p = (PARAM_VALUE (PARAM_VECT_COMPARE_LOOP_COSTS)
+			     && !unlimited_cost_model (loop));
 
   bool vect_epilogues = false;
   opt_result res = opt_result::success ();
@@ -2428,6 +2506,34 @@ vect_analyze_loop (class loop *loop, vec
 
       bool fatal = false;
 
+      /* When pick_lowest_cost_p is true, we should in principle iterate
+	 over all the loop_vec_infos that LOOP_VINFO could replace and
+	 try to vectorize LOOP_VINFO under the same conditions.
+	 E.g. when trying to replace an epilogue loop, we should vectorize
+	 LOOP_VINFO as an epilogue loop with the same VF limit.  When trying
+	 to replace the main loop, we should vectorize LOOP_VINFO as a main
+	 loop too.
+
+	 However, autovectorize_vector_modes is usually sorted as follows:
+
+	 - Modes that naturally produce lower VFs usually follow modes that
+	   naturally produce higher VFs.
+
+	 - When modes naturally produce the same VF, maskable modes
+	   usually follow unmaskable ones, so that the maskable mode
+	   can be used to vectorize the epilogue of the unmaskable mode.
+
+	 This order is preferred because it leads to the maximum
+	 epilogue vectorization opportunities.  Targets should only use
+	 a different order if they want to make wide modes available while
+	 disparaging them relative to earlier, smaller modes.  The assumption
+	 in that case is that the wider modes are more expensive in some
+	 way that isn't reflected directly in the costs.
+
+	 There should therefore be few interesting cases in which
+	 LOOP_VINFO fails when treated as an epilogue loop, succeeds when
+	 treated as a standalone loop, and ends up being genuinely cheaper
+	 than FIRST_LOOP_VINFO.  */
       if (vect_epilogues)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
@@ -2475,13 +2581,34 @@ vect_analyze_loop (class loop *loop, vec
 	      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
 	      simdlen = 0;
 	    }
+	  else if (pick_lowest_cost_p && first_loop_vinfo)
+	    {
+	      /* Keep trying to roll back vectorization attempts while the
+		 loop_vec_infos they produced were worse than this one.  */
+	      vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
+	      while (!vinfos.is_empty ()
+		     && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
+		{
+		  gcc_assert (vect_epilogues);
+		  delete vinfos.pop ();
+		}
+	      if (vinfos.is_empty ()
+		  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
+		{
+		  delete first_loop_vinfo;
+		  first_loop_vinfo = opt_loop_vec_info::success (NULL);
+		  LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
+		}
+	    }
 
 	  if (first_loop_vinfo == NULL)
 	    {
 	      first_loop_vinfo = loop_vinfo;
 	      lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 	    }
-	  else if (vect_epilogues)
+	  else if (vect_epilogues
+		   /* For now only allow one epilogue loop.  */
+		   && first_loop_vinfo->epilogue_vinfos.is_empty ())
 	    {
 	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
 	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
@@ -2501,12 +2628,14 @@ vect_analyze_loop (class loop *loop, vec
 			    && loop->inner == NULL
 			    && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK)
 			    && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo)
-			    /* For now only allow one epilogue loop.  */
-			    && first_loop_vinfo->epilogue_vinfos.is_empty ());
+			    /* For now only allow one epilogue loop, but allow
+			       pick_lowest_cost_p to replace it.  */
+			    && (first_loop_vinfo->epilogue_vinfos.is_empty ()
+				|| pick_lowest_cost_p));
 
 	  /* Commit to first_loop_vinfo if we have no reason to try
 	     alternatives.  */
-	  if (!simdlen && !vect_epilogues)
+	  if (!simdlen && !vect_epilogues && !pick_lowest_cost_p)
 	    break;
 	}
       else
@@ -3454,7 +3583,11 @@ vect_estimate_min_profitable_iters (loop
 	       &vec_inside_cost, &vec_epilogue_cost);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
-  
+
+  /* Stash the costs so that we can compare two loop_vec_infos.  */
+  loop_vinfo->vec_inside_cost = vec_inside_cost;
+  loop_vinfo->vec_outside_cost = vec_outside_cost;
+
   if (dump_enabled_p ())
     {
       dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [5/6] Account for the cost of generating loop masks
  2019-11-05 14:24 [0/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
                   ` (3 preceding siblings ...)
  2019-11-05 14:29 ` [4/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
@ 2019-11-05 14:31 ` Richard Sandiford
  2019-11-06 12:16   ` Richard Biener
  2019-11-05 14:32 ` [6/6][AArch64] Enable vect-compare-loop-costs by default for SVE Richard Sandiford
  5 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-05 14:31 UTC (permalink / raw)
  To: gcc-patches

We didn't take the cost of generating loop masks into account, and so
tended to underestimate the cost of loops that need multiple masks.


2019-11-05  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-vect-loop.c (vect_estimate_min_profitable_iters): Include
	the cost of generating loop masks.

gcc/testsuite/
	* gcc.target/aarch64/sve/mask_struct_store_3.c: Add
	-fno-vect-cost-model.
	* gcc.target/aarch64/sve/mask_struct_store_3_run.c: Likewise.
	* gcc.target/aarch64/sve/peel_ind_3.c: Likewise.
	* gcc.target/aarch64/sve/peel_ind_3_run.c: Likewise.

Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2019-11-05 14:19:58.781197820 +0000
+++ gcc/tree-vect-loop.c	2019-11-05 14:20:40.188909187 +0000
@@ -3435,6 +3435,32 @@ vect_estimate_min_profitable_iters (loop
 				  si->kind, si->stmt_info, si->misalign,
 				  vect_epilogue);
 	}
+
+      /* Calculate how many masks we need to generate.  */
+      unsigned int num_masks = 0;
+      rgroup_masks *rgm;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+	if (rgm->mask_type)
+	  num_masks += num_vectors_m1 + 1;
+      gcc_assert (num_masks > 0);
+
+      /* In the worst case, we need to generate each mask in the prologue
+	 and in the loop body.  One of the loop body mask instructions
+	 replaces the comparison in the scalar loop, and since we don't
+	 count the scalar comparison against the scalar body, we shouldn't
+	 count that vector instruction against the vector body either.
+
+	 Sometimes we can use unpacks instead of generating prologue
+	 masks and sometimes the prologue mask will fold to a constant,
+	 so the actual prologue cost might be smaller.  However, it's
+	 simpler and safer to use the worst-case cost; if this ends up
+	 being the tie-breaker between vectorizing or not, then it's
+	 probably better not to vectorize.  */
+      (void) add_stmt_cost (target_cost_data, num_masks, vector_stmt,
+			    NULL, 0, vect_prologue);
+      (void) add_stmt_cost (target_cost_data, num_masks - 1, vector_stmt,
+			    NULL, 0, vect_body);
     }
   else if (npeel < 0)
     {
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c	2019-03-08 18:14:29.768994780 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c	2019-11-05 14:20:40.184909216 +0000
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */
 
 #include <stdint.h>
 
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c	2019-03-08 18:14:29.772994767 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c	2019-11-05 14:20:40.184909216 +0000
@@ -1,5 +1,5 @@
 /* { dg-do run { target aarch64_sve_hw } } */
-/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */
 
 #include "mask_struct_store_3.c"
 
Index: gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c	2019-03-08 18:14:29.776994751 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c	2019-11-05 14:20:40.184909216 +0000
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* Pick an arbitrary target for which unaligned accesses are more
    expensive.  */
-/* { dg-options "-O3 -msve-vector-bits=256 -mtune=thunderx" } */
+/* { dg-options "-O3 -msve-vector-bits=256 -mtune=thunderx -fno-vect-cost-model" } */
 
 #define N 32
 #define MAX_START 8
Index: gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c	2019-03-08 18:14:29.784994721 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c	2019-11-05 14:20:40.184909216 +0000
@@ -1,6 +1,6 @@
 /* { dg-do run { target aarch64_sve_hw } } */
-/* { dg-options "-O3 -mtune=thunderx" } */
-/* { dg-options "-O3 -mtune=thunderx -msve-vector-bits=256" { target aarch64_sve256_hw } } */
+/* { dg-options "-O3 -mtune=thunderx -fno-vect-cost-model" } */
+/* { dg-options "-O3 -mtune=thunderx -msve-vector-bits=256 -fno-vect-cost-model" { target aarch64_sve256_hw } } */
 
 #include "peel_ind_3.c"
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [6/6][AArch64] Enable vect-compare-loop-costs by default for SVE
  2019-11-05 14:24 [0/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
                   ` (4 preceding siblings ...)
  2019-11-05 14:31 ` [5/6] Account for the cost of generating loop masks Richard Sandiford
@ 2019-11-05 14:32 ` Richard Sandiford
  5 siblings, 0 replies; 23+ messages in thread
From: Richard Sandiford @ 2019-11-05 14:32 UTC (permalink / raw)
  To: gcc-patches

This patch enables vect-compare-loop-costs by default for SVE, both so
that we can compare SVE against Advanced SIMD and so that (with future
patches) we can compare multiple SVE vectorisation approaches against
each other.

I'll apply if the prerequisites are approved.


2019-11-05  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* config/aarch64/aarch64.c (aarch64_override_options_internal):
	Set the default value of PARAM_VECT_COMPARE_LOOP_COSTS to 1
	when SVE is enabled.

gcc/testsuite/
	* gcc.target/aarch64/sve/reduc_3.c: Split multi-vector cases out
	into...
	* gcc.target/aarch64/sve/reduc_3_costly.c: ...this new test,
	passing -fno-vect-cost-model for them.
	* gcc.target/aarch64/sve/slp_6.c: Add -fno-vect-cost-model.
	* gcc.target/aarch64/sve/slp_7.c,
	* gcc.target/aarch64/sve/slp_7_run.c: Split multi-vector cases out
	into...
	* gcc.target/aarch64/sve/slp_7_costly.c,
	* gcc.target/aarch64/sve/slp_7_costly_run.c: ...these new tests,
	passing -fno-vect-cost-model for them.

Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	2019-11-05 11:04:15.559298615 +0000
+++ gcc/config/aarch64/aarch64.c	2019-11-05 14:21:15.416663625 +0000
@@ -13308,6 +13308,14 @@ aarch64_override_options_internal (struc
   initialize_aarch64_code_model (opts);
   initialize_aarch64_tls_size (opts);
 
+  /* Enable vect-compare-loop-costs by default for SVE, both so that we
+     can compare SVE against Advanced SIMD and so that we can compare
+     multiple SVE vectorisation approaches against each other.  */
+  if (TARGET_SVE)
+    maybe_set_param_value (PARAM_VECT_COMPARE_LOOP_COSTS, 1,
+			   opts->x_param_values,
+			   global_options_set.x_param_values);
+
   int queue_depth = 0;
   switch (aarch64_tune_params.autoprefetcher_model)
     {
Index: gcc/testsuite/gcc.target/aarch64/sve/reduc_3.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/reduc_3.c	2019-03-08 18:14:29.784994721 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/reduc_3.c	2019-11-05 14:21:15.416663625 +0000
@@ -17,7 +17,6 @@ void reduc_ptr_##DSTTYPE##_##SRCTYPE (DS
 
 REDUC_PTR (int8_t, int8_t)
 REDUC_PTR (int16_t, int16_t)
-
 REDUC_PTR (int32_t, int32_t)
 REDUC_PTR (int64_t, int64_t)
 
@@ -25,17 +24,6 @@ REDUC_PTR (_Float16, _Float16)
 REDUC_PTR (float, float)
 REDUC_PTR (double, double)
 
-/* Widening reductions.  */
-REDUC_PTR (int32_t, int8_t)
-REDUC_PTR (int32_t, int16_t)
-
-REDUC_PTR (int64_t, int8_t)
-REDUC_PTR (int64_t, int16_t)
-REDUC_PTR (int64_t, int32_t)
-
-REDUC_PTR (float, _Float16)
-REDUC_PTR (double, float)
-
 /* Float<>Int conversions */
 REDUC_PTR (_Float16, int16_t)
 REDUC_PTR (float, int32_t)
@@ -45,8 +33,14 @@ REDUC_PTR (int16_t, _Float16)
 REDUC_PTR (int32_t, float)
 REDUC_PTR (int64_t, double)
 
-/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 4 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 { xfail *-*-* } } } */
+/* We don't yet vectorize the int<-float cases.  */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 2 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/reduc_3_costly.c
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/reduc_3_costly.c	2019-11-05 14:21:15.416663625 +0000
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */
+
+#include <stdint.h>
+
+#define NUM_ELEMS(TYPE) (32 / sizeof (TYPE))
+
+#define REDUC_PTR(DSTTYPE, SRCTYPE)				\
+void reduc_ptr_##DSTTYPE##_##SRCTYPE (DSTTYPE *restrict sum,	\
+				      SRCTYPE *restrict array,	\
+				      int count)		\
+{								\
+  *sum = 0;							\
+  for (int i = 0; i < count; ++i)				\
+    *sum += array[i];						\
+}
+
+/* Widening reductions.  */
+REDUC_PTR (int32_t, int8_t)
+REDUC_PTR (int32_t, int16_t)
+
+REDUC_PTR (int64_t, int8_t)
+REDUC_PTR (int64_t, int16_t)
+REDUC_PTR (int64_t, int32_t)
+
+REDUC_PTR (float, _Float16)
+REDUC_PTR (double, float)
+
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d\n} 1 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/slp_6.c	2019-03-08 18:14:29.780994734 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/slp_6.c	2019-11-05 14:21:15.416663625 +0000
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math" } */
+/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math -fno-vect-cost-model" } */
 
 #include <stdint.h>
 
Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/slp_7.c	2019-10-25 10:13:15.544226032 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/slp_7.c	2019-11-05 14:21:15.416663625 +0000
@@ -31,37 +31,27 @@ #define TEST_ALL(T)				\
   T (uint16_t)					\
   T (int32_t)					\
   T (uint32_t)					\
-  T (int64_t)					\
-  T (uint64_t)					\
   T (_Float16)					\
-  T (float)					\
-  T (double)
+  T (float)
 
 TEST_ALL (VEC_PERM)
 
-/* We can't use SLP for the 64-bit loops, since the number of reduction
-   results might be greater than the number of elements in the vector.
-   Otherwise we have two loads per loop, one for the initial vector
-   and one for the loop body.  */
+/* We have two loads per loop, one for the initial vector and one for
+   the loop body.  */
 /* { dg-final { scan-assembler-times {\tld1b\t} 2 } } */
 /* { dg-final { scan-assembler-times {\tld1h\t} 3 } } */
 /* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */
-/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */
 /* { dg-final { scan-assembler-not {\tld4b\t} } } */
 /* { dg-final { scan-assembler-not {\tld4h\t} } } */
 /* { dg-final { scan-assembler-not {\tld4w\t} } } */
-/* { dg-final { scan-assembler-not {\tld1d\t} } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 8 } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 8 } } */
 /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 8 } } */
-/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */
 /* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */
 /* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */
-/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */
 
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 } } */
 /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c
===================================================================
--- gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c	2019-03-08 18:14:29.780994734 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/slp_7_run.c	2019-11-05 14:21:15.416663625 +0000
@@ -1,7 +1,11 @@
 /* { dg-do run { target aarch64_sve_hw } } */
 /* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
 
-#include "slp_7.c"
+#ifndef FILENAME
+#define FILENAME "slp_7.c"
+#endif
+
+#include FILENAME
 
 #define N (54 * 4)
 
Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly.c
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly.c	2019-11-05 14:21:15.416663625 +0000
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msve-vector-bits=scalable -ffast-math -fno-vect-cost-model" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE)						\
+void __attribute__ ((noinline, noclone))			\
+vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)	\
+{								\
+  TYPE x0 = b[0];						\
+  TYPE x1 = b[1];						\
+  TYPE x2 = b[2];						\
+  TYPE x3 = b[3];						\
+  for (int i = 0; i < n; ++i)					\
+    {								\
+      x0 += a[i * 4];						\
+      x1 += a[i * 4 + 1];					\
+      x2 += a[i * 4 + 2];					\
+      x3 += a[i * 4 + 3];					\
+    }								\
+  b[0] = x0;							\
+  b[1] = x1;							\
+  b[2] = x2;							\
+  b[3] = x3;							\
+}
+
+#define TEST_ALL(T)				\
+  T (int64_t)					\
+  T (uint64_t)					\
+  T (double)
+
+TEST_ALL (VEC_PERM)
+
+/* We can't use SLP for the 64-bit loops, since the number of reduction
+   results might be greater than the number of elements in the vector.  */
+/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */
+/* { dg-final { scan-assembler-not {\tld1d\t} } } */
+/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */
+/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */
+
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
+
+/* { dg-final { scan-assembler-not {\tuqdec} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly_run.c
===================================================================
--- /dev/null	2019-09-17 11:41:18.176664108 +0100
+++ gcc/testsuite/gcc.target/aarch64/sve/slp_7_costly_run.c	2019-11-05 14:21:15.416663625 +0000
@@ -0,0 +1,5 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */
+
+#define FILENAME "slp_7_costly.c"
+#include "slp_7_run.c"

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [1/6] Fix vectorizable_conversion costs
  2019-11-05 14:25 ` [1/6] Fix vectorizable_conversion costs Richard Sandiford
@ 2019-11-06 12:01   ` Richard Biener
  2019-11-07 15:14     ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2019-11-06 12:01 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, Nov 5, 2019 at 3:25 PM Richard Sandiford
<Richard.Sandiford@arm.com> wrote:
>
> This patch makes two tweaks to vectorizable_conversion.  The first
> is to use "modifier" to distinguish between promotion, demotion,
> and neither promotion nor demotion, rather than using a code for
> some cases and "modifier" for others.  The second is to take ncopies
> into account for the promotion and demotion costs; previously we gave
> multiple copies the same cost as a single copy.
>
> Later patches test this, but it seemed worth splitting out.

OK, but does ncopies properly handle unrolling with SLP?

Richard.

>
> 2019-11-05  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-vect-stmts.c (vect_model_promotion_demotion_cost): Take the
>         number of ncopies as an additional argument.
>         (vectorizable_conversion): Update call accordingly.  Use "modifier"
>         to check whether a conversion is between vectors with the same
>         numbers of units.
>
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c       2019-11-05 11:08:12.521631453 +0000
> +++ gcc/tree-vect-stmts.c       2019-11-05 14:17:43.330141911 +0000
> @@ -917,26 +917,27 @@ vect_model_simple_cost (stmt_vec_info st
>  }
>
>
> -/* Model cost for type demotion and promotion operations.  PWR is normally
> -   zero for single-step promotions and demotions.  It will be one if
> -   two-step promotion/demotion is required, and so on.  Each additional
> +/* Model cost for type demotion and promotion operations.  PWR is
> +   normally zero for single-step promotions and demotions.  It will be
> +   one if two-step promotion/demotion is required, and so on.  NCOPIES
> +   is the number of vector results (and thus number of instructions)
> +   for the narrowest end of the operation chain.  Each additional
>     step doubles the number of instructions required.  */
>
>  static void
>  vect_model_promotion_demotion_cost (stmt_vec_info stmt_info,
> -                                   enum vect_def_type *dt, int pwr,
> +                                   enum vect_def_type *dt,
> +                                   unsigned int ncopies, int pwr,
>                                     stmt_vector_for_cost *cost_vec)
>  {
> -  int i, tmp;
> +  int i;
>    int inside_cost = 0, prologue_cost = 0;
>
>    for (i = 0; i < pwr + 1; i++)
>      {
> -      tmp = (STMT_VINFO_TYPE (stmt_info) == type_promotion_vec_info_type) ?
> -       (i + 1) : i;
> -      inside_cost += record_stmt_cost (cost_vec, vect_pow2 (tmp),
> -                                      vec_promote_demote, stmt_info, 0,
> -                                      vect_body);
> +      inside_cost += record_stmt_cost (cost_vec, ncopies, vec_promote_demote,
> +                                      stmt_info, 0, vect_body);
> +      ncopies *= 2;
>      }
>
>    /* FORNOW: Assuming maximum 2 args per stmts.  */
> @@ -4981,7 +4982,7 @@ vectorizable_conversion (stmt_vec_info s
>    if (!vec_stmt)               /* transformation not required.  */
>      {
>        DUMP_VECT_SCOPE ("vectorizable_conversion");
> -      if (code == FIX_TRUNC_EXPR || code == FLOAT_EXPR)
> +      if (modifier == NONE)
>          {
>           STMT_VINFO_TYPE (stmt_info) = type_conversion_vec_info_type;
>           vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node,
> @@ -4990,14 +4991,17 @@ vectorizable_conversion (stmt_vec_info s
>        else if (modifier == NARROW)
>         {
>           STMT_VINFO_TYPE (stmt_info) = type_demotion_vec_info_type;
> -         vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
> -                                             cost_vec);
> +         /* The final packing step produces one vector result per copy.  */
> +         vect_model_promotion_demotion_cost (stmt_info, dt, ncopies,
> +                                             multi_step_cvt, cost_vec);
>         }
>        else
>         {
>           STMT_VINFO_TYPE (stmt_info) = type_promotion_vec_info_type;
> -         vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
> -                                             cost_vec);
> +         /* The initial unpacking step produces two vector results
> +            per copy.  */
> +         vect_model_promotion_demotion_cost (stmt_info, dt, ncopies * 2,
> +                                             multi_step_cvt, cost_vec);
>         }
>        interm_types.release ();
>        return true;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [2/6] Don't assign a cost to vectorizable_assignment
  2019-11-05 14:27 ` [2/6] Don't assign a cost to vectorizable_assignment Richard Sandiford
@ 2019-11-06 12:04   ` Richard Biener
  2019-11-06 15:58     ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2019-11-06 12:04 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, Nov 5, 2019 at 3:27 PM Richard Sandiford
<Richard.Sandiford@arm.com> wrote:
>
> vectorizable_assignment handles true SSA-to-SSA copies (which hopefully
> we don't see in practice) and no-op conversions that are required
> to maintain correct gimple, such as changes between signed and
> unsigned types.  These cases shouldn't generate any code and so
> shouldn't count against either the scalar or vector costs.
>
> Later patches test this, but it seemed worth splitting out.

Hmm, but you have to adjust vect_compute_single_scalar_iteration_cost and
possibly the SLP cost walk as well, otherwise we're artificially making
those copies cheaper when vectorized.

>
> 2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-vect-stmts.c (vectorizable_assignment): Don't add a cost.
>
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c       2019-11-05 14:17:43.330141911 +0000
> +++ gcc/tree-vect-stmts.c       2019-11-05 14:18:39.169752725 +0000
> @@ -5305,7 +5305,7 @@ vectorizable_conversion (stmt_vec_info s
>  static bool
>  vectorizable_assignment (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>                          stmt_vec_info *vec_stmt, slp_tree slp_node,
> -                        stmt_vector_for_cost *cost_vec)
> +                        stmt_vector_for_cost *)
>  {
>    tree vec_dest;
>    tree scalar_dest;
> @@ -5313,7 +5313,6 @@ vectorizable_assignment (stmt_vec_info s
>    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>    tree new_temp;
>    enum vect_def_type dt[1] = {vect_unknown_def_type};
> -  int ndts = 1;
>    int ncopies;
>    int i, j;
>    vec<tree> vec_oprnds = vNULL;
> @@ -5409,7 +5408,8 @@ vectorizable_assignment (stmt_vec_info s
>      {
>        STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
>        DUMP_VECT_SCOPE ("vectorizable_assignment");
> -      vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node, cost_vec);
> +      /* Don't add a cost here.  SSA copies and no-op conversions
> +        shouldn't generate any code in either scalar or vector form.  */
>        return true;
>      }
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [3/6] Avoid accounting for non-existent vector loop versioning
  2019-11-05 14:28 ` [3/6] Avoid accounting for non-existent vector loop versioning Richard Sandiford
@ 2019-11-06 12:05   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2019-11-06 12:05 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, Nov 5, 2019 at 3:28 PM Richard Sandiford
<Richard.Sandiford@arm.com> wrote:
>
> vect_analyze_loop_costing uses two profitability thresholds: a runtime
> one and a static compile-time one.  The runtime one is simply the point
> at which the vector loop is cheaper than the scalar loop, while the
> static one also takes into account the cost of choosing between the
> scalar and vector loops at runtime.  We compare this static cost against
> the expected execution frequency to decide whether it's worth generating
> any vector code at all.
>
> However, we never reclaimed the cost of applying the runtime threshold
> if it turned out that the vector code can always be used.  And we only
> know whether that's true once we've calculated what the runtime
> threshold would be.

OK.

>
> 2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-vectorizer.h (vect_apply_runtime_profitability_check_p):
>         New function.
>         * tree-vect-loop-manip.c (vect_loop_versioning): Use it.
>         * tree-vect-loop.c (vect_analyze_loop_2): Likewise.
>         (vect_transform_loop): Likewise.
>         (vect_analyze_loop_costing): Don't take the cost of versioning
>         into account for the static profitability threshold if it turns
>         out that no versioning is needed.
>
> Index: gcc/tree-vectorizer.h
> ===================================================================
> --- gcc/tree-vectorizer.h       2019-11-05 11:14:42.786884473 +0000
> +++ gcc/tree-vectorizer.h       2019-11-05 14:19:33.829371745 +0000
> @@ -1557,6 +1557,17 @@ vect_get_scalar_dr_size (dr_vec_info *dr
>    return tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr_info->dr))));
>  }
>
> +/* Return true if LOOP_VINFO requires a runtime check for whether the
> +   vector loop is profitable.  */
> +
> +inline bool
> +vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo)
> +{
> +  unsigned int th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> +  return (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +         && th >= vect_vf_for_cost (loop_vinfo));
> +}
> +
>  /* Source location + hotness information. */
>  extern dump_user_location_t vect_location;
>
> Index: gcc/tree-vect-loop-manip.c
> ===================================================================
> --- gcc/tree-vect-loop-manip.c  2019-11-05 10:38:31.838181047 +0000
> +++ gcc/tree-vect-loop-manip.c  2019-11-05 14:19:33.825371773 +0000
> @@ -3173,8 +3173,7 @@ vect_loop_versioning (loop_vec_info loop
>      = LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (loop_vinfo);
>    unsigned th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
>
> -  if (th >= vect_vf_for_cost (loop_vinfo)
> -      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +  if (vect_apply_runtime_profitability_check_p (loop_vinfo)
>        && !ordered_p (th, versioning_threshold))
>      cond_expr = fold_build2 (GE_EXPR, boolean_type_node, scalar_loop_iters,
>                              build_int_cst (TREE_TYPE (scalar_loop_iters),
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2019-11-05 11:14:42.782884501 +0000
> +++ gcc/tree-vect-loop.c        2019-11-05 14:19:33.829371745 +0000
> @@ -1689,6 +1689,24 @@ vect_analyze_loop_costing (loop_vec_info
>        return 0;
>      }
>
> +  /* The static profitablity threshold min_profitable_estimate includes
> +     the cost of having to check at runtime whether the scalar loop
> +     should be used instead.  If it turns out that we don't need or want
> +     such a check, the threshold we should use for the static estimate
> +     is simply the point at which the vector loop becomes more profitable
> +     than the scalar loop.  */
> +  if (min_profitable_estimate > min_profitable_iters
> +      && !LOOP_REQUIRES_VERSIONING (loop_vinfo)
> +      && !LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> +      && !LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> +      && !vect_apply_runtime_profitability_check_p (loop_vinfo))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location, "no need for a runtime"
> +                        " choice between the scalar and vector loops\n");
> +      min_profitable_estimate = min_profitable_iters;
> +    }
> +
>    HOST_WIDE_INT estimated_niter;
>
>    /* If we are vectorizing an epilogue then we know the maximum number of
> @@ -2225,8 +2243,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
>
>        /*  Use the same condition as vect_transform_loop to decide when to use
>           the cost to determine a versioning threshold.  */
> -      if (th >= vect_vf_for_cost (loop_vinfo)
> -         && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +      if (vect_apply_runtime_profitability_check_p (loop_vinfo)
>           && ordered_p (th, niters_th))
>         niters_th = ordered_max (poly_uint64 (th), niters_th);
>
> @@ -8268,14 +8285,13 @@ vect_transform_loop (loop_vec_info loop_
>       run at least the (estimated) vectorization factor number of times
>       checking is pointless, too.  */
>    th = LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo);
> -  if (th >= vect_vf_for_cost (loop_vinfo)
> -      && !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +  if (vect_apply_runtime_profitability_check_p (loop_vinfo))
>      {
> -       if (dump_enabled_p ())
> -         dump_printf_loc (MSG_NOTE, vect_location,
> -                          "Profitability threshold is %d loop iterations.\n",
> -                          th);
> -       check_profitability = true;
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "Profitability threshold is %d loop iterations.\n",
> +                        th);
> +      check_profitability = true;
>      }
>
>    /* Make sure there exists a single-predecessor exit bb.  Do this before

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [4/6] Optionally pick the cheapest loop_vec_info
  2019-11-05 14:29 ` [4/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
@ 2019-11-06 12:09   ` Richard Biener
  2019-11-06 14:01     ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2019-11-06 12:09 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford
<Richard.Sandiford@arm.com> wrote:
>
> This patch adds a mode in which the vectoriser tries each available
> base vector mode and picks the one with the lowest cost.  For now
> the behaviour is behind a default-off --param, but a later patch
> enables it by default for SVE.
>
> The patch keeps the current behaviour of preferring a VF of
> loop->simdlen over any larger or smaller VF, regardless of costs
> or target preferences.

Can you avoid using a --param for this?  Instead I'd suggest to
amend the vectorize_modes target hook to return some
flags like VECT_FIRST_MODE_WINS.  We'd eventually want
to make the target able to say do-not-vectorize-epiloges-of-MODE
(I think we may not want to vectorize SSE vectorized loop
epilogues with MMX-with-SSE or GPRs for example).  I guess
for the latter we'd use a new target hook.

Otherwise looks reasonable.

Richard.

>
> 2019-11-05  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * params.def (vect-compare-loop-costs): New param.
>         * doc/invoke.texi: Document it.
>         * tree-vectorizer.h (_loop_vec_info::vec_outside_cost)
>         (_loop_vec_info::vec_inside_cost): New member variables.
>         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize them.
>         (vect_better_loop_vinfo_p, vect_joust_loop_vinfos): New functions.
>         (vect_analyze_loop): When the new parameter allows, try vectorizing
>         the loop with each available vector mode and picking the one with
>         the lowest cost.
>         (vect_estimate_min_profitable_iters): Record the computed costs
>         in the loop_vec_info.
>
> Index: gcc/params.def
> ===================================================================
> --- gcc/params.def      2019-10-31 17:15:25.470517368 +0000
> +++ gcc/params.def      2019-11-05 14:19:58.781197820 +0000
> @@ -661,6 +661,13 @@ DEFPARAM(PARAM_VECT_MAX_PEELING_FOR_ALIG
>           "Maximum number of loop peels to enhance alignment of data references in a loop.",
>           -1, -1, 64)
>
> +DEFPARAM(PARAM_VECT_COMPARE_LOOP_COSTS,
> +        "vect-compare-loop-costs",
> +        "Whether to try vectorizing a loop using each supported"
> +        " combination of vector types and picking the version with the"
> +        " lowest cost.",
> +        0, 0, 1)
> +
>  DEFPARAM(PARAM_MAX_CSELIB_MEMORY_LOCATIONS,
>          "max-cselib-memory-locations",
>          "The maximum memory locations recorded by cselib.",
> Index: gcc/doc/invoke.texi
> ===================================================================
> --- gcc/doc/invoke.texi 2019-11-04 21:13:57.611756365 +0000
> +++ gcc/doc/invoke.texi 2019-11-05 14:19:58.777197850 +0000
> @@ -11563,6 +11563,12 @@ doing loop versioning for alias in the v
>  The maximum number of loop peels to enhance access alignment
>  for vectorizer. Value -1 means no limit.
>
> +@item vect-compare-loop-costs
> +Whether to try vectorizing a loop using each supported combination of
> +vector types and picking the version with the lowest cost.  This parameter
> +has no effect when @option{-fno-vect-cost-model} or
> +@option{-fvect-cost-model=unlimited} are used.
> +
>  @item max-iterations-to-track
>  The maximum number of iterations of a loop the brute-force algorithm
>  for analysis of the number of iterations of the loop tries to evaluate.
> Index: gcc/tree-vectorizer.h
> ===================================================================
> --- gcc/tree-vectorizer.h       2019-11-05 14:19:33.829371745 +0000
> +++ gcc/tree-vectorizer.h       2019-11-05 14:19:58.781197820 +0000
> @@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve
>    /* Cost of a single scalar iteration.  */
>    int single_scalar_iteration_cost;
>
> +  /* The cost of the vector prologue and epilogue, including peeled
> +     iterations and set-up code.  */
> +  int vec_outside_cost;
> +
> +  /* The cost of the vector loop body.  */
> +  int vec_inside_cost;
> +
>    /* Is the loop vectorizable? */
>    bool vectorizable;
>
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2019-11-05 14:19:33.829371745 +0000
> +++ gcc/tree-vect-loop.c        2019-11-05 14:19:58.781197820 +0000
> @@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo
>      scan_map (NULL),
>      slp_unrolling_factor (1),
>      single_scalar_iteration_cost (0),
> +    vec_outside_cost (0),
> +    vec_inside_cost (0),
>      vectorizable (false),
>      can_fully_mask_p (true),
>      fully_masked_p (false),
> @@ -2373,6 +2375,80 @@ vect_analyze_loop_2 (loop_vec_info loop_
>    goto start_over;
>  }
>
> +/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears
> +   to be better than vectorizing it using OLD_LOOP_VINFO.  Assume that
> +   OLD_LOOP_VINFO is better unless something specifically indicates
> +   otherwise.
> +
> +   Note that this deliberately isn't a partial order.  */
> +
> +static bool
> +vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
> +                         loop_vec_info old_loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo);
> +  gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop);
> +
> +  poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo);
> +  poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo);
> +
> +  /* Always prefer a VF of loop->simdlen over any other VF.  */
> +  if (loop->simdlen)
> +    {
> +      bool new_simdlen_p = known_eq (new_vf, loop->simdlen);
> +      bool old_simdlen_p = known_eq (old_vf, loop->simdlen);
> +      if (new_simdlen_p != old_simdlen_p)
> +       return new_simdlen_p;
> +    }
> +
> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> +  if (estimated_max_niter != -1)
> +    {
> +      if (known_le (estimated_max_niter, new_vf))
> +       new_vf = estimated_max_niter;
> +      if (known_le (estimated_max_niter, old_vf))
> +       old_vf = estimated_max_niter;
> +    }
> +
> +  /* Check whether the (fractional) cost per scalar iteration is lower
> +     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
> +  poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost
> +                            * poly_widest_int (old_vf));
> +  poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost
> +                            * poly_widest_int (new_vf));
> +  if (maybe_lt (rel_old, rel_new))
> +    return false;
> +  if (known_lt (rel_new, rel_old))
> +    return true;
> +
> +  /* If there's nothing to choose between the loop bodies, see whether
> +     there's a difference in the prologue and epilogue costs.  */
> +  if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost)
> +    return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost;
> +
> +  return false;
> +}
> +
> +/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
> +   true if we should.  */
> +
> +static bool
> +vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
> +                       loop_vec_info old_loop_vinfo)
> +{
> +  if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo))
> +    return false;
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +                    "***** Preferring vector mode %s to vector mode %s\n",
> +                    GET_MODE_NAME (new_loop_vinfo->vector_mode),
> +                    GET_MODE_NAME (old_loop_vinfo->vector_mode));
> +  return true;
> +}
> +
>  /* Function vect_analyze_loop.
>
>     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> @@ -2408,6 +2484,8 @@ vect_analyze_loop (class loop *loop, vec
>    machine_mode next_vector_mode = VOIDmode;
>    poly_uint64 lowest_th = 0;
>    unsigned vectorized_loops = 0;
> +  bool pick_lowest_cost_p = (PARAM_VALUE (PARAM_VECT_COMPARE_LOOP_COSTS)
> +                            && !unlimited_cost_model (loop));
>
>    bool vect_epilogues = false;
>    opt_result res = opt_result::success ();
> @@ -2428,6 +2506,34 @@ vect_analyze_loop (class loop *loop, vec
>
>        bool fatal = false;
>
> +      /* When pick_lowest_cost_p is true, we should in principle iterate
> +        over all the loop_vec_infos that LOOP_VINFO could replace and
> +        try to vectorize LOOP_VINFO under the same conditions.
> +        E.g. when trying to replace an epilogue loop, we should vectorize
> +        LOOP_VINFO as an epilogue loop with the same VF limit.  When trying
> +        to replace the main loop, we should vectorize LOOP_VINFO as a main
> +        loop too.
> +
> +        However, autovectorize_vector_modes is usually sorted as follows:
> +
> +        - Modes that naturally produce lower VFs usually follow modes that
> +          naturally produce higher VFs.
> +
> +        - When modes naturally produce the same VF, maskable modes
> +          usually follow unmaskable ones, so that the maskable mode
> +          can be used to vectorize the epilogue of the unmaskable mode.
> +
> +        This order is preferred because it leads to the maximum
> +        epilogue vectorization opportunities.  Targets should only use
> +        a different order if they want to make wide modes available while
> +        disparaging them relative to earlier, smaller modes.  The assumption
> +        in that case is that the wider modes are more expensive in some
> +        way that isn't reflected directly in the costs.
> +
> +        There should therefore be few interesting cases in which
> +        LOOP_VINFO fails when treated as an epilogue loop, succeeds when
> +        treated as a standalone loop, and ends up being genuinely cheaper
> +        than FIRST_LOOP_VINFO.  */
>        if (vect_epilogues)
>         LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
>
> @@ -2475,13 +2581,34 @@ vect_analyze_loop (class loop *loop, vec
>               LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
>               simdlen = 0;
>             }
> +         else if (pick_lowest_cost_p && first_loop_vinfo)
> +           {
> +             /* Keep trying to roll back vectorization attempts while the
> +                loop_vec_infos they produced were worse than this one.  */
> +             vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
> +             while (!vinfos.is_empty ()
> +                    && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
> +               {
> +                 gcc_assert (vect_epilogues);
> +                 delete vinfos.pop ();
> +               }
> +             if (vinfos.is_empty ()
> +                 && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
> +               {
> +                 delete first_loop_vinfo;
> +                 first_loop_vinfo = opt_loop_vec_info::success (NULL);
> +                 LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
> +               }
> +           }
>
>           if (first_loop_vinfo == NULL)
>             {
>               first_loop_vinfo = loop_vinfo;
>               lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
>             }
> -         else if (vect_epilogues)
> +         else if (vect_epilogues
> +                  /* For now only allow one epilogue loop.  */
> +                  && first_loop_vinfo->epilogue_vinfos.is_empty ())
>             {
>               first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
>               poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> @@ -2501,12 +2628,14 @@ vect_analyze_loop (class loop *loop, vec
>                             && loop->inner == NULL
>                             && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK)
>                             && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo)
> -                           /* For now only allow one epilogue loop.  */
> -                           && first_loop_vinfo->epilogue_vinfos.is_empty ());
> +                           /* For now only allow one epilogue loop, but allow
> +                              pick_lowest_cost_p to replace it.  */
> +                           && (first_loop_vinfo->epilogue_vinfos.is_empty ()
> +                               || pick_lowest_cost_p));
>
>           /* Commit to first_loop_vinfo if we have no reason to try
>              alternatives.  */
> -         if (!simdlen && !vect_epilogues)
> +         if (!simdlen && !vect_epilogues && !pick_lowest_cost_p)
>             break;
>         }
>        else
> @@ -3454,7 +3583,11 @@ vect_estimate_min_profitable_iters (loop
>                &vec_inside_cost, &vec_epilogue_cost);
>
>    vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
> -
> +
> +  /* Stash the costs so that we can compare two loop_vec_infos.  */
> +  loop_vinfo->vec_inside_cost = vec_inside_cost;
> +  loop_vinfo->vec_outside_cost = vec_outside_cost;
> +
>    if (dump_enabled_p ())
>      {
>        dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [5/6] Account for the cost of generating loop masks
  2019-11-05 14:31 ` [5/6] Account for the cost of generating loop masks Richard Sandiford
@ 2019-11-06 12:16   ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2019-11-06 12:16 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, Nov 5, 2019 at 3:31 PM Richard Sandiford
<Richard.Sandiford@arm.com> wrote:
>
> We didn't take the cost of generating loop masks into account, and so
> tended to underestimate the cost of loops that need multiple masks.

OK.

>
> 2019-11-05  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-vect-loop.c (vect_estimate_min_profitable_iters): Include
>         the cost of generating loop masks.
>
> gcc/testsuite/
>         * gcc.target/aarch64/sve/mask_struct_store_3.c: Add
>         -fno-vect-cost-model.
>         * gcc.target/aarch64/sve/mask_struct_store_3_run.c: Likewise.
>         * gcc.target/aarch64/sve/peel_ind_3.c: Likewise.
>         * gcc.target/aarch64/sve/peel_ind_3_run.c: Likewise.
>
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2019-11-05 14:19:58.781197820 +0000
> +++ gcc/tree-vect-loop.c        2019-11-05 14:20:40.188909187 +0000
> @@ -3435,6 +3435,32 @@ vect_estimate_min_profitable_iters (loop
>                                   si->kind, si->stmt_info, si->misalign,
>                                   vect_epilogue);
>         }
> +
> +      /* Calculate how many masks we need to generate.  */
> +      unsigned int num_masks = 0;
> +      rgroup_masks *rgm;
> +      unsigned int num_vectors_m1;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> +       if (rgm->mask_type)
> +         num_masks += num_vectors_m1 + 1;
> +      gcc_assert (num_masks > 0);
> +
> +      /* In the worst case, we need to generate each mask in the prologue
> +        and in the loop body.  One of the loop body mask instructions
> +        replaces the comparison in the scalar loop, and since we don't
> +        count the scalar comparison against the scalar body, we shouldn't
> +        count that vector instruction against the vector body either.
> +
> +        Sometimes we can use unpacks instead of generating prologue
> +        masks and sometimes the prologue mask will fold to a constant,
> +        so the actual prologue cost might be smaller.  However, it's
> +        simpler and safer to use the worst-case cost; if this ends up
> +        being the tie-breaker between vectorizing or not, then it's
> +        probably better not to vectorize.  */
> +      (void) add_stmt_cost (target_cost_data, num_masks, vector_stmt,
> +                           NULL, 0, vect_prologue);
> +      (void) add_stmt_cost (target_cost_data, num_masks - 1, vector_stmt,
> +                           NULL, 0, vect_body);
>      }
>    else if (npeel < 0)
>      {
> Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c  2019-03-08 18:14:29.768994780 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c  2019-11-05 14:20:40.184909216 +0000
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
> +/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */
>
>  #include <stdint.h>
>
> Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c      2019-03-08 18:14:29.772994767 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c      2019-11-05 14:20:40.184909216 +0000
> @@ -1,5 +1,5 @@
>  /* { dg-do run { target aarch64_sve_hw } } */
> -/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
> +/* { dg-options "-O2 -ftree-vectorize -ffast-math -fno-vect-cost-model" } */
>
>  #include "mask_struct_store_3.c"
>
> Index: gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c   2019-03-08 18:14:29.776994751 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3.c   2019-11-05 14:20:40.184909216 +0000
> @@ -1,7 +1,7 @@
>  /* { dg-do compile } */
>  /* Pick an arbitrary target for which unaligned accesses are more
>     expensive.  */
> -/* { dg-options "-O3 -msve-vector-bits=256 -mtune=thunderx" } */
> +/* { dg-options "-O3 -msve-vector-bits=256 -mtune=thunderx -fno-vect-cost-model" } */
>
>  #define N 32
>  #define MAX_START 8
> Index: gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c       2019-03-08 18:14:29.784994721 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve/peel_ind_3_run.c       2019-11-05 14:20:40.184909216 +0000
> @@ -1,6 +1,6 @@
>  /* { dg-do run { target aarch64_sve_hw } } */
> -/* { dg-options "-O3 -mtune=thunderx" } */
> -/* { dg-options "-O3 -mtune=thunderx -msve-vector-bits=256" { target aarch64_sve256_hw } } */
> +/* { dg-options "-O3 -mtune=thunderx -fno-vect-cost-model" } */
> +/* { dg-options "-O3 -mtune=thunderx -msve-vector-bits=256 -fno-vect-cost-model" { target aarch64_sve256_hw } } */
>
>  #include "peel_ind_3.c"
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [4/6] Optionally pick the cheapest loop_vec_info
  2019-11-06 12:09   ` Richard Biener
@ 2019-11-06 14:01     ` Richard Sandiford
  2019-11-06 14:50       ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-06 14:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford
> <Richard.Sandiford@arm.com> wrote:
>>
>> This patch adds a mode in which the vectoriser tries each available
>> base vector mode and picks the one with the lowest cost.  For now
>> the behaviour is behind a default-off --param, but a later patch
>> enables it by default for SVE.
>>
>> The patch keeps the current behaviour of preferring a VF of
>> loop->simdlen over any larger or smaller VF, regardless of costs
>> or target preferences.
>
> Can you avoid using a --param for this?  Instead I'd suggest to
> amend the vectorize_modes target hook to return some
> flags like VECT_FIRST_MODE_WINS.  We'd eventually want
> to make the target able to say do-not-vectorize-epiloges-of-MODE
> (I think we may not want to vectorize SSE vectorized loop
> epilogues with MMX-with-SSE or GPRs for example).  I guess
> for the latter we'd use a new target hook.

The reason for using a --param was that I wanted a way of turning
this on and off on the command line, so that users can experiment
with it if necessary.  E.g. enabling the --param could be a viable
alternative to -mprefix-* in some cases.  Disabling it would be
a way of working around a bad cost model decision without going
all the way to -fno-vect-cost-model.

These kinds of --params can become useful workarounds until an
optimisation bug is fixed.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [4/6] Optionally pick the cheapest loop_vec_info
  2019-11-06 14:01     ` Richard Sandiford
@ 2019-11-06 14:50       ` Richard Biener
  2019-11-07 17:15         ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2019-11-06 14:50 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford
> > <Richard.Sandiford@arm.com> wrote:
> >>
> >> This patch adds a mode in which the vectoriser tries each available
> >> base vector mode and picks the one with the lowest cost.  For now
> >> the behaviour is behind a default-off --param, but a later patch
> >> enables it by default for SVE.
> >>
> >> The patch keeps the current behaviour of preferring a VF of
> >> loop->simdlen over any larger or smaller VF, regardless of costs
> >> or target preferences.
> >
> > Can you avoid using a --param for this?  Instead I'd suggest to
> > amend the vectorize_modes target hook to return some
> > flags like VECT_FIRST_MODE_WINS.  We'd eventually want
> > to make the target able to say do-not-vectorize-epiloges-of-MODE
> > (I think we may not want to vectorize SSE vectorized loop
> > epilogues with MMX-with-SSE or GPRs for example).  I guess
> > for the latter we'd use a new target hook.
>
> The reason for using a --param was that I wanted a way of turning
> this on and off on the command line, so that users can experiment
> with it if necessary.  E.g. enabling the --param could be a viable
> alternative to -mprefix-* in some cases.  Disabling it would be
> a way of working around a bad cost model decision without going
> all the way to -fno-vect-cost-model.
>
> These kinds of --params can become useful workarounds until an
> optimisation bug is fixed.

I'm arguing that the default depends on the actual ISAs so there isn't
a one-fits all and given we have OMP SIMD and target cloning for
multiple ISAs this looks like a wrong approach.  For sure the
target can use its own switches to override defaults here, or alternatively
we might want to have a #pragma GCC simdlen mimicing OMP behavior
here.

Richard.

>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [2/6] Don't assign a cost to vectorizable_assignment
  2019-11-06 12:04   ` Richard Biener
@ 2019-11-06 15:58     ` Richard Sandiford
  2019-11-07  9:35       ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-06 15:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Nov 5, 2019 at 3:27 PM Richard Sandiford
> <Richard.Sandiford@arm.com> wrote:
>>
>> vectorizable_assignment handles true SSA-to-SSA copies (which hopefully
>> we don't see in practice) and no-op conversions that are required
>> to maintain correct gimple, such as changes between signed and
>> unsigned types.  These cases shouldn't generate any code and so
>> shouldn't count against either the scalar or vector costs.
>>
>> Later patches test this, but it seemed worth splitting out.
>
> Hmm, but you have to adjust vect_compute_single_scalar_iteration_cost and
> possibly the SLP cost walk as well, otherwise we're artificially making
> those copies cheaper when vectorized.

Ah, yeah.  It looks complicated to reproduce the conditions exactly
there, so how about just costing 1 copy in vectorizable_assignment
to counteract it, and ignore ncopies?

Seems like vectorizable_* ought to be costing the scalar code as
well as the vector code, but that's too much for GCC 10 at this stage.

Thanks,
Richard


>
>>
>> 2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>
>>
>> gcc/
>>         * tree-vect-stmts.c (vectorizable_assignment): Don't add a cost.
>>
>> Index: gcc/tree-vect-stmts.c
>> ===================================================================
>> --- gcc/tree-vect-stmts.c       2019-11-05 14:17:43.330141911 +0000
>> +++ gcc/tree-vect-stmts.c       2019-11-05 14:18:39.169752725 +0000
>> @@ -5305,7 +5305,7 @@ vectorizable_conversion (stmt_vec_info s
>>  static bool
>>  vectorizable_assignment (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>>                          stmt_vec_info *vec_stmt, slp_tree slp_node,
>> -                        stmt_vector_for_cost *cost_vec)
>> +                        stmt_vector_for_cost *)
>>  {
>>    tree vec_dest;
>>    tree scalar_dest;
>> @@ -5313,7 +5313,6 @@ vectorizable_assignment (stmt_vec_info s
>>    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>>    tree new_temp;
>>    enum vect_def_type dt[1] = {vect_unknown_def_type};
>> -  int ndts = 1;
>>    int ncopies;
>>    int i, j;
>>    vec<tree> vec_oprnds = vNULL;
>> @@ -5409,7 +5408,8 @@ vectorizable_assignment (stmt_vec_info s
>>      {
>>        STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
>>        DUMP_VECT_SCOPE ("vectorizable_assignment");
>> -      vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node, cost_vec);
>> +      /* Don't add a cost here.  SSA copies and no-op conversions
>> +        shouldn't generate any code in either scalar or vector form.  */
>>        return true;
>>      }
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [2/6] Don't assign a cost to vectorizable_assignment
  2019-11-06 15:58     ` Richard Sandiford
@ 2019-11-07  9:35       ` Richard Biener
  2019-11-07 16:40         ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2019-11-07  9:35 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, Nov 6, 2019 at 4:58 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Nov 5, 2019 at 3:27 PM Richard Sandiford
> > <Richard.Sandiford@arm.com> wrote:
> >>
> >> vectorizable_assignment handles true SSA-to-SSA copies (which hopefully
> >> we don't see in practice) and no-op conversions that are required
> >> to maintain correct gimple, such as changes between signed and
> >> unsigned types.  These cases shouldn't generate any code and so
> >> shouldn't count against either the scalar or vector costs.
> >>
> >> Later patches test this, but it seemed worth splitting out.
> >
> > Hmm, but you have to adjust vect_compute_single_scalar_iteration_cost and
> > possibly the SLP cost walk as well, otherwise we're artificially making
> > those copies cheaper when vectorized.
>
> Ah, yeah.  It looks complicated to reproduce the conditions exactly
> there, so how about just costing 1 copy in vectorizable_assignment
> to counteract it, and ignore ncopies?

I guess costing a single scalar_stmt ought to make it exactly offset
the scalar cost?

> Seems like vectorizable_* ought to be costing the scalar code as
> well as the vector code, but that's too much for GCC 10 at this stage.

Yeah.

Richard.

> Thanks,
> Richard
>
>
> >
> >>
> >> 2019-11-04  Richard Sandiford  <richard.sandiford@arm.com>
> >>
> >> gcc/
> >>         * tree-vect-stmts.c (vectorizable_assignment): Don't add a cost.
> >>
> >> Index: gcc/tree-vect-stmts.c
> >> ===================================================================
> >> --- gcc/tree-vect-stmts.c       2019-11-05 14:17:43.330141911 +0000
> >> +++ gcc/tree-vect-stmts.c       2019-11-05 14:18:39.169752725 +0000
> >> @@ -5305,7 +5305,7 @@ vectorizable_conversion (stmt_vec_info s
> >>  static bool
> >>  vectorizable_assignment (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >>                          stmt_vec_info *vec_stmt, slp_tree slp_node,
> >> -                        stmt_vector_for_cost *cost_vec)
> >> +                        stmt_vector_for_cost *)
> >>  {
> >>    tree vec_dest;
> >>    tree scalar_dest;
> >> @@ -5313,7 +5313,6 @@ vectorizable_assignment (stmt_vec_info s
> >>    loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> >>    tree new_temp;
> >>    enum vect_def_type dt[1] = {vect_unknown_def_type};
> >> -  int ndts = 1;
> >>    int ncopies;
> >>    int i, j;
> >>    vec<tree> vec_oprnds = vNULL;
> >> @@ -5409,7 +5408,8 @@ vectorizable_assignment (stmt_vec_info s
> >>      {
> >>        STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
> >>        DUMP_VECT_SCOPE ("vectorizable_assignment");
> >> -      vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node, cost_vec);
> >> +      /* Don't add a cost here.  SSA copies and no-op conversions
> >> +        shouldn't generate any code in either scalar or vector form.  */
> >>        return true;
> >>      }
> >>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [1/6] Fix vectorizable_conversion costs
  2019-11-06 12:01   ` Richard Biener
@ 2019-11-07 15:14     ` Richard Sandiford
  2019-11-07 16:13       ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-07 15:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Nov 5, 2019 at 3:25 PM Richard Sandiford
> <Richard.Sandiford@arm.com> wrote:
>>
>> This patch makes two tweaks to vectorizable_conversion.  The first
>> is to use "modifier" to distinguish between promotion, demotion,
>> and neither promotion nor demotion, rather than using a code for
>> some cases and "modifier" for others.  The second is to take ncopies
>> into account for the promotion and demotion costs; previously we gave
>> multiple copies the same cost as a single copy.
>>
>> Later patches test this, but it seemed worth splitting out.
>
> OK, but does ncopies properly handle unrolling with SLP?

Bah, thanks for catching that.  Here's a fixed version, tested as before.

Richard


2019-11-07  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-vect-stmts.c (vect_model_promotion_demotion_cost): Take the
	number of ncopies as an additional argument.
	(vectorizable_conversion): Update call accordingly.  Use "modifier"
	to check whether a conversion is between vectors with the same
	numbers of units.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2019-11-07 15:11:49.000000000 +0000
+++ gcc/tree-vect-stmts.c	2019-11-07 15:11:50.134775028 +0000
@@ -917,26 +917,27 @@ vect_model_simple_cost (stmt_vec_info st
 }
 
 
-/* Model cost for type demotion and promotion operations.  PWR is normally
-   zero for single-step promotions and demotions.  It will be one if 
-   two-step promotion/demotion is required, and so on.  Each additional
+/* Model cost for type demotion and promotion operations.  PWR is
+   normally zero for single-step promotions and demotions.  It will be
+   one if two-step promotion/demotion is required, and so on.  NCOPIES
+   is the number of vector results (and thus number of instructions)
+   for the narrowest end of the operation chain.  Each additional
    step doubles the number of instructions required.  */
 
 static void
 vect_model_promotion_demotion_cost (stmt_vec_info stmt_info,
-				    enum vect_def_type *dt, int pwr,
+				    enum vect_def_type *dt,
+				    unsigned int ncopies, int pwr,
 				    stmt_vector_for_cost *cost_vec)
 {
-  int i, tmp;
+  int i;
   int inside_cost = 0, prologue_cost = 0;
 
   for (i = 0; i < pwr + 1; i++)
     {
-      tmp = (STMT_VINFO_TYPE (stmt_info) == type_promotion_vec_info_type) ?
-	(i + 1) : i;
-      inside_cost += record_stmt_cost (cost_vec, vect_pow2 (tmp),
-				       vec_promote_demote, stmt_info, 0,
-				       vect_body);
+      inside_cost += record_stmt_cost (cost_vec, ncopies, vec_promote_demote,
+				       stmt_info, 0, vect_body);
+      ncopies *= 2;
     }
 
   /* FORNOW: Assuming maximum 2 args per stmts.  */
@@ -4964,7 +4965,7 @@ vectorizable_conversion (stmt_vec_info s
   if (!vec_stmt)		/* transformation not required.  */
     {
       DUMP_VECT_SCOPE ("vectorizable_conversion");
-      if (code == FIX_TRUNC_EXPR || code == FLOAT_EXPR)
+      if (modifier == NONE)
         {
 	  STMT_VINFO_TYPE (stmt_info) = type_conversion_vec_info_type;
 	  vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node,
@@ -4973,14 +4974,24 @@ vectorizable_conversion (stmt_vec_info s
       else if (modifier == NARROW)
 	{
 	  STMT_VINFO_TYPE (stmt_info) = type_demotion_vec_info_type;
-	  vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
-					      cost_vec);
+	  /* The final packing step produces one vector result per copy.  */
+	  unsigned int nvectors
+	    = (slp_node ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) : ncopies);
+	  vect_model_promotion_demotion_cost (stmt_info, dt, nvectors,
+					      multi_step_cvt, cost_vec);
 	}
       else
 	{
 	  STMT_VINFO_TYPE (stmt_info) = type_promotion_vec_info_type;
-	  vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
-					      cost_vec);
+	  /* The initial unpacking step produces two vector results
+	     per copy.  MULTI_STEP_CVT is 0 for a single conversion,
+	     so >> MULTI_STEP_CVT divides by 2^(number of steps - 1).  */
+	  unsigned int nvectors
+	    = (slp_node
+	       ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) >> multi_step_cvt
+	       : ncopies * 2);
+	  vect_model_promotion_demotion_cost (stmt_info, dt, nvectors,
+					      multi_step_cvt, cost_vec);
 	}
       interm_types.release ();
       return true;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [1/6] Fix vectorizable_conversion costs
  2019-11-07 15:14     ` Richard Sandiford
@ 2019-11-07 16:13       ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2019-11-07 16:13 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On November 7, 2019 4:14:14 PM GMT+01:00, Richard Sandiford <richard.sandiford@arm.com> wrote:
>Richard Biener <richard.guenther@gmail.com> writes:
>> On Tue, Nov 5, 2019 at 3:25 PM Richard Sandiford
>> <Richard.Sandiford@arm.com> wrote:
>>>
>>> This patch makes two tweaks to vectorizable_conversion.  The first
>>> is to use "modifier" to distinguish between promotion, demotion,
>>> and neither promotion nor demotion, rather than using a code for
>>> some cases and "modifier" for others.  The second is to take ncopies
>>> into account for the promotion and demotion costs; previously we
>gave
>>> multiple copies the same cost as a single copy.
>>>
>>> Later patches test this, but it seemed worth splitting out.
>>
>> OK, but does ncopies properly handle unrolling with SLP?
>
>Bah, thanks for catching that.  Here's a fixed version, tested as
>before.

OK. 

Thanks, 
Richard. 

>Richard
>
>
>2019-11-07  Richard Sandiford  <richard.sandiford@arm.com>
>
>gcc/
>	* tree-vect-stmts.c (vect_model_promotion_demotion_cost): Take the
>	number of ncopies as an additional argument.
>	(vectorizable_conversion): Update call accordingly.  Use "modifier"
>	to check whether a conversion is between vectors with the same
>	numbers of units.
>
>Index: gcc/tree-vect-stmts.c
>===================================================================
>--- gcc/tree-vect-stmts.c	2019-11-07 15:11:49.000000000 +0000
>+++ gcc/tree-vect-stmts.c	2019-11-07 15:11:50.134775028 +0000
>@@ -917,26 +917,27 @@ vect_model_simple_cost (stmt_vec_info st
> }
> 
> 
>-/* Model cost for type demotion and promotion operations.  PWR is
>normally
>-   zero for single-step promotions and demotions.  It will be one if 
>-   two-step promotion/demotion is required, and so on.  Each
>additional
>+/* Model cost for type demotion and promotion operations.  PWR is
>+   normally zero for single-step promotions and demotions.  It will be
>+   one if two-step promotion/demotion is required, and so on.  NCOPIES
>+   is the number of vector results (and thus number of instructions)
>+   for the narrowest end of the operation chain.  Each additional
>    step doubles the number of instructions required.  */
> 
> static void
> vect_model_promotion_demotion_cost (stmt_vec_info stmt_info,
>-				    enum vect_def_type *dt, int pwr,
>+				    enum vect_def_type *dt,
>+				    unsigned int ncopies, int pwr,
> 				    stmt_vector_for_cost *cost_vec)
> {
>-  int i, tmp;
>+  int i;
>   int inside_cost = 0, prologue_cost = 0;
> 
>   for (i = 0; i < pwr + 1; i++)
>     {
>-      tmp = (STMT_VINFO_TYPE (stmt_info) ==
>type_promotion_vec_info_type) ?
>-	(i + 1) : i;
>-      inside_cost += record_stmt_cost (cost_vec, vect_pow2 (tmp),
>-				       vec_promote_demote, stmt_info, 0,
>-				       vect_body);
>+      inside_cost += record_stmt_cost (cost_vec, ncopies,
>vec_promote_demote,
>+				       stmt_info, 0, vect_body);
>+      ncopies *= 2;
>     }
> 
>   /* FORNOW: Assuming maximum 2 args per stmts.  */
>@@ -4964,7 +4965,7 @@ vectorizable_conversion (stmt_vec_info s
>   if (!vec_stmt)		/* transformation not required.  */
>     {
>       DUMP_VECT_SCOPE ("vectorizable_conversion");
>-      if (code == FIX_TRUNC_EXPR || code == FLOAT_EXPR)
>+      if (modifier == NONE)
>         {
> 	  STMT_VINFO_TYPE (stmt_info) = type_conversion_vec_info_type;
> 	  vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node,
>@@ -4973,14 +4974,24 @@ vectorizable_conversion (stmt_vec_info s
>       else if (modifier == NARROW)
> 	{
> 	  STMT_VINFO_TYPE (stmt_info) = type_demotion_vec_info_type;
>-	  vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
>-					      cost_vec);
>+	  /* The final packing step produces one vector result per copy.  */
>+	  unsigned int nvectors
>+	    = (slp_node ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) : ncopies);
>+	  vect_model_promotion_demotion_cost (stmt_info, dt, nvectors,
>+					      multi_step_cvt, cost_vec);
> 	}
>       else
> 	{
> 	  STMT_VINFO_TYPE (stmt_info) = type_promotion_vec_info_type;
>-	  vect_model_promotion_demotion_cost (stmt_info, dt, multi_step_cvt,
>-					      cost_vec);
>+	  /* The initial unpacking step produces two vector results
>+	     per copy.  MULTI_STEP_CVT is 0 for a single conversion,
>+	     so >> MULTI_STEP_CVT divides by 2^(number of steps - 1).  */
>+	  unsigned int nvectors
>+	    = (slp_node
>+	       ? SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) >> multi_step_cvt
>+	       : ncopies * 2);
>+	  vect_model_promotion_demotion_cost (stmt_info, dt, nvectors,
>+					      multi_step_cvt, cost_vec);
> 	}
>       interm_types.release ();
>       return true;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [2/6] Don't assign a cost to vectorizable_assignment
  2019-11-07  9:35       ` Richard Biener
@ 2019-11-07 16:40         ` Richard Sandiford
  2019-11-08 11:24           ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-07 16:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Nov 6, 2019 at 4:58 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Tue, Nov 5, 2019 at 3:27 PM Richard Sandiford
>> > <Richard.Sandiford@arm.com> wrote:
>> >>
>> >> vectorizable_assignment handles true SSA-to-SSA copies (which hopefully
>> >> we don't see in practice) and no-op conversions that are required
>> >> to maintain correct gimple, such as changes between signed and
>> >> unsigned types.  These cases shouldn't generate any code and so
>> >> shouldn't count against either the scalar or vector costs.
>> >>
>> >> Later patches test this, but it seemed worth splitting out.
>> >
>> > Hmm, but you have to adjust vect_compute_single_scalar_iteration_cost and
>> > possibly the SLP cost walk as well, otherwise we're artificially making
>> > those copies cheaper when vectorized.
>>
>> Ah, yeah.  It looks complicated to reproduce the conditions exactly
>> there, so how about just costing 1 copy in vectorizable_assignment
>> to counteract it, and ignore ncopies?
>
> I guess costing a single scalar_stmt ought to make it exactly offset
> the scalar cost?

To summarise what we said on IRC: the problem with that is that we
need to count VF scalar stmts, where VF might be a runtime value.
The follow-on loop costing code copes with variable VF without
relying on vect_vf_for_cost.

Calling vectorizable_assignment from the scalar costing code
seemed like too much of a hack.  And it turns out that we can't
delay the scalar costing until after vect_analyze_stmts because
vect_enhance_data_refs_alignment needs it before then.  Reworking
this whole thing is too much work for GCC 10 at this stage.

So this patch goes with your suggestion of using a test based on
tree_nop_conversion.  To make sure that the scalar and vector costs
stay somewhat consistent, vectorizable_assignment continues to cost
stmts for which the new predicate is false.

Tested as before.

Thanks,
Richard


2019-11-07  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* tree-vectorizer.h (vect_nop_conversion_p): Declare.
	* tree-vect-stmts.c (vect_nop_conversion_p): New function.
	(vectorizable_assignment): Don't add a cost for nop conversions.
	* tree-vect-loop.c (vect_compute_single_scalar_iteration_cost):
	Likewise.
	* tree-vect-slp.c (vect_bb_slp_scalar_cost): Likewise.

Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2019-11-07 15:11:22.290972236 +0000
+++ gcc/tree-vectorizer.h	2019-11-07 16:32:14.817523866 +0000
@@ -1654,6 +1654,7 @@ extern tree vect_get_vec_def_for_stmt_co
 extern bool vect_transform_stmt (stmt_vec_info, gimple_stmt_iterator *,
 				 slp_tree, slp_instance);
 extern void vect_remove_stores (stmt_vec_info);
+extern bool vect_nop_conversion_p (stmt_vec_info);
 extern opt_result vect_analyze_stmt (stmt_vec_info, bool *, slp_tree,
 				     slp_instance, stmt_vector_for_cost *);
 extern void vect_get_load_cost (stmt_vec_info, int, bool,
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2019-11-07 15:11:50.134775028 +0000
+++ gcc/tree-vect-stmts.c	2019-11-07 16:32:14.817523866 +0000
@@ -5284,6 +5284,29 @@ vectorizable_conversion (stmt_vec_info s
   return true;
 }
 
+/* Return true if we can assume from the scalar form of STMT_INFO that
+   neither the scalar nor the vector forms will generate code.  STMT_INFO
+   is known not to involve a data reference.  */
+
+bool
+vect_nop_conversion_p (stmt_vec_info stmt_info)
+{
+  gassign *stmt = dyn_cast <gassign *> (stmt_info->stmt);
+  if (!stmt)
+    return false;
+
+  tree lhs = gimple_assign_lhs (stmt);
+  tree_code code = gimple_assign_rhs_code (stmt);
+  tree rhs = gimple_assign_rhs1 (stmt);
+
+  if (code == SSA_NAME || code == VIEW_CONVERT_EXPR)
+    return true;
+
+  if (CONVERT_EXPR_CODE_P (code))
+    return tree_nop_conversion_p (TREE_TYPE (lhs), TREE_TYPE (rhs));
+
+  return false;
+}
 
 /* Function vectorizable_assignment.
 
@@ -5399,7 +5422,9 @@ vectorizable_assignment (stmt_vec_info s
     {
       STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
       DUMP_VECT_SCOPE ("vectorizable_assignment");
-      vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node, cost_vec);
+      if (!vect_nop_conversion_p (stmt_info))
+	vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node,
+				cost_vec);
       return true;
     }
 
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2019-11-07 15:11:22.290972236 +0000
+++ gcc/tree-vect-loop.c	2019-11-07 16:32:14.813523891 +0000
@@ -1126,7 +1126,9 @@ vect_compute_single_scalar_iteration_cos
              else
                kind = scalar_store;
             }
-          else
+	  else if (vect_nop_conversion_p (stmt_info))
+	    continue;
+	  else
             kind = scalar_stmt;
 
 	  record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	2019-11-07 15:11:22.290972236 +0000
+++ gcc/tree-vect-slp.c	2019-11-07 16:32:14.817523866 +0000
@@ -3037,6 +3037,8 @@ vect_bb_slp_scalar_cost (basic_block bb,
           else
 	    kind = scalar_store;
         }
+      else if (vect_nop_conversion_p (stmt_info))
+	continue;
       else
 	kind = scalar_stmt;
       record_stmt_cost (cost_vec, 1, kind, stmt_info, 0, vect_body);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [4/6] Optionally pick the cheapest loop_vec_info
  2019-11-06 14:50       ` Richard Biener
@ 2019-11-07 17:15         ` Richard Sandiford
  2019-11-08 11:27           ` Richard Biener
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Sandiford @ 2019-11-07 17:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford
>> > <Richard.Sandiford@arm.com> wrote:
>> >>
>> >> This patch adds a mode in which the vectoriser tries each available
>> >> base vector mode and picks the one with the lowest cost.  For now
>> >> the behaviour is behind a default-off --param, but a later patch
>> >> enables it by default for SVE.
>> >>
>> >> The patch keeps the current behaviour of preferring a VF of
>> >> loop->simdlen over any larger or smaller VF, regardless of costs
>> >> or target preferences.
>> >
>> > Can you avoid using a --param for this?  Instead I'd suggest to
>> > amend the vectorize_modes target hook to return some
>> > flags like VECT_FIRST_MODE_WINS.  We'd eventually want
>> > to make the target able to say do-not-vectorize-epiloges-of-MODE
>> > (I think we may not want to vectorize SSE vectorized loop
>> > epilogues with MMX-with-SSE or GPRs for example).  I guess
>> > for the latter we'd use a new target hook.
>>
>> The reason for using a --param was that I wanted a way of turning
>> this on and off on the command line, so that users can experiment
>> with it if necessary.  E.g. enabling the --param could be a viable
>> alternative to -mprefix-* in some cases.  Disabling it would be
>> a way of working around a bad cost model decision without going
>> all the way to -fno-vect-cost-model.
>>
>> These kinds of --params can become useful workarounds until an
>> optimisation bug is fixed.
>
> I'm arguing that the default depends on the actual ISAs so there isn't
> a one-fits all and given we have OMP SIMD and target cloning for
> multiple ISAs this looks like a wrong approach.  For sure the
> target can use its own switches to override defaults here, or alternatively
> we might want to have a #pragma GCC simdlen mimicing OMP behavior
> here.

I agree there's no one-size-fits-all choice here, but that's true for
other --params too.  The problem with using target switches is that we
have to explain them and to keep accepting them "forever" (or at least
with a long deprecation period).  Whereas the --param was just something
that people could play with or perhaps use to work around problems
temporarily.  It would come with no guarantees attached.  And what the
--param did applied to any targets that support multiple modes,
regardless of what the targets do by default.

All that said, here's a version that returns the bitmask you suggested.
I ended up making the flag select the new behaviour and 0 select the
current behaviour, rather than have a flag for "first mode wins".
Tested as before.

Thanks,
Richard


2019-11-07  Richard Sandiford  <richard.sandiford@arm.com>

gcc/
	* target.h (VECT_COMPARE_COSTS): New constant.
	* target.def (autovectorize_vector_modes): Return a bitmask of flags.
	* doc/tm.texi: Regenerate.
	* targhooks.h (default_autovectorize_vector_modes): Update accordingly.
	* targhooks.c (default_autovectorize_vector_modes): Likewise.
	* config/aarch64/aarch64.c (aarch64_autovectorize_vector_modes):
	Likewise.
	* config/arc/arc.c (arc_autovectorize_vector_modes): Likewise.
	* config/arm/arm.c (arm_autovectorize_vector_modes): Likewise.
	* config/i386/i386.c (ix86_autovectorize_vector_modes): Likewise.
	* config/mips/mips.c (mips_autovectorize_vector_modes): Likewise.
	* tree-vectorizer.h (_loop_vec_info::vec_outside_cost)
	(_loop_vec_info::vec_inside_cost): New member variables.
	* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize them.
	(vect_better_loop_vinfo_p, vect_joust_loop_vinfos): New functions.
	(vect_analyze_loop): When autovectorize_vector_modes returns
	VECT_COMPARE_COSTS, try vectorizing the loop with each available
	vector mode and picking the one with the lowest cost.
	(vect_estimate_min_profitable_iters): Record the computed costs
	in the loop_vec_info.

Index: gcc/target.h
===================================================================
--- gcc/target.h	2019-11-07 15:11:15.831017985 +0000
+++ gcc/target.h	2019-11-07 16:52:30.037198353 +0000
@@ -218,6 +218,14 @@ enum omp_device_kind_arch_isa {
   omp_device_isa
 };
 
+/* Flags returned by TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES:
+
+   VECT_COMPARE_COSTS
+       Tells the loop vectorizer to try all the provided modes and
+       pick the one with the lowest cost.  By default the vectorizer
+       will choose the first mode that works.  */
+const unsigned int VECT_COMPARE_COSTS = 1U << 0;
+
 /* The target structure.  This holds all the backend hooks.  */
 #define DEFHOOKPOD(NAME, DOC, TYPE, INIT) TYPE NAME;
 #define DEFHOOK(NAME, DOC, TYPE, PARAMS, INIT) TYPE (* NAME) PARAMS;
Index: gcc/target.def
===================================================================
--- gcc/target.def	2019-11-07 15:11:15.819018071 +0000
+++ gcc/target.def	2019-11-07 16:52:30.037198353 +0000
@@ -1925,10 +1925,20 @@ element mode.\n\
 If @var{all} is true, add suitable vector modes even when they are generally\n\
 not expected to be worthwhile.\n\
 \n\
+The hook returns a bitmask of flags that control how the modes in\n\
+@var{modes} are used.  The flags are:\n\
+@table @code\n\
+@item VECT_COMPARE_COSTS\n\
+Tells the loop vectorizer to try all the provided modes and pick the one\n\
+with the lowest cost.  By default the vectorizer will choose the first\n\
+mode that works.\n\
+@end table\n\
+\n\
 The hook does not need to do anything if the vector returned by\n\
 @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant\n\
-for autovectorization.  The default implementation does nothing.",
- void,
+for autovectorization.  The default implementation adds no modes and\n\
+returns 0.",
+ unsigned int,
  (vector_modes *modes, bool all),
  default_autovectorize_vector_modes)
 
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	2019-11-07 15:11:15.779018354 +0000
+++ gcc/doc/tm.texi	2019-11-07 16:52:30.037198353 +0000
@@ -6016,7 +6016,7 @@ against lower halves of vectors recursiv
 reached.  The default is @var{mode} which means no splitting.
 @end deftypefn
 
-@deftypefn {Target Hook} void TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all})
+@deftypefn {Target Hook} {unsigned int} TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all})
 If using the mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}
 is not the only approach worth considering, this hook should add one mode to
 @var{modes} for each useful alternative approach.  These modes are then
@@ -6032,9 +6032,19 @@ element mode.
 If @var{all} is true, add suitable vector modes even when they are generally
 not expected to be worthwhile.
 
+The hook returns a bitmask of flags that control how the modes in
+@var{modes} are used.  The flags are:
+@table @code
+@item VECT_COMPARE_COSTS
+Tells the loop vectorizer to try all the provided modes and pick the one
+with the lowest cost.  By default the vectorizer will choose the first
+mode that works.
+@end table
+
 The hook does not need to do anything if the vector returned by
 @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant
-for autovectorization.  The default implementation does nothing.
+for autovectorization.  The default implementation adds no modes and
+returns 0.
 @end deftypefn
 
 @deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_RELATED_MODE (machine_mode @var{vector_mode}, scalar_mode @var{element_mode}, poly_uint64 @var{nunits})
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	2019-11-07 15:11:15.831017985 +0000
+++ gcc/targhooks.h	2019-11-07 16:52:30.041198324 +0000
@@ -113,7 +113,7 @@ default_builtin_support_vector_misalignm
 					     int, bool);
 extern machine_mode default_preferred_simd_mode (scalar_mode mode);
 extern machine_mode default_split_reduction (machine_mode);
-extern void default_autovectorize_vector_modes (vector_modes *, bool);
+extern unsigned int default_autovectorize_vector_modes (vector_modes *, bool);
 extern opt_machine_mode default_vectorize_related_mode (machine_mode,
 							scalar_mode,
 							poly_uint64);
Index: gcc/targhooks.c
===================================================================
--- gcc/targhooks.c	2019-11-07 15:11:15.831017985 +0000
+++ gcc/targhooks.c	2019-11-07 16:52:30.041198324 +0000
@@ -1301,9 +1301,10 @@ default_split_reduction (machine_mode mo
 
 /* By default only the preferred vector mode is tried.  */
 
-void
+unsigned int
 default_autovectorize_vector_modes (vector_modes *, bool)
 {
+  return 0;
 }
 
 /* The default implementation of TARGET_VECTORIZE_RELATED_MODE.  */
Index: gcc/config/aarch64/aarch64.c
===================================================================
--- gcc/config/aarch64/aarch64.c	2019-11-07 15:11:19.442992405 +0000
+++ gcc/config/aarch64/aarch64.c	2019-11-07 16:52:30.021198461 +0000
@@ -15949,7 +15949,7 @@ aarch64_preferred_simd_mode (scalar_mode
 
 /* Return a list of possible vector sizes for the vectorizer
    to iterate over.  */
-static void
+static unsigned int
 aarch64_autovectorize_vector_modes (vector_modes *modes, bool)
 {
   if (TARGET_SVE)
@@ -15975,6 +15975,8 @@ aarch64_autovectorize_vector_modes (vect
      TODO: We could similarly support limited forms of V2QImode and V2HImode
      for this case.  */
   modes->safe_push (V2SImode);
+
+  return 0;
 }
 
 /* Implement TARGET_MANGLE_TYPE.  */
Index: gcc/config/arc/arc.c
===================================================================
--- gcc/config/arc/arc.c	2019-11-07 15:11:15.599019628 +0000
+++ gcc/config/arc/arc.c	2019-11-07 16:52:30.021198461 +0000
@@ -609,7 +609,7 @@ arc_preferred_simd_mode (scalar_mode mod
 /* Implements target hook
    TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES.  */
 
-static void
+static unsigned int
 arc_autovectorize_vector_modes (vector_modes *modes, bool)
 {
   if (TARGET_PLUS_QMACW)
@@ -617,6 +617,7 @@ arc_autovectorize_vector_modes (vector_m
       modes->quick_push (V4HImode);
       modes->quick_push (V2HImode);
     }
+  return 0;
 }
 
 
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	2019-11-07 15:11:15.683019033 +0000
+++ gcc/config/arm/arm.c	2019-11-07 16:52:30.029198406 +0000
@@ -289,7 +289,7 @@ static bool arm_builtin_support_vector_m
 static void arm_conditional_register_usage (void);
 static enum flt_eval_method arm_excess_precision (enum excess_precision_type);
 static reg_class_t arm_preferred_rename_class (reg_class_t rclass);
-static void arm_autovectorize_vector_modes (vector_modes *, bool);
+static unsigned int arm_autovectorize_vector_modes (vector_modes *, bool);
 static int arm_default_branch_cost (bool, bool);
 static int arm_cortex_a5_branch_cost (bool, bool);
 static int arm_cortex_m_branch_cost (bool, bool);
@@ -29015,7 +29015,7 @@ arm_vector_alignment (const_tree type)
   return align;
 }
 
-static void
+static unsigned int
 arm_autovectorize_vector_modes (vector_modes *modes, bool)
 {
   if (!TARGET_NEON_VECTORIZE_DOUBLE)
@@ -29023,6 +29023,7 @@ arm_autovectorize_vector_modes (vector_m
       modes->safe_push (V16QImode);
       modes->safe_push (V8QImode);
     }
+  return 0;
 }
 
 static bool
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	2019-11-07 15:11:15.715018807 +0000
+++ gcc/config/i386/i386.c	2019-11-07 16:52:30.033198382 +0000
@@ -21385,7 +21385,7 @@ ix86_preferred_simd_mode (scalar_mode mo
    vectors.  If AVX512F is enabled then try vectorizing with 512bit,
    256bit and 128bit vectors.  */
 
-static void
+static unsigned int
 ix86_autovectorize_vector_modes (vector_modes *modes, bool all)
 {
   if (TARGET_AVX512F && !TARGET_PREFER_AVX256)
@@ -21415,6 +21415,8 @@ ix86_autovectorize_vector_modes (vector_
 
   if (TARGET_MMX_WITH_SSE)
     modes->safe_push (V8QImode);
+
+  return 0;
 }
 
 /* Implemenation of targetm.vectorize.get_mask_mode.  */
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c	2019-11-07 15:11:15.755018524 +0000
+++ gcc/config/mips/mips.c	2019-11-07 16:52:30.037198353 +0000
@@ -13455,11 +13455,12 @@ mips_preferred_simd_mode (scalar_mode mo
 
 /* Implement TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES.  */
 
-static void
+static unsigned int
 mips_autovectorize_vector_modes (vector_modes *modes, bool)
 {
   if (ISA_HAS_MSA)
     modes->safe_push (V16QImode);
+  return 0;
 }
 
 /* Implement TARGET_INIT_LIBFUNCS.  */
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2019-11-07 16:52:25.000000000 +0000
+++ gcc/tree-vectorizer.h	2019-11-07 16:52:30.041198324 +0000
@@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve
   /* Cost of a single scalar iteration.  */
   int single_scalar_iteration_cost;
 
+  /* The cost of the vector prologue and epilogue, including peeled
+     iterations and set-up code.  */
+  int vec_outside_cost;
+
+  /* The cost of the vector loop body.  */
+  int vec_inside_cost;
+
   /* Is the loop vectorizable? */
   bool vectorizable;
 
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2019-11-07 16:52:25.000000000 +0000
+++ gcc/tree-vect-loop.c	2019-11-07 16:52:30.041198324 +0000
@@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo
     scan_map (NULL),
     slp_unrolling_factor (1),
     single_scalar_iteration_cost (0),
+    vec_outside_cost (0),
+    vec_inside_cost (0),
     vectorizable (false),
     can_fully_mask_p (true),
     fully_masked_p (false),
@@ -2375,6 +2377,80 @@ vect_analyze_loop_2 (loop_vec_info loop_
   goto start_over;
 }
 
+/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears
+   to be better than vectorizing it using OLD_LOOP_VINFO.  Assume that
+   OLD_LOOP_VINFO is better unless something specifically indicates
+   otherwise.
+
+   Note that this deliberately isn't a partial order.  */
+
+static bool
+vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
+			  loop_vec_info old_loop_vinfo)
+{
+  struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo);
+  gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop);
+
+  poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo);
+  poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo);
+
+  /* Always prefer a VF of loop->simdlen over any other VF.  */
+  if (loop->simdlen)
+    {
+      bool new_simdlen_p = known_eq (new_vf, loop->simdlen);
+      bool old_simdlen_p = known_eq (old_vf, loop->simdlen);
+      if (new_simdlen_p != old_simdlen_p)
+	return new_simdlen_p;
+    }
+
+  /* Limit the VFs to what is likely to be the maximum number of iterations,
+     to handle cases in which at least one loop_vinfo is fully-masked.  */
+  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
+  if (estimated_max_niter != -1)
+    {
+      if (known_le (estimated_max_niter, new_vf))
+	new_vf = estimated_max_niter;
+      if (known_le (estimated_max_niter, old_vf))
+	old_vf = estimated_max_niter;
+    }
+
+  /* Check whether the (fractional) cost per scalar iteration is lower
+     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
+  poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost
+			     * poly_widest_int (old_vf));
+  poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost
+			     * poly_widest_int (new_vf));
+  if (maybe_lt (rel_old, rel_new))
+    return false;
+  if (known_lt (rel_new, rel_old))
+    return true;
+
+  /* If there's nothing to choose between the loop bodies, see whether
+     there's a difference in the prologue and epilogue costs.  */
+  if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost)
+    return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost;
+
+  return false;
+}
+
+/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
+   true if we should.  */
+
+static bool
+vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
+			loop_vec_info old_loop_vinfo)
+{
+  if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo))
+    return false;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "***** Preferring vector mode %s to vector mode %s\n",
+		     GET_MODE_NAME (new_loop_vinfo->vector_mode),
+		     GET_MODE_NAME (old_loop_vinfo->vector_mode));
+  return true;
+}
+
 /* Function vect_analyze_loop.
 
    Apply a set of analyses on LOOP, and create a loop_vec_info struct
@@ -2386,8 +2462,9 @@ vect_analyze_loop (class loop *loop, vec
   auto_vector_modes vector_modes;
 
   /* Autodetect first vector size we try.  */
-  targetm.vectorize.autovectorize_vector_modes (&vector_modes,
-						loop->simdlen != 0);
+  unsigned int autovec_flags
+    = targetm.vectorize.autovectorize_vector_modes (&vector_modes,
+						    loop->simdlen != 0);
   unsigned int mode_i = 0;
 
   DUMP_VECT_SCOPE ("analyze_loop_nest");
@@ -2410,6 +2487,8 @@ vect_analyze_loop (class loop *loop, vec
   machine_mode next_vector_mode = VOIDmode;
   poly_uint64 lowest_th = 0;
   unsigned vectorized_loops = 0;
+  bool pick_lowest_cost_p = ((autovec_flags & VECT_COMPARE_COSTS)
+			     && !unlimited_cost_model (loop));
 
   bool vect_epilogues = false;
   opt_result res = opt_result::success ();
@@ -2430,6 +2509,34 @@ vect_analyze_loop (class loop *loop, vec
 
       bool fatal = false;
 
+      /* When pick_lowest_cost_p is true, we should in principle iterate
+	 over all the loop_vec_infos that LOOP_VINFO could replace and
+	 try to vectorize LOOP_VINFO under the same conditions.
+	 E.g. when trying to replace an epilogue loop, we should vectorize
+	 LOOP_VINFO as an epilogue loop with the same VF limit.  When trying
+	 to replace the main loop, we should vectorize LOOP_VINFO as a main
+	 loop too.
+
+	 However, autovectorize_vector_modes is usually sorted as follows:
+
+	 - Modes that naturally produce lower VFs usually follow modes that
+	   naturally produce higher VFs.
+
+	 - When modes naturally produce the same VF, maskable modes
+	   usually follow unmaskable ones, so that the maskable mode
+	   can be used to vectorize the epilogue of the unmaskable mode.
+
+	 This order is preferred because it leads to the maximum
+	 epilogue vectorization opportunities.  Targets should only use
+	 a different order if they want to make wide modes available while
+	 disparaging them relative to earlier, smaller modes.  The assumption
+	 in that case is that the wider modes are more expensive in some
+	 way that isn't reflected directly in the costs.
+
+	 There should therefore be few interesting cases in which
+	 LOOP_VINFO fails when treated as an epilogue loop, succeeds when
+	 treated as a standalone loop, and ends up being genuinely cheaper
+	 than FIRST_LOOP_VINFO.  */
       if (vect_epilogues)
 	LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
 
@@ -2477,13 +2584,34 @@ vect_analyze_loop (class loop *loop, vec
 	      LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
 	      simdlen = 0;
 	    }
+	  else if (pick_lowest_cost_p && first_loop_vinfo)
+	    {
+	      /* Keep trying to roll back vectorization attempts while the
+		 loop_vec_infos they produced were worse than this one.  */
+	      vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
+	      while (!vinfos.is_empty ()
+		     && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
+		{
+		  gcc_assert (vect_epilogues);
+		  delete vinfos.pop ();
+		}
+	      if (vinfos.is_empty ()
+		  && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
+		{
+		  delete first_loop_vinfo;
+		  first_loop_vinfo = opt_loop_vec_info::success (NULL);
+		  LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
+		}
+	    }
 
 	  if (first_loop_vinfo == NULL)
 	    {
 	      first_loop_vinfo = loop_vinfo;
 	      lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
 	    }
-	  else if (vect_epilogues)
+	  else if (vect_epilogues
+		   /* For now only allow one epilogue loop.  */
+		   && first_loop_vinfo->epilogue_vinfos.is_empty ())
 	    {
 	      first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
 	      poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
@@ -2503,12 +2631,14 @@ vect_analyze_loop (class loop *loop, vec
 			    && loop->inner == NULL
 			    && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK)
 			    && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo)
-			    /* For now only allow one epilogue loop.  */
-			    && first_loop_vinfo->epilogue_vinfos.is_empty ());
+			    /* For now only allow one epilogue loop, but allow
+			       pick_lowest_cost_p to replace it.  */
+			    && (first_loop_vinfo->epilogue_vinfos.is_empty ()
+				|| pick_lowest_cost_p));
 
 	  /* Commit to first_loop_vinfo if we have no reason to try
 	     alternatives.  */
-	  if (!simdlen && !vect_epilogues)
+	  if (!simdlen && !vect_epilogues && !pick_lowest_cost_p)
 	    break;
 	}
       else
@@ -3467,7 +3597,11 @@ vect_estimate_min_profitable_iters (loop
 	       &vec_inside_cost, &vec_epilogue_cost);
 
   vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
-  
+
+  /* Stash the costs so that we can compare two loop_vec_infos.  */
+  loop_vinfo->vec_inside_cost = vec_inside_cost;
+  loop_vinfo->vec_outside_cost = vec_outside_cost;
+
   if (dump_enabled_p ())
     {
       dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [2/6] Don't assign a cost to vectorizable_assignment
  2019-11-07 16:40         ` Richard Sandiford
@ 2019-11-08 11:24           ` Richard Biener
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Biener @ 2019-11-08 11:24 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Thu, Nov 7, 2019 at 5:40 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Nov 6, 2019 at 4:58 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Tue, Nov 5, 2019 at 3:27 PM Richard Sandiford
> >> > <Richard.Sandiford@arm.com> wrote:
> >> >>
> >> >> vectorizable_assignment handles true SSA-to-SSA copies (which hopefully
> >> >> we don't see in practice) and no-op conversions that are required
> >> >> to maintain correct gimple, such as changes between signed and
> >> >> unsigned types.  These cases shouldn't generate any code and so
> >> >> shouldn't count against either the scalar or vector costs.
> >> >>
> >> >> Later patches test this, but it seemed worth splitting out.
> >> >
> >> > Hmm, but you have to adjust vect_compute_single_scalar_iteration_cost and
> >> > possibly the SLP cost walk as well, otherwise we're artificially making
> >> > those copies cheaper when vectorized.
> >>
> >> Ah, yeah.  It looks complicated to reproduce the conditions exactly
> >> there, so how about just costing 1 copy in vectorizable_assignment
> >> to counteract it, and ignore ncopies?
> >
> > I guess costing a single scalar_stmt ought to make it exactly offset
> > the scalar cost?
>
> To summarise what we said on IRC: the problem with that is that we
> need to count VF scalar stmts, where VF might be a runtime value.
> The follow-on loop costing code copes with variable VF without
> relying on vect_vf_for_cost.
>
> Calling vectorizable_assignment from the scalar costing code
> seemed like too much of a hack.  And it turns out that we can't
> delay the scalar costing until after vect_analyze_stmts because
> vect_enhance_data_refs_alignment needs it before then.  Reworking
> this whole thing is too much work for GCC 10 at this stage.
>
> So this patch goes with your suggestion of using a test based on
> tree_nop_conversion.  To make sure that the scalar and vector costs
> stay somewhat consistent, vectorizable_assignment continues to cost
> stmts for which the new predicate is false.
>
> Tested as before.

OK.

thanks,
Richard.

> Thanks,
> Richard
>
>
> 2019-11-07  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * tree-vectorizer.h (vect_nop_conversion_p): Declare.
>         * tree-vect-stmts.c (vect_nop_conversion_p): New function.
>         (vectorizable_assignment): Don't add a cost for nop conversions.
>         * tree-vect-loop.c (vect_compute_single_scalar_iteration_cost):
>         Likewise.
>         * tree-vect-slp.c (vect_bb_slp_scalar_cost): Likewise.
>
> Index: gcc/tree-vectorizer.h
> ===================================================================
> --- gcc/tree-vectorizer.h       2019-11-07 15:11:22.290972236 +0000
> +++ gcc/tree-vectorizer.h       2019-11-07 16:32:14.817523866 +0000
> @@ -1654,6 +1654,7 @@ extern tree vect_get_vec_def_for_stmt_co
>  extern bool vect_transform_stmt (stmt_vec_info, gimple_stmt_iterator *,
>                                  slp_tree, slp_instance);
>  extern void vect_remove_stores (stmt_vec_info);
> +extern bool vect_nop_conversion_p (stmt_vec_info);
>  extern opt_result vect_analyze_stmt (stmt_vec_info, bool *, slp_tree,
>                                      slp_instance, stmt_vector_for_cost *);
>  extern void vect_get_load_cost (stmt_vec_info, int, bool,
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c       2019-11-07 15:11:50.134775028 +0000
> +++ gcc/tree-vect-stmts.c       2019-11-07 16:32:14.817523866 +0000
> @@ -5284,6 +5284,29 @@ vectorizable_conversion (stmt_vec_info s
>    return true;
>  }
>
> +/* Return true if we can assume from the scalar form of STMT_INFO that
> +   neither the scalar nor the vector forms will generate code.  STMT_INFO
> +   is known not to involve a data reference.  */
> +
> +bool
> +vect_nop_conversion_p (stmt_vec_info stmt_info)
> +{
> +  gassign *stmt = dyn_cast <gassign *> (stmt_info->stmt);
> +  if (!stmt)
> +    return false;
> +
> +  tree lhs = gimple_assign_lhs (stmt);
> +  tree_code code = gimple_assign_rhs_code (stmt);
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +
> +  if (code == SSA_NAME || code == VIEW_CONVERT_EXPR)
> +    return true;
> +
> +  if (CONVERT_EXPR_CODE_P (code))
> +    return tree_nop_conversion_p (TREE_TYPE (lhs), TREE_TYPE (rhs));
> +
> +  return false;
> +}
>
>  /* Function vectorizable_assignment.
>
> @@ -5399,7 +5422,9 @@ vectorizable_assignment (stmt_vec_info s
>      {
>        STMT_VINFO_TYPE (stmt_info) = assignment_vec_info_type;
>        DUMP_VECT_SCOPE ("vectorizable_assignment");
> -      vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node, cost_vec);
> +      if (!vect_nop_conversion_p (stmt_info))
> +       vect_model_simple_cost (stmt_info, ncopies, dt, ndts, slp_node,
> +                               cost_vec);
>        return true;
>      }
>
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2019-11-07 15:11:22.290972236 +0000
> +++ gcc/tree-vect-loop.c        2019-11-07 16:32:14.813523891 +0000
> @@ -1126,7 +1126,9 @@ vect_compute_single_scalar_iteration_cos
>               else
>                 kind = scalar_store;
>              }
> -          else
> +         else if (vect_nop_conversion_p (stmt_info))
> +           continue;
> +         else
>              kind = scalar_stmt;
>
>           record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> Index: gcc/tree-vect-slp.c
> ===================================================================
> --- gcc/tree-vect-slp.c 2019-11-07 15:11:22.290972236 +0000
> +++ gcc/tree-vect-slp.c 2019-11-07 16:32:14.817523866 +0000
> @@ -3037,6 +3037,8 @@ vect_bb_slp_scalar_cost (basic_block bb,
>            else
>             kind = scalar_store;
>          }
> +      else if (vect_nop_conversion_p (stmt_info))
> +       continue;
>        else
>         kind = scalar_stmt;
>        record_stmt_cost (cost_vec, 1, kind, stmt_info, 0, vect_body);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [4/6] Optionally pick the cheapest loop_vec_info
  2019-11-07 17:15         ` Richard Sandiford
@ 2019-11-08 11:27           ` Richard Biener
  2019-11-08 12:15             ` Richard Sandiford
  0 siblings, 1 reply; 23+ messages in thread
From: Richard Biener @ 2019-11-08 11:27 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Thu, Nov 7, 2019 at 6:15 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford
> >> > <Richard.Sandiford@arm.com> wrote:
> >> >>
> >> >> This patch adds a mode in which the vectoriser tries each available
> >> >> base vector mode and picks the one with the lowest cost.  For now
> >> >> the behaviour is behind a default-off --param, but a later patch
> >> >> enables it by default for SVE.
> >> >>
> >> >> The patch keeps the current behaviour of preferring a VF of
> >> >> loop->simdlen over any larger or smaller VF, regardless of costs
> >> >> or target preferences.
> >> >
> >> > Can you avoid using a --param for this?  Instead I'd suggest to
> >> > amend the vectorize_modes target hook to return some
> >> > flags like VECT_FIRST_MODE_WINS.  We'd eventually want
> >> > to make the target able to say do-not-vectorize-epiloges-of-MODE
> >> > (I think we may not want to vectorize SSE vectorized loop
> >> > epilogues with MMX-with-SSE or GPRs for example).  I guess
> >> > for the latter we'd use a new target hook.
> >>
> >> The reason for using a --param was that I wanted a way of turning
> >> this on and off on the command line, so that users can experiment
> >> with it if necessary.  E.g. enabling the --param could be a viable
> >> alternative to -mprefix-* in some cases.  Disabling it would be
> >> a way of working around a bad cost model decision without going
> >> all the way to -fno-vect-cost-model.
> >>
> >> These kinds of --params can become useful workarounds until an
> >> optimisation bug is fixed.
> >
> > I'm arguing that the default depends on the actual ISAs so there isn't
> > a one-fits all and given we have OMP SIMD and target cloning for
> > multiple ISAs this looks like a wrong approach.  For sure the
> > target can use its own switches to override defaults here, or alternatively
> > we might want to have a #pragma GCC simdlen mimicing OMP behavior
> > here.
>
> I agree there's no one-size-fits-all choice here, but that's true for
> other --params too.  The problem with using target switches is that we
> have to explain them and to keep accepting them "forever" (or at least
> with a long deprecation period).

Fortunately next week you'll be able to add target specific --params
to your targets .opt file ;)

>  Whereas the --param was just something
> that people could play with or perhaps use to work around problems
> temporarily.  It would come with no guarantees attached.  And what the
> --param did applied to any targets that support multiple modes,
> regardless of what the targets do by default.
>
> All that said, here's a version that returns the bitmask you suggested.
> I ended up making the flag select the new behaviour and 0 select the
> current behaviour, rather than have a flag for "first mode wins".
> Tested as before.

OK.

Thanks,
Richard.

> Thanks,
> Richard
>
>
> 2019-11-07  Richard Sandiford  <richard.sandiford@arm.com>
>
> gcc/
>         * target.h (VECT_COMPARE_COSTS): New constant.
>         * target.def (autovectorize_vector_modes): Return a bitmask of flags.
>         * doc/tm.texi: Regenerate.
>         * targhooks.h (default_autovectorize_vector_modes): Update accordingly.
>         * targhooks.c (default_autovectorize_vector_modes): Likewise.
>         * config/aarch64/aarch64.c (aarch64_autovectorize_vector_modes):
>         Likewise.
>         * config/arc/arc.c (arc_autovectorize_vector_modes): Likewise.
>         * config/arm/arm.c (arm_autovectorize_vector_modes): Likewise.
>         * config/i386/i386.c (ix86_autovectorize_vector_modes): Likewise.
>         * config/mips/mips.c (mips_autovectorize_vector_modes): Likewise.
>         * tree-vectorizer.h (_loop_vec_info::vec_outside_cost)
>         (_loop_vec_info::vec_inside_cost): New member variables.
>         * tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize them.
>         (vect_better_loop_vinfo_p, vect_joust_loop_vinfos): New functions.
>         (vect_analyze_loop): When autovectorize_vector_modes returns
>         VECT_COMPARE_COSTS, try vectorizing the loop with each available
>         vector mode and picking the one with the lowest cost.
>         (vect_estimate_min_profitable_iters): Record the computed costs
>         in the loop_vec_info.
>
> Index: gcc/target.h
> ===================================================================
> --- gcc/target.h        2019-11-07 15:11:15.831017985 +0000
> +++ gcc/target.h        2019-11-07 16:52:30.037198353 +0000
> @@ -218,6 +218,14 @@ enum omp_device_kind_arch_isa {
>    omp_device_isa
>  };
>
> +/* Flags returned by TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES:
> +
> +   VECT_COMPARE_COSTS
> +       Tells the loop vectorizer to try all the provided modes and
> +       pick the one with the lowest cost.  By default the vectorizer
> +       will choose the first mode that works.  */
> +const unsigned int VECT_COMPARE_COSTS = 1U << 0;
> +
>  /* The target structure.  This holds all the backend hooks.  */
>  #define DEFHOOKPOD(NAME, DOC, TYPE, INIT) TYPE NAME;
>  #define DEFHOOK(NAME, DOC, TYPE, PARAMS, INIT) TYPE (* NAME) PARAMS;
> Index: gcc/target.def
> ===================================================================
> --- gcc/target.def      2019-11-07 15:11:15.819018071 +0000
> +++ gcc/target.def      2019-11-07 16:52:30.037198353 +0000
> @@ -1925,10 +1925,20 @@ element mode.\n\
>  If @var{all} is true, add suitable vector modes even when they are generally\n\
>  not expected to be worthwhile.\n\
>  \n\
> +The hook returns a bitmask of flags that control how the modes in\n\
> +@var{modes} are used.  The flags are:\n\
> +@table @code\n\
> +@item VECT_COMPARE_COSTS\n\
> +Tells the loop vectorizer to try all the provided modes and pick the one\n\
> +with the lowest cost.  By default the vectorizer will choose the first\n\
> +mode that works.\n\
> +@end table\n\
> +\n\
>  The hook does not need to do anything if the vector returned by\n\
>  @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant\n\
> -for autovectorization.  The default implementation does nothing.",
> - void,
> +for autovectorization.  The default implementation adds no modes and\n\
> +returns 0.",
> + unsigned int,
>   (vector_modes *modes, bool all),
>   default_autovectorize_vector_modes)
>
> Index: gcc/doc/tm.texi
> ===================================================================
> --- gcc/doc/tm.texi     2019-11-07 15:11:15.779018354 +0000
> +++ gcc/doc/tm.texi     2019-11-07 16:52:30.037198353 +0000
> @@ -6016,7 +6016,7 @@ against lower halves of vectors recursiv
>  reached.  The default is @var{mode} which means no splitting.
>  @end deftypefn
>
> -@deftypefn {Target Hook} void TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all})
> +@deftypefn {Target Hook} {unsigned int} TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES (vector_modes *@var{modes}, bool @var{all})
>  If using the mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}
>  is not the only approach worth considering, this hook should add one mode to
>  @var{modes} for each useful alternative approach.  These modes are then
> @@ -6032,9 +6032,19 @@ element mode.
>  If @var{all} is true, add suitable vector modes even when they are generally
>  not expected to be worthwhile.
>
> +The hook returns a bitmask of flags that control how the modes in
> +@var{modes} are used.  The flags are:
> +@table @code
> +@item VECT_COMPARE_COSTS
> +Tells the loop vectorizer to try all the provided modes and pick the one
> +with the lowest cost.  By default the vectorizer will choose the first
> +mode that works.
> +@end table
> +
>  The hook does not need to do anything if the vector returned by
>  @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is the only one relevant
> -for autovectorization.  The default implementation does nothing.
> +for autovectorization.  The default implementation adds no modes and
> +returns 0.
>  @end deftypefn
>
>  @deftypefn {Target Hook} opt_machine_mode TARGET_VECTORIZE_RELATED_MODE (machine_mode @var{vector_mode}, scalar_mode @var{element_mode}, poly_uint64 @var{nunits})
> Index: gcc/targhooks.h
> ===================================================================
> --- gcc/targhooks.h     2019-11-07 15:11:15.831017985 +0000
> +++ gcc/targhooks.h     2019-11-07 16:52:30.041198324 +0000
> @@ -113,7 +113,7 @@ default_builtin_support_vector_misalignm
>                                              int, bool);
>  extern machine_mode default_preferred_simd_mode (scalar_mode mode);
>  extern machine_mode default_split_reduction (machine_mode);
> -extern void default_autovectorize_vector_modes (vector_modes *, bool);
> +extern unsigned int default_autovectorize_vector_modes (vector_modes *, bool);
>  extern opt_machine_mode default_vectorize_related_mode (machine_mode,
>                                                         scalar_mode,
>                                                         poly_uint64);
> Index: gcc/targhooks.c
> ===================================================================
> --- gcc/targhooks.c     2019-11-07 15:11:15.831017985 +0000
> +++ gcc/targhooks.c     2019-11-07 16:52:30.041198324 +0000
> @@ -1301,9 +1301,10 @@ default_split_reduction (machine_mode mo
>
>  /* By default only the preferred vector mode is tried.  */
>
> -void
> +unsigned int
>  default_autovectorize_vector_modes (vector_modes *, bool)
>  {
> +  return 0;
>  }
>
>  /* The default implementation of TARGET_VECTORIZE_RELATED_MODE.  */
> Index: gcc/config/aarch64/aarch64.c
> ===================================================================
> --- gcc/config/aarch64/aarch64.c        2019-11-07 15:11:19.442992405 +0000
> +++ gcc/config/aarch64/aarch64.c        2019-11-07 16:52:30.021198461 +0000
> @@ -15949,7 +15949,7 @@ aarch64_preferred_simd_mode (scalar_mode
>
>  /* Return a list of possible vector sizes for the vectorizer
>     to iterate over.  */
> -static void
> +static unsigned int
>  aarch64_autovectorize_vector_modes (vector_modes *modes, bool)
>  {
>    if (TARGET_SVE)
> @@ -15975,6 +15975,8 @@ aarch64_autovectorize_vector_modes (vect
>       TODO: We could similarly support limited forms of V2QImode and V2HImode
>       for this case.  */
>    modes->safe_push (V2SImode);
> +
> +  return 0;
>  }
>
>  /* Implement TARGET_MANGLE_TYPE.  */
> Index: gcc/config/arc/arc.c
> ===================================================================
> --- gcc/config/arc/arc.c        2019-11-07 15:11:15.599019628 +0000
> +++ gcc/config/arc/arc.c        2019-11-07 16:52:30.021198461 +0000
> @@ -609,7 +609,7 @@ arc_preferred_simd_mode (scalar_mode mod
>  /* Implements target hook
>     TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES.  */
>
> -static void
> +static unsigned int
>  arc_autovectorize_vector_modes (vector_modes *modes, bool)
>  {
>    if (TARGET_PLUS_QMACW)
> @@ -617,6 +617,7 @@ arc_autovectorize_vector_modes (vector_m
>        modes->quick_push (V4HImode);
>        modes->quick_push (V2HImode);
>      }
> +  return 0;
>  }
>
>
> Index: gcc/config/arm/arm.c
> ===================================================================
> --- gcc/config/arm/arm.c        2019-11-07 15:11:15.683019033 +0000
> +++ gcc/config/arm/arm.c        2019-11-07 16:52:30.029198406 +0000
> @@ -289,7 +289,7 @@ static bool arm_builtin_support_vector_m
>  static void arm_conditional_register_usage (void);
>  static enum flt_eval_method arm_excess_precision (enum excess_precision_type);
>  static reg_class_t arm_preferred_rename_class (reg_class_t rclass);
> -static void arm_autovectorize_vector_modes (vector_modes *, bool);
> +static unsigned int arm_autovectorize_vector_modes (vector_modes *, bool);
>  static int arm_default_branch_cost (bool, bool);
>  static int arm_cortex_a5_branch_cost (bool, bool);
>  static int arm_cortex_m_branch_cost (bool, bool);
> @@ -29015,7 +29015,7 @@ arm_vector_alignment (const_tree type)
>    return align;
>  }
>
> -static void
> +static unsigned int
>  arm_autovectorize_vector_modes (vector_modes *modes, bool)
>  {
>    if (!TARGET_NEON_VECTORIZE_DOUBLE)
> @@ -29023,6 +29023,7 @@ arm_autovectorize_vector_modes (vector_m
>        modes->safe_push (V16QImode);
>        modes->safe_push (V8QImode);
>      }
> +  return 0;
>  }
>
>  static bool
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c      2019-11-07 15:11:15.715018807 +0000
> +++ gcc/config/i386/i386.c      2019-11-07 16:52:30.033198382 +0000
> @@ -21385,7 +21385,7 @@ ix86_preferred_simd_mode (scalar_mode mo
>     vectors.  If AVX512F is enabled then try vectorizing with 512bit,
>     256bit and 128bit vectors.  */
>
> -static void
> +static unsigned int
>  ix86_autovectorize_vector_modes (vector_modes *modes, bool all)
>  {
>    if (TARGET_AVX512F && !TARGET_PREFER_AVX256)
> @@ -21415,6 +21415,8 @@ ix86_autovectorize_vector_modes (vector_
>
>    if (TARGET_MMX_WITH_SSE)
>      modes->safe_push (V8QImode);
> +
> +  return 0;
>  }
>
>  /* Implemenation of targetm.vectorize.get_mask_mode.  */
> Index: gcc/config/mips/mips.c
> ===================================================================
> --- gcc/config/mips/mips.c      2019-11-07 15:11:15.755018524 +0000
> +++ gcc/config/mips/mips.c      2019-11-07 16:52:30.037198353 +0000
> @@ -13455,11 +13455,12 @@ mips_preferred_simd_mode (scalar_mode mo
>
>  /* Implement TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES.  */
>
> -static void
> +static unsigned int
>  mips_autovectorize_vector_modes (vector_modes *modes, bool)
>  {
>    if (ISA_HAS_MSA)
>      modes->safe_push (V16QImode);
> +  return 0;
>  }
>
>  /* Implement TARGET_INIT_LIBFUNCS.  */
> Index: gcc/tree-vectorizer.h
> ===================================================================
> --- gcc/tree-vectorizer.h       2019-11-07 16:52:25.000000000 +0000
> +++ gcc/tree-vectorizer.h       2019-11-07 16:52:30.041198324 +0000
> @@ -601,6 +601,13 @@ typedef class _loop_vec_info : public ve
>    /* Cost of a single scalar iteration.  */
>    int single_scalar_iteration_cost;
>
> +  /* The cost of the vector prologue and epilogue, including peeled
> +     iterations and set-up code.  */
> +  int vec_outside_cost;
> +
> +  /* The cost of the vector loop body.  */
> +  int vec_inside_cost;
> +
>    /* Is the loop vectorizable? */
>    bool vectorizable;
>
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2019-11-07 16:52:25.000000000 +0000
> +++ gcc/tree-vect-loop.c        2019-11-07 16:52:30.041198324 +0000
> @@ -830,6 +830,8 @@ _loop_vec_info::_loop_vec_info (class lo
>      scan_map (NULL),
>      slp_unrolling_factor (1),
>      single_scalar_iteration_cost (0),
> +    vec_outside_cost (0),
> +    vec_inside_cost (0),
>      vectorizable (false),
>      can_fully_mask_p (true),
>      fully_masked_p (false),
> @@ -2375,6 +2377,80 @@ vect_analyze_loop_2 (loop_vec_info loop_
>    goto start_over;
>  }
>
> +/* Return true if vectorizing a loop using NEW_LOOP_VINFO appears
> +   to be better than vectorizing it using OLD_LOOP_VINFO.  Assume that
> +   OLD_LOOP_VINFO is better unless something specifically indicates
> +   otherwise.
> +
> +   Note that this deliberately isn't a partial order.  */
> +
> +static bool
> +vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
> +                         loop_vec_info old_loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (new_loop_vinfo);
> +  gcc_assert (LOOP_VINFO_LOOP (old_loop_vinfo) == loop);
> +
> +  poly_int64 new_vf = LOOP_VINFO_VECT_FACTOR (new_loop_vinfo);
> +  poly_int64 old_vf = LOOP_VINFO_VECT_FACTOR (old_loop_vinfo);
> +
> +  /* Always prefer a VF of loop->simdlen over any other VF.  */
> +  if (loop->simdlen)
> +    {
> +      bool new_simdlen_p = known_eq (new_vf, loop->simdlen);
> +      bool old_simdlen_p = known_eq (old_vf, loop->simdlen);
> +      if (new_simdlen_p != old_simdlen_p)
> +       return new_simdlen_p;
> +    }
> +
> +  /* Limit the VFs to what is likely to be the maximum number of iterations,
> +     to handle cases in which at least one loop_vinfo is fully-masked.  */
> +  HOST_WIDE_INT estimated_max_niter = likely_max_stmt_executions_int (loop);
> +  if (estimated_max_niter != -1)
> +    {
> +      if (known_le (estimated_max_niter, new_vf))
> +       new_vf = estimated_max_niter;
> +      if (known_le (estimated_max_niter, old_vf))
> +       old_vf = estimated_max_niter;
> +    }
> +
> +  /* Check whether the (fractional) cost per scalar iteration is lower
> +     or higher: new_inside_cost / new_vf vs. old_inside_cost / old_vf.  */
> +  poly_widest_int rel_new = (new_loop_vinfo->vec_inside_cost
> +                            * poly_widest_int (old_vf));
> +  poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost
> +                            * poly_widest_int (new_vf));
> +  if (maybe_lt (rel_old, rel_new))
> +    return false;
> +  if (known_lt (rel_new, rel_old))
> +    return true;
> +
> +  /* If there's nothing to choose between the loop bodies, see whether
> +     there's a difference in the prologue and epilogue costs.  */
> +  if (new_loop_vinfo->vec_outside_cost != old_loop_vinfo->vec_outside_cost)
> +    return new_loop_vinfo->vec_outside_cost < old_loop_vinfo->vec_outside_cost;
> +
> +  return false;
> +}
> +
> +/* Decide whether to replace OLD_LOOP_VINFO with NEW_LOOP_VINFO.  Return
> +   true if we should.  */
> +
> +static bool
> +vect_joust_loop_vinfos (loop_vec_info new_loop_vinfo,
> +                       loop_vec_info old_loop_vinfo)
> +{
> +  if (!vect_better_loop_vinfo_p (new_loop_vinfo, old_loop_vinfo))
> +    return false;
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +                    "***** Preferring vector mode %s to vector mode %s\n",
> +                    GET_MODE_NAME (new_loop_vinfo->vector_mode),
> +                    GET_MODE_NAME (old_loop_vinfo->vector_mode));
> +  return true;
> +}
> +
>  /* Function vect_analyze_loop.
>
>     Apply a set of analyses on LOOP, and create a loop_vec_info struct
> @@ -2386,8 +2462,9 @@ vect_analyze_loop (class loop *loop, vec
>    auto_vector_modes vector_modes;
>
>    /* Autodetect first vector size we try.  */
> -  targetm.vectorize.autovectorize_vector_modes (&vector_modes,
> -                                               loop->simdlen != 0);
> +  unsigned int autovec_flags
> +    = targetm.vectorize.autovectorize_vector_modes (&vector_modes,
> +                                                   loop->simdlen != 0);
>    unsigned int mode_i = 0;
>
>    DUMP_VECT_SCOPE ("analyze_loop_nest");
> @@ -2410,6 +2487,8 @@ vect_analyze_loop (class loop *loop, vec
>    machine_mode next_vector_mode = VOIDmode;
>    poly_uint64 lowest_th = 0;
>    unsigned vectorized_loops = 0;
> +  bool pick_lowest_cost_p = ((autovec_flags & VECT_COMPARE_COSTS)
> +                            && !unlimited_cost_model (loop));
>
>    bool vect_epilogues = false;
>    opt_result res = opt_result::success ();
> @@ -2430,6 +2509,34 @@ vect_analyze_loop (class loop *loop, vec
>
>        bool fatal = false;
>
> +      /* When pick_lowest_cost_p is true, we should in principle iterate
> +        over all the loop_vec_infos that LOOP_VINFO could replace and
> +        try to vectorize LOOP_VINFO under the same conditions.
> +        E.g. when trying to replace an epilogue loop, we should vectorize
> +        LOOP_VINFO as an epilogue loop with the same VF limit.  When trying
> +        to replace the main loop, we should vectorize LOOP_VINFO as a main
> +        loop too.
> +
> +        However, autovectorize_vector_modes is usually sorted as follows:
> +
> +        - Modes that naturally produce lower VFs usually follow modes that
> +          naturally produce higher VFs.
> +
> +        - When modes naturally produce the same VF, maskable modes
> +          usually follow unmaskable ones, so that the maskable mode
> +          can be used to vectorize the epilogue of the unmaskable mode.
> +
> +        This order is preferred because it leads to the maximum
> +        epilogue vectorization opportunities.  Targets should only use
> +        a different order if they want to make wide modes available while
> +        disparaging them relative to earlier, smaller modes.  The assumption
> +        in that case is that the wider modes are more expensive in some
> +        way that isn't reflected directly in the costs.
> +
> +        There should therefore be few interesting cases in which
> +        LOOP_VINFO fails when treated as an epilogue loop, succeeds when
> +        treated as a standalone loop, and ends up being genuinely cheaper
> +        than FIRST_LOOP_VINFO.  */
>        if (vect_epilogues)
>         LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = first_loop_vinfo;
>
> @@ -2477,13 +2584,34 @@ vect_analyze_loop (class loop *loop, vec
>               LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
>               simdlen = 0;
>             }
> +         else if (pick_lowest_cost_p && first_loop_vinfo)
> +           {
> +             /* Keep trying to roll back vectorization attempts while the
> +                loop_vec_infos they produced were worse than this one.  */
> +             vec<loop_vec_info> &vinfos = first_loop_vinfo->epilogue_vinfos;
> +             while (!vinfos.is_empty ()
> +                    && vect_joust_loop_vinfos (loop_vinfo, vinfos.last ()))
> +               {
> +                 gcc_assert (vect_epilogues);
> +                 delete vinfos.pop ();
> +               }
> +             if (vinfos.is_empty ()
> +                 && vect_joust_loop_vinfos (loop_vinfo, first_loop_vinfo))
> +               {
> +                 delete first_loop_vinfo;
> +                 first_loop_vinfo = opt_loop_vec_info::success (NULL);
> +                 LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) = NULL;
> +               }
> +           }
>
>           if (first_loop_vinfo == NULL)
>             {
>               first_loop_vinfo = loop_vinfo;
>               lowest_th = LOOP_VINFO_VERSIONING_THRESHOLD (first_loop_vinfo);
>             }
> -         else if (vect_epilogues)
> +         else if (vect_epilogues
> +                  /* For now only allow one epilogue loop.  */
> +                  && first_loop_vinfo->epilogue_vinfos.is_empty ())
>             {
>               first_loop_vinfo->epilogue_vinfos.safe_push (loop_vinfo);
>               poly_uint64 th = LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo);
> @@ -2503,12 +2631,14 @@ vect_analyze_loop (class loop *loop, vec
>                             && loop->inner == NULL
>                             && PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK)
>                             && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo)
> -                           /* For now only allow one epilogue loop.  */
> -                           && first_loop_vinfo->epilogue_vinfos.is_empty ());
> +                           /* For now only allow one epilogue loop, but allow
> +                              pick_lowest_cost_p to replace it.  */
> +                           && (first_loop_vinfo->epilogue_vinfos.is_empty ()
> +                               || pick_lowest_cost_p));
>
>           /* Commit to first_loop_vinfo if we have no reason to try
>              alternatives.  */
> -         if (!simdlen && !vect_epilogues)
> +         if (!simdlen && !vect_epilogues && !pick_lowest_cost_p)
>             break;
>         }
>        else
> @@ -3467,7 +3597,11 @@ vect_estimate_min_profitable_iters (loop
>                &vec_inside_cost, &vec_epilogue_cost);
>
>    vec_outside_cost = (int)(vec_prologue_cost + vec_epilogue_cost);
> -
> +
> +  /* Stash the costs so that we can compare two loop_vec_infos.  */
> +  loop_vinfo->vec_inside_cost = vec_inside_cost;
> +  loop_vinfo->vec_outside_cost = vec_outside_cost;
> +
>    if (dump_enabled_p ())
>      {
>        dump_printf_loc (MSG_NOTE, vect_location, "Cost model analysis: \n");

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [4/6] Optionally pick the cheapest loop_vec_info
  2019-11-08 11:27           ` Richard Biener
@ 2019-11-08 12:15             ` Richard Sandiford
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Sandiford @ 2019-11-08 12:15 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, mliska

Richard Biener <richard.guenther@gmail.com> writes:
> On Thu, Nov 7, 2019 at 6:15 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Wed, Nov 6, 2019 at 3:01 PM Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Biener <richard.guenther@gmail.com> writes:
>> >> > On Tue, Nov 5, 2019 at 3:29 PM Richard Sandiford
>> >> > <Richard.Sandiford@arm.com> wrote:
>> >> >>
>> >> >> This patch adds a mode in which the vectoriser tries each available
>> >> >> base vector mode and picks the one with the lowest cost.  For now
>> >> >> the behaviour is behind a default-off --param, but a later patch
>> >> >> enables it by default for SVE.
>> >> >>
>> >> >> The patch keeps the current behaviour of preferring a VF of
>> >> >> loop->simdlen over any larger or smaller VF, regardless of costs
>> >> >> or target preferences.
>> >> >
>> >> > Can you avoid using a --param for this?  Instead I'd suggest to
>> >> > amend the vectorize_modes target hook to return some
>> >> > flags like VECT_FIRST_MODE_WINS.  We'd eventually want
>> >> > to make the target able to say do-not-vectorize-epiloges-of-MODE
>> >> > (I think we may not want to vectorize SSE vectorized loop
>> >> > epilogues with MMX-with-SSE or GPRs for example).  I guess
>> >> > for the latter we'd use a new target hook.
>> >>
>> >> The reason for using a --param was that I wanted a way of turning
>> >> this on and off on the command line, so that users can experiment
>> >> with it if necessary.  E.g. enabling the --param could be a viable
>> >> alternative to -mprefix-* in some cases.  Disabling it would be
>> >> a way of working around a bad cost model decision without going
>> >> all the way to -fno-vect-cost-model.
>> >>
>> >> These kinds of --params can become useful workarounds until an
>> >> optimisation bug is fixed.
>> >
>> > I'm arguing that the default depends on the actual ISAs so there isn't
>> > a one-fits all and given we have OMP SIMD and target cloning for
>> > multiple ISAs this looks like a wrong approach.  For sure the
>> > target can use its own switches to override defaults here, or alternatively
>> > we might want to have a #pragma GCC simdlen mimicing OMP behavior
>> > here.
>>
>> I agree there's no one-size-fits-all choice here, but that's true for
>> other --params too.  The problem with using target switches is that we
>> have to explain them and to keep accepting them "forever" (or at least
>> with a long deprecation period).
>
> Fortunately next week you'll be able to add target specific --params
> to your targets .opt file ;)

Nice!  That definitely sounds like a good compromise. :-)  I'll hold off
on 6/6 until Martin's patches have gone in.  There are a couple of other
SVE things that would benefit from that too.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-11-08 12:15 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-05 14:24 [0/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
2019-11-05 14:25 ` [1/6] Fix vectorizable_conversion costs Richard Sandiford
2019-11-06 12:01   ` Richard Biener
2019-11-07 15:14     ` Richard Sandiford
2019-11-07 16:13       ` Richard Biener
2019-11-05 14:27 ` [2/6] Don't assign a cost to vectorizable_assignment Richard Sandiford
2019-11-06 12:04   ` Richard Biener
2019-11-06 15:58     ` Richard Sandiford
2019-11-07  9:35       ` Richard Biener
2019-11-07 16:40         ` Richard Sandiford
2019-11-08 11:24           ` Richard Biener
2019-11-05 14:28 ` [3/6] Avoid accounting for non-existent vector loop versioning Richard Sandiford
2019-11-06 12:05   ` Richard Biener
2019-11-05 14:29 ` [4/6] Optionally pick the cheapest loop_vec_info Richard Sandiford
2019-11-06 12:09   ` Richard Biener
2019-11-06 14:01     ` Richard Sandiford
2019-11-06 14:50       ` Richard Biener
2019-11-07 17:15         ` Richard Sandiford
2019-11-08 11:27           ` Richard Biener
2019-11-08 12:15             ` Richard Sandiford
2019-11-05 14:31 ` [5/6] Account for the cost of generating loop masks Richard Sandiford
2019-11-06 12:16   ` Richard Biener
2019-11-05 14:32 ` [6/6][AArch64] Enable vect-compare-loop-costs by default for SVE Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).