aarch64: Opt-in tweaks to the AArch64 vector cost model

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* aarch64: Opt-in tweaks to the AArch64 vector cost model
@ 2021-03-26 16:12 Richard Sandiford
  2021-03-26 16:14 ` [PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs Richard Sandiford
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:12 UTC (permalink / raw)
  To: gcc-patches

SVE uses VECT_COMPARE_COSTS to tell the vectoriser to try as many
variations as it knows and pick the one with the lowest cost.
This serves two purposes:

(1) It means we can compare SVE loops that operate on packed vectors
    with SVE loops that operate on unpacked vectors.

(2) It means that we can compare SVE with Advanced SIMD.

Although we used VECT_COMPARE_COSTS for both of these purposes from the
outset, the focus initially was more on (1).  Adding VECT_COMPARE_COSTS
allowed us to use SVE extending loads and truncating stores, in which
loads and stores effectively operate on unpacked rather than packed
vectors.  This part seems to work pretty well in practice.

However, it turns out that the second part (Advanced SIMD vs. SVE)
is less reliable.  There are three main reasons for this:

* At the moment, the AArch64 vector cost structures stick rigidly to the
  vect_cost_for_stmt enumeration provided by target-independent code.
  This particularly affects vec_to_scalar, which is used for at least:

  - reductions
  - extracting an element from a vector to do scalar arithmetic
  - extracting an element to store it out

  The vectoriser gives us the information we need to distinguish
  these cases, but the port wasn't using it.  Other problems include
  undercosting LD[234] and ST[234] instructions and scatter stores.

* Currently, the vectoriser costing works by adding up what are typically
  latency values.  As Richi mentioned recently in an x86 context,
  this effectively means that we treat the scalar and vector code
  as executing serially.  That already causes some problems for
  Advanced SIMD vs. scalar code, but it turns out to be particularly
  a problem when comparing SVE with Advanced SIMD.  Scalar, Advanced
  SIMD and SVE can have significantly different issue characteristics,
  and summing latencies misses some important details, especially in
  loops involving reductions.

* Advanced SIMD code can be completely unrolled at compile time,
  but length-agnostic SVE code can't.  We weren't taking this into
  account when comparing the costs.

This series of patches tries to address these problems by making
some opt-in tweaks to the vector cost model.  It produces much better
results on the SVE workloads that we've tried internally.  We'd therefore
like to put this in for GCC 11.

I'm really sorry that this is landing so late in stage 4.  Clearly it
would have been much better to do this earlier.  However:

- The patches “only” change the vector cost hooks.  There are no changes
  elsewhere.  In other words, the SVE code we generate and the Advanced
  SIMD code we generate is unchanged: the “only” thing we're doing is
  using different heuristics to select between them.

- As mentioned above, almost all the new code is “opt-in”.  Therefore,
  only CPUs that explicitly want it (and will benefit from it) will be
  affected.  Most of the code is not executed otherwise.

Tested on aarch64-linux-gnu (with and without SVE), pushed to trunk.

Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
@ 2021-03-26 16:14 ` Richard Sandiford
  2021-03-26 16:14 ` [PATCH 02/13] aarch64: Add vector costs for SVE CLAST[AB] and FADDA Richard Sandiford
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:14 UTC (permalink / raw)
  To: gcc-patches

This patch is part of a series that makes opt-in tweaks to the
AArch64 vector cost model.

At the moment, all reductions are costed as vec_to_scalar, which
also includes things like extracting a single element from a vector.
This is a bit too coarse in practice, since the cost of a reduction
depends very much on the type of value that it's processing.
This patch therefore adds separate costs for each case.  To start with,
all the new costs are copied from the associated vec_to_scalar ones.

Due the extreme lateness of this patch in the GCC 11 cycle, I've added
a new tuning flag (use_new_vector_costs) that selects the new behaviour.
This should help to ensure that the risk of the new code is only borne
by the CPUs that need it.  Generic tuning is not affected.

gcc/
	* config/aarch64/aarch64-tuning-flags.def (use_new_vector_costs):
	New tuning flag.
	* config/aarch64/aarch64-protos.h (simd_vec_cost): Put comments
	above the fields rather than to the right.
	(simd_vec_cost::reduc_i8_cost): New member variable.
	(simd_vec_cost::reduc_i16_cost): Likewise.
	(simd_vec_cost::reduc_i32_cost): Likewise.
	(simd_vec_cost::reduc_i64_cost): Likewise.
	(simd_vec_cost::reduc_f16_cost): Likewise.
	(simd_vec_cost::reduc_f32_cost): Likewise.
	(simd_vec_cost::reduc_f64_cost): Likewise.
	* config/aarch64/aarch64.c (generic_advsimd_vector_cost): Update
	accordingly, using the vec_to_scalar_cost for the new fields.
	(generic_sve_vector_cost, a64fx_advsimd_vector_cost): Likewise.
	(a64fx_sve_vector_cost, qdf24xx_advsimd_vector_cost): Likewise.
	(thunderx_advsimd_vector_cost, tsv110_advsimd_vector_cost): Likewise.
	(cortexa57_advsimd_vector_cost, exynosm1_advsimd_vector_cost)
	(xgene1_advsimd_vector_cost, thunderx2t99_advsimd_vector_cost)
	(thunderx3t110_advsimd_vector_cost): Likewise.
	(aarch64_use_new_vector_costs_p): New function.
	(aarch64_simd_vec_costs): New function, split out from...
	(aarch64_builtin_vectorization_cost): ...here.
	(aarch64_is_reduction): New function.
	(aarch64_detect_vector_stmt_subtype): Likewise.
	(aarch64_add_stmt_cost): Call aarch64_detect_vector_stmt_subtype if
	using the new vector costs.
---
 gcc/config/aarch64/aarch64-protos.h         |  56 ++++--
 gcc/config/aarch64/aarch64-tuning-flags.def |   2 +
 gcc/config/aarch64/aarch64.c                | 180 +++++++++++++++++++-
 3 files changed, 216 insertions(+), 22 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ff87ced2a34..e4eeb2ce142 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -194,22 +194,46 @@ struct cpu_regmove_cost
 
 struct simd_vec_cost
 {
-  const int int_stmt_cost;		/* Cost of any int vector operation,
-					   excluding load, store, permute,
-					   vector-to-scalar and
-					   scalar-to-vector operation.  */
-  const int fp_stmt_cost;		 /* Cost of any fp vector operation,
-					    excluding load, store, permute,
-					    vector-to-scalar and
-					    scalar-to-vector operation.  */
-  const int permute_cost;		 /* Cost of permute operation.  */
-  const int vec_to_scalar_cost;		 /* Cost of vec-to-scalar operation.  */
-  const int scalar_to_vec_cost;		 /* Cost of scalar-to-vector
-					    operation.  */
-  const int align_load_cost;	 /* Cost of aligned vector load.  */
-  const int unalign_load_cost;	 /* Cost of unaligned vector load.  */
-  const int unalign_store_cost;	 /* Cost of unaligned vector store.  */
-  const int store_cost;		 /* Cost of vector store.  */
+  /* Cost of any integer vector operation, excluding the ones handled
+     specially below.  */
+  const int int_stmt_cost;
+
+  /* Cost of any fp vector operation, excluding the ones handled
+     specially below.  */
+  const int fp_stmt_cost;
+
+  /* Cost of a permute operation.  */
+  const int permute_cost;
+
+  /* Cost of reductions for various vector types: iN is for N-bit
+     integer elements and fN is for N-bit floating-point elements.
+     We need to single out the element type because it affects the
+     depth of the reduction.  */
+  const int reduc_i8_cost;
+  const int reduc_i16_cost;
+  const int reduc_i32_cost;
+  const int reduc_i64_cost;
+  const int reduc_f16_cost;
+  const int reduc_f32_cost;
+  const int reduc_f64_cost;
+
+  /* Cost of a vector-to-scalar operation.  */
+  const int vec_to_scalar_cost;
+
+  /* Cost of a scalar-to-vector operation.  */
+  const int scalar_to_vec_cost;
+
+  /* Cost of an aligned vector load.  */
+  const int align_load_cost;
+
+  /* Cost of an unaligned vector load.  */
+  const int unalign_load_cost;
+
+  /* Cost of an unaligned vector store.  */
+  const int unalign_store_cost;
+
+  /* Cost of a vector store.  */
+  const int store_cost;
 };
 
 typedef struct simd_vec_cost advsimd_vec_cost;
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 588edf4d923..a61fcf94916 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -48,4 +48,6 @@ AARCH64_EXTRA_TUNING_OPTION ("rename_load_regs", RENAME_LOAD_REGS)
 
 AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
 
+AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c8a87fe858a..b44dcdc6a6e 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -591,6 +591,13 @@ static const advsimd_vec_cost generic_advsimd_vector_cost =
   1, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
   2, /* permute_cost  */
+  2, /* reduc_i8_cost  */
+  2, /* reduc_i16_cost  */
+  2, /* reduc_i32_cost  */
+  2, /* reduc_i64_cost  */
+  2, /* reduc_f16_cost  */
+  2, /* reduc_f32_cost  */
+  2, /* reduc_f64_cost  */
   2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* align_load_cost  */
@@ -605,6 +612,13 @@ static const sve_vec_cost generic_sve_vector_cost =
   1, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
   2, /* permute_cost  */
+  2, /* reduc_i8_cost  */
+  2, /* reduc_i16_cost  */
+  2, /* reduc_i32_cost  */
+  2, /* reduc_i64_cost  */
+  2, /* reduc_f16_cost  */
+  2, /* reduc_f32_cost  */
+  2, /* reduc_f64_cost  */
   2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* align_load_cost  */
@@ -631,6 +645,13 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
   2, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
   3, /* permute_cost  */
+  13, /* reduc_i8_cost  */
+  13, /* reduc_i16_cost  */
+  13, /* reduc_i32_cost  */
+  13, /* reduc_i64_cost  */
+  13, /* reduc_f16_cost  */
+  13, /* reduc_f32_cost  */
+  13, /* reduc_f64_cost  */
   13, /* vec_to_scalar_cost  */
   4, /* scalar_to_vec_cost  */
   6, /* align_load_cost  */
@@ -644,6 +665,13 @@ static const sve_vec_cost a64fx_sve_vector_cost =
   2, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
   3, /* permute_cost  */
+  13, /* reduc_i8_cost  */
+  13, /* reduc_i16_cost  */
+  13, /* reduc_i32_cost  */
+  13, /* reduc_i64_cost  */
+  13, /* reduc_f16_cost  */
+  13, /* reduc_f32_cost  */
+  13, /* reduc_f64_cost  */
   13, /* vec_to_scalar_cost  */
   4, /* scalar_to_vec_cost  */
   6, /* align_load_cost  */
@@ -669,6 +697,13 @@ static const advsimd_vec_cost qdf24xx_advsimd_vector_cost =
   1, /* int_stmt_cost  */
   3, /* fp_stmt_cost  */
   2, /* permute_cost  */
+  1, /* reduc_i8_cost  */
+  1, /* reduc_i16_cost  */
+  1, /* reduc_i32_cost  */
+  1, /* reduc_i64_cost  */
+  1, /* reduc_f16_cost  */
+  1, /* reduc_f32_cost  */
+  1, /* reduc_f64_cost  */
   1, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* align_load_cost  */
@@ -696,6 +731,13 @@ static const advsimd_vec_cost thunderx_advsimd_vector_cost =
   4, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
   4, /* permute_cost  */
+  2, /* reduc_i8_cost  */
+  2, /* reduc_i16_cost  */
+  2, /* reduc_i32_cost  */
+  2, /* reduc_i64_cost  */
+  2, /* reduc_f16_cost  */
+  2, /* reduc_f32_cost  */
+  2, /* reduc_f64_cost  */
   2, /* vec_to_scalar_cost  */
   2, /* scalar_to_vec_cost  */
   3, /* align_load_cost  */
@@ -722,6 +764,13 @@ static const advsimd_vec_cost tsv110_advsimd_vector_cost =
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
   2, /* permute_cost  */
+  3, /* reduc_i8_cost  */
+  3, /* reduc_i16_cost  */
+  3, /* reduc_i32_cost  */
+  3, /* reduc_i64_cost  */
+  3, /* reduc_f16_cost  */
+  3, /* reduc_f32_cost  */
+  3, /* reduc_f64_cost  */
   3, /* vec_to_scalar_cost  */
   2, /* scalar_to_vec_cost  */
   5, /* align_load_cost  */
@@ -747,6 +796,13 @@ static const advsimd_vec_cost cortexa57_advsimd_vector_cost =
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
   3, /* permute_cost  */
+  8, /* reduc_i8_cost  */
+  8, /* reduc_i16_cost  */
+  8, /* reduc_i32_cost  */
+  8, /* reduc_i64_cost  */
+  8, /* reduc_f16_cost  */
+  8, /* reduc_f32_cost  */
+  8, /* reduc_f64_cost  */
   8, /* vec_to_scalar_cost  */
   8, /* scalar_to_vec_cost  */
   4, /* align_load_cost  */
@@ -773,6 +829,13 @@ static const advsimd_vec_cost exynosm1_advsimd_vector_cost =
   3, /* int_stmt_cost  */
   3, /* fp_stmt_cost  */
   3, /* permute_cost  */
+  3, /* reduc_i8_cost  */
+  3, /* reduc_i16_cost  */
+  3, /* reduc_i32_cost  */
+  3, /* reduc_i64_cost  */
+  3, /* reduc_f16_cost  */
+  3, /* reduc_f32_cost  */
+  3, /* reduc_f64_cost  */
   3, /* vec_to_scalar_cost  */
   3, /* scalar_to_vec_cost  */
   5, /* align_load_cost  */
@@ -798,6 +861,13 @@ static const advsimd_vec_cost xgene1_advsimd_vector_cost =
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
   2, /* permute_cost  */
+  4, /* reduc_i8_cost  */
+  4, /* reduc_i16_cost  */
+  4, /* reduc_i32_cost  */
+  4, /* reduc_i64_cost  */
+  4, /* reduc_f16_cost  */
+  4, /* reduc_f32_cost  */
+  4, /* reduc_f64_cost  */
   4, /* vec_to_scalar_cost  */
   4, /* scalar_to_vec_cost  */
   10, /* align_load_cost  */
@@ -824,6 +894,13 @@ static const advsimd_vec_cost thunderx2t99_advsimd_vector_cost =
   4, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
   10, /* permute_cost  */
+  6, /* reduc_i8_cost  */
+  6, /* reduc_i16_cost  */
+  6, /* reduc_i32_cost  */
+  6, /* reduc_i64_cost  */
+  6, /* reduc_f16_cost  */
+  6, /* reduc_f32_cost  */
+  6, /* reduc_f64_cost  */
   6, /* vec_to_scalar_cost  */
   5, /* scalar_to_vec_cost  */
   4, /* align_load_cost  */
@@ -850,6 +927,13 @@ static const advsimd_vec_cost thunderx3t110_advsimd_vector_cost =
   5, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
   10, /* permute_cost  */
+  5, /* reduc_i8_cost  */
+  5, /* reduc_i16_cost  */
+  5, /* reduc_i32_cost  */
+  5, /* reduc_i64_cost  */
+  5, /* reduc_f16_cost  */
+  5, /* reduc_f32_cost  */
+  5, /* reduc_f64_cost  */
   5, /* vec_to_scalar_cost  */
   5, /* scalar_to_vec_cost  */
   4, /* align_load_cost  */
@@ -13874,6 +13958,28 @@ aarch64_first_cycle_multipass_dfa_lookahead_guard (rtx_insn *insn,
 
 /* Vectorizer cost model target hooks.  */
 
+/* Return true if the current CPU should use the new costs defined
+   in GCC 11.  This should be removed for GCC 12 and above, with the
+   costs applying to all CPUs instead.  */
+static bool
+aarch64_use_new_vector_costs_p ()
+{
+  return (aarch64_tune_params.extra_tuning_flags
+	  & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
+}
+
+/* Return the appropriate SIMD costs for vectors of type VECTYPE.  */
+static const simd_vec_cost *
+aarch64_simd_vec_costs (tree vectype)
+{
+  const cpu_vector_cost *costs = aarch64_tune_params.vec_costs;
+  if (vectype != NULL
+      && aarch64_sve_mode_p (TYPE_MODE (vectype))
+      && costs->sve != NULL)
+    return costs->sve;
+  return costs->advsimd;
+}
+
 /* Implement targetm.vectorize.builtin_vectorization_cost.  */
 static int
 aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
@@ -13887,12 +13993,7 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
   if (vectype != NULL)
     fp = FLOAT_TYPE_P (vectype);
 
-  const simd_vec_cost *simd_costs;
-  if (vectype != NULL && aarch64_sve_mode_p (TYPE_MODE (vectype))
-      && costs->sve != NULL)
-    simd_costs = costs->sve;
-  else
-    simd_costs = costs->advsimd;
+  const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
 
   switch (type_of_cost)
     {
@@ -13951,6 +14052,14 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
     }
 }
 
+/* Return true if STMT_INFO represents part of a reduction.  */
+static bool
+aarch64_is_reduction (stmt_vec_info stmt_info)
+{
+  return (STMT_VINFO_REDUC_DEF (stmt_info)
+	  || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
+}
+
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
    vectors would produce a series of LDP or STP operations.  KIND is the
    kind of statement that STMT_INFO represents.  */
@@ -14014,6 +14123,57 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
 	  && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
 }
 
+/* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
+   for the vectorized form of STMT_INFO, which has cost kind KIND and which
+   when vectorized would operate on vector type VECTYPE.  Try to subdivide
+   the target-independent categorization provided by KIND to get a more
+   accurate cost.  WHERE specifies where the cost associated with KIND
+   occurs.  */
+static unsigned int
+aarch64_detect_vector_stmt_subtype (vect_cost_for_stmt kind,
+				    stmt_vec_info stmt_info, tree vectype,
+				    enum vect_cost_model_location where,
+				    unsigned int stmt_cost)
+{
+  const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
+
+  /* Detect cases in which vec_to_scalar represents a single reduction
+     instruction like FADDP or MAXV.  */
+  if (kind == vec_to_scalar
+      && where == vect_epilogue
+      && aarch64_is_reduction (stmt_info))
+    switch (GET_MODE_INNER (TYPE_MODE (vectype)))
+      {
+      case E_QImode:
+	return simd_costs->reduc_i8_cost;
+
+      case E_HImode:
+	return simd_costs->reduc_i16_cost;
+
+      case E_SImode:
+	return simd_costs->reduc_i32_cost;
+
+      case E_DImode:
+	return simd_costs->reduc_i64_cost;
+
+      case E_HFmode:
+      case E_BFmode:
+	return simd_costs->reduc_f16_cost;
+
+      case E_SFmode:
+	return simd_costs->reduc_f32_cost;
+
+      case E_DFmode:
+	return simd_costs->reduc_f64_cost;
+
+      default:
+	break;
+      }
+
+  /* Otherwise stick with the original categorization.  */
+  return stmt_cost;
+}
+
 /* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
    for STMT_INFO, which has cost kind KIND and which when vectorized would
    operate on vector type VECTYPE.  Adjust the cost as necessary for SVE
@@ -14097,6 +14257,14 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       int stmt_cost =
 	    aarch64_builtin_vectorization_cost (kind, vectype, misalign);
 
+      /* Try to get a more accurate cost by looking at STMT_INFO instead
+	 of just looking at KIND.  */
+      if (stmt_info && vectype && aarch64_use_new_vector_costs_p ())
+	stmt_cost = aarch64_detect_vector_stmt_subtype (kind, stmt_info,
+							vectype, where,
+							stmt_cost);
+
+      /* Do any SVE-specific adjustments to the cost.  */
       if (stmt_info && vectype && aarch64_sve_mode_p (TYPE_MODE (vectype)))
 	stmt_cost = aarch64_sve_adjust_stmt_cost (vinfo, kind, stmt_info,
 						  vectype, stmt_cost);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 02/13] aarch64: Add vector costs for SVE CLAST[AB] and FADDA
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
  2021-03-26 16:14 ` [PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs Richard Sandiford
@ 2021-03-26 16:14 ` Richard Sandiford
  2021-03-26 16:15 ` [PATCH 03/13] aarch64: Add costs for LD[234]/ST[234] permutes Richard Sandiford
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:14 UTC (permalink / raw)
  To: gcc-patches

Following on from the previous reduction costs patch, this one
adds costs for the SVE CLAST[AB] and FADDA instructions.
These instructions occur within the loop body, whereas the
reductions handled by the previous patch occur outside.

Like with the previous patch, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64-protos.h (sve_vec_cost): Turn into a
	derived class of simd_vec_cost.  Add information about CLAST[AB]
	and FADDA instructions.
	* config/aarch64/aarch64.c (generic_sve_vector_cost): Update
	accordingly, using the vec_to_scalar costs for the new fields.
	(a64fx_sve_vector_cost): Likewise.
	(aarch64_reduc_type): New function.
	(aarch64_sve_in_loop_reduction_latency): Likewise.
	(aarch64_detect_vector_stmt_subtype): Take a vinfo parameter.
	Use aarch64_sve_in_loop_reduction_latency to handle SVE reductions
	that occur in the loop body.
	(aarch64_add_stmt_cost): Update call accordingly.
---
 gcc/config/aarch64/aarch64-protos.h |  28 +++++-
 gcc/config/aarch64/aarch64.c        | 150 +++++++++++++++++++++-------
 2 files changed, 141 insertions(+), 37 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index e4eeb2ce142..bfcab72b122 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -237,7 +237,33 @@ struct simd_vec_cost
 };
 
 typedef struct simd_vec_cost advsimd_vec_cost;
-typedef struct simd_vec_cost sve_vec_cost;
+
+/* SVE-specific extensions to the information provided by simd_vec_cost.  */
+struct sve_vec_cost : simd_vec_cost
+{
+  constexpr sve_vec_cost (const simd_vec_cost &base,
+			  unsigned int clast_cost,
+			  unsigned int fadda_f16_cost,
+			  unsigned int fadda_f32_cost,
+			  unsigned int fadda_f64_cost)
+    : simd_vec_cost (base),
+      clast_cost (clast_cost),
+      fadda_f16_cost (fadda_f16_cost),
+      fadda_f32_cost (fadda_f32_cost),
+      fadda_f64_cost (fadda_f64_cost)
+  {}
+
+  /* The cost of a vector-to-scalar CLASTA or CLASTB instruction,
+     with the scalar being stored in FP registers.  This cost is
+     assumed to be a cycle latency.  */
+  const int clast_cost;
+
+  /* The costs of FADDA for the three data types that it supports.
+     These costs are assumed to be cycle latencies.  */
+  const int fadda_f16_cost;
+  const int fadda_f32_cost;
+  const int fadda_f64_cost;
+};
 
 /* Cost for vector insn classes.  */
 struct cpu_vector_cost
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b44dcdc6a6e..b62169a267a 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -609,22 +609,28 @@ static const advsimd_vec_cost generic_advsimd_vector_cost =
 /* Generic costs for SVE vector operations.  */
 static const sve_vec_cost generic_sve_vector_cost =
 {
-  1, /* int_stmt_cost  */
-  1, /* fp_stmt_cost  */
-  2, /* permute_cost  */
-  2, /* reduc_i8_cost  */
-  2, /* reduc_i16_cost  */
-  2, /* reduc_i32_cost  */
-  2, /* reduc_i64_cost  */
-  2, /* reduc_f16_cost  */
-  2, /* reduc_f32_cost  */
-  2, /* reduc_f64_cost  */
-  2, /* vec_to_scalar_cost  */
-  1, /* scalar_to_vec_cost  */
-  1, /* align_load_cost  */
-  1, /* unalign_load_cost  */
-  1, /* unalign_store_cost  */
-  1  /* store_cost  */
+  {
+    1, /* int_stmt_cost  */
+    1, /* fp_stmt_cost  */
+    2, /* permute_cost  */
+    2, /* reduc_i8_cost  */
+    2, /* reduc_i16_cost  */
+    2, /* reduc_i32_cost  */
+    2, /* reduc_i64_cost  */
+    2, /* reduc_f16_cost  */
+    2, /* reduc_f32_cost  */
+    2, /* reduc_f64_cost  */
+    2, /* vec_to_scalar_cost  */
+    1, /* scalar_to_vec_cost  */
+    1, /* align_load_cost  */
+    1, /* unalign_load_cost  */
+    1, /* unalign_store_cost  */
+    1  /* store_cost  */
+  },
+  2, /* clast_cost  */
+  2, /* fadda_f16_cost  */
+  2, /* fadda_f32_cost  */
+  2 /* fadda_f64_cost  */
 };
 
 /* Generic costs for vector insn classes.  */
@@ -662,22 +668,28 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
 
 static const sve_vec_cost a64fx_sve_vector_cost =
 {
-  2, /* int_stmt_cost  */
-  5, /* fp_stmt_cost  */
-  3, /* permute_cost  */
-  13, /* reduc_i8_cost  */
-  13, /* reduc_i16_cost  */
-  13, /* reduc_i32_cost  */
-  13, /* reduc_i64_cost  */
-  13, /* reduc_f16_cost  */
-  13, /* reduc_f32_cost  */
-  13, /* reduc_f64_cost  */
-  13, /* vec_to_scalar_cost  */
-  4, /* scalar_to_vec_cost  */
-  6, /* align_load_cost  */
-  6, /* unalign_load_cost  */
-  1, /* unalign_store_cost  */
-  1  /* store_cost  */
+  {
+    2, /* int_stmt_cost  */
+    5, /* fp_stmt_cost  */
+    3, /* permute_cost  */
+    13, /* reduc_i8_cost  */
+    13, /* reduc_i16_cost  */
+    13, /* reduc_i32_cost  */
+    13, /* reduc_i64_cost  */
+    13, /* reduc_f16_cost  */
+    13, /* reduc_f32_cost  */
+    13, /* reduc_f64_cost  */
+    13, /* vec_to_scalar_cost  */
+    4, /* scalar_to_vec_cost  */
+    6, /* align_load_cost  */
+    6, /* unalign_load_cost  */
+    1, /* unalign_store_cost  */
+    1  /* store_cost  */
+  },
+  13, /* clast_cost  */
+  13, /* fadda_f16_cost  */
+  13, /* fadda_f32_cost  */
+  13 /* fadda_f64_cost  */
 };
 
 static const struct cpu_vector_cost a64fx_vector_cost =
@@ -14060,6 +14072,20 @@ aarch64_is_reduction (stmt_vec_info stmt_info)
 	  || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
 }
 
+/* If STMT_INFO describes a reduction, return the type of reduction
+   it describes, otherwise return -1.  */
+static int
+aarch64_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
+{
+  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
+    if (STMT_VINFO_REDUC_DEF (stmt_info))
+      {
+	stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
+	return int (STMT_VINFO_REDUC_TYPE (reduc_info));
+      }
+  return -1;
+}
+
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
    vectors would produce a series of LDP or STP operations.  KIND is the
    kind of statement that STMT_INFO represents.  */
@@ -14123,6 +14149,43 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
 	  && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
 }
 
+/* We are considering implementing STMT_INFO using SVE vector type VECTYPE.
+   If STMT_INFO is an in-loop reduction that SVE supports directly, return
+   its latency in cycles, otherwise return zero.  SVE_COSTS specifies the
+   latencies of the relevant instructions.  */
+static unsigned int
+aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
+				       stmt_vec_info stmt_info,
+				       tree vectype,
+				       const sve_vec_cost *sve_costs)
+{
+  switch (aarch64_reduc_type (vinfo, stmt_info))
+    {
+    case EXTRACT_LAST_REDUCTION:
+      return sve_costs->clast_cost;
+
+    case FOLD_LEFT_REDUCTION:
+      switch (GET_MODE_INNER (TYPE_MODE (vectype)))
+	{
+	case E_HFmode:
+	case E_BFmode:
+	  return sve_costs->fadda_f16_cost;
+
+	case E_SFmode:
+	  return sve_costs->fadda_f32_cost;
+
+	case E_DFmode:
+	  return sve_costs->fadda_f64_cost;
+
+	default:
+	  break;
+	}
+      break;
+    }
+
+  return 0;
+}
+
 /* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
    for the vectorized form of STMT_INFO, which has cost kind KIND and which
    when vectorized would operate on vector type VECTYPE.  Try to subdivide
@@ -14130,12 +14193,27 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
    accurate cost.  WHERE specifies where the cost associated with KIND
    occurs.  */
 static unsigned int
-aarch64_detect_vector_stmt_subtype (vect_cost_for_stmt kind,
+aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
 				    stmt_vec_info stmt_info, tree vectype,
 				    enum vect_cost_model_location where,
 				    unsigned int stmt_cost)
 {
   const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
+  const sve_vec_cost *sve_costs = nullptr;
+  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
+    sve_costs = aarch64_tune_params.vec_costs->sve;
+
+  /* Detect cases in which vec_to_scalar represents an in-loop reduction.  */
+  if (kind == vec_to_scalar
+      && where == vect_body
+      && sve_costs)
+    {
+      unsigned int latency
+	= aarch64_sve_in_loop_reduction_latency (vinfo, stmt_info, vectype,
+						 sve_costs);
+      if (latency)
+	return latency;
+    }
 
   /* Detect cases in which vec_to_scalar represents a single reduction
      instruction like FADDP or MAXV.  */
@@ -14260,9 +14338,9 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       /* Try to get a more accurate cost by looking at STMT_INFO instead
 	 of just looking at KIND.  */
       if (stmt_info && vectype && aarch64_use_new_vector_costs_p ())
-	stmt_cost = aarch64_detect_vector_stmt_subtype (kind, stmt_info,
-							vectype, where,
-							stmt_cost);
+	stmt_cost = aarch64_detect_vector_stmt_subtype (vinfo, kind,
+							stmt_info, vectype,
+							where, stmt_cost);
 
       /* Do any SVE-specific adjustments to the cost.  */
       if (stmt_info && vectype && aarch64_sve_mode_p (TYPE_MODE (vectype)))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 03/13] aarch64: Add costs for LD[234]/ST[234] permutes
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
  2021-03-26 16:14 ` [PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs Richard Sandiford
  2021-03-26 16:14 ` [PATCH 02/13] aarch64: Add vector costs for SVE CLAST[AB] and FADDA Richard Sandiford
@ 2021-03-26 16:15 ` Richard Sandiford
  2021-03-26 16:15 ` [PATCH 04/13] aarch64: Add costs for storing one element of a vector Richard Sandiford
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:15 UTC (permalink / raw)
  To: gcc-patches

At the moment, we cost LD[234] and ST[234] as N vector loads
or stores, which effectively treats the implied permute as free.
This patch adds additional costs for the permutes, which apply on
top of the costs for the loads and stores.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64-protos.h (simd_vec_cost::ld2_st2_permute_cost)
	(simd_vec_cost::ld3_st3_permute_cost): New member variables.
	(simd_vec_cost::ld4_st4_permute_cost): Likewise.
	* config/aarch64/aarch64.c (generic_advsimd_vector_cost): Update
	accordingly, using zero for the new costs.
	(generic_sve_vector_cost, a64fx_advsimd_vector_cost): Likewise.
	(a64fx_sve_vector_cost, qdf24xx_advsimd_vector_cost): Likewise.
	(thunderx_advsimd_vector_cost, tsv110_advsimd_vector_cost): Likewise.
	(cortexa57_advsimd_vector_cost, exynosm1_advsimd_vector_cost)
	(xgene1_advsimd_vector_cost, thunderx2t99_advsimd_vector_cost)
	(thunderx3t110_advsimd_vector_cost): Likewise.
	(aarch64_ld234_st234_vectors): New function.
	(aarch64_adjust_stmt_cost): Likewise.
	(aarch64_add_stmt_cost): Call aarch64_adjust_stmt_cost if using
	the new vector costs.
---
 gcc/config/aarch64/aarch64-protos.h |  7 +++
 gcc/config/aarch64/aarch64.c        | 94 +++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index bfcab72b122..3d152754981 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -202,6 +202,13 @@ struct simd_vec_cost
      specially below.  */
   const int fp_stmt_cost;
 
+  /* Per-vector cost of permuting vectors after an LD2, LD3 or LD4,
+     as well as the per-vector cost of permuting vectors before
+     an ST2, ST3 or ST4.  */
+  const int ld2_st2_permute_cost;
+  const int ld3_st3_permute_cost;
+  const int ld4_st4_permute_cost;
+
   /* Cost of a permute operation.  */
   const int permute_cost;
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b62169a267a..8fb723dabd2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -590,6 +590,9 @@ static const advsimd_vec_cost generic_advsimd_vector_cost =
 {
   1, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   2, /* reduc_i8_cost  */
   2, /* reduc_i16_cost  */
@@ -612,6 +615,9 @@ static const sve_vec_cost generic_sve_vector_cost =
   {
     1, /* int_stmt_cost  */
     1, /* fp_stmt_cost  */
+    0, /* ld2_st2_permute_cost  */
+    0, /* ld3_st3_permute_cost  */
+    0, /* ld4_st4_permute_cost  */
     2, /* permute_cost  */
     2, /* reduc_i8_cost  */
     2, /* reduc_i16_cost  */
@@ -650,6 +656,9 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   13, /* reduc_i8_cost  */
   13, /* reduc_i16_cost  */
@@ -671,6 +680,9 @@ static const sve_vec_cost a64fx_sve_vector_cost =
   {
     2, /* int_stmt_cost  */
     5, /* fp_stmt_cost  */
+    0, /* ld2_st2_permute_cost  */
+    0, /* ld3_st3_permute_cost  */
+    0, /* ld4_st4_permute_cost  */
     3, /* permute_cost  */
     13, /* reduc_i8_cost  */
     13, /* reduc_i16_cost  */
@@ -708,6 +720,9 @@ static const advsimd_vec_cost qdf24xx_advsimd_vector_cost =
 {
   1, /* int_stmt_cost  */
   3, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   1, /* reduc_i8_cost  */
   1, /* reduc_i16_cost  */
@@ -742,6 +757,9 @@ static const advsimd_vec_cost thunderx_advsimd_vector_cost =
 {
   4, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   4, /* permute_cost  */
   2, /* reduc_i8_cost  */
   2, /* reduc_i16_cost  */
@@ -775,6 +793,9 @@ static const advsimd_vec_cost tsv110_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   3, /* reduc_i8_cost  */
   3, /* reduc_i16_cost  */
@@ -807,6 +828,9 @@ static const advsimd_vec_cost cortexa57_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   8, /* reduc_i8_cost  */
   8, /* reduc_i16_cost  */
@@ -840,6 +864,9 @@ static const advsimd_vec_cost exynosm1_advsimd_vector_cost =
 {
   3, /* int_stmt_cost  */
   3, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   3, /* reduc_i8_cost  */
   3, /* reduc_i16_cost  */
@@ -872,6 +899,9 @@ static const advsimd_vec_cost xgene1_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   4, /* reduc_i8_cost  */
   4, /* reduc_i16_cost  */
@@ -905,6 +935,9 @@ static const advsimd_vec_cost thunderx2t99_advsimd_vector_cost =
 {
   4, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   10, /* permute_cost  */
   6, /* reduc_i8_cost  */
   6, /* reduc_i16_cost  */
@@ -938,6 +971,9 @@ static const advsimd_vec_cost thunderx3t110_advsimd_vector_cost =
 {
   5, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   10, /* permute_cost  */
   5, /* reduc_i8_cost  */
   5, /* reduc_i16_cost  */
@@ -14086,6 +14122,26 @@ aarch64_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
   return -1;
 }
 
+/* Return true if an access of kind KIND for STMT_INFO represents one
+   vector of an LD[234] or ST[234] operation.  Return the total number of
+   vectors (2, 3 or 4) if so, otherwise return a value outside that range.  */
+static int
+aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
+{
+  if ((kind == vector_load
+       || kind == unaligned_load
+       || kind == vector_store
+       || kind == unaligned_store)
+      && STMT_VINFO_DATA_REF (stmt_info))
+    {
+      stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
+      if (stmt_info
+	  && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES)
+	return DR_GROUP_SIZE (stmt_info);
+    }
+  return 0;
+}
+
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
    vectors would produce a series of LDP or STP operations.  KIND is the
    kind of statement that STMT_INFO represents.  */
@@ -14320,6 +14376,38 @@ aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, vect_cost_for_stmt kind,
   return stmt_cost;
 }
 
+/* STMT_COST is the cost calculated for STMT_INFO, which has cost kind KIND
+   and which when vectorized would operate on vector type VECTYPE.  Add the
+   cost of any embedded operations.  */
+static unsigned int
+aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
+			  tree vectype, unsigned int stmt_cost)
+{
+  if (vectype)
+    {
+      const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
+
+      /* Detect cases in which a vector load or store represents an
+	 LD[234] or ST[234] instruction.  */
+      switch (aarch64_ld234_st234_vectors (kind, stmt_info))
+	{
+	case 2:
+	  stmt_cost += simd_costs->ld2_st2_permute_cost;
+	  break;
+
+	case 3:
+	  stmt_cost += simd_costs->ld3_st3_permute_cost;
+	  break;
+
+	case 4:
+	  stmt_cost += simd_costs->ld4_st4_permute_cost;
+	  break;
+	}
+    }
+
+  return stmt_cost;
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 static unsigned
 aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
@@ -14347,6 +14435,12 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	stmt_cost = aarch64_sve_adjust_stmt_cost (vinfo, kind, stmt_info,
 						  vectype, stmt_cost);
 
+      if (stmt_info && aarch64_use_new_vector_costs_p ())
+	/* Account for any extra "embedded" costs that apply additively
+	   to the base cost calculated above.  */
+	stmt_cost = aarch64_adjust_stmt_cost (kind, stmt_info, vectype,
+					      stmt_cost);
+
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
 	 arbitrary and could potentially be improved with analysis.  */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 04/13] aarch64: Add costs for storing one element of a vector
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (2 preceding siblings ...)
  2021-03-26 16:15 ` [PATCH 03/13] aarch64: Add costs for LD[234]/ST[234] permutes Richard Sandiford
@ 2021-03-26 16:15 ` Richard Sandiford
  2021-03-26 16:15 ` [PATCH 05/13] aarch64: Add costs for one element of a scatter store Richard Sandiford
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:15 UTC (permalink / raw)
  To: gcc-patches

Storing one element of a vector is costed as a vec_to_scalar
followed by a scalar_store.  However, vec_to_scalar is also
used for reductions and for vector-to-GPR moves, which makes
it difficult to pick one cost for them all.

This patch therefore adds a cost for extracting one element
of a vector in preparation for storing it out.  The store
itself is still costed separately.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64-protos.h
	(simd_vec_cost::store_elt_extra_cost): New member variable.
	* config/aarch64/aarch64.c (generic_advsimd_vector_cost): Update
	accordingly, using the vec_to_scalar cost for the new field.
	(generic_sve_vector_cost, a64fx_advsimd_vector_cost): Likewise.
	(a64fx_sve_vector_cost, qdf24xx_advsimd_vector_cost): Likewise.
	(thunderx_advsimd_vector_cost, tsv110_advsimd_vector_cost): Likewise.
	(cortexa57_advsimd_vector_cost, exynosm1_advsimd_vector_cost)
	(xgene1_advsimd_vector_cost, thunderx2t99_advsimd_vector_cost)
	(thunderx3t110_advsimd_vector_cost): Likewise.
	(aarch64_detect_vector_stmt_subtype): Detect single-element stores.
---
 gcc/config/aarch64/aarch64-protos.h |  4 ++++
 gcc/config/aarch64/aarch64.c        | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 3d152754981..fabe3df7071 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -224,6 +224,10 @@ struct simd_vec_cost
   const int reduc_f32_cost;
   const int reduc_f64_cost;
 
+  /* Additional cost of storing a single vector element, on top of the
+     normal cost of a scalar store.  */
+  const int store_elt_extra_cost;
+
   /* Cost of a vector-to-scalar operation.  */
   const int vec_to_scalar_cost;
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 8fb723dabd2..20bb75bd56c 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -601,6 +601,7 @@ static const advsimd_vec_cost generic_advsimd_vector_cost =
   2, /* reduc_f16_cost  */
   2, /* reduc_f32_cost  */
   2, /* reduc_f64_cost  */
+  2, /* store_elt_extra_cost  */
   2, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* align_load_cost  */
@@ -626,6 +627,7 @@ static const sve_vec_cost generic_sve_vector_cost =
     2, /* reduc_f16_cost  */
     2, /* reduc_f32_cost  */
     2, /* reduc_f64_cost  */
+    2, /* store_elt_extra_cost  */
     2, /* vec_to_scalar_cost  */
     1, /* scalar_to_vec_cost  */
     1, /* align_load_cost  */
@@ -667,6 +669,7 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
   13, /* reduc_f16_cost  */
   13, /* reduc_f32_cost  */
   13, /* reduc_f64_cost  */
+  13, /* store_elt_extra_cost  */
   13, /* vec_to_scalar_cost  */
   4, /* scalar_to_vec_cost  */
   6, /* align_load_cost  */
@@ -691,6 +694,7 @@ static const sve_vec_cost a64fx_sve_vector_cost =
     13, /* reduc_f16_cost  */
     13, /* reduc_f32_cost  */
     13, /* reduc_f64_cost  */
+    13, /* store_elt_extra_cost  */
     13, /* vec_to_scalar_cost  */
     4, /* scalar_to_vec_cost  */
     6, /* align_load_cost  */
@@ -731,6 +735,7 @@ static const advsimd_vec_cost qdf24xx_advsimd_vector_cost =
   1, /* reduc_f16_cost  */
   1, /* reduc_f32_cost  */
   1, /* reduc_f64_cost  */
+  1, /* store_elt_extra_cost  */
   1, /* vec_to_scalar_cost  */
   1, /* scalar_to_vec_cost  */
   1, /* align_load_cost  */
@@ -768,6 +773,7 @@ static const advsimd_vec_cost thunderx_advsimd_vector_cost =
   2, /* reduc_f16_cost  */
   2, /* reduc_f32_cost  */
   2, /* reduc_f64_cost  */
+  2, /* store_elt_extra_cost  */
   2, /* vec_to_scalar_cost  */
   2, /* scalar_to_vec_cost  */
   3, /* align_load_cost  */
@@ -804,6 +810,7 @@ static const advsimd_vec_cost tsv110_advsimd_vector_cost =
   3, /* reduc_f16_cost  */
   3, /* reduc_f32_cost  */
   3, /* reduc_f64_cost  */
+  3, /* store_elt_extra_cost  */
   3, /* vec_to_scalar_cost  */
   2, /* scalar_to_vec_cost  */
   5, /* align_load_cost  */
@@ -839,6 +846,7 @@ static const advsimd_vec_cost cortexa57_advsimd_vector_cost =
   8, /* reduc_f16_cost  */
   8, /* reduc_f32_cost  */
   8, /* reduc_f64_cost  */
+  8, /* store_elt_extra_cost  */
   8, /* vec_to_scalar_cost  */
   8, /* scalar_to_vec_cost  */
   4, /* align_load_cost  */
@@ -875,6 +883,7 @@ static const advsimd_vec_cost exynosm1_advsimd_vector_cost =
   3, /* reduc_f16_cost  */
   3, /* reduc_f32_cost  */
   3, /* reduc_f64_cost  */
+  3, /* store_elt_extra_cost  */
   3, /* vec_to_scalar_cost  */
   3, /* scalar_to_vec_cost  */
   5, /* align_load_cost  */
@@ -910,6 +919,7 @@ static const advsimd_vec_cost xgene1_advsimd_vector_cost =
   4, /* reduc_f16_cost  */
   4, /* reduc_f32_cost  */
   4, /* reduc_f64_cost  */
+  4, /* store_elt_extra_cost  */
   4, /* vec_to_scalar_cost  */
   4, /* scalar_to_vec_cost  */
   10, /* align_load_cost  */
@@ -946,6 +956,7 @@ static const advsimd_vec_cost thunderx2t99_advsimd_vector_cost =
   6, /* reduc_f16_cost  */
   6, /* reduc_f32_cost  */
   6, /* reduc_f64_cost  */
+  6, /* store_elt_extra_cost  */
   6, /* vec_to_scalar_cost  */
   5, /* scalar_to_vec_cost  */
   4, /* align_load_cost  */
@@ -982,6 +993,7 @@ static const advsimd_vec_cost thunderx3t110_advsimd_vector_cost =
   5, /* reduc_f16_cost  */
   5, /* reduc_f32_cost  */
   5, /* reduc_f64_cost  */
+  5, /* store_elt_extra_cost  */
   5, /* vec_to_scalar_cost  */
   5, /* scalar_to_vec_cost  */
   4, /* align_load_cost  */
@@ -14259,6 +14271,14 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
   if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
     sve_costs = aarch64_tune_params.vec_costs->sve;
 
+  /* Detect cases in which vec_to_scalar is describing the extraction of a
+     vector element in preparation for a scalar store.  The store itself is
+     costed separately.  */
+  if (kind == vec_to_scalar
+      && STMT_VINFO_DATA_REF (stmt_info)
+      && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
+    return simd_costs->store_elt_extra_cost;
+
   /* Detect cases in which vec_to_scalar represents an in-loop reduction.  */
   if (kind == vec_to_scalar
       && where == vect_body
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 05/13] aarch64: Add costs for one element of a scatter store
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (3 preceding siblings ...)
  2021-03-26 16:15 ` [PATCH 04/13] aarch64: Add costs for storing one element of a vector Richard Sandiford
@ 2021-03-26 16:15 ` Richard Sandiford
  2021-03-26 16:16 ` [PATCH 06/13] aarch64: Add a CPU-specific cost table for Neoverse V1 Richard Sandiford
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:15 UTC (permalink / raw)
  To: gcc-patches

Currently each element in a gather load is costed as a scalar_load
and each element in a scatter store is costed as a scalar_store.
The load side seems to work pretty well in practice, since many
CPU-specific costs give loads quite a high cost relative to
arithmetic operations.  However, stores usually have a cost
of just 1, which means that scatters tend to appear too cheap.

This patch adds a separate cost for one element in a scatter store.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64-protos.h
	(sve_vec_cost::scatter_store_elt_cost): New member variable.
	* config/aarch64/aarch64.c (generic_sve_vector_cost): Update
	accordingly, taking the cost from the cost of a scalar_store.
	(a64fx_sve_vector_cost): Likewise.
	(aarch64_detect_vector_stmt_subtype): Detect scatter stores.
---
 gcc/config/aarch64/aarch64-protos.h |  9 +++++++--
 gcc/config/aarch64/aarch64.c        | 13 +++++++++++--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index fabe3df7071..2ffa96ec24b 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -256,12 +256,14 @@ struct sve_vec_cost : simd_vec_cost
 			  unsigned int clast_cost,
 			  unsigned int fadda_f16_cost,
 			  unsigned int fadda_f32_cost,
-			  unsigned int fadda_f64_cost)
+			  unsigned int fadda_f64_cost,
+			  unsigned int scatter_store_elt_cost)
     : simd_vec_cost (base),
       clast_cost (clast_cost),
       fadda_f16_cost (fadda_f16_cost),
       fadda_f32_cost (fadda_f32_cost),
-      fadda_f64_cost (fadda_f64_cost)
+      fadda_f64_cost (fadda_f64_cost),
+      scatter_store_elt_cost (scatter_store_elt_cost)
   {}
 
   /* The cost of a vector-to-scalar CLASTA or CLASTB instruction,
@@ -274,6 +276,9 @@ struct sve_vec_cost : simd_vec_cost
   const int fadda_f16_cost;
   const int fadda_f32_cost;
   const int fadda_f64_cost;
+
+  /* The per-element cost of a scatter store.  */
+  const int scatter_store_elt_cost;
 };
 
 /* Cost for vector insn classes.  */
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 20bb75bd56c..7f727413d01 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -638,7 +638,8 @@ static const sve_vec_cost generic_sve_vector_cost =
   2, /* clast_cost  */
   2, /* fadda_f16_cost  */
   2, /* fadda_f32_cost  */
-  2 /* fadda_f64_cost  */
+  2, /* fadda_f64_cost  */
+  1 /* scatter_store_elt_cost  */
 };
 
 /* Generic costs for vector insn classes.  */
@@ -705,7 +706,8 @@ static const sve_vec_cost a64fx_sve_vector_cost =
   13, /* clast_cost  */
   13, /* fadda_f16_cost  */
   13, /* fadda_f32_cost  */
-  13 /* fadda_f64_cost  */
+  13, /* fadda_f64_cost  */
+  1 /* scatter_store_elt_cost  */
 };
 
 static const struct cpu_vector_cost a64fx_vector_cost =
@@ -14279,6 +14281,13 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
       && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
     return simd_costs->store_elt_extra_cost;
 
+  /* Detect cases in which a scalar_store is really storing one element
+     in a scatter operation.  */
+  if (kind == scalar_store
+      && sve_costs
+      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+    return sve_costs->scatter_store_elt_cost;
+
   /* Detect cases in which vec_to_scalar represents an in-loop reduction.  */
   if (kind == vec_to_scalar
       && where == vect_body
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 06/13] aarch64: Add a CPU-specific cost table for Neoverse V1
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (4 preceding siblings ...)
  2021-03-26 16:15 ` [PATCH 05/13] aarch64: Add costs for one element of a scatter store Richard Sandiford
@ 2021-03-26 16:16 ` Richard Sandiford
  2021-03-26 16:16 ` [PATCH 07/13] aarch64: Use an aarch64-specific structure for vector costing Richard Sandiford
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:16 UTC (permalink / raw)
  To: gcc-patches

This patch adds dedicated vector costs for Neoverse V1.
Previously we just used the Cortex-A57 costs, which isn't
ideal given that Cortex-A57 doesn't support SVE.

gcc/
	* config/aarch64/aarch64.c (neoversev1_advsimd_vector_cost)
	(neoversev1_sve_vector_cost): New cost structures.
	(neoversev1_vector_cost): Likewise.
	(neoversev1_tunings): Use them.  Enable use_new_vector_costs.
---
 gcc/config/aarch64/aarch64.c | 95 +++++++++++++++++++++++++++++++++++-
 1 file changed, 93 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 7f727413d01..2e9853e4c9b 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1619,12 +1619,102 @@ static const struct tune_params neoversen1_tunings =
   &generic_prefetch_tune
 };
 
+static const advsimd_vec_cost neoversev1_advsimd_vector_cost =
+{
+  2, /* int_stmt_cost  */
+  2, /* fp_stmt_cost  */
+  4, /* ld2_st2_permute_cost */
+  4, /* ld3_st3_permute_cost  */
+  5, /* ld4_st4_permute_cost  */
+  3, /* permute_cost  */
+  4, /* reduc_i8_cost  */
+  4, /* reduc_i16_cost  */
+  2, /* reduc_i32_cost  */
+  2, /* reduc_i64_cost  */
+  6, /* reduc_f16_cost  */
+  3, /* reduc_f32_cost  */
+  2, /* reduc_f64_cost  */
+  2, /* store_elt_extra_cost  */
+  /* This value is just inherited from the Cortex-A57 table.  */
+  8, /* vec_to_scalar_cost  */
+  /* This depends very much on what the scalar value is and
+     where it comes from.  E.g. some constants take two dependent
+     instructions or a load, while others might be moved from a GPR.
+     4 seems to be a reasonable compromise in practice.  */
+  4, /* scalar_to_vec_cost  */
+  4, /* align_load_cost  */
+  4, /* unalign_load_cost  */
+  /* Although stores have a latency of 2 and compete for the
+     vector pipes, in practice it's better not to model that.  */
+  1, /* unalign_store_cost  */
+  1  /* store_cost  */
+};
+
+static const sve_vec_cost neoversev1_sve_vector_cost =
+{
+  {
+    2, /* int_stmt_cost  */
+    2, /* fp_stmt_cost  */
+    4, /* ld2_st2_permute_cost  */
+    7, /* ld3_st3_permute_cost  */
+    8, /* ld4_st4_permute_cost  */
+    3, /* permute_cost  */
+    /* Theoretically, a reduction involving 31 scalar ADDs could
+       complete in ~9 cycles and would have a cost of 31.  [SU]ADDV
+       completes in 14 cycles, so give it a cost of 31 + 5.  */
+    36, /* reduc_i8_cost  */
+    /* Likewise for 15 scalar ADDs (~5 cycles) vs. 12: 15 + 7.  */
+    22, /* reduc_i16_cost  */
+    /* Likewise for 7 scalar ADDs (~3 cycles) vs. 10: 7 + 7.  */
+    14, /* reduc_i32_cost  */
+    /* Likewise for 3 scalar ADDs (~2 cycles) vs. 10: 3 + 8.  */
+    11, /* reduc_i64_cost  */
+    /* Theoretically, a reduction involving 15 scalar FADDs could
+       complete in ~9 cycles and would have a cost of 30.  FADDV
+       completes in 13 cycles, so give it a cost of 30 + 4.  */
+    34, /* reduc_f16_cost  */
+    /* Likewise for 7 scalar FADDs (~6 cycles) vs. 11: 14 + 5.  */
+    19, /* reduc_f32_cost  */
+    /* Likewise for 3 scalar FADDs (~4 cycles) vs. 9: 6 + 5.  */
+    11, /* reduc_f64_cost  */
+    2, /* store_elt_extra_cost  */
+    /* This value is just inherited from the Cortex-A57 table.  */
+    8, /* vec_to_scalar_cost  */
+    /* See the comment above the Advanced SIMD versions.  */
+    4, /* scalar_to_vec_cost  */
+    4, /* align_load_cost  */
+    4, /* unalign_load_cost  */
+    /* Although stores have a latency of 2 and compete for the
+       vector pipes, in practice it's better not to model that.  */
+    1, /* unalign_store_cost  */
+    1  /* store_cost  */
+  },
+  3, /* clast_cost  */
+  19, /* fadda_f16_cost  */
+  11, /* fadda_f32_cost  */
+  8, /* fadda_f64_cost  */
+  3 /* scatter_store_elt_cost  */
+};
+
+/* Neoverse V1 costs for vector insn classes.  */
+static const struct cpu_vector_cost neoversev1_vector_cost =
+{
+  1, /* scalar_int_stmt_cost  */
+  2, /* scalar_fp_stmt_cost  */
+  4, /* scalar_load_cost  */
+  1, /* scalar_store_cost  */
+  1, /* cond_taken_branch_cost  */
+  1, /* cond_not_taken_branch_cost  */
+  &neoversev1_advsimd_vector_cost, /* advsimd  */
+  &neoversev1_sve_vector_cost /* sve  */
+};
+
 static const struct tune_params neoversev1_tunings =
 {
   &cortexa76_extra_costs,
   &generic_addrcost_table,
   &generic_regmove_cost,
-  &cortexa57_vector_cost,
+  &neoversev1_vector_cost,
   &generic_branch_cost,
   &generic_approx_modes,
   SVE_256, /* sve_width  */
@@ -1641,7 +1731,8 @@ static const struct tune_params neoversev1_tunings =
   2,	/* min_div_recip_mul_df.  */
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS),	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
+   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS),	/* tune_flags.  */
   &generic_prefetch_tune
 };
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 07/13] aarch64: Use an aarch64-specific structure for vector costing
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (5 preceding siblings ...)
  2021-03-26 16:16 ` [PATCH 06/13] aarch64: Add a CPU-specific cost table for Neoverse V1 Richard Sandiford
@ 2021-03-26 16:16 ` Richard Sandiford
  2021-03-26 16:16 ` [PATCH 08/13] aarch64: Try to detect when Advanced SIMD code would be completely unrolled Richard Sandiford
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:16 UTC (permalink / raw)
  To: gcc-patches

This patch makes the AArch64 vector code use its own vector
costs structure, rather than just using the default unsigned[3].

Unfortunately, it's not easy to make this change specific to
use_new_vector_costs, so this part is one that affects all CPUs.
The change is relatively mechanical though.

gcc/
	* config/aarch64/aarch64.c (aarch64_vector_costs): New structure.
	(aarch64_init_cost): New function.
	(aarch64_add_stmt_cost): Use aarch64_vector_costs instead of
	the default unsigned[3].
	(aarch64_finish_cost, aarch64_destroy_cost_data): New functions.
	(TARGET_VECTORIZE_INIT_COST): Override.
	(TARGET_VECTORIZE_FINISH_COST): Likewise.
	(TARGET_VECTORIZE_DESTROY_COST_DATA): Likewise.
---
 gcc/config/aarch64/aarch64.c | 46 ++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2e9853e4c9b..81683b7faa7 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14111,6 +14111,21 @@ aarch64_first_cycle_multipass_dfa_lookahead_guard (rtx_insn *insn,
 
 /* Vectorizer cost model target hooks.  */
 
+/* Information about vector code that we're in the process of costing.  */
+struct aarch64_vector_costs
+{
+  /* The normal latency-based costs for each region (prologue, body and
+     epilogue), indexed by vect_cost_model_location.  */
+  unsigned int region[3] = {};
+};
+
+/* Implement TARGET_VECTORIZE_INIT_COST.  */
+void *
+aarch64_init_cost (class loop *)
+{
+  return new aarch64_vector_costs;
+}
+
 /* Return true if the current CPU should use the new costs defined
    in GCC 11.  This should be removed for GCC 12 and above, with the
    costs applying to all CPUs instead.  */
@@ -14535,7 +14550,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 		       struct _stmt_vec_info *stmt_info, tree vectype,
 		       int misalign, enum vect_cost_model_location where)
 {
-  unsigned *cost = (unsigned *) data;
+  auto *costs = static_cast<aarch64_vector_costs *> (data);
   unsigned retval = 0;
 
   if (flag_vect_cost_model)
@@ -14569,12 +14584,30 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	count *= 50; /*  FIXME  */
 
       retval = (unsigned) (count * stmt_cost);
-      cost[where] += retval;
+      costs->region[where] += retval;
     }
 
   return retval;
 }
 
+/* Implement TARGET_VECTORIZE_FINISH_COST.  */
+static void
+aarch64_finish_cost (void *data, unsigned *prologue_cost,
+		     unsigned *body_cost, unsigned *epilogue_cost)
+{
+  auto *costs = static_cast<aarch64_vector_costs *> (data);
+  *prologue_cost = costs->region[vect_prologue];
+  *body_cost     = costs->region[vect_body];
+  *epilogue_cost = costs->region[vect_epilogue];
+}
+
+/* Implement TARGET_VECTORIZE_DESTROY_COST_DATA.  */
+static void
+aarch64_destroy_cost_data (void *data)
+{
+  delete static_cast<aarch64_vector_costs *> (data);
+}
+
 static void initialize_aarch64_code_model (struct gcc_options *);
 
 /* Parse the TO_PARSE string and put the architecture struct that it
@@ -24713,9 +24746,18 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_ARRAY_MODE_SUPPORTED_P
 #define TARGET_ARRAY_MODE_SUPPORTED_P aarch64_array_mode_supported_p
 
+#undef TARGET_VECTORIZE_INIT_COST
+#define TARGET_VECTORIZE_INIT_COST aarch64_init_cost
+
 #undef TARGET_VECTORIZE_ADD_STMT_COST
 #define TARGET_VECTORIZE_ADD_STMT_COST aarch64_add_stmt_cost
 
+#undef TARGET_VECTORIZE_FINISH_COST
+#define TARGET_VECTORIZE_FINISH_COST aarch64_finish_cost
+
+#undef TARGET_VECTORIZE_DESTROY_COST_DATA
+#define TARGET_VECTORIZE_DESTROY_COST_DATA aarch64_destroy_cost_data
+
 #undef TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST
 #define TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST \
   aarch64_builtin_vectorization_cost
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 08/13] aarch64: Try to detect when Advanced SIMD code would be completely unrolled
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (6 preceding siblings ...)
  2021-03-26 16:16 ` [PATCH 07/13] aarch64: Use an aarch64-specific structure for vector costing Richard Sandiford
@ 2021-03-26 16:16 ` Richard Sandiford
  2021-03-26 16:17 ` [PATCH 09/13] aarch64: Detect scalar extending loads Richard Sandiford
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:16 UTC (permalink / raw)
  To: gcc-patches

GCC usually costs the SVE and Advanced SIMD versions of a loop
and picks the one with the lowest cost.  By default it will choose
SVE over Advanced SIMD in the event of tie.

This is normally the correct behaviour, not least because SVE can
handle every scalar iteration count whereas Advanced SIMD can only
handle full vectors.  However, there is one important exception
that GCC failed to consider: we can completely unroll Advanced SIMD
code at compile time, but we can't do the same for SVE.

This patch therefore adds an opt-in heuristic to guess whether
the Advanced SIMD version of a loop is likely to be unrolled.
This will only be suitable for some CPUs, so it is not enabled
by default and is controlled separately from use_new_vector_costs.

Like with previous patches, this one only becomes active if a
CPU selects both of the new tuning parameters.  It should therefore
have a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64-tuning-flags.def (matched_vector_throughput):
	New tuning parameter.
	* config/aarch64/aarch64.c (neoversev1_tunings): Use it.
	(aarch64_estimated_sve_vq): New function.
	(aarch64_vector_costs::analyzed_vinfo): New member variable.
	(aarch64_vector_costs::is_loop): Likewise.
	(aarch64_vector_costs::unrolled_advsimd_niters): Likewise.
	(aarch64_vector_costs::unrolled_advsimd_stmts): Likewise.
	(aarch64_record_potential_advsimd_unrolling): New function.
	(aarch64_analyze_loop_vinfo, aarch64_analyze_bb_vinfo): Likewise.
	(aarch64_add_stmt_cost): Call aarch64_analyze_loop_vinfo or
	aarch64_analyze_bb_vinfo on the first use of a costs structure.
	Detect whether we're vectorizing a loop for SVE that might be
	completely unrolled if it used Advanced SIMD instead.
	(aarch64_adjust_body_cost_for_latency): New function.
	(aarch64_finish_cost): Call it.
---
 gcc/config/aarch64/aarch64-tuning-flags.def |   2 +
 gcc/config/aarch64/aarch64.c                | 215 +++++++++++++++++++-
 2 files changed, 210 insertions(+), 7 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index a61fcf94916..65b4c37d652 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -50,4 +50,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
 
 AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
 
+AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", MATCHED_VECTOR_THROUGHPUT)
+
 #undef AARCH64_EXTRA_TUNING_OPTION
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 81683b7faa7..63750e38862 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1732,7 +1732,8 @@ static const struct tune_params neoversev1_tunings =
   0,	/* max_case_values.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
   (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
-   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS),	/* tune_flags.  */
+   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
   &generic_prefetch_tune
 };
 
@@ -2539,6 +2540,14 @@ aarch64_bit_representation (rtx x)
   return x;
 }
 
+/* Return an estimate for the number of quadwords in an SVE vector.  This is
+   equivalent to the number of Advanced SIMD vectors in an SVE vector.  */
+static unsigned int
+aarch64_estimated_sve_vq ()
+{
+  return estimated_poly_value (BITS_PER_SVE_VECTOR) / 128;
+}
+
 /* Return true if MODE is any of the Advanced SIMD structure modes.  */
 static bool
 aarch64_advsimd_struct_mode_p (machine_mode mode)
@@ -14117,6 +14126,39 @@ struct aarch64_vector_costs
   /* The normal latency-based costs for each region (prologue, body and
      epilogue), indexed by vect_cost_model_location.  */
   unsigned int region[3] = {};
+
+  /* True if we have performed one-time initialization based on the vec_info.
+
+     This variable exists because the vec_info is not passed to the
+     init_cost hook.  We therefore have to defer initialization based on
+     it till later.  */
+  bool analyzed_vinfo = false;
+
+  /* True if we're costing a vector loop, false if we're costing block-level
+     vectorization.  */
+  bool is_loop = false;
+
+  /* - If VEC_FLAGS is zero then we're costing the original scalar code.
+     - If VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
+       SIMD code.
+     - If VEC_FLAGS & VEC_ANY_SVE is nonzero then we're costing SVE code.  */
+  unsigned int vec_flags = 0;
+
+  /* On some CPUs, SVE and Advanced SIMD provide the same theoretical vector
+     throughput, such as 4x128 Advanced SIMD vs. 2x256 SVE.  In those
+     situations, we try to predict whether an Advanced SIMD implementation
+     of the loop could be completely unrolled and become straight-line code.
+     If so, it is generally better to use the Advanced SIMD version rather
+     than length-agnostic SVE, since the SVE loop would execute an unknown
+     number of times and so could not be completely unrolled in the same way.
+
+     If we're applying this heuristic, UNROLLED_ADVSIMD_NITERS is the
+     number of Advanced SIMD loop iterations that would be unrolled and
+     UNROLLED_ADVSIMD_STMTS estimates the total number of statements
+     in the unrolled loop.  Both values are zero if we're not applying
+     the heuristic.  */
+  unsigned HOST_WIDE_INT unrolled_advsimd_niters = 0;
+  unsigned HOST_WIDE_INT unrolled_advsimd_stmts = 0;
 };
 
 /* Implement TARGET_VECTORIZE_INIT_COST.  */
@@ -14148,6 +14190,94 @@ aarch64_simd_vec_costs (tree vectype)
   return costs->advsimd;
 }
 
+/* Decide whether to use the unrolling heuristic described above
+   aarch64_vector_costs::unrolled_advsimd_niters, updating that
+   field if so.  LOOP_VINFO describes the loop that we're vectorizing
+   and COSTS are the costs that we're calculating for it.  */
+static void
+aarch64_record_potential_advsimd_unrolling (loop_vec_info loop_vinfo,
+					    aarch64_vector_costs *costs)
+{
+  /* The heuristic only makes sense on targets that have the same
+     vector throughput for SVE and Advanced SIMD.  */
+  if (!(aarch64_tune_params.extra_tuning_flags
+	& AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT))
+    return;
+
+  /* We only want to apply the heuristic if LOOP_VINFO is being
+     vectorized for SVE.  */
+  if (!(costs->vec_flags & VEC_ANY_SVE))
+    return;
+
+  /* Check whether it is possible in principle to use Advanced SIMD
+     instead.  */
+  if (aarch64_autovec_preference == 2)
+    return;
+
+  /* We don't want to apply the heuristic to outer loops, since it's
+     harder to track two levels of unrolling.  */
+  if (LOOP_VINFO_LOOP (loop_vinfo)->inner)
+    return;
+
+  /* Only handle cases in which the number of Advanced SIMD iterations
+     would be known at compile time but the number of SVE iterations
+     would not.  */
+  if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+      || aarch64_sve_vg.is_constant ())
+    return;
+
+  /* Guess how many times the Advanced SIMD loop would iterate and make
+     sure that it is within the complete unrolling limit.  Even if the
+     number of iterations is small enough, the number of statements might
+     not be, which is why we need to estimate the number of statements too.  */
+  unsigned int estimated_vq = aarch64_estimated_sve_vq ();
+  unsigned int advsimd_vf = CEIL (vect_vf_for_cost (loop_vinfo), estimated_vq);
+  unsigned HOST_WIDE_INT unrolled_advsimd_niters
+    = LOOP_VINFO_INT_NITERS (loop_vinfo) / advsimd_vf;
+  if (unrolled_advsimd_niters > (unsigned int) param_max_completely_peel_times)
+    return;
+
+  /* Record that we're applying the heuristic and should try to estimate
+     the number of statements in the Advanced SIMD loop.  */
+  costs->unrolled_advsimd_niters = unrolled_advsimd_niters;
+}
+
+/* Do one-time initialization of COSTS given that we're costing the loop
+   vectorization described by LOOP_VINFO.  */
+static void
+aarch64_analyze_loop_vinfo (loop_vec_info loop_vinfo,
+			    aarch64_vector_costs *costs)
+{
+  costs->is_loop = true;
+
+  /* Detect whether we're costing the scalar code or the vector code.
+     This is a bit hacky: it would be better if the vectorizer told
+     us directly.
+
+     If we're costing the vector code, record whether we're vectorizing
+     for Advanced SIMD or SVE.  */
+  if (costs == LOOP_VINFO_TARGET_COST_DATA (loop_vinfo))
+    costs->vec_flags = aarch64_classify_vector_mode (loop_vinfo->vector_mode);
+  else
+    costs->vec_flags = 0;
+
+  /* Detect whether we're vectorizing for SVE and should
+     apply the unrolling heuristic described above
+     aarch64_vector_costs::unrolled_advsimd_niters.  */
+  aarch64_record_potential_advsimd_unrolling (loop_vinfo, costs);
+}
+
+/* Do one-time initialization of COSTS given that we're costing the block
+   vectorization described by BB_VINFO.  */
+static void
+aarch64_analyze_bb_vinfo (bb_vec_info bb_vinfo, aarch64_vector_costs *costs)
+{
+  /* Unfortunately, there's no easy way of telling whether we're costing
+     the vector code or the scalar code, so just assume that we're costing
+     the vector code.  */
+  costs->vec_flags = aarch64_classify_vector_mode (bb_vinfo->vector_mode);
+}
+
 /* Implement targetm.vectorize.builtin_vectorization_cost.  */
 static int
 aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
@@ -14555,8 +14685,20 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 
   if (flag_vect_cost_model)
     {
-      int stmt_cost =
-	    aarch64_builtin_vectorization_cost (kind, vectype, misalign);
+      int stmt_cost
+	= aarch64_builtin_vectorization_cost (kind, vectype, misalign);
+
+      /* Do one-time initialization based on the vinfo.  */
+      loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
+      bb_vec_info bb_vinfo = dyn_cast<bb_vec_info> (vinfo);
+      if (!costs->analyzed_vinfo && aarch64_use_new_vector_costs_p ())
+	{
+	  if (loop_vinfo)
+	    aarch64_analyze_loop_vinfo (loop_vinfo, costs);
+	  else
+	    aarch64_analyze_bb_vinfo (bb_vinfo, costs);
+	  costs->analyzed_vinfo = true;
+	}
 
       /* Try to get a more accurate cost by looking at STMT_INFO instead
 	 of just looking at KIND.  */
@@ -14571,10 +14713,21 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 						  vectype, stmt_cost);
 
       if (stmt_info && aarch64_use_new_vector_costs_p ())
-	/* Account for any extra "embedded" costs that apply additively
-	   to the base cost calculated above.  */
-	stmt_cost = aarch64_adjust_stmt_cost (kind, stmt_info, vectype,
-					      stmt_cost);
+	{
+	  /* Account for any extra "embedded" costs that apply additively
+	     to the base cost calculated above.  */
+	  stmt_cost = aarch64_adjust_stmt_cost (kind, stmt_info, vectype,
+						stmt_cost);
+
+	  /* If we're applying the SVE vs. Advanced SIMD unrolling heuristic,
+	     estimate the number of statements in the unrolled Advanced SIMD
+	     loop.  For simplicitly, we assume that one iteration of the
+	     Advanced SIMD loop would need the same number of statements
+	     as one iteration of the SVE loop.  */
+	  if (where == vect_body && costs->unrolled_advsimd_niters)
+	    costs->unrolled_advsimd_stmts
+	      += count * costs->unrolled_advsimd_niters;
+	}
 
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
@@ -14590,6 +14743,49 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* BODY_COST is the cost of a vector loop body recorded in COSTS.
+   Adjust the cost as necessary and return the new cost.  */
+static unsigned int
+aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
+{
+  unsigned int orig_body_cost = body_cost;
+
+  if (costs->unrolled_advsimd_stmts)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location, "Number of insns in"
+			 " unrolled Advanced SIMD loop = %d\n",
+			 costs->unrolled_advsimd_stmts);
+
+      /* Apply the Advanced SIMD vs. SVE unrolling heuristic described above
+	 aarch64_vector_costs::unrolled_advsimd_niters.
+
+	 The balance here is tricky.  On the one hand, we can't be sure whether
+	 the code is vectorizable with Advanced SIMD or not.  However, even if
+	 it isn't vectorizable with Advanced SIMD, there's a possibility that
+	 the scalar code could also be unrolled.  Some of the code might then
+	 benefit from SLP, or from using LDP and STP.  We therefore apply
+	 the heuristic regardless of can_use_advsimd_p.  */
+      if (costs->unrolled_advsimd_stmts
+	  && (costs->unrolled_advsimd_stmts
+	      <= (unsigned int) param_max_completely_peeled_insns))
+	{
+	  unsigned int estimated_vq = aarch64_estimated_sve_vq ();
+	  unsigned int min_cost = (orig_body_cost * estimated_vq) + 1;
+	  if (body_cost < min_cost)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Increasing body cost to %d to account for"
+				 " unrolling\n", min_cost);
+	      body_cost = min_cost;
+	    }
+	}
+    }
+
+  return body_cost;
+}
+
 /* Implement TARGET_VECTORIZE_FINISH_COST.  */
 static void
 aarch64_finish_cost (void *data, unsigned *prologue_cost,
@@ -14599,6 +14795,11 @@ aarch64_finish_cost (void *data, unsigned *prologue_cost,
   *prologue_cost = costs->region[vect_prologue];
   *body_cost     = costs->region[vect_body];
   *epilogue_cost = costs->region[vect_epilogue];
+
+  if (costs->is_loop
+      && costs->vec_flags
+      && aarch64_use_new_vector_costs_p ())
+    *body_cost = aarch64_adjust_body_cost (costs, *body_cost);
 }
 
 /* Implement TARGET_VECTORIZE_DESTROY_COST_DATA.  */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 09/13] aarch64: Detect scalar extending loads
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (7 preceding siblings ...)
  2021-03-26 16:16 ` [PATCH 08/13] aarch64: Try to detect when Advanced SIMD code would be completely unrolled Richard Sandiford
@ 2021-03-26 16:17 ` Richard Sandiford
  2021-03-26 16:17 ` [PATCH 10/13] aarch64: Cost comparisons embedded in COND_EXPRs Richard Sandiford
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:17 UTC (permalink / raw)
  To: gcc-patches

If the scalar code does an integer load followed by an integer
extension, we've tended to cost that as two separate operations,
even though the extension is probably going to be free in practice.
This patch treats the extension as having zero cost, like we already
do for extending SVE loads.

Like with previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64.c (aarch64_detect_scalar_stmt_subtype):
	New function.
	(aarch64_add_stmt_cost): Call it.
---
 gcc/config/aarch64/aarch64.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 63750e38862..e2d92f0c136 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14492,6 +14492,23 @@ aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
   return 0;
 }
 
+/* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
+   for STMT_INFO, which has cost kind KIND.  If this is a scalar operation,
+   try to subdivide the target-independent categorization provided by KIND
+   to get a more accurate cost.  */
+static unsigned int
+aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
+				    stmt_vec_info stmt_info,
+				    unsigned int stmt_cost)
+{
+  /* Detect an extension of a loaded value.  In general, we'll be able to fuse
+     the extension with the load.  */
+  if (kind == scalar_stmt && aarch64_extending_load_p (vinfo, stmt_info))
+    return 0;
+
+  return stmt_cost;
+}
+
 /* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
    for the vectorized form of STMT_INFO, which has cost kind KIND and which
    when vectorized would operate on vector type VECTYPE.  Try to subdivide
@@ -14702,10 +14719,16 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 
       /* Try to get a more accurate cost by looking at STMT_INFO instead
 	 of just looking at KIND.  */
-      if (stmt_info && vectype && aarch64_use_new_vector_costs_p ())
-	stmt_cost = aarch64_detect_vector_stmt_subtype (vinfo, kind,
-							stmt_info, vectype,
-							where, stmt_cost);
+      if (stmt_info && aarch64_use_new_vector_costs_p ())
+	{
+	  stmt_cost = aarch64_detect_scalar_stmt_subtype
+	    (vinfo, kind, stmt_info, stmt_cost);
+
+	  if (vectype && costs->vec_flags)
+	    stmt_cost = aarch64_detect_vector_stmt_subtype (vinfo, kind,
+							    stmt_info, vectype,
+							    where, stmt_cost);
+	}
 
       /* Do any SVE-specific adjustments to the cost.  */
       if (stmt_info && vectype && aarch64_sve_mode_p (TYPE_MODE (vectype)))
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 10/13] aarch64: Cost comparisons embedded in COND_EXPRs
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (8 preceding siblings ...)
  2021-03-26 16:17 ` [PATCH 09/13] aarch64: Detect scalar extending loads Richard Sandiford
@ 2021-03-26 16:17 ` Richard Sandiford
  2021-03-26 16:18 ` [PATCH 11/13] aarch64: Ignore inductions when costing vector code Richard Sandiford
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:17 UTC (permalink / raw)
  To: gcc-patches

So far the costing of COND_EXPRs hasn't distinguished between
cases in which the condition is calculated separately or is
built into the COND_EXPR itself.  This patch adds the cost
of any embedded comparison.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64.c (aarch64_embedded_comparison_type): New
	function.
	(aarch64_adjust_stmt_cost): Add the costs of embedded scalar and
	vector comparisons.
---
 gcc/config/aarch64/aarch64.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e2d92f0c136..e97e71b6e3d 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14392,6 +14392,21 @@ aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
   return 0;
 }
 
+/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
+   scalar type of the values being compared.  Return null otherwise.  */
+static tree
+aarch64_embedded_comparison_type (stmt_vec_info stmt_info)
+{
+  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
+    if (gimple_assign_rhs_code (assign) == COND_EXPR)
+      {
+	tree cond = gimple_assign_rhs1 (assign);
+	if (COMPARISON_CLASS_P (cond))
+	  return TREE_TYPE (TREE_OPERAND (cond, 0));
+      }
+  return NULL_TREE;
+}
+
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
    vectors would produce a series of LDP or STP operations.  KIND is the
    kind of statement that STMT_INFO represents.  */
@@ -14685,8 +14700,26 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
 	  stmt_cost += simd_costs->ld4_st4_permute_cost;
 	  break;
 	}
+
+      if (kind == vector_stmt || kind == vec_to_scalar)
+	if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
+	  {
+	    if (FLOAT_TYPE_P (cmp_type))
+	      stmt_cost += simd_costs->fp_stmt_cost;
+	    else
+	      stmt_cost += simd_costs->int_stmt_cost;
+	  }
     }
 
+  if (kind == scalar_stmt)
+    if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
+      {
+	if (FLOAT_TYPE_P (cmp_type))
+	  stmt_cost += aarch64_tune_params.vec_costs->scalar_fp_stmt_cost;
+	else
+	  stmt_cost += aarch64_tune_params.vec_costs->scalar_int_stmt_cost;
+      }
+
   return stmt_cost;
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 11/13] aarch64: Ignore inductions when costing vector code
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (9 preceding siblings ...)
  2021-03-26 16:17 ` [PATCH 10/13] aarch64: Cost comparisons embedded in COND_EXPRs Richard Sandiford
@ 2021-03-26 16:18 ` Richard Sandiford
  2021-03-26 16:18 ` [PATCH 12/13] aarch64: Take issue rate into account for vector loop costs Richard Sandiford
  2021-03-26 16:18 ` [PATCH 13/13] aarch64: Add costs for LD[34] and ST[34] postincrements Richard Sandiford
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:18 UTC (permalink / raw)
  To: gcc-patches

In practice it seems to be better not to cost a vector induction.
The scalar code generally needs the same induction but doesn't
cost it, making an apples-for-apples comparison harder.  Most
inductions also have a low latency and their cost usually gets
hidden by other operations.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64.c (aarch64_detect_vector_stmt_subtype):
	Assume a zero cost for induction phis.
---
 gcc/config/aarch64/aarch64.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e97e71b6e3d..6d18d82079c 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14541,6 +14541,12 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
   if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
     sve_costs = aarch64_tune_params.vec_costs->sve;
 
+  /* It's generally better to avoid costing inductions, since the induction
+     will usually be hidden by other operations.  This is particularly true
+     for things like COND_REDUCTIONS.  */
+  if (is_a<gphi *> (stmt_info->stmt))
+    return 0;
+
   /* Detect cases in which vec_to_scalar is describing the extraction of a
      vector element in preparation for a scalar store.  The store itself is
      costed separately.  */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 12/13] aarch64: Take issue rate into account for vector loop costs
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (10 preceding siblings ...)
  2021-03-26 16:18 ` [PATCH 11/13] aarch64: Ignore inductions when costing vector code Richard Sandiford
@ 2021-03-26 16:18 ` Richard Sandiford
  2021-03-26 16:18 ` [PATCH 13/13] aarch64: Add costs for LD[34] and ST[34] postincrements Richard Sandiford
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:18 UTC (permalink / raw)
  To: gcc-patches

When SVE is enabled, GCC needs to do a three-way comparison
between scalar, Advanced SIMD and SVE code.  The normal costs
tend to be latency-based, which is well-suited to SLP.  However,
comparing sums of latency costs means that we effectively treat
the code as executing sequentially.  This can hide the effect of
pipeline bubbles or resource contention that in practice are quite
important for loop vectorisation.  This is particularly true for
loops that involve reductions.

This patch therefore tries to estimate how quickly each piece
of code could issue, using a very (very) simplistic model.
It then uses this to adjust the loop vector costs up or down as
appropriate.  Part of the Advanced SIMD vs. SVE adjustment is
opt-in and is not enabled by default even for use_new_vector_costs.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.  The code also mostly ignores
CPUs that have no issue information, even if use_new_vector_costs
is enabled for some reason.

gcc/
	* config/aarch64/aarch64.opt
	(-param=aarch64-loop-vect-issue-rate-niters=): New parameter.
	* doc/invoke.texi: Document it.
	* config/aarch64/aarch64-protos.h (aarch64_base_vec_issue_info)
	(aarch64_scalar_vec_issue_info, aarch64_simd_vec_issue_info)
	(aarch64_advsimd_vec_issue_info, aarch64_sve_vec_issue_info)
	(aarch64_vec_issue_info): New structures.
	(cpu_vector_cost): Write comments above the variables rather
	than to the side.
	(cpu_vector_cost::issue_info): New member variable.
	* config/aarch64/aarch64.c: Include gimple-pretty-print.h
	and tree-ssa-loop-niter.h.
	(generic_vector_cost, a64fx_vector_cost, qdf24xx_vector_cost)
	(thunderx_vector_cost, tsv110_vector_cost, cortexa57_vector_cost)
	(exynosm1_vector_cost, xgene1_vector_cost, thunderx2t99_vector_cost)
	(thunderx3t110_vector_cost): Initialize issue_info to null.
	(neoversev1_scalar_issue_info, neoversev1_advsimd_issue_info)
	(neoversev1_sve_issue_info, neoversev1_vec_issue_info): New structures.
	(neoversev1_vector_cost): Use them.
	(aarch64_vec_op_count, aarch64_sve_op_count): New structures.
	(aarch64_vector_costs::saw_sve_only_op): New member variable.
	(aarch64_vector_costs::num_vector_iterations): Likewise.
	(aarch64_vector_costs::scalar_ops): Likewise.
	(aarch64_vector_costs::advsimd_ops): Likewise.
	(aarch64_vector_costs::sve_ops): Likewise.
	(aarch64_vector_costs::seen_loads): Likewise.
	(aarch64_simd_vec_costs_for_flags): New function.
	(aarch64_analyze_loop_vinfo): Initialize num_vector_iterations.
	Count the number of predicate operations required by SVE WHILE
	instructions.
	(aarch64_comparison_type, aarch64_multiply_add_p): New functions.
	(aarch64_sve_only_stmt_p, aarch64_in_loop_reduction_latency): Likewise.
	(aarch64_count_ops): Likewise.
	(aarch64_add_stmt_cost): Record whether see an SVE operation
	that cannot currently be implementing using Advanced SIMD.
	Record issue information about the scalar, Advanced SIMD
	and (where relevant) SVE versions of a loop.
	(aarch64_vec_op_count::dump): New function.
	(aarch64_sve_op_count::dump): Likewise.
	(aarch64_estimate_min_cycles_per_iter): Likewise.
	(aarch64_adjust_body_cost): If issue information is available,
	try to compare the issue rates of the various loop implementations
	and increase or decrease the vector body cost accordingly.
---
 gcc/config/aarch64/aarch64-protos.h | 178 ++++++-
 gcc/config/aarch64/aarch64.c        | 798 +++++++++++++++++++++++++++-
 gcc/config/aarch64/aarch64.opt      |   3 +
 gcc/doc/invoke.texi                 |   8 +
 4 files changed, 966 insertions(+), 21 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 2ffa96ec24b..ca1ed9e8758 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -281,19 +281,177 @@ struct sve_vec_cost : simd_vec_cost
   const int scatter_store_elt_cost;
 };
 
+/* Base information about how the CPU issues code, containing
+   information that is relevant to scalar, Advanced SIMD and SVE
+   operations.
+
+   The structure uses the general term "operation" to refer to
+   whichever subdivision of an instruction makes sense for the CPU.
+   These operations would typically be micro operations or macro
+   operations.
+
+   Note that this structure and the ones derived from it are only
+   as general as they need to be for the CPUs that currently use them.
+   They will probably need to be extended or refined as more CPUs are
+   added.  */
+struct aarch64_base_vec_issue_info
+{
+  /* How many loads and stores can be issued per cycle.  */
+  const unsigned int loads_stores_per_cycle;
+
+  /* How many stores can be issued per cycle.  */
+  const unsigned int stores_per_cycle;
+
+  /* How many integer or FP/SIMD operations can be issued per cycle.
+
+     Currently we don't try to distinguish the two.  For vector code,
+     we only really track FP/SIMD operations during vector costing;
+     we don't for example try to cost arithmetic operations like
+     address calculations, which are only decided later during ivopts.
+
+     For scalar code, we effectively assume that code operates entirely
+     on integers or entirely on floating-point values.  Again, we don't
+     try to take address calculations into account.
+
+     This is not very precise, but it's only meant to be a heuristic.
+     We could certainly try to do better in future if there's an example
+     of something that would benefit.  */
+  const unsigned int general_ops_per_cycle;
+
+  /* How many FP/SIMD operations to count for a floating-point or
+     vector load operation.
+
+     When constructing an Advanced SIMD vector from elements that have
+     been loaded from memory, these values apply to each individual load.
+     When using an SVE gather load, the values apply to each element of
+     the gather.  */
+  const unsigned int fp_simd_load_general_ops;
+
+  /* How many FP/SIMD operations to count for a floating-point or
+     vector store operation.
+
+     When storing individual elements of an Advanced SIMD vector out to
+     memory, these values apply to each individual store.  When using an
+     SVE scatter store, these values apply to each element of the scatter.  */
+  const unsigned int fp_simd_store_general_ops;
+};
+
+using aarch64_scalar_vec_issue_info = aarch64_base_vec_issue_info;
+
+/* Base information about the issue stage for vector operations.
+   This structure contains information that is relevant to both
+   Advanced SIMD and SVE.  */
+struct aarch64_simd_vec_issue_info : aarch64_base_vec_issue_info
+{
+  constexpr aarch64_simd_vec_issue_info (aarch64_base_vec_issue_info base,
+					 unsigned int ld2_st2_general_ops,
+					 unsigned int ld3_st3_general_ops,
+					 unsigned int ld4_st4_general_ops)
+    : aarch64_base_vec_issue_info (base),
+      ld2_st2_general_ops (ld2_st2_general_ops),
+      ld3_st3_general_ops (ld3_st3_general_ops),
+      ld4_st4_general_ops (ld4_st4_general_ops)
+  {}
+
+  /* How many FP/SIMD operations to count for each vector loaded or
+     stored by an LD[234] or ST[234] operation, in addition to the
+     base costs given in the parent class.  For example, the full
+     number of operations for an LD3 would be:
+
+       load ops:    3
+       general ops: 3 * (fp_simd_load_general_ops + ld3_st3_general_ops).  */
+  const unsigned int ld2_st2_general_ops;
+  const unsigned int ld3_st3_general_ops;
+  const unsigned int ld4_st4_general_ops;
+};
+
+using aarch64_advsimd_vec_issue_info = aarch64_simd_vec_issue_info;
+
+/* Information about the issue stage for SVE.  The main thing this adds
+   is a concept of "predicate operations".  */
+struct aarch64_sve_vec_issue_info : aarch64_simd_vec_issue_info
+{
+  constexpr aarch64_sve_vec_issue_info
+    (aarch64_simd_vec_issue_info base,
+     unsigned int pred_ops_per_cycle,
+     unsigned int while_pred_ops,
+     unsigned int int_cmp_pred_ops,
+     unsigned int fp_cmp_pred_ops,
+     unsigned int gather_scatter_pair_general_ops,
+     unsigned int gather_scatter_pair_pred_ops)
+    : aarch64_simd_vec_issue_info (base),
+      pred_ops_per_cycle (pred_ops_per_cycle),
+      while_pred_ops (while_pred_ops),
+      int_cmp_pred_ops (int_cmp_pred_ops),
+      fp_cmp_pred_ops (fp_cmp_pred_ops),
+      gather_scatter_pair_general_ops (gather_scatter_pair_general_ops),
+      gather_scatter_pair_pred_ops (gather_scatter_pair_pred_ops)
+  {}
+
+  /* How many predicate operations can be issued per cycle.  */
+  const unsigned int pred_ops_per_cycle;
+
+  /* How many predicate operations are generated by a WHILExx
+     instruction.  */
+  const unsigned int while_pred_ops;
+
+  /* How many predicate operations are generated by an integer
+     comparison instruction.  */
+  const unsigned int int_cmp_pred_ops;
+
+  /* How many predicate operations are generated by a floating-point
+     comparison instruction.  */
+  const unsigned int fp_cmp_pred_ops;
+
+  /* How many general and predicate operations are generated by each pair
+     of elements in a gather load or scatter store.  These values apply
+     on top of the per-element counts recorded in fp_simd_load_general_ops
+     and fp_simd_store_general_ops.
+
+     The reason for using pairs is that that is the largest possible
+     granule size for 128-bit SVE, which can load and store 2 64-bit
+     elements or 4 32-bit elements.  */
+  const unsigned int gather_scatter_pair_general_ops;
+  const unsigned int gather_scatter_pair_pred_ops;
+};
+
+/* Information related to instruction issue for a particular CPU.  */
+struct aarch64_vec_issue_info
+{
+  const aarch64_base_vec_issue_info *const scalar;
+  const aarch64_simd_vec_issue_info *const advsimd;
+  const aarch64_sve_vec_issue_info *const sve;
+};
+
 /* Cost for vector insn classes.  */
 struct cpu_vector_cost
 {
-  const int scalar_int_stmt_cost;	 /* Cost of any int scalar operation,
-					    excluding load and store.  */
-  const int scalar_fp_stmt_cost;	 /* Cost of any fp scalar operation,
-					    excluding load and store.  */
-  const int scalar_load_cost;		 /* Cost of scalar load.  */
-  const int scalar_store_cost;		 /* Cost of scalar store.  */
-  const int cond_taken_branch_cost;	 /* Cost of taken branch.  */
-  const int cond_not_taken_branch_cost;  /* Cost of not taken branch.  */
-  const advsimd_vec_cost *advsimd;	 /* Cost of Advanced SIMD operations.  */
-  const sve_vec_cost *sve;		 /* Cost of SVE operations.  */
+  /* Cost of any integer scalar operation, excluding load and store.  */
+  const int scalar_int_stmt_cost;
+
+  /* Cost of any fp scalar operation, excluding load and store.  */
+  const int scalar_fp_stmt_cost;
+
+  /* Cost of a scalar load.  */
+  const int scalar_load_cost;
+
+  /* Cost of a scalar store.  */
+  const int scalar_store_cost;
+
+  /* Cost of a taken branch.  */
+  const int cond_taken_branch_cost;
+
+  /* Cost of a not-taken branch.  */
+  const int cond_not_taken_branch_cost;
+
+  /* Cost of an Advanced SIMD operations.  */
+  const advsimd_vec_cost *advsimd;
+
+  /* Cost of an SVE operations, or null if SVE is not implemented.  */
+  const sve_vec_cost *sve;
+
+  /* Issue information, or null if none is provided.  */
+  const aarch64_vec_issue_info *const issue_info;
 };
 
 /* Branch costs.  */
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 6d18d82079c..6d961bea5dc 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -74,6 +74,8 @@
 #include "intl.h"
 #include "expmed.h"
 #include "function-abi.h"
+#include "gimple-pretty-print.h"
+#include "tree-ssa-loop-niter.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -652,7 +654,8 @@ static const struct cpu_vector_cost generic_vector_cost =
   3, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &generic_advsimd_vector_cost, /* advsimd  */
-  &generic_sve_vector_cost /* sve */
+  &generic_sve_vector_cost, /* sve */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost a64fx_advsimd_vector_cost =
@@ -719,7 +722,8 @@ static const struct cpu_vector_cost a64fx_vector_cost =
   3, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &a64fx_advsimd_vector_cost, /* advsimd  */
-  &a64fx_sve_vector_cost /* sve  */
+  &a64fx_sve_vector_cost, /* sve  */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost qdf24xx_advsimd_vector_cost =
@@ -756,7 +760,8 @@ static const struct cpu_vector_cost qdf24xx_vector_cost =
   3, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &qdf24xx_advsimd_vector_cost, /* advsimd  */
-  NULL /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 
@@ -794,7 +799,8 @@ static const struct cpu_vector_cost thunderx_vector_cost =
   3, /* cond_taken_branch_cost  */
   3, /* cond_not_taken_branch_cost  */
   &thunderx_advsimd_vector_cost, /* advsimd  */
-  NULL /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost tsv110_advsimd_vector_cost =
@@ -830,7 +836,8 @@ static const struct cpu_vector_cost tsv110_vector_cost =
   1, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &tsv110_advsimd_vector_cost, /* advsimd  */
-  NULL, /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost cortexa57_advsimd_vector_cost =
@@ -867,7 +874,8 @@ static const struct cpu_vector_cost cortexa57_vector_cost =
   1, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &cortexa57_advsimd_vector_cost, /* advsimd  */
-  NULL /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost exynosm1_advsimd_vector_cost =
@@ -903,7 +911,8 @@ static const struct cpu_vector_cost exynosm1_vector_cost =
   1, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &exynosm1_advsimd_vector_cost, /* advsimd  */
-  NULL /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost xgene1_advsimd_vector_cost =
@@ -940,7 +949,8 @@ static const struct cpu_vector_cost xgene1_vector_cost =
   2, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &xgene1_advsimd_vector_cost, /* advsimd  */
-  NULL /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost thunderx2t99_advsimd_vector_cost =
@@ -977,7 +987,8 @@ static const struct cpu_vector_cost thunderx2t99_vector_cost =
   2, /* cond_taken_branch_cost  */
   1,  /* cond_not_taken_branch_cost  */
   &thunderx2t99_advsimd_vector_cost, /* advsimd  */
-  NULL /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 static const advsimd_vec_cost thunderx3t110_advsimd_vector_cost =
@@ -1013,7 +1024,8 @@ static const struct cpu_vector_cost thunderx3t110_vector_cost =
   2, /* cond_taken_branch_cost  */
   1,  /* cond_not_taken_branch_cost  */
   &thunderx3t110_advsimd_vector_cost, /* advsimd  */
-  NULL /* sve  */
+  nullptr, /* sve  */
+  nullptr /* issue_info  */
 };
 
 
@@ -1696,6 +1708,58 @@ static const sve_vec_cost neoversev1_sve_vector_cost =
   3 /* scatter_store_elt_cost  */
 };
 
+static const aarch64_scalar_vec_issue_info neoversev1_scalar_issue_info =
+{
+  3, /* loads_stores_per_cycle  */
+  2, /* stores_per_cycle  */
+  4, /* general_ops_per_cycle  */
+  0, /* fp_simd_load_general_ops  */
+  1 /* fp_simd_store_general_ops  */
+};
+
+static const aarch64_advsimd_vec_issue_info neoversev1_advsimd_issue_info =
+{
+  {
+    3, /* loads_stores_per_cycle  */
+    2, /* stores_per_cycle  */
+    4, /* general_ops_per_cycle  */
+    0, /* fp_simd_load_general_ops  */
+    1 /* fp_simd_store_general_ops  */
+  },
+  2, /* ld2_st2_general_ops  */
+  2, /* ld3_st3_general_ops  */
+  3 /* ld4_st4_general_ops  */
+};
+
+static const aarch64_sve_vec_issue_info neoversev1_sve_issue_info =
+{
+  {
+    {
+      2, /* loads_per_cycle  */
+      2, /* stores_per_cycle  */
+      2, /* general_ops_per_cycle  */
+      0, /* fp_simd_load_general_ops  */
+      1 /* fp_simd_store_general_ops  */
+    },
+    2, /* ld2_st2_general_ops  */
+    2, /* ld3_st3_general_ops  */
+    3 /* ld4_st4_general_ops  */
+  },
+  1, /* pred_ops_per_cycle  */
+  2, /* while_pred_ops  */
+  2, /* int_cmp_pred_ops  */
+  1, /* fp_cmp_pred_ops  */
+  1, /* gather_scatter_pair_general_ops  */
+  1 /* gather_scatter_pair_pred_ops  */
+};
+
+static const aarch64_vec_issue_info neoversev1_vec_issue_info =
+{
+  &neoversev1_scalar_issue_info,
+  &neoversev1_advsimd_issue_info,
+  &neoversev1_sve_issue_info
+};
+
 /* Neoverse V1 costs for vector insn classes.  */
 static const struct cpu_vector_cost neoversev1_vector_cost =
 {
@@ -1706,7 +1770,8 @@ static const struct cpu_vector_cost neoversev1_vector_cost =
   1, /* cond_taken_branch_cost  */
   1, /* cond_not_taken_branch_cost  */
   &neoversev1_advsimd_vector_cost, /* advsimd  */
-  &neoversev1_sve_vector_cost /* sve  */
+  &neoversev1_sve_vector_cost, /* sve  */
+  &neoversev1_vec_issue_info /* issue_info  */
 };
 
 static const struct tune_params neoversev1_tunings =
@@ -14120,6 +14185,38 @@ aarch64_first_cycle_multipass_dfa_lookahead_guard (rtx_insn *insn,
 
 /* Vectorizer cost model target hooks.  */
 
+/* Information about how the CPU would issue the scalar, Advanced SIMD
+   or SVE version of a vector loop, using the scheme defined by the
+   aarch64_base_vec_issue_info hierarchy of structures.  */
+struct aarch64_vec_op_count
+{
+  void dump () const;
+
+  /* The number of individual "general" operations.  See the comments
+     in aarch64_base_vec_issue_info for details.  */
+  unsigned int general_ops = 0;
+
+  /* The number of load and store operations, under the same scheme
+     as above.  */
+  unsigned int loads = 0;
+  unsigned int stores = 0;
+
+  /* The minimum number of cycles needed to execute all loop-carried
+     operations, which in the vector code become associated with
+     reductions.  */
+  unsigned int reduction_latency = 0;
+};
+
+/* Extends aarch64_vec_op_count with SVE-specific information.  */
+struct aarch64_sve_op_count : aarch64_vec_op_count
+{
+  void dump () const;
+
+  /* The number of individual predicate operations.  See the comments
+     in aarch64_sve_vec_issue_info for details.  */
+  unsigned int pred_ops = 0;
+};
+
 /* Information about vector code that we're in the process of costing.  */
 struct aarch64_vector_costs
 {
@@ -14138,6 +14235,10 @@ struct aarch64_vector_costs
      vectorization.  */
   bool is_loop = false;
 
+  /* True if we've seen an SVE operation that we cannot currently vectorize
+     using Advanced SIMD.  */
+  bool saw_sve_only_op = false;
+
   /* - If VEC_FLAGS is zero then we're costing the original scalar code.
      - If VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
        SIMD code.
@@ -14159,6 +14260,32 @@ struct aarch64_vector_costs
      the heuristic.  */
   unsigned HOST_WIDE_INT unrolled_advsimd_niters = 0;
   unsigned HOST_WIDE_INT unrolled_advsimd_stmts = 0;
+
+  /* If we're vectorizing a loop that executes a constant number of times,
+     this variable gives the number of times that the vector loop would
+     iterate, otherwise it is zero.  */
+  uint64_t num_vector_iterations = 0;
+
+  /* Used only when vectorizing loops.  Estimates the number and kind of scalar
+     operations that would be needed to perform the same work as one iteration
+     of the vector loop.  */
+  aarch64_vec_op_count scalar_ops;
+
+  /* Used only when vectorizing loops.  If VEC_FLAGS & VEC_ADVSIMD,
+     this structure estimates the number and kind of operations that the
+     vector loop would contain.  If VEC_FLAGS & VEC_SVE, the structure
+     estimates what the equivalent Advanced SIMD-only code would need in
+     order to perform the same work as one iteration of the SVE loop.  */
+  aarch64_vec_op_count advsimd_ops;
+
+  /* Used only when vectorizing loops with SVE.  It estimates the number and
+     kind of operations that the SVE loop would contain.  */
+  aarch64_sve_op_count sve_ops;
+
+  /* Used to detect cases in which we end up costing the same load twice,
+     once to account for results that are actually used and once to account
+     for unused results.  */
+  hash_map<nofree_ptr_hash<_stmt_vec_info>, unsigned int> seen_loads;
 };
 
 /* Implement TARGET_VECTORIZE_INIT_COST.  */
@@ -14190,6 +14317,16 @@ aarch64_simd_vec_costs (tree vectype)
   return costs->advsimd;
 }
 
+/* Return the appropriate SIMD costs for vectors with VEC_* flags FLAGS.  */
+static const simd_vec_cost *
+aarch64_simd_vec_costs_for_flags (unsigned int flags)
+{
+  const cpu_vector_cost *costs = aarch64_tune_params.vec_costs;
+  if ((flags & VEC_ANY_SVE) && costs->sve)
+    return costs->sve;
+  return costs->advsimd;
+}
+
 /* Decide whether to use the unrolling heuristic described above
    aarch64_vector_costs::unrolled_advsimd_niters, updating that
    field if so.  LOOP_VINFO describes the loop that we're vectorizing
@@ -14250,6 +14387,19 @@ aarch64_analyze_loop_vinfo (loop_vec_info loop_vinfo,
 {
   costs->is_loop = true;
 
+  /* Record the number of times that the vector loop would execute,
+     if known.  */
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  auto scalar_niters = max_stmt_executions_int (loop);
+  if (scalar_niters >= 0)
+    {
+      unsigned int vf = vect_vf_for_cost (loop_vinfo);
+      if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
+	costs->num_vector_iterations = scalar_niters / vf;
+      else
+	costs->num_vector_iterations = CEIL (scalar_niters, vf);
+    }
+
   /* Detect whether we're costing the scalar code or the vector code.
      This is a bit hacky: it would be better if the vectorizer told
      us directly.
@@ -14265,6 +14415,20 @@ aarch64_analyze_loop_vinfo (loop_vec_info loop_vinfo,
      apply the unrolling heuristic described above
      aarch64_vector_costs::unrolled_advsimd_niters.  */
   aarch64_record_potential_advsimd_unrolling (loop_vinfo, costs);
+
+  /* Record the issue information for any SVE WHILE instructions that the
+     loop needs.  */
+  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
+  if (issue_info->sve && !LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
+    {
+      unsigned int num_masks = 0;
+      rgroup_controls *rgm;
+      unsigned int num_vectors_m1;
+      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
+	if (rgm->type)
+	  num_masks += num_vectors_m1 + 1;
+      costs->sve_ops.pred_ops += num_masks * issue_info->sve->while_pred_ops;
+    }
 }
 
 /* Do one-time initialization of COSTS given that we're costing the block
@@ -14407,6 +14571,17 @@ aarch64_embedded_comparison_type (stmt_vec_info stmt_info)
   return NULL_TREE;
 }
 
+/* If STMT_INFO is a comparison or contains an embedded comparison, return the
+   scalar type of the values being compared.  Return null otherwise.  */
+static tree
+aarch64_comparison_type (stmt_vec_info stmt_info)
+{
+  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
+    if (TREE_CODE_CLASS (gimple_assign_rhs_code (assign)) == tcc_comparison)
+      return TREE_TYPE (gimple_assign_rhs1 (assign));
+  return aarch64_embedded_comparison_type (stmt_info);
+}
+
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
    vectors would produce a series of LDP or STP operations.  KIND is the
    kind of statement that STMT_INFO represents.  */
@@ -14470,6 +14645,79 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
 	  && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
 }
 
+/* Return true if STMT_INFO is the second part of a two-statement multiply-add
+   or multiply-subtract sequence that might be suitable for fusing into a
+   single instruction.  */
+static bool
+aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
+{
+  gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
+  if (!assign)
+    return false;
+  tree_code code = gimple_assign_rhs_code (assign);
+  if (code != PLUS_EXPR && code != MINUS_EXPR)
+    return false;
+
+  if (CONSTANT_CLASS_P (gimple_assign_rhs1 (assign))
+      || CONSTANT_CLASS_P (gimple_assign_rhs2 (assign)))
+    return false;
+
+  for (int i = 1; i < 3; ++i)
+    {
+      tree rhs = gimple_op (assign, i);
+      /* ??? Should we try to check for a single use as well?  */
+      if (TREE_CODE (rhs) != SSA_NAME)
+	continue;
+
+      stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
+      if (!def_stmt_info
+	  || STMT_VINFO_DEF_TYPE (def_stmt_info) != vect_internal_def)
+	continue;
+      gassign *rhs_assign = dyn_cast<gassign *> (def_stmt_info->stmt);
+      if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
+	continue;
+
+      return true;
+    }
+  return false;
+}
+
+/* Return true if the vectorized form of STMT_INFO is something that is only
+   possible when using SVE instead of Advanced SIMD.  VECTYPE is the type of
+   the vector that STMT_INFO is operating on.  */
+static bool
+aarch64_sve_only_stmt_p (stmt_vec_info stmt_info, tree vectype)
+{
+  if (!aarch64_sve_mode_p (TYPE_MODE (vectype)))
+    return false;
+
+  if (STMT_VINFO_DATA_REF (stmt_info))
+    {
+      /* Check for true gathers and scatters (rather than just strided accesses
+	 that we've chosen to implement using gathers and scatters).  Although
+	 in principle we could use elementwise accesses for Advanced SIMD,
+	 the vectorizer doesn't yet support that.  */
+      if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+	return true;
+
+      /* Check for masked loads and stores.  */
+      if (auto *call = dyn_cast<gcall *> (stmt_info->stmt))
+	if (gimple_call_internal_p (call)
+	    && internal_fn_mask_index (gimple_call_internal_fn (call)) >= 0)
+	  return true;
+    }
+
+  /* Check for 64-bit integer multiplications.  */
+  auto *assign = dyn_cast<gassign *> (stmt_info->stmt);
+  if (assign
+      && gimple_assign_rhs_code (assign) == MULT_EXPR
+      && GET_MODE_INNER (TYPE_MODE (vectype)) == DImode
+      && !integer_pow2p (gimple_assign_rhs2 (assign)))
+    return true;
+
+  return false;
+}
+
 /* We are considering implementing STMT_INFO using SVE vector type VECTYPE.
    If STMT_INFO is an in-loop reduction that SVE supports directly, return
    its latency in cycles, otherwise return zero.  SVE_COSTS specifies the
@@ -14507,6 +14755,59 @@ aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
   return 0;
 }
 
+/* STMT_INFO describes a loop-carried operation in the original scalar code
+   that we are considering implementing as a reduction.  Return one of the
+   following values, depending on VEC_FLAGS:
+
+   - If VEC_FLAGS is zero, return the loop carry latency of the original
+     scalar operation.
+
+   - If VEC_FLAGS & VEC_ADVSIMD, return the loop carry latency of the
+     the Advanced SIMD implementation.
+
+   - If VEC_FLAGS & VEC_ANY_SVE, return the loop carry latency of the
+     SVE implementation.
+
+   VECTYPE is the type of vector that the vectorizer is considering using
+   for STMT_INFO, which might be different from the type of vector described
+   by VEC_FLAGS.  */
+static unsigned int
+aarch64_in_loop_reduction_latency (vec_info *vinfo, stmt_vec_info stmt_info,
+				   tree vectype, unsigned int vec_flags)
+{
+  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
+  const sve_vec_cost *sve_costs = nullptr;
+  if (vec_flags & VEC_ANY_SVE)
+    sve_costs = aarch64_tune_params.vec_costs->sve;
+
+  /* If the caller is asking for the SVE latency, check for forms of reduction
+     that only SVE can handle directly.  */
+  if (sve_costs)
+    {
+      unsigned int latency
+	= aarch64_sve_in_loop_reduction_latency (vinfo, stmt_info, vectype,
+						 sve_costs);
+      if (latency)
+	return latency;
+    }
+
+  /* Handle scalar costs.  */
+  if (vec_flags == 0)
+    {
+      if (FLOAT_TYPE_P (vectype))
+	return vec_costs->scalar_fp_stmt_cost;
+      return vec_costs->scalar_int_stmt_cost;
+    }
+
+  /* Otherwise, the loop body just contains normal integer or FP operations,
+     with a vector reduction outside the loop.  */
+  const simd_vec_cost *simd_costs
+    = aarch64_simd_vec_costs_for_flags (vec_flags);
+  if (FLOAT_TYPE_P (vectype))
+    return simd_costs->fp_stmt_cost;
+  return simd_costs->int_stmt_cost;
+}
+
 /* STMT_COST is the cost calculated by aarch64_builtin_vectorization_cost
    for STMT_INFO, which has cost kind KIND.  If this is a scalar operation,
    try to subdivide the target-independent categorization provided by KIND
@@ -14729,6 +15030,203 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
   return stmt_cost;
 }
 
+/* VINFO, COSTS, COUNT, KIND, STMT_INFO and VECTYPE are the same as for
+   TARGET_VECTORIZE_ADD_STMT_COST and they describe an operation in the
+   body of a vector loop.  Record issue information relating to the vector
+   operation in OPS, where OPS is one of COSTS->scalar_ops, COSTS->advsimd_ops
+   or COSTS->sve_ops; see the comments above those variables for details.
+   In addition:
+
+   - VEC_FLAGS is zero if OPS is COSTS->scalar_ops.
+
+   - VEC_FLAGS & VEC_ADVSIMD is nonzero if OPS is COSTS->advsimd_ops.
+
+   - VEC_FLAGS & VEC_ANY_SVE is nonzero if OPS is COSTS->sve_ops.
+
+   ISSUE_INFO provides the scalar, Advanced SIMD or SVE issue information
+   associated with OPS and VEC_FLAGS.  FACTOR says how many iterations of
+   the loop described by VEC_FLAGS would be needed to match one iteration
+   of the vector loop in VINFO.  */
+static void
+aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
+		   unsigned int count, enum vect_cost_for_stmt kind,
+		   _stmt_vec_info *stmt_info, tree vectype,
+		   unsigned int vec_flags, aarch64_vec_op_count *ops,
+		   const aarch64_base_vec_issue_info *issue_info,
+		   unsigned int factor)
+{
+  if (!issue_info)
+    return;
+
+  const aarch64_simd_vec_issue_info *simd_issue = nullptr;
+  if (vec_flags)
+    simd_issue = static_cast<const aarch64_simd_vec_issue_info *> (issue_info);
+
+  const aarch64_sve_vec_issue_info *sve_issue = nullptr;
+  if (vec_flags & VEC_ANY_SVE)
+    sve_issue = static_cast<const aarch64_sve_vec_issue_info *> (issue_info);
+
+  /* Calculate the minimum cycles per iteration imposed by a reduction
+     operation.  */
+  if ((kind == vector_stmt || kind == vec_to_scalar)
+      && aarch64_is_reduction (stmt_info))
+    {
+      unsigned int base
+	= aarch64_in_loop_reduction_latency (vinfo, stmt_info, vectype,
+					     vec_flags);
+      if (aarch64_reduc_type (vinfo, stmt_info) == FOLD_LEFT_REDUCTION)
+	{
+	  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
+	    {
+	      /* When costing an SVE FADDA, the vectorizer treats vec_to_scalar
+		 as a single operation, whereas for Advanced SIMD it is a
+		 per-element one.  Increase the factor accordingly, both for
+		 the reduction_latency calculation and for the op couting.  */
+	      if (vec_flags & VEC_ADVSIMD)
+		factor = vect_nunits_for_cost (vectype);
+	    }
+	  else
+	    /* An Advanced SIMD fold-left reduction is the same as a
+	       scalar one and the vectorizer therefore treats vec_to_scalar
+	       as a per-element cost.  There is no extra factor to apply for
+	       scalar code, either for reduction_latency or for the op
+	       counting below.  */
+	    factor = 1;
+	}
+
+      /* ??? Ideally for vector code we'd do COUNT * FACTOR reductions in
+	 parallel, but unfortunately that's not yet the case.  */
+      ops->reduction_latency = MAX (ops->reduction_latency,
+				    base * count * factor);
+    }
+
+  /* Assume that multiply-adds will become a single operation.  */
+  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
+    return;
+
+  /* When costing scalar statements in vector code, the count already
+     includes the number of scalar elements in the vector, so we don't
+     need to apply the factor as well.  */
+  if (kind == scalar_load || kind == scalar_store || kind == scalar_stmt)
+    factor = 1;
+
+  /* This can go negative with the load handling below.  */
+  int num_copies = count * factor;
+
+  /* Count the basic operation cost associated with KIND.  */
+  switch (kind)
+    {
+    case cond_branch_taken:
+    case cond_branch_not_taken:
+    case vector_gather_load:
+    case vector_scatter_store:
+      /* We currently don't expect these to be used in a loop body.  */
+      break;
+
+    case vec_perm:
+    case vec_promote_demote:
+    case vec_construct:
+    case vec_to_scalar:
+    case scalar_to_vec:
+      /* Assume that these operations have no overhead in the original
+	 scalar code.  */
+      if (!vec_flags)
+	break;
+      /* Fallthrough.  */
+    case vector_stmt:
+    case scalar_stmt:
+      ops->general_ops += num_copies;
+      break;
+
+    case scalar_load:
+    case vector_load:
+    case unaligned_load:
+      /* When costing scalars, detect cases in which we are called twice for
+	 the same load.  This happens for LD[234] operations if only some of
+	 the results are used.  The first time represents the cost of loading
+	 the unused vectors, while the second time represents the cost of
+	 loading the useful parts.  Only the latter should count towards the
+	 scalar costs.  */
+      if (stmt_info && !vec_flags)
+	{
+	  bool existed = false;
+	  unsigned int &prev_count
+	    = costs->seen_loads.get_or_insert (stmt_info, &existed);
+	  if (existed)
+	    num_copies -= prev_count;
+	  else
+	    prev_count = num_copies;
+	}
+      ops->loads += num_copies;
+      if (vec_flags || FLOAT_TYPE_P (vectype))
+	ops->general_ops += issue_info->fp_simd_load_general_ops * num_copies;
+      break;
+
+    case vector_store:
+    case unaligned_store:
+    case scalar_store:
+      ops->stores += num_copies;
+      if (vec_flags || FLOAT_TYPE_P (vectype))
+	ops->general_ops += issue_info->fp_simd_store_general_ops * num_copies;
+      break;
+    }
+
+  /* Add any embedded comparison operations.  */
+  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
+      && aarch64_embedded_comparison_type (stmt_info))
+    ops->general_ops += num_copies;
+
+  /* Detect COND_REDUCTIONs and things that would need to become
+     COND_REDUCTIONs if they were implemented using Advanced SIMD.
+     There are then two sets of VEC_COND_EXPRs, whereas so far we
+     have only accounted for one.  */
+  if (vec_flags && (kind == vector_stmt || kind == vec_to_scalar))
+    {
+      int reduc_type = aarch64_reduc_type (vinfo, stmt_info);
+      if ((reduc_type == EXTRACT_LAST_REDUCTION && (vec_flags & VEC_ADVSIMD))
+	  || reduc_type == COND_REDUCTION)
+	ops->general_ops += num_copies;
+    }
+
+  /* Count the predicate operations needed by an SVE comparison.  */
+  if (sve_issue && (kind == vector_stmt || kind == vec_to_scalar))
+    if (tree type = aarch64_comparison_type (stmt_info))
+      {
+	unsigned int base = (FLOAT_TYPE_P (type)
+			     ? sve_issue->fp_cmp_pred_ops
+			     : sve_issue->int_cmp_pred_ops);
+	costs->sve_ops.pred_ops += base * num_copies;
+      }
+
+  /* Add any extra overhead associated with LD[234] and ST[234] operations.  */
+  if (simd_issue)
+    switch (aarch64_ld234_st234_vectors (kind, stmt_info))
+      {
+      case 2:
+	ops->general_ops += simd_issue->ld2_st2_general_ops * num_copies;
+	break;
+
+      case 3:
+	ops->general_ops += simd_issue->ld3_st3_general_ops * num_copies;
+	break;
+
+      case 4:
+	ops->general_ops += simd_issue->ld4_st4_general_ops * num_copies;
+	break;
+      }
+
+  /* Add any overhead associated with gather loads and scatter stores.  */
+  if (sve_issue
+      && (kind == scalar_load || kind == scalar_store)
+      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+    {
+      unsigned int pairs = CEIL (count, 2);
+      costs->sve_ops.pred_ops
+	+= sve_issue->gather_scatter_pair_pred_ops * pairs;
+      ops->general_ops += sve_issue->gather_scatter_pair_general_ops * pairs;
+    }
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 static unsigned
 aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
@@ -14760,6 +15258,9 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	 of just looking at KIND.  */
       if (stmt_info && aarch64_use_new_vector_costs_p ())
 	{
+	  if (vectype && aarch64_sve_only_stmt_p (stmt_info, vectype))
+	    costs->saw_sve_only_op = true;
+
 	  stmt_cost = aarch64_detect_scalar_stmt_subtype
 	    (vinfo, kind, stmt_info, stmt_cost);
 
@@ -14781,6 +15282,44 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	  stmt_cost = aarch64_adjust_stmt_cost (kind, stmt_info, vectype,
 						stmt_cost);
 
+	  /* If we're recording a nonzero vector loop body cost, also estimate
+	     the operations that would need to be issued by all relevant
+	     implementations of the loop.  */
+	  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
+	  if (loop_vinfo
+	      && issue_info
+	      && costs->vec_flags
+	      && where == vect_body
+	      && vectype
+	      && stmt_cost != 0)
+	    {
+	      /* Record estimates for the scalar code.  */
+	      aarch64_count_ops (vinfo, costs, count, kind, stmt_info, vectype,
+				 0, &costs->scalar_ops, issue_info->scalar,
+				 vect_nunits_for_cost (vectype));
+
+	      if (aarch64_sve_mode_p (vinfo->vector_mode) && issue_info->sve)
+		{
+		  /* Record estimates for a possible Advanced SIMD version
+		     of the SVE code.  */
+		  aarch64_count_ops (vinfo, costs, count, kind, stmt_info,
+				     vectype, VEC_ADVSIMD, &costs->advsimd_ops,
+				     issue_info->advsimd,
+				     aarch64_estimated_sve_vq ());
+
+		  /* Record estimates for the SVE code itself.  */
+		  aarch64_count_ops (vinfo, costs, count, kind, stmt_info,
+				     vectype, VEC_ANY_SVE, &costs->sve_ops,
+				     issue_info->sve, 1);
+		}
+	      else
+		/* Record estimates for the Advanced SIMD code.  Treat SVE like
+		   Advanced SIMD if the CPU has no specific SVE costs.  */
+		aarch64_count_ops (vinfo, costs, count, kind, stmt_info,
+				   vectype, VEC_ADVSIMD, &costs->advsimd_ops,
+				   issue_info->advsimd, 1);
+	    }
+
 	  /* If we're applying the SVE vs. Advanced SIMD unrolling heuristic,
 	     estimate the number of statements in the unrolled Advanced SIMD
 	     loop.  For simplicitly, we assume that one iteration of the
@@ -14805,12 +15344,56 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
   return retval;
 }
 
+/* Dump information about the structure.  */
+void
+aarch64_vec_op_count::dump () const
+{
+  dump_printf_loc (MSG_NOTE, vect_location,
+		   "  load operations = %d\n", loads);
+  dump_printf_loc (MSG_NOTE, vect_location,
+		   "  store operations = %d\n", stores);
+  dump_printf_loc (MSG_NOTE, vect_location,
+		   "  general operations = %d\n", general_ops);
+  dump_printf_loc (MSG_NOTE, vect_location,
+		   "  reduction latency = %d\n", reduction_latency);
+}
+
+/* Dump information about the structure.  */
+void
+aarch64_sve_op_count::dump () const
+{
+  aarch64_vec_op_count::dump ();
+  dump_printf_loc (MSG_NOTE, vect_location,
+		   "  predicate operations = %d\n", pred_ops);
+}
+
+/* Use ISSUE_INFO to estimate the minimum number of cycles needed to issue
+   the operations described by OPS.  This is a very simplistic model!  */
+static unsigned int
+aarch64_estimate_min_cycles_per_iter
+  (const aarch64_vec_op_count *ops,
+   const aarch64_base_vec_issue_info *issue_info)
+{
+  unsigned int cycles = MAX (ops->reduction_latency, 1);
+  cycles = MAX (cycles, CEIL (ops->stores, issue_info->stores_per_cycle));
+  cycles = MAX (cycles, CEIL (ops->loads + ops->stores,
+			      issue_info->loads_stores_per_cycle));
+  cycles = MAX (cycles, CEIL (ops->general_ops,
+			      issue_info->general_ops_per_cycle));
+  return cycles;
+}
+
 /* BODY_COST is the cost of a vector loop body recorded in COSTS.
    Adjust the cost as necessary and return the new cost.  */
 static unsigned int
 aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 {
   unsigned int orig_body_cost = body_cost;
+  bool should_disparage = false;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location,
+		     "Original vector body cost = %d\n", body_cost);
 
   if (costs->unrolled_advsimd_stmts)
     {
@@ -14841,10 +15424,203 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 				 "Increasing body cost to %d to account for"
 				 " unrolling\n", min_cost);
 	      body_cost = min_cost;
+	      should_disparage = true;
 	    }
 	}
     }
 
+  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
+  if (!issue_info)
+    return body_cost;
+
+  unsigned int scalar_cycles_per_iter
+    = aarch64_estimate_min_cycles_per_iter (&costs->scalar_ops,
+					    issue_info->scalar);
+  unsigned int advsimd_cycles_per_iter
+    = aarch64_estimate_min_cycles_per_iter (&costs->advsimd_ops,
+					    issue_info->advsimd);
+  bool could_use_advsimd
+    = ((costs->vec_flags & VEC_ADVSIMD)
+       || (aarch64_autovec_preference != 2
+	   && (aarch64_tune_params.extra_tuning_flags
+	       & AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT)
+	   && !costs->saw_sve_only_op));
+
+  if (dump_enabled_p ())
+    {
+      if (IN_RANGE (costs->num_vector_iterations, 0, 65536))
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Vector loop iterates at most %wd times\n",
+			 costs->num_vector_iterations);
+      dump_printf_loc (MSG_NOTE, vect_location, "Scalar issue estimate:\n");
+      costs->scalar_ops.dump ();
+      dump_printf_loc (MSG_NOTE, vect_location,
+		       "  estimated cycles per iteration = %d\n",
+		       scalar_cycles_per_iter);
+      if (could_use_advsimd)
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "Advanced SIMD issue estimate:\n");
+	  costs->advsimd_ops.dump ();
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "  estimated cycles per iteration = %d\n",
+			   advsimd_cycles_per_iter);
+	}
+      else
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Loop could not use Advanced SIMD\n");
+    }
+
+  uint64_t vector_cycles_per_iter = advsimd_cycles_per_iter;
+  unsigned int vector_reduction_latency = costs->advsimd_ops.reduction_latency;
+  if ((costs->vec_flags & VEC_ANY_SVE) && issue_info->sve)
+    {
+      /* Estimate the minimum number of cycles per iteration needed to issue
+	 non-predicate operations.  */
+      unsigned int sve_cycles_per_iter
+	= aarch64_estimate_min_cycles_per_iter (&costs->sve_ops,
+						issue_info->sve);
+
+      /* Separately estimate the minimum number of cycles per iteration needed
+	 to issue the predicate operations.  */
+      unsigned int pred_cycles_per_iter
+	= CEIL (costs->sve_ops.pred_ops, issue_info->sve->pred_ops_per_cycle);
+
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location, "SVE issue estimate:\n");
+	  costs->sve_ops.dump ();
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "  estimated cycles per iteration for non-predicate"
+			   " operations = %d\n", sve_cycles_per_iter);
+	  if (costs->sve_ops.pred_ops)
+	    dump_printf_loc (MSG_NOTE, vect_location, "  estimated cycles per"
+			     " iteration for predicate operations = %d\n",
+			     pred_cycles_per_iter);
+	}
+
+      vector_cycles_per_iter = MAX (sve_cycles_per_iter, pred_cycles_per_iter);
+      vector_reduction_latency = costs->sve_ops.reduction_latency;
+
+      /* If the scalar version of the loop could issue at least as
+	 quickly as the predicate parts of the SVE loop, make the SVE loop
+	 prohibitively expensive.  In this case vectorization is adding an
+	 overhead that the original scalar code didn't have.
+
+	 This is mostly intended to detect cases in which WHILELOs dominate
+	 for very tight loops, which is something that normal latency-based
+	 costs would not model.  Adding this kind of cliffedge would be
+	 too drastic for scalar_cycles_per_iter vs. sve_cycles_per_iter;
+	 code later in the function handles that case in a more
+	 conservative way.  */
+      uint64_t sve_estimate = pred_cycles_per_iter + 1;
+      if (scalar_cycles_per_iter < sve_estimate)
+	{
+	  unsigned int min_cost
+	    = orig_body_cost * estimated_poly_value (BYTES_PER_SVE_VECTOR);
+	  if (body_cost < min_cost)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Increasing body cost to %d because the"
+				 " scalar code could issue within the limit"
+				 " imposed by predicate operations\n",
+				 min_cost);
+	      body_cost = min_cost;
+	      should_disparage = true;
+	    }
+	}
+
+      /* If it appears that the Advanced SIMD version of a loop could issue
+	 more quickly than the SVE one, increase the SVE cost in proportion
+	 to the difference.  The intention is to make Advanced SIMD preferable
+	 in cases where an Advanced SIMD version exists, without increasing
+	 the costs so much that SVE won't be used at all.
+
+	 The reasoning is similar to the scalar vs. predicate comparison above:
+	 if the issue rate of the SVE code is limited by predicate operations
+	 (i.e. if pred_cycles_per_iter > sve_cycles_per_iter), and if the
+	 Advanced SIMD code could issue within the limit imposed by the
+	 predicate operations, the predicate operations are adding an
+	 overhead that the original code didn't have and so we should prefer
+	 the Advanced SIMD version.  However, if the predicate operations
+	 do not dominate in this way, we should only increase the cost of
+	 the SVE code if sve_cycles_per_iter is strictly greater than
+	 advsimd_cycles_per_iter.  Given rounding effects, this should mean
+	 that Advanced SIMD is either better or at least no worse.  */
+      if (sve_cycles_per_iter >= pred_cycles_per_iter)
+	sve_estimate = sve_cycles_per_iter;
+      if (could_use_advsimd && advsimd_cycles_per_iter < sve_estimate)
+	{
+	  /* This ensures that min_cost > orig_body_cost * 2.  */
+	  unsigned int min_cost
+	    = orig_body_cost * CEIL (sve_estimate, advsimd_cycles_per_iter) + 1;
+	  if (body_cost < min_cost)
+	    {
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_NOTE, vect_location,
+				 "Increasing body cost to %d because Advanced"
+				 " SIMD code could issue as quickly\n",
+				 min_cost);
+	      body_cost = min_cost;
+	      should_disparage = true;
+	    }
+	}
+    }
+
+  /* Decide whether to stick to latency-based costs or whether to try to
+     take issue rates into account.  */
+  unsigned int threshold = aarch64_loop_vect_issue_rate_niters;
+  if (costs->vec_flags & VEC_ANY_SVE)
+    threshold = CEIL (threshold, aarch64_estimated_sve_vq ());
+
+  if (costs->num_vector_iterations >= 1
+      && costs->num_vector_iterations < threshold)
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Low iteration count, so using pure latency"
+			 " costs\n");
+    }
+  /* Increase the cost of the vector code if it looks like the scalar code
+     could issue more quickly.  These values are only rough estimates,
+     so minor differences should only result in minor changes.  */
+  else if (scalar_cycles_per_iter < vector_cycles_per_iter)
+    {
+      body_cost = CEIL (body_cost * vector_cycles_per_iter,
+			scalar_cycles_per_iter);
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Increasing body cost to %d because scalar code"
+			 " would issue more quickly\n", body_cost);
+    }
+  /* In general, it's expected that the proposed vector code would be able
+     to issue more quickly than the original scalar code.  This should
+     already be reflected to some extent in the latency-based costs.
+
+     However, the latency-based costs effectively assume that the scalar
+     code and the vector code execute serially, which tends to underplay
+     one important case: if the real (non-serialized) execution time of
+     a scalar iteration is dominated by loop-carried dependencies,
+     and if the vector code is able to reduce both the length of
+     the loop-carried dependencies *and* the number of cycles needed
+     to issue the code in general, we can be more confident that the
+     vector code is an improvement, even if adding the other (non-loop-carried)
+     latencies tends to hide this saving.  We therefore reduce the cost of the
+     vector loop body in proportion to the saving.  */
+  else if (costs->scalar_ops.reduction_latency > vector_reduction_latency
+	   && costs->scalar_ops.reduction_latency == scalar_cycles_per_iter
+	   && scalar_cycles_per_iter > vector_cycles_per_iter
+	   && !should_disparage)
+    {
+      body_cost = CEIL (body_cost * vector_cycles_per_iter,
+			scalar_cycles_per_iter);
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Decreasing body cost to %d account for smaller"
+			 " reduction latency\n", body_cost);
+    }
+
   return body_cost;
 }
 
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 91e5c6151a7..32191cf1acf 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -277,3 +277,6 @@ The number of Newton iterations for calculating the reciprocal for double type.
 
 -param=aarch64-autovec-preference=
 Target Joined UInteger Var(aarch64_autovec_preference) Init(0) IntegerRange(0, 4) Param
+
+-param=aarch64-loop-vect-issue-rate-niters=
+Target Joined UInteger Var(aarch64_loop_vect_issue_rate_niters) Init(6) IntegerRange(0, 65536) Param
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index ca204c7c3d8..7f1231303bf 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -14417,6 +14417,14 @@ Use both Advanced SIMD and SVE.  Prefer SVE when the costs are deemed equal.
 @end table
 The default value is 0.
 
+@item aarch64-loop-vect-issue-rate-niters
+The tuning for some AArch64 CPUs tries to take both latencies and issue
+rates into account when deciding whether a loop should be vectorized
+using SVE, vectorized using Advanced SIMD, or not vectorized at all.
+If this parameter is set to @var{n}, GCC will not use this heuristic
+for loops that are known to execute in fewer than @var{n} Advanced
+SIMD iterations.
+
 @end table
 
 @end table
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 13/13] aarch64: Add costs for LD[34] and ST[34] postincrements
  2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
                   ` (11 preceding siblings ...)
  2021-03-26 16:18 ` [PATCH 12/13] aarch64: Take issue rate into account for vector loop costs Richard Sandiford
@ 2021-03-26 16:18 ` Richard Sandiford
  12 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2021-03-26 16:18 UTC (permalink / raw)
  To: gcc-patches

Most postincrements are cheap on Neoverse V1, but it's
generally better to avoid them on LD[34] and ST[34] instructions.
This patch adds separate address costs fields for these cases.
Other CPUs continue to use the same costs for all postincrements.

gcc/
	* config/aarch64/aarch64-protos.h
	(cpu_addrcost_table::post_modify_ld3_st3): New member variable.
	(cpu_addrcost_table::post_modify_ld4_st4): Likewise.
	* config/aarch64/aarch64.c (generic_addrcost_table): Update
	accordingly, using the same costs as for post_modify.
	(exynosm1_addrcost_table, xgene1_addrcost_table): Likewise.
	(thunderx2t99_addrcost_table, thunderx3t110_addrcost_table):
	(tsv110_addrcost_table, qdf24xx_addrcost_table): Likewise.
	(a64fx_addrcost_table): Likewise.
	(neoversev1_addrcost_table): New.
	(neoversev1_tunings): Use neoversev1_addrcost_table.
	(aarch64_address_cost): Use the new post_modify costs for CImode
	and XImode.
---
 gcc/config/aarch64/aarch64-protos.h |  2 ++
 gcc/config/aarch64/aarch64.c        | 45 +++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ca1ed9e8758..d5d5417370e 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -177,6 +177,8 @@ struct cpu_addrcost_table
   const struct scale_addr_mode_cost addr_scale_costs;
   const int pre_modify;
   const int post_modify;
+  const int post_modify_ld3_st3;
+  const int post_modify_ld4_st4;
   const int register_offset;
   const int register_sextend;
   const int register_zextend;
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 6d961bea5dc..a573850b3fd 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -364,6 +364,8 @@ static const struct cpu_addrcost_table generic_addrcost_table =
     },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   0, /* register_offset  */
   0, /* register_sextend  */
   0, /* register_zextend  */
@@ -380,6 +382,8 @@ static const struct cpu_addrcost_table exynosm1_addrcost_table =
     },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   1, /* register_offset  */
   1, /* register_sextend  */
   2, /* register_zextend  */
@@ -396,6 +400,8 @@ static const struct cpu_addrcost_table xgene1_addrcost_table =
     },
   1, /* pre_modify  */
   1, /* post_modify  */
+  1, /* post_modify_ld3_st3  */
+  1, /* post_modify_ld4_st4  */
   0, /* register_offset  */
   1, /* register_sextend  */
   1, /* register_zextend  */
@@ -412,6 +418,8 @@ static const struct cpu_addrcost_table thunderx2t99_addrcost_table =
     },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   2, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
@@ -428,6 +436,8 @@ static const struct cpu_addrcost_table thunderx3t110_addrcost_table =
     },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   2, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
@@ -444,6 +454,8 @@ static const struct cpu_addrcost_table tsv110_addrcost_table =
     },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   0, /* register_offset  */
   1, /* register_sextend  */
   1, /* register_zextend  */
@@ -460,6 +472,8 @@ static const struct cpu_addrcost_table qdf24xx_addrcost_table =
     },
   1, /* pre_modify  */
   1, /* post_modify  */
+  1, /* post_modify_ld3_st3  */
+  1, /* post_modify_ld4_st4  */
   3, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
@@ -476,12 +490,32 @@ static const struct cpu_addrcost_table a64fx_addrcost_table =
     },
   0, /* pre_modify  */
   0, /* post_modify  */
+  0, /* post_modify_ld3_st3  */
+  0, /* post_modify_ld4_st4  */
   2, /* register_offset  */
   3, /* register_sextend  */
   3, /* register_zextend  */
   0, /* imm_offset  */
 };
 
+static const struct cpu_addrcost_table neoversev1_addrcost_table =
+{
+    {
+      1, /* hi  */
+      0, /* si  */
+      0, /* di  */
+      1, /* ti  */
+    },
+  0, /* pre_modify  */
+  0, /* post_modify  */
+  3, /* post_modify_ld3_st3  */
+  3, /* post_modify_ld4_st4  */
+  0, /* register_offset  */
+  0, /* register_sextend  */
+  0, /* register_zextend  */
+  0 /* imm_offset  */
+};
+
 static const struct cpu_regmove_cost generic_regmove_cost =
 {
   1, /* GP2GP  */
@@ -1777,7 +1811,7 @@ static const struct cpu_vector_cost neoversev1_vector_cost =
 static const struct tune_params neoversev1_tunings =
 {
   &cortexa76_extra_costs,
-  &generic_addrcost_table,
+  &neoversev1_addrcost_table,
   &generic_regmove_cost,
   &neoversev1_vector_cost,
   &generic_branch_cost,
@@ -12077,7 +12111,14 @@ aarch64_address_cost (rtx x,
 	if (c == PRE_INC || c == PRE_DEC || c == PRE_MODIFY)
 	  cost += addr_cost->pre_modify;
 	else if (c == POST_INC || c == POST_DEC || c == POST_MODIFY)
-	  cost += addr_cost->post_modify;
+	  {
+	    if (mode == CImode)
+	      cost += addr_cost->post_modify_ld3_st3;
+	    else if (mode == XImode)
+	      cost += addr_cost->post_modify_ld4_st4;
+	    else
+	      cost += addr_cost->post_modify;
+	  }
 	else
 	  gcc_unreachable ();
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-26 16:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-26 16:12 aarch64: Opt-in tweaks to the AArch64 vector cost model Richard Sandiford
2021-03-26 16:14 ` [PATCH 01/13] aarch64: Add reduction costs to simd_vec_costs Richard Sandiford
2021-03-26 16:14 ` [PATCH 02/13] aarch64: Add vector costs for SVE CLAST[AB] and FADDA Richard Sandiford
2021-03-26 16:15 ` [PATCH 03/13] aarch64: Add costs for LD[234]/ST[234] permutes Richard Sandiford
2021-03-26 16:15 ` [PATCH 04/13] aarch64: Add costs for storing one element of a vector Richard Sandiford
2021-03-26 16:15 ` [PATCH 05/13] aarch64: Add costs for one element of a scatter store Richard Sandiford
2021-03-26 16:16 ` [PATCH 06/13] aarch64: Add a CPU-specific cost table for Neoverse V1 Richard Sandiford
2021-03-26 16:16 ` [PATCH 07/13] aarch64: Use an aarch64-specific structure for vector costing Richard Sandiford
2021-03-26 16:16 ` [PATCH 08/13] aarch64: Try to detect when Advanced SIMD code would be completely unrolled Richard Sandiford
2021-03-26 16:17 ` [PATCH 09/13] aarch64: Detect scalar extending loads Richard Sandiford
2021-03-26 16:17 ` [PATCH 10/13] aarch64: Cost comparisons embedded in COND_EXPRs Richard Sandiford
2021-03-26 16:18 ` [PATCH 11/13] aarch64: Ignore inductions when costing vector code Richard Sandiford
2021-03-26 16:18 ` [PATCH 12/13] aarch64: Take issue rate into account for vector loop costs Richard Sandiford
2021-03-26 16:18 ` [PATCH 13/13] aarch64: Add costs for LD[34] and ST[34] postincrements Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).