[PATCH 0/8] aarch64 vector cost tweaks

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 0/8] aarch64 vector cost tweaks
@ 2021-08-03 12:03 Richard Sandiford
  2021-08-03 12:03 ` [PATCH 1/8] aarch64: Turn sve_width tuning field into a bitmask Richard Sandiford
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:03 UTC (permalink / raw)
  To: gcc-patches

This patch series:

(1) generalises the aarch64 vector costs to allow for the final patch.
    This part should be a no-op for existing tuning code.

(2) tweaks the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS code.  This currently
    only affects neoverse-v1 and again helps with the final patch.

(3) adds a new -mtune=neoverse-512tvb option.  See the covering message
    in the final patch for details.

Tested on aarch64-linux-gnu and applied to trunk so far.  I'll backport
to GCC 11 in a few days if there is no fallout.  The patches should be
very low risk; as mentioned, (1) should be a no-op for existing targets
and (2) simply provides minor tweaks/fixes to -mtune code that was new
to GCC 11.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/8] aarch64: Turn sve_width tuning field into a bitmask
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
@ 2021-08-03 12:03 ` Richard Sandiford
  2021-08-03 12:04 ` [PATCH 2/8] aarch64: Add a simple fixed-point class for costing Richard Sandiford
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:03 UTC (permalink / raw)
  To: gcc-patches

The tuning structures have an sve_width field that specifies the
number of bits in an SVE vector (or SVE_NOT_IMPLEMENTED if not
applicable).  This patch turns the field into a bitmask so that
it can specify multiple widths at the same time.  For now we
always treat the mininum width as the likely width.

An alternative would have been to add extra fields, which would
have coped correctly with non-power-of-2 widths.  However,
we're very far from supporting constant non-power-of-2 vectors
in GCC, so I think the non-power-of-2 case will in reality always
have to be hidden behind VLA.

gcc/
	* config/aarch64/aarch64-protos.h (tune_params::sve_width): Turn
	into a bitmask.
	* config/aarch64/aarch64.c (aarch64_cmp_autovec_modes): Update
	accordingly.
	(aarch64_estimated_poly_value): Likewise.  Use the least significant
	set bit for the minimum and likely values.  Use the most significant
	set bit for the maximum value.
---
 gcc/config/aarch64/aarch64-protos.h |  8 ++++----
 gcc/config/aarch64/aarch64.c        | 15 ++++++++++-----
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index c2033387384..fb4ce8e9f84 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -506,10 +506,10 @@ struct tune_params
   const struct cpu_vector_cost *vec_costs;
   const struct cpu_branch_cost *branch_costs;
   const struct cpu_approx_modes *approx_modes;
-  /* Width of the SVE registers or SVE_NOT_IMPLEMENTED if not applicable.
-     Only used for tuning decisions, does not disable VLA
-     vectorization.  */
-  enum aarch64_sve_vector_bits_enum sve_width;
+  /* A bitmask of the possible SVE register widths in bits,
+     or SVE_NOT_IMPLEMENTED if not applicable.  Only used for tuning
+     decisions, does not disable VLA vectorization.  */
+  unsigned int sve_width;
   int memmov_cost;
   int issue_rate;
   unsigned int fusible_ops;
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e2114605901..1a8cd131ca2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19144,14 +19144,12 @@ aarch64_cmp_autovec_modes (machine_mode sve_m, machine_mode asimd_m)
   bool prefer_asimd = aarch64_autovec_preference == 3;
   bool prefer_sve = aarch64_autovec_preference == 4;
 
-  aarch64_sve_vector_bits_enum tune_width = aarch64_tune_params.sve_width;
-
   poly_int64 nunits_sve = GET_MODE_NUNITS (sve_m);
   poly_int64 nunits_asimd = GET_MODE_NUNITS (asimd_m);
   /* If the CPU information does not have an SVE width registered use the
      generic poly_int comparison that prefers SVE.  If a preference is
      explicitly requested avoid this path.  */
-  if (tune_width == SVE_SCALABLE
+  if (aarch64_tune_params.sve_width == SVE_SCALABLE
       && !prefer_asimd
       && !prefer_sve)
     return maybe_gt (nunits_sve, nunits_asimd);
@@ -24980,8 +24978,7 @@ aarch64_estimated_poly_value (poly_int64 val,
 			      poly_value_estimate_kind kind
 				= POLY_VALUE_LIKELY)
 {
-  enum aarch64_sve_vector_bits_enum width_source
-    = aarch64_tune_params.sve_width;
+  unsigned int width_source = aarch64_tune_params.sve_width;
 
   /* If there is no core-specific information then the minimum and likely
      values are based on 128-bit vectors and the maximum is based on
@@ -24996,6 +24993,14 @@ aarch64_estimated_poly_value (poly_int64 val,
 	  return val.coeffs[0] + val.coeffs[1] * 15;
       }
 
+  /* Allow sve_width to be a bitmask of different VL, treating the lowest
+     as likely.  This could be made more general if future -mtune options
+     need it to be.  */
+  if (kind == POLY_VALUE_MAX)
+    width_source = 1 << floor_log2 (width_source);
+  else
+    width_source = least_bit_hwi (width_source);
+
   /* If the core provides width information, use that.  */
   HOST_WIDE_INT over_128 = width_source - 128;
   return val.coeffs[0] + val.coeffs[1] * over_128 / 128;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/8] aarch64: Add a simple fixed-point class for costing
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
  2021-08-03 12:03 ` [PATCH 1/8] aarch64: Turn sve_width tuning field into a bitmask Richard Sandiford
@ 2021-08-03 12:04 ` Richard Sandiford
  2021-08-03 12:04 ` [PATCH 3/8] aarch64: Split out aarch64_adjust_body_cost_sve Richard Sandiford
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:04 UTC (permalink / raw)
  To: gcc-patches

This patch adds a simple fixed-point class for holding fractional
cost values.  It can exactly represent the reciprocal of any
single-vector SVE element count (including the non-power-of-2 ones).
This means that it can also hold 1/N for all N in [1, 16], which should
be enough for the various *_per_cycle fields.

For now the assumption is that the number of possible reciprocals
is fixed at compile time and so the class should always be able
to hold an exact value.

The class uses a uint64_t to hold the fixed-point value, which means
that it can hold any scaled uint32_t cost.  Normally we don't worry
about overflow when manipulating raw uint32_t costs, but just to be
on the safe side, the class uses saturating arithmetic for all
operations.

As far as the changes to the cost routines themselves go:

- The changes to aarch64_add_stmt_cost and its subroutines are
  just laying groundwork for future patches; no functional change
  intended.

- The changes to aarch64_adjust_body_cost mean that we now
  take fractional differences into account.

gcc/
	* config/aarch64/fractional-cost.h: New file.
	* config/aarch64/aarch64.c: Include <algorithm> (indirectly)
	and cost_fraction.h.
	(vec_cost_fraction): New typedef.
	(aarch64_detect_scalar_stmt_subtype): Use it for statement costs.
	(aarch64_detect_vector_stmt_subtype): Likewise.
	(aarch64_sve_adjust_stmt_cost, aarch64_adjust_stmt_cost): Likewise.
	(aarch64_estimate_min_cycles_per_iter): Use vec_cost_fraction
	for cycle counts.
	(aarch64_adjust_body_cost): Likewise.
	(aarch64_test_cost_fraction): New function.
	(aarch64_run_selftests): Call it.
---
 gcc/config/aarch64/aarch64.c         | 179 +++++++++++++++-----
 gcc/config/aarch64/fractional-cost.h | 236 +++++++++++++++++++++++++++
 2 files changed, 377 insertions(+), 38 deletions(-)
 create mode 100644 gcc/config/aarch64/fractional-cost.h

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1a8cd131ca2..17fcb34b2c8 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -20,8 +20,9 @@
 
 #define IN_TARGET_CODE 1
 
-#include "config.h"
 #define INCLUDE_STRING
+#define INCLUDE_ALGORITHM
+#include "config.h"
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
@@ -76,6 +77,7 @@
 #include "function-abi.h"
 #include "gimple-pretty-print.h"
 #include "tree-ssa-loop-niter.h"
+#include "fractional-cost.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -14912,10 +14914,10 @@ aarch64_in_loop_reduction_latency (vec_info *vinfo, stmt_vec_info stmt_info,
    for STMT_INFO, which has cost kind KIND.  If this is a scalar operation,
    try to subdivide the target-independent categorization provided by KIND
    to get a more accurate cost.  */
-static unsigned int
+static fractional_cost
 aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
 				    stmt_vec_info stmt_info,
-				    unsigned int stmt_cost)
+				    fractional_cost stmt_cost)
 {
   /* Detect an extension of a loaded value.  In general, we'll be able to fuse
      the extension with the load.  */
@@ -14931,11 +14933,11 @@ aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
    the target-independent categorization provided by KIND to get a more
    accurate cost.  WHERE specifies where the cost associated with KIND
    occurs.  */
-static unsigned int
+static fractional_cost
 aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
 				    stmt_vec_info stmt_info, tree vectype,
 				    enum vect_cost_model_location where,
-				    unsigned int stmt_cost)
+				    fractional_cost stmt_cost)
 {
   const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
   const sve_vec_cost *sve_costs = nullptr;
@@ -15016,10 +15018,10 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
    for STMT_INFO, which has cost kind KIND and which when vectorized would
    operate on vector type VECTYPE.  Adjust the cost as necessary for SVE
    targets.  */
-static unsigned int
+static fractional_cost
 aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, vect_cost_for_stmt kind,
 			      stmt_vec_info stmt_info, tree vectype,
-			      unsigned int stmt_cost)
+			      fractional_cost stmt_cost)
 {
   /* Unlike vec_promote_demote, vector_stmt conversions do not change the
      vector register size or number of units.  Integer promotions of this
@@ -15083,9 +15085,9 @@ aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, vect_cost_for_stmt kind,
 /* STMT_COST is the cost calculated for STMT_INFO, which has cost kind KIND
    and which when vectorized would operate on vector type VECTYPE.  Add the
    cost of any embedded operations.  */
-static unsigned int
+static fractional_cost
 aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
-			  tree vectype, unsigned int stmt_cost)
+			  tree vectype, fractional_cost stmt_cost)
 {
   if (vectype)
     {
@@ -15339,7 +15341,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 
   if (flag_vect_cost_model)
     {
-      int stmt_cost
+      fractional_cost stmt_cost
 	= aarch64_builtin_vectorization_cost (kind, vectype, misalign);
 
       /* Do one-time initialization based on the vinfo.  */
@@ -15440,7 +15442,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	  count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /*  FIXME  */
 	}
 
-      retval = (unsigned) (count * stmt_cost);
+      retval = (count * stmt_cost).ceil ();
       costs->region[where] += retval;
     }
 
@@ -15472,17 +15474,17 @@ aarch64_sve_op_count::dump () const
 
 /* Use ISSUE_INFO to estimate the minimum number of cycles needed to issue
    the operations described by OPS.  This is a very simplistic model!  */
-static unsigned int
+static fractional_cost
 aarch64_estimate_min_cycles_per_iter
   (const aarch64_vec_op_count *ops,
    const aarch64_base_vec_issue_info *issue_info)
 {
-  unsigned int cycles = MAX (ops->reduction_latency, 1);
-  cycles = MAX (cycles, CEIL (ops->stores, issue_info->stores_per_cycle));
-  cycles = MAX (cycles, CEIL (ops->loads + ops->stores,
-			      issue_info->loads_stores_per_cycle));
-  cycles = MAX (cycles, CEIL (ops->general_ops,
-			      issue_info->general_ops_per_cycle));
+  fractional_cost cycles = MAX (ops->reduction_latency, 1);
+  cycles = std::max (cycles, { ops->stores, issue_info->stores_per_cycle });
+  cycles = std::max (cycles, { ops->loads + ops->stores,
+			       issue_info->loads_stores_per_cycle });
+  cycles = std::max (cycles, { ops->general_ops,
+			       issue_info->general_ops_per_cycle });
   return cycles;
 }
 
@@ -15536,12 +15538,14 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
   if (!issue_info)
     return body_cost;
 
-  unsigned int scalar_cycles_per_iter
+  fractional_cost scalar_cycles_per_iter
     = aarch64_estimate_min_cycles_per_iter (&costs->scalar_ops,
 					    issue_info->scalar);
-  unsigned int advsimd_cycles_per_iter
+
+  fractional_cost advsimd_cycles_per_iter
     = aarch64_estimate_min_cycles_per_iter (&costs->advsimd_ops,
 					    issue_info->advsimd);
+
   bool could_use_advsimd
     = ((costs->vec_flags & VEC_ADVSIMD)
        || (aarch64_autovec_preference != 2
@@ -15558,36 +15562,37 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
       dump_printf_loc (MSG_NOTE, vect_location, "Scalar issue estimate:\n");
       costs->scalar_ops.dump ();
       dump_printf_loc (MSG_NOTE, vect_location,
-		       "  estimated cycles per iteration = %d\n",
-		       scalar_cycles_per_iter);
+		       "  estimated cycles per iteration = %f\n",
+		       scalar_cycles_per_iter.as_double ());
       if (could_use_advsimd)
 	{
 	  dump_printf_loc (MSG_NOTE, vect_location,
 			   "Advanced SIMD issue estimate:\n");
 	  costs->advsimd_ops.dump ();
 	  dump_printf_loc (MSG_NOTE, vect_location,
-			   "  estimated cycles per iteration = %d\n",
-			   advsimd_cycles_per_iter);
+			   "  estimated cycles per iteration = %f\n",
+			   advsimd_cycles_per_iter.as_double ());
 	}
       else
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "Loop could not use Advanced SIMD\n");
     }
 
-  uint64_t vector_cycles_per_iter = advsimd_cycles_per_iter;
+  fractional_cost vector_cycles_per_iter = advsimd_cycles_per_iter;
   unsigned int vector_reduction_latency = costs->advsimd_ops.reduction_latency;
+
   if ((costs->vec_flags & VEC_ANY_SVE) && issue_info->sve)
     {
       /* Estimate the minimum number of cycles per iteration needed to issue
 	 non-predicate operations.  */
-      unsigned int sve_cycles_per_iter
+      fractional_cost sve_cycles_per_iter
 	= aarch64_estimate_min_cycles_per_iter (&costs->sve_ops,
 						issue_info->sve);
 
       /* Separately estimate the minimum number of cycles per iteration needed
 	 to issue the predicate operations.  */
-      unsigned int pred_cycles_per_iter
-	= CEIL (costs->sve_ops.pred_ops, issue_info->sve->pred_ops_per_cycle);
+      fractional_cost pred_cycles_per_iter
+	= { costs->sve_ops.pred_ops, issue_info->sve->pred_ops_per_cycle };
 
       if (dump_enabled_p ())
 	{
@@ -15595,14 +15600,16 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 	  costs->sve_ops.dump ();
 	  dump_printf_loc (MSG_NOTE, vect_location,
 			   "  estimated cycles per iteration for non-predicate"
-			   " operations = %d\n", sve_cycles_per_iter);
+			   " operations = %f\n",
+			   sve_cycles_per_iter.as_double ());
 	  if (costs->sve_ops.pred_ops)
 	    dump_printf_loc (MSG_NOTE, vect_location, "  estimated cycles per"
 			     " iteration for predicate operations = %d\n",
-			     pred_cycles_per_iter);
+			     pred_cycles_per_iter.as_double ());
 	}
 
-      vector_cycles_per_iter = MAX (sve_cycles_per_iter, pred_cycles_per_iter);
+      vector_cycles_per_iter = std::max (sve_cycles_per_iter,
+					 pred_cycles_per_iter);
       vector_reduction_latency = costs->sve_ops.reduction_latency;
 
       /* If the scalar version of the loop could issue at least as
@@ -15616,7 +15623,7 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 	 too drastic for scalar_cycles_per_iter vs. sve_cycles_per_iter;
 	 code later in the function handles that case in a more
 	 conservative way.  */
-      uint64_t sve_estimate = pred_cycles_per_iter + 1;
+      fractional_cost sve_estimate = pred_cycles_per_iter + 1;
       if (scalar_cycles_per_iter < sve_estimate)
 	{
 	  unsigned int min_cost
@@ -15656,8 +15663,10 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
       if (could_use_advsimd && advsimd_cycles_per_iter < sve_estimate)
 	{
 	  /* This ensures that min_cost > orig_body_cost * 2.  */
-	  unsigned int min_cost
-	    = orig_body_cost * CEIL (sve_estimate, advsimd_cycles_per_iter) + 1;
+	  unsigned int factor
+	    = fractional_cost::scale (1, sve_estimate,
+				      advsimd_cycles_per_iter);
+	  unsigned int min_cost = orig_body_cost * factor + 1;
 	  if (body_cost < min_cost)
 	    {
 	      if (dump_enabled_p ())
@@ -15690,8 +15699,8 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
      so minor differences should only result in minor changes.  */
   else if (scalar_cycles_per_iter < vector_cycles_per_iter)
     {
-      body_cost = CEIL (body_cost * vector_cycles_per_iter,
-			scalar_cycles_per_iter);
+      body_cost = fractional_cost::scale (body_cost, vector_cycles_per_iter,
+					  scalar_cycles_per_iter);
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "Increasing body cost to %d because scalar code"
@@ -15716,8 +15725,8 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 	   && scalar_cycles_per_iter > vector_cycles_per_iter
 	   && !should_disparage)
     {
-      body_cost = CEIL (body_cost * vector_cycles_per_iter,
-			scalar_cycles_per_iter);
+      body_cost = fractional_cost::scale (body_cost, vector_cycles_per_iter,
+					  scalar_cycles_per_iter);
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location,
 			 "Decreasing body cost to %d account for smaller"
@@ -25589,12 +25598,106 @@ aarch64_test_loading_full_dump ()
   ASSERT_EQ (SImode, GET_MODE (crtl->return_rtx));
 }
 
+/* Test the fractional_cost class.  */
+
+static void
+aarch64_test_fractional_cost ()
+{
+  using cf = fractional_cost;
+
+  ASSERT_EQ (cf (0, 20), 0);
+
+  ASSERT_EQ (cf (4, 2), 2);
+  ASSERT_EQ (3, cf (9, 3));
+
+  ASSERT_NE (cf (5, 2), 2);
+  ASSERT_NE (3, cf (8, 3));
+
+  ASSERT_EQ (cf (7, 11) + cf (15, 11), 2);
+  ASSERT_EQ (cf (2, 3) + cf (3, 5), cf (19, 15));
+  ASSERT_EQ (cf (2, 3) + cf (1, 6) + cf (1, 6), 1);
+
+  ASSERT_EQ (cf (14, 15) - cf (4, 15), cf (2, 3));
+  ASSERT_EQ (cf (1, 4) - cf (1, 2), 0);
+  ASSERT_EQ (cf (3, 5) - cf (1, 10), cf (1, 2));
+  ASSERT_EQ (cf (11, 3) - 3, cf (2, 3));
+  ASSERT_EQ (3 - cf (7, 3), cf (2, 3));
+  ASSERT_EQ (3 - cf (10, 3), 0);
+
+  ASSERT_EQ (cf (2, 3) * 5, cf (10, 3));
+  ASSERT_EQ (14 * cf (11, 21), cf (22, 3));
+
+  ASSERT_TRUE (cf (4, 15) < cf (5, 15));
+  ASSERT_FALSE (cf (5, 15) < cf (5, 15));
+  ASSERT_FALSE (cf (6, 15) < cf (5, 15));
+  ASSERT_TRUE (cf (1, 3) < cf (2, 5));
+  ASSERT_TRUE (cf (1, 12) < cf (1, 6));
+  ASSERT_FALSE (cf (5, 3) < cf (5, 3));
+  ASSERT_TRUE (cf (239, 240) < 1);
+  ASSERT_FALSE (cf (240, 240) < 1);
+  ASSERT_FALSE (cf (241, 240) < 1);
+  ASSERT_FALSE (2 < cf (207, 104));
+  ASSERT_FALSE (2 < cf (208, 104));
+  ASSERT_TRUE (2 < cf (209, 104));
+
+  ASSERT_TRUE (cf (4, 15) < cf (5, 15));
+  ASSERT_FALSE (cf (5, 15) < cf (5, 15));
+  ASSERT_FALSE (cf (6, 15) < cf (5, 15));
+  ASSERT_TRUE (cf (1, 3) < cf (2, 5));
+  ASSERT_TRUE (cf (1, 12) < cf (1, 6));
+  ASSERT_FALSE (cf (5, 3) < cf (5, 3));
+  ASSERT_TRUE (cf (239, 240) < 1);
+  ASSERT_FALSE (cf (240, 240) < 1);
+  ASSERT_FALSE (cf (241, 240) < 1);
+  ASSERT_FALSE (2 < cf (207, 104));
+  ASSERT_FALSE (2 < cf (208, 104));
+  ASSERT_TRUE (2 < cf (209, 104));
+
+  ASSERT_FALSE (cf (4, 15) >= cf (5, 15));
+  ASSERT_TRUE (cf (5, 15) >= cf (5, 15));
+  ASSERT_TRUE (cf (6, 15) >= cf (5, 15));
+  ASSERT_FALSE (cf (1, 3) >= cf (2, 5));
+  ASSERT_FALSE (cf (1, 12) >= cf (1, 6));
+  ASSERT_TRUE (cf (5, 3) >= cf (5, 3));
+  ASSERT_FALSE (cf (239, 240) >= 1);
+  ASSERT_TRUE (cf (240, 240) >= 1);
+  ASSERT_TRUE (cf (241, 240) >= 1);
+  ASSERT_TRUE (2 >= cf (207, 104));
+  ASSERT_TRUE (2 >= cf (208, 104));
+  ASSERT_FALSE (2 >= cf (209, 104));
+
+  ASSERT_FALSE (cf (4, 15) > cf (5, 15));
+  ASSERT_FALSE (cf (5, 15) > cf (5, 15));
+  ASSERT_TRUE (cf (6, 15) > cf (5, 15));
+  ASSERT_FALSE (cf (1, 3) > cf (2, 5));
+  ASSERT_FALSE (cf (1, 12) > cf (1, 6));
+  ASSERT_FALSE (cf (5, 3) > cf (5, 3));
+  ASSERT_FALSE (cf (239, 240) > 1);
+  ASSERT_FALSE (cf (240, 240) > 1);
+  ASSERT_TRUE (cf (241, 240) > 1);
+  ASSERT_TRUE (2 > cf (207, 104));
+  ASSERT_FALSE (2 > cf (208, 104));
+  ASSERT_FALSE (2 > cf (209, 104));
+
+  ASSERT_EQ (cf (1, 2).ceil (), 1);
+  ASSERT_EQ (cf (11, 7).ceil (), 2);
+  ASSERT_EQ (cf (20, 1).ceil (), 20);
+  ASSERT_EQ ((cf (0xfffffffd) + 1).ceil (), 0xfffffffe);
+  ASSERT_EQ ((cf (0xfffffffd) + 2).ceil (), 0xffffffff);
+  ASSERT_EQ ((cf (0xfffffffd) + 3).ceil (), 0xffffffff);
+  ASSERT_EQ ((cf (0x7fffffff) * 2).ceil (), 0xfffffffe);
+  ASSERT_EQ ((cf (0x80000000) * 2).ceil (), 0xffffffff);
+
+  ASSERT_EQ (cf (1, 2).as_double (), 0.5);
+}
+
 /* Run all target-specific selftests.  */
 
 static void
 aarch64_run_selftests (void)
 {
   aarch64_test_loading_full_dump ();
+  aarch64_test_fractional_cost ();
 }
 
 } // namespace selftest
diff --git a/gcc/config/aarch64/fractional-cost.h b/gcc/config/aarch64/fractional-cost.h
new file mode 100644
index 00000000000..6a01634905b
--- /dev/null
+++ b/gcc/config/aarch64/fractional-cost.h
@@ -0,0 +1,236 @@
+// Simple fixed-point representation of fractional costs
+// Copyright (C) 2021 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it under
+// the terms of the GNU General Public License as published by the Free
+// Software Foundation; either version 3, or (at your option) any later
+// version.
+//
+// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+// WARRANTY; without even the implied warranty of MERCHANTABILITY or
+// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+// for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// A simple saturating fixed-point type for representing fractional
+// intermediate results in cost calculations.  The input and result
+// costs are assumed to be uint32_ts.  Unlike sreal, the class can
+// represent most values that we care about exactly (without rounding).
+// See the comment above the SCALE field for the current set of
+// exactly-representable reciprocals.
+class fractional_cost
+{
+public:
+  // Construct an object equal to INT_VALUE.
+  constexpr fractional_cost (uint32_t int_value = 0)
+    : m_value (uint64_t (int_value) * SCALE) {}
+
+  fractional_cost (uint32_t a, uint32_t b);
+
+  fractional_cost operator+ (const fractional_cost &) const;
+  fractional_cost operator- (const fractional_cost &) const;
+  fractional_cost operator* (uint32_t) const;
+
+  fractional_cost &operator+= (const fractional_cost &);
+  fractional_cost &operator-= (const fractional_cost &);
+  fractional_cost &operator*= (uint32_t);
+
+  bool operator== (const fractional_cost &) const;
+  bool operator!= (const fractional_cost &) const;
+  bool operator< (const fractional_cost &) const;
+  bool operator<= (const fractional_cost &) const;
+  bool operator>= (const fractional_cost &) const;
+  bool operator> (const fractional_cost &) const;
+
+  uint32_t ceil () const;
+
+  static uint32_t scale (uint32_t, fractional_cost, fractional_cost);
+
+  explicit operator bool () const { return m_value != 0; }
+
+  // Convert the value to a double.
+  double as_double () const { return double (m_value) / SCALE; }
+
+private:
+  enum raw { RAW };
+  constexpr fractional_cost (uint64_t value, raw) : m_value (value) {}
+
+  // A multiple of [1, 16] * 16.  This ensures that 1/N is representable
+  // for every every possible SVE element count N, or for any "X per cycle"
+  // value N in the range [1, 16].
+  static const uint32_t SCALE = 11531520;
+
+  // The value multiplied by BIAS.
+  uint64_t m_value;
+};
+
+// Construct a representation of A / B, rounding up if (contrary to
+// expectations) we can't represent the value exactly.  For now we
+// treat inexact values as a bug, since all values of B should come
+// from a small set of values that are known at compile time.
+inline fractional_cost::fractional_cost (uint32_t a, uint32_t b)
+  : m_value (CEIL (uint64_t (a) * SCALE, uint64_t (b)))
+{
+  gcc_checking_assert (SCALE % b == 0);
+}
+
+inline fractional_cost
+fractional_cost::operator+ (const fractional_cost &other) const
+{
+  uint64_t sum = m_value + other.m_value;
+  return { sum >= m_value ? sum : ~uint64_t (0), RAW };
+}
+
+inline fractional_cost &
+fractional_cost::operator+= (const fractional_cost &other)
+{
+  *this = *this + other;
+  return *this;
+}
+
+inline fractional_cost
+fractional_cost::operator- (const fractional_cost &other) const
+{
+  uint64_t diff = m_value - other.m_value;
+  return { diff <= m_value ? diff : 0, RAW };
+}
+
+inline fractional_cost &
+fractional_cost::operator-= (const fractional_cost &other)
+{
+  *this = *this - other;
+  return *this;
+}
+
+inline fractional_cost
+fractional_cost::operator* (uint32_t other) const
+{
+  if (other == 0)
+    return 0;
+
+  uint64_t max = ~uint64_t (0);
+  return { m_value <= max / other ? m_value * other : max, RAW };
+}
+
+inline fractional_cost &
+fractional_cost::operator*= (uint32_t other)
+{
+  *this = *this * other;
+  return *this;
+}
+
+inline bool
+fractional_cost::operator== (const fractional_cost &other) const
+{
+  return m_value == other.m_value;
+}
+
+inline bool
+fractional_cost::operator!= (const fractional_cost &other) const
+{
+  return m_value != other.m_value;
+}
+
+inline bool
+fractional_cost::operator< (const fractional_cost &other) const
+{
+  return m_value < other.m_value;
+}
+
+inline bool
+fractional_cost::operator<= (const fractional_cost &other) const
+{
+  return m_value <= other.m_value;
+}
+
+inline bool
+fractional_cost::operator>= (const fractional_cost &other) const
+{
+  return m_value >= other.m_value;
+}
+
+inline bool
+fractional_cost::operator> (const fractional_cost &other) const
+{
+  return m_value > other.m_value;
+}
+
+// Round the value up to the nearest integer and saturate to a uint32_t.
+inline uint32_t
+fractional_cost::ceil () const
+{
+  uint32_t max = ~uint32_t (0);
+  if (m_value <= uint64_t (max - 1) * SCALE)
+    return (m_value + SCALE - 1) / SCALE;
+  return max;
+}
+
+// Round (COST * A) / B up to the nearest integer and saturate to a uint32_t.
+inline uint32_t
+fractional_cost::scale (uint32_t cost, fractional_cost a, fractional_cost b)
+{
+  widest_int result = wi::div_ceil (widest_int (cost) * a.m_value,
+				    b.m_value, SIGNED);
+  if (result < ~uint32_t (0))
+    return result.to_shwi ();
+  return ~uint32_t (0);
+}
+
+inline fractional_cost
+operator+ (uint32_t a, const fractional_cost &b)
+{
+  return b.operator+ (a);
+}
+
+inline fractional_cost
+operator- (uint32_t a, const fractional_cost &b)
+{
+  return fractional_cost (a).operator- (b);
+}
+
+inline fractional_cost
+operator* (uint32_t a, const fractional_cost &b)
+{
+  return b.operator* (a);
+}
+
+inline bool
+operator== (uint32_t a, const fractional_cost &b)
+{
+  return b.operator== (a);
+}
+
+inline bool
+operator!= (uint32_t a, const fractional_cost &b)
+{
+  return b.operator!= (a);
+}
+
+inline bool
+operator< (uint32_t a, const fractional_cost &b)
+{
+  return b.operator> (a);
+}
+
+inline bool
+operator<= (uint32_t a, const fractional_cost &b)
+{
+  return b.operator>= (a);
+}
+
+inline bool
+operator>= (uint32_t a, const fractional_cost &b)
+{
+  return b.operator<= (a);
+}
+
+inline bool
+operator> (uint32_t a, const fractional_cost &b)
+{
+  return b.operator< (a);
+}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/8] aarch64: Split out aarch64_adjust_body_cost_sve
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
  2021-08-03 12:03 ` [PATCH 1/8] aarch64: Turn sve_width tuning field into a bitmask Richard Sandiford
  2021-08-03 12:04 ` [PATCH 2/8] aarch64: Add a simple fixed-point class for costing Richard Sandiford
@ 2021-08-03 12:04 ` Richard Sandiford
  2021-08-03 12:05 ` [PATCH 4/8] aarch64: Add gather_load_xNN_cost tuning fields Richard Sandiford
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:04 UTC (permalink / raw)
  To: gcc-patches

This patch splits the SVE-specific part of aarch64_adjust_body_cost
out into its own subroutine, so that a future patch can call it
more than once.  I wondered about using a lambda to avoid having
to pass all the arguments, but in the end this way seemed clearer.

gcc/
	* config/aarch64/aarch64.c (aarch64_adjust_body_cost_sve): New
	function, split out from...
	(aarch64_adjust_body_cost): ...here.
---
 gcc/config/aarch64/aarch64.c | 220 ++++++++++++++++++++---------------
 1 file changed, 127 insertions(+), 93 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 17fcb34b2c8..b14b6f22aec 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15488,6 +15488,126 @@ aarch64_estimate_min_cycles_per_iter
   return cycles;
 }
 
+/* Subroutine of aarch64_adjust_body_cost for handling SVE.
+   Use ISSUE_INFO to work out how fast the SVE code can be issued and compare
+   it to the equivalent value for scalar code (SCALAR_CYCLES_PER_ITER).
+   If COULD_USE_ADVSIMD is true, also compare it to the issue rate of
+   Advanced SIMD code (ADVSIMD_CYCLES_PER_ITER).
+
+   COSTS is as for aarch64_adjust_body_cost.  ORIG_BODY_COST is the cost
+   originally passed to aarch64_adjust_body_cost and *BODY_COST is the current
+   value of the adjusted cost.  *SHOULD_DISPARAGE is true if we think the loop
+   body is too expensive.  */
+
+static fractional_cost
+aarch64_adjust_body_cost_sve (const aarch64_vector_costs *costs,
+			      const aarch64_vec_issue_info *issue_info,
+			      fractional_cost scalar_cycles_per_iter,
+			      fractional_cost advsimd_cycles_per_iter,
+			      bool could_use_advsimd,
+			      unsigned int orig_body_cost,
+			      unsigned int *body_cost,
+			      bool *should_disparage)
+{
+  /* Estimate the minimum number of cycles per iteration needed to issue
+     non-predicate operations.  */
+  fractional_cost sve_nonpred_cycles_per_iter
+    = aarch64_estimate_min_cycles_per_iter (&costs->sve_ops,
+					    issue_info->sve);
+
+  /* Separately estimate the minimum number of cycles per iteration needed
+     to issue the predicate operations.  */
+  fractional_cost sve_pred_issue_cycles_per_iter
+    = { costs->sve_ops.pred_ops, issue_info->sve->pred_ops_per_cycle };
+
+  /* Calculate the overall limit on the number of cycles per iteration.  */
+  fractional_cost sve_cycles_per_iter
+    = std::max (sve_nonpred_cycles_per_iter, sve_pred_issue_cycles_per_iter);
+
+  if (dump_enabled_p ())
+    {
+      costs->sve_ops.dump ();
+      dump_printf_loc (MSG_NOTE, vect_location,
+		       "  estimated cycles per iteration = %f\n",
+		       sve_cycles_per_iter.as_double ());
+      dump_printf_loc (MSG_NOTE, vect_location,
+		       "  estimated cycles per iteration for non-predicate"
+		       " operations = %f\n",
+		       sve_nonpred_cycles_per_iter.as_double ());
+      if (costs->sve_ops.pred_ops)
+	dump_printf_loc (MSG_NOTE, vect_location, "  estimated cycles per"
+			 " iteration for predicate operations = %d\n",
+			 sve_pred_issue_cycles_per_iter.as_double ());
+    }
+
+  /* If the scalar version of the loop could issue at least as
+     quickly as the predicate parts of the SVE loop, make the SVE loop
+     prohibitively expensive.  In this case vectorization is adding an
+     overhead that the original scalar code didn't have.
+
+     This is mostly intended to detect cases in which WHILELOs dominate
+     for very tight loops, which is something that normal latency-based
+     costs would not model.  Adding this kind of cliffedge would be
+     too drastic for scalar_cycles_per_iter vs. sve_cycles_per_iter;
+     code in the caller handles that case in a more conservative way.  */
+  fractional_cost sve_estimate = sve_pred_issue_cycles_per_iter + 1;
+  if (scalar_cycles_per_iter < sve_estimate)
+    {
+      unsigned int min_cost
+	= orig_body_cost * estimated_poly_value (BYTES_PER_SVE_VECTOR);
+      if (*body_cost < min_cost)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Increasing body cost to %d because the"
+			     " scalar code could issue within the limit"
+			     " imposed by predicate operations\n",
+			     min_cost);
+	  *body_cost = min_cost;
+	  *should_disparage = true;
+	}
+    }
+
+  /* If it appears that the Advanced SIMD version of a loop could issue
+     more quickly than the SVE one, increase the SVE cost in proportion
+     to the difference.  The intention is to make Advanced SIMD preferable
+     in cases where an Advanced SIMD version exists, without increasing
+     the costs so much that SVE won't be used at all.
+
+     The reasoning is similar to the scalar vs. predicate comparison above:
+     if the issue rate of the SVE code is limited by predicate operations
+     (i.e. if sve_pred_issue_cycles_per_iter > sve_nonpred_cycles_per_iter),
+     and if the Advanced SIMD code could issue within the limit imposed
+     by the predicate operations, the predicate operations are adding an
+     overhead that the original code didn't have and so we should prefer
+     the Advanced SIMD version.  However, if the predicate operations
+     do not dominate in this way, we should only increase the cost of
+     the SVE code if sve_cycles_per_iter is strictly greater than
+     advsimd_cycles_per_iter.  Given rounding effects, this should mean
+     that Advanced SIMD is either better or at least no worse.  */
+  if (sve_nonpred_cycles_per_iter >= sve_pred_issue_cycles_per_iter)
+    sve_estimate = sve_cycles_per_iter;
+  if (could_use_advsimd && advsimd_cycles_per_iter < sve_estimate)
+    {
+      /* This ensures that min_cost > orig_body_cost * 2.  */
+      unsigned int factor = fractional_cost::scale (1, sve_estimate,
+						    advsimd_cycles_per_iter);
+      unsigned int min_cost = orig_body_cost * factor + 1;
+      if (*body_cost < min_cost)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Increasing body cost to %d because Advanced"
+			     " SIMD code could issue as quickly\n",
+			     min_cost);
+	  *body_cost = min_cost;
+	  *should_disparage = true;
+	}
+    }
+
+  return sve_cycles_per_iter;
+}
+
 /* BODY_COST is the cost of a vector loop body recorded in COSTS.
    Adjust the cost as necessary and return the new cost.  */
 static unsigned int
@@ -15583,101 +15703,15 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 
   if ((costs->vec_flags & VEC_ANY_SVE) && issue_info->sve)
     {
-      /* Estimate the minimum number of cycles per iteration needed to issue
-	 non-predicate operations.  */
-      fractional_cost sve_cycles_per_iter
-	= aarch64_estimate_min_cycles_per_iter (&costs->sve_ops,
-						issue_info->sve);
-
-      /* Separately estimate the minimum number of cycles per iteration needed
-	 to issue the predicate operations.  */
-      fractional_cost pred_cycles_per_iter
-	= { costs->sve_ops.pred_ops, issue_info->sve->pred_ops_per_cycle };
-
       if (dump_enabled_p ())
-	{
-	  dump_printf_loc (MSG_NOTE, vect_location, "SVE issue estimate:\n");
-	  costs->sve_ops.dump ();
-	  dump_printf_loc (MSG_NOTE, vect_location,
-			   "  estimated cycles per iteration for non-predicate"
-			   " operations = %f\n",
-			   sve_cycles_per_iter.as_double ());
-	  if (costs->sve_ops.pred_ops)
-	    dump_printf_loc (MSG_NOTE, vect_location, "  estimated cycles per"
-			     " iteration for predicate operations = %d\n",
-			     pred_cycles_per_iter.as_double ());
-	}
-
-      vector_cycles_per_iter = std::max (sve_cycles_per_iter,
-					 pred_cycles_per_iter);
+	dump_printf_loc (MSG_NOTE, vect_location, "SVE issue estimate:\n");
       vector_reduction_latency = costs->sve_ops.reduction_latency;
-
-      /* If the scalar version of the loop could issue at least as
-	 quickly as the predicate parts of the SVE loop, make the SVE loop
-	 prohibitively expensive.  In this case vectorization is adding an
-	 overhead that the original scalar code didn't have.
-
-	 This is mostly intended to detect cases in which WHILELOs dominate
-	 for very tight loops, which is something that normal latency-based
-	 costs would not model.  Adding this kind of cliffedge would be
-	 too drastic for scalar_cycles_per_iter vs. sve_cycles_per_iter;
-	 code later in the function handles that case in a more
-	 conservative way.  */
-      fractional_cost sve_estimate = pred_cycles_per_iter + 1;
-      if (scalar_cycles_per_iter < sve_estimate)
-	{
-	  unsigned int min_cost
-	    = orig_body_cost * estimated_poly_value (BYTES_PER_SVE_VECTOR);
-	  if (body_cost < min_cost)
-	    {
-	      if (dump_enabled_p ())
-		dump_printf_loc (MSG_NOTE, vect_location,
-				 "Increasing body cost to %d because the"
-				 " scalar code could issue within the limit"
-				 " imposed by predicate operations\n",
-				 min_cost);
-	      body_cost = min_cost;
-	      should_disparage = true;
-	    }
-	}
-
-      /* If it appears that the Advanced SIMD version of a loop could issue
-	 more quickly than the SVE one, increase the SVE cost in proportion
-	 to the difference.  The intention is to make Advanced SIMD preferable
-	 in cases where an Advanced SIMD version exists, without increasing
-	 the costs so much that SVE won't be used at all.
-
-	 The reasoning is similar to the scalar vs. predicate comparison above:
-	 if the issue rate of the SVE code is limited by predicate operations
-	 (i.e. if pred_cycles_per_iter > sve_cycles_per_iter), and if the
-	 Advanced SIMD code could issue within the limit imposed by the
-	 predicate operations, the predicate operations are adding an
-	 overhead that the original code didn't have and so we should prefer
-	 the Advanced SIMD version.  However, if the predicate operations
-	 do not dominate in this way, we should only increase the cost of
-	 the SVE code if sve_cycles_per_iter is strictly greater than
-	 advsimd_cycles_per_iter.  Given rounding effects, this should mean
-	 that Advanced SIMD is either better or at least no worse.  */
-      if (sve_cycles_per_iter >= pred_cycles_per_iter)
-	sve_estimate = sve_cycles_per_iter;
-      if (could_use_advsimd && advsimd_cycles_per_iter < sve_estimate)
-	{
-	  /* This ensures that min_cost > orig_body_cost * 2.  */
-	  unsigned int factor
-	    = fractional_cost::scale (1, sve_estimate,
-				      advsimd_cycles_per_iter);
-	  unsigned int min_cost = orig_body_cost * factor + 1;
-	  if (body_cost < min_cost)
-	    {
-	      if (dump_enabled_p ())
-		dump_printf_loc (MSG_NOTE, vect_location,
-				 "Increasing body cost to %d because Advanced"
-				 " SIMD code could issue as quickly\n",
-				 min_cost);
-	      body_cost = min_cost;
-	      should_disparage = true;
-	    }
-	}
+      vector_cycles_per_iter
+	= aarch64_adjust_body_cost_sve (costs, issue_info,
+					scalar_cycles_per_iter,
+					advsimd_cycles_per_iter,
+					could_use_advsimd, orig_body_cost,
+					&body_cost, &should_disparage);
     }
 
   /* Decide whether to stick to latency-based costs or whether to try to

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/8] aarch64: Add gather_load_xNN_cost tuning fields
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
                   ` (2 preceding siblings ...)
  2021-08-03 12:04 ` [PATCH 3/8] aarch64: Split out aarch64_adjust_body_cost_sve Richard Sandiford
@ 2021-08-03 12:05 ` Richard Sandiford
  2021-08-03 12:05 ` [PATCH 5/8] aarch64: Tweak the cost of elementwise stores Richard Sandiford
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:05 UTC (permalink / raw)
  To: gcc-patches

This patch adds tuning fields for the total cost of a gather load
instruction.  Until now, we've costed them as one scalar load
per element instead.  Those scalar_load-based values are also
what the patch uses to fill in the new fields for existing
cost structures.

gcc/
	* config/aarch64/aarch64-protos.h (sve_vec_cost):
	Add gather_load_x32_cost and gather_load_x64_cost.
	* config/aarch64/aarch64.c (generic_sve_vector_cost)
	(a64fx_sve_vector_cost, neoversev1_sve_vector_cost): Update
	accordingly, using the values given by the scalar_load * number
	of elements calculation that we used previously.
	(aarch64_detect_vector_stmt_subtype): Use the new fields.
---
 gcc/config/aarch64/aarch64-protos.h |  9 +++++++++
 gcc/config/aarch64/aarch64.c        | 19 +++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index fb4ce8e9f84..b91eeeba101 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -259,12 +259,16 @@ struct sve_vec_cost : simd_vec_cost
 			  unsigned int fadda_f16_cost,
 			  unsigned int fadda_f32_cost,
 			  unsigned int fadda_f64_cost,
+			  unsigned int gather_load_x32_cost,
+			  unsigned int gather_load_x64_cost,
 			  unsigned int scatter_store_elt_cost)
     : simd_vec_cost (base),
       clast_cost (clast_cost),
       fadda_f16_cost (fadda_f16_cost),
       fadda_f32_cost (fadda_f32_cost),
       fadda_f64_cost (fadda_f64_cost),
+      gather_load_x32_cost (gather_load_x32_cost),
+      gather_load_x64_cost (gather_load_x64_cost),
       scatter_store_elt_cost (scatter_store_elt_cost)
   {}
 
@@ -279,6 +283,11 @@ struct sve_vec_cost : simd_vec_cost
   const int fadda_f32_cost;
   const int fadda_f64_cost;
 
+  /* The cost of a gather load instruction.  The x32 value is for loads
+     of 32-bit elements and the x64 value is for loads of 64-bit elements.  */
+  const int gather_load_x32_cost;
+  const int gather_load_x64_cost;
+
   /* The per-element cost of a scatter store.  */
   const int scatter_store_elt_cost;
 };
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b14b6f22aec..36f11808916 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -675,6 +675,8 @@ static const sve_vec_cost generic_sve_vector_cost =
   2, /* fadda_f16_cost  */
   2, /* fadda_f32_cost  */
   2, /* fadda_f64_cost  */
+  4, /* gather_load_x32_cost  */
+  2, /* gather_load_x64_cost  */
   1 /* scatter_store_elt_cost  */
 };
 
@@ -744,6 +746,8 @@ static const sve_vec_cost a64fx_sve_vector_cost =
   13, /* fadda_f16_cost  */
   13, /* fadda_f32_cost  */
   13, /* fadda_f64_cost  */
+  64, /* gather_load_x32_cost  */
+  32, /* gather_load_x64_cost  */
   1 /* scatter_store_elt_cost  */
 };
 
@@ -1739,6 +1743,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost =
   19, /* fadda_f16_cost  */
   11, /* fadda_f32_cost  */
   8, /* fadda_f64_cost  */
+  32, /* gather_load_x32_cost  */
+  16, /* gather_load_x64_cost  */
   3 /* scatter_store_elt_cost  */
 };
 
@@ -14958,6 +14964,19 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
       && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
     return simd_costs->store_elt_extra_cost;
 
+  /* Detect SVE gather loads, which are costed as a single scalar_load
+     for each element.  We therefore need to divide the full-instruction
+     cost by the number of elements in the vector.  */
+  if (kind == scalar_load
+      && sve_costs
+      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+    {
+      unsigned int nunits = vect_nunits_for_cost (vectype);
+      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
+	return { sve_costs->gather_load_x64_cost, nunits };
+      return { sve_costs->gather_load_x32_cost, nunits };
+    }
+
   /* Detect cases in which a scalar_store is really storing one element
      in a scatter operation.  */
   if (kind == scalar_store

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 5/8] aarch64: Tweak the cost of elementwise stores
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
                   ` (3 preceding siblings ...)
  2021-08-03 12:05 ` [PATCH 4/8] aarch64: Add gather_load_xNN_cost tuning fields Richard Sandiford
@ 2021-08-03 12:05 ` Richard Sandiford
  2021-08-04 11:43   ` Richard Biener
  2021-08-03 12:06 ` [PATCH 6/8] aarch64: Tweak MLA vector costs Richard Sandiford
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:05 UTC (permalink / raw)
  To: gcc-patches

When the vectoriser scalarises a strided store, it counts one
scalar_store for each element plus one vec_to_scalar extraction
for each element.  However, extracting element 0 is free on AArch64,
so it should have zero cost.

I don't have a testcase that requires this for existing -mtune
options, but it becomes more important with a later patch.

gcc/
	* config/aarch64/aarch64.c (aarch64_is_store_elt_extraction): New
	function, split out from...
	(aarch64_detect_vector_stmt_subtype): ...here.
	(aarch64_add_stmt_cost): Treat extracting element 0 as free.
---
 gcc/config/aarch64/aarch64.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 36f11808916..084f8caa0da 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14622,6 +14622,18 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
     }
 }
 
+/* Return true if an operaton of kind KIND for STMT_INFO represents
+   the extraction of an element from a vector in preparation for
+   storing the element to memory.  */
+static bool
+aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
+				 stmt_vec_info stmt_info)
+{
+  return (kind == vec_to_scalar
+	  && STMT_VINFO_DATA_REF (stmt_info)
+	  && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
+}
+
 /* Return true if STMT_INFO represents part of a reduction.  */
 static bool
 aarch64_is_reduction (stmt_vec_info stmt_info)
@@ -14959,9 +14971,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
   /* Detect cases in which vec_to_scalar is describing the extraction of a
      vector element in preparation for a scalar store.  The store itself is
      costed separately.  */
-  if (kind == vec_to_scalar
-      && STMT_VINFO_DATA_REF (stmt_info)
-      && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
+  if (aarch64_is_store_elt_extraction (kind, stmt_info))
     return simd_costs->store_elt_extra_cost;
 
   /* Detect SVE gather loads, which are costed as a single scalar_load
@@ -15382,6 +15392,12 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	  if (vectype && aarch64_sve_only_stmt_p (stmt_info, vectype))
 	    costs->saw_sve_only_op = true;
 
+	  /* If we scalarize a strided store, the vectorizer costs one
+	     vec_to_scalar for each element.  However, we can store the first
+	     element using an FP store without a separate extract step.  */
+	  if (aarch64_is_store_elt_extraction (kind, stmt_info))
+	    count -= 1;
+
 	  stmt_cost = aarch64_detect_scalar_stmt_subtype
 	    (vinfo, kind, stmt_info, stmt_cost);
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 6/8] aarch64: Tweak MLA vector costs
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
                   ` (4 preceding siblings ...)
  2021-08-03 12:05 ` [PATCH 5/8] aarch64: Tweak the cost of elementwise stores Richard Sandiford
@ 2021-08-03 12:06 ` Richard Sandiford
  2021-08-04 11:45   ` Richard Biener
  2021-08-03 12:06 ` [PATCH 7/8] aarch64: Restrict issue heuristics to inner vector loop Richard Sandiford
  2021-08-03 12:06 ` [PATCH 8/8] aarch64: Add -mtune=neoverse-512tvb Richard Sandiford
  7 siblings, 1 reply; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:06 UTC (permalink / raw)
  To: gcc-patches

The issue-based vector costs currently assume that a multiply-add
sequence can be implemented using a single instruction.  This is
generally true for scalars (which have a 4-operand instruction)
and SVE (which allows the output to be tied to any input).
However, for Advanced SIMD, multiplying two values and adding
an invariant will end up being a move and an MLA.

The only target to use the issue-based vector costs is Neoverse V1,
which would generally prefer SVE in this case anyway.  I therefore
don't have a self-contained testcase.  However, the distinction
becomes more important with a later patch.

gcc/
	* config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
	parameter.  Detect cases in which an Advanced SIMD MLA would almost
	certainly require a MOV.
	(aarch64_count_ops): Update accordingly.
---
 gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 084f8caa0da..19045ef6944 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
 
 /* Return true if STMT_INFO is the second part of a two-statement multiply-add
    or multiply-subtract sequence that might be suitable for fusing into a
-   single instruction.  */
+   single instruction.  If VEC_FLAGS is zero, analyze the operation as
+   a scalar one, otherwise analyze it as an operation on vectors with those
+   VEC_* flags.  */
 static bool
-aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
+aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
+			unsigned int vec_flags)
 {
   gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
   if (!assign)
@@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
       if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
 	continue;
 
+      if (vec_flags & VEC_ADVSIMD)
+	{
+	  /* Scalar and SVE code can tie the result to any FMLA input (or none,
+	     although that requires a MOVPRFX for SVE).  However, Advanced SIMD
+	     only supports MLA forms, so will require a move if the result
+	     cannot be tied to the accumulator.  The most important case in
+	     which this is true is when the accumulator input is invariant.  */
+	  rhs = gimple_op (assign, 3 - i);
+	  if (TREE_CODE (rhs) != SSA_NAME)
+	    return false;
+	  def_stmt_info = vinfo->lookup_def (rhs);
+	  if (!def_stmt_info
+	      || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
+	    return false;
+	}
+
       return true;
     }
   return false;
@@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
     }
 
   /* Assume that multiply-adds will become a single operation.  */
-  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
+  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
     return;
 
   /* When costing scalar statements in vector code, the count already

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 7/8] aarch64: Restrict issue heuristics to inner vector loop
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
                   ` (5 preceding siblings ...)
  2021-08-03 12:06 ` [PATCH 6/8] aarch64: Tweak MLA vector costs Richard Sandiford
@ 2021-08-03 12:06 ` Richard Sandiford
  2021-08-03 12:06 ` [PATCH 8/8] aarch64: Add -mtune=neoverse-512tvb Richard Sandiford
  7 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:06 UTC (permalink / raw)
  To: gcc-patches

The AArch64 vector costs try to take issue rates into account.
However, when vectorising an outer loop, we lumped the inner
and outer operations together, which is somewhat meaningless.
This patch restricts the heuristic to the inner loop.

gcc/
	* config/aarch64/aarch64.c (aarch64_add_stmt_cost): Only
	record issue information for operations that occur in the
	innermost loop.
---
 gcc/config/aarch64/aarch64.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 19045ef6944..19625eb048d 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -15392,6 +15392,10 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       fractional_cost stmt_cost
 	= aarch64_builtin_vectorization_cost (kind, vectype, misalign);
 
+      bool in_inner_loop_p = (where == vect_body
+			      && stmt_info
+			      && stmt_in_inner_loop_p (vinfo, stmt_info));
+
       /* Do one-time initialization based on the vinfo.  */
       loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
       bb_vec_info bb_vinfo = dyn_cast<bb_vec_info> (vinfo);
@@ -15438,14 +15442,15 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	  stmt_cost = aarch64_adjust_stmt_cost (kind, stmt_info, vectype,
 						stmt_cost);
 
-	  /* If we're recording a nonzero vector loop body cost, also estimate
-	     the operations that would need to be issued by all relevant
-	     implementations of the loop.  */
+	  /* If we're recording a nonzero vector loop body cost for the
+	     innermost loop, also estimate the operations that would need
+	     to be issued by all relevant implementations of the loop.  */
 	  auto *issue_info = aarch64_tune_params.vec_costs->issue_info;
 	  if (loop_vinfo
 	      && issue_info
 	      && costs->vec_flags
 	      && where == vect_body
+	      && (!LOOP_VINFO_LOOP (loop_vinfo)->inner || in_inner_loop_p)
 	      && vectype
 	      && stmt_cost != 0)
 	    {
@@ -15489,8 +15494,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
 	 arbitrary and could potentially be improved with analysis.  */
-      if (where == vect_body && stmt_info
-	  && stmt_in_inner_loop_p (vinfo, stmt_info))
+      if (in_inner_loop_p)
 	{
 	  gcc_assert (loop_vinfo);
 	  count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /*  FIXME  */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 8/8] aarch64: Add -mtune=neoverse-512tvb
  2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
                   ` (6 preceding siblings ...)
  2021-08-03 12:06 ` [PATCH 7/8] aarch64: Restrict issue heuristics to inner vector loop Richard Sandiford
@ 2021-08-03 12:06 ` Richard Sandiford
  2021-08-17 14:37   ` Richard Sandiford
  7 siblings, 1 reply; 17+ messages in thread
From: Richard Sandiford @ 2021-08-03 12:06 UTC (permalink / raw)
  To: gcc-patches

This patch adds an option to tune for Neoverse cores that have
a total vector bandwidth of 512 bits (4x128 for Advanced SIMD
and a vector-length-dependent equivalent for SVE).  This is intended
to be a compromise between tuning aggressively for a single core like
Neoverse V1 (which can be too narrow) and tuning for AArch64 cores
in general (which can be too wide).

-mcpu=neoverse-512tvb is equivalent to -mcpu=neoverse-v1
-mtune=neoverse-512tvb.

gcc/
	* doc/invoke.texi: Document -mtune=neoverse-512tvb and
	-mcpu=neoverse-512tvb.
	* config/aarch64/aarch64-cores.def (neoverse-512tvb): New entry.
	* config/aarch64/aarch64-tune.md: Regenerate.
	* config/aarch64/aarch64.c (neoverse512tvb_sve_vector_cost)
	(neoverse512tvb_sve_issue_info, neoverse512tvb_vec_issue_info)
	(neoverse512tvb_vector_cost, neoverse512tvb_tunings): New structures.
	(aarch64_adjust_body_cost_sve): Handle -mtune=neoverse-512tvb.
	(aarch64_adjust_body_cost): Likewise.
---
 gcc/config/aarch64/aarch64-cores.def |   1 +
 gcc/config/aarch64/aarch64-tune.md   |   2 +-
 gcc/config/aarch64/aarch64.c         | 184 ++++++++++++++++++++++++++-
 gcc/doc/invoke.texi                  |  27 +++-
 4 files changed, 202 insertions(+), 12 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def
index de8fe9bc09b..b2aa1670561 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -139,6 +139,7 @@ AARCH64_CORE("thunderx3t110",  thunderx3t110,  thunderx3t110, 8_3A,  AARCH64_FL_
 /* Arm ('A') cores.  */
 AARCH64_CORE("zeus", zeus, cortexa57, 8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_SVE | AARCH64_FL_RCPC | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | AARCH64_FL_PROFILE | AARCH64_FL_SSBS | AARCH64_FL_RNG, neoversev1, 0x41, 0xd40, -1)
 AARCH64_CORE("neoverse-v1", neoversev1, cortexa57, 8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_SVE | AARCH64_FL_RCPC | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | AARCH64_FL_PROFILE | AARCH64_FL_SSBS | AARCH64_FL_RNG, neoversev1, 0x41, 0xd40, -1)
+AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, 8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_SVE | AARCH64_FL_RCPC | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | AARCH64_FL_PROFILE | AARCH64_FL_SSBS | AARCH64_FL_RNG, neoverse512tvb, INVALID_IMP, INVALID_CORE, -1)
 
 /* Qualcomm ('Q') cores. */
 AARCH64_CORE("saphira",     saphira,    saphira,    8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira,   0x51, 0xC01, -1)
diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-tune.md
index af66c111da2..e491c29d31a 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-	"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,saphira,neoversen2,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82"
+	"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,neoversen2,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82"
 	(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 19625eb048d..f80de2ca897 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1842,6 +1842,136 @@ static const struct tune_params neoversev1_tunings =
   &generic_prefetch_tune
 };
 
+static const sve_vec_cost neoverse512tvb_sve_vector_cost =
+{
+  {
+    2, /* int_stmt_cost  */
+    2, /* fp_stmt_cost  */
+    4, /* ld2_st2_permute_cost  */
+    5, /* ld3_st3_permute_cost  */
+    5, /* ld4_st4_permute_cost  */
+    3, /* permute_cost  */
+    /* Theoretically, a reduction involving 15 scalar ADDs could
+       complete in ~5 cycles and would have a cost of 15.  Assume that
+       [SU]ADDV completes in 11 cycles and so give it a cost of 15 + 6.  */
+    21, /* reduc_i8_cost  */
+    /* Likewise for 7 scalar ADDs (~3 cycles) vs. 9: 7 + 6.  */
+    13, /* reduc_i16_cost  */
+    /* Likewise for 3 scalar ADDs (~2 cycles) vs. 8: 3 + 6.  */
+    9, /* reduc_i32_cost  */
+    /* Likewise for 1 scalar ADD (1 cycle) vs. 8: 1 + 7.  */
+    8, /* reduc_i64_cost  */
+    /* Theoretically, a reduction involving 7 scalar FADDs could
+       complete in ~6 cycles and would have a cost of 14.  Assume that
+       FADDV completes in 8 cycles and so give it a cost of 14 + 2.  */
+    16, /* reduc_f16_cost  */
+    /* Likewise for 3 scalar FADDs (~4 cycles) vs. 6: 6 + 2.  */
+    8, /* reduc_f32_cost  */
+    /* Likewise for 1 scalar FADD (2 cycles) vs. 4: 2 + 2.  */
+    4, /* reduc_f64_cost  */
+    2, /* store_elt_extra_cost  */
+    /* This value is just inherited from the Cortex-A57 table.  */
+    8, /* vec_to_scalar_cost  */
+    /* This depends very much on what the scalar value is and
+       where it comes from.  E.g. some constants take two dependent
+       instructions or a load, while others might be moved from a GPR.
+       4 seems to be a reasonable compromise in practice.  */
+    4, /* scalar_to_vec_cost  */
+    4, /* align_load_cost  */
+    4, /* unalign_load_cost  */
+    /* Although stores generally have a latency of 2 and compete for the
+       vector pipes, in practice it's better not to model that.  */
+    1, /* unalign_store_cost  */
+    1  /* store_cost  */
+  },
+  3, /* clast_cost  */
+  10, /* fadda_f16_cost  */
+  6, /* fadda_f32_cost  */
+  4, /* fadda_f64_cost  */
+  /* A strided Advanced SIMD x64 load would take two parallel FP loads
+     (6 cycles) plus an insertion (2 cycles).  Assume a 64-bit SVE gather
+     is 1 cycle more.  The Advanced SIMD version is costed as 2 scalar loads
+     (cost 8) and a vec_construct (cost 2).  Add a full vector operation
+     (cost 2) to that, to avoid the difference being lost in rounding.
+
+     There is no easy comparison between a strided Advanced SIMD x32 load
+     and an SVE 32-bit gather, but cost an SVE 32-bit gather as 1 vector
+     operation more than a 64-bit gather.  */
+  14, /* gather_load_x32_cost  */
+  12, /* gather_load_x64_cost  */
+  3 /* scatter_store_elt_cost  */
+};
+
+static const aarch64_sve_vec_issue_info neoverse512tvb_sve_issue_info =
+{
+  {
+    {
+      3, /* loads_per_cycle  */
+      2, /* stores_per_cycle  */
+      4, /* general_ops_per_cycle  */
+      0, /* fp_simd_load_general_ops  */
+      1 /* fp_simd_store_general_ops  */
+    },
+    2, /* ld2_st2_general_ops  */
+    2, /* ld3_st3_general_ops  */
+    3 /* ld4_st4_general_ops  */
+  },
+  2, /* pred_ops_per_cycle  */
+  2, /* while_pred_ops  */
+  2, /* int_cmp_pred_ops  */
+  1, /* fp_cmp_pred_ops  */
+  1, /* gather_scatter_pair_general_ops  */
+  1 /* gather_scatter_pair_pred_ops  */
+};
+
+static const aarch64_vec_issue_info neoverse512tvb_vec_issue_info =
+{
+  &neoversev1_scalar_issue_info,
+  &neoversev1_advsimd_issue_info,
+  &neoverse512tvb_sve_issue_info
+};
+
+static const struct cpu_vector_cost neoverse512tvb_vector_cost =
+{
+  1, /* scalar_int_stmt_cost  */
+  2, /* scalar_fp_stmt_cost  */
+  4, /* scalar_load_cost  */
+  1, /* scalar_store_cost  */
+  1, /* cond_taken_branch_cost  */
+  1, /* cond_not_taken_branch_cost  */
+  &neoversev1_advsimd_vector_cost, /* advsimd  */
+  &neoverse512tvb_sve_vector_cost, /* sve  */
+  &neoverse512tvb_vec_issue_info /* issue_info  */
+};
+
+static const struct tune_params neoverse512tvb_tunings =
+{
+  &cortexa76_extra_costs,
+  &neoversev1_addrcost_table,
+  &generic_regmove_cost,
+  &neoverse512tvb_vector_cost,
+  &generic_branch_cost,
+  &generic_approx_modes,
+  SVE_128 | SVE_256, /* sve_width  */
+  4, /* memmov_cost  */
+  3, /* issue_rate  */
+  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
+  "32:16",	/* function_align.  */
+  "4",		/* jump_align.  */
+  "32:16",	/* loop_align.  */
+  2,	/* int_reassoc_width.  */
+  4,	/* fp_reassoc_width.  */
+  2,	/* vec_reassoc_width.  */
+  2,	/* min_div_recip_mul_sf.  */
+  2,	/* min_div_recip_mul_df.  */
+  0,	/* max_case_values.  */
+  tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
+  (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
+   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
+   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT),	/* tune_flags.  */
+  &generic_prefetch_tune
+};
+
 static const struct tune_params neoversen2_tunings =
 {
   &cortexa76_extra_costs,
@@ -15569,10 +15699,32 @@ aarch64_adjust_body_cost_sve (const aarch64_vector_costs *costs,
 {
   /* Estimate the minimum number of cycles per iteration needed to issue
      non-predicate operations.  */
-  fractional_cost sve_nonpred_cycles_per_iter
+  fractional_cost sve_nonpred_issue_cycles_per_iter
     = aarch64_estimate_min_cycles_per_iter (&costs->sve_ops,
 					    issue_info->sve);
 
+  /* Estimate the minimum number of cycles per iteration needed to rename
+     SVE instructions.
+
+     ??? For now this is done inline rather than via cost tables, since it
+     isn't clear how it should be parameterized for the general case.  */
+  fractional_cost sve_rename_cycles_per_iter = 0;
+  if (issue_info == &neoverse512tvb_vec_issue_info)
+    /* + 1 for an addition.  We've already counted a general op for each
+       store, so we don't need to account for stores separately.  The branch
+       reads no registers and so does not need to be counted either.
+
+       ??? This value is very much on the pessimistic side, but seems to work
+       pretty well in practice.  */
+    sve_rename_cycles_per_iter
+      = { costs->sve_ops.general_ops
+	  + costs->sve_ops.loads
+	  + costs->sve_ops.pred_ops + 1, 5 };
+
+  /* Combine the rename and non-predicate issue limits into a single value.  */
+  fractional_cost sve_nonpred_cycles_per_iter
+    = std::max (sve_nonpred_issue_cycles_per_iter, sve_rename_cycles_per_iter);
+
   /* Separately estimate the minimum number of cycles per iteration needed
      to issue the predicate operations.  */
   fractional_cost sve_pred_issue_cycles_per_iter
@@ -15588,14 +15740,17 @@ aarch64_adjust_body_cost_sve (const aarch64_vector_costs *costs,
       dump_printf_loc (MSG_NOTE, vect_location,
 		       "  estimated cycles per iteration = %f\n",
 		       sve_cycles_per_iter.as_double ());
-      dump_printf_loc (MSG_NOTE, vect_location,
-		       "  estimated cycles per iteration for non-predicate"
-		       " operations = %f\n",
-		       sve_nonpred_cycles_per_iter.as_double ());
       if (costs->sve_ops.pred_ops)
-	dump_printf_loc (MSG_NOTE, vect_location, "  estimated cycles per"
-			 " iteration for predicate operations = %d\n",
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "    predicate issue = %f\n",
 			 sve_pred_issue_cycles_per_iter.as_double ());
+      if (costs->sve_ops.pred_ops || sve_rename_cycles_per_iter)
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "    non-predicate issue = %f\n",
+			 sve_nonpred_issue_cycles_per_iter.as_double ());
+      if (sve_rename_cycles_per_iter)
+	dump_printf_loc (MSG_NOTE, vect_location, "    rename = %f\n",
+			 sve_rename_cycles_per_iter.as_double ());
     }
 
   /* If the scalar version of the loop could issue at least as
@@ -15770,6 +15925,21 @@ aarch64_adjust_body_cost (aarch64_vector_costs *costs, unsigned int body_cost)
 					advsimd_cycles_per_iter,
 					could_use_advsimd, orig_body_cost,
 					&body_cost, &should_disparage);
+
+      if (aarch64_tune_params.vec_costs == &neoverse512tvb_vector_cost)
+	{
+	  /* Also take Neoverse V1 tuning into account, doubling the
+	     scalar and Advanced SIMD estimates to account for the
+	     doubling in SVE vector length.  */
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "Neoverse V1 estimate:\n");
+	  aarch64_adjust_body_cost_sve (costs, &neoversev1_vec_issue_info,
+					scalar_cycles_per_iter * 2,
+					advsimd_cycles_per_iter * 2,
+					could_use_advsimd, orig_body_cost,
+					&body_cost, &should_disparage);
+	}
     }
 
   /* Decide whether to stick to latency-based costs or whether to try to
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 32697e6117c..65bb9981f02 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -18407,10 +18407,10 @@ performance of the code.  Permissible values for this option are:
 @samp{cortex-a65}, @samp{cortex-a65ae}, @samp{cortex-a34},
 @samp{cortex-a78}, @samp{cortex-a78ae}, @samp{cortex-a78c},
 @samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor},
-@samp{neoverse-e1}, @samp{neoverse-n1}, @samp{neoverse-n2},
-@samp{neoverse-v1}, @samp{qdf24xx}, @samp{saphira},
-@samp{phecda}, @samp{xgene1}, @samp{vulcan}, @samp{octeontx},
-@samp{octeontx81},  @samp{octeontx83},
+@samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
+@samp{neoverse-n2}, @samp{neoverse-v1}, @samp{qdf24xx},
+@samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
+@samp{octeontx}, @samp{octeontx81},  @samp{octeontx83},
 @samp{octeontx2}, @samp{octeontx2t98}, @samp{octeontx2t96}
 @samp{octeontx2t93}, @samp{octeontx2f95}, @samp{octeontx2f95n},
 @samp{octeontx2f95mm},
@@ -18428,6 +18428,15 @@ The values @samp{cortex-a57.cortex-a53}, @samp{cortex-a72.cortex-a53},
 @samp{cortex-a75.cortex-a55}, @samp{cortex-a76.cortex-a55} specify that GCC
 should tune for a big.LITTLE system.
 
+The value @samp{neoverse-512tvb} specifies that GCC should tune
+for Neoverse cores that (a) implement SVE and (b) have a total vector
+bandwidth of 512 bits per cycle.  In other words, the option tells GCC to
+tune for Neoverse cores that can execute 4 128-bit Advanced SIMD arithmetic
+instructions a cycle and that can execute an equivalent number of SVE
+arithmetic instructions per cycle (2 for 256-bit SVE, 4 for 128-bit SVE).
+This is more general than tuning for a specific core like Neoverse V1
+but is more specific than the default tuning described below.
+
 Additionally on native AArch64 GNU/Linux systems the value
 @samp{native} tunes performance to the host system.  This option has no effect
 if the compiler is unable to recognize the processor of the host system.
@@ -18457,6 +18466,16 @@ by @option{-mtune}).  Where this option is used in conjunction
 with @option{-march} or @option{-mtune}, those options take precedence
 over the appropriate part of this option.
 
+@option{-mcpu=neoverse-512tvb} is special in that it does not refer
+to a specific core, but instead refers to all Neoverse cores that
+(a) implement SVE and (b) have a total vector bandwidth of 512 bits
+a cycle.  Unless overridden by @option{-march},
+@option{-mcpu=neoverse-512tvb} generates code that can run on a
+Neoverse V1 core, since Neoverse V1 is the first Neoverse core with
+these properties.  Unless overridden by @option{-mtune},
+@option{-mcpu=neoverse-512tvb} tunes code in the same way as for
+@option{-mtune=neoverse-512tvb}.
+
 @item -moverride=@var{string}
 @opindex moverride
 Override tuning decisions made by the back-end in response to a

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/8] aarch64: Tweak the cost of elementwise stores
  2021-08-03 12:05 ` [PATCH 5/8] aarch64: Tweak the cost of elementwise stores Richard Sandiford
@ 2021-08-04 11:43   ` Richard Biener
  2021-08-05 12:04     ` [PATCH] vect: Move costing helpers from aarch64 code Richard Sandiford
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-04 11:43 UTC (permalink / raw)
  To: Richard Sandiford, GCC Patches

On Tue, Aug 3, 2021 at 2:09 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> When the vectoriser scalarises a strided store, it counts one
> scalar_store for each element plus one vec_to_scalar extraction
> for each element.  However, extracting element 0 is free on AArch64,
> so it should have zero cost.
>
> I don't have a testcase that requires this for existing -mtune
> options, but it becomes more important with a later patch.
>
> gcc/
>         * config/aarch64/aarch64.c (aarch64_is_store_elt_extraction): New
>         function, split out from...
>         (aarch64_detect_vector_stmt_subtype): ...here.
>         (aarch64_add_stmt_cost): Treat extracting element 0 as free.
> ---
>  gcc/config/aarch64/aarch64.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 36f11808916..084f8caa0da 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -14622,6 +14622,18 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>      }
>  }
>
> +/* Return true if an operaton of kind KIND for STMT_INFO represents
> +   the extraction of an element from a vector in preparation for
> +   storing the element to memory.  */
> +static bool
> +aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
> +                                stmt_vec_info stmt_info)
> +{
> +  return (kind == vec_to_scalar
> +         && STMT_VINFO_DATA_REF (stmt_info)
> +         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
> +}

It would be nice to put functions like this in tree-vectorizer.h in some
section marked with a comment to contain helpers for the target
add_stmt_cost.

>  /* Return true if STMT_INFO represents part of a reduction.  */
>  static bool
>  aarch64_is_reduction (stmt_vec_info stmt_info)
> @@ -14959,9 +14971,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
>    /* Detect cases in which vec_to_scalar is describing the extraction of a
>       vector element in preparation for a scalar store.  The store itself is
>       costed separately.  */
> -  if (kind == vec_to_scalar
> -      && STMT_VINFO_DATA_REF (stmt_info)
> -      && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
> +  if (aarch64_is_store_elt_extraction (kind, stmt_info))
>      return simd_costs->store_elt_extra_cost;
>
>    /* Detect SVE gather loads, which are costed as a single scalar_load
> @@ -15382,6 +15392,12 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>           if (vectype && aarch64_sve_only_stmt_p (stmt_info, vectype))
>             costs->saw_sve_only_op = true;
>
> +         /* If we scalarize a strided store, the vectorizer costs one
> +            vec_to_scalar for each element.  However, we can store the first
> +            element using an FP store without a separate extract step.  */
> +         if (aarch64_is_store_elt_extraction (kind, stmt_info))
> +           count -= 1;
> +
>           stmt_cost = aarch64_detect_scalar_stmt_subtype
>             (vinfo, kind, stmt_info, stmt_cost);
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] aarch64: Tweak MLA vector costs
  2021-08-03 12:06 ` [PATCH 6/8] aarch64: Tweak MLA vector costs Richard Sandiford
@ 2021-08-04 11:45   ` Richard Biener
  2021-08-04 12:14     ` Richard Sandiford
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-04 11:45 UTC (permalink / raw)
  To: Richard Sandiford, GCC Patches

On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The issue-based vector costs currently assume that a multiply-add
> sequence can be implemented using a single instruction.  This is
> generally true for scalars (which have a 4-operand instruction)
> and SVE (which allows the output to be tied to any input).
> However, for Advanced SIMD, multiplying two values and adding
> an invariant will end up being a move and an MLA.
>
> The only target to use the issue-based vector costs is Neoverse V1,
> which would generally prefer SVE in this case anyway.  I therefore
> don't have a self-contained testcase.  However, the distinction
> becomes more important with a later patch.

But we do cost any invariants separately (for the prologue), so they
should be available in a register.  How doesn't that work?

> gcc/
>         * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>         parameter.  Detect cases in which an Advanced SIMD MLA would almost
>         certainly require a MOV.
>         (aarch64_count_ops): Update accordingly.
> ---
>  gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 084f8caa0da..19045ef6944 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>
>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>     or multiply-subtract sequence that might be suitable for fusing into a
> -   single instruction.  */
> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
> +   a scalar one, otherwise analyze it as an operation on vectors with those
> +   VEC_* flags.  */
>  static bool
> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
> +                       unsigned int vec_flags)
>  {
>    gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>    if (!assign)
> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>        if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>         continue;
>
> +      if (vec_flags & VEC_ADVSIMD)
> +       {
> +         /* Scalar and SVE code can tie the result to any FMLA input (or none,
> +            although that requires a MOVPRFX for SVE).  However, Advanced SIMD
> +            only supports MLA forms, so will require a move if the result
> +            cannot be tied to the accumulator.  The most important case in
> +            which this is true is when the accumulator input is invariant.  */
> +         rhs = gimple_op (assign, 3 - i);
> +         if (TREE_CODE (rhs) != SSA_NAME)
> +           return false;
> +         def_stmt_info = vinfo->lookup_def (rhs);
> +         if (!def_stmt_info
> +             || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
> +           return false;
> +       }
> +
>        return true;
>      }
>    return false;
> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>      }
>
>    /* Assume that multiply-adds will become a single operation.  */
> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>      return;
>
>    /* When costing scalar statements in vector code, the count already

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] aarch64: Tweak MLA vector costs
  2021-08-04 11:45   ` Richard Biener
@ 2021-08-04 12:14     ` Richard Sandiford
  2021-08-04 12:19       ` Richard Sandiford
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Sandiford @ 2021-08-04 12:14 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> The issue-based vector costs currently assume that a multiply-add
>> sequence can be implemented using a single instruction.  This is
>> generally true for scalars (which have a 4-operand instruction)
>> and SVE (which allows the output to be tied to any input).
>> However, for Advanced SIMD, multiplying two values and adding
>> an invariant will end up being a move and an MLA.
>>
>> The only target to use the issue-based vector costs is Neoverse V1,
>> which would generally prefer SVE in this case anyway.  I therefore
>> don't have a self-contained testcase.  However, the distinction
>> becomes more important with a later patch.
>
> But we do cost any invariants separately (for the prologue), so they
> should be available in a register.  How doesn't that work?

Yeah, that works, and the prologue part is costed correctly.  But the
result of an Advanced SIMD FMLA is tied to the accumulator input, so if
the accumulator input is an invariant, we need a register move (in the
loop body) before the FMLA.

E.g. for:

void
f (float *restrict x, float *restrict y, float *restrict z, float a)
{
  for (int i = 0; i < 100; ++i)
    x[i] = y[i] * z[i];
}

the scalar code is:

.L2:
        ldr     s1, [x1, x3]
        ldr     s2, [x2, x3]
        fmadd   s1, s1, s2, s0
        str     s1, [x0, x3]
        add     x3, x3, 4
        cmp     x3, 400
        bne     .L2

the SVE code is:

.L2:
        ld1w    z1.s, p0/z, [x1, x3, lsl 2]
        ld1w    z0.s, p0/z, [x2, x3, lsl 2]
        fmad    z0.s, p1/m, z1.s, z2.s
        st1w    z0.s, p0, [x0, x3, lsl 2]
        add     x3, x3, x5
        whilelo p0.s, w3, w4
        b.any   .L2

but the Advanced SIMD code is:

.L2:
        mov     v0.16b, v3.16b   // <-------- boo
        ldr     q2, [x2, x3]
        ldr     q1, [x1, x3]
        fmla    v0.4s, v2.4s, v1.4s
        str     q0, [x0, x3]
        add     x3, x3, 16
        cmp     x3, 400
        bne     .L2

Thanks,
Richard


>
>> gcc/
>>         * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>>         parameter.  Detect cases in which an Advanced SIMD MLA would almost
>>         certainly require a MOV.
>>         (aarch64_count_ops): Update accordingly.
>> ---
>>  gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 084f8caa0da..19045ef6944 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>>
>>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>>     or multiply-subtract sequence that might be suitable for fusing into a
>> -   single instruction.  */
>> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
>> +   a scalar one, otherwise analyze it as an operation on vectors with those
>> +   VEC_* flags.  */
>>  static bool
>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>> +                       unsigned int vec_flags)
>>  {
>>    gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>>    if (!assign)
>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>        if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>>         continue;
>>
>> +      if (vec_flags & VEC_ADVSIMD)
>> +       {
>> +         /* Scalar and SVE code can tie the result to any FMLA input (or none,
>> +            although that requires a MOVPRFX for SVE).  However, Advanced SIMD
>> +            only supports MLA forms, so will require a move if the result
>> +            cannot be tied to the accumulator.  The most important case in
>> +            which this is true is when the accumulator input is invariant.  */
>> +         rhs = gimple_op (assign, 3 - i);
>> +         if (TREE_CODE (rhs) != SSA_NAME)
>> +           return false;
>> +         def_stmt_info = vinfo->lookup_def (rhs);
>> +         if (!def_stmt_info
>> +             || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>> +           return false;
>> +       }
>> +
>>        return true;
>>      }
>>    return false;
>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>      }
>>
>>    /* Assume that multiply-adds will become a single operation.  */
>> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>>      return;
>>
>>    /* When costing scalar statements in vector code, the count already

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/8] aarch64: Tweak MLA vector costs
  2021-08-04 12:14     ` Richard Sandiford
@ 2021-08-04 12:19       ` Richard Sandiford
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-04 12:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Tue, Aug 3, 2021 at 2:10 PM Richard Sandiford via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> The issue-based vector costs currently assume that a multiply-add
>>> sequence can be implemented using a single instruction.  This is
>>> generally true for scalars (which have a 4-operand instruction)
>>> and SVE (which allows the output to be tied to any input).
>>> However, for Advanced SIMD, multiplying two values and adding
>>> an invariant will end up being a move and an MLA.
>>>
>>> The only target to use the issue-based vector costs is Neoverse V1,
>>> which would generally prefer SVE in this case anyway.  I therefore
>>> don't have a self-contained testcase.  However, the distinction
>>> becomes more important with a later patch.
>>
>> But we do cost any invariants separately (for the prologue), so they
>> should be available in a register.  How doesn't that work?
>
> Yeah, that works, and the prologue part is costed correctly.  But the
> result of an Advanced SIMD FMLA is tied to the accumulator input, so if
> the accumulator input is an invariant, we need a register move (in the
> loop body) before the FMLA.
>
> E.g. for:
>
> void
> f (float *restrict x, float *restrict y, float *restrict z, float a)
> {
>   for (int i = 0; i < 100; ++i)
>     x[i] = y[i] * z[i];

+ 1.0, not sure where that went.

> }
>
> the scalar code is:
>
> .L2:
>         ldr     s1, [x1, x3]
>         ldr     s2, [x2, x3]
>         fmadd   s1, s1, s2, s0
>         str     s1, [x0, x3]
>         add     x3, x3, 4
>         cmp     x3, 400
>         bne     .L2
>
> the SVE code is:
>
> .L2:
>         ld1w    z1.s, p0/z, [x1, x3, lsl 2]
>         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>         fmad    z0.s, p1/m, z1.s, z2.s
>         st1w    z0.s, p0, [x0, x3, lsl 2]
>         add     x3, x3, x5
>         whilelo p0.s, w3, w4
>         b.any   .L2
>
> but the Advanced SIMD code is:
>
> .L2:
>         mov     v0.16b, v3.16b   // <-------- boo
>         ldr     q2, [x2, x3]
>         ldr     q1, [x1, x3]
>         fmla    v0.4s, v2.4s, v1.4s
>         str     q0, [x0, x3]
>         add     x3, x3, 16
>         cmp     x3, 400
>         bne     .L2
>
> Thanks,
> Richard
>
>
>>
>>> gcc/
>>>         * config/aarch64/aarch64.c (aarch64_multiply_add_p): Add a vec_flags
>>>         parameter.  Detect cases in which an Advanced SIMD MLA would almost
>>>         certainly require a MOV.
>>>         (aarch64_count_ops): Update accordingly.
>>> ---
>>>  gcc/config/aarch64/aarch64.c | 25 ++++++++++++++++++++++---
>>>  1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 084f8caa0da..19045ef6944 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -14767,9 +14767,12 @@ aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>>>
>>>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>>>     or multiply-subtract sequence that might be suitable for fusing into a
>>> -   single instruction.  */
>>> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
>>> +   a scalar one, otherwise analyze it as an operation on vectors with those
>>> +   VEC_* flags.  */
>>>  static bool
>>> -aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>> +aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>>> +                       unsigned int vec_flags)
>>>  {
>>>    gassign *assign = dyn_cast<gassign *> (stmt_info->stmt);
>>>    if (!assign)
>>> @@ -14797,6 +14800,22 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info)
>>>        if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>>>         continue;
>>>
>>> +      if (vec_flags & VEC_ADVSIMD)
>>> +       {
>>> +         /* Scalar and SVE code can tie the result to any FMLA input (or none,
>>> +            although that requires a MOVPRFX for SVE).  However, Advanced SIMD
>>> +            only supports MLA forms, so will require a move if the result
>>> +            cannot be tied to the accumulator.  The most important case in
>>> +            which this is true is when the accumulator input is invariant.  */
>>> +         rhs = gimple_op (assign, 3 - i);
>>> +         if (TREE_CODE (rhs) != SSA_NAME)
>>> +           return false;
>>> +         def_stmt_info = vinfo->lookup_def (rhs);
>>> +         if (!def_stmt_info
>>> +             || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>>> +           return false;
>>> +       }
>>> +
>>>        return true;
>>>      }
>>>    return false;
>>> @@ -15232,7 +15251,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>>      }
>>>
>>>    /* Assume that multiply-adds will become a single operation.  */
>>> -  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info))
>>> +  if (stmt_info && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
>>>      return;
>>>
>>>    /* When costing scalar statements in vector code, the count already

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] vect: Move costing helpers from aarch64 code
  2021-08-04 11:43   ` Richard Biener
@ 2021-08-05 12:04     ` Richard Sandiford
  2021-08-05 12:11       ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Sandiford @ 2021-08-05 12:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Tue, Aug 3, 2021 at 2:09 PM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> When the vectoriser scalarises a strided store, it counts one
>> scalar_store for each element plus one vec_to_scalar extraction
>> for each element.  However, extracting element 0 is free on AArch64,
>> so it should have zero cost.
>>
>> I don't have a testcase that requires this for existing -mtune
>> options, but it becomes more important with a later patch.
>>
>> gcc/
>>         * config/aarch64/aarch64.c (aarch64_is_store_elt_extraction): New
>>         function, split out from...
>>         (aarch64_detect_vector_stmt_subtype): ...here.
>>         (aarch64_add_stmt_cost): Treat extracting element 0 as free.
>> ---
>>  gcc/config/aarch64/aarch64.c | 22 +++++++++++++++++++---
>>  1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 36f11808916..084f8caa0da 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -14622,6 +14622,18 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>>      }
>>  }
>>
>> +/* Return true if an operaton of kind KIND for STMT_INFO represents
>> +   the extraction of an element from a vector in preparation for
>> +   storing the element to memory.  */
>> +static bool
>> +aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
>> +                                stmt_vec_info stmt_info)
>> +{
>> +  return (kind == vec_to_scalar
>> +         && STMT_VINFO_DATA_REF (stmt_info)
>> +         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
>> +}
>
> It would be nice to put functions like this in tree-vectorizer.h in some
> section marked with a comment to contain helpers for the target
> add_stmt_cost.

Yeah, I guess that would avoid pointless cut-&-paste between targets.
How does this look?  Tested on aarch64-linux-gnu and x86_64-linux-gnu.

Thanks,
Richard


gcc/
	* tree-vectorizer.h (vect_is_store_elt_extraction, vect_is_reduction)
	(vect_reduc_type, vect_embedded_comparison_type, vect_comparison_type)
	(vect_is_extending_load, vect_is_integer_truncation): New functions,
	moved from aarch64.c but given different names.
	* config/aarch64/aarch64.c (aarch64_is_store_elt_extraction)
	(aarch64_is_reduction, aarch64_reduc_type)
	(aarch64_embedded_comparison_type, aarch64_comparison_type)
	(aarch64_extending_load_p, aarch64_integer_truncation_p): Delete
	in favor of the above.  Update callers accordingly.
---
 gcc/config/aarch64/aarch64.c | 125 ++++-------------------------------
 gcc/tree-vectorizer.h        | 104 +++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+), 111 deletions(-)

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index deb22477e28..fd8681747ca 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2192,4 +2192,108 @@ extern vect_pattern_decl_t slp_patterns[];
 /* Number of supported pattern matchers.  */
 extern size_t num__slp_patterns;
 
+/* ----------------------------------------------------------------------
+   Target support routines
+   -----------------------------------------------------------------------
+   The following routines are provided to simplify costing decisions in
+   target code.  Please add more as needed.  */
+
+/* Return true if an operaton of kind KIND for STMT_INFO represents
+   the extraction of an element from a vector in preparation for
+   storing the element to memory.  */
+inline bool
+vect_is_store_elt_extraction (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
+{
+  return (kind == vec_to_scalar
+	  && STMT_VINFO_DATA_REF (stmt_info)
+	  && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
+}
+
+/* Return true if STMT_INFO represents part of a reduction.  */
+inline bool
+vect_is_reduction (stmt_vec_info stmt_info)
+{
+  return (STMT_VINFO_REDUC_DEF (stmt_info)
+	  || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
+}
+
+/* If STMT_INFO describes a reduction, return the type of reduction
+   it describes, otherwise return -1.  */
+inline int
+vect_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
+{
+  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
+    if (STMT_VINFO_REDUC_DEF (stmt_info))
+      {
+	stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
+	return int (STMT_VINFO_REDUC_TYPE (reduc_info));
+      }
+  return -1;
+}
+
+/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
+   scalar type of the values being compared.  Return null otherwise.  */
+inline tree
+vect_embedded_comparison_type (stmt_vec_info stmt_info)
+{
+  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
+    if (gimple_assign_rhs_code (assign) == COND_EXPR)
+      {
+	tree cond = gimple_assign_rhs1 (assign);
+	if (COMPARISON_CLASS_P (cond))
+	  return TREE_TYPE (TREE_OPERAND (cond, 0));
+      }
+  return NULL_TREE;
+}
+
+/* If STMT_INFO is a comparison or contains an embedded comparison, return the
+   scalar type of the values being compared.  Return null otherwise.  */
+inline tree
+vect_comparison_type (stmt_vec_info stmt_info)
+{
+  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
+    if (TREE_CODE_CLASS (gimple_assign_rhs_code (assign)) == tcc_comparison)
+      return TREE_TYPE (gimple_assign_rhs1 (assign));
+  return vect_embedded_comparison_type (stmt_info);
+}
+
+/* Return true if STMT_INFO extends the result of a load.  */
+inline bool
+vect_is_extending_load (class vec_info *vinfo, stmt_vec_info stmt_info)
+{
+  /* Although this is quite large for an inline function, this part
+     at least should be inline.  */
+  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
+  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
+    return false;
+
+  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
+  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
+  tree rhs_type = TREE_TYPE (rhs);
+  if (!INTEGRAL_TYPE_P (lhs_type)
+      || !INTEGRAL_TYPE_P (rhs_type)
+      || TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (rhs_type))
+    return false;
+
+  stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
+  return (def_stmt_info
+	  && STMT_VINFO_DATA_REF (def_stmt_info)
+	  && DR_IS_READ (STMT_VINFO_DATA_REF (def_stmt_info)));
+}
+
+/* Return true if STMT_INFO is an integer truncation.  */
+inline bool
+vect_is_integer_truncation (stmt_vec_info stmt_info)
+{
+  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
+  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
+    return false;
+
+  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
+  tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (assign));
+  return (INTEGRAL_TYPE_P (lhs_type)
+	  && INTEGRAL_TYPE_P (rhs_type)
+	  && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
+}
+
 #endif  /* GCC_TREE_VECTORIZER_H  */
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e02cbcbcb38..a4456a86764 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -14790,40 +14790,6 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
     }
 }
 
-/* Return true if an operaton of kind KIND for STMT_INFO represents
-   the extraction of an element from a vector in preparation for
-   storing the element to memory.  */
-static bool
-aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
-				 stmt_vec_info stmt_info)
-{
-  return (kind == vec_to_scalar
-	  && STMT_VINFO_DATA_REF (stmt_info)
-	  && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
-}
-
-/* Return true if STMT_INFO represents part of a reduction.  */
-static bool
-aarch64_is_reduction (stmt_vec_info stmt_info)
-{
-  return (STMT_VINFO_REDUC_DEF (stmt_info)
-	  || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
-}
-
-/* If STMT_INFO describes a reduction, return the type of reduction
-   it describes, otherwise return -1.  */
-static int
-aarch64_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
-{
-  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
-    if (STMT_VINFO_REDUC_DEF (stmt_info))
-      {
-	stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
-	return int (STMT_VINFO_REDUC_TYPE (reduc_info));
-      }
-  return -1;
-}
-
 /* Return true if an access of kind KIND for STMT_INFO represents one
    vector of an LD[234] or ST[234] operation.  Return the total number of
    vectors (2, 3 or 4) if so, otherwise return a value outside that range.  */
@@ -14844,32 +14810,6 @@ aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
   return 0;
 }
 
-/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
-   scalar type of the values being compared.  Return null otherwise.  */
-static tree
-aarch64_embedded_comparison_type (stmt_vec_info stmt_info)
-{
-  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
-    if (gimple_assign_rhs_code (assign) == COND_EXPR)
-      {
-	tree cond = gimple_assign_rhs1 (assign);
-	if (COMPARISON_CLASS_P (cond))
-	  return TREE_TYPE (TREE_OPERAND (cond, 0));
-      }
-  return NULL_TREE;
-}
-
-/* If STMT_INFO is a comparison or contains an embedded comparison, return the
-   scalar type of the values being compared.  Return null otherwise.  */
-static tree
-aarch64_comparison_type (stmt_vec_info stmt_info)
-{
-  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
-    if (TREE_CODE_CLASS (gimple_assign_rhs_code (assign)) == tcc_comparison)
-      return TREE_TYPE (gimple_assign_rhs1 (assign));
-  return aarch64_embedded_comparison_type (stmt_info);
-}
-
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
    vectors would produce a series of LDP or STP operations.  KIND is the
    kind of statement that STMT_INFO represents.  */
@@ -14896,43 +14836,6 @@ aarch64_advsimd_ldp_stp_p (enum vect_cost_for_stmt kind,
   return is_gimple_assign (stmt_info->stmt);
 }
 
-/* Return true if STMT_INFO extends the result of a load.  */
-static bool
-aarch64_extending_load_p (class vec_info *vinfo, stmt_vec_info stmt_info)
-{
-  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
-  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
-    return false;
-
-  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
-  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
-  tree rhs_type = TREE_TYPE (rhs);
-  if (!INTEGRAL_TYPE_P (lhs_type)
-      || !INTEGRAL_TYPE_P (rhs_type)
-      || TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (rhs_type))
-    return false;
-
-  stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
-  return (def_stmt_info
-	  && STMT_VINFO_DATA_REF (def_stmt_info)
-	  && DR_IS_READ (STMT_VINFO_DATA_REF (def_stmt_info)));
-}
-
-/* Return true if STMT_INFO is an integer truncation.  */
-static bool
-aarch64_integer_truncation_p (stmt_vec_info stmt_info)
-{
-  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
-  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
-    return false;
-
-  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
-  tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (assign));
-  return (INTEGRAL_TYPE_P (lhs_type)
-	  && INTEGRAL_TYPE_P (rhs_type)
-	  && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
-}
-
 /* Return true if STMT_INFO is the second part of a two-statement multiply-add
    or multiply-subtract sequence that might be suitable for fusing into a
    single instruction.  If VEC_FLAGS is zero, analyze the operation as
@@ -15035,7 +14938,7 @@ aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
 				       tree vectype,
 				       const sve_vec_cost *sve_costs)
 {
-  switch (aarch64_reduc_type (vinfo, stmt_info))
+  switch (vect_reduc_type (vinfo, stmt_info))
     {
     case EXTRACT_LAST_REDUCTION:
       return sve_costs->clast_cost;
@@ -15126,7 +15029,7 @@ aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
 {
   /* Detect an extension of a loaded value.  In general, we'll be able to fuse
      the extension with the load.  */
-  if (kind == scalar_stmt && aarch64_extending_load_p (vinfo, stmt_info))
+  if (kind == scalar_stmt && vect_is_extending_load (vinfo, stmt_info))
     return 0;
 
   return stmt_cost;
@@ -15158,7 +15061,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
   /* Detect cases in which vec_to_scalar is describing the extraction of a
      vector element in preparation for a scalar store.  The store itself is
      costed separately.  */
-  if (aarch64_is_store_elt_extraction (kind, stmt_info))
+  if (vect_is_store_elt_extraction (kind, stmt_info))
     return simd_costs->store_elt_extra_cost;
 
   /* Detect SVE gather loads, which are costed as a single scalar_load
@@ -15197,7 +15100,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
      instruction like FADDP or MAXV.  */
   if (kind == vec_to_scalar
       && where == vect_epilogue
-      && aarch64_is_reduction (stmt_info))
+      && vect_is_reduction (stmt_info))
     switch (GET_MODE_INNER (TYPE_MODE (vectype)))
       {
       case E_QImode:
@@ -15247,12 +15150,12 @@ aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, vect_cost_for_stmt kind,
      on the fly.  Optimistically assume that a load followed by an extension
      will fold to this form during combine, and that the extension therefore
      comes for free.  */
-  if (kind == vector_stmt && aarch64_extending_load_p (vinfo, stmt_info))
+  if (kind == vector_stmt && vect_is_extending_load (vinfo, stmt_info))
     stmt_cost = 0;
 
   /* For similar reasons, vector_stmt integer truncations are a no-op,
      because we can just ignore the unused upper bits of the source.  */
-  if (kind == vector_stmt && aarch64_integer_truncation_p (stmt_info))
+  if (kind == vector_stmt && vect_is_integer_truncation (stmt_info))
     stmt_cost = 0;
 
   /* Advanced SIMD can load and store pairs of registers using LDP and STP,
@@ -15327,7 +15230,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
 	}
 
       if (kind == vector_stmt || kind == vec_to_scalar)
-	if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
+	if (tree cmp_type = vect_embedded_comparison_type (stmt_info))
 	  {
 	    if (FLOAT_TYPE_P (cmp_type))
 	      stmt_cost += simd_costs->fp_stmt_cost;
@@ -15337,7 +15240,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
     }
 
   if (kind == scalar_stmt)
-    if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
+    if (tree cmp_type = vect_embedded_comparison_type (stmt_info))
       {
 	if (FLOAT_TYPE_P (cmp_type))
 	  stmt_cost += aarch64_tune_params.vec_costs->scalar_fp_stmt_cost;
@@ -15387,12 +15290,12 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
   /* Calculate the minimum cycles per iteration imposed by a reduction
      operation.  */
   if ((kind == vector_stmt || kind == vec_to_scalar)
-      && aarch64_is_reduction (stmt_info))
+      && vect_is_reduction (stmt_info))
     {
       unsigned int base
 	= aarch64_in_loop_reduction_latency (vinfo, stmt_info, vectype,
 					     vec_flags);
-      if (aarch64_reduc_type (vinfo, stmt_info) == FOLD_LEFT_REDUCTION)
+      if (vect_reduc_type (vinfo, stmt_info) == FOLD_LEFT_REDUCTION)
 	{
 	  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
 	    {
@@ -15491,7 +15394,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
 
   /* Add any embedded comparison operations.  */
   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
-      && aarch64_embedded_comparison_type (stmt_info))
+      && vect_embedded_comparison_type (stmt_info))
     ops->general_ops += num_copies;
 
   /* Detect COND_REDUCTIONs and things that would need to become
@@ -15500,7 +15403,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
      have only accounted for one.  */
   if (vec_flags && (kind == vector_stmt || kind == vec_to_scalar))
     {
-      int reduc_type = aarch64_reduc_type (vinfo, stmt_info);
+      int reduc_type = vect_reduc_type (vinfo, stmt_info);
       if ((reduc_type == EXTRACT_LAST_REDUCTION && (vec_flags & VEC_ADVSIMD))
 	  || reduc_type == COND_REDUCTION)
 	ops->general_ops += num_copies;
@@ -15508,7 +15411,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
 
   /* Count the predicate operations needed by an SVE comparison.  */
   if (sve_issue && (kind == vector_stmt || kind == vec_to_scalar))
-    if (tree type = aarch64_comparison_type (stmt_info))
+    if (tree type = vect_comparison_type (stmt_info))
       {
 	unsigned int base = (FLOAT_TYPE_P (type)
 			     ? sve_issue->fp_cmp_pred_ops
@@ -15586,7 +15489,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	  /* If we scalarize a strided store, the vectorizer costs one
 	     vec_to_scalar for each element.  However, we can store the first
 	     element using an FP store without a separate extract step.  */
-	  if (aarch64_is_store_elt_extraction (kind, stmt_info))
+	  if (vect_is_store_elt_extraction (kind, stmt_info))
 	    count -= 1;
 
 	  stmt_cost = aarch64_detect_scalar_stmt_subtype

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] vect: Move costing helpers from aarch64 code
  2021-08-05 12:04     ` [PATCH] vect: Move costing helpers from aarch64 code Richard Sandiford
@ 2021-08-05 12:11       ` Richard Biener
  2021-08-05 13:06         ` Richard Sandiford
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-05 12:11 UTC (permalink / raw)
  To: Richard Biener, GCC Patches, Richard Sandiford

On Thu, Aug 5, 2021 at 2:04 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Tue, Aug 3, 2021 at 2:09 PM Richard Sandiford via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> When the vectoriser scalarises a strided store, it counts one
> >> scalar_store for each element plus one vec_to_scalar extraction
> >> for each element.  However, extracting element 0 is free on AArch64,
> >> so it should have zero cost.
> >>
> >> I don't have a testcase that requires this for existing -mtune
> >> options, but it becomes more important with a later patch.
> >>
> >> gcc/
> >>         * config/aarch64/aarch64.c (aarch64_is_store_elt_extraction): New
> >>         function, split out from...
> >>         (aarch64_detect_vector_stmt_subtype): ...here.
> >>         (aarch64_add_stmt_cost): Treat extracting element 0 as free.
> >> ---
> >>  gcc/config/aarch64/aarch64.c | 22 +++++++++++++++++++---
> >>  1 file changed, 19 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> index 36f11808916..084f8caa0da 100644
> >> --- a/gcc/config/aarch64/aarch64.c
> >> +++ b/gcc/config/aarch64/aarch64.c
> >> @@ -14622,6 +14622,18 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
> >>      }
> >>  }
> >>
> >> +/* Return true if an operaton of kind KIND for STMT_INFO represents
> >> +   the extraction of an element from a vector in preparation for
> >> +   storing the element to memory.  */
> >> +static bool
> >> +aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
> >> +                                stmt_vec_info stmt_info)
> >> +{
> >> +  return (kind == vec_to_scalar
> >> +         && STMT_VINFO_DATA_REF (stmt_info)
> >> +         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
> >> +}
> >
> > It would be nice to put functions like this in tree-vectorizer.h in some
> > section marked with a comment to contain helpers for the target
> > add_stmt_cost.
>
> Yeah, I guess that would avoid pointless cut-&-paste between targets.
> How does this look?  Tested on aarch64-linux-gnu and x86_64-linux-gnu.

Looks good besides ...

> Thanks,
> Richard
>
>
> gcc/
>         * tree-vectorizer.h (vect_is_store_elt_extraction, vect_is_reduction)
>         (vect_reduc_type, vect_embedded_comparison_type, vect_comparison_type)
>         (vect_is_extending_load, vect_is_integer_truncation): New functions,
>         moved from aarch64.c but given different names.
>         * config/aarch64/aarch64.c (aarch64_is_store_elt_extraction)
>         (aarch64_is_reduction, aarch64_reduc_type)
>         (aarch64_embedded_comparison_type, aarch64_comparison_type)
>         (aarch64_extending_load_p, aarch64_integer_truncation_p): Delete
>         in favor of the above.  Update callers accordingly.
> ---
>  gcc/config/aarch64/aarch64.c | 125 ++++-------------------------------
>  gcc/tree-vectorizer.h        | 104 +++++++++++++++++++++++++++++
>  2 files changed, 118 insertions(+), 111 deletions(-)
>
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index deb22477e28..fd8681747ca 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2192,4 +2192,108 @@ extern vect_pattern_decl_t slp_patterns[];
>  /* Number of supported pattern matchers.  */
>  extern size_t num__slp_patterns;
>
> +/* ----------------------------------------------------------------------
> +   Target support routines
> +   -----------------------------------------------------------------------
> +   The following routines are provided to simplify costing decisions in
> +   target code.  Please add more as needed.  */
> +
> +/* Return true if an operaton of kind KIND for STMT_INFO represents
> +   the extraction of an element from a vector in preparation for
> +   storing the element to memory.  */
> +inline bool
> +vect_is_store_elt_extraction (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
> +{
> +  return (kind == vec_to_scalar
> +         && STMT_VINFO_DATA_REF (stmt_info)
> +         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
> +}
> +
> +/* Return true if STMT_INFO represents part of a reduction.  */
> +inline bool
> +vect_is_reduction (stmt_vec_info stmt_info)
> +{
> +  return (STMT_VINFO_REDUC_DEF (stmt_info)
> +         || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
> +}
> +
> +/* If STMT_INFO describes a reduction, return the type of reduction
> +   it describes, otherwise return -1.  */
> +inline int

it's not clear what 'type of reduction' is - why not return enum
vect_reduction_type?
Because of the -1?  Maybe we can simply add a NOT_REDUCTION member
to the enum?  Or simply adjust the comment as "return the vect_reduction_type
of the reduction it describes, otherwise return -1"?

> +vect_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
> +{
> +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
> +    if (STMT_VINFO_REDUC_DEF (stmt_info))
> +      {
> +       stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
> +       return int (STMT_VINFO_REDUC_TYPE (reduc_info));
> +      }
> +  return -1;
> +}
> +
> +/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
> +   scalar type of the values being compared.  Return null otherwise.  */
> +inline tree
> +vect_embedded_comparison_type (stmt_vec_info stmt_info)
> +{
> +  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
> +    if (gimple_assign_rhs_code (assign) == COND_EXPR)
> +      {
> +       tree cond = gimple_assign_rhs1 (assign);
> +       if (COMPARISON_CLASS_P (cond))
> +         return TREE_TYPE (TREE_OPERAND (cond, 0));
> +      }
> +  return NULL_TREE;
> +}
> +
> +/* If STMT_INFO is a comparison or contains an embedded comparison, return the
> +   scalar type of the values being compared.  Return null otherwise.  */
> +inline tree
> +vect_comparison_type (stmt_vec_info stmt_info)
> +{
> +  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
> +    if (TREE_CODE_CLASS (gimple_assign_rhs_code (assign)) == tcc_comparison)
> +      return TREE_TYPE (gimple_assign_rhs1 (assign));
> +  return vect_embedded_comparison_type (stmt_info);
> +}
> +
> +/* Return true if STMT_INFO extends the result of a load.  */
> +inline bool
> +vect_is_extending_load (class vec_info *vinfo, stmt_vec_info stmt_info)
> +{
> +  /* Although this is quite large for an inline function, this part
> +     at least should be inline.  */
> +  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
> +  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
> +    return false;
> +
> +  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
> +  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
> +  tree rhs_type = TREE_TYPE (rhs);
> +  if (!INTEGRAL_TYPE_P (lhs_type)
> +      || !INTEGRAL_TYPE_P (rhs_type)
> +      || TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (rhs_type))
> +    return false;
> +
> +  stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
> +  return (def_stmt_info
> +         && STMT_VINFO_DATA_REF (def_stmt_info)
> +         && DR_IS_READ (STMT_VINFO_DATA_REF (def_stmt_info)));
> +}
> +
> +/* Return true if STMT_INFO is an integer truncation.  */
> +inline bool
> +vect_is_integer_truncation (stmt_vec_info stmt_info)
> +{
> +  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
> +  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
> +    return false;
> +
> +  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
> +  tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (assign));
> +  return (INTEGRAL_TYPE_P (lhs_type)
> +         && INTEGRAL_TYPE_P (rhs_type)
> +         && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
> +}
> +
>  #endif  /* GCC_TREE_VECTORIZER_H  */
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index e02cbcbcb38..a4456a86764 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -14790,40 +14790,6 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>      }
>  }
>
> -/* Return true if an operaton of kind KIND for STMT_INFO represents
> -   the extraction of an element from a vector in preparation for
> -   storing the element to memory.  */
> -static bool
> -aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
> -                                stmt_vec_info stmt_info)
> -{
> -  return (kind == vec_to_scalar
> -         && STMT_VINFO_DATA_REF (stmt_info)
> -         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
> -}
> -
> -/* Return true if STMT_INFO represents part of a reduction.  */
> -static bool
> -aarch64_is_reduction (stmt_vec_info stmt_info)
> -{
> -  return (STMT_VINFO_REDUC_DEF (stmt_info)
> -         || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
> -}
> -
> -/* If STMT_INFO describes a reduction, return the type of reduction
> -   it describes, otherwise return -1.  */
> -static int
> -aarch64_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
> -{
> -  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
> -    if (STMT_VINFO_REDUC_DEF (stmt_info))
> -      {
> -       stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
> -       return int (STMT_VINFO_REDUC_TYPE (reduc_info));
> -      }
> -  return -1;
> -}
> -
>  /* Return true if an access of kind KIND for STMT_INFO represents one
>     vector of an LD[234] or ST[234] operation.  Return the total number of
>     vectors (2, 3 or 4) if so, otherwise return a value outside that range.  */
> @@ -14844,32 +14810,6 @@ aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
>    return 0;
>  }
>
> -/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
> -   scalar type of the values being compared.  Return null otherwise.  */
> -static tree
> -aarch64_embedded_comparison_type (stmt_vec_info stmt_info)
> -{
> -  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
> -    if (gimple_assign_rhs_code (assign) == COND_EXPR)
> -      {
> -       tree cond = gimple_assign_rhs1 (assign);
> -       if (COMPARISON_CLASS_P (cond))
> -         return TREE_TYPE (TREE_OPERAND (cond, 0));
> -      }
> -  return NULL_TREE;
> -}
> -
> -/* If STMT_INFO is a comparison or contains an embedded comparison, return the
> -   scalar type of the values being compared.  Return null otherwise.  */
> -static tree
> -aarch64_comparison_type (stmt_vec_info stmt_info)
> -{
> -  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
> -    if (TREE_CODE_CLASS (gimple_assign_rhs_code (assign)) == tcc_comparison)
> -      return TREE_TYPE (gimple_assign_rhs1 (assign));
> -  return aarch64_embedded_comparison_type (stmt_info);
> -}
> -
>  /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
>     vectors would produce a series of LDP or STP operations.  KIND is the
>     kind of statement that STMT_INFO represents.  */
> @@ -14896,43 +14836,6 @@ aarch64_advsimd_ldp_stp_p (enum vect_cost_for_stmt kind,
>    return is_gimple_assign (stmt_info->stmt);
>  }
>
> -/* Return true if STMT_INFO extends the result of a load.  */
> -static bool
> -aarch64_extending_load_p (class vec_info *vinfo, stmt_vec_info stmt_info)
> -{
> -  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
> -  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
> -    return false;
> -
> -  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
> -  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
> -  tree rhs_type = TREE_TYPE (rhs);
> -  if (!INTEGRAL_TYPE_P (lhs_type)
> -      || !INTEGRAL_TYPE_P (rhs_type)
> -      || TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (rhs_type))
> -    return false;
> -
> -  stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
> -  return (def_stmt_info
> -         && STMT_VINFO_DATA_REF (def_stmt_info)
> -         && DR_IS_READ (STMT_VINFO_DATA_REF (def_stmt_info)));
> -}
> -
> -/* Return true if STMT_INFO is an integer truncation.  */
> -static bool
> -aarch64_integer_truncation_p (stmt_vec_info stmt_info)
> -{
> -  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
> -  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
> -    return false;
> -
> -  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
> -  tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (assign));
> -  return (INTEGRAL_TYPE_P (lhs_type)
> -         && INTEGRAL_TYPE_P (rhs_type)
> -         && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
> -}
> -
>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>     or multiply-subtract sequence that might be suitable for fusing into a
>     single instruction.  If VEC_FLAGS is zero, analyze the operation as
> @@ -15035,7 +14938,7 @@ aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
>                                        tree vectype,
>                                        const sve_vec_cost *sve_costs)
>  {
> -  switch (aarch64_reduc_type (vinfo, stmt_info))
> +  switch (vect_reduc_type (vinfo, stmt_info))
>      {
>      case EXTRACT_LAST_REDUCTION:
>        return sve_costs->clast_cost;
> @@ -15126,7 +15029,7 @@ aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
>  {
>    /* Detect an extension of a loaded value.  In general, we'll be able to fuse
>       the extension with the load.  */
> -  if (kind == scalar_stmt && aarch64_extending_load_p (vinfo, stmt_info))
> +  if (kind == scalar_stmt && vect_is_extending_load (vinfo, stmt_info))
>      return 0;
>
>    return stmt_cost;
> @@ -15158,7 +15061,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
>    /* Detect cases in which vec_to_scalar is describing the extraction of a
>       vector element in preparation for a scalar store.  The store itself is
>       costed separately.  */
> -  if (aarch64_is_store_elt_extraction (kind, stmt_info))
> +  if (vect_is_store_elt_extraction (kind, stmt_info))
>      return simd_costs->store_elt_extra_cost;
>
>    /* Detect SVE gather loads, which are costed as a single scalar_load
> @@ -15197,7 +15100,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
>       instruction like FADDP or MAXV.  */
>    if (kind == vec_to_scalar
>        && where == vect_epilogue
> -      && aarch64_is_reduction (stmt_info))
> +      && vect_is_reduction (stmt_info))
>      switch (GET_MODE_INNER (TYPE_MODE (vectype)))
>        {
>        case E_QImode:
> @@ -15247,12 +15150,12 @@ aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, vect_cost_for_stmt kind,
>       on the fly.  Optimistically assume that a load followed by an extension
>       will fold to this form during combine, and that the extension therefore
>       comes for free.  */
> -  if (kind == vector_stmt && aarch64_extending_load_p (vinfo, stmt_info))
> +  if (kind == vector_stmt && vect_is_extending_load (vinfo, stmt_info))
>      stmt_cost = 0;
>
>    /* For similar reasons, vector_stmt integer truncations are a no-op,
>       because we can just ignore the unused upper bits of the source.  */
> -  if (kind == vector_stmt && aarch64_integer_truncation_p (stmt_info))
> +  if (kind == vector_stmt && vect_is_integer_truncation (stmt_info))
>      stmt_cost = 0;
>
>    /* Advanced SIMD can load and store pairs of registers using LDP and STP,
> @@ -15327,7 +15230,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>         }
>
>        if (kind == vector_stmt || kind == vec_to_scalar)
> -       if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
> +       if (tree cmp_type = vect_embedded_comparison_type (stmt_info))
>           {
>             if (FLOAT_TYPE_P (cmp_type))
>               stmt_cost += simd_costs->fp_stmt_cost;
> @@ -15337,7 +15240,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>      }
>
>    if (kind == scalar_stmt)
> -    if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
> +    if (tree cmp_type = vect_embedded_comparison_type (stmt_info))
>        {
>         if (FLOAT_TYPE_P (cmp_type))
>           stmt_cost += aarch64_tune_params.vec_costs->scalar_fp_stmt_cost;
> @@ -15387,12 +15290,12 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>    /* Calculate the minimum cycles per iteration imposed by a reduction
>       operation.  */
>    if ((kind == vector_stmt || kind == vec_to_scalar)
> -      && aarch64_is_reduction (stmt_info))
> +      && vect_is_reduction (stmt_info))
>      {
>        unsigned int base
>         = aarch64_in_loop_reduction_latency (vinfo, stmt_info, vectype,
>                                              vec_flags);
> -      if (aarch64_reduc_type (vinfo, stmt_info) == FOLD_LEFT_REDUCTION)
> +      if (vect_reduc_type (vinfo, stmt_info) == FOLD_LEFT_REDUCTION)
>         {
>           if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
>             {
> @@ -15491,7 +15394,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>
>    /* Add any embedded comparison operations.  */
>    if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
> -      && aarch64_embedded_comparison_type (stmt_info))
> +      && vect_embedded_comparison_type (stmt_info))
>      ops->general_ops += num_copies;
>
>    /* Detect COND_REDUCTIONs and things that would need to become
> @@ -15500,7 +15403,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>       have only accounted for one.  */
>    if (vec_flags && (kind == vector_stmt || kind == vec_to_scalar))
>      {
> -      int reduc_type = aarch64_reduc_type (vinfo, stmt_info);
> +      int reduc_type = vect_reduc_type (vinfo, stmt_info);
>        if ((reduc_type == EXTRACT_LAST_REDUCTION && (vec_flags & VEC_ADVSIMD))
>           || reduc_type == COND_REDUCTION)
>         ops->general_ops += num_copies;
> @@ -15508,7 +15411,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>
>    /* Count the predicate operations needed by an SVE comparison.  */
>    if (sve_issue && (kind == vector_stmt || kind == vec_to_scalar))
> -    if (tree type = aarch64_comparison_type (stmt_info))
> +    if (tree type = vect_comparison_type (stmt_info))
>        {
>         unsigned int base = (FLOAT_TYPE_P (type)
>                              ? sve_issue->fp_cmp_pred_ops
> @@ -15586,7 +15489,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>           /* If we scalarize a strided store, the vectorizer costs one
>              vec_to_scalar for each element.  However, we can store the first
>              element using an FP store without a separate extract step.  */
> -         if (aarch64_is_store_elt_extraction (kind, stmt_info))
> +         if (vect_is_store_elt_extraction (kind, stmt_info))
>             count -= 1;
>
>           stmt_cost = aarch64_detect_scalar_stmt_subtype

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] vect: Move costing helpers from aarch64 code
  2021-08-05 12:11       ` Richard Biener
@ 2021-08-05 13:06         ` Richard Sandiford
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-05 13:06 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Thu, Aug 5, 2021 at 2:04 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Tue, Aug 3, 2021 at 2:09 PM Richard Sandiford via Gcc-patches
>> > <gcc-patches@gcc.gnu.org> wrote:
>> >>
>> >> When the vectoriser scalarises a strided store, it counts one
>> >> scalar_store for each element plus one vec_to_scalar extraction
>> >> for each element.  However, extracting element 0 is free on AArch64,
>> >> so it should have zero cost.
>> >>
>> >> I don't have a testcase that requires this for existing -mtune
>> >> options, but it becomes more important with a later patch.
>> >>
>> >> gcc/
>> >>         * config/aarch64/aarch64.c (aarch64_is_store_elt_extraction): New
>> >>         function, split out from...
>> >>         (aarch64_detect_vector_stmt_subtype): ...here.
>> >>         (aarch64_add_stmt_cost): Treat extracting element 0 as free.
>> >> ---
>> >>  gcc/config/aarch64/aarch64.c | 22 +++++++++++++++++++---
>> >>  1 file changed, 19 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> >> index 36f11808916..084f8caa0da 100644
>> >> --- a/gcc/config/aarch64/aarch64.c
>> >> +++ b/gcc/config/aarch64/aarch64.c
>> >> @@ -14622,6 +14622,18 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>> >>      }
>> >>  }
>> >>
>> >> +/* Return true if an operaton of kind KIND for STMT_INFO represents
>> >> +   the extraction of an element from a vector in preparation for
>> >> +   storing the element to memory.  */
>> >> +static bool
>> >> +aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
>> >> +                                stmt_vec_info stmt_info)
>> >> +{
>> >> +  return (kind == vec_to_scalar
>> >> +         && STMT_VINFO_DATA_REF (stmt_info)
>> >> +         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
>> >> +}
>> >
>> > It would be nice to put functions like this in tree-vectorizer.h in some
>> > section marked with a comment to contain helpers for the target
>> > add_stmt_cost.
>>
>> Yeah, I guess that would avoid pointless cut-&-paste between targets.
>> How does this look?  Tested on aarch64-linux-gnu and x86_64-linux-gnu.
>
> Looks good besides ...
>
>> Thanks,
>> Richard
>>
>>
>> gcc/
>>         * tree-vectorizer.h (vect_is_store_elt_extraction, vect_is_reduction)
>>         (vect_reduc_type, vect_embedded_comparison_type, vect_comparison_type)
>>         (vect_is_extending_load, vect_is_integer_truncation): New functions,
>>         moved from aarch64.c but given different names.
>>         * config/aarch64/aarch64.c (aarch64_is_store_elt_extraction)
>>         (aarch64_is_reduction, aarch64_reduc_type)
>>         (aarch64_embedded_comparison_type, aarch64_comparison_type)
>>         (aarch64_extending_load_p, aarch64_integer_truncation_p): Delete
>>         in favor of the above.  Update callers accordingly.
>> ---
>>  gcc/config/aarch64/aarch64.c | 125 ++++-------------------------------
>>  gcc/tree-vectorizer.h        | 104 +++++++++++++++++++++++++++++
>>  2 files changed, 118 insertions(+), 111 deletions(-)
>>
>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>> index deb22477e28..fd8681747ca 100644
>> --- a/gcc/tree-vectorizer.h
>> +++ b/gcc/tree-vectorizer.h
>> @@ -2192,4 +2192,108 @@ extern vect_pattern_decl_t slp_patterns[];
>>  /* Number of supported pattern matchers.  */
>>  extern size_t num__slp_patterns;
>>
>> +/* ----------------------------------------------------------------------
>> +   Target support routines
>> +   -----------------------------------------------------------------------
>> +   The following routines are provided to simplify costing decisions in
>> +   target code.  Please add more as needed.  */
>> +
>> +/* Return true if an operaton of kind KIND for STMT_INFO represents
>> +   the extraction of an element from a vector in preparation for
>> +   storing the element to memory.  */
>> +inline bool
>> +vect_is_store_elt_extraction (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
>> +{
>> +  return (kind == vec_to_scalar
>> +         && STMT_VINFO_DATA_REF (stmt_info)
>> +         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
>> +}
>> +
>> +/* Return true if STMT_INFO represents part of a reduction.  */
>> +inline bool
>> +vect_is_reduction (stmt_vec_info stmt_info)
>> +{
>> +  return (STMT_VINFO_REDUC_DEF (stmt_info)
>> +         || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
>> +}
>> +
>> +/* If STMT_INFO describes a reduction, return the type of reduction
>> +   it describes, otherwise return -1.  */
>> +inline int
>
> it's not clear what 'type of reduction' is - why not return enum
> vect_reduction_type?
> Because of the -1?  Maybe we can simply add a NOT_REDUCTION member
> to the enum?

Yeah, because of the -1.  I don't like the idea of adding a fake value
since it complicates switch statements that handle all non-fake values.

> Or simply adjust the comment as "return the vect_reduction_type
> of the reduction it describes, otherwise return -1"?

Ah, yeah, that sounds better.  I've pushed the patch with that change.

Thanks,
Richard
>
>> +vect_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
>> +{
>> +  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
>> +    if (STMT_VINFO_REDUC_DEF (stmt_info))
>> +      {
>> +       stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
>> +       return int (STMT_VINFO_REDUC_TYPE (reduc_info));
>> +      }
>> +  return -1;
>> +}
>> +
>> +/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
>> +   scalar type of the values being compared.  Return null otherwise.  */
>> +inline tree
>> +vect_embedded_comparison_type (stmt_vec_info stmt_info)
>> +{
>> +  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
>> +    if (gimple_assign_rhs_code (assign) == COND_EXPR)
>> +      {
>> +       tree cond = gimple_assign_rhs1 (assign);
>> +       if (COMPARISON_CLASS_P (cond))
>> +         return TREE_TYPE (TREE_OPERAND (cond, 0));
>> +      }
>> +  return NULL_TREE;
>> +}
>> +
>> +/* If STMT_INFO is a comparison or contains an embedded comparison, return the
>> +   scalar type of the values being compared.  Return null otherwise.  */
>> +inline tree
>> +vect_comparison_type (stmt_vec_info stmt_info)
>> +{
>> +  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
>> +    if (TREE_CODE_CLASS (gimple_assign_rhs_code (assign)) == tcc_comparison)
>> +      return TREE_TYPE (gimple_assign_rhs1 (assign));
>> +  return vect_embedded_comparison_type (stmt_info);
>> +}
>> +
>> +/* Return true if STMT_INFO extends the result of a load.  */
>> +inline bool
>> +vect_is_extending_load (class vec_info *vinfo, stmt_vec_info stmt_info)
>> +{
>> +  /* Although this is quite large for an inline function, this part
>> +     at least should be inline.  */
>> +  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
>> +  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
>> +    return false;
>> +
>> +  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
>> +  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
>> +  tree rhs_type = TREE_TYPE (rhs);
>> +  if (!INTEGRAL_TYPE_P (lhs_type)
>> +      || !INTEGRAL_TYPE_P (rhs_type)
>> +      || TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (rhs_type))
>> +    return false;
>> +
>> +  stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
>> +  return (def_stmt_info
>> +         && STMT_VINFO_DATA_REF (def_stmt_info)
>> +         && DR_IS_READ (STMT_VINFO_DATA_REF (def_stmt_info)));
>> +}
>> +
>> +/* Return true if STMT_INFO is an integer truncation.  */
>> +inline bool
>> +vect_is_integer_truncation (stmt_vec_info stmt_info)
>> +{
>> +  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
>> +  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
>> +    return false;
>> +
>> +  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
>> +  tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (assign));
>> +  return (INTEGRAL_TYPE_P (lhs_type)
>> +         && INTEGRAL_TYPE_P (rhs_type)
>> +         && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
>> +}
>> +
>>  #endif  /* GCC_TREE_VECTORIZER_H  */
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index e02cbcbcb38..a4456a86764 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -14790,40 +14790,6 @@ aarch64_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>>      }
>>  }
>>
>> -/* Return true if an operaton of kind KIND for STMT_INFO represents
>> -   the extraction of an element from a vector in preparation for
>> -   storing the element to memory.  */
>> -static bool
>> -aarch64_is_store_elt_extraction (vect_cost_for_stmt kind,
>> -                                stmt_vec_info stmt_info)
>> -{
>> -  return (kind == vec_to_scalar
>> -         && STMT_VINFO_DATA_REF (stmt_info)
>> -         && DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)));
>> -}
>> -
>> -/* Return true if STMT_INFO represents part of a reduction.  */
>> -static bool
>> -aarch64_is_reduction (stmt_vec_info stmt_info)
>> -{
>> -  return (STMT_VINFO_REDUC_DEF (stmt_info)
>> -         || VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info)));
>> -}
>> -
>> -/* If STMT_INFO describes a reduction, return the type of reduction
>> -   it describes, otherwise return -1.  */
>> -static int
>> -aarch64_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
>> -{
>> -  if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
>> -    if (STMT_VINFO_REDUC_DEF (stmt_info))
>> -      {
>> -       stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
>> -       return int (STMT_VINFO_REDUC_TYPE (reduc_info));
>> -      }
>> -  return -1;
>> -}
>> -
>>  /* Return true if an access of kind KIND for STMT_INFO represents one
>>     vector of an LD[234] or ST[234] operation.  Return the total number of
>>     vectors (2, 3 or 4) if so, otherwise return a value outside that range.  */
>> @@ -14844,32 +14810,6 @@ aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
>>    return 0;
>>  }
>>
>> -/* If STMT_INFO is a COND_EXPR that includes an embedded comparison, return the
>> -   scalar type of the values being compared.  Return null otherwise.  */
>> -static tree
>> -aarch64_embedded_comparison_type (stmt_vec_info stmt_info)
>> -{
>> -  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
>> -    if (gimple_assign_rhs_code (assign) == COND_EXPR)
>> -      {
>> -       tree cond = gimple_assign_rhs1 (assign);
>> -       if (COMPARISON_CLASS_P (cond))
>> -         return TREE_TYPE (TREE_OPERAND (cond, 0));
>> -      }
>> -  return NULL_TREE;
>> -}
>> -
>> -/* If STMT_INFO is a comparison or contains an embedded comparison, return the
>> -   scalar type of the values being compared.  Return null otherwise.  */
>> -static tree
>> -aarch64_comparison_type (stmt_vec_info stmt_info)
>> -{
>> -  if (auto *assign = dyn_cast<gassign *> (stmt_info->stmt))
>> -    if (TREE_CODE_CLASS (gimple_assign_rhs_code (assign)) == tcc_comparison)
>> -      return TREE_TYPE (gimple_assign_rhs1 (assign));
>> -  return aarch64_embedded_comparison_type (stmt_info);
>> -}
>> -
>>  /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
>>     vectors would produce a series of LDP or STP operations.  KIND is the
>>     kind of statement that STMT_INFO represents.  */
>> @@ -14896,43 +14836,6 @@ aarch64_advsimd_ldp_stp_p (enum vect_cost_for_stmt kind,
>>    return is_gimple_assign (stmt_info->stmt);
>>  }
>>
>> -/* Return true if STMT_INFO extends the result of a load.  */
>> -static bool
>> -aarch64_extending_load_p (class vec_info *vinfo, stmt_vec_info stmt_info)
>> -{
>> -  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
>> -  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
>> -    return false;
>> -
>> -  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
>> -  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
>> -  tree rhs_type = TREE_TYPE (rhs);
>> -  if (!INTEGRAL_TYPE_P (lhs_type)
>> -      || !INTEGRAL_TYPE_P (rhs_type)
>> -      || TYPE_PRECISION (lhs_type) <= TYPE_PRECISION (rhs_type))
>> -    return false;
>> -
>> -  stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
>> -  return (def_stmt_info
>> -         && STMT_VINFO_DATA_REF (def_stmt_info)
>> -         && DR_IS_READ (STMT_VINFO_DATA_REF (def_stmt_info)));
>> -}
>> -
>> -/* Return true if STMT_INFO is an integer truncation.  */
>> -static bool
>> -aarch64_integer_truncation_p (stmt_vec_info stmt_info)
>> -{
>> -  gassign *assign = dyn_cast <gassign *> (stmt_info->stmt);
>> -  if (!assign || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (assign)))
>> -    return false;
>> -
>> -  tree lhs_type = TREE_TYPE (gimple_assign_lhs (assign));
>> -  tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (assign));
>> -  return (INTEGRAL_TYPE_P (lhs_type)
>> -         && INTEGRAL_TYPE_P (rhs_type)
>> -         && TYPE_PRECISION (lhs_type) < TYPE_PRECISION (rhs_type));
>> -}
>> -
>>  /* Return true if STMT_INFO is the second part of a two-statement multiply-add
>>     or multiply-subtract sequence that might be suitable for fusing into a
>>     single instruction.  If VEC_FLAGS is zero, analyze the operation as
>> @@ -15035,7 +14938,7 @@ aarch64_sve_in_loop_reduction_latency (vec_info *vinfo,
>>                                        tree vectype,
>>                                        const sve_vec_cost *sve_costs)
>>  {
>> -  switch (aarch64_reduc_type (vinfo, stmt_info))
>> +  switch (vect_reduc_type (vinfo, stmt_info))
>>      {
>>      case EXTRACT_LAST_REDUCTION:
>>        return sve_costs->clast_cost;
>> @@ -15126,7 +15029,7 @@ aarch64_detect_scalar_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
>>  {
>>    /* Detect an extension of a loaded value.  In general, we'll be able to fuse
>>       the extension with the load.  */
>> -  if (kind == scalar_stmt && aarch64_extending_load_p (vinfo, stmt_info))
>> +  if (kind == scalar_stmt && vect_is_extending_load (vinfo, stmt_info))
>>      return 0;
>>
>>    return stmt_cost;
>> @@ -15158,7 +15061,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
>>    /* Detect cases in which vec_to_scalar is describing the extraction of a
>>       vector element in preparation for a scalar store.  The store itself is
>>       costed separately.  */
>> -  if (aarch64_is_store_elt_extraction (kind, stmt_info))
>> +  if (vect_is_store_elt_extraction (kind, stmt_info))
>>      return simd_costs->store_elt_extra_cost;
>>
>>    /* Detect SVE gather loads, which are costed as a single scalar_load
>> @@ -15197,7 +15100,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, vect_cost_for_stmt kind,
>>       instruction like FADDP or MAXV.  */
>>    if (kind == vec_to_scalar
>>        && where == vect_epilogue
>> -      && aarch64_is_reduction (stmt_info))
>> +      && vect_is_reduction (stmt_info))
>>      switch (GET_MODE_INNER (TYPE_MODE (vectype)))
>>        {
>>        case E_QImode:
>> @@ -15247,12 +15150,12 @@ aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, vect_cost_for_stmt kind,
>>       on the fly.  Optimistically assume that a load followed by an extension
>>       will fold to this form during combine, and that the extension therefore
>>       comes for free.  */
>> -  if (kind == vector_stmt && aarch64_extending_load_p (vinfo, stmt_info))
>> +  if (kind == vector_stmt && vect_is_extending_load (vinfo, stmt_info))
>>      stmt_cost = 0;
>>
>>    /* For similar reasons, vector_stmt integer truncations are a no-op,
>>       because we can just ignore the unused upper bits of the source.  */
>> -  if (kind == vector_stmt && aarch64_integer_truncation_p (stmt_info))
>> +  if (kind == vector_stmt && vect_is_integer_truncation (stmt_info))
>>      stmt_cost = 0;
>>
>>    /* Advanced SIMD can load and store pairs of registers using LDP and STP,
>> @@ -15327,7 +15230,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>>         }
>>
>>        if (kind == vector_stmt || kind == vec_to_scalar)
>> -       if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
>> +       if (tree cmp_type = vect_embedded_comparison_type (stmt_info))
>>           {
>>             if (FLOAT_TYPE_P (cmp_type))
>>               stmt_cost += simd_costs->fp_stmt_cost;
>> @@ -15337,7 +15240,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>>      }
>>
>>    if (kind == scalar_stmt)
>> -    if (tree cmp_type = aarch64_embedded_comparison_type (stmt_info))
>> +    if (tree cmp_type = vect_embedded_comparison_type (stmt_info))
>>        {
>>         if (FLOAT_TYPE_P (cmp_type))
>>           stmt_cost += aarch64_tune_params.vec_costs->scalar_fp_stmt_cost;
>> @@ -15387,12 +15290,12 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>    /* Calculate the minimum cycles per iteration imposed by a reduction
>>       operation.  */
>>    if ((kind == vector_stmt || kind == vec_to_scalar)
>> -      && aarch64_is_reduction (stmt_info))
>> +      && vect_is_reduction (stmt_info))
>>      {
>>        unsigned int base
>>         = aarch64_in_loop_reduction_latency (vinfo, stmt_info, vectype,
>>                                              vec_flags);
>> -      if (aarch64_reduc_type (vinfo, stmt_info) == FOLD_LEFT_REDUCTION)
>> +      if (vect_reduc_type (vinfo, stmt_info) == FOLD_LEFT_REDUCTION)
>>         {
>>           if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
>>             {
>> @@ -15491,7 +15394,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>
>>    /* Add any embedded comparison operations.  */
>>    if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>> -      && aarch64_embedded_comparison_type (stmt_info))
>> +      && vect_embedded_comparison_type (stmt_info))
>>      ops->general_ops += num_copies;
>>
>>    /* Detect COND_REDUCTIONs and things that would need to become
>> @@ -15500,7 +15403,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>       have only accounted for one.  */
>>    if (vec_flags && (kind == vector_stmt || kind == vec_to_scalar))
>>      {
>> -      int reduc_type = aarch64_reduc_type (vinfo, stmt_info);
>> +      int reduc_type = vect_reduc_type (vinfo, stmt_info);
>>        if ((reduc_type == EXTRACT_LAST_REDUCTION && (vec_flags & VEC_ADVSIMD))
>>           || reduc_type == COND_REDUCTION)
>>         ops->general_ops += num_copies;
>> @@ -15508,7 +15411,7 @@ aarch64_count_ops (class vec_info *vinfo, aarch64_vector_costs *costs,
>>
>>    /* Count the predicate operations needed by an SVE comparison.  */
>>    if (sve_issue && (kind == vector_stmt || kind == vec_to_scalar))
>> -    if (tree type = aarch64_comparison_type (stmt_info))
>> +    if (tree type = vect_comparison_type (stmt_info))
>>        {
>>         unsigned int base = (FLOAT_TYPE_P (type)
>>                              ? sve_issue->fp_cmp_pred_ops
>> @@ -15586,7 +15489,7 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>           /* If we scalarize a strided store, the vectorizer costs one
>>              vec_to_scalar for each element.  However, we can store the first
>>              element using an FP store without a separate extract step.  */
>> -         if (aarch64_is_store_elt_extraction (kind, stmt_info))
>> +         if (vect_is_store_elt_extraction (kind, stmt_info))
>>             count -= 1;
>>
>>           stmt_cost = aarch64_detect_scalar_stmt_subtype

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 8/8] aarch64: Add -mtune=neoverse-512tvb
  2021-08-03 12:06 ` [PATCH 8/8] aarch64: Add -mtune=neoverse-512tvb Richard Sandiford
@ 2021-08-17 14:37   ` Richard Sandiford
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Sandiford @ 2021-08-17 14:37 UTC (permalink / raw)
  To: gcc-patches

Richard Sandiford <richard.sandiford@arm.com> writes:
> This patch adds an option to tune for Neoverse cores that have
> a total vector bandwidth of 512 bits (4x128 for Advanced SIMD
> and a vector-length-dependent equivalent for SVE).  This is intended
> to be a compromise between tuning aggressively for a single core like
> Neoverse V1 (which can be too narrow) and tuning for AArch64 cores
> in general (which can be too wide).
>
> -mcpu=neoverse-512tvb is equivalent to -mcpu=neoverse-v1
> -mtune=neoverse-512tvb.
>
> gcc/
> 	* doc/invoke.texi: Document -mtune=neoverse-512tvb and
> 	-mcpu=neoverse-512tvb.
> 	* config/aarch64/aarch64-cores.def (neoverse-512tvb): New entry.
> 	* config/aarch64/aarch64-tune.md: Regenerate.
> 	* config/aarch64/aarch64.c (neoverse512tvb_sve_vector_cost)
> 	(neoverse512tvb_sve_issue_info, neoverse512tvb_vec_issue_info)
> 	(neoverse512tvb_vector_cost, neoverse512tvb_tunings): New structures.
> 	(aarch64_adjust_body_cost_sve): Handle -mtune=neoverse-512tvb.
> 	(aarch64_adjust_body_cost): Likewise.

I've backported this cut-down version to GCC 10 and 9, so that the
option is at least recognised there too.

gcc/
	* doc/invoke.texi: Document -mtune=neoverse-512tvb and
	-mcpu=neoverse-512tvb.
	* config/aarch64/aarch64-cores.def (neoverse-512tvb): New entry.
	* config/aarch64/aarch64-tune.md: Regenerate.

---
 gcc/config/aarch64/aarch64-cores.def |  1 +
 gcc/config/aarch64/aarch64-tune.md   |  2 +-
 gcc/doc/invoke.texi                  | 24 ++++++++++++++++++++++--
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-cores.def b/gcc/config/aarch64/aarch64-cores.def
index 9c290292479..fc60e2ae1ac 100644
--- a/gcc/config/aarch64/aarch64-cores.def
+++ b/gcc/config/aarch64/aarch64-cores.def
@@ -135,6 +135,7 @@ AARCH64_CORE("thunderx3t110",  thunderx3t110,  thunderx3t110, 8_3A,  AARCH64_FL_
 /* Arm ('A') cores.  */
 AARCH64_CORE("zeus", zeus, cortexa57, 8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_SVE | AARCH64_FL_RCPC | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | AARCH64_FL_PROFILE | AARCH64_FL_SSBS | AARCH64_FL_RNG, neoversev1, 0x41, 0xd40, -1)
 AARCH64_CORE("neoverse-v1", neoversev1, cortexa57, 8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_SVE | AARCH64_FL_RCPC | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | AARCH64_FL_PROFILE | AARCH64_FL_SSBS | AARCH64_FL_RNG, neoversev1, 0x41, 0xd40, -1)
+AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, 8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_SVE | AARCH64_FL_RCPC | AARCH64_FL_I8MM | AARCH64_FL_BF16 | AARCH64_FL_F16 | AARCH64_FL_PROFILE | AARCH64_FL_SSBS | AARCH64_FL_RNG, neoversev1, INVALID_IMP, INVALID_CORE, -1)
 
 /* Qualcomm ('Q') cores. */
 AARCH64_CORE("saphira",     saphira,    saphira,    8_4A,  AARCH64_FL_FOR_ARCH8_4 | AARCH64_FL_CRYPTO | AARCH64_FL_RCPC, saphira,   0x51, 0xC01, -1)
diff --git a/gcc/config/aarch64/aarch64-tune.md b/gcc/config/aarch64/aarch64-tune.md
index 7fda2294b8a..aa68d67bdf4 100644
--- a/gcc/config/aarch64/aarch64-tune.md
+++ b/gcc/config/aarch64/aarch64-tune.md
@@ -1,5 +1,5 @@
 ;; -*- buffer-read-only: t -*-
 ;; Generated automatically by gentune.sh from aarch64-cores.def
 (define_attr "tune"
-	"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa65,cortexa65ae,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,saphira,neoversen2,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55"
+	"cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa65,cortexa65ae,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,neoversen2,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55"
 	(const (symbol_ref "((enum attr_tune) aarch64_tune)")))
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index eabeec944e7..72d995cd0cc 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16994,8 +16994,9 @@ performance of the code.  Permissible values for this option are:
 @samp{cortex-a57}, @samp{cortex-a72}, @samp{cortex-a73}, @samp{cortex-a75},
 @samp{cortex-a76}, @samp{cortex-a76ae}, @samp{cortex-a77},
 @samp{cortex-a65}, @samp{cortex-a65ae}, @samp{cortex-a34},
-@samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor}, @samp{neoverse-e1},
-@samp{neoverse-n1}, @samp{neoverse-n2}, @samp{neoverse-v1}, @samp{qdf24xx},
+@samp{ares}, @samp{exynos-m1}, @samp{emag}, @samp{falkor},
+@samp{neoverse-e1}, @samp{neoverse-n1}, @samp{neoverse-n2},
+@samp{neoverse-v1}, @samp{neoverse-512tvb}, @samp{qdf24xx},
 @samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan}, @samp{octeontx},
 @samp{octeontx81},  @samp{octeontx83},
 @samp{octeontx2}, @samp{octeontx2t98}, @samp{octeontx2t96}
@@ -17015,6 +17016,15 @@ The values @samp{cortex-a57.cortex-a53}, @samp{cortex-a72.cortex-a53},
 @samp{cortex-a75.cortex-a55}, @samp{cortex-a76.cortex-a55} specify that GCC
 should tune for a big.LITTLE system.
 
+The value @samp{neoverse-512tvb} specifies that GCC should tune
+for Neoverse cores that (a) implement SVE and (b) have a total vector
+bandwidth of 512 bits per cycle.  In other words, the option tells GCC to
+tune for Neoverse cores that can execute 4 128-bit Advanced SIMD arithmetic
+instructions a cycle and that can execute an equivalent number of SVE
+arithmetic instructions per cycle (2 for 256-bit SVE, 4 for 128-bit SVE).
+This is more general than tuning for a specific core like Neoverse V1
+but is more specific than the default tuning described below.
+
 Additionally on native AArch64 GNU/Linux systems the value
 @samp{native} tunes performance to the host system.  This option has no effect
 if the compiler is unable to recognize the processor of the host system.
@@ -17044,6 +17054,16 @@ by @option{-mtune}).  Where this option is used in conjunction
 with @option{-march} or @option{-mtune}, those options take precedence
 over the appropriate part of this option.
 
+@option{-mcpu=neoverse-512tvb} is special in that it does not refer
+to a specific core, but instead refers to all Neoverse cores that
+(a) implement SVE and (b) have a total vector bandwidth of 512 bits
+a cycle.  Unless overridden by @option{-march},
+@option{-mcpu=neoverse-512tvb} generates code that can run on a
+Neoverse V1 core, since Neoverse V1 is the first Neoverse core with
+these properties.  Unless overridden by @option{-mtune},
+@option{-mcpu=neoverse-512tvb} tunes code in the same way as for
+@option{-mtune=neoverse-512tvb}.
+
 @item -moverride=@var{string}
 @opindex moverride
 Override tuning decisions made by the back-end in response to a

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-08-17 14:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03 12:03 [PATCH 0/8] aarch64 vector cost tweaks Richard Sandiford
2021-08-03 12:03 ` [PATCH 1/8] aarch64: Turn sve_width tuning field into a bitmask Richard Sandiford
2021-08-03 12:04 ` [PATCH 2/8] aarch64: Add a simple fixed-point class for costing Richard Sandiford
2021-08-03 12:04 ` [PATCH 3/8] aarch64: Split out aarch64_adjust_body_cost_sve Richard Sandiford
2021-08-03 12:05 ` [PATCH 4/8] aarch64: Add gather_load_xNN_cost tuning fields Richard Sandiford
2021-08-03 12:05 ` [PATCH 5/8] aarch64: Tweak the cost of elementwise stores Richard Sandiford
2021-08-04 11:43   ` Richard Biener
2021-08-05 12:04     ` [PATCH] vect: Move costing helpers from aarch64 code Richard Sandiford
2021-08-05 12:11       ` Richard Biener
2021-08-05 13:06         ` Richard Sandiford
2021-08-03 12:06 ` [PATCH 6/8] aarch64: Tweak MLA vector costs Richard Sandiford
2021-08-04 11:45   ` Richard Biener
2021-08-04 12:14     ` Richard Sandiford
2021-08-04 12:19       ` Richard Sandiford
2021-08-03 12:06 ` [PATCH 7/8] aarch64: Restrict issue heuristics to inner vector loop Richard Sandiford
2021-08-03 12:06 ` [PATCH 8/8] aarch64: Add -mtune=neoverse-512tvb Richard Sandiford
2021-08-17 14:37   ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).