[PATCH] vect: Add a “very cheap” cost model

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] vect: Add a “very cheap” cost model
@ 2020-11-13 18:34 Richard Sandiford
  2020-11-16  8:47 ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Sandiford @ 2020-11-13 18:34 UTC (permalink / raw)
  To: gcc-patches

Currently we have three vector cost models: cheap, dynamic and
unlimited.  -O2 -ftree-vectorize uses “cheap” by default, but that's
still relatively aggressive about peeling and aliasing checks,
and can lead to significant code size growth.

This patch adds an even more conservative choice, which for lack of
imagination I've called “very cheap”.  It only allows vectorisation
if the vector code entirely replaces the scalar code.  It also
requires one iteration of the vector loop to pay for itself,
regardless of how often the loop iterates.  (If the vector loop
needs multiple iterations to be beneficial then things are
probably too close to call, and the conservative thing would
be to stick with the scalar code.)

The idea is that this should be suitable for -O2, although the patch
doesn't change any defaults itself.

I tested this by building and running a bunch of workloads for SVE,
with three options:

  (1) -O2
  (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
  (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]

All three builds used the default -msve-vector-bits=scalable and
ran with the minimum vector length of 128 bits, which should give
a worst-case bound for the performance impact.

The workloads included a mixture of microbenchmarks and full
applications.  Because it's quite an eclectic mix, there's not
much point giving exact figures.  The aim was more to get a general
impression.

Code size growth with (2) was much lower than with (3).  Only a
handful of tests increased by more than 5%, and all of them were
microbenchmarks.

In terms of performance, (2) was significantly faster than (1)
on microbenchmarks (as expected) but also on some full apps.
Again, performance only regressed on a handful of tests.

As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
of a mixed bag.  There are several significant improvements with (3)
over (2), but also some (smaller) regressions.  That seems to be in
line with -O2 -ftree-vectorize being a kind of -O2.5.

The patch reorders vect_cost_model so that values are in order
of increasing aggressiveness, which makes it possible to use
range checks.  The value 0 still represents “unlimited”,
so “if (flag_vect_cost_model)” is still a meaningful check.

Tested on aarch64-linux-gnu, arm-linux-gnueabihf and
x86_64-linux-gnu.  OK to install?

Richard


gcc/
	* doc/invoke.texi (-fvect-cost-model): Add a very-cheap model.
	* common.opt (fvect-cost-model=): Add very-cheap as a possible option.
	(fsimd-cost-model=): Likewise.
	(vect_cost_model): Add very-cheap.
	* flag-types.h (vect_cost_model): Add VECT_COST_MODEL_VERY_CHEAP.
	Put the values in order of increasing aggressiveness.
	* tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Use
	range checks when comparing against VECT_COST_MODEL_CHEAP.
	(vect_prune_runtime_alias_test_list): Do not allow any alias
	checks for the very-cheap cost model.
	* tree-vect-loop.c (vect_analyze_loop_costing): Do not allow
	any peeling for the very-cheap cost model.  Also require one
	iteration of the vector loop to pay for itself.

gcc/testsuite/
	* gcc.dg/vect/vect-cost-model-1.c: New test.
	* gcc.dg/vect/vect-cost-model-2.c: Likewise.
	* gcc.dg/vect/vect-cost-model-3.c: Likewise.
	* gcc.dg/vect/vect-cost-model-4.c: Likewise.
	* gcc.dg/vect/vect-cost-model-5.c: Likewise.
	* gcc.dg/vect/vect-cost-model-6.c: Likewise.
---
 gcc/common.opt                                |  7 +++--
 gcc/doc/invoke.texi                           | 11 ++++++--
 gcc/flag-types.h                              | 10 ++++---
 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c | 12 +++++++++
 gcc/tree-vect-data-refs.c                     |  8 ++++--
 gcc/tree-vect-loop.c                          | 27 +++++++++++++++++++
 11 files changed, 120 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c

diff --git a/gcc/common.opt b/gcc/common.opt
index 7d0e0d9c88a..6ae613e3743 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3008,11 +3008,11 @@ Enable basic block vectorization (SLP) on trees.
 
 fvect-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
--fvect-cost-model=[unlimited|dynamic|cheap]	Specifies the cost model for vectorization.
+-fvect-cost-model=[unlimited|dynamic|cheap|very-cheap]	Specifies the cost model for vectorization.
 
 fsimd-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED) Optimization
--fsimd-cost-model=[unlimited|dynamic|cheap]	Specifies the vectorization cost model for code marked with a simd directive.
+-fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap]	Specifies the vectorization cost model for code marked with a simd directive.
 
 Enum
 Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown vectorizer cost model %qs)
@@ -3026,6 +3026,9 @@ Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
 EnumValue
 Enum(vect_cost_model) String(cheap) Value(VECT_COST_MODEL_CHEAP)
 
+EnumValue
+Enum(vect_cost_model) String(very-cheap) Value(VECT_COST_MODEL_VERY_CHEAP)
+
 fvect-cost-model
 Common Alias(fvect-cost-model=,dynamic,unlimited)
 Enables the dynamic vectorizer cost model.  Preserved for backward compatibility.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8d0d2136831..2066705ff58 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -11384,7 +11384,8 @@ and @option{-fauto-profile}.
 @item -fvect-cost-model=@var{model}
 @opindex fvect-cost-model
 Alter the cost model used for vectorization.  The @var{model} argument
-should be one of @samp{unlimited}, @samp{dynamic} or @samp{cheap}.
+should be one of @samp{unlimited}, @samp{dynamic}, @samp{cheap} or
+@samp{very-cheap}.
 With the @samp{unlimited} model the vectorized code-path is assumed
 to be profitable while with the @samp{dynamic} model a runtime check
 guards the vectorized code-path to enable it only for iteration
@@ -11392,7 +11393,13 @@ counts that will likely execute faster than when executing the original
 scalar loop.  The @samp{cheap} model disables vectorization of
 loops where doing so would be cost prohibitive for example due to
 required runtime checks for data dependence or alignment but otherwise
-is equal to the @samp{dynamic} model.
+is equal to the @samp{dynamic} model.  The @samp{very-cheap} model only
+allows vectorization if the vector code would entirely replace the
+scalar code that is being vectorized.  For example, if each iteration
+of a vectorized loop would handle exactly four iterations, the
+@samp{very-cheap} model would only allow vectorization if the scalar
+iteration count is known to be a multiple of four.
+
 The default cost model depends on other optimization flags and is
 either @samp{dynamic} or @samp{cheap}.
 
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index a887c75cfc7..866c7a3c788 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -232,12 +232,14 @@ enum scalar_storage_order_kind {
   SSO_LITTLE_ENDIAN
 };
 
-/* Vectorizer cost-model.  */
+/* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
+   the most conservative to the least conservative.  */
 enum vect_cost_model {
+  VECT_COST_MODEL_VERY_CHEAP = -3,
+  VECT_COST_MODEL_CHEAP = -2,
+  VECT_COST_MODEL_DYNAMIC = -1,
   VECT_COST_MODEL_UNLIMITED = 0,
-  VECT_COST_MODEL_CHEAP = 1,
-  VECT_COST_MODEL_DYNAMIC = 2,
-  VECT_COST_MODEL_DEFAULT = 3
+  VECT_COST_MODEL_DEFAULT = 1
 };
 
 /* Different instrumentation modes.  */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 0efab495407..18e36c89d14 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -2161,7 +2161,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
         {
           unsigned max_allowed_peel
 	    = param_vect_max_peeling_for_alignment;
-	  if (flag_vect_cost_model == VECT_COST_MODEL_CHEAP)
+	  if (flag_vect_cost_model <= VECT_COST_MODEL_CHEAP)
 	    max_allowed_peel = 0;
           if (max_allowed_peel != (unsigned)-1)
             {
@@ -2259,7 +2259,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
   do_versioning
     = (optimize_loop_nest_for_speed_p (loop)
        && !loop->inner /* FORNOW */
-       && flag_vect_cost_model != VECT_COST_MODEL_CHEAP);
+       && flag_vect_cost_model > VECT_COST_MODEL_CHEAP);
 
   if (do_versioning)
     {
@@ -3682,6 +3682,10 @@ vect_prune_runtime_alias_test_list (loop_vec_info loop_vinfo)
   unsigned int count = (comp_alias_ddrs.length ()
 			+ check_unequal_addrs.length ());
 
+  if (count && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP)
+    return opt_result::failure_at
+      (vect_location, "would need a runtime alias check\n");
+
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "improved number of alias checks from %d to %d\n",
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 39b7319e825..3b020bd6f0a 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1827,6 +1827,19 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
 	}
     }
 
+  /* If using the "very cheap" model. reject cases in which we'd keep
+     a copy of the scalar code (even if we might be able to vectorize it).  */
+  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
+      && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+	  || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "some scalar iterations would need to be peeled\n");
+      return 0;
+    }
+
   int min_profitable_iters, min_profitable_estimate;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
 				      &min_profitable_estimate);
@@ -1885,6 +1898,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       min_profitable_estimate = min_profitable_iters;
     }
 
+  /* If the vector loop needs multiple iterations to be beneficial then
+     things are probably too close to call, and the conservative thing
+     would be to stick with the scalar code.  */
+  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
+      && min_profitable_estimate >= (int) vect_vf_for_cost (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "one iteration of the vector loop would not"
+			 " be cheaper than the equivalent number of"
+			 " iterations of the scalar loop\n");
+      return 0;
+    }
+
   HOST_WIDE_INT estimated_niter;
 
   /* If we are vectorizing an epilogue then we know the maximum number of
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
new file mode 100644
index 00000000000..0737da5d671
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *x, int *y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
new file mode 100644
index 00000000000..fa9bdb607b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
+
+void
+f (int *x, int *y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
new file mode 100644
index 00000000000..d7c6cfd2049
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
new file mode 100644
index 00000000000..78129ecee6a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
new file mode 100644
index 00000000000..536ec0a3cda
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1023; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
new file mode 100644
index 00000000000..552febb5fee
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1023; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target { vect_int && vect_partial_vectors_usage_2 } } } } */
+/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target { vect_int && { ! vect_partial_vectors_usage_2 } } } } } */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vect: Add a “very cheap” cost model
  2020-11-13 18:34 [PATCH] vect: Add a “very cheap” cost model Richard Sandiford
@ 2020-11-16  8:47 ` Richard Biener
  2020-11-16  9:58   ` Richard Sandiford
  2020-11-21 20:30   ` Jan Hubicka
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Biener @ 2020-11-16  8:47 UTC (permalink / raw)
  To: Richard Sandiford, GCC Patches

On Fri, Nov 13, 2020 at 7:35 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Currently we have three vector cost models: cheap, dynamic and
> unlimited.  -O2 -ftree-vectorize uses “cheap” by default, but that's
> still relatively aggressive about peeling and aliasing checks,
> and can lead to significant code size growth.
>
> This patch adds an even more conservative choice, which for lack of
> imagination I've called “very cheap”.  It only allows vectorisation
> if the vector code entirely replaces the scalar code.  It also
> requires one iteration of the vector loop to pay for itself,
> regardless of how often the loop iterates.  (If the vector loop
> needs multiple iterations to be beneficial then things are
> probably too close to call, and the conservative thing would
> be to stick with the scalar code.)
>
> The idea is that this should be suitable for -O2, although the patch
> doesn't change any defaults itself.
>
> I tested this by building and running a bunch of workloads for SVE,
> with three options:
>
>   (1) -O2
>   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
>   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
>
> All three builds used the default -msve-vector-bits=scalable and
> ran with the minimum vector length of 128 bits, which should give
> a worst-case bound for the performance impact.
>
> The workloads included a mixture of microbenchmarks and full
> applications.  Because it's quite an eclectic mix, there's not
> much point giving exact figures.  The aim was more to get a general
> impression.
>
> Code size growth with (2) was much lower than with (3).  Only a
> handful of tests increased by more than 5%, and all of them were
> microbenchmarks.
>
> In terms of performance, (2) was significantly faster than (1)
> on microbenchmarks (as expected) but also on some full apps.
> Again, performance only regressed on a handful of tests.
>
> As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
> of a mixed bag.  There are several significant improvements with (3)
> over (2), but also some (smaller) regressions.  That seems to be in
> line with -O2 -ftree-vectorize being a kind of -O2.5.

So previous attempts at enabling vectorization at -O2 also factored
in compile-time requirements.  We've looked mainly at SPEC and
there even the current "cheap" model doesn't fare very well IIRC
and costs quite some compile-time and code-size.  Turning down
vectorization even more will have even less impact on performance
but the compile-time cost will likely not shrink very much.

I think we need ways to detect candidates that will end up
cheap or very cheap without actually doing all of the analysis
first.

> The patch reorders vect_cost_model so that values are in order
> of increasing aggressiveness, which makes it possible to use
> range checks.  The value 0 still represents “unlimited”,
> so “if (flag_vect_cost_model)” is still a meaningful check.
>
> Tested on aarch64-linux-gnu, arm-linux-gnueabihf and
> x86_64-linux-gnu.  OK to install?

Does the patch also vectorize with SVE loops that have
unknown loop bound?  The documentation isn't entirely
conclusive there.  Iff the iteration count is a multiple of
two and the target can vectorize the loop with both
VF 2 and VF 4 but VF 4 would be better if we'd use
the 'cheap' cost model, does 'very-cheap' not vectorize
the loop or does it choose VF 2?

In itself the patch is reasonable, thus OK.

Thanks,
Richard.

> Richard
>
>
> gcc/
>         * doc/invoke.texi (-fvect-cost-model): Add a very-cheap model.
>         * common.opt (fvect-cost-model=): Add very-cheap as a possible option.
>         (fsimd-cost-model=): Likewise.
>         (vect_cost_model): Add very-cheap.
>         * flag-types.h (vect_cost_model): Add VECT_COST_MODEL_VERY_CHEAP.
>         Put the values in order of increasing aggressiveness.
>         * tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Use
>         range checks when comparing against VECT_COST_MODEL_CHEAP.
>         (vect_prune_runtime_alias_test_list): Do not allow any alias
>         checks for the very-cheap cost model.
>         * tree-vect-loop.c (vect_analyze_loop_costing): Do not allow
>         any peeling for the very-cheap cost model.  Also require one
>         iteration of the vector loop to pay for itself.
>
> gcc/testsuite/
>         * gcc.dg/vect/vect-cost-model-1.c: New test.
>         * gcc.dg/vect/vect-cost-model-2.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-3.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-4.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-5.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-6.c: Likewise.
> ---
>  gcc/common.opt                                |  7 +++--
>  gcc/doc/invoke.texi                           | 11 ++++++--
>  gcc/flag-types.h                              | 10 ++++---
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c | 12 +++++++++
>  gcc/tree-vect-data-refs.c                     |  8 ++++--
>  gcc/tree-vect-loop.c                          | 27 +++++++++++++++++++
>  11 files changed, 120 insertions(+), 10 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 7d0e0d9c88a..6ae613e3743 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -3008,11 +3008,11 @@ Enable basic block vectorization (SLP) on trees.
>
>  fvect-cost-model=
>  Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
> --fvect-cost-model=[unlimited|dynamic|cheap]    Specifies the cost model for vectorization.
> +-fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost model for vectorization.
>
>  fsimd-cost-model=
>  Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED) Optimization
> --fsimd-cost-model=[unlimited|dynamic|cheap]    Specifies the vectorization cost model for code marked with a simd directive.
> +-fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the vectorization cost model for code marked with a simd directive.
>
>  Enum
>  Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown vectorizer cost model %qs)
> @@ -3026,6 +3026,9 @@ Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
>  EnumValue
>  Enum(vect_cost_model) String(cheap) Value(VECT_COST_MODEL_CHEAP)
>
> +EnumValue
> +Enum(vect_cost_model) String(very-cheap) Value(VECT_COST_MODEL_VERY_CHEAP)
> +
>  fvect-cost-model
>  Common Alias(fvect-cost-model=,dynamic,unlimited)
>  Enables the dynamic vectorizer cost model.  Preserved for backward compatibility.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 8d0d2136831..2066705ff58 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -11384,7 +11384,8 @@ and @option{-fauto-profile}.
>  @item -fvect-cost-model=@var{model}
>  @opindex fvect-cost-model
>  Alter the cost model used for vectorization.  The @var{model} argument
> -should be one of @samp{unlimited}, @samp{dynamic} or @samp{cheap}.
> +should be one of @samp{unlimited}, @samp{dynamic}, @samp{cheap} or
> +@samp{very-cheap}.
>  With the @samp{unlimited} model the vectorized code-path is assumed
>  to be profitable while with the @samp{dynamic} model a runtime check
>  guards the vectorized code-path to enable it only for iteration
> @@ -11392,7 +11393,13 @@ counts that will likely execute faster than when executing the original
>  scalar loop.  The @samp{cheap} model disables vectorization of
>  loops where doing so would be cost prohibitive for example due to
>  required runtime checks for data dependence or alignment but otherwise
> -is equal to the @samp{dynamic} model.
> +is equal to the @samp{dynamic} model.  The @samp{very-cheap} model only
> +allows vectorization if the vector code would entirely replace the
> +scalar code that is being vectorized.  For example, if each iteration
> +of a vectorized loop would handle exactly four iterations, the
> +@samp{very-cheap} model would only allow vectorization if the scalar
> +iteration count is known to be a multiple of four.
> +
>  The default cost model depends on other optimization flags and is
>  either @samp{dynamic} or @samp{cheap}.
>
> diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> index a887c75cfc7..866c7a3c788 100644
> --- a/gcc/flag-types.h
> +++ b/gcc/flag-types.h
> @@ -232,12 +232,14 @@ enum scalar_storage_order_kind {
>    SSO_LITTLE_ENDIAN
>  };
>
> -/* Vectorizer cost-model.  */
> +/* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
> +   the most conservative to the least conservative.  */
>  enum vect_cost_model {
> +  VECT_COST_MODEL_VERY_CHEAP = -3,
> +  VECT_COST_MODEL_CHEAP = -2,
> +  VECT_COST_MODEL_DYNAMIC = -1,
>    VECT_COST_MODEL_UNLIMITED = 0,
> -  VECT_COST_MODEL_CHEAP = 1,
> -  VECT_COST_MODEL_DYNAMIC = 2,
> -  VECT_COST_MODEL_DEFAULT = 3
> +  VECT_COST_MODEL_DEFAULT = 1
>  };
>
>  /* Different instrumentation modes.  */
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index 0efab495407..18e36c89d14 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -2161,7 +2161,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
>          {
>            unsigned max_allowed_peel
>             = param_vect_max_peeling_for_alignment;
> -         if (flag_vect_cost_model == VECT_COST_MODEL_CHEAP)
> +         if (flag_vect_cost_model <= VECT_COST_MODEL_CHEAP)
>             max_allowed_peel = 0;
>            if (max_allowed_peel != (unsigned)-1)
>              {
> @@ -2259,7 +2259,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
>    do_versioning
>      = (optimize_loop_nest_for_speed_p (loop)
>         && !loop->inner /* FORNOW */
> -       && flag_vect_cost_model != VECT_COST_MODEL_CHEAP);
> +       && flag_vect_cost_model > VECT_COST_MODEL_CHEAP);
>
>    if (do_versioning)
>      {
> @@ -3682,6 +3682,10 @@ vect_prune_runtime_alias_test_list (loop_vec_info loop_vinfo)
>    unsigned int count = (comp_alias_ddrs.length ()
>                         + check_unequal_addrs.length ());
>
> +  if (count && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP)
> +    return opt_result::failure_at
> +      (vect_location, "would need a runtime alias check\n");
> +
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,
>                      "improved number of alias checks from %d to %d\n",
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 39b7319e825..3b020bd6f0a 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -1827,6 +1827,19 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
>         }
>      }
>
> +  /* If using the "very cheap" model. reject cases in which we'd keep
> +     a copy of the scalar code (even if we might be able to vectorize it).  */
> +  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
> +      && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "some scalar iterations would need to be peeled\n");
> +      return 0;
> +    }
> +
>    int min_profitable_iters, min_profitable_estimate;
>    vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
>                                       &min_profitable_estimate);
> @@ -1885,6 +1898,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
>        min_profitable_estimate = min_profitable_iters;
>      }
>
> +  /* If the vector loop needs multiple iterations to be beneficial then
> +     things are probably too close to call, and the conservative thing
> +     would be to stick with the scalar code.  */
> +  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
> +      && min_profitable_estimate >= (int) vect_vf_for_cost (loop_vinfo))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "one iteration of the vector loop would not"
> +                        " be cheaper than the equivalent number of"
> +                        " iterations of the scalar loop\n");
> +      return 0;
> +    }
> +
>    HOST_WIDE_INT estimated_niter;
>
>    /* If we are vectorizing an epilogue then we know the maximum number of
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
> new file mode 100644
> index 00000000000..0737da5d671
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
> +
> +void
> +f (int *x, int *y)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
> new file mode 100644
> index 00000000000..fa9bdb607b2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
> +
> +void
> +f (int *x, int *y)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
> new file mode 100644
> index 00000000000..d7c6cfd2049
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
> +
> +void
> +f (int *restrict x, int *restrict y)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
> new file mode 100644
> index 00000000000..78129ecee6a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
> +
> +void
> +f (int *restrict x, int *restrict y)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
> new file mode 100644
> index 00000000000..536ec0a3cda
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
> +
> +void
> +f (int *restrict x, int *restrict y)
> +{
> +  for (unsigned int i = 0; i < 1023; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
> new file mode 100644
> index 00000000000..552febb5fee
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
> +
> +void
> +f (int *restrict x, int *restrict y)
> +{
> +  for (unsigned int i = 0; i < 1023; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target { vect_int && vect_partial_vectors_usage_2 } } } } */
> +/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target { vect_int && { ! vect_partial_vectors_usage_2 } } } } } */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vect: Add a “very cheap” cost model
  2020-11-16  8:47 ` Richard Biener
@ 2020-11-16  9:58   ` Richard Sandiford
  2020-11-16 11:23     ` Richard Biener
  2020-11-21 20:30   ` Jan Hubicka
  1 sibling, 1 reply; 8+ messages in thread
From: Richard Sandiford @ 2020-11-16  9:58 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Richard Biener <richard.guenther@gmail.com> writes:
> On Fri, Nov 13, 2020 at 7:35 PM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Currently we have three vector cost models: cheap, dynamic and
>> unlimited.  -O2 -ftree-vectorize uses “cheap” by default, but that's
>> still relatively aggressive about peeling and aliasing checks,
>> and can lead to significant code size growth.
>>
>> This patch adds an even more conservative choice, which for lack of
>> imagination I've called “very cheap”.  It only allows vectorisation
>> if the vector code entirely replaces the scalar code.  It also
>> requires one iteration of the vector loop to pay for itself,
>> regardless of how often the loop iterates.  (If the vector loop
>> needs multiple iterations to be beneficial then things are
>> probably too close to call, and the conservative thing would
>> be to stick with the scalar code.)
>>
>> The idea is that this should be suitable for -O2, although the patch
>> doesn't change any defaults itself.
>>
>> I tested this by building and running a bunch of workloads for SVE,
>> with three options:
>>
>>   (1) -O2
>>   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
>>   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
>>
>> All three builds used the default -msve-vector-bits=scalable and
>> ran with the minimum vector length of 128 bits, which should give
>> a worst-case bound for the performance impact.
>>
>> The workloads included a mixture of microbenchmarks and full
>> applications.  Because it's quite an eclectic mix, there's not
>> much point giving exact figures.  The aim was more to get a general
>> impression.
>>
>> Code size growth with (2) was much lower than with (3).  Only a
>> handful of tests increased by more than 5%, and all of them were
>> microbenchmarks.
>>
>> In terms of performance, (2) was significantly faster than (1)
>> on microbenchmarks (as expected) but also on some full apps.
>> Again, performance only regressed on a handful of tests.
>>
>> As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
>> of a mixed bag.  There are several significant improvements with (3)
>> over (2), but also some (smaller) regressions.  That seems to be in
>> line with -O2 -ftree-vectorize being a kind of -O2.5.
>
> So previous attempts at enabling vectorization at -O2 also factored
> in compile-time requirements.  We've looked mainly at SPEC and
> there even the current "cheap" model doesn't fare very well IIRC
> and costs quite some compile-time and code-size.

Yeah, that seems to match what I was seeing with the cheap model:
the size could increase quite significantly.

> Turning down vectorization even more will have even less impact on
> performance but the compile-time cost will likely not shrink very
> much.

Agreed.  We've already done most of the work by the time we decide not
to go ahead.

I didn't really measure compile time TBH.  This was mostly written
from an SVE point of view: when SVE is enabled, vectorisation is
important enough that it's IMO worth paying the compile-time cost.

> I think we need ways to detect candidates that will end up
> cheap or very cheap without actually doing all of the analysis
> first.

Yeah, that sounds good if it's doable.  But with SVE, the aim
is to reduce the number of cases in which a loop would fail to
be vectorised on cost grounds.  I hope we'll be able to do more
of that for GCC 12.

E.g. one of the uses of the SVE2 WHILERW and WHILEWR instructions
is to clamp the amount of work that the vector loop does based on
runtime aliases.  We don't yet use it for that (it's still on
the TODO list), but once we do, runtime aliases would often not
be a problem even for the very cheap model.  And SVE already removes
two of the other main reasons for aborting early: the need to peel
for alignment and the need to peel for niters.

There are cases like peeling for gaps that should produce scalar code
even with SVE, but they probably aren't common enough to have a
significant impact on compile time.

So in a sense, the aim with SVE is to make that kind of early-out test
redundant as much as possible.

>> The patch reorders vect_cost_model so that values are in order
>> of increasing aggressiveness, which makes it possible to use
>> range checks.  The value 0 still represents “unlimited”,
>> so “if (flag_vect_cost_model)” is still a meaningful check.
>>
>> Tested on aarch64-linux-gnu, arm-linux-gnueabihf and
>> x86_64-linux-gnu.  OK to install?
>
> Does the patch also vectorize with SVE loops that have
> unknown loop bound?  The documentation isn't entirely
> conclusive there.

Yeah, for SVE it vectorises.  How about changing:

  For example, if each iteration of a vectorized loop would handle
  exactly four iterations, …

to:

  For example, if each iteration of a vectorized loop could only
  handle exactly four iterations of the original scalar loop, …

?

> Iff the iteration count is a multiple of two and the target can
> vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if
> we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the
> loop or does it choose VF 2?

It would choose VF 2, if that's still a win over scalar code.

> In itself the patch is reasonable, thus OK.

Thanks.

Richard

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vect: Add a “very cheap” cost model
  2020-11-16  9:58   ` Richard Sandiford
@ 2020-11-16 11:23     ` Richard Biener
  2020-11-19 12:04       ` Richard Sandiford
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2020-11-16 11:23 UTC (permalink / raw)
  To: Richard Biener, GCC Patches, Richard Sandiford

On Mon, Nov 16, 2020 at 10:58 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Fri, Nov 13, 2020 at 7:35 PM Richard Sandiford via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> Currently we have three vector cost models: cheap, dynamic and
> >> unlimited.  -O2 -ftree-vectorize uses “cheap” by default, but that's
> >> still relatively aggressive about peeling and aliasing checks,
> >> and can lead to significant code size growth.
> >>
> >> This patch adds an even more conservative choice, which for lack of
> >> imagination I've called “very cheap”.  It only allows vectorisation
> >> if the vector code entirely replaces the scalar code.  It also
> >> requires one iteration of the vector loop to pay for itself,
> >> regardless of how often the loop iterates.  (If the vector loop
> >> needs multiple iterations to be beneficial then things are
> >> probably too close to call, and the conservative thing would
> >> be to stick with the scalar code.)
> >>
> >> The idea is that this should be suitable for -O2, although the patch
> >> doesn't change any defaults itself.
> >>
> >> I tested this by building and running a bunch of workloads for SVE,
> >> with three options:
> >>
> >>   (1) -O2
> >>   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
> >>   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
> >>
> >> All three builds used the default -msve-vector-bits=scalable and
> >> ran with the minimum vector length of 128 bits, which should give
> >> a worst-case bound for the performance impact.
> >>
> >> The workloads included a mixture of microbenchmarks and full
> >> applications.  Because it's quite an eclectic mix, there's not
> >> much point giving exact figures.  The aim was more to get a general
> >> impression.
> >>
> >> Code size growth with (2) was much lower than with (3).  Only a
> >> handful of tests increased by more than 5%, and all of them were
> >> microbenchmarks.
> >>
> >> In terms of performance, (2) was significantly faster than (1)
> >> on microbenchmarks (as expected) but also on some full apps.
> >> Again, performance only regressed on a handful of tests.
> >>
> >> As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
> >> of a mixed bag.  There are several significant improvements with (3)
> >> over (2), but also some (smaller) regressions.  That seems to be in
> >> line with -O2 -ftree-vectorize being a kind of -O2.5.
> >
> > So previous attempts at enabling vectorization at -O2 also factored
> > in compile-time requirements.  We've looked mainly at SPEC and
> > there even the current "cheap" model doesn't fare very well IIRC
> > and costs quite some compile-time and code-size.
>
> Yeah, that seems to match what I was seeing with the cheap model:
> the size could increase quite significantly.
>
> > Turning down vectorization even more will have even less impact on
> > performance but the compile-time cost will likely not shrink very
> > much.
>
> Agreed.  We've already done most of the work by the time we decide not
> to go ahead.
>
> I didn't really measure compile time TBH.  This was mostly written
> from an SVE point of view: when SVE is enabled, vectorisation is
> important enough that it's IMO worth paying the compile-time cost.
>
> > I think we need ways to detect candidates that will end up
> > cheap or very cheap without actually doing all of the analysis
> > first.
>
> Yeah, that sounds good if it's doable.  But with SVE, the aim
> is to reduce the number of cases in which a loop would fail to
> be vectorised on cost grounds.  I hope we'll be able to do more
> of that for GCC 12.
>
> E.g. one of the uses of the SVE2 WHILERW and WHILEWR instructions
> is to clamp the amount of work that the vector loop does based on
> runtime aliases.  We don't yet use it for that (it's still on
> the TODO list), but once we do, runtime aliases would often not
> be a problem even for the very cheap model.  And SVE already removes
> two of the other main reasons for aborting early: the need to peel
> for alignment and the need to peel for niters.
>
> There are cases like peeling for gaps that should produce scalar code
> even with SVE, but they probably aren't common enough to have a
> significant impact on compile time.
>
> So in a sense, the aim with SVE is to make that kind of early-out test
> redundant as much as possible.
>
> >> The patch reorders vect_cost_model so that values are in order
> >> of increasing aggressiveness, which makes it possible to use
> >> range checks.  The value 0 still represents “unlimited”,
> >> so “if (flag_vect_cost_model)” is still a meaningful check.
> >>
> >> Tested on aarch64-linux-gnu, arm-linux-gnueabihf and
> >> x86_64-linux-gnu.  OK to install?
> >
> > Does the patch also vectorize with SVE loops that have
> > unknown loop bound?  The documentation isn't entirely
> > conclusive there.
>
> Yeah, for SVE it vectorises.  How about changing:
>
>   For example, if each iteration of a vectorized loop would handle
>   exactly four iterations, …
>
> to:
>
>   For example, if each iteration of a vectorized loop could only
>   handle exactly four iterations of the original scalar loop, …
>
> ?

Yeah, guess that's better.

>
> > Iff the iteration count is a multiple of two and the target can
> > vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if
> > we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the
> > loop or does it choose VF 2?
>
> It would choose VF 2, if that's still a win over scalar code.

OK, that's what I expected.  The VF iteration is one source of
compile-time that we might want to avoid somehow ... on
x86_64 knowing the precise number of constant iterations
should allow to only pick a subset of vector modes based on
largest_pow2_factor or so?  Or maybe just use the preferred
SIMD mode for cheap/very-cheap?  (maybe pass down
the cost model kind to the target hook so targets can decide
for themselves here)

> > In itself the patch is reasonable, thus OK.
>
> Thanks.
>
> Richard

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vect: Add a “very cheap” cost model
  2020-11-16 11:23     ` Richard Biener
@ 2020-11-19 12:04       ` Richard Sandiford
  2020-11-19 14:08         ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Sandiford @ 2020-11-19 12:04 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, Nov 16, 2020 at 10:58 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> > Does the patch also vectorize with SVE loops that have
>> > unknown loop bound?  The documentation isn't entirely
>> > conclusive there.
>>
>> Yeah, for SVE it vectorises.  How about changing:
>>
>>   For example, if each iteration of a vectorized loop would handle
>>   exactly four iterations, …
>>
>> to:
>>
>>   For example, if each iteration of a vectorized loop could only
>>   handle exactly four iterations of the original scalar loop, …
>>
>> ?
>
> Yeah, guess that's better.
>
>>
>> > Iff the iteration count is a multiple of two and the target can
>> > vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if
>> > we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the
>> > loop or does it choose VF 2?
>>
>> It would choose VF 2, if that's still a win over scalar code.
>
> OK, that's what I expected.  The VF iteration is one source of
> compile-time that we might want to avoid somehow ... on
> x86_64 knowing the precise number of constant iterations
> should allow to only pick a subset of vector modes based on
> largest_pow2_factor or so?  Or maybe just use the preferred
> SIMD mode for cheap/very-cheap?  (maybe pass down
> the cost model kind to the target hook so targets can decide
> for themselves here)

On the preferred simd mode thing: TBH, I'd prefer to get rid
of that hook one day and just rely on autovectorize_vector_modes.

The difficulty with adding an early check is that we don't know ahead
of time which types of scalar element a loop operates on: we only find
that out on the fly during the first analysis of the loop.  The check
would also depend on SLP grouping: we can use a vector of 4 ints to
handle 2 iterations of the scalar loop if the ints are in an SLP
group of size 2.

So I agree it would be nice to have early-outs, but I think we'd have
to restructure things first.  E.g. maybe we could do some “cheap” initial
analysis that checks for basic vectorisability, records which scalar
elements are used by the loop, and records how big the containing SLP
groups might be (based on optimistic assumptions).  Then we can use
that to prefilter the modes we try (perhaps all the way down to no modes).
I guess that's conceptually similar to building an SLP graph though.

Does the attached look OK?  I've included a version of the updated
wording above.  I also changed this condition to use “>” rather
than “>=”:

  /* If the vector loop needs multiple iterations to be beneficial then
     things are probably too close to call, and the conservative thing
     would be to stick with the scalar code.  */
  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
      && min_profitable_estimate > (int) vect_vf_for_cost (loop_vinfo))

since when min_profitable_estimate == min_profitable_iters
we'll have done:

  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
      && min_profitable_iters < (assumed_vf + peel_iters_prologue))
    /* We want the vectorized loop to execute at least once.  */
    min_profitable_iters = assumed_vf + peel_iters_prologue;

I also tried to make vect-cost-model-4.c more resilient on targets
that require alignment.

Thanks,
Richard


gcc/
	* doc/invoke.texi (-fvect-cost-model): Add a very-cheap model.
	* common.opt (fvect-cost-model=): Add very-cheap as a possible option.
	(fsimd-cost-model=): Likewise.
	(vect_cost_model): Add very-cheap.
	* flag-types.h (vect_cost_model): Add VECT_COST_MODEL_VERY_CHEAP.
	Put the values in order of increasing aggressiveness.
	* tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Use
	range checks when comparing against VECT_COST_MODEL_CHEAP.
	(vect_prune_runtime_alias_test_list): Do not allow any alias
	checks for the very-cheap cost model.
	* tree-vect-loop.c (vect_analyze_loop_costing): Do not allow
	any peeling for the very-cheap cost model.  Also require one
	iteration of the vector loop to pay for itself.

gcc/testsuite/
	* gcc.dg/vect/vect-cost-model-1.c: New test.
	* gcc.dg/vect/vect-cost-model-2.c: Likewise.
	* gcc.dg/vect/vect-cost-model-3.c: Likewise.
	* gcc.dg/vect/vect-cost-model-4.c: Likewise.
	* gcc.dg/vect/vect-cost-model-5.c: Likewise.
	* gcc.dg/vect/vect-cost-model-6.c: Likewise.
---
 gcc/common.opt                                |  7 +++--
 gcc/doc/invoke.texi                           | 12 +++++++--
 gcc/flag-types.h                              | 10 ++++---
 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c | 13 +++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c | 11 ++++++++
 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c | 12 +++++++++
 gcc/tree-vect-data-refs.c                     |  8 ++++--
 gcc/tree-vect-loop.c                          | 27 +++++++++++++++++++
 11 files changed, 123 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c

diff --git a/gcc/common.opt b/gcc/common.opt
index fe39b3dee9f..ca8a2690799 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3020,11 +3020,11 @@ Enable basic block vectorization (SLP) on trees.
 
 fvect-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
--fvect-cost-model=[unlimited|dynamic|cheap]	Specifies the cost model for vectorization.
+-fvect-cost-model=[unlimited|dynamic|cheap|very-cheap]	Specifies the cost model for vectorization.
 
 fsimd-cost-model=
 Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED) Optimization
--fsimd-cost-model=[unlimited|dynamic|cheap]	Specifies the vectorization cost model for code marked with a simd directive.
+-fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap]	Specifies the vectorization cost model for code marked with a simd directive.
 
 Enum
 Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown vectorizer cost model %qs)
@@ -3038,6 +3038,9 @@ Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
 EnumValue
 Enum(vect_cost_model) String(cheap) Value(VECT_COST_MODEL_CHEAP)
 
+EnumValue
+Enum(vect_cost_model) String(very-cheap) Value(VECT_COST_MODEL_VERY_CHEAP)
+
 fvect-cost-model
 Common Alias(fvect-cost-model=,dynamic,unlimited)
 Enables the dynamic vectorizer cost model.  Preserved for backward compatibility.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 3510a54c6c4..07232c6b33d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -11440,7 +11440,8 @@ and @option{-fauto-profile}.
 @item -fvect-cost-model=@var{model}
 @opindex fvect-cost-model
 Alter the cost model used for vectorization.  The @var{model} argument
-should be one of @samp{unlimited}, @samp{dynamic} or @samp{cheap}.
+should be one of @samp{unlimited}, @samp{dynamic}, @samp{cheap} or
+@samp{very-cheap}.
 With the @samp{unlimited} model the vectorized code-path is assumed
 to be profitable while with the @samp{dynamic} model a runtime check
 guards the vectorized code-path to enable it only for iteration
@@ -11448,7 +11449,14 @@ counts that will likely execute faster than when executing the original
 scalar loop.  The @samp{cheap} model disables vectorization of
 loops where doing so would be cost prohibitive for example due to
 required runtime checks for data dependence or alignment but otherwise
-is equal to the @samp{dynamic} model.
+is equal to the @samp{dynamic} model.  The @samp{very-cheap} model only
+allows vectorization if the vector code would entirely replace the
+scalar code that is being vectorized.  For example, if each iteration
+of a vectorized loop would only be able to handle exactly four iterations
+of the scalar loop, the @samp{very-cheap} model would only allow
+vectorization if the scalar iteration count is known to be a multiple
+of four.
+
 The default cost model depends on other optimization flags and is
 either @samp{dynamic} or @samp{cheap}.
 
diff --git a/gcc/flag-types.h b/gcc/flag-types.h
index 648ed096e30..0dbab19943c 100644
--- a/gcc/flag-types.h
+++ b/gcc/flag-types.h
@@ -232,12 +232,14 @@ enum scalar_storage_order_kind {
   SSO_LITTLE_ENDIAN
 };
 
-/* Vectorizer cost-model.  */
+/* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
+   the most conservative to the least conservative.  */
 enum vect_cost_model {
+  VECT_COST_MODEL_VERY_CHEAP = -3,
+  VECT_COST_MODEL_CHEAP = -2,
+  VECT_COST_MODEL_DYNAMIC = -1,
   VECT_COST_MODEL_UNLIMITED = 0,
-  VECT_COST_MODEL_CHEAP = 1,
-  VECT_COST_MODEL_DYNAMIC = 2,
-  VECT_COST_MODEL_DEFAULT = 3
+  VECT_COST_MODEL_DEFAULT = 1
 };
 
 /* Different instrumentation modes.  */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
new file mode 100644
index 00000000000..0737da5d671
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *x, int *y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
new file mode 100644
index 00000000000..fa9bdb607b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
+
+void
+f (int *x, int *y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
new file mode 100644
index 00000000000..d7c6cfd2049
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
new file mode 100644
index 00000000000..bb018ad99fe
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
+
+int x[1024], y[1024];
+
+void
+f (void)
+{
+  for (unsigned int i = 0; i < 1024; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
new file mode 100644
index 00000000000..536ec0a3cda
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1023; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
new file mode 100644
index 00000000000..552febb5fee
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
+
+void
+f (int *restrict x, int *restrict y)
+{
+  for (unsigned int i = 0; i < 1023; ++i)
+    x[i] += y[i];
+}
+
+/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target { vect_int && vect_partial_vectors_usage_2 } } } } */
+/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target { vect_int && { ! vect_partial_vectors_usage_2 } } } } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 0efab495407..18e36c89d14 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -2161,7 +2161,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
         {
           unsigned max_allowed_peel
 	    = param_vect_max_peeling_for_alignment;
-	  if (flag_vect_cost_model == VECT_COST_MODEL_CHEAP)
+	  if (flag_vect_cost_model <= VECT_COST_MODEL_CHEAP)
 	    max_allowed_peel = 0;
           if (max_allowed_peel != (unsigned)-1)
             {
@@ -2259,7 +2259,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
   do_versioning
     = (optimize_loop_nest_for_speed_p (loop)
        && !loop->inner /* FORNOW */
-       && flag_vect_cost_model != VECT_COST_MODEL_CHEAP);
+       && flag_vect_cost_model > VECT_COST_MODEL_CHEAP);
 
   if (do_versioning)
     {
@@ -3682,6 +3682,10 @@ vect_prune_runtime_alias_test_list (loop_vec_info loop_vinfo)
   unsigned int count = (comp_alias_ddrs.length ()
 			+ check_unequal_addrs.length ());
 
+  if (count && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP)
+    return opt_result::failure_at
+      (vect_location, "would need a runtime alias check\n");
+
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
 		     "improved number of alias checks from %d to %d\n",
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 856bbfebf7c..48dfb4df00e 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1827,6 +1827,19 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
 	}
     }
 
+  /* If using the "very cheap" model. reject cases in which we'd keep
+     a copy of the scalar code (even if we might be able to vectorize it).  */
+  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
+      && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
+	  || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+	  || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "some scalar iterations would need to be peeled\n");
+      return 0;
+    }
+
   int min_profitable_iters, min_profitable_estimate;
   vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
 				      &min_profitable_estimate);
@@ -1885,6 +1898,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
       min_profitable_estimate = min_profitable_iters;
     }
 
+  /* If the vector loop needs multiple iterations to be beneficial then
+     things are probably too close to call, and the conservative thing
+     would be to stick with the scalar code.  */
+  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
+      && min_profitable_estimate > (int) vect_vf_for_cost (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "one iteration of the vector loop would be"
+			 " more expensive than the equivalent number of"
+			 " iterations of the scalar loop\n");
+      return 0;
+    }
+
   HOST_WIDE_INT estimated_niter;
 
   /* If we are vectorizing an epilogue then we know the maximum number of

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vect: Add a “very cheap” cost model
  2020-11-19 12:04       ` Richard Sandiford
@ 2020-11-19 14:08         ` Richard Biener
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Biener @ 2020-11-19 14:08 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches, Richard Biener, Richard Sandiford

On Thu, Nov 19, 2020 at 1:04 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, Nov 16, 2020 at 10:58 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >> > Does the patch also vectorize with SVE loops that have
> >> > unknown loop bound?  The documentation isn't entirely
> >> > conclusive there.
> >>
> >> Yeah, for SVE it vectorises.  How about changing:
> >>
> >>   For example, if each iteration of a vectorized loop would handle
> >>   exactly four iterations, …
> >>
> >> to:
> >>
> >>   For example, if each iteration of a vectorized loop could only
> >>   handle exactly four iterations of the original scalar loop, …
> >>
> >> ?
> >
> > Yeah, guess that's better.
> >
> >>
> >> > Iff the iteration count is a multiple of two and the target can
> >> > vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if
> >> > we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the
> >> > loop or does it choose VF 2?
> >>
> >> It would choose VF 2, if that's still a win over scalar code.
> >
> > OK, that's what I expected.  The VF iteration is one source of
> > compile-time that we might want to avoid somehow ... on
> > x86_64 knowing the precise number of constant iterations
> > should allow to only pick a subset of vector modes based on
> > largest_pow2_factor or so?  Or maybe just use the preferred
> > SIMD mode for cheap/very-cheap?  (maybe pass down
> > the cost model kind to the target hook so targets can decide
> > for themselves here)
>
> On the preferred simd mode thing: TBH, I'd prefer to get rid
> of that hook one day and just rely on autovectorize_vector_modes.
>
> The difficulty with adding an early check is that we don't know ahead
> of time which types of scalar element a loop operates on: we only find
> that out on the fly during the first analysis of the loop.  The check
> would also depend on SLP grouping: we can use a vector of 4 ints to
> handle 2 iterations of the scalar loop if the ints are in an SLP
> group of size 2.
>
> So I agree it would be nice to have early-outs, but I think we'd have
> to restructure things first.  E.g. maybe we could do some “cheap” initial
> analysis that checks for basic vectorisability, records which scalar
> elements are used by the loop, and records how big the containing SLP
> groups might be (based on optimistic assumptions).  Then we can use
> that to prefilter the modes we try (perhaps all the way down to no modes).
> I guess that's conceptually similar to building an SLP graph though.
>
> Does the attached look OK?  I've included a version of the updated
> wording above.  I also changed this condition to use “>” rather
> than “>=”:

Yes, OK.

Thanks,
Richard.

>
>   /* If the vector loop needs multiple iterations to be beneficial then
>      things are probably too close to call, and the conservative thing
>      would be to stick with the scalar code.  */
>   if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
>       && min_profitable_estimate > (int) vect_vf_for_cost (loop_vinfo))
>
> since when min_profitable_estimate == min_profitable_iters
> we'll have done:
>
>   if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
>       && min_profitable_iters < (assumed_vf + peel_iters_prologue))
>     /* We want the vectorized loop to execute at least once.  */
>     min_profitable_iters = assumed_vf + peel_iters_prologue;
>
> I also tried to make vect-cost-model-4.c more resilient on targets
> that require alignment.
>
> Thanks,
> Richard
>
>
> gcc/
>         * doc/invoke.texi (-fvect-cost-model): Add a very-cheap model.
>         * common.opt (fvect-cost-model=): Add very-cheap as a possible option.
>         (fsimd-cost-model=): Likewise.
>         (vect_cost_model): Add very-cheap.
>         * flag-types.h (vect_cost_model): Add VECT_COST_MODEL_VERY_CHEAP.
>         Put the values in order of increasing aggressiveness.
>         * tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Use
>         range checks when comparing against VECT_COST_MODEL_CHEAP.
>         (vect_prune_runtime_alias_test_list): Do not allow any alias
>         checks for the very-cheap cost model.
>         * tree-vect-loop.c (vect_analyze_loop_costing): Do not allow
>         any peeling for the very-cheap cost model.  Also require one
>         iteration of the vector loop to pay for itself.
>
> gcc/testsuite/
>         * gcc.dg/vect/vect-cost-model-1.c: New test.
>         * gcc.dg/vect/vect-cost-model-2.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-3.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-4.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-5.c: Likewise.
>         * gcc.dg/vect/vect-cost-model-6.c: Likewise.
> ---
>  gcc/common.opt                                |  7 +++--
>  gcc/doc/invoke.texi                           | 12 +++++++--
>  gcc/flag-types.h                              | 10 ++++---
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c | 13 +++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c | 11 ++++++++
>  gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c | 12 +++++++++
>  gcc/tree-vect-data-refs.c                     |  8 ++++--
>  gcc/tree-vect-loop.c                          | 27 +++++++++++++++++++
>  11 files changed, 123 insertions(+), 10 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index fe39b3dee9f..ca8a2690799 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -3020,11 +3020,11 @@ Enable basic block vectorization (SLP) on trees.
>
>  fvect-cost-model=
>  Common Joined RejectNegative Enum(vect_cost_model) Var(flag_vect_cost_model) Init(VECT_COST_MODEL_DEFAULT) Optimization
> --fvect-cost-model=[unlimited|dynamic|cheap]    Specifies the cost model for vectorization.
> +-fvect-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the cost model for vectorization.
>
>  fsimd-cost-model=
>  Common Joined RejectNegative Enum(vect_cost_model) Var(flag_simd_cost_model) Init(VECT_COST_MODEL_UNLIMITED) Optimization
> --fsimd-cost-model=[unlimited|dynamic|cheap]    Specifies the vectorization cost model for code marked with a simd directive.
> +-fsimd-cost-model=[unlimited|dynamic|cheap|very-cheap] Specifies the vectorization cost model for code marked with a simd directive.
>
>  Enum
>  Name(vect_cost_model) Type(enum vect_cost_model) UnknownError(unknown vectorizer cost model %qs)
> @@ -3038,6 +3038,9 @@ Enum(vect_cost_model) String(dynamic) Value(VECT_COST_MODEL_DYNAMIC)
>  EnumValue
>  Enum(vect_cost_model) String(cheap) Value(VECT_COST_MODEL_CHEAP)
>
> +EnumValue
> +Enum(vect_cost_model) String(very-cheap) Value(VECT_COST_MODEL_VERY_CHEAP)
> +
>  fvect-cost-model
>  Common Alias(fvect-cost-model=,dynamic,unlimited)
>  Enables the dynamic vectorizer cost model.  Preserved for backward compatibility.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 3510a54c6c4..07232c6b33d 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -11440,7 +11440,8 @@ and @option{-fauto-profile}.
>  @item -fvect-cost-model=@var{model}
>  @opindex fvect-cost-model
>  Alter the cost model used for vectorization.  The @var{model} argument
> -should be one of @samp{unlimited}, @samp{dynamic} or @samp{cheap}.
> +should be one of @samp{unlimited}, @samp{dynamic}, @samp{cheap} or
> +@samp{very-cheap}.
>  With the @samp{unlimited} model the vectorized code-path is assumed
>  to be profitable while with the @samp{dynamic} model a runtime check
>  guards the vectorized code-path to enable it only for iteration
> @@ -11448,7 +11449,14 @@ counts that will likely execute faster than when executing the original
>  scalar loop.  The @samp{cheap} model disables vectorization of
>  loops where doing so would be cost prohibitive for example due to
>  required runtime checks for data dependence or alignment but otherwise
> -is equal to the @samp{dynamic} model.
> +is equal to the @samp{dynamic} model.  The @samp{very-cheap} model only
> +allows vectorization if the vector code would entirely replace the
> +scalar code that is being vectorized.  For example, if each iteration
> +of a vectorized loop would only be able to handle exactly four iterations
> +of the scalar loop, the @samp{very-cheap} model would only allow
> +vectorization if the scalar iteration count is known to be a multiple
> +of four.
> +
>  The default cost model depends on other optimization flags and is
>  either @samp{dynamic} or @samp{cheap}.
>
> diff --git a/gcc/flag-types.h b/gcc/flag-types.h
> index 648ed096e30..0dbab19943c 100644
> --- a/gcc/flag-types.h
> +++ b/gcc/flag-types.h
> @@ -232,12 +232,14 @@ enum scalar_storage_order_kind {
>    SSO_LITTLE_ENDIAN
>  };
>
> -/* Vectorizer cost-model.  */
> +/* Vectorizer cost-model.  Except for DEFAULT, the values are ordered from
> +   the most conservative to the least conservative.  */
>  enum vect_cost_model {
> +  VECT_COST_MODEL_VERY_CHEAP = -3,
> +  VECT_COST_MODEL_CHEAP = -2,
> +  VECT_COST_MODEL_DYNAMIC = -1,
>    VECT_COST_MODEL_UNLIMITED = 0,
> -  VECT_COST_MODEL_CHEAP = 1,
> -  VECT_COST_MODEL_DYNAMIC = 2,
> -  VECT_COST_MODEL_DEFAULT = 3
> +  VECT_COST_MODEL_DEFAULT = 1
>  };
>
>  /* Different instrumentation modes.  */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
> new file mode 100644
> index 00000000000..0737da5d671
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-1.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
> +
> +void
> +f (int *x, int *y)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
> new file mode 100644
> index 00000000000..fa9bdb607b2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-2.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
> +
> +void
> +f (int *x, int *y)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
> new file mode 100644
> index 00000000000..d7c6cfd2049
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-3.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
> +
> +void
> +f (int *restrict x, int *restrict y)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
> new file mode 100644
> index 00000000000..bb018ad99fe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-4.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
> +
> +int x[1024], y[1024];
> +
> +void
> +f (void)
> +{
> +  for (unsigned int i = 0; i < 1024; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
> new file mode 100644
> index 00000000000..536ec0a3cda
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-5.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=cheap" } */
> +
> +void
> +f (int *restrict x, int *restrict y)
> +{
> +  for (unsigned int i = 0; i < 1023; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target vect_int } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
> new file mode 100644
> index 00000000000..552febb5fee
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -ftree-vectorize -fvect-cost-model=very-cheap" } */
> +
> +void
> +f (int *restrict x, int *restrict y)
> +{
> +  for (unsigned int i = 0; i < 1023; ++i)
> +    x[i] += y[i];
> +}
> +
> +/* { dg-final { scan-tree-dump {LOOP VECTORIZED} vect { target { vect_int && vect_partial_vectors_usage_2 } } } } */
> +/* { dg-final { scan-tree-dump-not {LOOP VECTORIZED} vect { target { vect_int && { ! vect_partial_vectors_usage_2 } } } } } */
> diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
> index 0efab495407..18e36c89d14 100644
> --- a/gcc/tree-vect-data-refs.c
> +++ b/gcc/tree-vect-data-refs.c
> @@ -2161,7 +2161,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
>          {
>            unsigned max_allowed_peel
>             = param_vect_max_peeling_for_alignment;
> -         if (flag_vect_cost_model == VECT_COST_MODEL_CHEAP)
> +         if (flag_vect_cost_model <= VECT_COST_MODEL_CHEAP)
>             max_allowed_peel = 0;
>            if (max_allowed_peel != (unsigned)-1)
>              {
> @@ -2259,7 +2259,7 @@ vect_enhance_data_refs_alignment (loop_vec_info loop_vinfo)
>    do_versioning
>      = (optimize_loop_nest_for_speed_p (loop)
>         && !loop->inner /* FORNOW */
> -       && flag_vect_cost_model != VECT_COST_MODEL_CHEAP);
> +       && flag_vect_cost_model > VECT_COST_MODEL_CHEAP);
>
>    if (do_versioning)
>      {
> @@ -3682,6 +3682,10 @@ vect_prune_runtime_alias_test_list (loop_vec_info loop_vinfo)
>    unsigned int count = (comp_alias_ddrs.length ()
>                         + check_unequal_addrs.length ());
>
> +  if (count && flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP)
> +    return opt_result::failure_at
> +      (vect_location, "would need a runtime alias check\n");
> +
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,
>                      "improved number of alias checks from %d to %d\n",
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 856bbfebf7c..48dfb4df00e 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -1827,6 +1827,19 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
>         }
>      }
>
> +  /* If using the "very cheap" model. reject cases in which we'd keep
> +     a copy of the scalar code (even if we might be able to vectorize it).  */
> +  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
> +      && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "some scalar iterations would need to be peeled\n");
> +      return 0;
> +    }
> +
>    int min_profitable_iters, min_profitable_estimate;
>    vect_estimate_min_profitable_iters (loop_vinfo, &min_profitable_iters,
>                                       &min_profitable_estimate);
> @@ -1885,6 +1898,20 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo)
>        min_profitable_estimate = min_profitable_iters;
>      }
>
> +  /* If the vector loop needs multiple iterations to be beneficial then
> +     things are probably too close to call, and the conservative thing
> +     would be to stick with the scalar code.  */
> +  if (flag_vect_cost_model == VECT_COST_MODEL_VERY_CHEAP
> +      && min_profitable_estimate > (int) vect_vf_for_cost (loop_vinfo))
> +    {
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "one iteration of the vector loop would be"
> +                        " more expensive than the equivalent number of"
> +                        " iterations of the scalar loop\n");
> +      return 0;
> +    }
> +
>    HOST_WIDE_INT estimated_niter;
>
>    /* If we are vectorizing an epilogue then we know the maximum number of

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vect: Add a “very cheap” cost model
  2020-11-16  8:47 ` Richard Biener
  2020-11-16  9:58   ` Richard Sandiford
@ 2020-11-21 20:30   ` Jan Hubicka
  2020-11-23  8:12     ` Richard Biener
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Hubicka @ 2020-11-21 20:30 UTC (permalink / raw)
  To: Richard Biener; +Cc: Richard Sandiford, GCC Patches

> > I tested this by building and running a bunch of workloads for SVE,
> > with three options:
> >
> >   (1) -O2
> >   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
> >   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
> >
> > All three builds used the default -msve-vector-bits=scalable and
> > ran with the minimum vector length of 128 bits, which should give
> > a worst-case bound for the performance impact.
> >
> > The workloads included a mixture of microbenchmarks and full
> > applications.  Because it's quite an eclectic mix, there's not
> > much point giving exact figures.  The aim was more to get a general
> > impression.
> >
> > Code size growth with (2) was much lower than with (3).  Only a
> > handful of tests increased by more than 5%, and all of them were
> > microbenchmarks.
> >
> > In terms of performance, (2) was significantly faster than (1)
> > on microbenchmarks (as expected) but also on some full apps.
> > Again, performance only regressed on a handful of tests.
> >
> > As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
> > of a mixed bag.  There are several significant improvements with (3)
> > over (2), but also some (smaller) regressions.  That seems to be in
> > line with -O2 -ftree-vectorize being a kind of -O2.5.
> 
> So previous attempts at enabling vectorization at -O2 also factored
> in compile-time requirements.  We've looked mainly at SPEC and
> there even the current "cheap" model doesn't fare very well IIRC
> and costs quite some compile-time and code-size.  Turning down
> vectorization even more will have even less impact on performance
> but the compile-time cost will likely not shrink very much.
> 
> I think we need ways to detect candidates that will end up
> cheap or very cheap without actually doing all of the analysis
> first.
The current cheap model indeed costs quite some code size.  I 
was playing with similar patch (mine simply changed the cheap model).
Richard's patch tests as follows:

https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on
https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on
(not all of SPEC2k runs are finished at the time of writting the email)

Here baseline is current trunk, first run is with vectorizer forced to
very cheap for both -O2 and -O3/fast and last is with vectorizer forced to
dynamic (so -O3/fast is same as baseline)

6.5% SPECint2017 improvement at -O2 is certainly very nice even if 
large part comes from x264 benchmark.
CPU2006 is affected by regression of tonto and cactusADM where the
second is known to be bit random.

There are some regressions but they exists already at -O3 and I guess
those are easier to track than usual -O3 vectorization failure so I will
check if they are tracked by bugzilla.
(for example 100% regression on cray should be very easy)

libxul LTO linktime at -O2 goes up from
real    7m47.358s
user    76m49.109s
sys     2m2.403s

to

real    8m12.651s
user    80m0.704s
sys     2m9.275s

so about 4.1% of backend time. (overall firefox build time is about 45
minutes on my setup)

For comparsion  -O2 --disable-tree-pre gives me:

real    7m36.438s
user    73m20.167s
sys     2m3.460s

So 4.7% backend time.
These values should be off-noise, I re-run the tests few times.

I would say that the speedups for vectorization are justified especially
when there are essentially zero code size costs.  It depends where we
set the bar on compile time...

Honza

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vect: Add a “very cheap” cost model
  2020-11-21 20:30   ` Jan Hubicka
@ 2020-11-23  8:12     ` Richard Biener
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Biener @ 2020-11-23  8:12 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Sandiford, GCC Patches

On Sat, Nov 21, 2020 at 9:30 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > I tested this by building and running a bunch of workloads for SVE,
> > > with three options:
> > >
> > >   (1) -O2
> > >   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
> > >   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
> > >
> > > All three builds used the default -msve-vector-bits=scalable and
> > > ran with the minimum vector length of 128 bits, which should give
> > > a worst-case bound for the performance impact.
> > >
> > > The workloads included a mixture of microbenchmarks and full
> > > applications.  Because it's quite an eclectic mix, there's not
> > > much point giving exact figures.  The aim was more to get a general
> > > impression.
> > >
> > > Code size growth with (2) was much lower than with (3).  Only a
> > > handful of tests increased by more than 5%, and all of them were
> > > microbenchmarks.
> > >
> > > In terms of performance, (2) was significantly faster than (1)
> > > on microbenchmarks (as expected) but also on some full apps.
> > > Again, performance only regressed on a handful of tests.
> > >
> > > As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
> > > of a mixed bag.  There are several significant improvements with (3)
> > > over (2), but also some (smaller) regressions.  That seems to be in
> > > line with -O2 -ftree-vectorize being a kind of -O2.5.
> >
> > So previous attempts at enabling vectorization at -O2 also factored
> > in compile-time requirements.  We've looked mainly at SPEC and
> > there even the current "cheap" model doesn't fare very well IIRC
> > and costs quite some compile-time and code-size.  Turning down
> > vectorization even more will have even less impact on performance
> > but the compile-time cost will likely not shrink very much.
> >
> > I think we need ways to detect candidates that will end up
> > cheap or very cheap without actually doing all of the analysis
> > first.
> The current cheap model indeed costs quite some code size.  I
> was playing with similar patch (mine simply changed the cheap model).
> Richard's patch tests as follows:
>
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on
> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?younger_in_days=14&older_in_days=0&all_changes=on&min_percentage_change=0.02&revisions=e4360e452b4c6cd56d4e21663703e920763413f5%2C11d39d4b278efb4a0134f495d698c2cd764c06e4%2Ca0917becd66182eee6eb6a7a150e38f0463b765d&include_user_branches=on
> (not all of SPEC2k runs are finished at the time of writting the email)
>
> Here baseline is current trunk, first run is with vectorizer forced to
> very cheap for both -O2 and -O3/fast and last is with vectorizer forced to
> dynamic (so -O3/fast is same as baseline)
>
> 6.5% SPECint2017 improvement at -O2 is certainly very nice even if
> large part comes from x264 benchmark.
> CPU2006 is affected by regression of tonto and cactusADM where the
> second is known to be bit random.
>
> There are some regressions but they exists already at -O3 and I guess
> those are easier to track than usual -O3 vectorization failure so I will
> check if they are tracked by bugzilla.
> (for example 100% regression on cray should be very easy)
>
> libxul LTO linktime at -O2 goes up from
> real    7m47.358s
> user    76m49.109s
> sys     2m2.403s
>
> to
>
> real    8m12.651s
> user    80m0.704s
> sys     2m9.275s
>
> so about 4.1% of backend time. (overall firefox build time is about 45
> minutes on my setup)

Hmm, that's unfortunate.  With very-cheap we should avoid the
known quadraticness (each vect_do_peeling call will do a whole-function
SSA update).  Which would leave the other (dependence calculation).
Still profiling might make some sense here (IIRC SPEC wrf was one
of the worst outliers with my measurements, but that was not
avoiding all peelings)

Richard.

>
> For comparsion  -O2 --disable-tree-pre gives me:
>
> real    7m36.438s
> user    73m20.167s
> sys     2m3.460s
>
> So 4.7% backend time.
> These values should be off-noise, I re-run the tests few times.
>
> I would say that the speedups for vectorization are justified especially
> when there are essentially zero code size costs.  It depends where we
> set the bar on compile time...
>
> Honza

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-11-23  8:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-13 18:34 [PATCH] vect: Add a “very cheap” cost model Richard Sandiford
2020-11-16  8:47 ` Richard Biener
2020-11-16  9:58   ` Richard Sandiford
2020-11-16 11:23     ` Richard Biener
2020-11-19 12:04       ` Richard Sandiford
2020-11-19 14:08         ` Richard Biener
2020-11-21 20:30   ` Jan Hubicka
2020-11-23  8:12     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).