public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] [i386] Reject too large vectors for partial vector vectorization
@ 2023-06-19 12:35 Richard Biener
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Biener @ 2023-06-19 12:35 UTC (permalink / raw)
  To: gcc-patches; +Cc: hongtao.liu, richard.sandiford

The following works around the lack of the x86 backend making the
vectorizer compare the costs of the different possible vector
sizes the backed advertises through the vector_modes hook.  When
enabling masked epilogues or main loops then this means we will
select the prefered vector mode which is usually the largest even
for loops that do not iterate close to the times the vector has
lanes.  When not using masking the vectorizer would reject any
mode resulting in a VF bigger than the number of iterations
but with masking they are simply masked out.

So this overloads the finish_cost function and matches for
the problematic case, forcing a high cost to make us try a
smaller vector size.


Bootstrapped and tested on x86_64-unknown-linux-gnu.  This should
avoid regressing 525.x264_r with partial vector epilogues and
instead improves it by 25% with -march=znver4 (need to re-check
that, that was true with some earlier attempt).

This falls short of enabling cost comparison in the x86 backend
which I also considered doing for --param vect-partial-vector-usage=1
but which will also cause a much larger churn and compile-time
impact (but it should be bearable as seen with aarch64).

I've filed PR110310 for an oddity I noticed around vectorizing
epilogues, I failed to adjust things for the case in that PR.

I'm using INT_MAX to fend off the vectorizer, I wondered if
we should be able to signal that with a bool return value of
finish_cost?  Though INT_MAX seems to work fine.

Does this look reasonable?

Thanks,
Richard.

	* config/i386/i386.cc (ix86_vector_costs::finish_cost):
        Overload.  For masked main loops make sure the vectorization
        factor isn't more than double the number of iterations.


	* gcc.target/i386/vect-partial-vectors-1.c: New testcase.
	* gcc.target/i386/vect-partial-vectors-2.c: Likewise.
---
 gcc/config/i386/i386.cc                       | 26 +++++++++++++++++++
 .../gcc.target/i386/vect-partial-vectors-1.c  | 13 ++++++++++
 .../gcc.target/i386/vect-partial-vectors-2.c  | 12 +++++++++
 3 files changed, 51 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b20cb86b822..32851a514a9 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23666,6 +23666,7 @@ class ix86_vector_costs : public vector_costs
 			      stmt_vec_info stmt_info, slp_tree node,
 			      tree vectype, int misalign,
 			      vect_cost_model_location where) override;
+  void finish_cost (const vector_costs *) override;
 };
 
 /* Implement targetm.vectorize.create_costs.  */
@@ -23918,6 +23919,31 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
   return retval;
 }
 
+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+{
+  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
+  if (loop_vinfo && !m_costing_for_scalar)
+    {
+      /* We are currently not asking the vectorizer to compare costs
+	 between different vector mode sizes.  When using predication
+	 that will end up always choosing the prefered mode size even
+	 if there's a smaller mode covering all lanes.  Test for this
+	 situation and artificially reject the larger mode attempt.
+	 ???  We currently lack masked ops for sub-SSE sized modes,
+	 so we could restrict this rejection to AVX and AVX512 modes
+	 but error on the safe side for now.  */
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+	  && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
+	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
+	m_costs[vect_body] = INT_MAX;
+    }
+
+  vector_costs::finish_cost (scalar_costs);
+}
+
 /* Validate target specific memory model bits in VAL. */
 
 static unsigned HOST_WIDE_INT
diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
new file mode 100644
index 00000000000..3834720e8e2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param vect-partial-vector-usage=1" } */
+
+void foo (int * __restrict a, int *b)
+{
+  for (int i = 0; i < 4; ++i)
+    a[i] = b[i] + 42;
+}
+
+/* We do not want to optimize this using masked AVX or AXV512
+   but unmasked SSE.  */
+/* { dg-final { scan-assembler-not "\[yz\]mm" } } */
+/* { dg-final { scan-assembler "xmm" } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
new file mode 100644
index 00000000000..4ab2cbc4203
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param vect-partial-vector-usage=1" } */
+
+void foo (int * __restrict a, int *b)
+{
+  for (int i = 0; i < 7; ++i)
+    a[i] = b[i] + 42;
+}
+
+/* We want to optimize this using masked AVX, not AXV512 or SSE.  */
+/* { dg-final { scan-assembler-not "zmm" } } */
+/* { dg-final { scan-assembler "ymm\[^\r\n\]*\{%k" } } */
-- 
2.35.3

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] [i386] Reject too large vectors for partial vector vectorization
       [not found] <20230619123541.77ED43858415@sourceware.org>
@ 2023-06-20  6:48 ` Hongtao Liu
  0 siblings, 0 replies; 2+ messages in thread
From: Hongtao Liu @ 2023-06-20  6:48 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, hongtao.liu, richard.sandiford

On Mon, Jun 19, 2023 at 8:35 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> The following works around the lack of the x86 backend making the
> vectorizer compare the costs of the different possible vector
> sizes the backed advertises through the vector_modes hook.  When
> enabling masked epilogues or main loops then this means we will
> select the prefered vector mode which is usually the largest even
> for loops that do not iterate close to the times the vector has
> lanes.  When not using masking the vectorizer would reject any
> mode resulting in a VF bigger than the number of iterations
> but with masking they are simply masked out.
>
> So this overloads the finish_cost function and matches for
> the problematic case, forcing a high cost to make us try a
> smaller vector size.
>
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.  This should
> avoid regressing 525.x264_r with partial vector epilogues and
> instead improves it by 25% with -march=znver4 (need to re-check
> that, that was true with some earlier attempt).
>
> This falls short of enabling cost comparison in the x86 backend
> which I also considered doing for --param vect-partial-vector-usage=1
> but which will also cause a much larger churn and compile-time
> impact (but it should be bearable as seen with aarch64).
>
> I've filed PR110310 for an oddity I noticed around vectorizing
> epilogues, I failed to adjust things for the case in that PR.
>
> I'm using INT_MAX to fend off the vectorizer, I wondered if
> we should be able to signal that with a bool return value of
> finish_cost?  Though INT_MAX seems to work fine.
>
> Does this look reasonable?
Reasonable for me, even for VECT_COMPARE_COSTS.
>
> Thanks,
> Richard.
>
>         * config/i386/i386.cc (ix86_vector_costs::finish_cost):
>         Overload.  For masked main loops make sure the vectorization
>         factor isn't more than double the number of iterations.
>
>
>         * gcc.target/i386/vect-partial-vectors-1.c: New testcase.
>         * gcc.target/i386/vect-partial-vectors-2.c: Likewise.
> ---
>  gcc/config/i386/i386.cc                       | 26 +++++++++++++++++++
>  .../gcc.target/i386/vect-partial-vectors-1.c  | 13 ++++++++++
>  .../gcc.target/i386/vect-partial-vectors-2.c  | 12 +++++++++
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index b20cb86b822..32851a514a9 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23666,6 +23666,7 @@ class ix86_vector_costs : public vector_costs
>                               stmt_vec_info stmt_info, slp_tree node,
>                               tree vectype, int misalign,
>                               vect_cost_model_location where) override;
> +  void finish_cost (const vector_costs *) override;
>  };
>
>  /* Implement targetm.vectorize.create_costs.  */
> @@ -23918,6 +23919,31 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>    return retval;
>  }
>
> +void
> +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> +{
> +  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> +  if (loop_vinfo && !m_costing_for_scalar)
> +    {
> +      /* We are currently not asking the vectorizer to compare costs
> +        between different vector mode sizes.  When using predication
> +        that will end up always choosing the prefered mode size even
> +        if there's a smaller mode covering all lanes.  Test for this
> +        situation and artificially reject the larger mode attempt.
> +        ???  We currently lack masked ops for sub-SSE sized modes,
> +        so we could restrict this rejection to AVX and AVX512 modes
> +        but error on the safe side for now.  */
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> +         && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +         && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +         && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> +             > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> +       m_costs[vect_body] = INT_MAX;
> +    }
> +
> +  vector_costs::finish_cost (scalar_costs);
> +}
> +
>  /* Validate target specific memory model bits in VAL. */
>
>  static unsigned HOST_WIDE_INT
> diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
> new file mode 100644
> index 00000000000..3834720e8e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param vect-partial-vector-usage=1" } */
> +
> +void foo (int * __restrict a, int *b)
> +{
> +  for (int i = 0; i < 4; ++i)
> +    a[i] = b[i] + 42;
> +}
> +
> +/* We do not want to optimize this using masked AVX or AXV512
> +   but unmasked SSE.  */
> +/* { dg-final { scan-assembler-not "\[yz\]mm" } } */
> +/* { dg-final { scan-assembler "xmm" } } */
> diff --git a/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
> new file mode 100644
> index 00000000000..4ab2cbc4203
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-partial-vectors-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512f -mavx512vl -mprefer-vector-width=512 --param vect-partial-vector-usage=1" } */
> +
> +void foo (int * __restrict a, int *b)
> +{
> +  for (int i = 0; i < 7; ++i)
> +    a[i] = b[i] + 42;
> +}
> +
> +/* We want to optimize this using masked AVX, not AXV512 or SSE.  */
> +/* { dg-final { scan-assembler-not "zmm" } } */
> +/* { dg-final { scan-assembler "ymm\[^\r\n\]*\{%k" } } */
> --
> 2.35.3



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-06-20  6:42 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-19 12:35 [PATCH] [i386] Reject too large vectors for partial vector vectorization Richard Biener
     [not found] <20230619123541.77ED43858415@sourceware.org>
2023-06-20  6:48 ` Hongtao Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).