Re: [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded
       [not found] <20230210110247.514FA385B53C@sourceware.org>
@ 2023-02-13 14:39 ` Jeff Law
  2023-02-13 14:51   ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Law @ 2023-02-13 14:39 UTC (permalink / raw)
  To: Richard Biener, gcc-patches; +Cc: richard.sandiford



On 2/10/23 04:02, Richard Biener via Gcc-patches wrote:
> This fixes an oversight to when removing the hard limits on using
> generic vectors for the vectorizer to enable both SLP and BB
> vectorization to use those.  The vectorizer relies on vector lowering
> to expand plus, minus and negate to bit operations but vector
> lowering has a hard limit on the minimum number of elements per
> work item.  Vectorizer costs for the testcase at hand work out
> to vectorize a loop with just two work items per vector and that
> causes element wise expansion and spilling.
> 
> The fix for now is to re-instantiate the hard limit, matching what
> vector lowering does.  For the future the way to go is to emit the
> lowered sequence directly from the vectorizer instead.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
> 
> Thanks,
> Richard.
> 
> 	PR tree-optimization/108724
> 	* tree-vect-stmts.cc (vectorizable_operation): Avoid
> 	using word_mode vectors when vector lowering will
> 	decompose them to elementwise operations.
> 
> 	* gcc.target/i386/pr108724.c: New testcase.
OK.  Though can't this be a problem with logicals too?  Or is there 
something special about +- going on here?


jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded
  2023-02-13 14:39 ` [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded Jeff Law
@ 2023-02-13 14:51   ` Richard Biener
  2023-02-13 16:05     ` Jeff Law
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Biener @ 2023-02-13 14:51 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, richard.sandiford

On Mon, 13 Feb 2023, Jeff Law wrote:

> 
> 
> On 2/10/23 04:02, Richard Biener via Gcc-patches wrote:
> > This fixes an oversight to when removing the hard limits on using
> > generic vectors for the vectorizer to enable both SLP and BB
> > vectorization to use those.  The vectorizer relies on vector lowering
> > to expand plus, minus and negate to bit operations but vector
> > lowering has a hard limit on the minimum number of elements per
> > work item.  Vectorizer costs for the testcase at hand work out
> > to vectorize a loop with just two work items per vector and that
> > causes element wise expansion and spilling.
> > 
> > The fix for now is to re-instantiate the hard limit, matching what
> > vector lowering does.  For the future the way to go is to emit the
> > lowered sequence directly from the vectorizer instead.
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
> > 
> > Thanks,
> > Richard.
> > 
> >  PR tree-optimization/108724
> >  * tree-vect-stmts.cc (vectorizable_operation): Avoid
> >  using word_mode vectors when vector lowering will
> >  decompose them to elementwise operations.
> > 
> >  * gcc.target/i386/pr108724.c: New testcase.
> OK.  Though can't this be a problem with logicals too?  Or is there something
> special about +- going on here?

Logical ops do not cross lanes even when using scalar operations on GPRs.
For +- you have to compute the MSB separately to avoid spilling over to
the next vector lane.

Richard.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded
  2023-02-13 14:51   ` Richard Biener
@ 2023-02-13 16:05     ` Jeff Law
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Law @ 2023-02-13 16:05 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, richard.sandiford



On 2/13/23 07:51, Richard Biener wrote:
> On Mon, 13 Feb 2023, Jeff Law wrote:
> 
>>
>>
>> On 2/10/23 04:02, Richard Biener via Gcc-patches wrote:
>>> This fixes an oversight to when removing the hard limits on using
>>> generic vectors for the vectorizer to enable both SLP and BB
>>> vectorization to use those.  The vectorizer relies on vector lowering
>>> to expand plus, minus and negate to bit operations but vector
>>> lowering has a hard limit on the minimum number of elements per
>>> work item.  Vectorizer costs for the testcase at hand work out
>>> to vectorize a loop with just two work items per vector and that
>>> causes element wise expansion and spilling.
>>>
>>> The fix for now is to re-instantiate the hard limit, matching what
>>> vector lowering does.  For the future the way to go is to emit the
>>> lowered sequence directly from the vectorizer instead.
>>>
>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
>>>
>>> Thanks,
>>> Richard.
>>>
>>>   PR tree-optimization/108724
>>>   * tree-vect-stmts.cc (vectorizable_operation): Avoid
>>>   using word_mode vectors when vector lowering will
>>>   decompose them to elementwise operations.
>>>
>>>   * gcc.target/i386/pr108724.c: New testcase.
>> OK.  Though can't this be a problem with logicals too?  Or is there something
>> special about +- going on here?
> 
> Logical ops do not cross lanes even when using scalar operations on GPRs.
> For +- you have to compute the MSB separately to avoid spilling over to
> the next vector lane.
Oh, yes, makes perfect sense.

jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded
  2023-02-10 11:18 ` Richard Sandiford
@ 2023-02-10 11:21   ` Richard Biener
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2023-02-10 11:21 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener via Gcc-patches

On Fri, 10 Feb 2023, Richard Sandiford wrote:

> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > This fixes an oversight to when removing the hard limits on using
> > generic vectors for the vectorizer to enable both SLP and BB
> > vectorization to use those.  The vectorizer relies on vector lowering
> > to expand plus, minus and negate to bit operations but vector
> > lowering has a hard limit on the minimum number of elements per
> > work item.  Vectorizer costs for the testcase at hand work out
> > to vectorize a loop with just two work items per vector and that
> > causes element wise expansion and spilling.
> >
> > The fix for now is to re-instantiate the hard limit, matching what
> > vector lowering does.  For the future the way to go is to emit the
> > lowered sequence directly from the vectorizer instead.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
> 
> LGTM after reading the vector lowering stuff in the PR trail.
> 
> TBH I don't remember when the hard limit was removed though.

I removed it as part of fixing PR101801 but didn't anticipate the
close connection with vector lowering.

Richard.

> Thanks,
> Richard
> 
> >
> > Thanks,
> > Richard.
> >
> > 	PR tree-optimization/108724
> > 	* tree-vect-stmts.cc (vectorizable_operation): Avoid
> > 	using word_mode vectors when vector lowering will
> > 	decompose them to elementwise operations.
> >
> > 	* gcc.target/i386/pr108724.c: New testcase.
> > ---
> >  gcc/testsuite/gcc.target/i386/pr108724.c | 15 +++++++++++++++
> >  gcc/tree-vect-stmts.cc                   | 14 ++++++++++++++
> >  2 files changed, 29 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr108724.c
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/pr108724.c b/gcc/testsuite/gcc.target/i386/pr108724.c
> > new file mode 100644
> > index 00000000000..c4e0e918610
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr108724.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -mno-sse" } */
> > +
> > +int a[16], b[16], c[16];
> > +void foo()
> > +{
> > +  for (int i = 0; i < 16; i++) {
> > +    a[i] = b[i] + c[i];
> > +  }
> > +}
> > +
> > +/* When this is vectorized this shouldn't be expanded piecewise again
> > +   which will result in spilling for the upper half access.  */
> > +
> > +/* { dg-final { scan-assembler-not "\\\[er\\\]sp" } } */
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index c86249adcc3..09b5af603d2 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -6315,6 +6315,20 @@ vectorizable_operation (vec_info *vinfo,
> >        return false;
> >      }
> >  
> > +  /* ???  We should instead expand the operations here, instead of
> > +     relying on vector lowering which has this hard cap on the number
> > +     of vector elements below it performs elementwise operations.  */
> > +  if (using_emulated_vectors_p
> > +      && (code == PLUS_EXPR || code == MINUS_EXPR || code == NEGATE_EXPR)
> > +      && ((BITS_PER_WORD / vector_element_bits (vectype)) < 4
> > +	  || maybe_lt (nunits_out, 4U)))
> > +    {
> > +      if (dump_enabled_p ())
> > +	dump_printf (MSG_NOTE, "not using word mode for +- and less than "
> > +		     "four vector elements\n");
> > +      return false;
> > +    }
> > +
> >    int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
> >    vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
> >    internal_fn cond_fn = get_conditional_internal_fn (code);
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded
       [not found] <20230210110251.62A5B385B52C@sourceware.org>
@ 2023-02-10 11:18 ` Richard Sandiford
  2023-02-10 11:21   ` Richard Biener
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Sandiford @ 2023-02-10 11:18 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches; +Cc: Richard Biener

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> This fixes an oversight to when removing the hard limits on using
> generic vectors for the vectorizer to enable both SLP and BB
> vectorization to use those.  The vectorizer relies on vector lowering
> to expand plus, minus and negate to bit operations but vector
> lowering has a hard limit on the minimum number of elements per
> work item.  Vectorizer costs for the testcase at hand work out
> to vectorize a loop with just two work items per vector and that
> causes element wise expansion and spilling.
>
> The fix for now is to re-instantiate the hard limit, matching what
> vector lowering does.  For the future the way to go is to emit the
> lowered sequence directly from the vectorizer instead.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

LGTM after reading the vector lowering stuff in the PR trail.

TBH I don't remember when the hard limit was removed though.

Thanks,
Richard

>
> Thanks,
> Richard.
>
> 	PR tree-optimization/108724
> 	* tree-vect-stmts.cc (vectorizable_operation): Avoid
> 	using word_mode vectors when vector lowering will
> 	decompose them to elementwise operations.
>
> 	* gcc.target/i386/pr108724.c: New testcase.
> ---
>  gcc/testsuite/gcc.target/i386/pr108724.c | 15 +++++++++++++++
>  gcc/tree-vect-stmts.cc                   | 14 ++++++++++++++
>  2 files changed, 29 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr108724.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr108724.c b/gcc/testsuite/gcc.target/i386/pr108724.c
> new file mode 100644
> index 00000000000..c4e0e918610
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr108724.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mno-sse" } */
> +
> +int a[16], b[16], c[16];
> +void foo()
> +{
> +  for (int i = 0; i < 16; i++) {
> +    a[i] = b[i] + c[i];
> +  }
> +}
> +
> +/* When this is vectorized this shouldn't be expanded piecewise again
> +   which will result in spilling for the upper half access.  */
> +
> +/* { dg-final { scan-assembler-not "\\\[er\\\]sp" } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c86249adcc3..09b5af603d2 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -6315,6 +6315,20 @@ vectorizable_operation (vec_info *vinfo,
>        return false;
>      }
>  
> +  /* ???  We should instead expand the operations here, instead of
> +     relying on vector lowering which has this hard cap on the number
> +     of vector elements below it performs elementwise operations.  */
> +  if (using_emulated_vectors_p
> +      && (code == PLUS_EXPR || code == MINUS_EXPR || code == NEGATE_EXPR)
> +      && ((BITS_PER_WORD / vector_element_bits (vectype)) < 4
> +	  || maybe_lt (nunits_out, 4U)))
> +    {
> +      if (dump_enabled_p ())
> +	dump_printf (MSG_NOTE, "not using word mode for +- and less than "
> +		     "four vector elements\n");
> +      return false;
> +    }
> +
>    int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
>    vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
>    internal_fn cond_fn = get_conditional_internal_fn (code);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded
@ 2023-02-10 11:02 Richard Biener
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Biener @ 2023-02-10 11:02 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

This fixes an oversight to when removing the hard limits on using
generic vectors for the vectorizer to enable both SLP and BB
vectorization to use those.  The vectorizer relies on vector lowering
to expand plus, minus and negate to bit operations but vector
lowering has a hard limit on the minimum number of elements per
work item.  Vectorizer costs for the testcase at hand work out
to vectorize a loop with just two work items per vector and that
causes element wise expansion and spilling.

The fix for now is to re-instantiate the hard limit, matching what
vector lowering does.  For the future the way to go is to emit the
lowered sequence directly from the vectorizer instead.

Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

Thanks,
Richard.

	PR tree-optimization/108724
	* tree-vect-stmts.cc (vectorizable_operation): Avoid
	using word_mode vectors when vector lowering will
	decompose them to elementwise operations.

	* gcc.target/i386/pr108724.c: New testcase.
---
 gcc/testsuite/gcc.target/i386/pr108724.c | 15 +++++++++++++++
 gcc/tree-vect-stmts.cc                   | 14 ++++++++++++++
 2 files changed, 29 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr108724.c

diff --git a/gcc/testsuite/gcc.target/i386/pr108724.c b/gcc/testsuite/gcc.target/i386/pr108724.c
new file mode 100644
index 00000000000..c4e0e918610
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr108724.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mno-sse" } */
+
+int a[16], b[16], c[16];
+void foo()
+{
+  for (int i = 0; i < 16; i++) {
+    a[i] = b[i] + c[i];
+  }
+}
+
+/* When this is vectorized this shouldn't be expanded piecewise again
+   which will result in spilling for the upper half access.  */
+
+/* { dg-final { scan-assembler-not "\\\[er\\\]sp" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c86249adcc3..09b5af603d2 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6315,6 +6315,20 @@ vectorizable_operation (vec_info *vinfo,
       return false;
     }
 
+  /* ???  We should instead expand the operations here, instead of
+     relying on vector lowering which has this hard cap on the number
+     of vector elements below it performs elementwise operations.  */
+  if (using_emulated_vectors_p
+      && (code == PLUS_EXPR || code == MINUS_EXPR || code == NEGATE_EXPR)
+      && ((BITS_PER_WORD / vector_element_bits (vectype)) < 4
+	  || maybe_lt (nunits_out, 4U)))
+    {
+      if (dump_enabled_p ())
+	dump_printf (MSG_NOTE, "not using word mode for +- and less than "
+		     "four vector elements\n");
+      return false;
+    }
+
   int reduc_idx = STMT_VINFO_REDUC_IDX (stmt_info);
   vec_loop_masks *masks = (loop_vinfo ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL);
   internal_fn cond_fn = get_conditional_internal_fn (code);
-- 
2.35.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-02-13 16:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230210110247.514FA385B53C@sourceware.org>
2023-02-13 14:39 ` [PATCH] tree-optimization/108724 - vectorized code getting piecewise expanded Jeff Law
2023-02-13 14:51   ` Richard Biener
2023-02-13 16:05     ` Jeff Law
     [not found] <20230210110251.62A5B385B52C@sourceware.org>
2023-02-10 11:18 ` Richard Sandiford
2023-02-10 11:21   ` Richard Biener
2023-02-10 11:02 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).