[PATCH] [RFC] main loop masked vectorization with --param vect-partial-vector-usage=1

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] [RFC] main loop masked vectorization with --param vect-partial-vector-usage=1
@ 2023-06-14  9:46 Richard Biener
  0 siblings, 0 replies; 3+ messages in thread
From: Richard Biener @ 2023-06-14  9:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

Currently vect_determine_partial_vectors_and_peeling will decide
to apply fully masking to the main loop despite
--param vect-partial-vector-usage=1 when the currently analyzed
vector mode results in a vectorization factor that's bigger
than the number of scalar iterations.  That's undesirable for
targets where a vector mode can handle both partial vector and
non-partial vector vectorization.  I understand that for AARCH64
we have SVE and NEON but SVE can only do partial vector and
NEON only non-partial vector vectorization, plus the target
chooses to let cost comparison decide the vector mode to use.

For x86 and the upcoming AVX512 partial vector support the
story is different, the target chooses the first (and largest)
vector mode that can successfully used for vectorization.  But
that means with --param vect-partial-vector-usage=1 we will
always choose AVX512 with partial vectors for the main loop
even if, for example, V4SI would be a perfect fit with full
vectors and no required epilog!

The following tries to find the appropriate condition for
this - I suppose simply refusing to set LOOP_VINFO_USING_PARTIAL_VECTORS_P
on the main loop when --param vect-partial-vector-usage=1 will
hurt AARCH64?  Incidentially looking up the docs for
vect-partial-vector-usage suggests that it's not supposed to
control epilog vectorization but instead
"1 allows partial vector loads and stores if vectorization removes the
need for the code to iterate".  That's probably OK in the end
but if there's a fixed size vector mode that allows the same thing
without using masking that would be better.

I wonder if we should special-case known niter (bounds) somehow
when analyzing the vector modes and override the targets sorting?

Maybe we want a new --param in addition to vect-epilogues-nomask
and vect-partial-vector-usage to say we want masked epilogues?

	* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
	For non-VLA vectorization interpret param_vect_partial_vector_usage == 1
	as only applying to epilogues.
---
 gcc/tree-vect-loop.cc | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 9be66b8fbc5..9323aa572d4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2478,7 +2478,15 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
 	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
 	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
 	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
-      else
+      /* Avoid using a large fixed size vectorization mode with masking
+	 for the main loop when we were asked to only use masking for
+	 the epilog.
+	 ???  Ideally we'd start analysis with a better sized mode,
+	 the param_vect_partial_vector_usage == 2 case suffers from
+	 this as well.  But there's a catch-22.  */
+      else if (!(!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+		 && param_vect_partial_vector_usage == 1
+		 && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
 	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
     }

-- 
2.35.3

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] [RFC] main loop masked vectorization with --param vect-partial-vector-usage=1
  2023-06-14 13:14 ` Richard Sandiford
@ 2023-06-14 13:44   ` Richard Biener
  0 siblings, 0 replies; 3+ messages in thread
From: Richard Biener @ 2023-06-14 13:44 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: Richard Biener via Gcc-patches

On Wed, 14 Jun 2023, Richard Sandiford wrote:

> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Currently vect_determine_partial_vectors_and_peeling will decide
> > to apply fully masking to the main loop despite
> > --param vect-partial-vector-usage=1 when the currently analyzed
> > vector mode results in a vectorization factor that's bigger
> > than the number of scalar iterations.  That's undesirable for
> > targets where a vector mode can handle both partial vector and
> > non-partial vector vectorization.  I understand that for AARCH64
> > we have SVE and NEON but SVE can only do partial vector and
> > NEON only non-partial vector vectorization, plus the target
> > chooses to let cost comparison decide the vector mode to use.
> 
> SVE can do both (and does non-partial for things that can't yet be
> predicated, like reversing loads).  But yeah, NEON can only do
> non-partial.
> 
> > For x86 and the upcoming AVX512 partial vector support the
> > story is different, the target chooses the first (and largest)
> > vector mode that can successfully used for vectorization.  But
> > that means with --param vect-partial-vector-usage=1 we will
> > always choose AVX512 with partial vectors for the main loop
> > even if, for example, V4SI would be a perfect fit with full
> > vectors and no required epilog!
> 
> Sounds like a good candidate for VECT_COMPARE_COSTS.  Did you
> try using that?

Yeah, I didn't try that because we've never done that and I expect
unrelated "effects" ...

> > The following tries to find the appropriate condition for
> > this - I suppose simply refusing to set LOOP_VINFO_USING_PARTIAL_VECTORS_P
> > on the main loop when --param vect-partial-vector-usage=1 will
> > hurt AARCH64?
> 
> Yeah, I'd expect so.
> 
> > Incidentially looking up the docs for
> > vect-partial-vector-usage suggests that it's not supposed to
> > control epilog vectorization but instead
> > "1 allows partial vector loads and stores if vectorization removes the
> > need for the code to iterate".  That's probably OK in the end
> > but if there's a fixed size vector mode that allows the same thing
> > without using masking that would be better.
> >
> > I wonder if we should special-case known niter (bounds) somehow
> > when analyzing the vector modes and override the targets sorting?
> >
> > Maybe we want a new --param in addition to vect-epilogues-nomask
> > and vect-partial-vector-usage to say we want masked epilogues?
> >
> > 	* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
> > 	For non-VLA vectorization interpret param_vect_partial_vector_usage == 1
> > 	as only applying to epilogues.
> > ---
> >  gcc/tree-vect-loop.cc | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 9be66b8fbc5..9323aa572d4 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2478,7 +2478,15 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
> >  	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> >  	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
> >  	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> > -      else
> > +      /* Avoid using a large fixed size vectorization mode with masking
> > +	 for the main loop when we were asked to only use masking for
> > +	 the epilog.
> > +	 ???  Ideally we'd start analysis with a better sized mode,
> > +	 the param_vect_partial_vector_usage == 2 case suffers from
> > +	 this as well.  But there's a catch-22.  */
> > +      else if (!(!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> > +		 && param_vect_partial_vector_usage == 1
> > +		 && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
> 
> I don't think is_constant is a good thing to test here.  The way things
> work for SVE is essentially the same for VL-agnostic and VL-specific.
> 
> Also, I think this hard-codes the assumption that the smallest mode
> isn't maskable.  Wouldn't it spuriously fail vectorisation if there
> was no smaller mode available?
> 
> Similarly, it looks like it could fail for AVX512 if processing 511
> characters, whereas I'd have expected AVX512 to still be the right
> choice there.

Possibly, yes.

> If VECT_COMPARE_COSTS seems too expensive, we could try to look
> for cases where a vector mode later in the list gives a VF that
> is exactly equal to the number of scalar iterations.  (Exactly
> *divides* the number of scalar iterations would be less clear-cut IMO.)
> 
> But converting a vector mode into a VF isn't trivial with our
> current vectoriser structures, so I'm not sure how much of a
> saving it would be over VECT_COMPARE_COSTS.  And it would be much
> more special-purpose than VECT_COMPARE_COSTS.

It occured to me if NITER is constant or we have a constant bound
on it we could set max_vf to the next higher power-of-two
(and min_vf to the next lower power-of-two?) when doing the
"autodetect" run.  Unfortunately we still select the "bad" mode
then, we simply fail that analysis and try with the next
still too big but no longer "autodetect" mode and end up
with AVX2 instead of the desired SSE2 for a simple (with disable cunrolli)

void foo (unsigned *a)
{
  for (int i = 0; i < 4; ++i)
    a[i] = 42;
}

but as you say it's difficult for get_vectype_for_scalar_type
with yet unknown VF to decide on a good mode.  It _could_ use
min_vf so we at least get V4SImode but with SLP the required
vector width could be larger (supposedly we've already
analyzed the maximum size of dataref groups).

Going through _all_ vector modes with the VF constraint
would be costly as well.

It's difficult to feed in enough info to the first
get_vectype_for_scalar_type call done since that's from
vect_analyze_data_refs.  Between that and vect_analyze_data_ref_accesses
there's pattern recog, trying to re-order and delay until after
that might work.

As you say, possibly VECT_COMPARE_COSTS is the way to go here ...

Richard.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] [RFC] main loop masked vectorization with --param vect-partial-vector-usage=1
       [not found] <20230614094639.B578B3858430@sourceware.org>
@ 2023-06-14 13:14 ` Richard Sandiford
  2023-06-14 13:44   ` Richard Biener
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Sandiford @ 2023-06-14 13:14 UTC (permalink / raw)
  To: Richard Biener via Gcc-patches; +Cc: Richard Biener

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Currently vect_determine_partial_vectors_and_peeling will decide
> to apply fully masking to the main loop despite
> --param vect-partial-vector-usage=1 when the currently analyzed
> vector mode results in a vectorization factor that's bigger
> than the number of scalar iterations.  That's undesirable for
> targets where a vector mode can handle both partial vector and
> non-partial vector vectorization.  I understand that for AARCH64
> we have SVE and NEON but SVE can only do partial vector and
> NEON only non-partial vector vectorization, plus the target
> chooses to let cost comparison decide the vector mode to use.

SVE can do both (and does non-partial for things that can't yet be
predicated, like reversing loads).  But yeah, NEON can only do
non-partial.

> For x86 and the upcoming AVX512 partial vector support the
> story is different, the target chooses the first (and largest)
> vector mode that can successfully used for vectorization.  But
> that means with --param vect-partial-vector-usage=1 we will
> always choose AVX512 with partial vectors for the main loop
> even if, for example, V4SI would be a perfect fit with full
> vectors and no required epilog!

Sounds like a good candidate for VECT_COMPARE_COSTS.  Did you
try using that?

> The following tries to find the appropriate condition for
> this - I suppose simply refusing to set LOOP_VINFO_USING_PARTIAL_VECTORS_P
> on the main loop when --param vect-partial-vector-usage=1 will
> hurt AARCH64?

Yeah, I'd expect so.

> Incidentially looking up the docs for
> vect-partial-vector-usage suggests that it's not supposed to
> control epilog vectorization but instead
> "1 allows partial vector loads and stores if vectorization removes the
> need for the code to iterate".  That's probably OK in the end
> but if there's a fixed size vector mode that allows the same thing
> without using masking that would be better.
>
> I wonder if we should special-case known niter (bounds) somehow
> when analyzing the vector modes and override the targets sorting?
>
> Maybe we want a new --param in addition to vect-epilogues-nomask
> and vect-partial-vector-usage to say we want masked epilogues?
>
> 	* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
> 	For non-VLA vectorization interpret param_vect_partial_vector_usage == 1
> 	as only applying to epilogues.
> ---
>  gcc/tree-vect-loop.cc | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 9be66b8fbc5..9323aa572d4 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2478,7 +2478,15 @@ vect_determine_partial_vectors_and_peeling (loop_vec_info loop_vinfo,
>  	  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
>  	  && !vect_known_niters_smaller_than_vf (loop_vinfo))
>  	LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
> -      else
> +      /* Avoid using a large fixed size vectorization mode with masking
> +	 for the main loop when we were asked to only use masking for
> +	 the epilog.
> +	 ???  Ideally we'd start analysis with a better sized mode,
> +	 the param_vect_partial_vector_usage == 2 case suffers from
> +	 this as well.  But there's a catch-22.  */
> +      else if (!(!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> +		 && param_vect_partial_vector_usage == 1
> +		 && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))

I don't think is_constant is a good thing to test here.  The way things
work for SVE is essentially the same for VL-agnostic and VL-specific.

Also, I think this hard-codes the assumption that the smallest mode
isn't maskable.  Wouldn't it spuriously fail vectorisation if there
was no smaller mode available?

Similarly, it looks like it could fail for AVX512 if processing 511
characters, whereas I'd have expected AVX512 to still be the right
choice there.

If VECT_COMPARE_COSTS seems too expensive, we could try to look
for cases where a vector mode later in the list gives a VF that
is exactly equal to the number of scalar iterations.  (Exactly
*divides* the number of scalar iterations would be less clear-cut IMO.)

But converting a vector mode into a VF isn't trivial with our
current vectoriser structures, so I'm not sure how much of a
saving it would be over VECT_COMPARE_COSTS.  And it would be much
more special-purpose than VECT_COMPARE_COSTS.

Thanks,
Richard


>  	LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
>      }

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-06-14 13:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-14  9:46 [PATCH] [RFC] main loop masked vectorization with --param vect-partial-vector-usage=1 Richard Biener
     [not found] <20230614094639.B578B3858430@sourceware.org>
2023-06-14 13:14 ` Richard Sandiford
2023-06-14 13:44   ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).