[PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
@ 2023-06-29 12:28 Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2023-06-29 12:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

With applying loop masking to epilogues on x86_64 AVX512 we see
some significant performance regressions when evaluating SPEC CPU 2017
that are caused by store-to-load forwarding fails across outer
loop iterations when the inner loop does not iterate.  Consider

  for (j = 0; j < m; ++j)
    for (i = 0; i < n; ++i)
      a[j*n + i] += b[j*n + i];

with 'n' chosen so that the inner loop vectorized code is fully
executed by the masked epilogue and that masked epilogue
storing O > n elements (with elements >= n masked of course).
Then the masked load performed for the next outer loop iteration
will get a hit in the store queue but it obviously cannot forward
so we have to wait for the store to retire.

That causes a significant hit to performance especially if 'n'
would have made a non-masked epilogue to fully cover 'n' as well
(say n == 4 for a V4DImode epilogue), avoiding the need for
store-forwarding and waiting for the retiring of the store.

The following applies a very simple heuristic, disabling
the use of loop masking when there's a memory reference pair
with dependence distance zero.  That resolves the issue
(other problematic dependence distances seem to be less common
at least).

I have applied this heuristic in generic vectorizer code but
restricted it to non-VL vector sizes.  There currently isn't
a way for the target to request disabling of masking only,
while we can reject the vectoriztion at costing time that will
not re-consider the same vector mode but without masking.
It seems simply re-costing with masking disabled should be
possible through, we'd just need an indication whether that
should be done?  Maybe always when the current vector mode is
of fixed size?

I wonder how SVE vectorized code behaves in these situations?
The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
503.bwaves_r though I think both will need a hardware vector
size covering at least 8 doubles to show the issue.  527.cam4_r
has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.

Bootstrap / regtest running on x86_64-unknown-linux-gnu.

Any comments?

Thanks,
Richard.

	PR target/110456
	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
	Initialize has_zero_dep_dist.
	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
	Remember if we've seen a dependence distance of zero.
	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
	When we've seen a dependence distance of zero and the vector
	type has constant size disable the use of partial vectors.
---
 gcc/tree-vect-data-refs.cc |  2 ++
 gcc/tree-vect-stmts.cc     | 10 ++++++++++
 gcc/tree-vectorizer.cc     |  1 +
 gcc/tree-vectorizer.h      |  3 +++
 4 files changed, 16 insertions(+)

diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index ebe93832b1e..40cde95c16a 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
 			     "dependence distance == 0 between %T and %T\n",
 			     DR_REF (dra), DR_REF (drb));

+	  loop_vinfo->shared->has_zero_dep_dist = true;
+
 	  /* When we perform grouped accesses and perform implicit CSE
 	     by detecting equal accesses and doing disambiguation with
 	     runtime alias tests like for
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index d642d3c257f..3bcbc000323 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       using_partial_vectors_p = true;
     }

+  if (loop_vinfo->shared->has_zero_dep_dist
+      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "disabling partial vectors because of possible "
+			 "STLF issues\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+
   if (!using_partial_vectors_p)
     {
       if (dump_enabled_p ())
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index a048e9d8917..74457259b6e 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -478,6 +478,7 @@ vec_info::~vec_info ()

 vec_info_shared::vec_info_shared ()
   : n_stmts (0),
+    has_zero_dep_dist (false),
     datarefs (vNULL),
     datarefs_copy (vNULL),
     ddrs (vNULL)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index a36974c2c0d..7626cda2a73 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -419,6 +419,9 @@ public:
   /* The number of scalar stmts.  */
   unsigned n_stmts;

+  /* Whether there's a dependence with zero distance.  */
+  bool has_zero_dep_dist;
+
   /* All data references.  Freed by free_data_refs, so not an auto_vec.  */
   vec<data_reference_p> datarefs;
   vec<data_reference> datarefs_copy;
-- 
2.35.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
  2023-07-05 13:29         ` Richard Biener
@ 2023-07-05 13:55           ` Richard Sandiford
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Sandiford @ 2023-07-05 13:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Tue, 4 Jul 2023, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Thu, 29 Jun 2023, Richard Biener wrote:
>> >
>> >> On Thu, 29 Jun 2023, Richard Sandiford wrote:
>> >> 
>> >> > Richard Biener <rguenther@suse.de> writes:
>> >> > > With applying loop masking to epilogues on x86_64 AVX512 we see
>> >> > > some significant performance regressions when evaluating SPEC CPU 2017
>> >> > > that are caused by store-to-load forwarding fails across outer
>> >> > > loop iterations when the inner loop does not iterate.  Consider
>> >> > >
>> >> > >   for (j = 0; j < m; ++j)
>> >> > >     for (i = 0; i < n; ++i)
>> >> > >       a[j*n + i] += b[j*n + i];
>> >> > >
>> >> > > with 'n' chosen so that the inner loop vectorized code is fully
>> >> > > executed by the masked epilogue and that masked epilogue
>> >> > > storing O > n elements (with elements >= n masked of course).
>> >> > > Then the masked load performed for the next outer loop iteration
>> >> > > will get a hit in the store queue but it obviously cannot forward
>> >> > > so we have to wait for the store to retire.
>> >> > >
>> >> > > That causes a significant hit to performance especially if 'n'
>> >> > > would have made a non-masked epilogue to fully cover 'n' as well
>> >> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
>> >> > > store-forwarding and waiting for the retiring of the store.
>> >> > >
>> >> > > The following applies a very simple heuristic, disabling
>> >> > > the use of loop masking when there's a memory reference pair
>> >> > > with dependence distance zero.  That resolves the issue
>> >> > > (other problematic dependence distances seem to be less common
>> >> > > at least).
>> >> > >
>> >> > > I have applied this heuristic in generic vectorizer code but
>> >> > > restricted it to non-VL vector sizes.  There currently isn't
>> >> > > a way for the target to request disabling of masking only,
>> >> > > while we can reject the vectoriztion at costing time that will
>> >> > > not re-consider the same vector mode but without masking.
>> >> > > It seems simply re-costing with masking disabled should be
>> >> > > possible through, we'd just need an indication whether that
>> >> > > should be done?  Maybe always when the current vector mode is
>> >> > > of fixed size?
>> >> > >
>> >> > > I wonder how SVE vectorized code behaves in these situations?
>> >> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
>> >> > > 503.bwaves_r though I think both will need a hardware vector
>> >> > > size covering at least 8 doubles to show the issue.  527.cam4_r
>> >> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
>> >> > >
>> >> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>> >> > >
>> >> > > Any comments?
>> >> > >
>> >> > > Thanks,
>> >> > > Richard.
>> >> > >
>> >> > > 	PR target/110456
>> >> > > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
>> >> > > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
>> >> > > 	Initialize has_zero_dep_dist.
>> >> > > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
>> >> > > 	Remember if we've seen a dependence distance of zero.
>> >> > > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
>> >> > > 	When we've seen a dependence distance of zero and the vector
>> >> > > 	type has constant size disable the use of partial vectors.
>> >> > > ---
>> >> > >  gcc/tree-vect-data-refs.cc |  2 ++
>> >> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
>> >> > >  gcc/tree-vectorizer.cc     |  1 +
>> >> > >  gcc/tree-vectorizer.h      |  3 +++
>> >> > >  4 files changed, 16 insertions(+)
>> >> > >
>> >> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
>> >> > > index ebe93832b1e..40cde95c16a 100644
>> >> > > --- a/gcc/tree-vect-data-refs.cc
>> >> > > +++ b/gcc/tree-vect-data-refs.cc
>> >> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
>> >> > >  			     "dependence distance == 0 between %T and %T\n",
>> >> > >  			     DR_REF (dra), DR_REF (drb));
>> >> > >  
>> >> > > +	  loop_vinfo->shared->has_zero_dep_dist = true;
>> >> > > +
>> >> > >  	  /* When we perform grouped accesses and perform implicit CSE
>> >> > >  	     by detecting equal accesses and doing disambiguation with
>> >> > >  	     runtime alias tests like for
>> >> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> >> > > index d642d3c257f..3bcbc000323 100644
>> >> > > --- a/gcc/tree-vect-stmts.cc
>> >> > > +++ b/gcc/tree-vect-stmts.cc
>> >> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>> >> > >        using_partial_vectors_p = true;
>> >> > >      }
>> >> > >  
>> >> > > +  if (loop_vinfo->shared->has_zero_dep_dist
>> >> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
>> >> > 
>> >> > I don't think it makes sense to treat VLA and VLS differently here.
>> >> > 
>> >> > But RMW operations are very common, so it seems like we're giving up a
>> >> > lot on the off chance that the inner loop is applied iteratively
>> >> > to successive memory locations.
>> >> 
>> >> Yes ...
>> >> 
>> >> > Maybe that's still OK for AVX512, where I guess loop masking is more
>> >> > of a niche use case.  But if so, then yeah, I think a hook to disable
>> >> > masking might be better here.
>> >> 
>> >> It's a nice use case in that if you'd cost the main vector loop
>> >> with and without masking the not masked case is always winning.
>> >> I understand with SVE it would be the same if you fix the
>> >> vector size but otherwise it would be a win to mask as there's
>> >> the chance the HW implementation uses bigger vectors than the
>> >> architected minimum size.
>> >> 
>> >> So for AVX512 the win is with the epilogue and the case of
>> >> few scalar iterations where the epilogue iterations play
>> >> a significant role.  Since we only vectorize the epilogue
>> >> of the main loop but not the epilogue of the epilogue loop
>> >> we're leaving quite some iterations unvectorized when the
>> >> main loop uses 512bit vectors and the epilogue 256bit.
>> >> 
>> >> But that case specifically is also prone to make the cross-iteration
>> >> issue wrt an outer loop significiant ...
>> >> 
>> >> I'll see to address this on the costing side somehow.
>> >
>> > So with us requiring both partial vector and non-partial vector variants
>> > working during vectorizable_* I thought it should be possible to
>> > simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
>> > re-analyzing in case the target deemed the partial vector using loop
>> > not profitable.
>> >
>> > While the following successfully hacks this in-place the question is
>> > whether the above is really true for VLA vector types?
>> 
>> Yeah, I think so.  We do support full-vector VLA.
>> 
>> > Besides from this "working", maybe the targets should get to say
>> > more specifically what they'd like to change - should we make
>> > the finish_cost () hook have a more elaborate return value or
>> > provide more hints like we already do with the suggested unroll factor?
>> >
>> > I can imagine a "partial vector usage is bad" or "too high VF"?
>> > But then we probably still want to re-do the final costing part
>> > so we'd need a way to push away the per-stmt costing we already
>> > did (sth nicer than the hack below).  Maybe make
>> > 'bool costing_for_scalar' a tri-state, adding 'produce copy
>> > from current vector cost in vinfo'?
>> 
>> Not sure I follow the tri-state thing.  If the idea is to redo
>> the vector_costs::add_stmt_costs stuff, could we just export
>> the cost_vec from vect_analyze_loop_operations and apply the
>> costs in vect_analyze_loop_costing instead?
>
> I think we don't need to re-do the add_stmt_costs, but yes, I guess
> we could do this and basically initialize the backend cost info
> only at vect_analyze_loop_costing time.  Is that what you suggest?

I wasn't think that far ahead, but yeah, if we can do that,
it definitely sounds better.

>> That seems worthwhile anyway, if we think the costs are going
>> to depend on partial vs. full vectors.  As things stand, we could
>> already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later
>> set it to 0.
>
> Yeah, at the time we feed stmts to the backed 
> (in vect_analyze_loop_operations) we have not yet committed to
> using partial vectors or not.
>
> What I meant with the tri-state thing is to give the vectorizer
> an idea how to best recover here without wasting too many cycles.
> At least costing masking vs. not masking should involve only
> re-running the costing itself so we could unconditionally try that
> as the patch does when the masked variant is rejected by the
> backend (via setting cost to INT_MAX).  The question above was
> whether with VLA vectors we could even code generate with
> !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?

Yeah, we can.  And we currently do for things like reversed loops.
E.g.:

    int x[100];
    int y[100];
    void f() {
        for (int i = 0; i < 100; ++i)
            x[i] = y[99 - i];
    }

[https://godbolt.org/z/rhqc9dWxs~].  Admittedly that's only because
we haven't added loop predication for reversed loads yet, rather than
a deliberate choice.

But vect-partial-vector-usage=1 makes sense for VLA too, and is
supported there.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
  2023-07-04 19:49       ` Richard Sandiford
@ 2023-07-05 13:29         ` Richard Biener
  2023-07-05 13:55           ` Richard Sandiford
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2023-07-05 13:29 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, 4 Jul 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Thu, 29 Jun 2023, Richard Biener wrote:
> >
> >> On Thu, 29 Jun 2023, Richard Sandiford wrote:
> >> 
> >> > Richard Biener <rguenther@suse.de> writes:
> >> > > With applying loop masking to epilogues on x86_64 AVX512 we see
> >> > > some significant performance regressions when evaluating SPEC CPU 2017
> >> > > that are caused by store-to-load forwarding fails across outer
> >> > > loop iterations when the inner loop does not iterate.  Consider
> >> > >
> >> > >   for (j = 0; j < m; ++j)
> >> > >     for (i = 0; i < n; ++i)
> >> > >       a[j*n + i] += b[j*n + i];
> >> > >
> >> > > with 'n' chosen so that the inner loop vectorized code is fully
> >> > > executed by the masked epilogue and that masked epilogue
> >> > > storing O > n elements (with elements >= n masked of course).
> >> > > Then the masked load performed for the next outer loop iteration
> >> > > will get a hit in the store queue but it obviously cannot forward
> >> > > so we have to wait for the store to retire.
> >> > >
> >> > > That causes a significant hit to performance especially if 'n'
> >> > > would have made a non-masked epilogue to fully cover 'n' as well
> >> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
> >> > > store-forwarding and waiting for the retiring of the store.
> >> > >
> >> > > The following applies a very simple heuristic, disabling
> >> > > the use of loop masking when there's a memory reference pair
> >> > > with dependence distance zero.  That resolves the issue
> >> > > (other problematic dependence distances seem to be less common
> >> > > at least).
> >> > >
> >> > > I have applied this heuristic in generic vectorizer code but
> >> > > restricted it to non-VL vector sizes.  There currently isn't
> >> > > a way for the target to request disabling of masking only,
> >> > > while we can reject the vectoriztion at costing time that will
> >> > > not re-consider the same vector mode but without masking.
> >> > > It seems simply re-costing with masking disabled should be
> >> > > possible through, we'd just need an indication whether that
> >> > > should be done?  Maybe always when the current vector mode is
> >> > > of fixed size?
> >> > >
> >> > > I wonder how SVE vectorized code behaves in these situations?
> >> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> >> > > 503.bwaves_r though I think both will need a hardware vector
> >> > > size covering at least 8 doubles to show the issue.  527.cam4_r
> >> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
> >> > >
> >> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> >> > >
> >> > > Any comments?
> >> > >
> >> > > Thanks,
> >> > > Richard.
> >> > >
> >> > > 	PR target/110456
> >> > > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> >> > > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> >> > > 	Initialize has_zero_dep_dist.
> >> > > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> >> > > 	Remember if we've seen a dependence distance of zero.
> >> > > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> >> > > 	When we've seen a dependence distance of zero and the vector
> >> > > 	type has constant size disable the use of partial vectors.
> >> > > ---
> >> > >  gcc/tree-vect-data-refs.cc |  2 ++
> >> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
> >> > >  gcc/tree-vectorizer.cc     |  1 +
> >> > >  gcc/tree-vectorizer.h      |  3 +++
> >> > >  4 files changed, 16 insertions(+)
> >> > >
> >> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> >> > > index ebe93832b1e..40cde95c16a 100644
> >> > > --- a/gcc/tree-vect-data-refs.cc
> >> > > +++ b/gcc/tree-vect-data-refs.cc
> >> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
> >> > >  			     "dependence distance == 0 between %T and %T\n",
> >> > >  			     DR_REF (dra), DR_REF (drb));
> >> > >  
> >> > > +	  loop_vinfo->shared->has_zero_dep_dist = true;
> >> > > +
> >> > >  	  /* When we perform grouped accesses and perform implicit CSE
> >> > >  	     by detecting equal accesses and doing disambiguation with
> >> > >  	     runtime alias tests like for
> >> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> >> > > index d642d3c257f..3bcbc000323 100644
> >> > > --- a/gcc/tree-vect-stmts.cc
> >> > > +++ b/gcc/tree-vect-stmts.cc
> >> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
> >> > >        using_partial_vectors_p = true;
> >> > >      }
> >> > >  
> >> > > +  if (loop_vinfo->shared->has_zero_dep_dist
> >> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> >> > 
> >> > I don't think it makes sense to treat VLA and VLS differently here.
> >> > 
> >> > But RMW operations are very common, so it seems like we're giving up a
> >> > lot on the off chance that the inner loop is applied iteratively
> >> > to successive memory locations.
> >> 
> >> Yes ...
> >> 
> >> > Maybe that's still OK for AVX512, where I guess loop masking is more
> >> > of a niche use case.  But if so, then yeah, I think a hook to disable
> >> > masking might be better here.
> >> 
> >> It's a nice use case in that if you'd cost the main vector loop
> >> with and without masking the not masked case is always winning.
> >> I understand with SVE it would be the same if you fix the
> >> vector size but otherwise it would be a win to mask as there's
> >> the chance the HW implementation uses bigger vectors than the
> >> architected minimum size.
> >> 
> >> So for AVX512 the win is with the epilogue and the case of
> >> few scalar iterations where the epilogue iterations play
> >> a significant role.  Since we only vectorize the epilogue
> >> of the main loop but not the epilogue of the epilogue loop
> >> we're leaving quite some iterations unvectorized when the
> >> main loop uses 512bit vectors and the epilogue 256bit.
> >> 
> >> But that case specifically is also prone to make the cross-iteration
> >> issue wrt an outer loop significiant ...
> >> 
> >> I'll see to address this on the costing side somehow.
> >
> > So with us requiring both partial vector and non-partial vector variants
> > working during vectorizable_* I thought it should be possible to
> > simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
> > re-analyzing in case the target deemed the partial vector using loop
> > not profitable.
> >
> > While the following successfully hacks this in-place the question is
> > whether the above is really true for VLA vector types?
> 
> Yeah, I think so.  We do support full-vector VLA.
> 
> > Besides from this "working", maybe the targets should get to say
> > more specifically what they'd like to change - should we make
> > the finish_cost () hook have a more elaborate return value or
> > provide more hints like we already do with the suggested unroll factor?
> >
> > I can imagine a "partial vector usage is bad" or "too high VF"?
> > But then we probably still want to re-do the final costing part
> > so we'd need a way to push away the per-stmt costing we already
> > did (sth nicer than the hack below).  Maybe make
> > 'bool costing_for_scalar' a tri-state, adding 'produce copy
> > from current vector cost in vinfo'?
> 
> Not sure I follow the tri-state thing.  If the idea is to redo
> the vector_costs::add_stmt_costs stuff, could we just export
> the cost_vec from vect_analyze_loop_operations and apply the
> costs in vect_analyze_loop_costing instead?

I think we don't need to re-do the add_stmt_costs, but yes, I guess
we could do this and basically initialize the backend cost info
only at vect_analyze_loop_costing time.  Is that what you suggest?

> That seems worthwhile anyway, if we think the costs are going
> to depend on partial vs. full vectors.  As things stand, we could
> already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later
> set it to 0.

Yeah, at the time we feed stmts to the backed 
(in vect_analyze_loop_operations) we have not yet committed to
using partial vectors or not.

What I meant with the tri-state thing is to give the vectorizer
an idea how to best recover here without wasting too many cycles.
At least costing masking vs. not masking should involve only
re-running the costing itself so we could unconditionally try that
as the patch does when the masked variant is rejected by the
backend (via setting cost to INT_MAX).  The question above was
whether with VLA vectors we could even code generate with
!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?

> > Anyway - the takeaway is the x86 backend likes a way to disable
> > the use of partial vectors for the epilog (or main loop) in some cases.
> > An alternative to doing this somehow via the costing hooks would be
> > to add a new hook - the specific data dependence check could be
> > a hook invoked after dependence checking, initializing
> > LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?
> >
> > Any good ideas?  Anything that comes to your minds that would be
> > useful in this area for other targets?
> >
> > Thanks,
> > Richard.
> >
> > The following applies ontop of the earlier posted patch in this thread.
> >
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 8989985700a..a892f72ded3 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
> >  	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> >  	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
> >  	m_costs[vect_body] = INT_MAX;
> > +
> > +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> > +	  && loop_vinfo->shared->has_zero_dep_dist)
> > +	m_costs[vect_body] = INT_MAX;
> >      }
> >  
> >    vector_costs::finish_cost (scalar_costs);
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 3b46c58a8d8..85802f07443 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2773,6 +2773,7 @@ start_over:
> >      }
> >  
> >    loop_vinfo->vector_costs = init_cost (loop_vinfo, false);
> > +  vector_costs saved_costs = *loop_vinfo->vector_costs;
> 
> The target can derive the class and add its own member variables,
> so I think we'd need some kind of clone method for this to be safe.
> 
> >    /* Analyze the alignment of the data-refs in the loop.
> >       Fail if a data reference is found that cannot be vectorized.  */
> > @@ -3017,6 +3018,8 @@ start_over:
> >  	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
> >      }
> >  
> > +  saved_costs = *loop_vinfo->vector_costs;
> > +again_no_partial_vectors:
> >    /* If we're vectorizing an epilogue loop, the vectorized loop either needs
> >       to be able to handle fewer than VF scalars, or needs to have a lower VF
> >       than the main loop.  */
> > @@ -3043,6 +3046,17 @@ start_over:
> >      {
> >        ok = opt_result::failure_at (vect_location,
> >  				   "Loop costings may not be worthwhile.\n");
> > +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> > +	{
> > +	  if (dump_enabled_p ())
> > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > +			     "trying with partial vectors disabled\n");
> > +	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > +	  LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false;
> > +	  LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false;
> 
> I guess these last two variables should have been set after
> vect_determine_partial_vectors_and_peeling rather than before.
> Sorry for not noticing earlier.
> 
> The approach seems OK to me otherwise FWIW.

OK, will try post-poning the target cost creation and re-feed
the cost_vect, that should avoid the need to copy the target cost
structure as well.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > +	  *loop_vinfo->vector_costs = saved_costs;
> > +	  goto again_no_partial_vectors;
> > +	}
> >        goto again;
> >      }
> >    if (!res)
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 08d071463fb..003af878ee7 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
> >        using_partial_vectors_p = true;
> >      }
> >  
> > -  if (loop_vinfo->shared->has_zero_dep_dist
> > +  if (0 && loop_vinfo->shared->has_zero_dep_dist
> >        && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> >      {
> >        if (dump_enabled_p ())
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
  2023-07-04 13:37     ` Richard Biener
@ 2023-07-04 19:49       ` Richard Sandiford
  2023-07-05 13:29         ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2023-07-04 19:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Thu, 29 Jun 2023, Richard Biener wrote:
>
>> On Thu, 29 Jun 2023, Richard Sandiford wrote:
>> 
>> > Richard Biener <rguenther@suse.de> writes:
>> > > With applying loop masking to epilogues on x86_64 AVX512 we see
>> > > some significant performance regressions when evaluating SPEC CPU 2017
>> > > that are caused by store-to-load forwarding fails across outer
>> > > loop iterations when the inner loop does not iterate.  Consider
>> > >
>> > >   for (j = 0; j < m; ++j)
>> > >     for (i = 0; i < n; ++i)
>> > >       a[j*n + i] += b[j*n + i];
>> > >
>> > > with 'n' chosen so that the inner loop vectorized code is fully
>> > > executed by the masked epilogue and that masked epilogue
>> > > storing O > n elements (with elements >= n masked of course).
>> > > Then the masked load performed for the next outer loop iteration
>> > > will get a hit in the store queue but it obviously cannot forward
>> > > so we have to wait for the store to retire.
>> > >
>> > > That causes a significant hit to performance especially if 'n'
>> > > would have made a non-masked epilogue to fully cover 'n' as well
>> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
>> > > store-forwarding and waiting for the retiring of the store.
>> > >
>> > > The following applies a very simple heuristic, disabling
>> > > the use of loop masking when there's a memory reference pair
>> > > with dependence distance zero.  That resolves the issue
>> > > (other problematic dependence distances seem to be less common
>> > > at least).
>> > >
>> > > I have applied this heuristic in generic vectorizer code but
>> > > restricted it to non-VL vector sizes.  There currently isn't
>> > > a way for the target to request disabling of masking only,
>> > > while we can reject the vectoriztion at costing time that will
>> > > not re-consider the same vector mode but without masking.
>> > > It seems simply re-costing with masking disabled should be
>> > > possible through, we'd just need an indication whether that
>> > > should be done?  Maybe always when the current vector mode is
>> > > of fixed size?
>> > >
>> > > I wonder how SVE vectorized code behaves in these situations?
>> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
>> > > 503.bwaves_r though I think both will need a hardware vector
>> > > size covering at least 8 doubles to show the issue.  527.cam4_r
>> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
>> > >
>> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>> > >
>> > > Any comments?
>> > >
>> > > Thanks,
>> > > Richard.
>> > >
>> > > 	PR target/110456
>> > > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
>> > > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
>> > > 	Initialize has_zero_dep_dist.
>> > > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
>> > > 	Remember if we've seen a dependence distance of zero.
>> > > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
>> > > 	When we've seen a dependence distance of zero and the vector
>> > > 	type has constant size disable the use of partial vectors.
>> > > ---
>> > >  gcc/tree-vect-data-refs.cc |  2 ++
>> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
>> > >  gcc/tree-vectorizer.cc     |  1 +
>> > >  gcc/tree-vectorizer.h      |  3 +++
>> > >  4 files changed, 16 insertions(+)
>> > >
>> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
>> > > index ebe93832b1e..40cde95c16a 100644
>> > > --- a/gcc/tree-vect-data-refs.cc
>> > > +++ b/gcc/tree-vect-data-refs.cc
>> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
>> > >  			     "dependence distance == 0 between %T and %T\n",
>> > >  			     DR_REF (dra), DR_REF (drb));
>> > >  
>> > > +	  loop_vinfo->shared->has_zero_dep_dist = true;
>> > > +
>> > >  	  /* When we perform grouped accesses and perform implicit CSE
>> > >  	     by detecting equal accesses and doing disambiguation with
>> > >  	     runtime alias tests like for
>> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> > > index d642d3c257f..3bcbc000323 100644
>> > > --- a/gcc/tree-vect-stmts.cc
>> > > +++ b/gcc/tree-vect-stmts.cc
>> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>> > >        using_partial_vectors_p = true;
>> > >      }
>> > >  
>> > > +  if (loop_vinfo->shared->has_zero_dep_dist
>> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
>> > 
>> > I don't think it makes sense to treat VLA and VLS differently here.
>> > 
>> > But RMW operations are very common, so it seems like we're giving up a
>> > lot on the off chance that the inner loop is applied iteratively
>> > to successive memory locations.
>> 
>> Yes ...
>> 
>> > Maybe that's still OK for AVX512, where I guess loop masking is more
>> > of a niche use case.  But if so, then yeah, I think a hook to disable
>> > masking might be better here.
>> 
>> It's a nice use case in that if you'd cost the main vector loop
>> with and without masking the not masked case is always winning.
>> I understand with SVE it would be the same if you fix the
>> vector size but otherwise it would be a win to mask as there's
>> the chance the HW implementation uses bigger vectors than the
>> architected minimum size.
>> 
>> So for AVX512 the win is with the epilogue and the case of
>> few scalar iterations where the epilogue iterations play
>> a significant role.  Since we only vectorize the epilogue
>> of the main loop but not the epilogue of the epilogue loop
>> we're leaving quite some iterations unvectorized when the
>> main loop uses 512bit vectors and the epilogue 256bit.
>> 
>> But that case specifically is also prone to make the cross-iteration
>> issue wrt an outer loop significiant ...
>> 
>> I'll see to address this on the costing side somehow.
>
> So with us requiring both partial vector and non-partial vector variants
> working during vectorizable_* I thought it should be possible to
> simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
> re-analyzing in case the target deemed the partial vector using loop
> not profitable.
>
> While the following successfully hacks this in-place the question is
> whether the above is really true for VLA vector types?

Yeah, I think so.  We do support full-vector VLA.

> Besides from this "working", maybe the targets should get to say
> more specifically what they'd like to change - should we make
> the finish_cost () hook have a more elaborate return value or
> provide more hints like we already do with the suggested unroll factor?
>
> I can imagine a "partial vector usage is bad" or "too high VF"?
> But then we probably still want to re-do the final costing part
> so we'd need a way to push away the per-stmt costing we already
> did (sth nicer than the hack below).  Maybe make
> 'bool costing_for_scalar' a tri-state, adding 'produce copy
> from current vector cost in vinfo'?

Not sure I follow the tri-state thing.  If the idea is to redo
the vector_costs::add_stmt_costs stuff, could we just export
the cost_vec from vect_analyze_loop_operations and apply the
costs in vect_analyze_loop_costing instead?

That seems worthwhile anyway, if we think the costs are going
to depend on partial vs. full vectors.  As things stand, we could
already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later
set it to 0.

> Anyway - the takeaway is the x86 backend likes a way to disable
> the use of partial vectors for the epilog (or main loop) in some cases.
> An alternative to doing this somehow via the costing hooks would be
> to add a new hook - the specific data dependence check could be
> a hook invoked after dependence checking, initializing
> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?
>
> Any good ideas?  Anything that comes to your minds that would be
> useful in this area for other targets?
>
> Thanks,
> Richard.
>
> The following applies ontop of the earlier posted patch in this thread.
>
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 8989985700a..a892f72ded3 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
>  	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
>  	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
>  	m_costs[vect_body] = INT_MAX;
> +
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> +	  && loop_vinfo->shared->has_zero_dep_dist)
> +	m_costs[vect_body] = INT_MAX;
>      }
>  
>    vector_costs::finish_cost (scalar_costs);
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 3b46c58a8d8..85802f07443 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2773,6 +2773,7 @@ start_over:
>      }
>  
>    loop_vinfo->vector_costs = init_cost (loop_vinfo, false);
> +  vector_costs saved_costs = *loop_vinfo->vector_costs;

The target can derive the class and add its own member variables,
so I think we'd need some kind of clone method for this to be safe.

>    /* Analyze the alignment of the data-refs in the loop.
>       Fail if a data reference is found that cannot be vectorized.  */
> @@ -3017,6 +3018,8 @@ start_over:
>  	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
>      }
>  
> +  saved_costs = *loop_vinfo->vector_costs;
> +again_no_partial_vectors:
>    /* If we're vectorizing an epilogue loop, the vectorized loop either needs
>       to be able to handle fewer than VF scalars, or needs to have a lower VF
>       than the main loop.  */
> @@ -3043,6 +3046,17 @@ start_over:
>      {
>        ok = opt_result::failure_at (vect_location,
>  				   "Loop costings may not be worthwhile.\n");
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "trying with partial vectors disabled\n");
> +	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	  LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false;
> +	  LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false;

I guess these last two variables should have been set after
vect_determine_partial_vectors_and_peeling rather than before.
Sorry for not noticing earlier.

The approach seems OK to me otherwise FWIW.

Thanks,
Richard

> +	  *loop_vinfo->vector_costs = saved_costs;
> +	  goto again_no_partial_vectors;
> +	}
>        goto again;
>      }
>    if (!res)
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 08d071463fb..003af878ee7 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>        using_partial_vectors_p = true;
>      }
>  
> -  if (loop_vinfo->shared->has_zero_dep_dist
> +  if (0 && loop_vinfo->shared->has_zero_dep_dist
>        && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
>      {
>        if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
  2023-06-29 13:16   ` Richard Biener
@ 2023-07-04 13:37     ` Richard Biener
  2023-07-04 19:49       ` Richard Sandiford
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2023-07-04 13:37 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Thu, 29 Jun 2023, Richard Biener wrote:

> On Thu, 29 Jun 2023, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > > With applying loop masking to epilogues on x86_64 AVX512 we see
> > > some significant performance regressions when evaluating SPEC CPU 2017
> > > that are caused by store-to-load forwarding fails across outer
> > > loop iterations when the inner loop does not iterate.  Consider
> > >
> > >   for (j = 0; j < m; ++j)
> > >     for (i = 0; i < n; ++i)
> > >       a[j*n + i] += b[j*n + i];
> > >
> > > with 'n' chosen so that the inner loop vectorized code is fully
> > > executed by the masked epilogue and that masked epilogue
> > > storing O > n elements (with elements >= n masked of course).
> > > Then the masked load performed for the next outer loop iteration
> > > will get a hit in the store queue but it obviously cannot forward
> > > so we have to wait for the store to retire.
> > >
> > > That causes a significant hit to performance especially if 'n'
> > > would have made a non-masked epilogue to fully cover 'n' as well
> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
> > > store-forwarding and waiting for the retiring of the store.
> > >
> > > The following applies a very simple heuristic, disabling
> > > the use of loop masking when there's a memory reference pair
> > > with dependence distance zero.  That resolves the issue
> > > (other problematic dependence distances seem to be less common
> > > at least).
> > >
> > > I have applied this heuristic in generic vectorizer code but
> > > restricted it to non-VL vector sizes.  There currently isn't
> > > a way for the target to request disabling of masking only,
> > > while we can reject the vectoriztion at costing time that will
> > > not re-consider the same vector mode but without masking.
> > > It seems simply re-costing with masking disabled should be
> > > possible through, we'd just need an indication whether that
> > > should be done?  Maybe always when the current vector mode is
> > > of fixed size?
> > >
> > > I wonder how SVE vectorized code behaves in these situations?
> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> > > 503.bwaves_r though I think both will need a hardware vector
> > > size covering at least 8 doubles to show the issue.  527.cam4_r
> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
> > >
> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> > >
> > > Any comments?
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 	PR target/110456
> > > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> > > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> > > 	Initialize has_zero_dep_dist.
> > > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> > > 	Remember if we've seen a dependence distance of zero.
> > > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> > > 	When we've seen a dependence distance of zero and the vector
> > > 	type has constant size disable the use of partial vectors.
> > > ---
> > >  gcc/tree-vect-data-refs.cc |  2 ++
> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
> > >  gcc/tree-vectorizer.cc     |  1 +
> > >  gcc/tree-vectorizer.h      |  3 +++
> > >  4 files changed, 16 insertions(+)
> > >
> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > > index ebe93832b1e..40cde95c16a 100644
> > > --- a/gcc/tree-vect-data-refs.cc
> > > +++ b/gcc/tree-vect-data-refs.cc
> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
> > >  			     "dependence distance == 0 between %T and %T\n",
> > >  			     DR_REF (dra), DR_REF (drb));
> > >  
> > > +	  loop_vinfo->shared->has_zero_dep_dist = true;
> > > +
> > >  	  /* When we perform grouped accesses and perform implicit CSE
> > >  	     by detecting equal accesses and doing disambiguation with
> > >  	     runtime alias tests like for
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index d642d3c257f..3bcbc000323 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
> > >        using_partial_vectors_p = true;
> > >      }
> > >  
> > > +  if (loop_vinfo->shared->has_zero_dep_dist
> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> > 
> > I don't think it makes sense to treat VLA and VLS differently here.
> > 
> > But RMW operations are very common, so it seems like we're giving up a
> > lot on the off chance that the inner loop is applied iteratively
> > to successive memory locations.
> 
> Yes ...
> 
> > Maybe that's still OK for AVX512, where I guess loop masking is more
> > of a niche use case.  But if so, then yeah, I think a hook to disable
> > masking might be better here.
> 
> It's a nice use case in that if you'd cost the main vector loop
> with and without masking the not masked case is always winning.
> I understand with SVE it would be the same if you fix the
> vector size but otherwise it would be a win to mask as there's
> the chance the HW implementation uses bigger vectors than the
> architected minimum size.
> 
> So for AVX512 the win is with the epilogue and the case of
> few scalar iterations where the epilogue iterations play
> a significant role.  Since we only vectorize the epilogue
> of the main loop but not the epilogue of the epilogue loop
> we're leaving quite some iterations unvectorized when the
> main loop uses 512bit vectors and the epilogue 256bit.
> 
> But that case specifically is also prone to make the cross-iteration
> issue wrt an outer loop significiant ...
> 
> I'll see to address this on the costing side somehow.

So with us requiring both partial vector and non-partial vector variants
working during vectorizable_* I thought it should be possible to
simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
re-analyzing in case the target deemed the partial vector using loop
not profitable.

While the following successfully hacks this in-place the question is
whether the above is really true for VLA vector types?

Besides from this "working", maybe the targets should get to say
more specifically what they'd like to change - should we make
the finish_cost () hook have a more elaborate return value or
provide more hints like we already do with the suggested unroll factor?

I can imagine a "partial vector usage is bad" or "too high VF"?
But then we probably still want to re-do the final costing part
so we'd need a way to push away the per-stmt costing we already
did (sth nicer than the hack below).  Maybe make
'bool costing_for_scalar' a tri-state, adding 'produce copy
from current vector cost in vinfo'?

Anyway - the takeaway is the x86 backend likes a way to disable
the use of partial vectors for the epilog (or main loop) in some cases.
An alternative to doing this somehow via the costing hooks would be
to add a new hook - the specific data dependence check could be
a hook invoked after dependence checking, initializing
LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?

Any good ideas?  Anything that comes to your minds that would be
useful in this area for other targets?

Thanks,
Richard.

The following applies ontop of the earlier posted patch in this thread.


diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 8989985700a..a892f72ded3 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
 	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
 	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
 	m_costs[vect_body] = INT_MAX;
+
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	  && loop_vinfo->shared->has_zero_dep_dist)
+	m_costs[vect_body] = INT_MAX;
     }
 
   vector_costs::finish_cost (scalar_costs);
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 3b46c58a8d8..85802f07443 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2773,6 +2773,7 @@ start_over:
     }
 
   loop_vinfo->vector_costs = init_cost (loop_vinfo, false);
+  vector_costs saved_costs = *loop_vinfo->vector_costs;
 
   /* Analyze the alignment of the data-refs in the loop.
      Fail if a data reference is found that cannot be vectorized.  */
@@ -3017,6 +3018,8 @@ start_over:
 	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
     }
 
+  saved_costs = *loop_vinfo->vector_costs;
+again_no_partial_vectors:
   /* If we're vectorizing an epilogue loop, the vectorized loop either needs
      to be able to handle fewer than VF scalars, or needs to have a lower VF
      than the main loop.  */
@@ -3043,6 +3046,17 @@ start_over:
     {
       ok = opt_result::failure_at (vect_location,
 				   "Loop costings may not be worthwhile.\n");
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "trying with partial vectors disabled\n");
+	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	  LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false;
+	  LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false;
+	  *loop_vinfo->vector_costs = saved_costs;
+	  goto again_no_partial_vectors;
+	}
       goto again;
     }
   if (!res)
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 08d071463fb..003af878ee7 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       using_partial_vectors_p = true;
     }
 
-  if (loop_vinfo->shared->has_zero_dep_dist
+  if (0 && loop_vinfo->shared->has_zero_dep_dist
       && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
     {
       if (dump_enabled_p ())

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
  2023-06-29 12:55 ` Richard Sandiford
@ 2023-06-29 13:16   ` Richard Biener
  2023-07-04 13:37     ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Biener @ 2023-06-29 13:16 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Thu, 29 Jun 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > With applying loop masking to epilogues on x86_64 AVX512 we see
> > some significant performance regressions when evaluating SPEC CPU 2017
> > that are caused by store-to-load forwarding fails across outer
> > loop iterations when the inner loop does not iterate.  Consider
> >
> >   for (j = 0; j < m; ++j)
> >     for (i = 0; i < n; ++i)
> >       a[j*n + i] += b[j*n + i];
> >
> > with 'n' chosen so that the inner loop vectorized code is fully
> > executed by the masked epilogue and that masked epilogue
> > storing O > n elements (with elements >= n masked of course).
> > Then the masked load performed for the next outer loop iteration
> > will get a hit in the store queue but it obviously cannot forward
> > so we have to wait for the store to retire.
> >
> > That causes a significant hit to performance especially if 'n'
> > would have made a non-masked epilogue to fully cover 'n' as well
> > (say n == 4 for a V4DImode epilogue), avoiding the need for
> > store-forwarding and waiting for the retiring of the store.
> >
> > The following applies a very simple heuristic, disabling
> > the use of loop masking when there's a memory reference pair
> > with dependence distance zero.  That resolves the issue
> > (other problematic dependence distances seem to be less common
> > at least).
> >
> > I have applied this heuristic in generic vectorizer code but
> > restricted it to non-VL vector sizes.  There currently isn't
> > a way for the target to request disabling of masking only,
> > while we can reject the vectoriztion at costing time that will
> > not re-consider the same vector mode but without masking.
> > It seems simply re-costing with masking disabled should be
> > possible through, we'd just need an indication whether that
> > should be done?  Maybe always when the current vector mode is
> > of fixed size?
> >
> > I wonder how SVE vectorized code behaves in these situations?
> > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> > 503.bwaves_r though I think both will need a hardware vector
> > size covering at least 8 doubles to show the issue.  527.cam4_r
> > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
> >
> > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> >
> > Any comments?
> >
> > Thanks,
> > Richard.
> >
> > 	PR target/110456
> > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> > 	Initialize has_zero_dep_dist.
> > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> > 	Remember if we've seen a dependence distance of zero.
> > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> > 	When we've seen a dependence distance of zero and the vector
> > 	type has constant size disable the use of partial vectors.
> > ---
> >  gcc/tree-vect-data-refs.cc |  2 ++
> >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
> >  gcc/tree-vectorizer.cc     |  1 +
> >  gcc/tree-vectorizer.h      |  3 +++
> >  4 files changed, 16 insertions(+)
> >
> > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > index ebe93832b1e..40cde95c16a 100644
> > --- a/gcc/tree-vect-data-refs.cc
> > +++ b/gcc/tree-vect-data-refs.cc
> > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
> >  			     "dependence distance == 0 between %T and %T\n",
> >  			     DR_REF (dra), DR_REF (drb));
> >  
> > +	  loop_vinfo->shared->has_zero_dep_dist = true;
> > +
> >  	  /* When we perform grouped accesses and perform implicit CSE
> >  	     by detecting equal accesses and doing disambiguation with
> >  	     runtime alias tests like for
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index d642d3c257f..3bcbc000323 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
> >        using_partial_vectors_p = true;
> >      }
> >  
> > +  if (loop_vinfo->shared->has_zero_dep_dist
> > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> 
> I don't think it makes sense to treat VLA and VLS differently here.
> 
> But RMW operations are very common, so it seems like we're giving up a
> lot on the off chance that the inner loop is applied iteratively
> to successive memory locations.

Yes ...

> Maybe that's still OK for AVX512, where I guess loop masking is more
> of a niche use case.  But if so, then yeah, I think a hook to disable
> masking might be better here.

It's a nice use case in that if you'd cost the main vector loop
with and without masking the not masked case is always winning.
I understand with SVE it would be the same if you fix the
vector size but otherwise it would be a win to mask as there's
the chance the HW implementation uses bigger vectors than the
architected minimum size.

So for AVX512 the win is with the epilogue and the case of
few scalar iterations where the epilogue iterations play
a significant role.  Since we only vectorize the epilogue
of the main loop but not the epilogue of the epilogue loop
we're leaving quite some iterations unvectorized when the
main loop uses 512bit vectors and the epilogue 256bit.

But that case specifically is also prone to make the cross-iteration
issue wrt an outer loop significiant ...

I'll see to address this on the costing side somehow.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > +    {
> > +      if (dump_enabled_p ())
> > +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +			 "disabling partial vectors because of possible "
> > +			 "STLF issues\n");
> > +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > +    }
> > +
> >    if (!using_partial_vectors_p)
> >      {
> >        if (dump_enabled_p ())
> > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > index a048e9d8917..74457259b6e 100644
> > --- a/gcc/tree-vectorizer.cc
> > +++ b/gcc/tree-vectorizer.cc
> > @@ -478,6 +478,7 @@ vec_info::~vec_info ()
> >  
> >  vec_info_shared::vec_info_shared ()
> >    : n_stmts (0),
> > +    has_zero_dep_dist (false),
> >      datarefs (vNULL),
> >      datarefs_copy (vNULL),
> >      ddrs (vNULL)
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index a36974c2c0d..7626cda2a73 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -419,6 +419,9 @@ public:
> >    /* The number of scalar stmts.  */
> >    unsigned n_stmts;
> >  
> > +  /* Whether there's a dependence with zero distance.  */
> > +  bool has_zero_dep_dist;
> > +
> >    /* All data references.  Freed by free_data_refs, so not an auto_vec.  */
> >    vec<data_reference_p> datarefs;
> >    vec<data_reference> datarefs_copy;
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
       [not found] <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com>
@ 2023-06-29 12:55 ` Richard Sandiford
  2023-06-29 13:16   ` Richard Biener
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Sandiford @ 2023-06-29 12:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> With applying loop masking to epilogues on x86_64 AVX512 we see
> some significant performance regressions when evaluating SPEC CPU 2017
> that are caused by store-to-load forwarding fails across outer
> loop iterations when the inner loop does not iterate.  Consider
>
>   for (j = 0; j < m; ++j)
>     for (i = 0; i < n; ++i)
>       a[j*n + i] += b[j*n + i];
>
> with 'n' chosen so that the inner loop vectorized code is fully
> executed by the masked epilogue and that masked epilogue
> storing O > n elements (with elements >= n masked of course).
> Then the masked load performed for the next outer loop iteration
> will get a hit in the store queue but it obviously cannot forward
> so we have to wait for the store to retire.
>
> That causes a significant hit to performance especially if 'n'
> would have made a non-masked epilogue to fully cover 'n' as well
> (say n == 4 for a V4DImode epilogue), avoiding the need for
> store-forwarding and waiting for the retiring of the store.
>
> The following applies a very simple heuristic, disabling
> the use of loop masking when there's a memory reference pair
> with dependence distance zero.  That resolves the issue
> (other problematic dependence distances seem to be less common
> at least).
>
> I have applied this heuristic in generic vectorizer code but
> restricted it to non-VL vector sizes.  There currently isn't
> a way for the target to request disabling of masking only,
> while we can reject the vectoriztion at costing time that will
> not re-consider the same vector mode but without masking.
> It seems simply re-costing with masking disabled should be
> possible through, we'd just need an indication whether that
> should be done?  Maybe always when the current vector mode is
> of fixed size?
>
> I wonder how SVE vectorized code behaves in these situations?
> The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> 503.bwaves_r though I think both will need a hardware vector
> size covering at least 8 doubles to show the issue.  527.cam4_r
> has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
>
> Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>
> Any comments?
>
> Thanks,
> Richard.
>
> 	PR target/110456
> 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> 	Initialize has_zero_dep_dist.
> 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> 	Remember if we've seen a dependence distance of zero.
> 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> 	When we've seen a dependence distance of zero and the vector
> 	type has constant size disable the use of partial vectors.
> ---
>  gcc/tree-vect-data-refs.cc |  2 ++
>  gcc/tree-vect-stmts.cc     | 10 ++++++++++
>  gcc/tree-vectorizer.cc     |  1 +
>  gcc/tree-vectorizer.h      |  3 +++
>  4 files changed, 16 insertions(+)
>
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index ebe93832b1e..40cde95c16a 100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
>  			     "dependence distance == 0 between %T and %T\n",
>  			     DR_REF (dra), DR_REF (drb));
>  
> +	  loop_vinfo->shared->has_zero_dep_dist = true;
> +
>  	  /* When we perform grouped accesses and perform implicit CSE
>  	     by detecting equal accesses and doing disambiguation with
>  	     runtime alias tests like for
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index d642d3c257f..3bcbc000323 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>        using_partial_vectors_p = true;
>      }
>  
> +  if (loop_vinfo->shared->has_zero_dep_dist
> +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())

I don't think it makes sense to treat VLA and VLS differently here.

But RMW operations are very common, so it seems like we're giving up a
lot on the off chance that the inner loop is applied iteratively
to successive memory locations.

Maybe that's still OK for AVX512, where I guess loop masking is more
of a niche use case.  But if so, then yeah, I think a hook to disable
masking might be better here.

Thanks,
Richard

> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "disabling partial vectors because of possible "
> +			 "STLF issues\n");
> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +
>    if (!using_partial_vectors_p)
>      {
>        if (dump_enabled_p ())
> diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> index a048e9d8917..74457259b6e 100644
> --- a/gcc/tree-vectorizer.cc
> +++ b/gcc/tree-vectorizer.cc
> @@ -478,6 +478,7 @@ vec_info::~vec_info ()
>  
>  vec_info_shared::vec_info_shared ()
>    : n_stmts (0),
> +    has_zero_dep_dist (false),
>      datarefs (vNULL),
>      datarefs_copy (vNULL),
>      ddrs (vNULL)
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index a36974c2c0d..7626cda2a73 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -419,6 +419,9 @@ public:
>    /* The number of scalar stmts.  */
>    unsigned n_stmts;
>  
> +  /* Whether there's a dependence with zero distance.  */
> +  bool has_zero_dep_dist;
> +
>    /* All data references.  Freed by free_data_refs, so not an auto_vec.  */
>    vec<data_reference_p> datarefs;
>    vec<data_reference> datarefs_copy;

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-07-05 13:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-29 12:28 [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences Richard Biener
     [not found] <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com>
2023-06-29 12:55 ` Richard Sandiford
2023-06-29 13:16   ` Richard Biener
2023-07-04 13:37     ` Richard Biener
2023-07-04 19:49       ` Richard Sandiford
2023-07-05 13:29         ` Richard Biener
2023-07-05 13:55           ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).