public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
Date: Thu, 29 Jun 2023 13:16:21 +0000 (UTC)	[thread overview]
Message-ID: <nycvar.YFH.7.77.849.2306291310060.4723@jbgna.fhfr.qr> (raw)
In-Reply-To: <mptbkgy5sxh.fsf@arm.com>

On Thu, 29 Jun 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > With applying loop masking to epilogues on x86_64 AVX512 we see
> > some significant performance regressions when evaluating SPEC CPU 2017
> > that are caused by store-to-load forwarding fails across outer
> > loop iterations when the inner loop does not iterate.  Consider
> >
> >   for (j = 0; j < m; ++j)
> >     for (i = 0; i < n; ++i)
> >       a[j*n + i] += b[j*n + i];
> >
> > with 'n' chosen so that the inner loop vectorized code is fully
> > executed by the masked epilogue and that masked epilogue
> > storing O > n elements (with elements >= n masked of course).
> > Then the masked load performed for the next outer loop iteration
> > will get a hit in the store queue but it obviously cannot forward
> > so we have to wait for the store to retire.
> >
> > That causes a significant hit to performance especially if 'n'
> > would have made a non-masked epilogue to fully cover 'n' as well
> > (say n == 4 for a V4DImode epilogue), avoiding the need for
> > store-forwarding and waiting for the retiring of the store.
> >
> > The following applies a very simple heuristic, disabling
> > the use of loop masking when there's a memory reference pair
> > with dependence distance zero.  That resolves the issue
> > (other problematic dependence distances seem to be less common
> > at least).
> >
> > I have applied this heuristic in generic vectorizer code but
> > restricted it to non-VL vector sizes.  There currently isn't
> > a way for the target to request disabling of masking only,
> > while we can reject the vectoriztion at costing time that will
> > not re-consider the same vector mode but without masking.
> > It seems simply re-costing with masking disabled should be
> > possible through, we'd just need an indication whether that
> > should be done?  Maybe always when the current vector mode is
> > of fixed size?
> >
> > I wonder how SVE vectorized code behaves in these situations?
> > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> > 503.bwaves_r though I think both will need a hardware vector
> > size covering at least 8 doubles to show the issue.  527.cam4_r
> > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
> >
> > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> >
> > Any comments?
> >
> > Thanks,
> > Richard.
> >
> > 	PR target/110456
> > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> > 	Initialize has_zero_dep_dist.
> > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> > 	Remember if we've seen a dependence distance of zero.
> > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> > 	When we've seen a dependence distance of zero and the vector
> > 	type has constant size disable the use of partial vectors.
> > ---
> >  gcc/tree-vect-data-refs.cc |  2 ++
> >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
> >  gcc/tree-vectorizer.cc     |  1 +
> >  gcc/tree-vectorizer.h      |  3 +++
> >  4 files changed, 16 insertions(+)
> >
> > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > index ebe93832b1e..40cde95c16a 100644
> > --- a/gcc/tree-vect-data-refs.cc
> > +++ b/gcc/tree-vect-data-refs.cc
> > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
> >  			     "dependence distance == 0 between %T and %T\n",
> >  			     DR_REF (dra), DR_REF (drb));
> >  
> > +	  loop_vinfo->shared->has_zero_dep_dist = true;
> > +
> >  	  /* When we perform grouped accesses and perform implicit CSE
> >  	     by detecting equal accesses and doing disambiguation with
> >  	     runtime alias tests like for
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index d642d3c257f..3bcbc000323 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
> >        using_partial_vectors_p = true;
> >      }
> >  
> > +  if (loop_vinfo->shared->has_zero_dep_dist
> > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> 
> I don't think it makes sense to treat VLA and VLS differently here.
> 
> But RMW operations are very common, so it seems like we're giving up a
> lot on the off chance that the inner loop is applied iteratively
> to successive memory locations.

Yes ...

> Maybe that's still OK for AVX512, where I guess loop masking is more
> of a niche use case.  But if so, then yeah, I think a hook to disable
> masking might be better here.

It's a nice use case in that if you'd cost the main vector loop
with and without masking the not masked case is always winning.
I understand with SVE it would be the same if you fix the
vector size but otherwise it would be a win to mask as there's
the chance the HW implementation uses bigger vectors than the
architected minimum size.

So for AVX512 the win is with the epilogue and the case of
few scalar iterations where the epilogue iterations play
a significant role.  Since we only vectorize the epilogue
of the main loop but not the epilogue of the epilogue loop
we're leaving quite some iterations unvectorized when the
main loop uses 512bit vectors and the epilogue 256bit.

But that case specifically is also prone to make the cross-iteration
issue wrt an outer loop significiant ...

I'll see to address this on the costing side somehow.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > +    {
> > +      if (dump_enabled_p ())
> > +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +			 "disabling partial vectors because of possible "
> > +			 "STLF issues\n");
> > +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> > +    }
> > +
> >    if (!using_partial_vectors_p)
> >      {
> >        if (dump_enabled_p ())
> > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > index a048e9d8917..74457259b6e 100644
> > --- a/gcc/tree-vectorizer.cc
> > +++ b/gcc/tree-vectorizer.cc
> > @@ -478,6 +478,7 @@ vec_info::~vec_info ()
> >  
> >  vec_info_shared::vec_info_shared ()
> >    : n_stmts (0),
> > +    has_zero_dep_dist (false),
> >      datarefs (vNULL),
> >      datarefs_copy (vNULL),
> >      ddrs (vNULL)
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index a36974c2c0d..7626cda2a73 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -419,6 +419,9 @@ public:
> >    /* The number of scalar stmts.  */
> >    unsigned n_stmts;
> >  
> > +  /* Whether there's a dependence with zero distance.  */
> > +  bool has_zero_dep_dist;
> > +
> >    /* All data references.  Freed by free_data_refs, so not an auto_vec.  */
> >    vec<data_reference_p> datarefs;
> >    vec<data_reference> datarefs_copy;
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

  reply	other threads:[~2023-06-29 13:16 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com>
2023-06-29 12:55 ` Richard Sandiford
2023-06-29 13:16   ` Richard Biener [this message]
2023-07-04 13:37     ` Richard Biener
2023-07-04 19:49       ` Richard Sandiford
2023-07-05 13:29         ` Richard Biener
2023-07-05 13:55           ` Richard Sandiford
2023-06-29 12:28 Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.YFH.7.77.849.2306291310060.4723@jbgna.fhfr.qr \
    --to=rguenther@suse.de \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).