Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: "Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	 Richard Sandiford <richard.sandiford@arm.com>
Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
Date: Fri, 7 May 2021 14:05:32 +0200 (CEST)	[thread overview]
Message-ID: <nycvar.YFH.7.76.2105071351240.9200@zhemvz.fhfr.qr> (raw)
In-Reply-To: <6e4541ef-e0a1-1d2d-53f5-4bfed9a65598@arm.com>

On Wed, 5 May 2021, Andre Vieira (lists) wrote:

> 
> On 05/05/2021 13:34, Richard Biener wrote:
> > On Wed, 5 May 2021, Andre Vieira (lists) wrote:
> >
> >> I tried to see what IVOPTs would make of this and it is able to analyze the
> >> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop
> >> 1)
> >> could be used as the base for the other (loop 2). I don't know if this is
> >> where you'd want such optimizations to be made, on one side I think it
> >> would
> >> be great as it would also help with non-vectorized loops as you allured to.
> > Hmm, OK.  So there's the first loop that has a looparound jump and thus
> > we do not always enter the 2nd loop with the first loop final value of the
> > IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
> > And for a followup transform to catch this it would need to compute
> > the final value of the IV and then match this up with the initial
> > value computation.  I suppose FRE could be teached to do this, at
> > least for very simple cases.
> I will admit I am not at all familiar with how FRE works, I know it exists as
> the occlusion of running it often breaks my vector patches :P But that's about
> all I know.
> I will have a look and see if it makes sense from my perspective to address it
> there, because ...
> >
> >> Anyway I diverge. Back to the main question of this patch. How do you
> >> suggest
> >> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once'
> >> IVs
> >> in the epilogue(s) (both vector and scalar!) and then teach it to merge
> >> IV's
> >> if one ends where the other begins?
> > I don't think we will make that work easily.  So indeed attacking this
> > in the vectorizer sounds most promising.
> 
> The problem with this that I found with my approach is that it only tackles
> the vectorized epilogues and that leads to regressions, I don't have the
> example at hand, but what I saw was happening was that increased register
> pressure lead to a spill in the hot path. I believe this was caused by the
> epilogue loop using the update pointers as the base for their DR's, in this
> case there were three DR's (2 loads one store), but the scalar epilogue still
> using the original base + niters, since this data_reference approach only
> changes the vectorized epilogues.

Yeah, this issue obviously extends to the scalar pro and epilogue loops...

So ideally we'd produce IL (mainly for the IV setup code I guess)
that will be handled well by the following passes but then IVOPTs
is not multi-loop aware ...

That said, in the end we should be able to code-generate the scalar
loops as well (my plan is to add that, at least for the vector
loop, to be able to support partly vectorized loops with unvectorizable
stmts simply replicated as scalar ops).  In that case we can use
the same IVs again.

> >   I'll note there's also
> > the issue of epilogue vectorization and reductions where we seem
> > to not re-use partially reduced reduction vectors but instead
> > reduce to a scalar in each step.  That's a related issue - we're
> > not able to carry forward a (reduction) IV we generated for the
> > main vector loop to the epilogue loops.  Like for
> >
> > double foo (double *a, int n)
> > {
> >    double sum = 0.;
> >    for (int i = 0; i < n; ++i)
> >      sum += a[i];
> >    return sum;
> > }
> >
> > with AVX512 we get three reductions to scalars instead of
> > a partial reduction from zmm to ymm before the first vectorized
> > epilogue followed by a reduction from ymm to xmm before the second
> > (the jump around for the epilogues need to jump to the further
> > reduction piece obviously).
> >
> > So I think we want to record IVs we generate (the reduction IVs
> > are already nicely associated with the stmt-infos), one might
> > consider to refer to them from the dr_vec_info for example.
> >
> > It's just going to be "interesting" to wire everything up
> > correctly with all the jump-arounds we have ...
> I have a downstream hack for the reductions, but it only worked for
> partial-vector-usage as there you have the guarantee it's the same
> vector-mode, so you don't need to pfaff around with half and full vectors.
> Obviously what you are suggesting has much wider applications and not
> surprisingly I think Richard Sandiford also pointed out to me that these are
> somewhat related and we might be able to reuse the IV-creation to manage it
> all. But I feel like I am currently light years away from that.
> 
> I had started to look at removing the data_reference updating we have now and
> dealing with this in the 'create_iv' calls from 'vect_create_data_ref_ptr'
> inside 'vectorizable_{load,store}' but then I thought it would be good to
> discuss it with you first. This will require keeping track of the 'end-value'
> of the IV, which for loops where we can skip the previous loop means we will
> need to construct a phi-node containing the updated pointer and the initial
> base. But I'm not entirely sure where to keep track of all this. Also I don't
> know if I can replace the base address of the data_reference right there at
> the 'create_iv' call, can a data_reference be used multiple times in the same
> loop?

Yes, it can.  I think we should see the data-ref IVs more as first-class.

But then we can also forgo completely those IVs and compute the
actual IV values based on the/a canonical IV we have/add.  We'd
then make sure to connect the canonical IV through all loops and
hope for IVOPTs to make sense of the mess we code-generate for
each dataref access ;)  Of course the datarefs in the scalar code
are not based on the canonical IV, but hey, IVOPTs eventually
manages to express them in terms of it.  And for cross-loop IVOPT
the important part would be that it's easy to connect the final/inital
values of adjacent loops.

Just in case you're out of experimenting ideas.

> I'll go do a bit more nosing around this idea and the ivmap you mentioned
> before. Let me know if you have any ideas on how this all should look like,
> even if its a 'in an ideal world'.

So in an ideal world we'd have a more explicit representation of those
IVs, just like we know about the reductions and inductions.  So
for example for each vector store/load (SLP node or stmt_info)
store the IV we use (maybe some extra struct so we can track if how
many uses we have, if it's the "main" DR or if we re-used it via
iv-map).  Then for epilogues we could replicate that meta by
cerating the IVs in the epilogue.  Would leave the issue of the
scalar pro- and epilogues. 

> 
> Andre
> >
> >> On 04/05/2021 10:56, Richard Biener wrote:
> >>> On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> The aim of this RFC is to explore a way of cleaning up the codegen around
> >>>> data_references.  To be specific, I'd like to reuse the main-loop's
> >>>> updated
> >>>> data_reference as the base_address for the epilogue's corresponding
> >>>> data_reference, rather than use the niters.  We have found this leads to
> >>>> better codegen in the vectorized epilogue loops.
> >>>>
> >>>> The approach in this RFC creates a map if iv_updates which always contain
> >>>> an
> >>>> updated pointer that is caputed in vectorizable_{load,store}, an
> >>>> iv_update
> >>>> may
> >>>> also contain a skip_edge in case we decide the vectorization can be
> >>>> skipped
> >>>> in
> >>>> 'vect_do_peeling'. During the epilogue update this map of iv_updates is
> >>>> then
> >>>> checked to see if it contains an entry for a data_reference and it is
> >>>> used
> >>>> accordingly and if not it reverts back to the old behavior of using the
> >>>> niters
> >>>> to advance the data_reference.
> >>>>
> >>>> The motivation for this work is to improve codegen for the option
> >>>> `--param
> >>>> vect-partial-vector-usage=1` for SVE. We found that one of the main
> >>>> problems
> >>>> for the codegen here was coming from unnecessary conversions caused by
> >>>> the
> >>>> way
> >>>> we update the data_references in the epilogue.
> >>>>
> >>>> This patch passes regression tests in aarch64-linux-gnu, but the codegen
> >>>> is
> >>>> still not optimal in some cases. Specifically those where we have a
> >>>> scalar
> >>>> epilogue, as this does not use the data_reference's and will rely on the
> >>>> gimple scalar code, thus constructing again a memory access using the
> >>>> niters.
> >>>> This is a limitation for which I haven't quite worked out a solution yet
> >>>> and
> >>>> does cause some minor regressions due to unfortunate spills.
> >>>>
> >>>> Let me know what you think and if you have ideas of how we can better
> >>>> achieve
> >>>> this.
> >>> Hmm, so the patch adds a kludge to improve the kludge we have in place ;)
> >>>
> >>> I think it might be interesting to create a C testcase mimicing the
> >>> update problem without involving the vectorizer.  That way we can
> >>> see how the various components involved behave (FRE + ivopts most
> >>> specifically).
> >>>
> >>> That said, a cleaner approach to dealing with this would be to
> >>> explicitely track the IVs we generate for vectorized DRs, eventually
> >>> factoring that out from vectorizable_{store,load} so we can simply
> >>> carry over the actual pointer IV final value to the epilogue as
> >>> initial value.  For each DR group we'd create a single IV (we can
> >>> even do better in case we have load + store of the "same" group).
> >>>
> >>> We already kind-of track things via the ivexpr_map, but I'm not sure
> >>> if this lazly populated map can be reliably re-used to "re-populate"
> >>> the epilogue one (walk the map, create epilogue IVs with the appropriate
> >>> initial value & adjustd upate).
> >>>
> >>> Richard.
> >>>
> >>>> Kind regards,
> >>>> Andre Vieira
> >>>>
> >>>>
> >>>>
> >>
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

next prev parent reply	other threads:[~2021-05-07 12:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-30  9:07 Andre Vieira (lists)
2021-05-04  9:56 ` Richard Biener
2021-05-05 11:34   ` Andre Vieira (lists)
2021-05-05 12:34     ` Richard Biener
2021-05-05 16:58       ` Andre Vieira (lists)
2021-05-07 12:05         ` Richard Biener [this message]
2021-05-17  8:14         ` Andre Vieira (lists)
2021-05-20 10:22           ` Richard Biener
2021-06-14  8:10             ` Andre Vieira (lists)
2021-06-14  8:37               ` Richard Biener
2021-06-14 10:57                 ` Richard Biener
2021-06-16 10:24                   ` Andre Vieira (lists)
2021-06-16 11:13                     ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.YFH.7.76.2105071351240.9200@zhemvz.fhfr.qr \
    --to=rguenther@suse.de \
    --cc=andre.simoesdiasvieira@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).