From: Richard Biener <rguenther@suse.de>
To: Andrew Stubbs <ams@codesourcery.com>
Cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com,
Jan Hubicka <hubicka@ucw.cz>,
hongtao.liu@intel.com, kirill.yukhin@gmail.com
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
Date: Thu, 15 Jun 2023 14:00:28 +0000 (UTC) [thread overview]
Message-ID: <nycvar.YFH.7.77.849.2306151358040.4723@jbgna.fhfr.qr> (raw)
In-Reply-To: <ea69279a-f9e4-11b0-a299-31df42fa6d41@codesourcery.com>
[-- Attachment #1: Type: text/plain, Size: 12905 bytes --]
On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> On 15/06/2023 14:34, Richard Biener wrote:
> > On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> >
> >> On 15/06/2023 12:06, Richard Biener wrote:
> >>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> >>>
> >>>> On 15/06/2023 10:58, Richard Biener wrote:
> >>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> >>>>>
> >>>>>> On 14/06/2023 15:29, Richard Biener wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> >>>>>>>>
> >>>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >>>>>>>>> This implemens fully masked vectorization or a masked epilog for
> >>>>>>>>> AVX512 style masks which single themselves out by representing
> >>>>>>>>> each lane with a single bit and by using integer modes for the mask
> >>>>>>>>> (both is much like GCN).
> >>>>>>>>> AVX512 is also special in that it doesn't have any instruction
> >>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
> >>>>>>>>> Instead the masks are produced by vector compares and the loop
> >>>>>>>>> control retains the scalar IV (mainly to avoid dependences on
> >>>>>>>>> mask generation, a suitable mask test instruction is available).
> >>>>>>>>
> >>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the
> >>>>>>>> middle
> >>>>>>>> end
> >>>>>>>> which expands to a vector compare against a vector of stepped values.
> >>>>>>>> This
> >>>>>>>> requires an additional instruction to prepare the comparison vector
> >>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
> >>>>>>>> returns
> >>>>>>>> the DImode bitmask, so it works reasonably well.
> >>>>>>>>
> >>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
> >>>>>>>>> messes things up in some cases removing that IV to eliminate
> >>>>>>>>> it with an incrementing one used for address generation.
> >>>>>>>>> One of the motivating testcases is from PR108410 which in turn
> >>>>>>>>> is extracted from x264 where large size vectorization shows
> >>>>>>>>> issues with small trip loops. Execution time there improves
> >>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
> >>>>>>>>> of less than 32 iterations.
> >>>>>>>>> size scalar 128 256 512 512e 512f
> >>>>>>>>> 1 9.42 11.32 9.35 11.17 15.13 16.89
> >>>>>>>>> 2 5.72 6.53 6.66 6.66 7.62 8.56
> >>>>>>>>> 3 4.49 5.10 5.10 5.74 5.08 5.73
> >>>>>>>>> 4 4.10 4.33 4.29 5.21 3.79 4.25
> >>>>>>>>> 6 3.78 3.85 3.86 4.76 2.54 2.85
> >>>>>>>>> 8 3.64 1.89 3.76 4.50 1.92 2.16
> >>>>>>>>> 12 3.56 2.21 3.75 4.26 1.26 1.42
> >>>>>>>>> 16 3.36 0.83 1.06 4.16 0.95 1.07
> >>>>>>>>> 20 3.39 1.42 1.33 4.07 0.75 0.85
> >>>>>>>>> 24 3.23 0.66 1.72 4.22 0.62 0.70
> >>>>>>>>> 28 3.18 1.09 2.04 4.20 0.54 0.61
> >>>>>>>>> 32 3.16 0.47 0.41 0.41 0.47 0.53
> >>>>>>>>> 34 3.16 0.67 0.61 0.56 0.44 0.50
> >>>>>>>>> 38 3.19 0.95 0.95 0.82 0.40 0.45
> >>>>>>>>> 42 3.09 0.58 1.21 1.13 0.36 0.40
> >>>>>>>>> 'size' specifies the number of actual iterations, 512e is for
> >>>>>>>>> a masked epilog and 512f for the fully masked loop. From
> >>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
> >>>>>>>>> the winner, the fully masked variant is clearly worse and
> >>>>>>>>> it's size benefit is also tiny.
> >>>>>>>>
> >>>>>>>> Let me check I understand correctly. In the fully masked case, there
> >>>>>>>> is
> >>>>>>>> a
> >>>>>>>> single loop in which a new mask is generated at the start of each
> >>>>>>>> iteration. In the masked epilogue case, the main loop uses no masking
> >>>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying
> >>>>>>>> the
> >>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue
> >>>>>>>> looks
> >>>>>>>> much
> >>>>>>>> like the fully masked case, but unlike smaller mode epilogues there
> >>>>>>>> is
> >>>>>>>> no
> >>>>>>>> loop because the eplogue vector size is the same. Is that right?
> >>>>>>>
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>>> This scheme seems like it might also benefit GCN, in so much as it
> >>>>>>>> simplifies the hot code path.
> >>>>>>>>
> >>>>>>>> GCN does not actually have smaller vector sizes, so there's no
> >>>>>>>> analogue
> >>>>>>>> to
> >>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
> >>>>>>>> middle
> >>>>>>>> end can't do masking everywhere yet, and it helps make some vector
> >>>>>>>> constants smaller, perhaps).
> >>>>>>>>
> >>>>>>>>> This patch does not enable using fully masked loops or
> >>>>>>>>> masked epilogues by default. More work on cost modeling
> >>>>>>>>> and vectorization kind selection on x86_64 is necessary
> >>>>>>>>> for this.
> >>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >>>>>>>>> which could be exploited further to unify some of the flags
> >>>>>>>>> we have right now but there didn't seem to be many easy things
> >>>>>>>>> to merge, so I'm leaving this for followups.
> >>>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
> >>>>>>>>> their
> >>>>>>>>> original form and recorded in a hash_set now instead of being
> >>>>>>>>> processed to a vector of rgroup_controls. Instead that's now
> >>>>>>>>> left to the final analysis phase which tries forming the
> >>>>>>>>> rgroup_controls
> >>>>>>>>> vector using while_ult and if that fails now tries AVX512 style
> >>>>>>>>> which needs a different organization and instead fills a hash_map
> >>>>>>>>> with the relevant info. vect_get_loop_mask now has two
> >>>>>>>>> implementations,
> >>>>>>>>> one for the two mask styles we then have.
> >>>>>>>>> I have decided against interweaving
> >>>>>>>>> vect_set_loop_condition_partial_vectors
> >>>>>>>>> with conditions to do AVX512 style masking and instead opted to
> >>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >>>>>>>>> Likewise for vect_verify_full_masking vs
> >>>>>>>>> vect_verify_full_masking_avx512.
> >>>>>>>>> I was split between making 'vec_loop_masks' a class with methods,
> >>>>>>>>> possibly merging in the _len stuff into a single registry. It
> >>>>>>>>> seemed to be too many changes for the purpose of getting AVX512
> >>>>>>>>> working. I'm going to play wait and see what happens with RISC-V
> >>>>>>>>> here since they are going to get both masks and lengths registered
> >>>>>>>>> I think.
> >>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
> >>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >>>>>>>>> looked odd.
> >>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run
> >>>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
> >>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> >>>>>>>>> and one latent wrong-code (PR110237).
> >>>>>>>>> There's followup work to be done to try enabling masked epilogues
> >>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
> >>>>>>>>> -mprefer-vector-width=512). Getting cost modeling and decision
> >>>>>>>>> right is going to be challenging.
> >>>>>>>>> Any comments?
> >>>>>>>>> OK?
> >>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
> >>>>>>>>> work for it so in case the while_ult path fails (not sure if
> >>>>>>>>> it ever does) it could get _avx512 style masking. Likewise
> >>>>>>>>> testing on ARM just to see I didn't break anything here.
> >>>>>>>>> I don't have SVE hardware so testing is probably meaningless.
> >>>>>>>>
> >>>>>>>> I can set some tests going. Is vect.exp enough?
> >>>>>>>
> >>>>>>> Well, only you know (from experience), but sure that?s a nice start.
> >>>>>>
> >>>>>> I tested vect.exp for both gcc and gfortran and there were no
> >>>>>> regressions.
> >>>>>> I
> >>>>>> have another run going with the other param settings.
> >>>>>>
> >>>>>> (Side note: vect.exp used to be a nice quick test for use during
> >>>>>> development,
> >>>>>> but the tsvc tests are now really slow, at least when run on a single
> >>>>>> GPU
> >>>>>> thread.)
> >>>>>>
> >>>>>> I tried some small examples with --param vect-partial-vector-usage=1
> >>>>>> (IIUC
> >>>>>> this prevents masked loops, but not masked epilogues, right?)
> >>>>>
> >>>>> Yes. That should also work with the while_ult style btw.
> >>>>>
> >>>>>> and the results
> >>>>>> look good. I plan to do some benchmarking shortly. One comment:
> >>>>>> building
> >>>>>> a
> >>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
> >>>>>> the
> >>>>>> constant pool and an unnecessary memory load (it literally has to use
> >>>>>> this
> >>>>>> sequence to generate the addresses to load the constant!) Generating
> >>>>>> the
> >>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
> >>>>>> ABI-mandated register that already holds that value. (Perhaps I have
> >>>>>> another
> >>>>>> piece missing here, IDK?)
> >>>>>
> >>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
> >>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> >>>>> It might be that the target needs to pattern match these constants
> >>>>> at RTL expansion time?
> >>>>>
> >>>>> Btw, did you disable your while_ult pattern for the experiment?
> >>>>
> >>>> I tried it both ways; both appear to work, and the while_ult case does
> >>>> avoid
> >>>> the constant vector. I also don't seem to need while_ult for the fully
> >>>> masked
> >>>> case any more (is that new?).
> >>>
> >>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
> >>> conveniently available but it has to multiply the IV with the
> >>> number of scalars per iter which has overflow issues it has
> >>> to compensate for by choosing a wider IV. I'm avoiding that
> >>> issue (besides for alignment peeling) by instead altering
> >>> the constant vector to compare against. On x86 the constant
> >>> vector is always a load but the multiplication would add to
> >>> the latency of mask production which already isn't too great.
> >>
> >> Is the multiplication not usually a shift?
> >>
> >>> And yes, the alternate scheme doesn't rely on while_ult but instead
> >>> on vec_cmpu to produce the masks.
> >>>
> >>> You might be able to produce the {0, 0, 1, 1, ... } constant
> >>> by interleaving v1 with itself? Any non-power-of-two duplication
> >>> looks more difficult though.
> >>
> >> I think that would need to use a full permutation, which is probably faster
> >> than a cold load, but in all these cases the vector that defines the
> >> permutation looks exactly like the result, so ......
> >>
> >> I've been playing with this stuff some more and I find that even though GCN
> >> supports fully masked loops and uses them when I test without offload, it's
> >> actually been running in param_vect_partial_vector_usage==0 mode for
> >> offload
> >> because i386.cc has that hardcoded and the offload compiler inherits param
> >> settings from the host.
> >
> > Doesn't that mean it will have a scalar epilog and a very large VF for the
> > main loop due to the large vector size?
> >
> >> I tried running the Babelstream benchmark with the various settings and
> >> it's a
> >> wash for most of the measurements (memory limited, most likely), but the
> >> "Dot"
> >> benchmark is considerably slower when fully masked (about 50%). This
> >> probably
> >> explains why adding the additional "fake" smaller vector sizes was so good
> >> for
> >> our numbers, but confirms that the partial epilogue is a good option.
> >
> > Ah, "fake" smaller vector sizes probably then made up for this with
> > "fixed" size epilogue vectorization? But yes, I think a vectorized
> > epilog with partial vectors that then does not iterate would get you
> > the best of both worlds.
>
> Yes, it uses V32 for the epilogue, which won't fit every case but it better
> than nothing.
>
> > So param_vect_partial_vector_usage == 1.
>
> Unfortunately, there's doesn't seem to be a way to set this *only* for the
> offload compiler. If you could fix it for x86_64 soon then that would be
> awesome. :)
So some opts tweaking in the GCN option_override hook doesn't work?
> > Whether with a while_ult optab or the vec_cmpu scheme should then
> > depend on generated code quality.
>
> Which looks like it depends on these constants alone.
Understood, compared to while_ult these now appear explicitely. As
Richard said you likely have to tweak the backend to make use of 'r1'
Richard.
next prev parent reply other threads:[~2023-06-15 14:00 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20230614115450.28CEA3858288@sourceware.org>
2023-06-14 14:26 ` Andrew Stubbs
2023-06-14 14:29 ` Richard Biener
2023-06-15 5:50 ` Liu, Hongtao
2023-06-15 6:51 ` Richard Biener
2023-06-15 9:26 ` Andrew Stubbs
2023-06-15 9:58 ` Richard Biener
2023-06-15 10:13 ` Andrew Stubbs
2023-06-15 11:06 ` Richard Biener
2023-06-15 13:04 ` Andrew Stubbs
2023-06-15 13:34 ` Richard Biener
2023-06-15 13:52 ` Andrew Stubbs
2023-06-15 14:00 ` Richard Biener [this message]
2023-06-15 14:04 ` Andrew Stubbs
2023-06-15 16:16 ` Richard Biener
2023-06-15 9:58 ` Richard Sandiford
[not found] <20230614115429.D400C3858433@sourceware.org>
2023-06-14 18:45 ` Richard Sandiford
2023-06-15 12:14 ` Richard Biener
2023-06-15 12:53 ` Richard Biener
2023-06-14 11:54 Richard Biener
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=nycvar.YFH.7.77.849.2306151358040.4723@jbgna.fhfr.qr \
--to=rguenther@suse.de \
--cc=ams@codesourcery.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hongtao.liu@intel.com \
--cc=hubicka@ucw.cz \
--cc=kirill.yukhin@gmail.com \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).