Re: [PATCH 3/3] AVX512 fully masked vectorization

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Andrew Stubbs <ams@codesourcery.com>
To: Richard Biener <rguenther@suse.de>
Cc: <gcc-patches@gcc.gnu.org>, <richard.sandiford@arm.com>,
	Jan Hubicka <hubicka@ucw.cz>, <hongtao.liu@intel.com>,
	<kirill.yukhin@gmail.com>
Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization
Date: Thu, 15 Jun 2023 14:52:46 +0100	[thread overview]
Message-ID: <ea69279a-f9e4-11b0-a299-31df42fa6d41@codesourcery.com> (raw)
In-Reply-To: <nycvar.YFH.7.77.849.2306151331470.4723@jbgna.fhfr.qr>

On 15/06/2023 14:34, Richard Biener wrote:
> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> 
>> On 15/06/2023 12:06, Richard Biener wrote:
>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>
>>>> On 15/06/2023 10:58, Richard Biener wrote:
>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote:
>>>>>
>>>>>> On 14/06/2023 15:29, Richard Biener wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
>>>>>>>>
>>>>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
>>>>>>>>> This implemens fully masked vectorization or a masked epilog for
>>>>>>>>> AVX512 style masks which single themselves out by representing
>>>>>>>>> each lane with a single bit and by using integer modes for the mask
>>>>>>>>> (both is much like GCN).
>>>>>>>>> AVX512 is also special in that it doesn't have any instruction
>>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ult.
>>>>>>>>> Instead the masks are produced by vector compares and the loop
>>>>>>>>> control retains the scalar IV (mainly to avoid dependences on
>>>>>>>>> mask generation, a suitable mask test instruction is available).
>>>>>>>>
>>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle
>>>>>>>> end
>>>>>>>> which expands to a vector compare against a vector of stepped values.
>>>>>>>> This
>>>>>>>> requires an additional instruction to prepare the comparison vector
>>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
>>>>>>>> returns
>>>>>>>> the DImode bitmask, so it works reasonably well.
>>>>>>>>
>>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
>>>>>>>>> messes things up in some cases removing that IV to eliminate
>>>>>>>>> it with an incrementing one used for address generation.
>>>>>>>>> One of the motivating testcases is from PR108410 which in turn
>>>>>>>>> is extracted from x264 where large size vectorization shows
>>>>>>>>> issues with small trip loops.  Execution time there improves
>>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases
>>>>>>>>> of less than 32 iterations.
>>>>>>>>> size   scalar     128     256     512    512e    512f
>>>>>>>>>         1    9.42   11.32    9.35   11.17   15.13   16.89
>>>>>>>>>         2    5.72    6.53    6.66    6.66    7.62    8.56
>>>>>>>>>         3    4.49    5.10    5.10    5.74    5.08    5.73
>>>>>>>>>         4    4.10    4.33    4.29    5.21    3.79    4.25
>>>>>>>>>         6    3.78    3.85    3.86    4.76    2.54    2.85
>>>>>>>>>         8    3.64    1.89    3.76    4.50    1.92    2.16
>>>>>>>>>        12    3.56    2.21    3.75    4.26    1.26    1.42
>>>>>>>>>        16    3.36    0.83    1.06    4.16    0.95    1.07
>>>>>>>>>        20    3.39    1.42    1.33    4.07    0.75    0.85
>>>>>>>>>        24    3.23    0.66    1.72    4.22    0.62    0.70
>>>>>>>>>        28    3.18    1.09    2.04    4.20    0.54    0.61
>>>>>>>>>        32    3.16    0.47    0.41    0.41    0.47    0.53
>>>>>>>>>        34    3.16    0.67    0.61    0.56    0.44    0.50
>>>>>>>>>        38    3.19    0.95    0.95    0.82    0.40    0.45
>>>>>>>>>        42    3.09    0.58    1.21    1.13    0.36    0.40
>>>>>>>>> 'size' specifies the number of actual iterations, 512e is for
>>>>>>>>> a masked epilog and 512f for the fully masked loop.  From
>>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
>>>>>>>>> the winner, the fully masked variant is clearly worse and
>>>>>>>>> it's size benefit is also tiny.
>>>>>>>>
>>>>>>>> Let me check I understand correctly. In the fully masked case, there is
>>>>>>>> a
>>>>>>>> single loop in which a new mask is generated at the start of each
>>>>>>>> iteration. In the masked epilogue case, the main loop uses no masking
>>>>>>>> whatsoever, thus avoiding the need for generating a mask, carrying the
>>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks
>>>>>>>> much
>>>>>>>> like the fully masked case, but unlike smaller mode epilogues there is
>>>>>>>> no
>>>>>>>> loop because the eplogue vector size is the same. Is that right?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> This scheme seems like it might also benefit GCN, in so much as it
>>>>>>>> simplifies the hot code path.
>>>>>>>>
>>>>>>>> GCN does not actually have smaller vector sizes, so there's no analogue
>>>>>>>> to
>>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the
>>>>>>>> middle
>>>>>>>> end can't do masking everywhere yet, and it helps make some vector
>>>>>>>> constants smaller, perhaps).
>>>>>>>>
>>>>>>>>> This patch does not enable using fully masked loops or
>>>>>>>>> masked epilogues by default.  More work on cost modeling
>>>>>>>>> and vectorization kind selection on x86_64 is necessary
>>>>>>>>> for this.
>>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
>>>>>>>>> which could be exploited further to unify some of the flags
>>>>>>>>> we have right now but there didn't seem to be many easy things
>>>>>>>>> to merge, so I'm leaving this for followups.
>>>>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in
>>>>>>>>> their
>>>>>>>>> original form and recorded in a hash_set now instead of being
>>>>>>>>> processed to a vector of rgroup_controls.  Instead that's now
>>>>>>>>> left to the final analysis phase which tries forming the
>>>>>>>>> rgroup_controls
>>>>>>>>> vector using while_ult and if that fails now tries AVX512 style
>>>>>>>>> which needs a different organization and instead fills a hash_map
>>>>>>>>> with the relevant info.  vect_get_loop_mask now has two
>>>>>>>>> implementations,
>>>>>>>>> one for the two mask styles we then have.
>>>>>>>>> I have decided against interweaving
>>>>>>>>> vect_set_loop_condition_partial_vectors
>>>>>>>>> with conditions to do AVX512 style masking and instead opted to
>>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
>>>>>>>>> Likewise for vect_verify_full_masking vs
>>>>>>>>> vect_verify_full_masking_avx512.
>>>>>>>>> I was split between making 'vec_loop_masks' a class with methods,
>>>>>>>>> possibly merging in the _len stuff into a single registry.  It
>>>>>>>>> seemed to be too many changes for the purpose of getting AVX512
>>>>>>>>> working.  I'm going to play wait and see what happens with RISC-V
>>>>>>>>> here since they are going to get both masks and lengths registered
>>>>>>>>> I think.
>>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with
>>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
>>>>>>>>> looked odd.
>>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
>>>>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and
>>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
>>>>>>>>> and one latent wrong-code (PR110237).
>>>>>>>>> There's followup work to be done to try enabling masked epilogues
>>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
>>>>>>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
>>>>>>>>> right is going to be challenging.
>>>>>>>>> Any comments?
>>>>>>>>> OK?
>>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
>>>>>>>>> work for it so in case the while_ult path fails (not sure if
>>>>>>>>> it ever does) it could get _avx512 style masking.  Likewise
>>>>>>>>> testing on ARM just to see I didn't break anything here.
>>>>>>>>> I don't have SVE hardware so testing is probably meaningless.
>>>>>>>>
>>>>>>>> I can set some tests going. Is vect.exp enough?
>>>>>>>
>>>>>>> Well, only you know (from experience), but sure that?s a nice start.
>>>>>>
>>>>>> I tested vect.exp for both gcc and gfortran and there were no
>>>>>> regressions.
>>>>>> I
>>>>>> have another run going with the other param settings.
>>>>>>
>>>>>> (Side note: vect.exp used to be a nice quick test for use during
>>>>>> development,
>>>>>> but the tsvc tests are now really slow, at least when run on a single GPU
>>>>>> thread.)
>>>>>>
>>>>>> I tried some small examples with --param vect-partial-vector-usage=1
>>>>>> (IIUC
>>>>>> this prevents masked loops, but not masked epilogues, right?)
>>>>>
>>>>> Yes.  That should also work with the while_ult style btw.
>>>>>
>>>>>> and the results
>>>>>> look good. I plan to do some benchmarking shortly. One comment: building
>>>>>> a
>>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in
>>>>>> the
>>>>>> constant pool and an unnecessary memory load (it literally has to use
>>>>>> this
>>>>>> sequence to generate the addresses to load the constant!) Generating the
>>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
>>>>>> ABI-mandated register that already holds that value. (Perhaps I have
>>>>>> another
>>>>>> piece missing here, IDK?)
>>>>>
>>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I
>>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
>>>>> It might be that the target needs to pattern match these constants
>>>>> at RTL expansion time?
>>>>>
>>>>> Btw, did you disable your while_ult pattern for the experiment?
>>>>
>>>> I tried it both ways; both appear to work, and the while_ult case does
>>>> avoid
>>>> the constant vector. I also don't seem to need while_ult for the fully
>>>> masked
>>>> case any more (is that new?).
>>>
>>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
>>> conveniently available but it has to multiply the IV with the
>>> number of scalars per iter which has overflow issues it has
>>> to compensate for by choosing a wider IV.  I'm avoiding that
>>> issue (besides for alignment peeling) by instead altering
>>> the constant vector to compare against.  On x86 the constant
>>> vector is always a load but the multiplication would add to
>>> the latency of mask production which already isn't too great.
>>
>> Is the multiplication not usually a shift?
>>
>>> And yes, the alternate scheme doesn't rely on while_ult but instead
>>> on vec_cmpu to produce the masks.
>>>
>>> You might be able to produce the {0, 0, 1, 1, ... } constant
>>> by interleaving v1 with itself?  Any non-power-of-two duplication
>>> looks more difficult though.
>>
>> I think that would need to use a full permutation, which is probably faster
>> than a cold load, but in all these cases the vector that defines the
>> permutation looks exactly like the result, so ......
>>
>> I've been playing with this stuff some more and I find that even though GCN
>> supports fully masked loops and uses them when I test without offload, it's
>> actually been running in param_vect_partial_vector_usage==0 mode for offload
>> because i386.cc has that hardcoded and the offload compiler inherits param
>> settings from the host.
> 
> Doesn't that mean it will have a scalar epilog and a very large VF for the
> main loop due to the large vector size?
> 
>> I tried running the Babelstream benchmark with the various settings and it's a
>> wash for most of the measurements (memory limited, most likely), but the "Dot"
>> benchmark is considerably slower when fully masked (about 50%). This probably
>> explains why adding the additional "fake" smaller vector sizes was so good for
>> our numbers, but confirms that the partial epilogue is a good option.
> 
> Ah, "fake" smaller vector sizes probably then made up for this with
> "fixed" size epilogue vectorization?  But yes, I think a vectorized
> epilog with partial vectors that then does not iterate would get you
> the best of both worlds.

Yes, it uses V32 for the epilogue, which won't fit every case but it 
better than nothing.

> So param_vect_partial_vector_usage == 1.

Unfortunately, there's doesn't seem to be a way to set this *only* for 
the offload compiler. If you could fix it for x86_64 soon then that 
would be awesome. :)

> Whether with a while_ult optab or the vec_cmpu scheme should then
> depend on generated code quality.

Which looks like it depends on these constants alone.

Thanks

Andrew

next prev parent reply	other threads:[~2023-06-15 13:52 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20230614115450.28CEA3858288@sourceware.org>
2023-06-14 14:26 ` Andrew Stubbs
2023-06-14 14:29   ` Richard Biener
2023-06-15  5:50     ` Liu, Hongtao
2023-06-15  6:51       ` Richard Biener
2023-06-15  9:26     ` Andrew Stubbs
2023-06-15  9:58       ` Richard Biener
2023-06-15 10:13         ` Andrew Stubbs
2023-06-15 11:06           ` Richard Biener
2023-06-15 13:04             ` Andrew Stubbs
2023-06-15 13:34               ` Richard Biener
2023-06-15 13:52                 ` Andrew Stubbs [this message]
2023-06-15 14:00                   ` Richard Biener
2023-06-15 14:04                     ` Andrew Stubbs
2023-06-15 16:16                       ` Richard Biener
2023-06-15  9:58       ` Richard Sandiford
     [not found] <20230614115429.D400C3858433@sourceware.org>
2023-06-14 18:45 ` Richard Sandiford
2023-06-15 12:14   ` Richard Biener
2023-06-15 12:53     ` Richard Biener
2023-06-14 11:54 Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea69279a-f9e4-11b0-a299-31df42fa6d41@codesourcery.com \
    --to=ams@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongtao.liu@intel.com \
    --cc=hubicka@ucw.cz \
    --cc=kirill.yukhin@gmail.com \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).