On Thu, 15 Jun 2023, Andrew Stubbs wrote:

> On 15/06/2023 10:58, Richard Biener wrote:
> > On Thu, 15 Jun 2023, Andrew Stubbs wrote:
> > 
> >> On 14/06/2023 15:29, Richard Biener wrote:
> >>>
> >>>
> >>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs <ams@codesourcery.com>:
> >>>>
> >>>> ﻿On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote:
> >>>>> This implemens fully masked vectorization or a masked epilog for
> >>>>> AVX512 style masks which single themselves out by representing
> >>>>> each lane with a single bit and by using integer modes for the mask
> >>>>> (both is much like GCN).
> >>>>> AVX512 is also special in that it doesn't have any instruction
> >>>>> to compute the mask from a scalar IV like SVE has with while_ult.
> >>>>> Instead the masks are produced by vector compares and the loop
> >>>>> control retains the scalar IV (mainly to avoid dependences on
> >>>>> mask generation, a suitable mask test instruction is available).
> >>>>
> >>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle
> >>>> end
> >>>> which expands to a vector compare against a vector of stepped values.
> >>>> This
> >>>> requires an additional instruction to prepare the comparison vector
> >>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example)
> >>>> returns
> >>>> the DImode bitmask, so it works reasonably well.
> >>>>
> >>>>> Like RVV code generation prefers a decrementing IV though IVOPTs
> >>>>> messes things up in some cases removing that IV to eliminate
> >>>>> it with an incrementing one used for address generation.
> >>>>> One of the motivating testcases is from PR108410 which in turn
> >>>>> is extracted from x264 where large size vectorization shows
> >>>>> issues with small trip loops.  Execution time there improves
> >>>>> compared to classic AVX512 with AVX2 epilogues for the cases
> >>>>> of less than 32 iterations.
> >>>>> size   scalar     128     256     512    512e    512f
> >>>>>       1    9.42   11.32    9.35   11.17   15.13   16.89
> >>>>>       2    5.72    6.53    6.66    6.66    7.62    8.56
> >>>>>       3    4.49    5.10    5.10    5.74    5.08    5.73
> >>>>>       4    4.10    4.33    4.29    5.21    3.79    4.25
> >>>>>       6    3.78    3.85    3.86    4.76    2.54    2.85
> >>>>>       8    3.64    1.89    3.76    4.50    1.92    2.16
> >>>>>      12    3.56    2.21    3.75    4.26    1.26    1.42
> >>>>>      16    3.36    0.83    1.06    4.16    0.95    1.07
> >>>>>      20    3.39    1.42    1.33    4.07    0.75    0.85
> >>>>>      24    3.23    0.66    1.72    4.22    0.62    0.70
> >>>>>      28    3.18    1.09    2.04    4.20    0.54    0.61
> >>>>>      32    3.16    0.47    0.41    0.41    0.47    0.53
> >>>>>      34    3.16    0.67    0.61    0.56    0.44    0.50
> >>>>>      38    3.19    0.95    0.95    0.82    0.40    0.45
> >>>>>      42    3.09    0.58    1.21    1.13    0.36    0.40
> >>>>> 'size' specifies the number of actual iterations, 512e is for
> >>>>> a masked epilog and 512f for the fully masked loop.  From
> >>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly
> >>>>> the winner, the fully masked variant is clearly worse and
> >>>>> it's size benefit is also tiny.
> >>>>
> >>>> Let me check I understand correctly. In the fully masked case, there is a
> >>>> single loop in which a new mask is generated at the start of each
> >>>> iteration. In the masked epilogue case, the main loop uses no masking
> >>>> whatsoever, thus avoiding the need for generating a mask, carrying the
> >>>> mask, inserting vec_merge operations, etc, and then the epilogue looks
> >>>> much
> >>>> like the fully masked case, but unlike smaller mode epilogues there is no
> >>>> loop because the eplogue vector size is the same. Is that right?
> >>>
> >>> Yes.
> >>>
> >>>> This scheme seems like it might also benefit GCN, in so much as it
> >>>> simplifies the hot code path.
> >>>>
> >>>> GCN does not actually have smaller vector sizes, so there's no analogue
> >>>> to
> >>>> AVX2 (we pretend we have some smaller sizes, but that's because the
> >>>> middle
> >>>> end can't do masking everywhere yet, and it helps make some vector
> >>>> constants smaller, perhaps).
> >>>>
> >>>>> This patch does not enable using fully masked loops or
> >>>>> masked epilogues by default.  More work on cost modeling
> >>>>> and vectorization kind selection on x86_64 is necessary
> >>>>> for this.
> >>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
> >>>>> which could be exploited further to unify some of the flags
> >>>>> we have right now but there didn't seem to be many easy things
> >>>>> to merge, so I'm leaving this for followups.
> >>>>> Mask requirements as registered by vect_record_loop_mask are kept in
> >>>>> their
> >>>>> original form and recorded in a hash_set now instead of being
> >>>>> processed to a vector of rgroup_controls.  Instead that's now
> >>>>> left to the final analysis phase which tries forming the rgroup_controls
> >>>>> vector using while_ult and if that fails now tries AVX512 style
> >>>>> which needs a different organization and instead fills a hash_map
> >>>>> with the relevant info.  vect_get_loop_mask now has two implementations,
> >>>>> one for the two mask styles we then have.
> >>>>> I have decided against interweaving
> >>>>> vect_set_loop_condition_partial_vectors
> >>>>> with conditions to do AVX512 style masking and instead opted to
> >>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
> >>>>> Likewise for vect_verify_full_masking vs
> >>>>> vect_verify_full_masking_avx512.
> >>>>> I was split between making 'vec_loop_masks' a class with methods,
> >>>>> possibly merging in the _len stuff into a single registry.  It
> >>>>> seemed to be too many changes for the purpose of getting AVX512
> >>>>> working.  I'm going to play wait and see what happens with RISC-V
> >>>>> here since they are going to get both masks and lengths registered
> >>>>> I think.
> >>>>> The vect_prepare_for_masked_peels hunk might run into issues with
> >>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
> >>>>> looked odd.
> >>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
> >>>>> the testsuite with --param vect-partial-vector-usage=2 with and
> >>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
> >>>>> and one latent wrong-code (PR110237).
> >>>>> There's followup work to be done to try enabling masked epilogues
> >>>>> for x86-64 by default (when AVX512 is enabled, possibly only when
> >>>>> -mprefer-vector-width=512).  Getting cost modeling and decision
> >>>>> right is going to be challenging.
> >>>>> Any comments?
> >>>>> OK?
> >>>>> Btw, testing on GCN would be welcome - the _avx512 paths could
> >>>>> work for it so in case the while_ult path fails (not sure if
> >>>>> it ever does) it could get _avx512 style masking.  Likewise
> >>>>> testing on ARM just to see I didn't break anything here.
> >>>>> I don't have SVE hardware so testing is probably meaningless.
> >>>>
> >>>> I can set some tests going. Is vect.exp enough?
> >>>
> >>> Well, only you know (from experience), but sure that?s a nice start.
> >>
> >> I tested vect.exp for both gcc and gfortran and there were no regressions.
> >> I
> >> have another run going with the other param settings.
> >>
> >> (Side note: vect.exp used to be a nice quick test for use during
> >> development,
> >> but the tsvc tests are now really slow, at least when run on a single GPU
> >> thread.)
> >>
> >> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC
> >> this prevents masked loops, but not masked epilogues, right?)
> > 
> > Yes.  That should also work with the while_ult style btw.
> > 
> >> and the results
> >> look good. I plan to do some benchmarking shortly. One comment: building a
> >> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the
> >> constant pool and an unnecessary memory load (it literally has to use this
> >> sequence to generate the addresses to load the constant!) Generating the
> >> sequence via VEC_SERIES would be a no-op, for GCN, because we have an
> >> ABI-mandated register that already holds that value. (Perhaps I have
> >> another
> >> piece missing here, IDK?)
> > 
> > I failed to special-case the {0, 1, 2, 3, ... } constant because I
> > couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }.
> > It might be that the target needs to pattern match these constants
> > at RTL expansion time?
> > 
> > Btw, did you disable your while_ult pattern for the experiment?
> 
> I tried it both ways; both appear to work, and the while_ult case does avoid
> the constant vector. I also don't seem to need while_ult for the fully masked
> case any more (is that new?).

Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems
conveniently available but it has to multiply the IV with the
number of scalars per iter which has overflow issues it has
to compensate for by choosing a wider IV.  I'm avoiding that
issue (besides for alignment peeling) by instead altering
the constant vector to compare against.  On x86 the constant
vector is always a load but the multiplication would add to
the latency of mask production which already isn't too great.

And yes, the alternate scheme doesn't rely on while_ult but instead
on vec_cmpu to produce the masks.

You might be able to produce the {0, 0, 1, 1, ... } constant
by interleaving v1 with itself?  Any non-power-of-two duplication
looks more difficult though.

Richard.