On Thu, 15 Jun 2023, Andrew Stubbs wrote: > On 15/06/2023 10:58, Richard Biener wrote: > > On Thu, 15 Jun 2023, Andrew Stubbs wrote: > > > >> On 14/06/2023 15:29, Richard Biener wrote: > >>> > >>> > >>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : > >>>> > >>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: > >>>>> This implemens fully masked vectorization or a masked epilog for > >>>>> AVX512 style masks which single themselves out by representing > >>>>> each lane with a single bit and by using integer modes for the mask > >>>>> (both is much like GCN). > >>>>> AVX512 is also special in that it doesn't have any instruction > >>>>> to compute the mask from a scalar IV like SVE has with while_ult. > >>>>> Instead the masks are produced by vector compares and the loop > >>>>> control retains the scalar IV (mainly to avoid dependences on > >>>>> mask generation, a suitable mask test instruction is available). > >>>> > >>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle > >>>> end > >>>> which expands to a vector compare against a vector of stepped values. > >>>> This > >>>> requires an additional instruction to prepare the comparison vector > >>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example) > >>>> returns > >>>> the DImode bitmask, so it works reasonably well. > >>>> > >>>>> Like RVV code generation prefers a decrementing IV though IVOPTs > >>>>> messes things up in some cases removing that IV to eliminate > >>>>> it with an incrementing one used for address generation. > >>>>> One of the motivating testcases is from PR108410 which in turn > >>>>> is extracted from x264 where large size vectorization shows > >>>>> issues with small trip loops. Execution time there improves > >>>>> compared to classic AVX512 with AVX2 epilogues for the cases > >>>>> of less than 32 iterations. > >>>>> size scalar 128 256 512 512e 512f > >>>>> 1 9.42 11.32 9.35 11.17 15.13 16.89 > >>>>> 2 5.72 6.53 6.66 6.66 7.62 8.56 > >>>>> 3 4.49 5.10 5.10 5.74 5.08 5.73 > >>>>> 4 4.10 4.33 4.29 5.21 3.79 4.25 > >>>>> 6 3.78 3.85 3.86 4.76 2.54 2.85 > >>>>> 8 3.64 1.89 3.76 4.50 1.92 2.16 > >>>>> 12 3.56 2.21 3.75 4.26 1.26 1.42 > >>>>> 16 3.36 0.83 1.06 4.16 0.95 1.07 > >>>>> 20 3.39 1.42 1.33 4.07 0.75 0.85 > >>>>> 24 3.23 0.66 1.72 4.22 0.62 0.70 > >>>>> 28 3.18 1.09 2.04 4.20 0.54 0.61 > >>>>> 32 3.16 0.47 0.41 0.41 0.47 0.53 > >>>>> 34 3.16 0.67 0.61 0.56 0.44 0.50 > >>>>> 38 3.19 0.95 0.95 0.82 0.40 0.45 > >>>>> 42 3.09 0.58 1.21 1.13 0.36 0.40 > >>>>> 'size' specifies the number of actual iterations, 512e is for > >>>>> a masked epilog and 512f for the fully masked loop. From > >>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly > >>>>> the winner, the fully masked variant is clearly worse and > >>>>> it's size benefit is also tiny. > >>>> > >>>> Let me check I understand correctly. In the fully masked case, there is a > >>>> single loop in which a new mask is generated at the start of each > >>>> iteration. In the masked epilogue case, the main loop uses no masking > >>>> whatsoever, thus avoiding the need for generating a mask, carrying the > >>>> mask, inserting vec_merge operations, etc, and then the epilogue looks > >>>> much > >>>> like the fully masked case, but unlike smaller mode epilogues there is no > >>>> loop because the eplogue vector size is the same. Is that right? > >>> > >>> Yes. > >>> > >>>> This scheme seems like it might also benefit GCN, in so much as it > >>>> simplifies the hot code path. > >>>> > >>>> GCN does not actually have smaller vector sizes, so there's no analogue > >>>> to > >>>> AVX2 (we pretend we have some smaller sizes, but that's because the > >>>> middle > >>>> end can't do masking everywhere yet, and it helps make some vector > >>>> constants smaller, perhaps). > >>>> > >>>>> This patch does not enable using fully masked loops or > >>>>> masked epilogues by default. More work on cost modeling > >>>>> and vectorization kind selection on x86_64 is necessary > >>>>> for this. > >>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE > >>>>> which could be exploited further to unify some of the flags > >>>>> we have right now but there didn't seem to be many easy things > >>>>> to merge, so I'm leaving this for followups. > >>>>> Mask requirements as registered by vect_record_loop_mask are kept in > >>>>> their > >>>>> original form and recorded in a hash_set now instead of being > >>>>> processed to a vector of rgroup_controls. Instead that's now > >>>>> left to the final analysis phase which tries forming the rgroup_controls > >>>>> vector using while_ult and if that fails now tries AVX512 style > >>>>> which needs a different organization and instead fills a hash_map > >>>>> with the relevant info. vect_get_loop_mask now has two implementations, > >>>>> one for the two mask styles we then have. > >>>>> I have decided against interweaving > >>>>> vect_set_loop_condition_partial_vectors > >>>>> with conditions to do AVX512 style masking and instead opted to > >>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. > >>>>> Likewise for vect_verify_full_masking vs > >>>>> vect_verify_full_masking_avx512. > >>>>> I was split between making 'vec_loop_masks' a class with methods, > >>>>> possibly merging in the _len stuff into a single registry. It > >>>>> seemed to be too many changes for the purpose of getting AVX512 > >>>>> working. I'm going to play wait and see what happens with RISC-V > >>>>> here since they are going to get both masks and lengths registered > >>>>> I think. > >>>>> The vect_prepare_for_masked_peels hunk might run into issues with > >>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE > >>>>> looked odd. > >>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run > >>>>> the testsuite with --param vect-partial-vector-usage=2 with and > >>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221) > >>>>> and one latent wrong-code (PR110237). > >>>>> There's followup work to be done to try enabling masked epilogues > >>>>> for x86-64 by default (when AVX512 is enabled, possibly only when > >>>>> -mprefer-vector-width=512). Getting cost modeling and decision > >>>>> right is going to be challenging. > >>>>> Any comments? > >>>>> OK? > >>>>> Btw, testing on GCN would be welcome - the _avx512 paths could > >>>>> work for it so in case the while_ult path fails (not sure if > >>>>> it ever does) it could get _avx512 style masking. Likewise > >>>>> testing on ARM just to see I didn't break anything here. > >>>>> I don't have SVE hardware so testing is probably meaningless. > >>>> > >>>> I can set some tests going. Is vect.exp enough? > >>> > >>> Well, only you know (from experience), but sure that?s a nice start. > >> > >> I tested vect.exp for both gcc and gfortran and there were no regressions. > >> I > >> have another run going with the other param settings. > >> > >> (Side note: vect.exp used to be a nice quick test for use during > >> development, > >> but the tsvc tests are now really slow, at least when run on a single GPU > >> thread.) > >> > >> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC > >> this prevents masked loops, but not masked epilogues, right?) > > > > Yes. That should also work with the while_ult style btw. > > > >> and the results > >> look good. I plan to do some benchmarking shortly. One comment: building a > >> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the > >> constant pool and an unnecessary memory load (it literally has to use this > >> sequence to generate the addresses to load the constant!) Generating the > >> sequence via VEC_SERIES would be a no-op, for GCN, because we have an > >> ABI-mandated register that already holds that value. (Perhaps I have > >> another > >> piece missing here, IDK?) > > > > I failed to special-case the {0, 1, 2, 3, ... } constant because I > > couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }. > > It might be that the target needs to pattern match these constants > > at RTL expansion time? > > > > Btw, did you disable your while_ult pattern for the experiment? > > I tried it both ways; both appear to work, and the while_ult case does avoid > the constant vector. I also don't seem to need while_ult for the fully masked > case any more (is that new?). Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems conveniently available but it has to multiply the IV with the number of scalars per iter which has overflow issues it has to compensate for by choosing a wider IV. I'm avoiding that issue (besides for alignment peeling) by instead altering the constant vector to compare against. On x86 the constant vector is always a load but the multiplication would add to the latency of mask production which already isn't too great. And yes, the alternate scheme doesn't rely on while_ult but instead on vec_cmpu to produce the masks. You might be able to produce the {0, 0, 1, 1, ... } constant by interleaving v1 with itself? Any non-power-of-two duplication looks more difficult though. Richard.