From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98]) by sourceware.org (Postfix) with ESMTPS id BD5423858D35 for ; Thu, 15 Jun 2023 13:04:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BD5423858D35 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="6.00,244,1681200000"; d="scan'208";a="9944100" Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167]) by esa2.mentor.iphmx.com with ESMTP; 15 Jun 2023 05:04:43 -0800 IronPort-SDR: VmcXC86stP4+bQlEODbaAfjthOcJu1/ktF4lALCRRz65B3gVlXU1A6KRgfGK0r/vcR0aq9L0A3 LwqMhphryKAK+HYJ5IIxPTgHxYNiy3tGal9Z0+7IrF1PuTU1MijzrMktVp/3oXTraq+daqcOFE HJ7lW6JsxqpJSDjtbWnju4lgEglL5i2hKTIAZJTc1nciIQz2NJ6fo2AUGiyTMQ8ofldj+Et8Gf d2LL0pm4F742PQzPBCQBzu4qUIq6wMXfcLx8/m/7muVUihwY9AgDThOEQVjxknqJ5NdTDwntsr 69o= Message-ID: <20d9c811-7964-bc51-64be-572a0a32e8be@codesourcery.com> Date: Thu, 15 Jun 2023 14:04:37 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0 Thunderbird/102.12.0 Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization Content-Language: en-GB To: Richard Biener CC: , , Jan Hubicka , , References: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com> <378905b9-2383-d564-1c91-2c6b1e06629d@codesourcery.com> From: Andrew Stubbs In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) To svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 15/06/2023 12:06, Richard Biener wrote: > On Thu, 15 Jun 2023, Andrew Stubbs wrote: > >> On 15/06/2023 10:58, Richard Biener wrote: >>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>> >>>> On 14/06/2023 15:29, Richard Biener wrote: >>>>> >>>>> >>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : >>>>>> >>>>>> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: >>>>>>> This implemens fully masked vectorization or a masked epilog for >>>>>>> AVX512 style masks which single themselves out by representing >>>>>>> each lane with a single bit and by using integer modes for the mask >>>>>>> (both is much like GCN). >>>>>>> AVX512 is also special in that it doesn't have any instruction >>>>>>> to compute the mask from a scalar IV like SVE has with while_ult. >>>>>>> Instead the masks are produced by vector compares and the loop >>>>>>> control retains the scalar IV (mainly to avoid dependences on >>>>>>> mask generation, a suitable mask test instruction is available). >>>>>> >>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the middle >>>>>> end >>>>>> which expands to a vector compare against a vector of stepped values. >>>>>> This >>>>>> requires an additional instruction to prepare the comparison vector >>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for example) >>>>>> returns >>>>>> the DImode bitmask, so it works reasonably well. >>>>>> >>>>>>> Like RVV code generation prefers a decrementing IV though IVOPTs >>>>>>> messes things up in some cases removing that IV to eliminate >>>>>>> it with an incrementing one used for address generation. >>>>>>> One of the motivating testcases is from PR108410 which in turn >>>>>>> is extracted from x264 where large size vectorization shows >>>>>>> issues with small trip loops. Execution time there improves >>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases >>>>>>> of less than 32 iterations. >>>>>>> size scalar 128 256 512 512e 512f >>>>>>> 1 9.42 11.32 9.35 11.17 15.13 16.89 >>>>>>> 2 5.72 6.53 6.66 6.66 7.62 8.56 >>>>>>> 3 4.49 5.10 5.10 5.74 5.08 5.73 >>>>>>> 4 4.10 4.33 4.29 5.21 3.79 4.25 >>>>>>> 6 3.78 3.85 3.86 4.76 2.54 2.85 >>>>>>> 8 3.64 1.89 3.76 4.50 1.92 2.16 >>>>>>> 12 3.56 2.21 3.75 4.26 1.26 1.42 >>>>>>> 16 3.36 0.83 1.06 4.16 0.95 1.07 >>>>>>> 20 3.39 1.42 1.33 4.07 0.75 0.85 >>>>>>> 24 3.23 0.66 1.72 4.22 0.62 0.70 >>>>>>> 28 3.18 1.09 2.04 4.20 0.54 0.61 >>>>>>> 32 3.16 0.47 0.41 0.41 0.47 0.53 >>>>>>> 34 3.16 0.67 0.61 0.56 0.44 0.50 >>>>>>> 38 3.19 0.95 0.95 0.82 0.40 0.45 >>>>>>> 42 3.09 0.58 1.21 1.13 0.36 0.40 >>>>>>> 'size' specifies the number of actual iterations, 512e is for >>>>>>> a masked epilog and 512f for the fully masked loop. From >>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly >>>>>>> the winner, the fully masked variant is clearly worse and >>>>>>> it's size benefit is also tiny. >>>>>> >>>>>> Let me check I understand correctly. In the fully masked case, there is a >>>>>> single loop in which a new mask is generated at the start of each >>>>>> iteration. In the masked epilogue case, the main loop uses no masking >>>>>> whatsoever, thus avoiding the need for generating a mask, carrying the >>>>>> mask, inserting vec_merge operations, etc, and then the epilogue looks >>>>>> much >>>>>> like the fully masked case, but unlike smaller mode epilogues there is no >>>>>> loop because the eplogue vector size is the same. Is that right? >>>>> >>>>> Yes. >>>>> >>>>>> This scheme seems like it might also benefit GCN, in so much as it >>>>>> simplifies the hot code path. >>>>>> >>>>>> GCN does not actually have smaller vector sizes, so there's no analogue >>>>>> to >>>>>> AVX2 (we pretend we have some smaller sizes, but that's because the >>>>>> middle >>>>>> end can't do masking everywhere yet, and it helps make some vector >>>>>> constants smaller, perhaps). >>>>>> >>>>>>> This patch does not enable using fully masked loops or >>>>>>> masked epilogues by default. More work on cost modeling >>>>>>> and vectorization kind selection on x86_64 is necessary >>>>>>> for this. >>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE >>>>>>> which could be exploited further to unify some of the flags >>>>>>> we have right now but there didn't seem to be many easy things >>>>>>> to merge, so I'm leaving this for followups. >>>>>>> Mask requirements as registered by vect_record_loop_mask are kept in >>>>>>> their >>>>>>> original form and recorded in a hash_set now instead of being >>>>>>> processed to a vector of rgroup_controls. Instead that's now >>>>>>> left to the final analysis phase which tries forming the rgroup_controls >>>>>>> vector using while_ult and if that fails now tries AVX512 style >>>>>>> which needs a different organization and instead fills a hash_map >>>>>>> with the relevant info. vect_get_loop_mask now has two implementations, >>>>>>> one for the two mask styles we then have. >>>>>>> I have decided against interweaving >>>>>>> vect_set_loop_condition_partial_vectors >>>>>>> with conditions to do AVX512 style masking and instead opted to >>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. >>>>>>> Likewise for vect_verify_full_masking vs >>>>>>> vect_verify_full_masking_avx512. >>>>>>> I was split between making 'vec_loop_masks' a class with methods, >>>>>>> possibly merging in the _len stuff into a single registry. It >>>>>>> seemed to be too many changes for the purpose of getting AVX512 >>>>>>> working. I'm going to play wait and see what happens with RISC-V >>>>>>> here since they are going to get both masks and lengths registered >>>>>>> I think. >>>>>>> The vect_prepare_for_masked_peels hunk might run into issues with >>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE >>>>>>> looked odd. >>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run >>>>>>> the testsuite with --param vect-partial-vector-usage=2 with and >>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221) >>>>>>> and one latent wrong-code (PR110237). >>>>>>> There's followup work to be done to try enabling masked epilogues >>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only when >>>>>>> -mprefer-vector-width=512). Getting cost modeling and decision >>>>>>> right is going to be challenging. >>>>>>> Any comments? >>>>>>> OK? >>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could >>>>>>> work for it so in case the while_ult path fails (not sure if >>>>>>> it ever does) it could get _avx512 style masking. Likewise >>>>>>> testing on ARM just to see I didn't break anything here. >>>>>>> I don't have SVE hardware so testing is probably meaningless. >>>>>> >>>>>> I can set some tests going. Is vect.exp enough? >>>>> >>>>> Well, only you know (from experience), but sure that?s a nice start. >>>> >>>> I tested vect.exp for both gcc and gfortran and there were no regressions. >>>> I >>>> have another run going with the other param settings. >>>> >>>> (Side note: vect.exp used to be a nice quick test for use during >>>> development, >>>> but the tsvc tests are now really slow, at least when run on a single GPU >>>> thread.) >>>> >>>> I tried some small examples with --param vect-partial-vector-usage=1 (IIUC >>>> this prevents masked loops, but not masked epilogues, right?) >>> >>> Yes. That should also work with the while_ult style btw. >>> >>>> and the results >>>> look good. I plan to do some benchmarking shortly. One comment: building a >>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the >>>> constant pool and an unnecessary memory load (it literally has to use this >>>> sequence to generate the addresses to load the constant!) Generating the >>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have an >>>> ABI-mandated register that already holds that value. (Perhaps I have >>>> another >>>> piece missing here, IDK?) >>> >>> I failed to special-case the {0, 1, 2, 3, ... } constant because I >>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }. >>> It might be that the target needs to pattern match these constants >>> at RTL expansion time? >>> >>> Btw, did you disable your while_ult pattern for the experiment? >> >> I tried it both ways; both appear to work, and the while_ult case does avoid >> the constant vector. I also don't seem to need while_ult for the fully masked >> case any more (is that new?). > > Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems > conveniently available but it has to multiply the IV with the > number of scalars per iter which has overflow issues it has > to compensate for by choosing a wider IV. I'm avoiding that > issue (besides for alignment peeling) by instead altering > the constant vector to compare against. On x86 the constant > vector is always a load but the multiplication would add to > the latency of mask production which already isn't too great. Is the multiplication not usually a shift? > And yes, the alternate scheme doesn't rely on while_ult but instead > on vec_cmpu to produce the masks. > > You might be able to produce the {0, 0, 1, 1, ... } constant > by interleaving v1 with itself? Any non-power-of-two duplication > looks more difficult though. I think that would need to use a full permutation, which is probably faster than a cold load, but in all these cases the vector that defines the permutation looks exactly like the result, so ...... I've been playing with this stuff some more and I find that even though GCN supports fully masked loops and uses them when I test without offload, it's actually been running in param_vect_partial_vector_usage==0 mode for offload because i386.cc has that hardcoded and the offload compiler inherits param settings from the host. I tried running the Babelstream benchmark with the various settings and it's a wash for most of the measurements (memory limited, most likely), but the "Dot" benchmark is considerably slower when fully masked (about 50%). This probably explains why adding the additional "fake" smaller vector sizes was so good for our numbers, but confirms that the partial epilogue is a good option. Andrew