From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id 939F6385771A for ; Thu, 15 Jun 2023 16:17:04 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 939F6385771A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id A3B631FE51; Thu, 15 Jun 2023 16:17:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1686845823; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lKqa++/ASlgSJSwu3ygoEquHaWDJTt1hkT0vt0m9dS8=; b=Pz1oLVDpKuALoxihhrK3tA1xoM/XTpoLQ19H7BcbxZpFxBYInB4AerpVoL0mE3hx4sDOCr kOSNnBVLozTNyVrqY9WNs/98Tw8580vbyjCY/Vas8OHUCop9aJpP7sfxWsc2WTDJacX8Vt ZWSashGKdN+DeNUJhaG8OjgDNRNQAC0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1686845823; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lKqa++/ASlgSJSwu3ygoEquHaWDJTt1hkT0vt0m9dS8=; b=dX59r0oOfNBOo3Uj8JwTViI3y2Vs+Rn/vJEXLHLpo3LFAmJL8YSR8AUIYfiQ9AqjzKQ8gu 4n+WMLLlpCKvyqCg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 94F9913467; Thu, 15 Jun 2023 16:17:03 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 7xZcJH85i2Q/LgAAMHmgww (envelope-from ); Thu, 15 Jun 2023 16:17:03 +0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Biener Mime-Version: 1.0 (1.0) Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization Date: Thu, 15 Jun 2023 18:16:53 +0200 Message-Id: <7B2A55C9-E276-42DA-BA36-F6863C3EF969@suse.de> References: <7758ea65-3a6b-c49b-c2c1-bfdf9b1bcbee@codesourcery.com> Cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com, Jan Hubicka , hongtao.liu@intel.com, kirill.yukhin@gmail.com In-Reply-To: <7758ea65-3a6b-c49b-c2c1-bfdf9b1bcbee@codesourcery.com> To: Andrew Stubbs X-Mailer: iPhone Mail (20F66) X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > Am 15.06.2023 um 16:04 schrieb Andrew Stubbs : >=20 > =EF=BB=BFOn 15/06/2023 15:00, Richard Biener wrote: >>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>> On 15/06/2023 14:34, Richard Biener wrote: >>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>>>=20 >>>>> On 15/06/2023 12:06, Richard Biener wrote: >>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>>>>>=20 >>>>>>> On 15/06/2023 10:58, Richard Biener wrote: >>>>>>>> On Thu, 15 Jun 2023, Andrew Stubbs wrote: >>>>>>>>=20 >>>>>>>>> On 14/06/2023 15:29, Richard Biener wrote: >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : >>>>>>>>>>>=20 >>>>>>>>>>> =EF=BB=BFOn 14/06/2023 12:54, Richard Biener via Gcc-patches wro= te: >>>>>>>>>>>> This implemens fully masked vectorization or a masked epilog fo= r >>>>>>>>>>>> AVX512 style masks which single themselves out by representing >>>>>>>>>>>> each lane with a single bit and by using integer modes for the m= ask >>>>>>>>>>>> (both is much like GCN). >>>>>>>>>>>> AVX512 is also special in that it doesn't have any instruction >>>>>>>>>>>> to compute the mask from a scalar IV like SVE has with while_ul= t. >>>>>>>>>>>> Instead the masks are produced by vector compares and the loop >>>>>>>>>>>> control retains the scalar IV (mainly to avoid dependences on >>>>>>>>>>>> mask generation, a suitable mask test instruction is available)= . >>>>>>>>>>>=20 >>>>>>>>>>> This is also sounds like GCN. We currently use WHILE_ULT in the >>>>>>>>>>> middle >>>>>>>>>>> end >>>>>>>>>>> which expands to a vector compare against a vector of stepped va= lues. >>>>>>>>>>> This >>>>>>>>>>> requires an additional instruction to prepare the comparison vec= tor >>>>>>>>>>> (compared to SVE), but the "while_ultv64sidi" pattern (for examp= le) >>>>>>>>>>> returns >>>>>>>>>>> the DImode bitmask, so it works reasonably well. >>>>>>>>>>>=20 >>>>>>>>>>>> Like RVV code generation prefers a decrementing IV though IVOPT= s >>>>>>>>>>>> messes things up in some cases removing that IV to eliminate >>>>>>>>>>>> it with an incrementing one used for address generation. >>>>>>>>>>>> One of the motivating testcases is from PR108410 which in turn >>>>>>>>>>>> is extracted from x264 where large size vectorization shows >>>>>>>>>>>> issues with small trip loops. Execution time there improves >>>>>>>>>>>> compared to classic AVX512 with AVX2 epilogues for the cases >>>>>>>>>>>> of less than 32 iterations. >>>>>>>>>>>> size scalar 128 256 512 512e 512f >>>>>>>>>>>> 1 9.42 11.32 9.35 11.17 15.13 16.89 >>>>>>>>>>>> 2 5.72 6.53 6.66 6.66 7.62 8.56 >>>>>>>>>>>> 3 4.49 5.10 5.10 5.74 5.08 5.73 >>>>>>>>>>>> 4 4.10 4.33 4.29 5.21 3.79 4.25 >>>>>>>>>>>> 6 3.78 3.85 3.86 4.76 2.54 2.85 >>>>>>>>>>>> 8 3.64 1.89 3.76 4.50 1.92 2.16 >>>>>>>>>>>> 12 3.56 2.21 3.75 4.26 1.26 1.42 >>>>>>>>>>>> 16 3.36 0.83 1.06 4.16 0.95 1.07 >>>>>>>>>>>> 20 3.39 1.42 1.33 4.07 0.75 0.85 >>>>>>>>>>>> 24 3.23 0.66 1.72 4.22 0.62 0.70 >>>>>>>>>>>> 28 3.18 1.09 2.04 4.20 0.54 0.61 >>>>>>>>>>>> 32 3.16 0.47 0.41 0.41 0.47 0.53 >>>>>>>>>>>> 34 3.16 0.67 0.61 0.56 0.44 0.50 >>>>>>>>>>>> 38 3.19 0.95 0.95 0.82 0.40 0.45 >>>>>>>>>>>> 42 3.09 0.58 1.21 1.13 0.36 0.40 >>>>>>>>>>>> 'size' specifies the number of actual iterations, 512e is for >>>>>>>>>>>> a masked epilog and 512f for the fully masked loop. From >>>>>>>>>>>> 4 scalar iterations on the AVX512 masked epilog code is clearly= >>>>>>>>>>>> the winner, the fully masked variant is clearly worse and >>>>>>>>>>>> it's size benefit is also tiny. >>>>>>>>>>>=20 >>>>>>>>>>> Let me check I understand correctly. In the fully masked case, t= here >>>>>>>>>>> is >>>>>>>>>>> a >>>>>>>>>>> single loop in which a new mask is generated at the start of eac= h >>>>>>>>>>> iteration. In the masked epilogue case, the main loop uses no ma= sking >>>>>>>>>>> whatsoever, thus avoiding the need for generating a mask, carryi= ng >>>>>>>>>>> the >>>>>>>>>>> mask, inserting vec_merge operations, etc, and then the epilogue= >>>>>>>>>>> looks >>>>>>>>>>> much >>>>>>>>>>> like the fully masked case, but unlike smaller mode epilogues th= ere >>>>>>>>>>> is >>>>>>>>>>> no >>>>>>>>>>> loop because the eplogue vector size is the same. Is that right?= >>>>>>>>>>=20 >>>>>>>>>> Yes. >>>>>>>>>>=20 >>>>>>>>>>> This scheme seems like it might also benefit GCN, in so much as i= t >>>>>>>>>>> simplifies the hot code path. >>>>>>>>>>>=20 >>>>>>>>>>> GCN does not actually have smaller vector sizes, so there's no >>>>>>>>>>> analogue >>>>>>>>>>> to >>>>>>>>>>> AVX2 (we pretend we have some smaller sizes, but that's because t= he >>>>>>>>>>> middle >>>>>>>>>>> end can't do masking everywhere yet, and it helps make some vect= or >>>>>>>>>>> constants smaller, perhaps). >>>>>>>>>>>=20 >>>>>>>>>>>> This patch does not enable using fully masked loops or >>>>>>>>>>>> masked epilogues by default. More work on cost modeling >>>>>>>>>>>> and vectorization kind selection on x86_64 is necessary >>>>>>>>>>>> for this. >>>>>>>>>>>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_= STYLE >>>>>>>>>>>> which could be exploited further to unify some of the flags >>>>>>>>>>>> we have right now but there didn't seem to be many easy things >>>>>>>>>>>> to merge, so I'm leaving this for followups. >>>>>>>>>>>> Mask requirements as registered by vect_record_loop_mask are ke= pt in >>>>>>>>>>>> their >>>>>>>>>>>> original form and recorded in a hash_set now instead of being >>>>>>>>>>>> processed to a vector of rgroup_controls. Instead that's now >>>>>>>>>>>> left to the final analysis phase which tries forming the >>>>>>>>>>>> rgroup_controls >>>>>>>>>>>> vector using while_ult and if that fails now tries AVX512 style= >>>>>>>>>>>> which needs a different organization and instead fills a hash_m= ap >>>>>>>>>>>> with the relevant info. vect_get_loop_mask now has two >>>>>>>>>>>> implementations, >>>>>>>>>>>> one for the two mask styles we then have. >>>>>>>>>>>> I have decided against interweaving >>>>>>>>>>>> vect_set_loop_condition_partial_vectors >>>>>>>>>>>> with conditions to do AVX512 style masking and instead opted to= >>>>>>>>>>>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx= 512. >>>>>>>>>>>> Likewise for vect_verify_full_masking vs >>>>>>>>>>>> vect_verify_full_masking_avx512. >>>>>>>>>>>> I was split between making 'vec_loop_masks' a class with method= s, >>>>>>>>>>>> possibly merging in the _len stuff into a single registry. It >>>>>>>>>>>> seemed to be too many changes for the purpose of getting AVX512= >>>>>>>>>>>> working. I'm going to play wait and see what happens with RISC= -V >>>>>>>>>>>> here since they are going to get both masks and lengths registe= red >>>>>>>>>>>> I think. >>>>>>>>>>>> The vect_prepare_for_masked_peels hunk might run into issues wi= th >>>>>>>>>>>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYP= E >>>>>>>>>>>> looked odd. >>>>>>>>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run >>>>>>>>>>>> the testsuite with --param vect-partial-vector-usage=3D2 with a= nd >>>>>>>>>>>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110= 221) >>>>>>>>>>>> and one latent wrong-code (PR110237). >>>>>>>>>>>> There's followup work to be done to try enabling masked epilogu= es >>>>>>>>>>>> for x86-64 by default (when AVX512 is enabled, possibly only wh= en >>>>>>>>>>>> -mprefer-vector-width=3D512). Getting cost modeling and decisi= on >>>>>>>>>>>> right is going to be challenging. >>>>>>>>>>>> Any comments? >>>>>>>>>>>> OK? >>>>>>>>>>>> Btw, testing on GCN would be welcome - the _avx512 paths could >>>>>>>>>>>> work for it so in case the while_ult path fails (not sure if >>>>>>>>>>>> it ever does) it could get _avx512 style masking. Likewise >>>>>>>>>>>> testing on ARM just to see I didn't break anything here. >>>>>>>>>>>> I don't have SVE hardware so testing is probably meaningless. >>>>>>>>>>>=20 >>>>>>>>>>> I can set some tests going. Is vect.exp enough? >>>>>>>>>>=20 >>>>>>>>>> Well, only you know (from experience), but sure that?s a nice sta= rt. >>>>>>>>>=20 >>>>>>>>> I tested vect.exp for both gcc and gfortran and there were no >>>>>>>>> regressions. >>>>>>>>> I >>>>>>>>> have another run going with the other param settings. >>>>>>>>>=20 >>>>>>>>> (Side note: vect.exp used to be a nice quick test for use during >>>>>>>>> development, >>>>>>>>> but the tsvc tests are now really slow, at least when run on a sin= gle >>>>>>>>> GPU >>>>>>>>> thread.) >>>>>>>>>=20 >>>>>>>>> I tried some small examples with --param vect-partial-vector-usage= =3D1 >>>>>>>>> (IIUC >>>>>>>>> this prevents masked loops, but not masked epilogues, right?) >>>>>>>>=20 >>>>>>>> Yes. That should also work with the while_ult style btw. >>>>>>>>=20 >>>>>>>>> and the results >>>>>>>>> look good. I plan to do some benchmarking shortly. One comment: >>>>>>>>> building >>>>>>>>> a >>>>>>>>> vector constant {0, 1, 2, 3, ...., 63} results in a very large ent= ry in >>>>>>>>> the >>>>>>>>> constant pool and an unnecessary memory load (it literally has to u= se >>>>>>>>> this >>>>>>>>> sequence to generate the addresses to load the constant!) Generati= ng >>>>>>>>> the >>>>>>>>> sequence via VEC_SERIES would be a no-op, for GCN, because we have= an >>>>>>>>> ABI-mandated register that already holds that value. (Perhaps I ha= ve >>>>>>>>> another >>>>>>>>> piece missing here, IDK?) >>>>>>>>=20 >>>>>>>> I failed to special-case the {0, 1, 2, 3, ... } constant because I >>>>>>>> couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ..= . }. >>>>>>>> It might be that the target needs to pattern match these constants >>>>>>>> at RTL expansion time? >>>>>>>>=20 >>>>>>>> Btw, did you disable your while_ult pattern for the experiment? >>>>>>>=20 >>>>>>> I tried it both ways; both appear to work, and the while_ult case do= es >>>>>>> avoid >>>>>>> the constant vector. I also don't seem to need while_ult for the ful= ly >>>>>>> masked >>>>>>> case any more (is that new?). >>>>>>=20 >>>>>> Yes, while_ult always compares to {0, 1, 2, 3 ...} which seems >>>>>> conveniently available but it has to multiply the IV with the >>>>>> number of scalars per iter which has overflow issues it has >>>>>> to compensate for by choosing a wider IV. I'm avoiding that >>>>>> issue (besides for alignment peeling) by instead altering >>>>>> the constant vector to compare against. On x86 the constant >>>>>> vector is always a load but the multiplication would add to >>>>>> the latency of mask production which already isn't too great. >>>>>=20 >>>>> Is the multiplication not usually a shift? >>>>>=20 >>>>>> And yes, the alternate scheme doesn't rely on while_ult but instead >>>>>> on vec_cmpu to produce the masks. >>>>>>=20 >>>>>> You might be able to produce the {0, 0, 1, 1, ... } constant >>>>>> by interleaving v1 with itself? Any non-power-of-two duplication >>>>>> looks more difficult though. >>>>>=20 >>>>> I think that would need to use a full permutation, which is probably f= aster >>>>> than a cold load, but in all these cases the vector that defines the >>>>> permutation looks exactly like the result, so ...... >>>>>=20 >>>>> I've been playing with this stuff some more and I find that even thoug= h GCN >>>>> supports fully masked loops and uses them when I test without offload,= it's >>>>> actually been running in param_vect_partial_vector_usage=3D=3D0 mode f= or >>>>> offload >>>>> because i386.cc has that hardcoded and the offload compiler inherits p= aram >>>>> settings from the host. >>>>=20 >>>> Doesn't that mean it will have a scalar epilog and a very large VF for t= he >>>> main loop due to the large vector size? >>>>=20 >>>>> I tried running the Babelstream benchmark with the various settings an= d >>>>> it's a >>>>> wash for most of the measurements (memory limited, most likely), but t= he >>>>> "Dot" >>>>> benchmark is considerably slower when fully masked (about 50%). This >>>>> probably >>>>> explains why adding the additional "fake" smaller vector sizes was so g= ood >>>>> for >>>>> our numbers, but confirms that the partial epilogue is a good option. >>>>=20 >>>> Ah, "fake" smaller vector sizes probably then made up for this with >>>> "fixed" size epilogue vectorization? But yes, I think a vectorized >>>> epilog with partial vectors that then does not iterate would get you >>>> the best of both worlds. >>>=20 >>> Yes, it uses V32 for the epilogue, which won't fit every case but it bet= ter >>> than nothing. >>>=20 >>>> So param_vect_partial_vector_usage =3D=3D 1. >>>=20 >>> Unfortunately, there's doesn't seem to be a way to set this *only* for t= he >>> offload compiler. If you could fix it for x86_64 soon then that would be= >>> awesome. :) >> So some opts tweaking in the GCN option_override hook doesn't work? >=20 > I didn't try that, but -foffload-options=3D--param=3Dvect-partial-vector-u= sage=3D1 had no effect. I guess the flag is streamed as function specific optimization=E2=80=A6 >=20 >>>> Whether with a while_ult optab or the vec_cmpu scheme should then >>>> depend on generated code quality. >>>=20 >>> Which looks like it depends on these constants alone. >> Understood, compared to while_ult these now appear explicitely. As >> Richard said you likely have to tweak the backend to make use of 'r1' >=20 > Yes, I think I understood what he meant. Should be doable. >=20 > Andrew