From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by sourceware.org (Postfix) with ESMTPS id 193A33858288 for ; Wed, 14 Jun 2023 14:30:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 193A33858288 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 2F2FF1FE08; Wed, 14 Jun 2023 14:30:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1686753005; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TaPCHDkD4vOe0nIVU8aCwEiNa1pd6nfgG4IBrVMpZH4=; b=BVRsUVbSMjeIAh9pYHghQ24PoJzkKR2kkINwBXIDv2jWYpv3Ys3zyc9ITnigQE2cBd5plu Dl/7GwSmuUlAXEYDTuQgsi/rzmUntrafOnUwRSEGNHHdHak819UaGyle1KZTyMn5I/g0H0 fpOhHB2AoF8J0OtHFmDMvgU6VH4y3pc= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1686753005; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TaPCHDkD4vOe0nIVU8aCwEiNa1pd6nfgG4IBrVMpZH4=; b=LcmLwHHoUxCCLbtZkzB8S6bzy4BwVKWzQ1xGH/4PMoDVuWrfLnKa7UfemfMeaiQsxh6uMo 1Aq4p/Eo3IGr8dCg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 182791391E; Wed, 14 Jun 2023 14:30:05 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id CZbYBe3OiWSsfQAAMHmgww (envelope-from ); Wed, 14 Jun 2023 14:30:05 +0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Biener Mime-Version: 1.0 (1.0) Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization Date: Wed, 14 Jun 2023 16:29:54 +0200 Message-Id: References: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com> Cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com, Jan Hubicka , hongtao.liu@intel.com, kirill.yukhin@gmail.com In-Reply-To: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com> To: Andrew Stubbs X-Mailer: iPhone Mail (20F66) X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : >=20 > =EF=BB=BFOn 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: >> This implemens fully masked vectorization or a masked epilog for >> AVX512 style masks which single themselves out by representing >> each lane with a single bit and by using integer modes for the mask >> (both is much like GCN). >> AVX512 is also special in that it doesn't have any instruction >> to compute the mask from a scalar IV like SVE has with while_ult. >> Instead the masks are produced by vector compares and the loop >> control retains the scalar IV (mainly to avoid dependences on >> mask generation, a suitable mask test instruction is available). >=20 > This is also sounds like GCN. We currently use WHILE_ULT in the middle end= which expands to a vector compare against a vector of stepped values. This r= equires an additional instruction to prepare the comparison vector (compared= to SVE), but the "while_ultv64sidi" pattern (for example) returns the DImod= e bitmask, so it works reasonably well. >=20 >> Like RVV code generation prefers a decrementing IV though IVOPTs >> messes things up in some cases removing that IV to eliminate >> it with an incrementing one used for address generation. >> One of the motivating testcases is from PR108410 which in turn >> is extracted from x264 where large size vectorization shows >> issues with small trip loops. Execution time there improves >> compared to classic AVX512 with AVX2 epilogues for the cases >> of less than 32 iterations. >> size scalar 128 256 512 512e 512f >> 1 9.42 11.32 9.35 11.17 15.13 16.89 >> 2 5.72 6.53 6.66 6.66 7.62 8.56 >> 3 4.49 5.10 5.10 5.74 5.08 5.73 >> 4 4.10 4.33 4.29 5.21 3.79 4.25 >> 6 3.78 3.85 3.86 4.76 2.54 2.85 >> 8 3.64 1.89 3.76 4.50 1.92 2.16 >> 12 3.56 2.21 3.75 4.26 1.26 1.42 >> 16 3.36 0.83 1.06 4.16 0.95 1.07 >> 20 3.39 1.42 1.33 4.07 0.75 0.85 >> 24 3.23 0.66 1.72 4.22 0.62 0.70 >> 28 3.18 1.09 2.04 4.20 0.54 0.61 >> 32 3.16 0.47 0.41 0.41 0.47 0.53 >> 34 3.16 0.67 0.61 0.56 0.44 0.50 >> 38 3.19 0.95 0.95 0.82 0.40 0.45 >> 42 3.09 0.58 1.21 1.13 0.36 0.40 >> 'size' specifies the number of actual iterations, 512e is for >> a masked epilog and 512f for the fully masked loop. From >> 4 scalar iterations on the AVX512 masked epilog code is clearly >> the winner, the fully masked variant is clearly worse and >> it's size benefit is also tiny. >=20 > Let me check I understand correctly. In the fully masked case, there is a s= ingle loop in which a new mask is generated at the start of each iteration. I= n the masked epilogue case, the main loop uses no masking whatsoever, thus a= voiding the need for generating a mask, carrying the mask, inserting vec_mer= ge operations, etc, and then the epilogue looks much like the fully masked c= ase, but unlike smaller mode epilogues there is no loop because the eplogue v= ector size is the same. Is that right? Yes. > This scheme seems like it might also benefit GCN, in so much as it simplif= ies the hot code path. >=20 > GCN does not actually have smaller vector sizes, so there's no analogue to= AVX2 (we pretend we have some smaller sizes, but that's because the middle e= nd can't do masking everywhere yet, and it helps make some vector constants s= maller, perhaps). >=20 >> This patch does not enable using fully masked loops or >> masked epilogues by default. More work on cost modeling >> and vectorization kind selection on x86_64 is necessary >> for this. >> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE >> which could be exploited further to unify some of the flags >> we have right now but there didn't seem to be many easy things >> to merge, so I'm leaving this for followups. >> Mask requirements as registered by vect_record_loop_mask are kept in thei= r >> original form and recorded in a hash_set now instead of being >> processed to a vector of rgroup_controls. Instead that's now >> left to the final analysis phase which tries forming the rgroup_controls >> vector using while_ult and if that fails now tries AVX512 style >> which needs a different organization and instead fills a hash_map >> with the relevant info. vect_get_loop_mask now has two implementations, >> one for the two mask styles we then have. >> I have decided against interweaving vect_set_loop_condition_partial_vecto= rs >> with conditions to do AVX512 style masking and instead opted to >> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. >> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.= >> I was split between making 'vec_loop_masks' a class with methods, >> possibly merging in the _len stuff into a single registry. It >> seemed to be too many changes for the purpose of getting AVX512 >> working. I'm going to play wait and see what happens with RISC-V >> here since they are going to get both masks and lengths registered >> I think. >> The vect_prepare_for_masked_peels hunk might run into issues with >> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE >> looked odd. >> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run >> the testsuite with --param vect-partial-vector-usage=3D2 with and >> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221) >> and one latent wrong-code (PR110237). >> There's followup work to be done to try enabling masked epilogues >> for x86-64 by default (when AVX512 is enabled, possibly only when >> -mprefer-vector-width=3D512). Getting cost modeling and decision >> right is going to be challenging. >> Any comments? >> OK? >> Btw, testing on GCN would be welcome - the _avx512 paths could >> work for it so in case the while_ult path fails (not sure if >> it ever does) it could get _avx512 style masking. Likewise >> testing on ARM just to see I didn't break anything here. >> I don't have SVE hardware so testing is probably meaningless. >=20 > I can set some tests going. Is vect.exp enough? Well, only you know (from experience), but sure that=E2=80=99s a nice start.= Richard=20 > Andrew >=20