From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by sourceware.org (Postfix) with ESMTPS id B24113858D3C for ; Thu, 15 Jun 2023 09:58:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B24113858D3C Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id E15E6223E8; Thu, 15 Jun 2023 09:58:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1686823110; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YpSWaBqukSJH62Umjtwx6xc7xfiHFjOshBtgG5SFl4U=; b=b1thVqccbdLDxkgxO7lvCJ4yckjnXPDkehQ/v+BRBaQr59YkHdR0xsDmVjMDiSBKUU4+H4 9tkIpNMgICTj1U0hjvsLRsh3XcVX/apNuM3nHPpxU98H1GcRw8Jf62tXK8nA7HB/mDegIJ O16UEqWRk/Hb6PGFJR30ZUj0sYAhDl8= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1686823110; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YpSWaBqukSJH62Umjtwx6xc7xfiHFjOshBtgG5SFl4U=; b=BWupzSR7LVEZhQPq5nEtnXcsK2+3MWh1wu9POjp4jcWjwdOqv3YhOm8BVyAlY/NoQ/q1Ry CeR2TZMz8ZnPDkAw== Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 7E82B2C141; Thu, 15 Jun 2023 09:58:30 +0000 (UTC) Date: Thu, 15 Jun 2023 09:58:30 +0000 (UTC) From: Richard Biener To: Andrew Stubbs cc: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com, Jan Hubicka , hongtao.liu@intel.com, kirill.yukhin@gmail.com Subject: Re: [PATCH 3/3] AVX512 fully masked vectorization In-Reply-To: <378905b9-2383-d564-1c91-2c6b1e06629d@codesourcery.com> Message-ID: References: <8aab0039-56a5-5bb8-e58a-29f13a9a6737@codesourcery.com> <378905b9-2383-d564-1c91-2c6b1e06629d@codesourcery.com> User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="-1609957120-2011304820-1686823110=:4723" X-Spam-Status: No, score=-4.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1609957120-2011304820-1686823110=:4723 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT On Thu, 15 Jun 2023, Andrew Stubbs wrote: > On 14/06/2023 15:29, Richard Biener wrote: > > > > > >> Am 14.06.2023 um 16:27 schrieb Andrew Stubbs : > >> > >> On 14/06/2023 12:54, Richard Biener via Gcc-patches wrote: > >>> This implemens fully masked vectorization or a masked epilog for > >>> AVX512 style masks which single themselves out by representing > >>> each lane with a single bit and by using integer modes for the mask > >>> (both is much like GCN). > >>> AVX512 is also special in that it doesn't have any instruction > >>> to compute the mask from a scalar IV like SVE has with while_ult. > >>> Instead the masks are produced by vector compares and the loop > >>> control retains the scalar IV (mainly to avoid dependences on > >>> mask generation, a suitable mask test instruction is available). > >> > >> This is also sounds like GCN. We currently use WHILE_ULT in the middle end > >> which expands to a vector compare against a vector of stepped values. This > >> requires an additional instruction to prepare the comparison vector > >> (compared to SVE), but the "while_ultv64sidi" pattern (for example) returns > >> the DImode bitmask, so it works reasonably well. > >> > >>> Like RVV code generation prefers a decrementing IV though IVOPTs > >>> messes things up in some cases removing that IV to eliminate > >>> it with an incrementing one used for address generation. > >>> One of the motivating testcases is from PR108410 which in turn > >>> is extracted from x264 where large size vectorization shows > >>> issues with small trip loops. Execution time there improves > >>> compared to classic AVX512 with AVX2 epilogues for the cases > >>> of less than 32 iterations. > >>> size scalar 128 256 512 512e 512f > >>> 1 9.42 11.32 9.35 11.17 15.13 16.89 > >>> 2 5.72 6.53 6.66 6.66 7.62 8.56 > >>> 3 4.49 5.10 5.10 5.74 5.08 5.73 > >>> 4 4.10 4.33 4.29 5.21 3.79 4.25 > >>> 6 3.78 3.85 3.86 4.76 2.54 2.85 > >>> 8 3.64 1.89 3.76 4.50 1.92 2.16 > >>> 12 3.56 2.21 3.75 4.26 1.26 1.42 > >>> 16 3.36 0.83 1.06 4.16 0.95 1.07 > >>> 20 3.39 1.42 1.33 4.07 0.75 0.85 > >>> 24 3.23 0.66 1.72 4.22 0.62 0.70 > >>> 28 3.18 1.09 2.04 4.20 0.54 0.61 > >>> 32 3.16 0.47 0.41 0.41 0.47 0.53 > >>> 34 3.16 0.67 0.61 0.56 0.44 0.50 > >>> 38 3.19 0.95 0.95 0.82 0.40 0.45 > >>> 42 3.09 0.58 1.21 1.13 0.36 0.40 > >>> 'size' specifies the number of actual iterations, 512e is for > >>> a masked epilog and 512f for the fully masked loop. From > >>> 4 scalar iterations on the AVX512 masked epilog code is clearly > >>> the winner, the fully masked variant is clearly worse and > >>> it's size benefit is also tiny. > >> > >> Let me check I understand correctly. In the fully masked case, there is a > >> single loop in which a new mask is generated at the start of each > >> iteration. In the masked epilogue case, the main loop uses no masking > >> whatsoever, thus avoiding the need for generating a mask, carrying the > >> mask, inserting vec_merge operations, etc, and then the epilogue looks much > >> like the fully masked case, but unlike smaller mode epilogues there is no > >> loop because the eplogue vector size is the same. Is that right? > > > > Yes. > > > >> This scheme seems like it might also benefit GCN, in so much as it > >> simplifies the hot code path. > >> > >> GCN does not actually have smaller vector sizes, so there's no analogue to > >> AVX2 (we pretend we have some smaller sizes, but that's because the middle > >> end can't do masking everywhere yet, and it helps make some vector > >> constants smaller, perhaps). > >> > >>> This patch does not enable using fully masked loops or > >>> masked epilogues by default. More work on cost modeling > >>> and vectorization kind selection on x86_64 is necessary > >>> for this. > >>> Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE > >>> which could be exploited further to unify some of the flags > >>> we have right now but there didn't seem to be many easy things > >>> to merge, so I'm leaving this for followups. > >>> Mask requirements as registered by vect_record_loop_mask are kept in their > >>> original form and recorded in a hash_set now instead of being > >>> processed to a vector of rgroup_controls. Instead that's now > >>> left to the final analysis phase which tries forming the rgroup_controls > >>> vector using while_ult and if that fails now tries AVX512 style > >>> which needs a different organization and instead fills a hash_map > >>> with the relevant info. vect_get_loop_mask now has two implementations, > >>> one for the two mask styles we then have. > >>> I have decided against interweaving > >>> vect_set_loop_condition_partial_vectors > >>> with conditions to do AVX512 style masking and instead opted to > >>> "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. > >>> Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512. > >>> I was split between making 'vec_loop_masks' a class with methods, > >>> possibly merging in the _len stuff into a single registry. It > >>> seemed to be too many changes for the purpose of getting AVX512 > >>> working. I'm going to play wait and see what happens with RISC-V > >>> here since they are going to get both masks and lengths registered > >>> I think. > >>> The vect_prepare_for_masked_peels hunk might run into issues with > >>> SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE > >>> looked odd. > >>> Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run > >>> the testsuite with --param vect-partial-vector-usage=2 with and > >>> without -fno-vect-cost-model and filed two bugs, one ICE (PR110221) > >>> and one latent wrong-code (PR110237). > >>> There's followup work to be done to try enabling masked epilogues > >>> for x86-64 by default (when AVX512 is enabled, possibly only when > >>> -mprefer-vector-width=512). Getting cost modeling and decision > >>> right is going to be challenging. > >>> Any comments? > >>> OK? > >>> Btw, testing on GCN would be welcome - the _avx512 paths could > >>> work for it so in case the while_ult path fails (not sure if > >>> it ever does) it could get _avx512 style masking. Likewise > >>> testing on ARM just to see I didn't break anything here. > >>> I don't have SVE hardware so testing is probably meaningless. > >> > >> I can set some tests going. Is vect.exp enough? > > > > Well, only you know (from experience), but sure that?s a nice start. > > I tested vect.exp for both gcc and gfortran and there were no regressions. I > have another run going with the other param settings. > > (Side note: vect.exp used to be a nice quick test for use during development, > but the tsvc tests are now really slow, at least when run on a single GPU > thread.) > > I tried some small examples with --param vect-partial-vector-usage=1 (IIUC > this prevents masked loops, but not masked epilogues, right?) Yes. That should also work with the while_ult style btw. > and the results > look good. I plan to do some benchmarking shortly. One comment: building a > vector constant {0, 1, 2, 3, ...., 63} results in a very large entry in the > constant pool and an unnecessary memory load (it literally has to use this > sequence to generate the addresses to load the constant!) Generating the > sequence via VEC_SERIES would be a no-op, for GCN, because we have an > ABI-mandated register that already holds that value. (Perhaps I have another > piece missing here, IDK?) I failed to special-case the {0, 1, 2, 3, ... } constant because I couldn't see how to do a series that creates { 0, 0, 1, 1, 2, 2, ... }. It might be that the target needs to pattern match these constants at RTL expansion time? Btw, did you disable your while_ult pattern for the experiment? Richard. > > Andrew > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; HRB 36809 (AG Nuernberg) ---1609957120-2011304820-1686823110=:4723--