From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by sourceware.org (Postfix) with ESMTPS id 3078539C242A for ; Wed, 5 May 2021 12:34:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 3078539C242A Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rguenther@suse.de X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 08F8BAEAE; Wed, 5 May 2021 12:34:02 +0000 (UTC) Date: Wed, 5 May 2021 14:34:01 +0200 (CEST) From: Richard Biener To: "Andre Vieira (lists)" cc: "gcc-patches@gcc.gnu.org" , Richard Sandiford Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization In-Reply-To: <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com> Message-ID: References: <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com> User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 X-Spam-Status: No, score=-3.9 required=5.0 tests=BAYES_00, BODY_8BITS, KAM_DMARC_STATUS, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 May 2021 12:34:05 -0000 On Wed, 5 May 2021, Andre Vieira (lists) wrote: > Hi Richi, > > So I'm trying to look at what IVOPTs does right now and how it might be able > to help us. Looking at these two code examples: > #include > #if 0 > int foo(short * a, short * b, unsigned int n) > { >     int sum = 0; >     for (unsigned int i = 0; i < n; ++i) >         sum += a[i] + b[i]; > >     return sum; > } > > > #else > > int bar (short * a, short *b, unsigned int n) > { >     int sum = 0; >     unsigned int i = 0; >     for (; i < (n / 16); i += 1) >     { >         // Iterates [0, 16, .., (n/16 * 16) * 16] >         // Example n = 127, >         // iterates [0, 16, 32, 48, 64, 80, 96, 112] >         sum += a[i*16] + b[i*16]; >     } >     for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j) >     { >         // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16] >         // Example n = 127, >         // j starts at (127/16) * 16 = 7 * 16 = 112, >         // So iterates over [112, 113, 114, 115, ..., 127] >         sum += a[j] + b[j]; >     } >     return sum; > } > #endif > > Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the following > options '-O3 -march=armv8-a -fno-tree-vectorize -fdump-tree-ivopts-all > -fno-unroll-loops'. See godbolt link here: https://godbolt.org/z/MEf6j6ebM > > I tried to see what IVOPTs would make of this and it is able to analyze the > IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1) > could be used as the base for the other (loop 2). I don't know if this is > where you'd want such optimizations to be made, on one side I think it would > be great as it would also help with non-vectorized loops as you allured to. Hmm, OK. So there's the first loop that has a looparound jump and thus we do not always enter the 2nd loop with the first loop final value of the IV. But yes, IVOPTs does not try to allocate IVs across multiple loops. And for a followup transform to catch this it would need to compute the final value of the IV and then match this up with the initial value computation. I suppose FRE could be teached to do this, at least for very simple cases. > However, if you compile the top test case (#if 1) and let the tree-vectorizer > have a go you will see different behaviours for different vectorization > approaches, so for: > '-O3 -march=armv8-a', using NEON and epilogue vectorization it seems IVOPTs > only picks up on one loop. Yep, the "others" are likely fully peeled because they have just a single iteration. Again some kind-of final-value replacement "CSE" could eventually help - but then we have jump-arounds here as well thus we'd need final-value replacement "PRE". > If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' it > will detect two loops. This may well be because in fact epilogue vectorization > 'un-loops' it because it knows it will only have to do one iteration of the > vectorized epilogue. vect-partial-vector-usage=1 could have done the same, but > because we are dealing with polymorphic vector modes it fails to, I have a > hack that works for vect-partial-vector-usage to avoid it, but I think we can > probably do better and try to reason about boundaries in poly_int's rather > than integers (TBC). > > Anyway I diverge. Back to the main question of this patch. How do you suggest > I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs > in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's > if one ends where the other begins? I don't think we will make that work easily. So indeed attacking this in the vectorizer sounds most promising. I'll note there's also the issue of epilogue vectorization and reductions where we seem to not re-use partially reduced reduction vectors but instead reduce to a scalar in each step. That's a related issue - we're not able to carry forward a (reduction) IV we generated for the main vector loop to the epilogue loops. Like for double foo (double *a, int n) { double sum = 0.; for (int i = 0; i < n; ++i) sum += a[i]; return sum; } with AVX512 we get three reductions to scalars instead of a partial reduction from zmm to ymm before the first vectorized epilogue followed by a reduction from ymm to xmm before the second (the jump around for the epilogues need to jump to the further reduction piece obviously). So I think we want to record IVs we generate (the reduction IVs are already nicely associated with the stmt-infos), one might consider to refer to them from the dr_vec_info for example. It's just going to be "interesting" to wire everything up correctly with all the jump-arounds we have ... > On 04/05/2021 10:56, Richard Biener wrote: > > On Fri, 30 Apr 2021, Andre Vieira (lists) wrote: > > > >> Hi, > >> > >> The aim of this RFC is to explore a way of cleaning up the codegen around > >> data_references.  To be specific, I'd like to reuse the main-loop's updated > >> data_reference as the base_address for the epilogue's corresponding > >> data_reference, rather than use the niters.  We have found this leads to > >> better codegen in the vectorized epilogue loops. > >> > >> The approach in this RFC creates a map if iv_updates which always contain > >> an > >> updated pointer that is caputed in vectorizable_{load,store}, an iv_update > >> may > >> also contain a skip_edge in case we decide the vectorization can be skipped > >> in > >> 'vect_do_peeling'. During the epilogue update this map of iv_updates is > >> then > >> checked to see if it contains an entry for a data_reference and it is used > >> accordingly and if not it reverts back to the old behavior of using the > >> niters > >> to advance the data_reference. > >> > >> The motivation for this work is to improve codegen for the option `--param > >> vect-partial-vector-usage=1` for SVE. We found that one of the main > >> problems > >> for the codegen here was coming from unnecessary conversions caused by the > >> way > >> we update the data_references in the epilogue. > >> > >> This patch passes regression tests in aarch64-linux-gnu, but the codegen is > >> still not optimal in some cases. Specifically those where we have a scalar > >> epilogue, as this does not use the data_reference's and will rely on the > >> gimple scalar code, thus constructing again a memory access using the > >> niters. > >> This is a limitation for which I haven't quite worked out a solution yet > >> and > >> does cause some minor regressions due to unfortunate spills. > >> > >> Let me know what you think and if you have ideas of how we can better > >> achieve > >> this. > > Hmm, so the patch adds a kludge to improve the kludge we have in place ;) > > > > I think it might be interesting to create a C testcase mimicing the > > update problem without involving the vectorizer. That way we can > > see how the various components involved behave (FRE + ivopts most > > specifically). > > > > That said, a cleaner approach to dealing with this would be to > > explicitely track the IVs we generate for vectorized DRs, eventually > > factoring that out from vectorizable_{store,load} so we can simply > > carry over the actual pointer IV final value to the epilogue as > > initial value. For each DR group we'd create a single IV (we can > > even do better in case we have load + store of the "same" group). > > > > We already kind-of track things via the ivexpr_map, but I'm not sure > > if this lazly populated map can be reliably re-used to "re-populate" > > the epilogue one (walk the map, create epilogue IVs with the appropriate > > initial value & adjustd upate). > > > > Richard. > > > >> Kind regards, > >> Andre Vieira > >> > >> > >> > > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)