From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id C828839888AA for ; Wed, 5 May 2021 11:34:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org C828839888AA Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 627C0ED1; Wed, 5 May 2021 04:34:20 -0700 (PDT) Received: from [10.57.1.74] (unknown [10.57.1.74]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CD5A23F70D; Wed, 5 May 2021 04:34:19 -0700 (PDT) Subject: Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization To: Richard Biener Cc: "gcc-patches@gcc.gnu.org" , Richard Sandiford References: From: "Andre Vieira (lists)" Message-ID: <3a5de6dc-d5ec-7dda-8eb9-85ea6f77984f@arm.com> Date: Wed, 5 May 2021 12:34:18 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00, BODY_8BITS, KAM_DMARC_STATUS, NICE_REPLY_A, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 May 2021 11:34:22 -0000 Hi Richi, So I'm trying to look at what IVOPTs does right now and how it might be able to help us. Looking at these two code examples: #include #if 0 int foo(short * a, short * b, unsigned int n) {     int sum = 0;     for (unsigned int i = 0; i < n; ++i)         sum += a[i] + b[i];     return sum; } #else int bar (short * a, short *b, unsigned int n) {     int sum = 0;     unsigned int i = 0;     for (; i < (n / 16); i += 1)     {         // Iterates [0, 16, .., (n/16 * 16) * 16]         // Example n = 127,         // iterates [0, 16, 32, 48, 64, 80, 96, 112]         sum += a[i*16] + b[i*16];     }     for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)     {         // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]         // Example n = 127,         // j starts at (127/16) * 16 = 7 * 16 = 112,         // So iterates over [112, 113, 114, 115, ..., 127]         sum += a[j] + b[j];     }     return sum; } #endif Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the following options '-O3 -march=armv8-a -fno-tree-vectorize -fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here: https://godbolt.org/z/MEf6j6ebM I tried to see what IVOPTs would make of this and it is able to analyze the IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1) could be used as the base for the other (loop 2). I don't know if this is where you'd want such optimizations to be made, on one side I think it would be great as it would also help with non-vectorized loops as you allured to. However, if you compile the top test case (#if 1) and let the tree-vectorizer have a go you will see different behaviours for different vectorization approaches, so for: '-O3 -march=armv8-a', using NEON and epilogue vectorization it seems IVOPTs only picks up on one loop. If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' it will detect two loops. This may well be because in fact epilogue vectorization 'un-loops' it because it knows it will only have to do one iteration of the vectorized epilogue. vect-partial-vector-usage=1 could have done the same, but because we are dealing with polymorphic vector modes it fails to, I have a hack that works for vect-partial-vector-usage to avoid it, but I think we can probably do better and try to reason about boundaries in poly_int's rather than integers (TBC). Anyway I diverge. Back to the main question of this patch. How do you suggest I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's if one ends where the other begins? On 04/05/2021 10:56, Richard Biener wrote: > On Fri, 30 Apr 2021, Andre Vieira (lists) wrote: > >> Hi, >> >> The aim of this RFC is to explore a way of cleaning up the codegen around >> data_references.  To be specific, I'd like to reuse the main-loop's updated >> data_reference as the base_address for the epilogue's corresponding >> data_reference, rather than use the niters.  We have found this leads to >> better codegen in the vectorized epilogue loops. >> >> The approach in this RFC creates a map if iv_updates which always contain an >> updated pointer that is caputed in vectorizable_{load,store}, an iv_update may >> also contain a skip_edge in case we decide the vectorization can be skipped in >> 'vect_do_peeling'. During the epilogue update this map of iv_updates is then >> checked to see if it contains an entry for a data_reference and it is used >> accordingly and if not it reverts back to the old behavior of using the niters >> to advance the data_reference. >> >> The motivation for this work is to improve codegen for the option `--param >> vect-partial-vector-usage=1` for SVE. We found that one of the main problems >> for the codegen here was coming from unnecessary conversions caused by the way >> we update the data_references in the epilogue. >> >> This patch passes regression tests in aarch64-linux-gnu, but the codegen is >> still not optimal in some cases. Specifically those where we have a scalar >> epilogue, as this does not use the data_reference's and will rely on the >> gimple scalar code, thus constructing again a memory access using the niters. >> This is a limitation for which I haven't quite worked out a solution yet and >> does cause some minor regressions due to unfortunate spills. >> >> Let me know what you think and if you have ideas of how we can better achieve >> this. > Hmm, so the patch adds a kludge to improve the kludge we have in place ;) > > I think it might be interesting to create a C testcase mimicing the > update problem without involving the vectorizer. That way we can > see how the various components involved behave (FRE + ivopts most > specifically). > > That said, a cleaner approach to dealing with this would be to > explicitely track the IVs we generate for vectorized DRs, eventually > factoring that out from vectorizable_{store,load} so we can simply > carry over the actual pointer IV final value to the epilogue as > initial value. For each DR group we'd create a single IV (we can > even do better in case we have load + store of the "same" group). > > We already kind-of track things via the ivexpr_map, but I'm not sure > if this lazly populated map can be reliably re-used to "re-populate" > the epilogue one (walk the map, create epilogue IVs with the appropriate > initial value & adjustd upate). > > Richard. > >> Kind regards, >> Andre Vieira >> >> >>