From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 550333858C2D for ; Tue, 11 Oct 2022 08:42:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 550333858C2D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DED50ED1; Tue, 11 Oct 2022 01:42:12 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F0B263F766; Tue, 11 Oct 2022 01:42:05 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener ,Andrew Stubbs , gcc-patches@gcc.gnu.org, juzhe.zhong@rivai.ai, richard.sandiford@arm.com Cc: Andrew Stubbs , gcc-patches@gcc.gnu.org, juzhe.zhong@rivai.ai Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences References: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de> <49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com> Date: Tue, 11 Oct 2022 09:42:04 +0100 In-Reply-To: (Richard Biener's message of "Tue, 11 Oct 2022 07:01:06 +0000 (UTC)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-38.6 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Biener writes: > On Mon, 10 Oct 2022, Andrew Stubbs wrote: >> On 10/10/2022 12:03, Richard Biener wrote: >> > The following picks up the prototype by Ju-Zhe Zhong for vectorizing >> > first order recurrences. That solves two TSVC missed optimization PRs. >> > >> > There's a new scalar cycle def kind, vect_first_order_recurrence >> > and it's handling of the backedge value vectorization is complicated >> > by the fact that the vectorized value isn't the PHI but instead >> > a (series of) permute(s) shifting in the recurring value from the >> > previous iteration. I've implemented this by creating both the >> > single vectorized PHI and the series of permutes when vectorizing >> > the scalar PHI but leave the backedge values in both unassigned. >> > The backedge values are (for the testcases) computed by a load >> > which is also the place after which the permutes are inserted. >> > That placement also restricts the cases we can handle (without >> > resorting to code motion). >> > >> > I added both costing and SLP handling though SLP handling is >> > restricted to the case where a single vectorized PHI is enough. >> > >> > Missing is epilogue handling - while prologue peeling would >> > be handled transparently by adjusting iv_phi_p the epilogue >> > case doesn't work with just inserting a scalar LC PHI since >> > that a) keeps the scalar load live and b) that loads is the >> > wrong one, it has to be the last, much like when we'd vectorize >> > the LC PHI as live operation. Unfortunately LIVE >> > compute/analysis happens too early before we decide on >> > peeling. When using fully masked loop vectorization the >> > vect-recurr-6.c works as expected though. >> > >> > I have tested this on x86_64 for now, but since epilogue >> > handling is missing there's probably no practical cases. >> > My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c >> > just fine but I didn't feel like running SPEC within SDE nor >> > is the WHILE_ULT patch complete enough. Builds of SPEC 2k7 >> > with fully masked loops succeed (minus three cases of >> > PR107096, caused by my WHILE_ULT prototype). >> > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu. >> > >> > Testing with SVE, GCN or RVV appreciated, ideas how to cleanly >> > handle epilogues welcome. >> >> The testcases all produce correct code on GCN and pass the execution tests. >> >> The code isn't terribly optimal because we don't have a two-input permutation >> instruction, so we permute each half separately and vec_merge the results. In >> this case the first vector is always a no-op permutation so that's wasted >> cycles. We'd really want a vector rotate and write-lane (or the other way >> around). I think the special-case permutations can be recognised and coded >> into the backend, but I don't know if we can easily tell that the first vector >> is just a bunch of duplicates, when it's not constant. > > It's not actually a bunch of duplicates in all but the first iteration. > But what you can recognize is that we're only using lane N - 1 of the > first vector, so you could model the permute as extract last > + shift in scalar (the extracted lane). IIRC VLA vector targets usually > have something like shift the vector and set the low lane from a > scalar? Yeah. > The extract lane N - 1 might be more difficult but then > a rotate plus extracting lane 0 might work as well. I guess for SVE we should probably use SPLICE, which joins two vectors and uses a predicate to select the first element that should be extracted. Unfortunately we don't have a way of representing "last bit set, all other bits clear" as a constant though, so I guess it'll have to be hidden behind unspecs. I meant to start SVE tests running once I'd finished for the day yesterday, but forgot, sorry. Will try to test today. On the patch: + /* This is the second phase of vectorizing first-order rececurrences. An + overview of the transformation is described below. Suppose we have the + following loop. + + int32_t t = 0; + for (int i = 0; i < n; ++i) + { + b[i] = a[i] - t; + t = a[i]; + } + + There is a first-order recurrence on "a". For this loop, the shorthand + scalar IR looks like: + + scalar.preheader: + init = a[-1] + br loop.body + + scalar.body: + i = PHI <0(scalar.preheader), i+1(scalar.body)> + _2 = PHI <(init(scalar.preheader), <_1(scalar.body)> + _1 = a[i] + b[i] = _1 - _2 + br cond, scalar.body, ... + + In this example, _2 is a recurrence because it's value depends on the + previous iteration. In the first phase of vectorization, we created a + temporary value for _2. We now complete the vectorization and produce the + shorthand vector IR shown below (VF = 4). + + vector.preheader: + vect_init = vect_cst(..., ..., ..., a[-1]) + br vector.body + + vector.body + i = PHI <0(vector.preheader), i+4(vector.body)> + vect_1 = PHI + vect_2 = a[i, i+1, i+2, i+3]; + vect_3 = vector(vect_1(3), vect_2(0, 1, 2)) + b[i, i+1, i+2, i+3] = vect_2 - vect_3 + br cond, vector.body, middle.block + + middle.block: + x = vect_2(3) + br scalar.preheader + + scalar.ph: + s_init = PHI + br scalar.body + + After execution completes the vector loop, we extract the next value of + the recurrence (x) to use as the initial value in the scalar loop. */ Looks like a[-1] should be zero in the example (or t should be initialised to a[-1]). Thanks, Richard