From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153]) by sourceware.org (Postfix) with ESMTPS id BC2D73858C83 for ; Fri, 14 Oct 2022 09:43:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BC2D73858C83 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.95,184,1661846400"; d="scan'208";a="87514212" Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165]) by esa1.mentor.iphmx.com with ESMTP; 14 Oct 2022 01:43:02 -0800 IronPort-SDR: hXhZxsz4UJZledWSbhzYdBRHNo3aiAoe1oyGJcK0LkGDE6Y6M/hiUMoAmYF89ah39cRacfLfIQ slcAdixqoB61fMf0NbOZOGm/EL6RnbK0B+2phzIwGAEGZTHV6gODK6rWVTjnSzpvjHYu2H2FWx gGUFeVIh3Xt7UAeKAUqY+KtPivyITf4Red6+0w40em90+ktNRlb3hZUs+4GWmfE1pFfi4ZRdov EIkfsi+ziWuxVNAGgmwkfV2MJM0wkp3S0PdcuXPaohLw+iiprRBz7uTUZHF87M15WfQO3K3DLO hxo= Message-ID: Date: Fri, 14 Oct 2022 10:42:58 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.2 Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences Content-Language: en-GB To: Richard Biener , Richard Sandiford CC: , References: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de> <49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com> <271ron87-q0o2-5pr5-8s65-1682p750o0@fhfr.qr> From: Andrew Stubbs In-Reply-To: <271ron87-q0o2-5pr5-8s65-1682p750o0@fhfr.qr> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12) To svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11) X-Spam-Status: No, score=-8.1 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 14/10/2022 08:07, Richard Biener wrote: > On Tue, 11 Oct 2022, Richard Sandiford wrote: > >> Richard Biener writes: >>> On Mon, 10 Oct 2022, Andrew Stubbs wrote: >>>> On 10/10/2022 12:03, Richard Biener wrote: >>>>> The following picks up the prototype by Ju-Zhe Zhong for vectorizing >>>>> first order recurrences. That solves two TSVC missed optimization PRs. >>>>> >>>>> There's a new scalar cycle def kind, vect_first_order_recurrence >>>>> and it's handling of the backedge value vectorization is complicated >>>>> by the fact that the vectorized value isn't the PHI but instead >>>>> a (series of) permute(s) shifting in the recurring value from the >>>>> previous iteration. I've implemented this by creating both the >>>>> single vectorized PHI and the series of permutes when vectorizing >>>>> the scalar PHI but leave the backedge values in both unassigned. >>>>> The backedge values are (for the testcases) computed by a load >>>>> which is also the place after which the permutes are inserted. >>>>> That placement also restricts the cases we can handle (without >>>>> resorting to code motion). >>>>> >>>>> I added both costing and SLP handling though SLP handling is >>>>> restricted to the case where a single vectorized PHI is enough. >>>>> >>>>> Missing is epilogue handling - while prologue peeling would >>>>> be handled transparently by adjusting iv_phi_p the epilogue >>>>> case doesn't work with just inserting a scalar LC PHI since >>>>> that a) keeps the scalar load live and b) that loads is the >>>>> wrong one, it has to be the last, much like when we'd vectorize >>>>> the LC PHI as live operation. Unfortunately LIVE >>>>> compute/analysis happens too early before we decide on >>>>> peeling. When using fully masked loop vectorization the >>>>> vect-recurr-6.c works as expected though. >>>>> >>>>> I have tested this on x86_64 for now, but since epilogue >>>>> handling is missing there's probably no practical cases. >>>>> My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c >>>>> just fine but I didn't feel like running SPEC within SDE nor >>>>> is the WHILE_ULT patch complete enough. Builds of SPEC 2k7 >>>>> with fully masked loops succeed (minus three cases of >>>>> PR107096, caused by my WHILE_ULT prototype). >>>>> >>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. >>>>> >>>>> Testing with SVE, GCN or RVV appreciated, ideas how to cleanly >>>>> handle epilogues welcome. >>>> >>>> The testcases all produce correct code on GCN and pass the execution tests. >>>> >>>> The code isn't terribly optimal because we don't have a two-input permutation >>>> instruction, so we permute each half separately and vec_merge the results. In >>>> this case the first vector is always a no-op permutation so that's wasted >>>> cycles. We'd really want a vector rotate and write-lane (or the other way >>>> around). I think the special-case permutations can be recognised and coded >>>> into the backend, but I don't know if we can easily tell that the first vector >>>> is just a bunch of duplicates, when it's not constant. >>> >>> It's not actually a bunch of duplicates in all but the first iteration. >>> But what you can recognize is that we're only using lane N - 1 of the >>> first vector, so you could model the permute as extract last >>> + shift in scalar (the extracted lane). IIRC VLA vector targets usually >>> have something like shift the vector and set the low lane from a >>> scalar? >> >> Yeah. >> >>> The extract lane N - 1 might be more difficult but then >>> a rotate plus extracting lane 0 might work as well. >> >> I guess for SVE we should probably use SPLICE, which joins two vectors >> and uses a predicate to select the first element that should be extracted. >> >> Unfortunately we don't have a way of representing "last bit set, all other >> bits clear" as a constant though, so I guess it'll have to be hidden >> behind unspecs. >> >> I meant to start SVE tests running once I'd finished for the day yesterday, >> but forgot, sorry. Will try to test today. >> >> On the patch: >> >> + /* This is the second phase of vectorizing first-order rececurrences. An >> + overview of the transformation is described below. Suppose we have the >> + following loop. >> + >> + int32_t t = 0; >> + for (int i = 0; i < n; ++i) >> + { >> + b[i] = a[i] - t; >> + t = a[i]; >> + } >> + >> + There is a first-order recurrence on "a". For this loop, the shorthand >> + scalar IR looks like: >> + >> + scalar.preheader: >> + init = a[-1] >> + br loop.body >> + >> + scalar.body: >> + i = PHI <0(scalar.preheader), i+1(scalar.body)> >> + _2 = PHI <(init(scalar.preheader), <_1(scalar.body)> >> + _1 = a[i] >> + b[i] = _1 - _2 >> + br cond, scalar.body, ... >> + >> + In this example, _2 is a recurrence because it's value depends on the >> + previous iteration. In the first phase of vectorization, we created a >> + temporary value for _2. We now complete the vectorization and produce the >> + shorthand vector IR shown below (VF = 4). >> + >> + vector.preheader: >> + vect_init = vect_cst(..., ..., ..., a[-1]) >> + br vector.body >> + >> + vector.body >> + i = PHI <0(vector.preheader), i+4(vector.body)> >> + vect_1 = PHI >> + vect_2 = a[i, i+1, i+2, i+3]; >> + vect_3 = vector(vect_1(3), vect_2(0, 1, 2)) >> + b[i, i+1, i+2, i+3] = vect_2 - vect_3 >> + br cond, vector.body, middle.block >> + >> + middle.block: >> + x = vect_2(3) >> + br scalar.preheader >> + >> + scalar.ph: >> + s_init = PHI >> + br scalar.body >> + >> + After execution completes the vector loop, we extract the next value of >> + the recurrence (x) to use as the initial value in the scalar loop. */ >> >> Looks like a[-1] should be zero in the example (or t should be initialised >> to a[-1]). > > Indeed. I've promoted the comment to function level and adjusted it > to reflect what is done now (I've missed to update it from Ju-Zhes > prototype). I have also applied your correctness fix. > > Re-bootstrap and regtest running on x86_64-unknown-linux-gnu. > > OK (given limited x86 test quality)? I've tested it on amdgcn and the vect-recurr tests all pass. However the tsvc tests all fail because there's no definition of "struct timeval" without an "include ". If I add that then they pass too. (Curiously, the timeval variables don't appear to be used anywhere, which is lucky because amdgcn doesn't have a working gettimeofday function.) I can't claim any insight into the patch itself. Andrew