From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by sourceware.org (Postfix) with ESMTPS id BA5933858C83 for ; Fri, 14 Oct 2022 09:46:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BA5933858C83 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 752711F88E; Fri, 14 Oct 2022 09:46:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1665740765; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=nIsnLDc3bEYWY0n0WJz+IkU1insiR2eaJw5xtidXm2c=; b=tNEOxPMqKdLwuZl1HuO2SC0BZgYai+OWPS1dIzFcw/PWG/zU90tC9FRfMYkIC4wQZBKlOp RRlGBPcJawvVtTntn1xcrsL+t5EJiILiZejb2sYYWaIX3Nc8+k88hpv8KZx7A9iH2WNhfN tGe5SCfYqmZHbIv/I93cVmmV78iDeQw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1665740765; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=nIsnLDc3bEYWY0n0WJz+IkU1insiR2eaJw5xtidXm2c=; b=Y6Kj2JAsWx5QoFCPQ6ssrKI6YQQoBTGyN3Bt1PDnRWqoSH7JV067bE2Sf/8awglgAZLnNN Jdcw5Fvpyuxn2WBw== Received: from wotan.suse.de (wotan.suse.de [10.160.0.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 1F5D42C141; Fri, 14 Oct 2022 09:46:04 +0000 (UTC) Date: Fri, 14 Oct 2022 09:46:04 +0000 (UTC) From: Richard Biener To: Andrew Stubbs cc: Richard Sandiford , gcc-patches@gcc.gnu.org, juzhe.zhong@rivai.ai, Martin Liska Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences In-Reply-To: Message-ID: References: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de> <49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com> <271ron87-q0o2-5pr5-8s65-1682p750o0@fhfr.qr> User-Agent: Alpine 2.22 (LSU 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, 14 Oct 2022, Andrew Stubbs wrote: > On 14/10/2022 08:07, Richard Biener wrote: > > On Tue, 11 Oct 2022, Richard Sandiford wrote: > > > >> Richard Biener writes: > >>> On Mon, 10 Oct 2022, Andrew Stubbs wrote: > >>>> On 10/10/2022 12:03, Richard Biener wrote: > >>>>> The following picks up the prototype by Ju-Zhe Zhong for vectorizing > >>>>> first order recurrences. That solves two TSVC missed optimization PRs. > >>>>> > >>>>> There's a new scalar cycle def kind, vect_first_order_recurrence > >>>>> and it's handling of the backedge value vectorization is complicated > >>>>> by the fact that the vectorized value isn't the PHI but instead > >>>>> a (series of) permute(s) shifting in the recurring value from the > >>>>> previous iteration. I've implemented this by creating both the > >>>>> single vectorized PHI and the series of permutes when vectorizing > >>>>> the scalar PHI but leave the backedge values in both unassigned. > >>>>> The backedge values are (for the testcases) computed by a load > >>>>> which is also the place after which the permutes are inserted. > >>>>> That placement also restricts the cases we can handle (without > >>>>> resorting to code motion). > >>>>> > >>>>> I added both costing and SLP handling though SLP handling is > >>>>> restricted to the case where a single vectorized PHI is enough. > >>>>> > >>>>> Missing is epilogue handling - while prologue peeling would > >>>>> be handled transparently by adjusting iv_phi_p the epilogue > >>>>> case doesn't work with just inserting a scalar LC PHI since > >>>>> that a) keeps the scalar load live and b) that loads is the > >>>>> wrong one, it has to be the last, much like when we'd vectorize > >>>>> the LC PHI as live operation. Unfortunately LIVE > >>>>> compute/analysis happens too early before we decide on > >>>>> peeling. When using fully masked loop vectorization the > >>>>> vect-recurr-6.c works as expected though. > >>>>> > >>>>> I have tested this on x86_64 for now, but since epilogue > >>>>> handling is missing there's probably no practical cases. > >>>>> My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c > >>>>> just fine but I didn't feel like running SPEC within SDE nor > >>>>> is the WHILE_ULT patch complete enough. Builds of SPEC 2k7 > >>>>> with fully masked loops succeed (minus three cases of > >>>>> PR107096, caused by my WHILE_ULT prototype). > >>>>> > >>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu. > >>>>> > >>>>> Testing with SVE, GCN or RVV appreciated, ideas how to cleanly > >>>>> handle epilogues welcome. > >>>> > >>>> The testcases all produce correct code on GCN and pass the execution > >>>> tests. > >>>> > >>>> The code isn't terribly optimal because we don't have a two-input > >>>> permutation > >>>> instruction, so we permute each half separately and vec_merge the > >>>> results. In > >>>> this case the first vector is always a no-op permutation so that's wasted > >>>> cycles. We'd really want a vector rotate and write-lane (or the other way > >>>> around). I think the special-case permutations can be recognised and > >>>> coded > >>>> into the backend, but I don't know if we can easily tell that the first > >>>> vector > >>>> is just a bunch of duplicates, when it's not constant. > >>> > >>> It's not actually a bunch of duplicates in all but the first iteration. > >>> But what you can recognize is that we're only using lane N - 1 of the > >>> first vector, so you could model the permute as extract last > >>> + shift in scalar (the extracted lane). IIRC VLA vector targets usually > >>> have something like shift the vector and set the low lane from a > >>> scalar? > >> > >> Yeah. > >> > >>> The extract lane N - 1 might be more difficult but then > >>> a rotate plus extracting lane 0 might work as well. > >> > >> I guess for SVE we should probably use SPLICE, which joins two vectors > >> and uses a predicate to select the first element that should be extracted. > >> > >> Unfortunately we don't have a way of representing "last bit set, all other > >> bits clear" as a constant though, so I guess it'll have to be hidden > >> behind unspecs. > >> > >> I meant to start SVE tests running once I'd finished for the day yesterday, > >> but forgot, sorry. Will try to test today. > >> > >> On the patch: > >> > >> + /* This is the second phase of vectorizing first-order rececurrences. An > >> + overview of the transformation is described below. Suppose we have > >> the > >> + following loop. > >> + > >> + int32_t t = 0; > >> + for (int i = 0; i < n; ++i) > >> + { > >> + b[i] = a[i] - t; > >> + t = a[i]; > >> + } > >> + > >> + There is a first-order recurrence on "a". For this loop, the shorthand > >> + scalar IR looks like: > >> + > >> + scalar.preheader: > >> + init = a[-1] > >> + br loop.body > >> + > >> + scalar.body: > >> + i = PHI <0(scalar.preheader), i+1(scalar.body)> > >> + _2 = PHI <(init(scalar.preheader), <_1(scalar.body)> > >> + _1 = a[i] > >> + b[i] = _1 - _2 > >> + br cond, scalar.body, ... > >> + > >> + In this example, _2 is a recurrence because it's value depends on the > >> + previous iteration. In the first phase of vectorization, we created a > >> + temporary value for _2. We now complete the vectorization and produce > >> the > >> + shorthand vector IR shown below (VF = 4). > >> + > >> + vector.preheader: > >> + vect_init = vect_cst(..., ..., ..., a[-1]) > >> + br vector.body > >> + > >> + vector.body > >> + i = PHI <0(vector.preheader), i+4(vector.body)> > >> + vect_1 = PHI > >> + vect_2 = a[i, i+1, i+2, i+3]; > >> + vect_3 = vector(vect_1(3), vect_2(0, 1, 2)) > >> + b[i, i+1, i+2, i+3] = vect_2 - vect_3 > >> + br cond, vector.body, middle.block > >> + > >> + middle.block: > >> + x = vect_2(3) > >> + br scalar.preheader > >> + > >> + scalar.ph: > >> + s_init = PHI > >> + br scalar.body > >> + > >> + After execution completes the vector loop, we extract the next value > >> of > >> + the recurrence (x) to use as the initial value in the scalar loop. */ > >> > >> Looks like a[-1] should be zero in the example (or t should be initialised > >> to a[-1]). > > > > Indeed. I've promoted the comment to function level and adjusted it > > to reflect what is done now (I've missed to update it from Ju-Zhes > > prototype). I have also applied your correctness fix. > > > > Re-bootstrap and regtest running on x86_64-unknown-linux-gnu. > > > > OK (given limited x86 test quality)? > > I've tested it on amdgcn and the vect-recurr tests all pass. > > However the tsvc tests all fail because there's no definition of "struct > timeval" without an "include ". If I add that then they pass too. > (Curiously, the timeval variables don't appear to be used anywhere, which is > lucky because amdgcn doesn't have a working gettimeofday function.) Ah, that's probably from tsvc.h which is included by all tests which uses struct timeval but doesn't include any header defining it. It also looks like the members are unused. Martin - can you have a look into that issue? Thanks, Richard.