From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 550333858C2D
	for <gcc-patches@gcc.gnu.org>; Tue, 11 Oct 2022 08:42:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 550333858C2D
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DED50ED1;
	Tue, 11 Oct 2022 01:42:12 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F0B263F766;
	Tue, 11 Oct 2022 01:42:05 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>,Andrew Stubbs <ams@codesourcery.com>,  gcc-patches@gcc.gnu.org,  juzhe.zhong@rivai.ai, richard.sandiford@arm.com
Cc: Andrew Stubbs <ams@codesourcery.com>,  gcc-patches@gcc.gnu.org,  juzhe.zhong@rivai.ai
Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences
References: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de>
	<49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com>
	<nycvar.YFH.7.77.849.2210110658110.18337@jbgna.fhfr.qr>
Date: Tue, 11 Oct 2022 09:42:04 +0100
In-Reply-To: <nycvar.YFH.7.77.849.2210110658110.18337@jbgna.fhfr.qr> (Richard
	Biener's message of "Tue, 11 Oct 2022 07:01:06 +0000 (UTC)")
Message-ID: <mptsfjug3ab.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-38.6 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <rguenther@suse.de> writes:
> On Mon, 10 Oct 2022, Andrew Stubbs wrote:
>> On 10/10/2022 12:03, Richard Biener wrote:
>> > The following picks up the prototype by Ju-Zhe Zhong for vectorizing
>> > first order recurrences.  That solves two TSVC missed optimization PRs.
>> > 
>> > There's a new scalar cycle def kind, vect_first_order_recurrence
>> > and it's handling of the backedge value vectorization is complicated
>> > by the fact that the vectorized value isn't the PHI but instead
>> > a (series of) permute(s) shifting in the recurring value from the
>> > previous iteration.  I've implemented this by creating both the
>> > single vectorized PHI and the series of permutes when vectorizing
>> > the scalar PHI but leave the backedge values in both unassigned.
>> > The backedge values are (for the testcases) computed by a load
>> > which is also the place after which the permutes are inserted.
>> > That placement also restricts the cases we can handle (without
>> > resorting to code motion).
>> > 
>> > I added both costing and SLP handling though SLP handling is
>> > restricted to the case where a single vectorized PHI is enough.
>> > 
>> > Missing is epilogue handling - while prologue peeling would
>> > be handled transparently by adjusting iv_phi_p the epilogue
>> > case doesn't work with just inserting a scalar LC PHI since
>> > that a) keeps the scalar load live and b) that loads is the
>> > wrong one, it has to be the last, much like when we'd vectorize
>> > the LC PHI as live operation.  Unfortunately LIVE
>> > compute/analysis happens too early before we decide on
>> > peeling.  When using fully masked loop vectorization the
>> > vect-recurr-6.c works as expected though.
>> > 
>> > I have tested this on x86_64 for now, but since epilogue
>> > handling is missing there's probably no practical cases.
>> > My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
>> > just fine but I didn't feel like running SPEC within SDE nor
>> > is the WHILE_ULT patch complete enough.  Builds of SPEC 2k7
>> > with fully masked loops succeed (minus three cases of
>> > PR107096, caused by my WHILE_ULT prototype).
>> > 
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> > 
>> > Testing with SVE, GCN or RVV appreciated, ideas how to cleanly
>> > handle epilogues welcome.
>> 
>> The testcases all produce correct code on GCN and pass the execution tests.
>> 
>> The code isn't terribly optimal because we don't have a two-input permutation
>> instruction, so we permute each half separately and vec_merge the results. In
>> this case the first vector is always a no-op permutation so that's wasted
>> cycles. We'd really want a vector rotate and write-lane (or the other way
>> around). I think the special-case permutations can be recognised and coded
>> into the backend, but I don't know if we can easily tell that the first vector
>> is just a bunch of duplicates, when it's not constant.
>
> It's not actually a bunch of duplicates in all but the first iteration.
> But what you can recognize is that we're only using lane N - 1 of the
> first vector, so you could model the permute as extract last
> + shift in scalar (the extracted lane).  IIRC VLA vector targets usually
> have something like shift the vector and set the low lane from a
> scalar?

Yeah.

> The extract lane N - 1 might be more difficult but then
> a rotate plus extracting lane 0 might work as well.

I guess for SVE we should probably use SPLICE, which joins two vectors
and uses a predicate to select the first element that should be extracted.

Unfortunately we don't have a way of representing "last bit set, all other
bits clear" as a constant though, so I guess it'll have to be hidden
behind unspecs.

I meant to start SVE tests running once I'd finished for the day yesterday,
but forgot, sorry.  Will try to test today.

On the patch:

+  /* This is the second phase of vectorizing first-order rececurrences. An
+     overview of the transformation is described below. Suppose we have the
+     following loop.
+
+     int32_t t = 0;
+     for (int i = 0; i < n; ++i)
+       {
+	b[i] = a[i] - t;
+	t = a[i];
+      }
+
+    There is a first-order recurrence on "a". For this loop, the shorthand
+    scalar IR looks like:
+
+    scalar.preheader:
+      init = a[-1]
+      br loop.body
+
+    scalar.body:
+      i = PHI <0(scalar.preheader), i+1(scalar.body)>
+      _2 = PHI <(init(scalar.preheader), <_1(scalar.body)>
+      _1 = a[i]
+      b[i] = _1 - _2
+      br cond, scalar.body, ...
+
+    In this example, _2 is a recurrence because it's value depends on the
+    previous iteration. In the first phase of vectorization, we created a
+    temporary value for _2. We now complete the vectorization and produce the
+    shorthand vector IR shown below (VF = 4).
+
+    vector.preheader:
+      vect_init = vect_cst(..., ..., ..., a[-1])
+      br vector.body
+
+    vector.body
+      i = PHI <0(vector.preheader), i+4(vector.body)>
+      vect_1 = PHI <vect_init(vector.preheader), v2(vector.body)>
+      vect_2 = a[i, i+1, i+2, i+3];
+      vect_3 = vector(vect_1(3), vect_2(0, 1, 2))
+      b[i, i+1, i+2, i+3] = vect_2 - vect_3
+      br cond, vector.body, middle.block
+
+    middle.block:
+      x = vect_2(3)
+      br scalar.preheader
+
+    scalar.ph:
+      s_init = PHI <x(middle.block), a[-1], otherwise>
+      br scalar.body
+
+    After execution completes the vector loop, we extract the next value of
+    the recurrence (x) to use as the initial value in the scalar loop.  */

Looks like a[-1] should be zero in the example (or t should be initialised
to a[-1]).

Thanks,
Richard