From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Andrew_Stubbs@mentor.com>
Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153])
	by sourceware.org (Postfix) with ESMTPS id BC2D73858C83
	for <gcc-patches@gcc.gnu.org>; Fri, 14 Oct 2022 09:43:03 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BC2D73858C83
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.95,184,1661846400"; 
   d="scan'208";a="87514212"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
  by esa1.mentor.iphmx.com with ESMTP; 14 Oct 2022 01:43:02 -0800
IronPort-SDR: hXhZxsz4UJZledWSbhzYdBRHNo3aiAoe1oyGJcK0LkGDE6Y6M/hiUMoAmYF89ah39cRacfLfIQ
 slcAdixqoB61fMf0NbOZOGm/EL6RnbK0B+2phzIwGAEGZTHV6gODK6rWVTjnSzpvjHYu2H2FWx
 gGUFeVIh3Xt7UAeKAUqY+KtPivyITf4Red6+0w40em90+ktNRlb3hZUs+4GWmfE1pFfi4ZRdov
 EIkfsi+ziWuxVNAGgmwkfV2MJM0wkp3S0PdcuXPaohLw+iiprRBz7uTUZHF87M15WfQO3K3DLO
 hxo=
Message-ID: <e176fd8c-95b3-59dc-6f2d-4ff221ce9d52@codesourcery.com>
Date: Fri, 14 Oct 2022 10:42:58 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.3.2
Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences
Content-Language: en-GB
To: Richard Biener <rguenther@suse.de>, Richard Sandiford
	<richard.sandiford@arm.com>
CC: <gcc-patches@gcc.gnu.org>, <juzhe.zhong@rivai.ai>
References: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de>
 <49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com>
 <nycvar.YFH.7.77.849.2210110658110.18337@jbgna.fhfr.qr>
 <mptsfjug3ab.fsf@arm.com> <271ron87-q0o2-5pr5-8s65-1682p750o0@fhfr.qr>
From: Andrew Stubbs <ams@codesourcery.com>
In-Reply-To: <271ron87-q0o2-5pr5-8s65-1682p750o0@fhfr.qr>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 7bit
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-12.mgc.mentorg.com (139.181.222.12) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-8.1 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 14/10/2022 08:07, Richard Biener wrote:
> On Tue, 11 Oct 2022, Richard Sandiford wrote:
> 
>> Richard Biener <rguenther@suse.de> writes:
>>> On Mon, 10 Oct 2022, Andrew Stubbs wrote:
>>>> On 10/10/2022 12:03, Richard Biener wrote:
>>>>> The following picks up the prototype by Ju-Zhe Zhong for vectorizing
>>>>> first order recurrences.  That solves two TSVC missed optimization PRs.
>>>>>
>>>>> There's a new scalar cycle def kind, vect_first_order_recurrence
>>>>> and it's handling of the backedge value vectorization is complicated
>>>>> by the fact that the vectorized value isn't the PHI but instead
>>>>> a (series of) permute(s) shifting in the recurring value from the
>>>>> previous iteration.  I've implemented this by creating both the
>>>>> single vectorized PHI and the series of permutes when vectorizing
>>>>> the scalar PHI but leave the backedge values in both unassigned.
>>>>> The backedge values are (for the testcases) computed by a load
>>>>> which is also the place after which the permutes are inserted.
>>>>> That placement also restricts the cases we can handle (without
>>>>> resorting to code motion).
>>>>>
>>>>> I added both costing and SLP handling though SLP handling is
>>>>> restricted to the case where a single vectorized PHI is enough.
>>>>>
>>>>> Missing is epilogue handling - while prologue peeling would
>>>>> be handled transparently by adjusting iv_phi_p the epilogue
>>>>> case doesn't work with just inserting a scalar LC PHI since
>>>>> that a) keeps the scalar load live and b) that loads is the
>>>>> wrong one, it has to be the last, much like when we'd vectorize
>>>>> the LC PHI as live operation.  Unfortunately LIVE
>>>>> compute/analysis happens too early before we decide on
>>>>> peeling.  When using fully masked loop vectorization the
>>>>> vect-recurr-6.c works as expected though.
>>>>>
>>>>> I have tested this on x86_64 for now, but since epilogue
>>>>> handling is missing there's probably no practical cases.
>>>>> My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
>>>>> just fine but I didn't feel like running SPEC within SDE nor
>>>>> is the WHILE_ULT patch complete enough.  Builds of SPEC 2k7
>>>>> with fully masked loops succeed (minus three cases of
>>>>> PR107096, caused by my WHILE_ULT prototype).
>>>>>
>>>>> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>>>>>
>>>>> Testing with SVE, GCN or RVV appreciated, ideas how to cleanly
>>>>> handle epilogues welcome.
>>>>
>>>> The testcases all produce correct code on GCN and pass the execution tests.
>>>>
>>>> The code isn't terribly optimal because we don't have a two-input permutation
>>>> instruction, so we permute each half separately and vec_merge the results. In
>>>> this case the first vector is always a no-op permutation so that's wasted
>>>> cycles. We'd really want a vector rotate and write-lane (or the other way
>>>> around). I think the special-case permutations can be recognised and coded
>>>> into the backend, but I don't know if we can easily tell that the first vector
>>>> is just a bunch of duplicates, when it's not constant.
>>>
>>> It's not actually a bunch of duplicates in all but the first iteration.
>>> But what you can recognize is that we're only using lane N - 1 of the
>>> first vector, so you could model the permute as extract last
>>> + shift in scalar (the extracted lane).  IIRC VLA vector targets usually
>>> have something like shift the vector and set the low lane from a
>>> scalar?
>>
>> Yeah.
>>
>>> The extract lane N - 1 might be more difficult but then
>>> a rotate plus extracting lane 0 might work as well.
>>
>> I guess for SVE we should probably use SPLICE, which joins two vectors
>> and uses a predicate to select the first element that should be extracted.
>>
>> Unfortunately we don't have a way of representing "last bit set, all other
>> bits clear" as a constant though, so I guess it'll have to be hidden
>> behind unspecs.
>>
>> I meant to start SVE tests running once I'd finished for the day yesterday,
>> but forgot, sorry.  Will try to test today.
>>
>> On the patch:
>>
>> +  /* This is the second phase of vectorizing first-order rececurrences. An
>> +     overview of the transformation is described below. Suppose we have the
>> +     following loop.
>> +
>> +     int32_t t = 0;
>> +     for (int i = 0; i < n; ++i)
>> +       {
>> +	b[i] = a[i] - t;
>> +	t = a[i];
>> +      }
>> +
>> +    There is a first-order recurrence on "a". For this loop, the shorthand
>> +    scalar IR looks like:
>> +
>> +    scalar.preheader:
>> +      init = a[-1]
>> +      br loop.body
>> +
>> +    scalar.body:
>> +      i = PHI <0(scalar.preheader), i+1(scalar.body)>
>> +      _2 = PHI <(init(scalar.preheader), <_1(scalar.body)>
>> +      _1 = a[i]
>> +      b[i] = _1 - _2
>> +      br cond, scalar.body, ...
>> +
>> +    In this example, _2 is a recurrence because it's value depends on the
>> +    previous iteration. In the first phase of vectorization, we created a
>> +    temporary value for _2. We now complete the vectorization and produce the
>> +    shorthand vector IR shown below (VF = 4).
>> +
>> +    vector.preheader:
>> +      vect_init = vect_cst(..., ..., ..., a[-1])
>> +      br vector.body
>> +
>> +    vector.body
>> +      i = PHI <0(vector.preheader), i+4(vector.body)>
>> +      vect_1 = PHI <vect_init(vector.preheader), v2(vector.body)>
>> +      vect_2 = a[i, i+1, i+2, i+3];
>> +      vect_3 = vector(vect_1(3), vect_2(0, 1, 2))
>> +      b[i, i+1, i+2, i+3] = vect_2 - vect_3
>> +      br cond, vector.body, middle.block
>> +
>> +    middle.block:
>> +      x = vect_2(3)
>> +      br scalar.preheader
>> +
>> +    scalar.ph:
>> +      s_init = PHI <x(middle.block), a[-1], otherwise>
>> +      br scalar.body
>> +
>> +    After execution completes the vector loop, we extract the next value of
>> +    the recurrence (x) to use as the initial value in the scalar loop.  */
>>
>> Looks like a[-1] should be zero in the example (or t should be initialised
>> to a[-1]).
> 
> Indeed.  I've promoted the comment to function level and adjusted it
> to reflect what is done now (I've missed to update it from Ju-Zhes
> prototype).  I have also applied your correctness fix.
> 
> Re-bootstrap and regtest running on x86_64-unknown-linux-gnu.
> 
> OK (given limited x86 test quality)?

I've tested it on amdgcn and the vect-recurr tests all pass.

However the tsvc tests all fail because there's no definition of "struct 
timeval" without an "include <sys/time.h>". If I add that then they pass 
too. (Curiously, the timeval variables don't appear to be used anywhere, 
which is lucky because amdgcn doesn't have a working gettimeofday function.)

I can't claim any insight into the patch itself.

Andrew