From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Andrew_Stubbs@mentor.com>
Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180])
	by sourceware.org (Postfix) with ESMTPS id 98B1C3858D37
	for <gcc-patches@gcc.gnu.org>; Mon, 10 Oct 2022 13:57:31 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 98B1C3858D37
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.95,173,1661846400"; 
   d="scan'208";a="84296243"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
  by esa3.mentor.iphmx.com with ESMTP; 10 Oct 2022 05:57:24 -0800
IronPort-SDR: 4Ro7hg8qZC0jZP7A5HM29z6tfX74iwD6avirN+lD2095OgxAh9J4G2ds9RL2BQn1w17TIgdBHv
 Lc6SukgpiPwEmnn+jzith6gb4Yb1zA04qIdOWSvlcsDc02tAY2cjuWNMQjdmdAl7jEGZ6A0k6r
 U4GvCU+ry8942Vu9HkG5zYu9kvSGr6/wdoXxdTxU38wINjfjcIjPaF7kW1nO78lzkdE1KPllS/
 qT6/8C2BhXcHhHR/sgpQzEcewMkMAP0DeWlbK7OMlulD4RmsOd8iGwPtClnn3YiT5xByNWtiXN
 V84=
Message-ID: <49b46c57-70b6-d9c0-a267-5e2f8315382b@codesourcery.com>
Date: Mon, 10 Oct 2022 14:57:13 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.3.2
Subject: Re: [PATCH][RFT] Vectorization of first-order recurrences
To: Richard Biener <rguenther@suse.de>, <gcc-patches@gcc.gnu.org>
CC: <richard.sandiford@arm.com>, <juzhe.zhong@rivai.ai>
References: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de>
Content-Language: en-GB
From: Andrew Stubbs <ams@codesourcery.com>
In-Reply-To: <20221010110339.E9E2513479@imap2.suse-dmz.suse.de>
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 7bit
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-13.mgc.mentorg.com (139.181.222.13) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,NICE_REPLY_A,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 10/10/2022 12:03, Richard Biener wrote:
> The following picks up the prototype by Ju-Zhe Zhong for vectorizing
> first order recurrences.  That solves two TSVC missed optimization PRs.
> 
> There's a new scalar cycle def kind, vect_first_order_recurrence
> and it's handling of the backedge value vectorization is complicated
> by the fact that the vectorized value isn't the PHI but instead
> a (series of) permute(s) shifting in the recurring value from the
> previous iteration.  I've implemented this by creating both the
> single vectorized PHI and the series of permutes when vectorizing
> the scalar PHI but leave the backedge values in both unassigned.
> The backedge values are (for the testcases) computed by a load
> which is also the place after which the permutes are inserted.
> That placement also restricts the cases we can handle (without
> resorting to code motion).
> 
> I added both costing and SLP handling though SLP handling is
> restricted to the case where a single vectorized PHI is enough.
> 
> Missing is epilogue handling - while prologue peeling would
> be handled transparently by adjusting iv_phi_p the epilogue
> case doesn't work with just inserting a scalar LC PHI since
> that a) keeps the scalar load live and b) that loads is the
> wrong one, it has to be the last, much like when we'd vectorize
> the LC PHI as live operation.  Unfortunately LIVE
> compute/analysis happens too early before we decide on
> peeling.  When using fully masked loop vectorization the
> vect-recurr-6.c works as expected though.
> 
> I have tested this on x86_64 for now, but since epilogue
> handling is missing there's probably no practical cases.
> My prototype WHILE_ULT AVX512 patch can handle vect-recurr-6.c
> just fine but I didn't feel like running SPEC within SDE nor
> is the WHILE_ULT patch complete enough.  Builds of SPEC 2k7
> with fully masked loops succeed (minus three cases of
> PR107096, caused by my WHILE_ULT prototype).
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> Testing with SVE, GCN or RVV appreciated, ideas how to cleanly
> handle epilogues welcome.

The testcases all produce correct code on GCN and pass the execution tests.

The code isn't terribly optimal because we don't have a two-input 
permutation instruction, so we permute each half separately and 
vec_merge the results. In this case the first vector is always a no-op 
permutation so that's wasted cycles. We'd really want a vector rotate 
and write-lane (or the other way around). I think the special-case 
permutations can be recognised and coded into the backend, but I don't 
know if we can easily tell that the first vector is just a bunch of 
duplicates, when it's not constant.

Andrew