From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Snz2=AD=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id AEFC93858D28
	for <gcc-patches@gcc.gnu.org>; Wed, 12 Apr 2023 11:17:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org AEFC93858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C4B1D1684;
	Wed, 12 Apr 2023 04:18:40 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6F2BE3F73F;
	Wed, 12 Apr 2023 04:17:55 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>,"juzhe.zhong\@rivai.ai" <juzhe.zhong@rivai.ai>,  gcc-patches <gcc-patches@gcc.gnu.org>,  jeffreyalaw <jeffreyalaw@gmail.com>,  rdapp@linux.ibm.com,  linkw@linux.ibm.com, richard.sandiford@arm.com
Cc: "juzhe.zhong\@rivai.ai" <juzhe.zhong@rivai.ai>,  gcc-patches <gcc-patches@gcc.gnu.org>,  jeffreyalaw <jeffreyalaw@gmail.com>,  rdapp@linux.ibm.com,  linkw@linux.ibm.com
Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization
References: <20230407014741.139387-1-juzhe.zhong@rivai.ai>
	<63723855B0BF2130+2023041120125573846623@rivai.ai>
	<mptjzyik2ql.fsf@arm.com>
	<nycvar.YFH.7.77.849.2304120657420.4466@jbgna.fhfr.qr>
	<139DA38AFC9CA5B5+2023041216004591287739@rivai.ai>
	<nycvar.YFH.7.77.849.2304120836450.4466@jbgna.fhfr.qr>
	<B6AA85AC56454A66+2023041217154958074655@rivai.ai>
	<nycvar.YFH.7.77.849.2304120923280.4466@jbgna.fhfr.qr>
Date: Wed, 12 Apr 2023 12:17:54 +0100
In-Reply-To: <nycvar.YFH.7.77.849.2304120923280.4466@jbgna.fhfr.qr> (Richard
	Biener's message of "Wed, 12 Apr 2023 09:29:50 +0000 (UTC)")
Message-ID: <mpt3555jqn1.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-25.2 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <rguenther@suse.de> writes:
> On Wed, 12 Apr 2023, juzhe.zhong@rivai.ai wrote:
>
>> 
>> >> Thanks for the detailed explanation.  Just to clarify - with RVV
>> >> there's only a single mask register, v0.t, or did you want to
>> >> say an instruction can only specify a single mask register?
>> 
>> RVV has 32 (v0~v31) vector register in total.
>> We can store vector data value or mask value in any of them.
>> We also have mask-logic instruction for example mask-and between any vector register.
>> 
>> However, any vector operation for example like vadd.vv can only  predicated by v0 (in asm is v0.t) which is the first vector register.
>> We can predicate vadd.vv with v1 - v31.
>> 
>> So, you can image every time we want to use a mask to predicate a vector operation, we should always first store the mask value
>> into v0.
>> 
>> So, we can write intrinsic sequence like this:
>> 
>> vmseq v0,v8,v9 (store mask value to v0)
>> vmslt v1,v10,v11 (store mask value to v1)
>> vmand v0,v0,v1
>> vadd.vv ...v0.t (predicate mask should always be mask).
>
> Ah, I see - that explains it well.
>
>> >> ARM SVE would have a loop control mask and a separate mask
>> >> for the if (cond[i]) which would be combined with a mask-and
>> >> instruction to a third mask which is then used on the
>> >> predicated instructions.
>> 
>> Yeah, I know it. ARM SVE way is a more elegant way than RVV do. 
>> However, for RVV, we can't follow this flow.
>> We don't have a  "whilelo" instruction to generate loop control mask.
>
> Yep.  Similar for AVX512 where I have to use a vector compare.  I'm
> currently using
>
>  { 0, 1, 2 ... } < { remaining_len, remaining_len, ... }
>
> and careful updating of remaining_len (we know it will either
> be adjusted by the full constant vector length or updated to zero).
>
>> We only can do loop control with length generated by vsetvl.
>> And we can only use "v0" to mask predicate vadd.vv, and mask value can only generated by comparison or mask logical instructions. 
>> 
>> >> PowerPC and s390x might be able to use WHILE_LEN as well (though
>> >> they only have LEN variants of loads and stores) - of course
>> >> only "simulating it".  For the fixed-vector-length ISAs the
>> >> predicated vector loop IMHO makes most sense for the epilogue to
>> >> handle low-trip loops better.
>> 
>> Yeah, I wonder how they do the flow control (if (cond[i])). 
>> For RVV, you can image I will need to add a pattern LEN_MASK_LOAD/LEN_MASK_STORE (length generated by WHILE_LEN and mask generated by comparison)
>> 
>> I think we can CC IBM folks to see whether we can make WHILE_LEN works 
>> for both IBM and RVV ?
>
> I've CCed them.  Adding WHILE_LEN support to rs6000/s390x would be
> mainly the "easy" way to get len-masked (epilog) loop support.

I think that already works for them (could be misremembering).
However, IIUC, they have no special instruction to calculate the
length (unlike for RVV), and so it's open-coded using vect_get_len.

I suppose my two questions are:

(1) How easy would it be to express WHILE_LEN in normal gimple?
    I haven't thought about this at all, so the answer might be
    "very hard".  But it reminds me a little of UQDEC on AArch64,
    which we open-code using MAX_EXPR and MINUS_EXPR (see
    vect_set_loop_controls_directly).

    I'm not saying WHILE_LEN is the same operation, just that it seems
    like it might be open-codeable in a similar way.

    Even if we can open-code it, we'd still need some way for the
    target to select the "RVV way" from the "s390/PowerPC way".

(2) What effect does using a variable IV step (the result of
    the WHILE_LEN) have on ivopts?  I remember experimenting with
    something similar once (can't remember the context) and not
    having a constant step prevented ivopts from making good
    addresing-mode choices.

Thanks,
Richard