From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=SBhv=AJ=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 9C03C3858D32
	for <gcc-patches@gcc.gnu.org>; Tue, 18 Apr 2023 09:32:07 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9C03C3858D32
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0C3C1168F;
	Tue, 18 Apr 2023 02:32:51 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 46E653F5A1;
	Tue, 18 Apr 2023 02:32:06 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: "juzhe.zhong\@rivai.ai" <juzhe.zhong@rivai.ai>
Mail-Followup-To: "juzhe.zhong\@rivai.ai" <juzhe.zhong@rivai.ai>,rguenther <rguenther@suse.de>,  gcc-patches <gcc-patches@gcc.gnu.org>,  jeffreyalaw <jeffreyalaw@gmail.com>,  rdapp <rdapp@linux.ibm.com>,  linkw <linkw@linux.ibm.com>,  kito.cheng <kito.cheng@gmail.com>, richard.sandiford@arm.com
Cc: rguenther <rguenther@suse.de>,  gcc-patches <gcc-patches@gcc.gnu.org>,  jeffreyalaw <jeffreyalaw@gmail.com>,  rdapp <rdapp@linux.ibm.com>,  linkw <linkw@linux.ibm.com>,  kito.cheng <kito.cheng@gmail.com>
Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization
References: <20230407014741.139387-1-juzhe.zhong@rivai.ai>
	<63723855B0BF2130+2023041120125573846623@rivai.ai>
	<mptjzyik2ql.fsf@arm.com>
	<nycvar.YFH.7.77.849.2304120657420.4466@jbgna.fhfr.qr>
	<139DA38AFC9CA5B5+2023041216004591287739@rivai.ai>
	<nycvar.YFH.7.77.849.2304120836450.4466@jbgna.fhfr.qr>
	<B6AA85AC56454A66+2023041217154958074655@rivai.ai>
	<nycvar.YFH.7.77.849.2304120923280.4466@jbgna.fhfr.qr>
	<mpt3555jqn1.fsf@arm.com>
	<E8B085826C635D01+2023041219371911535174@rivai.ai>
	<mptr0spi8zd.fsf@arm.com>
	<3340A6BE437A5F32+2023041222184139393725@rivai.ai>
	<nycvar.YFH.7.77.849.2304130643410.4466@jbgna.fhfr.qr>
	<4EBEDF1F4A296E7C+2023041317540077309355@rivai.ai>
Date: Tue, 18 Apr 2023 10:32:04 +0100
In-Reply-To: <4EBEDF1F4A296E7C+2023041317540077309355@rivai.ai> (juzhe's
	message of "Thu, 13 Apr 2023 17:54:01 +0800")
Message-ID: <mptzg75fsdn.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-25.0 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
>>> But the issue is the same in the reverse with WHILE_LEN, no?
>>>WHILE_LEN just computes a scalar value - you seem to suggest
>>>there's a hidden side-effect of "coalescing" the result with
>>>a hardware vector length register?  I don't think that's good design.
> No, I don't plan to suggest there's a hidden side-effect of "coalescing"
> the result with a hardware vector length register.
>
> Today, I read RVV ISA deeply again. I realize that this patch is not abso=
lute correct for=20
> any RVV hardward.
>
> According to RVV ISA, the vsetvl definition:
> an vsetvli instruction which is vsetvli vl, avl, vtype
> vl =3D AVL if AVL =E2=89=A4 VLMAX
> ceil(AVL / 2) =E2=89=A4 vl =E2=89=A4 VLMAX if AVL < (2 * VLMAX)
> vl =3D VLMAX if AVL =E2=89=A5 (2 * VLMAX)
> Deterministic on any given implementation for same input AVL and VLMAX va=
lues
> The second constraint make the result of vsetvli is not necessary to be V=
LMAX (the maximum number of elements will be updated of specific vector-len=
gth RVV CPU).
>
> So for a vsetvli instruction (vsetvli vl,avl,vtype). The "vl" value can b=
e various among different RVV CPU depending on the implementation of the do=
wnstream RVV hardware.
>
>
>  Now I think I should fix this patch since this patch is not always suita=
ble for all hardware.
>
> So according to RVV ISA:
> For example, this permits an implementation to set vl =3D ceil(AVL / 2) f=
or VLMAX < AVL < 2*VLMAX in order to evenly distribute work over the last t=
wo iterations of a stripmine loop.
>
> We can have these 2 following different RVV CPU:
>
> Suppose  the maximum number of the elements needs to be updated is 10 ele=
ment (int32_t), and the vector length =3D 256 bit (update 8 INT32 elements =
in max).
>
> So there are 2 iterations we need, the number elements of each iteration =
depending on hardware implementation.
>
> So we can have these 2 following hardware implementation are both legal f=
or RVV standard:
>
> RVV CPU 1:
> 1st iteration update 5 element (it satisfy the constraint ceil (AVL/2) <=
=3D vl <=3D VLMAX), set vl =3D ceil (AVL/2) =3D 5
> 2nd iteration update 5 elements too.
>
> RVV CPU 2:
> 1st iteration update 8 elements. set vl =3D VLMAX =3D 8.
> 2nd iteration update 3 elments.
>
> These 2 RVV CPU are both legal according to RVV specification standard.
> It's obvious this patch is correct for RVV CPU 2 but incorrect for RVV CP=
U 1.

Ah, OK.  In that case, I guess a new ifn like WHILE_LEN will be needed
after all.

> Since the current flow of this patch is as follows:
>
> +	<bb 3>
> +	_19 =3D (unsigned long) n_5(D);
> +	...
> +
> +	<bb 4>:
> +	...
> +	# ivtmp_20 =3D PHI <ivtmp_21(4), _19(3)>
> +	...
> +	_22 =3D .WHILE_LEN (ivtmp_20, vf);
> +	...
> +	LEN_LOAD (addr, _22);...        addr =3D addr + vf;+	ivtmp_21 =3D ivtmp=
_20 - _22;
> +	...
> +	if (ivtmp_21 !=3D 0)
> +	  goto <bb 4>; [75.00%]
> +	else
> +	  goto <bb 5>; [25.00%]
> +
> +	<bb 5>
> +	return;
>
> Here the _22 which is the output of WHILE_LEN is only used in ivtmp_21 =
=3D ivtmp_20 - _22;
> which serves the saturating-to-zero subtraction.=20
> And "addr =3D addr + vf;"=20
> The address is calculated in the loop just keep add vf.=20
> Such sequence is Ok for most of the RVV CPU so far I think.
> However, for future compatibility, we should make WHILE_LEN output as the=
 address IV adding value too.
>
> So, we should abandon the current the address loop way which is just keep=
ing add vf.
>
> Replace "addr =3D addr + vf".
>
> Instead, we should do like that:
>
> _22 =3D .WHILE_LEN (ivtmp_20, vf);
> ....
> LEN_LOAD (addr, _22);tmp =3D _22 * 4; (Assume it is INT32 calculation, ma=
ke _22 which is INT32 align into BYTE align for address counting) addr =3D =
addr + tmp;....
> Makeing the result of WHILE_LEN is not only used to do the remain =3D rem=
ain - len,But also used in addressing calculating: tmp =3D len * (element b=
ytesize) ; addr =3D addr + tmp;
> Then this flow is the correct flow for all RVV CPU.
> This flow is totally same as example in RVV ISA define:
> https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s
> I think I need to change this patch as described above to make it to be g=
lobal suitable for all RVV CPU in the word. But I am not sure whether GCC c=
ommunity accept this flow. So I propose it now before I do it.=20
> I didn't realize that since my downstream RVV hardware and the open-sourc=
e simulator generate "vl" =3D VLMAX. (Sorry about that)

Sounds OK to me FWIW.  Supporting non-(poly-)constant VFs seems useful.

I had a local patch (not submitted) that needed non-constant VFs for
a different reason: to clamp the VF at runtime to avoid hazards.
E.g. if we know that vectorisation is safe up to VF=3DN but not
beyond that, the patch would limit the VF to N on targets with
larger vectors, rather than punt to the scalar fallback loop.

There's still a key difference between that use case and yours,
since in "my" case the VF would still be invariant (for SVE),
whereas with yours the VF would vary between iterations.
But the general principle is the same.

Thanks,
Richard