From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 223583858412; Fri, 27 Oct 2023 00:57:28 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 223583858412
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1698368248;
	bh=+LQx+RXUDtm1UWxcXaQ6DP5sMiDBWKKv/SZGjLsEwJs=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=JdiO039/aiCiZkzrT+Efo8mkG5dnlASZUU7RihoTGFv6234huhLXRF8VnkyXvlzN4
	 sbeLS1a0UA8PMKIr+Jpbad7Ip9XWgfqRKhZeWoiWBfjKBWYRz7URGsv1Li1U6VV/Di
	 JeGEkyAEe91MdHqZH26WmlRzBZgLoaGqNy3wM9IU=
From: "juzhe.zhong at rivai dot ai" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c
 and vsetvlmax-8.c
Date: Fri, 27 Oct 2023 00:57:27 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: juzhe.zhong at rivai dot ai
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-112092-4-cTTM1I4IQy@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-112092-4@http.gcc.gnu.org/bugzilla/>
References: <bug-112092-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112092

--- Comment #8 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Maciej W. Rozycki from comment #7)
> Thank you for all your explanations.  I think I'm still missing something
> here, so I'll write it differently (and let's ignore the tail-agnostic vs
> tail-undisturbed choice for the purpose of this consideration).
>=20
> Let me paste the whole assembly code produced here (sans decorations):
>=20
> 	beq	a5,zero,.L2
> 	vsetvli	zero,a6,e32,m1,tu,ma
> .L3:
> 	beq	a4,zero,.L7
> 	li	a5,0
> .L5:
> 	vle32.v	v1,0(a0)
> 	vle32.v	v1,0(a1)
> 	vle32.v	v1,0(a2)
> 	vse32.v	v1,0(a3)
> 	addi	a5,a5,1
> 	bne	a4,a5,.L5
> .L7:
> 	ret
> .L2:
> 	vsetvli	zero,a6,e32,m1,tu,ma
> 	j	.L3
>=20
> This seems to me to correspond to this source code:
>=20
>   if (cond)
>     __riscv_vsetvl_e32m1(avl);
>   else
>     __riscv_vsetvl_e16mf2(avl);
>   for (size_t i =3D 0; i < n; i +=3D 1) {
>     vint32m1_t a =3D __riscv_vle32_v_i32m1(in1, avl);
>     vint32m1_t b =3D __riscv_vle32_v_i32m1_tu(a, in2, avl);
>     vint32m1_t c =3D __riscv_vle32_v_i32m1_tu(b, in3, avl);
>     __riscv_vse32_v_i32m1(out, c, avl);
>   }
>=20
> And in that case I'd expect the conditional to be optimised away, as its
> result is ignored (along with the intrinsics) and does not affect actual
> code executed except for the different execution path, i.e.:
>=20
> 	beq	a4,zero,.L7
> 	vsetvli	zero,a6,e32,m1,tu,ma
> 	li	a5,0
> .L5:
> 	vle32.v	v1,0(a0)
> 	vle32.v	v1,0(a1)
> 	vle32.v	v1,0(a2)
> 	vse32.v	v1,0(a3)
> 	addi	a5,a5,1
> 	bne	a4,a5,.L5
> .L7:
> 	ret
>=20

Good catch ! I think we have a missed-optimization here and I agree this co=
de
is correct and optimal codegen for this case.
We have a close-to-optimal (not optimal enough) codegen for now.

And this optimization should not be done by VSETVL PASS.

After VSETVL PASS fusion, both e16mf2 and e32m1 user vsetvl instrinsic are
fused into e32m1, tu. They are totally the same so it's meaningless seperate
them into different blocks (They should be the same single block).

The reason why we missed an optimization here is because we expand user
vsetvl __riscv_vsetvl_e32m1 and __riscv_vsetvl_e16mf2 into 2 different
RTL expressions. The before PASSes (before VSETVL) don't known they are
equivalent, so separate them into different blocks.

If you change codes as follows:
  if (cond)
    vl =3D __riscv_vsetvl_e32m1(avl);
  else
    vl =3D __riscv_vsetvl_e32m1(avl);

I am sure the codegen will be as you said above. (A single vsetvl e32m1 tu =
in
a single block).

To optimize it, a alternative approach is that we expand all user vsetvl
instrinscs into same RTL expression (as long as they are having same ratio).


Meaning, expand=20

__riscv_vsetvl_e64m1
__riscv_vsetvl_e32m1
__riscv_vsetvl_e16mf2
__riscv_vsetvl_e8mf8

into same RTL expression since their VL outputs are definitely the same.

I don't see it will cause any problems here.

But different ratio like 32m1 and e32mf2 should be different RLT expression.

I am not sure kito agree with this idea.


Another alternative approach is that we enhance bb_reorder PASS.
The VSETVL PASS is run before bb_reorder PASS and current bb_reorder PASS
is unable to fuse these 2 vsetvls e32m1 Tu into same block because we split
it into "real" vsetvls which is the RTL pattern has side effects.

The "real" vsetvl patterns which generate assembly should have side effects
since vsetvl does change global VL/VTYPE status and also set a general
register.

No matter which approach to optimize it, I won't do it in GCC-14 since stag=
e 1
is soon to close.  We have a few more features (which are much more imporan=
t)
that we are planning and working to support in GCC-14.
I have confidence that our RVV GCC current VSETVL PASS is really optimal and
fancy enough.

After stage 1 close, we won't do any optimizations, we will only run full
coverage testing (for example, using different LMUL different -march to run=
 the
whole gcc testsuite) and fix bugs.=