From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 223583858412; Fri, 27 Oct 2023 00:57:28 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 223583858412 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1698368248; bh=+LQx+RXUDtm1UWxcXaQ6DP5sMiDBWKKv/SZGjLsEwJs=; h=From:To:Subject:Date:In-Reply-To:References:From; b=JdiO039/aiCiZkzrT+Efo8mkG5dnlASZUU7RihoTGFv6234huhLXRF8VnkyXvlzN4 sbeLS1a0UA8PMKIr+Jpbad7Ip9XWgfqRKhZeWoiWBfjKBWYRz7URGsv1Li1U6VV/Di JeGEkyAEe91MdHqZH26WmlRzBZgLoaGqNy3wM9IU= From: "juzhe.zhong at rivai dot ai" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c Date: Fri, 27 Oct 2023 00:57:27 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: wrong-code X-Bugzilla-Severity: normal X-Bugzilla-Who: juzhe.zhong at rivai dot ai X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D112092 --- Comment #8 from JuzheZhong --- (In reply to Maciej W. Rozycki from comment #7) > Thank you for all your explanations. I think I'm still missing something > here, so I'll write it differently (and let's ignore the tail-agnostic vs > tail-undisturbed choice for the purpose of this consideration). >=20 > Let me paste the whole assembly code produced here (sans decorations): >=20 > beq a5,zero,.L2 > vsetvli zero,a6,e32,m1,tu,ma > .L3: > beq a4,zero,.L7 > li a5,0 > .L5: > vle32.v v1,0(a0) > vle32.v v1,0(a1) > vle32.v v1,0(a2) > vse32.v v1,0(a3) > addi a5,a5,1 > bne a4,a5,.L5 > .L7: > ret > .L2: > vsetvli zero,a6,e32,m1,tu,ma > j .L3 >=20 > This seems to me to correspond to this source code: >=20 > if (cond) > __riscv_vsetvl_e32m1(avl); > else > __riscv_vsetvl_e16mf2(avl); > for (size_t i =3D 0; i < n; i +=3D 1) { > vint32m1_t a =3D __riscv_vle32_v_i32m1(in1, avl); > vint32m1_t b =3D __riscv_vle32_v_i32m1_tu(a, in2, avl); > vint32m1_t c =3D __riscv_vle32_v_i32m1_tu(b, in3, avl); > __riscv_vse32_v_i32m1(out, c, avl); > } >=20 > And in that case I'd expect the conditional to be optimised away, as its > result is ignored (along with the intrinsics) and does not affect actual > code executed except for the different execution path, i.e.: >=20 > beq a4,zero,.L7 > vsetvli zero,a6,e32,m1,tu,ma > li a5,0 > .L5: > vle32.v v1,0(a0) > vle32.v v1,0(a1) > vle32.v v1,0(a2) > vse32.v v1,0(a3) > addi a5,a5,1 > bne a4,a5,.L5 > .L7: > ret >=20 Good catch ! I think we have a missed-optimization here and I agree this co= de is correct and optimal codegen for this case. We have a close-to-optimal (not optimal enough) codegen for now. And this optimization should not be done by VSETVL PASS. After VSETVL PASS fusion, both e16mf2 and e32m1 user vsetvl instrinsic are fused into e32m1, tu. They are totally the same so it's meaningless seperate them into different blocks (They should be the same single block). The reason why we missed an optimization here is because we expand user vsetvl __riscv_vsetvl_e32m1 and __riscv_vsetvl_e16mf2 into 2 different RTL expressions. The before PASSes (before VSETVL) don't known they are equivalent, so separate them into different blocks. If you change codes as follows: if (cond) vl =3D __riscv_vsetvl_e32m1(avl); else vl =3D __riscv_vsetvl_e32m1(avl); I am sure the codegen will be as you said above. (A single vsetvl e32m1 tu = in a single block). To optimize it, a alternative approach is that we expand all user vsetvl instrinscs into same RTL expression (as long as they are having same ratio). Meaning, expand=20 __riscv_vsetvl_e64m1 __riscv_vsetvl_e32m1 __riscv_vsetvl_e16mf2 __riscv_vsetvl_e8mf8 into same RTL expression since their VL outputs are definitely the same. I don't see it will cause any problems here. But different ratio like 32m1 and e32mf2 should be different RLT expression. I am not sure kito agree with this idea. Another alternative approach is that we enhance bb_reorder PASS. The VSETVL PASS is run before bb_reorder PASS and current bb_reorder PASS is unable to fuse these 2 vsetvls e32m1 Tu into same block because we split it into "real" vsetvls which is the RTL pattern has side effects. The "real" vsetvl patterns which generate assembly should have side effects since vsetvl does change global VL/VTYPE status and also set a general register. No matter which approach to optimize it, I won't do it in GCC-14 since stag= e 1 is soon to close. We have a few more features (which are much more imporan= t) that we are planning and working to support in GCC-14. I have confidence that our RVV GCC current VSETVL PASS is really optimal and fancy enough. After stage 1 close, we won't do any optimizations, we will only run full coverage testing (for example, using different LMUL different -march to run= the whole gcc testsuite) and fix bugs.=