[Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
@ 2023-10-26  1:00 macro at orcam dot me.uk
  2023-10-26  1:46 ` [Bug target/112092] " juzhe.zhong at rivai dot ai
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: macro at orcam dot me.uk @ 2023-10-26  1:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

            Bug ID: 112092
           Summary: RISC-V: Wrong RVV code produced for vsetvl-11.c and
                    vsetvlmax-8.c
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: wrong-code
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: macro at orcam dot me.uk
  Target Milestone: ---
            Target: riscv*-*-*

There is incorrect code produced for
gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-11.c, where we have:

  if (cond)
    vl = __riscv_vsetvl_e32m1(avl);
  else
    vl = __riscv_vsetvl_e16mf2(avl);

however the relevant parts of the assembly produced are:

        beq     a5,zero,.L2
        vsetvli zero,a6,e32,m1,tu,ma
.L3:
[...]
        ret
.L2:
        vsetvli zero,a6,e32,m1,tu,ma
        j       .L3

so both VSETVLI instructions are identical (and the whole conditional
redundant) while the former one is AFAICT supposed to be:

        vsetvli zero,a6,e16,mf2,ta,ma

according to the intrinsic used.  Additionally the pass condition of the
test case is too relaxed, making the test pass regardless.

RTL dumps indicate correct code generation up until the "vsetvl" pass,
where:

(insn 20 19 21 4 (set (reg/v:SI 16 a6 [orig:136 vl ] [136])
        (unspec:SI [
                (reg/v:SI 16 a6 [orig:147 avl ] [147])
                (const_int 16 [0x10])
                (const_int 7 [0x7])
                (const_int 2 [0x2]) repeated x2
            ] UNSPEC_VSETVL))
".../gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-11.c":12:10 1507
{vsetvlsi_no_side_effects}
     (nil))

is replaced with:

(insn 58 19 21 4 (parallel [
            (set (reg:SI 66 vl)
                (unspec:SI [
                        (reg/v:SI 16 a6 [orig:147 avl ] [147])
                        (const_int 32 [0x20])
                        (const_int 0 [0])
                    ] UNSPEC_VSETVL))
            (set (reg:SI 67 vtype)
                (unspec:SI [
                        (const_int 32 [0x20])
                        (const_int 0 [0]) repeated x2
                        (const_int 1 [0x1])
                    ] UNSPEC_VSETVL))
        ]) ".../gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-11.c":12:10
1505 {vsetvl_discard_resultsi}
     (nil))

Similarly with vsetvlmax-8.c.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
@ 2023-10-26  1:46 ` juzhe.zhong at rivai dot ai
  2023-10-26  1:57 ` juzhe.zhong at rivai dot ai
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-26  1:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

JuzheZhong <juzhe.zhong at rivai dot ai> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |juzhe.zhong at rivai dot ai

--- Comment #1 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
No, it is correct. It's the fancy optimization we have done in VSETVL PASS.

e16mf2 is same ratio e32m1.

The later loop demand e32m1 and TU, so we fuse it into e16mf2
(__riscv_vsetvl_e16mf2(avl)), change it into e32m1 and TU.

This is a valid optimization.

You can change e16mf2 into e16m1. I am sure the fusion will be blocked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
  2023-10-26  1:46 ` [Bug target/112092] " juzhe.zhong at rivai dot ai
@ 2023-10-26  1:57 ` juzhe.zhong at rivai dot ai
  2023-10-26  4:01 ` macro at orcam dot me.uk
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-26  1:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #2 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
To demonstrate the idea, here is a simple example to make you easier understand
the idea:

https://godbolt.org/z/Gxzjv48Ec

#include "riscv_vector.h"

void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n, int
cond, int avl) {
    size_t vl = __riscv_vsetvl_e16mf2(avl >> 2);
    vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
    vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
    vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
    __riscv_vse32_v_i32m1(out, c, vl);
}

LLVM:

        srai    a4, a6, 2
        vsetvli zero, a4, e16, mf2, ta, ma
        vle32.v v8, (a0)
        vsetvli zero, zero, e32, m1, tu, ma
        vle32.v v8, (a1)
        vle32.v v8, (a2)
        vse32.v v8, (a3)
        ret

LLVM is generating the naive code according to the intrinsics,
as you said, the first vsetvli keep e16mf2 unchanged.

Here is the codgen of GCC:
GCC:

        srai    a6,a6,2
        vsetvli a6,a6,e32,m1,tu,ma
        vle32.v v1,0(a0)
        vle32.v v1,0(a1)
        vle32.v v1,0(a2)
        vse32.v v1,0(a3)
        ret

since e16 mf2 is same ratio e32 m1, so we change first vsetvl from e16 mf2 into
e32 m1 TU. 

Then we can eliminate the second vsetvl

That is we call "local fusion" here.

For the case you mentioned is "global fusion" But they are the same thing.

Fuse vsetvl according to RVV ISA.

So, the example you mention, GCC is generating correct codes.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
  2023-10-26  1:46 ` [Bug target/112092] " juzhe.zhong at rivai dot ai
  2023-10-26  1:57 ` juzhe.zhong at rivai dot ai
@ 2023-10-26  4:01 ` macro at orcam dot me.uk
  2023-10-26  6:38 ` kito at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: macro at orcam dot me.uk @ 2023-10-26  4:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #3 from Maciej W. Rozycki <macro at orcam dot me.uk> ---
Maybe I'm missing something, but the RVV spec has this for VSETVLI:

"The application specifies the total number of elements to be processed
(the application vector length or AVL) as a candidate value for vl, and
the hardware responds via a general-purpose register with the (frequently
smaller) number of elements that the hardware will handle per iteration
(stored in vl), based on the microarchitectural implementation and the
vtype setting."

Is is guaranteed by the RVV specification that the value of `vl' produced
(which is then supplied as an argument to `__riscv_vle32_v_i32m1', etc.;
I presume implicitly via the VL CSR as I can't see it in actual assembly
produced) is going to be the same for all microarchitectures for both:

        vsetvli zero,a6,e32,m1,tu,ma

and:

        vsetvli zero,a6,e16,mf2,ta,ma

?

If it is, then still the code is awkward and the conditional ought to be
removed and the code paths merged as both legs execute the same
instruction.

What is the definition of the `vl' parameter to `__riscv_vle32_v_i32m1',
etc. anyway?  I have troubles chasing one down and the source code is so
convoluted with macros I can't even find the implementation.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (2 preceding siblings ...)
  2023-10-26  4:01 ` macro at orcam dot me.uk
@ 2023-10-26  6:38 ` kito at gcc dot gnu.org
  2023-10-26  6:51 ` juzhe.zhong at rivai dot ai
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: kito at gcc dot gnu.org @ 2023-10-26  6:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

Kito Cheng <kito at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kito at gcc dot gnu.org

--- Comment #4 from Kito Cheng <kito at gcc dot gnu.org> ---
The testcase it self is look like tricky but right, 
it typically could use to optimize mixed-width (mixed-SEW) operations,

You can refer to the EEW stuffs in v-spec[1], most load store has encoding
static-EEW and then could apply such vsetvli fusion optimization.

[1]
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#52-vector-operands

Give a (more) practical example here:

```c
#include "riscv_vector.h"

void foo(int32_t *in1, int16_t *in2, int16_t *in3, int32_t *out, size_t n, int
cond, int avl) {
    size_t vl = __riscv_vsetvl_e16mf2(avl);
    vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
    vint16mf2_t b = __riscv_vle16_v_i16mf2(in2, vl);
    vint16mf2_t c = __riscv_vle16_v_i16mf2(in3, vl);
    vint32m1_t x = __riscv_vwmacc_vv_i32m1(a, b, c, vl);
    __riscv_vse32_v_i32m1(out, x, vl);
}

```

> Is is guaranteed by the RVV specification that the value of `vl' produced
> (which is then supplied as an argument to `__riscv_vle32_v_i32m1', etc.;
> I presume implicitly via the VL CSR as I can't see it in actual assembly
> produced) is going to be the same for all microarchitectures for both:
>
>	vsetvli	zero,a6,e32,m1,tu,ma
>
>and:
>
>	vsetvli	zero,a6,e16,mf2,ta,ma

This is another trick in this case: tail agnostic vs tail undisturbed

tail undisturbed has stronger semantic than tail agnostic, so using tail
undisturbed for agnostic is always safe and satisfied the semantic, same for
mask agnostic vs mask undisturbed.

But performance is another story, as I know some uArch implement agnostic as
undisturbed, which means agnostic or undisturbed no much difference, so fuse
those two vsetvli is become kind of optimization.

However you could imagine, that also means some uArch is implement agnostic in
another way: agnostic MAY has better performance than undisturbed, we should
not fuse those vsetvli IF we are targeting such target, anyway, our cost model
for RVV still in an initial states, so personally I am fine with that for now,
but I guess we need add some more stuff to -mtune to handle those difference.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (3 preceding siblings ...)
  2023-10-26  6:38 ` kito at gcc dot gnu.org
@ 2023-10-26  6:51 ` juzhe.zhong at rivai dot ai
  2023-10-26  7:08 ` juzhe.zhong at rivai dot ai
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-26  6:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #5 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
Yes. I am agree that some arch prefer agnostic than undisturbed even with more
vsetvls. That's why I have post PR for asking whether we can have a option like

-mprefer-agosnotic.

https://github.com/riscv-non-isa/riscv-toolchain-conventions/issues/37


But I think Maciej is worrying about why GCC fuse vsetvl, and change

e16mf2 vsetvl into e32m1.


For example:

https://godbolt.org/z/6G9G7Pbe9

No 'TU' included.

I think LLVM codegen looks more reasonable:

        beqz    a5, .LBB0_4
        vsetvli a1, a6, e32, m1, ta, ma
        beqz    a4, .LBB0_3
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vsetvli zero, a1, e32, m1, ta, ma
        vle32.v v8, (a0)
        vadd.vv v8, v8, v8
        addi    a4, a4, -1
        vse32.v v8, (a3)
        bnez    a4, .LBB0_2
.LBB0_3:
        ret
.LBB0_4:
        srai    a1, a6, 2
        vsetvli a1, a1, e16, mf2, ta, ma
        bnez    a4, .LBB0_2
        j       .LBB0_3

But GCC is correct with optimizations:

foo(int*, int*, int*, int*, unsigned long, int, int):
        beq     a5,zero,.L2
        vsetvli a5,a6,e32,m1,ta,ma
.L3:
        beq     a4,zero,.L10
        li      a2,0
.L5:
        vle32.v v1,0(a0)
        addi    a2,a2,1
        vadd.vv v1,v1,v1
        vse32.v v1,0(a3)
        bne     a4,a2,.L5
.L10:
        ret
.L2:
        sraiw   a5,a6,2
        vsetvli zero,a5,e32,m1,ta,ma
        j       .L3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (4 preceding siblings ...)
  2023-10-26  6:51 ` juzhe.zhong at rivai dot ai
@ 2023-10-26  7:08 ` juzhe.zhong at rivai dot ai
  2023-10-26 23:31 ` macro at orcam dot me.uk
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-26  7:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #6 from JuzheZhong <juzhe.zhong at rivai dot ai> ---

> I have troubles chasing one down and the source code is so
> convoluted with macros I can't even find the implementation.

I am sorry for causing confusion to you here.

But because of the RVV fusion rules are so complicated, we define it in

riscv-vsetvl.def. To understand the codes, I suggest you directly read the
riscv-vsetvl.def

We define all compatible, fusion, available rules there.

For example, vle16.v (e16, m1 ) is compatible with vadd.vv (e32, mf2 ),
In this case, adjacent 2 instructions "vle16" (e16m1) and vadd.vv (e32mf2) can
have the same vsetvl (vsetvl e32mf2).

Wheras vsub.vv(e16,m1) and vadd (e32 mf2), they are not compatible.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (5 preceding siblings ...)
  2023-10-26  7:08 ` juzhe.zhong at rivai dot ai
@ 2023-10-26 23:31 ` macro at orcam dot me.uk
  2023-10-27  0:57 ` juzhe.zhong at rivai dot ai
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: macro at orcam dot me.uk @ 2023-10-26 23:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #7 from Maciej W. Rozycki <macro at orcam dot me.uk> ---
Thank you for all your explanations.  I think I'm still missing something
here, so I'll write it differently (and let's ignore the tail-agnostic vs
tail-undisturbed choice for the purpose of this consideration).

Let me paste the whole assembly code produced here (sans decorations):

        beq     a5,zero,.L2
        vsetvli zero,a6,e32,m1,tu,ma
.L3:
        beq     a4,zero,.L7
        li      a5,0
.L5:
        vle32.v v1,0(a0)
        vle32.v v1,0(a1)
        vle32.v v1,0(a2)
        vse32.v v1,0(a3)
        addi    a5,a5,1
        bne     a4,a5,.L5
.L7:
        ret
.L2:
        vsetvli zero,a6,e32,m1,tu,ma
        j       .L3

This seems to me to correspond to this source code:

  if (cond)
    __riscv_vsetvl_e32m1(avl);
  else
    __riscv_vsetvl_e16mf2(avl);
  for (size_t i = 0; i < n; i += 1) {
    vint32m1_t a = __riscv_vle32_v_i32m1(in1, avl);
    vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, avl);
    vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, avl);
    __riscv_vse32_v_i32m1(out, c, avl);
  }

And in that case I'd expect the conditional to be optimised away, as its
result is ignored (along with the intrinsics) and does not affect actual
code executed except for the different execution path, i.e.:

        beq     a4,zero,.L7
        vsetvli zero,a6,e32,m1,tu,ma
        li      a5,0
.L5:
        vle32.v v1,0(a0)
        vle32.v v1,0(a1)
        vle32.v v1,0(a2)
        vse32.v v1,0(a3)
        addi    a5,a5,1
        bne     a4,a5,.L5
.L7:
        ret

However actual source code is as follows:

  size_t vl;
  if (cond)
    vl = __riscv_vsetvl_e32m1(avl);
  else
    vl = __riscv_vsetvl_e16mf2(avl);
  for (size_t i = 0; i < n; i += 1) {
    vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
    vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
    vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
    __riscv_vse32_v_i32m1(out, c, vl);
  }

Based on what you write I'd expect code like this instead:

        beq     a5,zero,.L2
        vsetvli a6,a6,e16,mf2,ta,ma
.L3:
        beq     a4,zero,.L7
        vsetvli zero,a6,e32,m1,tu,ma
        li      a5,0
.L5:
        vle32.v v1,0(a0)
        vle32.v v1,0(a1)
        vle32.v v1,0(a2)
        vse32.v v1,0(a3)
        addi    a5,a5,1
        bne     a4,a5,.L5
.L7:
        ret
.L2:
        vsetvli a6,a6,e32,m1,ta,ma
        j       .L3

which is roughly what you say LLVM produces.

Why is the `vl' value determined by hardware from `avl' by an explicit
request (!) of the programmer who inserted the vsetvl intrinsics ignored?
Is the compiler able to prove the use of `avl' in place of `vl' does not
affect the operation of the VLE32.V and VSE32.V instructions in any way?
What is the purpose of these intrinsics if they can be freely ignored?

Please forgive me if my questions seem to you obvious to answer or
irrelevant, I'm still rather new to this RVV stuff.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (6 preceding siblings ...)
  2023-10-26 23:31 ` macro at orcam dot me.uk
@ 2023-10-27  0:57 ` juzhe.zhong at rivai dot ai
  2023-10-27  1:03 ` juzhe.zhong at rivai dot ai
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-27  0:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #8 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Maciej W. Rozycki from comment #7)
> Thank you for all your explanations.  I think I'm still missing something
> here, so I'll write it differently (and let's ignore the tail-agnostic vs
> tail-undisturbed choice for the purpose of this consideration).
> 
> Let me paste the whole assembly code produced here (sans decorations):
> 
> 	beq	a5,zero,.L2
> 	vsetvli	zero,a6,e32,m1,tu,ma
> .L3:
> 	beq	a4,zero,.L7
> 	li	a5,0
> .L5:
> 	vle32.v	v1,0(a0)
> 	vle32.v	v1,0(a1)
> 	vle32.v	v1,0(a2)
> 	vse32.v	v1,0(a3)
> 	addi	a5,a5,1
> 	bne	a4,a5,.L5
> .L7:
> 	ret
> .L2:
> 	vsetvli	zero,a6,e32,m1,tu,ma
> 	j	.L3
> 
> This seems to me to correspond to this source code:
> 
>   if (cond)
>     __riscv_vsetvl_e32m1(avl);
>   else
>     __riscv_vsetvl_e16mf2(avl);
>   for (size_t i = 0; i < n; i += 1) {
>     vint32m1_t a = __riscv_vle32_v_i32m1(in1, avl);
>     vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, avl);
>     vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, avl);
>     __riscv_vse32_v_i32m1(out, c, avl);
>   }
> 
> And in that case I'd expect the conditional to be optimised away, as its
> result is ignored (along with the intrinsics) and does not affect actual
> code executed except for the different execution path, i.e.:
> 
> 	beq	a4,zero,.L7
> 	vsetvli	zero,a6,e32,m1,tu,ma
> 	li	a5,0
> .L5:
> 	vle32.v	v1,0(a0)
> 	vle32.v	v1,0(a1)
> 	vle32.v	v1,0(a2)
> 	vse32.v	v1,0(a3)
> 	addi	a5,a5,1
> 	bne	a4,a5,.L5
> .L7:
> 	ret
> 

Good catch ! I think we have a missed-optimization here and I agree this code
is correct and optimal codegen for this case.
We have a close-to-optimal (not optimal enough) codegen for now.

And this optimization should not be done by VSETVL PASS.

After VSETVL PASS fusion, both e16mf2 and e32m1 user vsetvl instrinsic are
fused into e32m1, tu. They are totally the same so it's meaningless seperate
them into different blocks (They should be the same single block).

The reason why we missed an optimization here is because we expand user
vsetvl __riscv_vsetvl_e32m1 and __riscv_vsetvl_e16mf2 into 2 different
RTL expressions. The before PASSes (before VSETVL) don't known they are
equivalent, so separate them into different blocks.

If you change codes as follows:
  if (cond)
    vl = __riscv_vsetvl_e32m1(avl);
  else
    vl = __riscv_vsetvl_e32m1(avl);

I am sure the codegen will be as you said above. (A single vsetvl e32m1 tu in
a single block).

To optimize it, a alternative approach is that we expand all user vsetvl
instrinscs into same RTL expression (as long as they are having same ratio).

Meaning, expand 

__riscv_vsetvl_e64m1
__riscv_vsetvl_e32m1
__riscv_vsetvl_e16mf2
__riscv_vsetvl_e8mf8

into same RTL expression since their VL outputs are definitely the same.

I don't see it will cause any problems here.

But different ratio like 32m1 and e32mf2 should be different RLT expression.

I am not sure kito agree with this idea.

Another alternative approach is that we enhance bb_reorder PASS.
The VSETVL PASS is run before bb_reorder PASS and current bb_reorder PASS
is unable to fuse these 2 vsetvls e32m1 Tu into same block because we split
it into "real" vsetvls which is the RTL pattern has side effects.

The "real" vsetvl patterns which generate assembly should have side effects
since vsetvl does change global VL/VTYPE status and also set a general
register.

No matter which approach to optimize it, I won't do it in GCC-14 since stage 1
is soon to close.  We have a few more features (which are much more imporant)
that we are planning and working to support in GCC-14.
I have confidence that our RVV GCC current VSETVL PASS is really optimal and
fancy enough.

After stage 1 close, we won't do any optimizations, we will only run full
coverage testing (for example, using different LMUL different -march to run the
whole gcc testsuite) and fix bugs.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (7 preceding siblings ...)
  2023-10-27  0:57 ` juzhe.zhong at rivai dot ai
@ 2023-10-27  1:03 ` juzhe.zhong at rivai dot ai
  2023-10-31 13:58 ` [Bug target/112092] RISC-V: Suboptimal " macro at orcam dot me.uk
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-10-27  1:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #9 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Maciej W. Rozycki from comment #7)
> Thank you for all your explanations.  I think I'm still missing something
> here, so I'll write it differently (and let's ignore the tail-agnostic vs
> tail-undisturbed choice for the purpose of this consideration).
> 
> Why is the `vl' value determined by hardware from `avl' by an explicit
> request (!) of the programmer who inserted the vsetvl intrinsics ignored?
> Is the compiler able to prove the use of `avl' in place of `vl' does not
> affect the operation of the VLE32.V and VSE32.V instructions in any way?
> What is the purpose of these intrinsics if they can be freely ignored?
> 
> Please forgive me if my questions seem to you obvious to answer or
> irrelevant, I'm still rather new to this RVV stuff.

As long as the ratio of user vsetvl intrinsics are same as the following
RVV normal instruction, compiler is free to optimize it.

For example:

vl = __riscv_vsetvl_e32m1 (avl)
__riscv_vadd_vv_i32m1 (...,vl)

A naive way to insert vsetvl:

vsetvl VL, AVL e32 m1
vsetvl zero, VL e32 m1
vadd.vv

Howerver, since they are have same ratio, we can do it:

vsetvl zero, AVL e32 m1
vadd.vv

It's absolutely correct in-dependent on hardware.

However, different ratio:

vl = __riscv_vsetvl_e32m1 (avl)
__riscv_vadd_vv_i64m1 (...,vl)

vsetvl VL, AVL e32 m1
vsetvl zero, VL e64 m1
vadd.vv

We can't optimize it. This is the only correct codegen.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Suboptimal RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (8 preceding siblings ...)
  2023-10-27  1:03 ` juzhe.zhong at rivai dot ai
@ 2023-10-31 13:58 ` macro at orcam dot me.uk
  2023-11-08  6:38 ` cvs-commit at gcc dot gnu.org
  2023-11-08  6:39 ` juzhe.zhong at rivai dot ai
  11 siblings, 0 replies; 13+ messages in thread
From: macro at orcam dot me.uk @ 2023-10-31 13:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

Maciej W. Rozycki <macro at orcam dot me.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|wrong-code                  |missed-optimization
           Severity|normal                      |enhancement
            Summary|RISC-V: Wrong RVV code      |RISC-V: Suboptimal RVV code
                   |produced for vsetvl-11.c    |produced for vsetvl-11.c
                   |and vsetvlmax-8.c           |and vsetvlmax-8.c

--- Comment #10 from Maciej W. Rozycki <macro at orcam dot me.uk> ---
I see what you mean, thanks for straightening me out.  I've updated bug
summary and parameters accordingly.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Suboptimal RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (9 preceding siblings ...)
  2023-10-31 13:58 ` [Bug target/112092] RISC-V: Suboptimal " macro at orcam dot me.uk
@ 2023-11-08  6:38 ` cvs-commit at gcc dot gnu.org
  2023-11-08  6:39 ` juzhe.zhong at rivai dot ai
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-11-08  6:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The trunk branch has been updated by Lehua Ding <lhtin@gcc.gnu.org>:

https://gcc.gnu.org/g:f9148120048f4508156acfcd19a334f4dcbb96f0

commit r14-5239-gf9148120048f4508156acfcd19a334f4dcbb96f0
Author: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Date:   Wed Nov 8 14:05:00 2023 +0800

    RISC-V: Normalize user vsetvl intrinsics[PR112092]

    Since our user vsetvl intrinsics are defined as just calculate the VL
output
    which is the number of the elements to be processed. Such intrinsics do not
    have any side effects.  We should normalize them when they have same ratio.

    E.g __riscv_vsetvl_e8mf8 result is same as __riscv_vsetvl_e64m1.

    Normalize them can allow us have better codegen.
    Consider this following example:

    #include "riscv_vector.h"

    void foo(int32_t *in1, int32_t *in2, int32_t *in3, int32_t *out, size_t n,
int cond, int avl) {

      size_t vl;
      if (cond)
        vl = __riscv_vsetvl_e32m1(avl);
      else
        vl = __riscv_vsetvl_e16mf2(avl);
      for (size_t i = 0; i < n; i += 1) {
        vint32m1_t a = __riscv_vle32_v_i32m1(in1, vl);
        vint32m1_t b = __riscv_vle32_v_i32m1_tu(a, in2, vl);
        vint32m1_t c = __riscv_vle32_v_i32m1_tu(b, in3, vl);
        __riscv_vse32_v_i32m1(out, c, vl);
      }
    }

    Before this patch:

    foo:
            beq     a5,zero,.L2
            vsetvli a6,a6,e32,m1,tu,ma
    .L3:
            li      a5,0
            beq     a4,zero,.L9
    .L4:
            vle32.v v1,0(a0)
            addi    a5,a5,1
            vle32.v v1,0(a1)
            vle32.v v1,0(a2)
            vse32.v v1,0(a3)
            bne     a4,a5,.L4
    .L9:
            ret
    .L2:
            vsetvli zero,a6,e32,m1,tu,ma
            j       .L3

    After this patch:

    foo:
            li      a5,0
            vsetvli zero,a6,e32,m1,tu,ma
            beq     a4,zero,.L9
    .L4:
            vle32.v v1,0(a0)
            addi    a5,a5,1
            vle32.v v1,0(a1)
            vle32.v v1,0(a2)
            vse32.v v1,0(a3)
            bne     a4,a5,.L4
    .L9:
            ret

            PR target/112092

    gcc/ChangeLog:

            * config/riscv/riscv-vector-builtins-bases.cc: Normalize the
vsetvls.

    gcc/testsuite/ChangeLog:

            * gcc.target/riscv/rvv/vsetvl/pr109743-1.c: Adapt test.
            * gcc.target/riscv/rvv/vsetvl/pr109743-3.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvl-11.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvl-15.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvlmax-13.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvlmax-15.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvlmax-5.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvlmax-7.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/vsetvlmax-8.c: Ditto.
            * gcc.target/riscv/rvv/vsetvl/pr112092-1.c: New test.
            * gcc.target/riscv/rvv/vsetvl/pr112092-2.c: New test.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/112092] RISC-V: Suboptimal RVV code produced for vsetvl-11.c and vsetvlmax-8.c
  2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
                   ` (10 preceding siblings ...)
  2023-11-08  6:38 ` cvs-commit at gcc dot gnu.org
@ 2023-11-08  6:39 ` juzhe.zhong at rivai dot ai
  11 siblings, 0 replies; 13+ messages in thread
From: juzhe.zhong at rivai dot ai @ 2023-11-08  6:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112092

--- Comment #12 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
It should be fixed on the trunk.
Plz verify it and close the issue.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-11-08  6:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-26  1:00 [Bug target/112092] New: RISC-V: Wrong RVV code produced for vsetvl-11.c and vsetvlmax-8.c macro at orcam dot me.uk
2023-10-26  1:46 ` [Bug target/112092] " juzhe.zhong at rivai dot ai
2023-10-26  1:57 ` juzhe.zhong at rivai dot ai
2023-10-26  4:01 ` macro at orcam dot me.uk
2023-10-26  6:38 ` kito at gcc dot gnu.org
2023-10-26  6:51 ` juzhe.zhong at rivai dot ai
2023-10-26  7:08 ` juzhe.zhong at rivai dot ai
2023-10-26 23:31 ` macro at orcam dot me.uk
2023-10-27  0:57 ` juzhe.zhong at rivai dot ai
2023-10-27  1:03 ` juzhe.zhong at rivai dot ai
2023-10-31 13:58 ` [Bug target/112092] RISC-V: Suboptimal " macro at orcam dot me.uk
2023-11-08  6:38 ` cvs-commit at gcc dot gnu.org
2023-11-08  6:39 ` juzhe.zhong at rivai dot ai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).