[Bug target/116425] New: RISC-V missed optimization: vector lowering along lmul boundaries

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/116425] New: RISC-V missed optimization: vector lowering along lmul boundaries
@ 2024-08-20  0:08 ewlu at rivosinc dot com
  2024-08-20  7:26 ` [Bug target/116425] " rguenth at gcc dot gnu.org
  0 siblings, 1 reply; 2+ messages in thread
From: ewlu at rivosinc dot com @ 2024-08-20  0:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116425

            Bug ID: 116425
           Summary: RISC-V missed optimization: vector lowering along lmul
                    boundaries
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ewlu at rivosinc dot com
  Target Milestone: ---

Godbolt: https://godbolt.org/z/4T68qnsjc

Flags: -O3 -march=rv64gcv_zvl256b -ftree-vectorize -mrvv-max-lmul=m1

Testcase
```
__attribute__ ((noipa)) void permute_vnx8si (vnx8si values1, vnx8si values2,
vnx8si *out) {
    vnx8si v = 
        __builtin_shufflevector (values1, values2, 1, 1, 1, 1, 1, 1, 1, 1);
    *(vnx8si *) out = v;
} 
__attribute__ ((noipa)) void permute_vnx16si (vnx16si values1, vnx16si values2,
vnx16si *out) {
    vnx16si v = 
        __builtin_shufflevector (values1, values2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1);
    *(vnx16si *) out = v;
} 
```

Assembly:
```
permute_vnx8si:
        vsetivli        zero,8,e32,m1,ta,ma
        vle32.v v2,0(a0)
        vrgather.vi     v1,v2,1
        vse32.v v1,0(a2)
        ret
permute_vnx16si:
        lw      a5,4(a0)
        sw      a5,0(a2)
        sw      a5,4(a2)
        sw      a5,8(a2)
        sw      a5,12(a2)
        sw      a5,16(a2)
        sw      a5,20(a2)
        sw      a5,24(a2)
        sw      a5,28(a2)
        sw      a5,32(a2)
        sw      a5,36(a2)
        sw      a5,40(a2)
        sw      a5,44(a2)
        sw      a5,48(a2)
        sw      a5,52(a2)
        sw      a5,56(a2)
        sw      a5,60(a2)
        ret
```

For permute_vnx16si, it should be possible to replace all the scalar stores
with two vrgather.vi like

permute_vnx16si:
        vsetivli        zero,8,e32,m1,ta,ma
        vle32.v v2,0(a0)
        vrgather.vi     v8,v2,1
        vrgather.vi     v9,v2,1
        vse32.v v8,0(a2)
        vse32.v v9,32(a2)
        ret

This essentially mimics what a larger lmul would do. As an optimization, the
second vrgather could also be optimized away so we end up with something like

permute_vnx16si:
        vsetivli        zero,8,e32,m1,ta,ma
        vle32.v v2,0(a0)
        vrgather.vi     v8,v2,1
        vse32.v v8,0(a2)
        vse32.v v8,32(a2)
        ret

Currently in the veclower pass, permute_vnx16si is being lowered like so

__attribute__((noipa, noinline, noclone, no_icf))
void permute_vnx16si (vnx16si values1, vnx16si values2, vnx16si * out)
{
  vnx16si v;
  int _6;
  int _7;
  int _8;
  int _9;
  int _10;
  int _11;
  int _12;
  int _13;
  int _14;
  int _15;
  int _16;
  int _17;
  int _18;
  int _19;
  int _20;
  int _21;

  <bb 2> [local count: 1073741824]:
  _6 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _7 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _8 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _9 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _10 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _11 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _12 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _13 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _14 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _15 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _16 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _17 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _18 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _19 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _20 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  _21 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  v_2 = {_6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20,
_21};
  *out_4(D) = v_2;
  return;

}

I was thinking that modifying lower_vec_perm to check (if the vector could be
split) and split the vector would make the most sense. 

This would allow the veclower pass to output something along the lines of

  vnx8si v1;
  vnx8si v2;

  <bb 2> [local count: 1073741824]:
  v_1 = VEC_PERM_EXPR <values1_1(D), values1_1(D), { 1, 1, 1, 1, 1, 1, 1, 1 }>;
  v_2 = VEC_PERM_EXPR <values1_1(D), values1_1(D), { 1, 1, 1, 1, 1, 1, 1, 1 }>;
  *out_4(D) = v_1;
  *(out_4(D) + 8) = v_2;
  return;

However, the !VECTOR_MODE_P (mode) condition check in can_check_perm_var_p
would be rather messy since the current vector mode is E_BLKmode.

The other possibility I was thinking of was with modifying the forwprop pass
since the forwprop pass converts the vector constructor into 

__attribute__((noipa, noinline, noclone, no_icf))
void permute_vnx16si (vnx16si values1, vnx16si values2, vnx16si * out)
{
  vnx16si v;
  int _6;

  <bb 2> [local count: 1073741824]:
  _6 = BIT_FIELD_REF <values1_1(D), 32, 32>;
  BIT_FIELD_REF <*out_4(D), 32, 0> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 32> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 64> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 96> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 128> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 160> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 192> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 224> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 256> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 288> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 320> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 352> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 384> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 416> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 448> = _6;
  BIT_FIELD_REF <*out_4(D), 32, 480> = _6;
  return;

}

Ideally, the forwprop pass should be able to recognize that 
v_2 = {_6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, _20,
_21};
could be split into two vectors and end up with the same resulting
representation as before. I think this would be less invasive of a change but
it could be the wrong pass to add this in.

I might also be looking in the wrong location completely...

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug target/116425] RISC-V missed optimization: vector lowering along lmul boundaries
  2024-08-20  0:08 [Bug target/116425] New: RISC-V missed optimization: vector lowering along lmul boundaries ewlu at rivosinc dot com
@ 2024-08-20  7:26 ` rguenth at gcc dot gnu.org
  0 siblings, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-08-20  7:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116425

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |riscv

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think both passes are valid placed to improve this.  The vector constructor
could be from the original source as well.

Note vector lowering already tries to compute a "compute mode", but not in
all lowering paths IIRC and possibly not aware of VLA vectors or "LMUL".

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-08-20  7:26 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-20  0:08 [Bug target/116425] New: RISC-V missed optimization: vector lowering along lmul boundaries ewlu at rivosinc dot com
2024-08-20  7:26 ` [Bug target/116425] " rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).