Thanks a lot for the clarification.

I send a patch to remove address memory cost:
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640595.html 
for decremnt IV/SELECT_VL.

And I have tested various cases that are all get better codegen in RVV.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-12-14 20:50
To: juzhe.zhong@rivai.ai
CC: gcc-patches; richard.sandiford; jeffreyalaw
Subject: Re: Re: [PATCH] Middle-end: Adjust decrement IV style partial vectorization COST model
On Thu, 14 Dec 2023, juzhe.zhong@rivai.ai wrote:
 
> Thanks Richard.
> 
> Let me clarify again to make sure I understand your comments correctly:
> 
> Do you suggest not to model address cost here like other partial 
> vectorization style (while_ult, avx512...etc). Then set COST = 1 since 
> we only have SELECT_VL since beginning. At various cases we saw, COST=1 
> is better than COST=2.
 
I suggest to not model with address cost in mind since nothing else in
the vectorizer does that and thus you're "comparing" apples with
oranges then.
 
If address cost is important to decide between SELECT_VL and not
SELECT_VL then we'd need to start modeling address cost _at all_.
 
Richard.
 
> Thanks.
> 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-12-14 18:46
> To: juzhe.zhong
> CC: gcc-patches; richard.sandiford; jeffreyalaw
> Subject: Re: [PATCH] Middle-end: Adjust decrement IV style partial vectorization COST model
> 
> 
> Am 14.12.2023 um 09:28 schrieb juzhe.zhong@rivai.ai:
> 
> ﻿ 
> Hi, Richard.
> 
> I have a question about the decrement IV costing since I find the reduction case is generating inferior codegen.
> 
> reduc_plus_int:
> mv a3,a0
> ble a1,zero,.L7
> addiw a5,a1,-1
> li a4,2
> bleu a5,a4,.L8
> vsetivli zero,4,e32,m1,ta,ma
> srliw a4,a1,2
> vmv.v.i v1,0
> slli a4,a4,4
> add a4,a4,a0
> mv a5,a0
> .L4:
> vle32.v v2,0(a5)
> addi a5,a5,16
> vadd.vv v1,v1,v2
> bne a5,a4,.L4
> li a5,0
> vmv.s.x v2,a5
> andi a5,a1,-4
> vredsum.vs v1,v1,v2
> vmv.x.s a0,v1
> beq a1,a5,.L12
> .L3:
> subw a1,a1,a5
> slli a5,a5,32
> srli a5,a5,32
> slli a1,a1,32
> vsetvli a4,zero,e32,m1,ta,ma
> slli a5,a5,2
> srli a1,a1,32
> vmv.v.i v1,0
> add a3,a3,a5
> vsetvli a1,a1,e8,mf4,ta,ma
> vle32.v v3,0(a3)
> li a5,0
> vsetivli zero,1,e32,m1,ta,ma
> vmv.s.x v2,a5
> vsetvli zero,a1,e32,m1,tu,ma
> vmv.v.v v1,v3
> vsetvli a4,zero,e32,m1,ta,ma
> vredsum.vs v1,v1,v2
> vmv.x.s a5,v1
> addw a0,a0,a5
> ret
> .L12:
> ret
> .L7:
> li a0,0
> ret
> .L8:
> li a5,0
> li a0,0
> j .L3
> 
> This patch adjust length_update_cost from 3 (original cost) into 2 can fix conversion case (the test append in this patch).
> But can't fix reduction case.
> 
> Then I adjust it into COST = 1:
> 
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 19e38b8637b..50c99b1fe79 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -4877,7 +4877,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
>                  processed in current iteration, and a SHIFT operation to
>                  compute the next memory address instead of adding vectorization
>                  factor.  */
> -             length_update_cost = 2;
> +             length_update_cost = 1;
>             else
>               /* For increment IV stype, Each may need two MINs and one MINUS to
>                  update lengths in body for next iteration.  */
> 
> Then the codegen is reasonable now:
> 
> reduc_plus_int:
> ble a1,zero,.L4
> vsetvli a5,zero,e32,m1,ta,ma
> vmv.v.i v1,0
> .L3:
> vsetvli a5,a1,e32,m1,tu,ma
> vle32.v v2,0(a0)
> slli a4,a5,2
> sub a1,a1,a5
> add a0,a0,a4
> vadd.vv v1,v2,v1
> bne a1,zero,.L3
> li a5,0
> vsetivli zero,1,e32,m1,ta,ma
> vmv.s.x v2,a5
> vsetvli a5,zero,e32,m1,ta,ma
> vredsum.vs v1,v1,v2
> vmv.x.s a0,v1
> ret
> .L4:
> li a0,0
> ret
> 
> The reason I set COST = 2 instead of 1 in this patch since
> 
> one COST is for SELECT_VL.
> 
> The other is for memory address calculation since we don't update memory address with adding VF directly,
> instead:
> 
> We shift the result of SELECV_VL, and then add it into the memory IV as follows:
> 
> SSA_1 = SELECT_VL --> SSA_1 is element-wise
> SSA_2 = SSA_1 << 1 (If element is INT16, make it to be bytes-wise)
> next iteration memory address = current iteration memory address + SSA_2.
> 
> If elemente is INT8, then the shift operation is not needed:
> SSA_2 = SSA_1 << 1 since it is already byte-wise.
> 
> So, my question is the COST should be 1 or 2.
> It seems that COST = 1 is better for
> using SELECT_VL.
> 
> We are not modeling address cost, so trying to account for this here is only going to be a heuristic that?s as many times wrong than it is correct.  If the difference between SELECT_VL and not is so small  then you?ll have a hard time modeling this here.
> 
>  
> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-12-13 18:17
> To: Juzhe-Zhong
> CC: gcc-patches; richard.sandiford; jeffreyalaw
> Subject: Re: [PATCH] Middle-end: Adjust decrement IV style partial vectorization COST model
> On Wed, 13 Dec 2023, Juzhe-Zhong wrote:
>  
> > Hi, before this patch, a simple conversion case for RVV codegen:
> > 
> > foo:
> >         ble     a2,zero,.L8
> >         addiw   a5,a2,-1
> >         li      a4,6
> >         bleu    a5,a4,.L6
> >         srliw   a3,a2,3
> >         slli    a3,a3,3
> >         add     a3,a3,a0
> >         mv      a5,a0
> >         mv      a4,a1
> >         vsetivli        zero,8,e16,m1,ta,ma
> > .L4:
> >         vle8.v  v2,0(a5)
> >         addi    a5,a5,8
> >         vzext.vf2       v1,v2
> >         vse16.v v1,0(a4)
> >         addi    a4,a4,16
> >         bne     a3,a5,.L4
> >         andi    a5,a2,-8
> >         beq     a2,a5,.L10
> > .L3:
> >         slli    a4,a5,32
> >         srli    a4,a4,32
> >         subw    a2,a2,a5
> >         slli    a2,a2,32
> >         slli    a5,a4,1
> >         srli    a2,a2,32
> >         add     a0,a0,a4
> >         add     a1,a1,a5
> >         vsetvli zero,a2,e16,m1,ta,ma
> >         vle8.v  v2,0(a0)
> >         vzext.vf2       v1,v2
> >         vse16.v v1,0(a1)
> > .L8:
> >         ret
> > .L10:
> >         ret
> > .L6:
> >         li      a5,0
> >         j       .L3
> > 
> > This vectorization go through first loop:
> > 
> >         vsetivli        zero,8,e16,m1,ta,ma
> > .L4:
> >         vle8.v  v2,0(a5)
> >         addi    a5,a5,8
> >         vzext.vf2       v1,v2
> >         vse16.v v1,0(a4)
> >         addi    a4,a4,16
> >         bne     a3,a5,.L4
> > 
> > Each iteration processes 8 elements.
> > 
> > For a scalable vectorization with VLEN > 128 bits CPU, it's ok when VLEN = 128.
> > But, as long as VLEN > 128 bits, it will waste the CPU resources. That is, e.g. VLEN = 256bits.
> > only half of the vector units are working and another half is idle.
> > 
> > After investigation, I realize that I forgot to adjust COST for SELECT_VL.
> > So, adjust COST for SELECT_VL styple length vectorization. We adjust COST from 3 to 2. since
> > after this patch:
> > 
> > foo:
> > ble a2,zero,.L5
> > .L3:
> > vsetvli a5,a2,e16,m1,ta,ma     -----> SELECT_VL cost.
> > vle8.v v2,0(a0)
> > slli a4,a5,1                -----> additional shift of outcome SELECT_VL for memory address calculation.
> > vzext.vf2 v1,v2
> > sub a2,a2,a5
> > vse16.v v1,0(a1)
> > add a0,a0,a5
> > add a1,a1,a4
> > bne a2,zero,.L3
> > .L5:
> > ret
> > 
> > This patch is a simple fix that I previous forgot.
> > 
> > Ok for trunk ?
>  
> OK.
>  
> Richard.
>  
> > If not, I am going to adjust cost in backend cost model.
> > 
> > PR target/111317
> > 
> > gcc/ChangeLog:
> > 
> > * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Adjust for COST for decrement IV.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.dg/vect/costmodel/riscv/rvv/pr111317.c: New test.
> > 
> > ---
> >  .../gcc.dg/vect/costmodel/riscv/rvv/pr111317.c  | 12 ++++++++++++
> >  gcc/tree-vect-loop.cc                           | 17 ++++++++++++++---
> >  2 files changed, 26 insertions(+), 3 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c
> > 
> > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c
> > new file mode 100644
> > index 00000000000..d4bea242a9a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr111317.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize --param=riscv-autovec-lmul=m1" } */
> > +
> > +void
> > +foo (char *__restrict a, short *__restrict b, int n)
> > +{
> > +  for (int i = 0; i < n; i++)
> > +    b[i] = (short) a[i];
> > +}
> > +
> > +/* { dg-final { scan-assembler-times {vsetvli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*e16,\s*m1,\s*t[au],\s*m[au]} 1 } } */
> > +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 6261cd1be1d..19e38b8637b 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -4870,10 +4870,21 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo,
> >      if (partial_load_store_bias != 0)
> >        body_stmts += 1;
> >  
> > -     /* Each may need two MINs and one MINUS to update lengths in body
> > -        for next iteration.  */
> > +     unsigned int length_update_cost = 0;
> > +     if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> > +       /* For decrement IV style, we use a single SELECT_VL since
> > + beginning to calculate the number of elements need to be
> > + processed in current iteration, and a SHIFT operation to
> > + compute the next memory address instead of adding vectorization
> > + factor.  */
> > +       length_update_cost = 2;
> > +     else
> > +       /* For increment IV stype, Each may need two MINs and one MINUS to
> > + update lengths in body for next iteration.  */
> > +       length_update_cost = 3;
> > +
> >      if (need_iterate_p)
> > -       body_stmts += 3 * num_vectors;
> > +       body_stmts += length_update_cost * num_vectors;
> >    }
> >  
> >        (void) add_stmt_cost (target_cost_data, prologue_stmts,
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)