public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Question about dynamic choosing vectorization factor for RVV
@ 2023-08-31  3:04 juzhe.zhong
  2023-08-31  7:38 ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: juzhe.zhong @ 2023-08-31  3:04 UTC (permalink / raw)
  To: gcc; +Cc: richard.sandiford, rguenther

[-- Attachment #1: Type: text/plain, Size: 3351 bytes --]

Hi, Richard and Richi.

Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
according to compile option.

For example:
void
foo (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = a[i] + b[i];
}

with --param=riscv-autovec-lmul = m1:

vsetvli a5,a2,e32,m1,ta,ma
vle32.v v2,0(a0)
vle32.v v1,0(a1)
vsetvli a6,zero,e32,m1,ta,ma
slli a3,a5,2
vadd.vv v1,v1,v2
sub a2,a2,a5
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3

The 'vadd.vv' is only performing operations on a single register.

with --param=riscv-autovec-lmul=m8:

  vsetvli a5,a2,e8,m2,ta,ma
  vle32.v v16,0(a0)
  vle32.v v8,0(a1)
  vsetvli a6,zero,e32,m8,ta,ma
  slli a3,a5,2
  vadd.vv v8,v8,v16
  vsetvli zero,a2,e32,m8,ta,ma
  sub a2,a2,a5
  vse32.v v8,0(a4)
  add a0,a0,a3
  add a1,a1,a3
  add a4,a4,a3
  bne a2,zero,.L3

The 'vadd.vv' here is performing operations on 8 consecutive registers:

vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]

Users statically set the vectorization factor is not ideal.

We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.

Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.

Here is the analysis, we have 32 vector registers for RVV.
So we calculate the live range of current function local decl:

the number of decls live at the same time * LMUL <= 32. 
According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE

Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.

However, I can only choose optimal vectorization for whole function but failed to specific loop.

Here is the example:

void foo2 (int32_t *__restrict a,
          int32_t *__restrict b,
          int32_t *__restrict c,
          int32_t *__restrict a2,
          int32_t *__restrict b2,
          int32_t *__restrict c2,
          int32_t *__restrict a3,
          int32_t *__restrict b3,
          int32_t *__restrict c3,
          int32_t *__restrict a4,
          int32_t *__restrict b4,
          int32_t *__restrict c4,
          int32_t *__restrict a5,
          int32_t *__restrict b5,
          int32_t *__restrict c5,
          int n)
{
// Loop 1
    for (int i = 0; i < n; i++)
       a[i] = a[i] + b[i];
// Loop 2
    for (int i = 0; i < n; i++){
      a[i] = b[i] + c[i];
      a2[i] = b2[i] + c2[i];
      a3[i] = b3[i] + c3[i];
      a4[i] = b4[i] + c4[i];
      a5[i] = a[i] + a4[i];
      a[i] = a3[i] + a2[i]+ a5[i];
    }
}

Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).

If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.

However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
on the loop.

I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
to 'preferred_simd_mode' target hook?

Thanks.


juzhe.zhong@rivai.ai

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31  3:04 Question about dynamic choosing vectorization factor for RVV juzhe.zhong
@ 2023-08-31  7:38 ` Richard Biener
  2023-08-31  9:52   ` juzhe.zhong
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-08-31  7:38 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: gcc, richard.sandiford

On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:

> Hi, Richard and Richi.
> 
> Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> according to compile option.
> 
> For example:
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
>     a[i] = a[i] + b[i];
> }
> 
> with --param=riscv-autovec-lmul = m1:
> 
> vsetvli a5,a2,e32,m1,ta,ma
> vle32.v v2,0(a0)
> vle32.v v1,0(a1)
> vsetvli a6,zero,e32,m1,ta,ma
> slli a3,a5,2
> vadd.vv v1,v1,v2
> sub a2,a2,a5
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> The 'vadd.vv' is only performing operations on a single register.
> 
> with --param=riscv-autovec-lmul=m8:
> 
>   vsetvli a5,a2,e8,m2,ta,ma
>   vle32.v v16,0(a0)
>   vle32.v v8,0(a1)
>   vsetvli a6,zero,e32,m8,ta,ma
>   slli a3,a5,2
>   vadd.vv v8,v8,v16
>   vsetvli zero,a2,e32,m8,ta,ma
>   sub a2,a2,a5
>   vse32.v v8,0(a4)
>   add a0,a0,a3
>   add a1,a1,a3
>   add a4,a4,a3
>   bne a2,zero,.L3
> 
> The 'vadd.vv' here is performing operations on 8 consecutive registers:
> 
> vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> 
> Users statically set the vectorization factor is not ideal.
> 
> We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> 
> Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> 
> Here is the analysis, we have 32 vector registers for RVV.
> So we calculate the live range of current function local decl:
> 
> the number of decls live at the same time * LMUL <= 32. 
> According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> 
> Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> 
> However, I can only choose optimal vectorization for whole function but failed to specific loop.
> 
> Here is the example:
> 
> void foo2 (int32_t *__restrict a,
>           int32_t *__restrict b,
>           int32_t *__restrict c,
>           int32_t *__restrict a2,
>           int32_t *__restrict b2,
>           int32_t *__restrict c2,
>           int32_t *__restrict a3,
>           int32_t *__restrict b3,
>           int32_t *__restrict c3,
>           int32_t *__restrict a4,
>           int32_t *__restrict b4,
>           int32_t *__restrict c4,
>           int32_t *__restrict a5,
>           int32_t *__restrict b5,
>           int32_t *__restrict c5,
>           int n)
> {
> // Loop 1
>     for (int i = 0; i < n; i++)
>        a[i] = a[i] + b[i];
> // Loop 2
>     for (int i = 0; i < n; i++){
>       a[i] = b[i] + c[i];
>       a2[i] = b2[i] + c2[i];
>       a3[i] = b3[i] + c3[i];
>       a4[i] = b4[i] + c4[i];
>       a5[i] = a[i] + a4[i];
>       a[i] = a3[i] + a2[i]+ a5[i];
>     }
> }
> 
> Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> 
> If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> 
> However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> on the loop.
> 
> I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> to 'preferred_simd_mode' target hook?

That's not how it's currently designed to work - there's
the autovectorize_vector_modes hook where you should provide a vector
of modes the vectorizer iterates over and return VECT_COMPARE_COST
if you want to evaluate costs between choices.  Your analysis should
then happen in the finish_cost method.

That's how it's currently designed.  It might not be optimal for
compile-time reasons when there are many modes, giving the target
more control (and context) might be possible.

Richard.

> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31  7:38 ` Richard Biener
@ 2023-08-31  9:52   ` juzhe.zhong
  2023-08-31 11:12     ` Richard Sandiford
  2023-08-31 11:20     ` Richard Biener
  0 siblings, 2 replies; 10+ messages in thread
From: juzhe.zhong @ 2023-08-31  9:52 UTC (permalink / raw)
  To: rguenther; +Cc: gcc, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 6906 bytes --]

Thanks Richi.

I am trying to figure out how to adjust finish_cost to lower the LMUL

For example:

void
foo (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = a[i] + b[i];
}

preferred_simd_mode pick LMUL = 8 (RVVM8SImode)

Is is possible that we can adjust the COST in finish cost make Loop vectorizer pick LMUL = 4?

I am experimenting with this following cost:

  if (loop_vinfo)
    {
      if (loop_vinfo->vector_mode == RVVM8SImode)
        {
          m_costs[vect_prologue] = 2;
          m_costs[vect_body] = 20;
          m_costs[vect_epilogue] = 2;
        }
      else
        {
          m_costs[vect_prologue] = 1;
          m_costs[vect_body] = 1;
          m_costs[vect_epilogue] = 1;
        }
    }

I increase LMUL = 8 cost. The codegen is odd:

foo:
ble a2,zero,.L12
addiw a5,a2,-1
li a4,30
sext.w t1,a2
bleu a5,a4,.L7
srliw a7,t1,5
slli a7,a7,7
li a4,32
add a7,a7,a0
mv a5,a0
mv a3,a1
vsetvli zero,a4,e32,m8,ta,ma
.L4:
vle32.v v8,0(a5)
vle32.v v16,0(a3)
vadd.vv v8,v8,v16
vse32.v v8,0(a5)
addi a5,a5,128
addi a3,a3,128
bne a5,a7,.L4
andi a2,a2,-32
beq t1,a2,.L14
.L3:
slli a4,a2,32
subw a5,t1,a2
srli a4,a4,32
slli a5,a5,32
slli a4,a4,2
srli a5,a5,32
add a0,a0,a4
add a1,a1,a4
vsetvli a4,a5,e8,m1,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a2,zero,e32,m4,ta,ma
vadd.vv v4,v4,v8
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a0)
sub a3,a5,a4
beq a5,a4,.L12
slli a4,a4,2
vsetvli zero,a3,e8,m1,ta,ma
add a0,a0,a4
add a1,a1,a4
vle32.v v4,0(a0)
vle32.v v8,0(a1)
vsetvli a2,zero,e32,m4,ta,ma
vadd.vv v4,v4,v8
vsetvli zero,a3,e32,m4,ta,ma
vse32.v v4,0(a0)
.L12:
ret
.L7:
li a2,0
j .L3
.L14:
ret

I hope it can generate the code like this:

foo:
ble a2,zero,.L5
mv a4,a0
.L3:
vsetvli a5,a2,e32,m4,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a6,zero,e32,m4,ta,ma
slli a3,a5,2
vadd.vv v4,v4,v8
sub a2,a2,a5
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3
.L5:
ret

I am experimenting whether we can adjust cost statically to make loop vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8.
If we can do that, I think we can apply analysis and then adjust the cost according to analysis.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 15:38
To: juzhe.zhong@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
 
> Hi, Richard and Richi.
> 
> Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> according to compile option.
> 
> For example:
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
>     a[i] = a[i] + b[i];
> }
> 
> with --param=riscv-autovec-lmul = m1:
> 
> vsetvli a5,a2,e32,m1,ta,ma
> vle32.v v2,0(a0)
> vle32.v v1,0(a1)
> vsetvli a6,zero,e32,m1,ta,ma
> slli a3,a5,2
> vadd.vv v1,v1,v2
> sub a2,a2,a5
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> The 'vadd.vv' is only performing operations on a single register.
> 
> with --param=riscv-autovec-lmul=m8:
> 
>   vsetvli a5,a2,e8,m2,ta,ma
>   vle32.v v16,0(a0)
>   vle32.v v8,0(a1)
>   vsetvli a6,zero,e32,m8,ta,ma
>   slli a3,a5,2
>   vadd.vv v8,v8,v16
>   vsetvli zero,a2,e32,m8,ta,ma
>   sub a2,a2,a5
>   vse32.v v8,0(a4)
>   add a0,a0,a3
>   add a1,a1,a3
>   add a4,a4,a3
>   bne a2,zero,.L3
> 
> The 'vadd.vv' here is performing operations on 8 consecutive registers:
> 
> vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> 
> Users statically set the vectorization factor is not ideal.
> 
> We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> 
> Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> 
> Here is the analysis, we have 32 vector registers for RVV.
> So we calculate the live range of current function local decl:
> 
> the number of decls live at the same time * LMUL <= 32. 
> According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> 
> Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> 
> However, I can only choose optimal vectorization for whole function but failed to specific loop.
> 
> Here is the example:
> 
> void foo2 (int32_t *__restrict a,
>           int32_t *__restrict b,
>           int32_t *__restrict c,
>           int32_t *__restrict a2,
>           int32_t *__restrict b2,
>           int32_t *__restrict c2,
>           int32_t *__restrict a3,
>           int32_t *__restrict b3,
>           int32_t *__restrict c3,
>           int32_t *__restrict a4,
>           int32_t *__restrict b4,
>           int32_t *__restrict c4,
>           int32_t *__restrict a5,
>           int32_t *__restrict b5,
>           int32_t *__restrict c5,
>           int n)
> {
> // Loop 1
>     for (int i = 0; i < n; i++)
>        a[i] = a[i] + b[i];
> // Loop 2
>     for (int i = 0; i < n; i++){
>       a[i] = b[i] + c[i];
>       a2[i] = b2[i] + c2[i];
>       a3[i] = b3[i] + c3[i];
>       a4[i] = b4[i] + c4[i];
>       a5[i] = a[i] + a4[i];
>       a[i] = a3[i] + a2[i]+ a5[i];
>     }
> }
> 
> Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> 
> If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> 
> However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> on the loop.
> 
> I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> to 'preferred_simd_mode' target hook?
 
That's not how it's currently designed to work - there's
the autovectorize_vector_modes hook where you should provide a vector
of modes the vectorizer iterates over and return VECT_COMPARE_COST
if you want to evaluate costs between choices.  Your analysis should
then happen in the finish_cost method.
 
That's how it's currently designed.  It might not be optimal for
compile-time reasons when there are many modes, giving the target
more control (and context) might be possible.
 
Richard.
 
> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31  9:52   ` juzhe.zhong
@ 2023-08-31 11:12     ` Richard Sandiford
  2023-08-31 11:20     ` Richard Biener
  1 sibling, 0 replies; 10+ messages in thread
From: Richard Sandiford @ 2023-08-31 11:12 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: rguenther, gcc

"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Thanks Richi.
>
> I am trying to figure out how to adjust finish_cost to lower the LMUL
>
> For example:
>
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
>     a[i] = a[i] + b[i];
> }
>
> preferred_simd_mode pick LMUL = 8 (RVVM8SImode)

But is the LMUL decided by the mode?  Like Richard says, the vectoriser
already provides a way of trying vectorisation with different modes and
picking the best one, via autovectorize_vector_modes, VECT_COMPARE_COST,
and the cost structures.  preferred_simd_mode then just picks the first
mode to try -- the choide isn't final.

The idea is that you get to see what vectorisation looks like with
multiple mode choices, and can pick the best one.

It's not clear from your reply whether you've tried that or not.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31  9:52   ` juzhe.zhong
  2023-08-31 11:12     ` Richard Sandiford
@ 2023-08-31 11:20     ` Richard Biener
  2023-08-31 11:24       ` juzhe.zhong
  1 sibling, 1 reply; 10+ messages in thread
From: Richard Biener @ 2023-08-31 11:20 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: gcc, richard.sandiford

On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:

> Thanks Richi.
> 
> I am trying to figure out how to adjust finish_cost to lower the LMUL
> 
> For example:
> 
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
>     a[i] = a[i] + b[i];
> }
> 
> preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> 
> Is is possible that we can adjust the COST in finish cost make Loop 
> vectorizer pick LMUL = 4?

I see you have a autovectorize_vector_modes hook and you use
VECT_COMPARE_COSTS.  So the appropriate place would be to
amend your vector_costs::better_main_loop_than_p.

> I am experimenting with this following cost:
> 
>   if (loop_vinfo)
>     {
>       if (loop_vinfo->vector_mode == RVVM8SImode)
>         {
>           m_costs[vect_prologue] = 2;
>           m_costs[vect_body] = 20;
>           m_costs[vect_epilogue] = 2;
>         }
>       else
>         {
>           m_costs[vect_prologue] = 1;
>           m_costs[vect_body] = 1;
>           m_costs[vect_epilogue] = 1;
>         }
>     }
> 
> I increase LMUL = 8 cost. The codegen is odd:
> 
> foo:
> ble a2,zero,.L12
> addiw a5,a2,-1
> li a4,30
> sext.w t1,a2
> bleu a5,a4,.L7
> srliw a7,t1,5
> slli a7,a7,7
> li a4,32
> add a7,a7,a0
> mv a5,a0
> mv a3,a1
> vsetvli zero,a4,e32,m8,ta,ma
> .L4:
> vle32.v v8,0(a5)
> vle32.v v16,0(a3)
> vadd.vv v8,v8,v16
> vse32.v v8,0(a5)
> addi a5,a5,128
> addi a3,a3,128
> bne a5,a7,.L4
> andi a2,a2,-32
> beq t1,a2,.L14
> .L3:
> slli a4,a2,32
> subw a5,t1,a2
> srli a4,a4,32
> slli a5,a5,32
> slli a4,a4,2
> srli a5,a5,32
> add a0,a0,a4
> add a1,a1,a4
> vsetvli a4,a5,e8,m1,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a0)
> sub a3,a5,a4
> beq a5,a4,.L12
> slli a4,a4,2
> vsetvli zero,a3,e8,m1,ta,ma
> add a0,a0,a4
> add a1,a1,a4
> vle32.v v4,0(a0)
> vle32.v v8,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a3,e32,m4,ta,ma
> vse32.v v4,0(a0)
> .L12:
> ret
> .L7:
> li a2,0
> j .L3
> .L14:
> ret
> 
> I hope it can generate the code like this:
> 
> foo:
> ble a2,zero,.L5
> mv a4,a0
> .L3:
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> .L5:
> ret
> 
> I am experimenting whether we can adjust cost statically to make loop 
> vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> If we can do that, I think we can apply analysis and then adjust the 
> cost according to analysis.
>
> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 15:38
> To: juzhe.zhong@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Hi, Richard and Richi.
> > 
> > Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > according to compile option.
> > 
> > For example:
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> >     a[i] = a[i] + b[i];
> > }
> > 
> > with --param=riscv-autovec-lmul = m1:
> > 
> > vsetvli a5,a2,e32,m1,ta,ma
> > vle32.v v2,0(a0)
> > vle32.v v1,0(a1)
> > vsetvli a6,zero,e32,m1,ta,ma
> > slli a3,a5,2
> > vadd.vv v1,v1,v2
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > 
> > The 'vadd.vv' is only performing operations on a single register.
> > 
> > with --param=riscv-autovec-lmul=m8:
> > 
> >   vsetvli a5,a2,e8,m2,ta,ma
> >   vle32.v v16,0(a0)
> >   vle32.v v8,0(a1)
> >   vsetvli a6,zero,e32,m8,ta,ma
> >   slli a3,a5,2
> >   vadd.vv v8,v8,v16
> >   vsetvli zero,a2,e32,m8,ta,ma
> >   sub a2,a2,a5
> >   vse32.v v8,0(a4)
> >   add a0,a0,a3
> >   add a1,a1,a3
> >   add a4,a4,a3
> >   bne a2,zero,.L3
> > 
> > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > 
> > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > 
> > Users statically set the vectorization factor is not ideal.
> > 
> > We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> > 
> > Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> > 
> > Here is the analysis, we have 32 vector registers for RVV.
> > So we calculate the live range of current function local decl:
> > 
> > the number of decls live at the same time * LMUL <= 32. 
> > According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > 
> > Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> > 
> > However, I can only choose optimal vectorization for whole function but failed to specific loop.
> > 
> > Here is the example:
> > 
> > void foo2 (int32_t *__restrict a,
> >           int32_t *__restrict b,
> >           int32_t *__restrict c,
> >           int32_t *__restrict a2,
> >           int32_t *__restrict b2,
> >           int32_t *__restrict c2,
> >           int32_t *__restrict a3,
> >           int32_t *__restrict b3,
> >           int32_t *__restrict c3,
> >           int32_t *__restrict a4,
> >           int32_t *__restrict b4,
> >           int32_t *__restrict c4,
> >           int32_t *__restrict a5,
> >           int32_t *__restrict b5,
> >           int32_t *__restrict c5,
> >           int n)
> > {
> > // Loop 1
> >     for (int i = 0; i < n; i++)
> >        a[i] = a[i] + b[i];
> > // Loop 2
> >     for (int i = 0; i < n; i++){
> >       a[i] = b[i] + c[i];
> >       a2[i] = b2[i] + c2[i];
> >       a3[i] = b3[i] + c3[i];
> >       a4[i] = b4[i] + c4[i];
> >       a5[i] = a[i] + a4[i];
> >       a[i] = a3[i] + a2[i]+ a5[i];
> >     }
> > }
> > 
> > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > 
> > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> > 
> > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> > on the loop.
> > 
> > I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> > to 'preferred_simd_mode' target hook?
>  
> That's not how it's currently designed to work - there's
> the autovectorize_vector_modes hook where you should provide a vector
> of modes the vectorizer iterates over and return VECT_COMPARE_COST
> if you want to evaluate costs between choices.  Your analysis should
> then happen in the finish_cost method.
>  
> That's how it's currently designed.  It might not be optimal for
> compile-time reasons when there are many modes, giving the target
> more control (and context) might be possible.
>  
> Richard.
>  
> > Thanks.
> > 
> > 
> > juzhe.zhong@rivai.ai
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31 11:20     ` Richard Biener
@ 2023-08-31 11:24       ` juzhe.zhong
  2023-08-31 11:29         ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: juzhe.zhong @ 2023-08-31 11:24 UTC (permalink / raw)
  To: rguenther; +Cc: gcc, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 9021 bytes --]

Hi. Thanks Richard and Richi.

Now, I figure out how to choose smaller LMUL now.

void
costs::finish_cost (const vector_costs *scalar_costs)
{
  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
  if (loop_vinfo)
    {
      if (loop_vinfo->vector_mode == RVVM8SImode
      || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
        {
          m_costs[vect_prologue] = 8;
          m_costs[vect_body] = 8;
          m_costs[vect_epilogue] = 8;
        }
      else
        {
          m_costs[vect_prologue] = 1;
          m_costs[vect_body] = 1;
          m_costs[vect_epilogue] = 1;
        }
    }
   // m_suggested_unroll_factor = 2;
  vector_costs::finish_cost (scalar_costs);
}

Previous odd codes are because of VLS modes

Now, I can get the LMUL = 4 by adjusting cost.
vsetvli a5,a2,e32,m4,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a6,zero,e32,m4,ta,ma
slli a3,a5,2
vadd.vv v4,v4,v8
sub a2,a2,a5
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3

Fantastic architecture of GCC Vector Cost model!

Thanks a lot.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 19:20
To: juzhe.zhong@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
 
> Thanks Richi.
> 
> I am trying to figure out how to adjust finish_cost to lower the LMUL
> 
> For example:
> 
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
>     a[i] = a[i] + b[i];
> }
> 
> preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> 
> Is is possible that we can adjust the COST in finish cost make Loop 
> vectorizer pick LMUL = 4?
 
I see you have a autovectorize_vector_modes hook and you use
VECT_COMPARE_COSTS.  So the appropriate place would be to
amend your vector_costs::better_main_loop_than_p.
 
> I am experimenting with this following cost:
> 
>   if (loop_vinfo)
>     {
>       if (loop_vinfo->vector_mode == RVVM8SImode)
>         {
>           m_costs[vect_prologue] = 2;
>           m_costs[vect_body] = 20;
>           m_costs[vect_epilogue] = 2;
>         }
>       else
>         {
>           m_costs[vect_prologue] = 1;
>           m_costs[vect_body] = 1;
>           m_costs[vect_epilogue] = 1;
>         }
>     }
> 
> I increase LMUL = 8 cost. The codegen is odd:
> 
> foo:
> ble a2,zero,.L12
> addiw a5,a2,-1
> li a4,30
> sext.w t1,a2
> bleu a5,a4,.L7
> srliw a7,t1,5
> slli a7,a7,7
> li a4,32
> add a7,a7,a0
> mv a5,a0
> mv a3,a1
> vsetvli zero,a4,e32,m8,ta,ma
> .L4:
> vle32.v v8,0(a5)
> vle32.v v16,0(a3)
> vadd.vv v8,v8,v16
> vse32.v v8,0(a5)
> addi a5,a5,128
> addi a3,a3,128
> bne a5,a7,.L4
> andi a2,a2,-32
> beq t1,a2,.L14
> .L3:
> slli a4,a2,32
> subw a5,t1,a2
> srli a4,a4,32
> slli a5,a5,32
> slli a4,a4,2
> srli a5,a5,32
> add a0,a0,a4
> add a1,a1,a4
> vsetvli a4,a5,e8,m1,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a0)
> sub a3,a5,a4
> beq a5,a4,.L12
> slli a4,a4,2
> vsetvli zero,a3,e8,m1,ta,ma
> add a0,a0,a4
> add a1,a1,a4
> vle32.v v4,0(a0)
> vle32.v v8,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a3,e32,m4,ta,ma
> vse32.v v4,0(a0)
> .L12:
> ret
> .L7:
> li a2,0
> j .L3
> .L14:
> ret
> 
> I hope it can generate the code like this:
> 
> foo:
> ble a2,zero,.L5
> mv a4,a0
> .L3:
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> .L5:
> ret
> 
> I am experimenting whether we can adjust cost statically to make loop 
> vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> If we can do that, I think we can apply analysis and then adjust the 
> cost according to analysis.
>
> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 15:38
> To: juzhe.zhong@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Hi, Richard and Richi.
> > 
> > Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > according to compile option.
> > 
> > For example:
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> >     a[i] = a[i] + b[i];
> > }
> > 
> > with --param=riscv-autovec-lmul = m1:
> > 
> > vsetvli a5,a2,e32,m1,ta,ma
> > vle32.v v2,0(a0)
> > vle32.v v1,0(a1)
> > vsetvli a6,zero,e32,m1,ta,ma
> > slli a3,a5,2
> > vadd.vv v1,v1,v2
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > 
> > The 'vadd.vv' is only performing operations on a single register.
> > 
> > with --param=riscv-autovec-lmul=m8:
> > 
> >   vsetvli a5,a2,e8,m2,ta,ma
> >   vle32.v v16,0(a0)
> >   vle32.v v8,0(a1)
> >   vsetvli a6,zero,e32,m8,ta,ma
> >   slli a3,a5,2
> >   vadd.vv v8,v8,v16
> >   vsetvli zero,a2,e32,m8,ta,ma
> >   sub a2,a2,a5
> >   vse32.v v8,0(a4)
> >   add a0,a0,a3
> >   add a1,a1,a3
> >   add a4,a4,a3
> >   bne a2,zero,.L3
> > 
> > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > 
> > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > 
> > Users statically set the vectorization factor is not ideal.
> > 
> > We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> > 
> > Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> > 
> > Here is the analysis, we have 32 vector registers for RVV.
> > So we calculate the live range of current function local decl:
> > 
> > the number of decls live at the same time * LMUL <= 32. 
> > According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > 
> > Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> > 
> > However, I can only choose optimal vectorization for whole function but failed to specific loop.
> > 
> > Here is the example:
> > 
> > void foo2 (int32_t *__restrict a,
> >           int32_t *__restrict b,
> >           int32_t *__restrict c,
> >           int32_t *__restrict a2,
> >           int32_t *__restrict b2,
> >           int32_t *__restrict c2,
> >           int32_t *__restrict a3,
> >           int32_t *__restrict b3,
> >           int32_t *__restrict c3,
> >           int32_t *__restrict a4,
> >           int32_t *__restrict b4,
> >           int32_t *__restrict c4,
> >           int32_t *__restrict a5,
> >           int32_t *__restrict b5,
> >           int32_t *__restrict c5,
> >           int n)
> > {
> > // Loop 1
> >     for (int i = 0; i < n; i++)
> >        a[i] = a[i] + b[i];
> > // Loop 2
> >     for (int i = 0; i < n; i++){
> >       a[i] = b[i] + c[i];
> >       a2[i] = b2[i] + c2[i];
> >       a3[i] = b3[i] + c3[i];
> >       a4[i] = b4[i] + c4[i];
> >       a5[i] = a[i] + a4[i];
> >       a[i] = a3[i] + a2[i]+ a5[i];
> >     }
> > }
> > 
> > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > 
> > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> > 
> > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> > on the loop.
> > 
> > I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> > to 'preferred_simd_mode' target hook?
>  
> That's not how it's currently designed to work - there's
> the autovectorize_vector_modes hook where you should provide a vector
> of modes the vectorizer iterates over and return VECT_COMPARE_COST
> if you want to evaluate costs between choices.  Your analysis should
> then happen in the finish_cost method.
>  
> That's how it's currently designed.  It might not be optimal for
> compile-time reasons when there are many modes, giving the target
> more control (and context) might be possible.
>  
> Richard.
>  
> > Thanks.
> > 
> > 
> > juzhe.zhong@rivai.ai
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31 11:24       ` juzhe.zhong
@ 2023-08-31 11:29         ` Richard Biener
  2023-08-31 11:33           ` juzhe.zhong
  2023-08-31 11:54           ` juzhe.zhong
  0 siblings, 2 replies; 10+ messages in thread
From: Richard Biener @ 2023-08-31 11:29 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: gcc, richard.sandiford

On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:

> Hi. Thanks Richard and Richi.
> 
> Now, I figure out how to choose smaller LMUL now.
> 
> void
> costs::finish_cost (const vector_costs *scalar_costs)
> {
>   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>   if (loop_vinfo)
>     {
>       if (loop_vinfo->vector_mode == RVVM8SImode
>       || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
>         {
>           m_costs[vect_prologue] = 8;
>           m_costs[vect_body] = 8;
>           m_costs[vect_epilogue] = 8;
>         }
>       else
>         {
>           m_costs[vect_prologue] = 1;
>           m_costs[vect_body] = 1;
>           m_costs[vect_epilogue] = 1;
>         }
>     }
>    // m_suggested_unroll_factor = 2;
>   vector_costs::finish_cost (scalar_costs);
> }

I don't think that's "good" use of the API.

> Previous odd codes are because of VLS modes
> 
> Now, I can get the LMUL = 4 by adjusting cost.
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> Fantastic architecture of GCC Vector Cost model!
> 
> Thanks a lot.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:20
> To: juzhe.zhong@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Thanks Richi.
> > 
> > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > 
> > For example:
> > 
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> >     a[i] = a[i] + b[i];
> > }
> > 
> > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > 
> > Is is possible that we can adjust the COST in finish cost make Loop 
> > vectorizer pick LMUL = 4?
>  
> I see you have a autovectorize_vector_modes hook and you use
> VECT_COMPARE_COSTS.  So the appropriate place would be to
> amend your vector_costs::better_main_loop_than_p.
>  
> > I am experimenting with this following cost:
> > 
> >   if (loop_vinfo)
> >     {
> >       if (loop_vinfo->vector_mode == RVVM8SImode)
> >         {
> >           m_costs[vect_prologue] = 2;
> >           m_costs[vect_body] = 20;
> >           m_costs[vect_epilogue] = 2;
> >         }
> >       else
> >         {
> >           m_costs[vect_prologue] = 1;
> >           m_costs[vect_body] = 1;
> >           m_costs[vect_epilogue] = 1;
> >         }
> >     }
> > 
> > I increase LMUL = 8 cost. The codegen is odd:
> > 
> > foo:
> > ble a2,zero,.L12
> > addiw a5,a2,-1
> > li a4,30
> > sext.w t1,a2
> > bleu a5,a4,.L7
> > srliw a7,t1,5
> > slli a7,a7,7
> > li a4,32
> > add a7,a7,a0
> > mv a5,a0
> > mv a3,a1
> > vsetvli zero,a4,e32,m8,ta,ma
> > .L4:
> > vle32.v v8,0(a5)
> > vle32.v v16,0(a3)
> > vadd.vv v8,v8,v16
> > vse32.v v8,0(a5)
> > addi a5,a5,128
> > addi a3,a3,128
> > bne a5,a7,.L4
> > andi a2,a2,-32
> > beq t1,a2,.L14
> > .L3:
> > slli a4,a2,32
> > subw a5,t1,a2
> > srli a4,a4,32
> > slli a5,a5,32
> > slli a4,a4,2
> > srli a5,a5,32
> > add a0,a0,a4
> > add a1,a1,a4
> > vsetvli a4,a5,e8,m1,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > sub a3,a5,a4
> > beq a5,a4,.L12
> > slli a4,a4,2
> > vsetvli zero,a3,e8,m1,ta,ma
> > add a0,a0,a4
> > add a1,a1,a4
> > vle32.v v4,0(a0)
> > vle32.v v8,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a3,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > .L12:
> > ret
> > .L7:
> > li a2,0
> > j .L3
> > .L14:
> > ret
> > 
> > I hope it can generate the code like this:
> > 
> > foo:
> > ble a2,zero,.L5
> > mv a4,a0
> > .L3:
> > vsetvli a5,a2,e32,m4,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a6,zero,e32,m4,ta,ma
> > slli a3,a5,2
> > vadd.vv v4,v4,v8
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > .L5:
> > ret
> > 
> > I am experimenting whether we can adjust cost statically to make loop 
> > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> > If we can do that, I think we can apply analysis and then adjust the 
> > cost according to analysis.
> >
> > Thanks.
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-08-31 15:38
> > To: juzhe.zhong@rivai.ai
> > CC: gcc; richard.sandiford
> > Subject: Re: Question about dynamic choosing vectorization factor for RVV
> > On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
> >  
> > > Hi, Richard and Richi.
> > > 
> > > Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > > according to compile option.
> > > 
> > > For example:
> > > void
> > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > {
> > >   for (int i = 0; i < n; i++)
> > >     a[i] = a[i] + b[i];
> > > }
> > > 
> > > with --param=riscv-autovec-lmul = m1:
> > > 
> > > vsetvli a5,a2,e32,m1,ta,ma
> > > vle32.v v2,0(a0)
> > > vle32.v v1,0(a1)
> > > vsetvli a6,zero,e32,m1,ta,ma
> > > slli a3,a5,2
> > > vadd.vv v1,v1,v2
> > > sub a2,a2,a5
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a4)
> > > add a0,a0,a3
> > > add a1,a1,a3
> > > add a4,a4,a3
> > > bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' is only performing operations on a single register.
> > > 
> > > with --param=riscv-autovec-lmul=m8:
> > > 
> > >   vsetvli a5,a2,e8,m2,ta,ma
> > >   vle32.v v16,0(a0)
> > >   vle32.v v8,0(a1)
> > >   vsetvli a6,zero,e32,m8,ta,ma
> > >   slli a3,a5,2
> > >   vadd.vv v8,v8,v16
> > >   vsetvli zero,a2,e32,m8,ta,ma
> > >   sub a2,a2,a5
> > >   vse32.v v8,0(a4)
> > >   add a0,a0,a3
> > >   add a1,a1,a3
> > >   add a4,a4,a3
> > >   bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > > 
> > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > > 
> > > Users statically set the vectorization factor is not ideal.
> > > 
> > > We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> > > 
> > > Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> > > 
> > > Here is the analysis, we have 32 vector registers for RVV.
> > > So we calculate the live range of current function local decl:
> > > 
> > > the number of decls live at the same time * LMUL <= 32. 
> > > According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > > 
> > > Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> > > 
> > > However, I can only choose optimal vectorization for whole function but failed to specific loop.
> > > 
> > > Here is the example:
> > > 
> > > void foo2 (int32_t *__restrict a,
> > >           int32_t *__restrict b,
> > >           int32_t *__restrict c,
> > >           int32_t *__restrict a2,
> > >           int32_t *__restrict b2,
> > >           int32_t *__restrict c2,
> > >           int32_t *__restrict a3,
> > >           int32_t *__restrict b3,
> > >           int32_t *__restrict c3,
> > >           int32_t *__restrict a4,
> > >           int32_t *__restrict b4,
> > >           int32_t *__restrict c4,
> > >           int32_t *__restrict a5,
> > >           int32_t *__restrict b5,
> > >           int32_t *__restrict c5,
> > >           int n)
> > > {
> > > // Loop 1
> > >     for (int i = 0; i < n; i++)
> > >        a[i] = a[i] + b[i];
> > > // Loop 2
> > >     for (int i = 0; i < n; i++){
> > >       a[i] = b[i] + c[i];
> > >       a2[i] = b2[i] + c2[i];
> > >       a3[i] = b3[i] + c3[i];
> > >       a4[i] = b4[i] + c4[i];
> > >       a5[i] = a[i] + a4[i];
> > >       a[i] = a3[i] + a2[i]+ a5[i];
> > >     }
> > > }
> > > 
> > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > > 
> > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> > > 
> > > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> > > on the loop.
> > > 
> > > I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> > > to 'preferred_simd_mode' target hook?
> >  
> > That's not how it's currently designed to work - there's
> > the autovectorize_vector_modes hook where you should provide a vector
> > of modes the vectorizer iterates over and return VECT_COMPARE_COST
> > if you want to evaluate costs between choices.  Your analysis should
> > then happen in the finish_cost method.
> >  
> > That's how it's currently designed.  It might not be optimal for
> > compile-time reasons when there are many modes, giving the target
> > more control (and context) might be possible.
> >  
> > Richard.
> >  
> > > Thanks.
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31 11:29         ` Richard Biener
@ 2023-08-31 11:33           ` juzhe.zhong
  2023-08-31 11:54           ` juzhe.zhong
  1 sibling, 0 replies; 10+ messages in thread
From: juzhe.zhong @ 2023-08-31 11:33 UTC (permalink / raw)
  To: rguenther; +Cc: gcc, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 10146 bytes --]

Hi, Richi.

>> I don't think that's "good" use of the API.
You mean I should use 'better_main_loop_than_p‘ ?
Yes. I plan to use it.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 19:29
To: juzhe.zhong@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
 
> Hi. Thanks Richard and Richi.
> 
> Now, I figure out how to choose smaller LMUL now.
> 
> void
> costs::finish_cost (const vector_costs *scalar_costs)
> {
>   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>   if (loop_vinfo)
>     {
>       if (loop_vinfo->vector_mode == RVVM8SImode
>       || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
>         {
>           m_costs[vect_prologue] = 8;
>           m_costs[vect_body] = 8;
>           m_costs[vect_epilogue] = 8;
>         }
>       else
>         {
>           m_costs[vect_prologue] = 1;
>           m_costs[vect_body] = 1;
>           m_costs[vect_epilogue] = 1;
>         }
>     }
>    // m_suggested_unroll_factor = 2;
>   vector_costs::finish_cost (scalar_costs);
> }
 
I don't think that's "good" use of the API.
 
> Previous odd codes are because of VLS modes
> 
> Now, I can get the LMUL = 4 by adjusting cost.
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> Fantastic architecture of GCC Vector Cost model!
> 
> Thanks a lot.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:20
> To: juzhe.zhong@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Thanks Richi.
> > 
> > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > 
> > For example:
> > 
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> >     a[i] = a[i] + b[i];
> > }
> > 
> > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > 
> > Is is possible that we can adjust the COST in finish cost make Loop 
> > vectorizer pick LMUL = 4?
>  
> I see you have a autovectorize_vector_modes hook and you use
> VECT_COMPARE_COSTS.  So the appropriate place would be to
> amend your vector_costs::better_main_loop_than_p.
>  
> > I am experimenting with this following cost:
> > 
> >   if (loop_vinfo)
> >     {
> >       if (loop_vinfo->vector_mode == RVVM8SImode)
> >         {
> >           m_costs[vect_prologue] = 2;
> >           m_costs[vect_body] = 20;
> >           m_costs[vect_epilogue] = 2;
> >         }
> >       else
> >         {
> >           m_costs[vect_prologue] = 1;
> >           m_costs[vect_body] = 1;
> >           m_costs[vect_epilogue] = 1;
> >         }
> >     }
> > 
> > I increase LMUL = 8 cost. The codegen is odd:
> > 
> > foo:
> > ble a2,zero,.L12
> > addiw a5,a2,-1
> > li a4,30
> > sext.w t1,a2
> > bleu a5,a4,.L7
> > srliw a7,t1,5
> > slli a7,a7,7
> > li a4,32
> > add a7,a7,a0
> > mv a5,a0
> > mv a3,a1
> > vsetvli zero,a4,e32,m8,ta,ma
> > .L4:
> > vle32.v v8,0(a5)
> > vle32.v v16,0(a3)
> > vadd.vv v8,v8,v16
> > vse32.v v8,0(a5)
> > addi a5,a5,128
> > addi a3,a3,128
> > bne a5,a7,.L4
> > andi a2,a2,-32
> > beq t1,a2,.L14
> > .L3:
> > slli a4,a2,32
> > subw a5,t1,a2
> > srli a4,a4,32
> > slli a5,a5,32
> > slli a4,a4,2
> > srli a5,a5,32
> > add a0,a0,a4
> > add a1,a1,a4
> > vsetvli a4,a5,e8,m1,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > sub a3,a5,a4
> > beq a5,a4,.L12
> > slli a4,a4,2
> > vsetvli zero,a3,e8,m1,ta,ma
> > add a0,a0,a4
> > add a1,a1,a4
> > vle32.v v4,0(a0)
> > vle32.v v8,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a3,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > .L12:
> > ret
> > .L7:
> > li a2,0
> > j .L3
> > .L14:
> > ret
> > 
> > I hope it can generate the code like this:
> > 
> > foo:
> > ble a2,zero,.L5
> > mv a4,a0
> > .L3:
> > vsetvli a5,a2,e32,m4,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a6,zero,e32,m4,ta,ma
> > slli a3,a5,2
> > vadd.vv v4,v4,v8
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > .L5:
> > ret
> > 
> > I am experimenting whether we can adjust cost statically to make loop 
> > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> > If we can do that, I think we can apply analysis and then adjust the 
> > cost according to analysis.
> >
> > Thanks.
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-08-31 15:38
> > To: juzhe.zhong@rivai.ai
> > CC: gcc; richard.sandiford
> > Subject: Re: Question about dynamic choosing vectorization factor for RVV
> > On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
> >  
> > > Hi, Richard and Richi.
> > > 
> > > Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > > according to compile option.
> > > 
> > > For example:
> > > void
> > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > {
> > >   for (int i = 0; i < n; i++)
> > >     a[i] = a[i] + b[i];
> > > }
> > > 
> > > with --param=riscv-autovec-lmul = m1:
> > > 
> > > vsetvli a5,a2,e32,m1,ta,ma
> > > vle32.v v2,0(a0)
> > > vle32.v v1,0(a1)
> > > vsetvli a6,zero,e32,m1,ta,ma
> > > slli a3,a5,2
> > > vadd.vv v1,v1,v2
> > > sub a2,a2,a5
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a4)
> > > add a0,a0,a3
> > > add a1,a1,a3
> > > add a4,a4,a3
> > > bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' is only performing operations on a single register.
> > > 
> > > with --param=riscv-autovec-lmul=m8:
> > > 
> > >   vsetvli a5,a2,e8,m2,ta,ma
> > >   vle32.v v16,0(a0)
> > >   vle32.v v8,0(a1)
> > >   vsetvli a6,zero,e32,m8,ta,ma
> > >   slli a3,a5,2
> > >   vadd.vv v8,v8,v16
> > >   vsetvli zero,a2,e32,m8,ta,ma
> > >   sub a2,a2,a5
> > >   vse32.v v8,0(a4)
> > >   add a0,a0,a3
> > >   add a1,a1,a3
> > >   add a4,a4,a3
> > >   bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > > 
> > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > > 
> > > Users statically set the vectorization factor is not ideal.
> > > 
> > > We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> > > 
> > > Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> > > 
> > > Here is the analysis, we have 32 vector registers for RVV.
> > > So we calculate the live range of current function local decl:
> > > 
> > > the number of decls live at the same time * LMUL <= 32. 
> > > According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > > 
> > > Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> > > 
> > > However, I can only choose optimal vectorization for whole function but failed to specific loop.
> > > 
> > > Here is the example:
> > > 
> > > void foo2 (int32_t *__restrict a,
> > >           int32_t *__restrict b,
> > >           int32_t *__restrict c,
> > >           int32_t *__restrict a2,
> > >           int32_t *__restrict b2,
> > >           int32_t *__restrict c2,
> > >           int32_t *__restrict a3,
> > >           int32_t *__restrict b3,
> > >           int32_t *__restrict c3,
> > >           int32_t *__restrict a4,
> > >           int32_t *__restrict b4,
> > >           int32_t *__restrict c4,
> > >           int32_t *__restrict a5,
> > >           int32_t *__restrict b5,
> > >           int32_t *__restrict c5,
> > >           int n)
> > > {
> > > // Loop 1
> > >     for (int i = 0; i < n; i++)
> > >        a[i] = a[i] + b[i];
> > > // Loop 2
> > >     for (int i = 0; i < n; i++){
> > >       a[i] = b[i] + c[i];
> > >       a2[i] = b2[i] + c2[i];
> > >       a3[i] = b3[i] + c3[i];
> > >       a4[i] = b4[i] + c4[i];
> > >       a5[i] = a[i] + a4[i];
> > >       a[i] = a3[i] + a2[i]+ a5[i];
> > >     }
> > > }
> > > 
> > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > > 
> > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> > > 
> > > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> > > on the loop.
> > > 
> > > I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> > > to 'preferred_simd_mode' target hook?
> >  
> > That's not how it's currently designed to work - there's
> > the autovectorize_vector_modes hook where you should provide a vector
> > of modes the vectorizer iterates over and return VECT_COMPARE_COST
> > if you want to evaluate costs between choices.  Your analysis should
> > then happen in the finish_cost method.
> >  
> > That's how it's currently designed.  It might not be optimal for
> > compile-time reasons when there are many modes, giving the target
> > more control (and context) might be possible.
> >  
> > Richard.
> >  
> > > Thanks.
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > > 
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31 11:29         ` Richard Biener
  2023-08-31 11:33           ` juzhe.zhong
@ 2023-08-31 11:54           ` juzhe.zhong
  2023-08-31 12:01             ` Richard Biener
  1 sibling, 1 reply; 10+ messages in thread
From: juzhe.zhong @ 2023-08-31 11:54 UTC (permalink / raw)
  To: rguenther; +Cc: gcc, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 10563 bytes --]

Hi, Richi.

  /* Keep track of the VF for each mode.  Initialize all to 0 which indicates
     a mode has not been analyzed.  */
  auto_vec<poly_uint64, 8> cached_vf_per_mode;
  for (unsigned i = 0; i < vector_modes.length (); ++i)
    cached_vf_per_mode.safe_push (0);

I saw codes here:
the 'cached_vf_per_mode' is allocated size '8'.

But for RVV, I will need to push these following modes:

RVVM8QI, RVVM4QI, RVVM2QI, RVVM1QI, V128QI, V64QI, V32QI, V16QI, V8QI, V4QI, V2QI

There are 11 modes.
Should I increase the number from 8 to 11?

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 19:29
To: juzhe.zhong@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
 
> Hi. Thanks Richard and Richi.
> 
> Now, I figure out how to choose smaller LMUL now.
> 
> void
> costs::finish_cost (const vector_costs *scalar_costs)
> {
>   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>   if (loop_vinfo)
>     {
>       if (loop_vinfo->vector_mode == RVVM8SImode
>       || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
>         {
>           m_costs[vect_prologue] = 8;
>           m_costs[vect_body] = 8;
>           m_costs[vect_epilogue] = 8;
>         }
>       else
>         {
>           m_costs[vect_prologue] = 1;
>           m_costs[vect_body] = 1;
>           m_costs[vect_epilogue] = 1;
>         }
>     }
>    // m_suggested_unroll_factor = 2;
>   vector_costs::finish_cost (scalar_costs);
> }
 
I don't think that's "good" use of the API.
 
> Previous odd codes are because of VLS modes
> 
> Now, I can get the LMUL = 4 by adjusting cost.
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> Fantastic architecture of GCC Vector Cost model!
> 
> Thanks a lot.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:20
> To: juzhe.zhong@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Thanks Richi.
> > 
> > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > 
> > For example:
> > 
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> >     a[i] = a[i] + b[i];
> > }
> > 
> > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > 
> > Is is possible that we can adjust the COST in finish cost make Loop 
> > vectorizer pick LMUL = 4?
>  
> I see you have a autovectorize_vector_modes hook and you use
> VECT_COMPARE_COSTS.  So the appropriate place would be to
> amend your vector_costs::better_main_loop_than_p.
>  
> > I am experimenting with this following cost:
> > 
> >   if (loop_vinfo)
> >     {
> >       if (loop_vinfo->vector_mode == RVVM8SImode)
> >         {
> >           m_costs[vect_prologue] = 2;
> >           m_costs[vect_body] = 20;
> >           m_costs[vect_epilogue] = 2;
> >         }
> >       else
> >         {
> >           m_costs[vect_prologue] = 1;
> >           m_costs[vect_body] = 1;
> >           m_costs[vect_epilogue] = 1;
> >         }
> >     }
> > 
> > I increase LMUL = 8 cost. The codegen is odd:
> > 
> > foo:
> > ble a2,zero,.L12
> > addiw a5,a2,-1
> > li a4,30
> > sext.w t1,a2
> > bleu a5,a4,.L7
> > srliw a7,t1,5
> > slli a7,a7,7
> > li a4,32
> > add a7,a7,a0
> > mv a5,a0
> > mv a3,a1
> > vsetvli zero,a4,e32,m8,ta,ma
> > .L4:
> > vle32.v v8,0(a5)
> > vle32.v v16,0(a3)
> > vadd.vv v8,v8,v16
> > vse32.v v8,0(a5)
> > addi a5,a5,128
> > addi a3,a3,128
> > bne a5,a7,.L4
> > andi a2,a2,-32
> > beq t1,a2,.L14
> > .L3:
> > slli a4,a2,32
> > subw a5,t1,a2
> > srli a4,a4,32
> > slli a5,a5,32
> > slli a4,a4,2
> > srli a5,a5,32
> > add a0,a0,a4
> > add a1,a1,a4
> > vsetvli a4,a5,e8,m1,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > sub a3,a5,a4
> > beq a5,a4,.L12
> > slli a4,a4,2
> > vsetvli zero,a3,e8,m1,ta,ma
> > add a0,a0,a4
> > add a1,a1,a4
> > vle32.v v4,0(a0)
> > vle32.v v8,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a3,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > .L12:
> > ret
> > .L7:
> > li a2,0
> > j .L3
> > .L14:
> > ret
> > 
> > I hope it can generate the code like this:
> > 
> > foo:
> > ble a2,zero,.L5
> > mv a4,a0
> > .L3:
> > vsetvli a5,a2,e32,m4,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a6,zero,e32,m4,ta,ma
> > slli a3,a5,2
> > vadd.vv v4,v4,v8
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > .L5:
> > ret
> > 
> > I am experimenting whether we can adjust cost statically to make loop 
> > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> > If we can do that, I think we can apply analysis and then adjust the 
> > cost according to analysis.
> >
> > Thanks.
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-08-31 15:38
> > To: juzhe.zhong@rivai.ai
> > CC: gcc; richard.sandiford
> > Subject: Re: Question about dynamic choosing vectorization factor for RVV
> > On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
> >  
> > > Hi, Richard and Richi.
> > > 
> > > Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > > according to compile option.
> > > 
> > > For example:
> > > void
> > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > {
> > >   for (int i = 0; i < n; i++)
> > >     a[i] = a[i] + b[i];
> > > }
> > > 
> > > with --param=riscv-autovec-lmul = m1:
> > > 
> > > vsetvli a5,a2,e32,m1,ta,ma
> > > vle32.v v2,0(a0)
> > > vle32.v v1,0(a1)
> > > vsetvli a6,zero,e32,m1,ta,ma
> > > slli a3,a5,2
> > > vadd.vv v1,v1,v2
> > > sub a2,a2,a5
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a4)
> > > add a0,a0,a3
> > > add a1,a1,a3
> > > add a4,a4,a3
> > > bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' is only performing operations on a single register.
> > > 
> > > with --param=riscv-autovec-lmul=m8:
> > > 
> > >   vsetvli a5,a2,e8,m2,ta,ma
> > >   vle32.v v16,0(a0)
> > >   vle32.v v8,0(a1)
> > >   vsetvli a6,zero,e32,m8,ta,ma
> > >   slli a3,a5,2
> > >   vadd.vv v8,v8,v16
> > >   vsetvli zero,a2,e32,m8,ta,ma
> > >   sub a2,a2,a5
> > >   vse32.v v8,0(a4)
> > >   add a0,a0,a3
> > >   add a1,a1,a3
> > >   add a4,a4,a3
> > >   bne a2,zero,.L3
> > > 
> > > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > > 
> > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > > 
> > > Users statically set the vectorization factor is not ideal.
> > > 
> > > We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> > > 
> > > Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> > > 
> > > Here is the analysis, we have 32 vector registers for RVV.
> > > So we calculate the live range of current function local decl:
> > > 
> > > the number of decls live at the same time * LMUL <= 32. 
> > > According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > > 
> > > Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> > > 
> > > However, I can only choose optimal vectorization for whole function but failed to specific loop.
> > > 
> > > Here is the example:
> > > 
> > > void foo2 (int32_t *__restrict a,
> > >           int32_t *__restrict b,
> > >           int32_t *__restrict c,
> > >           int32_t *__restrict a2,
> > >           int32_t *__restrict b2,
> > >           int32_t *__restrict c2,
> > >           int32_t *__restrict a3,
> > >           int32_t *__restrict b3,
> > >           int32_t *__restrict c3,
> > >           int32_t *__restrict a4,
> > >           int32_t *__restrict b4,
> > >           int32_t *__restrict c4,
> > >           int32_t *__restrict a5,
> > >           int32_t *__restrict b5,
> > >           int32_t *__restrict c5,
> > >           int n)
> > > {
> > > // Loop 1
> > >     for (int i = 0; i < n; i++)
> > >        a[i] = a[i] + b[i];
> > > // Loop 2
> > >     for (int i = 0; i < n; i++){
> > >       a[i] = b[i] + c[i];
> > >       a2[i] = b2[i] + c2[i];
> > >       a3[i] = b3[i] + c3[i];
> > >       a4[i] = b4[i] + c4[i];
> > >       a5[i] = a[i] + a4[i];
> > >       a[i] = a3[i] + a2[i]+ a5[i];
> > >     }
> > > }
> > > 
> > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > > 
> > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> > > 
> > > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> > > on the loop.
> > > 
> > > I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> > > to 'preferred_simd_mode' target hook?
> >  
> > That's not how it's currently designed to work - there's
> > the autovectorize_vector_modes hook where you should provide a vector
> > of modes the vectorizer iterates over and return VECT_COMPARE_COST
> > if you want to evaluate costs between choices.  Your analysis should
> > then happen in the finish_cost method.
> >  
> > That's how it's currently designed.  It might not be optimal for
> > compile-time reasons when there are many modes, giving the target
> > more control (and context) might be possible.
> >  
> > Richard.
> >  
> > > Thanks.
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > > 
> >  
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Question about dynamic choosing vectorization factor for RVV
  2023-08-31 11:54           ` juzhe.zhong
@ 2023-08-31 12:01             ` Richard Biener
  0 siblings, 0 replies; 10+ messages in thread
From: Richard Biener @ 2023-08-31 12:01 UTC (permalink / raw)
  To: juzhe.zhong; +Cc: gcc, richard.sandiford

On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:

> Hi, Richi.
> 
>   /* Keep track of the VF for each mode.  Initialize all to 0 which indicates
>      a mode has not been analyzed.  */
>   auto_vec<poly_uint64, 8> cached_vf_per_mode;
>   for (unsigned i = 0; i < vector_modes.length (); ++i)
>     cached_vf_per_mode.safe_push (0);
> 
> I saw codes here:
> the 'cached_vf_per_mode' is allocated size '8'.
> 
> But for RVV, I will need to push these following modes:
> 
> RVVM8QI, RVVM4QI, RVVM2QI, RVVM1QI, V128QI, V64QI, V32QI, V16QI, V8QI, V4QI, V2QI
> 
> There are 11 modes.
> Should I increase the number from 8 to 11?

It will just perform dynamic allocation, no need to adjust.

> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:29
> To: juzhe.zhong@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
>  
> > Hi. Thanks Richard and Richi.
> > 
> > Now, I figure out how to choose smaller LMUL now.
> > 
> > void
> > costs::finish_cost (const vector_costs *scalar_costs)
> > {
> >   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >   if (loop_vinfo)
> >     {
> >       if (loop_vinfo->vector_mode == RVVM8SImode
> >       || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
> >         {
> >           m_costs[vect_prologue] = 8;
> >           m_costs[vect_body] = 8;
> >           m_costs[vect_epilogue] = 8;
> >         }
> >       else
> >         {
> >           m_costs[vect_prologue] = 1;
> >           m_costs[vect_body] = 1;
> >           m_costs[vect_epilogue] = 1;
> >         }
> >     }
> >    // m_suggested_unroll_factor = 2;
> >   vector_costs::finish_cost (scalar_costs);
> > }
>  
> I don't think that's "good" use of the API.
>  
> > Previous odd codes are because of VLS modes
> > 
> > Now, I can get the LMUL = 4 by adjusting cost.
> > vsetvli a5,a2,e32,m4,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a6,zero,e32,m4,ta,ma
> > slli a3,a5,2
> > vadd.vv v4,v4,v8
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > 
> > Fantastic architecture of GCC Vector Cost model!
> > 
> > Thanks a lot.
> > 
> > 
> > juzhe.zhong@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-08-31 19:20
> > To: juzhe.zhong@rivai.ai
> > CC: gcc; richard.sandiford
> > Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> > On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
> >  
> > > Thanks Richi.
> > > 
> > > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > > 
> > > For example:
> > > 
> > > void
> > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > {
> > >   for (int i = 0; i < n; i++)
> > >     a[i] = a[i] + b[i];
> > > }
> > > 
> > > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > > 
> > > Is is possible that we can adjust the COST in finish cost make Loop 
> > > vectorizer pick LMUL = 4?
> >  
> > I see you have a autovectorize_vector_modes hook and you use
> > VECT_COMPARE_COSTS.  So the appropriate place would be to
> > amend your vector_costs::better_main_loop_than_p.
> >  
> > > I am experimenting with this following cost:
> > > 
> > >   if (loop_vinfo)
> > >     {
> > >       if (loop_vinfo->vector_mode == RVVM8SImode)
> > >         {
> > >           m_costs[vect_prologue] = 2;
> > >           m_costs[vect_body] = 20;
> > >           m_costs[vect_epilogue] = 2;
> > >         }
> > >       else
> > >         {
> > >           m_costs[vect_prologue] = 1;
> > >           m_costs[vect_body] = 1;
> > >           m_costs[vect_epilogue] = 1;
> > >         }
> > >     }
> > > 
> > > I increase LMUL = 8 cost. The codegen is odd:
> > > 
> > > foo:
> > > ble a2,zero,.L12
> > > addiw a5,a2,-1
> > > li a4,30
> > > sext.w t1,a2
> > > bleu a5,a4,.L7
> > > srliw a7,t1,5
> > > slli a7,a7,7
> > > li a4,32
> > > add a7,a7,a0
> > > mv a5,a0
> > > mv a3,a1
> > > vsetvli zero,a4,e32,m8,ta,ma
> > > .L4:
> > > vle32.v v8,0(a5)
> > > vle32.v v16,0(a3)
> > > vadd.vv v8,v8,v16
> > > vse32.v v8,0(a5)
> > > addi a5,a5,128
> > > addi a3,a3,128
> > > bne a5,a7,.L4
> > > andi a2,a2,-32
> > > beq t1,a2,.L14
> > > .L3:
> > > slli a4,a2,32
> > > subw a5,t1,a2
> > > srli a4,a4,32
> > > slli a5,a5,32
> > > slli a4,a4,2
> > > srli a5,a5,32
> > > add a0,a0,a4
> > > add a1,a1,a4
> > > vsetvli a4,a5,e8,m1,ta,ma
> > > vle32.v v8,0(a0)
> > > vle32.v v4,0(a1)
> > > vsetvli a2,zero,e32,m4,ta,ma
> > > vadd.vv v4,v4,v8
> > > vsetvli zero,a5,e32,m4,ta,ma
> > > vse32.v v4,0(a0)
> > > sub a3,a5,a4
> > > beq a5,a4,.L12
> > > slli a4,a4,2
> > > vsetvli zero,a3,e8,m1,ta,ma
> > > add a0,a0,a4
> > > add a1,a1,a4
> > > vle32.v v4,0(a0)
> > > vle32.v v8,0(a1)
> > > vsetvli a2,zero,e32,m4,ta,ma
> > > vadd.vv v4,v4,v8
> > > vsetvli zero,a3,e32,m4,ta,ma
> > > vse32.v v4,0(a0)
> > > .L12:
> > > ret
> > > .L7:
> > > li a2,0
> > > j .L3
> > > .L14:
> > > ret
> > > 
> > > I hope it can generate the code like this:
> > > 
> > > foo:
> > > ble a2,zero,.L5
> > > mv a4,a0
> > > .L3:
> > > vsetvli a5,a2,e32,m4,ta,ma
> > > vle32.v v8,0(a0)
> > > vle32.v v4,0(a1)
> > > vsetvli a6,zero,e32,m4,ta,ma
> > > slli a3,a5,2
> > > vadd.vv v4,v4,v8
> > > sub a2,a2,a5
> > > vsetvli zero,a5,e32,m4,ta,ma
> > > vse32.v v4,0(a4)
> > > add a0,a0,a3
> > > add a1,a1,a3
> > > add a4,a4,a3
> > > bne a2,zero,.L3
> > > .L5:
> > > ret
> > > 
> > > I am experimenting whether we can adjust cost statically to make loop 
> > > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> > > If we can do that, I think we can apply analysis and then adjust the 
> > > cost according to analysis.
> > >
> > > Thanks.
> > > 
> > > 
> > > juzhe.zhong@rivai.ai
> > >  
> > > From: Richard Biener
> > > Date: 2023-08-31 15:38
> > > To: juzhe.zhong@rivai.ai
> > > CC: gcc; richard.sandiford
> > > Subject: Re: Question about dynamic choosing vectorization factor for RVV
> > > On Thu, 31 Aug 2023, juzhe.zhong@rivai.ai wrote:
> > >  
> > > > Hi, Richard and Richi.
> > > > 
> > > > Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > > > according to compile option.
> > > > 
> > > > For example:
> > > > void
> > > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > > {
> > > >   for (int i = 0; i < n; i++)
> > > >     a[i] = a[i] + b[i];
> > > > }
> > > > 
> > > > with --param=riscv-autovec-lmul = m1:
> > > > 
> > > > vsetvli a5,a2,e32,m1,ta,ma
> > > > vle32.v v2,0(a0)
> > > > vle32.v v1,0(a1)
> > > > vsetvli a6,zero,e32,m1,ta,ma
> > > > slli a3,a5,2
> > > > vadd.vv v1,v1,v2
> > > > sub a2,a2,a5
> > > > vsetvli zero,a5,e32,m1,ta,ma
> > > > vse32.v v1,0(a4)
> > > > add a0,a0,a3
> > > > add a1,a1,a3
> > > > add a4,a4,a3
> > > > bne a2,zero,.L3
> > > > 
> > > > The 'vadd.vv' is only performing operations on a single register.
> > > > 
> > > > with --param=riscv-autovec-lmul=m8:
> > > > 
> > > >   vsetvli a5,a2,e8,m2,ta,ma
> > > >   vle32.v v16,0(a0)
> > > >   vle32.v v8,0(a1)
> > > >   vsetvli a6,zero,e32,m8,ta,ma
> > > >   slli a3,a5,2
> > > >   vadd.vv v8,v8,v16
> > > >   vsetvli zero,a2,e32,m8,ta,ma
> > > >   sub a2,a2,a5
> > > >   vse32.v v8,0(a4)
> > > >   add a0,a0,a3
> > > >   add a1,a1,a3
> > > >   add a4,a4,a3
> > > >   bne a2,zero,.L3
> > > > 
> > > > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > > > 
> > > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > > > 
> > > > Users statically set the vectorization factor is not ideal.
> > > > 
> > > > We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis.
> > > > 
> > > > Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function.
> > > > 
> > > > Here is the analysis, we have 32 vector registers for RVV.
> > > > So we calculate the live range of current function local decl:
> > > > 
> > > > the number of decls live at the same time * LMUL <= 32. 
> > > > According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > > > 
> > > > Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces.
> > > > 
> > > > However, I can only choose optimal vectorization for whole function but failed to specific loop.
> > > > 
> > > > Here is the example:
> > > > 
> > > > void foo2 (int32_t *__restrict a,
> > > >           int32_t *__restrict b,
> > > >           int32_t *__restrict c,
> > > >           int32_t *__restrict a2,
> > > >           int32_t *__restrict b2,
> > > >           int32_t *__restrict c2,
> > > >           int32_t *__restrict a3,
> > > >           int32_t *__restrict b3,
> > > >           int32_t *__restrict c3,
> > > >           int32_t *__restrict a4,
> > > >           int32_t *__restrict b4,
> > > >           int32_t *__restrict c4,
> > > >           int32_t *__restrict a5,
> > > >           int32_t *__restrict b5,
> > > >           int32_t *__restrict c5,
> > > >           int n)
> > > > {
> > > > // Loop 1
> > > >     for (int i = 0; i < n; i++)
> > > >        a[i] = a[i] + b[i];
> > > > // Loop 2
> > > >     for (int i = 0; i < n; i++){
> > > >       a[i] = b[i] + c[i];
> > > >       a2[i] = b2[i] + c2[i];
> > > >       a3[i] = b3[i] + c3[i];
> > > >       a4[i] = b4[i] + c4[i];
> > > >       a5[i] = a[i] + a4[i];
> > > >       a[i] = a3[i] + a2[i]+ a5[i];
> > > >     }
> > > > }
> > > > 
> > > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > > > 
> > > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.
> > > > 
> > > > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base
> > > > on the loop.
> > > > 
> > > > I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info
> > > > to 'preferred_simd_mode' target hook?
> > >  
> > > That's not how it's currently designed to work - there's
> > > the autovectorize_vector_modes hook where you should provide a vector
> > > of modes the vectorizer iterates over and return VECT_COMPARE_COST
> > > if you want to evaluate costs between choices.  Your analysis should
> > > then happen in the finish_cost method.
> > >  
> > > That's how it's currently designed.  It might not be optimal for
> > > compile-time reasons when there are many modes, giving the target
> > > more control (and context) might be possible.
> > >  
> > > Richard.
> > >  
> > > > Thanks.
> > > > 
> > > > 
> > > > juzhe.zhong@rivai.ai
> > > > 
> > >  
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-08-31 12:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-31  3:04 Question about dynamic choosing vectorization factor for RVV juzhe.zhong
2023-08-31  7:38 ` Richard Biener
2023-08-31  9:52   ` juzhe.zhong
2023-08-31 11:12     ` Richard Sandiford
2023-08-31 11:20     ` Richard Biener
2023-08-31 11:24       ` juzhe.zhong
2023-08-31 11:29         ` Richard Biener
2023-08-31 11:33           ` juzhe.zhong
2023-08-31 11:54           ` juzhe.zhong
2023-08-31 12:01             ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).