Hi, all. After several investigations:
Here is my experiements:
void
single_rgroup (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = b[i] + a[i];
}

void
mutiple_rgroup (float *__restrict f, double *__restrict d, int n)
{
  for (int i = 0; i < n; ++i)
    {
      f[i * 2 + 0] = 1;
      f[i * 2 + 1] = 2;
      d[i] = 3;
    }
} 


single_rgroup:
ble a2,zero,.L5
li a4,4
.L3:
minu a5,a2,a4
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v1,0(a0)
vle32.v v2,0(a1)
vsetivli zero,4,e32,m1,ta,ma
mv a3,a2                                       ---------> 1 more "mv" instruction
vadd.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a0)
addi a1,a1,16
addi a0,a0,16
addi a2,a2,-4
bgtu a3,a4,.L3
.L5:
ret
.size single_rgroup, .-single_rgroup
.align 1
.globl foo5
.type foo5, @function
mutiple_rgroup :
ble a2,zero,.L11
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
vl1re32.v v2,0(a5)
lui a5,%hi(.LANCHOR0+16)
addi a5,a5,%lo(.LANCHOR0+16)
slli a2,a2,1
li a3,8
li a7,4
vl1re64.v v1,0(a5)
.L9:
minu a5,a2,a3
minu a4,a5,a7
sub a5,a5,a4
addi a6,a0,16
vsetvli zero,a4,e32,m1,ta,ma
vse32.v v2,0(a0)
srli a4,a4,1
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v2,0(a6)
srli a5,a5,1
vsetvli zero,a4,e64,m1,ta,ma
addi a6,a1,16
vse64.v v1,0(a1)
mv a4,a2                                ---------> 1 more "mv" instruction
vsetvli zero,a5,e64,m1,ta,ma
vse64.v v1,0(a6)
addi a0,a0,32
addi a1,a1,32
addi a2,a2,-8
bgtu a4,a3,.L9
.L11:
ret

These are the examples, I have tried enough amount cases. This is the worst case after this patch for RVV:
no matter single-rgroup or multiple-rgroup, we will end up with 1 more "mv" instruction inside the loop.
There are also some examples I have tried with no more instructions (It seems IVOPTS has done some optimization in some cases).

From my side (RVV),  I think one more "mv" instruction is not a big deal if this patch (apply vf step and check conditon by remain > vf)
can help IBM. 

For single-rgroup, this 'mv' instruction will gone when we use SELECT_VL. For multiple-rgroup, the 'mv' instruction remains
but as I said, not a big deal.

If this patch's approach is approved, I will rebase and send SELECT_VL patch again base on this patch.

Looking forward your suggestions.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-05-30 20:33
To: juzhe.zhong
CC: Richard Sandiford; gcc-patches; linkw
Subject: Re: [PATCH] VECT: Change flow of decrement IV
On Tue, 30 May 2023, juzhe.zhong wrote:
 
> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.
 
That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.
 
> ---- Replied Message ----
> From
> Richard Sandiford<richard.sandiford@arm.com>
> Date
> 05/30/2023 19:41
> To
> juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
> Cc
> gcc-patches<gcc-patches@gcc.gnu.org>,
> rguenther<rguenther@suse.de>,
> linkw<linkw@linux.ibm.com>
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >       sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)