This patch removed 2nd time initialization of RTL_SSA which is the approach we both hate.



juzhe.zhong@rivai.ai
 
From: Kito Cheng
Date: 2023-06-09 18:45
To: juzhe.zhong
CC: gcc-patches; kito.cheng; palmer; palmer; jeffreyalaw; rdapp.gcc; pan2.li
Subject: Re: [PATCH V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS
Thankful you send this before weekend, I could run the fuzzy testing
during this weekend :P
 
On Fri, Jun 9, 2023 at 6:41 PM <juzhe.zhong@rivai.ai> wrote:
>
> From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
>
> This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6
> are quite messy and cause some bugs discovered by my downstream auto-vectorization
> test-generator.
>
> Before this patch.
>
> Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction.
> E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and
> after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then,
> we remove this operand dependency helps for the following scheduling PASS.
>
> Phase 6 is propagate_avl do the following 2 things:
> 1. Local && Global user vsetvl instructions optimization.
>    E.g.
>       vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
>       vsetvli zero,a2, e32, mf2  ======> eliminate
> 2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions.
> Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new
> RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA.
>
> There are 2 issues in Phase 5 && Phase 6:
> 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than
>    Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton
>    in Phase 6 is redundant and should be removed.
> 2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec
>    patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the
>    necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this:
>
> Before Phase 6:
>    ...
>    insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
>    slli a4,a3,3
>    ...
>    insn2: vsetvli zero, a3, ...
>    load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5)
>    ...
>
> In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
> insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework
> even though we renew it (I didn't take a look at it and I don't think we need to now).
> Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()"
> which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses:
> 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator
> execution test failed.
>
> Conclusion of RTL_SSA framework:
> Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other
> is re-new after Phase 4 (LCM) has incorrect information that causes bugs.
>
> Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization.
>
> Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows:
> 1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized
>    at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons:
>    1). local_eliminate_vsetvl_insn we already have (no change).
>    2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation.
>       E.g.
>       void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
>         size_t avl;
>         if (m > 100)
>           avl = __riscv_vsetvl_e16mf4(vl << 4);
>         else
>           avl = __riscv_vsetvl_e32mf2(vl >> 8);
>         for (size_t i = 0; i < m; i++) {
>           vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
>           v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
>           __riscv_vse8_v_i8mf8(out + i, v0, avl);
>         }
>       }
>
>       This example failed to global user vsetvl optimize before this patch:
>       f:
>               li      a5,100
>               bleu    a3,a5,.L2
>               slli    a2,a2,4
>               vsetvli a4,a2,e16,mf4,ta,mu
>       .L3:
>               li      a5,0
>               vsetvli zero,a4,e8,mf8,ta,ma
>       .L5:
>               add     a6,a0,a5
>               add     a2,a1,a5
>               vle8.v  v1,0(a6)
>               addi    a5,a5,1
>               vadd.vv v1,v1,v1
>               vse8.v  v1,0(a2)
>               bgtu    a3,a5,.L5
>       .L10:
>               ret
>       .L2:
>               beq     a3,zero,.L10
>               srli    a2,a2,8
>               vsetvli a4,a2,e32,mf2,ta,mu
>               j       .L3
>       With this patch:
>       f:
>               li      a5,100
>               bleu    a3,a5,.L2
>               slli    a2,a2,4
>               vsetvli zero,a2,e8,mf8,ta,ma
>       .L3:
>               li      a5,0
>       .L5:
>               add     a6,a0,a5
>               add     a2,a1,a5
>               vle8.v  v1,0(a6)
>               addi    a5,a5,1
>               vadd.vv v1,v1,v1
>               vse8.v  v1,0(a2)
>               bgtu    a3,a5,.L5
>       .L10:
>               ret
>       .L2:
>               beq     a3,zero,.L10
>               srli    a2,a2,8
>               vsetvli zero,a2,e8,mf8,ta,ma
>               j       .L3
>
>    3). Remove AVL operand dependency of each RVV instructions.
>
> 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on
>    dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5:
>    ...
>    vsetvl a3, a2...
>    vadd.vv (use a3)
>    If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
>
>    This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
>    downstream tests).
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user vsetvl optimization.
>         (vector_insn_info::parse_insn): Add rtx_insn parse.
>         (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl optimization.
>         (get_first_vsetvl): New function.
>         (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
>         (pass_vsetvl::cleanup_insns): Remove it.
>         (pass_vsetvl::ssa_post_optimization): New function.
>         (has_no_uses): Ditto.
>         (pass_vsetvl::propagate_avl): Remove it.
>         (pass_vsetvl::df_post_optimization): New function.
>         (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
>         * config/riscv/riscv-vsetvl.h: Adapt declaration.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
>
> ---
>  gcc/config/riscv/riscv-vsetvl.cc              | 400 +++++++++++-------
>  gcc/config/riscv/riscv-vsetvl.h               |  34 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-16.c   |   2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-2.c    |   2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-21.c   |  21 +
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  21 +
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-23.c   |  37 ++
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-3.c    |   2 +-
>  8 files changed, 366 insertions(+), 153 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
>
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
> index fe55f4ccd30..924a94adf9c 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -395,10 +395,15 @@ available_occurrence_p (const bb_info *bb, const vector_insn_info dem)
>        if (!vlmax_avl_p (dem.get_avl ()))
>         {
>           rtx dest = NULL_RTX;
> +         insn_info *i = insn;
>           if (vsetvl_insn_p (insn->rtl ()))
> -           dest = get_vl (insn->rtl ());
> -         for (const insn_info *i = insn; real_insn_and_same_bb_p (i, bb);
> -              i = i->next_nondebug_insn ())
> +           {
> +             dest = get_vl (insn->rtl ());
> +             /* For user vsetvl a2, a2 instruction, we consider it as
> +                available even though it modifies "a2".  */
> +             i = i->next_nondebug_insn ();
> +           }
> +         for (; real_insn_and_same_bb_p (i, bb); i = i->next_nondebug_insn ())
>             {
>               if (read_vl_insn_p (i->rtl ()))
>                 continue;
> @@ -1893,11 +1898,13 @@ vector_insn_info::parse_insn (rtx_insn *rinsn)
>    *this = vector_insn_info ();
>    if (!NONDEBUG_INSN_P (rinsn))
>      return;
> -  if (!has_vtype_op (rinsn))
> +  if (optimize == 0 && !has_vtype_op (rinsn))
> +    return;
> +  if (optimize > 0 && !vsetvl_insn_p (rinsn))
>      return;
>    m_state = VALID;
>    extract_insn_cached (rinsn);
> -  const rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
> +  rtx avl = ::get_avl (rinsn);
>    m_avl = avl_info (avl, nullptr);
>    m_sew = ::get_sew (rinsn);
>    m_vlmul = ::get_vlmul (rinsn);
> @@ -2730,10 +2737,11 @@ private:
>    /* Phase 5.  */
>    rtx_insn *get_vsetvl_at_end (const bb_info *, vector_insn_info *) const;
>    void local_eliminate_vsetvl_insn (const bb_info *) const;
> -  void cleanup_insns (void) const;
> +  bool global_eliminate_vsetvl_insn (const bb_info *) const;
> +  void ssa_post_optimization (void) const;
>
>    /* Phase 6.  */
> -  void propagate_avl (void) const;
> +  void df_post_optimization (void) const;
>
>    void init (void);
>    void done (void);
> @@ -4246,7 +4254,7 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>
>        /* Local AVL compatibility checking is simpler than global, we only
>          need to check the REGNO is same.  */
> -      if (prev_dem.valid_p () && prev_dem.skip_avl_compatible_p (curr_dem)
> +      if (prev_dem.valid_or_dirty_p () && prev_dem.skip_avl_compatible_p (curr_dem)
>           && local_avl_compatible_p (prev_avl, curr_avl))
>         {
>           /* curr_dem and prev_dem is compatible!  */
> @@ -4277,27 +4285,187 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>      }
>  }
>
> -/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> -   implicitly. Since we will emit VSETVL instruction and make RVV instructions
> -   depending on VL/VTYPE global status registers, we remove the such AVL operand
> -   in the RVV instructions pattern here in order to remove AVL dependencies when
> -   AVL operand is a register operand.
> -
> -   Before the VSETVL PASS:
> -     li a5,32
> -     ...
> -     vadd.vv (..., a5)
> -   After the VSETVL PASS:
> -     li a5,32
> -     vsetvli zero, a5, ...
> -     ...
> -     vadd.vv (..., const_int 0).  */
> +/* Get the first vsetvl instructions of the block.  */
> +static rtx_insn *
> +get_first_vsetvl (basic_block cfg_bb)
> +{
> +  rtx_insn *rinsn;
> +  FOR_BB_INSNS (cfg_bb, rinsn)
> +    {
> +      if (!NONDEBUG_INSN_P (rinsn))
> +       continue;
> +      /* If we don't find any inserted vsetvli before user RVV instructions,
> +        we don't need to optimize the vsetvls in this block.  */
> +      if (has_vtype_op (rinsn) || vsetvl_insn_p (rinsn))
> +       return nullptr;
> +
> +      if (vsetvl_discard_result_insn_p (rinsn))
> +       return rinsn;
> +    }
> +  return nullptr;
> +}
> +
> +/* Global user vsetvl optimizaiton:
> +
> +     Case 1:
> +     bb 1:
> +       vsetvl a5,a4,e8,mf8
> +       ...
> +     bb 2:
> +       ...
> +       vsetvl zero,a5,e8,mf8 --> Eliminate directly.
> +
> +     Case 2:
> +      bb 1:
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       ...
> +      bb 2:
> +       ...
> +       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
> +
> +     Case 3:
> +      bb 1:
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       ...
> +      bb 2:
> +       ...
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       goto bb 3
> +      bb 3:
> +       ...
> +       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
> +*/
> +bool
> +pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
> +{
> +  rtx_insn *vsetvl_rinsn;
> +  vector_insn_info dem = vector_insn_info ();
> +  const auto &block_info = get_block_info (bb);
> +  basic_block cfg_bb = bb->cfg_bb ();
> +
> +  if (block_info.local_dem.valid_or_dirty_p ())
> +    {
> +      /* Optimize the local vsetvl.  */
> +      dem = block_info.local_dem;
> +      vsetvl_rinsn = get_first_vsetvl (cfg_bb);
> +    }
> +  if (!vsetvl_rinsn)
> +    /* Optimize the global vsetvl inserted by LCM.  */
> +    vsetvl_rinsn = get_vsetvl_at_end (bb, &dem);
> +
> +  /* No need to optimize if block doesn't have vsetvl instructions.  */
> +  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
> +      || !dem.has_avl_reg ())
> +    return false;
> +
> +  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
> +     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
> +     block, we can eliminate this vsetvl instruction.  */
> +  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
> +
> +  unsigned int bb_index;
> +  sbitmap_iterator sbi;
> +  rtx avl = get_avl (dem.get_insn ()->rtl ());
> +  hash_set<set_info *> sets
> +    = get_all_sets (dem.get_avl_source (), true, false, false);
> +  /* Condition 1: All VL/VTYPE available in are all compatible.  */
> +  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
> +    {
> +      const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +      const auto &insn = expr->get_insn ();
> +      def_info *def = find_access (insn->defs (), REGNO (avl));
> +      set_info *set = safe_dyn_cast<set_info *> (def);
> +      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +         || !sets.contains (set))
> +       return false;
> +    }
> +
> +  /* Condition 2: Check it has preds.  */
> +  if (EDGE_COUNT (cfg_bb->preds) == 0)
> +    return false;
> +
> +  /* Condition 3: We don't do the global optimization for the block
> +     has a pred is entry block or exit block.  */
> +  /* Condition 4: All preds have available VL/VTYPE out.  */
> +  edge e;
> +  edge_iterator ei;
> +  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
> +    {
> +      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
> +      if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun)
> +         || e->src == EXIT_BLOCK_PTR_FOR_FN (cfun) || bitmap_empty_p (avout))
> +       return false;
> +
> +      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
> +       {
> +         const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +         const auto &insn = expr->get_insn ();
> +         def_info *def = find_access (insn->defs (), REGNO (avl));
> +         set_info *set = safe_dyn_cast<set_info *> (def);
> +         if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +             || !sets.contains (set) || !expr->skip_avl_compatible_p (dem))
> +           return false;
> +       }
> +    }
> +
> +  /* Step1: Reshape the VL/VTYPE status to make sure everything compatible.  */
> +  hash_set<basic_block> pred_cfg_bbs = get_all_predecessors (cfg_bb);
> +  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
> +    {
> +      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
> +      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
> +       {
> +         vector_insn_info prev_dem = *m_vector_manager->vector_exprs[bb_index];
> +         vector_insn_info curr_dem = dem;
> +         insn_info *insn = prev_dem.get_insn ();
> +         if (!pred_cfg_bbs.contains (insn->bb ()->cfg_bb ()))
> +           continue;
> +         /* Update avl info since we need to make sure they are fully
> +            compatible before merge.  */
> +         curr_dem.set_avl_info (prev_dem.get_avl_info ());
> +         /* Merge both and update into curr_vsetvl.  */
> +         prev_dem = curr_dem.merge (prev_dem, LOCAL_MERGE);
> +         change_vsetvl_insn (insn, prev_dem);
> +       }
> +    }
> +
> +  /* Step2: eliminate the vsetvl instruction.  */
> +  eliminate_insn (vsetvl_rinsn);
> +  return true;
> +}
> +
> +/* This function does the following post optimization base on RTL_SSA:
> +
> +   1. Local user vsetvl optimizations.
> +   2. Global user vsetvl optimizations.
> +   3. AVL dependencies removal:
> +      Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> +      implicitly. Since we will emit VSETVL instruction and make RVV
> +      instructions depending on VL/VTYPE global status registers, we remove the
> +      such AVL operand in the RVV instructions pattern here in order to remove
> +      AVL dependencies when AVL operand is a register operand.
> +
> +      Before the VSETVL PASS:
> +       li a5,32
> +       ...
> +       vadd.vv (..., a5)
> +      After the VSETVL PASS:
> +       li a5,32
> +       vsetvli zero, a5, ...
> +       ...
> +       vadd.vv (..., const_int 0).  */
>  void
> -pass_vsetvl::cleanup_insns (void) const
> +pass_vsetvl::ssa_post_optimization (void) const
>  {
>    for (const bb_info *bb : crtl->ssa->bbs ())
>      {
>        local_eliminate_vsetvl_insn (bb);
> +      bool changed_p = true;
> +      while (changed_p)
> +       {
> +         changed_p = false;
> +         changed_p |= global_eliminate_vsetvl_insn (bb);
> +       }
>        for (insn_info *insn : bb->real_nondebug_insns ())
>         {
>           rtx_insn *rinsn = insn->rtl ();
> @@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
>      }
>  }
>
> +/* Return true if the SET result is not used by any instructions.  */
> +static bool
> +has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
> +{
> +  /* Handle the following case that can not be detected in RTL_SSA.  */
> +  /* E.g.
> +         li a5, 100
> +         vsetvli a6, a5...
> +         ...
> +         vadd (use a6)
> +
> +       The use of "a6" is removed from "vadd" but the information is
> +       not updated in RTL_SSA framework. We don't want to re-new
> +       a new RTL_SSA which is expensive, instead, we use data-flow
> +       analysis to check whether "a6" has no uses.  */
> +  if (bitmap_bit_p (df_get_live_out (cfg_bb), regno))
> +    return false;
> +
> +  rtx_insn *iter;
> +  for (iter = NEXT_INSN (rinsn); iter && iter != NEXT_INSN (BB_END (cfg_bb));
> +       iter = NEXT_INSN (iter))
> +    if (df_find_use (iter, regno_reg_rtx[regno]))
> +      return false;
> +
> +  return true;
> +}
> +
> +/* This function does the following post optimization base on dataflow
> +   analysis:
> +
> +   1. Change vsetvl rd, rs1 --> vsevl zero, rs1, if rd is not used by any
> +   nondebug instructions. Even though this PASS runs after RA and it doesn't
> +   help for reduce register pressure, it can help instructions scheduling since
> +   we remove the dependencies.
> +
> +   2. Remove redundant user vsetvls base on outcome of Phase 4 (LCM) && Phase 5
> +   (AVL dependencies removal).  */
>  void
> -pass_vsetvl::propagate_avl (void) const
> -{
> -  /* Rebuild the RTL_SSA according to the new CFG generated by LCM.  */
> -  /* Finalization of RTL_SSA.  */
> -  free_dominance_info (CDI_DOMINATORS);
> -  if (crtl->ssa->perform_pending_updates ())
> -    cleanup_cfg (0);
> -  delete crtl->ssa;
> -  crtl->ssa = nullptr;
> -  /* Initialization of RTL_SSA.  */
> -  calculate_dominance_info (CDI_DOMINATORS);
> +pass_vsetvl::df_post_optimization (void) const
> +{
>    df_analyze ();
> -  crtl->ssa = new function_info (cfun);
> -
>    hash_set<rtx_insn *> to_delete;
> -  for (const bb_info *bb : crtl->ssa->bbs ())
> +  basic_block cfg_bb;
> +  rtx_insn *rinsn;
> +  FOR_ALL_BB_FN (cfg_bb, cfun)
>      {
> -      for (insn_info *insn : bb->real_nondebug_insns ())
> +      FOR_BB_INSNS (cfg_bb, rinsn)
>         {
> -         if (vsetvl_discard_result_insn_p (insn->rtl ()))
> +         if (NONDEBUG_INSN_P (rinsn) && vsetvl_insn_p (rinsn))
>             {
> -             rtx avl = get_avl (insn->rtl ());
> -             if (!REG_P (avl))
> -               continue;
> -
> -             set_info *set = find_access (insn->uses (), REGNO (avl))->def ();
> -             insn_info *def_insn = extract_single_source (set);
> -             if (!def_insn)
> -               continue;
> -
> -             /* Handle this case:
> -                vsetvli        a6,zero,e32,m1,ta,mu
> -                li     a5,4096
> -                add    a7,a0,a5
> -                addi   a7,a7,-96
> -                vsetvli        t1,zero,e8,mf8,ta,ma
> -                vle8.v v24,0(a7)
> -                add    a5,a3,a5
> -                addi   a5,a5,-96
> -                vse8.v v24,0(a5)
> -                vsetvli        zero,a6,e32,m1,tu,ma
> -             */
> -             if (vsetvl_insn_p (def_insn->rtl ()))
> -               {
> -                 vl_vtype_info def_info = get_vl_vtype_info (def_insn);
> -                 vl_vtype_info info = get_vl_vtype_info (insn);
> -                 rtx avl = get_avl (def_insn->rtl ());
> -                 rtx vl = get_vl (def_insn->rtl ());
> -                 if (def_info.get_ratio () == info.get_ratio ())
> -                   {
> -                     if (vlmax_avl_p (def_info.get_avl ()))
> -                       {
> -                         info.set_avl_info (
> -                           avl_info (def_info.get_avl (), nullptr));
> -                         rtx new_pat
> -                           = gen_vsetvl_pat (VSETVL_NORMAL, info, vl);
> -                         validate_change (insn->rtl (),
> -                                          &PATTERN (insn->rtl ()), new_pat,
> -                                          false);
> -                         continue;
> -                       }
> -                     if (def_info.has_avl_imm () || rtx_equal_p (avl, vl))
> -                       {
> -                         info.set_avl_info (avl_info (avl, nullptr));
> -                         emit_vsetvl_insn (VSETVL_DISCARD_RESULT, EMIT_AFTER,
> -                                           info, NULL_RTX, insn->rtl ());
> -                         if (set->single_nondebug_insn_use ())
> -                           {
> -                             to_delete.add (insn->rtl ());
> -                             to_delete.add (def_insn->rtl ());
> -                           }
> -                         continue;
> -                       }
> -                   }
> -               }
> -           }
> -
> -         /* Change vsetvl rd, rs1 --> vsevl zero, rs1,
> -            if rd is not used by any nondebug instructions.
> -            Even though this PASS runs after RA and it doesn't help for
> -            reduce register pressure, it can help instructions scheduling
> -            since we remove the dependencies.  */
> -         if (vsetvl_insn_p (insn->rtl ()))
> -           {
> -             rtx vl = get_vl (insn->rtl ());
> -             rtx avl = get_avl (insn->rtl ());
> -             def_info *def = find_access (insn->defs (), REGNO (vl));
> -             set_info *set = safe_dyn_cast<set_info *> (def);
> +             rtx vl = get_vl (rinsn);
>               vector_insn_info info;
> -             info.parse_insn (insn);
> -             gcc_assert (set);
> -             if (m_vector_manager->to_delete_vsetvls.contains (insn->rtl ()))
> -               {
> -                 m_vector_manager->to_delete_vsetvls.remove (insn->rtl ());
> -                 if (m_vector_manager->to_refine_vsetvls.contains (
> -                       insn->rtl ()))
> -                   m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
> -                 if (!set->has_nondebug_insn_uses ())
> -                   {
> -                     to_delete.add (insn->rtl ());
> -                     continue;
> -                   }
> -               }
> -             if (m_vector_manager->to_refine_vsetvls.contains (insn->rtl ()))
> +             info.parse_insn (rinsn);
> +             bool to_delete_p = m_vector_manager->to_delete_p (rinsn);
> +             bool to_refine_p = m_vector_manager->to_refine_p (rinsn);
> +             if (has_no_uses (cfg_bb, rinsn, REGNO (vl)))
>                 {
> -                 m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
> -                 if (!set->has_nondebug_insn_uses ())
> +                 if (to_delete_p)
> +                   to_delete.add (rinsn);
> +                 else if (to_refine_p)
>                     {
>                       rtx new_pat = gen_vsetvl_pat (VSETVL_VTYPE_CHANGE_ONLY,
>                                                     info, NULL_RTX);
> -                     change_insn (insn->rtl (), new_pat);
> -                     continue;
> +                     validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
> +                   }
> +                 else if (!vlmax_avl_p (info.get_avl ()))
> +                   {
> +                     rtx new_pat = gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info,
> +                                                   NULL_RTX);
> +                     validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
>                     }
> -               }
> -             if (vlmax_avl_p (avl))
> -               continue;
> -             rtx new_pat
> -               = gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info, NULL_RTX);
> -             if (!set->has_nondebug_insn_uses ())
> -               {
> -                 validate_change (insn->rtl (), &PATTERN (insn->rtl ()),
> -                                  new_pat, false);
> -                 continue;
>                 }
>             }
>         }
>      }
> -
>    for (rtx_insn *rinsn : to_delete)
>      eliminate_insn (rinsn);
>  }
> @@ -4593,16 +4707,16 @@ pass_vsetvl::lazy_vsetvl (void)
>      fprintf (dump_file, "\nPhase 4: PRE vsetvl by Lazy code motion (LCM)\n");
>    pre_vsetvl ();
>
> -  /* Phase 5 - Cleanup AVL && VL operand of RVV instruction.  */
> +  /* Phase 5 - Post optimization base on RTL_SSA.  */
>    if (dump_file)
> -    fprintf (dump_file, "\nPhase 5: Cleanup AVL and VL operands\n");
> -  cleanup_insns ();
> +    fprintf (dump_file, "\nPhase 5: Post optimization base on RTL_SSA\n");
> +  ssa_post_optimization ();
>
> -  /* Phase 6 - Rebuild RTL_SSA to propagate AVL between vsetvls.  */
> +  /* Phase 6 - Post optimization base on data-flow analysis.  */
>    if (dump_file)
>      fprintf (dump_file,
> -            "\nPhase 6: Rebuild RTL_SSA to propagate AVL between vsetvls\n");
> -  propagate_avl ();
> +            "\nPhase 6: Post optimization base on data-flow analysis\n");
> +  df_post_optimization ();
>  }
>
>  /* Main entry point for this pass.  */
> diff --git a/gcc/config/riscv/riscv-vsetvl.h b/gcc/config/riscv/riscv-vsetvl.h
> index d7a6c14e931..4257451bb74 100644
> --- a/gcc/config/riscv/riscv-vsetvl.h
> +++ b/gcc/config/riscv/riscv-vsetvl.h
> @@ -290,13 +290,6 @@ private:
>       definition of AVL.  */
>    rtl_ssa::insn_info *m_insn;
>
> -  /* Parse the instruction to get VL/VTYPE information and demanding
> -   * information.  */
> -  /* This is only called by simple_vsetvl subroutine when optimize == 0.
> -     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
> -     the m_insn.  */
> -  void parse_insn (rtx_insn *);
> -
>    friend class vector_infos_manager;
>
>  public:
> @@ -305,6 +298,12 @@ public:
>        m_insn (nullptr)
>    {}
>
> +  /* Parse the instruction to get VL/VTYPE information and demanding
> +   * information.  */
> +  /* This is only called by simple_vsetvl subroutine when optimize == 0.
> +     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
> +     the m_insn.  */
> +  void parse_insn (rtx_insn *);
>    /* This is only called by lazy_vsetvl subroutine when optimize > 0.
>       We use RTL_SSA framework to initialize the insn_info.  */
>    void parse_insn (rtl_ssa::insn_info *);
> @@ -454,6 +453,27 @@ public:
>    bool all_empty_predecessor_p (const basic_block) const;
>    bool all_avail_in_compatible_p (const basic_block) const;
>
> +  bool to_delete_p (rtx_insn *rinsn)
> +  {
> +    if (to_delete_vsetvls.contains (rinsn))
> +      {
> +       to_delete_vsetvls.remove (rinsn);
> +       if (to_refine_vsetvls.contains (rinsn))
> +         to_refine_vsetvls.remove (rinsn);
> +       return true;
> +      }
> +    return false;
> +  }
> +  bool to_refine_p (rtx_insn *rinsn)
> +  {
> +    if (to_refine_vsetvls.contains (rinsn))
> +      {
> +       to_refine_vsetvls.remove (rinsn);
> +       return true;
> +      }
> +    return false;
> +  }
> +
>    void release (void);
>    void create_bitmap_vectors (void);
>    void free_bitmap_vectors (void);
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> index e0c6588b1db..29e05c4982b 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> @@ -16,5 +16,5 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>    }
>  }
>
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> index 0c5da5e640c..ff0171b3ff6 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> @@ -17,4 +17,4 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>  }
>
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
> new file mode 100644
> index 00000000000..551920c6a72
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  if (m > 100)
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +  else{
> +    if (k)
> +      avl = __riscv_vsetvl_e8mf8(vl);
> +  }
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
> new file mode 100644
> index 00000000000..103f4238c76
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  if (m > 100)
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +  else
> +    avl = __riscv_vsetvl_e32mf2(vl >> 8);
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*t[au],\s*m[au]} 2 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
> new file mode 100644
> index 00000000000..66c90ac10e7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
> @@ -0,0 +1,37 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  switch (m)
> +  {
> +  case 50:
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +    break;
> +  case 1:
> +    avl = __riscv_vsetvl_e32mf2(k);
> +    break;
> +  case 2:
> +    avl = __riscv_vsetvl_e64m1(vl);
> +    break;
> +  case 3:
> +    avl = __riscv_vsetvl_e32mf2(k >> 8);
> +    break;
> +  default:
> +    avl = __riscv_vsetvl_e32mf2(k + vl);
> +    break;
> +  }
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
> +    v0 = __riscv_vadd_vv_i8mf8_tu (v0, v0, v0, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {srli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*8} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*tu,\s*m[au]} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> index f995e04aacc..13d09fc3fd1 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> @@ -18,4 +18,4 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>  }
>
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> --
> 2.36.1
>