Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Kito Cheng <kito.cheng@sifive.com>
To: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Cc: gcc-patches@gcc.gnu.org, kito.cheng@gmail.com,
	jeffreyalaw@gmail.com,  rdapp.gcc@gmail.com,
	"Patrick O'Neill" <patrick@rivosinc.com>
Subject: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization
Date: Thu, 26 Oct 2023 15:51:57 +0800	[thread overview]
Message-ID: <CALLt3ThXmk4pey2QhSUvK183uuK3oY5bU=a4m8QYv-6UukBYyg@mail.gmail.com> (raw)
In-Reply-To: <20231025120518.1319929-1-juzhe.zhong@rivai.ai>

LGTM, Thanks, it's really awesome - the implementation is simpler than
I expected, it's another great improvement for RISC-V GCC!

Just make sure Patrick gives a green light on the testing before
committing the patch :)




On Wed, Oct 25, 2023 at 8:05 PM Juzhe-Zhong <juzhe.zhong@rivai.ai> wrote:
>
> This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization
> which is a known issue for a long time and I finally find the time to address it.
>
> Consider a simple vector addition operation:
>
> https://godbolt.org/z/7hfGfEjW3
>
> void
> foo (int *__restrict a,
>      int *__restrict b,
>      int *__restrict n)
> {
>   for (int i = 0; i < n; i++)
>       a[i] = a[i] + b[i];
> }
>
> Optimized IR:
>
> Loop body:
>   _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]);                          -> vsetvli a5,a2,e8,mf4,ta,ma
>   ...
>   vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0);    -> vle32.v v2,0(a0)
>   vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0);   -> vle32.v v1,0(a1)
>   vect__7.12_19 = vect__6.11_20 + vect__4.8_27;                              -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
>   .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19);  -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)
>
> We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
> The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment:
>
> vect__7.12_19 = vect__6.11_20 + vect__4.8_27;
>
> GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization.
> Such flow are used by all other targets like ARM SVE (RVV also uses such flow):
>
> ARM SVE:
>
> .L3:
>         ld1w    z30.s, p7/z, [x0, x3, lsl 2]   -> predicated load
>         ld1w    z31.s, p7/z, [x1, x3, lsl 2]   -> predicated load
>         add     z31.s, z31.s, z30.s            -> un-predicated add
>         st1w    z31.s, p7, [x0, x3, lsl 2]     -> predicated store
>
> Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it.
>
> Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons:
>
> 1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend.
> 2. Changing Loop vectorizer for it will make code base ugly and hard to maintain.
> 3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, ....
>    We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns.
>
> To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls
> due to AVL/VL toggling.
>
> The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS)
>
> Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several
> experiments and tries.
>
> The reasons as follows:
>
> 1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which
>    turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
>          PASS become heavy and heavy again, then we will need to refactor it again in the future.
>          Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor
>          fixes.
>
> 2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things,  I don't think we should fuse them into same PASS.
>
> 3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation.
>
> 4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations.
>    This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements.
>          We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate
>          VSETVL PASS again which is already so complicated.)
>
> Here is an example to demonstrate more:
>
> https://godbolt.org/z/bE86sv3q5
>
> void foo2 (int *__restrict a,
>           int *__restrict b,
>           int *__restrict c,
>           int *__restrict a2,
>           int *__restrict b2,
>           int *__restrict c2,
>           int *__restrict a3,
>           int *__restrict b3,
>           int *__restrict c3,
>           int *__restrict a4,
>           int *__restrict b4,
>           int *__restrict c4,
>           int *__restrict a5,
>           int *__restrict b5,
>           int *__restrict c5,
>           int n)
> {
>     for (int i = 0; i < n; i++){
>       a[i] = b[i] + c[i];
>       b5[i] = b[i] + c[i];
>       a2[i] = b2[i] + c2[i];
>       a3[i] = b3[i] + c3[i];
>       a4[i] = b4[i] + c4[i];
>       a5[i] = a[i] + a4[i];
>       a[i] = a5[i] + b5[i]+ a[i];
>
>       a[i] = a[i] + c[i];
>       b5[i] = a[i] + c[i];
>       a2[i] = a[i] + c2[i];
>       a3[i] = a[i] + c3[i];
>       a4[i] = a[i] + c4[i];
>       a5[i] = a[i] + a4[i];
>       a[i] = a[i] + b5[i]+ a[i];
>     }
> }
>
> 1. Loop Body:
>
> Before this patch:                                          After this patch:
>
>               vsetvli a4,t1,e8,mf4,ta,ma                           vsetvli      a4,t1,e32,m1,ta,ma
>         vle32.v v2,0(a2)                                     vle32.v    v2,0(a2)
>         vle32.v v4,0(a1)                                     vle32.v    v3,0(t2)
>         vle32.v v1,0(t2)                                     vle32.v    v4,0(a1)
>         vsetvli a7,zero,e32,m1,ta,ma                         vle32.v    v1,0(t0)
>         vadd.vv v4,v2,v4                                     vadd.vv    v4,v2,v4
>         vsetvli zero,a4,e32,m1,ta,ma                         vadd.vv    v1,v3,v1
>         vle32.v v3,0(s0)                                     vadd.vv    v1,v1,v4
>         vsetvli a7,zero,e32,m1,ta,ma                         vadd.vv    v1,v1,v4
>         vadd.vv v1,v3,v1                                     vadd.vv    v1,v1,v4
>         vadd.vv v1,v1,v4                                     vadd.vv    v1,v1,v2
>         vadd.vv v1,v1,v4                                     vadd.vv    v2,v1,v2
>         vadd.vv v1,v1,v4                                     vse32.v    v2,0(t5)
>         vsetvli zero,a4,e32,m1,ta,ma                         vadd.vv    v2,v2,v1
>         vle32.v v4,0(a5)                                     vadd.vv    v2,v2,v1
>         vsetvli a7,zero,e32,m1,ta,ma                         slli       a7,a4,2
>         vadd.vv v1,v1,v2                                     vadd.vv    v3,v1,v3
>         vadd.vv v2,v1,v2                                     vle32.v    v5,0(a5)
>         vadd.vv v4,v1,v4                                     vle32.v    v6,0(t6)
>         vsetvli zero,a4,e32,m1,ta,ma                         vse32.v    v3,0(t3)
>         vse32.v v2,0(t5)                                     vse32.v    v2,0(a0)
>         vse32.v v4,0(a3)                                     vadd.vv    v3,v3,v1
>         vsetvli a7,zero,e32,m1,ta,ma                         vadd.vv    v2,v1,v5
>         vadd.vv v3,v1,v3                                     vse32.v    v3,0(t4)
>         vadd.vv v2,v2,v1                                     vadd.vv    v1,v1,v6
>         vadd.vv v2,v2,v1                                     vse32.v    v2,0(a3)
>         vsetvli zero,a4,e32,m1,ta,ma                         vse32.v    v1,0(a6)
>         vse32.v v2,0(a0)
>         vse32.v v3,0(t3)
>         vle32.v v2,0(t0)
>         vsetvli a7,zero,e32,m1,ta,ma
>         vadd.vv v3,v3,v1
>         vsetvli zero,a4,e32,m1,ta,ma
>         vse32.v v3,0(t4)
>         vsetvli a7,zero,e32,m1,ta,ma
>         slli    a7,a4,2
>         vadd.vv v1,v1,v2
>         sub     t1,t1,a4
>         vsetvli zero,a4,e32,m1,ta,ma
>         vse32.v v1,0(a6)
>
> It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated.
>
> 2. Epilogue:
>     Before this patch:                                          After this patch:
>
>      .L5:                                                      .L5:
>         ld      s0,8(sp)                                         ret
>         addi    sp,sp,16
>         jr      ra
>
> This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register
> which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma'
>
> The final codegen after this patch:
>
> foo2:
>         lw      t1,56(sp)
>         ld      t6,0(sp)
>         ld      t3,8(sp)
>         ld      t0,16(sp)
>         ld      t2,24(sp)
>         ld      t4,32(sp)
>         ld      t5,40(sp)
>         ble     t1,zero,.L5
> .L3:
>         vsetvli a4,t1,e32,m1,ta,ma
>         vle32.v v2,0(a2)
>         vle32.v v3,0(t2)
>         vle32.v v4,0(a1)
>         vle32.v v1,0(t0)
>         vadd.vv v4,v2,v4
>         vadd.vv v1,v3,v1
>         vadd.vv v1,v1,v4
>         vadd.vv v1,v1,v4
>         vadd.vv v1,v1,v4
>         vadd.vv v1,v1,v2
>         vadd.vv v2,v1,v2
>         vse32.v v2,0(t5)
>         vadd.vv v2,v2,v1
>         vadd.vv v2,v2,v1
>         slli    a7,a4,2
>         vadd.vv v3,v1,v3
>         vle32.v v5,0(a5)
>         vle32.v v6,0(t6)
>         vse32.v v3,0(t3)
>         vse32.v v2,0(a0)
>         vadd.vv v3,v3,v1
>         vadd.vv v2,v1,v5
>         vse32.v v3,0(t4)
>         vadd.vv v1,v1,v6
>         vse32.v v2,0(a3)
>         vse32.v v1,0(a6)
>         sub     t1,t1,a4
>         add     a1,a1,a7
>         add     a2,a2,a7
>         add     a5,a5,a7
>         add     t6,t6,a7
>         add     t0,t0,a7
>         add     t2,t2,a7
>         add     t5,t5,a7
>         add     a3,a3,a7
>         add     a6,a6,a7
>         add     t3,t3,a7
>         add     t4,t4,a7
>         add     a0,a0,a7
>         bne     t1,zero,.L3
> .L5:
>         ret
>
>
>         PR target/111318
>         PR target/111888
>
> gcc/ChangeLog:
>
>         * config.gcc: Add AVL propagation PASS.
>         * config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto.
>         * config/riscv/riscv-protos.h (make_pass_avlprop): Ditto.
>         * config/riscv/t-riscv: Ditto.
>         * config/riscv/riscv-avlprop.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Adapt test.
>         * gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto.
>         * gcc.target/riscv/rvv/autovec/pr111318.c: New test.
>         * gcc.target/riscv/rvv/autovec/pr111888.c: New test.
>
> ---
>  gcc/config.gcc                                |   2 +-
>  gcc/config/riscv/riscv-avlprop.cc             | 419 ++++++++++++++++++
>  gcc/config/riscv/riscv-passes.def             |   1 +
>  gcc/config/riscv/riscv-protos.h               |   1 +
>  gcc/config/riscv/t-riscv                      |   6 +
>  .../riscv/rvv/autovec/partial/select_vl-2.c   |   5 +-
>  .../gcc.target/riscv/rvv/autovec/pr111318.c   |  16 +
>  .../gcc.target/riscv/rvv/autovec/pr111888.c   |  33 ++
>  .../riscv/rvv/autovec/ternop/ternop_nofm-2.c  |   1 -
>  9 files changed, 480 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/config/riscv/riscv-avlprop.cc
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 606d3a8513e..efd53965c9a 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -544,7 +544,7 @@ pru-*-*)
>  riscv*)
>         cpu_type=riscv
>         extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o"
> -       extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o"
> +       extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o"
>         extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
>         extra_objs="${extra_objs} thead.o"
>         d_target_objs="riscv-d.o"
> diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc
> new file mode 100644
> index 00000000000..2c79ec81806
> --- /dev/null
> +++ b/gcc/config/riscv/riscv-avlprop.cc
> @@ -0,0 +1,419 @@
> +/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler.
> +   Copyright (C) 2023-2023 Free Software Foundation, Inc.
> +   Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or(at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions.
> +   A standalone AVL propagation pass is designed because:
> +
> +     - Better code maintain:
> +       Current LCM-based VSETVL pass is so complicated that codes
> +       there will become even harder to maintain. A straight forward
> +       AVL propagation PASS is much easier to maintain.
> +
> +     - Reduce scalar register pressure:
> +       A type of AVL propagation is we propagate AVL from NON-VLMAX
> +       instruction to VLMAX instruction.
> +       Note: VLMAX instruction should be ignore tail elements (TA)
> +       and the result should be used by the NON-VLMAX instruction.
> +       This optimization is mostly for auto-vectorization codes:
> +
> +         vsetvli r136, r137      --- SELECT_VL
> +         vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD
> +         vadd.vv (use VLMAX)     --- PLUS_EXPR
> +         vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE
> +
> +       NO AVL propation:
> +
> +         vsetvli a5, a4, ta
> +         vle8.v v1
> +         vsetvli t0, zero, ta
> +         vadd.vv v2, v1, v1
> +         vse8.v v2
> +
> +       We can propagate the AVL to 'vadd.vv' since its result
> +       is consumed by a 'vse8.v' which has AVL = a5 and its
> +       tail elements are agnostic.
> +
> +       We DON'T do this optimization on VSETVL pass since it is a
> +       post-RA pass that consumed 't0' already wheras a standalone
> +       pre-RA AVL propagation pass allows us elide the consumption
> +       of the pseudo register of 't0' then we can reduce scalar
> +       register pressure.
> +
> +     - More AVL propagation opportunities:
> +       A pre-RA pass is more flexible for AVL REG def-use chain,
> +       thus we will get more potential AVL propagation as long as
> +       it doesn't increase the scalar register pressure.
> +*/
> +
> +#define IN_TARGET_CODE 1
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "tm.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "target.h"
> +#include "tree-pass.h"
> +#include "df.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "insn-attr.h"
> +
> +using namespace rtl_ssa;
> +using namespace riscv_vector;
> +
> +enum avlprop_type
> +{
> +  /* VLMAX AVL and tail agnostic candidates.  */
> +  AVLPROP_VLMAX_TA,
> +  AVLPROP_NONE
> +};
> +
> +/* dump helper functions */
> +static const char *
> +avlprop_type_to_str (enum avlprop_type type)
> +{
> +  switch (type)
> +    {
> +    case AVLPROP_VLMAX_TA:
> +      return "vlmax_ta";
> +
> +    default:
> +      gcc_unreachable ();
> +    }
> +}
> +
> +static bool
> +vlmax_ta_p (rtx_insn *rinsn)
> +{
> +  return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn);
> +}
> +
> +const pass_data pass_data_avlprop = {
> +  RTL_PASS,     /* type */
> +  "avlprop",    /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE,      /* tv_id */
> +  0,            /* properties_required */
> +  0,            /* properties_provided */
> +  0,            /* properties_destroyed */
> +  0,            /* todo_flags_start */
> +  0,            /* todo_flags_finish */
> +};
> +
> +class pass_avlprop : public rtl_opt_pass
> +{
> +public:
> +  pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *) final override
> +  {
> +    return TARGET_VECTOR && optimize > 0;
> +  }
> +  virtual unsigned int execute (function *) final override;
> +
> +private:
> +  /* The AVL propagation instructions and corresponding preferred AVL.
> +     It will be updated during the analysis.  */
> +  hash_map<insn_info *, rtx> *m_avl_propagations;
> +
> +  /* Potential feasible AVL propagation candidates.  */
> +  auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates;
> +
> +  rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const;
> +  rtx get_vlmax_ta_preferred_avl (insn_info *) const;
> +  rtx get_nonvlmax_avl (insn_info *) const;
> +
> +  void avlprop_init (function *);
> +  void avlprop_done (void);
> +}; // class pass_avlprop
> +
> +void
> +pass_avlprop::avlprop_init (function *fn)
> +{
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  df_analyze ();
> +  crtl->ssa = new function_info (fn);
> +  m_avl_propagations = new hash_map<insn_info *, rtx>;
> +}
> +
> +void
> +pass_avlprop::avlprop_done (void)
> +{
> +  free_dominance_info (CDI_DOMINATORS);
> +  if (crtl->ssa->perform_pending_updates ())
> +    cleanup_cfg (0);
> +  delete crtl->ssa;
> +  crtl->ssa = nullptr;
> +  delete m_avl_propagations;
> +  m_avl_propagations = NULL;
> +  if (!m_candidates.is_empty ())
> +    m_candidates.release ();
> +}
> +
> +/* If we have a preferred AVL to propagate, return the AVL.
> +   Otherwise, return NULL_RTX as we don't need have any preferred
> +   AVL.  */
> +
> +rtx
> +pass_avlprop::get_preferred_avl (
> +  const std::pair<enum avlprop_type, insn_info *> candidate) const
> +{
> +  switch (candidate.first)
> +    {
> +    case AVLPROP_VLMAX_TA:
> +      return get_vlmax_ta_preferred_avl (candidate.second);
> +    default:
> +      gcc_unreachable ();
> +    }
> +  return NULL_RTX;
> +}
> +
> +/* This is a straight forward pattern ALWAYS in paritial auto-vectorization:
> +
> +     VL = SELECT_AVL (AVL, ...)
> +     V0 = MASK_LEN_LOAD (..., VL)
> +     V1 = MASK_LEN_LOAD (..., VL)
> +     V2 = V0 + V1 --- Missed LEN information.
> +     MASK_LEN_STORE (..., V2, VL)
> +
> +   We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN)
> +   because:
> +
> +     - Few code changes in Loop Vectorizer.
> +     - Reuse the current clean flow of partial vectorization, That is, apply
> +       predicate LEN or MASK into LOAD/STORE operations and other special
> +       arithmetic operations (e.d. DIV), then do the whole vector register
> +       operation if it DON'T affect the correctness.
> +       Such flow is used by all other targets like x86, sve, s390, ... etc.
> +     - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD.
> +
> +   We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which
> +   generates the VLMAX instruction due to missed LEN information. The later
> +   VSETVL PASS will elided the redundant vsetvls.
> +*/
> +
> +rtx
> +pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const
> +{
> +  int sew = get_sew (insn->rtl ());
> +  enum vlmul_type vlmul = get_vlmul (insn->rtl ());
> +  int ratio = calculate_ratio (sew, vlmul);
> +
> +  rtx use_avl = NULL_RTX;
> +  for (def_info *def : insn->defs ())
> +    {
> +      if (!is_a<set_info *> (def) || def->is_mem ())
> +       return NULL_RTX;
> +      const auto *set = dyn_cast<set_info *> (def);
> +
> +      /* FIXME: Stop AVL propagation if any USE is not a RVV real
> +        instruction. It should be totally enough for vectorized codes since
> +        they always locate at extended blocks.
> +
> +        TODO: We can extend PHI checking for intrinsic codes if it
> +        necessary in the future.  */
> +      if (!set->is_local_to_ebb ())
> +       return NULL_RTX;
> +
> +      for (use_info *use : set->nondebug_insn_uses ())
> +       {
> +         insn_info *use_insn = use->insn ();
> +         if (!use_insn->can_be_optimized () || use_insn->is_asm ()
> +             || use_insn->is_call () || use_insn->has_volatile_refs ()
> +             || use_insn->has_pre_post_modify ()
> +             || !has_vl_op (use_insn->rtl ())
> +             || !tail_agnostic_p (use_insn->rtl ()))
> +           return NULL_RTX;
> +
> +         int new_sew = get_sew (use_insn->rtl ());
> +         enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ());
> +         int new_ratio = calculate_ratio (new_sew, new_vlmul);
> +         if (new_ratio != ratio)
> +           return NULL_RTX;
> +
> +         rtx new_use_avl = get_nonvlmax_avl (use_insn);
> +         if (!new_use_avl || SUBREG_P (new_use_avl))
> +           return NULL_RTX;
> +         if (REG_P (new_use_avl))
> +           {
> +             resource_info resource = full_register (REGNO (new_use_avl));
> +             def_lookup dl = crtl->ssa->find_def (resource, use_insn);
> +             if (dl.matching_set ())
> +               return NULL_RTX;
> +             def_info *def1 = dl.prev_def (insn);
> +             def_info *def2 = dl.prev_def (use_insn);
> +             if (!def1 || !def2 || def1 != def2)
> +               return NULL_RTX;
> +
> +             /* FIXME: We only all AVL propation within a block which should
> +                be totally enough for vectorized codes.
> +
> +                TODO: We can enhance it here for intrinsic codes in the future
> +                if it is necessary.  */
> +             if (def1->insn ()->bb () != insn->bb ()
> +                 && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (),
> +                                     def1->insn ()->bb ()->cfg_bb ()))
> +               return NULL_RTX;
> +             if (def1->insn ()->bb () == insn->bb ()
> +                 && def1->insn ()->compare_with (insn) >= 0)
> +               return NULL_RTX;
> +           }
> +
> +         if (!use_avl)
> +           use_avl = new_use_avl;
> +         else if (!rtx_equal_p (use_avl, new_use_avl))
> +           return NULL_RTX;
> +       }
> +    }
> +
> +  return use_avl;
> +}
> +
> +/* Try to get the NONVLMAX AVL of the INSN.
> +   INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN
> +   before the PASS but has been propagated a NON-VLMAX AVL
> +   in the before round propagation.  */
> +rtx
> +pass_avlprop::get_nonvlmax_avl (insn_info *insn) const
> +{
> +  if (m_avl_propagations->get (insn))
> +    return (*m_avl_propagations->get (insn));
> +  else if (nonvlmax_avl_type_p (insn->rtl ()))
> +    {
> +      extract_insn_cached (insn->rtl ());
> +      return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())];
> +    }
> +
> +  return NULL_RTX;
> +}
> +
> +/* Main entry point for this pass.  */
> +unsigned int
> +pass_avlprop::execute (function *fn)
> +{
> +  avlprop_init (fn);
> +
> +  /* Iterate the whole function in reverse order (which could speed the
> +     convergence) to collect all potential candidates that could be AVL
> +     propagated.
> +
> +     Note that: **NOT** all the candidates will be successfully AVL propagated.
> +  */
> +  for (bb_info *bb : crtl->ssa->reverse_bbs ())
> +    {
> +      for (insn_info *insn : bb->reverse_real_nondebug_insns ())
> +       {
> +         /* We only forward AVL to the instruction that has AVL/VL operand
> +            and can be optimized in RTL_SSA level.  */
> +         if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ()))
> +           continue;
> +
> +         /* TODO: We only do AVL propagation for VLMAX AVL with tail
> +            agnostic policy since we have missed-LEN information partial
> +            autovectorization.  We could add more more AVL propagation
> +            for intrinsic codes in the future.  */
> +         if (vlmax_ta_p (insn->rtl ()))
> +           m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn));
> +       }
> +    }
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n",
> +              m_candidates.length ());
> +      for (const auto candidate : m_candidates)
> +       {
> +         fprintf (dump_file, "\nAVL propagation type: %s\n",
> +                  avlprop_type_to_str (candidate.first));
> +         print_rtl_single (dump_file, candidate.second->rtl ());
> +       }
> +    }
> +
> +  /* Go through all the candidates looking for AVL that we could propagate. */
> +  bool change_p = true;
> +  while (change_p)
> +    {
> +      change_p = false;
> +      for (auto &candidate : m_candidates)
> +       {
> +         rtx new_avl = get_preferred_avl (candidate);
> +         if (new_avl)
> +           {
> +             gcc_assert (!vlmax_avl_p (new_avl));
> +             auto &update
> +               = m_avl_propagations->get_or_insert (candidate.second);
> +             change_p = !rtx_equal_p (update, new_avl);
> +             update = new_avl;
> +           }
> +       }
> +    }
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n",
> +            (int) m_avl_propagations->elements ());
> +
> +  for (const auto prop : *m_avl_propagations)
> +    {
> +      rtx_insn *rinsn = prop.first->rtl ();
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "\nPropagating AVL: ");
> +         print_rtl_single (dump_file, prop.second);
> +         fprintf (dump_file, "into: ");
> +         print_rtl_single (dump_file, rinsn);
> +       }
> +      /* Replace AVL operand.  */
> +      extract_insn_cached (rinsn);
> +      rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
> +      int count = count_regno_occurrences (rinsn, REGNO (avl));
> +      gcc_assert (count == 1);
> +      rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second);
> +      validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false);
> +
> +      /* Change AVL TYPE into NONVLMAX if it is VLMAX.  */
> +      if (vlmax_avl_type_p (rinsn))
> +       {
> +         int index = get_attr_avl_type_idx (rinsn);
> +         gcc_assert (index != INVALID_ATTRIBUTE);
> +         validate_change_or_fail (rinsn, recog_data.operand_loc[index],
> +                                  get_avl_type_rtx (avl_type::NONVLMAX),
> +                                  false);
> +       }
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "Successfully to match this instruction: ");
> +         print_rtl_single (dump_file, rinsn);
> +       }
> +    }
> +
> +  avlprop_done ();
> +  return 0;
> +}
> +
> +rtl_opt_pass *
> +make_pass_avlprop (gcc::context *ctxt)
> +{
> +  return new pass_avlprop (ctxt);
> +}
> diff --git a/gcc/config/riscv/riscv-passes.def b/gcc/config/riscv/riscv-passes.def
> index 4084122cf0a..b6260939d5c 100644
> --- a/gcc/config/riscv/riscv-passes.def
> +++ b/gcc/config/riscv/riscv-passes.def
> @@ -18,4 +18,5 @@
>     <http://www.gnu.org/licenses/>.  */
>
>  INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs);
> +INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop);
>  INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl);
> diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
> index 668d75043ca..d4e17fc3fd0 100644
> --- a/gcc/config/riscv/riscv-protos.h
> +++ b/gcc/config/riscv/riscv-protos.h
> @@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio
>  extern bool riscv_hard_regno_rename_ok (unsigned, unsigned);
>
>  rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt);
> +rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
>  rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
>
>  /* Routines implemented in riscv-string.c.  */
> diff --git a/gcc/config/riscv/t-riscv b/gcc/config/riscv/t-riscv
> index dd17056fe82..f8ca3f4ac57 100644
> --- a/gcc/config/riscv/t-riscv
> +++ b/gcc/config/riscv/t-riscv
> @@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \
>         $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>                 $(srcdir)/config/riscv/riscv-vector-costs.cc
>
> +riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \
> +  $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \
> +  $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h
> +       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> +               $(srcdir)/config/riscv/riscv-avlprop.cc
> +
>  riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \
>    $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H)
>         $(COMPILE) $<
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> index eac7cbc757b..ca88d42cdf4 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> @@ -7,10 +7,11 @@
>  /*
>  ** foo:
>  **     vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> +**     ...
>  **     vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
>  **     ...
> -**     vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> -**     add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+
> +**     vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> +**     ...
>  **     vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
>  **     ...
>  */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> new file mode 100644
> index 00000000000..ff36da8feeb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
> +
> +void
> +foo (int *__restrict a, int *__restrict b, int *__restrict c, int n)
> +{
> +  for (int i = 0; i < n; i += 1)
> +    c[i] = a[i] + b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetivli} } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
> new file mode 100644
> index 00000000000..2387c20a26c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
> +
> +void
> +foo (int *__restrict a, int *__restrict b, int *__restrict c,
> +     int *__restrict a2, int *__restrict b2, int *__restrict c2,
> +     int *__restrict a3, int *__restrict b3, int *__restrict c3,
> +     int *__restrict a4, int *__restrict b4, int *__restrict c4,
> +     int *__restrict a5, int *__restrict b5, int *__restrict c5,
> +     int *__restrict d, int *__restrict d2, int *__restrict d3,
> +     int *__restrict d4, int *__restrict d5, int n, int m)
> +{
> +  for (int i = 0; i < n; i++)
> +    {
> +      a[i] = b[i] + c[i];
> +      a2[i] = b2[i] + c2[i];
> +      a3[i] = b3[i] + c3[i];
> +      a4[i] = b4[i] + c4[i];
> +      a5[i] = a[i] + a4[i];
> +      d[i] = a[i] - a2[i];
> +      d2[i] = a2[i] * a[i];
> +      d3[i] = a3[i] * a2[i];
> +      d4[i] = a2[i] * d2[i];
> +      d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i];
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetivli} } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> index 965365da4bb..13367423751 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> @@ -3,7 +3,6 @@
>
>  #include "ternop-2.c"
>
> -/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */
>  /* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */
>  /* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */
>  /* { dg-final { scan-assembler-not {\tvmv} } } */
> --
> 2.36.3
>

next prev parent reply	other threads:[~2023-10-26  7:52 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25 12:05 Juzhe-Zhong
2023-10-26  7:51 ` Kito Cheng [this message]
2023-10-26  8:20   ` juzhe.zhong
2023-10-26  8:34     ` Robin Dapp
2023-10-26  8:41       ` juzhe.zhong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALLt3ThXmk4pey2QhSUvK183uuK3oY5bU=a4m8QYv-6UukBYyg@mail.gmail.com' \
    --to=kito.cheng@sifive.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jeffreyalaw@gmail.com \
    --cc=juzhe.zhong@rivai.ai \
    --cc=kito.cheng@gmail.com \
    --cc=patrick@rivosinc.com \
    --cc=rdapp.gcc@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).