From: Kito Cheng <kito.cheng@sifive.com>
To: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Cc: gcc-patches@gcc.gnu.org, kito.cheng@gmail.com,
jeffreyalaw@gmail.com, rdapp.gcc@gmail.com,
"Patrick O'Neill" <patrick@rivosinc.com>
Subject: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization
Date: Thu, 26 Oct 2023 15:51:57 +0800 [thread overview]
Message-ID: <CALLt3ThXmk4pey2QhSUvK183uuK3oY5bU=a4m8QYv-6UukBYyg@mail.gmail.com> (raw)
In-Reply-To: <20231025120518.1319929-1-juzhe.zhong@rivai.ai>
LGTM, Thanks, it's really awesome - the implementation is simpler than
I expected, it's another great improvement for RISC-V GCC!
Just make sure Patrick gives a green light on the testing before
committing the patch :)
On Wed, Oct 25, 2023 at 8:05 PM Juzhe-Zhong <juzhe.zhong@rivai.ai> wrote:
>
> This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization
> which is a known issue for a long time and I finally find the time to address it.
>
> Consider a simple vector addition operation:
>
> https://godbolt.org/z/7hfGfEjW3
>
> void
> foo (int *__restrict a,
> int *__restrict b,
> int *__restrict n)
> {
> for (int i = 0; i < n; i++)
> a[i] = a[i] + b[i];
> }
>
> Optimized IR:
>
> Loop body:
> _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]); -> vsetvli a5,a2,e8,mf4,ta,ma
> ...
> vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0); -> vle32.v v2,0(a0)
> vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0); -> vle32.v v1,0(a1)
> vect__7.12_19 = vect__6.11_20 + vect__4.8_27; -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
> .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19); -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)
>
> We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
> The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment:
>
> vect__7.12_19 = vect__6.11_20 + vect__4.8_27;
>
> GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization.
> Such flow are used by all other targets like ARM SVE (RVV also uses such flow):
>
> ARM SVE:
>
> .L3:
> ld1w z30.s, p7/z, [x0, x3, lsl 2] -> predicated load
> ld1w z31.s, p7/z, [x1, x3, lsl 2] -> predicated load
> add z31.s, z31.s, z30.s -> un-predicated add
> st1w z31.s, p7, [x0, x3, lsl 2] -> predicated store
>
> Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it.
>
> Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons:
>
> 1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend.
> 2. Changing Loop vectorizer for it will make code base ugly and hard to maintain.
> 3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, ....
> We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns.
>
> To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls
> due to AVL/VL toggling.
>
> The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS)
>
> Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several
> experiments and tries.
>
> The reasons as follows:
>
> 1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which
> turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
> PASS become heavy and heavy again, then we will need to refactor it again in the future.
> Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor
> fixes.
>
> 2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things, I don't think we should fuse them into same PASS.
>
> 3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation.
>
> 4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations.
> This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements.
> We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate
> VSETVL PASS again which is already so complicated.)
>
> Here is an example to demonstrate more:
>
> https://godbolt.org/z/bE86sv3q5
>
> void foo2 (int *__restrict a,
> int *__restrict b,
> int *__restrict c,
> int *__restrict a2,
> int *__restrict b2,
> int *__restrict c2,
> int *__restrict a3,
> int *__restrict b3,
> int *__restrict c3,
> int *__restrict a4,
> int *__restrict b4,
> int *__restrict c4,
> int *__restrict a5,
> int *__restrict b5,
> int *__restrict c5,
> int n)
> {
> for (int i = 0; i < n; i++){
> a[i] = b[i] + c[i];
> b5[i] = b[i] + c[i];
> a2[i] = b2[i] + c2[i];
> a3[i] = b3[i] + c3[i];
> a4[i] = b4[i] + c4[i];
> a5[i] = a[i] + a4[i];
> a[i] = a5[i] + b5[i]+ a[i];
>
> a[i] = a[i] + c[i];
> b5[i] = a[i] + c[i];
> a2[i] = a[i] + c2[i];
> a3[i] = a[i] + c3[i];
> a4[i] = a[i] + c4[i];
> a5[i] = a[i] + a4[i];
> a[i] = a[i] + b5[i]+ a[i];
> }
> }
>
> 1. Loop Body:
>
> Before this patch: After this patch:
>
> vsetvli a4,t1,e8,mf4,ta,ma vsetvli a4,t1,e32,m1,ta,ma
> vle32.v v2,0(a2) vle32.v v2,0(a2)
> vle32.v v4,0(a1) vle32.v v3,0(t2)
> vle32.v v1,0(t2) vle32.v v4,0(a1)
> vsetvli a7,zero,e32,m1,ta,ma vle32.v v1,0(t0)
> vadd.vv v4,v2,v4 vadd.vv v4,v2,v4
> vsetvli zero,a4,e32,m1,ta,ma vadd.vv v1,v3,v1
> vle32.v v3,0(s0) vadd.vv v1,v1,v4
> vsetvli a7,zero,e32,m1,ta,ma vadd.vv v1,v1,v4
> vadd.vv v1,v3,v1 vadd.vv v1,v1,v4
> vadd.vv v1,v1,v4 vadd.vv v1,v1,v2
> vadd.vv v1,v1,v4 vadd.vv v2,v1,v2
> vadd.vv v1,v1,v4 vse32.v v2,0(t5)
> vsetvli zero,a4,e32,m1,ta,ma vadd.vv v2,v2,v1
> vle32.v v4,0(a5) vadd.vv v2,v2,v1
> vsetvli a7,zero,e32,m1,ta,ma slli a7,a4,2
> vadd.vv v1,v1,v2 vadd.vv v3,v1,v3
> vadd.vv v2,v1,v2 vle32.v v5,0(a5)
> vadd.vv v4,v1,v4 vle32.v v6,0(t6)
> vsetvli zero,a4,e32,m1,ta,ma vse32.v v3,0(t3)
> vse32.v v2,0(t5) vse32.v v2,0(a0)
> vse32.v v4,0(a3) vadd.vv v3,v3,v1
> vsetvli a7,zero,e32,m1,ta,ma vadd.vv v2,v1,v5
> vadd.vv v3,v1,v3 vse32.v v3,0(t4)
> vadd.vv v2,v2,v1 vadd.vv v1,v1,v6
> vadd.vv v2,v2,v1 vse32.v v2,0(a3)
> vsetvli zero,a4,e32,m1,ta,ma vse32.v v1,0(a6)
> vse32.v v2,0(a0)
> vse32.v v3,0(t3)
> vle32.v v2,0(t0)
> vsetvli a7,zero,e32,m1,ta,ma
> vadd.vv v3,v3,v1
> vsetvli zero,a4,e32,m1,ta,ma
> vse32.v v3,0(t4)
> vsetvli a7,zero,e32,m1,ta,ma
> slli a7,a4,2
> vadd.vv v1,v1,v2
> sub t1,t1,a4
> vsetvli zero,a4,e32,m1,ta,ma
> vse32.v v1,0(a6)
>
> It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated.
>
> 2. Epilogue:
> Before this patch: After this patch:
>
> .L5: .L5:
> ld s0,8(sp) ret
> addi sp,sp,16
> jr ra
>
> This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register
> which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma'
>
> The final codegen after this patch:
>
> foo2:
> lw t1,56(sp)
> ld t6,0(sp)
> ld t3,8(sp)
> ld t0,16(sp)
> ld t2,24(sp)
> ld t4,32(sp)
> ld t5,40(sp)
> ble t1,zero,.L5
> .L3:
> vsetvli a4,t1,e32,m1,ta,ma
> vle32.v v2,0(a2)
> vle32.v v3,0(t2)
> vle32.v v4,0(a1)
> vle32.v v1,0(t0)
> vadd.vv v4,v2,v4
> vadd.vv v1,v3,v1
> vadd.vv v1,v1,v4
> vadd.vv v1,v1,v4
> vadd.vv v1,v1,v4
> vadd.vv v1,v1,v2
> vadd.vv v2,v1,v2
> vse32.v v2,0(t5)
> vadd.vv v2,v2,v1
> vadd.vv v2,v2,v1
> slli a7,a4,2
> vadd.vv v3,v1,v3
> vle32.v v5,0(a5)
> vle32.v v6,0(t6)
> vse32.v v3,0(t3)
> vse32.v v2,0(a0)
> vadd.vv v3,v3,v1
> vadd.vv v2,v1,v5
> vse32.v v3,0(t4)
> vadd.vv v1,v1,v6
> vse32.v v2,0(a3)
> vse32.v v1,0(a6)
> sub t1,t1,a4
> add a1,a1,a7
> add a2,a2,a7
> add a5,a5,a7
> add t6,t6,a7
> add t0,t0,a7
> add t2,t2,a7
> add t5,t5,a7
> add a3,a3,a7
> add a6,a6,a7
> add t3,t3,a7
> add t4,t4,a7
> add a0,a0,a7
> bne t1,zero,.L3
> .L5:
> ret
>
>
> PR target/111318
> PR target/111888
>
> gcc/ChangeLog:
>
> * config.gcc: Add AVL propagation PASS.
> * config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto.
> * config/riscv/riscv-protos.h (make_pass_avlprop): Ditto.
> * config/riscv/t-riscv: Ditto.
> * config/riscv/riscv-avlprop.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Adapt test.
> * gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto.
> * gcc.target/riscv/rvv/autovec/pr111318.c: New test.
> * gcc.target/riscv/rvv/autovec/pr111888.c: New test.
>
> ---
> gcc/config.gcc | 2 +-
> gcc/config/riscv/riscv-avlprop.cc | 419 ++++++++++++++++++
> gcc/config/riscv/riscv-passes.def | 1 +
> gcc/config/riscv/riscv-protos.h | 1 +
> gcc/config/riscv/t-riscv | 6 +
> .../riscv/rvv/autovec/partial/select_vl-2.c | 5 +-
> .../gcc.target/riscv/rvv/autovec/pr111318.c | 16 +
> .../gcc.target/riscv/rvv/autovec/pr111888.c | 33 ++
> .../riscv/rvv/autovec/ternop/ternop_nofm-2.c | 1 -
> 9 files changed, 480 insertions(+), 4 deletions(-)
> create mode 100644 gcc/config/riscv/riscv-avlprop.cc
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 606d3a8513e..efd53965c9a 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -544,7 +544,7 @@ pru-*-*)
> riscv*)
> cpu_type=riscv
> extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o"
> - extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o"
> + extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o"
> extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
> extra_objs="${extra_objs} thead.o"
> d_target_objs="riscv-d.o"
> diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc
> new file mode 100644
> index 00000000000..2c79ec81806
> --- /dev/null
> +++ b/gcc/config/riscv/riscv-avlprop.cc
> @@ -0,0 +1,419 @@
> +/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler.
> + Copyright (C) 2023-2023 Free Software Foundation, Inc.
> + Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or(at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3. If not see
> +<http://www.gnu.org/licenses/>. */
> +
> +/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions.
> + A standalone AVL propagation pass is designed because:
> +
> + - Better code maintain:
> + Current LCM-based VSETVL pass is so complicated that codes
> + there will become even harder to maintain. A straight forward
> + AVL propagation PASS is much easier to maintain.
> +
> + - Reduce scalar register pressure:
> + A type of AVL propagation is we propagate AVL from NON-VLMAX
> + instruction to VLMAX instruction.
> + Note: VLMAX instruction should be ignore tail elements (TA)
> + and the result should be used by the NON-VLMAX instruction.
> + This optimization is mostly for auto-vectorization codes:
> +
> + vsetvli r136, r137 --- SELECT_VL
> + vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD
> + vadd.vv (use VLMAX) --- PLUS_EXPR
> + vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE
> +
> + NO AVL propation:
> +
> + vsetvli a5, a4, ta
> + vle8.v v1
> + vsetvli t0, zero, ta
> + vadd.vv v2, v1, v1
> + vse8.v v2
> +
> + We can propagate the AVL to 'vadd.vv' since its result
> + is consumed by a 'vse8.v' which has AVL = a5 and its
> + tail elements are agnostic.
> +
> + We DON'T do this optimization on VSETVL pass since it is a
> + post-RA pass that consumed 't0' already wheras a standalone
> + pre-RA AVL propagation pass allows us elide the consumption
> + of the pseudo register of 't0' then we can reduce scalar
> + register pressure.
> +
> + - More AVL propagation opportunities:
> + A pre-RA pass is more flexible for AVL REG def-use chain,
> + thus we will get more potential AVL propagation as long as
> + it doesn't increase the scalar register pressure.
> +*/
> +
> +#define IN_TARGET_CODE 1
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "tm.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "target.h"
> +#include "tree-pass.h"
> +#include "df.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "insn-attr.h"
> +
> +using namespace rtl_ssa;
> +using namespace riscv_vector;
> +
> +enum avlprop_type
> +{
> + /* VLMAX AVL and tail agnostic candidates. */
> + AVLPROP_VLMAX_TA,
> + AVLPROP_NONE
> +};
> +
> +/* dump helper functions */
> +static const char *
> +avlprop_type_to_str (enum avlprop_type type)
> +{
> + switch (type)
> + {
> + case AVLPROP_VLMAX_TA:
> + return "vlmax_ta";
> +
> + default:
> + gcc_unreachable ();
> + }
> +}
> +
> +static bool
> +vlmax_ta_p (rtx_insn *rinsn)
> +{
> + return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn);
> +}
> +
> +const pass_data pass_data_avlprop = {
> + RTL_PASS, /* type */
> + "avlprop", /* name */
> + OPTGROUP_NONE, /* optinfo_flags */
> + TV_NONE, /* tv_id */
> + 0, /* properties_required */
> + 0, /* properties_provided */
> + 0, /* properties_destroyed */
> + 0, /* todo_flags_start */
> + 0, /* todo_flags_finish */
> +};
> +
> +class pass_avlprop : public rtl_opt_pass
> +{
> +public:
> + pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {}
> +
> + /* opt_pass methods: */
> + virtual bool gate (function *) final override
> + {
> + return TARGET_VECTOR && optimize > 0;
> + }
> + virtual unsigned int execute (function *) final override;
> +
> +private:
> + /* The AVL propagation instructions and corresponding preferred AVL.
> + It will be updated during the analysis. */
> + hash_map<insn_info *, rtx> *m_avl_propagations;
> +
> + /* Potential feasible AVL propagation candidates. */
> + auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates;
> +
> + rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const;
> + rtx get_vlmax_ta_preferred_avl (insn_info *) const;
> + rtx get_nonvlmax_avl (insn_info *) const;
> +
> + void avlprop_init (function *);
> + void avlprop_done (void);
> +}; // class pass_avlprop
> +
> +void
> +pass_avlprop::avlprop_init (function *fn)
> +{
> + calculate_dominance_info (CDI_DOMINATORS);
> + df_analyze ();
> + crtl->ssa = new function_info (fn);
> + m_avl_propagations = new hash_map<insn_info *, rtx>;
> +}
> +
> +void
> +pass_avlprop::avlprop_done (void)
> +{
> + free_dominance_info (CDI_DOMINATORS);
> + if (crtl->ssa->perform_pending_updates ())
> + cleanup_cfg (0);
> + delete crtl->ssa;
> + crtl->ssa = nullptr;
> + delete m_avl_propagations;
> + m_avl_propagations = NULL;
> + if (!m_candidates.is_empty ())
> + m_candidates.release ();
> +}
> +
> +/* If we have a preferred AVL to propagate, return the AVL.
> + Otherwise, return NULL_RTX as we don't need have any preferred
> + AVL. */
> +
> +rtx
> +pass_avlprop::get_preferred_avl (
> + const std::pair<enum avlprop_type, insn_info *> candidate) const
> +{
> + switch (candidate.first)
> + {
> + case AVLPROP_VLMAX_TA:
> + return get_vlmax_ta_preferred_avl (candidate.second);
> + default:
> + gcc_unreachable ();
> + }
> + return NULL_RTX;
> +}
> +
> +/* This is a straight forward pattern ALWAYS in paritial auto-vectorization:
> +
> + VL = SELECT_AVL (AVL, ...)
> + V0 = MASK_LEN_LOAD (..., VL)
> + V1 = MASK_LEN_LOAD (..., VL)
> + V2 = V0 + V1 --- Missed LEN information.
> + MASK_LEN_STORE (..., V2, VL)
> +
> + We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN)
> + because:
> +
> + - Few code changes in Loop Vectorizer.
> + - Reuse the current clean flow of partial vectorization, That is, apply
> + predicate LEN or MASK into LOAD/STORE operations and other special
> + arithmetic operations (e.d. DIV), then do the whole vector register
> + operation if it DON'T affect the correctness.
> + Such flow is used by all other targets like x86, sve, s390, ... etc.
> + - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD.
> +
> + We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which
> + generates the VLMAX instruction due to missed LEN information. The later
> + VSETVL PASS will elided the redundant vsetvls.
> +*/
> +
> +rtx
> +pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const
> +{
> + int sew = get_sew (insn->rtl ());
> + enum vlmul_type vlmul = get_vlmul (insn->rtl ());
> + int ratio = calculate_ratio (sew, vlmul);
> +
> + rtx use_avl = NULL_RTX;
> + for (def_info *def : insn->defs ())
> + {
> + if (!is_a<set_info *> (def) || def->is_mem ())
> + return NULL_RTX;
> + const auto *set = dyn_cast<set_info *> (def);
> +
> + /* FIXME: Stop AVL propagation if any USE is not a RVV real
> + instruction. It should be totally enough for vectorized codes since
> + they always locate at extended blocks.
> +
> + TODO: We can extend PHI checking for intrinsic codes if it
> + necessary in the future. */
> + if (!set->is_local_to_ebb ())
> + return NULL_RTX;
> +
> + for (use_info *use : set->nondebug_insn_uses ())
> + {
> + insn_info *use_insn = use->insn ();
> + if (!use_insn->can_be_optimized () || use_insn->is_asm ()
> + || use_insn->is_call () || use_insn->has_volatile_refs ()
> + || use_insn->has_pre_post_modify ()
> + || !has_vl_op (use_insn->rtl ())
> + || !tail_agnostic_p (use_insn->rtl ()))
> + return NULL_RTX;
> +
> + int new_sew = get_sew (use_insn->rtl ());
> + enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ());
> + int new_ratio = calculate_ratio (new_sew, new_vlmul);
> + if (new_ratio != ratio)
> + return NULL_RTX;
> +
> + rtx new_use_avl = get_nonvlmax_avl (use_insn);
> + if (!new_use_avl || SUBREG_P (new_use_avl))
> + return NULL_RTX;
> + if (REG_P (new_use_avl))
> + {
> + resource_info resource = full_register (REGNO (new_use_avl));
> + def_lookup dl = crtl->ssa->find_def (resource, use_insn);
> + if (dl.matching_set ())
> + return NULL_RTX;
> + def_info *def1 = dl.prev_def (insn);
> + def_info *def2 = dl.prev_def (use_insn);
> + if (!def1 || !def2 || def1 != def2)
> + return NULL_RTX;
> +
> + /* FIXME: We only all AVL propation within a block which should
> + be totally enough for vectorized codes.
> +
> + TODO: We can enhance it here for intrinsic codes in the future
> + if it is necessary. */
> + if (def1->insn ()->bb () != insn->bb ()
> + && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (),
> + def1->insn ()->bb ()->cfg_bb ()))
> + return NULL_RTX;
> + if (def1->insn ()->bb () == insn->bb ()
> + && def1->insn ()->compare_with (insn) >= 0)
> + return NULL_RTX;
> + }
> +
> + if (!use_avl)
> + use_avl = new_use_avl;
> + else if (!rtx_equal_p (use_avl, new_use_avl))
> + return NULL_RTX;
> + }
> + }
> +
> + return use_avl;
> +}
> +
> +/* Try to get the NONVLMAX AVL of the INSN.
> + INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN
> + before the PASS but has been propagated a NON-VLMAX AVL
> + in the before round propagation. */
> +rtx
> +pass_avlprop::get_nonvlmax_avl (insn_info *insn) const
> +{
> + if (m_avl_propagations->get (insn))
> + return (*m_avl_propagations->get (insn));
> + else if (nonvlmax_avl_type_p (insn->rtl ()))
> + {
> + extract_insn_cached (insn->rtl ());
> + return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())];
> + }
> +
> + return NULL_RTX;
> +}
> +
> +/* Main entry point for this pass. */
> +unsigned int
> +pass_avlprop::execute (function *fn)
> +{
> + avlprop_init (fn);
> +
> + /* Iterate the whole function in reverse order (which could speed the
> + convergence) to collect all potential candidates that could be AVL
> + propagated.
> +
> + Note that: **NOT** all the candidates will be successfully AVL propagated.
> + */
> + for (bb_info *bb : crtl->ssa->reverse_bbs ())
> + {
> + for (insn_info *insn : bb->reverse_real_nondebug_insns ())
> + {
> + /* We only forward AVL to the instruction that has AVL/VL operand
> + and can be optimized in RTL_SSA level. */
> + if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ()))
> + continue;
> +
> + /* TODO: We only do AVL propagation for VLMAX AVL with tail
> + agnostic policy since we have missed-LEN information partial
> + autovectorization. We could add more more AVL propagation
> + for intrinsic codes in the future. */
> + if (vlmax_ta_p (insn->rtl ()))
> + m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn));
> + }
> + }
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n",
> + m_candidates.length ());
> + for (const auto candidate : m_candidates)
> + {
> + fprintf (dump_file, "\nAVL propagation type: %s\n",
> + avlprop_type_to_str (candidate.first));
> + print_rtl_single (dump_file, candidate.second->rtl ());
> + }
> + }
> +
> + /* Go through all the candidates looking for AVL that we could propagate. */
> + bool change_p = true;
> + while (change_p)
> + {
> + change_p = false;
> + for (auto &candidate : m_candidates)
> + {
> + rtx new_avl = get_preferred_avl (candidate);
> + if (new_avl)
> + {
> + gcc_assert (!vlmax_avl_p (new_avl));
> + auto &update
> + = m_avl_propagations->get_or_insert (candidate.second);
> + change_p = !rtx_equal_p (update, new_avl);
> + update = new_avl;
> + }
> + }
> + }
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n",
> + (int) m_avl_propagations->elements ());
> +
> + for (const auto prop : *m_avl_propagations)
> + {
> + rtx_insn *rinsn = prop.first->rtl ();
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "\nPropagating AVL: ");
> + print_rtl_single (dump_file, prop.second);
> + fprintf (dump_file, "into: ");
> + print_rtl_single (dump_file, rinsn);
> + }
> + /* Replace AVL operand. */
> + extract_insn_cached (rinsn);
> + rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
> + int count = count_regno_occurrences (rinsn, REGNO (avl));
> + gcc_assert (count == 1);
> + rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second);
> + validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false);
> +
> + /* Change AVL TYPE into NONVLMAX if it is VLMAX. */
> + if (vlmax_avl_type_p (rinsn))
> + {
> + int index = get_attr_avl_type_idx (rinsn);
> + gcc_assert (index != INVALID_ATTRIBUTE);
> + validate_change_or_fail (rinsn, recog_data.operand_loc[index],
> + get_avl_type_rtx (avl_type::NONVLMAX),
> + false);
> + }
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "Successfully to match this instruction: ");
> + print_rtl_single (dump_file, rinsn);
> + }
> + }
> +
> + avlprop_done ();
> + return 0;
> +}
> +
> +rtl_opt_pass *
> +make_pass_avlprop (gcc::context *ctxt)
> +{
> + return new pass_avlprop (ctxt);
> +}
> diff --git a/gcc/config/riscv/riscv-passes.def b/gcc/config/riscv/riscv-passes.def
> index 4084122cf0a..b6260939d5c 100644
> --- a/gcc/config/riscv/riscv-passes.def
> +++ b/gcc/config/riscv/riscv-passes.def
> @@ -18,4 +18,5 @@
> <http://www.gnu.org/licenses/>. */
>
> INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs);
> +INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop);
> INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl);
> diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
> index 668d75043ca..d4e17fc3fd0 100644
> --- a/gcc/config/riscv/riscv-protos.h
> +++ b/gcc/config/riscv/riscv-protos.h
> @@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio
> extern bool riscv_hard_regno_rename_ok (unsigned, unsigned);
>
> rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt);
> +rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
> rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
>
> /* Routines implemented in riscv-string.c. */
> diff --git a/gcc/config/riscv/t-riscv b/gcc/config/riscv/t-riscv
> index dd17056fe82..f8ca3f4ac57 100644
> --- a/gcc/config/riscv/t-riscv
> +++ b/gcc/config/riscv/t-riscv
> @@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \
> $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> $(srcdir)/config/riscv/riscv-vector-costs.cc
>
> +riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \
> + $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \
> + $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h
> + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> + $(srcdir)/config/riscv/riscv-avlprop.cc
> +
> riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \
> $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H)
> $(COMPILE) $<
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> index eac7cbc757b..ca88d42cdf4 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> @@ -7,10 +7,11 @@
> /*
> ** foo:
> ** vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> +** ...
> ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
> ** ...
> -** vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> -** add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+
> +** vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> +** ...
> ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
> ** ...
> */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> new file mode 100644
> index 00000000000..ff36da8feeb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
> +
> +void
> +foo (int *__restrict a, int *__restrict b, int *__restrict c, int n)
> +{
> + for (int i = 0; i < n; i += 1)
> + c[i] = a[i] + b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetivli} } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
> new file mode 100644
> index 00000000000..2387c20a26c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
> +
> +void
> +foo (int *__restrict a, int *__restrict b, int *__restrict c,
> + int *__restrict a2, int *__restrict b2, int *__restrict c2,
> + int *__restrict a3, int *__restrict b3, int *__restrict c3,
> + int *__restrict a4, int *__restrict b4, int *__restrict c4,
> + int *__restrict a5, int *__restrict b5, int *__restrict c5,
> + int *__restrict d, int *__restrict d2, int *__restrict d3,
> + int *__restrict d4, int *__restrict d5, int n, int m)
> +{
> + for (int i = 0; i < n; i++)
> + {
> + a[i] = b[i] + c[i];
> + a2[i] = b2[i] + c2[i];
> + a3[i] = b3[i] + c3[i];
> + a4[i] = b4[i] + c4[i];
> + a5[i] = a[i] + a4[i];
> + d[i] = a[i] - a2[i];
> + d2[i] = a2[i] * a[i];
> + d3[i] = a3[i] * a2[i];
> + d4[i] = a2[i] * d2[i];
> + d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i];
> + }
> +}
> +
> +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetivli} } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> index 965365da4bb..13367423751 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> @@ -3,7 +3,6 @@
>
> #include "ternop-2.c"
>
> -/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */
> /* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */
> /* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */
> /* { dg-final { scan-assembler-not {\tvmv} } } */
> --
> 2.36.3
>
next prev parent reply other threads:[~2023-10-26 7:52 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-25 12:05 Juzhe-Zhong
2023-10-26 7:51 ` Kito Cheng [this message]
2023-10-26 8:20 ` juzhe.zhong
2023-10-26 8:34 ` Robin Dapp
2023-10-26 8:41 ` juzhe.zhong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALLt3ThXmk4pey2QhSUvK183uuK3oY5bU=a4m8QYv-6UukBYyg@mail.gmail.com' \
--to=kito.cheng@sifive.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=jeffreyalaw@gmail.com \
--cc=juzhe.zhong@rivai.ai \
--cc=kito.cheng@gmail.com \
--cc=patrick@rivosinc.com \
--cc=rdapp.gcc@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).