From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 48E433858D39 for ; Thu, 29 Sep 2022 16:53:50 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 48E433858D39 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6FBB315A1; Thu, 29 Sep 2022 09:53:56 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2BAE73F73B; Thu, 29 Sep 2022 09:53:49 -0700 (PDT) From: Richard Sandiford To: juzhe.zhong@rivai.ai Mail-Followup-To: juzhe.zhong@rivai.ai,gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org Subject: Re: [Unfinished PATCH] Add first-order recurrence autovectorization References: <20220929110723.277330-1-juzhe.zhong@rivai.ai> Date: Thu, 29 Sep 2022 17:53:47 +0100 In-Reply-To: <20220929110723.277330-1-juzhe.zhong@rivai.ai> (juzhe zhong's message of "Thu, 29 Sep 2022 19:07:23 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-45.5 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Thanks for posting the patch. juzhe.zhong@rivai.ai writes: > From: Ju-Zhe Zhong > > gcc/ChangeLog: > > * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New function. > (vect_analyze_scalar_cycles_1): Classify first-order recurrence phi. > (vect_analyze_loop_operations): Add first-order recurrence autovectorization support. > (vectorizable_dep_phi): New function. > (vect_use_first_order_phi_result_p): New function. > (vect_transform_loop): Add first-order recurrence autovectorization support. > * tree-vect-stmts.cc (vect_transform_stmt): Ditto. > (vect_is_simple_use): Ditto. > * tree-vectorizer.h (enum vect_def_type): New enum. > (enum stmt_vec_info_type): Ditto. > (vectorizable_dep_phi): New function. > > Hi, since Richard said I can post unfinished for help, I post it. > This patch is for fix issue:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409. > LLVM can vectorize this case using first-order recurrence loop-vectorizer. > This patch is inspired by first-order recurrence autovectorization support in LLVM: > https://reviews.llvm.org/D16197 > There is a link that I can show you several cases that GCC fails vectorization > because no support of firs-order recurrence vectorization: https://godbolt.org/z/nzf1Wrd6T > > Let's consider a simple case that I simplify: > void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restrict__ c, int n) > { > int32_t t = *c; > for (int i = 0; i < n; ++i) > { > b[i] = a[i] - t; > t = a[i]; > } > } One thing that I wondered about the LLVM implementation is: does reusing the loaded value really pay for itself? E.g. for the un-predictive-commoned version: void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restr\ ict__ c, int n) { b[0] = a[0] - *c; for (int i = 1; i < n; ++i) b[i] = a[i] - a[i - 1]; } GCC generates: L4: ldr q0, [x6, x2] ldr q1, [x0, x2] sub v0.4s, v0.4s, v1.4s str q0, [x5, x2] add x2, x2, 16 cmp x2, x4 bne .L4 whereas LLVM (with -fno-unroll-loops) generates: .LBB0_4: // %vector.body mov v1.16b, v0.16b subs x15, x15, #4 ldr q0, [x13], #16 ext v1.16b, v1.16b, v0.16b, #12 sub v1.4s, v0.4s, v1.4s str q1, [x14], #16 b.ne .LBB0_4 Introducing the loop-carried dependency (via the ext) limits the throughput of the loop to the latency of a permutation. But I guess which approach is better depends on the amount of work that is repeated by GCC's approach. For a single load it's probably better to repeat the work, but for something more complicated the general recurrence approach probably wins out. So perhaps we should first handle the general case (as for your patch) and then, as a potential later follow-on patch, optimise the cases where the loop-carried dependency is harmful? This is all hand-wavy speculation, in case it wasn't obvious :-) Thanks, Richard > Applying this patch, my downstream RVV GCC can vectorize with -fdump-tree-vect: > > note: vect_is_simple_use: operand t_21 = PHI <_4(6), t_12(5)>, type of def: first order recurrence > > However, it ICE in "dce6" when removing PHI node "t_21 = PHI <_4(6), t_12(5)>": > 0x143c174 crash_signal > ../../../riscv-gcc/gcc/toplev.cc:322 > 0x170d4fd delink_imm_use > ../../../riscv-gcc/gcc/ssa-iterators.h:257 > I was stuck by this issue. Besides, this patch has more 2 more things to do that I didn't implement: > > 1. insert VEC_PERM before the vector subtraction statement (Because I was stuck, I didn't continue > implementing this patch and miss this.) > 2. Support this vectorization in SLP autovectorizaiton. > > To understand this patch, 2 functions are important: > > 1. vect_phi_first_order_recurrence_p, this function is used to forbid the cases that can not be vectorized > by this vectorizer. The constraints there are strictly the same as LLVM. > 2. vectorizable_dep_phi, the implementation of first-order recurrence vectorizer. > > I hope someone can help me fix && finish && test && refine this patch. > Thanks. > > --- > gcc/tree-vect-loop.cc | 239 ++++++++++++++++++++++++++++++++++++++++- > gcc/tree-vect-stmts.cc | 12 ++- > gcc/tree-vectorizer.h | 4 + > 3 files changed, 252 insertions(+), 3 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 2536cc3cf49..adb48356c23 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -529,6 +529,57 @@ vect_inner_phi_in_double_reduction_p (loop_vec_info loop_vinfo, gphi *phi) > return false; > } > > +/* Returns true if Phi is a first-order recurrence. A first-order > + recurrence is a non-reduction recurrence relation in which the value of > + the recurrence in the current loop iteration equals a value defined in > + the previous iteration. */ > + > +static bool > +vect_phi_first_order_recurrence_p (loop_vec_info loop_vinfo, class loop *loop, > + gphi *phi) > +{ > + /* Ensure the phi node is in the loop header and has two incoming values. */ > + if (gimple_bb (phi) != loop->header || gimple_phi_num_args (phi) != 2) > + return false; > + > + /* Ensure the loop has a preheader and a single latch block. The loop > + vectorizer will need the latch to set up the next iteration of the loop. */ > + edge preheader = loop_preheader_edge (loop); > + edge latch = loop_latch_edge (loop); > + if (!preheader || !latch) > + return false; > + > + /* Ensure the phi node's incoming blocks are the loop preheader and latch. */ > + if (!PHI_ARG_DEF_FROM_EDGE (phi, preheader) > + || !PHI_ARG_DEF_FROM_EDGE (phi, latch)) > + return false; > + > + /* Get the previous value. The previous value comes from the latch edge while > + the initial value comes form the preheader edge. */ > + gimple *previous = SSA_NAME_DEF_STMT (PHI_ARG_DEF_FROM_EDGE (phi, latch)); > + if (!previous) > + return false; > + > + /* Ensure every use_stmt of the phi node is dominated by the previous value. > + The dominance requirement ensures the loop vectorizer will not need to > + vectorize the initial value prior to the first iteration of the loop. */ > + gimple *use_stmt; > + imm_use_iterator imm_iter; > + FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, gimple_phi_result (phi)) > + if (use_stmt) > + if (!dominated_by_p (CDI_DOMINATORS, gimple_bb (use_stmt), > + gimple_bb (previous))) > + return false; > + > + /* First-order recurrence autovectorization needs shuffle vector. */ > + tree scalar_type = TREE_TYPE (PHI_RESULT (phi)); > + tree vectype = get_vectype_for_scalar_type (loop_vinfo, scalar_type); > + if (!can_vec_perm_var_p (TYPE_MODE (vectype))) > + return false; > + > + return true; > +} > + > /* Function vect_analyze_scalar_cycles_1. > > Examine the cross iteration def-use cycles of scalar variables > @@ -666,6 +717,8 @@ vect_analyze_scalar_cycles_1 (loop_vec_info loop_vinfo, class loop *loop, > } > } > } > + else if (vect_phi_first_order_recurrence_p (loop_vinfo, loop, phi)) > + STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_first_order_recurrence; > else > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -1808,9 +1861,13 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo) > > gcc_assert (stmt_info); > > + if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence) > + ok = vectorizable_dep_phi (loop_vinfo, stmt_info, NULL, NULL); > + > if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope > || STMT_VINFO_LIVE_P (stmt_info)) > - && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def) > + && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def > + && STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence) > /* A scalar-dependence cycle that we don't support. */ > return opt_result::failure_at (phi, > "not vectorized:" > @@ -8267,6 +8324,120 @@ vectorizable_phi (vec_info *, > return true; > } > > +/* Vectorizes Dep PHIs. */ > + > +bool > +vectorizable_dep_phi (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, > + gimple **vec_stmt, slp_tree slp_node) > +{ > + if (!loop_vinfo || !is_a (stmt_info->stmt)) > + return false; > + > + gphi *phi = as_a (stmt_info->stmt); > + > + /* So far we only support first-order recurrence auto-vectorization. */ > + if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence) > + return false; > + > + if (!vec_stmt) /* transformation not required. */ > + { > + /* So far we don't support first-order recurrence for SLP > + * auto-vectorization. */ > + if (slp_node) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "do not support first-order recurrence" > + "autovectorization for SLP node\n"); > + return false; > + } > + STMT_VINFO_TYPE (stmt_info) = dep_phi_info_type; > + return true; > + } > + > + /* This is the second phase of vectorizing first-order rececurrences. An > + overview of the transformation is described below. Suppose we have the > + following loop. > + > + int32_t t = 0; > + for (int i = 0; i < n; ++i) > + { > + b[i] = a[i] - t; > + t = a[i]; > + } > + > + There is a first-order recurrence on "a". For this loop, the shorthand > + scalar IR looks like: > + > + scalar.preheader: > + init = a[-1] > + br loop.body > + > + scalar.body: > + i = PHI <0(scalar.preheader), i+1(scalar.body)> > + _2 = PHI <(init(scalar.preheader), <_1(scalar.body)> > + _1 = a[i] > + b[i] = _1 - _2 > + br cond, scalar.body, ... > + > + In this example, _2 is a recurrence because it's value depends on the > + previous iteration. In the first phase of vectorization, we created a > + temporary value for _2. We now complete the vectorization and produce the > + shorthand vector IR shown below (VF = 4). > + > + vector.preheader: > + vect_init = vect_cst(..., ..., ..., a[-1]) > + br vector.body > + > + vector.body > + i = PHI <0(vector.preheader), i+4(vector.body)> > + vect_1 = PHI > + vect_2 = a[i, i+1, i+2, i+3]; > + vect_3 = vector(vect_1(3), vect_2(0, 1, 2)) > + b[i, i+1, i+2, i+3] = vect_2 - vect_3 > + br cond, vector.body, middle.block > + > + middle.block: > + x = vect_2(3) > + br scalar.preheader > + > + scalar.ph: > + s_init = PHI > + br scalar.body > + > + After execution completes the vector loop, we extract the next value of > + the recurrence (x) to use as the initial value in the scalar loop. */ > + > + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + tree vectype = STMT_VINFO_VECTYPE (stmt_info); > + tree scalar_dest = gimple_phi_result (stmt_info->stmt); > + basic_block bb = gimple_bb (stmt_info->stmt); > + tree vec_dest = vect_create_destination_var (scalar_dest, vectype); > + auto_vec vec_oprnds; > + tree preheader = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop)); > + tree latch = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop)); > + gimple_seq stmts = NULL; > + > + tree vec_init = build_vector_from_val (vectype, preheader); > + vec_init = vect_init_vector (loop_vinfo, stmt_info, > + vec_init, vectype, NULL); > + vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, > + !slp_node ? vect_get_num_copies (loop_vinfo, vectype) : 1, > + gimple_phi_arg_def (stmt_info->stmt, 0), &vec_oprnds); > + /* Create the vectorized first-order PHI node. */ > + gphi *new_phi = create_phi_node (vec_dest, bb); > + add_phi_arg (new_phi, vec_oprnds[0], loop_latch_edge (loop), > + UNKNOWN_LOCATION); > + add_phi_arg (new_phi, vec_init, loop_preheader_edge (loop), UNKNOWN_LOCATION); > + if (slp_node) > + SLP_TREE_VEC_STMTS (slp_node).quick_push (new_phi); > + else > + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_phi); > + if (!slp_node) > + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; > + return true; > +} > + > /* Return true if VECTYPE represents a vector that requires lowering > by the vector lowering pass. */ > > @@ -10476,6 +10647,26 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance) > epilogue_vinfo->shared->save_datarefs (); > } > > +/* Return true if the stmt is the first statement that uses the result > + of a first-order recurrence phi node. */ > + > +static gphi * > +vect_use_first_order_phi_result_p (auto_vec &first_order_phi_vec, > + gimple *stmt) > +{ > + for (unsigned int i = 0; i < first_order_phi_vec.length (); ++i) > + { > + imm_use_iterator imm; > + gphi *phi = first_order_phi_vec[i]; > + tree phi_result = PHI_RESULT (phi); > + gimple *first_use = first_imm_use_stmt (&imm, phi_result); > + if (first_use && first_use == stmt) > + return phi; > + } > + > + return NULL; > +} > + > /* Function vect_transform_loop. > > The analysis phase has determined that the loop is vectorizable. > @@ -10624,6 +10815,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) > basic_block bb = bbs[i]; > stmt_vec_info stmt_info; > > + /* It's used to save the first-order recurrence phi node. > + Consider the following case: > + # t_19 = PHI <_4(6), 0(5)> > + ... > + _4 = *_3; > + ... > + _6 = _4 - t_19; > + > + The vectorization sequence should be: > + 1. Vectorize _4 = *_3; > + 2. Vectorize # t_19 = PHI <_4(6), 0(5)> > + 3, Vectorize _6 = _4 - t_19; > + > + Flow: > + 1. So we save the first-order recurrence PHI in first_order_phi_vec > + and skip vectorizing it tentatively when iterating phi statement > + (save t_19 = PHI <_4(6), 0(5)>). > + 2. Vectorize all statements when iterating all statements > + until we reach the statement who is using the result of > + first-order recurrence PHI (vectorize _4 = *_3;). > + 3. Stop iterating and return to vectorize the PHI saved > + in first_order_phi_vec (# t_19 = PHI <_4(6), 0(5)>). > + 4. Conintue iterating and vectorize the residual statements. */ > + auto_vec first_order_phi_vec; > + > for (gphi_iterator si = gsi_start_phis (bb); !gsi_end_p (si); > gsi_next (&si)) > { > @@ -10677,9 +10893,16 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) > || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def > || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def > || STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle > - || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def) > + || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def > + || STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence) > && ! PURE_SLP_STMT (stmt_info)) > maybe_set_vectorized_backedge_value (loop_vinfo, stmt_info); > + > + if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence) > + { > + first_order_phi_vec.safe_push (phi); > + continue; > + } > } > > for (gimple_stmt_iterator si = gsi_start_bb (bb); > @@ -10698,6 +10921,18 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) > /* Ignore vector stmts created in the outer loop. */ > stmt_info = loop_vinfo->lookup_stmt (stmt); > > + gphi *first_order_phi > + = vect_use_first_order_phi_result_p (first_order_phi_vec, stmt); > + if (first_order_phi) > + { > + stmt_vec_info first_order_stmt_info > + = loop_vinfo->lookup_stmt (first_order_phi); > + gimple_stmt_iterator first_order_si > + = gsi_for_stmt (first_order_phi); > + vect_transform_loop_stmt (loop_vinfo, first_order_stmt_info, > + &first_order_si, NULL); > + } > + > /* vector stmts created in the outer-loop during vectorization of > stmts in an inner-loop may not have a stmt_info, and do not > need to be vectorized. */ > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index c8d1efc45e5..a8c0a8a4151 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -11404,6 +11404,12 @@ vect_transform_stmt (vec_info *vinfo, > gcc_assert (done); > break; > > + case dep_phi_info_type: > + done = vectorizable_dep_phi (as_a (vinfo), > + stmt_info, &vec_stmt, slp_node); > + gcc_assert (done); > + break; > + > case phi_info_type: > done = vectorizable_phi (vinfo, stmt_info, &vec_stmt, slp_node, NULL); > gcc_assert (done); > @@ -11804,6 +11810,9 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt, > case vect_nested_cycle: > dump_printf (MSG_NOTE, "nested cycle\n"); > break; > + case vect_first_order_recurrence: > + dump_printf (MSG_NOTE, "first order recurrence\n"); > + break; > case vect_unknown_def_type: > dump_printf (MSG_NOTE, "unknown\n"); > break; > @@ -11852,7 +11861,8 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt, > || *dt == vect_induction_def > || *dt == vect_reduction_def > || *dt == vect_double_reduction_def > - || *dt == vect_nested_cycle) > + || *dt == vect_nested_cycle > + || *dt == vect_first_order_recurrence) > { > *vectype = STMT_VINFO_VECTYPE (def_stmt_info); > gcc_assert (*vectype != NULL_TREE); > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index 4870c754499..2c8e7660b4e 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -65,6 +65,7 @@ enum vect_def_type { > vect_reduction_def, > vect_double_reduction_def, > vect_nested_cycle, > + vect_first_order_recurrence, > vect_unknown_def_type > }; > > @@ -1027,6 +1028,7 @@ enum stmt_vec_info_type { > cycle_phi_info_type, > lc_phi_info_type, > phi_info_type, > + dep_phi_info_type, > loop_exit_ctrl_vec_info_type > }; > > @@ -2331,6 +2333,8 @@ extern bool vectorizable_lc_phi (loop_vec_info, stmt_vec_info, > gimple **, slp_tree); > extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree, > stmt_vector_for_cost *); > +extern bool vectorizable_dep_phi (loop_vec_info, stmt_vec_info, > + gimple **, slp_tree); > extern bool vect_emulated_vector_p (tree); > extern bool vect_can_vectorize_without_simd_p (tree_code); > extern bool vect_can_vectorize_without_simd_p (code_helper);