From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 3E6543851C12 for ; Wed, 10 May 2023 16:45:16 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3E6543851C12 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 208B512FC; Wed, 10 May 2023 09:46:00 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E4B323F67D; Wed, 10 May 2023 09:45:14 -0700 (PDT) From: Richard Sandiford To: juzhe.zhong@rivai.ai Mail-Followup-To: juzhe.zhong@rivai.ai,gcc-patches@gcc.gnu.org, rguenther@suse.de, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org, rguenther@suse.de Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support References: <20230504132540.286148-1-juzhe.zhong@rivai.ai> Date: Wed, 10 May 2023 17:45:13 +0100 In-Reply-To: <20230504132540.286148-1-juzhe.zhong@rivai.ai> (juzhe zhong's message of "Thu, 4 May 2023 21:25:40 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-29.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: In addition to Jeff's comments: juzhe.zhong@rivai.ai writes: > [...] > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) > operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated value. > +operand 1 is the total elements need to be updated value. > +operand 2 is the vectorization factor. > +The value of operand 0 is target dependent and flexible in each iteration. > +The operation of this pattern can be: > + > +@smallexample > +Case 1: > +operand0 = MIN (operand1, operand2); > +operand2 can be const_poly_int or poly_int related to vector mode size. > +Some target like RISC-V has a standalone instruction to get MIN (n, MODE SIZE) so > +that we can reduce a use of general purpose register. > + > +In this case, only the last iteration of the loop is partial iteration. > +@end smallexample > + > +@smallexample > +Case 2: > +if (operand1 <= operand2) > + operand0 = operand1; > +else if (operand1 < 2 * operand2) > + operand0 = IN_RANGE (ceil (operand1 / 2), operand2); GCC's IN_RANGE is a predicate, so it would be best to avoid that here. Why isn't it simply ceil (operand1 / 2), which must be <= operand2? > +else > + operand0 = operand2; > + > +This case will evenly distribute work over the last 2 iterations of a stripmine loop. > +@end smallexample > + > +The output of this pattern is not only used as IV of loop control counter, but also > +is used as the IV of address calculation with multiply/shift operation. This allow > +us dynamic adjust the number of elements is processed in each iteration of the loop. > + > @cindex @code{check_raw_ptrs@var{m}} instruction pattern > @item @samp{check_raw_ptrs@var{m}} > Check whether, given two pointers @var{a} and @var{b} and a length @var{len}, > [...] > diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc > index 909b705d00d..5abca64379e 100644 > --- a/gcc/tree-ssa-loop-manip.cc > +++ b/gcc/tree-ssa-loop-manip.cc > @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3. If not see > so that we can free them all at once. */ > static bitmap_obstack loop_renamer_obstack; > > -/* Creates an induction variable with value BASE + STEP * iteration in LOOP. > +/* Creates an induction variable with value BASE (+/-) STEP * iteration in LOOP. > + If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration. > + If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration. > It is expected that neither BASE nor STEP are shared with other expressions > (unless the sharing rules allow this). Use VAR as a base var_decl for it > (if NULL, a new temporary will be created). The increment will occur at > @@ -57,8 +59,8 @@ static bitmap_obstack loop_renamer_obstack; > VAR_AFTER (unless they are NULL). */ > > void > -create_iv (tree base, tree step, tree var, class loop *loop, > - gimple_stmt_iterator *incr_pos, bool after, > +create_iv (tree base, tree_code code, tree step, tree var, > + class loop *loop, gimple_stmt_iterator *incr_pos, bool after, > tree *var_before, tree *var_after) > { > gassign *stmt; > @@ -66,7 +68,9 @@ create_iv (tree base, tree step, tree var, class loop *loop, > tree initial, step1; > gimple_seq stmts; > tree vb, va; > - enum tree_code incr_op = PLUS_EXPR; > + /* The code can only be PLUS_EXPR or MINUS_EXPR. */ > + gcc_assert (code == PLUS_EXPR || code == MINUS_EXPR); > + tree_code incr_op = code; As Richard said, we should be able to get rid of incr_op, probably by calling the parameter incr_op. I think the later: incr_op = MINUS_EXPR; stmts need to be updated to something that flips between PLUS_EXPR and MINUS_EXPR (with updates to the comments). It would probably make sense to split the create_iv part out as a separate prepatch. > edge pe = loop_preheader_edge (loop); > > if (var != NULL_TREE) > @@ -1365,7 +1369,7 @@ tree_transform_and_unroll_loop (class loop *loop, unsigned factor, > tree ctr_before, ctr_after; > gimple_stmt_iterator bsi = gsi_last_nondebug_bb (new_exit->src); > exit_if = as_a (gsi_stmt (bsi)); > - create_iv (exit_base, exit_step, NULL_TREE, loop, > + create_iv (exit_base, PLUS_EXPR, exit_step, NULL_TREE, loop, > &bsi, false, &ctr_before, &ctr_after); > gimple_cond_set_code (exit_if, exit_cmp); > gimple_cond_set_lhs (exit_if, ctr_after); > @@ -1580,8 +1584,8 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, bool bump_in_latch) > gsi = gsi_last_bb (loop->latch); > else > gsi = gsi_last_nondebug_bb (loop->header); > - create_iv (build_int_cst_type (type, 0), build_int_cst (type, 1), NULL_TREE, > - loop, &gsi, bump_in_latch, &var_before, NULL); > + create_iv (build_int_cst_type (type, 0), PLUS_EXPR, build_int_cst (type, 1), > + NULL_TREE, loop, &gsi, bump_in_latch, &var_before, NULL); > > rewrite_all_phi_nodes_with_iv (loop, var_before); > > [...] > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > index 44bd5f2c805..d63ded5d4f0 100644 > --- a/gcc/tree-vect-loop-manip.cc > +++ b/gcc/tree-vect-loop-manip.cc > @@ -385,6 +385,48 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, > return false; > } > > +/* Try to use permutes to define the lens in DEST_RGM using the lens > + in SRC_RGM, given that the former has twice as many lens as the > + latter. Return true on success, adding any new statements to SEQ. */ Nit: think it's better to talk about "lengths" rather than "lens" (which ordinarily means something different). Like Jeff said, this operation is no longer a permute when applied to lengths (as opposed to masks). And because of that, I don't think there's anything special about DEST_RGN having twice as many controls as SRC_RGN. (And the implementation does seem to be more general than just a factor of two.) > + > +static bool > +vect_maybe_permute_loop_lens (tree iv_type, gimple_seq *seq, > + rgroup_controls *dest_rgm, > + rgroup_controls *src_rgm) > +{ > + tree ctrl_type = dest_rgm->type; > + poly_uint64 nitems_per_ctrl > + = TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor; > + if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter) > + { > + for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i) > + { > + tree src = src_rgm->controls[i / dest_rgm->controls.length ()]; > + tree dest = dest_rgm->controls[i]; > + gassign *stmt; > + if (i == 0) > + { > + /* MIN (X, VF*I/N) capped to the range [0, VF/N]. */ > + tree factor = build_int_cst (iv_type, nitems_per_ctrl); > + stmt = gimple_build_assign (dest, MIN_EXPR, src, factor); > + gimple_seq_add_stmt (seq, stmt); > + } > + else > + { > + /* (X - MIN (X, VF*I/N)) capped to the range [0, VF/N]. */ > + tree factor = build_int_cst (iv_type, nitems_per_ctrl * i); > + tree temp = make_ssa_name (iv_type); > + stmt = gimple_build_assign (temp, MIN_EXPR, src, factor); > + gimple_seq_add_stmt (seq, stmt); > + stmt = gimple_build_assign (dest, MINUS_EXPR, src, temp); > + gimple_seq_add_stmt (seq, stmt); > + } > + } > + return true; > + } > + return false; > +} > + This return false thing looks odd. I don't think we can decide on a group-by-group basis whether to use this approach or not. Instead it has to be a per-loop decision. Either (1) we use SELECT_VL for the single-control group, and use this function to derive all other groups from that, or (2) we just MIN for all groups. No mixing and matching is allowed. In other words, I think we need to decide at the top level whether to use SELECT_VL or not. As things stand, this decision should be "no" if nitems_per_iter != 1 for *any* group. If the decision is to use SELECT_VL, we should use it for the single-control group only and then unconditionally use the function above for the other groups, with no failure path. The decision about whether to use SELECT_VL or not should be cached if necessary, so that it's available... > /* Helper for vect_set_loop_condition_partial_vectors. Generate definitions > for all the rgroup controls in RGC and return a control that is nonzero > when the loop needs to iterate. Add any new preheader statements to > [...] > @@ -10361,15 +10361,18 @@ vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, > } > > /* Given a complete set of length LENS, extract length number INDEX for an > - rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS. */ > + rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS. > + Insert any set-up statements before GSI. */ > > tree > -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, > - unsigned int nvectors, unsigned int index) > +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi, > + vec_loop_lens *lens, unsigned int nvectors, tree vectype, > + unsigned int index) > { > rgroup_controls *rgl = &(*lens)[nvectors - 1]; > bool use_bias_adjusted_len = > LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0; > + tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); > > /* Populate the rgroup's len array, if this is the first time we've > used it. */ > @@ -10400,6 +10403,27 @@ vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, > > if (use_bias_adjusted_len) > return rgl->bias_adjusted_ctrl; > + else if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, > + OPTIMIZE_FOR_SPEED)) > + { > + tree loop_len = rgl->controls[index]; > + poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type); > + poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype); > + if (maybe_ne (nunits1, nunits2)) > + { > + /* A loop len for data type X can be reused for data type Y > + if X has N times more elements than Y and if Y's elements > + are N times bigger than X's. */ > + gcc_assert (multiple_p (nunits1, nunits2)); > + unsigned int factor = exact_div (nunits1, nunits2).to_constant (); > + gimple_seq seq = NULL; > + loop_len = gimple_build (&seq, RDIV_EXPR, iv_type, loop_len, > + build_int_cst (iv_type, factor)); > + if (seq) > + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); > + } > + return loop_len; > + } > else > return rgl->controls[index]; > } ...here. That is, the key isn't whether SELECT_VL is available, but instead whether we've decided to use it for this loop (unless I'm missing something). > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index 3ad6a7d28d7..26813ab56e9 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -3147,6 +3147,61 @@ vect_get_data_ptr_increment (vec_info *vinfo, > return iv_step; > } > > +/* Prepare the pointer IVs which needs to be updated by a variable amount. > + Such variable amount is the outcome of .SELECT_VL. In this case, we can > + allow each iteration process the flexible number of elements as long as > + the number <= vf elments. > + > + Return data reference according to SELECT_VL. > + If new statements are needed, insert them before GSI. */ > + > +static tree > +get_select_vl_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info, > + tree aggr_type, class loop *at_loop, tree offset, > + tree *dummy, gimple_stmt_iterator *gsi, > + bool simd_lane_access_p, vec_loop_lens *loop_lens, > + dr_vec_info *dr_info, > + vect_memory_access_type memory_access_type) > +{ > + loop_vec_info loop_vinfo = dyn_cast (vinfo); > + tree step = vect_dr_behavior (vinfo, dr_info)->step; > + > + /* TODO: We don't support gather/scatter or load_lanes/store_lanes for pointer > + IVs are updated by variable amount but we will support them in the future. > + */ > + gcc_assert (memory_access_type != VMAT_GATHER_SCATTER > + && memory_access_type != VMAT_LOAD_STORE_LANES); > + > + /* When we support SELECT_VL pattern, we dynamic adjust > + the memory address by .SELECT_VL result. > + > + The result of .SELECT_VL is the number of elements to > + be processed of each iteration. So the memory address > + adjustment operation should be: > + > + bytesize = GET_MODE_SIZE (element_mode (aggr_type)); > + addr = addr + .SELECT_VL (ARG..) * bytesize; > + */ > + gimple *ptr_incr; > + tree loop_len > + = vect_get_loop_len (loop_vinfo, gsi, loop_lens, 1, aggr_type, 0); > + tree len_type = TREE_TYPE (loop_len); > + poly_uint64 bytesize = GET_MODE_SIZE (element_mode (aggr_type)); > + /* Since the outcome of .SELECT_VL is element size, we should adjust > + it into bytesize so that it can be used in address pointer variable > + amount IVs adjustment. */ > + tree tmp = fold_build2 (MULT_EXPR, len_type, loop_len, > + build_int_cst (len_type, bytesize)); > + if (tree_int_cst_sgn (step) == -1) > + tmp = fold_build1 (NEGATE_EXPR, len_type, tmp); > + tree bump = make_temp_ssa_name (len_type, NULL, "ivtmp"); > + gassign *assign = gimple_build_assign (bump, tmp); > + gsi_insert_before (gsi, assign, GSI_SAME_STMT); It seems a little wasteful to recompute the temporary results (into gsi) for every pointer, but I guess it's not worth complicating the code to do on-the-fly DCE. Later passes can clean it up for us. (Unless Richi feels differently. Can't remember who's said what now.) > + return vect_create_data_ref_ptr (vinfo, stmt_info, aggr_type, at_loop, offset, > + dummy, gsi, &ptr_incr, simd_lane_access_p, > + bump); > +} > + > /* Check and perform vectorization of BUILT_IN_BSWAP{16,32,64,128}. */ > > static bool > @@ -8279,7 +8334,7 @@ vectorizable_store (vec_info *vinfo, > > stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base); > ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep); > - create_iv (stride_base, ivstep, NULL, > + create_iv (stride_base, PLUS_EXPR, ivstep, NULL, > loop, &incr_gsi, insert_after, > &offvar, NULL); > incr = gsi_stmt (incr_gsi); > @@ -8540,6 +8595,17 @@ vectorizable_store (vec_info *vinfo, > vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info, > slp_node, &gs_info, &dataref_ptr, > &vec_offsets); > + else if (loop_lens && loop_lens->length () == 1 > + && direct_internal_fn_supported_p ( > + IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo), > + OPTIMIZE_FOR_SPEED) > + && memory_access_type != VMAT_INVARIANT) > + dataref_ptr > + = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type, > + simd_lane_access_p ? loop : NULL, > + offset, &dummy, gsi, > + simd_lane_access_p, loop_lens, > + dr_info, memory_access_type); Here too I think the check should be based on a cached record of whether we're using SELECT_VL for this particular loop, What happens for loop_lens->length () != 1? Thanks, Richard > else > dataref_ptr > = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type, > @@ -8788,8 +8854,9 @@ vectorizable_store (vec_info *vinfo, > else if (loop_lens) > { > tree final_len > - = vect_get_loop_len (loop_vinfo, loop_lens, > - vec_num * ncopies, vec_num * j + i); > + = vect_get_loop_len (loop_vinfo, gsi, loop_lens, > + vec_num * ncopies, vectype, > + vec_num * j + i); > tree ptr = build_int_cst (ref_type, align * BITS_PER_UNIT); > machine_mode vmode = TYPE_MODE (vectype); > opt_machine_mode new_ovmode > @@ -9450,7 +9517,7 @@ vectorizable_load (vec_info *vinfo, > > stride_base = cse_and_gimplify_to_preheader (loop_vinfo, stride_base); > ivstep = cse_and_gimplify_to_preheader (loop_vinfo, ivstep); > - create_iv (stride_base, ivstep, NULL, > + create_iv (stride_base, PLUS_EXPR, ivstep, NULL, > loop, &incr_gsi, insert_after, > &offvar, NULL); > > @@ -9928,6 +9995,16 @@ vectorizable_load (vec_info *vinfo, > slp_node, &gs_info, &dataref_ptr, > &vec_offsets); > } > + else if (loop_lens && loop_lens->length () == 1 > + && direct_internal_fn_supported_p ( > + IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo), > + OPTIMIZE_FOR_SPEED) > + && memory_access_type != VMAT_INVARIANT) > + dataref_ptr > + = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type, > + at_loop, offset, &dummy, gsi, > + simd_lane_access_p, loop_lens, > + dr_info, memory_access_type); > else > dataref_ptr > = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type, > @@ -10144,8 +10221,8 @@ vectorizable_load (vec_info *vinfo, > else if (loop_lens && memory_access_type != VMAT_INVARIANT) > { > tree final_len > - = vect_get_loop_len (loop_vinfo, loop_lens, > - vec_num * ncopies, > + = vect_get_loop_len (loop_vinfo, gsi, loop_lens, > + vec_num * ncopies, vectype, > vec_num * j + i); > tree ptr = build_int_cst (ref_type, > align * BITS_PER_UNIT); > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index 9cf2fb23fe3..357bc5c7315 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -2293,8 +2293,8 @@ extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *, > unsigned int, tree, unsigned int); > extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int, > tree, unsigned int); > -extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int, > - unsigned int); > +extern tree vect_get_loop_len (loop_vec_info, gimple_stmt_iterator *, vec_loop_lens *, > + unsigned int, tree, unsigned int); > extern gimple_seq vect_gen_len (tree, tree, tree, tree); > extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info); > extern bool reduction_fn_for_scalar_code (code_helper, internal_fn *);