Thanks Richard so much. >> I don't think that's guaranteed by the proposed definition of WHILE_LEN. >> The first int64_t WHILE_LEN could come up short, and return something >> less than VF/2. I am so sorry that the comments of vect_set_loop_controls_by_while_len is totally misleading and incorrect and I have sent V3 patch to fix that. Actually, I don't use WHILE_LEN in multi-rgroups situation, instead, I use MIN_EXPR to force VF elements for each non-final iteration to make sure result is correct. Yes, I agree with you that WHILE_LEN will produce issues for SLP situation that having multi-rgroups since WHILE_LEN definition is allow target produces non-VF outcome in non-final iteration. Actually, I have fully tested this patch with local downstream auto-vectorization testcases and your comment is correct since I saw the issue as you said at the beginning so I use MIN_EXPR in multi-rgroups situation but I forgot to update the comments and make you confused. Really sorry. >> Is this ifn-specific handling needed? From the description, it seems >> like WHILE_LEN could be a normal binary internal function. >> The comment needs to be updated to describe the new interface. >>This is personal preference, but: I think the interface would be >>clearer if the code argument came between the base and step, >>so that the order matches a SCEV. The code could no longer be >>a default argument, and so all callers would need to be updated, >>but IMO that's OK. Not sure what others think though. >>The patch doesn't seem to change vect_estimate_min_profitable_iters though, >>so the comment doesn't seem accurate. Address all comments and fix patch V3: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616730.html Feel free to comments. Thanks juzhe.zhong@rivai.ai From: Richard Sandiford Date: 2023-04-26 00:58 To: juzhe.zhong CC: gcc-patches; rguenther Subject: Re: [PATCH] VECT: Add decrement IV iteration loop control by variable amount support juzhe.zhong@rivai.ai writes: > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc > index 6e81dc05e0e..5f44def90d3 100644 > --- a/gcc/internal-fn.cc > +++ b/gcc/internal-fn.cc > @@ -127,6 +127,7 @@ init_internal_fns () > #define cond_binary_direct { 1, 1, true } > #define cond_ternary_direct { 1, 1, true } > #define while_direct { 0, 2, false } > +#define while_len_direct { 0, 0, false } > #define fold_extract_direct { 2, 2, false } > #define fold_left_direct { 1, 1, false } > #define mask_fold_left_direct { 1, 1, false } > @@ -3702,6 +3703,33 @@ expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > emit_move_insn (lhs_rtx, ops[0].value); > } > > +/* Expand WHILE_LEN call STMT using optab OPTAB. */ > +static void > +expand_while_len_optab_fn (internal_fn, gcall *stmt, convert_optab optab) > +{ > + expand_operand ops[3]; > + tree rhs_type[2]; > + > + tree lhs = gimple_call_lhs (stmt); > + tree lhs_type = TREE_TYPE (lhs); > + rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE); > + create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type)); > + > + for (unsigned int i = 0; i < gimple_call_num_args (stmt); ++i) > + { > + tree rhs = gimple_call_arg (stmt, i); > + rhs_type[i] = TREE_TYPE (rhs); > + rtx rhs_rtx = expand_normal (rhs); > + create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type[i])); > + } > + > + insn_code icode = direct_optab_handler (optab, TYPE_MODE (rhs_type[0])); > + > + expand_insn (icode, 3, ops); > + if (!rtx_equal_p (lhs_rtx, ops[0].value)) > + emit_move_insn (lhs_rtx, ops[0].value); > +} Is this ifn-specific handling needed? From the description, it seems like WHILE_LEN could be a normal binary internal function. > + > /* Expand a call to a convert-like optab using the operands in STMT. > FN has a single output operand and NARGS input operands. */ > > @@ -3843,6 +3871,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types, > #define direct_scatter_store_optab_supported_p convert_optab_supported_p > #define direct_len_store_optab_supported_p direct_optab_supported_p > #define direct_while_optab_supported_p convert_optab_supported_p > +#define direct_while_len_optab_supported_p direct_optab_supported_p > #define direct_fold_extract_optab_supported_p direct_optab_supported_p > #define direct_fold_left_optab_supported_p direct_optab_supported_p > #define direct_mask_fold_left_optab_supported_p direct_optab_supported_p > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def > index 7fe742c2ae7..3a933abff5d 100644 > --- a/gcc/internal-fn.def > +++ b/gcc/internal-fn.def > @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set) > DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store) > > DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while) > +DEF_INTERNAL_OPTAB_FN (WHILE_LEN, ECF_CONST | ECF_NOTHROW, while_len, while_len) > DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW, > check_raw_ptrs, check_ptrs) > DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW, > diff --git a/gcc/optabs.def b/gcc/optabs.def > index 695f5911b30..f5938bd2c24 100644 > --- a/gcc/optabs.def > +++ b/gcc/optabs.def > @@ -476,3 +476,4 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES) > OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a") > OPTAB_D (len_load_optab, "len_load_$a") > OPTAB_D (len_store_optab, "len_store_$a") > +OPTAB_D (while_len_optab, "while_len$a") > diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc > index a52277abdbf..54845a62298 100644 > --- a/gcc/tree-ssa-loop-manip.cc > +++ b/gcc/tree-ssa-loop-manip.cc > @@ -59,14 +59,14 @@ static bitmap_obstack loop_renamer_obstack; > void > create_iv (tree base, tree step, tree var, class loop *loop, > gimple_stmt_iterator *incr_pos, bool after, > - tree *var_before, tree *var_after) > + tree *var_before, tree *var_after, enum tree_code code) The comment needs to be updated to describe the new interface. This is personal preference, but: I think the interface would be clearer if the code argument came between the base and step, so that the order matches a SCEV. The code could no longer be a default argument, and so all callers would need to be updated, but IMO that's OK. Not sure what others think though. > { > gassign *stmt; > gphi *phi; > tree initial, step1; > gimple_seq stmts; > tree vb, va; > - enum tree_code incr_op = PLUS_EXPR; > + enum tree_code incr_op = code; > edge pe = loop_preheader_edge (loop); > > if (var != NULL_TREE) > diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h > index d49273a3987..da755320a3a 100644 > --- a/gcc/tree-ssa-loop-manip.h > +++ b/gcc/tree-ssa-loop-manip.h > @@ -23,7 +23,7 @@ along with GCC; see the file COPYING3. If not see > typedef void (*transform_callback)(class loop *, void *); > > extern void create_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *, > - bool, tree *, tree *); > + bool, tree *, tree *, enum tree_code = PLUS_EXPR); > extern void rewrite_into_loop_closed_ssa (bitmap, unsigned); > extern void verify_loop_closed_ssa (bool, class loop * = NULL); > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > index f60fa50e8f4..a1c892f285a 100644 > --- a/gcc/tree-vect-loop-manip.cc > +++ b/gcc/tree-vect-loop-manip.cc > @@ -682,6 +682,210 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo, > return next_ctrl; > } > > +/* Helper for vect_set_loop_condition_partial_vectors. Generate definitions > + for all the rgroup controls in RGC and return a control that is nonzero > + when the loop needs to iterate. Add any new preheader statements to > + PREHEADER_SEQ. Use LOOP_COND_GSI to insert code before the exit gcond. > + > + RGC belongs to loop LOOP. The loop originally iterated NITERS > + times and has been vectorized according to LOOP_VINFO. > + > + Unlike vect_set_loop_controls_directly which is iterating from 0-based IV > + to TEST_LIMIT - bias. > + > + In vect_set_loop_controls_by_while_len, we are iterating from start at > + IV = TEST_LIMIT - bias and keep subtract IV by the length calculated by > + IFN_WHILE_LEN pattern. > + > + Note: the cost of the code generated by this function is modeled > + by vect_estimate_min_profitable_iters, so changes here may need > + corresponding changes there. The patch doesn't seem to change vect_estimate_min_profitable_iters though, so the comment doesn't seem accurate. > + > + 1. Single rgroup, the Gimple IR should be: > + > + > + _19 = (unsigned long) n_5(D); > + ... > + > + : > + ... > + # ivtmp_20 = PHI > + ... > + _22 = .WHILE_LEN (ivtmp_20, vf); > + ... > + vector statement (use _22); > + ... > + ivtmp_21 = ivtmp_20 - _22; > + ... > + if (ivtmp_21 != 0) > + goto ; [75.00%] > + else > + goto ; [25.00%] > + > + > + return; > + > + Note: IFN_WHILE_LEN will guarantee "ivtmp_21 = ivtmp_20 - _22" never > + underflow 0. > + > + 2. Multiple rgroup, the Gimple IR should be: > + > + > + _70 = (unsigned long) bnd.7_52; > + _71 = _70 * 2; > + _72 = MAX_EXPR <_71, 4>; > + _73 = _72 + 18446744073709551612; > + ... > + > + : > + ... > + # ivtmp_74 = PHI > + # ivtmp_77 = PHI > + _76 = .WHILE_LEN (ivtmp_74, vf * nitems_per_ctrl); > + _79 = .WHILE_LEN (ivtmp_77, vf * nitems_per_ctrl); > + ... > + vector statement (use _79); > + ... > + vector statement (use _76); > + ... > + _65 = _79 / 2; > + vector statement (use _65); > + ... > + _68 = _76 / 2; > + vector statement (use _68); > + ... > + ivtmp_78 = ivtmp_77 - _79; > + ivtmp_75 = ivtmp_74 - _76; > + ... > + if (ivtmp_78 != 0) > + goto ; [75.00%] > + else > + goto ; [25.00%] > + > + > + return; > + > +*/ I'm not sure it's safe to handle multiple rgroups like this. For example, if a loop operates on both int32_ts and int64_ts, and uses fully-packed vectors for both data types, there will be two int64_t vectors per int32_t vector (since the total number of lanes has to be the same for both types). The int32_t vectors would be controlled by one rgroup (with one control) and the int64_t vectors would be controlled by a different rgroup (with two controls). But it's critical that the two rgroups process the same number of scalars every iteration, and that they "agree" on which elements are being processed. So if the int32_t length is some value X, where X<=VF, then the int32_t control will enable the processing of X consecutive elements, starting at the first lane. It's then critical that the int64_t control enables processing of those same elements. If VF/2<=X<=VF then the first int64_t control must include all VF/2 elements and the second must include exactly the first X-VF/2 elements. I don't think that's guaranteed by the proposed definition of WHILE_LEN. The first int64_t WHILE_LEN could come up short, and return something less than VF/2. Thanks, Richard