>> It's been pretty standard to stick with just PLUS_EXPR for this stuff >> and instead negate the constant to produce the same effect as >> MINUS_EXPR. Is there a reason we're not continuing that practice? >> Sorry if you've answered this already -- if you have, you can just point >> me at the prior discussion and I'll read it. Richard (Sandiford) has answered this question: https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616745.html And as Richard (Biener) said, it will make IVOPTs failed if we have variable IVs. However, we already have implemented variable IVs in downstream RVV GCC and works fine and I don't see any bad codegen so far. So, I think it may not be a serious issue for RVV. Besides, this implementation is not my idea which is just following the guide coming from RVV ISA spec. And also inspired by the implementation coming from LLVM. Reference: 1). RVV ISA: https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s 2). LLVM length stuff implementation (Should note that "get_vector_length" pattern in LLVM is totally doing the same thing as "select_vl" pattern in GCC here: https://reviews.llvm.org/D99750 vvaddint32: vsetvli t0, a0, e32, ta, ma # Set vector length based on 32-bit vectors vle32.v v0, (a1) # Get first vector sub a0, a0, t0 # Decrement number done slli t0, t0, 2 # Multiply number done by 4 bytes add a1, a1, t0 # Bump pointer vle32.v v1, (a2) # Get second vector add a2, a2, t0 # Bump pointer vadd.vv v2, v0, v1 # Sum vectors vse32.v v2, (a3) # Store result add a3, a3, t0 # Bump pointer bnez a0, vvaddint32 # Loop back ret # Finished Notice "sub a0, a0, t0", the "t0" is the variable coming from "vsetvli t0, a0, e32, ta, ma" which is generated by "get_vector_length" in LLVM, or similiar it is also generatd by "select_vl" in GCC too. Other comments from Jeff will be addressed in the next patch (V5), I will wait for Richards (both Sandiford && Biener that are the experts in Loop Vectorizer) comments. Then send V5 patch which is including all comments from Jeff && Richards (Sandiford && Biener). Thanks. juzhe.zhong@rivai.ai From: Jeff Law Date: 2023-05-07 23:19 To: juzhe.zhong; gcc-patches CC: richard.sandiford; rguenther Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support On 5/4/23 07:25, juzhe.zhong@rivai.ai wrote: > From: Ju-Zhe Zhong > > This patch is fixing V3 patch: > https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zhong@rivai.ai/ > > Fix issues according to Richard Sandiford && Richard Biener. > > 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford. > 2. Support multiple-rgroup for non-SLP auto-vectorization. > > For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total length: > > _36 = MIN_EXPR ; > > First length (MIN (X, VF/N)): > loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>; > > Second length (X - MIN (X, 1 * VF/N)): > loop_len_16 = _36 - loop_len_15; > > Third length (X - MIN (X, 2 * VF/N)): > _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>; > loop_len_17 = _36 - _38; > > Forth length (X - MIN (X, 3 * VF/N)): > _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>; > loop_len_18 = _36 - _39; > > The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length since using SELECT_VL > to adapt induction IV consumes more instructions than just using MIN_EXPR. Also, during testing, > I found it's hard to adjust length correctly according to SELECT_VL. > > So, this patch we only use SELECT_VL for single-rgroup with single length control. > > 3. Fix document of select_vl for Richard Biener (remove mode N). > 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard Biener. > 5. Keep loop_vinfo as first parameter for "vect_get_loop_len". > 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated at the caller site. > > More comments from Richard Biener: >>> So it's not actually saturating. The saturating operation is done by .WHILE_LEN? > I define the outcome of SELECT_VL (n, vf) (WHILE_LEN) = IN_RANGE (0, min (n, vf)) will make > the loop control counter never underflow zero. > >>> I see. I wonder if it makes sense to leave .WHILE_LEN aside for a start, >>> the above scheme should also work for single rgroups, no? >>> As said, it _looks_ like you can progress without .WHILE_LEN and using >>> .WHILE_LEN is a pure optimization? > Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow > target adjust any length = INRANGE (0, min (n, vf)) each iteration. > > Let me known if I missed something for the V3 patch. So at a high level this is pretty good. I think there's some improvements we should make in the documentation and comments, but I'm comfortable with most of the implementation details. > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi > index cc4a93a8763..99cf0cdbdca 100644 > --- a/gcc/doc/md.texi > +++ b/gcc/doc/md.texi > @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++) > operand0[i] = operand0[i - 1] && (operand1 + i < operand2); > @end smallexample > > +@cindex @code{select_vl@var{m}} instruction pattern > +@item @code{select_vl@var{m}} > +Set operand 0 to the number of active elements in vector will be updated value. This reads rather poorly. Is this still accurate? Set operand 0 to the number of active elements in a vector to be updated in a loop iteration based on the total number of elements to be updated, the vectorization factor and vector properties of the target. > +operand 1 is the total elements need to be updated value. operand 1 is the total elements in the vector to be updated. > + > +The output of this pattern is not only used as IV of loop control counter, but also > +is used as the IV of address calculation with multiply/shift operation. This allow > +us dynamic adjust the number of elements is processed in each iteration of the loop. This allows dynamic adjustment of the number of elements processed each loop iteration. -- is that still accurate and does it read better? > @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3. If not see > so that we can free them all at once. */ > static bitmap_obstack loop_renamer_obstack; > > -/* Creates an induction variable with value BASE + STEP * iteration in LOOP. > +/* Creates an induction variable with value BASE (+/-) STEP * iteration in LOOP. > + If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration. > + If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration. > It is expected that neither BASE nor STEP are shared with other expressions > (unless the sharing rules allow this). Use VAR as a base var_decl for it > (if NULL, a new temporary will be created). The increment will occur at It's been pretty standard to stick with just PLUS_EXPR for this stuff and instead negate the constant to produce the same effect as MINUS_EXPR. Is there a reason we're not continuing that practice? Sorry if you've answered this already -- if you have, you can just point me at the prior discussion and I'll read it. > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > index 44bd5f2c805..d63ded5d4f0 100644 > --- a/gcc/tree-vect-loop-manip.cc > +++ b/gcc/tree-vect-loop-manip.cc > @@ -385,6 +385,48 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm, > return false; > } > > +/* Try to use permutes to define the lens in DEST_RGM using the lens > + in SRC_RGM, given that the former has twice as many lens as the > + latter. Return true on success, adding any new statements to SEQ. */ I would suggest not using "permute" in this description. When I read permute in the context of vectorization, I think of a vector permute to scramble elements within a vector. This looks like you're just adjusting how many vector elements you're operating on. > + { > + /* For SLP, we can't allow non-VF number of elements to be processed > + in non-final iteration. We force the number of elements to be > + processed in each non-final iteration is VF elements. If we allow > + non-VF elements processing in non-final iteration will make SLP too > + complicated and produce inferior codegen. Looks like you may have mixed up spaces and tabs in the above comment. Just a nit, but let's go ahead and get it fixed. > @@ -703,6 +1040,10 @@ vect_set_loop_condition_partial_vectors (class loop *loop, > > bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); > tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); > + tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); > + bool use_vl_p = !use_masks_p > + && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, > + OPTIMIZE_FOR_SPEED); When you break a line with a logical like this, go ahead and add parenthesis and make sure the logical aligns just after the paren. ie bool use_vl_p = (!use_masks_p && direct.... Alternately, compute the direct_itnernal_fn_supported_p into its own boolean and then you don't need as much line wrapping. In general, don't be afraid to use extra temporaries if doing so improves readability. > + else if (loop_lens && loop_lens->length () == 1 > + && direct_internal_fn_supported_p ( > + IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo), > + OPTIMIZE_FOR_SPEED) > + && memory_access_type != VMAT_INVARIANT) This looks like a good example of code that would be easier to read if the call to direct_internal-fn_supported_p was saved into a temporary. Similarly for the instance you added in vectorizable_load. I'd like to get this patch wrapped up soon. But I also want to give both Richards a chance to chime in with their concerns. Thanks, Jeff