>> It's been pretty standard to stick with just PLUS_EXPR for this stuff
>> and instead negate the constant to produce the same effect as
>> MINUS_EXPR.  Is there a reason we're not continuing that practice?
>> Sorry if you've answered this already -- if you have, you can just point
>> me at the prior discussion and I'll read it.

Richard (Sandiford) has answered this question:
https://gcc.gnu.org/pipermail/gcc-patches/2023-April/616745.html 
And as Richard (Biener) said, it will make IVOPTs failed if we have variable IVs.
However, we already have implemented variable IVs in downstream RVV GCC
and works fine and I don't see any bad codegen so far. So, I think it may not
be a serious issue for RVV.

Besides, this implementation is not my idea which is just following the guide coming
from RVV ISA spec.  And also inspired by the implementation coming from LLVM.
Reference:
1). RVV ISA: https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s 
2). LLVM length stuff implementation (Should note that "get_vector_length" pattern in LLVM is
     totally doing the same thing as "select_vl" pattern in GCC here:
     https://reviews.llvm.org/D99750 

vvaddint32:
    vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
    vle32.v v0, (a1)         # Get first vector
      sub a0, a0, t0         # Decrement number done
      slli t0, t0, 2         # Multiply number done by 4 bytes
      add a1, a1, t0         # Bump pointer
    vle32.v v1, (a2)         # Get second vector
      add a2, a2, t0         # Bump pointer
    vadd.vv v2, v0, v1       # Sum vectors
    vse32.v v2, (a3)         # Store result
      add a3, a3, t0         # Bump pointer
      bnez a0, vvaddint32    # Loop back
      ret                    # Finished

Notice "sub a0, a0, t0", the "t0" is the variable coming from "vsetvli t0, a0, e32, ta, ma"
which is generated by "get_vector_length" in LLVM, or similiar it is also generatd by "select_vl" in GCC too.

Other comments from Jeff will be addressed in the next patch (V5), I will wait for
Richards (both Sandiford && Biener that are the experts in Loop Vectorizer) comments.
Then send V5 patch which is including all comments from Jeff && Richards (Sandiford && Biener).

Thanks.


juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-05-07 23:19
To: juzhe.zhong; gcc-patches
CC: richard.sandiford; rguenther
Subject: Re: [PATCH V4] VECT: Add decrement IV iteration loop control by variable amount support
 
 
On 5/4/23 07:25, juzhe.zhong@rivai.ai wrote:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> This patch is fixing V3 patch:
> https://patchwork.sourceware.org/project/gcc/patch/20230407014741.139387-1-juzhe.zhong@rivai.ai/
> 
> Fix issues according to Richard Sandiford && Richard Biener.
> 
> 1. Rename WHILE_LEN pattern into SELECT_VL according to Richard Sandiford.
> 2. Support multiple-rgroup for non-SLP auto-vectorization.
> 
>     For vec_pack_trunc pattern (multi-rgroup of non-SLP), we generate the total length:
> 
>       _36 = MIN_EXPR <ivtmp_34, POLY_INT_CST [8, 8]>;
> 
>       First length (MIN (X, VF/N)):
>         loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
> 
>       Second length (X - MIN (X, 1 * VF/N)):
>         loop_len_16 = _36 - loop_len_15;
> 
>       Third length (X - MIN (X, 2 * VF/N)):
>         _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
>         loop_len_17 = _36 - _38;
> 
>       Forth length (X - MIN (X, 3 * VF/N)):
>         _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
>         loop_len_18 = _36 - _39;
> 
> The reason that I use MIN_EXPR instead of SELECT_VL to calculate total length since using SELECT_VL
> to adapt induction IV consumes more instructions than just using MIN_EXPR. Also, during testing,
> I found it's hard to adjust length correctly according to SELECT_VL.
> 
> So, this patch we only use SELECT_VL for single-rgroup with single length control.
> 
> 3. Fix document of select_vl for Richard Biener (remove mode N).
> 4. Fix comments of vect_set_loop_controls_by_select_vl according to Richard Biener.
> 5. Keep loop_vinfo as first parameter for "vect_get_loop_len".
> 6. make requirement of get_while_len_data_ref_ptr outside, let it to be gated at the caller site.
> 
> More comments from Richard Biener:
>>> So it's not actually saturating.  The saturating operation is done by .WHILE_LEN?
> I define the outcome of SELECT_VL (n, vf)  (WHILE_LEN) = IN_RANGE (0, min (n, vf)) will make
> the loop control counter never underflow zero.
> 
>>> I see.  I wonder if it makes sense to leave .WHILE_LEN aside for a start,
>>> the above scheme should also work for single rgroups, no?
>>> As said, it _looks_ like you can progress without .WHILE_LEN and using
>>> .WHILE_LEN is a pure optimization?
> Yes, SELECT_VL (WHILE_LEN) is pure optimization for single-rgroup and allow
> target adjust any length = INRANGE (0, min (n, vf)) each iteration.
> 
> Let me known if I missed something for the V3 patch.
So at a high level this is pretty good.  I think there's some 
improvements we should make in the documentation and comments, but I'm 
comfortable with most of the implementation details.
 
 
 
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index cc4a93a8763..99cf0cdbdca 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,40 @@ for (i = 1; i < operand3; i++)
>     operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>   @end smallexample
>   
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in vector will be updated value.
This reads rather poorly.  Is this still accurate?
 
Set operand 0 to the number of active elements in a vector to be updated 
in a loop iteration based on the total number of elements to be updated, 
the vectorization factor and vector properties of the target.
 
 
> +operand 1 is the total elements need to be updated value.
operand 1 is the total elements in the vector to be updated.
 
 
> +
> +The output of this pattern is not only used as IV of loop control counter, but also
> +is used as the IV of address calculation with multiply/shift operation. This allow
> +us dynamic adjust the number of elements is processed in each iteration of the loop.
This allows dynamic adjustment of the number of elements processed each 
loop iteration. -- is that still accurate and does it read better?
 
 
> @@ -47,7 +47,9 @@ along with GCC; see the file COPYING3.  If not see
>      so that we can free them all at once.  */
>   static bitmap_obstack loop_renamer_obstack;
>   
> -/* Creates an induction variable with value BASE + STEP * iteration in LOOP.
> +/* Creates an induction variable with value BASE (+/-) STEP * iteration in LOOP.
> +   If CODE is PLUS_EXPR, the induction variable is BASE + STEP * iteration.
> +   If CODE is MINUS_EXPR, the induction variable is BASE - STEP * iteration.
>      It is expected that neither BASE nor STEP are shared with other expressions
>      (unless the sharing rules allow this).  Use VAR as a base var_decl for it
>      (if NULL, a new temporary will be created).  The increment will occur at
It's been pretty standard to stick with just PLUS_EXPR for this stuff 
and instead negate the constant to produce the same effect as 
MINUS_EXPR.  Is there a reason we're not continuing that practice? 
Sorry if you've answered this already -- if you have, you can just point 
me at the prior discussion and I'll read it.
 
 
 
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 44bd5f2c805..d63ded5d4f0 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -385,6 +385,48 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>     return false;
>   }
>   
> +/* Try to use permutes to define the lens in DEST_RGM using the lens
> +   in SRC_RGM, given that the former has twice as many lens as the
> +   latter.  Return true on success, adding any new statements to SEQ.  */
I would suggest not using "permute" in this description.  When I read 
permute in the context of vectorization, I think of a vector permute to 
scramble elements within a vector.
 
This looks like you're just adjusting how many vector elements you're 
operating on.
 
> + {
> +   /* For SLP, we can't allow non-VF number of elements to be processed
> +      in non-final iteration. We force the number of elements to be
> +    processed in each non-final iteration is VF elements. If we allow
> +    non-VF elements processing in non-final iteration will make SLP too
> +    complicated and produce inferior codegen.
Looks like you may have mixed up spaces and tabs in the above comment. 
Just a nit, but let's go ahead and get it fixed.
 
> @@ -703,6 +1040,10 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>   
>     bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>     tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +  bool use_vl_p = !use_masks_p
> +   && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
> +      OPTIMIZE_FOR_SPEED);
When you break a line with a logical like this, go ahead and add 
parenthesis and make sure the logical aligns just after the paren.  ie
 
bool use_vl_p = (!use_masks_p
&& direct....
 
 
Alternately, compute the direct_itnernal_fn_supported_p into its own 
boolean and then you don't need as much line wrapping.
 
In general, don't be afraid to use extra temporaries if doing so 
improves readability.
 
 
 
 
 
> +   else if (loop_lens && loop_lens->length () == 1
> +    && direct_internal_fn_supported_p (
> +      IFN_SELECT_VL, LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo),
> +      OPTIMIZE_FOR_SPEED)
> +    && memory_access_type != VMAT_INVARIANT)
This looks like a good example of code that would be easier to read if 
the call to direct_internal-fn_supported_p was saved into a temporary. 
Similarly for the instance you added in vectorizable_load.
 
 
I'd like to get this patch wrapped up soon.   But I also want to give 
both Richards a chance to chime in with their concerns.
 
Thanks,
 
Jeff