Hi, Richard. Thanks for the comments.

>> I think it would be worth saying in more detail why we only use SELECT_VL
>> for single rgroups.  I assume the reason is to simplify the pointer IV
>> updates.  Is that right?

Yes.

>> The difficulty is
>> that an rgroup that controls N vector loads (say) would need N pointer
>> updates by variable amounts.  But I'm not 100% sure whether we're
>> avoiding that situation because it's difficult to code, or because
>> it's inefficient.  Or maybe we're avoiding it because it doesn't
>> fit well with the later RVV vsetvl pass.
I don't want to use SELECT_VL pattern for multiple rgroups since
it's really ineffecient and also not fit well with latter RVV vsetvl PASS. 
(I have a draft in my downstream, the loop body becomes very ugly with a
 lot of instructions to adjust IVs).

Since we define SELECT_VL as a flexible pattern that doesn't have the side effect
set vector length, we need much more scalar operations to adjust the pointer IVs
and it cause more vsetvli than just using MIN.

Also, it changes a lot in middle-end and make middle-end codes too ugly and no
benefits I see so far.

Current approach (MIN), I think the current codegen is good (even though may not be perfect).

Besides, LLVM only can handle one vector length (in GCC, we call multiple rgroup).

I think RVV GCC is already in a good shape now in case of loop control.

I'd rather support all RVV auto-vectorization features soon and focus on optimizing loopVectorizer
(For example, I known GCC in TSVC has 46 fails, fixing failed vectorization case will improve much more,
 I think it can also improve ARM. ).

Address comments and will send the next patch, really appreciate it.

Thanks.  


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-06-05 05:59
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH] VECT: Add SELECT_VL support
Sorry for the slow review.
 
I don't know the IV-related parts well enough to review those properly,
but they looked reasonable to me.  Hopefully Richi can comment.
 
I'm curious though.  For:
 
> +  tree step = vect_dr_behavior (vinfo, dr_info)->step;
> +
> +  [...]
> +  poly_uint64 bytesize = GET_MODE_SIZE (element_mode (aggr_type));
> +  /* Since the outcome of .SELECT_VL is element size, we should adjust
> +     it into bytesize so that it can be used in address pointer variable
> +     amount IVs adjustment.  */
> +  tree tmp = fold_build2 (MULT_EXPR, len_type, loop_len,
> +   build_int_cst (len_type, bytesize));
> +  if (tree_int_cst_sgn (step) == -1)
> +    tmp = fold_build1 (NEGATE_EXPR, len_type, tmp);
 
Could you not just multiply loop_len by step, probably written as:
 
  build_int_cst (len_type, wi::to_widest (step))
 
avoiding the NEGATE_EXPR and bytesize calculation?  step should
represent the step of the original scalar IV, so doing that feels
more direct.
 
The loop-control bits look good to me apart from one hunk:
 
> @@ -2737,6 +2738,14 @@ start_over:
>  LOOP_VINFO_VECT_FACTOR (loop_vinfo))))
>      LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
>  
> +  /* If we're using decrement IV and SELECT_VL is supported by the target.
> +     Use output of SELECT_VL to adjust IV of loop control and data reference.
> +     Note: We only use SELECT_VL on single-rgroup control.  */
> +  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
> +      && LOOP_VINFO_LENS (loop_vinfo).length () == 1
> +      && !slp)
> +    LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
> +
>    /* If we're vectorizing an epilogue loop, the vectorized loop either needs
>       to be able to handle fewer than VF scalars, or needs to have a lower VF
>       than the main loop.  */
 
This test also needs to check that the target implements the select_vl
optab for the chosen iv type.  You can check that using
direct_internal_fn_supported_p.
 
We should also check that LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1,
since the IV update multiplies by the size in bytes.
 
I think it would be worth saying in more detail why we only use SELECT_VL
for single rgroups.  I assume the reason is to simplify the pointer IV
updates.  Is that right?
 
That is: the multiple length controls that are currently generated from
a MIN_EXPR IV should also work with a SELECT_VL IV.  The difficulty is
that an rgroup that controls N vector loads (say) would need N pointer
updates by variable amounts.  But I'm not 100% sure whether we're
avoiding that situation because it's difficult to code, or because
it's inefficient.  Or maybe we're avoiding it because it doesn't
fit well with the later RVV vsetvl pass.
 
Thanks,
Richard