From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=FU1r=BB=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 52CBB3858C54
	for <gcc-patches@gcc.gnu.org>; Fri, 12 May 2023 11:08:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 52CBB3858C54
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 78A3DFEC;
	Fri, 12 May 2023 04:08:52 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 00EEA3F67D;
	Fri, 12 May 2023 04:08:06 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: juzhe.zhong@rivai.ai
Mail-Followup-To: juzhe.zhong@rivai.ai,gcc-patches@gcc.gnu.org,  kito.cheng@gmail.com,  palmer@dabbelt.com,  richard.guenther@gmail.com, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org,  kito.cheng@gmail.com,  palmer@dabbelt.com,  richard.guenther@gmail.com
Subject: Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer
References: <20230511231126.1594132-1-juzhe.zhong@rivai.ai>
Date: Fri, 12 May 2023 12:08:05 +0100
In-Reply-To: <20230511231126.1594132-1-juzhe.zhong@rivai.ai> (juzhe zhong's
	message of "Fri, 12 May 2023 07:11:26 +0800")
Message-ID: <mptr0rlaju2.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-29.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_ASCII_DIVIDERS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

juzhe.zhong@rivai.ai writes:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> 1. Fix document description according Jeff && Richard.
> 2. Add LOOP_VINFO_USING_SELECT_VL_P for single rgroup.
> 3. Add LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P for SLP multiple rgroup.
>
> Fix bugs for V5 after testing:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618209.html
>
> gcc/ChangeLog:
>
>         * doc/md.texi: Add seletc_vl pattern.
>         * internal-fn.def (SELECT_VL): New ifn.
>         * optabs.def (OPTAB_D): New optab.
>         * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function.
>         (vect_set_loop_controls_by_select_vl): Ditto.
>         (vect_set_loop_condition_partial_vectors): Add loop control for decrement IV.
>         * tree-vect-loop.cc (vect_get_loop_len): Adjust loop len for SLP.
>         * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): New function.
>         (vectorizable_store): Support data reference IV added by outcome of SELECT_VL.
>         (vectorizable_load): Ditto.
>         * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): New macro.
>         (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P): Ditto.
>         (vect_get_loop_len): Adjust loop len for SLP.
>
> ---
>  gcc/doc/md.texi             |  36 ++++
>  gcc/internal-fn.def         |   1 +
>  gcc/optabs.def              |   1 +
>  gcc/tree-vect-loop-manip.cc | 380 +++++++++++++++++++++++++++++++++++-
>  gcc/tree-vect-loop.cc       |  31 ++-
>  gcc/tree-vect-stmts.cc      |  79 +++++++-
>  gcc/tree-vectorizer.h       |  12 +-
>  7 files changed, 526 insertions(+), 14 deletions(-)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 8ebce31ba78..a94ffc4456d 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,42 @@ for (i = 1; i < operand3; i++)
>    operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of active elements in a vector to be updated 
> +in a loop iteration based on the total number of elements to be updated, 
> +the vectorization factor and vector properties of the target.
> +operand 1 is the total elements in the vector to be updated.
> +operand 2 is the vectorization factor.
> +The value of operand 0 is target dependent and flexible in each iteration.
> +The operation of this pattern can be:
> +
> +@smallexample
> +Case 1:
> +operand0 = MIN (operand1, operand2);
> +operand2 can be const_poly_int or poly_int related to vector mode size.
> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE SIZE) so
> +that we can reduce a use of general purpose register.
> +
> +In this case, only the last iteration of the loop is partial iteration.
> +@end smallexample
> +
> +@smallexample
> +Case 2:
> +if (operand1 <= operand2)
> +  operand0 = operand1;
> +else if (operand1 < 2 * operand2)
> +  operand0 = ceil (operand1 / 2);
> +else
> +  operand0 = operand2;
> +
> +This case will evenly distribute work over the last 2 iterations of a stripmine loop.
> +@end smallexample
> +
> +The output of this pattern is not only used as IV of loop control counter, but also
> +is used as the IV of address calculation with multiply/shift operation. This allows
> +dynamic adjustment of the number of elements processed each loop iteration.
> +

I don't think we need to restrict the definition to the two RVV cases.
How about:

-----------------------------------------------------------------------
Set operand 0 to the number of scalar iterations that should be handled
by one iteration of a vector loop.  Operand 1 is the total number of
scalar iterations that the loop needs to process and operand 2 is a
maximum bound on the result (also known as the maximum ``vectorization
factor'').

The maximum value of operand 0 is given by:
@smallexample
operand0 = MIN (operand1, operand2)
@end smallexample
However, targets might choose a lower value than this, based on
target-specific criteria.  Each iteration of the vector loop might
therefore process a different number of scalar iterations, which in turn
means that induction variables will have a variable step.  Because of
this, it is generally not useful to define this instruction if it will
always calculate the maximum value.

This optab is only useful on targets that implement @samp{len_load_@var{m}}
and/or @samp{len_store_@var{m}}.
-----------------------------------------------------------------------

>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 7fe742c2ae7..6f6fa7d37f9 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
> +DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, binary)
>  DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
>  		       check_raw_ptrs, check_ptrs)
>  DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW,
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 695f5911b30..b637471b76e 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -476,3 +476,4 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>  OPTAB_D (len_load_optab, "len_load_$a")
>  OPTAB_D (len_store_optab, "len_store_$a")
> +OPTAB_D (select_vl_optab, "select_vl$a")
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..81334f4f171 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -385,6 +385,353 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>    return false;
>  }
>  
> +/* Try to use adjust loop lens for non-SLP multiple-rgroups.
> +
> +     _36 = MIN_EXPR <ivtmp_34, POLY_INT_CST [8, 8]>;
> +
> +     First length (MIN (X, VF/N)):
> +       loop_len_15 = MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
> +
> +     Second length (X - MIN (X, 1 * VF/N)):
> +       loop_len_16 = _36 - loop_len_15;
> +
> +     Third length (X - MIN (X, 2 * VF/N)):
> +       _38 = MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
> +       loop_len_17 = _36 - _38;
> +
> +     Forth length (X - MIN (X, 3 * VF/N)):
> +       _39 = MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
> +       loop_len_18 = _36 - _39;  */
> +
> +static void
> +vect_adjust_loop_lens (tree iv_type, gimple_seq *seq, rgroup_controls *dest_rgm,
> +		       rgroup_controls *src_rgm)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> +    = TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +    {
> +      tree src = src_rgm->controls[i / dest_rgm->controls.length ()];
> +      tree dest = dest_rgm->controls[i];
> +      gassign *stmt;
> +      if (i == 0)
> +	{
> +	  /* MIN (X, VF*I/N) capped to the range [0, VF/N].  */
> +	  tree factor = build_int_cst (iv_type, nitems_per_ctrl);
> +	  stmt = gimple_build_assign (dest, MIN_EXPR, src, factor);
> +	  gimple_seq_add_stmt (seq, stmt);
> +	}
> +      else
> +	{
> +	  /* (X - MIN (X, VF*I/N)) capped to the range [0, VF/N].  */
> +	  tree factor = build_int_cst (iv_type, nitems_per_ctrl * i);
> +	  tree temp = make_ssa_name (iv_type);
> +	  stmt = gimple_build_assign (temp, MIN_EXPR, src, factor);
> +	  gimple_seq_add_stmt (seq, stmt);
> +	  stmt = gimple_build_assign (dest, MINUS_EXPR, src, temp);
> +	  gimple_seq_add_stmt (seq, stmt);
> +	}
> +    }
> +}
> +
> +/* Helper for vect_set_loop_condition_partial_vectors.  Generate definitions
> +   for all the rgroup controls in RGC and return a control that is nonzero
> +   when the loop needs to iterate.  Add any new preheader statements to
> +   PREHEADER_SEQ.  Use LOOP_COND_GSI to insert code before the exit gcond.
> +
> +   RGC belongs to loop LOOP.  The loop originally iterated NITERS
> +   times and has been vectorized according to LOOP_VINFO.
> +
> +   Unlike vect_set_loop_controls_directly which is iterating from 0-based IV
> +   to TEST_LIMIT - bias.
> +
> +   In vect_set_loop_controls_by_select_vl, we are iterating from start at
> +   IV = TEST_LIMIT - bias and keep subtract IV by the length calculated by
> +   IFN_SELECT_VL pattern.
> +
> +   1. Single rgroup, the Gimple IR should be:
> +
> +	# vectp_B.6_8 = PHI <vectp_B.6_13(6), &B(5)>
> +	# vectp_B.8_16 = PHI <vectp_B.8_17(6), &B(5)>
> +	# vectp_A.11_19 = PHI <vectp_A.11_20(6), &A(5)>
> +	# vectp_A.13_22 = PHI <vectp_A.13_23(6), &A(5)>
> +	# ivtmp_26 = PHI <ivtmp_27(6), _25(5)>
> +	_28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);
> +	ivtmp_15 = _28 * 4;
> +	vect__1.10_18 = .LEN_LOAD (vectp_B.8_16, 128B, _28, 0);
> +	_1 = B[i_10];
> +	.LEN_STORE (vectp_A.13_22, 128B, _28, vect__1.10_18, 0);
> +	i_7 = i_10 + 1;
> +	vectp_B.8_17 = vectp_B.8_16 + ivtmp_15;
> +	vectp_A.13_23 = vectp_A.13_22 + ivtmp_15;
> +	ivtmp_27 = ivtmp_26 - _28;
> +	if (ivtmp_27 != 0)
> +	  goto <bb 6>; [83.33%]
> +	else
> +	  goto <bb 7>; [16.67%]
> +
> +   Note: We use the outcome of .SELECT_VL to adjust both loop control IV and
> +   data reference pointer IV.
> +
> +   1). The result of .SELECT_VL:
> +       _28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);
> +       The _28 is not necessary to be VF in any iteration, instead, we allow
> +       _28 to be any value as long as _28 <= VF. Such flexible SELECT_VL
> +       pattern allows target have various flexible optimizations in vector
> +       loop iterations. Target like RISC-V has special application vector
> +       length calculation instruction which will distribute even workload
> +       in the last 2 iterations.
> +
> +       Other example is that we can allow even generate _28 <= VF / 2 so
> +       that some machine can run vector codes in low power mode.
> +
> +   2). Loop control IV:
> +       ivtmp_27 = ivtmp_26 - _28;
> +       if (ivtmp_27 != 0)
> +	 goto <bb 6>; [83.33%]
> +       else
> +	 goto <bb 7>; [16.67%]
> +
> +       This is the saturating-subtraction towards zero, the outcome of
> +       .SELECT_VL wil make ivtmp_27 never underflow zero.
> +
> +   3). Data reference pointer IV:
> +       ivtmp_15 = _28 * 4;
> +       vectp_B.8_17 = vectp_B.8_16 + ivtmp_15;
> +       vectp_A.13_23 = vectp_A.13_22 + ivtmp_15;
> +
> +       The pointer IV is adjusted accurately according to the .SELECT_VL.
> +
> +   2. Multiple rgroup, the Gimple IR should be:
> +
> +	# i_23 = PHI <i_20(6), 0(11)>
> +	# vectp_f.8_51 = PHI <vectp_f.8_52(6), f_15(D)(11)>
> +	# vectp_d.10_59 = PHI <vectp_d.10_60(6), d_18(D)(11)>
> +	# ivtmp_70 = PHI <ivtmp_71(6), _69(11)>
> +	# ivtmp_73 = PHI <ivtmp_74(6), _67(11)>
> +	_72 = MIN_EXPR <ivtmp_70, 16>;
> +	_75 = MIN_EXPR <ivtmp_73, 16>;
> +	_1 = i_23 * 2;
> +	_2 = (long unsigned int) _1;
> +	_3 = _2 * 2;
> +	_4 = f_15(D) + _3;
> +	_5 = _2 + 1;
> +	_6 = _5 * 2;
> +	_7 = f_15(D) + _6;
> +	.LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
> +	vectp_f.8_56 = vectp_f.8_51 + 16;
> +	.LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
> +	_8 = (long unsigned int) i_23;
> +	_9 = _8 * 4;
> +	_10 = d_18(D) + _9;
> +	_61 = _75 / 2;
> +	.LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0);
> +	vectp_d.10_63 = vectp_d.10_59 + 16;
> +	_64 = _72 / 2;
> +	.LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
> +	i_20 = i_23 + 1;
> +	vectp_f.8_52 = vectp_f.8_56 + 16;
> +	vectp_d.10_60 = vectp_d.10_63 + 16;
> +	ivtmp_74 = ivtmp_73 - _75;
> +	ivtmp_71 = ivtmp_70 - _72;
> +	if (ivtmp_74 != 0)
> +	  goto <bb 6>; [83.33%]
> +	else
> +	  goto <bb 13>; [16.67%]

In the gimple examples, I think it would help to quote only the relevant
parts and use ellipsis to hide things that don't directly matter.
E.g. in the above samples, the old scalar code isn't relevant, whereas
it's difficult to follow the example without knowing how _69 and _67
relate to each other.  It would also help to say which scalar loop
is being vectorised here.

> +
> +   Note: We DO NOT use .SELECT_VL in SLP auto-vectorization for multiple
> +   rgroups. Instead, we use MIN_EXPR to guarantee we always use VF as the
> +   iteration amount for mutiple rgroups.
> +
> +   The analysis of the flow of multiple rgroups:
> +	_72 = MIN_EXPR <ivtmp_70, 16>;
> +	_75 = MIN_EXPR <ivtmp_73, 16>;
> +	...
> +	.LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
> +	vectp_f.8_56 = vectp_f.8_51 + 16;
> +	.LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
> +	...
> +	_61 = _75 / 2;
> +	.LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0);
> +	vectp_d.10_63 = vectp_d.10_59 + 16;
> +	_64 = _72 / 2;
> +	.LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
> +
> +  We use _72 = MIN_EXPR <ivtmp_70, 16>; to generate the number of the elements
> +  to be processed in each iteration.
> +
> +  The related STOREs:
> +    _72 = MIN_EXPR <ivtmp_70, 16>;
> +    .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0);
> +    _64 = _72 / 2;
> +    .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
> +  Since these 2 STOREs store 2 vectors that the second vector is half elements
> +  of the first vector. So the length of second STORE will be _64 = _72 / 2;
> +  It's similar to the VIEW_CONVERT of handling masks in SLP.


> +
> +  3. Multiple rgroups for non-SLP auto-vectorization.
> +
> +     # ivtmp_26 = PHI <ivtmp_27(4), _25(3)>
> +     # ivtmp.35_10 = PHI <ivtmp.35_11(4), ivtmp.35_1(3)>
> +     # ivtmp.36_2 = PHI <ivtmp.36_8(4), ivtmp.36_23(3)>
> +     _28 = MIN_EXPR <ivtmp_26, POLY_INT_CST [8, 8]>;
> +     loop_len_15 = MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
> +     loop_len_16 = _28 - loop_len_15;
> +     _29 = (void *) ivtmp.35_10;
> +     _7 = &MEM <vector([4,4]) int> [(int *)_29];
> +     vect__1.25_17 = .LEN_LOAD (_7, 128B, loop_len_15, 0);
> +     _33 = _29 + POLY_INT_CST [16, 16];
> +     _34 = &MEM <vector([4,4]) int> [(int *)_33];
> +     vect__1.26_19 = .LEN_LOAD (_34, 128B, loop_len_16, 0);
> +     vect__2.27_20 = VEC_PACK_TRUNC_EXPR <vect__1.25_17, vect__1.26_19>;
> +     _30 = (void *) ivtmp.36_2;
> +     _31 = &MEM <vector([8,8]) short int> [(short int *)_30];
> +     .LEN_STORE (_31, 128B, _28, vect__2.27_20, 0);
> +     ivtmp_27 = ivtmp_26 - _28;
> +     ivtmp.35_11 = ivtmp.35_10 + POLY_INT_CST [32, 32];
> +     ivtmp.36_8 = ivtmp.36_2 + POLY_INT_CST [16, 16];
> +     if (ivtmp_27 != 0)
> +       goto <bb 4>; [83.33%]
> +     else
> +       goto <bb 5>; [16.67%]
> +
> +     The total length: _28 = MIN_EXPR <ivtmp_26, POLY_INT_CST [8, 8]>;
> +
> +     The length of first half vector:
> +       loop_len_15 = MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
> +
> +     The length of second half vector:
> +       loop_len_15 = MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
> +       loop_len_16 = _28 - loop_len_15;
> +
> +     1). _28 always <= POLY_INT_CST [8, 8].
> +     2). When _28 <= POLY_INT_CST [4, 4], second half vector is not processed.
> +     3). When _28 > POLY_INT_CST [4, 4], second half vector is processed.
> +*/
> +
> +static tree
> +vect_set_loop_controls_by_select_vl (class loop *loop, loop_vec_info loop_vinfo,
> +				     gimple_seq *preheader_seq,
> +				     gimple_seq *header_seq,
> +				     rgroup_controls *rgc, tree niters)
> +{
> +  tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +  /* We are not allowing masked approach in SELECT_VL.  */
> +  gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
> +
> +  tree ctrl_type = rgc->type;
> +  unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
> +  poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type) * rgc->factor;
> +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +
> +  /* Calculate the maximum number of item values that the rgroup
> +     handles in total, the number that it handles for each iteration
> +     of the vector loop.  */
> +  tree nitems_total = niters;
> +  if (nitems_per_iter != 1)
> +    {
> +      /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
> +	 these multiplications don't overflow.  */
> +      tree compare_factor = build_int_cst (compare_type, nitems_per_iter);
> +      nitems_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
> +				   nitems_total, compare_factor);
> +    }
> +
> +  /* Convert the comparison value to the IV type (either a no-op or
> +     a promotion).  */
> +  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
> +
> +  /* Create an induction variable that counts the number of items
> +     processed.  */
> +  tree index_before_incr, index_after_incr;
> +  gimple_stmt_iterator incr_gsi;
> +  bool insert_after;
> +  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> +
> +  /* Test the decremented IV, which will never underflow 0 since we have
> +     IFN_SELECT_VL to gurantee that.  */
> +  tree test_limit = nitems_total;
> +
> +  /* Provide a definition of each control in the group.  */
> +  tree ctrl;
> +  unsigned int i;
> +  FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> +    {
> +      /* Previous controls will cover BIAS items.  This control covers the
> +	 next batch.  */
> +      poly_uint64 bias = nitems_per_ctrl * i;
> +      tree bias_tree = build_int_cst (iv_type, bias);
> +
> +      /* Rather than have a new IV that starts at TEST_LIMIT and goes down to
> +	 BIAS, prefer to use the same TEST_LIMIT - BIAS based IV for each
> +	 control and adjust the bound down by BIAS.  */
> +      tree this_test_limit = test_limit;
> +      if (i != 0)
> +	{
> +	  this_test_limit = gimple_build (preheader_seq, MAX_EXPR, iv_type,
> +					  this_test_limit, bias_tree);
> +	  this_test_limit = gimple_build (preheader_seq, MINUS_EXPR, iv_type,
> +					  this_test_limit, bias_tree);
> +	}
> +
> +      /* Create decrement IV.  */
> +      create_iv (this_test_limit, MINUS_EXPR, ctrl, NULL_TREE, loop, &incr_gsi,
> +		 insert_after, &index_before_incr, &index_after_incr);
> +
> +      poly_uint64 final_vf = vf * nitems_per_iter;
> +      tree vf_step = build_int_cst (iv_type, final_vf);
> +      tree res_len;
> +      if (LOOP_VINFO_LENS (loop_vinfo).length () == 1)
> +	{
> +	  res_len = gimple_build (header_seq, IFN_SELECT_VL, iv_type,
> +				  index_before_incr, vf_step);
> +	  LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;

The middle of this loop seems too "deep down" to be setting this.
I think it would make sense to do it after:

  /* If we still have the option of using partial vectors,
     check whether we can generate the necessary loop controls.  */
  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
      && !vect_verify_full_masking (loop_vinfo)
      && !vect_verify_loop_lens (loop_vinfo))
    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;

in vect_analyze_loop_2.

I think it'd help for review purposes to split this patch into two
(independently tested) pieces:

(1) Your cases 2 and 3, where AIUI the main change is to use
    a decrementing loop control IV that counts scalars.  This can
    be done for any loop that:

    (a) uses length "controls"; and
    (b) can iterate more than once

    Initially this patch would handle case 1 in the same way.

    Conceptually, I think it would make sense for this case to use:

    - a signed control IV
    - with a constant VF step
    - and a loop-back test for > 0

    in cases where we can prove that that doesn't overflow.  But I
    accept that using:

    - an unsigned control IV
    - with a variable step
    - and a loop-back test for != 0

    is more general.  So it's OK to handle just that case.  The
    optimisation to use signed control IVs could be left to future work.

(2) Add SELECT_VL, where AIUI the main change (relative to (1))
    is to use a variable step for other IVs too.

This is just for review purposes, and to help to separate concepts.
SELECT_VL is still an important part of the end result.

Thanks,
Richard

> +	}
> +      else
> +	{
> +	  /* For SLP, we can't allow non-VF number of elements to be processed
> +	     in non-final iteration. We force the number of elements to be
> +	     processed in each non-final iteration is VF elements. If we allow
> +	     non-VF elements processing in non-final iteration will make SLP too
> +	     complicated and produce inferior codegen.
> +
> +	       For example:
> +
> +		If non-final iteration process VF elements.
> +
> +		  ...
> +		  .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0);
> +		  .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0);
> +		  ...
> +
> +		If non-final iteration process non-VF elements.
> +
> +		  ...
> +		  .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0);
> +		  if (_71 % 2 == 0)
> +		   .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0);
> +		  else
> +		   .LEN_STORE (vectp_f.8_56, 128B, _72, { 2, 1, 2, 1 }, 0);
> +		  ...
> +
> +	     This is the simple case of 2-elements interleaved vector SLP. We
> +	     consider other interleave vector, the situation will become more
> +	     complicated.  */
> +	  res_len = gimple_build (header_seq, MIN_EXPR, iv_type,
> +				  index_before_incr, vf_step);
> +	  if (rgc->max_nscalars_per_iter != 1)
> +	    LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo) = true;
> +	}
> +      gassign *assign = gimple_build_assign (ctrl, res_len);
> +      gimple_seq_add_stmt (header_seq, assign);
> +    }
> +
> +  return index_after_incr;
> +}
> +
>  /* Helper for vect_set_loop_condition_partial_vectors.  Generate definitions
>     for all the rgroup controls in RGC and return a control that is nonzero
>     when the loop needs to iterate.  Add any new preheader statements to
> @@ -704,6 +1051,10 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>  
>    bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>    tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> +  bool use_vl_p = !use_masks_p
> +		  && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
> +						     OPTIMIZE_FOR_SPEED);
>    unsigned int compare_precision = TYPE_PRECISION (compare_type);
>    tree orig_niters = niters;
>  
> @@ -753,17 +1104,34 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>  	      continue;
>  	  }
>  
> +	if (use_vl_p && rgc->max_nscalars_per_iter == 1
> +	    && rgc != &LOOP_VINFO_LENS (loop_vinfo)[0])
> +	  {
> +	    rgroup_controls *sub_rgc
> +	      = &(*controls)[nmasks / rgc->controls.length () - 1];
> +	    if (!sub_rgc->controls.is_empty ())
> +	      {
> +		vect_adjust_loop_lens (iv_type, &header_seq, rgc, sub_rgc);
> +		continue;
> +	      }
> +	  }
> +
>  	/* See whether zero-based IV would ever generate all-false masks
>  	   or zero length before wrapping around.  */
>  	bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>  
>  	/* Set up all controls for this group.  */
> -	test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
> -						     &preheader_seq,
> -						     &header_seq,
> -						     loop_cond_gsi, rgc,
> -						     niters, niters_skip,
> -						     might_wrap_p);
> +	if (use_vl_p)
> +	  test_ctrl
> +	    = vect_set_loop_controls_by_select_vl (loop, loop_vinfo,
> +						   &preheader_seq, &header_seq,
> +						   rgc, niters);
> +	else
> +	  test_ctrl
> +	    = vect_set_loop_controls_directly (loop, loop_vinfo, &preheader_seq,
> +					       &header_seq, loop_cond_gsi, rgc,
> +					       niters, niters_skip,
> +					       might_wrap_p);
>        }
>  
>    /* Emit all accumulated statements.  */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index ed0166fedab..fe6af4286bf 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -973,6 +973,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>      vectorizable (false),
>      can_use_partial_vectors_p (param_vect_partial_vector_usage != 0),
>      using_partial_vectors_p (false),
> +    using_select_vl_p (false),
> +    using_slp_adjusted_len_p (false),
>      epil_using_partial_vectors_p (false),
>      partial_load_store_bias (0),
>      peeling_for_gaps (false),
> @@ -10361,15 +10363,18 @@ vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>  }
>  
>  /* Given a complete set of length LENS, extract length number INDEX for an
> -   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
> +   rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.
> +   Insert any set-up statements before GSI.  */
>  
>  tree
> -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> -		   unsigned int nvectors, unsigned int index)
> +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi,
> +		   vec_loop_lens *lens, unsigned int nvectors, tree vectype,
> +		   unsigned int index)
>  {
>    rgroup_controls *rgl = &(*lens)[nvectors - 1];
>    bool use_bias_adjusted_len =
>      LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0;
> +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>  
>    /* Populate the rgroup's len array, if this is the first time we've
>       used it.  */
> @@ -10400,6 +10405,26 @@ vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>  
>    if (use_bias_adjusted_len)
>      return rgl->bias_adjusted_ctrl;
> +  else if (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo))
> +    {
> +      tree loop_len = rgl->controls[index];
> +      poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
> +      poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
> +      if (maybe_ne (nunits1, nunits2))
> +	{
> +	  /* A loop len for data type X can be reused for data type Y
> +	     if X has N times more elements than Y and if Y's elements
> +	     are N times bigger than X's.  */
> +	  gcc_assert (multiple_p (nunits1, nunits2));
> +	  unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
> +	  gimple_seq seq = NULL;
> +	  loop_len = gimple_build (&seq, RDIV_EXPR, iv_type, loop_len,
> +				   build_int_cst (iv_type, factor));
> +	  if (seq)
> +	    gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> +	}
> +      return loop_len;
> +    }
>    else
>      return rgl->controls[index];
>  }
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 7313191b0db..15b22132bd6 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -3147,6 +3147,61 @@ vect_get_data_ptr_increment (vec_info *vinfo,
>    return iv_step;
>  }
>  
> +/* Prepare the pointer IVs which needs to be updated by a variable amount.
> +   Such variable amount is the outcome of .SELECT_VL. In this case, we can
> +   allow each iteration process the flexible number of elements as long as
> +   the number <= vf elments.
> +
> +   Return data reference according to SELECT_VL.
> +   If new statements are needed, insert them before GSI.  */
> +
> +static tree
> +get_select_vl_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info,
> +			    tree aggr_type, class loop *at_loop, tree offset,
> +			    tree *dummy, gimple_stmt_iterator *gsi,
> +			    bool simd_lane_access_p, vec_loop_lens *loop_lens,
> +			    dr_vec_info *dr_info,
> +			    vect_memory_access_type memory_access_type)
> +{
> +  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
> +  tree step = vect_dr_behavior (vinfo, dr_info)->step;
> +
> +  /* TODO: We don't support gather/scatter or load_lanes/store_lanes for pointer
> +     IVs are updated by variable amount but we will support them in the future.
> +   */
> +  gcc_assert (memory_access_type != VMAT_GATHER_SCATTER
> +	      && memory_access_type != VMAT_LOAD_STORE_LANES);
> +
> +  /* When we support SELECT_VL pattern, we dynamic adjust
> +     the memory address by .SELECT_VL result.
> +
> +     The result of .SELECT_VL is the number of elements to
> +     be processed of each iteration. So the memory address
> +     adjustment operation should be:
> +
> +     bytesize = GET_MODE_SIZE (element_mode (aggr_type));
> +     addr = addr + .SELECT_VL (ARG..) * bytesize;
> +  */
> +  gimple *ptr_incr;
> +  tree loop_len
> +    = vect_get_loop_len (loop_vinfo, gsi, loop_lens, 1, aggr_type, 0);
> +  tree len_type = TREE_TYPE (loop_len);
> +  poly_uint64 bytesize = GET_MODE_SIZE (element_mode (aggr_type));
> +  /* Since the outcome of .SELECT_VL is element size, we should adjust
> +     it into bytesize so that it can be used in address pointer variable
> +     amount IVs adjustment.  */
> +  tree tmp = fold_build2 (MULT_EXPR, len_type, loop_len,
> +			  build_int_cst (len_type, bytesize));
> +  if (tree_int_cst_sgn (step) == -1)
> +    tmp = fold_build1 (NEGATE_EXPR, len_type, tmp);
> +  tree bump = make_temp_ssa_name (len_type, NULL, "ivtmp");
> +  gassign *assign = gimple_build_assign (bump, tmp);
> +  gsi_insert_before (gsi, assign, GSI_SAME_STMT);
> +  return vect_create_data_ref_ptr (vinfo, stmt_info, aggr_type, at_loop, offset,
> +				   dummy, gsi, &ptr_incr, simd_lane_access_p,
> +				   bump);
> +}
> +
>  /* Check and perform vectorization of BUILT_IN_BSWAP{16,32,64,128}.  */
>  
>  static bool
> @@ -8547,6 +8602,14 @@ vectorizable_store (vec_info *vinfo,
>  	    vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>  					 slp_node, &gs_info, &dataref_ptr,
>  					 &vec_offsets);
> +	  else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)
> +		   && memory_access_type != VMAT_INVARIANT)
> +	    dataref_ptr
> +	      = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
> +					    simd_lane_access_p ? loop : NULL,
> +					    offset, &dummy, gsi,
> +					    simd_lane_access_p, loop_lens,
> +					    dr_info, memory_access_type);
>  	  else
>  	    dataref_ptr
>  	      = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
> @@ -8795,8 +8858,9 @@ vectorizable_store (vec_info *vinfo,
>  	      else if (loop_lens)
>  		{
>  		  tree final_len
> -		    = vect_get_loop_len (loop_vinfo, loop_lens,
> -					 vec_num * ncopies, vec_num * j + i);
> +		    = vect_get_loop_len (loop_vinfo, gsi, loop_lens,
> +					 vec_num * ncopies, vectype,
> +					 vec_num * j + i);
>  		  tree ptr = build_int_cst (ref_type, align * BITS_PER_UNIT);
>  		  machine_mode vmode = TYPE_MODE (vectype);
>  		  opt_machine_mode new_ovmode
> @@ -9935,6 +9999,13 @@ vectorizable_load (vec_info *vinfo,
>  					   slp_node, &gs_info, &dataref_ptr,
>  					   &vec_offsets);
>  	    }
> +	  else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)
> +		   && memory_access_type != VMAT_INVARIANT)
> +	    dataref_ptr
> +	      = get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
> +					    at_loop, offset, &dummy, gsi,
> +					    simd_lane_access_p, loop_lens,
> +					    dr_info, memory_access_type);
>  	  else
>  	    dataref_ptr
>  	      = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
> @@ -10151,8 +10222,8 @@ vectorizable_load (vec_info *vinfo,
>  		    else if (loop_lens && memory_access_type != VMAT_INVARIANT)
>  		      {
>  			tree final_len
> -			  = vect_get_loop_len (loop_vinfo, loop_lens,
> -					       vec_num * ncopies,
> +			  = vect_get_loop_len (loop_vinfo, gsi, loop_lens,
> +					       vec_num * ncopies, vectype,
>  					       vec_num * j + i);
>  			tree ptr = build_int_cst (ref_type,
>  						  align * BITS_PER_UNIT);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 9cf2fb23fe3..3d21e23513d 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -818,6 +818,13 @@ public:
>       the vector loop can handle fewer than VF scalars.  */
>    bool using_partial_vectors_p;
>  
> +  /* True if we've decided to use SELECT_VL to get the number of active
> +     elements in a vector loop to be updated.  */
> +  bool using_select_vl_p;
> +
> +  /* True if use adjusted loop length for SLP.  */
> +  bool using_slp_adjusted_len_p;
> +
>    /* True if we've decided to use partially-populated vectors for the
>       epilogue of loop.  */
>    bool epil_using_partial_vectors_p;
> @@ -890,6 +897,8 @@ public:
>  #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
>  #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
>  #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
> +#define LOOP_VINFO_USING_SELECT_VL_P(L) (L)->using_select_vl_p
> +#define LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P(L) (L)->using_slp_adjusted_len_p
>  #define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
>    (L)->epil_using_partial_vectors_p
>  #define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_bias
> @@ -2293,7 +2302,8 @@ extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
>  				unsigned int, tree, unsigned int);
>  extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
>  				  tree, unsigned int);
> -extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
> +extern tree vect_get_loop_len (loop_vec_info, gimple_stmt_iterator *,
> +			       vec_loop_lens *, unsigned int, tree,
>  			       unsigned int);
>  extern gimple_seq vect_gen_len (tree, tree, tree, tree);
>  extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);