From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=FU1r=BB=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 59D77385B507
	for <gcc-patches@gcc.gnu.org>; Fri, 12 May 2023 13:26:02 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 59D77385B507
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7F5CAC14;
	Fri, 12 May 2023 06:26:46 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DF9F93F67D;
	Fri, 12 May 2023 06:26:00 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: "juzhe.zhong" <juzhe.zhong@rivai.ai>
Mail-Followup-To: "juzhe.zhong" <juzhe.zhong@rivai.ai>,"gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,  "kito.cheng\@gmail.com" <kito.cheng@gmail.com>,  "palmer\@dabbelt.com" <palmer@dabbelt.com>,  "richard.guenther\@gmail.com" <richard.guenther@gmail.com>, richard.sandiford@arm.com
Cc: "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,  "kito.cheng\@gmail.com" <kito.cheng@gmail.com>,  "palmer\@dabbelt.com" <palmer@dabbelt.com>,  "richard.guenther\@gmail.com" <richard.guenther@gmail.com>
Subject: Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer
References: <20230511231126.1594132-1-juzhe.zhong@rivai.ai>
	<5C0696881FC420F8+A4387224-068E-4647-B237-BC14AE06A32D@rivai.ai>
	<A80AA82F40C90097+853F3055-CCC0-44A5-B9C8-A931F855E0FB@rivai.ai>
Date: Fri, 12 May 2023 14:25:59 +0100
In-Reply-To: <A80AA82F40C90097+853F3055-CCC0-44A5-B9C8-A931F855E0FB@rivai.ai>
	(juzhe zhong's message of "Fri, 12 May 2023 20:41:35 +0800")
Message-ID: <mpt5y8xadg8.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-29.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_ASCII_DIVIDERS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

"juzhe.zhong" <juzhe.zhong@rivai.ai> writes:
> Hi, Richard.  For "can iterate more than once", is it correct use the con=
dition
> "LOOP_LENS ().length >1".=20=20=20=20=20

No, that says whether any LOAD_LENs or STORE_LENs operate on multiple
vectors, rather than just single vectors.

I meant: whether the vector loop body might be executed more than once
(i.e. whether the branch-back condition can be true).

This is true for a scalar loop that goes from 0 to some unbounded
variable n.  It's false for a scalar loop that goes from 0 to 6,
if the vectors are known to have at least 8 elements.

Thanks,
Richard

> ---- Replied Message ----
>
> From      Richard Sandiford<richard.sandiford@arm.com>
>
> Date      05/12/2023 19:39
>
> To        juzhe.zhong<juzhe.zhong@rivai.ai>
>
> Cc        gcc-patches@gcc.gnu.org<gcc-patches@gcc.gnu.org>,
>           kito.cheng@gmail.com<kito.cheng@gmail.com>,
>           palmer@dabbelt.com<palmer@dabbelt.com>,
>           richard.guenther@gmail.com<richard.guenther@gmail.com>
>
> Subject   Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer
>
> "juzhe.zhong" <juzhe.zhong@rivai.ai> writes:
>> Thanks Richard.
>>  I will do that as you suggested. I have a question for the first patch.=
 How
> to
>> enable decrement IV=EF=BC=9F Should I add a target hook or something to =
let target
>> decide whether enable decrement IV=EF=BC=9F
>
> At the moment, the only other targets that use IFN_LOAD_LEN and
> IFN_STORE_LEN are PowerPC and s390.  Both targets default to
> --param vect-partial-vector-usage=3D1 (i.e. use partial vectors
> for epilogues only).
>
> So I think the condition should be that the loop:
>
>  (a) uses length "controls"; and
>  (b) can iterate more than once
>
> No target checks should be needed.
>
> Thanks,
> Richard
>
>> ---- Replied Message ----
>>
>> From      Richard Sandiford<richard.sandiford@arm.com>
>>
>> Date      05/12/2023 19:08
>>
>> To        juzhe.zhong@rivai.ai<juzhe.zhong@rivai.ai>
>>
>> Cc        gcc-patches@gcc.gnu.org<gcc-patches@gcc.gnu.org>,
>>           kito.cheng@gmail.com<kito.cheng@gmail.com>,
>>           palmer@dabbelt.com<palmer@dabbelt.com>,
>>           richard.guenther@gmail.com<richard.guenther@gmail.com>
>>
>> Subject   Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectoriz=
er
>>
>> juzhe.zhong@rivai.ai writes:
>>> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>>>
>>> 1. Fix document description according Jeff && Richard.
>>> 2. Add LOOP_VINFO_USING_SELECT_VL_P for single rgroup.
>>> 3. Add LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P for SLP multiple rgroup.
>>>
>>> Fix bugs for V5 after testing:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618209.html
>>>
>>> gcc/ChangeLog:
>>>
>>>         * doc/md.texi: Add seletc_vl pattern.
>>>         * internal-fn.def (SELECT_VL): New ifn.
>>>         * optabs.def (OPTAB_D): New optab.
>>>         * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function.
>>>         (vect_set_loop_controls_by_select_vl): Ditto.
>>>         (vect_set_loop_condition_partial_vectors): Add loop control for
>> decrement IV.
>>>         * tree-vect-loop.cc (vect_get_loop_len): Adjust loop len for SL=
P.
>>>         * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): New function.
>>>         (vectorizable_store): Support data reference IV added by outcom=
e of
>> SELECT_VL.
>>>         (vectorizable_load): Ditto.
>>>         * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): New macro.
>>>         (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P): Ditto.
>>>         (vect_get_loop_len): Adjust loop len for SLP.
>>>
>>> ---
>>>  gcc/doc/md.texi             |  36 ++++
>>>  gcc/internal-fn.def         |   1 +
>>>  gcc/optabs.def              |   1 +
>>>  gcc/tree-vect-loop-manip.cc | 380 +++++++++++++++++++++++++++++++++++-
>>>  gcc/tree-vect-loop.cc       |  31 ++-
>>>  gcc/tree-vect-stmts.cc      |  79 +++++++-
>>>  gcc/tree-vectorizer.h       |  12 +-
>>>  7 files changed, 526 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
>>> index 8ebce31ba78..a94ffc4456d 100644
>>> --- a/gcc/doc/md.texi
>>> +++ b/gcc/doc/md.texi
>>> @@ -4974,6 +4974,42 @@ for (i =3D 1; i < operand3; i++)
>>>    operand0[i] =3D operand0[i - 1] && (operand1 + i < operand2);
>>>  @end smallexample
>>>=20=20
>>> +@cindex @code{select_vl@var{m}} instruction pattern
>>> +@item @code{select_vl@var{m}}
>>> +Set operand 0 to the number of active elements in a vector to be updat=
ed
>>> +in a loop iteration based on the total number of elements to be update=
d,
>>> +the vectorization factor and vector properties of the target.
>>> +operand 1 is the total elements in the vector to be updated.
>>> +operand 2 is the vectorization factor.
>>> +The value of operand 0 is target dependent and flexible in each iterat=
ion.
>>> +The operation of this pattern can be:
>>> +
>>> +@smallexample
>>> +Case 1:
>>> +operand0 =3D MIN (operand1, operand2);
>>> +operand2 can be const_poly_int or poly_int related to vector mode size.
>>> +Some target like RISC-V has a standalone instruction to get MIN (n, MO=
DE
>> SIZE) so
>>> +that we can reduce a use of general purpose register.
>>> +
>>> +In this case, only the last iteration of the loop is partial iteration.
>>> +@end smallexample
>>> +
>>> +@smallexample
>>> +Case 2:
>>> +if (operand1 <=3D operand2)
>>> +  operand0 =3D operand1;
>>> +else if (operand1 < 2 * operand2)
>>> +  operand0 =3D ceil (operand1 / 2);
>>> +else
>>> +  operand0 =3D operand2;
>>> +
>>> +This case will evenly distribute work over the last 2 iterations of a
>> stripmine loop.
>>> +@end smallexample
>>> +
>>> +The output of this pattern is not only used as IV of loop control coun=
ter,
>> but also
>>> +is used as the IV of address calculation with multiply/shift operation.
> This
>> allows
>>> +dynamic adjustment of the number of elements processed each loop itera=
tion.
>>> +
>>
>> I don't think we need to restrict the definition to the two RVV cases.
>> How about:
>>
>> -----------------------------------------------------------------------
>> Set operand 0 to the number of scalar iterations that should be handled
>> by one iteration of a vector loop.  Operand 1 is the total number of
>> scalar iterations that the loop needs to process and operand 2 is a
>> maximum bound on the result (also known as the maximum ``vectorization
>> factor'').
>>
>> The maximum value of operand 0 is given by:
>> @smallexample
>> operand0 =3D MIN (operand1, operand2)
>> @end smallexample
>> However, targets might choose a lower value than this, based on
>> target-specific criteria.  Each iteration of the vector loop might
>> therefore process a different number of scalar iterations, which in turn
>> means that induction variables will have a variable step.  Because of
>> this, it is generally not useful to define this instruction if it will
>> always calculate the maximum value.
>>
>> This optab is only useful on targets that implement @samp{len_load_@var{=
m}}
>> and/or @samp{len_store_@var{m}}.
>> -----------------------------------------------------------------------
>>
>>>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>>>  @item @samp{check_raw_ptrs@var{m}}
>>>  Check whether, given two pointers @var{a} and @var{b} and a length @var
>> {len},
>>> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
>>> index 7fe742c2ae7..6f6fa7d37f9 100644
>>> --- a/gcc/internal-fn.def
>>> +++ b/gcc/internal-fn.def
>>> @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>>>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
>>>=20=20
>>>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult,
> while)
>>> +DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl,
>> binary)
>>>  DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
>>>                 check_raw_ptrs, check_ptrs)
>>>  DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW,
>>> diff --git a/gcc/optabs.def b/gcc/optabs.def
>>> index 695f5911b30..b637471b76e 100644
>>> --- a/gcc/optabs.def
>>> +++ b/gcc/optabs.def
>>> @@ -476,3 +476,4 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SER=
IES)
>>>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>>>  OPTAB_D (len_load_optab, "len_load_$a")
>>>  OPTAB_D (len_store_optab, "len_store_$a")
>>> +OPTAB_D (select_vl_optab, "select_vl$a")
>>> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
>>> index ff6159e08d5..81334f4f171 100644
>>> --- a/gcc/tree-vect-loop-manip.cc
>>> +++ b/gcc/tree-vect-loop-manip.cc
>>> @@ -385,6 +385,353 @@ vect_maybe_permute_loop_masks (gimple_seq *seq,
>> rgroup_controls *dest_rgm,
>>>    return false;
>>>  }
>>>=20=20
>>> +/* Try to use adjust loop lens for non-SLP multiple-rgroups.
>>> +
>>> +     _36 =3D MIN_EXPR <ivtmp_34, POLY_INT_CST [8, 8]>;
>>> +
>>> +     First length (MIN (X, VF/N)):
>>> +       loop_len_15 =3D MIN_EXPR <_36, POLY_INT_CST [2, 2]>;
>>> +
>>> +     Second length (X - MIN (X, 1 * VF/N)):
>>> +       loop_len_16 =3D _36 - loop_len_15;
>>> +
>>> +     Third length (X - MIN (X, 2 * VF/N)):
>>> +       _38 =3D MIN_EXPR <_36, POLY_INT_CST [4, 4]>;
>>> +       loop_len_17 =3D _36 - _38;
>>> +
>>> +     Forth length (X - MIN (X, 3 * VF/N)):
>>> +       _39 =3D MIN_EXPR <_36, POLY_INT_CST [6, 6]>;
>>> +       loop_len_18 =3D _36 - _39;  */
>>> +
>>> +static void
>>> +vect_adjust_loop_lens (tree iv_type, gimple_seq *seq, rgroup_controls
>> *dest_rgm,
>>> +               rgroup_controls *src_rgm)
>>> +{
>>> +  tree ctrl_type =3D dest_rgm->type;
>>> +  poly_uint64 nitems_per_ctrl
>>> +    =3D TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
>>> +
>>> +  for (unsigned int i =3D 0; i < dest_rgm->controls.length (); ++i)
>>> +    {
>>> +      tree src =3D src_rgm->controls[i / dest_rgm->controls.length ()];
>>> +      tree dest =3D dest_rgm->controls[i];
>>> +      gassign *stmt;
>>> +      if (i =3D=3D 0)
>>> +    {
>>> +      /* MIN (X, VF*I/N) capped to the range [0, VF/N].  */
>>> +      tree factor =3D build_int_cst (iv_type, nitems_per_ctrl);
>>> +      stmt =3D gimple_build_assign (dest, MIN_EXPR, src, factor);
>>> +      gimple_seq_add_stmt (seq, stmt);
>>> +    }
>>> +      else
>>> +    {
>>> +      /* (X - MIN (X, VF*I/N)) capped to the range [0, VF/N].  */
>>> +      tree factor =3D build_int_cst (iv_type, nitems_per_ctrl * i);
>>> +      tree temp =3D make_ssa_name (iv_type);
>>> +      stmt =3D gimple_build_assign (temp, MIN_EXPR, src, factor);
>>> +      gimple_seq_add_stmt (seq, stmt);
>>> +      stmt =3D gimple_build_assign (dest, MINUS_EXPR, src, temp);
>>> +      gimple_seq_add_stmt (seq, stmt);
>>> +    }
>>> +    }
>>> +}
>>> +
>>> +/* Helper for vect_set_loop_condition_partial_vectors.  Generate
> definitions
>>> +   for all the rgroup controls in RGC and return a control that is non=
zero
>>> +   when the loop needs to iterate.  Add any new preheader statements to
>>> +   PREHEADER_SEQ.  Use LOOP_COND_GSI to insert code before the exit gc=
ond.
>>> +
>>> +   RGC belongs to loop LOOP.  The loop originally iterated NITERS
>>> +   times and has been vectorized according to LOOP_VINFO.
>>> +
>>> +   Unlike vect_set_loop_controls_directly which is iterating from 0-ba=
sed
> IV
>>> +   to TEST_LIMIT - bias.
>>> +
>>> +   In vect_set_loop_controls_by_select_vl, we are iterating from start=
 at
>>> +   IV =3D TEST_LIMIT - bias and keep subtract IV by the length calcula=
ted by
>>> +   IFN_SELECT_VL pattern.
>>> +
>>> +   1. Single rgroup, the Gimple IR should be:
>>> +
>>> +    # vectp_B.6_8 =3D PHI <vectp_B.6_13(6), &B(5)>
>>> +    # vectp_B.8_16 =3D PHI <vectp_B.8_17(6), &B(5)>
>>> +    # vectp_A.11_19 =3D PHI <vectp_A.11_20(6), &A(5)>
>>> +    # vectp_A.13_22 =3D PHI <vectp_A.13_23(6), &A(5)>
>>> +    # ivtmp_26 =3D PHI <ivtmp_27(6), _25(5)>
>>> +    _28 =3D .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);
>>> +    ivtmp_15 =3D _28 * 4;
>>> +    vect__1.10_18 =3D .LEN_LOAD (vectp_B.8_16, 128B, _28, 0);
>>> +    _1 =3D B[i_10];
>>> +    .LEN_STORE (vectp_A.13_22, 128B, _28, vect__1.10_18, 0);
>>> +    i_7 =3D i_10 + 1;
>>> +    vectp_B.8_17 =3D vectp_B.8_16 + ivtmp_15;
>>> +    vectp_A.13_23 =3D vectp_A.13_22 + ivtmp_15;
>>> +    ivtmp_27 =3D ivtmp_26 - _28;
>>> +    if (ivtmp_27 !=3D 0)
>>> +      goto <bb 6>; [83.33%]
>>> +    else
>>> +      goto <bb 7>; [16.67%]
>>> +
>>> +   Note: We use the outcome of .SELECT_VL to adjust both loop control =
IV
> and
>>> +   data reference pointer IV.
>>> +
>>> +   1). The result of .SELECT_VL:
>>> +       _28 =3D .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);
>>> +       The _28 is not necessary to be VF in any iteration, instead, we
> allow
>>> +       _28 to be any value as long as _28 <=3D VF. Such flexible SELEC=
T_VL
>>> +       pattern allows target have various flexible optimizations in ve=
ctor
>>> +       loop iterations. Target like RISC-V has special application vec=
tor
>>> +       length calculation instruction which will distribute even workl=
oad
>>> +       in the last 2 iterations.
>>> +
>>> +       Other example is that we can allow even generate _28 <=3D VF / =
2 so
>>> +       that some machine can run vector codes in low power mode.
>>> +
>>> +   2). Loop control IV:
>>> +       ivtmp_27 =3D ivtmp_26 - _28;
>>> +       if (ivtmp_27 !=3D 0)
>>> +     goto <bb 6>; [83.33%]
>>> +       else
>>> +     goto <bb 7>; [16.67%]
>>> +
>>> +       This is the saturating-subtraction towards zero, the outcome of
>>> +       .SELECT_VL wil make ivtmp_27 never underflow zero.
>>> +
>>> +   3). Data reference pointer IV:
>>> +       ivtmp_15 =3D _28 * 4;
>>> +       vectp_B.8_17 =3D vectp_B.8_16 + ivtmp_15;
>>> +       vectp_A.13_23 =3D vectp_A.13_22 + ivtmp_15;
>>> +
>>> +       The pointer IV is adjusted accurately according to the .SELECT_=
VL.
>>> +
>>> +   2. Multiple rgroup, the Gimple IR should be:
>>> +
>>> +    # i_23 =3D PHI <i_20(6), 0(11)>
>>> +    # vectp_f.8_51 =3D PHI <vectp_f.8_52(6), f_15(D)(11)>
>>> +    # vectp_d.10_59 =3D PHI <vectp_d.10_60(6), d_18(D)(11)>
>>> +    # ivtmp_70 =3D PHI <ivtmp_71(6), _69(11)>
>>> +    # ivtmp_73 =3D PHI <ivtmp_74(6), _67(11)>
>>> +    _72 =3D MIN_EXPR <ivtmp_70, 16>;
>>> +    _75 =3D MIN_EXPR <ivtmp_73, 16>;
>>> +    _1 =3D i_23 * 2;
>>> +    _2 =3D (long unsigned int) _1;
>>> +    _3 =3D _2 * 2;
>>> +    _4 =3D f_15(D) + _3;
>>> +    _5 =3D _2 + 1;
>>> +    _6 =3D _5 * 2;
>>> +    _7 =3D f_15(D) + _6;
>>> +    .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0=
);
>>> +    vectp_f.8_56 =3D vectp_f.8_51 + 16;
>>> +    .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0=
);
>>> +    _8 =3D (long unsigned int) i_23;
>>> +    _9 =3D _8 * 4;
>>> +    _10 =3D d_18(D) + _9;
>>> +    _61 =3D _75 / 2;
>>> +    .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0);
>>> +    vectp_d.10_63 =3D vectp_d.10_59 + 16;
>>> +    _64 =3D _72 / 2;
>>> +    .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
>>> +    i_20 =3D i_23 + 1;
>>> +    vectp_f.8_52 =3D vectp_f.8_56 + 16;
>>> +    vectp_d.10_60 =3D vectp_d.10_63 + 16;
>>> +    ivtmp_74 =3D ivtmp_73 - _75;
>>> +    ivtmp_71 =3D ivtmp_70 - _72;
>>> +    if (ivtmp_74 !=3D 0)
>>> +      goto <bb 6>; [83.33%]
>>> +    else
>>> +      goto <bb 13>; [16.67%]
>>
>> In the gimple examples, I think it would help to quote only the relevant
>> parts and use ellipsis to hide things that don't directly matter.
>> E.g. in the above samples, the old scalar code isn't relevant, whereas
>> it's difficult to follow the example without knowing how _69 and _67
>> relate to each other.  It would also help to say which scalar loop
>> is being vectorised here.
>>
>>> +
>>> +   Note: We DO NOT use .SELECT_VL in SLP auto-vectorization for multip=
le
>>> +   rgroups. Instead, we use MIN_EXPR to guarantee we always use VF as =
the
>>> +   iteration amount for mutiple rgroups.
>>> +
>>> +   The analysis of the flow of multiple rgroups:
>>> +    _72 =3D MIN_EXPR <ivtmp_70, 16>;
>>> +    _75 =3D MIN_EXPR <ivtmp_73, 16>;
>>> +    ...
>>> +    .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0=
);
>>> +    vectp_f.8_56 =3D vectp_f.8_51 + 16;
>>> +    .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0=
);
>>> +    ...
>>> +    _61 =3D _75 / 2;
>>> +    .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0);
>>> +    vectp_d.10_63 =3D vectp_d.10_59 + 16;
>>> +    _64 =3D _72 / 2;
>>> +    .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
>>> +
>>> +  We use _72 =3D MIN_EXPR <ivtmp_70, 16>; to generate the number of the
>> elements
>>> +  to be processed in each iteration.
>>> +
>>> +  The related STOREs:
>>> +    _72 =3D MIN_EXPR <ivtmp_70, 16>;
>>> +    .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0=
);
>>> +    _64 =3D _72 / 2;
>>> +    .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0);
>>> +  Since these 2 STOREs store 2 vectors that the second vector is half
>> elements
>>> +  of the first vector. So the length of second STORE will be _64 =3D _=
72 / 2;
>>> +  It's similar to the VIEW_CONVERT of handling masks in SLP.
>>
>>
>>
>>> +
>>> +  3. Multiple rgroups for non-SLP auto-vectorization.
>>> +
>>> +     # ivtmp_26 =3D PHI <ivtmp_27(4), _25(3)>
>>> +     # ivtmp.35_10 =3D PHI <ivtmp.35_11(4), ivtmp.35_1(3)>
>>> +     # ivtmp.36_2 =3D PHI <ivtmp.36_8(4), ivtmp.36_23(3)>
>>> +     _28 =3D MIN_EXPR <ivtmp_26, POLY_INT_CST [8, 8]>;
>>> +     loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
>>> +     loop_len_16   =3D _28 - loop_len_15;
>>> +     _29 =3D (void *) ivtmp.35_10;
>>> +     _7 =3D &MEM <vector([4,4]) int> [(int *)_29];
>>> +     vect__1.25_17 =3D .LEN_LOAD (_7, 128B, loop_len_15, 0);
>>> +     _33 =3D _29 + POLY_INT_CST [16, 16];
>>> +     _34 =3D &MEM <vector([4,4]) int> [(int *)_33];
>>> +     vect__1.26_19 =3D .LEN_LOAD (_34, 128B, loop_len_16, 0);
>>> +     vect__2.27_20 =3D VEC_PACK_TRUNC_EXPR <vect__1.25_17, vect__1.26_=
19>;
>>> +     _30 =3D (void *) ivtmp.36_2;
>>> +     _31 =3D &MEM <vector([8,8]) short int> [(short int *)_30];
>>> +     .LEN_STORE (_31, 128B, _28, vect__2.27_20, 0);
>>> +     ivtmp_27 =3D ivtmp_26 - _28;
>>> +     ivtmp.35_11 =3D ivtmp.35_10 + POLY_INT_CST [32, 32];
>>> +     ivtmp.36_8 =3D ivtmp.36_2 + POLY_INT_CST [16, 16];
>>> +     if (ivtmp_27 !=3D 0)
>>> +       goto <bb 4>; [83.33%]
>>> +     else
>>> +       goto <bb 5>; [16.67%]
>>> +
>>> +     The total length: _28 =3D MIN_EXPR <ivtmp_26, POLY_INT_CST [8, 8]=
>;
>>> +
>>> +     The length of first half vector:
>>> +       loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
>>> +
>>> +     The length of second half vector:
>>> +       loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>;
>>> +       loop_len_16 =3D _28 - loop_len_15;
>>> +
>>> +     1). _28 always <=3D POLY_INT_CST [8, 8].
>>> +     2). When _28 <=3D POLY_INT_CST [4, 4], second half vector is not
>> processed.
>>> +     3). When _28 > POLY_INT_CST [4, 4], second half vector is process=
ed.
>>> +*/
>>> +
>>> +static tree
>>> +vect_set_loop_controls_by_select_vl (class loop *loop, loop_vec_info
>> loop_vinfo,
>>> +                     gimple_seq *preheader_seq,
>>> +                     gimple_seq *header_seq,
>>> +                     rgroup_controls *rgc, tree niters)
>>> +{
>>> +  tree compare_type =3D LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
>>> +  tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>>> +  /* We are not allowing masked approach in SELECT_VL.  */
>>> +  gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
>>> +
>>> +  tree ctrl_type =3D rgc->type;
>>> +  unsigned int nitems_per_iter =3D rgc->max_nscalars_per_iter * rgc->f=
actor;
>>> +  poly_uint64 nitems_per_ctrl =3D TYPE_VECTOR_SUBPARTS (ctrl_type) * r=
gc->
>> factor;
>>> +  poly_uint64 vf =3D LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>> +
>>> +  /* Calculate the maximum number of item values that the rgroup
>>> +     handles in total, the number that it handles for each iteration
>>> +     of the vector loop.  */
>>> +  tree nitems_total =3D niters;
>>> +  if (nitems_per_iter !=3D 1)
>>> +    {
>>> +      /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P =
that
>>> +     these multiplications don't overflow.  */
>>> +      tree compare_factor =3D build_int_cst (compare_type, nitems_per_=
iter);
>>> +      nitems_total =3D gimple_build (preheader_seq, MULT_EXPR, compare=
_type,
>>> +                   nitems_total, compare_factor);
>>> +    }
>>> +
>>> +  /* Convert the comparison value to the IV type (either a no-op or
>>> +     a promotion).  */
>>> +  nitems_total =3D gimple_convert (preheader_seq, iv_type, nitems_tota=
l);
>>> +
>>> +  /* Create an induction variable that counts the number of items
>>> +     processed.  */
>>> +  tree index_before_incr, index_after_incr;
>>> +  gimple_stmt_iterator incr_gsi;
>>> +  bool insert_after;
>>> +  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
>>> +
>>> +  /* Test the decremented IV, which will never underflow 0 since we ha=
ve
>>> +     IFN_SELECT_VL to gurantee that.  */
>>> +  tree test_limit =3D nitems_total;
>>> +
>>> +  /* Provide a definition of each control in the group.  */
>>> +  tree ctrl;
>>> +  unsigned int i;
>>> +  FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
>>> +    {
>>> +      /* Previous controls will cover BIAS items.  This control covers=
 the
>>> +     next batch.  */
>>> +      poly_uint64 bias =3D nitems_per_ctrl * i;
>>> +      tree bias_tree =3D build_int_cst (iv_type, bias);
>>> +
>>> +      /* Rather than have a new IV that starts at TEST_LIMIT and goes =
down
>> to
>>> +     BIAS, prefer to use the same TEST_LIMIT - BIAS based IV for each
>>> +     control and adjust the bound down by BIAS.  */
>>> +      tree this_test_limit =3D test_limit;
>>> +      if (i !=3D 0)
>>> +    {
>>> +      this_test_limit =3D gimple_build (preheader_seq, MAX_EXPR, iv_ty=
pe,
>>> +                      this_test_limit, bias_tree);
>>> +      this_test_limit =3D gimple_build (preheader_seq, MINUS_EXPR, iv_=
type,
>>> +                      this_test_limit, bias_tree);
>>> +    }
>>> +
>>> +      /* Create decrement IV.  */
>>> +      create_iv (this_test_limit, MINUS_EXPR, ctrl, NULL_TREE, loop, &
>> incr_gsi,
>>> +         insert_after, &index_before_incr, &index_after_incr);
>>> +
>>> +      poly_uint64 final_vf =3D vf * nitems_per_iter;
>>> +      tree vf_step =3D build_int_cst (iv_type, final_vf);
>>> +      tree res_len;
>>> +      if (LOOP_VINFO_LENS (loop_vinfo).length () =3D=3D 1)
>>> +    {
>>> +      res_len =3D gimple_build (header_seq, IFN_SELECT_VL, iv_type,
>>> +                  index_before_incr, vf_step);
>>> +      LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) =3D true;
>>
>> The middle of this loop seems too "deep down" to be setting this.
>> I think it would make sense to do it after:
>>
>>  /* If we still have the option of using partial vectors,
>>     check whether we can generate the necessary loop controls.  */
>>  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>      && !vect_verify_full_masking (loop_vinfo)
>>      && !vect_verify_loop_lens (loop_vinfo))
>>    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) =3D false;
>>
>> in vect_analyze_loop_2.
>>
>> I think it'd help for review purposes to split this patch into two
>> (independently tested) pieces:
>>
>> (1) Your cases 2 and 3, where AIUI the main change is to use
>>    a decrementing loop control IV that counts scalars.  This can
>>    be done for any loop that:
>>
>>    (a) uses length "controls"; and
>>    (b) can iterate more than once
>>
>>    Initially this patch would handle case 1 in the same way.
>>
>>    Conceptually, I think it would make sense for this case to use:
>>
>>    - a signed control IV
>>    - with a constant VF step
>>    - and a loop-back test for > 0
>>
>>    in cases where we can prove that that doesn't overflow.  But I
>>    accept that using:
>>
>>    - an unsigned control IV
>>    - with a variable step
>>    - and a loop-back test for !=3D 0
>>
>>    is more general.  So it's OK to handle just that case.  The
>>    optimisation to use signed control IVs could be left to future work.
>>
>> (2) Add SELECT_VL, where AIUI the main change (relative to (1))
>>    is to use a variable step for other IVs too.
>>
>> This is just for review purposes, and to help to separate concepts.
>> SELECT_VL is still an important part of the end result.
>>
>> Thanks,
>> Richard
>>
>>> +    }
>>> +      else
>>> +    {
>>> +      /* For SLP, we can't allow non-VF number of elements to be proce=
ssed
>>> +         in non-final iteration. We force the number of elements to be
>>> +         processed in each non-final iteration is VF elements. If we a=
llow
>>> +         non-VF elements processing in non-final iteration will make S=
LP
> too
>>> +         complicated and produce inferior codegen.
>>> +
>>> +           For example:
>>> +
>>> +        If non-final iteration process VF elements.
>>> +
>>> +          ...
>>> +          .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0);
>>> +          .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0);
>>> +          ...
>>> +
>>> +        If non-final iteration process non-VF elements.
>>> +
>>> +          ...
>>> +          .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0);
>>> +          if (_71 % 2 =3D=3D 0)
>>> +           .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0);
>>> +          else
>>> +           .LEN_STORE (vectp_f.8_56, 128B, _72, { 2, 1, 2, 1 }, 0);
>>> +          ...
>>> +
>>> +         This is the simple case of 2-elements interleaved vector SLP.=
 We
>>> +         consider other interleave vector, the situation will become m=
ore
>>> +         complicated.  */
>>> +      res_len =3D gimple_build (header_seq, MIN_EXPR, iv_type,
>>> +                  index_before_incr, vf_step);
>>> +      if (rgc->max_nscalars_per_iter !=3D 1)
>>> +        LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo) =3D true;
>>> +    }
>>> +      gassign *assign =3D gimple_build_assign (ctrl, res_len);
>>> +      gimple_seq_add_stmt (header_seq, assign);
>>> +    }
>>> +
>>> +  return index_after_incr;
>>> +}
>>> +
>>>  /* Helper for vect_set_loop_condition_partial_vectors.  Generate
> definitions
>>>     for all the rgroup controls in RGC and return a control that is non=
zero
>>>     when the loop needs to iterate.  Add any new preheader statements to
>>> @@ -704,6 +1051,10 @@ vect_set_loop_condition_partial_vectors (class lo=
op
>> *loop,
>>>=20=20
>>>    bool use_masks_p =3D LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>>>    tree compare_type =3D LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
>>> +  tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>>> +  bool use_vl_p =3D !use_masks_p
>>> +          && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
>>> +                             OPTIMIZE_FOR_SPEED);
>>>    unsigned int compare_precision =3D TYPE_PRECISION (compare_type);
>>>    tree orig_niters =3D niters;
>>>=20=20
>>> @@ -753,17 +1104,34 @@ vect_set_loop_condition_partial_vectors (class l=
oop
>> *loop,
>>>            continue;
>>>        }
>>>=20=20
>>> +    if (use_vl_p && rgc->max_nscalars_per_iter =3D=3D 1
>>> +        && rgc !=3D &LOOP_VINFO_LENS (loop_vinfo)[0])
>>> +      {
>>> +        rgroup_controls *sub_rgc
>>> +          =3D &(*controls)[nmasks / rgc->controls.length () - 1];
>>> +        if (!sub_rgc->controls.is_empty ())
>>> +          {
>>> +        vect_adjust_loop_lens (iv_type, &header_seq, rgc, sub_rgc);
>>> +        continue;
>>> +          }
>>> +      }
>>> +
>>>      /* See whether zero-based IV would ever generate all-false masks
>>>         or zero length before wrapping around.  */
>>>      bool might_wrap_p =3D vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc=
);
>>>=20=20
>>>      /* Set up all controls for this group.  */
>>> -    test_ctrl =3D vect_set_loop_controls_directly (loop, loop_vinfo,
>>> -                             &preheader_seq,
>>> -                             &header_seq,
>>> -                             loop_cond_gsi, rgc,
>>> -                             niters, niters_skip,
>>> -                             might_wrap_p);
>>> +    if (use_vl_p)
>>> +      test_ctrl
>>> +        =3D vect_set_loop_controls_by_select_vl (loop, loop_vinfo,
>>> +                           &preheader_seq, &header_seq,
>>> +                           rgc, niters);
>>> +    else
>>> +      test_ctrl
>>> +        =3D vect_set_loop_controls_directly (loop, loop_vinfo, &
> preheader_seq,
>>> +                           &header_seq, loop_cond_gsi, rgc,
>>> +                           niters, niters_skip,
>>> +                           might_wrap_p);
>>>        }
>>>=20=20
>>>    /* Emit all accumulated statements.  */
>>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>>> index ed0166fedab..fe6af4286bf 100644
>>> --- a/gcc/tree-vect-loop.cc
>>> +++ b/gcc/tree-vect-loop.cc
>>> @@ -973,6 +973,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in,
>> vec_info_shared *shared)
>>>      vectorizable (false),
>>>      can_use_partial_vectors_p (param_vect_partial_vector_usage !=3D 0),
>>>      using_partial_vectors_p (false),
>>> +    using_select_vl_p (false),
>>> +    using_slp_adjusted_len_p (false),
>>>      epil_using_partial_vectors_p (false),
>>>      partial_load_store_bias (0),
>>>      peeling_for_gaps (false),
>>> @@ -10361,15 +10363,18 @@ vect_record_loop_len (loop_vec_info loop_vinf=
o,
>> vec_loop_lens *lens,
>>>  }
>>>=20=20
>>>  /* Given a complete set of length LENS, extract length number INDEX fo=
r an
>>> -   rgroup that operates on NVECTORS vectors, where 0 <=3D INDEX < NVEC=
TORS.
>  *
>> /
>>> +   rgroup that operates on NVECTORS vectors, where 0 <=3D INDEX < NVEC=
TORS.
>>> +   Insert any set-up statements before GSI.  */
>>>=20=20
>>>  tree
>>> -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
>>> -           unsigned int nvectors, unsigned int index)
>>> +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi,
>>> +           vec_loop_lens *lens, unsigned int nvectors, tree vectype,
>>> +           unsigned int index)
>>>  {
>>>    rgroup_controls *rgl =3D &(*lens)[nvectors - 1];
>>>    bool use_bias_adjusted_len =3D
>>>      LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) !=3D 0;
>>> +  tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
>>>=20=20
>>>    /* Populate the rgroup's len array, if this is the first time we've
>>>       used it.  */
>>> @@ -10400,6 +10405,26 @@ vect_get_loop_len (loop_vec_info loop_vinfo,
>> vec_loop_lens *lens,
>>>=20=20
>>>    if (use_bias_adjusted_len)
>>>      return rgl->bias_adjusted_ctrl;
>>> +  else if (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo))
>>> +    {
>>> +      tree loop_len =3D rgl->controls[index];
>>> +      poly_int64 nunits1 =3D TYPE_VECTOR_SUBPARTS (rgl->type);
>>> +      poly_int64 nunits2 =3D TYPE_VECTOR_SUBPARTS (vectype);
>>> +      if (maybe_ne (nunits1, nunits2))
>>> +    {
>>> +      /* A loop len for data type X can be reused for data type Y
>>> +         if X has N times more elements than Y and if Y's elements
>>> +         are N times bigger than X's.  */
>>> +      gcc_assert (multiple_p (nunits1, nunits2));
>>> +      unsigned int factor =3D exact_div (nunits1, nunits2).to_constant=
 ();
>>> +      gimple_seq seq =3D NULL;
>>> +      loop_len =3D gimple_build (&seq, RDIV_EXPR, iv_type, loop_len,
>>> +                   build_int_cst (iv_type, factor));
>>> +      if (seq)
>>> +        gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
>>> +    }
>>> +      return loop_len;
>>> +    }
>>>    else
>>>      return rgl->controls[index];
>>>  }
>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>> index 7313191b0db..15b22132bd6 100644
>>> --- a/gcc/tree-vect-stmts.cc
>>> +++ b/gcc/tree-vect-stmts.cc
>>> @@ -3147,6 +3147,61 @@ vect_get_data_ptr_increment (vec_info *vinfo,
>>>    return iv_step;
>>>  }
>>>=20=20
>>> +/* Prepare the pointer IVs which needs to be updated by a variable amo=
unt.
>>> +   Such variable amount is the outcome of .SELECT_VL. In this case, we=
 can
>>> +   allow each iteration process the flexible number of elements as lon=
g as
>>> +   the number <=3D vf elments.
>>> +
>>> +   Return data reference according to SELECT_VL.
>>> +   If new statements are needed, insert them before GSI.  */
>>> +
>>> +static tree
>>> +get_select_vl_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info,
>>> +                tree aggr_type, class loop *at_loop, tree offset,
>>> +                tree *dummy, gimple_stmt_iterator *gsi,
>>> +                bool simd_lane_access_p, vec_loop_lens *loop_lens,
>>> +                dr_vec_info *dr_info,
>>> +                vect_memory_access_type memory_access_type)
>>> +{
>>> +  loop_vec_info loop_vinfo =3D dyn_cast<loop_vec_info> (vinfo);
>>> +  tree step =3D vect_dr_behavior (vinfo, dr_info)->step;
>>> +
>>> +  /* TODO: We don't support gather/scatter or load_lanes/store_lanes f=
or
>> pointer
>>> +     IVs are updated by variable amount but we will support them in the
>> future.
>>> +   */
>>> +  gcc_assert (memory_access_type !=3D VMAT_GATHER_SCATTER
>>> +          && memory_access_type !=3D VMAT_LOAD_STORE_LANES);
>>> +
>>> +  /* When we support SELECT_VL pattern, we dynamic adjust
>>> +     the memory address by .SELECT_VL result.
>>> +
>>> +     The result of .SELECT_VL is the number of elements to
>>> +     be processed of each iteration. So the memory address
>>> +     adjustment operation should be:
>>> +
>>> +     bytesize =3D GET_MODE_SIZE (element_mode (aggr_type));
>>> +     addr =3D addr + .SELECT_VL (ARG..) * bytesize;
>>> +  */
>>> +  gimple *ptr_incr;
>>> +  tree loop_len
>>> +    =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens, 1, aggr_type, 0=
);
>>> +  tree len_type =3D TREE_TYPE (loop_len);
>>> +  poly_uint64 bytesize =3D GET_MODE_SIZE (element_mode (aggr_type));
>>> +  /* Since the outcome of .SELECT_VL is element size, we should adjust
>>> +     it into bytesize so that it can be used in address pointer variab=
le
>>> +     amount IVs adjustment.  */
>>> +  tree tmp =3D fold_build2 (MULT_EXPR, len_type, loop_len,
>>> +              build_int_cst (len_type, bytesize));
>>> +  if (tree_int_cst_sgn (step) =3D=3D -1)
>>> +    tmp =3D fold_build1 (NEGATE_EXPR, len_type, tmp);
>>> +  tree bump =3D make_temp_ssa_name (len_type, NULL, "ivtmp");
>>> +  gassign *assign =3D gimple_build_assign (bump, tmp);
>>> +  gsi_insert_before (gsi, assign, GSI_SAME_STMT);
>>> +  return vect_create_data_ref_ptr (vinfo, stmt_info, aggr_type, at_loo=
p,
>> offset,
>>> +                   dummy, gsi, &ptr_incr, simd_lane_access_p,
>>> +                   bump);
>>> +}
>>> +
>>>  /* Check and perform vectorization of BUILT_IN_BSWAP{16,32,64,128}.  */
>>>=20=20
>>>  static bool
>>> @@ -8547,6 +8602,14 @@ vectorizable_store (vec_info *vinfo,
>>>          vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>>>                       slp_node, &gs_info, &dataref_ptr,
>>>                       &vec_offsets);
>>> +      else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)
>>> +           && memory_access_type !=3D VMAT_INVARIANT)
>>> +        dataref_ptr
>>> +          =3D get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
>>> +                        simd_lane_access_p ? loop : NULL,
>>> +                        offset, &dummy, gsi,
>>> +                        simd_lane_access_p, loop_lens,
>>> +                        dr_info, memory_access_type);
>>>        else
>>>          dataref_ptr
>>>            =3D vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_t=
ype,
>>> @@ -8795,8 +8858,9 @@ vectorizable_store (vec_info *vinfo,
>>>            else if (loop_lens)
>>>          {
>>>            tree final_len
>>> -            =3D vect_get_loop_len (loop_vinfo, loop_lens,
>>> -                     vec_num * ncopies, vec_num * j + i);
>>> +            =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens,
>>> +                     vec_num * ncopies, vectype,
>>> +                     vec_num * j + i);
>>>            tree ptr =3D build_int_cst (ref_type, align * BITS_PER_UNIT);
>>>            machine_mode vmode =3D TYPE_MODE (vectype);
>>>            opt_machine_mode new_ovmode
>>> @@ -9935,6 +9999,13 @@ vectorizable_load (vec_info *vinfo,
>>>                         slp_node, &gs_info, &dataref_ptr,
>>>                         &vec_offsets);
>>>          }
>>> +      else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)
>>> +           && memory_access_type !=3D VMAT_INVARIANT)
>>> +        dataref_ptr
>>> +          =3D get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type,
>>> +                        at_loop, offset, &dummy, gsi,
>>> +                        simd_lane_access_p, loop_lens,
>>> +                        dr_info, memory_access_type);
>>>        else
>>>          dataref_ptr
>>>            =3D vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_t=
ype,
>>> @@ -10151,8 +10222,8 @@ vectorizable_load (vec_info *vinfo,
>>>              else if (loop_lens && memory_access_type !=3D VMAT_INVARIA=
NT)
>>>                {
>>>              tree final_len
>>> -              =3D vect_get_loop_len (loop_vinfo, loop_lens,
>>> -                           vec_num * ncopies,
>>> +              =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens,
>>> +                           vec_num * ncopies, vectype,
>>>                             vec_num * j + i);
>>>              tree ptr =3D build_int_cst (ref_type,
>>>                            align * BITS_PER_UNIT);
>>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
>>> index 9cf2fb23fe3..3d21e23513d 100644
>>> --- a/gcc/tree-vectorizer.h
>>> +++ b/gcc/tree-vectorizer.h
>>> @@ -818,6 +818,13 @@ public:
>>>       the vector loop can handle fewer than VF scalars.  */
>>>    bool using_partial_vectors_p;
>>>=20=20
>>> +  /* True if we've decided to use SELECT_VL to get the number of active
>>> +     elements in a vector loop to be updated.  */
>>> +  bool using_select_vl_p;
>>> +
>>> +  /* True if use adjusted loop length for SLP.  */
>>> +  bool using_slp_adjusted_len_p;
>>> +
>>>    /* True if we've decided to use partially-populated vectors for the
>>>       epilogue of loop.  */
>>>    bool epil_using_partial_vectors_p;
>>> @@ -890,6 +897,8 @@ public:
>>>  #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
>>>  #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->
>> can_use_partial_vectors_p
>>>  #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vecto=
rs_p
>>> +#define LOOP_VINFO_USING_SELECT_VL_P(L) (L)->using_select_vl_p
>>> +#define LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P(L) (L)->
> using_slp_adjusted_len_p
>>>  #define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)
>>                             \
>>>    (L)->epil_using_partial_vectors_p
>>>  #define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_=
bias
>>> @@ -2293,7 +2302,8 @@ extern tree vect_get_loop_mask (gimple_stmt_itera=
tor
> *,
>> vec_loop_masks *,
>>>                  unsigned int, tree, unsigned int);
>>>  extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsi=
gned
>> int,
>>>                    tree, unsigned int);
>>> -extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned
> int,
>>> +extern tree vect_get_loop_len (loop_vec_info, gimple_stmt_iterator *,
>>> +                   vec_loop_lens *, unsigned int, tree,
>>>                     unsigned int);
>>>  extern gimple_seq vect_gen_len (tree, tree, tree, tree);
>>>  extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
>> 8
> C