From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 59D77385B507 for ; Fri, 12 May 2023 13:26:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 59D77385B507 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7F5CAC14; Fri, 12 May 2023 06:26:46 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DF9F93F67D; Fri, 12 May 2023 06:26:00 -0700 (PDT) From: Richard Sandiford To: "juzhe.zhong" Mail-Followup-To: "juzhe.zhong" ,"gcc-patches\@gcc.gnu.org" , "kito.cheng\@gmail.com" , "palmer\@dabbelt.com" , "richard.guenther\@gmail.com" , richard.sandiford@arm.com Cc: "gcc-patches\@gcc.gnu.org" , "kito.cheng\@gmail.com" , "palmer\@dabbelt.com" , "richard.guenther\@gmail.com" Subject: Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer References: <20230511231126.1594132-1-juzhe.zhong@rivai.ai> <5C0696881FC420F8+A4387224-068E-4647-B237-BC14AE06A32D@rivai.ai> Date: Fri, 12 May 2023 14:25:59 +0100 In-Reply-To: (juzhe zhong's message of "Fri, 12 May 2023 20:41:35 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-29.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_ASCII_DIVIDERS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: "juzhe.zhong" writes: > Hi, Richard. For "can iterate more than once", is it correct use the con= dition > "LOOP_LENS ().length >1".=20=20=20=20=20 No, that says whether any LOAD_LENs or STORE_LENs operate on multiple vectors, rather than just single vectors. I meant: whether the vector loop body might be executed more than once (i.e. whether the branch-back condition can be true). This is true for a scalar loop that goes from 0 to some unbounded variable n. It's false for a scalar loop that goes from 0 to 6, if the vectors are known to have at least 8 elements. Thanks, Richard > ---- Replied Message ---- > > From Richard Sandiford > > Date 05/12/2023 19:39 > > To juzhe.zhong > > Cc gcc-patches@gcc.gnu.org, > kito.cheng@gmail.com, > palmer@dabbelt.com, > richard.guenther@gmail.com > > Subject Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer > > "juzhe.zhong" writes: >> Thanks Richard. >> I will do that as you suggested. I have a question for the first patch.= How > to >> enable decrement IV=EF=BC=9F Should I add a target hook or something to = let target >> decide whether enable decrement IV=EF=BC=9F > > At the moment, the only other targets that use IFN_LOAD_LEN and > IFN_STORE_LEN are PowerPC and s390. Both targets default to > --param vect-partial-vector-usage=3D1 (i.e. use partial vectors > for epilogues only). > > So I think the condition should be that the loop: > > (a) uses length "controls"; and > (b) can iterate more than once > > No target checks should be needed. > > Thanks, > Richard > >> ---- Replied Message ---- >> >> From Richard Sandiford >> >> Date 05/12/2023 19:08 >> >> To juzhe.zhong@rivai.ai >> >> Cc gcc-patches@gcc.gnu.org, >> kito.cheng@gmail.com, >> palmer@dabbelt.com, >> richard.guenther@gmail.com >> >> Subject Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectoriz= er >> >> juzhe.zhong@rivai.ai writes: >>> From: Ju-Zhe Zhong >>> >>> 1. Fix document description according Jeff && Richard. >>> 2. Add LOOP_VINFO_USING_SELECT_VL_P for single rgroup. >>> 3. Add LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P for SLP multiple rgroup. >>> >>> Fix bugs for V5 after testing: >>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618209.html >>> >>> gcc/ChangeLog: >>> >>> * doc/md.texi: Add seletc_vl pattern. >>> * internal-fn.def (SELECT_VL): New ifn. >>> * optabs.def (OPTAB_D): New optab. >>> * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function. >>> (vect_set_loop_controls_by_select_vl): Ditto. >>> (vect_set_loop_condition_partial_vectors): Add loop control for >> decrement IV. >>> * tree-vect-loop.cc (vect_get_loop_len): Adjust loop len for SL= P. >>> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): New function. >>> (vectorizable_store): Support data reference IV added by outcom= e of >> SELECT_VL. >>> (vectorizable_load): Ditto. >>> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): New macro. >>> (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P): Ditto. >>> (vect_get_loop_len): Adjust loop len for SLP. >>> >>> --- >>> gcc/doc/md.texi | 36 ++++ >>> gcc/internal-fn.def | 1 + >>> gcc/optabs.def | 1 + >>> gcc/tree-vect-loop-manip.cc | 380 +++++++++++++++++++++++++++++++++++- >>> gcc/tree-vect-loop.cc | 31 ++- >>> gcc/tree-vect-stmts.cc | 79 +++++++- >>> gcc/tree-vectorizer.h | 12 +- >>> 7 files changed, 526 insertions(+), 14 deletions(-) >>> >>> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi >>> index 8ebce31ba78..a94ffc4456d 100644 >>> --- a/gcc/doc/md.texi >>> +++ b/gcc/doc/md.texi >>> @@ -4974,6 +4974,42 @@ for (i =3D 1; i < operand3; i++) >>> operand0[i] =3D operand0[i - 1] && (operand1 + i < operand2); >>> @end smallexample >>>=20=20 >>> +@cindex @code{select_vl@var{m}} instruction pattern >>> +@item @code{select_vl@var{m}} >>> +Set operand 0 to the number of active elements in a vector to be updat= ed >>> +in a loop iteration based on the total number of elements to be update= d, >>> +the vectorization factor and vector properties of the target. >>> +operand 1 is the total elements in the vector to be updated. >>> +operand 2 is the vectorization factor. >>> +The value of operand 0 is target dependent and flexible in each iterat= ion. >>> +The operation of this pattern can be: >>> + >>> +@smallexample >>> +Case 1: >>> +operand0 =3D MIN (operand1, operand2); >>> +operand2 can be const_poly_int or poly_int related to vector mode size. >>> +Some target like RISC-V has a standalone instruction to get MIN (n, MO= DE >> SIZE) so >>> +that we can reduce a use of general purpose register. >>> + >>> +In this case, only the last iteration of the loop is partial iteration. >>> +@end smallexample >>> + >>> +@smallexample >>> +Case 2: >>> +if (operand1 <=3D operand2) >>> + operand0 =3D operand1; >>> +else if (operand1 < 2 * operand2) >>> + operand0 =3D ceil (operand1 / 2); >>> +else >>> + operand0 =3D operand2; >>> + >>> +This case will evenly distribute work over the last 2 iterations of a >> stripmine loop. >>> +@end smallexample >>> + >>> +The output of this pattern is not only used as IV of loop control coun= ter, >> but also >>> +is used as the IV of address calculation with multiply/shift operation. > This >> allows >>> +dynamic adjustment of the number of elements processed each loop itera= tion. >>> + >> >> I don't think we need to restrict the definition to the two RVV cases. >> How about: >> >> ----------------------------------------------------------------------- >> Set operand 0 to the number of scalar iterations that should be handled >> by one iteration of a vector loop. Operand 1 is the total number of >> scalar iterations that the loop needs to process and operand 2 is a >> maximum bound on the result (also known as the maximum ``vectorization >> factor''). >> >> The maximum value of operand 0 is given by: >> @smallexample >> operand0 =3D MIN (operand1, operand2) >> @end smallexample >> However, targets might choose a lower value than this, based on >> target-specific criteria. Each iteration of the vector loop might >> therefore process a different number of scalar iterations, which in turn >> means that induction variables will have a variable step. Because of >> this, it is generally not useful to define this instruction if it will >> always calculate the maximum value. >> >> This optab is only useful on targets that implement @samp{len_load_@var{= m}} >> and/or @samp{len_store_@var{m}}. >> ----------------------------------------------------------------------- >> >>> @cindex @code{check_raw_ptrs@var{m}} instruction pattern >>> @item @samp{check_raw_ptrs@var{m}} >>> Check whether, given two pointers @var{a} and @var{b} and a length @var >> {len}, >>> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def >>> index 7fe742c2ae7..6f6fa7d37f9 100644 >>> --- a/gcc/internal-fn.def >>> +++ b/gcc/internal-fn.def >>> @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set) >>> DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store) >>>=20=20 >>> DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, > while) >>> +DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, >> binary) >>> DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW, >>> check_raw_ptrs, check_ptrs) >>> DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW, >>> diff --git a/gcc/optabs.def b/gcc/optabs.def >>> index 695f5911b30..b637471b76e 100644 >>> --- a/gcc/optabs.def >>> +++ b/gcc/optabs.def >>> @@ -476,3 +476,4 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SER= IES) >>> OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a") >>> OPTAB_D (len_load_optab, "len_load_$a") >>> OPTAB_D (len_store_optab, "len_store_$a") >>> +OPTAB_D (select_vl_optab, "select_vl$a") >>> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc >>> index ff6159e08d5..81334f4f171 100644 >>> --- a/gcc/tree-vect-loop-manip.cc >>> +++ b/gcc/tree-vect-loop-manip.cc >>> @@ -385,6 +385,353 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, >> rgroup_controls *dest_rgm, >>> return false; >>> } >>>=20=20 >>> +/* Try to use adjust loop lens for non-SLP multiple-rgroups. >>> + >>> + _36 =3D MIN_EXPR ; >>> + >>> + First length (MIN (X, VF/N)): >>> + loop_len_15 =3D MIN_EXPR <_36, POLY_INT_CST [2, 2]>; >>> + >>> + Second length (X - MIN (X, 1 * VF/N)): >>> + loop_len_16 =3D _36 - loop_len_15; >>> + >>> + Third length (X - MIN (X, 2 * VF/N)): >>> + _38 =3D MIN_EXPR <_36, POLY_INT_CST [4, 4]>; >>> + loop_len_17 =3D _36 - _38; >>> + >>> + Forth length (X - MIN (X, 3 * VF/N)): >>> + _39 =3D MIN_EXPR <_36, POLY_INT_CST [6, 6]>; >>> + loop_len_18 =3D _36 - _39; */ >>> + >>> +static void >>> +vect_adjust_loop_lens (tree iv_type, gimple_seq *seq, rgroup_controls >> *dest_rgm, >>> + rgroup_controls *src_rgm) >>> +{ >>> + tree ctrl_type =3D dest_rgm->type; >>> + poly_uint64 nitems_per_ctrl >>> + =3D TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor; >>> + >>> + for (unsigned int i =3D 0; i < dest_rgm->controls.length (); ++i) >>> + { >>> + tree src =3D src_rgm->controls[i / dest_rgm->controls.length ()]; >>> + tree dest =3D dest_rgm->controls[i]; >>> + gassign *stmt; >>> + if (i =3D=3D 0) >>> + { >>> + /* MIN (X, VF*I/N) capped to the range [0, VF/N]. */ >>> + tree factor =3D build_int_cst (iv_type, nitems_per_ctrl); >>> + stmt =3D gimple_build_assign (dest, MIN_EXPR, src, factor); >>> + gimple_seq_add_stmt (seq, stmt); >>> + } >>> + else >>> + { >>> + /* (X - MIN (X, VF*I/N)) capped to the range [0, VF/N]. */ >>> + tree factor =3D build_int_cst (iv_type, nitems_per_ctrl * i); >>> + tree temp =3D make_ssa_name (iv_type); >>> + stmt =3D gimple_build_assign (temp, MIN_EXPR, src, factor); >>> + gimple_seq_add_stmt (seq, stmt); >>> + stmt =3D gimple_build_assign (dest, MINUS_EXPR, src, temp); >>> + gimple_seq_add_stmt (seq, stmt); >>> + } >>> + } >>> +} >>> + >>> +/* Helper for vect_set_loop_condition_partial_vectors. Generate > definitions >>> + for all the rgroup controls in RGC and return a control that is non= zero >>> + when the loop needs to iterate. Add any new preheader statements to >>> + PREHEADER_SEQ. Use LOOP_COND_GSI to insert code before the exit gc= ond. >>> + >>> + RGC belongs to loop LOOP. The loop originally iterated NITERS >>> + times and has been vectorized according to LOOP_VINFO. >>> + >>> + Unlike vect_set_loop_controls_directly which is iterating from 0-ba= sed > IV >>> + to TEST_LIMIT - bias. >>> + >>> + In vect_set_loop_controls_by_select_vl, we are iterating from start= at >>> + IV =3D TEST_LIMIT - bias and keep subtract IV by the length calcula= ted by >>> + IFN_SELECT_VL pattern. >>> + >>> + 1. Single rgroup, the Gimple IR should be: >>> + >>> + # vectp_B.6_8 =3D PHI >>> + # vectp_B.8_16 =3D PHI >>> + # vectp_A.11_19 =3D PHI >>> + # vectp_A.13_22 =3D PHI >>> + # ivtmp_26 =3D PHI >>> + _28 =3D .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]); >>> + ivtmp_15 =3D _28 * 4; >>> + vect__1.10_18 =3D .LEN_LOAD (vectp_B.8_16, 128B, _28, 0); >>> + _1 =3D B[i_10]; >>> + .LEN_STORE (vectp_A.13_22, 128B, _28, vect__1.10_18, 0); >>> + i_7 =3D i_10 + 1; >>> + vectp_B.8_17 =3D vectp_B.8_16 + ivtmp_15; >>> + vectp_A.13_23 =3D vectp_A.13_22 + ivtmp_15; >>> + ivtmp_27 =3D ivtmp_26 - _28; >>> + if (ivtmp_27 !=3D 0) >>> + goto ; [83.33%] >>> + else >>> + goto ; [16.67%] >>> + >>> + Note: We use the outcome of .SELECT_VL to adjust both loop control = IV > and >>> + data reference pointer IV. >>> + >>> + 1). The result of .SELECT_VL: >>> + _28 =3D .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]); >>> + The _28 is not necessary to be VF in any iteration, instead, we > allow >>> + _28 to be any value as long as _28 <=3D VF. Such flexible SELEC= T_VL >>> + pattern allows target have various flexible optimizations in ve= ctor >>> + loop iterations. Target like RISC-V has special application vec= tor >>> + length calculation instruction which will distribute even workl= oad >>> + in the last 2 iterations. >>> + >>> + Other example is that we can allow even generate _28 <=3D VF / = 2 so >>> + that some machine can run vector codes in low power mode. >>> + >>> + 2). Loop control IV: >>> + ivtmp_27 =3D ivtmp_26 - _28; >>> + if (ivtmp_27 !=3D 0) >>> + goto ; [83.33%] >>> + else >>> + goto ; [16.67%] >>> + >>> + This is the saturating-subtraction towards zero, the outcome of >>> + .SELECT_VL wil make ivtmp_27 never underflow zero. >>> + >>> + 3). Data reference pointer IV: >>> + ivtmp_15 =3D _28 * 4; >>> + vectp_B.8_17 =3D vectp_B.8_16 + ivtmp_15; >>> + vectp_A.13_23 =3D vectp_A.13_22 + ivtmp_15; >>> + >>> + The pointer IV is adjusted accurately according to the .SELECT_= VL. >>> + >>> + 2. Multiple rgroup, the Gimple IR should be: >>> + >>> + # i_23 =3D PHI >>> + # vectp_f.8_51 =3D PHI >>> + # vectp_d.10_59 =3D PHI >>> + # ivtmp_70 =3D PHI >>> + # ivtmp_73 =3D PHI >>> + _72 =3D MIN_EXPR ; >>> + _75 =3D MIN_EXPR ; >>> + _1 =3D i_23 * 2; >>> + _2 =3D (long unsigned int) _1; >>> + _3 =3D _2 * 2; >>> + _4 =3D f_15(D) + _3; >>> + _5 =3D _2 + 1; >>> + _6 =3D _5 * 2; >>> + _7 =3D f_15(D) + _6; >>> + .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0= ); >>> + vectp_f.8_56 =3D vectp_f.8_51 + 16; >>> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0= ); >>> + _8 =3D (long unsigned int) i_23; >>> + _9 =3D _8 * 4; >>> + _10 =3D d_18(D) + _9; >>> + _61 =3D _75 / 2; >>> + .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0); >>> + vectp_d.10_63 =3D vectp_d.10_59 + 16; >>> + _64 =3D _72 / 2; >>> + .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0); >>> + i_20 =3D i_23 + 1; >>> + vectp_f.8_52 =3D vectp_f.8_56 + 16; >>> + vectp_d.10_60 =3D vectp_d.10_63 + 16; >>> + ivtmp_74 =3D ivtmp_73 - _75; >>> + ivtmp_71 =3D ivtmp_70 - _72; >>> + if (ivtmp_74 !=3D 0) >>> + goto ; [83.33%] >>> + else >>> + goto ; [16.67%] >> >> In the gimple examples, I think it would help to quote only the relevant >> parts and use ellipsis to hide things that don't directly matter. >> E.g. in the above samples, the old scalar code isn't relevant, whereas >> it's difficult to follow the example without knowing how _69 and _67 >> relate to each other. It would also help to say which scalar loop >> is being vectorised here. >> >>> + >>> + Note: We DO NOT use .SELECT_VL in SLP auto-vectorization for multip= le >>> + rgroups. Instead, we use MIN_EXPR to guarantee we always use VF as = the >>> + iteration amount for mutiple rgroups. >>> + >>> + The analysis of the flow of multiple rgroups: >>> + _72 =3D MIN_EXPR ; >>> + _75 =3D MIN_EXPR ; >>> + ... >>> + .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0= ); >>> + vectp_f.8_56 =3D vectp_f.8_51 + 16; >>> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0= ); >>> + ... >>> + _61 =3D _75 / 2; >>> + .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0); >>> + vectp_d.10_63 =3D vectp_d.10_59 + 16; >>> + _64 =3D _72 / 2; >>> + .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0); >>> + >>> + We use _72 =3D MIN_EXPR ; to generate the number of the >> elements >>> + to be processed in each iteration. >>> + >>> + The related STOREs: >>> + _72 =3D MIN_EXPR ; >>> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0= ); >>> + _64 =3D _72 / 2; >>> + .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0); >>> + Since these 2 STOREs store 2 vectors that the second vector is half >> elements >>> + of the first vector. So the length of second STORE will be _64 =3D _= 72 / 2; >>> + It's similar to the VIEW_CONVERT of handling masks in SLP. >> >> >> >>> + >>> + 3. Multiple rgroups for non-SLP auto-vectorization. >>> + >>> + # ivtmp_26 =3D PHI >>> + # ivtmp.35_10 =3D PHI >>> + # ivtmp.36_2 =3D PHI >>> + _28 =3D MIN_EXPR ; >>> + loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>; >>> + loop_len_16 =3D _28 - loop_len_15; >>> + _29 =3D (void *) ivtmp.35_10; >>> + _7 =3D &MEM [(int *)_29]; >>> + vect__1.25_17 =3D .LEN_LOAD (_7, 128B, loop_len_15, 0); >>> + _33 =3D _29 + POLY_INT_CST [16, 16]; >>> + _34 =3D &MEM [(int *)_33]; >>> + vect__1.26_19 =3D .LEN_LOAD (_34, 128B, loop_len_16, 0); >>> + vect__2.27_20 =3D VEC_PACK_TRUNC_EXPR ; >>> + _30 =3D (void *) ivtmp.36_2; >>> + _31 =3D &MEM [(short int *)_30]; >>> + .LEN_STORE (_31, 128B, _28, vect__2.27_20, 0); >>> + ivtmp_27 =3D ivtmp_26 - _28; >>> + ivtmp.35_11 =3D ivtmp.35_10 + POLY_INT_CST [32, 32]; >>> + ivtmp.36_8 =3D ivtmp.36_2 + POLY_INT_CST [16, 16]; >>> + if (ivtmp_27 !=3D 0) >>> + goto ; [83.33%] >>> + else >>> + goto ; [16.67%] >>> + >>> + The total length: _28 =3D MIN_EXPR ; >>> + >>> + The length of first half vector: >>> + loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>; >>> + >>> + The length of second half vector: >>> + loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>; >>> + loop_len_16 =3D _28 - loop_len_15; >>> + >>> + 1). _28 always <=3D POLY_INT_CST [8, 8]. >>> + 2). When _28 <=3D POLY_INT_CST [4, 4], second half vector is not >> processed. >>> + 3). When _28 > POLY_INT_CST [4, 4], second half vector is process= ed. >>> +*/ >>> + >>> +static tree >>> +vect_set_loop_controls_by_select_vl (class loop *loop, loop_vec_info >> loop_vinfo, >>> + gimple_seq *preheader_seq, >>> + gimple_seq *header_seq, >>> + rgroup_controls *rgc, tree niters) >>> +{ >>> + tree compare_type =3D LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); >>> + tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); >>> + /* We are not allowing masked approach in SELECT_VL. */ >>> + gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)); >>> + >>> + tree ctrl_type =3D rgc->type; >>> + unsigned int nitems_per_iter =3D rgc->max_nscalars_per_iter * rgc->f= actor; >>> + poly_uint64 nitems_per_ctrl =3D TYPE_VECTOR_SUBPARTS (ctrl_type) * r= gc-> >> factor; >>> + poly_uint64 vf =3D LOOP_VINFO_VECT_FACTOR (loop_vinfo); >>> + >>> + /* Calculate the maximum number of item values that the rgroup >>> + handles in total, the number that it handles for each iteration >>> + of the vector loop. */ >>> + tree nitems_total =3D niters; >>> + if (nitems_per_iter !=3D 1) >>> + { >>> + /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P = that >>> + these multiplications don't overflow. */ >>> + tree compare_factor =3D build_int_cst (compare_type, nitems_per_= iter); >>> + nitems_total =3D gimple_build (preheader_seq, MULT_EXPR, compare= _type, >>> + nitems_total, compare_factor); >>> + } >>> + >>> + /* Convert the comparison value to the IV type (either a no-op or >>> + a promotion). */ >>> + nitems_total =3D gimple_convert (preheader_seq, iv_type, nitems_tota= l); >>> + >>> + /* Create an induction variable that counts the number of items >>> + processed. */ >>> + tree index_before_incr, index_after_incr; >>> + gimple_stmt_iterator incr_gsi; >>> + bool insert_after; >>> + standard_iv_increment_position (loop, &incr_gsi, &insert_after); >>> + >>> + /* Test the decremented IV, which will never underflow 0 since we ha= ve >>> + IFN_SELECT_VL to gurantee that. */ >>> + tree test_limit =3D nitems_total; >>> + >>> + /* Provide a definition of each control in the group. */ >>> + tree ctrl; >>> + unsigned int i; >>> + FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl) >>> + { >>> + /* Previous controls will cover BIAS items. This control covers= the >>> + next batch. */ >>> + poly_uint64 bias =3D nitems_per_ctrl * i; >>> + tree bias_tree =3D build_int_cst (iv_type, bias); >>> + >>> + /* Rather than have a new IV that starts at TEST_LIMIT and goes = down >> to >>> + BIAS, prefer to use the same TEST_LIMIT - BIAS based IV for each >>> + control and adjust the bound down by BIAS. */ >>> + tree this_test_limit =3D test_limit; >>> + if (i !=3D 0) >>> + { >>> + this_test_limit =3D gimple_build (preheader_seq, MAX_EXPR, iv_ty= pe, >>> + this_test_limit, bias_tree); >>> + this_test_limit =3D gimple_build (preheader_seq, MINUS_EXPR, iv_= type, >>> + this_test_limit, bias_tree); >>> + } >>> + >>> + /* Create decrement IV. */ >>> + create_iv (this_test_limit, MINUS_EXPR, ctrl, NULL_TREE, loop, & >> incr_gsi, >>> + insert_after, &index_before_incr, &index_after_incr); >>> + >>> + poly_uint64 final_vf =3D vf * nitems_per_iter; >>> + tree vf_step =3D build_int_cst (iv_type, final_vf); >>> + tree res_len; >>> + if (LOOP_VINFO_LENS (loop_vinfo).length () =3D=3D 1) >>> + { >>> + res_len =3D gimple_build (header_seq, IFN_SELECT_VL, iv_type, >>> + index_before_incr, vf_step); >>> + LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) =3D true; >> >> The middle of this loop seems too "deep down" to be setting this. >> I think it would make sense to do it after: >> >> /* If we still have the option of using partial vectors, >> check whether we can generate the necessary loop controls. */ >> if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) >> && !vect_verify_full_masking (loop_vinfo) >> && !vect_verify_loop_lens (loop_vinfo)) >> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) =3D false; >> >> in vect_analyze_loop_2. >> >> I think it'd help for review purposes to split this patch into two >> (independently tested) pieces: >> >> (1) Your cases 2 and 3, where AIUI the main change is to use >> a decrementing loop control IV that counts scalars. This can >> be done for any loop that: >> >> (a) uses length "controls"; and >> (b) can iterate more than once >> >> Initially this patch would handle case 1 in the same way. >> >> Conceptually, I think it would make sense for this case to use: >> >> - a signed control IV >> - with a constant VF step >> - and a loop-back test for > 0 >> >> in cases where we can prove that that doesn't overflow. But I >> accept that using: >> >> - an unsigned control IV >> - with a variable step >> - and a loop-back test for !=3D 0 >> >> is more general. So it's OK to handle just that case. The >> optimisation to use signed control IVs could be left to future work. >> >> (2) Add SELECT_VL, where AIUI the main change (relative to (1)) >> is to use a variable step for other IVs too. >> >> This is just for review purposes, and to help to separate concepts. >> SELECT_VL is still an important part of the end result. >> >> Thanks, >> Richard >> >>> + } >>> + else >>> + { >>> + /* For SLP, we can't allow non-VF number of elements to be proce= ssed >>> + in non-final iteration. We force the number of elements to be >>> + processed in each non-final iteration is VF elements. If we a= llow >>> + non-VF elements processing in non-final iteration will make S= LP > too >>> + complicated and produce inferior codegen. >>> + >>> + For example: >>> + >>> + If non-final iteration process VF elements. >>> + >>> + ... >>> + .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0); >>> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0); >>> + ... >>> + >>> + If non-final iteration process non-VF elements. >>> + >>> + ... >>> + .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0); >>> + if (_71 % 2 =3D=3D 0) >>> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0); >>> + else >>> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 2, 1, 2, 1 }, 0); >>> + ... >>> + >>> + This is the simple case of 2-elements interleaved vector SLP.= We >>> + consider other interleave vector, the situation will become m= ore >>> + complicated. */ >>> + res_len =3D gimple_build (header_seq, MIN_EXPR, iv_type, >>> + index_before_incr, vf_step); >>> + if (rgc->max_nscalars_per_iter !=3D 1) >>> + LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo) =3D true; >>> + } >>> + gassign *assign =3D gimple_build_assign (ctrl, res_len); >>> + gimple_seq_add_stmt (header_seq, assign); >>> + } >>> + >>> + return index_after_incr; >>> +} >>> + >>> /* Helper for vect_set_loop_condition_partial_vectors. Generate > definitions >>> for all the rgroup controls in RGC and return a control that is non= zero >>> when the loop needs to iterate. Add any new preheader statements to >>> @@ -704,6 +1051,10 @@ vect_set_loop_condition_partial_vectors (class lo= op >> *loop, >>>=20=20 >>> bool use_masks_p =3D LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); >>> tree compare_type =3D LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); >>> + tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); >>> + bool use_vl_p =3D !use_masks_p >>> + && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, >>> + OPTIMIZE_FOR_SPEED); >>> unsigned int compare_precision =3D TYPE_PRECISION (compare_type); >>> tree orig_niters =3D niters; >>>=20=20 >>> @@ -753,17 +1104,34 @@ vect_set_loop_condition_partial_vectors (class l= oop >> *loop, >>> continue; >>> } >>>=20=20 >>> + if (use_vl_p && rgc->max_nscalars_per_iter =3D=3D 1 >>> + && rgc !=3D &LOOP_VINFO_LENS (loop_vinfo)[0]) >>> + { >>> + rgroup_controls *sub_rgc >>> + =3D &(*controls)[nmasks / rgc->controls.length () - 1]; >>> + if (!sub_rgc->controls.is_empty ()) >>> + { >>> + vect_adjust_loop_lens (iv_type, &header_seq, rgc, sub_rgc); >>> + continue; >>> + } >>> + } >>> + >>> /* See whether zero-based IV would ever generate all-false masks >>> or zero length before wrapping around. */ >>> bool might_wrap_p =3D vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc= ); >>>=20=20 >>> /* Set up all controls for this group. */ >>> - test_ctrl =3D vect_set_loop_controls_directly (loop, loop_vinfo, >>> - &preheader_seq, >>> - &header_seq, >>> - loop_cond_gsi, rgc, >>> - niters, niters_skip, >>> - might_wrap_p); >>> + if (use_vl_p) >>> + test_ctrl >>> + =3D vect_set_loop_controls_by_select_vl (loop, loop_vinfo, >>> + &preheader_seq, &header_seq, >>> + rgc, niters); >>> + else >>> + test_ctrl >>> + =3D vect_set_loop_controls_directly (loop, loop_vinfo, & > preheader_seq, >>> + &header_seq, loop_cond_gsi, rgc, >>> + niters, niters_skip, >>> + might_wrap_p); >>> } >>>=20=20 >>> /* Emit all accumulated statements. */ >>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc >>> index ed0166fedab..fe6af4286bf 100644 >>> --- a/gcc/tree-vect-loop.cc >>> +++ b/gcc/tree-vect-loop.cc >>> @@ -973,6 +973,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, >> vec_info_shared *shared) >>> vectorizable (false), >>> can_use_partial_vectors_p (param_vect_partial_vector_usage !=3D 0), >>> using_partial_vectors_p (false), >>> + using_select_vl_p (false), >>> + using_slp_adjusted_len_p (false), >>> epil_using_partial_vectors_p (false), >>> partial_load_store_bias (0), >>> peeling_for_gaps (false), >>> @@ -10361,15 +10363,18 @@ vect_record_loop_len (loop_vec_info loop_vinf= o, >> vec_loop_lens *lens, >>> } >>>=20=20 >>> /* Given a complete set of length LENS, extract length number INDEX fo= r an >>> - rgroup that operates on NVECTORS vectors, where 0 <=3D INDEX < NVEC= TORS. > * >> / >>> + rgroup that operates on NVECTORS vectors, where 0 <=3D INDEX < NVEC= TORS. >>> + Insert any set-up statements before GSI. */ >>>=20=20 >>> tree >>> -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, >>> - unsigned int nvectors, unsigned int index) >>> +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi, >>> + vec_loop_lens *lens, unsigned int nvectors, tree vectype, >>> + unsigned int index) >>> { >>> rgroup_controls *rgl =3D &(*lens)[nvectors - 1]; >>> bool use_bias_adjusted_len =3D >>> LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) !=3D 0; >>> + tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); >>>=20=20 >>> /* Populate the rgroup's len array, if this is the first time we've >>> used it. */ >>> @@ -10400,6 +10405,26 @@ vect_get_loop_len (loop_vec_info loop_vinfo, >> vec_loop_lens *lens, >>>=20=20 >>> if (use_bias_adjusted_len) >>> return rgl->bias_adjusted_ctrl; >>> + else if (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo)) >>> + { >>> + tree loop_len =3D rgl->controls[index]; >>> + poly_int64 nunits1 =3D TYPE_VECTOR_SUBPARTS (rgl->type); >>> + poly_int64 nunits2 =3D TYPE_VECTOR_SUBPARTS (vectype); >>> + if (maybe_ne (nunits1, nunits2)) >>> + { >>> + /* A loop len for data type X can be reused for data type Y >>> + if X has N times more elements than Y and if Y's elements >>> + are N times bigger than X's. */ >>> + gcc_assert (multiple_p (nunits1, nunits2)); >>> + unsigned int factor =3D exact_div (nunits1, nunits2).to_constant= (); >>> + gimple_seq seq =3D NULL; >>> + loop_len =3D gimple_build (&seq, RDIV_EXPR, iv_type, loop_len, >>> + build_int_cst (iv_type, factor)); >>> + if (seq) >>> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); >>> + } >>> + return loop_len; >>> + } >>> else >>> return rgl->controls[index]; >>> } >>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >>> index 7313191b0db..15b22132bd6 100644 >>> --- a/gcc/tree-vect-stmts.cc >>> +++ b/gcc/tree-vect-stmts.cc >>> @@ -3147,6 +3147,61 @@ vect_get_data_ptr_increment (vec_info *vinfo, >>> return iv_step; >>> } >>>=20=20 >>> +/* Prepare the pointer IVs which needs to be updated by a variable amo= unt. >>> + Such variable amount is the outcome of .SELECT_VL. In this case, we= can >>> + allow each iteration process the flexible number of elements as lon= g as >>> + the number <=3D vf elments. >>> + >>> + Return data reference according to SELECT_VL. >>> + If new statements are needed, insert them before GSI. */ >>> + >>> +static tree >>> +get_select_vl_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info, >>> + tree aggr_type, class loop *at_loop, tree offset, >>> + tree *dummy, gimple_stmt_iterator *gsi, >>> + bool simd_lane_access_p, vec_loop_lens *loop_lens, >>> + dr_vec_info *dr_info, >>> + vect_memory_access_type memory_access_type) >>> +{ >>> + loop_vec_info loop_vinfo =3D dyn_cast (vinfo); >>> + tree step =3D vect_dr_behavior (vinfo, dr_info)->step; >>> + >>> + /* TODO: We don't support gather/scatter or load_lanes/store_lanes f= or >> pointer >>> + IVs are updated by variable amount but we will support them in the >> future. >>> + */ >>> + gcc_assert (memory_access_type !=3D VMAT_GATHER_SCATTER >>> + && memory_access_type !=3D VMAT_LOAD_STORE_LANES); >>> + >>> + /* When we support SELECT_VL pattern, we dynamic adjust >>> + the memory address by .SELECT_VL result. >>> + >>> + The result of .SELECT_VL is the number of elements to >>> + be processed of each iteration. So the memory address >>> + adjustment operation should be: >>> + >>> + bytesize =3D GET_MODE_SIZE (element_mode (aggr_type)); >>> + addr =3D addr + .SELECT_VL (ARG..) * bytesize; >>> + */ >>> + gimple *ptr_incr; >>> + tree loop_len >>> + =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens, 1, aggr_type, 0= ); >>> + tree len_type =3D TREE_TYPE (loop_len); >>> + poly_uint64 bytesize =3D GET_MODE_SIZE (element_mode (aggr_type)); >>> + /* Since the outcome of .SELECT_VL is element size, we should adjust >>> + it into bytesize so that it can be used in address pointer variab= le >>> + amount IVs adjustment. */ >>> + tree tmp =3D fold_build2 (MULT_EXPR, len_type, loop_len, >>> + build_int_cst (len_type, bytesize)); >>> + if (tree_int_cst_sgn (step) =3D=3D -1) >>> + tmp =3D fold_build1 (NEGATE_EXPR, len_type, tmp); >>> + tree bump =3D make_temp_ssa_name (len_type, NULL, "ivtmp"); >>> + gassign *assign =3D gimple_build_assign (bump, tmp); >>> + gsi_insert_before (gsi, assign, GSI_SAME_STMT); >>> + return vect_create_data_ref_ptr (vinfo, stmt_info, aggr_type, at_loo= p, >> offset, >>> + dummy, gsi, &ptr_incr, simd_lane_access_p, >>> + bump); >>> +} >>> + >>> /* Check and perform vectorization of BUILT_IN_BSWAP{16,32,64,128}. */ >>>=20=20 >>> static bool >>> @@ -8547,6 +8602,14 @@ vectorizable_store (vec_info *vinfo, >>> vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info, >>> slp_node, &gs_info, &dataref_ptr, >>> &vec_offsets); >>> + else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) >>> + && memory_access_type !=3D VMAT_INVARIANT) >>> + dataref_ptr >>> + =3D get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type, >>> + simd_lane_access_p ? loop : NULL, >>> + offset, &dummy, gsi, >>> + simd_lane_access_p, loop_lens, >>> + dr_info, memory_access_type); >>> else >>> dataref_ptr >>> =3D vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_t= ype, >>> @@ -8795,8 +8858,9 @@ vectorizable_store (vec_info *vinfo, >>> else if (loop_lens) >>> { >>> tree final_len >>> - =3D vect_get_loop_len (loop_vinfo, loop_lens, >>> - vec_num * ncopies, vec_num * j + i); >>> + =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens, >>> + vec_num * ncopies, vectype, >>> + vec_num * j + i); >>> tree ptr =3D build_int_cst (ref_type, align * BITS_PER_UNIT); >>> machine_mode vmode =3D TYPE_MODE (vectype); >>> opt_machine_mode new_ovmode >>> @@ -9935,6 +9999,13 @@ vectorizable_load (vec_info *vinfo, >>> slp_node, &gs_info, &dataref_ptr, >>> &vec_offsets); >>> } >>> + else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) >>> + && memory_access_type !=3D VMAT_INVARIANT) >>> + dataref_ptr >>> + =3D get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type, >>> + at_loop, offset, &dummy, gsi, >>> + simd_lane_access_p, loop_lens, >>> + dr_info, memory_access_type); >>> else >>> dataref_ptr >>> =3D vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_t= ype, >>> @@ -10151,8 +10222,8 @@ vectorizable_load (vec_info *vinfo, >>> else if (loop_lens && memory_access_type !=3D VMAT_INVARIA= NT) >>> { >>> tree final_len >>> - =3D vect_get_loop_len (loop_vinfo, loop_lens, >>> - vec_num * ncopies, >>> + =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens, >>> + vec_num * ncopies, vectype, >>> vec_num * j + i); >>> tree ptr =3D build_int_cst (ref_type, >>> align * BITS_PER_UNIT); >>> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h >>> index 9cf2fb23fe3..3d21e23513d 100644 >>> --- a/gcc/tree-vectorizer.h >>> +++ b/gcc/tree-vectorizer.h >>> @@ -818,6 +818,13 @@ public: >>> the vector loop can handle fewer than VF scalars. */ >>> bool using_partial_vectors_p; >>>=20=20 >>> + /* True if we've decided to use SELECT_VL to get the number of active >>> + elements in a vector loop to be updated. */ >>> + bool using_select_vl_p; >>> + >>> + /* True if use adjusted loop length for SLP. */ >>> + bool using_slp_adjusted_len_p; >>> + >>> /* True if we've decided to use partially-populated vectors for the >>> epilogue of loop. */ >>> bool epil_using_partial_vectors_p; >>> @@ -890,6 +897,8 @@ public: >>> #define LOOP_VINFO_VECTORIZABLE_P(L) (L)->vectorizable >>> #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)-> >> can_use_partial_vectors_p >>> #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vecto= rs_p >>> +#define LOOP_VINFO_USING_SELECT_VL_P(L) (L)->using_select_vl_p >>> +#define LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P(L) (L)-> > using_slp_adjusted_len_p >>> #define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L) >> \ >>> (L)->epil_using_partial_vectors_p >>> #define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_= bias >>> @@ -2293,7 +2302,8 @@ extern tree vect_get_loop_mask (gimple_stmt_itera= tor > *, >> vec_loop_masks *, >>> unsigned int, tree, unsigned int); >>> extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsi= gned >> int, >>> tree, unsigned int); >>> -extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned > int, >>> +extern tree vect_get_loop_len (loop_vec_info, gimple_stmt_iterator *, >>> + vec_loop_lens *, unsigned int, tree, >>> unsigned int); >>> extern gimple_seq vect_gen_len (tree, tree, tree, tree); >>> extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info); >> 8 > C