From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id CC3BA3858C54 for ; Fri, 12 May 2023 11:39:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CC3BA3858C54 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 19E71FEC; Fri, 12 May 2023 04:39:48 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7B8C93F67D; Fri, 12 May 2023 04:39:02 -0700 (PDT) From: Richard Sandiford To: "juzhe.zhong" Mail-Followup-To: "juzhe.zhong" ,"gcc-patches\@gcc.gnu.org" , "kito.cheng\@gmail.com" , "palmer\@dabbelt.com" , "richard.guenther\@gmail.com" , richard.sandiford@arm.com Cc: "gcc-patches\@gcc.gnu.org" , "kito.cheng\@gmail.com" , "palmer\@dabbelt.com" , "richard.guenther\@gmail.com" Subject: Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer References: <20230511231126.1594132-1-juzhe.zhong@rivai.ai> <5C0696881FC420F8+A4387224-068E-4647-B237-BC14AE06A32D@rivai.ai> Date: Fri, 12 May 2023 12:39:01 +0100 In-Reply-To: <5C0696881FC420F8+A4387224-068E-4647-B237-BC14AE06A32D@rivai.ai> (juzhe zhong's message of "Fri, 12 May 2023 19:30:24 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-29.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_ASCII_DIVIDERS,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: "juzhe.zhong" writes: > Thanks Richard. > I will do that as you suggested. I have a question for the first patch. = How to > enable decrement IV=EF=BC=9F Should I add a target hook or something to l= et target > decide whether enable decrement IV=EF=BC=9F At the moment, the only other targets that use IFN_LOAD_LEN and IFN_STORE_LEN are PowerPC and s390. Both targets default to --param vect-partial-vector-usage=3D1 (i.e. use partial vectors for epilogues only). So I think the condition should be that the loop: (a) uses length "controls"; and (b) can iterate more than once No target checks should be needed. Thanks, Richard > ---- Replied Message ---- > > From Richard Sandiford > > Date 05/12/2023 19:08 > > To juzhe.zhong@rivai.ai > > Cc gcc-patches@gcc.gnu.org, > kito.cheng@gmail.com, > palmer@dabbelt.com, > richard.guenther@gmail.com > > Subject Re: [PATCH V6] VECT: Add decrement IV support in Loop Vectorizer > > juzhe.zhong@rivai.ai writes: >> From: Ju-Zhe Zhong >> >> 1. Fix document description according Jeff && Richard. >> 2. Add LOOP_VINFO_USING_SELECT_VL_P for single rgroup. >> 3. Add LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P for SLP multiple rgroup. >> >> Fix bugs for V5 after testing: >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618209.html >> >> gcc/ChangeLog: >> >> * doc/md.texi: Add seletc_vl pattern. >> * internal-fn.def (SELECT_VL): New ifn. >> * optabs.def (OPTAB_D): New optab. >> * tree-vect-loop-manip.cc (vect_adjust_loop_lens): New function. >> (vect_set_loop_controls_by_select_vl): Ditto. >> (vect_set_loop_condition_partial_vectors): Add loop control for > decrement IV. >> * tree-vect-loop.cc (vect_get_loop_len): Adjust loop len for SLP. >> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): New function. >> (vectorizable_store): Support data reference IV added by outcome= of > SELECT_VL. >> (vectorizable_load): Ditto. >> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): New macro. >> (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P): Ditto. >> (vect_get_loop_len): Adjust loop len for SLP. >> >> --- >> gcc/doc/md.texi | 36 ++++ >> gcc/internal-fn.def | 1 + >> gcc/optabs.def | 1 + >> gcc/tree-vect-loop-manip.cc | 380 +++++++++++++++++++++++++++++++++++- >> gcc/tree-vect-loop.cc | 31 ++- >> gcc/tree-vect-stmts.cc | 79 +++++++- >> gcc/tree-vectorizer.h | 12 +- >> 7 files changed, 526 insertions(+), 14 deletions(-) >> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi >> index 8ebce31ba78..a94ffc4456d 100644 >> --- a/gcc/doc/md.texi >> +++ b/gcc/doc/md.texi >> @@ -4974,6 +4974,42 @@ for (i =3D 1; i < operand3; i++) >> operand0[i] =3D operand0[i - 1] && (operand1 + i < operand2); >> @end smallexample >>=20=20 >> +@cindex @code{select_vl@var{m}} instruction pattern >> +@item @code{select_vl@var{m}} >> +Set operand 0 to the number of active elements in a vector to be updated >> +in a loop iteration based on the total number of elements to be updated, >> +the vectorization factor and vector properties of the target. >> +operand 1 is the total elements in the vector to be updated. >> +operand 2 is the vectorization factor. >> +The value of operand 0 is target dependent and flexible in each iterati= on. >> +The operation of this pattern can be: >> + >> +@smallexample >> +Case 1: >> +operand0 =3D MIN (operand1, operand2); >> +operand2 can be const_poly_int or poly_int related to vector mode size. >> +Some target like RISC-V has a standalone instruction to get MIN (n, MODE > SIZE) so >> +that we can reduce a use of general purpose register. >> + >> +In this case, only the last iteration of the loop is partial iteration. >> +@end smallexample >> + >> +@smallexample >> +Case 2: >> +if (operand1 <=3D operand2) >> + operand0 =3D operand1; >> +else if (operand1 < 2 * operand2) >> + operand0 =3D ceil (operand1 / 2); >> +else >> + operand0 =3D operand2; >> + >> +This case will evenly distribute work over the last 2 iterations of a > stripmine loop. >> +@end smallexample >> + >> +The output of this pattern is not only used as IV of loop control count= er, > but also >> +is used as the IV of address calculation with multiply/shift operation.= This > allows >> +dynamic adjustment of the number of elements processed each loop iterat= ion. >> + > > I don't think we need to restrict the definition to the two RVV cases. > How about: > > ----------------------------------------------------------------------- > Set operand 0 to the number of scalar iterations that should be handled > by one iteration of a vector loop. Operand 1 is the total number of > scalar iterations that the loop needs to process and operand 2 is a > maximum bound on the result (also known as the maximum ``vectorization > factor''). > > The maximum value of operand 0 is given by: > @smallexample > operand0 =3D MIN (operand1, operand2) > @end smallexample > However, targets might choose a lower value than this, based on > target-specific criteria. Each iteration of the vector loop might > therefore process a different number of scalar iterations, which in turn > means that induction variables will have a variable step. Because of > this, it is generally not useful to define this instruction if it will > always calculate the maximum value. > > This optab is only useful on targets that implement @samp{len_load_@var{m= }} > and/or @samp{len_store_@var{m}}. > ----------------------------------------------------------------------- > >> @cindex @code{check_raw_ptrs@var{m}} instruction pattern >> @item @samp{check_raw_ptrs@var{m}} >> Check whether, given two pointers @var{a} and @var{b} and a length @var > {len}, >> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def >> index 7fe742c2ae7..6f6fa7d37f9 100644 >> --- a/gcc/internal-fn.def >> +++ b/gcc/internal-fn.def >> @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set) >> DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store) >>=20=20 >> DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, w= hile) >> +DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, > binary) >> DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW, >> check_raw_ptrs, check_ptrs) >> DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW, >> diff --git a/gcc/optabs.def b/gcc/optabs.def >> index 695f5911b30..b637471b76e 100644 >> --- a/gcc/optabs.def >> +++ b/gcc/optabs.def >> @@ -476,3 +476,4 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERI= ES) >> OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a") >> OPTAB_D (len_load_optab, "len_load_$a") >> OPTAB_D (len_store_optab, "len_store_$a") >> +OPTAB_D (select_vl_optab, "select_vl$a") >> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc >> index ff6159e08d5..81334f4f171 100644 >> --- a/gcc/tree-vect-loop-manip.cc >> +++ b/gcc/tree-vect-loop-manip.cc >> @@ -385,6 +385,353 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, > rgroup_controls *dest_rgm, >> return false; >> } >>=20=20 >> +/* Try to use adjust loop lens for non-SLP multiple-rgroups. >> + >> + _36 =3D MIN_EXPR ; >> + >> + First length (MIN (X, VF/N)): >> + loop_len_15 =3D MIN_EXPR <_36, POLY_INT_CST [2, 2]>; >> + >> + Second length (X - MIN (X, 1 * VF/N)): >> + loop_len_16 =3D _36 - loop_len_15; >> + >> + Third length (X - MIN (X, 2 * VF/N)): >> + _38 =3D MIN_EXPR <_36, POLY_INT_CST [4, 4]>; >> + loop_len_17 =3D _36 - _38; >> + >> + Forth length (X - MIN (X, 3 * VF/N)): >> + _39 =3D MIN_EXPR <_36, POLY_INT_CST [6, 6]>; >> + loop_len_18 =3D _36 - _39; */ >> + >> +static void >> +vect_adjust_loop_lens (tree iv_type, gimple_seq *seq, rgroup_controls > *dest_rgm, >> + rgroup_controls *src_rgm) >> +{ >> + tree ctrl_type =3D dest_rgm->type; >> + poly_uint64 nitems_per_ctrl >> + =3D TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor; >> + >> + for (unsigned int i =3D 0; i < dest_rgm->controls.length (); ++i) >> + { >> + tree src =3D src_rgm->controls[i / dest_rgm->controls.length ()]; >> + tree dest =3D dest_rgm->controls[i]; >> + gassign *stmt; >> + if (i =3D=3D 0) >> + { >> + /* MIN (X, VF*I/N) capped to the range [0, VF/N]. */ >> + tree factor =3D build_int_cst (iv_type, nitems_per_ctrl); >> + stmt =3D gimple_build_assign (dest, MIN_EXPR, src, factor); >> + gimple_seq_add_stmt (seq, stmt); >> + } >> + else >> + { >> + /* (X - MIN (X, VF*I/N)) capped to the range [0, VF/N]. */ >> + tree factor =3D build_int_cst (iv_type, nitems_per_ctrl * i); >> + tree temp =3D make_ssa_name (iv_type); >> + stmt =3D gimple_build_assign (temp, MIN_EXPR, src, factor); >> + gimple_seq_add_stmt (seq, stmt); >> + stmt =3D gimple_build_assign (dest, MINUS_EXPR, src, temp); >> + gimple_seq_add_stmt (seq, stmt); >> + } >> + } >> +} >> + >> +/* Helper for vect_set_loop_condition_partial_vectors. Generate defini= tions >> + for all the rgroup controls in RGC and return a control that is nonz= ero >> + when the loop needs to iterate. Add any new preheader statements to >> + PREHEADER_SEQ. Use LOOP_COND_GSI to insert code before the exit gco= nd. >> + >> + RGC belongs to loop LOOP. The loop originally iterated NITERS >> + times and has been vectorized according to LOOP_VINFO. >> + >> + Unlike vect_set_loop_controls_directly which is iterating from 0-bas= ed IV >> + to TEST_LIMIT - bias. >> + >> + In vect_set_loop_controls_by_select_vl, we are iterating from start = at >> + IV =3D TEST_LIMIT - bias and keep subtract IV by the length calculat= ed by >> + IFN_SELECT_VL pattern. >> + >> + 1. Single rgroup, the Gimple IR should be: >> + >> + # vectp_B.6_8 =3D PHI >> + # vectp_B.8_16 =3D PHI >> + # vectp_A.11_19 =3D PHI >> + # vectp_A.13_22 =3D PHI >> + # ivtmp_26 =3D PHI >> + _28 =3D .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]); >> + ivtmp_15 =3D _28 * 4; >> + vect__1.10_18 =3D .LEN_LOAD (vectp_B.8_16, 128B, _28, 0); >> + _1 =3D B[i_10]; >> + .LEN_STORE (vectp_A.13_22, 128B, _28, vect__1.10_18, 0); >> + i_7 =3D i_10 + 1; >> + vectp_B.8_17 =3D vectp_B.8_16 + ivtmp_15; >> + vectp_A.13_23 =3D vectp_A.13_22 + ivtmp_15; >> + ivtmp_27 =3D ivtmp_26 - _28; >> + if (ivtmp_27 !=3D 0) >> + goto ; [83.33%] >> + else >> + goto ; [16.67%] >> + >> + Note: We use the outcome of .SELECT_VL to adjust both loop control I= V and >> + data reference pointer IV. >> + >> + 1). The result of .SELECT_VL: >> + _28 =3D .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]); >> + The _28 is not necessary to be VF in any iteration, instead, we = allow >> + _28 to be any value as long as _28 <=3D VF. Such flexible SELECT= _VL >> + pattern allows target have various flexible optimizations in vec= tor >> + loop iterations. Target like RISC-V has special application vect= or >> + length calculation instruction which will distribute even worklo= ad >> + in the last 2 iterations. >> + >> + Other example is that we can allow even generate _28 <=3D VF / 2= so >> + that some machine can run vector codes in low power mode. >> + >> + 2). Loop control IV: >> + ivtmp_27 =3D ivtmp_26 - _28; >> + if (ivtmp_27 !=3D 0) >> + goto ; [83.33%] >> + else >> + goto ; [16.67%] >> + >> + This is the saturating-subtraction towards zero, the outcome of >> + .SELECT_VL wil make ivtmp_27 never underflow zero. >> + >> + 3). Data reference pointer IV: >> + ivtmp_15 =3D _28 * 4; >> + vectp_B.8_17 =3D vectp_B.8_16 + ivtmp_15; >> + vectp_A.13_23 =3D vectp_A.13_22 + ivtmp_15; >> + >> + The pointer IV is adjusted accurately according to the .SELECT_V= L. >> + >> + 2. Multiple rgroup, the Gimple IR should be: >> + >> + # i_23 =3D PHI >> + # vectp_f.8_51 =3D PHI >> + # vectp_d.10_59 =3D PHI >> + # ivtmp_70 =3D PHI >> + # ivtmp_73 =3D PHI >> + _72 =3D MIN_EXPR ; >> + _75 =3D MIN_EXPR ; >> + _1 =3D i_23 * 2; >> + _2 =3D (long unsigned int) _1; >> + _3 =3D _2 * 2; >> + _4 =3D f_15(D) + _3; >> + _5 =3D _2 + 1; >> + _6 =3D _5 * 2; >> + _7 =3D f_15(D) + _6; >> + .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0); >> + vectp_f.8_56 =3D vectp_f.8_51 + 16; >> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0); >> + _8 =3D (long unsigned int) i_23; >> + _9 =3D _8 * 4; >> + _10 =3D d_18(D) + _9; >> + _61 =3D _75 / 2; >> + .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0); >> + vectp_d.10_63 =3D vectp_d.10_59 + 16; >> + _64 =3D _72 / 2; >> + .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0); >> + i_20 =3D i_23 + 1; >> + vectp_f.8_52 =3D vectp_f.8_56 + 16; >> + vectp_d.10_60 =3D vectp_d.10_63 + 16; >> + ivtmp_74 =3D ivtmp_73 - _75; >> + ivtmp_71 =3D ivtmp_70 - _72; >> + if (ivtmp_74 !=3D 0) >> + goto ; [83.33%] >> + else >> + goto ; [16.67%] > > In the gimple examples, I think it would help to quote only the relevant > parts and use ellipsis to hide things that don't directly matter. > E.g. in the above samples, the old scalar code isn't relevant, whereas > it's difficult to follow the example without knowing how _69 and _67 > relate to each other. It would also help to say which scalar loop > is being vectorised here. > >> + >> + Note: We DO NOT use .SELECT_VL in SLP auto-vectorization for multiple >> + rgroups. Instead, we use MIN_EXPR to guarantee we always use VF as t= he >> + iteration amount for mutiple rgroups. >> + >> + The analysis of the flow of multiple rgroups: >> + _72 =3D MIN_EXPR ; >> + _75 =3D MIN_EXPR ; >> + ... >> + .LEN_STORE (vectp_f.8_51, 128B, _75, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0); >> + vectp_f.8_56 =3D vectp_f.8_51 + 16; >> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0); >> + ... >> + _61 =3D _75 / 2; >> + .LEN_STORE (vectp_d.10_59, 128B, _61, { 3, 3, 3, 3 }, 0); >> + vectp_d.10_63 =3D vectp_d.10_59 + 16; >> + _64 =3D _72 / 2; >> + .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0); >> + >> + We use _72 =3D MIN_EXPR ; to generate the number of the > elements >> + to be processed in each iteration. >> + >> + The related STOREs: >> + _72 =3D MIN_EXPR ; >> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2, 1, 2, 1, 2 }, 0); >> + _64 =3D _72 / 2; >> + .LEN_STORE (vectp_d.10_63, 128B, _64, { 3, 3, 3, 3 }, 0); >> + Since these 2 STOREs store 2 vectors that the second vector is half > elements >> + of the first vector. So the length of second STORE will be _64 =3D _7= 2 / 2; >> + It's similar to the VIEW_CONVERT of handling masks in SLP. > > > >> + >> + 3. Multiple rgroups for non-SLP auto-vectorization. >> + >> + # ivtmp_26 =3D PHI >> + # ivtmp.35_10 =3D PHI >> + # ivtmp.36_2 =3D PHI >> + _28 =3D MIN_EXPR ; >> + loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>; >> + loop_len_16 =3D _28 - loop_len_15; >> + _29 =3D (void *) ivtmp.35_10; >> + _7 =3D &MEM [(int *)_29]; >> + vect__1.25_17 =3D .LEN_LOAD (_7, 128B, loop_len_15, 0); >> + _33 =3D _29 + POLY_INT_CST [16, 16]; >> + _34 =3D &MEM [(int *)_33]; >> + vect__1.26_19 =3D .LEN_LOAD (_34, 128B, loop_len_16, 0); >> + vect__2.27_20 =3D VEC_PACK_TRUNC_EXPR ; >> + _30 =3D (void *) ivtmp.36_2; >> + _31 =3D &MEM [(short int *)_30]; >> + .LEN_STORE (_31, 128B, _28, vect__2.27_20, 0); >> + ivtmp_27 =3D ivtmp_26 - _28; >> + ivtmp.35_11 =3D ivtmp.35_10 + POLY_INT_CST [32, 32]; >> + ivtmp.36_8 =3D ivtmp.36_2 + POLY_INT_CST [16, 16]; >> + if (ivtmp_27 !=3D 0) >> + goto ; [83.33%] >> + else >> + goto ; [16.67%] >> + >> + The total length: _28 =3D MIN_EXPR ; >> + >> + The length of first half vector: >> + loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>; >> + >> + The length of second half vector: >> + loop_len_15 =3D MIN_EXPR <_28, POLY_INT_CST [4, 4]>; >> + loop_len_16 =3D _28 - loop_len_15; >> + >> + 1). _28 always <=3D POLY_INT_CST [8, 8]. >> + 2). When _28 <=3D POLY_INT_CST [4, 4], second half vector is not > processed. >> + 3). When _28 > POLY_INT_CST [4, 4], second half vector is processe= d. >> +*/ >> + >> +static tree >> +vect_set_loop_controls_by_select_vl (class loop *loop, loop_vec_info > loop_vinfo, >> + gimple_seq *preheader_seq, >> + gimple_seq *header_seq, >> + rgroup_controls *rgc, tree niters) >> +{ >> + tree compare_type =3D LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); >> + tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); >> + /* We are not allowing masked approach in SELECT_VL. */ >> + gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)); >> + >> + tree ctrl_type =3D rgc->type; >> + unsigned int nitems_per_iter =3D rgc->max_nscalars_per_iter * rgc->fa= ctor; >> + poly_uint64 nitems_per_ctrl =3D TYPE_VECTOR_SUBPARTS (ctrl_type) * rg= c-> > factor; >> + poly_uint64 vf =3D LOOP_VINFO_VECT_FACTOR (loop_vinfo); >> + >> + /* Calculate the maximum number of item values that the rgroup >> + handles in total, the number that it handles for each iteration >> + of the vector loop. */ >> + tree nitems_total =3D niters; >> + if (nitems_per_iter !=3D 1) >> + { >> + /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P t= hat >> + these multiplications don't overflow. */ >> + tree compare_factor =3D build_int_cst (compare_type, nitems_per_i= ter); >> + nitems_total =3D gimple_build (preheader_seq, MULT_EXPR, compare_= type, >> + nitems_total, compare_factor); >> + } >> + >> + /* Convert the comparison value to the IV type (either a no-op or >> + a promotion). */ >> + nitems_total =3D gimple_convert (preheader_seq, iv_type, nitems_total= ); >> + >> + /* Create an induction variable that counts the number of items >> + processed. */ >> + tree index_before_incr, index_after_incr; >> + gimple_stmt_iterator incr_gsi; >> + bool insert_after; >> + standard_iv_increment_position (loop, &incr_gsi, &insert_after); >> + >> + /* Test the decremented IV, which will never underflow 0 since we have >> + IFN_SELECT_VL to gurantee that. */ >> + tree test_limit =3D nitems_total; >> + >> + /* Provide a definition of each control in the group. */ >> + tree ctrl; >> + unsigned int i; >> + FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl) >> + { >> + /* Previous controls will cover BIAS items. This control covers = the >> + next batch. */ >> + poly_uint64 bias =3D nitems_per_ctrl * i; >> + tree bias_tree =3D build_int_cst (iv_type, bias); >> + >> + /* Rather than have a new IV that starts at TEST_LIMIT and goes d= own > to >> + BIAS, prefer to use the same TEST_LIMIT - BIAS based IV for each >> + control and adjust the bound down by BIAS. */ >> + tree this_test_limit =3D test_limit; >> + if (i !=3D 0) >> + { >> + this_test_limit =3D gimple_build (preheader_seq, MAX_EXPR, iv_typ= e, >> + this_test_limit, bias_tree); >> + this_test_limit =3D gimple_build (preheader_seq, MINUS_EXPR, iv_t= ype, >> + this_test_limit, bias_tree); >> + } >> + >> + /* Create decrement IV. */ >> + create_iv (this_test_limit, MINUS_EXPR, ctrl, NULL_TREE, loop, & > incr_gsi, >> + insert_after, &index_before_incr, &index_after_incr); >> + >> + poly_uint64 final_vf =3D vf * nitems_per_iter; >> + tree vf_step =3D build_int_cst (iv_type, final_vf); >> + tree res_len; >> + if (LOOP_VINFO_LENS (loop_vinfo).length () =3D=3D 1) >> + { >> + res_len =3D gimple_build (header_seq, IFN_SELECT_VL, iv_type, >> + index_before_incr, vf_step); >> + LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) =3D true; > > The middle of this loop seems too "deep down" to be setting this. > I think it would make sense to do it after: > > /* If we still have the option of using partial vectors, > check whether we can generate the necessary loop controls. */ > if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) > && !vect_verify_full_masking (loop_vinfo) > && !vect_verify_loop_lens (loop_vinfo)) > LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) =3D false; > > in vect_analyze_loop_2. > > I think it'd help for review purposes to split this patch into two > (independently tested) pieces: > > (1) Your cases 2 and 3, where AIUI the main change is to use > a decrementing loop control IV that counts scalars. This can > be done for any loop that: > > (a) uses length "controls"; and > (b) can iterate more than once > > Initially this patch would handle case 1 in the same way. > > Conceptually, I think it would make sense for this case to use: > > - a signed control IV > - with a constant VF step > - and a loop-back test for > 0 > > in cases where we can prove that that doesn't overflow. But I > accept that using: > > - an unsigned control IV > - with a variable step > - and a loop-back test for !=3D 0 > > is more general. So it's OK to handle just that case. The > optimisation to use signed control IVs could be left to future work. > > (2) Add SELECT_VL, where AIUI the main change (relative to (1)) > is to use a variable step for other IVs too. > > This is just for review purposes, and to help to separate concepts. > SELECT_VL is still an important part of the end result. > > Thanks, > Richard > >> + } >> + else >> + { >> + /* For SLP, we can't allow non-VF number of elements to be proces= sed >> + in non-final iteration. We force the number of elements to be >> + processed in each non-final iteration is VF elements. If we al= low >> + non-VF elements processing in non-final iteration will make SL= P too >> + complicated and produce inferior codegen. >> + >> + For example: >> + >> + If non-final iteration process VF elements. >> + >> + ... >> + .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0); >> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0); >> + ... >> + >> + If non-final iteration process non-VF elements. >> + >> + ... >> + .LEN_STORE (vectp_f.8_51, 128B, _71, { 1, 2, 1, 2 }, 0); >> + if (_71 % 2 =3D=3D 0) >> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 1, 2, 1, 2 }, 0); >> + else >> + .LEN_STORE (vectp_f.8_56, 128B, _72, { 2, 1, 2, 1 }, 0); >> + ... >> + >> + This is the simple case of 2-elements interleaved vector SLP. = We >> + consider other interleave vector, the situation will become mo= re >> + complicated. */ >> + res_len =3D gimple_build (header_seq, MIN_EXPR, iv_type, >> + index_before_incr, vf_step); >> + if (rgc->max_nscalars_per_iter !=3D 1) >> + LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo) =3D true; >> + } >> + gassign *assign =3D gimple_build_assign (ctrl, res_len); >> + gimple_seq_add_stmt (header_seq, assign); >> + } >> + >> + return index_after_incr; >> +} >> + >> /* Helper for vect_set_loop_condition_partial_vectors. Generate defini= tions >> for all the rgroup controls in RGC and return a control that is nonz= ero >> when the loop needs to iterate. Add any new preheader statements to >> @@ -704,6 +1051,10 @@ vect_set_loop_condition_partial_vectors (class loop > *loop, >>=20=20 >> bool use_masks_p =3D LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); >> tree compare_type =3D LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); >> + tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); >> + bool use_vl_p =3D !use_masks_p >> + && direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, >> + OPTIMIZE_FOR_SPEED); >> unsigned int compare_precision =3D TYPE_PRECISION (compare_type); >> tree orig_niters =3D niters; >>=20=20 >> @@ -753,17 +1104,34 @@ vect_set_loop_condition_partial_vectors (class lo= op > *loop, >> continue; >> } >>=20=20 >> + if (use_vl_p && rgc->max_nscalars_per_iter =3D=3D 1 >> + && rgc !=3D &LOOP_VINFO_LENS (loop_vinfo)[0]) >> + { >> + rgroup_controls *sub_rgc >> + =3D &(*controls)[nmasks / rgc->controls.length () - 1]; >> + if (!sub_rgc->controls.is_empty ()) >> + { >> + vect_adjust_loop_lens (iv_type, &header_seq, rgc, sub_rgc); >> + continue; >> + } >> + } >> + >> /* See whether zero-based IV would ever generate all-false masks >> or zero length before wrapping around. */ >> bool might_wrap_p =3D vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc); >>=20=20 >> /* Set up all controls for this group. */ >> - test_ctrl =3D vect_set_loop_controls_directly (loop, loop_vinfo, >> - &preheader_seq, >> - &header_seq, >> - loop_cond_gsi, rgc, >> - niters, niters_skip, >> - might_wrap_p); >> + if (use_vl_p) >> + test_ctrl >> + =3D vect_set_loop_controls_by_select_vl (loop, loop_vinfo, >> + &preheader_seq, &header_seq, >> + rgc, niters); >> + else >> + test_ctrl >> + =3D vect_set_loop_controls_directly (loop, loop_vinfo, &prehead= er_seq, >> + &header_seq, loop_cond_gsi, rgc, >> + niters, niters_skip, >> + might_wrap_p); >> } >>=20=20 >> /* Emit all accumulated statements. */ >> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc >> index ed0166fedab..fe6af4286bf 100644 >> --- a/gcc/tree-vect-loop.cc >> +++ b/gcc/tree-vect-loop.cc >> @@ -973,6 +973,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, > vec_info_shared *shared) >> vectorizable (false), >> can_use_partial_vectors_p (param_vect_partial_vector_usage !=3D 0), >> using_partial_vectors_p (false), >> + using_select_vl_p (false), >> + using_slp_adjusted_len_p (false), >> epil_using_partial_vectors_p (false), >> partial_load_store_bias (0), >> peeling_for_gaps (false), >> @@ -10361,15 +10363,18 @@ vect_record_loop_len (loop_vec_info loop_vinfo, > vec_loop_lens *lens, >> } >>=20=20 >> /* Given a complete set of length LENS, extract length number INDEX for= an >> - rgroup that operates on NVECTORS vectors, where 0 <=3D INDEX < NVECT= ORS. * > / >> + rgroup that operates on NVECTORS vectors, where 0 <=3D INDEX < NVECT= ORS. >> + Insert any set-up statements before GSI. */ >>=20=20 >> tree >> -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, >> - unsigned int nvectors, unsigned int index) >> +vect_get_loop_len (loop_vec_info loop_vinfo, gimple_stmt_iterator *gsi, >> + vec_loop_lens *lens, unsigned int nvectors, tree vectype, >> + unsigned int index) >> { >> rgroup_controls *rgl =3D &(*lens)[nvectors - 1]; >> bool use_bias_adjusted_len =3D >> LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) !=3D 0; >> + tree iv_type =3D LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); >>=20=20 >> /* Populate the rgroup's len array, if this is the first time we've >> used it. */ >> @@ -10400,6 +10405,26 @@ vect_get_loop_len (loop_vec_info loop_vinfo, > vec_loop_lens *lens, >>=20=20 >> if (use_bias_adjusted_len) >> return rgl->bias_adjusted_ctrl; >> + else if (LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P (loop_vinfo)) >> + { >> + tree loop_len =3D rgl->controls[index]; >> + poly_int64 nunits1 =3D TYPE_VECTOR_SUBPARTS (rgl->type); >> + poly_int64 nunits2 =3D TYPE_VECTOR_SUBPARTS (vectype); >> + if (maybe_ne (nunits1, nunits2)) >> + { >> + /* A loop len for data type X can be reused for data type Y >> + if X has N times more elements than Y and if Y's elements >> + are N times bigger than X's. */ >> + gcc_assert (multiple_p (nunits1, nunits2)); >> + unsigned int factor =3D exact_div (nunits1, nunits2).to_constant = (); >> + gimple_seq seq =3D NULL; >> + loop_len =3D gimple_build (&seq, RDIV_EXPR, iv_type, loop_len, >> + build_int_cst (iv_type, factor)); >> + if (seq) >> + gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT); >> + } >> + return loop_len; >> + } >> else >> return rgl->controls[index]; >> } >> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >> index 7313191b0db..15b22132bd6 100644 >> --- a/gcc/tree-vect-stmts.cc >> +++ b/gcc/tree-vect-stmts.cc >> @@ -3147,6 +3147,61 @@ vect_get_data_ptr_increment (vec_info *vinfo, >> return iv_step; >> } >>=20=20 >> +/* Prepare the pointer IVs which needs to be updated by a variable amou= nt. >> + Such variable amount is the outcome of .SELECT_VL. In this case, we = can >> + allow each iteration process the flexible number of elements as long= as >> + the number <=3D vf elments. >> + >> + Return data reference according to SELECT_VL. >> + If new statements are needed, insert them before GSI. */ >> + >> +static tree >> +get_select_vl_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info, >> + tree aggr_type, class loop *at_loop, tree offset, >> + tree *dummy, gimple_stmt_iterator *gsi, >> + bool simd_lane_access_p, vec_loop_lens *loop_lens, >> + dr_vec_info *dr_info, >> + vect_memory_access_type memory_access_type) >> +{ >> + loop_vec_info loop_vinfo =3D dyn_cast (vinfo); >> + tree step =3D vect_dr_behavior (vinfo, dr_info)->step; >> + >> + /* TODO: We don't support gather/scatter or load_lanes/store_lanes for > pointer >> + IVs are updated by variable amount but we will support them in the > future. >> + */ >> + gcc_assert (memory_access_type !=3D VMAT_GATHER_SCATTER >> + && memory_access_type !=3D VMAT_LOAD_STORE_LANES); >> + >> + /* When we support SELECT_VL pattern, we dynamic adjust >> + the memory address by .SELECT_VL result. >> + >> + The result of .SELECT_VL is the number of elements to >> + be processed of each iteration. So the memory address >> + adjustment operation should be: >> + >> + bytesize =3D GET_MODE_SIZE (element_mode (aggr_type)); >> + addr =3D addr + .SELECT_VL (ARG..) * bytesize; >> + */ >> + gimple *ptr_incr; >> + tree loop_len >> + =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens, 1, aggr_type, 0); >> + tree len_type =3D TREE_TYPE (loop_len); >> + poly_uint64 bytesize =3D GET_MODE_SIZE (element_mode (aggr_type)); >> + /* Since the outcome of .SELECT_VL is element size, we should adjust >> + it into bytesize so that it can be used in address pointer variable >> + amount IVs adjustment. */ >> + tree tmp =3D fold_build2 (MULT_EXPR, len_type, loop_len, >> + build_int_cst (len_type, bytesize)); >> + if (tree_int_cst_sgn (step) =3D=3D -1) >> + tmp =3D fold_build1 (NEGATE_EXPR, len_type, tmp); >> + tree bump =3D make_temp_ssa_name (len_type, NULL, "ivtmp"); >> + gassign *assign =3D gimple_build_assign (bump, tmp); >> + gsi_insert_before (gsi, assign, GSI_SAME_STMT); >> + return vect_create_data_ref_ptr (vinfo, stmt_info, aggr_type, at_loop, > offset, >> + dummy, gsi, &ptr_incr, simd_lane_access_p, >> + bump); >> +} >> + >> /* Check and perform vectorization of BUILT_IN_BSWAP{16,32,64,128}. */ >>=20=20 >> static bool >> @@ -8547,6 +8602,14 @@ vectorizable_store (vec_info *vinfo, >> vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info, >> slp_node, &gs_info, &dataref_ptr, >> &vec_offsets); >> + else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) >> + && memory_access_type !=3D VMAT_INVARIANT) >> + dataref_ptr >> + =3D get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type, >> + simd_lane_access_p ? loop : NULL, >> + offset, &dummy, gsi, >> + simd_lane_access_p, loop_lens, >> + dr_info, memory_access_type); >> else >> dataref_ptr >> =3D vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_ty= pe, >> @@ -8795,8 +8858,9 @@ vectorizable_store (vec_info *vinfo, >> else if (loop_lens) >> { >> tree final_len >> - =3D vect_get_loop_len (loop_vinfo, loop_lens, >> - vec_num * ncopies, vec_num * j + i); >> + =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens, >> + vec_num * ncopies, vectype, >> + vec_num * j + i); >> tree ptr =3D build_int_cst (ref_type, align * BITS_PER_UNIT); >> machine_mode vmode =3D TYPE_MODE (vectype); >> opt_machine_mode new_ovmode >> @@ -9935,6 +9999,13 @@ vectorizable_load (vec_info *vinfo, >> slp_node, &gs_info, &dataref_ptr, >> &vec_offsets); >> } >> + else if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) >> + && memory_access_type !=3D VMAT_INVARIANT) >> + dataref_ptr >> + =3D get_select_vl_data_ref_ptr (vinfo, stmt_info, aggr_type, >> + at_loop, offset, &dummy, gsi, >> + simd_lane_access_p, loop_lens, >> + dr_info, memory_access_type); >> else >> dataref_ptr >> =3D vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_ty= pe, >> @@ -10151,8 +10222,8 @@ vectorizable_load (vec_info *vinfo, >> else if (loop_lens && memory_access_type !=3D VMAT_INVARIAN= T) >> { >> tree final_len >> - =3D vect_get_loop_len (loop_vinfo, loop_lens, >> - vec_num * ncopies, >> + =3D vect_get_loop_len (loop_vinfo, gsi, loop_lens, >> + vec_num * ncopies, vectype, >> vec_num * j + i); >> tree ptr =3D build_int_cst (ref_type, >> align * BITS_PER_UNIT); >> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h >> index 9cf2fb23fe3..3d21e23513d 100644 >> --- a/gcc/tree-vectorizer.h >> +++ b/gcc/tree-vectorizer.h >> @@ -818,6 +818,13 @@ public: >> the vector loop can handle fewer than VF scalars. */ >> bool using_partial_vectors_p; >>=20=20 >> + /* True if we've decided to use SELECT_VL to get the number of active >> + elements in a vector loop to be updated. */ >> + bool using_select_vl_p; >> + >> + /* True if use adjusted loop length for SLP. */ >> + bool using_slp_adjusted_len_p; >> + >> /* True if we've decided to use partially-populated vectors for the >> epilogue of loop. */ >> bool epil_using_partial_vectors_p; >> @@ -890,6 +897,8 @@ public: >> #define LOOP_VINFO_VECTORIZABLE_P(L) (L)->vectorizable >> #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)-> > can_use_partial_vectors_p >> #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vector= s_p >> +#define LOOP_VINFO_USING_SELECT_VL_P(L) (L)->using_select_vl_p >> +#define LOOP_VINFO_USING_SLP_ADJUSTED_LEN_P(L) (L)->using_slp_adjusted_= len_p >> #define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L) > \ >> (L)->epil_using_partial_vectors_p >> #define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_b= ias >> @@ -2293,7 +2302,8 @@ extern tree vect_get_loop_mask (gimple_stmt_iterat= or *, > vec_loop_masks *, >> unsigned int, tree, unsigned int); >> extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsig= ned > int, >> tree, unsigned int); >> -extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned= int, >> +extern tree vect_get_loop_len (loop_vec_info, gimple_stmt_iterator *, >> + vec_loop_lens *, unsigned int, tree, >> unsigned int); >> extern gimple_seq vect_gen_len (tree, tree, tree, tree); >> extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info); > 8