public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai>
To: rguenther <rguenther@suse.de>
Cc: richard.sandiford <richard.sandiford@arm.com>,
	 gcc-patches <gcc-patches@gcc.gnu.org>,
	 jeffreyalaw <jeffreyalaw@gmail.com>
Subject: Re: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization
Date: Wed, 12 Apr 2023 17:20:02 +0800	[thread overview]
Message-ID: <AFB3AEBAFA05F11E+2023041217200154694356@rivai.ai> (raw)
In-Reply-To: <2023041217154958074655@rivai.ai>

[-- Attachment #1: Type: text/plain, Size: 32894 bytes --]

Sorry for incorrect typo.We can predicate vadd.vv with v1 - v31.
====>
We can't predicate vadd.vv with v1 - v31.


juzhe.zhong@rivai.ai
 
From: juzhe.zhong@rivai.ai
Date: 2023-04-12 17:15
To: rguenther
CC: richard.sandiford; gcc-patches; jeffreyalaw
Subject: Re: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization

>> Thanks for the detailed explanation.  Just to clarify - with RVV
>> there's only a single mask register, v0.t, or did you want to
>> say an instruction can only specify a single mask register?

RVV has 32 (v0~v31) vector register in total.
We can store vector data value or mask value in any of them.
We also have mask-logic instruction for example mask-and between any vector register.

However, any vector operation for example like vadd.vv can only  predicated by v0 (in asm is v0.t) which is the first vector register.
We can predicate vadd.vv with v1 - v31.

So, you can image every time we want to use a mask to predicate a vector operation, we should always first store the mask value
into v0.

So, we can write intrinsic sequence like this:

vmseq v0,v8,v9 (store mask value to v0)
vmslt v1,v10,v11 (store mask value to v1)
vmand v0,v0,v1
vadd.vv ...v0.t (predicate mask should always be mask).

>> ARM SVE would have a loop control mask and a separate mask
>> for the if (cond[i]) which would be combined with a mask-and
>> instruction to a third mask which is then used on the
>> predicated instructions.

Yeah, I know it. ARM SVE way is a more elegant way than RVV do. 
However, for RVV, we can't follow this flow.
We don't have a  "whilelo" instruction to generate loop control mask.
We only can do loop control with length generated by vsetvl.
And we can only use "v0" to mask predicate vadd.vv, and mask value can only generated by comparison or mask logical instructions. 

>> PowerPC and s390x might be able to use WHILE_LEN as well (though
>> they only have LEN variants of loads and stores) - of course
>> only "simulating it".  For the fixed-vector-length ISAs the
>> predicated vector loop IMHO makes most sense for the epilogue to
>> handle low-trip loops better.

Yeah, I wonder how they do the flow control (if (cond[i])). 
For RVV, you can image I will need to add a pattern LEN_MASK_LOAD/LEN_MASK_STORE (length generated by WHILE_LEN and mask generated by comparison)

I think we can CC IBM folks to see whether we can make WHILE_LEN works 
for both IBM and RVV ? 

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-04-12 16:42
To: juzhe.zhong@rivai.ai
CC: richard.sandiford; gcc-patches; jeffreyalaw
Subject: Re: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization
On Wed, 12 Apr 2023, juzhe.zhong@rivai.ai wrote:
 
> Thank you very much for reply.
> 
> WHILE_LEN is the pattern that calculates the number of the elements of the vector will be updated in each iteration.
> For RVV, we use vsetvl instruction to calculate the number of the elements of the vector.
> 
> WHILE_ULT can not work for RVV since WHILE_ULT is generating mask to predicate vector operation, but RVV do not
> use mask to do the loop strip mining (RVV only use mask for control flow inside the loop).
> 
> Here is the example WHILE_ULT working in ARM SVE:
> https://godbolt.org/z/jKsT8E1hP 
> 
> The first example is:
> void foo (int32_t * __restrict a, int32_t * __restrict b, int n)
> {
>     for (int i = 0; i < n; i++)
>       a[i] = a[i] + b[i];
> }
> 
> ARM SVE:
> foo:
>         cmp     w2, 0
>         ble     .L1
>         mov     x3, 0
>         cntw    x4
>         whilelo p0.s, wzr, w2
> .L3:
>         ld1w    z1.s, p0/z, [x0, x3, lsl 2]
>         ld1w    z0.s, p0/z, [x1, x3, lsl 2]
>         add     z0.s, z0.s, z1.s
>         st1w    z0.s, p0, [x0, x3, lsl 2]
>         add     x3, x3, x4
>         whilelo p0.s, w3, w2
>         b.any   .L3
> .L1:
>         ret
> 
> Here, whilelo will generate the mask according to w3 to w2.
> So for example, if w3 = 0, and w2 = 3 (Suppose machine vector length > 3).
> Then it will generate a mask with 0b111 mask to predicate loads and stores.
> 
> For RVV, we can't do that since RVV doesn't have whilelo instructions to generate predicate mask.
> Also, we can't use mask as the predicate to do loop strip mining since RVV only has 1 single mask 
> to handle flow control  inside the loop.
> 
> Instead, we use vsetvl to do the strip mining, so base on this, the same C code, RVV ideal asm according RVV ISA should be:
> 
> preheader:
> a0 = n (the total number of the scalar should be calculated).
>  .....
> .L3:
>         vsetvli a5,a0,e32,m1,ta,ma    ====> WHILE_LEN pattern generate this instruction, calculate the number of the elements should be updated
>         vle32.v v1,0(a4)
>         sub     a0,a0,a5      ============> Decrement the induction variable by the a5 (generated by WHILE_LEN)
>         ....   
> 
>         vadd.vv....
>         vse32.v v1,0(a3)
>         add     a4,a4,a2
>         add     a3,a3,a2
>         bne     a0,zero,.L3
> .L1:
>         ret
> 
> So you will see, if n = 3 like I said for ARM SVE (Suppose machine vector length > 3), then vsetvli a5,a0,e32,m1,ta,ma will
> generate a5 = 3, then the vle32.v/vadd.vv/vse32.v are all doing the operation only on the element 0,  element 1, element 2.
> 
> Besides, WHILE_LEN is defined to make sure to be never overflow the input operand which is "a0".
> That means  sub     a0,a0,a5 will make a0 never underflow 0.
> 
> I have tried to return Pmode in TARGET_VECTORIZE_GET_MASK_MODE 
> target hook and then use WHILE_ULT. 
> 
> But there are 2 issues:
> One is that current GCC is doing the flow from 0-based until the TEST_LIMIT. Wheras the optimal flow of RVV I showed above
> is from "n" keep decreasing n until 0.  Trying to fit the current flow of GCC, RVV needs more instructions to do the loop strip mining.
> 
> Second is that if we return a Pmode in TARGET_VECTORIZE_GET_MASK_MODE 
> which not only specify the dest mode for WHILE_ULT but also the mask mode of flow control.
> If we return Pmode which is used as the length for RVV. We can't use mask mode like VNx2BI mode to do the flow control predicate.
> This another example:
> void foo2 (int32_t * __restrict a, int32_t * __restrict b, int32_t * restrict cond, int n)
> {
>     for (int i = 0; i < n; i++)
>       if (cond[i])
>         a[i] = a[i] + b[i];
> }
> 
> ARM SVE:
>         ld1w    z0.s, p0/z, [x2, x4, lsl 2]
>         cmpne   p0.s, p0/z, z0.s, #0
>         ld1w    z0.s, p0/z, [x0, x4, lsl 2]
>         ld1w    z1.s, p0/z, [x1, x4, lsl 2]
>         add     z0.s, z0.s, z1.s
>         st1w    z0.s, p0, [x0, x4, lsl 2]
>         add     x4, x4, x5
>         whilelo p0.s, w4, w3
>         b.any   .L8
> 
> Here we can see ARM use mask mode for both loop strip minning and flow control.
> 
> Wheras, RVV use length generated by vsetvl (WHILE_LEN) to do the loop strip minning and mask generated by comparison to do the flow control.
> 
> So the ASM generated by my downstream LLVM/GCC:
> .L3:
>         vsetvli a6,a3,e32,m1,ta,mu   ==========> generate length to predicate RVV operation. 
>         vle32.v v0,(a2)
>         sub     a3,a3,a6      ==========> decrease the induction variable until 0.
>         vmsne.vi        v0,v0,0   ==========> generate mask to predicate RVV operation. 
>         vle32.v v24,(a0),v0.t   ===========> here using v0.t is the only mask register to predicate RVV operation
>         vle32.v v25,(a1),v0.t
>         vadd.vv v24,v24,v25
>         vse32.v v24,(a0),v0.t
>         add     a2,a2,a4
>         add     a0,a0,a4
>         add     a1,a1,a4
>         bne     a3,zero,.L3
> .L1:
>         ret
> 
> 
> This is the how RVV works.
> Feel free to comment if you have any questions.
 
Thanks for the detailed explanation.  Just to clarify - with RVV
there's only a single mask register, v0.t, or did you want to
say an instruction can only specify a single mask register?
ARM SVE would have a loop control mask and a separate mask
for the if (cond[i]) which would be combined with a mask-and
instruction to a third mask which is then used on the
predicated instructions.
 
For AVX512 WHILE_ULT is a better match since we need a mask in the
end (but WHILE_ULT isn't a very good match either, so I'm still
working on masked loop support there).
 
PowerPC and s390x might be able to use WHILE_LEN as well (though
they only have LEN variants of loads and stores) - of course
only "simulating it".  For the fixed-vector-length ISAs the
predicated vector loop IMHO makes most sense for the epilogue to
handle low-trip loops better.
 
Richard.
 
> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-04-12 15:00
> To: Richard Sandiford
> CC: juzhe.zhong@rivai.ai; gcc-patches; jeffreyalaw
> Subject: Re: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization
> On Tue, 11 Apr 2023, Richard Sandiford wrote:
>  
> > "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > > Hi, Richards. 
> > > Kindly Ping this patch. 
> > > This is the most important patch for RVV auto-vectorization support.
> > > Bootstraped on X86 has passed.
> > 
> > Can it wait for GCC 14?  It doesn't seem like stage 4 material.
> > 
> > Also, pinging after 5 days seems a bit soon.  It's been a 4-day
> > holiday weekend for much of Europe.
>  
> Also can you explain why using WHILE_ULT is not possible?  (I've
> successfully - to some extent - done that for AVX512 for example)
>  
> The patch lacks the description of what WHILE_LEN actually is.
>  
> Richard.
>  
> > Thanks,
> > Richard
> > 
> > > Feel free to comments.
> > >
> > > Thanks.
> > >
> > >
> > > juzhe.zhong@rivai.ai
> > >  
> > > From: juzhe.zhong
> > > Date: 2023-04-07 09:47
> > > To: gcc-patches
> > > CC: richard.sandiford; rguenther; jeffreyalaw; Juzhe-Zhong
> > > Subject: [PATCH] VECT: Add WHILE_LEN pattern for decrement IV support for auto-vectorization
> > > From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> > >  
> > > This patch is to add WHILE_LEN pattern.
> > > It's inspired by RVV ISA simple "vvaddint32.s" example:
> > > https://github.com/riscv/riscv-v-spec/blob/master/example/vvaddint32.s
> > >  
> > > More details are in "vect_set_loop_controls_by_while_len" implementation
> > > and comments.
> > >  
> > > Consider such following case:
> > > #define N 16
> > > int src[N];
> > > int dest[N];
> > >  
> > > void
> > > foo (int n)
> > > {
> > >   for (int i = 0; i < n; i++)
> > >     dest[i] = src[i];
> > > }
> > >  
> > > -march=rv64gcv -O3 --param riscv-autovec-preference=scalable -fno-vect-cost-model -fno-tree-loop-distribute-patterns:
> > >  
> > > foo:        
> > >         ble     a0,zero,.L1
> > >         lui     a4,%hi(.LANCHOR0)
> > >         addi    a4,a4,%lo(.LANCHOR0)
> > >         addi    a3,a4,64
> > >         csrr    a2,vlenb
> > > .L3:
> > >         vsetvli a5,a0,e32,m1,ta,ma
> > >         vle32.v v1,0(a4)
> > >         sub     a0,a0,a5
> > >         vse32.v v1,0(a3)
> > >         add     a4,a4,a2
> > >         add     a3,a3,a2
> > >         bne     a0,zero,.L3
> > > .L1:
> > >         ret
> > >  
> > > gcc/ChangeLog:
> > >  
> > >         * doc/md.texi: Add WHILE_LEN support.
> > >         * internal-fn.cc (while_len_direct): Ditto.
> > >         (expand_while_len_optab_fn): Ditto.
> > >         (direct_while_len_optab_supported_p): Ditto.
> > >         * internal-fn.def (WHILE_LEN): Ditto.
> > >         * optabs.def (OPTAB_D): Ditto.
> > >         * tree-ssa-loop-manip.cc (create_iv): Ditto.
> > >         * tree-ssa-loop-manip.h (create_iv): Ditto.
> > >         * tree-vect-loop-manip.cc (vect_set_loop_controls_by_while_len): Ditto.
> > >         (vect_set_loop_condition_partial_vectors): Ditto.
> > >         * tree-vect-loop.cc (vect_get_loop_len): Ditto.
> > >         * tree-vect-stmts.cc (vectorizable_store): Ditto.
> > >         (vectorizable_load): Ditto.
> > >         * tree-vectorizer.h (vect_get_loop_len): Ditto.
> > >  
> > > ---
> > > gcc/doc/md.texi             |  14 +++
> > > gcc/internal-fn.cc          |  29 ++++++
> > > gcc/internal-fn.def         |   1 +
> > > gcc/optabs.def              |   1 +
> > > gcc/tree-ssa-loop-manip.cc  |   4 +-
> > > gcc/tree-ssa-loop-manip.h   |   2 +-
> > > gcc/tree-vect-loop-manip.cc | 186 ++++++++++++++++++++++++++++++++++--
> > > gcc/tree-vect-loop.cc       |  35 +++++--
> > > gcc/tree-vect-stmts.cc      |   9 +-
> > > gcc/tree-vectorizer.h       |   4 +-
> > > 10 files changed, 264 insertions(+), 21 deletions(-)
> > >  
> > > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > > index 8e3113599fd..72178ab014c 100644
> > > --- a/gcc/doc/md.texi
> > > +++ b/gcc/doc/md.texi
> > > @@ -4965,6 +4965,20 @@ for (i = 1; i < operand3; i++)
> > >    operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
> > > @end smallexample
> > > +@cindex @code{while_len@var{m}@var{n}} instruction pattern
> > > +@item @code{while_len@var{m}@var{n}}
> > > +Set operand 0 to the number of active elements in vector will be updated value.
> > > +operand 1 is the total elements need to be updated value.
> > > +operand 2 is the vectorization factor.
> > > +The operation is equivalent to:
> > > +
> > > +@smallexample
> > > +operand0 = MIN (operand1, operand2);
> > > +operand2 can be const_poly_int or poly_int related to vector mode size.
> > > +Some target like RISC-V has a standalone instruction to get MIN (n, MODE SIZE) so
> > > +that we can reduce a use of general purpose register.
> > > +@end smallexample
> > > +
> > > @cindex @code{check_raw_ptrs@var{m}} instruction pattern
> > > @item @samp{check_raw_ptrs@var{m}}
> > > Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> > > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > > index 6e81dc05e0e..5f44def90d3 100644
> > > --- a/gcc/internal-fn.cc
> > > +++ b/gcc/internal-fn.cc
> > > @@ -127,6 +127,7 @@ init_internal_fns ()
> > > #define cond_binary_direct { 1, 1, true }
> > > #define cond_ternary_direct { 1, 1, true }
> > > #define while_direct { 0, 2, false }
> > > +#define while_len_direct { 0, 0, false }
> > > #define fold_extract_direct { 2, 2, false }
> > > #define fold_left_direct { 1, 1, false }
> > > #define mask_fold_left_direct { 1, 1, false }
> > > @@ -3702,6 +3703,33 @@ expand_while_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> > >      emit_move_insn (lhs_rtx, ops[0].value);
> > > }
> > > +/* Expand WHILE_LEN call STMT using optab OPTAB.  */
> > > +static void
> > > +expand_while_len_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> > > +{
> > > +  expand_operand ops[3];
> > > +  tree rhs_type[2];
> > > +
> > > +  tree lhs = gimple_call_lhs (stmt);
> > > +  tree lhs_type = TREE_TYPE (lhs);
> > > +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> > > +  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
> > > +
> > > +  for (unsigned int i = 0; i < gimple_call_num_args (stmt); ++i)
> > > +    {
> > > +      tree rhs = gimple_call_arg (stmt, i);
> > > +      rhs_type[i] = TREE_TYPE (rhs);
> > > +      rtx rhs_rtx = expand_normal (rhs);
> > > +      create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type[i]));
> > > +    }
> > > +
> > > +  insn_code icode = direct_optab_handler (optab, TYPE_MODE (rhs_type[0]));
> > > +
> > > +  expand_insn (icode, 3, ops);
> > > +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> > > +    emit_move_insn (lhs_rtx, ops[0].value);
> > > +}
> > > +
> > > /* Expand a call to a convert-like optab using the operands in STMT.
> > >     FN has a single output operand and NARGS input operands.  */
> > > @@ -3843,6 +3871,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
> > > #define direct_scatter_store_optab_supported_p convert_optab_supported_p
> > > #define direct_len_store_optab_supported_p direct_optab_supported_p
> > > #define direct_while_optab_supported_p convert_optab_supported_p
> > > +#define direct_while_len_optab_supported_p direct_optab_supported_p
> > > #define direct_fold_extract_optab_supported_p direct_optab_supported_p
> > > #define direct_fold_left_optab_supported_p direct_optab_supported_p
> > > #define direct_mask_fold_left_optab_supported_p direct_optab_supported_p
> > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > > index 7fe742c2ae7..3a933abff5d 100644
> > > --- a/gcc/internal-fn.def
> > > +++ b/gcc/internal-fn.def
> > > @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
> > > DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
> > > DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
> > > +DEF_INTERNAL_OPTAB_FN (WHILE_LEN, ECF_CONST | ECF_NOTHROW, while_len, while_len)
> > > DEF_INTERNAL_OPTAB_FN (CHECK_RAW_PTRS, ECF_CONST | ECF_NOTHROW,
> > >        check_raw_ptrs, check_ptrs)
> > > DEF_INTERNAL_OPTAB_FN (CHECK_WAR_PTRS, ECF_CONST | ECF_NOTHROW,
> > > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > > index 695f5911b30..f5938bd2c24 100644
> > > --- a/gcc/optabs.def
> > > +++ b/gcc/optabs.def
> > > @@ -476,3 +476,4 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> > > OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> > > OPTAB_D (len_load_optab, "len_load_$a")
> > > OPTAB_D (len_store_optab, "len_store_$a")
> > > +OPTAB_D (while_len_optab, "while_len$a")
> > > diff --git a/gcc/tree-ssa-loop-manip.cc b/gcc/tree-ssa-loop-manip.cc
> > > index 09acc1c94cc..cdbf280e249 100644
> > > --- a/gcc/tree-ssa-loop-manip.cc
> > > +++ b/gcc/tree-ssa-loop-manip.cc
> > > @@ -59,14 +59,14 @@ static bitmap_obstack loop_renamer_obstack;
> > > void
> > > create_iv (tree base, tree step, tree var, class loop *loop,
> > >    gimple_stmt_iterator *incr_pos, bool after,
> > > -    tree *var_before, tree *var_after)
> > > +    tree *var_before, tree *var_after, enum tree_code code)
> > > {
> > >    gassign *stmt;
> > >    gphi *phi;
> > >    tree initial, step1;
> > >    gimple_seq stmts;
> > >    tree vb, va;
> > > -  enum tree_code incr_op = PLUS_EXPR;
> > > +  enum tree_code incr_op = code;
> > >    edge pe = loop_preheader_edge (loop);
> > >    if (var != NULL_TREE)
> > > diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> > > index d49273a3987..da755320a3a 100644
> > > --- a/gcc/tree-ssa-loop-manip.h
> > > +++ b/gcc/tree-ssa-loop-manip.h
> > > @@ -23,7 +23,7 @@ along with GCC; see the file COPYING3.  If not see
> > > typedef void (*transform_callback)(class loop *, void *);
> > > extern void create_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *,
> > > -        bool, tree *, tree *);
> > > +        bool, tree *, tree *, enum tree_code = PLUS_EXPR);
> > > extern void rewrite_into_loop_closed_ssa (bitmap, unsigned);
> > > extern void verify_loop_closed_ssa (bool, class loop * = NULL);
> > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > index f60fa50e8f4..f3cd6c51d2e 100644
> > > --- a/gcc/tree-vect-loop-manip.cc
> > > +++ b/gcc/tree-vect-loop-manip.cc
> > > @@ -682,6 +682,173 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
> > >    return next_ctrl;
> > > }
> > > +/* Helper for vect_set_loop_condition_partial_vectors.  Generate definitions
> > > +   for all the rgroup controls in RGC and return a control that is nonzero
> > > +   when the loop needs to iterate.  Add any new preheader statements to
> > > +   PREHEADER_SEQ.  Use LOOP_COND_GSI to insert code before the exit gcond.
> > > +
> > > +   RGC belongs to loop LOOP.  The loop originally iterated NITERS
> > > +   times and has been vectorized according to LOOP_VINFO.
> > > +
> > > +   Unlike vect_set_loop_controls_directly which is iterating from 0-based IV
> > > +   to TEST_LIMIT - bias.
> > > +
> > > +   In vect_set_loop_controls_by_while_len, we are iterating from start at
> > > +   IV = TEST_LIMIT - bias and keep subtract IV by the length calculated by
> > > +   IFN_WHILE_LEN pattern.
> > > +
> > > +   Note: the cost of the code generated by this function is modeled
> > > +   by vect_estimate_min_profitable_iters, so changes here may need
> > > +   corresponding changes there.
> > > +
> > > +   1. Single rgroup, the Gimple IR should be:
> > > +
> > > + <bb 3>
> > > + _19 = (unsigned long) n_5(D);
> > > + ...
> > > +
> > > + <bb 4>:
> > > + ...
> > > + # ivtmp_20 = PHI <ivtmp_21(4), _19(3)>
> > > + ...
> > > + _22 = .WHILE_LEN (ivtmp_20, vf);
> > > + ...
> > > + vector statement (use _22);
> > > + ...
> > > + ivtmp_21 = ivtmp_20 - _22;
> > > + ...
> > > + if (ivtmp_21 != 0)
> > > +   goto <bb 4>; [75.00%]
> > > + else
> > > +   goto <bb 5>; [25.00%]
> > > +
> > > + <bb 5>
> > > + return;
> > > +
> > > +   Note: IFN_WHILE_LEN will guarantee "ivtmp_21 = ivtmp_20 - _22" never
> > > +   underflow 0.
> > > +
> > > +   2. Multiple rgroup, the Gimple IR should be:
> > > +
> > > + <bb 3>
> > > + _70 = (unsigned long) bnd.7_52;
> > > + _71 = _70 * 2;
> > > + _72 = MAX_EXPR <_71, 4>;
> > > + _73 = _72 + 18446744073709551612;
> > > + ...
> > > +
> > > + <bb 4>:
> > > + ...
> > > + # ivtmp_74 = PHI <ivtmp_75(6), _73(12)>
> > > + # ivtmp_77 = PHI <ivtmp_78(6), _71(12)>
> > > + _76 = .WHILE_LEN (ivtmp_74, vf * nitems_per_ctrl);
> > > + _79 = .WHILE_LEN (ivtmp_77, vf * nitems_per_ctrl);
> > > + ...
> > > + vector statement (use _79);
> > > + ...
> > > + vector statement (use _76);
> > > + ...
> > > + _65 = _79 / 2;
> > > + vector statement (use _65);
> > > + ...
> > > + _68 = _76 / 2;
> > > + vector statement (use _68);
> > > + ...
> > > + ivtmp_78 = ivtmp_77 - _79;
> > > + ivtmp_75 = ivtmp_74 - _76;
> > > + ...
> > > + if (ivtmp_78 != 0)
> > > +   goto <bb 4>; [75.00%]
> > > + else
> > > +   goto <bb 5>; [25.00%]
> > > +
> > > + <bb 5>
> > > + return;
> > > +
> > > +*/
> > > +
> > > +static tree
> > > +vect_set_loop_controls_by_while_len (class loop *loop, loop_vec_info loop_vinfo,
> > > +      gimple_seq *preheader_seq,
> > > +      gimple_seq *header_seq,
> > > +      rgroup_controls *rgc, tree niters)
> > > +{
> > > +  tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> > > +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> > > +  /* We are not allowing masked approach in WHILE_LEN.  */
> > > +  gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
> > > +
> > > +  tree ctrl_type = rgc->type;
> > > +  unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
> > > +  poly_uint64 nitems_per_ctrl = TYPE_VECTOR_SUBPARTS (ctrl_type) * rgc->factor;
> > > +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > +
> > > +  /* Calculate the maximum number of item values that the rgroup
> > > +     handles in total, the number that it handles for each iteration
> > > +     of the vector loop.  */
> > > +  tree nitems_total = niters;
> > > +  if (nitems_per_iter != 1)
> > > +    {
> > > +      /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
> > > + these multiplications don't overflow.  */
> > > +      tree compare_factor = build_int_cst (compare_type, nitems_per_iter);
> > > +      nitems_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
> > > +    nitems_total, compare_factor);
> > > +    }
> > > +
> > > +  /* Convert the comparison value to the IV type (either a no-op or
> > > +     a promotion).  */
> > > +  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
> > > +
> > > +  /* Create an induction variable that counts the number of items
> > > +     processed.  */
> > > +  tree index_before_incr, index_after_incr;
> > > +  gimple_stmt_iterator incr_gsi;
> > > +  bool insert_after;
> > > +  standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> > > +
> > > +  /* Test the decremented IV, which will never underflow 0 since we have
> > > +     IFN_WHILE_LEN to gurantee that.  */
> > > +  tree test_limit = nitems_total;
> > > +
> > > +  /* Provide a definition of each control in the group.  */
> > > +  tree ctrl;
> > > +  unsigned int i;
> > > +  FOR_EACH_VEC_ELT_REVERSE (rgc->controls, i, ctrl)
> > > +    {
> > > +      /* Previous controls will cover BIAS items.  This control covers the
> > > + next batch.  */
> > > +      poly_uint64 bias = nitems_per_ctrl * i;
> > > +      tree bias_tree = build_int_cst (iv_type, bias);
> > > +
> > > +      /* Rather than have a new IV that starts at TEST_LIMIT and goes down to
> > > + BIAS, prefer to use the same TEST_LIMIT - BIAS based IV for each
> > > + control and adjust the bound down by BIAS.  */
> > > +      tree this_test_limit = test_limit;
> > > +      if (i != 0)
> > > + {
> > > +   this_test_limit = gimple_build (preheader_seq, MAX_EXPR, iv_type,
> > > +   this_test_limit, bias_tree);
> > > +   this_test_limit = gimple_build (preheader_seq, MINUS_EXPR, iv_type,
> > > +   this_test_limit, bias_tree);
> > > + }
> > > +
> > > +      /* Create decrement IV.  */
> > > +      create_iv (this_test_limit, ctrl, NULL_TREE, loop, &incr_gsi,
> > > + insert_after, &index_before_incr, &index_after_incr,
> > > + MINUS_EXPR);
> > > +
> > > +      poly_uint64 final_vf = vf * nitems_per_iter;
> > > +      tree vf_step = build_int_cst (iv_type, final_vf);
> > > +      tree res_len = gimple_build (header_seq, IFN_WHILE_LEN, iv_type,
> > > +    index_before_incr, vf_step);
> > > +      gassign *assign = gimple_build_assign (ctrl, res_len);
> > > +      gimple_seq_add_stmt (header_seq, assign);
> > > +    }
> > > +
> > > +  return index_after_incr;
> > > +}
> > > +
> > > /* Set up the iteration condition and rgroup controls for LOOP, given
> > >     that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the vectorized
> > >     loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
> > > @@ -703,6 +870,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > >    bool use_masks_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> > >    tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
> > > +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> > >    unsigned int compare_precision = TYPE_PRECISION (compare_type);
> > >    tree orig_niters = niters;
> > > @@ -757,12 +925,18 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> > > bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
> > > /* Set up all controls for this group.  */
> > > - test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
> > > -      &preheader_seq,
> > > -      &header_seq,
> > > -      loop_cond_gsi, rgc,
> > > -      niters, niters_skip,
> > > -      might_wrap_p);
> > > + if (direct_internal_fn_supported_p (IFN_WHILE_LEN, iv_type,
> > > +     OPTIMIZE_FOR_SPEED))
> > > +   test_ctrl
> > > +     = vect_set_loop_controls_by_while_len (loop, loop_vinfo,
> > > +    &preheader_seq, &header_seq,
> > > +    rgc, niters);
> > > + else
> > > +   test_ctrl
> > > +     = vect_set_loop_controls_directly (loop, loop_vinfo, &preheader_seq,
> > > +        &header_seq, loop_cond_gsi, rgc,
> > > +        niters, niters_skip,
> > > +        might_wrap_p);
> > >        }
> > >    /* Emit all accumulated statements.  */
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index 1ba9f18d73e..5bffd9a6322 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -10360,12 +10360,14 @@ vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> > >     rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS.  */
> > > tree
> > > -vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> > > -    unsigned int nvectors, unsigned int index)
> > > +vect_get_loop_len (gimple_stmt_iterator *gsi, loop_vec_info loop_vinfo,
> > > +    vec_loop_lens *lens, unsigned int nvectors, tree vectype,
> > > +    unsigned int index)
> > > {
> > >    rgroup_controls *rgl = &(*lens)[nvectors - 1];
> > > -  bool use_bias_adjusted_len =
> > > -    LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0;
> > > +  bool use_bias_adjusted_len
> > > +    = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) != 0;
> > > +  tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> > >    /* Populate the rgroup's len array, if this is the first time we've
> > >       used it.  */
> > > @@ -10386,8 +10388,8 @@ vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> > >   if (use_bias_adjusted_len)
> > >     {
> > >       gcc_assert (i == 0);
> > > -       tree adjusted_len =
> > > - make_temp_ssa_name (len_type, NULL, "adjusted_loop_len");
> > > +       tree adjusted_len
> > > + = make_temp_ssa_name (len_type, NULL, "adjusted_loop_len");
> > >       SSA_NAME_DEF_STMT (adjusted_len) = gimple_build_nop ();
> > >       rgl->bias_adjusted_ctrl = adjusted_len;
> > >     }
> > > @@ -10396,6 +10398,27 @@ vect_get_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens,
> > >    if (use_bias_adjusted_len)
> > >      return rgl->bias_adjusted_ctrl;
> > > +  else if (direct_internal_fn_supported_p (IFN_WHILE_LEN, iv_type,
> > > +    OPTIMIZE_FOR_SPEED))
> > > +    {
> > > +      tree loop_len = rgl->controls[index];
> > > +      poly_int64 nunits1 = TYPE_VECTOR_SUBPARTS (rgl->type);
> > > +      poly_int64 nunits2 = TYPE_VECTOR_SUBPARTS (vectype);
> > > +      if (maybe_ne (nunits1, nunits2))
> > > + {
> > > +   /* A loop len for data type X can be reused for data type Y
> > > +      if X has N times more elements than Y and if Y's elements
> > > +      are N times bigger than X's.  */
> > > +   gcc_assert (multiple_p (nunits1, nunits2));
> > > +   unsigned int factor = exact_div (nunits1, nunits2).to_constant ();
> > > +   gimple_seq seq = NULL;
> > > +   loop_len = gimple_build (&seq, RDIV_EXPR, iv_type, loop_len,
> > > +    build_int_cst (iv_type, factor));
> > > +   if (seq)
> > > +     gsi_insert_seq_before (gsi, seq, GSI_SAME_STMT);
> > > + }
> > > +      return loop_len;
> > > +    }
> > >    else
> > >      return rgl->controls[index];
> > > }
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index efa2d0daa52..708c8a1d806 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -8653,8 +8653,9 @@ vectorizable_store (vec_info *vinfo,
> > >       else if (loop_lens)
> > > {
> > >   tree final_len
> > > -     = vect_get_loop_len (loop_vinfo, loop_lens,
> > > - vec_num * ncopies, vec_num * j + i);
> > > +     = vect_get_loop_len (gsi, loop_vinfo, loop_lens,
> > > + vec_num * ncopies, vectype,
> > > + vec_num * j + i);
> > >   tree ptr = build_int_cst (ref_type, align * BITS_PER_UNIT);
> > >   machine_mode vmode = TYPE_MODE (vectype);
> > >   opt_machine_mode new_ovmode
> > > @@ -10009,8 +10010,8 @@ vectorizable_load (vec_info *vinfo,
> > >     else if (loop_lens && memory_access_type != VMAT_INVARIANT)
> > >       {
> > > tree final_len
> > > -   = vect_get_loop_len (loop_vinfo, loop_lens,
> > > -        vec_num * ncopies,
> > > +   = vect_get_loop_len (gsi, loop_vinfo, loop_lens,
> > > +        vec_num * ncopies, vectype,
> > >        vec_num * j + i);
> > > tree ptr = build_int_cst (ref_type,
> > >   align * BITS_PER_UNIT);
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > index 9cf2fb23fe3..e5cf38caf4b 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > @@ -2293,8 +2293,8 @@ extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *,
> > > unsigned int, tree, unsigned int);
> > > extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
> > >   tree, unsigned int);
> > > -extern tree vect_get_loop_len (loop_vec_info, vec_loop_lens *, unsigned int,
> > > -        unsigned int);
> > > +extern tree vect_get_loop_len (gimple_stmt_iterator *, loop_vec_info,
> > > +        vec_loop_lens *, unsigned int, tree, unsigned int);
> > > extern gimple_seq vect_gen_len (tree, tree, tree, tree);
> > > extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info);
> > > extern bool reduction_fn_for_scalar_code (code_helper, internal_fn *);
> > 
>  
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
 

  parent reply	other threads:[~2023-04-12  9:20 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07  1:47 juzhe.zhong
2023-04-07  3:23 ` Li, Pan2
2023-04-11 12:12 ` juzhe.zhong
2023-04-11 12:44   ` Richard Sandiford
2023-04-12  7:00     ` Richard Biener
2023-04-12  8:00       ` juzhe.zhong
2023-04-12  8:42         ` Richard Biener
2023-04-12  9:15           ` juzhe.zhong
2023-04-12  9:29             ` Richard Biener
2023-04-12  9:42               ` Robin Dapp
2023-04-12 11:17               ` Richard Sandiford
2023-04-12 11:37                 ` juzhe.zhong
2023-04-12 12:24                   ` Richard Sandiford
2023-04-12 14:18                     ` 钟居哲
2023-04-13  6:47                       ` Richard Biener
2023-04-13  9:54                         ` juzhe.zhong
2023-04-18  9:32                           ` Richard Sandiford
2023-04-12 12:56                   ` Kewen.Lin
2023-04-12 13:22                     ` 钟居哲
2023-04-13  7:29                       ` Kewen.Lin
2023-04-13 13:44                         ` 钟居哲
2023-04-14  2:54                           ` Kewen.Lin
2023-04-14  3:09                             ` juzhe.zhong
2023-04-14  5:40                               ` Kewen.Lin
2023-04-14  3:39                             ` juzhe.zhong
2023-04-14  6:31                               ` Kewen.Lin
2023-04-14  6:39                                 ` juzhe.zhong
2023-04-14  7:41                                   ` Kewen.Lin
2023-04-14  6:52                               ` Richard Biener
2023-04-12 11:42                 ` Richard Biener
     [not found]           ` <2023041217154958074655@rivai.ai>
2023-04-12  9:20             ` juzhe.zhong [this message]
2023-04-19 21:53 ` 钟居哲
2023-04-20  8:52   ` Richard Sandiford
2023-04-20  8:57     ` juzhe.zhong
2023-04-20  9:11       ` Richard Sandiford
2023-04-20  9:19         ` juzhe.zhong
2023-04-20  9:22           ` Richard Sandiford
2023-04-20  9:50             ` Richard Biener
2023-04-20  9:54               ` Richard Sandiford
2023-04-20 10:38                 ` juzhe.zhong
2023-04-20 12:05                   ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AFB3AEBAFA05F11E+2023041217200154694356@rivai.ai \
    --to=juzhe.zhong@rivai.ai \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jeffreyalaw@gmail.com \
    --cc=rguenther@suse.de \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).