Oh. I just realize the follow you design is working well for vec_pack_trunk too.
Will send V13 patch soon.

Thanks.



juzhe.zhong@rivai.ai
 
From: 钟居哲
Date: 2023-05-24 22:10
To: richard.sandiford
CC: gcc-patches; rguenther
Subject: Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support
>> Both approaches are fine.  I'm not against one or the other.

>> What I didn't understand was why your patch only reuses existing IVs
>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>> multiplication (well, really a shift left) when moving from one
>> rgroup to another?  E.g. if one rgroup had;

>>   nscalars_per_iter == 2 && factor == 1

>> and another had:

>>   nscalars_per_iter == 4 && factor == 1

>> then we would need to mulitply by 2 when going from the first rgroup
>> to the second.

>> If so, avoiding a multiplication seems like a good reason for the choice
>> you were making in the path.  But we then need to check
>> max_nscalars_per_iter == 1 for both the source rgroup and the
>> destination rgroup, not just the destination.  And I think the
>> condition for “no multiplication needed” should be that:

Oh, I didn't realize such complicated problem. Frankly, I didn't understand well
rgroup. Sorry about that :).

I just remember last time you said I need to handle multiple-rgroup
not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
Then I use max_nscalars_per_iter == 1 here (I didn't really lean very well from this, just add it as you said). 

Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying to avoid  multiplication and I think
scalar multiplication (not cost too much) is fine in modern CPU.

So, what do you suggest that I handle multiple-rgroup for non-SLP.

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support
钟居哲 <juzhe.zhong@rivai.ai> writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
>         if (use_masks_p && (nmasks & 1) == 0)
>           {
>             rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
>             if (!half_rgc->controls.is_empty ()
>                 && vect_maybe_permute_loop_masks (&header_seq, rgc, half_rgc))
>               continue;
>           }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>    in SRC_RGM, given that the former has twice as many masks as the
>    latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>                                rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>       && (icode1 = optab_handler (vec_unpacku_hi_optab,
>                                   src_mode)) != CODE_FOR_nothing
>       && (icode2 = optab_handler (vec_unpacku_lo_optab,
>                                   src_mode)) != CODE_FOR_nothing)
>     {
>       /* Unpacking the source masks gives at least as many mask bits as
>          we need.  We can then VIEW_CONVERT any excess bits away.  */
>       machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>       gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>       tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>       for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
>         {
>           tree src = src_rgm->controls[i / 2];
>           tree dest = dest_rgm->controls[i];
>           tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
>                             ? VEC_UNPACK_HI_EXPR
>                             : VEC_UNPACK_LO_EXPR);
>           gassign *stmt;
>           if (dest_masktype == unpack_masktype)
>             stmt = gimple_build_assign (dest, code, src);
>           else
>             {
>               tree temp = make_ssa_name (unpack_masktype);
>               stmt = gimple_build_assign (temp, code, src);
>               gimple_seq_add_stmt (seq, stmt);
>               stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>                                           build1 (VIEW_CONVERT_EXPR,
>                                                   dest_masktype, temp));
>             }
>           gimple_seq_add_stmt (seq, stmt);
>         }
>       return true;
>     }
>   vec_perm_indices indices[2];
>   if (dest_masktype == src_masktype
>       && interleave_supported_p (&indices[0], src_masktype, 0)
>       && interleave_supported_p (&indices[1], src_masktype, 1))
>     {
>       /* The destination requires twice as many mask bits as the source, so
>          we can use interleaving permutes to double up the number of bits.  */
>       tree masks[2];
>       for (unsigned int i = 0; i < 2; ++i)
>         masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
>       for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
>         {
>           tree src = src_rgm->controls[i / 2];
>           tree dest = dest_rgm->controls[i];
>           gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
>                                               src, src, masks[i & 1]);
>           gimple_seq_add_stmt (seq, stmt);
>         }
>       return true;
>     }
>   return false;
> }
>
> I know this is just optimization for ARM SVE with sub_rgc (int16)  is half size of rgc (int8).
> But when I just copy the codes from ARM SVE and make it general for all cases (int8 <-> int64).
> They all work well and codegen is good. 
>
> If you don't like this way, would you mind give me some suggestions?
 
It's not a case of disliking one approach or disliking another.
There are two separate parts of this: one specific and one general.
 
The specific part is that the code had:
 
    rgroup_controls *sub_rgc
      = &(*controls)[nmasks / rgc->controls.length () - 1];
    if (!sub_rgc->controls.is_empty ())
      {
tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
vect_adjust_loop_lens_control (iv_type, &header_seq, rgc,
       sub_rgc, NULL_TREE);
continue;
      }
 
But AIUI, nmasks is always equal to rgc->controls.length ()
(if rgc->controls is non-empty).  So I think this always used
(*controls)[0] as the source rgroup.  And I think that's the
only case that would work, since vect_adjust_loop_lens_control
only reads from sub_rgc once.  Is that right?
 
It would be possible to make other source rgroups work, but it
would involve reading every element of sub_rgc->controls, not just
sub_rgc->controls[0].  (This is also what the SVE code does.)
 
The general point is that there are two extreme approaches:
 
(1) Use one IV, and base everything off that.
 
(2) Use a separate IV for every rgroup.
 
It's also possible to do something between these two extremes.
(And we should, if it gives better code.)
 
The code above reuses an existing IV.  Doing that for all multi-control
rgroups would give (1).  At the other extreme, (2) would involve calling
vect_set_loop_controls_directly for every rgroup.
 
Both approaches are fine.  I'm not against one or the other.
 
What I didn't understand was why your patch only reuses existing IVs
for max_nscalars_per_iter == 1.  Was it to avoid having to do a
multiplication (well, really a shift left) when moving from one
rgroup to another?  E.g. if one rgroup had;
 
  nscalars_per_iter == 2 && factor == 1
 
and another had:
 
  nscalars_per_iter == 4 && factor == 1
 
then we would need to mulitply by 2 when going from the first rgroup
to the second.
 
If so, avoiding a multiplication seems like a good reason for the choice
you were making in the path.  But we then need to check
max_nscalars_per_iter == 1 for both the source rgroup and the
destination rgroup, not just the destination.  And I think the
condition for “no multiplication needed” should be that:
 
  max_nscalars_per_iter * factor
 
(aka "nitems_per_iter") is the same for both groups.
 
Thanks,
Richard