On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> >>
> >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> >> > > <richard.sandiford@arm.com> wrote:
> >> > >>
> >> > >> Sorry for the slow response.  I wanted to find some time to think
> >> > >> about this a bit more.
> >> > >>
> >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> >> > >> > <richard.sandiford@arm.com> wrote:
> >> > >> >>
> >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> > >> >> >> For num_poly_int_coeffs == 2,
> >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> > >> >> >> If a1/trunc n1 succeeds,
> >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> >> > >> >> >
> >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >> > >> >> >
> >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >> > >> >> >
> >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >> > >> >>
> >> > >> >> Sorry, should have been:
> >> > >> >>
> >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> >> > >> > Hi Richard,
> >> > >> > Thanks for the clarifications, and sorry for late reply.
> >> > >> > I have attached POC patch that tries to implement the above approach.
> >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >> > >> >
> >> > >> > For VLA vectors, I have only done limited testing so far.
> >> > >> > It seems to pass couple of tests written in the patch for
> >> > >> > nelts_per_pattern == 3,
> >> > >> > and folds the following svld1rq test:
> >> > >> > int32x4_t v = {1, 2, 3, 4};
> >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> >> > >> > into:
> >> > >> > return {1, 2, 3, 4, ...};
> >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> >> > >> >
> >> > >> > I have a couple of questions:
> >> > >> > 1] When mask selects elements from same vector but from different patterns:
> >> > >> > For eg:
> >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> >> > >> >
> >> > >> > With above mask,
> >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> >> > >> > pattern in arg0.
> >> > >> > The result is:
> >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> >> > >> > with a0 = 1, a1 = 11, S = -9.
> >> > >> > Is that expected tho ? It seems to create a new encoding which
> >> > >> > wasn't present in the input vector. For instance, the next elem in
> >> > >> > sequence would be -7,
> >> > >> > which is not present originally in arg0.
> >> > >>
> >> > >> Yeah, you're right, sorry.  Going back to:
> >> > >>
> >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> > >>     elements for any integer N.  This extended sequence can be reencoded
> >> > >>     as having N*Px patterns, with Ex staying the same.
> >> > >>
> >> > >> I guess we need to pick an N for the selector such that each new
> >> > >> selector pattern (each one out of the N*Px patterns) selects from
> >> > >> the *same pattern* of the same data input.
> >> > >>
> >> > >> So if a particular pattern in the selector has a step S, and the data
> >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> >> > >>
> >> > >> I think that means that the total number of patterns in the result
> >> > >> (Pr from previous messages) can safely be:
> >> > >>
> >> > >>   Ps * least_common_multiple(
> >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> >> > >>     ...
> >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> >> > >>   )
> >> > >>
> >> > >> where:
> >> > >>
> >> > >>   Ps = the number of patterns in the selector
> >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> >> > >>   P[I] = the number of patterns in data input I
> >> > >>
> >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> >> > >> and S[...] then it could also get large.  Perhaps we should finally
> >> > >> give up on the general case and limit this to power-of-2 patterns and
> >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> >> > >> simplifies other things as well.
> >> > >>
> >> > >> What do you think?
> >> > > Hi Richard,
> >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> >> > > follow up patches if possible.
> >> > >
> >> > > Sorry if this sounds like a silly ques -- if we are going to have
> >> > > pattern in selector, select *same pattern from same input vector*,
> >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> >> > > make sense for elements in selector to denote pattern number itself
> >> > > instead of element index
> >> > > if input vectors are VLA ?
> >> > >
> >> > > For eg:
> >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> >> > > op1 = {...}
> >> > > with npatterns == 4, nelts_per_pattern == 3,
> >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> >> > > Not sure if this is correct tho.
> >> >
> >> > This wouldn't allow us to represent things like a "duplicate one
> >> > element", or "copy the leading N elements from the first input and
> >> > the other elements from elements N+ of the second input", which we
> >> > can with the current scheme.
> >> >
> >> > The restriction about each (unwound) selector pattern selecting from the
> >> > same input pattern only applies to case where the selector pattern is
> >> > stepped (and only applies to the stepped part of the pattern, not the
> >> > leading element).  The restriction is also local to this code; it
> >> > doesn't make other VEC_PERM_EXPRs invalid.
> >> Hi Richard,
> >> Thanks for the clarifications.
> >> Just to clarify your approach with an eg:
> >> Let selected input vector be:
> >> arg0: {a0, b0, c0, d0,
> >>           a0 + S, b0 + S, c0 + S, d0 + S,
> >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> >>
> >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> >>
> >> So, the first pattern in sel:
> >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> >> which would be incorrect, since they belong to different patterns in arg0.
> >> So to select elements from same pattern in arg0, we need to divide p1
> >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> >>
> >> Similarly for second pattern in sel:
> >> p2: {0, 2, 4, ...}, we need to divide it into
> >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> >>
> >> Select N = max(N1, N2) = 4
> >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> >> and will be re-encoded with N*Ps = 8 patterns:
> >>
> >> re-encoded sel:
> >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> >> ...}
> >>
> >> with 8 patterns,
> >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> >> ...
> >> which select elements from same pattern from same input vector.
> >> Does this look correct ?
> >>
> >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> >> arg1_npatterns are powers of 2 and for each stepped pattern,
> >> it's stepped size S is a power of 2. I suppose this will be sufficient
> >> to ensure that sel can be re-encoded with N*Ps npatterns
> >> such that each new pattern selects elements from same pattern
> >> of the input vector ?
> >>
> >> Then compute N:
> >> N = 1;
> >> for (every pattern p in sel)
> >>   {
> >>      op = corresponding input vector for pattern;
> >>      S = step_size (p);
> >>      N_pattern = max (S, npatterns (op)) / S;
> >>      N = max(N, N_pattern)
> >>   }
> >>
> >> and re-encode selector with N*Ps patterns.
> >> I guess rest of the patch will mostly stay the same.
> > Hi,
> > I have attached a POC patch based on the above approach.
> > For the above eg:
> > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > and
> > sel = {0, 0, 0, 1, 0, 2, ...}
> > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> >
> > For pattern, {0, 1, 2, ...} it will select elements from different
> > patterns from arg0, which is incorrect.
> > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > with following patterns:
> > p1 = { 0, ... }
> > p2 = { 0, 2, 4, ... }
> > p3 = { 0, ... }
> > p4 = { 1, 3, 5, ... }
> > which should be correct since each element from the respective
> > patterns in sel chooses
> > elements from same pattern from arg0.
> > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > Does this look correct ?
>
> Yeah.  But like I said above:
>
>   The restriction about each (unwound) selector pattern selecting from the
>   same input pattern only applies to case where the selector pattern is
>   stepped (and only applies to the stepped part of the pattern, not the
>   leading element).
>
> If the selector nelts-per-pattern is 1 or 2 then we can support all
> power-of-2 cases, with the final npatterns being the maximum of the
> source nelts-per-patterns.
>
> Also, going back to an earlier part of the discussion, I think we
> should use this technique for both VLA and VLS, and only fall back
> to the VLS-specific approach if the VLA approach fails.
>
> So I suggest we put the VLA code in its own function and have
> the VLS-only path kick in when the VLA code fails.  If the code is
> having to pass a lot of state around, it might make sense to define
> a local class, store the state in member variables, and use member
> functions for the various subroutines.  I don't know if that will
> work out neater though.
Hi Richard,
Thanks for the suggestions. I have attached an updated POC patch,
that does the following:
(a) Uses VLA approach by default, and falls back to VLS specific
folding if VLA approach fails for VLS vectors.
(b) Separates cases for sel_nelts_per_pattern < 3 and
sel_nelts_per_pattern == 3.
(c) Allows, a0 to select different vector from a1 .. ae.
I have written a few unit tests in the patch for testing the same.
Does the patch look in the right direction ?

The patch has an issue for the following case marked as "case 9"
in test_vec_perm_vla_folding:
arg0 = { 1, 11, 2, 12, 3, 13, ... }
arg1 = { 21, 31, 22, 32, 23, 33, ... }
arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.

mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
where 4 + 4x is runtime vector length.
npatterns = 1, nelts_per_pattern = 3.

a1 = 5 + 4x
ae = a1 + (esel - 2) * S
     = (5 + 4x) + (4 + 4x - 2) * 1
     = 7 + 8x

Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
Is that expected for the above mask ?

I intended it to select the second vector similar to,
sel = { 0, 1, 2 .. }, which would select the first vector
by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
The first would select elements from first pattern from arg0,
while the second pattern would select elements from second pattern from arg0.
with result effectively having same encoding as arg0.
Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?

PS: I will be on vacation next week.

Thanks,
Prathamesh

>
> > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >                         build_zero_cst (itype));
> >  }
> >
> > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > +   and return the selected arg, otherwise return NULL_TREE.  */
> >
> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > -   true if successful.  */
> > -
> > -static bool
> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > +static tree
> > +get_vector_for_pattern (tree arg0, tree arg1,
> > +                     const vec_perm_indices &sel, unsigned pattern,
> > +                     unsigned sel_npatterns, int &S)
> >  {
> > -  unsigned HOST_WIDE_INT i, nunits;
> > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >
> > -  if (TREE_CODE (arg) == VECTOR_CST
> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +  poly_uint64 nsel = sel.length ();
> > +  poly_uint64 esel;
> > +
> > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > +    return NULL_TREE;
> > +
> > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > +  S = 0;
> > +  if (sel_nelts_per_pattern == 3)
> >      {
> > -      for (i = 0; i < nunits; ++i)
> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > +      S = (a2 - a1).to_constant ();
>
> The code hasn't proven that this to_constant is safe.
>
> > +      if (S != 0 && !pow2p_hwi (S))
> > +     return NULL_TREE;
> >      }
> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > +
> > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > +  uint64_t q1, qe;
> > +  poly_uint64 r1, re;
> > +
> > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > +      || (q1 != qe))
> > +    return NULL_TREE;
>
> Going back to the above: this check doesn't make sense for
> sel_nelts_per_pattern != 3.
>
> Thanks,
> Richard
>
> > +
> > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > +
> > +  if (S < 0)
> >      {
> > -      constructor_elt *elt;
> > +      poly_uint64 a0 = sel[pattern];
> > +      if (!known_eq (S, a1 - a0))
> > +        return NULL_TREE;
> >
> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > -       return false;
> > -     else
> > -       elts[i] = elt->value;
> > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > +        return NULL_TREE;
> >      }
> > -  else
> > -    return false;
> > -  for (; i < nelts; i++)
> > -    elts[i]
> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > -  return true;
> > +
> > +  return arg;
> >  }
> >
> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >    unsigned HOST_WIDE_INT nelts;
> >    bool need_ctor = false;
> >
> > -  if (!sel.length ().is_constant (&nelts))
> > -    return NULL_TREE;
> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >      return NULL_TREE;
> >
> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > +  unsigned res_npatterns = 0;
> > +  unsigned res_nelts_per_pattern = 0;
> > +  unsigned sel_npatterns = 0;
> > +  tree *vector_for_pattern = NULL;
> > +
> > +  if (TREE_CODE (arg0) == VECTOR_CST
> > +      && TREE_CODE (arg1) == VECTOR_CST
> > +      && !sel.length ().is_constant ())
> > +    {
> > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > +      sel_npatterns = sel.encoding ().npatterns ();
> > +
> > +      if (!pow2p_hwi (arg0_npatterns)
> > +       || !pow2p_hwi (arg1_npatterns)
> > +       || !pow2p_hwi (sel_npatterns))
> > +        return NULL_TREE;
> > +
> > +      unsigned N = 1;
> > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > +     {
> > +       int S = 0;
> > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > +       if (!op)
> > +         return NULL_TREE;
> > +       vector_for_pattern[i] = op;
> > +       unsigned N_pattern =
> > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > +       N = std::max (N, N_pattern);
> > +     }
> > +
> > +      res_npatterns
> > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > +
> > +      res_nelts_per_pattern
> > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > +    }
> > +  else if (sel.length ().is_constant (&nelts)
> > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > +    {
> > +      /* For VLS vectors, treat all vectors with
> > +      npatterns = nelts, nelts_per_pattern = 1. */
> > +      res_npatterns = sel_npatterns = nelts;
> > +      res_nelts_per_pattern = 1;
> > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > +      for (unsigned i = 0; i < nelts; i++)
> > +        {
> > +       HOST_WIDE_INT index;
> > +       if (!sel[i].is_constant (&index))
> > +         return NULL_TREE;
> > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > +     }
> > +    }
> > +  else
> >      return NULL_TREE;
> >
> > -  tree_vector_builder out_elts (type, nelts, 1);
> > -  for (i = 0; i < nelts; i++)
> > +  tree_vector_builder out_elts (type, res_npatterns,
> > +                             res_nelts_per_pattern);
> > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > +  for (unsigned i = 0; i < res_nelts; i++)
> >      {
> > -      HOST_WIDE_INT index;
> > -      if (!sel[i].is_constant (&index))
> > -     return NULL_TREE;
> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > -     need_ctor = true;
> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > +      /* For VLA vectors, i % sel_npatterns would give the original
> > +         pattern the element belongs to, which is sufficient to get the arg.
> > +      Even if sel_npatterns has been multiplied by N,
> > +      they will always come from the same input vector.
> > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > +      so i % sel_npatterns == i since i < nelts */
> > +
> > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > +      unsigned HOST_WIDE_INT index;
> > +
> > +      if (arg == arg0)
> > +     {
> > +       if (!sel[i].is_constant ())
> > +         return NULL_TREE;
> > +       index = sel[i].to_constant ();
> > +     }
> > +      else
> > +        {
> > +       gcc_assert (arg == arg1);
> > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +       uint64_t q;
> > +       poly_uint64 r;
> > +
> > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > +          which would be the index for either input vector.  */
> > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > +         return NULL_TREE;
> > +
> > +       if (!r.is_constant (&index))
> > +         return NULL_TREE;
> > +     }
> > +
> > +      tree elem;
> > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > +        {
> > +       gcc_assert (index < nelts);
> > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > +         return NULL_TREE;
> > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > +         return NULL_TREE;
> > +       need_ctor = true;
> > +     }
> > +      else
> > +        elem = vector_cst_elt (arg, index);
> > +      out_elts.quick_push (elem);
> >      }
> >
> >    if (need_ctor)
> >      {
> >        vec<constructor_elt, va_gc> *v;
> > -      vec_alloc (v, nelts);
> > -      for (i = 0; i < nelts; i++)
> > +      vec_alloc (v, res_nelts);
> > +      for (i = 0; i < res_nelts; i++)
> >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> >        return build_constructor (type, v);
> >      }
> > -  else
> > -    return out_elts.build ();
> > +  return out_elts.build ();
> >  }
> >
> >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> >  }
> >
> > +static tree
> > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > +                int *encoded_elems)
> > +{
> > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > +  //machine_mode vmode = VNx4SImode;
> > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > +
> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > +  return builder.build ();
> > +}
> > +
> > +static void
> > +test_vec_perm_vla_folding ()
> > +{
> > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > +
> > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > +
> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > +    return;
> > +
> > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > +     should select arg0.  */
> > +  {
> > +    int mask_elems[] = {0, 1, 2};
> > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    ASSERT_TRUE (res != NULL_TREE);
> > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > +
> > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > +    for (unsigned i = 0; i < res_nelts; i++)
> > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > +  }
> > +
> > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > +     should return NULL because for len = 4 + 4x,
> > +     if x == 0, we select from arg1
> > +     if x > 0, we select from arg0
> > +     and thus cannot determine result at compile time.  */
> > +  {
> > +    int mask_elems[] = {4, 5, 6};
> > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    gcc_assert (res == NULL_TREE);
> > +  }
> > +
> > +  /* Case 3:
> > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > +     npatterns == 2, nelts_per_pattern == 3
> > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > +  {
> > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > +
> > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > +  }
> > +
> > +  /* Case 4:
> > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > +     npatterns == 2, nelts_per_pattern == 3
> > +     Pattern {0, ...} should select arg0[1]
> > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > +     a1 = 5 + 4x
> > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > +        = 5 + 6x
> > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > +     res: {1, 21, 1, 31, 1, 22, ... }
> > +     FIXME: How to build vector with poly_int elems ?  */
> > +
> > +  /* Case 5: S < 0.  */
> > +}
> > +
> >  /* Run all of the selftests within this file.  */
> >
> >  void
> > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> >    test_arithmetic_folding ();
> >    test_vector_folding ();
> >    test_vec_duplicate_folding ();
> > +  test_vec_perm_vla_folding ();
> >  }
> >
> >  } // namespace selftest