Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,
	gcc Patches <gcc-patches@gcc.gnu.org>,
	 Richard Biener <richard.guenther@gmail.com>,
	richard.sandiford@arm.com
Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner
Date: Thu, 15 Sep 2022 17:56:28 +0530	[thread overview]
Message-ID: <CAAgBjMnuG7EsGcsg+A99FTmh9Q9_dwJ6LtfE6aTLP7knxhwr7A@mail.gmail.com> (raw)
In-Reply-To: <mpt4jxc7jmu.fsf@arm.com>

On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 5 Sept 2022 at 15:51, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Sorry for the slow reply.  I wrote a response a couple of weeks ago
> >> but I think it get lost in a machine outage.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi,
> >> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
> >> > in VLA manner, and currently handles the following cases:
> >> > (a) fixed len arg0, arg1 and fixed len sel.
> >> > (b) fixed len arg0, arg1 and vla sel
> >> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
> >> >
> >> > It seems to work for the VLA tests written in
> >> > test_vec_perm_vla_folding (), and am working thru the fallout observed in
> >> > regression testing.
> >> >
> >> > Does the approach taken in the patch look in the right direction ?
> >> > I am not sure if I have got the conversion from "sel_index"
> >> > to index of either arg0, or arg1 entirely correct.
> >> > I would be grateful for suggestions on the patch.
> >> >
> >> > Thanks,
> >> > Prathamesh
> >> >
> >> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> > index 4f4ec81c8d4..5e12260211e 100644
> >> > --- a/gcc/fold-const.cc
> >> > +++ b/gcc/fold-const.cc
> >> > @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
> >> >  #include "vec-perm-indices.h"
> >> >  #include "asan.h"
> >> >  #include "gimple-range.h"
> >> > +#include "tree-pretty-print.h"
> >> > +#include "gimple-pretty-print.h"
> >> > +#include "print-tree.h"
> >> >
> >> >  /* Nonzero if we are folding constants inside an initializer or a C++
> >> >     manifestly-constant-evaluated context; zero otherwise.
> >> > @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >> >                         build_zero_cst (itype));
> >> >  }
> >> >
> >> > -
> >> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> >> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> >> > -   true if successful.  */
> >> > -
> >> > -static bool
> >> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >> > -{
> >> > -  unsigned HOST_WIDE_INT i, nunits;
> >> > -
> >> > -  if (TREE_CODE (arg) == VECTOR_CST
> >> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> >> > -    {
> >> > -      for (i = 0; i < nunits; ++i)
> >> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> >> > -    }
> >> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> >> > -    {
> >> > -      constructor_elt *elt;
> >> > -
> >> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> >> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> >> > -       return false;
> >> > -     else
> >> > -       elts[i] = elt->value;
> >> > -    }
> >> > -  else
> >> > -    return false;
> >> > -  for (; i < nelts; i++)
> >> > -    elts[i]
> >> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> >> > -  return true;
> >> > -}
> >> > -
> >> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> >> >     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
> >> >     NULL_TREE otherwise.  */
> >> > @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >> >  tree
> >> >  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >> >  {
> >> > -  unsigned int i;
> >> > -  unsigned HOST_WIDE_INT nelts;
> >> > -  bool need_ctor = false;
> >> > +  poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > +  poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
> >> > +
> >> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
> >> > +                     sel.length ()));
> >> > +  gcc_assert (known_eq (arg0_len, arg1_len));
> >> >
> >> > -  if (!sel.length ().is_constant (&nelts))
> >> > -    return NULL_TREE;
> >> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> >> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> >> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> >> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >> >      return NULL_TREE;
> >> >
> >> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> >> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> >> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> >> > +  unsigned input_npatterns = 0;
> >> > +  unsigned out_npatterns = sel.encoding ().npatterns ();
> >> > +  unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >> > +
> >> > +  /* FIXME: How to reshape fixed length vector_cst, so that
> >> > +     npatterns == vector.length () and nelts_per_pattern == 1 ?
> >> > +     It seems the vector is canonicalized to minimize npatterns.  */
> >> > +
> >> > +  if (arg0_len.is_constant ())
> >> > +    {
> >> > +      /* If arg0, arg1 are fixed width vectors, and sel is VLA,
> >> > +         ensure that it is a dup sequence and has same period
> >> > +      as input vector.  */
> >> > +
> >> > +      if (!sel.length ().is_constant ()
> >> > +       && (sel.encoding ().nelts_per_pattern () > 2
> >> > +           || !known_eq (arg0_len, sel.encoding ().npatterns ())))
> >> > +     return NULL_TREE;
> >> > +
> >> > +      input_npatterns = arg0_len.to_constant ();
> >> > +
> >> > +      if (sel.length ().is_constant ())
> >> > +     {
> >> > +       out_npatterns = sel.length ().to_constant ();
> >> > +       out_nelts_per_pattern = 1;
> >> > +     }
> >> > +    }
> >> > +  else if (TREE_CODE (arg0) == VECTOR_CST
> >> > +        && TREE_CODE (arg1) == VECTOR_CST)
> >> > +    {
> >> > +      unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
> >> > +      unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
> >> > +
> >> > +      /* If arg0, arg1 are VLA, then ensure that,
> >> > +      (a) sel also has same length as input vectors.
> >> > +      (b) arg0 and arg1 have same encoding.
> >> > +      (c) sel has same number of patterns as input vectors.
> >> > +      (d) if sel is a stepped sequence, then it has same
> >> > +          encoding as input vectors.  */
> >> > +
> >> > +      if (!known_eq (arg0_len, sel.length ())
> >> > +       || npatterns != VECTOR_CST_NPATTERNS (arg1)
> >> > +       || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
> >> > +       || npatterns != sel.encoding ().npatterns ()
> >> > +       || (sel.encoding ().nelts_per_pattern () > 2
> >> > +           && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
> >> > +     return NULL_TREE;
> >>
> >> This seems too restrictive.  More below.
> >>
> >> > +
> >> > +      input_npatterns = npatterns;
> >> > +    }
> >> > +  else
> >> >      return NULL_TREE;
> >> >
> >> > -  tree_vector_builder out_elts (type, nelts, 1);
> >> > -  for (i = 0; i < nelts; i++)
> >> > +  tree_vector_builder out_elts_builder (type, out_npatterns,
> >> > +                                     out_nelts_per_pattern);
> >> > +  bool need_ctor = false;
> >> > +  unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
> >> > +
> >> > +  for (unsigned i = 0; i < out_encoded_nelts; i++)
> >> >      {
> >> > -      HOST_WIDE_INT index;
> >> > -      if (!sel[i].is_constant (&index))
> >> > +      HOST_WIDE_INT sel_index;
> >> > +      if (!sel[i].is_constant (&sel_index))
> >> >       return NULL_TREE;
> >> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> >> > -     need_ctor = true;
> >> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> >> > +
> >> > +      /* Convert sel_index to index of either arg0 or arg1.
> >> > +      For eg:
> >> > +      arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
> >> > +      arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
> >> > +      Both have npatterns == 2, nelts_per_pattern == 3.
> >> > +      Then the combined vector would be:
> >> > +      {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
> >> > +      This combined vector will have,
> >> > +      npatterns = 2 * input_npatterns == 4.
> >> > +      sel_index is used to index this above combined vector.
> >>
> >> There's no interleaving of the arguments though.  The selector selects from:
> >>
> >> {a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...}
> >>
> >> The VLA encoding encodes the first N patterns explicitly.  The
> >> npatterns/nelts_per_pattern values then describe how to extend that
> >> initial sequence to an arbitrary number of elements.  So when performing
> >> an operation on (potentially) variable-length vectors, the questions is:
> >>
> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
> >>   pair that will be correct for all elements of the result?
> >>
> >> This depends on the operation that we're performing.  E.g. it's
> >> different for unary operations (vector_builder::new_unary_operation)
> >> and binary operations (vector_builder::new_binary_operations).  It also
> >> varies between unary operations and between binary operations, hence
> >> the allow_stepped_p parameters.
> >>
> >> For VEC_PERM_EXPR, I think the key requirement is that:
> >>
> >> (R) Each individual selector pattern must always select from the same vector.
> >>
> >> Whether this condition is met depends both on the pattern itself and on
> >> the number of patterns that it's combined with.
> >>
> >> E.g. suppose we had the selector pattern:
> >>
> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
> >>
> >> If the arguments and selector are n elements then this pattern on its
> >> own would select from more than one argument if 3(n-1) - 2 >= n.
> >> This is clearly true for large enough n.  So if n is variable then
> >> we cannot represent this.
> >>
> >> If the pattern above is one of two patterns, so interleaved as:
> >>
> >>      { 0, _, 1, _, 4, _, ... }  o=0
> >>   or { _, 0, _, 1, _, 4, ... }  o=1
> >>
> >> then the pattern would select from more than one argument if
> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
> >>
> >> But if the pattern above is one of four patterns then it selects
> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
> >> true for any valid n or o, so the pattern is OK.
> >>
> >> So let's define some ad hoc terminology:
> >>
> >> * Px is the number of patterns in x
> >> * Ex is the number of elements per pattern in x
> >>
> >> where x can be:
> >>
> >> * 1: first argument
> >> * 2: second argument
> >> * s: selector
> >> * r: result
> >>
> >> Then:
> >>
> >> (1) The number of elements encoded explicitly for x is Ex*Px
> >>
> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >>     elements for any integer N.  This extended sequence can be reencoded
> >>     as having N*Px patterns, with Ex staying the same.
> >>
> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
> >>     of the explicit encoding.
> >>
> >> So let's assume (optimistically) that we can produce the result
> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
> >> to imply the rest.  Then:
> >>
> >> * (2) means that, when combining multiple input operands with potentially
> >>   different encodings, we can set the number of patterns in the result
> >>   to the least common multiple of the number of patterns in the inputs.
> >>   In this case:
> >>
> >>   Pr = least_common_multiple(P1, P2, Ps)
> >>
> >>   is a valid number of patterns.
> >>
> >> * (3) means that the number of elements per pattern of the result can
> >>   be the maximum of the number of elements per pattern in the inputs.
> >>   (Alternatively, we could always use 3.)  In this case:
> >>
> >>   Er = max(E1, E2, Es)
> >>
> >>   is a valid number of elements per pattern.
> >>
> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> >> calculating the first Pr*Er elements of the result and using the
> >> encoding to derive the rest.  If (R) doesn't hold then we need the
> >> selector to be constant-length.  We should then fill in the result
> >> based on:
> >>
> >> - Pr == number of elements in the result
> >> - Er == 1
> >>
> >> But this should be the fallback option, even for VLS.
> >>
> >> As far as the arguments go: we should reject CONSTRUCTORs for
> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
> >> for an N-element vector type by setting the number of patterns to N
> >> and the number of elements per pattern to 1.
> > Hi Richard,
> > Thanks for the suggestions, and sorry for late response.
> > I have a couple of very elementary questions:
> >
> > 1: Consider following inputs to VEC_PERM_EXPR:
> > op1: P_op1 == 4, E_op1 == 1
> > {1, 2, 3, 4, ...}
> >
> > op2: P_op2 == 2, E_op2 == 2
> > {11, 21, 12, 22, ...}
> >
> > sel: P_sel == 3, E_sel == 1
> > {0, 4, 5, ...}
> >
> > What shall be the result in this case ?
> > P_res = lcm(4, 2, 3) == 12
> > E_res = max(1, 2, 1) == 2.
>
> Yeah, that looks right.  Of course, since sel is just repeating
> every three elements, it could just be P_res==3, E_sel==1,
> but the vector_builder would do that optimisation for us.
>
> (I'm not sure whether we'd see a P==3 encoding in practice,
> but perhaps it's possible.)
>
> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
> repeating every three elements) then:
>
> P_res = lcm(4, 2) == 4
> E_res = max(1, 2, 3) == 3
>
> which also looks like it would give the right encoding.
>
> > 2. How should we specify index of element in sel when it is not
> > explicitly encoded in the operand ?
> > For eg:
> > op1: npatterns == 2, nelts_per_pattern == 3
> > { 1, 0, 2, 0, 3, 0, ... }
> > op2: npatterns == 6, nelts_per_pattern == 1
> > { 11, 12, 13, 14, 15, 16, ...}
> >
> > In sel, how do we refer to element with value 4, that would be 4th element
> > of first pattern in op1, but not explicitly encoded ?
> > In op1, 4 will come at index == 6.
> > However in sel, index 6 would refer to 11, ie op2[0] ?
>
> What index 6 refers to depends on the length of op1.
> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
> If the length of op1 is 6 then index 6 refers to op2[0].
> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>
> This comes back to (R) above.  We need to be able to prove at compile
> time that each pattern selects from the same input vectors (for all
> elements, not just the encoded elements).  If we can't prove that
> then we can't fold for variable-length vectors.
Hi Richard,
Thanks for the clarification!
I have come up with an approach to verify R:

Consider following pattern:
a0, a1, a1 + S, ...,
nelts_per_pattern would be n / Psel, where n == actual length of the vector.
And last element of pattern will be given by:
a1 + (n/Psel - 2) * S

Rearranging the above term, we can think of pattern
as a line with following equation:
y = (S/Psel) * n + (a1 - 2S)
where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.

At,
n = 2*Psel, y = a1
n = 3*Psel, y = a1 + S,
n = 4*Psel, y = a1 + 2S ...

To compare with n, we compare the following lines:
y1 = (S/Psel) * n + (a1 - 2S)
y2 = n

So to check if elements always come from first vector,
we want to check y1 < y2 for n > 0.
Likewise, if elements always come from second vector,
we want to check if y1 >= y2, for n > 0.

If both lines are parallel, ie S/PSel == 1,
then we choose first or second vector depending on the y-intercept a1 - 2S.
If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.

For eg, if we have following pattern:
{0, 1, 3, ...}
where a1 = 1, S = 2, and consider PSel = 2.

y1 = n - 3
y2 = n

In this case, y1 < y2 for all values of n,  so we select first vector.

Since y2 = n, passes thru origin with slope = 1,
a line can intersect it either in 1st or 3rd quadrant.
Calculate point of intersection:
n_int = Psel * (a1 - 2S) / (Psel - S);

(a) n_int > 0
n_int > 0 => intersecting in 1st quadrant.
In this case there will be a cross-over at n_int.

For eg, consider pattern { 0, 1, 4, ...}
a1 = 1, S = 3, and let's take PSel = 2

y1 = (3/2)n - 5
y2 = n

Both intersect at (10, 10).
So for n < 10, y1 < y2
and for n > 10, y1 > y2.
so in this case we can't fold since we will select elements from both vectors.

(b) n_int <= 0
In this case, the lines will intersect in 3rd quadrant,
so depending upon the slope we can choose either vector.
If (S/Psel) < 1, ie y1 has a gentler slope than y2,
then y1 < y2 for n > 0
If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
then y1 > y2 for n > 0.

For eg, in the above pattern {0, 1, 4, ...}
a1 = 1, S = 3, and let's take PSel = 4

y1 = (3/4)n - 5
y2 = n
Both intersect at (-20, -20).
y1's slope = (S/Psel) = (3/4) < 1
So y1 < y2 for n > 0.
Graph: https://www.desmos.com/calculator/ct7edqbr9d
So we pick first vector.

The following pseudo code attempts to capture this:

tree select_vector_for_pattern (op1, op2, a1, S, Psel)
{
  if (S == Psel)
    {
      /* If y1 intercept >= 0, then y1 >= y2
          for all values of n.  */
      if (a1 - 2*S >= 0)
        return op2;
      return op1;
    }

   n_int = Psel * (a1 - 2*S) / (Psel - S)
   /* If intersecting in 1st quadrant, there will be cross over,
       bail out.  */
   if (n_int > 0)
     return NULL_TREE;
   /* If S/Psel < 1, ie y1 has gentler slope than y2,
      then y1 < y2 for n > 0.  */
   if (S < Psel)
     return op1;
   /* If S/Psel > 1, ie y1 has steeper slope than y2,
      then y1 > y2 for n > 0.  */
   return op2;
}

Does this look reasonable ?

Thanks,
Prathamesh
>
> Thanks,
> Richard

next prev parent reply	other threads:[~2022-09-15 12:27 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-17 12:39 Prathamesh Kulkarni
2022-08-29  6:08 ` Prathamesh Kulkarni
2022-09-05  8:53   ` Prathamesh Kulkarni
2022-09-05 10:21 ` Richard Sandiford
2022-09-09 13:59   ` Prathamesh Kulkarni
2022-09-12 14:27     ` Richard Sandiford
2022-09-15 12:26       ` Prathamesh Kulkarni [this message]
2022-09-20 12:39         ` Richard Sandiford
2022-09-23 11:59           ` Prathamesh Kulkarni
2022-09-23 16:03             ` Richard Sandiford
2022-09-26 19:33               ` Prathamesh Kulkarni
2022-09-26 20:29                 ` Richard Sandiford
2022-09-30 14:41                   ` Prathamesh Kulkarni
2022-09-30 16:00                     ` Richard Sandiford
2022-09-30 16:08                       ` Richard Sandiford
2022-10-10 10:48                         ` Prathamesh Kulkarni
2022-10-17 10:32                           ` Prathamesh Kulkarni
2022-10-24  8:12                             ` Prathamesh Kulkarni
2022-10-26 15:37                           ` Richard Sandiford
2022-10-28 14:46                             ` Prathamesh Kulkarni
2022-10-31  9:57                               ` Richard Sandiford
2022-11-04  8:30                                 ` Prathamesh Kulkarni
2022-11-21  9:07                                   ` Prathamesh Kulkarni
2022-11-28 11:44                                     ` Prathamesh Kulkarni
2022-12-06 15:30                                     ` Richard Sandiford
2022-12-13  6:05                                       ` Prathamesh Kulkarni
2022-12-26  4:26                                         ` Prathamesh Kulkarni
2023-01-17 11:54                                           ` Prathamesh Kulkarni
2023-02-01 10:01                                             ` Prathamesh Kulkarni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAgBjMnuG7EsGcsg+A99FTmh9Q9_dwJ6LtfE6aTLP7knxhwr7A@mail.gmail.com \
    --to=prathamesh.kulkarni@linaro.org \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=richard.guenther@gmail.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).