Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Cc: gcc Patches <gcc-patches@gcc.gnu.org>,
	 Richard Biener <richard.guenther@gmail.com>
Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner
Date: Mon, 12 Sep 2022 15:27:37 +0100	[thread overview]
Message-ID: <mpt4jxc7jmu.fsf@arm.com> (raw)
In-Reply-To: <CAAgBjM=XNLy_8ek1NKdrHte6F0UN30gPSxZ+pM7=6_1y=D8cug@mail.gmail.com> (Prathamesh Kulkarni's message of "Fri, 9 Sep 2022 19:29:18 +0530")

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Mon, 5 Sept 2022 at 15:51, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Sorry for the slow reply.  I wrote a response a couple of weeks ago
>> but I think it get lost in a machine outage.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > Hi,
>> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
>> > in VLA manner, and currently handles the following cases:
>> > (a) fixed len arg0, arg1 and fixed len sel.
>> > (b) fixed len arg0, arg1 and vla sel
>> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
>> >
>> > It seems to work for the VLA tests written in
>> > test_vec_perm_vla_folding (), and am working thru the fallout observed in
>> > regression testing.
>> >
>> > Does the approach taken in the patch look in the right direction ?
>> > I am not sure if I have got the conversion from "sel_index"
>> > to index of either arg0, or arg1 entirely correct.
>> > I would be grateful for suggestions on the patch.
>> >
>> > Thanks,
>> > Prathamesh
>> >
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index 4f4ec81c8d4..5e12260211e 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "vec-perm-indices.h"
>> >  #include "asan.h"
>> >  #include "gimple-range.h"
>> > +#include "tree-pretty-print.h"
>> > +#include "gimple-pretty-print.h"
>> > +#include "print-tree.h"
>> >
>> >  /* Nonzero if we are folding constants inside an initializer or a C++
>> >     manifestly-constant-evaluated context; zero otherwise.
>> > @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>> >                         build_zero_cst (itype));
>> >  }
>> >
>> > -
>> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
>> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
>> > -   true if successful.  */
>> > -
>> > -static bool
>> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>> > -{
>> > -  unsigned HOST_WIDE_INT i, nunits;
>> > -
>> > -  if (TREE_CODE (arg) == VECTOR_CST
>> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
>> > -    {
>> > -      for (i = 0; i < nunits; ++i)
>> > -     elts[i] = VECTOR_CST_ELT (arg, i);
>> > -    }
>> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
>> > -    {
>> > -      constructor_elt *elt;
>> > -
>> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
>> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
>> > -       return false;
>> > -     else
>> > -       elts[i] = elt->value;
>> > -    }
>> > -  else
>> > -    return false;
>> > -  for (; i < nelts; i++)
>> > -    elts[i]
>> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
>> > -  return true;
>> > -}
>> > -
>> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
>> >     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
>> >     NULL_TREE otherwise.  */
>> > @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>> >  tree
>> >  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
>> >  {
>> > -  unsigned int i;
>> > -  unsigned HOST_WIDE_INT nelts;
>> > -  bool need_ctor = false;
>> > +  poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +  poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
>> > +
>> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
>> > +                     sel.length ()));
>> > +  gcc_assert (known_eq (arg0_len, arg1_len));
>> >
>> > -  if (!sel.length ().is_constant (&nelts))
>> > -    return NULL_TREE;
>> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
>> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
>> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
>> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
>> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
>> >      return NULL_TREE;
>> >
>> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
>> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
>> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
>> > +  unsigned input_npatterns = 0;
>> > +  unsigned out_npatterns = sel.encoding ().npatterns ();
>> > +  unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>> > +
>> > +  /* FIXME: How to reshape fixed length vector_cst, so that
>> > +     npatterns == vector.length () and nelts_per_pattern == 1 ?
>> > +     It seems the vector is canonicalized to minimize npatterns.  */
>> > +
>> > +  if (arg0_len.is_constant ())
>> > +    {
>> > +      /* If arg0, arg1 are fixed width vectors, and sel is VLA,
>> > +         ensure that it is a dup sequence and has same period
>> > +      as input vector.  */
>> > +
>> > +      if (!sel.length ().is_constant ()
>> > +       && (sel.encoding ().nelts_per_pattern () > 2
>> > +           || !known_eq (arg0_len, sel.encoding ().npatterns ())))
>> > +     return NULL_TREE;
>> > +
>> > +      input_npatterns = arg0_len.to_constant ();
>> > +
>> > +      if (sel.length ().is_constant ())
>> > +     {
>> > +       out_npatterns = sel.length ().to_constant ();
>> > +       out_nelts_per_pattern = 1;
>> > +     }
>> > +    }
>> > +  else if (TREE_CODE (arg0) == VECTOR_CST
>> > +        && TREE_CODE (arg1) == VECTOR_CST)
>> > +    {
>> > +      unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
>> > +      unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
>> > +
>> > +      /* If arg0, arg1 are VLA, then ensure that,
>> > +      (a) sel also has same length as input vectors.
>> > +      (b) arg0 and arg1 have same encoding.
>> > +      (c) sel has same number of patterns as input vectors.
>> > +      (d) if sel is a stepped sequence, then it has same
>> > +          encoding as input vectors.  */
>> > +
>> > +      if (!known_eq (arg0_len, sel.length ())
>> > +       || npatterns != VECTOR_CST_NPATTERNS (arg1)
>> > +       || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
>> > +       || npatterns != sel.encoding ().npatterns ()
>> > +       || (sel.encoding ().nelts_per_pattern () > 2
>> > +           && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
>> > +     return NULL_TREE;
>>
>> This seems too restrictive.  More below.
>>
>> > +
>> > +      input_npatterns = npatterns;
>> > +    }
>> > +  else
>> >      return NULL_TREE;
>> >
>> > -  tree_vector_builder out_elts (type, nelts, 1);
>> > -  for (i = 0; i < nelts; i++)
>> > +  tree_vector_builder out_elts_builder (type, out_npatterns,
>> > +                                     out_nelts_per_pattern);
>> > +  bool need_ctor = false;
>> > +  unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
>> > +
>> > +  for (unsigned i = 0; i < out_encoded_nelts; i++)
>> >      {
>> > -      HOST_WIDE_INT index;
>> > -      if (!sel[i].is_constant (&index))
>> > +      HOST_WIDE_INT sel_index;
>> > +      if (!sel[i].is_constant (&sel_index))
>> >       return NULL_TREE;
>> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
>> > -     need_ctor = true;
>> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
>> > +
>> > +      /* Convert sel_index to index of either arg0 or arg1.
>> > +      For eg:
>> > +      arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
>> > +      arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
>> > +      Both have npatterns == 2, nelts_per_pattern == 3.
>> > +      Then the combined vector would be:
>> > +      {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
>> > +      This combined vector will have,
>> > +      npatterns = 2 * input_npatterns == 4.
>> > +      sel_index is used to index this above combined vector.
>>
>> There's no interleaving of the arguments though.  The selector selects from:
>>
>> {a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...}
>>
>> The VLA encoding encodes the first N patterns explicitly.  The
>> npatterns/nelts_per_pattern values then describe how to extend that
>> initial sequence to an arbitrary number of elements.  So when performing
>> an operation on (potentially) variable-length vectors, the questions is:
>>
>> * Can we work out an initial sequence and npatterns/nelts_per_pattern
>>   pair that will be correct for all elements of the result?
>>
>> This depends on the operation that we're performing.  E.g. it's
>> different for unary operations (vector_builder::new_unary_operation)
>> and binary operations (vector_builder::new_binary_operations).  It also
>> varies between unary operations and between binary operations, hence
>> the allow_stepped_p parameters.
>>
>> For VEC_PERM_EXPR, I think the key requirement is that:
>>
>> (R) Each individual selector pattern must always select from the same vector.
>>
>> Whether this condition is met depends both on the pattern itself and on
>> the number of patterns that it's combined with.
>>
>> E.g. suppose we had the selector pattern:
>>
>>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>>
>> If the arguments and selector are n elements then this pattern on its
>> own would select from more than one argument if 3(n-1) - 2 >= n.
>> This is clearly true for large enough n.  So if n is variable then
>> we cannot represent this.
>>
>> If the pattern above is one of two patterns, so interleaved as:
>>
>>      { 0, _, 1, _, 4, _, ... }  o=0
>>   or { _, 0, _, 1, _, 4, ... }  o=1
>>
>> then the pattern would select from more than one argument if
>> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
>>
>> But if the pattern above is one of four patterns then it selects
>> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
>> true for any valid n or o, so the pattern is OK.
>>
>> So let's define some ad hoc terminology:
>>
>> * Px is the number of patterns in x
>> * Ex is the number of elements per pattern in x
>>
>> where x can be:
>>
>> * 1: first argument
>> * 2: second argument
>> * s: selector
>> * r: result
>>
>> Then:
>>
>> (1) The number of elements encoded explicitly for x is Ex*Px
>>
>> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>>     elements for any integer N.  This extended sequence can be reencoded
>>     as having N*Px patterns, with Ex staying the same.
>>
>> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
>>     of the explicit encoding.
>>
>> So let's assume (optimistically) that we can produce the result
>> by calculating the first Pr*Er elements and using the Pr,Er encoding
>> to imply the rest.  Then:
>>
>> * (2) means that, when combining multiple input operands with potentially
>>   different encodings, we can set the number of patterns in the result
>>   to the least common multiple of the number of patterns in the inputs.
>>   In this case:
>>
>>   Pr = least_common_multiple(P1, P2, Ps)
>>
>>   is a valid number of patterns.
>>
>> * (3) means that the number of elements per pattern of the result can
>>   be the maximum of the number of elements per pattern in the inputs.
>>   (Alternatively, we could always use 3.)  In this case:
>>
>>   Er = max(E1, E2, Es)
>>
>>   is a valid number of elements per pattern.
>>
>> So if (R) holds we can compute the result -- for both VLA and VLS -- by
>> calculating the first Pr*Er elements of the result and using the
>> encoding to derive the rest.  If (R) doesn't hold then we need the
>> selector to be constant-length.  We should then fill in the result
>> based on:
>>
>> - Pr == number of elements in the result
>> - Er == 1
>>
>> But this should be the fallback option, even for VLS.
>>
>> As far as the arguments go: we should reject CONSTRUCTORs for
>> variable-length types.  After doing that, we can treat a CONSTRUCTOR
>> for an N-element vector type by setting the number of patterns to N
>> and the number of elements per pattern to 1.
> Hi Richard,
> Thanks for the suggestions, and sorry for late response.
> I have a couple of very elementary questions:
>
> 1: Consider following inputs to VEC_PERM_EXPR:
> op1: P_op1 == 4, E_op1 == 1
> {1, 2, 3, 4, ...}
>
> op2: P_op2 == 2, E_op2 == 2
> {11, 21, 12, 22, ...}
>
> sel: P_sel == 3, E_sel == 1
> {0, 4, 5, ...}
>
> What shall be the result in this case ?
> P_res = lcm(4, 2, 3) == 12
> E_res = max(1, 2, 1) == 2.

Yeah, that looks right.  Of course, since sel is just repeating
every three elements, it could just be P_res==3, E_sel==1,
but the vector_builder would do that optimisation for us.

(I'm not sure whether we'd see a P==3 encoding in practice,
but perhaps it's possible.)

If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
repeating every three elements) then:

P_res = lcm(4, 2) == 4
E_res = max(1, 2, 3) == 3

which also looks like it would give the right encoding.

> 2. How should we specify index of element in sel when it is not
> explicitly encoded in the operand ?
> For eg:
> op1: npatterns == 2, nelts_per_pattern == 3
> { 1, 0, 2, 0, 3, 0, ... }
> op2: npatterns == 6, nelts_per_pattern == 1
> { 11, 12, 13, 14, 15, 16, ...}
>
> In sel, how do we refer to element with value 4, that would be 4th element
> of first pattern in op1, but not explicitly encoded ?
> In op1, 4 will come at index == 6.
> However in sel, index 6 would refer to 11, ie op2[0] ?

What index 6 refers to depends on the length of op1.
If the length of op1 is 4 at runtime the index 6 refers to op2[2].
If the length of op1 is 6 then index 6 refers to op2[0].
If the length of op1 is 8 then index 6 refers to op1[6], etc.

This comes back to (R) above.  We need to be able to prove at compile
time that each pattern selects from the same input vectors (for all
elements, not just the encoded elements).  If we can't prove that
then we can't fold for variable-length vectors.

Thanks,
Richard

next prev parent reply	other threads:[~2022-09-12 14:27 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-17 12:39 Prathamesh Kulkarni
2022-08-29  6:08 ` Prathamesh Kulkarni
2022-09-05  8:53   ` Prathamesh Kulkarni
2022-09-05 10:21 ` Richard Sandiford
2022-09-09 13:59   ` Prathamesh Kulkarni
2022-09-12 14:27     ` Richard Sandiford [this message]
2022-09-15 12:26       ` Prathamesh Kulkarni
2022-09-20 12:39         ` Richard Sandiford
2022-09-23 11:59           ` Prathamesh Kulkarni
2022-09-23 16:03             ` Richard Sandiford
2022-09-26 19:33               ` Prathamesh Kulkarni
2022-09-26 20:29                 ` Richard Sandiford
2022-09-30 14:41                   ` Prathamesh Kulkarni
2022-09-30 16:00                     ` Richard Sandiford
2022-09-30 16:08                       ` Richard Sandiford
2022-10-10 10:48                         ` Prathamesh Kulkarni
2022-10-17 10:32                           ` Prathamesh Kulkarni
2022-10-24  8:12                             ` Prathamesh Kulkarni
2022-10-26 15:37                           ` Richard Sandiford
2022-10-28 14:46                             ` Prathamesh Kulkarni
2022-10-31  9:57                               ` Richard Sandiford
2022-11-04  8:30                                 ` Prathamesh Kulkarni
2022-11-21  9:07                                   ` Prathamesh Kulkarni
2022-11-28 11:44                                     ` Prathamesh Kulkarni
2022-12-06 15:30                                     ` Richard Sandiford
2022-12-13  6:05                                       ` Prathamesh Kulkarni
2022-12-26  4:26                                         ` Prathamesh Kulkarni
2023-01-17 11:54                                           ` Prathamesh Kulkarni
2023-02-01 10:01                                             ` Prathamesh Kulkarni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mpt4jxc7jmu.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=prathamesh.kulkarni@linaro.org \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).