From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,
gcc Patches <gcc-patches@gcc.gnu.org>,
Richard Biener <richard.guenther@gmail.com>,
richard.sandiford@arm.com
Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner
Date: Thu, 15 Sep 2022 17:56:28 +0530 [thread overview]
Message-ID: <CAAgBjMnuG7EsGcsg+A99FTmh9Q9_dwJ6LtfE6aTLP7knxhwr7A@mail.gmail.com> (raw)
In-Reply-To: <mpt4jxc7jmu.fsf@arm.com>
On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 5 Sept 2022 at 15:51, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Sorry for the slow reply. I wrote a response a couple of weeks ago
> >> but I think it get lost in a machine outage.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi,
> >> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
> >> > in VLA manner, and currently handles the following cases:
> >> > (a) fixed len arg0, arg1 and fixed len sel.
> >> > (b) fixed len arg0, arg1 and vla sel
> >> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
> >> >
> >> > It seems to work for the VLA tests written in
> >> > test_vec_perm_vla_folding (), and am working thru the fallout observed in
> >> > regression testing.
> >> >
> >> > Does the approach taken in the patch look in the right direction ?
> >> > I am not sure if I have got the conversion from "sel_index"
> >> > to index of either arg0, or arg1 entirely correct.
> >> > I would be grateful for suggestions on the patch.
> >> >
> >> > Thanks,
> >> > Prathamesh
> >> >
> >> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> > index 4f4ec81c8d4..5e12260211e 100644
> >> > --- a/gcc/fold-const.cc
> >> > +++ b/gcc/fold-const.cc
> >> > @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3. If not see
> >> > #include "vec-perm-indices.h"
> >> > #include "asan.h"
> >> > #include "gimple-range.h"
> >> > +#include "tree-pretty-print.h"
> >> > +#include "gimple-pretty-print.h"
> >> > +#include "print-tree.h"
> >> >
> >> > /* Nonzero if we are folding constants inside an initializer or a C++
> >> > manifestly-constant-evaluated context; zero otherwise.
> >> > @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >> > build_zero_cst (itype));
> >> > }
> >> >
> >> > -
> >> > -/* Helper function for fold_vec_perm. Store elements of VECTOR_CST or
> >> > - CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> >> > - true if successful. */
> >> > -
> >> > -static bool
> >> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >> > -{
> >> > - unsigned HOST_WIDE_INT i, nunits;
> >> > -
> >> > - if (TREE_CODE (arg) == VECTOR_CST
> >> > - && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> >> > - {
> >> > - for (i = 0; i < nunits; ++i)
> >> > - elts[i] = VECTOR_CST_ELT (arg, i);
> >> > - }
> >> > - else if (TREE_CODE (arg) == CONSTRUCTOR)
> >> > - {
> >> > - constructor_elt *elt;
> >> > -
> >> > - FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> >> > - if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> >> > - return false;
> >> > - else
> >> > - elts[i] = elt->value;
> >> > - }
> >> > - else
> >> > - return false;
> >> > - for (; i < nelts; i++)
> >> > - elts[i]
> >> > - = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> >> > - return true;
> >> > -}
> >> > -
> >> > /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> >> > selector. Return the folded VECTOR_CST or CONSTRUCTOR if successful,
> >> > NULL_TREE otherwise. */
> >> > @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >> > tree
> >> > fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >> > {
> >> > - unsigned int i;
> >> > - unsigned HOST_WIDE_INT nelts;
> >> > - bool need_ctor = false;
> >> > + poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > + poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
> >> > +
> >> > + gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
> >> > + sel.length ()));
> >> > + gcc_assert (known_eq (arg0_len, arg1_len));
> >> >
> >> > - if (!sel.length ().is_constant (&nelts))
> >> > - return NULL_TREE;
> >> > - gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> >> > - && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> >> > - && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> >> > if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >> > || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >> > return NULL_TREE;
> >> >
> >> > - tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> >> > - if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> >> > - || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> >> > + unsigned input_npatterns = 0;
> >> > + unsigned out_npatterns = sel.encoding ().npatterns ();
> >> > + unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >> > +
> >> > + /* FIXME: How to reshape fixed length vector_cst, so that
> >> > + npatterns == vector.length () and nelts_per_pattern == 1 ?
> >> > + It seems the vector is canonicalized to minimize npatterns. */
> >> > +
> >> > + if (arg0_len.is_constant ())
> >> > + {
> >> > + /* If arg0, arg1 are fixed width vectors, and sel is VLA,
> >> > + ensure that it is a dup sequence and has same period
> >> > + as input vector. */
> >> > +
> >> > + if (!sel.length ().is_constant ()
> >> > + && (sel.encoding ().nelts_per_pattern () > 2
> >> > + || !known_eq (arg0_len, sel.encoding ().npatterns ())))
> >> > + return NULL_TREE;
> >> > +
> >> > + input_npatterns = arg0_len.to_constant ();
> >> > +
> >> > + if (sel.length ().is_constant ())
> >> > + {
> >> > + out_npatterns = sel.length ().to_constant ();
> >> > + out_nelts_per_pattern = 1;
> >> > + }
> >> > + }
> >> > + else if (TREE_CODE (arg0) == VECTOR_CST
> >> > + && TREE_CODE (arg1) == VECTOR_CST)
> >> > + {
> >> > + unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
> >> > + unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
> >> > +
> >> > + /* If arg0, arg1 are VLA, then ensure that,
> >> > + (a) sel also has same length as input vectors.
> >> > + (b) arg0 and arg1 have same encoding.
> >> > + (c) sel has same number of patterns as input vectors.
> >> > + (d) if sel is a stepped sequence, then it has same
> >> > + encoding as input vectors. */
> >> > +
> >> > + if (!known_eq (arg0_len, sel.length ())
> >> > + || npatterns != VECTOR_CST_NPATTERNS (arg1)
> >> > + || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
> >> > + || npatterns != sel.encoding ().npatterns ()
> >> > + || (sel.encoding ().nelts_per_pattern () > 2
> >> > + && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
> >> > + return NULL_TREE;
> >>
> >> This seems too restrictive. More below.
> >>
> >> > +
> >> > + input_npatterns = npatterns;
> >> > + }
> >> > + else
> >> > return NULL_TREE;
> >> >
> >> > - tree_vector_builder out_elts (type, nelts, 1);
> >> > - for (i = 0; i < nelts; i++)
> >> > + tree_vector_builder out_elts_builder (type, out_npatterns,
> >> > + out_nelts_per_pattern);
> >> > + bool need_ctor = false;
> >> > + unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
> >> > +
> >> > + for (unsigned i = 0; i < out_encoded_nelts; i++)
> >> > {
> >> > - HOST_WIDE_INT index;
> >> > - if (!sel[i].is_constant (&index))
> >> > + HOST_WIDE_INT sel_index;
> >> > + if (!sel[i].is_constant (&sel_index))
> >> > return NULL_TREE;
> >> > - if (!CONSTANT_CLASS_P (in_elts[index]))
> >> > - need_ctor = true;
> >> > - out_elts.quick_push (unshare_expr (in_elts[index]));
> >> > +
> >> > + /* Convert sel_index to index of either arg0 or arg1.
> >> > + For eg:
> >> > + arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
> >> > + arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
> >> > + Both have npatterns == 2, nelts_per_pattern == 3.
> >> > + Then the combined vector would be:
> >> > + {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
> >> > + This combined vector will have,
> >> > + npatterns = 2 * input_npatterns == 4.
> >> > + sel_index is used to index this above combined vector.
> >>
> >> There's no interleaving of the arguments though. The selector selects from:
> >>
> >> {a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...}
> >>
> >> The VLA encoding encodes the first N patterns explicitly. The
> >> npatterns/nelts_per_pattern values then describe how to extend that
> >> initial sequence to an arbitrary number of elements. So when performing
> >> an operation on (potentially) variable-length vectors, the questions is:
> >>
> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
> >> pair that will be correct for all elements of the result?
> >>
> >> This depends on the operation that we're performing. E.g. it's
> >> different for unary operations (vector_builder::new_unary_operation)
> >> and binary operations (vector_builder::new_binary_operations). It also
> >> varies between unary operations and between binary operations, hence
> >> the allow_stepped_p parameters.
> >>
> >> For VEC_PERM_EXPR, I think the key requirement is that:
> >>
> >> (R) Each individual selector pattern must always select from the same vector.
> >>
> >> Whether this condition is met depends both on the pattern itself and on
> >> the number of patterns that it's combined with.
> >>
> >> E.g. suppose we had the selector pattern:
> >>
> >> { 0, 1, 4, ... } i.e. 3x - 2 for x > 0
> >>
> >> If the arguments and selector are n elements then this pattern on its
> >> own would select from more than one argument if 3(n-1) - 2 >= n.
> >> This is clearly true for large enough n. So if n is variable then
> >> we cannot represent this.
> >>
> >> If the pattern above is one of two patterns, so interleaved as:
> >>
> >> { 0, _, 1, _, 4, _, ... } o=0
> >> or { _, 0, _, 1, _, 4, ... } o=1
> >>
> >> then the pattern would select from more than one argument if
> >> 3(n/2-1) - 2 + o >= n. This too would be a problem for variable n.
> >>
> >> But if the pattern above is one of four patterns then it selects
> >> from more than one argument if 3(n/4-1) - 2 + o >= n. This is not
> >> true for any valid n or o, so the pattern is OK.
> >>
> >> So let's define some ad hoc terminology:
> >>
> >> * Px is the number of patterns in x
> >> * Ex is the number of elements per pattern in x
> >>
> >> where x can be:
> >>
> >> * 1: first argument
> >> * 2: second argument
> >> * s: selector
> >> * r: result
> >>
> >> Then:
> >>
> >> (1) The number of elements encoded explicitly for x is Ex*Px
> >>
> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> elements for any integer N. This extended sequence can be reencoded
> >> as having N*Px patterns, with Ex staying the same.
> >>
> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
> >> of the explicit encoding.
> >>
> >> So let's assume (optimistically) that we can produce the result
> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
> >> to imply the rest. Then:
> >>
> >> * (2) means that, when combining multiple input operands with potentially
> >> different encodings, we can set the number of patterns in the result
> >> to the least common multiple of the number of patterns in the inputs.
> >> In this case:
> >>
> >> Pr = least_common_multiple(P1, P2, Ps)
> >>
> >> is a valid number of patterns.
> >>
> >> * (3) means that the number of elements per pattern of the result can
> >> be the maximum of the number of elements per pattern in the inputs.
> >> (Alternatively, we could always use 3.) In this case:
> >>
> >> Er = max(E1, E2, Es)
> >>
> >> is a valid number of elements per pattern.
> >>
> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> >> calculating the first Pr*Er elements of the result and using the
> >> encoding to derive the rest. If (R) doesn't hold then we need the
> >> selector to be constant-length. We should then fill in the result
> >> based on:
> >>
> >> - Pr == number of elements in the result
> >> - Er == 1
> >>
> >> But this should be the fallback option, even for VLS.
> >>
> >> As far as the arguments go: we should reject CONSTRUCTORs for
> >> variable-length types. After doing that, we can treat a CONSTRUCTOR
> >> for an N-element vector type by setting the number of patterns to N
> >> and the number of elements per pattern to 1.
> > Hi Richard,
> > Thanks for the suggestions, and sorry for late response.
> > I have a couple of very elementary questions:
> >
> > 1: Consider following inputs to VEC_PERM_EXPR:
> > op1: P_op1 == 4, E_op1 == 1
> > {1, 2, 3, 4, ...}
> >
> > op2: P_op2 == 2, E_op2 == 2
> > {11, 21, 12, 22, ...}
> >
> > sel: P_sel == 3, E_sel == 1
> > {0, 4, 5, ...}
> >
> > What shall be the result in this case ?
> > P_res = lcm(4, 2, 3) == 12
> > E_res = max(1, 2, 1) == 2.
>
> Yeah, that looks right. Of course, since sel is just repeating
> every three elements, it could just be P_res==3, E_sel==1,
> but the vector_builder would do that optimisation for us.
>
> (I'm not sure whether we'd see a P==3 encoding in practice,
> but perhaps it's possible.)
>
> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
> repeating every three elements) then:
>
> P_res = lcm(4, 2) == 4
> E_res = max(1, 2, 3) == 3
>
> which also looks like it would give the right encoding.
>
> > 2. How should we specify index of element in sel when it is not
> > explicitly encoded in the operand ?
> > For eg:
> > op1: npatterns == 2, nelts_per_pattern == 3
> > { 1, 0, 2, 0, 3, 0, ... }
> > op2: npatterns == 6, nelts_per_pattern == 1
> > { 11, 12, 13, 14, 15, 16, ...}
> >
> > In sel, how do we refer to element with value 4, that would be 4th element
> > of first pattern in op1, but not explicitly encoded ?
> > In op1, 4 will come at index == 6.
> > However in sel, index 6 would refer to 11, ie op2[0] ?
>
> What index 6 refers to depends on the length of op1.
> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
> If the length of op1 is 6 then index 6 refers to op2[0].
> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>
> This comes back to (R) above. We need to be able to prove at compile
> time that each pattern selects from the same input vectors (for all
> elements, not just the encoded elements). If we can't prove that
> then we can't fold for variable-length vectors.
Hi Richard,
Thanks for the clarification!
I have come up with an approach to verify R:
Consider following pattern:
a0, a1, a1 + S, ...,
nelts_per_pattern would be n / Psel, where n == actual length of the vector.
And last element of pattern will be given by:
a1 + (n/Psel - 2) * S
Rearranging the above term, we can think of pattern
as a line with following equation:
y = (S/Psel) * n + (a1 - 2S)
where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
At,
n = 2*Psel, y = a1
n = 3*Psel, y = a1 + S,
n = 4*Psel, y = a1 + 2S ...
To compare with n, we compare the following lines:
y1 = (S/Psel) * n + (a1 - 2S)
y2 = n
So to check if elements always come from first vector,
we want to check y1 < y2 for n > 0.
Likewise, if elements always come from second vector,
we want to check if y1 >= y2, for n > 0.
If both lines are parallel, ie S/PSel == 1,
then we choose first or second vector depending on the y-intercept a1 - 2S.
If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
For eg, if we have following pattern:
{0, 1, 3, ...}
where a1 = 1, S = 2, and consider PSel = 2.
y1 = n - 3
y2 = n
In this case, y1 < y2 for all values of n, so we select first vector.
Since y2 = n, passes thru origin with slope = 1,
a line can intersect it either in 1st or 3rd quadrant.
Calculate point of intersection:
n_int = Psel * (a1 - 2S) / (Psel - S);
(a) n_int > 0
n_int > 0 => intersecting in 1st quadrant.
In this case there will be a cross-over at n_int.
For eg, consider pattern { 0, 1, 4, ...}
a1 = 1, S = 3, and let's take PSel = 2
y1 = (3/2)n - 5
y2 = n
Both intersect at (10, 10).
So for n < 10, y1 < y2
and for n > 10, y1 > y2.
so in this case we can't fold since we will select elements from both vectors.
(b) n_int <= 0
In this case, the lines will intersect in 3rd quadrant,
so depending upon the slope we can choose either vector.
If (S/Psel) < 1, ie y1 has a gentler slope than y2,
then y1 < y2 for n > 0
If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
then y1 > y2 for n > 0.
For eg, in the above pattern {0, 1, 4, ...}
a1 = 1, S = 3, and let's take PSel = 4
y1 = (3/4)n - 5
y2 = n
Both intersect at (-20, -20).
y1's slope = (S/Psel) = (3/4) < 1
So y1 < y2 for n > 0.
Graph: https://www.desmos.com/calculator/ct7edqbr9d
So we pick first vector.
The following pseudo code attempts to capture this:
tree select_vector_for_pattern (op1, op2, a1, S, Psel)
{
if (S == Psel)
{
/* If y1 intercept >= 0, then y1 >= y2
for all values of n. */
if (a1 - 2*S >= 0)
return op2;
return op1;
}
n_int = Psel * (a1 - 2*S) / (Psel - S)
/* If intersecting in 1st quadrant, there will be cross over,
bail out. */
if (n_int > 0)
return NULL_TREE;
/* If S/Psel < 1, ie y1 has gentler slope than y2,
then y1 < y2 for n > 0. */
if (S < Psel)
return op1;
/* If S/Psel > 1, ie y1 has steeper slope than y2,
then y1 > y2 for n > 0. */
return op2;
}
Does this look reasonable ?
Thanks,
Prathamesh
>
> Thanks,
> Richard
next prev parent reply other threads:[~2022-09-15 12:27 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-17 12:39 Prathamesh Kulkarni
2022-08-29 6:08 ` Prathamesh Kulkarni
2022-09-05 8:53 ` Prathamesh Kulkarni
2022-09-05 10:21 ` Richard Sandiford
2022-09-09 13:59 ` Prathamesh Kulkarni
2022-09-12 14:27 ` Richard Sandiford
2022-09-15 12:26 ` Prathamesh Kulkarni [this message]
2022-09-20 12:39 ` Richard Sandiford
2022-09-23 11:59 ` Prathamesh Kulkarni
2022-09-23 16:03 ` Richard Sandiford
2022-09-26 19:33 ` Prathamesh Kulkarni
2022-09-26 20:29 ` Richard Sandiford
2022-09-30 14:41 ` Prathamesh Kulkarni
2022-09-30 16:00 ` Richard Sandiford
2022-09-30 16:08 ` Richard Sandiford
2022-10-10 10:48 ` Prathamesh Kulkarni
2022-10-17 10:32 ` Prathamesh Kulkarni
2022-10-24 8:12 ` Prathamesh Kulkarni
2022-10-26 15:37 ` Richard Sandiford
2022-10-28 14:46 ` Prathamesh Kulkarni
2022-10-31 9:57 ` Richard Sandiford
2022-11-04 8:30 ` Prathamesh Kulkarni
2022-11-21 9:07 ` Prathamesh Kulkarni
2022-11-28 11:44 ` Prathamesh Kulkarni
2022-12-06 15:30 ` Richard Sandiford
2022-12-13 6:05 ` Prathamesh Kulkarni
2022-12-26 4:26 ` Prathamesh Kulkarni
2023-01-17 11:54 ` Prathamesh Kulkarni
2023-02-01 10:01 ` Prathamesh Kulkarni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAAgBjMnuG7EsGcsg+A99FTmh9Q9_dwJ6LtfE6aTLP7knxhwr7A@mail.gmail.com \
--to=prathamesh.kulkarni@linaro.org \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).