From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x52f.google.com (mail-ed1-x52f.google.com [IPv6:2a00:1450:4864:20::52f]) by sourceware.org (Postfix) with ESMTPS id 295AA384D157 for ; Thu, 15 Sep 2022 12:27:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 295AA384D157 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-ed1-x52f.google.com with SMTP id x94so9393657ede.11 for ; Thu, 15 Sep 2022 05:27:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date; bh=17moDiHE8MCIbxOTbT6wzFiKyq7H59oyPtYm3VXV4ks=; b=NLIUoYFw85xxY5SL5xg1yAxExYEP6ogwhFduXbtPPoqv5UBfqakfrRWV9tQjc82pQR y564UK0PFNXuCTJDT4LnT/e1CRxc72XDqnp6c/yHSfVDN7sxiREKQHVjJVo6mivxmurX w1hH39KyufBU1YXjBingvJ0lz91bTCm9wGneu9rUjYk+bprQ4HJCQFthLKig/f1Z5Ayr RtOrp3tofk2dqVy/GyC+UH4IefHZDpRq/STo34HzC1kYQEpHt0JTIHxmb8YpSjzW/Tpu AXeQmhoFS45DgrV79gTieOZLP1fX33jmK3rbxUngArEw9sGEjeouhEqxchYzdWfoouMW kSQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date; bh=17moDiHE8MCIbxOTbT6wzFiKyq7H59oyPtYm3VXV4ks=; b=b+Q/jQYm39jdpeZg89MZCzvoC6MGdjaGMB4gkaTYT2tb9a2kF+jDdZOL9qTna4Mrx+ 6KsaZ7Q2sKv8Bh3M9OAI4HuQfg6mjVgHnckyo/boJ0+xnc3p9xul47obWtA61RnSSNw0 Mz2mUuHgfaFhlBZBqYTyu2HmdloGFeoBJr12liT2eb9Dd29rob0T7sAVO/ARxOBADVV5 QMwEGY12MjhsSWcahApnvAm2N+HRvOaF6YrY5Z74aW8m8g7i8BZSjZN+wU5AtVk+uvo3 6Q6nJUI4zXm89cc4fgqsd33k9RL61a1PpMBx4/+IzBLRvrZbURnCbILEO18KNNwle7f6 Z4fg== X-Gm-Message-State: ACgBeo2DhvVDDEO1gjh1wyv+0JiMA47QSI+Q7rBgr8EGe5/iS1YPCC1s H8Bmi5ObMzoFajsvEYnubqwfatu1/ELSvOweD9TfIw== X-Google-Smtp-Source: AA6agR7BJLqMtgvgJm3VOwRUEgJ4ay7a/YyCs2t0R29U/8oKqGZT+u+3imHmw7KTqLJhXcxbSKigOtthw9JkyGenHCQ= X-Received: by 2002:a05:6402:cb2:b0:44e:4336:d2a0 with SMTP id cn18-20020a0564020cb200b0044e4336d2a0mr34683681edb.209.1663244824684; Thu, 15 Sep 2022 05:27:04 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Prathamesh Kulkarni Date: Thu, 15 Sep 2022 17:56:28 +0530 Message-ID: Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner To: Prathamesh Kulkarni , gcc Patches , Richard Biener , richard.sandiford@arm.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, 12 Sept 2022 at 19:57, Richard Sandiford wrote: > > Prathamesh Kulkarni writes: > > On Mon, 5 Sept 2022 at 15:51, Richard Sandiford > > wrote: > >> > >> Sorry for the slow reply. I wrote a response a couple of weeks ago > >> but I think it get lost in a machine outage. > >> > >> Prathamesh Kulkarni writes: > >> > Hi, > >> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR > >> > in VLA manner, and currently handles the following cases: > >> > (a) fixed len arg0, arg1 and fixed len sel. > >> > (b) fixed len arg0, arg1 and vla sel > >> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST. > >> > > >> > It seems to work for the VLA tests written in > >> > test_vec_perm_vla_folding (), and am working thru the fallout observed in > >> > regression testing. > >> > > >> > Does the approach taken in the patch look in the right direction ? > >> > I am not sure if I have got the conversion from "sel_index" > >> > to index of either arg0, or arg1 entirely correct. > >> > I would be grateful for suggestions on the patch. > >> > > >> > Thanks, > >> > Prathamesh > >> > > >> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > >> > index 4f4ec81c8d4..5e12260211e 100644 > >> > --- a/gcc/fold-const.cc > >> > +++ b/gcc/fold-const.cc > >> > @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3. If not see > >> > #include "vec-perm-indices.h" > >> > #include "asan.h" > >> > #include "gimple-range.h" > >> > +#include "tree-pretty-print.h" > >> > +#include "gimple-pretty-print.h" > >> > +#include "print-tree.h" > >> > > >> > /* Nonzero if we are folding constants inside an initializer or a C++ > >> > manifestly-constant-evaluated context; zero otherwise. > >> > @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr) > >> > build_zero_cst (itype)); > >> > } > >> > > >> > - > >> > -/* Helper function for fold_vec_perm. Store elements of VECTOR_CST or > >> > - CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return > >> > - true if successful. */ > >> > - > >> > -static bool > >> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts) > >> > -{ > >> > - unsigned HOST_WIDE_INT i, nunits; > >> > - > >> > - if (TREE_CODE (arg) == VECTOR_CST > >> > - && VECTOR_CST_NELTS (arg).is_constant (&nunits)) > >> > - { > >> > - for (i = 0; i < nunits; ++i) > >> > - elts[i] = VECTOR_CST_ELT (arg, i); > >> > - } > >> > - else if (TREE_CODE (arg) == CONSTRUCTOR) > >> > - { > >> > - constructor_elt *elt; > >> > - > >> > - FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt) > >> > - if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE) > >> > - return false; > >> > - else > >> > - elts[i] = elt->value; > >> > - } > >> > - else > >> > - return false; > >> > - for (; i < nelts; i++) > >> > - elts[i] > >> > - = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node); > >> > - return true; > >> > -} > >> > - > >> > /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL > >> > selector. Return the folded VECTOR_CST or CONSTRUCTOR if successful, > >> > NULL_TREE otherwise. */ > >> > @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts) > >> > tree > >> > fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel) > >> > { > >> > - unsigned int i; > >> > - unsigned HOST_WIDE_INT nelts; > >> > - bool need_ctor = false; > >> > + poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > >> > + poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)); > >> > + > >> > + gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), > >> > + sel.length ())); > >> > + gcc_assert (known_eq (arg0_len, arg1_len)); > >> > > >> > - if (!sel.length ().is_constant (&nelts)) > >> > - return NULL_TREE; > >> > - gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts) > >> > - && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts) > >> > - && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts)); > >> > if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type) > >> > || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type)) > >> > return NULL_TREE; > >> > > >> > - tree *in_elts = XALLOCAVEC (tree, nelts * 2); > >> > - if (!vec_cst_ctor_to_array (arg0, nelts, in_elts) > >> > - || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts)) > >> > + unsigned input_npatterns = 0; > >> > + unsigned out_npatterns = sel.encoding ().npatterns (); > >> > + unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > >> > + > >> > + /* FIXME: How to reshape fixed length vector_cst, so that > >> > + npatterns == vector.length () and nelts_per_pattern == 1 ? > >> > + It seems the vector is canonicalized to minimize npatterns. */ > >> > + > >> > + if (arg0_len.is_constant ()) > >> > + { > >> > + /* If arg0, arg1 are fixed width vectors, and sel is VLA, > >> > + ensure that it is a dup sequence and has same period > >> > + as input vector. */ > >> > + > >> > + if (!sel.length ().is_constant () > >> > + && (sel.encoding ().nelts_per_pattern () > 2 > >> > + || !known_eq (arg0_len, sel.encoding ().npatterns ()))) > >> > + return NULL_TREE; > >> > + > >> > + input_npatterns = arg0_len.to_constant (); > >> > + > >> > + if (sel.length ().is_constant ()) > >> > + { > >> > + out_npatterns = sel.length ().to_constant (); > >> > + out_nelts_per_pattern = 1; > >> > + } > >> > + } > >> > + else if (TREE_CODE (arg0) == VECTOR_CST > >> > + && TREE_CODE (arg1) == VECTOR_CST) > >> > + { > >> > + unsigned npatterns = VECTOR_CST_NPATTERNS (arg0); > >> > + unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0); > >> > + > >> > + /* If arg0, arg1 are VLA, then ensure that, > >> > + (a) sel also has same length as input vectors. > >> > + (b) arg0 and arg1 have same encoding. > >> > + (c) sel has same number of patterns as input vectors. > >> > + (d) if sel is a stepped sequence, then it has same > >> > + encoding as input vectors. */ > >> > + > >> > + if (!known_eq (arg0_len, sel.length ()) > >> > + || npatterns != VECTOR_CST_NPATTERNS (arg1) > >> > + || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1) > >> > + || npatterns != sel.encoding ().npatterns () > >> > + || (sel.encoding ().nelts_per_pattern () > 2 > >> > + && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern)) > >> > + return NULL_TREE; > >> > >> This seems too restrictive. More below. > >> > >> > + > >> > + input_npatterns = npatterns; > >> > + } > >> > + else > >> > return NULL_TREE; > >> > > >> > - tree_vector_builder out_elts (type, nelts, 1); > >> > - for (i = 0; i < nelts; i++) > >> > + tree_vector_builder out_elts_builder (type, out_npatterns, > >> > + out_nelts_per_pattern); > >> > + bool need_ctor = false; > >> > + unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern; > >> > + > >> > + for (unsigned i = 0; i < out_encoded_nelts; i++) > >> > { > >> > - HOST_WIDE_INT index; > >> > - if (!sel[i].is_constant (&index)) > >> > + HOST_WIDE_INT sel_index; > >> > + if (!sel[i].is_constant (&sel_index)) > >> > return NULL_TREE; > >> > - if (!CONSTANT_CLASS_P (in_elts[index])) > >> > - need_ctor = true; > >> > - out_elts.quick_push (unshare_expr (in_elts[index])); > >> > + > >> > + /* Convert sel_index to index of either arg0 or arg1. > >> > + For eg: > >> > + arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...} > >> > + arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...} > >> > + Both have npatterns == 2, nelts_per_pattern == 3. > >> > + Then the combined vector would be: > >> > + {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... } > >> > + This combined vector will have, > >> > + npatterns = 2 * input_npatterns == 4. > >> > + sel_index is used to index this above combined vector. > >> > >> There's no interleaving of the arguments though. The selector selects from: > >> > >> {a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...} > >> > >> The VLA encoding encodes the first N patterns explicitly. The > >> npatterns/nelts_per_pattern values then describe how to extend that > >> initial sequence to an arbitrary number of elements. So when performing > >> an operation on (potentially) variable-length vectors, the questions is: > >> > >> * Can we work out an initial sequence and npatterns/nelts_per_pattern > >> pair that will be correct for all elements of the result? > >> > >> This depends on the operation that we're performing. E.g. it's > >> different for unary operations (vector_builder::new_unary_operation) > >> and binary operations (vector_builder::new_binary_operations). It also > >> varies between unary operations and between binary operations, hence > >> the allow_stepped_p parameters. > >> > >> For VEC_PERM_EXPR, I think the key requirement is that: > >> > >> (R) Each individual selector pattern must always select from the same vector. > >> > >> Whether this condition is met depends both on the pattern itself and on > >> the number of patterns that it's combined with. > >> > >> E.g. suppose we had the selector pattern: > >> > >> { 0, 1, 4, ... } i.e. 3x - 2 for x > 0 > >> > >> If the arguments and selector are n elements then this pattern on its > >> own would select from more than one argument if 3(n-1) - 2 >= n. > >> This is clearly true for large enough n. So if n is variable then > >> we cannot represent this. > >> > >> If the pattern above is one of two patterns, so interleaved as: > >> > >> { 0, _, 1, _, 4, _, ... } o=0 > >> or { _, 0, _, 1, _, 4, ... } o=1 > >> > >> then the pattern would select from more than one argument if > >> 3(n/2-1) - 2 + o >= n. This too would be a problem for variable n. > >> > >> But if the pattern above is one of four patterns then it selects > >> from more than one argument if 3(n/4-1) - 2 + o >= n. This is not > >> true for any valid n or o, so the pattern is OK. > >> > >> So let's define some ad hoc terminology: > >> > >> * Px is the number of patterns in x > >> * Ex is the number of elements per pattern in x > >> > >> where x can be: > >> > >> * 1: first argument > >> * 2: second argument > >> * s: selector > >> * r: result > >> > >> Then: > >> > >> (1) The number of elements encoded explicitly for x is Ex*Px > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px > >> elements for any integer N. This extended sequence can be reencoded > >> as having N*Px patterns, with Ex staying the same. > >> > >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements > >> of the explicit encoding. > >> > >> So let's assume (optimistically) that we can produce the result > >> by calculating the first Pr*Er elements and using the Pr,Er encoding > >> to imply the rest. Then: > >> > >> * (2) means that, when combining multiple input operands with potentially > >> different encodings, we can set the number of patterns in the result > >> to the least common multiple of the number of patterns in the inputs. > >> In this case: > >> > >> Pr = least_common_multiple(P1, P2, Ps) > >> > >> is a valid number of patterns. > >> > >> * (3) means that the number of elements per pattern of the result can > >> be the maximum of the number of elements per pattern in the inputs. > >> (Alternatively, we could always use 3.) In this case: > >> > >> Er = max(E1, E2, Es) > >> > >> is a valid number of elements per pattern. > >> > >> So if (R) holds we can compute the result -- for both VLA and VLS -- by > >> calculating the first Pr*Er elements of the result and using the > >> encoding to derive the rest. If (R) doesn't hold then we need the > >> selector to be constant-length. We should then fill in the result > >> based on: > >> > >> - Pr == number of elements in the result > >> - Er == 1 > >> > >> But this should be the fallback option, even for VLS. > >> > >> As far as the arguments go: we should reject CONSTRUCTORs for > >> variable-length types. After doing that, we can treat a CONSTRUCTOR > >> for an N-element vector type by setting the number of patterns to N > >> and the number of elements per pattern to 1. > > Hi Richard, > > Thanks for the suggestions, and sorry for late response. > > I have a couple of very elementary questions: > > > > 1: Consider following inputs to VEC_PERM_EXPR: > > op1: P_op1 == 4, E_op1 == 1 > > {1, 2, 3, 4, ...} > > > > op2: P_op2 == 2, E_op2 == 2 > > {11, 21, 12, 22, ...} > > > > sel: P_sel == 3, E_sel == 1 > > {0, 4, 5, ...} > > > > What shall be the result in this case ? > > P_res = lcm(4, 2, 3) == 12 > > E_res = max(1, 2, 1) == 2. > > Yeah, that looks right. Of course, since sel is just repeating > every three elements, it could just be P_res==3, E_sel==1, > but the vector_builder would do that optimisation for us. > > (I'm not sure whether we'd see a P==3 encoding in practice, > but perhaps it's possible.) > > If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than > repeating every three elements) then: > > P_res = lcm(4, 2) == 4 > E_res = max(1, 2, 3) == 3 > > which also looks like it would give the right encoding. > > > 2. How should we specify index of element in sel when it is not > > explicitly encoded in the operand ? > > For eg: > > op1: npatterns == 2, nelts_per_pattern == 3 > > { 1, 0, 2, 0, 3, 0, ... } > > op2: npatterns == 6, nelts_per_pattern == 1 > > { 11, 12, 13, 14, 15, 16, ...} > > > > In sel, how do we refer to element with value 4, that would be 4th element > > of first pattern in op1, but not explicitly encoded ? > > In op1, 4 will come at index == 6. > > However in sel, index 6 would refer to 11, ie op2[0] ? > > What index 6 refers to depends on the length of op1. > If the length of op1 is 4 at runtime the index 6 refers to op2[2]. > If the length of op1 is 6 then index 6 refers to op2[0]. > If the length of op1 is 8 then index 6 refers to op1[6], etc. > > This comes back to (R) above. We need to be able to prove at compile > time that each pattern selects from the same input vectors (for all > elements, not just the encoded elements). If we can't prove that > then we can't fold for variable-length vectors. Hi Richard, Thanks for the clarification! I have come up with an approach to verify R: Consider following pattern: a0, a1, a1 + S, ..., nelts_per_pattern would be n / Psel, where n == actual length of the vector. And last element of pattern will be given by: a1 + (n/Psel - 2) * S Rearranging the above term, we can think of pattern as a line with following equation: y = (S/Psel) * n + (a1 - 2S) where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept. At, n = 2*Psel, y = a1 n = 3*Psel, y = a1 + S, n = 4*Psel, y = a1 + 2S ... To compare with n, we compare the following lines: y1 = (S/Psel) * n + (a1 - 2S) y2 = n So to check if elements always come from first vector, we want to check y1 < y2 for n > 0. Likewise, if elements always come from second vector, we want to check if y1 >= y2, for n > 0. If both lines are parallel, ie S/PSel == 1, then we choose first or second vector depending on the y-intercept a1 - 2S. If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector. If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector. For eg, if we have following pattern: {0, 1, 3, ...} where a1 = 1, S = 2, and consider PSel = 2. y1 = n - 3 y2 = n In this case, y1 < y2 for all values of n, so we select first vector. Since y2 = n, passes thru origin with slope = 1, a line can intersect it either in 1st or 3rd quadrant. Calculate point of intersection: n_int = Psel * (a1 - 2S) / (Psel - S); (a) n_int > 0 n_int > 0 => intersecting in 1st quadrant. In this case there will be a cross-over at n_int. For eg, consider pattern { 0, 1, 4, ...} a1 = 1, S = 3, and let's take PSel = 2 y1 = (3/2)n - 5 y2 = n Both intersect at (10, 10). So for n < 10, y1 < y2 and for n > 10, y1 > y2. so in this case we can't fold since we will select elements from both vectors. (b) n_int <= 0 In this case, the lines will intersect in 3rd quadrant, so depending upon the slope we can choose either vector. If (S/Psel) < 1, ie y1 has a gentler slope than y2, then y1 < y2 for n > 0 If (S/Psel) > 1, ie, y1 has a steeper slope than y2, then y1 > y2 for n > 0. For eg, in the above pattern {0, 1, 4, ...} a1 = 1, S = 3, and let's take PSel = 4 y1 = (3/4)n - 5 y2 = n Both intersect at (-20, -20). y1's slope = (S/Psel) = (3/4) < 1 So y1 < y2 for n > 0. Graph: https://www.desmos.com/calculator/ct7edqbr9d So we pick first vector. The following pseudo code attempts to capture this: tree select_vector_for_pattern (op1, op2, a1, S, Psel) { if (S == Psel) { /* If y1 intercept >= 0, then y1 >= y2 for all values of n. */ if (a1 - 2*S >= 0) return op2; return op1; } n_int = Psel * (a1 - 2*S) / (Psel - S) /* If intersecting in 1st quadrant, there will be cross over, bail out. */ if (n_int > 0) return NULL_TREE; /* If S/Psel < 1, ie y1 has gentler slope than y2, then y1 < y2 for n > 0. */ if (S < Psel) return op1; /* If S/Psel > 1, ie y1 has steeper slope than y2, then y1 > y2 for n > 0. */ return op2; } Does this look reasonable ? Thanks, Prathamesh > > Thanks, > Richard