From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 2DB1F3854172 for ; Fri, 30 Sep 2022 16:00:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 2DB1F3854172 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 34CEC1424; Fri, 30 Sep 2022 09:00:12 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C888F3F792; Fri, 30 Sep 2022 09:00:04 -0700 (PDT) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , Richard Biener , richard.sandiford@arm.com Cc: gcc Patches , Richard Biener Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner References: Date: Fri, 30 Sep 2022 17:00:03 +0100 In-Reply-To: (Prathamesh Kulkarni's message of "Fri, 30 Sep 2022 20:11:43 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-39.1 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > On Tue, 27 Sept 2022 at 01:59, Richard Sandiford > wrote: >> >> Prathamesh Kulkarni writes: >> > On Fri, 23 Sept 2022 at 21:33, Richard Sandiford >> > wrote: >> >> >> >> Prathamesh Kulkarni writes: >> >> > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford >> >> > wrote: >> >> >> >> >> >> Prathamesh Kulkarni writes: >> >> >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford >> >> >> > wrote: >> >> >> >> >> >> >> >> Prathamesh Kulkarni writes: >> >> >> >> >> The VLA encoding encodes the first N patterns explicitly. The >> >> >> >> >> npatterns/nelts_per_pattern values then describe how to extend that >> >> >> >> >> initial sequence to an arbitrary number of elements. So when performing >> >> >> >> >> an operation on (potentially) variable-length vectors, the questions is: >> >> >> >> >> >> >> >> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern >> >> >> >> >> pair that will be correct for all elements of the result? >> >> >> >> >> >> >> >> >> >> This depends on the operation that we're performing. E.g. it's >> >> >> >> >> different for unary operations (vector_builder::new_unary_operation) >> >> >> >> >> and binary operations (vector_builder::new_binary_operations). It also >> >> >> >> >> varies between unary operations and between binary operations, hence >> >> >> >> >> the allow_stepped_p parameters. >> >> >> >> >> >> >> >> >> >> For VEC_PERM_EXPR, I think the key requirement is that: >> >> >> >> >> >> >> >> >> >> (R) Each individual selector pattern must always select from the same vector. >> >> >> >> >> >> >> >> >> >> Whether this condition is met depends both on the pattern itself and on >> >> >> >> >> the number of patterns that it's combined with. >> >> >> >> >> >> >> >> >> >> E.g. suppose we had the selector pattern: >> >> >> >> >> >> >> >> >> >> { 0, 1, 4, ... } i.e. 3x - 2 for x > 0 >> >> >> >> >> >> >> >> >> >> If the arguments and selector are n elements then this pattern on its >> >> >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n. >> >> >> >> >> This is clearly true for large enough n. So if n is variable then >> >> >> >> >> we cannot represent this. >> >> >> >> >> >> >> >> >> >> If the pattern above is one of two patterns, so interleaved as: >> >> >> >> >> >> >> >> >> >> { 0, _, 1, _, 4, _, ... } o=0 >> >> >> >> >> or { _, 0, _, 1, _, 4, ... } o=1 >> >> >> >> >> >> >> >> >> >> then the pattern would select from more than one argument if >> >> >> >> >> 3(n/2-1) - 2 + o >= n. This too would be a problem for variable n. >> >> >> >> >> >> >> >> >> >> But if the pattern above is one of four patterns then it selects >> >> >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n. This is not >> >> >> >> >> true for any valid n or o, so the pattern is OK. >> >> >> >> >> >> >> >> >> >> So let's define some ad hoc terminology: >> >> >> >> >> >> >> >> >> >> * Px is the number of patterns in x >> >> >> >> >> * Ex is the number of elements per pattern in x >> >> >> >> >> >> >> >> >> >> where x can be: >> >> >> >> >> >> >> >> >> >> * 1: first argument >> >> >> >> >> * 2: second argument >> >> >> >> >> * s: selector >> >> >> >> >> * r: result >> >> >> >> >> >> >> >> >> >> Then: >> >> >> >> >> >> >> >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px >> >> >> >> >> >> >> >> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px >> >> >> >> >> elements for any integer N. This extended sequence can be reencoded >> >> >> >> >> as having N*Px patterns, with Ex staying the same. >> >> >> >> >> >> >> >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements >> >> >> >> >> of the explicit encoding. >> >> >> >> >> >> >> >> >> >> So let's assume (optimistically) that we can produce the result >> >> >> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding >> >> >> >> >> to imply the rest. Then: >> >> >> >> >> >> >> >> >> >> * (2) means that, when combining multiple input operands with potentially >> >> >> >> >> different encodings, we can set the number of patterns in the result >> >> >> >> >> to the least common multiple of the number of patterns in the inputs. >> >> >> >> >> In this case: >> >> >> >> >> >> >> >> >> >> Pr = least_common_multiple(P1, P2, Ps) >> >> >> >> >> >> >> >> >> >> is a valid number of patterns. >> >> >> >> >> >> >> >> >> >> * (3) means that the number of elements per pattern of the result can >> >> >> >> >> be the maximum of the number of elements per pattern in the inputs. >> >> >> >> >> (Alternatively, we could always use 3.) In this case: >> >> >> >> >> >> >> >> >> >> Er = max(E1, E2, Es) >> >> >> >> >> >> >> >> >> >> is a valid number of elements per pattern. >> >> >> >> >> >> >> >> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by >> >> >> >> >> calculating the first Pr*Er elements of the result and using the >> >> >> >> >> encoding to derive the rest. If (R) doesn't hold then we need the >> >> >> >> >> selector to be constant-length. We should then fill in the result >> >> >> >> >> based on: >> >> >> >> >> >> >> >> >> >> - Pr == number of elements in the result >> >> >> >> >> - Er == 1 >> >> >> >> >> >> >> >> >> >> But this should be the fallback option, even for VLS. >> >> >> >> >> >> >> >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for >> >> >> >> >> variable-length types. After doing that, we can treat a CONSTRUCTOR >> >> >> >> >> for an N-element vector type by setting the number of patterns to N >> >> >> >> >> and the number of elements per pattern to 1. >> >> >> >> > Hi Richard, >> >> >> >> > Thanks for the suggestions, and sorry for late response. >> >> >> >> > I have a couple of very elementary questions: >> >> >> >> > >> >> >> >> > 1: Consider following inputs to VEC_PERM_EXPR: >> >> >> >> > op1: P_op1 == 4, E_op1 == 1 >> >> >> >> > {1, 2, 3, 4, ...} >> >> >> >> > >> >> >> >> > op2: P_op2 == 2, E_op2 == 2 >> >> >> >> > {11, 21, 12, 22, ...} >> >> >> >> > >> >> >> >> > sel: P_sel == 3, E_sel == 1 >> >> >> >> > {0, 4, 5, ...} >> >> >> >> > >> >> >> >> > What shall be the result in this case ? >> >> >> >> > P_res = lcm(4, 2, 3) == 12 >> >> >> >> > E_res = max(1, 2, 1) == 2. >> >> >> >> >> >> >> >> Yeah, that looks right. Of course, since sel is just repeating >> >> >> >> every three elements, it could just be P_res==3, E_sel==1, >> >> >> >> but the vector_builder would do that optimisation for us. >> >> >> >> >> >> >> >> (I'm not sure whether we'd see a P==3 encoding in practice, >> >> >> >> but perhaps it's possible.) >> >> >> >> >> >> >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than >> >> >> >> repeating every three elements) then: >> >> >> >> >> >> >> >> P_res = lcm(4, 2) == 4 >> >> >> >> E_res = max(1, 2, 3) == 3 >> >> >> >> >> >> >> >> which also looks like it would give the right encoding. >> >> >> >> >> >> >> >> > 2. How should we specify index of element in sel when it is not >> >> >> >> > explicitly encoded in the operand ? >> >> >> >> > For eg: >> >> >> >> > op1: npatterns == 2, nelts_per_pattern == 3 >> >> >> >> > { 1, 0, 2, 0, 3, 0, ... } >> >> >> >> > op2: npatterns == 6, nelts_per_pattern == 1 >> >> >> >> > { 11, 12, 13, 14, 15, 16, ...} >> >> >> >> > >> >> >> >> > In sel, how do we refer to element with value 4, that would be 4th element >> >> >> >> > of first pattern in op1, but not explicitly encoded ? >> >> >> >> > In op1, 4 will come at index == 6. >> >> >> >> > However in sel, index 6 would refer to 11, ie op2[0] ? >> >> >> >> >> >> >> >> What index 6 refers to depends on the length of op1. >> >> >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2]. >> >> >> >> If the length of op1 is 6 then index 6 refers to op2[0]. >> >> >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc. >> >> >> >> >> >> >> >> This comes back to (R) above. We need to be able to prove at compile >> >> >> >> time that each pattern selects from the same input vectors (for all >> >> >> >> elements, not just the encoded elements). If we can't prove that >> >> >> >> then we can't fold for variable-length vectors. >> >> >> > Hi Richard, >> >> >> > Thanks for the clarification! >> >> >> > I have come up with an approach to verify R: >> >> >> > >> >> >> > Consider following pattern: >> >> >> > a0, a1, a1 + S, ..., >> >> >> > nelts_per_pattern would be n / Psel, where n == actual length of the vector. >> >> >> > And last element of pattern will be given by: >> >> >> > a1 + (n/Psel - 2) * S >> >> >> >> >> >> (I think this is just a terminology thing, but in the source, >> >> >> nelts_per_pattern is a compile-time constant that describes the >> >> >> encoding. It always has the value 1, 2 or 3, regardless of the >> >> >> runtime length.) >> >> >> >> >> >> > Rearranging the above term, we can think of pattern >> >> >> > as a line with following equation: >> >> >> > y = (S/Psel) * n + (a1 - 2S) >> >> >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept. >> >> >> > >> >> >> > At, >> >> >> > n = 2*Psel, y = a1 >> >> >> > n = 3*Psel, y = a1 + S, >> >> >> > n = 4*Psel, y = a1 + 2S ... >> >> >> > >> >> >> > To compare with n, we compare the following lines: >> >> >> > y1 = (S/Psel) * n + (a1 - 2S) >> >> >> > y2 = n >> >> >> > >> >> >> > So to check if elements always come from first vector, >> >> >> > we want to check y1 < y2 for n > 0. >> >> >> > Likewise, if elements always come from second vector, >> >> >> > we want to check if y1 >= y2, for n > 0. >> >> >> >> >> >> One difficulty here is that the indices wrap around, so an index value of >> >> >> 2n selects from the first vector rather than the second. (This is pretty >> >> >> awkward for VLA and doesn't match the native SVE TBL behaviour.) So... >> >> >> >> >> >> > If both lines are parallel, ie S/PSel == 1, >> >> >> > then we choose first or second vector depending on the y-intercept a1 - 2S. >> >> >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector. >> >> >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector. >> >> >> > >> >> >> > For eg, if we have following pattern: >> >> >> > {0, 1, 3, ...} >> >> >> > where a1 = 1, S = 2, and consider PSel = 2. >> >> >> > >> >> >> > y1 = n - 3 >> >> >> > y2 = n >> >> >> > >> >> >> > In this case, y1 < y2 for all values of n, so we select first vector. >> >> >> > >> >> >> > Since y2 = n, passes thru origin with slope = 1, >> >> >> > a line can intersect it either in 1st or 3rd quadrant. >> >> >> > Calculate point of intersection: >> >> >> > n_int = Psel * (a1 - 2S) / (Psel - S); >> >> >> > >> >> >> > (a) n_int > 0 >> >> >> > n_int > 0 => intersecting in 1st quadrant. >> >> >> > In this case there will be a cross-over at n_int. >> >> >> > >> >> >> > For eg, consider pattern { 0, 1, 4, ...} >> >> >> > a1 = 1, S = 3, and let's take PSel = 2 >> >> >> > >> >> >> > y1 = (3/2)n - 5 >> >> >> > y2 = n >> >> >> > >> >> >> > Both intersect at (10, 10). >> >> >> > So for n < 10, y1 < y2 >> >> >> > and for n > 10, y1 > y2. >> >> >> > so in this case we can't fold since we will select elements from both vectors. >> >> >> > >> >> >> > (b) n_int <= 0 >> >> >> > In this case, the lines will intersect in 3rd quadrant, >> >> >> > so depending upon the slope we can choose either vector. >> >> >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2, >> >> >> > then y1 < y2 for n > 0 >> >> >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2, >> >> >> > then y1 > y2 for n > 0. >> >> >> > >> >> >> > For eg, in the above pattern {0, 1, 4, ...} >> >> >> > a1 = 1, S = 3, and let's take PSel = 4 >> >> >> > >> >> >> > y1 = (3/4)n - 5 >> >> >> > y2 = n >> >> >> > Both intersect at (-20, -20). >> >> >> > y1's slope = (S/Psel) = (3/4) < 1 >> >> >> > So y1 < y2 for n > 0. >> >> >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d >> >> >> > So we pick first vector. >> >> >> > >> >> >> > The following pseudo code attempts to capture this: >> >> >> > >> >> >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel) >> >> >> > { >> >> >> > if (S == Psel) >> >> >> > { >> >> >> > /* If y1 intercept >= 0, then y1 >= y2 >> >> >> > for all values of n. */ >> >> >> > if (a1 - 2*S >= 0) >> >> >> > return op2; >> >> >> > return op1; >> >> >> > } >> >> >> > >> >> >> > n_int = Psel * (a1 - 2*S) / (Psel - S) >> >> >> > /* If intersecting in 1st quadrant, there will be cross over, >> >> >> > bail out. */ >> >> >> > if (n_int > 0) >> >> >> > return NULL_TREE; >> >> >> > /* If S/Psel < 1, ie y1 has gentler slope than y2, >> >> >> > then y1 < y2 for n > 0. */ >> >> >> > if (S < Psel) >> >> >> > return op1; >> >> >> > /* If S/Psel > 1, ie y1 has steeper slope than y2, >> >> >> > then y1 > y2 for n > 0. */ >> >> >> > return op2; >> >> >> > } >> >> >> > >> >> >> > Does this look reasonable ? >> >> >> >> >> >> ...I think we need to be more conservative. I think we also need to >> >> >> distinguish n1 (the number of elements in the input vectors) and >> >> >> nsel (the number of elements in the selector). >> >> >> >> >> >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say >> >> >> there will be (nsel /exact Psel) - 1 index elements from the stepped >> >> >> encoding and the final index value will be: >> >> >> >> >> >> ae = a1 + (nsel /exact Psel - 2) * S >> >> >> >> >> >> Because of wrap-around, we need to ensure that that doesn't run >> >> >> into an adjoining vector. I think the easiest way of doing that >> >> >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p) >> >> >> and check that the quotients are equal. >> >> > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same >> >> > vector since ae is last elem, >> >> > and the quotient can choose the vector because it will be either 0 or >> >> > 1 (since indices wrap around after 2n). >> >> >> >> Right. >> >> >> >> > Um, could you please elaborate a bit on how will can_div_trunc_p >> >> > calculate quotients, when n1 and nsel are unknown >> >> > at compile time ? >> >> > >> >> > To calculate the quotients for a hard coded pattern, >> >> > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4, >> >> > I tried the following: >> >> > >> >> > poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode); >> >> > poly_uint64 nsel = n1; >> >> > poly_uint64 a1 = 1 >> >> > poly_uint64 Esel = exact_div (nsel / Psel); >> >> >> >> We can't use exact_div here. We need to test that nsel is a multiple >> >> of Psel (e.g. using multiple_p). >> >> >> >> > poly_uint64 ae = a1 + (Esel - 2) * S; >> >> > >> >> > int q1, qe; >> >> > poly_uint64 r1, re; >> >> > >> >> > bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1); >> >> > bool dive_p = can_div_trunc_p (ae, n1, &qe, &re); >> >> > >> >> > Which gave strange values for qe and 0 for q1, with first call succeeding, >> >> > and second call returning false. >> >> > Am I calling it incorrectly ? >> >> >> >> No, that looks right. I guess the issue is that ae < a1 for Psel == nsel >> >> and so for min(nsel) the index wraps around to the other input vector. >> >> IMO it's OK to punt on that. We don't have interfaces for applying >> >> ranges to the indeterminates in a poly_int. >> > Hi Richard, >> > Thanks for the suggestions. >> > I tried the following to test for wrap around: >> > >> > poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode); >> > poly_uint64 nsel = n1; >> > poly_uint64 a1 = 1; >> > unsigned Psel = 4; >> > int S = 3; >> > >> > if (multiple_p (nsel, Psel)) >> > { >> > poly_uint64 nelems = exact_div (nsel, Psel); >> > poly_uint64 ae = a1 + (nelems - 2) * S; >> > >> > if (known_gt (ae, a1)) >> > { >> > int q1, qe; >> > poly_uint64 r1, re; >> > >> > bool ok1 = can_div_trunc_p (a1, n1, &q1, &r1); >> > bool oke = can_div_trunc_p (ae, n1, &qe, &re); >> > } >> > } >> > >> > However, the second call to can_div_trunc_p still returns false, with >> > strange values for qe. >> >> The known_gt (ae, a1) part isn't required. It's OK for S to be negative. >> >> What I meant above is that Psel is a valid value of nsel. When nsel >> *does* equal Psel, ae will select from a different input vector from a1. >> (Although I said ae < a1, that isn't the important bit, sorry. >> The important bit is that a1 > 0 and ae < 0.) >> >> That's why not having conditions that apply ranges to the indeterminates >> is a problem. I guess what we want to ask is "does ae select from the >> same input as a1 for all nsel >= Psel*2?". But we don't have a way of >> asking that. We can only ask for all nsel, and the answer to "does ae >> select from the same input as a1 for all nsel?" is "no". >> >> Perhaps another way of saying this is: if the selector was a natural >> stepped vector, the first value in the pattern (a0) would be -2. The >> selector { -2, 1, 4, ... } *would* cross input vectors. >> >> This means that a selector with the above values of a1 and S could only >> be valid if a0 is in the range [0, nsel). It's valid for a0 to be in >> that range, because it can be arbitrarily different from a1 and above, >> but it means that the test for a1 and above is no longer a simple linear >> one, of the type we're trying to use here. >> >> So what I meant was, we should accept that the fold will be rejected >> for this case. I don't think that matters in practice. >> >> On the last bit, the values returned by reference from can_div_trunc_p >> are only meaningful when the function returns true. The variables >> aren't updated otherwise (which is by design). > Hi Richard, > Thanks for the explanation! > IIUC, for can_div_trunc_p (a, b, &q, &r); > to return true, quotients obtained by division of respective > coeffs should be equal to q0, if q0 = a.coeffs[0] / b.coeffs[0] != 0. > > C q = NCa (a.coeffs[0]) / NCb (b.coeffs[0]); > /* Otherwise just check for the case in which ai / bi == Q. */ > if (NCa (a.coeffs[i]) / NCb (b.coeffs[i]) != q) > return false; > > Just to iterate, for above case, > n1 = len(Vnx4SI) = 4 + 4x > nsel = n1 = 4 + 4x > S = 3 > Psel = 4 > a1 = 1 > ae = 3x - 2 > > a1 /trunc n1 == (1 + 0x) / (4 + 4x) > Since 1/4 == 0/4, can_div_trunc_p returns true, and sets q1 to 0. > > ae /trunc n1 == (-2 + 3x) / (4 + 4x) > Since the coeff type is unsigned long, > -2/4 != 3/4, so it returns false. > and we reject to fold for this case. > Is this correct ? Yeah. Like I say, if this were a simple linear pattern, a0 would be -2, and so a0 and a1 would select from different inputs. So a simple linear test cannot handle this situation. > Sorry to ask a silly question but in which case shall we select 2nd vector ? > For num_poly_int_coeffs == 2, > a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x) > If a1/trunc n1 succeeds, > 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0. > So, a1 has to be < n1.coeffs[0] ? Remember that a1 is itself a poly_int. It's not necessarily a constant. E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector: { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... } which is an interleaving of the two patterns: { 0, 2, 4, ... } a0 = 0, a1 = 2, S = 2 { 2 + 2x, 4 + 2x, 6 + 2x } a0 = 2 + 2x, a1 = 4 + 2x, S = 2 > For eg, if n1 = 4 + 4x, and a1 = 5, > a1 /trunc n1 will return false since 5/4 != 0 /4. > At runtime, > if x == 0, we will select op1[2]; > for x > 0, we will select op0[5]; > But we cannot determine this at compile time ? Right. Punting on that case is the right thing to do. > I have tried to come up with following pseudo code to verify R: > > tree get_vector_for_pattern(tree op0, tree op1, vec_perm_indices sel, > int pattern) > { > if (!multiple_p (nsel, sel_npatterns)) > return NULL_TREE; > > poly_uint64 nsel = sel.length (); > poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0)); > poly_uint64 Esel = exact_div (nsel, Psel); Realise it's only pseudo code, but FWIW, this can be produced as a side-effect of the multiple_p. > poly_uint64 a1 = sel[pattern + sel_npatterns]; > poly_uint64 ae = a1 + (Esel - 2) * S; > > if (known_lt (ae, 0)) > return NULL_TREE; I don't think we need this. > uint64_t q1, qe; > poly_uint64 r1, re; > if (!can_div_trunc_p (a1, n1, &q1, &r1) > || !can_div_trunc_p (ae, n1, &qe, &re)) > return NULL_TREE; > > if (q1 != qe) > return NULL_TREE; > tree op_vec = (q1 == 0) ? op0 : op1; This should be testing the low bit of q1. 2 selects from the first input, 3 from the second, etc. (Not that we should see that, since the selector should already have been canonicalised. But better safe than sorry.) > sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > if (sel_nelts_per_pattern == 3) > { I suppose all of the above is dependent on sel_nelts_per_pattern == 3 too. S would be zero otherwise. > a2 = sel[pattern + 2 * sel_npatterns]; > S = a2 - a1; > if (S < 0) > { > /* Check for natural stepped pattern. */ > if ((a1 - sel[pattern]) != S) > return NULL_TREE; > if (!known_ge (re, VECTOR_CST_NPATTERNS (op_vec))) > return NULL_TREE; > } > } > return op_vec; > } > > Does this look in right direction ? Yeah, looks like it. > Um, could you please give an example when nsel is *not* a multiple > of Psel ? > I had (incorrectly) assumed, nsel = Psel * Esel, > so nsel would always be a multiple of Psel. I was using nsel to mean TYPE_VECTOR_SUBPARTS, i.e. the total number of elements in the vector at runtime. Psel and Esel are instead compile-time constants (number of patterns and explicitly-encoded elements per pattern respectively). Although I think in practical use the Psel will be a power of 2, that's not guaranteed by construction. Also, there's no reason in principle why you couldn't have Psel = 4 for nsel = 2 + 2x. Two of the patterns will be dropped for x==0, but that's OK. So all in all, it's better to check. Like I said in the above comment, it's a single operation either way: a three-argument multiple_p rather than an exact_div. Thanks, Richard