From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id B6FCC3857346 for ; Fri, 23 Sep 2022 16:03:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org B6FCC3857346 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D709E13D5; Fri, 23 Sep 2022 09:03:49 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9409B3F73B; Fri, 23 Sep 2022 09:03:42 -0700 (PDT) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , Richard Biener , richard.sandiford@arm.com Cc: gcc Patches , Richard Biener Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner References: Date: Fri, 23 Sep 2022 17:03:41 +0100 In-Reply-To: (Prathamesh Kulkarni's message of "Fri, 23 Sep 2022 17:29:03 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-41.2 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford > wrote: >> >> Prathamesh Kulkarni writes: >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford >> > wrote: >> >> >> >> Prathamesh Kulkarni writes: >> >> >> The VLA encoding encodes the first N patterns explicitly. The >> >> >> npatterns/nelts_per_pattern values then describe how to extend that >> >> >> initial sequence to an arbitrary number of elements. So when performing >> >> >> an operation on (potentially) variable-length vectors, the questions is: >> >> >> >> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern >> >> >> pair that will be correct for all elements of the result? >> >> >> >> >> >> This depends on the operation that we're performing. E.g. it's >> >> >> different for unary operations (vector_builder::new_unary_operation) >> >> >> and binary operations (vector_builder::new_binary_operations). It also >> >> >> varies between unary operations and between binary operations, hence >> >> >> the allow_stepped_p parameters. >> >> >> >> >> >> For VEC_PERM_EXPR, I think the key requirement is that: >> >> >> >> >> >> (R) Each individual selector pattern must always select from the same vector. >> >> >> >> >> >> Whether this condition is met depends both on the pattern itself and on >> >> >> the number of patterns that it's combined with. >> >> >> >> >> >> E.g. suppose we had the selector pattern: >> >> >> >> >> >> { 0, 1, 4, ... } i.e. 3x - 2 for x > 0 >> >> >> >> >> >> If the arguments and selector are n elements then this pattern on its >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n. >> >> >> This is clearly true for large enough n. So if n is variable then >> >> >> we cannot represent this. >> >> >> >> >> >> If the pattern above is one of two patterns, so interleaved as: >> >> >> >> >> >> { 0, _, 1, _, 4, _, ... } o=0 >> >> >> or { _, 0, _, 1, _, 4, ... } o=1 >> >> >> >> >> >> then the pattern would select from more than one argument if >> >> >> 3(n/2-1) - 2 + o >= n. This too would be a problem for variable n. >> >> >> >> >> >> But if the pattern above is one of four patterns then it selects >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n. This is not >> >> >> true for any valid n or o, so the pattern is OK. >> >> >> >> >> >> So let's define some ad hoc terminology: >> >> >> >> >> >> * Px is the number of patterns in x >> >> >> * Ex is the number of elements per pattern in x >> >> >> >> >> >> where x can be: >> >> >> >> >> >> * 1: first argument >> >> >> * 2: second argument >> >> >> * s: selector >> >> >> * r: result >> >> >> >> >> >> Then: >> >> >> >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px >> >> >> >> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px >> >> >> elements for any integer N. This extended sequence can be reencoded >> >> >> as having N*Px patterns, with Ex staying the same. >> >> >> >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements >> >> >> of the explicit encoding. >> >> >> >> >> >> So let's assume (optimistically) that we can produce the result >> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding >> >> >> to imply the rest. Then: >> >> >> >> >> >> * (2) means that, when combining multiple input operands with potentially >> >> >> different encodings, we can set the number of patterns in the result >> >> >> to the least common multiple of the number of patterns in the inputs. >> >> >> In this case: >> >> >> >> >> >> Pr = least_common_multiple(P1, P2, Ps) >> >> >> >> >> >> is a valid number of patterns. >> >> >> >> >> >> * (3) means that the number of elements per pattern of the result can >> >> >> be the maximum of the number of elements per pattern in the inputs. >> >> >> (Alternatively, we could always use 3.) In this case: >> >> >> >> >> >> Er = max(E1, E2, Es) >> >> >> >> >> >> is a valid number of elements per pattern. >> >> >> >> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by >> >> >> calculating the first Pr*Er elements of the result and using the >> >> >> encoding to derive the rest. If (R) doesn't hold then we need the >> >> >> selector to be constant-length. We should then fill in the result >> >> >> based on: >> >> >> >> >> >> - Pr == number of elements in the result >> >> >> - Er == 1 >> >> >> >> >> >> But this should be the fallback option, even for VLS. >> >> >> >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for >> >> >> variable-length types. After doing that, we can treat a CONSTRUCTOR >> >> >> for an N-element vector type by setting the number of patterns to N >> >> >> and the number of elements per pattern to 1. >> >> > Hi Richard, >> >> > Thanks for the suggestions, and sorry for late response. >> >> > I have a couple of very elementary questions: >> >> > >> >> > 1: Consider following inputs to VEC_PERM_EXPR: >> >> > op1: P_op1 == 4, E_op1 == 1 >> >> > {1, 2, 3, 4, ...} >> >> > >> >> > op2: P_op2 == 2, E_op2 == 2 >> >> > {11, 21, 12, 22, ...} >> >> > >> >> > sel: P_sel == 3, E_sel == 1 >> >> > {0, 4, 5, ...} >> >> > >> >> > What shall be the result in this case ? >> >> > P_res = lcm(4, 2, 3) == 12 >> >> > E_res = max(1, 2, 1) == 2. >> >> >> >> Yeah, that looks right. Of course, since sel is just repeating >> >> every three elements, it could just be P_res==3, E_sel==1, >> >> but the vector_builder would do that optimisation for us. >> >> >> >> (I'm not sure whether we'd see a P==3 encoding in practice, >> >> but perhaps it's possible.) >> >> >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than >> >> repeating every three elements) then: >> >> >> >> P_res = lcm(4, 2) == 4 >> >> E_res = max(1, 2, 3) == 3 >> >> >> >> which also looks like it would give the right encoding. >> >> >> >> > 2. How should we specify index of element in sel when it is not >> >> > explicitly encoded in the operand ? >> >> > For eg: >> >> > op1: npatterns == 2, nelts_per_pattern == 3 >> >> > { 1, 0, 2, 0, 3, 0, ... } >> >> > op2: npatterns == 6, nelts_per_pattern == 1 >> >> > { 11, 12, 13, 14, 15, 16, ...} >> >> > >> >> > In sel, how do we refer to element with value 4, that would be 4th element >> >> > of first pattern in op1, but not explicitly encoded ? >> >> > In op1, 4 will come at index == 6. >> >> > However in sel, index 6 would refer to 11, ie op2[0] ? >> >> >> >> What index 6 refers to depends on the length of op1. >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2]. >> >> If the length of op1 is 6 then index 6 refers to op2[0]. >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc. >> >> >> >> This comes back to (R) above. We need to be able to prove at compile >> >> time that each pattern selects from the same input vectors (for all >> >> elements, not just the encoded elements). If we can't prove that >> >> then we can't fold for variable-length vectors. >> > Hi Richard, >> > Thanks for the clarification! >> > I have come up with an approach to verify R: >> > >> > Consider following pattern: >> > a0, a1, a1 + S, ..., >> > nelts_per_pattern would be n / Psel, where n == actual length of the vector. >> > And last element of pattern will be given by: >> > a1 + (n/Psel - 2) * S >> >> (I think this is just a terminology thing, but in the source, >> nelts_per_pattern is a compile-time constant that describes the >> encoding. It always has the value 1, 2 or 3, regardless of the >> runtime length.) >> >> > Rearranging the above term, we can think of pattern >> > as a line with following equation: >> > y = (S/Psel) * n + (a1 - 2S) >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept. >> > >> > At, >> > n = 2*Psel, y = a1 >> > n = 3*Psel, y = a1 + S, >> > n = 4*Psel, y = a1 + 2S ... >> > >> > To compare with n, we compare the following lines: >> > y1 = (S/Psel) * n + (a1 - 2S) >> > y2 = n >> > >> > So to check if elements always come from first vector, >> > we want to check y1 < y2 for n > 0. >> > Likewise, if elements always come from second vector, >> > we want to check if y1 >= y2, for n > 0. >> >> One difficulty here is that the indices wrap around, so an index value of >> 2n selects from the first vector rather than the second. (This is pretty >> awkward for VLA and doesn't match the native SVE TBL behaviour.) So... >> >> > If both lines are parallel, ie S/PSel == 1, >> > then we choose first or second vector depending on the y-intercept a1 - 2S. >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector. >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector. >> > >> > For eg, if we have following pattern: >> > {0, 1, 3, ...} >> > where a1 = 1, S = 2, and consider PSel = 2. >> > >> > y1 = n - 3 >> > y2 = n >> > >> > In this case, y1 < y2 for all values of n, so we select first vector. >> > >> > Since y2 = n, passes thru origin with slope = 1, >> > a line can intersect it either in 1st or 3rd quadrant. >> > Calculate point of intersection: >> > n_int = Psel * (a1 - 2S) / (Psel - S); >> > >> > (a) n_int > 0 >> > n_int > 0 => intersecting in 1st quadrant. >> > In this case there will be a cross-over at n_int. >> > >> > For eg, consider pattern { 0, 1, 4, ...} >> > a1 = 1, S = 3, and let's take PSel = 2 >> > >> > y1 = (3/2)n - 5 >> > y2 = n >> > >> > Both intersect at (10, 10). >> > So for n < 10, y1 < y2 >> > and for n > 10, y1 > y2. >> > so in this case we can't fold since we will select elements from both vectors. >> > >> > (b) n_int <= 0 >> > In this case, the lines will intersect in 3rd quadrant, >> > so depending upon the slope we can choose either vector. >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2, >> > then y1 < y2 for n > 0 >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2, >> > then y1 > y2 for n > 0. >> > >> > For eg, in the above pattern {0, 1, 4, ...} >> > a1 = 1, S = 3, and let's take PSel = 4 >> > >> > y1 = (3/4)n - 5 >> > y2 = n >> > Both intersect at (-20, -20). >> > y1's slope = (S/Psel) = (3/4) < 1 >> > So y1 < y2 for n > 0. >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d >> > So we pick first vector. >> > >> > The following pseudo code attempts to capture this: >> > >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel) >> > { >> > if (S == Psel) >> > { >> > /* If y1 intercept >= 0, then y1 >= y2 >> > for all values of n. */ >> > if (a1 - 2*S >= 0) >> > return op2; >> > return op1; >> > } >> > >> > n_int = Psel * (a1 - 2*S) / (Psel - S) >> > /* If intersecting in 1st quadrant, there will be cross over, >> > bail out. */ >> > if (n_int > 0) >> > return NULL_TREE; >> > /* If S/Psel < 1, ie y1 has gentler slope than y2, >> > then y1 < y2 for n > 0. */ >> > if (S < Psel) >> > return op1; >> > /* If S/Psel > 1, ie y1 has steeper slope than y2, >> > then y1 > y2 for n > 0. */ >> > return op2; >> > } >> > >> > Does this look reasonable ? >> >> ...I think we need to be more conservative. I think we also need to >> distinguish n1 (the number of elements in the input vectors) and >> nsel (the number of elements in the selector). >> >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say >> there will be (nsel /exact Psel) - 1 index elements from the stepped >> encoding and the final index value will be: >> >> ae = a1 + (nsel /exact Psel - 2) * S >> >> Because of wrap-around, we need to ensure that that doesn't run >> into an adjoining vector. I think the easiest way of doing that >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p) >> and check that the quotients are equal. > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same > vector since ae is last elem, > and the quotient can choose the vector because it will be either 0 or > 1 (since indices wrap around after 2n). Right. > Um, could you please elaborate a bit on how will can_div_trunc_p > calculate quotients, when n1 and nsel are unknown > at compile time ? > > To calculate the quotients for a hard coded pattern, > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4, > I tried the following: > > poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode); > poly_uint64 nsel = n1; > poly_uint64 a1 = 1 > poly_uint64 Esel = exact_div (nsel / Psel); We can't use exact_div here. We need to test that nsel is a multiple of Psel (e.g. using multiple_p). > poly_uint64 ae = a1 + (Esel - 2) * S; > > int q1, qe; > poly_uint64 r1, re; > > bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1); > bool dive_p = can_div_trunc_p (ae, n1, &qe, &re); > > Which gave strange values for qe and 0 for q1, with first call succeeding, > and second call returning false. > Am I calling it incorrectly ? No, that looks right. I guess the issue is that ae < a1 for Psel == nsel and so for min(nsel) the index wraps around to the other input vector. IMO it's OK to punt on that. We don't have interfaces for applying ranges to the indeterminates in a poly_int. I guess if we want to be fancy, we could look for a1 = a0 + S and, if true, do the calculation based on a0 rather than a1. The combination of Psel >= min(nsel) and a0 != a1 + S, although valid in theory, should be a very niche case. But it might be better to stick to the a1-based case first and get that working. We can then see whether it's worth extending. Thanks, Richard > > Thanks, > Prathamesh > >> >> However, I now realise that there's a wrinkle. If S < 0 then we >> also need to check that either: >> >> (a) the chosen input vector (given by the quotient above) has either: >> >> (i) nelts_per_pattern == 1 >> (ii) nelts_per_pattern == 3 and the difference between the >> first and second elements in each pattern is the same as >> the difference between the second and third elements >> (i.e. every pattern is a natural stepped one). >> >> (b) ae % n1 >= the number of patterns in the input vector. >> (ae % n1 is calculated as a side-effect of can_div_trunc_p). >> >> Otherwise the index vector has the effect of moving the "foreground" >> from the front of the input vector to the end of the result vector. >> >> If nsel == Psel then the stepped part of the sequence doesn't matter. >> Thus, the same condition works whenever nsel is a multiple of Psel. >> >> If nsel is not a multiple of Psel then I think we should punt for now. >> There are some cases that we could handle when n1 == nsel, but "nsel >> is a multiple of Psel" will be the normal case. >> >> Thanks, >> Richard