From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io1-xd2b.google.com (mail-io1-xd2b.google.com [IPv6:2607:f8b0:4864:20::d2b]) by sourceware.org (Postfix) with ESMTPS id CABF13858D37 for ; Mon, 26 Sep 2022 19:34:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CABF13858D37 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-io1-xd2b.google.com with SMTP id 138so6082801iou.9 for ; Mon, 26 Sep 2022 12:34:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date; bh=3+ZV64Yw8L/tObKZvAt2vHQwvyHArnNHF1cmFRga9aU=; b=vF1GjZ97s0Iuxbn+bIPukaa95/Sgun9eQUA+Dog8LCd2l/HGooYJ4SCqTkkUjef0ln yZmKCuSpjYEEaJwzefG+X+y7pi+XfQg0JDj6vcwOqIlfXC10jUCx5FbPxzmHs1YdOBYO MSKhuIM3SBUEO4Pbk3KSH5/W6xu/6cel1/rjr1rS31EnJvyDNQ9MNBuGfs5Bp7whmn8M jxgTHkRxEcNa9U4lKz3eNI3UA3gKmxS6LzWY1t5StEVTp9XwDhZUSCDngjgglBMFRdjH iGp1plk3AUTtAUIcQW3sSuYdsT7kWXRhqyEctTfUS1TascmEktCvrUgT2B9JSAyT8y8l NH3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date; bh=3+ZV64Yw8L/tObKZvAt2vHQwvyHArnNHF1cmFRga9aU=; b=TNGXQpOYR/IEU0RLZlJdCT8cDkPeI6dfAZBC/OjFAR+FePK2l+wfj2JvGtfnO/pH+b Jyv5L76IgZfslWWWJjR11z9advCnDzX77NbhEPJ1POEdQ7ZIwFERn3o87toG6+Okz/ii gHV+WXV1s45FVnPQfv1Te0oJTSZ51QrWvseyMJ9esXIpvxrP1UjQXLJ5s0FS23RLtSSq WBRkse5zGDfpmITrPxO66mq5Q8Pmpc72jDboklkqOUwMoZbCoFi/CdodxhGpOdVL9Is7 ZQOk4hSmnotW4nQARQAPi33LYoxY2C5ySscg7/nnYjReFGUTWRXK9Xf8W99o7HadL4IU 0l7A== X-Gm-Message-State: ACrzQf3GoI5qE/RoDcjc8Lz/yqA662UMW4N2vIW8MXkSVrXEQYV87vPF aOioq0VeQZUDh2W3rriSq3xLP7RA2rP4jPhgnqRvCw== X-Google-Smtp-Source: AMsMyM6jFpyXKqWmyJj6l7M+yRMklqNawOTZio1LyBHxmhYQcSpF22fO8CgJK5QNs+98y6XkY2Sa3QRj2FuIUiw7JFA= X-Received: by 2002:a05:6638:1a1b:b0:35a:ce31:2e2b with SMTP id cd27-20020a0566381a1b00b0035ace312e2bmr12558014jab.289.1664220875770; Mon, 26 Sep 2022 12:34:35 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Prathamesh Kulkarni Date: Tue, 27 Sep 2022 01:03:59 +0530 Message-ID: Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner To: Prathamesh Kulkarni , gcc Patches , Richard Biener , richard.sandiford@arm.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, 23 Sept 2022 at 21:33, Richard Sandiford wrote: > > Prathamesh Kulkarni writes: > > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford > > wrote: > >> > >> Prathamesh Kulkarni writes: > >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford > >> > wrote: > >> >> > >> >> Prathamesh Kulkarni writes: > >> >> >> The VLA encoding encodes the first N patterns explicitly. The > >> >> >> npatterns/nelts_per_pattern values then describe how to extend that > >> >> >> initial sequence to an arbitrary number of elements. So when performing > >> >> >> an operation on (potentially) variable-length vectors, the questions is: > >> >> >> > >> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern > >> >> >> pair that will be correct for all elements of the result? > >> >> >> > >> >> >> This depends on the operation that we're performing. E.g. it's > >> >> >> different for unary operations (vector_builder::new_unary_operation) > >> >> >> and binary operations (vector_builder::new_binary_operations). It also > >> >> >> varies between unary operations and between binary operations, hence > >> >> >> the allow_stepped_p parameters. > >> >> >> > >> >> >> For VEC_PERM_EXPR, I think the key requirement is that: > >> >> >> > >> >> >> (R) Each individual selector pattern must always select from the same vector. > >> >> >> > >> >> >> Whether this condition is met depends both on the pattern itself and on > >> >> >> the number of patterns that it's combined with. > >> >> >> > >> >> >> E.g. suppose we had the selector pattern: > >> >> >> > >> >> >> { 0, 1, 4, ... } i.e. 3x - 2 for x > 0 > >> >> >> > >> >> >> If the arguments and selector are n elements then this pattern on its > >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n. > >> >> >> This is clearly true for large enough n. So if n is variable then > >> >> >> we cannot represent this. > >> >> >> > >> >> >> If the pattern above is one of two patterns, so interleaved as: > >> >> >> > >> >> >> { 0, _, 1, _, 4, _, ... } o=0 > >> >> >> or { _, 0, _, 1, _, 4, ... } o=1 > >> >> >> > >> >> >> then the pattern would select from more than one argument if > >> >> >> 3(n/2-1) - 2 + o >= n. This too would be a problem for variable n. > >> >> >> > >> >> >> But if the pattern above is one of four patterns then it selects > >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n. This is not > >> >> >> true for any valid n or o, so the pattern is OK. > >> >> >> > >> >> >> So let's define some ad hoc terminology: > >> >> >> > >> >> >> * Px is the number of patterns in x > >> >> >> * Ex is the number of elements per pattern in x > >> >> >> > >> >> >> where x can be: > >> >> >> > >> >> >> * 1: first argument > >> >> >> * 2: second argument > >> >> >> * s: selector > >> >> >> * r: result > >> >> >> > >> >> >> Then: > >> >> >> > >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px > >> >> >> > >> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px > >> >> >> elements for any integer N. This extended sequence can be reencoded > >> >> >> as having N*Px patterns, with Ex staying the same. > >> >> >> > >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements > >> >> >> of the explicit encoding. > >> >> >> > >> >> >> So let's assume (optimistically) that we can produce the result > >> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding > >> >> >> to imply the rest. Then: > >> >> >> > >> >> >> * (2) means that, when combining multiple input operands with potentially > >> >> >> different encodings, we can set the number of patterns in the result > >> >> >> to the least common multiple of the number of patterns in the inputs. > >> >> >> In this case: > >> >> >> > >> >> >> Pr = least_common_multiple(P1, P2, Ps) > >> >> >> > >> >> >> is a valid number of patterns. > >> >> >> > >> >> >> * (3) means that the number of elements per pattern of the result can > >> >> >> be the maximum of the number of elements per pattern in the inputs. > >> >> >> (Alternatively, we could always use 3.) In this case: > >> >> >> > >> >> >> Er = max(E1, E2, Es) > >> >> >> > >> >> >> is a valid number of elements per pattern. > >> >> >> > >> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by > >> >> >> calculating the first Pr*Er elements of the result and using the > >> >> >> encoding to derive the rest. If (R) doesn't hold then we need the > >> >> >> selector to be constant-length. We should then fill in the result > >> >> >> based on: > >> >> >> > >> >> >> - Pr == number of elements in the result > >> >> >> - Er == 1 > >> >> >> > >> >> >> But this should be the fallback option, even for VLS. > >> >> >> > >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for > >> >> >> variable-length types. After doing that, we can treat a CONSTRUCTOR > >> >> >> for an N-element vector type by setting the number of patterns to N > >> >> >> and the number of elements per pattern to 1. > >> >> > Hi Richard, > >> >> > Thanks for the suggestions, and sorry for late response. > >> >> > I have a couple of very elementary questions: > >> >> > > >> >> > 1: Consider following inputs to VEC_PERM_EXPR: > >> >> > op1: P_op1 == 4, E_op1 == 1 > >> >> > {1, 2, 3, 4, ...} > >> >> > > >> >> > op2: P_op2 == 2, E_op2 == 2 > >> >> > {11, 21, 12, 22, ...} > >> >> > > >> >> > sel: P_sel == 3, E_sel == 1 > >> >> > {0, 4, 5, ...} > >> >> > > >> >> > What shall be the result in this case ? > >> >> > P_res = lcm(4, 2, 3) == 12 > >> >> > E_res = max(1, 2, 1) == 2. > >> >> > >> >> Yeah, that looks right. Of course, since sel is just repeating > >> >> every three elements, it could just be P_res==3, E_sel==1, > >> >> but the vector_builder would do that optimisation for us. > >> >> > >> >> (I'm not sure whether we'd see a P==3 encoding in practice, > >> >> but perhaps it's possible.) > >> >> > >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than > >> >> repeating every three elements) then: > >> >> > >> >> P_res = lcm(4, 2) == 4 > >> >> E_res = max(1, 2, 3) == 3 > >> >> > >> >> which also looks like it would give the right encoding. > >> >> > >> >> > 2. How should we specify index of element in sel when it is not > >> >> > explicitly encoded in the operand ? > >> >> > For eg: > >> >> > op1: npatterns == 2, nelts_per_pattern == 3 > >> >> > { 1, 0, 2, 0, 3, 0, ... } > >> >> > op2: npatterns == 6, nelts_per_pattern == 1 > >> >> > { 11, 12, 13, 14, 15, 16, ...} > >> >> > > >> >> > In sel, how do we refer to element with value 4, that would be 4th element > >> >> > of first pattern in op1, but not explicitly encoded ? > >> >> > In op1, 4 will come at index == 6. > >> >> > However in sel, index 6 would refer to 11, ie op2[0] ? > >> >> > >> >> What index 6 refers to depends on the length of op1. > >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2]. > >> >> If the length of op1 is 6 then index 6 refers to op2[0]. > >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc. > >> >> > >> >> This comes back to (R) above. We need to be able to prove at compile > >> >> time that each pattern selects from the same input vectors (for all > >> >> elements, not just the encoded elements). If we can't prove that > >> >> then we can't fold for variable-length vectors. > >> > Hi Richard, > >> > Thanks for the clarification! > >> > I have come up with an approach to verify R: > >> > > >> > Consider following pattern: > >> > a0, a1, a1 + S, ..., > >> > nelts_per_pattern would be n / Psel, where n == actual length of the vector. > >> > And last element of pattern will be given by: > >> > a1 + (n/Psel - 2) * S > >> > >> (I think this is just a terminology thing, but in the source, > >> nelts_per_pattern is a compile-time constant that describes the > >> encoding. It always has the value 1, 2 or 3, regardless of the > >> runtime length.) > >> > >> > Rearranging the above term, we can think of pattern > >> > as a line with following equation: > >> > y = (S/Psel) * n + (a1 - 2S) > >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept. > >> > > >> > At, > >> > n = 2*Psel, y = a1 > >> > n = 3*Psel, y = a1 + S, > >> > n = 4*Psel, y = a1 + 2S ... > >> > > >> > To compare with n, we compare the following lines: > >> > y1 = (S/Psel) * n + (a1 - 2S) > >> > y2 = n > >> > > >> > So to check if elements always come from first vector, > >> > we want to check y1 < y2 for n > 0. > >> > Likewise, if elements always come from second vector, > >> > we want to check if y1 >= y2, for n > 0. > >> > >> One difficulty here is that the indices wrap around, so an index value of > >> 2n selects from the first vector rather than the second. (This is pretty > >> awkward for VLA and doesn't match the native SVE TBL behaviour.) So... > >> > >> > If both lines are parallel, ie S/PSel == 1, > >> > then we choose first or second vector depending on the y-intercept a1 - 2S. > >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector. > >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector. > >> > > >> > For eg, if we have following pattern: > >> > {0, 1, 3, ...} > >> > where a1 = 1, S = 2, and consider PSel = 2. > >> > > >> > y1 = n - 3 > >> > y2 = n > >> > > >> > In this case, y1 < y2 for all values of n, so we select first vector. > >> > > >> > Since y2 = n, passes thru origin with slope = 1, > >> > a line can intersect it either in 1st or 3rd quadrant. > >> > Calculate point of intersection: > >> > n_int = Psel * (a1 - 2S) / (Psel - S); > >> > > >> > (a) n_int > 0 > >> > n_int > 0 => intersecting in 1st quadrant. > >> > In this case there will be a cross-over at n_int. > >> > > >> > For eg, consider pattern { 0, 1, 4, ...} > >> > a1 = 1, S = 3, and let's take PSel = 2 > >> > > >> > y1 = (3/2)n - 5 > >> > y2 = n > >> > > >> > Both intersect at (10, 10). > >> > So for n < 10, y1 < y2 > >> > and for n > 10, y1 > y2. > >> > so in this case we can't fold since we will select elements from both vectors. > >> > > >> > (b) n_int <= 0 > >> > In this case, the lines will intersect in 3rd quadrant, > >> > so depending upon the slope we can choose either vector. > >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2, > >> > then y1 < y2 for n > 0 > >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2, > >> > then y1 > y2 for n > 0. > >> > > >> > For eg, in the above pattern {0, 1, 4, ...} > >> > a1 = 1, S = 3, and let's take PSel = 4 > >> > > >> > y1 = (3/4)n - 5 > >> > y2 = n > >> > Both intersect at (-20, -20). > >> > y1's slope = (S/Psel) = (3/4) < 1 > >> > So y1 < y2 for n > 0. > >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d > >> > So we pick first vector. > >> > > >> > The following pseudo code attempts to capture this: > >> > > >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel) > >> > { > >> > if (S == Psel) > >> > { > >> > /* If y1 intercept >= 0, then y1 >= y2 > >> > for all values of n. */ > >> > if (a1 - 2*S >= 0) > >> > return op2; > >> > return op1; > >> > } > >> > > >> > n_int = Psel * (a1 - 2*S) / (Psel - S) > >> > /* If intersecting in 1st quadrant, there will be cross over, > >> > bail out. */ > >> > if (n_int > 0) > >> > return NULL_TREE; > >> > /* If S/Psel < 1, ie y1 has gentler slope than y2, > >> > then y1 < y2 for n > 0. */ > >> > if (S < Psel) > >> > return op1; > >> > /* If S/Psel > 1, ie y1 has steeper slope than y2, > >> > then y1 > y2 for n > 0. */ > >> > return op2; > >> > } > >> > > >> > Does this look reasonable ? > >> > >> ...I think we need to be more conservative. I think we also need to > >> distinguish n1 (the number of elements in the input vectors) and > >> nsel (the number of elements in the selector). > >> > >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say > >> there will be (nsel /exact Psel) - 1 index elements from the stepped > >> encoding and the final index value will be: > >> > >> ae = a1 + (nsel /exact Psel - 2) * S > >> > >> Because of wrap-around, we need to ensure that that doesn't run > >> into an adjoining vector. I think the easiest way of doing that > >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p) > >> and check that the quotients are equal. > > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same > > vector since ae is last elem, > > and the quotient can choose the vector because it will be either 0 or > > 1 (since indices wrap around after 2n). > > Right. > > > Um, could you please elaborate a bit on how will can_div_trunc_p > > calculate quotients, when n1 and nsel are unknown > > at compile time ? > > > > To calculate the quotients for a hard coded pattern, > > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4, > > I tried the following: > > > > poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode); > > poly_uint64 nsel = n1; > > poly_uint64 a1 = 1 > > poly_uint64 Esel = exact_div (nsel / Psel); > > We can't use exact_div here. We need to test that nsel is a multiple > of Psel (e.g. using multiple_p). > > > poly_uint64 ae = a1 + (Esel - 2) * S; > > > > int q1, qe; > > poly_uint64 r1, re; > > > > bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1); > > bool dive_p = can_div_trunc_p (ae, n1, &qe, &re); > > > > Which gave strange values for qe and 0 for q1, with first call succeeding, > > and second call returning false. > > Am I calling it incorrectly ? > > No, that looks right. I guess the issue is that ae < a1 for Psel == nsel > and so for min(nsel) the index wraps around to the other input vector. > IMO it's OK to punt on that. We don't have interfaces for applying > ranges to the indeterminates in a poly_int. Hi Richard, Thanks for the suggestions. I tried the following to test for wrap around: poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode); poly_uint64 nsel = n1; poly_uint64 a1 = 1; unsigned Psel = 4; int S = 3; if (multiple_p (nsel, Psel)) { poly_uint64 nelems = exact_div (nsel, Psel); poly_uint64 ae = a1 + (nelems - 2) * S; if (known_gt (ae, a1)) { int q1, qe; poly_uint64 r1, re; bool ok1 = can_div_trunc_p (a1, n1, &q1, &r1); bool oke = can_div_trunc_p (ae, n1, &qe, &re); } } However, the second call to can_div_trunc_p still returns false, with strange values for qe. Thanks, Prathamesh > > I guess if we want to be fancy, we could look for a1 = a0 + S and, > if true, do the calculation based on a0 rather than a1. The combination > of Psel >= min(nsel) and a0 != a1 + S, although valid in theory, should > be a very niche case. > > But it might be better to stick to the a1-based case first and > get that working. We can then see whether it's worth extending. > > Thanks, > Richard > > > > > Thanks, > > Prathamesh > > > >> > >> However, I now realise that there's a wrinkle. If S < 0 then we > >> also need to check that either: > >> > >> (a) the chosen input vector (given by the quotient above) has either: > >> > >> (i) nelts_per_pattern == 1 > >> (ii) nelts_per_pattern == 3 and the difference between the > >> first and second elements in each pattern is the same as > >> the difference between the second and third elements > >> (i.e. every pattern is a natural stepped one). > >> > >> (b) ae % n1 >= the number of patterns in the input vector. > >> (ae % n1 is calculated as a side-effect of can_div_trunc_p). > >> > >> Otherwise the index vector has the effect of moving the "foreground" > >> from the front of the input vector to the end of the result vector. > >> > >> If nsel == Psel then the stepped part of the sequence doesn't matter. > >> Thus, the same condition works whenever nsel is a multiple of Psel. > >> > >> If nsel is not a multiple of Psel then I think we should punt for now. > >> There are some cases that we could handle when n1 == nsel, but "nsel > >> is a multiple of Psel" will be the normal case. > >> > >> Thanks, > >> Richard