From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr1-x432.google.com (mail-wr1-x432.google.com [IPv6:2a00:1450:4864:20::432]) by sourceware.org (Postfix) with ESMTPS id 2BA603858D38 for ; Wed, 1 Feb 2023 10:02:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2BA603858D38 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-wr1-x432.google.com with SMTP id q5so16794704wrv.0 for ; Wed, 01 Feb 2023 02:02:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=tDUbwnUZB8CTVWFhwRYc4mFASOtcVq5dqDLW5DXzaWY=; b=XL7LfqSMk8SWqsE7AA/K3wo2Iz4vPrZkTDpLWAnWoZj6bxsGu97QdRyphhbvYmqAXp ruM2jQazE+bV+MT5c1iEWfUB0Zbu5CugEAMcNOECnYKbbgnEL/uMQdT0Lj0MN9KWhjHM SsOUFVzZ7v0e/n0KoPCZgVLL1tPcnj/CJ6pgutY0A7o2aMl0LNU5NwGtyYwq75rjSjsL JhtLLqPvmQftKvDFRqb6QxjXfobrSaW12SELklpZ0Hf6/mDFLp3u1e02tV6b09OzjOYt c/DAOFKSsWSO7AAnArUJBkW7uZ7dugiBOy/CVnlDYbfVRODt0qc7P3I3QSQcwY236vKO rWgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=tDUbwnUZB8CTVWFhwRYc4mFASOtcVq5dqDLW5DXzaWY=; b=sT7Qz3LoK3gygJc0ux8P5d1t0KK4/opFSl6T4JVkxvn+QIQwkrk/CyIvh9yatocPUm WPUmHcummB6vZeTEaYCNMFOV+h0JmROBfrhCB57sMzbWm/ZDnwfLjqFiZsyiGh8OcUaV sFIIejwNJIK7LGfZpOHumCpWHEHJIkiKNSrMpxS3RyX3Z+s5cyKhJtB1OgqzPYEvhgY6 JGbtlR2UqgkLdNJMabUxC0mjQFY7I0N3q0JgKfCJ6+FEL9GztVMUrTsfB2pmFjHDvIQD 55tx6J783lXmK+yGSDmgknQwiHGt678fS3Dk0cHnjghcGfT7pcQO+xmVmzBaDQiBUdjw QewQ== X-Gm-Message-State: AO0yUKUgxWxSCOSM+jiowRCTQFt17bjiBGttxy0LWMzIaMsl6ZiG0tGY N7VC18hCWr+zDHVAVIHU14zZ2dbMstcJpv//EvBZmFDo9gYsVYth X-Google-Smtp-Source: AK7set/DIGnViApM8OkCfbMFM/BrjjWUSYl6UPoNereEpNRWywxjnrp/JBrsU3UTilF5OgWaYFlf9UUS2gqA3zwGJMM= X-Received: by 2002:adf:ef05:0:b0:2bf:c066:dd76 with SMTP id e5-20020adfef05000000b002bfc066dd76mr83496wro.173.1675245753888; Wed, 01 Feb 2023 02:02:33 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Prathamesh Kulkarni Date: Wed, 1 Feb 2023 15:31:59 +0530 Message-ID: Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner To: Prathamesh Kulkarni via Gcc-patches , Prathamesh Kulkarni , Richard Biener , richard.sandiford@arm.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,RCVD_IN_DNSWL_NONE,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, 17 Jan 2023 at 17:24, Prathamesh Kulkarni wrote: > > On Mon, 26 Dec 2022 at 09:56, Prathamesh Kulkarni > wrote: > > > > On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni > > wrote: > > > > > > On Tue, 6 Dec 2022 at 21:00, Richard Sandiford > > > wrote: > > > > > > > > Prathamesh Kulkarni via Gcc-patches writes: > > > > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni > > > > > wrote: > > > > >> > > > > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford > > > > >> wrote: > > > > >> > > > > > >> > Prathamesh Kulkarni writes: > > > > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford > > > > >> > > wrote: > > > > >> > >> > > > > >> > >> Sorry for the slow response. I wanted to find some time to think > > > > >> > >> about this a bit more. > > > > >> > >> > > > > >> > >> Prathamesh Kulkarni writes: > > > > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford > > > > >> > >> > wrote: > > > > >> > >> >> > > > > >> > >> >> Richard Sandiford via Gcc-patches writes: > > > > >> > >> >> > Prathamesh Kulkarni writes: > > > > >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ? > > > > >> > >> >> >> For num_poly_int_coeffs == 2, > > > > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x) > > > > >> > >> >> >> If a1/trunc n1 succeeds, > > > > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0. > > > > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ? > > > > >> > >> >> > > > > > >> > >> >> > Remember that a1 is itself a poly_int. It's not necessarily a constant. > > > > >> > >> >> > > > > > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector: > > > > >> > >> >> > > > > > >> > >> >> > { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... } > > > > >> > >> >> > > > > >> > >> >> Sorry, should have been: > > > > >> > >> >> > > > > >> > >> >> { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... } > > > > >> > >> > Hi Richard, > > > > >> > >> > Thanks for the clarifications, and sorry for late reply. > > > > >> > >> > I have attached POC patch that tries to implement the above approach. > > > > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors. > > > > >> > >> > > > > > >> > >> > For VLA vectors, I have only done limited testing so far. > > > > >> > >> > It seems to pass couple of tests written in the patch for > > > > >> > >> > nelts_per_pattern == 3, > > > > >> > >> > and folds the following svld1rq test: > > > > >> > >> > int32x4_t v = {1, 2, 3, 4}; > > > > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0]) > > > > >> > >> > into: > > > > >> > >> > return {1, 2, 3, 4, ...}; > > > > >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding. > > > > >> > >> > > > > > >> > >> > I have a couple of questions: > > > > >> > >> > 1] When mask selects elements from same vector but from different patterns: > > > > >> > >> > For eg: > > > > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...}, > > > > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...}, > > > > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... }, > > > > >> > >> > All have npatterns = 2, nelts_per_pattern = 3. > > > > >> > >> > > > > > >> > >> > With above mask, > > > > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...} > > > > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...} > > > > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different > > > > >> > >> > pattern in arg0. > > > > >> > >> > The result is: > > > > >> > >> > res = {1, 1, 1, 11, 1, 2, ...} > > > > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with: > > > > >> > >> > with a0 = 1, a1 = 11, S = -9. > > > > >> > >> > Is that expected tho ? It seems to create a new encoding which > > > > >> > >> > wasn't present in the input vector. For instance, the next elem in > > > > >> > >> > sequence would be -7, > > > > >> > >> > which is not present originally in arg0. > > > > >> > >> > > > > >> > >> Yeah, you're right, sorry. Going back to: > > > > >> > >> > > > > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px > > > > >> > >> elements for any integer N. This extended sequence can be reencoded > > > > >> > >> as having N*Px patterns, with Ex staying the same. > > > > >> > >> > > > > >> > >> I guess we need to pick an N for the selector such that each new > > > > >> > >> selector pattern (each one out of the N*Px patterns) selects from > > > > >> > >> the *same pattern* of the same data input. > > > > >> > >> > > > > >> > >> So if a particular pattern in the selector has a step S, and the data > > > > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi. > > > > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S. > > > > >> > >> > > > > >> > >> I think that means that the total number of patterns in the result > > > > >> > >> (Pr from previous messages) can safely be: > > > > >> > >> > > > > >> > >> Ps * least_common_multiple( > > > > >> > >> least_common_multiple(S[1], P[input(1)]) / S[1], > > > > >> > >> ... > > > > >> > >> least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps] > > > > >> > >> ) > > > > >> > >> > > > > >> > >> where: > > > > >> > >> > > > > >> > >> Ps = the number of patterns in the selector > > > > >> > >> S[I] = the step for selector pattern I (I being 1-based) > > > > >> > >> input(I) = the data input selected by selector pattern I (I being 1-based) > > > > >> > >> P[I] = the number of patterns in data input I > > > > >> > >> > > > > >> > >> That's getting quite complicated :-) If we allow arbitrary P[...] > > > > >> > >> and S[...] then it could also get large. Perhaps we should finally > > > > >> > >> give up on the general case and limit this to power-of-2 patterns and > > > > >> > >> power-of-2 steps, so that least_common_multiple becomes MAX. Maybe that > > > > >> > >> simplifies other things as well. > > > > >> > >> > > > > >> > >> What do you think? > > > > >> > > Hi Richard, > > > > >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for > > > > >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in > > > > >> > > follow up patches if possible. > > > > >> > > > > > > >> > > Sorry if this sounds like a silly ques -- if we are going to have > > > > >> > > pattern in selector, select *same pattern from same input vector*, > > > > >> > > instead of re-encoding the selector to have N * Ps patterns, would it > > > > >> > > make sense for elements in selector to denote pattern number itself > > > > >> > > instead of element index > > > > >> > > if input vectors are VLA ? > > > > >> > > > > > > >> > > For eg: > > > > >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...} > > > > >> > > op1 = {...} > > > > >> > > with npatterns == 4, nelts_per_pattern == 3, > > > > >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0, > > > > >> > > so, res = {1, 4, 1, 5, 1, 6, ...} > > > > >> > > Not sure if this is correct tho. > > > > >> > > > > > >> > This wouldn't allow us to represent things like a "duplicate one > > > > >> > element", or "copy the leading N elements from the first input and > > > > >> > the other elements from elements N+ of the second input", which we > > > > >> > can with the current scheme. > > > > >> > > > > > >> > The restriction about each (unwound) selector pattern selecting from the > > > > >> > same input pattern only applies to case where the selector pattern is > > > > >> > stepped (and only applies to the stepped part of the pattern, not the > > > > >> > leading element). The restriction is also local to this code; it > > > > >> > doesn't make other VEC_PERM_EXPRs invalid. > > > > >> Hi Richard, > > > > >> Thanks for the clarifications. > > > > >> Just to clarify your approach with an eg: > > > > >> Let selected input vector be: > > > > >> arg0: {a0, b0, c0, d0, > > > > >> a0 + S, b0 + S, c0 + S, d0 + S, > > > > >> a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...} > > > > >> where arg0 has npatterns = 4, and nelts_per_pattern = 3. > > > > >> > > > > >> Let sel = {0, 0, 1, 2, 2, 4, ...} > > > > >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3 > > > > >> > > > > >> So, the first pattern in sel: > > > > >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...} > > > > >> which would be incorrect, since they belong to different patterns in arg0. > > > > >> So to select elements from same pattern in arg0, we need to divide p1 > > > > >> into at least N1 = P_arg0 / S0 = 4 distinct patterns. > > > > >> > > > > >> Similarly for second pattern in sel: > > > > >> p2: {0, 2, 4, ...}, we need to divide it into > > > > >> at least N2 = P_arg0 / S1 = 2 distinct patterns. > > > > >> > > > > >> Select N = max(N1, N2) = 4 > > > > >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements, > > > > >> and will be re-encoded with N*Ps = 8 patterns: > > > > >> > > > > >> re-encoded sel: > > > > >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S, > > > > >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S, > > > > >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S, > > > > >> ...} > > > > >> > > > > >> with 8 patterns, > > > > >> p1: {a0, a0 + 2S, a0 + 4S, ...} > > > > >> p2: {b0, b0 + 2S, b0 + 4S, ...} > > > > >> ... > > > > >> which select elements from same pattern from same input vector. > > > > >> Does this look correct ? > > > > >> > > > > >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns, > > > > >> arg1_npatterns are powers of 2 and for each stepped pattern, > > > > >> it's stepped size S is a power of 2. I suppose this will be sufficient > > > > >> to ensure that sel can be re-encoded with N*Ps npatterns > > > > >> such that each new pattern selects elements from same pattern > > > > >> of the input vector ? > > > > >> > > > > >> Then compute N: > > > > >> N = 1; > > > > >> for (every pattern p in sel) > > > > >> { > > > > >> op = corresponding input vector for pattern; > > > > >> S = step_size (p); > > > > >> N_pattern = max (S, npatterns (op)) / S; > > > > >> N = max(N, N_pattern) > > > > >> } > > > > >> > > > > >> and re-encode selector with N*Ps patterns. > > > > >> I guess rest of the patch will mostly stay the same. > > > > > Hi, > > > > > I have attached a POC patch based on the above approach. > > > > > For the above eg: > > > > > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3, > > > > > and > > > > > sel = {0, 0, 0, 1, 0, 2, ...} > > > > > with sel_npatterns == 2 and sel_nelts_per_pattern == 3. > > > > > > > > > > For pattern, {0, 1, 2, ...} it will select elements from different > > > > > patterns from arg0, which is incorrect. > > > > > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0. > > > > > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...} > > > > > with following patterns: > > > > > p1 = { 0, ... } > > > > > p2 = { 0, 2, 4, ... } > > > > > p3 = { 0, ... } > > > > > p4 = { 1, 3, 5, ... } > > > > > which should be correct since each element from the respective > > > > > patterns in sel chooses > > > > > elements from same pattern from arg0. > > > > > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... } > > > > > Does this look correct ? > > > > > > > > Yeah. But like I said above: > > > > > > > > The restriction about each (unwound) selector pattern selecting from the > > > > same input pattern only applies to case where the selector pattern is > > > > stepped (and only applies to the stepped part of the pattern, not the > > > > leading element). > > > > > > > > If the selector nelts-per-pattern is 1 or 2 then we can support all > > > > power-of-2 cases, with the final npatterns being the maximum of the > > > > source nelts-per-patterns. > > > > > > > > Also, going back to an earlier part of the discussion, I think we > > > > should use this technique for both VLA and VLS, and only fall back > > > > to the VLS-specific approach if the VLA approach fails. > > > > > > > > So I suggest we put the VLA code in its own function and have > > > > the VLS-only path kick in when the VLA code fails. If the code is > > > > having to pass a lot of state around, it might make sense to define > > > > a local class, store the state in member variables, and use member > > > > functions for the various subroutines. I don't know if that will > > > > work out neater though. > > > Hi Richard, > > > Thanks for the suggestions. I have attached an updated POC patch, > > > that does the following: > > > (a) Uses VLA approach by default, and falls back to VLS specific > > > folding if VLA approach fails for VLS vectors. > > > (b) Separates cases for sel_nelts_per_pattern < 3 and > > > sel_nelts_per_pattern == 3. > > > (c) Allows, a0 to select different vector from a1 .. ae. > > > I have written a few unit tests in the patch for testing the same. > > > Does the patch look in the right direction ? > > > > > > The patch has an issue for the following case marked as "case 9" > > > in test_vec_perm_vla_folding: > > > arg0 = { 1, 11, 2, 12, 3, 13, ... } > > > arg1 = { 21, 31, 22, 32, 23, 33, ... } > > > arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3. > > > > > > mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... } > > > where 4 + 4x is runtime vector length. > > > npatterns = 1, nelts_per_pattern = 3. > > > > > > a1 = 5 + 4x > > > ae = a1 + (esel - 2) * S > > > = (5 + 4x) + (4 + 4x - 2) * 1 > > > = 7 + 8x > > > > > > Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE. > > > Is that expected for the above mask ? > > > > > > I intended it to select the second vector similar to, > > > sel = { 0, 1, 2 .. }, which would select the first vector > > > by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... } > > > with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...} > > > The first would select elements from first pattern from arg0, > > > while the second pattern would select elements from second pattern from arg0. > > > with result effectively having same encoding as arg0. > > > Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ? > > Hi Richard, > > ping https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html > Hi Richard, > ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html Hi Richard, ping * 3: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html Thanks, Prathamesh > > Thanks, > Prathamesh > > > > Thanks, > > Prathamesh > > > > > > PS: I will be on vacation next week. > > > > > > Thanks, > > > Prathamesh > > > > > > > > > > > > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr) > > > > > build_zero_cst (itype)); > > > > > } > > > > > > > > > > +/* Check if PATTERN in SEL selects either ARG0 or ARG1, > > > > > + and return the selected arg, otherwise return NULL_TREE. */ > > > > > > > > > > -/* Helper function for fold_vec_perm. Store elements of VECTOR_CST or > > > > > - CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return > > > > > - true if successful. */ > > > > > - > > > > > -static bool > > > > > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts) > > > > > +static tree > > > > > +get_vector_for_pattern (tree arg0, tree arg1, > > > > > + const vec_perm_indices &sel, unsigned pattern, > > > > > + unsigned sel_npatterns, int &S) > > > > > { > > > > > - unsigned HOST_WIDE_INT i, nunits; > > > > > + unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); > > > > > > > > > > - if (TREE_CODE (arg) == VECTOR_CST > > > > > - && VECTOR_CST_NELTS (arg).is_constant (&nunits)) > > > > > + poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > > > > > + poly_uint64 nsel = sel.length (); > > > > > + poly_uint64 esel; > > > > > + > > > > > + if (!multiple_p (nsel, sel_npatterns, &esel)) > > > > > + return NULL_TREE; > > > > > + > > > > > + poly_uint64 a1 = sel[pattern + sel_npatterns]; > > > > > + S = 0; > > > > > + if (sel_nelts_per_pattern == 3) > > > > > { > > > > > - for (i = 0; i < nunits; ++i) > > > > > - elts[i] = VECTOR_CST_ELT (arg, i); > > > > > + poly_uint64 a2 = sel[pattern + 2 * sel_npatterns]; > > > > > + S = (a2 - a1).to_constant (); > > > > > > > > The code hasn't proven that this to_constant is safe. > > > > > > > > > + if (S != 0 && !pow2p_hwi (S)) > > > > > + return NULL_TREE; > > > > > } > > > > > - else if (TREE_CODE (arg) == CONSTRUCTOR) > > > > > + > > > > > + poly_uint64 ae = a1 + (esel - 2) * S; > > > > > + uint64_t q1, qe; > > > > > + poly_uint64 r1, re; > > > > > + > > > > > + if (!can_div_trunc_p (a1, n1, &q1, &r1) > > > > > + || !can_div_trunc_p (ae, n1, &qe, &re) > > > > > + || (q1 != qe)) > > > > > + return NULL_TREE; > > > > > > > > Going back to the above: this check doesn't make sense for > > > > sel_nelts_per_pattern != 3. > > > > > > > > Thanks, > > > > Richard > > > > > > > > > + > > > > > + tree arg = ((q1 & 1) == 0) ? arg0 : arg1; > > > > > + > > > > > + if (S < 0) > > > > > { > > > > > - constructor_elt *elt; > > > > > + poly_uint64 a0 = sel[pattern]; > > > > > + if (!known_eq (S, a1 - a0)) > > > > > + return NULL_TREE; > > > > > > > > > > - FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt) > > > > > - if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE) > > > > > - return false; > > > > > - else > > > > > - elts[i] = elt->value; > > > > > + if (!known_gt (re, VECTOR_CST_NPATTERNS (arg))) > > > > > + return NULL_TREE; > > > > > } > > > > > - else > > > > > - return false; > > > > > - for (; i < nelts; i++) > > > > > - elts[i] > > > > > - = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node); > > > > > - return true; > > > > > + > > > > > + return arg; > > > > > } > > > > > > > > > > /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL > > > > > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel) > > > > > unsigned HOST_WIDE_INT nelts; > > > > > bool need_ctor = false; > > > > > > > > > > - if (!sel.length ().is_constant (&nelts)) > > > > > - return NULL_TREE; > > > > > - gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts) > > > > > - && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts) > > > > > - && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts)); > > > > > + gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ()) > > > > > + && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), > > > > > + TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)))); > > > > > if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type) > > > > > || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type)) > > > > > return NULL_TREE; > > > > > > > > > > - tree *in_elts = XALLOCAVEC (tree, nelts * 2); > > > > > - if (!vec_cst_ctor_to_array (arg0, nelts, in_elts) > > > > > - || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts)) > > > > > + unsigned res_npatterns = 0; > > > > > + unsigned res_nelts_per_pattern = 0; > > > > > + unsigned sel_npatterns = 0; > > > > > + tree *vector_for_pattern = NULL; > > > > > + > > > > > + if (TREE_CODE (arg0) == VECTOR_CST > > > > > + && TREE_CODE (arg1) == VECTOR_CST > > > > > + && !sel.length ().is_constant ()) > > > > > + { > > > > > + unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0); > > > > > + unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1); > > > > > + sel_npatterns = sel.encoding ().npatterns (); > > > > > + > > > > > + if (!pow2p_hwi (arg0_npatterns) > > > > > + || !pow2p_hwi (arg1_npatterns) > > > > > + || !pow2p_hwi (sel_npatterns)) > > > > > + return NULL_TREE; > > > > > + > > > > > + unsigned N = 1; > > > > > + vector_for_pattern = XALLOCAVEC (tree, sel_npatterns); > > > > > + for (unsigned i = 0; i < sel_npatterns; i++) > > > > > + { > > > > > + int S = 0; > > > > > + tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S); > > > > > + if (!op) > > > > > + return NULL_TREE; > > > > > + vector_for_pattern[i] = op; > > > > > + unsigned N_pattern = > > > > > + (S > 0) ? std::max(S, VECTOR_CST_NPATTERNS (op)) / S : 1; > > > > > + N = std::max (N, N_pattern); > > > > > + } > > > > > + > > > > > + res_npatterns > > > > > + = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns)); > > > > > + > > > > > + res_nelts_per_pattern > > > > > + = std::max(sel.encoding ().nelts_per_pattern (), > > > > > + std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0), > > > > > + VECTOR_CST_NELTS_PER_PATTERN (arg1))); > > > > > + } > > > > > + else if (sel.length ().is_constant (&nelts) > > > > > + && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant () > > > > > + && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts) > > > > > + { > > > > > + /* For VLS vectors, treat all vectors with > > > > > + npatterns = nelts, nelts_per_pattern = 1. */ > > > > > + res_npatterns = sel_npatterns = nelts; > > > > > + res_nelts_per_pattern = 1; > > > > > + vector_for_pattern = XALLOCAVEC (tree, nelts); > > > > > + for (unsigned i = 0; i < nelts; i++) > > > > > + { > > > > > + HOST_WIDE_INT index; > > > > > + if (!sel[i].is_constant (&index)) > > > > > + return NULL_TREE; > > > > > + vector_for_pattern[i] = (index < nelts) ? arg0 : arg1; > > > > > + } > > > > > + } > > > > > + else > > > > > return NULL_TREE; > > > > > > > > > > - tree_vector_builder out_elts (type, nelts, 1); > > > > > - for (i = 0; i < nelts; i++) > > > > > + tree_vector_builder out_elts (type, res_npatterns, > > > > > + res_nelts_per_pattern); > > > > > + unsigned res_nelts = res_npatterns * res_nelts_per_pattern; > > > > > + for (unsigned i = 0; i < res_nelts; i++) > > > > > { > > > > > - HOST_WIDE_INT index; > > > > > - if (!sel[i].is_constant (&index)) > > > > > - return NULL_TREE; > > > > > - if (!CONSTANT_CLASS_P (in_elts[index])) > > > > > - need_ctor = true; > > > > > - out_elts.quick_push (unshare_expr (in_elts[index])); > > > > > + /* For VLA vectors, i % sel_npatterns would give the original > > > > > + pattern the element belongs to, which is sufficient to get the arg. > > > > > + Even if sel_npatterns has been multiplied by N, > > > > > + they will always come from the same input vector. > > > > > + For VLS vectors, sel_npatterns == res_nelts == nelts, > > > > > + so i % sel_npatterns == i since i < nelts */ > > > > > + > > > > > + tree arg = vector_for_pattern[i % sel_npatterns]; > > > > > + unsigned HOST_WIDE_INT index; > > > > > + > > > > > + if (arg == arg0) > > > > > + { > > > > > + if (!sel[i].is_constant ()) > > > > > + return NULL_TREE; > > > > > + index = sel[i].to_constant (); > > > > > + } > > > > > + else > > > > > + { > > > > > + gcc_assert (arg == arg1); > > > > > + poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)); > > > > > + uint64_t q; > > > > > + poly_uint64 r; > > > > > + > > > > > + /* Divide sel[i] by input vector length, to obtain remainder, > > > > > + which would be the index for either input vector. */ > > > > > + if (!can_div_trunc_p (sel[i], n1, &q, &r)) > > > > > + return NULL_TREE; > > > > > + > > > > > + if (!r.is_constant (&index)) > > > > > + return NULL_TREE; > > > > > + } > > > > > + > > > > > + tree elem; > > > > > + if (TREE_CODE (arg) == CONSTRUCTOR) > > > > > + { > > > > > + gcc_assert (index < nelts); > > > > > + if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg))) > > > > > + return NULL_TREE; > > > > > + elem = CONSTRUCTOR_ELT (arg, index)->value; > > > > > + if (VECTOR_TYPE_P (TREE_TYPE (elem))) > > > > > + return NULL_TREE; > > > > > + need_ctor = true; > > > > > + } > > > > > + else > > > > > + elem = vector_cst_elt (arg, index); > > > > > + out_elts.quick_push (elem); > > > > > } > > > > > > > > > > if (need_ctor) > > > > > { > > > > > vec *v; > > > > > - vec_alloc (v, nelts); > > > > > - for (i = 0; i < nelts; i++) > > > > > + vec_alloc (v, res_nelts); > > > > > + for (i = 0; i < res_nelts; i++) > > > > > CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]); > > > > > return build_constructor (type, v); > > > > > } > > > > > - else > > > > > - return out_elts.build (); > > > > > + return out_elts.build (); > > > > > } > > > > > > > > > > /* Try to fold a pointer difference of type TYPE two address expressions of > > > > > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding () > > > > > ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0)); > > > > > } > > > > > > > > > > +static tree > > > > > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern, > > > > > + int *encoded_elems) > > > > > +{ > > > > > + scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node); > > > > > + machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode); > > > > > + //machine_mode vmode = VNx4SImode; > > > > > + poly_uint64 nunits = GET_MODE_NUNITS (vmode); > > > > > + tree vectype = build_vector_type (integer_type_node, nunits); > > > > > + > > > > > + tree_vector_builder builder (vectype, npatterns, nelts_per_pattern); > > > > > + for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++) > > > > > + builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i])); > > > > > + return builder.build (); > > > > > +} > > > > > + > > > > > +static void > > > > > +test_vec_perm_vla_folding () > > > > > +{ > > > > > + int arg0_elems[] = { 1, 11, 2, 12, 3, 13 }; > > > > > + tree arg0 = build_vec_int_cst (2, 3, arg0_elems); > > > > > + > > > > > + int arg1_elems[] = { 21, 31, 22, 32, 23, 33 }; > > > > > + tree arg1 = build_vec_int_cst (2, 3, arg1_elems); > > > > > + > > > > > + if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant () > > > > > + || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ()) > > > > > + return; > > > > > + > > > > > + /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3, > > > > > + should select arg0. */ > > > > > + { > > > > > + int mask_elems[] = {0, 1, 2}; > > > > > + tree mask = build_vec_int_cst (1, 3, mask_elems); > > > > > + tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask); > > > > > + ASSERT_TRUE (res != NULL_TREE); > > > > > + ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2); > > > > > + ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3); > > > > > + > > > > > + unsigned res_nelts = vector_cst_encoded_nelts (res); > > > > > + for (unsigned i = 0; i < res_nelts; i++) > > > > > + ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), > > > > > + VECTOR_CST_ELT (arg0, i), 0)); > > > > > + } > > > > > + > > > > > + /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3, > > > > > + should return NULL because for len = 4 + 4x, > > > > > + if x == 0, we select from arg1 > > > > > + if x > 0, we select from arg0 > > > > > + and thus cannot determine result at compile time. */ > > > > > + { > > > > > + int mask_elems[] = {4, 5, 6}; > > > > > + tree mask = build_vec_int_cst (1, 3, mask_elems); > > > > > + tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask); > > > > > + gcc_assert (res == NULL_TREE); > > > > > + } > > > > > + > > > > > + /* Case 3: > > > > > + mask: {0, 0, 0, 1, 0, 2, ...} > > > > > + npatterns == 2, nelts_per_pattern == 3 > > > > > + Pattern {0, ...} should select arg0[0], ie, 1. > > > > > + Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...}, > > > > > + so res = {1, 1, 1, 11, 1, 2, ...}. */ > > > > > + { > > > > > + int mask_elems[] = {0, 0, 0, 1, 0, 2}; > > > > > + tree mask = build_vec_int_cst (2, 3, mask_elems); > > > > > + tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask); > > > > > + ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4); > > > > > + ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3); > > > > > + > > > > > + /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...} */ > > > > > + int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13}; > > > > > + for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++) > > > > > + ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]); > > > > > + } > > > > > + > > > > > + /* Case 4: > > > > > + mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...} > > > > > + npatterns == 2, nelts_per_pattern == 3 > > > > > + Pattern {0, ...} should select arg0[1] > > > > > + Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since: > > > > > + a1 = 5 + 4x > > > > > + ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1 > > > > > + = 5 + 6x > > > > > + Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ... > > > > > + res: {1, 21, 1, 31, 1, 22, ... } > > > > > + FIXME: How to build vector with poly_int elems ? */ > > > > > + > > > > > + /* Case 5: S < 0. */ > > > > > +} > > > > > + > > > > > /* Run all of the selftests within this file. */ > > > > > > > > > > void > > > > > @@ -16918,6 +17123,7 @@ fold_const_cc_tests () > > > > > test_arithmetic_folding (); > > > > > test_vector_folding (); > > > > > test_vec_duplicate_folding (); > > > > > + test_vec_perm_vla_folding (); > > > > > } > > > > > > > > > > } // namespace selftest