From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <prathamesh.kulkarni@linaro.org>
Received: from mail-ej1-x631.google.com (mail-ej1-x631.google.com [IPv6:2a00:1450:4864:20::631])
	by sourceware.org (Postfix) with ESMTPS id 99B4B385AC22
	for <gcc-patches@gcc.gnu.org>; Fri, 23 Sep 2022 11:59:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 99B4B385AC22
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=linaro.org
Received: by mail-ej1-x631.google.com with SMTP id z13so162700ejp.6
        for <gcc-patches@gcc.gnu.org>; Fri, 23 Sep 2022 04:59:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :from:to:cc:subject:date;
        bh=ECJ19kv4X0nbE16DRknmTaLXJJInvOZAQaD9xgQI17s=;
        b=Ad+glpUIyZD68W/AFOhVQEKS/yhQ11llZLlng+7wfcuXns5VDHMqiHiFya35qhbB70
         sGGWVbxQ5Z/jDemoL3ZaoXSzlzbHVQZBaHjAaakuW74RvKdB20iAQCOXDiZdjG5Yd2o1
         ssI4+YGWmU9oveeBCSebCskLJbqgGK0I7eMWFZKdT4EdRashL7kgIGxhNvvoThuJ67EB
         O5xOeOzOEBjyGFeq2ViJAd4guyptgNyKT3BtdEMNcZH22k4fTiYhggqesIpfiSMBMyyP
         3ySNIuBSnEoF4lBeD3kqCEF2CIDTSAnycBiJ/CJZnQNJ0CwyTs5YR/9Om+fT296vHqKi
         vGDw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :x-gm-message-state:from:to:cc:subject:date;
        bh=ECJ19kv4X0nbE16DRknmTaLXJJInvOZAQaD9xgQI17s=;
        b=s5ZPy3F21+V9cJ5oDWP4GJwTls6W0huPovQVHbHIwzAvCu+2V+CrychUEw197PIo3S
         xrwEQjPOYR2wc/gnRcQOLBn7ZERN+j8XI1YsZ2lpVMC3A5YyuykHo79PNaST7useIxyl
         f6kQFT9nN1v23/IckJz106cyu1Gf40piCpG5S8lOOUB/Q4P/C2ymBuUJZLZY2IVa/6wF
         toabRKBhvZuQ1EbEP+8MwGFx2NdAZQ8R81uNMMZ6WEhwz7fE7pDByjoAhPciPFLH6Kzc
         /ZkdRn7p7yuFyxGPfzsCDomCrFMD9Ai3rAOlqD5GFa1w4976VHWsY6grujsWS8ayPXCN
         H6oQ==
X-Gm-Message-State: ACrzQf30Na8CeZaNDZaICF8yfefMOUMb/bp4/g5mK9uiqPD4snN0zQho
	DjQq9o/B8iT49C+k7RdtzkrTngAcrRKnQbEXDB/IMO/aUC4=
X-Google-Smtp-Source: AMsMyM40Mhd/E29cuNkFsLhQymI9meHCI5+nllIbo0ElqfQICrh4N/k9ogTaCuTSi/E7dMhmgTArnt2vDFsdep8jDoc=
X-Received: by 2002:a17:907:d27:b0:782:cc3:8e87 with SMTP id
 gn39-20020a1709070d2700b007820cc38e87mr6418450ejc.94.1663934379201; Fri, 23
 Sep 2022 04:59:39 -0700 (PDT)
MIME-Version: 1.0
References: <CAAgBjM=CUm=fOVWCDu4YVydKQzRxkmw9fa_2hiZuX5pYungj6Q@mail.gmail.com>
 <mpt35d6azqb.fsf@arm.com> <CAAgBjM=XNLy_8ek1NKdrHte6F0UN30gPSxZ+pM7=6_1y=D8cug@mail.gmail.com>
 <mpt4jxc7jmu.fsf@arm.com> <CAAgBjMnuG7EsGcsg+A99FTmh9Q9_dwJ6LtfE6aTLP7knxhwr7A@mail.gmail.com>
 <mpt8rme2paf.fsf@arm.com>
In-Reply-To: <mpt8rme2paf.fsf@arm.com>
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Date: Fri, 23 Sep 2022 17:29:03 +0530
Message-ID: <CAAgBjMkr9x8e_twZxPmq6t5tGfd1Cyv623jYZrBDz4g7WFm5-g@mail.gmail.com>
Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>, gcc Patches <gcc-patches@gcc.gnu.org>, 
	Richard Biener <richard.guenther@gmail.com>, richard.sandiford@arm.com
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> The VLA encoding encodes the first N patterns explicitly.  The
> >> >> npatterns/nelts_per_pattern values then describe how to extend that
> >> >> initial sequence to an arbitrary number of elements.  So when performing
> >> >> an operation on (potentially) variable-length vectors, the questions is:
> >> >>
> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
> >> >>   pair that will be correct for all elements of the result?
> >> >>
> >> >> This depends on the operation that we're performing.  E.g. it's
> >> >> different for unary operations (vector_builder::new_unary_operation)
> >> >> and binary operations (vector_builder::new_binary_operations).  It also
> >> >> varies between unary operations and between binary operations, hence
> >> >> the allow_stepped_p parameters.
> >> >>
> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
> >> >>
> >> >> (R) Each individual selector pattern must always select from the same vector.
> >> >>
> >> >> Whether this condition is met depends both on the pattern itself and on
> >> >> the number of patterns that it's combined with.
> >> >>
> >> >> E.g. suppose we had the selector pattern:
> >> >>
> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
> >> >>
> >> >> If the arguments and selector are n elements then this pattern on its
> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
> >> >> This is clearly true for large enough n.  So if n is variable then
> >> >> we cannot represent this.
> >> >>
> >> >> If the pattern above is one of two patterns, so interleaved as:
> >> >>
> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
> >> >>
> >> >> then the pattern would select from more than one argument if
> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
> >> >>
> >> >> But if the pattern above is one of four patterns then it selects
> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
> >> >> true for any valid n or o, so the pattern is OK.
> >> >>
> >> >> So let's define some ad hoc terminology:
> >> >>
> >> >> * Px is the number of patterns in x
> >> >> * Ex is the number of elements per pattern in x
> >> >>
> >> >> where x can be:
> >> >>
> >> >> * 1: first argument
> >> >> * 2: second argument
> >> >> * s: selector
> >> >> * r: result
> >> >>
> >> >> Then:
> >> >>
> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
> >> >>
> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> >>     elements for any integer N.  This extended sequence can be reencoded
> >> >>     as having N*Px patterns, with Ex staying the same.
> >> >>
> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
> >> >>     of the explicit encoding.
> >> >>
> >> >> So let's assume (optimistically) that we can produce the result
> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
> >> >> to imply the rest.  Then:
> >> >>
> >> >> * (2) means that, when combining multiple input operands with potentially
> >> >>   different encodings, we can set the number of patterns in the result
> >> >>   to the least common multiple of the number of patterns in the inputs.
> >> >>   In this case:
> >> >>
> >> >>   Pr = least_common_multiple(P1, P2, Ps)
> >> >>
> >> >>   is a valid number of patterns.
> >> >>
> >> >> * (3) means that the number of elements per pattern of the result can
> >> >>   be the maximum of the number of elements per pattern in the inputs.
> >> >>   (Alternatively, we could always use 3.)  In this case:
> >> >>
> >> >>   Er = max(E1, E2, Es)
> >> >>
> >> >>   is a valid number of elements per pattern.
> >> >>
> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> >> >> calculating the first Pr*Er elements of the result and using the
> >> >> encoding to derive the rest.  If (R) doesn't hold then we need the
> >> >> selector to be constant-length.  We should then fill in the result
> >> >> based on:
> >> >>
> >> >> - Pr == number of elements in the result
> >> >> - Er == 1
> >> >>
> >> >> But this should be the fallback option, even for VLS.
> >> >>
> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
> >> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
> >> >> for an N-element vector type by setting the number of patterns to N
> >> >> and the number of elements per pattern to 1.
> >> > Hi Richard,
> >> > Thanks for the suggestions, and sorry for late response.
> >> > I have a couple of very elementary questions:
> >> >
> >> > 1: Consider following inputs to VEC_PERM_EXPR:
> >> > op1: P_op1 == 4, E_op1 == 1
> >> > {1, 2, 3, 4, ...}
> >> >
> >> > op2: P_op2 == 2, E_op2 == 2
> >> > {11, 21, 12, 22, ...}
> >> >
> >> > sel: P_sel == 3, E_sel == 1
> >> > {0, 4, 5, ...}
> >> >
> >> > What shall be the result in this case ?
> >> > P_res = lcm(4, 2, 3) == 12
> >> > E_res = max(1, 2, 1) == 2.
> >>
> >> Yeah, that looks right.  Of course, since sel is just repeating
> >> every three elements, it could just be P_res==3, E_sel==1,
> >> but the vector_builder would do that optimisation for us.
> >>
> >> (I'm not sure whether we'd see a P==3 encoding in practice,
> >> but perhaps it's possible.)
> >>
> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
> >> repeating every three elements) then:
> >>
> >> P_res = lcm(4, 2) == 4
> >> E_res = max(1, 2, 3) == 3
> >>
> >> which also looks like it would give the right encoding.
> >>
> >> > 2. How should we specify index of element in sel when it is not
> >> > explicitly encoded in the operand ?
> >> > For eg:
> >> > op1: npatterns == 2, nelts_per_pattern == 3
> >> > { 1, 0, 2, 0, 3, 0, ... }
> >> > op2: npatterns == 6, nelts_per_pattern == 1
> >> > { 11, 12, 13, 14, 15, 16, ...}
> >> >
> >> > In sel, how do we refer to element with value 4, that would be 4th element
> >> > of first pattern in op1, but not explicitly encoded ?
> >> > In op1, 4 will come at index == 6.
> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
> >>
> >> What index 6 refers to depends on the length of op1.
> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
> >> If the length of op1 is 6 then index 6 refers to op2[0].
> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
> >>
> >> This comes back to (R) above.  We need to be able to prove at compile
> >> time that each pattern selects from the same input vectors (for all
> >> elements, not just the encoded elements).  If we can't prove that
> >> then we can't fold for variable-length vectors.
> > Hi Richard,
> > Thanks for the clarification!
> > I have come up with an approach to verify R:
> >
> > Consider following pattern:
> > a0, a1, a1 + S, ...,
> > nelts_per_pattern would be n / Psel, where n == actual length of the vector.
> > And last element of pattern will be given by:
> > a1 + (n/Psel - 2) * S
>
> (I think this is just a terminology thing, but in the source,
> nelts_per_pattern is a compile-time constant that describes the
> encoding.  It always has the value 1, 2 or 3, regardless of the
> runtime length.)
>
> > Rearranging the above term, we can think of pattern
> > as a line with following equation:
> > y = (S/Psel) * n + (a1 - 2S)
> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
> >
> > At,
> > n = 2*Psel, y = a1
> > n = 3*Psel, y = a1 + S,
> > n = 4*Psel, y = a1 + 2S ...
> >
> > To compare with n, we compare the following lines:
> > y1 = (S/Psel) * n + (a1 - 2S)
> > y2 = n
> >
> > So to check if elements always come from first vector,
> > we want to check y1 < y2 for n > 0.
> > Likewise, if elements always come from second vector,
> > we want to check if y1 >= y2, for n > 0.
>
> One difficulty here is that the indices wrap around, so an index value of
> 2n selects from the first vector rather than the second.  (This is pretty
> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
>
> > If both lines are parallel, ie S/PSel == 1,
> > then we choose first or second vector depending on the y-intercept a1 - 2S.
> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
> >
> > For eg, if we have following pattern:
> > {0, 1, 3, ...}
> > where a1 = 1, S = 2, and consider PSel = 2.
> >
> > y1 = n - 3
> > y2 = n
> >
> > In this case, y1 < y2 for all values of n,  so we select first vector.
> >
> > Since y2 = n, passes thru origin with slope = 1,
> > a line can intersect it either in 1st or 3rd quadrant.
> > Calculate point of intersection:
> > n_int = Psel * (a1 - 2S) / (Psel - S);
> >
> > (a) n_int > 0
> > n_int > 0 => intersecting in 1st quadrant.
> > In this case there will be a cross-over at n_int.
> >
> > For eg, consider pattern { 0, 1, 4, ...}
> > a1 = 1, S = 3, and let's take PSel = 2
> >
> > y1 = (3/2)n - 5
> > y2 = n
> >
> > Both intersect at (10, 10).
> > So for n < 10, y1 < y2
> > and for n > 10, y1 > y2.
> > so in this case we can't fold since we will select elements from both vectors.
> >
> > (b) n_int <= 0
> > In this case, the lines will intersect in 3rd quadrant,
> > so depending upon the slope we can choose either vector.
> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
> > then y1 < y2 for n > 0
> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
> > then y1 > y2 for n > 0.
> >
> > For eg, in the above pattern {0, 1, 4, ...}
> > a1 = 1, S = 3, and let's take PSel = 4
> >
> > y1 = (3/4)n - 5
> > y2 = n
> > Both intersect at (-20, -20).
> > y1's slope = (S/Psel) = (3/4) < 1
> > So y1 < y2 for n > 0.
> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
> > So we pick first vector.
> >
> > The following pseudo code attempts to capture this:
> >
> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
> > {
> >   if (S == Psel)
> >     {
> >       /* If y1 intercept >= 0, then y1 >= y2
> >           for all values of n.  */
> >       if (a1 - 2*S >= 0)
> >         return op2;
> >       return op1;
> >     }
> >
> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
> >    /* If intersecting in 1st quadrant, there will be cross over,
> >        bail out.  */
> >    if (n_int > 0)
> >      return NULL_TREE;
> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
> >       then y1 < y2 for n > 0.  */
> >    if (S < Psel)
> >      return op1;
> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
> >       then y1 > y2 for n > 0.  */
> >    return op2;
> > }
> >
> > Does this look reasonable ?
>
> ...I think we need to be more conservative.  I think we also need to
> distinguish n1 (the number of elements in the input vectors) and
> nsel (the number of elements in the selector).
>
> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
> there will be (nsel /exact Psel) - 1 index elements from the stepped
> encoding and the final index value will be:
>
>   ae = a1 + (nsel /exact Psel - 2) * S
>
> Because of wrap-around, we need to ensure that that doesn't run
> into an adjoining vector.  I think the easiest way of doing that
> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
> and check that the quotients are equal.
IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
vector since ae is last elem,
and the quotient can choose the vector because it will be either 0 or
1 (since indices wrap around after 2n).
Um, could you please elaborate a bit on how will can_div_trunc_p
calculate quotients, when n1 and nsel are unknown
at compile time ?

To calculate the quotients for a hard coded pattern,
with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
I tried the following:

  poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
  poly_uint64 nsel = n1;
  poly_uint64 a1 = 1
  poly_uint64 Esel = exact_div (nsel / Psel);
  poly_uint64 ae = a1 + (Esel - 2) * S;

  int q1, qe;
  poly_uint64 r1, re;

  bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
  bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);

Which gave strange values for qe and 0 for q1, with first call succeeding,
and second call returning false.
Am I calling it incorrectly ?

Thanks,
Prathamesh

>
> However, I now realise that there's a wrinkle.  If S < 0 then we
> also need to check that either:
>
> (a) the chosen input vector (given by the quotient above) has either:
>
>     (i) nelts_per_pattern == 1
>     (ii) nelts_per_pattern == 3 and the difference between the
>          first and second elements in each pattern is the same as
>          the difference between the second and third elements
>          (i.e. every pattern is a natural stepped one).
>
> (b) ae % n1 >= the number of patterns in the input vector.
>     (ae % n1 is calculated as a side-effect of can_div_trunc_p).
>
> Otherwise the index vector has the effect of moving the "foreground"
> from the front of the input vector to the end of the result vector.
>
> If nsel == Psel then the stepped part of the sequence doesn't matter.
> Thus, the same condition works whenever nsel is a multiple of Psel.
>
> If nsel is not a multiple of Psel then I think we should punt for now.
> There are some cases that we could handle when n1 == nsel, but "nsel
> is a multiple of Psel" will be the normal case.
>
> Thanks,
> Richard