From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 74178385DC07
	for <gcc-patches@gcc.gnu.org>; Mon, 31 Oct 2022 09:57:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 74178385DC07
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2BD801FB;
	Mon, 31 Oct 2022 02:57:33 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.62])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 49A033F534;
	Mon, 31 Oct 2022 02:57:26 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>,  Richard Biener <richard.guenther@gmail.com>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>,  Richard Biener <richard.guenther@gmail.com>
Subject: Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner
References: <CAAgBjM=CUm=fOVWCDu4YVydKQzRxkmw9fa_2hiZuX5pYungj6Q@mail.gmail.com>
	<mpt35d6azqb.fsf@arm.com>
	<CAAgBjM=XNLy_8ek1NKdrHte6F0UN30gPSxZ+pM7=6_1y=D8cug@mail.gmail.com>
	<mpt4jxc7jmu.fsf@arm.com>
	<CAAgBjMnuG7EsGcsg+A99FTmh9Q9_dwJ6LtfE6aTLP7knxhwr7A@mail.gmail.com>
	<mpt8rme2paf.fsf@arm.com>
	<CAAgBjMkr9x8e_twZxPmq6t5tGfd1Cyv623jYZrBDz4g7WFm5-g@mail.gmail.com>
	<mptleqaxele.fsf@arm.com>
	<CAAgBjMkVmDfGu0_5DCFfhmNsDQB45JASJyQahpo3Z=jKVCyRpQ@mail.gmail.com>
	<mpta66lx4k4.fsf@arm.com>
	<CAAgBjMmfTcy-aFEiiSwizYHWcwY7HNYAEP4kNw5Hsa-24CPJnQ@mail.gmail.com>
	<mptr0zsn98c.fsf@arm.com> <mptk05kn8ut.fsf@arm.com>
	<CAAgBjMmRt3H9KVPCT3z5YHRRmDJw6K2rHVKc9+cTCeLTi9EWqQ@mail.gmail.com>
	<mptwn8m38bm.fsf@arm.com>
	<CAAgBjMku5-i3m8eLc2ky1Hb-NBjwGFD5F4278R3GTSFQq=H5KQ@mail.gmail.com>
Date: Mon, 31 Oct 2022 09:57:25 +0000
In-Reply-To: <CAAgBjMku5-i3m8eLc2ky1Hb-NBjwGFD5F4278R3GTSFQq=H5KQ@mail.gmail.com>
	(Prathamesh Kulkarni's message of "Fri, 28 Oct 2022 20:16:07 +0530")
Message-ID: <mpth6zk2u4a.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-36.9 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Sorry for the slow response.  I wanted to find some time to think
>> about this a bit more.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> >> >> For num_poly_int_coeffs == 2,
>> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> >> >> If a1/trunc n1 succeeds,
>> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> >> >> So, a1 has to be < n1.coeffs[0] ?
>> >> >
>> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>> >> >
>> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>> >> >
>> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>> >>
>> >> Sorry, should have been:
>> >>
>> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
>> > Hi Richard,
>> > Thanks for the clarifications, and sorry for late reply.
>> > I have attached POC patch that tries to implement the above approach.
>> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>> >
>> > For VLA vectors, I have only done limited testing so far.
>> > It seems to pass couple of tests written in the patch for
>> > nelts_per_pattern == 3,
>> > and folds the following svld1rq test:
>> > int32x4_t v = {1, 2, 3, 4};
>> > return svld1rq_s32 (svptrue_b8 (), &v[0])
>> > into:
>> > return {1, 2, 3, 4, ...};
>> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
>> >
>> > I have a couple of questions:
>> > 1] When mask selects elements from same vector but from different patterns:
>> > For eg:
>> > arg0 = {1, 11, 2, 12, 3, 13, ...},
>> > arg1 = {21, 31, 22, 32, 23, 33, ...},
>> > mask = {0, 0, 0, 1, 0, 2, ... },
>> > All have npatterns = 2, nelts_per_pattern = 3.
>> >
>> > With above mask,
>> > Pattern {0, ...} selects arg0[0], ie {1, ...}
>> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
>> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
>> > pattern in arg0.
>> > The result is:
>> > res = {1, 1, 1, 11, 1, 2, ...}
>> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
>> > with a0 = 1, a1 = 11, S = -9.
>> > Is that expected tho ? It seems to create a new encoding which
>> > wasn't present in the input vector. For instance, the next elem in
>> > sequence would be -7,
>> > which is not present originally in arg0.
>>
>> Yeah, you're right, sorry.  Going back to:
>>
>> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>>     elements for any integer N.  This extended sequence can be reencoded
>>     as having N*Px patterns, with Ex staying the same.
>>
>> I guess we need to pick an N for the selector such that each new
>> selector pattern (each one out of the N*Px patterns) selects from
>> the *same pattern* of the same data input.
>>
>> So if a particular pattern in the selector has a step S, and the data
>> input it selects from has Pi patterns, N*S must be a multiple of Pi.
>> N must be a multiple of least_common_multiple(S,Pi)/S.
>>
>> I think that means that the total number of patterns in the result
>> (Pr from previous messages) can safely be:
>>
>>   Ps * least_common_multiple(
>>     least_common_multiple(S[1], P[input(1)]) / S[1],
>>     ...
>>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
>>   )
>>
>> where:
>>
>>   Ps = the number of patterns in the selector
>>   S[I] = the step for selector pattern I (I being 1-based)
>>   input(I) = the data input selected by selector pattern I (I being 1-based)
>>   P[I] = the number of patterns in data input I
>>
>> That's getting quite complicated :-)  If we allow arbitrary P[...]
>> and S[...] then it could also get large.  Perhaps we should finally
>> give up on the general case and limit this to power-of-2 patterns and
>> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
>> simplifies other things as well.
>>
>> What do you think?
> Hi Richard,
> Thanks for the suggestions. Yeah I suppose we can initially add support for
> power-of-2 patterns and power-of-2 steps and try to generalize it in
> follow up patches if possible.
>
> Sorry if this sounds like a silly ques -- if we are going to have
> pattern in selector, select *same pattern from same input vector*,
> instead of re-encoding the selector to have N * Ps patterns, would it
> make sense for elements in selector to denote pattern number itself
> instead of element index
> if input vectors are VLA ?
>
> For eg:
> op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> op1 = {...}
> with npatterns == 4, nelts_per_pattern == 3,
> sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> so, res = {1, 4, 1, 5, 1, 6, ...}
> Not sure if this is correct tho.

This wouldn't allow us to represent things like a "duplicate one
element", or "copy the leading N elements from the first input and
the other elements from elements N+ of the second input", which we
can with the current scheme.

The restriction about each (unwound) selector pattern selecting from the
same input pattern only applies to case where the selector pattern is
stepped (and only applies to the stepped part of the pattern, not the
leading element).  The restriction is also local to this code; it
doesn't make other VEC_PERM_EXPRs invalid.

Thanks,
Richard

>
> Thanks,
> Prathamesh
>>
>> > I suppose it's fine since if the user defines mask to have pattern {0,
>> > 1, 2, ...}
>> > they intended result to have pattern with above encoding.
>> > Just wanted to confirm if this is correct ?
>> >
>> > 2] Could you please suggest a test-case for S < 0 ?
>> > I am not able to come up with one :/
>>
>> svrev is one way of creating negative steps.
>>
>> Thanks,
>> Richard
>>
>> >
>> > Thanks,
>> > Prathamesh
>> >>
>> >> > which is an interleaving of the two patterns:
>> >> >
>> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
>> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2