From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=oWln=56=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 5F3BA385843E
	for <gcc-patches@gcc.gnu.org>; Thu,  2 Feb 2023 15:20:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5F3BA385843E
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 472BAC14;
	Thu,  2 Feb 2023 07:21:24 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A2FB73F64C;
	Thu,  2 Feb 2023 07:20:41 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector
References: <CAAgBjM=0mHW4Aw2u-Kksy=OV5KY-G7_CW+mrT1QKPyKMrBi80g@mail.gmail.com>
	<mptedte15sh.fsf@arm.com> <mptfsdugkqv.fsf@arm.com>
	<CAAgBjM=AW6O_fy-bxajgA0ZSu1-F0K8dPtJ3M847TG7SFD+5jQ@mail.gmail.com>
	<CAAgBjMnMpzNDYcDFOmzyLUi4zhepdp3AttbhC1ODKK2NQ=xvnQ@mail.gmail.com>
	<mptmt6nvjhu.fsf@arm.com>
	<CAAgBjMndgAd5eS52rKq+5MsqzA2FRiXM_3CLiovgD9rn8f6TBw@mail.gmail.com>
	<mptv8kle4hd.fsf@arm.com>
	<CAAgBjMkxZXVPYoX_C=deX1P83ZXXqxoWWAkhuFMVE2ha3XJG+A@mail.gmail.com>
Date: Thu, 02 Feb 2023 15:20:40 +0000
In-Reply-To: <CAAgBjMkxZXVPYoX_C=deX1P83ZXXqxoWWAkhuFMVE2ha3XJG+A@mail.gmail.com>
	(Prathamesh Kulkarni's message of "Thu, 2 Feb 2023 20:21:37 +0530")
Message-ID: <mpta61wccvr.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-30.1 required=5.0 tests=BAYES_00,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > I have attached a patch that extends the transform if one half is dup
>> >> > and other is set of constants.
>> >> > For eg:
>> >> > int8x16_t f(int8_t x)
>> >> > {
>> >> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 };
>> >> > }
>> >> >
>> >> > code-gen trunk:
>> >> > f:
>> >> >         adrp    x1, .LC0
>> >> >         ldr     q0, [x1, #:lo12:.LC0]
>> >> >         ins     v0.b[0], w0
>> >> >         ins     v0.b[2], w0
>> >> >         ins     v0.b[4], w0
>> >> >         ins     v0.b[6], w0
>> >> >         ins     v0.b[8], w0
>> >> >         ins     v0.b[10], w0
>> >> >         ins     v0.b[12], w0
>> >> >         ins     v0.b[14], w0
>> >> >         ret
>> >> >
>> >> > code-gen with patch:
>> >> > f:
>> >> >         dup     v0.16b, w0
>> >> >         adrp    x0, .LC0
>> >> >         ldr     q1, [x0, #:lo12:.LC0]
>> >> >         zip1    v0.16b, v0.16b, v1.16b
>> >> >         ret
>> >> >
>> >> > Bootstrapped+tested on aarch64-linux-gnu.
>> >> > Does it look OK ?
>> >>
>> >> Looks like a nice improvement.  It'll need to wait for GCC 14 now though.
>> >>
>> >> However, rather than handle this case specially, I think we should instead
>> >> take a divide-and-conquer approach: split the initialiser into even and
>> >> odd elements, find the best way of loading each part, then compare the
>> >> cost of these sequences + ZIP with the cost of the fallback code (the code
>> >> later in aarch64_expand_vector_init).
>> >>
>> >> For example, doing that would allow:
>> >>
>> >>   { x, y, 0, y, 0, y, 0, y, 0, y }
>> >>
>> >> to be loaded more easily, even though the even elements aren't wholly
>> >> constant.
>> > Hi Richard,
>> > I have attached a prototype patch based on the above approach.
>> > It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating
>> > same sequence, thus I removed that hunk, and improves the following cases:
>> >
>> > (a)
>> > int8x16_t f_s16(int8_t x)
>> > {
>> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>> >                                  x, 5, x, 6, x, 7, x, 8 };
>> > }
>> >
>> > code-gen trunk:
>> > f_s16:
>> >         adrp    x1, .LC0
>> >         ldr     q0, [x1, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         ins     v0.b[2], w0
>> >         ins     v0.b[4], w0
>> >         ins     v0.b[6], w0
>> >         ins     v0.b[8], w0
>> >         ins     v0.b[10], w0
>> >         ins     v0.b[12], w0
>> >         ins     v0.b[14], w0
>> >         ret
>> >
>> > code-gen with patch:
>> > f_s16:
>> >         dup     v0.16b, w0
>> >         adrp    x0, .LC0
>> >         ldr     q1, [x0, #:lo12:.LC0]
>> >         zip1    v0.16b, v0.16b, v1.16b
>> >         ret
>> >
>> > (b)
>> > int8x16_t f_s16(int8_t x, int8_t y)
>> > {
>> >   return (int8x16_t) { x, y, 1, y, 2, y, 3, y,
>> >                                 4, y, 5, y, 6, y, 7, y };
>> > }
>> >
>> > code-gen trunk:
>> > f_s16:
>> >         adrp    x2, .LC0
>> >         ldr     q0, [x2, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         ins     v0.b[1], w1
>> >         ins     v0.b[3], w1
>> >         ins     v0.b[5], w1
>> >         ins     v0.b[7], w1
>> >         ins     v0.b[9], w1
>> >         ins     v0.b[11], w1
>> >         ins     v0.b[13], w1
>> >         ins     v0.b[15], w1
>> >         ret
>> >
>> > code-gen patch:
>> > f_s16:
>> >         adrp    x2, .LC0
>> >         dup     v1.16b, w1
>> >         ldr     q0, [x2, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         zip1    v0.16b, v0.16b, v1.16b
>> >         ret
>>
>> Nice.
>>
>> > There are a couple of issues I have come across:
>> > (1) Choosing element to pad vector.
>> > For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y }
>> > with mode V8HI.
>> > We split it into { x, 0, 1, 2 } and { y, y, y, y}
>> > However since the mode is V8HI, we would need to pad the above split vectors
>> > with 4 more elements to match up to vector length.
>> > For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y}
>> > using 'y' is the obvious choice thus making them:
>> > {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y}
>> > These would be then merged using zip1 which would discard the lower half
>> > of both vectors.
>> > Currently I encoded the above two heuristics in
>> > aarch64_expand_vector_init_get_padded_elem:
>> > (a) If split portion contains a constant, use the constant to pad the vector.
>> > (b) If split portion only contains variables, then use the most
>> > frequently repeating variable
>> > to pad the vector.
>> > I suppose tho this could be improved ?
>>
>> I think we should just build two 64-bit vectors (V4HIs) and use a subreg
>> to fill the upper elements with undefined values.
>>
>> I suppose in principle we would have the same problem when splitting
>> a 64-bit vector into 2 32-bit vectors, but it's probably better to punt
>> on that for now.  Eventually it would be worth adding full support for
>> 32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions)
>> but it's quite a big task.  The 128-bit to 64-bit split is the one that
>> matters most.
>>
>> > (2) Setting cost for zip1:
>> > Currently it returns 4 as cost for following zip1 insn:
>> > (set (reg:V8HI 102)
>> >     (unspec:V8HI [
>> >             (reg:V8HI 103)
>> >             (reg:V8HI 108)
>> >         ] UNSPEC_ZIP1))
>> > I am not sure if that's correct, or if not, what cost to use in this case
>> > for zip1 ?
>>
>> TBH 4 seems a bit optimistic.  It's COSTS_N_INSNS (1), whereas the
>> generic advsimd_vec_cost::permute_cost is 2 insns.  But the costs of
>> inserts are probably underestimated to the same extent, so hopefully
>> things work out.
>>
>> So it's probably best to accept the costs as they're currently given.
>> Changing them would need extensive testing.
>>
>> However, one of the advantages of the split is that it allows the
>> subvectors to be built in parallel.  When optimising for speed,
>> it might make sense to take the maximum of the subsequence costs
>> and add the cost of the zip to that.
> Hi Richard,
> Thanks for the suggestions.
> In the attached patch, it recurses only if nelts == 16 to punt for 64
> -> 32 bit split,

It should be based on the size rather than the number of elements.
The example we talked about above involved building V8HIs from two
V4HIs, which is also valid.

> and uses std::max(even_init, odd_init) + insn_cost (zip1_insn) for
> computing total cost of the sequence.
>
> So, for following case:
> int8x16_t f_s8(int8_t x)
> {
>   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>                                 x, 5, x, 6, x, 7, x, 8 };
> }
>
> it now generates:
> f_s16:
>         dup     v0.8b, w0
>         adrp    x0, .LC0
>         ldr       d1, [x0, #:lo12:.LC0]
>         zip1    v0.16b, v0.16b, v1.16b
>         ret
>
> Which I assume is correct, since zip1 will merge the lower halves of
> two vectors while leaving the upper halves undefined ?

Yeah, it looks valid, but I would say that zip1 ignores the upper halves
(rather than leaving them undefined).

Thanks,
Richard