From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=GJj1=55=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 64ADE3858CDB
	for <gcc-patches@gcc.gnu.org>; Wed,  1 Feb 2023 16:26:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 64ADE3858CDB
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 13E9C4B3;
	Wed,  1 Feb 2023 08:27:38 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7FDFA3F64C;
	Wed,  1 Feb 2023 08:26:55 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector
References: <CAAgBjM=0mHW4Aw2u-Kksy=OV5KY-G7_CW+mrT1QKPyKMrBi80g@mail.gmail.com>
	<mptedte15sh.fsf@arm.com> <mptfsdugkqv.fsf@arm.com>
	<CAAgBjM=AW6O_fy-bxajgA0ZSu1-F0K8dPtJ3M847TG7SFD+5jQ@mail.gmail.com>
	<CAAgBjMnMpzNDYcDFOmzyLUi4zhepdp3AttbhC1ODKK2NQ=xvnQ@mail.gmail.com>
	<mptmt6nvjhu.fsf@arm.com>
	<CAAgBjMndgAd5eS52rKq+5MsqzA2FRiXM_3CLiovgD9rn8f6TBw@mail.gmail.com>
Date: Wed, 01 Feb 2023 16:26:54 +0000
In-Reply-To: <CAAgBjMndgAd5eS52rKq+5MsqzA2FRiXM_3CLiovgD9rn8f6TBw@mail.gmail.com>
	(Prathamesh Kulkarni's message of "Wed, 1 Feb 2023 15:06:35 +0530")
Message-ID: <mptv8kle4hd.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-35.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Thu, 12 Jan 2023 at 21:21, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Tue, 6 Dec 2022 at 07:01, Prathamesh Kulkarni
>> > <prathamesh.kulkarni@linaro.org> wrote:
>> >>
>> >> On Mon, 5 Dec 2022 at 16:50, Richard Sandiford
>> >> <richard.sandiford@arm.com> wrote:
>> >> >
>> >> > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > >> Hi,
>> >> > >> For the following test-case:
>> >> > >>
>> >> > >> int16x8_t foo(int16_t x, int16_t y)
>> >> > >> {
>> >> > >>   return (int16x8_t) { x, y, x, y, x, y, x, y };
>> >> > >> }
>> >> > >>
>> >> > >> Code gen at -O3:
>> >> > >> foo:
>> >> > >>         dup    v0.8h, w0
>> >> > >>         ins     v0.h[1], w1
>> >> > >>         ins     v0.h[3], w1
>> >> > >>         ins     v0.h[5], w1
>> >> > >>         ins     v0.h[7], w1
>> >> > >>         ret
>> >> > >>
>> >> > >> For 16 elements, it results in 8 ins instructions which might not be
>> >> > >> optimal perhaps.
>> >> > >> I guess, the above code-gen would be equivalent to the following ?
>> >> > >> dup v0.8h, w0
>> >> > >> dup v1.8h, w1
>> >> > >> zip1 v0.8h, v0.8h, v1.8h
>> >> > >>
>> >> > >> I have attached patch to do the same, if number of elements >= 8,
>> >> > >> which should be possibly better compared to current code-gen ?
>> >> > >> Patch passes bootstrap+test on aarch64-linux-gnu.
>> >> > >> Does the patch look OK ?
>> >> > >>
>> >> > >> Thanks,
>> >> > >> Prathamesh
>> >> > >>
>> >> > >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> >> > >> index c91df6f5006..e5dea70e363 100644
>> >> > >> --- a/gcc/config/aarch64/aarch64.cc
>> >> > >> +++ b/gcc/config/aarch64/aarch64.cc
>> >> > >> @@ -22028,6 +22028,39 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> > >>        return;
>> >> > >>      }
>> >> > >>
>> >> > >> +  /* Check for interleaving case.
>> >> > >> +     For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}.
>> >> > >> +     Generate following code:
>> >> > >> +     dup v0.h, x
>> >> > >> +     dup v1.h, y
>> >> > >> +     zip1 v0.h, v0.h, v1.h
>> >> > >> +     for "large enough" initializer.  */
>> >> > >> +
>> >> > >> +  if (n_elts >= 8)
>> >> > >> +    {
>> >> > >> +      int i;
>> >> > >> +      for (i = 2; i < n_elts; i++)
>> >> > >> +    if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2)))
>> >> > >> +      break;
>> >> > >> +
>> >> > >> +      if (i == n_elts)
>> >> > >> +    {
>> >> > >> +      machine_mode mode = GET_MODE (target);
>> >> > >> +      rtx dest[2];
>> >> > >> +
>> >> > >> +      for (int i = 0; i < 2; i++)
>> >> > >> +        {
>> >> > >> +          rtx x = copy_to_mode_reg (GET_MODE_INNER (mode), XVECEXP (vals, 0, i));
>> >> > >
>> >> > > Formatting nit: long line.
>> >> > >
>> >> > >> +          dest[i] = gen_reg_rtx (mode);
>> >> > >> +          aarch64_emit_move (dest[i], gen_vec_duplicate (mode, x));
>> >> > >> +        }
>> >> > >
>> >> > > This could probably be written:
>> >> > >
>> >> > >         for (int i = 0; i < 2; i++)
>> >> > >           {
>> >> > >             rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i));
>> >> > >             dest[i] = force_reg (GET_MODE_INNER (mode), x);
>> >> >
>> >> > Oops, I meant "mode" rather than "GET_MODE_INNER (mode)", sorry.
>> >> Thanks, I have pushed the change in
>> >> 769370f3e2e04823c8a621d8ffa756dd83ebf21e after running
>> >> bootstrap+test on aarch64-linux-gnu.
>> > Hi Richard,
>> > I have attached a patch that extends the transform if one half is dup
>> > and other is set of constants.
>> > For eg:
>> > int8x16_t f(int8_t x)
>> > {
>> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 };
>> > }
>> >
>> > code-gen trunk:
>> > f:
>> >         adrp    x1, .LC0
>> >         ldr     q0, [x1, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         ins     v0.b[2], w0
>> >         ins     v0.b[4], w0
>> >         ins     v0.b[6], w0
>> >         ins     v0.b[8], w0
>> >         ins     v0.b[10], w0
>> >         ins     v0.b[12], w0
>> >         ins     v0.b[14], w0
>> >         ret
>> >
>> > code-gen with patch:
>> > f:
>> >         dup     v0.16b, w0
>> >         adrp    x0, .LC0
>> >         ldr     q1, [x0, #:lo12:.LC0]
>> >         zip1    v0.16b, v0.16b, v1.16b
>> >         ret
>> >
>> > Bootstrapped+tested on aarch64-linux-gnu.
>> > Does it look OK ?
>>
>> Looks like a nice improvement.  It'll need to wait for GCC 14 now though.
>>
>> However, rather than handle this case specially, I think we should instead
>> take a divide-and-conquer approach: split the initialiser into even and
>> odd elements, find the best way of loading each part, then compare the
>> cost of these sequences + ZIP with the cost of the fallback code (the code
>> later in aarch64_expand_vector_init).
>>
>> For example, doing that would allow:
>>
>>   { x, y, 0, y, 0, y, 0, y, 0, y }
>>
>> to be loaded more easily, even though the even elements aren't wholly
>> constant.
> Hi Richard,
> I have attached a prototype patch based on the above approach.
> It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating
> same sequence, thus I removed that hunk, and improves the following cases:
>
> (a)
> int8x16_t f_s16(int8_t x)
> {
>   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>                                  x, 5, x, 6, x, 7, x, 8 };
> }
>
> code-gen trunk:
> f_s16:
>         adrp    x1, .LC0
>         ldr     q0, [x1, #:lo12:.LC0]
>         ins     v0.b[0], w0
>         ins     v0.b[2], w0
>         ins     v0.b[4], w0
>         ins     v0.b[6], w0
>         ins     v0.b[8], w0
>         ins     v0.b[10], w0
>         ins     v0.b[12], w0
>         ins     v0.b[14], w0
>         ret
>
> code-gen with patch:
> f_s16:
>         dup     v0.16b, w0
>         adrp    x0, .LC0
>         ldr     q1, [x0, #:lo12:.LC0]
>         zip1    v0.16b, v0.16b, v1.16b
>         ret
>
> (b)
> int8x16_t f_s16(int8_t x, int8_t y)
> {
>   return (int8x16_t) { x, y, 1, y, 2, y, 3, y,
>                                 4, y, 5, y, 6, y, 7, y };
> }
>
> code-gen trunk:
> f_s16:
>         adrp    x2, .LC0
>         ldr     q0, [x2, #:lo12:.LC0]
>         ins     v0.b[0], w0
>         ins     v0.b[1], w1
>         ins     v0.b[3], w1
>         ins     v0.b[5], w1
>         ins     v0.b[7], w1
>         ins     v0.b[9], w1
>         ins     v0.b[11], w1
>         ins     v0.b[13], w1
>         ins     v0.b[15], w1
>         ret
>
> code-gen patch:
> f_s16:
>         adrp    x2, .LC0
>         dup     v1.16b, w1
>         ldr     q0, [x2, #:lo12:.LC0]
>         ins     v0.b[0], w0
>         zip1    v0.16b, v0.16b, v1.16b
>         ret

Nice.

> There are a couple of issues I have come across:
> (1) Choosing element to pad vector.
> For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y }
> with mode V8HI.
> We split it into { x, 0, 1, 2 } and { y, y, y, y}
> However since the mode is V8HI, we would need to pad the above split vectors
> with 4 more elements to match up to vector length.
> For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y}
> using 'y' is the obvious choice thus making them:
> {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y}
> These would be then merged using zip1 which would discard the lower half
> of both vectors.
> Currently I encoded the above two heuristics in
> aarch64_expand_vector_init_get_padded_elem:
> (a) If split portion contains a constant, use the constant to pad the vector.
> (b) If split portion only contains variables, then use the most
> frequently repeating variable
> to pad the vector.
> I suppose tho this could be improved ?

I think we should just build two 64-bit vectors (V4HIs) and use a subreg
to fill the upper elements with undefined values.

I suppose in principle we would have the same problem when splitting
a 64-bit vector into 2 32-bit vectors, but it's probably better to punt
on that for now.  Eventually it would be worth adding full support for
32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions)
but it's quite a big task.  The 128-bit to 64-bit split is the one that
matters most.

> (2) Setting cost for zip1:
> Currently it returns 4 as cost for following zip1 insn:
> (set (reg:V8HI 102)
>     (unspec:V8HI [
>             (reg:V8HI 103)
>             (reg:V8HI 108)
>         ] UNSPEC_ZIP1))
> I am not sure if that's correct, or if not, what cost to use in this case
> for zip1 ?

TBH 4 seems a bit optimistic.  It's COSTS_N_INSNS (1), whereas the
generic advsimd_vec_cost::permute_cost is 2 insns.  But the costs of
inserts are probably underestimated to the same extent, so hopefully
things work out.

So it's probably best to accept the costs as they're currently given.
Changing them would need extensive testing.

However, one of the advantages of the split is that it allows the
subvectors to be built in parallel.  When optimising for speed,
it might make sense to take the maximum of the subsequence costs
and add the cost of the zip to that.

Thanks,
Richard