Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector
Date: Wed, 01 Feb 2023 16:26:54 +0000	[thread overview]
Message-ID: <mptv8kle4hd.fsf@arm.com> (raw)
In-Reply-To: <CAAgBjMndgAd5eS52rKq+5MsqzA2FRiXM_3CLiovgD9rn8f6TBw@mail.gmail.com> (Prathamesh Kulkarni's message of "Wed, 1 Feb 2023 15:06:35 +0530")

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Thu, 12 Jan 2023 at 21:21, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Tue, 6 Dec 2022 at 07:01, Prathamesh Kulkarni
>> > <prathamesh.kulkarni@linaro.org> wrote:
>> >>
>> >> On Mon, 5 Dec 2022 at 16:50, Richard Sandiford
>> >> <richard.sandiford@arm.com> wrote:
>> >> >
>> >> > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > >> Hi,
>> >> > >> For the following test-case:
>> >> > >>
>> >> > >> int16x8_t foo(int16_t x, int16_t y)
>> >> > >> {
>> >> > >>   return (int16x8_t) { x, y, x, y, x, y, x, y };
>> >> > >> }
>> >> > >>
>> >> > >> Code gen at -O3:
>> >> > >> foo:
>> >> > >>         dup    v0.8h, w0
>> >> > >>         ins     v0.h[1], w1
>> >> > >>         ins     v0.h[3], w1
>> >> > >>         ins     v0.h[5], w1
>> >> > >>         ins     v0.h[7], w1
>> >> > >>         ret
>> >> > >>
>> >> > >> For 16 elements, it results in 8 ins instructions which might not be
>> >> > >> optimal perhaps.
>> >> > >> I guess, the above code-gen would be equivalent to the following ?
>> >> > >> dup v0.8h, w0
>> >> > >> dup v1.8h, w1
>> >> > >> zip1 v0.8h, v0.8h, v1.8h
>> >> > >>
>> >> > >> I have attached patch to do the same, if number of elements >= 8,
>> >> > >> which should be possibly better compared to current code-gen ?
>> >> > >> Patch passes bootstrap+test on aarch64-linux-gnu.
>> >> > >> Does the patch look OK ?
>> >> > >>
>> >> > >> Thanks,
>> >> > >> Prathamesh
>> >> > >>
>> >> > >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> >> > >> index c91df6f5006..e5dea70e363 100644
>> >> > >> --- a/gcc/config/aarch64/aarch64.cc
>> >> > >> +++ b/gcc/config/aarch64/aarch64.cc
>> >> > >> @@ -22028,6 +22028,39 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>> >> > >>        return;
>> >> > >>      }
>> >> > >>
>> >> > >> +  /* Check for interleaving case.
>> >> > >> +     For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}.
>> >> > >> +     Generate following code:
>> >> > >> +     dup v0.h, x
>> >> > >> +     dup v1.h, y
>> >> > >> +     zip1 v0.h, v0.h, v1.h
>> >> > >> +     for "large enough" initializer.  */
>> >> > >> +
>> >> > >> +  if (n_elts >= 8)
>> >> > >> +    {
>> >> > >> +      int i;
>> >> > >> +      for (i = 2; i < n_elts; i++)
>> >> > >> +    if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2)))
>> >> > >> +      break;
>> >> > >> +
>> >> > >> +      if (i == n_elts)
>> >> > >> +    {
>> >> > >> +      machine_mode mode = GET_MODE (target);
>> >> > >> +      rtx dest[2];
>> >> > >> +
>> >> > >> +      for (int i = 0; i < 2; i++)
>> >> > >> +        {
>> >> > >> +          rtx x = copy_to_mode_reg (GET_MODE_INNER (mode), XVECEXP (vals, 0, i));
>> >> > >
>> >> > > Formatting nit: long line.
>> >> > >
>> >> > >> +          dest[i] = gen_reg_rtx (mode);
>> >> > >> +          aarch64_emit_move (dest[i], gen_vec_duplicate (mode, x));
>> >> > >> +        }
>> >> > >
>> >> > > This could probably be written:
>> >> > >
>> >> > >         for (int i = 0; i < 2; i++)
>> >> > >           {
>> >> > >             rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i));
>> >> > >             dest[i] = force_reg (GET_MODE_INNER (mode), x);
>> >> >
>> >> > Oops, I meant "mode" rather than "GET_MODE_INNER (mode)", sorry.
>> >> Thanks, I have pushed the change in
>> >> 769370f3e2e04823c8a621d8ffa756dd83ebf21e after running
>> >> bootstrap+test on aarch64-linux-gnu.
>> > Hi Richard,
>> > I have attached a patch that extends the transform if one half is dup
>> > and other is set of constants.
>> > For eg:
>> > int8x16_t f(int8_t x)
>> > {
>> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 };
>> > }
>> >
>> > code-gen trunk:
>> > f:
>> >         adrp    x1, .LC0
>> >         ldr     q0, [x1, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         ins     v0.b[2], w0
>> >         ins     v0.b[4], w0
>> >         ins     v0.b[6], w0
>> >         ins     v0.b[8], w0
>> >         ins     v0.b[10], w0
>> >         ins     v0.b[12], w0
>> >         ins     v0.b[14], w0
>> >         ret
>> >
>> > code-gen with patch:
>> > f:
>> >         dup     v0.16b, w0
>> >         adrp    x0, .LC0
>> >         ldr     q1, [x0, #:lo12:.LC0]
>> >         zip1    v0.16b, v0.16b, v1.16b
>> >         ret
>> >
>> > Bootstrapped+tested on aarch64-linux-gnu.
>> > Does it look OK ?
>>
>> Looks like a nice improvement.  It'll need to wait for GCC 14 now though.
>>
>> However, rather than handle this case specially, I think we should instead
>> take a divide-and-conquer approach: split the initialiser into even and
>> odd elements, find the best way of loading each part, then compare the
>> cost of these sequences + ZIP with the cost of the fallback code (the code
>> later in aarch64_expand_vector_init).
>>
>> For example, doing that would allow:
>>
>>   { x, y, 0, y, 0, y, 0, y, 0, y }
>>
>> to be loaded more easily, even though the even elements aren't wholly
>> constant.
> Hi Richard,
> I have attached a prototype patch based on the above approach.
> It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating
> same sequence, thus I removed that hunk, and improves the following cases:
>
> (a)
> int8x16_t f_s16(int8_t x)
> {
>   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>                                  x, 5, x, 6, x, 7, x, 8 };
> }
>
> code-gen trunk:
> f_s16:
>         adrp    x1, .LC0
>         ldr     q0, [x1, #:lo12:.LC0]
>         ins     v0.b[0], w0
>         ins     v0.b[2], w0
>         ins     v0.b[4], w0
>         ins     v0.b[6], w0
>         ins     v0.b[8], w0
>         ins     v0.b[10], w0
>         ins     v0.b[12], w0
>         ins     v0.b[14], w0
>         ret
>
> code-gen with patch:
> f_s16:
>         dup     v0.16b, w0
>         adrp    x0, .LC0
>         ldr     q1, [x0, #:lo12:.LC0]
>         zip1    v0.16b, v0.16b, v1.16b
>         ret
>
> (b)
> int8x16_t f_s16(int8_t x, int8_t y)
> {
>   return (int8x16_t) { x, y, 1, y, 2, y, 3, y,
>                                 4, y, 5, y, 6, y, 7, y };
> }
>
> code-gen trunk:
> f_s16:
>         adrp    x2, .LC0
>         ldr     q0, [x2, #:lo12:.LC0]
>         ins     v0.b[0], w0
>         ins     v0.b[1], w1
>         ins     v0.b[3], w1
>         ins     v0.b[5], w1
>         ins     v0.b[7], w1
>         ins     v0.b[9], w1
>         ins     v0.b[11], w1
>         ins     v0.b[13], w1
>         ins     v0.b[15], w1
>         ret
>
> code-gen patch:
> f_s16:
>         adrp    x2, .LC0
>         dup     v1.16b, w1
>         ldr     q0, [x2, #:lo12:.LC0]
>         ins     v0.b[0], w0
>         zip1    v0.16b, v0.16b, v1.16b
>         ret

Nice.

> There are a couple of issues I have come across:
> (1) Choosing element to pad vector.
> For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y }
> with mode V8HI.
> We split it into { x, 0, 1, 2 } and { y, y, y, y}
> However since the mode is V8HI, we would need to pad the above split vectors
> with 4 more elements to match up to vector length.
> For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y}
> using 'y' is the obvious choice thus making them:
> {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y}
> These would be then merged using zip1 which would discard the lower half
> of both vectors.
> Currently I encoded the above two heuristics in
> aarch64_expand_vector_init_get_padded_elem:
> (a) If split portion contains a constant, use the constant to pad the vector.
> (b) If split portion only contains variables, then use the most
> frequently repeating variable
> to pad the vector.
> I suppose tho this could be improved ?

I think we should just build two 64-bit vectors (V4HIs) and use a subreg
to fill the upper elements with undefined values.

I suppose in principle we would have the same problem when splitting
a 64-bit vector into 2 32-bit vectors, but it's probably better to punt
on that for now.  Eventually it would be worth adding full support for
32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions)
but it's quite a big task.  The 128-bit to 64-bit split is the one that
matters most.

> (2) Setting cost for zip1:
> Currently it returns 4 as cost for following zip1 insn:
> (set (reg:V8HI 102)
>     (unspec:V8HI [
>             (reg:V8HI 103)
>             (reg:V8HI 108)
>         ] UNSPEC_ZIP1))
> I am not sure if that's correct, or if not, what cost to use in this case
> for zip1 ?

TBH 4 seems a bit optimistic.  It's COSTS_N_INSNS (1), whereas the
generic advsimd_vec_cost::permute_cost is 2 insns.  But the costs of
inserts are probably underestimated to the same extent, so hopefully
things work out.

So it's probably best to accept the costs as they're currently given.
Changing them would need extensive testing.

However, one of the advantages of the split is that it allows the
subvectors to be built in parallel.  When optimising for speed,
it might make sense to take the maximum of the subsequence costs
and add the cost of the zip to that.

Thanks,
Richard

next prev parent reply	other threads:[~2023-02-01 16:26 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-29 14:39 Prathamesh Kulkarni
2022-11-29 15:13 ` Andrew Pinski
2022-11-29 17:06   ` Prathamesh Kulkarni
2022-12-05 10:52 ` Richard Sandiford
2022-12-05 11:20   ` Richard Sandiford
2022-12-06  1:31     ` Prathamesh Kulkarni
2022-12-26  4:22       ` Prathamesh Kulkarni
2023-01-12 15:51         ` Richard Sandiford
2023-02-01  9:36           ` Prathamesh Kulkarni
2023-02-01 16:26             ` Richard Sandiford [this message]
2023-02-02 14:51               ` Prathamesh Kulkarni
2023-02-02 15:20                 ` Richard Sandiford
2023-02-03  1:40                   ` Prathamesh Kulkarni
2023-02-03  3:02                     ` Prathamesh Kulkarni
2023-02-03 15:17                       ` Richard Sandiford
2023-02-04  6:49                         ` Prathamesh Kulkarni
2023-02-06 12:13                           ` Richard Sandiford
2023-02-11  9:12                             ` Prathamesh Kulkarni
2023-03-10 18:08                               ` Richard Sandiford
2023-03-13  7:33                                 ` Richard Biener
2023-04-03 16:33                                   ` Prathamesh Kulkarni
2023-04-04 18:05                                     ` Richard Sandiford
2023-04-06 10:26                                       ` Prathamesh Kulkarni
2023-04-06 10:34                                         ` Richard Sandiford
2023-04-06 11:21                                           ` Prathamesh Kulkarni
2023-04-12  8:59                                             ` Richard Sandiford
2023-04-21  7:27                                               ` Prathamesh Kulkarni
2023-04-21  9:17                                                 ` Richard Sandiford
2023-04-21 15:15                                                   ` Prathamesh Kulkarni
2023-04-23  1:53                                                     ` Prathamesh Kulkarni
2023-04-24  9:29                                                       ` Richard Sandiford
2023-05-04 11:47                                                         ` Prathamesh Kulkarni
2023-05-11 19:07                                                           ` Richard Sandiford
2023-05-13  9:10                                                             ` Prathamesh Kulkarni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mptv8kle4hd.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=prathamesh.kulkarni@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).