Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector
Date: Thu, 02 Feb 2023 15:20:40 +0000	[thread overview]
Message-ID: <mpta61wccvr.fsf@arm.com> (raw)
In-Reply-To: <CAAgBjMkxZXVPYoX_C=deX1P83ZXXqxoWWAkhuFMVE2ha3XJG+A@mail.gmail.com> (Prathamesh Kulkarni's message of "Thu, 2 Feb 2023 20:21:37 +0530")

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > I have attached a patch that extends the transform if one half is dup
>> >> > and other is set of constants.
>> >> > For eg:
>> >> > int8x16_t f(int8_t x)
>> >> > {
>> >> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 };
>> >> > }
>> >> >
>> >> > code-gen trunk:
>> >> > f:
>> >> >         adrp    x1, .LC0
>> >> >         ldr     q0, [x1, #:lo12:.LC0]
>> >> >         ins     v0.b[0], w0
>> >> >         ins     v0.b[2], w0
>> >> >         ins     v0.b[4], w0
>> >> >         ins     v0.b[6], w0
>> >> >         ins     v0.b[8], w0
>> >> >         ins     v0.b[10], w0
>> >> >         ins     v0.b[12], w0
>> >> >         ins     v0.b[14], w0
>> >> >         ret
>> >> >
>> >> > code-gen with patch:
>> >> > f:
>> >> >         dup     v0.16b, w0
>> >> >         adrp    x0, .LC0
>> >> >         ldr     q1, [x0, #:lo12:.LC0]
>> >> >         zip1    v0.16b, v0.16b, v1.16b
>> >> >         ret
>> >> >
>> >> > Bootstrapped+tested on aarch64-linux-gnu.
>> >> > Does it look OK ?
>> >>
>> >> Looks like a nice improvement.  It'll need to wait for GCC 14 now though.
>> >>
>> >> However, rather than handle this case specially, I think we should instead
>> >> take a divide-and-conquer approach: split the initialiser into even and
>> >> odd elements, find the best way of loading each part, then compare the
>> >> cost of these sequences + ZIP with the cost of the fallback code (the code
>> >> later in aarch64_expand_vector_init).
>> >>
>> >> For example, doing that would allow:
>> >>
>> >>   { x, y, 0, y, 0, y, 0, y, 0, y }
>> >>
>> >> to be loaded more easily, even though the even elements aren't wholly
>> >> constant.
>> > Hi Richard,
>> > I have attached a prototype patch based on the above approach.
>> > It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating
>> > same sequence, thus I removed that hunk, and improves the following cases:
>> >
>> > (a)
>> > int8x16_t f_s16(int8_t x)
>> > {
>> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>> >                                  x, 5, x, 6, x, 7, x, 8 };
>> > }
>> >
>> > code-gen trunk:
>> > f_s16:
>> >         adrp    x1, .LC0
>> >         ldr     q0, [x1, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         ins     v0.b[2], w0
>> >         ins     v0.b[4], w0
>> >         ins     v0.b[6], w0
>> >         ins     v0.b[8], w0
>> >         ins     v0.b[10], w0
>> >         ins     v0.b[12], w0
>> >         ins     v0.b[14], w0
>> >         ret
>> >
>> > code-gen with patch:
>> > f_s16:
>> >         dup     v0.16b, w0
>> >         adrp    x0, .LC0
>> >         ldr     q1, [x0, #:lo12:.LC0]
>> >         zip1    v0.16b, v0.16b, v1.16b
>> >         ret
>> >
>> > (b)
>> > int8x16_t f_s16(int8_t x, int8_t y)
>> > {
>> >   return (int8x16_t) { x, y, 1, y, 2, y, 3, y,
>> >                                 4, y, 5, y, 6, y, 7, y };
>> > }
>> >
>> > code-gen trunk:
>> > f_s16:
>> >         adrp    x2, .LC0
>> >         ldr     q0, [x2, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         ins     v0.b[1], w1
>> >         ins     v0.b[3], w1
>> >         ins     v0.b[5], w1
>> >         ins     v0.b[7], w1
>> >         ins     v0.b[9], w1
>> >         ins     v0.b[11], w1
>> >         ins     v0.b[13], w1
>> >         ins     v0.b[15], w1
>> >         ret
>> >
>> > code-gen patch:
>> > f_s16:
>> >         adrp    x2, .LC0
>> >         dup     v1.16b, w1
>> >         ldr     q0, [x2, #:lo12:.LC0]
>> >         ins     v0.b[0], w0
>> >         zip1    v0.16b, v0.16b, v1.16b
>> >         ret
>>
>> Nice.
>>
>> > There are a couple of issues I have come across:
>> > (1) Choosing element to pad vector.
>> > For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y }
>> > with mode V8HI.
>> > We split it into { x, 0, 1, 2 } and { y, y, y, y}
>> > However since the mode is V8HI, we would need to pad the above split vectors
>> > with 4 more elements to match up to vector length.
>> > For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y}
>> > using 'y' is the obvious choice thus making them:
>> > {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y}
>> > These would be then merged using zip1 which would discard the lower half
>> > of both vectors.
>> > Currently I encoded the above two heuristics in
>> > aarch64_expand_vector_init_get_padded_elem:
>> > (a) If split portion contains a constant, use the constant to pad the vector.
>> > (b) If split portion only contains variables, then use the most
>> > frequently repeating variable
>> > to pad the vector.
>> > I suppose tho this could be improved ?
>>
>> I think we should just build two 64-bit vectors (V4HIs) and use a subreg
>> to fill the upper elements with undefined values.
>>
>> I suppose in principle we would have the same problem when splitting
>> a 64-bit vector into 2 32-bit vectors, but it's probably better to punt
>> on that for now.  Eventually it would be worth adding full support for
>> 32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions)
>> but it's quite a big task.  The 128-bit to 64-bit split is the one that
>> matters most.
>>
>> > (2) Setting cost for zip1:
>> > Currently it returns 4 as cost for following zip1 insn:
>> > (set (reg:V8HI 102)
>> >     (unspec:V8HI [
>> >             (reg:V8HI 103)
>> >             (reg:V8HI 108)
>> >         ] UNSPEC_ZIP1))
>> > I am not sure if that's correct, or if not, what cost to use in this case
>> > for zip1 ?
>>
>> TBH 4 seems a bit optimistic.  It's COSTS_N_INSNS (1), whereas the
>> generic advsimd_vec_cost::permute_cost is 2 insns.  But the costs of
>> inserts are probably underestimated to the same extent, so hopefully
>> things work out.
>>
>> So it's probably best to accept the costs as they're currently given.
>> Changing them would need extensive testing.
>>
>> However, one of the advantages of the split is that it allows the
>> subvectors to be built in parallel.  When optimising for speed,
>> it might make sense to take the maximum of the subsequence costs
>> and add the cost of the zip to that.
> Hi Richard,
> Thanks for the suggestions.
> In the attached patch, it recurses only if nelts == 16 to punt for 64
> -> 32 bit split,

It should be based on the size rather than the number of elements.
The example we talked about above involved building V8HIs from two
V4HIs, which is also valid.

> and uses std::max(even_init, odd_init) + insn_cost (zip1_insn) for
> computing total cost of the sequence.
>
> So, for following case:
> int8x16_t f_s8(int8_t x)
> {
>   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>                                 x, 5, x, 6, x, 7, x, 8 };
> }
>
> it now generates:
> f_s16:
>         dup     v0.8b, w0
>         adrp    x0, .LC0
>         ldr       d1, [x0, #:lo12:.LC0]
>         zip1    v0.16b, v0.16b, v1.16b
>         ret
>
> Which I assume is correct, since zip1 will merge the lower halves of
> two vectors while leaving the upper halves undefined ?

Yeah, it looks valid, but I would say that zip1 ignores the upper halves
(rather than leaving them undefined).

Thanks,
Richard

next prev parent reply	other threads:[~2023-02-02 15:20 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-29 14:39 Prathamesh Kulkarni
2022-11-29 15:13 ` Andrew Pinski
2022-11-29 17:06   ` Prathamesh Kulkarni
2022-12-05 10:52 ` Richard Sandiford
2022-12-05 11:20   ` Richard Sandiford
2022-12-06  1:31     ` Prathamesh Kulkarni
2022-12-26  4:22       ` Prathamesh Kulkarni
2023-01-12 15:51         ` Richard Sandiford
2023-02-01  9:36           ` Prathamesh Kulkarni
2023-02-01 16:26             ` Richard Sandiford
2023-02-02 14:51               ` Prathamesh Kulkarni
2023-02-02 15:20                 ` Richard Sandiford [this message]
2023-02-03  1:40                   ` Prathamesh Kulkarni
2023-02-03  3:02                     ` Prathamesh Kulkarni
2023-02-03 15:17                       ` Richard Sandiford
2023-02-04  6:49                         ` Prathamesh Kulkarni
2023-02-06 12:13                           ` Richard Sandiford
2023-02-11  9:12                             ` Prathamesh Kulkarni
2023-03-10 18:08                               ` Richard Sandiford
2023-03-13  7:33                                 ` Richard Biener
2023-04-03 16:33                                   ` Prathamesh Kulkarni
2023-04-04 18:05                                     ` Richard Sandiford
2023-04-06 10:26                                       ` Prathamesh Kulkarni
2023-04-06 10:34                                         ` Richard Sandiford
2023-04-06 11:21                                           ` Prathamesh Kulkarni
2023-04-12  8:59                                             ` Richard Sandiford
2023-04-21  7:27                                               ` Prathamesh Kulkarni
2023-04-21  9:17                                                 ` Richard Sandiford
2023-04-21 15:15                                                   ` Prathamesh Kulkarni
2023-04-23  1:53                                                     ` Prathamesh Kulkarni
2023-04-24  9:29                                                       ` Richard Sandiford
2023-05-04 11:47                                                         ` Prathamesh Kulkarni
2023-05-11 19:07                                                           ` Richard Sandiford
2023-05-13  9:10                                                             ` Prathamesh Kulkarni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mpta61wccvr.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=prathamesh.kulkarni@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).