From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=dUMx=57=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 048013858D20
	for <gcc-patches@gcc.gnu.org>; Fri,  3 Feb 2023 15:17:47 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 048013858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B9EC24B3;
	Fri,  3 Feb 2023 07:18:28 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ED7603F71E;
	Fri,  3 Feb 2023 07:17:45 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Mail-Followup-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,gcc Patches <gcc-patches@gcc.gnu.org>, richard.sandiford@arm.com
Cc: gcc Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector
References: <CAAgBjM=0mHW4Aw2u-Kksy=OV5KY-G7_CW+mrT1QKPyKMrBi80g@mail.gmail.com>
	<mptedte15sh.fsf@arm.com> <mptfsdugkqv.fsf@arm.com>
	<CAAgBjM=AW6O_fy-bxajgA0ZSu1-F0K8dPtJ3M847TG7SFD+5jQ@mail.gmail.com>
	<CAAgBjMnMpzNDYcDFOmzyLUi4zhepdp3AttbhC1ODKK2NQ=xvnQ@mail.gmail.com>
	<mptmt6nvjhu.fsf@arm.com>
	<CAAgBjMndgAd5eS52rKq+5MsqzA2FRiXM_3CLiovgD9rn8f6TBw@mail.gmail.com>
	<mptv8kle4hd.fsf@arm.com>
	<CAAgBjMkxZXVPYoX_C=deX1P83ZXXqxoWWAkhuFMVE2ha3XJG+A@mail.gmail.com>
	<mpta61wccvr.fsf@arm.com>
	<CAAgBjMktt7DN30efqEnjhPnkbVufTqrqSgrBkSw-0aytA8rf6A@mail.gmail.com>
	<CAAgBjMmTke8Qp2yzXYfDLASpW_LxPTR=AkKqbkMUia=MgwQrXA@mail.gmail.com>
Date: Fri, 03 Feb 2023 15:17:44 +0000
In-Reply-To: <CAAgBjMmTke8Qp2yzXYfDLASpW_LxPTR=AkKqbkMUia=MgwQrXA@mail.gmail.com>
	(Prathamesh Kulkarni's message of "Fri, 3 Feb 2023 08:32:46 +0530")
Message-ID: <mpt357maicn.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-35.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 3 Feb 2023 at 07:10, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Thu, 2 Feb 2023 at 20:50, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > >> >> > I have attached a patch that extends the transform if one half is dup
>> > >> >> > and other is set of constants.
>> > >> >> > For eg:
>> > >> >> > int8x16_t f(int8_t x)
>> > >> >> > {
>> > >> >> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 };
>> > >> >> > }
>> > >> >> >
>> > >> >> > code-gen trunk:
>> > >> >> > f:
>> > >> >> >         adrp    x1, .LC0
>> > >> >> >         ldr     q0, [x1, #:lo12:.LC0]
>> > >> >> >         ins     v0.b[0], w0
>> > >> >> >         ins     v0.b[2], w0
>> > >> >> >         ins     v0.b[4], w0
>> > >> >> >         ins     v0.b[6], w0
>> > >> >> >         ins     v0.b[8], w0
>> > >> >> >         ins     v0.b[10], w0
>> > >> >> >         ins     v0.b[12], w0
>> > >> >> >         ins     v0.b[14], w0
>> > >> >> >         ret
>> > >> >> >
>> > >> >> > code-gen with patch:
>> > >> >> > f:
>> > >> >> >         dup     v0.16b, w0
>> > >> >> >         adrp    x0, .LC0
>> > >> >> >         ldr     q1, [x0, #:lo12:.LC0]
>> > >> >> >         zip1    v0.16b, v0.16b, v1.16b
>> > >> >> >         ret
>> > >> >> >
>> > >> >> > Bootstrapped+tested on aarch64-linux-gnu.
>> > >> >> > Does it look OK ?
>> > >> >>
>> > >> >> Looks like a nice improvement.  It'll need to wait for GCC 14 now though.
>> > >> >>
>> > >> >> However, rather than handle this case specially, I think we should instead
>> > >> >> take a divide-and-conquer approach: split the initialiser into even and
>> > >> >> odd elements, find the best way of loading each part, then compare the
>> > >> >> cost of these sequences + ZIP with the cost of the fallback code (the code
>> > >> >> later in aarch64_expand_vector_init).
>> > >> >>
>> > >> >> For example, doing that would allow:
>> > >> >>
>> > >> >>   { x, y, 0, y, 0, y, 0, y, 0, y }
>> > >> >>
>> > >> >> to be loaded more easily, even though the even elements aren't wholly
>> > >> >> constant.
>> > >> > Hi Richard,
>> > >> > I have attached a prototype patch based on the above approach.
>> > >> > It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating
>> > >> > same sequence, thus I removed that hunk, and improves the following cases:
>> > >> >
>> > >> > (a)
>> > >> > int8x16_t f_s16(int8_t x)
>> > >> > {
>> > >> >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>> > >> >                                  x, 5, x, 6, x, 7, x, 8 };
>> > >> > }
>> > >> >
>> > >> > code-gen trunk:
>> > >> > f_s16:
>> > >> >         adrp    x1, .LC0
>> > >> >         ldr     q0, [x1, #:lo12:.LC0]
>> > >> >         ins     v0.b[0], w0
>> > >> >         ins     v0.b[2], w0
>> > >> >         ins     v0.b[4], w0
>> > >> >         ins     v0.b[6], w0
>> > >> >         ins     v0.b[8], w0
>> > >> >         ins     v0.b[10], w0
>> > >> >         ins     v0.b[12], w0
>> > >> >         ins     v0.b[14], w0
>> > >> >         ret
>> > >> >
>> > >> > code-gen with patch:
>> > >> > f_s16:
>> > >> >         dup     v0.16b, w0
>> > >> >         adrp    x0, .LC0
>> > >> >         ldr     q1, [x0, #:lo12:.LC0]
>> > >> >         zip1    v0.16b, v0.16b, v1.16b
>> > >> >         ret
>> > >> >
>> > >> > (b)
>> > >> > int8x16_t f_s16(int8_t x, int8_t y)
>> > >> > {
>> > >> >   return (int8x16_t) { x, y, 1, y, 2, y, 3, y,
>> > >> >                                 4, y, 5, y, 6, y, 7, y };
>> > >> > }
>> > >> >
>> > >> > code-gen trunk:
>> > >> > f_s16:
>> > >> >         adrp    x2, .LC0
>> > >> >         ldr     q0, [x2, #:lo12:.LC0]
>> > >> >         ins     v0.b[0], w0
>> > >> >         ins     v0.b[1], w1
>> > >> >         ins     v0.b[3], w1
>> > >> >         ins     v0.b[5], w1
>> > >> >         ins     v0.b[7], w1
>> > >> >         ins     v0.b[9], w1
>> > >> >         ins     v0.b[11], w1
>> > >> >         ins     v0.b[13], w1
>> > >> >         ins     v0.b[15], w1
>> > >> >         ret
>> > >> >
>> > >> > code-gen patch:
>> > >> > f_s16:
>> > >> >         adrp    x2, .LC0
>> > >> >         dup     v1.16b, w1
>> > >> >         ldr     q0, [x2, #:lo12:.LC0]
>> > >> >         ins     v0.b[0], w0
>> > >> >         zip1    v0.16b, v0.16b, v1.16b
>> > >> >         ret
>> > >>
>> > >> Nice.
>> > >>
>> > >> > There are a couple of issues I have come across:
>> > >> > (1) Choosing element to pad vector.
>> > >> > For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y }
>> > >> > with mode V8HI.
>> > >> > We split it into { x, 0, 1, 2 } and { y, y, y, y}
>> > >> > However since the mode is V8HI, we would need to pad the above split vectors
>> > >> > with 4 more elements to match up to vector length.
>> > >> > For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y}
>> > >> > using 'y' is the obvious choice thus making them:
>> > >> > {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y}
>> > >> > These would be then merged using zip1 which would discard the lower half
>> > >> > of both vectors.
>> > >> > Currently I encoded the above two heuristics in
>> > >> > aarch64_expand_vector_init_get_padded_elem:
>> > >> > (a) If split portion contains a constant, use the constant to pad the vector.
>> > >> > (b) If split portion only contains variables, then use the most
>> > >> > frequently repeating variable
>> > >> > to pad the vector.
>> > >> > I suppose tho this could be improved ?
>> > >>
>> > >> I think we should just build two 64-bit vectors (V4HIs) and use a subreg
>> > >> to fill the upper elements with undefined values.
>> > >>
>> > >> I suppose in principle we would have the same problem when splitting
>> > >> a 64-bit vector into 2 32-bit vectors, but it's probably better to punt
>> > >> on that for now.  Eventually it would be worth adding full support for
>> > >> 32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions)
>> > >> but it's quite a big task.  The 128-bit to 64-bit split is the one that
>> > >> matters most.
>> > >>
>> > >> > (2) Setting cost for zip1:
>> > >> > Currently it returns 4 as cost for following zip1 insn:
>> > >> > (set (reg:V8HI 102)
>> > >> >     (unspec:V8HI [
>> > >> >             (reg:V8HI 103)
>> > >> >             (reg:V8HI 108)
>> > >> >         ] UNSPEC_ZIP1))
>> > >> > I am not sure if that's correct, or if not, what cost to use in this case
>> > >> > for zip1 ?
>> > >>
>> > >> TBH 4 seems a bit optimistic.  It's COSTS_N_INSNS (1), whereas the
>> > >> generic advsimd_vec_cost::permute_cost is 2 insns.  But the costs of
>> > >> inserts are probably underestimated to the same extent, so hopefully
>> > >> things work out.
>> > >>
>> > >> So it's probably best to accept the costs as they're currently given.
>> > >> Changing them would need extensive testing.
>> > >>
>> > >> However, one of the advantages of the split is that it allows the
>> > >> subvectors to be built in parallel.  When optimising for speed,
>> > >> it might make sense to take the maximum of the subsequence costs
>> > >> and add the cost of the zip to that.
>> > > Hi Richard,
>> > > Thanks for the suggestions.
>> > > In the attached patch, it recurses only if nelts == 16 to punt for 64
>> > > -> 32 bit split,
>> >
>> > It should be based on the size rather than the number of elements.
>> > The example we talked about above involved building V8HIs from two
>> > V4HIs, which is also valid.
>> Right, sorry got mixed up. The attached patch punts if vector_size == 64 by
>> resorting to fallback, which handles V8HI cases.
>> For eg:
>> int16x8_t f(int16_t x)
>> {
>>   return (int16x8_t) { x, 1, x, 2, x, 3, x, 4 };
>> }
>>
>> code-gen with patch:
>> f:
>>         dup     v0.4h, w0
>>         adrp    x0, .LC0
>>         ldr       d1, [x0, #:lo12:.LC0]
>>         zip1    v0.8h, v0.8h, v1.8h
>>         ret
>>
>> Just to clarify, we punt on 64 bit vector size, because there is no
>> 32-bit vector available,
>> to build 2 32-bit vectors for even and odd halves, and then "extend"
>> them with subreg ?

Right.  And if we want to fix that, I think the starting point would
be to add (general) 32-bit vector support first.

>> It also punts if n_elts < 8, because I am not sure
>> if it's profitable to do recursion+merging for 4 or lesser elements.
>> Does it look OK ?

Splitting { x, y, x, y } should at least be a size win over 4 individual
moves/inserts.  Possibly a speed win too if x and y are in general
registers.

So I think n_elts < 4 might be better.  If the costs get a case wrong,
we should fix the costs.

>> > > and uses std::max(even_init, odd_init) + insn_cost (zip1_insn) for
>> > > computing total cost of the sequence.
>> > >
>> > > So, for following case:
>> > > int8x16_t f_s8(int8_t x)
>> > > {
>> > >   return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
>> > >                                 x, 5, x, 6, x, 7, x, 8 };
>> > > }
>> > >
>> > > it now generates:
>> > > f_s16:
>> > >         dup     v0.8b, w0
>> > >         adrp    x0, .LC0
>> > >         ldr       d1, [x0, #:lo12:.LC0]
>> > >         zip1    v0.16b, v0.16b, v1.16b
>> > >         ret
>> > >
>> > > Which I assume is correct, since zip1 will merge the lower halves of
>> > > two vectors while leaving the upper halves undefined ?
>> >
>> > Yeah, it looks valid, but I would say that zip1 ignores the upper halves
>> > (rather than leaving them undefined).
>> Yes, sorry for mis-phrasing.
>>
>> For the following test:
>> int16x8_t f_s16 (int16_t x0, int16_t x1, int16_t x2, int16_t x3,
>>                           int16_t x4, int16_t x5, int16_t x6, int16_t x7)
>> {
>>   return (int16x8_t) { x0, x1, x2, x3, x4, x5, x6, x7 };
>> }
>>
>> it chose to go recursive+zip1 route since we take max (cost
>> (odd_init), cost (even_init)) and add
>> cost of zip1 insn which turns out to be lesser than cost of fallback:
>>
>> f_s16:
>>         sxth    w0, w0
>>         sxth    w1, w1
>>         fmov    d0, x0
>>         fmov    d1, x1
>>         ins     v0.h[1], w2
>>         ins     v1.h[1], w3
>>         ins     v0.h[2], w4
>>         ins     v1.h[2], w5
>>         ins     v0.h[3], w6
>>         ins     v1.h[3], w7
>>         zip1    v0.8h, v0.8h, v1.8h
>>         ret
>>
>> I assume that's OK since it has fewer dependencies compared to
>> fallback code-gen even if it's longer ?
>> With -Os the cost for sequence is taken as cost(odd_init) +
>> cost(even_init) + cost(zip1_insn)
>> which turns out to be same as cost for fallback sequence and it
>> generates the fallback code-sequence:
>>
>> f_s16:
>>         sxth    w0, w0
>>         fmov    s0, w0
>>         ins     v0.h[1], w1
>>         ins     v0.h[2], w2
>>         ins     v0.h[3], w3
>>         ins     v0.h[4], w4
>>         ins     v0.h[5], w5
>>         ins     v0.h[6], w6
>>         ins     v0.h[7], w7
>>         ret
>>
> Forgot to remove the hunk handling interleaving case, done in the
> attached patch.
>
> Thanks,
> Prathamesh
>> Thanks,
>> Prathamesh
>> >
>> > Thanks,
>> > Richard
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index acc0cfe5f94..dd2a64d2e4e 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -21976,7 +21976,7 @@ aarch64_simd_make_constant (rtx vals)
>     initialised to contain VALS.  */
>  
>  void
> -aarch64_expand_vector_init (rtx target, rtx vals)
> +aarch64_expand_vector_init_fallback (rtx target, rtx vals)

The comment needs to be updated.  Maybe:

/* A subroutine of aarch64_expand_vector_init, with the same interface.
   The caller has already tried a divide-and-conquer approach, so do
   not consider that case here.  */

>  {
>    machine_mode mode = GET_MODE (target);
>    scalar_mode inner_mode = GET_MODE_INNER (mode);
> @@ -22036,38 +22036,6 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>        return;
>      }
>  
> -  /* Check for interleaving case.
> -     For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}.
> -     Generate following code:
> -     dup v0.h, x
> -     dup v1.h, y
> -     zip1 v0.h, v0.h, v1.h
> -     for "large enough" initializer.  */
> -
> -  if (n_elts >= 8)
> -    {
> -      int i;
> -      for (i = 2; i < n_elts; i++)
> -	if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2)))
> -	  break;
> -
> -      if (i == n_elts)
> -	{
> -	  machine_mode mode = GET_MODE (target);
> -	  rtx dest[2];
> -
> -	  for (int i = 0; i < 2; i++)
> -	    {
> -	      rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i));
> -	      dest[i] = force_reg (mode, x);
> -	    }
> -
> -	  rtvec v = gen_rtvec (2, dest[0], dest[1]);
> -	  emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1));
> -	  return;
> -	}
> -    }
> -
>    enum insn_code icode = optab_handler (vec_set_optab, mode);
>    gcc_assert (icode != CODE_FOR_nothing);
>  
> @@ -22189,7 +22157,7 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	    }
>  	  XVECEXP (copy, 0, i) = subst;
>  	}
> -      aarch64_expand_vector_init (target, copy);
> +      aarch64_expand_vector_init_fallback (target, copy);
>      }
>  
>    /* Insert the variable lanes directly.  */
> @@ -22203,6 +22171,91 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>      }
>  }
>  
> +DEBUG_FUNCTION
> +static void
> +aarch64_expand_vector_init_debug_seq (rtx_insn *seq, const char *s)
> +{
> +  fprintf (stderr, "%s: %u\n", s, seq_cost (seq, !optimize_size));
> +  for (rtx_insn *i = seq; i; i = NEXT_INSN (i))
> +    {
> +      debug_rtx (PATTERN (i));
> +      fprintf (stderr, "cost: %d\n", pattern_cost (PATTERN (i), !optimize_size));
> +    }
> +}

I'm not sure we should commit this to the tree.

> +
> +static rtx
> +aarch64_expand_vector_init_split_vals (machine_mode mode, rtx vals, bool even_p)

How about calling this aarch64_unzip_vector_init?  It needs a function
comment.

> +{
> +  int n = XVECLEN (vals, 0);
> +  machine_mode new_mode
> +    = aarch64_simd_container_mode (GET_MODE_INNER (mode), 64);

IMO it would be better to use "GET_MODE_BITSIZE (mode).to_constant () / 2"
or "GET_MODE_UNIT_BITSIZE (mode) * n / 2" for the second argument.

> +  rtvec vec = rtvec_alloc (n / 2);
> +  for (int i = 0; i < n; i++)
> +    RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i)
> +				  : XVECEXP (vals, 0, 2 * i + 1);
> +  return gen_rtx_PARALLEL (new_mode, vec);
> +}
> +
> +/*
> +The function does the following:
> +(a) Generates code sequence by splitting VALS into even and odd halves,
> +    and recursively calling itself to initialize them and then merge using
> +    zip1.
> +(b) Generate code sequence directly using aarch64_expand_vector_init_fallback.
> +(c) Compare the cost of code sequences generated by (a) and (b), and choose
> +    the more efficient one.
> +*/

I think we should keep the current description of the interface,
before the describing the implementation:

/* Expand a vector initialization sequence, such that TARGET is
   initialized to contain VALS.  */

(includes an s/s/z/).

And it's probably better to describe the implementation inside
the function.

Most comments are written in imperative style, so how about:

  /* Try decomposing the initializer into even and odd halves and
     then ZIP them together.  Use the resulting sequence if it is
     strictly cheaper than loading VALS directly.

     Prefer the fallback sequence in the event of a tie, since it
     will tend to use fewer registers.  */

> +
> +void
> +aarch64_expand_vector_init (rtx target, rtx vals)
> +{
> +  machine_mode mode = GET_MODE (target);
> +  int n_elts = XVECLEN (vals, 0);
> +
> +  if (n_elts < 8
> +      || known_eq (GET_MODE_BITSIZE (mode), 64))

Might be more robust to test maybe_ne (GET_MODE_BITSIZE (mode), 128)

> +    {
> +      aarch64_expand_vector_init_fallback (target, vals);
> +      return;
> +    }
> +
> +  start_sequence ();
> +  rtx dest[2];
> +  unsigned costs[2];
> +  for (int i = 0; i < 2; i++)
> +    {
> +      start_sequence ();
> +      dest[i] = gen_reg_rtx (mode);
> +      rtx new_vals
> +	= aarch64_expand_vector_init_split_vals (mode, vals, (i % 2) == 0);
> +      rtx tmp_reg = gen_reg_rtx (GET_MODE (new_vals));
> +      aarch64_expand_vector_init (tmp_reg, new_vals);
> +      dest[i] = gen_rtx_SUBREG (mode, tmp_reg, 0);

Maybe "src" or "halves" would be a better name than "dest", given that
the rtx isn't actually the destination of the subsequence.

> +      rtx_insn *rec_seq = get_insns ();
> +      end_sequence ();
> +      costs[i] = seq_cost (rec_seq, !optimize_size);
> +      emit_insn (rec_seq);
> +    }
> +
> +  rtvec v = gen_rtvec (2, dest[0], dest[1]);
> +  rtx_insn *zip1_insn
> +    = emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1));
> +  unsigned seq_total_cost
> +    = (!optimize_size) ? std::max (costs[0], costs[1]) : costs[0] + costs[1];

This is the wrong way round: max should be for speed and addition
for size.

Thanks,
Richard

> +  seq_total_cost += insn_cost (zip1_insn, !optimize_size);
> +
> +  rtx_insn *seq = get_insns ();
> +  end_sequence ();
> +
> +  start_sequence ();
> +  aarch64_expand_vector_init_fallback (target, vals);
> +  rtx_insn *fallback_seq = get_insns ();
> +  unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size);
> +  end_sequence ();
> +
> +  emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq);
> +}
> +
>  /* Emit RTL corresponding to:
>     insr TARGET, ELEM.  */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> similarity index 82%
> rename from gcc/testsuite/gcc.target/aarch64/interleave-init-1.c
> rename to gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> index ee775048589..e812d3946de 100644
> --- a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> @@ -7,8 +7,8 @@
>  /*
>  ** foo:
>  **	...
> -**	dup	v[0-9]+\.8h, w[0-9]+
> -**	dup	v[0-9]+\.8h, w[0-9]+
> +**	dup	v[0-9]+\.4h, w[0-9]+
> +**	dup	v[0-9]+\.4h, w[0-9]+
>  **	zip1	v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h
>  **	...
>  **	ret
> @@ -23,8 +23,8 @@ int16x8_t foo(int16_t x, int y)
>  /*
>  ** foo2:
>  **	...
> -**	dup	v[0-9]+\.8h, w[0-9]+
> -**	movi	v[0-9]+\.8h, 0x1
> +**	dup	v[0-9]+\.4h, w[0-9]+
> +**	movi	v[0-9]+\.4h, 0x1
>  **	zip1	v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h
>  **	...
>  **	ret
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-19.c b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c
> new file mode 100644
> index 00000000000..e28fdcda29d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	...
> +**	dup	v[0-9]+\.8b, w[0-9]+
> +**	adrp	x[0-9]+, \.LC[0-9]+
> +**	ldr	d[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\]
> +**	zip1	v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x)
> +{
> +  return (int8x16_t) { x, 1, x, 2, x, 3, x, 4,
> +                       x, 5, x, 6, x, 7, x, 8 };
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-20.c b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c
> new file mode 100644
> index 00000000000..9366ca349b6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	...
> +**	adrp	x[0-9]+, \.LC[0-9]+
> +**	dup	v[0-9]+\.8b, w[0-9]+
> +**	ldr	d[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\]
> +**	ins	v0\.b\[0\], w0
> +**	zip1	v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x, int8_t y)
> +{
> +  return (int8x16_t) { x, y, 1, y, 2, y, 3, y,
> +                       4, y, 5, y, 6, y, 7, y };
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-21.c b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c
> new file mode 100644
> index 00000000000..e16459486d7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** f_s8:
> +**	...
> +**	adrp	x[0-9]+, \.LC[0-9]+
> +**	ldr	q[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\]
> +**	ins	v0\.b\[0\], w0
> +**	ins	v0\.b\[1\], w1
> +**	...
> +**	ret
> +*/
> +
> +int8x16_t f_s8(int8_t x, int8_t y)
> +{
> +  return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
> +                       7, 8, 9, 10, 11, 12, 13, 14 };
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c
> new file mode 100644
> index 00000000000..8f35854c008
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/* Verify that fallback code-sequence is chosen over
> +   recursively generated code-sequence merged with zip1.  */
> +
> +/*
> +** f_s16:
> +**	...
> +**	sxth	w0, w0
> +**	fmov	s0, w0
> +**	ins	v0\.h\[1\], w1
> +**	ins	v0\.h\[2\], w2
> +**	ins	v0\.h\[3\], w3
> +**	ins	v0\.h\[4\], w4
> +**	ins	v0\.h\[5\], w5
> +**	ins	v0\.h\[6\], w6
> +**	ins	v0\.h\[7\], w7
> +**	...
> +**	ret
> +*/
> +
> +#include "vec-init-22.h"
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c
> new file mode 100644
> index 00000000000..172d56ffdf1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/* Verify that we recursively generate code for even and odd halves
> +   instead of fallback code. This is so despite the longer code-gen
> +   because it has fewer dependencies and thus has lesser cost.  */
> +
> +/*
> +** f_s16:
> +**	...
> +**	sxth	w0, w0
> +**	sxth	w1, w1
> +**	fmov	d0, x0
> +**	fmov	d1, x1
> +**	ins	v[0-9]+\.h\[1\], w2
> +**	ins	v[0-9]+\.h\[1\], w3
> +**	ins	v[0-9]+\.h\[2\], w4
> +**	ins	v[0-9]+\.h\[2\], w5
> +**	ins	v[0-9]+\.h\[3\], w6
> +**	ins	v[0-9]+\.h\[3\], w7
> +**	zip1	v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h
> +**	...
> +**	ret
> +*/
> +
> +#include "vec-init-22.h"
> diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22.h b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h
> new file mode 100644
> index 00000000000..15b889d4097
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h
> @@ -0,0 +1,7 @@
> +#include <arm_neon.h>
> +
> +int16x8_t f_s16 (int16_t x0, int16_t x1, int16_t x2, int16_t x3,
> +                 int16_t x4, int16_t x5, int16_t x6, int16_t x7)
> +{
> +  return (int16x8_t) { x0, x1, x2, x3, x4, x5, x6, x7 };
> +}