From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 048013858D20 for ; Fri, 3 Feb 2023 15:17:47 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 048013858D20 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B9EC24B3; Fri, 3 Feb 2023 07:18:28 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ED7603F71E; Fri, 3 Feb 2023 07:17:45 -0800 (PST) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , richard.sandiford@arm.com Cc: gcc Patches Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector References: Date: Fri, 03 Feb 2023 15:17:44 +0000 In-Reply-To: (Prathamesh Kulkarni's message of "Fri, 3 Feb 2023 08:32:46 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-35.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > On Fri, 3 Feb 2023 at 07:10, Prathamesh Kulkarni > wrote: >> >> On Thu, 2 Feb 2023 at 20:50, Richard Sandiford >> wrote: >> > >> > Prathamesh Kulkarni writes: >> > >> >> > I have attached a patch that extends the transform if one half is dup >> > >> >> > and other is set of constants. >> > >> >> > For eg: >> > >> >> > int8x16_t f(int8_t x) >> > >> >> > { >> > >> >> > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 }; >> > >> >> > } >> > >> >> > >> > >> >> > code-gen trunk: >> > >> >> > f: >> > >> >> > adrp x1, .LC0 >> > >> >> > ldr q0, [x1, #:lo12:.LC0] >> > >> >> > ins v0.b[0], w0 >> > >> >> > ins v0.b[2], w0 >> > >> >> > ins v0.b[4], w0 >> > >> >> > ins v0.b[6], w0 >> > >> >> > ins v0.b[8], w0 >> > >> >> > ins v0.b[10], w0 >> > >> >> > ins v0.b[12], w0 >> > >> >> > ins v0.b[14], w0 >> > >> >> > ret >> > >> >> > >> > >> >> > code-gen with patch: >> > >> >> > f: >> > >> >> > dup v0.16b, w0 >> > >> >> > adrp x0, .LC0 >> > >> >> > ldr q1, [x0, #:lo12:.LC0] >> > >> >> > zip1 v0.16b, v0.16b, v1.16b >> > >> >> > ret >> > >> >> > >> > >> >> > Bootstrapped+tested on aarch64-linux-gnu. >> > >> >> > Does it look OK ? >> > >> >> >> > >> >> Looks like a nice improvement. It'll need to wait for GCC 14 now though. >> > >> >> >> > >> >> However, rather than handle this case specially, I think we should instead >> > >> >> take a divide-and-conquer approach: split the initialiser into even and >> > >> >> odd elements, find the best way of loading each part, then compare the >> > >> >> cost of these sequences + ZIP with the cost of the fallback code (the code >> > >> >> later in aarch64_expand_vector_init). >> > >> >> >> > >> >> For example, doing that would allow: >> > >> >> >> > >> >> { x, y, 0, y, 0, y, 0, y, 0, y } >> > >> >> >> > >> >> to be loaded more easily, even though the even elements aren't wholly >> > >> >> constant. >> > >> > Hi Richard, >> > >> > I have attached a prototype patch based on the above approach. >> > >> > It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating >> > >> > same sequence, thus I removed that hunk, and improves the following cases: >> > >> > >> > >> > (a) >> > >> > int8x16_t f_s16(int8_t x) >> > >> > { >> > >> > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, >> > >> > x, 5, x, 6, x, 7, x, 8 }; >> > >> > } >> > >> > >> > >> > code-gen trunk: >> > >> > f_s16: >> > >> > adrp x1, .LC0 >> > >> > ldr q0, [x1, #:lo12:.LC0] >> > >> > ins v0.b[0], w0 >> > >> > ins v0.b[2], w0 >> > >> > ins v0.b[4], w0 >> > >> > ins v0.b[6], w0 >> > >> > ins v0.b[8], w0 >> > >> > ins v0.b[10], w0 >> > >> > ins v0.b[12], w0 >> > >> > ins v0.b[14], w0 >> > >> > ret >> > >> > >> > >> > code-gen with patch: >> > >> > f_s16: >> > >> > dup v0.16b, w0 >> > >> > adrp x0, .LC0 >> > >> > ldr q1, [x0, #:lo12:.LC0] >> > >> > zip1 v0.16b, v0.16b, v1.16b >> > >> > ret >> > >> > >> > >> > (b) >> > >> > int8x16_t f_s16(int8_t x, int8_t y) >> > >> > { >> > >> > return (int8x16_t) { x, y, 1, y, 2, y, 3, y, >> > >> > 4, y, 5, y, 6, y, 7, y }; >> > >> > } >> > >> > >> > >> > code-gen trunk: >> > >> > f_s16: >> > >> > adrp x2, .LC0 >> > >> > ldr q0, [x2, #:lo12:.LC0] >> > >> > ins v0.b[0], w0 >> > >> > ins v0.b[1], w1 >> > >> > ins v0.b[3], w1 >> > >> > ins v0.b[5], w1 >> > >> > ins v0.b[7], w1 >> > >> > ins v0.b[9], w1 >> > >> > ins v0.b[11], w1 >> > >> > ins v0.b[13], w1 >> > >> > ins v0.b[15], w1 >> > >> > ret >> > >> > >> > >> > code-gen patch: >> > >> > f_s16: >> > >> > adrp x2, .LC0 >> > >> > dup v1.16b, w1 >> > >> > ldr q0, [x2, #:lo12:.LC0] >> > >> > ins v0.b[0], w0 >> > >> > zip1 v0.16b, v0.16b, v1.16b >> > >> > ret >> > >> >> > >> Nice. >> > >> >> > >> > There are a couple of issues I have come across: >> > >> > (1) Choosing element to pad vector. >> > >> > For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y } >> > >> > with mode V8HI. >> > >> > We split it into { x, 0, 1, 2 } and { y, y, y, y} >> > >> > However since the mode is V8HI, we would need to pad the above split vectors >> > >> > with 4 more elements to match up to vector length. >> > >> > For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y} >> > >> > using 'y' is the obvious choice thus making them: >> > >> > {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y} >> > >> > These would be then merged using zip1 which would discard the lower half >> > >> > of both vectors. >> > >> > Currently I encoded the above two heuristics in >> > >> > aarch64_expand_vector_init_get_padded_elem: >> > >> > (a) If split portion contains a constant, use the constant to pad the vector. >> > >> > (b) If split portion only contains variables, then use the most >> > >> > frequently repeating variable >> > >> > to pad the vector. >> > >> > I suppose tho this could be improved ? >> > >> >> > >> I think we should just build two 64-bit vectors (V4HIs) and use a subreg >> > >> to fill the upper elements with undefined values. >> > >> >> > >> I suppose in principle we would have the same problem when splitting >> > >> a 64-bit vector into 2 32-bit vectors, but it's probably better to punt >> > >> on that for now. Eventually it would be worth adding full support for >> > >> 32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions) >> > >> but it's quite a big task. The 128-bit to 64-bit split is the one that >> > >> matters most. >> > >> >> > >> > (2) Setting cost for zip1: >> > >> > Currently it returns 4 as cost for following zip1 insn: >> > >> > (set (reg:V8HI 102) >> > >> > (unspec:V8HI [ >> > >> > (reg:V8HI 103) >> > >> > (reg:V8HI 108) >> > >> > ] UNSPEC_ZIP1)) >> > >> > I am not sure if that's correct, or if not, what cost to use in this case >> > >> > for zip1 ? >> > >> >> > >> TBH 4 seems a bit optimistic. It's COSTS_N_INSNS (1), whereas the >> > >> generic advsimd_vec_cost::permute_cost is 2 insns. But the costs of >> > >> inserts are probably underestimated to the same extent, so hopefully >> > >> things work out. >> > >> >> > >> So it's probably best to accept the costs as they're currently given. >> > >> Changing them would need extensive testing. >> > >> >> > >> However, one of the advantages of the split is that it allows the >> > >> subvectors to be built in parallel. When optimising for speed, >> > >> it might make sense to take the maximum of the subsequence costs >> > >> and add the cost of the zip to that. >> > > Hi Richard, >> > > Thanks for the suggestions. >> > > In the attached patch, it recurses only if nelts == 16 to punt for 64 >> > > -> 32 bit split, >> > >> > It should be based on the size rather than the number of elements. >> > The example we talked about above involved building V8HIs from two >> > V4HIs, which is also valid. >> Right, sorry got mixed up. The attached patch punts if vector_size == 64 by >> resorting to fallback, which handles V8HI cases. >> For eg: >> int16x8_t f(int16_t x) >> { >> return (int16x8_t) { x, 1, x, 2, x, 3, x, 4 }; >> } >> >> code-gen with patch: >> f: >> dup v0.4h, w0 >> adrp x0, .LC0 >> ldr d1, [x0, #:lo12:.LC0] >> zip1 v0.8h, v0.8h, v1.8h >> ret >> >> Just to clarify, we punt on 64 bit vector size, because there is no >> 32-bit vector available, >> to build 2 32-bit vectors for even and odd halves, and then "extend" >> them with subreg ? Right. And if we want to fix that, I think the starting point would be to add (general) 32-bit vector support first. >> It also punts if n_elts < 8, because I am not sure >> if it's profitable to do recursion+merging for 4 or lesser elements. >> Does it look OK ? Splitting { x, y, x, y } should at least be a size win over 4 individual moves/inserts. Possibly a speed win too if x and y are in general registers. So I think n_elts < 4 might be better. If the costs get a case wrong, we should fix the costs. >> > > and uses std::max(even_init, odd_init) + insn_cost (zip1_insn) for >> > > computing total cost of the sequence. >> > > >> > > So, for following case: >> > > int8x16_t f_s8(int8_t x) >> > > { >> > > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, >> > > x, 5, x, 6, x, 7, x, 8 }; >> > > } >> > > >> > > it now generates: >> > > f_s16: >> > > dup v0.8b, w0 >> > > adrp x0, .LC0 >> > > ldr d1, [x0, #:lo12:.LC0] >> > > zip1 v0.16b, v0.16b, v1.16b >> > > ret >> > > >> > > Which I assume is correct, since zip1 will merge the lower halves of >> > > two vectors while leaving the upper halves undefined ? >> > >> > Yeah, it looks valid, but I would say that zip1 ignores the upper halves >> > (rather than leaving them undefined). >> Yes, sorry for mis-phrasing. >> >> For the following test: >> int16x8_t f_s16 (int16_t x0, int16_t x1, int16_t x2, int16_t x3, >> int16_t x4, int16_t x5, int16_t x6, int16_t x7) >> { >> return (int16x8_t) { x0, x1, x2, x3, x4, x5, x6, x7 }; >> } >> >> it chose to go recursive+zip1 route since we take max (cost >> (odd_init), cost (even_init)) and add >> cost of zip1 insn which turns out to be lesser than cost of fallback: >> >> f_s16: >> sxth w0, w0 >> sxth w1, w1 >> fmov d0, x0 >> fmov d1, x1 >> ins v0.h[1], w2 >> ins v1.h[1], w3 >> ins v0.h[2], w4 >> ins v1.h[2], w5 >> ins v0.h[3], w6 >> ins v1.h[3], w7 >> zip1 v0.8h, v0.8h, v1.8h >> ret >> >> I assume that's OK since it has fewer dependencies compared to >> fallback code-gen even if it's longer ? >> With -Os the cost for sequence is taken as cost(odd_init) + >> cost(even_init) + cost(zip1_insn) >> which turns out to be same as cost for fallback sequence and it >> generates the fallback code-sequence: >> >> f_s16: >> sxth w0, w0 >> fmov s0, w0 >> ins v0.h[1], w1 >> ins v0.h[2], w2 >> ins v0.h[3], w3 >> ins v0.h[4], w4 >> ins v0.h[5], w5 >> ins v0.h[6], w6 >> ins v0.h[7], w7 >> ret >> > Forgot to remove the hunk handling interleaving case, done in the > attached patch. > > Thanks, > Prathamesh >> Thanks, >> Prathamesh >> > >> > Thanks, >> > Richard > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index acc0cfe5f94..dd2a64d2e4e 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -21976,7 +21976,7 @@ aarch64_simd_make_constant (rtx vals) > initialised to contain VALS. */ > > void > -aarch64_expand_vector_init (rtx target, rtx vals) > +aarch64_expand_vector_init_fallback (rtx target, rtx vals) The comment needs to be updated. Maybe: /* A subroutine of aarch64_expand_vector_init, with the same interface. The caller has already tried a divide-and-conquer approach, so do not consider that case here. */ > { > machine_mode mode = GET_MODE (target); > scalar_mode inner_mode = GET_MODE_INNER (mode); > @@ -22036,38 +22036,6 @@ aarch64_expand_vector_init (rtx target, rtx vals) > return; > } > > - /* Check for interleaving case. > - For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}. > - Generate following code: > - dup v0.h, x > - dup v1.h, y > - zip1 v0.h, v0.h, v1.h > - for "large enough" initializer. */ > - > - if (n_elts >= 8) > - { > - int i; > - for (i = 2; i < n_elts; i++) > - if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2))) > - break; > - > - if (i == n_elts) > - { > - machine_mode mode = GET_MODE (target); > - rtx dest[2]; > - > - for (int i = 0; i < 2; i++) > - { > - rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i)); > - dest[i] = force_reg (mode, x); > - } > - > - rtvec v = gen_rtvec (2, dest[0], dest[1]); > - emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1)); > - return; > - } > - } > - > enum insn_code icode = optab_handler (vec_set_optab, mode); > gcc_assert (icode != CODE_FOR_nothing); > > @@ -22189,7 +22157,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) > } > XVECEXP (copy, 0, i) = subst; > } > - aarch64_expand_vector_init (target, copy); > + aarch64_expand_vector_init_fallback (target, copy); > } > > /* Insert the variable lanes directly. */ > @@ -22203,6 +22171,91 @@ aarch64_expand_vector_init (rtx target, rtx vals) > } > } > > +DEBUG_FUNCTION > +static void > +aarch64_expand_vector_init_debug_seq (rtx_insn *seq, const char *s) > +{ > + fprintf (stderr, "%s: %u\n", s, seq_cost (seq, !optimize_size)); > + for (rtx_insn *i = seq; i; i = NEXT_INSN (i)) > + { > + debug_rtx (PATTERN (i)); > + fprintf (stderr, "cost: %d\n", pattern_cost (PATTERN (i), !optimize_size)); > + } > +} I'm not sure we should commit this to the tree. > + > +static rtx > +aarch64_expand_vector_init_split_vals (machine_mode mode, rtx vals, bool even_p) How about calling this aarch64_unzip_vector_init? It needs a function comment. > +{ > + int n = XVECLEN (vals, 0); > + machine_mode new_mode > + = aarch64_simd_container_mode (GET_MODE_INNER (mode), 64); IMO it would be better to use "GET_MODE_BITSIZE (mode).to_constant () / 2" or "GET_MODE_UNIT_BITSIZE (mode) * n / 2" for the second argument. > + rtvec vec = rtvec_alloc (n / 2); > + for (int i = 0; i < n; i++) > + RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i) > + : XVECEXP (vals, 0, 2 * i + 1); > + return gen_rtx_PARALLEL (new_mode, vec); > +} > + > +/* > +The function does the following: > +(a) Generates code sequence by splitting VALS into even and odd halves, > + and recursively calling itself to initialize them and then merge using > + zip1. > +(b) Generate code sequence directly using aarch64_expand_vector_init_fallback. > +(c) Compare the cost of code sequences generated by (a) and (b), and choose > + the more efficient one. > +*/ I think we should keep the current description of the interface, before the describing the implementation: /* Expand a vector initialization sequence, such that TARGET is initialized to contain VALS. */ (includes an s/s/z/). And it's probably better to describe the implementation inside the function. Most comments are written in imperative style, so how about: /* Try decomposing the initializer into even and odd halves and then ZIP them together. Use the resulting sequence if it is strictly cheaper than loading VALS directly. Prefer the fallback sequence in the event of a tie, since it will tend to use fewer registers. */ > + > +void > +aarch64_expand_vector_init (rtx target, rtx vals) > +{ > + machine_mode mode = GET_MODE (target); > + int n_elts = XVECLEN (vals, 0); > + > + if (n_elts < 8 > + || known_eq (GET_MODE_BITSIZE (mode), 64)) Might be more robust to test maybe_ne (GET_MODE_BITSIZE (mode), 128) > + { > + aarch64_expand_vector_init_fallback (target, vals); > + return; > + } > + > + start_sequence (); > + rtx dest[2]; > + unsigned costs[2]; > + for (int i = 0; i < 2; i++) > + { > + start_sequence (); > + dest[i] = gen_reg_rtx (mode); > + rtx new_vals > + = aarch64_expand_vector_init_split_vals (mode, vals, (i % 2) == 0); > + rtx tmp_reg = gen_reg_rtx (GET_MODE (new_vals)); > + aarch64_expand_vector_init (tmp_reg, new_vals); > + dest[i] = gen_rtx_SUBREG (mode, tmp_reg, 0); Maybe "src" or "halves" would be a better name than "dest", given that the rtx isn't actually the destination of the subsequence. > + rtx_insn *rec_seq = get_insns (); > + end_sequence (); > + costs[i] = seq_cost (rec_seq, !optimize_size); > + emit_insn (rec_seq); > + } > + > + rtvec v = gen_rtvec (2, dest[0], dest[1]); > + rtx_insn *zip1_insn > + = emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1)); > + unsigned seq_total_cost > + = (!optimize_size) ? std::max (costs[0], costs[1]) : costs[0] + costs[1]; This is the wrong way round: max should be for speed and addition for size. Thanks, Richard > + seq_total_cost += insn_cost (zip1_insn, !optimize_size); > + > + rtx_insn *seq = get_insns (); > + end_sequence (); > + > + start_sequence (); > + aarch64_expand_vector_init_fallback (target, vals); > + rtx_insn *fallback_seq = get_insns (); > + unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size); > + end_sequence (); > + > + emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq); > +} > + > /* Emit RTL corresponding to: > insr TARGET, ELEM. */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c > similarity index 82% > rename from gcc/testsuite/gcc.target/aarch64/interleave-init-1.c > rename to gcc/testsuite/gcc.target/aarch64/vec-init-18.c > index ee775048589..e812d3946de 100644 > --- a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c > @@ -7,8 +7,8 @@ > /* > ** foo: > ** ... > -** dup v[0-9]+\.8h, w[0-9]+ > -** dup v[0-9]+\.8h, w[0-9]+ > +** dup v[0-9]+\.4h, w[0-9]+ > +** dup v[0-9]+\.4h, w[0-9]+ > ** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h > ** ... > ** ret > @@ -23,8 +23,8 @@ int16x8_t foo(int16_t x, int y) > /* > ** foo2: > ** ... > -** dup v[0-9]+\.8h, w[0-9]+ > -** movi v[0-9]+\.8h, 0x1 > +** dup v[0-9]+\.4h, w[0-9]+ > +** movi v[0-9]+\.4h, 0x1 > ** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h > ** ... > ** ret > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-19.c b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c > new file mode 100644 > index 00000000000..e28fdcda29d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c > @@ -0,0 +1,21 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include > + > +/* > +** f_s8: > +** ... > +** dup v[0-9]+\.8b, w[0-9]+ > +** adrp x[0-9]+, \.LC[0-9]+ > +** ldr d[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\] > +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b > +** ret > +*/ > + > +int8x16_t f_s8(int8_t x) > +{ > + return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, > + x, 5, x, 6, x, 7, x, 8 }; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-20.c b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c > new file mode 100644 > index 00000000000..9366ca349b6 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c > @@ -0,0 +1,22 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include > + > +/* > +** f_s8: > +** ... > +** adrp x[0-9]+, \.LC[0-9]+ > +** dup v[0-9]+\.8b, w[0-9]+ > +** ldr d[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\] > +** ins v0\.b\[0\], w0 > +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b > +** ret > +*/ > + > +int8x16_t f_s8(int8_t x, int8_t y) > +{ > + return (int8x16_t) { x, y, 1, y, 2, y, 3, y, > + 4, y, 5, y, 6, y, 7, y }; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-21.c b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c > new file mode 100644 > index 00000000000..e16459486d7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c > @@ -0,0 +1,22 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include > + > +/* > +** f_s8: > +** ... > +** adrp x[0-9]+, \.LC[0-9]+ > +** ldr q[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\] > +** ins v0\.b\[0\], w0 > +** ins v0\.b\[1\], w1 > +** ... > +** ret > +*/ > + > +int8x16_t f_s8(int8_t x, int8_t y) > +{ > + return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6, > + 7, 8, 9, 10, 11, 12, 13, 14 }; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c > new file mode 100644 > index 00000000000..8f35854c008 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c > @@ -0,0 +1,24 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Os" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +/* Verify that fallback code-sequence is chosen over > + recursively generated code-sequence merged with zip1. */ > + > +/* > +** f_s16: > +** ... > +** sxth w0, w0 > +** fmov s0, w0 > +** ins v0\.h\[1\], w1 > +** ins v0\.h\[2\], w2 > +** ins v0\.h\[3\], w3 > +** ins v0\.h\[4\], w4 > +** ins v0\.h\[5\], w5 > +** ins v0\.h\[6\], w6 > +** ins v0\.h\[7\], w7 > +** ... > +** ret > +*/ > + > +#include "vec-init-22.h" > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c > new file mode 100644 > index 00000000000..172d56ffdf1 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +/* Verify that we recursively generate code for even and odd halves > + instead of fallback code. This is so despite the longer code-gen > + because it has fewer dependencies and thus has lesser cost. */ > + > +/* > +** f_s16: > +** ... > +** sxth w0, w0 > +** sxth w1, w1 > +** fmov d0, x0 > +** fmov d1, x1 > +** ins v[0-9]+\.h\[1\], w2 > +** ins v[0-9]+\.h\[1\], w3 > +** ins v[0-9]+\.h\[2\], w4 > +** ins v[0-9]+\.h\[2\], w5 > +** ins v[0-9]+\.h\[3\], w6 > +** ins v[0-9]+\.h\[3\], w7 > +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h > +** ... > +** ret > +*/ > + > +#include "vec-init-22.h" > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22.h b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h > new file mode 100644 > index 00000000000..15b889d4097 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h > @@ -0,0 +1,7 @@ > +#include > + > +int16x8_t f_s16 (int16_t x0, int16_t x1, int16_t x2, int16_t x3, > + int16_t x4, int16_t x5, int16_t x6, int16_t x7) > +{ > + return (int16x8_t) { x0, x1, x2, x3, x4, x5, x6, x7 }; > +}