From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 007C23858D1E for ; Mon, 6 Feb 2023 12:13:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 007C23858D1E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E75BFFEC; Mon, 6 Feb 2023 04:14:38 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E2F983F8C6; Mon, 6 Feb 2023 04:13:55 -0800 (PST) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , richard.sandiford@arm.com Cc: gcc Patches Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector References: Date: Mon, 06 Feb 2023 12:13:54 +0000 In-Reply-To: (Prathamesh Kulkarni's message of "Sat, 4 Feb 2023 12:19:23 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-35.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > On Fri, 3 Feb 2023 at 20:47, Richard Sandiford > wrote: >> >> Prathamesh Kulkarni writes: >> > On Fri, 3 Feb 2023 at 07:10, Prathamesh Kulkarni >> > wrote: >> >> >> >> On Thu, 2 Feb 2023 at 20:50, Richard Sandiford >> >> wrote: >> >> > >> >> > Prathamesh Kulkarni writes: >> >> > >> >> > I have attached a patch that extends the transform if one half is dup >> >> > >> >> > and other is set of constants. >> >> > >> >> > For eg: >> >> > >> >> > int8x16_t f(int8_t x) >> >> > >> >> > { >> >> > >> >> > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 }; >> >> > >> >> > } >> >> > >> >> > >> >> > >> >> > code-gen trunk: >> >> > >> >> > f: >> >> > >> >> > adrp x1, .LC0 >> >> > >> >> > ldr q0, [x1, #:lo12:.LC0] >> >> > >> >> > ins v0.b[0], w0 >> >> > >> >> > ins v0.b[2], w0 >> >> > >> >> > ins v0.b[4], w0 >> >> > >> >> > ins v0.b[6], w0 >> >> > >> >> > ins v0.b[8], w0 >> >> > >> >> > ins v0.b[10], w0 >> >> > >> >> > ins v0.b[12], w0 >> >> > >> >> > ins v0.b[14], w0 >> >> > >> >> > ret >> >> > >> >> > >> >> > >> >> > code-gen with patch: >> >> > >> >> > f: >> >> > >> >> > dup v0.16b, w0 >> >> > >> >> > adrp x0, .LC0 >> >> > >> >> > ldr q1, [x0, #:lo12:.LC0] >> >> > >> >> > zip1 v0.16b, v0.16b, v1.16b >> >> > >> >> > ret >> >> > >> >> > >> >> > >> >> > Bootstrapped+tested on aarch64-linux-gnu. >> >> > >> >> > Does it look OK ? >> >> > >> >> >> >> > >> >> Looks like a nice improvement. It'll need to wait for GCC 14 now though. >> >> > >> >> >> >> > >> >> However, rather than handle this case specially, I think we should instead >> >> > >> >> take a divide-and-conquer approach: split the initialiser into even and >> >> > >> >> odd elements, find the best way of loading each part, then compare the >> >> > >> >> cost of these sequences + ZIP with the cost of the fallback code (the code >> >> > >> >> later in aarch64_expand_vector_init). >> >> > >> >> >> >> > >> >> For example, doing that would allow: >> >> > >> >> >> >> > >> >> { x, y, 0, y, 0, y, 0, y, 0, y } >> >> > >> >> >> >> > >> >> to be loaded more easily, even though the even elements aren't wholly >> >> > >> >> constant. >> >> > >> > Hi Richard, >> >> > >> > I have attached a prototype patch based on the above approach. >> >> > >> > It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating >> >> > >> > same sequence, thus I removed that hunk, and improves the following cases: >> >> > >> > >> >> > >> > (a) >> >> > >> > int8x16_t f_s16(int8_t x) >> >> > >> > { >> >> > >> > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, >> >> > >> > x, 5, x, 6, x, 7, x, 8 }; >> >> > >> > } >> >> > >> > >> >> > >> > code-gen trunk: >> >> > >> > f_s16: >> >> > >> > adrp x1, .LC0 >> >> > >> > ldr q0, [x1, #:lo12:.LC0] >> >> > >> > ins v0.b[0], w0 >> >> > >> > ins v0.b[2], w0 >> >> > >> > ins v0.b[4], w0 >> >> > >> > ins v0.b[6], w0 >> >> > >> > ins v0.b[8], w0 >> >> > >> > ins v0.b[10], w0 >> >> > >> > ins v0.b[12], w0 >> >> > >> > ins v0.b[14], w0 >> >> > >> > ret >> >> > >> > >> >> > >> > code-gen with patch: >> >> > >> > f_s16: >> >> > >> > dup v0.16b, w0 >> >> > >> > adrp x0, .LC0 >> >> > >> > ldr q1, [x0, #:lo12:.LC0] >> >> > >> > zip1 v0.16b, v0.16b, v1.16b >> >> > >> > ret >> >> > >> > >> >> > >> > (b) >> >> > >> > int8x16_t f_s16(int8_t x, int8_t y) >> >> > >> > { >> >> > >> > return (int8x16_t) { x, y, 1, y, 2, y, 3, y, >> >> > >> > 4, y, 5, y, 6, y, 7, y }; >> >> > >> > } >> >> > >> > >> >> > >> > code-gen trunk: >> >> > >> > f_s16: >> >> > >> > adrp x2, .LC0 >> >> > >> > ldr q0, [x2, #:lo12:.LC0] >> >> > >> > ins v0.b[0], w0 >> >> > >> > ins v0.b[1], w1 >> >> > >> > ins v0.b[3], w1 >> >> > >> > ins v0.b[5], w1 >> >> > >> > ins v0.b[7], w1 >> >> > >> > ins v0.b[9], w1 >> >> > >> > ins v0.b[11], w1 >> >> > >> > ins v0.b[13], w1 >> >> > >> > ins v0.b[15], w1 >> >> > >> > ret >> >> > >> > >> >> > >> > code-gen patch: >> >> > >> > f_s16: >> >> > >> > adrp x2, .LC0 >> >> > >> > dup v1.16b, w1 >> >> > >> > ldr q0, [x2, #:lo12:.LC0] >> >> > >> > ins v0.b[0], w0 >> >> > >> > zip1 v0.16b, v0.16b, v1.16b >> >> > >> > ret >> >> > >> >> >> > >> Nice. >> >> > >> >> >> > >> > There are a couple of issues I have come across: >> >> > >> > (1) Choosing element to pad vector. >> >> > >> > For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y } >> >> > >> > with mode V8HI. >> >> > >> > We split it into { x, 0, 1, 2 } and { y, y, y, y} >> >> > >> > However since the mode is V8HI, we would need to pad the above split vectors >> >> > >> > with 4 more elements to match up to vector length. >> >> > >> > For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y} >> >> > >> > using 'y' is the obvious choice thus making them: >> >> > >> > {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y} >> >> > >> > These would be then merged using zip1 which would discard the lower half >> >> > >> > of both vectors. >> >> > >> > Currently I encoded the above two heuristics in >> >> > >> > aarch64_expand_vector_init_get_padded_elem: >> >> > >> > (a) If split portion contains a constant, use the constant to pad the vector. >> >> > >> > (b) If split portion only contains variables, then use the most >> >> > >> > frequently repeating variable >> >> > >> > to pad the vector. >> >> > >> > I suppose tho this could be improved ? >> >> > >> >> >> > >> I think we should just build two 64-bit vectors (V4HIs) and use a subreg >> >> > >> to fill the upper elements with undefined values. >> >> > >> >> >> > >> I suppose in principle we would have the same problem when splitting >> >> > >> a 64-bit vector into 2 32-bit vectors, but it's probably better to punt >> >> > >> on that for now. Eventually it would be worth adding full support for >> >> > >> 32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions) >> >> > >> but it's quite a big task. The 128-bit to 64-bit split is the one that >> >> > >> matters most. >> >> > >> >> >> > >> > (2) Setting cost for zip1: >> >> > >> > Currently it returns 4 as cost for following zip1 insn: >> >> > >> > (set (reg:V8HI 102) >> >> > >> > (unspec:V8HI [ >> >> > >> > (reg:V8HI 103) >> >> > >> > (reg:V8HI 108) >> >> > >> > ] UNSPEC_ZIP1)) >> >> > >> > I am not sure if that's correct, or if not, what cost to use in this case >> >> > >> > for zip1 ? >> >> > >> >> >> > >> TBH 4 seems a bit optimistic. It's COSTS_N_INSNS (1), whereas the >> >> > >> generic advsimd_vec_cost::permute_cost is 2 insns. But the costs of >> >> > >> inserts are probably underestimated to the same extent, so hopefully >> >> > >> things work out. >> >> > >> >> >> > >> So it's probably best to accept the costs as they're currently given. >> >> > >> Changing them would need extensive testing. >> >> > >> >> >> > >> However, one of the advantages of the split is that it allows the >> >> > >> subvectors to be built in parallel. When optimising for speed, >> >> > >> it might make sense to take the maximum of the subsequence costs >> >> > >> and add the cost of the zip to that. >> >> > > Hi Richard, >> >> > > Thanks for the suggestions. >> >> > > In the attached patch, it recurses only if nelts == 16 to punt for 64 >> >> > > -> 32 bit split, >> >> > >> >> > It should be based on the size rather than the number of elements. >> >> > The example we talked about above involved building V8HIs from two >> >> > V4HIs, which is also valid. >> >> Right, sorry got mixed up. The attached patch punts if vector_size == 64 by >> >> resorting to fallback, which handles V8HI cases. >> >> For eg: >> >> int16x8_t f(int16_t x) >> >> { >> >> return (int16x8_t) { x, 1, x, 2, x, 3, x, 4 }; >> >> } >> >> >> >> code-gen with patch: >> >> f: >> >> dup v0.4h, w0 >> >> adrp x0, .LC0 >> >> ldr d1, [x0, #:lo12:.LC0] >> >> zip1 v0.8h, v0.8h, v1.8h >> >> ret >> >> >> >> Just to clarify, we punt on 64 bit vector size, because there is no >> >> 32-bit vector available, >> >> to build 2 32-bit vectors for even and odd halves, and then "extend" >> >> them with subreg ? >> >> Right. And if we want to fix that, I think the starting point would >> be to add (general) 32-bit vector support first. >> >> >> It also punts if n_elts < 8, because I am not sure >> >> if it's profitable to do recursion+merging for 4 or lesser elements. >> >> Does it look OK ? >> >> Splitting { x, y, x, y } should at least be a size win over 4 individual >> moves/inserts. Possibly a speed win too if x and y are in general >> registers. >> >> So I think n_elts < 4 might be better. If the costs get a case wrong, >> we should fix the costs. >> >> >> > > and uses std::max(even_init, odd_init) + insn_cost (zip1_insn) for >> >> > > computing total cost of the sequence. >> >> > > >> >> > > So, for following case: >> >> > > int8x16_t f_s8(int8_t x) >> >> > > { >> >> > > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, >> >> > > x, 5, x, 6, x, 7, x, 8 }; >> >> > > } >> >> > > >> >> > > it now generates: >> >> > > f_s16: >> >> > > dup v0.8b, w0 >> >> > > adrp x0, .LC0 >> >> > > ldr d1, [x0, #:lo12:.LC0] >> >> > > zip1 v0.16b, v0.16b, v1.16b >> >> > > ret >> >> > > >> >> > > Which I assume is correct, since zip1 will merge the lower halves of >> >> > > two vectors while leaving the upper halves undefined ? >> >> > >> >> > Yeah, it looks valid, but I would say that zip1 ignores the upper halves >> >> > (rather than leaving them undefined). >> >> Yes, sorry for mis-phrasing. >> >> >> >> For the following test: >> >> int16x8_t f_s16 (int16_t x0, int16_t x1, int16_t x2, int16_t x3, >> >> int16_t x4, int16_t x5, int16_t x6, int16_t x7) >> >> { >> >> return (int16x8_t) { x0, x1, x2, x3, x4, x5, x6, x7 }; >> >> } >> >> >> >> it chose to go recursive+zip1 route since we take max (cost >> >> (odd_init), cost (even_init)) and add >> >> cost of zip1 insn which turns out to be lesser than cost of fallback: >> >> >> >> f_s16: >> >> sxth w0, w0 >> >> sxth w1, w1 >> >> fmov d0, x0 >> >> fmov d1, x1 >> >> ins v0.h[1], w2 >> >> ins v1.h[1], w3 >> >> ins v0.h[2], w4 >> >> ins v1.h[2], w5 >> >> ins v0.h[3], w6 >> >> ins v1.h[3], w7 >> >> zip1 v0.8h, v0.8h, v1.8h >> >> ret >> >> >> >> I assume that's OK since it has fewer dependencies compared to >> >> fallback code-gen even if it's longer ? >> >> With -Os the cost for sequence is taken as cost(odd_init) + >> >> cost(even_init) + cost(zip1_insn) >> >> which turns out to be same as cost for fallback sequence and it >> >> generates the fallback code-sequence: >> >> >> >> f_s16: >> >> sxth w0, w0 >> >> fmov s0, w0 >> >> ins v0.h[1], w1 >> >> ins v0.h[2], w2 >> >> ins v0.h[3], w3 >> >> ins v0.h[4], w4 >> >> ins v0.h[5], w5 >> >> ins v0.h[6], w6 >> >> ins v0.h[7], w7 >> >> ret >> >> >> > Forgot to remove the hunk handling interleaving case, done in the >> > attached patch. >> > >> > Thanks, >> > Prathamesh >> >> Thanks, >> >> Prathamesh >> >> > >> >> > Thanks, >> >> > Richard >> > >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> > index acc0cfe5f94..dd2a64d2e4e 100644 >> > --- a/gcc/config/aarch64/aarch64.cc >> > +++ b/gcc/config/aarch64/aarch64.cc >> > @@ -21976,7 +21976,7 @@ aarch64_simd_make_constant (rtx vals) >> > initialised to contain VALS. */ >> > >> > void >> > -aarch64_expand_vector_init (rtx target, rtx vals) >> > +aarch64_expand_vector_init_fallback (rtx target, rtx vals) >> >> The comment needs to be updated. Maybe: >> >> /* A subroutine of aarch64_expand_vector_init, with the same interface. >> The caller has already tried a divide-and-conquer approach, so do >> not consider that case here. */ >> >> > { >> > machine_mode mode = GET_MODE (target); >> > scalar_mode inner_mode = GET_MODE_INNER (mode); >> > @@ -22036,38 +22036,6 @@ aarch64_expand_vector_init (rtx target, rtx vals) >> > return; >> > } >> > >> > - /* Check for interleaving case. >> > - For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}. >> > - Generate following code: >> > - dup v0.h, x >> > - dup v1.h, y >> > - zip1 v0.h, v0.h, v1.h >> > - for "large enough" initializer. */ >> > - >> > - if (n_elts >= 8) >> > - { >> > - int i; >> > - for (i = 2; i < n_elts; i++) >> > - if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2))) >> > - break; >> > - >> > - if (i == n_elts) >> > - { >> > - machine_mode mode = GET_MODE (target); >> > - rtx dest[2]; >> > - >> > - for (int i = 0; i < 2; i++) >> > - { >> > - rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i)); >> > - dest[i] = force_reg (mode, x); >> > - } >> > - >> > - rtvec v = gen_rtvec (2, dest[0], dest[1]); >> > - emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1)); >> > - return; >> > - } >> > - } >> > - >> > enum insn_code icode = optab_handler (vec_set_optab, mode); >> > gcc_assert (icode != CODE_FOR_nothing); >> > >> > @@ -22189,7 +22157,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) >> > } >> > XVECEXP (copy, 0, i) = subst; >> > } >> > - aarch64_expand_vector_init (target, copy); >> > + aarch64_expand_vector_init_fallback (target, copy); >> > } >> > >> > /* Insert the variable lanes directly. */ >> > @@ -22203,6 +22171,91 @@ aarch64_expand_vector_init (rtx target, rtx vals) >> > } >> > } >> > >> > +DEBUG_FUNCTION >> > +static void >> > +aarch64_expand_vector_init_debug_seq (rtx_insn *seq, const char *s) >> > +{ >> > + fprintf (stderr, "%s: %u\n", s, seq_cost (seq, !optimize_size)); >> > + for (rtx_insn *i = seq; i; i = NEXT_INSN (i)) >> > + { >> > + debug_rtx (PATTERN (i)); >> > + fprintf (stderr, "cost: %d\n", pattern_cost (PATTERN (i), !optimize_size)); >> > + } >> > +} >> >> I'm not sure we should commit this to the tree. >> >> > + >> > +static rtx >> > +aarch64_expand_vector_init_split_vals (machine_mode mode, rtx vals, bool even_p) >> >> How about calling this aarch64_unzip_vector_init? It needs a function >> comment. >> >> > +{ >> > + int n = XVECLEN (vals, 0); >> > + machine_mode new_mode >> > + = aarch64_simd_container_mode (GET_MODE_INNER (mode), 64); >> >> IMO it would be better to use "GET_MODE_BITSIZE (mode).to_constant () / 2" >> or "GET_MODE_UNIT_BITSIZE (mode) * n / 2" for the second argument. >> >> > + rtvec vec = rtvec_alloc (n / 2); >> > + for (int i = 0; i < n; i++) >> > + RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i) >> > + : XVECEXP (vals, 0, 2 * i + 1); >> > + return gen_rtx_PARALLEL (new_mode, vec); >> > +} >> > + >> > +/* >> > +The function does the following: >> > +(a) Generates code sequence by splitting VALS into even and odd halves, >> > + and recursively calling itself to initialize them and then merge using >> > + zip1. >> > +(b) Generate code sequence directly using aarch64_expand_vector_init_fallback. >> > +(c) Compare the cost of code sequences generated by (a) and (b), and choose >> > + the more efficient one. >> > +*/ >> >> I think we should keep the current description of the interface, >> before the describing the implementation: >> >> /* Expand a vector initialization sequence, such that TARGET is >> initialized to contain VALS. */ >> >> (includes an s/s/z/). >> >> And it's probably better to describe the implementation inside >> the function. >> >> Most comments are written in imperative style, so how about: >> >> /* Try decomposing the initializer into even and odd halves and >> then ZIP them together. Use the resulting sequence if it is >> strictly cheaper than loading VALS directly. >> >> Prefer the fallback sequence in the event of a tie, since it >> will tend to use fewer registers. */ >> >> > + >> > +void >> > +aarch64_expand_vector_init (rtx target, rtx vals) >> > +{ >> > + machine_mode mode = GET_MODE (target); >> > + int n_elts = XVECLEN (vals, 0); >> > + >> > + if (n_elts < 8 >> > + || known_eq (GET_MODE_BITSIZE (mode), 64)) >> >> Might be more robust to test maybe_ne (GET_MODE_BITSIZE (mode), 128) >> >> > + { >> > + aarch64_expand_vector_init_fallback (target, vals); >> > + return; >> > + } >> > + >> > + start_sequence (); >> > + rtx dest[2]; >> > + unsigned costs[2]; >> > + for (int i = 0; i < 2; i++) >> > + { >> > + start_sequence (); >> > + dest[i] = gen_reg_rtx (mode); >> > + rtx new_vals >> > + = aarch64_expand_vector_init_split_vals (mode, vals, (i % 2) == 0); >> > + rtx tmp_reg = gen_reg_rtx (GET_MODE (new_vals)); >> > + aarch64_expand_vector_init (tmp_reg, new_vals); >> > + dest[i] = gen_rtx_SUBREG (mode, tmp_reg, 0); >> >> Maybe "src" or "halves" would be a better name than "dest", given that >> the rtx isn't actually the destination of the subsequence. >> >> > + rtx_insn *rec_seq = get_insns (); >> > + end_sequence (); >> > + costs[i] = seq_cost (rec_seq, !optimize_size); >> > + emit_insn (rec_seq); >> > + } >> > + >> > + rtvec v = gen_rtvec (2, dest[0], dest[1]); >> > + rtx_insn *zip1_insn >> > + = emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1)); >> > + unsigned seq_total_cost >> > + = (!optimize_size) ? std::max (costs[0], costs[1]) : costs[0] + costs[1]; >> >> This is the wrong way round: max should be for speed and addition >> for size. > I assumed, !optimize_size meant optimizing for speed ? > So (!optimize_size) ? std::max (costs[0] ,costs[1]) : costs[0] + costs[1] > would imply taking max of the two for speed and addition for size, or > am I misunderstanding ? Ah, sorry, I misread. But IMO it would be more natural as: optimize_size ? ... : ...; > I have done rest of the changes in attached patch. > > Thanks, > Prathamesh >> >> Thanks, >> Richard >> >> > + seq_total_cost += insn_cost (zip1_insn, !optimize_size); >> > + >> > + rtx_insn *seq = get_insns (); >> > + end_sequence (); >> > + >> > + start_sequence (); >> > + aarch64_expand_vector_init_fallback (target, vals); >> > + rtx_insn *fallback_seq = get_insns (); >> > + unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size); >> > + end_sequence (); >> > + >> > + emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq); >> > +} >> > + >> > /* Emit RTL corresponding to: >> > insr TARGET, ELEM. */ >> > >> > diff --git a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c >> > similarity index 82% >> > rename from gcc/testsuite/gcc.target/aarch64/interleave-init-1.c >> > rename to gcc/testsuite/gcc.target/aarch64/vec-init-18.c >> > index ee775048589..e812d3946de 100644 >> > --- a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c >> > @@ -7,8 +7,8 @@ >> > /* >> > ** foo: >> > ** ... >> > -** dup v[0-9]+\.8h, w[0-9]+ >> > -** dup v[0-9]+\.8h, w[0-9]+ >> > +** dup v[0-9]+\.4h, w[0-9]+ >> > +** dup v[0-9]+\.4h, w[0-9]+ >> > ** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h >> > ** ... >> > ** ret >> > @@ -23,8 +23,8 @@ int16x8_t foo(int16_t x, int y) >> > /* >> > ** foo2: >> > ** ... >> > -** dup v[0-9]+\.8h, w[0-9]+ >> > -** movi v[0-9]+\.8h, 0x1 >> > +** dup v[0-9]+\.4h, w[0-9]+ >> > +** movi v[0-9]+\.4h, 0x1 >> > ** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h >> > ** ... >> > ** ret >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-19.c b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c >> > new file mode 100644 >> > index 00000000000..e28fdcda29d >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c >> > @@ -0,0 +1,21 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-options "-O3" } */ >> > +/* { dg-final { check-function-bodies "**" "" "" } } */ >> > + >> > +#include >> > + >> > +/* >> > +** f_s8: >> > +** ... >> > +** dup v[0-9]+\.8b, w[0-9]+ >> > +** adrp x[0-9]+, \.LC[0-9]+ >> > +** ldr d[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\] >> > +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b >> > +** ret >> > +*/ >> > + >> > +int8x16_t f_s8(int8_t x) >> > +{ >> > + return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, >> > + x, 5, x, 6, x, 7, x, 8 }; >> > +} >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-20.c b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c >> > new file mode 100644 >> > index 00000000000..9366ca349b6 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c >> > @@ -0,0 +1,22 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-options "-O3" } */ >> > +/* { dg-final { check-function-bodies "**" "" "" } } */ >> > + >> > +#include >> > + >> > +/* >> > +** f_s8: >> > +** ... >> > +** adrp x[0-9]+, \.LC[0-9]+ >> > +** dup v[0-9]+\.8b, w[0-9]+ >> > +** ldr d[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\] >> > +** ins v0\.b\[0\], w0 >> > +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b >> > +** ret >> > +*/ >> > + >> > +int8x16_t f_s8(int8_t x, int8_t y) >> > +{ >> > + return (int8x16_t) { x, y, 1, y, 2, y, 3, y, >> > + 4, y, 5, y, 6, y, 7, y }; >> > +} >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-21.c b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c >> > new file mode 100644 >> > index 00000000000..e16459486d7 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c >> > @@ -0,0 +1,22 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-options "-O3" } */ >> > +/* { dg-final { check-function-bodies "**" "" "" } } */ >> > + >> > +#include >> > + >> > +/* >> > +** f_s8: >> > +** ... >> > +** adrp x[0-9]+, \.LC[0-9]+ >> > +** ldr q[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\] >> > +** ins v0\.b\[0\], w0 >> > +** ins v0\.b\[1\], w1 >> > +** ... >> > +** ret >> > +*/ >> > + >> > +int8x16_t f_s8(int8_t x, int8_t y) >> > +{ >> > + return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6, >> > + 7, 8, 9, 10, 11, 12, 13, 14 }; >> > +} >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c >> > new file mode 100644 >> > index 00000000000..8f35854c008 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c >> > @@ -0,0 +1,24 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-options "-Os" } */ >> > +/* { dg-final { check-function-bodies "**" "" "" } } */ >> > + >> > +/* Verify that fallback code-sequence is chosen over >> > + recursively generated code-sequence merged with zip1. */ >> > + >> > +/* >> > +** f_s16: >> > +** ... >> > +** sxth w0, w0 >> > +** fmov s0, w0 >> > +** ins v0\.h\[1\], w1 >> > +** ins v0\.h\[2\], w2 >> > +** ins v0\.h\[3\], w3 >> > +** ins v0\.h\[4\], w4 >> > +** ins v0\.h\[5\], w5 >> > +** ins v0\.h\[6\], w6 >> > +** ins v0\.h\[7\], w7 >> > +** ... >> > +** ret >> > +*/ >> > + >> > +#include "vec-init-22.h" >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c >> > new file mode 100644 >> > index 00000000000..172d56ffdf1 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c >> > @@ -0,0 +1,27 @@ >> > +/* { dg-do compile } */ >> > +/* { dg-options "-O3" } */ >> > +/* { dg-final { check-function-bodies "**" "" "" } } */ >> > + >> > +/* Verify that we recursively generate code for even and odd halves >> > + instead of fallback code. This is so despite the longer code-gen >> > + because it has fewer dependencies and thus has lesser cost. */ >> > + >> > +/* >> > +** f_s16: >> > +** ... >> > +** sxth w0, w0 >> > +** sxth w1, w1 >> > +** fmov d0, x0 >> > +** fmov d1, x1 >> > +** ins v[0-9]+\.h\[1\], w2 >> > +** ins v[0-9]+\.h\[1\], w3 >> > +** ins v[0-9]+\.h\[2\], w4 >> > +** ins v[0-9]+\.h\[2\], w5 >> > +** ins v[0-9]+\.h\[3\], w6 >> > +** ins v[0-9]+\.h\[3\], w7 >> > +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h >> > +** ... >> > +** ret >> > +*/ >> > + >> > +#include "vec-init-22.h" >> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22.h b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h >> > new file mode 100644 >> > index 00000000000..15b889d4097 >> > --- /dev/null >> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h >> > @@ -0,0 +1,7 @@ >> > +#include >> > + >> > +int16x8_t f_s16 (int16_t x0, int16_t x1, int16_t x2, int16_t x3, >> > + int16_t x4, int16_t x5, int16_t x6, int16_t x7) >> > +{ >> > + return (int16x8_t) { x0, x1, x2, x3, x4, x5, x6, x7 }; >> > +} > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index acc0cfe5f94..94cc4338678 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -21972,11 +21972,12 @@ aarch64_simd_make_constant (rtx vals) > return NULL_RTX; > } > > -/* Expand a vector initialisation sequence, such that TARGET is > - initialised to contain VALS. */ > +/* A subroutine of aarch64_expand_vector_init, with the same interface. > + The caller has already tried a divide-and-conquer approach, so do > + not consider that case here. */ > > void > -aarch64_expand_vector_init (rtx target, rtx vals) > +aarch64_expand_vector_init_fallback (rtx target, rtx vals) > { > machine_mode mode = GET_MODE (target); > scalar_mode inner_mode = GET_MODE_INNER (mode); > @@ -22036,38 +22037,6 @@ aarch64_expand_vector_init (rtx target, rtx vals) > return; > } > > - /* Check for interleaving case. > - For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}. > - Generate following code: > - dup v0.h, x > - dup v1.h, y > - zip1 v0.h, v0.h, v1.h > - for "large enough" initializer. */ > - > - if (n_elts >= 8) > - { > - int i; > - for (i = 2; i < n_elts; i++) > - if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2))) > - break; > - > - if (i == n_elts) > - { > - machine_mode mode = GET_MODE (target); > - rtx dest[2]; > - > - for (int i = 0; i < 2; i++) > - { > - rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i)); > - dest[i] = force_reg (mode, x); > - } > - > - rtvec v = gen_rtvec (2, dest[0], dest[1]); > - emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1)); > - return; > - } > - } > - > enum insn_code icode = optab_handler (vec_set_optab, mode); > gcc_assert (icode != CODE_FOR_nothing); > > @@ -22189,7 +22158,7 @@ aarch64_expand_vector_init (rtx target, rtx vals) > } > XVECEXP (copy, 0, i) = subst; > } > - aarch64_expand_vector_init (target, copy); > + aarch64_expand_vector_init_fallback (target, copy); > } > > /* Insert the variable lanes directly. */ > @@ -22203,6 +22172,81 @@ aarch64_expand_vector_init (rtx target, rtx vals) > } > } > > +/* Return even or odd half of VALS depending on EVEN_P. */ > + > +static rtx > +aarch64_unzip_vector_init (machine_mode mode, rtx vals, bool even_p) > +{ > + int n = XVECLEN (vals, 0); > + machine_mode new_mode > + = aarch64_simd_container_mode (GET_MODE_INNER (mode), > + GET_MODE_BITSIZE (mode).to_constant () / 2); > + rtvec vec = rtvec_alloc (n / 2); > + for (int i = 0; i < n; i++) > + RTVEC_ELT (vec, i) = (even_p) ? XVECEXP (vals, 0, 2 * i) > + : XVECEXP (vals, 0, 2 * i + 1); > + return gen_rtx_PARALLEL (new_mode, vec); > +} > + > +/* Expand a vector initialisation sequence, such that TARGET is initialization It would be good to add -fno-schedule-insns -fno-schedule-insns2 to the tests' dg-options (or use -O instead of -O3 for the -O3 tests, if that works). OK for stage 1 with those changes, thanks. Richard > + initialized to contain VALS. */ > + > +void > +aarch64_expand_vector_init (rtx target, rtx vals) > +{ > + /* Try decomposing the initializer into even and odd halves and > + then ZIP them together. Use the resulting sequence if it is > + strictly cheaper than loading VALS directly. > + > + Prefer the fallback sequence in the event of a tie, since it > + will tend to use fewer registers. */ > + > + machine_mode mode = GET_MODE (target); > + int n_elts = XVECLEN (vals, 0); > + > + if (n_elts < 4 > + || maybe_ne (GET_MODE_BITSIZE (mode), 128)) > + { > + aarch64_expand_vector_init_fallback (target, vals); > + return; > + } > + > + start_sequence (); > + rtx halves[2]; > + unsigned costs[2]; > + for (int i = 0; i < 2; i++) > + { > + start_sequence (); > + rtx new_vals > + = aarch64_unzip_vector_init (mode, vals, (i % 2) == 0); > + rtx tmp_reg = gen_reg_rtx (GET_MODE (new_vals)); > + aarch64_expand_vector_init (tmp_reg, new_vals); > + halves[i] = gen_rtx_SUBREG (mode, tmp_reg, 0); > + rtx_insn *rec_seq = get_insns (); > + end_sequence (); > + costs[i] = seq_cost (rec_seq, !optimize_size); > + emit_insn (rec_seq); > + } > + > + rtvec v = gen_rtvec (2, halves[0], halves[1]); > + rtx_insn *zip1_insn > + = emit_set_insn (target, gen_rtx_UNSPEC (mode, v, UNSPEC_ZIP1)); > + unsigned seq_total_cost > + = (!optimize_size) ? std::max (costs[0], costs[1]) : costs[0] + costs[1]; > + seq_total_cost += insn_cost (zip1_insn, !optimize_size); > + > + rtx_insn *seq = get_insns (); > + end_sequence (); > + > + start_sequence (); > + aarch64_expand_vector_init_fallback (target, vals); > + rtx_insn *fallback_seq = get_insns (); > + unsigned fallback_seq_cost = seq_cost (fallback_seq, !optimize_size); > + end_sequence (); > + > + emit_insn (seq_total_cost < fallback_seq_cost ? seq : fallback_seq); > +} > + > /* Emit RTL corresponding to: > insr TARGET, ELEM. */ > > diff --git a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c > similarity index 82% > rename from gcc/testsuite/gcc.target/aarch64/interleave-init-1.c > rename to gcc/testsuite/gcc.target/aarch64/vec-init-18.c > index ee775048589..e812d3946de 100644 > --- a/gcc/testsuite/gcc.target/aarch64/interleave-init-1.c > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c > @@ -7,8 +7,8 @@ > /* > ** foo: > ** ... > -** dup v[0-9]+\.8h, w[0-9]+ > -** dup v[0-9]+\.8h, w[0-9]+ > +** dup v[0-9]+\.4h, w[0-9]+ > +** dup v[0-9]+\.4h, w[0-9]+ > ** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h > ** ... > ** ret > @@ -23,8 +23,8 @@ int16x8_t foo(int16_t x, int y) > /* > ** foo2: > ** ... > -** dup v[0-9]+\.8h, w[0-9]+ > -** movi v[0-9]+\.8h, 0x1 > +** dup v[0-9]+\.4h, w[0-9]+ > +** movi v[0-9]+\.4h, 0x1 > ** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h > ** ... > ** ret > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-19.c b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c > new file mode 100644 > index 00000000000..e28fdcda29d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-19.c > @@ -0,0 +1,21 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include > + > +/* > +** f_s8: > +** ... > +** dup v[0-9]+\.8b, w[0-9]+ > +** adrp x[0-9]+, \.LC[0-9]+ > +** ldr d[0-9]+, \[x[0-9]+, #:lo12:.LC[0-9]+\] > +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b > +** ret > +*/ > + > +int8x16_t f_s8(int8_t x) > +{ > + return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, > + x, 5, x, 6, x, 7, x, 8 }; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-20.c b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c > new file mode 100644 > index 00000000000..9366ca349b6 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-20.c > @@ -0,0 +1,22 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include > + > +/* > +** f_s8: > +** ... > +** adrp x[0-9]+, \.LC[0-9]+ > +** dup v[0-9]+\.8b, w[0-9]+ > +** ldr d[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\] > +** ins v0\.b\[0\], w0 > +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b > +** ret > +*/ > + > +int8x16_t f_s8(int8_t x, int8_t y) > +{ > + return (int8x16_t) { x, y, 1, y, 2, y, 3, y, > + 4, y, 5, y, 6, y, 7, y }; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-21.c b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c > new file mode 100644 > index 00000000000..e16459486d7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-21.c > @@ -0,0 +1,22 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include > + > +/* > +** f_s8: > +** ... > +** adrp x[0-9]+, \.LC[0-9]+ > +** ldr q[0-9]+, \[x[0-9]+, #:lo12:\.LC[0-9]+\] > +** ins v0\.b\[0\], w0 > +** ins v0\.b\[1\], w1 > +** ... > +** ret > +*/ > + > +int8x16_t f_s8(int8_t x, int8_t y) > +{ > + return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6, > + 7, 8, 9, 10, 11, 12, 13, 14 }; > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c > new file mode 100644 > index 00000000000..8f35854c008 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-size.c > @@ -0,0 +1,24 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Os" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +/* Verify that fallback code-sequence is chosen over > + recursively generated code-sequence merged with zip1. */ > + > +/* > +** f_s16: > +** ... > +** sxth w0, w0 > +** fmov s0, w0 > +** ins v0\.h\[1\], w1 > +** ins v0\.h\[2\], w2 > +** ins v0\.h\[3\], w3 > +** ins v0\.h\[4\], w4 > +** ins v0\.h\[5\], w5 > +** ins v0\.h\[6\], w6 > +** ins v0\.h\[7\], w7 > +** ... > +** ret > +*/ > + > +#include "vec-init-22.h" > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c > new file mode 100644 > index 00000000000..172d56ffdf1 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22-speed.c > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +/* Verify that we recursively generate code for even and odd halves > + instead of fallback code. This is so despite the longer code-gen > + because it has fewer dependencies and thus has lesser cost. */ > + > +/* > +** f_s16: > +** ... > +** sxth w0, w0 > +** sxth w1, w1 > +** fmov d0, x0 > +** fmov d1, x1 > +** ins v[0-9]+\.h\[1\], w2 > +** ins v[0-9]+\.h\[1\], w3 > +** ins v[0-9]+\.h\[2\], w4 > +** ins v[0-9]+\.h\[2\], w5 > +** ins v[0-9]+\.h\[3\], w6 > +** ins v[0-9]+\.h\[3\], w7 > +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h > +** ... > +** ret > +*/ > + > +#include "vec-init-22.h" > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-22.h b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h > new file mode 100644 > index 00000000000..15b889d4097 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-22.h > @@ -0,0 +1,7 @@ > +#include > + > +int16x8_t f_s16 (int16_t x0, int16_t x1, int16_t x2, int16_t x3, > + int16_t x4, int16_t x5, int16_t x6, int16_t x7) > +{ > + return (int16x8_t) { x0, x1, x2, x3, x4, x5, x6, x7 }; > +}