From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 64ADE3858CDB for ; Wed, 1 Feb 2023 16:26:56 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 64ADE3858CDB Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 13E9C4B3; Wed, 1 Feb 2023 08:27:38 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.99.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7FDFA3F64C; Wed, 1 Feb 2023 08:26:55 -0800 (PST) From: Richard Sandiford To: Prathamesh Kulkarni Mail-Followup-To: Prathamesh Kulkarni ,gcc Patches , richard.sandiford@arm.com Cc: gcc Patches Subject: Re: [aarch64] Use dup and zip1 for interleaving elements in initializing vector References: Date: Wed, 01 Feb 2023 16:26:54 +0000 In-Reply-To: (Prathamesh Kulkarni's message of "Wed, 1 Feb 2023 15:06:35 +0530") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-35.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SCC_5_SHORT_WORD_LINES,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Prathamesh Kulkarni writes: > On Thu, 12 Jan 2023 at 21:21, Richard Sandiford > wrote: >> >> Prathamesh Kulkarni writes: >> > On Tue, 6 Dec 2022 at 07:01, Prathamesh Kulkarni >> > wrote: >> >> >> >> On Mon, 5 Dec 2022 at 16:50, Richard Sandiford >> >> wrote: >> >> > >> >> > Richard Sandiford via Gcc-patches writes: >> >> > > Prathamesh Kulkarni writes: >> >> > >> Hi, >> >> > >> For the following test-case: >> >> > >> >> >> > >> int16x8_t foo(int16_t x, int16_t y) >> >> > >> { >> >> > >> return (int16x8_t) { x, y, x, y, x, y, x, y }; >> >> > >> } >> >> > >> >> >> > >> Code gen at -O3: >> >> > >> foo: >> >> > >> dup v0.8h, w0 >> >> > >> ins v0.h[1], w1 >> >> > >> ins v0.h[3], w1 >> >> > >> ins v0.h[5], w1 >> >> > >> ins v0.h[7], w1 >> >> > >> ret >> >> > >> >> >> > >> For 16 elements, it results in 8 ins instructions which might not be >> >> > >> optimal perhaps. >> >> > >> I guess, the above code-gen would be equivalent to the following ? >> >> > >> dup v0.8h, w0 >> >> > >> dup v1.8h, w1 >> >> > >> zip1 v0.8h, v0.8h, v1.8h >> >> > >> >> >> > >> I have attached patch to do the same, if number of elements >= 8, >> >> > >> which should be possibly better compared to current code-gen ? >> >> > >> Patch passes bootstrap+test on aarch64-linux-gnu. >> >> > >> Does the patch look OK ? >> >> > >> >> >> > >> Thanks, >> >> > >> Prathamesh >> >> > >> >> >> > >> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >> >> > >> index c91df6f5006..e5dea70e363 100644 >> >> > >> --- a/gcc/config/aarch64/aarch64.cc >> >> > >> +++ b/gcc/config/aarch64/aarch64.cc >> >> > >> @@ -22028,6 +22028,39 @@ aarch64_expand_vector_init (rtx target, rtx vals) >> >> > >> return; >> >> > >> } >> >> > >> >> >> > >> + /* Check for interleaving case. >> >> > >> + For eg if initializer is (int16x8_t) {x, y, x, y, x, y, x, y}. >> >> > >> + Generate following code: >> >> > >> + dup v0.h, x >> >> > >> + dup v1.h, y >> >> > >> + zip1 v0.h, v0.h, v1.h >> >> > >> + for "large enough" initializer. */ >> >> > >> + >> >> > >> + if (n_elts >= 8) >> >> > >> + { >> >> > >> + int i; >> >> > >> + for (i = 2; i < n_elts; i++) >> >> > >> + if (!rtx_equal_p (XVECEXP (vals, 0, i), XVECEXP (vals, 0, i % 2))) >> >> > >> + break; >> >> > >> + >> >> > >> + if (i == n_elts) >> >> > >> + { >> >> > >> + machine_mode mode = GET_MODE (target); >> >> > >> + rtx dest[2]; >> >> > >> + >> >> > >> + for (int i = 0; i < 2; i++) >> >> > >> + { >> >> > >> + rtx x = copy_to_mode_reg (GET_MODE_INNER (mode), XVECEXP (vals, 0, i)); >> >> > > >> >> > > Formatting nit: long line. >> >> > > >> >> > >> + dest[i] = gen_reg_rtx (mode); >> >> > >> + aarch64_emit_move (dest[i], gen_vec_duplicate (mode, x)); >> >> > >> + } >> >> > > >> >> > > This could probably be written: >> >> > > >> >> > > for (int i = 0; i < 2; i++) >> >> > > { >> >> > > rtx x = expand_vector_broadcast (mode, XVECEXP (vals, 0, i)); >> >> > > dest[i] = force_reg (GET_MODE_INNER (mode), x); >> >> > >> >> > Oops, I meant "mode" rather than "GET_MODE_INNER (mode)", sorry. >> >> Thanks, I have pushed the change in >> >> 769370f3e2e04823c8a621d8ffa756dd83ebf21e after running >> >> bootstrap+test on aarch64-linux-gnu. >> > Hi Richard, >> > I have attached a patch that extends the transform if one half is dup >> > and other is set of constants. >> > For eg: >> > int8x16_t f(int8_t x) >> > { >> > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, x, 5, x, 6, x, 7, x, 8 }; >> > } >> > >> > code-gen trunk: >> > f: >> > adrp x1, .LC0 >> > ldr q0, [x1, #:lo12:.LC0] >> > ins v0.b[0], w0 >> > ins v0.b[2], w0 >> > ins v0.b[4], w0 >> > ins v0.b[6], w0 >> > ins v0.b[8], w0 >> > ins v0.b[10], w0 >> > ins v0.b[12], w0 >> > ins v0.b[14], w0 >> > ret >> > >> > code-gen with patch: >> > f: >> > dup v0.16b, w0 >> > adrp x0, .LC0 >> > ldr q1, [x0, #:lo12:.LC0] >> > zip1 v0.16b, v0.16b, v1.16b >> > ret >> > >> > Bootstrapped+tested on aarch64-linux-gnu. >> > Does it look OK ? >> >> Looks like a nice improvement. It'll need to wait for GCC 14 now though. >> >> However, rather than handle this case specially, I think we should instead >> take a divide-and-conquer approach: split the initialiser into even and >> odd elements, find the best way of loading each part, then compare the >> cost of these sequences + ZIP with the cost of the fallback code (the code >> later in aarch64_expand_vector_init). >> >> For example, doing that would allow: >> >> { x, y, 0, y, 0, y, 0, y, 0, y } >> >> to be loaded more easily, even though the even elements aren't wholly >> constant. > Hi Richard, > I have attached a prototype patch based on the above approach. > It subsumes specializing for above {x, y, x, y, x, y, x, y} case by generating > same sequence, thus I removed that hunk, and improves the following cases: > > (a) > int8x16_t f_s16(int8_t x) > { > return (int8x16_t) { x, 1, x, 2, x, 3, x, 4, > x, 5, x, 6, x, 7, x, 8 }; > } > > code-gen trunk: > f_s16: > adrp x1, .LC0 > ldr q0, [x1, #:lo12:.LC0] > ins v0.b[0], w0 > ins v0.b[2], w0 > ins v0.b[4], w0 > ins v0.b[6], w0 > ins v0.b[8], w0 > ins v0.b[10], w0 > ins v0.b[12], w0 > ins v0.b[14], w0 > ret > > code-gen with patch: > f_s16: > dup v0.16b, w0 > adrp x0, .LC0 > ldr q1, [x0, #:lo12:.LC0] > zip1 v0.16b, v0.16b, v1.16b > ret > > (b) > int8x16_t f_s16(int8_t x, int8_t y) > { > return (int8x16_t) { x, y, 1, y, 2, y, 3, y, > 4, y, 5, y, 6, y, 7, y }; > } > > code-gen trunk: > f_s16: > adrp x2, .LC0 > ldr q0, [x2, #:lo12:.LC0] > ins v0.b[0], w0 > ins v0.b[1], w1 > ins v0.b[3], w1 > ins v0.b[5], w1 > ins v0.b[7], w1 > ins v0.b[9], w1 > ins v0.b[11], w1 > ins v0.b[13], w1 > ins v0.b[15], w1 > ret > > code-gen patch: > f_s16: > adrp x2, .LC0 > dup v1.16b, w1 > ldr q0, [x2, #:lo12:.LC0] > ins v0.b[0], w0 > zip1 v0.16b, v0.16b, v1.16b > ret Nice. > There are a couple of issues I have come across: > (1) Choosing element to pad vector. > For eg, if we are initiailizing a vector say { x, y, 0, y, 1, y, 2, y } > with mode V8HI. > We split it into { x, 0, 1, 2 } and { y, y, y, y} > However since the mode is V8HI, we would need to pad the above split vectors > with 4 more elements to match up to vector length. > For {x, 0, 1, 2} using any constant is the obvious choice while for {y, y, y, y} > using 'y' is the obvious choice thus making them: > {x, 0, 1, 2, 0, 0, 0, 0} and {y, y, y, y, y, y, y, y} > These would be then merged using zip1 which would discard the lower half > of both vectors. > Currently I encoded the above two heuristics in > aarch64_expand_vector_init_get_padded_elem: > (a) If split portion contains a constant, use the constant to pad the vector. > (b) If split portion only contains variables, then use the most > frequently repeating variable > to pad the vector. > I suppose tho this could be improved ? I think we should just build two 64-bit vectors (V4HIs) and use a subreg to fill the upper elements with undefined values. I suppose in principle we would have the same problem when splitting a 64-bit vector into 2 32-bit vectors, but it's probably better to punt on that for now. Eventually it would be worth adding full support for 32-bit Advanced SIMD modes (with necessary restrictions for FP exceptions) but it's quite a big task. The 128-bit to 64-bit split is the one that matters most. > (2) Setting cost for zip1: > Currently it returns 4 as cost for following zip1 insn: > (set (reg:V8HI 102) > (unspec:V8HI [ > (reg:V8HI 103) > (reg:V8HI 108) > ] UNSPEC_ZIP1)) > I am not sure if that's correct, or if not, what cost to use in this case > for zip1 ? TBH 4 seems a bit optimistic. It's COSTS_N_INSNS (1), whereas the generic advsimd_vec_cost::permute_cost is 2 insns. But the costs of inserts are probably underestimated to the same extent, so hopefully things work out. So it's probably best to accept the costs as they're currently given. Changing them would need extensive testing. However, one of the advantages of the split is that it allows the subvectors to be built in parallel. When optimising for speed, it might make sense to take the maximum of the subsequence costs and add the cost of the zip to that. Thanks, Richard