From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id F10533858420 for ; Tue, 31 Aug 2021 16:07:17 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F10533858420 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 803516D; Tue, 31 Aug 2021 09:07:17 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8E60B3F766; Tue, 31 Aug 2021 09:07:16 -0700 (PDT) From: Richard Sandiford To: Tamar Christina Mail-Followup-To: Tamar Christina , "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov , richard.sandiford@arm.com Cc: "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants and operations References: Date: Tue, 31 Aug 2021 17:07:14 +0100 In-Reply-To: (Tamar Christina's message of "Tue, 31 Aug 2021 15:47:19 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00, KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Aug 2021 16:07:21 -0000 Tamar Christina writes: >> -----Original Message----- >> From: Richard Sandiford >> Sent: Tuesday, August 31, 2021 4:14 PM >> To: Tamar Christina >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw >> ; Marcus Shawcroft >> ; Kyrylo Tkachov >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants >> and operations >>=20 >> Tamar Christina writes: >> > @@ -13936,8 +13937,65 @@ cost_plus: >> > mode, MULT, 1, speed); >> > return true; >> > } >> > + break; >> > + case PARALLEL: >> > + /* Fall through */ >>=20 >> Which code paths lead to getting a PARALLEL here? > > Hi, > > Thanks for the review! > > I added it for completeness because CSE treats a parallel and CONST_VECTO= R as > equivalent when they each entry in the parallel defines a constant. Could you test whether it ever triggers in practice though? The code would be much simpler without it. >> > + case CONST_VECTOR: >> > + { >> > + rtx gen_insn =3D aarch64_simd_make_constant (x, true); >> > + /* Not a valid const vector. */ >> > + if (!gen_insn) >> > + break; >> > >> > - /* Fall through. */ >> > + switch (GET_CODE (gen_insn)) >> > + { >> > + case CONST_VECTOR: >> > + /* Load using MOVI/MVNI. */ >> > + if (aarch64_simd_valid_immediate (x, NULL)) >> > + *cost +=3D extra_cost->vect.movi; >> > + else /* Load using constant pool. */ >> > + *cost +=3D extra_cost->ldst.load; >> > + break; >> > + /* Load using a DUP. */ >> > + case VEC_DUPLICATE: >> > + *cost +=3D extra_cost->vect.dup; >> > + break; >>=20 >> Does this trigger in practice? The new check=3D=3Dtrue path (rightly) s= tops the >> duplicated element from being forced into a register, but then I would h= ave >> expected: >>=20 >> rtx >> gen_vec_duplicate (machine_mode mode, rtx x) { >> if (valid_for_const_vector_p (mode, x)) >> return gen_const_vec_duplicate (mode, x); >> return gen_rtx_VEC_DUPLICATE (mode, x); } >>=20 >> to generate the original CONST_VECTOR again. > > Yes, but CSE is trying to see whether using a DUP is cheaper than another= instruction. > Normal code won't hit this but CSE is just costing all the different ways= one can semantically > construct a vector, which RTL actually comes out of it depends on how it'= s folded as you say. But what I mean is, you call: rtx gen_insn =3D aarch64_simd_make_constant (x, true); /* Not a valid const vector. */ if (!gen_insn) break; where aarch64_simd_make_constant does: if (CONST_VECTOR_P (vals)) const_vec =3D vals; else if (GET_CODE (vals) =3D=3D PARALLEL) { /* A CONST_VECTOR must contain only CONST_INTs and CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF). Only store valid constants in a CONST_VECTOR. */ int n_elts =3D XVECLEN (vals, 0); for (i =3D 0; i < n_elts; ++i) { rtx x =3D XVECEXP (vals, 0, i); if (CONST_INT_P (x) || CONST_DOUBLE_P (x)) n_const++; } if (n_const =3D=3D n_elts) const_vec =3D gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0)); } else gcc_unreachable (); if (const_vec !=3D NULL_RTX && aarch64_simd_valid_immediate (const_vec, NULL)) /* Load using MOVI/MVNI. */ return const_vec; else if ((const_dup =3D aarch64_simd_dup_constant (vals, check)) !=3D NUL= L_RTX) /* Loaded using DUP. */ return const_dup; and aarch64_simd_dup_constant does: machine_mode mode =3D GET_MODE (vals); machine_mode inner_mode =3D GET_MODE_INNER (mode); rtx x; if (!const_vec_duplicate_p (vals, &x)) return NULL_RTX; /* We can load this constant by using DUP and a constant in a single ARM register. This will be cheaper than a vector load. */ if (!check) x =3D copy_to_mode_reg (inner_mode, x); return gen_vec_duplicate (mode, x); For the =E2=80=9Ccheck=E2=80=9D case, =E2=80=9Cx=E2=80=9D will be a constan= t, and so gen_vec_duplicate will call gen_const_vec_duplicate, which will return a CONST_VECTOR. It didn't seem to be possible for gen_insn to be a VEC_DUPLICATE. This would be much simpler if we could call aarch64_simd_valid_immediate and aarch64_simd_dup_constant directly from the rtx cost code, hence the question about whether the PARALLEL stuff was really needed in practice. >> > + default: >> > + *cost +=3D extra_cost->ldst.load; >> > + break; >> > + } >> > + return true; >> > + } >> > + case VEC_CONCAT: >> > + /* depending on the operation, either DUP or INS. >> > + For now, keep default costing. */ >> > + break; >> > + case VEC_DUPLICATE: >> > + *cost +=3D extra_cost->vect.dup; >> > + return true; >> > + case VEC_SELECT: >> > + { >> > + /* cost subreg of 0 as free, otherwise as DUP */ >> > + rtx op1 =3D XEXP (x, 1); >> > + int nelts; >> > + if ((op1 =3D=3D const0_rtx && !BYTES_BIG_ENDIAN) >> > + || (BYTES_BIG_ENDIAN >> > + && GET_MODE_NUNITS (mode).is_constant(&nelts) >> > + && INTVAL (op1) =3D=3D nelts - 1)) >> > + ; >> > + else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1)) >> > + ; >> > + else if (vec_series_highpart_p (mode, GET_MODE (op1), op1)) >> > + /* Selecting the high part is not technically free, but we lack >> > + enough information to decide that here. For instance selecting >> > + the high-part of a vec_dup *is* free or to feed into any _high >> > + instruction. Both of which we can't really tell. That said >> > + have a better chance to optimize an dup vs multiple constants. = */ >> > + ; >>=20 >> Not sure about this. We already try to detect the latter case (_high >> instructions) via aarch64_strip_extend_vec_half. We might be missing so= me >> cases, but that still feels like the right way to go IMO. > > That's a different problem from what I understand. What this is trying t= o say is that > If you have a vector [x y a b] and you need vector [x y] that you can use= the top part > of the original vector for this. > > This is an approximation, because something that can be created with a mo= vi is probably > Cheaper to keep distinct if it's not going to be paired with a _high oper= ation (since you will have a dup then). > > The problem is that the front end has already spit the two Vectors into [= x y a b] and [x y]. > There's nothing else that tries to consolidate them back up if both survi= ve. > > As a consequence of this, the testcase test0 is not handled optimally. I= t would instead create > 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bit= s. > > So if the cost of selecting it is cheaper than the movi, cse will not con= solidate the vectors, > and because movi's are so cheap, the only cost that worked was 0. But in= creasing the costs > of movi's requires the costs of everything to be increased (including loa= ds). > > I preferred to 0 out the cost, because the worst that can happen is an du= p instead of a movi, > And at best a dup instead of a load from a pool (if the constant is compl= icated). Hmm, will need to look at this more tomorrow. >> Selecting the high part of a vec_dup should get folded into another vec_= dup. >>=20 >> The lowpart bits look OK, but which paths call this function without fir= st >> simplifying the select to a subreg? The subreg is now the canonical form >> (thanks to r12-2288). > > The simplification will happen during folding in cse or in combine. This= costing happens before the folding, > When CSE is trying to decide whether to undo the front end's lowering of = constants. > > To do so it models the constants and the semantic operation required to e= xtract them. E.g. to get > 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat the= first element/bottom part special > Here. Costing wise they would be the same. But which code path creates the VEC_SELECT? We don't need any context to know that the VEC_SELECT is non-canonical. It's obvious from the operands of the VEC_SELECT in isolation. I'd just rather tackle this at source than try to get the cost code to handle non-canonical rtl. Thanks, Richard