From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id AF094385B801 for ; Mon, 13 Jun 2022 08:26:39 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AF094385B801 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 49660D6E; Mon, 13 Jun 2022 01:26:39 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.37]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F2ED73F792; Mon, 13 Jun 2022 01:26:37 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener , Tamar Christina , "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov , roger@eyesopen.com, richard.sandiford@arm.com Cc: Tamar Christina , "gcc-patches\@gcc.gnu.org" , nd , Richard Earnshaw , Marcus Shawcroft , Kyrylo Tkachov , roger@eyesopen.com Subject: Re: [PATCH]AArch64 relax predicate on load structure load instructions References: Date: Mon, 13 Jun 2022 09:26:36 +0100 In-Reply-To: (Richard Biener's message of "Mon, 13 Jun 2022 10:00:21 +0200 (CEST)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-52.7 required=5.0 tests=BAYES_00, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jun 2022 08:26:42 -0000 Richard Biener writes: > On Wed, 8 Jun 2022, Richard Sandiford wrote: >> Tamar Christina writes: >> >> -----Original Message----- >> >> From: Richard Sandiford >> >> Sent: Wednesday, June 8, 2022 11:31 AM >> >> To: Tamar Christina >> >> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw >> >> ; Marcus Shawcroft >> >> ; Kyrylo Tkachov >> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load >> >> instructions >> >>=20 >> >> Tamar Christina writes: >> >> > Hi All, >> >> > >> >> > At some point in time we started lowering the ld1r instructions in = gimple. >> >> > >> >> > That is: >> >> > >> >> > uint8x8_t f1(const uint8_t *in) { >> >> > return vld1_dup_u8(&in[1]); >> >> > } >> >> > >> >> > generates at gimple: >> >> > >> >> > _3 =3D MEM[(const uint8_t *)in_1(D) + 1B]; >> >> > _4 =3D {_3, _3, _3, _3, _3, _3, _3, _3}; >> >> > >> >> > Which is good, but we then generate: >> >> > >> >> > f1: >> >> > ldr b0, [x0, 1] >> >> > dup v0.8b, v0.b[0] >> >> > ret >> >> > >> >> > instead of ld1r. >> >> > >> >> > The reason for this is because the load instructions have a too >> >> > restrictive predicate on them which causes combine not to be able to >> >> > combine the instructions due to the predicate only accepting simple >> >> addressing modes. >> >> > >> >> > This patch relaxes the predicate to accept any memory operand and >> >> > relies on LRA to legitimize the address when it needs to as the >> >> > constraint still only allows the simple addressing mode. Reload is >> >> > always able to legitimize to these. >> >> > >> >> > Secondly since we are now actually generating more ld1r it became >> >> > clear that the lane instructions suffer from a similar issue. >> >> > >> >> > i.e. >> >> > >> >> > float32x4_t f2(const float32_t *in, float32x4_t a) { >> >> > float32x4_t dup =3D vld1q_dup_f32(&in[1]); >> >> > return vfmaq_laneq_f32 (a, a, dup, 1); } >> >> > >> >> > would generate ld1r + vector fmla instead of ldr + lane fmla. >> >> > >> >> > The reason for this is similar to the ld1r issue. The predicate is >> >> > too restrictive in only acception register operands but not memory. >> >> > >> >> > This relaxes it to accept register and/or memory while leaving the >> >> > constraint to only accept registers. This will have LRA generate a >> >> > reload if needed forcing the memory to registers using the standard >> >> patterns. >> >> > >> >> > These two changes allow combine and reload to generate the right >> >> sequences. >> >> > >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. >> >>=20 >> >> This is going against the general direction of travel, which is to ma= ke the >> >> instruction's predicates and conditions enforce the constraints as mu= ch as >> >> possible (making optimistic assumptions about pseudo registers). >> >>=20 >> >> The RA *can* deal with things like: >> >>=20 >> >> (match_operand:M N "general_operand" "r") >> >>=20 >> >> but it's best avoided, for a few reasons: >> >>=20 >> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary >> >> registers. This can make the allocation of those temporaries >> >> suboptimal but (more importantly) it might require other >> >> previously-allocated registers to be spilled late due to the >> >> unexpected increase in register pressure. >> >>=20 >> >> (2) It ends up hiding instructions from the pre-RA optimisers. >> >>=20 >> >> (3) It can also prevent combine opportunities (as well as create them= ), >> >> unless the loose predicates in an insn I are propagated to all >> >> patterns that might result from combining I with something else. >> >>=20 >> >> It sounds like the first problem (not generating ld1r) could be fixed= by (a) >> >> combining aarch64_simd_dup and *aarch64_simd_ld1r, so >> >> that the register and memory alternatives are in the same pattern and= (b) >> >> using the merged instruction(s) to implement the vec_duplicate optab. >> >> Target-independent code should then make the address satisfy the >> >> predicate, simplifying the address where necessary. >> >>=20 >> > >> > I think I am likely missing something here. I would assume that you wa= nted >> > to use the optab to split the addressing off from the mem expression s= o the >> > combined insn matches. >> > >> > But in that case, why do you need to combine the two instructions? >> > I've tried and it doesn't work since the vec_duplicate optab doesn't s= ee the >> > mem as op1, because in gimple the mem is not part of the duplicate. >> > >> > So you still just see: >> > >> >>>> dbgrtx (ops[1].value) >> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0) >> > >> > As the operand as the argument to the dup is just an SSA_NAME. >>=20 >> Ah, yeah, I'd forgotten that fixed-length vec_duplicates would >> come from a constructor rather than a vec_duplicate_expr, so we don't >> get the usual benefit of folding single-use mems during expand. >>=20 >> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595362.html >> moves towards using vec_duplicate even for fixed-length vectors. >> If we take that approach, then I suppose a plain constructor >> should be folded to a vec_duplicate where possible. >>=20 >> (Alternatively, we could use an extended vec_perm_expr with >> scalar inputs, as Richi suggested in that thread.) >>=20 >> If we don't do that, or don't do it yet, then=E2=80=A6 > > I suppose since we alrady have vec_duplicate we can just use it ... > what was the reason to not do this originally? There just wasn't any specific benefit for fixed-length vectors at the time, and obvious potential problems -- introducing VEC_DUPLICATE_EXPRs too early would lose out on existing CONSTRUCTOR-based folds. Also, isel didn't exist at the time that vec_duplicate was added, but it seems like it might be a good place to do the replacement. Match rules that want to test for a uniform vector operand can already use vec_same_elem_p to handle all representations, but perhaps we also need a way of generating the =E2=80=9Cright=E2=80=9D form of duplicate for = the current stage in the pass pipeline? > I suppose the > vec_duplicate expander has a fallback via store_constuctor? > > Originally I wanted to avoid multiple ways to express the same thing > but vec_duplicate is a common enough special-case and it also > usually maps to a special instruction in vector ISAs. > There's VIEW_CONVERT vs. vec_duplicate for V1m modes then, I > suppose VIEW_CONVERT is more canonical here. Is that already true for V1m constructors? (view_convert being canonical and constructors not, I mean.) What do you think about the suggestion in the other thread of making VEC_PERM_EXPR take an arbitrary number of inputs, with (as you suggested) the inputs allowed to be scalars rather than vectors? VEC_PERM_EXPR could then replace both CONSTRUCTOR and VEC_DUPLICATE_EXPR and =E2=80=9Copt= imising=E2=80=9D a normal constructor to a duplicate would just be a case of removing repeated scalar inputs. Thanks, Richard