From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.sandiford@arm.com>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id AF094385B801
 for <gcc-patches@gcc.gnu.org>; Mon, 13 Jun 2022 08:26:39 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org AF094385B801
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 49660D6E;
 Mon, 13 Jun 2022 01:26:39 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.37])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F2ED73F792;
 Mon, 13 Jun 2022 01:26:37 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>,
 Tamar Christina <Tamar.Christina@arm.com>,
 "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>, roger@eyesopen.com,
 richard.sandiford@arm.com
Cc: Tamar Christina <Tamar.Christina@arm.com>,
 "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>, roger@eyesopen.com
Subject: Re: [PATCH]AArch64 relax predicate on load structure load instructions
References: <patch-15781-tamar@arm.com> <mptbkv3igzp.fsf@arm.com>
 <VI1PR08MB532510AAA956E28EDFAFBE3AFFA49@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <mpt4k0vgr47.fsf@arm.com>
 <nsq6nq15-7on8-8so2-o32r-7r547rnq141o@fhfr.qr>
Date: Mon, 13 Jun 2022 09:26:36 +0100
In-Reply-To: <nsq6nq15-7on8-8so2-o32r-7r547rnq141o@fhfr.qr> (Richard Biener's
 message of "Mon, 13 Jun 2022 10:00:21 +0200 (CEST)")
Message-ID: <mptk09ldl4z.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-52.7 required=5.0 tests=BAYES_00, KAM_DMARC_NONE,
 KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, SPF_HELO_NONE, SPF_NONE,
 TXREP, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Jun 2022 08:26:42 -0000

Richard Biener <rguenther@suse.de> writes:
> On Wed, 8 Jun 2022, Richard Sandiford wrote:
>> Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandiford@arm.com>
>> >> Sent: Wednesday, June 8, 2022 11:31 AM
>> >> To: Tamar Christina <Tamar.Christina@arm.com>
>> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> >> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
>> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
>> >> instructions
>> >>=20
>> >> Tamar Christina <tamar.christina@arm.com> writes:
>> >> > Hi All,
>> >> >
>> >> > At some point in time we started lowering the ld1r instructions in =
gimple.
>> >> >
>> >> > That is:
>> >> >
>> >> > uint8x8_t f1(const uint8_t *in) {
>> >> >     return vld1_dup_u8(&in[1]);
>> >> > }
>> >> >
>> >> > generates at gimple:
>> >> >
>> >> >   _3 =3D MEM[(const uint8_t *)in_1(D) + 1B];
>> >> >   _4 =3D {_3, _3, _3, _3, _3, _3, _3, _3};
>> >> >
>> >> > Which is good, but we then generate:
>> >> >
>> >> > f1:
>> >> > 	ldr     b0, [x0, 1]
>> >> > 	dup     v0.8b, v0.b[0]
>> >> > 	ret
>> >> >
>> >> > instead of ld1r.
>> >> >
>> >> > The reason for this is because the load instructions have a too
>> >> > restrictive predicate on them which causes combine not to be able to
>> >> > combine the instructions due to the predicate only accepting simple
>> >> addressing modes.
>> >> >
>> >> > This patch relaxes the predicate to accept any memory operand and
>> >> > relies on LRA to legitimize the address when it needs to as the
>> >> > constraint still only allows the simple addressing mode.  Reload is
>> >> > always able to legitimize to these.
>> >> >
>> >> > Secondly since we are now actually generating more ld1r it became
>> >> > clear that the lane instructions suffer from a similar issue.
>> >> >
>> >> > i.e.
>> >> >
>> >> > float32x4_t f2(const float32_t *in, float32x4_t a) {
>> >> >     float32x4_t dup =3D vld1q_dup_f32(&in[1]);
>> >> >     return vfmaq_laneq_f32 (a, a, dup, 1); }
>> >> >
>> >> > would generate ld1r + vector fmla instead of ldr + lane fmla.
>> >> >
>> >> > The reason for this is similar to the ld1r issue.  The predicate is
>> >> > too restrictive in only acception register operands but not memory.
>> >> >
>> >> > This relaxes it to accept register and/or memory while leaving the
>> >> > constraint to only accept registers.  This will have LRA generate a
>> >> > reload if needed forcing the memory to registers using the standard
>> >> patterns.
>> >> >
>> >> > These two changes allow combine and reload to generate the right
>> >> sequences.
>> >> >
>> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >>=20
>> >> This is going against the general direction of travel, which is to ma=
ke the
>> >> instruction's predicates and conditions enforce the constraints as mu=
ch as
>> >> possible (making optimistic assumptions about pseudo registers).
>> >>=20
>> >> The RA *can* deal with things like:
>> >>=20
>> >>   (match_operand:M N "general_operand" "r")
>> >>=20
>> >> but it's best avoided, for a few reasons:
>> >>=20
>> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary
>> >>     registers.  This can make the allocation of those temporaries
>> >>     suboptimal but (more importantly) it might require other
>> >>     previously-allocated registers to be spilled late due to the
>> >>     unexpected increase in register pressure.
>> >>=20
>> >> (2) It ends up hiding instructions from the pre-RA optimisers.
>> >>=20
>> >> (3) It can also prevent combine opportunities (as well as create them=
),
>> >>     unless the loose predicates in an insn I are propagated to all
>> >>     patterns that might result from combining I with something else.
>> >>=20
>> >> It sounds like the first problem (not generating ld1r) could be fixed=
 by (a)
>> >> combining aarch64_simd_dup<mode> and *aarch64_simd_ld1r<mode>, so
>> >> that the register and memory alternatives are in the same pattern and=
 (b)
>> >> using the merged instruction(s) to implement the vec_duplicate optab.
>> >> Target-independent code should then make the address satisfy the
>> >> predicate, simplifying the address where necessary.
>> >>=20
>> >
>> > I think I am likely missing something here. I would assume that you wa=
nted
>> > to use the optab to split the addressing off from the mem expression s=
o the
>> > combined insn matches.
>> >
>> > But in that case, why do you need to combine the two instructions?
>> > I've tried and it doesn't work since the vec_duplicate optab doesn't s=
ee the
>> > mem as op1, because in gimple the mem is not part of the duplicate.
>> >
>> > So you still just see:
>> >
>> >>>> dbgrtx (ops[1].value)
>> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0)
>> >
>> > As the operand as the argument to the dup is just an SSA_NAME.
>>=20
>> Ah, yeah, I'd forgotten that fixed-length vec_duplicates would
>> come from a constructor rather than a vec_duplicate_expr, so we don't
>> get the usual benefit of folding single-use mems during expand.
>>=20
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595362.html
>> moves towards using vec_duplicate even for fixed-length vectors.
>> If we take that approach, then I suppose a plain constructor
>> should be folded to a vec_duplicate where possible.
>>=20
>> (Alternatively, we could use an extended vec_perm_expr with
>> scalar inputs, as Richi suggested in that thread.)
>>=20
>> If we don't do that, or don't do it yet, then=E2=80=A6
>
> I suppose since we alrady have vec_duplicate we can just use it ...
> what was the reason to not do this originally?

There just wasn't any specific benefit for fixed-length vectors at the
time, and obvious potential problems -- introducing VEC_DUPLICATE_EXPRs
too early would lose out on existing CONSTRUCTOR-based folds.

Also, isel didn't exist at the time that vec_duplicate was added, but it
seems like it might be a good place to do the replacement.

Match rules that want to test for a uniform vector operand can already
use vec_same_elem_p to handle all representations, but perhaps we also
need a way of generating the =E2=80=9Cright=E2=80=9D form of duplicate for =
the current
stage in the pass pipeline?

> I suppose the
> vec_duplicate expander has a fallback via store_constuctor?
>
> Originally I wanted to avoid multiple ways to express the same thing
> but vec_duplicate is a common enough special-case and it also
> usually maps to a special instruction in vector ISAs.
> There's VIEW_CONVERT vs. vec_duplicate for V1m modes then, I
> suppose VIEW_CONVERT is more canonical here.

Is that already true for V1m constructors?  (view_convert being
canonical and constructors not, I mean.)

What do you think about the suggestion in the other thread of making
VEC_PERM_EXPR take an arbitrary number of inputs, with (as you suggested)
the inputs allowed to be scalars rather than vectors?  VEC_PERM_EXPR
could then replace both CONSTRUCTOR and VEC_DUPLICATE_EXPR and =E2=80=9Copt=
imising=E2=80=9D
a normal constructor to a duplicate would just be a case of removing
repeated scalar inputs.

Thanks,
Richard