Re: [PATCH]AArch64 relax predicate on load structure load instructions

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: Tamar Christina <Tamar.Christina@arm.com>,
	 "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>,  Richard Earnshaw <Richard.Earnshaw@arm.com>,
	 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
	 Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>,
	roger@eyesopen.com
Subject: Re: [PATCH]AArch64 relax predicate on load structure load instructions
Date: Mon, 13 Jun 2022 10:38:27 +0200 (CEST)	[thread overview]
Message-ID: <1821q4nr-nsp9-r811-229q-s3s447srq09n@fhfr.qr> (raw)
In-Reply-To: <mptk09ldl4z.fsf@arm.com>

On Mon, 13 Jun 2022, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 8 Jun 2022, Richard Sandiford wrote:
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Wednesday, June 8, 2022 11:31 AM
> >> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> >> >> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> >> >> Subject: Re: [PATCH]AArch64 relax predicate on load structure load
> >> >> instructions
> >> >> 
> >> >> Tamar Christina <tamar.christina@arm.com> writes:
> >> >> > Hi All,
> >> >> >
> >> >> > At some point in time we started lowering the ld1r instructions in gimple.
> >> >> >
> >> >> > That is:
> >> >> >
> >> >> > uint8x8_t f1(const uint8_t *in) {
> >> >> >     return vld1_dup_u8(&in[1]);
> >> >> > }
> >> >> >
> >> >> > generates at gimple:
> >> >> >
> >> >> >   _3 = MEM[(const uint8_t *)in_1(D) + 1B];
> >> >> >   _4 = {_3, _3, _3, _3, _3, _3, _3, _3};
> >> >> >
> >> >> > Which is good, but we then generate:
> >> >> >
> >> >> > f1:
> >> >> > 	ldr     b0, [x0, 1]
> >> >> > 	dup     v0.8b, v0.b[0]
> >> >> > 	ret
> >> >> >
> >> >> > instead of ld1r.
> >> >> >
> >> >> > The reason for this is because the load instructions have a too
> >> >> > restrictive predicate on them which causes combine not to be able to
> >> >> > combine the instructions due to the predicate only accepting simple
> >> >> addressing modes.
> >> >> >
> >> >> > This patch relaxes the predicate to accept any memory operand and
> >> >> > relies on LRA to legitimize the address when it needs to as the
> >> >> > constraint still only allows the simple addressing mode.  Reload is
> >> >> > always able to legitimize to these.
> >> >> >
> >> >> > Secondly since we are now actually generating more ld1r it became
> >> >> > clear that the lane instructions suffer from a similar issue.
> >> >> >
> >> >> > i.e.
> >> >> >
> >> >> > float32x4_t f2(const float32_t *in, float32x4_t a) {
> >> >> >     float32x4_t dup = vld1q_dup_f32(&in[1]);
> >> >> >     return vfmaq_laneq_f32 (a, a, dup, 1); }
> >> >> >
> >> >> > would generate ld1r + vector fmla instead of ldr + lane fmla.
> >> >> >
> >> >> > The reason for this is similar to the ld1r issue.  The predicate is
> >> >> > too restrictive in only acception register operands but not memory.
> >> >> >
> >> >> > This relaxes it to accept register and/or memory while leaving the
> >> >> > constraint to only accept registers.  This will have LRA generate a
> >> >> > reload if needed forcing the memory to registers using the standard
> >> >> patterns.
> >> >> >
> >> >> > These two changes allow combine and reload to generate the right
> >> >> sequences.
> >> >> >
> >> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >> >> 
> >> >> This is going against the general direction of travel, which is to make the
> >> >> instruction's predicates and conditions enforce the constraints as much as
> >> >> possible (making optimistic assumptions about pseudo registers).
> >> >> 
> >> >> The RA *can* deal with things like:
> >> >> 
> >> >>   (match_operand:M N "general_operand" "r")
> >> >> 
> >> >> but it's best avoided, for a few reasons:
> >> >> 
> >> >> (1) The fix-up will be done in LRA, so IRA will not see the temporary
> >> >>     registers.  This can make the allocation of those temporaries
> >> >>     suboptimal but (more importantly) it might require other
> >> >>     previously-allocated registers to be spilled late due to the
> >> >>     unexpected increase in register pressure.
> >> >> 
> >> >> (2) It ends up hiding instructions from the pre-RA optimisers.
> >> >> 
> >> >> (3) It can also prevent combine opportunities (as well as create them),
> >> >>     unless the loose predicates in an insn I are propagated to all
> >> >>     patterns that might result from combining I with something else.
> >> >> 
> >> >> It sounds like the first problem (not generating ld1r) could be fixed by (a)
> >> >> combining aarch64_simd_dup<mode> and *aarch64_simd_ld1r<mode>, so
> >> >> that the register and memory alternatives are in the same pattern and (b)
> >> >> using the merged instruction(s) to implement the vec_duplicate optab.
> >> >> Target-independent code should then make the address satisfy the
> >> >> predicate, simplifying the address where necessary.
> >> >> 
> >> >
> >> > I think I am likely missing something here. I would assume that you wanted
> >> > to use the optab to split the addressing off from the mem expression so the
> >> > combined insn matches.
> >> >
> >> > But in that case, why do you need to combine the two instructions?
> >> > I've tried and it doesn't work since the vec_duplicate optab doesn't see the
> >> > mem as op1, because in gimple the mem is not part of the duplicate.
> >> >
> >> > So you still just see:
> >> >
> >> >>>> dbgrtx (ops[1].value)
> >> > (subreg/s/v:QI (reg:SI 92 [ _3 ]) 0)
> >> >
> >> > As the operand as the argument to the dup is just an SSA_NAME.
> >> 
> >> Ah, yeah, I'd forgotten that fixed-length vec_duplicates would
> >> come from a constructor rather than a vec_duplicate_expr, so we don't
> >> get the usual benefit of folding single-use mems during expand.
> >> 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595362.html
> >> moves towards using vec_duplicate even for fixed-length vectors.
> >> If we take that approach, then I suppose a plain constructor
> >> should be folded to a vec_duplicate where possible.
> >> 
> >> (Alternatively, we could use an extended vec_perm_expr with
> >> scalar inputs, as Richi suggested in that thread.)
> >> 
> >> If we don't do that, or don't do it yet, then…
> >
> > I suppose since we alrady have vec_duplicate we can just use it ...
> > what was the reason to not do this originally?
> 
> There just wasn't any specific benefit for fixed-length vectors at the
> time, and obvious potential problems -- introducing VEC_DUPLICATE_EXPRs
> too early would lose out on existing CONSTRUCTOR-based folds.
> 
> Also, isel didn't exist at the time that vec_duplicate was added, but it
> seems like it might be a good place to do the replacement.
> 
> Match rules that want to test for a uniform vector operand can already
> use vec_same_elem_p to handle all representations, but perhaps we also
> need a way of generating the “right” form of duplicate for the current
> stage in the pass pipeline?

I think we can have vec_duplicate without native target support by
expanding via CONSTRUCTOR, so vec_duplicate would be the correct
one at all stages and we fixup during RTL expansion directly.

As you noted most targets don't implement vec_duplicate yet.

> > I suppose the
> > vec_duplicate expander has a fallback via store_constuctor?
> >
> > Originally I wanted to avoid multiple ways to express the same thing
> > but vec_duplicate is a common enough special-case and it also
> > usually maps to a special instruction in vector ISAs.
> > There's VIEW_CONVERT vs. vec_duplicate for V1m modes then, I
> > suppose VIEW_CONVERT is more canonical here.
> 
> Is that already true for V1m constructors?  (view_convert being
> canonical and constructors not, I mean.)

I think so, yes.

> What do you think about the suggestion in the other thread of making
> VEC_PERM_EXPR take an arbitrary number of inputs, with (as you suggested)
> the inputs allowed to be scalars rather than vectors?  VEC_PERM_EXPR
> could then replace both CONSTRUCTOR and VEC_DUPLICATE_EXPR and “optimising”
> a normal constructor to a duplicate would just be a case of removing
> repeated scalar inputs.

It's indeed somewhat appealing to make VEC_PERM a swiss army knife.
I'm not sure about making it a VL tree though, currently it's
a nice GIMPLE ternary while VL would make it a SINGLE RHS with
a GENERIC tree (unless we introduce a gimple_vec_perm special
gimple node).  That said, allowing scalars as VEC_PERM inputs
to get rid of VEC_DUPLICATE only will still leave us with the
VIEW_CONVERT special case.

At some point we might want to help targets with "interesting"
ISAs by lowering VEC_PERM to supported .VEC_PERM_CONSTs and
relaxing what permutes we allow earlier in the pipeline
(I'm thinking of x86 with its many special permutation ops
and the open-coded vec-perm-const expander).

So not sure what to do, but I'm happy to widen VEC_DUPLICATE_EXPR use.

Richard.

next prev parent reply	other threads:[~2022-06-13  8:38 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-08  9:38 Tamar Christina
2022-06-08 10:31 ` Richard Sandiford
2022-06-08 13:51   ` Tamar Christina
2022-06-08 14:35     ` Richard Sandiford
2022-06-09  7:42       ` Tamar Christina
2022-06-09  8:22         ` Richard Sandiford
2022-06-09  8:43           ` Tamar Christina
2022-06-13  8:00       ` Richard Biener
2022-06-13  8:26         ` Richard Sandiford
2022-06-13  8:38           ` Richard Biener [this message]
2022-06-13  9:51             ` Tamar Christina
2022-06-13 11:50               ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1821q4nr-nsp9-r811-229q-s3s447srq09n@fhfr.qr \
    --to=rguenther@suse.de \
    --cc=Kyrylo.Tkachov@arm.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=Tamar.Christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=richard.sandiford@arm.com \
    --cc=roger@eyesopen.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).