From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <Richard.Sandiford@arm.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
nd <nd@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>,
Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
"ktkachov@gcc.gnu.org" <ktkachov@gcc.gnu.org>
Subject: RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.
Date: Thu, 16 May 2024 02:45:18 +0000 [thread overview]
Message-ID: <VI1PR08MB53250391E2D0D4AB3A5D4AE2FFED2@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <mptcypmhie8.fsf@arm.com>
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Wednesday, May 15, 2024 10:31 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-patches@gcc.gnu.org; nd
> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
> Shawcroft <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain
> operations.
>
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
> >> >> <tamar.christina@arm.com> wrote:
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that
> state
> >> >> > that for predicated operations that also produce a predicate it is preferred
> >> >> > that the codegen should use a different register for the destination than
> that
> >> >> > of the input predicate in order to avoid a performance overhead.
> >> >> >
> >> >> > This of course has the problem that it increases register pressure and so
> >> should
> >> >> > be done with care. Additionally not all micro-architectures have this
> >> >> > consideration and so it shouldn't be done as a default thing.
> >> >> >
> >> >> > The patch series adds support for doing conditional early clobbers through
> a
> >> >> > combination of new alternatives and attributes to control their availability.
> >> >>
> >> >> You could have two alternatives, one with early clobber and one with
> >> >> a matching constraint where you'd disparage the matching constraint one?
> >> >>
> >> >
> >> > Yeah, that's what I do, though there's no need to disparage the non-early
> clobber
> >> > alternative as the early clobber alternative will naturally get a penalty if it
> needs a
> >> > reload.
> >>
> >> But I think Richard's suggestion was to disparage the one with a matching
> >> constraint (not the earlyclobber), to reflect the increased cost of
> >> reusing the register.
> >>
> >> We did take that approach for gathers, e.g.:
> >>
> >> [&w, Z, w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
> >> [?w, Z, 0, Ui1, Ui1, Upl] ^
> >>
> >> The (supposed) advantage is that, if register pressure is so tight
> >> that using matching registers is the only alternative, we still
> >> have the opportunity to do that, as a last resort.
> >>
> >> Providing only an earlyclobber version means that using the same
> >> register is prohibited outright. If no other register is free, the RA
> >> would need to spill something else to free up a temporary register.
> >> And it might then do the equivalent of (pseudo-code):
> >>
> >> not p1.b, ..., p0.b
> >> mov p0.d, p1.d
> >>
> >> after spilling what would otherwise have occupied p1. In that
> >> situation it would be better use:
> >>
> >> not p0.b, ..., p0.b
> >>
> >> and not introduce the spill of p1.
> >
> > I think I understood what Richi meant, but I thought it was already working that
> way.
>
> The suggestion was to use matching constraints (like "0") though,
> whereas the patch doesn't. I think your argument is that you don't
> need to use matching constraints. But that's different from the
> suggestion (and from how we handle gathers).
>
> I was going to say in response to patch 3 (but got distracted, sorry):
> I don't think we should have:
>
> &Upa, Upa, ...
> Upa, Upa, ...
>
> (taken from the pure logic ops) enabled at the same time. Even though
> it works for the testcases, I don't think it has well-defined semantics.
>
> The problem is that, taken on its own, the second alternative says that
> matching operands are free. And fundamentally, I don't think the costs
> *must* take the earlyclobber alternative over the non-earlyclobber one
> (when costing during IRA, for instance). In principle, the cheapest
> is best.
>
> The aim of the gather approach is to make each alternative correct in
> isolation. In:
>
> [&w, Z, w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
> [?w, Z, 0, Ui1, Ui1, Upl] ^
>
> the second alternative says that it is possible to have operands 0
> and 2 be the same vector register, but using that version has the
> cost of an extra reload. In that sense the alternatives are
> (essentially) consistent about the restriction.
>
Oh I see! Sorry read over the explicit tie in the first mail! I understand now,
The idea is to explicitly model the tie, and non-tie versions. Got it.
> > i.e. as one of the testcases I had:
> >
> >> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -ffixed-
> p[1-15]
> >
> > foo:
> > mov z31.h, w0
> > ptrue p0.b, all
> > cmplo p0.h, p0/z, z0.h, z31.h
> > b use
> >
> > and reload did not force a spill.
> >
> > My understanding of how this works, and how it seems to be working is that
> since reload costs
> > Alternative from front to back the cheapest one wins and it stops evaluating the
> rest.
> >
> > The early clobber case is first and preferred, however when it's not possible, i.e.
> requires a non-pseudo
> > reload, the reload cost is added to the alternative.
> >
> > However you're right that in the following testcase:
> >
> > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-
> p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -
> ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
> >
> > i.e. giving it an extra free register inexplicably causes a spill:
> >
> > foo:
> > addvl sp, sp, #-1
> > mov z31.h, w0
> > ptrue p0.b, all
> > str p15, [sp]
> > cmplo p15.h, p0/z, z0.h, z31.h
> > mov p0.b, p15.b
> > ldr p15, [sp]
> > addvl sp, sp, #1
> > b use
> >
> > so that's unexpected and is very weird as p15 has no defined value..
>
> This is because the function implicitly uses the SVE PCS, and so needs
> to preserve p15 for the caller. It looks like the correct behaviour.
>
> > Now adding the ? as suggested to the non-early clobber alternative does not fix
> it, and my mental model for how this is supposed to work does not quite line up..
> > Why would making the non-clobber alternative even more expensive help it
> during high register pressure??
>
> Hopefully the above answers this. The non-clobber alternative has
> zero extra cost as things stand. The costs from one alternative
> (the earlyclobber one) don't carry forward to other alternatives.
>
> > But with that suggestion the above case does not get fixed
> > and the following case
> >
> > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-
> p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -
> ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload
> >
> > ICEs:
> >
> > pred-clobber.c: In function 'foo':
> > pred-clobber.c:9:1: error: unable to find a register to spill
> > 9 | }
> > | ^
> > pred-clobber.c:9:1: error: this is the insn:
> > (insn 10 22 19 2 (parallel [
> > (set (reg:VNx8BI 110 [104])
> > (unspec:VNx8BI [
> > (reg:VNx8BI 112)
> > (const_int 1 [0x1])
> > (ltu:VNx8BI (reg:VNx8HI 32 v0)
> > (reg:VNx8HI 63 v31))
> > ] UNSPEC_PRED_Z))
> > (clobber (reg:CC_NZC 66 cc))
> > ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
> > (expr_list:REG_DEAD (reg:VNx8BI 112)
> > (expr_list:REG_DEAD (reg:VNx8HI 63 v31)
> > (expr_list:REG_DEAD (reg:VNx8HI 32 v0)
> > (expr_list:REG_UNUSED (reg:CC_NZC 66 cc)
> > (nil))))))
> > during RTL pass: reload
> > dump file: pred-clobber.c.315r.reload
>
> Which pattern did you use?
>
> > and this is because the use of ? has the unintended side-effect of blocking a
> register class entirely during Sched1 as we've recently discovered.
> > i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766
>
> (Is sched1 the problem here, or is it purely an RA thing? What happens
> when scheduling is disabled?)
>
> > in this case it marked the alternative as NO_REGS during sched1 and so it's
> completely dead.
> > the use of the ? alternatives has caused quite the code bloat as we've recently
> discovered because of this unexpected and undocumented behavior.
> >
> > To me,
> >
> > diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-
> sve.md
> > index 93ec59e58af..2ee3d8ea35e 100644
> > --- a/gcc/config/aarch64/aarch64-sve.md
> > +++ b/gcc/config/aarch64/aarch64-sve.md
> > @@ -8120,10 +8120,10 @@ (define_insn
> "@aarch64_pred_cmp<cmp_op><mode>"
> > (clobber (reg:CC_NZC CC_REGNUM))]
> > "TARGET_SVE"
> > {@ [ cons: =0 , 1 , 3 , 4 ; attrs: pred_clobber ]
> > - [ &Upa , Upl , w , <sve_imm_con>; yes ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> > - [ Upa , Upl , w , <sve_imm_con>; * ] ^
> > - [ &Upa , Upl , w , w ; yes ] cmp<cmp_op>\t%0.<Vetype>,
> %1/z, %3.<Vetype>, %4.<Vetype>
> > - [ Upa , Upl , w , w ; * ] ^
> > + [ ^&Upa , Upl , w , <sve_imm_con>; yes ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> > + [ Upa , Upl , w , <sve_imm_con>; * ] ^
> > + [ ^&Upa , Upl , w , w ; yes ] cmp<cmp_op>\t%0.<Vetype>,
> %1/z, %3.<Vetype>, %4.<Vetype>
> > + [ Upa , Upl , w , w ; * ] ^
> > }
> > )
> >
> > Would have been the right approach, i.e. we prefer the alternative unless a reload
> is needed, which should work no? well. if ^ wasn't broken the same way
> > as ?. Perhaps I need to use Wilco's new alternative that doesn't block a register
> class?
>
> Hmm, I'm not sure. It seems odd to mark only the output with ^, since
> reloading the output isn't fundamentally different (costwise) from
> reloading the input.
>
> But to me, it's the alternative without the earlyclobber that should be
> disparaged, since it's the inherently expensive one.
>
> The gather-like approach would be something like:
>
> [ &Upa , Upl , w , <sve_imm_con>; yes ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> [ ?Upl , 0 , w , <sve_imm_con>; yes ] ^
> [ Upa , Upl , w , <sve_imm_con>; no ] ^
> [ &Upa , Upl , w , w ; yes ] cmp<cmp_op>\t%0.<Vetype>, %1/z,
> %3.<Vetype>, %4.<Vetype>
> [ ?Upl , 0 , w , w ; yes ] ^
> [ Upa , Upl , w , w ; no ] ^
>
> with:
>
> (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
Yeah, this makes sense to me. Sorry I completely misunderstood that the alternative
with the tie was suggested in addition to, and not instead of.
I'll respin the patches this way.
Thanks both!,
Tamar
>
> Thanks,
> Richard
next prev parent reply other threads:[~2024-05-16 2:45 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-15 10:28 Tamar Christina
2024-05-15 10:28 ` [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax Tamar Christina
2024-05-15 10:35 ` Kyrill Tkachov
2024-05-15 11:06 ` Richard Sandiford
2024-05-15 10:28 ` [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber Tamar Christina
2024-05-15 10:56 ` Richard Sandiford
2024-05-15 11:03 ` Tamar Christina
2024-05-22 9:29 ` Tamar Christina
2024-05-28 9:37 ` Tamar Christina
2024-05-30 14:59 ` Richard Sandiford
2024-05-15 10:29 ` [PATCH 3/4]AArch64: add new alternative with early clobber to patterns Tamar Christina
2024-05-15 10:29 ` [PATCH 4/4]AArch64: enable new predicate tuning for Neoverse cores Tamar Christina
2024-05-15 11:20 ` [PATCH 0/4]AArch64: support conditional early clobbers on certain operations Richard Biener
2024-05-15 11:23 ` Tamar Christina
2024-05-15 14:51 ` Richard Sandiford
2024-05-15 15:56 ` Tamar Christina
2024-05-15 21:31 ` Richard Sandiford
2024-05-16 2:45 ` Tamar Christina [this message]
2024-05-21 3:24 ` Tamar Christina
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=VI1PR08MB53250391E2D0D4AB3A5D4AE2FFED2@VI1PR08MB5325.eurprd08.prod.outlook.com \
--to=tamar.christina@arm.com \
--cc=Marcus.Shawcroft@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=Richard.Sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=ktkachov@gcc.gnu.org \
--cc=nd@arm.com \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).