RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <Richard.Sandiford@arm.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>,
	Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
	"ktkachov@gcc.gnu.org" <ktkachov@gcc.gnu.org>
Subject: RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.
Date: Thu, 16 May 2024 02:45:18 +0000	[thread overview]
Message-ID: <VI1PR08MB53250391E2D0D4AB3A5D4AE2FFED2@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <mptcypmhie8.fsf@arm.com>

> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Wednesday, May 15, 2024 10:31 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-patches@gcc.gnu.org; nd
> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus
> Shawcroft <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain
> operations.
> 
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
> >> >> <tamar.christina@arm.com> wrote:
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that
> state
> >> >> > that for predicated operations that also produce a predicate it is preferred
> >> >> > that the codegen should use a different register for the destination than
> that
> >> >> > of the input predicate in order to avoid a performance overhead.
> >> >> >
> >> >> > This of course has the problem that it increases register pressure and so
> >> should
> >> >> > be done with care.  Additionally not all micro-architectures have this
> >> >> > consideration and so it shouldn't be done as a default thing.
> >> >> >
> >> >> > The patch series adds support for doing conditional early clobbers through
> a
> >> >> > combination of new alternatives and attributes to control their availability.
> >> >>
> >> >> You could have two alternatives, one with early clobber and one with
> >> >> a matching constraint where you'd disparage the matching constraint one?
> >> >>
> >> >
> >> > Yeah, that's what I do, though there's no need to disparage the non-early
> clobber
> >> > alternative as the early clobber alternative will naturally get a penalty if it
> needs a
> >> > reload.
> >>
> >> But I think Richard's suggestion was to disparage the one with a matching
> >> constraint (not the earlyclobber), to reflect the increased cost of
> >> reusing the register.
> >>
> >> We did take that approach for gathers, e.g.:
> >>
> >>      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
> >>      [?w, Z,   0, Ui1, Ui1, Upl] ^
> >>
> >> The (supposed) advantage is that, if register pressure is so tight
> >> that using matching registers is the only alternative, we still
> >> have the opportunity to do that, as a last resort.
> >>
> >> Providing only an earlyclobber version means that using the same
> >> register is prohibited outright.  If no other register is free, the RA
> >> would need to spill something else to free up a temporary register.
> >> And it might then do the equivalent of (pseudo-code):
> >>
> >>       not p1.b, ..., p0.b
> >>       mov p0.d, p1.d
> >>
> >> after spilling what would otherwise have occupied p1.  In that
> >> situation it would be better use:
> >>
> >>       not p0.b, ..., p0.b
> >>
> >> and not introduce the spill of p1.
> >
> > I think I understood what Richi meant, but I thought it was already working that
> way.
> 
> The suggestion was to use matching constraints (like "0") though,
> whereas the patch doesn't.  I think your argument is that you don't
> need to use matching constraints.  But that's different from the
> suggestion (and from how we handle gathers).
> 
> I was going to say in response to patch 3 (but got distracted, sorry):
> I don't think we should have:
> 
>    &Upa, Upa, ...
>    Upa, Upa, ...
> 
> (taken from the pure logic ops) enabled at the same time.  Even though
> it works for the testcases, I don't think it has well-defined semantics.
> 
> The problem is that, taken on its own, the second alternative says that
> matching operands are free.  And fundamentally, I don't think the costs
> *must* take the earlyclobber alternative over the non-earlyclobber one
> (when costing during IRA, for instance).  In principle, the cheapest
> is best.
> 
> The aim of the gather approach is to make each alternative correct in
> isolation.  In:
> 
>       [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
>       [?w, Z,   0, Ui1, Ui1, Upl] ^
> 
> the second alternative says that it is possible to have operands 0
> and 2 be the same vector register, but using that version has the
> cost of an extra reload.  In that sense the alternatives are
> (essentially) consistent about the restriction.
> 

Oh I see! Sorry read over the explicit tie in the first mail! I understand now,
The idea is to explicitly model the tie, and non-tie versions. Got it.

> > i.e. as one of the testcases I had:
> >
> >> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -ffixed-
> p[1-15]
> >
> > foo:
> >         mov     z31.h, w0
> >         ptrue   p0.b, all
> >         cmplo   p0.h, p0/z, z0.h, z31.h
> >         b       use
> >
> > and reload did not force a spill.
> >
> > My understanding of how this works, and how it seems to be working is that
> since reload costs
> > Alternative from front to back the cheapest one wins and it stops evaluating the
> rest.
> >
> > The early clobber case is first and preferred, however when it's not possible, i.e.
> requires a non-pseudo
> > reload, the reload cost is added to the alternative.
> >
> > However you're right that in the following testcase:
> >
> > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-
> p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -
> ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
> >
> > i.e. giving it an extra free register inexplicably causes a spill:
> >
> > foo:
> >         addvl   sp, sp, #-1
> >         mov     z31.h, w0
> >         ptrue   p0.b, all
> >         str     p15, [sp]
> >         cmplo   p15.h, p0/z, z0.h, z31.h
> >         mov     p0.b, p15.b
> >         ldr     p15, [sp]
> >         addvl   sp, sp, #1
> >         b       use
> >
> > so that's unexpected and is very weird as p15 has no defined value..
> 
> This is because the function implicitly uses the SVE PCS, and so needs
> to preserve p15 for the caller.  It looks like the correct behaviour.
> 
> > Now adding the ? as suggested to the non-early clobber alternative does not fix
> it, and my mental model for how this is supposed to work does not quite line up..
> > Why would making the non-clobber alternative even more expensive help it
> during high register pressure??
> 
> Hopefully the above answers this.  The non-clobber alternative has
> zero extra cost as things stand.  The costs from one alternative
> (the earlyclobber one) don't carry forward to other alternatives.
> 
> > But with that suggestion the above case does not get fixed
> > and the following case
> >
> > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-
> p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -
> ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload
> >
> > ICEs:
> >
> > pred-clobber.c: In function 'foo':
> > pred-clobber.c:9:1: error: unable to find a register to spill
> >     9 | }
> >       | ^
> > pred-clobber.c:9:1: error: this is the insn:
> > (insn 10 22 19 2 (parallel [
> >             (set (reg:VNx8BI 110 [104])
> >                 (unspec:VNx8BI [
> >                         (reg:VNx8BI 112)
> >                         (const_int 1 [0x1])
> >                         (ltu:VNx8BI (reg:VNx8HI 32 v0)
> >                             (reg:VNx8HI 63 v31))
> >                     ] UNSPEC_PRED_Z))
> >             (clobber (reg:CC_NZC 66 cc))
> >         ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
> >      (expr_list:REG_DEAD (reg:VNx8BI 112)
> >         (expr_list:REG_DEAD (reg:VNx8HI 63 v31)
> >             (expr_list:REG_DEAD (reg:VNx8HI 32 v0)
> >                 (expr_list:REG_UNUSED (reg:CC_NZC 66 cc)
> >                     (nil))))))
> > during RTL pass: reload
> > dump file: pred-clobber.c.315r.reload
> 
> Which pattern did you use?
> 
> > and this is because the use of ? has the unintended side-effect of blocking a
> register class entirely during Sched1 as we've recently discovered.
> > i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766
> 
> (Is sched1 the problem here, or is it purely an RA thing?  What happens
> when scheduling is disabled?)
> 
> > in this case it marked the alternative as NO_REGS during sched1 and so it's
> completely dead.
> > the use of the ? alternatives has caused quite the code bloat as we've recently
> discovered because of this unexpected and undocumented behavior.
> >
> > To me,
> >
> > diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-
> sve.md
> > index 93ec59e58af..2ee3d8ea35e 100644
> > --- a/gcc/config/aarch64/aarch64-sve.md
> > +++ b/gcc/config/aarch64/aarch64-sve.md
> > @@ -8120,10 +8120,10 @@ (define_insn
> "@aarch64_pred_cmp<cmp_op><mode>"
> >     (clobber (reg:CC_NZC CC_REGNUM))]
> >    "TARGET_SVE"
> >    {@ [ cons: =0 , 1   , 3 , 4            ; attrs: pred_clobber ]
> > -     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> > -     [ Upa      , Upl , w , <sve_imm_con>; *                   ] ^
> > -     [ &Upa     , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>,
> %1/z, %3.<Vetype>, %4.<Vetype>
> > -     [ Upa      , Upl , w , w            ; *                   ] ^
> > +     [ ^&Upa    , Upl , w , <sve_imm_con>; yes                 ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> > +     [  Upa     , Upl , w , <sve_imm_con>; *                   ] ^
> > +     [ ^&Upa    , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>,
> %1/z, %3.<Vetype>, %4.<Vetype>
> > +     [  Upa     , Upl , w , w            ; *                   ] ^
> >    }
> >  )
> >
> > Would have been the right approach, i.e. we prefer the alternative unless a reload
> is needed, which should work no? well. if ^ wasn't broken the same way
> > as ?.  Perhaps I need to use Wilco's new alternative that doesn't block a register
> class?
> 
> Hmm, I'm not sure.  It seems odd to mark only the output with ^, since
> reloading the output isn't fundamentally different (costwise) from
> reloading the input.
> 
> But to me, it's the alternative without the earlyclobber that should be
> disparaged, since it's the inherently expensive one.
> 
> The gather-like approach would be something like:
> 
>      [ &Upa     , Upl , w , <sve_imm_con>; yes                 ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
>      [ ?Upl     , 0   , w , <sve_imm_con>; yes                 ] ^
>      [ Upa      , Upl , w , <sve_imm_con>; no                  ] ^
>      [ &Upa     , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z,
> %3.<Vetype>, %4.<Vetype>
>      [ ?Upl     , 0   , w , w            ; yes                 ] ^
>      [ Upa      , Upl , w , w            ; no                  ] ^
> 
> with:
> 
>   (define_attr "pred_clobber" "any,no,yes" (const_string "any"))

Yeah, this makes sense to me.  Sorry I completely misunderstood that the alternative
with the tie was suggested in addition to, and not instead of.

I'll respin the patches this way.

Thanks both!,
Tamar

> 
> Thanks,
> Richard

next prev parent reply	other threads:[~2024-05-16  2:45 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15 10:28 Tamar Christina
2024-05-15 10:28 ` [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax Tamar Christina
2024-05-15 10:35   ` Kyrill Tkachov
2024-05-15 11:06   ` Richard Sandiford
2024-05-15 10:28 ` [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber Tamar Christina
2024-05-15 10:56   ` Richard Sandiford
2024-05-15 11:03     ` Tamar Christina
2024-05-22  9:29     ` Tamar Christina
2024-05-28  9:37       ` Tamar Christina
2024-05-30 14:59         ` Richard Sandiford
2024-05-15 10:29 ` [PATCH 3/4]AArch64: add new alternative with early clobber to patterns Tamar Christina
2024-05-15 10:29 ` [PATCH 4/4]AArch64: enable new predicate tuning for Neoverse cores Tamar Christina
2024-05-15 11:20 ` [PATCH 0/4]AArch64: support conditional early clobbers on certain operations Richard Biener
2024-05-15 11:23   ` Tamar Christina
2024-05-15 14:51     ` Richard Sandiford
2024-05-15 15:56       ` Tamar Christina
2024-05-15 21:31         ` Richard Sandiford
2024-05-16  2:45           ` Tamar Christina [this message]
2024-05-21  3:24           ` Tamar Christina

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB53250391E2D0D4AB3A5D4AE2FFED2@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=Richard.Sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=ktkachov@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).