RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <Richard.Sandiford@arm.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>,
	Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
	"ktkachov@gcc.gnu.org" <ktkachov@gcc.gnu.org>
Subject: RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.
Date: Wed, 15 May 2024 15:56:11 +0000	[thread overview]
Message-ID: <VI1PR08MB53255D5B497013562821DF16FFEC2@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <mpta5krjfhj.fsf@arm.com>

> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
> >> <tamar.christina@arm.com> wrote:
> >> >
> >> > Hi All,
> >> >
> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that state
> >> > that for predicated operations that also produce a predicate it is preferred
> >> > that the codegen should use a different register for the destination than that
> >> > of the input predicate in order to avoid a performance overhead.
> >> >
> >> > This of course has the problem that it increases register pressure and so
> should
> >> > be done with care.  Additionally not all micro-architectures have this
> >> > consideration and so it shouldn't be done as a default thing.
> >> >
> >> > The patch series adds support for doing conditional early clobbers through a
> >> > combination of new alternatives and attributes to control their availability.
> >>
> >> You could have two alternatives, one with early clobber and one with
> >> a matching constraint where you'd disparage the matching constraint one?
> >>
> >
> > Yeah, that's what I do, though there's no need to disparage the non-early clobber
> > alternative as the early clobber alternative will naturally get a penalty if it needs a
> > reload.
> 
> But I think Richard's suggestion was to disparage the one with a matching
> constraint (not the earlyclobber), to reflect the increased cost of
> reusing the register.
> 
> We did take that approach for gathers, e.g.:
> 
>      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
>      [?w, Z,   0, Ui1, Ui1, Upl] ^
> 
> The (supposed) advantage is that, if register pressure is so tight
> that using matching registers is the only alternative, we still
> have the opportunity to do that, as a last resort.
> 
> Providing only an earlyclobber version means that using the same
> register is prohibited outright.  If no other register is free, the RA
> would need to spill something else to free up a temporary register.
> And it might then do the equivalent of (pseudo-code):
> 
>       not p1.b, ..., p0.b
>       mov p0.d, p1.d
> 
> after spilling what would otherwise have occupied p1.  In that
> situation it would be better use:
> 
>       not p0.b, ..., p0.b
> 
> and not introduce the spill of p1.

I think I understood what Richi meant, but I thought it was already working that way.
i.e. as one of the testcases I had:

> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -ffixed-p[1-15]

foo:
        mov     z31.h, w0
        ptrue   p0.b, all
        cmplo   p0.h, p0/z, z0.h, z31.h
        b       use

and reload did not force a spill.

My understanding of how this works, and how it seems to be working is that since reload costs
Alternative from front to back the cheapest one wins and it stops evaluating the rest.

The early clobber case is first and preferred, however when it's not possible, i.e. requires a non-pseudo
reload, the reload cost is added to the alternative.

However you're right that in the following testcase:

-mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload

i.e. giving it an extra free register inexplicably causes a spill:

foo:
        addvl   sp, sp, #-1
        mov     z31.h, w0
        ptrue   p0.b, all
        str     p15, [sp]
        cmplo   p15.h, p0/z, z0.h, z31.h
        mov     p0.b, p15.b
        ldr     p15, [sp]
        addvl   sp, sp, #1
        b       use

so that's unexpected and is very weird as p15 has no defined value..

Now adding the ? as suggested to the non-early clobber alternative does not fix it, and my mental model for how this is supposed to work does not quite line up..
Why would making the non-clobber alternative even more expensive help it during high register pressure?? But with that suggestion the above case does not get fixed
and the following case

-mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload

ICEs:

pred-clobber.c: In function 'foo':
pred-clobber.c:9:1: error: unable to find a register to spill
    9 | }
      | ^
pred-clobber.c:9:1: error: this is the insn:
(insn 10 22 19 2 (parallel [
            (set (reg:VNx8BI 110 [104])
                (unspec:VNx8BI [
                        (reg:VNx8BI 112)
                        (const_int 1 [0x1])
                        (ltu:VNx8BI (reg:VNx8HI 32 v0)
                            (reg:VNx8HI 63 v31))
                    ] UNSPEC_PRED_Z))
            (clobber (reg:CC_NZC 66 cc))
        ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
     (expr_list:REG_DEAD (reg:VNx8BI 112)
        (expr_list:REG_DEAD (reg:VNx8HI 63 v31)
            (expr_list:REG_DEAD (reg:VNx8HI 32 v0)
                (expr_list:REG_UNUSED (reg:CC_NZC 66 cc)
                    (nil))))))
during RTL pass: reload
dump file: pred-clobber.c.315r.reload

and this is because the use of ? has the unintended side-effect of blocking a register class entirely during Sched1 as we've recently discovered.
i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766

in this case it marked the alternative as NO_REGS during sched1 and so it's completely dead.
the use of the ? alternatives has caused quite the code bloat as we've recently discovered because of this unexpected and undocumented behavior.

To me,

diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index 93ec59e58af..2ee3d8ea35e 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -8120,10 +8120,10 @@ (define_insn "@aarch64_pred_cmp<cmp_op><mode>"
    (clobber (reg:CC_NZC CC_REGNUM))]
   "TARGET_SVE"
   {@ [ cons: =0 , 1   , 3 , 4            ; attrs: pred_clobber ]
-     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
-     [ Upa      , Upl , w , <sve_imm_con>; *                   ] ^
-     [ &Upa     , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
-     [ Upa      , Upl , w , w            ; *                   ] ^
+     [ ^&Upa    , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
+     [  Upa     , Upl , w , <sve_imm_con>; *                   ] ^
+     [ ^&Upa    , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
+     [  Upa     , Upl , w , w            ; *                   ] ^
   }
 )

Would have been the right approach, i.e. we prefer the alternative unless a reload is needed, which should work no? well. if ^ wasn't broken the same way
as ?.  Perhaps I need to use Wilco's new alternative that doesn't block a register class?

But I'm probably missing something...

> 
> Another case where using matching registers is natural is for
> loop-carried dependencies.  Do we want to keep them in:
> 
>    loop:
>       ...no other sets of p0....
>       not p0.b, ..., p0.b
>       ...no other sets of p0....
>       bne loop
> 
> or should we split it to:
> 
>    loop:
>       ...no other sets of p0....
>       not p1.b, ..., p0.b
>       mov p0.d, p1.d
>       ...no other sets of p0....
>       bne loop
> 
> ?

On the uarches that this affects they are equivalent (I'm happy to expand on this internally if you'd like),
So in those cases the first one is preferred as it won't matter.

Thanks for the review an explanation!

Tamar

> 
> Thanks,
> Richard
> 
> >
> > Cheers,
> > Tamar
> >
> >> > On high register pressure we also use LRA's costing to prefer not to use the
> >> > alternative and instead just use the tie as this is preferable to a reload.
> >> >
> >> > Concretely this patch series does:
> >> >
> >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2
> >> >
> >> > foo:
> >> >         mov     z31.h, w0
> >> >         ptrue   p3.b, all
> >> >         cmplo   p0.h, p3/z, z0.h, z31.h
> >> >         b       use
> >> >
> >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n1+sve
> >> >
> >> > foo:
> >> >         mov     z31.h, w0
> >> >         ptrue   p0.b, all
> >> >         cmplo   p0.h, p0/z, z0.h, z31.h
> >> >         b       use
> >> >
> >> > > aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -
> >> ffixed-p[1-15]
> >> >
> >> > foo:
> >> >         mov     z31.h, w0
> >> >         ptrue   p0.b, all
> >> >         cmplo   p0.h, p0/z, z0.h, z31.h
> >> >         b       use
> >> >
> >> > Testcases for the changes are in the last patch of the series.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >> >
> >> > Thanks,
> >> > Tamar
> >> >
> >> > ---
> >> >
> >> > --

next prev parent reply	other threads:[~2024-05-15 15:56 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15 10:28 Tamar Christina
2024-05-15 10:28 ` [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax Tamar Christina
2024-05-15 10:35   ` Kyrill Tkachov
2024-05-15 11:06   ` Richard Sandiford
2024-05-15 10:28 ` [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber Tamar Christina
2024-05-15 10:56   ` Richard Sandiford
2024-05-15 11:03     ` Tamar Christina
2024-05-22  9:29     ` Tamar Christina
2024-05-28  9:37       ` Tamar Christina
2024-05-30 14:59         ` Richard Sandiford
2024-05-15 10:29 ` [PATCH 3/4]AArch64: add new alternative with early clobber to patterns Tamar Christina
2024-05-15 10:29 ` [PATCH 4/4]AArch64: enable new predicate tuning for Neoverse cores Tamar Christina
2024-05-15 11:20 ` [PATCH 0/4]AArch64: support conditional early clobbers on certain operations Richard Biener
2024-05-15 11:23   ` Tamar Christina
2024-05-15 14:51     ` Richard Sandiford
2024-05-15 15:56       ` Tamar Christina [this message]
2024-05-15 21:31         ` Richard Sandiford
2024-05-16  2:45           ` Tamar Christina
2024-05-21  3:24           ` Tamar Christina

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB53255D5B497013562821DF16FFEC2@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=Richard.Sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=ktkachov@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).