Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
	 "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	 nd <nd@arm.com>,  Richard Earnshaw <Richard.Earnshaw@arm.com>,
	 Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
	 "ktkachov\@gcc.gnu.org" <ktkachov@gcc.gnu.org>
Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.
Date: Wed, 15 May 2024 22:31:11 +0100	[thread overview]
Message-ID: <mptcypmhie8.fsf@arm.com> (raw)
In-Reply-To: <VI1PR08MB53255D5B497013562821DF16FFEC2@VI1PR08MB5325.eurprd08.prod.outlook.com> (Tamar Christina's message of "Wed, 15 May 2024 15:56:11 +0000")

Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
>> >> <tamar.christina@arm.com> wrote:
>> >> >
>> >> > Hi All,
>> >> >
>> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that state
>> >> > that for predicated operations that also produce a predicate it is preferred
>> >> > that the codegen should use a different register for the destination than that
>> >> > of the input predicate in order to avoid a performance overhead.
>> >> >
>> >> > This of course has the problem that it increases register pressure and so
>> should
>> >> > be done with care.  Additionally not all micro-architectures have this
>> >> > consideration and so it shouldn't be done as a default thing.
>> >> >
>> >> > The patch series adds support for doing conditional early clobbers through a
>> >> > combination of new alternatives and attributes to control their availability.
>> >>
>> >> You could have two alternatives, one with early clobber and one with
>> >> a matching constraint where you'd disparage the matching constraint one?
>> >>
>> >
>> > Yeah, that's what I do, though there's no need to disparage the non-early clobber
>> > alternative as the early clobber alternative will naturally get a penalty if it needs a
>> > reload.
>> 
>> But I think Richard's suggestion was to disparage the one with a matching
>> constraint (not the earlyclobber), to reflect the increased cost of
>> reusing the register.
>> 
>> We did take that approach for gathers, e.g.:
>> 
>>      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
>>      [?w, Z,   0, Ui1, Ui1, Upl] ^
>> 
>> The (supposed) advantage is that, if register pressure is so tight
>> that using matching registers is the only alternative, we still
>> have the opportunity to do that, as a last resort.
>> 
>> Providing only an earlyclobber version means that using the same
>> register is prohibited outright.  If no other register is free, the RA
>> would need to spill something else to free up a temporary register.
>> And it might then do the equivalent of (pseudo-code):
>> 
>>       not p1.b, ..., p0.b
>>       mov p0.d, p1.d
>> 
>> after spilling what would otherwise have occupied p1.  In that
>> situation it would be better use:
>> 
>>       not p0.b, ..., p0.b
>> 
>> and not introduce the spill of p1.
>
> I think I understood what Richi meant, but I thought it was already working that way.

The suggestion was to use matching constraints (like "0") though,
whereas the patch doesn't.  I think your argument is that you don't
need to use matching constraints.  But that's different from the
suggestion (and from how we handle gathers).

I was going to say in response to patch 3 (but got distracted, sorry):
I don't think we should have:

   &Upa, Upa, ...
   Upa, Upa, ...

(taken from the pure logic ops) enabled at the same time.  Even though
it works for the testcases, I don't think it has well-defined semantics.

The problem is that, taken on its own, the second alternative says that
matching operands are free.  And fundamentally, I don't think the costs
*must* take the earlyclobber alternative over the non-earlyclobber one
(when costing during IRA, for instance).  In principle, the cheapest
is best.

The aim of the gather approach is to make each alternative correct in
isolation.  In:

      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
      [?w, Z,   0, Ui1, Ui1, Upl] ^

the second alternative says that it is possible to have operands 0
and 2 be the same vector register, but using that version has the
cost of an extra reload.  In that sense the alternatives are
(essentially) consistent about the restriction.

> i.e. as one of the testcases I had:
>
>> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 -ffixed-p[1-15]
>
> foo:
>         mov     z31.h, w0
>         ptrue   p0.b, all
>         cmplo   p0.h, p0/z, z0.h, z31.h
>         b       use
>
> and reload did not force a spill.
>
> My understanding of how this works, and how it seems to be working is that since reload costs
> Alternative from front to back the cheapest one wins and it stops evaluating the rest.
>
> The early clobber case is first and preferred, however when it's not possible, i.e. requires a non-pseudo
> reload, the reload cost is added to the alternative.
>
> However you're right that in the following testcase:
>
> -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
>
> i.e. giving it an extra free register inexplicably causes a spill:
>
> foo:
>         addvl   sp, sp, #-1
>         mov     z31.h, w0
>         ptrue   p0.b, all
>         str     p15, [sp]
>         cmplo   p15.h, p0/z, z0.h, z31.h
>         mov     p0.b, p15.b
>         ldr     p15, [sp]
>         addvl   sp, sp, #1
>         b       use
>
> so that's unexpected and is very weird as p15 has no defined value..

This is because the function implicitly uses the SVE PCS, and so needs
to preserve p15 for the caller.  It looks like the correct behaviour.

> Now adding the ? as suggested to the non-early clobber alternative does not fix it, and my mental model for how this is supposed to work does not quite line up..
> Why would making the non-clobber alternative even more expensive help it during high register pressure??

Hopefully the above answers this.  The non-clobber alternative has
zero extra cost as things stand.  The costs from one alternative
(the earlyclobber one) don't carry forward to other alternatives.

> But with that suggestion the above case does not get fixed
> and the following case
>
> -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload
>
> ICEs:
>
> pred-clobber.c: In function 'foo':
> pred-clobber.c:9:1: error: unable to find a register to spill
>     9 | }
>       | ^
> pred-clobber.c:9:1: error: this is the insn:
> (insn 10 22 19 2 (parallel [
>             (set (reg:VNx8BI 110 [104])
>                 (unspec:VNx8BI [
>                         (reg:VNx8BI 112)
>                         (const_int 1 [0x1])
>                         (ltu:VNx8BI (reg:VNx8HI 32 v0)
>                             (reg:VNx8HI 63 v31))
>                     ] UNSPEC_PRED_Z))
>             (clobber (reg:CC_NZC 66 cc))
>         ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
>      (expr_list:REG_DEAD (reg:VNx8BI 112)
>         (expr_list:REG_DEAD (reg:VNx8HI 63 v31)
>             (expr_list:REG_DEAD (reg:VNx8HI 32 v0)
>                 (expr_list:REG_UNUSED (reg:CC_NZC 66 cc)
>                     (nil))))))
> during RTL pass: reload
> dump file: pred-clobber.c.315r.reload

Which pattern did you use?

> and this is because the use of ? has the unintended side-effect of blocking a register class entirely during Sched1 as we've recently discovered.
> i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766

(Is sched1 the problem here, or is it purely an RA thing?  What happens
when scheduling is disabled?)

> in this case it marked the alternative as NO_REGS during sched1 and so it's completely dead.
> the use of the ? alternatives has caused quite the code bloat as we've recently discovered because of this unexpected and undocumented behavior.
>
> To me,
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
> index 93ec59e58af..2ee3d8ea35e 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -8120,10 +8120,10 @@ (define_insn "@aarch64_pred_cmp<cmp_op><mode>"
>     (clobber (reg:CC_NZC CC_REGNUM))]
>    "TARGET_SVE"
>    {@ [ cons: =0 , 1   , 3 , 4            ; attrs: pred_clobber ]
> -     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> -     [ Upa      , Upl , w , <sve_imm_con>; *                   ] ^
> -     [ &Upa     , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
> -     [ Upa      , Upl , w , w            ; *                   ] ^
> +     [ ^&Upa    , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> +     [  Upa     , Upl , w , <sve_imm_con>; *                   ] ^
> +     [ ^&Upa    , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
> +     [  Upa     , Upl , w , w            ; *                   ] ^
>    }
>  )
>
> Would have been the right approach, i.e. we prefer the alternative unless a reload is needed, which should work no? well. if ^ wasn't broken the same way
> as ?.  Perhaps I need to use Wilco's new alternative that doesn't block a register class?

Hmm, I'm not sure.  It seems odd to mark only the output with ^, since
reloading the output isn't fundamentally different (costwise) from
reloading the input.

But to me, it's the alternative without the earlyclobber that should be
disparaged, since it's the inherently expensive one.

The gather-like approach would be something like:

     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
     [ ?Upl     , 0   , w , <sve_imm_con>; yes                 ] ^
     [ Upa      , Upl , w , <sve_imm_con>; no                  ] ^
     [ &Upa     , Upl , w , w            ; yes                 ] cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
     [ ?Upl     , 0   , w , w            ; yes                 ] ^
     [ Upa      , Upl , w , w            ; no                  ] ^

with:

  (define_attr "pred_clobber" "any,no,yes" (const_string "any"))

Thanks,
Richard

next prev parent reply	other threads:[~2024-05-15 21:31 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15 10:28 Tamar Christina
2024-05-15 10:28 ` [PATCH 1/4]AArch64: convert several predicate patterns to new compact syntax Tamar Christina
2024-05-15 10:35   ` Kyrill Tkachov
2024-05-15 11:06   ` Richard Sandiford
2024-05-15 10:28 ` [PATCH 2/4]AArch64: add new tuning param and attribute for enabling conditional early clobber Tamar Christina
2024-05-15 10:56   ` Richard Sandiford
2024-05-15 11:03     ` Tamar Christina
2024-05-22  9:29     ` Tamar Christina
2024-05-28  9:37       ` Tamar Christina
2024-05-30 14:59         ` Richard Sandiford
2024-05-15 10:29 ` [PATCH 3/4]AArch64: add new alternative with early clobber to patterns Tamar Christina
2024-05-15 10:29 ` [PATCH 4/4]AArch64: enable new predicate tuning for Neoverse cores Tamar Christina
2024-05-15 11:20 ` [PATCH 0/4]AArch64: support conditional early clobbers on certain operations Richard Biener
2024-05-15 11:23   ` Tamar Christina
2024-05-15 14:51     ` Richard Sandiford
2024-05-15 15:56       ` Tamar Christina
2024-05-15 21:31         ` Richard Sandiford [this message]
2024-05-16  2:45           ` Tamar Christina
2024-05-21  3:24           ` Tamar Christina

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mptcypmhie8.fsf@arm.com \
    --to=richard.sandiford@arm.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=Tamar.Christina@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=ktkachov@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).