From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=msqH=MS=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id E9E553858D20;
	Wed, 15 May 2024 21:31:13 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E9E553858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E9E553858D20
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1715808676; cv=none;
	b=gcNk4UX6PS7+hp1M5wVlaQWAm+L/0PAK+XBJ0P/h/2agKiQS2vBijedjpFebgIU7OYF0w5xow5mbx2WAu0Czh+6Hf4pafy7d8GA466/8la9R6P17dPAZ1j5ddsNIbFuIi03/QERZ1uy34p4Vrf208H8zkwYDsEqTeWQTksZeVJw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1715808676; c=relaxed/simple;
	bh=VA5kaTT2eFIptLT7txD3k6iLr12kjTh5c6rn08uNG9I=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=SkwD5gih3V/8CbdcORP+I5u5pi5Z4xriON96KQzeAaOsxQfKZa4PDlCln3uNqxYlNAU7IzZZOMolo2qSI+Qh//1r8m02vDIbvb5S6Iyi/OJqV60Ry/W6pUJfgVDh7qmnp+EVmC7Et4NfIu0epzAx+MmqVQiwRujm4j53tvyVaTc=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EB60FDA7;
	Wed, 15 May 2024 14:31:37 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6C3E33F7A6;
	Wed, 15 May 2024 14:31:12 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Tamar Christina <Tamar.Christina@arm.com>
Mail-Followup-To: Tamar Christina <Tamar.Christina@arm.com>,Richard Biener <richard.guenther@gmail.com>,  "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  Richard Earnshaw <Richard.Earnshaw@arm.com>,  Marcus Shawcroft <Marcus.Shawcroft@arm.com>,  "ktkachov\@gcc.gnu.org" <ktkachov@gcc.gnu.org>, richard.sandiford@arm.com
Cc: Richard Biener <richard.guenther@gmail.com>,  "gcc-patches\@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,  nd <nd@arm.com>,  Richard Earnshaw <Richard.Earnshaw@arm.com>,  Marcus Shawcroft <Marcus.Shawcroft@arm.com>,  "ktkachov\@gcc.gnu.org" <ktkachov@gcc.gnu.org>
Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.
References: <patch-18359-tamar@arm.com>
	<CAFiYyc1DV2jdUSH_DrOaGCest0QM=VUPGVEGPajGeDcngFRsug@mail.gmail.com>
	<VI1PR08MB5325B058CEAB0FC71F0AC00FFFEC2@VI1PR08MB5325.eurprd08.prod.outlook.com>
	<mpta5krjfhj.fsf@arm.com>
	<VI1PR08MB53255D5B497013562821DF16FFEC2@VI1PR08MB5325.eurprd08.prod.outlook.com>
Date: Wed, 15 May 2024 22:31:11 +0100
In-Reply-To: <VI1PR08MB53255D5B497013562821DF16FFEC2@VI1PR08MB5325.eurprd08.prod.outlook.com>
	(Tamar Christina's message of "Wed, 15 May 2024 15:56:11 +0000")
Message-ID: <mptcypmhie8.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-20.5 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Tamar Christina <Tamar.Christina@arm.com> writes:
>> >> On Wed, May 15, 2024 at 12:29=E2=80=AFPM Tamar Christina
>> >> <tamar.christina@arm.com> wrote:
>> >> >
>> >> > Hi All,
>> >> >
>> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause tha=
t state
>> >> > that for predicated operations that also produce a predicate it is =
preferred
>> >> > that the codegen should use a different register for the destinatio=
n than that
>> >> > of the input predicate in order to avoid a performance overhead.
>> >> >
>> >> > This of course has the problem that it increases register pressure =
and so
>> should
>> >> > be done with care.  Additionally not all micro-architectures have t=
his
>> >> > consideration and so it shouldn't be done as a default thing.
>> >> >
>> >> > The patch series adds support for doing conditional early clobbers =
through a
>> >> > combination of new alternatives and attributes to control their ava=
ilability.
>> >>
>> >> You could have two alternatives, one with early clobber and one with
>> >> a matching constraint where you'd disparage the matching constraint o=
ne?
>> >>
>> >
>> > Yeah, that's what I do, though there's no need to disparage the non-ea=
rly clobber
>> > alternative as the early clobber alternative will naturally get a pena=
lty if it needs a
>> > reload.
>>=20
>> But I think Richard's suggestion was to disparage the one with a matching
>> constraint (not the earlyclobber), to reflect the increased cost of
>> reusing the register.
>>=20
>> We did take that approach for gathers, e.g.:
>>=20
>>      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
>>      [?w, Z,   0, Ui1, Ui1, Upl] ^
>>=20
>> The (supposed) advantage is that, if register pressure is so tight
>> that using matching registers is the only alternative, we still
>> have the opportunity to do that, as a last resort.
>>=20
>> Providing only an earlyclobber version means that using the same
>> register is prohibited outright.  If no other register is free, the RA
>> would need to spill something else to free up a temporary register.
>> And it might then do the equivalent of (pseudo-code):
>>=20
>>       not p1.b, ..., p0.b
>>       mov p0.d, p1.d
>>=20
>> after spilling what would otherwise have occupied p1.  In that
>> situation it would be better use:
>>=20
>>       not p0.b, ..., p0.b
>>=20
>> and not introduce the spill of p1.
>
> I think I understood what Richi meant, but I thought it was already worki=
ng that way.

The suggestion was to use matching constraints (like "0") though,
whereas the patch doesn't.  I think your argument is that you don't
need to use matching constraints.  But that's different from the
suggestion (and from how we handle gathers).

I was going to say in response to patch 3 (but got distracted, sorry):
I don't think we should have:

   &Upa, Upa, ...
   Upa, Upa, ...

(taken from the pure logic ops) enabled at the same time.  Even though
it works for the testcases, I don't think it has well-defined semantics.

The problem is that, taken on its own, the second alternative says that
matching operands are free.  And fundamentally, I don't think the costs
*must* take the earlyclobber alternative over the non-earlyclobber one
(when costing during IRA, for instance).  In principle, the cheapest
is best.

The aim of the gather approach is to make each alternative correct in
isolation.  In:

      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
      [?w, Z,   0, Ui1, Ui1, Upl] ^

the second alternative says that it is possible to have operands 0
and 2 be the same vector register, but using that version has the
cost of an extra reload.  In that sense the alternatives are
(essentially) consistent about the restriction.

> i.e. as one of the testcases I had:
>
>> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=3Dneoverse-n2 =
-ffixed-p[1-15]
>
> foo:
>         mov     z31.h, w0
>         ptrue   p0.b, all
>         cmplo   p0.h, p0/z, z0.h, z31.h
>         b       use
>
> and reload did not force a spill.
>
> My understanding of how this works, and how it seems to be working is tha=
t since reload costs
> Alternative from front to back the cheapest one wins and it stops evaluat=
ing the rest.
>
> The early clobber case is first and preferred, however when it's not poss=
ible, i.e. requires a non-pseudo
> reload, the reload cost is added to the alternative.
>
> However you're right that in the following testcase:
>
> -mcpu=3Dneoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p=
5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffix=
ed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
>
> i.e. giving it an extra free register inexplicably causes a spill:
>
> foo:
>         addvl   sp, sp, #-1
>         mov     z31.h, w0
>         ptrue   p0.b, all
>         str     p15, [sp]
>         cmplo   p15.h, p0/z, z0.h, z31.h
>         mov     p0.b, p15.b
>         ldr     p15, [sp]
>         addvl   sp, sp, #1
>         b       use
>
> so that's unexpected and is very weird as p15 has no defined value..

This is because the function implicitly uses the SVE PCS, and so needs
to preserve p15 for the caller.  It looks like the correct behaviour.

> Now adding the ? as suggested to the non-early clobber alternative does n=
ot fix it, and my mental model for how this is supposed to work does not qu=
ite line up..
> Why would making the non-clobber alternative even more expensive help it =
during high register pressure??

Hopefully the above answers this.  The non-clobber alternative has
zero extra cost as things stand.  The costs from one alternative
(the earlyclobber one) don't carry forward to other alternatives.

> But with that suggestion the above case does not get fixed
> and the following case
>
> -mcpu=3Dneoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p=
5 -ffixed-p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffix=
ed-p12 -ffixed-p12 -ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload
>
> ICEs:
>
> pred-clobber.c: In function 'foo':
> pred-clobber.c:9:1: error: unable to find a register to spill
>     9 | }
>       | ^
> pred-clobber.c:9:1: error: this is the insn:
> (insn 10 22 19 2 (parallel [
>             (set (reg:VNx8BI 110 [104])
>                 (unspec:VNx8BI [
>                         (reg:VNx8BI 112)
>                         (const_int 1 [0x1])
>                         (ltu:VNx8BI (reg:VNx8HI 32 v0)
>                             (reg:VNx8HI 63 v31))
>                     ] UNSPEC_PRED_Z))
>             (clobber (reg:CC_NZC 66 cc))
>         ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
>      (expr_list:REG_DEAD (reg:VNx8BI 112)
>         (expr_list:REG_DEAD (reg:VNx8HI 63 v31)
>             (expr_list:REG_DEAD (reg:VNx8HI 32 v0)
>                 (expr_list:REG_UNUSED (reg:CC_NZC 66 cc)
>                     (nil))))))
> during RTL pass: reload
> dump file: pred-clobber.c.315r.reload

Which pattern did you use?

> and this is because the use of ? has the unintended side-effect of blocki=
ng a register class entirely during Sched1 as we've recently discovered.
> i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114766

(Is sched1 the problem here, or is it purely an RA thing?  What happens
when scheduling is disabled?)

> in this case it marked the alternative as NO_REGS during sched1 and so it=
's completely dead.
> the use of the ? alternatives has caused quite the code bloat as we've re=
cently discovered because of this unexpected and undocumented behavior.
>
> To me,
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch=
64-sve.md
> index 93ec59e58af..2ee3d8ea35e 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -8120,10 +8120,10 @@ (define_insn "@aarch64_pred_cmp<cmp_op><mode>"
>     (clobber (reg:CC_NZC CC_REGNUM))]
>    "TARGET_SVE"
>    {@ [ cons: =3D0 , 1   , 3 , 4            ; attrs: pred_clobber ]
> -     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp=
_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> -     [ Upa      , Upl , w , <sve_imm_con>; *                   ] ^
> -     [ &Upa     , Upl , w , w            ; yes                 ] cmp<cmp=
_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
> -     [ Upa      , Upl , w , w            ; *                   ] ^
> +     [ ^&Upa    , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp=
_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> +     [  Upa     , Upl , w , <sve_imm_con>; *                   ] ^
> +     [ ^&Upa    , Upl , w , w            ; yes                 ] cmp<cmp=
_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
> +     [  Upa     , Upl , w , w            ; *                   ] ^
>    }
>  )
>
> Would have been the right approach, i.e. we prefer the alternative unless=
 a reload is needed, which should work no? well. if ^ wasn't broken the sam=
e way
> as ?.  Perhaps I need to use Wilco's new alternative that doesn't block a=
 register class?

Hmm, I'm not sure.  It seems odd to mark only the output with ^, since
reloading the output isn't fundamentally different (costwise) from
reloading the input.

But to me, it's the alternative without the earlyclobber that should be
disparaged, since it's the inherently expensive one.

The gather-like approach would be something like:

     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ] cmp<cmp_op=
>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
     [ ?Upl     , 0   , w , <sve_imm_con>; yes                 ] ^
     [ Upa      , Upl , w , <sve_imm_con>; no                  ] ^
     [ &Upa     , Upl , w , w            ; yes                 ] cmp<cmp_op=
>\t%0.<Vetype>, %1/z, %3.<Vetype>, %4.<Vetype>
     [ ?Upl     , 0   , w , w            ; yes                 ] ^
     [ Upa      , Upl , w , w            ; no                  ] ^

with:

  (define_attr "pred_clobber" "any,no,yes" (const_string "any"))

Thanks,
Richard