From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=s5sW=FX=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x232.google.com (mail-lj1-x232.google.com [IPv6:2a00:1450:4864:20::232])
	by sourceware.org (Postfix) with ESMTPS id 273923858D38
	for <gcc-patches@gcc.gnu.org>; Mon,  9 Oct 2023 11:33:37 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 273923858D38
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x232.google.com with SMTP id 38308e7fff4ca-2b95d5ee18dso52971601fa.1
        for <gcc-patches@gcc.gnu.org>; Mon, 09 Oct 2023 04:33:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1696851215; x=1697456015; darn=gcc.gnu.org;
        h=content-transfer-encoding:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=43DDDIQbtM3ZWu3B1q0zBug+GLbg7c7wiIsv5OeOnco=;
        b=I9K6IWXO3co7qE2WLqxBHLzvlCsT0dyaBVIpM6tdnQRvoZbWzrFcglYqYn/VZESD8F
         HzuAVEdoZuF5wmm5Rr2BafVct8ns7HrNCLiD9Tgv20wL1YjqTR3JmmcjLbR4jmEPtEwL
         9DVw1tf60Ey3IlN2bk0qhh+Km+/DQq9Wqpo2zoUJOESy87fDf2/CYO/eVj9OvYD1k+7r
         9AJeNLNJnbm81so1UqbKeWi90wesXRykiql27nPl3HsebfZrdz8wearPjM/fuQZy8ZMf
         phfJHRZCGvXsbOgFhB6SnkU7/juXd2ScaIuHxCLdgk7ry+di0TmjGnCBAHE99q7JelJP
         3VWw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696851215; x=1697456015;
        h=content-transfer-encoding:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=43DDDIQbtM3ZWu3B1q0zBug+GLbg7c7wiIsv5OeOnco=;
        b=wCCSDEHREOXUt8o3XXdOXYbxKpFVjDg6Mxy0FHfhR5+k5IkCqhABPcYAzavv7XPahd
         skIo6vyPcrh5VkVmuR9V0okLZXyMIrdsgxwDfrfQGl/7GN8KhNHqfnqHm0U96WIx2jx4
         TV4ZrIQDQM0KhoQjPFtKN87wovzE+QDFjwIxTCML6MilT+3Il884bn79ncJYwg8O6b4x
         Ii1Tb9cN+xuI3x34FJqokfiYDMJ/FQxlwHCIwCfSBUnpXh7YDqyAMuHRikT/gFHvSFhO
         AN0zkY6nxlNYQZQbds5aBvAdmvWkkM7mx9o6393fQK0W7rp4s69E1UDHi8JXdpE+BW3k
         mQSw==
X-Gm-Message-State: AOJu0YxqW1OKP1yPjDeJPgSzYGNSFdUfhBNW8z3YPsTDNsvX3UG7Zbuy
	oANQE6cQuShC+XE8qC+MvascIduJKo5on8ozTIU=
X-Google-Smtp-Source: AGHT+IESC/h/KDDL3ArysWTS+yAER5UKHYr4vH1E4yKl48ayPsDQGP6tqdPAJRlvpuEY29AT1k2CIwRMO3Dcud+fngQ=
X-Received: by 2002:ac2:4c4b:0:b0:503:cca:e52f with SMTP id
 o11-20020ac24c4b000000b005030ccae52fmr16203398lfk.51.1696851215059; Mon, 09
 Oct 2023 04:33:35 -0700 (PDT)
MIME-Version: 1.0
References: <patch-17812-tamar@arm.com> <mptzg0wna16.fsf@arm.com>
 <VI1PR08MB53256A3D4227EF9076576E42FFCAA@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <mpt1qe8n7ef.fsf@arm.com> <VI1PR08MB53255672DF89DF1ADC094412FFCAA@VI1PR08MB5325.eurprd08.prod.outlook.com>
 <CAFiYyc0obZh7p=rxeEx1-0_giOgAKE0LbGCwudJkOSENOLO0hw@mail.gmail.com>
 <mptr0m6kb52.fsf@arm.com> <AM0PR08MB531607FD690F61F14805111BFFCEA@AM0PR08MB5316.eurprd08.prod.outlook.com>
 <mptr0m4glvi.fsf@arm.com> <AM0PR08MB53169D0C21D3BF3AA67781A8FFCEA@AM0PR08MB5316.eurprd08.prod.outlook.com>
 <mptil7ggkw7.fsf@arm.com>
In-Reply-To: <mptil7ggkw7.fsf@arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 9 Oct 2023 13:30:47 +0200
Message-ID: <CAFiYyc1BGV5FohcgzoEypf8+jz18Or6kpkRAXWym9wORjQXrLQ@mail.gmail.com>
Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
To: Tamar Christina <Tamar.Christina@arm.com>, Richard Biener <richard.guenther@gmail.com>, 
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>, 
	Richard Earnshaw <Richard.Earnshaw@arm.com>, Marcus Shawcroft <Marcus.Shawcroft@arm.com>, 
	Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>, richard.sandiford@arm.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_MANYTO,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Mon, Oct 9, 2023 at 12:17=E2=80=AFPM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Monday, October 9, 2023 10:56 AM
> >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-patches@gcc.gnu.o=
rg;
> >> nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> >> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> <Kyrylo.Tkachov@arm.com>
> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >>
> >> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> Sent: Saturday, October 7, 2023 10:58 AM
> >> >> To: Richard Biener <richard.guenther@gmail.com>
> >> >> Cc: Tamar Christina <Tamar.Christina@arm.com>;
> >> >> gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> >> >> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> <Kyrylo.Tkachov@arm.com>
> >> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysig=
n.
> >> >>
> >> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> >> > On Thu, Oct 5, 2023 at 10:46=E2=80=AFPM Tamar Christina
> >> >> <Tamar.Christina@arm.com> wrote:
> >> >> >>
> >> >> >> > -----Original Message-----
> >> >> >> > From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
> >> >> >> > To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> >> >> >> > <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> >> >> > <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> >> <Kyrylo.Tkachov@arm.com>
> >> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> >> cond_copysign.
> >> >> >> >
> >> >> >> > Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> >> > >> -----Original Message-----
> >> >> >> > >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> >> >> >> > >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard
> >> >> >> > >> Earnshaw <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> >> >> > >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> >> >> > <Kyrylo.Tkachov@arm.com>
> >> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> >> cond_copysign.
> >> >> >> > >>
> >> >> >> > >> Tamar Christina <tamar.christina@arm.com> writes:
> >> >> >> > >> > Hi All,
> >> >> >> > >> >
> >> >> >> > >> > This adds an implementation for masked copysign along wit=
h
> >> >> >> > >> > an optimized pattern for masked copysign (x, -1).
> >> >> >> > >>
> >> >> >> > >> It feels like we're ending up with a lot of AArch64-specifi=
c
> >> >> >> > >> code that just hard- codes the observation that changing th=
e
> >> >> >> > >> sign is equivalent to changing the top bit.  We then need t=
o
> >> >> >> > >> make sure that we choose the best way of changing the top b=
it
> >> >> >> > >> for any
> >> >> given situation.
> >> >> >> > >>
> >> >> >> > >> Hard-coding the -1/negative case is one instance of that.
> >> >> >> > >> But it looks like we also fail to use the best sequence for=
 SVE2.  E.g.
> >> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> >> >> >> > >>
> >> >> >> > >> #include <stdint.h>
> >> >> >> > >>
> >> >> >> > >> void f(double *restrict a, double *restrict b) {
> >> >> >> > >>     for (int i =3D 0; i < 100; ++i)
> >> >> >> > >>         a[i] =3D __builtin_copysign(a[i], b[i]); }
> >> >> >> > >>
> >> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t=
 c) {
> >> >> >> > >>     for (int i =3D 0; i < 100; ++i)
> >> >> >> > >>         a[i] =3D (a[i] & ~c) | (b[i] & c); }
> >> >> >> > >>
> >> >> >> > >> gives:
> >> >> >> > >>
> >> >> >> > >> f:
> >> >> >> > >>         mov     x2, 0
> >> >> >> > >>         mov     w3, 100
> >> >> >> > >>         whilelo p7.d, wzr, w3
> >> >> >> > >> .L2:
> >> >> >> > >>         ld1d    z30.d, p7/z, [x0, x2, lsl 3]
> >> >> >> > >>         ld1d    z31.d, p7/z, [x1, x2, lsl 3]
> >> >> >> > >>         and     z30.d, z30.d, #0x7fffffffffffffff
> >> >> >> > >>         and     z31.d, z31.d, #0x8000000000000000
> >> >> >> > >>         orr     z31.d, z31.d, z30.d
> >> >> >> > >>         st1d    z31.d, p7, [x0, x2, lsl 3]
> >> >> >> > >>         incd    x2
> >> >> >> > >>         whilelo p7.d, w2, w3
> >> >> >> > >>         b.any   .L2
> >> >> >> > >>         ret
> >> >> >> > >> g:
> >> >> >> > >>         mov     x3, 0
> >> >> >> > >>         mov     w4, 100
> >> >> >> > >>         mov     z29.d, x2
> >> >> >> > >>         whilelo p7.d, wzr, w4
> >> >> >> > >> .L6:
> >> >> >> > >>         ld1d    z30.d, p7/z, [x0, x3, lsl 3]
> >> >> >> > >>         ld1d    z31.d, p7/z, [x1, x3, lsl 3]
> >> >> >> > >>         bsl     z31.d, z31.d, z30.d, z29.d
> >> >> >> > >>         st1d    z31.d, p7, [x0, x3, lsl 3]
> >> >> >> > >>         incd    x3
> >> >> >> > >>         whilelo p7.d, w3, w4
> >> >> >> > >>         b.any   .L6
> >> >> >> > >>         ret
> >> >> >> > >>
> >> >> >> > >> I saw that you originally tried to do this in match.pd and
> >> >> >> > >> that the decision was to fold to copysign instead.  But
> >> >> >> > >> perhaps there's a compromise where isel does something with
> >> >> >> > >> the (new) copysign canonical
> >> >> >> > form?
> >> >> >> > >> I.e. could we go with your new version of the match.pd patc=
h,
> >> >> >> > >> and add some isel stuff as a follow-on?
> >>
> >> [A]
> >>
> >> >> >> > >>
> >> >> >> > >
> >> >> >> > > Sure if that's what's desired.... But..
> >> >> >> > >
> >> >> >> > > The example you posted above is for instance worse for x86
> >> >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation ha=
s
> >> >> >> > > a dependency chain of 2 and the latter of 3.  It's likely an=
y
> >> >> >> > > open coding of this
> >> >> >> > operation is going to hurt a target.
> >> >> >> > >
> >> >> >> > > So I'm unsure what isel transform this into...
> >> >> >> >
> >> >> >> > I didn't mean that we should go straight to using isel for the
> >> >> >> > general case, just for the new case.  The example above was
> >> >> >> > instead trying to show the general point that hiding the logic
> >> >> >> > ops in target code is
> >> >> a double-edged sword.
> >> >> >>
> >> >> >> I see.. but the problem here is that transforming copysign (x, -=
1)
> >> >> >> into (x | 0x8000000) would require an integer operation on an FP
> >> >> >> value.  I'm happy to do it but it seems like it'll be an AArch64
> >> >> >> only thing
> >> >> anyway.
> >> >> >>
> >> >> >> If we want to do this we need to check can_change_mode_class or =
a
> >> hook.
> >> >> >> Most targets including x86 reject the conversion.  So it'll just
> >> >> >> be effectively an AArch64 thing.
> >> >> >>
> >> >> >> You're right that the actual equivalent transformation is this
> >> >> >> https://godbolt.org/z/KesfrMv5z But the target won't allow it.
> >> >> >>
> >> >> >> >
> >> >> >> > The x86_64 example for the -1 case would be
> >> >> >> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be
> >> >> >> > an improvement.  Without that, I guess
> >> >> >> > x86_64 will need to have a similar patch to the AArch64 one.
> >> >> >> >
> >> >> >>
> >> >> >> I think that's to be expected.  I think it's logical that every
> >> >> >> target just needs to implement their optabs optimally.
> >> >> >>
> >> >> >> > That said, https://godbolt.org/z/e6nqoqbMh suggests that
> >> >> >> > powerpc64 is probably relying on the current copysign -> neg/a=
bs
> >> transform.
> >> >> >> > (Not sure why the second function uses different IVs from the
> >> >> >> > first.)
> >> >> >> >
> >> >> >> > Personally, I wouldn't be against a target hook that indicated
> >> >> >> > whether float bit manipulation is "free" for a given mode, if =
it
> >> >> >> > comes to
> >> >> that.
> >> >> >>
> >> >> >> I'm really sure Richi would agree there.  Generally speaking I
> >> >> >> don't think people see doing FP operations on Int as beneficial.
> >> >> >> But it's always
> >> >> the case on AArch64.
> >> >> >
> >> >> > IIRC we're doing fpclassify "expansion" early for example.
> >> >> >
> >> >> > Note the issue I see is that the middle-end shouldn't get in the
> >> >> > way of targets that have a copysign optab.  In case it's worthwhi=
le
> >> >> > to special-case a "setsign" thing then maybe provide an optab for
> >> >> > that as well.  Now, that doesn't help if we want setsign to be
> >> >> > expanded from the middle-end but still wan the copysign optab (an=
d
> >> >> > not require targets to implement both if they want to escape
> >> >> > middle-end generic
> >> >> expansion of setsign).
> >> >> >
> >> >> > But yes, I also thought the , 1 and , -1 cases could be special
> >> >> > cased by RTL expansion (or ISEL).  But it would need to invoke
> >> >> > costing which likely means target specific changes anyway... :/
> >> >>
> >> >> FWIW, if we had the target hook I suggested, I don't think AArch64
> >> >> would need to implement copysign or xorsign optabs.  That's probabl=
y
> >> >> true of
> >> >> AArch32 too (at least for all combinations that are likely to matte=
r these
> >> days).
> >> >>
> >> >> I'd go one step further and say that, if a target wants to do its o=
wn
> >> >> thing for copysign and xorsign, it clearly doesn't meet the
> >> >> requirement that bit manipulation of floats is "free" for that mode=
.
> >> >>
> >> >
> >> > So I'm aware I have no say here, but I'll make one last effort.
> >> >
> >> > The patch isn't just implementing the fneg (fabs ()) optimization ju=
st
> >> because.
> >> > The location where it's implemented makes a big difference.
> >> >
> >> > If this optimization is done late, it doesn't fix the regression
> >> > fully, because doing It after all optimization passes have run means=
 it can't
> >> properly be optimized.
> >> >
> >> > The point of doing the rewrite early to ORR accomplished two things:
> >> >
> >> > 1. It makes PRE realize that the block it's splitting would only hav=
e 1
> >> instruction in it
> >> >     and that such a split is not beneficial.  This is why I'm agains=
t doing such
> >> optimizations
> >> >     so later. Optimizations don=E2=80=99t' happen in isolation, and =
when they make
> >> sense should
> >> >     happen early.  Transforming fneg (fabs (..)) in isel not only fe=
els wrong but
> >> results in a 4%
> >> >     performance loss.
> >> >
> >> > 2. Doing it early also gets the ORRs to be reassociated changing whe=
re the
> >> loop dependent
> >> >      variable lands.  Early makes it land in the merging MOVPRFX rat=
her than
> >> requiring a SEL
> >> >      at the end of the iteration.
> >> >
> >> > Replacing the fneg (fabs (..)) with copysign accomplishes 1 but no 2=
.
> >> > This results in a 2-3% loss, but I can live with that given doing 1 =
gets us back
> >> to GCC 12 levels.
> >> >
> >> > Doing fneg (fabs (..)) in isel would have no meaning for me and not
> >> > fix the regression.  I won't be looking to do that in that case.
> >> >
> >> > If it's acceptable I can transform COPYSIGN (X, -1) in gimple-isel.
> >>
> >> FWIW, this is what I meant by [A] above.  I.e. go with your latest tar=
get-
> >> independent change, and make isel replace COPYSIGN (X, -1) with logic
> >> operations, under control of a new target hook that says that logic op=
erations
> >> on float values are as cheap as logic operations on integer values (fo=
r a given
> >> float mode).  In future, hopefully all COPYSIGNs and XORSIGNs would be
> >> handled the same way, under control of the same hook, but that would b=
e a
> >> separate future change.
> >
> > Ok, would you prefer isel or in the expander that Richi suggested?
> > I assume isel is better if the intention is to later remove the expansi=
on code?
>
> Yeah, isel sounds better to me FWIW (for that reason), but Richi would
> know better.

ISEL is supposed to be "RTL expansion on GIMPLE" and since we have
copysign builtin expansion with several variants in the optab code and
also direct IFN expansion I think it wouldn't be good to add another place
pre-empting those.

> At least with isel, there's the theoretical possibility of doing
> simple optimisation before expand, or taking surrounding stmts
> into account, if that ever becomes necessary for future changes.

Yeah, but I doubt there's anything for copysign (x, -1) here.

Unfortunately IFN_COPYSIGN expansion isn't routed through
expand_copysign but goes through expand_fn_using_insn(?).
I suppose we can go back to DEF_INTERNAL_OPTAB_FN
for it and implement -1 handing in expand_COPYSIGN?

There's also the case being made for targets not implementing
a vector copysign expander to handle this in pattern recognition.

Richard.

>
> Thanks,
> Richard