From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <Richard.Sandiford@arm.com>
Cc: Richard Biener <richard.guenther@gmail.com>,
"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
nd <nd@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>,
Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: RE: [PATCH]AArch64 Add SVE implementation for cond_copysign.
Date: Mon, 9 Oct 2023 10:09:30 +0000 [thread overview]
Message-ID: <AM0PR08MB53169D0C21D3BF3AA67781A8FFCEA@AM0PR08MB5316.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <mptr0m4glvi.fsf@arm.com>
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, October 9, 2023 10:56 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Biener <richard.guenther@gmail.com>; gcc-patches@gcc.gnu.org;
> nd <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>;
> Marcus Shawcroft <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
>
> Tamar Christina <Tamar.Christina@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Saturday, October 7, 2023 10:58 AM
> >> To: Richard Biener <richard.guenther@gmail.com>
> >> Cc: Tamar Christina <Tamar.Christina@arm.com>;
> >> gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> >> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>
> >> Subject: Re: [PATCH]AArch64 Add SVE implementation for cond_copysign.
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Thu, Oct 5, 2023 at 10:46 PM Tamar Christina
> >> <Tamar.Christina@arm.com> wrote:
> >> >>
> >> >> > -----Original Message-----
> >> >> > From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> > Sent: Thursday, October 5, 2023 9:26 PM
> >> >> > To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> >> >> > <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> >> > <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> <Kyrylo.Tkachov@arm.com>
> >> >> > Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> cond_copysign.
> >> >> >
> >> >> > Tamar Christina <Tamar.Christina@arm.com> writes:
> >> >> > >> -----Original Message-----
> >> >> > >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> >> > >> Sent: Thursday, October 5, 2023 8:29 PM
> >> >> > >> To: Tamar Christina <Tamar.Christina@arm.com>
> >> >> > >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard
> >> >> > >> Earnshaw <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> >> >> > >> <Marcus.Shawcroft@arm.com>; Kyrylo Tkachov
> >> >> > <Kyrylo.Tkachov@arm.com>
> >> >> > >> Subject: Re: [PATCH]AArch64 Add SVE implementation for
> >> cond_copysign.
> >> >> > >>
> >> >> > >> Tamar Christina <tamar.christina@arm.com> writes:
> >> >> > >> > Hi All,
> >> >> > >> >
> >> >> > >> > This adds an implementation for masked copysign along with
> >> >> > >> > an optimized pattern for masked copysign (x, -1).
> >> >> > >>
> >> >> > >> It feels like we're ending up with a lot of AArch64-specific
> >> >> > >> code that just hard- codes the observation that changing the
> >> >> > >> sign is equivalent to changing the top bit. We then need to
> >> >> > >> make sure that we choose the best way of changing the top bit
> >> >> > >> for any
> >> given situation.
> >> >> > >>
> >> >> > >> Hard-coding the -1/negative case is one instance of that.
> >> >> > >> But it looks like we also fail to use the best sequence for SVE2. E.g.
> >> >> > >> [https://godbolt.org/z/ajh3MM5jv]:
> >> >> > >>
> >> >> > >> #include <stdint.h>
> >> >> > >>
> >> >> > >> void f(double *restrict a, double *restrict b) {
> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> > >> a[i] = __builtin_copysign(a[i], b[i]); }
> >> >> > >>
> >> >> > >> void g(uint64_t *restrict a, uint64_t *restrict b, uint64_t c) {
> >> >> > >> for (int i = 0; i < 100; ++i)
> >> >> > >> a[i] = (a[i] & ~c) | (b[i] & c); }
> >> >> > >>
> >> >> > >> gives:
> >> >> > >>
> >> >> > >> f:
> >> >> > >> mov x2, 0
> >> >> > >> mov w3, 100
> >> >> > >> whilelo p7.d, wzr, w3
> >> >> > >> .L2:
> >> >> > >> ld1d z30.d, p7/z, [x0, x2, lsl 3]
> >> >> > >> ld1d z31.d, p7/z, [x1, x2, lsl 3]
> >> >> > >> and z30.d, z30.d, #0x7fffffffffffffff
> >> >> > >> and z31.d, z31.d, #0x8000000000000000
> >> >> > >> orr z31.d, z31.d, z30.d
> >> >> > >> st1d z31.d, p7, [x0, x2, lsl 3]
> >> >> > >> incd x2
> >> >> > >> whilelo p7.d, w2, w3
> >> >> > >> b.any .L2
> >> >> > >> ret
> >> >> > >> g:
> >> >> > >> mov x3, 0
> >> >> > >> mov w4, 100
> >> >> > >> mov z29.d, x2
> >> >> > >> whilelo p7.d, wzr, w4
> >> >> > >> .L6:
> >> >> > >> ld1d z30.d, p7/z, [x0, x3, lsl 3]
> >> >> > >> ld1d z31.d, p7/z, [x1, x3, lsl 3]
> >> >> > >> bsl z31.d, z31.d, z30.d, z29.d
> >> >> > >> st1d z31.d, p7, [x0, x3, lsl 3]
> >> >> > >> incd x3
> >> >> > >> whilelo p7.d, w3, w4
> >> >> > >> b.any .L6
> >> >> > >> ret
> >> >> > >>
> >> >> > >> I saw that you originally tried to do this in match.pd and
> >> >> > >> that the decision was to fold to copysign instead. But
> >> >> > >> perhaps there's a compromise where isel does something with
> >> >> > >> the (new) copysign canonical
> >> >> > form?
> >> >> > >> I.e. could we go with your new version of the match.pd patch,
> >> >> > >> and add some isel stuff as a follow-on?
>
> [A]
>
> >> >> > >>
> >> >> > >
> >> >> > > Sure if that's what's desired.... But..
> >> >> > >
> >> >> > > The example you posted above is for instance worse for x86
> >> >> > > https://godbolt.org/z/x9ccqxW6T where the first operation has
> >> >> > > a dependency chain of 2 and the latter of 3. It's likely any
> >> >> > > open coding of this
> >> >> > operation is going to hurt a target.
> >> >> > >
> >> >> > > So I'm unsure what isel transform this into...
> >> >> >
> >> >> > I didn't mean that we should go straight to using isel for the
> >> >> > general case, just for the new case. The example above was
> >> >> > instead trying to show the general point that hiding the logic
> >> >> > ops in target code is
> >> a double-edged sword.
> >> >>
> >> >> I see.. but the problem here is that transforming copysign (x, -1)
> >> >> into (x | 0x8000000) would require an integer operation on an FP
> >> >> value. I'm happy to do it but it seems like it'll be an AArch64
> >> >> only thing
> >> anyway.
> >> >>
> >> >> If we want to do this we need to check can_change_mode_class or a
> hook.
> >> >> Most targets including x86 reject the conversion. So it'll just
> >> >> be effectively an AArch64 thing.
> >> >>
> >> >> You're right that the actual equivalent transformation is this
> >> >> https://godbolt.org/z/KesfrMv5z But the target won't allow it.
> >> >>
> >> >> >
> >> >> > The x86_64 example for the -1 case would be
> >> >> > https://godbolt.org/z/b9s6MaKs8 where the isel change would be
> >> >> > an improvement. Without that, I guess
> >> >> > x86_64 will need to have a similar patch to the AArch64 one.
> >> >> >
> >> >>
> >> >> I think that's to be expected. I think it's logical that every
> >> >> target just needs to implement their optabs optimally.
> >> >>
> >> >> > That said, https://godbolt.org/z/e6nqoqbMh suggests that
> >> >> > powerpc64 is probably relying on the current copysign -> neg/abs
> transform.
> >> >> > (Not sure why the second function uses different IVs from the
> >> >> > first.)
> >> >> >
> >> >> > Personally, I wouldn't be against a target hook that indicated
> >> >> > whether float bit manipulation is "free" for a given mode, if it
> >> >> > comes to
> >> that.
> >> >>
> >> >> I'm really sure Richi would agree there. Generally speaking I
> >> >> don't think people see doing FP operations on Int as beneficial.
> >> >> But it's always
> >> the case on AArch64.
> >> >
> >> > IIRC we're doing fpclassify "expansion" early for example.
> >> >
> >> > Note the issue I see is that the middle-end shouldn't get in the
> >> > way of targets that have a copysign optab. In case it's worthwhile
> >> > to special-case a "setsign" thing then maybe provide an optab for
> >> > that as well. Now, that doesn't help if we want setsign to be
> >> > expanded from the middle-end but still wan the copysign optab (and
> >> > not require targets to implement both if they want to escape
> >> > middle-end generic
> >> expansion of setsign).
> >> >
> >> > But yes, I also thought the , 1 and , -1 cases could be special
> >> > cased by RTL expansion (or ISEL). But it would need to invoke
> >> > costing which likely means target specific changes anyway... :/
> >>
> >> FWIW, if we had the target hook I suggested, I don't think AArch64
> >> would need to implement copysign or xorsign optabs. That's probably
> >> true of
> >> AArch32 too (at least for all combinations that are likely to matter these
> days).
> >>
> >> I'd go one step further and say that, if a target wants to do its own
> >> thing for copysign and xorsign, it clearly doesn't meet the
> >> requirement that bit manipulation of floats is "free" for that mode.
> >>
> >
> > So I'm aware I have no say here, but I'll make one last effort.
> >
> > The patch isn't just implementing the fneg (fabs ()) optimization just
> because.
> > The location where it's implemented makes a big difference.
> >
> > If this optimization is done late, it doesn't fix the regression
> > fully, because doing It after all optimization passes have run means it can't
> properly be optimized.
> >
> > The point of doing the rewrite early to ORR accomplished two things:
> >
> > 1. It makes PRE realize that the block it's splitting would only have 1
> instruction in it
> > and that such a split is not beneficial. This is why I'm against doing such
> optimizations
> > so later. Optimizations don’t' happen in isolation, and when they make
> sense should
> > happen early. Transforming fneg (fabs (..)) in isel not only feels wrong but
> results in a 4%
> > performance loss.
> >
> > 2. Doing it early also gets the ORRs to be reassociated changing where the
> loop dependent
> > variable lands. Early makes it land in the merging MOVPRFX rather than
> requiring a SEL
> > at the end of the iteration.
> >
> > Replacing the fneg (fabs (..)) with copysign accomplishes 1 but no 2.
> > This results in a 2-3% loss, but I can live with that given doing 1 gets us back
> to GCC 12 levels.
> >
> > Doing fneg (fabs (..)) in isel would have no meaning for me and not
> > fix the regression. I won't be looking to do that in that case.
> >
> > If it's acceptable I can transform COPYSIGN (X, -1) in gimple-isel.
>
> FWIW, this is what I meant by [A] above. I.e. go with your latest target-
> independent change, and make isel replace COPYSIGN (X, -1) with logic
> operations, under control of a new target hook that says that logic operations
> on float values are as cheap as logic operations on integer values (for a given
> float mode). In future, hopefully all COPYSIGNs and XORSIGNs would be
> handled the same way, under control of the same hook, but that would be a
> separate future change.
Ok, would you prefer isel or in the expander that Richi suggested?
I assume isel is better if the intention is to later remove the expansion code?
> I wasn't suggesting recognising (fneg (fabs )) in isel.
>
Ok, I wasn't sure as there were many interleaving threads.
> But like I say, beware of powerpc64. It seems to rely on the current reverse
> transformation of (copysign x -1) to (fneg (fabs x)).
> (Could be wrong, but would be worth checking.)
I'll do a check, thanks!
Tamar
> Thanks,
> Richard
>
>
>
> > So before I start on this, Would this be acceptable for you?
> >
> > Thanks,
> > Tamar
> >
> >> > So I have no good advice here but I hoped that even the generic
> >> > target specific copysign implementation with and & xor would
> >> > eventually be optimized later on RTL for constant second arg?
> >>
> >> Yeah. It looks like the required logic is there for scalars, it just
> >> needs extending to vectors.
> >>
> >> The patch below (untested beyond simple cases) seems to be enough to
> >> fix it, but using the simplify routines even for CONST_INTs might be
> controversial.
> >>
> >> Thanks,
> >> Richard
> >>
> >>
> >> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index
> >> bd9443dbcc2..5a9b1745673 100644
> >> --- a/gcc/simplify-rtx.cc
> >> +++ b/gcc/simplify-rtx.cc
> >> @@ -3409,20 +3409,20 @@
> simplify_context::simplify_binary_operation_1
> >> (rtx_code code,
> >>
> >> /* Canonicalize (X & C1) | C2. */
> >> if (GET_CODE (op0) == AND
> >> - && CONST_INT_P (trueop1)
> >> - && CONST_INT_P (XEXP (op0, 1)))
> >> + && CONSTANT_P (trueop1)
> >> + && CONSTANT_P (XEXP (op0, 1)))
> >> {
> >> - HOST_WIDE_INT mask = GET_MODE_MASK (mode);
> >> - HOST_WIDE_INT c1 = INTVAL (XEXP (op0, 1));
> >> - HOST_WIDE_INT c2 = INTVAL (trueop1);
> >> + rtx c1 = XEXP (op0, 1);
> >> + rtx c2 = trueop1;
> >>
> >> /* If (C1&C2) == C1, then (X&C1)|C2 becomes C2. */
> >> - if ((c1 & c2) == c1
> >> + if (rtx_equal_p (simplify_binary_operation (AND, mode, c1, c2),
> >> +c1)
> >> && !side_effects_p (XEXP (op0, 0)))
> >> return trueop1;
> >>
> >> /* If (C1|C2) == ~0 then (X&C1)|C2 becomes X|C2. */
> >> - if (((c1|c2) & mask) == mask)
> >> + if (rtx_equal_p (simplify_binary_operation (IOR, mode, c1, c2),
> >> + CONSTM1_RTX (mode)))
> >> return simplify_gen_binary (IOR, mode, XEXP (op0, 0), op1);
> >> }
> >>
> >> @@ -3732,7 +3732,7 @@ simplify_context::simplify_binary_operation_1
> >> (rtx_code code,
> >> machines, and also has shorter instruction path length. */
> >> if (GET_CODE (op0) == AND
> >> && GET_CODE (XEXP (op0, 0)) == XOR
> >> - && CONST_INT_P (XEXP (op0, 1))
> >> + && CONSTANT_P (XEXP (op0, 1))
> >> && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1))
> >> {
> >> rtx a = trueop1;
> >> @@ -3746,7 +3746,7 @@ simplify_context::simplify_binary_operation_1
> >> (rtx_code code,
> >> /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) */
> >> else if (GET_CODE (op0) == AND
> >> && GET_CODE (XEXP (op0, 0)) == XOR
> >> - && CONST_INT_P (XEXP (op0, 1))
> >> + && CONSTANT_P (XEXP (op0, 1))
> >> && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1))
> >> {
> >> rtx a = XEXP (XEXP (op0, 0), 0);
> >> --
> >> 2.25.1
> >>
next prev parent reply other threads:[~2023-10-09 10:09 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-05 18:21 Tamar Christina
2023-10-05 19:28 ` Richard Sandiford
2023-10-05 19:47 ` Tamar Christina
2023-10-05 20:25 ` Richard Sandiford
2023-10-05 20:45 ` Tamar Christina
2023-10-06 7:32 ` Richard Biener
2023-10-07 9:57 ` Richard Sandiford
2023-10-09 9:38 ` Tamar Christina
2023-10-09 9:45 ` Richard Biener
2023-10-09 9:55 ` Tamar Christina
2023-10-09 9:56 ` Richard Sandiford
2023-10-09 10:09 ` Tamar Christina [this message]
2023-10-09 10:17 ` Richard Sandiford
2023-10-09 11:30 ` Richard Biener
2023-10-05 20:34 ` Andrew Pinski
2023-10-19 21:29 ` Richard Sandiford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=AM0PR08MB53169D0C21D3BF3AA67781A8FFCEA@AM0PR08MB5316.eurprd08.prod.outlook.com \
--to=tamar.christina@arm.com \
--cc=Kyrylo.Tkachov@arm.com \
--cc=Marcus.Shawcroft@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=Richard.Sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=nd@arm.com \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).