From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
nd <nd@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>,
Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
Date: Tue, 9 Jun 2020 13:23:44 +0100 [thread overview]
Message-ID: <20200609122344.GA11277@arm.com> (raw)
In-Reply-To: <mpty2owo450.fsf@arm.com>
[-- Attachment #1: Type: text/plain, Size: 3866 bytes --]
Hi Richard,
The 06/09/2020 12:44, Richard Sandiford wrote:
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi Richard,
> > The 06/08/2020 16:42, Richard Sandiford wrote:
> >> Tamar Christina <tamar.christina@arm.com> writes:
> >> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> > index 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c 100644
> >> > --- a/gcc/config/aarch64/aarch64.c
> >> > +++ b/gcc/config/aarch64/aarch64.c
> >> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
> >> > if (VECTOR_MODE_P (mode))
> >> > mode = GET_MODE_INNER (mode);
> >> >
> >> > + /* The by element versions of the instruction has the same costs as the
> >> > + normal 3 vector version. So don't add the costs of the duplicate into
> >> > + the costs of the multiply. */
> >> > + if (GET_CODE (op0) == VEC_DUPLICATE)
> >> > + op0 = XEXP (op0, 0);
> >> > + else if (GET_CODE (op1) == VEC_DUPLICATE)
> >> > + op1 = XEXP (op1, 0);
> >> > +
> >> > /* Integer multiply/fma. */
> >> > if (GET_MODE_CLASS (mode) == MODE_INT)
> >> > {
> >>
> >> SVE doesn't have duplicating forms, so I think we should put this code
> >> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
> >> and then restrict it to VEC_ADVSIMD modes.
> >>
> >> (SVE FMUL does have an indexed form, but the index is relative to the
> >> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
> >>
> >
> > Done, I have updated the patch. (See attached)
> >
> >> I guess there's a danger that this could underestimate the cost for
> >> integer modes, if the scalar integer input needs to be moved from GPRs.
> >> In that case the cost of a MULT + VEC_DUPLICATE is probably more
> >> accurate, even though it's still one instruction before RA.
> >>
> >> But I guess there's no perfect answer there. The new code will be
> >> right for integer modes in some cases and not in others. Same if
> >> we leave things as they are. But maybe it'd be worth having a comment
> >> to say that we're assuming the best case, i.e. that the duplicated
> >> value is naturally in FPRs?
> >>
> >
> > Hmm I haven't added the comment yet since I don't fully understand when the
> > integer case would be misleading.
> >
> > In both cases the cost for the GPR is paid by the MOV no? I'm missing
> > why having the MUL account for it would be better in some cases.
>
> The point was that any MOV isn't exposed until after register allocation,
> whereas costs are usually applied before then. So before RA:
>
> > For instance for the integer case we used to generate
> >
> > dup v0.4s, w2
> > mul v2.4s, v2.4s, v0.4s
>
> ...this was costed as:
>
> (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1)))
> (set (reg:V4SI R3) (mult:V4SI ...))
>
> and so accurate when R1 naturally ends up in a GPR.
>
> > but now do
> >
> > fmov s0, w2
> > mul v2.4s, v2.4s, v0.s[0]
>
> ...and this is costed as:
>
> (set (reg:V4SI R3) (mult:V4SI ...))
>
> and so accurate when R1 naturally ends up in an FPR (without needing
> a reload to put it there).
>
> In other words, before RA, the patch is making the optimistic assumption
> that R1 is already in FPRs and so a separate FMOV won't be needed.
>
Aargggs... yes that makes sense. Sorry when I looked at the dump before I didn't noticed the order was switched.
The SET was for the load of course. :(
I have added the comment as suggested, thanks for the explanation.
OK for master?
Thanks,
Tamar
> Thanks,
> Richard
>
> > Which is better on older cores such Cortex-A55 and no different on newer cores such as
> > Cortex-A76 according to the optimization guides.
> >
> > Regards,
> > Tamar
> >
> >> Thanks,
> >> Richard
--
[-- Attachment #2: pr13166.patch --]
[-- Type: text/x-diff, Size: 2148 bytes --]
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65aa4fb348450872036617362aa17310fb20..5a5a9ad44f0945b4d6a869fc2b4e857022659c55 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,7 +11279,22 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
op1 = XEXP (x, 1);
if (VECTOR_MODE_P (mode))
- mode = GET_MODE_INNER (mode);
+ {
+ unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+ mode = GET_MODE_INNER (mode);
+ if (vec_flags & VEC_ADVSIMD)
+ {
+ /* The by element versions of the instruction has the same costs as the
+ normal 3 vector version. So don't add the costs of the duplicate into
+ the costs of the multiply. We make an assumption that the value in
+ the VEC_DUPLICATE is already the FP&SIMD side. This means costing of
+ a MUL by element pre RA is a bit optimistic. */
+ if (GET_CODE (op0) == VEC_DUPLICATE)
+ op0 = XEXP (op0, 0);
+ else if (GET_CODE (op1) == VEC_DUPLICATE)
+ op1 = XEXP (op1, 0);
+ }
+ }
/* Integer multiply/fma. */
if (GET_MODE_CLASS (mode) == MODE_INT)
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+ for (int x = 0; x < 16; x++)
+ res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+ for (int x = 0; x < 16; x++)
+ res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
next prev parent reply other threads:[~2020-06-09 12:23 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-08 14:14 Tamar Christina
2020-06-08 15:42 ` Richard Sandiford
2020-06-09 11:21 ` Tamar Christina
2020-06-09 11:44 ` Richard Sandiford
2020-06-09 12:23 ` Tamar Christina [this message]
2020-06-10 7:32 ` Richard Sandiford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200609122344.GA11277@arm.com \
--to=tamar.christina@arm.com \
--cc=Kyrylo.Tkachov@arm.com \
--cc=Marcus.Shawcroft@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=nd@arm.com \
--cc=richard.sandiford@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).