public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <richard.sandiford@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	nd <nd@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>,
	Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
	Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
Date: Tue, 9 Jun 2020 13:23:44 +0100	[thread overview]
Message-ID: <20200609122344.GA11277@arm.com> (raw)
In-Reply-To: <mpty2owo450.fsf@arm.com>

[-- Attachment #1: Type: text/plain, Size: 3866 bytes --]

Hi Richard,

The 06/09/2020 12:44, Richard Sandiford wrote:
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi Richard,
> > The 06/08/2020 16:42, Richard Sandiford wrote:
> >> Tamar Christina <tamar.christina@arm.com> writes:
> >> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> > index 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c 100644
> >> > --- a/gcc/config/aarch64/aarch64.c
> >> > +++ b/gcc/config/aarch64/aarch64.c
> >> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
> >> >    if (VECTOR_MODE_P (mode))
> >> >      mode = GET_MODE_INNER (mode);
> >> > 
> >> > +  /* The by element versions of the instruction has the same costs as the
> >> > +     normal 3 vector version.  So don't add the costs of the duplicate into
> >> > +     the costs of the multiply.  */
> >> > +  if (GET_CODE (op0) == VEC_DUPLICATE)
> >> > +    op0 = XEXP (op0, 0);
> >> > +  else if (GET_CODE (op1) == VEC_DUPLICATE)
> >> > +    op1 = XEXP (op1, 0);
> >> > +
> >> >    /* Integer multiply/fma.  */
> >> >    if (GET_MODE_CLASS (mode) == MODE_INT)
> >> >      {
> >> 
> >> SVE doesn't have duplicating forms, so I think we should put this code
> >> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
> >> and then restrict it to VEC_ADVSIMD modes.
> >> 
> >> (SVE FMUL does have an indexed form, but the index is relative to the
> >> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
> >>
> >
> > Done, I have updated the patch. (See attached)
> >  
> >> I guess there's a danger that this could underestimate the cost for
> >> integer modes, if the scalar integer input needs to be moved from GPRs.
> >> In that case the cost of a MULT + VEC_DUPLICATE is probably more
> >> accurate, even though it's still one instruction before RA.
> >> 
> >> But I guess there's no perfect answer there.  The new code will be
> >> right for integer modes in some cases and not in others.  Same if
> >> we leave things as they are.  But maybe it'd be worth having a comment
> >> to say that we're assuming the best case, i.e. that the duplicated
> >> value is naturally in FPRs?
> >> 
> >
> > Hmm I haven't added the comment yet since I don't fully understand when the
> > integer case would be misleading.
> >
> > In both cases the cost for the GPR is paid by the MOV no? I'm missing
> > why having the MUL account for it would be better in some cases.
> 
> The point was that any MOV isn't exposed until after register allocation,
> whereas costs are usually applied before then.  So before RA:
> 
> > For instance for the integer case we used to generate
> >
> >         dup     v0.4s, w2
> >         mul     v2.4s, v2.4s, v0.4s
> 
> ...this was costed as:
> 
>    (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1)))
>    (set (reg:V4SI R3) (mult:V4SI ...))
> 
> and so accurate when R1 naturally ends up in a GPR.
> 
> > but now do
> >
> >         fmov    s0, w2
> >         mul     v2.4s, v2.4s, v0.s[0]
> 
> ...and this is costed as:
> 
>    (set (reg:V4SI R3) (mult:V4SI ...))
> 
> and so accurate when R1 naturally ends up in an FPR (without needing
> a reload to put it there).
> 
> In other words, before RA, the patch is making the optimistic assumption
> that R1 is already in FPRs and so a separate FMOV won't be needed.
> 

Aargggs... yes that makes sense. Sorry when I looked at the dump before I didn't noticed the order was switched.
The SET was for the load of course. :(

I have added the comment as suggested, thanks for the explanation.

OK for master?

Thanks,
Tamar

> Thanks,
> Richard
> 
> > Which is better on older cores such Cortex-A55 and no different on newer cores such as
> > Cortex-A76 according to the optimization guides.
> >
> > Regards,
> > Tamar
> >
> >> Thanks,
> >> Richard

-- 

[-- Attachment #2: pr13166.patch --]
[-- Type: text/x-diff, Size: 2148 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65aa4fb348450872036617362aa17310fb20..5a5a9ad44f0945b4d6a869fc2b4e857022659c55 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,7 +11279,22 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
   op1 = XEXP (x, 1);
 
   if (VECTOR_MODE_P (mode))
-    mode = GET_MODE_INNER (mode);
+    {
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      mode = GET_MODE_INNER (mode);
+      if (vec_flags & VEC_ADVSIMD)
+	{
+	  /* The by element versions of the instruction has the same costs as the
+	     normal 3 vector version.  So don't add the costs of the duplicate into
+	     the costs of the multiply.  We make an assumption that the value in
+	     the VEC_DUPLICATE is already the FP&SIMD side.  This means costing of
+	     a MUL by element pre RA is a bit optimistic.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE)
+	    op0 = XEXP (op0, 0);
+	  else if (GET_CODE (op1) == VEC_DUPLICATE)
+	    op1 = XEXP (op1, 0);
+	}
+    }
 
   /* Integer multiply/fma.  */
   if (GET_MODE_CLASS (mode) == MODE_INT)
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */


  reply	other threads:[~2020-06-09 12:23 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-08 14:14 Tamar Christina
2020-06-08 15:42 ` Richard Sandiford
2020-06-09 11:21   ` Tamar Christina
2020-06-09 11:44     ` Richard Sandiford
2020-06-09 12:23       ` Tamar Christina [this message]
2020-06-10  7:32         ` Richard Sandiford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200609122344.GA11277@arm.com \
    --to=tamar.christina@arm.com \
    --cc=Kyrylo.Tkachov@arm.com \
    --cc=Marcus.Shawcroft@arm.com \
    --cc=Richard.Earnshaw@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=nd@arm.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).