public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
@ 2020-06-08 14:14 Tamar Christina
  2020-06-08 15:42 ` Richard Sandiford
  0 siblings, 1 reply; 6+ messages in thread
From: Tamar Christina @ 2020-06-08 14:14 UTC (permalink / raw)
  To: gcc-patches
  Cc: nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov,
	richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 779 bytes --]

Hi All,

The cost model is currently treating multiplication by element as being more
expensive than 3 same multiplication.  This means that if the value is on the
SIMD side we add an unneeded DUP.  If the value is on the genreg side we use the
more expensive DUP instead of fmov.

This patch corrects the costs such that the two multiplies are costed the same
which allows us to generate

        fmul    v3.4s, v3.4s, v0.s[0]

instead of

        dup     v0.4s, v0.s[0]
        fmul    v3.4s, v3.4s, v0.4s

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Adjust costs for mul.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/asimd-mull-elem.c: New test.

-- 

[-- Attachment #2: rb13166.patch --]
[-- Type: text/x-diff, Size: 1792 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
   if (VECTOR_MODE_P (mode))
     mode = GET_MODE_INNER (mode);
 
+  /* The by element versions of the instruction has the same costs as the
+     normal 3 vector version.  So don't add the costs of the duplicate into
+     the costs of the multiply.  */
+  if (GET_CODE (op0) == VEC_DUPLICATE)
+    op0 = XEXP (op0, 0);
+  else if (GET_CODE (op1) == VEC_DUPLICATE)
+    op1 = XEXP (op1, 0);
+
   /* Integer multiply/fma.  */
   if (GET_MODE_CLASS (mode) == MODE_INT)
     {
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
  2020-06-08 14:14 [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL Tamar Christina
@ 2020-06-08 15:42 ` Richard Sandiford
  2020-06-09 11:21   ` Tamar Christina
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Sandiford @ 2020-06-08 15:42 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard.Earnshaw, Marcus.Shawcroft, Kyrylo.Tkachov

Tamar Christina <tamar.christina@arm.com> writes:
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
>    if (VECTOR_MODE_P (mode))
>      mode = GET_MODE_INNER (mode);
> 
> +  /* The by element versions of the instruction has the same costs as the
> +     normal 3 vector version.  So don't add the costs of the duplicate into
> +     the costs of the multiply.  */
> +  if (GET_CODE (op0) == VEC_DUPLICATE)
> +    op0 = XEXP (op0, 0);
> +  else if (GET_CODE (op1) == VEC_DUPLICATE)
> +    op1 = XEXP (op1, 0);
> +
>    /* Integer multiply/fma.  */
>    if (GET_MODE_CLASS (mode) == MODE_INT)
>      {

SVE doesn't have duplicating forms, so I think we should put this code
under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
and then restrict it to VEC_ADVSIMD modes.

(SVE FMUL does have an indexed form, but the index is relative to the
start of the associated quadword, so it isn't a VEC_DUPLICATE.)

I guess there's a danger that this could underestimate the cost for
integer modes, if the scalar integer input needs to be moved from GPRs.
In that case the cost of a MULT + VEC_DUPLICATE is probably more
accurate, even though it's still one instruction before RA.

But I guess there's no perfect answer there.  The new code will be
right for integer modes in some cases and not in others.  Same if
we leave things as they are.  But maybe it'd be worth having a comment
to say that we're assuming the best case, i.e. that the duplicated
value is naturally in FPRs?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
  2020-06-08 15:42 ` Richard Sandiford
@ 2020-06-09 11:21   ` Tamar Christina
  2020-06-09 11:44     ` Richard Sandiford
  0 siblings, 1 reply; 6+ messages in thread
From: Tamar Christina @ 2020-06-09 11:21 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 2603 bytes --]

Hi Richard,


The 06/08/2020 16:42, Richard Sandiford wrote:
> Tamar Christina <tamar.christina@arm.com> writes:
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
> >    if (VECTOR_MODE_P (mode))
> >      mode = GET_MODE_INNER (mode);
> > 
> > +  /* The by element versions of the instruction has the same costs as the
> > +     normal 3 vector version.  So don't add the costs of the duplicate into
> > +     the costs of the multiply.  */
> > +  if (GET_CODE (op0) == VEC_DUPLICATE)
> > +    op0 = XEXP (op0, 0);
> > +  else if (GET_CODE (op1) == VEC_DUPLICATE)
> > +    op1 = XEXP (op1, 0);
> > +
> >    /* Integer multiply/fma.  */
> >    if (GET_MODE_CLASS (mode) == MODE_INT)
> >      {
> 
> SVE doesn't have duplicating forms, so I think we should put this code
> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
> and then restrict it to VEC_ADVSIMD modes.
> 
> (SVE FMUL does have an indexed form, but the index is relative to the
> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
>

Done, I have updated the patch. (See attached)
 
> I guess there's a danger that this could underestimate the cost for
> integer modes, if the scalar integer input needs to be moved from GPRs.
> In that case the cost of a MULT + VEC_DUPLICATE is probably more
> accurate, even though it's still one instruction before RA.
> 
> But I guess there's no perfect answer there.  The new code will be
> right for integer modes in some cases and not in others.  Same if
> we leave things as they are.  But maybe it'd be worth having a comment
> to say that we're assuming the best case, i.e. that the duplicated
> value is naturally in FPRs?
> 

Hmm I haven't added the comment yet since I don't fully understand when the
integer case would be misleading.

In both cases the cost for the GPR is paid by the MOV no? I'm missing why having the MUL
account for it would be better in some cases.

For instance for the integer case we used to generate

        dup     v0.4s, w2
        mul     v2.4s, v2.4s, v0.4s

but now do

        fmov    s0, w2
        mul     v2.4s, v2.4s, v0.s[0]

Which is better on older cores such Cortex-A55 and no different on newer cores such as
Cortex-A76 according to the optimization guides.

Regards,
Tamar

> Thanks,
> Richard

-- 

[-- Attachment #2: pr13166.patch --]
[-- Type: text/x-diff, Size: 1978 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65aa4fb348450872036617362aa17310fb20..d2e959c5276d9b801294c722c92762c5674cb244 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,7 +11279,20 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
   op1 = XEXP (x, 1);
 
   if (VECTOR_MODE_P (mode))
-    mode = GET_MODE_INNER (mode);
+    {
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      mode = GET_MODE_INNER (mode);
+      if (vec_flags & VEC_ADVSIMD)
+	{
+	  /* The by element versions of the instruction has the same costs as the
+	     normal 3 vector version.  So don't add the costs of the duplicate into
+	     the costs of the multiply.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE)
+	    op0 = XEXP (op0, 0);
+	  else if (GET_CODE (op1) == VEC_DUPLICATE)
+	    op1 = XEXP (op1, 0);
+	}
+    }
 
   /* Integer multiply/fma.  */
   if (GET_MODE_CLASS (mode) == MODE_INT)
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
  2020-06-09 11:21   ` Tamar Christina
@ 2020-06-09 11:44     ` Richard Sandiford
  2020-06-09 12:23       ` Tamar Christina
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Sandiford @ 2020-06-09 11:44 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
> Hi Richard,
> The 06/08/2020 16:42, Richard Sandiford wrote:
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> > index 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c 100644
>> > --- a/gcc/config/aarch64/aarch64.c
>> > +++ b/gcc/config/aarch64/aarch64.c
>> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
>> >    if (VECTOR_MODE_P (mode))
>> >      mode = GET_MODE_INNER (mode);
>> > 
>> > +  /* The by element versions of the instruction has the same costs as the
>> > +     normal 3 vector version.  So don't add the costs of the duplicate into
>> > +     the costs of the multiply.  */
>> > +  if (GET_CODE (op0) == VEC_DUPLICATE)
>> > +    op0 = XEXP (op0, 0);
>> > +  else if (GET_CODE (op1) == VEC_DUPLICATE)
>> > +    op1 = XEXP (op1, 0);
>> > +
>> >    /* Integer multiply/fma.  */
>> >    if (GET_MODE_CLASS (mode) == MODE_INT)
>> >      {
>> 
>> SVE doesn't have duplicating forms, so I think we should put this code
>> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
>> and then restrict it to VEC_ADVSIMD modes.
>> 
>> (SVE FMUL does have an indexed form, but the index is relative to the
>> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
>>
>
> Done, I have updated the patch. (See attached)
>  
>> I guess there's a danger that this could underestimate the cost for
>> integer modes, if the scalar integer input needs to be moved from GPRs.
>> In that case the cost of a MULT + VEC_DUPLICATE is probably more
>> accurate, even though it's still one instruction before RA.
>> 
>> But I guess there's no perfect answer there.  The new code will be
>> right for integer modes in some cases and not in others.  Same if
>> we leave things as they are.  But maybe it'd be worth having a comment
>> to say that we're assuming the best case, i.e. that the duplicated
>> value is naturally in FPRs?
>> 
>
> Hmm I haven't added the comment yet since I don't fully understand when the
> integer case would be misleading.
>
> In both cases the cost for the GPR is paid by the MOV no? I'm missing
> why having the MUL account for it would be better in some cases.

The point was that any MOV isn't exposed until after register allocation,
whereas costs are usually applied before then.  So before RA:

> For instance for the integer case we used to generate
>
>         dup     v0.4s, w2
>         mul     v2.4s, v2.4s, v0.4s

...this was costed as:

   (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1)))
   (set (reg:V4SI R3) (mult:V4SI ...))

and so accurate when R1 naturally ends up in a GPR.

> but now do
>
>         fmov    s0, w2
>         mul     v2.4s, v2.4s, v0.s[0]

...and this is costed as:

   (set (reg:V4SI R3) (mult:V4SI ...))

and so accurate when R1 naturally ends up in an FPR (without needing
a reload to put it there).

In other words, before RA, the patch is making the optimistic assumption
that R1 is already in FPRs and so a separate FMOV won't be needed.

Thanks,
Richard

> Which is better on older cores such Cortex-A55 and no different on newer cores such as
> Cortex-A76 according to the optimization guides.
>
> Regards,
> Tamar
>
>> Thanks,
>> Richard

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
  2020-06-09 11:44     ` Richard Sandiford
@ 2020-06-09 12:23       ` Tamar Christina
  2020-06-10  7:32         ` Richard Sandiford
  0 siblings, 1 reply; 6+ messages in thread
From: Tamar Christina @ 2020-06-09 12:23 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 3866 bytes --]

Hi Richard,

The 06/09/2020 12:44, Richard Sandiford wrote:
> Tamar Christina <Tamar.Christina@arm.com> writes:
> > Hi Richard,
> > The 06/08/2020 16:42, Richard Sandiford wrote:
> >> Tamar Christina <tamar.christina@arm.com> writes:
> >> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> > index 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c 100644
> >> > --- a/gcc/config/aarch64/aarch64.c
> >> > +++ b/gcc/config/aarch64/aarch64.c
> >> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
> >> >    if (VECTOR_MODE_P (mode))
> >> >      mode = GET_MODE_INNER (mode);
> >> > 
> >> > +  /* The by element versions of the instruction has the same costs as the
> >> > +     normal 3 vector version.  So don't add the costs of the duplicate into
> >> > +     the costs of the multiply.  */
> >> > +  if (GET_CODE (op0) == VEC_DUPLICATE)
> >> > +    op0 = XEXP (op0, 0);
> >> > +  else if (GET_CODE (op1) == VEC_DUPLICATE)
> >> > +    op1 = XEXP (op1, 0);
> >> > +
> >> >    /* Integer multiply/fma.  */
> >> >    if (GET_MODE_CLASS (mode) == MODE_INT)
> >> >      {
> >> 
> >> SVE doesn't have duplicating forms, so I think we should put this code
> >> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
> >> and then restrict it to VEC_ADVSIMD modes.
> >> 
> >> (SVE FMUL does have an indexed form, but the index is relative to the
> >> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
> >>
> >
> > Done, I have updated the patch. (See attached)
> >  
> >> I guess there's a danger that this could underestimate the cost for
> >> integer modes, if the scalar integer input needs to be moved from GPRs.
> >> In that case the cost of a MULT + VEC_DUPLICATE is probably more
> >> accurate, even though it's still one instruction before RA.
> >> 
> >> But I guess there's no perfect answer there.  The new code will be
> >> right for integer modes in some cases and not in others.  Same if
> >> we leave things as they are.  But maybe it'd be worth having a comment
> >> to say that we're assuming the best case, i.e. that the duplicated
> >> value is naturally in FPRs?
> >> 
> >
> > Hmm I haven't added the comment yet since I don't fully understand when the
> > integer case would be misleading.
> >
> > In both cases the cost for the GPR is paid by the MOV no? I'm missing
> > why having the MUL account for it would be better in some cases.
> 
> The point was that any MOV isn't exposed until after register allocation,
> whereas costs are usually applied before then.  So before RA:
> 
> > For instance for the integer case we used to generate
> >
> >         dup     v0.4s, w2
> >         mul     v2.4s, v2.4s, v0.4s
> 
> ...this was costed as:
> 
>    (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1)))
>    (set (reg:V4SI R3) (mult:V4SI ...))
> 
> and so accurate when R1 naturally ends up in a GPR.
> 
> > but now do
> >
> >         fmov    s0, w2
> >         mul     v2.4s, v2.4s, v0.s[0]
> 
> ...and this is costed as:
> 
>    (set (reg:V4SI R3) (mult:V4SI ...))
> 
> and so accurate when R1 naturally ends up in an FPR (without needing
> a reload to put it there).
> 
> In other words, before RA, the patch is making the optimistic assumption
> that R1 is already in FPRs and so a separate FMOV won't be needed.
> 

Aargggs... yes that makes sense. Sorry when I looked at the dump before I didn't noticed the order was switched.
The SET was for the load of course. :(

I have added the comment as suggested, thanks for the explanation.

OK for master?

Thanks,
Tamar

> Thanks,
> Richard
> 
> > Which is better on older cores such Cortex-A55 and no different on newer cores such as
> > Cortex-A76 according to the optimization guides.
> >
> > Regards,
> > Tamar
> >
> >> Thanks,
> >> Richard

-- 

[-- Attachment #2: pr13166.patch --]
[-- Type: text/x-diff, Size: 2148 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65aa4fb348450872036617362aa17310fb20..5a5a9ad44f0945b4d6a869fc2b4e857022659c55 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,7 +11279,22 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
   op1 = XEXP (x, 1);
 
   if (VECTOR_MODE_P (mode))
-    mode = GET_MODE_INNER (mode);
+    {
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      mode = GET_MODE_INNER (mode);
+      if (vec_flags & VEC_ADVSIMD)
+	{
+	  /* The by element versions of the instruction has the same costs as the
+	     normal 3 vector version.  So don't add the costs of the duplicate into
+	     the costs of the multiply.  We make an assumption that the value in
+	     the VEC_DUPLICATE is already the FP&SIMD side.  This means costing of
+	     a MUL by element pre RA is a bit optimistic.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE)
+	    op0 = XEXP (op0, 0);
+	  else if (GET_CODE (op1) == VEC_DUPLICATE)
+	    op1 = XEXP (op1, 0);
+	}
+    }
 
   /* Integer multiply/fma.  */
   if (GET_MODE_CLASS (mode) == MODE_INT)
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL.
  2020-06-09 12:23       ` Tamar Christina
@ 2020-06-10  7:32         ` Richard Sandiford
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Sandiford @ 2020-06-10  7:32 UTC (permalink / raw)
  To: Tamar Christina
  Cc: gcc-patches, nd, Richard Earnshaw, Marcus Shawcroft, Kyrylo Tkachov

Tamar Christina <Tamar.Christina@arm.com> writes:
> @@ -11279,7 +11279,22 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
>    op1 = XEXP (x, 1);
 
>    if (VECTOR_MODE_P (mode))
> -    mode = GET_MODE_INNER (mode);
> +    {
> +      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
> +      mode = GET_MODE_INNER (mode);
> +      if (vec_flags & VEC_ADVSIMD)
> +	{
> +	  /* The by element versions of the instruction has the same costs as the

“by-element”, s/has/have/

> +	     normal 3 vector version.  So don't add the costs of the duplicate into

“3-vector”

> +	     the costs of the multiply.  We make an assumption that the value in

Maybe s/value in/input to/?

> +	     the VEC_DUPLICATE is already the FP&SIMD side.  This means costing of

“already on the”

Some lines are over the 80-character limit.

OK with those changes, thanks.

Richard

> +	     a MUL by element pre RA is a bit optimistic.  */
> +	  if (GET_CODE (op0) == VEC_DUPLICATE)
> +	    op0 = XEXP (op0, 0);
> +	  else if (GET_CODE (op1) == VEC_DUPLICATE)
> +	    op1 = XEXP (op1, 0);
> +	}
> +    }
>  
>    /* Integer multiply/fma.  */
>    if (GET_MODE_CLASS (mode) == MODE_INT)



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-06-10  7:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-08 14:14 [PATCH] AArch64: Adjust costing of by element MUL to be the same as SAME3 MUL Tamar Christina
2020-06-08 15:42 ` Richard Sandiford
2020-06-09 11:21   ` Tamar Christina
2020-06-09 11:44     ` Richard Sandiford
2020-06-09 12:23       ` Tamar Christina
2020-06-10  7:32         ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).