[PATCH] aarch64: Don't include vec_select in SIMD multiply cost

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] aarch64: Don't include vec_select in SIMD multiply cost
@ 2021-07-20 10:46 Jonathan Wright
  2021-07-22 17:16 ` Richard Sandiford
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Wright @ 2021-07-20 10:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: Kyrylo Tkachov, Richard Sandiford

[-- Attachment #1: Type: text/plain, Size: 1075 bytes --]

Hi,

The Neon multiply/multiply-accumulate/multiply-subtract instructions
can take various forms - multiplying full vector registers of values
or multiplying one vector by a single element of another. Regardless
of the form used, these instructions have the same cost, and this
should be reflected by the RTL cost function.

This patch adds RTL tree traversal in the Neon multiply cost function
to match the vec_select used by the lane-referencing forms of the
instructions already mentioned. This traversal prevents the cost of
the vec_select from being added into the cost of the multiply -
meaning that these instructions can now be emitted in the combine
pass as they are no longer deemed prohibitively expensive.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
	RTL tree to prevents vec_select from being added into Neon
	multiply cost.

[-- Attachment #2: rb14675.patch --]
[-- Type: application/octet-stream, Size: 1359 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
 	    op0 = XEXP (op0, 0);
 	  else if (GET_CODE (op1) == VEC_DUPLICATE)
 	    op1 = XEXP (op1, 0);
+	  /* The same argument applies to the VEC_SELECT when using the lane-
+	     referencing forms of the MUL/MLA/MLS instructions. Without the
+	     traversal here, the combine pass deems these patterns too
+	     expensive and subsequently does not emit the lane-referencing
+	     forms of the instructions. In addition, canonical form is for the
+	     VEC_SELECT to be the second argument of the multiply - thus only
+	     op1 is traversed.  */
+	  if (GET_CODE (op1) == VEC_SELECT
+	      && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
+	    op1 = XEXP (op1, 0);
+	  else if ((GET_CODE (op1) == ZERO_EXTEND
+		    || GET_CODE (op1) == SIGN_EXTEND)
+		   && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
+		   && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
+	    op1 = XEXP (XEXP (op1, 0), 0);
 	}
       cost += rtx_cost (op0, mode, MULT, 0, speed);
       cost += rtx_cost (op1, mode, MULT, 1, speed);

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost
  2021-07-20 10:46 [PATCH] aarch64: Don't include vec_select in SIMD multiply cost Jonathan Wright
@ 2021-07-22 17:16 ` Richard Sandiford
  2021-07-28 13:34   ` [PATCH V2] " Jonathan Wright
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2021-07-22 17:16 UTC (permalink / raw)
  To: Jonathan Wright; +Cc: gcc-patches, Kyrylo Tkachov

Jonathan Wright <Jonathan.Wright@arm.com> writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can take various forms - multiplying full vector registers of values
> or multiplying one vector by a single element of another. Regardless
> of the form used, these instructions have the same cost, and this
> should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match the vec_select used by the lane-referencing forms of the
> instructions already mentioned. This traversal prevents the cost of
> the vec_select from being added into the cost of the multiply -
> meaning that these instructions can now be emitted in the combine
> pass as they are no longer deemed prohibitively expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  <jonathan.wright@arm.com>
>
>         * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
>         RTL tree to prevents vec_select from being added into Neon
>         multiply cost.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
>  	    op0 = XEXP (op0, 0);
>  	  else if (GET_CODE (op1) == VEC_DUPLICATE)
>  	    op1 = XEXP (op1, 0);
> +	  /* The same argument applies to the VEC_SELECT when using the lane-
> +	     referencing forms of the MUL/MLA/MLS instructions. Without the
> +	     traversal here, the combine pass deems these patterns too
> +	     expensive and subsequently does not emit the lane-referencing
> +	     forms of the instructions. In addition, canonical form is for the
> +	     VEC_SELECT to be the second argument of the multiply - thus only
> +	     op1 is traversed.  */
> +	  if (GET_CODE (op1) == VEC_SELECT
> +	      && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> +	    op1 = XEXP (op1, 0);
> +	  else if ((GET_CODE (op1) == ZERO_EXTEND
> +		    || GET_CODE (op1) == SIGN_EXTEND)
> +		   && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
> +		   && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> +	    op1 = XEXP (XEXP (op1, 0), 0);

I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
if block, since the condition is never true otherwise.  We can probably
skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
it would be better to add them to the existing VEC_DUPLICATE tests rather
than restrict them to the VEC_SELECT ones.

Also, although this is in Advanced SIMD-specific code, I think it'd be
better to use:

  is_a<scalar_mode> (GET_MODE (op1))

instead of:

  GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1

Do you have a testcase?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost
  2021-07-22 17:16 ` Richard Sandiford
@ 2021-07-28 13:34   ` Jonathan Wright
  2021-08-04  8:51     ` Richard Sandiford
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Wright @ 2021-07-28 13:34 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, Kyrylo Tkachov

[-- Attachment #1: Type: text/plain, Size: 4179 bytes --]

Hi,

V2 of the patch addresses the initial review comments, factors out
common code (as we discussed off-list) and adds a set of unit tests
to verify the code generation benefit.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-07-19  Jonathan Wright  <jonathan.wright@arm.com>

	* config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt):
	Define.
	(aarch64_rtx_mult_cost): Traverse RTL tree to prevent
	vec_select cost from being added into Neon multiply cost.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/vmul_element_cost.c: New test.



From: Richard Sandiford <richard.sandiford@arm.com>
Sent: 22 July 2021 18:16
To: Jonathan Wright <Jonathan.Wright@arm.com>
Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost 
 
Jonathan Wright <Jonathan.Wright@arm.com> writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can take various forms - multiplying full vector registers of values
> or multiplying one vector by a single element of another. Regardless
> of the form used, these instructions have the same cost, and this
> should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match the vec_select used by the lane-referencing forms of the
> instructions already mentioned. This traversal prevents the cost of
> the vec_select from being added into the cost of the multiply -
> meaning that these instructions can now be emitted in the combine
> pass as they are no longer deemed prohibitively expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  <jonathan.wright@arm.com>
>
>         * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
>         RTL tree to prevents vec_select from being added into Neon
>         multiply cost.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
>            op0 = XEXP (op0, 0);
>          else if (GET_CODE (op1) == VEC_DUPLICATE)
>            op1 = XEXP (op1, 0);
> +       /* The same argument applies to the VEC_SELECT when using the lane-
> +          referencing forms of the MUL/MLA/MLS instructions. Without the
> +          traversal here, the combine pass deems these patterns too
> +          expensive and subsequently does not emit the lane-referencing
> +          forms of the instructions. In addition, canonical form is for the
> +          VEC_SELECT to be the second argument of the multiply - thus only
> +          op1 is traversed.  */
> +       if (GET_CODE (op1) == VEC_SELECT
> +           && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> +         op1 = XEXP (op1, 0);
> +       else if ((GET_CODE (op1) == ZERO_EXTEND
> +                 || GET_CODE (op1) == SIGN_EXTEND)
> +                && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
> +                && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> +         op1 = XEXP (XEXP (op1, 0), 0);

I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
if block, since the condition is never true otherwise.  We can probably
skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
it would be better to add them to the existing VEC_DUPLICATE tests rather
than restrict them to the VEC_SELECT ones.

Also, although this is in Advanced SIMD-specific code, I think it'd be
better to use:

  is_a<scalar_mode> (GET_MODE (op1))

instead of:

  GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1

Do you have a testcase?

Thanks,
Richard

[-- Attachment #2: rb14675.patch --]
[-- Type: application/octet-stream, Size: 6068 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 3bdf19d71b54d0ade8e5648323f6e1f012bc4f8f..5809887997305317c5a81421089db431685e2927 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11908,6 +11908,26 @@ aarch64_strip_extend (rtx x, bool strip_shift)
   return x;
 }
 
+
+/* Helper function for rtx cost calculation. Strip VEC_DUPLICATE as well as
+   any subsequent extend and VEC_SELECT from X. Returns the inner scalar
+   operand if successful, or the original expression on failure.  */
+static rtx
+aarch64_strip_duplicate_vec_elt (rtx x)
+{
+  if (GET_CODE (x) == VEC_DUPLICATE
+      && is_a<scalar_mode> (GET_MODE (XEXP (x, 0))))
+    {
+      x = XEXP (x, 0);
+      if (GET_CODE (x) == VEC_SELECT)
+	x = XEXP (x, 0);
+      else if ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND)
+	       && GET_CODE (XEXP (x, 0)) == VEC_SELECT)
+	x = XEXP (XEXP (x, 0), 0);
+    }
+  return x;
+}
+
 /* Return true iff CODE is a shift supported in combination
    with arithmetic instructions.  */
 
@@ -11977,14 +11997,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
 	{
 	  /* The by-element versions of the instruction have the same costs as
 	     the normal 3-vector version.  So don't add the costs of the
-	     duplicate into the costs of the multiply.  We make an assumption
-	     that the input to the VEC_DUPLICATE is already on the FP & SIMD
-	     side.  This means costing of a MUL by element pre RA is a bit
-	     optimistic.  */
+	     duplicate or subsequent select into the costs of the multiply.  We
+	     make an assumption that the input to the VEC_DUPLICATE is already
+	     on the FP & SIMD side.  This means costing of a MUL by element pre
+	     RA is a bit optimistic.  */
 	  if (GET_CODE (op0) == VEC_DUPLICATE)
-	    op0 = XEXP (op0, 0);
+	    op0 = aarch64_strip_duplicate_vec_elt (op0);
 	  else if (GET_CODE (op1) == VEC_DUPLICATE)
-	    op1 = XEXP (op1, 0);
+	    op1 = aarch64_strip_duplicate_vec_elt (op1);
 	}
       cost += rtx_cost (op0, mode, MULT, 0, speed);
       cost += rtx_cost (op1, mode, MULT, 1, speed);
diff --git a/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c
new file mode 100644
index 0000000000000000000000000000000000000000..c153775f0914072fb985b18516f110aded7dccd5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c
@@ -0,0 +1,94 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+#include <arm_neon.h>
+
+#define TEST_MUL_UNIFORM(name, q, vectype, ts) \
+  vectype test_ ## name ## q ## _ ## ts (vectype a, vectype b, vectype c) \
+	{ \
+		vectype t0 = name ## q ## _n_ ## ts (a, c[1]); \
+		vectype t1 = name ## q ## _n_ ## ts (b, c[1]); \
+		return vmul ## q ## _ ## ts (t0, t1); \
+	}
+
+TEST_MUL_UNIFORM (vmul, , int16x4_t, s16)
+TEST_MUL_UNIFORM (vmul, , uint16x4_t, u16)
+TEST_MUL_UNIFORM (vmul, , int32x2_t, s32)
+TEST_MUL_UNIFORM (vmul, , uint32x2_t, u32)
+TEST_MUL_UNIFORM (vmul, , float32x2_t, f32)
+TEST_MUL_UNIFORM (vmul, q, int16x8_t, s16)
+TEST_MUL_UNIFORM (vmul, q, uint16x8_t, u16)
+TEST_MUL_UNIFORM (vmul, q, int32x4_t, s32)
+TEST_MUL_UNIFORM (vmul, q, uint32x4_t, u32)
+TEST_MUL_UNIFORM (vmul, q, float32x4_t, f32)
+TEST_MUL_UNIFORM (vmul, q, float64x2_t, f64)
+
+#define TEST_MLX_UNIFORM(name, q, vectype, ts) \
+  vectype test_ ## name ## q ## _ ## ts (vectype acc, vectype a, vectype b) \
+	{ \
+		acc = name ## q ## _n_ ## ts (acc, a, b[1]); \
+		return name ## q ## _n_ ## ts (acc, a, b[1]); \
+	}
+
+TEST_MLX_UNIFORM (vmla, , int16x4_t, s16)
+TEST_MLX_UNIFORM (vmla, , uint16x4_t, u16)
+TEST_MLX_UNIFORM (vmla, , int32x2_t, s32)
+TEST_MLX_UNIFORM (vmla, , uint32x2_t, u32)
+TEST_MLX_UNIFORM (vmla, , float32x2_t, f32)
+TEST_MLX_UNIFORM (vmla, q, int16x8_t, s16)
+TEST_MLX_UNIFORM (vmla, q, uint16x8_t, u16)
+TEST_MLX_UNIFORM (vmla, q, int32x4_t, s32)
+TEST_MLX_UNIFORM (vmla, q, uint32x4_t, u32)
+TEST_MLX_UNIFORM (vmla, q, float32x4_t, f32)
+
+TEST_MLX_UNIFORM (vmls, , int16x4_t, s16)
+TEST_MLX_UNIFORM (vmls, , uint16x4_t, u16)
+TEST_MLX_UNIFORM (vmls, , int32x2_t, s32)
+TEST_MLX_UNIFORM (vmls, , uint32x2_t, u32)
+TEST_MLX_UNIFORM (vmls, , float32x2_t, f32)
+TEST_MLX_UNIFORM (vmls, q, int16x8_t, s16)
+TEST_MLX_UNIFORM (vmls, q, uint16x8_t, u16)
+TEST_MLX_UNIFORM (vmls, q, int32x4_t, s32)
+TEST_MLX_UNIFORM (vmls, q, uint32x4_t, u32)
+TEST_MLX_UNIFORM (vmls, q, float32x4_t, f32)
+
+#define TEST_MUL_LONG(name, rettype, intype, ts, rs) \
+  rettype test_ ## name ## ts (intype a, intype b, intype c) \
+	{ \
+		rettype t0 = name ## ts (a, c[1]); \
+		rettype t1 = name ## ts (b, c[1]); \
+		return vqaddq ## _ ## rs (t0, t1); \
+	}
+
+TEST_MUL_LONG (vmull_n_, int32x4_t, int16x4_t, s16, s32)
+TEST_MUL_LONG (vmull_n_, uint32x4_t, uint16x4_t, u16, u32)
+TEST_MUL_LONG (vmull_n_, int64x2_t, int32x2_t, s32, s64)
+TEST_MUL_LONG (vmull_n_, uint64x2_t, uint32x2_t, u32, u64)
+
+TEST_MUL_LONG (vqdmull_n_, int32x4_t, int16x4_t, s16, s32)
+TEST_MUL_LONG (vqdmull_n_, int64x2_t, int32x2_t, s32, s64)
+
+#define TEST_MLX_LONG(name, rettype, intype, ts, rs) \
+  rettype test_ ## name ## _ ## ts (rettype acc, intype a, intype b) \
+	{ \
+		acc = name ## ts (acc, a, b[1]); \
+		return name ## ts (acc, a, b[1]); \
+	}
+
+TEST_MLX_LONG (vmlal_n_, int32x4_t, int16x4_t, s16, s32)
+TEST_MLX_LONG (vmlal_n_, uint32x4_t, uint16x4_t, u16, u32)
+TEST_MLX_LONG (vmlal_n_, int64x2_t, int32x2_t, s32, s64)
+TEST_MLX_LONG (vmlal_n_, uint64x2_t, uint32x2_t, u32, u64)
+
+TEST_MLX_LONG (vmlsl_n_, int32x4_t, int16x4_t, s16, s32)
+TEST_MLX_LONG (vmlsl_n_, uint32x4_t, uint16x4_t, u16, u32)
+TEST_MLX_LONG (vmlsl_n_, int64x2_t, int32x2_t, s32, s64)
+TEST_MLX_LONG (vmlsl_n_, uint64x2_t, uint32x2_t, u32, u64)
+
+TEST_MLX_LONG (vqdmlal_n_, int32x4_t, int16x4_t, s16, s32)
+TEST_MLX_LONG (vqdmlal_n_, int64x2_t, int32x2_t, s32, s64)
+
+TEST_MLX_LONG (vqdmlsl_n_, int32x4_t, int16x4_t, s16, s32)
+TEST_MLX_LONG (vqdmlsl_n_, int64x2_t, int32x2_t, s32, s64)
+
+/* { dg-final { scan-assembler-not "dup\\t" } } */

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost
  2021-07-28 13:34   ` [PATCH V2] " Jonathan Wright
@ 2021-08-04  8:51     ` Richard Sandiford
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Sandiford @ 2021-08-04  8:51 UTC (permalink / raw)
  To: Jonathan Wright via Gcc-patches

Jonathan Wright via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi,
>
> V2 of the patch addresses the initial review comments, factors out
> common code (as we discussed off-list) and adds a set of unit tests
> to verify the code generation benefit.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  <jonathan.wright@arm.com>
>
> 	* config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt):
> 	Define.
> 	(aarch64_rtx_mult_cost): Traverse RTL tree to prevent
> 	vec_select cost from being added into Neon multiply cost.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/vmul_element_cost.c: New test.
>
>
>
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: 22 July 2021 18:16
> To: Jonathan Wright <Jonathan.Wright@arm.com>
> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost 
>  
> Jonathan Wright <Jonathan.Wright@arm.com> writes:
>> Hi,
>>
>> The Neon multiply/multiply-accumulate/multiply-subtract instructions
>> can take various forms - multiplying full vector registers of values
>> or multiplying one vector by a single element of another. Regardless
>> of the form used, these instructions have the same cost, and this
>> should be reflected by the RTL cost function.
>>
>> This patch adds RTL tree traversal in the Neon multiply cost function
>> to match the vec_select used by the lane-referencing forms of the
>> instructions already mentioned. This traversal prevents the cost of
>> the vec_select from being added into the cost of the multiply -
>> meaning that these instructions can now be emitted in the combine
>> pass as they are no longer deemed prohibitively expensive.
>>
>> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
>> issues.
>>
>> Ok for master?
>>
>> Thanks,
>> Jonathan
>>
>> ---
>>
>> gcc/ChangeLog:
>>
>> 2021-07-19  Jonathan Wright  <jonathan.wright@arm.com>
>>
>>         * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
>>         RTL tree to prevents vec_select from being added into Neon
>>         multiply cost.
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
>>            op0 = XEXP (op0, 0);
>>          else if (GET_CODE (op1) == VEC_DUPLICATE)
>>            op1 = XEXP (op1, 0);
>> +       /* The same argument applies to the VEC_SELECT when using the lane-
>> +          referencing forms of the MUL/MLA/MLS instructions. Without the
>> +          traversal here, the combine pass deems these patterns too
>> +          expensive and subsequently does not emit the lane-referencing
>> +          forms of the instructions. In addition, canonical form is for the
>> +          VEC_SELECT to be the second argument of the multiply - thus only
>> +          op1 is traversed.  */
>> +       if (GET_CODE (op1) == VEC_SELECT
>> +           && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
>> +         op1 = XEXP (op1, 0);
>> +       else if ((GET_CODE (op1) == ZERO_EXTEND
>> +                 || GET_CODE (op1) == SIGN_EXTEND)
>> +                && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
>> +                && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
>> +         op1 = XEXP (XEXP (op1, 0), 0);
>
> I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
> if block, since the condition is never true otherwise.  We can probably
> skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
> it would be better to add them to the existing VEC_DUPLICATE tests rather
> than restrict them to the VEC_SELECT ones.
>
> Also, although this is in Advanced SIMD-specific code, I think it'd be
> better to use:
>
>   is_a<scalar_mode> (GET_MODE (op1))
>
> instead of:
>
>   GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1
>
> Do you have a testcase?
>
> Thanks,
> Richard
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 3bdf19d71b54d0ade8e5648323f6e1f012bc4f8f..5809887997305317c5a81421089db431685e2927 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11908,6 +11908,26 @@ aarch64_strip_extend (rtx x, bool strip_shift)
>    return x;
>  }
>  
> +
> +/* Helper function for rtx cost calculation. Strip VEC_DUPLICATE as well as
> +   any subsequent extend and VEC_SELECT from X. Returns the inner scalar
> +   operand if successful, or the original expression on failure.  */
> +static rtx
> +aarch64_strip_duplicate_vec_elt (rtx x)
> +{
> +  if (GET_CODE (x) == VEC_DUPLICATE
> +      && is_a<scalar_mode> (GET_MODE (XEXP (x, 0))))
> +    {
> +      x = XEXP (x, 0);
> +      if (GET_CODE (x) == VEC_SELECT)
> +	x = XEXP (x, 0);
> +      else if ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND)
> +	       && GET_CODE (XEXP (x, 0)) == VEC_SELECT)
> +	x = XEXP (XEXP (x, 0), 0);
> +    }
> +  return x;
> +}
> +
>  /* Return true iff CODE is a shift supported in combination
>     with arithmetic instructions.  */
>  
> @@ -11977,14 +11997,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
>  	{
>  	  /* The by-element versions of the instruction have the same costs as
>  	     the normal 3-vector version.  So don't add the costs of the
> -	     duplicate into the costs of the multiply.  We make an assumption
> -	     that the input to the VEC_DUPLICATE is already on the FP & SIMD
> -	     side.  This means costing of a MUL by element pre RA is a bit
> -	     optimistic.  */
> +	     duplicate or subsequent select into the costs of the multiply.  We

Very pedantic, but: the select conceptually happens before the duplicate.
TBH I think we can probably just drop this sentence, since the calls
make the operation self-description.  (The other parts of the comment
are still useful.)

> +	     make an assumption that the input to the VEC_DUPLICATE is already
> +	     on the FP & SIMD side.  This means costing of a MUL by element pre
> +	     RA is a bit optimistic.  */
>  	  if (GET_CODE (op0) == VEC_DUPLICATE)
> -	    op0 = XEXP (op0, 0);
> +	    op0 = aarch64_strip_duplicate_vec_elt (op0);
>  	  else if (GET_CODE (op1) == VEC_DUPLICATE)
> -	    op1 = XEXP (op1, 0);
> +	    op1 = aarch64_strip_duplicate_vec_elt (op1);

I think we might as well call aarch64_strip_duplicate_vec_elt
unconditionally, without the VEC_DUPLICATE tests.

OK with those changes, and sorry for the slow review.

Thanks,
Richard

>  	}
>        cost += rtx_cost (op0, mode, MULT, 0, speed);
>        cost += rtx_cost (op1, mode, MULT, 1, speed);
> diff --git a/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..c153775f0914072fb985b18516f110aded7dccd5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c
> @@ -0,0 +1,94 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +
> +#include <arm_neon.h>
> +
> +#define TEST_MUL_UNIFORM(name, q, vectype, ts) \
> +  vectype test_ ## name ## q ## _ ## ts (vectype a, vectype b, vectype c) \
> +	{ \
> +		vectype t0 = name ## q ## _n_ ## ts (a, c[1]); \
> +		vectype t1 = name ## q ## _n_ ## ts (b, c[1]); \
> +		return vmul ## q ## _ ## ts (t0, t1); \
> +	}
> +
> +TEST_MUL_UNIFORM (vmul, , int16x4_t, s16)
> +TEST_MUL_UNIFORM (vmul, , uint16x4_t, u16)
> +TEST_MUL_UNIFORM (vmul, , int32x2_t, s32)
> +TEST_MUL_UNIFORM (vmul, , uint32x2_t, u32)
> +TEST_MUL_UNIFORM (vmul, , float32x2_t, f32)
> +TEST_MUL_UNIFORM (vmul, q, int16x8_t, s16)
> +TEST_MUL_UNIFORM (vmul, q, uint16x8_t, u16)
> +TEST_MUL_UNIFORM (vmul, q, int32x4_t, s32)
> +TEST_MUL_UNIFORM (vmul, q, uint32x4_t, u32)
> +TEST_MUL_UNIFORM (vmul, q, float32x4_t, f32)
> +TEST_MUL_UNIFORM (vmul, q, float64x2_t, f64)
> +
> +#define TEST_MLX_UNIFORM(name, q, vectype, ts) \
> +  vectype test_ ## name ## q ## _ ## ts (vectype acc, vectype a, vectype b) \
> +	{ \
> +		acc = name ## q ## _n_ ## ts (acc, a, b[1]); \
> +		return name ## q ## _n_ ## ts (acc, a, b[1]); \
> +	}
> +
> +TEST_MLX_UNIFORM (vmla, , int16x4_t, s16)
> +TEST_MLX_UNIFORM (vmla, , uint16x4_t, u16)
> +TEST_MLX_UNIFORM (vmla, , int32x2_t, s32)
> +TEST_MLX_UNIFORM (vmla, , uint32x2_t, u32)
> +TEST_MLX_UNIFORM (vmla, , float32x2_t, f32)
> +TEST_MLX_UNIFORM (vmla, q, int16x8_t, s16)
> +TEST_MLX_UNIFORM (vmla, q, uint16x8_t, u16)
> +TEST_MLX_UNIFORM (vmla, q, int32x4_t, s32)
> +TEST_MLX_UNIFORM (vmla, q, uint32x4_t, u32)
> +TEST_MLX_UNIFORM (vmla, q, float32x4_t, f32)
> +
> +TEST_MLX_UNIFORM (vmls, , int16x4_t, s16)
> +TEST_MLX_UNIFORM (vmls, , uint16x4_t, u16)
> +TEST_MLX_UNIFORM (vmls, , int32x2_t, s32)
> +TEST_MLX_UNIFORM (vmls, , uint32x2_t, u32)
> +TEST_MLX_UNIFORM (vmls, , float32x2_t, f32)
> +TEST_MLX_UNIFORM (vmls, q, int16x8_t, s16)
> +TEST_MLX_UNIFORM (vmls, q, uint16x8_t, u16)
> +TEST_MLX_UNIFORM (vmls, q, int32x4_t, s32)
> +TEST_MLX_UNIFORM (vmls, q, uint32x4_t, u32)
> +TEST_MLX_UNIFORM (vmls, q, float32x4_t, f32)
> +
> +#define TEST_MUL_LONG(name, rettype, intype, ts, rs) \
> +  rettype test_ ## name ## ts (intype a, intype b, intype c) \
> +	{ \
> +		rettype t0 = name ## ts (a, c[1]); \
> +		rettype t1 = name ## ts (b, c[1]); \
> +		return vqaddq ## _ ## rs (t0, t1); \
> +	}
> +
> +TEST_MUL_LONG (vmull_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MUL_LONG (vmull_n_, uint32x4_t, uint16x4_t, u16, u32)
> +TEST_MUL_LONG (vmull_n_, int64x2_t, int32x2_t, s32, s64)
> +TEST_MUL_LONG (vmull_n_, uint64x2_t, uint32x2_t, u32, u64)
> +
> +TEST_MUL_LONG (vqdmull_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MUL_LONG (vqdmull_n_, int64x2_t, int32x2_t, s32, s64)
> +
> +#define TEST_MLX_LONG(name, rettype, intype, ts, rs) \
> +  rettype test_ ## name ## _ ## ts (rettype acc, intype a, intype b) \
> +	{ \
> +		acc = name ## ts (acc, a, b[1]); \
> +		return name ## ts (acc, a, b[1]); \
> +	}
> +
> +TEST_MLX_LONG (vmlal_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vmlal_n_, uint32x4_t, uint16x4_t, u16, u32)
> +TEST_MLX_LONG (vmlal_n_, int64x2_t, int32x2_t, s32, s64)
> +TEST_MLX_LONG (vmlal_n_, uint64x2_t, uint32x2_t, u32, u64)
> +
> +TEST_MLX_LONG (vmlsl_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vmlsl_n_, uint32x4_t, uint16x4_t, u16, u32)
> +TEST_MLX_LONG (vmlsl_n_, int64x2_t, int32x2_t, s32, s64)
> +TEST_MLX_LONG (vmlsl_n_, uint64x2_t, uint32x2_t, u32, u64)
> +
> +TEST_MLX_LONG (vqdmlal_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vqdmlal_n_, int64x2_t, int32x2_t, s32, s64)
> +
> +TEST_MLX_LONG (vqdmlsl_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vqdmlsl_n_, int64x2_t, int32x2_t, s32, s64)
> +
> +/* { dg-final { scan-assembler-not "dup\\t" } } */

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-08-04  8:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20 10:46 [PATCH] aarch64: Don't include vec_select in SIMD multiply cost Jonathan Wright
2021-07-22 17:16 ` Richard Sandiford
2021-07-28 13:34   ` [PATCH V2] " Jonathan Wright
2021-08-04  8:51     ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).