[patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
@ 2018-05-15  8:20 Kyrill Tkachov
  2018-05-15 10:04 ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Kyrill Tkachov @ 2018-05-15  8:20 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Earnshaw (lists), James Greenhalgh, Marcus Shawcroft

[-- Attachment #1: Type: text/plain, Size: 3044 bytes --]

Hi all,

This is a respin of James's patch from: https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
The original patch was approved and committed but was later reverted because of failures on big-endian.
This tweaked version fixes the big-endian failures in aarch64_expand_vector_init by picking the right
element of VALS to move into the low part of the vector register depending on endianness. The rest of the patch
stays the same. I'm looking for approval on the aarch64 parts, as they are the ones that have changed
since the last approved version of the patch.

-----------------------------------------------------------------------

In the testcase in this patch we create an SLP vector with only two
elements. Our current vector initialisation code will first duplicate
the first element to both lanes, then overwrite the top lane with a new
value.

This duplication can be clunky and wasteful.

Better would be to simply use the fact that we will always be
overwriting
the remaining bits, and simply move the first element to the corrcet
place
(implicitly zeroing all other bits).

This reduces the code generation for this case, and can allow more
efficient addressing modes, and other second order benefits for AArch64
code which has been vectorized to V2DI mode.

Note that the change is generic enough to catch the case for any vector
mode, but is expected to be most useful for 2x64-bit vectorization.

Unfortunately, on its own, this would cause failures in
gcc.target/aarch64/load_v2vec_lanes_1.c and
gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
vec_merge and vec_duplicate for their simplifications to apply. To fix
this,
add a special case to the AArch64 code if we are loading from two memory
addresses, and use the load_pair_lanes patterns directly.

We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
, to
catch:

   (vec_merge:OUTER
      (vec_duplicate:OUTER x:INNER)
      (subreg:OUTER y:INNER 0)
      (const_int N))

And simplify it to:

   (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)

This is similar to the existing patterns which are tested in this
function,
without requiring the second operand to also be a vec_duplicate.

Bootstrapped and tested on aarch64-none-linux-gnu and tested on
aarch64-none-elf.

Note that this requires
https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
if we don't want to ICE creating broken vector zero extends.

Are the non-AArch64 parts OK?

Thanks,
James

---
2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
             Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

         * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify
         code generation for cases where splatting a value is not useful.
         * simplify-rtx.c (simplify_ternary_operation): Simplify
         vec_merge across a vec_duplicate and a paradoxical subreg forming a vector
         mode to a vec_concat.

2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>

         * gcc.target/aarch64/vect-slp-dup.c: New.

[-- Attachment #2: vector-splat.patch --]
[-- Type: text/x-patch, Size: 4592 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 4b5183b602b8786307deb8e3d8056323028b50a2..562eb315f881a1e0d8aa3ba946c99b4c6f25949b 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13901,9 +13901,54 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	    maxv = matches[i][1];
 	  }
 
-      /* Create a duplicate of the most common element.  */
-      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+      /* Create a duplicate of the most common element, unless all elements
+	 are equally useless to us, in which case just immediately set the
+	 vector register using the first element.  */
+
+      if (maxv == 1)
+	{
+	  /* For vectors of two 64-bit elements, we can do even better.  */
+	  if (n_elts == 2
+	      && (inner_mode == E_DImode
+		  || inner_mode == E_DFmode))
+
+	    {
+	      rtx x0 = XVECEXP (vals, 0, 0);
+	      rtx x1 = XVECEXP (vals, 0, 1);
+	      /* Combine can pick up this case, but handling it directly
+		 here leaves clearer RTL.
+
+		 This is load_pair_lanes<mode>, and also gives us a clean-up
+		 for store_pair_lanes<mode>.  */
+	      if (memory_operand (x0, inner_mode)
+		  && memory_operand (x1, inner_mode)
+		  && !STRICT_ALIGNMENT
+		  && rtx_equal_p (XEXP (x1, 0),
+				  plus_constant (Pmode,
+						 XEXP (x0, 0),
+						 GET_MODE_SIZE (inner_mode))))
+		{
+		  rtx t;
+		  if (inner_mode == DFmode)
+		    t = gen_load_pair_lanesdf (target, x0, x1);
+		  else
+		    t = gen_load_pair_lanesdi (target, x0, x1);
+		  emit_insn (t);
+		  return;
+		}
+	    }
+	  /* The subreg-move sequence below will move into lane zero of the
+	     vector register.  For big-endian we want that position to hold
+	     the last element of VALS.  */
+	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	}
+      else
+	{
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+	}
 
       /* Insert the rest.  */
       for (int i = 0; i < n_elts; i++)
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 23244a12545ba2f9db21f66a63a6d36ff8fd29fc..d32cdd19ecd9b8ca6a2957d115bec9c6613d3836 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5891,6 +5891,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
 	    }
 
+	  /* Replace:
+
+	      (vec_merge:outer (vec_duplicate:outer x:inner)
+			       (subreg:outer y:inner 0)
+			       (const_int N))
+
+	     with (vec_concat:outer x:inner y:inner) if N == 1,
+	     or (vec_concat:outer y:inner x:inner) if N == 2.
+
+	     Implicitly, this means we have a paradoxical subreg, but such
+	     a check is cheap, so make it anyway.
+
+	     Only applies for vectors of two elements.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE
+	      && GET_CODE (op1) == SUBREG
+	      && GET_MODE (op1) == GET_MODE (op0)
+	      && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
+	      && paradoxical_subreg_p (op1)
+	      && subreg_lowpart_p (op1)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
+	      && IN_RANGE (sel, 1, 2))
+	    {
+	      rtx newop0 = XEXP (op0, 0);
+	      rtx newop1 = SUBREG_REG (op1);
+	      if (sel == 2)
+		std::swap (newop0, newop1);
+	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
+	    }
+
 	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
 				 (const_int n))
 	     with (vec_concat x y) or (vec_concat y x) depending on value
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
new file mode 100644
index 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+
+/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
+
+void bar (double);
+
+void
+foo (double *restrict in, double *restrict in2,
+     double *restrict out1, double *restrict out2)
+{
+  for (int i = 0; i < 1024; i++)
+    {
+      out1[i] = in[i] + 2.0 * in[i+128];
+      out1[i+1] = in[i+1] + 2.0 * in2[i];
+      bar (in[i]);
+    }
+}
+
+/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
+

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-15  8:20 [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful Kyrill Tkachov
@ 2018-05-15 10:04 ` Richard Biener
  2018-05-16  9:08   ` Kyrill Tkachov
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2018-05-15 10:04 UTC (permalink / raw)
  To: kyrylo.tkachov
  Cc: GCC Patches, Richard Earnshaw, james.greenhalgh, Marcus Shawcroft

On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com>
wrote:

> Hi all,

> This is a respin of James's patch from:
https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
> The original patch was approved and committed but was later reverted
because of failures on big-endian.
> This tweaked version fixes the big-endian failures in
aarch64_expand_vector_init by picking the right
> element of VALS to move into the low part of the vector register
depending on endianness. The rest of the patch
> stays the same. I'm looking for approval on the aarch64 parts, as they
are the ones that have changed
> since the last approved version of the patch.

> -----------------------------------------------------------------------

> In the testcase in this patch we create an SLP vector with only two
> elements. Our current vector initialisation code will first duplicate
> the first element to both lanes, then overwrite the top lane with a new
> value.

> This duplication can be clunky and wasteful.

> Better would be to simply use the fact that we will always be
> overwriting
> the remaining bits, and simply move the first element to the corrcet
> place
> (implicitly zeroing all other bits).

> This reduces the code generation for this case, and can allow more
> efficient addressing modes, and other second order benefits for AArch64
> code which has been vectorized to V2DI mode.

> Note that the change is generic enough to catch the case for any vector
> mode, but is expected to be most useful for 2x64-bit vectorization.

> Unfortunately, on its own, this would cause failures in
> gcc.target/aarch64/load_v2vec_lanes_1.c and
> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
> vec_merge and vec_duplicate for their simplifications to apply. To fix
> this,
> add a special case to the AArch64 code if we are loading from two memory
> addresses, and use the load_pair_lanes patterns directly.

> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
> , to
> catch:

>     (vec_merge:OUTER
>        (vec_duplicate:OUTER x:INNER)
>        (subreg:OUTER y:INNER 0)
>        (const_int N))

> And simplify it to:

>     (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)

> This is similar to the existing patterns which are tested in this
> function,
> without requiring the second operand to also be a vec_duplicate.

> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
> aarch64-none-elf.

> Note that this requires
> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
> if we don't want to ICE creating broken vector zero extends.

> Are the non-AArch64 parts OK?

Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
you handle?  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
also doesn't handle the swapped operand case.

Otherwise the middle-end parts looks ok.

Thanks,
Richard.

> Thanks,
> James

> ---
> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
>               Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

>           * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify
>           code generation for cases where splatting a value is not useful.
>           * simplify-rtx.c (simplify_ternary_operation): Simplify
>           vec_merge across a vec_duplicate and a paradoxical subreg
forming a vector
>           mode to a vec_concat.

> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>

>           * gcc.target/aarch64/vect-slp-dup.c: New.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-15 10:04 ` Richard Biener
@ 2018-05-16  9:08   ` Kyrill Tkachov
  2018-05-16 10:11     ` Richard Biener
  2018-05-17  9:45     ` Richard Earnshaw (lists)
  0 siblings, 2 replies; 12+ messages in thread
From: Kyrill Tkachov @ 2018-05-16  9:08 UTC (permalink / raw)
  To: Richard Biener
  Cc: GCC Patches, Richard Earnshaw, james.greenhalgh, Marcus Shawcroft

[-- Attachment #1: Type: text/plain, Size: 3946 bytes --]


On 15/05/18 10:58, Richard Biener wrote:
> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com>
> wrote:
>
>> Hi all,
>> This is a respin of James's patch from:
> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>> The original patch was approved and committed but was later reverted
> because of failures on big-endian.
>> This tweaked version fixes the big-endian failures in
> aarch64_expand_vector_init by picking the right
>> element of VALS to move into the low part of the vector register
> depending on endianness. The rest of the patch
>> stays the same. I'm looking for approval on the aarch64 parts, as they
> are the ones that have changed
>> since the last approved version of the patch.
>> -----------------------------------------------------------------------
>> In the testcase in this patch we create an SLP vector with only two
>> elements. Our current vector initialisation code will first duplicate
>> the first element to both lanes, then overwrite the top lane with a new
>> value.
>> This duplication can be clunky and wasteful.
>> Better would be to simply use the fact that we will always be
>> overwriting
>> the remaining bits, and simply move the first element to the corrcet
>> place
>> (implicitly zeroing all other bits).
>> This reduces the code generation for this case, and can allow more
>> efficient addressing modes, and other second order benefits for AArch64
>> code which has been vectorized to V2DI mode.
>> Note that the change is generic enough to catch the case for any vector
>> mode, but is expected to be most useful for 2x64-bit vectorization.
>> Unfortunately, on its own, this would cause failures in
>> gcc.target/aarch64/load_v2vec_lanes_1.c and
>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
>> vec_merge and vec_duplicate for their simplifications to apply. To fix
>> this,
>> add a special case to the AArch64 code if we are loading from two memory
>> addresses, and use the load_pair_lanes patterns directly.
>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
>> , to
>> catch:
>>      (vec_merge:OUTER
>>         (vec_duplicate:OUTER x:INNER)
>>         (subreg:OUTER y:INNER 0)
>>         (const_int N))
>> And simplify it to:
>>      (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
>> This is similar to the existing patterns which are tested in this
>> function,
>> without requiring the second operand to also be a vec_duplicate.
>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
>> aarch64-none-elf.
>> Note that this requires
>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>> if we don't want to ICE creating broken vector zero extends.
>> Are the non-AArch64 parts OK?
> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
> you handle?  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
> also doesn't handle the swapped operand case.
>
> Otherwise the middle-end parts looks ok.

I don't see any explicit canonicalisation code for it.
I've updated the simplify-rtx part to handle the swapped operand case.
Is the attached patch better in this regard? I couldn't think of a clean way to avoid
duplicating some logic (beyond creating a new function away from the callsite).

Thanks,
Kyrill

> Thanks,
> Richard.
>
>> Thanks,
>> James
>> ---
>> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
>>                Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>            * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify
>>            code generation for cases where splatting a value is not useful.
>>            * simplify-rtx.c (simplify_ternary_operation): Simplify
>>            vec_merge across a vec_duplicate and a paradoxical subreg
> forming a vector
>>            mode to a vec_concat.
>> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
>>            * gcc.target/aarch64/vect-slp-dup.c: New.


[-- Attachment #2: vec-splat.patch --]
[-- Type: text/x-patch, Size: 5502 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index a2003fe52875f1653d644347bafd7773d1f01e91..6bf6c05535b61eef1021d46bcd8448fb3a0b25f4 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -13916,9 +13916,54 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	    maxv = matches[i][1];
 	  }
 
-      /* Create a duplicate of the most common element.  */
-      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+      /* Create a duplicate of the most common element, unless all elements
+	 are equally useless to us, in which case just immediately set the
+	 vector register using the first element.  */
+
+      if (maxv == 1)
+	{
+	  /* For vectors of two 64-bit elements, we can do even better.  */
+	  if (n_elts == 2
+	      && (inner_mode == E_DImode
+		  || inner_mode == E_DFmode))
+
+	    {
+	      rtx x0 = XVECEXP (vals, 0, 0);
+	      rtx x1 = XVECEXP (vals, 0, 1);
+	      /* Combine can pick up this case, but handling it directly
+		 here leaves clearer RTL.
+
+		 This is load_pair_lanes<mode>, and also gives us a clean-up
+		 for store_pair_lanes<mode>.  */
+	      if (memory_operand (x0, inner_mode)
+		  && memory_operand (x1, inner_mode)
+		  && !STRICT_ALIGNMENT
+		  && rtx_equal_p (XEXP (x1, 0),
+				  plus_constant (Pmode,
+						 XEXP (x0, 0),
+						 GET_MODE_SIZE (inner_mode))))
+		{
+		  rtx t;
+		  if (inner_mode == DFmode)
+		    t = gen_load_pair_lanesdf (target, x0, x1);
+		  else
+		    t = gen_load_pair_lanesdi (target, x0, x1);
+		  emit_insn (t);
+		  return;
+		}
+	    }
+	  /* The subreg-move sequence below will move into lane zero of the
+	     vector register.  For big-endian we want that position to hold
+	     the last element of VALS.  */
+	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	}
+      else
+	{
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+	}
 
       /* Insert the rest.  */
       for (int i = 0; i < n_elts; i++)
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index e96c9d1b441fcfcbddcbffb0c1c7c0e2a871a2a3..d2714db7ae8ef946a6ce035bea59ddbec890e905 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5891,6 +5891,60 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
 	    }
 
+	  /* Replace:
+
+	      (vec_merge:outer (vec_duplicate:outer x:inner)
+			       (subreg:outer y:inner 0)
+			       (const_int N))
+
+	     with (vec_concat:outer x:inner y:inner) if N == 1,
+	     or (vec_concat:outer y:inner x:inner) if N == 2.
+
+	     Implicitly, this means we have a paradoxical subreg, but such
+	     a check is cheap, so make it anyway.
+
+	     Only applies for vectors of two elements.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE
+	      && GET_CODE (op1) == SUBREG
+	      && GET_MODE (op1) == GET_MODE (op0)
+	      && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
+	      && paradoxical_subreg_p (op1)
+	      && subreg_lowpart_p (op1)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
+	      && IN_RANGE (sel, 1, 2))
+	    {
+	      rtx newop0 = XEXP (op0, 0);
+	      rtx newop1 = SUBREG_REG (op1);
+	      if (sel == 2)
+		std::swap (newop0, newop1);
+	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
+	    }
+
+	  /* Same as above but with switched operands:
+		Replace (vec_merge:outer (subreg:outer x:inner 0)
+					 (vec_duplicate:outer y:inner)
+			       (const_int N))
+
+	     with (vec_concat:outer x:inner y:inner) if N == 1,
+	     or (vec_concat:outer y:inner x:inner) if N == 2.  */
+	  if (GET_CODE (op1) == VEC_DUPLICATE
+	      && GET_CODE (op0) == SUBREG
+	      && GET_MODE (op0) == GET_MODE (op1)
+	      && GET_MODE (SUBREG_REG (op0)) == GET_MODE (XEXP (op1, 0))
+	      && paradoxical_subreg_p (op0)
+	      && subreg_lowpart_p (op0)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
+	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
+	      && IN_RANGE (sel, 1, 2))
+	    {
+	      rtx newop0 = SUBREG_REG (op0);
+	      rtx newop1 = XEXP (op1, 0);
+	      if (sel == 2)
+		std::swap (newop0, newop1);
+	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
+	    }
+
 	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
 				 (const_int n))
 	     with (vec_concat x y) or (vec_concat y x) depending on value
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
new file mode 100644
index 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+
+/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
+
+void bar (double);
+
+void
+foo (double *restrict in, double *restrict in2,
+     double *restrict out1, double *restrict out2)
+{
+  for (int i = 0; i < 1024; i++)
+    {
+      out1[i] = in[i] + 2.0 * in[i+128];
+      out1[i+1] = in[i+1] + 2.0 * in2[i];
+      bar (in[i]);
+    }
+}
+
+/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
+

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-16  9:08   ` Kyrill Tkachov
@ 2018-05-16 10:11     ` Richard Biener
  2018-05-16 10:12       ` Kyrill Tkachov
  2018-05-16 14:42       ` Segher Boessenkool
  2018-05-17  9:45     ` Richard Earnshaw (lists)
  1 sibling, 2 replies; 12+ messages in thread
From: Richard Biener @ 2018-05-16 10:11 UTC (permalink / raw)
  To: kyrylo.tkachov, Segher Boessenkool
  Cc: GCC Patches, Richard Earnshaw, james.greenhalgh, Marcus Shawcroft

On Wed, May 16, 2018 at 10:37 AM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com>
wrote:


> On 15/05/18 10:58, Richard Biener wrote:
> > On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
> > <kyrylo.tkachov@foss.arm.com>
> > wrote:
> >
> >> Hi all,
> >> This is a respin of James's patch from:
> > https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
> >> The original patch was approved and committed but was later reverted
> > because of failures on big-endian.
> >> This tweaked version fixes the big-endian failures in
> > aarch64_expand_vector_init by picking the right
> >> element of VALS to move into the low part of the vector register
> > depending on endianness. The rest of the patch
> >> stays the same. I'm looking for approval on the aarch64 parts, as they
> > are the ones that have changed
> >> since the last approved version of the patch.
> >> -----------------------------------------------------------------------
> >> In the testcase in this patch we create an SLP vector with only two
> >> elements. Our current vector initialisation code will first duplicate
> >> the first element to both lanes, then overwrite the top lane with a new
> >> value.
> >> This duplication can be clunky and wasteful.
> >> Better would be to simply use the fact that we will always be
> >> overwriting
> >> the remaining bits, and simply move the first element to the corrcet
> >> place
> >> (implicitly zeroing all other bits).
> >> This reduces the code generation for this case, and can allow more
> >> efficient addressing modes, and other second order benefits for AArch64
> >> code which has been vectorized to V2DI mode.
> >> Note that the change is generic enough to catch the case for any vector
> >> mode, but is expected to be most useful for 2x64-bit vectorization.
> >> Unfortunately, on its own, this would cause failures in
> >> gcc.target/aarch64/load_v2vec_lanes_1.c and
> >> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
> >> vec_merge and vec_duplicate for their simplifications to apply. To fix
> >> this,
> >> add a special case to the AArch64 code if we are loading from two
memory
> >> addresses, and use the load_pair_lanes patterns directly.
> >> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
> >> , to
> >> catch:
> >>      (vec_merge:OUTER
> >>         (vec_duplicate:OUTER x:INNER)
> >>         (subreg:OUTER y:INNER 0)
> >>         (const_int N))
> >> And simplify it to:
> >>      (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
> >> This is similar to the existing patterns which are tested in this
> >> function,
> >> without requiring the second operand to also be a vec_duplicate.
> >> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
> >> aarch64-none-elf.
> >> Note that this requires
> >> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
> >> if we don't want to ICE creating broken vector zero extends.
> >> Are the non-AArch64 parts OK?
> > Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
> > you handle?  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
> > also doesn't handle the swapped operand case.
> >
> > Otherwise the middle-end parts looks ok.

> I don't see any explicit canonicalisation code for it.
> I've updated the simplify-rtx part to handle the swapped operand case.
> Is the attached patch better in this regard? I couldn't think of a clean
way to avoid
> duplicating some logic (beyond creating a new function away from the
callsite).

Works for me.  Were you able to actually create such RTL from testcases?
Segher, do you know where canonicalization rules are documented?
IIRC we do not actively try to canonicalize in most cases.

Richard.

> Thanks,
> Kyrill

> > Thanks,
> > Richard.
> >
> >> Thanks,
> >> James
> >> ---
> >> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
> >>                Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> >>            * config/aarch64/aarch64.c (aarch64_expand_vector_init):
Modify
> >>            code generation for cases where splatting a value is not
useful.
> >>            * simplify-rtx.c (simplify_ternary_operation): Simplify
> >>            vec_merge across a vec_duplicate and a paradoxical subreg
> > forming a vector
> >>            mode to a vec_concat.
> >> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
> >>            * gcc.target/aarch64/vect-slp-dup.c: New.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-16 10:11     ` Richard Biener
@ 2018-05-16 10:12       ` Kyrill Tkachov
  2018-05-16 15:13         ` Segher Boessenkool
  2018-05-16 14:42       ` Segher Boessenkool
  1 sibling, 1 reply; 12+ messages in thread
From: Kyrill Tkachov @ 2018-05-16 10:12 UTC (permalink / raw)
  To: Richard Biener, Segher Boessenkool
  Cc: GCC Patches, Richard Earnshaw, james.greenhalgh, Marcus Shawcroft


On 16/05/18 10:42, Richard Biener wrote:
> On Wed, May 16, 2018 at 10:37 AM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com>
> wrote:
>
>
>> On 15/05/18 10:58, Richard Biener wrote:
>>> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com>
>>> wrote:
>>>
>>>> Hi all,
>>>> This is a respin of James's patch from:
>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>>> The original patch was approved and committed but was later reverted
>>> because of failures on big-endian.
>>>> This tweaked version fixes the big-endian failures in
>>> aarch64_expand_vector_init by picking the right
>>>> element of VALS to move into the low part of the vector register
>>> depending on endianness. The rest of the patch
>>>> stays the same. I'm looking for approval on the aarch64 parts, as they
>>> are the ones that have changed
>>>> since the last approved version of the patch.
>>>> -----------------------------------------------------------------------
>>>> In the testcase in this patch we create an SLP vector with only two
>>>> elements. Our current vector initialisation code will first duplicate
>>>> the first element to both lanes, then overwrite the top lane with a new
>>>> value.
>>>> This duplication can be clunky and wasteful.
>>>> Better would be to simply use the fact that we will always be
>>>> overwriting
>>>> the remaining bits, and simply move the first element to the corrcet
>>>> place
>>>> (implicitly zeroing all other bits).
>>>> This reduces the code generation for this case, and can allow more
>>>> efficient addressing modes, and other second order benefits for AArch64
>>>> code which has been vectorized to V2DI mode.
>>>> Note that the change is generic enough to catch the case for any vector
>>>> mode, but is expected to be most useful for 2x64-bit vectorization.
>>>> Unfortunately, on its own, this would cause failures in
>>>> gcc.target/aarch64/load_v2vec_lanes_1.c and
>>>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
>>>> vec_merge and vec_duplicate for their simplifications to apply. To fix
>>>> this,
>>>> add a special case to the AArch64 code if we are loading from two
> memory
>>>> addresses, and use the load_pair_lanes patterns directly.
>>>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
>>>> , to
>>>> catch:
>>>>       (vec_merge:OUTER
>>>>          (vec_duplicate:OUTER x:INNER)
>>>>          (subreg:OUTER y:INNER 0)
>>>>          (const_int N))
>>>> And simplify it to:
>>>>       (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
>>>> This is similar to the existing patterns which are tested in this
>>>> function,
>>>> without requiring the second operand to also be a vec_duplicate.
>>>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
>>>> aarch64-none-elf.
>>>> Note that this requires
>>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>>> if we don't want to ICE creating broken vector zero extends.
>>>> Are the non-AArch64 parts OK?
>>> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
>>> you handle?  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
>>> also doesn't handle the swapped operand case.
>>>
>>> Otherwise the middle-end parts looks ok.
>> I don't see any explicit canonicalisation code for it.
>> I've updated the simplify-rtx part to handle the swapped operand case.
>> Is the attached patch better in this regard? I couldn't think of a clean
> way to avoid
>> duplicating some logic (beyond creating a new function away from the
> callsite).
>
> Works for me.  Were you able to actually create such RTL from testcases?
> Segher, do you know where canonicalization rules are documented?
> IIRC we do not actively try to canonicalize in most cases.

The documentation we have for RTL canonicalisation is at:
https://gcc.gnu.org/onlinedocs/gccint/Insn-Canonicalizations.html#Insn-Canonicalizations

It doesn't mention anything about vec_merge AFAICS so I couldn't convince myself that there
is a canonicalisation that we enforce (though maybe someone can prove me wrong).

Kyrill

> Richard.
>
>> Thanks,
>> Kyrill
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> James
>>>> ---
>>>> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
>>>>                 Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>             * config/aarch64/aarch64.c (aarch64_expand_vector_init):
> Modify
>>>>             code generation for cases where splatting a value is not
> useful.
>>>>             * simplify-rtx.c (simplify_ternary_operation): Simplify
>>>>             vec_merge across a vec_duplicate and a paradoxical subreg
>>> forming a vector
>>>>             mode to a vec_concat.
>>>> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
>>>>             * gcc.target/aarch64/vect-slp-dup.c: New.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-16 10:11     ` Richard Biener
  2018-05-16 10:12       ` Kyrill Tkachov
@ 2018-05-16 14:42       ` Segher Boessenkool
  1 sibling, 0 replies; 12+ messages in thread
From: Segher Boessenkool @ 2018-05-16 14:42 UTC (permalink / raw)
  To: Richard Biener
  Cc: kyrylo.tkachov, GCC Patches, Richard Earnshaw, james.greenhalgh,
	Marcus Shawcroft

On Wed, May 16, 2018 at 11:42:39AM +0200, Richard Biener wrote:
> Works for me.  Were you able to actually create such RTL from testcases?
> Segher, do you know where canonicalization rules are documented?
> IIRC we do not actively try to canonicalize in most cases.

md.texi, node "Insn Canonicalizations"?


Segher

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-16 10:12       ` Kyrill Tkachov
@ 2018-05-16 15:13         ` Segher Boessenkool
  0 siblings, 0 replies; 12+ messages in thread
From: Segher Boessenkool @ 2018-05-16 15:13 UTC (permalink / raw)
  To: Kyrill Tkachov
  Cc: Richard Biener, GCC Patches, Richard Earnshaw, james.greenhalgh,
	Marcus Shawcroft

On Wed, May 16, 2018 at 11:10:55AM +0100, Kyrill Tkachov wrote:
> On 16/05/18 10:42, Richard Biener wrote:
> >Segher, do you know where canonicalization rules are documented?
> >IIRC we do not actively try to canonicalize in most cases.
> 
> The documentation we have for RTL canonicalisation is at:
> https://gcc.gnu.org/onlinedocs/gccint/Insn-Canonicalizations.html#Insn-Canonicalizations
> 
> It doesn't mention anything about vec_merge AFAICS so I couldn't convince 
> myself that there
> is a canonicalisation that we enforce (though maybe someone can prove me 
> wrong).

Many canonicalisations aren't documented, it's never clear which of the
canonicalisations are how canonical :-/


Segher

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-16  9:08   ` Kyrill Tkachov
  2018-05-16 10:11     ` Richard Biener
@ 2018-05-17  9:45     ` Richard Earnshaw (lists)
  2018-05-19  9:59       ` Richard Sandiford
  1 sibling, 1 reply; 12+ messages in thread
From: Richard Earnshaw (lists) @ 2018-05-17  9:45 UTC (permalink / raw)
  To: Kyrill Tkachov, Richard Biener
  Cc: GCC Patches, james.greenhalgh, Marcus Shawcroft

On 16/05/18 09:37, Kyrill Tkachov wrote:
> 
> On 15/05/18 10:58, Richard Biener wrote:
>> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
>> <kyrylo.tkachov@foss.arm.com>
>> wrote:
>>
>>> Hi all,
>>> This is a respin of James's patch from:
>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>> The original patch was approved and committed but was later reverted
>> because of failures on big-endian.
>>> This tweaked version fixes the big-endian failures in
>> aarch64_expand_vector_init by picking the right
>>> element of VALS to move into the low part of the vector register
>> depending on endianness. The rest of the patch
>>> stays the same. I'm looking for approval on the aarch64 parts, as they
>> are the ones that have changed
>>> since the last approved version of the patch.
>>> -----------------------------------------------------------------------
>>> In the testcase in this patch we create an SLP vector with only two
>>> elements. Our current vector initialisation code will first duplicate
>>> the first element to both lanes, then overwrite the top lane with a new
>>> value.
>>> This duplication can be clunky and wasteful.
>>> Better would be to simply use the fact that we will always be
>>> overwriting
>>> the remaining bits, and simply move the first element to the corrcet
>>> place
>>> (implicitly zeroing all other bits).
>>> This reduces the code generation for this case, and can allow more
>>> efficient addressing modes, and other second order benefits for AArch64
>>> code which has been vectorized to V2DI mode.
>>> Note that the change is generic enough to catch the case for any vector
>>> mode, but is expected to be most useful for 2x64-bit vectorization.
>>> Unfortunately, on its own, this would cause failures in
>>> gcc.target/aarch64/load_v2vec_lanes_1.c and
>>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
>>> vec_merge and vec_duplicate for their simplifications to apply. To fix
>>> this,
>>> add a special case to the AArch64 code if we are loading from two memory
>>> addresses, and use the load_pair_lanes patterns directly.
>>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
>>> , to
>>> catch:
>>> Â Â Â Â  (vec_merge:OUTER
>>> Â Â Â Â Â Â Â  (vec_duplicate:OUTER x:INNER)
>>> Â Â Â Â Â Â Â  (subreg:OUTER y:INNER 0)
>>> Â Â Â Â Â Â Â  (const_int N))
>>> And simplify it to:
>>> Â Â Â Â  (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
>>> This is similar to the existing patterns which are tested in this
>>> function,
>>> without requiring the second operand to also be a vec_duplicate.
>>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
>>> aarch64-none-elf.
>>> Note that this requires
>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>> if we don't want to ICE creating broken vector zero extends.
>>> Are the non-AArch64 parts OK?
>> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
>> you handle?Â  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
>> also doesn't handle the swapped operand case.
>>
>> Otherwise the middle-end parts looks ok.
> 
> I don't see any explicit canonicalisation code for it.
> I've updated the simplify-rtx part to handle the swapped operand case.
> Is the attached patch better in this regard? I couldn't think of a clean
> way to avoid
> duplicating some logic (beyond creating a new function away from the
> callsite).
> 
> Thanks,
> Kyrill
> 
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> James
>>> ---
>>> 2018-05-15Â  James GreenhalghÂ  <james.greenhalgh@arm.com>
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Kyrylo TkachovÂ  <kyrylo.tkachov@arm.com>
>>> Â Â Â Â Â Â Â Â Â Â  * config/aarch64/aarch64.c (aarch64_expand_vector_init):
>>> Modify
>>> Â Â Â Â Â Â Â Â Â Â  code generation for cases where splatting a value is not
>>> useful.
>>> Â Â Â Â Â Â Â Â Â Â  * simplify-rtx.c (simplify_ternary_operation): Simplify
>>> Â Â Â Â Â Â Â Â Â Â  vec_merge across a vec_duplicate and a paradoxical subreg
>> forming a vector
>>> Â Â Â Â Â Â Â Â Â Â  mode to a vec_concat.
>>> 2018-05-15Â  James GreenhalghÂ  <james.greenhalgh@arm.com>
>>> Â Â Â Â Â Â Â Â Â Â  * gcc.target/aarch64/vect-slp-dup.c: New.
> 

I'm surprised we don't seem to have a function in the compiler that
performs this check:

+		  && rtx_equal_p (XEXP (x1, 0),
+				  plus_constant (Pmode,
+						 XEXP (x0, 0),
+						 GET_MODE_SIZE (inner_mode))))

Without generating dead RTL (plus_constant will rarely be able to return
a subexpression of the original pattern).  I would have thought this
sort of test was not that uncommon.

However, I don't think that needs to hold up this patch.

OK.

R.
> 
> vec-splat.patch
> 
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index a2003fe52875f1653d644347bafd7773d1f01e91..6bf6c05535b61eef1021d46bcd8448fb3a0b25f4 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -13916,9 +13916,54 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>  	    maxv = matches[i][1];
>  	  }
>  
> -      /* Create a duplicate of the most common element.  */
> -      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> -      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> +      /* Create a duplicate of the most common element, unless all elements
> +	 are equally useless to us, in which case just immediately set the
> +	 vector register using the first element.  */
> +
> +      if (maxv == 1)
> +	{
> +	  /* For vectors of two 64-bit elements, we can do even better.  */
> +	  if (n_elts == 2
> +	      && (inner_mode == E_DImode
> +		  || inner_mode == E_DFmode))
> +
> +	    {
> +	      rtx x0 = XVECEXP (vals, 0, 0);
> +	      rtx x1 = XVECEXP (vals, 0, 1);
> +	      /* Combine can pick up this case, but handling it directly
> +		 here leaves clearer RTL.
> +
> +		 This is load_pair_lanes<mode>, and also gives us a clean-up
> +		 for store_pair_lanes<mode>.  */
> +	      if (memory_operand (x0, inner_mode)
> +		  && memory_operand (x1, inner_mode)
> +		  && !STRICT_ALIGNMENT
> +		  && rtx_equal_p (XEXP (x1, 0),
> +				  plus_constant (Pmode,
> +						 XEXP (x0, 0),
> +						 GET_MODE_SIZE (inner_mode))))
> +		{
> +		  rtx t;
> +		  if (inner_mode == DFmode)
> +		    t = gen_load_pair_lanesdf (target, x0, x1);
> +		  else
> +		    t = gen_load_pair_lanesdi (target, x0, x1);
> +		  emit_insn (t);
> +		  return;
> +		}
> +	    }
> +	  /* The subreg-move sequence below will move into lane zero of the
> +	     vector register.  For big-endian we want that position to hold
> +	     the last element of VALS.  */
> +	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
> +	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
> +	}
> +      else
> +	{
> +	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
> +	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
> +	}
>  
>        /* Insert the rest.  */
>        for (int i = 0; i < n_elts; i++)
> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
> index e96c9d1b441fcfcbddcbffb0c1c7c0e2a871a2a3..d2714db7ae8ef946a6ce035bea59ddbec890e905 100644
> --- a/gcc/simplify-rtx.c
> +++ b/gcc/simplify-rtx.c
> @@ -5891,6 +5891,60 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
>  		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>  	    }
>  
> +	  /* Replace:
> +
> +	      (vec_merge:outer (vec_duplicate:outer x:inner)
> +			       (subreg:outer y:inner 0)
> +			       (const_int N))
> +
> +	     with (vec_concat:outer x:inner y:inner) if N == 1,
> +	     or (vec_concat:outer y:inner x:inner) if N == 2.
> +
> +	     Implicitly, this means we have a paradoxical subreg, but such
> +	     a check is cheap, so make it anyway.
> +
> +	     Only applies for vectors of two elements.  */
> +	  if (GET_CODE (op0) == VEC_DUPLICATE
> +	      && GET_CODE (op1) == SUBREG
> +	      && GET_MODE (op1) == GET_MODE (op0)
> +	      && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
> +	      && paradoxical_subreg_p (op1)
> +	      && subreg_lowpart_p (op1)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
> +	      && IN_RANGE (sel, 1, 2))
> +	    {
> +	      rtx newop0 = XEXP (op0, 0);
> +	      rtx newop1 = SUBREG_REG (op1);
> +	      if (sel == 2)
> +		std::swap (newop0, newop1);
> +	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
> +	    }
> +
> +	  /* Same as above but with switched operands:
> +		Replace (vec_merge:outer (subreg:outer x:inner 0)
> +					 (vec_duplicate:outer y:inner)
> +			       (const_int N))
> +
> +	     with (vec_concat:outer x:inner y:inner) if N == 1,
> +	     or (vec_concat:outer y:inner x:inner) if N == 2.  */
> +	  if (GET_CODE (op1) == VEC_DUPLICATE
> +	      && GET_CODE (op0) == SUBREG
> +	      && GET_MODE (op0) == GET_MODE (op1)
> +	      && GET_MODE (SUBREG_REG (op0)) == GET_MODE (XEXP (op1, 0))
> +	      && paradoxical_subreg_p (op0)
> +	      && subreg_lowpart_p (op0)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
> +	      && IN_RANGE (sel, 1, 2))
> +	    {
> +	      rtx newop0 = SUBREG_REG (op0);
> +	      rtx newop1 = XEXP (op1, 0);
> +	      if (sel == 2)
> +		std::swap (newop0, newop1);
> +	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
> +	    }
> +
>  	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
>  				 (const_int n))
>  	     with (vec_concat x y) or (vec_concat y x) depending on value
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +
> +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
> +
> +void bar (double);
> +
> +void
> +foo (double *restrict in, double *restrict in2,
> +     double *restrict out1, double *restrict out2)
> +{
> +  for (int i = 0; i < 1024; i++)
> +    {
> +      out1[i] = in[i] + 2.0 * in[i+128];
> +      out1[i+1] = in[i+1] + 2.0 * in2[i];
> +      bar (in[i]);
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
> +
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2018-05-17  9:45     ` Richard Earnshaw (lists)
@ 2018-05-19  9:59       ` Richard Sandiford
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Sandiford @ 2018-05-19  9:59 UTC (permalink / raw)
  To: Richard Earnshaw (lists)
  Cc: Kyrill Tkachov, Richard Biener, GCC Patches, james.greenhalgh,
	Marcus Shawcroft

"Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> writes:
> On 16/05/18 09:37, Kyrill Tkachov wrote:
>> 
>> On 15/05/18 10:58, Richard Biener wrote:
>>> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com>
>>> wrote:
>>>
>>>> Hi all,
>>>> This is a respin of James's patch from:
>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>>> The original patch was approved and committed but was later reverted
>>> because of failures on big-endian.
>>>> This tweaked version fixes the big-endian failures in
>>> aarch64_expand_vector_init by picking the right
>>>> element of VALS to move into the low part of the vector register
>>> depending on endianness. The rest of the patch
>>>> stays the same. I'm looking for approval on the aarch64 parts, as they
>>> are the ones that have changed
>>>> since the last approved version of the patch.
>>>> -----------------------------------------------------------------------
>>>> In the testcase in this patch we create an SLP vector with only two
>>>> elements. Our current vector initialisation code will first duplicate
>>>> the first element to both lanes, then overwrite the top lane with a new
>>>> value.
>>>> This duplication can be clunky and wasteful.
>>>> Better would be to simply use the fact that we will always be
>>>> overwriting
>>>> the remaining bits, and simply move the first element to the corrcet
>>>> place
>>>> (implicitly zeroing all other bits).
>>>> This reduces the code generation for this case, and can allow more
>>>> efficient addressing modes, and other second order benefits for AArch64
>>>> code which has been vectorized to V2DI mode.
>>>> Note that the change is generic enough to catch the case for any vector
>>>> mode, but is expected to be most useful for 2x64-bit vectorization.
>>>> Unfortunately, on its own, this would cause failures in
>>>> gcc.target/aarch64/load_v2vec_lanes_1.c and
>>>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
>>>> vec_merge and vec_duplicate for their simplifications to apply. To fix
>>>> this,
>>>> add a special case to the AArch64 code if we are loading from two memory
>>>> addresses, and use the load_pair_lanes patterns directly.
>>>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
>>>> , to
>>>> catch:
>>>>      (vec_merge:OUTER
>>>>         (vec_duplicate:OUTER x:INNER)
>>>>         (subreg:OUTER y:INNER 0)
>>>>         (const_int N))
>>>> And simplify it to:
>>>>      (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
>>>> This is similar to the existing patterns which are tested in this
>>>> function,
>>>> without requiring the second operand to also be a vec_duplicate.
>>>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
>>>> aarch64-none-elf.
>>>> Note that this requires
>>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>>> if we don't want to ICE creating broken vector zero extends.
>>>> Are the non-AArch64 parts OK?
>>> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
>>> you handle?  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
>>> also doesn't handle the swapped operand case.
>>>
>>> Otherwise the middle-end parts looks ok.
>> 
>> I don't see any explicit canonicalisation code for it.
>> I've updated the simplify-rtx part to handle the swapped operand case.
>> Is the attached patch better in this regard? I couldn't think of a clean
>> way to avoid
>> duplicating some logic (beyond creating a new function away from the
>> callsite).
>> 
>> Thanks,
>> Kyrill
>> 
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> James
>>>> ---
>>>> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
>>>>                Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>            * config/aarch64/aarch64.c (aarch64_expand_vector_init):
>>>> Modify
>>>>            code generation for cases where splatting a value is not
>>>> useful.
>>>>            * simplify-rtx.c (simplify_ternary_operation): Simplify
>>>>            vec_merge across a vec_duplicate and a paradoxical subreg
>>> forming a vector
>>>>            mode to a vec_concat.
>>>> 2018-05-15  James Greenhalgh  <james.greenhalgh@arm.com>
>>>>            * gcc.target/aarch64/vect-slp-dup.c: New.
>> 
>
> I'm surprised we don't seem to have a function in the compiler that
> performs this check:
>
> +		  && rtx_equal_p (XEXP (x1, 0),
> +				  plus_constant (Pmode,
> +						 XEXP (x0, 0),
> +						 GET_MODE_SIZE (inner_mode))))
>
> Without generating dead RTL (plus_constant will rarely be able to return
> a subexpression of the original pattern).  I would have thought this
> sort of test was not that uncommon.

FWIW, I think the way to write it without generating dead RTL is:

     rtx_equal_p (strip_offset (XEXP (x0, 0), &x0_offset),
                  strip_offset (XEXP (x1, 0), &x1_offset))
     && known_eq (x1_offset, x0_offset + GET_MODE_SIZE (inner_mode))

But yeah, a helper would be nice at some point.

> However, I don't think that needs to hold up this patch.
>
> OK.
>
> R.
>> 
>> vec-splat.patch
>> 
>> 
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index a2003fe52875f1653d644347bafd7773d1f01e91..6bf6c05535b61eef1021d46bcd8448fb3a0b25f4 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -13916,9 +13916,54 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>>  	    maxv = matches[i][1];
>>  	  }
>>  
>> -      /* Create a duplicate of the most common element.  */
>> -      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> -      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>> +      /* Create a duplicate of the most common element, unless all elements
>> +	 are equally useless to us, in which case just immediately set the
>> +	 vector register using the first element.  */
>> +
>> +      if (maxv == 1)
>> +	{
>> +	  /* For vectors of two 64-bit elements, we can do even better.  */
>> +	  if (n_elts == 2
>> +	      && (inner_mode == E_DImode
>> +		  || inner_mode == E_DFmode))
>> +
>> +	    {
>> +	      rtx x0 = XVECEXP (vals, 0, 0);
>> +	      rtx x1 = XVECEXP (vals, 0, 1);
>> +	      /* Combine can pick up this case, but handling it directly
>> +		 here leaves clearer RTL.
>> +
>> +		 This is load_pair_lanes<mode>, and also gives us a clean-up
>> +		 for store_pair_lanes<mode>.  */
>> +	      if (memory_operand (x0, inner_mode)
>> +		  && memory_operand (x1, inner_mode)
>> +		  && !STRICT_ALIGNMENT
>> +		  && rtx_equal_p (XEXP (x1, 0),
>> +				  plus_constant (Pmode,
>> +						 XEXP (x0, 0),
>> +						 GET_MODE_SIZE (inner_mode))))
>> +		{
>> +		  rtx t;
>> +		  if (inner_mode == DFmode)
>> +		    t = gen_load_pair_lanesdf (target, x0, x1);
>> +		  else
>> +		    t = gen_load_pair_lanesdi (target, x0, x1);
>> +		  emit_insn (t);
>> +		  return;
>> +		}
>> +	    }
>> +	  /* The subreg-move sequence below will move into lane zero of the
>> +	     vector register.  For big-endian we want that position to hold
>> +	     the last element of VALS.  */
>> +	  maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> +	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> +	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> +	}
>> +      else
>> +	{
>> +	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> +	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>> +	}
>>  
>>        /* Insert the rest.  */
>>        for (int i = 0; i < n_elts; i++)
>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
>> index e96c9d1b441fcfcbddcbffb0c1c7c0e2a871a2a3..d2714db7ae8ef946a6ce035bea59ddbec890e905 100644
>> --- a/gcc/simplify-rtx.c
>> +++ b/gcc/simplify-rtx.c
>> @@ -5891,6 +5891,60 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
>>  		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>>  	    }
>>  
>> +	  /* Replace:
>> +
>> +	      (vec_merge:outer (vec_duplicate:outer x:inner)
>> +			       (subreg:outer y:inner 0)
>> +			       (const_int N))
>> +
>> +	     with (vec_concat:outer x:inner y:inner) if N == 1,
>> +	     or (vec_concat:outer y:inner x:inner) if N == 2.
>> +
>> +	     Implicitly, this means we have a paradoxical subreg, but such
>> +	     a check is cheap, so make it anyway.
>> +
>> +	     Only applies for vectors of two elements.  */
>> +	  if (GET_CODE (op0) == VEC_DUPLICATE
>> +	      && GET_CODE (op1) == SUBREG
>> +	      && GET_MODE (op1) == GET_MODE (op0)
>> +	      && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
>> +	      && paradoxical_subreg_p (op1)
>> +	      && subreg_lowpart_p (op1)
>> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
>> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
>> +	      && IN_RANGE (sel, 1, 2))
>> +	    {
>> +	      rtx newop0 = XEXP (op0, 0);
>> +	      rtx newop1 = SUBREG_REG (op1);
>> +	      if (sel == 2)
>> +		std::swap (newop0, newop1);
>> +	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>> +	    }
>> +
>> +	  /* Same as above but with switched operands:
>> +		Replace (vec_merge:outer (subreg:outer x:inner 0)
>> +					 (vec_duplicate:outer y:inner)
>> +			       (const_int N))
>> +
>> +	     with (vec_concat:outer x:inner y:inner) if N == 1,
>> +	     or (vec_concat:outer y:inner x:inner) if N == 2.  */
>> +	  if (GET_CODE (op1) == VEC_DUPLICATE
>> +	      && GET_CODE (op0) == SUBREG
>> +	      && GET_MODE (op0) == GET_MODE (op1)
>> +	      && GET_MODE (SUBREG_REG (op0)) == GET_MODE (XEXP (op1, 0))
>> +	      && paradoxical_subreg_p (op0)
>> +	      && subreg_lowpart_p (op0)
>> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
>> +	      && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
>> +	      && IN_RANGE (sel, 1, 2))
>> +	    {
>> +	      rtx newop0 = SUBREG_REG (op0);
>> +	      rtx newop1 = XEXP (op1, 0);
>> +	      if (sel == 2)
>> +		std::swap (newop0, newop1);
>> +	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>> +	    }
>> +
>>  	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
>>  				 (const_int n))
>>  	     with (vec_concat x y) or (vec_concat y x) depending on value
>> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
>> new file mode 100644
>> index 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
>> @@ -0,0 +1,20 @@
>> +/* { dg-do compile } */
>> +
>> +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
>> +
>> +void bar (double);
>> +
>> +void
>> +foo (double *restrict in, double *restrict in2,
>> +     double *restrict out1, double *restrict out2)
>> +{
>> +  for (int i = 0; i < 1024; i++)
>> +    {
>> +      out1[i] = in[i] + 2.0 * in[i+128];
>> +      out1[i+1] = in[i+1] + 2.0 * in2[i];
>> +      bar (in[i]);
>> +    }
>> +}
>> +
>> +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
>> +
>> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2017-12-18 23:37   ` Jeff Law
@ 2018-01-03  9:55     ` Christophe Lyon
  0 siblings, 0 replies; 12+ messages in thread
From: Christophe Lyon @ 2018-01-03  9:55 UTC (permalink / raw)
  To: Jeff Law
  Cc: James Greenhalgh, gcc-patches, nd, Marcus Shawcroft, Richard Earnshaw

On 19 December 2017 at 00:36, Jeff Law <law@redhat.com> wrote:
> On 12/11/2017 08:44 AM, James Greenhalgh wrote:
>> Hi,
>>
>> In the testcase in this patch we create an SLP vector with only two
>> elements. Our current vector initialisation code will first duplicate
>> the first element to both lanes, then overwrite the top lane with a new
>> value.
>>
>> This duplication can be clunky and wasteful.
>>
>> Better would be to simply use the fact that we will always be overwriting
>> the remaining bits, and simply move the first element to the corrcet place
>> (implicitly zeroing all other bits).
>>
>> This reduces the code generation for this case, and can allow more
>> efficient addressing modes, and other second order benefits for AArch64
>> code which has been vectorized to V2DI mode.
>>
>> Note that the change is generic enough to catch the case for any vector
>> mode, but is expected to be most useful for 2x64-bit vectorization.
>>
>> Unfortunately, on its own, this would cause failures in
>> gcc.target/aarch64/load_v2vec_lanes_1.c and
>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
>> vec_merge and vec_duplicate for their simplifications to apply. To fix this,
>> add a special case to the AArch64 code if we are loading from two memory
>> addresses, and use the load_pair_lanes patterns directly.
>>
>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation , to
>> catch:
>>
>>   (vec_merge:OUTER
>>      (vec_duplicate:OUTER x:INNER)
>>      (subreg:OUTER y:INNER 0)
>>      (const_int N))
>>
>> And simplify it to:
>>
>>   (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
>>
>> This is similar to the existing patterns which are tested in this function,
>> without requiring the second operand to also be a vec_duplicate.
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
>> aarch64-none-elf.
>>
>> Note that this requires https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>> if we don't want to ICE creating broken vector zero extends.
>>
>> Are the non-AArch64 parts OK?
>>
>> Thanks,
>> James
>>
>> ---
>> 2017-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
>>
>>       * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify code
>>       generation for cases where splatting a value is not useful.
>>       * simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge
>>       across a vec_duplicate and a paradoxical subreg forming a vector
>>       mode to a vec_concat.
>>
>> 2017-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
>>
>>       * gcc.target/aarch64/vect-slp-dup.c: New.
>>
>>
>> 0001-patch-AArch64-Do-not-perform-a-vector-splat-for-vect.patch
>>
>>
>
>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
>> index 806c309..ed16f70 100644
>> --- a/gcc/simplify-rtx.c
>> +++ b/gcc/simplify-rtx.c
>> @@ -5785,6 +5785,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
>>               return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>>           }
>>
>> +       /* Replace:
>> +
>> +           (vec_merge:outer (vec_duplicate:outer x:inner)
>> +                            (subreg:outer y:inner 0)
>> +                            (const_int N))
>> +
>> +          with (vec_concat:outer x:inner y:inner) if N == 1,
>> +          or (vec_concat:outer y:inner x:inner) if N == 2.
>> +
>> +          Implicitly, this means we have a paradoxical subreg, but such
>> +          a check is cheap, so make it anyway.
> I'm going to assume that N == 1 and N == 3 are handled elsewhere and do
> not show up here in practice.
>
>
> So is it advisable to handle the case where the VEC_DUPLICATE and SUBREG
> show up in the opposite order?  Or is there some canonicalization that
> prevents that?
>
> simplify-rtx bits are OK as-is if we're certain we're not going to get
> the alternate ordering of the VEC_MERGE operands.  ALso OK if you either
> generalize this chunk of code or duplicate & twiddle it to handle the
> alternate order.
>
> I didn't look at the aarch64 specific bits.
>

Hi James,

Your patch (r255946) introduced regressions on aarch64_be:
http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/255946/report-build-info.html

I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83663
to track this.

Thanks,

Christophe

> jeff
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2017-12-11 15:46 ` [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful James Greenhalgh
@ 2017-12-18 23:37   ` Jeff Law
  2018-01-03  9:55     ` Christophe Lyon
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff Law @ 2017-12-18 23:37 UTC (permalink / raw)
  To: James Greenhalgh, gcc-patches; +Cc: nd, marcus.shawcroft, richard.earnshaw

On 12/11/2017 08:44 AM, James Greenhalgh wrote:
> Hi,
> 
> In the testcase in this patch we create an SLP vector with only two
> elements. Our current vector initialisation code will first duplicate
> the first element to both lanes, then overwrite the top lane with a new
> value.
> 
> This duplication can be clunky and wasteful.
> 
> Better would be to simply use the fact that we will always be overwriting
> the remaining bits, and simply move the first element to the corrcet place
> (implicitly zeroing all other bits).
> 
> This reduces the code generation for this case, and can allow more
> efficient addressing modes, and other second order benefits for AArch64
> code which has been vectorized to V2DI mode.
> 
> Note that the change is generic enough to catch the case for any vector
> mode, but is expected to be most useful for 2x64-bit vectorization.
> 
> Unfortunately, on its own, this would cause failures in
> gcc.target/aarch64/load_v2vec_lanes_1.c and
> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
> vec_merge and vec_duplicate for their simplifications to apply. To fix this,
> add a special case to the AArch64 code if we are loading from two memory
> addresses, and use the load_pair_lanes patterns directly.
> 
> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation , to
> catch:
> 
>   (vec_merge:OUTER
>      (vec_duplicate:OUTER x:INNER)
>      (subreg:OUTER y:INNER 0)
>      (const_int N))
> 
> And simplify it to:
> 
>   (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
> 
> This is similar to the existing patterns which are tested in this function,
> without requiring the second operand to also be a vec_duplicate.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
> aarch64-none-elf.
> 
> Note that this requires https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
> if we don't want to ICE creating broken vector zero extends.
> 
> Are the non-AArch64 parts OK?
> 
> Thanks,
> James
> 
> ---
> 2017-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify code
> 	generation for cases where splatting a value is not useful.
> 	* simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge
> 	across a vec_duplicate and a paradoxical subreg forming a vector
> 	mode to a vec_concat.
> 
> 2017-12-11  James Greenhalgh  <james.greenhalgh@arm.com>
> 
> 	* gcc.target/aarch64/vect-slp-dup.c: New.
> 
> 
> 0001-patch-AArch64-Do-not-perform-a-vector-splat-for-vect.patch
> 
> 

> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
> index 806c309..ed16f70 100644
> --- a/gcc/simplify-rtx.c
> +++ b/gcc/simplify-rtx.c
> @@ -5785,6 +5785,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
>  		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>  	    }
>  
> +	  /* Replace:
> +
> +	      (vec_merge:outer (vec_duplicate:outer x:inner)
> +			       (subreg:outer y:inner 0)
> +			       (const_int N))
> +
> +	     with (vec_concat:outer x:inner y:inner) if N == 1,
> +	     or (vec_concat:outer y:inner x:inner) if N == 2.
> +
> +	     Implicitly, this means we have a paradoxical subreg, but such
> +	     a check is cheap, so make it anyway.
I'm going to assume that N == 1 and N == 3 are handled elsewhere and do
not show up here in practice.


So is it advisable to handle the case where the VEC_DUPLICATE and SUBREG
show up in the opposite order?  Or is there some canonicalization that
prevents that?

simplify-rtx bits are OK as-is if we're certain we're not going to get
the alternate ordering of the VEC_MERGE operands.  ALso OK if you either
generalize this chunk of code or duplicate & twiddle it to handle the
alternate order.

I didn't look at the aarch64 specific bits.

jeff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful
  2017-12-11 14:19 [Patch combine] Don't create vector mode ZERO_EXTEND from subregs James Greenhalgh
@ 2017-12-11 15:46 ` James Greenhalgh
  2017-12-18 23:37   ` Jeff Law
  0 siblings, 1 reply; 12+ messages in thread
From: James Greenhalgh @ 2017-12-11 15:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, law, marcus.shawcroft, richard.earnshaw

[-- Attachment #1: Type: text/plain, Size: 2298 bytes --]


Hi,

In the testcase in this patch we create an SLP vector with only two
elements. Our current vector initialisation code will first duplicate
the first element to both lanes, then overwrite the top lane with a new
value.

This duplication can be clunky and wasteful.

Better would be to simply use the fact that we will always be overwriting
the remaining bits, and simply move the first element to the corrcet place
(implicitly zeroing all other bits).

This reduces the code generation for this case, and can allow more
efficient addressing modes, and other second order benefits for AArch64
code which has been vectorized to V2DI mode.

Note that the change is generic enough to catch the case for any vector
mode, but is expected to be most useful for 2x64-bit vectorization.

Unfortunately, on its own, this would cause failures in
gcc.target/aarch64/load_v2vec_lanes_1.c and
gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
vec_merge and vec_duplicate for their simplifications to apply. To fix this,
add a special case to the AArch64 code if we are loading from two memory
addresses, and use the load_pair_lanes patterns directly.

We also need a new pattern in simplify-rtx.c:simplify_ternary_operation , to
catch:

  (vec_merge:OUTER
     (vec_duplicate:OUTER x:INNER)
     (subreg:OUTER y:INNER 0)
     (const_int N))

And simplify it to:

  (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)

This is similar to the existing patterns which are tested in this function,
without requiring the second operand to also be a vec_duplicate.

Bootstrapped and tested on aarch64-none-linux-gnu and tested on
aarch64-none-elf.

Note that this requires https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
if we don't want to ICE creating broken vector zero extends.

Are the non-AArch64 parts OK?

Thanks,
James

---
2017-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

	* config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify code
	generation for cases where splatting a value is not useful.
	* simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge
	across a vec_duplicate and a paradoxical subreg forming a vector
	mode to a vec_concat.

2017-12-11  James Greenhalgh  <james.greenhalgh@arm.com>

	* gcc.target/aarch64/vect-slp-dup.c: New.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-patch-AArch64-Do-not-perform-a-vector-splat-for-vect.patch --]
[-- Type: text/x-patch; name="0001-patch-AArch64-Do-not-perform-a-vector-splat-for-vect.patch", Size: 4158 bytes --]

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 83d8607..8abb8e4 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -12105,9 +12105,51 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	    maxv = matches[i][1];
 	  }
 
-      /* Create a duplicate of the most common element.  */
-      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
-      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+      /* Create a duplicate of the most common element, unless all elements
+	 are equally useless to us, in which case just immediately set the
+	 vector register using the first element.  */
+
+      if (maxv == 1)
+	{
+	  /* For vectors of two 64-bit elements, we can do even better.  */
+	  if (n_elts == 2
+	      && (inner_mode == E_DImode
+		  || inner_mode == E_DFmode))
+
+	    {
+	      rtx x0 = XVECEXP (vals, 0, 0);
+	      rtx x1 = XVECEXP (vals, 0, 1);
+	      /* Combine can pick up this case, but handling it directly
+		 here leaves clearer RTL.
+
+		 This is load_pair_lanes<mode>, and also gives us a clean-up
+		 for store_pair_lanes<mode>.  */
+	      if (memory_operand (x0, inner_mode)
+		  && memory_operand (x1, inner_mode)
+		  && !STRICT_ALIGNMENT
+		  && rtx_equal_p (XEXP (x1, 0),
+				  plus_constant (Pmode,
+						 XEXP (x0, 0),
+						 GET_MODE_SIZE (inner_mode))))
+		{
+		  rtx t;
+		  if (inner_mode == DFmode)
+		    t = gen_load_pair_lanesdf (target, x0, x1);
+		  else
+		    t = gen_load_pair_lanesdi (target, x0, x1);
+		  emit_insn (t);
+		  return;
+		}
+	    }
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 0));
+	  aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
+	  maxelement = 0;
+	}
+      else
+	{
+	  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
+	  aarch64_emit_move (target, gen_vec_duplicate (mode, x));
+	}
 
       /* Insert the rest.  */
       for (int i = 0; i < n_elts; i++)
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 806c309..ed16f70 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5785,6 +5785,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
 	    }
 
+	  /* Replace:
+
+	      (vec_merge:outer (vec_duplicate:outer x:inner)
+			       (subreg:outer y:inner 0)
+			       (const_int N))
+
+	     with (vec_concat:outer x:inner y:inner) if N == 1,
+	     or (vec_concat:outer y:inner x:inner) if N == 2.
+
+	     Implicitly, this means we have a paradoxical subreg, but such
+	     a check is cheap, so make it anyway.
+
+	     Only applies for vectors of two elements.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE
+	      && GET_CODE (op1) == SUBREG
+	      && GET_MODE (op1) == GET_MODE (op0)
+	      && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
+	      && paradoxical_subreg_p (op1)
+	      && SUBREG_BYTE (op1) == 0
+	      && GET_MODE_NUNITS (GET_MODE (op0)) == 2
+	      && GET_MODE_NUNITS (GET_MODE (op1)) == 2
+	      && IN_RANGE (sel, 1, 2))
+	    {
+	      rtx newop0 = XEXP (op0, 0);
+	      rtx newop1 = SUBREG_REG (op1);
+	      if (sel == 2)
+		std::swap (newop0, newop1);
+	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
+	    }
+
 	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
 				 (const_int n))
 	     with (vec_concat x y) or (vec_concat y x) depending on value
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
new file mode 100644
index 0000000..0541e48
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+
+/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
+
+void bar (double);
+
+void
+foo (double *restrict in, double *restrict in2,
+     double *restrict out1, double *restrict out2)
+{
+  for (int i = 0; i < 1024; i++)
+    {
+      out1[i] = in[i] + 2.0 * in[i+128];
+      out1[i+1] = in[i+1] + 2.0 * in2[i];
+      bar (in[i]);
+    }
+}
+
+/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
+

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-05-19  9:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-15  8:20 [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful Kyrill Tkachov
2018-05-15 10:04 ` Richard Biener
2018-05-16  9:08   ` Kyrill Tkachov
2018-05-16 10:11     ` Richard Biener
2018-05-16 10:12       ` Kyrill Tkachov
2018-05-16 15:13         ` Segher Boessenkool
2018-05-16 14:42       ` Segher Boessenkool
2018-05-17  9:45     ` Richard Earnshaw (lists)
2018-05-19  9:59       ` Richard Sandiford
  -- strict thread matches above, loose matches on Subject: below --
2017-12-11 14:19 [Patch combine] Don't create vector mode ZERO_EXTEND from subregs James Greenhalgh
2017-12-11 15:46 ` [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful James Greenhalgh
2017-12-18 23:37   ` Jeff Law
2018-01-03  9:55     ` Christophe Lyon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).