public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] PR/67682, break SLP groups up if only some elements match
@ 2015-10-23 15:26 Alan Lawrence
  2015-10-25 11:56 ` Alan Lawrence
  2015-10-26 15:09 ` Richard Biener
  0 siblings, 2 replies; 15+ messages in thread
From: Alan Lawrence @ 2015-10-23 15:26 UTC (permalink / raw)
  To: gcc-patches

vect_analyze_slp_instance currently only creates an slp_instance if _all_ stores
in a group fitted the same pattern. This patch splits non-matching groups up
on vector boundaries, allowing only part of the group to be SLP'd, or multiple
subgroups to be SLP'd differently.

The algorithm could be made more efficient: we have more info available in
the matches vector, and a single match in a vector full of non-matches, means
we will be unable to SLP). But the double-recursion case has at most log(N)
depth and the single-recursion case is at worst N^2 in *the group width*, which
is generally small.

This could possibly be extended to hybrid SLP, but I believe that would also
require splitting the load groups, as at present removing the bb_vinfo check
ends up with data being stored to the wrong locations e.g. in slp-11a.c. Hence,
leaving that extension to a future patch.

Bootstrapped + check-{gcc,g++,fortran} on aarch64, x86_64 and arm (v7-a neon).

Thanks, Alan

gcc/ChangeLog:

	* tree-vect-slp.c (vect_split_slp_group): New.
	(vect_analyze_slp_instance): Recurse on subgroup(s) if
	vect_build_slp_tree fails during basic block SLP.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/bb-slp-7.c: Make that SLP group even more non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           |  8 ++--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 ++++++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++++
 gcc/tree-vect-slp.c                            | 63 +++++++++++++++++++++++++-
 4 files changed, 152 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b012d78 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -47,4 +47,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..06099fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,42 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1] + 1;
+    a[2] = b[2] * 2;
+    a[3] = b[3] + 3;
+
+    a[4] = b[4] + 4;
+    a[5] = b[5] + 4;
+    a[6] = b[6] + 4;
+    a[7] = b[7] + 4;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 1;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
+    abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != i + 5)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 7a2d623..fb8d11e 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1627,6 +1627,33 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of statements currently beginning at FIRST_STMT into two
+   groups, one (still beginning at FIRST_STMT) containing the first I stmts,
+   the second containing the remainder.  Return the first stmt in the second
+   group.  */
+
+static gimple *
+vect_split_slp_group (gimple *first_stmt, unsigned int i)
+{
+  gcc_assert (GROUP_FIRST_ELEMENT (vinfo_for_stmt (first_stmt)) == first_stmt);
+  gcc_assert (i > 0);
+  int group2_size = GROUP_SIZE (vinfo_for_stmt (first_stmt)) - i;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (vinfo_for_stmt (first_stmt)) = i;
+
+  gimple *stmt = first_stmt;
+  while (i-- > 1)
+    stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  /* STMT is now the last element of the first group.  */
+  gimple *first = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (first)) = group2_size;
+  for (stmt = first; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = first;
+  return first;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1642,7 +1669,7 @@ vect_analyze_slp_instance (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1836,6 +1863,40 @@ vect_analyze_slp_instance (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block vectorization, try to break the group up into multiples
+     of the vectorization factor.  */
+  if (bb_vinfo && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i < vectorization_factor && vectorization_factor < group_size)
+	{
+	  /* First vector is a mix of (non-/)matches, or first element was
+	     impossible for another reason.  Retry without the first vector.  */
+	  stmt = vect_split_slp_group (stmt, vectorization_factor);
+	  return vect_analyze_slp_instance (loop_vinfo, bb_vinfo,
+					    stmt, max_tree_size);
+	}
+      else if (i >= vectorization_factor && i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  i &= ~(vectorization_factor - 1);
+	  gimple *group2start = vect_split_slp_group (stmt, i);
+	  return vect_analyze_slp_instance (loop_vinfo, bb_vinfo,
+					    stmt, max_tree_size)
+		 | vect_analyze_slp_instance (loop_vinfo, bb_vinfo,
+					      group2start, max_tree_size);
+	 }
+    }
+  else
+    {
+      /* TODO: handle reduction case?  */
+    }
+
   return false;
 }
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-10-23 15:26 [PATCH] PR/67682, break SLP groups up if only some elements match Alan Lawrence
@ 2015-10-25 11:56 ` Alan Lawrence
  2015-10-25 12:31   ` Andrew Pinski
  2015-10-26 15:09 ` Richard Biener
  1 sibling, 1 reply; 15+ messages in thread
From: Alan Lawrence @ 2015-10-25 11:56 UTC (permalink / raw)
  To: gcc-patches

On 23 October 2015 at 16:20, Alan Lawrence <alan.lawrence@arm.com> wrote:
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b012d78 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;

Erm, oops, I seem to have posted a version without the corresponding
change to result-checking in bb-slp-7.c...

Also on second thoughts a small change to improve efficiency
of the recursion by skipping some known-impossible bits, would not add
much complexity.

So I'll post a new version shortly with those changes.

Thanks, Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-10-25 11:56 ` Alan Lawrence
@ 2015-10-25 12:31   ` Andrew Pinski
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Pinski @ 2015-10-25 12:31 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Sun, Oct 25, 2015 at 7:51 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 23 October 2015 at 16:20, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> index ab54a48..b012d78 100644
>> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>>    unsigned int *pout = &out[0];
>>    unsigned int a0, a1, a2, a3;
>>
>> -  /* Non isomorphic.  */
>> +  /* Non isomorphic, even 64-bit subgroups.  */
>>    a0 = *pin++ + 23;
>> -  a1 = *pin++ + 142;
>> +  a1 = *pin++ * 142;
>>    a2 = *pin++ + 2;
>>    a3 = *pin++ * 31;
>
> Erm, oops, I seem to have posted a version without the corresponding
> change to result-checking in bb-slp-7.c...
>
> Also on second thoughts a small change to improve efficiency
> of the recursion by skipping some known-impossible bits, would not add
> much complexity.
>
> So I'll post a new version shortly with those changes.

Maybe it is better to make a new test for the changed file and keep
the old one with the updated result checking?
That way the testcase is the same between versions and the only thing
that changes is the result checking.

Thanks,
Andrew

>
> Thanks, Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-10-23 15:26 [PATCH] PR/67682, break SLP groups up if only some elements match Alan Lawrence
  2015-10-25 11:56 ` Alan Lawrence
@ 2015-10-26 15:09 ` Richard Biener
  2015-10-27 17:44   ` Alan Lawrence
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Biener @ 2015-10-26 15:09 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: GCC Patches

On Fri, Oct 23, 2015 at 5:20 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> vect_analyze_slp_instance currently only creates an slp_instance if _all_ stores
> in a group fitted the same pattern. This patch splits non-matching groups up
> on vector boundaries, allowing only part of the group to be SLP'd, or multiple
> subgroups to be SLP'd differently.
>
> The algorithm could be made more efficient: we have more info available in
> the matches vector, and a single match in a vector full of non-matches, means
> we will be unable to SLP). But the double-recursion case has at most log(N)
> depth and the single-recursion case is at worst N^2 in *the group width*, which
> is generally small.
>
> This could possibly be extended to hybrid SLP, but I believe that would also
> require splitting the load groups, as at present removing the bb_vinfo check
> ends up with data being stored to the wrong locations e.g. in slp-11a.c. Hence,
> leaving that extension to a future patch.
>
> Bootstrapped + check-{gcc,g++,fortran} on aarch64, x86_64 and arm (v7-a neon).
>
> Thanks, Alan
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.c (vect_split_slp_group): New.
>         (vect_analyze_slp_instance): Recurse on subgroup(s) if
>         vect_build_slp_tree fails during basic block SLP.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/bb-slp-7.c: Make that SLP group even more non-isomorphic.
>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           |  8 ++--
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 ++++++++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++++
>  gcc/tree-vect-slp.c                            | 63 +++++++++++++++++++++++++-
>  4 files changed, 152 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b012d78 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;
> -
> +
>    *pout++ = a0 * x;
>    *pout++ = a1 * y;
>    *pout++ = a2 * x;
> @@ -47,4 +47,4 @@ int main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1];
> +    a[2] = b[2];
> +    a[3] = b[3];
> +    a[4] = 0;
> +    a[5] = 0;
> +    a[6] = 0;
> +    a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  for (int i = 0; i < 4; i++)
> +    if (a[i] != i+4)
> +      abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != 0)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..06099fd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,42 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1] + 1;
> +    a[2] = b[2] * 2;
> +    a[3] = b[3] + 3;
> +
> +    a[4] = b[4] + 4;
> +    a[5] = b[5] + 4;
> +    a[6] = b[6] + 4;
> +    a[7] = b[7] + 4;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 1;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
> +    abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != i + 5)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 7a2d623..fb8d11e 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1627,6 +1627,33 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>    body_cost_vec.release ();
>  }
>
> +/* Splits a group of statements currently beginning at FIRST_STMT into two
> +   groups, one (still beginning at FIRST_STMT) containing the first I stmts,
> +   the second containing the remainder.  Return the first stmt in the second
> +   group.  */
> +
> +static gimple *
> +vect_split_slp_group (gimple *first_stmt, unsigned int i)
> +{
> +  gcc_assert (GROUP_FIRST_ELEMENT (vinfo_for_stmt (first_stmt)) == first_stmt);
> +  gcc_assert (i > 0);
> +  int group2_size = GROUP_SIZE (vinfo_for_stmt (first_stmt)) - i;
> +  gcc_assert (group2_size > 0);
> +  GROUP_SIZE (vinfo_for_stmt (first_stmt)) = i;
> +
> +  gimple *stmt = first_stmt;
> +  while (i-- > 1)
> +    stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  /* STMT is now the last element of the first group.  */
> +  gimple *first = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> +  GROUP_SIZE (vinfo_for_stmt (first)) = group2_size;
> +  for (stmt = first; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> +    GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = first;

apart from the fact that you'll post a new version you need to adjust GROUP_GAP.
You also seem to somewhat "confuse" "first I stmts" and "a group of
size I", those
are not the same when the group has haps.  I'd say "a group of size i" makes the
most sense here thus I suggest to adjust the function comment accordingly.

GROUP_GAP is the gap from the previous group element.  The goup first
elements GROUP_GAP is the gap at the _end_ of the whole group.  Thus
for the 2nd group the first stmts GROUP_GAP needs to be GROUP_GAP
of the 1st stmt of the original group.  For the new 1st group GROUP_GAP
of the 1st stmt should be the old GROUP_GAP of the 1st element of the
new 2nd group.  And of course the counting of 'i' needs to properly factor
in GROUP_GAP as well.

This is probably what broke the loop-vectorization side but I'm quite sure
I added some intermediate-gap BB vectorization support this year as well.

Like

   ... = a[0];
   ... = a[2];
   ... = a[2];
   ... = a[3];

just no gaps at the boundaries of the group (but after being split they might
be on a boundary).

Richard.

> +  return first;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
> @@ -1642,7 +1669,7 @@ vect_analyze_slp_instance (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo,
>    tree vectype, scalar_type = NULL_TREE;
>    gimple *next;
>    unsigned int vectorization_factor = 0;
> -  int i;
> +  unsigned int i;
>    unsigned int max_nunits = 0;
>    vec<slp_tree> loads;
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1836,6 +1863,40 @@ vect_analyze_slp_instance (loop_vec_info loop_vinfo, bb_vec_info bb_vinfo,
>    vect_free_slp_tree (node);
>    loads.release ();
>
> +  /* For basic block vectorization, try to break the group up into multiples
> +     of the vectorization factor.  */
> +  if (bb_vinfo && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      /* We consider breaking the group only on VF boundaries from the existing
> +        start.  */
> +      for (i = 0; i < group_size; i++)
> +       if (!matches[i]) break;
> +
> +      if (i < vectorization_factor && vectorization_factor < group_size)
> +       {
> +         /* First vector is a mix of (non-/)matches, or first element was
> +            impossible for another reason.  Retry without the first vector.  */
> +         stmt = vect_split_slp_group (stmt, vectorization_factor);
> +         return vect_analyze_slp_instance (loop_vinfo, bb_vinfo,
> +                                           stmt, max_tree_size);
> +       }
> +      else if (i >= vectorization_factor && i < group_size)
> +       {
> +         /* Split into two groups at the first vector boundary before i.  */
> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
> +         i &= ~(vectorization_factor - 1);
> +         gimple *group2start = vect_split_slp_group (stmt, i);
> +         return vect_analyze_slp_instance (loop_vinfo, bb_vinfo,
> +                                           stmt, max_tree_size)
> +                | vect_analyze_slp_instance (loop_vinfo, bb_vinfo,
> +                                             group2start, max_tree_size);
> +        }
> +    }
> +  else
> +    {
> +      /* TODO: handle reduction case?  */
> +    }
> +
>    return false;
>  }
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-10-26 15:09 ` Richard Biener
@ 2015-10-27 17:44   ` Alan Lawrence
  2015-11-03 13:39     ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Lawrence @ 2015-10-27 17:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On 26/10/15 15:04, Richard Biener wrote:
>
> apart from the fact that you'll post a new version you need to adjust GROUP_GAP.
> You also seem to somewhat "confuse" "first I stmts" and "a group of
> size I", those
> are not the same when the group has haps.  I'd say "a group of size i" makes the
> most sense here thus I suggest to adjust the function comment accordingly.

Ok, thanks for pointing this out. My objective had been to only split the store 
groups - which in BB vectorization, always seem to have gap 0 1 1 .... 1. I 
didn't come up with a good scheme for how to split load groups, but it seemed 
that I didn't need to do anything there if I restricted to BB vectorization 
only. For example, consider (ignoring that we could multiply the first four 
elements by 1 and add 0 to the last four):

     a[0] = b[I] + 1;
     a[1] = b[J] + 2;
     a[2] = b[K] + 3;
     a[3] = b[L] + 4;
     a[4] = b[M] * 3;
     a[5] = b[N] * 4;
     a[6] = b[O] * 5;
     a[7] = b[P] * 7;


with constants I,J,K,L,M,N,O,P. Even with those being a sequence 2 0 1 1 3 0 2 1 
with overlaps and repetitions, this works fine for BB SLP (two subgroups of 
stores, *sharing* a load group but with different permutations). Likewise 0 1 2 
3 0 2 4 6.

For loop SLP, yes it looks like the load group needs to be split. So how; and 
what constraints to impose on those constants? (There is no single right answer!)

A fairly-strict scheme could be that (I,J,K,L) must be within a contiguous block 
of memory, that does not overlap with the contiguous block containing (M,N,O,P). 
Then, splitting the load group on the boundary seems reasonable, and updating 
the gaps as you suggest. However, when you say "the group first elements 
GROUP_GAP is the gap at the _end_ of the whole group" - the gap at the end is 
the gap that comes after the last element and up to....what?

Say I...P are consecutive, the input would have gaps 0 1 1 1 1 1 1 1. If we 
split the load group, we would want subgroups with gaps 0 1 1 1 and 0 1 1 1? 
(IIUC, you suggest 1111 and 0111?)

If they are disjoint sets, but overlapping blocks of memory, say 0 2 4 6 1 3 5 
7...then do we create two load groups, with gap 0 2 2 2 and 0 2 2 2 again? Does 
something record that the load groups access overlapping areas, and record the 
offset against each other?

If there are repeated elements (as in the BB SLP case mentioned above), I'm not 
clear how we can split this effectively...so may have to rule out that case. 
(Moreover, if we are considering hybrid SLP, it may not be clear what the loop 
accesses are, we may be presented only with the SLP accesses. Do we necessarily 
want to pull those out of a load group?)

So I expect I may resolve some of these issues as I progress, but I'm curious as 
to whether (and why) the patch was really broken (wrt gaps) as it stood...

Thanks,
Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-10-27 17:44   ` Alan Lawrence
@ 2015-11-03 13:39     ` Richard Biener
  2015-11-05 13:34       ` Alan Lawrence
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2015-11-03 13:39 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: GCC Patches

On Tue, Oct 27, 2015 at 6:38 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 26/10/15 15:04, Richard Biener wrote:
>>
>>
>> apart from the fact that you'll post a new version you need to adjust
>> GROUP_GAP.
>> You also seem to somewhat "confuse" "first I stmts" and "a group of
>> size I", those
>> are not the same when the group has haps.  I'd say "a group of size i"
>> makes the
>> most sense here thus I suggest to adjust the function comment accordingly.
>
>
> Ok, thanks for pointing this out. My objective had been to only split the
> store groups - which in BB vectorization, always seem to have gap 0 1 1 ....
> 1. I didn't come up with a good scheme for how to split load groups, but it
> seemed that I didn't need to do anything there if I restricted to BB
> vectorization only. For example, consider (ignoring that we could multiply
> the first four elements by 1 and add 0 to the last four):
>
>     a[0] = b[I] + 1;
>     a[1] = b[J] + 2;
>     a[2] = b[K] + 3;
>     a[3] = b[L] + 4;
>     a[4] = b[M] * 3;
>     a[5] = b[N] * 4;
>     a[6] = b[O] * 5;
>     a[7] = b[P] * 7;
>
>
> with constants I,J,K,L,M,N,O,P. Even with those being a sequence 2 0 1 1 3 0
> 2 1 with overlaps and repetitions, this works fine for BB SLP (two subgroups
> of stores, *sharing* a load group but with different permutations). Likewise
> 0 1 2 3 0 2 4 6.
>
> For loop SLP, yes it looks like the load group needs to be split. So how;
> and what constraints to impose on those constants? (There is no single right
> answer!)
>
> A fairly-strict scheme could be that (I,J,K,L) must be within a contiguous
> block of memory, that does not overlap with the contiguous block containing
> (M,N,O,P). Then, splitting the load group on the boundary seems reasonable,
> and updating the gaps as you suggest. However, when you say "the group first
> elements GROUP_GAP is the gap at the _end_ of the whole group" - the gap at
> the end is the gap that comes after the last element and up to....what?
>
> Say I...P are consecutive, the input would have gaps 0 1 1 1 1 1 1 1. If we
> split the load group, we would want subgroups with gaps 0 1 1 1 and 0 1 1 1?
> (IIUC, you suggest 1111 and 0111?)

As said on IRC it should be 4 1 1 1 and 4 1 1 1.

> If they are disjoint sets, but overlapping blocks of memory, say 0 2 4 6 1 3
> 5 7...then do we create two load groups, with gap 0 2 2 2 and 0 2 2 2 again?
> Does something record that the load groups access overlapping areas, and
> record the offset against each other?

No, I don't think we can split load groups that way.  So I think if
splitting store
groups works well (with having larger load groups) then that's the way to go
(even for loop vect).

> If there are repeated elements (as in the BB SLP case mentioned above), I'm
> not clear how we can split this effectively...so may have to rule out that
> case. (Moreover, if we are considering hybrid SLP, it may not be clear what
> the loop accesses are, we may be presented only with the SLP accesses. Do we
> necessarily want to pull those out of a load group?)
>
> So I expect I may resolve some of these issues as I progress, but I'm
> curious as to whether (and why) the patch was really broken (wrt gaps) as it
> stood...

Yes, the gaps were clearly bogously constructed in general.  If you have an
existing group you can only split it into non-overlapping groups.  Thus for
two load SLP nodes loading from 0 2 4 6 and from 1 3 5 7 you will have
a single "group" (0 1 2 3 4 5 6 7) and you can at most split it as
0 1 2 3, 4 5 6 7 which won't help in this case (but would be actually worse).

So I think restricting the splitting to the stores should work fine.

Richard.

> Thanks,
> Alan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-03 13:39     ` Richard Biener
@ 2015-11-05 13:34       ` Alan Lawrence
  2015-11-06 12:55         ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Lawrence @ 2015-11-05 13:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther

On 03/11/15 13:39, Richard Biener wrote:
> On Tue, Oct 27, 2015 at 6:38 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>
>> Say I...P are consecutive, the input would have gaps 0 1 1 1 1 1 1 1. If we
>> split the load group, we would want subgroups with gaps 0 1 1 1 and 0 1 1 1?
>
> As said on IRC it should be 4 1 1 1 and 4 1 1 1.

Right. And so, if we have a twelve-element group (0 1 1 1 1 1 1 1 1 1 1 1), by
the time it became three subgroups, these should each be (8 1 1 1), via an
intermediate stage of (4 1 1 1 1 1 1 1) (8 1 1 1). This leads to the code in
the attached patch.

> No, I don't think we can split load groups that way.  So I think if
> splitting store
> groups works well (with having larger load groups) then that's the way to go
> (even for loop vect).

Well, slp-11a.c still fails if I enable the splitting for non-BB SLP; I was
thinking this was because I needed to split the load groups too, but maybe not
- maybe this is a separate bug/issue with hybrid SLP. Whatever the reason, I
still think splitting groups in hybrid SLP is another patch. (Do we really want
to put off handling the basic-block case until it works for hybrid SLP as well?
IMHO I would think not.) It sounds as if the approach of restricting splitting
to store groups with appropriate asserts GROUP_GAP == 1 is thus the right thing
to do in the longer term too (hence, renamed vect_split_slp_store_group to
emphasize that) - at least until we remove that restriction on SLP generally.

Bootstrapped + check-{gcc,g++,gfortran} on x86_64, AArch64, ARM.

Re. the extra skipping loop, I think it would theoretically be possible for the
recursive call to vect_slp_analyze to succeed on an element where the original
failed, because it may have more num_permutes remaining (after skipping over
the first vector). So there's a second argument (besides code complexity) for
dropping that part....?

gcc/ChangeLog:

	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): Recurse on subgroup(s) if
	vect_build_slp_tree fails during basic block SLP.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
	* gcc.dg/vect/bb-slp-subgroups-4.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++
 gcc/tree-vect-slp.c                            | 87 +++++++++++++++++++++++++-
 6 files changed, 259 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..06099fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,42 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1] + 1;
+    a[2] = b[2] * 2;
+    a[3] = b[3] + 3;
+
+    a[4] = b[4] + 4;
+    a[5] = b[5] + 4;
+    a[6] = b[6] + 4;
+    a[7] = b[7] + 4;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 1;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
+    abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != i + 5)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 1424123..c01632a 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1641,6 +1641,50 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1656,7 +1700,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1846,6 +1890,47 @@ vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block vectorization, try to break the group up into multiples
+     of the vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo) && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i < vectorization_factor)
+	{
+	  /* First vector is a mix of (non-/)matches, or first element was
+	     impossible for another reason.  Skip the first vector, and look
+	     for a vector's worth of consecutive (non-/)matches.  */
+	  i = vectorization_factor;
+	  while (i < group_size)
+	    {
+	      for (unsigned j = i + 1; matches[j] == matches[i]; j++)
+		if (j == (i + vectorization_factor - 1))
+		  {
+		    /* Either all in that vector match, or none do.
+		       Retry SLP from that vector onwards.  */
+		    stmt = vect_split_slp_store_group (stmt, i);
+		    return vect_analyze_slp_instance (vinfo,
+						      stmt, max_tree_size);
+		  }
+	      /* Vector contains both matches and non-matches.  Skip over.  */
+	      i += vectorization_factor;
+	    }
+	}
+      else if (i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  i &= ~(vectorization_factor - 1);
+	  gimple *grp2start = vect_split_slp_store_group (stmt, i);
+	  return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
+		 | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
+	 }
+    }
+
   return false;
 }
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-05 13:34       ` Alan Lawrence
@ 2015-11-06 12:55         ` Richard Biener
  2015-11-09 14:59           ` Alan Lawrence
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2015-11-06 12:55 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: GCC Patches

On Thu, Nov 5, 2015 at 2:33 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 03/11/15 13:39, Richard Biener wrote:
>> On Tue, Oct 27, 2015 at 6:38 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>>
>>> Say I...P are consecutive, the input would have gaps 0 1 1 1 1 1 1 1. If we
>>> split the load group, we would want subgroups with gaps 0 1 1 1 and 0 1 1 1?
>>
>> As said on IRC it should be 4 1 1 1 and 4 1 1 1.
>
> Right. And so, if we have a twelve-element group (0 1 1 1 1 1 1 1 1 1 1 1), by
> the time it became three subgroups, these should each be (8 1 1 1), via an
> intermediate stage of (4 1 1 1 1 1 1 1) (8 1 1 1). This leads to the code in
> the attached patch.
>
>> No, I don't think we can split load groups that way.  So I think if
>> splitting store
>> groups works well (with having larger load groups) then that's the way to go
>> (even for loop vect).
>
> Well, slp-11a.c still fails if I enable the splitting for non-BB SLP; I was
> thinking this was because I needed to split the load groups too, but maybe not
> - maybe this is a separate bug/issue with hybrid SLP. Whatever the reason, I
> still think splitting groups in hybrid SLP is another patch. (Do we really want
> to put off handling the basic-block case until it works for hybrid SLP as well?
> IMHO I would think not.)

No.

> It sounds as if the approach of restricting splitting
> to store groups with appropriate asserts GROUP_GAP == 1 is thus the right thing
> to do in the longer term too (hence, renamed vect_split_slp_store_group to
> emphasize that) - at least until we remove that restriction on SLP generally.
>
> Bootstrapped + check-{gcc,g++,gfortran} on x86_64, AArch64, ARM.
>
> Re. the extra skipping loop, I think it would theoretically be possible for the
> recursive call to vect_slp_analyze to succeed on an element where the original
> failed, because it may have more num_permutes remaining (after skipping over
> the first vector). So there's a second argument (besides code complexity) for
> dropping that part....?
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>         (vect_analyze_slp_instance): Recurse on subgroup(s) if
>         vect_build_slp_tree fails during basic block SLP.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-4.c: New.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++
>  gcc/tree-vect-slp.c                            | 87 +++++++++++++++++++++++++-
>  6 files changed, 259 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b8bef8c 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;
> -
> +
>    *pout++ = a0 * x;
>    *pout++ = a1 * y;
>    *pout++ = a2 * x;
> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>
>    /* Check results.  */
>    if (out[0] != (in[0] + 23) * x
> -      || out[1] != (in[1] + 142) * y
> +      || out[1] != (in[1] * 142) * y
>        || out[2] != (in[2] + 2) * x
>        || out[3] != (in[3] * 31) * y)
>      abort();
> @@ -47,4 +47,4 @@ int main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1];
> +    a[2] = b[2];
> +    a[3] = b[3];
> +    a[4] = 0;
> +    a[5] = 0;
> +    a[6] = 0;
> +    a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  for (int i = 0; i < 4; i++)
> +    if (a[i] != i+4)
> +      abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != 0)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..06099fd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,42 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1] + 1;
> +    a[2] = b[2] * 2;
> +    a[3] = b[3] + 3;
> +
> +    a[4] = b[4] + 4;
> +    a[5] = b[5] + 4;
> +    a[6] = b[6] + 4;
> +    a[7] = b[7] + 4;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 1;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
> +    abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != i + 5)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> new file mode 100644
> index 0000000..13c51f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[2] + 1;
> +    a[1] = b[0] + 2;
> +    a[2] = b[1] + 3;
> +    a[3] = b[1] + 4;
> +    a[4] = b[3] * 3;
> +    a[5] = b[0] * 4;
> +    a[6] = b[2] * 5;
> +    a[7] = b[1] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> new file mode 100644
> index 0000000..6ae9a89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0] + 1;
> +    a[1] = b[1] + 2;
> +    a[2] = b[2] + 3;
> +    a[3] = b[3] + 4;
> +    a[4] = b[0] * 3;
> +    a[5] = b[2] * 4;
> +    a[6] = b[4] * 5;
> +    a[7] = b[6] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 1424123..c01632a 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1641,6 +1641,50 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>    body_cost_vec.release ();
>  }
>
> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
> +   containing the remainder.
> +   Return the first stmt in the second group.  */
> +
> +static gimple *
> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
> +{
> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
> +  gcc_assert (group1_size > 0);
> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
> +  gcc_assert (group2_size > 0);
> +  GROUP_SIZE (first_vinfo) = group1_size;
> +
> +  gimple *stmt = first_stmt;
> +  for (unsigned i = group1_size; i > 1; i--)
> +    {
> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +  /* STMT is now the last element of the first group.  */
> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +
> +  /* For the second group, the GROUP_GAP is that before the original group,
> +     plus skipping over the first vector.  */
> +  GROUP_GAP (vinfo_for_stmt (group2)) =
> +    GROUP_GAP (first_vinfo) + group1_size;
> +
> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
> +  GROUP_GAP (first_vinfo) += group2_size;

Please add a MSG_NOTE debug printf stating that we split the group and
at which element.

> +  return group2;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
> @@ -1656,7 +1700,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    tree vectype, scalar_type = NULL_TREE;
>    gimple *next;
>    unsigned int vectorization_factor = 0;
> -  int i;
> +  unsigned int i;
>    unsigned int max_nunits = 0;
>    vec<slp_tree> loads;
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1846,6 +1890,47 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    vect_free_slp_tree (node);
>    loads.release ();
>
> +  /* For basic block vectorization, try to break the group up into multiples
> +     of the vectorization factor.  */
> +  if (is_a <bb_vec_info> (vinfo) && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))

I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
otherwise this could be SLP reductions where there is no way the split
group would
enable vectorization.

> +    {
> +      /* We consider breaking the group only on VF boundaries from the existing
> +        start.  */
> +      for (i = 0; i < group_size; i++)
> +       if (!matches[i]) break;
> +
> +      if (i < vectorization_factor)
> +       {
> +         /* First vector is a mix of (non-/)matches, or first element was
> +            impossible for another reason.  Skip the first vector, and look
> +            for a vector's worth of consecutive (non-/)matches.  */
> +         i = vectorization_factor;
> +         while (i < group_size)
> +           {
> +             for (unsigned j = i + 1; matches[j] == matches[i]; j++)
> +               if (j == (i + vectorization_factor - 1))
> +                 {
> +                   /* Either all in that vector match, or none do.
> +                      Retry SLP from that vector onwards.  */

Hmm.  This is of course pretty bad for compile-time for the non-matching
case as that effectively will always split into N pieces if we feed it
garbage (that is, without being sure that at least one pice _will_ vectorize).

OTOH with the current way of computing "matches" we'd restrict ourselves
to cases where the first stmt we look at (and match to) happens to be
the operation that in the end will form a matching group.

> +                   stmt = vect_split_slp_store_group (stmt, i);
> +                   return vect_analyze_slp_instance (vinfo,
> +                                                     stmt, max_tree_size);
> +                 }
> +             /* Vector contains both matches and non-matches.  Skip over.  */
> +             i += vectorization_factor;
> +           }
> +       }
> +      else if (i < group_size)
> +       {
> +         /* Split into two groups at the first vector boundary before i.  */
> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
> +         i &= ~(vectorization_factor - 1);
> +         gimple *grp2start = vect_split_slp_store_group (stmt, i);
> +         return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
> +                | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);

I think this part is more obvious, at least one vectorization_factor size
group at the beginning "matches" and we're iteratively splitting the 2nd "half".

Note that BB vectorization now also very aggressively falls back to considering
non-matches being "external".

Not sure why that doesn't trigger for your testcases ;)

I'm comfortable with the i < group_size half of the patch.  For the other piece
I'd like to see more compile-time / success data from, say, building
SPEC CPU 2006.  Eventually we'd want to improve the "matches" return
to include the affected stmts (ok, that'll be not very easy) so we can do
a cheap "if we split here will it fix that particular mismatch" check.

Eventually a the SLP tree build can maintain a "hierarchy" of 'matches'
itself, that is, make matches[] an array of ints refering to the element it
matches to (or itself if none).  That will make the non-matching case a bit
more expensive during the SLP tree build of course.

So, please split the patch and I suggest to leave the controversical part
for next stage1 together with some improvement on the SLP build process
itself?

Richard.

> +        }
> +    }
> +
>    return false;
>  }
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-06 12:55         ` Richard Biener
@ 2015-11-09 14:59           ` Alan Lawrence
  2015-11-10 12:45             ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Lawrence @ 2015-11-09 14:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther

On 06/11/15 12:55, Richard Biener wrote:
>
>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>> +  GROUP_GAP (first_vinfo) += group2_size;
>
> Please add a MSG_NOTE debug printf stating that we split the group and
> at which element.

Done.

> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
> otherwise this could be SLP reductions where there is no way the split
> group would enable vectorization.

Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check
sufficed for that, as per e.g. the section above
  /* Create a node (a root of the SLP tree) for the packed grouped stores. */
But done.

> Note that BB vectorization now also very aggressively falls back to considering
> non-matches being "external".
>
> Not sure why that doesn't trigger for your testcases ;)

I tested against trunk r229944, on which all of my scan-tree-dump's were failing....

> I'm comfortable with the i < group_size half of the patch.  For the other piece
> I'd like to see more compile-time / success data from, say, building
> SPEC CPU 2006.

Well, as a couple of quick data points, a compile of SPEC2000 on
aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have:

3080 successes without patch;
+79 successes from the "i < vectorization_factor" part of the patch (on its own)
+90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own)
+(79 from first) +(90 from second) + (an additional 62) from both parts together.

And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57):
11979 successes without patch;
+ 499 from the "i < vectorization_factor" part
+ 264 from the (i >= vectorization factor) && (i < group_size)" part
+ extra 336 if both parts combined.

I haven't done any significant measurements of compile-time yet.

(snipping this bit out-of-order)
> Hmm. This is of course pretty bad for compile-time for the non-matching
> case as that effectively will always split into N pieces if we feed it
> garbage (that is, without being sure that at least one pice _will_ vectorize).
>
> OTOH with the current way of computing "matches" we'd restrict ourselves
> to cases where the first stmt we look at (and match to) happens to be
> the operation that in the end will form a matching group.
...
> Eventually we'd want to improve the "matches" return
> to include the affected stmts (ok, that'll be not very easy) so we can do
> a cheap "if we split here will it fix that particular mismatch" check.

Yes, I think there are a bunch of things we can do here, that would be more
powerful than the simple approach I used here. The biggest limiting factor will
probably be (lack of) permutation, i.e. if we only SLP stores to consecutive
addresses.

> So, please split the patch and I suggest to leave the controversical part
> for next stage1 together with some improvement on the SLP build process
> itself?

Here's a reduced version with just the second case,
bootstrapped+check-gcc/g++ on x86_64.

gcc/ChangeLog:

	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): During basic block SLP, recurse on
	subgroups if vect_build_slp_tree fails after 1st vector.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
	* gcc.dg/vect/bb-slp-subgroups-4.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 ++--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++
 gcc/tree-vect-slp.c                            | 74 +++++++++++++++++++++++++-
 6 files changed, 246 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..06099fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,42 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1] + 1;
+    a[2] = b[2] * 2;
+    a[3] = b[3] + 3;
+
+    a[4] = b[4] + 4;
+    a[5] = b[5] + 4;
+    a[6] = b[6] + 4;
+    a[7] = b[7] + 4;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 1;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
+    abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != i + 5)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index cfdfc29..05d4b35 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
+		     group1_size, group2_size);
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block SLP, try to break the group up into multiples of the
+     vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo)
+      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
+      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i >= vectorization_factor && i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  i &= ~(vectorization_factor - 1);
+	  gimple *grp2start = vect_split_slp_store_group (stmt, i);
+	  return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
+		 | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
+	 }
+      /* Even though the first vector did not all match, we might be able to SLP
+	 (some) of the remainder.  FORNOW ignore this possibility.  */
+    }
+
   return false;
 }
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-09 14:59           ` Alan Lawrence
@ 2015-11-10 12:45             ` Richard Biener
  2015-11-10 12:51               ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2015-11-10 12:45 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: GCC Patches

On Mon, Nov 9, 2015 at 3:59 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 06/11/15 12:55, Richard Biener wrote:
>>
>>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>>> +  GROUP_GAP (first_vinfo) += group2_size;
>>
>> Please add a MSG_NOTE debug printf stating that we split the group and
>> at which element.
>
> Done.
>
>> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
>> otherwise this could be SLP reductions where there is no way the split
>> group would enable vectorization.
>
> Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check
> sufficed for that, as per e.g. the section above
>   /* Create a node (a root of the SLP tree) for the packed grouped stores. */
> But done.
>
>> Note that BB vectorization now also very aggressively falls back to considering
>> non-matches being "external".
>>
>> Not sure why that doesn't trigger for your testcases ;)
>
> I tested against trunk r229944, on which all of my scan-tree-dump's were failing....
>
>> I'm comfortable with the i < group_size half of the patch.  For the other piece
>> I'd like to see more compile-time / success data from, say, building
>> SPEC CPU 2006.
>
> Well, as a couple of quick data points, a compile of SPEC2000 on
> aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have:
>
> 3080 successes without patch;
> +79 successes from the "i < vectorization_factor" part of the patch (on its own)
> +90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own)
> +(79 from first) +(90 from second) + (an additional 62) from both parts together.
>
> And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57):
> 11979 successes without patch;
> + 499 from the "i < vectorization_factor" part
> + 264 from the (i >= vectorization factor) && (i < group_size)" part
> + extra 336 if both parts combined.

Thanks

> I haven't done any significant measurements of compile-time yet.
>
> (snipping this bit out-of-order)
>> Hmm. This is of course pretty bad for compile-time for the non-matching
>> case as that effectively will always split into N pieces if we feed it
>> garbage (that is, without being sure that at least one pice _will_ vectorize).
>>
>> OTOH with the current way of computing "matches" we'd restrict ourselves
>> to cases where the first stmt we look at (and match to) happens to be
>> the operation that in the end will form a matching group.
> ...
>> Eventually we'd want to improve the "matches" return
>> to include the affected stmts (ok, that'll be not very easy) so we can do
>> a cheap "if we split here will it fix that particular mismatch" check.
>
> Yes, I think there are a bunch of things we can do here, that would be more
> powerful than the simple approach I used here. The biggest limiting factor will
> probably be (lack of) permutation, i.e. if we only SLP stores to consecutive
> addresses.

Yeah, doing anything else requires us to use masked or strided stores though.

>> So, please split the patch and I suggest to leave the controversical part
>> for next stage1 together with some improvement on the SLP build process
>> itself?
>
> Here's a reduced version with just the second case,
> bootstrapped+check-gcc/g++ on x86_64.
>
> gcc/ChangeLog:
>
>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>         (vect_analyze_slp_instance): During basic block SLP, recurse on
>         subgroups if vect_build_slp_tree fails after 1st vector.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-4.c: New.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 ++--
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++
>  gcc/tree-vect-slp.c                            | 74 +++++++++++++++++++++++++-
>  6 files changed, 246 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b8bef8c 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;
> -
> +
>    *pout++ = a0 * x;
>    *pout++ = a1 * y;
>    *pout++ = a2 * x;
> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>
>    /* Check results.  */
>    if (out[0] != (in[0] + 23) * x
> -      || out[1] != (in[1] + 142) * y
> +      || out[1] != (in[1] * 142) * y
>        || out[2] != (in[2] + 2) * x
>        || out[3] != (in[3] * 31) * y)
>      abort();
> @@ -47,4 +47,4 @@ int main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1];
> +    a[2] = b[2];
> +    a[3] = b[3];
> +    a[4] = 0;
> +    a[5] = 0;
> +    a[6] = 0;
> +    a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  for (int i = 0; i < 4; i++)
> +    if (a[i] != i+4)
> +      abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != 0)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..06099fd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,42 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1] + 1;
> +    a[2] = b[2] * 2;
> +    a[3] = b[3] + 3;
> +
> +    a[4] = b[4] + 4;
> +    a[5] = b[5] + 4;
> +    a[6] = b[6] + 4;
> +    a[7] = b[7] + 4;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 1;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
> +    abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != i + 5)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> new file mode 100644
> index 0000000..13c51f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[2] + 1;
> +    a[1] = b[0] + 2;
> +    a[2] = b[1] + 3;
> +    a[3] = b[1] + 4;
> +    a[4] = b[3] * 3;
> +    a[5] = b[0] * 4;
> +    a[6] = b[2] * 5;
> +    a[7] = b[1] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> new file mode 100644
> index 0000000..6ae9a89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0] + 1;
> +    a[1] = b[1] + 2;
> +    a[2] = b[2] + 3;
> +    a[3] = b[3] + 4;
> +    a[4] = b[0] * 3;
> +    a[5] = b[2] * 4;
> +    a[6] = b[4] * 5;
> +    a[7] = b[6] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index cfdfc29..05d4b35 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>    body_cost_vec.release ();
>  }
>
> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
> +   containing the remainder.
> +   Return the first stmt in the second group.  */
> +
> +static gimple *
> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
> +{
> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
> +  gcc_assert (group1_size > 0);
> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
> +  gcc_assert (group2_size > 0);
> +  GROUP_SIZE (first_vinfo) = group1_size;
> +
> +  gimple *stmt = first_stmt;
> +  for (unsigned i = group1_size; i > 1; i--)
> +    {
> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +  /* STMT is now the last element of the first group.  */
> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +
> +  /* For the second group, the GROUP_GAP is that before the original group,
> +     plus skipping over the first vector.  */
> +  GROUP_GAP (vinfo_for_stmt (group2)) =
> +    GROUP_GAP (first_vinfo) + group1_size;
> +
> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
> +  GROUP_GAP (first_vinfo) += group2_size;
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
> +                    group1_size, group2_size);
> +
> +  return group2;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    tree vectype, scalar_type = NULL_TREE;
>    gimple *next;
>    unsigned int vectorization_factor = 0;
> -  int i;
> +  unsigned int i;
>    unsigned int max_nunits = 0;
>    vec<slp_tree> loads;
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    vect_free_slp_tree (node);
>    loads.release ();
>
> +  /* For basic block SLP, try to break the group up into multiples of the
> +     vectorization factor.  */
> +  if (is_a <bb_vec_info> (vinfo)
> +      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
> +      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
> +    {
> +      /* We consider breaking the group only on VF boundaries from the existing
> +        start.  */
> +      for (i = 0; i < group_size; i++)
> +       if (!matches[i]) break;
> +
> +      if (i >= vectorization_factor && i < group_size)
> +       {
> +         /* Split into two groups at the first vector boundary before i.  */
> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);

Just noticing this... if we have a vectorization factor of 4 and matches
is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?

Thus shouldn't we split either only if i % vectorization_factor is 0 or
if not, split "twice", dropping the intermediate surely non-matching
group of vectorization_factor size?  After all if we split like with the
patch then the upper half will _not_ be splitted again with the
simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).

So I expect that the statistics will be the same if we restrict splitting
to the i % vectorization_factor == 0 case, or rather split where we do
now but only re-analyze group2 if i % vectorization_factor == 0 holds?

Ok with that change.  Improvements on that incrementally.

Thanks,
Richard.

> +         i &= ~(vectorization_factor - 1);
> +         gimple *grp2start = vect_split_slp_store_group (stmt, i);
> +         return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
> +                | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
> +        }
> +      /* Even though the first vector did not all match, we might be able to SLP
> +        (some) of the remainder.  FORNOW ignore this possibility.  */
> +    }
> +
>    return false;
>  }
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-10 12:45             ` Richard Biener
@ 2015-11-10 12:51               ` Richard Biener
  2015-11-13 16:19                 ` Alan Lawrence
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Biener @ 2015-11-10 12:51 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: GCC Patches

On Tue, Nov 10, 2015 at 1:45 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Mon, Nov 9, 2015 at 3:59 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>> On 06/11/15 12:55, Richard Biener wrote:
>>>
>>>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>>>> +  GROUP_GAP (first_vinfo) += group2_size;
>>>
>>> Please add a MSG_NOTE debug printf stating that we split the group and
>>> at which element.
>>
>> Done.
>>
>>> I think you want to add && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))
>>> otherwise this could be SLP reductions where there is no way the split
>>> group would enable vectorization.
>>
>> Ah, I had thought that the (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) check
>> sufficed for that, as per e.g. the section above
>>   /* Create a node (a root of the SLP tree) for the packed grouped stores. */
>> But done.
>>
>>> Note that BB vectorization now also very aggressively falls back to considering
>>> non-matches being "external".
>>>
>>> Not sure why that doesn't trigger for your testcases ;)
>>
>> I tested against trunk r229944, on which all of my scan-tree-dump's were failing....
>>
>>> I'm comfortable with the i < group_size half of the patch.  For the other piece
>>> I'd like to see more compile-time / success data from, say, building
>>> SPEC CPU 2006.
>>
>> Well, as a couple of quick data points, a compile of SPEC2000 on
>> aarch64-none-linux-gnu (-Ofast -fomit-frame-pointer -mcpu=cortex-a57), I have:
>>
>> 3080 successes without patch;
>> +79 successes from the "i < vectorization_factor" part of the patch (on its own)
>> +90 successes from the (i>=vectorization_factor) && "i < group_size" part (on its own)
>> +(79 from first) +(90 from second) + (an additional 62) from both parts together.
>>
>> And for SPEC2006, aarch64-linux-gnu (-O3 -fomit-frame-pointer -mcpu=cortex-a57):
>> 11979 successes without patch;
>> + 499 from the "i < vectorization_factor" part
>> + 264 from the (i >= vectorization factor) && (i < group_size)" part
>> + extra 336 if both parts combined.
>
> Thanks
>
>> I haven't done any significant measurements of compile-time yet.
>>
>> (snipping this bit out-of-order)
>>> Hmm. This is of course pretty bad for compile-time for the non-matching
>>> case as that effectively will always split into N pieces if we feed it
>>> garbage (that is, without being sure that at least one pice _will_ vectorize).
>>>
>>> OTOH with the current way of computing "matches" we'd restrict ourselves
>>> to cases where the first stmt we look at (and match to) happens to be
>>> the operation that in the end will form a matching group.
>> ...
>>> Eventually we'd want to improve the "matches" return
>>> to include the affected stmts (ok, that'll be not very easy) so we can do
>>> a cheap "if we split here will it fix that particular mismatch" check.
>>
>> Yes, I think there are a bunch of things we can do here, that would be more
>> powerful than the simple approach I used here. The biggest limiting factor will
>> probably be (lack of) permutation, i.e. if we only SLP stores to consecutive
>> addresses.
>
> Yeah, doing anything else requires us to use masked or strided stores though.
>
>>> So, please split the patch and I suggest to leave the controversical part
>>> for next stage1 together with some improvement on the SLP build process
>>> itself?
>>
>> Here's a reduced version with just the second case,
>> bootstrapped+check-gcc/g++ on x86_64.
>>
>> gcc/ChangeLog:
>>
>>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>>         (vect_analyze_slp_instance): During basic block SLP, recurse on
>>         subgroups if vect_build_slp_tree fails after 1st vector.
>>
>> gcc/testsuite/ChangeLog:
>>
>>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.
>>         * gcc.dg/vect/bb-slp-subgroups-4.c: New.
>> ---
>>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 ++--
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++++
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++++
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++++
>>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++++
>>  gcc/tree-vect-slp.c                            | 74 +++++++++++++++++++++++++-
>>  6 files changed, 246 insertions(+), 6 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> index ab54a48..b8bef8c 100644
>> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
>> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>>    unsigned int *pout = &out[0];
>>    unsigned int a0, a1, a2, a3;
>>
>> -  /* Non isomorphic.  */
>> +  /* Non isomorphic, even 64-bit subgroups.  */
>>    a0 = *pin++ + 23;
>> -  a1 = *pin++ + 142;
>> +  a1 = *pin++ * 142;
>>    a2 = *pin++ + 2;
>>    a3 = *pin++ * 31;
>> -
>> +
>>    *pout++ = a0 * x;
>>    *pout++ = a1 * y;
>>    *pout++ = a2 * x;
>> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>>
>>    /* Check results.  */
>>    if (out[0] != (in[0] + 23) * x
>> -      || out[1] != (in[1] + 142) * y
>> +      || out[1] != (in[1] * 142) * y
>>        || out[2] != (in[2] + 2) * x
>>        || out[3] != (in[3] * 31) * y)
>>      abort();
>> @@ -47,4 +47,4 @@ int main (void)
>>  }
>>
>>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
>> -
>> +
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>> new file mode 100644
>> index 0000000..39c23c3
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>> @@ -0,0 +1,44 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[4];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[0];
>> +    a[1] = b[1];
>> +    a[2] = b[2];
>> +    a[3] = b[3];
>> +    a[4] = 0;
>> +    a[5] = 0;
>> +    a[6] = 0;
>> +    a[7] = 0;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    a[i] = 1;
>> +  for (int i = 0; i < 4; i++)
>> +    b[i] = i + 4;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  for (int i = 0; i < 4; i++)
>> +    if (a[i] != i+4)
>> +      abort ();
>> +  for (int i = 4; i < 8; i++)
>> +    if (a[i] != 0)
>> +      abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>> new file mode 100644
>> index 0000000..06099fd
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>> @@ -0,0 +1,42 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[8];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[0];
>> +    a[1] = b[1] + 1;
>> +    a[2] = b[2] * 2;
>> +    a[3] = b[3] + 3;
>> +
>> +    a[4] = b[4] + 4;
>> +    a[5] = b[5] + 4;
>> +    a[6] = b[6] + 4;
>> +    a[7] = b[7] + 4;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    b[i] = i + 1;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
>> +    abort ();
>> +  for (int i = 4; i < 8; i++)
>> +    if (a[i] != i + 5)
>> +      abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>> new file mode 100644
>> index 0000000..13c51f3
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>> @@ -0,0 +1,41 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[4];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[2] + 1;
>> +    a[1] = b[0] + 2;
>> +    a[2] = b[1] + 3;
>> +    a[3] = b[1] + 4;
>> +    a[4] = b[3] * 3;
>> +    a[5] = b[0] * 4;
>> +    a[6] = b[2] * 5;
>> +    a[7] = b[1] * 7;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    a[i] = 1;
>> +  for (int i = 0; i < 4; i++)
>> +    b[i] = i + 4;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
>> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
>> +    abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>> new file mode 100644
>> index 0000000..6ae9a89
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
>> @@ -0,0 +1,41 @@
>> +/* { dg-require-effective-target vect_int } */
>> +/* PR tree-optimization/67682.  */
>> +
>> +#include "tree-vect.h"
>> +
>> +int __attribute__((__aligned__(8))) a[8];
>> +int __attribute__((__aligned__(8))) b[8];
>> +
>> +__attribute__ ((noinline)) void
>> +test ()
>> +{
>> +    a[0] = b[0] + 1;
>> +    a[1] = b[1] + 2;
>> +    a[2] = b[2] + 3;
>> +    a[3] = b[3] + 4;
>> +    a[4] = b[0] * 3;
>> +    a[5] = b[2] * 4;
>> +    a[6] = b[4] * 5;
>> +    a[7] = b[6] * 7;
>> +}
>> +
>> +int
>> +main (int argc, char **argv)
>> +{
>> +  check_vect ();
>> +
>> +  for (int i = 0; i < 8; i++)
>> +    a[i] = 1;
>> +  for (int i = 0; i < 8; i++)
>> +    b[i] = i + 4;
>> +  __asm__ volatile ("" : : : "memory");
>> +  test (a, b);
>> +  __asm__ volatile ("" : : : "memory");
>> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
>> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
>> +    abort ();
>> +  return 0;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
>> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
>> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
>> index cfdfc29..05d4b35 100644
>> --- a/gcc/tree-vect-slp.c
>> +++ b/gcc/tree-vect-slp.c
>> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>>    body_cost_vec.release ();
>>  }
>>
>> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
>> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
>> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
>> +   containing the remainder.
>> +   Return the first stmt in the second group.  */
>> +
>> +static gimple *
>> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
>> +{
>> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
>> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
>> +  gcc_assert (group1_size > 0);
>> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
>> +  gcc_assert (group2_size > 0);
>> +  GROUP_SIZE (first_vinfo) = group1_size;
>> +
>> +  gimple *stmt = first_stmt;
>> +  for (unsigned i = group1_size; i > 1; i--)
>> +    {
>> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
>> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
>> +    }
>> +  /* STMT is now the last element of the first group.  */
>> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
>> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
>> +
>> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
>> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
>> +    {
>> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
>> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
>> +    }
>> +
>> +  /* For the second group, the GROUP_GAP is that before the original group,
>> +     plus skipping over the first vector.  */
>> +  GROUP_GAP (vinfo_for_stmt (group2)) =
>> +    GROUP_GAP (first_vinfo) + group1_size;
>> +
>> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
>> +  GROUP_GAP (first_vinfo) += group2_size;
>> +
>> +  if (dump_enabled_p ())
>> +    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
>> +                    group1_size, group2_size);
>> +
>> +  return group2;
>> +}
>> +
>>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>>     vect_build_slp_tree to build a tree of packed stmts if possible.
>>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
>> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>>    tree vectype, scalar_type = NULL_TREE;
>>    gimple *next;
>>    unsigned int vectorization_factor = 0;
>> -  int i;
>> +  unsigned int i;
>>    unsigned int max_nunits = 0;
>>    vec<slp_tree> loads;
>>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
>> @@ -1811,6 +1859,30 @@ vect_analyze_slp_instance (vec_info *vinfo,
>>    vect_free_slp_tree (node);
>>    loads.release ();
>>
>> +  /* For basic block SLP, try to break the group up into multiples of the
>> +     vectorization factor.  */
>> +  if (is_a <bb_vec_info> (vinfo)
>> +      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
>> +      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
>> +    {
>> +      /* We consider breaking the group only on VF boundaries from the existing
>> +        start.  */
>> +      for (i = 0; i < group_size; i++)
>> +       if (!matches[i]) break;
>> +
>> +      if (i >= vectorization_factor && i < group_size)
>> +       {
>> +         /* Split into two groups at the first vector boundary before i.  */
>> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
>
> Just noticing this... if we have a vectorization factor of 4 and matches
> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>
> Thus shouldn't we split either only if i % vectorization_factor is 0 or
> if not, split "twice", dropping the intermediate surely non-matching
> group of vectorization_factor size?  After all if we split like with the
> patch then the upper half will _not_ be splitted again with the
> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>
> So I expect that the statistics will be the same if we restrict splitting
> to the i % vectorization_factor == 0 case, or rather split where we do
> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>
> Ok with that change.  Improvements on that incrementally.

Btw, it just occurs to me that the whole thing is equivalent to splitting
the store-group into vector-size pieces up-front?  That way we do
the maximum splitting up-frond and avoid any redundant work?

The patch is still ok as said, just the above may be a simple thing
to explore.

Thanks,
Richard.

> Thanks,
> Richard.
>
>> +         i &= ~(vectorization_factor - 1);
>> +         gimple *grp2start = vect_split_slp_store_group (stmt, i);
>> +         return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
>> +                | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
>> +        }
>> +      /* Even though the first vector did not all match, we might be able to SLP
>> +        (some) of the remainder.  FORNOW ignore this possibility.  */
>> +    }
>> +
>>    return false;
>>  }
>>
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-10 12:51               ` Richard Biener
@ 2015-11-13 16:19                 ` Alan Lawrence
  2015-11-16 14:42                   ` Christophe Lyon
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Lawrence @ 2015-11-13 16:19 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther

On 10/11/15 12:51, Richard Biener wrote:
>>
>> Just noticing this... if we have a vectorization factor of 4 and matches
>> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>>
>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>> if not, split "twice", dropping the intermediate surely non-matching
>> group of vectorization_factor size?  After all if we split like with the
>> patch then the upper half will _not_ be splitted again with the
>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>
>> So I expect that the statistics will be the same if we restrict splitting
>> to the i % vectorization_factor == 0 case, or rather split where we do
>> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>>
>> Ok with that change.  Improvements on that incrementally.
>
> Btw, it just occurs to me that the whole thing is equivalent to splitting
> the store-group into vector-size pieces up-front?  That way we do
> the maximum splitting up-frond and avoid any redundant work?
>
> The patch is still ok as said, just the above may be a simple thing
> to explore.

I'd refrained from splitting in vect_analyze_group_access_1 as my understanding
was that we only did that once, whereas we would retry the
vect_analyze_slp_instance path each time we decreased the
vectorization_factor...however, I did try putting code at the beginning of
vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses
us some previously-successful SLPs, as some bigger groups cannot be SLPed if we
split them as they require 'unrolling'...so not addressing that here.

However your suggestion of splitting twice when we know the boundary is in the
middle of a vector is a nice compromise; it nets us a good number more
successes in SPEC2000 and SPEC2006, about 7% more than without the patch.

Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others
up as we don't do that one anymore.)

Cheers, Alan

gcc/ChangeLog:

	PR tree-optimization/67682
	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): During basic block SLP, recurse on
	subgroups if vect_build_slp_tree fails after 1st vector.

gcc/testsuite/ChangeLog:

	PR tree-optimization/67682
	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++
 gcc/tree-vect-slp.c                            | 85 +++++++++++++++++++++++++-
 5 files changed, 215 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index cfdfc29..65a183f 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
+		     group1_size, group2_size);
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block SLP, try to break the group up into multiples of the
+     vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo)
+      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
+      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i >= vectorization_factor && i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  unsigned group1_size = i & ~(vectorization_factor - 1);
+
+	  gimple *rest = vect_split_slp_store_group (stmt, group1_size);
+	  bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size);
+	  /* If the first non-match was in the middle of a vector,
+	     skip the rest of that vector.  */
+	  if (group1_size < i)
+	    {
+	      i = group1_size + vectorization_factor;
+	      if (i < group_size)
+		rest = vect_split_slp_store_group (rest, vectorization_factor);
+	    }
+	  if (i < group_size)
+	    res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size);
+	  return res;
+	}
+      /* Even though the first vector did not all match, we might be able to SLP
+	 (some) of the remainder.  FORNOW ignore this possibility.  */
+    }
+
   return false;
 }
 
-- 
1.9.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-13 16:19                 ` Alan Lawrence
@ 2015-11-16 14:42                   ` Christophe Lyon
  2015-11-17 16:47                     ` Alan Lawrence
  0 siblings, 1 reply; 15+ messages in thread
From: Christophe Lyon @ 2015-11-16 14:42 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, Richard Biener

On 13 November 2015 at 17:18, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 10/11/15 12:51, Richard Biener wrote:
>>>
>>> Just noticing this... if we have a vectorization factor of 4 and matches
>>> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1, 1 and
>>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fail?
>>>
>>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>>> if not, split "twice", dropping the intermediate surely non-matching
>>> group of vectorization_factor size?  After all if we split like with the
>>> patch then the upper half will _not_ be splitted again with the
>>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>>
>>> So I expect that the statistics will be the same if we restrict splitting
>>> to the i % vectorization_factor == 0 case, or rather split where we do
>>> now but only re-analyze group2 if i % vectorization_factor == 0 holds?
>>>
>>> Ok with that change.  Improvements on that incrementally.
>>
>> Btw, it just occurs to me that the whole thing is equivalent to splitting
>> the store-group into vector-size pieces up-front?  That way we do
>> the maximum splitting up-frond and avoid any redundant work?
>>
>> The patch is still ok as said, just the above may be a simple thing
>> to explore.
>
> I'd refrained from splitting in vect_analyze_group_access_1 as my understanding
> was that we only did that once, whereas we would retry the
> vect_analyze_slp_instance path each time we decreased the
> vectorization_factor...however, I did try putting code at the beginning of
> vect_analyze_slp_instance to split up any groups > vf. Unfortunately this loses
> us some previously-successful SLPs, as some bigger groups cannot be SLPed if we
> split them as they require 'unrolling'...so not addressing that here.
>
> However your suggestion of splitting twice when we know the boundary is in the
> middle of a vector is a nice compromise; it nets us a good number more
> successes in SPEC2000 and SPEC2006, about 7% more than without the patch.
>
> Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
> and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the others
> up as we don't do that one anymore.)
>
> Cheers, Alan
>
> gcc/ChangeLog:
>
>         PR tree-optimization/67682
>         * tree-vect-slp.c (vect_split_slp_store_group): New.
>         (vect_analyze_slp_instance): During basic block SLP, recurse on
>         subgroups if vect_build_slp_tree fails after 1st vector.
>
> gcc/testsuite/ChangeLog:
>
>         PR tree-optimization/67682
>         * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
>         * gcc.dg/vect/bb-slp-subgroups-1.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-2.c: New.
>         * gcc.dg/vect/bb-slp-subgroups-3.c: New.

Hi Alan,

I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.



> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++
>  gcc/tree-vect-slp.c                            | 85 +++++++++++++++++++++++++-
>  5 files changed, 215 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> index ab54a48..b8bef8c 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
> @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
>    unsigned int *pout = &out[0];
>    unsigned int a0, a1, a2, a3;
>
> -  /* Non isomorphic.  */
> +  /* Non isomorphic, even 64-bit subgroups.  */
>    a0 = *pin++ + 23;
> -  a1 = *pin++ + 142;
> +  a1 = *pin++ * 142;
>    a2 = *pin++ + 2;
>    a3 = *pin++ * 31;
> -
> +
>    *pout++ = a0 * x;
>    *pout++ = a1 * y;
>    *pout++ = a2 * x;
> @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
>
>    /* Check results.  */
>    if (out[0] != (in[0] + 23) * x
> -      || out[1] != (in[1] + 142) * y
> +      || out[1] != (in[1] * 142) * y
>        || out[2] != (in[2] + 2) * x
>        || out[3] != (in[3] * 31) * y)
>      abort();
> @@ -47,4 +47,4 @@ int main (void)
>  }
>
>  /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
> -
> +
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> new file mode 100644
> index 0000000..39c23c3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
> @@ -0,0 +1,44 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0];
> +    a[1] = b[1];
> +    a[2] = b[2];
> +    a[3] = b[3];
> +    a[4] = 0;
> +    a[5] = 0;
> +    a[6] = 0;
> +    a[7] = 0;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  for (int i = 0; i < 4; i++)
> +    if (a[i] != i+4)
> +      abort ();
> +  for (int i = 4; i < 8; i++)
> +    if (a[i] != 0)
> +      abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> new file mode 100644
> index 0000000..13c51f3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[4];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[2] + 1;
> +    a[1] = b[0] + 2;
> +    a[2] = b[1] + 3;
> +    a[3] = b[1] + 4;
> +    a[4] = b[3] * 3;
> +    a[5] = b[0] * 4;
> +    a[6] = b[2] * 5;
> +    a[7] = b[1] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 4; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
> +      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> new file mode 100644
> index 0000000..6ae9a89
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
> @@ -0,0 +1,41 @@
> +/* { dg-require-effective-target vect_int } */
> +/* PR tree-optimization/67682.  */
> +
> +#include "tree-vect.h"
> +
> +int __attribute__((__aligned__(8))) a[8];
> +int __attribute__((__aligned__(8))) b[8];
> +
> +__attribute__ ((noinline)) void
> +test ()
> +{
> +    a[0] = b[0] + 1;
> +    a[1] = b[1] + 2;
> +    a[2] = b[2] + 3;
> +    a[3] = b[3] + 4;
> +    a[4] = b[0] * 3;
> +    a[5] = b[2] * 4;
> +    a[6] = b[4] * 5;
> +    a[7] = b[6] * 7;
> +}
> +
> +int
> +main (int argc, char **argv)
> +{
> +  check_vect ();
> +
> +  for (int i = 0; i < 8; i++)
> +    a[i] = 1;
> +  for (int i = 0; i < 8; i++)
> +    b[i] = i + 4;
> +  __asm__ volatile ("" : : : "memory");
> +  test (a, b);
> +  __asm__ volatile ("" : : : "memory");
> +  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
> +      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
> +    abort ();
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
> +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index cfdfc29..65a183f 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
>    body_cost_vec.release ();
>  }
>
> +/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
> +   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
> +   the first GROUP1_SIZE stmts, since stores are consecutive), the second
> +   containing the remainder.
> +   Return the first stmt in the second group.  */
> +
> +static gimple *
> +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
> +{
> +  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
> +  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
> +  gcc_assert (group1_size > 0);
> +  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
> +  gcc_assert (group2_size > 0);
> +  GROUP_SIZE (first_vinfo) = group1_size;
> +
> +  gimple *stmt = first_stmt;
> +  for (unsigned i = group1_size; i > 1; i--)
> +    {
> +      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +  /* STMT is now the last element of the first group.  */
> +  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
> +  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
> +
> +  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
> +  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
> +      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
> +    }
> +
> +  /* For the second group, the GROUP_GAP is that before the original group,
> +     plus skipping over the first vector.  */
> +  GROUP_GAP (vinfo_for_stmt (group2)) =
> +    GROUP_GAP (first_vinfo) + group1_size;
> +
> +  /* GROUP_GAP of the first group now has to skip over the second group too.  */
> +  GROUP_GAP (first_vinfo) += group2_size;
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
> +                    group1_size, group2_size);
> +
> +  return group2;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the loop.  */
> @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    tree vectype, scalar_type = NULL_TREE;
>    gimple *next;
>    unsigned int vectorization_factor = 0;
> -  int i;
> +  unsigned int i;
>    unsigned int max_nunits = 0;
>    vec<slp_tree> loads;
>    struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
> @@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo,
>    vect_free_slp_tree (node);
>    loads.release ();
>
> +  /* For basic block SLP, try to break the group up into multiples of the
> +     vectorization factor.  */
> +  if (is_a <bb_vec_info> (vinfo)
> +      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
> +      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
> +    {
> +      /* We consider breaking the group only on VF boundaries from the existing
> +        start.  */
> +      for (i = 0; i < group_size; i++)
> +       if (!matches[i]) break;
> +
> +      if (i >= vectorization_factor && i < group_size)
> +       {
> +         /* Split into two groups at the first vector boundary before i.  */
> +         gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
> +         unsigned group1_size = i & ~(vectorization_factor - 1);
> +
> +         gimple *rest = vect_split_slp_store_group (stmt, group1_size);
> +         bool res = vect_analyze_slp_instance (vinfo, stmt, max_tree_size);
> +         /* If the first non-match was in the middle of a vector,
> +            skip the rest of that vector.  */
> +         if (group1_size < i)
> +           {
> +             i = group1_size + vectorization_factor;
> +             if (i < group_size)
> +               rest = vect_split_slp_store_group (rest, vectorization_factor);
> +           }
> +         if (i < group_size)
> +           res |= vect_analyze_slp_instance (vinfo, rest, max_tree_size);
> +         return res;
> +       }
> +      /* Even though the first vector did not all match, we might be able to SLP
> +        (some) of the remainder.  FORNOW ignore this possibility.  */
> +    }
> +
>    return false;
>  }
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-16 14:42                   ` Christophe Lyon
@ 2015-11-17 16:47                     ` Alan Lawrence
  2015-11-18  9:26                       ` Richard Biener
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Lawrence @ 2015-11-17 16:47 UTC (permalink / raw)
  To: Christophe Lyon
  Cc: gcc-patches, Tejas Belagod, Richard Earnshaw, Ramana Radhakrishnan

On 16/11/15 14:42, Christophe Lyon wrote:
>
> Hi Alan,
>
> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
> fails for armeb targets.
> I haven't had time to look at more details yet, but I guess you can
> reproduce it quickly enough.
>

Thanks - yes I see it now.

-fdump-tree-optimized looks sensible:

__attribute__((noinline))
test ()
{
   vector(4) int vect__14.21;
   vector(4) int vect__2.20;
   vector(4) int vect__2.19;
   vector(4) int vect__3.13;
   vector(4) int vect__2.12;

   <bb 2>:
   vect__2.12_24 = MEM[(int *)&b];
   vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
   MEM[(int *)&a] = vect__3.13_27;
   vect__2.19_31 = MEM[(int *)&b + 16B];
   vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>;
   vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
   MEM[(int *)&a + 16B] = vect__14.21_35;
   return;
}

but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being 
multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 
pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as 
to what's in each register:

test:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	movw	r2, #:lower16:b
	movt	r2, #:upper16:b
	vldr	d22, .L11
	vldr	d23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
	movw	r3, #:lower16:a
	movt	r3, #:upper16:a
	vld1.64	{d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
	vmov	q9, q8  @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
	vldr	d20, [r2, #16]
	vldr	d21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
	vuzp.32	q10, q9
;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
	vldr	d20, .L11+16
	vldr	d21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
	vmul.i32	q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
	vadd.i32	q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
	vst1.64	{d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
	vstr	d18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
	vstr	d19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
	bx	lr
.L12:
	.align	3
.L11:
	.word	3
	.word	4
	.word	5
	.word	7
	.word	1
	.word	2
	.word	3
	.word	4

Which is to say - the bit order in the q-registers, is neither big- nor 
little-endian, but the elements get stored back to memory in a consistent order 
with how they were loaded, so we're OK as long as there are no permutes. 
Unfortunately for UZP this lane ordering mixup is not idempotent and messes 
everything up...

Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing 
armeb's whole register-numbering/lane-flipping scheme, and might be quite a 
large task. OTOH it might also fix the significant number of failing vectorizer 
tests. A simpler solution might be to disable...some part of vector 
support....on armeb, but I'm not sure which part would be best, yet.

Thoughts (CC maintainers)?

--Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] PR/67682, break SLP groups up if only some elements match
  2015-11-17 16:47                     ` Alan Lawrence
@ 2015-11-18  9:26                       ` Richard Biener
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Biener @ 2015-11-18  9:26 UTC (permalink / raw)
  To: Alan Lawrence
  Cc: Christophe Lyon, gcc-patches, Tejas Belagod, Richard Earnshaw,
	Ramana Radhakrishnan

On Tue, Nov 17, 2015 at 5:46 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> On 16/11/15 14:42, Christophe Lyon wrote:
>>
>>
>> Hi Alan,
>>
>> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
>> fails for armeb targets.
>> I haven't had time to look at more details yet, but I guess you can
>> reproduce it quickly enough.
>>
>
> Thanks - yes I see it now.
>
> -fdump-tree-optimized looks sensible:
>
> __attribute__((noinline))
> test ()
> {
>   vector(4) int vect__14.21;
>   vector(4) int vect__2.20;
>   vector(4) int vect__2.19;
>   vector(4) int vect__3.13;
>   vector(4) int vect__2.12;
>
>   <bb 2>:
>   vect__2.12_24 = MEM[(int *)&b];
>   vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
>   MEM[(int *)&a] = vect__3.13_27;
>   vect__2.19_31 = MEM[(int *)&b + 16B];
>   vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6
> }>;
>   vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
>   MEM[(int *)&a + 16B] = vect__14.21_35;
>   return;
> }
>
> but while a[0...3] end up containing 5 7 9 11 as expected,
> a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
> That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10),
> being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a
> UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my
> workings as to what's in each register:
>
> test:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         movw    r2, #:lower16:b
>         movt    r2, #:upper16:b
>         vldr    d22, .L11
>         vldr    d23, .L11+8
> ;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
>         movw    r3, #:lower16:a
>         movt    r3, #:upper16:a
>         vld1.64 {d16-d17}, [r2:64]
> ;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
>         vmov    q9, q8  @ v4si
> ;; q9 = (b[2] b[3] b[0] b[1])
>         vldr    d20, [r2, #16]
>         vldr    d21, [r2, #24]
> ;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
>         vuzp.32 q10, q9
> ;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3]
> b[1])
> ;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2]
> b[0])
>         vldr    d20, .L11+16
>         vldr    d21, .L11+24
> ;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
>         vmul.i32        q9, q9, q11
> ;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
> ;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
>         vadd.i32        q8, q8, q10
> ;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
> ;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
>         vst1.64 {d16-d17}, [r3:64]
> ;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
>         vstr    d18, [r3, #16]
> ;; a[4] = b[6]*3, a[5] = b[4]*4
>         vstr    d19, [r3, #24]
> ;; a[6] = b[2]*5, a[7] = b[0]*7
>         bx      lr
> .L12:
>         .align  3
> .L11:
>         .word   3
>         .word   4
>         .word   5
>         .word   7
>         .word   1
>         .word   2
>         .word   3
>         .word   4
>
> Which is to say - the bit order in the q-registers, is neither big- nor
> little-endian, but the elements get stored back to memory in a consistent
> order with how they were loaded, so we're OK as long as there are no
> permutes. Unfortunately for UZP this lane ordering mixup is not idempotent
> and messes everything up...
>
> Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing
> armeb's whole register-numbering/lane-flipping scheme, and might be quite a
> large task. OTOH it might also fix the significant number of failing
> vectorizer tests. A simpler solution might be to disable...some part of
> vector support....on armeb, but I'm not sure which part would be best, yet.
>
> Thoughts (CC maintainers)?

As most of the permutation support in the arm backend is disabled
for armeb I suppose disabling some more of it is appropriate ...

Or finally sitting down and fixing the mess.  For constant permutes
fixing it on the constant side would be an option I guess.

Richard.

> --Alan
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-11-18  9:26 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 15:26 [PATCH] PR/67682, break SLP groups up if only some elements match Alan Lawrence
2015-10-25 11:56 ` Alan Lawrence
2015-10-25 12:31   ` Andrew Pinski
2015-10-26 15:09 ` Richard Biener
2015-10-27 17:44   ` Alan Lawrence
2015-11-03 13:39     ` Richard Biener
2015-11-05 13:34       ` Alan Lawrence
2015-11-06 12:55         ` Richard Biener
2015-11-09 14:59           ` Alan Lawrence
2015-11-10 12:45             ` Richard Biener
2015-11-10 12:51               ` Richard Biener
2015-11-13 16:19                 ` Alan Lawrence
2015-11-16 14:42                   ` Christophe Lyon
2015-11-17 16:47                     ` Alan Lawrence
2015-11-18  9:26                       ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).