From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-414019-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 79661 invoked by alias); 13 Nov 2015 16:19:00 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 79645 invoked by uid 89); 13 Nov 2015 16:19:00 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (207.82.80.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 13 Nov 2015 16:18:57 +0000
Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id uk-mta-16-HNfL1Me3QWWfjfSgu8sjNQ-1; Fri, 13 Nov 2015 16:18:51 +0000
Received: from arm.com ([10.1.2.79]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959);	 Fri, 13 Nov 2015 16:18:51 +0000
From: Alan Lawrence <alan.lawrence@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: richard.guenther@gmail.com
Subject: Re: [PATCH] PR/67682, break SLP groups up if only some elements match
Date: Fri, 13 Nov 2015 16:19:00 -0000
Message-Id: <1447431522-4695-1-git-send-email-alan.lawrence@arm.com>
In-Reply-To: <CAFiYyc3GXFmREtzjLP+m1LBju2okEEzDwcX6wHn2xcuSdrh4wg@mail.gmail.com>
References: <CAFiYyc3GXFmREtzjLP+m1LBju2okEEzDwcX6wHn2xcuSdrh4wg@mail.gmail.com>
X-MC-Unique: HNfL1Me3QWWfjfSgu8sjNQ-1
Content-Type: text/plain; charset=WINDOWS-1252
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
X-SW-Source: 2015-11/txt/msg01735.txt.bz2

On 10/11/15 12:51, Richard Biener wrote:
>>
>> Just noticing this... if we have a vectorization factor of 4 and matches
>> is 1, 1, 1, 1,  1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1=
, 1 and
>> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fai=
l?
>>
>> Thus shouldn't we split either only if i % vectorization_factor is 0 or
>> if not, split "twice", dropping the intermediate surely non-matching
>> group of vectorization_factor size?  After all if we split like with the
>> patch then the upper half will _not_ be splitted again with the
>> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again).
>>
>> So I expect that the statistics will be the same if we restrict splitting
>> to the i % vectorization_factor =3D=3D 0 case, or rather split where we =
do
>> now but only re-analyze group2 if i % vectorization_factor =3D=3D 0 hold=
s?
>>
>> Ok with that change.  Improvements on that incrementally.
>
> Btw, it just occurs to me that the whole thing is equivalent to splitting
> the store-group into vector-size pieces up-front?  That way we do
> the maximum splitting up-frond and avoid any redundant work?
>
> The patch is still ok as said, just the above may be a simple thing
> to explore.

I'd refrained from splitting in vect_analyze_group_access_1 as my understan=
ding
was that we only did that once, whereas we would retry the
vect_analyze_slp_instance path each time we decreased the
vectorization_factor...however, I did try putting code at the beginning of
vect_analyze_slp_instance to split up any groups > vf. Unfortunately this l=
oses
us some previously-successful SLPs, as some bigger groups cannot be SLPed i=
f we
split them as they require 'unrolling'...so not addressing that here.

However your suggestion of splitting twice when we know the boundary is in =
the
middle of a vector is a nice compromise; it nets us a good number more
successes in SPEC2000 and SPEC2006, about 7% more than without the patch.

Hence, here's the patch I've committed, as r230330, after regstrap on x86_64
and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the oth=
ers
up as we don't do that one anymore.)

Cheers, Alan

gcc/ChangeLog:

	PR tree-optimization/67682
	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): During basic block SLP, recurse on
	subgroups if vect_build_slp_tree fails after 1st vector.

gcc/testsuite/ChangeLog:

	PR tree-optimization/67682
	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++
 gcc/tree-vect-slp.c                            | 85 ++++++++++++++++++++++=
+++-
 5 files changed, 215 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/ve=
ct/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout =3D &out[0];
   unsigned int a0, a1, a2, a3;
=20
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 =3D *pin++ + 23;
-  a1 =3D *pin++ + 142;
+  a1 =3D *pin++ * 142;
   a2 =3D *pin++ + 2;
   a3 =3D *pin++ * 31;
-=20=20
+
   *pout++ =3D a0 * x;
   *pout++ =3D a1 * y;
   *pout++ =3D a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
=20
   /* Check results.  */
   if (out[0] !=3D (in[0] + 23) * x
-      || out[1] !=3D (in[1] + 142) * y
+      || out[1] !=3D (in[1] * 142) * y
       || out[2] !=3D (in[2] + 2) * x
       || out[3] !=3D (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@ int main (void)
 }
=20
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } }=
 */
-=20=20
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite=
/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] =3D b[0];
+    a[1] =3D b[1];
+    a[2] =3D b[2];
+    a[3] =3D b[3];
+    a[4] =3D 0;
+    a[5] =3D 0;
+    a[6] =3D 0;
+    a[7] =3D 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i =3D 0; i < 8; i++)
+    a[i] =3D 1;
+  for (int i =3D 0; i < 4; i++)
+    b[i] =3D i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i =3D 0; i < 4; i++)
+    if (a[i] !=3D i+4)
+      abort ();
+  for (int i =3D 4; i < 8; i++)
+    if (a[i] !=3D 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using=
 SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } }=
 */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite=
/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] =3D b[2] + 1;
+    a[1] =3D b[0] + 2;
+    a[2] =3D b[1] + 3;
+    a[3] =3D b[1] + 4;
+    a[4] =3D b[3] * 3;
+    a[5] =3D b[0] * 4;
+    a[6] =3D b[2] * 5;
+    a[7] =3D b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i =3D 0; i < 8; i++)
+    a[i] =3D 1;
+  for (int i =3D 0; i < 4; i++)
+    b[i] =3D i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] !=3D 7) || a[1] !=3D 6 || (a[2] !=3D 8) || (a[3] !=3D 9)
+      || (a[4] !=3D 21) || (a[5] !=3D 16) || (a[6] !=3D 30) || (a[7] !=3D =
35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using=
 SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } }=
 */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite=
/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] =3D b[0] + 1;
+    a[1] =3D b[1] + 2;
+    a[2] =3D b[2] + 3;
+    a[3] =3D b[3] + 4;
+    a[4] =3D b[0] * 3;
+    a[5] =3D b[2] * 4;
+    a[6] =3D b[4] * 5;
+    a[7] =3D b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i =3D 0; i < 8; i++)
+    a[i] =3D 1;
+  for (int i =3D 0; i < 8; i++)
+    b[i] =3D i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] !=3D 5) || (a[1] !=3D 7) || (a[2] !=3D 9) || (a[3] !=3D 11)
+      || (a[4] !=3D 12) || (a[5] !=3D 24) || (a[6] !=3D 40) || (a[7] !=3D =
70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using=
 SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } }=
 */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index cfdfc29..65a183f 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *=
data)
   body_cost_vec.release ();
 }
=20
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two g=
roups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo =3D vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) =3D=3D first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size =3D GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) =3D group1_size;
+
+  gimple *stmt =3D first_stmt;
+  for (unsigned i =3D group1_size; i > 1; i--)
+    {
+      stmt =3D GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) =3D=3D 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 =3D GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) =3D 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) =3D group2_size;
+  for (stmt =3D group2; stmt; stmt =3D GROUP_NEXT_ELEMENT (vinfo_for_stmt =
(stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) =3D group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) =3D=3D 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =3D
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group to=
o.  */
+  GROUP_GAP (first_vinfo) +=3D group2_size;
+
+  if (dump_enabled_p ())
+    dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\=
n",
+		     group1_size, group2_size);
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type =3D NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor =3D 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits =3D 0;
   vec<slp_tree> loads;
   struct data_reference *dr =3D STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)=
);
@@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
=20
+  /* For basic block SLP, try to break the group up into multiples of the
+     vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo)
+      && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))
+      && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the exi=
sting
+	 start.  */
+      for (i =3D 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i >=3D vectorization_factor && i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) =3D=3D =
0);
+	  unsigned group1_size =3D i & ~(vectorization_factor - 1);
+
+	  gimple *rest =3D vect_split_slp_store_group (stmt, group1_size);
+	  bool res =3D vect_analyze_slp_instance (vinfo, stmt, max_tree_size);
+	  /* If the first non-match was in the middle of a vector,
+	     skip the rest of that vector.  */
+	  if (group1_size < i)
+	    {
+	      i =3D group1_size + vectorization_factor;
+	      if (i < group_size)
+		rest =3D vect_split_slp_store_group (rest, vectorization_factor);
+	    }
+	  if (i < group_size)
+	    res |=3D vect_analyze_slp_instance (vinfo, rest, max_tree_size);
+	  return res;
+	}
+      /* Even though the first vector did not all match, we might be able =
to SLP
+	 (some) of the remainder.  FORNOW ignore this possibility.  */
+    }
+
   return false;
 }
=20
--=20
1.9.1