From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 79661 invoked by alias); 13 Nov 2015 16:19:00 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 79645 invoked by uid 89); 13 Nov 2015 16:19:00 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (207.82.80.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 13 Nov 2015 16:18:57 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id uk-mta-16-HNfL1Me3QWWfjfSgu8sjNQ-1; Fri, 13 Nov 2015 16:18:51 +0000 Received: from arm.com ([10.1.2.79]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Fri, 13 Nov 2015 16:18:51 +0000 From: Alan Lawrence To: gcc-patches@gcc.gnu.org Cc: richard.guenther@gmail.com Subject: Re: [PATCH] PR/67682, break SLP groups up if only some elements match Date: Fri, 13 Nov 2015 16:19:00 -0000 Message-Id: <1447431522-4695-1-git-send-email-alan.lawrence@arm.com> In-Reply-To: References: X-MC-Unique: HNfL1Me3QWWfjfSgu8sjNQ-1 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2015-11/txt/msg01735.txt.bz2 On 10/11/15 12:51, Richard Biener wrote: >> >> Just noticing this... if we have a vectorization factor of 4 and matches >> is 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0 then this will split into 1, 1, 1= , 1 and >> 1, 1, 0, 0, 0, ... where we know from the matches that it will again fai= l? >> >> Thus shouldn't we split either only if i % vectorization_factor is 0 or >> if not, split "twice", dropping the intermediate surely non-matching >> group of vectorization_factor size? After all if we split like with the >> patch then the upper half will _not_ be splitted again with the >> simplified patch (result will be 1, 1, 0, 0, 0, 0, 0, 0 again). >> >> So I expect that the statistics will be the same if we restrict splitting >> to the i % vectorization_factor =3D=3D 0 case, or rather split where we = do >> now but only re-analyze group2 if i % vectorization_factor =3D=3D 0 hold= s? >> >> Ok with that change. Improvements on that incrementally. > > Btw, it just occurs to me that the whole thing is equivalent to splitting > the store-group into vector-size pieces up-front? That way we do > the maximum splitting up-frond and avoid any redundant work? > > The patch is still ok as said, just the above may be a simple thing > to explore. I'd refrained from splitting in vect_analyze_group_access_1 as my understan= ding was that we only did that once, whereas we would retry the vect_analyze_slp_instance path each time we decreased the vectorization_factor...however, I did try putting code at the beginning of vect_analyze_slp_instance to split up any groups > vf. Unfortunately this l= oses us some previously-successful SLPs, as some bigger groups cannot be SLPed i= f we split them as they require 'unrolling'...so not addressing that here. However your suggestion of splitting twice when we know the boundary is in = the middle of a vector is a nice compromise; it nets us a good number more successes in SPEC2000 and SPEC2006, about 7% more than without the patch. Hence, here's the patch I've committed, as r230330, after regstrap on x86_64 and AArch64. (I dropped the previous bb-slp-subgroups-2 and renamed the oth= ers up as we don't do that one anymore.) Cheers, Alan gcc/ChangeLog: PR tree-optimization/67682 * tree-vect-slp.c (vect_split_slp_store_group): New. (vect_analyze_slp_instance): During basic block SLP, recurse on subgroups if vect_build_slp_tree fails after 1st vector. gcc/testsuite/ChangeLog: PR tree-optimization/67682 * gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic. * gcc.dg/vect/bb-slp-subgroups-1.c: New. * gcc.dg/vect/bb-slp-subgroups-2.c: New. * gcc.dg/vect/bb-slp-subgroups-3.c: New. --- gcc/testsuite/gcc.dg/vect/bb-slp-7.c | 10 +-- gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 41 +++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 +++++++++++++ gcc/tree-vect-slp.c | 85 ++++++++++++++++++++++= +++- 5 files changed, 215 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/ve= ct/bb-slp-7.c index ab54a48..b8bef8c 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c @@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y) unsigned int *pout =3D &out[0]; unsigned int a0, a1, a2, a3; =20 - /* Non isomorphic. */ + /* Non isomorphic, even 64-bit subgroups. */ a0 =3D *pin++ + 23; - a1 =3D *pin++ + 142; + a1 =3D *pin++ * 142; a2 =3D *pin++ + 2; a3 =3D *pin++ * 31; -=20=20 + *pout++ =3D a0 * x; *pout++ =3D a1 * y; *pout++ =3D a2 * x; @@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y) =20 /* Check results. */ if (out[0] !=3D (in[0] + 23) * x - || out[1] !=3D (in[1] + 142) * y + || out[1] !=3D (in[1] * 142) * y || out[2] !=3D (in[2] + 2) * x || out[3] !=3D (in[3] * 31) * y) abort(); @@ -47,4 +47,4 @@ int main (void) } =20 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } }= */ -=20=20 + diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite= /gcc.dg/vect/bb-slp-subgroups-1.c new file mode 100644 index 0000000..39c23c3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c @@ -0,0 +1,44 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] =3D b[0]; + a[1] =3D b[1]; + a[2] =3D b[2]; + a[3] =3D b[3]; + a[4] =3D 0; + a[5] =3D 0; + a[6] =3D 0; + a[7] =3D 0; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i =3D 0; i < 8; i++) + a[i] =3D 1; + for (int i =3D 0; i < 4; i++) + b[i] =3D i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + for (int i =3D 0; i < 4; i++) + if (a[i] !=3D i+4) + abort (); + for (int i =3D 4; i < 8; i++) + if (a[i] !=3D 0) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using= SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } }= */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite= /gcc.dg/vect/bb-slp-subgroups-2.c new file mode 100644 index 0000000..13c51f3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[4]; + +__attribute__ ((noinline)) void +test () +{ + a[0] =3D b[2] + 1; + a[1] =3D b[0] + 2; + a[2] =3D b[1] + 3; + a[3] =3D b[1] + 4; + a[4] =3D b[3] * 3; + a[5] =3D b[0] * 4; + a[6] =3D b[2] * 5; + a[7] =3D b[1] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i =3D 0; i < 8; i++) + a[i] =3D 1; + for (int i =3D 0; i < 4; i++) + b[i] =3D i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] !=3D 7) || a[1] !=3D 6 || (a[2] !=3D 8) || (a[3] !=3D 9) + || (a[4] !=3D 21) || (a[5] !=3D 16) || (a[6] !=3D 30) || (a[7] !=3D = 35)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using= SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } }= */ diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite= /gcc.dg/vect/bb-slp-subgroups-3.c new file mode 100644 index 0000000..6ae9a89 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c @@ -0,0 +1,41 @@ +/* { dg-require-effective-target vect_int } */ +/* PR tree-optimization/67682. */ + +#include "tree-vect.h" + +int __attribute__((__aligned__(8))) a[8]; +int __attribute__((__aligned__(8))) b[8]; + +__attribute__ ((noinline)) void +test () +{ + a[0] =3D b[0] + 1; + a[1] =3D b[1] + 2; + a[2] =3D b[2] + 3; + a[3] =3D b[3] + 4; + a[4] =3D b[0] * 3; + a[5] =3D b[2] * 4; + a[6] =3D b[4] * 5; + a[7] =3D b[6] * 7; +} + +int +main (int argc, char **argv) +{ + check_vect (); + + for (int i =3D 0; i < 8; i++) + a[i] =3D 1; + for (int i =3D 0; i < 8; i++) + b[i] =3D i + 4; + __asm__ volatile ("" : : : "memory"); + test (a, b); + __asm__ volatile ("" : : : "memory"); + if ((a[0] !=3D 5) || (a[1] !=3D 7) || (a[2] !=3D 9) || (a[3] !=3D 11) + || (a[4] !=3D 12) || (a[5] !=3D 24) || (a[6] !=3D 40) || (a[7] !=3D = 70)) + abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using= SLP" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } }= */ diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index cfdfc29..65a183f 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -1606,6 +1606,54 @@ vect_analyze_slp_cost (slp_instance instance, void *= data) body_cost_vec.release (); } =20 +/* Splits a group of stores, currently beginning at FIRST_STMT, into two g= roups: + one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing + the first GROUP1_SIZE stmts, since stores are consecutive), the second + containing the remainder. + Return the first stmt in the second group. */ + +static gimple * +vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size) +{ + stmt_vec_info first_vinfo =3D vinfo_for_stmt (first_stmt); + gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) =3D=3D first_stmt); + gcc_assert (group1_size > 0); + int group2_size =3D GROUP_SIZE (first_vinfo) - group1_size; + gcc_assert (group2_size > 0); + GROUP_SIZE (first_vinfo) =3D group1_size; + + gimple *stmt =3D first_stmt; + for (unsigned i =3D group1_size; i > 1; i--) + { + stmt =3D GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) =3D=3D 1); + } + /* STMT is now the last element of the first group. */ + gimple *group2 =3D GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)); + GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) =3D 0; + + GROUP_SIZE (vinfo_for_stmt (group2)) =3D group2_size; + for (stmt =3D group2; stmt; stmt =3D GROUP_NEXT_ELEMENT (vinfo_for_stmt = (stmt))) + { + GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) =3D group2; + gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) =3D=3D 1); + } + + /* For the second group, the GROUP_GAP is that before the original group, + plus skipping over the first vector. */ + GROUP_GAP (vinfo_for_stmt (group2)) =3D + GROUP_GAP (first_vinfo) + group1_size; + + /* GROUP_GAP of the first group now has to skip over the second group to= o. */ + GROUP_GAP (first_vinfo) +=3D group2_size; + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\= n", + group1_size, group2_size); + + return group2; +} + /* Analyze an SLP instance starting from a group of grouped stores. Call vect_build_slp_tree to build a tree of packed stmts if possible. Return FALSE if it's impossible to SLP any stmt in the loop. */ @@ -1621,7 +1669,7 @@ vect_analyze_slp_instance (vec_info *vinfo, tree vectype, scalar_type =3D NULL_TREE; gimple *next; unsigned int vectorization_factor =3D 0; - int i; + unsigned int i; unsigned int max_nunits =3D 0; vec loads; struct data_reference *dr =3D STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)= ); @@ -1811,6 +1859,41 @@ vect_analyze_slp_instance (vec_info *vinfo, vect_free_slp_tree (node); loads.release (); =20 + /* For basic block SLP, try to break the group up into multiples of the + vectorization factor. */ + if (is_a (vinfo) + && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) + && STMT_VINFO_GROUPED_ACCESS (vinfo_for_stmt (stmt))) + { + /* We consider breaking the group only on VF boundaries from the exi= sting + start. */ + for (i =3D 0; i < group_size; i++) + if (!matches[i]) break; + + if (i >=3D vectorization_factor && i < group_size) + { + /* Split into two groups at the first vector boundary before i. */ + gcc_assert ((vectorization_factor & (vectorization_factor - 1)) =3D=3D = 0); + unsigned group1_size =3D i & ~(vectorization_factor - 1); + + gimple *rest =3D vect_split_slp_store_group (stmt, group1_size); + bool res =3D vect_analyze_slp_instance (vinfo, stmt, max_tree_size); + /* If the first non-match was in the middle of a vector, + skip the rest of that vector. */ + if (group1_size < i) + { + i =3D group1_size + vectorization_factor; + if (i < group_size) + rest =3D vect_split_slp_store_group (rest, vectorization_factor); + } + if (i < group_size) + res |=3D vect_analyze_slp_instance (vinfo, rest, max_tree_size); + return res; + } + /* Even though the first vector did not all match, we might be able = to SLP + (some) of the remainder. FORNOW ignore this possibility. */ + } + return false; } =20 --=20 1.9.1