From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-449159-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 27905 invoked by alias); 16 Apr 2014 12:10:11 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 27858 invoked by uid 48); 16 Apr 2014 12:10:06 -0000
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/60841] [4.9/4.10 Regression] gcc: internal compiler error: Killed (program cc1) out of memory
Date: Wed, 16 Apr 2014 12:10:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 4.9.0
X-Bugzilla-Keywords: memory-hog
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 4.9.0
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-60841-4-9wFfXDRpgk@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-60841-4@http.gcc.gnu.org/bugzilla/>
References: <bug-60841-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-04/txt/msg01179.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60841
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ok, so this loop contains an incredibly connected value-graph (connecting the
loads to the stores) and stupidly (yeah, well...) the vectorizer SLP code
builds a tree (yes, by effectively duplicating 'shared' nodes).

AFAIK that itself isn't a regression.  What may be a regression is that
we compute SLP now before other analysis (dependence analysis for example)
would reject the loop as not vectorizable.  So you can certainly build
a testcase that should show similar behavior with 4.8 or 4.7.

Note that the vectorization transform doesn't yet support SLP rooting
of a build-from-scalars vector, so your obvious idea to fix this is
a general missed optimization (also for scalar code vectorization - as
soon as the SLP build fails we can make it succeed by building a starting
vector from N (different) scalars).  It also doesn't really work with
SLP within loop vectorization I think.

4.8 fails to vectorize with

t.i:9: note: === vect_analyze_data_ref_accesses ===
t.i:9: note: Detected interleaving of size 4
t.i:9: note: interleaving size is greater than step for p_240->f3
t.i:9: note: not vectorized: complicated access pattern.
t.i:9: note: bad data access.

but 4.9 and trunk are happy with them.

Example testcase, vectorized with the same graph vs. tree issue with
-fvect-cost-model=unlimited:

int a[1024];
int b[1024];
void foo()
{
  int i;
  for (i = 0; i < 1024/4; i+=4)
    {
      int a0 = a[i+0];
      int a1 = a[i+1];
      int a2 = a[i+2];
      int a3 = a[i+3];
      a0 = a0 * 3;
      a1 = a1 * 3;
      a2 = a2 * 3;
      a3 = a3 * 3;
      int a0p = a0 + 1;
      int a1p = a1 + 1;
      int a2p = a2 + 1;
      int a3p = a3 + 1;
      b[i+0] = a0p + a0;
      b[i+1] = a1p + a1;
      b[i+2] = a2p + a2;
      b[i+3] = a3p + a3;
    }
}

> grep 'vectorizing SLP node' t.c.114t.vect 
t.c:6:3: note: ------>vectorizing SLP node starting from: a0_4 = a[i_30];
t.c:6:3: note: ------>vectorizing SLP node starting from: a0_11 = a0_4 * 3;
t.c:6:3: note: ------>vectorizing SLP node starting from: a0p_15 = a0_11 + 1;
t.c:6:3: note: ------>vectorizing SLP node starting from: a0_4 = a[i_30];
t.c:6:3: note: ------>vectorizing SLP node starting from: a0_11 = a0_4 * 3;
t.c:6:3: note: ------>vectorizing SLP node starting from: _19 = a0p_15 + a0_11;
t.c:6:3: note: ------>vectorizing SLP node starting from: b[i_30] = _19;

thus we have duplicate nodes for the SLP load and the SLP mult by 3.  This
issue grows the SLP tree exponentially.  In theory each reference may
correspond to a different SLP permutation (but we don't support that in
the code-gen yet).  Note we also emit duplicate vectorized code from
such SLP trees (but usually we reject it via cost analysis).  The above
happens at least since GCC 4.7.

I'm not sure if we should completely disregard multi-uses for this reason
(will check if there are testcases that expect vectorization), but
definitely that would fix the issue.

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c (revision 209423)
+++ gcc/tree-vect-slp.c (working copy)
@@ -911,6 +911,37 @@ vect_build_slp_tree (loop_vec_info loop_
       if (oprnd_info->first_dt != vect_internal_def)
         continue;

+      /* Check if we have multiple uses on stmts not participating in
+        this SLP node.  */
+      unsigned j;
+      gimple def_stmt, use_stmt;
+      imm_use_iterator iter;
+      FOR_EACH_VEC_ELT (oprnd_info->def_stmts, j, def_stmt)
+       {
+         tree def;
+         if (gimple_code (def_stmt) == GIMPLE_PHI)
+           def = gimple_phi_result (def_stmt);
+         else
+           def = single_ssa_tree_operand (def_stmt, SSA_OP_DEF);
+         FOR_EACH_IMM_USE_STMT (use_stmt, iter, def)
+           {
+             unsigned k;
+             gimple sstmt;
+             FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (*node), k, sstmt)
+                 if (use_stmt == sstmt)
+                   break;
+             if (k == SLP_TREE_SCALAR_STMTS (*node).length ())
+               {
+                 if (dump_enabled_p ())
+                   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                                    "Build SLP failed: multiple uses\n");
+                 end_imm_use_stmt_traverse (&iter);
+                 vect_free_oprnd_info (oprnds_info);
+                 return false;
+               }
+           }
+       }
+

this retains multiple uses in the same SLP node which is supported.
Unfortunately the check is quadratic in the SLP group size (but that's
limited to the size of a vector register ...).  We could, if we ensure
like PLF_2 is always clear on stmts, mark the scalar stmts that way
and aovid the inner loop, of course.

The above patch for example disables SLP vectorization of slp-12a.c
(and other vect testcases) because there is a non-SLP use of b6 with the
store to ia[i].

So checking this at this point doesn't seem to work.

We can impose an artificial limit on the SLP size.  But that's not a real
fix either.

We can mark stmts "consumed" when building the SLP tree and check that.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c       (revision 209423)
+++ gcc/tree-vect-stmts.c       (working copy)
@@ -7388,6 +7388,8 @@ new_stmt_vec_info (gimple stmt, loop_vec
   GROUP_GAP (res) = 0;
   GROUP_SAME_DR_STMT (res) = NULL;

+  gimple_set_plf (stmt, GF_PLF_1, false);
+
   return res;
 }

Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c (revision 209423)
+++ gcc/tree-vect-slp.c (working copy)
@@ -85,6 +85,8 @@ vect_free_slp_tree (slp_tree node)
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
     vect_free_slp_tree (child);

+  gimple_set_plf (SLP_TREE_SCALAR_STMTS (node)[0], GF_PLF_1, false);
+
   SLP_TREE_CHILDREN (node).release ();
   SLP_TREE_SCALAR_STMTS (node).release ();
   SLP_TREE_VEC_STMTS (node).release ();
@@ -862,6 +864,14 @@ vect_build_slp_tree (loop_vec_info loop_
   matches[0] = false;

   stmt = SLP_TREE_SCALAR_STMTS (*node)[0];
+  if (gimple_plf (stmt, GF_PLF_1))
+    {
+      if (dump_enabled_p ())
+       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+                        "Build SLP failed: multiple uses\n");
+      return false;
+    }
+
   if (is_gimple_call (stmt))
     nops = gimple_call_num_args (stmt);
   else if (is_gimple_assign (stmt))
@@ -977,6 +1020,8 @@ vect_build_slp_tree (loop_vec_info loop_
       return false;
     }

+  gimple_set_plf (stmt, GF_PLF_1, true);
+
   vect_free_oprnd_info (oprnds_info);
   return true;
 }

That still fails slp-18.c which exactly has such a duplicate use (with
very non-optimal initial vectorized codegen).  Similar bb-slp-20.c
and bb-slp-21.c.  So we could add an artificial "number of allowed
duplications" counter ... (but that couldn't use the flag approach
anymore as we'd falsely clear it on duplicate uses).