public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
@ 2015-05-22 15:46 Alan Lawrence
  2015-05-27 16:01 ` Jeff Law
  2015-05-28 12:11 ` Richard Biener
  0 siblings, 2 replies; 8+ messages in thread
From: Alan Lawrence @ 2015-05-22 15:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: Richard Biener

[-- Attachment #1: Type: text/plain, Size: 4408 bytes --]

This example which I wrote to test ifconversion, currently fails to if-convert 
or vectorize:

int foo ()
{
   for (int i = 0; i < 32 ; i++)
     {
       int m = (a[i] & i) ? 5 : 4;
       b[i] = a[i] * m;
     }
}

...because jump-threading in dom1 rearranged the loop into a form that neither 
if-conversion nor vectorization would attempt. Discussion at 
https://gcc.gnu.org/ml/gcc/2015-04/msg00343.html lead to the suggestion that I 
should rerun loop-header copying (an earlier attempt to fix ifconversion, 
https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01743.html, still did not enable 
vectorization.)

This patch does so (and makes slightly less conservative, to tackle the example 
above). I found I had to make this a separate pass, so that the phi nodes were 
cleaned up at the end of the pass before running tree_if_conversion. Also at 
this stage in the compiler (inside loop opts) it was not possible to run 
loop_optimizer_init+finalize, or other loop_optimizer data structures needed 
later would be deleted; hence, I have two nearly-but-not-quite-identical passes, 
the new "ch_vect" avoiding the init/finalize. I tried to tackle this with some 
C++ subclassing, which removes the duplication, but the result feels a little 
ugly; suggestions for any neater approach welcome.

This patch causes failure of the scan-tree-dump of dom2 in gcc.dg/ssa/pr21417.c. 
This looks for jump-threading to perform an optimization, but no longer finds 
the expected line in the log - as the loop-header-copying phase has already done 
an equivalent transformation *before* dom2. The final CFG is thus in the desired 
form, but I'm not sure how to determine this (scanning the CFG itself is very 
difficult, well beyond what we can do with regex, requiring looking at multiple 
lines and basic blocks). Can anyone advise? [The test issue can be worked around 
by preserving the old do_while_p logic for the first header-copying pass, and 
using the new logic only for the second, but this is more awkward inside the 
compiler, which feels wrong.]

Besides the new vect-ifcvt-11.c, the testsuite actually has a couple of other 
examples where this patch enables (undesired!) vectorization. I've dealt with 
these, but for the record:
	* gcc.dg/vect/slp-perm-7.c: the initialization loop in main, contained a check 
that input[i] < 200; this was already optimized out (because input[i] was set to 
i%256, where i<N with N #defined to 16), but that loop was not vectorized because:
/work/alalaw01/oban/srcfsf/gcc/gcc/testsuite/gcc.dg/vect/slp-perm-7.c:54:3: 
note: not vectorized: latch block not empty.
/work/alalaw01/oban/srcfsf/gcc/gcc/testsuite/gcc.dg/vect/slp-perm-7.c:54:3: 
note: bad loop form.

	* gcc.dg/vect/vect-strided-a-u16-i4.c: the main1() function has three loops; 
the first (initialization) has an 'if (y) abort() /* Avoid vectorization.  */'. 
However, the 'volatile int y = 0' this was meant to reference, is actually 
shadowed by a local non-volatile; the test is thus peeled off and absent from 
the body of the loop. The loop only avoided vectorization because of non-empty 
latch and bad loop form, as previous.

With this patch, both those loops now have good form, hence I have fixed both to 
check a global volatile to prevent these extraneous parts from being vectorized.

Tested with bootstrap + check-gcc on x86_64 and AArch64 (linux). As noted above, 
this causes a spurious PASS->FAIL of a scan-tree-dump test, which I'm unsure how 
to fix, but no other regressions.

gcc/ChangeLog:

	* tree-pass.h (make_pass_ch_vect): New.
	* passes.def: Add pass_ch_vect just before pass_if_conversion.

	* tree-ssa-loop-ch.c (do_while_loop_p): For single-exit loops,
	look for blocks with exit edge and code after them.
	(pass_data_ch_vect, class pass_ch_vect, make_pass_ch_vect): New.
	(class pass_ch): Extend pass_ch_vect.
	(pass_ch::execute): Move all but loop_optimizer_init/finalize to...
	(pass_ch_vect::execute): ...here.

	* tree-ssa-loop.c (pass_tree_loop_init::execute): Add flags
	LOOPS_HAVE_PREHEADERS and LOOPS_HAVE_SIMPLE_LATCHES.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/slp-perm-7.c (zero): New.
	(main): Test zero rather than input[i], to avoid vectorization.
	* gcc.dg/vect/vect-strided-a-u16-i4.c (main1): Narrow scope of x,y,z,w.
	of unsigned
	* gcc.dg/vect/vect-ifcvt-11.c: New.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: rerun-loop-ch.patch --]
[-- Type: text/x-patch; name=rerun-loop-ch.patch, Size: 7848 bytes --]

diff --git a/gcc/passes.def b/gcc/passes.def
index 1d598b2..87cfe2a 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -247,6 +247,7 @@ along with GCC; see the file COPYING3.  If not see
 	  PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
 	      NEXT_PASS (pass_expand_omp_ssa);
 	  POP_INSERT_PASSES ()
+	  NEXT_PASS (pass_ch_vect);
 	  NEXT_PASS (pass_if_conversion);
 	  /* pass_vectorize must immediately follow pass_if_conversion.
 	     Please do not add any other passes in between.  */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
index 6291096..e0ec55a 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-7.c
@@ -20,6 +20,8 @@
 
 #define N 16
 
+volatile int zero = 0;
+
 /* SLP with load permutation and loop-based vectorization.  */
 void foo (int *__restrict__ pInput, int *__restrict__ pOutput,
           int *__restrict__ pInput2, int *__restrict__ pOutput2)
@@ -57,7 +59,7 @@ int main (int argc, const char* argv[])
       input2[i] = i%256;
       output[i] = 0;
       output2[i] = 0;
-      if (input[i] > 200)
+      if (zero) /* Avoid vectorization.  */
         abort ();
     }
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-ifcvt-11.c b/gcc/testsuite/gcc.dg/vect/vect-ifcvt-11.c
new file mode 100644
index 0000000..53b139d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-ifcvt-11.c
@@ -0,0 +1,37 @@
+/* { dg-require-effective-target vect_condition } */
+/* { dg-require-effective-target vect_int } */
+
+#include "tree-vect.h"
+
+#define N 16
+
+extern void abort(void);
+
+int A[N] = {36, 39, 42, 45, 43, 32, 21, 12, 23, 34, 45, 56, 67, 78, 81, 11};
+int B[N] = {144,195,210,225,172,128,105,60, 92, 136,225,280,268,390,324,55};
+
+__attribute__((noinline))
+void foo ()
+{
+  for (int i = 0; i < N; i++)
+    {
+      int m = (A[i] & i) ? 5 : 4;
+      A[i] = A[i] * m;
+    }
+}
+
+int main ()
+{
+
+  check_vect ();
+  foo ();
+  /* check results:  */
+  for (int i = 0; i < N; i++)
+    if (A[i] != B[i])
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/vect-strided-a-u16-i4.c b/gcc/testsuite/gcc.dg/vect/vect-strided-a-u16-i4.c
index 68114a6..0656fb7 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-strided-a-u16-i4.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-strided-a-u16-i4.c
@@ -21,7 +21,6 @@ main1 ()
   s *ptr = arr;
   s res[N];
   int i;
-  unsigned short x, y, z, w;
 
   for (i = 0; i < N; i++)
     {
@@ -35,6 +34,7 @@ main1 ()
 
   for (i = 0; i < N; i++)
     {
+      unsigned short x, y, z, w;
       x = ptr->b - ptr->a;
       y = ptr->d - ptr->c;
       res[i].c = x + y;
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index bc8763d..cb0f986 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -379,6 +379,7 @@ extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ccp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-ch.c b/gcc/tree-ssa-loop-ch.c
index c6441b8..0fb7155 100644
--- a/gcc/tree-ssa-loop-ch.c
+++ b/gcc/tree-ssa-loop-ch.c
@@ -140,6 +140,29 @@ do_while_loop_p (struct loop *loop)
       && gimple_code (stmt) == GIMPLE_COND)
     return false;
 
+  /* Leave anything with multiple exits, given it satisfies the above.  */
+  if (!single_exit (loop))
+    return true;
+
+  /* If any block in the loop has an exit edge, and code after it, it is
+     not a do-while loop.  */
+  basic_block *body = get_loop_body (loop);
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      bool block_has_exit = false, block_precedes_code = false;
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, body[i]->succs)
+	if (loop_exit_edge_p (loop, e))
+	  block_has_exit = true;
+	else if (e->dest != loop->header
+		 && e->dest != loop->latch)
+	  block_precedes_code = true;
+      if (block_has_exit && block_precedes_code)
+        return false;
+    }
+  free (body);
+
   return true;
 }
 
@@ -162,21 +185,48 @@ const pass_data pass_data_ch =
   0, /* todo_flags_finish */
 };
 
-class pass_ch : public gimple_opt_pass
+const pass_data pass_data_ch_vect =
+{
+  GIMPLE_PASS, /* type */
+  "ch_vect", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_CH, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_ch_vect : public gimple_opt_pass
 {
 public:
-  pass_ch (gcc::context *ctxt)
-    : gimple_opt_pass (pass_data_ch, ctxt)
+  pass_ch_vect (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_ch_vect, ctxt)
   {}
 
   /* opt_pass methods: */
   virtual bool gate (function *) { return flag_tree_ch != 0; }
   virtual unsigned int execute (function *);
+protected:
+  pass_ch_vect (const pass_data data, gcc::context *ctxt)
+    : gimple_opt_pass (data, ctxt)
+  {}
+}; // class pass_ch_vect
 
+/* Adds a call to loop_optimizer_init before it executes,
+   and loop_optimizer_finalize after.  */
+class pass_ch : public pass_ch_vect
+{
+public:
+  pass_ch (gcc::context *ctxt) : pass_ch_vect (pass_data_ch, ctxt)
+  {}
+
+  virtual unsigned int execute (function *);
 }; // class pass_ch
 
 unsigned int
-pass_ch::execute (function *fun)
+pass_ch_vect::execute (function *fun)
 {
   struct loop *loop;
   basic_block header;
@@ -186,13 +236,8 @@ pass_ch::execute (function *fun)
   unsigned bbs_size;
   bool changed = false;
 
-  loop_optimizer_init (LOOPS_HAVE_PREHEADERS
-		       | LOOPS_HAVE_SIMPLE_LATCHES);
   if (number_of_loops (fun) <= 1)
-    {
-      loop_optimizer_finalize ();
       return 0;
-    }
 
   bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (fun));
   copied_bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (fun));
@@ -296,17 +341,35 @@ pass_ch::execute (function *fun)
       changed = true;
     }
 
-  update_ssa (TODO_update_ssa);
+  if (changed)
+    update_ssa (TODO_update_ssa);
   free (bbs);
   free (copied_bbs);
 
-  loop_optimizer_finalize ();
   return changed ? TODO_cleanup_cfg : 0;
 }
 
+unsigned int
+pass_ch::execute (function *fun)
+{
+  loop_optimizer_init (LOOPS_HAVE_PREHEADERS
+		       | LOOPS_HAVE_SIMPLE_LATCHES);
+
+  unsigned int res = pass_ch_vect::execute (fun);
+
+  loop_optimizer_finalize ();
+  return res;
+}
+
 } // anon namespace
 
 gimple_opt_pass *
+make_pass_ch_vect (gcc::context *ctxt)
+{
+  return new pass_ch_vect (ctxt);
+}
+
+gimple_opt_pass *
 make_pass_ch (gcc::context *ctxt)
 {
   return new pass_ch (ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index ccb8f97..a5029bb 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -234,7 +234,9 @@ unsigned int
 pass_tree_loop_init::execute (function *fun ATTRIBUTE_UNUSED)
 {
   loop_optimizer_init (LOOPS_NORMAL
-		       | LOOPS_HAVE_RECORDED_EXITS);
+		       | LOOPS_HAVE_RECORDED_EXITS
+		       | LOOPS_HAVE_PREHEADERS
+		       | LOOPS_HAVE_SIMPLE_LATCHES);
   rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
 
   /* We might discover new loops, e.g. when turning irreducible

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
  2015-05-22 15:46 [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion Alan Lawrence
@ 2015-05-27 16:01 ` Jeff Law
  2015-06-19 17:54   ` Alan Lawrence
  2015-05-28 12:11 ` Richard Biener
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff Law @ 2015-05-27 16:01 UTC (permalink / raw)
  To: Alan Lawrence, gcc-patches; +Cc: Richard Biener

On 05/22/2015 09:42 AM, Alan Lawrence wrote:
>
> This patch does so (and makes slightly less conservative, to tackle the
> example above). I found I had to make this a separate pass, so that the
> phi nodes were cleaned up at the end of the pass before running
> tree_if_conversion. Also at this stage in the compiler (inside loop
> opts) it was not possible to run loop_optimizer_init+finalize, or other
> loop_optimizer data structures needed later would be deleted; hence, I
> have two nearly-but-not-quite-identical passes, the new "ch_vect"
> avoiding the init/finalize. I tried to tackle this with some C++
> subclassing, which removes the duplication, but the result feels a
> little ugly; suggestions for any neater approach welcome.
What PHI node cleanup needs to be done?  I don't doubt something's 
needed, but would like to understand the cleanup -- depending on what 
needs to be done, it may be the case that we can cleanup on-the-fly or 
it may point at a general issue we should be resolving prior to running 
tree_if_conversion.


>
> This patch causes failure of the scan-tree-dump of dom2 in
> gcc.dg/ssa/pr21417.c. This looks for jump-threading to perform an
> optimization, but no longer finds the expected line in the log - as the
> loop-header-copying phase has already done an equivalent transformation
> *before* dom2. The final CFG is thus in the desired form, but I'm not
> sure how to determine this (scanning the CFG itself is very difficult,
> well beyond what we can do with regex, requiring looking at multiple
> lines and basic blocks). Can anyone advise? [The test issue can be
> worked around by preserving the old do_while_p logic for the first
> header-copying pass, and using the new logic only for the second, but
> this is more awkward inside the compiler, which feels wrong.]
Don't we have a flag to turn off loop header copying?  If so, does 
adding that flag to the test "fix" it without resorting to something 
gross like preserving the old logic for the first pass and new logic for 
the second pass.

The refactoring to deal with being able to call into this without 
reinitializing the loop optimizer doesn't seem terrible to me.  One 
could argue that the loop optimizer init bits could become a property 
and managed by the pass manager.  I'm not sure that really simplifies 
anything though.

My biggest worry would be cases where data initialized by 
loop_optimizer_init gets invalidated by the header copying.  Have you 
looked at all at that possibility?  I don't have anything specific in 
mind to point you at -- just a general concern.

>
> Besides the new vect-ifcvt-11.c, the testsuite actually has a couple of
> other examples where this patch enables (undesired!) vectorization. I've
> dealt with these, but for the record:
Presumably undesired is within the scope of the testsuite, not 
necessarily in terms of the code we generate for real user code :-)

Overall it doesn't look bad to me...  Convince me it's safe WRT the 
loop_optimizer_init issue above and we'll have a clear path forward.

jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
  2015-05-22 15:46 [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion Alan Lawrence
  2015-05-27 16:01 ` Jeff Law
@ 2015-05-28 12:11 ` Richard Biener
  2015-06-19 17:38   ` Alan Lawrence
  1 sibling, 1 reply; 8+ messages in thread
From: Richard Biener @ 2015-05-28 12:11 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches

On Fri, May 22, 2015 at 5:42 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
> This example which I wrote to test ifconversion, currently fails to
> if-convert or vectorize:
>
> int foo ()
> {
>   for (int i = 0; i < 32 ; i++)
>     {
>       int m = (a[i] & i) ? 5 : 4;
>       b[i] = a[i] * m;
>     }
> }
>
> ...because jump-threading in dom1 rearranged the loop into a form that
> neither if-conversion nor vectorization would attempt. Discussion at
> https://gcc.gnu.org/ml/gcc/2015-04/msg00343.html lead to the suggestion that
> I should rerun loop-header copying (an earlier attempt to fix ifconversion,
> https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01743.html, still did not
> enable vectorization.)
>
> This patch does so (and makes slightly less conservative, to tackle the
> example above). I found I had to make this a separate pass, so that the phi
> nodes were cleaned up at the end of the pass before running
> tree_if_conversion. Also at this stage in the compiler (inside loop opts) it
> was not possible to run loop_optimizer_init+finalize, or other
> loop_optimizer data structures needed later would be deleted; hence, I have
> two nearly-but-not-quite-identical passes, the new "ch_vect" avoiding the
> init/finalize. I tried to tackle this with some C++ subclassing, which
> removes the duplication, but the result feels a little ugly; suggestions for
> any neater approach welcome.
>
> This patch causes failure of the scan-tree-dump of dom2 in
> gcc.dg/ssa/pr21417.c. This looks for jump-threading to perform an
> optimization, but no longer finds the expected line in the log - as the
> loop-header-copying phase has already done an equivalent transformation
> *before* dom2. The final CFG is thus in the desired form, but I'm not sure
> how to determine this (scanning the CFG itself is very difficult, well
> beyond what we can do with regex, requiring looking at multiple lines and
> basic blocks). Can anyone advise? [The test issue can be worked around by
> preserving the old do_while_p logic for the first header-copying pass, and
> using the new logic only for the second, but this is more awkward inside the
> compiler, which feels wrong.]
>
> Besides the new vect-ifcvt-11.c, the testsuite actually has a couple of
> other examples where this patch enables (undesired!) vectorization. I've
> dealt with these, but for the record:
>         * gcc.dg/vect/slp-perm-7.c: the initialization loop in main,
> contained a check that input[i] < 200; this was already optimized out
> (because input[i] was set to i%256, where i<N with N #defined to 16), but
> that loop was not vectorized because:
> /work/alalaw01/oban/srcfsf/gcc/gcc/testsuite/gcc.dg/vect/slp-perm-7.c:54:3:
> note: not vectorized: latch block not empty.
> /work/alalaw01/oban/srcfsf/gcc/gcc/testsuite/gcc.dg/vect/slp-perm-7.c:54:3:
> note: bad loop form.
>
>         * gcc.dg/vect/vect-strided-a-u16-i4.c: the main1() function has
> three loops; the first (initialization) has an 'if (y) abort() /* Avoid
> vectorization.  */'. However, the 'volatile int y = 0' this was meant to
> reference, is actually shadowed by a local non-volatile; the test is thus
> peeled off and absent from the body of the loop. The loop only avoided
> vectorization because of non-empty latch and bad loop form, as previous.
>
> With this patch, both those loops now have good form, hence I have fixed
> both to check a global volatile to prevent these extraneous parts from being
> vectorized.
>
> Tested with bootstrap + check-gcc on x86_64 and AArch64 (linux). As noted
> above, this causes a spurious PASS->FAIL of a scan-tree-dump test, which I'm
> unsure how to fix, but no other regressions.

Apart from Jeffs comment - the usual fix for the undesired
vectorization is to put
a __asm__ volatile (""); in the loop.

+  /* If any block in the loop has an exit edge, and code after it, it is
+     not a do-while loop.  */
+  basic_block *body = get_loop_body (loop);
+  for (unsigned i = 0; i < loop->num_nodes; i++)

wouldn't it be easier to verify that the predecessor of the loop latch
contains the (only) loop exit?

Like

   e = single_exit (loop);
   if (!e)
     return true;

   if (single_exit (loop)->pred != single_pred (loop->latch))
     return false;

?  In fact I think that even for multiple exists we want the latch predecessor
have an exit (though the vectorizer or if-conversion don't deal with that).

Note that single_exit () only works when the loop state has
LOOPS_HAVE_RECORDED_EXITS
thus it might be easier to simply check

  FOR_EACH_EDGE (... single_pred (loop->latch)->succs ..)
     if (e->dest == loop->latch)
       ;
     else
       break;
  if (!e || !loop_exit_edge_p (loop, e))
    return true;

which should work always.

Coding-style wise, can you please move the "common" pass_ch_vect::execute out
of the pass_ch_vect class?

  unsigned int res = pass_ch_vect::execute (fun);

looks ugly, as well as deriving pass_ch from pass_ch_vect.  I think pass_ch_vect
should be only executed if flag_tree_loop_vectorize is enabled.

   loop_optimizer_init (LOOPS_NORMAL
-                      | LOOPS_HAVE_RECORDED_EXITS);
+                      | LOOPS_HAVE_RECORDED_EXITS
+                      | LOOPS_HAVE_PREHEADERS
+                      | LOOPS_HAVE_SIMPLE_LATCHES);

already included in LOOPS_NORMAL.

Thanks,
Richard.

> gcc/ChangeLog:
>
>         * tree-pass.h (make_pass_ch_vect): New.
>         * passes.def: Add pass_ch_vect just before pass_if_conversion.
>
>         * tree-ssa-loop-ch.c (do_while_loop_p): For single-exit loops,
>         look for blocks with exit edge and code after them.
>         (pass_data_ch_vect, class pass_ch_vect, make_pass_ch_vect): New.
>         (class pass_ch): Extend pass_ch_vect.
>         (pass_ch::execute): Move all but loop_optimizer_init/finalize to...
>         (pass_ch_vect::execute): ...here.
>
>         * tree-ssa-loop.c (pass_tree_loop_init::execute): Add flags
>         LOOPS_HAVE_PREHEADERS and LOOPS_HAVE_SIMPLE_LATCHES.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/slp-perm-7.c (zero): New.
>         (main): Test zero rather than input[i], to avoid vectorization.
>         * gcc.dg/vect/vect-strided-a-u16-i4.c (main1): Narrow scope of
> x,y,z,w.
>         of unsigned
>         * gcc.dg/vect/vect-ifcvt-11.c: New.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
  2015-05-28 12:11 ` Richard Biener
@ 2015-06-19 17:38   ` Alan Lawrence
  0 siblings, 0 replies; 8+ messages in thread
From: Alan Lawrence @ 2015-06-19 17:38 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener wrote:
> Apart from Jeffs comment - the usual fix for the undesired
> vectorization is to put
> a __asm__ volatile (""); in the loop.

In vect-strided-a-u16-i4.c, narrowing the scope of the declaration seemed to 
preserve the original intent. I've been able to drop the other testsuite changes.

> +  /* If any block in the loop has an exit edge, and code after it, it is
> +     not a do-while loop.  */
> +  basic_block *body = get_loop_body (loop);
> +  for (unsigned i = 0; i < loop->num_nodes; i++)
> 
> wouldn't it be easier to verify that the predecessor of the loop latch
> contains the (only) loop exit?

It's not guaranteed that the loop latch has only one predecessor. The testsuite 
contains quite a few examples, e.g. gcc.c-torture/compile/20011114.c (at -O3). 
However, I've found a simpler (and equivalent) test, as we have the unique exit 
edge and it's source already.

> Note that single_exit () only works when the loop state has
> LOOPS_HAVE_RECORDED_EXITS

Hah, thanks - didn't realize that. So using single_exit_p did make pass_ch 
behave differently from pass_ch_vect. I've restored the original code for the 
original pass_ch...

 > I think pass_ch_vect
 > should be only executed if flag_tree_loop_vectorize is enabled.

...agreed; and handling loop->force_vectorize and loop->dont_vectorize properly 
required splitting the two phases up more anyway, so I've used clearly-different 
predicates in each.

> Coding-style wise, can you please move the "common" pass_ch_vect::execute out
> of the pass_ch_vect class?

Yes, I've done some reorg, introducing a third base class with the common 
execute bits calling a virtual method returning bool.

>    loop_optimizer_init (LOOPS_NORMAL
> -                      | LOOPS_HAVE_RECORDED_EXITS);
> +                      | LOOPS_HAVE_RECORDED_EXITS
> +                      | LOOPS_HAVE_PREHEADERS
> +                      | LOOPS_HAVE_SIMPLE_LATCHES);
> 
> already included in LOOPS_NORMAL.

So it is. Thanks!

TYVM for the review - I've posted a v2 at 
https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01355.html .

Cheers, Alan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
  2015-05-27 16:01 ` Jeff Law
@ 2015-06-19 17:54   ` Alan Lawrence
  2015-06-23 20:34     ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Lawrence @ 2015-06-19 17:54 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, Richard Biener

Jeff Law wrote:
> On 05/22/2015 09:42 AM, Alan Lawrence wrote:
>> This patch does so (and makes slightly less conservative, to tackle the
>> example above). I found I had to make this a separate pass, so that the
>> phi nodes were cleaned up at the end of the pass before running
>> tree_if_conversion.
> What PHI node cleanup needs to be done?  I don't doubt something's 
> needed, but would like to understand the cleanup -- depending on what 
> needs to be done, it may be the case that we can cleanup on-the-fly or 
> it may point at a general issue we should be resolving prior to running 
> tree_if_conversion.

If I change pass_ch_vect to return 0 rather than TODO_update_cfg, my testcase gives:

foo ()
{
   int m;
   int i;
   unsigned int ivtmp_3;
   int _5;
   int _6;
   int _7;
   int _10;
   int _12;
   unsigned int ivtmp_13;
   int _15;
   unsigned int ivtmp_19;
   int _20;
   unsigned int ivtmp_23;
   unsigned int ivtmp_27;

   <bb 2>:
   _10 = A[0];

   <bb 8>:
   # m_11 = PHI <4(2)>
   # _12 = PHI <_10(2)>
   # i_1 = PHI <0(2)>
   # ivtmp_19 = PHI <16(2)>
   _20 = m_11 * _12;
   A[i_1] = _20;
   i_22 = i_1 + 1;
   ivtmp_23 = ivtmp_19 - 1;
   if (ivtmp_23 != 0)
     goto <bb 9>;
   else
     goto <bb 7>;

   <bb 9>:

   <bb 3>:
   # i_26 = PHI <i_9(10), i_22(9)>
   # ivtmp_27 = PHI <ivtmp_13(10), ivtmp_23(9)>
   _5 = A[i_26];
   _6 = _5 & i_26;
   if (_6 != 0)
     goto <bb 5>;
   else
     goto <bb 4>;

   <bb 4>:

   <bb 5>:
   # m_14 = PHI <5(3), 4(4)>

   <bb 6>:
   # m_2 = PHI <m_14(5)>
   # _15 = PHI <_5(5)>
   # i_16 = PHI <i_26(5)>
   # ivtmp_3 = PHI <ivtmp_27(5)>
   _7 = m_2 * _15;
   A[i_16] = _7;
   i_9 = i_16 + 1;
   ivtmp_13 = ivtmp_3 - 1;
   if (ivtmp_13 != 0)
     goto <bb 10>;
   else
     goto <bb 7>;

   <bb 10>:
   goto <bb 3>;

   <bb 7>:
   return;

}

if-conversion then bails out in if_convertible_phi_p, with a phi (whose result 
is a virtual operand, specifically an SSA name) used as operand to another phi 
("Difficult to handle this virtual phi" in tree-if-conv.c): see m_2, i_16. 
Returning TODO_update_cfg, causes merging of blocks 2 and 8, and blocks 5 and 6, 
giving instead:

foo ()
{
   int m;
   int i;
   int _5;
   int _6;
   int _7;
   int _10;
   unsigned int ivtmp_13;
   int _20;
   unsigned int ivtmp_23;
   unsigned int ivtmp_27;

   <bb 2>:
   _10 = A[0];
   _20 = _10 * 4;
   A[0] = _20;
   i_22 = 1;
   ivtmp_23 = 15;
   if (ivtmp_23 != 0)
     goto <bb 3>;
   else
     goto <bb 8>;

   <bb 3>:

   <bb 4>:
   # i_26 = PHI <i_9(7), i_22(3)>
   # ivtmp_27 = PHI <ivtmp_13(7), ivtmp_23(3)>
   _5 = A[i_26];
   _6 = _5 & i_26;
   if (_6 != 0)
     goto <bb 6>;
   else
     goto <bb 5>;

   <bb 5>:

   <bb 6>:
   # m_14 = PHI <5(4), 4(5)>
   _7 = _5 * m_14;
   A[i_26] = _7;
   i_9 = i_26 + 1;
   ivtmp_13 = ivtmp_27 + 4294967295;
   if (ivtmp_13 != 0)
     goto <bb 7>;
   else
     goto <bb 8>;

   <bb 7>:
   goto <bb 4>;

   <bb 8>:
   return;

}

and one can see that the troublesome phi's are gone.

> Don't we have a flag to turn off loop header copying?  If so, does 
> adding that flag to the test "fix" it  without resorting to something
> gross like preserving the old logic for the first pass and new logic for 
> the second pass.

Ah, yes, thanks, -fno-tree-ch works. In fact here the problem was caused by the 
second pass, which following Richard's suggestion, I've now gated with 
-ftree-vectorize also - which is turned off by default at -O2, so the test is 
passing without changes.

However, it turns out I was implicitly using different logic in the two passes: 
the 'if (!single_exit(loop)) return true' always bailed out in the first 
pass_ch, as single_exit-ness is not known in the absence of 
LOOPS_HAVE_RECORDED_EXITS, so the first pass_ch was still using the original 
logic anyway! I've now had to refactor the two classes to allow different code 
anyway, which makes the distinction clear.

(I also tried a more complicated version of do_while_loop_p that computed it's 
own single_exit criteria to have that in the first pass; this gave more test 
failures, which are fixable, but it's not really the purpose of this patch to do 
additional header-copying when we are not vectorizing.)

> My biggest worry would be cases where data initialized by 
> loop_optimizer_init gets invalidated by the header copying.  Have you 
> looked at all at that possibility?  I don't have anything specific in 
> mind to point you at -- just a general concern.

Well, this code from tree-ssa-loop-ch.c appears to update the preheader and loop 
latch (I've verified with printf's that loop->latch, loop->header, 
loop_preheader_edge(loop) and loop_latch_edge(loop) are all sensible after this):

/* Ensure that the latch and the preheader is simple (we know that they
	 are not now, since there was the loop exit condition.  */
       split_edge (loop_preheader_edge (loop));
       split_edge (loop_latch_edge (loop));

and I think the exit edges don't change (they get cloned outside the loop). The 
one remaining worry might be irreducible code, but I believe this should have 
been removed by the stage pass_ch_vect runs: I've also run an entire bootstrap + 
testsuite with both (a) an assertion in pass_ch_vect::process_loop_p that none 
of the blocks in the loop are BB_IRREDUCIBLE, and (b) a call to 
verify_loop_structure at the end of pass_ch_base::copy_headers.

I agree this isn't (/cannot be) totally definitive, so if you are not 
sufficiently reassured - would you be if I called loop_optimizer_init 
(LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS) at the end of pass_ch_vect, redoing 
the setup done in pass_tree_loop_init::execute?

Patch v2 at https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01355.html .

Thanks for the review!

--Alan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
  2015-06-19 17:54   ` Alan Lawrence
@ 2015-06-23 20:34     ` Jeff Law
  2015-06-30 16:10       ` Alan Lawrence
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Law @ 2015-06-23 20:34 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, Richard Biener

On 06/19/2015 11:38 AM, Alan Lawrence wrote:
> Jeff Law wrote:
>> On 05/22/2015 09:42 AM, Alan Lawrence wrote:
>>> This patch does so (and makes slightly less conservative, to tackle the
>>> example above). I found I had to make this a separate pass, so that the
>>> phi nodes were cleaned up at the end of the pass before running
>>> tree_if_conversion.
>> What PHI node cleanup needs to be done?  I don't doubt something's
>> needed, but would like to understand the cleanup -- depending on what
>> needs to be done, it may be the case that we can cleanup on-the-fly or
>> it may point at a general issue we should be resolving prior to
>> running tree_if_conversion.
>
> If I change pass_ch_vect to return 0 rather than TODO_update_cfg, my
> testcase gives:
Thanks.  Does running the phi-only propagator after the loop header 
copying help?  At first glance it would seem that it ought to propagate 
the values of those degenerate PHIs then eliminate those PHIs.

It was written to cleanup after jump threading which has a tendency to 
create very similar code to what you've shown below and to do so very 
quickly.


/* A very simple pass to eliminate degenerate PHI nodes from the
    IL.  This is meant to be fast enough to be able to be run several
    times in the optimization pipeline.

    Certain optimizations, particularly those which duplicate blocks
    or remove edges from the CFG can create or expose PHIs which are
    trivial copies or constant initializations.

    While we could pick up these optimizations in DOM or with the
    combination of copy-prop and CCP, those solutions are far too
    heavy-weight for our needs.

    This implementation has two phases so that we can efficiently
    eliminate the first order degenerate PHIs and second order
    degenerate PHIs.

    The first phase performs a dominator walk to identify and eliminate
    the vast majority of the degenerate PHIs.  When a degenerate PHI
    is identified and eliminated any affected statements or PHIs
    are put on a worklist.

    The second phase eliminates degenerate PHIs and trivial copies
    or constant initializations using the worklist.  This is how we
    pick up the secondary optimization opportunities with minimal
    cost.  */

> I agree this isn't (/cannot be) totally definitive, so if you are not
> sufficiently reassured - would you be if I called loop_optimizer_init
> (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS) at the end of pass_ch_vect,
> redoing the setup done in pass_tree_loop_init::execute?
As I mentioned, I didn't have anything specific in mind, just a general 
concern.   No way to be totally definitive here.  I think you've done 
suitable due diligence.


>
> Patch v2 at https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01355.html .
Thanks.  It's in the queue again.  Two weeks of PTO has made that queue 
much deeper than I'd like, but I'm making progress :-)

jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
  2015-06-23 20:34     ` Jeff Law
@ 2015-06-30 16:10       ` Alan Lawrence
  2015-07-01 17:06         ` Jeff Law
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Lawrence @ 2015-06-30 16:10 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, Richard Biener

Jeff Law wrote:
> Thanks.  Does running the phi-only propagator after the loop header 
> copying help?  At first glance it would seem that it ought to propagate 
> the values of those degenerate PHIs then eliminate those PHIs.
> 
> It was written to cleanup after jump threading which has a tendency to 
> create very similar code to what you've shown below and to do so very 
> quickly.

Thanks for the tip - this fixes up some examples, but not at all. Other examples 
require also a call to rewrite_into_loop_closed_ssa and recomputing 
dominators...maybe I can get everything to work with all of those, but my 
feeling is to keep it as a pass: if the first pass_ch justifies being a pass in 
its own right, then surely a *more aggressive* version of that, does too...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion
  2015-06-30 16:10       ` Alan Lawrence
@ 2015-07-01 17:06         ` Jeff Law
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Law @ 2015-07-01 17:06 UTC (permalink / raw)
  To: Alan Lawrence; +Cc: gcc-patches, Richard Biener

On 06/30/2015 10:04 AM, Alan Lawrence wrote:
> Jeff Law wrote:
>> Thanks.  Does running the phi-only propagator after the loop header
>> copying help?  At first glance it would seem that it ought to
>> propagate the values of those degenerate PHIs then eliminate those PHIs.
>>
>> It was written to cleanup after jump threading which has a tendency to
>> create very similar code to what you've shown below and to do so very
>> quickly.
>
> Thanks for the tip - this fixes up some examples, but not at all. Other
> examples require also a call to rewrite_into_loop_closed_ssa and
> recomputing dominators...maybe I can get everything to work with all of
> those, but my feeling is to keep it as a pass: if the first pass_ch
> justifies being a pass in its own right, then surely a *more aggressive*
> version of that, does too...
Works for me.  Thanks for investigating.

jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-07-01 17:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-22 15:46 [PATCH/RFC] Make loop-header-copying more aggressive, rerun before tree-if-conversion Alan Lawrence
2015-05-27 16:01 ` Jeff Law
2015-06-19 17:54   ` Alan Lawrence
2015-06-23 20:34     ` Jeff Law
2015-06-30 16:10       ` Alan Lawrence
2015-07-01 17:06         ` Jeff Law
2015-05-28 12:11 ` Richard Biener
2015-06-19 17:38   ` Alan Lawrence

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).