[RFC] More jump threading restrictions in the presence of loops.

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [RFC] More jump threading restrictions in the presence of loops.
@ 2021-10-04  9:43 Aldy Hernandez
  2021-10-04 13:31 ` Jeff Law
  2021-10-06 10:22 ` Aldy Hernandez
  0 siblings, 2 replies; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-04  9:43 UTC (permalink / raw)
  To: Jeff Law, Richard Biener, matz
  Cc: Andrew MacLeod, GCC patches, Aldy Hernandez

It is frustrating that virtually all the regressions with the hybrid
threader for VRP, have not been with the engine itself, but with the
independent restrictions we have agreed upon.

The following patch is a collection of discussions with Richi, Jeff,
and Michael Matz regarding jump threading limitations in the presence
of loops, that I hope can lead to further refinements.

As I have mentioned before, most of our threading tests are too
fragile, so in this patch I have distilled various restrictions into
gimple FE tests that I hope can help in maintaining the threader going
forward.  The goal is to have one test with no valid threads
whatsover, and one with exclusively one valid thread per function.
This should make it trivial to maintain this going forward.

I would like to request the relevant experts to not only examine the
patch, but review the tests in this patch, to make sure we agree upon
these restrictions.  I have distilled the smallest possible test for
each restriction and have annotated said tests to make reviewing easy.

Note that the test in ssa-thread-valid.c is a thread that Jeff has
suggested should be an exception to the path crossing loops
restriction, but I have not implemented it yet, because even if we
could loosen the restriction, it would violate Richi's restriction of
a path that rotates a loop.  Comments are highly welcome.

By the way, these restrictions trigger *a lot*.  We seem to be
rotating loops left and right.  So I wouldn't be surprised if this
requires (as usual) a lot of test tweaking.

Untested patch follows.

p.s. Note that I'm just facilitating the discussion.  I'm highly
dependent on the loop experts here ;-).

gcc/ChangeLog:

	* tree-ssa-threadupdate.c
	(jt_path_registry::cancel_invalid_paths): Tweak.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
---
 .../gcc.dg/tree-ssa/ssa-thread-invalid.c      | 102 ++++++++++++++++++
 .../gcc.dg/tree-ssa/ssa-thread-valid.c        |  57 ++++++++++
 gcc/tree-ssa-threadupdate.c                   |  29 ++++-
 3 files changed, 186 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
new file mode 100644
index 00000000000..bd56a62a4b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
@@ -0,0 +1,102 @@
+// { dg-do compile }
+// { dg-options "-O2 -fgimple -fdump-statistics" }
+//
+// This is a collection of seemingly threadble paths that should not be allowed.
+
+void foobar (int);
+
+// Possible thread from 2->4->3, but it would rotate the loop.
+void __GIMPLE (ssa)
+f1 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB4;
+
+  // Latch.
+  __BB(3):
+  foobar (i_1);
+  i_5 = i_1 + 1;
+  goto __BB4;
+
+  __BB(4,loop_header(1)):
+  i_1 = __PHI (__BB2: 0, __BB3: i_5);
+  if (i_1 != 101)
+    goto __BB3;
+  else
+    goto __BB5;
+
+  __BB(5):
+  return;
+
+}
+
+// Possible thread from 2->3->5 but threading through the empty latch
+// would create a non-empty latch.
+void __GIMPLE (ssa)
+f2 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB3;
+
+  __BB(3,loop_header(1)):
+  i_8 = __PHI (__BB5: i_5, __BB2: 0);
+  foobar (i_8);
+  i_5 = i_8 + 1;
+  if (i_5 != 256)
+    goto __BB5;
+  else
+    goto __BB4;
+
+  // Latch.
+  __BB(5):
+  goto __BB3;
+
+  __BB(4):
+  return;
+
+}
+
+// Possible thread from 3->5->6->3 but this would thread through the
+// header but not exit the loop.
+int __GIMPLE (ssa)
+f3 (int a)
+{
+  int i;
+
+  __BB(2):
+  goto __BB6;
+
+  __BB(3):
+  if (i_1 != 0)
+    goto __BB4;
+  else
+    goto __BB5;
+
+  __BB(4):
+  foobar (5);
+  goto __BB5;
+
+  // Latch.
+  __BB(5):
+  i_7 = i_1 + 1;
+  goto __BB6;
+
+  __BB(6,loop_header(1)):
+  i_1 = __PHI (__BB2: 1, __BB5: i_7);
+  if (i_1 <= 99)
+    goto __BB3;
+  else
+    goto __BB7;
+
+  __BB(7):
+  return;
+
+}
+
+// { dg-final { scan-tree-dump-not "Jumps threaded" "statistics" } }
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c
new file mode 100644
index 00000000000..e10e877d01a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c
@@ -0,0 +1,57 @@
+// { dg-do compile }
+// { dg-options "-O2 -fgimple -fdump-statistics" }
+//
+// This is a collection of threadable paths.  To simplify maintenance,
+// there should only be one threadable path per function.
+
+int global;
+
+// ** FIXME: This currently fails**
+//
+// There is a path crossing loops through 3->4->5.
+//
+// Jeff has stipulated that...
+//
+// This might be an exception since this goes from inside the loop to
+// outside the loop without entering another loop.  That is, we have
+// found an early exit from the loop that doesn't require us to go
+// through the latch.  In fact, the whole threaded path is logically
+// outside the loop.  So the primary effect is to reduce the number of
+// branches necessary when we exit the loop via this path.
+//
+// This has nice secondary effects.  For example, objects on the
+// threaded path will no longer necessarily be live throughout the
+// loop -- so we can get register allocation improvements.  The
+// threaded path can physically move outside the loop resulting in
+// better icache efficiency, etc.
+//
+// The question we'd have to answer is whether or not exposing this
+// alternate exit path mucks up other loop optimizations, but if we
+// restrict to after the loop optimizer is done, then that's a
+// non-issue.
+int __GIMPLE (ssa)
+f1 (int x)
+{
+  int a;
+
+  __BB(2):
+  a_4 = ~x_3(D);
+  goto __BB4;
+
+  // Latch.
+  __BB(3):
+  global = a_1;
+  goto __BB4;
+
+  __BB(4,loop_header(1)):
+  a_1 = __PHI (__BB2: a_4, __BB3: 0);
+  if (a_1 != 0)
+    goto __BB3;
+  else
+    goto __BB5;
+
+  __BB(5):
+  return;
+}
+
+// { dg-final { scan-tree-dump "Jumps threaded: 1" "statistics" { xfail *-*-* } } }
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index dcabfdb30d2..414ebabe895 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2766,10 +2766,12 @@ bool
 jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 {
   gcc_checking_assert (!path.is_empty ());
-  edge taken_edge = path[path.length () - 1]->e;
-  loop_p loop = taken_edge->src->loop_father;
+  edge entry = path[0]->e;
+  edge exit = path[path.length () - 1]->e;
+  loop_p loop = exit->src->loop_father;
   bool seen_latch = false;
   bool path_crosses_loops = false;
+  bool path_crosses_loop_header = false;
 
   for (unsigned int i = 0; i < path.length (); i++)
     {
@@ -2793,6 +2795,14 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 	  || e->dest->loop_father != loop)
 	path_crosses_loops = true;
 
+      // ?? Avoid threading through loop headers that remain in the
+      // loop, as such threadings tend to create sub-loops which
+      // _might_ be OK ??.
+      if (e->dest->loop_father->header == e->dest
+	  && !flow_loop_nested_p (exit->dest->loop_father,
+				  e->dest->loop_father))
+	path_crosses_loop_header = true;
+
       if (flag_checking && !m_backedge_threads)
 	gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
     }
@@ -2811,6 +2821,21 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
       cancel_thread (&path, "Path crosses loops");
       return true;
     }
+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+      && !flow_loop_nested_p (exit->src->loop_father,
+			      entry->dest->loop_father))
+    {
+      cancel_thread (&path, "Path rotates loop");
+      return true;
+    }
+  if (path_crosses_loop_header)
+    {
+      cancel_thread (&path, "Path crosses loop header but does not exit it");
+      return true;
+    }
   return false;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-04  9:43 [RFC] More jump threading restrictions in the presence of loops Aldy Hernandez
@ 2021-10-04 13:31 ` Jeff Law
  2021-10-04 13:36   ` Aldy Hernandez
  2021-10-06 10:22 ` Aldy Hernandez
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Law @ 2021-10-04 13:31 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener, matz; +Cc: Andrew MacLeod, GCC patches



On 10/4/2021 3:43 AM, Aldy Hernandez wrote:
> It is frustrating that virtually all the regressions with the hybrid
> threader for VRP, have not been with the engine itself, but with the
> independent restrictions we have agreed upon.
>
> The following patch is a collection of discussions with Richi, Jeff,
> and Michael Matz regarding jump threading limitations in the presence
> of loops, that I hope can lead to further refinements.
>
> As I have mentioned before, most of our threading tests are too
> fragile, so in this patch I have distilled various restrictions into
> gimple FE tests that I hope can help in maintaining the threader going
> forward.  The goal is to have one test with no valid threads
> whatsover, and one with exclusively one valid thread per function.
> This should make it trivial to maintain this going forward.
>
> I would like to request the relevant experts to not only examine the
> patch, but review the tests in this patch, to make sure we agree upon
> these restrictions.  I have distilled the smallest possible test for
> each restriction and have annotated said tests to make reviewing easy.
>
> Note that the test in ssa-thread-valid.c is a thread that Jeff has
> suggested should be an exception to the path crossing loops
> restriction, but I have not implemented it yet, because even if we
> could loosen the restriction, it would violate Richi's restriction of
> a path that rotates a loop.  Comments are highly welcome.
>
> By the way, these restrictions trigger *a lot*.  We seem to be
> rotating loops left and right.  So I wouldn't be surprised if this
> requires (as usual) a lot of test tweaking.
>
> Untested patch follows.
>
> p.s. Note that I'm just facilitating the discussion.  I'm highly
> dependent on the loop experts here ;-).
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c
> 	(jt_path_registry::cancel_invalid_paths): Tweak.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
> 	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
Let me throw that into the tester and see what pops.

Jeff


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-04 13:31 ` Jeff Law
@ 2021-10-04 13:36   ` Aldy Hernandez
  2021-10-04 15:30     ` Jeff Law
  0 siblings, 1 reply; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-04 13:36 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, Michael Matz, Andrew MacLeod, GCC patches

Note that none of the other tests have been adjusted so it'll likely result
in multiple threading regressions.

Thanks for looking at this.

Aldy

On Mon, Oct 4, 2021, 15:31 Jeff Law <jeffreyalaw@gmail.com> wrote:

>
>
> On 10/4/2021 3:43 AM, Aldy Hernandez wrote:
> > It is frustrating that virtually all the regressions with the hybrid
> > threader for VRP, have not been with the engine itself, but with the
> > independent restrictions we have agreed upon.
> >
> > The following patch is a collection of discussions with Richi, Jeff,
> > and Michael Matz regarding jump threading limitations in the presence
> > of loops, that I hope can lead to further refinements.
> >
> > As I have mentioned before, most of our threading tests are too
> > fragile, so in this patch I have distilled various restrictions into
> > gimple FE tests that I hope can help in maintaining the threader going
> > forward.  The goal is to have one test with no valid threads
> > whatsover, and one with exclusively one valid thread per function.
> > This should make it trivial to maintain this going forward.
> >
> > I would like to request the relevant experts to not only examine the
> > patch, but review the tests in this patch, to make sure we agree upon
> > these restrictions.  I have distilled the smallest possible test for
> > each restriction and have annotated said tests to make reviewing easy.
> >
> > Note that the test in ssa-thread-valid.c is a thread that Jeff has
> > suggested should be an exception to the path crossing loops
> > restriction, but I have not implemented it yet, because even if we
> > could loosen the restriction, it would violate Richi's restriction of
> > a path that rotates a loop.  Comments are highly welcome.
> >
> > By the way, these restrictions trigger *a lot*.  We seem to be
> > rotating loops left and right.  So I wouldn't be surprised if this
> > requires (as usual) a lot of test tweaking.
> >
> > Untested patch follows.
> >
> > p.s. Note that I'm just facilitating the discussion.  I'm highly
> > dependent on the loop experts here ;-).
> >
> > gcc/ChangeLog:
> >
> >       * tree-ssa-threadupdate.c
> >       (jt_path_registry::cancel_invalid_paths): Tweak.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
> >       * gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
> Let me throw that into the tester and see what pops.
>
> Jeff
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-04 13:36   ` Aldy Hernandez
@ 2021-10-04 15:30     ` Jeff Law
  2021-10-04 16:29       ` Michael Matz
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Law @ 2021-10-04 15:30 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Richard Biener, Michael Matz, Andrew MacLeod, GCC patches



On 10/4/2021 7:36 AM, Aldy Hernandez wrote:
> Note that none of the other tests have been adjusted so it'll likely 
> result in multiple threading regressions.
Yea ;-)  So maybe we first focus on how those affect visium since that's 
what started us down this path.

The original test of regressions for visium were:
visium-sim: gcc.c-torture/execute/960218-1.c   -Os  (test for excess errors)
visium-sim: gcc.c-torture/execute/961125-1.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors)
visium-sim: gcc.c-torture/execute/961125-1.c   -O3 -g  (test for excess 
errors)
visium-sim: gcc.c-torture/execute/pending-4.c   -O1  (test for excess 
errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O2  (test for excess errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O2 -flto 
-fno-use-linker-plugin -flto-partition=none  (test for excess errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O2 -flto 
-fuse-linker-plugin -fno-fat-lto-objects  (test for excess errors)
visium-sim: gcc.c-torture/execute/pr58209.c   -O3 -g  (test for excess 
errors)
visium-sim: gcc.c-torture/execute/pr68911.c   -O1  (test for excess errors)

I'm pretty confident these are all cases where we previously threaded 
one or more jumps and as a result we were able to remove all calls to 
abort ().   Your change doesn't affect any of those :(

So probably the first thing to do is get a patch that fixes one or more 
of those failures *or* make a determination that one or more of those 
tests are things we do not want to optimize.   The one we had previously 
looked at (960218-1)  had a thread path from inside the loop to a point 
outside the loop.  My sense is we probably want to allow that.

And just in case it got lost.  Here's the analysis on 960218-1 on visium:

We've got this going into ethread:

int f (int x)
{
   int D.1420;
   int a;

;;   basic block 2, loop depth 0, maybe hot
;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
   a_4 = ~x_3(D);
   goto <bb 4>; [INV]
;;    succ:       4 (FALLTHRU,EXECUTABLE)

;;   basic block 3, loop depth 1, maybe hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
   gl = a_1;
;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)

;;   basic block 4, loop depth 1, maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;    pred:       2 (FALLTHRU,EXECUTABLE)
;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
   # a_1 = PHI <a_4(2), 0(3)>
   if (a_1 != 0)
     goto <bb 3>; [INV]
   else
     goto <bb 5>; [INV]
;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
;;                5 (FALSE_VALUE,EXECUTABLE)

;;   basic block 5, loop depth 0, maybe hot
;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
   return;
;;    succ:       EXIT

}


There's a pretty obvious jump thread in there.  3->4->5.  We used to 
allow this, but it's currently being rejected and I'm not sure it should.

In simplest terms we're threading from inside the loop, through the 
latch to a point outside the loop.  At first glance, that seems safe.

960218-1.c, compiled with -Os

int gl;

g (x)
{
   gl = x;
   return 0;
}

f (x)
{
   int a = ~x;
   while (a)
     a = g (a);
}

main ()
{
   f (3);
   if (gl != -4)
     abort ();
   exit (0);
}


Ideally in the final assembly there would be no calls to abort since it 
can be determined at compile-time that the call can never happen.

jeff


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-04 15:30     ` Jeff Law
@ 2021-10-04 16:29       ` Michael Matz
  2021-10-05 11:22         ` Richard Biener
  2021-10-05 13:33         ` Aldy Hernandez
  0 siblings, 2 replies; 27+ messages in thread
From: Michael Matz @ 2021-10-04 16:29 UTC (permalink / raw)
  To: Jeff Law; +Cc: Aldy Hernandez, Richard Biener, Andrew MacLeod, GCC patches

Hello,

On Mon, 4 Oct 2021, Jeff Law wrote:

> And just in case it got lost.  Here's the analysis on 960218-1 on visium:
> 
> We've got this going into ethread:
> 
> int f (int x)
> {
>   int D.1420;
>   int a;
> 
> ;;   basic block 2, loop depth 0, maybe hot
> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
>   a_4 = ~x_3(D);
>   goto <bb 4>; [INV]
> ;;    succ:       4 (FALLTHRU,EXECUTABLE)
> 
> ;;   basic block 3, loop depth 1, maybe hot
> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
>   gl = a_1;
> ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
> 
> ;;   basic block 4, loop depth 1, maybe hot
> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       2 (FALLTHRU,EXECUTABLE)
> ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
>   # a_1 = PHI <a_4(2), 0(3)>
>   if (a_1 != 0)
>     goto <bb 3>; [INV]
>   else
>     goto <bb 5>; [INV]
> ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
> ;;                5 (FALSE_VALUE,EXECUTABLE)
> 
> ;;   basic block 5, loop depth 0, maybe hot
> ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
>   return;
> ;;    succ:       EXIT
> 
> }

First notice that this doesn't have an empty latch block to start with 
(in fact, there is no separate latch block at all), so the loop optimizers 
haven't been initialized for simple latches at this point.  Still, I would 
say that even then we want to be careful with the latch edge, as putting 
code onto it will most probably create a problem downstream once we _do_ 
want to intialize the loop optimizers for simple latches.  So, that we are 
careful here is okay, we are just too careful in this specific case.

> There's a pretty obvious jump thread in there.  3->4->5.
> 
> We used to allow this, but it's currently being rejected and I'm not 
> sure it should.
> 
> In simplest terms we're threading from inside the loop, through the 
> latch to a point outside the loop.  At first glance, that seems safe.

Is at least the unrestricted jump threader (the one after loop optimizers) 
picking this up?

Independend of that, I agree, that this specific instance of threading 
through the latch block, even though followed by to-be-copied 
instructions, is okay.  We need to determine precisely when that is okay, 
though.  In this case there are several reasons I could see why this is 
so:
(a) while the thread is through the latch block, it's _not_ through the
    latch edge (which is 4->3).  That would create the problem, so here
    we're fine.
(b) even if it were through the latch edge, and would be followed by 
    to-be-copied instructions (and hence fill the latch edge) the ultimate 
    end of the path is outside the loop, so all the edges and blocks that 
    would now carry instructions would also be outside the loop, and hence 
    be no problem

Now, capture this precisely ;-)

I think something like so: we have a candidate path

  S -> L -> B1 (-> B2 ... Bn)

(Start, Latch, Blocks 1 to n following the latch).  (I think in our 
notation that means that the jump in L is redirected to Bn, and all code 
from B1..Bn be copied, right?  Do we even support multiple blocks after 
the to-be-redirected jump?)

Now if L is a latch block, but L->B1 is no latch/back edge : no 
restrictions apply.

Otherwise, L->B1 is a latch/back edge (that means B1 is a loop header): if 
all of B2..Bn-1 don't contain jumps back into the loop, and Bn is outside 
the loop, then the thread is okay as well.  (B2..Bn-1 can be inside the 
loop, as long as they don't contain jumps back into the loop, after 
copying by the threader, they don't create problems: their copies will be 
placed outside the loop and won't generate side entries back into the 
loop; the copied latch edge will not be a latch edge anymore, but a loop 
exit edge).

It's quite possible that the full generality above is not necessary.  
It's probably enough to just not deal with the (b) case above (and 
continue to reject it), and simply only accept the candidate path if none 
of the involved edges is a latch/back edge (despite one block being the 
latch block).  Especially so because I'm not convinced that I handled 
the idea of case (b) correctly above ;-)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-04 16:29       ` Michael Matz
@ 2021-10-05 11:22         ` Richard Biener
  2021-10-05 12:43           ` Michael Matz
  2021-10-05 14:56           ` Jeff Law
  2021-10-05 13:33         ` Aldy Hernandez
  1 sibling, 2 replies; 27+ messages in thread
From: Richard Biener @ 2021-10-05 11:22 UTC (permalink / raw)
  To: Michael Matz; +Cc: Jeff Law, Aldy Hernandez, Andrew MacLeod, GCC patches

On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>
> Hello,
>
> On Mon, 4 Oct 2021, Jeff Law wrote:
>
> > And just in case it got lost.  Here's the analysis on 960218-1 on visium:
> >
> > We've got this going into ethread:
> >
> > int f (int x)
> > {
> >   int D.1420;
> >   int a;
> >
> > ;;   basic block 2, loop depth 0, maybe hot
> > ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
> >   a_4 = ~x_3(D);
> >   goto <bb 4>; [INV]
> > ;;    succ:       4 (FALLTHRU,EXECUTABLE)
> >
> > ;;   basic block 3, loop depth 1, maybe hot
> > ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
> >   gl = a_1;
> > ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >
> > ;;   basic block 4, loop depth 1, maybe hot
> > ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       2 (FALLTHRU,EXECUTABLE)
> > ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >   # a_1 = PHI <a_4(2), 0(3)>
> >   if (a_1 != 0)
> >     goto <bb 3>; [INV]
> >   else
> >     goto <bb 5>; [INV]
> > ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
> > ;;                5 (FALSE_VALUE,EXECUTABLE)
> >
> > ;;   basic block 5, loop depth 0, maybe hot
> > ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
> >   return;
> > ;;    succ:       EXIT
> >
> > }
>
> First notice that this doesn't have an empty latch block to start with
> (in fact, there is no separate latch block at all), so the loop optimizers
> haven't been initialized for simple latches at this point.  Still, I would
> say that even then we want to be careful with the latch edge, as putting
> code onto it will most probably create a problem downstream once we _do_
> want to intialize the loop optimizers for simple latches.  So, that we are
> careful here is okay, we are just too careful in this specific case.

Not sure if the argument about empty or not empty latches is important...
> > There's a pretty obvious jump thread in there.  3->4->5.
> >
> > We used to allow this, but it's currently being rejected and I'm not
> > sure it should.
> >
> > In simplest terms we're threading from inside the loop, through the
> > latch to a point outside the loop.  At first glance, that seems safe.

All threadings that start inside the loop and end outside of it are OK
in principle.  Even if the path crosses the latch or the header.

We _might_ end up creating a loop with multiple exits though.

> Is at least the unrestricted jump threader (the one after loop optimizers)
> picking this up?
>
> Independend of that, I agree, that this specific instance of threading
> through the latch block, even though followed by to-be-copied
> instructions, is okay.  We need to determine precisely when that is okay,
> though.  In this case there are several reasons I could see why this is
> so:
> (a) while the thread is through the latch block, it's _not_ through the
>     latch edge (which is 4->3).  That would create the problem, so here
>     we're fine.
> (b) even if it were through the latch edge, and would be followed by
>     to-be-copied instructions (and hence fill the latch edge) the ultimate
>     end of the path is outside the loop, so all the edges and blocks that
>     would now carry instructions would also be outside the loop, and hence
>     be no problem

Yep.

> Now, capture this precisely ;-)

I tried to capture this with

+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+      && !flow_loop_nested_p (exit->src->loop_father,
+                             entry->dest->loop_father))
+    {
+      cancel_thread (&path, "Path rotates loop");
+      return true;
+    }

it's not really necessary to have (simple) latches to argue about this
I think.

> I think something like so: we have a candidate path
>
>   S -> L -> B1 (-> B2 ... Bn)
>
> (Start, Latch, Blocks 1 to n following the latch).  (I think in our
> notation that means that the jump in L is redirected to Bn, and all code
> from B1..Bn be copied, right?  Do we even support multiple blocks after
> the to-be-redirected jump?)
>
> Now if L is a latch block, but L->B1 is no latch/back edge : no
> restrictions apply.
>
> Otherwise, L->B1 is a latch/back edge (that means B1 is a loop header): if
> all of B2..Bn-1 don't contain jumps back into the loop, and Bn is outside
> the loop, then the thread is okay as well.  (B2..Bn-1 can be inside the
> loop, as long as they don't contain jumps back into the loop, after
> copying by the threader, they don't create problems: their copies will be
> placed outside the loop and won't generate side entries back into the
> loop; the copied latch edge will not be a latch edge anymore, but a loop
> exit edge).
>
> It's quite possible that the full generality above is not necessary.
> It's probably enough to just not deal with the (b) case above (and
> continue to reject it), and simply only accept the candidate path if none
> of the involved edges is a latch/back edge (despite one block being the
> latch block).  Especially so because I'm not convinced that I handled
> the idea of case (b) correctly above ;-)
>
>
> Ciao,
> Michael.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-05 11:22         ` Richard Biener
@ 2021-10-05 12:43           ` Michael Matz
  2021-10-05 14:56           ` Jeff Law
  1 sibling, 0 replies; 27+ messages in thread
From: Michael Matz @ 2021-10-05 12:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, Aldy Hernandez, Andrew MacLeod, GCC patches

Hello,

On Tue, 5 Oct 2021, Richard Biener wrote:

> > First notice that this doesn't have an empty latch block to start with 
> > (in fact, there is no separate latch block at all), so the loop 
> > optimizers haven't been initialized for simple latches at this point.  
> > Still, I would say that even then we want to be careful with the latch 
> > edge, as putting code onto it will most probably create a problem 
> > downstream once we _do_ want to intialize the loop optimizers for 
> > simple latches.  So, that we are careful here is okay, we are just too 
> > careful in this specific case.
> 
> Not sure if the argument about empty or not empty latches is important...

In this case it's not (as there are no separate latches anyway), but 
generally a latch that is already non-empty (i.e. problematic) only 
becomes more non-empty, so doing the threading doesn't introduce that 
specific problem.

> > > There's a pretty obvious jump thread in there.  3->4->5.
> > >
> > > We used to allow this, but it's currently being rejected and I'm not
> > > sure it should.
> > >
> > > In simplest terms we're threading from inside the loop, through the
> > > latch to a point outside the loop.  At first glance, that seems safe.
> 
> All threadings that start inside the loop and end outside of it are OK
> in principle.  Even if the path crosses the latch or the header.
> 
> We _might_ end up creating a loop with multiple exits though.

And entries (and hence irreducable loops)!  That's why I also added the 
piece about "all of B2..Bn-1 don't contain jumps back into the loop".  
I'm not sure if candidate paths going into our threader can have this 
problem, but if they have then you also don't want to do the block 
copying.

> > (a) while the thread is through the latch block, it's _not_ through the
> >     latch edge (which is 4->3).  That would create the problem, so here
> >     we're fine.
> > (b) even if it were through the latch edge, and would be followed by
> >     to-be-copied instructions (and hence fill the latch edge) the ultimate
> >     end of the path is outside the loop, so all the edges and blocks that
> >     would now carry instructions would also be outside the loop, and hence
> >     be no problem
> 
> Yep.
> 
> > Now, capture this precisely ;-)
> 
> I tried to capture this with
> 
> +  // The path should either start and end in the same loop or exit the
> +  // loop it starts in but never enter a loop.  This also catches
> +  // creating irreducible loops, not only rotation.
> +  if (entry->src->loop_father != exit->dest->loop_father
> +      && !flow_loop_nested_p (exit->src->loop_father,
> +                             entry->dest->loop_father))
> +    {
> +      cancel_thread (&path, "Path rotates loop");
> +      return true;
> +    }
> 
> it's not really necessary to have (simple) latches to argue about this
> I think.

Correct.  I don't think the above captures the re-entry problem (when 
intermediary blocks contain jumps inside the loop), the one I'm not sure 
we even have, but otherwise I think it does capture the (a) and (b) parts.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-04 16:29       ` Michael Matz
  2021-10-05 11:22         ` Richard Biener
@ 2021-10-05 13:33         ` Aldy Hernandez
  2021-10-05 15:10           ` Jeff Law
                             ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-05 13:33 UTC (permalink / raw)
  To: Michael Matz; +Cc: Jeff Law, Richard Biener, Andrew MacLeod, GCC patches

[-- Attachment #1: Type: text/plain, Size: 5156 bytes --]

On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>
> Hello,
>
> On Mon, 4 Oct 2021, Jeff Law wrote:
>
> > And just in case it got lost.  Here's the analysis on 960218-1 on visium:
> >
> > We've got this going into ethread:
> >
> > int f (int x)
> > {
> >   int D.1420;
> >   int a;
> >
> > ;;   basic block 2, loop depth 0, maybe hot
> > ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
> >   a_4 = ~x_3(D);
> >   goto <bb 4>; [INV]
> > ;;    succ:       4 (FALLTHRU,EXECUTABLE)
> >
> > ;;   basic block 3, loop depth 1, maybe hot
> > ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
> >   gl = a_1;
> > ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >
> > ;;   basic block 4, loop depth 1, maybe hot
> > ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       2 (FALLTHRU,EXECUTABLE)
> > ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
> >   # a_1 = PHI <a_4(2), 0(3)>
> >   if (a_1 != 0)
> >     goto <bb 3>; [INV]
> >   else
> >     goto <bb 5>; [INV]
> > ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
> > ;;                5 (FALSE_VALUE,EXECUTABLE)
> >
> > ;;   basic block 5, loop depth 0, maybe hot
> > ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
> >   return;
> > ;;    succ:       EXIT
> >
> > }
>
> First notice that this doesn't have an empty latch block to start with
> (in fact, there is no separate latch block at all), so the loop optimizers
> haven't been initialized for simple latches at this point.  Still, I would
> say that even then we want to be careful with the latch edge, as putting
> code onto it will most probably create a problem downstream once we _do_
> want to intialize the loop optimizers for simple latches.  So, that we are
> careful here is okay, we are just too careful in this specific case.
>
> > There's a pretty obvious jump thread in there.  3->4->5.
> >
> > We used to allow this, but it's currently being rejected and I'm not
> > sure it should.
> >
> > In simplest terms we're threading from inside the loop, through the
> > latch to a point outside the loop.  At first glance, that seems safe.
>
> Is at least the unrestricted jump threader (the one after loop optimizers)
> picking this up?
>
> Independend of that, I agree, that this specific instance of threading
> through the latch block, even though followed by to-be-copied
> instructions, is okay.  We need to determine precisely when that is okay,
> though.  In this case there are several reasons I could see why this is
> so:
> (a) while the thread is through the latch block, it's _not_ through the
>     latch edge (which is 4->3).  That would create the problem, so here
>     we're fine.

It seems we all agree Jeff's finding should be allowed, so let's
attack this one first, since it gets almost all of his visium
failures.  I can submit the rest of the cases separately.

The attached patch catches the IL discussed, and adds a relevant
gimple FE test others can use for experimenting :).

Tested on x86-64 and by visual inspection on visium-elf on the
regressions Jeff pointed me at.

OK?

BTW Jeff, this fixes all the regressions you mention except:

1. pr68911.c

The path not being threaded here is 7->10->11->12.  It crosses loops
multiple times, so I believe the restriction code is correct.

7, 10, 12 are in loop1.
11 is in loop 2.

So we have a path going from loop1 -> loop2 -> loop1.  I can't
conceive any scenario where this is ok, but I can attach the entire IL
if I'm missing something.

2. 961125-1.c

This one is a bit trickier.  Here we're trying to thread the following
conditional, which I mentioned when I contributed this work, we don't
handle (and happens infrequently enough not to matter):

+  // Loop 4
+  <bb 6> [local count: 114863531]:
+  # ptr_8 = PHI <ptr_2(4), ptr_2(5)>
+  if (ptr_8 < &MEM <char[4]> [(void *)":ab" + 3B])
+    goto <bb 7>; [50.00%]
+  else
+    goto <bb 17>; [50.00%]

The hybrid threader doesn't handle &MEM in the final conditional.  As
I mentioned earlier, if this becomes an issue, we can adapt class
pointer_equiv_analyzer like we did for evrp.  I have a gimple FE test
I will contribute as an XFAIL with an associated PR to keep us honest.

That being said... in this particular test, this is all irrelevant
because the path will be disallowed for two reasons:

a) The path crosses loops, and the reason we didn't realize it in the
old code was because the ASSERT_EXPR had pulled the SSA outside the
loop, so it looks like the entire path is l in the same loop.  If you
look at the original IL, it's not.

b) Now the path actually fits the pattern being discussed in this
patch, where there's an early exit out of a loop, so it looks like we
should handle it.  But...in this case, we would fill a presently empty
latch.  Interestingly, the old code didn't catch it, because....you
guessed it...there was an ASSERT_EXPR in the latch.

So I argue that even in the absence of MEM_REF handling, we shouldn't
thread this path.

[-- Attachment #2: 0001-Loosen-loop-crossing-restriction-in-threader.patch --]
[-- Type: text/x-patch, Size: 5004 bytes --]

From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
From: Aldy Hernandez <aldyh@redhat.com>
Date: Tue, 5 Oct 2021 15:03:34 +0200
Subject: [PATCH] Loosen loop crossing restriction in threader.

Crossing loops is generally discouraged from the threader, but we can
make an exception when we don't cross the latch or enter another loop,
since this is just an early exit out of the loop.

In fact, the whole threaded path is logically outside the loop.  This
has nice secondary effects.  For example, objects on the threaded path
will no longer necessarily be live throughout the loop, so we can get
register allocation improvements.  The threaded path can physically
move outside the loop resulting in better icache efficiency, etc.

Tested on x86-64 Linux, and on a visium-elf cross making sure that the
following tests do not have an abort in the final assembly:

gcc.c-torture/execute/960218-1.c
gcc.c-torture/execute/visium-pending-4.c
gcc.c-torture/execute/pr58209.c

gcc/ChangeLog:

	* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.

Co-authored-by: Jeff Law <jeffrealaw@gmail.com>
---
 .../gcc.dg/tree-ssa/ssa-thread-valid.c        | 39 ++++++++++++++++++
 gcc/tree-ssa-threadupdate.c                   | 40 ++++++++++++++-----
 2 files changed, 68 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c
new file mode 100644
index 00000000000..7adca97cc2b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-valid.c
@@ -0,0 +1,39 @@
+// { dg-do compile }
+// { dg-options "-O2 -fgimple -fdump-statistics" }
+
+// This is a collection of threadable paths.  To simplify maintenance,
+// there should only be one threadable path per function.
+
+int global;
+
+// The thread from 3->4->5 crosses loops but is allowed because it
+// never crosses the latch (BB3) and is just an early exit out of the
+// loop.
+int __GIMPLE (ssa)
+foo1 (int x)
+{
+  int D_1420;
+  int a;
+
+  __BB(2):
+  a_4 = ~x_3(D);
+  goto __BB4;
+
+  // Latch.
+  __BB(3):
+  global = a_1;
+  goto __BB4;
+
+  __BB(4,loop_header(1)):
+  a_1 = __PHI (__BB2: a_4, __BB3: 0);
+  if (a_1 != 0)
+    goto __BB3;
+  else
+    goto __BB5;
+
+  __BB(5):
+  return;
+
+}
+
+// { dg-final { scan-tree-dump "Jumps threaded\" \"foo1\" 1" "statistics" } }
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index dcabfdb30d2..32ce1e3af40 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -2766,10 +2766,17 @@ bool
 jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 {
   gcc_checking_assert (!path.is_empty ());
-  edge taken_edge = path[path.length () - 1]->e;
-  loop_p loop = taken_edge->src->loop_father;
+  edge entry = path[0]->e;
+  edge exit = path[path.length () - 1]->e;
   bool seen_latch = false;
-  bool path_crosses_loops = false;
+  int loops_crossed = 0;
+  bool crossed_latch = false;
+  // Use ->dest here instead of ->src to ignore the first block.  The
+  // first block is allowed to be in a different loop, since it'll be
+  // redirected.  See similar comment in profitable_path_p: "we don't
+  // care about that block...".
+  loop_p loop = entry->dest->loop_father;
+  loop_p curr_loop = loop;
 
   for (unsigned int i = 0; i < path.length (); i++)
     {
@@ -2784,19 +2791,30 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 	}
 
       if (loop->latch == e->src || loop->latch == e->dest)
-	seen_latch = true;
+	{
+	  seen_latch = true;
+	  // Like seen_latch, but excludes the first block.
+	  if (e->src != entry->src)
+	    crossed_latch = true;
+	}
 
-      // The first entry represents the block with an outgoing edge
-      // that we will redirect to the jump threading path.  Thus we
-      // don't care about that block's loop father.
-      if ((i > 0 && e->src->loop_father != loop)
-	  || e->dest->loop_father != loop)
-	path_crosses_loops = true;
+      if (e->dest->loop_father != curr_loop)
+	{
+	  curr_loop = e->dest->loop_father;
+	  ++loops_crossed;
+	}
 
       if (flag_checking && !m_backedge_threads)
 	gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
     }
 
+  // If we crossed a loop into an outer loop without crossing the
+  // latch, this is just an early exit from the loop.
+  if (loops_crossed == 1
+      && !crossed_latch
+      && flow_loop_nested_p (exit->dest->loop_father, exit->src->loop_father))
+    return false;
+
   if (cfun->curr_properties & PROP_loop_opts_done)
     return false;
 
@@ -2806,7 +2824,7 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 		     "would create non-empty latch");
       return true;
     }
-  if (path_crosses_loops)
+  if (loops_crossed)
     {
       cancel_thread (&path, "Path crosses loops");
       return true;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-05 11:22         ` Richard Biener
  2021-10-05 12:43           ` Michael Matz
@ 2021-10-05 14:56           ` Jeff Law
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Law @ 2021-10-05 14:56 UTC (permalink / raw)
  To: Richard Biener, Michael Matz; +Cc: Aldy Hernandez, Andrew MacLeod, GCC patches



On 10/5/2021 5:22 AM, Richard Biener wrote:
> On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>> Hello,
>>
>> On Mon, 4 Oct 2021, Jeff Law wrote:
>>
>>> And just in case it got lost.  Here's the analysis on 960218-1 on visium:
>>>
>>> We've got this going into ethread:
>>>
>>> int f (int x)
>>> {
>>>    int D.1420;
>>>    int a;
>>>
>>> ;;   basic block 2, loop depth 0, maybe hot
>>> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
>>>    a_4 = ~x_3(D);
>>>    goto <bb 4>; [INV]
>>> ;;    succ:       4 (FALLTHRU,EXECUTABLE)
>>>
>>> ;;   basic block 3, loop depth 1, maybe hot
>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
>>>    gl = a_1;
>>> ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>
>>> ;;   basic block 4, loop depth 1, maybe hot
>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       2 (FALLTHRU,EXECUTABLE)
>>> ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>    # a_1 = PHI <a_4(2), 0(3)>
>>>    if (a_1 != 0)
>>>      goto <bb 3>; [INV]
>>>    else
>>>      goto <bb 5>; [INV]
>>> ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
>>> ;;                5 (FALSE_VALUE,EXECUTABLE)
>>>
>>> ;;   basic block 5, loop depth 0, maybe hot
>>> ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
>>>    return;
>>> ;;    succ:       EXIT
>>>
>>> }
>> First notice that this doesn't have an empty latch block to start with
>> (in fact, there is no separate latch block at all), so the loop optimizers
>> haven't been initialized for simple latches at this point.  Still, I would
>> say that even then we want to be careful with the latch edge, as putting
>> code onto it will most probably create a problem downstream once we _do_
>> want to intialize the loop optimizers for simple latches.  So, that we are
>> careful here is okay, we are just too careful in this specific case.
> Not sure if the argument about empty or not empty latches is important...
>>> There's a pretty obvious jump thread in there.  3->4->5.
>>>
>>> We used to allow this, but it's currently being rejected and I'm not
>>> sure it should.
>>>
>>> In simplest terms we're threading from inside the loop, through the
>>> latch to a point outside the loop.  At first glance, that seems safe.
> All threadings that start inside the loop and end outside of it are OK
> in principle.  Even if the path crosses the latch or the header.
I generally agree, but with one caveat.  If the latch is marked as a 
joiner block, then it will not be OK as that'll create multiple 
backedges to the top of the loop.


Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-05 13:33         ` Aldy Hernandez
@ 2021-10-05 15:10           ` Jeff Law
  2021-10-05 16:08           ` Jeff Law
  2021-10-06 13:15           ` Andreas Schwab
  2 siblings, 0 replies; 27+ messages in thread
From: Jeff Law @ 2021-10-05 15:10 UTC (permalink / raw)
  To: Aldy Hernandez, Michael Matz; +Cc: Richard Biener, Andrew MacLeod, GCC patches



On 10/5/2021 7:33 AM, Aldy Hernandez wrote:
> On Mon, Oct 4, 2021 at 6:29 PM Michael Matz <matz@suse.de> wrote:
>> Hello,
>>
>> On Mon, 4 Oct 2021, Jeff Law wrote:
>>
>>> And just in case it got lost.  Here's the analysis on 960218-1 on visium:
>>>
>>> We've got this going into ethread:
>>>
>>> int f (int x)
>>> {
>>>    int D.1420;
>>>    int a;
>>>
>>> ;;   basic block 2, loop depth 0, maybe hot
>>> ;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       ENTRY (FALLTHRU,EXECUTABLE)
>>>    a_4 = ~x_3(D);
>>>    goto <bb 4>; [INV]
>>> ;;    succ:       4 (FALLTHRU,EXECUTABLE)
>>>
>>> ;;   basic block 3, loop depth 1, maybe hot
>>> ;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (TRUE_VALUE,EXECUTABLE)
>>>    gl = a_1;
>>> ;;    succ:       4 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>
>>> ;;   basic block 4, loop depth 1, maybe hot
>>> ;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       2 (FALLTHRU,EXECUTABLE)
>>> ;;                3 (FALLTHRU,DFS_BACK,EXECUTABLE)
>>>    # a_1 = PHI <a_4(2), 0(3)>
>>>    if (a_1 != 0)
>>>      goto <bb 3>; [INV]
>>>    else
>>>      goto <bb 5>; [INV]
>>> ;;    succ:       3 (TRUE_VALUE,EXECUTABLE)
>>> ;;                5 (FALSE_VALUE,EXECUTABLE)
>>>
>>> ;;   basic block 5, loop depth 0, maybe hot
>>> ;;    prev block 4, next block 1, flags: (NEW, REACHABLE, VISITED)
>>> ;;    pred:       4 (FALSE_VALUE,EXECUTABLE)
>>>    return;
>>> ;;    succ:       EXIT
>>>
>>> }
>> First notice that this doesn't have an empty latch block to start with
>> (in fact, there is no separate latch block at all), so the loop optimizers
>> haven't been initialized for simple latches at this point.  Still, I would
>> say that even then we want to be careful with the latch edge, as putting
>> code onto it will most probably create a problem downstream once we _do_
>> want to intialize the loop optimizers for simple latches.  So, that we are
>> careful here is okay, we are just too careful in this specific case.
>>
>>> There's a pretty obvious jump thread in there.  3->4->5.
>>>
>>> We used to allow this, but it's currently being rejected and I'm not
>>> sure it should.
>>>
>>> In simplest terms we're threading from inside the loop, through the
>>> latch to a point outside the loop.  At first glance, that seems safe.
>> Is at least the unrestricted jump threader (the one after loop optimizers)
>> picking this up?
>>
>> Independend of that, I agree, that this specific instance of threading
>> through the latch block, even though followed by to-be-copied
>> instructions, is okay.  We need to determine precisely when that is okay,
>> though.  In this case there are several reasons I could see why this is
>> so:
>> (a) while the thread is through the latch block, it's _not_ through the
>>      latch edge (which is 4->3).  That would create the problem, so here
>>      we're fine.
> It seems we all agree Jeff's finding should be allowed, so let's
> attack this one first, since it gets almost all of his visium
> failures.  I can submit the rest of the cases separately.
>
> The attached patch catches the IL discussed, and adds a relevant
> gimple FE test others can use for experimenting :).
>
> Tested on x86-64 and by visual inspection on visium-elf on the
> regressions Jeff pointed me at.
>
> OK?
>
> BTW Jeff, this fixes all the regressions you mention except:
>
> 1. pr68911.c
>
> The path not being threaded here is 7->10->11->12.  It crosses loops
> multiple times, so I believe the restriction code is correct.
>
> 7, 10, 12 are in loop1.
> 11 is in loop 2.
>
> So we have a path going from loop1 -> loop2 -> loop1.  I can't
> conceive any scenario where this is ok, but I can attach the entire IL
> if I'm missing something.
>
> 2. 961125-1.c
>
> This one is a bit trickier.  Here we're trying to thread the following
> conditional, which I mentioned when I contributed this work, we don't
> handle (and happens infrequently enough not to matter):
>
> +  // Loop 4
> +  <bb 6> [local count: 114863531]:
> +  # ptr_8 = PHI <ptr_2(4), ptr_2(5)>
> +  if (ptr_8 < &MEM <char[4]> [(void *)":ab" + 3B])
> +    goto <bb 7>; [50.00%]
> +  else
> +    goto <bb 17>; [50.00%]
>
> The hybrid threader doesn't handle &MEM in the final conditional.  As
> I mentioned earlier, if this becomes an issue, we can adapt class
> pointer_equiv_analyzer like we did for evrp.  I have a gimple FE test
> I will contribute as an XFAIL with an associated PR to keep us honest.
>
> That being said... in this particular test, this is all irrelevant
> because the path will be disallowed for two reasons:
>
> a) The path crosses loops, and the reason we didn't realize it in the
> old code was because the ASSERT_EXPR had pulled the SSA outside the
> loop, so it looks like the entire path is l in the same loop.  If you
> look at the original IL, it's not.
>
> b) Now the path actually fits the pattern being discussed in this
> patch, where there's an early exit out of a loop, so it looks like we
> should handle it.  But...in this case, we would fill a presently empty
> latch.  Interestingly, the old code didn't catch it, because....you
> guessed it...there was an ASSERT_EXPR in the latch.
>
> So I argue that even in the absence of MEM_REF handling, we shouldn't
> thread this path.
>
> 0001-Loosen-loop-crossing-restriction-in-threader.patch
>
>  From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Tue, 5 Oct 2021 15:03:34 +0200
> Subject: [PATCH] Loosen loop crossing restriction in threader.
>
> Crossing loops is generally discouraged from the threader, but we can
> make an exception when we don't cross the latch or enter another loop,
> since this is just an early exit out of the loop.
>
> In fact, the whole threaded path is logically outside the loop.  This
> has nice secondary effects.  For example, objects on the threaded path
> will no longer necessarily be live throughout the loop, so we can get
> register allocation improvements.  The threaded path can physically
> move outside the loop resulting in better icache efficiency, etc.
>
> Tested on x86-64 Linux, and on a visium-elf cross making sure that the
> following tests do not have an abort in the final assembly:
>
> gcc.c-torture/execute/960218-1.c
> gcc.c-torture/execute/visium-pending-4.c
> gcc.c-torture/execute/pr58209.c
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
I'll throw it in and see what we get :-)

jeff


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-05 13:33         ` Aldy Hernandez
  2021-10-05 15:10           ` Jeff Law
@ 2021-10-05 16:08           ` Jeff Law
  2021-10-05 16:22             ` Aldy Hernandez
  2021-10-06 13:15           ` Andreas Schwab
  2 siblings, 1 reply; 27+ messages in thread
From: Jeff Law @ 2021-10-05 16:08 UTC (permalink / raw)
  To: Aldy Hernandez, Michael Matz; +Cc: Richard Biener, Andrew MacLeod, GCC patches



On 10/5/2021 7:33 AM, Aldy Hernandez wrote:
>
> OK?
>
> BTW Jeff, this fixes all the regressions you mention except:
>
> 1. pr68911.c
>
> The path not being threaded here is 7->10->11->12.  It crosses loops
> multiple times, so I believe the restriction code is correct.
>
> 7, 10, 12 are in loop1.
> 11 is in loop 2.
>
> So we have a path going from loop1 -> loop2 -> loop1.  I can't
> conceive any scenario where this is ok, but I can attach the entire IL
> if I'm missing something.
Wow.  I'm having trouble seeing how we were threading that, but clearly 
not OK.
>
> 2. 961125-1.c
>
> This one is a bit trickier.  Here we're trying to thread the following
> conditional, which I mentioned when I contributed this work, we don't
> handle (and happens infrequently enough not to matter):
Sorry, I'd forgotten about this case.

>
> +  // Loop 4
> +  <bb 6> [local count: 114863531]:
> +  # ptr_8 = PHI <ptr_2(4), ptr_2(5)>
> +  if (ptr_8 < &MEM <char[4]> [(void *)":ab" + 3B])
> +    goto <bb 7>; [50.00%]
> +  else
> +    goto <bb 17>; [50.00%]
>
> The hybrid threader doesn't handle &MEM in the final conditional.  As
> I mentioned earlier, if this becomes an issue, we can adapt class
> pointer_equiv_analyzer like we did for evrp.  I have a gimple FE test
> I will contribute as an XFAIL with an associated PR to keep us honest.
>
> That being said... in this particular test, this is all irrelevant
> because the path will be disallowed for two reasons:
>
> a) The path crosses loops, and the reason we didn't realize it in the
> old code was because the ASSERT_EXPR had pulled the SSA outside the
> loop, so it looks like the entire path is l in the same loop.  If you
> look at the original IL, it's not.
>
> b) Now the path actually fits the pattern being discussed in this
> patch, where there's an early exit out of a loop, so it looks like we
> should handle it.  But...in this case, we would fill a presently empty
> latch.  Interestingly, the old code didn't catch it, because....you
> guessed it...there was an ASSERT_EXPR in the latch.
>
> So I argue that even in the absence of MEM_REF handling, we shouldn't
> thread this path.
I can live with that too.  I'll set new baselines for visium-elf so that 
these don't show up again.

>
> 0001-Loosen-loop-crossing-restriction-in-threader.patch
>
>  From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Tue, 5 Oct 2021 15:03:34 +0200
> Subject: [PATCH] Loosen loop crossing restriction in threader.
>
> Crossing loops is generally discouraged from the threader, but we can
> make an exception when we don't cross the latch or enter another loop,
> since this is just an early exit out of the loop.
>
> In fact, the whole threaded path is logically outside the loop.  This
> has nice secondary effects.  For example, objects on the threaded path
> will no longer necessarily be live throughout the loop, so we can get
> register allocation improvements.  The threaded path can physically
> move outside the loop resulting in better icache efficiency, etc.
>
> Tested on x86-64 Linux, and on a visium-elf cross making sure that the
> following tests do not have an abort in the final assembly:
>
> gcc.c-torture/execute/960218-1.c
> gcc.c-torture/execute/visium-pending-4.c
> gcc.c-torture/execute/pr58209.c
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths):
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/tree-ssa/ssa-thread-valid.c: New test.
OK.  I think there may still be some concern if the latch is marked as a 
joiner.    I think we should always reject those before the loop 
optimizers run as the joiner's clone would introduce a second latch.   I 
think that can be handled in a follow-up refinement.





>
> Co-authored-by: Jeff Law <jeffrealaw@gmail.com>
I'm not sure I deserve co-authorship :-)  You're doing all the work, I'm 
just reporting the regressions.

jeff


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-05 16:08           ` Jeff Law
@ 2021-10-05 16:22             ` Aldy Hernandez
  0 siblings, 0 replies; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-05 16:22 UTC (permalink / raw)
  To: Jeff Law; +Cc: Michael Matz, Richard Biener, Andrew MacLeod, GCC patches

On Tue, Oct 5, 2021 at 6:08 PM Jeff Law <jeffreyalaw@gmail.com> wrote:

> OK.  I think there may still be some concern if the latch is marked as a joiner.    I think we should always reject those before the loop optimizers run as the joiner's clone would introduce a second latch.   I think that can be handled in a follow-up refinement.

I've been mumbling, to myself mostly, that the loop threading
restrictions should be owned by the loop experts.  There's not much I
can contribute here, except testcases and patches that I *think* are
what y'all want.  That being said, I will follow-up with the
suggestions made by Richi, because I've already done the work, and
have cobbled up the relevant gimple FE tests ;-).

>
>
>
>
>
>
> Co-authored-by: Jeff Law <jeffrealaw@gmail.com>
>
> I'm not sure I deserve co-authorship :-)  You're doing all the work, I'm just reporting the regressions.

Heh.  Heh.  Ok.   I honestly thought about putting everyone's name here.

Aldy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-04  9:43 [RFC] More jump threading restrictions in the presence of loops Aldy Hernandez
  2021-10-04 13:31 ` Jeff Law
@ 2021-10-06 10:22 ` Aldy Hernandez
  2021-10-14  9:25   ` Aldy Hernandez
  2021-10-17  1:32   ` Jeff Law
  1 sibling, 2 replies; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-06 10:22 UTC (permalink / raw)
  To: Jeff Law, Richard Biener, Michael Matz; +Cc: Andrew MacLeod, GCC patches

[-- Attachment #1: Type: text/plain, Size: 1427 bytes --]

[Here go the bits by Richi, tested on x86-64 Linux, and ongoing tests
on aarch64 and ppc64le.]

There is a lot of fall-out from this patch, as there were many threading
tests that assumed the restrictions introduced by this patch were valid.
Some tests have merely shifted the threading to after loop
optimizations, but others ended up with no threading opportunities at
all.  Surprisingly some tests ended up with more total threads.  It was
a crapshoot all around.

On a postive note, there are 6 tests that no longer XFAIL, and one
guality test which now passes.

I would appreciate someone reviewing the test changes.  I am unsure
whether some of the tests should be removed, XFAILed, or some other
thing.

I felt a bit queasy about such a fundamental change wrt threading, so I
ran it through my callgrind test harness (.ii files from a bootstrap).
There was no change in overall compilation, DOM, or the VRP threaders.

However, there was a slight increase of 1.63% in the backward threader.
I'm pretty sure we could reduce this if we incorporated the restrictions
into their profitability code.  This way we could stop the search when
we ran into one of these restrictions.  Not sure it's worth it at this
point.

Note, that this ad-hoc testing is not meant to replace a more thorough
SPEC, etc test.

Tested on x86-64 Linux.

OK pending tests on aarch64 and ppc64le?

Co-authored-by: Richard Biener <rguenther@suse.de>

[-- Attachment #2: 0001-Disallow-loop-rotation-and-loop-header-crossing-in-j.patch --]
[-- Type: text/x-patch, Size: 25697 bytes --]

From 5c0fb55b45733ec38918f5d7a576719da32e4a6c Mon Sep 17 00:00:00 2001
From: Aldy Hernandez <aldyh@redhat.com>
Date: Mon, 4 Oct 2021 09:47:02 +0200
Subject: [PATCH] Disallow loop rotation and loop header crossing in jump
 threaders.

There is a lot of fall-out from this patch, as there are many threading
tests that assumed the restrictions introduced by this patch were valid.
Some tests have merely shifted the threading to after loop
optimizations, but others ended up with no threading opportunities at
all.  Surprisingly some tests ended up with more total threads.  It was
a crapshoot all around.

On a postive note, there are 6 tests that no longer XFAIL, and one
guality test which now passes.

I would appreciate someone reviewing the test changes.  I am unsure
whether some of the tests should be removed, XFAILed, or some other
thing.

I felt a bit queasy about such a fundamental change wrt threading, so I
ran it through my callgrind test harness (.ii files from a bootstrap).
There was no change in overall compilation, DOM, or the VRP threaders.

However, there was a slight increase of 1.63% in the backward threader.
I'm pretty sure we could reduce this if we incorporated the restrictions
into their profitability code.  This way we could stop the search when
we ran into one of these restrictions.  Not sure it's worth it at this
point.

Note, that this ad-hoc testing is not meant to replace a more thorough
SPEC, etc test.

Tested on x86-64 Linux.

OK pending tests on aarch64 and ppc64le?

Co-authored-by: Richard Biener <rguenther@suse.de>

gcc/ChangeLog:

	* tree-ssa-threadupdate.c (cancel_thread): Dump threading reason
	on the same line as the threading cancellation.
	(jt_path_registry::cancel_invalid_paths): Avoid rotating loops.
	Avoid threading through loop headers where the path remains in the
	loop.

libgomp/ChangeLog:

	* testsuite/libgomp.graphite/force-parallel-5.c: Remove xfail.

gcc/testsuite/ChangeLog:

	* gcc.dg/Warray-bounds-87.c: Remove xfail.
	* gcc.dg/analyzer/pr94851-2.c: Remove xfail.
	* gcc.dg/graphite/pr69728.c: Remove xfail.
	* gcc.dg/graphite/scop-dsyr2k.c: Remove xfail.
	* gcc.dg/graphite/scop-dsyrk.c: Remove xfail.
	* gcc.dg/shrink-wrap-loop.c: Remove xfail.
	* gcc.dg/loop-8.c: Adjust for new threading restrictions.
	* gcc.dg/tree-ssa/ifc-20040816-1.c: Same.
	* gcc.dg/tree-ssa/pr21559.c: Same.
	* gcc.dg/tree-ssa/pr59597.c: Same.
	* gcc.dg/tree-ssa/pr71437.c: Same.
	* gcc.dg/tree-ssa/pr77445-2.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-18.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-2a.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-4.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-6.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same.
	* gcc.dg/vect/bb-slp-16.c: Same.
	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
---
 gcc/testsuite/gcc.dg/Warray-bounds-87.c       |   2 +-
 gcc/testsuite/gcc.dg/analyzer/pr94851-2.c     |   2 +-
 gcc/testsuite/gcc.dg/graphite/pr69728.c       |   4 +-
 gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c   |   2 +-
 gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c    |   2 +-
 gcc/testsuite/gcc.dg/loop-8.c                 |   6 +-
 gcc/testsuite/gcc.dg/shrink-wrap-loop.c       |  54 +---------
 .../gcc.dg/tree-ssa/ifc-20040816-1.c          |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr21559.c       |   7 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr59597.c       |  10 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr71437.c       |   8 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c     |   3 +-
 .../gcc.dg/tree-ssa/ssa-dom-thread-18.c       |   8 +-
 .../gcc.dg/tree-ssa/ssa-dom-thread-2a.c       |  11 +-
 .../gcc.dg/tree-ssa/ssa-dom-thread-4.c        |  14 ++-
 .../gcc.dg/tree-ssa/ssa-dom-thread-6.c        |   8 +-
 .../gcc.dg/tree-ssa/ssa-dom-thread-7.c        |   5 +-
 .../gcc.dg/tree-ssa/ssa-thread-invalid.c      | 102 ++++++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-16.c         |   6 +-
 gcc/tree-ssa-threadupdate.c                   |  26 ++++-
 .../libgomp.graphite/force-parallel-5.c       |   2 +-
 21 files changed, 179 insertions(+), 105 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c

diff --git a/gcc/testsuite/gcc.dg/Warray-bounds-87.c b/gcc/testsuite/gcc.dg/Warray-bounds-87.c
index a49874df5da..a5457807c3a 100644
--- a/gcc/testsuite/gcc.dg/Warray-bounds-87.c
+++ b/gcc/testsuite/gcc.dg/Warray-bounds-87.c
@@ -33,7 +33,7 @@ static unsigned int h (int i, int j)
     case 9:
       return j;
     case 10:
-      return a[i];  // { dg-bogus "-Warray-bounds" "pr101671" { xfail *-*-* } }
+      return a[i];  // { dg-bogus "-Warray-bounds" "pr101671" }
     }
   return 0;
 }
diff --git a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c
index 0acf48810c1..62176bdaee8 100644
--- a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c
+++ b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c
@@ -45,7 +45,7 @@ int pamark(void) {
     if (curbp->b_amark == (AMARK *)NULL)
       curbp->b_amark = p;
     else
-      last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" { xfail *-*-* } } */
+      last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" } */
   }
 
   p->m_name = (char)c; /* { dg-bogus "leak of 'p'" "bogus leak" } */
diff --git a/gcc/testsuite/gcc.dg/graphite/pr69728.c b/gcc/testsuite/gcc.dg/graphite/pr69728.c
index 69e28318aaf..a6f385749c2 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr69728.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr69728.c
@@ -24,6 +24,4 @@ fn1 ()
    run into scheduling issues before here, not being able to handle
    empty domains.  */
 
-/* XFAILed by fix for PR86865.  */
-
-/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" { xfail *-*-* } } }  */
+/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" } } */
diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c
index 925ae306903..41c91b97b57 100644
--- a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c
+++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c
@@ -17,4 +17,4 @@ void dsyr2k(int N) {
 #pragma endscop
 }
 
-/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */ 
+/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */
diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c
index b748946fabb..e01a517be11 100644
--- a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c
+++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c
@@ -19,4 +19,4 @@ void dsyrk(int N)
 #pragma endscop
 }
 
-/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */
diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c
index 90ea1c45524..66318fc08dc 100644
--- a/gcc/testsuite/gcc.dg/loop-8.c
+++ b/gcc/testsuite/gcc.dg/loop-8.c
@@ -24,5 +24,9 @@ f (int *a, int *b)
 
 /* Load of 42 is moved out of the loop, introducing a new pseudo register.  */
 /* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */
-/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */
+
+
+/* ?? The expected behavior below depends on threading the 2->3->5 path
+   in DOM2, but this is invalid since it would rotate the loop.  */
+/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" { xfail *-*-* } } } */
 
diff --git a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c
index 6e1be8937fe..ddc99e6b75a 100644
--- a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c
+++ b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c
@@ -1,58 +1,6 @@
 /* { dg-do compile { target { { { i?86-*-* x86_64-*-* } && lp64 } || { arm_thumb2 } } } } */
 /* { dg-options "-O2 -fdump-rtl-pro_and_epilogue"  } */
 
-/*
-Our new threader is threading things a bit too early, and causing the
-testcase in gcc.dg/shrink-wrap-loop.c to fail.
-
-  The gist is this BB inside a loop:
-
-  <bb 6> :
-  # p_2 = PHI <p2_6(D)(2), p_12(5)>
-  if (p_2 != 0B)
-    goto <bb 3>; [INV]
-  else
-    goto <bb 7>; [INV]
-
-Our threader can move this check outside of the loop (good).  This is
-done before branch probabilities are calculated and causes the probs
-to be calculated as:
-
-<bb 2> [local count: 216361238]:
-  if (p2_6(D) != 0B)
-    goto <bb 7>; [54.59%]
-  else
-    goto <bb 6>; [45.41%]
-
-Logically this seems correct to me.  A simple check outside of a loop
-should slightly but not overwhelmingly favor a non-zero value.
-
-Interestingly however, the old threader couldn't get this, but the IL
-ended up identical, albeit with different probabilities.  What happens
-is that, because the old code could not thread this, the p2 != 0 check
-would remain inside the loop and probs would be calculated thusly:
-
-  <bb 6> [local count: 1073741824]:
-  # p_2 = PHI <p2_6(D)(2), p_12(5)>
-  if (p_2 != 0B)
-    goto <bb 3>; [94.50%]
-  else
-    goto <bb 7>; [5.50%]
-
-Then when the loop header copying pass ("ch") shuffled things around,
-the IL would end up identical to my early threader code, but with the
-probabilities would remain as 94.5/5.5.
-
-The above discrepancy causes the RTL ifcvt pass to generate different
-code, and by the time we get to the shrink wrapping pass, things look
-sufficiently different such that the legacy code can actually shrink
-wrap, whereas our new code does not.
-
-IMO, if the loop-ch pass moves conditionals outside of a loop, the
-probabilities should be adjusted, but that does mean the shrink wrap
-won't happen for this contrived testcase.
- */
-
 int foo (int *p1, int *p2);
 
 int
@@ -68,4 +16,4 @@ test (int *p1, int *p2)
 
   return 1;
 }
-/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
index b55a533e374..f8a6495cbaa 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
@@ -39,4 +39,4 @@ int main1 ()
    which is folded by vectorizer.  Both outgoing edges must have probability
    100% so the resulting profile match after folding.  */
 /* { dg-final { scan-tree-dump-times "Invalid sum of outgoing probabilities 200.0" 1 "ifcvt" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 1 "ifcvt" } } */
+/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 2 "ifcvt" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c
index 51b3b7ac755..43f046edabe 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c
@@ -35,10 +35,7 @@ void foo (void)
 /* First, we should simplify the bits < 0 test within the loop.  */
 /* { dg-final { scan-tree-dump-times "Simplified relational" 1 "evrp" } } */
 
-/* Second, we should thread the edge out of the loop via the break
-   statement.  We also realize that the final bytes == 0 test is useless,
-   and thread over it.  We also know that toread != 0 is useless when
-   entering while loop and thread over it.  */
-/* { dg-final { scan-tree-dump-times "Threaded jump" 3 "vrp-thread1" } } */
+/* We used to check for 3 threaded jumps here, but they all would
+   rotate the loop.  */
 
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
index 2caa1f532ea..764b3fe2e80 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
@@ -56,11 +56,7 @@ main (int argc, char argv[])
   return crc;
 }
 
-/* Previously we had 3 jump threads, but one of them crossed loops.
-   The reason the old threader was allowing it, was because there was
-   an ASSERT_EXPR getting in the way.  Without the ASSERT_EXPR, we
-   have an empty pre-header block as the final block in the thread,
-   which the threader will simply join with the next block which *is*
-   in a different loop.  */
-/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */
+/* None of the threads we can get in vrp-thread1 are valid.  They all
+   cross or rotate loops.  */
+/* { dg-final { scan-tree-dump-not "Registering jump thread" "vrp-thread1" } } */
 /* { dg-final { scan-tree-dump-not "joiner" "vrp-thread1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c
index a2386ba19f0..eab3a25928e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-ffast-math -O3 -fdump-tree-vrp-thread1-details" } */
+/* { dg-options "-ffast-math -O3 -fdump-tree-dom3-details" } */
 
 int I = 50, J = 50;
 int S, L;
@@ -39,4 +39,8 @@ void foo (int K)
 	bar (LD, SD);
     }
 }
-/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "vrp-thread1" } } */
+
+/* We used to get 1 vrp-thread1 candidates here, but they now get
+   deferred until after loop opts are done, because they were rotating
+   loops.  */
+/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "dom3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
index 18f7aab2be7..f2a5e78e6be 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
@@ -123,8 +123,7 @@ enum STATES FMS( u8 **in , u32 *transitions) {
    aarch64 has the highest CASE_VALUES_THRESHOLD in GCC.  It's high enough
    to change decisions in switch expansion which in turn can expose new
    jump threading opportunities.  Skip the later tests on aarch64.  */
-/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread2" } } */
 /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */
 /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */
 /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
index 0246ebf3c63..f83cefd8d89 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
@@ -22,6 +22,8 @@
 
    All the cases are picked up by VRP1 as jump threads.  */
 
-/* There used to be 6 jump threads found by thread1, but they all
-   depended on threading through distinct loops in ethread.  */
-/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */
+/* This test should be obsoleted.  We used to catch 2 total threads in
+   vrp-thread1, but after adding loop rotating restrictions, we get
+   none.  Interestingly, on x86-64 we now get 1 in DOM2, 5 in DOM3,
+   and 1 in vrp-thread2.  */
+/* { dg-final { scan-tree-dump-not "Threaded" "vrp-thread1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
index 8f0a12c12ee..68808bd09fc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
@@ -1,10 +1,9 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */
+/* { dg-options "-O2 -fdump-statistics" } */
 
 void bla();
 
-/* In the following case, we should be able to thread edge through
-   the loop header.  */
+/* No one should thread through the loop header.  */
 
 void thread_entry_through_header (void)
 {
@@ -14,8 +13,4 @@ void thread_entry_through_header (void)
     bla ();
 }
 
-/* There's a single jump thread that should be handled by the VRP
-   jump threading pass.  */
-/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */
-/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */
-/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */
+/* { dg-final { scan-tree-dump-not "Jumps threaded" "statistics"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c
index 46e464ff26a..9cd463571c4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-vrp-thread1-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */
+/* { dg-options "-O2 -fdump-tree-vrp-thread2-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */
 struct bitmap_head_def;
 typedef struct bitmap_head_def *bitmap;
 typedef const struct bitmap_head_def *const_bitmap;
@@ -53,10 +53,8 @@ bitmap_ior_and_compl (bitmap dst, const_bitmap a, const_bitmap b,
 
   return changed;
 }
-/* The block starting the second conditional has  3 incoming edges,
-   we should thread all three, but due to a bug in the threading
-   code we missed the edge when the first conditional is false
-   (b_elt is zero, which means the second conditional is always
-   zero.  VRP1 catches all three.  */
-/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */
-/* { dg-final { scan-tree-dump-times "Path crosses loops" 1 "vrp-thread1" } } */
+/* We used to catch 3 jump threads in vrp-thread1, but they all
+   rotated the loop, so they were disallowed.  This in turn created
+   other opportunities for the other threaders which result in the the
+   post-loop threader (vrp-thread2) catching more.  */
+/* { dg-final { scan-tree-dump-times "Registering jump thread" 5 "vrp-thread2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
index b0a7d423475..24de9d57d50 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
@@ -1,8 +1,12 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */
 
-/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */
+/* ?? We should obsolete this test.  All the threads in thread1 and
+   thread3 we used to get cross the loop header but does not exit the
+   loop, so they have been deemed invalid.  */
+
+/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread1" } } */
+/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread3" } } */
 
 int sum0, sum1, sum2, sum3;
 int foo (char *s, char **ret)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index 16abcde5053..1da00a691c8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,15 +1,14 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
 
-/* { dg-final { scan-tree-dump "Jumps threaded: 12"  "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 12"  "thread3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom2" } } */
 
 /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC.  It's high enough
    to change decisions in switch expansion which in turn can expose new
    jump threading opportunities.  Skip the later tests on aarch64.  */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" { target { ! aarch64*-*-* } } } } */
-/* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" { target { ! aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp-thread2" { target { ! aarch64*-*-* } } } } */
 
 enum STATE {
   S0=0,
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
new file mode 100644
index 00000000000..bd56a62a4b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
@@ -0,0 +1,102 @@
+// { dg-do compile }
+// { dg-options "-O2 -fgimple -fdump-statistics" }
+//
+// This is a collection of seemingly threadble paths that should not be allowed.
+
+void foobar (int);
+
+// Possible thread from 2->4->3, but it would rotate the loop.
+void __GIMPLE (ssa)
+f1 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB4;
+
+  // Latch.
+  __BB(3):
+  foobar (i_1);
+  i_5 = i_1 + 1;
+  goto __BB4;
+
+  __BB(4,loop_header(1)):
+  i_1 = __PHI (__BB2: 0, __BB3: i_5);
+  if (i_1 != 101)
+    goto __BB3;
+  else
+    goto __BB5;
+
+  __BB(5):
+  return;
+
+}
+
+// Possible thread from 2->3->5 but threading through the empty latch
+// would create a non-empty latch.
+void __GIMPLE (ssa)
+f2 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB3;
+
+  __BB(3,loop_header(1)):
+  i_8 = __PHI (__BB5: i_5, __BB2: 0);
+  foobar (i_8);
+  i_5 = i_8 + 1;
+  if (i_5 != 256)
+    goto __BB5;
+  else
+    goto __BB4;
+
+  // Latch.
+  __BB(5):
+  goto __BB3;
+
+  __BB(4):
+  return;
+
+}
+
+// Possible thread from 3->5->6->3 but this would thread through the
+// header but not exit the loop.
+int __GIMPLE (ssa)
+f3 (int a)
+{
+  int i;
+
+  __BB(2):
+  goto __BB6;
+
+  __BB(3):
+  if (i_1 != 0)
+    goto __BB4;
+  else
+    goto __BB5;
+
+  __BB(4):
+  foobar (5);
+  goto __BB5;
+
+  // Latch.
+  __BB(5):
+  i_7 = i_1 + 1;
+  goto __BB6;
+
+  __BB(6,loop_header(1)):
+  i_1 = __PHI (__BB2: 1, __BB5: i_7);
+  if (i_1 <= 99)
+    goto __BB3;
+  else
+    goto __BB7;
+
+  __BB(7):
+  return;
+
+}
+
+// { dg-final { scan-tree-dump-not "Jumps threaded" "statistics" } }
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
index e68a9b62535..fc3adab3fc3 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
@@ -65,5 +65,9 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */
+/* ?? The check below depends on jump threading.  There are no a
+   couple threaded paths that are being rejected because they either
+   rotate a loop or cross the loop header without exiting the
+   loop.  */
+/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" { xfail *-*-* } } } */
   
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 32ce1e3af40..293836cdc53 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -278,7 +278,7 @@ cancel_thread (vec<jump_thread_edge *> *path, const char *reason = NULL)
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       if (reason)
-	fprintf (dump_file, "%s:\n", reason);
+	fprintf (dump_file, "%s: ", reason);
 
       dump_jump_thread_path (dump_file, *path, false);
       fprintf (dump_file, "\n");
@@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
   bool seen_latch = false;
   int loops_crossed = 0;
   bool crossed_latch = false;
+  bool crossed_loop_header = false;
   // Use ->dest here instead of ->src to ignore the first block.  The
   // first block is allowed to be in a different loop, since it'll be
   // redirected.  See similar comment in profitable_path_p: "we don't
@@ -2804,6 +2805,14 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 	  ++loops_crossed;
 	}
 
+      // ?? Avoid threading through loop headers that remain in the
+      // loop, as such threadings tend to create sub-loops which
+      // _might_ be OK ??.
+      if (e->dest->loop_father->header == e->dest
+	  && !flow_loop_nested_p (exit->dest->loop_father,
+				  e->dest->loop_father))
+	crossed_loop_header = true;
+
       if (flag_checking && !m_backedge_threads)
 	gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
     }
@@ -2829,6 +2838,21 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
       cancel_thread (&path, "Path crosses loops");
       return true;
     }
+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+      && !flow_loop_nested_p (exit->src->loop_father,
+			      entry->dest->loop_father))
+    {
+      cancel_thread (&path, "Path rotates loop");
+      return true;
+    }
+  if (crossed_loop_header)
+    {
+      cancel_thread (&path, "Path crosses loop header but does not exit it");
+      return true;
+    }
   return false;
 }
 
diff --git a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c
index b83ca79dfae..de31d6436f5 100644
--- a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c
+++ b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c
@@ -31,6 +31,6 @@ int main(void)
 }
 
 /* Check that parallel code generation part make the right answer.  */
-/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" } } */
 /* { dg-final { scan-tree-dump-times "loopfn.0" 4 "optimized" } } */
 /* { dg-final { scan-tree-dump-times "loopfn.1" 4 "optimized" } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-05 13:33         ` Aldy Hernandez
  2021-10-05 15:10           ` Jeff Law
  2021-10-05 16:08           ` Jeff Law
@ 2021-10-06 13:15           ` Andreas Schwab
  2021-10-06 13:47             ` Aldy Hernandez
  2 siblings, 1 reply; 27+ messages in thread
From: Andreas Schwab @ 2021-10-06 13:15 UTC (permalink / raw)
  To: Aldy Hernandez via Gcc-patches; +Cc: Michael Matz, Aldy Hernandez

On Okt 05 2021, Aldy Hernandez via Gcc-patches wrote:

> From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Tue, 5 Oct 2021 15:03:34 +0200
> Subject: [PATCH] Loosen loop crossing restriction in threader.
>
> Crossing loops is generally discouraged from the threader, but we can
> make an exception when we don't cross the latch or enter another loop,
> since this is just an early exit out of the loop.

This breaks bootstrap on aarch64 (in stage2):

In function 'void mark_stack_region_used(poly_uint64, poly_uint64)',
    inlined from 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)' at ../../gcc/calls.c:4536:29:
../../gcc/calls.c:206:26: error: 'const_upper' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  206 |       stack_usage_map[i] = 1;
      |       ~~~~~~~~~~~~~~~~~~~^~~
../../gcc/calls.c: In function 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)':
../../gcc/calls.c:202:39: note: 'const_upper' was declared here
  202 |   unsigned HOST_WIDE_INT const_lower, const_upper;
      |                                       ^~~~~~~~~~~

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 13:15           ` Andreas Schwab
@ 2021-10-06 13:47             ` Aldy Hernandez
  2021-10-06 15:03               ` Martin Sebor
  2021-10-06 23:06               ` Jeff Law
  0 siblings, 2 replies; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-06 13:47 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Aldy Hernandez via Gcc-patches, Michael Matz, Martin Sebor

The pending patch I have from Richi fixes this.  Even so, it's the
uninit code that's confused.

Sigh...every single change to the threading code shines the light on
some warning bug.

If you take the calls.ii file from the aarch64 bootstrap and break on
the warning, you can see that the uninitalized use is for
const_upper_3934 here:

 <bb 102> [local count: 315357954]:
  # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
  if (_881 != 0)
    goto <bb 103>; [50.00%]
  else
    goto <bb 106>; [50.00%]

  <bb 103> [local count: 157678977]:
  if (const_upper_3934 > _6699)
    goto <bb 105>; [89.00%]
  else
    goto <bb 294>; [11.00%]

  <bb 294> [local count: 17344687]:

  <bb 104> [local count: 157678977]:
  goto <bb 107>; [100.00%]

  <bb 105> [local count: 140334290]:
  stack_usage_map.481_3930 = stack_usage_map;
  _6441 = const_upper_3934 - _6699;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PROBLEMATIC READ HERE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  _4819 = stack_usage_map.481_3930 + _6699;
  __builtin_memset (_4819, 1, _6441);
  goto <bb 104>; [11.00%]

const_upper_3934 could be undefined if it comes from BB101
(const_upper_3937(D)), but it only gets read for _881 != 0, so it
shouldn't warn.

I suggest -Wmaybe-uninitialized be turned off for that file until the
warning is fixed.

And yes, the proposed patch will also cure this, but the underlying
problem in the warning is still there.

Aldy

On Wed, Oct 6, 2021 at 3:24 PM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Okt 05 2021, Aldy Hernandez via Gcc-patches wrote:
>
> > From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
> > From: Aldy Hernandez <aldyh@redhat.com>
> > Date: Tue, 5 Oct 2021 15:03:34 +0200
> > Subject: [PATCH] Loosen loop crossing restriction in threader.
> >
> > Crossing loops is generally discouraged from the threader, but we can
> > make an exception when we don't cross the latch or enter another loop,
> > since this is just an early exit out of the loop.
>
> This breaks bootstrap on aarch64 (in stage2):
>
> In function 'void mark_stack_region_used(poly_uint64, poly_uint64)',
>     inlined from 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)' at ../../gcc/calls.c:4536:29:
> ../../gcc/calls.c:206:26: error: 'const_upper' may be used uninitialized in this function [-Werror=maybe-uninitialized]
>   206 |       stack_usage_map[i] = 1;
>       |       ~~~~~~~~~~~~~~~~~~~^~~
> ../../gcc/calls.c: In function 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)':
> ../../gcc/calls.c:202:39: note: 'const_upper' was declared here
>   202 |   unsigned HOST_WIDE_INT const_lower, const_upper;
>       |                                       ^~~~~~~~~~~
>
> Andreas.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 13:47             ` Aldy Hernandez
@ 2021-10-06 15:03               ` Martin Sebor
  2021-10-06 16:22                 ` Aldy Hernandez
  2021-10-06 23:06               ` Jeff Law
  1 sibling, 1 reply; 27+ messages in thread
From: Martin Sebor @ 2021-10-06 15:03 UTC (permalink / raw)
  To: Aldy Hernandez, Andreas Schwab
  Cc: Martin Sebor, Michael Matz, Aldy Hernandez via Gcc-patches

On 10/6/21 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
> The pending patch I have from Richi fixes this.  Even so, it's the
> uninit code that's confused.
> 
> Sigh...every single change to the threading code shines the light on
> some warning bug.
> 
> If you take the calls.ii file from the aarch64 bootstrap and break on
> the warning, you can see that the uninitalized use is for
> const_upper_3934 here:
> 
>   <bb 102> [local count: 315357954]:
>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>    if (_881 != 0)
>      goto <bb 103>; [50.00%]
>    else
>      goto <bb 106>; [50.00%]
> 
>    <bb 103> [local count: 157678977]:
>    if (const_upper_3934 > _6699)
>      goto <bb 105>; [89.00%]
>    else
>      goto <bb 294>; [11.00%]
> 
>    <bb 294> [local count: 17344687]:
> 
>    <bb 104> [local count: 157678977]:
>    goto <bb 107>; [100.00%]
> 
>    <bb 105> [local count: 140334290]:
>    stack_usage_map.481_3930 = stack_usage_map;
>    _6441 = const_upper_3934 - _6699;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> PROBLEMATIC READ HERE
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    _4819 = stack_usage_map.481_3930 + _6699;
>    __builtin_memset (_4819, 1, _6441);
>    goto <bb 104>; [11.00%]
> 
> const_upper_3934 could be undefined if it comes from BB101
> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
> shouldn't warn.
> 
> I suggest -Wmaybe-uninitialized be turned off for that file until the
> warning is fixed.

-Wmaybe-uninitialized certainly suffers from a high rate of false
positives (I count 40 open PRs).  Even after all this time debugging
it I still struggle with the code for it but in all the bug reports
I've reviewed, only a small subset seems to be due to bugs or even
limitations in it.  Most are inherent in its design goal to warn
for reads from variables that cannot be proven to have been
initialized.

If this one is a bug with some chance of getting fixed it would
be good to distill it to a small test case (even though it's not
unlikely that there already is an existing report with the same
root cause).

That said, from from your description and the IL above it sounds
to me like the uninitialized read here is possible (it takes place
when _881 != 0), and so -Wmaybe-uininitialized triggers as it's
supposed to.

Either way, rather than suppressing the warning for the whole file
I would suggest to consider initializing the local variable instead.
Looking at the code in calls.c, I can't convince myself that
the variable cannot, in fact, be used uninitialized.

Martin

> 
> And yes, the proposed patch will also cure this, but the underlying
> problem in the warning is still there.
> 
> Aldy
> 
> On Wed, Oct 6, 2021 at 3:24 PM Andreas Schwab <schwab@linux-m68k.org> wrote:
>>
>> On Okt 05 2021, Aldy Hernandez via Gcc-patches wrote:
>>
>>>  From 5abe65668f602d53b8f3dc515508dc4616beb048 Mon Sep 17 00:00:00 2001
>>> From: Aldy Hernandez <aldyh@redhat.com>
>>> Date: Tue, 5 Oct 2021 15:03:34 +0200
>>> Subject: [PATCH] Loosen loop crossing restriction in threader.
>>>
>>> Crossing loops is generally discouraged from the threader, but we can
>>> make an exception when we don't cross the latch or enter another loop,
>>> since this is just an early exit out of the loop.
>>
>> This breaks bootstrap on aarch64 (in stage2):
>>
>> In function 'void mark_stack_region_used(poly_uint64, poly_uint64)',
>>      inlined from 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)' at ../../gcc/calls.c:4536:29:
>> ../../gcc/calls.c:206:26: error: 'const_upper' may be used uninitialized in this function [-Werror=maybe-uninitialized]
>>    206 |       stack_usage_map[i] = 1;
>>        |       ~~~~~~~~~~~~~~~~~~~^~~
>> ../../gcc/calls.c: In function 'rtx_def* emit_library_call_value_1(int, rtx, rtx, libcall_type, machine_mode, int, rtx_mode_t*)':
>> ../../gcc/calls.c:202:39: note: 'const_upper' was declared here
>>    202 |   unsigned HOST_WIDE_INT const_lower, const_upper;
>>        |                                       ^~~~~~~~~~~
>>
>> Andreas.
>>
>> --
>> Andreas Schwab, schwab@linux-m68k.org
>> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
>> "And now for something completely different."
>>
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 15:03               ` Martin Sebor
@ 2021-10-06 16:22                 ` Aldy Hernandez
  2021-10-06 17:03                   ` Aldy Hernandez
  2021-10-06 23:00                   ` Jeff Law
  0 siblings, 2 replies; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-06 16:22 UTC (permalink / raw)
  To: Martin Sebor, Jakub Jelinek
  Cc: Andreas Schwab, Martin Sebor, Michael Matz,
	Aldy Hernandez via Gcc-patches

On Wed, Oct 6, 2021 at 5:03 PM Martin Sebor <msebor@gmail.com> wrote:
>
> On 10/6/21 7:47 AM, Aldy Hernandez via Gcc-patches wrote:

> -Wmaybe-uninitialized certainly suffers from a high rate of false
> positives (I count 40 open PRs).  Even after all this time debugging
> it I still struggle with the code for it but in all the bug reports
> I've reviewed, only a small subset seems to be due to bugs or even
> limitations in it.  Most are inherent in its design goal to warn
> for reads from variables that cannot be proven to have been
> initialized.
>
> If this one is a bug with some chance of getting fixed it would
> be good to distill it to a small test case (even though it's not
> unlikely that there already is an existing report with the same
> root cause).
>
> That said, from from your description and the IL above it sounds
> to me like the uninitialized read here is possible (it takes place
> when _881 != 0), and so -Wmaybe-uininitialized triggers as it's
> supposed to.
>
> Either way, rather than suppressing the warning for the whole file
> I would suggest to consider initializing the local variable instead.
> Looking at the code in calls.c, I can't convince myself that
> the variable cannot, in fact, be used uninitialized.
>
> Martin

FWIW, libgomp/team.c suffers from the same problem if you remove the
stop-gap in gomp_team_start():

  struct gomp_thread_start_data *start_data = NULL;

The use of start_data in the block predicated by "if (nested)" is the
only path that can use start_data without passing through its
initialization.  But the only way to elide the initialization is from:

if (!nested)
{
  ...
 ....
  if (i == nthreads)
    goto do_release;
}

So the use of start_data either crosses the gomp_alloca
initialization, or is predicated by if(nested).

And the gimple shows a very similar pattern to what I described earlier:

  <bb 163> [local count: 59055800]:
  # start_data_876 = PHI <start_data_770(162), start_data_258(288),
start_data_359(307)>
do_release:
  if (_1 != 0)
    goto <bb 164>; [55.90%]
  else
    goto <bb 289>; [44.10%]

  <bb 289> [local count: 26046541]:
  goto <bb 165>; [100.00%]

  <bb 164> [local count: 33009259]:
  _223 = &team_417(D)->barrier;
  gomp_barrier_wait (_223);
  goto <bb 166>; [100.00%]

  <bb 265> [local count: 6962719]:

  <bb 165> [local count: 33009259]:
  # start_data_781 = PHI <start_data_876(289), start_data_518(D)(265)>
 # old_threads_used_887 = PHI <old_threads_used_782(289),
old_threads_used_454(265)>
  # affinity_count_825 = PHI <affinity_count_885(289), affinity_count_343(265)>
  # affinity_thr_904 = PHI <affinity_thr_867(289), 0B(265)>
  # force_display_840 = PHI <force_display_612(289), force_display_192(265)>
  _589 = &MEM[(struct gomp_simple_barrier_t *)pool_410 + 64B].bar;
  gomp_barrier_wait (_589);

  <bb 166> [local count: 66018519]:
  # start_data_870 = PHI <start_data_876(164), start_data_781(165)>

The use of start_data_781 coming in from  BB265 (start_data_518(D))
would be uninitialized if the read wasn't guarded by "if (_1 != 0)".

It seems uninit has issues seeing this pattern.

Unfortunately reducing this has been, ahem...challenging, but the
problem is still there, both within calls.c on aarch64 and on x86-64
for libgomp/team.c.

Aldy


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 16:22                 ` Aldy Hernandez
@ 2021-10-06 17:03                   ` Aldy Hernandez
  2021-10-06 19:11                     ` Martin Sebor
  2021-10-06 23:00                   ` Jeff Law
  1 sibling, 1 reply; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-06 17:03 UTC (permalink / raw)
  To: Martin Sebor, Jakub Jelinek
  Cc: Andreas Schwab, Martin Sebor, Michael Matz,
	Aldy Hernandez via Gcc-patches

FWIW, I've opened a PR with both testcases:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102631


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 17:03                   ` Aldy Hernandez
@ 2021-10-06 19:11                     ` Martin Sebor
  0 siblings, 0 replies; 27+ messages in thread
From: Martin Sebor @ 2021-10-06 19:11 UTC (permalink / raw)
  To: Aldy Hernandez, Jakub Jelinek
  Cc: Andreas Schwab, Martin Sebor, Michael Matz,
	Aldy Hernandez via Gcc-patches

On 10/6/21 11:03 AM, Aldy Hernandez wrote:
> FWIW, I've opened a PR with both testcases:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102631
> 

Okay, thanks.  I see what you mean that reducing it to a test case
is challenging.  I'll see if I can find some time to distill it to
something useful.  Until then, to get past the warning, I recommend
to simply initialize the variable.  For simple scalars there's little
reason not to.

Martin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 16:22                 ` Aldy Hernandez
  2021-10-06 17:03                   ` Aldy Hernandez
@ 2021-10-06 23:00                   ` Jeff Law
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Law @ 2021-10-06 23:00 UTC (permalink / raw)
  To: Aldy Hernandez, Martin Sebor, Jakub Jelinek
  Cc: Aldy Hernandez via Gcc-patches, Martin Sebor, Michael Matz,
	Andreas Schwab



On 10/6/2021 10:22 AM, Aldy Hernandez via Gcc-patches wrote:
> On Wed, Oct 6, 2021 at 5:03 PM Martin Sebor <msebor@gmail.com> wrote:
>> On 10/6/21 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
>> -Wmaybe-uninitialized certainly suffers from a high rate of false
>> positives (I count 40 open PRs).  Even after all this time debugging
>> it I still struggle with the code for it but in all the bug reports
>> I've reviewed, only a small subset seems to be due to bugs or even
>> limitations in it.  Most are inherent in its design goal to warn
>> for reads from variables that cannot be proven to have been
>> initialized.
>>
>> If this one is a bug with some chance of getting fixed it would
>> be good to distill it to a small test case (even though it's not
>> unlikely that there already is an existing report with the same
>> root cause).
>>
>> That said, from from your description and the IL above it sounds
>> to me like the uninitialized read here is possible (it takes place
>> when _881 != 0), and so -Wmaybe-uininitialized triggers as it's
>> supposed to.
>>
>> Either way, rather than suppressing the warning for the whole file
>> I would suggest to consider initializing the local variable instead.
>> Looking at the code in calls.c, I can't convince myself that
>> the variable cannot, in fact, be used uninitialized.
>>
>> Martin
> FWIW, libgomp/team.c suffers from the same problem if you remove the
> stop-gap in gomp_team_start():
>
>    struct gomp_thread_start_data *start_data = NULL;
>
> The use of start_data in the block predicated by "if (nested)" is the
> only path that can use start_data without passing through its
> initialization.  But the only way to elide the initialization is from:
>
> if (!nested)
> {
>    ...
>   ....
>    if (i == nthreads)
>      goto do_release;
> }
>
> So the use of start_data either crosses the gomp_alloca
> initialization, or is predicated by if(nested).
It may simply be the case that the uninit analysis gave up or it may 
have failed to simplify its predicates enough to realize the use is 
properly guarded.  These things certainly do happen.

Of course this is all part of why we try so hard to thread jumps to 
aggressively.  Missed jump threads closely correlate with bogus 
uninitialized warnings due to infeasible paths in the CFG.  In my mind 
the predicate analysis we do for uninit is primarily there to catch 
cases that jump threading discovers, but does not optimize due to cost.

Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 13:47             ` Aldy Hernandez
  2021-10-06 15:03               ` Martin Sebor
@ 2021-10-06 23:06               ` Jeff Law
  2021-10-07  8:15                 ` Aldy Hernandez
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Law @ 2021-10-06 23:06 UTC (permalink / raw)
  To: Aldy Hernandez, Andreas Schwab
  Cc: Martin Sebor, Michael Matz, Aldy Hernandez via Gcc-patches



On 10/6/2021 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
> The pending patch I have from Richi fixes this.  Even so, it's the
> uninit code that's confused.
>
> Sigh...every single change to the threading code shines the light on
> some warning bug.
>
> If you take the calls.ii file from the aarch64 bootstrap and break on
> the warning, you can see that the uninitalized use is for
> const_upper_3934 here:
>
>   <bb 102> [local count: 315357954]:
>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>    if (_881 != 0)
>      goto <bb 103>; [50.00%]
>    else
>      goto <bb 106>; [50.00%]
>
>    <bb 103> [local count: 157678977]:
>    if (const_upper_3934 > _6699)
>      goto <bb 105>; [89.00%]
>    else
>      goto <bb 294>; [11.00%]
>
>    <bb 294> [local count: 17344687]:
>
>    <bb 104> [local count: 157678977]:
>    goto <bb 107>; [100.00%]
>
>    <bb 105> [local count: 140334290]:
>    stack_usage_map.481_3930 = stack_usage_map;
>    _6441 = const_upper_3934 - _6699;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> PROBLEMATIC READ HERE
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>    _4819 = stack_usage_map.481_3930 + _6699;
>    __builtin_memset (_4819, 1, _6441);
>    goto <bb 104>; [11.00%]
>
> const_upper_3934 could be undefined if it comes from BB101
> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
> shouldn't warn.
>
> I suggest -Wmaybe-uninitialized be turned off for that file until the
> warning is fixed.
>
> And yes, the proposed patch will also cure this, but the underlying
> problem in the warning is still there.
You haven't shown enough context for me to agree or disagree.  Are there 
other preds to bb105?   I hate these slimmed down dumps.  I work better 
with the full pred/succ lists. -fdump-tree-<whatever>-blocks-details :-)

It appears to me that for _881 != 0 we certainly flow into the read of 
_const_upper_3934 in bb103 and bb105.  Why do you think that's safe?

Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 23:06               ` Jeff Law
@ 2021-10-07  8:15                 ` Aldy Hernandez
  2021-10-07 13:35                   ` Jeff Law
  0 siblings, 1 reply; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-07  8:15 UTC (permalink / raw)
  To: Jeff Law, Andreas Schwab
  Cc: Martin Sebor, Michael Matz, Aldy Hernandez via Gcc-patches

[Andrew, ranger comment embedded below.]

On 10/7/21 1:06 AM, Jeff Law wrote:
> 
> 
> On 10/6/2021 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
>> The pending patch I have from Richi fixes this.  Even so, it's the
>> uninit code that's confused.
>>
>> Sigh...every single change to the threading code shines the light on
>> some warning bug.
>>
>> If you take the calls.ii file from the aarch64 bootstrap and break on
>> the warning, you can see that the uninitalized use is for
>> const_upper_3934 here:
>>
>>   <bb 102> [local count: 315357954]:
>>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>>    if (_881 != 0)
>>      goto <bb 103>; [50.00%]
>>    else
>>      goto <bb 106>; [50.00%]
>>
>>    <bb 103> [local count: 157678977]:
>>    if (const_upper_3934 > _6699)
>>      goto <bb 105>; [89.00%]
>>    else
>>      goto <bb 294>; [11.00%]
>>
>>    <bb 294> [local count: 17344687]:
>>
>>    <bb 104> [local count: 157678977]:
>>    goto <bb 107>; [100.00%]
>>
>>    <bb 105> [local count: 140334290]:
>>    stack_usage_map.481_3930 = stack_usage_map;
>>    _6441 = const_upper_3934 - _6699;
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> PROBLEMATIC READ HERE
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>    _4819 = stack_usage_map.481_3930 + _6699;
>>    __builtin_memset (_4819, 1, _6441);
>>    goto <bb 104>; [11.00%]
>>
>> const_upper_3934 could be undefined if it comes from BB101
>> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
>> shouldn't warn.
>>
>> I suggest -Wmaybe-uninitialized be turned off for that file until the
>> warning is fixed.
>>
>> And yes, the proposed patch will also cure this, but the underlying
>> problem in the warning is still there.
> You haven't shown enough context for me to agree or disagree.  Are there 
> other preds to bb105?   I hate these slimmed down dumps.  I work better 
> with the full pred/succ lists. -fdump-tree-<whatever>-blocks-details :-)
> 
> It appears to me that for _881 != 0 we certainly flow into the read of 
> _const_upper_3934 in bb103 and bb105.  Why do you think that's safe?

My bad, there's some missing context.

The only way to get to BB101->BB102 is through:

   <bb 100>
   if (_6711 != 0)
     goto <bb 101>; [5.50%]
   else
     goto <bb 293>; [94.50%]

And there's an implicit relation between _6711 and _811:

<bb 86>
...
   if (_6711 != 0)
     goto <bb 287>; [5.50%]
   else
     goto <bb 87>; [94.50%]

   <bb 287> [local count: 17344687]:
   goto <bb 88>; [100.00%]

   <bb 87> [local count: 298013267]:

   <bb 88> [local count: 315357954]:
   # _881 = PHI <1(87), 0(287)>

That is, _6711 == !_881.

[Andrew, it'd be neat if we could teach ranger the relationship between 
_6711 and _811 above.  And also, that _881 is [0,1].  Perhaps with the 
relation oracle, the uninit code could notice that a _6711 guard is also 
a !_811 guard.]

Presumably the threader shuffled things sufficiently so that the above 
relationship was difficult to devise.

Your preferred full context follows ;-).

Aldy

----

Here we see that _881 == 0 when _6711 != 0:

;;   basic block 286, loop depth 1, count 640272214 (estimated locally), 
maybe hot
;;    prev block 85, next block 86, flags: (NEW)
;;    pred:       85 [67.0% (guessed)]  count:640272214 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
   goto <bb 112>; [100.00%]
;;    succ:       112 [always]  count:640272214 (estimated locally) 
(FALLTHRU)

;;   basic block 86, loop depth 1, count 315357954 (estimated locally), 
maybe hot
;;    prev block 286, next block 287, flags: (NEW, REACHABLE, VISITED)
;;    pred:       85 [33.0% (guessed)]  count:315357953 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   _4293 = MEM[(long int *)_5338 + 80B];
   _6727 = MEM[(long int *)_5338 + 88B];
   _6715 = MEM[(long int *)_5338 + 32B];
   _4305 = _4293 + _6715;
   _4299 = MEM[(long int *)_5338 + 40B];
   _4300 = _4299 + _6727;
   _6707 = (long unsigned int) _4305;
   _6711 = (long unsigned int) _4300;
   _6699 = (long unsigned int) _4293;
   if (_6711 != 0)
     goto <bb 287>; [5.50%]
   else
     goto <bb 87>; [94.50%]
;;    succ:       287 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                87 [94.5% (guessed)]  count:298013267 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 287, loop depth 1, count 17344687 (estimated locally), 
maybe hot
;;    prev block 86, next block 87, flags: (NEW)
;;    pred:       86 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   goto <bb 88>; [100.00%]
;;    succ:       88 [always]  count:17344687 (estimated locally) (FALLTHRU)

;;   basic block 87, loop depth 1, count 298013267 (estimated locally), 
maybe hot
;;    prev block 287, next block 88, flags: (NEW, VISITED)
;;    pred:       86 [94.5% (guessed)]  count:298013267 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
;;    succ:       88 [always]  count:298013267 (estimated locally) 
(FALLTHRU,EXECUTABLE)

;;   basic block 88, loop depth 1, count 315357954 (estimated locally), 
maybe hot
;;    prev block 87, next block 89, flags: (NEW, REACHABLE, VISITED)
;;    pred:       87 [always]  count:298013267 (estimated locally) 
(FALLTHRU,EXECUTABLE)
;;                287 [always]  count:17344687 (estimated locally) 
(FALLTHRU)
   # const_upper_3863 = PHI <_6707(87), 18446744073709551615(287)>
   # _881 = PHI <1(87), 0(287)>

[snip]
[snip]
[snip]

And here we see that the "undefined" read is guarded by _6711:

;;   basic block 100, loop depth 1, count 315357955 (estimated locally), 
maybe hot
;;    prev block 99, next block 293, flags: (NEW, REACHABLE, VISITED)
;;    pred:       98 [always]  count:94607385 (estimated locally) 
(FALLTHRU,EXECUTABLE)
;;                99 [always]  count:220750569 (estimated locally) 
(FALLTHRU,EXECUTABLE)
   # iftmp.581_254 = PHI <_669(98), _3922(99)>
   MEM[(struct poly_int *)&D.150826] ={v} {CLOBBER};
   _4417 = MEM[(long int *)_5338 + 56B];
   MEM[(struct poly_int *)&D.150826].D.21135.coeffs[0] = _4417;
   _4410 = MEM[(long int *)_5338 + 64B];
   MEM[(struct poly_int *)&D.150826].D.21135.coeffs[1] = _4410;
   _673 = gen_int_mode (D.150826, 16);
   MEM[(struct poly_int *)&D.150832] ={v} {CLOBBER};
   MEM[(struct poly_int *)&D.150832].D.21135.coeffs[0] = 0;
   MEM[(struct poly_int *)&D.150832].D.21135.coeffs[1] = 0;
   emit_push_insn (val_598, mode_597, 0B, 0B, parm_align_601, 
partial_600, reg_599, D.150832, argblock_88, _673, 0, iftmp.581_254, 0);
   D.150832 ={v} {CLOBBER};
   D.150826 ={v} {CLOBBER};
   D.150830 ={v} {CLOBBER};
   D.150828 ={v} {CLOBBER};
   if (_6711 != 0)
     goto <bb 101>; [5.50%]
   else
     goto <bb 293>; [94.50%]
;;    succ:       101 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                293 [94.5% (guessed)]  count:298013268 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 293, loop depth 1, count 298013268 (estimated locally), 
maybe hot
;;    prev block 100, next block 101, flags: (NEW)
;;    pred:       100 [94.5% (guessed)]  count:298013268 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
   goto <bb 102>; [100.00%]
;;    succ:       102 [always]  count:298013268 (estimated locally) 
(FALLTHRU)

;;   basic block 101, loop depth 1, count 17344687 (estimated locally), 
maybe hot
;;    prev block 293, next block 102, flags: (NEW, VISITED)
;;    pred:       100 [5.5% (guessed)]  count:17344687 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;    succ:       102 [always]  count:17344687 (estimated locally) 
(FALLTHRU,EXECUTABLE)

;;   basic block 102, loop depth 1, count 315357954 (estimated locally), 
maybe hot
;;    prev block 101, next block 103, flags: (NEW, REACHABLE, VISITED)
;;    pred:       101 [always]  count:17344687 (estimated locally) 
(FALLTHRU,EXECUTABLE)
;;                293 [always]  count:298013268 (estimated locally) 
(FALLTHRU)
   # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
   if (_881 != 0)
     goto <bb 103>; [50.00%]
   else
     goto <bb 106>; [50.00%]
;;    succ:       103 [50.0% (guessed)]  count:157678977 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                106 [50.0% (guessed)]  count:157678977 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 103, loop depth 1, count 157678977 (estimated locally), 
maybe hot
;;    prev block 102, next block 294, flags: (NEW, REACHABLE, VISITED)
;;    pred:       102 [50.0% (guessed)]  count:157678977 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   if (const_upper_3934 > _6699)
     goto <bb 105>; [89.00%]
   else
     goto <bb 294>; [11.00%]
;;    succ:       105 [89.0% (guessed)]  count:140334290 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
;;                294 [11.0% (guessed)]  count:17344687 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 294, loop depth 1, count 17344687 (estimated locally), 
maybe hot
;;    prev block 103, next block 104, flags: (NEW)
;;    pred:       103 [11.0% (guessed)]  count:17344687 (estimated 
locally) (FALSE_VALUE,EXECUTABLE)
;;    succ:       104 [always]  count:17344687 (estimated locally) 
(FALLTHRU)

;;   basic block 104, loop depth 1, count 157678977 (estimated locally), 
maybe hot
;;   Invalid sum of incoming counts 32781459 (estimated locally), should 
be 157678977 (estimated locally)
;;    prev block 294, next block 105, flags: (NEW, VISITED)
;;    pred:       294 [always]  count:17344687 (estimated locally) 
(FALLTHRU)
;;                105 [11.0% (guessed)]  count:15436772 (estimated 
locally) (FALLTHRU,EXECUTABLE)
   goto <bb 107>; [100.00%]
;;    succ:       107 [always]  count:157678977 (estimated locally) 
(FALLTHRU,EXECUTABLE)

;;   basic block 105, loop depth 1, count 140334290 (estimated locally), 
maybe hot
;;    prev block 104, next block 106, flags: (NEW, REACHABLE, VISITED)
;;    pred:       103 [89.0% (guessed)]  count:140334290 (estimated 
locally) (TRUE_VALUE,EXECUTABLE)
   stack_usage_map.481_3930 = stack_usage_map;
   _6441 = const_upper_3934 - _6699;
   _4819 = stack_usage_map.481_3930 + _6699;
   __builtin_memset (_4819, 1, _6441);
   goto <bb 104>; [11.00%]


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-07  8:15                 ` Aldy Hernandez
@ 2021-10-07 13:35                   ` Jeff Law
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Law @ 2021-10-07 13:35 UTC (permalink / raw)
  To: Aldy Hernandez, Andreas Schwab
  Cc: Martin Sebor, Michael Matz, Aldy Hernandez via Gcc-patches



On 10/7/2021 2:15 AM, Aldy Hernandez wrote:
> [Andrew, ranger comment embedded below.]
>
> On 10/7/21 1:06 AM, Jeff Law wrote:
>>
>>
>> On 10/6/2021 7:47 AM, Aldy Hernandez via Gcc-patches wrote:
>>> The pending patch I have from Richi fixes this.  Even so, it's the
>>> uninit code that's confused.
>>>
>>> Sigh...every single change to the threading code shines the light on
>>> some warning bug.
>>>
>>> If you take the calls.ii file from the aarch64 bootstrap and break on
>>> the warning, you can see that the uninitalized use is for
>>> const_upper_3934 here:
>>>
>>>   <bb 102> [local count: 315357954]:
>>>    # const_upper_3934 = PHI <const_upper_3937(D)(101), _6707(293)>
>>>    if (_881 != 0)
>>>      goto <bb 103>; [50.00%]
>>>    else
>>>      goto <bb 106>; [50.00%]
>>>
>>>    <bb 103> [local count: 157678977]:
>>>    if (const_upper_3934 > _6699)
>>>      goto <bb 105>; [89.00%]
>>>    else
>>>      goto <bb 294>; [11.00%]
>>>
>>>    <bb 294> [local count: 17344687]:
>>>
>>>    <bb 104> [local count: 157678977]:
>>>    goto <bb 107>; [100.00%]
>>>
>>>    <bb 105> [local count: 140334290]:
>>>    stack_usage_map.481_3930 = stack_usage_map;
>>>    _6441 = const_upper_3934 - _6699;
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> PROBLEMATIC READ HERE
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>    _4819 = stack_usage_map.481_3930 + _6699;
>>>    __builtin_memset (_4819, 1, _6441);
>>>    goto <bb 104>; [11.00%]
>>>
>>> const_upper_3934 could be undefined if it comes from BB101
>>> (const_upper_3937(D)), but it only gets read for _881 != 0, so it
>>> shouldn't warn.
>>>
>>> I suggest -Wmaybe-uninitialized be turned off for that file until the
>>> warning is fixed.
>>>
>>> And yes, the proposed patch will also cure this, but the underlying
>>> problem in the warning is still there.
>> You haven't shown enough context for me to agree or disagree. Are 
>> there other preds to bb105?   I hate these slimmed down dumps.  I 
>> work better with the full pred/succ lists. 
>> -fdump-tree-<whatever>-blocks-details :-)
>>
>> It appears to me that for _881 != 0 we certainly flow into the read 
>> of _const_upper_3934 in bb103 and bb105.  Why do you think that's safe?
>
> My bad, there's some missing context.
>
> The only way to get to BB101->BB102 is through:
>
>   <bb 100>
>   if (_6711 != 0)
>     goto <bb 101>; [5.50%]
>   else
>     goto <bb 293>; [94.50%]
>
> And there's an implicit relation between _6711 and _811:
>
> <bb 86>
> ...
>   if (_6711 != 0)
>     goto <bb 287>; [5.50%]
>   else
>     goto <bb 87>; [94.50%]
>
>   <bb 287> [local count: 17344687]:
>   goto <bb 88>; [100.00%]
>
>   <bb 87> [local count: 298013267]:
>
>   <bb 88> [local count: 315357954]:
>   # _881 = PHI <1(87), 0(287)>
>
> That is, _6711 == !_881.
>
> [Andrew, it'd be neat if we could teach ranger the relationship 
> between _6711 and _811 above.  And also, that _881 is [0,1]. Perhaps 
> with the relation oracle, the uninit code could notice that a _6711 
> guard is also a !_811 guard.]
Ah, yes.  This is one of the most common cases where we fail to detect 
jump threads and/or fail to prune a path in the uninit warnings.  
There's multiple instances of this BZ, though I don't have the #s handy 
and the testcases are much smaller & easier to understand.  I'm pretty 
sure they're linked to the meta bug for threading or uninit warnings.

However, in this instance we may have a way out.   When we record the 
constant boolean equivalence for _881 talking the paths from 87 and 287 
respectively, we could walk up the control dependency graph and derive 
other values such as _6711 in this example.  I'm pretty sure we don't 
have the CDG built for DOM, but I do think we have it in the predicate 
analysis engine, so we could at least prune the warning.


Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 10:22 ` Aldy Hernandez
@ 2021-10-14  9:25   ` Aldy Hernandez
  2021-10-14 14:23     ` Jeff Law
  2021-10-17  1:32   ` Jeff Law
  1 sibling, 1 reply; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-14  9:25 UTC (permalink / raw)
  To: Jeff Law, Richard Biener, Michael Matz; +Cc: Andrew MacLeod, GCC patches

PING.

Note, that there are no PRs and nothing really dependent on this
patch.  This has just been suggested as the right thing to do wrt
loops.

This patch fixes 6 XFAILs in our testsuite and has the added
side-effect of fixing the aarch64 bootstrap problem (though the
problem in the uninit code is still there).

This is a fundamental change in what we've traditionally allowed for
jump threading, but I think it's a good thing.  It also paves the way
for combining the validity models for both the forward and the
backward threaders.

I am happy to field the PRs this may bring about, since every change
in the cost model has us questioning whether we should or shouldn't
thread a path.  But I may need some help from y'all if there's a
missing thread that causes a regression in some other pass.  That
being said, most of the issues that have come with the threader
changes haven't been because we thread less, but because we thread
more-- so perhaps restricting things is a good thing ;-).

Aldy

On Wed, Oct 6, 2021 at 12:22 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>
> [Here go the bits by Richi, tested on x86-64 Linux, and ongoing tests
> on aarch64 and ppc64le.]
>
> There is a lot of fall-out from this patch, as there were many threading
> tests that assumed the restrictions introduced by this patch were valid.
> Some tests have merely shifted the threading to after loop
> optimizations, but others ended up with no threading opportunities at
> all.  Surprisingly some tests ended up with more total threads.  It was
> a crapshoot all around.
>
> On a postive note, there are 6 tests that no longer XFAIL, and one
> guality test which now passes.
>
> I would appreciate someone reviewing the test changes.  I am unsure
> whether some of the tests should be removed, XFAILed, or some other
> thing.
>
> I felt a bit queasy about such a fundamental change wrt threading, so I
> ran it through my callgrind test harness (.ii files from a bootstrap).
> There was no change in overall compilation, DOM, or the VRP threaders.
>
> However, there was a slight increase of 1.63% in the backward threader.
> I'm pretty sure we could reduce this if we incorporated the restrictions
> into their profitability code.  This way we could stop the search when
> we ran into one of these restrictions.  Not sure it's worth it at this
> point.
>
> Note, that this ad-hoc testing is not meant to replace a more thorough
> SPEC, etc test.
>
> Tested on x86-64 Linux.
>
> OK pending tests on aarch64 and ppc64le?
>
> Co-authored-by: Richard Biener <rguenther@suse.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-14  9:25   ` Aldy Hernandez
@ 2021-10-14 14:23     ` Jeff Law
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Law @ 2021-10-14 14:23 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener, Michael Matz; +Cc: Andrew MacLeod, GCC patches



On 10/14/2021 3:25 AM, Aldy Hernandez wrote:
> PING.
>
> Note, that there are no PRs and nothing really dependent on this
> patch.  This has just been suggested as the right thing to do wrt
> loops.
>
> This patch fixes 6 XFAILs in our testsuite and has the added
> side-effect of fixing the aarch64 bootstrap problem (though the
> problem in the uninit code is still there).
>
> This is a fundamental change in what we've traditionally allowed for
> jump threading, but I think it's a good thing.  It also paves the way
> for combining the validity models for both the forward and the
> backward threaders.
>
> I am happy to field the PRs this may bring about, since every change
> in the cost model has us questioning whether we should or shouldn't
> thread a path.  But I may need some help from y'all if there's a
> missing thread that causes a regression in some other pass.  That
> being said, most of the issues that have come with the threader
> changes haven't been because we thread less, but because we thread
> more-- so perhaps restricting things is a good thing ;-).
Just chasing down fallout from the vectorizer changes first :-)

jeff


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-06 10:22 ` Aldy Hernandez
  2021-10-14  9:25   ` Aldy Hernandez
@ 2021-10-17  1:32   ` Jeff Law
  2021-10-18 10:14     ` Aldy Hernandez
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Law @ 2021-10-17  1:32 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener, Michael Matz; +Cc: Andrew MacLeod, GCC patches



On 10/6/2021 4:22 AM, Aldy Hernandez wrote:
> [Here go the bits by Richi, tested on x86-64 Linux, and ongoing tests
> on aarch64 and ppc64le.]
>
> There is a lot of fall-out from this patch, as there were many threading
> tests that assumed the restrictions introduced by this patch were valid.
> Some tests have merely shifted the threading to after loop
> optimizations, but others ended up with no threading opportunities at
> all.  Surprisingly some tests ended up with more total threads.  It was
> a crapshoot all around.
>
> On a postive note, there are 6 tests that no longer XFAIL, and one
> guality test which now passes.
>
> I would appreciate someone reviewing the test changes.  I am unsure
> whether some of the tests should be removed, XFAILed, or some other
> thing.
>
> I felt a bit queasy about such a fundamental change wrt threading, so I
> ran it through my callgrind test harness (.ii files from a bootstrap).
> There was no change in overall compilation, DOM, or the VRP threaders.
>
> However, there was a slight increase of 1.63% in the backward threader.
> I'm pretty sure we could reduce this if we incorporated the restrictions
> into their profitability code.  This way we could stop the search when
> we ran into one of these restrictions.  Not sure it's worth it at this
> point.
>
> Note, that this ad-hoc testing is not meant to replace a more thorough
> SPEC, etc test.
>
> Tested on x86-64 Linux.
>
> OK pending tests on aarch64 and ppc64le?
>
> Co-authored-by: Richard Biener <rguenther@suse.de>
>
> 0001-Disallow-loop-rotation-and-loop-header-crossing-in-j.patch
>
>  From 5c0fb55b45733ec38918f5d7a576719da32e4a6c Mon Sep 17 00:00:00 2001
> From: Aldy Hernandez <aldyh@redhat.com>
> Date: Mon, 4 Oct 2021 09:47:02 +0200
> Subject: [PATCH] Disallow loop rotation and loop header crossing in jump
>   threaders.
>
> There is a lot of fall-out from this patch, as there are many threading
> tests that assumed the restrictions introduced by this patch were valid.
> Some tests have merely shifted the threading to after loop
> optimizations, but others ended up with no threading opportunities at
> all.  Surprisingly some tests ended up with more total threads.  It was
> a crapshoot all around.
>
> On a postive note, there are 6 tests that no longer XFAIL, and one
> guality test which now passes.
>
> I would appreciate someone reviewing the test changes.  I am unsure
> whether some of the tests should be removed, XFAILed, or some other
> thing.
>
> I felt a bit queasy about such a fundamental change wrt threading, so I
> ran it through my callgrind test harness (.ii files from a bootstrap).
> There was no change in overall compilation, DOM, or the VRP threaders.
>
> However, there was a slight increase of 1.63% in the backward threader.
> I'm pretty sure we could reduce this if we incorporated the restrictions
> into their profitability code.  This way we could stop the search when
> we ran into one of these restrictions.  Not sure it's worth it at this
> point.
>
> Note, that this ad-hoc testing is not meant to replace a more thorough
> SPEC, etc test.
>
> Tested on x86-64 Linux.
>
> OK pending tests on aarch64 and ppc64le?
>
> Co-authored-by: Richard Biener <rguenther@suse.de>
>
> gcc/ChangeLog:
>
> 	* tree-ssa-threadupdate.c (cancel_thread): Dump threading reason
> 	on the same line as the threading cancellation.
> 	(jt_path_registry::cancel_invalid_paths): Avoid rotating loops.
> 	Avoid threading through loop headers where the path remains in the
> 	loop.
>
> libgomp/ChangeLog:
>
> 	* testsuite/libgomp.graphite/force-parallel-5.c: Remove xfail.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.dg/Warray-bounds-87.c: Remove xfail.
> 	* gcc.dg/analyzer/pr94851-2.c: Remove xfail.
> 	* gcc.dg/graphite/pr69728.c: Remove xfail.
> 	* gcc.dg/graphite/scop-dsyr2k.c: Remove xfail.
> 	* gcc.dg/graphite/scop-dsyrk.c: Remove xfail.
> 	* gcc.dg/shrink-wrap-loop.c: Remove xfail.
> 	* gcc.dg/loop-8.c: Adjust for new threading restrictions.
> 	* gcc.dg/tree-ssa/ifc-20040816-1.c: Same.
> 	* gcc.dg/tree-ssa/pr21559.c: Same.
> 	* gcc.dg/tree-ssa/pr59597.c: Same.
> 	* gcc.dg/tree-ssa/pr71437.c: Same.
> 	* gcc.dg/tree-ssa/pr77445-2.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-18.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-2a.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-4.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-6.c: Same.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same.
> 	* gcc.dg/vect/bb-slp-16.c: Same.
> 	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
So there is some light fallout on our friend visium.  I must say that 
having a port which fails to link if there's a call to abort () left in 
the program is awful helpful :-)    I reviewed the half-dozen or so 
tests that fail after this change, they all look like any jump threads 
are for loop rotation.  So I'm inclined to ignore those and just 
generate new baselines for the visium port.

Some additional thoughts on the tests below.  I didn't see anything 
that's worth an objection, but at least in a couple cases there's ways 
to save the test.  In others I might have made a slightly different 
decision, but I can live with yours.

I think once we reach a consensus on the tests, this will be good to go.


> diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c
> index 90ea1c45524..66318fc08dc 100644
> --- a/gcc/testsuite/gcc.dg/loop-8.c
> +++ b/gcc/testsuite/gcc.dg/loop-8.c
> @@ -24,5 +24,9 @@ f (int *a, int *b)
>   
>   /* Load of 42 is moved out of the loop, introducing a new pseudo register.  */
>   /* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */
> -/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */
> +
> +
> +/* ?? The expected behavior below depends on threading the 2->3->5 path
> +   in DOM2, but this is invalid since it would rotate the loop.  */
> +/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" { xfail *-*-* } } } */
So maybe the thing to do here since I guess we want to keep the test 
would be to manually rotate the loop in the source.  In theory that 
should restore the test to validating what we want it to validate 
(specifically the behavior of LICM).

> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
> index 0246ebf3c63..f83cefd8d89 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
> @@ -22,6 +22,8 @@
>   
>      All the cases are picked up by VRP1 as jump threads.  */
>   
> -/* There used to be 6 jump threads found by thread1, but they all
> -   depended on threading through distinct loops in ethread.  */
> -/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */
> +/* This test should be obsoleted.  We used to catch 2 total threads in
> +   vrp-thread1, but after adding loop rotating restrictions, we get
> +   none.  Interestingly, on x86-64 we now get 1 in DOM2, 5 in DOM3,
> +   and 1 in vrp-thread2.  */
> +/* { dg-final { scan-tree-dump-not "Threaded" "vrp-thread1" } } */
I think that testing nothing was threaded in vrp1 is probably best. 
Though I wouldn't lose any sleep if this just went away.

> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
> index 8f0a12c12ee..68808bd09fc 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
> @@ -1,10 +1,9 @@
>   /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */
> +/* { dg-options "-O2 -fdump-statistics" } */
>   
>   void bla();
>   
> -/* In the following case, we should be able to thread edge through
> -   the loop header.  */
> +/* No one should thread through the loop header.  */
>   
>   void thread_entry_through_header (void)
>   {
> @@ -14,8 +13,4 @@ void thread_entry_through_header (void)
>       bla ();
>   }
>   
> -/* There's a single jump thread that should be handled by the VRP
> -   jump threading pass.  */
> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */
> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */
> -/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */
> +/* { dg-final { scan-tree-dump-not "Jumps threaded" "statistics"} } */
Similarly.

> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
> index b0a7d423475..24de9d57d50 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
> @@ -1,8 +1,12 @@
>   /* { dg-do compile } */
>   /* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */
>   
> -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
> -/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */
> +/* ?? We should obsolete this test.  All the threads in thread1 and
> +   thread3 we used to get cross the loop header but does not exit the
> +   loop, so they have been deemed invalid.  */
> +
> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread1" } } */
> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread3" } } */
>   
>   int sum0, sum1, sum2, sum3;
>   int foo (char *s, char **ret)
Similarly.

> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
> index e68a9b62535..fc3adab3fc3 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
> @@ -65,5 +65,9 @@ int main (void)
>     return 0;
>   }
>   
> -/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */
> +/* ?? The check below depends on jump threading.  There are no a
> +   couple threaded paths that are being rejected because they either
> +   rotate a loop or cross the loop header without exiting the
> +   loop.  */
> +/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" { xfail *-*-* } } } */
So we could manually rotate the loop in the source which in turn would 
allow this test to continue to validate SLP's behavior.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] More jump threading restrictions in the presence of loops.
  2021-10-17  1:32   ` Jeff Law
@ 2021-10-18 10:14     ` Aldy Hernandez
  0 siblings, 0 replies; 27+ messages in thread
From: Aldy Hernandez @ 2021-10-18 10:14 UTC (permalink / raw)
  To: Jeff Law, Richard Biener, Michael Matz; +Cc: Andrew MacLeod, GCC patches

[-- Attachment #1: Type: text/plain, Size: 5650 bytes --]



On 10/17/21 3:32 AM, Jeff Law wrote:

> I think once we reach a consensus on the tests, this will be good to go.
> 
> 
>> diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c
>> index 90ea1c45524..66318fc08dc 100644
>> --- a/gcc/testsuite/gcc.dg/loop-8.c
>> +++ b/gcc/testsuite/gcc.dg/loop-8.c
>> @@ -24,5 +24,9 @@ f (int *a, int *b)
>>   
>>   /* Load of 42 is moved out of the loop, introducing a new pseudo register.  */
>>   /* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */
>> -/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */
>> +
>> +
>> +/* ?? The expected behavior below depends on threading the 2->3->5 path
>> +   in DOM2, but this is invalid since it would rotate the loop.  */
>> +/* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" { xfail *-*-* } } } */
> So maybe the thing to do here since I guess we want to keep the test 
> would be to manually rotate the loop in the source.  In theory that 
> should restore the test to validating what we want it to validate 
> (specifically the behavior of LICM).

I have rotated the loop.  This fixes the xfail I introduced, but the 
"Decided"  test fails.  I've removed that instead.  Let me know if this 
is OK.

> 
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>> index 0246ebf3c63..f83cefd8d89 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
>> @@ -22,6 +22,8 @@
>>   
>>      All the cases are picked up by VRP1 as jump threads.  */
>>   
>> -/* There used to be 6 jump threads found by thread1, but they all
>> -   depended on threading through distinct loops in ethread.  */
>> -/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */
>> +/* This test should be obsoleted.  We used to catch 2 total threads in
>> +   vrp-thread1, but after adding loop rotating restrictions, we get
>> +   none.  Interestingly, on x86-64 we now get 1 in DOM2, 5 in DOM3,
>> +   and 1 in vrp-thread2.  */
>> +/* { dg-final { scan-tree-dump-not "Threaded" "vrp-thread1" } } */
> I think that testing nothing was threaded in vrp1 is probably best. 
> Though I wouldn't lose any sleep if this just went away.

I've opted to remove these tests, since I'm testing the exact behavior 
we're disallowing in the gimple FE tests I've included in this patch.

> 
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
>> index 8f0a12c12ee..68808bd09fc 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
>> @@ -1,10 +1,9 @@
>>   /* { dg-do compile } */
>> -/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */
>> +/* { dg-options "-O2 -fdump-statistics" } */
>>   
>>   void bla();
>>   
>> -/* In the following case, we should be able to thread edge through
>> -   the loop header.  */
>> +/* No one should thread through the loop header.  */
>>   
>>   void thread_entry_through_header (void)
>>   {
>> @@ -14,8 +13,4 @@ void thread_entry_through_header (void)
>>       bla ();
>>   }
>>   
>> -/* There's a single jump thread that should be handled by the VRP
>> -   jump threading pass.  */
>> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */
>> -/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */
>> -/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */
>> +/* { dg-final { scan-tree-dump-not "Jumps threaded" "statistics"} } */
> Similarly.

Same.

> 
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
>> index b0a7d423475..24de9d57d50 100644
>> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
>> @@ -1,8 +1,12 @@
>>   /* { dg-do compile } */
>>   /* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */
>>   
>> -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
>> -/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */
>> +/* ?? We should obsolete this test.  All the threads in thread1 and
>> +   thread3 we used to get cross the loop header but does not exit the
>> +   loop, so they have been deemed invalid.  */
>> +
>> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread1" } } */
>> +/* { dg-final { scan-tree-dump-times "Registering jump" 0 "thread3" } } */
>>   
>>   int sum0, sum1, sum2, sum3;
>>   int foo (char *s, char **ret)
> Similarly.

Same.

> 
>> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>> index e68a9b62535..fc3adab3fc3 100644
>> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
>> @@ -65,5 +65,9 @@ int main (void)
>>     return 0;
>>   }
>>   
>> -/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */
>> +/* ?? The check below depends on jump threading.  There are no a
>> +   couple threaded paths that are being rejected because they either
>> +   rotate a loop or cross the loop header without exiting the
>> +   loop.  */
>> +/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" { xfail *-*-* } } } */
> So we could manually rotate the loop in the source which in turn would 
> allow this test to continue to validate SLP's behavior.

That did the trick.  Thanks.

OK?

Aldy

[-- Attachment #2: 0001-Disallow-loop-rotation-and-loop-header-crossing-in-j.patch --]
[-- Type: text/x-patch, Size: 27979 bytes --]

From 4e21d1aa9a794356b32d25a9ed539ac4bb70ab7e Mon Sep 17 00:00:00 2001
From: Aldy Hernandez <aldyh@redhat.com>
Date: Mon, 4 Oct 2021 09:47:02 +0200
Subject: [PATCH] Disallow loop rotation and loop header crossing in jump
 threaders.

There is a lot of fall-out from this patch, as there were many threading
tests that assumed the restrictions introduced by this patch were valid.
Some tests have merely shifted the threading to after loop
optimizations, but others ended up with no threading opportunities at
all.  Surprisingly some tests ended up with more total threads.  It was
a crapshoot all around.

On a postive note, there are 6 tests that no longer XFAIL, and one
guality test which now passes.

I would appreciate someone reviewing the test changes.  I am unsure
whether some of the tests should be removed, XFAILed, or some other
thing.

I felt a bit queasy about such a fundamental change wrt threading, so I
ran it through my callgrind test harness (.ii files from a bootstrap).
There was no change in overall compilation, DOM, or the VRP threaders.

However, there was a slight increase of 1.63% in the backward threader.
I'm pretty sure we could reduce this if we incorporated the restrictions
into their profitability code.  This way we could stop the search when
we ran into one of these restrictions.  Not sure it's worth it at this
point.

Note, that this ad-hoc testing is not meant to replace a more thorough
SPEC, etc test.

Tested on x86-64 Linux.

Co-authored-by: Richard Biener <rguenther@suse.de>

gcc/ChangeLog:

	* tree-ssa-threadupdate.c (cancel_thread): Dump threading reason
	on the same line as the threading cancellation.
	(jt_path_registry::cancel_invalid_paths): Avoid rotating loops.
	Avoid threading through loop headers where the path remains in the
	loop.

libgomp/ChangeLog:

	* testsuite/libgomp.graphite/force-parallel-5.c: Remove xfail.

gcc/testsuite/ChangeLog:

	* gcc.dg/Warray-bounds-87.c: Remove xfail.
	* gcc.dg/analyzer/pr94851-2.c: Remove xfail.
	* gcc.dg/graphite/pr69728.c: Remove xfail.
	* gcc.dg/graphite/scop-dsyr2k.c: Remove xfail.
	* gcc.dg/graphite/scop-dsyrk.c: Remove xfail.
	* gcc.dg/shrink-wrap-loop.c: Remove xfail.
	* gcc.dg/loop-8.c: Adjust for new threading restrictions.
	* gcc.dg/tree-ssa/ifc-20040816-1.c: Same.
	* gcc.dg/tree-ssa/pr21559.c: Same.
	* gcc.dg/tree-ssa/pr59597.c: Same.
	* gcc.dg/tree-ssa/pr71437.c: Same.
	* gcc.dg/tree-ssa/pr77445-2.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-4.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same.
	* gcc.dg/vect/bb-slp-16.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-6.c: Remove.
	* gcc.dg/tree-ssa/ssa-dom-thread-18.c: Remove.
	* gcc.dg/tree-ssa/ssa-dom-thread-2a.c: Remove.
	* gcc.dg/tree-ssa/ssa-thread-invalid.c: New test.
---
 gcc/testsuite/gcc.dg/Warray-bounds-87.c       |   2 +-
 gcc/testsuite/gcc.dg/analyzer/pr94851-2.c     |   2 +-
 gcc/testsuite/gcc.dg/graphite/pr69728.c       |   4 +-
 gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c   |   2 +-
 gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c    |   2 +-
 gcc/testsuite/gcc.dg/loop-8.c                 |  19 ++--
 gcc/testsuite/gcc.dg/shrink-wrap-loop.c       |  54 +---------
 .../gcc.dg/tree-ssa/ifc-20040816-1.c          |   2 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr21559.c       |   7 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr59597.c       |  10 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr71437.c       |   8 +-
 gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c     |   3 +-
 .../gcc.dg/tree-ssa/ssa-dom-thread-18.c       |  27 -----
 .../gcc.dg/tree-ssa/ssa-dom-thread-2a.c       |  21 ----
 .../gcc.dg/tree-ssa/ssa-dom-thread-4.c        |  14 ++-
 .../gcc.dg/tree-ssa/ssa-dom-thread-6.c        |  44 --------
 .../gcc.dg/tree-ssa/ssa-dom-thread-7.c        |   5 +-
 .../gcc.dg/tree-ssa/ssa-thread-invalid.c      | 102 ++++++++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-16.c         |  70 ++++++------
 gcc/tree-ssa-threadupdate.c                   |  26 ++++-
 .../libgomp.graphite/force-parallel-5.c       |   2 +-
 21 files changed, 207 insertions(+), 219 deletions(-)
 delete mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
 delete mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
 delete mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c

diff --git a/gcc/testsuite/gcc.dg/Warray-bounds-87.c b/gcc/testsuite/gcc.dg/Warray-bounds-87.c
index a49874df5da..a5457807c3a 100644
--- a/gcc/testsuite/gcc.dg/Warray-bounds-87.c
+++ b/gcc/testsuite/gcc.dg/Warray-bounds-87.c
@@ -33,7 +33,7 @@ static unsigned int h (int i, int j)
     case 9:
       return j;
     case 10:
-      return a[i];  // { dg-bogus "-Warray-bounds" "pr101671" { xfail *-*-* } }
+      return a[i];  // { dg-bogus "-Warray-bounds" "pr101671" }
     }
   return 0;
 }
diff --git a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c
index 0acf48810c1..62176bdaee8 100644
--- a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c
+++ b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c
@@ -45,7 +45,7 @@ int pamark(void) {
     if (curbp->b_amark == (AMARK *)NULL)
       curbp->b_amark = p;
     else
-      last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" { xfail *-*-* } } */
+      last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" } */
   }
 
   p->m_name = (char)c; /* { dg-bogus "leak of 'p'" "bogus leak" } */
diff --git a/gcc/testsuite/gcc.dg/graphite/pr69728.c b/gcc/testsuite/gcc.dg/graphite/pr69728.c
index 69e28318aaf..a6f385749c2 100644
--- a/gcc/testsuite/gcc.dg/graphite/pr69728.c
+++ b/gcc/testsuite/gcc.dg/graphite/pr69728.c
@@ -24,6 +24,4 @@ fn1 ()
    run into scheduling issues before here, not being able to handle
    empty domains.  */
 
-/* XFAILed by fix for PR86865.  */
-
-/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" { xfail *-*-* } } }  */
+/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" } } */
diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c
index 925ae306903..41c91b97b57 100644
--- a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c
+++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c
@@ -17,4 +17,4 @@ void dsyr2k(int N) {
 #pragma endscop
 }
 
-/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */ 
+/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */
diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c
index b748946fabb..e01a517be11 100644
--- a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c
+++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c
@@ -19,4 +19,4 @@ void dsyrk(int N)
 #pragma endscop
 }
 
-/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */
diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c
index 90ea1c45524..a685fc25056 100644
--- a/gcc/testsuite/gcc.dg/loop-8.c
+++ b/gcc/testsuite/gcc.dg/loop-8.c
@@ -11,18 +11,23 @@ f (int *a, int *b)
 {
   int i;
 
-  for (i = 0; i < 100; i++)
+  i = 100;
+  if (i > 0)
     {
-      int d = 42;
+      do
+	{
+	  int d = 42;
 
-      a[i] = d;
-      if (i % 2)
-	d = i;
-      b[i] = d;
+	  a[i] = d;
+	  if (i % 2)
+	    d = i;
+	  b[i] = d;
+	  ++i;
+	}
+      while (i < 100);
     }
 }
 
 /* Load of 42 is moved out of the loop, introducing a new pseudo register.  */
-/* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */
 /* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */
 
diff --git a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c
index 6e1be8937fe..ddc99e6b75a 100644
--- a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c
+++ b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c
@@ -1,58 +1,6 @@
 /* { dg-do compile { target { { { i?86-*-* x86_64-*-* } && lp64 } || { arm_thumb2 } } } } */
 /* { dg-options "-O2 -fdump-rtl-pro_and_epilogue"  } */
 
-/*
-Our new threader is threading things a bit too early, and causing the
-testcase in gcc.dg/shrink-wrap-loop.c to fail.
-
-  The gist is this BB inside a loop:
-
-  <bb 6> :
-  # p_2 = PHI <p2_6(D)(2), p_12(5)>
-  if (p_2 != 0B)
-    goto <bb 3>; [INV]
-  else
-    goto <bb 7>; [INV]
-
-Our threader can move this check outside of the loop (good).  This is
-done before branch probabilities are calculated and causes the probs
-to be calculated as:
-
-<bb 2> [local count: 216361238]:
-  if (p2_6(D) != 0B)
-    goto <bb 7>; [54.59%]
-  else
-    goto <bb 6>; [45.41%]
-
-Logically this seems correct to me.  A simple check outside of a loop
-should slightly but not overwhelmingly favor a non-zero value.
-
-Interestingly however, the old threader couldn't get this, but the IL
-ended up identical, albeit with different probabilities.  What happens
-is that, because the old code could not thread this, the p2 != 0 check
-would remain inside the loop and probs would be calculated thusly:
-
-  <bb 6> [local count: 1073741824]:
-  # p_2 = PHI <p2_6(D)(2), p_12(5)>
-  if (p_2 != 0B)
-    goto <bb 3>; [94.50%]
-  else
-    goto <bb 7>; [5.50%]
-
-Then when the loop header copying pass ("ch") shuffled things around,
-the IL would end up identical to my early threader code, but with the
-probabilities would remain as 94.5/5.5.
-
-The above discrepancy causes the RTL ifcvt pass to generate different
-code, and by the time we get to the shrink wrapping pass, things look
-sufficiently different such that the legacy code can actually shrink
-wrap, whereas our new code does not.
-
-IMO, if the loop-ch pass moves conditionals outside of a loop, the
-probabilities should be adjusted, but that does mean the shrink wrap
-won't happen for this contrived testcase.
- */
-
 int foo (int *p1, int *p2);
 
 int
@@ -68,4 +16,4 @@ test (int *p1, int *p2)
 
   return 1;
 }
-/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
index b55a533e374..f8a6495cbaa 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c
@@ -39,4 +39,4 @@ int main1 ()
    which is folded by vectorizer.  Both outgoing edges must have probability
    100% so the resulting profile match after folding.  */
 /* { dg-final { scan-tree-dump-times "Invalid sum of outgoing probabilities 200.0" 1 "ifcvt" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 1 "ifcvt" } } */
+/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 2 "ifcvt" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c
index 51b3b7ac755..43f046edabe 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c
@@ -35,10 +35,7 @@ void foo (void)
 /* First, we should simplify the bits < 0 test within the loop.  */
 /* { dg-final { scan-tree-dump-times "Simplified relational" 1 "evrp" } } */
 
-/* Second, we should thread the edge out of the loop via the break
-   statement.  We also realize that the final bytes == 0 test is useless,
-   and thread over it.  We also know that toread != 0 is useless when
-   entering while loop and thread over it.  */
-/* { dg-final { scan-tree-dump-times "Threaded jump" 3 "vrp-thread1" } } */
+/* We used to check for 3 threaded jumps here, but they all would
+   rotate the loop.  */
 
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
index 2caa1f532ea..764b3fe2e80 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c
@@ -56,11 +56,7 @@ main (int argc, char argv[])
   return crc;
 }
 
-/* Previously we had 3 jump threads, but one of them crossed loops.
-   The reason the old threader was allowing it, was because there was
-   an ASSERT_EXPR getting in the way.  Without the ASSERT_EXPR, we
-   have an empty pre-header block as the final block in the thread,
-   which the threader will simply join with the next block which *is*
-   in a different loop.  */
-/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */
+/* None of the threads we can get in vrp-thread1 are valid.  They all
+   cross or rotate loops.  */
+/* { dg-final { scan-tree-dump-not "Registering jump thread" "vrp-thread1" } } */
 /* { dg-final { scan-tree-dump-not "joiner" "vrp-thread1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c
index a2386ba19f0..eab3a25928e 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-ffast-math -O3 -fdump-tree-vrp-thread1-details" } */
+/* { dg-options "-ffast-math -O3 -fdump-tree-dom3-details" } */
 
 int I = 50, J = 50;
 int S, L;
@@ -39,4 +39,8 @@ void foo (int K)
 	bar (LD, SD);
     }
 }
-/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "vrp-thread1" } } */
+
+/* We used to get 1 vrp-thread1 candidates here, but they now get
+   deferred until after loop opts are done, because they were rotating
+   loops.  */
+/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "dom3" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
index 18f7aab2be7..f2a5e78e6be 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
@@ -123,8 +123,7 @@ enum STATES FMS( u8 **in , u32 *transitions) {
    aarch64 has the highest CASE_VALUES_THRESHOLD in GCC.  It's high enough
    to change decisions in switch expansion which in turn can expose new
    jump threading opportunities.  Skip the later tests on aarch64.  */
-/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread2" } } */
 /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */
 /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */
 /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
deleted file mode 100644
index 0246ebf3c63..00000000000
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c
+++ /dev/null
@@ -1,27 +0,0 @@
-/* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-vrp-thread1-details -std=gnu89 --param logical-op-non-short-circuit=0" } */
-
-#include "ssa-dom-thread-4.c"
-
-/* On targets that define LOGICAL_OP_NON_SHORT_CIRCUIT to 0, we split both
-   "a_elt || b_elt" and "b_elt && kill_elt" into two conditions each,
-   rather than using "(var1 != 0) op (var2 != 0)".  Also, as on other targets,
-   we duplicate the header of the inner "while" loop.  There are then
-   4 threading opportunities:
-
-   1x "!a_elt && b_elt" in the outer "while" loop
-      -> the start of the inner "while" loop,
-	 skipping the known-true "b_elt" in the first condition.
-   1x "!b_elt" in the first condition
-      -> the outer "while" loop's continuation point,
-	 skipping the known-false "b_elt" in the second condition.
-   2x "kill_elt->indx >= b_elt->indx" in the first "while" loop
-      -> "kill_elt->indx == b_elt->indx" in the second condition,
-	 skipping the known-true "b_elt && kill_elt" in the second
-	 condition.
-
-   All the cases are picked up by VRP1 as jump threads.  */
-
-/* There used to be 6 jump threads found by thread1, but they all
-   depended on threading through distinct loops in ethread.  */
-/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
deleted file mode 100644
index 8f0a12c12ee..00000000000
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c
+++ /dev/null
@@ -1,21 +0,0 @@
-/* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */
-
-void bla();
-
-/* In the following case, we should be able to thread edge through
-   the loop header.  */
-
-void thread_entry_through_header (void)
-{
-  int i;
-
-  for (i = 0; i < 170; i++)
-    bla ();
-}
-
-/* There's a single jump thread that should be handled by the VRP
-   jump threading pass.  */
-/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */
-/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */
-/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c
index 46e464ff26a..9cd463571c4 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-vrp-thread1-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */
+/* { dg-options "-O2 -fdump-tree-vrp-thread2-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */
 struct bitmap_head_def;
 typedef struct bitmap_head_def *bitmap;
 typedef const struct bitmap_head_def *const_bitmap;
@@ -53,10 +53,8 @@ bitmap_ior_and_compl (bitmap dst, const_bitmap a, const_bitmap b,
 
   return changed;
 }
-/* The block starting the second conditional has  3 incoming edges,
-   we should thread all three, but due to a bug in the threading
-   code we missed the edge when the first conditional is false
-   (b_elt is zero, which means the second conditional is always
-   zero.  VRP1 catches all three.  */
-/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */
-/* { dg-final { scan-tree-dump-times "Path crosses loops" 1 "vrp-thread1" } } */
+/* We used to catch 3 jump threads in vrp-thread1, but they all
+   rotated the loop, so they were disallowed.  This in turn created
+   other opportunities for the other threaders which result in the the
+   post-loop threader (vrp-thread2) catching more.  */
+/* { dg-final { scan-tree-dump-times "Registering jump thread" 5 "vrp-thread2" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
deleted file mode 100644
index b0a7d423475..00000000000
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
+++ /dev/null
@@ -1,44 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */
-
-/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */
-
-int sum0, sum1, sum2, sum3;
-int foo (char *s, char **ret)
-{
-  int state=0;
-  char c;
-
-  for (; *s && state != 4; s++)
-    {
-      c = *s;
-      if (c == '*')
-	{
-	  s++;
-	  break;
-	}
-      switch (state)
-	{
-	case 0:
-	  if (c == '+')
-	    state = 1;
-	  else if (c != '-')
-	    sum0+=c;
-	  break;
-	case 1:
-	  if (c == '+')
-	    state = 2;
-	  else if (c == '-')
-	    state = 0;
-	  else
-	    sum1+=c;
-	  break;
-	default:
-	  break;
-	}
-
-    }
-  *ret = s;
-  return state;
-}
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index 16abcde5053..1da00a691c8 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,15 +1,14 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
 
-/* { dg-final { scan-tree-dump "Jumps threaded: 12"  "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 12"  "thread3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom2" } } */
 
 /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC.  It's high enough
    to change decisions in switch expansion which in turn can expose new
    jump threading opportunities.  Skip the later tests on aarch64.  */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" { target { ! aarch64*-*-* } } } } */
-/* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" { target { ! aarch64*-*-* } } } } */
+/* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp-thread2" { target { ! aarch64*-*-* } } } } */
 
 enum STATE {
   S0=0,
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
new file mode 100644
index 00000000000..bd56a62a4b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c
@@ -0,0 +1,102 @@
+// { dg-do compile }
+// { dg-options "-O2 -fgimple -fdump-statistics" }
+//
+// This is a collection of seemingly threadble paths that should not be allowed.
+
+void foobar (int);
+
+// Possible thread from 2->4->3, but it would rotate the loop.
+void __GIMPLE (ssa)
+f1 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB4;
+
+  // Latch.
+  __BB(3):
+  foobar (i_1);
+  i_5 = i_1 + 1;
+  goto __BB4;
+
+  __BB(4,loop_header(1)):
+  i_1 = __PHI (__BB2: 0, __BB3: i_5);
+  if (i_1 != 101)
+    goto __BB3;
+  else
+    goto __BB5;
+
+  __BB(5):
+  return;
+
+}
+
+// Possible thread from 2->3->5 but threading through the empty latch
+// would create a non-empty latch.
+void __GIMPLE (ssa)
+f2 ()
+{
+  int i;
+
+  // Pre-header.
+  __BB(2):
+  goto __BB3;
+
+  __BB(3,loop_header(1)):
+  i_8 = __PHI (__BB5: i_5, __BB2: 0);
+  foobar (i_8);
+  i_5 = i_8 + 1;
+  if (i_5 != 256)
+    goto __BB5;
+  else
+    goto __BB4;
+
+  // Latch.
+  __BB(5):
+  goto __BB3;
+
+  __BB(4):
+  return;
+
+}
+
+// Possible thread from 3->5->6->3 but this would thread through the
+// header but not exit the loop.
+int __GIMPLE (ssa)
+f3 (int a)
+{
+  int i;
+
+  __BB(2):
+  goto __BB6;
+
+  __BB(3):
+  if (i_1 != 0)
+    goto __BB4;
+  else
+    goto __BB5;
+
+  __BB(4):
+  foobar (5);
+  goto __BB5;
+
+  // Latch.
+  __BB(5):
+  i_7 = i_1 + 1;
+  goto __BB6;
+
+  __BB(6,loop_header(1)):
+  i_1 = __PHI (__BB2: 1, __BB5: i_7);
+  if (i_1 <= 99)
+    goto __BB3;
+  else
+    goto __BB7;
+
+  __BB(7):
+  return;
+
+}
+
+// { dg-final { scan-tree-dump-not "Jumps threaded" "statistics" } }
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
index e68a9b62535..4fc176dde84 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c
@@ -16,41 +16,52 @@ main1 (int dummy)
   unsigned int *pin = &in[0];
   unsigned int *pout = &out[0];
   unsigned int a = 0;
-  
-  for (i = 0; i < N; i++)
+
+  i = N;
+  if (i > 0)
     {
-      *pout++ = *pin++ + a;
-      *pout++ = *pin++ + a;
-      *pout++ = *pin++ + a;
-      *pout++ = *pin++ + a;
-      *pout++ = *pin++ + a;
-      *pout++ = *pin++ + a;
-      *pout++ = *pin++ + a;
-      *pout++ = *pin++ + a;
-      if (arr[i] = i)
-        a = i;
-      else
-        a = 2;
+      do
+	{
+	  *pout++ = *pin++ + a;
+	  *pout++ = *pin++ + a;
+	  *pout++ = *pin++ + a;
+	  *pout++ = *pin++ + a;
+	  *pout++ = *pin++ + a;
+	  *pout++ = *pin++ + a;
+	  *pout++ = *pin++ + a;
+	  *pout++ = *pin++ + a;
+	  if (arr[i] = i)
+	    a = i;
+	  else
+	    a = 2;
+	}
+      while (i < N);
     }
 
   a = 0;
-  /* check results: */ 
-  for (i = 0; i < N; i++)
+  /* check results: */
+  i = N;
+  if (i > 0)
     {
-      if (out[i*8] !=  in[i*8] + a
-         || out[i*8 + 1] != in[i*8 + 1] + a
-         || out[i*8 + 2] != in[i*8 + 2] + a
-         || out[i*8 + 3] != in[i*8 + 3] + a
-         || out[i*8 + 4] != in[i*8 + 4] + a
-         || out[i*8 + 5] != in[i*8 + 5] + a
-         || out[i*8 + 6] != in[i*8 + 6] + a
-         || out[i*8 + 7] != in[i*8 + 7] + a)
-	abort ();
+      do
+	{
+	  if (out[i*8] !=  in[i*8] + a
+	      || out[i*8 + 1] != in[i*8 + 1] + a
+	      || out[i*8 + 2] != in[i*8 + 2] + a
+	      || out[i*8 + 3] != in[i*8 + 3] + a
+	      || out[i*8 + 4] != in[i*8 + 4] + a
+	      || out[i*8 + 5] != in[i*8 + 5] + a
+	      || out[i*8 + 6] != in[i*8 + 6] + a
+	      || out[i*8 + 7] != in[i*8 + 7] + a)
+	    abort ();
 
-      if (arr[i] = i)
-        a = i;
-      else
-        a = 2;
+	  if (arr[i] = i)
+	    a = i;
+	  else
+	    a = 2;
+	  i++;
+	}
+      while (i < N);
     }
 
   return 0;
@@ -66,4 +77,3 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */
-  
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 32ce1e3af40..293836cdc53 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -278,7 +278,7 @@ cancel_thread (vec<jump_thread_edge *> *path, const char *reason = NULL)
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
       if (reason)
-	fprintf (dump_file, "%s:\n", reason);
+	fprintf (dump_file, "%s: ", reason);
 
       dump_jump_thread_path (dump_file, *path, false);
       fprintf (dump_file, "\n");
@@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
   bool seen_latch = false;
   int loops_crossed = 0;
   bool crossed_latch = false;
+  bool crossed_loop_header = false;
   // Use ->dest here instead of ->src to ignore the first block.  The
   // first block is allowed to be in a different loop, since it'll be
   // redirected.  See similar comment in profitable_path_p: "we don't
@@ -2804,6 +2805,14 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
 	  ++loops_crossed;
 	}
 
+      // ?? Avoid threading through loop headers that remain in the
+      // loop, as such threadings tend to create sub-loops which
+      // _might_ be OK ??.
+      if (e->dest->loop_father->header == e->dest
+	  && !flow_loop_nested_p (exit->dest->loop_father,
+				  e->dest->loop_father))
+	crossed_loop_header = true;
+
       if (flag_checking && !m_backedge_threads)
 	gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0);
     }
@@ -2829,6 +2838,21 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path)
       cancel_thread (&path, "Path crosses loops");
       return true;
     }
+  // The path should either start and end in the same loop or exit the
+  // loop it starts in but never enter a loop.  This also catches
+  // creating irreducible loops, not only rotation.
+  if (entry->src->loop_father != exit->dest->loop_father
+      && !flow_loop_nested_p (exit->src->loop_father,
+			      entry->dest->loop_father))
+    {
+      cancel_thread (&path, "Path rotates loop");
+      return true;
+    }
+  if (crossed_loop_header)
+    {
+      cancel_thread (&path, "Path crosses loop header but does not exit it");
+      return true;
+    }
   return false;
 }
 
diff --git a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c
index b83ca79dfae..de31d6436f5 100644
--- a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c
+++ b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c
@@ -31,6 +31,6 @@ int main(void)
 }
 
 /* Check that parallel code generation part make the right answer.  */
-/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" } } */
 /* { dg-final { scan-tree-dump-times "loopfn.0" 4 "optimized" } } */
 /* { dg-final { scan-tree-dump-times "loopfn.1" 4 "optimized" } } */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-10-18 10:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-04  9:43 [RFC] More jump threading restrictions in the presence of loops Aldy Hernandez
2021-10-04 13:31 ` Jeff Law
2021-10-04 13:36   ` Aldy Hernandez
2021-10-04 15:30     ` Jeff Law
2021-10-04 16:29       ` Michael Matz
2021-10-05 11:22         ` Richard Biener
2021-10-05 12:43           ` Michael Matz
2021-10-05 14:56           ` Jeff Law
2021-10-05 13:33         ` Aldy Hernandez
2021-10-05 15:10           ` Jeff Law
2021-10-05 16:08           ` Jeff Law
2021-10-05 16:22             ` Aldy Hernandez
2021-10-06 13:15           ` Andreas Schwab
2021-10-06 13:47             ` Aldy Hernandez
2021-10-06 15:03               ` Martin Sebor
2021-10-06 16:22                 ` Aldy Hernandez
2021-10-06 17:03                   ` Aldy Hernandez
2021-10-06 19:11                     ` Martin Sebor
2021-10-06 23:00                   ` Jeff Law
2021-10-06 23:06               ` Jeff Law
2021-10-07  8:15                 ` Aldy Hernandez
2021-10-07 13:35                   ` Jeff Law
2021-10-06 10:22 ` Aldy Hernandez
2021-10-14  9:25   ` Aldy Hernandez
2021-10-14 14:23     ` Jeff Law
2021-10-17  1:32   ` Jeff Law
2021-10-18 10:14     ` Aldy Hernandez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).