public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH GCC]A simple implementation of loop interchange
@ 2017-08-30 14:31 Bin Cheng
  2017-08-30 15:11 ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Bin Cheng @ 2017-08-30 14:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd

[-- Attachment #1: Type: text/plain, Size: 4275 bytes --]

Hi,
This patch implements a simple loop interchange pass in GCC, as described by its comments:
+/* This pass performs loop interchange: for example, the loop nest
+
+   for (int j = 0; j < N; j++)
+     for (int k = 0; k < N; k++)
+       for (int i = 0; i < N; i++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   is transformed to
+
+   for (int i = 0; i < N; i++)
+     for (int j = 0; j < N; j++)
+       for (int k = 0; k < N; k++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   This pass implements loop interchange in the following steps:
+
+     1) Find perfect loop nest for each innermost loop and compute data
+	dependence relations for it.  For above example, loop nest is
+	<loop_j, loop_k, loop_i>.
+     2) From innermost to outermost loop, this pass tries to interchange
+	each loop pair.  For above case, it firstly tries to interchange
+	<loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
+	Then it tries to interchange <loop_j, loop_i> and loop nest becomes
+	<loop_i, loop_j, loop_k>.  The overall effect is to move innermost
+	loop to the outermost position.  For loop pair <loop_i, loop_j>
+	to be interchanged, we:
+     3) Check if data dependence relations are valid for loop interchange.
+     4) Check if both loops can be interchanged in terms of transformation.
+     5) Check if interchanging the two loops is profitable.
+     6) Interchange the two loops by mapping induction variables.
+
+   This pass also handles reductions in loop nest.  So far we only support
+   simple reduction of inner loop and double reduction of the loop nest.  */

Actually, this pass only does loop shift which outermosting inner loop to outer, rather
than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
nest.  I put it just after loop distribution thus ideally loop split/distribution could
create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
for now because it only works for innermost loop.  For example, the motivation case in
spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
for (innermost) loop nest and with that patch bwaves case can be handled.
Another point is I deliberately make both the cost model and code transformation (very)
conservative.  We can support more cases, or more transformations with great care when
it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
don't want to introduce more.
As for code generation, this patch has an issue that invariant code in outer loop could
be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
generated during interchange.  In the future, we may need to avoid that in interchange
itself, or another lim pass just like the one after graphite optimizations.

Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
Note this pass is disabled in patch, while the code is exercised by bootstrap/building
programs with it enabled by default.  Any comments?

Thanks,
bin
2017-08-29  Bin Cheng  <bin.cheng@arm.com>

	* Makefile.in (tree-ssa-loop-interchange.o): New object file.
	* common.opt (ftree-loop-interchange): New option.
	* doc/invoke.texi (-ftree-loop-interchange): Document new option.
	* passes.def (pass_linterchange): New pass.
	* timevar.def (TV_LINTERCHANGE): New time var.
	* tree-pass.h (make_pass_linterchange): New declaration.
	* tree-ssa-loop-interchange.cc: New file.
	* tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
	* tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.

gcc/testsuite
2017-08-29  Bin Cheng  <bin.cheng@arm.com>

	* gcc.dg/tree-ssa/loop-interchange-1.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-2.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-3.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-4.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-5.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-6.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-7.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-8.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-9.c: New test.
	* gcc.dg/tree-ssa/loop-interchange-10.c: New test.

[-- Attachment #2: linterchange-20170829.txt --]
[-- Type: text/plain, Size: 71691 bytes --]

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 0bde7ac..5002598 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1522,6 +1522,7 @@ OBJS = \
 	tree-ssa-live.o \
 	tree-ssa-loop-ch.o \
 	tree-ssa-loop-im.o \
+	tree-ssa-loop-interchange.o \
 	tree-ssa-loop-ivcanon.o \
 	tree-ssa-loop-ivopts.o \
 	tree-ssa-loop-manip.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 1331008..e7efa09 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2524,6 +2524,10 @@ ftree-loop-distribute-patterns
 Common Report Var(flag_tree_loop_distribute_patterns) Optimization
 Enable loop distribution for patterns transformed into a library call.
 
+ftree-loop-interchange
+Common Report Var(flag_tree_loop_interchange) Optimization
+Enable loop interchange on trees.
+
 ftree-loop-im
 Common Report Var(flag_tree_loop_im) Init(1) Optimization
 Enable loop invariant motion on trees.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 64363e5..9807977 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8490,6 +8490,25 @@ ENDDO
 @end smallexample
 and the initialization loop is transformed into a call to memset zero.
 
+@item -ftree-loop-interchange
+@opindex ftree-loop-interchange
+Perform loop interchange outside of graphite.  This flag can improve cache
+performance on loop nest and allow further loop optimizations, like
+vectorization, to take place.  For example, the loop
+@smallexample
+for (int i = 0; i < N; i++)
+  for (int j = 0; j < N; j++)
+    for (int k = 0; k < N; k++)
+      c[i][j] = c[i][j] + a[i][k]*b[k][j];
+@end smallexample
+is transformed to
+@smallexample
+for (int i = 0; i < N; i++)
+  for (int k = 0; k < N; k++)
+    for (int j = 0; j < N; j++)
+      c[i][j] = c[i][j] + a[i][k]*b[k][j];
+@end smallexample
+
 @item -ftree-loop-im
 @opindex ftree-loop-im
 Perform loop invariant motion on trees.  This pass moves only invariants that
diff --git a/gcc/passes.def b/gcc/passes.def
index 316e19d..b1ee57c 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -279,6 +279,7 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_loop_distribution);
+	  NEXT_PASS (pass_linterchange);
 	  NEXT_PASS (pass_copy_prop);
 	  NEXT_PASS (pass_graphite);
 	  PUSH_INSERT_PASSES_WITHIN (pass_graphite)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c
new file mode 100644
index 0000000..9782fd6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-4.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+double u[1782225];
+
+static int __attribute__((noinline))
+foo (int N, int *res)
+{
+  int i, j;
+  double sum = 0;
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      sum = sum + u[i + 1335 * j];
+
+  for (i = 0; i < N; i++)
+    u[1336 * i] *= 2;
+
+  *res = sum + N + u[1336 * 2] + u[1336];
+}
+
+extern void abort ();
+
+int
+main (void)
+{
+  int i, j, res;
+
+  for (i = 0; i < 1782225; i++)
+    u[i] = 2;
+
+  foo (1335, &res);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 3565793)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c
new file mode 100644
index 0000000..5dfad8a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M];
+int __attribute__((noinline))
+double_reduc (int n)
+{
+  int sum = 0;
+  for (int j = 0; j < n; j++)
+    {
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+    }
+  return sum;
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+    }
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  int sum = double_reduc (M);
+
+  if (sum != 1065369600)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 1 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c
new file mode 100644
index 0000000..748d5fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-5.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 100
+#define M 1111
+int A[N][M];
+
+static int __attribute__((noinline))
+foo (void)
+{
+  int i, j;
+
+  for( i = 0; i < M; i++)
+    for( j = 0; j < N; j++)
+      A[j][i] = 5 * A[j][i];
+
+  return A[0][0] + A[N-1][M-1];
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  int j;
+
+  for (j = 0; j < M; j++)
+    A[i][j] = 2;
+}
+
+int
+main (void)
+{
+  int i, j, res;
+
+  for (i = 0; i < N; i++)
+    init (i);
+
+  res = foo ();
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 20)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c
new file mode 100644
index 0000000..b7bb478
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-6.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 100
+#define M 200
+
+static int __attribute__((noinline))
+foo (int A[N][M])
+{
+  int i, j;
+
+  /* This loop should be interchanged. */
+  for(j = 0; j < M; j++)
+    for(i = 0; i < N; i++)
+      A[i][j] = A[i][j] + A[i][j];
+
+  return A[0][0] + A[N-1][M-1];
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int *arr, int i)
+{
+  int j;
+
+  for (j = 0; j < M; j++)
+    arr[j] = 2;
+}
+
+int
+main (void)
+{
+  int A[N][M];
+  int i, j, res;
+
+  for (i = 0; i < N; i++)
+    init (A[i], i);
+
+  res = foo (A);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 8)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c
new file mode 100644
index 0000000..76bd8c4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-7.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 111
+#define M 1111
+
+static int __attribute__((noinline))
+foo (double *a)
+{
+  int i,j;
+  int r = 0;
+
+  for (i = 0; i < N; ++i)
+    for (j = 0; j < M; ++j)
+      r += a[j * N + i];
+
+  return r;
+}
+
+extern void abort ();
+
+int
+main (void)
+{
+  double A[N*M];
+  int i, res;
+
+  for (i = 0; i < N*M; i++)
+    A[i] = 2;
+
+  res = foo (A);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 246642)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c
new file mode 100644
index 0000000..5b851aa
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c
@@ -0,0 +1,71 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < n; j++)
+      for (int k = 0; k < n; k++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 1 "linterchange" } } */
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is not interchanged" 1 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c
new file mode 100644
index 0000000..613c00e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+    for (int j = 0; j < n; j++)
+      for (int k = 0; k < n; k++)
+  for (int i = 0; i < n; i++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 2 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c
new file mode 100644
index 0000000..78627a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+      for (int k = 0; k < n; k++)
+    for (int j = 0; j < n; j++)
+  for (int i = 0; i < n; i++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 2 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c
new file mode 100644
index 0000000..4c5653d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+  for (int i = 0; i < n; i++)
+      for (int k = 0; k < n; k++)
+    for (int j = 0; j < n; j++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is not interchanged" 2 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c
new file mode 100644
index 0000000..ffe9835
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c
@@ -0,0 +1,62 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M], d[M];
+void __attribute__((noinline))
+simple_reduc_1 (int n)
+{
+  for (int j = 0; j < n; j++)
+    {
+      int sum = c[j];
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+
+      c[j] = sum;
+    }
+}
+
+void __attribute__((noinline))
+simple_reduc_2 (int n)
+{
+  for (int j = 0; j < n; j++)
+    {
+      int sum = d[j];
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+
+      asm volatile ("" ::: "memory");
+      d[j] = sum;
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  c[i] = 0;
+  d[i] = 0;
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+    }
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  simple_reduc_1 (M);
+  simple_reduc_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (c[i] != d[i])
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_nest<outer:., inner:.> is interchanged" 1 "linterchange" } } */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index e246058..269fad6 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -184,6 +184,7 @@ DEFTIMEVAR (TV_TREE_LOOP	     , "tree loop optimization")
 DEFTIMEVAR (TV_TREE_NOLOOP           , "loopless fn")
 DEFTIMEVAR (TV_TREE_LOOP_BOUNDS	     , "tree loop bounds")
 DEFTIMEVAR (TV_LIM                   , "tree loop invariant motion")
+DEFTIMEVAR (TV_LINTERCHANGE          , "tree loop interchange")
 DEFTIMEVAR (TV_TREE_LOOP_IVCANON     , "tree canonical iv")
 DEFTIMEVAR (TV_SCEV_CONST            , "scev constant prop")
 DEFTIMEVAR (TV_TREE_LOOP_UNSWITCH    , "tree loop unswitching")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 2863f76..de68574 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -367,6 +367,7 @@ extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_linterchange (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_split (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
new file mode 100644
index 0000000..d8482ff
--- /dev/null
+++ b/gcc/tree-ssa-loop-interchange.cc
@@ -0,0 +1,1814 @@
+/* Loop invariant motion.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "gimple-pretty-print.h"
+#include "fold-const.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "cfgloop.h"
+#include "params.h"
+#include "tree-scalar-evolution.h"
+#include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop-niter.h"
+#include "tree-ssa-loop-ivopts.h"
+#include "tree-data-ref.h"
+
+/* This pass performs loop interchange: for example, the loop nest
+
+   for (int j = 0; j < N; j++)
+     for (int k = 0; k < N; k++)
+       for (int i = 0; i < N; i++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   is transformed to
+
+   for (int i = 0; i < N; i++)
+     for (int j = 0; j < N; j++)
+       for (int k = 0; k < N; k++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   This pass implements loop interchange in the following steps:
+
+     1) Find perfect loop nest for each innermost loop and compute data
+	dependence relations for it.  For above example, loop nest is
+	<loop_j, loop_k, loop_i>.
+     2) From innermost to outermost loop, this pass tries to interchange
+	each loop pair.  For above case, it firstly tries to interchange
+	<loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
+	Then it tries to interchange <loop_j, loop_i> and loop nest becomes
+	<loop_i, loop_j, loop_k>.  The overall effect is to move innermost
+	loop to the outermost position.  For loop pair <loop_i, loop_j>
+	to be interchanged, we:
+     3) Check if data dependence relations are valid for loop interchange.
+     4) Check if both loops can be interchanged in terms of transformation.
+     5) Check if interchanging the two loops is profitable.
+     6) Interchange the two loops by mapping induction variables.
+
+   This pass also handles reductions in loop nest.  So far we only support
+   simple reduction of inner loop and double reduction of the loop nest.  */
+
+/* Maximum number of stmts in each loop that should be interchanged.  */
+#define MAX_NUM_STMT	(64)
+
+/* Default size for each array dimension.  */
+#define AVG_DIM_SIZE	(PARAM_VALUE (PARAM_AVG_LOOP_NITER))
+
+/* Comparison ratio of access stride between inner/outer loops to be
+   interchanged.  This ratio is used for interchanging the immermost
+   two loop.  */
+#define INNER_STRIDE_RATIO	(3)
+/* The same as above, but this one is only used for interchanging not
+   innermost loops.  */
+#define OUTER_STRIDE_RATIO	(2)
+
+/* Structure recording loop induction variable.  */
+typedef struct induction
+{
+  /* IV itself.  */
+  tree var;
+  /* IV's base and step part of SCEV.  */
+  tree base;
+  tree step;
+  /* Mapped IV variabled used for interchanging loops.  */
+  tree mapped_var;
+}*induction_p;
+
+/* Enum type for loop reduction variable.  */
+enum reduction_type
+{
+  UNKNOWN_RTYPE = 0,
+  SIMPLE_RTYPE,
+  DOUBLE_RTYPE
+};
+
+/* Structure recording loop reduction variable.  */
+typedef struct reduction
+{
+  /* Reduction itself.  */
+  tree var;
+  /* PHI node defining reduction variable.  */
+  gphi *phi;
+  /* Init and next variables of the reduction.  */
+  tree init;
+  tree next;
+  /* Lcssa PHI node if reduction is used outside of its definition loop.  */
+  gphi *lcssa_phi;
+  /* Single use of reduction variable.  This is generally but not necessarily
+     the stmt defining next variable of reduction.  */
+  gimple *single_use;
+  /* Stmts defining init and next.  */
+  gimple *producer;
+  gimple *consumer;
+  /* If init is loaded from memory, this is the loading memory reference.  */
+  tree init_ref;
+  /* If reduction is finally stored to memory, this is the stored memory
+     reference.  */
+  tree fini_ref;
+  enum reduction_type type;
+}*reduction_p;
+
+
+/* Dump reduction RE.  */
+
+static void
+dump_reduction (reduction_p re)
+{
+  if (re->type == SIMPLE_RTYPE)
+    fprintf (dump_file, "  Simple reduction:  ");
+  else if (re->type == DOUBLE_RTYPE)
+    fprintf (dump_file, "  Double reduction:  ");
+  else
+    fprintf (dump_file, "  Unknown reduction:  ");
+
+  print_gimple_stmt (dump_file, re->phi, 0);
+}
+
+/* Dump LOOP's induction IV.  */
+static void
+dump_induction (struct loop *loop, induction_p iv)
+{
+  fprintf (dump_file, "  Induction:  ");
+  print_generic_expr (dump_file, iv->var, TDF_SLIM);
+  fprintf (dump_file, " = {");
+  print_generic_expr (dump_file, iv->base, TDF_SLIM);
+  fprintf (dump_file, ", ");
+  print_generic_expr (dump_file, iv->step, TDF_SLIM);
+  fprintf (dump_file, "}_%d\n", loop->num);
+}
+
+/* Loop candidate for interchange.  */
+
+class loop_cand
+{
+public:
+  loop_cand (struct loop *, struct loop *);
+  ~loop_cand ();
+
+  reduction_p find_reduction_by_init (tree);
+  reduction_p find_reduction_by_stmt (gimple *);
+  void classify_simple_reduction (reduction_p);
+  bool analyze_iloop_reduction_var (tree);
+  bool analyze_oloop_reduction_var (loop_cand *, tree);
+  bool analyze_reduction_var (loop_cand *, tree);
+  bool analyze_induction_var (tree, tree);
+  bool analyze_carried_vars (loop_cand *);
+  bool analyze_lcssa_phis (void);
+  bool can_interchange_p (loop_cand *);
+  bool unsupported_operation (basic_block, loop_cand *);
+  void undo_simple_reduction (reduction_p);
+
+  friend class tree_loop_interchange;
+private:
+  /* The loop itself.  */
+  struct loop *loop;
+  /* The outer loop of loop nest for interchange.  */
+  struct loop *nest;
+  /* Vector of induction variables in loop.  */
+  vec<induction_p> inductions;
+  /* Vector of reduction variables in loop.  */
+  vec<reduction_p> reductions;
+  /* Lcssa PHI nodes of this loop.  */
+  vec<gphi *> lcssa_nodes;
+  /* # of iterations of this loop.  */
+  tree niters;
+  /* Single exit edge of this loop.  */
+  edge exit;
+  /* Basic blocks of this loop.  */
+  basic_block *bbs;
+  /* # of stmts of this loop.  */
+  unsigned num_stmts;
+};
+
+/* Constructor.  */
+
+loop_cand::loop_cand (struct loop *loop, struct loop *nest)
+  : loop (loop), nest (nest), niters (number_of_latch_executions (loop)),
+    exit (single_exit (loop)), bbs (get_loop_body (loop)), num_stmts (0)
+{
+    inductions.create (3);
+    reductions.create (3);
+    lcssa_nodes.create (3);
+}
+
+/* Destructor.  */
+
+loop_cand::~loop_cand ()
+{
+  induction_p iv;
+  for (unsigned i = 0; inductions.iterate (i, &iv); ++i)
+    free (iv);
+
+  reduction_p re;
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    free (iv);
+
+  inductions.release ();
+  reductions.release ();
+  lcssa_nodes.release ();
+  free (bbs);
+}
+
+/* Return single use stmt of VAR in LOOP, otherwise return NULL.  */
+
+static gimple *
+single_use_in_loop (tree var, struct loop *loop)
+{
+  gimple *stmt, *res = NULL;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      basic_block bb = gimple_bb (stmt);
+      gcc_assert (bb != NULL);
+      if (!flow_bb_inside_loop_p (loop, bb))
+	continue;
+
+      if (res)
+	return NULL;
+
+      res = stmt;
+    }
+  return res;
+}
+
+/* Return true if E is abnormal edge.  */
+
+static inline bool
+abnormal_edge (edge e)
+{
+  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
+}
+
+/* Return true if PHI is abnormal phi node.  */
+
+static inline bool
+abnormal_phi_node (gphi *phi)
+{
+  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
+    return true;
+
+  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+    {
+      tree arg = PHI_ARG_DEF (phi, i);
+      if (TREE_CODE (arg) == SSA_NAME
+	  && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
+	return true;
+    }
+
+  return false;
+}
+
+/* Return reduction whose init variable is VAR, otherwise return NULL.  */
+
+reduction_p
+loop_cand::find_reduction_by_init (tree var)
+{
+  reduction_p re;
+
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    if (re->init == var || operand_equal_p (re->init, var, 0))
+      return re;
+
+  return NULL;
+}
+
+/* Return the reduction if STMT is one of its lcssa PHI, producer or consumer
+   stmt.  */
+
+reduction_p
+loop_cand::find_reduction_by_stmt (gimple *stmt)
+{
+  gphi *phi = NULL;
+  reduction_p re;
+
+  if (is_a <gphi *> (stmt))
+    phi = as_a <gphi *> (stmt);
+
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    if ((phi != NULL && phi == re->lcssa_phi)
+	|| (stmt == re->producer || stmt == re->consumer))
+      return re;;
+
+  return NULL;
+}
+
+/* Return true if all stmts in BB can be supported by loop interchange,
+   otherwise return false.  ILOOP is not NULL if this loop_cand is the
+   outer loop in loop nest.  */
+
+bool
+loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
+{
+  unsigned bb_num_stmts = 0;
+  gphi_iterator psi;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      if (gimple_has_volatile_ops (stmt)
+	  || gimple_has_side_effects (stmt))
+	return false;
+
+      bb_num_stmts++;
+      if (is_gimple_call (stmt))
+	{
+	  int cflags = gimple_call_flags (stmt);
+	  /* Only support const/pure calls.  */
+	  if (!(cflags & (ECF_CONST | ECF_PURE)))
+	    return false;
+
+	  /* In basic block of outer loop, the call should be cheap since
+	     it will be moved to inner loop.  */
+	  if (iloop != NULL
+	      && !gimple_inexpensive_call_p (as_a <gcall *> (stmt)))
+	    return false;
+
+	  continue;
+	}
+
+      if (!iloop || !gimple_vuse (stmt))
+	continue;
+
+      /* Support stmt accessing memory in outer loop only if it is for inner
+	 loop's reduction.  */
+      if (iloop->find_reduction_by_stmt (stmt))
+	continue;
+
+      tree lhs;
+      /* Or it's invariant memory reference and only used by inner loop.  */
+      if (gimple_assign_single_p (stmt)
+	  && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+	  && TREE_CODE (lhs) == SSA_NAME
+	  && single_use_in_loop (lhs, iloop->loop))
+	continue;
+
+      return false;
+    }
+  num_stmts += bb_num_stmts;
+
+  /* Allow PHI nodes in any basic block of inner loop, or PHI nodes in
+     (outer) loop's header.  */
+  if (!iloop || bb == loop->header)
+    return true;
+
+  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
+    {
+      gphi *phi = psi.phi ();
+
+      if (abnormal_phi_node (phi))
+	return false;
+
+      if (virtual_operand_p (PHI_RESULT (phi)))
+	continue;
+
+      /* For outer loop, we only support PHI in loop header and lcssa PHI
+	 of inner loop's reduction.  */
+      if (!iloop->find_reduction_by_stmt (phi))
+	return false;
+    }
+  return true;
+}
+
+/* Return true if current loop_cand be interchanged.  ILOOP is not NULL if
+   current loop_cand is outer loop in loop nest.  */
+
+bool
+loop_cand::can_interchange_p (loop_cand *iloop)
+{
+  /* For now we only support at most one reduction.  */
+  unsigned allowed_reduction_num = 1;
+
+  /* Only support reduction if the loop nest to be interchanged is the
+     innermostin two loops.  */
+  if ((iloop == NULL && loop->inner != NULL)
+       || (iloop != NULL && iloop->loop->inner != NULL))
+    allowed_reduction_num = 0;
+
+  if (reductions.length () > allowed_reduction_num
+      || (reductions.length () == 1
+	  && reductions[0]->type == UNKNOWN_RTYPE))
+    return false;
+
+  /* Only support lcssa PHI node which is for reduction.  */
+  if (lcssa_nodes.length () > allowed_reduction_num)
+    return false;
+
+  /* Check basic blocks other than loop header/exit.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+
+      /* Skip basic blocks of inner loops.  */
+      if (bb->loop_father != loop)
+	continue;
+
+      /* Check if basic block has any unsupported operation.  */
+      if (!unsupported_operation (bb, iloop))
+	return false;
+
+      /* Check if loop has too many stmts.  */
+      if (num_stmts > MAX_NUM_STMT)
+	return false;
+    }
+
+  return true;
+}
+
+/* Classify if reduction RE is a simple one.  */
+
+void
+loop_cand::classify_simple_reduction (reduction_p re)
+{
+  gimple *producer, *consumer;
+  enum tree_code code;
+  tree lhs, rhs;
+
+  /* Check init variable of reduction and how it is initialized.  */
+  if (TREE_CODE (re->init) == SSA_NAME)
+    {
+      producer = SSA_NAME_DEF_STMT (re->init);
+      re->producer = producer;
+      basic_block bb = gimple_bb (producer);
+      if (!bb || bb->loop_father != nest)
+	return;
+
+      if (!is_gimple_assign (producer))
+	return;
+
+      code = gimple_assign_rhs_code (producer);
+      if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
+	return;
+
+      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
+	  || lhs != re->init)
+	return;
+
+      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
+	  || !REFERENCE_CLASS_P (rhs))
+	return;
+
+      re->init_ref = rhs;
+    }
+  else if (!CONSTANT_CLASS_P (re->init))
+    return;
+
+  /* TODO: Don't support constant initializer yet.  */
+  if (re->init_ref == NULL_TREE)
+    return;
+
+  /* Check how reduction variable is used.  Note usually reduction variable
+     is used outside of its defining loop, we don't require that in terms
+     loop interchange.  */
+  if (re->lcssa_phi == NULL)
+    consumer = single_use_in_loop (re->next, loop);
+  else
+    consumer = single_use_in_loop (PHI_RESULT (re->lcssa_phi), nest);
+
+  if (consumer == NULL)
+    return;
+
+  re->consumer = consumer;
+
+  if (!is_gimple_assign (consumer))
+    return;
+
+  code = gimple_assign_rhs_code (consumer);
+  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
+    return;
+
+  if ((rhs = gimple_assign_rhs1 (consumer)) == NULL_TREE
+      || rhs != PHI_RESULT (re->lcssa_phi))
+    return;
+
+  if ((lhs = gimple_assign_lhs (consumer)) == NULL_TREE
+      || !REFERENCE_CLASS_P (lhs))
+    return;
+
+  re->fini_ref = lhs;
+
+  /* Require memory references in producer and consumer are the same so
+     that we can undo reduction during interchange.  */
+  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
+    return;
+
+  re->type = SIMPLE_RTYPE;
+}
+
+/* Analyze reduction variable VAR.  ILOOP is not NULL if current loop_cand
+   is outer loop in loop nest.  Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_reduction_var (loop_cand *iloop, tree var)
+{
+  if (iloop != NULL)
+    return analyze_oloop_reduction_var (iloop, var);
+  else
+    return analyze_iloop_reduction_var (var);
+}
+
+/* Analyze reduction variable VAR for inner loop of the loop nest to be
+   interchanged.  Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_iloop_reduction_var (tree var)
+{
+  gphi *phi = as_a <gphi *> (SSA_NAME_DEF_STMT (var));
+  gphi *lcssa_phi = NULL, *use_phi;
+  tree init = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+  tree next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  edge e = single_exit (loop);
+  reduction_p re;
+  gimple *stmt, *next_def, *single_use = NULL;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+  basic_block bb;
+
+  if (TREE_CODE (next) != SSA_NAME)
+    return false;
+
+  next_def = SSA_NAME_DEF_STMT (next);
+  bb = gimple_bb (next_def);
+  if (!bb || !flow_bb_inside_loop_p (loop, bb))
+    return false;
+
+  /* In restricted reduction, the var is (and must be) used in defining
+     the updated var.  The process can be depicted as below:
+
+		var ;; = PHI<init, next>
+		 |
+		 |
+		 v
+      +---------------------+
+      | reduction operators | <-- other operands
+      +---------------------+
+		 |
+		 |
+		 v
+		next
+
+     In terms loop interchange, we don't change how NEXT is computed based
+     on VAR and OTHER OPERANDS.  In case of double reduction in loop nest
+     to be interchanged, we don't changed it at all.  In the case of simple
+     reduction in inner loop, we only make change how VAR/NEXT is loaded or
+     stored.  With these conditions, we can relax restrictions on reduction
+     in a way that reduction operation is seen as black box.  In general,
+     we can ignore reassociation of reduction operator; we can handle fake
+     reductions in which VAR is not even used to compute NEXT.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb || !flow_bb_inside_loop_p (loop, bb))
+	return false;
+
+      if (single_use != NULL)
+	return false;
+
+      single_use = stmt;
+    }
+
+  if (single_use != next_def
+      && !stmt_dominates_stmt_p (single_use, next_def))
+    return false;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb)
+	return false;
+
+      /* Or else it's used in PHI itself.  */
+      use_phi = NULL;
+      if (is_a <gphi *> (stmt)
+	  && (use_phi = as_a <gphi *> (stmt)) != NULL
+	  && use_phi == phi)
+	continue;
+
+      if (use_phi != NULL
+	  && lcssa_phi == NULL
+	  && bb == e->dest
+	  && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next)
+	lcssa_phi = use_phi;
+      else
+	return false;
+    }
+  re = XCNEW (struct reduction);
+  re->var = var;
+  re->init = init;
+  re->next = next;
+  re->phi = phi;
+  re->lcssa_phi = lcssa_phi;
+  re->single_use = single_use;
+
+  classify_simple_reduction (re);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_reduction (re);
+
+  reductions.safe_push (re);
+  return true;
+}
+
+/* Analyze reduction variable VAR for outer loop of the loop nest to be
+   interchanged.  ILOOP is not NULL and points to inner loop.  For the
+   moment, we only support double reduction for outer loop, like:
+
+     for (int i = 0; i < n; i++)
+       {
+	 int sum = 0;
+
+	 for (int j = 0; j < n; j++)    // outer loop
+	   for (int k = 0; k < n; k++)  // inner loop
+	     sum += a[i][k]*b[k][j];
+
+	 s[i] = sum;
+       }
+
+   Note the innermost two loops are the loop nest to be interchanged.
+   Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var)
+{
+  gphi *phi = as_a <gphi *> (SSA_NAME_DEF_STMT (var));
+  gphi *lcssa_phi = NULL, *use_phi;
+  tree init = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+  tree next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  edge e = single_exit (loop);
+  reduction_p re;
+  gimple *stmt, *next_def;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+  basic_block bb;
+
+  if (TREE_CODE (next) != SSA_NAME)
+    return false;
+
+  next_def = SSA_NAME_DEF_STMT (next);
+  bb = gimple_bb (next_def);
+  if (!bb || !flow_bb_inside_loop_p (loop, bb))
+    return false;
+
+  /* Find inner loop's simple reduction that uses var as initializer.  */
+  reduction_p inner_re = iloop->find_reduction_by_init (var);
+  if (inner_re == NULL
+      || inner_re->type != UNKNOWN_RTYPE
+      || inner_re->producer != phi)
+    return false;
+
+  /* In case of double reduction, outer loop's reduction should be updated
+     by inner loop's simple reduction.  */
+  if (next_def != inner_re->lcssa_phi)
+    return false;
+
+  /* Outer loop's reduction should only be used to initialize inner loop's
+     simple reduction.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb || !flow_bb_inside_loop_p (loop, bb))
+	return false;
+
+      if (! is_a <gphi *> (stmt)
+	  || (use_phi = as_a <gphi *> (stmt)) == NULL
+	  || use_phi != inner_re->phi)
+	return false;
+    }
+
+  /* Check this reduction is correctly used outside of loop via lcssa phi.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb)
+	return false;
+
+      /* Or else it's used in PHI itself.  */
+      use_phi = NULL;
+      if (is_a <gphi *> (stmt)
+	  && (use_phi = as_a <gphi *> (stmt)) != NULL
+	  && use_phi == phi)
+	continue;
+
+      if (lcssa_phi == NULL
+	  && use_phi != NULL
+	  && bb == e->dest
+	  && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next)
+	lcssa_phi = use_phi;
+      else
+	return false;
+    }
+
+  re = XCNEW (struct reduction);
+  re->var = var;
+  re->init = init;
+  re->next = next;
+  re->phi = phi;
+  re->lcssa_phi = lcssa_phi;
+  re->type = DOUBLE_RTYPE;
+  inner_re->type = DOUBLE_RTYPE;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_reduction (re);
+
+  reductions.safe_push (re);
+  return true;
+}
+
+/* Return true if VAR is induction variable of current loop whose scev is
+   specified by CHREC.  */
+
+bool
+loop_cand::analyze_induction_var (tree var, tree chrec)
+{
+  /* Var is loop invariant, though it's unlikely to happen.  */
+  if (tree_does_not_contain_chrecs (chrec))
+    {
+      struct induction *iv = XCNEW (struct induction);
+      iv->var = var;
+      iv->base = chrec;
+      iv->step = build_int_cst (TREE_TYPE (chrec), 0);
+      inductions.safe_push (iv);
+      return true;
+    }
+
+  if (TREE_CODE (chrec) != POLYNOMIAL_CHREC
+      || CHREC_VARIABLE (chrec) != (unsigned) loop->num
+      || tree_contains_chrecs (CHREC_LEFT (chrec), NULL)
+      || tree_contains_chrecs (CHREC_RIGHT (chrec), NULL))
+    return false;
+
+  struct induction *iv = XCNEW (struct induction);
+  iv->var = var;
+  iv->base = CHREC_LEFT (chrec);
+  iv->step = CHREC_RIGHT (chrec);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_induction (loop, iv);
+
+  inductions.safe_push (iv);
+  return true;
+}
+
+/* Return true if all loop carried variables defined in loop header can
+   be successfully analyzed.  */
+
+bool
+loop_cand::analyze_carried_vars (loop_cand *iloop)
+{
+  basic_block before_loop = block_before_loop (nest);
+  gphi_iterator gsi;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\nLoop(%d) carried vars:\n", loop->num);
+
+  for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = gsi.phi ();
+
+      if (abnormal_phi_node (phi))
+	return false;
+
+      tree var = PHI_RESULT (phi);
+      if (virtual_operand_p (var))
+	continue;
+
+      tree chrec = analyze_scalar_evolution (loop, var);
+      chrec = instantiate_scev (before_loop, loop, chrec);
+
+      /* Analyze var as reduction variable.  */
+      if (chrec_contains_undetermined (chrec)
+	  || chrec_contains_symbols_defined_in_loop (chrec, nest->num))
+	{
+	  if (!analyze_reduction_var (iloop, var))
+	    return false;
+	}
+      /* Analyze var as induction variable.  */
+      else if (!analyze_induction_var (var, chrec))
+	return false;
+    }
+
+  return true;
+}
+
+/* Return TRUE if loop closed PHI nodes can be analyzed successfully.  */
+
+bool
+loop_cand::analyze_lcssa_phis (void)
+{
+  edge e = single_exit (loop);
+  gphi_iterator gsi;
+
+  for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = gsi.phi ();
+
+      if (abnormal_phi_node (phi))
+	return false;
+
+      if (virtual_operand_p (PHI_RESULT (phi)))
+	continue;
+
+      /* TODO: We only support lcssa phi for reduction for now.  */
+      if (!find_reduction_by_stmt (phi))
+	return false;
+    }
+
+  return true;
+}
+
+/* Given inner loop with simple reduction as below:
+
+   for (i = 0; i < N; i++)
+     for (j = 0; j < N; j++)
+       {
+	 int red = c[i][j];           // producer
+	 for (k = 0; k < N; k++)
+	   red += a[i][k] * b[k][j];
+
+	 c[i][j] = red;               // consumer
+       }
+
+   This function undo the reduction and generates below loop nest:
+
+   for (i = 0; i < N; i++)
+     for (j = 0; j < N; j++)
+       {
+	 for (k = 0; k < N; k++)
+	   c[i][j] += a[i][k] * b[k][j];
+       }
+
+   This basically reverts transformation done by LIM or PRE.  */
+
+void
+loop_cand::undo_simple_reduction (reduction_p re)
+{
+  gimple *phi = SSA_NAME_DEF_STMT (re->var);
+  gimple_stmt_iterator gsi, from;
+
+  /* Move producer stmt into inner loop, just before its use.  */
+  if (gimple_vuse (re->producer))
+    gimple_set_vuse (re->producer, NULL_TREE);
+  gimple_assign_set_lhs (re->producer, re->var);
+  from = gsi_for_stmt (re->producer);
+  gsi = gsi_for_stmt (re->single_use);
+  gsi_move_before (&from, &gsi);
+
+  /* Delete loop header PHI node of reduction.  */
+  gsi = gsi_for_stmt (phi);
+  gsi_remove (&gsi, true);
+
+  /* Move consumer stmt into inner loop, just after its def.  */
+  if (gimple_vdef (re->consumer))
+    gimple_set_vuse (re->consumer, NULL_TREE);
+  if (gimple_vuse (re->consumer))
+    gimple_set_vuse (re->consumer, NULL_TREE);
+  gimple_assign_set_rhs1 (re->consumer, re->next);
+  from = gsi_for_stmt (re->consumer);
+  gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (re->next));
+  gsi_move_after (&from, &gsi);
+
+  /* Delete loop closed PHI node of reduction.  */
+  gsi = gsi_for_stmt (re->lcssa_phi);
+  gsi_remove (&gsi, true);
+}
+
+/* Class for loop interchange transformation.  */
+
+class tree_loop_interchange
+{
+public:
+  tree_loop_interchange (vec<loop_p> loop_nest,
+			 vec<data_reference_p> datarefs, vec<ddr_p> ddrs)
+    : loop_nest (loop_nest), datarefs (datarefs),
+      ddrs (ddrs), unshare_datarefs_p (true) { }
+  ~tree_loop_interchange () {
+    free_dependence_relations (ddrs);
+    free_data_refs (datarefs);
+    loop_nest.release ();
+  }
+  bool interchange ();
+
+private:
+  void update_data_refs (loop_cand &, loop_cand &);
+  void update_data_deps (unsigned, unsigned);
+  bool valid_data_dependences (unsigned, unsigned);
+  bool can_interchange_loops (loop_cand &, loop_cand &);
+  bool should_interchange_loops (loop_cand &, loop_cand &);
+  void interchange_loops (loop_cand &, loop_cand &);
+  void interchange_reductions (loop_cand &, loop_cand &);
+  void interchange_inductions (loop_cand &, loop_cand &);
+  void map_inductions_to_loop (loop_cand &, loop_cand &);
+  void move_code_to_inner_loop (struct loop *, struct loop *, basic_block *);
+
+  /* Vector of the loop nest.  */
+  vec<struct loop *> loop_nest;
+  /* Vector of data references in loop nest.  */
+  vec<data_reference_p> datarefs;
+  /* Vector of data dependence relations in loop nest.  */
+  vec<ddr_p> ddrs;
+  /* Due to current implementation of data dependence analysis, access
+     functions are shared by all data dependence relations.  After loop
+     interchange, we need to update data reference/dependence according
+     to interchanged loops.  During updating, this flag will be checked
+     and DR_ACCESS_FNs will be unshared if it's true.  */
+  bool unshare_datarefs_p;
+};
+
+/* Given ILOOP and OLOOP representing two loops in loop nest to be
+   interchanged, this function decomposes access function ACCESS_FN.
+   This function does below things:
+
+     1) Look into ACCESS_FN and record the corresponding polynomial
+	chrec of ILOOP/OLOOP in ILOOP_CHREC/OLOOP_CHREC.
+     2) Record access STRIDE in ILOOP_STRIDE or OLOOP_STRIDE if they
+	are not NULL.
+     4) Record index of dimension DIM in iloop_dim or oloop_dim if
+	they are not NULL.
+
+   For example, given below ACCESS_FN:
+
+     {{base, step1}_oloop, step2}_iloop
+
+   This function decomposes it into:
+
+     ILOOP_CHREC: {{base, step1}_oloop, step2}_iloop
+     OLOOP_CHREC: {base, step1}_oloop
+
+   and sets other arguments properly.  */
+
+static void
+decompose_chrecs (unsigned iloop_num, unsigned oloop_num,
+		  tree access_fn, unsigned dim, tree stride,
+		  tree *iloop_chrec, tree *iloop_stride, unsigned *iloop_dim,
+		  tree *oloop_chrec, tree *oloop_stride, unsigned *oloop_dim)
+{
+  do {
+    unsigned var = CHREC_VARIABLE (access_fn);
+    if (var == iloop_num)
+      {
+	gcc_assert (*iloop_chrec == NULL_TREE);
+	*iloop_chrec = access_fn;
+	if (iloop_stride)
+	  *iloop_stride = stride;
+	if (iloop_dim)
+	  *iloop_dim = dim;
+      }
+    if (var == oloop_num)
+      {
+	gcc_assert (*oloop_chrec == NULL_TREE);
+	*oloop_chrec = access_fn;
+	if (oloop_stride)
+	  *oloop_stride = stride;
+	if (oloop_dim)
+	  *oloop_dim = dim;
+      }
+    access_fn = CHREC_LEFT (access_fn);
+  } while (TREE_CODE (access_fn) == POLYNOMIAL_CHREC
+	   && (*iloop_chrec == NULL_TREE || *oloop_chrec == NULL_TREE));
+}
+
+/* Update data refs to keep them valid after interchanging ILOOP/OLOOP.  */
+
+void
+tree_loop_interchange::update_data_refs (loop_cand &iloop, loop_cand &oloop)
+{
+  struct data_reference *dr;
+  for (unsigned i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      unsigned iloop_dim = DR_NUM_DIMENSIONS (dr);
+      unsigned oloop_dim = DR_NUM_DIMENSIONS (dr);
+      tree iloop_chrec = NULL_TREE, oloop_chrec = NULL_TREE;
+
+      for (unsigned j = 0; j < DR_NUM_DIMENSIONS (dr); ++j)
+	{
+	  if (TREE_CODE (DR_ACCESS_FN (dr, j)) != POLYNOMIAL_CHREC)
+	    continue;
+
+	  /* Unshare access functions for the first time.  */
+	  if (unshare_datarefs_p)
+	    DR_ACCESS_FN (dr, j) = unshare_expr (DR_ACCESS_FN (dr, j));
+
+	  /* Find the corresponding CHRECs for ILOOP and OLOOP.  */
+	  tree access_fn = DR_ACCESS_FN (dr, j);
+	  decompose_chrecs (iloop.loop->num, oloop.loop->num,
+			    access_fn, j, NULL,
+			    &iloop_chrec, NULL, &iloop_dim,
+			    &oloop_chrec, NULL, &oloop_dim);
+	}
+      if (iloop_chrec && oloop_chrec)
+	{
+	  if (iloop_dim != oloop_dim)
+	    {
+	      /* For data reference with independent CHRECs for both loops,
+		 swap the loop information.  For example,
+		   (data_ref ...
+		    {base1, step1}_iloop
+		    {base2, step2}_oloop)
+		 is transformed into:
+		   (data_ref ...
+		    {base1, step1}_oloop
+		    {base2, step2}_iloop).  */
+	      std::swap (CHREC_VAR (iloop_chrec), CHREC_VAR (oloop_chrec));
+	    }
+	  else
+	    {
+	      /* For data reference with multivariate CHREC for the two loops,
+		 swap step part of CHREC.  For example,
+		   (data_ref ...
+		    {{base1, step1}_oloop, step2}_iloop)
+		 is transformed into:
+		   (data_ref ...
+		    {{base1, step2}_oloop, step1}_iloop).  */
+	      std::swap (CHREC_RIGHT (iloop_chrec), CHREC_RIGHT (oloop_chrec));
+	    }
+	}
+      /* For data reference without CHREC for either one of ILOOP/OLOOP, set
+	 the loop information to the other loop.  This works because we only
+	 interchange consecutive loops in loop nest.  */
+      else if (iloop_chrec)
+	{
+	  tree type = TREE_TYPE (CHREC_VAR (iloop_chrec));
+	  CHREC_VAR (iloop_chrec) = build_int_cst (type, oloop.loop->num);
+	}
+      else if (oloop_chrec)
+	{
+	  tree type = TREE_TYPE (CHREC_VAR (oloop_chrec));
+	  CHREC_VAR (oloop_chrec) = build_int_cst (type, iloop.loop->num);
+	}
+    }
+  unshare_datarefs_p = false;
+}
+
+/* Update data dependence relations after interchanging loops.  INNER/OUTER
+   gives index of interchanged loops in loop nest, they are used to access
+   DIR_MATRIX.  */
+
+void
+tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
+{
+  struct data_dependence_relation *ddr;
+
+  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
+    {
+      /* Skip no-dependence case.  */
+      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	continue;
+
+      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
+	{
+	  lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
+	  std::swap (dir_vect[inner], dir_vect[outer]);
+	}
+    }
+}
+
+/* Check data dependence relations, return TRUE if it's valid to interchange
+   two loops specified by INNER/OUTER.  Note we have been conservative here
+   by not checking all valid cases.  */
+
+bool
+tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
+{
+  struct data_dependence_relation *ddr;
+
+  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
+    {
+      /* Skip no-dependence case.  */
+      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	continue;
+
+      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
+	{
+	  lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
+	  unsigned level = dependence_level (dir_vect, loop_nest.length ());
+
+	  if (level == 0)
+	    continue;
+
+	  level --;
+	  if (!lambda_vector_lexico_pos (dir_vect, level))
+	    return false;
+
+	  if (level < outer)
+	    continue;
+
+	  if (dir_vect[inner] == dir_vect[outer])
+	    continue;
+
+	  if (dir_vect[inner] != dir_equal
+	      && dir_vect[inner] != dir_positive
+	      && dir_vect[inner] != dir_independent
+	      && dir_vect[inner] != dir_positive_or_equal)
+	    return false;
+
+	  if (dir_vect[outer] != dir_equal
+	      && dir_vect[outer] != dir_positive
+	      && dir_vect[outer] != dir_independent
+	      && dir_vect[outer] != dir_positive_or_equal)
+	    return false;
+	}
+    }
+
+  return true;
+}
+
+/* Return true if ILOOP and OLOOP can be interchanged in terms of code
+   transformation.  */
+
+bool
+tree_loop_interchange::can_interchange_loops (loop_cand &iloop,
+					      loop_cand &oloop)
+{
+  /* Iloop could be previously interchanged from inside, check rectangle
+     loop nest again.  */
+  if (chrec_contains_symbols_defined_in_loop (iloop.niters, oloop.loop->num))
+    return false;
+
+  return (iloop.analyze_carried_vars (NULL)
+	  && iloop.analyze_lcssa_phis ()
+	  && oloop.analyze_carried_vars (&iloop)
+	  && oloop.analyze_lcssa_phis ()
+	  && iloop.can_interchange_p (NULL)
+	  && oloop.can_interchange_p (&iloop));
+}
+
+/* Compute and return overall access stride given CHREC and step STRIDE.  */
+
+static inline unsigned HOST_WIDE_INT
+access_stride (tree chrec, tree stride)
+{
+  unsigned HOST_WIDE_INT uhwi_stride, uhwi_step;
+
+  if (tree_fits_uhwi_p (CHREC_RIGHT (chrec)))
+    uhwi_step = tree_to_uhwi (CHREC_RIGHT (chrec));
+  else
+    uhwi_step = AVG_DIM_SIZE;
+
+  gcc_assert (tree_fits_uhwi_p (stride));
+  uhwi_stride = tree_to_uhwi (stride);
+
+  return uhwi_step * uhwi_stride;
+}
+
+/* Given LOOP, strip off its inner loop's chrec from ACCESS_FN and return
+   true for stripped case.  */
+
+static bool
+strip_sub_loop_access_fn (struct loop *loop, tree *access_fn)
+{
+  bool sub_loop_p = false;
+  do {
+    struct loop *sub_loop = get_loop (cfun, CHREC_VARIABLE (*access_fn));
+
+    if (!flow_loop_nested_p (loop, sub_loop))
+      break;
+
+    sub_loop_p = true;
+    *access_fn = CHREC_LEFT (*access_fn);
+  } while (TREE_CODE (*access_fn) == POLYNOMIAL_CHREC);
+
+  return sub_loop_p;
+}
+
+/* Return true if interchanging ILOOP/OLOOP is profitable.  The function
+   computes and compares three types information:
+     1) Access stride for ILOOP and OLOOP.
+     2) Number of invariant memory references with respect to ILOOP before
+	and after loop interchange.
+     3) Flags indicating if all memory references access sequential memory
+	in ILOOP, before and after loop interchange.  */
+
+bool
+tree_loop_interchange::should_interchange_loops (loop_cand &iloop,
+						 loop_cand &oloop)
+{
+  unsigned HOST_WIDE_INT ratio;
+  unsigned i, j, num_old_inv_drs = 0, num_new_inv_drs = 0;
+  struct data_reference *dr;
+  bool all_seq_dr_before_p = true, all_seq_dr_after_p = true;
+  widest_int iloop_strides = 0, oloop_strides = 0;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\nData ref strides:\n\tmem_ref:\t\tiloop\toloop\n");
+
+  for (i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      bool sub_loop_p = false;
+      tree ref = DR_REF (dr), stride;
+      tree iloop_chrec = NULL_TREE, iloop_stride = NULL_TREE;
+      tree oloop_chrec = NULL_TREE, oloop_stride = NULL_TREE;
+
+      /* Get CHREC and the corresponding stride for ILOOP/OLOOP.  */
+      for (j = 0; j < DR_NUM_DIMENSIONS (dr); ++j)
+	{
+	  while (handled_component_p (ref) && TREE_CODE (ref) != ARRAY_REF)
+	    ref = TREE_OPERAND (ref, 0);
+
+	  stride = integer_one_node;
+	  if (TREE_CODE (ref) == ARRAY_REF)
+	    {
+	      stride = TYPE_SIZE_UNIT (TREE_TYPE (ref));
+	      ref = TREE_OPERAND (ref, 0);
+	    }
+
+	  tree access_fn = DR_ACCESS_FN (dr, j);
+	  if (TREE_CODE (access_fn) != POLYNOMIAL_CHREC)
+	    continue;
+
+	  sub_loop_p |= strip_sub_loop_access_fn (iloop.loop, &access_fn);
+	  if (TREE_CODE (access_fn) != POLYNOMIAL_CHREC)
+	    continue;
+
+	  decompose_chrecs (iloop.loop->num, oloop.loop->num,
+			    access_fn, j, stride,
+			    &iloop_chrec, &iloop_stride, NULL,
+			    &oloop_chrec, &oloop_stride, NULL);
+	}
+
+      if (!iloop_chrec && !oloop_chrec)
+	continue;
+
+      /* If stride is unknown for inner or outer loop.  */
+      if ((iloop_chrec != NULL
+	   && (iloop_stride == NULL_TREE
+	       || !tree_fits_uhwi_p (iloop_stride)))
+	  || (oloop_chrec != NULL
+	      && (oloop_stride == NULL_TREE
+		  || !tree_fits_uhwi_p (oloop_stride))))
+	continue;
+
+      /* Compute overall access strides for ILOOP.  */
+      unsigned HOST_WIDE_INT t1 = 0, t2 = 0;
+      if (iloop_chrec)
+	{
+	  t1 = access_stride (iloop_chrec, iloop_stride);
+	  iloop_strides = wi::add (iloop_strides, t1);
+	}
+      else if (!sub_loop_p)
+	num_old_inv_drs++;
+
+      /* Compute overall access strides for OLOOP.  */
+      if (oloop_chrec)
+	{
+	  t2 = access_stride (oloop_chrec, oloop_stride);
+	  oloop_strides = wi::add (oloop_strides, t2);
+	}
+      else if (!sub_loop_p)
+	num_new_inv_drs++;
+
+      /* Track if all data references access sequential memory before and
+	 after loop interchange.  */
+      if (sub_loop_p)
+	{
+	  /* Data ref can't be sequential if it evaluates wrto any sub loop
+	     of inner loop.  */
+	  all_seq_dr_before_p = false;
+	  all_seq_dr_after_p = false;
+	}
+      else if ((stride = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)))) != NULL_TREE
+	       && tree_fits_uhwi_p (stride))
+	{
+	  /* Check if data ref access sequential memory wrto inner loop.
+	     Note invariant is considered sequential.  */
+	  unsigned HOST_WIDE_INT uhwi_stride = tree_to_uhwi (stride);
+	  if (t1 != 0 && t1 != uhwi_stride)
+	    all_seq_dr_before_p = false;
+	  if (t2 != 0 && t2 != uhwi_stride)
+	    all_seq_dr_after_p = false;
+	}
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "\t");
+	  print_generic_expr (dump_file, DR_REF (dr), TDF_SLIM);
+	  fprintf (dump_file, ":\t\t" HOST_WIDE_INT_PRINT_DEC
+		   "\t" HOST_WIDE_INT_PRINT_DEC "\n", t1, t2);
+	}
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\toverall:\t\t");
+      print_decu (iloop_strides, dump_file);
+      fprintf (dump_file, "\t");
+      print_decu (oloop_strides, dump_file);
+      fprintf (dump_file, "\n");
+
+      fprintf (dump_file, "Invariant data ref: before(%d), after(%d)\n",
+	       num_old_inv_drs, num_new_inv_drs);
+      fprintf (dump_file, "Consecutive stride: before(%s), after(%s)\n",
+	       all_seq_dr_before_p ? "true" : "false",
+	       all_seq_dr_after_p ? "true" : "false");
+    }
+
+  /* We use different stride comparison ratio for interchanging innermost
+     two loops or not.  The idea is to be conservative in interchange for
+     the innermost loops.  */
+  ratio = !iloop.loop->inner ? INNER_STRIDE_RATIO : OUTER_STRIDE_RATIO;
+  /* Do interchange if it gives better data locality behavior.  */
+  if (wi::gtu_p (iloop_strides, wi::mul (oloop_strides, ratio)))
+    return true;
+  if (wi::gtu_p (iloop_strides, oloop_strides))
+    {
+      /* Or it creates more invariant memory references.  */
+      if ((!all_seq_dr_before_p || all_seq_dr_after_p)
+	  && num_new_inv_drs > num_old_inv_drs)
+	return true;
+      /* Or it makes all memory references sequential.  */
+      if (num_new_inv_drs >= num_old_inv_drs
+	  && !all_seq_dr_before_p && all_seq_dr_after_p)
+	return true;
+    }
+
+  return false;
+}
+
+/* Interchange two loops specified by ILOOP and OLOOP.  */
+
+void
+tree_loop_interchange::interchange_loops (loop_cand &iloop, loop_cand &oloop)
+{
+  interchange_reductions (iloop, oloop);
+  interchange_inductions (iloop, oloop);
+
+  /* TODO: niters information should be swapped here.  */
+  iloop.loop->any_upper_bound = false;
+  iloop.loop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (iloop.loop);
+  oloop.loop->any_upper_bound = false;
+  oloop.loop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (oloop.loop);
+}
+
+/* Interchange transformation for reductions of ILOOP and OLOOP.  We only
+   support two types reductions for now:
+     1) simple reduction of inner loop.
+     2) double reduction of loop nest.
+   For simple reduction, we simply undo it by moving producer/consumer to
+   inner loop; for double reduction, we don't need to do anything.  */
+
+void
+tree_loop_interchange::interchange_reductions (loop_cand &iloop,
+					       loop_cand &oloop)
+{
+  unsigned i;
+  reduction_p re;
+
+  for (i = 0; iloop.reductions.iterate (i, &re); ++i)
+    {
+      if (re->type == DOUBLE_RTYPE)
+	continue;
+
+      /* Undo simple reductions.  */
+      iloop.undo_simple_reduction (re);
+    }
+
+  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
+    if (re->type != DOUBLE_RTYPE)
+      gcc_unreachable ();
+}
+
+/* Interchange transformation for inductions of ILOOP and OLOOP.  */
+
+void
+tree_loop_interchange::interchange_inductions (loop_cand &iloop,
+					       loop_cand &oloop)
+{
+  /* Map outer loop's IV to inner loop.  */
+  map_inductions_to_loop (oloop, iloop);
+  /* Map inner loop's IV to outer loop.  */
+  map_inductions_to_loop (iloop, oloop);
+
+  /* Create canonical IV for both loops.  Note canonical IV for outer/inner
+     loop is actually from inner/outer loop.  */
+  create_canonical_iv (oloop.loop, single_exit (oloop.loop), iloop.niters);
+  create_canonical_iv (iloop.loop, single_exit (iloop.loop), oloop.niters);
+}
+
+/* Map induction variables of SRC loop to TGT loop.  The function firstly
+   creates the same IV of SRC loop in TGT loop, then deletes the original
+   IV and re-initialize it using the newly created IV.  For example, loop
+   nest:
+
+     for (i = 0; i < N; i++)
+       for (j = 0; j < M; j++)
+	 {
+	   //...
+	 }
+
+   will be transformed into:
+
+     for (jj = 0; jj < M; jj++)
+       for (ii = 0; ii < N; ii++)
+	 {
+	   j = jj;
+	   i = ii;
+	   //...
+	 }
+
+   after loop interchange.  */
+
+void
+tree_loop_interchange::map_inductions_to_loop (loop_cand &src, loop_cand &tgt)
+{
+  gimple *def_stmt;
+  gassign *assign;
+  induction_p iv;
+  unsigned i;
+  edge e = single_exit (tgt.loop);
+  gimple_stmt_iterator incr_pos = gsi_last_bb (e->src), gsi;
+  bool move_code_p = flow_loop_nested_p (src.loop, tgt.loop);
+
+  /* Move src's code to tgt loop.  This is necessary when src is the outer
+     loop and tgt is the inner loop.  */
+  if (move_code_p)
+    move_code_to_inner_loop (src.loop, tgt.loop, src.bbs);
+
+  /* Map source loop's IV to target loop.  */
+  for (i = 0; src.inductions.iterate (i, &iv); ++i)
+    {
+      /* Step 1: Create the same IV in target loop.  */
+      tree base = unshare_expr (iv->base), step = unshare_expr (iv->step);
+      create_iv (base, step, SSA_NAME_VAR (iv->var),
+		 tgt.loop, &incr_pos, false, &iv->mapped_var, NULL);
+
+      /* Delete var definition of the original IV's in the source loop.  */
+      def_stmt = SSA_NAME_DEF_STMT (iv->var);
+      gcc_assert (is_a <gphi *> (def_stmt));
+      gsi = gsi_for_stmt (def_stmt);
+      gsi_remove (&gsi, true);
+
+      /* Initialize the var with newly added IV and insert it to either tgt
+	 loop or src loop.  Actually, new stmt is always inserted to inner
+	 loop for now.  */
+      assign = gimple_build_assign (iv->var, iv->mapped_var);
+      if (move_code_p)
+	gsi = gsi_after_labels (tgt.loop->header);
+      else
+	gsi = gsi_after_labels (src.loop->header);
+      gsi_insert_before (&gsi, assign, GSI_SAME_STMT);
+    }
+}
+
+/* Compute the insert position at inner loop when moving code from outer
+   loop to inner one.  */
+
+static inline void
+insert_pos_at_inner_loop (struct loop *outer, struct loop *inner,
+			  basic_block bb, gimple_stmt_iterator *pos)
+{
+  if (bb == outer->header || bb == outer->latch)
+    {
+      /* Move code from header/latch to header/latch.  */
+      *pos = gsi_after_labels (inner->header);
+    }
+  else
+    {
+      /* Otherwise, simply move to exit->src.  */
+      edge e = single_exit (inner);
+      *pos = gsi_last_bb (e->src);
+    }
+}
+
+/* Move stmts of outer loop to inner loop.  */
+
+void
+tree_loop_interchange::move_code_to_inner_loop (struct loop *outer,
+						struct loop *inner,
+						basic_block *outer_bbs)
+{
+  unsigned int i;
+  edge oloop_exit = single_exit (outer);
+  gimple_stmt_iterator insert_pos, gsi;
+
+  for (i = 0; i < outer->num_nodes; i++)
+    {
+      basic_block bb = outer_bbs[i];
+
+      /* Skip basic blocks of inner loop.  */
+      if (flow_bb_inside_loop_p (inner, bb))
+	continue;
+
+      insert_pos_at_inner_loop (outer, inner, bb, &insert_pos);
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (gimple_code (stmt) == GIMPLE_LABEL)
+	    {
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (oloop_exit->src == bb
+	      && stmt == gsi_stmt (gsi_last_bb (oloop_exit->src)))
+	    {
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (gimple_vuse (stmt))
+	    gimple_set_vuse (stmt, NULL_TREE);
+	  if (gimple_vdef (stmt))
+	    gimple_set_vdef (stmt, NULL_TREE);
+
+	  gsi_move_before (&gsi, &insert_pos);
+	}
+    }
+}
+
+static int num_interchanged = 0;
+
+/* Try to interchange inner loop of a loop nest to outer level.  */
+
+bool
+tree_loop_interchange::interchange ()
+{
+  bool changed_p = false;
+  /* In each iteration we try to interchange I-th loop with (I+1)-th loop.
+     The overall effect is to push inner loop to outermost level in whole
+     loop nest.  */
+  for (unsigned i = loop_nest.length (); i > 1; --i)
+    {
+      unsigned inner = i - 1, outer = i - 2;
+
+      /* Check validity for loop interchange.  */
+      if (!valid_data_dependences (inner, outer))
+	break;
+
+      loop_cand iloop (loop_nest[inner], loop_nest[outer]);
+      loop_cand oloop (loop_nest[outer], loop_nest[outer]);
+
+      /* Check if we can do transformation for loop interchange.  */
+      if (!can_interchange_loops (iloop, oloop))
+	break;
+
+      /* Check profitability for loop interchange.  */
+      if (should_interchange_loops (iloop, oloop))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Loop_nest<outer:%d, inner:%d> is interchanged\n\n",
+		     oloop.loop->num, iloop.loop->num);
+
+	  num_interchanged++;
+
+	  interchange_loops (iloop, oloop);
+	  /* Update data structures for further loop interchange.  */
+	  update_data_refs (iloop, oloop);
+	  update_data_deps (inner, outer);
+	  changed_p = true;
+	}
+      else
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Loop_nest<outer:%d, inner:%d> is not interchanged\n\n",
+		     oloop.loop->num, iloop.loop->num);
+	}
+    }
+
+  return changed_p;
+}
+
+
+/* Loop interchange pass.  */
+
+namespace {
+
+const pass_data pass_data_linterchange =
+{
+  GIMPLE_PASS, /* type */
+  "linterchange", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_LINTERCHANGE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_linterchange : public gimple_opt_pass
+{
+public:
+  pass_linterchange (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_linterchange, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_linterchange (m_ctxt); }
+  virtual bool gate (function *) { return flag_tree_loop_interchange; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_linterchange
+
+
+/* Return true if LOOP has proper form for interchange.  */
+
+static bool
+proper_loop_form_for_interchange (struct loop *loop)
+{
+  edge e0, e1, exit;
+
+  /* Don't interchange if outer loop has basic block other than header,
+     exit->src and latch.  In general, only below form of loop nest:
+		header<---+
+                  |       |
+                  v       |
+              INNER_LOOP  |
+                  |       |
+                  v       |
+		exit--->latch
+     is supported.  */
+  if (loop->inner != NULL
+      && loop->num_nodes != loop->inner->num_nodes + 3)
+    return false;
+
+  if ((exit = single_dom_exit (loop)) == NULL)
+    return false;
+
+  /* Check control flow on loop header/exit blocks.  */
+  if (loop->header == exit->src
+      && (EDGE_COUNT (loop->header->preds) != 2
+	  || EDGE_COUNT (loop->header->succs) != 2))
+    return false;
+  else if (loop->header != exit->src
+	   && (EDGE_COUNT (loop->header->preds) != 2
+	       || !single_succ_p (loop->header)
+	       || abnormal_edge (single_succ_edge (loop->header))
+	       || EDGE_COUNT (exit->src->succs) != 2
+	       || !single_pred_p (exit->src)
+	       || abnormal_edge (single_pred_edge (exit->src))))
+    return false;
+
+  e0 = EDGE_PRED (loop->header, 0);
+  e1 = EDGE_PRED (loop->header, 1);
+  if (abnormal_edge (e0) || abnormal_edge (e1)
+      || (e0->src != loop->latch && e1->src != loop->latch)
+      || (e0->src->loop_father == loop && e1->src->loop_father == loop))
+    return false;
+
+  e0 = EDGE_SUCC (exit->src, 0);
+  e1 = EDGE_SUCC (exit->src, 1);
+  if (abnormal_edge (e0) || abnormal_edge (e1)
+      || (e0->dest != loop->latch && e1->dest != loop->latch)
+      || (e0->dest->loop_father == loop && e1->dest->loop_father == loop))
+    return false;
+
+  return true;
+}
+
+/* Given innermost LOOP, return true if perfect loop nest can be found and
+   data dependences can be computed.  If succeed, record the perfect loop
+   nest in LOOP_NEST; record all data references in DATAREFS and record all
+   data dependence relations in DDRS.  */
+
+static bool
+prepare_perfect_loop_nest (struct loop *loop, vec<loop_p> *loop_nest,
+			   vec<data_reference_p> *datarefs, vec<ddr_p> *ddrs)
+{
+  tree niters;
+  struct loop *start_loop = NULL;
+
+  /* Find loop nest from the innermost loop.  */
+  while (loop->num != 0 && loop->inner == start_loop)
+    {
+      if (!proper_loop_form_for_interchange (loop))
+	break;
+
+      /* Loop must have determined niters.  */
+      niters = number_of_latch_executions (loop);
+      if (!niters || chrec_contains_undetermined (niters))
+	break;
+
+      start_loop = loop;
+      /* If this loop has sibling loop, the father loop won't be in perfect
+	 loop nest.  */
+      if (loop->next != NULL)
+	break;
+
+      /* Get the outer loop.  */
+      loop = superloop_at_depth (loop, loop_depth (loop) - 1);
+      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
+	 depends on outer loop's IV.  */
+      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
+	break;
+    }
+
+  if (!start_loop || !start_loop->inner)
+    return false;
+
+  /* Check if start_loop forms a perfect loop nest.  */
+  loop_nest->create (3);
+  do {
+    datarefs->create (20);
+    ddrs->create (20);
+    loop_nest->truncate (0);
+    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
+					   datarefs, ddrs))
+      {
+	unsigned i;
+	struct data_dependence_relation *ddr;
+
+	for (i = 0; ddrs->iterate (i, &ddr); ++i)
+	  {
+	    if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	      continue;
+	    /* Give up on any unknown dependence.  */
+	    if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
+		|| DDR_NUM_DIR_VECTS (ddr) == 0)
+	      break;
+	  }
+
+	if (i == ddrs->length ())
+	  return true;
+      }
+    /* Try to analyze with the outermost loop skipped.  */
+    free_dependence_relations (*ddrs);
+    free_data_refs (*datarefs);
+    start_loop = start_loop->inner;
+  } while (start_loop && start_loop->inner);
+
+  loop_nest->release ();
+  return false;
+}
+
+/* Main entry for loop interchange pass.  */
+
+unsigned int
+pass_linterchange::execute (function *fun)
+{
+  if (number_of_loops (fun) <= 2)
+    return 0;
+
+  bool changed_p = false;;
+  struct loop *loop;
+  vec<loop_p> loop_nest;
+  vec<data_reference_p> datarefs;
+  vec<ddr_p> ddrs;
+
+  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
+    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
+      {
+	tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
+	changed_p |= loop_interchange.interchange ();
+      }
+
+  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_linterchange (gcc::context *ctxt)
+{
+  return new pass_linterchange (ctxt);
+}
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index efb199a..2b3bce6 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -78,7 +78,7 @@ enum unroll_level
 /* Adds a canonical induction variable to LOOP iterating NITER times.  EXIT
    is the exit edge whose condition is replaced.  */
 
-static void
+void
 create_canonical_iv (struct loop *loop, edge exit, tree niter)
 {
   edge in;
diff --git a/gcc/tree-ssa-loop-ivopts.h b/gcc/tree-ssa-loop-ivopts.h
index f8f31e9..f741fe2 100644
--- a/gcc/tree-ssa-loop-ivopts.h
+++ b/gcc/tree-ssa-loop-ivopts.h
@@ -31,4 +31,5 @@ extern bool expr_invariant_in_loop_p (struct loop *, tree);
 bool may_be_nonaddressable_p (tree expr);
 void tree_ssa_iv_optimize (void);
 
+void create_canonical_iv (struct loop *, edge, tree);
 #endif /* GCC_TREE_SSA_LOOP_IVOPTS_H */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-08-30 14:31 [PATCH GCC]A simple implementation of loop interchange Bin Cheng
@ 2017-08-30 15:11 ` Richard Biener
  2017-08-30 17:03   ` Bin.Cheng
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2017-08-30 15:11 UTC (permalink / raw)
  To: Bin Cheng, Michael Matz; +Cc: gcc-patches, nd

On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
> Hi,
> This patch implements a simple loop interchange pass in GCC, as described by its comments:
> +/* This pass performs loop interchange: for example, the loop nest
> +
> +   for (int j = 0; j < N; j++)
> +     for (int k = 0; k < N; k++)
> +       for (int i = 0; i < N; i++)
> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
> +
> +   is transformed to
> +
> +   for (int i = 0; i < N; i++)
> +     for (int j = 0; j < N; j++)
> +       for (int k = 0; k < N; k++)
> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
> +
> +   This pass implements loop interchange in the following steps:
> +
> +     1) Find perfect loop nest for each innermost loop and compute data
> +       dependence relations for it.  For above example, loop nest is
> +       <loop_j, loop_k, loop_i>.
> +     2) From innermost to outermost loop, this pass tries to interchange
> +       each loop pair.  For above case, it firstly tries to interchange
> +       <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
> +       Then it tries to interchange <loop_j, loop_i> and loop nest becomes
> +       <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
> +       loop to the outermost position.  For loop pair <loop_i, loop_j>
> +       to be interchanged, we:
> +     3) Check if data dependence relations are valid for loop interchange.
> +     4) Check if both loops can be interchanged in terms of transformation.
> +     5) Check if interchanging the two loops is profitable.
> +     6) Interchange the two loops by mapping induction variables.
> +
> +   This pass also handles reductions in loop nest.  So far we only support
> +   simple reduction of inner loop and double reduction of the loop nest.  */
>
> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
> than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
> nest.  I put it just after loop distribution thus ideally loop split/distribution could
> create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
> for now because it only works for innermost loop.  For example, the motivation case in
> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
> for (innermost) loop nest and with that patch bwaves case can be handled.
> Another point is I deliberately make both the cost model and code transformation (very)
> conservative.  We can support more cases, or more transformations with great care when
> it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
> don't want to introduce more.
> As for code generation, this patch has an issue that invariant code in outer loop could
> be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
> generated during interchange.  In the future, we may need to avoid that in interchange
> itself, or another lim pass just like the one after graphite optimizations.
>
> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
> programs with it enabled by default.  Any comments?

+/* The same as above, but this one is only used for interchanging not
+   innermost loops.  */
+#define OUTER_STRIDE_RATIO     (2)

please make all these knobs --params.

+/* Enum type for loop reduction variable.  */
+enum reduction_type
+{
+  UNKNOWN_RTYPE = 0,
+  SIMPLE_RTYPE,
+  DOUBLE_RTYPE
+};

seeing this we should have some generic data structure / analysis for
reduction detection.  This adds a third user (autopar and vectorizer
are the others).  Just an idea.

+/* Return true if E is abnormal edge.  */
+
+static inline bool
+abnormal_edge (edge e)
+{
+  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
+}

bad name/comment for what it does.

... jumping to end of file / start of pass

+      /* Get the outer loop.  */
+      loop = superloop_at_depth (loop, loop_depth (loop) - 1);

loop_outer (loop)?

+      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
+        depends on outer loop's IV.  */
+      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
+       break;

but you don't check for a three-nest whether niters depends on outer outer
loop's IV that way.  Either the check is superfluous here or incomplete.

+  /* Check if start_loop forms a perfect loop nest.  */
+  loop_nest->create (3);
+  do {
+    datarefs->create (20);
+    ddrs->create (20);
+    loop_nest->truncate (0);
+    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
+                                          datarefs, ddrs))
+      {
+       unsigned i;
+       struct data_dependence_relation *ddr;
+
+       for (i = 0; ddrs->iterate (i, &ddr); ++i)
+         {
+           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+             continue;
+           /* Give up on any unknown dependence.  */
+           if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
+               || DDR_NUM_DIR_VECTS (ddr) == 0)
+             break;
+         }
+
+       if (i == ddrs->length ())
+         return true;

better open-code this so we dont' waste time computing all dependences
when we give up in the majority of cases (unknown dependence).
Memleak here (ddrs and datarefs).

+    start_loop = start_loop->inner;
+  } while (start_loop && start_loop->inner);

ick.  So this is cubic -- nest depth * #drs * #drs ... (exactly why I
never committed loop distribution for nests ;)).

I see that should_interchange_loops only uses datarefs.  This means
I'd rather do that as a very first step before considering validity
(and computing dependences).  That analysis (for all possible
interchanges) should be much cheaper?  I see it probably doesn't
fit with the iteration you do very well...  can't we somehow compute
a loop permutation and apply it in a single step rather than
piecewise with update_data_refs/deps?

valid_data_dependences has almost no comments, it would be nice
to add some (overall) one(s).

Thanks,
Richard.


> Thanks,
> bin
> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>
>         * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>         * common.opt (ftree-loop-interchange): New option.
>         * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>         * passes.def (pass_linterchange): New pass.
>         * timevar.def (TV_LINTERCHANGE): New time var.
>         * tree-pass.h (make_pass_linterchange): New declaration.
>         * tree-ssa-loop-interchange.cc: New file.
>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>
> gcc/testsuite
> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>
>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-08-30 15:11 ` Richard Biener
@ 2017-08-30 17:03   ` Bin.Cheng
  2017-09-04 13:55     ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Bin.Cheng @ 2017-08-30 17:03 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, gcc-patches

On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>> Hi,
>> This patch implements a simple loop interchange pass in GCC, as described by its comments:
>> +/* This pass performs loop interchange: for example, the loop nest
>> +
>> +   for (int j = 0; j < N; j++)
>> +     for (int k = 0; k < N; k++)
>> +       for (int i = 0; i < N; i++)
>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>> +
>> +   is transformed to
>> +
>> +   for (int i = 0; i < N; i++)
>> +     for (int j = 0; j < N; j++)
>> +       for (int k = 0; k < N; k++)
>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>> +
>> +   This pass implements loop interchange in the following steps:
>> +
>> +     1) Find perfect loop nest for each innermost loop and compute data
>> +       dependence relations for it.  For above example, loop nest is
>> +       <loop_j, loop_k, loop_i>.
>> +     2) From innermost to outermost loop, this pass tries to interchange
>> +       each loop pair.  For above case, it firstly tries to interchange
>> +       <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>> +       Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>> +       <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>> +       loop to the outermost position.  For loop pair <loop_i, loop_j>
>> +       to be interchanged, we:
>> +     3) Check if data dependence relations are valid for loop interchange.
>> +     4) Check if both loops can be interchanged in terms of transformation.
>> +     5) Check if interchanging the two loops is profitable.
>> +     6) Interchange the two loops by mapping induction variables.
>> +
>> +   This pass also handles reductions in loop nest.  So far we only support
>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>
>> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
>> than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
>> nest.  I put it just after loop distribution thus ideally loop split/distribution could
>> create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
>> for now because it only works for innermost loop.  For example, the motivation case in
>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
>> for (innermost) loop nest and with that patch bwaves case can be handled.
>> Another point is I deliberately make both the cost model and code transformation (very)
>> conservative.  We can support more cases, or more transformations with great care when
>> it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
>> don't want to introduce more.
>> As for code generation, this patch has an issue that invariant code in outer loop could
>> be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
>> generated during interchange.  In the future, we may need to avoid that in interchange
>> itself, or another lim pass just like the one after graphite optimizations.
>>
>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
>> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
>> programs with it enabled by default.  Any comments?
>
Thanks for quick review.
> +/* The same as above, but this one is only used for interchanging not
> +   innermost loops.  */
> +#define OUTER_STRIDE_RATIO     (2)
>
> please make all these knobs --params.
>
> +/* Enum type for loop reduction variable.  */
> +enum reduction_type
> +{
> +  UNKNOWN_RTYPE = 0,
> +  SIMPLE_RTYPE,
> +  DOUBLE_RTYPE
> +};
>
> seeing this we should have some generic data structure / analysis for
> reduction detection.  This adds a third user (autopar and vectorizer
> are the others).  Just an idea.
>
> +/* Return true if E is abnormal edge.  */
> +
> +static inline bool
> +abnormal_edge (edge e)
> +{
> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
> +}
>
> bad name/comment for what it does.
>
> ... jumping to end of file / start of pass
>
> +      /* Get the outer loop.  */
> +      loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>
> loop_outer (loop)?
>
> +      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
> +        depends on outer loop's IV.  */
> +      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
> +       break;
>
> but you don't check for a three-nest whether niters depends on outer outer
> loop's IV that way.  Either the check is superfluous here or incomplete.
It is checked for multi-nest case in can_interchange_loops.  I will
move the check to this function so that we can save compilation time.
>
> +  /* Check if start_loop forms a perfect loop nest.  */
> +  loop_nest->create (3);
> +  do {
> +    datarefs->create (20);
> +    ddrs->create (20);
> +    loop_nest->truncate (0);
> +    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
> +                                          datarefs, ddrs))
> +      {
> +       unsigned i;
> +       struct data_dependence_relation *ddr;
> +
> +       for (i = 0; ddrs->iterate (i, &ddr); ++i)
> +         {
> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
> +             continue;
> +           /* Give up on any unknown dependence.  */
> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
> +               || DDR_NUM_DIR_VECTS (ddr) == 0)
> +             break;
> +         }
> +
> +       if (i == ddrs->length ())
> +         return true;
>
> better open-code this so we dont' waste time computing all dependences
> when we give up in the majority of cases (unknown dependence).
Right.  Will make the change.
> Memleak here (ddrs and datarefs).
>
> +    start_loop = start_loop->inner;
> +  } while (start_loop && start_loop->inner);
>
> ick.  So this is cubic -- nest depth * #drs * #drs ... (exactly why I
> never committed loop distribution for nests ;)).
Hmm, loop distribution for (innermost) nest is necessary for this pass
to handle bwaves unfortunately.  It is also necessary to distribute
for (i;)
  for (j)
    arr[i * len + j] = 0;
into a single memcall, rather than a loop of sub-memcall.

>
> I see that should_interchange_loops only uses datarefs.  This means
> I'd rather do that as a very first step before considering validity
> (and computing dependences).  That analysis (for all possible
> interchanges) should be much cheaper?  I see it probably doesn't
Yes it makes the code less straightforward.  I think we can
additionally call should_interchange_loops before computing
dependencies.  This will bypass for most cases, for example, ~3500
loops can be interchanged without cost model, while only  ~200 loops
actually pass the cost model.

> fit with the iteration you do very well...  can't we somehow compute
> a loop permutation and apply it in a single step rather than
> piecewise with update_data_refs/deps?
Arbitrary loop permutation would be non-trivial.  If you meant
computing loop shit (outermosting here) and transforming the code once
for all, I think it's doable.  Cost model check can be easily extended
in that way, though transformation part needs quite lot rework.  One
of my concern is, other than matrix-mul, I don't have motivation case
for multi-nest interchange so far.
>
> valid_data_dependences has almost no comments, it would be nice
> to add some (overall) one(s).
Sorry, I just realized it's wrong because direct vector is misused for
distance vector...  Surprised it didn't bite with thousands of
interchanged loops in spec...  Will fix it.

>
> Thanks,
> Richard.
>
>
>> Thanks,
>> bin
>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>         * common.opt (ftree-loop-interchange): New option.
>>         * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>         * passes.def (pass_linterchange): New pass.
>>         * timevar.def (TV_LINTERCHANGE): New time var.
>>         * tree-pass.h (make_pass_linterchange): New declaration.
>>         * tree-ssa-loop-interchange.cc: New file.
>>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>
>> gcc/testsuite
>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>
>>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-08-30 17:03   ` Bin.Cheng
@ 2017-09-04 13:55     ` Richard Biener
  2017-09-04 14:48       ` Bin.Cheng
  2017-09-22 10:25       ` Bin.Cheng
  0 siblings, 2 replies; 10+ messages in thread
From: Richard Biener @ 2017-09-04 13:55 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Michael Matz, gcc-patches

On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> This patch implements a simple loop interchange pass in GCC, as described by its comments:
>>> +/* This pass performs loop interchange: for example, the loop nest
>>> +
>>> +   for (int j = 0; j < N; j++)
>>> +     for (int k = 0; k < N; k++)
>>> +       for (int i = 0; i < N; i++)
>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> +   is transformed to
>>> +
>>> +   for (int i = 0; i < N; i++)
>>> +     for (int j = 0; j < N; j++)
>>> +       for (int k = 0; k < N; k++)
>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>> +
>>> +   This pass implements loop interchange in the following steps:
>>> +
>>> +     1) Find perfect loop nest for each innermost loop and compute data
>>> +       dependence relations for it.  For above example, loop nest is
>>> +       <loop_j, loop_k, loop_i>.
>>> +     2) From innermost to outermost loop, this pass tries to interchange
>>> +       each loop pair.  For above case, it firstly tries to interchange
>>> +       <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>> +       Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>> +       <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>>> +       loop to the outermost position.  For loop pair <loop_i, loop_j>
>>> +       to be interchanged, we:
>>> +     3) Check if data dependence relations are valid for loop interchange.
>>> +     4) Check if both loops can be interchanged in terms of transformation.
>>> +     5) Check if interchanging the two loops is profitable.
>>> +     6) Interchange the two loops by mapping induction variables.
>>> +
>>> +   This pass also handles reductions in loop nest.  So far we only support
>>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>>
>>> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
>>> than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
>>> nest.  I put it just after loop distribution thus ideally loop split/distribution could
>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
>>> for now because it only works for innermost loop.  For example, the motivation case in
>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>> Another point is I deliberately make both the cost model and code transformation (very)
>>> conservative.  We can support more cases, or more transformations with great care when
>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
>>> don't want to introduce more.
>>> As for code generation, this patch has an issue that invariant code in outer loop could
>>> be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
>>> generated during interchange.  In the future, we may need to avoid that in interchange
>>> itself, or another lim pass just like the one after graphite optimizations.
>>>
>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
>>> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
>>> programs with it enabled by default.  Any comments?
>>
> Thanks for quick review.
>> +/* The same as above, but this one is only used for interchanging not
>> +   innermost loops.  */
>> +#define OUTER_STRIDE_RATIO     (2)
>>
>> please make all these knobs --params.
>>
>> +/* Enum type for loop reduction variable.  */
>> +enum reduction_type
>> +{
>> +  UNKNOWN_RTYPE = 0,
>> +  SIMPLE_RTYPE,
>> +  DOUBLE_RTYPE
>> +};
>>
>> seeing this we should have some generic data structure / analysis for
>> reduction detection.  This adds a third user (autopar and vectorizer
>> are the others).  Just an idea.
>>
>> +/* Return true if E is abnormal edge.  */
>> +
>> +static inline bool
>> +abnormal_edge (edge e)
>> +{
>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>> +}
>>
>> bad name/comment for what it does.
>>
>> ... jumping to end of file / start of pass
>>
>> +      /* Get the outer loop.  */
>> +      loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>
>> loop_outer (loop)?
>>
>> +      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>> +        depends on outer loop's IV.  */
>> +      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
>> +       break;
>>
>> but you don't check for a three-nest whether niters depends on outer outer
>> loop's IV that way.  Either the check is superfluous here or incomplete.
> It is checked for multi-nest case in can_interchange_loops.  I will
> move the check to this function so that we can save compilation time.
>>
>> +  /* Check if start_loop forms a perfect loop nest.  */
>> +  loop_nest->create (3);
>> +  do {
>> +    datarefs->create (20);
>> +    ddrs->create (20);
>> +    loop_nest->truncate (0);
>> +    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
>> +                                          datarefs, ddrs))
>> +      {
>> +       unsigned i;
>> +       struct data_dependence_relation *ddr;
>> +
>> +       for (i = 0; ddrs->iterate (i, &ddr); ++i)
>> +         {
>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>> +             continue;
>> +           /* Give up on any unknown dependence.  */
>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
>> +               || DDR_NUM_DIR_VECTS (ddr) == 0)
>> +             break;
>> +         }
>> +
>> +       if (i == ddrs->length ())
>> +         return true;
>>
>> better open-code this so we dont' waste time computing all dependences
>> when we give up in the majority of cases (unknown dependence).
> Right.  Will make the change.
>> Memleak here (ddrs and datarefs).
>>
>> +    start_loop = start_loop->inner;
>> +  } while (start_loop && start_loop->inner);
>>
>> ick.  So this is cubic -- nest depth * #drs * #drs ... (exactly why I
>> never committed loop distribution for nests ;)).
> Hmm, loop distribution for (innermost) nest is necessary for this pass
> to handle bwaves unfortunately.  It is also necessary to distribute
> for (i;)
>   for (j)
>     arr[i * len + j] = 0;
> into a single memcall, rather than a loop of sub-memcall.

I know...  bwaves is what I implemented the change for to enable interchange.
I think right now store-motion makes the nest difficult to handle as it
has a reduction (but I see you handle reductions in interchange -- just this
one cannot be handled I think?).

Richard.

>>
>> I see that should_interchange_loops only uses datarefs.  This means
>> I'd rather do that as a very first step before considering validity
>> (and computing dependences).  That analysis (for all possible
>> interchanges) should be much cheaper?  I see it probably doesn't
> Yes it makes the code less straightforward.  I think we can
> additionally call should_interchange_loops before computing
> dependencies.  This will bypass for most cases, for example, ~3500
> loops can be interchanged without cost model, while only  ~200 loops
> actually pass the cost model.
>
>> fit with the iteration you do very well...  can't we somehow compute
>> a loop permutation and apply it in a single step rather than
>> piecewise with update_data_refs/deps?
> Arbitrary loop permutation would be non-trivial.  If you meant
> computing loop shit (outermosting here) and transforming the code once
> for all, I think it's doable.  Cost model check can be easily extended
> in that way, though transformation part needs quite lot rework.  One
> of my concern is, other than matrix-mul, I don't have motivation case
> for multi-nest interchange so far.
>>
>> valid_data_dependences has almost no comments, it would be nice
>> to add some (overall) one(s).
> Sorry, I just realized it's wrong because direct vector is misused for
> distance vector...  Surprised it didn't bite with thousands of
> interchanged loops in spec...  Will fix it.
>
>>
>> Thanks,
>> Richard.
>>
>>
>>> Thanks,
>>> bin
>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>>         * common.opt (ftree-loop-interchange): New option.
>>>         * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>>         * passes.def (pass_linterchange): New pass.
>>>         * timevar.def (TV_LINTERCHANGE): New time var.
>>>         * tree-pass.h (make_pass_linterchange): New declaration.
>>>         * tree-ssa-loop-interchange.cc: New file.
>>>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>>
>>> gcc/testsuite
>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>
>>>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-09-04 13:55     ` Richard Biener
@ 2017-09-04 14:48       ` Bin.Cheng
  2017-09-22 10:25       ` Bin.Cheng
  1 sibling, 0 replies; 10+ messages in thread
From: Bin.Cheng @ 2017-09-04 14:48 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, gcc-patches

On Mon, Sep 4, 2017 at 2:54 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> This patch implements a simple loop interchange pass in GCC, as described by its comments:
>>>> +/* This pass performs loop interchange: for example, the loop nest
>>>> +
>>>> +   for (int j = 0; j < N; j++)
>>>> +     for (int k = 0; k < N; k++)
>>>> +       for (int i = 0; i < N; i++)
>>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   is transformed to
>>>> +
>>>> +   for (int i = 0; i < N; i++)
>>>> +     for (int j = 0; j < N; j++)
>>>> +       for (int k = 0; k < N; k++)
>>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   This pass implements loop interchange in the following steps:
>>>> +
>>>> +     1) Find perfect loop nest for each innermost loop and compute data
>>>> +       dependence relations for it.  For above example, loop nest is
>>>> +       <loop_j, loop_k, loop_i>.
>>>> +     2) From innermost to outermost loop, this pass tries to interchange
>>>> +       each loop pair.  For above case, it firstly tries to interchange
>>>> +       <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>>> +       Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>>> +       <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>>>> +       loop to the outermost position.  For loop pair <loop_i, loop_j>
>>>> +       to be interchanged, we:
>>>> +     3) Check if data dependence relations are valid for loop interchange.
>>>> +     4) Check if both loops can be interchanged in terms of transformation.
>>>> +     5) Check if interchanging the two loops is profitable.
>>>> +     6) Interchange the two loops by mapping induction variables.
>>>> +
>>>> +   This pass also handles reductions in loop nest.  So far we only support
>>>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>>>
>>>> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
>>>> than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
>>>> nest.  I put it just after loop distribution thus ideally loop split/distribution could
>>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
>>>> for now because it only works for innermost loop.  For example, the motivation case in
>>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
>>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>>> Another point is I deliberately make both the cost model and code transformation (very)
>>>> conservative.  We can support more cases, or more transformations with great care when
>>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
>>>> don't want to introduce more.
>>>> As for code generation, this patch has an issue that invariant code in outer loop could
>>>> be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
>>>> generated during interchange.  In the future, we may need to avoid that in interchange
>>>> itself, or another lim pass just like the one after graphite optimizations.
>>>>
>>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
>>>> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
>>>> programs with it enabled by default.  Any comments?
>>>
>> Thanks for quick review.
>>> +/* The same as above, but this one is only used for interchanging not
>>> +   innermost loops.  */
>>> +#define OUTER_STRIDE_RATIO     (2)
>>>
>>> please make all these knobs --params.
>>>
>>> +/* Enum type for loop reduction variable.  */
>>> +enum reduction_type
>>> +{
>>> +  UNKNOWN_RTYPE = 0,
>>> +  SIMPLE_RTYPE,
>>> +  DOUBLE_RTYPE
>>> +};
>>>
>>> seeing this we should have some generic data structure / analysis for
>>> reduction detection.  This adds a third user (autopar and vectorizer
>>> are the others).  Just an idea.
>>>
>>> +/* Return true if E is abnormal edge.  */
>>> +
>>> +static inline bool
>>> +abnormal_edge (edge e)
>>> +{
>>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>>> +}
>>>
>>> bad name/comment for what it does.
>>>
>>> ... jumping to end of file / start of pass
>>>
>>> +      /* Get the outer loop.  */
>>> +      loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>>
>>> loop_outer (loop)?
>>>
>>> +      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>>> +        depends on outer loop's IV.  */
>>> +      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
>>> +       break;
>>>
>>> but you don't check for a three-nest whether niters depends on outer outer
>>> loop's IV that way.  Either the check is superfluous here or incomplete.
>> It is checked for multi-nest case in can_interchange_loops.  I will
>> move the check to this function so that we can save compilation time.
>>>
>>> +  /* Check if start_loop forms a perfect loop nest.  */
>>> +  loop_nest->create (3);
>>> +  do {
>>> +    datarefs->create (20);
>>> +    ddrs->create (20);
>>> +    loop_nest->truncate (0);
>>> +    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
>>> +                                          datarefs, ddrs))
>>> +      {
>>> +       unsigned i;
>>> +       struct data_dependence_relation *ddr;
>>> +
>>> +       for (i = 0; ddrs->iterate (i, &ddr); ++i)
>>> +         {
>>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>> +             continue;
>>> +           /* Give up on any unknown dependence.  */
>>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
>>> +               || DDR_NUM_DIR_VECTS (ddr) == 0)
>>> +             break;
>>> +         }
>>> +
>>> +       if (i == ddrs->length ())
>>> +         return true;
>>>
>>> better open-code this so we dont' waste time computing all dependences
>>> when we give up in the majority of cases (unknown dependence).
>> Right.  Will make the change.
>>> Memleak here (ddrs and datarefs).
>>>
>>> +    start_loop = start_loop->inner;
>>> +  } while (start_loop && start_loop->inner);
>>>
>>> ick.  So this is cubic -- nest depth * #drs * #drs ... (exactly why I
>>> never committed loop distribution for nests ;)).
>> Hmm, loop distribution for (innermost) nest is necessary for this pass
>> to handle bwaves unfortunately.  It is also necessary to distribute
>> for (i;)
>>   for (j)
>>     arr[i * len + j] = 0;
>> into a single memcall, rather than a loop of sub-memcall.
>
> I know...  bwaves is what I implemented the change for to enable interchange.
> I think right now store-motion makes the nest difficult to handle as it
> has a reduction (but I see you handle reductions in interchange -- just this
> one cannot be handled I think?).
Yes, it's necessary to handle reduction in loop interchange because
store motion in GCC is powerful enough to change cases like bwaves and
matrix-mul into reduction.
As for bwaves, the transformed reduction is supported, it is the
initialization statement not handled by this patch.  Well, it actually
can but that's effectively doing distribution in interchange pass.
One of my idea is to only implement single-objective passes, so I'd
rather rely on distribution for that.

Thanks,
bin
>
> Richard.
>
>>>
>>> I see that should_interchange_loops only uses datarefs.  This means
>>> I'd rather do that as a very first step before considering validity
>>> (and computing dependences).  That analysis (for all possible
>>> interchanges) should be much cheaper?  I see it probably doesn't
>> Yes it makes the code less straightforward.  I think we can
>> additionally call should_interchange_loops before computing
>> dependencies.  This will bypass for most cases, for example, ~3500
>> loops can be interchanged without cost model, while only  ~200 loops
>> actually pass the cost model.
>>
>>> fit with the iteration you do very well...  can't we somehow compute
>>> a loop permutation and apply it in a single step rather than
>>> piecewise with update_data_refs/deps?
>> Arbitrary loop permutation would be non-trivial.  If you meant
>> computing loop shit (outermosting here) and transforming the code once
>> for all, I think it's doable.  Cost model check can be easily extended
>> in that way, though transformation part needs quite lot rework.  One
>> of my concern is, other than matrix-mul, I don't have motivation case
>> for multi-nest interchange so far.
>>>
>>> valid_data_dependences has almost no comments, it would be nice
>>> to add some (overall) one(s).
>> Sorry, I just realized it's wrong because direct vector is misused for
>> distance vector...  Surprised it didn't bite with thousands of
>> interchanged loops in spec...  Will fix it.
>>
>>>
>>> Thanks,
>>> Richard.
>>>
>>>
>>>> Thanks,
>>>> bin
>>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>>>         * common.opt (ftree-loop-interchange): New option.
>>>>         * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>>>         * passes.def (pass_linterchange): New pass.
>>>>         * timevar.def (TV_LINTERCHANGE): New time var.
>>>>         * tree-pass.h (make_pass_linterchange): New declaration.
>>>>         * tree-ssa-loop-interchange.cc: New file.
>>>>         * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>>>         * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>>>
>>>> gcc/testsuite
>>>> 2017-08-29  Bin Cheng  <bin.cheng@arm.com>
>>>>
>>>>         * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>>>         * gcc.dg/tree-ssa/loop-interchange-10.c: New test.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-09-04 13:55     ` Richard Biener
  2017-09-04 14:48       ` Bin.Cheng
@ 2017-09-22 10:25       ` Bin.Cheng
  2017-10-24 14:35         ` Michael Matz
  1 sibling, 1 reply; 10+ messages in thread
From: Bin.Cheng @ 2017-09-22 10:25 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 10664 bytes --]

On Mon, Sep 4, 2017 at 2:54 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> This patch implements a simple loop interchange pass in GCC, as described by its comments:
>>>> +/* This pass performs loop interchange: for example, the loop nest
>>>> +
>>>> +   for (int j = 0; j < N; j++)
>>>> +     for (int k = 0; k < N; k++)
>>>> +       for (int i = 0; i < N; i++)
>>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   is transformed to
>>>> +
>>>> +   for (int i = 0; i < N; i++)
>>>> +     for (int j = 0; j < N; j++)
>>>> +       for (int k = 0; k < N; k++)
>>>> +        c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   This pass implements loop interchange in the following steps:
>>>> +
>>>> +     1) Find perfect loop nest for each innermost loop and compute data
>>>> +       dependence relations for it.  For above example, loop nest is
>>>> +       <loop_j, loop_k, loop_i>.
>>>> +     2) From innermost to outermost loop, this pass tries to interchange
>>>> +       each loop pair.  For above case, it firstly tries to interchange
>>>> +       <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>>> +       Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>>> +       <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>>>> +       loop to the outermost position.  For loop pair <loop_i, loop_j>
>>>> +       to be interchanged, we:
>>>> +     3) Check if data dependence relations are valid for loop interchange.
>>>> +     4) Check if both loops can be interchanged in terms of transformation.
>>>> +     5) Check if interchanging the two loops is profitable.
>>>> +     6) Interchange the two loops by mapping induction variables.
>>>> +
>>>> +   This pass also handles reductions in loop nest.  So far we only support
>>>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>>>
>>>> Actually, this pass only does loop shift which outermosting inner loop to outer, rather
>>>> than permutation.  Also as a traditional loop optimizer, it only works for perfect loop
>>>> nest.  I put it just after loop distribution thus ideally loop split/distribution could
>>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest from distribution
>>>> for now because it only works for innermost loop.  For example, the motivation case in
>>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending distribution
>>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>>> Another point is I deliberately make both the cost model and code transformation (very)
>>>> conservative.  We can support more cases, or more transformations with great care when
>>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues quite often and
>>>> don't want to introduce more.
>>>> As for code generation, this patch has an issue that invariant code in outer loop could
>>>> be moved to inner loop.  For the moment, we rely on the last lim pass to handle all INV
>>>> generated during interchange.  In the future, we may need to avoid that in interchange
>>>> itself, or another lim pass just like the one after graphite optimizations.
>>>>
>>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run successfully.
>>>> Note this pass is disabled in patch, while the code is exercised by bootstrap/building
>>>> programs with it enabled by default.  Any comments?
>>>
>> Thanks for quick review.
>>> +/* The same as above, but this one is only used for interchanging not
>>> +   innermost loops.  */
>>> +#define OUTER_STRIDE_RATIO     (2)
>>>
>>> please make all these knobs --params.
>>>
>>> +/* Enum type for loop reduction variable.  */
>>> +enum reduction_type
>>> +{
>>> +  UNKNOWN_RTYPE = 0,
>>> +  SIMPLE_RTYPE,
>>> +  DOUBLE_RTYPE
>>> +};
>>>
>>> seeing this we should have some generic data structure / analysis for
>>> reduction detection.  This adds a third user (autopar and vectorizer
>>> are the others).  Just an idea.
>>>
>>> +/* Return true if E is abnormal edge.  */
>>> +
>>> +static inline bool
>>> +abnormal_edge (edge e)
>>> +{
>>> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
>>> +}
>>>
>>> bad name/comment for what it does.
>>>
>>> ... jumping to end of file / start of pass
>>>
>>> +      /* Get the outer loop.  */
>>> +      loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>>>
>>> loop_outer (loop)?
>>>
>>> +      /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
>>> +        depends on outer loop's IV.  */
>>> +      if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
>>> +       break;
>>>
>>> but you don't check for a three-nest whether niters depends on outer outer
>>> loop's IV that way.  Either the check is superfluous here or incomplete.
>> It is checked for multi-nest case in can_interchange_loops.  I will
>> move the check to this function so that we can save compilation time.
>>>
>>> +  /* Check if start_loop forms a perfect loop nest.  */
>>> +  loop_nest->create (3);
>>> +  do {
>>> +    datarefs->create (20);
>>> +    ddrs->create (20);
>>> +    loop_nest->truncate (0);
>>> +    if (compute_data_dependences_for_loop (start_loop, false, loop_nest,
>>> +                                          datarefs, ddrs))
>>> +      {
>>> +       unsigned i;
>>> +       struct data_dependence_relation *ddr;
>>> +
>>> +       for (i = 0; ddrs->iterate (i, &ddr); ++i)
>>> +         {
>>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>> +             continue;
>>> +           /* Give up on any unknown dependence.  */
>>> +           if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
>>> +               || DDR_NUM_DIR_VECTS (ddr) == 0)
>>> +             break;
>>> +         }
>>> +
>>> +       if (i == ddrs->length ())
>>> +         return true;
>>>
>>> better open-code this so we dont' waste time computing all dependences
>>> when we give up in the majority of cases (unknown dependence).
>> Right.  Will make the change.
>>> Memleak here (ddrs and datarefs).
>>>
>>> +    start_loop = start_loop->inner;
>>> +  } while (start_loop && start_loop->inner);
>>>
>>> ick.  So this is cubic -- nest depth * #drs * #drs ... (exactly why I
>>> never committed loop distribution for nests ;)).
>> Hmm, loop distribution for (innermost) nest is necessary for this pass
>> to handle bwaves unfortunately.  It is also necessary to distribute
>> for (i;)
>>   for (j)
>>     arr[i * len + j] = 0;
>> into a single memcall, rather than a loop of sub-memcall.
>
> I know...  bwaves is what I implemented the change for to enable interchange.
> I think right now store-motion makes the nest difficult to handle as it
> has a reduction (but I see you handle reductions in interchange -- just this
> one cannot be handled I think?).
>
> Richard.
>
>>>
>>> I see that should_interchange_loops only uses datarefs.  This means
>>> I'd rather do that as a very first step before considering validity
>>> (and computing dependences).  That analysis (for all possible
>>> interchanges) should be much cheaper?  I see it probably doesn't
>> Yes it makes the code less straightforward.  I think we can
>> additionally call should_interchange_loops before computing
>> dependencies.  This will bypass for most cases, for example, ~3500
>> loops can be interchanged without cost model, while only  ~200 loops
>> actually pass the cost model.
>>
>>> fit with the iteration you do very well...  can't we somehow compute
>>> a loop permutation and apply it in a single step rather than
>>> piecewise with update_data_refs/deps?
>> Arbitrary loop permutation would be non-trivial.  If you meant
>> computing loop shit (outermosting here) and transforming the code once
>> for all, I think it's doable.  Cost model check can be easily extended
>> in that way, though transformation part needs quite lot rework.  One
>> of my concern is, other than matrix-mul, I don't have motivation case
>> for multi-nest interchange so far.
>>>
>>> valid_data_dependences has almost no comments, it would be nice
>>> to add some (overall) one(s).
>> Sorry, I just realized it's wrong because direct vector is misused for
>> distance vector...  Surprised it didn't bite with thousands of
>> interchanged loops in spec...  Will fix it.
>>

Hi,
This is updated patch for loop interchange with review suggestions
resolved.  Changes are:
  1) It does more light weight checks like rectangle loop nest check
earlier than before.
  2) It checks profitability of interchange before data dependence computation.
  3) It calls find_data_references_in_loop only once for a loop nest now.
  4) Data dependence is open-computed so that we can skip instantly at
unknown dependence.
  5) It improves code generation in mapping induction variables for
loop nest, as well as
     adding a simple dead code elimination pass.
  6) It changes magic constants into parameters.

Bootstrap and test as previous.  Any comments?  I noticed many
GRAPHITE bugs are fixed now, I will be happy to discard this patch if
GRAPHITE is close to be stable/default in GCC.

Thanks,
bin
2017-09-21  Bin Cheng  <bin.cheng@arm.com>

* Makefile.in (tree-ssa-loop-interchange.o): New object file.
* common.opt (ftree-loop-interchange): New option.
* doc/invoke.texi (-ftree-loop-interchange): Document new option.
* params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
(PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
* passes.def (pass_linterchange): New pass.
* timevar.def (TV_LINTERCHANGE): New time var.
* tree-pass.h (make_pass_linterchange): New declaration.
* tree-ssa-loop-interchange.cc: New file.
* tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
Record IV before/after increment in new parameters.
* tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.

gcc/testsuite
2017-09-21  Bin Cheng  <bin.cheng@arm.com>

* gcc.dg/tree-ssa/loop-interchange-1.c: New test.
* gcc.dg/tree-ssa/loop-interchange-2.c: New test.
* gcc.dg/tree-ssa/loop-interchange-3.c: New test.
* gcc.dg/tree-ssa/loop-interchange-4.c: New test.
* gcc.dg/tree-ssa/loop-interchange-5.c: New test.
* gcc.dg/tree-ssa/loop-interchange-6.c: New test.
* gcc.dg/tree-ssa/loop-interchange-7.c: New test.
* gcc.dg/tree-ssa/loop-interchange-8.c: New test.
* gcc.dg/tree-ssa/loop-interchange-9.c: New test.
* gcc.dg/tree-ssa/loop-interchange-10.c: New test.

[-- Attachment #2: linterchange-20170912.txt --]
[-- Type: text/plain, Size: 80036 bytes --]

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 0bde7ac..5002598 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1522,6 +1522,7 @@ OBJS = \
 	tree-ssa-live.o \
 	tree-ssa-loop-ch.o \
 	tree-ssa-loop-im.o \
+	tree-ssa-loop-interchange.o \
 	tree-ssa-loop-ivcanon.o \
 	tree-ssa-loop-ivopts.o \
 	tree-ssa-loop-manip.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 1581ca8..5babe3f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2530,6 +2530,10 @@ ftree-loop-distribute-patterns
 Common Report Var(flag_tree_loop_distribute_patterns) Optimization
 Enable loop distribution for patterns transformed into a library call.
 
+ftree-loop-interchange
+Common Report Var(flag_tree_loop_interchange) Optimization
+Enable loop interchange on trees.
+
 ftree-loop-im
 Common Report Var(flag_tree_loop_im) Init(1) Optimization
 Enable loop invariant motion on trees.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index e4cacf2..1825be8 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8498,6 +8498,25 @@ ENDDO
 @end smallexample
 and the initialization loop is transformed into a call to memset zero.
 
+@item -ftree-loop-interchange
+@opindex ftree-loop-interchange
+Perform loop interchange outside of graphite.  This flag can improve cache
+performance on loop nest and allow further loop optimizations, like
+vectorization, to take place.  For example, the loop
+@smallexample
+for (int i = 0; i < N; i++)
+  for (int j = 0; j < N; j++)
+    for (int k = 0; k < N; k++)
+      c[i][j] = c[i][j] + a[i][k]*b[k][j];
+@end smallexample
+is transformed to
+@smallexample
+for (int i = 0; i < N; i++)
+  for (int k = 0; k < N; k++)
+    for (int j = 0; j < N; j++)
+      c[i][j] = c[i][j] + a[i][k]*b[k][j];
+@end smallexample
+
 @item -ftree-loop-im
 @opindex ftree-loop-im
 Perform loop invariant motion on trees.  This pass moves only invariants that
diff --git a/gcc/params.def b/gcc/params.def
index 805302b..c50a1fe 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -780,6 +780,20 @@ DEFPARAM (PARAM_L2_CACHE_SIZE,
 	  "The size of L2 cache.",
 	  512, 0, 0)
 
+/* Maximum number of statements in loop nest for loop interchange.  */
+
+DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
+	  "loop-interchange-max-num-stmts",
+	  "The maximum number of stmts in loop nest for loop interchange.",
+	  64, 0, 0)
+
+/* Minimum stride ratio for loop interchange to be profitiable.  */
+
+DEFPARAM (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO,
+	  "loop-interchange-stride-ratio",
+	  "The minimum stride ratio for loop interchange to be profitable",
+	  2, 0, 0)
+
 /* Whether we should use canonical types rather than deep "structural"
    type checking.  Setting this value to 1 (the default) improves
    compilation performance in the C++ and Objective-C++ front end;
diff --git a/gcc/passes.def b/gcc/passes.def
index 00e75d2..2e38c6f 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -278,6 +278,7 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_loop_distribution);
+	  NEXT_PASS (pass_linterchange);
 	  NEXT_PASS (pass_copy_prop);
 	  NEXT_PASS (pass_graphite);
 	  PUSH_INSERT_PASSES_WITHIN (pass_graphite)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c
new file mode 100644
index 0000000..f2392e3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-4.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+double u[1782225];
+
+static int __attribute__((noinline))
+foo (int N, int *res)
+{
+  int i, j;
+  double sum = 0;
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      sum = sum + u[i + 1335 * j];
+
+  for (i = 0; i < N; i++)
+    u[1336 * i] *= 2;
+
+  *res = sum + N + u[1336 * 2] + u[1336];
+}
+
+extern void abort ();
+
+int
+main (void)
+{
+  int i, j, res;
+
+  for (i = 0; i < 1782225; i++)
+    u[i] = 2;
+
+  foo (1335, &res);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 3565793)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c
new file mode 100644
index 0000000..610610b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M];
+int __attribute__((noinline))
+double_reduc (int n)
+{
+  int sum = 0;
+  for (int j = 0; j < n; j++)
+    {
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+    }
+  return sum;
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+    }
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  int sum = double_reduc (M);
+
+  if (sum != 1065369600)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c
new file mode 100644
index 0000000..8341a22
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-5.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 100
+#define M 1111
+int A[N][M];
+
+static int __attribute__((noinline))
+foo (void)
+{
+  int i, j;
+
+  for( i = 0; i < M; i++)
+    for( j = 0; j < N; j++)
+      A[j][i] = 5 * A[j][i];
+
+  return A[0][0] + A[N-1][M-1];
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  int j;
+
+  for (j = 0; j < M; j++)
+    A[i][j] = 2;
+}
+
+int
+main (void)
+{
+  int i, j, res;
+
+  for (i = 0; i < N; i++)
+    init (i);
+
+  res = foo ();
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 20)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c
new file mode 100644
index 0000000..ca2a114
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-6.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 100
+#define M 200
+
+static int __attribute__((noinline))
+foo (int A[N][M])
+{
+  int i, j;
+
+  /* This loop should be interchanged. */
+  for(j = 0; j < M; j++)
+    for(i = 0; i < N; i++)
+      A[i][j] = A[i][j] + A[i][j];
+
+  return A[0][0] + A[N-1][M-1];
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int *arr, int i)
+{
+  int j;
+
+  for (j = 0; j < M; j++)
+    arr[j] = 2;
+}
+
+int
+main (void)
+{
+  int A[N][M];
+  int i, j, res;
+
+  for (i = 0; i < N; i++)
+    init (A[i], i);
+
+  res = foo (A);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 8)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c
new file mode 100644
index 0000000..ff820f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-7.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 111
+#define M 1111
+
+static int __attribute__((noinline))
+foo (double *a)
+{
+  int i,j;
+  int r = 0;
+
+  for (i = 0; i < N; ++i)
+    for (j = 0; j < M; ++j)
+      r += a[j * N + i];
+
+  return r;
+}
+
+extern void abort ();
+
+int
+main (void)
+{
+  double A[N*M];
+  int i, res;
+
+  for (i = 0; i < N*M; i++)
+    A[i] = 2;
+
+  res = foo (A);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 246642)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c
new file mode 100644
index 0000000..706da88
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c
@@ -0,0 +1,71 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < n; j++)
+      for (int k = 0; k < n; k++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange" } } */
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is not interchanged" 1 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c
new file mode 100644
index 0000000..97555ed
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+    for (int j = 0; j < n; j++)
+      for (int k = 0; k < n; k++)
+  for (int i = 0; i < n; i++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 2 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c
new file mode 100644
index 0000000..b93ca78
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+      for (int k = 0; k < n; k++)
+    for (int j = 0; j < n; j++)
+  for (int i = 0; i < n; i++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 2 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c
new file mode 100644
index 0000000..29f5917
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+  for (int i = 0; i < n; i++)
+      for (int k = 0; k < n; k++)
+    for (int j = 0; j < n; j++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "Loop_pair<outer:., inner:.> is interchanged" "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c
new file mode 100644
index 0000000..d6a3f5c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c
@@ -0,0 +1,62 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M], d[M];
+void __attribute__((noinline))
+simple_reduc_1 (int n)
+{
+  for (int j = 0; j < n; j++)
+    {
+      int sum = c[j];
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+
+      c[j] = sum;
+    }
+}
+
+void __attribute__((noinline))
+simple_reduc_2 (int n)
+{
+  for (int j = 0; j < n; j++)
+    {
+      int sum = d[j];
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+
+      asm volatile ("" ::: "memory");
+      d[j] = sum;
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  c[i] = 0;
+  d[i] = 0;
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+    }
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  simple_reduc_1 (M);
+  simple_reduc_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (c[i] != d[i])
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange" } } */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 8cec6af..730a1dc 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -184,6 +184,7 @@ DEFTIMEVAR (TV_TREE_LOOP	     , "tree loop optimization")
 DEFTIMEVAR (TV_TREE_NOLOOP           , "loopless fn")
 DEFTIMEVAR (TV_TREE_LOOP_BOUNDS	     , "tree loop bounds")
 DEFTIMEVAR (TV_LIM                   , "tree loop invariant motion")
+DEFTIMEVAR (TV_LINTERCHANGE          , "tree loop interchange")
 DEFTIMEVAR (TV_TREE_LOOP_IVCANON     , "tree canonical iv")
 DEFTIMEVAR (TV_SCEV_CONST            , "scev constant prop")
 DEFTIMEVAR (TV_TREE_LOOP_UNSWITCH    , "tree loop unswitching")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 9f76d82..fe47736 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -367,6 +367,7 @@ extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_linterchange (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_split (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
new file mode 100644
index 0000000..c835d02
--- /dev/null
+++ b/gcc/tree-ssa-loop-interchange.cc
@@ -0,0 +1,2015 @@
+/* Loop invariant motion.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "gimple-pretty-print.h"
+#include "fold-const.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "cfgloop.h"
+#include "params.h"
+#include "tree-scalar-evolution.h"
+#include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop-niter.h"
+#include "tree-ssa-loop-ivopts.h"
+#include "tree-data-ref.h"
+
+/* This pass performs loop interchange: for example, the loop nest
+
+   for (int j = 0; j < N; j++)
+     for (int k = 0; k < N; k++)
+       for (int i = 0; i < N; i++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   is transformed to
+
+   for (int i = 0; i < N; i++)
+     for (int j = 0; j < N; j++)
+       for (int k = 0; k < N; k++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   This pass implements loop interchange in the following steps:
+
+     1) Find perfect loop nest for each innermost loop and compute data
+	dependence relations for it.  For above example, loop nest is
+	<loop_j, loop_k, loop_i>.
+     2) From innermost to outermost loop, this pass tries to interchange
+	each loop pair.  For above case, it firstly tries to interchange
+	<loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
+	Then it tries to interchange <loop_j, loop_i> and loop nest becomes
+	<loop_i, loop_j, loop_k>.  The overall effect is to move innermost
+	loop to the outermost position.  For loop pair <loop_i, loop_j>
+	to be interchanged, we:
+     3) Check if data dependence relations are valid for loop interchange.
+     4) Check if both loops can be interchanged in terms of transformation.
+     5) Check if interchanging the two loops is profitable.
+     6) Interchange the two loops by mapping induction variables.
+
+   This pass also handles reductions in loop nest.  So far we only support
+   simple reduction of inner loop and double reduction of the loop nest.  */
+
+/* Maximum number of stmts in each loop that should be interchanged.  */
+#define MAX_NUM_STMT    (PARAM_VALUE (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS))
+
+/* Default size for each array dimension.  */
+#define AVG_DIM_SIZE    (PARAM_VALUE (PARAM_AVG_LOOP_NITER))
+
+/* Comparison ratio of access stride between inner/outer loops to be
+   interchanged.  This is the minimum stride ratio for loop interchange
+   to be profitable.  */
+#define OUTER_STRIDE_RATIO  (PARAM_VALUE (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO))
+/* The same as above, but we require higher ratio for interchanging the
+   innermost two loops.  */
+#define INNER_STRIDE_RATIO  ((OUTER_STRIDE_RATIO) + 1)
+
+/* Structure recording loop induction variable.  */
+typedef struct induction
+{
+  /* IV itself.  */
+  tree var;
+  /* IV's base and step part of SCEV.  */
+  tree base;
+  tree step;
+  /* Mapped IV variabled used for interchanging loops.  */
+  tree mapped_var;
+}*induction_p;
+
+/* Enum type for loop reduction variable.  */
+enum reduction_type
+{
+  UNKNOWN_RTYPE = 0,
+  SIMPLE_RTYPE,
+  DOUBLE_RTYPE
+};
+
+/* Structure recording loop reduction variable.  */
+typedef struct reduction
+{
+  /* Reduction itself.  */
+  tree var;
+  /* PHI node defining reduction variable.  */
+  gphi *phi;
+  /* Init and next variables of the reduction.  */
+  tree init;
+  tree next;
+  /* Lcssa PHI node if reduction is used outside of its definition loop.  */
+  gphi *lcssa_phi;
+  /* Single use of reduction variable.  This is generally but not necessarily
+     the stmt defining next variable of reduction.  */
+  gimple *single_use;
+  /* Stmts defining init and next.  */
+  gimple *producer;
+  gimple *consumer;
+  /* If init is loaded from memory, this is the loading memory reference.  */
+  tree init_ref;
+  /* If reduction is finally stored to memory, this is the stored memory
+     reference.  */
+  tree fini_ref;
+  enum reduction_type type;
+}*reduction_p;
+
+
+/* Dump reduction RE.  */
+
+static void
+dump_reduction (reduction_p re)
+{
+  if (re->type == SIMPLE_RTYPE)
+    fprintf (dump_file, "  Simple reduction:  ");
+  else if (re->type == DOUBLE_RTYPE)
+    fprintf (dump_file, "  Double reduction:  ");
+  else
+    fprintf (dump_file, "  Unknown reduction:  ");
+
+  print_gimple_stmt (dump_file, re->phi, 0);
+}
+
+/* Dump LOOP's induction IV.  */
+static void
+dump_induction (struct loop *loop, induction_p iv)
+{
+  fprintf (dump_file, "  Induction:  ");
+  print_generic_expr (dump_file, iv->var, TDF_SLIM);
+  fprintf (dump_file, " = {");
+  print_generic_expr (dump_file, iv->base, TDF_SLIM);
+  fprintf (dump_file, ", ");
+  print_generic_expr (dump_file, iv->step, TDF_SLIM);
+  fprintf (dump_file, "}_%d\n", loop->num);
+}
+
+/* Loop candidate for interchange.  */
+
+class loop_cand
+{
+public:
+  loop_cand (struct loop *, struct loop *);
+  ~loop_cand ();
+
+  reduction_p find_reduction_by_init (tree);
+  reduction_p find_reduction_by_stmt (gimple *);
+  void classify_simple_reduction (reduction_p);
+  bool analyze_iloop_reduction_var (tree);
+  bool analyze_oloop_reduction_var (loop_cand *, tree);
+  bool analyze_reduction_var (loop_cand *, tree);
+  bool analyze_induction_var (tree, tree);
+  bool analyze_carried_vars (loop_cand *);
+  bool analyze_lcssa_phis (void);
+  bool can_interchange_p (loop_cand *);
+  bool unsupported_operation (basic_block, loop_cand *);
+  void undo_simple_reduction (reduction_p);
+  void eliminate_dead_code (void);
+
+  friend class tree_loop_interchange;
+private:
+  /* The loop itself.  */
+  struct loop *loop;
+  /* The outer loop of loop nest for interchange.  */
+  struct loop *nest;
+  /* Vector of induction variables in loop.  */
+  vec<induction_p> inductions;
+  /* Vector of reduction variables in loop.  */
+  vec<reduction_p> reductions;
+  /* Lcssa PHI nodes of this loop.  */
+  vec<gphi *> lcssa_nodes;
+  /* # of iterations of this loop.  */
+  tree niters;
+  /* Single exit edge of this loop.  */
+  edge exit;
+  /* Basic blocks of this loop.  */
+  basic_block *bbs;
+  /* # of stmts of this loop.  */
+  int num_stmts;
+};
+
+/* Constructor.  */
+
+loop_cand::loop_cand (struct loop *loop, struct loop *nest)
+  : loop (loop), nest (nest),
+    niters (unshare_expr (number_of_latch_executions (loop))),
+    exit (single_exit (loop)), bbs (get_loop_body (loop)), num_stmts (0)
+{
+    inductions.create (3);
+    reductions.create (3);
+    lcssa_nodes.create (3);
+}
+
+/* Destructor.  */
+
+loop_cand::~loop_cand ()
+{
+  induction_p iv;
+  for (unsigned i = 0; inductions.iterate (i, &iv); ++i)
+    free (iv);
+
+  reduction_p re;
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    free (iv);
+
+  inductions.release ();
+  reductions.release ();
+  lcssa_nodes.release ();
+  free (bbs);
+}
+
+/* Return single use stmt of VAR in LOOP, otherwise return NULL.  */
+
+static gimple *
+single_use_in_loop (tree var, struct loop *loop)
+{
+  gimple *stmt, *res = NULL;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      basic_block bb = gimple_bb (stmt);
+      gcc_assert (bb != NULL);
+      if (!flow_bb_inside_loop_p (loop, bb))
+	continue;
+
+      if (res)
+	return NULL;
+
+      res = stmt;
+    }
+  return res;
+}
+
+/* Return true if E is unsupported in loop interchange, i.e, E is a complex
+   edge or part of irreducible loop.  */
+
+static inline bool
+unsupported_edge (edge e)
+{
+  return (e->flags & (EDGE_COMPLEX | EDGE_IRREDUCIBLE_LOOP));
+}
+
+/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
+   ssa var appearing in any abnormal phi node.  */
+
+static inline bool
+unsupported_phi_node (gphi *phi)
+{
+  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
+    return true;
+
+  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+    {
+      tree arg = PHI_ARG_DEF (phi, i);
+      if (TREE_CODE (arg) == SSA_NAME
+	  && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
+	return true;
+    }
+
+  return false;
+}
+
+/* Return reduction whose init variable is VAR, otherwise return NULL.  */
+
+reduction_p
+loop_cand::find_reduction_by_init (tree var)
+{
+  reduction_p re;
+
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    if (re->init == var || operand_equal_p (re->init, var, 0))
+      return re;
+
+  return NULL;
+}
+
+/* Return the reduction if STMT is one of its lcssa PHI, producer or consumer
+   stmt.  */
+
+reduction_p
+loop_cand::find_reduction_by_stmt (gimple *stmt)
+{
+  gphi *phi = NULL;
+  reduction_p re;
+
+  if (is_a <gphi *> (stmt))
+    phi = as_a <gphi *> (stmt);
+
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    if ((phi != NULL && phi == re->lcssa_phi)
+	|| (stmt == re->producer || stmt == re->consumer))
+      return re;;
+
+  return NULL;
+}
+
+/* Return true if all stmts in BB can be supported by loop interchange,
+   otherwise return false.  ILOOP is not NULL if this loop_cand is the
+   outer loop in loop nest.  */
+
+bool
+loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
+{
+  int bb_num_stmts = 0;
+  gphi_iterator psi;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      if (gimple_has_volatile_ops (stmt)
+	  || gimple_has_side_effects (stmt))
+	return false;
+
+      bb_num_stmts++;
+      if (is_gimple_call (stmt))
+	{
+	  int cflags = gimple_call_flags (stmt);
+	  /* Only support const/pure calls.  */
+	  if (!(cflags & (ECF_CONST | ECF_PURE)))
+	    return false;
+
+	  /* In basic block of outer loop, the call should be cheap since
+	     it will be moved to inner loop.  */
+	  if (iloop != NULL
+	      && !gimple_inexpensive_call_p (as_a <gcall *> (stmt)))
+	    return false;
+
+	  continue;
+	}
+
+      if (!iloop || !gimple_vuse (stmt))
+	continue;
+
+      /* Support stmt accessing memory in outer loop only if it is for inner
+	 loop's reduction.  */
+      if (iloop->find_reduction_by_stmt (stmt))
+	continue;
+
+      tree lhs;
+      /* Or it's invariant memory reference and only used by inner loop.  */
+      if (gimple_assign_single_p (stmt)
+	  && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+	  && TREE_CODE (lhs) == SSA_NAME
+	  && single_use_in_loop (lhs, iloop->loop))
+	continue;
+
+      return false;
+    }
+  num_stmts += bb_num_stmts;
+
+  /* Allow PHI nodes in any basic block of inner loop, or PHI nodes in
+     (outer) loop's header.  */
+  if (!iloop || bb == loop->header)
+    return true;
+
+  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
+    {
+      gphi *phi = psi.phi ();
+
+      if (unsupported_phi_node (phi))
+	return false;
+
+      if (virtual_operand_p (PHI_RESULT (phi)))
+	continue;
+
+      /* For outer loop, we only support PHI in loop header and lcssa PHI
+	 of inner loop's reduction.  */
+      if (!iloop->find_reduction_by_stmt (phi))
+	return false;
+    }
+  return true;
+}
+
+/* Return true if current loop_cand be interchanged.  ILOOP is not NULL if
+   current loop_cand is outer loop in loop nest.  */
+
+bool
+loop_cand::can_interchange_p (loop_cand *iloop)
+{
+  /* For now we only support at most one reduction.  */
+  unsigned allowed_reduction_num = 1;
+
+  /* Only support reduction if the loop nest to be interchanged is the
+     innermostin two loops.  */
+  if ((iloop == NULL && loop->inner != NULL)
+       || (iloop != NULL && iloop->loop->inner != NULL))
+    allowed_reduction_num = 0;
+
+  if (reductions.length () > allowed_reduction_num
+      || (reductions.length () == 1
+	  && reductions[0]->type == UNKNOWN_RTYPE))
+    return false;
+
+  /* Only support lcssa PHI node which is for reduction.  */
+  if (lcssa_nodes.length () > allowed_reduction_num)
+    return false;
+
+  /* Check basic blocks other than loop header/exit.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+
+      /* Skip basic blocks of inner loops.  */
+      if (bb->loop_father != loop)
+	continue;
+
+      /* Check if basic block has any unsupported operation.  */
+      if (!unsupported_operation (bb, iloop))
+	return false;
+
+      /* Check if loop has too many stmts.  */
+      if (num_stmts > MAX_NUM_STMT)
+	return false;
+    }
+
+  return true;
+}
+
+/* Classify if reduction RE is a simple one.  */
+
+void
+loop_cand::classify_simple_reduction (reduction_p re)
+{
+  gimple *producer, *consumer;
+  enum tree_code code;
+  tree lhs, rhs;
+
+  /* Check init variable of reduction and how it is initialized.  */
+  if (TREE_CODE (re->init) == SSA_NAME)
+    {
+      producer = SSA_NAME_DEF_STMT (re->init);
+      re->producer = producer;
+      basic_block bb = gimple_bb (producer);
+      if (!bb || bb->loop_father != nest)
+	return;
+
+      if (!is_gimple_assign (producer))
+	return;
+
+      code = gimple_assign_rhs_code (producer);
+      if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
+	return;
+
+      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
+	  || lhs != re->init)
+	return;
+
+      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
+	  || !REFERENCE_CLASS_P (rhs))
+	return;
+
+      re->init_ref = rhs;
+    }
+  else if (!CONSTANT_CLASS_P (re->init))
+    return;
+
+  /* TODO: Don't support constant initializer yet.  */
+  if (re->init_ref == NULL_TREE)
+    return;
+
+  /* Check how reduction variable is used.  Note usually reduction variable
+     is used outside of its defining loop, we don't require that in terms
+     loop interchange.  */
+  if (re->lcssa_phi == NULL)
+    consumer = single_use_in_loop (re->next, loop);
+  else
+    consumer = single_use_in_loop (PHI_RESULT (re->lcssa_phi), nest);
+
+  if (consumer == NULL)
+    return;
+
+  re->consumer = consumer;
+
+  if (!is_gimple_assign (consumer))
+    return;
+
+  code = gimple_assign_rhs_code (consumer);
+  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
+    return;
+
+  if ((rhs = gimple_assign_rhs1 (consumer)) == NULL_TREE
+      || rhs != PHI_RESULT (re->lcssa_phi))
+    return;
+
+  if ((lhs = gimple_assign_lhs (consumer)) == NULL_TREE
+      || !REFERENCE_CLASS_P (lhs))
+    return;
+
+  re->fini_ref = lhs;
+
+  /* Require memory references in producer and consumer are the same so
+     that we can undo reduction during interchange.  */
+  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
+    return;
+
+  re->type = SIMPLE_RTYPE;
+}
+
+/* Analyze reduction variable VAR.  ILOOP is not NULL if current loop_cand
+   is outer loop in loop nest.  Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_reduction_var (loop_cand *iloop, tree var)
+{
+  if (iloop != NULL)
+    return analyze_oloop_reduction_var (iloop, var);
+  else
+    return analyze_iloop_reduction_var (var);
+}
+
+/* Analyze reduction variable VAR for inner loop of the loop nest to be
+   interchanged.  Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_iloop_reduction_var (tree var)
+{
+  gphi *phi = as_a <gphi *> (SSA_NAME_DEF_STMT (var));
+  gphi *lcssa_phi = NULL, *use_phi;
+  tree init = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+  tree next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  edge e = single_exit (loop);
+  reduction_p re;
+  gimple *stmt, *next_def, *single_use = NULL;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+  basic_block bb;
+
+  if (TREE_CODE (next) != SSA_NAME)
+    return false;
+
+  next_def = SSA_NAME_DEF_STMT (next);
+  bb = gimple_bb (next_def);
+  if (!bb || !flow_bb_inside_loop_p (loop, bb))
+    return false;
+
+  /* In restricted reduction, the var is (and must be) used in defining
+     the updated var.  The process can be depicted as below:
+
+		var ;; = PHI<init, next>
+		 |
+		 |
+		 v
+      +---------------------+
+      | reduction operators | <-- other operands
+      +---------------------+
+		 |
+		 |
+		 v
+		next
+
+     In terms loop interchange, we don't change how NEXT is computed based
+     on VAR and OTHER OPERANDS.  In case of double reduction in loop nest
+     to be interchanged, we don't changed it at all.  In the case of simple
+     reduction in inner loop, we only make change how VAR/NEXT is loaded or
+     stored.  With these conditions, we can relax restrictions on reduction
+     in a way that reduction operation is seen as black box.  In general,
+     we can ignore reassociation of reduction operator; we can handle fake
+     reductions in which VAR is not even used to compute NEXT.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb || !flow_bb_inside_loop_p (loop, bb))
+	return false;
+
+      if (single_use != NULL)
+	return false;
+
+      single_use = stmt;
+    }
+
+  if (single_use != next_def
+      && !stmt_dominates_stmt_p (single_use, next_def))
+    return false;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb)
+	return false;
+
+      /* Or else it's used in PHI itself.  */
+      use_phi = NULL;
+      if (is_a <gphi *> (stmt)
+	  && (use_phi = as_a <gphi *> (stmt)) != NULL
+	  && use_phi == phi)
+	continue;
+
+      if (use_phi != NULL
+	  && lcssa_phi == NULL
+	  && bb == e->dest
+	  && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next)
+	lcssa_phi = use_phi;
+      else
+	return false;
+    }
+  re = XCNEW (struct reduction);
+  re->var = var;
+  re->init = init;
+  re->next = next;
+  re->phi = phi;
+  re->lcssa_phi = lcssa_phi;
+  re->single_use = single_use;
+
+  classify_simple_reduction (re);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_reduction (re);
+
+  reductions.safe_push (re);
+  return true;
+}
+
+/* Analyze reduction variable VAR for outer loop of the loop nest to be
+   interchanged.  ILOOP is not NULL and points to inner loop.  For the
+   moment, we only support double reduction for outer loop, like:
+
+     for (int i = 0; i < n; i++)
+       {
+	 int sum = 0;
+
+	 for (int j = 0; j < n; j++)    // outer loop
+	   for (int k = 0; k < n; k++)  // inner loop
+	     sum += a[i][k]*b[k][j];
+
+	 s[i] = sum;
+       }
+
+   Note the innermost two loops are the loop nest to be interchanged.
+   Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var)
+{
+  gphi *phi = as_a <gphi *> (SSA_NAME_DEF_STMT (var));
+  gphi *lcssa_phi = NULL, *use_phi;
+  tree init = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+  tree next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  edge e = single_exit (loop);
+  reduction_p re;
+  gimple *stmt, *next_def;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+  basic_block bb;
+
+  if (TREE_CODE (next) != SSA_NAME)
+    return false;
+
+  next_def = SSA_NAME_DEF_STMT (next);
+  bb = gimple_bb (next_def);
+  if (!bb || !flow_bb_inside_loop_p (loop, bb))
+    return false;
+
+  /* Find inner loop's simple reduction that uses var as initializer.  */
+  reduction_p inner_re = iloop->find_reduction_by_init (var);
+  if (inner_re == NULL
+      || inner_re->type != UNKNOWN_RTYPE
+      || inner_re->producer != phi)
+    return false;
+
+  /* In case of double reduction, outer loop's reduction should be updated
+     by inner loop's simple reduction.  */
+  if (next_def != inner_re->lcssa_phi)
+    return false;
+
+  /* Outer loop's reduction should only be used to initialize inner loop's
+     simple reduction.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb || !flow_bb_inside_loop_p (loop, bb))
+	return false;
+
+      if (! is_a <gphi *> (stmt)
+	  || (use_phi = as_a <gphi *> (stmt)) == NULL
+	  || use_phi != inner_re->phi)
+	return false;
+    }
+
+  /* Check this reduction is correctly used outside of loop via lcssa phi.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      bb = gimple_bb (stmt);
+      if (!bb)
+	return false;
+
+      /* Or else it's used in PHI itself.  */
+      use_phi = NULL;
+      if (is_a <gphi *> (stmt)
+	  && (use_phi = as_a <gphi *> (stmt)) != NULL
+	  && use_phi == phi)
+	continue;
+
+      if (lcssa_phi == NULL
+	  && use_phi != NULL
+	  && bb == e->dest
+	  && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next)
+	lcssa_phi = use_phi;
+      else
+	return false;
+    }
+
+  re = XCNEW (struct reduction);
+  re->var = var;
+  re->init = init;
+  re->next = next;
+  re->phi = phi;
+  re->lcssa_phi = lcssa_phi;
+  re->type = DOUBLE_RTYPE;
+  inner_re->type = DOUBLE_RTYPE;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_reduction (re);
+
+  reductions.safe_push (re);
+  return true;
+}
+
+/* Return true if VAR is induction variable of current loop whose scev is
+   specified by CHREC.  */
+
+bool
+loop_cand::analyze_induction_var (tree var, tree chrec)
+{
+  /* Var is loop invariant, though it's unlikely to happen.  */
+  if (tree_does_not_contain_chrecs (chrec))
+    {
+      struct induction *iv = XCNEW (struct induction);
+      iv->var = var;
+      iv->base = chrec;
+      iv->step = build_int_cst (TREE_TYPE (chrec), 0);
+      inductions.safe_push (iv);
+      return true;
+    }
+
+  if (TREE_CODE (chrec) != POLYNOMIAL_CHREC
+      || CHREC_VARIABLE (chrec) != (unsigned) loop->num
+      || tree_contains_chrecs (CHREC_LEFT (chrec), NULL)
+      || tree_contains_chrecs (CHREC_RIGHT (chrec), NULL))
+    return false;
+
+  struct induction *iv = XCNEW (struct induction);
+  iv->var = var;
+  iv->base = CHREC_LEFT (chrec);
+  iv->step = CHREC_RIGHT (chrec);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_induction (loop, iv);
+
+  inductions.safe_push (iv);
+  return true;
+}
+
+/* Return true if all loop carried variables defined in loop header can
+   be successfully analyzed.  */
+
+bool
+loop_cand::analyze_carried_vars (loop_cand *iloop)
+{
+  basic_block before_loop = block_before_loop (nest);
+  gphi_iterator gsi;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\nLoop(%d) carried vars:\n", loop->num);
+
+  for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = gsi.phi ();
+
+      if (unsupported_phi_node (phi))
+	return false;
+
+      tree var = PHI_RESULT (phi);
+      if (virtual_operand_p (var))
+	continue;
+
+      tree chrec = analyze_scalar_evolution (loop, var);
+      chrec = instantiate_scev (before_loop, loop, chrec);
+
+      /* Analyze var as reduction variable.  */
+      if (chrec_contains_undetermined (chrec)
+	  || chrec_contains_symbols_defined_in_loop (chrec, nest->num))
+	{
+	  if (!analyze_reduction_var (iloop, var))
+	    return false;
+	}
+      /* Analyze var as induction variable.  */
+      else if (!analyze_induction_var (var, chrec))
+	return false;
+    }
+
+  return true;
+}
+
+/* Return TRUE if loop closed PHI nodes can be analyzed successfully.  */
+
+bool
+loop_cand::analyze_lcssa_phis (void)
+{
+  edge e = single_exit (loop);
+  gphi_iterator gsi;
+
+  for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = gsi.phi ();
+
+      if (unsupported_phi_node (phi))
+	return false;
+
+      if (virtual_operand_p (PHI_RESULT (phi)))
+	continue;
+
+      /* TODO: We only support lcssa phi for reduction for now.  */
+      if (!find_reduction_by_stmt (phi))
+	return false;
+    }
+
+  return true;
+}
+
+/* Given inner loop with simple reduction as below:
+
+   for (i = 0; i < N; i++)
+     for (j = 0; j < N; j++)
+       {
+	 int red = c[i][j];           // producer
+	 for (k = 0; k < N; k++)
+	   red += a[i][k] * b[k][j];
+
+	 c[i][j] = red;               // consumer
+       }
+
+   This function undo the reduction and generates below loop nest:
+
+   for (i = 0; i < N; i++)
+     for (j = 0; j < N; j++)
+       {
+	 for (k = 0; k < N; k++)
+	   c[i][j] += a[i][k] * b[k][j];
+       }
+
+   This basically reverts transformation done by LIM or PRE.  */
+
+void
+loop_cand::undo_simple_reduction (reduction_p re)
+{
+  gimple *phi = SSA_NAME_DEF_STMT (re->var);
+  gimple_stmt_iterator gsi, from;
+
+  /* Move producer stmt into inner loop, just before its use.  */
+  if (gimple_vuse (re->producer))
+    gimple_set_vuse (re->producer, NULL_TREE);
+  gimple_assign_set_lhs (re->producer, re->var);
+  from = gsi_for_stmt (re->producer);
+  gsi = gsi_for_stmt (re->single_use);
+  gsi_move_before (&from, &gsi);
+
+  /* Delete loop header PHI node of reduction.  */
+  gsi = gsi_for_stmt (phi);
+  gsi_remove (&gsi, true);
+
+  /* Move consumer stmt into inner loop, just after its def.  */
+  if (gimple_vdef (re->consumer))
+    gimple_set_vuse (re->consumer, NULL_TREE);
+  if (gimple_vuse (re->consumer))
+    gimple_set_vuse (re->consumer, NULL_TREE);
+  gimple_assign_set_rhs1 (re->consumer, re->next);
+  from = gsi_for_stmt (re->consumer);
+  gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (re->next));
+  gsi_move_after (&from, &gsi);
+
+  /* Delete loop closed PHI node of reduction.  */
+  gsi = gsi_for_stmt (re->lcssa_phi);
+  gsi_remove (&gsi, true);
+}
+
+/* Eliminate dead code after loop interchange.  */
+
+void
+loop_cand::eliminate_dead_code (void)
+{
+  /* Check basic blocks other than loop header/exit.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+
+      /* Skip basic blocks of inner loops.  */
+      if (bb->loop_father != loop)
+	continue;
+
+      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  tree lhs;
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  /* Given copy propagation is done during interchange, we can
+	     simply check zero uses of var and eliminate it.  */
+	  if (is_gimple_assign (stmt)
+	      && !gimple_vuse (stmt)
+	      && !gimple_has_volatile_ops (stmt)
+	      && !gimple_has_side_effects (stmt)
+	      && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+	      && TREE_CODE (lhs) == SSA_NAME
+	      && has_zero_uses (lhs))
+	    gsi_remove (&gsi, true);
+	  else
+	    gsi_next (&gsi);
+	}
+    }
+}
+
+/* Class for loop interchange transformation.  */
+
+class tree_loop_interchange
+{
+public:
+  tree_loop_interchange (vec<loop_p> loop_nest,
+			 vec<data_reference_p> datarefs, vec<ddr_p> ddrs)
+    : loop_nest (loop_nest), datarefs (datarefs),
+      ddrs (ddrs), niters_iv_var (NULL_TREE), unshare_datarefs_p (true) { }
+  ~tree_loop_interchange () {
+    free_dependence_relations (ddrs);
+    free_data_refs (datarefs);
+    loop_nest.release ();
+  }
+  bool interchange ();
+
+private:
+  void update_data_refs (loop_cand &, loop_cand &);
+  void update_data_deps (unsigned, unsigned);
+  bool valid_data_dependences (unsigned, unsigned);
+  bool can_interchange_loops (loop_cand &, loop_cand &);
+  void interchange_loops (loop_cand &, loop_cand &);
+  void interchange_reductions (loop_cand &, loop_cand &);
+  void interchange_inductions (loop_cand &, loop_cand &);
+  void map_inductions_to_loop (loop_cand &, loop_cand &);
+  void move_code_to_inner_loop (struct loop *, struct loop *, basic_block *);
+
+  /* Vector of the loop nest.  */
+  vec<struct loop *> loop_nest;
+  /* Vector of data references in loop nest.  */
+  vec<data_reference_p> datarefs;
+  /* Vector of data dependence relations in loop nest.  */
+  vec<ddr_p> ddrs;
+
+  /* We create new IV which is only used in loop's exit condition check.
+     In case of 3-level loop nest interchange, when we interchange the
+     innermost two loops, new IV created in the middle level loop does
+     not need to be preserved in interchanging the outermost two loops
+     later.  We record the IV so that it can be skipped.  */
+  tree niters_iv_var;
+  /* Due to current implementation of data dependence analysis, access
+     functions are shared by all data dependence relations.  After loop
+     interchange, we need to update data reference/dependence according
+     to interchanged loops.  During updating, this flag will be checked
+     and DR_ACCESS_FNs will be unshared if it's true.  */
+  bool unshare_datarefs_p;
+};
+
+/* Given ILOOP and OLOOP representing two loops in loop nest to be
+   interchanged, this function decomposes access function ACCESS_FN.
+   This function does below things:
+
+     1) Look into ACCESS_FN and record the corresponding polynomial
+	chrec of ILOOP/OLOOP in ILOOP_CHREC/OLOOP_CHREC.
+     2) Record access STRIDE in ILOOP_STRIDE or OLOOP_STRIDE if they
+	are not NULL.
+     4) Record index of dimension DIM in iloop_dim or oloop_dim if
+	they are not NULL.
+
+   For example, given below ACCESS_FN:
+
+     {{base, step1}_oloop, step2}_iloop
+
+   This function decomposes it into:
+
+     ILOOP_CHREC: {{base, step1}_oloop, step2}_iloop
+     OLOOP_CHREC: {base, step1}_oloop
+
+   and sets other arguments properly.  */
+
+static void
+decompose_chrecs (unsigned iloop_num, unsigned oloop_num,
+		  tree access_fn, unsigned dim, tree stride,
+		  tree *iloop_chrec, tree *iloop_stride, unsigned *iloop_dim,
+		  tree *oloop_chrec, tree *oloop_stride, unsigned *oloop_dim)
+{
+  do {
+    unsigned var = CHREC_VARIABLE (access_fn);
+    if (var == iloop_num)
+      {
+	gcc_assert (*iloop_chrec == NULL_TREE);
+	*iloop_chrec = access_fn;
+	if (iloop_stride)
+	  *iloop_stride = stride;
+	if (iloop_dim)
+	  *iloop_dim = dim;
+      }
+    if (var == oloop_num)
+      {
+	gcc_assert (*oloop_chrec == NULL_TREE);
+	*oloop_chrec = access_fn;
+	if (oloop_stride)
+	  *oloop_stride = stride;
+	if (oloop_dim)
+	  *oloop_dim = dim;
+      }
+    access_fn = CHREC_LEFT (access_fn);
+  } while (TREE_CODE (access_fn) == POLYNOMIAL_CHREC
+	   && (*iloop_chrec == NULL_TREE || *oloop_chrec == NULL_TREE));
+}
+
+/* Update data refs to keep them valid after interchanging ILOOP/OLOOP.  */
+
+void
+tree_loop_interchange::update_data_refs (loop_cand &iloop, loop_cand &oloop)
+{
+  struct data_reference *dr;
+  for (unsigned i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      unsigned iloop_dim = DR_NUM_DIMENSIONS (dr);
+      unsigned oloop_dim = DR_NUM_DIMENSIONS (dr);
+      tree iloop_chrec = NULL_TREE, oloop_chrec = NULL_TREE;
+
+      for (unsigned j = 0; j < DR_NUM_DIMENSIONS (dr); ++j)
+	{
+	  if (TREE_CODE (DR_ACCESS_FN (dr, j)) != POLYNOMIAL_CHREC)
+	    continue;
+
+	  /* Unshare access functions for the first time.  */
+	  if (unshare_datarefs_p)
+	    DR_ACCESS_FN (dr, j) = unshare_expr (DR_ACCESS_FN (dr, j));
+
+	  /* Find the corresponding CHRECs for ILOOP and OLOOP.  */
+	  tree access_fn = DR_ACCESS_FN (dr, j);
+	  decompose_chrecs (iloop.loop->num, oloop.loop->num,
+			    access_fn, j, NULL,
+			    &iloop_chrec, NULL, &iloop_dim,
+			    &oloop_chrec, NULL, &oloop_dim);
+	}
+      if (iloop_chrec && oloop_chrec)
+	{
+	  if (iloop_dim != oloop_dim)
+	    {
+	      /* For data reference with independent CHRECs for both loops,
+		 swap the loop information.  For example,
+		   (data_ref ...
+		    {base1, step1}_iloop
+		    {base2, step2}_oloop)
+		 is transformed into:
+		   (data_ref ...
+		    {base1, step1}_oloop
+		    {base2, step2}_iloop).  */
+	      std::swap (CHREC_VAR (iloop_chrec), CHREC_VAR (oloop_chrec));
+	    }
+	  else
+	    {
+	      /* For data reference with multivariate CHREC for the two loops,
+		 swap step part of CHREC.  For example,
+		   (data_ref ...
+		    {{base1, step1}_oloop, step2}_iloop)
+		 is transformed into:
+		   (data_ref ...
+		    {{base1, step2}_oloop, step1}_iloop).  */
+	      std::swap (CHREC_RIGHT (iloop_chrec), CHREC_RIGHT (oloop_chrec));
+	    }
+	}
+      /* For data reference without CHREC for either one of ILOOP/OLOOP, set
+	 the loop information to the other loop.  This works because we only
+	 interchange consecutive loops in loop nest.  */
+      else if (iloop_chrec)
+	{
+	  tree type = TREE_TYPE (CHREC_VAR (iloop_chrec));
+	  CHREC_VAR (iloop_chrec) = build_int_cst (type, oloop.loop->num);
+	}
+      else if (oloop_chrec)
+	{
+	  tree type = TREE_TYPE (CHREC_VAR (oloop_chrec));
+	  CHREC_VAR (oloop_chrec) = build_int_cst (type, iloop.loop->num);
+	}
+    }
+  unshare_datarefs_p = false;
+}
+
+/* Update data dependence relations after interchanging loops.  INNER/OUTER
+   gives index of interchanged loops in loop nest, they are used to access
+   DIR_MATRIX.  */
+
+void
+tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
+{
+  struct data_dependence_relation *ddr;
+
+  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
+    {
+      /* Skip no-dependence case.  */
+      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	continue;
+
+      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
+	{
+	  lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
+	  std::swap (dir_vect[inner], dir_vect[outer]);
+	}
+    }
+}
+
+/* Check data dependence relations, return TRUE if it's valid to interchange
+   two loops specified by INNER/OUTER.  Theoretically, interchanging the two
+   loops is valid only if direct vector, after interchanging, doesn't have
+   '>' as the leftmost non-'=' direction.  Practically, this function have
+   been conservative here by not checking some valid cases.  */
+
+bool
+tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
+{
+  struct data_dependence_relation *ddr;
+
+  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
+    {
+      /* Skip no-dependence case.  */
+      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	continue;
+
+      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
+	{
+	  lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
+	  unsigned level = dependence_level (dist_vect, loop_nest.length ());
+
+	  /* If there is no carried dependence.  */
+	  if (level == 0)
+	    continue;
+
+	  level --;
+	  /* Skip case which has '>' as the leftmost direction.  */
+	  if (!lambda_vector_lexico_pos (dist_vect, level))
+	    return false;
+
+	  /* If dependence is carried by outer loop of the two loops for
+	     interchange.  */
+	  if (level < outer)
+	    continue;
+
+	  lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
+	  /* If directions at both inner/outer levels are the same.  */
+	  if (dir_vect[inner] == dir_vect[outer])
+	    continue;
+
+	  /* Be conservative, skip case if either direction at inner/outer
+	     levels is not '=' or '<'.  */
+	  if (dir_vect[inner] != dir_equal
+	      && dir_vect[inner] != dir_positive
+	      && dir_vect[inner] != dir_independent
+	      && dir_vect[inner] != dir_positive_or_equal)
+	    return false;
+
+	  if (dir_vect[outer] != dir_equal
+	      && dir_vect[outer] != dir_positive
+	      && dir_vect[outer] != dir_independent
+	      && dir_vect[outer] != dir_positive_or_equal)
+	    return false;
+	}
+    }
+
+  return true;
+}
+
+/* Return true if ILOOP and OLOOP can be interchanged in terms of code
+   transformation.  */
+
+bool
+tree_loop_interchange::can_interchange_loops (loop_cand &iloop,
+					      loop_cand &oloop)
+{
+  return (iloop.analyze_carried_vars (NULL)
+	  && iloop.analyze_lcssa_phis ()
+	  && oloop.analyze_carried_vars (&iloop)
+	  && oloop.analyze_lcssa_phis ()
+	  && iloop.can_interchange_p (NULL)
+	  && oloop.can_interchange_p (&iloop));
+}
+
+/* Compute and return overall access stride given CHREC and step STRIDE.  */
+
+static inline unsigned HOST_WIDE_INT
+access_stride (tree chrec, tree stride)
+{
+  unsigned HOST_WIDE_INT uhwi_stride, uhwi_step;
+
+  if (tree_fits_uhwi_p (CHREC_RIGHT (chrec)))
+    uhwi_step = tree_to_uhwi (CHREC_RIGHT (chrec));
+  else
+    uhwi_step = AVG_DIM_SIZE;
+
+  gcc_assert (tree_fits_uhwi_p (stride));
+  uhwi_stride = tree_to_uhwi (stride);
+
+  return uhwi_step * uhwi_stride;
+}
+
+/* Given LOOP, strip off its inner loop's chrec from ACCESS_FN and return
+   true for stripped case.  */
+
+static bool
+strip_sub_loop_access_fn (struct loop *loop, tree *access_fn)
+{
+  bool sub_loop_p = false;
+  do {
+    struct loop *sub_loop = get_loop (cfun, CHREC_VARIABLE (*access_fn));
+
+    if (!flow_loop_nested_p (loop, sub_loop))
+      break;
+
+    sub_loop_p = true;
+    *access_fn = CHREC_LEFT (*access_fn);
+  } while (TREE_CODE (*access_fn) == POLYNOMIAL_CHREC);
+
+  return sub_loop_p;
+}
+
+/* Interchange niters info of ILOOP and OLOOP while reset any other niters
+   estimates information for now.  */
+
+static inline void
+interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
+{
+  tree nb_iterations = oloop->nb_iterations;
+
+  oloop->any_upper_bound = false;
+  oloop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (oloop);
+
+  oloop->nb_iterations = iloop->nb_iterations;
+
+  iloop->any_upper_bound = false;
+  iloop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (iloop);
+
+  iloop->nb_iterations = nb_iterations;
+}
+
+/* Interchange two loops specified by ILOOP and OLOOP.  */
+
+void
+tree_loop_interchange::interchange_loops (loop_cand &iloop, loop_cand &oloop)
+{
+  interchange_reductions (iloop, oloop);
+  interchange_inductions (iloop, oloop);
+
+  interchange_nb_iterations (iloop.loop, oloop.loop);
+
+  iloop.eliminate_dead_code ();
+}
+
+/* Interchange transformation for reductions of ILOOP and OLOOP.  We only
+   support two types reductions for now:
+     1) simple reduction of inner loop.
+     2) double reduction of loop nest.
+   For simple reduction, we simply undo it by moving producer/consumer to
+   inner loop; for double reduction, we don't need to do anything.  */
+
+void
+tree_loop_interchange::interchange_reductions (loop_cand &iloop,
+					       loop_cand &oloop)
+{
+  unsigned i;
+  reduction_p re;
+
+  for (i = 0; iloop.reductions.iterate (i, &re); ++i)
+    {
+      if (re->type == DOUBLE_RTYPE)
+	continue;
+
+      /* Undo simple reductions.  */
+      iloop.undo_simple_reduction (re);
+    }
+
+  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
+    if (re->type != DOUBLE_RTYPE)
+      gcc_unreachable ();
+}
+
+/* Interchange transformation for inductions of ILOOP and OLOOP.  */
+
+void
+tree_loop_interchange::interchange_inductions (loop_cand &iloop,
+					       loop_cand &oloop)
+{
+  /* Map outer loop's IV to inner loop.  */
+  map_inductions_to_loop (oloop, iloop);
+  /* Map inner loop's IV to outer loop.  */
+  map_inductions_to_loop (iloop, oloop);
+
+  /* Create canonical IV for both loops.  Note canonical IV for outer/inner
+     loop is actually from inner/outer loop.  Also we record the new IV
+     created for the outer loop so that it can be skipped in later loop
+     interchange.  */
+  create_canonical_iv (oloop.loop, single_exit (oloop.loop), iloop.niters,
+		       &niters_iv_var);
+  create_canonical_iv (iloop.loop, single_exit (iloop.loop), oloop.niters);
+}
+
+/* Map induction variables of SRC loop to TGT loop.  The function firstly
+   creates the same IV of SRC loop in TGT loop, then deletes the original
+   IV and re-initialize it using the newly created IV.  For example, loop
+   nest:
+
+     for (i = 0; i < N; i++)
+       for (j = 0; j < M; j++)
+	 {
+	   //use of i;
+	   //use of j;
+	 }
+
+   will be transformed into:
+
+     for (jj = 0; jj < M; jj++)
+       for (ii = 0; ii < N; ii++)
+	 {
+	   //use of ii;
+	   //use of jj;
+	 }
+
+   after loop interchange.  */
+
+void
+tree_loop_interchange::map_inductions_to_loop (loop_cand &src, loop_cand &tgt)
+{
+  induction_p iv;
+  edge e = single_exit (tgt.loop);
+  gimple_stmt_iterator incr_pos = gsi_last_bb (e->src), gsi;
+  bool move_code_p = flow_loop_nested_p (src.loop, tgt.loop);
+
+  /* Move src's code to tgt loop.  This is necessary when src is the outer
+     loop and tgt is the inner loop.  */
+  if (move_code_p)
+    move_code_to_inner_loop (src.loop, tgt.loop, src.bbs);
+
+  /* Map source loop's IV to target loop.  */
+  for (unsigned i = 0; src.inductions.iterate (i, &iv); ++i)
+    {
+      gimple *stmt = SSA_NAME_DEF_STMT (iv->var);
+      gcc_assert (is_a <gphi *> (stmt));
+
+      /* Delete var definition of the original IV's in the source loop.  */
+      gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+
+      /* No need to map PHI to target loop if it is created in previous
+	 loop interchange.  */
+      if (niters_iv_var == iv->var)
+	{
+	  gcc_assert (!move_code_p);
+	  continue;
+	}
+
+      /* Map the IV by creating the same one in target loop.  */
+      tree base = unshare_expr (iv->base), step = unshare_expr (iv->step);
+      create_iv (base, step, SSA_NAME_VAR (iv->var),
+		 tgt.loop, &incr_pos, false, &iv->mapped_var, NULL);
+
+      /* Replace uses of the original IV var with newly created IV var.  */
+      use_operand_p imm_use_p;
+      imm_use_iterator iterator;
+      FOR_EACH_IMM_USE_STMT (stmt, iterator, iv->var)
+	FOR_EACH_IMM_USE_ON_STMT (imm_use_p, iterator)
+	  SET_USE (imm_use_p, iv->mapped_var);
+    }
+}
+
+/* Compute the insert position at inner loop when moving code from outer
+   loop to inner one.  */
+
+static inline void
+insert_pos_at_inner_loop (struct loop *outer, struct loop *inner,
+			  basic_block bb, gimple_stmt_iterator *pos)
+{
+  if (bb == outer->header || bb == outer->latch)
+    {
+      /* Move code from header/latch to header/latch.  */
+      *pos = gsi_after_labels (inner->header);
+    }
+  else
+    {
+      /* Otherwise, simply move to exit->src.  */
+      edge e = single_exit (inner);
+      *pos = gsi_last_bb (e->src);
+    }
+}
+
+/* Move stmts of outer loop to inner loop.  */
+
+void
+tree_loop_interchange::move_code_to_inner_loop (struct loop *outer,
+						struct loop *inner,
+						basic_block *outer_bbs)
+{
+  unsigned int i;
+  edge oloop_exit = single_exit (outer);
+  gimple_stmt_iterator insert_pos, gsi;
+
+  for (i = 0; i < outer->num_nodes; i++)
+    {
+      basic_block bb = outer_bbs[i];
+
+      /* Skip basic blocks of inner loop.  */
+      if (flow_bb_inside_loop_p (inner, bb))
+	continue;
+
+      insert_pos_at_inner_loop (outer, inner, bb, &insert_pos);
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (gimple_code (stmt) == GIMPLE_LABEL)
+	    {
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (oloop_exit->src == bb
+	      && stmt == gsi_stmt (gsi_last_bb (oloop_exit->src)))
+	    {
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (gimple_vuse (stmt))
+	    gimple_set_vuse (stmt, NULL_TREE);
+	  if (gimple_vdef (stmt))
+	    gimple_set_vdef (stmt, NULL_TREE);
+
+	  gsi_move_before (&gsi, &insert_pos);
+	}
+    }
+}
+
+/* Return true if interchanging ILOOP/OLOOP is profitable.  The function
+   computes and compares three types information from all DATAREFS:
+     1) Access stride for ILOOP and OLOOP.
+     2) Number of invariant memory references with respect to ILOOP before
+	and after loop interchange.
+     3) Flags indicating if all memory references access sequential memory
+	in ILOOP, before and after loop interchange.
+   This function also dumps information if DUMP_INFO_P is true.  */
+
+static bool
+should_interchange_loops (struct loop *iloop, struct loop *oloop,
+			  vec<data_reference_p> datarefs,
+			  bool dump_info_p = true)
+{
+  unsigned HOST_WIDE_INT ratio;
+  unsigned i, j, num_old_inv_drs = 0, num_new_inv_drs = 0;
+  struct data_reference *dr;
+  bool all_seq_dr_before_p = true, all_seq_dr_after_p = true;
+  widest_int iloop_strides = 0, oloop_strides = 0;
+
+  if (dump_info_p && dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\nData ref strides:\n\tmem_ref:\t\tiloop\toloop\n");
+
+  for (i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      bool sub_loop_p = false;
+      tree ref = DR_REF (dr), stride;
+      tree iloop_chrec = NULL_TREE, iloop_stride = NULL_TREE;
+      tree oloop_chrec = NULL_TREE, oloop_stride = NULL_TREE;
+
+      /* Get CHREC and the corresponding stride for ILOOP/OLOOP.  */
+      for (j = 0; j < DR_NUM_DIMENSIONS (dr); ++j)
+	{
+	  while (handled_component_p (ref) && TREE_CODE (ref) != ARRAY_REF)
+	    ref = TREE_OPERAND (ref, 0);
+
+	  stride = integer_one_node;
+	  if (TREE_CODE (ref) == ARRAY_REF)
+	    {
+	      stride = TYPE_SIZE_UNIT (TREE_TYPE (ref));
+	      ref = TREE_OPERAND (ref, 0);
+	    }
+
+	  tree access_fn = DR_ACCESS_FN (dr, j);
+	  if (TREE_CODE (access_fn) != POLYNOMIAL_CHREC)
+	    continue;
+
+	  sub_loop_p |= strip_sub_loop_access_fn (iloop, &access_fn);
+	  if (TREE_CODE (access_fn) != POLYNOMIAL_CHREC)
+	    continue;
+
+	  decompose_chrecs (iloop->num, oloop->num,
+			    access_fn, j, stride,
+			    &iloop_chrec, &iloop_stride, NULL,
+			    &oloop_chrec, &oloop_stride, NULL);
+	}
+
+      if (!iloop_chrec && !oloop_chrec)
+	continue;
+
+      /* If stride is unknown for inner or outer loop.  */
+      if ((iloop_chrec != NULL
+	   && (iloop_stride == NULL_TREE
+	       || !tree_fits_uhwi_p (iloop_stride)))
+	  || (oloop_chrec != NULL
+	      && (oloop_stride == NULL_TREE
+		  || !tree_fits_uhwi_p (oloop_stride))))
+	continue;
+
+      /* Compute overall access strides for ILOOP.  */
+      unsigned HOST_WIDE_INT t1 = 0, t2 = 0;
+      if (iloop_chrec)
+	{
+	  t1 = access_stride (iloop_chrec, iloop_stride);
+	  iloop_strides = wi::add (iloop_strides, t1);
+	}
+      else if (!sub_loop_p)
+	num_old_inv_drs++;
+
+      /* Compute overall access strides for OLOOP.  */
+      if (oloop_chrec)
+	{
+	  t2 = access_stride (oloop_chrec, oloop_stride);
+	  oloop_strides = wi::add (oloop_strides, t2);
+	}
+      else if (!sub_loop_p)
+	num_new_inv_drs++;
+
+      /* Track if all data references access sequential memory before and
+	 after loop interchange.  */
+      if (sub_loop_p)
+	{
+	  /* Data ref can't be sequential if it evaluates wrto any sub loop
+	     of inner loop.  */
+	  all_seq_dr_before_p = false;
+	  all_seq_dr_after_p = false;
+	}
+      else if ((stride = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)))) != NULL_TREE
+	       && tree_fits_uhwi_p (stride))
+	{
+	  /* Check if data ref access sequential memory wrto inner loop.
+	     Note invariant is considered sequential.  */
+	  unsigned HOST_WIDE_INT uhwi_stride = tree_to_uhwi (stride);
+	  if (t1 != 0 && t1 != uhwi_stride)
+	    all_seq_dr_before_p = false;
+	  if (t2 != 0 && t2 != uhwi_stride)
+	    all_seq_dr_after_p = false;
+	}
+
+      if (dump_info_p && dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "\t");
+	  print_generic_expr (dump_file, DR_REF (dr), TDF_SLIM);
+	  fprintf (dump_file, ":\t\t" HOST_WIDE_INT_PRINT_DEC
+		   "\t" HOST_WIDE_INT_PRINT_DEC "\n", t1, t2);
+	}
+    }
+
+  if (dump_info_p && dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\toverall:\t\t");
+      print_decu (iloop_strides, dump_file);
+      fprintf (dump_file, "\t");
+      print_decu (oloop_strides, dump_file);
+      fprintf (dump_file, "\n");
+
+      fprintf (dump_file, "Invariant data ref: before(%d), after(%d)\n",
+	       num_old_inv_drs, num_new_inv_drs);
+      fprintf (dump_file, "Consecutive stride: before(%s), after(%s)\n",
+	       all_seq_dr_before_p ? "true" : "false",
+	       all_seq_dr_after_p ? "true" : "false");
+    }
+
+  /* We use different stride comparison ratio for interchanging innermost
+     two loops or not.  The idea is to be conservative in interchange for
+     the innermost loops.  */
+  ratio = !iloop->inner ? INNER_STRIDE_RATIO : OUTER_STRIDE_RATIO;
+  /* Do interchange if it gives better data locality behavior.  */
+  if (wi::gtu_p (iloop_strides, wi::mul (oloop_strides, ratio)))
+    return true;
+  if (wi::gtu_p (iloop_strides, oloop_strides))
+    {
+      /* Or it creates more invariant memory references.  */
+      if ((!all_seq_dr_before_p || all_seq_dr_after_p)
+	  && num_new_inv_drs > num_old_inv_drs)
+	return true;
+      /* Or it makes all memory references sequential.  */
+      if (num_new_inv_drs >= num_old_inv_drs
+	  && !all_seq_dr_before_p && all_seq_dr_after_p)
+	return true;
+    }
+
+  return false;
+}
+
+/* Try to interchange inner loop of a loop nest to outer level.  */
+
+bool
+tree_loop_interchange::interchange ()
+{
+  bool changed_p = false;
+  /* In each iteration we try to interchange I-th loop with (I+1)-th loop.
+     The overall effect is to push inner loop to outermost level in whole
+     loop nest.  */
+  for (unsigned i = loop_nest.length (); i > 1; --i)
+    {
+      unsigned inner = i - 1, outer = i - 2;
+
+      /* Check validity for loop interchange.  */
+      if (!valid_data_dependences (inner, outer))
+	break;
+
+      loop_cand iloop (loop_nest[inner], loop_nest[outer]);
+      loop_cand oloop (loop_nest[outer], loop_nest[outer]);
+
+      /* Check if we can do transformation for loop interchange.  */
+      if (!can_interchange_loops (iloop, oloop))
+	break;
+
+      /* Check profitability for loop interchange.  */
+      if (should_interchange_loops (iloop.loop, oloop.loop, datarefs))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Loop_pair<outer:%d, inner:%d> is interchanged\n\n",
+		     oloop.loop->num, iloop.loop->num);
+
+	  interchange_loops (iloop, oloop);
+	  /* Update data structures for further loop interchange.  */
+	  update_data_refs (iloop, oloop);
+	  update_data_deps (inner, outer);
+	  changed_p = true;
+	}
+      else
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Loop_pair<outer:%d, inner:%d> is not interchanged\n\n",
+		     oloop.loop->num, iloop.loop->num);
+	}
+    }
+
+  return changed_p;
+}
+
+
+/* Loop interchange pass.  */
+
+namespace {
+
+const pass_data pass_data_linterchange =
+{
+  GIMPLE_PASS, /* type */
+  "linterchange", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_LINTERCHANGE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_linterchange : public gimple_opt_pass
+{
+public:
+  pass_linterchange (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_linterchange, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_linterchange (m_ctxt); }
+  virtual bool gate (function *) { return flag_tree_loop_interchange; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_linterchange
+
+
+/* Return true if LOOP has proper form for interchange.  */
+
+static bool
+proper_loop_form_for_interchange (struct loop *loop)
+{
+  edge e0, e1, exit;
+
+  /* Don't interchange if loop has unsupported information for the moment.  */
+  if (loop->safelen > 0
+      || loop->constraints != 0
+      || loop->can_be_parallel
+      || loop->dont_vectorize
+      || loop->force_vectorize
+      || loop->in_oacc_kernels_region
+      || loop->orig_loop_num != 0
+      || loop->simduid != NULL_TREE)
+    return false;
+
+  /* Don't interchange if outer loop has basic block other than header,
+     exit->src and latch.  In general, only below form of loop nest:
+		header<---+
+                  |       |
+                  v       |
+              INNER_LOOP  |
+                  |       |
+                  v       |
+		exit--->latch
+     is supported.  */
+  if (loop->inner != NULL
+      && loop->num_nodes != loop->inner->num_nodes + 3)
+    return false;
+
+  if ((exit = single_dom_exit (loop)) == NULL)
+    return false;
+
+  /* Check control flow on loop header/exit blocks.  */
+  if (loop->header == exit->src
+      && (EDGE_COUNT (loop->header->preds) != 2
+	  || EDGE_COUNT (loop->header->succs) != 2))
+    return false;
+  else if (loop->header != exit->src
+	   && (EDGE_COUNT (loop->header->preds) != 2
+	       || !single_succ_p (loop->header)
+	       || unsupported_edge (single_succ_edge (loop->header))
+	       || EDGE_COUNT (exit->src->succs) != 2
+	       || !single_pred_p (exit->src)
+	       || unsupported_edge (single_pred_edge (exit->src))))
+    return false;
+
+  e0 = EDGE_PRED (loop->header, 0);
+  e1 = EDGE_PRED (loop->header, 1);
+  if (unsupported_edge (e0) || unsupported_edge (e1)
+      || (e0->src != loop->latch && e1->src != loop->latch)
+      || (e0->src->loop_father == loop && e1->src->loop_father == loop))
+    return false;
+
+  e0 = EDGE_SUCC (exit->src, 0);
+  e1 = EDGE_SUCC (exit->src, 1);
+  if (unsupported_edge (e0) || unsupported_edge (e1)
+      || (e0->dest != loop->latch && e1->dest != loop->latch)
+      || (e0->dest->loop_father == loop && e1->dest->loop_father == loop))
+    return false;
+
+  return true;
+}
+
+/* Return true if any two adjacent loops in loop nest OUTERMOST_LOOP should
+   be interchanged by looking into all DATAREFS.  INNERMOST_LOOP is the
+   innermost loop of this loop nest.  */
+
+static bool
+should_interchange_loop_nest (struct loop *outermost_loop,
+			      struct loop *innermost_loop,
+			      vec<data_reference_p> datarefs)
+{
+  struct loop *inner = innermost_loop, *outer;
+
+  /* Check if any two adjacent loops should be interchanged.  */
+  while (inner != outermost_loop)
+    {
+      outer = loop_outer (inner);
+
+      if (should_interchange_loops (inner, outer, datarefs, false))
+	return true;
+
+      inner = outer;
+    }
+
+  return false;
+}
+
+/* Given loop nest LOOP_NEST and data references DATAREFS, compute data
+   dependences for loop interchange and store it in DDRS.  Note we compute
+   dependences directly rather than call generic interface so that we can
+   return on unknown dependence instantly.  */
+
+static bool
+tree_loop_interchange_compute_ddrs (vec<loop_p> loop_nest,
+				    vec<data_reference_p> datarefs,
+				    vec<ddr_p> *ddrs)
+{
+  struct data_reference *a, *b;
+
+  for (unsigned i = 0; datarefs.iterate (i, &a); ++i)
+    for (unsigned j = i + 1; datarefs.iterate (j, &b); ++j)
+      if (DR_IS_WRITE (a) || DR_IS_WRITE (b))
+	{
+	  ddr_p ddr = initialize_data_dependence_relation (a, b, loop_nest);
+	  ddrs->safe_push (ddr);
+	  compute_affine_dependence (ddr, loop_nest[0]);
+
+	  /* Give up if ddr is unknown dependence or classic direct vector
+	     is not available.  */
+	  if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
+	      || (DDR_ARE_DEPENDENT (ddr) == NULL_TREE
+		  && DDR_NUM_DIR_VECTS (ddr) == 0))
+	    return false;
+	}
+
+  return true;
+}
+
+/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
+
+static inline void
+prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
+{
+  struct data_reference *dr;
+
+  for (unsigned i = 0; datarefs.iterate (i, &dr);)
+    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
+      i++;
+    else
+      datarefs.ordered_remove (i);
+}
+
+/* Given loop nest like <OLOOP, ILOOP>, the function strips off outer
+   loops if it forms non-rectangle loop nest.  The outermost loop of
+   the rest rectangle loop nest or NULL is returned.  */
+
+struct loop *
+prune_non_rectangle_loop_nest (struct loop *iloop, struct loop *oloop)
+{
+  if (!oloop || !iloop || oloop == iloop)
+    return NULL;
+
+  struct loop *loop1 = iloop;
+
+  while (loop1 != NULL && flow_loop_nested_p (oloop, loop1))
+    {
+      tree niters = number_of_latch_executions (loop1);
+      struct loop *loop2 = loop_outer (loop1);
+
+      while (loop2 != NULL
+	     && (loop2 == oloop || flow_loop_nested_p (oloop, loop2)))
+	{
+	  /* Strip off the outermost loop if it isn't rectangle loop nest.  */
+	  if (chrec_contains_symbols_defined_in_loop (niters, loop2->num))
+	    {
+	      oloop = loop2->inner;
+	      break;
+	    }
+
+	  loop2 = loop_outer (loop2);
+	}
+      if (oloop == iloop)
+	return NULL;
+
+      loop1 = loop_outer (loop1);
+    }
+
+  return oloop;
+}
+
+/* Given innermost LOOP, return true if perfect loop nest can be found and
+   data dependences can be computed.  If succeed, record the perfect loop
+   nest in LOOP_NEST; record all data references in DATAREFS and record all
+   data dependence relations in DDRS.  */
+
+static bool
+prepare_perfect_loop_nest (struct loop *loop, vec<loop_p> *loop_nest,
+			   vec<data_reference_p> *datarefs, vec<ddr_p> *ddrs)
+{
+  tree niters;
+  struct loop *start_loop = NULL, *innermost_loop = loop;
+
+  /* Find loop nest from the innermost loop.  */
+  while (loop->num != 0 && loop->inner == start_loop)
+    {
+      if (!proper_loop_form_for_interchange (loop))
+	break;
+
+      /* Loop must have determined niters.  */
+      niters = number_of_latch_executions (loop);
+      if (!niters || chrec_contains_undetermined (niters))
+	break;
+
+      start_loop = loop;
+      /* If this loop has sibling loop, the father loop won't be in perfect
+	 loop nest.  */
+      if (loop->next != NULL)
+	break;
+
+      loop = loop_outer (loop);
+    }
+
+  start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
+
+  if (!start_loop || !start_loop->inner)
+    return false;
+
+  datarefs->create (20);
+  if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
+      /* Check if there are too many of data references.  */
+      || ((int) datarefs->length ()
+	  > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
+      /* Check if loop nest should be interchanged.  */
+      || !should_interchange_loop_nest (start_loop, innermost_loop, *datarefs))
+    {
+      free_data_refs (*datarefs);
+      return false;
+    }
+
+  /* Check if data dependences can be computed for loop nest starting from
+     start_loop.  */
+  loop = start_loop;
+  loop_nest->create (3);
+  do {
+    ddrs->create (20);
+    loop_nest->truncate (0);
+
+    /* Strip data references not in loop nest starting from start_loop.  */
+    if (loop != start_loop)
+      prune_datarefs_not_in_loop (start_loop, *datarefs);
+
+    if (find_loop_nest (start_loop, loop_nest)
+	&& tree_loop_interchange_compute_ddrs (*loop_nest, *datarefs, ddrs))
+      {
+	if (dump_file && (dump_flags & TDF_DETAILS))
+	  fprintf (dump_file,
+		   "\nConsider loop interchange for loop_nest<%d - %d>\n",
+		   start_loop->num, innermost_loop->num);
+
+	return true;
+      }
+
+    free_dependence_relations (*ddrs);
+    /* Try to compute data dependences with the outermost loop stripped.  */
+    loop = start_loop;
+    start_loop = start_loop->inner;
+  } while (start_loop && start_loop->inner);
+
+  loop_nest->release ();
+  free_data_refs (*datarefs);
+  return false;
+}
+
+/* Main entry for loop interchange pass.  */
+
+unsigned int
+pass_linterchange::execute (function *fun)
+{
+  if (number_of_loops (fun) <= 2)
+    return 0;
+
+  bool changed_p = false;;
+  struct loop *loop;
+  vec<loop_p> loop_nest;
+  vec<data_reference_p> datarefs;
+  vec<ddr_p> ddrs;
+
+  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
+    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
+      {
+	tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
+	changed_p |= loop_interchange.interchange ();
+      }
+
+  if (changed_p)
+    scev_reset_htab ();
+
+  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_linterchange (gcc::context *ctxt)
+{
+  return new pass_linterchange (ctxt);
+}
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index efb199a..56eff1c 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -76,10 +76,13 @@ enum unroll_level
 };
 
 /* Adds a canonical induction variable to LOOP iterating NITER times.  EXIT
-   is the exit edge whose condition is replaced.  */
+   is the exit edge whose condition is replaced.  The ssa versions of the new
+   IV before and after increment will be stored in VAR_BEFORE and VAR_AFTER
+   if they are not NULL.  */
 
-static void
-create_canonical_iv (struct loop *loop, edge exit, tree niter)
+void
+create_canonical_iv (struct loop *loop, edge exit, tree niter,
+		     tree *var_before = NULL, tree *var_after = NULL)
 {
   edge in;
   tree type, var;
@@ -112,7 +115,9 @@ create_canonical_iv (struct loop *loop, edge exit, tree niter)
   create_iv (niter,
 	     build_int_cst (type, -1),
 	     NULL_TREE, loop,
-	     &incr_at, false, NULL, &var);
+	     &incr_at, false, var_before, &var);
+  if (var_after)
+    *var_after = var;
 
   cmp = (exit->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
   gimple_cond_set_code (cond, cmp);
diff --git a/gcc/tree-ssa-loop-ivopts.h b/gcc/tree-ssa-loop-ivopts.h
index f8f31e9..cfa1fbe 100644
--- a/gcc/tree-ssa-loop-ivopts.h
+++ b/gcc/tree-ssa-loop-ivopts.h
@@ -31,4 +31,6 @@ extern bool expr_invariant_in_loop_p (struct loop *, tree);
 bool may_be_nonaddressable_p (tree expr);
 void tree_ssa_iv_optimize (void);
 
+void create_canonical_iv (struct loop *, edge, tree,
+			  tree * = NULL, tree * = NULL);
 #endif /* GCC_TREE_SSA_LOOP_IVOPTS_H */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-09-22 10:25       ` Bin.Cheng
@ 2017-10-24 14:35         ` Michael Matz
  2017-11-16 15:30           ` Bin.Cheng
  0 siblings, 1 reply; 10+ messages in thread
From: Michael Matz @ 2017-10-24 14:35 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Richard Biener, gcc-patches

Hello,

On Fri, 22 Sep 2017, Bin.Cheng wrote:

> This is updated patch for loop interchange with review suggestions
> resolved.  Changes are:
>   1) It does more light weight checks like rectangle loop nest check
> earlier than before.
>   2) It checks profitability of interchange before data dependence computation.
>   3) It calls find_data_references_in_loop only once for a loop nest now.
>   4) Data dependence is open-computed so that we can skip instantly at
> unknown dependence.
>   5) It improves code generation in mapping induction variables for
> loop nest, as well as
>      adding a simple dead code elimination pass.
>   6) It changes magic constants into parameters.

So I have a couple comments/questions.  Something stylistic:

> +class loop_cand
> +{
> +public:
> ...
> +  friend class tree_loop_interchange;
> +private:

Just make this all public (and hence a struct, not class).
No need for friends in file local classes.

> +single_use_in_loop (tree var, struct loop *loop)
> ...
> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
> +    {
> +      stmt = USE_STMT (use_p);
> ...
> +      basic_block bb = gimple_bb (stmt);
> +      gcc_assert (bb != NULL);

This pattern reoccurs often in your patch: you check for a bb associated 
for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need 
for checking.

Then, something about your handling of simple reductions:

> +void
> +loop_cand::classify_simple_reduction (reduction_p re)
> +{
> ...
> +  /* Require memory references in producer and consumer are the same so
> +     that we can undo reduction during interchange.  */
> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
> +    return;

Where is it checked that the undoing transformation is legal also
from a data dep point of view?  Think code like this:

   sum = X[i];
   for (j ...)
     sum += X[j];
   X[i] = sum;

Moving the store into the inner loop isn't always correct and I don't seem 
to find where the above situation is rejected.

Maybe I'm confused because I also don't see where you even can get into 
the above situation (though I do see testcases about this).  The thing is, 
for an 2d loop nest to contain something like the above reduction it can't 
be perfect:

   for (j) {
     int sum = X[j];      // 1
     for (i)
       sum += Y[j][i];
     X[j] = sum;          // 2
   }

But you do check for perfectness in proper_loop_form_for_interchange and 
prepare_perfect_loop_nest, so either you can't get into the situation or 
the checking can't be complete, or you define the above to be perfect 
nevertheless (probably because the load and store are in outer loop 
header/exit blocks?).  The latter would mean that you accept also other 
code in header/footer of loops from a pure CFG perspective, so where is it 
checked that that other code (which aren't simple reductions) isn't 
harmful to the transformation?

Then, the data dependence part of the new pass:

> +bool
> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
> +{
> +  struct data_dependence_relation *ddr;
> +
> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
> +    {
> +      /* Skip no-dependence case.  */
> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
> +     continue;
> +
> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
> +     {
> +       lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
> +       unsigned level = dependence_level (dist_vect, loop_nest.length ());
> +
> +       /* If there is no carried dependence.  */
> +       if (level == 0)
> +         continue;
> +
> +       level --;
> +       /* Skip case which has '>' as the leftmost direction.  */
> +       if (!lambda_vector_lexico_pos (dist_vect, level))
> +         return false;

Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.

> +       /* If dependence is carried by outer loop of the two loops for
> +          interchange.  */
> +       if (level < outer)
> +         continue;
> +
> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
> +       /* If directions at both inner/outer levels are the same.  */
> +       if (dir_vect[inner] == dir_vect[outer])
> +         continue;
> +
> +       /* Be conservative, skip case if either direction at inner/outer
> +          levels is not '=' or '<'.  */
> +       if (dir_vect[inner] != dir_equal
> +           && dir_vect[inner] != dir_positive
> +           && dir_vect[inner] != dir_independent
> +           && dir_vect[inner] != dir_positive_or_equal)
> +         return false;
> +
> +       if (dir_vect[outer] != dir_equal
> +           && dir_vect[outer] != dir_positive
> +           && dir_vect[outer] != dir_independent
> +           && dir_vect[outer] != dir_positive_or_equal)
> +         return false;

Checking dir vectors doesn't make much sense in GCC: the elements are only
ever set to dir_positive, dir_negative or dir_equal, exactly when distance 
is
 > 0, < 0 or == 0.  So checking dist vector is enough. (though sameness of
direction checks sameness of sign with zero).  Incidentally:

> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
> +{
> +  struct data_dependence_relation *ddr;
> +
> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
> +    {
> +      /* Skip no-dependence case.  */
> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
> +     continue;
> +
> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
> +     {
> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
> +       std::swap (dir_vect[inner], dir_vect[outer]);
> +     }
> +    }
> +}

Here you swap only the direction but not the distance vector, which can't 
be right.  I suggest only using (and updating) the distance vector.

And then your usage and update of DR_ACCESS_FNs: there's quite some 
complexity connected with that and I'm not sure how worthwhile it is.  
You're basically using the ACCESS_FNs to determine profitability (and not 
for validity, and that's good).  But e.g. for pointer based accesses like 
in fortran with explicit address arithmetic the relation between access-fn 
step and stride and actual access stride isn't that easy (e.g. in your 
should_interchange_loops function iloop_stride and oloop_stride will 
always be one for pointer based accesses).

Conceptually what you should check is how the access address for each data 
ref revolves for each loop, so why not doing this explicitely?  What I 
mean is: calculate a (complicated) chrec for the DR addresses for the 
whole nest at the beginning.  It should be in the form like (assume "+"
always):

  {{{init, s1}_l1, s2}_l2, s3}_l3

(i.e. all steps should be invariants/constants, and only one non-chrec 
init value).  Addresses which aren't in this form you're already ignoring 
right now, so you could continue doing that.  (Or better said, all 
non-constant steps you regard as being AVG_DIM_SIZE, which you still can 
continue doing).

Now, with the above form you can form expressions for the difference 
between addresses per iteration for each loop (i.e. the address stride per 
loop); store these.  Then, when interchanging loops you need to merely 
swap these expressions like you have to with the distance vector, instead 
of fiddling inside the DR_ACCESS_FNs themself.  Much code would go away.

Testcases: given that we had to remove our old separate interchange pass 
because it miscompiled stuff all over I'm missing some testcases where 
interchange should _not_ happen for validity reasons, like my above 
example with an reduction that can't be moved inside.  Perhaps you can 
think of some more.

I hope this is of some help to you :)


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-10-24 14:35         ` Michael Matz
@ 2017-11-16 15:30           ` Bin.Cheng
  2017-11-20 14:59             ` Richard Biener
  0 siblings, 1 reply; 10+ messages in thread
From: Bin.Cheng @ 2017-11-16 15:30 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Biener, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 10974 bytes --]

On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <matz@suse.de> wrote:
> Hello,
>
> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>
>> This is updated patch for loop interchange with review suggestions
>> resolved.  Changes are:
>>   1) It does more light weight checks like rectangle loop nest check
>> earlier than before.
>>   2) It checks profitability of interchange before data dependence computation.
>>   3) It calls find_data_references_in_loop only once for a loop nest now.
>>   4) Data dependence is open-computed so that we can skip instantly at
>> unknown dependence.
>>   5) It improves code generation in mapping induction variables for
>> loop nest, as well as
>>      adding a simple dead code elimination pass.
>>   6) It changes magic constants into parameters.
>
> So I have a couple comments/questions.  Something stylistic:
Hi Michael,
Thanks for reviewing.

>
>> +class loop_cand
>> +{
>> +public:
>> ...
>> +  friend class tree_loop_interchange;
>> +private:
>
> Just make this all public (and hence a struct, not class).
> No need for friends in file local classes.
Done.

>
>> +single_use_in_loop (tree var, struct loop *loop)
>> ...
>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>> +    {
>> +      stmt = USE_STMT (use_p);
>> ...
>> +      basic_block bb = gimple_bb (stmt);
>> +      gcc_assert (bb != NULL);
>
> This pattern reoccurs often in your patch: you check for a bb associated
> for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need
> for checking.
Done.

>
> Then, something about your handling of simple reductions:
>
>> +void
>> +loop_cand::classify_simple_reduction (reduction_p re)
>> +{
>> ...
>> +  /* Require memory references in producer and consumer are the same so
>> +     that we can undo reduction during interchange.  */
>> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>> +    return;
>
> Where is it checked that the undoing transformation is legal also
> from a data dep point of view?  Think code like this:
>
>    sum = X[i];
>    for (j ...)
>      sum += X[j];
>    X[i] = sum;
>
> Moving the store into the inner loop isn't always correct and I don't seem
> to find where the above situation is rejected.
Yeah.  for the old patch, it's possible to have such loop wrongly interchanged;
in practice, it's hard to create an example.  The pass will give up
when computing
data dep between references in inner/outer loops.  In this updated
patch, it's fixed
by giving up if there is any dependence between references of inner/outer loops.

>
> Maybe I'm confused because I also don't see where you even can get into
> the above situation (though I do see testcases about this).  The thing is,
> for an 2d loop nest to contain something like the above reduction it can't
> be perfect:
>
>    for (j) {
>      int sum = X[j];      // 1
>      for (i)
>        sum += Y[j][i];
>      X[j] = sum;          // 2
>    }
>
> But you do check for perfectness in proper_loop_form_for_interchange and
> prepare_perfect_loop_nest, so either you can't get into the situation or
> the checking can't be complete, or you define the above to be perfect
> nevertheless (probably because the load and store are in outer loop
> header/exit blocks?).  The latter would mean that you accept also other
> code in header/footer of loops from a pure CFG perspective, so where is it
> checked that that other code (which aren't simple reductions) isn't
> harmful to the transformation?
Yes, I used the name perfect loop nest, but the pass can handle special form
imperfect loop nest for the simple reduction.  I added comments describing
this before function prepare_perfect_loop_nest.

>
> Then, the data dependence part of the new pass:
>
>> +bool
>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
>> +{
>> +  struct data_dependence_relation *ddr;
>> +
>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>> +    {
>> +      /* Skip no-dependence case.  */
>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>> +     continue;
>> +
>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>> +     {
>> +       lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>> +       unsigned level = dependence_level (dist_vect, loop_nest.length ());
>> +
>> +       /* If there is no carried dependence.  */
>> +       if (level == 0)
>> +         continue;
>> +
>> +       level --;
>> +       /* Skip case which has '>' as the leftmost direction.  */
>> +       if (!lambda_vector_lexico_pos (dist_vect, level))
>> +         return false;
>
> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.
Done.

>
>> +       /* If dependence is carried by outer loop of the two loops for
>> +          interchange.  */
>> +       if (level < outer)
>> +         continue;
>> +
>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>> +       /* If directions at both inner/outer levels are the same.  */
>> +       if (dir_vect[inner] == dir_vect[outer])
>> +         continue;
>> +
>> +       /* Be conservative, skip case if either direction at inner/outer
>> +          levels is not '=' or '<'.  */
>> +       if (dir_vect[inner] != dir_equal
>> +           && dir_vect[inner] != dir_positive
>> +           && dir_vect[inner] != dir_independent
>> +           && dir_vect[inner] != dir_positive_or_equal)
>> +         return false;
>> +
>> +       if (dir_vect[outer] != dir_equal
>> +           && dir_vect[outer] != dir_positive
>> +           && dir_vect[outer] != dir_independent
>> +           && dir_vect[outer] != dir_positive_or_equal)
>> +         return false;
>
> Checking dir vectors doesn't make much sense in GCC: the elements are only
> ever set to dir_positive, dir_negative or dir_equal, exactly when distance
> is
>  > 0, < 0 or == 0.  So checking dist vector is enough. (though sameness of
> direction checks sameness of sign with zero).  Incidentally:
Done.

>
>> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
>> +{
>> +  struct data_dependence_relation *ddr;
>> +
>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>> +    {
>> +      /* Skip no-dependence case.  */
>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>> +     continue;
>> +
>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>> +     {
>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>> +       std::swap (dir_vect[inner], dir_vect[outer]);
>> +     }
>> +    }
>> +}
>
> Here you swap only the direction but not the distance vector, which can't
> be right.  I suggest only using (and updating) the distance vector.
Yeah, fixed.

>
> And then your usage and update of DR_ACCESS_FNs: there's quite some
> complexity connected with that and I'm not sure how worthwhile it is.
> You're basically using the ACCESS_FNs to determine profitability (and not
> for validity, and that's good).  But e.g. for pointer based accesses like
> in fortran with explicit address arithmetic the relation between access-fn
> step and stride and actual access stride isn't that easy (e.g. in your
> should_interchange_loops function iloop_stride and oloop_stride will
> always be one for pointer based accesses).
>
> Conceptually what you should check is how the access address for each data
> ref revolves for each loop, so why not doing this explicitely?  What I
> mean is: calculate a (complicated) chrec for the DR addresses for the
> whole nest at the beginning.  It should be in the form like (assume "+"
> always):
>
>   {{{init, s1}_l1, s2}_l2, s3}_l3
>
> (i.e. all steps should be invariants/constants, and only one non-chrec
> init value).  Addresses which aren't in this form you're already ignoring
> right now, so you could continue doing that.  (Or better said, all
> non-constant steps you regard as being AVG_DIM_SIZE, which you still can
> continue doing).
>
> Now, with the above form you can form expressions for the difference
> between addresses per iteration for each loop (i.e. the address stride per
> loop); store these.  Then, when interchanging loops you need to merely
> swap these expressions like you have to with the distance vector, instead
> of fiddling inside the DR_ACCESS_FNs themself.  Much code would go away.
Yeah.  Did similar thing in loop nest distribution pass.  See
compute_access_range
in tree-loop-distribution.c.  Actually, I would do the same here if I
had implemented
this pass after loop nest distribution patches.  Done in this updated patch.

>
> Testcases: given that we had to remove our old separate interchange pass
> because it miscompiled stuff all over I'm missing some testcases where
> interchange should _not_ happen for validity reasons, like my above
> example with an reduction that can't be moved inside.  Perhaps you can
> think of some more.
As mentioned above, it's hard to create test that fail exactly for this reason.
I added one that data dependence prevents us from interchanging the loop.

>
> I hope this is of some help to you :)
Thanks again, it's very helpful.

I also fixed several bugs of previous implementation, mostly about debug info
statements and simple reductions.  As for test, I enabled this pass by default,
bootstrap and regtest GCC, I also build/run specs.  There must be some other
latent bugs in it, but guess we have to exercise it by enabling it at
some point.

So any comments?

Thanks,
bin

2017-11-15  Bin Cheng  <bin.cheng@arm.com>

    * Makefile.in (tree-ssa-loop-interchange.o): New object file.
    * common.opt (ftree-loop-interchange): New option.
    * doc/invoke.texi (-ftree-loop-interchange): Document new option.
    * gimple-iterator.c (gsi_remove): New parameter determining if dbg
    bind stmt is inserted or not.
    * gimple-iterator.h (gsi_remove): New parameter in declaration.
    * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
    (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
    * passes.def (pass_linterchange): New pass.
    * timevar.def (TV_LINTERCHANGE): New time var.
    * tree-pass.h (make_pass_linterchange): New declaration.
    * tree-ssa-loop-interchange.cc: New file.
    * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
    Record IV before/after increment in new parameters.
    * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.

gcc/testsuite
2017-11-15  Bin Cheng  <bin.cheng@arm.com>

    * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
    * gcc.dg/tree-ssa/loop-interchange-11.c: New test.

>
>
> Ciao,
> Michael.

[-- Attachment #2: linterchange-20171115.txt --]
[-- Type: text/plain, Size: 90277 bytes --]

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 5db7855..a720bed 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1524,6 +1524,7 @@ OBJS = \
 	tree-ssa-live.o \
 	tree-ssa-loop-ch.o \
 	tree-ssa-loop-im.o \
+	tree-ssa-loop-interchange.o \
 	tree-ssa-loop-ivcanon.o \
 	tree-ssa-loop-ivopts.o \
 	tree-ssa-loop-manip.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index f8f2ed3..f4eeaa7 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2574,6 +2574,10 @@ ftree-loop-distribute-patterns
 Common Report Var(flag_tree_loop_distribute_patterns) Optimization
 Enable loop distribution for patterns transformed into a library call.
 
+ftree-loop-interchange
+Common Report Var(flag_tree_loop_interchange) Optimization
+Enable loop interchange on trees.
+
 ftree-loop-im
 Common Report Var(flag_tree_loop_im) Init(1) Optimization
 Enable loop invariant motion on trees.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 4427328..00bdc1d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -8620,6 +8620,25 @@ ENDDO
 @end smallexample
 and the initialization loop is transformed into a call to memset zero.
 
+@item -ftree-loop-interchange
+@opindex ftree-loop-interchange
+Perform loop interchange outside of graphite.  This flag can improve cache
+performance on loop nest and allow further loop optimizations, like
+vectorization, to take place.  For example, the loop
+@smallexample
+for (int i = 0; i < N; i++)
+  for (int j = 0; j < N; j++)
+    for (int k = 0; k < N; k++)
+      c[i][j] = c[i][j] + a[i][k]*b[k][j];
+@end smallexample
+is transformed to
+@smallexample
+for (int i = 0; i < N; i++)
+  for (int k = 0; k < N; k++)
+    for (int j = 0; j < N; j++)
+      c[i][j] = c[i][j] + a[i][k]*b[k][j];
+@end smallexample
+
 @item -ftree-loop-im
 @opindex ftree-loop-im
 Perform loop invariant motion on trees.  This pass moves only invariants that
diff --git a/gcc/gimple-iterator.c b/gcc/gimple-iterator.c
index d9d02d3..a4e8f46 100644
--- a/gcc/gimple-iterator.c
+++ b/gcc/gimple-iterator.c
@@ -549,16 +549,20 @@ gsi_insert_after (gimple_stmt_iterator *i, gimple *stmt,
    from the IL and not reinserted elsewhere.  In that case we remove the
    statement pointed to by iterator I from the EH tables, and free its
    operand caches.  Otherwise we do not modify this information.  Returns
-   true whether EH edge cleanup is required.  */
+   true whether EH edge cleanup is required.
+
+   If INSERT_DGB is true, this function inserts debug bind stmt for each
+   variable defined by the current stmt when necessary; replaces uses of
+   def with the newly-created debug temp.  */
 
 bool
-gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
+gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg)
 {
   gimple_seq_node cur, next, prev;
   gimple *stmt = gsi_stmt (*i);
   bool require_eh_edge_purge = false;
 
-  if (gimple_code (stmt) != GIMPLE_PHI)
+  if (insert_dbg && gimple_code (stmt) != GIMPLE_PHI)
     insert_debug_temps_for_defs (i);
 
   /* Free all the data flow information for STMT.  */
diff --git a/gcc/gimple-iterator.h b/gcc/gimple-iterator.h
index 70f18be..f1c43cc 100644
--- a/gcc/gimple-iterator.h
+++ b/gcc/gimple-iterator.h
@@ -77,7 +77,7 @@ extern void gsi_insert_after_without_update (gimple_stmt_iterator *, gimple *,
 					     enum gsi_iterator_update);
 extern void gsi_insert_after (gimple_stmt_iterator *, gimple *,
 			      enum gsi_iterator_update);
-extern bool gsi_remove (gimple_stmt_iterator *, bool);
+extern bool gsi_remove (gimple_stmt_iterator *, bool, bool = true);
 extern gimple_stmt_iterator gsi_for_stmt (gimple *);
 extern gphi_iterator gsi_for_phi (gphi *);
 extern void gsi_move_after (gimple_stmt_iterator *, gimple_stmt_iterator *);
diff --git a/gcc/params.def b/gcc/params.def
index 8881f4c..b5e5893 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -790,6 +790,20 @@ DEFPARAM (PARAM_L2_CACHE_SIZE,
 	  "The size of L2 cache.",
 	  512, 0, 0)
 
+/* Maximum number of statements in loop nest for loop interchange.  */
+
+DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
+	  "loop-interchange-max-num-stmts",
+	  "The maximum number of stmts in loop nest for loop interchange.",
+	  64, 0, 0)
+
+/* Minimum stride ratio for loop interchange to be profitiable.  */
+
+DEFPARAM (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO,
+	  "loop-interchange-stride-ratio",
+	  "The minimum stride ratio for loop interchange to be profitable",
+	  2, 0, 0)
+
 /* Whether we should use canonical types rather than deep "structural"
    type checking.  Setting this value to 1 (the default) improves
    compilation performance in the C++ and Objective-C++ front end;
diff --git a/gcc/passes.def b/gcc/passes.def
index 00e75d2..2e38c6f 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -278,6 +278,7 @@ along with GCC; see the file COPYING3.  If not see
 	  NEXT_PASS (pass_cd_dce);
 	  NEXT_PASS (pass_iv_canon);
 	  NEXT_PASS (pass_loop_distribution);
+	  NEXT_PASS (pass_linterchange);
 	  NEXT_PASS (pass_copy_prop);
 	  NEXT_PASS (pass_graphite);
 	  PUSH_INSERT_PASSES_WITHIN (pass_graphite)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c
new file mode 100644
index 0000000..f2392e3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-1.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-4.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+double u[1782225];
+
+static int __attribute__((noinline))
+foo (int N, int *res)
+{
+  int i, j;
+  double sum = 0;
+  for (i = 0; i < N; i++)
+    for (j = 0; j < N; j++)
+      sum = sum + u[i + 1335 * j];
+
+  for (i = 0; i < N; i++)
+    u[1336 * i] *= 2;
+
+  *res = sum + N + u[1336 * 2] + u[1336];
+}
+
+extern void abort ();
+
+int
+main (void)
+{
+  int i, j, res;
+
+  for (i = 0; i < 1782225; i++)
+    u[i] = 2;
+
+  foo (1335, &res);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 3565793)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c
new file mode 100644
index 0000000..610610b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-10.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M];
+int __attribute__((noinline))
+double_reduc (int n)
+{
+  int sum = 0;
+  for (int j = 0; j < n; j++)
+    {
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+    }
+  return sum;
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+    }
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  int sum = double_reduc (M);
+
+  if (sum != 1065369600)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-11.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-11.c
new file mode 100644
index 0000000..e7acc1b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-11.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M];
+
+void
+simple_reduc_1 (int n, int *p)
+{
+  for (int j = 0; j < n; j++)
+    {
+      int sum = p[j];
+      for (int i = 0; i < n; i++)
+	{
+	  sum = sum + b[i][j];
+	  b[i][j] += a[i][j];
+	}
+
+      p[j] = sum;
+    }
+}
+/* { dg-final { scan-tree-dump-not "Loop_pair<outer:., inner:.> is interchanged" "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c
new file mode 100644
index 0000000..8341a22
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-2.c
@@ -0,0 +1,58 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-5.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 100
+#define M 1111
+int A[N][M];
+
+static int __attribute__((noinline))
+foo (void)
+{
+  int i, j;
+
+  for( i = 0; i < M; i++)
+    for( j = 0; j < N; j++)
+      A[j][i] = 5 * A[j][i];
+
+  return A[0][0] + A[N-1][M-1];
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  int j;
+
+  for (j = 0; j < M; j++)
+    A[i][j] = 2;
+}
+
+int
+main (void)
+{
+  int i, j, res;
+
+  for (i = 0; i < N; i++)
+    init (i);
+
+  res = foo ();
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 20)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c
new file mode 100644
index 0000000..ca2a114
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-3.c
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-6.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 100
+#define M 200
+
+static int __attribute__((noinline))
+foo (int A[N][M])
+{
+  int i, j;
+
+  /* This loop should be interchanged. */
+  for(j = 0; j < M; j++)
+    for(i = 0; i < N; i++)
+      A[i][j] = A[i][j] + A[i][j];
+
+  return A[0][0] + A[N-1][M-1];
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int *arr, int i)
+{
+  int j;
+
+  for (j = 0; j < M; j++)
+    arr[j] = 2;
+}
+
+int
+main (void)
+{
+  int A[N][M];
+  int i, j, res;
+
+  for (i = 0; i < N; i++)
+    init (A[i], i);
+
+  res = foo (A);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 8)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c
new file mode 100644
index 0000000..ff820f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-4.c
@@ -0,0 +1,50 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+/* Copied from graphite/interchange-7.c */
+
+#define DEBUG 0
+#if DEBUG
+#include <stdio.h>
+#endif
+
+#define N 111
+#define M 1111
+
+static int __attribute__((noinline))
+foo (double *a)
+{
+  int i,j;
+  int r = 0;
+
+  for (i = 0; i < N; ++i)
+    for (j = 0; j < M; ++j)
+      r += a[j * N + i];
+
+  return r;
+}
+
+extern void abort ();
+
+int
+main (void)
+{
+  double A[N*M];
+  int i, res;
+
+  for (i = 0; i < N*M; i++)
+    A[i] = 2;
+
+  res = foo (A);
+
+#if DEBUG
+  fprintf (stderr, "res = %d \n", res);
+#endif
+
+  if (res != 246642)
+    abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c
new file mode 100644
index 0000000..706da88
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-5.c
@@ -0,0 +1,71 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+  for (int i = 0; i < n; i++)
+    for (int j = 0; j < n; j++)
+      for (int k = 0; k < n; k++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange" } } */
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is not interchanged" 1 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c
new file mode 100644
index 0000000..97555ed
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-6.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+    for (int j = 0; j < n; j++)
+      for (int k = 0; k < n; k++)
+  for (int i = 0; i < n; i++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 2 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c
new file mode 100644
index 0000000..b93ca78
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-7.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+      for (int k = 0; k < n; k++)
+    for (int j = 0; j < n; j++)
+  for (int i = 0; i < n; i++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 2 "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c
new file mode 100644
index 0000000..29f5917
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-8.c
@@ -0,0 +1,70 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M][M], d[M][M];
+void __attribute__((noinline))
+matrix_mul_1 (int n)
+{
+  for (int i = 0; i < n; i++)
+      for (int k = 0; k < n; k++)
+    for (int j = 0; j < n; j++)
+        c[i][j] = c[i][j] + a[i][k]*b[k][j];
+}
+
+void __attribute__((noinline))
+matrix_mul_2 (int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      for (int j = 0; j < n; j++)
+	{
+	  for (int k = 0; k < n; k++)
+	    d[i][j] = d[i][j] + a[i][k]*b[k][j];
+
+	  asm volatile ("" ::: "memory");
+	}
+      asm volatile ("" ::: "memory");
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+      c[i][j] = 0;
+      d[i][j] = 0;
+    }
+}
+
+static int __attribute__((noinline))
+check (int i)
+{
+  for (int j = 0; j < M; j++)
+    if (c[i][j] != d[i][j])
+      return 0;
+
+  return 1;
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  matrix_mul_1 (M);
+  matrix_mul_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (!check (i))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-not "Loop_pair<outer:., inner:.> is interchanged" "linterchange" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c
new file mode 100644
index 0000000..d6a3f5c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/loop-interchange-9.c
@@ -0,0 +1,62 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -ftree-loop-interchange -fdump-tree-linterchange-details" } */
+
+#define M 256
+int a[M][M], b[M][M], c[M], d[M];
+void __attribute__((noinline))
+simple_reduc_1 (int n)
+{
+  for (int j = 0; j < n; j++)
+    {
+      int sum = c[j];
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+
+      c[j] = sum;
+    }
+}
+
+void __attribute__((noinline))
+simple_reduc_2 (int n)
+{
+  for (int j = 0; j < n; j++)
+    {
+      int sum = d[j];
+      for (int i = 0; i < n; i++)
+	sum = sum + a[i][j]*b[i][j];
+
+      asm volatile ("" ::: "memory");
+      d[j] = sum;
+    }
+}
+
+extern void abort ();
+
+static void __attribute__((noinline))
+init (int i)
+{
+  c[i] = 0;
+  d[i] = 0;
+  for (int j = 0; j < M; j++)
+    {
+      a[i][j] = i;
+      b[i][j] = j;
+    }
+}
+
+int main (void)
+{
+  for (int i = 0; i < M; ++i)
+    init (i);
+
+  simple_reduc_1 (M);
+  simple_reduc_2 (M);
+
+  for (int i = 0; i < M; ++i)
+    if (c[i] != d[i])
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange" } } */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 8cec6af..730a1dc 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -184,6 +184,7 @@ DEFTIMEVAR (TV_TREE_LOOP	     , "tree loop optimization")
 DEFTIMEVAR (TV_TREE_NOLOOP           , "loopless fn")
 DEFTIMEVAR (TV_TREE_LOOP_BOUNDS	     , "tree loop bounds")
 DEFTIMEVAR (TV_LIM                   , "tree loop invariant motion")
+DEFTIMEVAR (TV_LINTERCHANGE          , "tree loop interchange")
 DEFTIMEVAR (TV_TREE_LOOP_IVCANON     , "tree canonical iv")
 DEFTIMEVAR (TV_SCEV_CONST            , "scev constant prop")
 DEFTIMEVAR (TV_TREE_LOOP_UNSWITCH    , "tree loop unswitching")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 9777308..18b06bd 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -368,6 +368,7 @@ extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_linterchange (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_split (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
new file mode 100644
index 0000000..abffbf6
--- /dev/null
+++ b/gcc/tree-ssa-loop-interchange.cc
@@ -0,0 +1,2274 @@
+/* Loop invariant motion.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "gimple-pretty-print.h"
+#include "fold-const.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "cfgloop.h"
+#include "params.h"
+#include "tree-ssa.h"
+#include "tree-scalar-evolution.h"
+#include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop-niter.h"
+#include "tree-ssa-loop-ivopts.h"
+#include "tree-data-ref.h"
+
+/* This pass performs loop interchange: for example, the loop nest
+
+   for (int j = 0; j < N; j++)
+     for (int k = 0; k < N; k++)
+       for (int i = 0; i < N; i++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   is transformed to
+
+   for (int i = 0; i < N; i++)
+     for (int j = 0; j < N; j++)
+       for (int k = 0; k < N; k++)
+	 c[i][j] = c[i][j] + a[i][k]*b[k][j];
+
+   This pass implements loop interchange in the following steps:
+
+     1) Find perfect loop nest for each innermost loop and compute data
+	dependence relations for it.  For above example, loop nest is
+	<loop_j, loop_k, loop_i>.
+     2) From innermost to outermost loop, this pass tries to interchange
+	each loop pair.  For above case, it firstly tries to interchange
+	<loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
+	Then it tries to interchange <loop_j, loop_i> and loop nest becomes
+	<loop_i, loop_j, loop_k>.  The overall effect is to move innermost
+	loop to the outermost position.  For loop pair <loop_i, loop_j>
+	to be interchanged, we:
+     3) Check if data dependence relations are valid for loop interchange.
+     4) Check if both loops can be interchanged in terms of transformation.
+     5) Check if interchanging the two loops is profitable.
+     6) Interchange the two loops by mapping induction variables.
+
+   This pass also handles reductions in loop nest.  So far we only support
+   simple reduction of inner loop and double reduction of the loop nest.  */
+
+/* Maximum number of stmts in each loop that should be interchanged.  */
+#define MAX_NUM_STMT    (PARAM_VALUE (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS))
+
+/* Default average number of loop iterations.  */
+#define AVG_LOOP_NITER    (PARAM_VALUE (PARAM_AVG_LOOP_NITER))
+
+/* Comparison ratio of access stride between inner/outer loops to be
+   interchanged.  This is the minimum stride ratio for loop interchange
+   to be profitable.  */
+#define OUTER_STRIDE_RATIO  (PARAM_VALUE (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO))
+/* The same as above, but we require higher ratio for interchanging the
+   innermost two loops.  */
+#define INNER_STRIDE_RATIO  ((OUTER_STRIDE_RATIO) + 1)
+
+/* Vector of strides that DR accesses in each level loop of a loop nest.  */
+#define DR_ACCESS_STRIDE(dr) ((vec<tree> *) dr->aux)
+
+/* Structure recording loop induction variable.  */
+typedef struct induction
+{
+  /* IV itself.  */
+  tree var;
+  /* Initializer.  */
+  tree init;
+  /* IV's base and step part of SCEV.  */
+  tree base;
+  tree step;
+  /* Mapped IV variabled used for interchanging loops.  */
+  tree mapped_var;
+}*induction_p;
+
+/* Enum type for loop reduction variable.  */
+enum reduction_type
+{
+  UNKNOWN_RTYPE = 0,
+  SIMPLE_RTYPE,
+  DOUBLE_RTYPE
+};
+
+/* Structure recording loop reduction variable.  */
+typedef struct reduction
+{
+  /* Reduction itself.  */
+  tree var;
+  /* PHI node defining reduction variable.  */
+  gphi *phi;
+  /* Init and next variables of the reduction.  */
+  tree init;
+  tree next;
+  /* Lcssa PHI node if reduction is used outside of its definition loop.  */
+  gphi *lcssa_phi;
+  /* Stmts defining init and next.  */
+  gimple *producer;
+  gimple *consumer;
+  /* If init is loaded from memory, this is the loading memory reference.  */
+  tree init_ref;
+  /* If reduction is finally stored to memory, this is the stored memory
+     reference.  */
+  tree fini_ref;
+  enum reduction_type type;
+}*reduction_p;
+
+
+/* Dump reduction RE.  */
+
+static void
+dump_reduction (reduction_p re)
+{
+  if (re->type == SIMPLE_RTYPE)
+    fprintf (dump_file, "  Simple reduction:  ");
+  else if (re->type == DOUBLE_RTYPE)
+    fprintf (dump_file, "  Double reduction:  ");
+  else
+    fprintf (dump_file, "  Unknown reduction:  ");
+
+  print_gimple_stmt (dump_file, re->phi, 0);
+}
+
+/* Dump LOOP's induction IV.  */
+static void
+dump_induction (struct loop *loop, induction_p iv)
+{
+  fprintf (dump_file, "  Induction:  ");
+  print_generic_expr (dump_file, iv->var, TDF_SLIM);
+  fprintf (dump_file, " = {");
+  print_generic_expr (dump_file, iv->base, TDF_SLIM);
+  fprintf (dump_file, ", ");
+  print_generic_expr (dump_file, iv->step, TDF_SLIM);
+  fprintf (dump_file, "}_%d\n", loop->num);
+}
+
+/* Loop candidate for interchange.  */
+
+struct loop_cand
+{
+  loop_cand (unsigned, struct loop *, struct loop *);
+  ~loop_cand ();
+
+  reduction_p find_reduction_by_init (tree);
+  reduction_p find_reduction_by_stmt (gimple *);
+  void classify_simple_reduction (reduction_p);
+  bool analyze_iloop_reduction_var (tree);
+  bool analyze_oloop_reduction_var (loop_cand *, tree);
+  bool analyze_reduction_var (loop_cand *, tree);
+  bool analyze_induction_var (tree, tree);
+  bool analyze_carried_vars (loop_cand *);
+  bool analyze_lcssa_phis (void);
+  bool can_interchange_p (loop_cand *);
+  bool unsupported_operation (basic_block, loop_cand *);
+  void undo_simple_reduction (reduction_p);
+  void eliminate_dead_code (void);
+
+  /* The index of this loop in the whole loop nest vector.  */
+  unsigned idx;
+  /* The loop itself.  */
+  struct loop *loop;
+  /* The outer loop for interchange.  It equals to loop if this loop cand
+     itself represents the outer loop.  */
+  struct loop *outer;
+  /* Vector of induction variables in loop.  */
+  vec<induction_p> inductions;
+  /* Vector of reduction variables in loop.  */
+  vec<reduction_p> reductions;
+  /* Lcssa PHI nodes of this loop.  */
+  vec<gphi *> lcssa_nodes;
+  /* # of iterations of this loop.  */
+  tree niters;
+  /* Single exit edge of this loop.  */
+  edge exit;
+  /* Basic blocks of this loop.  */
+  basic_block *bbs;
+  /* # of stmts of this loop.  */
+  int num_stmts;
+};
+
+/* Constructor.  */
+
+loop_cand::loop_cand (unsigned idx, struct loop *loop, struct loop *outer)
+  : idx (idx), loop (loop), outer (outer),
+    niters (unshare_expr (number_of_latch_executions (loop))),
+    exit (single_exit (loop)), bbs (get_loop_body (loop)), num_stmts (0)
+{
+    inductions.create (3);
+    reductions.create (3);
+    lcssa_nodes.create (3);
+}
+
+/* Destructor.  */
+
+loop_cand::~loop_cand ()
+{
+  induction_p iv;
+  for (unsigned i = 0; inductions.iterate (i, &iv); ++i)
+    free (iv);
+
+  reduction_p re;
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    free (re);
+
+  inductions.release ();
+  reductions.release ();
+  lcssa_nodes.release ();
+  free (bbs);
+}
+
+/* Return single use stmt of VAR in LOOP, otherwise return NULL.  */
+
+static gimple *
+single_use_in_loop (tree var, struct loop *loop)
+{
+  gimple *stmt, *res = NULL;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	continue;
+
+      if (res)
+	return NULL;
+
+      res = stmt;
+    }
+  return res;
+}
+
+/* Return true if E is unsupported in loop interchange, i.e, E is a complex
+   edge or part of irreducible loop.  */
+
+static inline bool
+unsupported_edge (edge e)
+{
+  return (e->flags & (EDGE_COMPLEX | EDGE_IRREDUCIBLE_LOOP));
+}
+
+/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
+   ssa var appearing in any abnormal phi node.  */
+
+static inline bool
+unsupported_phi_node (gphi *phi)
+{
+  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
+    return true;
+
+  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+    {
+      tree arg = PHI_ARG_DEF (phi, i);
+      if (TREE_CODE (arg) == SSA_NAME
+	  && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
+	return true;
+    }
+
+  return false;
+}
+
+/* Return reduction whose init variable is VAR, otherwise return NULL.  */
+
+reduction_p
+loop_cand::find_reduction_by_init (tree var)
+{
+  reduction_p re;
+
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    if (re->init == var || operand_equal_p (re->init, var, 0))
+      return re;
+
+  return NULL;
+}
+
+/* Return the reduction if STMT is one of its lcssa PHI, producer or consumer
+   stmt.  */
+
+reduction_p
+loop_cand::find_reduction_by_stmt (gimple *stmt)
+{
+  gphi *phi = NULL;
+  reduction_p re;
+
+  if (is_a <gphi *> (stmt))
+    phi = as_a <gphi *> (stmt);
+
+  for (unsigned i = 0; reductions.iterate (i, &re); ++i)
+    if ((phi != NULL && phi == re->lcssa_phi)
+	|| (stmt == re->producer || stmt == re->consumer))
+      return re;;
+
+  return NULL;
+}
+
+/* Return true if all stmts in BB can be supported by loop interchange,
+   otherwise return false.  ILOOP is not NULL if this loop_cand is the
+   outer loop in loop nest.  */
+
+bool
+loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
+{
+  int bb_num_stmts = 0;
+  gphi_iterator psi;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      if (gimple_has_volatile_ops (stmt)
+	  || gimple_has_side_effects (stmt))
+	return false;
+
+      bb_num_stmts++;
+      if (is_gimple_call (stmt))
+	{
+	  int cflags = gimple_call_flags (stmt);
+	  /* Only support const/pure calls.  */
+	  if (!(cflags & (ECF_CONST | ECF_PURE)))
+	    return false;
+
+	  /* In basic block of outer loop, the call should be cheap since
+	     it will be moved to inner loop.  */
+	  if (iloop != NULL
+	      && !gimple_inexpensive_call_p (as_a <gcall *> (stmt)))
+	    return false;
+
+	  continue;
+	}
+
+      if (!iloop || !gimple_vuse (stmt))
+	continue;
+
+      /* Support stmt accessing memory in outer loop only if it is for inner
+	 loop's reduction.  */
+      if (iloop->find_reduction_by_stmt (stmt))
+	continue;
+
+      tree lhs;
+      /* Or it's invariant memory reference and only used by inner loop.  */
+      if (gimple_assign_single_p (stmt)
+	  && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+	  && TREE_CODE (lhs) == SSA_NAME
+	  && single_use_in_loop (lhs, iloop->loop))
+	continue;
+
+      return false;
+    }
+  num_stmts += bb_num_stmts;
+
+  /* Allow PHI nodes in any basic block of inner loop, or PHI nodes in
+     (outer) loop's header.  */
+  if (!iloop || bb == loop->header)
+    return true;
+
+  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
+    {
+      gphi *phi = psi.phi ();
+
+      if (unsupported_phi_node (phi))
+	return false;
+
+      if (virtual_operand_p (PHI_RESULT (phi)))
+	continue;
+
+      /* For outer loop, we only support PHI in loop header and lcssa PHI
+	 of inner loop's reduction.  */
+      if (!iloop->find_reduction_by_stmt (phi))
+	return false;
+    }
+  return true;
+}
+
+/* Return true if current loop_cand be interchanged.  ILOOP is not NULL if
+   current loop_cand is outer loop in loop nest.  */
+
+bool
+loop_cand::can_interchange_p (loop_cand *iloop)
+{
+  /* For now we only support at most one reduction.  */
+  unsigned allowed_reduction_num = 1;
+
+  /* Only support reduction if the loop nest to be interchanged is the
+     innermostin two loops.  */
+  if ((iloop == NULL && loop->inner != NULL)
+       || (iloop != NULL && iloop->loop->inner != NULL))
+    allowed_reduction_num = 0;
+
+  if (reductions.length () > allowed_reduction_num
+      || (reductions.length () == 1
+	  && reductions[0]->type == UNKNOWN_RTYPE))
+    return false;
+
+  /* Only support lcssa PHI node which is for reduction.  */
+  if (lcssa_nodes.length () > allowed_reduction_num)
+    return false;
+
+  /* Check basic blocks other than loop header/exit.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+
+      /* Skip basic blocks of inner loops.  */
+      if (bb->loop_father != loop)
+	continue;
+
+      /* Check if basic block has any unsupported operation.  */
+      if (!unsupported_operation (bb, iloop))
+	return false;
+
+      /* Check if loop has too many stmts.  */
+      if (num_stmts > MAX_NUM_STMT)
+	return false;
+    }
+
+  return true;
+}
+
+/* Classify if reduction RE is a simple one.  */
+
+void
+loop_cand::classify_simple_reduction (reduction_p re)
+{
+  gimple *producer, *consumer;
+  enum tree_code code;
+  tree lhs, rhs;
+
+  /* Check init variable of reduction and how it is initialized.  */
+  if (TREE_CODE (re->init) == SSA_NAME)
+    {
+      producer = SSA_NAME_DEF_STMT (re->init);
+      re->producer = producer;
+      basic_block bb = gimple_bb (producer);
+      if (!bb || bb->loop_father != outer)
+	return;
+
+      if (!is_gimple_assign (producer))
+	return;
+
+      code = gimple_assign_rhs_code (producer);
+      if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
+	return;
+
+      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
+	  || lhs != re->init)
+	return;
+
+      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
+	  || !REFERENCE_CLASS_P (rhs))
+	return;
+
+      re->init_ref = rhs;
+    }
+  else if (!CONSTANT_CLASS_P (re->init))
+    return;
+
+  /* Check how reduction variable is used.  Note usually reduction variable
+     is used outside of its defining loop, we don't require that in terms
+     loop interchange.  */
+  if (re->lcssa_phi == NULL)
+    consumer = single_use_in_loop (re->next, loop);
+  else
+    consumer = single_use_in_loop (PHI_RESULT (re->lcssa_phi), outer);
+
+  if (consumer == NULL)
+    return;
+
+  re->consumer = consumer;
+
+  if (!is_gimple_assign (consumer))
+    return;
+
+  code = gimple_assign_rhs_code (consumer);
+  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
+    return;
+
+  if ((rhs = gimple_assign_rhs1 (consumer)) == NULL_TREE
+      || rhs != PHI_RESULT (re->lcssa_phi))
+    return;
+
+  if ((lhs = gimple_assign_lhs (consumer)) == NULL_TREE
+      || !REFERENCE_CLASS_P (lhs))
+    return;
+
+  re->fini_ref = lhs;
+
+  /* Simple reduction with constant initializer.  */
+  if (re->init_ref == NULL_TREE)
+    {
+      gcc_assert (CONSTANT_CLASS_P (re->init));
+      re->init_ref = unshare_expr (re->fini_ref);
+    }
+
+  /* Require memory references in producer and consumer are the same so
+     that we can undo reduction during interchange.  */
+  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
+    return;
+
+  re->type = SIMPLE_RTYPE;
+}
+
+/* Analyze reduction variable VAR.  ILOOP is not NULL if current loop_cand
+   is outer loop in loop nest.  Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_reduction_var (loop_cand *iloop, tree var)
+{
+  if (iloop != NULL)
+    return analyze_oloop_reduction_var (iloop, var);
+  else
+    return analyze_iloop_reduction_var (var);
+}
+
+/* Analyze reduction variable VAR for inner loop of the loop nest to be
+   interchanged.  Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_iloop_reduction_var (tree var)
+{
+  gphi *phi = as_a <gphi *> (SSA_NAME_DEF_STMT (var));
+  gphi *lcssa_phi = NULL, *use_phi;
+  tree init = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+  tree next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  edge e = single_exit (loop);
+  reduction_p re;
+  gimple *stmt, *next_def, *single_use = NULL;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+
+  if (TREE_CODE (next) != SSA_NAME)
+    return false;
+
+  next_def = SSA_NAME_DEF_STMT (next);
+  basic_block bb = gimple_bb (next_def);
+  if (!bb || !flow_bb_inside_loop_p (loop, bb))
+    return false;
+
+  /* In restricted reduction, the var is (and must be) used in defining
+     the updated var.  The process can be depicted as below:
+
+		var ;; = PHI<init, next>
+		 |
+		 |
+		 v
+      +---------------------+
+      | reduction operators | <-- other operands
+      +---------------------+
+		 |
+		 |
+		 v
+		next
+
+     In terms loop interchange, we don't change how NEXT is computed based
+     on VAR and OTHER OPERANDS.  In case of double reduction in loop nest
+     to be interchanged, we don't changed it at all.  In the case of simple
+     reduction in inner loop, we only make change how VAR/NEXT is loaded or
+     stored.  With these conditions, we can relax restrictions on reduction
+     in a way that reduction operation is seen as black box.  In general,
+     we can ignore reassociation of reduction operator; we can handle fake
+     reductions in which VAR is not even used to compute NEXT.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	return false;
+
+      if (single_use != NULL)
+	return false;
+
+      single_use = stmt;
+    }
+
+  if (single_use != next_def
+      && !stmt_dominates_stmt_p (single_use, next_def))
+    return false;
+
+  /* Only support cases in which INIT is used in inner loop.  */
+  if (TREE_CODE (init) == SSA_NAME)
+    FOR_EACH_IMM_USE_FAST (use_p, iterator, init)
+      {
+	stmt = USE_STMT (use_p);
+	if (is_gimple_debug (stmt))
+	  continue;
+
+	if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	  return false;
+      }
+
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      /* Or else it's used in PHI itself.  */
+      use_phi = NULL;
+      if (is_a <gphi *> (stmt)
+	  && (use_phi = as_a <gphi *> (stmt)) != NULL
+	  && use_phi == phi)
+	continue;
+
+      if (use_phi != NULL
+	  && lcssa_phi == NULL
+	  && gimple_bb (stmt) == e->dest
+	  && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next)
+	lcssa_phi = use_phi;
+      else
+	return false;
+    }
+  re = XCNEW (struct reduction);
+  re->var = var;
+  re->init = init;
+  re->next = next;
+  re->phi = phi;
+  re->lcssa_phi = lcssa_phi;
+
+  classify_simple_reduction (re);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_reduction (re);
+
+  reductions.safe_push (re);
+  return true;
+}
+
+/* Analyze reduction variable VAR for outer loop of the loop nest to be
+   interchanged.  ILOOP is not NULL and points to inner loop.  For the
+   moment, we only support double reduction for outer loop, like:
+
+     for (int i = 0; i < n; i++)
+       {
+	 int sum = 0;
+
+	 for (int j = 0; j < n; j++)    // outer loop
+	   for (int k = 0; k < n; k++)  // inner loop
+	     sum += a[i][k]*b[k][j];
+
+	 s[i] = sum;
+       }
+
+   Note the innermost two loops are the loop nest to be interchanged.
+   Return true if analysis succeeds.  */
+
+bool
+loop_cand::analyze_oloop_reduction_var (loop_cand *iloop, tree var)
+{
+  gphi *phi = as_a <gphi *> (SSA_NAME_DEF_STMT (var));
+  gphi *lcssa_phi = NULL, *use_phi;
+  tree init = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+  tree next = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  edge e = single_exit (loop);
+  reduction_p re;
+  gimple *stmt, *next_def;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+
+  if (TREE_CODE (next) != SSA_NAME)
+    return false;
+
+  next_def = SSA_NAME_DEF_STMT (next);
+  basic_block bb = gimple_bb (next_def);
+  if (!bb || !flow_bb_inside_loop_p (loop, bb))
+    return false;
+
+  /* Find inner loop's simple reduction that uses var as initializer.  */
+  reduction_p inner_re = iloop->find_reduction_by_init (var);
+  if (inner_re == NULL
+      || inner_re->type != UNKNOWN_RTYPE
+      || inner_re->producer != phi)
+    return false;
+
+  /* In case of double reduction, outer loop's reduction should be updated
+     by inner loop's simple reduction.  */
+  if (next_def != inner_re->lcssa_phi)
+    return false;
+
+  /* Outer loop's reduction should only be used to initialize inner loop's
+     simple reduction.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      if (!flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+	return false;
+
+      if (! is_a <gphi *> (stmt)
+	  || (use_phi = as_a <gphi *> (stmt)) == NULL
+	  || use_phi != inner_re->phi)
+	return false;
+    }
+
+  /* Check this reduction is correctly used outside of loop via lcssa phi.  */
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
+    {
+      stmt = USE_STMT (use_p);
+      if (is_gimple_debug (stmt))
+	continue;
+
+      /* Or else it's used in PHI itself.  */
+      use_phi = NULL;
+      if (is_a <gphi *> (stmt)
+	  && (use_phi = as_a <gphi *> (stmt)) != NULL
+	  && use_phi == phi)
+	continue;
+
+      if (lcssa_phi == NULL
+	  && use_phi != NULL
+	  && gimple_bb (stmt) == e->dest
+	  && PHI_ARG_DEF_FROM_EDGE (use_phi, e) == next)
+	lcssa_phi = use_phi;
+      else
+	return false;
+    }
+
+  re = XCNEW (struct reduction);
+  re->var = var;
+  re->init = init;
+  re->next = next;
+  re->phi = phi;
+  re->lcssa_phi = lcssa_phi;
+  re->type = DOUBLE_RTYPE;
+  inner_re->type = DOUBLE_RTYPE;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_reduction (re);
+
+  reductions.safe_push (re);
+  return true;
+}
+
+/* Return true if VAR is induction variable of current loop whose scev is
+   specified by CHREC.  */
+
+bool
+loop_cand::analyze_induction_var (tree var, tree chrec)
+{
+  gphi *phi = as_a <gphi *> (SSA_NAME_DEF_STMT (var));
+  tree init = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+
+  /* Var is loop invariant, though it's unlikely to happen.  */
+  if (tree_does_not_contain_chrecs (chrec))
+    {
+      struct induction *iv = XCNEW (struct induction);
+      iv->var = var;
+      iv->init = init;
+      iv->base = chrec;
+      iv->step = build_int_cst (TREE_TYPE (chrec), 0);
+      inductions.safe_push (iv);
+      return true;
+    }
+
+  if (TREE_CODE (chrec) != POLYNOMIAL_CHREC
+      || CHREC_VARIABLE (chrec) != (unsigned) loop->num
+      || tree_contains_chrecs (CHREC_LEFT (chrec), NULL)
+      || tree_contains_chrecs (CHREC_RIGHT (chrec), NULL))
+    return false;
+
+  struct induction *iv = XCNEW (struct induction);
+  iv->var = var;
+  iv->init = init;
+  iv->base = CHREC_LEFT (chrec);
+  iv->step = CHREC_RIGHT (chrec);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_induction (loop, iv);
+
+  inductions.safe_push (iv);
+  return true;
+}
+
+/* Return true if all loop carried variables defined in loop header can
+   be successfully analyzed.  */
+
+bool
+loop_cand::analyze_carried_vars (loop_cand *iloop)
+{
+  edge e = loop_preheader_edge (outer);
+  gphi_iterator gsi;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\nLoop(%d) carried vars:\n", loop->num);
+
+  for (gsi = gsi_start_phis (loop->header); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = gsi.phi ();
+
+      if (unsupported_phi_node (phi))
+	return false;
+
+      tree var = PHI_RESULT (phi);
+      if (virtual_operand_p (var))
+	continue;
+
+      tree chrec = analyze_scalar_evolution (loop, var);
+      chrec = instantiate_scev (e, loop, chrec);
+
+      /* Analyze var as reduction variable.  */
+      if (chrec_contains_undetermined (chrec)
+	  || chrec_contains_symbols_defined_in_loop (chrec, outer->num))
+	{
+	  if (!analyze_reduction_var (iloop, var))
+	    return false;
+	}
+      /* Analyze var as induction variable.  */
+      else if (!analyze_induction_var (var, chrec))
+	return false;
+    }
+
+  return true;
+}
+
+/* Return TRUE if loop closed PHI nodes can be analyzed successfully.  */
+
+bool
+loop_cand::analyze_lcssa_phis (void)
+{
+  edge e = single_exit (loop);
+  gphi_iterator gsi;
+
+  for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gphi *phi = gsi.phi ();
+
+      if (unsupported_phi_node (phi))
+	return false;
+
+      if (virtual_operand_p (PHI_RESULT (phi)))
+	continue;
+
+      /* TODO: We only support lcssa phi for reduction for now.  */
+      if (!find_reduction_by_stmt (phi))
+	return false;
+    }
+
+  return true;
+}
+
+/* Find all stmts on which CONSUMER depends in basic block BB, put the
+   result stmts in sequence STMTS.  */
+
+static void
+find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer)
+{
+  auto_vec<gimple *> worklist;
+  use_operand_p use_p;
+  ssa_op_iter iter;
+  gimple *stmt;
+
+  worklist.safe_push (consumer);
+  while (!worklist.is_empty ())
+    {
+      stmt = worklist.pop ();
+      FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)
+	{
+	  gimple *def_stmt = SSA_NAME_DEF_STMT (USE_FROM_PTR (use_p));
+
+	  if (is_a <gphi *> (def_stmt))
+	    continue;
+
+	  if (gimple_bb (def_stmt) != bb)
+	    continue;
+
+	  if (gimple_plf (def_stmt, GF_PLF_1))
+	    continue;
+
+	  worklist.safe_push (def_stmt);
+	}
+      gimple_set_plf (stmt, GF_PLF_1, true);
+    }
+  for (gimple_stmt_iterator gsi = gsi_start_bb_nondebug (bb); !gsi_end_p (gsi);)
+    {
+      stmt = gsi_stmt (gsi);
+      if (stmt == consumer)
+	break;
+      if (gimple_plf (stmt, GF_PLF_1))
+	{
+	  gsi_remove (&gsi, false, false);
+	  gimple_seq_add_stmt (stmts, stmt);
+	}
+      else
+	gsi_next_nondebug (&gsi);
+    }
+}
+
+/* Given inner loop with simple reduction as below:
+
+   for (i = 0; i < N; i++)
+     for (j = 0; j < N; j++)
+       {
+	 int red = c[i][j];           // producer
+	 for (k = 0; k < N; k++)
+	   red += a[i][k] * b[k][j];
+
+	 c[i][j] = red;               // consumer
+       }
+
+   This function undo the reduction and generates below loop nest:
+
+   for (i = 0; i < N; i++)
+     for (j = 0; j < N; j++)
+       {
+	 for (k = 0; k < N; k++)
+	   c[i][j] += a[i][k] * b[k][j];
+       }
+
+   This basically reverts transformation done by LIM or PRE.  */
+
+void
+loop_cand::undo_simple_reduction (reduction_p re)
+{
+  gimple *phi = SSA_NAME_DEF_STMT (re->var);
+  gimple_stmt_iterator from, gsi = gsi_after_labels (loop->header);
+
+  if (re->producer != NULL)
+    {
+      /* Move producer stmt into inner loop.  */
+      if (gimple_vuse (re->producer))
+	gimple_set_vuse (re->producer, NULL_TREE);
+      reset_debug_uses (re->producer);
+      gimple_assign_set_lhs (re->producer, re->var);
+      update_stmt(re->producer);
+      from = gsi_for_stmt (re->producer);
+      gsi_remove (&from, false, false);
+      gsi_insert_before (&gsi, re->producer, GSI_SAME_STMT);
+
+      /* Replace all uses of init in loop.  */
+      gcc_assert (TREE_CODE (re->init) == SSA_NAME);
+      gimple *stmt;
+      use_operand_p use_p;
+      imm_use_iterator iterator;
+      FOR_EACH_IMM_USE_STMT (stmt, iterator, re->init)
+	{
+	  FOR_EACH_IMM_USE_ON_STMT (use_p, iterator)
+	    SET_USE (use_p, re->var);
+
+	  update_stmt (stmt);
+	}
+    }
+  else
+    {
+      /* Transform simple reduction of below form:
+
+	   init = 0;
+	   loop:
+	     var = phi<init, next>
+	     next = var op ...
+	   reduc_sum = phi<next>
+	   MEM_REF[...] = reduc_sum
+
+	 into:
+
+	   init = 0;
+	   loop:
+	     tmp = MEM_REF[...]
+	     var = !first_iteration ? tmp : init
+	     next = var op ...
+	     MEM_REF[...] = next
+	   reduc_sum = phi<next>
+
+        Note constant init value is used in the first iteration.  */
+
+      /* Find all stmts on which consumer depends because we generate new
+	 load stmt to TMP.  */
+      gimple_seq stmts = NULL;
+      find_deps_in_bb_for_stmt (&stmts, gimple_bb (re->consumer), re->consumer);
+
+      /* Generate stmt loading from mem reference to TMP variable.  */
+      tree tmp = copy_ssa_name (re->var);
+      gimple *stmt = gimple_build_assign (tmp, re->init_ref);
+      gimple_seq_add_stmt (&stmts, stmt);
+
+      /* Generate cond stmt so that constant init value is used in the first
+	 iteration.  */
+      induction_p iv = inductions[0];
+      tree cond = fold_build2 (NE_EXPR, boolean_type_node, iv->var, iv->init);
+      stmt = gimple_build_assign (re->var, COND_EXPR, cond, tmp, re->init);
+      gimple_seq_add_stmt (&stmts, stmt);
+
+      gsi_insert_seq_before (&gsi, stmts, GSI_SAME_STMT);
+    }
+
+  /* Delete loop header PHI node of reduction.  */
+  gsi = gsi_for_stmt (phi);
+  gsi_remove (&gsi, true);
+
+  /* Move consumer stmt into inner loop, just after its def.  */
+  if (gimple_vdef (re->consumer))
+    gimple_set_vuse (re->consumer, NULL_TREE);
+  if (gimple_vuse (re->consumer))
+    gimple_set_vuse (re->consumer, NULL_TREE);
+  gimple_assign_set_rhs1 (re->consumer, re->next);
+  update_stmt (re->consumer);
+  from = gsi_for_stmt (re->consumer);
+  gsi = gsi_for_stmt (SSA_NAME_DEF_STMT (re->next));
+  gsi_remove (&from, false, false);
+  gsi_insert_after (&gsi, re->consumer, GSI_NEW_STMT);
+
+  /* Delete loop closed PHI node of reduction.  */
+  gsi = gsi_for_stmt (re->lcssa_phi);
+  reset_debug_uses (re->lcssa_phi);
+  gsi_remove (&gsi, true);
+}
+
+/* Eliminate dead code after loop interchange.  */
+
+void
+loop_cand::eliminate_dead_code (void)
+{
+  /* Check basic blocks other than loop header/exit.  */
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+
+      /* Skip basic blocks of inner loops.  */
+      if (bb->loop_father != loop)
+	continue;
+
+      for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  tree lhs;
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  /* Given copy propagation is done during interchange, we can
+	     simply check zero uses of var and eliminate it.  */
+	  if (is_gimple_assign (stmt)
+	      && !gimple_vuse (stmt)
+	      && !gimple_has_volatile_ops (stmt)
+	      && !gimple_has_side_effects (stmt)
+	      && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+	      && TREE_CODE (lhs) == SSA_NAME
+	      && has_zero_uses (lhs))
+	    gsi_remove (&gsi, true);
+	  else
+	    gsi_next (&gsi);
+	}
+    }
+}
+
+/* Free DATAREFS and its auxiliary memory.  */
+
+static void
+free_data_refs_with_aux (vec<data_reference_p> datarefs)
+{
+  data_reference_p dr;
+  for (unsigned i = 0; datarefs.iterate (i, &dr); ++i)
+    if (dr->aux != NULL)
+      {
+	DR_ACCESS_STRIDE (dr)->release ();
+	free (dr->aux);
+      }
+
+  free_data_refs (datarefs);
+}
+
+/* Class for loop interchange transformation.  */
+
+class tree_loop_interchange
+{
+public:
+  tree_loop_interchange (vec<loop_p> loop_nest,
+			 vec<data_reference_p> datarefs, vec<ddr_p> ddrs)
+    : loop_nest (loop_nest), datarefs (datarefs),
+      ddrs (ddrs), niters_iv_var (NULL_TREE) { }
+  ~tree_loop_interchange () {
+    free_dependence_relations (ddrs);
+    free_data_refs_with_aux (datarefs);
+    loop_nest.release ();
+  }
+  bool interchange ();
+
+private:
+  void update_data_refs (unsigned, unsigned);
+  void update_data_deps (unsigned, unsigned);
+  bool valid_data_dependences (unsigned, unsigned);
+  bool can_interchange_loops (loop_cand &, loop_cand &);
+  void interchange_loops (loop_cand &, loop_cand &);
+  void interchange_reductions (loop_cand &, loop_cand &);
+  void interchange_inductions (loop_cand &, loop_cand &);
+  void map_inductions_to_loop (loop_cand &, loop_cand &);
+  void move_code_to_inner_loop (struct loop *, struct loop *, basic_block *);
+
+  /* Vector of the loop nest.  */
+  vec<struct loop *> loop_nest;
+  /* Vector of data references in loop nest.  */
+  vec<data_reference_p> datarefs;
+  /* Vector of data dependence relations in loop nest.  */
+  vec<ddr_p> ddrs;
+
+  /* We create new IV which is only used in loop's exit condition check.
+     In case of 3-level loop nest interchange, when we interchange the
+     innermost two loops, new IV created in the middle level loop does
+     not need to be preserved in interchanging the outermost two loops
+     later.  We record the IV so that it can be skipped.  */
+  tree niters_iv_var;
+};
+
+/* Update data refs' access stride after interchanging loops.  I_IDX/O_IDX
+   gives index of interchanged loops in loop nest.  */
+
+void
+tree_loop_interchange::update_data_refs (unsigned i_idx, unsigned o_idx)
+{
+  struct data_reference *dr;
+  for (unsigned i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      vec<tree> *stride = DR_ACCESS_STRIDE (dr);
+      gcc_assert (stride->length () > i_idx);
+      std::swap ((*stride)[i_idx], (*stride)[o_idx]);
+    }
+}
+
+/* Update data dependence relations after interchanging loops.  I_IDX/O_IDX
+   gives index of interchanged loops in loop nest, they are used to access
+   DIST_MATRIX.  */
+
+void
+tree_loop_interchange::update_data_deps (unsigned i_idx, unsigned o_idx)
+{
+  struct data_dependence_relation *ddr;
+
+  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
+    {
+      /* Skip no-dependence case.  */
+      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	continue;
+
+      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); ++j)
+	{
+	  lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
+	  std::swap (dist_vect[i_idx], dist_vect[o_idx]);
+	}
+    }
+}
+
+/* Check data dependence relations, return TRUE if it's valid to interchange
+   two loops specified by I_IDX/O_IDX.  Theoretically, interchanging the two
+   loops is valid only if dist vector, after interchanging, doesn't have '>'
+   as the leftmost non-'=' direction.  Practically, this function have been
+   conservative here by not checking some valid cases.  */
+
+bool
+tree_loop_interchange::valid_data_dependences (unsigned i_idx, unsigned o_idx)
+{
+  struct data_dependence_relation *ddr;
+
+  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
+    {
+      /* Skip no-dependence case.  */
+      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
+	continue;
+
+      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); ++j)
+	{
+	  lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
+	  unsigned level = dependence_level (dist_vect, loop_nest.length ());
+
+	  /* If there is no carried dependence.  */
+	  if (level == 0)
+	    continue;
+
+	  level --;
+
+	  /* If dependence is not carried by any loop in between the two
+	     loops [oloop, iloop] to interchange.  */
+	  if (level < o_idx || level > i_idx)
+	    continue;
+
+	  /* Be conservative, skip case if either direction at i_idx/o_idx
+	     levels is not '=' or '<'.  */
+	  if (dist_vect[i_idx] < 0 || dist_vect[o_idx] < 0)
+	    return false;
+	}
+    }
+
+  return true;
+}
+
+/* Return true if ILOOP and OLOOP can be interchanged in terms of code
+   transformation.  */
+
+bool
+tree_loop_interchange::can_interchange_loops (loop_cand &iloop,
+					      loop_cand &oloop)
+{
+  return (iloop.analyze_carried_vars (NULL)
+	  && iloop.analyze_lcssa_phis ()
+	  && oloop.analyze_carried_vars (&iloop)
+	  && oloop.analyze_lcssa_phis ()
+	  && iloop.can_interchange_p (NULL)
+	  && oloop.can_interchange_p (&iloop));
+}
+
+/* Interchange niters info of ILOOP and OLOOP while reset any other niters
+   estimates information for now.  */
+
+static inline void
+interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
+{
+  tree nb_iterations = oloop->nb_iterations;
+
+  oloop->any_upper_bound = false;
+  oloop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (oloop);
+
+  oloop->nb_iterations = iloop->nb_iterations;
+
+  iloop->any_upper_bound = false;
+  iloop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (iloop);
+
+  iloop->nb_iterations = nb_iterations;
+}
+
+/* Interchange two loops specified by ILOOP and OLOOP.  */
+
+void
+tree_loop_interchange::interchange_loops (loop_cand &iloop, loop_cand &oloop)
+{
+  interchange_reductions (iloop, oloop);
+  interchange_inductions (iloop, oloop);
+
+  interchange_nb_iterations (iloop.loop, oloop.loop);
+
+  iloop.eliminate_dead_code ();
+}
+
+/* If STMT is a debug stmt that uses VAR, mark pass local flag if it only
+   refers to VAR, otherwise, remove the stmt.  */
+
+static inline void
+mark_or_remove_dbg_stmt (gimple *stmt, tree var)
+{
+  if (!is_gimple_debug (stmt))
+    return;
+
+  tree t = SINGLE_SSA_TREE_OPERAND (stmt, SSA_OP_USE);
+  if (t == var)
+    gimple_set_plf (stmt, GF_PLF_1, true);
+  else
+    {
+      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
+}
+
+/* Interchange transformation for reductions of ILOOP and OLOOP.  We only
+   support two types reductions for now:
+     1) simple reduction of inner loop.
+     2) double reduction of loop nest.
+   For simple reduction, we simply undo it by moving producer/consumer to
+   inner loop; for double reduction, we don't need to do anything.  */
+
+void
+tree_loop_interchange::interchange_reductions (loop_cand &iloop,
+					       loop_cand &oloop)
+{
+  unsigned i;
+  reduction_p re;
+
+  for (i = 0; iloop.reductions.iterate (i, &re); ++i)
+    {
+      if (re->type == DOUBLE_RTYPE)
+	continue;
+
+      /* Undo simple reductions.  */
+      iloop.undo_simple_reduction (re);
+    }
+
+  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
+    {
+      if (re->type != DOUBLE_RTYPE)
+	gcc_unreachable ();
+
+      use_operand_p use_p;
+      imm_use_iterator iterator;
+      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
+	mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
+      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
+	mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
+      if (TREE_CODE (re->init) == SSA_NAME)
+	{
+	  FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
+	    mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
+	}
+    }
+}
+
+/* Interchange transformation for inductions of ILOOP and OLOOP.  */
+
+void
+tree_loop_interchange::interchange_inductions (loop_cand &iloop,
+					       loop_cand &oloop)
+{
+  /* Map outer loop's IV to inner loop.  */
+  map_inductions_to_loop (oloop, iloop);
+  /* Map inner loop's IV to outer loop.  */
+  map_inductions_to_loop (iloop, oloop);
+
+  /* Create canonical IV for both loops.  Note canonical IV for outer/inner
+     loop is actually from inner/outer loop.  Also we record the new IV
+     created for the outer loop so that it can be skipped in later loop
+     interchange.  */
+  create_canonical_iv (oloop.loop, single_exit (oloop.loop), iloop.niters,
+		       &niters_iv_var);
+  create_canonical_iv (iloop.loop, single_exit (iloop.loop), oloop.niters);
+}
+
+/* Map induction variables of SRC loop to TGT loop.  The function firstly
+   creates the same IV of SRC loop in TGT loop, then deletes the original
+   IV and re-initialize it using the newly created IV.  For example, loop
+   nest:
+
+     for (i = 0; i < N; i++)
+       for (j = 0; j < M; j++)
+	 {
+	   //use of i;
+	   //use of j;
+	 }
+
+   will be transformed into:
+
+     for (jj = 0; jj < M; jj++)
+       for (ii = 0; ii < N; ii++)
+	 {
+	   //use of ii;
+	   //use of jj;
+	 }
+
+   after loop interchange.  */
+
+void
+tree_loop_interchange::map_inductions_to_loop (loop_cand &src, loop_cand &tgt)
+{
+  induction_p iv;
+  edge e = single_exit (tgt.loop);
+  gimple_stmt_iterator incr_pos = gsi_last_bb (e->src), gsi;
+  bool move_code_p = flow_loop_nested_p (src.loop, tgt.loop);
+
+  /* Move src's code to tgt loop.  This is necessary when src is the outer
+     loop and tgt is the inner loop.  */
+  if (move_code_p)
+    move_code_to_inner_loop (src.loop, tgt.loop, src.bbs);
+
+  /* Map source loop's IV to target loop.  */
+  for (unsigned i = 0; src.inductions.iterate (i, &iv); ++i)
+    {
+      gimple *stmt = SSA_NAME_DEF_STMT (iv->var);
+      gcc_assert (is_a <gphi *> (stmt));
+
+      /* Delete var definition of the original IV's in the source loop.  */
+      gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+
+      /* No need to map PHI to target loop if it is created in previous
+	 loop interchange.  */
+      if (niters_iv_var == iv->var)
+	{
+	  gcc_assert (!move_code_p);
+	  continue;
+	}
+
+      /* Map the IV by creating the same one in target loop.  */
+      tree base = unshare_expr (iv->base), step = unshare_expr (iv->step);
+      create_iv (base, step, SSA_NAME_VAR (iv->var),
+		 tgt.loop, &incr_pos, false, &iv->mapped_var, NULL);
+
+      /* Replace uses of the original IV var with newly created IV var.  */
+      use_operand_p imm_use_p;
+      imm_use_iterator iterator;
+      FOR_EACH_IMM_USE_STMT (stmt, iterator, iv->var)
+	{
+	  FOR_EACH_IMM_USE_ON_STMT (imm_use_p, iterator)
+	    SET_USE (imm_use_p, iv->mapped_var);
+
+	  update_stmt (stmt);
+	}
+    }
+}
+
+/* Compute the insert position at inner loop when moving code from outer
+   loop to inner one.  */
+
+static inline void
+insert_pos_at_inner_loop (struct loop *outer, struct loop *inner,
+			  basic_block bb, gimple_stmt_iterator *pos)
+{
+  /* Move code from header/latch to header/latch.  */
+  if (bb == outer->header)
+    *pos = gsi_after_labels (inner->header);
+  else if (bb == outer->latch)
+    *pos = gsi_after_labels (inner->latch);
+  else
+    {
+      /* Otherwise, simply move to exit->src.  */
+      edge e = single_exit (inner);
+      *pos = gsi_last_bb (e->src);
+    }
+}
+
+/* Move stmts of outer loop to inner loop.  */
+
+void
+tree_loop_interchange::move_code_to_inner_loop (struct loop *outer,
+						struct loop *inner,
+						basic_block *outer_bbs)
+{
+  unsigned int i;
+  edge oloop_exit = single_exit (outer);
+  gimple_stmt_iterator insert_pos, gsi;
+
+  for (i = 0; i < outer->num_nodes; i++)
+    {
+      basic_block bb = outer_bbs[i];
+
+      /* Skip basic blocks of inner loop.  */
+      if (flow_bb_inside_loop_p (inner, bb))
+	continue;
+
+      insert_pos_at_inner_loop (outer, inner, bb, &insert_pos);
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (gimple_code (stmt) == GIMPLE_LABEL)
+	    {
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (oloop_exit->src == bb
+	      && stmt == gsi_stmt (gsi_last_bb (oloop_exit->src)))
+	    {
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (gimple_plf (stmt, GF_PLF_1))
+	    {
+	      gsi_next (&gsi);
+	      continue;
+	    }
+
+	  if (gimple_vuse (stmt))
+	    gimple_set_vuse (stmt, NULL_TREE);
+	  if (gimple_vdef (stmt))
+	    gimple_set_vdef (stmt, NULL_TREE);
+
+	  gsi_remove (&gsi, false, false);
+	  gsi_insert_before (&insert_pos, stmt, GSI_SAME_STMT);
+	}
+    }
+}
+
+/* Estimate and return the value of EXPR by replacing variables in EXPR
+   with CST_TREE and simplifying.  */
+
+static tree
+estimate_val_by_simplify_replace (tree expr, tree cst_tree)
+{
+  unsigned i, n;
+  tree ret = NULL_TREE, e, se;
+
+  if (!expr)
+    return NULL_TREE;
+
+  /* Do not bother to replace constants.  */
+  if (CONSTANT_CLASS_P (expr))
+    return expr;
+
+  if (!EXPR_P (expr))
+    return cst_tree;
+
+  n = TREE_OPERAND_LENGTH (expr);
+  for (i = 0; i < n; i++)
+    {
+      e = TREE_OPERAND (expr, i);
+      se = estimate_val_by_simplify_replace (e, cst_tree);
+      if (e == se)
+	continue;
+
+      if (!ret)
+	ret = copy_node (expr);
+
+      TREE_OPERAND (ret, i) = se;
+    }
+
+  return (ret ? fold (ret) : expr);
+}
+
+/* Given data reference DR in LOOP_NEST, the function computes DR's access
+   stride at each level of loop from innermost LOOP to outer.  On success,
+   it saves access stride at each level loop in a vector which is pointed
+   by DR->aux.  For example:
+
+     int arr[100][100][100];
+     for (i = 0; i < 100; i++)       ;(DR->aux)strides[0] = 40000
+       for (j = 100; j > 0; j--)     ;(DR->aux)strides[1] = 400
+	 for (k = 0; k < 100; k++)   ;(DR->aux)strides[2] = 4
+	   arr[i][j - 1][k] = 0;  */
+
+static void
+compute_access_stride (struct loop *loop_nest, struct loop *loop,
+		       data_reference_p dr)
+{
+  vec<tree> *strides = new vec<tree> ();
+  basic_block bb = gimple_bb (DR_STMT (dr));
+
+  while (!flow_bb_inside_loop_p (loop, bb))
+    {
+      strides->safe_push (build_int_cst (sizetype, 0));
+      loop = loop_outer (loop);
+    }
+
+  gcc_assert (loop == bb->loop_father);
+
+  tree ref = DR_REF (dr);
+  tree scev_base = build_fold_addr_expr (ref);
+  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
+  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
+  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
+
+  do {
+    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
+    if (chrec_contains_undetermined (scev_fn)
+	|| chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
+      break;
+
+    if (TREE_CODE (scev_fn) != POLYNOMIAL_CHREC)
+      {
+	scev_base = scev_fn;
+	strides->safe_push (build_int_cst (sizetype, 0));
+	continue;
+      }
+
+    scev_base = CHREC_LEFT (scev_fn);
+    if (tree_contains_chrecs (scev_base, NULL))
+      break;
+
+    tree scev_step = fold_convert (sizetype, CHREC_RIGHT (scev_fn));
+
+    enum ev_direction scev_dir = scev_direction (scev_fn);
+    /* Estimate if step isn't constant.  */
+    if (scev_dir == EV_DIR_UNKNOWN)
+      {
+	scev_step = estimate_val_by_simplify_replace (scev_step, niters);
+	if (TREE_CODE (scev_step) != INTEGER_CST
+	    || tree_int_cst_lt (scev_step, access_size))
+	  scev_step = access_size;
+      }
+    /* Compute absolute value of scev step.  */
+    else if (scev_dir == EV_DIR_DECREASES)
+      scev_step = fold_build1 (NEGATE_EXPR, sizetype, scev_step);
+
+    strides->safe_push (scev_step);
+  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
+
+  dr->aux = strides;
+}
+
+/* Given loop nest LOOP_NEST with innermost LOOP, the function computes
+   access strides with respect to each level loop for all data refs in
+   DATAREFS from inner loop to outer loop.  On success, it returns the
+   outermost loop that access strides can be computed successfully for
+   all data references.  If access strides cannot be computed at least
+   for two levels of loop for any data reference, it returns NULL.  */
+
+static struct loop *
+compute_access_strides (struct loop *loop_nest, struct loop *loop,
+			vec<data_reference_p> datarefs)
+{
+  unsigned i, j, num_loops = (unsigned) -1;
+  data_reference_p dr;
+  vec<tree> *stride;
+
+  for (i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      compute_access_stride (loop_nest, loop, dr);
+      stride = DR_ACCESS_STRIDE (dr);
+      if (stride->length () < num_loops)
+	{
+	  num_loops = stride->length ();
+	  if (num_loops < 2)
+	    return NULL;
+	}
+    }
+
+  for (i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      stride = DR_ACCESS_STRIDE (dr);
+      if (stride->length () > num_loops)
+	stride->truncate (num_loops);
+
+      for (j = 0; j < (num_loops >> 1); ++j)
+	std::swap ((*stride)[j], (*stride)[num_loops - j - 1]);
+    }
+
+  while (--num_loops > 0)
+    loop = loop_outer (loop);
+
+  gcc_assert (loop_nest == loop || flow_loop_nested_p (loop_nest, loop));
+  return loop;
+}
+
+/* Count and return the number of loops in LOOP_NEST.  */
+
+unsigned int
+num_loops_in_loop_nest (struct loop *loop_nest)
+{
+  unsigned num_loops;
+  for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
+    ;
+  return num_loops;
+}
+
+/* Prune access strides for data references in DATAREFS by removing strides
+   of loops that isn't in current LOOP_NEST.  */
+
+static void
+prune_access_strides_not_in_loop (struct loop *loop_nest,
+				  vec<data_reference_p> datarefs)
+{
+  data_reference_p dr;
+  unsigned num_loops = num_loops_in_loop_nest (loop_nest);
+  gcc_assert (num_loops > 1);
+
+  /* Block remove strides of loops that is not in current loop nest.  */
+  for (unsigned i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      vec<tree> *stride = DR_ACCESS_STRIDE (dr);
+      if (stride->length () > num_loops)
+	stride->block_remove (0, stride->length () - num_loops);
+    }
+}
+
+/* Dump access strides for all DATAREFS.  */
+
+static void
+dump_access_strides (vec<data_reference_p> datarefs)
+{
+  data_reference_p dr;
+  fprintf (dump_file, "Access Strides for DRs:\n");
+  for (unsigned i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      fprintf (dump_file, "  ");
+      print_generic_expr (dump_file, DR_REF (dr), TDF_SLIM);
+      fprintf (dump_file, ":\t\t<");
+
+      vec<tree> *stride = DR_ACCESS_STRIDE (dr);
+      unsigned num_loops = stride->length ();
+      for (unsigned j = 0; j < num_loops; ++j)
+	{
+	  print_generic_expr (dump_file, (*stride)[j], TDF_SLIM);
+	  fprintf (dump_file, "%s", (j < num_loops - 1) ? ",\t" : ">\n");
+	}
+    }
+}
+
+/* Return true if it's profitable to interchange two loops whose index
+   in whole loop nest vector are I_IDX/O_IDX respectively.  The function
+   computes and compares three types information from all DATAREFS:
+     1) Access stride for loop I_IDX and O_IDX.
+     2) Number of invariant memory references with respect to I_IDX before
+	and after loop interchange.
+     3) Flags indicating if all memory references access sequential memory
+	in ILOOP, before and after loop interchange.
+   If INNMOST_LOOP_P is true, the two loops for interchanging are the two
+   innermost loops in loop nest.  This function also dumps information if
+   DUMP_INFO_P is true.  */
+
+static bool
+should_interchange_loops (unsigned i_idx, unsigned o_idx,
+			  vec<data_reference_p> datarefs,
+			  bool innermost_loops_p, bool dump_info_p = true)
+{
+  unsigned HOST_WIDE_INT ratio;
+  unsigned i, j, num_old_inv_drs = 0, num_new_inv_drs = 0;
+  struct data_reference *dr;
+  bool all_seq_dr_before_p = true, all_seq_dr_after_p = true;
+  widest_int iloop_strides = 0, oloop_strides = 0;
+
+  if (dump_info_p && dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "\nData ref strides:\n\tmem_ref:\t\tiloop\toloop\n");
+
+  for (i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      vec<tree> *stride = DR_ACCESS_STRIDE (dr);
+      tree iloop_stride = (*stride)[i_idx], oloop_stride = (*stride)[o_idx];
+      gcc_assert (TREE_CODE (iloop_stride) == INTEGER_CST);
+      gcc_assert (TREE_CODE (oloop_stride) == INTEGER_CST);
+
+      bool subloop_stride_p = false;
+      for (j = i_idx + 1; j < stride->length (); ++j)
+	if (integer_nonzerop ((*stride)[j]))
+	  {
+	    subloop_stride_p = true;
+	    break;
+	  }
+
+      if (integer_nonzerop (iloop_stride))
+	iloop_strides = wi::add (iloop_strides, wi::to_widest (iloop_stride));
+      else if (!subloop_stride_p)
+	num_old_inv_drs++;
+
+      if (integer_nonzerop (oloop_stride))
+	oloop_strides = wi::add (oloop_strides, wi::to_widest (oloop_stride));
+      else if (!subloop_stride_p)
+	num_new_inv_drs++;
+
+      /* Track if all data references access sequential memory before and
+	 after loop interchange.  */
+      if (subloop_stride_p)
+	{
+	  /* Data ref can't be sequential if it evaluates wrto any sub loop
+	     of inner loop.  */
+	  all_seq_dr_before_p = false;
+	  all_seq_dr_after_p = false;
+	}
+      else
+	{
+	  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr)));
+	  /* Consider data reference with smaller access stride than the
+	     threshold as sequential access.  Note invariant is considered
+	     sequential too.  */
+	  if (all_seq_dr_before_p
+	      && integer_nonzerop (iloop_stride)
+	      && wi::gtu_p (wi::to_wide (iloop_stride),
+			    wi::to_wide (access_size)))
+	    all_seq_dr_before_p = false;
+	  if (all_seq_dr_after_p
+	      && integer_nonzerop (oloop_stride)
+	      && wi::gtu_p (wi::to_wide (oloop_stride),
+			    wi::to_wide (access_size)))
+	    all_seq_dr_after_p = false;
+	}
+    }
+
+  if (dump_info_p && dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\toverall:\t\t");
+      print_decu (iloop_strides, dump_file);
+      fprintf (dump_file, "\t");
+      print_decu (oloop_strides, dump_file);
+      fprintf (dump_file, "\n");
+
+      fprintf (dump_file, "Invariant data ref: before(%d), after(%d)\n",
+	       num_old_inv_drs, num_new_inv_drs);
+      fprintf (dump_file, "Consecutive stride: before(%s), after(%s)\n",
+	       all_seq_dr_before_p ? "true" : "false",
+	       all_seq_dr_after_p ? "true" : "false");
+    }
+
+  /* We use different stride comparison ratio for interchanging innermost
+     two loops or not.  The idea is to be conservative in interchange for
+     the innermost loops.  */
+  ratio = innermost_loops_p ? INNER_STRIDE_RATIO : OUTER_STRIDE_RATIO;
+  /* Do interchange if it gives better data locality behavior.  */
+  if (wi::gtu_p (iloop_strides, wi::mul (oloop_strides, ratio)))
+    return true;
+  if (wi::gtu_p (iloop_strides, oloop_strides))
+    {
+      /* Or it creates more invariant memory references.  */
+      if ((!all_seq_dr_before_p || all_seq_dr_after_p)
+	  && num_new_inv_drs > num_old_inv_drs)
+	return true;
+      /* Or it makes all memory references sequential.  */
+      if (num_new_inv_drs >= num_old_inv_drs
+	  && !all_seq_dr_before_p && all_seq_dr_after_p)
+	return true;
+    }
+
+  return false;
+}
+
+/* Try to interchange inner loop of a loop nest to outer level.  */
+
+bool
+tree_loop_interchange::interchange ()
+{
+  bool changed_p = false;
+  /* In each iteration we try to interchange I-th loop with (I+1)-th loop.
+     The overall effect is to push inner loop to outermost level in whole
+     loop nest.  */
+  for (unsigned i = loop_nest.length (); i > 1; --i)
+    {
+      unsigned i_idx = i - 1, o_idx = i - 2;
+
+      /* Check validity for loop interchange.  */
+      if (!valid_data_dependences (i_idx, o_idx))
+	break;
+
+      loop_cand iloop (i_idx, loop_nest[i_idx], loop_nest[o_idx]);
+      loop_cand oloop (o_idx, loop_nest[o_idx], loop_nest[o_idx]);
+
+      /* Check if we can do transformation for loop interchange.  */
+      if (!can_interchange_loops (iloop, oloop))
+	break;
+
+      /* Check profitability for loop interchange.  */
+      if (should_interchange_loops (i_idx, o_idx, datarefs,
+				    iloop.loop->inner == NULL))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Loop_pair<outer:%d, inner:%d> is interchanged\n\n",
+		     oloop.loop->num, iloop.loop->num);
+
+	  interchange_loops (iloop, oloop);
+	  /* No need to update information if there is no further loop
+	     interchange.  */
+	  if (o_idx > 0)
+	    {
+	      update_data_refs (i_idx, o_idx);
+	      update_data_deps (i_idx, o_idx);
+	    }
+	  changed_p = true;
+	}
+      else
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file,
+		     "Loop_pair<outer:%d, inner:%d> is not interchanged\n\n",
+		     oloop.loop->num, iloop.loop->num);
+	}
+    }
+
+  return changed_p;
+}
+
+
+/* Loop interchange pass.  */
+
+namespace {
+
+const pass_data pass_data_linterchange =
+{
+  GIMPLE_PASS, /* type */
+  "linterchange", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_LINTERCHANGE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_linterchange : public gimple_opt_pass
+{
+public:
+  pass_linterchange (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_linterchange, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () { return new pass_linterchange (m_ctxt); }
+  virtual bool gate (function *) { return flag_tree_loop_interchange; }
+  virtual unsigned int execute (function *);
+
+}; // class pass_linterchange
+
+
+/* Return true if LOOP has proper form for interchange.  */
+
+static bool
+proper_loop_form_for_interchange (struct loop *loop)
+{
+  edge e0, e1, exit;
+
+  /* Don't interchange if loop has unsupported information for the moment.  */
+  if (loop->safelen > 0
+      || loop->constraints != 0
+      || loop->can_be_parallel
+      || loop->dont_vectorize
+      || loop->force_vectorize
+      || loop->in_oacc_kernels_region
+      || loop->orig_loop_num != 0
+      || loop->simduid != NULL_TREE)
+    return false;
+
+  /* Don't interchange if outer loop has basic block other than header,
+     exit->src and latch.  In general, only below form of loop nest:
+		header<---+
+                  |       |
+                  v       |
+              INNER_LOOP  |
+                  |       |
+                  v       |
+		exit--->latch
+     is supported.  */
+  if (loop->inner != NULL
+      && loop->num_nodes != loop->inner->num_nodes + 3)
+    return false;
+
+  if ((exit = single_dom_exit (loop)) == NULL)
+    return false;
+
+  /* Check control flow on loop header/exit blocks.  */
+  if (loop->header == exit->src
+      && (EDGE_COUNT (loop->header->preds) != 2
+	  || EDGE_COUNT (loop->header->succs) != 2))
+    return false;
+  else if (loop->header != exit->src
+	   && (EDGE_COUNT (loop->header->preds) != 2
+	       || !single_succ_p (loop->header)
+	       || unsupported_edge (single_succ_edge (loop->header))
+	       || EDGE_COUNT (exit->src->succs) != 2
+	       || !single_pred_p (exit->src)
+	       || unsupported_edge (single_pred_edge (exit->src))))
+    return false;
+
+  e0 = EDGE_PRED (loop->header, 0);
+  e1 = EDGE_PRED (loop->header, 1);
+  if (unsupported_edge (e0) || unsupported_edge (e1)
+      || (e0->src != loop->latch && e1->src != loop->latch)
+      || (e0->src->loop_father == loop && e1->src->loop_father == loop))
+    return false;
+
+  e0 = EDGE_SUCC (exit->src, 0);
+  e1 = EDGE_SUCC (exit->src, 1);
+  if (unsupported_edge (e0) || unsupported_edge (e1)
+      || (e0->dest != loop->latch && e1->dest != loop->latch)
+      || (e0->dest->loop_father == loop && e1->dest->loop_father == loop))
+    return false;
+
+  return true;
+}
+
+/* Return true if any two adjacent loops in loop nest OUTERMOST_LOOP should
+   be interchanged by looking into all DATAREFS.  INNERMOST_LOOP is the
+   innermost loop of this loop nest.  */
+
+static bool
+should_interchange_loop_nest (struct loop *outermost_loop,
+			      struct loop *innermost_loop,
+			      vec<data_reference_p> datarefs)
+{
+  unsigned idx = num_loops_in_loop_nest (outermost_loop) - 1;
+  gcc_assert (idx > 0);
+
+  /* Check if any two adjacent loops should be interchanged.  */
+  for (struct loop *loop = innermost_loop;
+       loop != outermost_loop;
+       loop = loop_outer (loop), idx--)
+    if (should_interchange_loops (idx, idx - 1, datarefs,
+				  loop == innermost_loop, false))
+      return true;
+
+  return false;
+}
+
+/* Given loop nest LOOP_NEST and data references DATAREFS, compute data
+   dependences for loop interchange and store it in DDRS.  Note we compute
+   dependences directly rather than call generic interface so that we can
+   return on unknown dependence instantly.  */
+
+static bool
+tree_loop_interchange_compute_ddrs (vec<loop_p> loop_nest,
+				    vec<data_reference_p> datarefs,
+				    vec<ddr_p> *ddrs)
+{
+  struct data_reference *a, *b;
+  struct loop *innermost = loop_nest.last ();
+
+  for (unsigned i = 0; datarefs.iterate (i, &a); ++i)
+    {
+      bool a_outer_p = gimple_bb (DR_STMT (a))->loop_father != innermost;
+      for (unsigned j = i + 1; datarefs.iterate (j, &b); ++j)
+	if (DR_IS_WRITE (a) || DR_IS_WRITE (b))
+	  {
+	    bool b_outer_p = gimple_bb (DR_STMT (b))->loop_father != innermost;
+	    /* Don't support multiple write references in outer loop.  */
+	    if (a_outer_p && b_outer_p && DR_IS_WRITE (a) && DR_IS_WRITE (b))
+	      return false;
+
+	    ddr_p ddr = initialize_data_dependence_relation (a, b, loop_nest);
+	    ddrs->safe_push (ddr);
+	    compute_affine_dependence (ddr, loop_nest[0]);
+
+	    /* Give up if ddr is unknown dependence or classic direct vector
+	       is not available.  */
+	    if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know
+		|| (DDR_ARE_DEPENDENT (ddr) == NULL_TREE
+		    && DDR_NUM_DIR_VECTS (ddr) == 0))
+	      return false;
+
+	    /* If either data references is in outer loop of nest, we require
+	       no dependence here because the data reference need to be moved
+	       into inner loop during interchange.  */
+	    if (a_outer_p && b_outer_p
+		&& operand_equal_p (DR_REF (a), DR_REF (b), 0))
+	      continue;
+	    if (DDR_ARE_DEPENDENT (ddr) != chrec_known
+		&& (a_outer_p || b_outer_p))
+	      return false;
+	}
+    }
+
+  return true;
+}
+
+/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
+
+static inline void
+prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
+{
+  struct data_reference *dr;
+
+  for (unsigned i = 0; datarefs.iterate (i, &dr);)
+    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
+      i++;
+    else
+      {
+	datarefs.ordered_remove (i);
+	if (dr->aux)
+	  {
+	    DR_ACCESS_STRIDE (dr)->release ();
+	    free (dr->aux);
+	  }
+	free_data_ref (dr);
+      }
+}
+
+/* Given loop nest like <OLOOP, ILOOP>, the function strips off outer
+   loops if it forms non-rectangle loop nest.  The outermost loop of
+   the rest rectangle loop nest or NULL is returned.  */
+
+struct loop *
+prune_non_rectangle_loop_nest (struct loop *iloop, struct loop *oloop)
+{
+  if (!oloop || !iloop || oloop == iloop)
+    return NULL;
+
+  struct loop *loop1 = iloop;
+
+  while (loop1 != NULL && flow_loop_nested_p (oloop, loop1))
+    {
+      tree niters = number_of_latch_executions (loop1);
+      struct loop *loop2 = loop_outer (loop1);
+
+      while (loop2 != NULL
+	     && (loop2 == oloop || flow_loop_nested_p (oloop, loop2)))
+	{
+	  /* Strip off the outermost loop if it isn't rectangle loop nest.  */
+	  if (chrec_contains_symbols_defined_in_loop (niters, loop2->num))
+	    {
+	      oloop = loop2->inner;
+	      break;
+	    }
+
+	  loop2 = loop_outer (loop2);
+	}
+      if (oloop == iloop)
+	return NULL;
+
+      loop1 = loop_outer (loop1);
+    }
+
+  return oloop;
+}
+
+/* Clear pass local flag stmts in LOOP.  */
+
+static void
+clear_stmt_flag (struct loop *loop)
+{
+  basic_block *bbs = get_loop_body (loop);
+
+  for (unsigned i = 0; i < loop->num_nodes; i++)
+    {
+      basic_block bb = bbs[i];
+
+      for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
+	   !gsi_end_p (gsi); gsi_next (&gsi))
+	gimple_set_plf (gsi_stmt (gsi), GF_PLF_1, false);
+    }
+
+  free (bbs);
+}
+
+/* Return true if number of iterations of any data reference in DATAREFS
+   is different to niters of its loop header.  */
+
+static bool
+dataref_niters_diff_to_loop_header (vec<data_reference_p> datarefs)
+{
+  data_reference_p dr;
+  for (unsigned i = 0; datarefs.iterate (i, &dr); ++i)
+    {
+      basic_block bb = gimple_bb (DR_STMT (dr));
+      struct loop *loop = bb->loop_father;
+      if (!dominated_by_p (CDI_DOMINATORS, single_exit (loop)->src, bb))
+	return true;
+    }
+  return false;
+}
+
+/* Given innermost LOOP, return true if perfect loop nest can be found and
+   data dependences can be computed.  If succeed, record the perfect loop
+   nest in LOOP_NEST; record all data references in DATAREFS and record all
+   data dependence relations in DDRS.
+
+   We do support a restricted form of imperfect loop nest, i.e, loop nest
+   with load/store in outer loop initializing/finalizing simple reduction
+   of the innermost loop.  For such outer loop reference, we require that
+   it has no dependence with others sinve it will be moved to inner loop
+   in interchange.  */
+
+static bool
+prepare_perfect_loop_nest (struct loop *loop, vec<loop_p> *loop_nest,
+			   vec<data_reference_p> *datarefs, vec<ddr_p> *ddrs)
+{
+  tree niters;
+  struct loop *start_loop = NULL, *innermost_loop = loop;
+
+  /* Find loop nest from the innermost loop.  */
+  while (loop->num != 0 && loop->inner == start_loop)
+    {
+      if (!proper_loop_form_for_interchange (loop))
+	break;
+
+      /* Loop must have determined niters.  */
+      niters = number_of_latch_executions (loop);
+      if (!niters || chrec_contains_undetermined (niters))
+	break;
+
+      start_loop = loop;
+      /* If this loop has sibling loop, the father loop won't be in perfect
+	 loop nest.  */
+      if (loop->next != NULL)
+	break;
+
+      loop = loop_outer (loop);
+    }
+
+  start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
+
+  if (!start_loop || !start_loop->inner)
+    return false;
+
+  datarefs->create (20);
+  if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
+      /* Check if there is no data reference.  */
+      || datarefs->length () == 0
+      /* Check if there are too many of data references.  */
+      || ((int) datarefs->length ()
+	  > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
+      /* Check if there is any data reference in loop latch.  We can't handle
+	 loops which loop header and data references have different execution
+	 times.  */
+      || dataref_niters_diff_to_loop_header (*datarefs)
+      /* Compute access strides for all data references.  */
+      || ((start_loop = compute_access_strides (start_loop, innermost_loop,
+						*datarefs)) == NULL)
+      /* Check if loop nest should be interchanged.  */
+      || !should_interchange_loop_nest (start_loop, innermost_loop, *datarefs))
+    {
+      free_data_refs_with_aux (*datarefs);
+      return false;
+    }
+
+  /* Check if data dependences can be computed for loop nest starting from
+     start_loop.  */
+  loop = start_loop;
+  loop_nest->create (3);
+  do {
+    ddrs->create (20);
+    loop_nest->truncate (0);
+
+    if (loop != start_loop)
+      prune_datarefs_not_in_loop (start_loop, *datarefs);
+
+    if (find_loop_nest (start_loop, loop_nest)
+	&& tree_loop_interchange_compute_ddrs (*loop_nest, *datarefs, ddrs))
+      {
+	if (dump_file && (dump_flags & TDF_DETAILS))
+	  fprintf (dump_file,
+		   "\nConsider loop interchange for loop_nest<%d - %d>\n",
+		   start_loop->num, innermost_loop->num);
+
+	if (loop != start_loop)
+	  prune_access_strides_not_in_loop (start_loop, *datarefs);
+
+	if (dump_file && (dump_flags & TDF_DETAILS))
+	  dump_access_strides (*datarefs);
+
+	clear_stmt_flag (start_loop);
+	return true;
+      }
+
+    free_dependence_relations (*ddrs);
+    /* Try to compute data dependences with the outermost loop stripped.  */
+    loop = start_loop;
+    start_loop = start_loop->inner;
+  } while (start_loop && start_loop->inner);
+
+  loop_nest->release ();
+  free_data_refs_with_aux (*datarefs);
+  return false;
+}
+
+/* Main entry for loop interchange pass.  */
+
+unsigned int
+pass_linterchange::execute (function *fun)
+{
+  if (number_of_loops (fun) <= 2)
+    return 0;
+
+  bool changed_p = false;;
+  struct loop *loop;
+  vec<loop_p> loop_nest;
+  vec<data_reference_p> datarefs;
+  vec<ddr_p> ddrs;
+
+  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
+    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
+      {
+	tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
+	changed_p |= loop_interchange.interchange ();
+      }
+
+  if (changed_p)
+    scev_reset_htab ();
+
+  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_linterchange (gcc::context *ctxt)
+{
+  return new pass_linterchange (ctxt);
+}
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index 8b1daa6..205534f 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -76,10 +76,13 @@ enum unroll_level
 };
 
 /* Adds a canonical induction variable to LOOP iterating NITER times.  EXIT
-   is the exit edge whose condition is replaced.  */
+   is the exit edge whose condition is replaced.  The ssa versions of the new
+   IV before and after increment will be stored in VAR_BEFORE and VAR_AFTER
+   if they are not NULL.  */
 
-static void
-create_canonical_iv (struct loop *loop, edge exit, tree niter)
+void
+create_canonical_iv (struct loop *loop, edge exit, tree niter,
+		     tree *var_before = NULL, tree *var_after = NULL)
 {
   edge in;
   tree type, var;
@@ -112,7 +115,9 @@ create_canonical_iv (struct loop *loop, edge exit, tree niter)
   create_iv (niter,
 	     build_int_cst (type, -1),
 	     NULL_TREE, loop,
-	     &incr_at, false, NULL, &var);
+	     &incr_at, false, var_before, &var);
+  if (var_after)
+    *var_after = var;
 
   cmp = (exit->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
   gimple_cond_set_code (cond, cmp);
diff --git a/gcc/tree-ssa-loop-ivopts.h b/gcc/tree-ssa-loop-ivopts.h
index bd92051..a723f46 100644
--- a/gcc/tree-ssa-loop-ivopts.h
+++ b/gcc/tree-ssa-loop-ivopts.h
@@ -32,4 +32,6 @@ extern tree strip_offset (tree, unsigned HOST_WIDE_INT *);
 bool may_be_nonaddressable_p (tree expr);
 void tree_ssa_iv_optimize (void);
 
+void create_canonical_iv (struct loop *, edge, tree,
+			  tree * = NULL, tree * = NULL);
 #endif /* GCC_TREE_SSA_LOOP_IVOPTS_H */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-11-16 15:30           ` Bin.Cheng
@ 2017-11-20 14:59             ` Richard Biener
  2017-11-23 17:01               ` Bin.Cheng
  0 siblings, 1 reply; 10+ messages in thread
From: Richard Biener @ 2017-11-20 14:59 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Michael Matz, gcc-patches

On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <matz@suse.de> wrote:
>> Hello,
>>
>> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>>
>>> This is updated patch for loop interchange with review suggestions
>>> resolved.  Changes are:
>>>   1) It does more light weight checks like rectangle loop nest check
>>> earlier than before.
>>>   2) It checks profitability of interchange before data dependence computation.
>>>   3) It calls find_data_references_in_loop only once for a loop nest now.
>>>   4) Data dependence is open-computed so that we can skip instantly at
>>> unknown dependence.
>>>   5) It improves code generation in mapping induction variables for
>>> loop nest, as well as
>>>      adding a simple dead code elimination pass.
>>>   6) It changes magic constants into parameters.
>>
>> So I have a couple comments/questions.  Something stylistic:
> Hi Michael,
> Thanks for reviewing.
>
>>
>>> +class loop_cand
>>> +{
>>> +public:
>>> ...
>>> +  friend class tree_loop_interchange;
>>> +private:
>>
>> Just make this all public (and hence a struct, not class).
>> No need for friends in file local classes.
> Done.
>
>>
>>> +single_use_in_loop (tree var, struct loop *loop)
>>> ...
>>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>>> +    {
>>> +      stmt = USE_STMT (use_p);
>>> ...
>>> +      basic_block bb = gimple_bb (stmt);
>>> +      gcc_assert (bb != NULL);
>>
>> This pattern reoccurs often in your patch: you check for a bb associated
>> for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need
>> for checking.
> Done.
>
>>
>> Then, something about your handling of simple reductions:
>>
>>> +void
>>> +loop_cand::classify_simple_reduction (reduction_p re)
>>> +{
>>> ...
>>> +  /* Require memory references in producer and consumer are the same so
>>> +     that we can undo reduction during interchange.  */
>>> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>>> +    return;
>>
>> Where is it checked that the undoing transformation is legal also
>> from a data dep point of view?  Think code like this:
>>
>>    sum = X[i];
>>    for (j ...)
>>      sum += X[j];
>>    X[i] = sum;
>>
>> Moving the store into the inner loop isn't always correct and I don't seem
>> to find where the above situation is rejected.
> Yeah.  for the old patch, it's possible to have such loop wrongly interchanged;
> in practice, it's hard to create an example.  The pass will give up
> when computing
> data dep between references in inner/outer loops.  In this updated
> patch, it's fixed
> by giving up if there is any dependence between references of inner/outer loops.
>
>>
>> Maybe I'm confused because I also don't see where you even can get into
>> the above situation (though I do see testcases about this).  The thing is,
>> for an 2d loop nest to contain something like the above reduction it can't
>> be perfect:
>>
>>    for (j) {
>>      int sum = X[j];      // 1
>>      for (i)
>>        sum += Y[j][i];
>>      X[j] = sum;          // 2
>>    }
>>
>> But you do check for perfectness in proper_loop_form_for_interchange and
>> prepare_perfect_loop_nest, so either you can't get into the situation or
>> the checking can't be complete, or you define the above to be perfect
>> nevertheless (probably because the load and store are in outer loop
>> header/exit blocks?).  The latter would mean that you accept also other
>> code in header/footer of loops from a pure CFG perspective, so where is it
>> checked that that other code (which aren't simple reductions) isn't
>> harmful to the transformation?
> Yes, I used the name perfect loop nest, but the pass can handle special form
> imperfect loop nest for the simple reduction.  I added comments describing
> this before function prepare_perfect_loop_nest.
>
>>
>> Then, the data dependence part of the new pass:
>>
>>> +bool
>>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned outer)
>>> +{
>>> +  struct data_dependence_relation *ddr;
>>> +
>>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>> +    {
>>> +      /* Skip no-dependence case.  */
>>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>> +     continue;
>>> +
>>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>> +     {
>>> +       lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>>> +       unsigned level = dependence_level (dist_vect, loop_nest.length ());
>>> +
>>> +       /* If there is no carried dependence.  */
>>> +       if (level == 0)
>>> +         continue;
>>> +
>>> +       level --;
>>> +       /* Skip case which has '>' as the leftmost direction.  */
>>> +       if (!lambda_vector_lexico_pos (dist_vect, level))
>>> +         return false;
>>
>> Shouldn't happen as dist vectors are forced positive via DDR_REVERSED.
> Done.
>
>>
>>> +       /* If dependence is carried by outer loop of the two loops for
>>> +          interchange.  */
>>> +       if (level < outer)
>>> +         continue;
>>> +
>>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>> +       /* If directions at both inner/outer levels are the same.  */
>>> +       if (dir_vect[inner] == dir_vect[outer])
>>> +         continue;
>>> +
>>> +       /* Be conservative, skip case if either direction at inner/outer
>>> +          levels is not '=' or '<'.  */
>>> +       if (dir_vect[inner] != dir_equal
>>> +           && dir_vect[inner] != dir_positive
>>> +           && dir_vect[inner] != dir_independent
>>> +           && dir_vect[inner] != dir_positive_or_equal)
>>> +         return false;
>>> +
>>> +       if (dir_vect[outer] != dir_equal
>>> +           && dir_vect[outer] != dir_positive
>>> +           && dir_vect[outer] != dir_independent
>>> +           && dir_vect[outer] != dir_positive_or_equal)
>>> +         return false;
>>
>> Checking dir vectors doesn't make much sense in GCC: the elements are only
>> ever set to dir_positive, dir_negative or dir_equal, exactly when distance
>> is
>>  > 0, < 0 or == 0.  So checking dist vector is enough. (though sameness of
>> direction checks sameness of sign with zero).  Incidentally:
> Done.
>
>>
>>> +tree_loop_interchange::update_data_deps (unsigned inner, unsigned outer)
>>> +{
>>> +  struct data_dependence_relation *ddr;
>>> +
>>> +  for (unsigned i = 0; ddrs.iterate (i, &ddr); ++i)
>>> +    {
>>> +      /* Skip no-dependence case.  */
>>> +      if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>>> +     continue;
>>> +
>>> +      for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>>> +     {
>>> +       lambda_vector dir_vect = DDR_DIR_VECT (ddr, j);
>>> +       std::swap (dir_vect[inner], dir_vect[outer]);
>>> +     }
>>> +    }
>>> +}
>>
>> Here you swap only the direction but not the distance vector, which can't
>> be right.  I suggest only using (and updating) the distance vector.
> Yeah, fixed.
>
>>
>> And then your usage and update of DR_ACCESS_FNs: there's quite some
>> complexity connected with that and I'm not sure how worthwhile it is.
>> You're basically using the ACCESS_FNs to determine profitability (and not
>> for validity, and that's good).  But e.g. for pointer based accesses like
>> in fortran with explicit address arithmetic the relation between access-fn
>> step and stride and actual access stride isn't that easy (e.g. in your
>> should_interchange_loops function iloop_stride and oloop_stride will
>> always be one for pointer based accesses).
>>
>> Conceptually what you should check is how the access address for each data
>> ref revolves for each loop, so why not doing this explicitely?  What I
>> mean is: calculate a (complicated) chrec for the DR addresses for the
>> whole nest at the beginning.  It should be in the form like (assume "+"
>> always):
>>
>>   {{{init, s1}_l1, s2}_l2, s3}_l3
>>
>> (i.e. all steps should be invariants/constants, and only one non-chrec
>> init value).  Addresses which aren't in this form you're already ignoring
>> right now, so you could continue doing that.  (Or better said, all
>> non-constant steps you regard as being AVG_DIM_SIZE, which you still can
>> continue doing).
>>
>> Now, with the above form you can form expressions for the difference
>> between addresses per iteration for each loop (i.e. the address stride per
>> loop); store these.  Then, when interchanging loops you need to merely
>> swap these expressions like you have to with the distance vector, instead
>> of fiddling inside the DR_ACCESS_FNs themself.  Much code would go away.
> Yeah.  Did similar thing in loop nest distribution pass.  See
> compute_access_range
> in tree-loop-distribution.c.  Actually, I would do the same here if I
> had implemented
> this pass after loop nest distribution patches.  Done in this updated patch.
>
>>
>> Testcases: given that we had to remove our old separate interchange pass
>> because it miscompiled stuff all over I'm missing some testcases where
>> interchange should _not_ happen for validity reasons, like my above
>> example with an reduction that can't be moved inside.  Perhaps you can
>> think of some more.
> As mentioned above, it's hard to create test that fail exactly for this reason.
> I added one that data dependence prevents us from interchanging the loop.
>
>>
>> I hope this is of some help to you :)
> Thanks again, it's very helpful.
>
> I also fixed several bugs of previous implementation, mostly about debug info
> statements and simple reductions.  As for test, I enabled this pass by default,
> bootstrap and regtest GCC, I also build/run specs.  There must be some other
> latent bugs in it, but guess we have to exercise it by enabling it at
> some point.
>
> So any comments?

 bool
-gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
+gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg)
 {

that you need this suggests you do stmt removal in wrong order (you need to
do reverse dom order).

+/* Maximum number of statements in loop nest for loop interchange.  */
+
+DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
+         "loop-interchange-max-num-stmts",
+         "The maximum number of stmts in loop nest for loop interchange.",
+         64, 0, 0)

is that to limit dependence computation?  In this case you should probably
limit the number of data references instead?

+ftree-loop-interchange
+Common Report Var(flag_tree_loop_interchange) Optimization
+Enable loop interchange on trees.
+

please re-use -floop-interchange instead and change the GRAPHITE tests
to use -floop-nest-optimize.  You can do that as pre-approved thing now.

Please enable the pass by default at O3 via opts.c.

diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc

gimple-loop-interchange.cc please.

new file mode 100644
index 0000000..abffbf6
--- /dev/null
+++ b/gcc/tree-ssa-loop-interchange.cc
@@ -0,0 +1,2274 @@
+/* Loop invariant motion.
+   Copyright (C) 2017 Free Software Foundation, Inc.

Loop invariant motion? ... ;)

Please add a "Contributed by ..." to have an easy way to figure people to blame.

+}*induction_p;
+

space after '*'

+}*reduction_p;
+

likewise.

+/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
+   ssa var appearing in any abnormal phi node.  */
+
+static inline bool
+unsupported_phi_node (gphi *phi)
+{
+  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
+    return true;
+
+  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
+    {
+      tree arg = PHI_ARG_DEF (phi, i);
+      if (TREE_CODE (arg) == SSA_NAME
+         && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
+       return true;
+    }
+
+  return false;

I believe the above isn't necessary given you rule out abnormal edges
into the loop.
Did you have a testcase that broke?  A minor thing I guess if it is
just for extra
safety...

+/* Return true if all stmts in BB can be supported by loop interchange,
+   otherwise return false.  ILOOP is not NULL if this loop_cand is the
+   outer loop in loop nest.  */
+
+bool
+loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
+{

docs and return value suggest this be named supported_operation

+      /* Or it's invariant memory reference and only used by inner loop.  */
+      if (gimple_assign_single_p (stmt)
+         && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+         && TREE_CODE (lhs) == SSA_NAME
+         && single_use_in_loop (lhs, iloop->loop))
+       continue;

comment suggests multiple uses in loop would be ok?

+      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
+         || lhs != re->init)
+       return;
+
+      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
+         || !REFERENCE_CLASS_P (rhs))
+       return;

lhs and rhs are never NULL.  Please initialize them outside of the if.
You want to disallow DECL_P rhs here?

Can you add an overall function comment what this function does?  It seems
to detect a reduction as produced by loop store-motion?  Thus it tries to
get at enough information to perform the reverse transform?

During review I have a hard time distinguishing between locals and members
so can you please prefix all member variables with m_ as according to our
code guidelines?  I guess what adds to the confusion is the loop_cand argument
that sometimes is present for loop_cand member functions...
(I personally prefer to prefix all member accesses with this-> but that's harder
to enforce)

+static void
+find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer)
+{

this extracts stmts starting from consumer but rules out other
consumers (recursively).
Is that intended?  I wonder if you can avoid this dance...

+      /* Transform simple reduction of below form:
+
+          init = 0;
+          loop:
+            var = phi<init, next>
+            next = var op ...
+          reduc_sum = phi<next>
+          MEM_REF[...] = reduc_sum
+
+        into:
+

which one is 'consumer'?  I wonder if you can simply leave all the
dead code in place
just emitting the load-update-store stmts and removing the
MEM_REF[...] = reduc_sum
above?

+/* Eliminate dead code after loop interchange.  */
+
+void
+loop_cand::eliminate_dead_code (void)
+{

PRE tracks "possible" dead defs and has a worklist algorithm in
remove_dead_inserted_code.
I wonder if you can do sth similar?  That is, I wonder if doing a
sweep from last to first
stmt wouldn't be better here?

+         /* Given copy propagation is done during interchange, we can
+            simply check zero uses of var and eliminate it.  */
+         if (is_gimple_assign (stmt)
+             && !gimple_vuse (stmt)

you probably meant to check gimple_vdef?

+             && !gimple_has_volatile_ops (stmt)
+             && !gimple_has_side_effects (stmt)

the former is redundant

+             && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
+             && TREE_CODE (lhs) == SSA_NAME
+             && has_zero_uses (lhs))

if you use gimple_get_lhs () you can also handle calls.

That said, this seems to be a very poor DCE, why is it necessary at all?

+/* Interchange niters info of ILOOP and OLOOP while reset any other niters
+   estimates information for now.  */
+
+static inline void
+interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
+{
+  tree nb_iterations = oloop->nb_iterations;
+
+  oloop->any_upper_bound = false;
+  oloop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (oloop);
+
+  oloop->nb_iterations = iloop->nb_iterations;
+
+  iloop->any_upper_bound = false;
+  iloop->any_likely_upper_bound = false;
+  free_numbers_of_iterations_estimates (iloop);
+
+  iloop->nb_iterations = nb_iterations;

use std::swap?  Also I think if you can keep nb_iterations you
can certainly keep the upper bounds.  You're probably
afraid of the ->stmt references in the nb_iter_bound entries?

Anyway, either scrap everything or try to keep everything.

+  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
+    {
+      if (re->type != DOUBLE_RTYPE)
+       gcc_unreachable ();
+
+      use_operand_p use_p;
+      imm_use_iterator iterator;
+      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
+       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
+      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
+       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
+      if (TREE_CODE (re->init) == SSA_NAME)
+       {
+         FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
+           mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
+       }

can you add a comment what you are doing here?

Note that other loop opts simply scrap all debug stmts ...

+static void
+compute_access_stride (struct loop *loop_nest, struct loop *loop,
+                      data_reference_p dr)
+{
...
+  tree ref = DR_REF (dr);
+  tree scev_base = build_fold_addr_expr (ref);
+  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
+  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
+  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
+
+  do {
+    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
+    if (chrec_contains_undetermined (scev_fn)
+       || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
+      break;
...
+    strides->safe_push (scev_step);
+  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
+

I _think_ you want to do

   scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
DR_STMT (dr) is in loop
   scev_fn = instantiate_parameters (nest, scev_fn);
   if (chrec_contains_undetermined (scev_fn))
     return; // false?

and analyze the result which should be of the form

  { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...

if canonical.  I think estimate_val_by_simplify_replace isn't needed
if you do that
(it also looks odd to replace all vairables in step by niter...).

I think keeping the chrec in the above form is also more suitable for what
the caller does so the outermost loop is simply

  loop = loop_nest;
  loop-over-all-dr-chrecs
    if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
      loop = CHREC_LOOP (chrec);

given the outermost loop is the outer evolution.  If you sort the
stride vecs from inner
to outer you don't need prune_access_strides_not_in_loop.

+/* Count and return the number of loops in LOOP_NEST.  */
+
+unsigned int
+num_loops_in_loop_nest (struct loop *loop_nest)
+{
+  unsigned num_loops;
+  for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
+    ;
+  return num_loops;

loop_depth (innermost) - loop_depth (nest)?

+static bool
+should_interchange_loops (unsigned i_idx, unsigned o_idx,
+                         vec<data_reference_p> datarefs,
+                         bool innermost_loops_p, bool dump_info_p = true)
+{

isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs
and figure a permutation sequence to arrive there?  That is for the local
decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
when we should interchange?

That subloop_stride_p and tracking invariant DRs looks a bit odd.  For loops
where a DR is invariant you simply do not have an evolution in that loop.
You seem to simply add strides in the inner and outer loops for each DR,
can you explain how that works?  Also I guess strides bigger than the
various cache-line size parameters should be treated 'equal'?  That is,
if we don't get any DR to a step that results in L1 cache hits because
two DRs share a cache line the interchange is pointless?

+/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
+
+static inline void
+prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
+{
+  struct data_reference *dr;
+
+  for (unsigned i = 0; datarefs.iterate (i, &dr);)
+    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
+      i++;
+    else
+      {
+       datarefs.ordered_remove (i);

that's expensive.  It's better to keep moving DRs we want to keep
when walking the array.  That is, add a j you increment only when
we keep a DR, moving *i to *j.

+       if (dr->aux)
+         {
+           DR_ACCESS_STRIDE (dr)->release ();
+           free (dr->aux);
+         }
+       free_data_ref (dr);
+      }
+}

+
+  start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
+

Hmm.  If you instantiate the SCEV for the niters for each loop in the nest
wrt the nest you can figure if it has any evolution in sth else than the
loop (evolution_function_is_univariate_p).   That is, this is not a problem
until you arrive at analyzing DR strides, right?  At which point you
can check for the appropriate form?

+  if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
+      /* Check if there is no data reference.  */
+      || datarefs->length () == 0
+      /* Check if there are too many of data references.  */
+      || ((int) datarefs->length ()
+         > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
+      /* Check if there is any data reference in loop latch.  We can't handle
+        loops which loop header and data references have different execution
+        times.  */
+      || dataref_niters_diff_to_loop_header (*datarefs)

this suggests to do your own find_data_references_in_loop so you can early
out.

Overall the flow through the pass is a bit hard to follow given there are
IMHO too many functions.

+unsigned int
+pass_linterchange::execute (function *fun)
+{
+  if (number_of_loops (fun) <= 2)
+    return 0;
+
+  bool changed_p = false;;
+  struct loop *loop;
+  vec<loop_p> loop_nest;
+  vec<data_reference_p> datarefs;
+  vec<ddr_p> ddrs;
+
+  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
+    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
+      {
+       tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
+       changed_p |= loop_interchange.interchange ();
+      }

you leak datarefs/ddrs?

Richard.


> Thanks,
> bin
>
> 2017-11-15  Bin Cheng  <bin.cheng@arm.com>
>
>     * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>     * common.opt (ftree-loop-interchange): New option.
>     * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>     * gimple-iterator.c (gsi_remove): New parameter determining if dbg
>     bind stmt is inserted or not.
>     * gimple-iterator.h (gsi_remove): New parameter in declaration.
>     * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
>     (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>     * passes.def (pass_linterchange): New pass.
>     * timevar.def (TV_LINTERCHANGE): New time var.
>     * tree-pass.h (make_pass_linterchange): New declaration.
>     * tree-ssa-loop-interchange.cc: New file.
>     * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>     Record IV before/after increment in new parameters.
>     * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>
> gcc/testsuite
> 2017-11-15  Bin Cheng  <bin.cheng@arm.com>
>
>     * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>     * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
>
>>
>>
>> Ciao,
>> Michael.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH GCC]A simple implementation of loop interchange
  2017-11-20 14:59             ` Richard Biener
@ 2017-11-23 17:01               ` Bin.Cheng
  0 siblings, 0 replies; 10+ messages in thread
From: Bin.Cheng @ 2017-11-23 17:01 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, gcc-patches

Hi Richard,
Thanks for reviewing.  It's quite lot comment, I am trying to resolve
it one by one.  Here I have some questions as embedded.

On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <matz@suse.de> wrote:
>>
>>>
>>> I hope this is of some help to you :)
>> Thanks again, it's very helpful.
>>
>> I also fixed several bugs of previous implementation, mostly about debug info
>> statements and simple reductions.  As for test, I enabled this pass by default,
>> bootstrap and regtest GCC, I also build/run specs.  There must be some other
>> latent bugs in it, but guess we have to exercise it by enabling it at
>> some point.
>>
>> So any comments?
>
>  bool
> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool insert_dbg)
>  {
>
> that you need this suggests you do stmt removal in wrong order (you need to
> do reverse dom order).
>
> +/* Maximum number of statements in loop nest for loop interchange.  */
> +
> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
> +         "loop-interchange-max-num-stmts",
> +         "The maximum number of stmts in loop nest for loop interchange.",
> +         64, 0, 0)
>
> is that to limit dependence computation?  In this case you should probably
> limit the number of data references instead?
No, this is to limit number of statements in loop.  We don't want to
do interchange for too large loops, right?

>
> +ftree-loop-interchange
> +Common Report Var(flag_tree_loop_interchange) Optimization
> +Enable loop interchange on trees.
> +
>
> please re-use -floop-interchange instead and change the GRAPHITE tests
> to use -floop-nest-optimize.  You can do that as pre-approved thing now.
>
> Please enable the pass by default at O3 via opts.c.
There are quite many (vectorize) test cases affected by interchange
(which is correct I believe), so I will prepare another patch enabling
it at O3 and adjusting the tests, to keep this patch small.

>
> diff --git a/gcc/tree-ssa-loop-interchange.cc b/gcc/tree-ssa-loop-interchange.cc
>
> gimple-loop-interchange.cc please.
>
> new file mode 100644
> index 0000000..abffbf6
> --- /dev/null
> +++ b/gcc/tree-ssa-loop-interchange.cc
> @@ -0,0 +1,2274 @@
> +/* Loop invariant motion.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
>
> Loop invariant motion? ... ;)
>
> Please add a "Contributed by ..." to have an easy way to figure people to blame.
>
> +}*induction_p;
> +
>
> space after '*'
>
> +}*reduction_p;
> +
>
> likewise.
>
> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
> +   ssa var appearing in any abnormal phi node.  */
> +
> +static inline bool
> +unsupported_phi_node (gphi *phi)
> +{
> +  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
> +    return true;
> +
> +  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> +    {
> +      tree arg = PHI_ARG_DEF (phi, i);
> +      if (TREE_CODE (arg) == SSA_NAME
> +         && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
> +       return true;
> +    }
> +
> +  return false;
>
> I believe the above isn't necessary given you rule out abnormal edges
> into the loop.
> Did you have a testcase that broke?  A minor thing I guess if it is
> just for extra
> safety...
>
> +/* Return true if all stmts in BB can be supported by loop interchange,
> +   otherwise return false.  ILOOP is not NULL if this loop_cand is the
> +   outer loop in loop nest.  */
> +
> +bool
> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
> +{
>
> docs and return value suggest this be named supported_operation
>
> +      /* Or it's invariant memory reference and only used by inner loop.  */
> +      if (gimple_assign_single_p (stmt)
> +         && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> +         && TREE_CODE (lhs) == SSA_NAME
> +         && single_use_in_loop (lhs, iloop->loop))
> +       continue;
>
> comment suggests multiple uses in loop would be ok?
>
> +      if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
> +         || lhs != re->init)
> +       return;
> +
> +      if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
> +         || !REFERENCE_CLASS_P (rhs))
> +       return;
>
> lhs and rhs are never NULL.  Please initialize them outside of the if.
> You want to disallow DECL_P rhs here?
>
> Can you add an overall function comment what this function does?  It seems
> to detect a reduction as produced by loop store-motion?  Thus it tries to
> get at enough information to perform the reverse transform?
>
> During review I have a hard time distinguishing between locals and members
> so can you please prefix all member variables with m_ as according to our
> code guidelines?  I guess what adds to the confusion is the loop_cand argument
> that sometimes is present for loop_cand member functions...
> (I personally prefer to prefix all member accesses with this-> but that's harder
> to enforce)
>
> +static void
> +find_deps_in_bb_for_stmt (gimple_seq *stmts, basic_block bb, gimple *consumer)
> +{
>
> this extracts stmts starting from consumer but rules out other
> consumers (recursively).
> Is that intended?  I wonder if you can avoid this dance...
>
> +      /* Transform simple reduction of below form:
> +
> +          init = 0;
> +          loop:
> +            var = phi<init, next>
> +            next = var op ...
> +          reduc_sum = phi<next>
> +          MEM_REF[...] = reduc_sum
> +
> +        into:
> +
>
> which one is 'consumer'?  I wonder if you can simply leave all the
> dead code in place
> just emitting the load-update-store stmts and removing the
> MEM_REF[...] = reduc_sum
> above?
>
> +/* Eliminate dead code after loop interchange.  */
> +
> +void
> +loop_cand::eliminate_dead_code (void)
> +{
>
> PRE tracks "possible" dead defs and has a worklist algorithm in
> remove_dead_inserted_code.
> I wonder if you can do sth similar?  That is, I wonder if doing a
> sweep from last to first
> stmt wouldn't be better here?
>
> +         /* Given copy propagation is done during interchange, we can
> +            simply check zero uses of var and eliminate it.  */
> +         if (is_gimple_assign (stmt)
> +             && !gimple_vuse (stmt)
>
> you probably meant to check gimple_vdef?
>
> +             && !gimple_has_volatile_ops (stmt)
> +             && !gimple_has_side_effects (stmt)
>
> the former is redundant
>
> +             && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> +             && TREE_CODE (lhs) == SSA_NAME
> +             && has_zero_uses (lhs))
>
> if you use gimple_get_lhs () you can also handle calls.
>
> That said, this seems to be a very poor DCE, why is it necessary at all?
It is a very poor DCE, but enough to catch cases generated by
interchange operation.  Because there is no DCE pass between
interchange and vectorizer, I don't want to pass them to vectorizer.
It's expensive, and maybe sometimes bad code generation in vectorizer.

>
> +/* Interchange niters info of ILOOP and OLOOP while reset any other niters
> +   estimates information for now.  */
> +
> +static inline void
> +interchange_nb_iterations (struct loop *iloop, struct loop *oloop)
> +{
> +  tree nb_iterations = oloop->nb_iterations;
> +
> +  oloop->any_upper_bound = false;
> +  oloop->any_likely_upper_bound = false;
> +  free_numbers_of_iterations_estimates (oloop);
> +
> +  oloop->nb_iterations = iloop->nb_iterations;
> +
> +  iloop->any_upper_bound = false;
> +  iloop->any_likely_upper_bound = false;
> +  free_numbers_of_iterations_estimates (iloop);
> +
> +  iloop->nb_iterations = nb_iterations;
>
> use std::swap?  Also I think if you can keep nb_iterations you
> can certainly keep the upper bounds.  You're probably
> afraid of the ->stmt references in the nb_iter_bound entries?
>
> Anyway, either scrap everything or try to keep everything.
>
> +  for (i = 0; oloop.reductions.iterate (i, &re); ++i)
> +    {
> +      if (re->type != DOUBLE_RTYPE)
> +       gcc_unreachable ();
> +
> +      use_operand_p use_p;
> +      imm_use_iterator iterator;
> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
> +      FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
> +       mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
> +      if (TREE_CODE (re->init) == SSA_NAME)
> +       {
> +         FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
> +           mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
> +       }
>
> can you add a comment what you are doing here?
>
> Note that other loop opts simply scrap all debug stmts ...
>
> +static void
> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
> +                      data_reference_p dr)
> +{
> ...
> +  tree ref = DR_REF (dr);
> +  tree scev_base = build_fold_addr_expr (ref);
> +  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
> +  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
> +  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
> +
> +  do {
> +    tree scev_fn = analyze_scalar_evolution (loop, scev_base);
> +    if (chrec_contains_undetermined (scev_fn)
> +       || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
> +      break;
> ...
> +    strides->safe_push (scev_step);
> +  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
> +
>
> I _think_ you want to do
>
>    scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
> DR_STMT (dr) is in loop
>    scev_fn = instantiate_parameters (nest, scev_fn);
>    if (chrec_contains_undetermined (scev_fn))
>      return; // false?
>
> and analyze the result which should be of the form
>
>   { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>
> if canonical.  I think estimate_val_by_simplify_replace isn't needed
> if you do that
> (it also looks odd to replace all vairables in step by niter...).
No, estimate_val_by_simplify_replace is for cases like:

foo (int *p, int n)
{
  for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
      for (int k = 0; k < n; k++)
        p[k + j * n + i * n * n] += 1;
}

So the step1/step2 are linear functions of n, which are not constant.
The loop here can be seen as inlined instantiate_parameters.  Calling
it directly can't compute step1/step2 neither.
I have to admit SCEV still doesn't work very well with pointer
addresses.  Even worse, for above case, call to instantiate_parameters
gives:
{{p_21(D) + (sizetype) ((long unsigned int) ((int) ((unsigned int)
n_18(D) + 4294967295) * n_18(D)) * 4), +, (long unsigned int) (n_18(D)
* n_18(D)) * 4}_1, +, 4}_3

This isn't canonical at all, there might be something wrong in SCEV.

So if you don't mind, I tend to keep the code, computing scev for base
iteratively for each loop in nest.
Function estimate_val_by_simplify_replace does look odd, we can get
rid of it by giving up on all pointer cases.  Any preference?

>
> I think keeping the chrec in the above form is also more suitable for what
> the caller does so the outermost loop is simply
>
>   loop = loop_nest;
>   loop-over-all-dr-chrecs
>     if (flow_loop_nested_p (loop, CHREC_LOOP (chrec)))
>       loop = CHREC_LOOP (chrec);
>
> given the outermost loop is the outer evolution.  If you sort the
> stride vecs from inner
> to outer you don't need prune_access_strides_not_in_loop.
Problem is loop_nest created by data dependence analyzer is sorted
from outer to inner.  Here I sort strides too so that we don't have
trouble in should_interchange_loops and update_data_refs.

Thanks,
bin
>
> +/* Count and return the number of loops in LOOP_NEST.  */
> +
> +unsigned int
> +num_loops_in_loop_nest (struct loop *loop_nest)
> +{
> +  unsigned num_loops;
> +  for (num_loops = 0; loop_nest; num_loops++, loop_nest = loop_nest->inner)
> +    ;
> +  return num_loops;
>
> loop_depth (innermost) - loop_depth (nest)?
>
> +static bool
> +should_interchange_loops (unsigned i_idx, unsigned o_idx,
> +                         vec<data_reference_p> datarefs,
> +                         bool innermost_loops_p, bool dump_info_p = true)
> +{
>
> isn't all we need associating the above CHREC to sort after the CHREC_RIGHTs
> and figure a permutation sequence to arrive there?  That is for the local
> decision you compute here it is CHREC_RIGHT [i_idx] > CHREC_RIGHT [o_idx]
> when we should interchange?
>
> That subloop_stride_p and tracking invariant DRs looks a bit odd.  For loops
> where a DR is invariant you simply do not have an evolution in that loop.
> You seem to simply add strides in the inner and outer loops for each DR,
> can you explain how that works?  Also I guess strides bigger than the
> various cache-line size parameters should be treated 'equal'?  That is,
> if we don't get any DR to a step that results in L1 cache hits because
> two DRs share a cache line the interchange is pointless?
>
> +/* Prune DATAREFS by removing any data reference not inside of LOOP.  */
> +
> +static inline void
> +prune_datarefs_not_in_loop (struct loop *loop, vec<data_reference_p> datarefs)
> +{
> +  struct data_reference *dr;
> +
> +  for (unsigned i = 0; datarefs.iterate (i, &dr);)
> +    if (flow_bb_inside_loop_p (loop, gimple_bb (DR_STMT (dr))))
> +      i++;
> +    else
> +      {
> +       datarefs.ordered_remove (i);
>
> that's expensive.  It's better to keep moving DRs we want to keep
> when walking the array.  That is, add a j you increment only when
> we keep a DR, moving *i to *j.
>
> +       if (dr->aux)
> +         {
> +           DR_ACCESS_STRIDE (dr)->release ();
> +           free (dr->aux);
> +         }
> +       free_data_ref (dr);
> +      }
> +}
>
> +
> +  start_loop = prune_non_rectangle_loop_nest (innermost_loop, start_loop);
> +
>
> Hmm.  If you instantiate the SCEV for the niters for each loop in the nest
> wrt the nest you can figure if it has any evolution in sth else than the
> loop (evolution_function_is_univariate_p).   That is, this is not a problem
> until you arrive at analyzing DR strides, right?  At which point you
> can check for the appropriate form?
>
> +  if (find_data_references_in_loop (start_loop, datarefs) == chrec_dont_know
> +      /* Check if there is no data reference.  */
> +      || datarefs->length () == 0
> +      /* Check if there are too many of data references.  */
> +      || ((int) datarefs->length ()
> +         > PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
> +      /* Check if there is any data reference in loop latch.  We can't handle
> +        loops which loop header and data references have different execution
> +        times.  */
> +      || dataref_niters_diff_to_loop_header (*datarefs)
>
> this suggests to do your own find_data_references_in_loop so you can early
> out.
>
> Overall the flow through the pass is a bit hard to follow given there are
> IMHO too many functions.
>
> +unsigned int
> +pass_linterchange::execute (function *fun)
> +{
> +  if (number_of_loops (fun) <= 2)
> +    return 0;
> +
> +  bool changed_p = false;;
> +  struct loop *loop;
> +  vec<loop_p> loop_nest;
> +  vec<data_reference_p> datarefs;
> +  vec<ddr_p> ddrs;
> +
> +  FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
> +    if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
> +      {
> +       tree_loop_interchange loop_interchange (loop_nest, datarefs, ddrs);
> +       changed_p |= loop_interchange.interchange ();
> +      }
>
> you leak datarefs/ddrs?
>
> Richard.
>
>
>> Thanks,
>> bin
>>
>> 2017-11-15  Bin Cheng  <bin.cheng@arm.com>
>>
>>     * Makefile.in (tree-ssa-loop-interchange.o): New object file.
>>     * common.opt (ftree-loop-interchange): New option.
>>     * doc/invoke.texi (-ftree-loop-interchange): Document new option.
>>     * gimple-iterator.c (gsi_remove): New parameter determining if dbg
>>     bind stmt is inserted or not.
>>     * gimple-iterator.h (gsi_remove): New parameter in declaration.
>>     * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
>>     (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>>     * passes.def (pass_linterchange): New pass.
>>     * timevar.def (TV_LINTERCHANGE): New time var.
>>     * tree-pass.h (make_pass_linterchange): New declaration.
>>     * tree-ssa-loop-interchange.cc: New file.
>>     * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external.
>>     Record IV before/after increment in new parameters.
>>     * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>>
>> gcc/testsuite
>> 2017-11-15  Bin Cheng  <bin.cheng@arm.com>
>>
>>     * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>>     * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
>>
>>>
>>>
>>> Ciao,
>>> Michael.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-11-23 16:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-30 14:31 [PATCH GCC]A simple implementation of loop interchange Bin Cheng
2017-08-30 15:11 ` Richard Biener
2017-08-30 17:03   ` Bin.Cheng
2017-09-04 13:55     ` Richard Biener
2017-09-04 14:48       ` Bin.Cheng
2017-09-22 10:25       ` Bin.Cheng
2017-10-24 14:35         ` Michael Matz
2017-11-16 15:30           ` Bin.Cheng
2017-11-20 14:59             ` Richard Biener
2017-11-23 17:01               ` Bin.Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).