conditional lim

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* conditional lim
@ 2015-05-08 21:07 Evgeniya Maenkova
  2015-05-09 12:41 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Evgeniya Maenkova @ 2015-05-08 21:07 UTC (permalink / raw)
  To: Richard Biener, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5080 bytes --]

Hi,

Could you please review my patch for predicated lim?

Let me note some details about it:



1)      Phi statements are still moved only if they have 1 or 2
arguments. However, phi statements could be move under conditions (as
it’s done for the other statements).  Probably, phi statement motion
with 3 + arguments could be implemented in the next patch after
predicated lim.

2)      Patch has limitations/features like (it was ok to me to
implement it such way, maybe I’m not correct. ):

a)      Loop1

    {

              If (a)

                 Loop2

                     {

                       Stmt - Invariant for Loop1

                     }

             }

       In this case Stmt will be moved only out of Loop2, because of if (a).

b)      Or

Loop1

    {

         …

         If (cond1)

              If (cond2)

                  If (cond3)

                      Stmt;

       }

Stmt will be moved out only if cond1 is always executed in Loop1.

c)       It took me a long time to write all of these code, so there
might be other peculiarities which I forgot to mention. :)

                       Let’s discuss these ones as you will review my patch.

3)      Patch consists of 9 files:

a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:

-          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
predicated lim moves 2 more statements out of the loop;

-          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
lim recip optimization in this test doesn’t work (the corresponding
value is below threshold as I could see in the code for recip, 1<3).
So to have recip working in this test I changed test a little bit.

b)      gcc/tree-ssa-loop-im.c – the patched lim per se

c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,

gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,

gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c

the tests for conditional lim.

4)      Patch testing:

a)      make –k check (no difference in results for me for the clean
build and the patched one,

-          Revision: 222849,

-          uname -a

               Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux

b)      Bootstrap.

 It goes well now, however to fix it I have made a temporary hack in
the lim code. And with this fix patch definitely shouldn’t be
committed.

I did so, as I would like to discuss this issue first.

The problem is: I got stage2-stage3 comparison failure on the single
file (tree-vect-data-refs.o). After some investigation I understood
that tree-vect-data-refs.o differs being compiled with and without
‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
and for stage 3 this is ‘-g –O2’. But to simplify things I can
reproduce this difference on the same build (even not bootstrapped),
with option ‘ –g’ and without it).

Of course, the file compiled with –g option will contain debug
information and will differ from the corresponding file without debug
information. I mean there is the difference reported by
contrib/compare-debug (I mean we talk about stripped files).

Then I compared assemblers and lim dump files and made a conclusion
the difference is:

There is statement _484=something, which is conditionally moved out of
loop X. In non debug cases no usages of _484 are left inside the loop
X. In debug case, there is the single use in debug statement. So for
debug case my patch generates phi statement for _484 to make it
available inside the loop X. For non debug case, no such phi statement
generated as there is no uses of _484.

There is one more statement with lhs=_493, with exactly this pattern
of use. But this is not important for our discussion.



So, in my opinion it’s ok to generate additional phi node for debug
case. But I’m not a compiler expert and maybe there is some
requirement that debug and non-debug versions should differ only by
debug statements, I don’t know.



My temporary hack fix is just removing of such debug statements (no
debug statements, no problems J).



The alternatives I see are:

-          Move debug statements outside the loop (not good in my opinion);

-          Somehow reset values in debug statements (not good in my
opinion, if I could provide correct information for them);

-          Generate phi for debug statements and fix bootstrap (say,
let’s have some list of files, which we have special check for. I mean
for my case, the difference is not in stage2 and stage3, it’s in debug
and non-debug). I like this variant. However, as I don’t see such list
of special files in the whole gcc, I think I might be missing
something.

What do you think?



Thanks,



Evgeniya

[-- Attachment #2: patch_May8_3 --]
[-- Type: application/octet-stream, Size: 65515 bytes --]

Index: gcc/testsuite/gcc.dg/tree-ssa/loop-7.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/loop-7.c	(revision 222849)
+++ gcc/testsuite/gcc.dg/tree-ssa/loop-7.c	(working copy)
@@ -28,8 +28,9 @@ int xxx (void)
 }
 
 /* Calls to cst_fun1 and pure_fun1 may be moved out of the loop.
-   Calls to cst_fun2 and pure_fun2 should not be, since calling
-   with k = 0 may be invalid.  */
+   Calls to cst_fun2 and pure_fun2 should be moved under condition,
+   since calling with k = 0 may be invalid.  */
 
-/* { dg-final { scan-tree-dump-times "Moving statement" 2 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 4 "lim1" } } */
+/* { dg-final { scan-tree-dump-times "Conditionally executed" 2 "lim1" } } */
 /* { dg-final { cleanup-tree-dump "lim1" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/recip-3.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/recip-3.c	(revision 222849)
+++ gcc/testsuite/gcc.dg/tree-ssa/recip-3.c	(working copy)
@@ -7,18 +7,19 @@ double F[2] = { 0.0, 0.0 }, e;
 float h ()
 {
 	int i;
-	double E, W, P, d;
+	double E, W, P1, d, P2, S;
 
 	W = 1.1;
 	d = 2.*e;
 	E = 1. - d;
 
 	for( i=0; i < 2; i++ )
-		if( d > 0.01 )
-		{
-			P = ( W < E ) ? (W - E)/d : (E - W)/d;
-			F[i] += P;
-		}
+	  {
+	    S = (W < E) ? (W - E) : (E - W);
+	    P1 = S / d;
+	    P2 = (S * 3) /d;
+	    F[i] += (P1 > P2) ? P1 : P2;
+	  }
 
 	F[0] += E / d;
 }
Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c	(working copy)
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-lim1-details" } */
+
+void bar (int);
+void foo (int n, int m, int k)
+{
+  unsigned i;
+  for (i = 0; i < n; ++i)
+    {
+      if (k)
+	{
+	  int x;
+	  if (m < 0)
+	    x = 1;
+	  else
+	    x = m;
+	  bar(x);
+	}
+
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Conditionally executed" 1 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 0 "lim1"  } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c	(working copy)
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-lim1-details" } */
+
+void bar (int);
+void foo (int n, int m, int k)
+{
+  unsigned i;
+  for (i = 0; i < n; ++i)
+    {
+      if (k + i < 5)
+	{
+	  int x;
+	  if (m < 0)
+	    x = 1;
+	  else
+	    x = m;
+	  bar(x);
+	}
+
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Moving" 0 "lim1"  } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c	(working copy)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-lim1-details" } */
+
+void bar (int);
+void foo (int n, int m, int k)
+{
+  unsigned i;
+  for (i = 0; i < n; ++i)
+    {
+      if (n + m > 5000)
+	continue;
+      if (k)
+	{
+	  int x;
+	  if (m < 0)
+	    x = 1;
+	  else
+	    x = m;
+	  bar(x);
+	}
+
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Moving PHI node" 1 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Conditionally executed" 1 "lim1"  } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c	(working copy)
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+
+void bar (long);
+void foo (int n, int m, int k, int s)
+{
+  unsigned x,y,z;
+  long a,b,c;
+
+  for (x = 0; x < 1000; ++x)
+    {
+      c = c + k;
+      for (z = 0; z < 2000; z +=3)
+	{
+	  if (k)
+	    {
+	      a += m * n;
+	      b += m * x;
+	    }
+	}
+    }
+  bar (a);
+  bar (b);
+  bar (c);
+
+}
+
+/* { dg-final { scan-tree-dump-times "Conditionally executed" 3 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 3 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "out of loop 2" 2 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 1 "lim1"  } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c	(working copy)
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim1-details" } */
+
+void bar (long);
+void foo (int n, int m, int k, int s)
+{
+  unsigned x,y,z;
+  long a,b,c;
+
+  for (x = 0; x < 1000; ++x)
+    {
+      c = c + k;
+      if (c + k < 200)
+	continue;
+      for (z = 0; z < 2000; z +=3)
+	{
+	  if (k)
+	    {
+	      a += m * n;
+	      b += m * x;
+	    }
+	}
+    }
+  bar (a);
+  bar (b);
+  bar (c);
+
+}
+
+/* { dg-final { scan-tree-dump-times "Conditionally executed" 3 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "Moving statement" 3 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "out of loop 2" 3 "lim1"  } } */
+/* { dg-final { scan-tree-dump-times "out of loop 1" 0 "lim1"  } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c	(working copy)
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-lim1-details" } */
+
+struct { int x; int y; } global;
+void foo(int n, int m)
+{
+  int i;
+  for ( i=0; i<n; i++)
+    if (m > 0)
+      global.y += global.x*global.x;
+}
+
+/* { dg-final { scan-tree-dump "Executing store motion of global.y" "lim1" } } */
+/* { dg-final { scan-tree-dump "Moving statement.*global.x.*out of loop 1" "lim1" } } */
+/* { dg-final { cleanup-tree-dump "lim1" } } */
Index: gcc/tree-ssa-loop-im.c
===================================================================
--- gcc/tree-ssa-loop-im.c	(revision 222849)
+++ gcc/tree-ssa-loop-im.c	(working copy)
@@ -65,34 +65,13 @@ along with GCC; see the file COPYING3.
 #include "domwalk.h"
 #include "params.h"
 #include "tree-pass.h"
+#include "tree-dfa.h"
 #include "flags.h"
 #include "tree-affine.h"
 #include "tree-ssa-propagate.h"
 #include "trans-mem.h"
 #include "gimple-fold.h"
 
-/* TODO:  Support for predicated code motion.  I.e.
-
-   while (1)
-     {
-       if (cond)
-	 {
-	   a = inv;
-	   something;
-	 }
-     }
-
-   Where COND and INV are invariants, but evaluating INV may trap or be
-   invalid from some other reason if !COND.  This may be transformed to
-
-   if (cond)
-     a = inv;
-   while (1)
-     {
-       if (cond)
-	 something;
-     }  */
-
 /* The auxiliary data kept for each statement.  */
 
 struct lim_aux_data
@@ -116,11 +95,65 @@ struct lim_aux_data
 				   is hoisted; i.e. those that define the
 				   operands of the statement and are inside of
 				   the MAX_LOOP loop.  */
+  bool ignore_condition;	/* This is used for store motion. */
+};
+
+/* The data used to create phi nodes after computations movement (per each loop
+   where stmts were moved to). */
+struct phi_data
+{
+  vec<gimple> stmts;		/* Stmts which phi nodes should be created
+				       for. As left parts of these stmts are
+				       used in stmts in some other levels. */
+  vec<gimple> stmts_fl_phi_only;
+			       /* Stmt which phi nodes should be created only
+				  on the first level. */
+
+  hash_map<tree, tree> *names_map;  /* For each of the statements in stmts
+				       field we store the name of phi node,
+				       which will be used outside of this
+				       level (i.e. in the base block which
+				       is the preheader dominator).
+				       I.e.
+					 (cond)
+					   /   \
+				 (bb1) (cond2)
+					 |       /    \
+					 | (bb2) (bb with stmt)
+					  \       \    /
+					   \ (bb with intermidiate phi node)
+					    \       /
+					 (phi node with
+					     the name from this map,
+					  this is the preheader dominator)*/
+  hash_map<tree, tree> *fl_names_map; /* This is the first level phi node names,
+					 i.e phi node in the first bb after
+					 bb with stmt (see the picture above).
+					 This map is used only to replace var
+					 uses in phi with 2 arguments.*/
+  hash_map<tree, tree> *defs_map;   /* This is used when we create phi nodes
+				       and need to know definitions on outer
+				       loops. */
+
 };
 
 /* Maps statements to their lim_aux_data.  */
 
 static hash_map<gimple, lim_aux_data *> *lim_aux_data_map;
+typedef hash_map<gimple, basic_block> cond_data_map;
+/* This information is used when we move computations. Per loop we store
+base blocks created to keep conditions. */
+static hash_map<struct loop*, cond_data_map*> *level_cond_data_map;
+/* As by this pass the information about the post dominators is not collected
+ (CDI_POST_DOMINATORS flag), we do create some kind of post dominator info for
+created based blocks. This is only locally created, locally used and for anew
+created blocks only. */
+static hash_map<basic_block, basic_block> *post_dominators_map;
+/* This vector keeps the statements with lhs which is used in the conditions.
+After computations movements we do iterate over this to replace lhs as needed
+in these conditions. As there could be several copies of the same conditions,
+we cannot do these replacements before the all movements. */
+static vec<gimple> stmts_with_changed_lhs;
 
 /* Description of a memory reference location.  */
 
@@ -130,21 +163,49 @@ typedef struct mem_ref_loc
   gimple stmt;			/* The statement in that it occurs.  */
 } *mem_ref_loc_p;
 
+/* The type of movement condtion */
+enum cond_type
+{
+   TRUE_TARGET_OF,		     /* bb with move_cond is true target */
+   FALSE_TARGET_OF,		    /* bb with move_cond is false target */
+   ALWAYS_EXECUTED_IN,		 /* bb with move_cond is always executed
+					  in some loop */
+   NEW_BB_FROM_PREHEADER_REGION_OF     /* this is for anew created bb,
+					  indicates which preheader region
+					  this is. */
+};
+
+/* This data is per base block. */
+struct move_cond
+{
+   enum cond_type type;
+   int aux;
+   union {
+       gimple cond_stmt;   /* if cond_type is TRUE_TARGET_OF of FALSE_TARGET_OF,
+			      this is parent condition */
+       struct loop* loop;  /* this is either the loop where correspoding base
+			      block is always executed
+			 (type == ALWAYS_EXECUTED_IN) or the loop which
+			      preheader region corresponding base block belongs
+			      to (NEW_BB_FROM_PREHEADER_REGION_OF). */
+   };
+};
+
 
 /* Description of a memory reference.  */
 
 typedef struct im_mem_ref
 {
   unsigned id;			/* ID assigned to the memory reference
-				   (its index in memory_accesses.refs_list)  */
+				 (its index in memory_accesses.refs_list)  */
   hashval_t hash;		/* Its hash value.  */
 
   /* The memory access itself and associated caching of alias-oracle
      query meta-data.  */
   ao_ref mem;
 
-  bitmap stored;		/* The set of loops in that this memory location
-				   is stored to.  */
+  bitmap stored;		/* The set of loops in that this memory
+				   location is stored to.  */
   vec<mem_ref_loc>		accesses_in_loop;
 				/* The locations of the accesses.  Vector
 				   indexed by the loop number.  */
@@ -157,8 +218,8 @@ typedef struct im_mem_ref
 				   If it is stored in the loop, this store
 				     is independent on all other loads and
 				     stores.
-				   If it is only loaded, then it is independent
-				     on all stores in the loop.  */
+				   If it is only loaded, then it is
+				   independent on all stores in the loop. */
   bitmap_head dep_loop;		/* The complement of INDEP_LOOP.  */
 } *mem_ref_p;
 
@@ -228,10 +289,54 @@ static bool ref_indep_loop_p (struct loo
 /* Minimum cost of an expensive expression.  */
 #define LIM_EXPENSIVE ((unsigned) PARAM_VALUE (PARAM_LIM_EXPENSIVE))
 
+#define BB_MOVE_COND(BB) ((struct move_cond *) (BB->aux))
 /* The outermost loop for which execution of the header guarantees that the
    block will be executed.  */
-#define ALWAYS_EXECUTED_IN(BB) ((struct loop *) (BB)->aux)
-#define SET_ALWAYS_EXECUTED_IN(BB, VAL) ((BB)->aux = (void *) (VAL))
+#define ALWAYS_EXECUTED_IN(BB)\
+		 (BB_MOVE_COND (BB) != NULL\
+		 && BB_MOVE_COND (BB)->type == ALWAYS_EXECUTED_IN\
+		 ? BB_MOVE_COND (BB)->loop: NULL)
+#define SET_ALWAYS_EXECUTED_IN(BB, VAL)\
+		  BB->aux = (void *) (XCNEW (struct move_cond));\
+		  BB_MOVE_COND (BB)->type=ALWAYS_EXECUTED_IN;\
+		  BB_MOVE_COND (BB)->loop=VAL;
+#define IS_TRUE_TARGET(BB) (BB_MOVE_COND (BB) != NULL\
+			    && BB_MOVE_COND (BB)->type == TRUE_TARGET_OF)
+#define IS_FALSE_TARGET(BB) (BB_MOVE_COND (BB) != NULL\
+			    && BB_MOVE_COND (BB)->type == FALSE_TARGET_OF)
+#define IS_THE_SAME_TARGET_TYPE(BB, type) (BB_MOVE_COND (BB) != NULL\
+			     && BB_MOVE_COND (BB)->type == type)
+
+/* Whether this BB is true or false target. (If the outermost condition is
+   not always executed in the loop, correspoding move_cond of the all child
+   cond and the BB will be NULL */
+#define IS_TRUE_OR_FALSE_TARGET(BB) (IS_TRUE_TARGET (BB)\
+	  || IS_FALSE_TARGET (BB))
+/* Condition statement for BB */
+#define COND_STMT(BB) (BB_MOVE_COND (BB)->cond_stmt)
+#define SET_TARGET_OF(cond,type,BB)  if (BB->aux == NULL)\
+				     BB->aux = (void *) (new move_cond);\
+				     BB_MOVE_COND (BB)->type=type;\
+				     BB_MOVE_COND (BB)->cond_stmt=cond;\
+				     BB_MOVE_COND (BB)->aux=false;
+#define SET_MOVE_COND_AUX(BB, VALUE) (BB_MOVE_COND (BB)->aux = VALUE)
+/* This is used when we evaluate whether the outermost condition is always
+   executed in the loop. */
+#define GET_MOVE_COND_AUX(BB) (BB_MOVE_COND (BB)->aux)
+#define IS_NEW_BB(BB) (BB_MOVE_COND (BB) != NULL\
+		       && BB_MOVE_COND (BB)->type\
+	       == NEW_BB_FROM_PREHEADER_REGION_OF)
+#define SET_LOOP_OF_PREHEADER_REGION(BB,VAL)\
+	   if (BB->aux == NULL)\
+	      {BB->aux = (void *)\
+			 (XCNEW (struct move_cond));\
+	       BB_MOVE_COND (BB)->type\
+		 =NEW_BB_FROM_PREHEADER_REGION_OF;\
+	       BB_MOVE_COND (BB)->loop=VAL;}
+#define LOOP_OF_PREHEADER_REGION(BB) (IS_NEW_BB (BB)\
+	       ? BB_MOVE_COND (BB)->loop: NULL)
+
+#define FREE_AUX_FOR_BB(BB) XDELETE (BB_MOVE_COND (BB));BB->aux=NULL;
 
 /* ID of the shared unanalyzable mem.  */
 #define UNANALYZABLE_MEM_ID 0
@@ -244,7 +349,7 @@ init_lim_data (gimple stmt)
 {
   lim_aux_data *p = XCNEW (struct lim_aux_data);
   lim_aux_data_map->put (stmt, p);
-
+  p->ignore_condition = false;
   return p;
 }
 
@@ -258,6 +363,26 @@ get_lim_data (gimple stmt)
   return *p;
 }
 
+static cond_data_map *get_cond_data_map (struct loop * level)
+{
+  cond_data_map **p = level_cond_data_map->get (level);
+  if (!p)
+    {
+      cond_data_map *map = new cond_data_map;
+      level_cond_data_map->put (level, map);
+      return map;
+    }
+   return *p;
+}
+
+static basic_block get_bb_by_cond (cond_data_map *map, gimple cond)
+{
+  basic_block *p = map->get (cond);
+  if (!p)
+    return NULL;
+  return *p;
+}
+
 /* Releases the memory occupied by DATA.  */
 
 static void
@@ -302,13 +427,9 @@ movement_possibility (gimple stmt)
   tree lhs;
   enum move_pos ret = MOVE_POSSIBLE;
 
-  if (flag_unswitch_loops
-      && gimple_code (stmt) == GIMPLE_COND)
-    {
-      /* If we perform unswitching, force the operands of the invariant
-	 condition to be moved out of the loop.  */
-      return MOVE_POSSIBLE;
-    }
+  if (gimple_code (stmt) == GIMPLE_COND)
+    return MOVE_POSSIBLE;
+
 
   if (gimple_code (stmt) == GIMPLE_PHI
       && gimple_phi_num_args (stmt) <= 2
@@ -540,13 +661,13 @@ stmt_cost (gimple stmt)
 
     case CONSTRUCTOR:
       /* Make vector construction cost proportional to the number
-         of elements.  */
+	 of elements.  */
       return CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt));
 
     case SSA_NAME:
     case PAREN_EXPR:
       /* Whether or not something is wrapped inside a PAREN_EXPR
-         should not change move cost.  Nor should an intermediate
+	 should not change move cost.  Nor should an intermediate
 	 unpropagated SSA name copy.  */
       return 0;
 
@@ -707,21 +828,18 @@ extract_true_false_args_from_phi (basic_
    is defined in, and true otherwise.  */
 
 static bool
-determine_max_movement (gimple stmt, bool must_preserve_exec)
+determine_max_movement (gimple stmt, struct loop *start_level)
 {
   basic_block bb = gimple_bb (stmt);
   struct loop *loop = bb->loop_father;
-  struct loop *level;
+  struct loop *level = start_level;
   struct lim_aux_data *lim_data = get_lim_data (stmt);
   tree val;
   ssa_op_iter iter;
 
-  if (must_preserve_exec)
-    level = ALWAYS_EXECUTED_IN (bb);
-  else
-    level = superloop_at_depth (loop, 1);
   lim_data->max_loop = level;
 
+
   if (gphi *phi = dyn_cast <gphi *> (stmt))
     {
       use_operand_p use_p;
@@ -762,7 +880,6 @@ determine_max_movement (gimple stmt, boo
 
       min_cost = MIN (min_cost, total_cost);
       lim_data->cost += min_cost;
-
       if (gimple_phi_num_args (phi) > 1)
 	{
 	  basic_block dom = get_immediate_dominator (CDI_DOMINATORS, bb);
@@ -887,7 +1004,7 @@ nonpure_call_p (gimple stmt)
   return gimple_has_side_effects (stmt);
 }
 
-/* Rewrite a/b to a*(1/b).  Return the invariant stmt to process.  */
+/* Rewrite a/b to a* (1/b).  Return the invariant stmt to process.  */
 
 static gimple
 rewrite_reciprocal (gimple_stmt_iterator *bsi)
@@ -921,7 +1038,7 @@ rewrite_reciprocal (gimple_stmt_iterator
 }
 
 /* Check if the pattern at *BSI is a bittest of the form
-   (A >> B) & 1 != 0 and in this case rewrite it to A & (1 << B) != 0.  */
+ (A >> B) & 1 != 0 and in this case rewrite it to A & (1 << B) != 0.  */
 
 static gimple
 rewrite_bittest (gimple_stmt_iterator *bsi)
@@ -1027,6 +1144,151 @@ public:
   virtual void before_dom_children (basic_block);
 };
 
+#define EVALUATION_IN_PROGRESS (0x01 << 1)
+#define EVALUATED (0x01 << 2)
+
+int is_first_level_cond_always_executed (basic_block bb)
+{
+  if (BB_MOVE_COND (bb) == NULL)
+    return false;
+  /* We have recursive conditions, will not move these computations. */
+  if (GET_MOVE_COND_AUX (bb) & EVALUATION_IN_PROGRESS)
+    {
+      FREE_AUX_FOR_BB (bb);
+      return false;
+    }
+
+  if (GET_MOVE_COND_AUX (bb) & EVALUATED)
+    return true;
+
+  if (ALWAYS_EXECUTED_IN (bb) != NULL)
+    {
+      SET_MOVE_COND_AUX (bb, EVALUATED);
+      return true;
+    }
+  if (!IS_TRUE_OR_FALSE_TARGET (bb))
+    {
+      FREE_AUX_FOR_BB (bb);
+      return false;
+    }
+
+   SET_MOVE_COND_AUX (bb, EVALUATION_IN_PROGRESS);
+   bool result = is_first_level_cond_always_executed (COND_STMT (bb)->bb);
+
+   if (!result)
+     {
+       FREE_AUX_FOR_BB (bb);
+     } else
+       SET_MOVE_COND_AUX (bb, EVALUATED);
+
+   return result;
+}
+
+/* Checks whether bb is conditionally executed. However, we only count
+   as conditionally executed the blocks which outermost condition
+   is always executed.
+   I.e.
+   if (a) if (b) if (c) stmt; ('a' should be always executed in some loop,
+   then base blocks corresponging to 'b', 'c' and 'stmt' will be called
+   conditionally executed, otherwise correspoding move_cond will be NULL
+   for basic blocks corresponding to 'b', 'c' and 'stmt'). */
+bool check_conditionally_executed (basic_block bb)
+{
+  if (!IS_TRUE_OR_FALSE_TARGET (bb))
+    return false;
+  basic_block cond_bb = COND_STMT (bb)->bb;
+  if (cond_bb == bb)
+    {
+      FREE_AUX_FOR_BB (bb);
+      return false;
+    }
+  return is_first_level_cond_always_executed (cond_bb);
+}
+
+bool calculate_tgt_level (gimple stmt, struct loop *outermost,
+			 bool cond_executed, basic_block bb)
+{
+  struct lim_aux_data *lim_data = init_lim_data (stmt);
+  lim_data->always_executed_in = outermost;
+
+  struct loop * start_level = NULL;
+  if (cond_executed)
+    {
+      gimple cond = COND_STMT (bb);
+      struct lim_aux_data *cond_lim_data = get_lim_data (cond);
+      if (cond_lim_data)
+	start_level = cond_lim_data->max_loop;
+	struct loop *current_loop = bb->loop_father;
+	if (start_level != NULL && current_loop != start_level
+	    && !flow_loop_nested_p (start_level, current_loop))
+	  start_level = NULL;
+     }
+  else /* The reason why we don't set start_level for MOVE_POSSIBLE
+	  statements to more outer level is: this statement could depend on
+	  conditional statements and even probably is not defined without this
+	  condition. So either we should analyze these ones and move
+	  under condition somehow or keep more strong start_level */
+    start_level = ALWAYS_EXECUTED_IN (bb);
+  if (start_level == NULL || !determine_max_movement (stmt, start_level))
+    {
+       lim_data->max_loop = NULL;
+       return false;
+    }
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+       print_gimple_stmt (dump_file, stmt, 2, 0);
+       fprintf (dump_file, "  invariant up to level %d, cost %d.\n\n",
+	       loop_depth (lim_data->max_loop),
+	       lim_data->cost);
+     }
+
+  if (lim_data->cost >= LIM_EXPENSIVE)
+    set_profitable_level (stmt);
+  return true;
+}
+
+/* This method is called only from handle_cond_stmt, which is called
+   in CDI_DOMINATORS order, so we call SET_TARGET_OF for all nodes which
+   is predicated by this condition and for _some_ of them predicated by
+   additional subconditions (as this method will be called again for
+   sub-conditions to correct things) */
+
+void handle_branch_edge (edge e, enum cond_type type, gimple stmt)
+ {
+   edge edge_loc = e;
+   basic_block bb = edge_loc->dest;
+   if (stmt->bb->loop_father != bb->loop_father)
+     return;
+   if (!single_pred_p (bb))
+     return;
+   basic_block dominator = bb;
+   struct loop *loop = bb->loop_father;
+   do
+     {
+       if (IS_THE_SAME_TARGET_TYPE (bb, type) && COND_STMT (bb) == stmt)
+	 {
+	    FREE_AUX_FOR_BB (bb);
+	    break;
+	 }
+       SET_TARGET_OF (stmt, type, bb);
+       edge_loc = EDGE_COUNT (bb->succs) > 0 ? EDGE_SUCC (bb, 0) : NULL;
+       if (edge_loc)
+	 bb = edge_loc->dest;
+      } while (edge_loc
+	    && bb->loop_father == loop
+	    &&  dominated_by_p (CDI_DOMINATORS, bb, dominator));
+ }
+
+void handle_cond_stmt (gimple stmt)
+{
+   basic_block bb = stmt->bb;
+   edge true_edge,false_edge;
+   extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+   handle_branch_edge (true_edge, TRUE_TARGET_OF, stmt);
+   handle_branch_edge (false_edge, FALSE_TARGET_OF, stmt);
+}
+
+
 /* Determine the outermost loops in that statements in basic block BB are
    invariant, and record them to the LIM_DATA associated with the statements.
    Callback for dom_walker.  */
@@ -1039,10 +1301,10 @@ invariantness_dom_walker::before_dom_chi
   gimple stmt;
   bool maybe_never = ALWAYS_EXECUTED_IN (bb) == NULL;
   struct loop *outermost = ALWAYS_EXECUTED_IN (bb);
-  struct lim_aux_data *lim_data;
 
   if (!loop_outer (bb->loop_father))
     return;
+  bool cond_executed = maybe_never && check_conditionally_executed (bb);
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Basic block %d (loop %d -- depth %d):\n\n",
@@ -1063,32 +1325,16 @@ invariantness_dom_walker::before_dom_chi
 	pos = movement_possibility (stmt);
 	if (pos == MOVE_IMPOSSIBLE)
 	  continue;
-
-	lim_data = init_lim_data (stmt);
-	lim_data->always_executed_in = outermost;
-
-	if (!determine_max_movement (stmt, false))
-	  {
-	    lim_data->max_loop = NULL;
-	    continue;
-	  }
-
-	if (dump_file && (dump_flags & TDF_DETAILS))
-	  {
-	    print_gimple_stmt (dump_file, stmt, 2, 0);
-	    fprintf (dump_file, "  invariant up to level %d, cost %d.\n\n",
-		     loop_depth (lim_data->max_loop),
-		     lim_data->cost);
-	  }
-
-	if (lim_data->cost >= LIM_EXPENSIVE)
-	  set_profitable_level (stmt);
+	if (!calculate_tgt_level (stmt, outermost, cond_executed, bb))
+	  continue;
       }
 
-  for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
+  for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
+						 gsi_next (&bsi))
     {
       stmt = gsi_stmt (bsi);
-
+      if (gimple_code (stmt) == GIMPLE_COND)
+	handle_cond_stmt (stmt);
       pos = movement_possibility (stmt);
       if (pos == MOVE_IMPOSSIBLE)
 	{
@@ -1116,7 +1362,7 @@ invariantness_dom_walker::before_dom_chi
 	  struct loop *ol1 = outermost_invariant_loop (op1,
 					loop_containing_stmt (stmt));
 
-	  /* If divisor is invariant, convert a/b to a*(1/b), allowing reciprocal
+	  /* If divisor is invariant, convert a/b to a* (1/b), allowing reciprocal
 	     to be hoisted out of loop, saving expensive divide.  */
 	  if (pos == MOVE_POSSIBLE
 	      && gimple_assign_rhs_code (stmt) == RDIV_EXPR
@@ -1137,55 +1383,656 @@ invariantness_dom_walker::before_dom_chi
 	    stmt = rewrite_bittest (&bsi);
 	}
 
-      lim_data = init_lim_data (stmt);
-      lim_data->always_executed_in = outermost;
+      if (maybe_never && !cond_executed && pos == MOVE_PRESERVE_EXECUTION)
+	continue;
 
-      if (maybe_never && pos == MOVE_PRESERVE_EXECUTION)
+      if (!calculate_tgt_level (stmt, outermost, cond_executed, bb))
 	continue;
+    }
+}
+
+class move_computations_dom_walker : public dom_walker
+{
+public:
+  move_computations_dom_walker (cdi_direction direction)
+    : dom_walker (direction), todo_ (0) {}
+
+  virtual void before_dom_children (basic_block);
+
+  unsigned int todo_;
+};
+
+/* This function should be called only for anew created basic blocks */
+basic_block get_postdominator (basic_block bb)
+{
+  basic_block *p = post_dominators_map->get (bb);
+  /* as this is called only for anew created bb */
+  gcc_assert (p != NULL);
+  return *p;
+}
+
+/* This function is called only for bb created in this pass. */
+edge find_branch_edge (basic_block bb, enum cfg_edge_flags flag)
+{
+  edge e;
+  edge_iterator ei;
 
-      if (!determine_max_movement (stmt, pos == MOVE_PRESERVE_EXECUTION))
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    if (e->flags & flag)
+      break;
+  if (e != NULL && get_postdominator (bb) != e->dest)
+    return e;
+  return NULL;
+}
+
+/* This function is called only for bb created in this pass.
+   Called to get basic block where to add some statement to. The branch
+   is defined by first_in_branch (the is true or false edge, which src is bb
+   and dest is first_in_branch).*/
+basic_block get_most_remoute_in_branch (basic_block bb,
+				       basic_block first_in_branch,
+				       struct loop * level)
+{
+  basic_block post_dominator = get_postdominator (bb);
+  edge e;
+  edge_iterator ei;
+  basic_block merger = NULL;
+  edge prev= NULL;
+  struct vec<edge> *edges_to_change_dest = XCNEW (struct vec<edge>);
+
+  FOR_EACH_EDGE (e, ei, post_dominator->preds)
+    {
+      if (dominated_by_p (CDI_DOMINATORS, e->src, first_in_branch))
 	{
-	  lim_data->max_loop = NULL;
-	  continue;
+	  if (prev != NULL)
+	    {
+	      if (merger == NULL)
+		{
+		  merger = create_empty_bb (first_in_branch);
+
+		  SET_LOOP_OF_PREHEADER_REGION (merger, level);
+		  edges_to_change_dest->safe_push (prev);
+		  merger->frequency = EDGE_FREQUENCY (prev);
+		  set_immediate_dominator (CDI_DOMINATORS, merger,
+					   first_in_branch);
+		  post_dominators_map->put (first_in_branch, merger);
+		  add_bb_to_loop (merger, first_in_branch->loop_father);
+		}
+	      edges_to_change_dest->safe_push (e);
+	      merger->frequency += EDGE_FREQUENCY (e);
+	    }
+	  else
+	    prev = e;
 	}
+    }
+   if (merger != NULL)
+     {
+       edge new_edge = make_edge (merger, post_dominator,
+					    EDGE_FALLTHRU);
+       new_edge->probability = REG_BR_PROB_BASE;
+       unsigned i;
+       for (i = 0; edges_to_change_dest->iterate (i, &e); ++i)
+	 redirect_edge_succ (e, merger);
+       free (edges_to_change_dest);
+       return merger;
+     }
+  return prev->src;
+}
+
+/* This function is called to get basic block to add statement
+ (under condition from bb, whether true or false edge is defined by flag).
+   Probability fields of the edges added during predicated lim is computed
+   as described in profile.c when profile is absent.
+   Counts fileds of the edges and basic blocks added during predicatedlim
+   is not updated.
+   Frequency fields of the basic blocks added during predicated lim is
+   computed as sum of the edges frequencies calculated by EDGE_FREQUENCY.
+  */
+
+basic_block make_branch (basic_block bb, enum cfg_edge_flags flag,
+			struct loop *level)
+{
+  edge e;
+  edge_iterator ei;
+  basic_block result = NULL;
+
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    if (e->flags & flag)
+      break;
+  if (e != NULL)
+    result = split_edge (e);
+  else
+    {
+      basic_block new_bb = create_empty_bb (bb);
+      set_immediate_dominator (CDI_DOMINATORS, new_bb, bb);
+      edge e1 = make_edge (bb, new_bb, flag);
+      edge fallthru_edge = find_fallthru_edge (bb->succs);
+      fallthru_edge->flags = (flag & EDGE_TRUE_VALUE ? EDGE_FALSE_VALUE
+			       : EDGE_TRUE_VALUE);
+      int old_freq = EDGE_FREQUENCY (fallthru_edge);
+      basic_block post_dominator = fallthru_edge->dest;
+      edge e2 = make_edge (new_bb, post_dominator, EDGE_FALLTHRU);
+      add_bb_to_loop (new_bb, bb->loop_father);
+      post_dominators_map->put (bb, post_dominator);
+      e1->probability = REG_BR_PROB_BASE/2;
+      fallthru_edge->probability = REG_BR_PROB_BASE/2;
+      e2->probability = REG_BR_PROB_BASE;
+      new_bb->frequency = EDGE_FREQUENCY (e1);
+      post_dominator->frequency +=EDGE_FREQUENCY (fallthru_edge)
+				  + EDGE_FREQUENCY (e2) - old_freq;
+      result = new_bb;
+    }
+  SET_LOOP_OF_PREHEADER_REGION (result, level);
+  return result;
+}
+
+/* This method is called to get a basic block to add a conditional statement.*/
+basic_block get_block_for_predicated_statement (gimple cond, bool branch,
+					       edge preheader,
+					       cond_data_map *cond_map,
+					       struct loop *level)
+{
+  basic_block bb = get_bb_by_cond (cond_map, cond);
+
+  enum cfg_edge_flags flag = branch ? EDGE_TRUE_VALUE : EDGE_FALSE_VALUE;
+  if (bb != NULL)
+    {
+      edge be = find_branch_edge (bb, flag);
+      if (be != NULL)
+	return get_most_remoute_in_branch (bb, be->dest, level);
+      else
+	return make_branch (bb, flag, level);
+    }
+  else
+    {
+      basic_block cond_bb = cond->bb;
+      bool always_executed = (ALWAYS_EXECUTED_IN (cond_bb) != NULL);
+      edge edge_for_cond;
+      if (always_executed)
+	{
+	  edge_for_cond = preheader;
+	}
+      else
+	{
+	  gimple parent_cond = COND_STMT (cond_bb);
+	  bool branch = IS_TRUE_TARGET (cond_bb);
+	  cond_bb = get_block_for_predicated_statement (parent_cond, branch,
+						       preheader, cond_map,
+						       level);
+	  edge_for_cond = find_fallthru_edge (cond_bb->succs);
+	}
+      gimple cond_copy = gimple_copy (cond);
+      gsi_insert_on_edge_immediate (edge_for_cond, cond_copy);
+      cond_bb = cond_copy->bb;
+      if (always_executed)
+	split_edge (preheader);
+      cond_map->put (cond, cond_bb);
+      SET_LOOP_OF_PREHEADER_REGION (cond_bb, level);
+      return make_branch (cond_bb, flag, level);
+    }
+}
+
+/* phi_data->names_map is going to be used more often than
+   fl_names_map and defs_map. So we initialize it together
+   with phi_data initialization. */
+phi_data *init_phi_data (struct loop *level)
+{
+  phi_data *data = XCNEW (struct phi_data);
+  data->names_map = new hash_map<tree,tree>;
+  data->fl_names_map = NULL;
+  data->defs_map = NULL;
+  level->aux = data;
+  return data;
+}
+
+tree create_key (tree var)
+{
+  tree key = SSA_NAME_VAR (var);
+  if (key == NULL) //temporary_variable
+    key = var;
+  return key;
+}
+
+tree get_name (hash_map<tree, tree> *map, tree key)
+{
+  if (map == NULL)
+    return NULL;
+  tree *p = map->get (key);
+  if (p != NULL)
+    return *p;
+  return NULL;
+}
+
+
+hash_map<tree, tree> *get_names_map (struct phi_data *data,
+				    bool from_fl_names_map)
+{
+  gcc_assert (data != NULL);
+  if (from_fl_names_map)
+    {
+      if (data->fl_names_map == NULL)
+	data->fl_names_map = new hash_map<tree, tree>;
+      return data->fl_names_map;
+    }
+  else
+    return data->names_map;
+}
+
+tree get_new_name (struct loop * level, tree var,
+		   bool from_fl_names_map)
+{
+  struct phi_data *data = (phi_data *) (level->aux);
+  tree key = create_key (var);
+  tree new_name = NULL;
+  if (data != NULL)
+    new_name = get_name (get_names_map (data,
+				      from_fl_names_map),
+			key);
+  else
+    data = init_phi_data (level);
+  if (new_name == NULL)
+    {
+      new_name = copy_ssa_name (var);
+      get_names_map (data, from_fl_names_map)->put (key,
+						   new_name);
+    }
+  return new_name;
+}
+
+tree get_new_name (struct loop * level, tree var)
+{
+  return get_new_name (level, var, false);
+}
+
+void set_default_def (struct loop* level, tree key, tree def)
+{
+  struct phi_data *data = (phi_data *) (level->aux);
+  if (data == NULL)
+    data = init_phi_data (level);
+  if (data->defs_map == NULL)
+    data->defs_map = new hash_map<tree, tree>;
+  data->defs_map->put (key, def);
+}
+
+/*!!! TODO: debug statements removing is a very temporary solution!!!
+  This is terrible and only before the first review of predicated lim.
+  The problem is: when debug statement is the only statement which we need
+  create phi for there is the different code for 'g' option and without 'g'
+  option.
+  That's why bootstrap build fails. The difference between stage2 and stage3
+  is in the single file onl (tree-vect-data-refs.o).
+  And actually the difference is not in stage2 and
+  stage3. The difference is in between compiler options (-gtoogle).
+  So as a temporary solution, we just remove such debug statements.
+  No statements - no problems :) . */
+void remove_debug_stmt (vec<gimple> *stmts, bool cond)
+{
+  if (!cond)
+    return;
+
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  unsigned i;
 
+  for (i = 0; stmts->iterate (i, &stmt); ++i)
+    {
+      gsi = gsi_for_stmt (stmt);
+      gsi_remove (&gsi, true);
+    }
+}
+
+tree create_tmp_va (tree old_var, gimple stmt,
+		    struct loop *loop_with_data,
+		    struct loop *level)
+{
+  tree result = old_var;
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  bool cond_for_tmp_var = false;
+  bool remove_debug_stmts_flag = false;
+  struct vec<gimple> *debug_stmts= NULL;
+  FOR_EACH_IMM_USE_FAST (use_p, iter, old_var)
+    {
+      gimple use_stmt = USE_STMT (use_p);
+      struct lim_aux_data * lim_data = get_lim_data (use_stmt);
+      struct loop *tgt_loop = lim_data != NULL ? lim_data->tgt_loop
+					      : NULL;
+      gphi * phi;
+      if (tgt_loop != level && gimple_code (use_stmt) == GIMPLE_DEBUG)
+	{
+	  remove_debug_stmts_flag = true;
+	  if (debug_stmts == NULL)
+	     debug_stmts = XCNEW (struct vec<gimple>);
+	  debug_stmts->safe_push (use_stmt);
+	}
+      else if (tgt_loop != level
+	  || gimple_code (use_stmt) == GIMPLE_COND
+	  || (phi = dyn_cast <gphi *> (use_stmt)))
+	{
+	  cond_for_tmp_var = true;
+	  remove_debug_stmts_flag = false;
+	  break;
+	}
+    }
+  remove_debug_stmt (debug_stmts, remove_debug_stmts_flag);
+  free (debug_stmts);
+  /* This is limitation from make_ssa_name and
+     get_or_create_ssa_default_def. We need these methods further
+     when create phi nodes. */
+  cond_for_tmp_var &= !(TREE_CODE (old_var) == VAR_DECL
+			|| TREE_CODE (old_var) == PARM_DECL
+			|| TREE_CODE (old_var) == RESULT_DECL);
+  if (cond_for_tmp_var)
+    {
+      tree tmp_var = create_tmp_reg (TREE_TYPE (old_var), "plim");
+      result = make_ssa_name (tmp_var, NULL);
+      gimple_set_lhs (stmt, result);
       if (dump_file && (dump_flags & TDF_DETAILS))
 	{
-	  print_gimple_stmt (dump_file, stmt, 2, 0);
-	  fprintf (dump_file, "  invariant up to level %d, cost %d.\n\n",
-		   loop_depth (lim_data->max_loop),
-		   lim_data->cost);
+	  fprintf (dump_file, "created new name ");
+	  print_generic_stmt (dump_file, result, 0);
+	  fprintf (dump_file, "\n");
+	}
+
+      set_default_def (loop_with_data, create_key (result),
+		     build_zero_cst (TREE_TYPE (old_var)));
+    }
+  return result;
+}
+
+#define WAS_NOT_ADDED (0x0)
+#define ADDED_TO_PHI_DATA (0x01 << 1)
+#define ADDED_TO_STMTS_WITH_CH_LHS (0x01 << 2)
+#define ADDED_TO_FL_PHI_DATA (0x01 << 3)
+#define ADDED_TO_EVERYWHERE (ADDED_TO_PHI_DATA | ADDED_TO_STMTS_WITH_CH_LHS)
+
+/* This function is called from move_computations routines, when
+   stmt is going to be moved outside its current loop. So we analyze whether
+   phi nodes should be added in the new loop, store corresponding info
+   into loop->aux (as phi_data->stmts, ADDED_TO_PHI_DATA result code),
+   replace uses of lhs of stmt, if any,
+   to suggested phi name (phi_data->names_map) and also
+   add stmt to stmts_with_changed_lhs to update lhs uses in conditions later
+ (ADDED_TO_STMTS_WITH_CH_LHS result code). */
+int add_info_for_phi (gimple stmt, struct loop *level)
+{
+  int result = WAS_NOT_ADDED;
+  tree old_var = gimple_get_lhs (stmt);
+  if (old_var == NULL)
+     return result;
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Will analyze uses of ");
+      print_generic_expr (dump_file, old_var, 0);
+      fprintf (dump_file,"\n");
+    }
+  struct loop* loop_with_data = loop_outer (level);
+  tree name_for_key = create_tmp_va (old_var, stmt, loop_with_data,
+				     level);
+  bool tmp_reg_created = (name_for_key != old_var);
+
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  gimple use_stmt;
+  tree new_name = NULL;
+  tree fl_new_name = NULL;
+  struct lim_aux_data *lim_data;
+  bool create_phi_node = false;
+  gphi *phi = NULL;
+  if (TREE_CODE (old_var) == SSA_NAME && !has_zero_uses (old_var))
+    {
+      FOR_EACH_IMM_USE_STMT (use_stmt, iter, old_var)
+	{
+	  lim_data = get_lim_data (use_stmt);
+	  struct loop *tgt_loop = lim_data != NULL ? lim_data->tgt_loop
+						       : NULL;
+	  bool is_phi = (phi = dyn_cast <gphi *> (use_stmt))
+	      && (gimple_phi_num_args (phi) == 2);
+	  bool is_the_same_level = (tgt_loop == level);
+	  bool is_cond = gimple_code (use_stmt) == GIMPLE_COND;
+	  if ((lim_data == NULL || tgt_loop == NULL
+	       || loop_depth (tgt_loop) > loop_depth (level))
+	       || (is_the_same_level && (is_phi || is_cond))
+	       || tmp_reg_created)
+	    {
+	      bool is_phi_on_the_same_level = is_phi && is_the_same_level;
+	      tree replacement = NULL;
+	      if (is_phi_on_the_same_level
+		   && check_conditionally_executed (use_stmt->bb))
+		{
+		  if (fl_new_name == NULL)
+		    fl_new_name = get_new_name (loop_with_data, name_for_key,
+						true);
+		  replacement = fl_new_name;
+		}
+	      else
+		{
+		  if ((is_phi_on_the_same_level
+		      || !(tmp_reg_created && is_the_same_level)) && new_name == NULL)
+		    new_name = get_new_name (loop_with_data, name_for_key);
+		  replacement = is_the_same_level && !is_phi ? name_for_key : new_name;
+		}
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  fprintf (dump_file, "Will replace uses in: ");
+		  print_gimple_stmt (dump_file, use_stmt, 2, 0);
+		}
+	      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+		{
+		   SET_USE (use_p, replacement);
+		}
+	      update_stmt (use_stmt);
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  fprintf (dump_file, "After replacement: ");
+		  print_gimple_stmt (dump_file, use_stmt, 2, 0);
+		}
+
+	      if (is_cond)
+		{ /* If cond is on the same level, we will leave it in the initial loop,
+		    So we do need phi anyway */
+		  create_phi_node = true;
+		  if ((result & ADDED_TO_STMTS_WITH_CH_LHS) == 0)
+		    {
+		      stmts_with_changed_lhs.safe_push (stmt);
+		      result |= ADDED_TO_STMTS_WITH_CH_LHS;
+		    }
+		  if (new_name == NULL)
+		    new_name = get_new_name (loop_with_data, name_for_key);
+		}
+	    }
+	}
+    }
+  if (new_name != NULL || create_phi_node || fl_new_name != NULL)
+    {
+      ((phi_data *)loop_with_data->aux)->stmts.safe_push (stmt);
+      result |= ADDED_TO_PHI_DATA;
+      if (new_name == NULL && fl_new_name != NULL)
+	{
+	  ((phi_data *)loop_with_data->aux)->stmts_fl_phi_only.safe_push (stmt);
+	  result |= ADDED_TO_FL_PHI_DATA;
 	}
+    }
+  return result;
+}
+/* This is called after computations movement. As there could appear
+   several copies of the same conditional statement
+ (as statements under the same condition could go to different levels),
+   we may want to change some uses in these conditions (definition
+   statements of some operands could also be moved). */
+void replace_uses_in_conditions ()
+{
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  gimple use_stmt;
+  tree new_name;
+  unsigned i;
+  gimple stmt;
+  tree old_var;
+  struct loop *def_loop;
+  struct loop *use_loop;
+  FOR_EACH_VEC_ELT (stmts_with_changed_lhs, i, stmt)
+    {
+      new_name = NULL;
+      old_var = gimple_get_lhs (stmt);
+      def_loop = gimple_bb (stmt)->loop_father;
 
-      if (lim_data->cost >= LIM_EXPENSIVE)
-	set_profitable_level (stmt);
+      FOR_EACH_IMM_USE_STMT (use_stmt, iter, old_var)
+	{
+	  if (gimple_code (use_stmt) != GIMPLE_COND)
+	    continue;
+	  use_loop = gimple_bb (use_stmt)->loop_father;
+	  if (use_loop != def_loop)
+	    {
+	      if (new_name == NULL)
+		new_name = get_new_name (def_loop, old_var);
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  fprintf (dump_file, "Statement: ");
+		  print_gimple_stmt (dump_file, use_stmt, 2, 0);
+		}
+	      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+		{
+		  SET_USE (use_p, new_name);
+		}
+	      update_stmt (use_stmt);
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  fprintf (dump_file, "changed to: ");
+		  print_gimple_stmt (dump_file, use_stmt, 2, 0);
+		}
+	    }
+	}
     }
 }
 
-class move_computations_dom_walker : public dom_walker
+/* Get ssa default definition corresponding to key recursively asking outer
+   loops. Once found for the given loop, the default definition is stored in
+  phi_data->defs_map (loop->aux), so there is no need in recursion in further
+  calls of this function. */
+tree get_default_def (struct loop *loop, tree key, bool is_first_level)
 {
-public:
-  move_computations_dom_walker (cdi_direction direction)
-    : dom_walker (direction), todo_ (0) {}
+  phi_data *data = (phi_data *)loop->aux;
+  tree default_def = NULL;
+  if (data != NULL)
+    {
+      hash_map<tree, tree> *map = is_first_level ? data->defs_map
+				   : data->names_map;
+      if (map != NULL)
+	 default_def = get_name (map, key);
+     if (default_def != NULL)
+	 return default_def;
+    }
+  struct loop *outer_loop = loop_outer (loop);
+  if (outer_loop != NULL)
+    default_def = get_default_def (outer_loop, key, false);
+  if (is_first_level)
+    {
+      if (default_def == NULL)
+	default_def = get_or_create_ssa_default_def (cfun, key);
+      if (data == NULL)
+	data = init_phi_data (loop);
+      if (data->defs_map == NULL)
+	data->defs_map = new hash_map<tree, tree>;
+      data->defs_map->put (key, default_def);
+    }
+  return default_def;
+}
 
-  virtual void before_dom_children (basic_block);
+void conditionally_replace_last (vec<gimple> vec, int cond, gimple stmt)
+{
+  if (cond)
+    {
+      vec.pop ();
+      vec.safe_push (stmt);
+    }
+}
 
-  unsigned int todo_;
-};
+void check_and_correct_vectors (struct loop *level, gimple stmt, int cond)
+{
+  phi_data *data = (phi_data *)level->aux;
+  if (data != NULL)
+    {
+      conditionally_replace_last (data->stmts, cond & ADDED_TO_PHI_DATA, stmt);
+      conditionally_replace_last (data->stmts_fl_phi_only,
+				  cond & ADDED_TO_FL_PHI_DATA, stmt);
+    }
+  conditionally_replace_last (stmts_with_changed_lhs,
+			     cond & ADDED_TO_STMTS_WITH_CH_LHS, stmt);
+}
+
+basic_block get_bb_for_phi (basic_block bb)
+{
+  edge e;
+  basic_block res = bb;
+  while (1)
+    {
+      e = EDGE_SUCC (res, 0);
+      res = e->dest;
+      if (!dominated_by_p (CDI_DOMINATORS, res, bb))
+	break;
+    };
+  return res;
+}
+
+int prepare_edge_for_stm (gimple stmt,
+			  struct loop *level,
+			  bool cond_executed,
+			  edge &e,
+			  basic_block bb)
+{
+  int add_info_for_phi_result = WAS_NOT_ADDED;
+  e = loop_preheader_edge (level);
+
+  if (cond_executed)
+    {
+      basic_block bb_for_stmt
+	 = get_block_for_predicated_statement (COND_STMT (bb),
+					       IS_TRUE_TARGET (bb),
+					       e,
+					       get_cond_data_map (level),
+					       level);
+      add_info_for_phi_result
+	   = add_info_for_phi (stmt, level);
+      e = find_fallthru_edge (bb_for_stmt->succs);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	  fprintf (dump_file,
+	  "Conditionally executed, created bb to move statement is %d\n",
+	   bb_for_stmt->index);
+
+      }
+
+  return add_info_for_phi_result;
+}
+
+void set_loop_of_preheader_regio (basic_block stmt_bb, struct loop *level,
+				  bool cond_executed)
+{
+  if (stmt_bb != NULL && cond_executed)
+    {
+      SET_LOOP_OF_PREHEADER_REGION (stmt_bb, level);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	  fprintf (dump_file,
+	 "Statement added to bb = %d\n",
+	 stmt_bb->index);
+    }
+}
 
 /* Hoist the statements in basic block BB out of the loops prescribed by
    data stored in LIM_DATA structures associated with each statement.  Callback
    for walk_dominator_tree.  */
-
 void
 move_computations_dom_walker::before_dom_children (basic_block bb)
 {
   struct loop *level;
   unsigned cost = 0;
   struct lim_aux_data *lim_data;
+  edge e = NULL;
 
   if (!loop_outer (bb->loop_father))
     return;
+  struct loop * always_executed_in = ALWAYS_EXECUTED_IN (bb);
+  bool bb_cond_executed = check_conditionally_executed (bb);
 
   for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); )
     {
@@ -1216,7 +2063,6 @@ move_computations_dom_walker::before_dom
 	  fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
 		   cost, level->num);
 	}
-
       if (gimple_phi_num_args (stmt) == 1)
 	{
 	  tree arg = PHI_ARG_DEF (stmt, 0);
@@ -1247,23 +2093,22 @@ move_computations_dom_walker::before_dom
 	  SSA_NAME_RANGE_INFO (lhs) = NULL;
 	  SSA_NAME_ANTI_RANGE_P (lhs) = 0;
 	}
-      gsi_insert_on_edge (loop_preheader_edge (level), new_stmt);
+      prepare_edge_for_stm (new_stmt, level, bb_cond_executed, e, bb);
+      basic_block stmt_bb = gsi_insert_on_edge_immediate (e, new_stmt);
+      set_loop_of_preheader_regio (stmt_bb, level, bb_cond_executed);
       remove_phi_node (&bsi, false);
     }
 
   for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi); )
     {
-      edge e;
-
+      bool cond_executed = bb_cond_executed;
       gimple stmt = gsi_stmt (bsi);
-
       lim_data = get_lim_data (stmt);
       if (lim_data == NULL)
 	{
 	  gsi_next (&bsi);
 	  continue;
 	}
-
       cost = lim_data->cost;
       level = lim_data->tgt_loop;
       clear_lim_data (stmt);
@@ -1283,11 +2128,11 @@ move_computations_dom_walker::before_dom
 	{
 	  fprintf (dump_file, "Moving statement\n");
 	  print_gimple_stmt (dump_file, stmt, 0, 0);
-	  fprintf (dump_file, "(cost %u) out of loop %d.\n\n",
-		   cost, level->num);
+	  fprintf (dump_file, " (cost %u) out of loop %d from bb %d.\n\n",
+		   cost, level->num, bb->index);
 	}
 
-      e = loop_preheader_edge (level);
+      edge e = loop_preheader_edge (level);
       gcc_assert (!gimple_vdef (stmt));
       if (gimple_vuse (stmt))
 	{
@@ -1305,36 +2150,256 @@ move_computations_dom_walker::before_dom
 		}
 	    }
 	}
-      gsi_remove (&bsi, false);
+      cond_executed &= !lim_data->ignore_condition;
+      int add_info_for_phi_result =
+	 prepare_edge_for_stm (stmt, level, cond_executed, e, bb);
+
       if (gimple_has_lhs (stmt)
 	  && TREE_CODE (gimple_get_lhs (stmt)) == SSA_NAME
 	  && INTEGRAL_TYPE_P (TREE_TYPE (gimple_get_lhs (stmt)))
-	  && (!ALWAYS_EXECUTED_IN (bb)
-	      || !(ALWAYS_EXECUTED_IN (bb) == level
-		   || flow_loop_nested_p (ALWAYS_EXECUTED_IN (bb), level))))
+	  && (!always_executed_in
+	      || ! (always_executed_in == level
+	      || flow_loop_nested_p (always_executed_in, level))))
 	{
 	  tree lhs = gimple_get_lhs (stmt);
 	  SSA_NAME_RANGE_INFO (lhs) = NULL;
 	  SSA_NAME_ANTI_RANGE_P (lhs) = 0;
 	}
+
+       gsi_remove (&bsi, false);
       /* In case this is a stmt that is not unconditionally executed
-         when the target loop header is executed and the stmt may
+	 when the target loop header is executed and the stmt may
 	 invoke undefined integer or pointer overflow rewrite it to
 	 unsigned arithmetic.  */
+      basic_block stmt_bb = NULL;
       if (is_gimple_assign (stmt)
 	  && INTEGRAL_TYPE_P (TREE_TYPE (gimple_assign_lhs (stmt)))
 	  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (gimple_assign_lhs (stmt)))
 	  && arith_code_with_undefined_signed_overflow
-	       (gimple_assign_rhs_code (stmt))
-	  && (!ALWAYS_EXECUTED_IN (bb)
-	      || !(ALWAYS_EXECUTED_IN (bb) == level
+	 (gimple_assign_rhs_code (stmt))
+	  && (!always_executed_in
+	      || ! (always_executed_in == level
 		   || flow_loop_nested_p (ALWAYS_EXECUTED_IN (bb), level))))
-	gsi_insert_seq_on_edge (e, rewrite_to_defined_overflow (stmt));
-      else
-	gsi_insert_on_edge (e, stmt);
+	{
+	  gimple_seq seq = rewrite_to_defined_overflow (stmt);
+	  /* stmt which we was going to move could have changed lhs after
+	  rewrite_to_defined_overflow call.
+	  So the last stmt in the sequence will contain another statement,
+	  but with needed lhs. So we need to make the replacement in our
+	  helper structures (check_and_correct_vectors). */
+	  gimple last_stmt = gimple_seq_last_stmt (seq);
+	  stmt_bb = gsi_insert_seq_on_edge_immediate (e, seq);
+	  check_and_correct_vectors (stmt->bb->loop_father, last_stmt,
+				   add_info_for_phi_result);
+	}
+	else
+	  stmt_bb = gsi_insert_on_edge_immediate (e, stmt);
+	todo_ |= TODO_cleanup_cfg;
+	set_loop_of_preheader_regio (stmt_bb, level, cond_executed);
+    }
+}
+
+
+void add_phi_args (gphi *phi, tree lhs, tree default_def,
+		  bool add_default_def, basic_block stmt_bb,
+		  basic_block phi_bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, phi_bb->preds)
+    {
+      if (dominated_by_p (CDI_DOMINATORS, e->src, stmt_bb))
+	add_phi_arg (phi, lhs, e, UNKNOWN_LOCATION);
+      else if (add_default_def)
+	add_phi_arg (phi, default_def, e, UNKNOWN_LOCATION);
+    }
+}
+
+gphi *get_phi_from_map (basic_block bb, tree var,
+		       hash_map<basic_block,
+		       hash_map<tree, gphi*>*> *map)
+{
+  hash_map<tree, gphi*> **p1 = map->get (bb);
+  if (p1 == NULL)
+    return NULL;
+  gphi **p2 = (*p1)->get (var);
+  if (p2 == NULL)
+    return NULL;
+  return *p2;
+}
+
+void put_phi_into_map (basic_block bb, tree var, gphi *phi,
+		      hash_map<basic_block, hash_map<tree, gphi*>*> *map)
+{
+   hash_map<tree, gphi*> **p1 = map->get (bb);
+   hash_map<tree, gphi*> *phi_map;
+   if (p1 == NULL)
+     {
+       phi_map = new hash_map<tree, gphi*>;
+       map->put (bb, phi_map);
+     }
+   else
+     phi_map = *p1;
+   phi_map->put (var, phi);
+}
+
+
+gphi *add_phi_node (basic_block phi_bb,
+		  basic_block stmt_bb, tree lhs,
+		  struct loop *loop, phi_data * data,
+		  bool is_preheader_dominator,
+		  hash_map<basic_block, hash_map<tree, gphi*>*> *map,
+		  bool is_first_phi)
+{
+  tree key = create_key (lhs);
+  gphi *phi = NULL;
+  bool is_tmp = (key == lhs);
+  /* if this is not temorary variable the phi node could be created
+     already (for another branch) */
+  if (!is_tmp)
+    phi = get_phi_from_map (phi_bb, key, map);
+
+  tree default_def = get_default_def (loop, key, true);
+  if (phi == NULL)
+    {
+      phi = create_phi_node (key, phi_bb);
+      if (is_preheader_dominator)
+	{
+	  /* The name of 'the last' phi node should be already
+	     choosen, when moving computations. */
+	  tree phi_name = get_name (data->names_map, key);
+	  gimple_phi_set_result (phi, phi_name);
+	}
+      else if (is_first_phi)
+	{
+	  /* If we move phi with 2 argument this name was choosen
+	     earlier. */
+	  tree phi_name = get_name (data->fl_names_map, key);
+	  if (phi_name != NULL)
+	    gimple_phi_set_result (phi, phi_name);
+	}
+
+      if (!is_tmp)
+	put_phi_into_map (phi_bb, key, phi, map);
+      add_phi_args (phi, lhs, default_def, true, stmt_bb, phi_bb);
+      return phi;
+    }
+  else
+   {
+     add_phi_args (phi, lhs, default_def, false, stmt_bb, phi_bb);
+     return NULL;
+   }
+}
+
+
+void delete_bb_phi_map (hash_map<basic_block, hash_map<tree, gphi*>*> *map)
+{
+  for (hash_map<basic_block, hash_map<tree, gphi*>*>::iterator
+       iter = map->begin ();
+       iter != map->end (); ++iter)
+    {
+      basic_block key = (*iter).first;
+      hash_map<tree, gphi*> **phi_map = map->get (key);
+      delete *phi_map;
+      phi_map = NULL;
+    }
+  delete map;
+  map = NULL;
+}
+
+bool check_fl_phi_cond (phi_data *data, gimple stmt, bool is_first_phi)
+{
+  if (!is_first_phi)
+    return false;
+  vec<gimple> v = data->stmts_fl_phi_only;
+  if (v.length () == 0)
+    return false;
+  if (v.last () == stmt)
+    {
+      v.pop ();
+      return true;
+    }
+  return false;
+}
+
+
+/* After computations movement, we create phi nodes. */
+void create_phi_nodes ()
+{
+  struct loop * loop;
+  phi_data *data;
+  basic_block phi_bb = NULL;
+  basic_block preheader;
+  /* Here we store information about phi nodes in bb.
+  This map corresponding to the current loop in the cycle below.*/
+  hash_map<basic_block, hash_map<tree, gphi*>*> *phi_map;
+  gphi *phi;
+  edge preheader_edge;
+  FOR_EACH_LOOP (loop,1)
+    {
+      phi_map = new hash_map<basic_block, hash_map<tree, gphi*>*>;
+      data = (phi_data *)loop->aux;
+      if (data == NULL)
+	continue;
+      gimple stmt;
+      /* data->stmts keeps the statements which require phi node
+	 because of their lhs uses outside of current level. */
+      while (data->stmts.length () > 0)
+	{
+	  stmt = data->stmts.pop ();
+	  bool is_preheader_dominator = false;
+	  tree lhs = gimple_get_lhs (stmt);
+	  basic_block stmt_bb = stmt->bb;
+	  preheader_edge =
+	   loop_preheader_edge (LOOP_OF_PREHEADER_REGION (stmt_bb));
+	  preheader = preheader_edge->src;
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "Will create phi node for: ");
+	      print_generic_expr (dump_file, lhs, 0);
+	      fprintf (dump_file, "\n, def bb = %d, loop = %d,\
+		      preheader.src = %d\n",
+		      stmt_bb->index, loop->num, preheader->index);
+	    }
+	  /* We add phi nodes while basic block with phi node doesn't
+	     dominate preheader (of the loop which we moved statement from).
+	     That's why we keep
+	     LOOP_OF_PREHEADER_REGION (stmt_bb) information. */
+	  bool is_first_phi = true;
+	  do
+	    {
+	      phi_bb = get_bb_for_phi (stmt_bb);
+	      is_preheader_dominator = dominated_by_p (CDI_DOMINATORS, preheader,
+						      phi_bb);
+	      phi = add_phi_node (phi_bb, stmt_bb, lhs, loop, data,
+				 is_preheader_dominator,
+				 phi_map,
+				 is_first_phi);
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		{
+		  if (phi == NULL)
+		    fprintf (dump_file, "phi node is NULL");
+		  else
+		    {
+		      fprintf (dump_file, "phi node create in the bb = %d:",
+		       	   phi->bb->index);
+		      print_gimple_stmt (dump_file, phi, 2, 0);
+		    }
+		}
+	      if (phi == NULL || check_fl_phi_cond (data, stmt, is_first_phi))
+		break;
+	      lhs = PHI_RESULT (phi);
+	      stmt_bb = phi_bb;
+	      is_first_phi = false;
+	    }
+	  while (!is_preheader_dominator);
+	}
+      delete_bb_phi_map (phi_map);
     }
 }
 
+
 /* Hoist the statements out of the loops prescribed by data stored in
    LIM_DATA structures associated with each statement.*/
 
@@ -1344,10 +2409,11 @@ move_computations (void)
   move_computations_dom_walker walker (CDI_DOMINATORS);
   walker.walk (cfun->cfg->x_entry_block_ptr);
 
-  gsi_commit_edge_inserts ();
+  create_phi_nodes ();
+  replace_uses_in_conditions ();
+
   if (need_ssa_update_p (cfun))
     rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
-
   return walker.todo_;
 }
 
@@ -1564,8 +2630,8 @@ static unsigned *bb_loop_postorder;
 static int
 sort_bbs_in_loop_postorder_cmp (const void *bb1_, const void *bb2_)
 {
-  basic_block bb1 = *(basic_block *)const_cast<void *>(bb1_);
-  basic_block bb2 = *(basic_block *)const_cast<void *>(bb2_);
+  basic_block bb1 = * (basic_block *)const_cast<void *> (bb1_);
+  basic_block bb2 = * (basic_block *)const_cast<void *> (bb2_);
   struct loop *loop1 = bb1->loop_father;
   struct loop *loop2 = bb2->loop_father;
   if (loop1->num == loop2->num)
@@ -1578,8 +2644,8 @@ sort_bbs_in_loop_postorder_cmp (const vo
 static int
 sort_locs_in_loop_postorder_cmp (const void *loc1_, const void *loc2_)
 {
-  mem_ref_loc *loc1 = (mem_ref_loc *)const_cast<void *>(loc1_);
-  mem_ref_loc *loc2 = (mem_ref_loc *)const_cast<void *>(loc2_);
+  mem_ref_loc *loc1 = (mem_ref_loc *)const_cast<void *> (loc1_);
+  mem_ref_loc *loc2 = (mem_ref_loc *)const_cast<void *> (loc2_);
   struct loop *loop1 = gimple_bb (loc1->stmt)->loop_father;
   struct loop *loop2 = gimple_bb (loc2->stmt)->loop_father;
   if (loop1->num == loop2->num)
@@ -1598,7 +2664,7 @@ analyze_memory_references (void)
   unsigned i, n;
 
   /* Collect all basic-blocks in loops and sort them after their
-     loops postorder.  */
+   ps postorder.  */
   i = 0;
   bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
   FOR_EACH_BB_FN (bb, cfun)
@@ -1613,7 +2679,7 @@ analyze_memory_references (void)
     {
       basic_block bb = bbs[i];
       for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
-        gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
+	gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
     }
 
   /* Sort the location list of gathered memory references after their
@@ -1685,8 +2751,8 @@ mem_refs_may_alias_p (mem_ref_p mem1, me
 static int
 find_ref_loc_in_loop_cmp (const void *loop_, const void *loc_)
 {
-  struct loop *loop = (struct loop *)const_cast<void *>(loop_);
-  mem_ref_loc *loc = (mem_ref_loc *)const_cast<void *>(loc_);
+  struct loop *loop = (struct loop *)const_cast<void *> (loop_);
+  mem_ref_loc *loc = (mem_ref_loc *)const_cast<void *> (loc_);
   struct loop *loc_loop = gimple_bb (loc->stmt)->loop_father;
   if (loop->num  == loc_loop->num
       || flow_loop_nested_p (loop, loc_loop))
@@ -1696,7 +2762,7 @@ find_ref_loc_in_loop_cmp (const void *lo
 }
 
 /* Iterates over all locations of REF in LOOP and its subloops calling
-   fn.operator() with the location as argument.  When that operator
+   fn.operator () with the location as argument.  When that operator
    returns true the iteration is stopped and true is returned.
    Otherwise false is returned.  */
 
@@ -1808,9 +2874,9 @@ struct prev_flag_edges {
 
      for (...) {
        if (foo)
-         stuff;
+	 stuff;
        else
-         MEM = TMP_VAR;
+	 MEM = TMP_VAR;
      }
 
    into:
@@ -1818,9 +2884,9 @@ struct prev_flag_edges {
      lsm = MEM;
      for (...) {
        if (foo)
-         stuff;
+	 stuff;
        else
-         lsm = TMP_VAR;
+	 lsm = TMP_VAR;
      }
      MEM = lsm;
 
@@ -1832,10 +2898,10 @@ struct prev_flag_edges {
      ...
      for (...) {
        if (foo)
-         stuff;
+	 stuff;
        else {
-         lsm = TMP_VAR;
-         lsm_flag = true;
+	 lsm = TMP_VAR;
+	 lsm_flag = true;
        }
      }
      if (lsm_flag)	<--
@@ -2013,7 +3079,8 @@ execute_sm (struct loop *loop, vec<edge>
   /* Emit the load code on a random exit edge or into the latch if
      the loop does not exit, so that we are sure it will be processed
      by move_computations after all dependencies.  */
-  gsi = gsi_for_stmt (first_mem_ref_loc (loop, ref)->stmt);
+  gimple stmt = first_mem_ref_loc (loop, ref)->stmt;
+  gsi = gsi_for_stmt (stmt);
 
   /* FIXME/TODO: For the multi-threaded variant, we could avoid this
      load altogether, since the store is predicated by a flag.  We
@@ -2022,12 +3089,18 @@ execute_sm (struct loop *loop, vec<edge>
   lim_data = init_lim_data (load);
   lim_data->max_loop = loop;
   lim_data->tgt_loop = loop;
+  basic_block bb = gimple_bb (stmt);
+  bool ignore_condition = IS_TRUE_OR_FALSE_TARGET (bb);
+  lim_data->ignore_condition = ignore_condition;
+
+
   gsi_insert_before (&gsi, load, GSI_SAME_STMT);
 
   if (multi_threaded_model_p)
     {
       load = gimple_build_assign (store_flag, boolean_false_node);
       lim_data = init_lim_data (load);
+      lim_data->ignore_condition = ignore_condition;
       lim_data->max_loop = loop;
       lim_data->tgt_loop = loop;
       gsi_insert_before (&gsi, load, GSI_SAME_STMT);
@@ -2202,6 +3275,7 @@ ref_indep_loop_p_2 (struct loop *loop, m
 
   bool indep_p = ref_indep_loop_p_1 (loop, ref, stored_p);
 
+
   if (dump_file && (dump_flags & TDF_DETAILS))
     fprintf (dump_file, "Querying dependencies of ref %u in loop %d: %s\n",
 	     ref->id, loop->num, indep_p ? "independent" : "dependent");
@@ -2428,9 +3502,12 @@ fill_always_executed_in_1 (struct loop *
     fill_always_executed_in_1 (loop, contains_call);
 }
 
-/* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
-   for each such basic block bb records the outermost loop for that execution
-   of its header implies execution of bb.  */
+
+/* Fills move_cond information for basic blocks (either ALWAYS_EXECUTED_IN
+   or IS_[TRUE/FALSE]_TARGET, i.e.
+   for each such basic block bb records either the outermost loop for that execution
+   of its header implies execution of bb or this is conditionally executed bb
+   or nothing.  */
 
 static void
 fill_always_executed_in (void)
@@ -2444,8 +3521,8 @@ fill_always_executed_in (void)
     {
       gimple_stmt_iterator gsi;
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
-	{
-	  if (nonpure_call_p (gsi_stmt (gsi)))
+	{ gimple stmt = gsi_stmt (gsi);
+	  if (nonpure_call_p (stmt))
 	    break;
 	}
 
@@ -2471,6 +3548,8 @@ tree_ssa_lim_initialize (void)
   bitmap_obstack_initialize (&lim_bitmap_obstack);
   gcc_obstack_init (&mem_ref_obstack);
   lim_aux_data_map = new hash_map<gimple, lim_aux_data *>;
+  level_cond_data_map = new hash_map<struct loop*, cond_data_map *>;
+  post_dominators_map = new hash_map<basic_block, basic_block>;
 
   if (flag_tm)
     compute_transaction_bits ();
@@ -2481,7 +3560,7 @@ tree_ssa_lim_initialize (void)
   memory_accesses.refs_list.create (100);
   /* Allocate a special, unanalyzable mem-ref with ID zero.  */
   memory_accesses.refs_list.quick_push
-    (mem_ref_alloc (error_mark_node, 0, UNANALYZABLE_MEM_ID));
+ (mem_ref_alloc (error_mark_node, 0, UNANALYZABLE_MEM_ID));
 
   memory_accesses.refs_in_loop.create (number_of_loops (cfun));
   memory_accesses.refs_in_loop.quick_grow (number_of_loops (cfun));
@@ -2510,6 +3589,7 @@ tree_ssa_lim_initialize (void)
     bb_loop_postorder[loop->num] = i++;
 }
 
+
 /* Cleans up after the invariant motion pass.  */
 
 static void
@@ -2522,10 +3602,25 @@ tree_ssa_lim_finalize (void)
   free_aux_for_edges ();
 
   FOR_EACH_BB_FN (bb, cfun)
-    SET_ALWAYS_EXECUTED_IN (bb, NULL);
+  {
+      FREE_AUX_FOR_BB (bb);
+  }
 
   bitmap_obstack_release (&lim_bitmap_obstack);
   delete lim_aux_data_map;
+  delete post_dominators_map;
+
+  for (hash_map<struct loop * , cond_data_map *>::iterator iter
+	  = level_cond_data_map->begin ();
+       iter != level_cond_data_map->end (); ++iter)
+  {
+    cond_data_map **map = level_cond_data_map->get ((*iter).first);
+    delete *map;
+    map = NULL;
+  }
+
+
+  delete level_cond_data_map;
 
   delete memory_accesses.refs;
   memory_accesses.refs = NULL;
@@ -2543,21 +3638,41 @@ tree_ssa_lim_finalize (void)
     free_affine_expand_cache (&memory_accesses.ttae_cache);
 
   free (bb_loop_postorder);
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 1)
+  {
+    phi_data *data = (phi_data*)loop->aux;
+    if (data != NULL)
+    {
+      data->stmts.release ();
+      data->stmts_fl_phi_only.release ();
+      delete data->names_map;
+      data->names_map = NULL;
+      delete data->fl_names_map;
+      data->fl_names_map = NULL;
+      delete data->defs_map;
+      data->defs_map = NULL;
+      free (data);
+      loop->aux = NULL;
+    }
+  }
+
+  stmts_with_changed_lhs.release ();
 }
 
-/* Moves invariants from loops.  Only "expensive" invariants are moved out --
+/*  Moves invariants from loops.  Only "expensive" invariants are moved out --
    i.e. those that are likely to be win regardless of the register pressure.  */
 
 unsigned int
 tree_ssa_lim (void)
 {
   unsigned int todo;
-
   tree_ssa_lim_initialize ();
 
   /* Gathers information about memory accesses in the loops.  */
   analyze_memory_references ();
 
+
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();
 
@@ -2574,7 +3689,6 @@ tree_ssa_lim (void)
   todo = move_computations ();
 
   tree_ssa_lim_finalize ();
-
   return todo;
 }
 
@@ -2627,3 +3741,4 @@ make_pass_lim (gcc::context *ctxt)
 }
 
 
+

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-05-08 21:07 conditional lim Evgeniya Maenkova
@ 2015-05-09 12:41 ` Andi Kleen
  2015-05-26 10:32 ` Richard Biener
  2015-05-29 13:32 ` Evgeniya Maenkova
  2 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-05-09 12:41 UTC (permalink / raw)
  To: Evgeniya Maenkova; +Cc: Richard Biener, gcc-patches

Evgeniya Maenkova <evgeniya.maenkova@gmail.com> writes:
>
> So, in my opinion it’s ok to generate additional phi node for debug
> case. But I’m not a compiler expert and maybe there is some
> requirement that debug and non-debug versions should differ only by
> debug statements, I don’t know.

gcc has such a requirement.

Otherwise you cannot debug code that was originally not compiled
with debugging.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-05-08 21:07 conditional lim Evgeniya Maenkova
  2015-05-09 12:41 ` Andi Kleen
@ 2015-05-26 10:32 ` Richard Biener
  2015-05-26 13:38   ` Evgeniya Maenkova
  2015-05-29 13:32 ` Evgeniya Maenkova
  2 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2015-05-26 10:32 UTC (permalink / raw)
  To: Evgeniya Maenkova; +Cc: GCC Patches

On Fri, May 8, 2015 at 11:07 PM, Evgeniya Maenkova
<evgeniya.maenkova@gmail.com> wrote:
> Hi,
>
> Could you please review my patch for predicated lim?
>
> Let me note some details about it:
>
>
>
> 1)      Phi statements are still moved only if they have 1 or 2
> arguments. However, phi statements could be move under conditions (as
> it’s done for the other statements).  Probably, phi statement motion
> with 3 + arguments could be implemented in the next patch after
> predicated lim.
>
> 2)      Patch has limitations/features like (it was ok to me to
> implement it such way, maybe I’m not correct. ):
>
> a)      Loop1
>
>     {
>
>               If (a)
>
>                  Loop2
>
>                      {
>
>                        Stmt - Invariant for Loop1
>
>                      }
>
>              }
>
>        In this case Stmt will be moved only out of Loop2, because of if (a).
>
> b)      Or
>
> Loop1
>
>     {
>
>          …
>
>          If (cond1)
>
>               If (cond2)
>
>                   If (cond3)
>
>                       Stmt;
>
>        }
>
> Stmt will be moved out only if cond1 is always executed in Loop1.
>
> c)       It took me a long time to write all of these code, so there
> might be other peculiarities which I forgot to mention. :)
>
>                        Let’s discuss these ones as you will review my patch.
>
> 3)      Patch consists of 9 files:
>
> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>
> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
> predicated lim moves 2 more statements out of the loop;
>
> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
> lim recip optimization in this test doesn’t work (the corresponding
> value is below threshold as I could see in the code for recip, 1<3).
> So to have recip working in this test I changed test a little bit.
>
> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>
> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>
> the tests for conditional lim.
>
> 4)      Patch testing:
>
> a)      make –k check (no difference in results for me for the clean
> build and the patched one,
>
> -          Revision: 222849,
>
> -          uname -a
>
>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>
> b)      Bootstrap.
>
>  It goes well now, however to fix it I have made a temporary hack in
> the lim code. And with this fix patch definitely shouldn’t be
> committed.
>
> I did so, as I would like to discuss this issue first.
>
> The problem is: I got stage2-stage3 comparison failure on the single
> file (tree-vect-data-refs.o). After some investigation I understood
> that tree-vect-data-refs.o differs being compiled with and without
> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
> and for stage 3 this is ‘-g –O2’. But to simplify things I can
> reproduce this difference on the same build (even not bootstrapped),
> with option ‘ –g’ and without it).
>
> Of course, the file compiled with –g option will contain debug
> information and will differ from the corresponding file without debug
> information. I mean there is the difference reported by
> contrib/compare-debug (I mean we talk about stripped files).
>
> Then I compared assemblers and lim dump files and made a conclusion
> the difference is:
>
> There is statement _484=something, which is conditionally moved out of
> loop X. In non debug cases no usages of _484 are left inside the loop
> X. In debug case, there is the single use in debug statement. So for
> debug case my patch generates phi statement for _484 to make it
> available inside the loop X. For non debug case, no such phi statement
> generated as there is no uses of _484.
>
> There is one more statement with lhs=_493, with exactly this pattern
> of use. But this is not important for our discussion.
>
>
>
> So, in my opinion it’s ok to generate additional phi node for debug
> case. But I’m not a compiler expert and maybe there is some
> requirement that debug and non-debug versions should differ only by
> debug statements, I don’t know.

As Andi said code generation needs to be the same.

> My temporary hack fix is just removing of such debug statements (no
> debug statements, no problems J).

That's fine and generally what is done in other passes.

> The alternatives I see are:
>
> -          Move debug statements outside the loop (not good in my opinion);
>
> -          Somehow reset values in debug statements (not good in my
> opinion, if I could provide correct information for them);

You could re-set them via gimple_debug_bind_reset_value which
will tell the user that at this point the value is optimized out
(that's slightly
better than removing the debug stmts).

> -          Generate phi for debug statements and fix bootstrap (say,
> let’s have some list of files, which we have special check for. I mean
> for my case, the difference is not in stage2 and stage3, it’s in debug
> and non-debug). I like this variant. However, as I don’t see such list
> of special files in the whole gcc, I think I might be missing
> something.

The policy ;)

> What do you think?

Now about the patch.  Generally there seem to be spurious whitespace-only
or formatting differences - try to avoid these.

All new functiuons need a comment explaining them and their function
parameters.

+static cond_data_map *get_cond_data_map (struct loop * level)
+{

per GCC coding conventions this should be formatted like

static cond_data_map *
get_cond_data_map (struct loop * level)
{

+#define SET_ALWAYS_EXECUTED_IN(BB, VAL)\
+                 BB->aux = (void *) (XCNEW (struct move_cond));\
+                 BB_MOVE_COND (BB)->type=ALWAYS_EXECUTED_IN;\
+                 BB_MOVE_COND (BB)->loop=VAL;

this kind of side-effects in macros are bad - better turn them into
inline functions.

-  if (flag_unswitch_loops
-      && gimple_code (stmt) == GIMPLE_COND)
-    {
-      /* If we perform unswitching, force the operands of the invariant
-        condition to be moved out of the loop.  */
-      return MOVE_POSSIBLE;
-    }
+  if (gimple_code (stmt) == GIMPLE_COND)
+    return MOVE_POSSIBLE;

any reason you removed the guard on flag_unswitch_loops?  We may
want to add a new flag, certainly this transform might not be suitable
for -O[12s] for example.  Like if not all stmts inside a conditional can
be moved the transform will increase code-size.

You are doing a (lot of?) refactoring which makes review of the patch
harder.  For example

+bool calculate_tgt_level (gimple stmt, struct loop *outermost,
+                        bool cond_executed, basic_block bb)
+{
+  struct lim_aux_data *lim_dat
...
+
+  if (lim_data->cost >= LIM_EXPENSIVE)
+    set_profitable_level (stmt);
+  return true;
+}

...

-
-       lim_data = init_lim_data (stmt);
-       lim_data->always_executed_in = outermost;
-
-       if (!determine_max_movement (stmt, false))
-         {
-           lim_data->max_loop = NULL;
-           continue;
-         }
-
-       if (dump_file && (dump_flags & TDF_DETAILS))
-         {
-           print_gimple_stmt (dump_file, stmt, 2, 0);
-           fprintf (dump_file, "  invariant up to level %d, cost %d.\n\n",
-                    loop_depth (lim_data->max_loop),
-                    lim_data->cost);
-         }
-
-       if (lim_data->cost >= LIM_EXPENSIVE)
-         set_profitable_level (stmt);
+       if (!calculate_tgt_level (stmt, outermost, cond_executed, bb))
+         continue;

where it is not obvious why moving set_profitable_level should be
in a function named calculate_tgt_level.  In fact determine_max_movement
was supposed to be the abstraction already.

Without yet getting an idea of the overall algorithm (I'm really missing such
a thing...) it seems to me that the current handling of doing invariant motion
of PHIs should handle the analysis phase just fine (you might be adding
more advanced cases in this patch, but I'd rather do that as a followup for
initial support).

The code-gen part looks somewhat ad-hoc to me.  In principle if we have
a list of PHIs to hoist we can materialize the needed part of the original
basic-block structure on the pre-header edge together with the controlling
conditionals.  The loop pre-header would be the always-executed BB
and the needed part would always start with the loop header and be a
single-entry multiple-exit region.  So ideally we could use sth like
gimple_duplicate_sese_region with the BB copying worker copy_bbs
refactored/replaced with something only copying controlling conditionals
and statements/PHIs that a callback identifies (tree-loop-distribution.c
would also like sth like that - currently it copies whole loops and removes
stmts it didn't intend to copy...).

So basically invariantness_dom_walker would record a vector of PHIs
to hoist (or we'd construct that at move_computations time).
If there is any control-flow to hoist then we'd do an alternative
move_computations by "copying" the needed CFG and covered stmts.
In theory doing it the loop-distribution way (simply copy the whole
loop body and then remove anything not needed) would also be possible
though that wouldn't be the very best idea obviously ;)

Thanks,
Richard.

>
>
> Thanks,
>
>
>
> Evgeniya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-05-26 10:32 ` Richard Biener
@ 2015-05-26 13:38   ` Evgeniya Maenkova
  2015-05-27 10:52     ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Evgeniya Maenkova @ 2015-05-26 13:38 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

Hi, Richard

Thanks for review starting.

Do you see any major issues with this patch (i.e. algorithms and ideas
that should be completely replaced, effectively causing the re-write
of most code)?

To decide if there are major issues in the patch, perhaps, you need
additional clarifications from me? Could you point at the places where
additional explanations could save you most effort?

Your answers to these questions are looking the first priority ones.
You wrote about several issues in the code, which are looking as easy
(or almost easy ;) to fix(inline functions, unswitch-loops flag,
comments, etc). But, I think you agree, let’s first decide about the
major issues (I mean, whether we continue with this patch or starting
new one, this will save a lot of time for both of us).

Thanks,

Evgeniya

On Tue, May 26, 2015 at 2:31 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, May 8, 2015 at 11:07 PM, Evgeniya Maenkova
> <evgeniya.maenkova@gmail.com> wrote:
>> Hi,
>>
>> Could you please review my patch for predicated lim?
>>
>> Let me note some details about it:
>>
>>
>>
>> 1)      Phi statements are still moved only if they have 1 or 2
>> arguments. However, phi statements could be move under conditions (as
>> it’s done for the other statements).  Probably, phi statement motion
>> with 3 + arguments could be implemented in the next patch after
>> predicated lim.
>>
>> 2)      Patch has limitations/features like (it was ok to me to
>> implement it such way, maybe I’m not correct. ):
>>
>> a)      Loop1
>>
>>     {
>>
>>               If (a)
>>
>>                  Loop2
>>
>>                      {
>>
>>                        Stmt - Invariant for Loop1
>>
>>                      }
>>
>>              }
>>
>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>
>> b)      Or
>>
>> Loop1
>>
>>     {
>>
>>          …
>>
>>          If (cond1)
>>
>>               If (cond2)
>>
>>                   If (cond3)
>>
>>                       Stmt;
>>
>>        }
>>
>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>
>> c)       It took me a long time to write all of these code, so there
>> might be other peculiarities which I forgot to mention. :)
>>
>>                        Let’s discuss these ones as you will review my patch.
>>
>> 3)      Patch consists of 9 files:
>>
>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>
>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>> predicated lim moves 2 more statements out of the loop;
>>
>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>> lim recip optimization in this test doesn’t work (the corresponding
>> value is below threshold as I could see in the code for recip, 1<3).
>> So to have recip working in this test I changed test a little bit.
>>
>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>
>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>
>> the tests for conditional lim.
>>
>> 4)      Patch testing:
>>
>> a)      make –k check (no difference in results for me for the clean
>> build and the patched one,
>>
>> -          Revision: 222849,
>>
>> -          uname -a
>>
>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>
>> b)      Bootstrap.
>>
>>  It goes well now, however to fix it I have made a temporary hack in
>> the lim code. And with this fix patch definitely shouldn’t be
>> committed.
>>
>> I did so, as I would like to discuss this issue first.
>>
>> The problem is: I got stage2-stage3 comparison failure on the single
>> file (tree-vect-data-refs.o). After some investigation I understood
>> that tree-vect-data-refs.o differs being compiled with and without
>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>> reproduce this difference on the same build (even not bootstrapped),
>> with option ‘ –g’ and without it).
>>
>> Of course, the file compiled with –g option will contain debug
>> information and will differ from the corresponding file without debug
>> information. I mean there is the difference reported by
>> contrib/compare-debug (I mean we talk about stripped files).
>>
>> Then I compared assemblers and lim dump files and made a conclusion
>> the difference is:
>>
>> There is statement _484=something, which is conditionally moved out of
>> loop X. In non debug cases no usages of _484 are left inside the loop
>> X. In debug case, there is the single use in debug statement. So for
>> debug case my patch generates phi statement for _484 to make it
>> available inside the loop X. For non debug case, no such phi statement
>> generated as there is no uses of _484.
>>
>> There is one more statement with lhs=_493, with exactly this pattern
>> of use. But this is not important for our discussion.
>>
>>
>>
>> So, in my opinion it’s ok to generate additional phi node for debug
>> case. But I’m not a compiler expert and maybe there is some
>> requirement that debug and non-debug versions should differ only by
>> debug statements, I don’t know.
>
> As Andi said code generation needs to be the same.
>
>> My temporary hack fix is just removing of such debug statements (no
>> debug statements, no problems J).
>
> That's fine and generally what is done in other passes.
>
>> The alternatives I see are:
>>
>> -          Move debug statements outside the loop (not good in my opinion);
>>
>> -          Somehow reset values in debug statements (not good in my
>> opinion, if I could provide correct information for them);
>
> You could re-set them via gimple_debug_bind_reset_value which
> will tell the user that at this point the value is optimized out
> (that's slightly
> better than removing the debug stmts).
>
>> -          Generate phi for debug statements and fix bootstrap (say,
>> let’s have some list of files, which we have special check for. I mean
>> for my case, the difference is not in stage2 and stage3, it’s in debug
>> and non-debug). I like this variant. However, as I don’t see such list
>> of special files in the whole gcc, I think I might be missing
>> something.
>
> The policy ;)
>
>> What do you think?
>
> Now about the patch.  Generally there seem to be spurious whitespace-only
> or formatting differences - try to avoid these.
>
> All new functiuons need a comment explaining them and their function
> parameters.
>
> +static cond_data_map *get_cond_data_map (struct loop * level)
> +{
>
> per GCC coding conventions this should be formatted like
>
> static cond_data_map *
> get_cond_data_map (struct loop * level)
> {
>
> +#define SET_ALWAYS_EXECUTED_IN(BB, VAL)\
> +                 BB->aux = (void *) (XCNEW (struct move_cond));\
> +                 BB_MOVE_COND (BB)->type=ALWAYS_EXECUTED_IN;\
> +                 BB_MOVE_COND (BB)->loop=VAL;
>
> this kind of side-effects in macros are bad - better turn them into
> inline functions.
>
> -  if (flag_unswitch_loops
> -      && gimple_code (stmt) == GIMPLE_COND)
> -    {
> -      /* If we perform unswitching, force the operands of the invariant
> -        condition to be moved out of the loop.  */
> -      return MOVE_POSSIBLE;
> -    }
> +  if (gimple_code (stmt) == GIMPLE_COND)
> +    return MOVE_POSSIBLE;
>
> any reason you removed the guard on flag_unswitch_loops?  We may
> want to add a new flag, certainly this transform might not be suitable
> for -O[12s] for example.  Like if not all stmts inside a conditional can
> be moved the transform will increase code-size.
>
> You are doing a (lot of?) refactoring which makes review of the patch
> harder.  For example
>
> +bool calculate_tgt_level (gimple stmt, struct loop *outermost,
> +                        bool cond_executed, basic_block bb)
> +{
> +  struct lim_aux_data *lim_dat
> ...
> +
> +  if (lim_data->cost >= LIM_EXPENSIVE)
> +    set_profitable_level (stmt);
> +  return true;
> +}
>
> ...
>
> -
> -       lim_data = init_lim_data (stmt);
> -       lim_data->always_executed_in = outermost;
> -
> -       if (!determine_max_movement (stmt, false))
> -         {
> -           lim_data->max_loop = NULL;
> -           continue;
> -         }
> -
> -       if (dump_file && (dump_flags & TDF_DETAILS))
> -         {
> -           print_gimple_stmt (dump_file, stmt, 2, 0);
> -           fprintf (dump_file, "  invariant up to level %d, cost %d.\n\n",
> -                    loop_depth (lim_data->max_loop),
> -                    lim_data->cost);
> -         }
> -
> -       if (lim_data->cost >= LIM_EXPENSIVE)
> -         set_profitable_level (stmt);
> +       if (!calculate_tgt_level (stmt, outermost, cond_executed, bb))
> +         continue;
>
> where it is not obvious why moving set_profitable_level should be
> in a function named calculate_tgt_level.  In fact determine_max_movement
> was supposed to be the abstraction already.
>
> Without yet getting an idea of the overall algorithm (I'm really missing such
> a thing...) it seems to me that the current handling of doing invariant motion
> of PHIs should handle the analysis phase just fine (you might be adding
> more advanced cases in this patch, but I'd rather do that as a followup for
> initial support).
>
> The code-gen part looks somewhat ad-hoc to me.  In principle if we have
> a list of PHIs to hoist we can materialize the needed part of the original
> basic-block structure on the pre-header edge together with the controlling
> conditionals.  The loop pre-header would be the always-executed BB
> and the needed part would always start with the loop header and be a
> single-entry multiple-exit region.  So ideally we could use sth like
> gimple_duplicate_sese_region with the BB copying worker copy_bbs
> refactored/replaced with something only copying controlling conditionals
> and statements/PHIs that a callback identifies (tree-loop-distribution.c
> would also like sth like that - currently it copies whole loops and removes
> stmts it didn't intend to copy...).
>
> So basically invariantness_dom_walker would record a vector of PHIs
> to hoist (or we'd construct that at move_computations time).
> If there is any control-flow to hoist then we'd do an alternative
> move_computations by "copying" the needed CFG and covered stmts.
> In theory doing it the loop-distribution way (simply copy the whole
> loop body and then remove anything not needed) would also be possible
> though that wouldn't be the very best idea obviously ;)
>
> Thanks,
> Richard.
>
>>
>>
>> Thanks,
>>
>>
>>
>> Evgeniya



-- 
Thanks,

Evgeniya

perfstories.wordpress.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-05-26 13:38   ` Evgeniya Maenkova
@ 2015-05-27 10:52     ` Richard Biener
  2015-05-27 13:36       ` Evgeniya Maenkova
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2015-05-27 10:52 UTC (permalink / raw)
  To: Evgeniya Maenkova; +Cc: GCC Patches

On Tue, May 26, 2015 at 3:10 PM, Evgeniya Maenkova
<evgeniya.maenkova@gmail.com> wrote:
> Hi, Richard
>
> Thanks for review starting.
>
> Do you see any major issues with this patch (i.e. algorithms and ideas
> that should be completely replaced, effectively causing the re-write
> of most code)?
>
> To decide if there are major issues in the patch, perhaps, you need
> additional clarifications from me? Could you point at the places where
> additional explanations could save you most effort?
>
> Your answers to these questions are looking the first priority ones.
> You wrote about several issues in the code, which are looking as easy
> (or almost easy ;) to fix(inline functions, unswitch-loops flag,
> comments, etc). But, I think you agree, let’s first decide about the
> major issues (I mean, whether we continue with this patch or starting
> new one, this will save a lot of time for both of us).

I didn't get an overall idea on how the patch works, that is, how it integrates
with the existing algorithm.  If you can elaborate on that a bit that would
be helpful.

I think the code-generation part needs some work (whether by following
my idea with re-using copy_bbs or whether by basically re-implementing
it is up to debate).  How does your code handle

  for ()
    {
       if (cond1)
        {
           if (cond2)
             invariant;
           if (cond3)
             invariant;
        }
    }

?  Optimally we'd have before the loop exactly the same if () structure
(thus if (cond1) is shared).

Richard.


> Thanks,
>
> Evgeniya
>
> On Tue, May 26, 2015 at 2:31 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, May 8, 2015 at 11:07 PM, Evgeniya Maenkova
>> <evgeniya.maenkova@gmail.com> wrote:
>>> Hi,
>>>
>>> Could you please review my patch for predicated lim?
>>>
>>> Let me note some details about it:
>>>
>>>
>>>
>>> 1)      Phi statements are still moved only if they have 1 or 2
>>> arguments. However, phi statements could be move under conditions (as
>>> it’s done for the other statements).  Probably, phi statement motion
>>> with 3 + arguments could be implemented in the next patch after
>>> predicated lim.
>>>
>>> 2)      Patch has limitations/features like (it was ok to me to
>>> implement it such way, maybe I’m not correct. ):
>>>
>>> a)      Loop1
>>>
>>>     {
>>>
>>>               If (a)
>>>
>>>                  Loop2
>>>
>>>                      {
>>>
>>>                        Stmt - Invariant for Loop1
>>>
>>>                      }
>>>
>>>              }
>>>
>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>
>>> b)      Or
>>>
>>> Loop1
>>>
>>>     {
>>>
>>>          …
>>>
>>>          If (cond1)
>>>
>>>               If (cond2)
>>>
>>>                   If (cond3)
>>>
>>>                       Stmt;
>>>
>>>        }
>>>
>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>
>>> c)       It took me a long time to write all of these code, so there
>>> might be other peculiarities which I forgot to mention. :)
>>>
>>>                        Let’s discuss these ones as you will review my patch.
>>>
>>> 3)      Patch consists of 9 files:
>>>
>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>
>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>> predicated lim moves 2 more statements out of the loop;
>>>
>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>> lim recip optimization in this test doesn’t work (the corresponding
>>> value is below threshold as I could see in the code for recip, 1<3).
>>> So to have recip working in this test I changed test a little bit.
>>>
>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>
>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>
>>> the tests for conditional lim.
>>>
>>> 4)      Patch testing:
>>>
>>> a)      make –k check (no difference in results for me for the clean
>>> build and the patched one,
>>>
>>> -          Revision: 222849,
>>>
>>> -          uname -a
>>>
>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>
>>> b)      Bootstrap.
>>>
>>>  It goes well now, however to fix it I have made a temporary hack in
>>> the lim code. And with this fix patch definitely shouldn’t be
>>> committed.
>>>
>>> I did so, as I would like to discuss this issue first.
>>>
>>> The problem is: I got stage2-stage3 comparison failure on the single
>>> file (tree-vect-data-refs.o). After some investigation I understood
>>> that tree-vect-data-refs.o differs being compiled with and without
>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>> reproduce this difference on the same build (even not bootstrapped),
>>> with option ‘ –g’ and without it).
>>>
>>> Of course, the file compiled with –g option will contain debug
>>> information and will differ from the corresponding file without debug
>>> information. I mean there is the difference reported by
>>> contrib/compare-debug (I mean we talk about stripped files).
>>>
>>> Then I compared assemblers and lim dump files and made a conclusion
>>> the difference is:
>>>
>>> There is statement _484=something, which is conditionally moved out of
>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>> X. In debug case, there is the single use in debug statement. So for
>>> debug case my patch generates phi statement for _484 to make it
>>> available inside the loop X. For non debug case, no such phi statement
>>> generated as there is no uses of _484.
>>>
>>> There is one more statement with lhs=_493, with exactly this pattern
>>> of use. But this is not important for our discussion.
>>>
>>>
>>>
>>> So, in my opinion it’s ok to generate additional phi node for debug
>>> case. But I’m not a compiler expert and maybe there is some
>>> requirement that debug and non-debug versions should differ only by
>>> debug statements, I don’t know.
>>
>> As Andi said code generation needs to be the same.
>>
>>> My temporary hack fix is just removing of such debug statements (no
>>> debug statements, no problems J).
>>
>> That's fine and generally what is done in other passes.
>>
>>> The alternatives I see are:
>>>
>>> -          Move debug statements outside the loop (not good in my opinion);
>>>
>>> -          Somehow reset values in debug statements (not good in my
>>> opinion, if I could provide correct information for them);
>>
>> You could re-set them via gimple_debug_bind_reset_value which
>> will tell the user that at this point the value is optimized out
>> (that's slightly
>> better than removing the debug stmts).
>>
>>> -          Generate phi for debug statements and fix bootstrap (say,
>>> let’s have some list of files, which we have special check for. I mean
>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>> and non-debug). I like this variant. However, as I don’t see such list
>>> of special files in the whole gcc, I think I might be missing
>>> something.
>>
>> The policy ;)
>>
>>> What do you think?
>>
>> Now about the patch.  Generally there seem to be spurious whitespace-only
>> or formatting differences - try to avoid these.
>>
>> All new functiuons need a comment explaining them and their function
>> parameters.
>>
>> +static cond_data_map *get_cond_data_map (struct loop * level)
>> +{
>>
>> per GCC coding conventions this should be formatted like
>>
>> static cond_data_map *
>> get_cond_data_map (struct loop * level)
>> {
>>
>> +#define SET_ALWAYS_EXECUTED_IN(BB, VAL)\
>> +                 BB->aux = (void *) (XCNEW (struct move_cond));\
>> +                 BB_MOVE_COND (BB)->type=ALWAYS_EXECUTED_IN;\
>> +                 BB_MOVE_COND (BB)->loop=VAL;
>>
>> this kind of side-effects in macros are bad - better turn them into
>> inline functions.
>>
>> -  if (flag_unswitch_loops
>> -      && gimple_code (stmt) == GIMPLE_COND)
>> -    {
>> -      /* If we perform unswitching, force the operands of the invariant
>> -        condition to be moved out of the loop.  */
>> -      return MOVE_POSSIBLE;
>> -    }
>> +  if (gimple_code (stmt) == GIMPLE_COND)
>> +    return MOVE_POSSIBLE;
>>
>> any reason you removed the guard on flag_unswitch_loops?  We may
>> want to add a new flag, certainly this transform might not be suitable
>> for -O[12s] for example.  Like if not all stmts inside a conditional can
>> be moved the transform will increase code-size.
>>
>> You are doing a (lot of?) refactoring which makes review of the patch
>> harder.  For example
>>
>> +bool calculate_tgt_level (gimple stmt, struct loop *outermost,
>> +                        bool cond_executed, basic_block bb)
>> +{
>> +  struct lim_aux_data *lim_dat
>> ...
>> +
>> +  if (lim_data->cost >= LIM_EXPENSIVE)
>> +    set_profitable_level (stmt);
>> +  return true;
>> +}
>>
>> ...
>>
>> -
>> -       lim_data = init_lim_data (stmt);
>> -       lim_data->always_executed_in = outermost;
>> -
>> -       if (!determine_max_movement (stmt, false))
>> -         {
>> -           lim_data->max_loop = NULL;
>> -           continue;
>> -         }
>> -
>> -       if (dump_file && (dump_flags & TDF_DETAILS))
>> -         {
>> -           print_gimple_stmt (dump_file, stmt, 2, 0);
>> -           fprintf (dump_file, "  invariant up to level %d, cost %d.\n\n",
>> -                    loop_depth (lim_data->max_loop),
>> -                    lim_data->cost);
>> -         }
>> -
>> -       if (lim_data->cost >= LIM_EXPENSIVE)
>> -         set_profitable_level (stmt);
>> +       if (!calculate_tgt_level (stmt, outermost, cond_executed, bb))
>> +         continue;
>>
>> where it is not obvious why moving set_profitable_level should be
>> in a function named calculate_tgt_level.  In fact determine_max_movement
>> was supposed to be the abstraction already.
>>
>> Without yet getting an idea of the overall algorithm (I'm really missing such
>> a thing...) it seems to me that the current handling of doing invariant motion
>> of PHIs should handle the analysis phase just fine (you might be adding
>> more advanced cases in this patch, but I'd rather do that as a followup for
>> initial support).
>>
>> The code-gen part looks somewhat ad-hoc to me.  In principle if we have
>> a list of PHIs to hoist we can materialize the needed part of the original
>> basic-block structure on the pre-header edge together with the controlling
>> conditionals.  The loop pre-header would be the always-executed BB
>> and the needed part would always start with the loop header and be a
>> single-entry multiple-exit region.  So ideally we could use sth like
>> gimple_duplicate_sese_region with the BB copying worker copy_bbs
>> refactored/replaced with something only copying controlling conditionals
>> and statements/PHIs that a callback identifies (tree-loop-distribution.c
>> would also like sth like that - currently it copies whole loops and removes
>> stmts it didn't intend to copy...).
>>
>> So basically invariantness_dom_walker would record a vector of PHIs
>> to hoist (or we'd construct that at move_computations time).
>> If there is any control-flow to hoist then we'd do an alternative
>> move_computations by "copying" the needed CFG and covered stmts.
>> In theory doing it the loop-distribution way (simply copy the whole
>> loop body and then remove anything not needed) would also be possible
>> though that wouldn't be the very best idea obviously ;)
>>
>> Thanks,
>> Richard.
>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Evgeniya
>
>
>
> --
> Thanks,
>
> Evgeniya
>
> perfstories.wordpress.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-05-27 10:52     ` Richard Biener
@ 2015-05-27 13:36       ` Evgeniya Maenkova
  0 siblings, 0 replies; 14+ messages in thread
From: Evgeniya Maenkova @ 2015-05-27 13:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Wed, May 27, 2015 at 2:11 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, May 26, 2015 at 3:10 PM, Evgeniya Maenkova
> <evgeniya.maenkova@gmail.com> wrote:
>> Hi, Richard
>>
>> Thanks for review starting.
>>
>> Do you see any major issues with this patch (i.e. algorithms and ideas
>> that should be completely replaced, effectively causing the re-write
>> of most code)?
>>
>> To decide if there are major issues in the patch, perhaps, you need
>> additional clarifications from me? Could you point at the places where
>> additional explanations could save you most effort?
>>
>> Your answers to these questions are looking the first priority ones.
>> You wrote about several issues in the code, which are looking as easy
>> (or almost easy ;) to fix(inline functions, unswitch-loops flag,
>> comments, etc). But, I think you agree, let’s first decide about the
>> major issues (I mean, whether we continue with this patch or starting
>> new one, this will save a lot of time for both of us).
>
> I didn't get an overall idea on how the patch works, that is, how it integrates
> with the existing algorithm.  If you can elaborate on that a bit that would
> be helpful.
>
Hi,
Sure, I'll write you some notes in several days.

> I think the code-generation part needs some work (whether by following
> my idea with re-using copy_bbs or whether by basically re-implementing
> it is up to debate).  How does your code handle
>
>   for ()
>     {
>        if (cond1)
>         {
>            if (cond2)
>              invariant;
>            if (cond3)
>              invariant;
>         }
>     }
>
> ?  Optimally we'd have before the loop exactly the same if () structure
> (thus if (cond1) is shared).

If both invariants are going out of the same loop (i mean tgt_level),
then if structure will be the same.
for1()
  for ()
    {
      if (cond1)
        {
          if (cond2)
            invariant1;
          if (cond3)
            invariant2;
        }
    }

will be transformed to
for1()
  if (cond1)
    {
      if (cond2)
        invariant1;
      if (cond3)
        invariant2;
     }
  }
  for ()
    {
      if (cond1)
        {
          if (cond2);
          if (cond3);
        }
    }
(I don't cleanup empty if's in lim code).


If these invarians are moved in different loops then

for1
  for2()
    for()
      {
        if (cond1)
          {
            if (cond2)
              invariant1;
            if (cond3)
              invariant2;
           }
       }

will be transformed to:
for1
  {
    if (cond1)
      if (cond2)
        invariant1;
    for2()
      {
         if (cond1)
           if (cond3)
             invariant2;
         for()
           {
              if (cond1)
               {
                 if (cond2);
                 if (cond3);
               }
           }
        }
   }

Of course, there could be some bugs, but the idea was as mentioned above.

This transformation was looking logical to me. What do you think?

Thanks,

Evgeniya

>
> Richard.



>
>
>> Thanks,
>>
>> Evgeniya
>>
>> On Tue, May 26, 2015 at 2:31 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, May 8, 2015 at 11:07 PM, Evgeniya Maenkova
>>> <evgeniya.maenkova@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> Could you please review my patch for predicated lim?
>>>>
>>>> Let me note some details about it:
>>>>
>>>>
>>>>
>>>> 1)      Phi statements are still moved only if they have 1 or 2
>>>> arguments. However, phi statements could be move under conditions (as
>>>> it’s done for the other statements).  Probably, phi statement motion
>>>> with 3 + arguments could be implemented in the next patch after
>>>> predicated lim.
>>>>
>>>> 2)      Patch has limitations/features like (it was ok to me to
>>>> implement it such way, maybe I’m not correct. ):
>>>>
>>>> a)      Loop1
>>>>
>>>>     {
>>>>
>>>>               If (a)
>>>>
>>>>                  Loop2
>>>>
>>>>                      {
>>>>
>>>>                        Stmt - Invariant for Loop1
>>>>
>>>>                      }
>>>>
>>>>              }
>>>>
>>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>>
>>>> b)      Or
>>>>
>>>> Loop1
>>>>
>>>>     {
>>>>
>>>>          …
>>>>
>>>>          If (cond1)
>>>>
>>>>               If (cond2)
>>>>
>>>>                   If (cond3)
>>>>
>>>>                       Stmt;
>>>>
>>>>        }
>>>>
>>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>>
>>>> c)       It took me a long time to write all of these code, so there
>>>> might be other peculiarities which I forgot to mention. :)
>>>>
>>>>                        Let’s discuss these ones as you will review my patch.
>>>>
>>>> 3)      Patch consists of 9 files:
>>>>
>>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>>
>>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>>> predicated lim moves 2 more statements out of the loop;
>>>>
>>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>>> lim recip optimization in this test doesn’t work (the corresponding
>>>> value is below threshold as I could see in the code for recip, 1<3).
>>>> So to have recip working in this test I changed test a little bit.
>>>>
>>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>>
>>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>>
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>>
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>
>>>> the tests for conditional lim.
>>>>
>>>> 4)      Patch testing:
>>>>
>>>> a)      make –k check (no difference in results for me for the clean
>>>> build and the patched one,
>>>>
>>>> -          Revision: 222849,
>>>>
>>>> -          uname -a
>>>>
>>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>>
>>>> b)      Bootstrap.
>>>>
>>>>  It goes well now, however to fix it I have made a temporary hack in
>>>> the lim code. And with this fix patch definitely shouldn’t be
>>>> committed.
>>>>
>>>> I did so, as I would like to discuss this issue first.
>>>>
>>>> The problem is: I got stage2-stage3 comparison failure on the single
>>>> file (tree-vect-data-refs.o). After some investigation I understood
>>>> that tree-vect-data-refs.o differs being compiled with and without
>>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>>> reproduce this difference on the same build (even not bootstrapped),
>>>> with option ‘ –g’ and without it).
>>>>
>>>> Of course, the file compiled with –g option will contain debug
>>>> information and will differ from the corresponding file without debug
>>>> information. I mean there is the difference reported by
>>>> contrib/compare-debug (I mean we talk about stripped files).
>>>>
>>>> Then I compared assemblers and lim dump files and made a conclusion
>>>> the difference is:
>>>>
>>>> There is statement _484=something, which is conditionally moved out of
>>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>>> X. In debug case, there is the single use in debug statement. So for
>>>> debug case my patch generates phi statement for _484 to make it
>>>> available inside the loop X. For non debug case, no such phi statement
>>>> generated as there is no uses of _484.
>>>>
>>>> There is one more statement with lhs=_493, with exactly this pattern
>>>> of use. But this is not important for our discussion.
>>>>
>>>>
>>>>
>>>> So, in my opinion it’s ok to generate additional phi node for debug
>>>> case. But I’m not a compiler expert and maybe there is some
>>>> requirement that debug and non-debug versions should differ only by
>>>> debug statements, I don’t know.
>>>
>>> As Andi said code generation needs to be the same.
>>>
>>>> My temporary hack fix is just removing of such debug statements (no
>>>> debug statements, no problems J).
>>>
>>> That's fine and generally what is done in other passes.
>>>
>>>> The alternatives I see are:
>>>>
>>>> -          Move debug statements outside the loop (not good in my opinion);
>>>>
>>>> -          Somehow reset values in debug statements (not good in my
>>>> opinion, if I could provide correct information for them);
>>>
>>> You could re-set them via gimple_debug_bind_reset_value which
>>> will tell the user that at this point the value is optimized out
>>> (that's slightly
>>> better than removing the debug stmts).
>>>
>>>> -          Generate phi for debug statements and fix bootstrap (say,
>>>> let’s have some list of files, which we have special check for. I mean
>>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>>> and non-debug). I like this variant. However, as I don’t see such list
>>>> of special files in the whole gcc, I think I might be missing
>>>> something.
>>>
>>> The policy ;)
>>>
>>>> What do you think?
>>>
>>> Now about the patch.  Generally there seem to be spurious whitespace-only
>>> or formatting differences - try to avoid these.
>>>
>>> All new functiuons need a comment explaining them and their function
>>> parameters.
>>>
>>> +static cond_data_map *get_cond_data_map (struct loop * level)
>>> +{
>>>
>>> per GCC coding conventions this should be formatted like
>>>
>>> static cond_data_map *
>>> get_cond_data_map (struct loop * level)
>>> {
>>>
>>> +#define SET_ALWAYS_EXECUTED_IN(BB, VAL)\
>>> +                 BB->aux = (void *) (XCNEW (struct move_cond));\
>>> +                 BB_MOVE_COND (BB)->type=ALWAYS_EXECUTED_IN;\
>>> +                 BB_MOVE_COND (BB)->loop=VAL;
>>>
>>> this kind of side-effects in macros are bad - better turn them into
>>> inline functions.
>>>
>>> -  if (flag_unswitch_loops
>>> -      && gimple_code (stmt) == GIMPLE_COND)
>>> -    {
>>> -      /* If we perform unswitching, force the operands of the invariant
>>> -        condition to be moved out of the loop.  */
>>> -      return MOVE_POSSIBLE;
>>> -    }
>>> +  if (gimple_code (stmt) == GIMPLE_COND)
>>> +    return MOVE_POSSIBLE;
>>>
>>> any reason you removed the guard on flag_unswitch_loops?  We may
>>> want to add a new flag, certainly this transform might not be suitable
>>> for -O[12s] for example.  Like if not all stmts inside a conditional can
>>> be moved the transform will increase code-size.
>>>
>>> You are doing a (lot of?) refactoring which makes review of the patch
>>> harder.  For example
>>>
>>> +bool calculate_tgt_level (gimple stmt, struct loop *outermost,
>>> +                        bool cond_executed, basic_block bb)
>>> +{
>>> +  struct lim_aux_data *lim_dat
>>> ...
>>> +
>>> +  if (lim_data->cost >= LIM_EXPENSIVE)
>>> +    set_profitable_level (stmt);
>>> +  return true;
>>> +}
>>>
>>> ...
>>>
>>> -
>>> -       lim_data = init_lim_data (stmt);
>>> -       lim_data->always_executed_in = outermost;
>>> -
>>> -       if (!determine_max_movement (stmt, false))
>>> -         {
>>> -           lim_data->max_loop = NULL;
>>> -           continue;
>>> -         }
>>> -
>>> -       if (dump_file && (dump_flags & TDF_DETAILS))
>>> -         {
>>> -           print_gimple_stmt (dump_file, stmt, 2, 0);
>>> -           fprintf (dump_file, "  invariant up to level %d, cost %d.\n\n",
>>> -                    loop_depth (lim_data->max_loop),
>>> -                    lim_data->cost);
>>> -         }
>>> -
>>> -       if (lim_data->cost >= LIM_EXPENSIVE)
>>> -         set_profitable_level (stmt);
>>> +       if (!calculate_tgt_level (stmt, outermost, cond_executed, bb))
>>> +         continue;
>>>
>>> where it is not obvious why moving set_profitable_level should be
>>> in a function named calculate_tgt_level.  In fact determine_max_movement
>>> was supposed to be the abstraction already.
>>>
>>> Without yet getting an idea of the overall algorithm (I'm really missing such
>>> a thing...) it seems to me that the current handling of doing invariant motion
>>> of PHIs should handle the analysis phase just fine (you might be adding
>>> more advanced cases in this patch, but I'd rather do that as a followup for
>>> initial support).
>>>
>>> The code-gen part looks somewhat ad-hoc to me.  In principle if we have
>>> a list of PHIs to hoist we can materialize the needed part of the original
>>> basic-block structure on the pre-header edge together with the controlling
>>> conditionals.  The loop pre-header would be the always-executed BB
>>> and the needed part would always start with the loop header and be a
>>> single-entry multiple-exit region.  So ideally we could use sth like
>>> gimple_duplicate_sese_region with the BB copying worker copy_bbs
>>> refactored/replaced with something only copying controlling conditionals
>>> and statements/PHIs that a callback identifies (tree-loop-distribution.c
>>> would also like sth like that - currently it copies whole loops and removes
>>> stmts it didn't intend to copy...).
>>>
>>> So basically invariantness_dom_walker would record a vector of PHIs
>>> to hoist (or we'd construct that at move_computations time).
>>> If there is any control-flow to hoist then we'd do an alternative
>>> move_computations by "copying" the needed CFG and covered stmts.
>>> In theory doing it the loop-distribution way (simply copy the whole
>>> loop body and then remove anything not needed) would also be possible
>>> though that wouldn't be the very best idea obviously ;)
>>>
>>> Thanks,
>>> Richard.
>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Evgeniya
>>
>>
>>
>> --
>> Thanks,
>>
>> Evgeniya
>>
>> perfstories.wordpress.com



-- 
Thanks,

Evgeniya

perfstories.wordpress.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-05-08 21:07 conditional lim Evgeniya Maenkova
  2015-05-09 12:41 ` Andi Kleen
  2015-05-26 10:32 ` Richard Biener
@ 2015-05-29 13:32 ` Evgeniya Maenkova
  2015-06-09 11:48   ` Richard Biener
  2 siblings, 1 reply; 14+ messages in thread
From: Evgeniya Maenkova @ 2015-05-29 13:32 UTC (permalink / raw)
  To: Richard Biener, GCC Patches

Hi Richard,

Here is some explanation. I hope you let me know if I need to clarify something.

Also, you asked me about concrete example, to make sure you don’t miss
my answer here is the link:
https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.

Also, I doubt whether it’s convenient for you to create a build with
my patch or not. May be to clarify things you could send me some
examples/concrete cases, then I’ll compile them with
–fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
these dumps. May be these dumps will be useful. (I’ll only disable
cleanup_cfg TODO after lim to let you know the exact picture after
lim).

What do you think?

1.       invariantness _dom_walker –

1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
for true and false edges handle_branch_edge, which calls SET_TARGET_OF
for all bb ‘predicated’ by given GIMPLE_COND.

SET_TARGET_OF sets in basic_blocks aux 2 facts:

a)      this is true or false edge;

b)      link to cond stmt;

Handle_branch_edge works this way:

If (cond1)

  {

     bb1;

     if (cond2}

       {

           bb2;

        }

   Being called for cond1, it sets cond1 as condition for both bb1 and
bb2 (the whole branch for cond1, ie also for bb containing cond2),
then this method will be called (as there is dominance order) for
cond2 to correct things (ie to set cond2 as condition for bb2).

1.2   As 1.1 goes we identify whether some bb is predicated by some
condition or not.

bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
bb->aux->cond_stmt=cond stmt (the nearest condition).

If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
bb->loop->loop (this info was available in the clean build).

1.3   As this walker is called in dominance order, information about
condition is available when invariantness_dom_walker is called for
given bb.  So we can make some computations based on bb->aux
structure. This is done in check_conditionally_executed. The main goal
of this method is to check that the root condition is always executed
in the loop. I did so to avoid situation like this

Loop:

   Jmp somewhere;

  If (cond1)

      If (cond2)

         Etc

By ‘root condition’ I mean cond1 in this case (the first cond in
dominance order).

If  ‘root condition’ is not always executed we disable (free) all the
info in bb->aux, ie all the blocks becomes neither conditionally nor
always executed according to bb->aux.

There is some optimization to don’t go each time trough the conditions
chain (cond2->cond1), let me skip such details for now.

1.4   Then we calculate tgt_level  (which is used in move_computations later)

The patch is the same for phi and regular stmt (calculate_tgt_level)

1)      If stmt is conditionally executed we compute max movement
(determine_max_movement) starting from
get_lim_data(condition)->max_loop.

2)      If stmt is not cond executed as start level  for
determine_max_movement computations we choose ALWAYS_EXECUTED_IN.

To clarify why, see the comment

/* The reason why we don't set start_level for MOVE_POSSIBLE

          statements to more outer level is: this statement could depend on

          conditional statements and even probably is not defined without this

          condition. So either we should analyze these ones and move

          under condition somehow or keep more strong start_level . */

As you noted in your review there was some refactoring. Of course, I
had no goal to refactor existing code, I intended to remove some
duplication which I caused by my changes.  I hope we discuss in
details later if you don’t like some refactoring.

2.       store_motion  - for some stmts set flags to ignore predicated
execution (I mean to move statements without conditions).

3.       move_computations. The patch is doing something in the
following 3 calls inside of this method.

3.1   (more details below) walker.walk – move computations(create if()
structure) and store information to create phi nodes later (which
statements require phi nodes, as they conditionally executed and there
are still their uses on other levels, what are the names for such phi
nodes, as we replace uses in here, in walker.walk.)

3.2   (more details below) create_phi_nodes – create phi nodes for
statements which require that (see 3.1)

3.3   replace_uses_in_conditions

After computations movements we can have several copies of the cond
stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
cond stmt it doesn’t work. As cond stmt, of course, have something in
lim_aux_data->tgt_level, but, in fact, they are moved only if
corresponding invariants are moved. So, in fact, the same cond (copies
of it, of course) could go to the different levels. So to fix these
uses, we call replace_uses_in_conditions.

More details on 3.1 and 3.2

3.1 The patch is very similar for phi stmts and regular stmts (there
is some difference for regular stmts related to the case when we move
sequence instead of single stmt, let’s skip it in this description,
will describe later. BTW, there are comments, in corresponding parts).

a) for each bb we call check_conditionally_executed and if bb is cond
executed then invariant statement  will be moved conditionally.

b) prepare_edge_for_stmt collects information needed to create phi
nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
structure needed to move statement
(get_block_for_predicated_statement, see below). All of this is done
for cond executed stmt. Nothing is done for non cond executed.

BTW, there is some strange names for functions like missed ending
(prepare_edge_for_stm, w/o t in the end, this is because of my script
to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
please so far).

Get_block_for_predicated_stmt:

Create bb where to add cond stmt to (or return existing, each loop has
a map  (level_cond_data_map is map <loop, cond_data_map> ,
cond_data_map is map <gimple, basic_block>) to identify if bb for
given cond was created or not for given level).

Parameters of this method:

Cond – condition of stmt, branch – is it on true or false edge of
cond, cond_map  - see above, level (tgt_level), preheader –
preheader_edge for level.

So first of all we check whether there is a basic block for given
condition in given loop.

a)      If such bb exists then we trying to find corresponding edge,
destination of this edge shouldn’t be a post dominator of bb
(find_branch_edge).

a.1) if such edge exists we return the most remote in this branch bb
(or create it), see get_most_remote_in_branch;

a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch

       b)  If such bb doesn’t exist we create it (recursively calling
get_block_for_predicated_stmt if required bb is conditionally
executed).

Will not describe it further so far (let me know if you would like to
read additional details).

Add_info_for_phi

The main goal of this method is to identify situation where
stmt(parameter,ie the statement which we are moving) or more exactly
lhs of this statement

a)      is still used in other loops, so we need to create phi node to
use phi name inside other loops.

b)      Is used in phi node of original loop and this phi node is
going to be moved. BUT, as we move phi node as assignment (in the case
of phi node with 2 arguments) we need to create phi node to use lhs of
statement (param of add_info_for_phi).

In these cases we make corresponding replacements and store
information needed to create phi nodes and make some other
replacements in 3.3 (replace_uses_in_conditions).

Add_info_for_phi calls create_tmp_var which requires some explanation.

Create_tmp_var

To create new names for phi nodes and to create default def for phi
arguments I use make_ssa_name and get_or_create_ssa_default_def. These
methods have some asserts (being merged):   (TREE_CODE (old_var) ==
VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
(old_var) == RESULT_DECL).

So in some cases (when I know that I need to use methods mentioned
above, see the start of create_tmp_var, ie the uses in other loops) I
create new tmp variable to overcome these asserts. To be honest I
don’t like this but don’t know how to deal with this better.

And a couple of words regarding where we store info for phi nodes:

Each loop contains in its aux  phi_data structure. On this stage we
only add there stmt (if phi node will be required), phi name=phi node
result name in names_map or fl_names_map). See the comment about
phi_data in patch.

If there should be created several phi nodes for given lhs (I mean the
first place for phi node doesn’t dominate on corresponding preheader,
we add only the last name in names_map (as intermediate names could be
created later, but the last name in bb which dominates preheader
should be used for replacement in other loops. Replacements were
already made in walker).

If lhs is used only in phi node, which are moved and transformed into
assignment, and there is no uses in other loops, we need to create
only first level phi node (don’t need to create phi nodes till some bb
which dominates preheader), then we add such name to fl_names_map.

3.2 Create_phi_nodes

For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
are statements which were moved, but they have uses in the original
loop (more exactly, their uses in some other loops replaced by some
name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
loop->aux)->fl_names_map.

So for each of such stmts we create phi nodes (for its lhs) chain.

Basic block for phi node is found in get_bb_for_phi  for given bb with
stmt. If basic block for phi node dominates preheader->src we end
chain creation. There is some special condition for the case when we
need to create only first level phi node. (I described this situation
above, but let me know If I need to add  more details.

Basic blocks can have maps to identify if we created phi node for
given variable in given basic_block, so we only add an edge argument
for such case (phi_map = new hash_map<basic_block, hash_map<tree,
gphi*>*>).

In short, that’s all.

Thanks,

Evgeniya

On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
<evgeniya.maenkova@gmail.com> wrote:
> Hi,
>
> Could you please review my patch for predicated lim?
>
> Let me note some details about it:
>
>
>
> 1)      Phi statements are still moved only if they have 1 or 2
> arguments. However, phi statements could be move under conditions (as
> it’s done for the other statements).  Probably, phi statement motion
> with 3 + arguments could be implemented in the next patch after
> predicated lim.
>
> 2)      Patch has limitations/features like (it was ok to me to
> implement it such way, maybe I’m not correct. ):
>
> a)      Loop1
>
>     {
>
>               If (a)
>
>                  Loop2
>
>                      {
>
>                        Stmt - Invariant for Loop1
>
>                      }
>
>              }
>
>        In this case Stmt will be moved only out of Loop2, because of if (a).
>
> b)      Or
>
> Loop1
>
>     {
>
>          …
>
>          If (cond1)
>
>               If (cond2)
>
>                   If (cond3)
>
>                       Stmt;
>
>        }
>
> Stmt will be moved out only if cond1 is always executed in Loop1.
>
> c)       It took me a long time to write all of these code, so there
> might be other peculiarities which I forgot to mention. :)
>
>                        Let’s discuss these ones as you will review my patch.
>
> 3)      Patch consists of 9 files:
>
> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>
> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
> predicated lim moves 2 more statements out of the loop;
>
> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
> lim recip optimization in this test doesn’t work (the corresponding
> value is below threshold as I could see in the code for recip, 1<3).
> So to have recip working in this test I changed test a little bit.
>
> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>
> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>
> the tests for conditional lim.
>
> 4)      Patch testing:
>
> a)      make –k check (no difference in results for me for the clean
> build and the patched one,
>
> -          Revision: 222849,
>
> -          uname -a
>
>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>
> b)      Bootstrap.
>
>  It goes well now, however to fix it I have made a temporary hack in
> the lim code. And with this fix patch definitely shouldn’t be
> committed.
>
> I did so, as I would like to discuss this issue first.
>
> The problem is: I got stage2-stage3 comparison failure on the single
> file (tree-vect-data-refs.o). After some investigation I understood
> that tree-vect-data-refs.o differs being compiled with and without
> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
> and for stage 3 this is ‘-g –O2’. But to simplify things I can
> reproduce this difference on the same build (even not bootstrapped),
> with option ‘ –g’ and without it).
>
> Of course, the file compiled with –g option will contain debug
> information and will differ from the corresponding file without debug
> information. I mean there is the difference reported by
> contrib/compare-debug (I mean we talk about stripped files).
>
> Then I compared assemblers and lim dump files and made a conclusion
> the difference is:
>
> There is statement _484=something, which is conditionally moved out of
> loop X. In non debug cases no usages of _484 are left inside the loop
> X. In debug case, there is the single use in debug statement. So for
> debug case my patch generates phi statement for _484 to make it
> available inside the loop X. For non debug case, no such phi statement
> generated as there is no uses of _484.
>
> There is one more statement with lhs=_493, with exactly this pattern
> of use. But this is not important for our discussion.
>
>
>
> So, in my opinion it’s ok to generate additional phi node for debug
> case. But I’m not a compiler expert and maybe there is some
> requirement that debug and non-debug versions should differ only by
> debug statements, I don’t know.
>
>
>
> My temporary hack fix is just removing of such debug statements (no
> debug statements, no problems J).
>
>
>
> The alternatives I see are:
>
> -          Move debug statements outside the loop (not good in my opinion);
>
> -          Somehow reset values in debug statements (not good in my
> opinion, if I could provide correct information for them);
>
> -          Generate phi for debug statements and fix bootstrap (say,
> let’s have some list of files, which we have special check for. I mean
> for my case, the difference is not in stage2 and stage3, it’s in debug
> and non-debug). I like this variant. However, as I don’t see such list
> of special files in the whole gcc, I think I might be missing
> something.
>
> What do you think?
>
>
>
> Thanks,
>
>
>
> Evgeniya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-05-29 13:32 ` Evgeniya Maenkova
@ 2015-06-09 11:48   ` Richard Biener
  2015-06-09 20:12     ` Evgeniya Maenkova
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2015-06-09 11:48 UTC (permalink / raw)
  To: Evgeniya Maenkova; +Cc: GCC Patches

On Fri, May 29, 2015 at 3:14 PM, Evgeniya Maenkova
<evgeniya.maenkova@gmail.com> wrote:
> Hi Richard,
>
> Here is some explanation. I hope you let me know if I need to clarify something.
>
> Also, you asked me about concrete example, to make sure you don’t miss
> my answer here is the link:
> https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.
>
> Also, I doubt whether it’s convenient for you to create a build with
> my patch or not. May be to clarify things you could send me some
> examples/concrete cases, then I’ll compile them with
> –fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
> these dumps. May be these dumps will be useful. (I’ll only disable
> cleanup_cfg TODO after lim to let you know the exact picture after
> lim).
>
> What do you think?
>
> 1.       invariantness _dom_walker –
>
> 1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
> for true and false edges handle_branch_edge, which calls SET_TARGET_OF
> for all bb ‘predicated’ by given GIMPLE_COND.
>
> SET_TARGET_OF sets in basic_blocks aux 2 facts:
>
> a)      this is true or false edge;
>
> b)      link to cond stmt;
>
> Handle_branch_edge works this way:
>
> If (cond1)
>
>   {
>
>      bb1;
>
>      if (cond2}
>
>        {
>
>            bb2;
>
>         }
>
>    Being called for cond1, it sets cond1 as condition for both bb1 and
> bb2 (the whole branch for cond1, ie also for bb containing cond2),
> then this method will be called (as there is dominance order) for
> cond2 to correct things (ie to set cond2 as condition for bb2).

Hmm, why not track the current condition as state during the DOM walk
and thus avoid processing more than one basic-block in handle_branch_edge?
Thus get rid of handle_branch_edge and instead do everything in handle_cond_stmt
plus the dom-walkers BB visitor?

I see you don't handle BBs with multiple predecessors - that's ok, but
are you sure you don't run into correctness issues when not marking such
BBs as predicated?  This misses handling of, say

 if (a || b)
   bb;

which is a pity (but can be fixed later if desired).

I note that collecting predicates has similarities to what if-conversion
does in tree-ifcvt.c (even if its implementation is completely different,
of course).

> 1.2   As 1.1 goes we identify whether some bb is predicated by some
> condition or not.
>
> bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
> bb->aux->cond_stmt=cond stmt (the nearest condition).
>
> If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
> bb->loop->loop (this info was available in the clean build).
>
> 1.3   As this walker is called in dominance order, information about
> condition is available when invariantness_dom_walker is called for
> given bb.  So we can make some computations based on bb->aux
> structure. This is done in check_conditionally_executed. The main goal
> of this method is to check that the root condition is always executed
> in the loop. I did so to avoid situation like this
>
> Loop:
>
>    Jmp somewhere;
>
>   If (cond1)
>
>       If (cond2)
>
>          Etc
>
> By ‘root condition’ I mean cond1 in this case (the first cond in
> dominance order).

Ok, but this can't happen (an uncontional jump to somehwere).  It
will always be a conditional jump (a loop exit) and thus should
be handled already.

Did you have a testcase that was mishandled?

> If  ‘root condition’ is not always executed we disable (free) all the
> info in bb->aux, ie all the blocks becomes neither conditionally nor
> always executed according to bb->aux.
>
> There is some optimization to don’t go each time trough the conditions
> chain (cond2->cond1), let me skip such details for now.
>
>
>
> 1.4   Then we calculate tgt_level  (which is used in move_computations later)
>
> The patch is the same for phi and regular stmt (calculate_tgt_level)
>
> 1)      If stmt is conditionally executed we compute max movement
> (determine_max_movement) starting from
> get_lim_data(condition)->max_loop.
>
> 2)      If stmt is not cond executed as start level  for
> determine_max_movement computations we choose ALWAYS_EXECUTED_IN.
>
> To clarify why, see the comment
>
> /* The reason why we don't set start_level for MOVE_POSSIBLE
>
>           statements to more outer level is: this statement could depend on
>
>           conditional statements and even probably is not defined without this
>
>           condition. So either we should analyze these ones and move
>
>           under condition somehow or keep more strong start_level . */
>
> As you noted in your review there was some refactoring. Of course, I
> had no goal to refactor existing code, I intended to remove some
> duplication which I caused by my changes.  I hope we discuss in
> details later if you don’t like some refactoring.

Refactoring makes a big patch harder to review because it ends up
containing no-op changes.  It also makes it appear bigger than it
really is ;)

A reasonable solution is to factor out the refactoring to a separate patch.

> 2.       store_motion  - for some stmts set flags to ignore predicated
> execution (I mean to move statements without conditions).

That's for the existing "predicated" support as far as I can see.

>
>
> 3.       move_computations. The patch is doing something in the
> following 3 calls inside of this method.
>
>
>
> 3.1   (more details below) walker.walk – move computations(create if()
> structure) and store information to create phi nodes later (which
> statements require phi nodes, as they conditionally executed and there
> are still their uses on other levels, what are the names for such phi
> nodes, as we replace uses in here, in walker.walk.)

What kind of uses are that?  By definition all stmts executed
under condition A that are used in code not under condition A have
a PHI node associated.  This is why the existing conditional movement
handling moves PHI nodes (and all dependent computations).

Are you thinking of

  for (;;)
    if (a) x = 1;

  .. use of x

?  If you move the PHI node for x then you don't need to insert any other
PHI nodes (and you _do_ have to move the PHI node anyway).

> 3.2   (more details below) create_phi_nodes – create phi nodes for
> statements which require that (see 3.1)

Only PHI nodes need PHI node creation and only its dependences
might need predication.

That is, if you move a conditionally executed statement but not
any PHI node it feeds then the movement is useless.

But maybe I am missing something about the PHI business.

>
> 3.3   replace_uses_in_conditions
>
> After computations movements we can have several copies of the cond
> stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
> cond stmt it doesn’t work. As cond stmt, of course, have something in
> lim_aux_data->tgt_level, but, in fact, they are moved only if
> corresponding invariants are moved. So, in fact, the same cond (copies
> of it, of course) could go to the different levels. So to fix these
> uses, we call replace_uses_in_conditions.

Hmm, but stmts that a condition depends on always have at least the
target level of the condition statement.  Thus their computation is
available in all other "same" conditon statements that might appear
in more inner loop levels, no?  So isn't this just about performing the
movement in a proper ordering?

> More details on 3.1 and 3.2
>
> 3.1 The patch is very similar for phi stmts and regular stmts (there
> is some difference for regular stmts related to the case when we move
> sequence instead of single stmt, let’s skip it in this description,
> will describe later. BTW, there are comments, in corresponding parts).
>
> a) for each bb we call check_conditionally_executed and if bb is cond
> executed then invariant statement  will be moved conditionally.
>
> b) prepare_edge_for_stmt collects information needed to create phi
> nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
> structure needed to move statement
> (get_block_for_predicated_statement, see below). All of this is done
> for cond executed stmt. Nothing is done for non cond executed.
>
> BTW, there is some strange names for functions like missed ending
> (prepare_edge_for_stm, w/o t in the end, this is because of my script
> to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
> please so far).
>
> Get_block_for_predicated_stmt:
>
> Create bb where to add cond stmt to (or return existing, each loop has
> a map  (level_cond_data_map is map <loop, cond_data_map> ,
> cond_data_map is map <gimple, basic_block>) to identify if bb for
> given cond was created or not for given level).
>
> Parameters of this method:
>
> Cond – condition of stmt, branch – is it on true or false edge of
> cond, cond_map  - see above, level (tgt_level), preheader –
> preheader_edge for level.
>
> So first of all we check whether there is a basic block for given
> condition in given loop.
>
> a)      If such bb exists then we trying to find corresponding edge,
> destination of this edge shouldn’t be a post dominator of bb
> (find_branch_edge).
>
> a.1) if such edge exists we return the most remote in this branch bb
> (or create it), see get_most_remote_in_branch;
>
> a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch
>
>        b)  If such bb doesn’t exist we create it (recursively calling
> get_block_for_predicated_stmt if required bb is conditionally
> executed).
>
> Will not describe it further so far (let me know if you would like to
> read additional details).
>
> Add_info_for_phi
>
> The main goal of this method is to identify situation where
> stmt(parameter,ie the statement which we are moving) or more exactly
> lhs of this statement
>
> a)      is still used in other loops, so we need to create phi node to
> use phi name inside other loops.
>
> b)      Is used in phi node of original loop and this phi node is
> going to be moved. BUT, as we move phi node as assignment (in the case
> of phi node with 2 arguments) we need to create phi node to use lhs of
> statement (param of add_info_for_phi).
>
> In these cases we make corresponding replacements and store
> information needed to create phi nodes and make some other
> replacements in 3.3 (replace_uses_in_conditions).
>
> Add_info_for_phi calls create_tmp_var which requires some explanation.
>
> Create_tmp_var
>
> To create new names for phi nodes and to create default def for phi
> arguments I use make_ssa_name and get_or_create_ssa_default_def. These
> methods have some asserts (being merged):   (TREE_CODE (old_var) ==
> VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
> (old_var) == RESULT_DECL).
>
> So in some cases (when I know that I need to use methods mentioned
> above, see the start of create_tmp_var, ie the uses in other loops) I
> create new tmp variable to overcome these asserts. To be honest I
> don’t like this but don’t know how to deal with this better.

Hmm, don't you just run into SSA names here?  That is, TREE_CODE
(old_var) == SSA_NAME?
You can use make_ssa_name (TREE_TYPE (old_var), "plim") in this case.

Or you run into the issue that anonymous SSA names cannot have default
defintions (I don't understand why you need all the fuss about default defs).

> And a couple of words regarding where we store info for phi nodes:
>
> Each loop contains in its aux  phi_data structure. On this stage we
> only add there stmt (if phi node will be required), phi name=phi node
> result name in names_map or fl_names_map). See the comment about
> phi_data in patch.
>
> If there should be created several phi nodes for given lhs (I mean the
> first place for phi node doesn’t dominate on corresponding preheader,
> we add only the last name in names_map (as intermediate names could be
> created later, but the last name in bb which dominates preheader
> should be used for replacement in other loops. Replacements were
> already made in walker).
>
> If lhs is used only in phi node, which are moved and transformed into
> assignment, and there is no uses in other loops, we need to create
> only first level phi node (don’t need to create phi nodes till some bb
> which dominates preheader), then we add such name to fl_names_map.
>
> 3.2 Create_phi_nodes
>
> For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
> are statements which were moved, but they have uses in the original
> loop (more exactly, their uses in some other loops replaced by some
> name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
> loop->aux)->fl_names_map.
>
> So for each of such stmts we create phi nodes (for its lhs) chain.
>
> Basic block for phi node is found in get_bb_for_phi  for given bb with
> stmt. If basic block for phi node dominates preheader->src we end
> chain creation. There is some special condition for the case when we
> need to create only first level phi node. (I described this situation
> above, but let me know If I need to add  more details.
>
> Basic blocks can have maps to identify if we created phi node for
> given variable in given basic_block, so we only add an edge argument
> for such case (phi_map = new hash_map<basic_block, hash_map<tree,
> gphi*>*>).

Ok.

Well - I think it might be easier code-generation-wise to have a separate
phase move_phis executed before move_computations, that just moves
invariant PHI nodes (which this all is about - see above) and their
controlling conditions and only then move their dependencies.

Thanks,
Richard.

> In short, that’s all.
>
> Thanks,
>
> Evgeniya
>
>
>
> On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
> <evgeniya.maenkova@gmail.com> wrote:
>> Hi,
>>
>> Could you please review my patch for predicated lim?
>>
>> Let me note some details about it:
>>
>>
>>
>> 1)      Phi statements are still moved only if they have 1 or 2
>> arguments. However, phi statements could be move under conditions (as
>> it’s done for the other statements).  Probably, phi statement motion
>> with 3 + arguments could be implemented in the next patch after
>> predicated lim.
>>
>> 2)      Patch has limitations/features like (it was ok to me to
>> implement it such way, maybe I’m not correct. ):
>>
>> a)      Loop1
>>
>>     {
>>
>>               If (a)
>>
>>                  Loop2
>>
>>                      {
>>
>>                        Stmt - Invariant for Loop1
>>
>>                      }
>>
>>              }
>>
>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>
>> b)      Or
>>
>> Loop1
>>
>>     {
>>
>>          …
>>
>>          If (cond1)
>>
>>               If (cond2)
>>
>>                   If (cond3)
>>
>>                       Stmt;
>>
>>        }
>>
>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>
>> c)       It took me a long time to write all of these code, so there
>> might be other peculiarities which I forgot to mention. :)
>>
>>                        Let’s discuss these ones as you will review my patch.
>>
>> 3)      Patch consists of 9 files:
>>
>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>
>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>> predicated lim moves 2 more statements out of the loop;
>>
>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>> lim recip optimization in this test doesn’t work (the corresponding
>> value is below threshold as I could see in the code for recip, 1<3).
>> So to have recip working in this test I changed test a little bit.
>>
>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>
>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>
>> the tests for conditional lim.
>>
>> 4)      Patch testing:
>>
>> a)      make –k check (no difference in results for me for the clean
>> build and the patched one,
>>
>> -          Revision: 222849,
>>
>> -          uname -a
>>
>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>
>> b)      Bootstrap.
>>
>>  It goes well now, however to fix it I have made a temporary hack in
>> the lim code. And with this fix patch definitely shouldn’t be
>> committed.
>>
>> I did so, as I would like to discuss this issue first.
>>
>> The problem is: I got stage2-stage3 comparison failure on the single
>> file (tree-vect-data-refs.o). After some investigation I understood
>> that tree-vect-data-refs.o differs being compiled with and without
>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>> reproduce this difference on the same build (even not bootstrapped),
>> with option ‘ –g’ and without it).
>>
>> Of course, the file compiled with –g option will contain debug
>> information and will differ from the corresponding file without debug
>> information. I mean there is the difference reported by
>> contrib/compare-debug (I mean we talk about stripped files).
>>
>> Then I compared assemblers and lim dump files and made a conclusion
>> the difference is:
>>
>> There is statement _484=something, which is conditionally moved out of
>> loop X. In non debug cases no usages of _484 are left inside the loop
>> X. In debug case, there is the single use in debug statement. So for
>> debug case my patch generates phi statement for _484 to make it
>> available inside the loop X. For non debug case, no such phi statement
>> generated as there is no uses of _484.
>>
>> There is one more statement with lhs=_493, with exactly this pattern
>> of use. But this is not important for our discussion.
>>
>>
>>
>> So, in my opinion it’s ok to generate additional phi node for debug
>> case. But I’m not a compiler expert and maybe there is some
>> requirement that debug and non-debug versions should differ only by
>> debug statements, I don’t know.
>>
>>
>>
>> My temporary hack fix is just removing of such debug statements (no
>> debug statements, no problems J).
>>
>>
>>
>> The alternatives I see are:
>>
>> -          Move debug statements outside the loop (not good in my opinion);
>>
>> -          Somehow reset values in debug statements (not good in my
>> opinion, if I could provide correct information for them);
>>
>> -          Generate phi for debug statements and fix bootstrap (say,
>> let’s have some list of files, which we have special check for. I mean
>> for my case, the difference is not in stage2 and stage3, it’s in debug
>> and non-debug). I like this variant. However, as I don’t see such list
>> of special files in the whole gcc, I think I might be missing
>> something.
>>
>> What do you think?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Evgeniya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-06-09 11:48   ` Richard Biener
@ 2015-06-09 20:12     ` Evgeniya Maenkova
  2015-06-29 13:15       ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Evgeniya Maenkova @ 2015-06-09 20:12 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 25166 bytes --]

On Tue, Jun 9, 2015 at 3:46 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, May 29, 2015 at 3:14 PM, Evgeniya Maenkova
> <evgeniya.maenkova@gmail.com> wrote:
>> Hi Richard,
>>
>> Here is some explanation. I hope you let me know if I need to clarify something.
>>
>> Also, you asked me about concrete example, to make sure you don’t miss
>> my answer here is the link:
>> https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.
>>
>> Also, I doubt whether it’s convenient for you to create a build with
>> my patch or not. May be to clarify things you could send me some
>> examples/concrete cases, then I’ll compile them with
>> –fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
>> these dumps. May be these dumps will be useful. (I’ll only disable
>> cleanup_cfg TODO after lim to let you know the exact picture after
>> lim).
>>
>> What do you think?
>>
>> 1.       invariantness _dom_walker –
>>
>> 1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
>> for true and false edges handle_branch_edge, which calls SET_TARGET_OF
>> for all bb ‘predicated’ by given GIMPLE_COND.
>>
>> SET_TARGET_OF sets in basic_blocks aux 2 facts:
>>
>> a)      this is true or false edge;
>>
>> b)      link to cond stmt;
>>
>> Handle_branch_edge works this way:
>>
>> If (cond1)
>>
>>   {
>>
>>      bb1;
>>
>>      if (cond2}
>>
>>        {
>>
>>            bb2;
>>
>>         }
>>
>>    Being called for cond1, it sets cond1 as condition for both bb1 and
>> bb2 (the whole branch for cond1, ie also for bb containing cond2),
>> then this method will be called (as there is dominance order) for
>> cond2 to correct things (ie to set cond2 as condition for bb2).
>
> Hmm, why not track the current condition as state during the DOM walk
> and thus avoid processing more than one basic-block in handle_branch_edge?
> Thus get rid of handle_branch_edge and instead do everything in handle_cond_stmt
> plus the dom-walkers BB visitor?
>
I need to look more carefully how to implement it, but I think I
understand what you mean and this optimization of course looks
reasonable to me. Will do.

> I see you don't handle BBs with multiple predecessors - that's ok, but
> are you sure you don't run into correctness issues when not marking such
> BBs as predicated?  This misses handling of, say
>
>  if (a || b)
>    bb;
>
> which is a pity (but can be fixed later if desired).
>
I had some test (in gcc testsuite or bootstrap build) which worked
incorrectly because of multiple predecessors. As far as I remember the
situation was (further, will make some notes about such tests to
clarify this better), I mean with previous version of my code which
handled bb with 2 predecessors:
if (a)
  tmpvar=something;
while()
  if (a || b)
      basic_block {do something with tmpvar;} // I mean basic block
predicated by bb with a and bb with b

So, if a is false, I mean we didn't do tmpvar=something (outside
loop), BUT we are in basick_block  (we went by bb with b), we got
Segmentation falt in basic_block {do something with tmpvar;}.

I think we can reproduce all the details of this test if I remove not
handling bb with 2 predecessors.

So I wouldn't move bb with 2 predecessors (this is not always executed
bb in any loop, not conditionally, they will not be moved at all).

This is my more detail explanation on this point. Perhaps, I didn't
understand your question about correctness. Could you repeat it in
other words (based on this new clarification).

So I think according to current code it will not be moved. What
incorrectness do you mean?

> I note that collecting predicates has similarities to what if-conversion
> does in tree-ifcvt.c (even if its implementation is completely different,
> of course).
>

Ok, I'll look at this. But could you please clarify your point?
(Should I just take this into account with low priority and look at
this later or you want some refactoring?)

>> 1.2   As 1.1 goes we identify whether some bb is predicated by some
>> condition or not.
>>
>> bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
>> bb->aux->cond_stmt=cond stmt (the nearest condition).
>>
>> If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
>> bb->loop->loop (this info was available in the clean build).
>>
>> 1.3   As this walker is called in dominance order, information about
>> condition is available when invariantness_dom_walker is called for
>> given bb.  So we can make some computations based on bb->aux
>> structure. This is done in check_conditionally_executed. The main goal
>> of this method is to check that the root condition is always executed
>> in the loop. I did so to avoid situation like this
>>
>> Loop:
>>
>>    Jmp somewhere;
>>
>>   If (cond1)
>>
>>       If (cond2)
>>
>>          Etc
>>
>> By ‘root condition’ I mean cond1 in this case (the first cond in
>> dominance order).
>
> Ok, but this can't happen (an uncontional jump to somehwere).  It
> will always be a conditional jump (a loop exit) and thus should
> be handled already.
>
> Did you have a testcase that was mishandled?

No, I have not such testcase. This is because I;m newbie at gcc and
compilers. If you tell me this fact, let's just remove this check?
Correct?
>
>> If  ‘root condition’ is not always executed we disable (free) all the
>> info in bb->aux, ie all the blocks becomes neither conditionally nor
>> always executed according to bb->aux.
>>
>> There is some optimization to don’t go each time trough the conditions
>> chain (cond2->cond1), let me skip such details for now.
>>
>>
>>
>> 1.4   Then we calculate tgt_level  (which is used in move_computations later)
>>
>> The patch is the same for phi and regular stmt (calculate_tgt_level)
>>
>> 1)      If stmt is conditionally executed we compute max movement
>> (determine_max_movement) starting from
>> get_lim_data(condition)->max_loop.
>>
>> 2)      If stmt is not cond executed as start level  for
>> determine_max_movement computations we choose ALWAYS_EXECUTED_IN.
>>
>> To clarify why, see the comment
>>
>> /* The reason why we don't set start_level for MOVE_POSSIBLE
>>
>>           statements to more outer level is: this statement could depend on
>>
>>           conditional statements and even probably is not defined without this
>>
>>           condition. So either we should analyze these ones and move
>>
>>           under condition somehow or keep more strong start_level . */
>>
>> As you noted in your review there was some refactoring. Of course, I
>> had no goal to refactor existing code, I intended to remove some
>> duplication which I caused by my changes.  I hope we discuss in
>> details later if you don’t like some refactoring.
>
> Refactoring makes a big patch harder to review because it ends up
> containing no-op changes.  It also makes it appear bigger than it
> really is ;)
>
> A reasonable solution is to factor out the refactoring to a separate patch.
>
I see what you mean.

>> 2.       store_motion  - for some stmts set flags to ignore predicated
>> execution (I mean to move statements without conditions).
>
> That's for the existing "predicated" support as far as I can see.
>
>>
>>
>> 3.       move_computations. The patch is doing something in the
>> following 3 calls inside of this method.
>>
>>
>>
>> 3.1   (more details below) walker.walk – move computations(create if()
>> structure) and store information to create phi nodes later (which
>> statements require phi nodes, as they conditionally executed and there
>> are still their uses on other levels, what are the names for such phi
>> nodes, as we replace uses in here, in walker.walk.)
>
> What kind of uses are that?  By definition all stmts executed
> under condition A that are used in code not under condition A have
> a PHI node associated.  This is why the existing conditional movement
> handling moves PHI nodes (and all dependent computations).
>
> Are you thinking of
>
>   for (;;)
>     if (a) x = 1;
>
>   .. use of x
>
> ?
No, I think in this case phi node of x will be used in initial
(==loopinit) code (in 'use of x') instead of some version of x (which
I move).

>If you move the PHI node for x then you don't need to insert any other
> PHI nodes (and you _do_ have to move the PHI node anyway).

>
>> 3.2   (more details below) create_phi_nodes – create phi nodes for
>> statements which require that (see 3.1)
>
> Only PHI nodes need PHI node creation and only its dependences
> might need predication.
>
> That is, if you move a conditionally executed statement but not
> any PHI node it feeds then the movement is useless.
>
> But maybe I am missing something about the PHI business.
>
Ok. I have different understanding on this (and am afraid of this of
course :) ).

See the example:
void bar (long);
void foo (int cond1, int cond2, int cond3, int m, int n)
{
  unsigned x;
  unsigned inv1;

  for (x = 0; x < 1000; ++x)
    {
      if (cond1)
        {
          int tmp_inv1 = m*n;
          non_inv1 = tmp_inv1 + x;
        }
    }
  bar (non_inv1);
}

This is working example, will attach loopinit and lim1 dumps.
There is no phi node for tmp_inv1 in loopinitat all. However, I think
we should move tmp_inv1. To use tmp_inv1 in non_inv1 we need phi node
(otherwise we get verification failure as bb with conditionally moved
tmp_inv1 doesn;t dominate bb with non_inv1). Another example if
non_inv1 is invariant but has not enough cost or something else. In
short, if it's still used in initial or others loops. Say, non_inv1 is
invariant and moved in another loop. Or something else.

So, when I read
'Only PHI nodes need PHI node creation'
and
'That is, if you move a conditionally executed statement but not any
PHI node it feeds then the movement is useless.'
I think about this example and don't understand. Could you clarify?

>>
>> 3.3   replace_uses_in_conditions
>>
>> After computations movements we can have several copies of the cond
>> stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
>> cond stmt it doesn’t work. As cond stmt, of course, have something in
>> lim_aux_data->tgt_level, but, in fact, they are moved only if
>> corresponding invariants are moved. So, in fact, the same cond (copies
>> of it, of course) could go to the different levels. So to fix these
>> uses, we call replace_uses_in_conditions.
>
> Hmm, but stmts that a condition depends on always have at least the
> target level of the condition statement.  Thus their computation is
> available in all other "same" conditon statements that might appear
> in more inner loop levels, no?  So isn't this just about performing the
> movement in a proper ordering?

Let me give your more details, then may be you repeat question in other words.
for ()
{
  if (a)
    {
       inv1 = some_computations;
       if (inv1 < inv2)
            {
              inv3;
              inv4;
            }
       }
}

So, the order for movement is: inv1; inv3; inv4. When we move inv1 we
have tgt_level for inv3 and inv4 computed but we have not created
copies of 'if (inv1 < inv2)' created. In theory, we can have several
copies:(1) for tgt_level of inv3 (2) for tgt_level of inv2 (3) in
initial cycle. if (1) is in tgt level of inv1 we need no replacement,
in other cases we need replacements. Of course, we know this(all the
info about tgt_levels) when we move inv1, but then we need to create
all needed copies (store it somehow, etc) of 'if (inv1 < inv2') and
make replacement in these copies. I don't see now why this is much
better. This doesn't mean that current solution is perfect, just
clarify your thought.

>
>> More details on 3.1 and 3.2
>>
>> 3.1 The patch is very similar for phi stmts and regular stmts (there
>> is some difference for regular stmts related to the case when we move
>> sequence instead of single stmt, let’s skip it in this description,
>> will describe later. BTW, there are comments, in corresponding parts).
>>
>> a) for each bb we call check_conditionally_executed and if bb is cond
>> executed then invariant statement  will be moved conditionally.
>>
>> b) prepare_edge_for_stmt collects information needed to create phi
>> nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
>> structure needed to move statement
>> (get_block_for_predicated_statement, see below). All of this is done
>> for cond executed stmt. Nothing is done for non cond executed.
>>
>> BTW, there is some strange names for functions like missed ending
>> (prepare_edge_for_stm, w/o t in the end, this is because of my script
>> to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
>> please so far).
>>
>> Get_block_for_predicated_stmt:
>>
>> Create bb where to add cond stmt to (or return existing, each loop has
>> a map  (level_cond_data_map is map <loop, cond_data_map> ,
>> cond_data_map is map <gimple, basic_block>) to identify if bb for
>> given cond was created or not for given level).
>>
>> Parameters of this method:
>>
>> Cond – condition of stmt, branch – is it on true or false edge of
>> cond, cond_map  - see above, level (tgt_level), preheader –
>> preheader_edge for level.
>>
>> So first of all we check whether there is a basic block for given
>> condition in given loop.
>>
>> a)      If such bb exists then we trying to find corresponding edge,
>> destination of this edge shouldn’t be a post dominator of bb
>> (find_branch_edge).
>>
>> a.1) if such edge exists we return the most remote in this branch bb
>> (or create it), see get_most_remote_in_branch;
>>
>> a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch
>>
>>        b)  If such bb doesn’t exist we create it (recursively calling
>> get_block_for_predicated_stmt if required bb is conditionally
>> executed).
>>
>> Will not describe it further so far (let me know if you would like to
>> read additional details).
>>
>> Add_info_for_phi
>>
>> The main goal of this method is to identify situation where
>> stmt(parameter,ie the statement which we are moving) or more exactly
>> lhs of this statement
>>
>> a)      is still used in other loops, so we need to create phi node to
>> use phi name inside other loops.
>>
>> b)      Is used in phi node of original loop and this phi node is
>> going to be moved. BUT, as we move phi node as assignment (in the case
>> of phi node with 2 arguments) we need to create phi node to use lhs of
>> statement (param of add_info_for_phi).
>>
>> In these cases we make corresponding replacements and store
>> information needed to create phi nodes and make some other
>> replacements in 3.3 (replace_uses_in_conditions).
>>
>> Add_info_for_phi calls create_tmp_var which requires some explanation.
>>
>> Create_tmp_var
>>
>> To create new names for phi nodes and to create default def for phi
>> arguments I use make_ssa_name and get_or_create_ssa_default_def. These
>> methods have some asserts (being merged):   (TREE_CODE (old_var) ==
>> VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
>> (old_var) == RESULT_DECL).
>>
>> So in some cases (when I know that I need to use methods mentioned
>> above, see the start of create_tmp_var, ie the uses in other loops) I
>> create new tmp variable to overcome these asserts. To be honest I
>> don’t like this but don’t know how to deal with this better.
>
> Hmm, don't you just run into SSA names here?  That is, TREE_CODE
> (old_var) == SSA_NAME?
> You can use make_ssa_name (TREE_TYPE (old_var), "plim") in this case.
>
> Or you run into the issue that anonymous SSA names cannot have default
> defintions (I don't understand why you need all the fuss about default defs).

I used default defs to write something in phi args when I have no values on
corresponding branches. Say there is some tmp value which I should
create phi for.
if (a)
   {
     tmpvar=something
     b = something(tmpvar);
    }
else
   {
   }
When I move if (a) {tmpvar = something} and create phi for tmpvar (to
use phi result
in the initial cycle) i need to write something for 'not a' phi arg.
BUT, in some cases I use
 build_zero_cst (which I started to use after default defs use. I mean
I didn't know about
 build_zero_cst while I started to use default defs.). So may be I can replace
uses of default defs in such cases by build_zero_cst? I mean just SOME values
(as I know that control flow will never go to this places actually). I
mean if !a tmpvar will not used.
>
>> And a couple of words regarding where we store info for phi nodes:
>>
>> Each loop contains in its aux  phi_data structure. On this stage we
>> only add there stmt (if phi node will be required), phi name=phi node
>> result name in names_map or fl_names_map). See the comment about
>> phi_data in patch.
>>
>> If there should be created several phi nodes for given lhs (I mean the
>> first place for phi node doesn’t dominate on corresponding preheader,
>> we add only the last name in names_map (as intermediate names could be
>> created later, but the last name in bb which dominates preheader
>> should be used for replacement in other loops. Replacements were
>> already made in walker).
>>
>> If lhs is used only in phi node, which are moved and transformed into
>> assignment, and there is no uses in other loops, we need to create
>> only first level phi node (don’t need to create phi nodes till some bb
>> which dominates preheader), then we add such name to fl_names_map.
>>
>> 3.2 Create_phi_nodes
>>
>> For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
>> are statements which were moved, but they have uses in the original
>> loop (more exactly, their uses in some other loops replaced by some
>> name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
>> loop->aux)->fl_names_map.
>>
>> So for each of such stmts we create phi nodes (for its lhs) chain.
>>
>> Basic block for phi node is found in get_bb_for_phi  for given bb with
>> stmt. If basic block for phi node dominates preheader->src we end
>> chain creation. There is some special condition for the case when we
>> need to create only first level phi node. (I described this situation
>> above, but let me know If I need to add  more details.
>>
>> Basic blocks can have maps to identify if we created phi node for
>> given variable in given basic_block, so we only add an edge argument
>> for such case (phi_map = new hash_map<basic_block, hash_map<tree,
>> gphi*>*>).
>
> Ok.
>
> Well - I think it might be easier code-generation-wise to have a separate
> phase move_phis executed before move_computations, that just moves
> invariant PHI nodes (which this all is about - see above) and their
> controlling conditions and only then move their dependencies.
I didn;t understand so far (I think let's clarify previous questions,especially
about phi nodes then let's go back to this one.)
>
> Thanks,
> Richard.
>
>> In short, that’s all.
>>
>> Thanks,
>>
>> Evgeniya
>>
>>
>>
>> On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
>> <evgeniya.maenkova@gmail.com> wrote:
>>> Hi,
>>>
>>> Could you please review my patch for predicated lim?
>>>
>>> Let me note some details about it:
>>>
>>>
>>>
>>> 1)      Phi statements are still moved only if they have 1 or 2
>>> arguments. However, phi statements could be move under conditions (as
>>> it’s done for the other statements).  Probably, phi statement motion
>>> with 3 + arguments could be implemented in the next patch after
>>> predicated lim.
>>>
>>> 2)      Patch has limitations/features like (it was ok to me to
>>> implement it such way, maybe I’m not correct. ):
>>>
>>> a)      Loop1
>>>
>>>     {
>>>
>>>               If (a)
>>>
>>>                  Loop2
>>>
>>>                      {
>>>
>>>                        Stmt - Invariant for Loop1
>>>
>>>                      }
>>>
>>>              }
>>>
>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>
>>> b)      Or
>>>
>>> Loop1
>>>
>>>     {
>>>
>>>          …
>>>
>>>          If (cond1)
>>>
>>>               If (cond2)
>>>
>>>                   If (cond3)
>>>
>>>                       Stmt;
>>>
>>>        }
>>>
>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>
>>> c)       It took me a long time to write all of these code, so there
>>> might be other peculiarities which I forgot to mention. :)
>>>
>>>                        Let’s discuss these ones as you will review my patch.
>>>
>>> 3)      Patch consists of 9 files:
>>>
>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>
>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>> predicated lim moves 2 more statements out of the loop;
>>>
>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>> lim recip optimization in this test doesn’t work (the corresponding
>>> value is below threshold as I could see in the code for recip, 1<3).
>>> So to have recip working in this test I changed test a little bit.
>>>
>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>
>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>
>>> the tests for conditional lim.
>>>
>>> 4)      Patch testing:
>>>
>>> a)      make –k check (no difference in results for me for the clean
>>> build and the patched one,
>>>
>>> -          Revision: 222849,
>>>
>>> -          uname -a
>>>
>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>
>>> b)      Bootstrap.
>>>
>>>  It goes well now, however to fix it I have made a temporary hack in
>>> the lim code. And with this fix patch definitely shouldn’t be
>>> committed.
>>>
>>> I did so, as I would like to discuss this issue first.
>>>
>>> The problem is: I got stage2-stage3 comparison failure on the single
>>> file (tree-vect-data-refs.o). After some investigation I understood
>>> that tree-vect-data-refs.o differs being compiled with and without
>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>> reproduce this difference on the same build (even not bootstrapped),
>>> with option ‘ –g’ and without it).
>>>
>>> Of course, the file compiled with –g option will contain debug
>>> information and will differ from the corresponding file without debug
>>> information. I mean there is the difference reported by
>>> contrib/compare-debug (I mean we talk about stripped files).
>>>
>>> Then I compared assemblers and lim dump files and made a conclusion
>>> the difference is:
>>>
>>> There is statement _484=something, which is conditionally moved out of
>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>> X. In debug case, there is the single use in debug statement. So for
>>> debug case my patch generates phi statement for _484 to make it
>>> available inside the loop X. For non debug case, no such phi statement
>>> generated as there is no uses of _484.
>>>
>>> There is one more statement with lhs=_493, with exactly this pattern
>>> of use. But this is not important for our discussion.
>>>
>>>
>>>
>>> So, in my opinion it’s ok to generate additional phi node for debug
>>> case. But I’m not a compiler expert and maybe there is some
>>> requirement that debug and non-debug versions should differ only by
>>> debug statements, I don’t know.
>>>
>>>
>>>
>>> My temporary hack fix is just removing of such debug statements (no
>>> debug statements, no problems J).
>>>
>>>
>>>
>>> The alternatives I see are:
>>>
>>> -          Move debug statements outside the loop (not good in my opinion);
>>>
>>> -          Somehow reset values in debug statements (not good in my
>>> opinion, if I could provide correct information for them);
>>>
>>> -          Generate phi for debug statements and fix bootstrap (say,
>>> let’s have some list of files, which we have special check for. I mean
>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>> and non-debug). I like this variant. However, as I don’t see such list
>>> of special files in the whole gcc, I think I might be missing
>>> something.
>>>
>>> What do you think?
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Evgeniya

[-- Attachment #2: ssa-lim-exp1.c.114t.lim1 --]
[-- Type: application/octet-stream, Size: 2163 bytes --]


;; Function foo (foo, funcdef_no=0, decl_uid=1485, cgraph_uid=0, symbol_order=0)

Basic block 3 (loop 1 -- depth 1):

if (cond1_5(D) != 0)
  invariant up to level 1, cost 20.

Basic block 7 (loop 1 -- depth 1):

Basic block 4 (loop 1 -- depth 1):

tmp_inv1_8 = m_6(D) * n_7(D);
  invariant up to level 1, cost 20.

tmp_inv1.0_9 = (unsigned int) tmp_inv1_8;
  invariant up to level 1, cost 21.

Basic block 5 (loop 1 -- depth 1):

Basic block 8 (loop 1 -- depth 1):

Moving statement
tmp_inv1_8 = m_6(D) * n_7(D);
 (cost 20) out of loop 1 from bb 4.

Will analyze uses of tmp_inv1_8
Conditionally executed, created bb to move statement is 10
Moving statement
tmp_inv1.0_9 = (unsigned int) tmp_inv1_8;
 (cost 21) out of loop 1 from bb 4.

Will analyze uses of tmp_inv1.0_9
created new name plim.4_3

Will replace uses in: non_inv1_10 = tmp_inv1.0_9 + x_17;
After replacement: non_inv1_10 = plim.4_1 + x_17;
Conditionally executed, created bb to move statement is 10
Will create phi node for: plim.4_3
, def bb = 10, loop = 0,		      preheader.src = 9
phi node create in the bb = 9:plim.4_1 = PHI <0(2), plim.4_3(10)>
Removing basic block 7
basic block 7, loop depth 1
 pred:      
goto <bb 5>;
 succ:       5


fix_loop_structure: fixing up loops for function
foo (int cond1, int cond2, int cond3, int m, int n)
{
  unsigned int plim.4;
  int tmp_inv1;
  unsigned int non_inv1;
  unsigned int x;
  unsigned int tmp_inv1.0_9;
  long int non_inv1.1_12;

  <bb 2>:
  if (cond1_5(D) != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

  <bb 3>:
  tmp_inv1_8 = m_6(D) * n_7(D);
  plim.4_3 = (unsigned int) tmp_inv1_8;

  <bb 4>:
  # plim.4_1 = PHI <0(2), plim.4_3(3)>

  <bb 5>:
  # x_17 = PHI <x_11(8), 0(4)>
  # non_inv1_18 = PHI <non_inv1_2(8), non_inv1_4(D)(4)>
  if (cond1_5(D) != 0)
    goto <bb 6>;
  else
    goto <bb 7>;

  <bb 6>:
  non_inv1_10 = plim.4_1 + x_17;

  <bb 7>:
  # non_inv1_2 = PHI <non_inv1_18(5), non_inv1_10(6)>
  x_11 = x_17 + 1;
  if (x_11 <= 999)
    goto <bb 8>;
  else
    goto <bb 9>;

  <bb 8>:
  goto <bb 5>;

  <bb 9>:
  # non_inv1_19 = PHI <non_inv1_2(7)>
  non_inv1.1_12 = (long int) non_inv1_19;
  bar (non_inv1.1_12);
  return;

}



[-- Attachment #3: ssa-lim-exp1.c --]
[-- Type: text/x-csrc, Size: 708 bytes --]

/* { dg-do compile } */
/* { dg-options "-O2 -fdump-tree-lim1-details" } */

void bar (long);
void foo (int cond1, int cond2, int cond3, int m, int n)
{
  unsigned x;
  unsigned non_inv1;

  for (x = 0; x < 1000; ++x)
    {
      if (cond1)
	{
          int tmp_inv1 = m*n;
          non_inv1 = tmp_inv1 + x;    
	}
    }
  bar (non_inv1);
}

/* { dg-final { scan-tree-dump-times "Conditionally executed" 3 "lim1"  } } */
/* { dg-final { scan-tree-dump-times "Moving statement" 3 "lim1"  } } */
/* { dg-final { scan-tree-dump-times "out of loop 2" 2 "lim1"  } } */
/* { dg-final { scan-tree-dump-times "out of loop 1" 1 "lim1"  } } */
/* { dg-final { cleanup-tree-dump "lim1" } } */

[-- Attachment #4: ssa-lim-exp1.c.113t.loopinit --]
[-- Type: application/octet-stream, Size: 1802 bytes --]


;; Function foo (foo, funcdef_no=0, decl_uid=1485, cgraph_uid=0, symbol_order=0)

;; 2 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 7 4 5 8 6
;;
;; Loop 1
;;  header 3, latch 8
;;  depth 1, outer 0
;;  nodes: 3 8 5 7 4
;; 2 succs { 3 }
;; 3 succs { 4 7 }
;; 7 succs { 5 }
;; 4 succs { 5 }
;; 5 succs { 8 6 }
;; 8 succs { 3 }
;; 6 succs { 1 }
;; Created LCSSA PHI: non_inv1_19 = PHI <non_inv1_2(5)>

Updating SSA:
Registering new PHI nodes in block #3
Registering new PHI nodes in block #7
Registering new PHI nodes in block #4
Registering new PHI nodes in block #5
Registering new PHI nodes in block #8
Registering new PHI nodes in block #6
Updating SSA information for statement non_inv1.1_12 = (long int) non_inv1_2;

SSA replacement table
N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j

non_inv1_19 -> { non_inv1_2 }
Incremental SSA update started at block: 3
Number of blocks in CFG: 9
Number of blocks to update: 4 ( 44%)
Affected blocks: 3 5 6 8


foo (int cond1, int cond2, int cond3, int m, int n)
{
  int tmp_inv1;
  unsigned int non_inv1;
  unsigned int x;
  unsigned int tmp_inv1.0_9;
  long int non_inv1.1_12;

  <bb 2>:

  <bb 3>:
  # x_17 = PHI <x_11(8), 0(2)>
  # non_inv1_18 = PHI <non_inv1_2(8), non_inv1_4(D)(2)>
  if (cond1_5(D) != 0)
    goto <bb 4>;
  else
    goto <bb 7>;

  <bb 7>:
  goto <bb 5>;

  <bb 4>:
  tmp_inv1_8 = m_6(D) * n_7(D);
  tmp_inv1.0_9 = (unsigned int) tmp_inv1_8;
  non_inv1_10 = tmp_inv1.0_9 + x_17;

  <bb 5>:
  # non_inv1_2 = PHI <non_inv1_18(7), non_inv1_10(4)>
  x_11 = x_17 + 1;
  if (x_11 <= 999)
    goto <bb 8>;
  else
    goto <bb 6>;

  <bb 8>:
  goto <bb 3>;

  <bb 6>:
  # non_inv1_19 = PHI <non_inv1_2(5)>
  non_inv1.1_12 = (long int) non_inv1_19;
  bar (non_inv1.1_12);
  return;

}



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-06-09 20:12     ` Evgeniya Maenkova
@ 2015-06-29 13:15       ` Richard Biener
  2015-06-29 14:24         ` Evgeniya Maenkova
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2015-06-29 13:15 UTC (permalink / raw)
  To: Evgeniya Maenkova; +Cc: GCC Patches

On Tue, Jun 9, 2015 at 10:11 PM, Evgeniya Maenkova
<evgeniya.maenkova@gmail.com> wrote:
> On Tue, Jun 9, 2015 at 3:46 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, May 29, 2015 at 3:14 PM, Evgeniya Maenkova
>> <evgeniya.maenkova@gmail.com> wrote:
>>> Hi Richard,
>>>
>>> Here is some explanation. I hope you let me know if I need to clarify something.
>>>
>>> Also, you asked me about concrete example, to make sure you don’t miss
>>> my answer here is the link:
>>> https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.
>>>
>>> Also, I doubt whether it’s convenient for you to create a build with
>>> my patch or not. May be to clarify things you could send me some
>>> examples/concrete cases, then I’ll compile them with
>>> –fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
>>> these dumps. May be these dumps will be useful. (I’ll only disable
>>> cleanup_cfg TODO after lim to let you know the exact picture after
>>> lim).
>>>
>>> What do you think?
>>>
>>> 1.       invariantness _dom_walker –
>>>
>>> 1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
>>> for true and false edges handle_branch_edge, which calls SET_TARGET_OF
>>> for all bb ‘predicated’ by given GIMPLE_COND.
>>>
>>> SET_TARGET_OF sets in basic_blocks aux 2 facts:
>>>
>>> a)      this is true or false edge;
>>>
>>> b)      link to cond stmt;
>>>
>>> Handle_branch_edge works this way:
>>>
>>> If (cond1)
>>>
>>>   {
>>>
>>>      bb1;
>>>
>>>      if (cond2}
>>>
>>>        {
>>>
>>>            bb2;
>>>
>>>         }
>>>
>>>    Being called for cond1, it sets cond1 as condition for both bb1 and
>>> bb2 (the whole branch for cond1, ie also for bb containing cond2),
>>> then this method will be called (as there is dominance order) for
>>> cond2 to correct things (ie to set cond2 as condition for bb2).
>>
>> Hmm, why not track the current condition as state during the DOM walk
>> and thus avoid processing more than one basic-block in handle_branch_edge?
>> Thus get rid of handle_branch_edge and instead do everything in handle_cond_stmt
>> plus the dom-walkers BB visitor?
>>
> I need to look more carefully how to implement it, but I think I
> understand what you mean and this optimization of course looks
> reasonable to me. Will do.
>
>> I see you don't handle BBs with multiple predecessors - that's ok, but
>> are you sure you don't run into correctness issues when not marking such
>> BBs as predicated?  This misses handling of, say
>>
>>  if (a || b)
>>    bb;
>>
>> which is a pity (but can be fixed later if desired).
>>
> I had some test (in gcc testsuite or bootstrap build) which worked
> incorrectly because of multiple predecessors. As far as I remember the
> situation was (further, will make some notes about such tests to
> clarify this better), I mean with previous version of my code which
> handled bb with 2 predecessors:
> if (a)
>   tmpvar=something;
> while()
>   if (a || b)
>       basic_block {do something with tmpvar;} // I mean basic block
> predicated by bb with a and bb with b
>
> So, if a is false, I mean we didn't do tmpvar=something (outside
> loop), BUT we are in basick_block  (we went by bb with b), we got
> Segmentation falt in basic_block {do something with tmpvar;}.
>
> I think we can reproduce all the details of this test if I remove not
> handling bb with 2 predecessors.
>
> So I wouldn't move bb with 2 predecessors (this is not always executed
> bb in any loop, not conditionally, they will not be moved at all).
>
> This is my more detail explanation on this point. Perhaps, I didn't
> understand your question about correctness. Could you repeat it in
> other words (based on this new clarification).
>
> So I think according to current code it will not be moved. What
> incorrectness do you mean?

If the block isn't marked as predicated the question is whether it is
handled correctly or assumed to be unconditionally executed.

>> I note that collecting predicates has similarities to what if-conversion
>> does in tree-ifcvt.c (even if its implementation is completely different,
>> of course).
>>
>
> Ok, I'll look at this. But could you please clarify your point?
> (Should I just take this into account with low priority and look at
> this later or you want some refactoring?)

I just noted similar code exists elsewhere - it may be possible to
factor it out but I didn't investigate.  And no, doing that isn't a prerequesite
for this patch.

>>> 1.2   As 1.1 goes we identify whether some bb is predicated by some
>>> condition or not.
>>>
>>> bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
>>> bb->aux->cond_stmt=cond stmt (the nearest condition).
>>>
>>> If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
>>> bb->loop->loop (this info was available in the clean build).
>>>
>>> 1.3   As this walker is called in dominance order, information about
>>> condition is available when invariantness_dom_walker is called for
>>> given bb.  So we can make some computations based on bb->aux
>>> structure. This is done in check_conditionally_executed. The main goal
>>> of this method is to check that the root condition is always executed
>>> in the loop. I did so to avoid situation like this
>>>
>>> Loop:
>>>
>>>    Jmp somewhere;
>>>
>>>   If (cond1)
>>>
>>>       If (cond2)
>>>
>>>          Etc
>>>
>>> By ‘root condition’ I mean cond1 in this case (the first cond in
>>> dominance order).
>>
>> Ok, but this can't happen (an uncontional jump to somehwere).  It
>> will always be a conditional jump (a loop exit) and thus should
>> be handled already.
>>
>> Did you have a testcase that was mishandled?
>
> No, I have not such testcase. This is because I;m newbie at gcc and
> compilers. If you tell me this fact, let's just remove this check?
> Correct?

Yes.

>>
>>> If  ‘root condition’ is not always executed we disable (free) all the
>>> info in bb->aux, ie all the blocks becomes neither conditionally nor
>>> always executed according to bb->aux.
>>>
>>> There is some optimization to don’t go each time trough the conditions
>>> chain (cond2->cond1), let me skip such details for now.
>>>
>>>
>>>
>>> 1.4   Then we calculate tgt_level  (which is used in move_computations later)
>>>
>>> The patch is the same for phi and regular stmt (calculate_tgt_level)
>>>
>>> 1)      If stmt is conditionally executed we compute max movement
>>> (determine_max_movement) starting from
>>> get_lim_data(condition)->max_loop.
>>>
>>> 2)      If stmt is not cond executed as start level  for
>>> determine_max_movement computations we choose ALWAYS_EXECUTED_IN.
>>>
>>> To clarify why, see the comment
>>>
>>> /* The reason why we don't set start_level for MOVE_POSSIBLE
>>>
>>>           statements to more outer level is: this statement could depend on
>>>
>>>           conditional statements and even probably is not defined without this
>>>
>>>           condition. So either we should analyze these ones and move
>>>
>>>           under condition somehow or keep more strong start_level . */
>>>
>>> As you noted in your review there was some refactoring. Of course, I
>>> had no goal to refactor existing code, I intended to remove some
>>> duplication which I caused by my changes.  I hope we discuss in
>>> details later if you don’t like some refactoring.
>>
>> Refactoring makes a big patch harder to review because it ends up
>> containing no-op changes.  It also makes it appear bigger than it
>> really is ;)
>>
>> A reasonable solution is to factor out the refactoring to a separate patch.
>>
> I see what you mean.
>
>>> 2.       store_motion  - for some stmts set flags to ignore predicated
>>> execution (I mean to move statements without conditions).
>>
>> That's for the existing "predicated" support as far as I can see.
>>
>>>
>>>
>>> 3.       move_computations. The patch is doing something in the
>>> following 3 calls inside of this method.
>>>
>>>
>>>
>>> 3.1   (more details below) walker.walk – move computations(create if()
>>> structure) and store information to create phi nodes later (which
>>> statements require phi nodes, as they conditionally executed and there
>>> are still their uses on other levels, what are the names for such phi
>>> nodes, as we replace uses in here, in walker.walk.)
>>
>> What kind of uses are that?  By definition all stmts executed
>> under condition A that are used in code not under condition A have
>> a PHI node associated.  This is why the existing conditional movement
>> handling moves PHI nodes (and all dependent computations).
>>
>> Are you thinking of
>>
>>   for (;;)
>>     if (a) x = 1;
>>
>>   .. use of x
>>
>> ?
> No, I think in this case phi node of x will be used in initial
> (==loopinit) code (in 'use of x') instead of some version of x (which
> I move).
>
>>If you move the PHI node for x then you don't need to insert any other
>> PHI nodes (and you _do_ have to move the PHI node anyway).
>
>>
>>> 3.2   (more details below) create_phi_nodes – create phi nodes for
>>> statements which require that (see 3.1)
>>
>> Only PHI nodes need PHI node creation and only its dependences
>> might need predication.
>>
>> That is, if you move a conditionally executed statement but not
>> any PHI node it feeds then the movement is useless.
>>
>> But maybe I am missing something about the PHI business.
>>
> Ok. I have different understanding on this (and am afraid of this of
> course :) ).
>
> See the example:
> void bar (long);
> void foo (int cond1, int cond2, int cond3, int m, int n)
> {
>   unsigned x;
>   unsigned inv1;
>
>   for (x = 0; x < 1000; ++x)
>     {
>       if (cond1)
>         {
>           int tmp_inv1 = m*n;
>           non_inv1 = tmp_inv1 + x;
>         }
>     }
>   bar (non_inv1);
> }
>
> This is working example, will attach loopinit and lim1 dumps.
> There is no phi node for tmp_inv1 in loopinitat all. However, I think
> we should move tmp_inv1. To use tmp_inv1 in non_inv1 we need phi node
> (otherwise we get verification failure as bb with conditionally moved
> tmp_inv1 doesn;t dominate bb with non_inv1). Another example if
> non_inv1 is invariant but has not enough cost or something else. In
> short, if it's still used in initial or others loops. Say, non_inv1 is
> invariant and moved in another loop. Or something else.
>
> So, when I read
> 'Only PHI nodes need PHI node creation'
> and
> 'That is, if you move a conditionally executed statement but not any
> PHI node it feeds then the movement is useless.'
> I think about this example and don't understand. Could you clarify?

Ok, so this case is already handled by LIM by simply moving tmp_inv1 = m*n
(and making it unconditionally execute before the loop).  In fact whether
we can move it does not depend on whether 'cond' is invariant or not.

If we couldn't execute m*n unconditionally on the loop preheader edge
then we'd have to be able to also move the guarding condition.  Note
that cost considerations shouldn't come into play here.

So I was only thinking of the case where we also want to move the
condition (which is wanted if we want to move a PHI node).

>>>
>>> 3.3   replace_uses_in_conditions
>>>
>>> After computations movements we can have several copies of the cond
>>> stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
>>> cond stmt it doesn’t work. As cond stmt, of course, have something in
>>> lim_aux_data->tgt_level, but, in fact, they are moved only if
>>> corresponding invariants are moved. So, in fact, the same cond (copies
>>> of it, of course) could go to the different levels. So to fix these
>>> uses, we call replace_uses_in_conditions.
>>
>> Hmm, but stmts that a condition depends on always have at least the
>> target level of the condition statement.  Thus their computation is
>> available in all other "same" conditon statements that might appear
>> in more inner loop levels, no?  So isn't this just about performing the
>> movement in a proper ordering?
>
> Let me give your more details, then may be you repeat question in other words.
> for ()
> {
>   if (a)
>     {
>        inv1 = some_computations;
>        if (inv1 < inv2)
>             {
>               inv3;
>               inv4;
>             }
>        }
> }
>
> So, the order for movement is: inv1; inv3; inv4. When we move inv1 we
> have tgt_level for inv3 and inv4 computed but we have not created
> copies of 'if (inv1 < inv2)' created. In theory, we can have several
> copies:(1) for tgt_level of inv3 (2) for tgt_level of inv2 (3) in
> initial cycle. if (1) is in tgt level of inv1 we need no replacement,
> in other cases we need replacements. Of course, we know this(all the
> info about tgt_levels) when we move inv1, but then we need to create
> all needed copies (store it somehow, etc) of 'if (inv1 < inv2') and
> make replacement in these copies. I don't see now why this is much
> better. This doesn't mean that current solution is perfect, just
> clarify your thought.

I still don't get what "replacements" actually mean.  Let's extend the
example to

for ()  (A)
  {
> for ()  (B)
> {
>   if (a)
>     {
>        inv1 = some_computations;
>        if (inv1 < inv2)
>             {
>               inv3;
>               inv4;
>             }
>        }
> }
 }

then you say the target level of inv1 could be 'A' and the target
level of inv3 'B'
and that of inv4 is 'B' as well.

That's true.  So we move inv1 to before 'A' and inv3/inv4 to before 'B' guarded
with if (inv1 < inv2).  I fail to see what "copies" we need here.  'inv1' is
available in all target levels below its own, no copies needed.

>>
>>> More details on 3.1 and 3.2
>>>
>>> 3.1 The patch is very similar for phi stmts and regular stmts (there
>>> is some difference for regular stmts related to the case when we move
>>> sequence instead of single stmt, let’s skip it in this description,
>>> will describe later. BTW, there are comments, in corresponding parts).
>>>
>>> a) for each bb we call check_conditionally_executed and if bb is cond
>>> executed then invariant statement  will be moved conditionally.
>>>
>>> b) prepare_edge_for_stmt collects information needed to create phi
>>> nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
>>> structure needed to move statement
>>> (get_block_for_predicated_statement, see below). All of this is done
>>> for cond executed stmt. Nothing is done for non cond executed.
>>>
>>> BTW, there is some strange names for functions like missed ending
>>> (prepare_edge_for_stm, w/o t in the end, this is because of my script
>>> to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
>>> please so far).
>>>
>>> Get_block_for_predicated_stmt:
>>>
>>> Create bb where to add cond stmt to (or return existing, each loop has
>>> a map  (level_cond_data_map is map <loop, cond_data_map> ,
>>> cond_data_map is map <gimple, basic_block>) to identify if bb for
>>> given cond was created or not for given level).
>>>
>>> Parameters of this method:
>>>
>>> Cond – condition of stmt, branch – is it on true or false edge of
>>> cond, cond_map  - see above, level (tgt_level), preheader –
>>> preheader_edge for level.
>>>
>>> So first of all we check whether there is a basic block for given
>>> condition in given loop.
>>>
>>> a)      If such bb exists then we trying to find corresponding edge,
>>> destination of this edge shouldn’t be a post dominator of bb
>>> (find_branch_edge).
>>>
>>> a.1) if such edge exists we return the most remote in this branch bb
>>> (or create it), see get_most_remote_in_branch;
>>>
>>> a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch
>>>
>>>        b)  If such bb doesn’t exist we create it (recursively calling
>>> get_block_for_predicated_stmt if required bb is conditionally
>>> executed).
>>>
>>> Will not describe it further so far (let me know if you would like to
>>> read additional details).
>>>
>>> Add_info_for_phi
>>>
>>> The main goal of this method is to identify situation where
>>> stmt(parameter,ie the statement which we are moving) or more exactly
>>> lhs of this statement
>>>
>>> a)      is still used in other loops, so we need to create phi node to
>>> use phi name inside other loops.
>>>
>>> b)      Is used in phi node of original loop and this phi node is
>>> going to be moved. BUT, as we move phi node as assignment (in the case
>>> of phi node with 2 arguments) we need to create phi node to use lhs of
>>> statement (param of add_info_for_phi).
>>>
>>> In these cases we make corresponding replacements and store
>>> information needed to create phi nodes and make some other
>>> replacements in 3.3 (replace_uses_in_conditions).
>>>
>>> Add_info_for_phi calls create_tmp_var which requires some explanation.
>>>
>>> Create_tmp_var
>>>
>>> To create new names for phi nodes and to create default def for phi
>>> arguments I use make_ssa_name and get_or_create_ssa_default_def. These
>>> methods have some asserts (being merged):   (TREE_CODE (old_var) ==
>>> VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
>>> (old_var) == RESULT_DECL).
>>>
>>> So in some cases (when I know that I need to use methods mentioned
>>> above, see the start of create_tmp_var, ie the uses in other loops) I
>>> create new tmp variable to overcome these asserts. To be honest I
>>> don’t like this but don’t know how to deal with this better.
>>
>> Hmm, don't you just run into SSA names here?  That is, TREE_CODE
>> (old_var) == SSA_NAME?
>> You can use make_ssa_name (TREE_TYPE (old_var), "plim") in this case.
>>
>> Or you run into the issue that anonymous SSA names cannot have default
>> defintions (I don't understand why you need all the fuss about default defs).
>
> I used default defs to write something in phi args when I have no values on
> corresponding branches. Say there is some tmp value which I should
> create phi for.
> if (a)
>    {
>      tmpvar=something
>      b = something(tmpvar);
>     }
> else
>    {
>    }
> When I move if (a) {tmpvar = something} and create phi for tmpvar (to
> use phi result
> in the initial cycle) i need to write something for 'not a' phi arg.
> BUT, in some cases I use
>  build_zero_cst (which I started to use after default defs use. I mean
> I didn't know about
>  build_zero_cst while I started to use default defs.). So may be I can replace
> uses of default defs in such cases by build_zero_cst? I mean just SOME values
> (as I know that control flow will never go to this places actually). I
> mean if !a tmpvar will not used.

Ok, this is again about the case where you are not moving a PHI node but
creating it because you want to conditionally execute invariant stmts.
Yes, a default def is ok here, but you can always use a new VAR_DECL
just for the default def creation like gimple_fold_call does for example:

                          tree var = create_tmp_var (TREE_TYPE (lhs));
                          tree def = get_or_create_ssa_default_def (cfun, var);

Note that I think we don't need to bother about this case as we'd only
want to move PHI nodes in the first place (so you already have values
for all PHI args).

>>
>>> And a couple of words regarding where we store info for phi nodes:
>>>
>>> Each loop contains in its aux  phi_data structure. On this stage we
>>> only add there stmt (if phi node will be required), phi name=phi node
>>> result name in names_map or fl_names_map). See the comment about
>>> phi_data in patch.
>>>
>>> If there should be created several phi nodes for given lhs (I mean the
>>> first place for phi node doesn’t dominate on corresponding preheader,
>>> we add only the last name in names_map (as intermediate names could be
>>> created later, but the last name in bb which dominates preheader
>>> should be used for replacement in other loops. Replacements were
>>> already made in walker).
>>>
>>> If lhs is used only in phi node, which are moved and transformed into
>>> assignment, and there is no uses in other loops, we need to create
>>> only first level phi node (don’t need to create phi nodes till some bb
>>> which dominates preheader), then we add such name to fl_names_map.
>>>
>>> 3.2 Create_phi_nodes
>>>
>>> For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
>>> are statements which were moved, but they have uses in the original
>>> loop (more exactly, their uses in some other loops replaced by some
>>> name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
>>> loop->aux)->fl_names_map.
>>>
>>> So for each of such stmts we create phi nodes (for its lhs) chain.
>>>
>>> Basic block for phi node is found in get_bb_for_phi  for given bb with
>>> stmt. If basic block for phi node dominates preheader->src we end
>>> chain creation. There is some special condition for the case when we
>>> need to create only first level phi node. (I described this situation
>>> above, but let me know If I need to add  more details.
>>>
>>> Basic blocks can have maps to identify if we created phi node for
>>> given variable in given basic_block, so we only add an edge argument
>>> for such case (phi_map = new hash_map<basic_block, hash_map<tree,
>>> gphi*>*>).
>>
>> Ok.
>>
>> Well - I think it might be easier code-generation-wise to have a separate
>> phase move_phis executed before move_computations, that just moves
>> invariant PHI nodes (which this all is about - see above) and their
>> controlling conditions and only then move their dependencies.
> I didn;t understand so far (I think let's clarify previous questions,especially
> about phi nodes then let's go back to this one.)

Yeah, I think the PHI node thing is our major mis-understanding (or the
main thing that I think complicates the patch for too little gain - at least
initially).

Thanks,
Richard.

>>
>> Thanks,
>> Richard.
>>
>>> In short, that’s all.
>>>
>>> Thanks,
>>>
>>> Evgeniya
>>>
>>>
>>>
>>> On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
>>> <evgeniya.maenkova@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> Could you please review my patch for predicated lim?
>>>>
>>>> Let me note some details about it:
>>>>
>>>>
>>>>
>>>> 1)      Phi statements are still moved only if they have 1 or 2
>>>> arguments. However, phi statements could be move under conditions (as
>>>> it’s done for the other statements).  Probably, phi statement motion
>>>> with 3 + arguments could be implemented in the next patch after
>>>> predicated lim.
>>>>
>>>> 2)      Patch has limitations/features like (it was ok to me to
>>>> implement it such way, maybe I’m not correct. ):
>>>>
>>>> a)      Loop1
>>>>
>>>>     {
>>>>
>>>>               If (a)
>>>>
>>>>                  Loop2
>>>>
>>>>                      {
>>>>
>>>>                        Stmt - Invariant for Loop1
>>>>
>>>>                      }
>>>>
>>>>              }
>>>>
>>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>>
>>>> b)      Or
>>>>
>>>> Loop1
>>>>
>>>>     {
>>>>
>>>>          …
>>>>
>>>>          If (cond1)
>>>>
>>>>               If (cond2)
>>>>
>>>>                   If (cond3)
>>>>
>>>>                       Stmt;
>>>>
>>>>        }
>>>>
>>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>>
>>>> c)       It took me a long time to write all of these code, so there
>>>> might be other peculiarities which I forgot to mention. :)
>>>>
>>>>                        Let’s discuss these ones as you will review my patch.
>>>>
>>>> 3)      Patch consists of 9 files:
>>>>
>>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>>
>>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>>> predicated lim moves 2 more statements out of the loop;
>>>>
>>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>>> lim recip optimization in this test doesn’t work (the corresponding
>>>> value is below threshold as I could see in the code for recip, 1<3).
>>>> So to have recip working in this test I changed test a little bit.
>>>>
>>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>>
>>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>>
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>>
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>
>>>> the tests for conditional lim.
>>>>
>>>> 4)      Patch testing:
>>>>
>>>> a)      make –k check (no difference in results for me for the clean
>>>> build and the patched one,
>>>>
>>>> -          Revision: 222849,
>>>>
>>>> -          uname -a
>>>>
>>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>>
>>>> b)      Bootstrap.
>>>>
>>>>  It goes well now, however to fix it I have made a temporary hack in
>>>> the lim code. And with this fix patch definitely shouldn’t be
>>>> committed.
>>>>
>>>> I did so, as I would like to discuss this issue first.
>>>>
>>>> The problem is: I got stage2-stage3 comparison failure on the single
>>>> file (tree-vect-data-refs.o). After some investigation I understood
>>>> that tree-vect-data-refs.o differs being compiled with and without
>>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>>> reproduce this difference on the same build (even not bootstrapped),
>>>> with option ‘ –g’ and without it).
>>>>
>>>> Of course, the file compiled with –g option will contain debug
>>>> information and will differ from the corresponding file without debug
>>>> information. I mean there is the difference reported by
>>>> contrib/compare-debug (I mean we talk about stripped files).
>>>>
>>>> Then I compared assemblers and lim dump files and made a conclusion
>>>> the difference is:
>>>>
>>>> There is statement _484=something, which is conditionally moved out of
>>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>>> X. In debug case, there is the single use in debug statement. So for
>>>> debug case my patch generates phi statement for _484 to make it
>>>> available inside the loop X. For non debug case, no such phi statement
>>>> generated as there is no uses of _484.
>>>>
>>>> There is one more statement with lhs=_493, with exactly this pattern
>>>> of use. But this is not important for our discussion.
>>>>
>>>>
>>>>
>>>> So, in my opinion it’s ok to generate additional phi node for debug
>>>> case. But I’m not a compiler expert and maybe there is some
>>>> requirement that debug and non-debug versions should differ only by
>>>> debug statements, I don’t know.
>>>>
>>>>
>>>>
>>>> My temporary hack fix is just removing of such debug statements (no
>>>> debug statements, no problems J).
>>>>
>>>>
>>>>
>>>> The alternatives I see are:
>>>>
>>>> -          Move debug statements outside the loop (not good in my opinion);
>>>>
>>>> -          Somehow reset values in debug statements (not good in my
>>>> opinion, if I could provide correct information for them);
>>>>
>>>> -          Generate phi for debug statements and fix bootstrap (say,
>>>> let’s have some list of files, which we have special check for. I mean
>>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>>> and non-debug). I like this variant. However, as I don’t see such list
>>>> of special files in the whole gcc, I think I might be missing
>>>> something.
>>>>
>>>> What do you think?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Evgeniya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-06-29 13:15       ` Richard Biener
@ 2015-06-29 14:24         ` Evgeniya Maenkova
  2015-07-14 10:54           ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Evgeniya Maenkova @ 2015-06-29 14:24 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Mon, Jun 29, 2015 at 5:10 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Jun 9, 2015 at 10:11 PM, Evgeniya Maenkova
> <evgeniya.maenkova@gmail.com> wrote:
>> On Tue, Jun 9, 2015 at 3:46 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, May 29, 2015 at 3:14 PM, Evgeniya Maenkova
>>> <evgeniya.maenkova@gmail.com> wrote:
>>>> Hi Richard,
>>>>
>>>> Here is some explanation. I hope you let me know if I need to clarify something.
>>>>
>>>> Also, you asked me about concrete example, to make sure you don’t miss
>>>> my answer here is the link:
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.
>>>>
>>>> Also, I doubt whether it’s convenient for you to create a build with
>>>> my patch or not. May be to clarify things you could send me some
>>>> examples/concrete cases, then I’ll compile them with
>>>> –fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
>>>> these dumps. May be these dumps will be useful. (I’ll only disable
>>>> cleanup_cfg TODO after lim to let you know the exact picture after
>>>> lim).
>>>>
>>>> What do you think?
>>>>
>>>> 1.       invariantness _dom_walker –
>>>>
>>>> 1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
>>>> for true and false edges handle_branch_edge, which calls SET_TARGET_OF
>>>> for all bb ‘predicated’ by given GIMPLE_COND.
>>>>
>>>> SET_TARGET_OF sets in basic_blocks aux 2 facts:
>>>>
>>>> a)      this is true or false edge;
>>>>
>>>> b)      link to cond stmt;
>>>>
>>>> Handle_branch_edge works this way:
>>>>
>>>> If (cond1)
>>>>
>>>>   {
>>>>
>>>>      bb1;
>>>>
>>>>      if (cond2}
>>>>
>>>>        {
>>>>
>>>>            bb2;
>>>>
>>>>         }
>>>>
>>>>    Being called for cond1, it sets cond1 as condition for both bb1 and
>>>> bb2 (the whole branch for cond1, ie also for bb containing cond2),
>>>> then this method will be called (as there is dominance order) for
>>>> cond2 to correct things (ie to set cond2 as condition for bb2).
>>>
>>> Hmm, why not track the current condition as state during the DOM walk
>>> and thus avoid processing more than one basic-block in handle_branch_edge?
>>> Thus get rid of handle_branch_edge and instead do everything in handle_cond_stmt
>>> plus the dom-walkers BB visitor?
>>>
>> I need to look more carefully how to implement it, but I think I
>> understand what you mean and this optimization of course looks
>> reasonable to me. Will do.
>>
>>> I see you don't handle BBs with multiple predecessors - that's ok, but
>>> are you sure you don't run into correctness issues when not marking such
>>> BBs as predicated?  This misses handling of, say
>>>
>>>  if (a || b)
>>>    bb;
>>>
>>> which is a pity (but can be fixed later if desired).
>>>
>> I had some test (in gcc testsuite or bootstrap build) which worked
>> incorrectly because of multiple predecessors. As far as I remember the
>> situation was (further, will make some notes about such tests to
>> clarify this better), I mean with previous version of my code which
>> handled bb with 2 predecessors:
>> if (a)
>>   tmpvar=something;
>> while()
>>   if (a || b)
>>       basic_block {do something with tmpvar;} // I mean basic block
>> predicated by bb with a and bb with b
>>
>> So, if a is false, I mean we didn't do tmpvar=something (outside
>> loop), BUT we are in basick_block  (we went by bb with b), we got
>> Segmentation falt in basic_block {do something with tmpvar;}.
>>
>> I think we can reproduce all the details of this test if I remove not
>> handling bb with 2 predecessors.
>>
>> So I wouldn't move bb with 2 predecessors (this is not always executed
>> bb in any loop, not conditionally, they will not be moved at all).
>>
>> This is my more detail explanation on this point. Perhaps, I didn't
>> understand your question about correctness. Could you repeat it in
>> other words (based on this new clarification).
>>
>> So I think according to current code it will not be moved. What
>> incorrectness do you mean?
>
> If the block isn't marked as predicated the question is whether it is
> handled correctly or assumed to be unconditionally executed.
>
>>> I note that collecting predicates has similarities to what if-conversion
>>> does in tree-ifcvt.c (even if its implementation is completely different,
>>> of course).
>>>
>>
>> Ok, I'll look at this. But could you please clarify your point?
>> (Should I just take this into account with low priority and look at
>> this later or you want some refactoring?)
>
> I just noted similar code exists elsewhere - it may be possible to
> factor it out but I didn't investigate.  And no, doing that isn't a prerequesite
> for this patch.
>
>>>> 1.2   As 1.1 goes we identify whether some bb is predicated by some
>>>> condition or not.
>>>>
>>>> bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
>>>> bb->aux->cond_stmt=cond stmt (the nearest condition).
>>>>
>>>> If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
>>>> bb->loop->loop (this info was available in the clean build).
>>>>
>>>> 1.3   As this walker is called in dominance order, information about
>>>> condition is available when invariantness_dom_walker is called for
>>>> given bb.  So we can make some computations based on bb->aux
>>>> structure. This is done in check_conditionally_executed. The main goal
>>>> of this method is to check that the root condition is always executed
>>>> in the loop. I did so to avoid situation like this
>>>>
>>>> Loop:
>>>>
>>>>    Jmp somewhere;
>>>>
>>>>   If (cond1)
>>>>
>>>>       If (cond2)
>>>>
>>>>          Etc
>>>>
>>>> By ‘root condition’ I mean cond1 in this case (the first cond in
>>>> dominance order).
>>>
>>> Ok, but this can't happen (an uncontional jump to somehwere).  It
>>> will always be a conditional jump (a loop exit) and thus should
>>> be handled already.
>>>
>>> Did you have a testcase that was mishandled?
>>
>> No, I have not such testcase. This is because I;m newbie at gcc and
>> compilers. If you tell me this fact, let's just remove this check?
>> Correct?
>
> Yes.
>
>>>
>>>> If  ‘root condition’ is not always executed we disable (free) all the
>>>> info in bb->aux, ie all the blocks becomes neither conditionally nor
>>>> always executed according to bb->aux.
>>>>
>>>> There is some optimization to don’t go each time trough the conditions
>>>> chain (cond2->cond1), let me skip such details for now.
>>>>
>>>>
>>>>
>>>> 1.4   Then we calculate tgt_level  (which is used in move_computations later)
>>>>
>>>> The patch is the same for phi and regular stmt (calculate_tgt_level)
>>>>
>>>> 1)      If stmt is conditionally executed we compute max movement
>>>> (determine_max_movement) starting from
>>>> get_lim_data(condition)->max_loop.
>>>>
>>>> 2)      If stmt is not cond executed as start level  for
>>>> determine_max_movement computations we choose ALWAYS_EXECUTED_IN.
>>>>
>>>> To clarify why, see the comment
>>>>
>>>> /* The reason why we don't set start_level for MOVE_POSSIBLE
>>>>
>>>>           statements to more outer level is: this statement could depend on
>>>>
>>>>           conditional statements and even probably is not defined without this
>>>>
>>>>           condition. So either we should analyze these ones and move
>>>>
>>>>           under condition somehow or keep more strong start_level . */
>>>>
>>>> As you noted in your review there was some refactoring. Of course, I
>>>> had no goal to refactor existing code, I intended to remove some
>>>> duplication which I caused by my changes.  I hope we discuss in
>>>> details later if you don’t like some refactoring.
>>>
>>> Refactoring makes a big patch harder to review because it ends up
>>> containing no-op changes.  It also makes it appear bigger than it
>>> really is ;)
>>>
>>> A reasonable solution is to factor out the refactoring to a separate patch.
>>>
>> I see what you mean.
>>
>>>> 2.       store_motion  - for some stmts set flags to ignore predicated
>>>> execution (I mean to move statements without conditions).
>>>
>>> That's for the existing "predicated" support as far as I can see.
>>>
>>>>
>>>>
>>>> 3.       move_computations. The patch is doing something in the
>>>> following 3 calls inside of this method.
>>>>
>>>>
>>>>
>>>> 3.1   (more details below) walker.walk – move computations(create if()
>>>> structure) and store information to create phi nodes later (which
>>>> statements require phi nodes, as they conditionally executed and there
>>>> are still their uses on other levels, what are the names for such phi
>>>> nodes, as we replace uses in here, in walker.walk.)
>>>
>>> What kind of uses are that?  By definition all stmts executed
>>> under condition A that are used in code not under condition A have
>>> a PHI node associated.  This is why the existing conditional movement
>>> handling moves PHI nodes (and all dependent computations).
>>>
>>> Are you thinking of
>>>
>>>   for (;;)
>>>     if (a) x = 1;
>>>
>>>   .. use of x
>>>
>>> ?
>> No, I think in this case phi node of x will be used in initial
>> (==loopinit) code (in 'use of x') instead of some version of x (which
>> I move).
>>
>>>If you move the PHI node for x then you don't need to insert any other
>>> PHI nodes (and you _do_ have to move the PHI node anyway).
>>
>>>
>>>> 3.2   (more details below) create_phi_nodes – create phi nodes for
>>>> statements which require that (see 3.1)
>>>
>>> Only PHI nodes need PHI node creation and only its dependences
>>> might need predication.
>>>
>>> That is, if you move a conditionally executed statement but not
>>> any PHI node it feeds then the movement is useless.
>>>
>>> But maybe I am missing something about the PHI business.
>>>
>> Ok. I have different understanding on this (and am afraid of this of
>> course :) ).
>>
>> See the example:
>> void bar (long);
>> void foo (int cond1, int cond2, int cond3, int m, int n)
>> {
>>   unsigned x;
>>   unsigned inv1;
>>
>>   for (x = 0; x < 1000; ++x)
>>     {
>>       if (cond1)
>>         {
>>           int tmp_inv1 = m*n;
>>           non_inv1 = tmp_inv1 + x;
>>         }
>>     }
>>   bar (non_inv1);
>> }
>>
>> This is working example, will attach loopinit and lim1 dumps.
>> There is no phi node for tmp_inv1 in loopinitat all. However, I think
>> we should move tmp_inv1. To use tmp_inv1 in non_inv1 we need phi node
>> (otherwise we get verification failure as bb with conditionally moved
>> tmp_inv1 doesn;t dominate bb with non_inv1). Another example if
>> non_inv1 is invariant but has not enough cost or something else. In
>> short, if it's still used in initial or others loops. Say, non_inv1 is
>> invariant and moved in another loop. Or something else.
>>
>> So, when I read
>> 'Only PHI nodes need PHI node creation'
>> and
>> 'That is, if you move a conditionally executed statement but not any
>> PHI node it feeds then the movement is useless.'
>> I think about this example and don't understand. Could you clarify?
>
> Ok, so this case is already handled by LIM by simply moving tmp_inv1 = m*n
> (and making it unconditionally execute before the loop).  In fact whether
> we can move it does not depend on whether 'cond' is invariant or not.
>
> If we couldn't execute m*n unconditionally on the loop preheader edge
> then we'd have to be able to also move the guarding condition.  Note
> that cost considerations shouldn't come into play here.
>
> So I was only thinking of the case where we also want to move the
> condition (which is wanted if we want to move a PHI node).
>

Don't understand "So I was only thinking of the case where we also
want to move the condition (which is wanted if we want to move a PHI
node)."

What about the case when we want to move condition but don't have phi
node to move.

Let's consider a little bit modified example:

void bar (long);
void foo (int cond1, int cond2, int cond3, int m, int n)
  {
     unsigned x;
     unsigned inv1;

     for (x = 0; x < 1000; ++x)
       {
         if (n)
           {
             int tmp_inv1 = m/n;
             non_inv1 = tmp_inv1 + x;
            }
       }
     bar (non_inv1);
 }
We couldn't  move tmp_inv1 uncoditionally. Right (I have no clean
build at hands right now to check. If I'm wrong let's replace m/n to
something that couldn't be unconditionally moved :))? There is no phi
node for tmp_inv1.
Could you clarify what conditional lim should do in your opinion here?

Is it too little gain to move statements that only used in the same bb
or branch (I mean before phi node creation, even if stmt had phi
node)?

In fact, we could have here 2 thousands of divisions:
if (n)
  {
    int tmp_inv1 = m/n;
    tmp_inv2 = something_which_couldn't_be moved_unconditionally (tmp_inv1);
    ...
    tmp_invN = something_which_couldn't_be moved_unconditionally (tmp_invN-1);
    non_inv1 = tmp_invN + x;
  }


>>>>
>>>> 3.3   replace_uses_in_conditions
>>>>
>>>> After computations movements we can have several copies of the cond
>>>> stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
>>>> cond stmt it doesn’t work. As cond stmt, of course, have something in
>>>> lim_aux_data->tgt_level, but, in fact, they are moved only if
>>>> corresponding invariants are moved. So, in fact, the same cond (copies
>>>> of it, of course) could go to the different levels. So to fix these
>>>> uses, we call replace_uses_in_conditions.
>>>
>>> Hmm, but stmts that a condition depends on always have at least the
>>> target level of the condition statement.  Thus their computation is
>>> available in all other "same" conditon statements that might appear
>>> in more inner loop levels, no?  So isn't this just about performing the
>>> movement in a proper ordering?
>>
>> Let me give your more details, then may be you repeat question in other words.
>> for ()
>> {
>>   if (a)
>>     {
>>        inv1 = some_computations;
>>        if (inv1 < inv2)
>>             {
>>               inv3;
>>               inv4;
>>             }
>>        }
>> }
>>
>> So, the order for movement is: inv1; inv3; inv4. When we move inv1 we
>> have tgt_level for inv3 and inv4 computed but we have not created
>> copies of 'if (inv1 < inv2)' created. In theory, we can have several
>> copies:(1) for tgt_level of inv3 (2) for tgt_level of inv2 (3) in
>> initial cycle. if (1) is in tgt level of inv1 we need no replacement,
>> in other cases we need replacements. Of course, we know this(all the
>> info about tgt_levels) when we move inv1, but then we need to create
>> all needed copies (store it somehow, etc) of 'if (inv1 < inv2') and
>> make replacement in these copies. I don't see now why this is much
>> better. This doesn't mean that current solution is perfect, just
>> clarify your thought.
>
> I still don't get what "replacements" actually mean.  Let's extend the
> example to
>
> for ()  (A)
>   {
>> for ()  (B)
>> {
>>   if (a)
>>     {
>>        inv1 = some_computations;
>>        if (inv1 < inv2)
>>             {
>>               inv3;
>>               inv4;
>>             }
>>        }
>> }
>  }
>
> then you say the target level of inv1 could be 'A' and the target
> level of inv3 'B'
> and that of inv4 is 'B' as well.
>
> That's true.  So we move inv1 to before 'A' and inv3/inv4 to before 'B' guarded
> with if (inv1 < inv2).  I fail to see what "copies" we need here.  'inv1' is
> available in all target levels below its own, no copies needed.
>

I meant the copies of conditions (COND_STMT: inv1 < inv2).

Let's modify this example: tgt_level (inv1) = A; tgt_level(inv3) = A;
tgt_level(inv4) = B.

 Then
if (a)
   inv1=some_computations;
if (inv1 < inv2)
   inv3;
<preheader edge for A, where phi for inv1(phi_for_inv1) is computed>
for ()  (A)
 {
 if (phi_for_inv1 <inv2)
   inv3
 for ()  (B)
   {
   if (a)
     {
        if (phi_for_inv1 < inv2)
          {
          }
     }
 }
So, we have 3 copies of inv1 < inv2 after lim (in some of them we use
inv1, in other phi_for_inv1). But this is the second priority for now
(this is all about whether we should move statements w/o phi node).

>>>
>>>> More details on 3.1 and 3.2
>>>>
>>>> 3.1 The patch is very similar for phi stmts and regular stmts (there
>>>> is some difference for regular stmts related to the case when we move
>>>> sequence instead of single stmt, let’s skip it in this description,
>>>> will describe later. BTW, there are comments, in corresponding parts).
>>>>
>>>> a) for each bb we call check_conditionally_executed and if bb is cond
>>>> executed then invariant statement  will be moved conditionally.
>>>>
>>>> b) prepare_edge_for_stmt collects information needed to create phi
>>>> nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
>>>> structure needed to move statement
>>>> (get_block_for_predicated_statement, see below). All of this is done
>>>> for cond executed stmt. Nothing is done for non cond executed.
>>>>
>>>> BTW, there is some strange names for functions like missed ending
>>>> (prepare_edge_for_stm, w/o t in the end, this is because of my script
>>>> to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
>>>> please so far).
>>>>
>>>> Get_block_for_predicated_stmt:
>>>>
>>>> Create bb where to add cond stmt to (or return existing, each loop has
>>>> a map  (level_cond_data_map is map <loop, cond_data_map> ,
>>>> cond_data_map is map <gimple, basic_block>) to identify if bb for
>>>> given cond was created or not for given level).
>>>>
>>>> Parameters of this method:
>>>>
>>>> Cond – condition of stmt, branch – is it on true or false edge of
>>>> cond, cond_map  - see above, level (tgt_level), preheader –
>>>> preheader_edge for level.
>>>>
>>>> So first of all we check whether there is a basic block for given
>>>> condition in given loop.
>>>>
>>>> a)      If such bb exists then we trying to find corresponding edge,
>>>> destination of this edge shouldn’t be a post dominator of bb
>>>> (find_branch_edge).
>>>>
>>>> a.1) if such edge exists we return the most remote in this branch bb
>>>> (or create it), see get_most_remote_in_branch;
>>>>
>>>> a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch
>>>>
>>>>        b)  If such bb doesn’t exist we create it (recursively calling
>>>> get_block_for_predicated_stmt if required bb is conditionally
>>>> executed).
>>>>
>>>> Will not describe it further so far (let me know if you would like to
>>>> read additional details).
>>>>
>>>> Add_info_for_phi
>>>>
>>>> The main goal of this method is to identify situation where
>>>> stmt(parameter,ie the statement which we are moving) or more exactly
>>>> lhs of this statement
>>>>
>>>> a)      is still used in other loops, so we need to create phi node to
>>>> use phi name inside other loops.
>>>>
>>>> b)      Is used in phi node of original loop and this phi node is
>>>> going to be moved. BUT, as we move phi node as assignment (in the case
>>>> of phi node with 2 arguments) we need to create phi node to use lhs of
>>>> statement (param of add_info_for_phi).
>>>>
>>>> In these cases we make corresponding replacements and store
>>>> information needed to create phi nodes and make some other
>>>> replacements in 3.3 (replace_uses_in_conditions).
>>>>
>>>> Add_info_for_phi calls create_tmp_var which requires some explanation.
>>>>
>>>> Create_tmp_var
>>>>
>>>> To create new names for phi nodes and to create default def for phi
>>>> arguments I use make_ssa_name and get_or_create_ssa_default_def. These
>>>> methods have some asserts (being merged):   (TREE_CODE (old_var) ==
>>>> VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
>>>> (old_var) == RESULT_DECL).
>>>>
>>>> So in some cases (when I know that I need to use methods mentioned
>>>> above, see the start of create_tmp_var, ie the uses in other loops) I
>>>> create new tmp variable to overcome these asserts. To be honest I
>>>> don’t like this but don’t know how to deal with this better.
>>>
>>> Hmm, don't you just run into SSA names here?  That is, TREE_CODE
>>> (old_var) == SSA_NAME?
>>> You can use make_ssa_name (TREE_TYPE (old_var), "plim") in this case.
>>>
>>> Or you run into the issue that anonymous SSA names cannot have default
>>> defintions (I don't understand why you need all the fuss about default defs).
>>
>> I used default defs to write something in phi args when I have no values on
>> corresponding branches. Say there is some tmp value which I should
>> create phi for.
>> if (a)
>>    {
>>      tmpvar=something
>>      b = something(tmpvar);
>>     }
>> else
>>    {
>>    }
>> When I move if (a) {tmpvar = something} and create phi for tmpvar (to
>> use phi result
>> in the initial cycle) i need to write something for 'not a' phi arg.
>> BUT, in some cases I use
>>  build_zero_cst (which I started to use after default defs use. I mean
>> I didn't know about
>>  build_zero_cst while I started to use default defs.). So may be I can replace
>> uses of default defs in such cases by build_zero_cst? I mean just SOME values
>> (as I know that control flow will never go to this places actually). I
>> mean if !a tmpvar will not used.
>
> Ok, this is again about the case where you are not moving a PHI node but
> creating it because you want to conditionally execute invariant stmts.
> Yes, a default def is ok here, but you can always use a new VAR_DECL
> just for the default def creation like gimple_fold_call does for example:
>
>                           tree var = create_tmp_var (TREE_TYPE (lhs));
>                           tree def = get_or_create_ssa_default_def (cfun, var);
>
> Note that I think we don't need to bother about this case as we'd only
> want to move PHI nodes in the first place (so you already have values
> for all PHI args).
>
>>>
>>>> And a couple of words regarding where we store info for phi nodes:
>>>>
>>>> Each loop contains in its aux  phi_data structure. On this stage we
>>>> only add there stmt (if phi node will be required), phi name=phi node
>>>> result name in names_map or fl_names_map). See the comment about
>>>> phi_data in patch.
>>>>
>>>> If there should be created several phi nodes for given lhs (I mean the
>>>> first place for phi node doesn’t dominate on corresponding preheader,
>>>> we add only the last name in names_map (as intermediate names could be
>>>> created later, but the last name in bb which dominates preheader
>>>> should be used for replacement in other loops. Replacements were
>>>> already made in walker).
>>>>
>>>> If lhs is used only in phi node, which are moved and transformed into
>>>> assignment, and there is no uses in other loops, we need to create
>>>> only first level phi node (don’t need to create phi nodes till some bb
>>>> which dominates preheader), then we add such name to fl_names_map.
>>>>
>>>> 3.2 Create_phi_nodes
>>>>
>>>> For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
>>>> are statements which were moved, but they have uses in the original
>>>> loop (more exactly, their uses in some other loops replaced by some
>>>> name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
>>>> loop->aux)->fl_names_map.
>>>>
>>>> So for each of such stmts we create phi nodes (for its lhs) chain.
>>>>
>>>> Basic block for phi node is found in get_bb_for_phi  for given bb with
>>>> stmt. If basic block for phi node dominates preheader->src we end
>>>> chain creation. There is some special condition for the case when we
>>>> need to create only first level phi node. (I described this situation
>>>> above, but let me know If I need to add  more details.
>>>>
>>>> Basic blocks can have maps to identify if we created phi node for
>>>> given variable in given basic_block, so we only add an edge argument
>>>> for such case (phi_map = new hash_map<basic_block, hash_map<tree,
>>>> gphi*>*>).
>>>
>>> Ok.
>>>
>>> Well - I think it might be easier code-generation-wise to have a separate
>>> phase move_phis executed before move_computations, that just moves
>>> invariant PHI nodes (which this all is about - see above) and their
>>> controlling conditions and only then move their dependencies.
>> I didn;t understand so far (I think let's clarify previous questions,especially
>> about phi nodes then let's go back to this one.)
>
> Yeah, I think the PHI node thing is our major mis-understanding (or the
> main thing that I think complicates the patch for too little gain - at least
> initially).
>
> Thanks,
> Richard.
>
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> In short, that’s all.
>>>>
>>>> Thanks,
>>>>
>>>> Evgeniya
>>>>
>>>>
>>>>
>>>> On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Could you please review my patch for predicated lim?
>>>>>
>>>>> Let me note some details about it:
>>>>>
>>>>>
>>>>>
>>>>> 1)      Phi statements are still moved only if they have 1 or 2
>>>>> arguments. However, phi statements could be move under conditions (as
>>>>> it’s done for the other statements).  Probably, phi statement motion
>>>>> with 3 + arguments could be implemented in the next patch after
>>>>> predicated lim.
>>>>>
>>>>> 2)      Patch has limitations/features like (it was ok to me to
>>>>> implement it such way, maybe I’m not correct. ):
>>>>>
>>>>> a)      Loop1
>>>>>
>>>>>     {
>>>>>
>>>>>               If (a)
>>>>>
>>>>>                  Loop2
>>>>>
>>>>>                      {
>>>>>
>>>>>                        Stmt - Invariant for Loop1
>>>>>
>>>>>                      }
>>>>>
>>>>>              }
>>>>>
>>>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>>>
>>>>> b)      Or
>>>>>
>>>>> Loop1
>>>>>
>>>>>     {
>>>>>
>>>>>          …
>>>>>
>>>>>          If (cond1)
>>>>>
>>>>>               If (cond2)
>>>>>
>>>>>                   If (cond3)
>>>>>
>>>>>                       Stmt;
>>>>>
>>>>>        }
>>>>>
>>>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>>>
>>>>> c)       It took me a long time to write all of these code, so there
>>>>> might be other peculiarities which I forgot to mention. :)
>>>>>
>>>>>                        Let’s discuss these ones as you will review my patch.
>>>>>
>>>>> 3)      Patch consists of 9 files:
>>>>>
>>>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>>>
>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>>>> predicated lim moves 2 more statements out of the loop;
>>>>>
>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>>>> lim recip optimization in this test doesn’t work (the corresponding
>>>>> value is below threshold as I could see in the code for recip, 1<3).
>>>>> So to have recip working in this test I changed test a little bit.
>>>>>
>>>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>>>
>>>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>>>
>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>>>
>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>>
>>>>> the tests for conditional lim.
>>>>>
>>>>> 4)      Patch testing:
>>>>>
>>>>> a)      make –k check (no difference in results for me for the clean
>>>>> build and the patched one,
>>>>>
>>>>> -          Revision: 222849,
>>>>>
>>>>> -          uname -a
>>>>>
>>>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>>>
>>>>> b)      Bootstrap.
>>>>>
>>>>>  It goes well now, however to fix it I have made a temporary hack in
>>>>> the lim code. And with this fix patch definitely shouldn’t be
>>>>> committed.
>>>>>
>>>>> I did so, as I would like to discuss this issue first.
>>>>>
>>>>> The problem is: I got stage2-stage3 comparison failure on the single
>>>>> file (tree-vect-data-refs.o). After some investigation I understood
>>>>> that tree-vect-data-refs.o differs being compiled with and without
>>>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>>>> reproduce this difference on the same build (even not bootstrapped),
>>>>> with option ‘ –g’ and without it).
>>>>>
>>>>> Of course, the file compiled with –g option will contain debug
>>>>> information and will differ from the corresponding file without debug
>>>>> information. I mean there is the difference reported by
>>>>> contrib/compare-debug (I mean we talk about stripped files).
>>>>>
>>>>> Then I compared assemblers and lim dump files and made a conclusion
>>>>> the difference is:
>>>>>
>>>>> There is statement _484=something, which is conditionally moved out of
>>>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>>>> X. In debug case, there is the single use in debug statement. So for
>>>>> debug case my patch generates phi statement for _484 to make it
>>>>> available inside the loop X. For non debug case, no such phi statement
>>>>> generated as there is no uses of _484.
>>>>>
>>>>> There is one more statement with lhs=_493, with exactly this pattern
>>>>> of use. But this is not important for our discussion.
>>>>>
>>>>>
>>>>>
>>>>> So, in my opinion it’s ok to generate additional phi node for debug
>>>>> case. But I’m not a compiler expert and maybe there is some
>>>>> requirement that debug and non-debug versions should differ only by
>>>>> debug statements, I don’t know.
>>>>>
>>>>>
>>>>>
>>>>> My temporary hack fix is just removing of such debug statements (no
>>>>> debug statements, no problems J).
>>>>>
>>>>>
>>>>>
>>>>> The alternatives I see are:
>>>>>
>>>>> -          Move debug statements outside the loop (not good in my opinion);
>>>>>
>>>>> -          Somehow reset values in debug statements (not good in my
>>>>> opinion, if I could provide correct information for them);
>>>>>
>>>>> -          Generate phi for debug statements and fix bootstrap (say,
>>>>> let’s have some list of files, which we have special check for. I mean
>>>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>>>> and non-debug). I like this variant. However, as I don’t see such list
>>>>> of special files in the whole gcc, I think I might be missing
>>>>> something.
>>>>>
>>>>> What do you think?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>> Evgeniya



-- 
Thanks,

Evgeniya

perfstories.wordpress.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-06-29 14:24         ` Evgeniya Maenkova
@ 2015-07-14 10:54           ` Richard Biener
  2015-07-15  8:43             ` Evgeniya Maenkova
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Biener @ 2015-07-14 10:54 UTC (permalink / raw)
  To: Evgeniya Maenkova; +Cc: GCC Patches

On Mon, Jun 29, 2015 at 4:21 PM, Evgeniya Maenkova
<evgeniya.maenkova@gmail.com> wrote:
> On Mon, Jun 29, 2015 at 5:10 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Tue, Jun 9, 2015 at 10:11 PM, Evgeniya Maenkova
>> <evgeniya.maenkova@gmail.com> wrote:
>>> On Tue, Jun 9, 2015 at 3:46 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, May 29, 2015 at 3:14 PM, Evgeniya Maenkova
>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>> Hi Richard,
>>>>>
>>>>> Here is some explanation. I hope you let me know if I need to clarify something.
>>>>>
>>>>> Also, you asked me about concrete example, to make sure you don’t miss
>>>>> my answer here is the link:
>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.
>>>>>
>>>>> Also, I doubt whether it’s convenient for you to create a build with
>>>>> my patch or not. May be to clarify things you could send me some
>>>>> examples/concrete cases, then I’ll compile them with
>>>>> –fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
>>>>> these dumps. May be these dumps will be useful. (I’ll only disable
>>>>> cleanup_cfg TODO after lim to let you know the exact picture after
>>>>> lim).
>>>>>
>>>>> What do you think?
>>>>>
>>>>> 1.       invariantness _dom_walker –
>>>>>
>>>>> 1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
>>>>> for true and false edges handle_branch_edge, which calls SET_TARGET_OF
>>>>> for all bb ‘predicated’ by given GIMPLE_COND.
>>>>>
>>>>> SET_TARGET_OF sets in basic_blocks aux 2 facts:
>>>>>
>>>>> a)      this is true or false edge;
>>>>>
>>>>> b)      link to cond stmt;
>>>>>
>>>>> Handle_branch_edge works this way:
>>>>>
>>>>> If (cond1)
>>>>>
>>>>>   {
>>>>>
>>>>>      bb1;
>>>>>
>>>>>      if (cond2}
>>>>>
>>>>>        {
>>>>>
>>>>>            bb2;
>>>>>
>>>>>         }
>>>>>
>>>>>    Being called for cond1, it sets cond1 as condition for both bb1 and
>>>>> bb2 (the whole branch for cond1, ie also for bb containing cond2),
>>>>> then this method will be called (as there is dominance order) for
>>>>> cond2 to correct things (ie to set cond2 as condition for bb2).
>>>>
>>>> Hmm, why not track the current condition as state during the DOM walk
>>>> and thus avoid processing more than one basic-block in handle_branch_edge?
>>>> Thus get rid of handle_branch_edge and instead do everything in handle_cond_stmt
>>>> plus the dom-walkers BB visitor?
>>>>
>>> I need to look more carefully how to implement it, but I think I
>>> understand what you mean and this optimization of course looks
>>> reasonable to me. Will do.
>>>
>>>> I see you don't handle BBs with multiple predecessors - that's ok, but
>>>> are you sure you don't run into correctness issues when not marking such
>>>> BBs as predicated?  This misses handling of, say
>>>>
>>>>  if (a || b)
>>>>    bb;
>>>>
>>>> which is a pity (but can be fixed later if desired).
>>>>
>>> I had some test (in gcc testsuite or bootstrap build) which worked
>>> incorrectly because of multiple predecessors. As far as I remember the
>>> situation was (further, will make some notes about such tests to
>>> clarify this better), I mean with previous version of my code which
>>> handled bb with 2 predecessors:
>>> if (a)
>>>   tmpvar=something;
>>> while()
>>>   if (a || b)
>>>       basic_block {do something with tmpvar;} // I mean basic block
>>> predicated by bb with a and bb with b
>>>
>>> So, if a is false, I mean we didn't do tmpvar=something (outside
>>> loop), BUT we are in basick_block  (we went by bb with b), we got
>>> Segmentation falt in basic_block {do something with tmpvar;}.
>>>
>>> I think we can reproduce all the details of this test if I remove not
>>> handling bb with 2 predecessors.
>>>
>>> So I wouldn't move bb with 2 predecessors (this is not always executed
>>> bb in any loop, not conditionally, they will not be moved at all).
>>>
>>> This is my more detail explanation on this point. Perhaps, I didn't
>>> understand your question about correctness. Could you repeat it in
>>> other words (based on this new clarification).
>>>
>>> So I think according to current code it will not be moved. What
>>> incorrectness do you mean?
>>
>> If the block isn't marked as predicated the question is whether it is
>> handled correctly or assumed to be unconditionally executed.
>>
>>>> I note that collecting predicates has similarities to what if-conversion
>>>> does in tree-ifcvt.c (even if its implementation is completely different,
>>>> of course).
>>>>
>>>
>>> Ok, I'll look at this. But could you please clarify your point?
>>> (Should I just take this into account with low priority and look at
>>> this later or you want some refactoring?)
>>
>> I just noted similar code exists elsewhere - it may be possible to
>> factor it out but I didn't investigate.  And no, doing that isn't a prerequesite
>> for this patch.
>>
>>>>> 1.2   As 1.1 goes we identify whether some bb is predicated by some
>>>>> condition or not.
>>>>>
>>>>> bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
>>>>> bb->aux->cond_stmt=cond stmt (the nearest condition).
>>>>>
>>>>> If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
>>>>> bb->loop->loop (this info was available in the clean build).
>>>>>
>>>>> 1.3   As this walker is called in dominance order, information about
>>>>> condition is available when invariantness_dom_walker is called for
>>>>> given bb.  So we can make some computations based on bb->aux
>>>>> structure. This is done in check_conditionally_executed. The main goal
>>>>> of this method is to check that the root condition is always executed
>>>>> in the loop. I did so to avoid situation like this
>>>>>
>>>>> Loop:
>>>>>
>>>>>    Jmp somewhere;
>>>>>
>>>>>   If (cond1)
>>>>>
>>>>>       If (cond2)
>>>>>
>>>>>          Etc
>>>>>
>>>>> By ‘root condition’ I mean cond1 in this case (the first cond in
>>>>> dominance order).
>>>>
>>>> Ok, but this can't happen (an uncontional jump to somehwere).  It
>>>> will always be a conditional jump (a loop exit) and thus should
>>>> be handled already.
>>>>
>>>> Did you have a testcase that was mishandled?
>>>
>>> No, I have not such testcase. This is because I;m newbie at gcc and
>>> compilers. If you tell me this fact, let's just remove this check?
>>> Correct?
>>
>> Yes.
>>
>>>>
>>>>> If  ‘root condition’ is not always executed we disable (free) all the
>>>>> info in bb->aux, ie all the blocks becomes neither conditionally nor
>>>>> always executed according to bb->aux.
>>>>>
>>>>> There is some optimization to don’t go each time trough the conditions
>>>>> chain (cond2->cond1), let me skip such details for now.
>>>>>
>>>>>
>>>>>
>>>>> 1.4   Then we calculate tgt_level  (which is used in move_computations later)
>>>>>
>>>>> The patch is the same for phi and regular stmt (calculate_tgt_level)
>>>>>
>>>>> 1)      If stmt is conditionally executed we compute max movement
>>>>> (determine_max_movement) starting from
>>>>> get_lim_data(condition)->max_loop.
>>>>>
>>>>> 2)      If stmt is not cond executed as start level  for
>>>>> determine_max_movement computations we choose ALWAYS_EXECUTED_IN.
>>>>>
>>>>> To clarify why, see the comment
>>>>>
>>>>> /* The reason why we don't set start_level for MOVE_POSSIBLE
>>>>>
>>>>>           statements to more outer level is: this statement could depend on
>>>>>
>>>>>           conditional statements and even probably is not defined without this
>>>>>
>>>>>           condition. So either we should analyze these ones and move
>>>>>
>>>>>           under condition somehow or keep more strong start_level . */
>>>>>
>>>>> As you noted in your review there was some refactoring. Of course, I
>>>>> had no goal to refactor existing code, I intended to remove some
>>>>> duplication which I caused by my changes.  I hope we discuss in
>>>>> details later if you don’t like some refactoring.
>>>>
>>>> Refactoring makes a big patch harder to review because it ends up
>>>> containing no-op changes.  It also makes it appear bigger than it
>>>> really is ;)
>>>>
>>>> A reasonable solution is to factor out the refactoring to a separate patch.
>>>>
>>> I see what you mean.
>>>
>>>>> 2.       store_motion  - for some stmts set flags to ignore predicated
>>>>> execution (I mean to move statements without conditions).
>>>>
>>>> That's for the existing "predicated" support as far as I can see.
>>>>
>>>>>
>>>>>
>>>>> 3.       move_computations. The patch is doing something in the
>>>>> following 3 calls inside of this method.
>>>>>
>>>>>
>>>>>
>>>>> 3.1   (more details below) walker.walk – move computations(create if()
>>>>> structure) and store information to create phi nodes later (which
>>>>> statements require phi nodes, as they conditionally executed and there
>>>>> are still their uses on other levels, what are the names for such phi
>>>>> nodes, as we replace uses in here, in walker.walk.)
>>>>
>>>> What kind of uses are that?  By definition all stmts executed
>>>> under condition A that are used in code not under condition A have
>>>> a PHI node associated.  This is why the existing conditional movement
>>>> handling moves PHI nodes (and all dependent computations).
>>>>
>>>> Are you thinking of
>>>>
>>>>   for (;;)
>>>>     if (a) x = 1;
>>>>
>>>>   .. use of x
>>>>
>>>> ?
>>> No, I think in this case phi node of x will be used in initial
>>> (==loopinit) code (in 'use of x') instead of some version of x (which
>>> I move).
>>>
>>>>If you move the PHI node for x then you don't need to insert any other
>>>> PHI nodes (and you _do_ have to move the PHI node anyway).
>>>
>>>>
>>>>> 3.2   (more details below) create_phi_nodes – create phi nodes for
>>>>> statements which require that (see 3.1)
>>>>
>>>> Only PHI nodes need PHI node creation and only its dependences
>>>> might need predication.
>>>>
>>>> That is, if you move a conditionally executed statement but not
>>>> any PHI node it feeds then the movement is useless.
>>>>
>>>> But maybe I am missing something about the PHI business.
>>>>
>>> Ok. I have different understanding on this (and am afraid of this of
>>> course :) ).
>>>
>>> See the example:
>>> void bar (long);
>>> void foo (int cond1, int cond2, int cond3, int m, int n)
>>> {
>>>   unsigned x;
>>>   unsigned inv1;
>>>
>>>   for (x = 0; x < 1000; ++x)
>>>     {
>>>       if (cond1)
>>>         {
>>>           int tmp_inv1 = m*n;
>>>           non_inv1 = tmp_inv1 + x;
>>>         }
>>>     }
>>>   bar (non_inv1);
>>> }
>>>
>>> This is working example, will attach loopinit and lim1 dumps.
>>> There is no phi node for tmp_inv1 in loopinitat all. However, I think
>>> we should move tmp_inv1. To use tmp_inv1 in non_inv1 we need phi node
>>> (otherwise we get verification failure as bb with conditionally moved
>>> tmp_inv1 doesn;t dominate bb with non_inv1). Another example if
>>> non_inv1 is invariant but has not enough cost or something else. In
>>> short, if it's still used in initial or others loops. Say, non_inv1 is
>>> invariant and moved in another loop. Or something else.
>>>
>>> So, when I read
>>> 'Only PHI nodes need PHI node creation'
>>> and
>>> 'That is, if you move a conditionally executed statement but not any
>>> PHI node it feeds then the movement is useless.'
>>> I think about this example and don't understand. Could you clarify?
>>
>> Ok, so this case is already handled by LIM by simply moving tmp_inv1 = m*n
>> (and making it unconditionally execute before the loop).  In fact whether
>> we can move it does not depend on whether 'cond' is invariant or not.
>>
>> If we couldn't execute m*n unconditionally on the loop preheader edge
>> then we'd have to be able to also move the guarding condition.  Note
>> that cost considerations shouldn't come into play here.
>>
>> So I was only thinking of the case where we also want to move the
>> condition (which is wanted if we want to move a PHI node).
>>
>
> Don't understand "So I was only thinking of the case where we also
> want to move the condition (which is wanted if we want to move a PHI
> node)."
>
> What about the case when we want to move condition but don't have phi
> node to move.
>
> Let's consider a little bit modified example:
>
> void bar (long);
> void foo (int cond1, int cond2, int cond3, int m, int n)
>   {
>      unsigned x;
>      unsigned inv1;
>
>      for (x = 0; x < 1000; ++x)
>        {
>          if (n)
>            {
>              int tmp_inv1 = m/n;
>              non_inv1 = tmp_inv1 + x;
>             }
>        }
>      bar (non_inv1);
>  }
> We couldn't  move tmp_inv1 uncoditionally. Right (I have no clean
> build at hands right now to check. If I'm wrong let's replace m/n to
> something that couldn't be unconditionally moved :))? There is no phi
> node for tmp_inv1.
> Could you clarify what conditional lim should do in your opinion here?

Nothing at this point.  If only then for the simple reason to make the
first iteration of the patch simpler (still covering most cases).

You seem to want to catch 100% of all possible cases.

> Is it too little gain to move statements that only used in the same bb
> or branch (I mean before phi node creation, even if stmt had phi
> node)?
>
> In fact, we could have here 2 thousands of divisions:
> if (n)
>   {
>     int tmp_inv1 = m/n;
>     tmp_inv2 = something_which_couldn't_be moved_unconditionally (tmp_inv1);
>     ...
>     tmp_invN = something_which_couldn't_be moved_unconditionally (tmp_invN-1);
>     non_inv1 = tmp_invN + x;
>   }

Of course we could.  But the patch didn't even exercise this case (no testcase).

Please make patches simple - support for moving non-PHIs can be added
as a followup.  The first important part is to make the current code
(moving PHIs)
create control-flow rather than if-converted form.

Richard.

>
>>>>>
>>>>> 3.3   replace_uses_in_conditions
>>>>>
>>>>> After computations movements we can have several copies of the cond
>>>>> stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
>>>>> cond stmt it doesn’t work. As cond stmt, of course, have something in
>>>>> lim_aux_data->tgt_level, but, in fact, they are moved only if
>>>>> corresponding invariants are moved. So, in fact, the same cond (copies
>>>>> of it, of course) could go to the different levels. So to fix these
>>>>> uses, we call replace_uses_in_conditions.
>>>>
>>>> Hmm, but stmts that a condition depends on always have at least the
>>>> target level of the condition statement.  Thus their computation is
>>>> available in all other "same" conditon statements that might appear
>>>> in more inner loop levels, no?  So isn't this just about performing the
>>>> movement in a proper ordering?
>>>
>>> Let me give your more details, then may be you repeat question in other words.
>>> for ()
>>> {
>>>   if (a)
>>>     {
>>>        inv1 = some_computations;
>>>        if (inv1 < inv2)
>>>             {
>>>               inv3;
>>>               inv4;
>>>             }
>>>        }
>>> }
>>>
>>> So, the order for movement is: inv1; inv3; inv4. When we move inv1 we
>>> have tgt_level for inv3 and inv4 computed but we have not created
>>> copies of 'if (inv1 < inv2)' created. In theory, we can have several
>>> copies:(1) for tgt_level of inv3 (2) for tgt_level of inv2 (3) in
>>> initial cycle. if (1) is in tgt level of inv1 we need no replacement,
>>> in other cases we need replacements. Of course, we know this(all the
>>> info about tgt_levels) when we move inv1, but then we need to create
>>> all needed copies (store it somehow, etc) of 'if (inv1 < inv2') and
>>> make replacement in these copies. I don't see now why this is much
>>> better. This doesn't mean that current solution is perfect, just
>>> clarify your thought.
>>
>> I still don't get what "replacements" actually mean.  Let's extend the
>> example to
>>
>> for ()  (A)
>>   {
>>> for ()  (B)
>>> {
>>>   if (a)
>>>     {
>>>        inv1 = some_computations;
>>>        if (inv1 < inv2)
>>>             {
>>>               inv3;
>>>               inv4;
>>>             }
>>>        }
>>> }
>>  }
>>
>> then you say the target level of inv1 could be 'A' and the target
>> level of inv3 'B'
>> and that of inv4 is 'B' as well.
>>
>> That's true.  So we move inv1 to before 'A' and inv3/inv4 to before 'B' guarded
>> with if (inv1 < inv2).  I fail to see what "copies" we need here.  'inv1' is
>> available in all target levels below its own, no copies needed.
>>
>
> I meant the copies of conditions (COND_STMT: inv1 < inv2).
>
> Let's modify this example: tgt_level (inv1) = A; tgt_level(inv3) = A;
> tgt_level(inv4) = B.
>
>  Then
> if (a)
>    inv1=some_computations;
> if (inv1 < inv2)
>    inv3;
> <preheader edge for A, where phi for inv1(phi_for_inv1) is computed>
> for ()  (A)
>  {
>  if (phi_for_inv1 <inv2)
>    inv3
>  for ()  (B)
>    {
>    if (a)
>      {
>         if (phi_for_inv1 < inv2)
>           {
>           }
>      }
>  }
> So, we have 3 copies of inv1 < inv2 after lim (in some of them we use
> inv1, in other phi_for_inv1). But this is the second priority for now
> (this is all about whether we should move statements w/o phi node).
>
>>>>
>>>>> More details on 3.1 and 3.2
>>>>>
>>>>> 3.1 The patch is very similar for phi stmts and regular stmts (there
>>>>> is some difference for regular stmts related to the case when we move
>>>>> sequence instead of single stmt, let’s skip it in this description,
>>>>> will describe later. BTW, there are comments, in corresponding parts).
>>>>>
>>>>> a) for each bb we call check_conditionally_executed and if bb is cond
>>>>> executed then invariant statement  will be moved conditionally.
>>>>>
>>>>> b) prepare_edge_for_stmt collects information needed to create phi
>>>>> nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
>>>>> structure needed to move statement
>>>>> (get_block_for_predicated_statement, see below). All of this is done
>>>>> for cond executed stmt. Nothing is done for non cond executed.
>>>>>
>>>>> BTW, there is some strange names for functions like missed ending
>>>>> (prepare_edge_for_stm, w/o t in the end, this is because of my script
>>>>> to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
>>>>> please so far).
>>>>>
>>>>> Get_block_for_predicated_stmt:
>>>>>
>>>>> Create bb where to add cond stmt to (or return existing, each loop has
>>>>> a map  (level_cond_data_map is map <loop, cond_data_map> ,
>>>>> cond_data_map is map <gimple, basic_block>) to identify if bb for
>>>>> given cond was created or not for given level).
>>>>>
>>>>> Parameters of this method:
>>>>>
>>>>> Cond – condition of stmt, branch – is it on true or false edge of
>>>>> cond, cond_map  - see above, level (tgt_level), preheader –
>>>>> preheader_edge for level.
>>>>>
>>>>> So first of all we check whether there is a basic block for given
>>>>> condition in given loop.
>>>>>
>>>>> a)      If such bb exists then we trying to find corresponding edge,
>>>>> destination of this edge shouldn’t be a post dominator of bb
>>>>> (find_branch_edge).
>>>>>
>>>>> a.1) if such edge exists we return the most remote in this branch bb
>>>>> (or create it), see get_most_remote_in_branch;
>>>>>
>>>>> a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch
>>>>>
>>>>>        b)  If such bb doesn’t exist we create it (recursively calling
>>>>> get_block_for_predicated_stmt if required bb is conditionally
>>>>> executed).
>>>>>
>>>>> Will not describe it further so far (let me know if you would like to
>>>>> read additional details).
>>>>>
>>>>> Add_info_for_phi
>>>>>
>>>>> The main goal of this method is to identify situation where
>>>>> stmt(parameter,ie the statement which we are moving) or more exactly
>>>>> lhs of this statement
>>>>>
>>>>> a)      is still used in other loops, so we need to create phi node to
>>>>> use phi name inside other loops.
>>>>>
>>>>> b)      Is used in phi node of original loop and this phi node is
>>>>> going to be moved. BUT, as we move phi node as assignment (in the case
>>>>> of phi node with 2 arguments) we need to create phi node to use lhs of
>>>>> statement (param of add_info_for_phi).
>>>>>
>>>>> In these cases we make corresponding replacements and store
>>>>> information needed to create phi nodes and make some other
>>>>> replacements in 3.3 (replace_uses_in_conditions).
>>>>>
>>>>> Add_info_for_phi calls create_tmp_var which requires some explanation.
>>>>>
>>>>> Create_tmp_var
>>>>>
>>>>> To create new names for phi nodes and to create default def for phi
>>>>> arguments I use make_ssa_name and get_or_create_ssa_default_def. These
>>>>> methods have some asserts (being merged):   (TREE_CODE (old_var) ==
>>>>> VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
>>>>> (old_var) == RESULT_DECL).
>>>>>
>>>>> So in some cases (when I know that I need to use methods mentioned
>>>>> above, see the start of create_tmp_var, ie the uses in other loops) I
>>>>> create new tmp variable to overcome these asserts. To be honest I
>>>>> don’t like this but don’t know how to deal with this better.
>>>>
>>>> Hmm, don't you just run into SSA names here?  That is, TREE_CODE
>>>> (old_var) == SSA_NAME?
>>>> You can use make_ssa_name (TREE_TYPE (old_var), "plim") in this case.
>>>>
>>>> Or you run into the issue that anonymous SSA names cannot have default
>>>> defintions (I don't understand why you need all the fuss about default defs).
>>>
>>> I used default defs to write something in phi args when I have no values on
>>> corresponding branches. Say there is some tmp value which I should
>>> create phi for.
>>> if (a)
>>>    {
>>>      tmpvar=something
>>>      b = something(tmpvar);
>>>     }
>>> else
>>>    {
>>>    }
>>> When I move if (a) {tmpvar = something} and create phi for tmpvar (to
>>> use phi result
>>> in the initial cycle) i need to write something for 'not a' phi arg.
>>> BUT, in some cases I use
>>>  build_zero_cst (which I started to use after default defs use. I mean
>>> I didn't know about
>>>  build_zero_cst while I started to use default defs.). So may be I can replace
>>> uses of default defs in such cases by build_zero_cst? I mean just SOME values
>>> (as I know that control flow will never go to this places actually). I
>>> mean if !a tmpvar will not used.
>>
>> Ok, this is again about the case where you are not moving a PHI node but
>> creating it because you want to conditionally execute invariant stmts.
>> Yes, a default def is ok here, but you can always use a new VAR_DECL
>> just for the default def creation like gimple_fold_call does for example:
>>
>>                           tree var = create_tmp_var (TREE_TYPE (lhs));
>>                           tree def = get_or_create_ssa_default_def (cfun, var);
>>
>> Note that I think we don't need to bother about this case as we'd only
>> want to move PHI nodes in the first place (so you already have values
>> for all PHI args).
>>
>>>>
>>>>> And a couple of words regarding where we store info for phi nodes:
>>>>>
>>>>> Each loop contains in its aux  phi_data structure. On this stage we
>>>>> only add there stmt (if phi node will be required), phi name=phi node
>>>>> result name in names_map or fl_names_map). See the comment about
>>>>> phi_data in patch.
>>>>>
>>>>> If there should be created several phi nodes for given lhs (I mean the
>>>>> first place for phi node doesn’t dominate on corresponding preheader,
>>>>> we add only the last name in names_map (as intermediate names could be
>>>>> created later, but the last name in bb which dominates preheader
>>>>> should be used for replacement in other loops. Replacements were
>>>>> already made in walker).
>>>>>
>>>>> If lhs is used only in phi node, which are moved and transformed into
>>>>> assignment, and there is no uses in other loops, we need to create
>>>>> only first level phi node (don’t need to create phi nodes till some bb
>>>>> which dominates preheader), then we add such name to fl_names_map.
>>>>>
>>>>> 3.2 Create_phi_nodes
>>>>>
>>>>> For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
>>>>> are statements which were moved, but they have uses in the original
>>>>> loop (more exactly, their uses in some other loops replaced by some
>>>>> name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
>>>>> loop->aux)->fl_names_map.
>>>>>
>>>>> So for each of such stmts we create phi nodes (for its lhs) chain.
>>>>>
>>>>> Basic block for phi node is found in get_bb_for_phi  for given bb with
>>>>> stmt. If basic block for phi node dominates preheader->src we end
>>>>> chain creation. There is some special condition for the case when we
>>>>> need to create only first level phi node. (I described this situation
>>>>> above, but let me know If I need to add  more details.
>>>>>
>>>>> Basic blocks can have maps to identify if we created phi node for
>>>>> given variable in given basic_block, so we only add an edge argument
>>>>> for such case (phi_map = new hash_map<basic_block, hash_map<tree,
>>>>> gphi*>*>).
>>>>
>>>> Ok.
>>>>
>>>> Well - I think it might be easier code-generation-wise to have a separate
>>>> phase move_phis executed before move_computations, that just moves
>>>> invariant PHI nodes (which this all is about - see above) and their
>>>> controlling conditions and only then move their dependencies.
>>> I didn;t understand so far (I think let's clarify previous questions,especially
>>> about phi nodes then let's go back to this one.)
>>
>> Yeah, I think the PHI node thing is our major mis-understanding (or the
>> main thing that I think complicates the patch for too little gain - at least
>> initially).
>>
>> Thanks,
>> Richard.
>>
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> In short, that’s all.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Evgeniya
>>>>>
>>>>>
>>>>>
>>>>> On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
>>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Could you please review my patch for predicated lim?
>>>>>>
>>>>>> Let me note some details about it:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1)      Phi statements are still moved only if they have 1 or 2
>>>>>> arguments. However, phi statements could be move under conditions (as
>>>>>> it’s done for the other statements).  Probably, phi statement motion
>>>>>> with 3 + arguments could be implemented in the next patch after
>>>>>> predicated lim.
>>>>>>
>>>>>> 2)      Patch has limitations/features like (it was ok to me to
>>>>>> implement it such way, maybe I’m not correct. ):
>>>>>>
>>>>>> a)      Loop1
>>>>>>
>>>>>>     {
>>>>>>
>>>>>>               If (a)
>>>>>>
>>>>>>                  Loop2
>>>>>>
>>>>>>                      {
>>>>>>
>>>>>>                        Stmt - Invariant for Loop1
>>>>>>
>>>>>>                      }
>>>>>>
>>>>>>              }
>>>>>>
>>>>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>>>>
>>>>>> b)      Or
>>>>>>
>>>>>> Loop1
>>>>>>
>>>>>>     {
>>>>>>
>>>>>>          …
>>>>>>
>>>>>>          If (cond1)
>>>>>>
>>>>>>               If (cond2)
>>>>>>
>>>>>>                   If (cond3)
>>>>>>
>>>>>>                       Stmt;
>>>>>>
>>>>>>        }
>>>>>>
>>>>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>>>>
>>>>>> c)       It took me a long time to write all of these code, so there
>>>>>> might be other peculiarities which I forgot to mention. :)
>>>>>>
>>>>>>                        Let’s discuss these ones as you will review my patch.
>>>>>>
>>>>>> 3)      Patch consists of 9 files:
>>>>>>
>>>>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>>>>
>>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>>>>> predicated lim moves 2 more statements out of the loop;
>>>>>>
>>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>>>>> lim recip optimization in this test doesn’t work (the corresponding
>>>>>> value is below threshold as I could see in the code for recip, 1<3).
>>>>>> So to have recip working in this test I changed test a little bit.
>>>>>>
>>>>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>>>>
>>>>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>>>>
>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>>>>
>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>>>
>>>>>> the tests for conditional lim.
>>>>>>
>>>>>> 4)      Patch testing:
>>>>>>
>>>>>> a)      make –k check (no difference in results for me for the clean
>>>>>> build and the patched one,
>>>>>>
>>>>>> -          Revision: 222849,
>>>>>>
>>>>>> -          uname -a
>>>>>>
>>>>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>>>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>>>>
>>>>>> b)      Bootstrap.
>>>>>>
>>>>>>  It goes well now, however to fix it I have made a temporary hack in
>>>>>> the lim code. And with this fix patch definitely shouldn’t be
>>>>>> committed.
>>>>>>
>>>>>> I did so, as I would like to discuss this issue first.
>>>>>>
>>>>>> The problem is: I got stage2-stage3 comparison failure on the single
>>>>>> file (tree-vect-data-refs.o). After some investigation I understood
>>>>>> that tree-vect-data-refs.o differs being compiled with and without
>>>>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>>>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>>>>> reproduce this difference on the same build (even not bootstrapped),
>>>>>> with option ‘ –g’ and without it).
>>>>>>
>>>>>> Of course, the file compiled with –g option will contain debug
>>>>>> information and will differ from the corresponding file without debug
>>>>>> information. I mean there is the difference reported by
>>>>>> contrib/compare-debug (I mean we talk about stripped files).
>>>>>>
>>>>>> Then I compared assemblers and lim dump files and made a conclusion
>>>>>> the difference is:
>>>>>>
>>>>>> There is statement _484=something, which is conditionally moved out of
>>>>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>>>>> X. In debug case, there is the single use in debug statement. So for
>>>>>> debug case my patch generates phi statement for _484 to make it
>>>>>> available inside the loop X. For non debug case, no such phi statement
>>>>>> generated as there is no uses of _484.
>>>>>>
>>>>>> There is one more statement with lhs=_493, with exactly this pattern
>>>>>> of use. But this is not important for our discussion.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So, in my opinion it’s ok to generate additional phi node for debug
>>>>>> case. But I’m not a compiler expert and maybe there is some
>>>>>> requirement that debug and non-debug versions should differ only by
>>>>>> debug statements, I don’t know.
>>>>>>
>>>>>>
>>>>>>
>>>>>> My temporary hack fix is just removing of such debug statements (no
>>>>>> debug statements, no problems J).
>>>>>>
>>>>>>
>>>>>>
>>>>>> The alternatives I see are:
>>>>>>
>>>>>> -          Move debug statements outside the loop (not good in my opinion);
>>>>>>
>>>>>> -          Somehow reset values in debug statements (not good in my
>>>>>> opinion, if I could provide correct information for them);
>>>>>>
>>>>>> -          Generate phi for debug statements and fix bootstrap (say,
>>>>>> let’s have some list of files, which we have special check for. I mean
>>>>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>>>>> and non-debug). I like this variant. However, as I don’t see such list
>>>>>> of special files in the whole gcc, I think I might be missing
>>>>>> something.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Evgeniya
>
>
>
> --
> Thanks,
>
> Evgeniya
>
> perfstories.wordpress.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-07-14 10:54           ` Richard Biener
@ 2015-07-15  8:43             ` Evgeniya Maenkova
  2015-07-15 10:58               ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Evgeniya Maenkova @ 2015-07-15  8:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches

On Tue, Jul 14, 2015 at 2:54 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Mon, Jun 29, 2015 at 4:21 PM, Evgeniya Maenkova
> <evgeniya.maenkova@gmail.com> wrote:
>> On Mon, Jun 29, 2015 at 5:10 PM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Tue, Jun 9, 2015 at 10:11 PM, Evgeniya Maenkova
>>> <evgeniya.maenkova@gmail.com> wrote:
>>>> On Tue, Jun 9, 2015 at 3:46 PM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, May 29, 2015 at 3:14 PM, Evgeniya Maenkova
>>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> Here is some explanation. I hope you let me know if I need to clarify something.
>>>>>>
>>>>>> Also, you asked me about concrete example, to make sure you don’t miss
>>>>>> my answer here is the link:
>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.
>>>>>>
>>>>>> Also, I doubt whether it’s convenient for you to create a build with
>>>>>> my patch or not. May be to clarify things you could send me some
>>>>>> examples/concrete cases, then I’ll compile them with
>>>>>> –fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
>>>>>> these dumps. May be these dumps will be useful. (I’ll only disable
>>>>>> cleanup_cfg TODO after lim to let you know the exact picture after
>>>>>> lim).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> 1.       invariantness _dom_walker –
>>>>>>
>>>>>> 1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
>>>>>> for true and false edges handle_branch_edge, which calls SET_TARGET_OF
>>>>>> for all bb ‘predicated’ by given GIMPLE_COND.
>>>>>>
>>>>>> SET_TARGET_OF sets in basic_blocks aux 2 facts:
>>>>>>
>>>>>> a)      this is true or false edge;
>>>>>>
>>>>>> b)      link to cond stmt;
>>>>>>
>>>>>> Handle_branch_edge works this way:
>>>>>>
>>>>>> If (cond1)
>>>>>>
>>>>>>   {
>>>>>>
>>>>>>      bb1;
>>>>>>
>>>>>>      if (cond2}
>>>>>>
>>>>>>        {
>>>>>>
>>>>>>            bb2;
>>>>>>
>>>>>>         }
>>>>>>
>>>>>>    Being called for cond1, it sets cond1 as condition for both bb1 and
>>>>>> bb2 (the whole branch for cond1, ie also for bb containing cond2),
>>>>>> then this method will be called (as there is dominance order) for
>>>>>> cond2 to correct things (ie to set cond2 as condition for bb2).
>>>>>
>>>>> Hmm, why not track the current condition as state during the DOM walk
>>>>> and thus avoid processing more than one basic-block in handle_branch_edge?
>>>>> Thus get rid of handle_branch_edge and instead do everything in handle_cond_stmt
>>>>> plus the dom-walkers BB visitor?
>>>>>
>>>> I need to look more carefully how to implement it, but I think I
>>>> understand what you mean and this optimization of course looks
>>>> reasonable to me. Will do.
>>>>
>>>>> I see you don't handle BBs with multiple predecessors - that's ok, but
>>>>> are you sure you don't run into correctness issues when not marking such
>>>>> BBs as predicated?  This misses handling of, say
>>>>>
>>>>>  if (a || b)
>>>>>    bb;
>>>>>
>>>>> which is a pity (but can be fixed later if desired).
>>>>>
>>>> I had some test (in gcc testsuite or bootstrap build) which worked
>>>> incorrectly because of multiple predecessors. As far as I remember the
>>>> situation was (further, will make some notes about such tests to
>>>> clarify this better), I mean with previous version of my code which
>>>> handled bb with 2 predecessors:
>>>> if (a)
>>>>   tmpvar=something;
>>>> while()
>>>>   if (a || b)
>>>>       basic_block {do something with tmpvar;} // I mean basic block
>>>> predicated by bb with a and bb with b
>>>>
>>>> So, if a is false, I mean we didn't do tmpvar=something (outside
>>>> loop), BUT we are in basick_block  (we went by bb with b), we got
>>>> Segmentation falt in basic_block {do something with tmpvar;}.
>>>>
>>>> I think we can reproduce all the details of this test if I remove not
>>>> handling bb with 2 predecessors.
>>>>
>>>> So I wouldn't move bb with 2 predecessors (this is not always executed
>>>> bb in any loop, not conditionally, they will not be moved at all).
>>>>
>>>> This is my more detail explanation on this point. Perhaps, I didn't
>>>> understand your question about correctness. Could you repeat it in
>>>> other words (based on this new clarification).
>>>>
>>>> So I think according to current code it will not be moved. What
>>>> incorrectness do you mean?
>>>
>>> If the block isn't marked as predicated the question is whether it is
>>> handled correctly or assumed to be unconditionally executed.
>>>
>>>>> I note that collecting predicates has similarities to what if-conversion
>>>>> does in tree-ifcvt.c (even if its implementation is completely different,
>>>>> of course).
>>>>>
>>>>
>>>> Ok, I'll look at this. But could you please clarify your point?
>>>> (Should I just take this into account with low priority and look at
>>>> this later or you want some refactoring?)
>>>
>>> I just noted similar code exists elsewhere - it may be possible to
>>> factor it out but I didn't investigate.  And no, doing that isn't a prerequesite
>>> for this patch.
>>>
>>>>>> 1.2   As 1.1 goes we identify whether some bb is predicated by some
>>>>>> condition or not.
>>>>>>
>>>>>> bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
>>>>>> bb->aux->cond_stmt=cond stmt (the nearest condition).
>>>>>>
>>>>>> If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
>>>>>> bb->loop->loop (this info was available in the clean build).
>>>>>>
>>>>>> 1.3   As this walker is called in dominance order, information about
>>>>>> condition is available when invariantness_dom_walker is called for
>>>>>> given bb.  So we can make some computations based on bb->aux
>>>>>> structure. This is done in check_conditionally_executed. The main goal
>>>>>> of this method is to check that the root condition is always executed
>>>>>> in the loop. I did so to avoid situation like this
>>>>>>
>>>>>> Loop:
>>>>>>
>>>>>>    Jmp somewhere;
>>>>>>
>>>>>>   If (cond1)
>>>>>>
>>>>>>       If (cond2)
>>>>>>
>>>>>>          Etc
>>>>>>
>>>>>> By ‘root condition’ I mean cond1 in this case (the first cond in
>>>>>> dominance order).
>>>>>
>>>>> Ok, but this can't happen (an uncontional jump to somehwere).  It
>>>>> will always be a conditional jump (a loop exit) and thus should
>>>>> be handled already.
>>>>>
>>>>> Did you have a testcase that was mishandled?
>>>>
>>>> No, I have not such testcase. This is because I;m newbie at gcc and
>>>> compilers. If you tell me this fact, let's just remove this check?
>>>> Correct?
>>>
>>> Yes.
>>>
>>>>>
>>>>>> If  ‘root condition’ is not always executed we disable (free) all the
>>>>>> info in bb->aux, ie all the blocks becomes neither conditionally nor
>>>>>> always executed according to bb->aux.
>>>>>>
>>>>>> There is some optimization to don’t go each time trough the conditions
>>>>>> chain (cond2->cond1), let me skip such details for now.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1.4   Then we calculate tgt_level  (which is used in move_computations later)
>>>>>>
>>>>>> The patch is the same for phi and regular stmt (calculate_tgt_level)
>>>>>>
>>>>>> 1)      If stmt is conditionally executed we compute max movement
>>>>>> (determine_max_movement) starting from
>>>>>> get_lim_data(condition)->max_loop.
>>>>>>
>>>>>> 2)      If stmt is not cond executed as start level  for
>>>>>> determine_max_movement computations we choose ALWAYS_EXECUTED_IN.
>>>>>>
>>>>>> To clarify why, see the comment
>>>>>>
>>>>>> /* The reason why we don't set start_level for MOVE_POSSIBLE
>>>>>>
>>>>>>           statements to more outer level is: this statement could depend on
>>>>>>
>>>>>>           conditional statements and even probably is not defined without this
>>>>>>
>>>>>>           condition. So either we should analyze these ones and move
>>>>>>
>>>>>>           under condition somehow or keep more strong start_level . */
>>>>>>
>>>>>> As you noted in your review there was some refactoring. Of course, I
>>>>>> had no goal to refactor existing code, I intended to remove some
>>>>>> duplication which I caused by my changes.  I hope we discuss in
>>>>>> details later if you don’t like some refactoring.
>>>>>
>>>>> Refactoring makes a big patch harder to review because it ends up
>>>>> containing no-op changes.  It also makes it appear bigger than it
>>>>> really is ;)
>>>>>
>>>>> A reasonable solution is to factor out the refactoring to a separate patch.
>>>>>
>>>> I see what you mean.
>>>>
>>>>>> 2.       store_motion  - for some stmts set flags to ignore predicated
>>>>>> execution (I mean to move statements without conditions).
>>>>>
>>>>> That's for the existing "predicated" support as far as I can see.
>>>>>
>>>>>>
>>>>>>
>>>>>> 3.       move_computations. The patch is doing something in the
>>>>>> following 3 calls inside of this method.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3.1   (more details below) walker.walk – move computations(create if()
>>>>>> structure) and store information to create phi nodes later (which
>>>>>> statements require phi nodes, as they conditionally executed and there
>>>>>> are still their uses on other levels, what are the names for such phi
>>>>>> nodes, as we replace uses in here, in walker.walk.)
>>>>>
>>>>> What kind of uses are that?  By definition all stmts executed
>>>>> under condition A that are used in code not under condition A have
>>>>> a PHI node associated.  This is why the existing conditional movement
>>>>> handling moves PHI nodes (and all dependent computations).
>>>>>
>>>>> Are you thinking of
>>>>>
>>>>>   for (;;)
>>>>>     if (a) x = 1;
>>>>>
>>>>>   .. use of x
>>>>>
>>>>> ?
>>>> No, I think in this case phi node of x will be used in initial
>>>> (==loopinit) code (in 'use of x') instead of some version of x (which
>>>> I move).
>>>>
>>>>>If you move the PHI node for x then you don't need to insert any other
>>>>> PHI nodes (and you _do_ have to move the PHI node anyway).
>>>>
>>>>>
>>>>>> 3.2   (more details below) create_phi_nodes – create phi nodes for
>>>>>> statements which require that (see 3.1)
>>>>>
>>>>> Only PHI nodes need PHI node creation and only its dependences
>>>>> might need predication.
>>>>>
>>>>> That is, if you move a conditionally executed statement but not
>>>>> any PHI node it feeds then the movement is useless.
>>>>>
>>>>> But maybe I am missing something about the PHI business.
>>>>>
>>>> Ok. I have different understanding on this (and am afraid of this of
>>>> course :) ).
>>>>
>>>> See the example:
>>>> void bar (long);
>>>> void foo (int cond1, int cond2, int cond3, int m, int n)
>>>> {
>>>>   unsigned x;
>>>>   unsigned inv1;
>>>>
>>>>   for (x = 0; x < 1000; ++x)
>>>>     {
>>>>       if (cond1)
>>>>         {
>>>>           int tmp_inv1 = m*n;
>>>>           non_inv1 = tmp_inv1 + x;
>>>>         }
>>>>     }
>>>>   bar (non_inv1);
>>>> }
>>>>
>>>> This is working example, will attach loopinit and lim1 dumps.
>>>> There is no phi node for tmp_inv1 in loopinitat all. However, I think
>>>> we should move tmp_inv1. To use tmp_inv1 in non_inv1 we need phi node
>>>> (otherwise we get verification failure as bb with conditionally moved
>>>> tmp_inv1 doesn;t dominate bb with non_inv1). Another example if
>>>> non_inv1 is invariant but has not enough cost or something else. In
>>>> short, if it's still used in initial or others loops. Say, non_inv1 is
>>>> invariant and moved in another loop. Or something else.
>>>>
>>>> So, when I read
>>>> 'Only PHI nodes need PHI node creation'
>>>> and
>>>> 'That is, if you move a conditionally executed statement but not any
>>>> PHI node it feeds then the movement is useless.'
>>>> I think about this example and don't understand. Could you clarify?
>>>
>>> Ok, so this case is already handled by LIM by simply moving tmp_inv1 = m*n
>>> (and making it unconditionally execute before the loop).  In fact whether
>>> we can move it does not depend on whether 'cond' is invariant or not.
>>>
>>> If we couldn't execute m*n unconditionally on the loop preheader edge
>>> then we'd have to be able to also move the guarding condition.  Note
>>> that cost considerations shouldn't come into play here.
>>>
>>> So I was only thinking of the case where we also want to move the
>>> condition (which is wanted if we want to move a PHI node).
>>>
>>
>> Don't understand "So I was only thinking of the case where we also
>> want to move the condition (which is wanted if we want to move a PHI
>> node)."
>>
>> What about the case when we want to move condition but don't have phi
>> node to move.
>>
>> Let's consider a little bit modified example:
>>
>> void bar (long);
>> void foo (int cond1, int cond2, int cond3, int m, int n)
>>   {
>>      unsigned x;
>>      unsigned inv1;
>>
>>      for (x = 0; x < 1000; ++x)
>>        {
>>          if (n)
>>            {
>>              int tmp_inv1 = m/n;
>>              non_inv1 = tmp_inv1 + x;
>>             }
>>        }
>>      bar (non_inv1);
>>  }
>> We couldn't  move tmp_inv1 uncoditionally. Right (I have no clean
>> build at hands right now to check. If I'm wrong let's replace m/n to
>> something that couldn't be unconditionally moved :))? There is no phi
>> node for tmp_inv1.
>> Could you clarify what conditional lim should do in your opinion here?
>
> Nothing at this point.  If only then for the simple reason to make the
> first iteration of the patch simpler (still covering most cases).
>
> You seem to want to catch 100% of all possible cases.
>
>> Is it too little gain to move statements that only used in the same bb
>> or branch (I mean before phi node creation, even if stmt had phi
>> node)?
>>
>> In fact, we could have here 2 thousands of divisions:
>> if (n)
>>   {
>>     int tmp_inv1 = m/n;
>>     tmp_inv2 = something_which_couldn't_be moved_unconditionally (tmp_inv1);
>>     ...
>>     tmp_invN = something_which_couldn't_be moved_unconditionally (tmp_invN-1);
>>     non_inv1 = tmp_invN + x;
>>   }
>
> Of course we could.  But the patch didn't even exercise this case (no testcase).

Patch handles this case (this is why I created phi nodes which we
tried to discuss).
Your comment is about test case not about the patch.

>
> Please make patches simple - support for moving non-PHIs can be added
> as a followup.  The first important part is to make the current code
> (moving PHIs)
> create control-flow rather than if-converted form.
>
> Richard.
>
>>
>>>>>>
>>>>>> 3.3   replace_uses_in_conditions
>>>>>>
>>>>>> After computations movements we can have several copies of the cond
>>>>>> stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
>>>>>> cond stmt it doesn’t work. As cond stmt, of course, have something in
>>>>>> lim_aux_data->tgt_level, but, in fact, they are moved only if
>>>>>> corresponding invariants are moved. So, in fact, the same cond (copies
>>>>>> of it, of course) could go to the different levels. So to fix these
>>>>>> uses, we call replace_uses_in_conditions.
>>>>>
>>>>> Hmm, but stmts that a condition depends on always have at least the
>>>>> target level of the condition statement.  Thus their computation is
>>>>> available in all other "same" conditon statements that might appear
>>>>> in more inner loop levels, no?  So isn't this just about performing the
>>>>> movement in a proper ordering?
>>>>
>>>> Let me give your more details, then may be you repeat question in other words.
>>>> for ()
>>>> {
>>>>   if (a)
>>>>     {
>>>>        inv1 = some_computations;
>>>>        if (inv1 < inv2)
>>>>             {
>>>>               inv3;
>>>>               inv4;
>>>>             }
>>>>        }
>>>> }
>>>>
>>>> So, the order for movement is: inv1; inv3; inv4. When we move inv1 we
>>>> have tgt_level for inv3 and inv4 computed but we have not created
>>>> copies of 'if (inv1 < inv2)' created. In theory, we can have several
>>>> copies:(1) for tgt_level of inv3 (2) for tgt_level of inv2 (3) in
>>>> initial cycle. if (1) is in tgt level of inv1 we need no replacement,
>>>> in other cases we need replacements. Of course, we know this(all the
>>>> info about tgt_levels) when we move inv1, but then we need to create
>>>> all needed copies (store it somehow, etc) of 'if (inv1 < inv2') and
>>>> make replacement in these copies. I don't see now why this is much
>>>> better. This doesn't mean that current solution is perfect, just
>>>> clarify your thought.
>>>
>>> I still don't get what "replacements" actually mean.  Let's extend the
>>> example to
>>>
>>> for ()  (A)
>>>   {
>>>> for ()  (B)
>>>> {
>>>>   if (a)
>>>>     {
>>>>        inv1 = some_computations;
>>>>        if (inv1 < inv2)
>>>>             {
>>>>               inv3;
>>>>               inv4;
>>>>             }
>>>>        }
>>>> }
>>>  }
>>>
>>> then you say the target level of inv1 could be 'A' and the target
>>> level of inv3 'B'
>>> and that of inv4 is 'B' as well.
>>>
>>> That's true.  So we move inv1 to before 'A' and inv3/inv4 to before 'B' guarded
>>> with if (inv1 < inv2).  I fail to see what "copies" we need here.  'inv1' is
>>> available in all target levels below its own, no copies needed.
>>>
>>
>> I meant the copies of conditions (COND_STMT: inv1 < inv2).
>>
>> Let's modify this example: tgt_level (inv1) = A; tgt_level(inv3) = A;
>> tgt_level(inv4) = B.
>>
>>  Then
>> if (a)
>>    inv1=some_computations;
>> if (inv1 < inv2)
>>    inv3;
>> <preheader edge for A, where phi for inv1(phi_for_inv1) is computed>
>> for ()  (A)
>>  {
>>  if (phi_for_inv1 <inv2)
>>    inv3
>>  for ()  (B)
>>    {
>>    if (a)
>>      {
>>         if (phi_for_inv1 < inv2)
>>           {
>>           }
>>      }
>>  }
>> So, we have 3 copies of inv1 < inv2 after lim (in some of them we use
>> inv1, in other phi_for_inv1). But this is the second priority for now
>> (this is all about whether we should move statements w/o phi node).
>>
>>>>>
>>>>>> More details on 3.1 and 3.2
>>>>>>
>>>>>> 3.1 The patch is very similar for phi stmts and regular stmts (there
>>>>>> is some difference for regular stmts related to the case when we move
>>>>>> sequence instead of single stmt, let’s skip it in this description,
>>>>>> will describe later. BTW, there are comments, in corresponding parts).
>>>>>>
>>>>>> a) for each bb we call check_conditionally_executed and if bb is cond
>>>>>> executed then invariant statement  will be moved conditionally.
>>>>>>
>>>>>> b) prepare_edge_for_stmt collects information needed to create phi
>>>>>> nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
>>>>>> structure needed to move statement
>>>>>> (get_block_for_predicated_statement, see below). All of this is done
>>>>>> for cond executed stmt. Nothing is done for non cond executed.
>>>>>>
>>>>>> BTW, there is some strange names for functions like missed ending
>>>>>> (prepare_edge_for_stm, w/o t in the end, this is because of my script
>>>>>> to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
>>>>>> please so far).
>>>>>>
>>>>>> Get_block_for_predicated_stmt:
>>>>>>
>>>>>> Create bb where to add cond stmt to (or return existing, each loop has
>>>>>> a map  (level_cond_data_map is map <loop, cond_data_map> ,
>>>>>> cond_data_map is map <gimple, basic_block>) to identify if bb for
>>>>>> given cond was created or not for given level).
>>>>>>
>>>>>> Parameters of this method:
>>>>>>
>>>>>> Cond – condition of stmt, branch – is it on true or false edge of
>>>>>> cond, cond_map  - see above, level (tgt_level), preheader –
>>>>>> preheader_edge for level.
>>>>>>
>>>>>> So first of all we check whether there is a basic block for given
>>>>>> condition in given loop.
>>>>>>
>>>>>> a)      If such bb exists then we trying to find corresponding edge,
>>>>>> destination of this edge shouldn’t be a post dominator of bb
>>>>>> (find_branch_edge).
>>>>>>
>>>>>> a.1) if such edge exists we return the most remote in this branch bb
>>>>>> (or create it), see get_most_remote_in_branch;
>>>>>>
>>>>>> a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch
>>>>>>
>>>>>>        b)  If such bb doesn’t exist we create it (recursively calling
>>>>>> get_block_for_predicated_stmt if required bb is conditionally
>>>>>> executed).
>>>>>>
>>>>>> Will not describe it further so far (let me know if you would like to
>>>>>> read additional details).
>>>>>>
>>>>>> Add_info_for_phi
>>>>>>
>>>>>> The main goal of this method is to identify situation where
>>>>>> stmt(parameter,ie the statement which we are moving) or more exactly
>>>>>> lhs of this statement
>>>>>>
>>>>>> a)      is still used in other loops, so we need to create phi node to
>>>>>> use phi name inside other loops.
>>>>>>
>>>>>> b)      Is used in phi node of original loop and this phi node is
>>>>>> going to be moved. BUT, as we move phi node as assignment (in the case
>>>>>> of phi node with 2 arguments) we need to create phi node to use lhs of
>>>>>> statement (param of add_info_for_phi).
>>>>>>
>>>>>> In these cases we make corresponding replacements and store
>>>>>> information needed to create phi nodes and make some other
>>>>>> replacements in 3.3 (replace_uses_in_conditions).
>>>>>>
>>>>>> Add_info_for_phi calls create_tmp_var which requires some explanation.
>>>>>>
>>>>>> Create_tmp_var
>>>>>>
>>>>>> To create new names for phi nodes and to create default def for phi
>>>>>> arguments I use make_ssa_name and get_or_create_ssa_default_def. These
>>>>>> methods have some asserts (being merged):   (TREE_CODE (old_var) ==
>>>>>> VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
>>>>>> (old_var) == RESULT_DECL).
>>>>>>
>>>>>> So in some cases (when I know that I need to use methods mentioned
>>>>>> above, see the start of create_tmp_var, ie the uses in other loops) I
>>>>>> create new tmp variable to overcome these asserts. To be honest I
>>>>>> don’t like this but don’t know how to deal with this better.
>>>>>
>>>>> Hmm, don't you just run into SSA names here?  That is, TREE_CODE
>>>>> (old_var) == SSA_NAME?
>>>>> You can use make_ssa_name (TREE_TYPE (old_var), "plim") in this case.
>>>>>
>>>>> Or you run into the issue that anonymous SSA names cannot have default
>>>>> defintions (I don't understand why you need all the fuss about default defs).
>>>>
>>>> I used default defs to write something in phi args when I have no values on
>>>> corresponding branches. Say there is some tmp value which I should
>>>> create phi for.
>>>> if (a)
>>>>    {
>>>>      tmpvar=something
>>>>      b = something(tmpvar);
>>>>     }
>>>> else
>>>>    {
>>>>    }
>>>> When I move if (a) {tmpvar = something} and create phi for tmpvar (to
>>>> use phi result
>>>> in the initial cycle) i need to write something for 'not a' phi arg.
>>>> BUT, in some cases I use
>>>>  build_zero_cst (which I started to use after default defs use. I mean
>>>> I didn't know about
>>>>  build_zero_cst while I started to use default defs.). So may be I can replace
>>>> uses of default defs in such cases by build_zero_cst? I mean just SOME values
>>>> (as I know that control flow will never go to this places actually). I
>>>> mean if !a tmpvar will not used.
>>>
>>> Ok, this is again about the case where you are not moving a PHI node but
>>> creating it because you want to conditionally execute invariant stmts.
>>> Yes, a default def is ok here, but you can always use a new VAR_DECL
>>> just for the default def creation like gimple_fold_call does for example:
>>>
>>>                           tree var = create_tmp_var (TREE_TYPE (lhs));
>>>                           tree def = get_or_create_ssa_default_def (cfun, var);
>>>
>>> Note that I think we don't need to bother about this case as we'd only
>>> want to move PHI nodes in the first place (so you already have values
>>> for all PHI args).
>>>
>>>>>
>>>>>> And a couple of words regarding where we store info for phi nodes:
>>>>>>
>>>>>> Each loop contains in its aux  phi_data structure. On this stage we
>>>>>> only add there stmt (if phi node will be required), phi name=phi node
>>>>>> result name in names_map or fl_names_map). See the comment about
>>>>>> phi_data in patch.
>>>>>>
>>>>>> If there should be created several phi nodes for given lhs (I mean the
>>>>>> first place for phi node doesn’t dominate on corresponding preheader,
>>>>>> we add only the last name in names_map (as intermediate names could be
>>>>>> created later, but the last name in bb which dominates preheader
>>>>>> should be used for replacement in other loops. Replacements were
>>>>>> already made in walker).
>>>>>>
>>>>>> If lhs is used only in phi node, which are moved and transformed into
>>>>>> assignment, and there is no uses in other loops, we need to create
>>>>>> only first level phi node (don’t need to create phi nodes till some bb
>>>>>> which dominates preheader), then we add such name to fl_names_map.
>>>>>>
>>>>>> 3.2 Create_phi_nodes
>>>>>>
>>>>>> For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
>>>>>> are statements which were moved, but they have uses in the original
>>>>>> loop (more exactly, their uses in some other loops replaced by some
>>>>>> name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
>>>>>> loop->aux)->fl_names_map.
>>>>>>
>>>>>> So for each of such stmts we create phi nodes (for its lhs) chain.
>>>>>>
>>>>>> Basic block for phi node is found in get_bb_for_phi  for given bb with
>>>>>> stmt. If basic block for phi node dominates preheader->src we end
>>>>>> chain creation. There is some special condition for the case when we
>>>>>> need to create only first level phi node. (I described this situation
>>>>>> above, but let me know If I need to add  more details.
>>>>>>
>>>>>> Basic blocks can have maps to identify if we created phi node for
>>>>>> given variable in given basic_block, so we only add an edge argument
>>>>>> for such case (phi_map = new hash_map<basic_block, hash_map<tree,
>>>>>> gphi*>*>).
>>>>>
>>>>> Ok.
>>>>>
>>>>> Well - I think it might be easier code-generation-wise to have a separate
>>>>> phase move_phis executed before move_computations, that just moves
>>>>> invariant PHI nodes (which this all is about - see above) and their
>>>>> controlling conditions and only then move their dependencies.
>>>> I didn;t understand so far (I think let's clarify previous questions,especially
>>>> about phi nodes then let's go back to this one.)
>>>
>>> Yeah, I think the PHI node thing is our major mis-understanding (or the
>>> main thing that I think complicates the patch for too little gain - at least
>>> initially).
>>>
>>> Thanks,
>>> Richard.
>>>
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> In short, that’s all.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Evgeniya
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
>>>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Could you please review my patch for predicated lim?
>>>>>>>
>>>>>>> Let me note some details about it:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1)      Phi statements are still moved only if they have 1 or 2
>>>>>>> arguments. However, phi statements could be move under conditions (as
>>>>>>> it’s done for the other statements).  Probably, phi statement motion
>>>>>>> with 3 + arguments could be implemented in the next patch after
>>>>>>> predicated lim.
>>>>>>>
>>>>>>> 2)      Patch has limitations/features like (it was ok to me to
>>>>>>> implement it such way, maybe I’m not correct. ):
>>>>>>>
>>>>>>> a)      Loop1
>>>>>>>
>>>>>>>     {
>>>>>>>
>>>>>>>               If (a)
>>>>>>>
>>>>>>>                  Loop2
>>>>>>>
>>>>>>>                      {
>>>>>>>
>>>>>>>                        Stmt - Invariant for Loop1
>>>>>>>
>>>>>>>                      }
>>>>>>>
>>>>>>>              }
>>>>>>>
>>>>>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>>>>>
>>>>>>> b)      Or
>>>>>>>
>>>>>>> Loop1
>>>>>>>
>>>>>>>     {
>>>>>>>
>>>>>>>          …
>>>>>>>
>>>>>>>          If (cond1)
>>>>>>>
>>>>>>>               If (cond2)
>>>>>>>
>>>>>>>                   If (cond3)
>>>>>>>
>>>>>>>                       Stmt;
>>>>>>>
>>>>>>>        }
>>>>>>>
>>>>>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>>>>>
>>>>>>> c)       It took me a long time to write all of these code, so there
>>>>>>> might be other peculiarities which I forgot to mention. :)
>>>>>>>
>>>>>>>                        Let’s discuss these ones as you will review my patch.
>>>>>>>
>>>>>>> 3)      Patch consists of 9 files:
>>>>>>>
>>>>>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>>>>>
>>>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>>>>>> predicated lim moves 2 more statements out of the loop;
>>>>>>>
>>>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>>>>>> lim recip optimization in this test doesn’t work (the corresponding
>>>>>>> value is below threshold as I could see in the code for recip, 1<3).
>>>>>>> So to have recip working in this test I changed test a little bit.
>>>>>>>
>>>>>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>>>>>
>>>>>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>>>>>
>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>>>>>
>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>>>>
>>>>>>> the tests for conditional lim.
>>>>>>>
>>>>>>> 4)      Patch testing:
>>>>>>>
>>>>>>> a)      make –k check (no difference in results for me for the clean
>>>>>>> build and the patched one,
>>>>>>>
>>>>>>> -          Revision: 222849,
>>>>>>>
>>>>>>> -          uname -a
>>>>>>>
>>>>>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>>>>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>>>>>
>>>>>>> b)      Bootstrap.
>>>>>>>
>>>>>>>  It goes well now, however to fix it I have made a temporary hack in
>>>>>>> the lim code. And with this fix patch definitely shouldn’t be
>>>>>>> committed.
>>>>>>>
>>>>>>> I did so, as I would like to discuss this issue first.
>>>>>>>
>>>>>>> The problem is: I got stage2-stage3 comparison failure on the single
>>>>>>> file (tree-vect-data-refs.o). After some investigation I understood
>>>>>>> that tree-vect-data-refs.o differs being compiled with and without
>>>>>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>>>>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>>>>>> reproduce this difference on the same build (even not bootstrapped),
>>>>>>> with option ‘ –g’ and without it).
>>>>>>>
>>>>>>> Of course, the file compiled with –g option will contain debug
>>>>>>> information and will differ from the corresponding file without debug
>>>>>>> information. I mean there is the difference reported by
>>>>>>> contrib/compare-debug (I mean we talk about stripped files).
>>>>>>>
>>>>>>> Then I compared assemblers and lim dump files and made a conclusion
>>>>>>> the difference is:
>>>>>>>
>>>>>>> There is statement _484=something, which is conditionally moved out of
>>>>>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>>>>>> X. In debug case, there is the single use in debug statement. So for
>>>>>>> debug case my patch generates phi statement for _484 to make it
>>>>>>> available inside the loop X. For non debug case, no such phi statement
>>>>>>> generated as there is no uses of _484.
>>>>>>>
>>>>>>> There is one more statement with lhs=_493, with exactly this pattern
>>>>>>> of use. But this is not important for our discussion.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So, in my opinion it’s ok to generate additional phi node for debug
>>>>>>> case. But I’m not a compiler expert and maybe there is some
>>>>>>> requirement that debug and non-debug versions should differ only by
>>>>>>> debug statements, I don’t know.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My temporary hack fix is just removing of such debug statements (no
>>>>>>> debug statements, no problems J).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The alternatives I see are:
>>>>>>>
>>>>>>> -          Move debug statements outside the loop (not good in my opinion);
>>>>>>>
>>>>>>> -          Somehow reset values in debug statements (not good in my
>>>>>>> opinion, if I could provide correct information for them);
>>>>>>>
>>>>>>> -          Generate phi for debug statements and fix bootstrap (say,
>>>>>>> let’s have some list of files, which we have special check for. I mean
>>>>>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>>>>>> and non-debug). I like this variant. However, as I don’t see such list
>>>>>>> of special files in the whole gcc, I think I might be missing
>>>>>>> something.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Evgeniya
>>
>>
>>
>> --
>> Thanks,
>>
>> Evgeniya
>>
>> perfstories.wordpress.com



-- 
Thanks,

Evgeniya

perfstories.wordpress.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: conditional lim
  2015-07-15  8:43             ` Evgeniya Maenkova
@ 2015-07-15 10:58               ` Richard Biener
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Biener @ 2015-07-15 10:58 UTC (permalink / raw)
  To: Evgeniya Maenkova; +Cc: GCC Patches

On Wed, Jul 15, 2015 at 10:36 AM, Evgeniya Maenkova
<evgeniya.maenkova@gmail.com> wrote:
> On Tue, Jul 14, 2015 at 2:54 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Mon, Jun 29, 2015 at 4:21 PM, Evgeniya Maenkova
>> <evgeniya.maenkova@gmail.com> wrote:
>>> On Mon, Jun 29, 2015 at 5:10 PM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Tue, Jun 9, 2015 at 10:11 PM, Evgeniya Maenkova
>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>> On Tue, Jun 9, 2015 at 3:46 PM, Richard Biener
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Fri, May 29, 2015 at 3:14 PM, Evgeniya Maenkova
>>>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>>>> Hi Richard,
>>>>>>>
>>>>>>> Here is some explanation. I hope you let me know if I need to clarify something.
>>>>>>>
>>>>>>> Also, you asked me about concrete example, to make sure you don’t miss
>>>>>>> my answer here is the link:
>>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02417.html.
>>>>>>>
>>>>>>> Also, I doubt whether it’s convenient for you to create a build with
>>>>>>> my patch or not. May be to clarify things you could send me some
>>>>>>> examples/concrete cases, then I’ll compile them with
>>>>>>> –fdump-tree-loopinit-details and –fdump-tree-lim-details and send you
>>>>>>> these dumps. May be these dumps will be useful. (I’ll only disable
>>>>>>> cleanup_cfg TODO after lim to let you know the exact picture after
>>>>>>> lim).
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> 1.       invariantness _dom_walker –
>>>>>>>
>>>>>>> 1.1   for each GIMPLE_COND in given bb calls handle_cond_stmt to call
>>>>>>> for true and false edges handle_branch_edge, which calls SET_TARGET_OF
>>>>>>> for all bb ‘predicated’ by given GIMPLE_COND.
>>>>>>>
>>>>>>> SET_TARGET_OF sets in basic_blocks aux 2 facts:
>>>>>>>
>>>>>>> a)      this is true or false edge;
>>>>>>>
>>>>>>> b)      link to cond stmt;
>>>>>>>
>>>>>>> Handle_branch_edge works this way:
>>>>>>>
>>>>>>> If (cond1)
>>>>>>>
>>>>>>>   {
>>>>>>>
>>>>>>>      bb1;
>>>>>>>
>>>>>>>      if (cond2}
>>>>>>>
>>>>>>>        {
>>>>>>>
>>>>>>>            bb2;
>>>>>>>
>>>>>>>         }
>>>>>>>
>>>>>>>    Being called for cond1, it sets cond1 as condition for both bb1 and
>>>>>>> bb2 (the whole branch for cond1, ie also for bb containing cond2),
>>>>>>> then this method will be called (as there is dominance order) for
>>>>>>> cond2 to correct things (ie to set cond2 as condition for bb2).
>>>>>>
>>>>>> Hmm, why not track the current condition as state during the DOM walk
>>>>>> and thus avoid processing more than one basic-block in handle_branch_edge?
>>>>>> Thus get rid of handle_branch_edge and instead do everything in handle_cond_stmt
>>>>>> plus the dom-walkers BB visitor?
>>>>>>
>>>>> I need to look more carefully how to implement it, but I think I
>>>>> understand what you mean and this optimization of course looks
>>>>> reasonable to me. Will do.
>>>>>
>>>>>> I see you don't handle BBs with multiple predecessors - that's ok, but
>>>>>> are you sure you don't run into correctness issues when not marking such
>>>>>> BBs as predicated?  This misses handling of, say
>>>>>>
>>>>>>  if (a || b)
>>>>>>    bb;
>>>>>>
>>>>>> which is a pity (but can be fixed later if desired).
>>>>>>
>>>>> I had some test (in gcc testsuite or bootstrap build) which worked
>>>>> incorrectly because of multiple predecessors. As far as I remember the
>>>>> situation was (further, will make some notes about such tests to
>>>>> clarify this better), I mean with previous version of my code which
>>>>> handled bb with 2 predecessors:
>>>>> if (a)
>>>>>   tmpvar=something;
>>>>> while()
>>>>>   if (a || b)
>>>>>       basic_block {do something with tmpvar;} // I mean basic block
>>>>> predicated by bb with a and bb with b
>>>>>
>>>>> So, if a is false, I mean we didn't do tmpvar=something (outside
>>>>> loop), BUT we are in basick_block  (we went by bb with b), we got
>>>>> Segmentation falt in basic_block {do something with tmpvar;}.
>>>>>
>>>>> I think we can reproduce all the details of this test if I remove not
>>>>> handling bb with 2 predecessors.
>>>>>
>>>>> So I wouldn't move bb with 2 predecessors (this is not always executed
>>>>> bb in any loop, not conditionally, they will not be moved at all).
>>>>>
>>>>> This is my more detail explanation on this point. Perhaps, I didn't
>>>>> understand your question about correctness. Could you repeat it in
>>>>> other words (based on this new clarification).
>>>>>
>>>>> So I think according to current code it will not be moved. What
>>>>> incorrectness do you mean?
>>>>
>>>> If the block isn't marked as predicated the question is whether it is
>>>> handled correctly or assumed to be unconditionally executed.
>>>>
>>>>>> I note that collecting predicates has similarities to what if-conversion
>>>>>> does in tree-ifcvt.c (even if its implementation is completely different,
>>>>>> of course).
>>>>>>
>>>>>
>>>>> Ok, I'll look at this. But could you please clarify your point?
>>>>> (Should I just take this into account with low priority and look at
>>>>> this later or you want some refactoring?)
>>>>
>>>> I just noted similar code exists elsewhere - it may be possible to
>>>> factor it out but I didn't investigate.  And no, doing that isn't a prerequesite
>>>> for this patch.
>>>>
>>>>>>> 1.2   As 1.1 goes we identify whether some bb is predicated by some
>>>>>>> condition or not.
>>>>>>>
>>>>>>> bb->aux->type will be [TRUE/FALSE]_TARGET_OF and
>>>>>>> bb->aux->cond_stmt=cond stmt (the nearest condition).
>>>>>>>
>>>>>>> If bb is always executed bb->aux->type = ALWAYS_EXECUTED_IN,
>>>>>>> bb->loop->loop (this info was available in the clean build).
>>>>>>>
>>>>>>> 1.3   As this walker is called in dominance order, information about
>>>>>>> condition is available when invariantness_dom_walker is called for
>>>>>>> given bb.  So we can make some computations based on bb->aux
>>>>>>> structure. This is done in check_conditionally_executed. The main goal
>>>>>>> of this method is to check that the root condition is always executed
>>>>>>> in the loop. I did so to avoid situation like this
>>>>>>>
>>>>>>> Loop:
>>>>>>>
>>>>>>>    Jmp somewhere;
>>>>>>>
>>>>>>>   If (cond1)
>>>>>>>
>>>>>>>       If (cond2)
>>>>>>>
>>>>>>>          Etc
>>>>>>>
>>>>>>> By ‘root condition’ I mean cond1 in this case (the first cond in
>>>>>>> dominance order).
>>>>>>
>>>>>> Ok, but this can't happen (an uncontional jump to somehwere).  It
>>>>>> will always be a conditional jump (a loop exit) and thus should
>>>>>> be handled already.
>>>>>>
>>>>>> Did you have a testcase that was mishandled?
>>>>>
>>>>> No, I have not such testcase. This is because I;m newbie at gcc and
>>>>> compilers. If you tell me this fact, let's just remove this check?
>>>>> Correct?
>>>>
>>>> Yes.
>>>>
>>>>>>
>>>>>>> If  ‘root condition’ is not always executed we disable (free) all the
>>>>>>> info in bb->aux, ie all the blocks becomes neither conditionally nor
>>>>>>> always executed according to bb->aux.
>>>>>>>
>>>>>>> There is some optimization to don’t go each time trough the conditions
>>>>>>> chain (cond2->cond1), let me skip such details for now.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 1.4   Then we calculate tgt_level  (which is used in move_computations later)
>>>>>>>
>>>>>>> The patch is the same for phi and regular stmt (calculate_tgt_level)
>>>>>>>
>>>>>>> 1)      If stmt is conditionally executed we compute max movement
>>>>>>> (determine_max_movement) starting from
>>>>>>> get_lim_data(condition)->max_loop.
>>>>>>>
>>>>>>> 2)      If stmt is not cond executed as start level  for
>>>>>>> determine_max_movement computations we choose ALWAYS_EXECUTED_IN.
>>>>>>>
>>>>>>> To clarify why, see the comment
>>>>>>>
>>>>>>> /* The reason why we don't set start_level for MOVE_POSSIBLE
>>>>>>>
>>>>>>>           statements to more outer level is: this statement could depend on
>>>>>>>
>>>>>>>           conditional statements and even probably is not defined without this
>>>>>>>
>>>>>>>           condition. So either we should analyze these ones and move
>>>>>>>
>>>>>>>           under condition somehow or keep more strong start_level . */
>>>>>>>
>>>>>>> As you noted in your review there was some refactoring. Of course, I
>>>>>>> had no goal to refactor existing code, I intended to remove some
>>>>>>> duplication which I caused by my changes.  I hope we discuss in
>>>>>>> details later if you don’t like some refactoring.
>>>>>>
>>>>>> Refactoring makes a big patch harder to review because it ends up
>>>>>> containing no-op changes.  It also makes it appear bigger than it
>>>>>> really is ;)
>>>>>>
>>>>>> A reasonable solution is to factor out the refactoring to a separate patch.
>>>>>>
>>>>> I see what you mean.
>>>>>
>>>>>>> 2.       store_motion  - for some stmts set flags to ignore predicated
>>>>>>> execution (I mean to move statements without conditions).
>>>>>>
>>>>>> That's for the existing "predicated" support as far as I can see.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 3.       move_computations. The patch is doing something in the
>>>>>>> following 3 calls inside of this method.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 3.1   (more details below) walker.walk – move computations(create if()
>>>>>>> structure) and store information to create phi nodes later (which
>>>>>>> statements require phi nodes, as they conditionally executed and there
>>>>>>> are still their uses on other levels, what are the names for such phi
>>>>>>> nodes, as we replace uses in here, in walker.walk.)
>>>>>>
>>>>>> What kind of uses are that?  By definition all stmts executed
>>>>>> under condition A that are used in code not under condition A have
>>>>>> a PHI node associated.  This is why the existing conditional movement
>>>>>> handling moves PHI nodes (and all dependent computations).
>>>>>>
>>>>>> Are you thinking of
>>>>>>
>>>>>>   for (;;)
>>>>>>     if (a) x = 1;
>>>>>>
>>>>>>   .. use of x
>>>>>>
>>>>>> ?
>>>>> No, I think in this case phi node of x will be used in initial
>>>>> (==loopinit) code (in 'use of x') instead of some version of x (which
>>>>> I move).
>>>>>
>>>>>>If you move the PHI node for x then you don't need to insert any other
>>>>>> PHI nodes (and you _do_ have to move the PHI node anyway).
>>>>>
>>>>>>
>>>>>>> 3.2   (more details below) create_phi_nodes – create phi nodes for
>>>>>>> statements which require that (see 3.1)
>>>>>>
>>>>>> Only PHI nodes need PHI node creation and only its dependences
>>>>>> might need predication.
>>>>>>
>>>>>> That is, if you move a conditionally executed statement but not
>>>>>> any PHI node it feeds then the movement is useless.
>>>>>>
>>>>>> But maybe I am missing something about the PHI business.
>>>>>>
>>>>> Ok. I have different understanding on this (and am afraid of this of
>>>>> course :) ).
>>>>>
>>>>> See the example:
>>>>> void bar (long);
>>>>> void foo (int cond1, int cond2, int cond3, int m, int n)
>>>>> {
>>>>>   unsigned x;
>>>>>   unsigned inv1;
>>>>>
>>>>>   for (x = 0; x < 1000; ++x)
>>>>>     {
>>>>>       if (cond1)
>>>>>         {
>>>>>           int tmp_inv1 = m*n;
>>>>>           non_inv1 = tmp_inv1 + x;
>>>>>         }
>>>>>     }
>>>>>   bar (non_inv1);
>>>>> }
>>>>>
>>>>> This is working example, will attach loopinit and lim1 dumps.
>>>>> There is no phi node for tmp_inv1 in loopinitat all. However, I think
>>>>> we should move tmp_inv1. To use tmp_inv1 in non_inv1 we need phi node
>>>>> (otherwise we get verification failure as bb with conditionally moved
>>>>> tmp_inv1 doesn;t dominate bb with non_inv1). Another example if
>>>>> non_inv1 is invariant but has not enough cost or something else. In
>>>>> short, if it's still used in initial or others loops. Say, non_inv1 is
>>>>> invariant and moved in another loop. Or something else.
>>>>>
>>>>> So, when I read
>>>>> 'Only PHI nodes need PHI node creation'
>>>>> and
>>>>> 'That is, if you move a conditionally executed statement but not any
>>>>> PHI node it feeds then the movement is useless.'
>>>>> I think about this example and don't understand. Could you clarify?
>>>>
>>>> Ok, so this case is already handled by LIM by simply moving tmp_inv1 = m*n
>>>> (and making it unconditionally execute before the loop).  In fact whether
>>>> we can move it does not depend on whether 'cond' is invariant or not.
>>>>
>>>> If we couldn't execute m*n unconditionally on the loop preheader edge
>>>> then we'd have to be able to also move the guarding condition.  Note
>>>> that cost considerations shouldn't come into play here.
>>>>
>>>> So I was only thinking of the case where we also want to move the
>>>> condition (which is wanted if we want to move a PHI node).
>>>>
>>>
>>> Don't understand "So I was only thinking of the case where we also
>>> want to move the condition (which is wanted if we want to move a PHI
>>> node)."
>>>
>>> What about the case when we want to move condition but don't have phi
>>> node to move.
>>>
>>> Let's consider a little bit modified example:
>>>
>>> void bar (long);
>>> void foo (int cond1, int cond2, int cond3, int m, int n)
>>>   {
>>>      unsigned x;
>>>      unsigned inv1;
>>>
>>>      for (x = 0; x < 1000; ++x)
>>>        {
>>>          if (n)
>>>            {
>>>              int tmp_inv1 = m/n;
>>>              non_inv1 = tmp_inv1 + x;
>>>             }
>>>        }
>>>      bar (non_inv1);
>>>  }
>>> We couldn't  move tmp_inv1 uncoditionally. Right (I have no clean
>>> build at hands right now to check. If I'm wrong let's replace m/n to
>>> something that couldn't be unconditionally moved :))? There is no phi
>>> node for tmp_inv1.
>>> Could you clarify what conditional lim should do in your opinion here?
>>
>> Nothing at this point.  If only then for the simple reason to make the
>> first iteration of the patch simpler (still covering most cases).
>>
>> You seem to want to catch 100% of all possible cases.
>>
>>> Is it too little gain to move statements that only used in the same bb
>>> or branch (I mean before phi node creation, even if stmt had phi
>>> node)?
>>>
>>> In fact, we could have here 2 thousands of divisions:
>>> if (n)
>>>   {
>>>     int tmp_inv1 = m/n;
>>>     tmp_inv2 = something_which_couldn't_be moved_unconditionally (tmp_inv1);
>>>     ...
>>>     tmp_invN = something_which_couldn't_be moved_unconditionally (tmp_invN-1);
>>>     non_inv1 = tmp_invN + x;
>>>   }
>>
>> Of course we could.  But the patch didn't even exercise this case (no testcase).
>
> Patch handles this case (this is why I created phi nodes which we
> tried to discuss).
> Your comment is about test case not about the patch.

My comment was about the lack of a testcase for this in the patch submission.

But my whole point is that supporting this should be split off to a followup
patch to make both patches easier to review.

Richard.

>>
>> Please make patches simple - support for moving non-PHIs can be added
>> as a followup.  The first important part is to make the current code
>> (moving PHIs)
>> create control-flow rather than if-converted form.
>>
>> Richard.
>>
>>>
>>>>>>>
>>>>>>> 3.3   replace_uses_in_conditions
>>>>>>>
>>>>>>> After computations movements we can have several copies of the cond
>>>>>>> stmt. In 3.1 we do replacement of uses based on stmt’s tgt_level. For,
>>>>>>> cond stmt it doesn’t work. As cond stmt, of course, have something in
>>>>>>> lim_aux_data->tgt_level, but, in fact, they are moved only if
>>>>>>> corresponding invariants are moved. So, in fact, the same cond (copies
>>>>>>> of it, of course) could go to the different levels. So to fix these
>>>>>>> uses, we call replace_uses_in_conditions.
>>>>>>
>>>>>> Hmm, but stmts that a condition depends on always have at least the
>>>>>> target level of the condition statement.  Thus their computation is
>>>>>> available in all other "same" conditon statements that might appear
>>>>>> in more inner loop levels, no?  So isn't this just about performing the
>>>>>> movement in a proper ordering?
>>>>>
>>>>> Let me give your more details, then may be you repeat question in other words.
>>>>> for ()
>>>>> {
>>>>>   if (a)
>>>>>     {
>>>>>        inv1 = some_computations;
>>>>>        if (inv1 < inv2)
>>>>>             {
>>>>>               inv3;
>>>>>               inv4;
>>>>>             }
>>>>>        }
>>>>> }
>>>>>
>>>>> So, the order for movement is: inv1; inv3; inv4. When we move inv1 we
>>>>> have tgt_level for inv3 and inv4 computed but we have not created
>>>>> copies of 'if (inv1 < inv2)' created. In theory, we can have several
>>>>> copies:(1) for tgt_level of inv3 (2) for tgt_level of inv2 (3) in
>>>>> initial cycle. if (1) is in tgt level of inv1 we need no replacement,
>>>>> in other cases we need replacements. Of course, we know this(all the
>>>>> info about tgt_levels) when we move inv1, but then we need to create
>>>>> all needed copies (store it somehow, etc) of 'if (inv1 < inv2') and
>>>>> make replacement in these copies. I don't see now why this is much
>>>>> better. This doesn't mean that current solution is perfect, just
>>>>> clarify your thought.
>>>>
>>>> I still don't get what "replacements" actually mean.  Let's extend the
>>>> example to
>>>>
>>>> for ()  (A)
>>>>   {
>>>>> for ()  (B)
>>>>> {
>>>>>   if (a)
>>>>>     {
>>>>>        inv1 = some_computations;
>>>>>        if (inv1 < inv2)
>>>>>             {
>>>>>               inv3;
>>>>>               inv4;
>>>>>             }
>>>>>        }
>>>>> }
>>>>  }
>>>>
>>>> then you say the target level of inv1 could be 'A' and the target
>>>> level of inv3 'B'
>>>> and that of inv4 is 'B' as well.
>>>>
>>>> That's true.  So we move inv1 to before 'A' and inv3/inv4 to before 'B' guarded
>>>> with if (inv1 < inv2).  I fail to see what "copies" we need here.  'inv1' is
>>>> available in all target levels below its own, no copies needed.
>>>>
>>>
>>> I meant the copies of conditions (COND_STMT: inv1 < inv2).
>>>
>>> Let's modify this example: tgt_level (inv1) = A; tgt_level(inv3) = A;
>>> tgt_level(inv4) = B.
>>>
>>>  Then
>>> if (a)
>>>    inv1=some_computations;
>>> if (inv1 < inv2)
>>>    inv3;
>>> <preheader edge for A, where phi for inv1(phi_for_inv1) is computed>
>>> for ()  (A)
>>>  {
>>>  if (phi_for_inv1 <inv2)
>>>    inv3
>>>  for ()  (B)
>>>    {
>>>    if (a)
>>>      {
>>>         if (phi_for_inv1 < inv2)
>>>           {
>>>           }
>>>      }
>>>  }
>>> So, we have 3 copies of inv1 < inv2 after lim (in some of them we use
>>> inv1, in other phi_for_inv1). But this is the second priority for now
>>> (this is all about whether we should move statements w/o phi node).
>>>
>>>>>>
>>>>>>> More details on 3.1 and 3.2
>>>>>>>
>>>>>>> 3.1 The patch is very similar for phi stmts and regular stmts (there
>>>>>>> is some difference for regular stmts related to the case when we move
>>>>>>> sequence instead of single stmt, let’s skip it in this description,
>>>>>>> will describe later. BTW, there are comments, in corresponding parts).
>>>>>>>
>>>>>>> a) for each bb we call check_conditionally_executed and if bb is cond
>>>>>>> executed then invariant statement  will be moved conditionally.
>>>>>>>
>>>>>>> b) prepare_edge_for_stmt collects information needed to create phi
>>>>>>> nodes in 3.2(add_info_for_phi) AND prepares edge ie creates if()
>>>>>>> structure needed to move statement
>>>>>>> (get_block_for_predicated_statement, see below). All of this is done
>>>>>>> for cond executed stmt. Nothing is done for non cond executed.
>>>>>>>
>>>>>>> BTW, there is some strange names for functions like missed ending
>>>>>>> (prepare_edge_for_stm, w/o t in the end, this is because of my script
>>>>>>> to add ‘ ‘ before ‘(‘, will fix in the next patch version, ignore
>>>>>>> please so far).
>>>>>>>
>>>>>>> Get_block_for_predicated_stmt:
>>>>>>>
>>>>>>> Create bb where to add cond stmt to (or return existing, each loop has
>>>>>>> a map  (level_cond_data_map is map <loop, cond_data_map> ,
>>>>>>> cond_data_map is map <gimple, basic_block>) to identify if bb for
>>>>>>> given cond was created or not for given level).
>>>>>>>
>>>>>>> Parameters of this method:
>>>>>>>
>>>>>>> Cond – condition of stmt, branch – is it on true or false edge of
>>>>>>> cond, cond_map  - see above, level (tgt_level), preheader –
>>>>>>> preheader_edge for level.
>>>>>>>
>>>>>>> So first of all we check whether there is a basic block for given
>>>>>>> condition in given loop.
>>>>>>>
>>>>>>> a)      If such bb exists then we trying to find corresponding edge,
>>>>>>> destination of this edge shouldn’t be a post dominator of bb
>>>>>>> (find_branch_edge).
>>>>>>>
>>>>>>> a.1) if such edge exists we return the most remote in this branch bb
>>>>>>> (or create it), see get_most_remote_in_branch;
>>>>>>>
>>>>>>> a.2) if such edge doesn’t exists, we create corresponding bb, see make_branch
>>>>>>>
>>>>>>>        b)  If such bb doesn’t exist we create it (recursively calling
>>>>>>> get_block_for_predicated_stmt if required bb is conditionally
>>>>>>> executed).
>>>>>>>
>>>>>>> Will not describe it further so far (let me know if you would like to
>>>>>>> read additional details).
>>>>>>>
>>>>>>> Add_info_for_phi
>>>>>>>
>>>>>>> The main goal of this method is to identify situation where
>>>>>>> stmt(parameter,ie the statement which we are moving) or more exactly
>>>>>>> lhs of this statement
>>>>>>>
>>>>>>> a)      is still used in other loops, so we need to create phi node to
>>>>>>> use phi name inside other loops.
>>>>>>>
>>>>>>> b)      Is used in phi node of original loop and this phi node is
>>>>>>> going to be moved. BUT, as we move phi node as assignment (in the case
>>>>>>> of phi node with 2 arguments) we need to create phi node to use lhs of
>>>>>>> statement (param of add_info_for_phi).
>>>>>>>
>>>>>>> In these cases we make corresponding replacements and store
>>>>>>> information needed to create phi nodes and make some other
>>>>>>> replacements in 3.3 (replace_uses_in_conditions).
>>>>>>>
>>>>>>> Add_info_for_phi calls create_tmp_var which requires some explanation.
>>>>>>>
>>>>>>> Create_tmp_var
>>>>>>>
>>>>>>> To create new names for phi nodes and to create default def for phi
>>>>>>> arguments I use make_ssa_name and get_or_create_ssa_default_def. These
>>>>>>> methods have some asserts (being merged):   (TREE_CODE (old_var) ==
>>>>>>> VAR_DECL   || TREE_CODE (old_var) == PARM_DECL   || TREE_CODE
>>>>>>> (old_var) == RESULT_DECL).
>>>>>>>
>>>>>>> So in some cases (when I know that I need to use methods mentioned
>>>>>>> above, see the start of create_tmp_var, ie the uses in other loops) I
>>>>>>> create new tmp variable to overcome these asserts. To be honest I
>>>>>>> don’t like this but don’t know how to deal with this better.
>>>>>>
>>>>>> Hmm, don't you just run into SSA names here?  That is, TREE_CODE
>>>>>> (old_var) == SSA_NAME?
>>>>>> You can use make_ssa_name (TREE_TYPE (old_var), "plim") in this case.
>>>>>>
>>>>>> Or you run into the issue that anonymous SSA names cannot have default
>>>>>> defintions (I don't understand why you need all the fuss about default defs).
>>>>>
>>>>> I used default defs to write something in phi args when I have no values on
>>>>> corresponding branches. Say there is some tmp value which I should
>>>>> create phi for.
>>>>> if (a)
>>>>>    {
>>>>>      tmpvar=something
>>>>>      b = something(tmpvar);
>>>>>     }
>>>>> else
>>>>>    {
>>>>>    }
>>>>> When I move if (a) {tmpvar = something} and create phi for tmpvar (to
>>>>> use phi result
>>>>> in the initial cycle) i need to write something for 'not a' phi arg.
>>>>> BUT, in some cases I use
>>>>>  build_zero_cst (which I started to use after default defs use. I mean
>>>>> I didn't know about
>>>>>  build_zero_cst while I started to use default defs.). So may be I can replace
>>>>> uses of default defs in such cases by build_zero_cst? I mean just SOME values
>>>>> (as I know that control flow will never go to this places actually). I
>>>>> mean if !a tmpvar will not used.
>>>>
>>>> Ok, this is again about the case where you are not moving a PHI node but
>>>> creating it because you want to conditionally execute invariant stmts.
>>>> Yes, a default def is ok here, but you can always use a new VAR_DECL
>>>> just for the default def creation like gimple_fold_call does for example:
>>>>
>>>>                           tree var = create_tmp_var (TREE_TYPE (lhs));
>>>>                           tree def = get_or_create_ssa_default_def (cfun, var);
>>>>
>>>> Note that I think we don't need to bother about this case as we'd only
>>>> want to move PHI nodes in the first place (so you already have values
>>>> for all PHI args).
>>>>
>>>>>>
>>>>>>> And a couple of words regarding where we store info for phi nodes:
>>>>>>>
>>>>>>> Each loop contains in its aux  phi_data structure. On this stage we
>>>>>>> only add there stmt (if phi node will be required), phi name=phi node
>>>>>>> result name in names_map or fl_names_map). See the comment about
>>>>>>> phi_data in patch.
>>>>>>>
>>>>>>> If there should be created several phi nodes for given lhs (I mean the
>>>>>>> first place for phi node doesn’t dominate on corresponding preheader,
>>>>>>> we add only the last name in names_map (as intermediate names could be
>>>>>>> created later, but the last name in bb which dominates preheader
>>>>>>> should be used for replacement in other loops. Replacements were
>>>>>>> already made in walker).
>>>>>>>
>>>>>>> If lhs is used only in phi node, which are moved and transformed into
>>>>>>> assignment, and there is no uses in other loops, we need to create
>>>>>>> only first level phi node (don’t need to create phi nodes till some bb
>>>>>>> which dominates preheader), then we add such name to fl_names_map.
>>>>>>>
>>>>>>> 3.2 Create_phi_nodes
>>>>>>>
>>>>>>> For each of the loops we go over ((phi_data*) loop->aux)->stmts. These
>>>>>>> are statements which were moved, but they have uses in the original
>>>>>>> loop (more exactly, their uses in some other loops replaced by some
>>>>>>> name from ((phi_data*) loop->aux)->names_map or ((phi_data*)
>>>>>>> loop->aux)->fl_names_map.
>>>>>>>
>>>>>>> So for each of such stmts we create phi nodes (for its lhs) chain.
>>>>>>>
>>>>>>> Basic block for phi node is found in get_bb_for_phi  for given bb with
>>>>>>> stmt. If basic block for phi node dominates preheader->src we end
>>>>>>> chain creation. There is some special condition for the case when we
>>>>>>> need to create only first level phi node. (I described this situation
>>>>>>> above, but let me know If I need to add  more details.
>>>>>>>
>>>>>>> Basic blocks can have maps to identify if we created phi node for
>>>>>>> given variable in given basic_block, so we only add an edge argument
>>>>>>> for such case (phi_map = new hash_map<basic_block, hash_map<tree,
>>>>>>> gphi*>*>).
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>> Well - I think it might be easier code-generation-wise to have a separate
>>>>>> phase move_phis executed before move_computations, that just moves
>>>>>> invariant PHI nodes (which this all is about - see above) and their
>>>>>> controlling conditions and only then move their dependencies.
>>>>> I didn;t understand so far (I think let's clarify previous questions,especially
>>>>> about phi nodes then let's go back to this one.)
>>>>
>>>> Yeah, I think the PHI node thing is our major mis-understanding (or the
>>>> main thing that I think complicates the patch for too little gain - at least
>>>> initially).
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Richard.
>>>>>>
>>>>>>> In short, that’s all.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Evgeniya
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 9, 2015 at 1:07 AM, Evgeniya Maenkova
>>>>>>> <evgeniya.maenkova@gmail.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Could you please review my patch for predicated lim?
>>>>>>>>
>>>>>>>> Let me note some details about it:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1)      Phi statements are still moved only if they have 1 or 2
>>>>>>>> arguments. However, phi statements could be move under conditions (as
>>>>>>>> it’s done for the other statements).  Probably, phi statement motion
>>>>>>>> with 3 + arguments could be implemented in the next patch after
>>>>>>>> predicated lim.
>>>>>>>>
>>>>>>>> 2)      Patch has limitations/features like (it was ok to me to
>>>>>>>> implement it such way, maybe I’m not correct. ):
>>>>>>>>
>>>>>>>> a)      Loop1
>>>>>>>>
>>>>>>>>     {
>>>>>>>>
>>>>>>>>               If (a)
>>>>>>>>
>>>>>>>>                  Loop2
>>>>>>>>
>>>>>>>>                      {
>>>>>>>>
>>>>>>>>                        Stmt - Invariant for Loop1
>>>>>>>>
>>>>>>>>                      }
>>>>>>>>
>>>>>>>>              }
>>>>>>>>
>>>>>>>>        In this case Stmt will be moved only out of Loop2, because of if (a).
>>>>>>>>
>>>>>>>> b)      Or
>>>>>>>>
>>>>>>>> Loop1
>>>>>>>>
>>>>>>>>     {
>>>>>>>>
>>>>>>>>          …
>>>>>>>>
>>>>>>>>          If (cond1)
>>>>>>>>
>>>>>>>>               If (cond2)
>>>>>>>>
>>>>>>>>                   If (cond3)
>>>>>>>>
>>>>>>>>                       Stmt;
>>>>>>>>
>>>>>>>>        }
>>>>>>>>
>>>>>>>> Stmt will be moved out only if cond1 is always executed in Loop1.
>>>>>>>>
>>>>>>>> c)       It took me a long time to write all of these code, so there
>>>>>>>> might be other peculiarities which I forgot to mention. :)
>>>>>>>>
>>>>>>>>                        Let’s discuss these ones as you will review my patch.
>>>>>>>>
>>>>>>>> 3)      Patch consists of 9 files:
>>>>>>>>
>>>>>>>> a)      gcc/testsuite/gcc.dg/tree-ssa/loop-7.c,
>>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – changed tests:
>>>>>>>>
>>>>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/loop-7.c  changed as
>>>>>>>> predicated lim moves 2 more statements out of the loop;
>>>>>>>>
>>>>>>>> -          gcc/testsuite/gcc.dg/tree-ssa/recip-3.c – with conditional
>>>>>>>> lim recip optimization in this test doesn’t work (the corresponding
>>>>>>>> value is below threshold as I could see in the code for recip, 1<3).
>>>>>>>> So to have recip working in this test I changed test a little bit.
>>>>>>>>
>>>>>>>> b)      gcc/tree-ssa-loop-im.c – the patched lim per se
>>>>>>>>
>>>>>>>> c)       gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-13.c,
>>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-14.c,
>>>>>>>>
>>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-15.c,
>>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-16.c,
>>>>>>>>
>>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-17.c,
>>>>>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-18.c
>>>>>>>>
>>>>>>>> the tests for conditional lim.
>>>>>>>>
>>>>>>>> 4)      Patch testing:
>>>>>>>>
>>>>>>>> a)      make –k check (no difference in results for me for the clean
>>>>>>>> build and the patched one,
>>>>>>>>
>>>>>>>> -          Revision: 222849,
>>>>>>>>
>>>>>>>> -          uname -a
>>>>>>>>
>>>>>>>>                Linux Istanbul 3.16.0-23-generic #31-Ubuntu SMP Tue Oct
>>>>>>>> 21 18:00:35 UTC 2014 i686 i686 i686 GNU/Linux
>>>>>>>>
>>>>>>>> b)      Bootstrap.
>>>>>>>>
>>>>>>>>  It goes well now, however to fix it I have made a temporary hack in
>>>>>>>> the lim code. And with this fix patch definitely shouldn’t be
>>>>>>>> committed.
>>>>>>>>
>>>>>>>> I did so, as I would like to discuss this issue first.
>>>>>>>>
>>>>>>>> The problem is: I got stage2-stage3 comparison failure on the single
>>>>>>>> file (tree-vect-data-refs.o). After some investigation I understood
>>>>>>>> that tree-vect-data-refs.o differs being compiled with and without
>>>>>>>> ‘-g’ option (yes, more exactly on stage 2 this is  ‘-g –O2 –gtoggle’,
>>>>>>>> and for stage 3 this is ‘-g –O2’. But to simplify things I can
>>>>>>>> reproduce this difference on the same build (even not bootstrapped),
>>>>>>>> with option ‘ –g’ and without it).
>>>>>>>>
>>>>>>>> Of course, the file compiled with –g option will contain debug
>>>>>>>> information and will differ from the corresponding file without debug
>>>>>>>> information. I mean there is the difference reported by
>>>>>>>> contrib/compare-debug (I mean we talk about stripped files).
>>>>>>>>
>>>>>>>> Then I compared assemblers and lim dump files and made a conclusion
>>>>>>>> the difference is:
>>>>>>>>
>>>>>>>> There is statement _484=something, which is conditionally moved out of
>>>>>>>> loop X. In non debug cases no usages of _484 are left inside the loop
>>>>>>>> X. In debug case, there is the single use in debug statement. So for
>>>>>>>> debug case my patch generates phi statement for _484 to make it
>>>>>>>> available inside the loop X. For non debug case, no such phi statement
>>>>>>>> generated as there is no uses of _484.
>>>>>>>>
>>>>>>>> There is one more statement with lhs=_493, with exactly this pattern
>>>>>>>> of use. But this is not important for our discussion.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So, in my opinion it’s ok to generate additional phi node for debug
>>>>>>>> case. But I’m not a compiler expert and maybe there is some
>>>>>>>> requirement that debug and non-debug versions should differ only by
>>>>>>>> debug statements, I don’t know.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> My temporary hack fix is just removing of such debug statements (no
>>>>>>>> debug statements, no problems J).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The alternatives I see are:
>>>>>>>>
>>>>>>>> -          Move debug statements outside the loop (not good in my opinion);
>>>>>>>>
>>>>>>>> -          Somehow reset values in debug statements (not good in my
>>>>>>>> opinion, if I could provide correct information for them);
>>>>>>>>
>>>>>>>> -          Generate phi for debug statements and fix bootstrap (say,
>>>>>>>> let’s have some list of files, which we have special check for. I mean
>>>>>>>> for my case, the difference is not in stage2 and stage3, it’s in debug
>>>>>>>> and non-debug). I like this variant. However, as I don’t see such list
>>>>>>>> of special files in the whole gcc, I think I might be missing
>>>>>>>> something.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Evgeniya
>>>
>>>
>>>
>>> --
>>> Thanks,
>>>
>>> Evgeniya
>>>
>>> perfstories.wordpress.com
>
>
>
> --
> Thanks,
>
> Evgeniya
>
> perfstories.wordpress.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-07-15 10:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08 21:07 conditional lim Evgeniya Maenkova
2015-05-09 12:41 ` Andi Kleen
2015-05-26 10:32 ` Richard Biener
2015-05-26 13:38   ` Evgeniya Maenkova
2015-05-27 10:52     ` Richard Biener
2015-05-27 13:36       ` Evgeniya Maenkova
2015-05-29 13:32 ` Evgeniya Maenkova
2015-06-09 11:48   ` Richard Biener
2015-06-09 20:12     ` Evgeniya Maenkova
2015-06-29 13:15       ` Richard Biener
2015-06-29 14:24         ` Evgeniya Maenkova
2015-07-14 10:54           ` Richard Biener
2015-07-15  8:43             ` Evgeniya Maenkova
2015-07-15 10:58               ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).