public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Early jump threading
@ 2016-08-11 14:02 Jan Hubicka
  2016-08-11 14:06 ` Richard Biener
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Jan Hubicka @ 2016-08-11 14:02 UTC (permalink / raw)
  To: gcc-patches, law, rguenther

Hi,
this patch adds early jump threading pass.  Jump threading is one of most
common cases where estimated profile becomes corrupted, because the branches
are predicted independently beforehand. This patch simply adds early mode to
jump threader that does not permit code growth and thus only win/win threads
are done before profile is constructed.

Naturally this also improves IPA decisions because code size/time is estimated
more acurately.

It is not very cool to add another pass and the jump threader is already
run 5 times. I think incrementally we could drop one of late threaders at least.
I tried to measure compile time effects but they are in wash. Moreover the patch
pays for itself in cc1plus code size:

Before patch to tweak size estimates: 27779964  
Current mainline:                     27748900  
With patch applied:                   27716173  

So despite adding few functions the code size effect is positive which I think
is quite nice.

Given the fact that jump threading seems quite lightweit, I wonder why it is
guarded by flag_expensive_optimizations? Is it expensive in terms of code
size?

The effectivity of individual threading passes is as follows (for tramp3d)

      mainline                              with patch
pass  thread count     profile mismatches   thread count    profile mismatch
early                                       525
1     1853             1900                 316             337
2     4                812                  4               288
3     24               1450                 32              947
4     76               1457                 75              975

So at least tramp3d data seems to suggest that we can drop the second occurence
of jump threading and that most of the job thread1 does can be done by the
size restricted early version (the lower absolute counts are caused by the
fact that threadable paths gets duplicated by the inliner). 50% drop in
profile distortion is not bad. I wonder why basically every threaded paths seems
to introduce a mismatch.

The patch distorts testusite somewhat, in most cases we only want to update
dump files or disable early threading:

+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
+XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
+FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
+FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
+FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
+FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
+FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
+FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
+FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
+FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"

This testcase is the now probably unnecesary heuristics (it may still be
relevant in cases we do not thread because of size bounds but its effectivity
may not be worth the maintenance cost):

+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
+FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
+FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3

If the patch seems acceptable, I will do the updates. One option why I did
not do that is that it seems to be now posisble to pass parameters to passes
from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
consistent with way we handle other early passes.

Bootstrapped/regtested x86_64-linux
Honza

	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
	* tree-pass.h (make_pass_early_thread_jumps): Declare.
	* tree-ssa-threadbackward.c (fsm_find_thread_path,
	fsm_find_thread_path, profitable_jump_thread_path,
	fsm_find_control_statement_thread_paths,
	find_jump_threads_backwards): Add speed_p parameter.
	(pass_data_early_thread_jumps): New pass.
	(make_pass_early_thread_jumps): New function.
	
Index: passes.def
===================================================================
--- passes.def	(revision 239218)
+++ passes.def	(working copy)
@@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
 	  /* After CCP we rewrite no longer addressed locals into SSA
 	     form if possible.  */
 	  NEXT_PASS (pass_forwprop);
+          NEXT_PASS (pass_early_thread_jumps);
 	  NEXT_PASS (pass_sra_early);
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 239218)
+++ tree-pass.h	(working copy)
@@ -399,6 +399,7 @@ extern gimple_opt_pass *make_pass_cd_dce
 extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_merge_phi (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_thread_jumps (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_early_thread_jumps (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_split_crit_edges (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_laddress (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_pre (gcc::context *ctxt);
Index: tree-ssa-threadbackward.c
===================================================================
--- tree-ssa-threadbackward.c	(revision 239219)
+++ tree-ssa-threadbackward.c	(working copy)
@@ -61,12 +61,14 @@ get_gimple_control_stmt (basic_block bb)
 /* Return true if the CFG contains at least one path from START_BB to END_BB.
    When a path is found, record in PATH the blocks from END_BB to START_BB.
    VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
-   the recursion to basic blocks belonging to LOOP.  */
+   the recursion to basic blocks belonging to LOOP.
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 static bool
 fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
 		      vec<basic_block, va_gc> *&path,
-		      hash_set<basic_block> *visited_bbs, loop_p loop)
+		      hash_set<basic_block> *visited_bbs, loop_p loop,
+		      bool speed_p)
 {
   if (loop != start_bb->loop_father)
     return false;
@@ -82,7 +84,8 @@ fsm_find_thread_path (basic_block start_
       edge e;
       edge_iterator ei;
       FOR_EACH_EDGE (e, ei, start_bb->succs)
-	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop))
+	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop,
+				  speed_p))
 	  {
 	    vec_safe_push (path, start_bb);
 	    return true;
@@ -101,11 +104,13 @@ fsm_find_thread_path (basic_block start_
    value on PATH.  ARG is the value of that SSA_NAME.
 
    BBI will be appended to PATH when we have a profitable jump threading
-   path.  Callers are responsible for removing BBI from PATH in that case. */
+   path.  Callers are responsible for removing BBI from PATH in that case.
+
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 static edge
 profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
-			     basic_block bbi, tree name, tree arg)
+			     basic_block bbi, tree name, tree arg, bool speed_p)
 {
   /* Note BBI is not in the path yet, hence the +1 in the test below
      to make sure BBI is accounted for in the path length test.  */
@@ -306,7 +311,7 @@ profitable_jump_thread_path (vec<basic_b
       return NULL;
     }
 
-  if (optimize_edge_for_speed_p (taken_edge))
+  if (speed_p && optimize_edge_for_speed_p (taken_edge))
     {
       if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
 	{
@@ -421,13 +426,15 @@ convert_and_register_jump_thread_path (v
 
 /* We trace the value of the SSA_NAME NAME back through any phi nodes looking
    for places where it gets a constant value and save the path.  Stop after
-   having recorded MAX_PATHS jump threading paths.  */
+   having recorded MAX_PATHS jump threading paths.
+
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 static void
 fsm_find_control_statement_thread_paths (tree name,
 					 hash_set<basic_block> *visited_bbs,
 					 vec<basic_block, va_gc> *&path,
-					 bool seen_loop_phi)
+					 bool seen_loop_phi, bool speed_p)
 {
   /* If NAME appears in an abnormal PHI, then don't try to trace its
      value back through PHI nodes.  */
@@ -495,7 +502,7 @@ fsm_find_control_statement_thread_paths
 	  hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
 
 	  if (fsm_find_thread_path (var_bb, e->src, next_path, visited_bbs,
-				    e->src->loop_father))
+				    e->src->loop_father, speed_p))
 	    ++e_count;
 
 	  delete visited_bbs;
@@ -561,7 +568,7 @@ fsm_find_control_statement_thread_paths
 	      /* Recursively follow SSA_NAMEs looking for a constant
 		 definition.  */
 	      fsm_find_control_statement_thread_paths (arg, visited_bbs, path,
-						       seen_loop_phi);
+						       seen_loop_phi, speed_p);
 
 	      path->pop ();
 	      continue;
@@ -572,7 +579,8 @@ fsm_find_control_statement_thread_paths
 
 	  /* If this is a profitable jump thread path, then convert it
 	     into the canonical form and register it.  */
-	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg);
+	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg,
+							 speed_p);
 	  if (taken_edge)
 	    {
 	      if (bb_loop_depth (taken_edge->src)
@@ -588,7 +596,7 @@ fsm_find_control_statement_thread_paths
 
       if (TREE_CODE (arg) == SSA_NAME)
 	fsm_find_control_statement_thread_paths (arg, visited_bbs,
-						 path, seen_loop_phi);
+						 path, seen_loop_phi, speed_p);
 
       else
 	{
@@ -598,7 +606,7 @@ fsm_find_control_statement_thread_paths
 	  path->pop ();
 
 	  edge taken_edge = profitable_jump_thread_path (path, var_bb,
-						     name, arg);
+						         name, arg, speed_p);
 	  if (taken_edge)
 	    {
 	      if (bb_loop_depth (taken_edge->src)
@@ -622,10 +630,11 @@ fsm_find_control_statement_thread_paths
    is a constant.  Record such paths for jump threading.
 
    It is assumed that BB ends with a control statement and that by
-   finding a path where NAME is a constant, we can thread the path.  */
+   finding a path where NAME is a constant, we can thread the path.
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 void  
-find_jump_threads_backwards (basic_block bb)
+find_jump_threads_backwards (basic_block bb, bool speed_p)
 {     
   gimple *stmt = get_gimple_control_stmt (bb);
   if (!stmt)
@@ -655,7 +664,8 @@ find_jump_threads_backwards (basic_block
   hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
 
   max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
-  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false);
+  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false,
+					   speed_p);
 
   delete visited_bbs;
   vec_free (bb_path);
@@ -703,7 +713,7 @@ pass_thread_jumps::execute (function *fu
   FOR_EACH_BB_FN (bb, fun)
     {
       if (EDGE_COUNT (bb->succs) > 1)
-	find_jump_threads_backwards (bb);
+	find_jump_threads_backwards (bb, true);
     }
   thread_through_all_blocks (true);
   return 0;
@@ -716,3 +726,59 @@ make_pass_thread_jumps (gcc::context *ct
 {
   return new pass_thread_jumps (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_early_thread_jumps =
+{
+  GIMPLE_PASS,
+  "early_thread",
+  OPTGROUP_NONE,
+  TV_TREE_SSA_THREAD_JUMPS,
+  ( PROP_cfg | PROP_ssa ),
+  0,
+  0,
+  0,
+  ( TODO_cleanup_cfg | TODO_update_ssa ),
+};
+
+class pass_early_thread_jumps : public gimple_opt_pass
+{
+public:
+  pass_early_thread_jumps (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_early_thread_jumps, ctxt)
+  {}
+
+  opt_pass * clone (void) { return new pass_early_thread_jumps (m_ctxt); }
+  virtual bool gate (function *);
+  virtual unsigned int execute (function *);
+};
+
+bool
+pass_early_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
+{
+  return true;
+}
+
+
+unsigned int
+pass_early_thread_jumps::execute (function *fun)
+{
+  /* Try to thread each block with more than one successor.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fun)
+    {
+      if (EDGE_COUNT (bb->succs) > 1)
+	find_jump_threads_backwards (bb, false);
+    }
+  thread_through_all_blocks (true);
+  return 0;
+}
+
+}
+
+gimple_opt_pass *
+make_pass_early_thread_jumps (gcc::context *ctxt)
+{
+  return new pass_early_thread_jumps (ctxt);
+}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:02 Early jump threading Jan Hubicka
@ 2016-08-11 14:06 ` Richard Biener
  2016-08-11 14:27   ` Jan Hubicka
  2016-08-11 18:30   ` Jeff Law
  2016-08-11 18:29 ` Jeff Law
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 25+ messages in thread
From: Richard Biener @ 2016-08-11 14:06 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, law

On Thu, 11 Aug 2016, Jan Hubicka wrote:

> Hi,
> this patch adds early jump threading pass.  Jump threading is one of most
> common cases where estimated profile becomes corrupted, because the branches
> are predicted independently beforehand. This patch simply adds early mode to
> jump threader that does not permit code growth and thus only win/win threads
> are done before profile is constructed.
> 
> Naturally this also improves IPA decisions because code size/time is estimated
> more acurately.
> 
> It is not very cool to add another pass and the jump threader is already
> run 5 times. I think incrementally we could drop one of late threaders at least.
> I tried to measure compile time effects but they are in wash. Moreover the patch
> pays for itself in cc1plus code size:
> 
> Before patch to tweak size estimates: 27779964  
> Current mainline:                     27748900  
> With patch applied:                   27716173  
> 
> So despite adding few functions the code size effect is positive which I think
> is quite nice.
> 
> Given the fact that jump threading seems quite lightweit, I wonder why it is
> guarded by flag_expensive_optimizations? Is it expensive in terms of code
> size?
> 
> The effectivity of individual threading passes is as follows (for tramp3d)
> 
>       mainline                              with patch
> pass  thread count     profile mismatches   thread count    profile mismatch
> early                                       525
> 1     1853             1900                 316             337
> 2     4                812                  4               288
> 3     24               1450                 32              947
> 4     76               1457                 75              975
> 
> So at least tramp3d data seems to suggest that we can drop the second occurence
> of jump threading and that most of the job thread1 does can be done by the
> size restricted early version (the lower absolute counts are caused by the
> fact that threadable paths gets duplicated by the inliner). 50% drop in
> profile distortion is not bad. I wonder why basically every threaded paths seems
> to introduce a mismatch.
> 
> The patch distorts testusite somewhat, in most cases we only want to update
> dump files or disable early threading:
> 
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
> +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
> +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
> +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
> +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
> +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
> +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
> +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
> 
> This testcase is the now probably unnecesary heuristics (it may still be
> relevant in cases we do not thread because of size bounds but its effectivity
> may not be worth the maintenance cost):
> 
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> 
> If the patch seems acceptable, I will do the updates. One option why I did
> not do that is that it seems to be now posisble to pass parameters to passes
> from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
> consistent with way we handle other early passes.

I wonder why you choose to put the FSM threader early which only does
backward threading(?!).  I'd expect forward threading to be more
profitable (though we don't have a separate threader for that and
would need VRP or DOM - but it seems we will get an early VRP anyway).

Richard.

> Bootstrapped/regtested x86_64-linux
> Honza
> 
> 	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
> 	* tree-pass.h (make_pass_early_thread_jumps): Declare.
> 	* tree-ssa-threadbackward.c (fsm_find_thread_path,
> 	fsm_find_thread_path, profitable_jump_thread_path,
> 	fsm_find_control_statement_thread_paths,
> 	find_jump_threads_backwards): Add speed_p parameter.
> 	(pass_data_early_thread_jumps): New pass.
> 	(make_pass_early_thread_jumps): New function.
> 	
> Index: passes.def
> ===================================================================
> --- passes.def	(revision 239218)
> +++ passes.def	(working copy)
> @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
>  	  /* After CCP we rewrite no longer addressed locals into SSA
>  	     form if possible.  */
>  	  NEXT_PASS (pass_forwprop);
> +          NEXT_PASS (pass_early_thread_jumps);
>  	  NEXT_PASS (pass_sra_early);
>  	  /* pass_build_ealias is a dummy pass that ensures that we
>  	     execute TODO_rebuild_alias at this point.  */
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h	(revision 239218)
> +++ tree-pass.h	(working copy)
> @@ -399,6 +399,7 @@ extern gimple_opt_pass *make_pass_cd_dce
>  extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_merge_phi (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_thread_jumps (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_early_thread_jumps (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_split_crit_edges (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_laddress (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_pre (gcc::context *ctxt);
> Index: tree-ssa-threadbackward.c
> ===================================================================
> --- tree-ssa-threadbackward.c	(revision 239219)
> +++ tree-ssa-threadbackward.c	(working copy)
> @@ -61,12 +61,14 @@ get_gimple_control_stmt (basic_block bb)
>  /* Return true if the CFG contains at least one path from START_BB to END_BB.
>     When a path is found, record in PATH the blocks from END_BB to START_BB.
>     VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
> -   the recursion to basic blocks belonging to LOOP.  */
> +   the recursion to basic blocks belonging to LOOP.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  static bool
>  fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
>  		      vec<basic_block, va_gc> *&path,
> -		      hash_set<basic_block> *visited_bbs, loop_p loop)
> +		      hash_set<basic_block> *visited_bbs, loop_p loop,
> +		      bool speed_p)
>  {
>    if (loop != start_bb->loop_father)
>      return false;
> @@ -82,7 +84,8 @@ fsm_find_thread_path (basic_block start_
>        edge e;
>        edge_iterator ei;
>        FOR_EACH_EDGE (e, ei, start_bb->succs)
> -	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop))
> +	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop,
> +				  speed_p))
>  	  {
>  	    vec_safe_push (path, start_bb);
>  	    return true;
> @@ -101,11 +104,13 @@ fsm_find_thread_path (basic_block start_
>     value on PATH.  ARG is the value of that SSA_NAME.
>  
>     BBI will be appended to PATH when we have a profitable jump threading
> -   path.  Callers are responsible for removing BBI from PATH in that case. */
> +   path.  Callers are responsible for removing BBI from PATH in that case.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  static edge
>  profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
> -			     basic_block bbi, tree name, tree arg)
> +			     basic_block bbi, tree name, tree arg, bool speed_p)
>  {
>    /* Note BBI is not in the path yet, hence the +1 in the test below
>       to make sure BBI is accounted for in the path length test.  */
> @@ -306,7 +311,7 @@ profitable_jump_thread_path (vec<basic_b
>        return NULL;
>      }
>  
> -  if (optimize_edge_for_speed_p (taken_edge))
> +  if (speed_p && optimize_edge_for_speed_p (taken_edge))
>      {
>        if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
>  	{
> @@ -421,13 +426,15 @@ convert_and_register_jump_thread_path (v
>  
>  /* We trace the value of the SSA_NAME NAME back through any phi nodes looking
>     for places where it gets a constant value and save the path.  Stop after
> -   having recorded MAX_PATHS jump threading paths.  */
> +   having recorded MAX_PATHS jump threading paths.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  static void
>  fsm_find_control_statement_thread_paths (tree name,
>  					 hash_set<basic_block> *visited_bbs,
>  					 vec<basic_block, va_gc> *&path,
> -					 bool seen_loop_phi)
> +					 bool seen_loop_phi, bool speed_p)
>  {
>    /* If NAME appears in an abnormal PHI, then don't try to trace its
>       value back through PHI nodes.  */
> @@ -495,7 +502,7 @@ fsm_find_control_statement_thread_paths
>  	  hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>  
>  	  if (fsm_find_thread_path (var_bb, e->src, next_path, visited_bbs,
> -				    e->src->loop_father))
> +				    e->src->loop_father, speed_p))
>  	    ++e_count;
>  
>  	  delete visited_bbs;
> @@ -561,7 +568,7 @@ fsm_find_control_statement_thread_paths
>  	      /* Recursively follow SSA_NAMEs looking for a constant
>  		 definition.  */
>  	      fsm_find_control_statement_thread_paths (arg, visited_bbs, path,
> -						       seen_loop_phi);
> +						       seen_loop_phi, speed_p);
>  
>  	      path->pop ();
>  	      continue;
> @@ -572,7 +579,8 @@ fsm_find_control_statement_thread_paths
>  
>  	  /* If this is a profitable jump thread path, then convert it
>  	     into the canonical form and register it.  */
> -	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg);
> +	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg,
> +							 speed_p);
>  	  if (taken_edge)
>  	    {
>  	      if (bb_loop_depth (taken_edge->src)
> @@ -588,7 +596,7 @@ fsm_find_control_statement_thread_paths
>  
>        if (TREE_CODE (arg) == SSA_NAME)
>  	fsm_find_control_statement_thread_paths (arg, visited_bbs,
> -						 path, seen_loop_phi);
> +						 path, seen_loop_phi, speed_p);
>  
>        else
>  	{
> @@ -598,7 +606,7 @@ fsm_find_control_statement_thread_paths
>  	  path->pop ();
>  
>  	  edge taken_edge = profitable_jump_thread_path (path, var_bb,
> -						     name, arg);
> +						         name, arg, speed_p);
>  	  if (taken_edge)
>  	    {
>  	      if (bb_loop_depth (taken_edge->src)
> @@ -622,10 +630,11 @@ fsm_find_control_statement_thread_paths
>     is a constant.  Record such paths for jump threading.
>  
>     It is assumed that BB ends with a control statement and that by
> -   finding a path where NAME is a constant, we can thread the path.  */
> +   finding a path where NAME is a constant, we can thread the path.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  void  
> -find_jump_threads_backwards (basic_block bb)
> +find_jump_threads_backwards (basic_block bb, bool speed_p)
>  {     
>    gimple *stmt = get_gimple_control_stmt (bb);
>    if (!stmt)
> @@ -655,7 +664,8 @@ find_jump_threads_backwards (basic_block
>    hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>  
>    max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
> -  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false);
> +  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false,
> +					   speed_p);
>  
>    delete visited_bbs;
>    vec_free (bb_path);
> @@ -703,7 +713,7 @@ pass_thread_jumps::execute (function *fu
>    FOR_EACH_BB_FN (bb, fun)
>      {
>        if (EDGE_COUNT (bb->succs) > 1)
> -	find_jump_threads_backwards (bb);
> +	find_jump_threads_backwards (bb, true);
>      }
>    thread_through_all_blocks (true);
>    return 0;
> @@ -716,3 +726,59 @@ make_pass_thread_jumps (gcc::context *ct
>  {
>    return new pass_thread_jumps (ctxt);
>  }
> +
> +namespace {
> +
> +const pass_data pass_data_early_thread_jumps =
> +{
> +  GIMPLE_PASS,
> +  "early_thread",
> +  OPTGROUP_NONE,
> +  TV_TREE_SSA_THREAD_JUMPS,
> +  ( PROP_cfg | PROP_ssa ),
> +  0,
> +  0,
> +  0,
> +  ( TODO_cleanup_cfg | TODO_update_ssa ),
> +};
> +
> +class pass_early_thread_jumps : public gimple_opt_pass
> +{
> +public:
> +  pass_early_thread_jumps (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_early_thread_jumps, ctxt)
> +  {}
> +
> +  opt_pass * clone (void) { return new pass_early_thread_jumps (m_ctxt); }
> +  virtual bool gate (function *);
> +  virtual unsigned int execute (function *);
> +};
> +
> +bool
> +pass_early_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
> +{
> +  return true;
> +}
> +
> +
> +unsigned int
> +pass_early_thread_jumps::execute (function *fun)
> +{
> +  /* Try to thread each block with more than one successor.  */
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      if (EDGE_COUNT (bb->succs) > 1)
> +	find_jump_threads_backwards (bb, false);
> +    }
> +  thread_through_all_blocks (true);
> +  return 0;
> +}
> +
> +}
> +
> +gimple_opt_pass *
> +make_pass_early_thread_jumps (gcc::context *ctxt)
> +{
> +  return new pass_early_thread_jumps (ctxt);
> +}
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:06 ` Richard Biener
@ 2016-08-11 14:27   ` Jan Hubicka
  2016-08-11 15:50     ` Richard Biener
  2016-08-11 18:41     ` Jeff Law
  2016-08-11 18:30   ` Jeff Law
  1 sibling, 2 replies; 25+ messages in thread
From: Jan Hubicka @ 2016-08-11 14:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, gcc-patches, law

> On Thu, 11 Aug 2016, Jan Hubicka wrote:
> 
> > Hi,
> > this patch adds early jump threading pass.  Jump threading is one of most
> > common cases where estimated profile becomes corrupted, because the branches
> > are predicted independently beforehand. This patch simply adds early mode to
> > jump threader that does not permit code growth and thus only win/win threads
> > are done before profile is constructed.
> > 
> > Naturally this also improves IPA decisions because code size/time is estimated
> > more acurately.
> > 
> > It is not very cool to add another pass and the jump threader is already
> > run 5 times. I think incrementally we could drop one of late threaders at least.
> > I tried to measure compile time effects but they are in wash. Moreover the patch
> > pays for itself in cc1plus code size:
> > 
> > Before patch to tweak size estimates: 27779964  
> > Current mainline:                     27748900  
> > With patch applied:                   27716173  
> > 
> > So despite adding few functions the code size effect is positive which I think
> > is quite nice.
> > 
> > Given the fact that jump threading seems quite lightweit, I wonder why it is
> > guarded by flag_expensive_optimizations? Is it expensive in terms of code
> > size?
> > 
> > The effectivity of individual threading passes is as follows (for tramp3d)
> > 
> >       mainline                              with patch
> > pass  thread count     profile mismatches   thread count    profile mismatch
> > early                                       525
> > 1     1853             1900                 316             337
> > 2     4                812                  4               288
> > 3     24               1450                 32              947
> > 4     76               1457                 75              975
> > 
> > So at least tramp3d data seems to suggest that we can drop the second occurence
> > of jump threading and that most of the job thread1 does can be done by the
> > size restricted early version (the lower absolute counts are caused by the
> > fact that threadable paths gets duplicated by the inliner). 50% drop in
> > profile distortion is not bad. I wonder why basically every threaded paths seems
> > to introduce a mismatch.
> > 
> > The patch distorts testusite somewhat, in most cases we only want to update
> > dump files or disable early threading:
> > 
> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> > +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
> > +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
> > +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
> > +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
> > +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
> > +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
> > +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
> > +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
> > 
> > This testcase is the now probably unnecesary heuristics (it may still be
> > relevant in cases we do not thread because of size bounds but its effectivity
> > may not be worth the maintenance cost):
> > 
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> > 
> > If the patch seems acceptable, I will do the updates. One option why I did
> > not do that is that it seems to be now posisble to pass parameters to passes
> > from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
> > consistent with way we handle other early passes.
> 
> I wonder why you choose to put the FSM threader early which only does
> backward threading(?!).  I'd expect forward threading to be more
> profitable (though we don't have a separate threader for that and
> would need VRP or DOM - but it seems we will get an early VRP anyway).

On tramp3d all VRP passes threads together 730 branches, all DOM passes 393, so
FSM threading (with 1957 branches) is the most effective one. Perhaps eventually
early VRP can also do bit of work.

I am not 100% sure from where "backward" is comming from. I guess is means that
analysis goes backward from conditionals to definitions: it looks for
conditional driven by a PHI statement that has a constant value on some paths
and duplicates for those. This seems cheap and rather effective way of getting
good part of the threading oppurtunities (most expensive part is probably
identifying and walking paths that will not be threaded at the end).

BTW I wonder if the same analysis can't be done for other instructions where constant
operands are very profitable, like division or multiplication.

Honza

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:27   ` Jan Hubicka
@ 2016-08-11 15:50     ` Richard Biener
  2016-08-11 18:42       ` Jeff Law
  2016-08-11 18:41     ` Jeff Law
  1 sibling, 1 reply; 25+ messages in thread
From: Richard Biener @ 2016-08-11 15:50 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, law

On August 11, 2016 4:27:00 PM GMT+02:00, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Thu, 11 Aug 2016, Jan Hubicka wrote:
>> 
>> > Hi,
>> > this patch adds early jump threading pass.  Jump threading is one
>of most
>> > common cases where estimated profile becomes corrupted, because the
>branches
>> > are predicted independently beforehand. This patch simply adds
>early mode to
>> > jump threader that does not permit code growth and thus only
>win/win threads
>> > are done before profile is constructed.
>> > 
>> > Naturally this also improves IPA decisions because code size/time
>is estimated
>> > more acurately.
>> > 
>> > It is not very cool to add another pass and the jump threader is
>already
>> > run 5 times. I think incrementally we could drop one of late
>threaders at least.
>> > I tried to measure compile time effects but they are in wash.
>Moreover the patch
>> > pays for itself in cc1plus code size:
>> > 
>> > Before patch to tweak size estimates: 27779964  
>> > Current mainline:                     27748900  
>> > With patch applied:                   27716173  
>> > 
>> > So despite adding few functions the code size effect is positive
>which I think
>> > is quite nice.
>> > 
>> > Given the fact that jump threading seems quite lightweit, I wonder
>why it is
>> > guarded by flag_expensive_optimizations? Is it expensive in terms
>of code
>> > size?
>> > 
>> > The effectivity of individual threading passes is as follows (for
>tramp3d)
>> > 
>> >       mainline                              with patch
>> > pass  thread count     profile mismatches   thread count    profile
>mismatch
>> > early                                       525
>> > 1     1853             1900                 316             337
>> > 2     4                812                  4               288
>> > 3     24               1450                 32              947
>> > 4     76               1457                 75              975
>> > 
>> > So at least tramp3d data seems to suggest that we can drop the
>second occurence
>> > of jump threading and that most of the job thread1 does can be done
>by the
>> > size restricted early version (the lower absolute counts are caused
>by the
>> > fact that threadable paths gets duplicated by the inliner). 50%
>drop in
>> > profile distortion is not bad. I wonder why basically every
>threaded paths seems
>> > to introduce a mismatch.
>> > 
>> > The patch distorts testusite somewhat, in most cases we only want
>to update
>> > dump files or disable early threading:
>> > 
>> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
>> > +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
>> > +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
>> > +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1
>"Registering FSM" 1
>> > +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did
>not thread around loop and would copy too many statements"
>> > +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times
>thread1 "Jumps threaded: 1" 1
>> > +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
>> > +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding
>predicate p_.*to 1" 1
>> > +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps
>threaded: 1"
>> > +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1,
>1\\\\]"
>> > 
>> > This testcase is the now probably unnecesary heuristics (it may
>still be
>> > relevant in cases we do not thread because of size bounds but its
>effectivity
>> > may not be worth the maintenance cost):
>> > 
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 1
>> > +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "extra loop exit heuristics of
>edge[^:]*:" 2
>> > +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98 
>scan-tree-dump-times profile_estimate "loop exit heuristics of
>edge[^:]*:" 3
>> > 
>> > If the patch seems acceptable, I will do the updates. One option
>why I did
>> > not do that is that it seems to be now posisble to pass parameters
>to passes
>> > from passes.def, so perhaps we do not need early_thread_jumps, but
>doing so is
>> > consistent with way we handle other early passes.
>> 
>> I wonder why you choose to put the FSM threader early which only does
>> backward threading(?!).  I'd expect forward threading to be more
>> profitable (though we don't have a separate threader for that and
>> would need VRP or DOM - but it seems we will get an early VRP
>anyway).
>
>On tramp3d all VRP passes threads together 730 branches, all DOM passes
>393, so
>FSM threading (with 1957 branches) is the most effective one. Perhaps
>eventually
>early VRP can also do bit of work.
>
>I am not 100% sure from where "backward" is comming from. I guess is
>means that
>analysis goes backward from conditionals to definitions: it looks for
>conditional driven by a PHI statement that has a constant value on some
>paths
>and duplicates for those. This seems cheap and rather effective way of
>getting
>good part of the threading oppurtunities (most expensive part is
>probably
>identifying and walking paths that will not be threaded at the end).

Ah, I thought it was exclusively dealing with threading through back edges which is sth I'd avoid doing early?

>BTW I wonder if the same analysis can't be done for other instructions
>where constant
>operands are very profitable, like division or multiplication.

No idea, but Jeff will likely know.

Richard.

>
>Honza


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:02 Early jump threading Jan Hubicka
  2016-08-11 14:06 ` Richard Biener
@ 2016-08-11 18:29 ` Jeff Law
  2016-08-16 11:02   ` Jan Hubicka
  2016-08-12  8:02 ` Richard Biener
  2016-08-18 19:55 ` Jeff Law
  3 siblings, 1 reply; 25+ messages in thread
From: Jeff Law @ 2016-08-11 18:29 UTC (permalink / raw)
  To: Jan Hubicka, gcc-patches, rguenther

On 08/11/2016 08:02 AM, Jan Hubicka wrote:
> Hi,
> this patch adds early jump threading pass.  Jump threading is one of most
> common cases where estimated profile becomes corrupted, because the branches
> are predicted independently beforehand. This patch simply adds early mode to
> jump threader that does not permit code growth and thus only win/win threads
> are done before profile is constructed.
>
> Naturally this also improves IPA decisions because code size/time is estimated
> more acurately.
Excellent.  One of the goals here was to enable you to run it early, so 
I'm glad to see it's working out.

>
> It is not very cool to add another pass and the jump threader is already
> run 5 times. I think incrementally we could drop one of late threaders at least.
> I tried to measure compile time effects but they are in wash. Moreover the patch
> pays for itself in cc1plus code size:
Most definitely we want to be dropping calls into the later passes. 
There's analysis to do, but the goal is to drop the old style threading 
from DOM/VRP completely.   It may also be the case that one or more 
passes of the backwards/FSM threader can be avoided, but we should look 
at removing the DOM/VRP threading first.

>
> Before patch to tweak size estimates: 27779964
> Current mainline:                     27748900
> With patch applied:                   27716173
>
> So despite adding few functions the code size effect is positive which I think
> is quite nice.
>
> Given the fact that jump threading seems quite lightweit, I wonder why it is
> guarded by flag_expensive_optimizations? Is it expensive in terms of code
> size?
The DOM/VRP jump threading used to be very expensive because of 
iteration and the ssa updates.  Both of those issues have since been 
addressed.   The backwards/FSM threader is based on an algorithm that 
IIRC is cubic, and we also have some implementation issues that cause it 
to use far more time than it should.

However, in an early mode, we don't want to walk backwards through the 
CFG very far (if at all) and for an early mode we may not want to guard 
on flag_expensive_optimizations.

>
> The effectivity of individual threading passes is as follows (for tramp3d)
>
>       mainline                              with patch
> pass  thread count     profile mismatches   thread count    profile mismatch
> early                                       525
> 1     1853             1900                 316             337
> 2     4                812                  4               288
> 3     24               1450                 32              947
> 4     76               1457                 75              975
>
> So at least tramp3d data seems to suggest that we can drop the second occurence
> of jump threading and that most of the job thread1 does can be done by the
> size restricted early version (the lower absolute counts are caused by the
> fact that threadable paths gets duplicated by the inliner). 50% drop in
> profile distortion is not bad. I wonder why basically every threaded paths seems
> to introduce a mismatch.
I don't think the backwards/FSM threader tries to update the profile 
data at all right now.

>
> The patch distorts testusite somewhat, in most cases we only want to update
> dump files or disable early threading:
>
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
> +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
> +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
> +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
> +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
> +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
> +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
> +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
>
> This testcase is the now probably unnecesary heuristics (it may still be
> relevant in cases we do not thread because of size bounds but its effectivity
> may not be worth the maintenance cost):
>
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>
> If the patch seems acceptable, I will do the updates. One option why I did
> not do that is that it seems to be now posisble to pass parameters to passes
> from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
> consistent with way we handle other early passes.
>
> Bootstrapped/regtested x86_64-linux
> Honza
>
> 	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
> 	* tree-pass.h (make_pass_early_thread_jumps): Declare.
> 	* tree-ssa-threadbackward.c (fsm_find_thread_path,
> 	fsm_find_thread_path, profitable_jump_thread_path,
> 	fsm_find_control_statement_thread_paths,
> 	find_jump_threads_backwards): Add speed_p parameter.
> 	(pass_data_early_thread_jumps): New pass.
> 	(make_pass_early_thread_jumps): New function.
I'll try to take a look at the details shortly.

jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:06 ` Richard Biener
  2016-08-11 14:27   ` Jan Hubicka
@ 2016-08-11 18:30   ` Jeff Law
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff Law @ 2016-08-11 18:30 UTC (permalink / raw)
  To: Richard Biener, Jan Hubicka; +Cc: gcc-patches

On 08/11/2016 08:06 AM, Richard Biener wrote:
> On Thu, 11 Aug 2016, Jan Hubicka wrote:
>
>> Hi,
>> this patch adds early jump threading pass.  Jump threading is one of most
>> common cases where estimated profile becomes corrupted, because the branches
>> are predicted independently beforehand. This patch simply adds early mode to
>> jump threader that does not permit code growth and thus only win/win threads
>> are done before profile is constructed.
>>
>> Naturally this also improves IPA decisions because code size/time is estimated
>> more acurately.
>>
>> It is not very cool to add another pass and the jump threader is already
>> run 5 times. I think incrementally we could drop one of late threaders at least.
>> I tried to measure compile time effects but they are in wash. Moreover the patch
>> pays for itself in cc1plus code size:
>>
>> Before patch to tweak size estimates: 27779964
>> Current mainline:                     27748900
>> With patch applied:                   27716173
>>
>> So despite adding few functions the code size effect is positive which I think
>> is quite nice.
>>
>> Given the fact that jump threading seems quite lightweit, I wonder why it is
>> guarded by flag_expensive_optimizations? Is it expensive in terms of code
>> size?
>>
>> The effectivity of individual threading passes is as follows (for tramp3d)
>>
>>       mainline                              with patch
>> pass  thread count     profile mismatches   thread count    profile mismatch
>> early                                       525
>> 1     1853             1900                 316             337
>> 2     4                812                  4               288
>> 3     24               1450                 32              947
>> 4     76               1457                 75              975
>>
>> So at least tramp3d data seems to suggest that we can drop the second occurence
>> of jump threading and that most of the job thread1 does can be done by the
>> size restricted early version (the lower absolute counts are caused by the
>> fact that threadable paths gets duplicated by the inliner). 50% drop in
>> profile distortion is not bad. I wonder why basically every threaded paths seems
>> to introduce a mismatch.
>>
>> The patch distorts testusite somewhat, in most cases we only want to update
>> dump files or disable early threading:
>>
>> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
>> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
>> +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
>> +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
>> +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
>> +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
>> +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
>> +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
>> +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
>> +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
>>
>> This testcase is the now probably unnecesary heuristics (it may still be
>> relevant in cases we do not thread because of size bounds but its effectivity
>> may not be worth the maintenance cost):
>>
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>>
>> If the patch seems acceptable, I will do the updates. One option why I did
>> not do that is that it seems to be now posisble to pass parameters to passes
>> from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
>> consistent with way we handle other early passes.
>
> I wonder why you choose to put the FSM threader early which only does
> backward threading(?!).  I'd expect forward threading to be more
> profitable (though we don't have a separate threader for that and
> would need VRP or DOM - but it seems we will get an early VRP anyway).
forward/backward refers to how they find threading opportunities.  The 
backward/FSM threader walks back from a conditional through the CFG & 
PHI nodes.  The old DOM/VRP threader utilizes state from forward walks 
during the CFG to try and simplify conditionals when they're encountered.

Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:27   ` Jan Hubicka
  2016-08-11 15:50     ` Richard Biener
@ 2016-08-11 18:41     ` Jeff Law
  2016-08-12  6:02       ` Richard Biener
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Law @ 2016-08-11 18:41 UTC (permalink / raw)
  To: Jan Hubicka, Richard Biener; +Cc: gcc-patches

On 08/11/2016 08:27 AM, Jan Hubicka wrote:
>
> On tramp3d all VRP passes threads together 730 branches, all DOM passes 393, so
> FSM threading (with 1957 branches) is the most effective one. Perhaps eventually
> early VRP can also do bit of work.
That's roughly consistent with what I've seen.  I have some thoughts on 
what DOM's threading pass is catching that we're missing in the 
backward/FSM threader, but I haven't had time to see how big those 
effects really are.

>
> I am not 100% sure from where "backward" is comming from. I guess is means that
> analysis goes backward from conditionals to definitions: it looks for
> conditional driven by a PHI statement that has a constant value on some paths
> and duplicates for those. This seems cheap and rather effective way of getting
> good part of the threading oppurtunities (most expensive part is probably
> identifying and walking paths that will not be threaded at the end).
Correct, forward/backward is based on the direction of analysis.  ie, do 
we start at the conditional and work backwards through the USE-DEF chain 
or do we build a equivalence table as we walk forward through the entire 
function and use the equivalence tables to simplify the conditional.


>
> BTW I wonder if the same analysis can't be done for other instructions where constant
> operands are very profitable, like division or multiplication.
Yes.  There's all kinds of problems that can be solved using a backwards 
walk to identify useful properties on paths, then path duplication to 
isolate the use point and modify it.

So you could (for example) walk backwards from an indirect call and 
identify paths through the CFG where we know the target.  We then can 
use path duplication to isolate that path from the rest and call the 
target function directly.


Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 15:50     ` Richard Biener
@ 2016-08-11 18:42       ` Jeff Law
  2016-09-18 19:30         ` Jan Hubicka
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Law @ 2016-08-11 18:42 UTC (permalink / raw)
  To: Richard Biener, Jan Hubicka; +Cc: gcc-patches

On 08/11/2016 09:50 AM, Richard Biener wrote:

>
> Ah, I thought it was exclusively dealing with threading through back
> edges which is sth I'd avoid doing early?
No, that's one of the fundamental changes we made for gcc-6, namely 
using the FSM threader for general purpose threading rather than just 
using it for threading across loop backedges.

I've got a few things to do first, but the direction I want to go is to 
use the FSM threader exclusively.

Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 18:41     ` Jeff Law
@ 2016-08-12  6:02       ` Richard Biener
  2016-08-15 17:06         ` Jeff Law
  0 siblings, 1 reply; 25+ messages in thread
From: Richard Biener @ 2016-08-12  6:02 UTC (permalink / raw)
  To: Jeff Law, Jan Hubicka; +Cc: gcc-patches

On August 11, 2016 8:41:53 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>On 08/11/2016 08:27 AM, Jan Hubicka wrote:
>>
>> On tramp3d all VRP passes threads together 730 branches, all DOM
>passes 393, so
>> FSM threading (with 1957 branches) is the most effective one. Perhaps
>eventually
>> early VRP can also do bit of work.
>That's roughly consistent with what I've seen.  I have some thoughts on
>
>what DOM's threading pass is catching that we're missing in the 
>backward/FSM threader, but I haven't had time to see how big those 
>effects really are.
>
>>
>> I am not 100% sure from where "backward" is comming from. I guess is
>means that
>> analysis goes backward from conditionals to definitions: it looks for
>> conditional driven by a PHI statement that has a constant value on
>some paths
>> and duplicates for those. This seems cheap and rather effective way
>of getting
>> good part of the threading oppurtunities (most expensive part is
>probably
>> identifying and walking paths that will not be threaded at the end).
>Correct, forward/backward is based on the direction of analysis.  ie,
>do 
>we start at the conditional and work backwards through the USE-DEF
>chain 
>or do we build a equivalence table as we walk forward through the
>entire 
>function and use the equivalence tables to simplify the conditional.
>
>
>>
>> BTW I wonder if the same analysis can't be done for other
>instructions where constant
>> operands are very profitable, like division or multiplication.
>Yes.  There's all kinds of problems that can be solved using a
>backwards 
>walk to identify useful properties on paths, then path duplication to 
>isolate the use point and modify it.
>
>So you could (for example) walk backwards from an indirect call and 
>identify paths through the CFG where we know the target.  We then can 
>use path duplication to isolate that path from the rest and call the 
>target function directly.

Hmm, isn't walking backwards from uses doing a lot of redundant stmt walking compared to walking stmts once in forward direction?  To me it sounds like a 'local' patterns matching like optimization rather than a global one with proper data flow or a lattice?

Richard.

>
>Jeff


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:02 Early jump threading Jan Hubicka
  2016-08-11 14:06 ` Richard Biener
  2016-08-11 18:29 ` Jeff Law
@ 2016-08-12  8:02 ` Richard Biener
  2016-08-12 11:27   ` Jan Hubicka
  2016-08-16 16:52   ` Jeff Law
  2016-08-18 19:55 ` Jeff Law
  3 siblings, 2 replies; 25+ messages in thread
From: Richard Biener @ 2016-08-12  8:02 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, law

On Thu, 11 Aug 2016, Jan Hubicka wrote:

> Hi,
> this patch adds early jump threading pass.  Jump threading is one of most
> common cases where estimated profile becomes corrupted, because the branches
> are predicted independently beforehand. This patch simply adds early mode to
> jump threader that does not permit code growth and thus only win/win threads
> are done before profile is constructed.
> 
> Naturally this also improves IPA decisions because code size/time is estimated
> more acurately.
> 
> It is not very cool to add another pass and the jump threader is already
> run 5 times. I think incrementally we could drop one of late threaders at least.
> I tried to measure compile time effects but they are in wash. Moreover the patch
> pays for itself in cc1plus code size:
> 
> Before patch to tweak size estimates: 27779964  
> Current mainline:                     27748900  
> With patch applied:                   27716173  
> 
> So despite adding few functions the code size effect is positive which I think
> is quite nice.
> 
> Given the fact that jump threading seems quite lightweit, I wonder why it is
> guarded by flag_expensive_optimizations? Is it expensive in terms of code
> size?
> 
> The effectivity of individual threading passes is as follows (for tramp3d)
> 
>       mainline                              with patch
> pass  thread count     profile mismatches   thread count    profile mismatch
> early                                       525
> 1     1853             1900                 316             337
> 2     4                812                  4               288
> 3     24               1450                 32              947
> 4     76               1457                 75              975
> 
> So at least tramp3d data seems to suggest that we can drop the second occurence
> of jump threading and that most of the job thread1 does can be done by the
> size restricted early version (the lower absolute counts are caused by the
> fact that threadable paths gets duplicated by the inliner). 50% drop in
> profile distortion is not bad. I wonder why basically every threaded paths seems
> to introduce a mismatch.
> 
> The patch distorts testusite somewhat, in most cases we only want to update
> dump files or disable early threading:
> 
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
> +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
> +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
> +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
> +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
> +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
> +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
> +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
> 
> This testcase is the now probably unnecesary heuristics (it may still be
> relevant in cases we do not thread because of size bounds but its effectivity
> may not be worth the maintenance cost):
> 
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> 
> If the patch seems acceptable, I will do the updates. One option why I did
> not do that is that it seems to be now posisble to pass parameters to passes
> from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
> consistent with way we handle other early passes.

Few comments on the patch itself below.  Note currently the pass is
enabled at -O[s2]+ only which I think is fine.  We certainly do not
want it at -Og.  (and nobody seems to try making -O1 sane anyway)

Given your statistics we want to remove thread2 and thread3 (I guess
tracer immediately before thread_jumps is odd as well, I'd at least
swap them).  The fact that thread4 threads more than thread3 argues
for DOM being required as a FSM thread enabler rather than the
other way around.  So I'd try with generally moving the threader
_after_ CSE-like optimizations.

Richard.

> Bootstrapped/regtested x86_64-linux
> Honza
> 
> 	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
> 	* tree-pass.h (make_pass_early_thread_jumps): Declare.
> 	* tree-ssa-threadbackward.c (fsm_find_thread_path,
> 	fsm_find_thread_path, profitable_jump_thread_path,
> 	fsm_find_control_statement_thread_paths,
> 	find_jump_threads_backwards): Add speed_p parameter.
> 	(pass_data_early_thread_jumps): New pass.
> 	(make_pass_early_thread_jumps): New function.
> 	
> Index: passes.def
> ===================================================================
> --- passes.def	(revision 239218)
> +++ passes.def	(working copy)
> @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
>  	  /* After CCP we rewrite no longer addressed locals into SSA
>  	     form if possible.  */
>  	  NEXT_PASS (pass_forwprop);
> +          NEXT_PASS (pass_early_thread_jumps);

What's the reason for this placement?  I know Jeff argues that
as jump threading helps CSE we need to place it before CSE but
OTOH the FSM style threading relies on copies and redundancies
being optimized already and the above has only constants and copies
being propagated and forwprop left you with lots of dead code
(but it should also have copies and constants propagated but it
leaves PHIs alone, not propagating into them or removing degenerate
ones - sth to fix I guess).

So I'd be interested to see threading statistics when you place
the threading pass after early FRE (or cd_dce).  I guess early
FRE will already handle quite some of the simplistic "threading"
opportunities (it optimizes redundant checks) thus numbers may
even get worse here.

That said - if you put it before early FRE then I'd put it
right after CCP, not after forwprop.



>  	  NEXT_PASS (pass_sra_early);
>  	  /* pass_build_ealias is a dummy pass that ensures that we
>  	     execute TODO_rebuild_alias at this point.  */
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h	(revision 239218)
> +++ tree-pass.h	(working copy)
> @@ -399,6 +399,7 @@ extern gimple_opt_pass *make_pass_cd_dce
>  extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_merge_phi (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_thread_jumps (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_early_thread_jumps (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_split_crit_edges (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_laddress (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_pre (gcc::context *ctxt);
> Index: tree-ssa-threadbackward.c
> ===================================================================
> --- tree-ssa-threadbackward.c	(revision 239219)
> +++ tree-ssa-threadbackward.c	(working copy)
> @@ -61,12 +61,14 @@ get_gimple_control_stmt (basic_block bb)
>  /* Return true if the CFG contains at least one path from START_BB to END_BB.
>     When a path is found, record in PATH the blocks from END_BB to START_BB.
>     VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
> -   the recursion to basic blocks belonging to LOOP.  */
> +   the recursion to basic blocks belonging to LOOP.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  static bool
>  fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
>  		      vec<basic_block, va_gc> *&path,
> -		      hash_set<basic_block> *visited_bbs, loop_p loop)
> +		      hash_set<basic_block> *visited_bbs, loop_p loop,
> +		      bool speed_p)
>  {
>    if (loop != start_bb->loop_father)
>      return false;
> @@ -82,7 +84,8 @@ fsm_find_thread_path (basic_block start_
>        edge e;
>        edge_iterator ei;
>        FOR_EACH_EDGE (e, ei, start_bb->succs)
> -	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop))
> +	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop,
> +				  speed_p))
>  	  {
>  	    vec_safe_push (path, start_bb);
>  	    return true;
> @@ -101,11 +104,13 @@ fsm_find_thread_path (basic_block start_
>     value on PATH.  ARG is the value of that SSA_NAME.
>  
>     BBI will be appended to PATH when we have a profitable jump threading
> -   path.  Callers are responsible for removing BBI from PATH in that case. */
> +   path.  Callers are responsible for removing BBI from PATH in that case.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  static edge
>  profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
> -			     basic_block bbi, tree name, tree arg)
> +			     basic_block bbi, tree name, tree arg, bool speed_p)
>  {
>    /* Note BBI is not in the path yet, hence the +1 in the test below
>       to make sure BBI is accounted for in the path length test.  */
> @@ -306,7 +311,7 @@ profitable_jump_thread_path (vec<basic_b
>        return NULL;
>      }
>  
> -  if (optimize_edge_for_speed_p (taken_edge))
> +  if (speed_p && optimize_edge_for_speed_p (taken_edge))
>      {
>        if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
>  	{
> @@ -421,13 +426,15 @@ convert_and_register_jump_thread_path (v
>  
>  /* We trace the value of the SSA_NAME NAME back through any phi nodes looking
>     for places where it gets a constant value and save the path.  Stop after
> -   having recorded MAX_PATHS jump threading paths.  */
> +   having recorded MAX_PATHS jump threading paths.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  static void
>  fsm_find_control_statement_thread_paths (tree name,
>  					 hash_set<basic_block> *visited_bbs,
>  					 vec<basic_block, va_gc> *&path,
> -					 bool seen_loop_phi)
> +					 bool seen_loop_phi, bool speed_p)
>  {
>    /* If NAME appears in an abnormal PHI, then don't try to trace its
>       value back through PHI nodes.  */
> @@ -495,7 +502,7 @@ fsm_find_control_statement_thread_paths
>  	  hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>  
>  	  if (fsm_find_thread_path (var_bb, e->src, next_path, visited_bbs,
> -				    e->src->loop_father))
> +				    e->src->loop_father, speed_p))
>  	    ++e_count;
>  
>  	  delete visited_bbs;
> @@ -561,7 +568,7 @@ fsm_find_control_statement_thread_paths
>  	      /* Recursively follow SSA_NAMEs looking for a constant
>  		 definition.  */
>  	      fsm_find_control_statement_thread_paths (arg, visited_bbs, path,
> -						       seen_loop_phi);
> +						       seen_loop_phi, speed_p);
>  
>  	      path->pop ();
>  	      continue;
> @@ -572,7 +579,8 @@ fsm_find_control_statement_thread_paths
>  
>  	  /* If this is a profitable jump thread path, then convert it
>  	     into the canonical form and register it.  */
> -	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg);
> +	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg,
> +							 speed_p);
>  	  if (taken_edge)
>  	    {
>  	      if (bb_loop_depth (taken_edge->src)
> @@ -588,7 +596,7 @@ fsm_find_control_statement_thread_paths
>  
>        if (TREE_CODE (arg) == SSA_NAME)
>  	fsm_find_control_statement_thread_paths (arg, visited_bbs,
> -						 path, seen_loop_phi);
> +						 path, seen_loop_phi, speed_p);
>  
>        else
>  	{
> @@ -598,7 +606,7 @@ fsm_find_control_statement_thread_paths
>  	  path->pop ();
>  
>  	  edge taken_edge = profitable_jump_thread_path (path, var_bb,
> -						     name, arg);
> +						         name, arg, speed_p);
>  	  if (taken_edge)
>  	    {
>  	      if (bb_loop_depth (taken_edge->src)
> @@ -622,10 +630,11 @@ fsm_find_control_statement_thread_paths
>     is a constant.  Record such paths for jump threading.
>  
>     It is assumed that BB ends with a control statement and that by
> -   finding a path where NAME is a constant, we can thread the path.  */
> +   finding a path where NAME is a constant, we can thread the path.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>  
>  void  
> -find_jump_threads_backwards (basic_block bb)
> +find_jump_threads_backwards (basic_block bb, bool speed_p)
>  {     
>    gimple *stmt = get_gimple_control_stmt (bb);
>    if (!stmt)
> @@ -655,7 +664,8 @@ find_jump_threads_backwards (basic_block
>    hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>  
>    max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
> -  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false);
> +  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false,
> +					   speed_p);
>  
>    delete visited_bbs;
>    vec_free (bb_path);
> @@ -703,7 +713,7 @@ pass_thread_jumps::execute (function *fu
>    FOR_EACH_BB_FN (bb, fun)
>      {
>        if (EDGE_COUNT (bb->succs) > 1)
> -	find_jump_threads_backwards (bb);
> +	find_jump_threads_backwards (bb, true);
>      }
>    thread_through_all_blocks (true);
>    return 0;
> @@ -716,3 +726,59 @@ make_pass_thread_jumps (gcc::context *ct
>  {
>    return new pass_thread_jumps (ctxt);
>  }
> +
> +namespace {
> +
> +const pass_data pass_data_early_thread_jumps =
> +{
> +  GIMPLE_PASS,
> +  "early_thread",
> +  OPTGROUP_NONE,
> +  TV_TREE_SSA_THREAD_JUMPS,
> +  ( PROP_cfg | PROP_ssa ),
> +  0,
> +  0,
> +  0,
> +  ( TODO_cleanup_cfg | TODO_update_ssa ),
> +};
> +
> +class pass_early_thread_jumps : public gimple_opt_pass
> +{
> +public:
> +  pass_early_thread_jumps (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_early_thread_jumps, ctxt)
> +  {}
> +
> +  opt_pass * clone (void) { return new pass_early_thread_jumps (m_ctxt); }
> +  virtual bool gate (function *);
> +  virtual unsigned int execute (function *);
> +};
> +
> +bool
> +pass_early_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
> +{
> +  return true;
> +}
> +
> +
> +unsigned int
> +pass_early_thread_jumps::execute (function *fun)
> +{
> +  /* Try to thread each block with more than one successor.  */
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      if (EDGE_COUNT (bb->succs) > 1)
> +	find_jump_threads_backwards (bb, false);
> +    }
> +  thread_through_all_blocks (true);
> +  return 0;
> +}
> +
> +}
> +
> +gimple_opt_pass *
> +make_pass_early_thread_jumps (gcc::context *ctxt)
> +{
> +  return new pass_early_thread_jumps (ctxt);
> +}
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-12  8:02 ` Richard Biener
@ 2016-08-12 11:27   ` Jan Hubicka
  2016-08-15 17:25     ` Jeff Law
  2016-08-16 16:52   ` Jeff Law
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2016-08-12 11:27 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, gcc-patches, law

> > 	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
> > 	* tree-pass.h (make_pass_early_thread_jumps): Declare.
> > 	* tree-ssa-threadbackward.c (fsm_find_thread_path,
> > 	fsm_find_thread_path, profitable_jump_thread_path,
> > 	fsm_find_control_statement_thread_paths,
> > 	find_jump_threads_backwards): Add speed_p parameter.
> > 	(pass_data_early_thread_jumps): New pass.
> > 	(make_pass_early_thread_jumps): New function.
> > 	
> > Index: passes.def
> > ===================================================================
> > --- passes.def	(revision 239218)
> > +++ passes.def	(working copy)
> > @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
> >  	  /* After CCP we rewrite no longer addressed locals into SSA
> >  	     form if possible.  */
> >  	  NEXT_PASS (pass_forwprop);
> > +          NEXT_PASS (pass_early_thread_jumps);
> 
> What's the reason for this placement?  I know Jeff argues that
> as jump threading helps CSE we need to place it before CSE but
> OTOH the FSM style threading relies on copies and redundancies
> being optimized already and the above has only constants and copies
> being propagated and forwprop left you with lots of dead code
> (but it should also have copies and constants propagated but it
> leaves PHIs alone, not propagating into them or removing degenerate
> ones - sth to fix I guess).
> 
> So I'd be interested to see threading statistics when you place
> the threading pass after early FRE (or cd_dce).  I guess early
> FRE will already handle quite some of the simplistic "threading"
> opportunities (it optimizes redundant checks) thus numbers may
> even get worse here.
> 
> That said - if you put it before early FRE then I'd put it
> right after CCP, not after forwprop.

I placed it just after forwprop becasue the pattern it handles:
     bb0:
       x = a COND b;
       if (x) goto ... else goto ...

   Will be transformed into:

     bb0:
       if (a COND b) goto ... else goto ...

is probably going to help the simpistic analysis done by threadbackwards.
I collected data with placement just before FRE and they were
more or less identical.

In general threading helps forward propagators becuase of code specialization
it does. It does not like dead code (as it will get accounted and prevent
duplication), degenerate PHIs (because it will do useless duplication -
something to fix probably), and unpropagated temporaries (because
fsm_find_control_statement_thread_paths does not look into them)

Honza

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-12  6:02       ` Richard Biener
@ 2016-08-15 17:06         ` Jeff Law
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Law @ 2016-08-15 17:06 UTC (permalink / raw)
  To: Richard Biener, Jan Hubicka; +Cc: gcc-patches

On 08/12/2016 12:02 AM, Richard Biener wrote:
>
> Hmm, isn't walking backwards from uses doing a lot of redundant stmt
> walking compared to walking stmts once in forward direction?  To me
> it sounds like a 'local' patterns matching like optimization rather
> than a global one with proper data flow or a lattice?
You end up walking the use-def chain, so you look only at the chain of 
feeding statements.

Forward threading is actually worse because it tries to walk *past* the 
current point in the dominator walk.  For example at a dominance 
frontier we have multiple paths merging -- we will walk all statements 
in the merge block for every incoming path as well as all statements on 
the outgoing paths of the merge block.   We have to update the various 
tables, then unwind them as we work through all those paths.

I have not done a full analysis, but I strongly suspect the backwards 
threader will ultimately end up doing less work, both in the common and 
pathological cases.

jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-12 11:27   ` Jan Hubicka
@ 2016-08-15 17:25     ` Jeff Law
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Law @ 2016-08-15 17:25 UTC (permalink / raw)
  To: Jan Hubicka, Richard Biener; +Cc: gcc-patches

On 08/12/2016 05:27 AM, Jan Hubicka wrote:
>>> 	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
>>> 	* tree-pass.h (make_pass_early_thread_jumps): Declare.
>>> 	* tree-ssa-threadbackward.c (fsm_find_thread_path,
>>> 	fsm_find_thread_path, profitable_jump_thread_path,
>>> 	fsm_find_control_statement_thread_paths,
>>> 	find_jump_threads_backwards): Add speed_p parameter.
>>> 	(pass_data_early_thread_jumps): New pass.
>>> 	(make_pass_early_thread_jumps): New function.
>>> 	
>>> Index: passes.def
>>> ===================================================================
>>> --- passes.def	(revision 239218)
>>> +++ passes.def	(working copy)
>>> @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
>>>  	  /* After CCP we rewrite no longer addressed locals into SSA
>>>  	     form if possible.  */
>>>  	  NEXT_PASS (pass_forwprop);
>>> +          NEXT_PASS (pass_early_thread_jumps);
>>
>> What's the reason for this placement?  I know Jeff argues that
>> as jump threading helps CSE we need to place it before CSE but
>> OTOH the FSM style threading relies on copies and redundancies
>> being optimized already and the above has only constants and copies
>> being propagated and forwprop left you with lots of dead code
>> (but it should also have copies and constants propagated but it
>> leaves PHIs alone, not propagating into them or removing degenerate
>> ones - sth to fix I guess).
>>
>> So I'd be interested to see threading statistics when you place
>> the threading pass after early FRE (or cd_dce).  I guess early
>> FRE will already handle quite some of the simplistic "threading"
>> opportunities (it optimizes redundant checks) thus numbers may
>> even get worse here.
>>
>> That said - if you put it before early FRE then I'd put it
>> right after CCP, not after forwprop.
>
> I placed it just after forwprop becasue the pattern it handles:
>      bb0:
>        x = a COND b;
>        if (x) goto ... else goto ...
>
>    Will be transformed into:
>
>      bb0:
>        if (a COND b) goto ... else goto ...
Note that extending the backward threader to handle the former style 
sequence is relatively straightforward.  In fact, building a bit of that 
kind of infrastructure is what I expect to be the biggest source of 
things we're missing relative to the forward threader.

>
> In general threading helps forward propagators becuase of code specialization
> it does. It does not like dead code (as it will get accounted and prevent
> duplication), degenerate PHIs (because it will do useless duplication -
> something to fix probably), and unpropagated temporaries (because
> fsm_find_control_statement_thread_paths does not look into them)
Yes, avoiding useless duplication and factoring identical paths from 
different incoming edges are definitely on the TODO list.  They're 
clearly a source of codesize issues when we compare the backward 
threader to the forward threader.

Presumably for unpropagated temporaries you're referring to cases where 
both operands of a COND_EXPR need to be looked up?  Right now we only 
walk one operand backward to a constant, but we really want to walk both.

jeff
>
> Honza
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 18:29 ` Jeff Law
@ 2016-08-16 11:02   ` Jan Hubicka
  2016-08-16 15:42     ` Jeff Law
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2016-08-16 11:02 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jan Hubicka, gcc-patches, rguenther

> I don't think the backwards/FSM threader tries to update the profile
> data at all right now.

This seems to be cause of the regression andrew is speaking about.  I wrote updating
of profile after threading few times and there was also Theresa's patch.
I tought all the threaders use common duplication engine and that one does the updating?
I will try to dive into the code, but creating basic blocks with frequency 0 across hot
paths is bad idea - they will be optimized for size.

Honza

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-16 11:02   ` Jan Hubicka
@ 2016-08-16 15:42     ` Jeff Law
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Law @ 2016-08-16 15:42 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, rguenther

On 08/16/2016 05:02 AM, Jan Hubicka wrote:
>> I don't think the backwards/FSM threader tries to update the profile
>> data at all right now.
>
> This seems to be cause of the regression andrew is speaking about.  I wrote updating
> of profile after threading few times and there was also Theresa's patch.
> I tought all the threaders use common duplication engine and that one does the updating?
> I will try to dive into the code, but creating basic blocks with frequency 0 across hot
> paths is bad idea - they will be optimized for size.
No, the updaters are independent.  We'll need to port Theresa's work to 
the newer scheme.

jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-12  8:02 ` Richard Biener
  2016-08-12 11:27   ` Jan Hubicka
@ 2016-08-16 16:52   ` Jeff Law
  1 sibling, 0 replies; 25+ messages in thread
From: Jeff Law @ 2016-08-16 16:52 UTC (permalink / raw)
  To: Richard Biener, Jan Hubicka; +Cc: gcc-patches

On 08/12/2016 02:02 AM, Richard Biener wrote:
> On Thu, 11 Aug 2016, Jan Hubicka wrote:
>
>> Hi,
>> this patch adds early jump threading pass.  Jump threading is one of most
>> common cases where estimated profile becomes corrupted, because the branches
>> are predicted independently beforehand. This patch simply adds early mode to
>> jump threader that does not permit code growth and thus only win/win threads
>> are done before profile is constructed.
>>
>> Naturally this also improves IPA decisions because code size/time is estimated
>> more acurately.
>>
>> It is not very cool to add another pass and the jump threader is already
>> run 5 times. I think incrementally we could drop one of late threaders at least.
>> I tried to measure compile time effects but they are in wash. Moreover the patch
>> pays for itself in cc1plus code size:
>>
>> Before patch to tweak size estimates: 27779964
>> Current mainline:                     27748900
>> With patch applied:                   27716173
>>
>> So despite adding few functions the code size effect is positive which I think
>> is quite nice.
>>
>> Given the fact that jump threading seems quite lightweit, I wonder why it is
>> guarded by flag_expensive_optimizations? Is it expensive in terms of code
>> size?
>>
>> The effectivity of individual threading passes is as follows (for tramp3d)
>>
>>       mainline                              with patch
>> pass  thread count     profile mismatches   thread count    profile mismatch
>> early                                       525
>> 1     1853             1900                 316             337
>> 2     4                812                  4               288
>> 3     24               1450                 32              947
>> 4     76               1457                 75              975
>>
>> So at least tramp3d data seems to suggest that we can drop the second occurence
>> of jump threading and that most of the job thread1 does can be done by the
>> size restricted early version (the lower absolute counts are caused by the
>> fact that threadable paths gets duplicated by the inliner). 50% drop in
>> profile distortion is not bad. I wonder why basically every threaded paths seems
>> to introduce a mismatch.
>>
>> The patch distorts testusite somewhat, in most cases we only want to update
>> dump files or disable early threading:
>>
>> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
>> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
>> +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
>> +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
>> +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
>> +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
>> +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
>> +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
>> +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
>> +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
>>
>> This testcase is the now probably unnecesary heuristics (it may still be
>> relevant in cases we do not thread because of size bounds but its effectivity
>> may not be worth the maintenance cost):
>>
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
>> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
>> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
>>
>> If the patch seems acceptable, I will do the updates. One option why I did
>> not do that is that it seems to be now posisble to pass parameters to passes
>> from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
>> consistent with way we handle other early passes.
>
> Few comments on the patch itself below.  Note currently the pass is
> enabled at -O[s2]+ only which I think is fine.  We certainly do not
> want it at -Og.  (and nobody seems to try making -O1 sane anyway)
>
> Given your statistics we want to remove thread2 and thread3 (I guess
> tracer immediately before thread_jumps is odd as well, I'd at least
> swap them).  The fact that thread4 threads more than thread3 argues
> for DOM being required as a FSM thread enabler rather than the
> other way around.  So I'd try with generally moving the threader
> _after_ CSE-like optimizations.
What we want to be doing is threading *before* CSE-like optimizations 
and removing the DOM/VRP jump threaders.  Based on what I've seen for 
the last 15 years, threading helps CSE far more often than the opposite.

The fact that we do CSE first, then threading is an artifact of the 
threading implementation wanting to exploit the CSE tables.  But that's 
not necessary in a backwards walking threader.

So what we need to be going is figuring out what key things aren't being 
handled so that we can drop the DOM/VRP threaders.  Once that's done we 
will want to look at whether or not any of the standalone threading 
passes continue to make sense.

jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 14:02 Early jump threading Jan Hubicka
                   ` (2 preceding siblings ...)
  2016-08-12  8:02 ` Richard Biener
@ 2016-08-18 19:55 ` Jeff Law
  3 siblings, 0 replies; 25+ messages in thread
From: Jeff Law @ 2016-08-18 19:55 UTC (permalink / raw)
  To: Jan Hubicka, gcc-patches, rguenther

On 08/11/2016 08:02 AM, Jan Hubicka wrote:
> Hi,
> this patch adds early jump threading pass.  Jump threading is one of most
> common cases where estimated profile becomes corrupted, because the branches
> are predicted independently beforehand. This patch simply adds early mode to
> jump threader that does not permit code growth and thus only win/win threads
> are done before profile is constructed.
>
> Naturally this also improves IPA decisions because code size/time is estimated
> more acurately.
>
> It is not very cool to add another pass and the jump threader is already
> run 5 times. I think incrementally we could drop one of late threaders at least.
> I tried to measure compile time effects but they are in wash. Moreover the patch
> pays for itself in cc1plus code size:
>
> Before patch to tweak size estimates: 27779964
> Current mainline:                     27748900
> With patch applied:                   27716173
>
> So despite adding few functions the code size effect is positive which I think
> is quite nice.
>
> Given the fact that jump threading seems quite lightweit, I wonder why it is
> guarded by flag_expensive_optimizations? Is it expensive in terms of code
> size?
>
> The effectivity of individual threading passes is as follows (for tramp3d)
>
>       mainline                              with patch
> pass  thread count     profile mismatches   thread count    profile mismatch
> early                                       525
> 1     1853             1900                 316             337
> 2     4                812                  4               288
So the real question here is what did VRP1 threading do between the 
standalone thread1/thread2 passes.  ie, it may look tempting to 
eliminate the standalone thread2 pass, but threading done by VRP1 is 
expected to be pushed into the thread{1,2} passes.  So removing thread2 
is premature at this point.



>
> The patch distorts testusite somewhat, in most cases we only want to update
> dump files or disable early threading:
>
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 13)
> +XPASS: gcc.dg/uninit-15.c  (test for warnings, line 23)
> +FAIL: gcc.dg/uninit-15.c  (test for warnings, line 24)
This seems like a step backwards to me.  We're warning about "j", but in 
the context in which its used, we really ought to be warning about "i". 
And we lost a later warning.




> +FAIL: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
Moved into early-threading, so adjusting test for that seems appropriate.


> +FAIL: gcc.dg/tree-ssa/pr69196-1.c scan-tree-dump thread1 "FSM did not thread around loop and would copy too many statements"
I'm not sure how to best check this one.  Threading this test too 
aggressively results in something like a 2X bloat in the resulting code 
on the sparc.




> +FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2b.c scan-tree-dump-times thread1 "Jumps threaded: 1" 1
Moved into early threading, so adjusting the test for that seems 
appropriate to me.


> +FAIL: gcc.dg/tree-ssa/ssa-thread-13.c scan-tree-dump thread1 "FSM"
Moved into early threading, so adjusting the test for that seems 
appropriate to me.


> +FAIL: gcc.dg/tree-ssa/vrp01.c scan-tree-dump-times vrp1 "Folding predicate p_.*to 1" 1
So the optimization moved into fre1 as a result of early jump threading 
which is good. We should probably check for that, even if the test is 
badly mis-named now.

But this is showing one of the cases where the backwards threader is 
deficient.  It's not using the conditional as an implied set.  It's a 
known limitation and I'm hoping Andrew's work makes that an easier 
problem to solve.

> +FAIL: gcc.dg/tree-ssa/vrp56.c scan-tree-dump thread1 "Jumps threaded: 1"
Moved into early threading, so adjusting the test for that seems 
appropriate to me.

> +FAIL: gcc.dg/tree-ssa/vrp92.c scan-tree-dump vrp1 "res_.: \\\\[1, 1\\\\]"
Some results adjustment seems to be in order here.  I think verifying 
that early threading finds the obvious jump threads, then vrp is able to 
eliminate the 2nd conditional and the result is a collapsed return i for 
the entire function.

>
> This testcase is the now probably unnecesary heuristics (it may still be
> relevant in cases we do not thread because of size bounds but its effectivity
> may not be worth the maintenance cost):
>
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-1.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 1
> +FAIL: g++.dg/predict-loop-exit-2.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++11  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++14  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "extra loop exit heuristics of edge[^:]*:" 2
> +FAIL: g++.dg/predict-loop-exit-3.C  -std=gnu++98  scan-tree-dump-times profile_estimate "loop exit heuristics of edge[^:]*:" 3
You're the best judge on how to deal with these.  I leave that to your 
discretion.



> If the patch seems acceptable, I will do the updates. One option why I did
> not do that is that it seems to be now posisble to pass parameters to passes
> from passes.def, so perhaps we do not need early_thread_jumps, but doing so is
> consistent with way we handle other early passes.
Your call on this.  One of the things in the related literature is a 
limiter on how far back the use-def chains we go.  I'd envisioned adding 
that limiter and having early jump threading use a different (much 
smaller) limiter.   But I don't think it's critical at this point.


>
> Bootstrapped/regtested x86_64-linux
> Honza
>
> 	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
> 	* tree-pass.h (make_pass_early_thread_jumps): Declare.
> 	* tree-ssa-threadbackward.c (fsm_find_thread_path,
> 	fsm_find_thread_path, profitable_jump_thread_path,
> 	fsm_find_control_statement_thread_paths,
> 	find_jump_threads_backwards): Add speed_p parameter.
> 	(pass_data_early_thread_jumps): New pass.
> 	(make_pass_early_thread_jumps): New function.
LGTM.  OK after fixing up the tests.

Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-08-11 18:42       ` Jeff Law
@ 2016-09-18 19:30         ` Jan Hubicka
  2016-09-19  6:53           ` Andrew Pinski
  2016-09-19  7:59           ` Early jump threading Christophe Lyon
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Hubicka @ 2016-09-18 19:30 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, Jan Hubicka, gcc-patches

Hi,
this is the patch compensating testsuite I commited after re-testing
on x86_64-linux.

Other placements of early_thread_jumps does not work veyr well (at least in
current implementation). Putting it before forwprop disables about 15% of
threadings. Placing it after DCE makes inliner to not see much of benefits
because threading requires a cleanup propagation+DCE after itself.
So unless we extend threader to be smarter or add extra DCE cleanup, i think
this is best placement.

Honza

	* passes.def (pass_early_thread_jumps): Schedule after forwprop.
	* tree-pass.h (make_pass_early_thread_jumps): Declare.
	* tree-ssa-threadbackward.c (fsm_find_thread_path,
	fsm_find_thread_path, profitable_jump_thread_path,
	fsm_find_control_statement_thread_paths,
	find_jump_threads_backwards): Add speed_p parameter.
	(pass_data_early_thread_jumps): New pass.
	(make_pass_early_thread_jumps): New function.

	* g++.dg/predict-loop-exit-1.C: Disable early jump threading.
	* g++.dg/predict-loop-exit-2.C: Disable early jump threading.
	* g++.dg/predict-loop-exit-3.C: Disable early jump threading.
	* gcc.dg/tree-ssa/pr69196-1.c: Disable early jump threading.
	* gcc.dg/tree-ssa/vrp01.c: Disable early jump threading.
	* gcc.dg/tree-ssa/ssa-dom-thread-2b.c: Disable early jump threading.
	* gcc.dg/tree-ssa/pr68198.c: Scan ethread dump.
	* gcc.dg/tree-ssa/ssa-thread-13.c: Scan ethread dump.
	* gcc.dg/tree-ssa/vrp56.c: Scan ethread dump.
	* gcc.dg/tree-ssa/vrp92.c: Scan ethread dump.
	* gcc.dg/uninit-15.c: Swap xfailed and non-xfailed alternative.

Index: passes.def
===================================================================
--- passes.def	(revision 240109)
+++ passes.def	(working copy)
@@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
 	  /* After CCP we rewrite no longer addressed locals into SSA
 	     form if possible.  */
 	  NEXT_PASS (pass_forwprop);
+          NEXT_PASS (pass_early_thread_jumps);
 	  NEXT_PASS (pass_sra_early);
 	  /* pass_build_ealias is a dummy pass that ensures that we
 	     execute TODO_rebuild_alias at this point.  */
Index: testsuite/g++.dg/predict-loop-exit-1.C
===================================================================
--- testsuite/g++.dg/predict-loop-exit-1.C	(revision 240109)
+++ testsuite/g++.dg/predict-loop-exit-1.C	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
+/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
 
 int g;
 int foo();
Index: testsuite/g++.dg/predict-loop-exit-2.C
===================================================================
--- testsuite/g++.dg/predict-loop-exit-2.C	(revision 240109)
+++ testsuite/g++.dg/predict-loop-exit-2.C	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
+/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
 
 int g;
 int foo();
Index: testsuite/g++.dg/predict-loop-exit-3.C
===================================================================
--- testsuite/g++.dg/predict-loop-exit-3.C	(revision 240109)
+++ testsuite/g++.dg/predict-loop-exit-3.C	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
+/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
 
 int g;
 int foo();
Index: testsuite/gcc.dg/tree-ssa/pr68198.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/pr68198.c	(revision 240109)
+++ testsuite/gcc.dg/tree-ssa/pr68198.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details" } */
+/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
 
 extern void abort (void);
 
@@ -40,4 +40,4 @@ c_finish_omp_clauses (tree clauses)
 /* There are 3 FSM jump threading opportunities, two of which will
   get filtered out.  */
 /* { dg-final { scan-tree-dump-times "Registering FSM" 1 "thread1"} } */
-/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "thread1"} } */
+/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "ethread"} } */
Index: testsuite/gcc.dg/tree-ssa/pr69196-1.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/pr69196-1.c	(revision 240109)
+++ testsuite/gcc.dg/tree-ssa/pr69196-1.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile { target sparc*-*-* x86_64-*-* } } */
-/* { dg-options "-O2 -fdump-tree-thread1-details" } */
+/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
 
 /* { dg-final { scan-tree-dump "FSM did not thread around loop and would copy too many statements" "thread1" } } */
 
Index: testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c	(revision 240109)
+++ testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats" } */
+/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */
 
 void foo();
 void bla();
Index: testsuite/gcc.dg/tree-ssa/ssa-thread-13.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/ssa-thread-13.c	(revision 240109)
+++ testsuite/gcc.dg/tree-ssa/ssa-thread-13.c	(working copy)
@@ -1,6 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details" } */
-/* { dg-final { scan-tree-dump "FSM" "thread1" } } */
+/* { dg-options "-O2 -fdump-tree-ethread-details" } */
+/* { dg-final { scan-tree-dump "FSM" "ethread" } } */
 
 typedef struct rtx_def *rtx;
 typedef const struct rtx_def *const_rtx;
Index: testsuite/gcc.dg/tree-ssa/vrp01.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/vrp01.c	(revision 240109)
+++ testsuite/gcc.dg/tree-ssa/vrp01.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp1" } */
+/* { dg-options "-O2 -fdump-tree-vrp1 -fdisable-tree-ethread" } */
 
 int
 foo (int *p, int i)
Index: testsuite/gcc.dg/tree-ssa/vrp56.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/vrp56.c	(revision 240109)
+++ testsuite/gcc.dg/tree-ssa/vrp56.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-thread1-stats" } */
+/* { dg-options "-O2 -fdump-tree-ethread-stats" } */
 typedef struct basic_block_def *basic_block;
 struct basic_block_def;
 struct edge_def;
@@ -38,5 +38,5 @@ cleanup_empty_eh (basic_block bb)
 	foo ();
     }
 }
-/* { dg-final { scan-tree-dump "Jumps threaded: 1" "thread1"} } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 1" "ethread"} } */
 
Index: testsuite/gcc.dg/tree-ssa/vrp92.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/vrp92.c	(revision 240109)
+++ testsuite/gcc.dg/tree-ssa/vrp92.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp1-details" } */
+/* { dg-options "-O2 -fdump-tree-vrp1-details -fdisable-tree-ethread" } */
 
 void bar (void);
 int foo (int i, int j)
Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 240109)
+++ tree-pass.h	(working copy)
@@ -399,6 +399,7 @@ extern gimple_opt_pass *make_pass_cd_dce
 extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_merge_phi (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_thread_jumps (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_early_thread_jumps (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_split_crit_edges (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_laddress (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_pre (gcc::context *ctxt);
Index: tree-ssa-threadbackward.c
===================================================================
--- tree-ssa-threadbackward.c	(revision 240109)
+++ tree-ssa-threadbackward.c	(working copy)
@@ -61,12 +61,14 @@ get_gimple_control_stmt (basic_block bb)
 /* Return true if the CFG contains at least one path from START_BB to END_BB.
    When a path is found, record in PATH the blocks from END_BB to START_BB.
    VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
-   the recursion to basic blocks belonging to LOOP.  */
+   the recursion to basic blocks belonging to LOOP.
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 static bool
 fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
 		      vec<basic_block, va_gc> *&path,
-		      hash_set<basic_block> *visited_bbs, loop_p loop)
+		      hash_set<basic_block> *visited_bbs, loop_p loop,
+		      bool speed_p)
 {
   if (loop != start_bb->loop_father)
     return false;
@@ -82,7 +84,8 @@ fsm_find_thread_path (basic_block start_
       edge e;
       edge_iterator ei;
       FOR_EACH_EDGE (e, ei, start_bb->succs)
-	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop))
+	if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop,
+				  speed_p))
 	  {
 	    vec_safe_push (path, start_bb);
 	    return true;
@@ -101,11 +104,13 @@ fsm_find_thread_path (basic_block start_
    value on PATH.  ARG is the value of that SSA_NAME.
 
    BBI will be appended to PATH when we have a profitable jump threading
-   path.  Callers are responsible for removing BBI from PATH in that case. */
+   path.  Callers are responsible for removing BBI from PATH in that case.
+
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 static edge
 profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
-			     basic_block bbi, tree name, tree arg)
+			     basic_block bbi, tree name, tree arg, bool speed_p)
 {
   /* Note BBI is not in the path yet, hence the +1 in the test below
      to make sure BBI is accounted for in the path length test.  */
@@ -307,7 +312,7 @@ profitable_jump_thread_path (vec<basic_b
       return NULL;
     }
 
-  if (optimize_edge_for_speed_p (taken_edge))
+  if (speed_p && optimize_edge_for_speed_p (taken_edge))
     {
       if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
 	{
@@ -422,13 +427,15 @@ convert_and_register_jump_thread_path (v
 
 /* We trace the value of the SSA_NAME NAME back through any phi nodes looking
    for places where it gets a constant value and save the path.  Stop after
-   having recorded MAX_PATHS jump threading paths.  */
+   having recorded MAX_PATHS jump threading paths.
+
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 static void
 fsm_find_control_statement_thread_paths (tree name,
 					 hash_set<basic_block> *visited_bbs,
 					 vec<basic_block, va_gc> *&path,
-					 bool seen_loop_phi)
+					 bool seen_loop_phi, bool speed_p)
 {
   /* If NAME appears in an abnormal PHI, then don't try to trace its
      value back through PHI nodes.  */
@@ -496,7 +503,7 @@ fsm_find_control_statement_thread_paths
 	  hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
 
 	  if (fsm_find_thread_path (var_bb, e->src, next_path, visited_bbs,
-				    e->src->loop_father))
+				    e->src->loop_father, speed_p))
 	    ++e_count;
 
 	  delete visited_bbs;
@@ -562,7 +569,7 @@ fsm_find_control_statement_thread_paths
 	      /* Recursively follow SSA_NAMEs looking for a constant
 		 definition.  */
 	      fsm_find_control_statement_thread_paths (arg, visited_bbs, path,
-						       seen_loop_phi);
+						       seen_loop_phi, speed_p);
 
 	      path->pop ();
 	      continue;
@@ -573,7 +580,8 @@ fsm_find_control_statement_thread_paths
 
 	  /* If this is a profitable jump thread path, then convert it
 	     into the canonical form and register it.  */
-	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg);
+	  edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg,
+							 speed_p);
 	  if (taken_edge)
 	    {
 	      if (bb_loop_depth (taken_edge->src)
@@ -589,7 +597,7 @@ fsm_find_control_statement_thread_paths
 
       if (TREE_CODE (arg) == SSA_NAME)
 	fsm_find_control_statement_thread_paths (arg, visited_bbs,
-						 path, seen_loop_phi);
+						 path, seen_loop_phi, speed_p);
 
       else
 	{
@@ -599,7 +607,7 @@ fsm_find_control_statement_thread_paths
 	  path->pop ();
 
 	  edge taken_edge = profitable_jump_thread_path (path, var_bb,
-						     name, arg);
+						         name, arg, speed_p);
 	  if (taken_edge)
 	    {
 	      if (bb_loop_depth (taken_edge->src)
@@ -623,10 +631,11 @@ fsm_find_control_statement_thread_paths
    is a constant.  Record such paths for jump threading.
 
    It is assumed that BB ends with a control statement and that by
-   finding a path where NAME is a constant, we can thread the path.  */
+   finding a path where NAME is a constant, we can thread the path.
+   SPEED_P indicate that we could increase code size to improve the code path */
 
 void  
-find_jump_threads_backwards (basic_block bb)
+find_jump_threads_backwards (basic_block bb, bool speed_p)
 {     
   gimple *stmt = get_gimple_control_stmt (bb);
   if (!stmt)
@@ -656,7 +665,8 @@ find_jump_threads_backwards (basic_block
   hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
 
   max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
-  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false);
+  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false,
+					   speed_p);
 
   delete visited_bbs;
   vec_free (bb_path);
@@ -706,7 +716,7 @@ pass_thread_jumps::execute (function *fu
   FOR_EACH_BB_FN (bb, fun)
     {
       if (EDGE_COUNT (bb->succs) > 1)
-	find_jump_threads_backwards (bb);
+	find_jump_threads_backwards (bb, true);
     }
   bool changed = thread_through_all_blocks (true);
 
@@ -721,3 +731,59 @@ make_pass_thread_jumps (gcc::context *ct
 {
   return new pass_thread_jumps (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_early_thread_jumps =
+{
+  GIMPLE_PASS,
+  "ethread",
+  OPTGROUP_NONE,
+  TV_TREE_SSA_THREAD_JUMPS,
+  ( PROP_cfg | PROP_ssa ),
+  0,
+  0,
+  0,
+  ( TODO_cleanup_cfg | TODO_update_ssa ),
+};
+
+class pass_early_thread_jumps : public gimple_opt_pass
+{
+public:
+  pass_early_thread_jumps (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_early_thread_jumps, ctxt)
+  {}
+
+  opt_pass * clone (void) { return new pass_early_thread_jumps (m_ctxt); }
+  virtual bool gate (function *);
+  virtual unsigned int execute (function *);
+};
+
+bool
+pass_early_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
+{
+  return true;
+}
+
+
+unsigned int
+pass_early_thread_jumps::execute (function *fun)
+{
+  /* Try to thread each block with more than one successor.  */
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fun)
+    {
+      if (EDGE_COUNT (bb->succs) > 1)
+	find_jump_threads_backwards (bb, false);
+    }
+  thread_through_all_blocks (true);
+  return 0;
+}
+
+}
+
+gimple_opt_pass *
+make_pass_early_thread_jumps (gcc::context *ctxt)
+{
+  return new pass_early_thread_jumps (ctxt);
+}
Index: testsuite/gcc.dg/uninit-15.c
===================================================================
--- testsuite/gcc.dg/uninit-15.c	(revision 240109)
+++ testsuite/gcc.dg/uninit-15.c	(working copy)
@@ -1,16 +1,16 @@
 /* PR tree-optimization/17506
    We issue an uninitialized variable warning at a wrong location at
    line 11, which is very confusing.  Make sure we print out a note to
-   make it less confusing.  (xfailed alternative)
+   make it less confusing.  (not xfailed alternative)
    But it is of course ok if we warn in bar about uninitialized use
-   of j.  (not xfailed alternative)  */
+   of j.  (xfailed alternative)  */
 /* { dg-do compile } */
 /* { dg-options "-O1 -Wuninitialized" } */
 
 inline int
 foo (int i)
 {
-  if (i) /* { dg-warning "used uninitialized in this function" "" { xfail *-*-* } } */
+  if (i) /* { dg-warning "used uninitialized in this function" } */
     return 1;
   return 0;
 }
@@ -20,7 +20,7 @@ void baz (void);
 void
 bar (void)
 {
-  int j; /* { dg-message "note: 'j' was declared here" "" { xfail *-*-* } } */
-  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" } */
+  int j; /* { dg-message "note: 'j' was declared here" } */
+  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" "" { xfail *-*-* } } */
     baz ();
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-09-18 19:30         ` Jan Hubicka
@ 2016-09-19  6:53           ` Andrew Pinski
  2016-09-19  9:29             ` Jan Hubicka
  2016-09-19  7:59           ` Early jump threading Christophe Lyon
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Pinski @ 2016-09-19  6:53 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Jeff Law, Richard Biener, GCC Patches

On Mon, Sep 19, 2016 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> this is the patch compensating testsuite I commited after re-testing
> on x86_64-linux.
>
> Other placements of early_thread_jumps does not work veyr well (at least in
> current implementation). Putting it before forwprop disables about 15% of
> threadings. Placing it after DCE makes inliner to not see much of benefits
> because threading requires a cleanup propagation+DCE after itself.
> So unless we extend threader to be smarter or add extra DCE cleanup, i think
> this is best placement.


This caused (another) 3-4% degradation in coremarks on ThunderX.

Thanks,
Andrew

>
> Honza
>
>         * passes.def (pass_early_thread_jumps): Schedule after forwprop.
>         * tree-pass.h (make_pass_early_thread_jumps): Declare.
>         * tree-ssa-threadbackward.c (fsm_find_thread_path,
>         fsm_find_thread_path, profitable_jump_thread_path,
>         fsm_find_control_statement_thread_paths,
>         find_jump_threads_backwards): Add speed_p parameter.
>         (pass_data_early_thread_jumps): New pass.
>         (make_pass_early_thread_jumps): New function.
>
>         * g++.dg/predict-loop-exit-1.C: Disable early jump threading.
>         * g++.dg/predict-loop-exit-2.C: Disable early jump threading.
>         * g++.dg/predict-loop-exit-3.C: Disable early jump threading.
>         * gcc.dg/tree-ssa/pr69196-1.c: Disable early jump threading.
>         * gcc.dg/tree-ssa/vrp01.c: Disable early jump threading.
>         * gcc.dg/tree-ssa/ssa-dom-thread-2b.c: Disable early jump threading.
>         * gcc.dg/tree-ssa/pr68198.c: Scan ethread dump.
>         * gcc.dg/tree-ssa/ssa-thread-13.c: Scan ethread dump.
>         * gcc.dg/tree-ssa/vrp56.c: Scan ethread dump.
>         * gcc.dg/tree-ssa/vrp92.c: Scan ethread dump.
>         * gcc.dg/uninit-15.c: Swap xfailed and non-xfailed alternative.
>
> Index: passes.def
> ===================================================================
> --- passes.def  (revision 240109)
> +++ passes.def  (working copy)
> @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
>           /* After CCP we rewrite no longer addressed locals into SSA
>              form if possible.  */
>           NEXT_PASS (pass_forwprop);
> +          NEXT_PASS (pass_early_thread_jumps);
>           NEXT_PASS (pass_sra_early);
>           /* pass_build_ealias is a dummy pass that ensures that we
>              execute TODO_rebuild_alias at this point.  */
> Index: testsuite/g++.dg/predict-loop-exit-1.C
> ===================================================================
> --- testsuite/g++.dg/predict-loop-exit-1.C      (revision 240109)
> +++ testsuite/g++.dg/predict-loop-exit-1.C      (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
>
>  int g;
>  int foo();
> Index: testsuite/g++.dg/predict-loop-exit-2.C
> ===================================================================
> --- testsuite/g++.dg/predict-loop-exit-2.C      (revision 240109)
> +++ testsuite/g++.dg/predict-loop-exit-2.C      (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
>
>  int g;
>  int foo();
> Index: testsuite/g++.dg/predict-loop-exit-3.C
> ===================================================================
> --- testsuite/g++.dg/predict-loop-exit-3.C      (revision 240109)
> +++ testsuite/g++.dg/predict-loop-exit-3.C      (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
>
>  int g;
>  int foo();
> Index: testsuite/gcc.dg/tree-ssa/pr68198.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/pr68198.c (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/pr68198.c (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> +/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
>
>  extern void abort (void);
>
> @@ -40,4 +40,4 @@ c_finish_omp_clauses (tree clauses)
>  /* There are 3 FSM jump threading opportunities, two of which will
>    get filtered out.  */
>  /* { dg-final { scan-tree-dump-times "Registering FSM" 1 "thread1"} } */
> -/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "thread1"} } */
> +/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "ethread"} } */
> Index: testsuite/gcc.dg/tree-ssa/pr69196-1.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/pr69196-1.c       (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/pr69196-1.c       (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target sparc*-*-* x86_64-*-* } } */
> -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> +/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
>
>  /* { dg-final { scan-tree-dump "FSM did not thread around loop and would copy too many statements" "thread1" } } */
>
> Index: testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c       (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c       (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats" } */
> +/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */
>
>  void foo();
>  void bla();
> Index: testsuite/gcc.dg/tree-ssa/ssa-thread-13.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/ssa-thread-13.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/ssa-thread-13.c   (working copy)
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> -/* { dg-final { scan-tree-dump "FSM" "thread1" } } */
> +/* { dg-options "-O2 -fdump-tree-ethread-details" } */
> +/* { dg-final { scan-tree-dump "FSM" "ethread" } } */
>
>  typedef struct rtx_def *rtx;
>  typedef const struct rtx_def *const_rtx;
> Index: testsuite/gcc.dg/tree-ssa/vrp01.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/vrp01.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/vrp01.c   (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-vrp1" } */
> +/* { dg-options "-O2 -fdump-tree-vrp1 -fdisable-tree-ethread" } */
>
>  int
>  foo (int *p, int i)
> Index: testsuite/gcc.dg/tree-ssa/vrp56.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/vrp56.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/vrp56.c   (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-stats" } */
> +/* { dg-options "-O2 -fdump-tree-ethread-stats" } */
>  typedef struct basic_block_def *basic_block;
>  struct basic_block_def;
>  struct edge_def;
> @@ -38,5 +38,5 @@ cleanup_empty_eh (basic_block bb)
>         foo ();
>      }
>  }
> -/* { dg-final { scan-tree-dump "Jumps threaded: 1" "thread1"} } */
> +/* { dg-final { scan-tree-dump "Jumps threaded: 1" "ethread"} } */
>
> Index: testsuite/gcc.dg/tree-ssa/vrp92.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/vrp92.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/vrp92.c   (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-vrp1-details" } */
> +/* { dg-options "-O2 -fdump-tree-vrp1-details -fdisable-tree-ethread" } */
>
>  void bar (void);
>  int foo (int i, int j)
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h (revision 240109)
> +++ tree-pass.h (working copy)
> @@ -399,6 +399,7 @@ extern gimple_opt_pass *make_pass_cd_dce
>  extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_merge_phi (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_thread_jumps (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_early_thread_jumps (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_split_crit_edges (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_laddress (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_pre (gcc::context *ctxt);
> Index: tree-ssa-threadbackward.c
> ===================================================================
> --- tree-ssa-threadbackward.c   (revision 240109)
> +++ tree-ssa-threadbackward.c   (working copy)
> @@ -61,12 +61,14 @@ get_gimple_control_stmt (basic_block bb)
>  /* Return true if the CFG contains at least one path from START_BB to END_BB.
>     When a path is found, record in PATH the blocks from END_BB to START_BB.
>     VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
> -   the recursion to basic blocks belonging to LOOP.  */
> +   the recursion to basic blocks belonging to LOOP.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  static bool
>  fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
>                       vec<basic_block, va_gc> *&path,
> -                     hash_set<basic_block> *visited_bbs, loop_p loop)
> +                     hash_set<basic_block> *visited_bbs, loop_p loop,
> +                     bool speed_p)
>  {
>    if (loop != start_bb->loop_father)
>      return false;
> @@ -82,7 +84,8 @@ fsm_find_thread_path (basic_block start_
>        edge e;
>        edge_iterator ei;
>        FOR_EACH_EDGE (e, ei, start_bb->succs)
> -       if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop))
> +       if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop,
> +                                 speed_p))
>           {
>             vec_safe_push (path, start_bb);
>             return true;
> @@ -101,11 +104,13 @@ fsm_find_thread_path (basic_block start_
>     value on PATH.  ARG is the value of that SSA_NAME.
>
>     BBI will be appended to PATH when we have a profitable jump threading
> -   path.  Callers are responsible for removing BBI from PATH in that case. */
> +   path.  Callers are responsible for removing BBI from PATH in that case.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  static edge
>  profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
> -                            basic_block bbi, tree name, tree arg)
> +                            basic_block bbi, tree name, tree arg, bool speed_p)
>  {
>    /* Note BBI is not in the path yet, hence the +1 in the test below
>       to make sure BBI is accounted for in the path length test.  */
> @@ -307,7 +312,7 @@ profitable_jump_thread_path (vec<basic_b
>        return NULL;
>      }
>
> -  if (optimize_edge_for_speed_p (taken_edge))
> +  if (speed_p && optimize_edge_for_speed_p (taken_edge))
>      {
>        if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
>         {
> @@ -422,13 +427,15 @@ convert_and_register_jump_thread_path (v
>
>  /* We trace the value of the SSA_NAME NAME back through any phi nodes looking
>     for places where it gets a constant value and save the path.  Stop after
> -   having recorded MAX_PATHS jump threading paths.  */
> +   having recorded MAX_PATHS jump threading paths.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  static void
>  fsm_find_control_statement_thread_paths (tree name,
>                                          hash_set<basic_block> *visited_bbs,
>                                          vec<basic_block, va_gc> *&path,
> -                                        bool seen_loop_phi)
> +                                        bool seen_loop_phi, bool speed_p)
>  {
>    /* If NAME appears in an abnormal PHI, then don't try to trace its
>       value back through PHI nodes.  */
> @@ -496,7 +503,7 @@ fsm_find_control_statement_thread_paths
>           hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>
>           if (fsm_find_thread_path (var_bb, e->src, next_path, visited_bbs,
> -                                   e->src->loop_father))
> +                                   e->src->loop_father, speed_p))
>             ++e_count;
>
>           delete visited_bbs;
> @@ -562,7 +569,7 @@ fsm_find_control_statement_thread_paths
>               /* Recursively follow SSA_NAMEs looking for a constant
>                  definition.  */
>               fsm_find_control_statement_thread_paths (arg, visited_bbs, path,
> -                                                      seen_loop_phi);
> +                                                      seen_loop_phi, speed_p);
>
>               path->pop ();
>               continue;
> @@ -573,7 +580,8 @@ fsm_find_control_statement_thread_paths
>
>           /* If this is a profitable jump thread path, then convert it
>              into the canonical form and register it.  */
> -         edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg);
> +         edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg,
> +                                                        speed_p);
>           if (taken_edge)
>             {
>               if (bb_loop_depth (taken_edge->src)
> @@ -589,7 +597,7 @@ fsm_find_control_statement_thread_paths
>
>        if (TREE_CODE (arg) == SSA_NAME)
>         fsm_find_control_statement_thread_paths (arg, visited_bbs,
> -                                                path, seen_loop_phi);
> +                                                path, seen_loop_phi, speed_p);
>
>        else
>         {
> @@ -599,7 +607,7 @@ fsm_find_control_statement_thread_paths
>           path->pop ();
>
>           edge taken_edge = profitable_jump_thread_path (path, var_bb,
> -                                                    name, arg);
> +                                                        name, arg, speed_p);
>           if (taken_edge)
>             {
>               if (bb_loop_depth (taken_edge->src)
> @@ -623,10 +631,11 @@ fsm_find_control_statement_thread_paths
>     is a constant.  Record such paths for jump threading.
>
>     It is assumed that BB ends with a control statement and that by
> -   finding a path where NAME is a constant, we can thread the path.  */
> +   finding a path where NAME is a constant, we can thread the path.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  void
> -find_jump_threads_backwards (basic_block bb)
> +find_jump_threads_backwards (basic_block bb, bool speed_p)
>  {
>    gimple *stmt = get_gimple_control_stmt (bb);
>    if (!stmt)
> @@ -656,7 +665,8 @@ find_jump_threads_backwards (basic_block
>    hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>
>    max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
> -  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false);
> +  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false,
> +                                          speed_p);
>
>    delete visited_bbs;
>    vec_free (bb_path);
> @@ -706,7 +716,7 @@ pass_thread_jumps::execute (function *fu
>    FOR_EACH_BB_FN (bb, fun)
>      {
>        if (EDGE_COUNT (bb->succs) > 1)
> -       find_jump_threads_backwards (bb);
> +       find_jump_threads_backwards (bb, true);
>      }
>    bool changed = thread_through_all_blocks (true);
>
> @@ -721,3 +731,59 @@ make_pass_thread_jumps (gcc::context *ct
>  {
>    return new pass_thread_jumps (ctxt);
>  }
> +
> +namespace {
> +
> +const pass_data pass_data_early_thread_jumps =
> +{
> +  GIMPLE_PASS,
> +  "ethread",
> +  OPTGROUP_NONE,
> +  TV_TREE_SSA_THREAD_JUMPS,
> +  ( PROP_cfg | PROP_ssa ),
> +  0,
> +  0,
> +  0,
> +  ( TODO_cleanup_cfg | TODO_update_ssa ),
> +};
> +
> +class pass_early_thread_jumps : public gimple_opt_pass
> +{
> +public:
> +  pass_early_thread_jumps (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_early_thread_jumps, ctxt)
> +  {}
> +
> +  opt_pass * clone (void) { return new pass_early_thread_jumps (m_ctxt); }
> +  virtual bool gate (function *);
> +  virtual unsigned int execute (function *);
> +};
> +
> +bool
> +pass_early_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
> +{
> +  return true;
> +}
> +
> +
> +unsigned int
> +pass_early_thread_jumps::execute (function *fun)
> +{
> +  /* Try to thread each block with more than one successor.  */
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      if (EDGE_COUNT (bb->succs) > 1)
> +       find_jump_threads_backwards (bb, false);
> +    }
> +  thread_through_all_blocks (true);
> +  return 0;
> +}
> +
> +}
> +
> +gimple_opt_pass *
> +make_pass_early_thread_jumps (gcc::context *ctxt)
> +{
> +  return new pass_early_thread_jumps (ctxt);
> +}
> Index: testsuite/gcc.dg/uninit-15.c
> ===================================================================
> --- testsuite/gcc.dg/uninit-15.c        (revision 240109)
> +++ testsuite/gcc.dg/uninit-15.c        (working copy)
> @@ -1,16 +1,16 @@
>  /* PR tree-optimization/17506
>     We issue an uninitialized variable warning at a wrong location at
>     line 11, which is very confusing.  Make sure we print out a note to
> -   make it less confusing.  (xfailed alternative)
> +   make it less confusing.  (not xfailed alternative)
>     But it is of course ok if we warn in bar about uninitialized use
> -   of j.  (not xfailed alternative)  */
> +   of j.  (xfailed alternative)  */
>  /* { dg-do compile } */
>  /* { dg-options "-O1 -Wuninitialized" } */
>
>  inline int
>  foo (int i)
>  {
> -  if (i) /* { dg-warning "used uninitialized in this function" "" { xfail *-*-* } } */
> +  if (i) /* { dg-warning "used uninitialized in this function" } */
>      return 1;
>    return 0;
>  }
> @@ -20,7 +20,7 @@ void baz (void);
>  void
>  bar (void)
>  {
> -  int j; /* { dg-message "note: 'j' was declared here" "" { xfail *-*-* } } */
> -  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" } */
> +  int j; /* { dg-message "note: 'j' was declared here" } */
> +  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" "" { xfail *-*-* } } */
>      baz ();
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-09-18 19:30         ` Jan Hubicka
  2016-09-19  6:53           ` Andrew Pinski
@ 2016-09-19  7:59           ` Christophe Lyon
  2016-09-20  8:41             ` Thomas Preudhomme
  1 sibling, 1 reply; 25+ messages in thread
From: Christophe Lyon @ 2016-09-19  7:59 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Jeff Law, Richard Biener, gcc-patches

Hi,


On 18 September 2016 at 20:48, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> this is the patch compensating testsuite I commited after re-testing
> on x86_64-linux.
>
> Other placements of early_thread_jumps does not work veyr well (at least in
> current implementation). Putting it before forwprop disables about 15% of
> threadings. Placing it after DCE makes inliner to not see much of benefits
> because threading requires a cleanup propagation+DCE after itself.
> So unless we extend threader to be smarter or add extra DCE cleanup, i think
> this is best placement.
>
> Honza
>
>         * passes.def (pass_early_thread_jumps): Schedule after forwprop.
>         * tree-pass.h (make_pass_early_thread_jumps): Declare.
>         * tree-ssa-threadbackward.c (fsm_find_thread_path,
>         fsm_find_thread_path, profitable_jump_thread_path,
>         fsm_find_control_statement_thread_paths,
>         find_jump_threads_backwards): Add speed_p parameter.
>         (pass_data_early_thread_jumps): New pass.
>         (make_pass_early_thread_jumps): New function.
>
>         * g++.dg/predict-loop-exit-1.C: Disable early jump threading.
>         * g++.dg/predict-loop-exit-2.C: Disable early jump threading.
>         * g++.dg/predict-loop-exit-3.C: Disable early jump threading.
>         * gcc.dg/tree-ssa/pr69196-1.c: Disable early jump threading.
>         * gcc.dg/tree-ssa/vrp01.c: Disable early jump threading.
>         * gcc.dg/tree-ssa/ssa-dom-thread-2b.c: Disable early jump threading.
>         * gcc.dg/tree-ssa/pr68198.c: Scan ethread dump.
>         * gcc.dg/tree-ssa/ssa-thread-13.c: Scan ethread dump.
>         * gcc.dg/tree-ssa/vrp56.c: Scan ethread dump.
>         * gcc.dg/tree-ssa/vrp92.c: Scan ethread dump.
>         * gcc.dg/uninit-15.c: Swap xfailed and non-xfailed alternative.
>
> Index: passes.def
> ===================================================================
> --- passes.def  (revision 240109)
> +++ passes.def  (working copy)
> @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
>           /* After CCP we rewrite no longer addressed locals into SSA
>              form if possible.  */
>           NEXT_PASS (pass_forwprop);
> +          NEXT_PASS (pass_early_thread_jumps);
>           NEXT_PASS (pass_sra_early);
>           /* pass_build_ealias is a dummy pass that ensures that we
>              execute TODO_rebuild_alias at this point.  */
> Index: testsuite/g++.dg/predict-loop-exit-1.C
> ===================================================================
> --- testsuite/g++.dg/predict-loop-exit-1.C      (revision 240109)
> +++ testsuite/g++.dg/predict-loop-exit-1.C      (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
>
>  int g;
>  int foo();
> Index: testsuite/g++.dg/predict-loop-exit-2.C
> ===================================================================
> --- testsuite/g++.dg/predict-loop-exit-2.C      (revision 240109)
> +++ testsuite/g++.dg/predict-loop-exit-2.C      (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
>
>  int g;
>  int foo();
> Index: testsuite/g++.dg/predict-loop-exit-3.C
> ===================================================================
> --- testsuite/g++.dg/predict-loop-exit-3.C      (revision 240109)
> +++ testsuite/g++.dg/predict-loop-exit-3.C      (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
>
>  int g;
>  int foo();
> Index: testsuite/gcc.dg/tree-ssa/pr68198.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/pr68198.c (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/pr68198.c (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> +/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
>
>  extern void abort (void);
>
> @@ -40,4 +40,4 @@ c_finish_omp_clauses (tree clauses)
>  /* There are 3 FSM jump threading opportunities, two of which will
>    get filtered out.  */
>  /* { dg-final { scan-tree-dump-times "Registering FSM" 1 "thread1"} } */
> -/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "thread1"} } */
> +/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "ethread"} } */

This test does not work, at least on arm* and aarch64*. I'm seeing:
cc1: note: disable pass tree-ethread for functions in the range of [0,
4294967295]

PASS: gcc.dg/tree-ssa/pr68198.c (test for excess errors)
PASS: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
gcc.dg/tree-ssa/pr68198.c: dump file does not exist
UNRESOLVED: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times ethread
"FSM Thread through multiway branch without threading a multiway
branch" 2

Christophe

> Index: testsuite/gcc.dg/tree-ssa/pr69196-1.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/pr69196-1.c       (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/pr69196-1.c       (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target sparc*-*-* x86_64-*-* } } */
> -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> +/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
>
>  /* { dg-final { scan-tree-dump "FSM did not thread around loop and would copy too many statements" "thread1" } } */
>
> Index: testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c       (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c       (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats" } */
> +/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */
>
>  void foo();
>  void bla();
> Index: testsuite/gcc.dg/tree-ssa/ssa-thread-13.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/ssa-thread-13.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/ssa-thread-13.c   (working copy)
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> -/* { dg-final { scan-tree-dump "FSM" "thread1" } } */
> +/* { dg-options "-O2 -fdump-tree-ethread-details" } */
> +/* { dg-final { scan-tree-dump "FSM" "ethread" } } */
>
>  typedef struct rtx_def *rtx;
>  typedef const struct rtx_def *const_rtx;
> Index: testsuite/gcc.dg/tree-ssa/vrp01.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/vrp01.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/vrp01.c   (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-vrp1" } */
> +/* { dg-options "-O2 -fdump-tree-vrp1 -fdisable-tree-ethread" } */
>
>  int
>  foo (int *p, int i)
> Index: testsuite/gcc.dg/tree-ssa/vrp56.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/vrp56.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/vrp56.c   (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-stats" } */
> +/* { dg-options "-O2 -fdump-tree-ethread-stats" } */
>  typedef struct basic_block_def *basic_block;
>  struct basic_block_def;
>  struct edge_def;
> @@ -38,5 +38,5 @@ cleanup_empty_eh (basic_block bb)
>         foo ();
>      }
>  }
> -/* { dg-final { scan-tree-dump "Jumps threaded: 1" "thread1"} } */
> +/* { dg-final { scan-tree-dump "Jumps threaded: 1" "ethread"} } */
>
> Index: testsuite/gcc.dg/tree-ssa/vrp92.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/vrp92.c   (revision 240109)
> +++ testsuite/gcc.dg/tree-ssa/vrp92.c   (working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-vrp1-details" } */
> +/* { dg-options "-O2 -fdump-tree-vrp1-details -fdisable-tree-ethread" } */
>
>  void bar (void);
>  int foo (int i, int j)
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h (revision 240109)
> +++ tree-pass.h (working copy)
> @@ -399,6 +399,7 @@ extern gimple_opt_pass *make_pass_cd_dce
>  extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_merge_phi (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_thread_jumps (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_early_thread_jumps (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_split_crit_edges (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_laddress (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_pre (gcc::context *ctxt);
> Index: tree-ssa-threadbackward.c
> ===================================================================
> --- tree-ssa-threadbackward.c   (revision 240109)
> +++ tree-ssa-threadbackward.c   (working copy)
> @@ -61,12 +61,14 @@ get_gimple_control_stmt (basic_block bb)
>  /* Return true if the CFG contains at least one path from START_BB to END_BB.
>     When a path is found, record in PATH the blocks from END_BB to START_BB.
>     VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
> -   the recursion to basic blocks belonging to LOOP.  */
> +   the recursion to basic blocks belonging to LOOP.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  static bool
>  fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
>                       vec<basic_block, va_gc> *&path,
> -                     hash_set<basic_block> *visited_bbs, loop_p loop)
> +                     hash_set<basic_block> *visited_bbs, loop_p loop,
> +                     bool speed_p)
>  {
>    if (loop != start_bb->loop_father)
>      return false;
> @@ -82,7 +84,8 @@ fsm_find_thread_path (basic_block start_
>        edge e;
>        edge_iterator ei;
>        FOR_EACH_EDGE (e, ei, start_bb->succs)
> -       if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop))
> +       if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop,
> +                                 speed_p))
>           {
>             vec_safe_push (path, start_bb);
>             return true;
> @@ -101,11 +104,13 @@ fsm_find_thread_path (basic_block start_
>     value on PATH.  ARG is the value of that SSA_NAME.
>
>     BBI will be appended to PATH when we have a profitable jump threading
> -   path.  Callers are responsible for removing BBI from PATH in that case. */
> +   path.  Callers are responsible for removing BBI from PATH in that case.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  static edge
>  profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
> -                            basic_block bbi, tree name, tree arg)
> +                            basic_block bbi, tree name, tree arg, bool speed_p)
>  {
>    /* Note BBI is not in the path yet, hence the +1 in the test below
>       to make sure BBI is accounted for in the path length test.  */
> @@ -307,7 +312,7 @@ profitable_jump_thread_path (vec<basic_b
>        return NULL;
>      }
>
> -  if (optimize_edge_for_speed_p (taken_edge))
> +  if (speed_p && optimize_edge_for_speed_p (taken_edge))
>      {
>        if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
>         {
> @@ -422,13 +427,15 @@ convert_and_register_jump_thread_path (v
>
>  /* We trace the value of the SSA_NAME NAME back through any phi nodes looking
>     for places where it gets a constant value and save the path.  Stop after
> -   having recorded MAX_PATHS jump threading paths.  */
> +   having recorded MAX_PATHS jump threading paths.
> +
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  static void
>  fsm_find_control_statement_thread_paths (tree name,
>                                          hash_set<basic_block> *visited_bbs,
>                                          vec<basic_block, va_gc> *&path,
> -                                        bool seen_loop_phi)
> +                                        bool seen_loop_phi, bool speed_p)
>  {
>    /* If NAME appears in an abnormal PHI, then don't try to trace its
>       value back through PHI nodes.  */
> @@ -496,7 +503,7 @@ fsm_find_control_statement_thread_paths
>           hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>
>           if (fsm_find_thread_path (var_bb, e->src, next_path, visited_bbs,
> -                                   e->src->loop_father))
> +                                   e->src->loop_father, speed_p))
>             ++e_count;
>
>           delete visited_bbs;
> @@ -562,7 +569,7 @@ fsm_find_control_statement_thread_paths
>               /* Recursively follow SSA_NAMEs looking for a constant
>                  definition.  */
>               fsm_find_control_statement_thread_paths (arg, visited_bbs, path,
> -                                                      seen_loop_phi);
> +                                                      seen_loop_phi, speed_p);
>
>               path->pop ();
>               continue;
> @@ -573,7 +580,8 @@ fsm_find_control_statement_thread_paths
>
>           /* If this is a profitable jump thread path, then convert it
>              into the canonical form and register it.  */
> -         edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg);
> +         edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg,
> +                                                        speed_p);
>           if (taken_edge)
>             {
>               if (bb_loop_depth (taken_edge->src)
> @@ -589,7 +597,7 @@ fsm_find_control_statement_thread_paths
>
>        if (TREE_CODE (arg) == SSA_NAME)
>         fsm_find_control_statement_thread_paths (arg, visited_bbs,
> -                                                path, seen_loop_phi);
> +                                                path, seen_loop_phi, speed_p);
>
>        else
>         {
> @@ -599,7 +607,7 @@ fsm_find_control_statement_thread_paths
>           path->pop ();
>
>           edge taken_edge = profitable_jump_thread_path (path, var_bb,
> -                                                    name, arg);
> +                                                        name, arg, speed_p);
>           if (taken_edge)
>             {
>               if (bb_loop_depth (taken_edge->src)
> @@ -623,10 +631,11 @@ fsm_find_control_statement_thread_paths
>     is a constant.  Record such paths for jump threading.
>
>     It is assumed that BB ends with a control statement and that by
> -   finding a path where NAME is a constant, we can thread the path.  */
> +   finding a path where NAME is a constant, we can thread the path.
> +   SPEED_P indicate that we could increase code size to improve the code path */
>
>  void
> -find_jump_threads_backwards (basic_block bb)
> +find_jump_threads_backwards (basic_block bb, bool speed_p)
>  {
>    gimple *stmt = get_gimple_control_stmt (bb);
>    if (!stmt)
> @@ -656,7 +665,8 @@ find_jump_threads_backwards (basic_block
>    hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
>
>    max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
> -  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false);
> +  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false,
> +                                          speed_p);
>
>    delete visited_bbs;
>    vec_free (bb_path);
> @@ -706,7 +716,7 @@ pass_thread_jumps::execute (function *fu
>    FOR_EACH_BB_FN (bb, fun)
>      {
>        if (EDGE_COUNT (bb->succs) > 1)
> -       find_jump_threads_backwards (bb);
> +       find_jump_threads_backwards (bb, true);
>      }
>    bool changed = thread_through_all_blocks (true);
>
> @@ -721,3 +731,59 @@ make_pass_thread_jumps (gcc::context *ct
>  {
>    return new pass_thread_jumps (ctxt);
>  }
> +
> +namespace {
> +
> +const pass_data pass_data_early_thread_jumps =
> +{
> +  GIMPLE_PASS,
> +  "ethread",
> +  OPTGROUP_NONE,
> +  TV_TREE_SSA_THREAD_JUMPS,
> +  ( PROP_cfg | PROP_ssa ),
> +  0,
> +  0,
> +  0,
> +  ( TODO_cleanup_cfg | TODO_update_ssa ),
> +};
> +
> +class pass_early_thread_jumps : public gimple_opt_pass
> +{
> +public:
> +  pass_early_thread_jumps (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_early_thread_jumps, ctxt)
> +  {}
> +
> +  opt_pass * clone (void) { return new pass_early_thread_jumps (m_ctxt); }
> +  virtual bool gate (function *);
> +  virtual unsigned int execute (function *);
> +};
> +
> +bool
> +pass_early_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
> +{
> +  return true;
> +}
> +
> +
> +unsigned int
> +pass_early_thread_jumps::execute (function *fun)
> +{
> +  /* Try to thread each block with more than one successor.  */
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fun)
> +    {
> +      if (EDGE_COUNT (bb->succs) > 1)
> +       find_jump_threads_backwards (bb, false);
> +    }
> +  thread_through_all_blocks (true);
> +  return 0;
> +}
> +
> +}
> +
> +gimple_opt_pass *
> +make_pass_early_thread_jumps (gcc::context *ctxt)
> +{
> +  return new pass_early_thread_jumps (ctxt);
> +}
> Index: testsuite/gcc.dg/uninit-15.c
> ===================================================================
> --- testsuite/gcc.dg/uninit-15.c        (revision 240109)
> +++ testsuite/gcc.dg/uninit-15.c        (working copy)
> @@ -1,16 +1,16 @@
>  /* PR tree-optimization/17506
>     We issue an uninitialized variable warning at a wrong location at
>     line 11, which is very confusing.  Make sure we print out a note to
> -   make it less confusing.  (xfailed alternative)
> +   make it less confusing.  (not xfailed alternative)
>     But it is of course ok if we warn in bar about uninitialized use
> -   of j.  (not xfailed alternative)  */
> +   of j.  (xfailed alternative)  */
>  /* { dg-do compile } */
>  /* { dg-options "-O1 -Wuninitialized" } */
>
>  inline int
>  foo (int i)
>  {
> -  if (i) /* { dg-warning "used uninitialized in this function" "" { xfail *-*-* } } */
> +  if (i) /* { dg-warning "used uninitialized in this function" } */
>      return 1;
>    return 0;
>  }
> @@ -20,7 +20,7 @@ void baz (void);
>  void
>  bar (void)
>  {
> -  int j; /* { dg-message "note: 'j' was declared here" "" { xfail *-*-* } } */
> -  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" } */
> +  int j; /* { dg-message "note: 'j' was declared here" } */
> +  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" "" { xfail *-*-* } } */
>      baz ();
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-09-19  6:53           ` Andrew Pinski
@ 2016-09-19  9:29             ` Jan Hubicka
  2016-10-18 15:56               ` James Greenhalgh
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Hubicka @ 2016-09-19  9:29 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Jan Hubicka, Jeff Law, Richard Biener, GCC Patches

> On Mon, Sep 19, 2016 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hi,
> > this is the patch compensating testsuite I commited after re-testing
> > on x86_64-linux.
> >
> > Other placements of early_thread_jumps does not work veyr well (at least in
> > current implementation). Putting it before forwprop disables about 15% of
> > threadings. Placing it after DCE makes inliner to not see much of benefits
> > because threading requires a cleanup propagation+DCE after itself.
> > So unless we extend threader to be smarter or add extra DCE cleanup, i think
> > this is best placement.
> 
> 
> This caused (another) 3-4% degradation in coremarks on ThunderX.

Hmm, this is interesting. The patch should have "fixed" the previous
degradation by making the profile correct (backward threader still doe not
update it, but because most threading now happens early and profile is built
afterwards this should be less of issue).  I am now looking into the profile
update issues and will try to check why coremarks degrade again.

Honza
> 
> Thanks,
> Andrew
> 
> >
> > Honza
> >
> >         * passes.def (pass_early_thread_jumps): Schedule after forwprop.
> >         * tree-pass.h (make_pass_early_thread_jumps): Declare.
> >         * tree-ssa-threadbackward.c (fsm_find_thread_path,
> >         fsm_find_thread_path, profitable_jump_thread_path,
> >         fsm_find_control_statement_thread_paths,
> >         find_jump_threads_backwards): Add speed_p parameter.
> >         (pass_data_early_thread_jumps): New pass.
> >         (make_pass_early_thread_jumps): New function.
> >
> >         * g++.dg/predict-loop-exit-1.C: Disable early jump threading.
> >         * g++.dg/predict-loop-exit-2.C: Disable early jump threading.
> >         * g++.dg/predict-loop-exit-3.C: Disable early jump threading.
> >         * gcc.dg/tree-ssa/pr69196-1.c: Disable early jump threading.
> >         * gcc.dg/tree-ssa/vrp01.c: Disable early jump threading.
> >         * gcc.dg/tree-ssa/ssa-dom-thread-2b.c: Disable early jump threading.
> >         * gcc.dg/tree-ssa/pr68198.c: Scan ethread dump.
> >         * gcc.dg/tree-ssa/ssa-thread-13.c: Scan ethread dump.
> >         * gcc.dg/tree-ssa/vrp56.c: Scan ethread dump.
> >         * gcc.dg/tree-ssa/vrp92.c: Scan ethread dump.
> >         * gcc.dg/uninit-15.c: Swap xfailed and non-xfailed alternative.
> >
> > Index: passes.def
> > ===================================================================
> > --- passes.def  (revision 240109)
> > +++ passes.def  (working copy)
> > @@ -84,6 +84,7 @@ along with GCC; see the file COPYING3.
> >           /* After CCP we rewrite no longer addressed locals into SSA
> >              form if possible.  */
> >           NEXT_PASS (pass_forwprop);
> > +          NEXT_PASS (pass_early_thread_jumps);
> >           NEXT_PASS (pass_sra_early);
> >           /* pass_build_ealias is a dummy pass that ensures that we
> >              execute TODO_rebuild_alias at this point.  */
> > Index: testsuite/g++.dg/predict-loop-exit-1.C
> > ===================================================================
> > --- testsuite/g++.dg/predict-loop-exit-1.C      (revision 240109)
> > +++ testsuite/g++.dg/predict-loop-exit-1.C      (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> > +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
> >
> >  int g;
> >  int foo();
> > Index: testsuite/g++.dg/predict-loop-exit-2.C
> > ===================================================================
> > --- testsuite/g++.dg/predict-loop-exit-2.C      (revision 240109)
> > +++ testsuite/g++.dg/predict-loop-exit-2.C      (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> > +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
> >
> >  int g;
> >  int foo();
> > Index: testsuite/g++.dg/predict-loop-exit-3.C
> > ===================================================================
> > --- testsuite/g++.dg/predict-loop-exit-3.C      (revision 240109)
> > +++ testsuite/g++.dg/predict-loop-exit-3.C      (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-profile_estimate" } */
> > +/* { dg-options "-O2 -fdump-tree-profile_estimate -fdisable-tree-ethread" } */
> >
> >  int g;
> >  int foo();
> > Index: testsuite/gcc.dg/tree-ssa/pr68198.c
> > ===================================================================
> > --- testsuite/gcc.dg/tree-ssa/pr68198.c (revision 240109)
> > +++ testsuite/gcc.dg/tree-ssa/pr68198.c (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> > +/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
> >
> >  extern void abort (void);
> >
> > @@ -40,4 +40,4 @@ c_finish_omp_clauses (tree clauses)
> >  /* There are 3 FSM jump threading opportunities, two of which will
> >    get filtered out.  */
> >  /* { dg-final { scan-tree-dump-times "Registering FSM" 1 "thread1"} } */
> > -/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "thread1"} } */
> > +/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "ethread"} } */
> > Index: testsuite/gcc.dg/tree-ssa/pr69196-1.c
> > ===================================================================
> > --- testsuite/gcc.dg/tree-ssa/pr69196-1.c       (revision 240109)
> > +++ testsuite/gcc.dg/tree-ssa/pr69196-1.c       (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile { target sparc*-*-* x86_64-*-* } } */
> > -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> > +/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
> >
> >  /* { dg-final { scan-tree-dump "FSM did not thread around loop and would copy too many statements" "thread1" } } */
> >
> > Index: testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c
> > ===================================================================
> > --- testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c       (revision 240109)
> > +++ testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c       (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats" } */
> > +/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */
> >
> >  void foo();
> >  void bla();
> > Index: testsuite/gcc.dg/tree-ssa/ssa-thread-13.c
> > ===================================================================
> > --- testsuite/gcc.dg/tree-ssa/ssa-thread-13.c   (revision 240109)
> > +++ testsuite/gcc.dg/tree-ssa/ssa-thread-13.c   (working copy)
> > @@ -1,6 +1,6 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
> > -/* { dg-final { scan-tree-dump "FSM" "thread1" } } */
> > +/* { dg-options "-O2 -fdump-tree-ethread-details" } */
> > +/* { dg-final { scan-tree-dump "FSM" "ethread" } } */
> >
> >  typedef struct rtx_def *rtx;
> >  typedef const struct rtx_def *const_rtx;
> > Index: testsuite/gcc.dg/tree-ssa/vrp01.c
> > ===================================================================
> > --- testsuite/gcc.dg/tree-ssa/vrp01.c   (revision 240109)
> > +++ testsuite/gcc.dg/tree-ssa/vrp01.c   (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-vrp1" } */
> > +/* { dg-options "-O2 -fdump-tree-vrp1 -fdisable-tree-ethread" } */
> >
> >  int
> >  foo (int *p, int i)
> > Index: testsuite/gcc.dg/tree-ssa/vrp56.c
> > ===================================================================
> > --- testsuite/gcc.dg/tree-ssa/vrp56.c   (revision 240109)
> > +++ testsuite/gcc.dg/tree-ssa/vrp56.c   (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-thread1-stats" } */
> > +/* { dg-options "-O2 -fdump-tree-ethread-stats" } */
> >  typedef struct basic_block_def *basic_block;
> >  struct basic_block_def;
> >  struct edge_def;
> > @@ -38,5 +38,5 @@ cleanup_empty_eh (basic_block bb)
> >         foo ();
> >      }
> >  }
> > -/* { dg-final { scan-tree-dump "Jumps threaded: 1" "thread1"} } */
> > +/* { dg-final { scan-tree-dump "Jumps threaded: 1" "ethread"} } */
> >
> > Index: testsuite/gcc.dg/tree-ssa/vrp92.c
> > ===================================================================
> > --- testsuite/gcc.dg/tree-ssa/vrp92.c   (revision 240109)
> > +++ testsuite/gcc.dg/tree-ssa/vrp92.c   (working copy)
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fdump-tree-vrp1-details" } */
> > +/* { dg-options "-O2 -fdump-tree-vrp1-details -fdisable-tree-ethread" } */
> >
> >  void bar (void);
> >  int foo (int i, int j)
> > Index: tree-pass.h
> > ===================================================================
> > --- tree-pass.h (revision 240109)
> > +++ tree-pass.h (working copy)
> > @@ -399,6 +399,7 @@ extern gimple_opt_pass *make_pass_cd_dce
> >  extern gimple_opt_pass *make_pass_call_cdce (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_merge_phi (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_thread_jumps (gcc::context *ctxt);
> > +extern gimple_opt_pass *make_pass_early_thread_jumps (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_split_crit_edges (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_laddress (gcc::context *ctxt);
> >  extern gimple_opt_pass *make_pass_pre (gcc::context *ctxt);
> > Index: tree-ssa-threadbackward.c
> > ===================================================================
> > --- tree-ssa-threadbackward.c   (revision 240109)
> > +++ tree-ssa-threadbackward.c   (working copy)
> > @@ -61,12 +61,14 @@ get_gimple_control_stmt (basic_block bb)
> >  /* Return true if the CFG contains at least one path from START_BB to END_BB.
> >     When a path is found, record in PATH the blocks from END_BB to START_BB.
> >     VISITED_BBS is used to make sure we don't fall into an infinite loop.  Bound
> > -   the recursion to basic blocks belonging to LOOP.  */
> > +   the recursion to basic blocks belonging to LOOP.
> > +   SPEED_P indicate that we could increase code size to improve the code path */
> >
> >  static bool
> >  fsm_find_thread_path (basic_block start_bb, basic_block end_bb,
> >                       vec<basic_block, va_gc> *&path,
> > -                     hash_set<basic_block> *visited_bbs, loop_p loop)
> > +                     hash_set<basic_block> *visited_bbs, loop_p loop,
> > +                     bool speed_p)
> >  {
> >    if (loop != start_bb->loop_father)
> >      return false;
> > @@ -82,7 +84,8 @@ fsm_find_thread_path (basic_block start_
> >        edge e;
> >        edge_iterator ei;
> >        FOR_EACH_EDGE (e, ei, start_bb->succs)
> > -       if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop))
> > +       if (fsm_find_thread_path (e->dest, end_bb, path, visited_bbs, loop,
> > +                                 speed_p))
> >           {
> >             vec_safe_push (path, start_bb);
> >             return true;
> > @@ -101,11 +104,13 @@ fsm_find_thread_path (basic_block start_
> >     value on PATH.  ARG is the value of that SSA_NAME.
> >
> >     BBI will be appended to PATH when we have a profitable jump threading
> > -   path.  Callers are responsible for removing BBI from PATH in that case. */
> > +   path.  Callers are responsible for removing BBI from PATH in that case.
> > +
> > +   SPEED_P indicate that we could increase code size to improve the code path */
> >
> >  static edge
> >  profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
> > -                            basic_block bbi, tree name, tree arg)
> > +                            basic_block bbi, tree name, tree arg, bool speed_p)
> >  {
> >    /* Note BBI is not in the path yet, hence the +1 in the test below
> >       to make sure BBI is accounted for in the path length test.  */
> > @@ -307,7 +312,7 @@ profitable_jump_thread_path (vec<basic_b
> >        return NULL;
> >      }
> >
> > -  if (optimize_edge_for_speed_p (taken_edge))
> > +  if (speed_p && optimize_edge_for_speed_p (taken_edge))
> >      {
> >        if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
> >         {
> > @@ -422,13 +427,15 @@ convert_and_register_jump_thread_path (v
> >
> >  /* We trace the value of the SSA_NAME NAME back through any phi nodes looking
> >     for places where it gets a constant value and save the path.  Stop after
> > -   having recorded MAX_PATHS jump threading paths.  */
> > +   having recorded MAX_PATHS jump threading paths.
> > +
> > +   SPEED_P indicate that we could increase code size to improve the code path */
> >
> >  static void
> >  fsm_find_control_statement_thread_paths (tree name,
> >                                          hash_set<basic_block> *visited_bbs,
> >                                          vec<basic_block, va_gc> *&path,
> > -                                        bool seen_loop_phi)
> > +                                        bool seen_loop_phi, bool speed_p)
> >  {
> >    /* If NAME appears in an abnormal PHI, then don't try to trace its
> >       value back through PHI nodes.  */
> > @@ -496,7 +503,7 @@ fsm_find_control_statement_thread_paths
> >           hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
> >
> >           if (fsm_find_thread_path (var_bb, e->src, next_path, visited_bbs,
> > -                                   e->src->loop_father))
> > +                                   e->src->loop_father, speed_p))
> >             ++e_count;
> >
> >           delete visited_bbs;
> > @@ -562,7 +569,7 @@ fsm_find_control_statement_thread_paths
> >               /* Recursively follow SSA_NAMEs looking for a constant
> >                  definition.  */
> >               fsm_find_control_statement_thread_paths (arg, visited_bbs, path,
> > -                                                      seen_loop_phi);
> > +                                                      seen_loop_phi, speed_p);
> >
> >               path->pop ();
> >               continue;
> > @@ -573,7 +580,8 @@ fsm_find_control_statement_thread_paths
> >
> >           /* If this is a profitable jump thread path, then convert it
> >              into the canonical form and register it.  */
> > -         edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg);
> > +         edge taken_edge = profitable_jump_thread_path (path, bbi, name, arg,
> > +                                                        speed_p);
> >           if (taken_edge)
> >             {
> >               if (bb_loop_depth (taken_edge->src)
> > @@ -589,7 +597,7 @@ fsm_find_control_statement_thread_paths
> >
> >        if (TREE_CODE (arg) == SSA_NAME)
> >         fsm_find_control_statement_thread_paths (arg, visited_bbs,
> > -                                                path, seen_loop_phi);
> > +                                                path, seen_loop_phi, speed_p);
> >
> >        else
> >         {
> > @@ -599,7 +607,7 @@ fsm_find_control_statement_thread_paths
> >           path->pop ();
> >
> >           edge taken_edge = profitable_jump_thread_path (path, var_bb,
> > -                                                    name, arg);
> > +                                                        name, arg, speed_p);
> >           if (taken_edge)
> >             {
> >               if (bb_loop_depth (taken_edge->src)
> > @@ -623,10 +631,11 @@ fsm_find_control_statement_thread_paths
> >     is a constant.  Record such paths for jump threading.
> >
> >     It is assumed that BB ends with a control statement and that by
> > -   finding a path where NAME is a constant, we can thread the path.  */
> > +   finding a path where NAME is a constant, we can thread the path.
> > +   SPEED_P indicate that we could increase code size to improve the code path */
> >
> >  void
> > -find_jump_threads_backwards (basic_block bb)
> > +find_jump_threads_backwards (basic_block bb, bool speed_p)
> >  {
> >    gimple *stmt = get_gimple_control_stmt (bb);
> >    if (!stmt)
> > @@ -656,7 +665,8 @@ find_jump_threads_backwards (basic_block
> >    hash_set<basic_block> *visited_bbs = new hash_set<basic_block>;
> >
> >    max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS);
> > -  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false);
> > +  fsm_find_control_statement_thread_paths (name, visited_bbs, bb_path, false,
> > +                                          speed_p);
> >
> >    delete visited_bbs;
> >    vec_free (bb_path);
> > @@ -706,7 +716,7 @@ pass_thread_jumps::execute (function *fu
> >    FOR_EACH_BB_FN (bb, fun)
> >      {
> >        if (EDGE_COUNT (bb->succs) > 1)
> > -       find_jump_threads_backwards (bb);
> > +       find_jump_threads_backwards (bb, true);
> >      }
> >    bool changed = thread_through_all_blocks (true);
> >
> > @@ -721,3 +731,59 @@ make_pass_thread_jumps (gcc::context *ct
> >  {
> >    return new pass_thread_jumps (ctxt);
> >  }
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_early_thread_jumps =
> > +{
> > +  GIMPLE_PASS,
> > +  "ethread",
> > +  OPTGROUP_NONE,
> > +  TV_TREE_SSA_THREAD_JUMPS,
> > +  ( PROP_cfg | PROP_ssa ),
> > +  0,
> > +  0,
> > +  0,
> > +  ( TODO_cleanup_cfg | TODO_update_ssa ),
> > +};
> > +
> > +class pass_early_thread_jumps : public gimple_opt_pass
> > +{
> > +public:
> > +  pass_early_thread_jumps (gcc::context *ctxt)
> > +    : gimple_opt_pass (pass_data_early_thread_jumps, ctxt)
> > +  {}
> > +
> > +  opt_pass * clone (void) { return new pass_early_thread_jumps (m_ctxt); }
> > +  virtual bool gate (function *);
> > +  virtual unsigned int execute (function *);
> > +};
> > +
> > +bool
> > +pass_early_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
> > +{
> > +  return true;
> > +}
> > +
> > +
> > +unsigned int
> > +pass_early_thread_jumps::execute (function *fun)
> > +{
> > +  /* Try to thread each block with more than one successor.  */
> > +  basic_block bb;
> > +  FOR_EACH_BB_FN (bb, fun)
> > +    {
> > +      if (EDGE_COUNT (bb->succs) > 1)
> > +       find_jump_threads_backwards (bb, false);
> > +    }
> > +  thread_through_all_blocks (true);
> > +  return 0;
> > +}
> > +
> > +}
> > +
> > +gimple_opt_pass *
> > +make_pass_early_thread_jumps (gcc::context *ctxt)
> > +{
> > +  return new pass_early_thread_jumps (ctxt);
> > +}
> > Index: testsuite/gcc.dg/uninit-15.c
> > ===================================================================
> > --- testsuite/gcc.dg/uninit-15.c        (revision 240109)
> > +++ testsuite/gcc.dg/uninit-15.c        (working copy)
> > @@ -1,16 +1,16 @@
> >  /* PR tree-optimization/17506
> >     We issue an uninitialized variable warning at a wrong location at
> >     line 11, which is very confusing.  Make sure we print out a note to
> > -   make it less confusing.  (xfailed alternative)
> > +   make it less confusing.  (not xfailed alternative)
> >     But it is of course ok if we warn in bar about uninitialized use
> > -   of j.  (not xfailed alternative)  */
> > +   of j.  (xfailed alternative)  */
> >  /* { dg-do compile } */
> >  /* { dg-options "-O1 -Wuninitialized" } */
> >
> >  inline int
> >  foo (int i)
> >  {
> > -  if (i) /* { dg-warning "used uninitialized in this function" "" { xfail *-*-* } } */
> > +  if (i) /* { dg-warning "used uninitialized in this function" } */
> >      return 1;
> >    return 0;
> >  }
> > @@ -20,7 +20,7 @@ void baz (void);
> >  void
> >  bar (void)
> >  {
> > -  int j; /* { dg-message "note: 'j' was declared here" "" { xfail *-*-* } } */
> > -  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" } */
> > +  int j; /* { dg-message "note: 'j' was declared here" } */
> > +  for (; foo (j); ++j)  /* { dg-warning "'j' is used uninitialized" "" { xfail *-*-* } } */
> >      baz ();
> >  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-09-19  7:59           ` Early jump threading Christophe Lyon
@ 2016-09-20  8:41             ` Thomas Preudhomme
  0 siblings, 0 replies; 25+ messages in thread
From: Thomas Preudhomme @ 2016-09-20  8:41 UTC (permalink / raw)
  To: Christophe Lyon, Jan Hubicka; +Cc: Jeff Law, Richard Biener, gcc-patches

On 19/09/16 08:02, Christophe Lyon wrote:
>> Index: testsuite/gcc.dg/tree-ssa/pr68198.c
>> ===================================================================
>> --- testsuite/gcc.dg/tree-ssa/pr68198.c (revision 240109)
>> +++ testsuite/gcc.dg/tree-ssa/pr68198.c (working copy)
>> @@ -1,5 +1,5 @@
>>  /* { dg-do compile } */
>> -/* { dg-options "-O2 -fdump-tree-thread1-details" } */
>> +/* { dg-options "-O2 -fdump-tree-thread1-details -fdisable-tree-ethread" } */
>>
>>  extern void abort (void);
>>
>> @@ -40,4 +40,4 @@ c_finish_omp_clauses (tree clauses)
>>  /* There are 3 FSM jump threading opportunities, two of which will
>>    get filtered out.  */
>>  /* { dg-final { scan-tree-dump-times "Registering FSM" 1 "thread1"} } */
>> -/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "thread1"} } */
>> +/* { dg-final { scan-tree-dump-times "FSM Thread through multiway branch without threading a multiway branch" 2 "ethread"} } */
>
> This test does not work, at least on arm* and aarch64*. I'm seeing:
> cc1: note: disable pass tree-ethread for functions in the range of [0,
> 4294967295]
>
> PASS: gcc.dg/tree-ssa/pr68198.c (test for excess errors)
> PASS: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times thread1 "Registering FSM" 1
> gcc.dg/tree-ssa/pr68198.c: dump file does not exist
> UNRESOLVED: gcc.dg/tree-ssa/pr68198.c scan-tree-dump-times ethread
> "FSM Thread through multiway branch without threading a multiway
> branch" 2

Shouldn't the test enable tree-ethread to have a ethread dump?

Best regards,

Thomas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Early jump threading
  2016-09-19  9:29             ` Jan Hubicka
@ 2016-10-18 15:56               ` James Greenhalgh
  2016-12-15 11:54                 ` [Patch] Undermine the jump threading cost model to fix PR77445 James Greenhalgh
  0 siblings, 1 reply; 25+ messages in thread
From: James Greenhalgh @ 2016-10-18 15:56 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Andrew Pinski, Jeff Law, Richard Biener, GCC Patches, nd

On Mon, Sep 19, 2016 at 11:22:27AM +0200, Jan Hubicka wrote:
> > On Mon, Sep 19, 2016 at 2:48 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > > Hi,
> > > this is the patch compensating testsuite I commited after re-testing
> > > on x86_64-linux.
> > >
> > > Other placements of early_thread_jumps does not work veyr well (at least in
> > > current implementation). Putting it before forwprop disables about 15% of
> > > threadings. Placing it after DCE makes inliner to not see much of benefits
> > > because threading requires a cleanup propagation+DCE after itself.
> > > So unless we extend threader to be smarter or add extra DCE cleanup, i think
> > > this is best placement.
> > 
> > 
> > This caused (another) 3-4% degradation in coremarks on ThunderX.
> 
> Hmm, this is interesting. The patch should have "fixed" the previous
> degradation by making the profile correct (backward threader still doe not
> update it, but because most threading now happens early and profile is built
> afterwards this should be less of issue).  I am now looking into the profile
> update issues and will try to check why coremarks degrade again.
 
But, the early threader is running with speed_p set to false (second parameter
to find_jump_threads_backwards)

  unsigned int
  pass_early_thread_jumps::execute (function *fun)
  {
    /* Try to thread each block with more than one successor.  */
    basic_block bb;
    FOR_EACH_BB_FN (bb, fun)
      {
        if (EDGE_COUNT (bb->succs) > 1)
  	find_jump_threads_backwards (bb, false);
      }
    thread_through_all_blocks (true);
    return 0;
  }

So even though profile information is ignored, we think we are compiling
for size and won't thread. The relevant check in profitable_jump_thread_path
is:

  if (speed_p && optimize_edge_for_speed_p (taken_edge))
    {
      <snip>
    }
  else if (n_insns > 1)
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
	fprintf (dump_file, "FSM jump-thread path not considered: "
		 "duplication of %i insns is needed and optimizing for size.\n",
		 n_insns);
      path->pop ();
      return NULL;
    }

Changing false to true in the above hunk looks like it enables some of
the threading we're relying on here.

Thanks,
James

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Patch] Undermine the jump threading cost model to fix PR77445.
  2016-10-18 15:56               ` James Greenhalgh
@ 2016-12-15 11:54                 ` James Greenhalgh
  2016-12-16 12:34                   ` Richard Biener
  0 siblings, 1 reply; 25+ messages in thread
From: James Greenhalgh @ 2016-12-15 11:54 UTC (permalink / raw)
  To: gcc-patches; +Cc: nd, law, hubicka

[-- Attachment #1: Type: text/plain, Size: 3590 bytes --]


Hi,

As mentioned in PR77445, the improvements to the jump threading cost model
this year have caused substantial regressions in the amount of jump threading
we do and the performance of workloads which rely on that threading.

This patch represents the low-bar in fixing the performance issues reported
in PR77445 - by weakening the cost model enough that we thread in a way much
closer to GCC 6. I don't think this patch is likely to be acceptable for
trunk, but I'm posting it for consideration regardless. 

Under the new cost model, if the edge doesn't pass optimize_edge_for_speed_p,
then we don't thread. The problem in late threading is bad edge profile
data makes the edge look cold, and thus it fails optimize_edge_for_speed_p
and is no longer considered a candidate for threading. As an aside, I think
this is the wrong cost model for jump threading, where you get the most
impact if you can resolve unpredictable switch statements - which by their
nature may have multiple cold edges in need of threading.

Early threading should avoid these issues, as there is no edge profile
info yet. optimize_edge_for_speed_p is therefore more likely to hold, but
the condition for threading is:

  if (speed_p && optimize_edge_for_speed_p (taken_edge))
    {
      if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
	{
          [...reject threading...]
        }
    }
  else if (n_insns > 1)
    {
      [...reject threading...]
    }

With speed_p is hardwired to false for the early threader
( pass_early_thread_jumps::execute ):

	find_jump_threads_backwards (bb, false);

So we always fall to the n_insns > 1 case and thus only rarely get to
thread.

In this patch I change that call in pass_early_thread_jumps::execute to
instead look at optimize_bb_for_speed_p (bb) . That allows the speed_p
check to pass in the main threading cost model, and then the
optimize_edge_for_speed_p can also pass. That gets the first stage of
jump-threading back working in a proprietary benchmark which is sensitive
to this optimisation.

To get the rest of the required jump threading, I also have to weaken the
cost model - and this is obviously a hack! The easy hack is to special case
when the taken edge has frequency zero, and permit jump threading there.

I know this patch is likely not the preferred way to fix this. For me that
would be a change to the cost model, which as I mentioned above I think
misses the point about which edges we want to thread. By far the best
fix would be to the junk edge profiling data we create during threading.

However, this patch does fix the performance issues identified in PR77445,
and does highlight a fundamental issue with the early threader (which
doesn't seem to me like it will be effective while it sets speed_p to
false), so I'd like it to be considered for trunk if no better fix appears
before stage 4.

Bootstrapped on x86_64 with no issues. The testsuite changes just reshuffle
which passes spot the threading opportunities.

OK?

Thanks,
James

---
gcc/

2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>

	PR tree-optimization/77445
	* tree-ssa-threadbackward.c (profitable_jump_thread_path) Work
	around sometimes corrupt edge frequency data.
	(pass_early_thread_jumps::execute): Pass
	optimize_bb_for_speed_p as the speed_p parameter to
	find_jump_threads_backwards to enable threading in more cases.

gcc/testsuite/

2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>

	PR tree-optimization/77445
	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust options and dump passes.
	* gcc.dg/tree-ssa/pr66752-3.c: Likewise.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Patch-Undermine-the-jump-threading-cost-model-to-fix.patch --]
[-- Type: text/x-patch; name="0001-Patch-Undermine-the-jump-threading-cost-model-to-fix.patch", Size: 3819 bytes --]

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
index 896c8bf..39ec3d6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */
+/* { dg-options "-O2 -fdump-tree-ethread-details -fdump-tree-dce2" } */
 
 extern int status, pt;
 extern int count;
@@ -34,7 +34,7 @@ foo (int N, int c, int b, int *a)
 
 /* There are 4 FSM jump threading opportunities, all of which will be
    realized, which will eliminate testing of FLAG, completely.  */
-/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "thread1"} } */
+/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "ethread"} } */
 
 /* There should be no assignments or references to FLAG, verify they're
    eliminated as early as possible.  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index 9a9d1cb..5b087fb 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,8 +1,9 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 16"  "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread2" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 3" "thread3" } } */
+/* { dg-options "-O2 -fdump-tree-ethread-stats -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 16"  "ethread" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 6"  "thread1" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 4" "thread2" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 2" "thread3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom2" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" } } */
diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 203e20e..0d29ab5 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -311,7 +311,20 @@ profitable_jump_thread_path (vec<basic_block, va_gc> *&path,
       return NULL;
     }
 
-  if (speed_p && optimize_edge_for_speed_p (taken_edge))
+
+  /* FIXME: Edge frequency can get badly out shape as a result of
+     the jump threading passes.  In those cases,
+     EDGE_FREQUENCY (taken_edge) == 0 , and so trivially fails the
+     test for optimize_edge_for_speed_p.  The correct fix would
+     be to ensure that profiling information coming out of jump threading
+     is meaningful, but in lieu of that add a hack check to this cost model
+     which permits jump threading in the case EDGE_FREQUENCY has been
+     corrupted.  Only do this if the profile info is present and corrupt,
+     not if it is absent.  */
+  if (speed_p
+      && (optimize_edge_for_speed_p (taken_edge)
+	  || (profile_status_for_fn (cfun) != PROFILE_ABSENT
+	      && EDGE_FREQUENCY (taken_edge) == 0)))
     {
       if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
 	{
@@ -870,7 +883,7 @@ pass_early_thread_jumps::execute (function *fun)
   FOR_EACH_BB_FN (bb, fun)
     {
       if (EDGE_COUNT (bb->succs) > 1)
-	find_jump_threads_backwards (bb, false);
+	find_jump_threads_backwards (bb, optimize_bb_for_speed_p (bb));
     }
   thread_through_all_blocks (true);
   return 0;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Patch] Undermine the jump threading cost model to fix PR77445.
  2016-12-15 11:54                 ` [Patch] Undermine the jump threading cost model to fix PR77445 James Greenhalgh
@ 2016-12-16 12:34                   ` Richard Biener
  0 siblings, 0 replies; 25+ messages in thread
From: Richard Biener @ 2016-12-16 12:34 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: GCC Patches, nd, Jeff Law, Jan Hubicka

On Thu, Dec 15, 2016 at 12:33 PM, James Greenhalgh
<james.greenhalgh@arm.com> wrote:
>
> Hi,
>
> As mentioned in PR77445, the improvements to the jump threading cost model
> this year have caused substantial regressions in the amount of jump threading
> we do and the performance of workloads which rely on that threading.
>
> This patch represents the low-bar in fixing the performance issues reported
> in PR77445 - by weakening the cost model enough that we thread in a way much
> closer to GCC 6. I don't think this patch is likely to be acceptable for
> trunk, but I'm posting it for consideration regardless.
>
> Under the new cost model, if the edge doesn't pass optimize_edge_for_speed_p,
> then we don't thread. The problem in late threading is bad edge profile
> data makes the edge look cold, and thus it fails optimize_edge_for_speed_p
> and is no longer considered a candidate for threading. As an aside, I think
> this is the wrong cost model for jump threading, where you get the most
> impact if you can resolve unpredictable switch statements - which by their
> nature may have multiple cold edges in need of threading.
>
> Early threading should avoid these issues, as there is no edge profile
> info yet. optimize_edge_for_speed_p is therefore more likely to hold, but
> the condition for threading is:
>
>   if (speed_p && optimize_edge_for_speed_p (taken_edge))
>     {
>       if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
>         {
>           [...reject threading...]
>         }
>     }
>   else if (n_insns > 1)
>     {
>       [...reject threading...]
>     }
>
> With speed_p is hardwired to false for the early threader
> ( pass_early_thread_jumps::execute ):
>
>         find_jump_threads_backwards (bb, false);
>
> So we always fall to the n_insns > 1 case and thus only rarely get to
> thread.
>
> In this patch I change that call in pass_early_thread_jumps::execute to
> instead look at optimize_bb_for_speed_p (bb) . That allows the speed_p
> check to pass in the main threading cost model, and then the
> optimize_edge_for_speed_p can also pass. That gets the first stage of
> jump-threading back working in a proprietary benchmark which is sensitive
> to this optimisation.

But all early optimizations are supposed to never increase code size -- they
exist to get more realistic sizes to the IPA inliner heuristic code.

Iff at all then add some --param early-jump-threading-insns and use that
in place of the '1' and then eventually bump that to 2 or 3.  But using
the 100 from the late threading is too much.

Note that using optimize_bb_for_speed_p at this point in the pipeline
is nonsense and you can as well use ! optimize_size ...  because even
a guessed profile is only created at the end of the early optimization
pipeline.

> To get the rest of the required jump threading, I also have to weaken the
> cost model - and this is obviously a hack! The easy hack is to special case
> when the taken edge has frequency zero, and permit jump threading there.
>
> I know this patch is likely not the preferred way to fix this. For me that
> would be a change to the cost model, which as I mentioned above I think
> misses the point about which edges we want to thread. By far the best
> fix would be to the junk edge profiling data we create during threading.
>
> However, this patch does fix the performance issues identified in PR77445,
> and does highlight a fundamental issue with the early threader (which
> doesn't seem to me like it will be effective while it sets speed_p to
> false), so I'd like it to be considered for trunk if no better fix appears
> before stage 4.

But the PR has nothing to do with early threading but with late threading
fed a bogus profile.

> Bootstrapped on x86_64 with no issues. The testsuite changes just reshuffle
> which passes spot the threading opportunities.
>
> OK?

I don't think so.

Richard.

> Thanks,
> James
>
> ---
> gcc/
>
> 2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         PR tree-optimization/77445
>         * tree-ssa-threadbackward.c (profitable_jump_thread_path) Work
>         around sometimes corrupt edge frequency data.
>         (pass_early_thread_jumps::execute): Pass
>         optimize_bb_for_speed_p as the speed_p parameter to
>         find_jump_threads_backwards to enable threading in more cases.
>
> gcc/testsuite/
>
> 2016-12-15  James Greenhalgh  <james.greenhalgh@arm.com>
>
>         PR tree-optimization/77445
>         * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust options and dump passes.
>         * gcc.dg/tree-ssa/pr66752-3.c: Likewise.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-12-16 12:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11 14:02 Early jump threading Jan Hubicka
2016-08-11 14:06 ` Richard Biener
2016-08-11 14:27   ` Jan Hubicka
2016-08-11 15:50     ` Richard Biener
2016-08-11 18:42       ` Jeff Law
2016-09-18 19:30         ` Jan Hubicka
2016-09-19  6:53           ` Andrew Pinski
2016-09-19  9:29             ` Jan Hubicka
2016-10-18 15:56               ` James Greenhalgh
2016-12-15 11:54                 ` [Patch] Undermine the jump threading cost model to fix PR77445 James Greenhalgh
2016-12-16 12:34                   ` Richard Biener
2016-09-19  7:59           ` Early jump threading Christophe Lyon
2016-09-20  8:41             ` Thomas Preudhomme
2016-08-11 18:41     ` Jeff Law
2016-08-12  6:02       ` Richard Biener
2016-08-15 17:06         ` Jeff Law
2016-08-11 18:30   ` Jeff Law
2016-08-11 18:29 ` Jeff Law
2016-08-16 11:02   ` Jan Hubicka
2016-08-16 15:42     ` Jeff Law
2016-08-12  8:02 ` Richard Biener
2016-08-12 11:27   ` Jan Hubicka
2016-08-15 17:25     ` Jeff Law
2016-08-16 16:52   ` Jeff Law
2016-08-18 19:55 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).