backward threading heuristics tweek

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* backward threading heuristics tweek
@ 2016-06-06 10:19 Jan Hubicka
  2016-08-05 21:34 ` Jeff Law
  2016-08-07 17:31 ` Andrew Pinski
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Hubicka @ 2016-06-06 10:19 UTC (permalink / raw)
  To: law, gcc-patches

Hi,
while looking into profile mismatches introduced by the backward threading pass
I noticed that the heuristics seems quite simplistics.  First it should be
profile sensitive and disallow duplication when optimizing cold paths. Second
it should use estimate_num_insns because gimple statement count is not really
very realistic estimate of final code size effect and third there seems to be
no reason to disable the pass for functions optimized for size.

If we block duplication for more than 1 insns for size optimized paths the pass
is able to do majority of threading decisions that are for free and improve codegen.
The code size benefit was between 0.5% to 2.7% on testcases I tried (tramp3d,
GCC modules, xlanancbmk and some other stuff around my hd).

Bootstrapped/regtested x86_64-linux, seems sane?

The pass should also avoid calling cleanup_cfg when no trheading was done
and i do not see why it is guarded by expensive_optimizations. What are the
main compile time complexity limitations?

Honza

	* tree-ssa-threadbackward.c: Include tree-inline.h
	(profitable_jump_thread_path): Use estimate_num_insns to estimate
	size of copied block; for cold paths reduce duplication.
	(find_jump_threads_backwards): Remove redundant tests.
	(pass_thread_jumps::gate): Enable for -Os.
	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update testcase.
Index: tree-ssa-threadbackward.c
===================================================================
--- tree-ssa-threadbackward.c	(revision 237101)
+++ tree-ssa-threadbackward.c	(working copy)
@@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.
 #include "tree-pass.h"
 #include "gimple-ssa.h"
 #include "tree-phinodes.h"
+#include "tree-inline.h"
 
 static int max_threaded_paths;
 
@@ -210,7 +211,7 @@ profitable_jump_thread_path (vec<basic_b
 		  && !(gimple_code (stmt) == GIMPLE_ASSIGN
 		       && gimple_assign_rhs_code (stmt) == ASSERT_EXPR)
 		  && !is_gimple_debug (stmt))
-		++n_insns;
+	        n_insns += estimate_num_insns (stmt, &eni_size_weights);
 	    }
 
 	  /* We do not look at the block with the threaded branch
@@ -238,13 +239,15 @@ profitable_jump_thread_path (vec<basic_b
 	threaded_through_latch = true;
     }
 
+  gimple *stmt = get_gimple_control_stmt ((*path)[0]);
+  gcc_assert (stmt);
+
   /* We are going to remove the control statement at the end of the
      last block in the threading path.  So don't count it against our
      statement count.  */
-  n_insns--;
 
-  gimple *stmt = get_gimple_control_stmt ((*path)[0]);
-  gcc_assert (stmt);
+  n_insns-= estimate_num_insns (stmt, &eni_size_weights);
+
   /* We have found a constant value for ARG.  For GIMPLE_SWITCH
      and GIMPLE_GOTO, we use it as-is.  However, for a GIMPLE_COND
      we need to substitute, fold and simplify so we can determine
@@ -290,12 +293,24 @@ profitable_jump_thread_path (vec<basic_b
       return NULL;
     }
 
-  if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
+  if (optimize_edge_for_speed_p (taken_edge))
+    {
+      if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "FSM jump-thread path not considered: "
+		     "the number of instructions on the path "
+		     "exceeds PARAM_MAX_FSM_THREAD_PATH_INSNS.\n");
+	  path->pop ();
+	  return NULL;
+	}
+    }
+  else if (n_insns > 1)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
 	fprintf (dump_file, "FSM jump-thread path not considered: "
-		 "the number of instructions on the path "
-		 "exceeds PARAM_MAX_FSM_THREAD_PATH_INSNS.\n");
+		 "duplication of %i insns is needed and optimizing for size.\n",
+		 n_insns);
       path->pop ();
       return NULL;
     }
@@ -600,10 +615,6 @@ fsm_find_control_statement_thread_paths
 void  
 find_jump_threads_backwards (basic_block bb)
 {     
-  if (!flag_expensive_optimizations
-      || optimize_function_for_size_p (cfun))
-    return;
-
   gimple *stmt = get_gimple_control_stmt (bb);
   if (!stmt)
     return;
@@ -668,8 +679,7 @@ public:
 bool
 pass_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
 {
-  return (flag_expensive_optimizations
-	  && ! optimize_function_for_size_p (cfun));
+  return flag_expensive_optimizations;
 }
 
 
Index: testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c	(revision 237101)
+++ testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c	(working copy)
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats" } */
+/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
 /* { dg-final { scan-tree-dump "Jumps threaded: 16"  "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 11" "thread2" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 6" "thread2" } } */
 /* { dg-final { scan-tree-dump "Jumps threaded: 3" "thread3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" } } */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-06-06 10:19 backward threading heuristics tweek Jan Hubicka
@ 2016-08-05 21:34 ` Jeff Law
  2016-08-07 17:31 ` Andrew Pinski
  1 sibling, 0 replies; 10+ messages in thread
From: Jeff Law @ 2016-08-05 21:34 UTC (permalink / raw)
  To: Jan Hubicka, gcc-patches

On 06/06/2016 04:19 AM, Jan Hubicka wrote:
> Hi,
> while looking into profile mismatches introduced by the backward threading pass
> I noticed that the heuristics seems quite simplistics.  First it should be
> profile sensitive and disallow duplication when optimizing cold paths. Second
> it should use estimate_num_insns because gimple statement count is not really
> very realistic estimate of final code size effect and third there seems to be
> no reason to disable the pass for functions optimized for size.
I've never been happy with the size estimation code -- it's a series of 
hacks to address a couple pathological codesize regressions that were 
seen when comparing gcc-6 to prior versions.  I won't lose any sleep if 
we move to estimate_num_insns and rely more on profile data.


>
> If we block duplication for more than 1 insns for size optimized paths the pass
> is able to do majority of threading decisions that are for free and improve codegen.
> The code size benefit was between 0.5% to 2.7% on testcases I tried (tramp3d,
> GCC modules, xlanancbmk and some other stuff around my hd).
>
> Bootstrapped/regtested x86_64-linux, seems sane?
>
> The pass should also avoid calling cleanup_cfg when no trheading was done
> and i do not see why it is guarded by expensive_optimizations. What are the
> main compile time complexity limitations?
The pass essentially walks up the use-def links in the CFG.  WHen it 
encounters a PHI, we walk up every PHI argument.  That's where it gets 
expensive.  I think there's a better algorithm to do those walks that 
doesn't start at the beginning each time, but I haven't had time to 
experiment with coding it up.


>
> Honza
>
> 	* tree-ssa-threadbackward.c: Include tree-inline.h
> 	(profitable_jump_thread_path): Use estimate_num_insns to estimate
> 	size of copied block; for cold paths reduce duplication.
> 	(find_jump_threads_backwards): Remove redundant tests.
> 	(pass_thread_jumps::gate): Enable for -Os.
> 	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update testcase.
> Index: tree-ssa-threadbackward.c
> ===================================================================
> --- tree-ssa-threadbackward.c	(revision 237101)
> +++ tree-ssa-threadbackward.c	(working copy)
> @@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.
>  #include "tree-pass.h"
>  #include "gimple-ssa.h"
>  #include "tree-phinodes.h"
> +#include "tree-inline.h"
Probably an indication we want estimate_num_insns broken out into a more 
generic place that could be used by the threader, inlining, etc.  Please 
consider that as a follow-up.



>
>  static int max_threaded_paths;
>
> @@ -210,7 +211,7 @@ profitable_jump_thread_path (vec<basic_b
>  		  && !(gimple_code (stmt) == GIMPLE_ASSIGN
>  		       && gimple_assign_rhs_code (stmt) == ASSERT_EXPR)
>  		  && !is_gimple_debug (stmt))
> -		++n_insns;
> +	        n_insns += estimate_num_insns (stmt, &eni_size_weights);
>  	    }
>
>  	  /* We do not look at the block with the threaded branch
> @@ -238,13 +239,15 @@ profitable_jump_thread_path (vec<basic_b
>  	threaded_through_latch = true;
>      }
>
> +  gimple *stmt = get_gimple_control_stmt ((*path)[0]);
> +  gcc_assert (stmt);
> +
>    /* We are going to remove the control statement at the end of the
>       last block in the threading path.  So don't count it against our
>       statement count.  */
> -  n_insns--;
>
> -  gimple *stmt = get_gimple_control_stmt ((*path)[0]);
> -  gcc_assert (stmt);
> +  n_insns-= estimate_num_insns (stmt, &eni_size_weights);
Whitespace nit here before the "-=".

I think this is fine with the whitespace fix.  And again, consider 
moving estimate_num_insns to a new location outside the inliner.

jeff


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-06-06 10:19 backward threading heuristics tweek Jan Hubicka
  2016-08-05 21:34 ` Jeff Law
@ 2016-08-07 17:31 ` Andrew Pinski
  2016-08-08  8:29   ` James Greenhalgh
  2016-08-11 11:35   ` Jan Hubicka
  1 sibling, 2 replies; 10+ messages in thread
From: Andrew Pinski @ 2016-08-07 17:31 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Jeff Law, GCC Patches

On Mon, Jun 6, 2016 at 3:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> while looking into profile mismatches introduced by the backward threading pass
> I noticed that the heuristics seems quite simplistics.  First it should be
> profile sensitive and disallow duplication when optimizing cold paths. Second
> it should use estimate_num_insns because gimple statement count is not really
> very realistic estimate of final code size effect and third there seems to be
> no reason to disable the pass for functions optimized for size.
>
> If we block duplication for more than 1 insns for size optimized paths the pass
> is able to do majority of threading decisions that are for free and improve codegen.
> The code size benefit was between 0.5% to 2.7% on testcases I tried (tramp3d,
> GCC modules, xlanancbmk and some other stuff around my hd).
>
> Bootstrapped/regtested x86_64-linux, seems sane?
>
> The pass should also avoid calling cleanup_cfg when no trheading was done
> and i do not see why it is guarded by expensive_optimizations. What are the
> main compile time complexity limitations?

This patch caused a huge regression (~11%) on coremarks on ThunderX.
I assume other targets too.
Basically it looks like the path is no longer thread jumped.

Thanks,
Andrew


>
> Honza
>
>         * tree-ssa-threadbackward.c: Include tree-inline.h
>         (profitable_jump_thread_path): Use estimate_num_insns to estimate
>         size of copied block; for cold paths reduce duplication.
>         (find_jump_threads_backwards): Remove redundant tests.
>         (pass_thread_jumps::gate): Enable for -Os.
>         * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update testcase.
> Index: tree-ssa-threadbackward.c
> ===================================================================
> --- tree-ssa-threadbackward.c   (revision 237101)
> +++ tree-ssa-threadbackward.c   (working copy)
> @@ -35,6 +35,7 @@ along with GCC; see the file COPYING3.
>  #include "tree-pass.h"
>  #include "gimple-ssa.h"
>  #include "tree-phinodes.h"
> +#include "tree-inline.h"
>
>  static int max_threaded_paths;
>
> @@ -210,7 +211,7 @@ profitable_jump_thread_path (vec<basic_b
>                   && !(gimple_code (stmt) == GIMPLE_ASSIGN
>                        && gimple_assign_rhs_code (stmt) == ASSERT_EXPR)
>                   && !is_gimple_debug (stmt))
> -               ++n_insns;
> +               n_insns += estimate_num_insns (stmt, &eni_size_weights);
>             }
>
>           /* We do not look at the block with the threaded branch
> @@ -238,13 +239,15 @@ profitable_jump_thread_path (vec<basic_b
>         threaded_through_latch = true;
>      }
>
> +  gimple *stmt = get_gimple_control_stmt ((*path)[0]);
> +  gcc_assert (stmt);
> +
>    /* We are going to remove the control statement at the end of the
>       last block in the threading path.  So don't count it against our
>       statement count.  */
> -  n_insns--;
>
> -  gimple *stmt = get_gimple_control_stmt ((*path)[0]);
> -  gcc_assert (stmt);
> +  n_insns-= estimate_num_insns (stmt, &eni_size_weights);
> +
>    /* We have found a constant value for ARG.  For GIMPLE_SWITCH
>       and GIMPLE_GOTO, we use it as-is.  However, for a GIMPLE_COND
>       we need to substitute, fold and simplify so we can determine
> @@ -290,12 +293,24 @@ profitable_jump_thread_path (vec<basic_b
>        return NULL;
>      }
>
> -  if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
> +  if (optimize_edge_for_speed_p (taken_edge))
> +    {
> +      if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
> +       {
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           fprintf (dump_file, "FSM jump-thread path not considered: "
> +                    "the number of instructions on the path "
> +                    "exceeds PARAM_MAX_FSM_THREAD_PATH_INSNS.\n");
> +         path->pop ();
> +         return NULL;
> +       }
> +    }
> +  else if (n_insns > 1)
>      {
>        if (dump_file && (dump_flags & TDF_DETAILS))
>         fprintf (dump_file, "FSM jump-thread path not considered: "
> -                "the number of instructions on the path "
> -                "exceeds PARAM_MAX_FSM_THREAD_PATH_INSNS.\n");
> +                "duplication of %i insns is needed and optimizing for size.\n",
> +                n_insns);
>        path->pop ();
>        return NULL;
>      }
> @@ -600,10 +615,6 @@ fsm_find_control_statement_thread_paths
>  void
>  find_jump_threads_backwards (basic_block bb)
>  {
> -  if (!flag_expensive_optimizations
> -      || optimize_function_for_size_p (cfun))
> -    return;
> -
>    gimple *stmt = get_gimple_control_stmt (bb);
>    if (!stmt)
>      return;
> @@ -668,8 +679,7 @@ public:
>  bool
>  pass_thread_jumps::gate (function *fun ATTRIBUTE_UNUSED)
>  {
> -  return (flag_expensive_optimizations
> -         && ! optimize_function_for_size_p (cfun));
> +  return flag_expensive_optimizations;
>  }
>
>
> Index: testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c        (revision 237101)
> +++ testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c        (working copy)
> @@ -1,7 +1,7 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats" } */
> +/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
>  /* { dg-final { scan-tree-dump "Jumps threaded: 16"  "thread1" } } */
> -/* { dg-final { scan-tree-dump "Jumps threaded: 11" "thread2" } } */
> +/* { dg-final { scan-tree-dump "Jumps threaded: 6" "thread2" } } */
>  /* { dg-final { scan-tree-dump "Jumps threaded: 3" "thread3" } } */
>  /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom3" } } */
>  /* { dg-final { scan-tree-dump-not "Jumps threaded"  "vrp2" } } */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-08-07 17:31 ` Andrew Pinski
@ 2016-08-08  8:29   ` James Greenhalgh
  2016-08-11 12:36     ` Jan Hubicka
  2016-08-11 11:35   ` Jan Hubicka
  1 sibling, 1 reply; 10+ messages in thread
From: James Greenhalgh @ 2016-08-08  8:29 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Jan Hubicka, Jeff Law, GCC Patches, nd

On Sun, Aug 07, 2016 at 10:30:48AM -0700, Andrew Pinski wrote:
> On Mon, Jun 6, 2016 at 3:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hi,
> > while looking into profile mismatches introduced by the backward threading pass
> > I noticed that the heuristics seems quite simplistics.  First it should be
> > profile sensitive and disallow duplication when optimizing cold paths. Second
> > it should use estimate_num_insns because gimple statement count is not really
> > very realistic estimate of final code size effect and third there seems to be
> > no reason to disable the pass for functions optimized for size.
> >
> > If we block duplication for more than 1 insns for size optimized paths the pass
> > is able to do majority of threading decisions that are for free and improve codegen.
> > The code size benefit was between 0.5% to 2.7% on testcases I tried (tramp3d,
> > GCC modules, xlanancbmk and some other stuff around my hd).
> >
> > Bootstrapped/regtested x86_64-linux, seems sane?
> >
> > The pass should also avoid calling cleanup_cfg when no trheading was done
> > and i do not see why it is guarded by expensive_optimizations. What are the
> > main compile time complexity limitations?
> 
> This patch caused a huge regression (~11%) on coremarks on ThunderX.
> I assume other targets too.
> Basically it looks like the path is no longer thread jumped.

This also caused:

FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4

  Failures:
	gcc.dg/tree-ssa/pr69270-3.c
	
  Bisected to: 

  Author: hubicka
  Date:   Sun Aug 7 10:50:16 2016 +0000

    	* tree-ssa-threadbackward.c: Include tree-inline.h
    	(profitable_jump_thread_path): Use estimate_num_insns to estimate
    	size of copied block; for cold paths reduce duplication.
    	(find_jump_threads_backwards): Remove redundant tests.
    	(pass_thread_jumps::gate): Enable for -Os.
    	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update testcase.

  svn+ssh://gcc.gnu.org/svn/gcc/trunk@239219 

On my aarch64-none-linux-gnu and x86_64-none-linux-gnu automatic bisect
robots.

Thanks,
James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-08-07 17:31 ` Andrew Pinski
  2016-08-08  8:29   ` James Greenhalgh
@ 2016-08-11 11:35   ` Jan Hubicka
  2016-08-12 15:07     ` James Greenhalgh
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Hubicka @ 2016-08-11 11:35 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: Jan Hubicka, Jeff Law, GCC Patches

> On Mon, Jun 6, 2016 at 3:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hi,
> > while looking into profile mismatches introduced by the backward threading pass
> > I noticed that the heuristics seems quite simplistics.  First it should be
> > profile sensitive and disallow duplication when optimizing cold paths. Second
> > it should use estimate_num_insns because gimple statement count is not really
> > very realistic estimate of final code size effect and third there seems to be
> > no reason to disable the pass for functions optimized for size.
> >
> > If we block duplication for more than 1 insns for size optimized paths the pass
> > is able to do majority of threading decisions that are for free and improve codegen.
> > The code size benefit was between 0.5% to 2.7% on testcases I tried (tramp3d,
> > GCC modules, xlanancbmk and some other stuff around my hd).
> >
> > Bootstrapped/regtested x86_64-linux, seems sane?
> >
> > The pass should also avoid calling cleanup_cfg when no trheading was done
> > and i do not see why it is guarded by expensive_optimizations. What are the
> > main compile time complexity limitations?
> 
> This patch caused a huge regression (~11%) on coremarks on ThunderX.
> I assume other targets too.
> Basically it looks like the path is no longer thread jumped.

Sorry for late reply. I checked our periodic testers and the patch seems more or less
performance neutral with some code size improvements. Can you point me to the path that
is no longer crossjumped? I added diag output, so you should see the reason why the
path was considered unprofitable - either it was cold or we exceeded the maximal size.
The size is largely untuned, so perhaps we can just adjust it.

Honza

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-08-08  8:29   ` James Greenhalgh
@ 2016-08-11 12:36     ` Jan Hubicka
  2016-08-15 16:57       ` Jeff Law
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Hubicka @ 2016-08-11 12:36 UTC (permalink / raw)
  To: James Greenhalgh; +Cc: Andrew Pinski, Jan Hubicka, Jeff Law, GCC Patches, nd

> This also caused:
> 
> FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4
> 
>   Failures:
> 	gcc.dg/tree-ssa/pr69270-3.c
> 	
>   Bisected to: 
> 
>   Author: hubicka
>   Date:   Sun Aug 7 10:50:16 2016 +0000
> 
>     	* tree-ssa-threadbackward.c: Include tree-inline.h
>     	(profitable_jump_thread_path): Use estimate_num_insns to estimate
>     	size of copied block; for cold paths reduce duplication.
>     	(find_jump_threads_backwards): Remove redundant tests.
>     	(pass_thread_jumps::gate): Enable for -Os.
>     	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update testcase.
> 
>   svn+ssh://gcc.gnu.org/svn/gcc/trunk@239219 
> 
> On my aarch64-none-linux-gnu and x86_64-none-linux-gnu automatic bisect
> robots.

Sorry for that - it seems I have missed this failure.  The reason is that FSM
now stops on:
  /* We avoid creating irreducible inner loops unless we thread through
     a multiway branch, in which case we have deemed it worth losing
     other loop optimizations later.

     We also consider it worth creating an irreducible inner loop if
     the number of copied statement is low relative to the length of
     the path -- in that case there's little the traditional loop
     optimizer would have done anyway, so an irreducible loop is not
     so bad.  */
  if (!threaded_multiway_branch && creates_irreducible_loop
      && (n_insns * PARAM_VALUE (PARAM_FSM_SCALE_PATH_STMTS)
          > path_length * PARAM_VALUE (PARAM_FSM_SCALE_PATH_BLOCKS)))

    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file,
                 "FSM would create irreducible loop without threading "
                 "multiway branch.\n");
      path->pop ();
      return NULL;
    }

The path threaded now gets n_insn==13 and path_lengt=6. I guess the difference
is that the path consists of several calls that are considered heavy by the
new code size estimate which is correct. It is definitly heaver than path
consisting of few increments.

<bb 6>:                                                                                                                                                                                    
if (phi_inserted_5 == 0)                                                                                                                                                                   
  goto <bb 8>;                                                                                                                                                                             
else                                                                                                                                                                                       
  goto <bb 7>;

<bb 4>:
_2 = boo ();
if (_2 != 20)
  goto <bb 8>;
else
  goto <bb 6>;

<bb 3>:
_1 = arf ();
if (_1 != 10)
  goto <bb 8>;
else
  goto <bb 4>;

<bb 9>:
# phi_inserted_5 = PHI <0(2), phi_inserted_4(8)>
_3 = end_imm_use_stmt_p ();
if (_3 == 0)
  goto <bb 3>;
else
  goto <bb 10>;

loop latch.
<bb 8>: 
# phi_inserted_4 = PHI <phi_inserted_5(4), 1(6), phi_inserted_5(7), phi_inserted_5(3)>
next_imm_use_stmt ();

<bb 6>:
if (phi_inserted_5 == 0)
  goto <bb 8>;
else
  goto <bb 7>;


I would say that optimizing this path to dead is not the most important thing.  The question is whether
there is really problem with an irreducible loop. THere are two loops in the function body prior threading:

;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4 6 7 8 9 10
;;
;; Loop 1
;;  header 9, latch 8
;;  depth 1, outer 0
;;  nodes: 9 8 4 6 7 3
;; 2 succs { 9 }
;; 3 succs { 8 4 }
;; 4 succs { 8 6 }
;; 6 succs { 8 7 }
;; 7 succs { 8 }
;; 8 succs { 9 }
;; 9 succs { 3 10 }
;; 10 succs { 1 }

So the threaded path lives fully inside loop1: 6->8->9->3->4->6 propagating
that phi_inserted is 0 after the first iteration of the loop.  This looks like
useful loop peeling oppurtunity which does not garble loop structure. So
perhaps threading paths starting and passing loop latch (i.e. peeling) is
sane? Perhaps all paths fully captured in the loop in question are?

;; 2 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
;;
;; Loop 1
;;  header 12, latch 11
;;  depth 1, outer 0
;;  nodes: 12 11 4 9 10 3 8 7 6 5
;; 2 succs { 12 }
;; 3 succs { 11 4 }
;; 4 succs { 11 5 }
;; 5 succs { 6 10 }
;; 6 succs { 7 }
;; 7 succs { 8 13 }
;; 8 succs { 11 9 }
;; 9 succs { 11 10 }
;; 10 succs { 11 }
;; 11 succs { 12 }
;; 12 succs { 3 13 }
;; 13 succs { 1 }

Honza

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-08-11 11:35   ` Jan Hubicka
@ 2016-08-12 15:07     ` James Greenhalgh
  0 siblings, 0 replies; 10+ messages in thread
From: James Greenhalgh @ 2016-08-12 15:07 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Andrew Pinski, Jeff Law, GCC Patches, nd

On Thu, Aug 11, 2016 at 01:35:16PM +0200, Jan Hubicka wrote:
> > On Mon, Jun 6, 2016 at 3:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > > Hi,
> > > while looking into profile mismatches introduced by the backward threading pass
> > > I noticed that the heuristics seems quite simplistics.  First it should be
> > > profile sensitive and disallow duplication when optimizing cold paths. Second
> > > it should use estimate_num_insns because gimple statement count is not really
> > > very realistic estimate of final code size effect and third there seems to be
> > > no reason to disable the pass for functions optimized for size.
> > >
> > > If we block duplication for more than 1 insns for size optimized paths the pass
> > > is able to do majority of threading decisions that are for free and improve codegen.
> > > The code size benefit was between 0.5% to 2.7% on testcases I tried (tramp3d,
> > > GCC modules, xlanancbmk and some other stuff around my hd).
> > >
> > > Bootstrapped/regtested x86_64-linux, seems sane?
> > >
> > > The pass should also avoid calling cleanup_cfg when no trheading was done
> > > and i do not see why it is guarded by expensive_optimizations. What are the
> > > main compile time complexity limitations?
> > 
> > This patch caused a huge regression (~11%) on coremarks on ThunderX.
> > I assume other targets too.
> > Basically it looks like the path is no longer thread jumped.
> 
> Sorry for late reply. I checked our periodic testers and the patch seems more
> or less performance neutral with some code size improvements. Can you point
> me to the path that is no longer crossjumped? I added diag output, so you
> should see the reason why the path was considered unprofitable - either it
> was cold or we exceeded the maximal size.  The size is largely untuned, so
> perhaps we can just adjust it.

Yes, it is going to be the "cold" edge heuristic that does it.

FSM jump-thread path not considered: duplication of 18 insns is needed and optimizing for size.

For whatever reason, in the case of the edges in this benchmark, 
if I dump EDGE_FREQUENCY for the rejected edges in thread2 that we used to
accept they have frequency 0 (!).

This happens because the source block of the edge (the controlling switch)
has frequency 0:

  ;;   basic block 7, loop depth 3, count 0, freq 0
  ;;    prev block 6, next block 8, flags: (NEW, REACHABLE)
  ;;    pred:       5 [95.0%]  (FALSE_VALUE,EXECUTABLE)
    switch (state_504) <default: <L31>, case 0: <L24>, case 2: <L25>, case 3: <L28>, case 4: <L26>, case 5: <L27>, case 6: <L29>, case 7: <L30>>
  ;;    succ:       36 [12.5%]  (EXECUTABLE)
  ;;                8 [12.5%]  (EXECUTABLE)
  ;;                18 [12.5%]  (EXECUTABLE)
  ;;                30 [12.5%]  (EXECUTABLE)
  ;;                22 [12.5%]  (EXECUTABLE)
  ;;                26 [12.5%]  (EXECUTABLE)
  ;;                32 [12.5%]  (EXECUTABLE)
  ;;                34 [12.5%]  (EXECUTABLE)

It looks like it is thread1 that creates these blocks with empty
frequency information.

I haven't managed to find a reduced testcase that shows this issue.

Thanks,
James

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-08-11 12:36     ` Jan Hubicka
@ 2016-08-15 16:57       ` Jeff Law
  2016-08-15 20:06         ` Jan Hubicka
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Law @ 2016-08-15 16:57 UTC (permalink / raw)
  To: Jan Hubicka, James Greenhalgh; +Cc: Andrew Pinski, GCC Patches, nd

On 08/11/2016 06:35 AM, Jan Hubicka wrote:
>> This also caused:
>>
>> FAIL: gcc.dg/tree-ssa/pr69270-3.c scan-tree-dump-times uncprop1 ", 1" 4
>>
>>   Failures:
>> 	gcc.dg/tree-ssa/pr69270-3.c
>> 	
>>   Bisected to:
>>
>>   Author: hubicka
>>   Date:   Sun Aug 7 10:50:16 2016 +0000
>>
>>     	* tree-ssa-threadbackward.c: Include tree-inline.h
>>     	(profitable_jump_thread_path): Use estimate_num_insns to estimate
>>     	size of copied block; for cold paths reduce duplication.
>>     	(find_jump_threads_backwards): Remove redundant tests.
>>     	(pass_thread_jumps::gate): Enable for -Os.
>>     	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Update testcase.
>>
>>   svn+ssh://gcc.gnu.org/svn/gcc/trunk@239219
>>
>> On my aarch64-none-linux-gnu and x86_64-none-linux-gnu automatic bisect
>> robots.
>
> Sorry for that - it seems I have missed this failure.  The reason is that FSM
> now stops on:
>   /* We avoid creating irreducible inner loops unless we thread through
>      a multiway branch, in which case we have deemed it worth losing
>      other loop optimizations later.
>
>      We also consider it worth creating an irreducible inner loop if
>      the number of copied statement is low relative to the length of
>      the path -- in that case there's little the traditional loop
>      optimizer would have done anyway, so an irreducible loop is not
>      so bad.  */
>   if (!threaded_multiway_branch && creates_irreducible_loop
>       && (n_insns * PARAM_VALUE (PARAM_FSM_SCALE_PATH_STMTS)
>           > path_length * PARAM_VALUE (PARAM_FSM_SCALE_PATH_BLOCKS)))
>
>     {
>       if (dump_file && (dump_flags & TDF_DETAILS))
>         fprintf (dump_file,
>                  "FSM would create irreducible loop without threading "
>                  "multiway branch.\n");
>       path->pop ();
>       return NULL;
>     }
>
> The path threaded now gets n_insn==13 and path_lengt=6. I guess the difference
> is that the path consists of several calls that are considered heavy by the
> new code size estimate which is correct. It is definitly heaver than path
> consisting of few increments.
>
> <bb 6>:
> if (phi_inserted_5 == 0)
>   goto <bb 8>;
> else
>   goto <bb 7>;
>
> <bb 4>:
> _2 = boo ();
> if (_2 != 20)
>   goto <bb 8>;
> else
>   goto <bb 6>;
>
> <bb 3>:
> _1 = arf ();
> if (_1 != 10)
>   goto <bb 8>;
> else
>   goto <bb 4>;
>
> <bb 9>:
> # phi_inserted_5 = PHI <0(2), phi_inserted_4(8)>
> _3 = end_imm_use_stmt_p ();
> if (_3 == 0)
>   goto <bb 3>;
> else
>   goto <bb 10>;
>
> loop latch.
> <bb 8>:
> # phi_inserted_4 = PHI <phi_inserted_5(4), 1(6), phi_inserted_5(7), phi_inserted_5(3)>
> next_imm_use_stmt ();
>
> <bb 6>:
> if (phi_inserted_5 == 0)
>   goto <bb 8>;
> else
>   goto <bb 7>;
>
>
> I would say that optimizing this path to dead is not the most important thing.  The question is whether
> there is really problem with an irreducible loop. THere are two loops in the function body prior threading:
>
> ;; Loop 0
> ;;  header 0, latch 1
> ;;  depth 0, outer -1
> ;;  nodes: 0 1 2 3 4 6 7 8 9 10
> ;;
> ;; Loop 1
> ;;  header 9, latch 8
> ;;  depth 1, outer 0
> ;;  nodes: 9 8 4 6 7 3
> ;; 2 succs { 9 }
> ;; 3 succs { 8 4 }
> ;; 4 succs { 8 6 }
> ;; 6 succs { 8 7 }
> ;; 7 succs { 8 }
> ;; 8 succs { 9 }
> ;; 9 succs { 3 10 }
> ;; 10 succs { 1 }
>
> So the threaded path lives fully inside loop1: 6->8->9->3->4->6 propagating
> that phi_inserted is 0 after the first iteration of the loop.  This looks like
> useful loop peeling oppurtunity which does not garble loop structure. So
> perhaps threading paths starting and passing loop latch (i.e. peeling) is
> sane? Perhaps all paths fully captured in the loop in question are?
Peeling like this has long been a point of contention -- it totally 
mucks things up like vectorizing.

The general issue that the threader knows nothing about the 
characteristics of the loop -- thus peeling is at this point is 
premature and just as likely to hinder performance as improve it.

I'm never been happy with how this aspect of threading vs loop opts 
turned out and we have open BZs related to this rats nest of issues.

jeff


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-08-15 16:57       ` Jeff Law
@ 2016-08-15 20:06         ` Jan Hubicka
  2016-08-16 15:43           ` Jeff Law
  0 siblings, 1 reply; 10+ messages in thread
From: Jan Hubicka @ 2016-08-15 20:06 UTC (permalink / raw)
  To: Jeff Law; +Cc: Jan Hubicka, James Greenhalgh, Andrew Pinski, GCC Patches, nd

> >So the threaded path lives fully inside loop1: 6->8->9->3->4->6 propagating
> >that phi_inserted is 0 after the first iteration of the loop.  This looks like
> >useful loop peeling oppurtunity which does not garble loop structure. So
> >perhaps threading paths starting and passing loop latch (i.e. peeling) is
> >sane? Perhaps all paths fully captured in the loop in question are?
> Peeling like this has long been a point of contention -- it totally
> mucks things up like vectorizing.
> 
> The general issue that the threader knows nothing about the
> characteristics of the loop -- thus peeling is at this point is
> premature and just as likely to hinder performance as improve it.
> 
> I'm never been happy with how this aspect of threading vs loop opts
> turned out and we have open BZs related to this rats nest of issues.

Ok, then we perhaps just want to silence the testcase?

Honza
> 
> jeff
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: backward threading heuristics tweek
  2016-08-15 20:06         ` Jan Hubicka
@ 2016-08-16 15:43           ` Jeff Law
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff Law @ 2016-08-16 15:43 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: James Greenhalgh, Andrew Pinski, GCC Patches, nd

On 08/15/2016 02:06 PM, Jan Hubicka wrote:
>>> So the threaded path lives fully inside loop1: 6->8->9->3->4->6 propagating
>>> that phi_inserted is 0 after the first iteration of the loop.  This looks like
>>> useful loop peeling oppurtunity which does not garble loop structure. So
>>> perhaps threading paths starting and passing loop latch (i.e. peeling) is
>>> sane? Perhaps all paths fully captured in the loop in question are?
>> Peeling like this has long been a point of contention -- it totally
>> mucks things up like vectorizing.
>>
>> The general issue that the threader knows nothing about the
>> characteristics of the loop -- thus peeling is at this point is
>> premature and just as likely to hinder performance as improve it.
>>
>> I'm never been happy with how this aspect of threading vs loop opts
>> turned out and we have open BZs related to this rats nest of issues.
>
> Ok, then we perhaps just want to silence the testcase?
We might.  I'll have to take a closer look though.  Which means I have 
to stop losing time to other things every day ;(

jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-08-16 15:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-06 10:19 backward threading heuristics tweek Jan Hubicka
2016-08-05 21:34 ` Jeff Law
2016-08-07 17:31 ` Andrew Pinski
2016-08-08  8:29   ` James Greenhalgh
2016-08-11 12:36     ` Jan Hubicka
2016-08-15 16:57       ` Jeff Law
2016-08-15 20:06         ` Jan Hubicka
2016-08-16 15:43           ` Jeff Law
2016-08-11 11:35   ` Jan Hubicka
2016-08-12 15:07     ` James Greenhalgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).