Reduce complette unrolling & peeling limits

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Reduce complette unrolling & peeling limits
@ 2012-11-14 23:34 Jan Hubicka
  2012-11-15 13:19 ` Jakub Jelinek
  2012-11-18  8:09 ` Eric Botcazou
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Hubicka @ 2012-11-14 23:34 UTC (permalink / raw)
  To: gcc-patches, jakub

Hi,
this patch reduces max-peeled-insns and max-completely-peeled-insns from 400 to
100.  The reason why I am doing this is that I want to reduce code bloat caused
by my cunroll work that enabled a lot more unrolling then previously causing
considerable code size regression at -O3.

I do not think those params was ever serviously tunned, or re-tunned after
introduction of tree-ssa peeling.  I bootstrapped/regtested x86_64 with few
values - 4000, 200, 100, 50 on spec2000,spec2k6,C++ benchmarks and polyhedron.

I also did partial tests on ia-64 (that is broken quite a lot now, but I wanted
to have some sanity check that these values are not too x86 specific).

With 4000 (and also bumped up max-peel-times/max-completely-peel-times) there
are improvements on
  ammp 1360->1460
  equake 1800->1840
  applu 1450->1500
but i guess those needs to be handled by better heuristic.

Otherwise there are no perfromance regression with going 400->100. With 50
there are tiny performance drops on swim and applu.

I plan to follow by testing the max-peel times parameters and then doing inliner
tests.

Bootstrapped/regtested x86_64-linux, OK?

Honza

	* params.def (max-peeled-insns, max-completely-peeled-insns): Reduce to 100.
Index: params.def
===================================================================
--- params.def	(revision 193505)
+++ params.def	(working copy)
@@ -290,7 +290,7 @@ DEFPARAM(PARAM_MAX_UNROLL_TIMES,
 DEFPARAM(PARAM_MAX_PEELED_INSNS,
 	"max-peeled-insns",
 	"The maximum number of insns of a peeled loop",
-	400, 0, 0)
+	100, 0, 0)
 /* The maximum number of peelings of a single loop.  */
 DEFPARAM(PARAM_MAX_PEEL_TIMES,
 	"max-peel-times",
@@ -305,7 +305,7 @@ DEFPARAM(PARAM_MAX_PEEL_BRANCHES,
 DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS,
 	"max-completely-peeled-insns",
 	"The maximum number of insns of a completely peeled loop",
-	400, 0, 0)
+	100, 0, 0)
 /* The maximum number of peelings of a single loop that is peeled completely.  */
 DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES,
 	"max-completely-peel-times",

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-14 23:34 Reduce complette unrolling & peeling limits Jan Hubicka
@ 2012-11-15 13:19 ` Jakub Jelinek
  2012-11-18  8:09 ` Eric Botcazou
  1 sibling, 0 replies; 19+ messages in thread
From: Jakub Jelinek @ 2012-11-15 13:19 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches

On Thu, Nov 15, 2012 at 12:34:07AM +0100, Jan Hubicka wrote:
> 	* params.def (max-peeled-insns, max-completely-peeled-insns): Reduce to 100.

Ok, thanks.

> --- params.def	(revision 193505)
> +++ params.def	(working copy)
> @@ -290,7 +290,7 @@ DEFPARAM(PARAM_MAX_UNROLL_TIMES,
>  DEFPARAM(PARAM_MAX_PEELED_INSNS,
>  	"max-peeled-insns",
>  	"The maximum number of insns of a peeled loop",
> -	400, 0, 0)
> +	100, 0, 0)
>  /* The maximum number of peelings of a single loop.  */
>  DEFPARAM(PARAM_MAX_PEEL_TIMES,
>  	"max-peel-times",
> @@ -305,7 +305,7 @@ DEFPARAM(PARAM_MAX_PEEL_BRANCHES,
>  DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS,
>  	"max-completely-peeled-insns",
>  	"The maximum number of insns of a completely peeled loop",
> -	400, 0, 0)
> +	100, 0, 0)
>  /* The maximum number of peelings of a single loop that is peeled completely.  */
>  DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES,
>  	"max-completely-peel-times",

	Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-14 23:34 Reduce complette unrolling & peeling limits Jan Hubicka
  2012-11-15 13:19 ` Jakub Jelinek
@ 2012-11-18  8:09 ` Eric Botcazou
  2012-11-18 16:52   ` Jan Hubicka
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Botcazou @ 2012-11-18  8:09 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, jakub

> this patch reduces max-peeled-insns and max-completely-peeled-insns from 400
> to 100.  The reason why I am doing this is that I want to reduce code bloat
> caused by my cunroll work that enabled a lot more unrolling then previously
> causing considerable code size regression at -O3.

Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again 
takes a while to compile, so times out on slow machines:

FAIL: gcc.c-torture/compile/pr43186.c  -O3 -fomit-frame-pointer  (test for 
excess errors)
FAIL: gcc.c-torture/compile/pr43186.c  -O3 -fomit-frame-pointer -funroll-loops  
(test for excess errors)
FAIL: gcc.c-torture/compile/pr43186.c  -O3 -fomit-frame-pointer -funroll-all-
loops -finline-functions  (test for excess errors)
FAIL: gcc.c-torture/compile/pr43186.c  -O3 -g  (test for excess errors)
WARNING: program timed out.
WARNING: program timed out.
WARNING: program timed out.
WARNING: program timed out.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-18  8:09 ` Eric Botcazou
@ 2012-11-18 16:52   ` Jan Hubicka
  2012-11-18 17:09     ` Jan Hubicka
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Hubicka @ 2012-11-18 16:52 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Jan Hubicka, gcc-patches, jakub

> > this patch reduces max-peeled-insns and max-completely-peeled-insns from 400
> > to 100.  The reason why I am doing this is that I want to reduce code bloat
> > caused by my cunroll work that enabled a lot more unrolling then previously
> > causing considerable code size regression at -O3.
> 
> Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again 
> takes a while to compile, so times out on slow machines:

I did not :(.  I am currently on a trip, but will take a look on tuesday.
If it seems to disturb testing, please just revert the patch for time being.

Honza

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-18 16:52   ` Jan Hubicka
@ 2012-11-18 17:09     ` Jan Hubicka
  2012-11-18 18:47       ` Eric Botcazou
  2012-11-23 18:46       ` Hans-Peter Nilsson
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Hubicka @ 2012-11-18 17:09 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Eric Botcazou, gcc-patches, jakub

> > > this patch reduces max-peeled-insns and max-completely-peeled-insns from 400
> > > to 100.  The reason why I am doing this is that I want to reduce code bloat
> > > caused by my cunroll work that enabled a lot more unrolling then previously
> > > causing considerable code size regression at -O3.
> > 
> > Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again 
> > takes a while to compile, so times out on slow machines:
> 
> I did not :(.  I am currently on a trip, but will take a look on tuesday.
> If it seems to disturb testing, please just revert the patch for time being.

OK, here are multiple issues.
1) recursive inlining makes huge loop nest (of 18 loops)
2) SCEV is very slow on answering simple_iv tests in this case becuase it walks
   the nest
3) unroller is computing loop body size even when it is clear the body is much larger
   than the limit (the outer loop has 78000 instructions)

I will prepare patches to fix those issues. 

Honza
> 
> Honza

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-18 17:09     ` Jan Hubicka
@ 2012-11-18 18:47       ` Eric Botcazou
  2012-11-19 12:46         ` Jan Hubicka
  2012-11-23 18:46       ` Hans-Peter Nilsson
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Botcazou @ 2012-11-18 18:47 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, jakub

> OK, here are multiple issues.
> 1) recursive inlining makes huge loop nest (of 18 loops)
> 2) SCEV is very slow on answering simple_iv tests in this case becuase it
> walks the nest
> 3) unroller is computing loop body size even when it is clear the body is
> much larger than the limit (the outer loop has 78000 instructions)
> 
> I will prepare patches to fix those issues.

Thanks for the analysis (and don't worry, I won't revert anything :-).

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-18 18:47       ` Eric Botcazou
@ 2012-11-19 12:46         ` Jan Hubicka
  2012-11-21 16:25           ` Jan Hubicka
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Hubicka @ 2012-11-19 12:46 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Jan Hubicka, gcc-patches, jakub

Hi,
this is patch I will try to test once I have chance :)
t simply prevents unroller from analyzing loops when they are already too large.

	* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
	parameter.
	(try_unroll_loop_completely) Update.
Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c	(revision 193598)
+++ tree-ssa-loop-ivcanon.c	(working copy)
@@ -1,5 +1,5 @@
-/* Induction variable canonicalization.
-   Copyright (C) 2004, 2005, 2007, 2008, 2010
+/* Induction variable canonicalization and loop peeling.
+   Copyright (C) 2004, 2005, 2007, 2008, 2010, 2012
    Free Software Foundation, Inc.
 
 This file is part of GCC.
@@ -29,9 +29,12 @@ along with GCC; see the file COPYING3.
    variables.  In that case the created optimization possibilities are likely
    to pay up.
 
-   Additionally in case we detect that it is beneficial to unroll the
-   loop completely, we do it right here to expose the optimization
-   possibilities to the following passes.  */
+   We also perform
+     - complette unrolling (or peeling) when the loops is rolling few enough
+       times
+     - simple peeling (i.e. copying few initial iterations prior the loop)
+       when number of iteration estimate is known (typically by the profile
+       info).  */
 
 #include "config.h"
 #include "system.h"
@@ -207,10 +210,12 @@ constant_after_peeling (tree op, gimple
    iteration of the loop.
    EDGE_TO_CANCEL (if non-NULL) is an non-exit edge eliminated in the last iteration
    of loop.
-   Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT.  */
+   Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT. 
+   Stop estimating after UPPER_BOUND is met. Return true in this case */
 
-static void
-tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size)
+static bool
+tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size,
+			 int upper_bound)
 {
   basic_block *body = get_loop_body (loop);
   gimple_stmt_iterator gsi;
@@ -316,6 +321,12 @@ tree_estimate_loop_size (struct loop *lo
 	      if (likely_eliminated || likely_eliminated_last)
 		size->last_iteration_eliminated_by_peeling += num;
 	    }
+	  if ((size->overall - size->eliminated_by_peeling
+	      - size->last_iteration_eliminated_by_peeling) > upper_bound)
+	    {
+              free (body);
+	      return true;
+	    }
 	}
     }
   while (path.length ())
@@ -357,6 +368,7 @@ tree_estimate_loop_size (struct loop *lo
 	     size->last_iteration_eliminated_by_peeling);
 
   free (body);
+  return false;
 }
 
 /* Estimate number of insns of completely unrolled loop.
@@ -699,12 +711,23 @@ try_unroll_loop_completely (struct loop
       sbitmap wont_exit;
       edge e;
       unsigned i;
+      bool large;
       vec<edge> to_remove = vec<edge>();
       if (ul == UL_SINGLE_ITER)
 	return false;
 
-      tree_estimate_loop_size (loop, exit, edge_to_cancel, &size);
+      large = tree_estimate_loop_size
+		 (loop, exit, edge_to_cancel, &size,
+	          ul == UL_NO_GROWTH ? 0
+		  : PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS) * 2);
       ninsns = size.overall;
+      if (large)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Not unrolling loop %d: it is too large.\n",
+		     loop->num);
+	  return false;
+	}
 
       unr_insns = estimated_unrolled_size (&size, n_unroll);
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -865,6 +888,133 @@ try_unroll_loop_completely (struct loop
   return true;
 }
 
+/* Return number of instructions after peeling.  */
+static unsigned HOST_WIDE_INT
+estimated_peeled_sequence_size (struct loop_size *size,
+			        unsigned HOST_WIDE_INT npeel)
+{
+  return MAX (npeel * (HOST_WIDE_INT) (size->overall
+			     	       - size->eliminated_by_peeling), 1);
+}
+
+/* If the loop is expected to iterate N times and is
+   small enough, duplicate the loop body N+1 times before
+   the loop itself.  This way the hot path will never
+   enter the loop.  
+   Parameters are the same as for try_unroll_loops_completely */
+
+static bool
+try_peel_loop (struct loop *loop,
+	       edge exit, tree niter,
+	       HOST_WIDE_INT maxiter)
+{
+  int npeel;
+  struct loop_size size;
+  int peeled_size;
+  sbitmap wont_exit;
+  unsigned i;
+  vec<edge> to_remove = vec<edge>();
+  edge e;
+
+  /* If the iteration bound is known and large, then we can safely eliminate
+     the check in peeled copies.  */
+  if (TREE_CODE (niter) != INTEGER_CST)
+    exit = NULL;
+
+  if (!flag_peel_loops || PARAM_VALUE (PARAM_MAX_PEEL_TIMES) <= 0)
+    return false;
+
+  /* Peel only innermost loops.  */
+  if (loop->inner)
+    {
+      if (dump_file)
+        fprintf (dump_file, "Not peeling: outer loop\n");
+      return false;
+    }
+
+  if (!optimize_loop_for_speed_p (loop))
+    {
+      if (dump_file)
+        fprintf (dump_file, "Not peeling: cold loop\n");
+      return false;
+    }
+
+  /* Check if there is an estimate on the number of iterations.  */
+  npeel = estimated_loop_iterations_int (loop);
+  if (npeel < 0)
+    {
+      if (dump_file)
+        fprintf (dump_file, "Not peeling: number of iterations is not "
+	         "estimated\n");
+      return false;
+    }
+  if (maxiter >= 0 && maxiter <= npeel)
+    {
+      if (dump_file)
+        fprintf (dump_file, "Not peeling: upper bound is known so can "
+		 "unroll complettely\n");
+      return false;
+    }
+
+  /* We want to peel estimated number of iterations + 1 (so we never
+     enter the loop on quick path).  Check against PARAM_MAX_PEEL_TIMES
+     and be sure to avoid overflows.  */
+  if (npeel > PARAM_VALUE (PARAM_MAX_PEEL_TIMES) - 1)
+    {
+      if (dump_file)
+        fprintf (dump_file, "Not peeling: rolls too much "
+		 "(%i + 1 > --param max-peel-times)\n", npeel);
+      return false;
+    }
+  npeel++;
+
+  /* Check peeled loops size.  */
+  tree_estimate_loop_size (loop, exit, NULL, &size,
+			   PARAM_VALUE (PARAM_MAX_PEELED_INSNS));
+  if ((peeled_size = estimated_peeled_sequence_size (&size, npeel))
+      > PARAM_VALUE (PARAM_MAX_PEELED_INSNS))
+    {
+      if (dump_file)
+        fprintf (dump_file, "Not peeling: peeled sequence size is too large "
+		 "(%i insns > --param max-peel-insns)", peeled_size);
+      return false;
+    }
+
+  /* Duplicate possibly eliminating the exits.  */
+  initialize_original_copy_tables ();
+  wont_exit = sbitmap_alloc (npeel + 1);
+  bitmap_ones (wont_exit);
+  bitmap_clear_bit (wont_exit, 0);
+  if (!gimple_duplicate_loop_to_header_edge (loop, loop_preheader_edge (loop),
+					     npeel, wont_exit,
+					     exit, &to_remove,
+					     DLTHE_FLAG_UPDATE_FREQ
+					     | DLTHE_FLAG_COMPLETTE_PEEL))
+    {
+      free_original_copy_tables ();
+      free (wont_exit);
+      return false;
+    }
+  FOR_EACH_VEC_ELT (to_remove, i, e)
+    {
+      bool ok = remove_path (e);
+      gcc_assert (ok);
+    }
+  free (wont_exit);
+  free_original_copy_tables ();
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Peeled loop %d, %i times.\n",
+	       loop->num, npeel);
+    }
+  if (loop->any_upper_bound)
+    loop->nb_iterations_upper_bound -= double_int::from_uhwi (npeel);
+  loop->nb_iterations_estimate = double_int_zero;
+  /* Make sure to mark loop cold so we do not try to peel it more.  */
+  scale_loop_profile (loop, 1, 0);
+  loop->header->count = 0;
+  return true;
+}
 /* Adds a canonical induction variable to LOOP if suitable.
    CREATE_IV is true if we may create a new iv.  UL determines
    which loops we are allowed to completely unroll.  If TRY_EVAL is true, we try
@@ -939,6 +1089,9 @@ canonicalize_loop_induction_variables (s
       && exit && just_once_each_iteration_p (loop, exit->src))
     create_canonical_iv (loop, exit, niter);
 
+  if (ul == UL_ALL)
+    modified |= try_peel_loop (loop, exit, niter, maxiter);
+
   return modified;
 }
 
@@ -981,8 +1134,10 @@ canonicalize_induction_variables (void)
     }
   BITMAP_FREE (loop_closed_ssa_invalidated);
 
+  /* Update virtuals because we possibly introduced __builtin_unreachable
+     call.  */
   if (changed)
-    return TODO_cleanup_cfg;
+    return TODO_cleanup_cfg | TODO_update_ssa_only_virtuals;
   return 0;
 }
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-19 12:46         ` Jan Hubicka
@ 2012-11-21 16:25           ` Jan Hubicka
  2012-12-03 12:06             ` Eric Botcazou
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Hubicka @ 2012-11-21 16:25 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Eric Botcazou, gcc-patches, jakub

Hi,
here is updated patch.  It should get the bounds safe enough to not have effect
on codegen of complette unrolling.

There is IMO no way to cut the walk of loop body w/o affecting codegen in 
unrolling for size mode.  The condition on unroling to happen is

 unrolled_size * 2 / 3 < original_size

The patch makes the function walking body to stop after minimal number of
duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
above allows unlimited duplication when loop body is large enough. This is
more a bug than feature, so I think it is safe to alter it.

Bootstrapped/regtested x86_64-linux, OK?

Honza

 	* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
 	parameter.
 	(try_unroll_loop_completely) Update.


Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c	(revision 193694)
+++ tree-ssa-loop-ivcanon.c	(working copy)
@@ -1,5 +1,5 @@
-/* Induction variable canonicalization.
-   Copyright (C) 2004, 2005, 2007, 2008, 2010
+/* Induction variable canonicalization and loop peeling.
+   Copyright (C) 2004, 2005, 2007, 2008, 2010, 2012
    Free Software Foundation, Inc.
 
 This file is part of GCC.
@@ -207,10 +210,12 @@ constant_after_peeling (tree op, gimple
    iteration of the loop.
    EDGE_TO_CANCEL (if non-NULL) is an non-exit edge eliminated in the last iteration
    of loop.
-   Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT.  */
+   Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT. 
+   Stop estimating after UPPER_BOUND is met. Return true in this case */
 
-static void
-tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size)
+static bool
+tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size,
+			 int upper_bound)
 {
   basic_block *body = get_loop_body (loop);
   gimple_stmt_iterator gsi;
@@ -316,6 +321,12 @@ tree_estimate_loop_size (struct loop *lo
 	      if (likely_eliminated || likely_eliminated_last)
 		size->last_iteration_eliminated_by_peeling += num;
 	    }
+	  if ((size->overall * 3 / 2 - size->eliminated_by_peeling
+	      - size->last_iteration_eliminated_by_peeling) > upper_bound)
+	    {
+              free (body);
+	      return true;
+	    }
 	}
     }
   while (path.length ())
@@ -357,6 +368,7 @@ tree_estimate_loop_size (struct loop *lo
 	     size->last_iteration_eliminated_by_peeling);
 
   free (body);
+  return false;
 }
 
 /* Estimate number of insns of completely unrolled loop.
@@ -699,12 +711,22 @@ try_unroll_loop_completely (struct loop
       sbitmap wont_exit;
       edge e;
       unsigned i;
-      vec<edge> to_remove = vNULL;
+      bool large;
+      vec<edge> to_remove = vNULL;
       if (ul == UL_SINGLE_ITER)
 	return false;
 
-      tree_estimate_loop_size (loop, exit, edge_to_cancel, &size);
+      large = tree_estimate_loop_size
+		 (loop, exit, edge_to_cancel, &size,
+		  PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS));
       ninsns = size.overall;
+      if (large)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Not unrolling loop %d: it is too large.\n",
+		     loop->num);
+	  return false;
+	}
 
       unr_insns = estimated_unrolled_size (&size, n_unroll);
       if (dump_file && (dump_flags & TDF_DETAILS))

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-18 17:09     ` Jan Hubicka
  2012-11-18 18:47       ` Eric Botcazou
@ 2012-11-23 18:46       ` Hans-Peter Nilsson
  2012-11-24  7:47         ` Jan Hubicka
  1 sibling, 1 reply; 19+ messages in thread
From: Hans-Peter Nilsson @ 2012-11-23 18:46 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Eric Botcazou, gcc-patches, jakub

On Sun, 18 Nov 2012, Jan Hubicka wrote:
> > > > this patch reduces max-peeled-insns and max-completely-peeled-insns from 400
> > > > to 100.  The reason why I am doing this is that I want to reduce code bloat
> > > > caused by my cunroll work that enabled a lot more unrolling then previously
> > > > causing considerable code size regression at -O3.
> > >
> > > Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again
> > > takes a while to compile, so times out on slow machines:
> >
> > I did not :(.  I am currently on a trip, but will take a look on tuesday.
> > If it seems to disturb testing, please just revert the patch for time being.
>
> OK, here are multiple issues.
> 1) recursive inlining makes huge loop nest (of 18 loops)
> 2) SCEV is very slow on answering simple_iv tests in this case becuase it walks
>    the nest
> 3) unroller is computing loop body size even when it is clear the body is much larger
>    than the limit (the outer loop has 78000 instructions)
>
> I will prepare patches to fix those issues.

The recent (well, a week ago) params.def change also regressed
gfortran.dg/reassoc_4.f almost everywhere; see PR55452.

I guess a fix is fairly trivial, I just don't know to what.

brgds, H-P

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-23 18:46       ` Hans-Peter Nilsson
@ 2012-11-24  7:47         ` Jan Hubicka
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Hubicka @ 2012-11-24  7:47 UTC (permalink / raw)
  To: Hans-Peter Nilsson; +Cc: Jan Hubicka, Eric Botcazou, gcc-patches, jakub

> On Sun, 18 Nov 2012, Jan Hubicka wrote:
> > > > > this patch reduces max-peeled-insns and max-completely-peeled-insns from 400
> > > > > to 100.  The reason why I am doing this is that I want to reduce code bloat
> > > > > caused by my cunroll work that enabled a lot more unrolling then previously
> > > > > causing considerable code size regression at -O3.
> > > >
> > > > Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again
> > > > takes a while to compile, so times out on slow machines:
> > >
> > > I did not :(.  I am currently on a trip, but will take a look on tuesday.
> > > If it seems to disturb testing, please just revert the patch for time being.
> >
> > OK, here are multiple issues.
> > 1) recursive inlining makes huge loop nest (of 18 loops)
> > 2) SCEV is very slow on answering simple_iv tests in this case becuase it walks
> >    the nest
> > 3) unroller is computing loop body size even when it is clear the body is much larger
> >    than the limit (the outer loop has 78000 instructions)
> >
> > I will prepare patches to fix those issues.
> 
> The recent (well, a week ago) params.def change also regressed
> gfortran.dg/reassoc_4.f almost everywhere; see PR55452.
> 
> I guess a fix is fairly trivial, I just don't know to what.

Yes, we siply should add explicit unrolling limits there, I believe I posted a patch?
I am currently on a way, I will look up the message and/or post it.

Honza
> 
> brgds, H-P

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-21 16:25           ` Jan Hubicka
@ 2012-12-03 12:06             ` Eric Botcazou
  2012-12-04 18:28               ` Jan Hubicka
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Botcazou @ 2012-12-03 12:06 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, jakub, Richard Biener

> here is updated patch.  It should get the bounds safe enough to not have
> effect on codegen of complette unrolling.
> 
> There is IMO no way to cut the walk of loop body w/o affecting codegen in
> unrolling for size mode.  The condition on unroling to happen is
> 
>  unrolled_size * 2 / 3 < original_size
> 
> The patch makes the function walking body to stop after minimal number of
> duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
> above allows unlimited duplication when loop body is large enough. This is
> more a bug than feature, so I think it is safe to alter it.
> 
> Bootstrapped/regtested x86_64-linux, OK?
> 
> Honza
> 
>  	* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
>  	parameter.
>  	(try_unroll_loop_completely) Update.

The patch hasn't been installed, has it?  The test still takes 20s to compile 
at -O3 on a fast x86-64 box, so you can imagine what this yields on slower 
machines (and that's before the x4 because of the various dg-torture options).

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-12-03 12:06             ` Eric Botcazou
@ 2012-12-04 18:28               ` Jan Hubicka
  2012-12-06  9:35                 ` Richard Biener
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Hubicka @ 2012-12-04 18:28 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Jan Hubicka, gcc-patches, jakub, Richard Biener

> > here is updated patch.  It should get the bounds safe enough to not have
> > effect on codegen of complette unrolling.
> > 
> > There is IMO no way to cut the walk of loop body w/o affecting codegen in
> > unrolling for size mode.  The condition on unroling to happen is
> > 
> >  unrolled_size * 2 / 3 < original_size
> > 
> > The patch makes the function walking body to stop after minimal number of
> > duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
> > above allows unlimited duplication when loop body is large enough. This is
> > more a bug than feature, so I think it is safe to alter it.
> > 
> > Bootstrapped/regtested x86_64-linux, OK?
> > 
> > Honza
> > 
> >  	* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
> >  	parameter.
> >  	(try_unroll_loop_completely) Update.
> 
> The patch hasn't been installed, has it?  The test still takes 20s to compile 
> at -O3 on a fast x86-64 box, so you can imagine what this yields on slower 
> machines (and that's before the x4 because of the various dg-torture options).

Yes, I need approval for this one.
http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01798.html

Honza

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-12-04 18:28               ` Jan Hubicka
@ 2012-12-06  9:35                 ` Richard Biener
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Biener @ 2012-12-06  9:35 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Eric Botcazou, gcc-patches, jakub

On Tue, 4 Dec 2012, Jan Hubicka wrote:

> > > here is updated patch.  It should get the bounds safe enough to not have
> > > effect on codegen of complette unrolling.
> > > 
> > > There is IMO no way to cut the walk of loop body w/o affecting codegen in
> > > unrolling for size mode.  The condition on unroling to happen is
> > > 
> > >  unrolled_size * 2 / 3 < original_size
> > > 
> > > The patch makes the function walking body to stop after minimal number of
> > > duplicated insns is large (PARAM_MAX_COMPLETELY_PEELED_INSNS). The formula
> > > above allows unlimited duplication when loop body is large enough. This is
> > > more a bug than feature, so I think it is safe to alter it.
> > > 
> > > Bootstrapped/regtested x86_64-linux, OK?
> > > 
> > > Honza
> > > 
> > >  	* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add UPPER_BOUND
> > >  	parameter.
> > >  	(try_unroll_loop_completely) Update.
> > 
> > The patch hasn't been installed, has it?  The test still takes 20s to compile 
> > at -O3 on a fast x86-64 box, so you can imagine what this yields on slower 
> > machines (and that's before the x4 because of the various dg-torture options).
> 
> Yes, I need approval for this one.
> http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01798.html

Ok.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-25 12:22   ` Dominique Dhumieres
@ 2012-11-25 12:26     ` Dominique Dhumieres
  0 siblings, 0 replies; 19+ messages in thread
From: Dominique Dhumieres @ 2012-11-25 12:26 UTC (permalink / raw)
  To: hubicka, dominiq; +Cc: hubicka, gcc-patches, ebotcazou

My mailer has eaten a line in my previous mail. One should read:

I have found another fall out: I have some avatars of the polyhedron tests
where the REAL(8) have been replaced with REAL(10). Some of them are now ~50%
slower with the new value of max-completely-peeled-insns.
Should I open a new PR for that?

Sorry for the noise.

Dominique

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-21 16:27 ` Jan Hubicka
@ 2012-11-25 12:22   ` Dominique Dhumieres
  2012-11-25 12:26     ` Dominique Dhumieres
  0 siblings, 1 reply; 19+ messages in thread
From: Dominique Dhumieres @ 2012-11-25 12:22 UTC (permalink / raw)
  To: hubicka, dominiq; +Cc: hubicka, gcc-patches, ebotcazou

> ... I believe I posted a patch?

Yes: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01799.html

I have found another fall out: I have some avatars of the polyhedron tests
where the REAL(8) have been replaced with REAL(10). Some of them are now
Should I open a new PR for that?

Cheers,

Dominique

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-18 13:35 Dominique Dhumieres
@ 2012-11-21 16:27 ` Jan Hubicka
  2012-11-25 12:22   ` Dominique Dhumieres
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Hubicka @ 2012-11-21 16:27 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: gcc-patches, ebotcazou, hubicka

> > Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again
> > takes a while to compile, so times out on slow machines:
> > ...
> 
> On a 2.5Ghz Core2Duo, compiling the test with revision 192891 (2012-10-28)
> takes a small fraction of a second, while with revision 193270 (2012-11-06)
> it takes ~25s.
> 
> However this patch makes gfortran.dg/reassoc_4.f to fail
> 
> FAIL: gfortran.dg/reassoc_4.f  -O   scan-tree-dump-times reassoc1 "[0-9] \\\\* " 22
> 
> After it 22 should be replaced with 16 (thresshold max-completely-peeled-insns=138
> gives 16, =139 gives 22).

I would propose the following patch instead.  The patch anyway changes the limits on some
targets, so lets change them on all.

Honza

Index: reassoc_4.f
===================================================================
--- reassoc_4.f (revision 193698)
+++ reassoc_4.f (working copy)
@@ -1,7 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-O3 -ffast-math -fdump-tree-reassoc1" }
-! { dg-additional-options "--param max-completely-peel-times=16" { target spu-*-* } }
-! { dg-additional-options "--param max-completely-peeled-insns=400" { target s390*-*-* } }
+! { dg-options "-O3 -ffast-math -fdump-tree-reassoc1 --param max-completely-peel-times=16 --param max-completely-peeled-insns=400" }
       subroutine anisonl(w,vo,anisox,s,ii1,jj1,weight)
       integer ii1,jj1,i1,iii1,j1,jjj1,k1,l1,m1,n1
       real*8 w(3,3),vo(3,3),anisox(3,3,3,3),s(60,60),weight

> 
> Dominique

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
  2012-11-21 13:47 Dominique Dhumieres
@ 2012-11-21 14:16 ` Jan Hubicka
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Hubicka @ 2012-11-21 14:16 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: gcc-patches, ebotcazou, hubicka

> FAIL: gcc.dg/graphite/interchange-8.c scan-tree-dump-times graphite "will be interchanged" 2
> FAIL: gcc.dg/graphite/pr42530.c (internal compiler error)
> FAIL: gcc.dg/graphite/pr42530.c (test for excess errors)
> FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "Unrolled loop 1 completely .duplicated 2 times.."
> FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "Last iteration exit edge was proved true."
> FAIL: gcc.dg/tree-ssa/cunroll-3.c scan-tree-dump cunrolli "Unrolled loop 1 completely .duplicated 1 times.."
> FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 "c.array"
> FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized "my_array"
> FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump-not optimized "if \\("
> FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
> FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops  execution test
> FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
> FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test
> FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
> FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops  execution test
> FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
> FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test

Yep, problem here is the *2/3 heuristic in estimated unroller body size. I am
back to internet access, so  I will look into it today or tomorrow.

Honza
> 
> for both -m32 and -m64 +
> 
> FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre "Insertions: 2" 1
> 
> with -m32.
> 
> Dominique

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
@ 2012-11-21 13:47 Dominique Dhumieres
  2012-11-21 14:16 ` Jan Hubicka
  0 siblings, 1 reply; 19+ messages in thread
From: Dominique Dhumieres @ 2012-11-21 13:47 UTC (permalink / raw)
  To: gcc-patches; +Cc: ebotcazou, hubicka

Hi Jan,

> this is patch I will try to test once I have chance :)
> It simply prevents unroller from analyzing loops when they are already too large.
> ...

This patch breaks bootstrap with

...
/opt/gcc/p_build/./prev-gcc/g++ -B/opt/gcc/p_build/./prev-gcc/ -B/opt/gcc/gcc4.8p-193652p3/x86_64-apple-darwin10.8.0/bin/ -nostdinc++ -B/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/src/.libs -B/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/libsupc++/.libs -I/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/include/x86_64-apple-darwin10.8.0 -I/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/include -I/opt/gcc/p_work/libstdc++-v3/libsupc++ -L/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/src/.libs -L/opt/gcc/p_build/prev-x86_64-apple-darwin10.8.0/libstdc++-v3/libsupc++/.libs -c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror   -DHAVE_CONFIG_H -I. -I. -I../../p_work/gcc -I../../p_work/gcc/. -I../../p_work/gcc/../include -I./../intl -I../../p_work/gcc/../libcpp/include -I/opt/mp/include  -I../../p_work/gcc/../libdecnumber -I../../p_work/gcc/../libdecnumber/dpd -I../libdecnumber -I../../p_work/gcc/../libbacktrace -DCLOOG_INT_GMP  -I/opt/mp/include  ../../p_work/gcc/tree-ssa-loop-ivopts.c -o tree-ssa-loop-ivopts.o
../../p_work/gcc/tree-ssa-loop-ivcanon.c: In function 'bool canonicalize_loop_induction_variables(loop*, bool, unroll_level, bool)':
../../p_work/gcc/tree-ssa-loop-ivcanon.c:690:62: error: 'n_unroll' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
                                                              ^
../../p_work/gcc/tree-ssa-loop-ivcanon.c:656:26: note: 'n_unroll' was declared here
   unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
                          ^
cc1plus: all warnings being treated as errors
...

I have completed bootstrap with the following change

--- ../_clean/gcc/tree-ssa-loop-ivcanon.c	2012-11-18 11:27:28.000000000 +0100
+++ gcc/tree-ssa-loop-ivcanon.c	2012-11-20 16:27:07.000000000 +0100
@@ -641,9 +641,10 @@ try_unroll_loop_completely (struct loop 
 			    enum unroll_level ul,
 			    HOST_WIDE_INT maxiter)
 {
-  unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
+  unsigned HOST_WIDE_INT ninsns, max_unroll, unr_insns;
   gimple cond;
   struct loop_size size;
+  unsigned HOST_WIDE_INT n_unroll = 0;
   bool n_unroll_found = false;
   edge edge_to_cancel = NULL;
   int num = loop->num;

After that the compilation of gcc.c-torture/compile/pr43186.c is back to
a fraction of a second, but I see the following regressions:

FAIL: gcc.dg/graphite/interchange-8.c scan-tree-dump-times graphite "will be interchanged" 2
FAIL: gcc.dg/graphite/pr42530.c (internal compiler error)
FAIL: gcc.dg/graphite/pr42530.c (test for excess errors)
FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "Unrolled loop 1 completely .duplicated 2 times.."
FAIL: gcc.dg/tree-ssa/cunroll-1.c scan-tree-dump cunrolli "Last iteration exit edge was proved true."
FAIL: gcc.dg/tree-ssa/cunroll-3.c scan-tree-dump cunrolli "Unrolled loop 1 completely .duplicated 1 times.."
FAIL: gcc.dg/tree-ssa/loop-36.c scan-tree-dump-not dce2 "c.array"
FAIL: gcc.dg/tree-ssa/loop-37.c scan-tree-dump-not optimized "my_array"
FAIL: gcc.dg/tree-ssa/pr21829.c scan-tree-dump-not optimized "if \\("
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: libgomp.fortran/reduction2.f90  -O3 -g  execution test

for both -m32 and -m64 +

FAIL: gcc.dg/tree-ssa/loadpre6.c scan-tree-dump-times pre "Insertions: 2" 1

with -m32.

Dominique

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Reduce complette unrolling & peeling limits
@ 2012-11-18 13:35 Dominique Dhumieres
  2012-11-21 16:27 ` Jan Hubicka
  0 siblings, 1 reply; 19+ messages in thread
From: Dominique Dhumieres @ 2012-11-18 13:35 UTC (permalink / raw)
  To: gcc-patches; +Cc: ebotcazou, hubicka

> Did you notice that gcc.c-torture/compile/pr43186.c regressed?  It now again
> takes a while to compile, so times out on slow machines:
> ...

On a 2.5Ghz Core2Duo, compiling the test with revision 192891 (2012-10-28)
takes a small fraction of a second, while with revision 193270 (2012-11-06)
it takes ~25s.

However this patch makes gfortran.dg/reassoc_4.f to fail

FAIL: gfortran.dg/reassoc_4.f  -O   scan-tree-dump-times reassoc1 "[0-9] \\\\* " 22

After it 22 should be replaced with 16 (thresshold max-completely-peeled-insns=138
gives 16, =139 gives 22).

Dominique

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-12-06  9:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-14 23:34 Reduce complette unrolling & peeling limits Jan Hubicka
2012-11-15 13:19 ` Jakub Jelinek
2012-11-18  8:09 ` Eric Botcazou
2012-11-18 16:52   ` Jan Hubicka
2012-11-18 17:09     ` Jan Hubicka
2012-11-18 18:47       ` Eric Botcazou
2012-11-19 12:46         ` Jan Hubicka
2012-11-21 16:25           ` Jan Hubicka
2012-12-03 12:06             ` Eric Botcazou
2012-12-04 18:28               ` Jan Hubicka
2012-12-06  9:35                 ` Richard Biener
2012-11-23 18:46       ` Hans-Peter Nilsson
2012-11-24  7:47         ` Jan Hubicka
2012-11-18 13:35 Dominique Dhumieres
2012-11-21 16:27 ` Jan Hubicka
2012-11-25 12:22   ` Dominique Dhumieres
2012-11-25 12:26     ` Dominique Dhumieres
2012-11-21 13:47 Dominique Dhumieres
2012-11-21 14:16 ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).