Compile-time and execution-time impact of CSE

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Compile-time and execution-time impact of CSE - some numbers
@ 2004-08-31 13:52 Paolo Bonzini
  2004-08-31 14:39 ` Steven Bosscher
  0 siblings, 1 reply; 3+ messages in thread
From: Paolo Bonzini @ 2004-08-31 13:52 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: text/plain, Size: 3863 bytes --]

Hello, these are the results of a simple attempt at trimming the time
spent in CSE passes.  Not very encouraging really, but maybe it can
help more experienced people than me. 

The first thing I tried is to remove CSE1 and move EBB CSE to -O3,
using the attached patch.  This also meant that we do not run local CSE
at -O1 anymore.  Here are the results of this and other experiments;
all times were taken on a Pentium 4 machine running at 1.7 GHz.

As for bootstrapping, I have only timed a C-only --disable-checking
bootstrap.  Bootstrapping times are very similar but they are not very
representative of the effect of the patch, due to the large time spent
compiling stage2; but compiling stage3 takes 10:22 minutes instead of
11:00, which is about 6% faster.

I then timed combine.i files.  I ran the compiler five times, took out
the runs with the best and worst overall time, and averaged the other
three (the machine was very lightly loaded and has plenty of memory,
so system time did not matter).  Times are in the following table. The
headings are different at -O1 than for other optimization levels,
because CSE and/or GCSE are not run there:

             -O1       | -O2             | -O3
             tot   CSE | tot   GCSE  CSE | tot   GCSE  CSE
patched      6.74  --- | 10.07 0.43 0.35 | 15.19 1.16 0.90
HEAD         6.99 0.16 | 10.86 0.48 1.04 | 16.00 1.10 1.59
improvement  4.1%      |  7.3%           |  5.0%

For -O2 I got run-time numbers too, which I took from a CPU-intensive
sed benchmark (I used sed 4.1.1, compiled with IMA including the regex
matcher), doing measurements in the same way as above for both
compilation and the sed benchmark dc.sed.

The results are in the table that follows and are for several compilers:

1) "patched" is as above

2) "HEAD, no EBB" is mainline with -fno-cse-skip-blocks -fno-cse-follow-jumps:
the results are even worse.

3) "patched+EBB" uses the attached patch but without the hunks that move
-fcse-skip-blocks and -fcse-follow-jumps to -O3, since it looks like CSE
on EBBs is (still :-( ...) doing good, but CSE1 is not.

4) "HEAD, no CSE2" is a final try... let's disable CSE2 instead, and run a
full-power CSE1 (no GCSE column in the table since the two GCSE's look
at exactly the same things): this means moving -frerun-cse-after-loop to
-O3 (or using HEAD's compiler with -O2 -fno-rerun-cse-after-loop).

               combine.i             sed
               tot    GCSE    CSE  | compile  GCSE    CSE     dc.sed
-----------------------------------+---------------------------------
HEAD           10.86  0.48   1.04  | 10.50    0.33   1.08     11.77
-----------------------------------+---------------------------------
patched        10.07  0.43   0.35  |  9.69    0.35   0.38     11.96
improvement     7.3%               |  7.7%                    -1.6%
-----------------------------------+---------------------------------
HEAD, no EBB   10.28  0.46   0.67  |  9.99    0.31   0.67     12.03
improvement     5.3%               |  4.8%                    -2.1%
-----------------------------------+---------------------------------
patched+EBB    10.31  0.46   0.66  | 10.00    0.34   0.65     11.89
improvement     5.1%               |  4.8%                    -1.0%
-----------------------------------+---------------------------------
HEAD, no CSE2  10.47  0.48   0.62  | 10.05    0.33   0.76     11.85
improvement     3.6%               |  4.3%                    -0.7%

4.1% on -O1 looks good to me, and I think we can safely lose 1-2% of
execution time at -O1.  But for -O2 only the last two are worth running
SPEC on.  If anybody wants to try, for the latter there's not even a
patch to apply.  But it looks like at -O2 the RTL passes are not going
away soon. :-(

Regards,

Paolo


-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/

[-- Attachment #2: rtl-passes.patch --]
[-- Type: text/plain, Size: 9029 bytes --]

2004-08-13  Paolo Bonzini  <bonzini@gnu.org>

	* opts.c (decode_options): Move -fcse-skip-blocks and
	-fcse-follow-jumps to -O3.
	* passes.c (rest_of_handle_cse): Remove.
	(rest_of_handle_cse2): Rename to rest_of_handle_cse, use
	TV_CSE and DFI_cse.
	(rest_of_compilation): Adjust for the above.
	(rest_of_handle_gcse): Clean up loop to call CSE.
	(dump_file_info): Remove DFI_cse2.
	* timevar.def (TV_CSE2): Remove.
	* toplev.c (process_options): Make -fcse-skip-blocks or
	-fcse-follow-jumps imply -frerun-cse-after-loop.

diff -rup gcc-backup/gcc/opts.c gcc/gcc/opts.c
--- gcc-backup/gcc/opts.c	2004-08-08 10:20:19.000000000 +0200
+++ gcc/gcc/opts.c	2004-08-13 09:17:03.000000000 +0200
@@ -517,8 +517,6 @@ decode_options (unsigned int argc, const
     {
       flag_crossjumping = 1;
       flag_optimize_sibling_calls = 1;
-      flag_cse_follow_jumps = 1;
-      flag_cse_skip_blocks = 1;
       flag_gcse = 1;
       flag_expensive_optimizations = 1;
       flag_strength_reduce = 1;
@@ -544,6 +542,8 @@ decode_options (unsigned int argc, const
       flag_inline_functions = 1;
       flag_unswitch_loops = 1;
       flag_gcse_after_reload = 1;
+      flag_cse_follow_jumps = 1;
+      flag_cse_skip_blocks = 1;
     }
 
   if (optimize < 2 || optimize_size)
diff -rup gcc-backup/gcc/passes.c gcc/gcc/passes.c
--- gcc-backup/gcc/passes.c	2004-08-08 10:20:19.000000000 +0200
+++ gcc/gcc/passes.c	2004-08-13 09:31:26.000000000 +0200
@@ -138,7 +138,6 @@ enum dump_file_index
   DFI_eh,
   DFI_jump,
   DFI_null,
-  DFI_cse,
   DFI_gcse,
   DFI_loop,
   DFI_bypass,
@@ -149,7 +148,7 @@ enum dump_file_index
   DFI_tracer,
   DFI_loop2,
   DFI_web,
-  DFI_cse2,
+  DFI_cse,
   DFI_life,
   DFI_combine,
   DFI_ce2,
@@ -179,7 +178,7 @@ enum dump_file_index
 
    Remaining -d letters:
 
-	"   e            q         "
+	"   e            q s       "
 	"    F     K   O Q     WXY "
 */
 
@@ -191,7 +190,6 @@ static struct dump_file_info dump_file_t
   { "eh",	'h', 0, 0, 0 },
   { "jump",	'j', 0, 0, 0 },
   { "null",	'u', 0, 0, 0 },
-  { "cse",	's', 0, 0, 0 },
   { "gcse",	'G', 1, 0, 0 },
   { "loop",	'L', 1, 0, 0 },
   { "bypass",   'G', 1, 0, 0 }, /* Yes, duplicate enable switch.  */
@@ -1129,50 +1127,14 @@ rest_of_handle_life (void)
   ggc_collect ();
 }
 
-/* Perform common subexpression elimination.  Nonzero value from
-   `cse_main' means that jumps were simplified and some code may now
-   be unreachable, so do jump optimization again.  */
+/* Run local CSE pass after loop optimizations.  */
 static void
 rest_of_handle_cse (void)
 {
   int tem;
 
-  open_dump_file (DFI_cse, current_function_decl);
-  if (dump_file)
-    dump_flow_info (dump_file);
   timevar_push (TV_CSE);
-
-  reg_scan (get_insns (), max_reg_num (), 1);
-
-  tem = cse_main (get_insns (), max_reg_num (), 0, dump_file);
-  if (tem)
-    rebuild_jump_labels (get_insns ());
-  if (purge_all_dead_edges (0))
-    delete_unreachable_blocks ();
-
-  delete_trivially_dead_insns (get_insns (), max_reg_num ());
-
-  /* If we are not running more CSE passes, then we are no longer
-     expecting CSE to be run.  But always rerun it in a cheap mode.  */
-  cse_not_expected = !flag_rerun_cse_after_loop && !flag_gcse;
-
-  if (tem || optimize > 1)
-    cleanup_cfg (CLEANUP_EXPENSIVE | CLEANUP_PRE_LOOP);
-
-  timevar_pop (TV_CSE);
-  close_dump_file (DFI_cse, print_rtl_with_bb, get_insns ());
-
-  ggc_collect ();
-}
-
-/* Run second CSE pass after loop optimizations.  */
-static void
-rest_of_handle_cse2 (void)
-{
-  int tem;
-
-  timevar_push (TV_CSE2);
-  open_dump_file (DFI_cse2, current_function_decl);
+  open_dump_file (DFI_cse, current_function_decl);
   if (dump_file)
     dump_flow_info (dump_file);
   /* CFG is no longer maintained up-to-date.  */
@@ -1184,6 +1146,8 @@ rest_of_handle_cse2 (void)
      bypassed safely.  */
   cse_condition_code_reg ();
 
+  cse_not_expected = 1;
+
   purge_all_dead_edges (0);
   delete_trivially_dead_insns (get_insns (), max_reg_num ());
 
@@ -1195,8 +1159,8 @@ rest_of_handle_cse2 (void)
       timevar_pop (TV_JUMP);
     }
   reg_scan (get_insns (), max_reg_num (), 0);
-  close_dump_file (DFI_cse2, print_rtl_with_bb, get_insns ());
-  timevar_pop (TV_CSE2);
+  close_dump_file (DFI_cse, print_rtl_with_bb, get_insns ());
+  timevar_pop (TV_CSE);
 
   ggc_collect ();
 }
@@ -1206,7 +1170,7 @@ static void
 rest_of_handle_gcse (void)
 {
   int save_csb, save_cfj;
-  int tem2 = 0, tem;
+  int tem;
 
   timevar_push (TV_GCSE);
   open_dump_file (DFI_gcse, current_function_decl);
@@ -1219,40 +1183,30 @@ rest_of_handle_gcse (void)
   save_cfj = flag_cse_follow_jumps;
   flag_cse_skip_blocks = flag_cse_follow_jumps = 0;
 
-  /* If -fexpensive-optimizations, re-run CSE to clean up things done
-     by gcse.  */
-  if (flag_expensive_optimizations)
-    {
-      timevar_push (TV_CSE);
-      reg_scan (get_insns (), max_reg_num (), 1);
-      tem2 = cse_main (get_insns (), max_reg_num (), 0, dump_file);
-      purge_all_dead_edges (0);
-      delete_trivially_dead_insns (get_insns (), max_reg_num ());
-      timevar_pop (TV_CSE);
-      cse_not_expected = !flag_rerun_cse_after_loop;
-    }
-
   /* If gcse or cse altered any jumps, rerun jump optimizations to clean
-     things up.  Then possibly re-run CSE again.  */
-  while (tem || tem2)
+     things up.  Then possibly run CSE again to cleanup things left by
+     GCSE.  */
+  while (tem)
     {
-      tem = tem2 = 0;
-      timevar_push (TV_JUMP);
-      rebuild_jump_labels (get_insns ());
-      cleanup_cfg (CLEANUP_EXPENSIVE | CLEANUP_PRE_LOOP);
-      timevar_pop (TV_JUMP);
-
       if (flag_expensive_optimizations)
 	{
 	  timevar_push (TV_CSE);
 	  reg_scan (get_insns (), max_reg_num (), 1);
-	  tem2 = cse_main (get_insns (), max_reg_num (), 0, dump_file);
+	  tem = cse_main (get_insns (), max_reg_num (), 0, dump_file);
 	  purge_all_dead_edges (0);
 	  delete_trivially_dead_insns (get_insns (), max_reg_num ());
 	  timevar_pop (TV_CSE);
 	}
+      else
+	tem = 0;
+
+      timevar_push (TV_JUMP);
+      rebuild_jump_labels (get_insns ());
+      cleanup_cfg (CLEANUP_EXPENSIVE | CLEANUP_PRE_LOOP);
+      timevar_pop (TV_JUMP);
     }
 
+  cse_not_expected = !flag_rerun_cse_after_loop;
   close_dump_file (DFI_gcse, print_rtl_with_bb, get_insns ());
   timevar_pop (TV_GCSE);
 
@@ -1791,9 +1745,6 @@ rest_of_compilation (void)
   rest_of_handle_jump2 ();
 
   if (optimize > 0)
-    rest_of_handle_cse ();
-
-  if (optimize > 0)
     {
       if (flag_gcse)
 	rest_of_handle_gcse ();
@@ -1840,9 +1791,7 @@ rest_of_compilation (void)
     rest_of_handle_web ();
 
   if (optimize > 0 && flag_rerun_cse_after_loop)
-    rest_of_handle_cse2 ();
-
-  cse_not_expected = 1;
+    rest_of_handle_cse ();
 
   rest_of_handle_life ();
   timevar_pop (TV_FLOW);
diff -rup gcc-backup/gcc/timevar.def gcc/gcc/timevar.def
--- gcc-backup/gcc/timevar.def	2004-08-08 10:20:19.000000000 +0200
+++ gcc/gcc/timevar.def	2004-08-13 09:28:12.000000000 +0200
@@ -97,13 +97,12 @@ DEFTIMEVAR (TV_TEMPLATE_INSTANTIATION, "
 DEFTIMEVAR (TV_EXPAND		     , "expand")
 DEFTIMEVAR (TV_VARCONST              , "varconst")
 DEFTIMEVAR (TV_JUMP                  , "jump")
-DEFTIMEVAR (TV_CSE                   , "CSE")
 DEFTIMEVAR (TV_GCSE                  , "global CSE")
 DEFTIMEVAR (TV_LOOP                  , "loop analysis")
 DEFTIMEVAR (TV_BYPASS                , "bypass jumps")
 DEFTIMEVAR (TV_TRACER                , "tracer")
 DEFTIMEVAR (TV_WEB                   , "web")
-DEFTIMEVAR (TV_CSE2                  , "CSE 2")
+DEFTIMEVAR (TV_CSE                   , "local CSE")
 DEFTIMEVAR (TV_BRANCH_PROB           , "branch prediction")
 DEFTIMEVAR (TV_VPT                   , "value profile opts")
 DEFTIMEVAR (TV_FLOW                  , "flow analysis")
diff -rup gcc-backup/gcc/toplev.c gcc/gcc/toplev.c
--- gcc-backup/gcc/toplev.c	2004-08-08 10:20:19.000000000 +0200
+++ gcc/gcc/toplev.c	2004-08-13 09:19:30.000000000 +0200
@@ -1679,15 +1679,18 @@ process_options (void)
     flag_old_unroll_loops = 1;
 
   /* Old loop unrolling requires that strength_reduction be on also.  Silently
-     turn on strength reduction here if it isn't already on.  Also, the loop
-     unrolling code assumes that cse will be run after loop, so that must
-     be turned on also.  */
+     turn on strength reduction here if it isn't already on.  */
   if (flag_old_unroll_loops)
-    {
-      flag_strength_reduce = 1;
-      flag_rerun_cse_after_loop = 1;
-    }
-  if (flag_unroll_loops || flag_peel_loops)
+    flag_strength_reduce = 1;
+
+  /* Also, the loop unrolling code assumes that cse will be run after loop, so
+     that must be turned on also.  Ditto if we requested a CSE pass on extended
+     basic blocks, since at all other times we limit CSE to local analysis.  */
+  if (flag_unroll_loops
+      || flag_old_unroll_loops
+      || flag_peel_loops
+      || flag_cse_skip_blocks
+      || flag_cse_follow_jumps)
     flag_rerun_cse_after_loop = 1;
 
   /* If explicitly asked to run new loop optimizer, switch off the old

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Compile-time and execution-time impact of CSE - some numbers
  2004-08-31 13:52 Compile-time and execution-time impact of CSE - some numbers Paolo Bonzini
@ 2004-08-31 14:39 ` Steven Bosscher
  2004-08-31 21:03   ` Paolo Bonzini
  0 siblings, 1 reply; 3+ messages in thread
From: Steven Bosscher @ 2004-08-31 14:39 UTC (permalink / raw)
  To: Paolo Bonzini, gcc

On Tuesday 31 August 2004 15:29, Paolo Bonzini wrote:
> Hello, these are the results of a simple attempt at trimming the time
> spent in CSE passes.  Not very encouraging really, but maybe it can
> help more experienced people than me.
>
> The first thing I tried is to remove CSE1 and move EBB CSE to -O3,
> using the attached patch.  This also meant that we do not run local CSE
> at -O1 anymore.  Here are the results of this and other experiments;
> all times were taken on a Pentium 4 machine running at 1.7 GHz.
>
> As for bootstrapping, I have only timed a C-only --disable-checking
> bootstrap.  Bootstrapping times are very similar but they are not very
> representative of the effect of the patch, due to the large time spent
> compiling stage2; but compiling stage3 takes 10:22 minutes instead of
> 11:00, which is about 6% faster.
>
> I then timed combine.i files.  I ran the compiler five times, took out
> the runs with the best and worst overall time, and averaged the other
> three (the machine was very lightly loaded and has plenty of memory,
> so system time did not matter).  Times are in the following table. The
> headings are different at -O1 than for other optimization levels,
> because CSE and/or GCSE are not run there:
>
>              -O1       | -O2             | -O3
>              tot   CSE | tot   GCSE  CSE | tot   GCSE  CSE
> patched      6.74  --- | 10.07 0.43 0.35 | 15.19 1.16 0.90
> HEAD         6.99 0.16 | 10.86 0.48 1.04 | 16.00 1.10 1.59
> improvement  4.1%      |  7.3%           |  5.0%
>
> For -O2 I got run-time numbers too, which I took from a CPU-intensive
> sed benchmark (I used sed 4.1.1, compiled with IMA including the regex
> matcher), doing measurements in the same way as above for both
> compilation and the sed benchmark dc.sed.
>
> The results are in the table that follows and are for several compilers:
>
> 1) "patched" is as above
>
> 2) "HEAD, no EBB" is mainline with -fno-cse-skip-blocks
> -fno-cse-follow-jumps: the results are even worse.
>
> 3) "patched+EBB" uses the attached patch but without the hunks that move
> -fcse-skip-blocks and -fcse-follow-jumps to -O3, since it looks like CSE
> on EBBs is (still :-( ...) doing good, but CSE1 is not.
>
> 4) "HEAD, no CSE2" is a final try... let's disable CSE2 instead, and run a
> full-power CSE1 (no GCSE column in the table since the two GCSE's look
> at exactly the same things): this means moving -frerun-cse-after-loop to
> -O3 (or using HEAD's compiler with -O2 -fno-rerun-cse-after-loop).
>
>                combine.i             sed
>                tot    GCSE    CSE  | compile  GCSE    CSE     dc.sed
> -----------------------------------+---------------------------------
> HEAD           10.86  0.48   1.04  | 10.50    0.33   1.08     11.77
> -----------------------------------+---------------------------------
> patched        10.07  0.43   0.35  |  9.69    0.35   0.38     11.96
> improvement     7.3%               |  7.7%                    -1.6%
> -----------------------------------+---------------------------------
> HEAD, no EBB   10.28  0.46   0.67  |  9.99    0.31   0.67     12.03
> improvement     5.3%               |  4.8%                    -2.1%
> -----------------------------------+---------------------------------
> patched+EBB    10.31  0.46   0.66  | 10.00    0.34   0.65     11.89
> improvement     5.1%               |  4.8%                    -1.0%
> -----------------------------------+---------------------------------
> HEAD, no CSE2  10.47  0.48   0.62  | 10.05    0.33   0.76     11.85
> improvement     3.6%               |  4.3%                    -0.7%
>
> 4.1% on -O1 looks good to me, and I think we can safely lose 1-2% of
> execution time at -O1.  But for -O2 only the last two are worth running
> SPEC on.  If anybody wants to try, for the latter there's not even a
> patch to apply.  But it looks like at -O2 the RTL passes are not going
> away soon. :-(

You will find that you can spend your time better on fixing the bugs
marked as "tree-optimization" and "missed-optimization".  That is still
where most of the things CSE1 and GCSE catch come from.

Also, the only *real* good way of speeding up CSE is by making it work
on extended basic blocks (ie. kill -fskip-blocks) and then teaching it
to not rescan already visited blocks (by using a scoped hash table).

Unfortunately, simply disabling -fskip-blocks doesn't give much speedup
either.  But that would be the first step.   The second step would be to
make cse.c use the CFG (ie. FOR_BB_INSNS/BB_HEAD/BB_ END/etc.) instead
of relying on block notes.  Next you'd clean up the path following code
to track back only to the last visited block before following a jump.

Gr.
Steven


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Compile-time and execution-time impact of CSE - some numbers
  2004-08-31 14:39 ` Steven Bosscher
@ 2004-08-31 21:03   ` Paolo Bonzini
  0 siblings, 0 replies; 3+ messages in thread
From: Paolo Bonzini @ 2004-08-31 21:03 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: gcc

> The second step would be to
> make cse.c use the CFG (ie. FOR_BB_INSNS/BB_HEAD/BB_ END/etc.) instead
> of relying on block notes.

Block notes are the lesser problem.  It's using barriers and labels to detect 
edge types, that would drive me nuts :-)

> Next you'd clean up the path following code
> to track back only to the last visited block before following a jump.

Is it different from a very low --param max-cse-path-length?

Paolo

-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-08-31 20:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-31 13:52 Compile-time and execution-time impact of CSE - some numbers Paolo Bonzini
2004-08-31 14:39 ` Steven Bosscher
2004-08-31 21:03   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).