Where does the time go?

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Where does the time go?
@ 2010-05-20 15:55 Steven Bosscher
  2010-05-20 19:16 ` Vladimir Makarov
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Steven Bosscher @ 2010-05-20 15:55 UTC (permalink / raw)
  To: GCC Mailing List

Hello,

For some time now, I've wanted to see where compile time goes in a
typical GCC build, because nobody really seems to know what the
compiler spends its time on. The impressions that get published about
gcc usually indicate that there is at least a feeling that GCC is not
getting faster, and that parts of the compiler are unreasonably slow.
I was hoping to maybe shed some light on what parts that could be.

What I've done is this:
* Build GCC 4.6.0 (trunk r159624) with --enable-checking=release and
with -O2 and install it
* Build GCC 4.6.0 (trunk r159624) again, with the installed compiler
and with "-O2 -g3 -ftime-report". The time reports (along with
everything else on stderr) are piped to an output file
* Extract, sum, and sort time consumed per timevar

Host was cfarm gcc14 (8 x 3GHz Xeon). Target was
x86_64-unknown-linux-gnu. "Build" means non-bootstrap.

Results at the bottom of this mail.

Conclusions:

* There are quite a few timevars for parts of the compiler that have
been removed: TV_SEQABSTR, TV_GLOBAL_ALLOC, TV_LOCAL_ALLOC are the
ones I've spotted so far.  I will go through the whole list, remove
all timevars that are unused, and submit a patch.

* The "slow" parts of the compiler are not exactly news: tree-PRE,
scheduling, register allocation

* Variable tracking costs ~7.8% of compile time. This more than the
cost of the register allocation (IRA+reload)

* The C front end (preprocessing+lexing+parsing) costs ~17%. For an
optimizing compiler with so many passes, this is quite a lot.

* The GIMPLE optimizers (done with egrep
"tree|dominator_opt|alias_stmt_walking|alias_analysis|inline_heuristics|PHI_merge")
together cost ~16%.

* Adding and subtracting the above numbers, the rest of the compiler,
which is mostly the RTL parts, still account for 100-17-16-8=59% of
the total compile time. This was the most surprising result for me.

Ciao!
Steven


auto_inc_dec                                    0.00    0%
callgraph_verifier                              0.00    0%
cfg_construction                                0.00    0%
CFG_verifier                                    0.00    0%
delay_branch_sched                              0.00    0%
df_live_byte_regs                               0.00    0%
df_scan_insns                                   0.00    0%
df_uninitialized_regs_2                         0.00    0%
dump_files                                      0.00    0%
global_alloc                                    0.00    0%
Graphite_code_generation                        0.00    0%
Graphite_data_dep_analysis                      0.00    0%
Graphite_loop_transforms                        0.00    0%
ipa_free_lang_data                              0.00    0%
ipa_lto_cgraph_IO                               0.00    0%
ipa_lto_cgraph_merge                            0.00    0%
ipa_lto_decl_init_IO                            0.00    0%
ipa_lto_decl_IO                                 0.00    0%
ipa_lto_decl_merge                              0.00    0%
ipa_lto_gimple_IO                               0.00    0%
ipa_points_to                                   0.00    0%
ipa_profile                                     0.00    0%
ipa_type_escape                                 0.00    0%
life_analysis                                   0.00    0%
life_info_update                                0.00    0%
load_CSE_after_reload                           0.00    0%
local_alloc                                     0.00    0%
loop_doloop                                     0.00    0%
loop_unrolling                                  0.00    0%
loop_unswitching                                0.00    0%
LSM                                             0.00    0%
lto                                             0.00    0%
name_lookup                                     0.00    0%
overload_resolution                             0.00    0%
PCH_main_state_restore                          0.00    0%
PCH_main_state_save                             0.00    0%
PCH_pointer_reallocation                        0.00    0%
PCH_pointer_sort                                0.00    0%
PCH_preprocessor_state_restore                  0.00    0%
PCH_preprocessor_state_save                     0.00    0%
plugin_execution                                0.00    0%
plugin_initialization                           0.00    0%
predictive_commoning                            0.00    0%
reg_stack                                       0.00    0%
rename_registers                                0.00    0%
rest_of_compilation                             0.00    0%
sequence_abstraction                            0.00    0%
shorten_branches                                0.00    0%
sms_modulo_scheduling                           0.00    0%
template_instantiation                          0.00    0%
total_time                                      0.00    0%
tracer                                          0.00    0%
tree_check_data_dependences                     0.00    0%
tree_loop_distribution                          0.00    0%
tree_loop_linear                                0.00    0%
tree_loop_optimization                          0.00    0%
tree_loop_unswitching                           0.00    0%
tree_parallelize_loops                          0.00    0%
tree_prefetching                                0.00    0%
tree_redundant_PHIs                             0.00    0%
tree_slp_vectorization                          0.00    0%
tree_SSA_to_normal                              0.00    0%
tree_SSA_verifier                               0.00    0%
tree_STMT_verifier                              0.00    0%
tree_STORE_CCP                                  0.00    0%
tree_store_copy_prop                            0.00    0%
tree_vectorization                              0.00    0%
value_profile_opts                              0.00    0%
web                                             0.00    0%
whopr_ltrans                                    0.00    0%
whopr_wpa                                       0.00    0%
whopr_wpa_IO                                    0.00    0%
whopr_wpa_ltrans                                0.00    0%
mode_switching                                  0.01    0.00261117%
tree_NRV_optimization                           0.01    0.00261117%
tree_loop_fini                                  0.03    0.00783351%
tree_switch_initialization_conversion           0.03    0.00783351%
lower_subreg                                    0.04    0.0104447%
tree_buildin_call_DCE                           0.05    0.0130559%
code_hoisting                                   0.06    0.015667%
ipa_reference                                   0.06    0.015667%
tree_canonical_iv                               0.06    0.015667%
tree_if_combine                                 0.06    0.015667%
PHI_merge                                       0.07    0.0182782%
tree_phiprop                                    0.07    0.0182782%
uninit_var_anaysis                              0.07    0.0182782%
control_dependences                             0.08    0.0208894%
varconst                                        0.09    0.0235005%
tree_PHI_const_copy_prop                        0.16    0.0417787%
tree_eh                                         0.19    0.0496122%
tree_split_crit_edges                           0.19    0.0496122%
scev_constant_prop                              0.20    0.0522234%
tree_PHI_insertion                              0.20    0.0522234%
tree_copy_headers                               0.23    0.0600569%
tree_loop_bounds                                0.24    0.0626681%
tree_loop_invariant_motion                      0.27    0.0705016%
variable_output                                 0.27    0.0705016%
combine_stack_adjustments                       0.28    0.0731128%
garbage_collection                              0.28    0.0731128%
loop_analysis                                   0.28    0.0731128%
tree_SSA_uncprop                                0.28    0.0731128%
tree_SRA                                        0.30    0.0783351%
ipa_cp                                          0.34    0.0887798%
tree_linearize_phis                             0.34    0.0887798%
tree_DSE                                        0.39    0.101836%
tree_find_ref._vars                             0.39    0.101836%
varpool_construction                            0.39    0.101836%
tree_rename_SSA_copies                          0.47    0.122725%
complete_unrolling                              0.50    0.130559%
tree_SSA_other                                  0.53    0.138392%
tree_loop_init                                  0.57    0.148837%
tree_CFG_construction                           0.59    0.154059%
tree_code_sinking                               0.60    0.15667%
zee                                             0.62    0.161893%
dominance_frontiers                             0.65    0.169726%
loop_invariant_motion                           0.66    0.172337%
register_scan                                   0.67    0.174948%
ipa_pure_const                                  0.71    0.185393%
tree_reassociation                              0.72    0.188004%
callgraph_construction                          0.73    0.190615%
if_conversion_2                                 0.74    0.193227%
tree_forward_propagate                          0.77    0.20106%
ipa_SRA                                         0.91    0.237617%
peephole_2                                      0.95    0.248061%
tree_conservative_DCE                           0.96    0.250672%
regmove                                         1.02    0.266339%
thread_pro_and_epilogue                         1.28    0.33423%
tree_iv_optimization                            1.31    0.342063%
tree_operand_scan                               1.32    0.344675%
rebuild_jump_labels                             1.33    0.347286%
jump                                            1.34    0.349897%
branch_prediction                               1.35    0.352508%
machine_dep_reorg                               1.36    0.355119%
inline_heuristics                               1.42    0.370786%
df_multiple_defs                                1.50    0.391676%
dead_code_elimination                           1.74    0.454344%
tree_SSA_rewrite                                1.80    0.470011%
df_use_def_def_use_chains                       1.86    0.485678%
trivially_dead_code                             2.00    0.522234%
reorder_blocks                                  2.07    0.540512%
hard_reg_cprop                                  2.10    0.548346%
alias_stmt_walking                              2.15    0.561402%
tree_copy_propagation                           2.19    0.571846%
register_information                            2.27    0.592736%
tree_aggressive_DCE                             2.29    0.597958%
dead_store_elim1                                2.38    0.621459%
dead_store_elim2                                2.40    0.626681%
integration                                     2.73    0.71285%
if_conversion                                   2.80    0.731128%
tree_CCP                                        2.89    0.754628%
tree_gimplify                                   3.03    0.791185%
callgraph_optimization                          3.22    0.840797%
forward_prop                                    3.25    0.84863%
alias_analysis                                  3.41    0.890409%
df_reaching_defs                                3.44    0.898243%
dominator_optimization                          3.49    0.911299%
tree_SSA_incremental                            3.50    0.91391%
tree_FRE                                        3.90    1.01836%
CSE_2                                           4.71    1.22986%
tree_PTA                                        4.80    1.25336%
CPROP                                           4.98    1.30036%
reload_CSE_regs                                 5.23    1.36564%
final                                           5.26    1.37348%
dominance_computation                           5.44    1.42048%
df_reg_dead_unused_notes                        5.62    1.46748%
tree_CFG_cleanup                                5.69    1.48576%
cfg_cleanup                                     6.28    1.63982%
PRE                                             6.64    1.73382%
lexical_analysis                                6.65    1.73643%
CSE                                             8.16    2.13072%
tree_VRP                                        8.36    2.18294%
symout                                          8.94    2.33439%
combiner                                        10.17   2.65556%
tree_PRE                                        11.42   2.98196%
scheduling_2                                    11.44   2.98718%
reload                                          11.7    3.05507%
df_live_initialized_regs                        12.92   3.37363%
integrated_RA                                   16.31   4.25882%
df_live_regs                                    17.52   4.57477%
expand                                          24.18   6.31381%
preprocessing                                   27.59   7.20422%
variable_tracking                               29.17   7.61678%
parser                                          31.53   8.23302%
TOTAL                                           382.97  100%

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 15:55 Where does the time go? Steven Bosscher
@ 2010-05-20 19:16 ` Vladimir Makarov
  2010-05-20 19:57   ` Toon Moene
  2010-05-20 20:36   ` Steven Bosscher
  2010-05-20 19:36 ` Joseph S. Myers
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Vladimir Makarov @ 2010-05-20 19:16 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Mailing List

Steven Bosscher wrote:
> Hello,
>
> For some time now, I've wanted to see where compile time goes in a
> typical GCC build, because nobody really seems to know what the
> compiler spends its time on. The impressions that get published about
> gcc usually indicate that there is at least a feeling that GCC is not
> getting faster, and that parts of the compiler are unreasonably slow.
>   
It is just a feeling.  In fact, starting since 4.2, gcc becomes faster 
(if we ignore LTO).  My feeling is that LLVM becomes slower.  The gap in 
compilation speed between GCC4.5 and LLVM2.7 achieves less 10% on x86_64 
SPECInt2000 for -O2.

Feeling that GCC becomes slower probably occurs for people who switched 
recently from 3.x versions because in comparison with theses versions 
gcc became slower achieving maximum slowdown 25 and 40% slowdown on 
gcc4.2 correspondingly on SPECInt2000 and SPECFP2000 on x86_64 for -O2.

All these GCC version comparison can be found on 
http://vmakarov.fedorapeople.org/spec/.

As I wrote, GCC is not such slower compiler.  People sometime exaggerate 
this problem.  For example, x86_64 Sun compiler generating about the 
same quality code as GCC is 2 times slower than GCC.

We should work on speeding GCC up but the first priority (at least for 
me) should be improvement of generated code.
> Host was cfarm gcc14 (8 x 3GHz Xeon). Target was
> x86_64-unknown-linux-gnu. "Build" means non-bootstrap.
>
> Results at the bottom of this mail.
>
> Conclusions:
>
>
> * The "slow" parts of the compiler are not exactly news: tree-PRE,
> scheduling, register allocation
>
>   
RA and scheduling is usually the slowest part of optimizing compiler 
because they solve NP-problems and widely used algorithms (list 
scheduling and graph coloring) has worst quadratic complexity.  For 
example, here is comparison of how many time LLVM-2.7 passes and 
analogous GCC passes (although sometime it is hard to find full 
correspondence) spent on

                    LTO                                            GCC4.6
RA (linear scan RA + simple register coalescing) 7.2%        IRA         
       9%
Instruction Selection                           10.7%        
combiner+reload    9%

The data are from compilation all_cp2k_gfortran.f90 (420Kline fortran 
with hundreds functions) on x86_64 in -O3 mode.  By the way on Core2 
GCC4.6 spent 235 user sec on compilation of this file vs 265 sec by LLVM.

Also linear scan RA is one of the fastest RA algorithm but it is much 
worse (at least 2-3% on SPEC2000) than graph coloring one.

I wanted to look at the same time distribution for OPEN64 because it has 
sophisticated graph coloring RA algorithm.  May be I'll do it when I 
learn OPEN64 more.
> * Adding and subtracting the above numbers, the rest of the compiler,
> which is mostly the RTL parts, still account for 100-17-16-8=59% of
> the total compile time. This was the most surprising result for me.
>
>   
I don't know is it big or not to have such time spend in RTL parts.  But 
I think that this RTL part could be decreased if RTL (magically :) would 
have smaller footprint and contain less details.
> Ciao!
> Steven
>
>   
Thanks for the data.  That was interesting to look one more time (from 
many others) at this.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 15:55 Where does the time go? Steven Bosscher
  2010-05-20 19:16 ` Vladimir Makarov
@ 2010-05-20 19:36 ` Joseph S. Myers
  2010-05-20 20:35 ` Eric Botcazou
  2010-05-21 20:43 ` Diego Novillo
  3 siblings, 0 replies; 26+ messages in thread
From: Joseph S. Myers @ 2010-05-20 19:36 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Mailing List

On Thu, 20 May 2010, Steven Bosscher wrote:

> * The C front end (preprocessing+lexing+parsing) costs ~17%. For an
> optimizing compiler with so many passes, this is quite a lot.

Unlike C++ where a lot of speedups were done around 2004-5, I don't think 
much has been done about speeding up the C front end outside the 
preprocessor, so there are likely to be lots of opportunities available 
(there are probably still a fair number for C++ as well).

Preprocessor (including preprocessed output) speed is of relevance for use 
with tools such as distcc, and whole-compilation speed is of obvious 
relevance for general use of the compiler.  Front-end-only speed 
(-fsyntax-only, approximately) is less important on its own as that's not 
a normal compiler use (I think) - it's only relevant as a contribution to 
the speed of a whole compilation.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 19:16 ` Vladimir Makarov
@ 2010-05-20 19:57   ` Toon Moene
  2010-05-20 20:36   ` Steven Bosscher
  1 sibling, 0 replies; 26+ messages in thread
From: Toon Moene @ 2010-05-20 19:57 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: Steven Bosscher, GCC Mailing List

On 05/20/2010 09:17 PM, Vladimir Makarov wrote:

> Steven Bosscher wrote:

>> For some time now, I've wanted to see where compile time goes in a
>> typical GCC build, because nobody really seems to know what the
>> compiler spends its time on. The impressions that get published about
>> gcc usually indicate that there is at least a feeling that GCC is not
>> getting faster, and that parts of the compiler are unreasonably slow.

> It is just a feeling. In fact, starting since 4.2, gcc becomes faster
> (if we ignore LTO). My feeling is that LLVM becomes slower. The gap in
> compilation speed between GCC4.5 and LLVM2.7 achieves less 10% on x86_64
> SPECInt2000 for -O2.
>
> Feeling that GCC becomes slower probably occurs for people who switched
> recently from 3.x versions because in comparison with theses versions
> gcc became slower achieving maximum slowdown 25 and 40% slowdown on
> gcc4.2 correspondingly on SPECInt2000 and SPECFP2000 on x86_64 for -O2.

I never understood the problem of "gcc getting slower", but that might 
just be my "the economic life-time of computers is 3 years" replacement 
policy.

It's very interesting, Vladimir, to couple this to switching from gcc 3 
to gcc 4.

Because I was mainly interested in what gcc 4 (+ gfortran) could *do* 
(in terms of the code it could compile), I never really bothered about 
the compiler's speed of doing so.

What I *did* notice, though, is that as of gcc 4.4, I can recompile our 
complete weather forecasting code (~ 1 million lines of Fortran and ~ 
30,000 lines of C) with -O3 -ffast-math (therefore, with vectorization) 
within 5 minutes [quad core Core 2].

This means that recompiling "everything" before every run (4 times a 
day) is now a no-brainer (currently, I do it only once a day, max).

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 15:55 Where does the time go? Steven Bosscher
  2010-05-20 19:16 ` Vladimir Makarov
  2010-05-20 19:36 ` Joseph S. Myers
@ 2010-05-20 20:35 ` Eric Botcazou
  2010-05-20 20:42   ` Eric Botcazou
  2010-05-21 20:43 ` Diego Novillo
  3 siblings, 1 reply; 26+ messages in thread
From: Eric Botcazou @ 2010-05-20 20:35 UTC (permalink / raw)
  To: gcc; +Cc: Steven Bosscher

> * Adding and subtracting the above numbers, the rest of the compiler,
> which is mostly the RTL parts, still account for 100-17-16-8=59% of
> the total compile time. This was the most surprising result for me.

That figure is a little skewed though, the rest is not entirely RTL.


Front-end (3):
lexical_analysis                         6.65
preprocessing                           27.59
parser                                  31.53
                                        65.77     17.16%

IPA (9):
ipa_reference                            0.06
ipa_cp                                   0.34
varpool_construction                     0.39
ipa_pure_const                           0.71
callgraph_construction                   0.73
ipa_SRA                                  0.91
inline_heuristics                        1.42
integration                              2.73
callgraph_optimization                   3.22
                                        10.51      2.74%

dominance_computation                    5.44      1.42%

Tree optimizers (47):
tree_NRV_optimization                    0.01
tree_loop_fini                           0.03
tree_switch_initialization_convers       0.03
tree_buildin_call_DCE                    0.05
tree_canonical_iv                        0.06
tree_if_combine                          0.06
PHI_merge                                0.07
tree_phiprop                             0.07
uninit_var_anaysis                       0.07
control_dependences                      0.08
tree_PHI_const_copy_prop                 0.16
tree_eh                                  0.19
tree_split_crit_edges                    0.19
scev_constant_prop                        0.2
tree_PHI_insertion                        0.2
tree_copy_headers                        0.23
tree_loop_bounds                         0.24
tree_loop_invariant_motion               0.27
tree_SSA_uncprop                         0.28
tree_SRA                                  0.3
tree_linearize_phis                      0.34
tree_DSE                                 0.39
tree_find_ref._vars                      0.39
tree_rename_SSA_copies                   0.47
tree_SSA_other                           0.53
tree_loop_init                           0.57
tree_CFG_construction                    0.59
tree_code_sinking                         0.6
tree_reassociation                       0.72
tree_forward_propagate                   0.77
tree_conservative_DCE                    0.96
tree_iv_optimization                     1.31
tree_operand_scan                        1.32
tree_SSA_rewrite                          1.8
alias_stmt_walking                       2.15
tree_copy_propagation                    2.19
tree_aggressive_DCE                      2.29
tree_CCP                                 2.89
tree_gimplify                            3.03
alias_analysis                           3.41
dominator_optimization                   3.49
tree_SSA_incremental                      3.5
tree_FRE                                  3.9
tree_PTA                                  4.8
tree_CFG_cleanup                         5.69
tree_VRP                                 8.36
tree_PRE                                11.42
                                        70.67     18.44%

expand                                  24.18      6.31%

RTL optimizers (43)
mode_switching                           0.01
lower_subreg                             0.04
code_hoisting                            0.06
combine_stack_adjustments                0.28
loop_analysis                            0.28
complete_unrolling                        0.5
zee                                      0.62
dominance_frontiers                      0.65
loop_invariant_motion                    0.66
register_scan                            0.67
if_conversion_2                          0.74
Peephole_2                               0.95
regmove                                  1.02
thread_pro_and_epilogue                  1.28
rebuild_jump_labels                      1.33
jump                                     1.34
branch_prediction                        1.35
machine_dep_reorg                        1.36
df_multiple_defs                          1.5
dead_code_elimination                    1.74
df_use_def_def_use_chains                1.86
trivially_dead_code                         2
reorder_blocks                           2.07
hard_reg_cprop                            2.1
register_information                     2.27
dead_store_elim1                         2.38
dead_store_elim2                          2.4
if_conversion                             2.8
forward_prop                             3.25
df_reaching_defs                         3.44
CSE_2                                    4.71
CPROP                                    4.98
reload_CSE_regs                          5.23
df_reg_dead_unused_notes                 5.62
cfg_cleanup                              6.28
PRE                                      6.64
CSE                                      8.16
combiner                                10.17
scheduling_2                            11.44
reload                                   11.7
df_live_initialized_regs                12.92
integrated_RA                           16.31
df_live_regs                            17.52
                                       162.63     42.44%

variable_tracking                       29.17      7.61%

Output (4):
varconst                                 0.09
variable_output                          0.27
final                                    5.26
symout                                   8.94
                                        14.56     11.29%

garbage_collection                       0.28      0.07%

total                                  383.21    100.00%


-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 19:16 ` Vladimir Makarov
  2010-05-20 19:57   ` Toon Moene
@ 2010-05-20 20:36   ` Steven Bosscher
  2010-05-20 20:54     ` Duncan Sands
  2010-05-20 21:09     ` Ian Lance Taylor
  1 sibling, 2 replies; 26+ messages in thread
From: Steven Bosscher @ 2010-05-20 20:36 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: GCC Mailing List

Hi Vlad,

On Thu, May 20, 2010 at 9:17 PM, Vladimir Makarov wrote:
>> For some time now, I've wanted to see where compile time goes in a
>> typical GCC build, because nobody really seems to know what the
>> compiler spends its time on. The impressions that get published about
>> gcc usually indicate that there is at least a feeling that GCC is not
>> getting faster, and that parts of the compiler are unreasonably slow.
>>
>
> It is just a feeling.  In fact, starting since 4.2, gcc becomes faster (if
> we ignore LTO).  My feeling is that LLVM becomes slower.  The gap in
> compilation speed between GCC4.5 and LLVM2.7 achieves less 10% on x86_64
> SPECInt2000 for -O2.
>
> Feeling that GCC becomes slower probably occurs for people who switched
> recently from 3.x versions because in comparison with theses versions gcc
> became slower achieving maximum slowdown 25 and 40% slowdown on gcc4.2
> correspondingly on SPECInt2000 and SPECFP2000 on x86_64 for -O2.

Yes, I agree with you on this point. But I think that doesn't mean the
GCC community can ignore the feeling.  It would be great if, for once,
the release notes could say "this release is 15% faster than the
previous release" or something like that. For PR if nothing else ;-)

> RA and scheduling is usually the slowest part of optimizing compiler because
> they solve NP-problems and widely used algorithms (list scheduling and graph
> coloring) has worst quadratic complexity.

Yes.
Actually I'm more worried about things like the DF LR/LIVE problems,
var-tracking, and expand.

DF used to be better than this, so I think there is a regression
somewhere.  I guess I'll have to look at the file-by-file data to see
if there is one file that triggers bad performance.

For var-tracking, I'll confess I'm biased, but I think it's one of the
the worst thing that happened to GCC in a long time: too invasive,
perhaps I'm not looking in the right places but I don't see the
benefit and the crowds cheering about better debug info, and on top of
all that it's slow. But that's one battle lost...  I do feel that this
slowdown (see also bugzilla) was not properly addressed before the GCC
4.5 release.

And finally: expand. This should be just a change of IR format, from
GIMPLE to RTL. I have no idea why this pass always shows up in the top
10 of slowest parts of GCC.  Lowering passes on e.g. WHIRL, or GENERIC
lowering to GIMPLE, never show up in the compile time overviews.

>  For example, here is comparison
> of how many time LLVM-2.7 passes and analogous GCC passes (although sometime
> it is hard to find full correspondence) spent on
>
>                   LTO                                            GCC4.6
> RA (linear scan RA + simple register coalescing) 7.2%        IRA
>   9%
> Instruction Selection                           10.7%        combiner+reload
>    9%
>
> The data are from compilation all_cp2k_gfortran.f90 (420Kline fortran with
> hundreds functions) on x86_64 in -O3 mode.  By the way on Core2 GCC4.6 spent
> 235 user sec on compilation of this file vs 265 sec by LLVM.

Interesting data, thanks!

> I don't know is it big or not to have such time spend in RTL parts.  But I
> think that this RTL part could be decreased if RTL (magically :) would have
> smaller footprint and contain less details.

<checks pockets...>
Bah, no wand... :-)

Ciao!
Steven

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 20:35 ` Eric Botcazou
@ 2010-05-20 20:42   ` Eric Botcazou
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Botcazou @ 2010-05-20 20:42 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: gcc

> That figure is a little skewed though, the rest is not entirely RTL.

Now without some annoying typo in a formula...


Front-end (3):
lexical_analysis                         6.65
preprocessing                           27.59
parser                                  31.53
                                        65.77     17.16%

IPA (9):
ipa_reference                            0.06
ipa_cp                                   0.34
varpool_construction                     0.39
ipa_pure_const                           0.71
callgraph_construction                   0.73
ipa_SRA                                  0.91
inline_heuristics                        1.42
integration                              2.73
callgraph_optimization                   3.22
                                        10.51      2.74%

dominance_computation                    5.44      1.42%

Tree optimizers (47):
tree_NRV_optimization                    0.01
tree_loop_fini                           0.03
tree_switch_initialization_convers       0.03
tree_buildin_call_DCE                    0.05
tree_canonical_iv                        0.06
tree_if_combine                          0.06
PHI_merge                                0.07
tree_phiprop                             0.07
uninit_var_anaysis                       0.07
control_dependences                      0.08
tree_PHI_const_copy_prop                 0.16
tree_eh                                  0.19
tree_split_crit_edges                    0.19
scev_constant_prop                        0.2
tree_PHI_insertion                        0.2
tree_copy_headers                        0.23
tree_loop_bounds                         0.24
tree_loop_invariant_motion               0.27
tree_SSA_uncprop                         0.28
tree_SRA                                  0.3
tree_linearize_phis                      0.34
tree_DSE                                 0.39
tree_find_ref._vars                      0.39
tree_rename_SSA_copies                   0.47
tree_SSA_other                           0.53
tree_loop_init                           0.57
tree_CFG_construction                    0.59
tree_code_sinking                         0.6
tree_reassociation                       0.72
tree_forward_propagate                   0.77
tree_conservative_DCE                    0.96
tree_iv_optimization                     1.31
tree_operand_scan                        1.32
tree_SSA_rewrite                          1.8
alias_stmt_walking                       2.15
tree_copy_propagation                    2.19
tree_aggressive_DCE                      2.29
tree_CCP                                 2.89
tree_gimplify                            3.03
alias_analysis                           3.41
dominator_optimization                   3.49
tree_SSA_incremental                      3.5
tree_FRE                                  3.9
tree_PTA                                  4.8
tree_CFG_cleanup                         5.69
tree_VRP                                 8.36
tree_PRE                                11.42
                                        70.67     18.44%

expand                                  24.18      6.31%

RTL optimizers (43)
mode_switching                           0.01
lower_subreg                             0.04
code_hoisting                            0.06
combine_stack_adjustments                0.28
loop_analysis                            0.28
complete_unrolling                        0.5
zee                                      0.62
dominance_frontiers                      0.65
loop_invariant_motion                    0.66
register_scan                            0.67
if_conversion_2                          0.74
Peephole_2                               0.95
regmove                                  1.02
thread_pro_and_epilogue                  1.28
rebuild_jump_labels                      1.33
jump                                     1.34
branch_prediction                        1.35
machine_dep_reorg                        1.36
df_multiple_defs                          1.5
dead_code_elimination                    1.74
df_use_def_def_use_chains                1.86
trivially_dead_code                         2
reorder_blocks                           2.07
hard_reg_cprop                            2.1
register_information                     2.27
dead_store_elim1                         2.38
dead_store_elim2                          2.4
if_conversion                             2.8
forward_prop                             3.25
df_reaching_defs                         3.44
CSE_2                                    4.71
CPROP                                    4.98
reload_CSE_regs                          5.23
df_reg_dead_unused_notes                 5.62
cfg_cleanup                              6.28
PRE                                      6.64
CSE                                      8.16
combiner                                10.17
scheduling_2                            11.44
reload                                   11.7
df_live_initialized_regs                12.92
integrated_RA                           16.31
df_live_regs                            17.52
                                       162.63     42.44%

variable_tracking                       29.17      7.61%

Output (4):
varconst                                 0.09
variable_output                          0.27
final                                    5.26
symout                                   8.94
                                        14.56      3.80%

garbage_collection                       0.28      0.07%

total                                  383.21    100.00%


-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 20:36   ` Steven Bosscher
@ 2010-05-20 20:54     ` Duncan Sands
  2010-05-20 21:14       ` Steven Bosscher
  2010-05-20 21:09     ` Ian Lance Taylor
  1 sibling, 1 reply; 26+ messages in thread
From: Duncan Sands @ 2010-05-20 20:54 UTC (permalink / raw)
  To: gcc

Hi,

>> I don't know is it big or not to have such time spend in RTL parts.  But I
>> think that this RTL part could be decreased if RTL (magically :) would have
>> smaller footprint and contain less details.
>
> <checks pockets...>
> Bah, no wand... :-)

I noticed while working on the dragonegg plugin that replacing gimple -> RTL
with gimple -> LLVM IR significantly reduced the amount of memory used by the
compiler at -O0.  I didn't investigate where the memory was going, but it seems
likely that RTL either contains a whole lot more information than the LLVM IR,
or doesn't represent it in a very memory efficient way.

Ciao,

Duncan.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 20:36   ` Steven Bosscher
  2010-05-20 20:54     ` Duncan Sands
@ 2010-05-20 21:09     ` Ian Lance Taylor
  2010-05-20 21:14       ` Xinliang David Li
  1 sibling, 1 reply; 26+ messages in thread
From: Ian Lance Taylor @ 2010-05-20 21:09 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Vladimir Makarov, GCC Mailing List

Steven Bosscher <stevenb.gcc@gmail.com> writes:

> And finally: expand. This should be just a change of IR format, from
> GIMPLE to RTL. I have no idea why this pass always shows up in the top
> 10 of slowest parts of GCC.  Lowering passes on e.g. WHIRL, or GENERIC
> lowering to GIMPLE, never show up in the compile time overviews.

expand unfortunately does more than just convert IR.  It also does
things like changing division by a constant into multiplication by a
constant, and changing multiplication by a constant into a set of
shifts and adds.  It does various optimizations involving store flags
(a = b == c;).  When expanding function calls, it generates
instructions to put arguments in the right place in registers and on
the stack.  Etc.

Ian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 20:54     ` Duncan Sands
@ 2010-05-20 21:14       ` Steven Bosscher
  2010-05-23 19:09         ` Joseph S. Myers
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Bosscher @ 2010-05-20 21:14 UTC (permalink / raw)
  To: Duncan Sands; +Cc: gcc

On Thu, May 20, 2010 at 10:54 PM, Duncan Sands <baldrick@free.fr> wrote:
> I noticed while working on the dragonegg plugin that replacing gimple -> RTL
> with gimple -> LLVM IR significantly reduced the amount of memory used by
> the compiler at -O0.  I didn't investigate where the memory was going, but
> it seems likely that RTL either contains a whole lot more information than
> the LLVM IR, or doesn't represent it in a very memory efficient way.

The latter. LLVM IR contains a bit more information (or at least,
contains it in a more natural way) but the problem with RTL is, I
think, the tree-like representation. If you have an instruction like
(set (a) (b+c)) you could have, at the simples, three integers (insn
uid, basic block, instruction code) and three pointers for operands.
In total, on a 64 bits host: 3*4+3*8 = 36 bytes.

An RTL instruction of that form, assuming all operands are registers,
is 6*sizeof(struct rtx_def) = 6*48 = 288 bytes, give or take a few.
Those 6 rtx'en are for:

1. insn
2. set
3. set_dest operand
4. set_source: a plus
5. source operand 1
6. source operand 2

All in all, perhaps not the most efficient representation for memory
foot print, and the pointer chasing probably doesn't help (cache!).
But changing it is a lot more difficult than the GIMPLE tuples
project. I don't think it can be done.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 21:09     ` Ian Lance Taylor
@ 2010-05-20 21:14       ` Xinliang David Li
  2010-05-20 21:18         ` Steven Bosscher
  0 siblings, 1 reply; 26+ messages in thread
From: Xinliang David Li @ 2010-05-20 21:14 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Steven Bosscher, Vladimir Makarov, GCC Mailing List

On Thu, May 20, 2010 at 2:09 PM, Ian Lance Taylor <iant@google.com> wrote:
> Steven Bosscher <stevenb.gcc@gmail.com> writes:
>
>> And finally: expand. This should be just a change of IR format, from
>> GIMPLE to RTL. I have no idea why this pass always shows up in the top
>> 10 of slowest parts of GCC.  Lowering passes on e.g. WHIRL, or GENERIC
>> lowering to GIMPLE, never show up in the compile time overviews.
>
> expand unfortunately does more than just convert IR.  It also does
> things like changing division by a constant into multiplication by a
> constant, and changing multiplication by a constant into a set of
> shifts and adds.  It does various optimizations involving store flags
> (a = b == c;).  When expanding function calls, it generates
> instructions to put arguments in the right place in registers and on
> the stack.  Etc.


stack variable overlay and stack slot assignments is here too.

David


>
> Ian
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 21:14       ` Xinliang David Li
@ 2010-05-20 21:18         ` Steven Bosscher
  2010-05-20 21:21           ` Xinliang David Li
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Bosscher @ 2010-05-20 21:18 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Ian Lance Taylor, Vladimir Makarov, GCC Mailing List

On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
> stack variable overlay and stack slot assignments is here too.

Yes, and for these I would like to add a separate timevar. Agree?

Ciao!
Steven

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 21:18         ` Steven Bosscher
@ 2010-05-20 21:21           ` Xinliang David Li
  2010-05-21 10:54             ` Richard Guenther
  0 siblings, 1 reply; 26+ messages in thread
From: Xinliang David Li @ 2010-05-20 21:21 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Ian Lance Taylor, Vladimir Makarov, GCC Mailing List

On Thu, May 20, 2010 at 2:18 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
>> stack variable overlay and stack slot assignments is here too.
>
> Yes, and for these I would like to add a separate timevar. Agree?

Yes.  (By the way, we are rewriting this pass to eliminate the code
motion/aliasing problem -- but that is a different topic).

David

>
> Ciao!
> Steven
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 21:21           ` Xinliang David Li
@ 2010-05-21 10:54             ` Richard Guenther
  2010-05-21 13:26               ` Jan Hubicka
  2010-05-21 17:06               ` Xinliang David Li
  0 siblings, 2 replies; 26+ messages in thread
From: Richard Guenther @ 2010-05-21 10:54 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Steven Bosscher, Ian Lance Taylor, Vladimir Makarov, GCC Mailing List

On Thu, May 20, 2010 at 11:21 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Thu, May 20, 2010 at 2:18 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> stack variable overlay and stack slot assignments is here too.
>>
>> Yes, and for these I would like to add a separate timevar. Agree?
>
> Yes.  (By the way, we are rewriting this pass to eliminate the code
> motion/aliasing problem -- but that is a different topic).

Btw, we want to address the same problem by representing the
points where (big) variables go out-of scope in the IL, also to
help DSE.  The original idea was to simply drop in an aggregate
assignment from an undefined value at the end of the scope
during lowering, like

  var = {undefined};

which we'd expand to nothing.  Of course shifting the problem to
the RTL optimizers, so better expand to a similar RTL construct.
But then are you addressing the similar problem on the RTL side?

Richard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-21 10:54             ` Richard Guenther
@ 2010-05-21 13:26               ` Jan Hubicka
  2010-05-21 15:06                 ` Richard Guenther
  2010-05-21 17:06               ` Xinliang David Li
  1 sibling, 1 reply; 26+ messages in thread
From: Jan Hubicka @ 2010-05-21 13:26 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Xinliang David Li, Steven Bosscher, Ian Lance Taylor,
	Vladimir Makarov, GCC Mailing List

> On Thu, May 20, 2010 at 11:21 PM, Xinliang David Li <davidxl@google.com> wrote:
> > On Thu, May 20, 2010 at 2:18 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> >> On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
> >>> stack variable overlay and stack slot assignments is here too.
> >>
> >> Yes, and for these I would like to add a separate timevar. Agree?
> >
> > Yes. Â (By the way, we are rewriting this pass to eliminate the code
> > motion/aliasing problem -- but that is a different topic).
> 
> Btw, we want to address the same problem by representing the
> points where (big) variables go out-of scope in the IL, also to
> help DSE.  The original idea was to simply drop in an aggregate
> assignment from an undefined value at the end of the scope
> during lowering, like
> 
>   var = {undefined};
> 
> which we'd expand to nothing.  Of course shifting the problem to
> the RTL optimizers, so better expand to a similar RTL construct.
> But then are you addressing the similar problem on the RTL side?

Well, RTL already has the clobber insns, that are pretty much what you need
(and RTL land plays with this at least for double word arithmetic). But
I am not quite sure how much of DSE we want to deffer to RTL.

Honza
> 
> Richard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-21 13:26               ` Jan Hubicka
@ 2010-05-21 15:06                 ` Richard Guenther
  2010-05-21 15:49                   ` Jan Hubicka
  0 siblings, 1 reply; 26+ messages in thread
From: Richard Guenther @ 2010-05-21 15:06 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, Steven Bosscher, Ian Lance Taylor,
	Vladimir Makarov, GCC Mailing List

2010/5/21 Jan Hubicka <hubicka@ucw.cz>:
>> On Thu, May 20, 2010 at 11:21 PM, Xinliang David Li <davidxl@google.com> wrote:
>> > On Thu, May 20, 2010 at 2:18 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> >> On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
>> >>> stack variable overlay and stack slot assignments is here too.
>> >>
>> >> Yes, and for these I would like to add a separate timevar. Agree?
>> >
>> > Yes.  (By the way, we are rewriting this pass to eliminate the code
>> > motion/aliasing problem -- but that is a different topic).
>>
>> Btw, we want to address the same problem by representing the
>> points where (big) variables go out-of scope in the IL, also to
>> help DSE.  The original idea was to simply drop in an aggregate
>> assignment from an undefined value at the end of the scope
>> during lowering, like
>>
>>   var = {undefined};
>>
>> which we'd expand to nothing.  Of course shifting the problem to
>> the RTL optimizers, so better expand to a similar RTL construct.
>> But then are you addressing the similar problem on the RTL side?
>
> Well, RTL already has the clobber insns, that are pretty much what you need
> (and RTL land plays with this at least for double word arithmetic). But
> I am not quite sure how much of DSE we want to deffer to RTL.

But a clobber for (stack) memory objects doesn't work in the granularity
we want it to be applied, does it?

Richard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-21 15:06                 ` Richard Guenther
@ 2010-05-21 15:49                   ` Jan Hubicka
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Hubicka @ 2010-05-21 15:49 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Jan Hubicka, Xinliang David Li, Steven Bosscher,
	Ian Lance Taylor, Vladimir Makarov, GCC Mailing List

> 2010/5/21 Jan Hubicka <hubicka@ucw.cz>:
> >> On Thu, May 20, 2010 at 11:21 PM, Xinliang David Li <davidxl@google.com> wrote:
> >> > On Thu, May 20, 2010 at 2:18 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> >> >> On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
> >> >>> stack variable overlay and stack slot assignments is here too.
> >> >>
> >> >> Yes, and for these I would like to add a separate timevar. Agree?
> >> >
> >> > Yes. Â (By the way, we are rewriting this pass to eliminate the code
> >> > motion/aliasing problem -- but that is a different topic).
> >>
> >> Btw, we want to address the same problem by representing the
> >> points where (big) variables go out-of scope in the IL, also to
> >> help DSE. Â The original idea was to simply drop in an aggregate
> >> assignment from an undefined value at the end of the scope
> >> during lowering, like
> >>
> >> Â  var = {undefined};
> >>
> >> which we'd expand to nothing. Â Of course shifting the problem to
> >> the RTL optimizers, so better expand to a similar RTL construct.
> >> But then are you addressing the similar problem on the RTL side?
> >
> > Well, RTL already has the clobber insns, that are pretty much what you need
> > (and RTL land plays with this at least for double word arithmetic). But
> > I am not quite sure how much of DSE we want to deffer to RTL.
> 
> But a clobber for (stack) memory objects doesn't work in the granularity
> we want it to be applied, does it?

Clobber takes memory expression that has alias info in it, so you can make it
specific to single var.  But I guess problem will be that memory clobber
(unlike reg clobber) has by default the "may" interpretation (i.e. it may or
may not change the memory in random way).

Honza

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-21 10:54             ` Richard Guenther
  2010-05-21 13:26               ` Jan Hubicka
@ 2010-05-21 17:06               ` Xinliang David Li
  2010-05-21 17:07                 ` Richard Guenther
  1 sibling, 1 reply; 26+ messages in thread
From: Xinliang David Li @ 2010-05-21 17:06 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Steven Bosscher, Ian Lance Taylor, Vladimir Makarov, GCC Mailing List

On Fri, May 21, 2010 at 2:24 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Thu, May 20, 2010 at 11:21 PM, Xinliang David Li <davidxl@google.com> wrote:
>> On Thu, May 20, 2010 at 2:18 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>> On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> stack variable overlay and stack slot assignments is here too.
>>>
>>> Yes, and for these I would like to add a separate timevar. Agree?
>>
>> Yes.  (By the way, we are rewriting this pass to eliminate the code
>> motion/aliasing problem -- but that is a different topic).
>


> Btw, we want to address the same problem by representing the
> points where (big) variables go out-of scope in the IL, also to
> help DSE.  The original idea was to simply drop in an aggregate
> assignment from an undefined value at the end of the scope
> during lowering, like
>
>  var = {undefined};
>
> which we'd expand to nothing.  Of course shifting the problem to
> the RTL optimizers, so better expand to a similar RTL construct.
> But then are you addressing the similar problem on the RTL side?

We are probably talking about different problems -- the one I
mentioned is that code motion leading to overlapping live range for
variables in different scopes which invalidates the scope based stack
variable overlay. What is the problem you referred? Any PR? I wonder
why a dummy assignment is needed.

Thanks,

David


>
> Richard.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-21 17:06               ` Xinliang David Li
@ 2010-05-21 17:07                 ` Richard Guenther
  0 siblings, 0 replies; 26+ messages in thread
From: Richard Guenther @ 2010-05-21 17:07 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Steven Bosscher, Ian Lance Taylor, Vladimir Makarov, GCC Mailing List

On Fri, May 21, 2010 at 6:13 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Fri, May 21, 2010 at 2:24 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Thu, May 20, 2010 at 11:21 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> On Thu, May 20, 2010 at 2:18 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>>> On Thu, May 20, 2010 at 11:14 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>> stack variable overlay and stack slot assignments is here too.
>>>>
>>>> Yes, and for these I would like to add a separate timevar. Agree?
>>>
>>> Yes.  (By the way, we are rewriting this pass to eliminate the code
>>> motion/aliasing problem -- but that is a different topic).
>>
>
>
>> Btw, we want to address the same problem by representing the
>> points where (big) variables go out-of scope in the IL, also to
>> help DSE.  The original idea was to simply drop in an aggregate
>> assignment from an undefined value at the end of the scope
>> during lowering, like
>>
>>  var = {undefined};
>>
>> which we'd expand to nothing.  Of course shifting the problem to
>> the RTL optimizers, so better expand to a similar RTL construct.
>> But then are you addressing the similar problem on the RTL side?
>
> We are probably talking about different problems -- the one I
> mentioned is that code motion leading to overlapping live range for
> variables in different scopes which invalidates the scope based stack
> variable overlay. What is the problem you referred? Any PR? I wonder
> why a dummy assignment is needed.

Yes, it's the same problem.  A dummy assignment would prevent the
code motion on the tree level and thus keep the scope block information
valid.

Richard.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 15:55 Where does the time go? Steven Bosscher
                   ` (2 preceding siblings ...)
  2010-05-20 20:35 ` Eric Botcazou
@ 2010-05-21 20:43 ` Diego Novillo
  3 siblings, 0 replies; 26+ messages in thread
From: Diego Novillo @ 2010-05-21 20:43 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Mailing List

Interesting.  Thanks for gathering this.

I did a similar study internally on our C++ codebase.  The results are
fairly different.  In our case, the front end takes a LARGE chunk of
the compile time.  The numbers below are taken from a full build of
one of our applications, consisting of ~4,500 C++ files.  These are
the aggregated times for an -O0 build (only the first 10 main
contributors shown):

TOTAL                  :  11583.19 cpu sec,   3705.69 sys sec,
16012.79 wall sec, 841456705 kB
parser                 :   6057.47 cpu sec,   1717.66 sys sec,
7955.42 wall sec, 580829675 kB
name lookup            :   1688.17 cpu sec,   1011.80 sys sec,
2750.34 wall sec,  93255238 kB
preprocessing          :    796.18 cpu sec,    575.73 sys sec,
1777.63 wall sec,  14743549 kB
tree |^ dominator      :   1067.33 cpu sec,    197.20 sys sec,
1290.34 wall sec, 101280196 kB
tree gimplify          :    973.18 cpu sec,    175.87 sys sec,
1172.88 wall sec,  96888948 kB
garbage collection     :    471.20 cpu sec,     22.20 sys sec,
499.94 wall sec,         0 kB
expand                 :    339.34 cpu sec,     39.46 sys sec,
386.19 wall sec,  17818728 kB
varconst               :    336.20 cpu sec,     45.11 sys sec,
383.70 wall sec,  10064103 kB
final                  :    111.79 cpu sec,     13.64 sys sec,
131.18 wall sec,    801421 kB

The front end (parser + name lookup + preprocessing) is taking 78% of
the wall clock time.  What was surprising is that this ratio is fairly
consistently maintained across three kinds of builds: optimized (-O2),
not optimized (-O0) and debug (-g -O0).

In an optimized build, the difference was that the front end reduced
its contribution from 78% to 70%.

In all cases, the gimplifier took about 6% of the total compile time.

For debug builds, the difference was a 6% of time taken by symout.

We are currently looking at ways of speeding up builds internally.
Based on this, we are mostly looking at ways of making the front end
faster.  This profile is fairly consistent across much of our code
base.

Diego.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-20 21:14       ` Steven Bosscher
@ 2010-05-23 19:09         ` Joseph S. Myers
  2010-05-24 17:00           ` Mark Mitchell
  0 siblings, 1 reply; 26+ messages in thread
From: Joseph S. Myers @ 2010-05-23 19:09 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Duncan Sands, gcc

On Thu, 20 May 2010, Steven Bosscher wrote:

> think, the tree-like representation. If you have an instruction like
> (set (a) (b+c)) you could have, at the simples, three integers (insn
> uid, basic block, instruction code) and three pointers for operands.
> In total, on a 64 bits host: 3*4+3*8 = 36 bytes.

(Plus four bytes padding for alignment.)

> An RTL instruction of that form, assuming all operands are registers,
> is 6*sizeof(struct rtx_def) = 6*48 = 288 bytes, give or take a few.
> Those 6 rtx'en are for:
> 
> 1. insn
> 2. set
> 3. set_dest operand
> 4. set_source: a plus
> 5. source operand 1
> 6. source operand 2
> 
> All in all, perhaps not the most efficient representation for memory
> foot print, and the pointer chasing probably doesn't help (cache!).
> But changing it is a lot more difficult than the GIMPLE tuples
> project. I don't think it can be done.

I don't see any reason technically why it can't be done.  There would be 
several large projects, certainly, and nontrivial work in actually 
producing a design for conversion, but there are also clear incremental 
steps, such as static typing of some different kinds of RTL and moving to 
more specific accessors for parts of an RTX in place of generic ones such 
as XEXP used at present.  If it can't be done then that would be more for 
economic reasons - no-one benefiting enough from the change, potential 
benefits being gained more cheaply in other ways - than because of 
intrinsic technical obstacles.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-23 19:09         ` Joseph S. Myers
@ 2010-05-24 17:00           ` Mark Mitchell
  2010-05-24 21:07             ` Steven Bosscher
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Mitchell @ 2010-05-24 17:00 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Steven Bosscher, Duncan Sands, gcc

Joseph S. Myers wrote:

>> All in all, perhaps not the most efficient representation for memory
>> foot print, and the pointer chasing probably doesn't help (cache!).
>> But changing it is a lot more difficult than the GIMPLE tuples
>> project. I don't think it can be done.
> 
> I don't see any reason technically why it can't be done.  There would be 
> several large projects, certainly, and nontrivial work in actually 
> producing a design for conversion, but there are also clear incremental 
> steps, such as static typing of some different kinds of RTL and moving to 
> more specific accessors for parts of an RTX in place of generic ones such 
> as XEXP used at present.  If it can't be done then that would be more for 
> economic reasons - no-one benefiting enough from the change, potential 
> benefits being gained more cheaply in other ways - than because of 
> intrinsic technical obstacles.

I completely agree.  This is a tractable problem, approximately on the
same scale as GIMPLE tuples.  I would guess approximately a person-year,
perhaps spread out over a longer time.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-24 17:00           ` Mark Mitchell
@ 2010-05-24 21:07             ` Steven Bosscher
  2010-05-24 23:22               ` Mark Mitchell
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Bosscher @ 2010-05-24 21:07 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Joseph S. Myers, Duncan Sands, gcc

On Mon, May 24, 2010 at 6:20 PM, Mark Mitchell <mark@codesourcery.com> wrote:
> Joseph S. Myers wrote:
>
>>> All in all, perhaps not the most efficient representation for memory
>>> foot print, and the pointer chasing probably doesn't help (cache!).
>>> But changing it is a lot more difficult than the GIMPLE tuples
>>> project. I don't think it can be done.
>>
>> I don't see any reason technically why it can't be done.  There would be
>> several large projects, certainly, and nontrivial work in actually
>> producing a design for conversion, but there are also clear incremental
>> steps, such as static typing of some different kinds of RTL and moving to
>> more specific accessors for parts of an RTX in place of generic ones such
>> as XEXP used at present.  If it can't be done then that would be more for
>> economic reasons - no-one benefiting enough from the change, potential
>> benefits being gained more cheaply in other ways - than because of
>> intrinsic technical obstacles.
>
> I completely agree.  This is a tractable problem, approximately on the
> same scale as GIMPLE tuples.

Great you all do semantics :-)

So I said "impossible". Perhaps I should have said "of course
technically feasible (for the non-faint of heart) but unlikely to
happen given the considerable investment required, the technical
complexity and risk involved, the unclear and uncertain
return-on-investment in both the short and long terms, and perhaps the
availability of better options such as the GIMPLE backend (xf.
http://gcc.gnu.org/wiki/gimplebackend)".

>  I would guess approximately a person-year,
> perhaps spread out over a longer time.

No-one will know for sure, but I think this approximation is a bit too
optimistic:

* Just in gcc/, all gimple/tree code is less than 200,000 lines of
code and all rtl code is >400,000 lines of code. I am not even going
to count in config/*.
*  For RTL if you access XEXP you don't know if you will see an
operand or an expression. The GIMPLE tuples project was relatively
easy because most GIMPLE passes already access operands through the
SSA operands system.
* For RTL, expressions can be of arbitrary complexity (machine
dependent forms), while for GIMPLE the number of instruction templates
is limited.
* For RTL, one would have to test many different configurations to
avoid breaking things. The code to be modified for GIMPLE tuples was
mostly target-independent (compared to all the RTL code in config/*).

The GIMPLE tuples work took man-years (note: plural). There was less
code to convert and the process of conversion was easier, relatively,
than the conversion of RTL would be. So your one person-year seems
grossly underestimated.

Not that I would discourage anyone from doing the work, but it seems
to me that resources would be better spent on something that actually
has clear pay-off in the hopefully shorter term (such as
http://gcc.gnu.org/wiki/ModularGCC, or indeed
http://gcc.gnu.org/wiki/gimplebackend).

Ciao!
Steven

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-24 21:07             ` Steven Bosscher
@ 2010-05-24 23:22               ` Mark Mitchell
  2010-05-25  1:20                 ` Joseph S. Myers
  0 siblings, 1 reply; 26+ messages in thread
From: Mark Mitchell @ 2010-05-24 23:22 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Joseph S. Myers, Duncan Sands, gcc

Steven Bosscher wrote:

> The GIMPLE tuples work took man-years (note: plural). There was less
> code to convert and the process of conversion was easier, relatively,
> than the conversion of RTL would be. So your one person-year seems
> grossly underestimated.

I dunno.  To get good project estimates, you have to break things down
into small enough subpieces that you have good confidence in the
estimates.  I certainly don't claim a high level of confidence in my
estimate.  But, it's a largely mechanical activity; we're not talking
about changing algorithms, but rather about changing data representations.

As to whether this is a better choice than working on GIMPLE back-ends,
I think that's unclear.  There's no question that a GIMPLE back-end
would be prettier.  I think it's a question of what your goals are.  If
your goal is a faster compiler with less memory usage, I bet converting
RTL to a better representation is a faster path to that solution.  If
your goal is to have one representation all the way through to ease
future maintenance and make it easier to do plugins at all stages of
compilation and so forth, then a better representation for RTL doesn't
really help you.

I'd accept either set of patches, if someone were to provide them.

-- 
Mark Mitchell
CodeSourcery
mark@codesourcery.com
(650) 331-3385 x713

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
  2010-05-24 23:22               ` Mark Mitchell
@ 2010-05-25  1:20                 ` Joseph S. Myers
  0 siblings, 0 replies; 26+ messages in thread
From: Joseph S. Myers @ 2010-05-25  1:20 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Steven Bosscher, Duncan Sands, gcc

On Mon, 24 May 2010, Mark Mitchell wrote:

> As to whether this is a better choice than working on GIMPLE back-ends,
> I think that's unclear.  There's no question that a GIMPLE back-end
> would be prettier.  I think it's a question of what your goals are.  If

I don't think of the two as being mutually exclusive; the end point of a 
series of improvements to RTL types and datastructures could be that RTL 
ends up as a form of GIMPLE, if the changes are planned that way (and if 
you are making such changes to RTL, the advantages of moving towards 
low-GIMPLE are clear).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Where does the time go?
@ 2010-05-20 21:28 Bradley Lucier
  0 siblings, 0 replies; 26+ messages in thread
From: Bradley Lucier @ 2010-05-20 21:28 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: Bradley Lucier, GCC Mailing List

On my codes, pre-RA instruction scheduling on X86-64 (a) improves run
times by roughly 10%, and (b) costs a lot of compile time.

The -fscheduling option didn't seem to be on in your time tests (I think
it's not on by default on that architecture at -O2).

Brad

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-05-24 22:59 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-20 15:55 Where does the time go? Steven Bosscher
2010-05-20 19:16 ` Vladimir Makarov
2010-05-20 19:57   ` Toon Moene
2010-05-20 20:36   ` Steven Bosscher
2010-05-20 20:54     ` Duncan Sands
2010-05-20 21:14       ` Steven Bosscher
2010-05-23 19:09         ` Joseph S. Myers
2010-05-24 17:00           ` Mark Mitchell
2010-05-24 21:07             ` Steven Bosscher
2010-05-24 23:22               ` Mark Mitchell
2010-05-25  1:20                 ` Joseph S. Myers
2010-05-20 21:09     ` Ian Lance Taylor
2010-05-20 21:14       ` Xinliang David Li
2010-05-20 21:18         ` Steven Bosscher
2010-05-20 21:21           ` Xinliang David Li
2010-05-21 10:54             ` Richard Guenther
2010-05-21 13:26               ` Jan Hubicka
2010-05-21 15:06                 ` Richard Guenther
2010-05-21 15:49                   ` Jan Hubicka
2010-05-21 17:06               ` Xinliang David Li
2010-05-21 17:07                 ` Richard Guenther
2010-05-20 19:36 ` Joseph S. Myers
2010-05-20 20:35 ` Eric Botcazou
2010-05-20 20:42   ` Eric Botcazou
2010-05-21 20:43 ` Diego Novillo
2010-05-20 21:28 Bradley Lucier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).