* if-conversion a performance bottleneck
@ 2000-05-01 10:54 Brad Lucier
2000-05-01 15:29 ` Richard Henderson
0 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-01 10:54 UTC (permalink / raw)
To: gcc; +Cc: lucier
The patch
http://gcc.gnu.org/ml/gcc-patches/2000-04/msg00248.html
killed one of the biggest time hogs in gcc, but now Richard
Henderson's if-conversion pass has affected performance adversely.
Perhaps with the calls to verify_flow_info, this is a temporary effect,
I don't know.
On this file
http://www.math.purdue.edu/~lucier/_t-c-2.i.gz
called with these options
/export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -fno-math-errno -mieee -g -mcpu=ev6 -O1 _t-c-2.i > & cc1.out &
with this compiler
popov-41% gcc -v
Reading specs from /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/specs
gcc version 2.96 20000430 (experimental)
I get the following performance on a 500 MHz alpha 21264:
__copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 47533k -> 10293k} {GC 15386k -> 10788k} {GC 14832k -> 11255k} ___init_proc {GC 24248k -> 2107k} ____20___t_2d_c_2d_2
Execution times (seconds)
garbage collection : 0.75 ( 0%) usr 0.00 ( 0%) sys 0.75 ( 0%) wall
parser : 6.55 ( 0%) usr 0.18 (15%) sys 6.73 ( 0%) wall
varconst : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall
integration : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
jump : 6.81 ( 0%) usr 0.40 (32%) sys 7.21 ( 0%) wall
CSE : 2.29 ( 0%) usr 0.00 ( 0%) sys 2.29 ( 0%) wall
global CSE : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
loop analysis : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
CSE 2 : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
flow analysis : 61.00 ( 2%) usr 0.07 ( 6%) sys 61.05 ( 2%) wall
combiner : 2.82 ( 0%) usr 0.00 ( 0%) sys 2.82 ( 0%) wall
if-conversion :2306.07 (64%) usr 0.44 (35%) sys2306.74 (64%) wall
local alloc : 1.31 ( 0%) usr 0.00 ( 0%) sys 1.31 ( 0%) wall
global alloc : 2.42 ( 0%) usr 0.04 ( 4%) sys 2.46 ( 0%) wall
reload CSE regs : 3.62 ( 0%) usr 0.00 ( 0%) sys 3.62 ( 0%) wall
flow 2 : 30.20 ( 1%) usr 0.00 ( 0%) sys 30.20 ( 1%) wall
if-conversion 2 :1196.27 (33%) usr 0.09 ( 7%) sys1196.01 (33%) wall
shorten branches : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.13 ( 0%) wall
final : 2.69 ( 0%) usr 0.00 ( 1%) sys 2.70 ( 0%) wall
symout : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall
rest of compilation : 1.06 ( 0%) usr 0.00 ( 0%) sys 1.06 ( 0%) wall
TOTAL :3624.12 1.27 3625.23
Some of the details are:
Flat profile:
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
49.50 192.40 192.40 9 21378.26 21381.98 verify_flow_info
28.10 301.64 109.24 40604 2.69 2.69 sbitmap_intersection_of_succs
6.75 327.90 26.25 15025 1.75 1.75 sbitmap_intersection_of_preds
5.38 348.79 20.89 20981 1.00 1.00 for_each_successor_phi
1.87 356.06 7.27 25079015 0.00 0.00 bitmap_operation
...
-----------------------------------------------
0.00 288.44 9/9 rest_of_compilation [5]
[6] 74.2 0.00 288.44 9 if_convert [6]
192.40 0.03 9/9 verify_flow_info [7]
2.17 90.46 6/9 compute_flow_dominators [8]
0.00 3.27 1/10 update_life_info [11]
0.00 0.03 9/37 free_basic_block_vars [99]
0.03 0.00 9/18 compute_bb_for_insn [177]
0.00 0.02 15384/15384 find_if_header [331]
0.01 0.00 6/21 sbitmap_vector_alloc [222]
0.00 0.00 1/25 allocate_reg_info [536]
0.00 0.00 2/64864 max_reg_num [371]
0.00 0.00 9/5201 get_max_uid [1093]
0.00 0.00 1/974 sbitmap_alloc [1124]
0.00 0.00 1/31743 sbitmap_zero [1070]
0.00 0.00 1/1 count_or_remove_death_notes [1414]
-----------------------------------------------
192.40 0.03 9/9 if_convert [6]
[7] 49.5 192.40 0.03 9 verify_flow_info [7]
0.00 0.03 15008/26961 returnjump_p [183]
0.00 0.00 3/87970 condjump_p [388]
0.00 0.00 9/5201 get_max_uid [1093]
0.00 0.00 9/231 get_insns [1157]
-----------------------------------------------
1.09 45.23 3/9 flow_loops_find [10]
2.17 90.46 6/9 if_convert [6]
[8] 35.7 3.26 135.69 9 compute_flow_dominators [8]
109.24 0.01 40604/40604 sbitmap_intersection_of_succs [9]
26.25 0.00 15025/15025 sbitmap_intersection_of_preds [13]
0.17 0.00 55638/55638 sbitmap_a_and_b [88]
0.02 0.00 9/21 sbitmap_vector_alloc [222]
0.00 0.00 9/9 sbitmap_vector_ones [688]
0.00 0.00 9/12 sbitmap_vector_zero [788]
0.00 0.00 9/31743 sbitmap_zero [1070]
-----------------------------------------------
109.24 0.01 40604/40604 compute_flow_dominators [8]
[9] 28.1 109.24 0.01 40604 sbitmap_intersection_of_succs [9]
0.01 0.00 40604/55629 sbitmap_copy [396]
-----------------------------------------------
Brad Lucier
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-01 10:54 if-conversion a performance bottleneck Brad Lucier
@ 2000-05-01 15:29 ` Richard Henderson
2000-05-01 18:01 ` Brad Lucier
0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2000-05-01 15:29 UTC (permalink / raw)
To: Brad Lucier; +Cc: gcc
On Mon, May 01, 2000 at 12:54:35PM -0500, Brad Lucier wrote:
> Perhaps with the calls to verify_flow_info, this is a temporary effect,
> I don't know.
Oops. That call was supposed to be ifdef ENABLE_CHECKING.
With that fixed, are compilation times mostly sane?
r~
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-01 15:29 ` Richard Henderson
@ 2000-05-01 18:01 ` Brad Lucier
2000-05-02 5:45 ` Michael Matz
0 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-01 18:01 UTC (permalink / raw)
To: Richard Henderson; +Cc: Brad Lucier, gcc
> Oops. That call was supposed to be ifdef ENABLE_CHECKING.
> With that fixed, are compilation times mostly sane?
> r~
Yes, that helped a lot :-), but if-conversion still takes almost half
the CPU time to compile that file:
popov-211% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
__copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
garbage collection : 0.70 ( 0%) usr 0.00 ( 0%) sys 0.70 ( 0%) wall
parser : 6.25 ( 3%) usr 0.18 (22%) sys 6.43 ( 3%) wall
varconst : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall
integration : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
jump : 6.50 ( 3%) usr 0.38 (48%) sys 6.88 ( 3%) wall
CSE : 2.28 ( 1%) usr 0.00 ( 0%) sys 2.28 ( 1%) wall
global CSE : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
loop analysis : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
CSE 2 : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
flow analysis : 58.86 (25%) usr 0.11 (14%) sys 58.99 (25%) wall
combiner : 2.87 ( 1%) usr 0.00 ( 0%) sys 2.87 ( 1%) wall
if-conversion : 59.53 (26%) usr 0.02 ( 3%) sys 59.59 (26%) wall
local alloc : 1.24 ( 1%) usr 0.00 ( 0%) sys 1.24 ( 1%) wall
global alloc : 2.37 ( 1%) usr 0.05 ( 6%) sys 2.42 ( 1%) wall
reload CSE regs : 3.88 ( 2%) usr 0.00 ( 0%) sys 3.88 ( 2%) wall
flow 2 : 27.86 (12%) usr 0.00 ( 0%) sys 27.86 (12%) wall
if-conversion 2 : 55.02 (24%) usr 0.01 ( 2%) sys 55.02 (24%) wall
shorten branches : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall
final : 2.77 ( 1%) usr 0.01 ( 1%) sys 2.78 ( 1%) wall
symout : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
rest of compilation : 1.02 ( 0%) usr 0.00 ( 0%) sys 1.02 ( 0%) wall
TOTAL : 231.37 0.81 232.20
Flat profile:
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
12.83 136.59 24.92 15025 1.66 1.66 sbitmap_intersection_of_preds
9.58 155.19 18.60 20981 0.89 0.89 for_each_successor_phi
3.78 162.54 7.35 25079015 0.00 0.00 bitmap_operation
1.99 166.40 3.86 6 643.72 5013.82 calculate_global_regs_live
1.71 169.73 3.33 18 184.84 184.84 mark_critical_edges
1.57 172.78 3.05 9 338.87 15538.80 compute_flow_dominators
1.49 175.67 2.89 3 963.87 16982.59 flow_loops_find
...
-----------------------------------------------
1.02 45.60 3/9 flow_loops_find [9]
2.03 91.20 6/9 if_convert [8]
[6] 72.0 3.05 136.80 9 compute_flow_dominators [6]
111.67 0.00 40604/40604 sbitmap_intersection_of_succs [7]
24.92 0.00 15025/15025 sbitmap_intersection_of_preds [12]
0.18 0.00 55638/55638 sbitmap_a_and_b [77]
0.02 0.00 9/21 sbitmap_vector_alloc [217]
0.00 0.00 9/9 sbitmap_vector_ones [744]
0.00 0.00 9/12 sbitmap_vector_zero [758]
0.00 0.00 9/31743 sbitmap_zero [752]
-----------------------------------------------
111.67 0.00 40604/40604 compute_flow_dominators [6]
[7] 57.5 111.67 0.00 40604 sbitmap_intersection_of_succs [7]
0.00 0.00 40604/55629 sbitmap_copy [470]
-----------------------------------------------
0.00 96.36 9/9 rest_of_compilation [5]
[8] 49.6 0.00 96.36 9 if_convert [8]
2.03 91.20 6/9 compute_flow_dominators [6]
0.00 3.05 1/10 update_life_info [10]
0.03 0.00 9/18 compute_bb_for_insn [190]
0.00 0.02 9/37 free_basic_block_vars [135]
0.00 0.02 15384/15384 find_if_header [297]
0.01 0.00 6/21 sbitmap_vector_alloc [217]
0.00 0.00 1/25 allocate_reg_info [481]
0.00 0.00 9/5192 get_max_uid [727]
0.00 0.00 2/64864 max_reg_num [454]
0.00 0.00 1/31743 sbitmap_zero [752]
0.00 0.00 1/974 sbitmap_alloc [1076]
0.00 0.00 1/1 count_or_remove_death_notes [1385]
-----------------------------------------------
2.89 48.06 3/3 rest_of_compilation [5]
[9] 26.2 2.89 48.06 3 flow_loops_find [9]
1.02 45.60 3/9 compute_flow_dominators [6]
0.71 0.00 964/964 flow_loop_exits_find [31]
0.71 0.00 1/1 flow_depth_first_order_compute [32]
0.01 0.00 3/21 sbitmap_vector_alloc [217]
0.00 0.00 3/3 flow_loops_tree_build [604]
0.00 0.00 964/964 flow_loop_nodes_find [698]
0.00 0.00 964/964 sbitmap_first_set_bit [739]
0.00 0.00 964/964 flow_loop_pre_header_find [738]
0.00 0.00 2/31743 sbitmap_zero [752]
0.00 0.00 966/974 sbitmap_alloc [1076]
0.00 0.00 964/964 sbitmap_last_set_bit [1078]
0.00 0.00 3/3 flow_loops_level_compute [1331]
-----------------------------------------------
0.00 3.05 1/10 if_convert [8]
0.00 9.14 3/10 rest_of_compilation [5]
0.00 18.28 6/10 life_analysis [14]
[10] 15.7 0.00 30.47 10 update_life_info [10]
3.86 26.22 6/6 calculate_global_regs_live [11]
0.02 0.35 15374/27234 propagate_block [34]
0.00 0.00 15374/27239 free_propagate_block_info [389]
0.01 0.00 15374/72751 bitmap_copy [262]
0.00 0.00 5125/5125 verify_local_live_at_start [573]
0.00 0.00 10/257366 bitmap_clear [283]
0.00 0.00 10/124375 bitmap_initialize [407]
-----------------------------------------------
3.86 26.22 6/6 update_life_info [10]
[11] 15.5 3.86 26.22 6 calculate_global_regs_live [11]
18.60 0.00 20981/20981 for_each_successor_phi [13]
7.30 0.00 24929375/25079015 bitmap_operation [15]
0.01 0.27 11855/27234 propagate_block [34]
0.01 0.00 24210/72751 bitmap_copy [262]
0.00 0.00 52694/257366 bitmap_clear [283]
0.00 0.00 11855/27239 free_propagate_block_info [389]
0.00 0.00 11855/11855 bitmap_equal_p [525]
0.00 0.00 20981/488036 bitmap_set_bit [167]
0.00 0.00 10261/124375 bitmap_initialize [407]
-----------------------------------------------
24.92 0.00 15025/15025 compute_flow_dominators [6]
[12] 12.8 24.92 0.00 15025 sbitmap_intersection_of_preds [12]
0.00 0.00 15025/55629 sbitmap_copy [470]
-----------------------------------------------
18.60 0.00 20981/20981 calculate_global_regs_live [11]
[13] 9.6 18.60 0.00 20981 for_each_successor_phi [13]
-----------------------------------------------
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-01 18:01 ` Brad Lucier
@ 2000-05-02 5:45 ` Michael Matz
2000-05-02 22:58 ` Michael Matz
0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-02 5:45 UTC (permalink / raw)
To: Brad Lucier; +Cc: Richard Henderson, gcc
On Mon, 1 May 2000, Brad Lucier wrote:
> Each sample counts as 0.000976562 seconds.
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
> 12.83 136.59 24.92 15025 1.66 1.66 sbitmap_intersection_of_preds
I once had faster versions of these two functions, if I get home I'll see
if they make any difference on your input data.
Ciao,
Michael.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-02 5:45 ` Michael Matz
@ 2000-05-02 22:58 ` Michael Matz
2000-05-03 5:50 ` Brad Lucier
` (3 more replies)
0 siblings, 4 replies; 32+ messages in thread
From: Michael Matz @ 2000-05-02 22:58 UTC (permalink / raw)
To: Brad Lucier; +Cc: Richard Henderson, gcc
Hi,
On Tue, 2 May 2000, Michael Matz wrote:
> > time seconds seconds calls ms/call ms/call name
> > 57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
> > 12.83 136.59 24.92 15025 1.66 1.66 sbitmap_intersection_of_preds
>
> I once had faster versions of these two functions, if I get home I'll see
> if they make any difference on your input data.
Before fiddling with sbitmap_intersection_of_xx() I first reworked
compute_flow_dominators() to also behave normally in calculating the
post_doms. At least the order of work-queue initialization was wrong (in
post_dom the changes are propagating from the _end_). I then also
implemented a poor man's topological sort for cyclic graphs ;), which
again gave a better performance. (I also did this once for doms, but there
it didn't make a great difference on Brads test cases)
Please try the attached diff (against actual CVS) if they make also a
difference for you ;)
Ciao,
Michael.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-02 22:58 ` Michael Matz
@ 2000-05-03 5:50 ` Brad Lucier
2000-05-03 20:05 ` Brad Lucier
` (2 subsequent siblings)
3 siblings, 0 replies; 32+ messages in thread
From: Brad Lucier @ 2000-05-03 5:50 UTC (permalink / raw)
To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc
> Please try the attached diff (against actual CVS) if they make also a
> difference for you ;)
The diff didn't apply cleanly, I don't know why, so I applied it by hand.
I didn't have time this morning to install a new profiled version, so
I ran make bootstrap with the usual BOOT_CFLAGS.
Your patch did help. Here is a comparison between the a profiled version
without your patch and an unprofiled version with your patch:
popov-726% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
__copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
garbage collection : 0.71 ( 0%) usr 0.00 ( 1%) sys 0.71 ( 0%) wall
parser : 6.13 ( 3%) usr 0.19 (25%) sys 6.32 ( 3%) wall
varconst : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall
integration : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
jump : 6.45 ( 3%) usr 0.39 (51%) sys 6.84 ( 3%) wall
CSE : 2.20 ( 1%) usr 0.00 ( 1%) sys 2.20 ( 1%) wall
global CSE : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
loop analysis : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
CSE 2 : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
flow analysis : 63.17 (27%) usr 0.06 ( 8%) sys 63.21 (27%) wall
combiner : 2.81 ( 1%) usr 0.00 ( 0%) sys 2.81 ( 1%) wall
if-conversion : 56.39 (24%) usr 0.01 ( 2%) sys 56.39 (24%) wall
local alloc : 1.24 ( 1%) usr 0.00 ( 0%) sys 1.24 ( 1%) wall
global alloc : 2.35 ( 1%) usr 0.05 ( 8%) sys 2.41 ( 1%) wall
reload CSE regs : 3.68 ( 2%) usr 0.00 ( 0%) sys 3.68 ( 2%) wall
flow 2 : 29.86 (13%) usr 0.00 ( 1%) sys 29.86 (13%) wall
if-conversion 2 : 58.60 (25%) usr 0.01 ( 2%) sys 58.60 (25%) wall
shorten branches : 0.11 ( 0%) usr 0.00 ( 0%) sys 0.11 ( 0%) wall
final : 2.68 ( 1%) usr 0.00 ( 1%) sys 2.70 ( 1%) wall
symout : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
rest of compilation : 1.02 ( 0%) usr 0.00 ( 0%) sys 1.02 ( 0%) wall
TOTAL : 237.52 0.77 238.47
popov-727% gcc -v
Reading specs from /export/u10/gcc-2.95.1/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.95.1/specs
gcc version 2.95.1 19990816 (release)
popov-728% /export/u10/egcs-test/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
__copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
garbage collection : 0.32 ( 0%) usr 0.00 ( 1%) sys 0.32 ( 0%) wall
parser : 2.92 ( 2%) usr 0.18 (25%) sys 3.10 ( 2%) wall
varconst : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall
integration : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
jump : 4.56 ( 3%) usr 0.38 (52%) sys 4.94 ( 3%) wall
CSE : 0.81 ( 1%) usr 0.00 ( 0%) sys 0.81 ( 1%) wall
global CSE : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
loop analysis : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
CSE 2 : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
flow analysis : 51.69 (32%) usr 0.06 ( 8%) sys 51.74 (32%) wall
combiner : 0.99 ( 1%) usr 0.00 ( 0%) sys 0.99 ( 1%) wall
if-conversion : 33.69 (21%) usr 0.01 ( 2%) sys 33.70 (21%) wall
local alloc : 0.54 ( 0%) usr 0.00 ( 0%) sys 0.54 ( 0%) wall
global alloc : 1.11 ( 1%) usr 0.05 ( 7%) sys 1.16 ( 1%) wall
reload CSE regs : 3.15 ( 2%) usr 0.00 ( 1%) sys 3.15 ( 2%) wall
flow 2 : 23.85 (15%) usr 0.00 ( 0%) sys 23.84 (15%) wall
if-conversion 2 : 33.45 (21%) usr 0.01 ( 2%) sys 33.45 (21%) wall
shorten branches : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall
final : 2.46 ( 2%) usr 0.00 ( 0%) sys 2.46 ( 2%) wall
symout : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
rest of compilation : 0.43 ( 0%) usr 0.00 ( 0%) sys 0.43 ( 0%) wall
TOTAL : 160.09 0.73 160.78
I can get more information this afternoon (this evening in Germany :-).
Brad
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-02 22:58 ` Michael Matz
2000-05-03 5:50 ` Brad Lucier
@ 2000-05-03 20:05 ` Brad Lucier
2000-05-04 11:46 ` Michael Matz
2000-05-04 11:55 ` if-conversion a performance bottleneck Richard Henderson
2000-05-05 12:01 ` Jeffrey A Law
3 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-03 20:05 UTC (permalink / raw)
To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc
> Please try the attached diff (against actual CVS) if they make also a
> difference for you ;)
Your changes to flow.c have cut the number of calls to
sbitmap_intersection_of_succs from 40604 to 24225, so they are definitely
worthwhile. Bootstrapped on alphaev6-unknown-linux-gnu.
Brad
Here are the new statistics with your changes:
Execution times (seconds)
garbage collection : 1.30 ( 1%) usr 0.00 ( 0%) sys 1.30 ( 1%) wall
parser : 6.45 ( 3%) usr 0.19 (15%) sys 6.64 ( 3%) wall
varconst : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall
integration : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
jump : 20.06 ( 9%) usr 0.84 (66%) sys 20.90 ( 9%) wall
CSE : 2.77 ( 1%) usr 0.00 ( 0%) sys 2.77 ( 1%) wall
global CSE : 5.51 ( 2%) usr 0.01 ( 1%) sys 5.52 ( 2%) wall
loop analysis : 0.24 ( 0%) usr 0.00 ( 0%) sys 0.24 ( 0%) wall
CSE 2 : 2.29 ( 1%) usr 0.00 ( 0%) sys 2.29 ( 1%) wall
flow analysis : 52.11 (23%) usr 0.06 ( 5%) sys 52.16 (23%) wall
combiner : 2.80 ( 1%) usr 0.00 ( 0%) sys 2.80 ( 1%) wall
if-conversion : 44.42 (20%) usr 0.02 ( 2%) sys 44.43 (20%) wall
regmove : 0.50 ( 0%) usr 0.00 ( 0%) sys 0.50 ( 0%) wall
scheduling : 6.26 ( 3%) usr 0.01 ( 1%) sys 6.27 ( 3%) wall
local alloc : 1.49 ( 1%) usr 0.00 ( 0%) sys 1.49 ( 1%) wall
global alloc : 2.78 ( 1%) usr 0.07 ( 6%) sys 2.86 ( 1%) wall
reload CSE regs : 7.92 ( 4%) usr 0.01 ( 1%) sys 7.93 ( 3%) wall
flow 2 : 19.36 ( 9%) usr 0.00 ( 0%) sys 19.35 ( 9%) wall
if-conversion 2 : 36.38 (16%) usr 0.01 ( 1%) sys 36.38 (16%) wall
peephole 2 : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall
schedulding 2 : 8.88 ( 4%) usr 0.00 ( 0%) sys 8.88 ( 4%) wall
shorten branches : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall
final : 3.22 ( 1%) usr 0.00 ( 0%) sys 3.22 ( 1%) wall
symout : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
rest of compilation : 1.07 ( 0%) usr 0.00 ( 0%) sys 1.07 ( 0%) wall
TOTAL : 226.10 1.27 227.32
Flat profile:
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
42.54 69.80 69.80 24225 2.88 2.88 sbitmap_intersection_of_succs
17.24 98.09 28.29 15025 1.88 1.88 sbitmap_intersection_of_preds
5.03 106.34 8.25 42 196.36 196.36 mark_critical_edges
3.98 112.86 6.52 25085231 0.00 0.00 bitmap_operation
3.56 118.70 5.84 10248 0.57 0.62 compute_block_backward_dependences
3.44 124.35 5.65 9 627.60 11546.70 compute_flow_dominators
2.37 128.24 3.89 21 185.36 185.93 delete_unreachable_blocks
2.31 132.03 3.79 6 631.02 1760.19 calculate_global_regs_live
...
-----------------------------------------------
1.88 32.76 3/9 flow_loops_find [9]
3.77 65.51 6/9 if_convert [8]
[6] 63.3 5.65 98.27 9 compute_flow_dominators [6]
69.80 0.00 24225/24225 sbitmap_intersection_of_succs [7]
28.29 0.00 15025/15025 sbitmap_intersection_of_preds [10]
0.16 0.00 39259/39259 sbitmap_a_and_b [135]
0.02 0.00 9/27 sbitmap_vector_alloc [255]
0.00 0.00 9/18 sbitmap_vector_zero [771]
0.00 0.00 9/74716 sbitmap_zero [725]
0.00 0.00 9/9 sbitmap_vector_ones [1365]
-----------------------------------------------
69.80 0.00 24225/24225 compute_flow_dominators [6]
[7] 42.5 69.80 0.00 24225 sbitmap_intersection_of_succs [7]
0.00 0.00 24225/39250 sbitmap_copy [612]
-----------------------------------------------
0.00 69.40 12/12 rest_of_compilation [5]
[8] 42.3 0.00 69.40 12 if_convert [8]
3.77 65.51 6/9 compute_flow_dominators [6]
0.00 0.06 12/43 free_basic_block_vars [112]
0.03 0.00 12/48 compute_bb_for_insn [147]
0.00 0.01 20509/20509 find_if_header [391]
0.01 0.00 6/27 sbitmap_vector_alloc [255]
0.00 0.00 1/10258 update_life_info [12]
0.00 0.00 1/37 allocate_reg_info [540]
0.00 0.00 1/20500 count_or_remove_death_notes [36]
0.00 0.00 1/995 sbitmap_alloc [679]
0.00 0.00 2/145008 max_reg_num [374]
0.00 0.00 1/74716 sbitmap_zero [725]
0.00 0.00 12/5225 get_max_uid [1134]
-----------------------------------------------
2.96 36.17 3/3 rest_of_compilation [5]
[9] 23.8 2.96 36.17 3 flow_loops_find [9]
1.88 32.76 3/9 compute_flow_dominators [6]
0.76 0.00 964/964 flow_loop_exits_find [38]
0.75 0.00 1/1 flow_depth_first_order_compute [39]
0.01 0.00 3/27 sbitmap_vector_alloc [255]
0.00 0.00 964/964 flow_loop_pre_header_find [627]
0.00 0.00 966/995 sbitmap_alloc [679]
0.00 0.00 3/3 flow_loops_tree_build [747]
0.00 0.00 964/964 flow_loop_nodes_find [791]
0.00 0.00 964/964 sbitmap_last_set_bit [834]
0.00 0.00 2/74716 sbitmap_zero [725]
0.00 0.00 964/964 sbitmap_first_set_bit [1174]
0.00 0.00 3/3 flow_loops_level_compute [1435]
-----------------------------------------------
28.29 0.00 15025/15025 compute_flow_dominators [6]
[10] 17.2 28.29 0.00 15025 sbitmap_intersection_of_preds [10]
0.00 0.00 15025/39250 sbitmap_copy [612]
-----------------------------------------------
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-03 20:05 ` Brad Lucier
@ 2000-05-04 11:46 ` Michael Matz
2000-05-04 19:39 ` Brad Lucier
0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-04 11:46 UTC (permalink / raw)
To: Brad Lucier; +Cc: Richard Henderson, gcc
Hi,
On Wed, 3 May 2000, Brad Lucier wrote:
> Your changes to flow.c have cut the number of calls to
> sbitmap_intersection_of_succs from 40604 to 24225, so they are definitely
> worthwhile.
Yes, that was the purpose of the patch ;)
Anyway, I saw some issues of that patch with one of your other test files,
where the runtime exploded, and on my HDD at home I have another version
which even more reduces the time. If I find time, I'll rewrite the sbitmap
routines for large homogeneous bitmaps, clean up all that stuff and see if
anything comes out of it ;)
Btw. do you have also slightly smaller test cases (a quarter or so)? At
home I only have a slow machine and waiting 10 minutes everytime to see,
if one change was worthwhile is, erhh... slow.
> rest of compilation : 1.07 ( 0%) usr 0.00 ( 0%) sys 1.07 ( 0%) wall
> TOTAL : 226.10 1.27 227.32
God! For me that was ca. 1110 seconds ;) (Now I'm down to 580)
Ciao,
Michael.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-04 11:46 ` Michael Matz
@ 2000-05-04 19:39 ` Brad Lucier
2000-05-04 22:16 ` Richard Henderson
0 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-04 19:39 UTC (permalink / raw)
To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc
> Btw. do you have also slightly smaller test cases (a quarter or so)? At
> home I only have a slow machine and waiting 10 minutes everytime to see,
> if one change was worthwhile is, erhh... slow.
Try
http://www.math.purdue.edu/~lucier/_t-c-1.i.gz
I get
popov-1089% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-1.i
...
TOTAL : 31.78 0.36 32.13
BTW, with the introduction of libgcc1, I can no longer
make -j bootstrap
on my Linux box---I run out of processes with the standard value of
NR_TASKS=512! I install a new kernel tomorrow with NR_TASKS=4096.
Brad
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-04 19:39 ` Brad Lucier
@ 2000-05-04 22:16 ` Richard Henderson
2000-05-10 8:30 ` Andreas Schwab
0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2000-05-04 22:16 UTC (permalink / raw)
To: Brad Lucier; +Cc: Michael Matz, gcc
On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
> BTW, with the introduction of libgcc1, I can no longer
>
> make -j bootstrap
>
> on my Linux box---I run out of processes with the standard value of
> NR_TASKS=512! I install a new kernel tomorrow with NR_TASKS=4096.
Err.. don't do that? Try make -j3 instead.
r~
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-04 22:16 ` Richard Henderson
@ 2000-05-10 8:30 ` Andreas Schwab
2000-05-10 9:36 ` Joe Buck
0 siblings, 1 reply; 32+ messages in thread
From: Andreas Schwab @ 2000-05-10 8:30 UTC (permalink / raw)
To: Richard Henderson; +Cc: Brad Lucier, Michael Matz, gcc
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 704 bytes --]
Richard Henderson <rth@cygnus.com> writes:
|> On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
|> > BTW, with the introduction of libgcc1, I can no longer
|> >
|> > make -j bootstrap
|> >
|> > on my Linux box---I run out of processes with the standard value of
|> > NR_TASKS=512! I install a new kernel tomorrow with NR_TASKS=4096.
|>
|> Err.. don't do that? Try make -j3 instead.
Or say make -j -l <some small value> to limit depending on the load.
Andreas.
--
Andreas Schwab "And now for something
SuSE Labs completely different."
Andreas.Schwab@suse.de
SuSE GmbH, Schanzäckerstr. 10, D-90443 Nürnberg
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-10 8:30 ` Andreas Schwab
@ 2000-05-10 9:36 ` Joe Buck
2000-05-10 9:52 ` Jeffrey A Law
2000-05-10 14:17 ` Joern Rennecke
0 siblings, 2 replies; 32+ messages in thread
From: Joe Buck @ 2000-05-10 9:36 UTC (permalink / raw)
To: Andreas Schwab; +Cc: Richard Henderson, Brad Lucier, Michael Matz, gcc
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]
> Richard Henderson <rth@cygnus.com> writes:
>
> |> On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
> |> > BTW, with the introduction of libgcc1, I can no longer
> |> >
> |> > make -j bootstrap
> |> >
> |> > on my Linux box---I run out of processes with the standard value of
> |> > NR_TASKS=512! I install a new kernel tomorrow with NR_TASKS=4096.
> |>
> |> Err.. don't do that? Try make -j3 instead.
>
> Or say make -j -l <some small value> to limit depending on the load.
No, -l does not work and never did. The reason is that the load value
increases very slowly as processes are launched; as a result, if you
say make -j -l, a very large number of processes get launched initially,
and it may take 10 seconds or so for the load to climb to the level
specified in -l -- but then the load climbs to 20x that figure or more.
At this point, no more processes get launched until at least a minute
after all of the first batch dies, and then another process storm is
launched. This is becuase the number checked by -l is based on an average
load over a full minute, not an instantaneous figure.
Solution: get make 3.79 and use -j<N>. The GNU make folks should
deprecate -l and eventually remove it.
> Andreas.
>
> --
> Andreas Schwab "And now for something
> SuSE Labs completely different."
> Andreas.Schwab@suse.de
> SuSE GmbH, Schanzäckerstr. 10, D-90443 Nürnberg
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-10 9:36 ` Joe Buck
@ 2000-05-10 9:52 ` Jeffrey A Law
2000-05-10 10:49 ` Joe Buck
2000-05-10 14:17 ` Joern Rennecke
1 sibling, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-10 9:52 UTC (permalink / raw)
To: Joe Buck
Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc
In message < 200005101628.JAA25911@possibly.synopsys.com >you write:
> Solution: get make 3.79 and use -j<N>. The GNU make folks should
> deprecate -l and eventually remove it.
Most definitely. I upgraded to 3.79 a few days ago and have really liked
the way -j works in the new version. I strongly recommend anyone building
gcc on an SMP machine upgrade to make-3.79.
jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-10 9:52 ` Jeffrey A Law
@ 2000-05-10 10:49 ` Joe Buck
0 siblings, 0 replies; 32+ messages in thread
From: Joe Buck @ 2000-05-10 10:49 UTC (permalink / raw)
To: law; +Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc
> In message < 200005101628.JAA25911@possibly.synopsys.com >you write:
> > Solution: get make 3.79 and use -j<N>. The GNU make folks should
> > deprecate -l and eventually remove it.
> Most definitely. I upgraded to 3.79 a few days ago and have really liked
> the way -j works in the new version. I strongly recommend anyone building
> gcc on an SMP machine upgrade to make-3.79.
Even a 1-processor machine can benefit from -j2, as the compiler will
spend a significant portion of its time reading or writing disk, leaving
idle CPU for the other process.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-10 9:36 ` Joe Buck
2000-05-10 9:52 ` Jeffrey A Law
@ 2000-05-10 14:17 ` Joern Rennecke
2000-05-10 14:24 ` GNU make options (was Re: if-conversion ...) Joe Buck
1 sibling, 1 reply; 32+ messages in thread
From: Joern Rennecke @ 2000-05-10 14:17 UTC (permalink / raw)
To: Joe Buck
Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc
> No, -l does not work and never did. The reason is that the load value
> increases very slowly as processes are launched; as a result, if you
> say make -j -l, a very large number of processes get launched initially,
> and it may take 10 seconds or so for the load to climb to the level
> specified in -l -- but then the load climbs to 20x that figure or more.
> At this point, no more processes get launched until at least a minute
> after all of the first batch dies, and then another process storm is
> launched. This is becuase the number checked by -l is based on an average
> load over a full minute, not an instantaneous figure.
>
> Solution: get make 3.79 and use -j<N>. The GNU make folks should
> deprecate -l and eventually remove it.
I wouldn't go so far. It seems clear that -l is not useful to make make
react to the load it creates itself, but you might use it instead or in
addition to nice to control how much a batch job taxes a system that is
used for interactive stuff as well.
^ permalink raw reply [flat|nested] 32+ messages in thread
* GNU make options (was Re: if-conversion ...)
2000-05-10 14:17 ` Joern Rennecke
@ 2000-05-10 14:24 ` Joe Buck
0 siblings, 0 replies; 32+ messages in thread
From: Joe Buck @ 2000-05-10 14:24 UTC (permalink / raw)
To: Joern Rennecke
Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc
I wrote (re GNU make options)
> > Solution: get make 3.79 and use -j<N>. The GNU make folks should
> > deprecate -l and eventually remove it.
>
> I wouldn't go so far. It seems clear that -l is not useful to make make
> react to the load it creates itself, but you might use it instead or in
> addition to nice to control how much a batch job taxes a system that is
> used for interactive stuff as well.
OK, the combination -j<N> -l <limit> might be useful for that purpose.
But -j without an argument and -l together is a recipe for major problems.
If there is a high degree of parallelism it may result in exhausting
virtual memory, which can be very bad on systems that optimistically
allocate virtual memory (e.g. Linux): the system thrashes severely and
then tries to kill processes at random. This huge load and memory
exhaustion can strike before the load average rises very much.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-02 22:58 ` Michael Matz
2000-05-03 5:50 ` Brad Lucier
2000-05-03 20:05 ` Brad Lucier
@ 2000-05-04 11:55 ` Richard Henderson
2000-05-05 12:01 ` Jeffrey A Law
3 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2000-05-04 11:55 UTC (permalink / raw)
To: Michael Matz; +Cc: Brad Lucier, gcc
On Wed, May 03, 2000 at 07:53:00AM +0200, Michael Matz wrote:
> I then also implemented a poor man's topological sort for cyclic graphs
> ;), which again gave a better performance. (I also did this once for doms,
> but there it didn't make a great difference on Brads test cases)
Would you split this off into a new function? Perhaps called
order_bb_for_backward_flow or something? This could be useful elsewhere.
r~
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-02 22:58 ` Michael Matz
` (2 preceding siblings ...)
2000-05-04 11:55 ` if-conversion a performance bottleneck Richard Henderson
@ 2000-05-05 12:01 ` Jeffrey A Law
2000-05-05 14:32 ` flow_d_f_o_compute misnamed? (was: if-conversion a performance...) Michael Matz
3 siblings, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-05 12:01 UTC (permalink / raw)
To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc
In message < Pine.SOL.4.10.10005030741450.1121-200000@platon >you write:
> This message is in MIME format. The first part should be readable text,
> while the remaining parts are likely unreadable without MIME-aware tools.
> Send mail to mime@docserver.cac.washington.edu for more info.
>
> ---559023410-33463914-957333180=:1121
> Content-Type: TEXT/PLAIN; charset=US-ASCII
>
> Hi,
>
> On Tue, 2 May 2000, Michael Matz wrote:
> > > time seconds seconds calls ms/call ms/call name
> > > 57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
> > > 12.83 136.59 24.92 15025 1.66 1.66 sbitmap_intersection_of_preds
> >
> > I once had faster versions of these two functions, if I get home I'll see
> > if they make any difference on your input data.
>
> Before fiddling with sbitmap_intersection_of_xx() I first reworked
> compute_flow_dominators() to also behave normally in calculating the
> post_doms. At least the order of work-queue initialization was wrong (in
> post_dom the changes are propagating from the _end_). I then also
> implemented a poor man's topological sort for cyclic graphs ;), which
> again gave a better performance. (I also did this once for doms, but there
> it didn't make a great difference on Brads test cases)
>
> Please try the attached diff (against actual CVS) if they make also a
> difference for you ;)
Like Richard, I'd like to see you submit this as a function which can be
called from multiple locations in the compiler.
We've got a number of routines that _might_ benefit from this code, but
I'd also like to see some more general benchmarking. I don't want to
see us slow down the compiler for the common cases just to make Brad's
one test run faster.
jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
2000-05-05 12:01 ` Jeffrey A Law
@ 2000-05-05 14:32 ` Michael Matz
2000-05-05 14:53 ` Jeffrey A Law
0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-05 14:32 UTC (permalink / raw)
To: Jeffrey A Law; +Cc: Richard Henderson, gcc
Hi,
On Fri, 5 May 2000, Jeffrey A Law wrote:
> > post_dom the changes are propagating from the _end_). I then also
> > implemented a poor man's topological sort for cyclic graphs ;), which
> >
> Like Richard, I'd like to see you submit this as a function which can be
> called from multiple locations in the compiler.
Will do so on the weekend.
> We've got a number of routines that _might_ benefit from this code, but
> I'd also like to see some more general benchmarking. I don't want to
> see us slow down the compiler for the common cases just to make Brad's
> one test run faster.
Yes, right now this sorting is linear with number of edges, while the
simple reverse filling of the worklist is linear with n_basic_blocks. May
be it would be worthwhile to make it
order_bb_for_backward_flow(basic_block *dest, int simple_reverse)
and if the arg is true, only fill dest backwards instead of trying to be
intelligent.
Meanwhile I noticed a slight strangeness with
flow_depth_first_order_compute, in that it does not compute any DFS order
but more a width first one. E.g. if presented with a graph like
{(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
produces an order of [a,b,c,d] which is not according to any DF definition
I know of ;) which would result in e.g. [a,b,d,c].
In case anybody is interested, I'm currently trying to implement a linear
time dominator algorithm, and wanted to use some stuff already there. (As
I also need a backward DFS for post-dom, I anyway would have to implement
it myself, but for the beginning ...)
Ciao,
Michael.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
2000-05-05 14:32 ` flow_d_f_o_compute misnamed? (was: if-conversion a performance...) Michael Matz
@ 2000-05-05 14:53 ` Jeffrey A Law
2000-05-05 16:10 ` Michael Matz
0 siblings, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-05 14:53 UTC (permalink / raw)
To: Michael Matz; +Cc: Richard Henderson, gcc
In message < Pine.SOL.4.10.10005052304100.4463-100000@platon >you write:
> Hi,
>
> On Fri, 5 May 2000, Jeffrey A Law wrote:
> > > post_dom the changes are propagating from the _end_). I then also
> > > implemented a poor man's topological sort for cyclic graphs ;), which
> > >
> > Like Richard, I'd like to see you submit this as a function which can be
> > called from multiple locations in the compiler.
>
> Will do so on the weekend.
Great!
> > We've got a number of routines that _might_ benefit from this code, but
> > I'd also like to see some more general benchmarking. I don't want to
> > see us slow down the compiler for the common cases just to make Brad's
> > one test run faster.
>
> Yes, right now this sorting is linear with number of edges, while the
> simple reverse filling of the worklist is linear with n_basic_blocks. May
> be it would be worthwhile to make it
> order_bb_for_backward_flow(basic_block *dest, int simple_reverse)
> and if the arg is true, only fill dest backwards instead of trying to be
> intelligent.
I didn't look at the code in any detail, it's entirely possible we're loading
up the worklist in a silly order right now, which can be easily resolved and
may provide most of the benefit of your patch with lower cost.
Most CFGs are relatively well behaved with fast convergence for iterative
dataflow equations -- convergence is bad typically when the CFG has a lot
of nested backedges, and even then those backedges have to enclose a large
number of blocks for iterative dataflow analysis to slow down significantly.
There are certainly other possibilities for speeding up the various dataflow
solvers, whether it's a better dominator algorithm or more efficient code
to solve the LCM dataflow equations.
> Meanwhile I noticed a slight strangeness with
> flow_depth_first_order_compute, in that it does not compute any DFS order
> but more a width first one. E.g. if presented with a graph like
> {(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
> produces an order of [a,b,c,d] which is not according to any DF definition
> I know of ;) which would result in e.g. [a,b,d,c].
If that's what it's doing, then it certainly doesn't sound like DFS to me ;-)
> In case anybody is interested, I'm currently trying to implement a linear
> time dominator algorithm, and wanted to use some stuff already there. (As
> I also need a backward DFS for post-dom, I anyway would have to implement
> it myself, but for the beginning ...)
Which algorithm? Improving dominator computation time is going to be more
important going forward as we use that information in more and more places.
It would probably be useful for you to go ahead and get the legal paperwork
filed with the FSF so that we can use your new dominator code when it's
ready. http://gcc.gnu.org/contribute.html will provide you with the details
you need for your copyright assignment and/or employer disclaimer.
jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
2000-05-05 14:53 ` Jeffrey A Law
@ 2000-05-05 16:10 ` Michael Matz
2000-05-05 17:05 ` Jeffrey A Law
0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-05 16:10 UTC (permalink / raw)
To: Jeffrey A Law; +Cc: Richard Henderson, gcc
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3093 bytes --]
Hi,
On Fri, 5 May 2000, Jeffrey A Law wrote:
> I didn't look at the code in any detail, it's entirely possible we're loading
> up the worklist in a silly order right now, which can be easily resolved and
> may provide most of the benefit of your patch with lower cost.
Right now the worklist is filled in natural order, which couldn't be worse
for a ´backward' problem (post dominators) ;) Reversing this already helps
alot. Filling it in an intelligent way again reduces the blocks which get
reevaluated, and additionally checking for changes minimizes that number.
I'm now down to only 386 blocks reinserted into the worklist out of ca.
5000 bb in Brads testcase (originally there were around 15000
reinsertions). I doubt that I would get anywhere further with the
bitvector based algorithm. May be another representation of the bitvectors
would help (one which accounts for large blocks of same bits) in case of
_many_ basic blocks. I was going to implement these, but I decided that my
time is better spent in replacing the whole bitvector based algorithm.
It's also more interesting ;)
> There are certainly other possibilities for speeding up the various dataflow
> solvers, whether it's a better dominator algorithm or more efficient code
> to solve the LCM dataflow equations.
I also noticed in reading through the code, that there are many places,
where data flow is calculated in some way, that do slightly the same but
nevertheless do not reuse any code. It would be good to uniform that
middle end (I mean the language and machine independent things), but gcc
is such a huge thing ;)
> > {(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
> > produces an order of [a,b,c,d] which is not according to any DF definition
> > I know of ;) which would result in e.g. [a,b,d,c].
> If that's what it's doing, then it certainly doesn't sound like DFS to me ;-)
So may be it's only misnamed? But then again, it's used in
flow_loops_find() to find outer loops before inner ones. I don't see right
now where a normal DFS would help that, may be I'm too tired :)
> > In case anybody is interested, I'm currently trying to implement a linear
> > time dominator algorithm, and wanted to use some stuff already there. (As
> > I also need a backward DFS for post-dom, I anyway would have to implement
> > it myself, but for the beginning ...)
> Which algorithm? Improving dominator computation time is going to be more
> important going forward as we use that information in more and more places.
From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topps/ ,
I think report 320 or so). Do you have any other accessible? I haven't
found more (well, I only searched the net).
> It would probably be useful for you to go ahead and get the legal paperwork
> filed with the FSF so that we can use your new dominator code when it's
> ready. http://gcc.gnu.org/contribute.html will provide you with the details
> you need for your copyright assignment and/or employer disclaimer.
I feared somebody would say that. Bloody bureaucracy ;)
Ciao,
Michael.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
2000-05-05 16:10 ` Michael Matz
@ 2000-05-05 17:05 ` Jeffrey A Law
2000-05-05 18:21 ` Michael Matz
0 siblings, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-05 17:05 UTC (permalink / raw)
To: Michael Matz; +Cc: Richard Henderson, gcc
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2626 bytes --]
In message < Pine.SOL.4.10.10005060022080.4463-100000@platon >you write:
> Hi,
>
> On Fri, 5 May 2000, Jeffrey A Law wrote:
> > I didn't look at the code in any detail, it's entirely possible we're loa
> ding
> > up the worklist in a silly order right now, which can be easily resolved
> and
> > may provide most of the benefit of your patch with lower cost.
>
> Right now the worklist is filled in natural order, which couldn't be worse
> for a ´backward' problem (post dominators) ;) Reversing this already helps
> alot.
Yea, that was exactly what I was asking. The cost is almost non-existent
with notable benefits.
> Filling it in an intelligent way again reduces the blocks which get
> reevaluated, and additionally checking for changes minimizes that number.
> I'm now down to only 386 blocks reinserted into the worklist out of ca.
> 5000 bb in Brads testcase (originally there were around 15000
> reinsertions).
Right. This is typical behavior when we're looking at stuff in the opposite
of the intended order.
How many reinsertions do we get by just reversing the order of insertion
into the worklist?
> > > {(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
> > > produces an order of [a,b,c,d] which is not according to any DF defin
> ition
> > > I know of ;) which would result in e.g. [a,b,d,c].
> > If that's what it's doing, then it certainly doesn't sound like DFS to me
> ;-)
>
> So may be it's only misnamed? But then again, it's used in
> flow_loops_find() to find outer loops before inner ones. I don't see right
> now where a normal DFS would help that, may be I'm too tired :)
A DFS numbering will detect inner loops first, I haven't worked through
whether or not a BFS would detect outer loops first or not.
> >From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topps/ ,
> I think report 320 or so). Do you have any other accessible? I haven't
> found more (well, I only searched the net).
Are they based on the Lengauer & Tarjan algorithm? That's the one I hear
the most about in papers & texts for fast dominator computations.
> > It would probably be useful for you to go ahead and get the legal paperwo
> rk
> > filed with the FSF so that we can use your new dominator code when it's
> > ready. http://gcc.gnu.org/contribute.html will provide you with the deta
> ils
> > you need for your copyright assignment and/or employer disclaimer.
>
> I feared somebody would say that. Bloody bureaucracy ;)
:-) For better or worse, legal paperwork is a requirement.
jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
2000-05-05 17:05 ` Jeffrey A Law
@ 2000-05-05 18:21 ` Michael Matz
2000-05-05 19:04 ` Michael Hayes
2000-05-11 17:14 ` Jeffrey A Law
0 siblings, 2 replies; 32+ messages in thread
From: Michael Matz @ 2000-05-05 18:21 UTC (permalink / raw)
To: Jeffrey A Law; +Cc: Richard Henderson, gcc
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3393 bytes --]
Hi,
On Fri, 5 May 2000, Jeffrey A Law wrote:
> > Right now the worklist is filled in natural order, which couldn't be worse
> > for a ´backward' problem (post dominators) ;) Reversing this already helps
> > alot.
> Yea, that was exactly what I was asking. The cost is almost non-existent
> with notable benefits.
But why restrict ourself when we can do better ;) Note that the time for
sorting the worklist in any way (may be even randomly ;)) is by far
outweighted (sp?) by the time it takes to do all the bitvectors
intersections, which are mostly useless, in the sense that some components
of the intersection didn't contribute to the result.
E.g. I measured the number of sbitmap_a_and_b's which didn't result in any
change, and my finding was horrible: of around 5 millions intersections
4.8 million were effectless. But the algorithm must try it anyway, so the
bitvector based approach leads to horrors and slowness :)
> > Filling it in an intelligent way again reduces the blocks which get
> > reevaluated, and additionally checking for changes minimizes that number.
> > I'm now down to only 386 blocks reinserted into the worklist out of ca.
> > 5000 bb in Brads testcase (originally there were around 15000
> > reinsertions).
> Right. This is typical behavior when we're looking at stuff in the opposite
> of the intended order.
>
> How many reinsertions do we get by just reversing the order of insertion
> into the worklist?
Quick rundown of number of reinserted blocks:
8523 with simple reverse
7003 with more intelligent "topological" sort
386 with avoiding insert as much as possible
The last one I implement with before actually reinserting a block I test,
if the block forcing the reinsert has any effect on the BB in question.
That again squeezed some minutes out of that stuff on my slow machine ;)
But it comes with the price that it now is obvious how many intersections
are done without use. With the other cases it is hidden in
intersection_of_succs (and _preds), and only noticable by the fact that it
is painfully slow ;)
> > So may be it's only misnamed? But then again, it's used in
> > flow_loops_find() to find outer loops before inner ones. I don't see right
> > now where a normal DFS would help that, may be I'm too tired :)
> A DFS numbering will detect inner loops first, I haven't worked through
> whether or not a BFS would detect outer loops first or not.
If DFS first detects inner loops (after some thinking I see that too :))
then some comments in flow_loops_find() are wrong (some say, they are
searching for outer first), and flow_depth_first_order_compute() should be
corrected to do what the name promises ;) Or do I miss something obvious?
I mean it works somehow right now, I'm not really sure if I can believe
that such a central function is incorrect and nobody noticed.
> > >From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topps/ ,
> > I think report 320 or so). Do you have any other accessible? I haven't
> > found more (well, I only searched the net).
> Are they based on the Lengauer & Tarjan algorithm? That's the one I hear
> the most about in papers & texts for fast dominator computations.
Indeed they based on work from the god of linear graph algorithms, but
while L&T is O(m*alpha(m,n)) (n=nodes, m=edges, alpha a more than
const-function) the A&L&T one is truly linear O(m+n).
Ciao,
Michael.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
2000-05-05 18:21 ` Michael Matz
@ 2000-05-05 19:04 ` Michael Hayes
2000-05-11 17:14 ` Jeffrey A Law
1 sibling, 0 replies; 32+ messages in thread
From: Michael Hayes @ 2000-05-05 19:04 UTC (permalink / raw)
To: Michael Matz; +Cc: Jeffrey A Law, Richard Henderson, gcc
Michael Matz writes:
> If DFS first detects inner loops (after some thinking I see that too :))
> then some comments in flow_loops_find() are wrong (some say, they are
> searching for outer first), and flow_depth_first_order_compute() should be
> corrected to do what the name promises ;) Or do I miss something obvious?
Even though I wrote this, I cannot remember the details and
unfortunately I do not have my work book with me at present.
It is quite possible that my depth first search algorithm is wrong as
I am naive when it comes to graph theory. I vaguely recall converting
my original recursive algorithm to an iterative algorithm so I could
easily have made a mistake.
> I mean it works somehow right now, I'm not really sure if I can believe
> that such a central function is incorrect and nobody noticed.
Although the DFS algorithm may be wrong, it may have no effect on the
loop finding algorithm. For the latter, it is not essential that all
the outer loops in the CFG are found first but that with nested loops,
an inner loop is found after its containing outer loop. Possibly
my `DFS' algorithm also has this property?
Michael.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
2000-05-05 18:21 ` Michael Matz
2000-05-05 19:04 ` Michael Hayes
@ 2000-05-11 17:14 ` Jeffrey A Law
1 sibling, 0 replies; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-11 17:14 UTC (permalink / raw)
To: Michael Matz; +Cc: Richard Henderson, gcc
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4833 bytes --]
In message < Pine.SOL.4.10.10005060226560.4463-100000@platon >you write:
> > > Right now the worklist is filled in natural order, which couldn't be wo
> rse
> > > for a ´backward' problem (post dominators) ;) Reversing this already
hel
> ps
> > > alot.
> > Yea, that was exactly what I was asking. The cost is almost non-existent
> > with notable benefits.
>
> But why restrict ourself when we can do better ;)
I'm not necessarily restricting what we can/will do, only trying to
look at cost/benefit of the various possibilities. A solution that
is less efficient in pathological cases, but more efficient in the
general case may be the preferred solution, it depends on a number of
factors -- and we need data to try and find reasonable tradeoff points.
> Note that the time for
> sorting the worklist in any way (may be even randomly ;)) is by far
> outweighted (sp?) by the time it takes to do all the bitvectors
> intersections, which are mostly useless, in the sense that some components
> of the intersection didn't contribute to the result.
>
> E.g. I measured the number of sbitmap_a_and_b's which didn't result in any
> change, and my finding was horrible: of around 5 millions intersections
> 4.8 million were effectless. But the algorithm must try it anyway, so the
> bitvector based approach leads to horrors and slowness :)
Right. The tradeoff is time/space vs code complexity -- choosing the right
tradeoff point is sometimes difficult.
Other approaches using "blocking factors" like in the null pointer check
elimination optimization may significantly improve this situaton. Then
again, it may not. Aside from being more memory efficient, the blocking
code _may_ ultimately result in significantly fewer bitvector ops than the
current implementation. See the null pointer check code in gcse.c.
Converting the rest of gcse to use the blocking code ought to be
reasonably easy and would be greatly appreciated.
> > How many reinsertions do we get by just reversing the order of insertion
> > into the worklist?
>
> Quick rundown of number of reinserted blocks:
> 8523 with simple reverse
> 7003 with more intelligent "topological" sort
> 386 with avoiding insert as much as possible
OK. Assuming the code to avoid reinsertions isn't too expensive and
difficult to understand/maintain, then it probably makes sense. One
way to get a feel for the expense of the code is to time a bootstrap
with and without your change. Presumably the time to bootstrap should
be reduced.
Note you need to do bootstraps anyway for testing purposes :-)
> The last one I implement with before actually reinserting a block I test,
> if the block forcing the reinsert has any effect on the BB in question.
Can you explain this in more detail? I'm presuming that you only insert
a block if something in its bitvector changed? I thought we already did that.
> > A DFS numbering will detect inner loops first, I haven't worked through
> > whether or not a BFS would detect outer loops first or not.
>
> If DFS first detects inner loops (after some thinking I see that too :))
No need to think about it -- just refer to literature and take it as the
gospel :-)
> then some comments in flow_loops_find() are wrong (some say, they are
> searching for outer first), and flow_depth_first_order_compute() should be
> corrected to do what the name promises ;) Or do I miss something obvious?
I don't think you've missed anything. I think the code is wrong. I'm not
sure of an algorithm to find the outermost loops first other than to find the
innermost loops and reverse the loop nest when we're done.
> I mean it works somehow right now, I'm not really sure if I can believe
> that such a central function is incorrect and nobody noticed.
I suspect it "works" in the sense that even if it doesn't find the right loop
nest the results of an incorrect nest are merely less efficient code, not
incorrect code. Regardless it needs to be fixed.
>
> > > >From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topp
> s/,
> > > I think report 320 or so). Do you have any other accessible? I haven't
> > > found more (well, I only searched the net).
> > Are they based on the Lengauer & Tarjan algorithm? That's the one I hear
> > the most about in papers & texts for fast dominator computations.
>
> Indeed they based on work from the god of linear graph algorithms, but
> while L&T is O(m*alpha(m,n)) (n=nodes, m=edges, alpha a more than
> const-function) the A&L&T one is truly linear O(m+n).
I certainly hope you document both the code and the algorithm well. You might
also submit a patch to the readings.html web page to reference the paper you
mentioned above. It would be greatly appreciated.
jeff
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
@ 2000-05-04 22:32 Mike Stump
2000-05-04 22:35 ` Richard Henderson
0 siblings, 1 reply; 32+ messages in thread
From: Mike Stump @ 2000-05-04 22:32 UTC (permalink / raw)
To: lucier, rth; +Cc: gcc, matzmich
> Date: Thu, 4 May 2000 22:16:17 -0700
> From: Richard Henderson <rth@cygnus.com>
> > make -j bootstrap
> Err.. don't do that? Try make -j3 instead.
Unfortunately -j3 gives you not three jobs, but an exponential cascade
of 3 jobs per recursive make level. Yes, I have read the paper
`Recursive make considered harmful'. :-)
I found that -j3 -l10 will at least try and limit them from expanding
too much, which is useful if your swap limited (just how did those 2
emacen grow to be 60M each, and netscape inflate to 90M? :-().
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-04 22:32 if-conversion a performance bottleneck Mike Stump
@ 2000-05-04 22:35 ` Richard Henderson
2000-05-05 6:12 ` Brad Lucier
0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2000-05-04 22:35 UTC (permalink / raw)
To: Mike Stump; +Cc: lucier, gcc, matzmich
On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> Unfortunately -j3 gives you not three jobs, but an exponential cascade
> of 3 jobs per recursive make level.
I know. But 3 or 15 is still a lot better than 1000, which might
be what you get with just -j.
> I found that -j3 -l10 will at least try and limit them from expanding
> too much, which is useful if your swap limited (just how did those 2
> emacen grow to be 60M each, and netscape inflate to 90M? :-().
That's what gigabytes of RAM are for. ;-)
r~
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-04 22:35 ` Richard Henderson
@ 2000-05-05 6:12 ` Brad Lucier
2000-05-05 10:37 ` Richard Henderson
2000-05-05 13:35 ` Gerald Pfeifer
0 siblings, 2 replies; 32+ messages in thread
From: Brad Lucier @ 2000-05-05 6:12 UTC (permalink / raw)
To: Richard Henderson; +Cc: Mike Stump, lucier, gcc, matzmich
> On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> > Unfortunately -j3 gives you not three jobs, but an exponential cascade
> > of 3 jobs per recursive make level.
>
> I know. But 3 or 15 is still a lot better than 1000, which might
> be what you get with just -j.
>
> > I found that -j3 -l10 will at least try and limit them from expanding
> > too much, which is useful if your swap limited (just how did those 2
> > emacen grow to be 60M each, and netscape inflate to 90M? :-().
>
> That's what gigabytes of RAM are for. ;-)
I have gigabytes of RAM; I've watched what happens when I say, e.g.,
make -j9 bootstrap
or
make -j 9 bootstrap
or whatever, and it doesn't seem to actually set off a bunch of parallel
jobs---definitely, by the time the stage1 compiler starts compiling
things, I'm down to one job at a time. That's with the 2.2.13 kernel
on alpha with make 3.77; is this a known problem? Should I upgrade
make?
BTW, the maximum load with make -j is < 80 (without libgcc) and
things seem to stay rational because make gets less time as the
load rises and can't spawn jobs at such a high rate.
Brad
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-05 6:12 ` Brad Lucier
@ 2000-05-05 10:37 ` Richard Henderson
2000-05-05 13:35 ` Gerald Pfeifer
1 sibling, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2000-05-05 10:37 UTC (permalink / raw)
To: Brad Lucier; +Cc: Mike Stump, gcc, matzmich
On Fri, May 05, 2000 at 08:12:34AM -0500, Brad Lucier wrote:
> make -j 9 bootstrap
>
> or whatever, and it doesn't seem to actually set off a bunch of parallel
> jobs--
make MAKE='make -j4' bootstrap -- the -j doesn't normally get
passed down to submakes.
> Should I upgrade make?
Apparently both of us should. I'm interested in this "build server"
thingy...
r~
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
2000-05-05 6:12 ` Brad Lucier
2000-05-05 10:37 ` Richard Henderson
@ 2000-05-05 13:35 ` Gerald Pfeifer
1 sibling, 0 replies; 32+ messages in thread
From: Gerald Pfeifer @ 2000-05-05 13:35 UTC (permalink / raw)
To: Brad Lucier; +Cc: Richard Henderson, Mike Stump, gcc, matzmich
On Fri, 5 May 2000, Brad Lucier wrote:
> make -j 9 bootstrap
>
> or whatever, and it doesn't seem to actually set off a bunch of parallel
> jobs---definitely, by the time the stage1 compiler starts compiling
> things, I'm down to one job at a time. That's with the 2.2.13 kernel
> on alpha with make 3.77; is this a known problem? Should I upgrade
> make?
Please do so and let us know the results!
It would be very useful to document properly, how to use parallel
makes for GCC -- perhaps you could come up with a patch for our
documentation at?
http://gcc.gnu.org/install/build.html
Gerald
--
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
@ 2000-05-04 23:28 Mathias Froehlich
0 siblings, 0 replies; 32+ messages in thread
From: Mathias Froehlich @ 2000-05-04 23:28 UTC (permalink / raw)
To: gcc
> On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> > Unfortunately -j3 gives you not three jobs, but an exponential cascade
> > of 3 jobs per recursive make level.
> I know. But 3 or 15 is still a lot better than 1000, which might
> be what you get with just -j.
> > I found that -j3 -l10 will at least try and limit them from expanding
> > too much, which is useful if your swap limited (just how did those 2
> > emacen grow to be 60M each, and netscape inflate to 90M? :-().
> That's what gigabytes of RAM are for. ;-)
Or use a recent version of gmake (>=3.78.1), which implements a so
called "jobserver mode". This limits ther _overall_ number of working
make jobs to the number given by -j. This does also work for recursive
makes.
Try it out it works great ...
Regards,
Mathias Froehlich
--
Mathias Fr"ohlich e-mail: frohlich@na.uni-tuebingen.de
Institut f"ur Mathematik, Universit"at T"ubingen, D-72076 T"ubingen
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: if-conversion a performance bottleneck
@ 2000-05-05 13:46 Brad Lucier
0 siblings, 0 replies; 32+ messages in thread
From: Brad Lucier @ 2000-05-05 13:46 UTC (permalink / raw)
To: lucier, pfeifer; +Cc: rth, mrs, gcc, matzmich
> On Fri, 5 May 2000, Brad Lucier wrote:
> > make -j 9 bootstrap
> >
> > or whatever, and it doesn't seem to actually set off a bunch of parallel
> > jobs---definitely, by the time the stage1 compiler starts compiling
> > things, I'm down to one job at a time. That's with the 2.2.13 kernel
> > on alpha with make 3.77; is this a known problem? Should I upgrade
> > make?
>
> Please do so and let us know the results!
>
> It would be very useful to document properly, how to use parallel
> makes for GCC -- perhaps you could come up with a patch for our
> documentation at?
>
> http://gcc.gnu.org/install/build.html
After upgrading make from version 3.77 to version 3.79,
make -j 9 bootstrap
works as I think it should---it consistently has about 9 processes
running in various subdirectories throughout the build process.
Brad
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2000-05-11 17:14 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-01 10:54 if-conversion a performance bottleneck Brad Lucier
2000-05-01 15:29 ` Richard Henderson
2000-05-01 18:01 ` Brad Lucier
2000-05-02 5:45 ` Michael Matz
2000-05-02 22:58 ` Michael Matz
2000-05-03 5:50 ` Brad Lucier
2000-05-03 20:05 ` Brad Lucier
2000-05-04 11:46 ` Michael Matz
2000-05-04 19:39 ` Brad Lucier
2000-05-04 22:16 ` Richard Henderson
2000-05-10 8:30 ` Andreas Schwab
2000-05-10 9:36 ` Joe Buck
2000-05-10 9:52 ` Jeffrey A Law
2000-05-10 10:49 ` Joe Buck
2000-05-10 14:17 ` Joern Rennecke
2000-05-10 14:24 ` GNU make options (was Re: if-conversion ...) Joe Buck
2000-05-04 11:55 ` if-conversion a performance bottleneck Richard Henderson
2000-05-05 12:01 ` Jeffrey A Law
2000-05-05 14:32 ` flow_d_f_o_compute misnamed? (was: if-conversion a performance...) Michael Matz
2000-05-05 14:53 ` Jeffrey A Law
2000-05-05 16:10 ` Michael Matz
2000-05-05 17:05 ` Jeffrey A Law
2000-05-05 18:21 ` Michael Matz
2000-05-05 19:04 ` Michael Hayes
2000-05-11 17:14 ` Jeffrey A Law
2000-05-04 22:32 if-conversion a performance bottleneck Mike Stump
2000-05-04 22:35 ` Richard Henderson
2000-05-05 6:12 ` Brad Lucier
2000-05-05 10:37 ` Richard Henderson
2000-05-05 13:35 ` Gerald Pfeifer
2000-05-04 23:28 Mathias Froehlich
2000-05-05 13:46 Brad Lucier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).