if-conversion a performance bottleneck

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* if-conversion a performance bottleneck
@ 2000-05-01 10:54 Brad Lucier
  2000-05-01 15:29 ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-01 10:54 UTC (permalink / raw)
  To: gcc; +Cc: lucier

The patch

http://gcc.gnu.org/ml/gcc-patches/2000-04/msg00248.html

killed one of the biggest time hogs in gcc, but now Richard
Henderson's if-conversion pass has affected performance adversely.
Perhaps with the calls to verify_flow_info, this is a temporary effect,
I don't know.

On this file

http://www.math.purdue.edu/~lucier/_t-c-2.i.gz

called with these options

/export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -fno-math-errno -mieee -g -mcpu=ev6 -O1 _t-c-2.i > & cc1.out &

with this compiler

popov-41% gcc -v
Reading specs from /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/specs
gcc version 2.96 20000430 (experimental)

I get the following performance on a 500 MHz alpha 21264:

 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 47533k -> 10293k} {GC 15386k -> 10788k} {GC 14832k -> 11255k} ___init_proc {GC 24248k -> 2107k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.75 ( 0%) usr   0.00 ( 0%) sys   0.75 ( 0%) wall
 parser                :   6.55 ( 0%) usr   0.18 (15%) sys   6.73 ( 0%) wall
 varconst              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   6.81 ( 0%) usr   0.40 (32%) sys   7.21 ( 0%) wall
 CSE                   :   2.29 ( 0%) usr   0.00 ( 0%) sys   2.29 ( 0%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  61.00 ( 2%) usr   0.07 ( 6%) sys  61.05 ( 2%) wall
 combiner              :   2.82 ( 0%) usr   0.00 ( 0%) sys   2.82 ( 0%) wall
 if-conversion         :2306.07 (64%) usr   0.44 (35%) sys2306.74 (64%) wall
 local alloc           :   1.31 ( 0%) usr   0.00 ( 0%) sys   1.31 ( 0%) wall
 global alloc          :   2.42 ( 0%) usr   0.04 ( 4%) sys   2.46 ( 0%) wall
 reload CSE regs       :   3.62 ( 0%) usr   0.00 ( 0%) sys   3.62 ( 0%) wall
 flow 2                :  30.20 ( 1%) usr   0.00 ( 0%) sys  30.20 ( 1%) wall
 if-conversion 2       :1196.27 (33%) usr   0.09 ( 7%) sys1196.01 (33%) wall
 shorten branches      :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.13 ( 0%) wall
 final                 :   2.69 ( 0%) usr   0.00 ( 1%) sys   2.70 ( 0%) wall
 symout                :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall
 rest of compilation   :   1.06 ( 0%) usr   0.00 ( 0%) sys   1.06 ( 0%) wall
 TOTAL                 :3624.12             1.27          3625.23

Some of the details are:

Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 49.50    192.40   192.40        9 21378.26 21381.98  verify_flow_info
 28.10    301.64   109.24    40604     2.69     2.69  sbitmap_intersection_of_succs
  6.75    327.90    26.25    15025     1.75     1.75  sbitmap_intersection_of_preds
  5.38    348.79    20.89    20981     1.00     1.00  for_each_successor_phi
  1.87    356.06     7.27 25079015     0.00     0.00  bitmap_operation
...
-----------------------------------------------
                0.00  288.44       9/9           rest_of_compilation [5]
[6]     74.2    0.00  288.44       9         if_convert [6]
              192.40    0.03       9/9           verify_flow_info [7]
                2.17   90.46       6/9           compute_flow_dominators [8]
                0.00    3.27       1/10          update_life_info [11]
                0.00    0.03       9/37          free_basic_block_vars [99]
                0.03    0.00       9/18          compute_bb_for_insn [177]
                0.00    0.02   15384/15384       find_if_header [331]
                0.01    0.00       6/21          sbitmap_vector_alloc [222]
                0.00    0.00       1/25          allocate_reg_info [536]
                0.00    0.00       2/64864       max_reg_num [371]
                0.00    0.00       9/5201        get_max_uid [1093]
                0.00    0.00       1/974         sbitmap_alloc [1124]
                0.00    0.00       1/31743       sbitmap_zero [1070]
                0.00    0.00       1/1           count_or_remove_death_notes [1414]
-----------------------------------------------
              192.40    0.03       9/9           if_convert [6]
[7]     49.5  192.40    0.03       9         verify_flow_info [7]
                0.00    0.03   15008/26961       returnjump_p [183]
                0.00    0.00       3/87970       condjump_p [388]
                0.00    0.00       9/5201        get_max_uid [1093]
                0.00    0.00       9/231         get_insns [1157]
-----------------------------------------------
                1.09   45.23       3/9           flow_loops_find [10]
                2.17   90.46       6/9           if_convert [6]
[8]     35.7    3.26  135.69       9         compute_flow_dominators [8]
              109.24    0.01   40604/40604       sbitmap_intersection_of_succs [9]
               26.25    0.00   15025/15025       sbitmap_intersection_of_preds [13]
                0.17    0.00   55638/55638       sbitmap_a_and_b [88]
                0.02    0.00       9/21          sbitmap_vector_alloc [222]
                0.00    0.00       9/9           sbitmap_vector_ones [688]
                0.00    0.00       9/12          sbitmap_vector_zero [788]
                0.00    0.00       9/31743       sbitmap_zero [1070]
-----------------------------------------------
              109.24    0.01   40604/40604       compute_flow_dominators [8]
[9]     28.1  109.24    0.01   40604         sbitmap_intersection_of_succs [9]
                0.01    0.00   40604/55629       sbitmap_copy [396]
-----------------------------------------------


Brad Lucier

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-01 10:54 if-conversion a performance bottleneck Brad Lucier
@ 2000-05-01 15:29 ` Richard Henderson
  2000-05-01 18:01   ` Brad Lucier
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2000-05-01 15:29 UTC (permalink / raw)
  To: Brad Lucier; +Cc: gcc

On Mon, May 01, 2000 at 12:54:35PM -0500, Brad Lucier wrote:
> Perhaps with the calls to verify_flow_info, this is a temporary effect,
> I don't know.

Oops.  That call was supposed to be ifdef ENABLE_CHECKING.
With that fixed, are compilation times mostly sane?



r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-01 15:29 ` Richard Henderson
@ 2000-05-01 18:01   ` Brad Lucier
  2000-05-02  5:45     ` Michael Matz
  0 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-01 18:01 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Brad Lucier, gcc

> Oops.  That call was supposed to be ifdef ENABLE_CHECKING.
> With that fixed, are compilation times mostly sane?
> r~

Yes, that helped a lot :-), but if-conversion still takes almost half
the CPU time to compile that file:

popov-211% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.70 ( 0%) usr   0.00 ( 0%) sys   0.70 ( 0%) wall
 parser                :   6.25 ( 3%) usr   0.18 (22%) sys   6.43 ( 3%) wall
 varconst              :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   6.50 ( 3%) usr   0.38 (48%) sys   6.88 ( 3%) wall
 CSE                   :   2.28 ( 1%) usr   0.00 ( 0%) sys   2.28 ( 1%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  58.86 (25%) usr   0.11 (14%) sys  58.99 (25%) wall
 combiner              :   2.87 ( 1%) usr   0.00 ( 0%) sys   2.87 ( 1%) wall
 if-conversion         :  59.53 (26%) usr   0.02 ( 3%) sys  59.59 (26%) wall
 local alloc           :   1.24 ( 1%) usr   0.00 ( 0%) sys   1.24 ( 1%) wall
 global alloc          :   2.37 ( 1%) usr   0.05 ( 6%) sys   2.42 ( 1%) wall
 reload CSE regs       :   3.88 ( 2%) usr   0.00 ( 0%) sys   3.88 ( 2%) wall
 flow 2                :  27.86 (12%) usr   0.00 ( 0%) sys  27.86 (12%) wall
 if-conversion 2       :  55.02 (24%) usr   0.01 ( 2%) sys  55.02 (24%) wall
 shorten branches      :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall
 final                 :   2.77 ( 1%) usr   0.01 ( 1%) sys   2.78 ( 1%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   1.02 ( 0%) usr   0.00 ( 0%) sys   1.02 ( 0%) wall
 TOTAL                 : 231.37             0.81           232.20

Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 57.51    111.67   111.67    40604     2.75     2.75  sbitmap_intersection_of_succs
 12.83    136.59    24.92    15025     1.66     1.66  sbitmap_intersection_of_preds
  9.58    155.19    18.60    20981     0.89     0.89  for_each_successor_phi
  3.78    162.54     7.35 25079015     0.00     0.00  bitmap_operation
  1.99    166.40     3.86        6   643.72  5013.82  calculate_global_regs_live
  1.71    169.73     3.33       18   184.84   184.84  mark_critical_edges
  1.57    172.78     3.05        9   338.87 15538.80  compute_flow_dominators
  1.49    175.67     2.89        3   963.87 16982.59  flow_loops_find
...
-----------------------------------------------
                1.02   45.60       3/9           flow_loops_find [9]
                2.03   91.20       6/9           if_convert [8]
[6]     72.0    3.05  136.80       9         compute_flow_dominators [6]
              111.67    0.00   40604/40604       sbitmap_intersection_of_succs [7]
               24.92    0.00   15025/15025       sbitmap_intersection_of_preds [12]
                0.18    0.00   55638/55638       sbitmap_a_and_b [77]
                0.02    0.00       9/21          sbitmap_vector_alloc [217]
                0.00    0.00       9/9           sbitmap_vector_ones [744]
                0.00    0.00       9/12          sbitmap_vector_zero [758]
                0.00    0.00       9/31743       sbitmap_zero [752]
-----------------------------------------------
              111.67    0.00   40604/40604       compute_flow_dominators [6]
[7]     57.5  111.67    0.00   40604         sbitmap_intersection_of_succs [7]
                0.00    0.00   40604/55629       sbitmap_copy [470]
-----------------------------------------------
                0.00   96.36       9/9           rest_of_compilation [5]
[8]     49.6    0.00   96.36       9         if_convert [8]
                2.03   91.20       6/9           compute_flow_dominators [6]
                0.00    3.05       1/10          update_life_info [10]
                0.03    0.00       9/18          compute_bb_for_insn [190]
                0.00    0.02       9/37          free_basic_block_vars [135]
                0.00    0.02   15384/15384       find_if_header [297]
                0.01    0.00       6/21          sbitmap_vector_alloc [217]
                0.00    0.00       1/25          allocate_reg_info [481]
                0.00    0.00       9/5192        get_max_uid [727]
                0.00    0.00       2/64864       max_reg_num [454]
                0.00    0.00       1/31743       sbitmap_zero [752]
                0.00    0.00       1/974         sbitmap_alloc [1076]
                0.00    0.00       1/1           count_or_remove_death_notes [1385]
-----------------------------------------------
                2.89   48.06       3/3           rest_of_compilation [5]
[9]     26.2    2.89   48.06       3         flow_loops_find [9]
                1.02   45.60       3/9           compute_flow_dominators [6]
                0.71    0.00     964/964         flow_loop_exits_find [31]
                0.71    0.00       1/1           flow_depth_first_order_compute [32]
                0.01    0.00       3/21          sbitmap_vector_alloc [217]
                0.00    0.00       3/3           flow_loops_tree_build [604]
                0.00    0.00     964/964         flow_loop_nodes_find [698]
                0.00    0.00     964/964         sbitmap_first_set_bit [739]
                0.00    0.00     964/964         flow_loop_pre_header_find [738]
                0.00    0.00       2/31743       sbitmap_zero [752]
                0.00    0.00     966/974         sbitmap_alloc [1076]
                0.00    0.00     964/964         sbitmap_last_set_bit [1078]
                0.00    0.00       3/3           flow_loops_level_compute [1331]
-----------------------------------------------
                0.00    3.05       1/10          if_convert [8]
                0.00    9.14       3/10          rest_of_compilation [5]
                0.00   18.28       6/10          life_analysis [14]
[10]    15.7    0.00   30.47      10         update_life_info [10]
                3.86   26.22       6/6           calculate_global_regs_live [11]
                0.02    0.35   15374/27234       propagate_block [34]
                0.00    0.00   15374/27239       free_propagate_block_info [389]
                0.01    0.00   15374/72751       bitmap_copy [262]
                0.00    0.00    5125/5125        verify_local_live_at_start [573]
                0.00    0.00      10/257366      bitmap_clear [283]
                0.00    0.00      10/124375      bitmap_initialize [407]
-----------------------------------------------
                3.86   26.22       6/6           update_life_info [10]
[11]    15.5    3.86   26.22       6         calculate_global_regs_live [11]
               18.60    0.00   20981/20981       for_each_successor_phi [13]
                7.30    0.00 24929375/25079015     bitmap_operation [15]
                0.01    0.27   11855/27234       propagate_block [34]
                0.01    0.00   24210/72751       bitmap_copy [262]
                0.00    0.00   52694/257366      bitmap_clear [283]
                0.00    0.00   11855/27239       free_propagate_block_info [389]
                0.00    0.00   11855/11855       bitmap_equal_p [525]
                0.00    0.00   20981/488036      bitmap_set_bit [167]
                0.00    0.00   10261/124375      bitmap_initialize [407]
-----------------------------------------------
               24.92    0.00   15025/15025       compute_flow_dominators [6]
[12]    12.8   24.92    0.00   15025         sbitmap_intersection_of_preds [12]
                0.00    0.00   15025/55629       sbitmap_copy [470]
-----------------------------------------------
               18.60    0.00   20981/20981       calculate_global_regs_live [11]
[13]     9.6   18.60    0.00   20981         for_each_successor_phi [13]
-----------------------------------------------

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-01 18:01   ` Brad Lucier
@ 2000-05-02  5:45     ` Michael Matz
  2000-05-02 22:58       ` Michael Matz
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-02  5:45 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, gcc

On Mon, 1 May 2000, Brad Lucier wrote:
> Each sample counts as 0.000976562 seconds.
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls  ms/call  ms/call  name    
>  57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
>  12.83 136.59 24.92  15025 1.66 1.66 sbitmap_intersection_of_preds

I once had faster versions of these two functions, if I get home I'll see
if they make any difference on your input data.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02  5:45     ` Michael Matz
@ 2000-05-02 22:58       ` Michael Matz
  2000-05-03  5:50         ` Brad Lucier
                           ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Michael Matz @ 2000-05-02 22:58 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, gcc

Hi,

On Tue, 2 May 2000, Michael Matz wrote:
> >  time   seconds   seconds    calls  ms/call  ms/call  name    
> >  57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
> >  12.83 136.59 24.92  15025 1.66 1.66 sbitmap_intersection_of_preds
> 
> I once had faster versions of these two functions, if I get home I'll see
> if they make any difference on your input data.

Before fiddling with sbitmap_intersection_of_xx() I first reworked
compute_flow_dominators() to also behave normally in calculating the
post_doms. At least the order of work-queue initialization was wrong (in
post_dom the changes are propagating from the _end_). I then also
implemented a poor man's topological sort for cyclic graphs ;), which
again gave a better performance. (I also did this once for doms, but there
it didn't make a great difference on Brads test cases)

Please try the attached diff (against actual CVS) if they make also a
difference for you ;)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
@ 2000-05-03  5:50         ` Brad Lucier
  2000-05-03 20:05         ` Brad Lucier
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Brad Lucier @ 2000-05-03  5:50 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

> Please try the attached diff (against actual CVS) if they make also a
> difference for you ;)

The diff didn't apply cleanly, I don't know why, so I applied it by hand.
I didn't have time this morning to install a new profiled version, so
I ran make bootstrap with the usual BOOT_CFLAGS.

Your patch did help.  Here is a comparison between the a profiled version
without your patch and an unprofiled version with your patch:

popov-726% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.71 ( 0%) usr   0.00 ( 1%) sys   0.71 ( 0%) wall
 parser                :   6.13 ( 3%) usr   0.19 (25%) sys   6.32 ( 3%) wall
 varconst              :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   6.45 ( 3%) usr   0.39 (51%) sys   6.84 ( 3%) wall
 CSE                   :   2.20 ( 1%) usr   0.00 ( 1%) sys   2.20 ( 1%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  63.17 (27%) usr   0.06 ( 8%) sys  63.21 (27%) wall
 combiner              :   2.81 ( 1%) usr   0.00 ( 0%) sys   2.81 ( 1%) wall
 if-conversion         :  56.39 (24%) usr   0.01 ( 2%) sys  56.39 (24%) wall
 local alloc           :   1.24 ( 1%) usr   0.00 ( 0%) sys   1.24 ( 1%) wall
 global alloc          :   2.35 ( 1%) usr   0.05 ( 8%) sys   2.41 ( 1%) wall
 reload CSE regs       :   3.68 ( 2%) usr   0.00 ( 0%) sys   3.68 ( 2%) wall
 flow 2                :  29.86 (13%) usr   0.00 ( 1%) sys  29.86 (13%) wall
 if-conversion 2       :  58.60 (25%) usr   0.01 ( 2%) sys  58.60 (25%) wall
 shorten branches      :   0.11 ( 0%) usr   0.00 ( 0%) sys   0.11 ( 0%) wall
 final                 :   2.68 ( 1%) usr   0.00 ( 1%) sys   2.70 ( 1%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   1.02 ( 0%) usr   0.00 ( 0%) sys   1.02 ( 0%) wall
 TOTAL                 : 237.52             0.77           238.47
popov-727% gcc -v
Reading specs from /export/u10/gcc-2.95.1/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.95.1/specs
gcc version 2.95.1 19990816 (release)
popov-728% /export/u10/egcs-test/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-2.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___t_2d_c_2d_2 {GC 46791k -> 9596k} {GC 14695k -> 10090k} {GC 14135k -> 10558k} ___init_proc {GC 23543k -> 2096k} ____20___t_2d_c_2d_2
Execution times (seconds)
 garbage collection    :   0.32 ( 0%) usr   0.00 ( 1%) sys   0.32 ( 0%) wall
 parser                :   2.92 ( 2%) usr   0.18 (25%) sys   3.10 ( 2%) wall
 varconst              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :   4.56 ( 3%) usr   0.38 (52%) sys   4.94 ( 3%) wall
 CSE                   :   0.81 ( 1%) usr   0.00 ( 0%) sys   0.81 ( 1%) wall
 global CSE            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 loop analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 CSE 2                 :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 flow analysis         :  51.69 (32%) usr   0.06 ( 8%) sys  51.74 (32%) wall
 combiner              :   0.99 ( 1%) usr   0.00 ( 0%) sys   0.99 ( 1%) wall
 if-conversion         :  33.69 (21%) usr   0.01 ( 2%) sys  33.70 (21%) wall
 local alloc           :   0.54 ( 0%) usr   0.00 ( 0%) sys   0.54 ( 0%) wall
 global alloc          :   1.11 ( 1%) usr   0.05 ( 7%) sys   1.16 ( 1%) wall
 reload CSE regs       :   3.15 ( 2%) usr   0.00 ( 1%) sys   3.15 ( 2%) wall
 flow 2                :  23.85 (15%) usr   0.00 ( 0%) sys  23.84 (15%) wall
 if-conversion 2       :  33.45 (21%) usr   0.01 ( 2%) sys  33.45 (21%) wall
 shorten branches      :   0.06 ( 0%) usr   0.00 ( 0%) sys   0.06 ( 0%) wall
 final                 :   2.46 ( 2%) usr   0.00 ( 0%) sys   2.46 ( 2%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   0.43 ( 0%) usr   0.00 ( 0%) sys   0.43 ( 0%) wall
 TOTAL                 : 160.09             0.73           160.78

I can get more information this afternoon (this evening in Germany :-).

Brad

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
  2000-05-03  5:50         ` Brad Lucier
@ 2000-05-03 20:05         ` Brad Lucier
  2000-05-04 11:46           ` Michael Matz
  2000-05-04 11:55         ` if-conversion a performance bottleneck Richard Henderson
  2000-05-05 12:01         ` Jeffrey A Law
  3 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-03 20:05 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

> Please try the attached diff (against actual CVS) if they make also a
> difference for you ;)

Your changes to flow.c have cut the number of calls to 
sbitmap_intersection_of_succs from 40604 to 24225, so they are definitely
worthwhile.  Bootstrapped on alphaev6-unknown-linux-gnu.

Brad

Here are the new statistics with your changes:
Execution times (seconds)
 garbage collection    :   1.30 ( 1%) usr   0.00 ( 0%) sys   1.30 ( 1%) wall
 parser                :   6.45 ( 3%) usr   0.19 (15%) sys   6.64 ( 3%) wall
 varconst              :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
 integration           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 jump                  :  20.06 ( 9%) usr   0.84 (66%) sys  20.90 ( 9%) wall
 CSE                   :   2.77 ( 1%) usr   0.00 ( 0%) sys   2.77 ( 1%) wall
 global CSE            :   5.51 ( 2%) usr   0.01 ( 1%) sys   5.52 ( 2%) wall
 loop analysis         :   0.24 ( 0%) usr   0.00 ( 0%) sys   0.24 ( 0%) wall
 CSE 2                 :   2.29 ( 1%) usr   0.00 ( 0%) sys   2.29 ( 1%) wall
 flow analysis         :  52.11 (23%) usr   0.06 ( 5%) sys  52.16 (23%) wall
 combiner              :   2.80 ( 1%) usr   0.00 ( 0%) sys   2.80 ( 1%) wall
 if-conversion         :  44.42 (20%) usr   0.02 ( 2%) sys  44.43 (20%) wall
 regmove               :   0.50 ( 0%) usr   0.00 ( 0%) sys   0.50 ( 0%) wall
 scheduling            :   6.26 ( 3%) usr   0.01 ( 1%) sys   6.27 ( 3%) wall
 local alloc           :   1.49 ( 1%) usr   0.00 ( 0%) sys   1.49 ( 1%) wall
 global alloc          :   2.78 ( 1%) usr   0.07 ( 6%) sys   2.86 ( 1%) wall
 reload CSE regs       :   7.92 ( 4%) usr   0.01 ( 1%) sys   7.93 ( 3%) wall
 flow 2                :  19.36 ( 9%) usr   0.00 ( 0%) sys  19.35 ( 9%) wall
 if-conversion 2       :  36.38 (16%) usr   0.01 ( 1%) sys  36.38 (16%) wall
 peephole 2            :   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall
 schedulding 2         :   8.88 ( 4%) usr   0.00 ( 0%) sys   8.88 ( 4%) wall
 shorten branches      :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall
 final                 :   3.22 ( 1%) usr   0.00 ( 0%) sys   3.22 ( 1%) wall
 symout                :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall
 rest of compilation   :   1.07 ( 0%) usr   0.00 ( 0%) sys   1.07 ( 0%) wall
 TOTAL                 : 226.10             1.27           227.32


Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 42.54     69.80    69.80    24225     2.88     2.88  sbitmap_intersection_of_succs
 17.24     98.09    28.29    15025     1.88     1.88  sbitmap_intersection_of_preds
  5.03    106.34     8.25       42   196.36   196.36  mark_critical_edges
  3.98    112.86     6.52 25085231     0.00     0.00  bitmap_operation
  3.56    118.70     5.84    10248     0.57     0.62  compute_block_backward_dependences
  3.44    124.35     5.65        9   627.60 11546.70  compute_flow_dominators
  2.37    128.24     3.89       21   185.36   185.93  delete_unreachable_blocks
  2.31    132.03     3.79        6   631.02  1760.19  calculate_global_regs_live
...
-----------------------------------------------
                1.88   32.76       3/9           flow_loops_find [9]
                3.77   65.51       6/9           if_convert [8]
[6]     63.3    5.65   98.27       9         compute_flow_dominators [6]
               69.80    0.00   24225/24225       sbitmap_intersection_of_succs [7]
               28.29    0.00   15025/15025       sbitmap_intersection_of_preds [10]
                0.16    0.00   39259/39259       sbitmap_a_and_b [135]
                0.02    0.00       9/27          sbitmap_vector_alloc [255]
                0.00    0.00       9/18          sbitmap_vector_zero [771]
                0.00    0.00       9/74716       sbitmap_zero [725]
                0.00    0.00       9/9           sbitmap_vector_ones [1365]
-----------------------------------------------
               69.80    0.00   24225/24225       compute_flow_dominators [6]
[7]     42.5   69.80    0.00   24225         sbitmap_intersection_of_succs [7]
                0.00    0.00   24225/39250       sbitmap_copy [612]
-----------------------------------------------
                0.00   69.40      12/12          rest_of_compilation [5]
[8]     42.3    0.00   69.40      12         if_convert [8]
                3.77   65.51       6/9           compute_flow_dominators [6]
                0.00    0.06      12/43          free_basic_block_vars [112]
                0.03    0.00      12/48          compute_bb_for_insn [147]
                0.00    0.01   20509/20509       find_if_header [391]
                0.01    0.00       6/27          sbitmap_vector_alloc [255]
                0.00    0.00       1/10258       update_life_info [12]
                0.00    0.00       1/37          allocate_reg_info [540]
                0.00    0.00       1/20500       count_or_remove_death_notes [36]
                0.00    0.00       1/995         sbitmap_alloc [679]
                0.00    0.00       2/145008      max_reg_num [374]
                0.00    0.00       1/74716       sbitmap_zero [725]
                0.00    0.00      12/5225        get_max_uid [1134]
-----------------------------------------------
                2.96   36.17       3/3           rest_of_compilation [5]
[9]     23.8    2.96   36.17       3         flow_loops_find [9]
                1.88   32.76       3/9           compute_flow_dominators [6]
                0.76    0.00     964/964         flow_loop_exits_find [38]
                0.75    0.00       1/1           flow_depth_first_order_compute [39]
                0.01    0.00       3/27          sbitmap_vector_alloc [255]
                0.00    0.00     964/964         flow_loop_pre_header_find [627]
                0.00    0.00     966/995         sbitmap_alloc [679]
                0.00    0.00       3/3           flow_loops_tree_build [747]
                0.00    0.00     964/964         flow_loop_nodes_find [791]
                0.00    0.00     964/964         sbitmap_last_set_bit [834]
                0.00    0.00       2/74716       sbitmap_zero [725]
                0.00    0.00     964/964         sbitmap_first_set_bit [1174]
                0.00    0.00       3/3           flow_loops_level_compute [1435]
-----------------------------------------------
               28.29    0.00   15025/15025       compute_flow_dominators [6]
[10]    17.2   28.29    0.00   15025         sbitmap_intersection_of_preds [10]
                0.00    0.00   15025/39250       sbitmap_copy [612]
-----------------------------------------------

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-03 20:05         ` Brad Lucier
@ 2000-05-04 11:46           ` Michael Matz
  2000-05-04 19:39             ` Brad Lucier
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-04 11:46 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, gcc

Hi,

On Wed, 3 May 2000, Brad Lucier wrote:
> Your changes to flow.c have cut the number of calls to 
> sbitmap_intersection_of_succs from 40604 to 24225, so they are definitely
> worthwhile.

Yes, that was the purpose of the patch ;)
Anyway, I saw some issues of that patch with one of your other test files,
where the runtime exploded, and on my HDD at home I have another version
which even more reduces the time. If I find time, I'll rewrite the sbitmap
routines for large homogeneous bitmaps, clean up all that stuff and see if
anything comes out of it ;)

Btw. do you have also slightly smaller test cases (a quarter or so)? At
home I only have a slow machine and waiting 10 minutes everytime to see,
if one change was worthwhile is, erhh... slow.

>  rest of compilation   :   1.07 ( 0%) usr   0.00 ( 0%) sys   1.07 ( 0%) wall
>  TOTAL                 : 226.10             1.27           227.32

God! For me that was ca. 1110 seconds ;) (Now I'm down to 580)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 11:46           ` Michael Matz
@ 2000-05-04 19:39             ` Brad Lucier
  2000-05-04 22:16               ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Brad Lucier @ 2000-05-04 19:39 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

> Btw. do you have also slightly smaller test cases (a quarter or so)? At
> home I only have a slow machine and waiting 10 minutes everytime to see,
> if one change was worthwhile is, erhh... slow.

Try

http://www.math.purdue.edu/~lucier/_t-c-1.i.gz

I get

popov-1089% /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 -mcpu=ev6 -fno-math-errno -mieee _t-c-1.i
...
 TOTAL                 :  31.78             0.36            32.13

BTW, with the introduction of libgcc1, I can no longer

make -j bootstrap

on my Linux box---I run out of processes with the standard value of
NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.

Brad

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 19:39             ` Brad Lucier
@ 2000-05-04 22:16               ` Richard Henderson
  2000-05-10  8:30                 ` Andreas Schwab
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2000-05-04 22:16 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Michael Matz, gcc

On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
> BTW, with the introduction of libgcc1, I can no longer
> 
> make -j bootstrap
> 
> on my Linux box---I run out of processes with the standard value of
> NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.

Err.. don't do that?  Try make -j3 instead.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 22:16               ` Richard Henderson
@ 2000-05-10  8:30                 ` Andreas Schwab
  2000-05-10  9:36                   ` Joe Buck
  0 siblings, 1 reply; 32+ messages in thread
From: Andreas Schwab @ 2000-05-10  8:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Brad Lucier, Michael Matz, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 704 bytes --]

Richard Henderson <rth@cygnus.com> writes:

|> On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
|> > BTW, with the introduction of libgcc1, I can no longer
|> > 
|> > make -j bootstrap
|> > 
|> > on my Linux box---I run out of processes with the standard value of
|> > NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.
|> 
|> Err.. don't do that?  Try make -j3 instead.

Or say make -j -l <some small value> to limit depending on the load.

Andreas.

-- 
Andreas Schwab                                  "And now for something
SuSE Labs                                        completely different."
Andreas.Schwab@suse.de
SuSE GmbH, SchanzÃ¤ckerstr. 10, D-90443 NÃ¼rnberg

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  8:30                 ` Andreas Schwab
@ 2000-05-10  9:36                   ` Joe Buck
  2000-05-10  9:52                     ` Jeffrey A Law
  2000-05-10 14:17                     ` Joern Rennecke
  0 siblings, 2 replies; 32+ messages in thread
From: Joe Buck @ 2000-05-10  9:36 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Richard Henderson, Brad Lucier, Michael Matz, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]

> Richard Henderson <rth@cygnus.com> writes:
> 
> |> On Thu, May 04, 2000 at 09:39:36PM -0500, Brad Lucier wrote:
> |> > BTW, with the introduction of libgcc1, I can no longer
> |> > 
> |> > make -j bootstrap
> |> > 
> |> > on my Linux box---I run out of processes with the standard value of
> |> > NR_TASKS=512!  I install a new kernel tomorrow with NR_TASKS=4096.
> |> 
> |> Err.. don't do that?  Try make -j3 instead.
> 
> Or say make -j -l <some small value> to limit depending on the load.

No, -l does not work and never did.  The reason is that the load value
increases very slowly as processes are launched; as a result, if you
say make -j -l, a very large number of processes get launched initially,
and it may take 10 seconds or so for the load to climb to the level
specified in -l -- but then the load climbs to 20x that figure or more.
At this point, no more processes get launched until at least a minute
after all of the first batch dies, and then another process storm is
launched.  This is becuase the number checked by -l is based on an average
load over a full minute, not an instantaneous figure.

Solution: get make 3.79 and use -j<N>.  The GNU make folks should
deprecate -l and eventually remove it.

> Andreas.
> 
> -- 
> Andreas Schwab                                  "And now for something
> SuSE Labs                                        completely different."
> Andreas.Schwab@suse.de
> SuSE GmbH, SchanzÃ¤ckerstr. 10, D-90443 NÃ¼rnberg
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  9:36                   ` Joe Buck
@ 2000-05-10  9:52                     ` Jeffrey A Law
  2000-05-10 10:49                       ` Joe Buck
  2000-05-10 14:17                     ` Joern Rennecke
  1 sibling, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-10  9:52 UTC (permalink / raw)
  To: Joe Buck
  Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc

  In message < 200005101628.JAA25911@possibly.synopsys.com >you write:
  > Solution: get make 3.79 and use -j<N>.  The GNU make folks should
  > deprecate -l and eventually remove it.
Most definitely.  I upgraded to 3.79 a few days ago and have really liked
the way -j works in the new version.  I strongly recommend anyone building
gcc on an SMP machine upgrade to make-3.79.


jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  9:52                     ` Jeffrey A Law
@ 2000-05-10 10:49                       ` Joe Buck
  0 siblings, 0 replies; 32+ messages in thread
From: Joe Buck @ 2000-05-10 10:49 UTC (permalink / raw)
  To: law; +Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc

>   In message < 200005101628.JAA25911@possibly.synopsys.com >you write:
>   > Solution: get make 3.79 and use -j<N>.  The GNU make folks should
>   > deprecate -l and eventually remove it.
> Most definitely.  I upgraded to 3.79 a few days ago and have really liked
> the way -j works in the new version.  I strongly recommend anyone building
> gcc on an SMP machine upgrade to make-3.79.

Even a 1-processor machine can benefit from -j2, as the compiler will
spend a significant portion of its time reading or writing disk, leaving
idle CPU for the other process.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-10  9:36                   ` Joe Buck
  2000-05-10  9:52                     ` Jeffrey A Law
@ 2000-05-10 14:17                     ` Joern Rennecke
  2000-05-10 14:24                       ` GNU make options (was Re: if-conversion ...) Joe Buck
  1 sibling, 1 reply; 32+ messages in thread
From: Joern Rennecke @ 2000-05-10 14:17 UTC (permalink / raw)
  To: Joe Buck
  Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc

> No, -l does not work and never did.  The reason is that the load value
> increases very slowly as processes are launched; as a result, if you
> say make -j -l, a very large number of processes get launched initially,
> and it may take 10 seconds or so for the load to climb to the level
> specified in -l -- but then the load climbs to 20x that figure or more.
> At this point, no more processes get launched until at least a minute
> after all of the first batch dies, and then another process storm is
> launched.  This is becuase the number checked by -l is based on an average
> load over a full minute, not an instantaneous figure.
> 
> Solution: get make 3.79 and use -j<N>.  The GNU make folks should
> deprecate -l and eventually remove it.

I wouldn't go so far.  It seems clear that -l is not useful to make make
react to the load it creates itself, but you might use it instead or in
addition to nice to control how much a batch job taxes a system that is
used for interactive stuff as well.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* GNU make options (was Re: if-conversion ...)
  2000-05-10 14:17                     ` Joern Rennecke
@ 2000-05-10 14:24                       ` Joe Buck
  0 siblings, 0 replies; 32+ messages in thread
From: Joe Buck @ 2000-05-10 14:24 UTC (permalink / raw)
  To: Joern Rennecke
  Cc: Andreas Schwab, Richard Henderson, Brad Lucier, Michael Matz, gcc

I wrote (re GNU make options) 
> > Solution: get make 3.79 and use -j<N>.  The GNU make folks should
> > deprecate -l and eventually remove it.
> 
> I wouldn't go so far.  It seems clear that -l is not useful to make make
> react to the load it creates itself, but you might use it instead or in
> addition to nice to control how much a batch job taxes a system that is
> used for interactive stuff as well.

OK, the combination -j<N> -l <limit> might be useful for that purpose.
But -j without an argument and -l together is a recipe for major problems.
If there is a high degree of parallelism it may result in exhausting
virtual memory, which can be very bad on systems that optimistically
allocate virtual memory (e.g. Linux): the system thrashes severely and
then tries to kill processes at random.  This huge load and memory
exhaustion can strike before the load average rises very much.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
  2000-05-03  5:50         ` Brad Lucier
  2000-05-03 20:05         ` Brad Lucier
@ 2000-05-04 11:55         ` Richard Henderson
  2000-05-05 12:01         ` Jeffrey A Law
  3 siblings, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2000-05-04 11:55 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, gcc

On Wed, May 03, 2000 at 07:53:00AM +0200, Michael Matz wrote:
> I then also implemented a poor man's topological sort for cyclic graphs
> ;), which again gave a better performance. (I also did this once for doms,
> but there it didn't make a great difference on Brads test cases)

Would you split this off into a new function?  Perhaps called
order_bb_for_backward_flow or something?  This could be useful elsewhere.



r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-02 22:58       ` Michael Matz
                           ` (2 preceding siblings ...)
  2000-05-04 11:55         ` if-conversion a performance bottleneck Richard Henderson
@ 2000-05-05 12:01         ` Jeffrey A Law
  2000-05-05 14:32           ` flow_d_f_o_compute misnamed? (was: if-conversion a performance...) Michael Matz
  3 siblings, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-05 12:01 UTC (permalink / raw)
  To: Michael Matz; +Cc: Brad Lucier, Richard Henderson, gcc

  In message < Pine.SOL.4.10.10005030741450.1121-200000@platon >you write:
  >   This message is in MIME format.  The first part should be readable text,
  >   while the remaining parts are likely unreadable without MIME-aware tools.
  >   Send mail to mime@docserver.cac.washington.edu for more info.
  > 
  > ---559023410-33463914-957333180=:1121
  > Content-Type: TEXT/PLAIN; charset=US-ASCII
  > 
  > Hi,
  > 
  > On Tue, 2 May 2000, Michael Matz wrote:
  > > >  time   seconds   seconds    calls  ms/call  ms/call  name    
  > > >  57.51 111.67 111.67 40604 2.75 2.75 sbitmap_intersection_of_succs
  > > >  12.83 136.59 24.92  15025 1.66 1.66 sbitmap_intersection_of_preds
  > > 
  > > I once had faster versions of these two functions, if I get home I'll see
  > > if they make any difference on your input data.
  > 
  > Before fiddling with sbitmap_intersection_of_xx() I first reworked
  > compute_flow_dominators() to also behave normally in calculating the
  > post_doms. At least the order of work-queue initialization was wrong (in
  > post_dom the changes are propagating from the _end_). I then also
  > implemented a poor man's topological sort for cyclic graphs ;), which
  > again gave a better performance. (I also did this once for doms, but there
  > it didn't make a great difference on Brads test cases)
  > 
  > Please try the attached diff (against actual CVS) if they make also a
  > difference for you ;)
Like Richard, I'd like to see you submit this as a function which can be
called from multiple locations in the compiler.

We've got a number of routines that _might_ benefit from this code, but
I'd also like to see some more general benchmarking.  I don't want to
see us slow down the compiler for the common cases just to make Brad's
one test run faster.

jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
  2000-05-05 12:01         ` Jeffrey A Law
@ 2000-05-05 14:32           ` Michael Matz
  2000-05-05 14:53             ` Jeffrey A Law
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-05 14:32 UTC (permalink / raw)
  To: Jeffrey A Law; +Cc: Richard Henderson, gcc

Hi,

On Fri, 5 May 2000, Jeffrey A Law wrote:
> > post_dom the changes are propagating from the _end_). I then also
> > implemented a poor man's topological sort for cyclic graphs ;), which
> >
> Like Richard, I'd like to see you submit this as a function which can be
> called from multiple locations in the compiler.

Will do so on the weekend.

> We've got a number of routines that _might_ benefit from this code, but
> I'd also like to see some more general benchmarking.  I don't want to
> see us slow down the compiler for the common cases just to make Brad's
> one test run faster.

Yes, right now this sorting is linear with number of edges, while the
simple reverse filling of the worklist is linear with n_basic_blocks. May
be it would be worthwhile to make it
order_bb_for_backward_flow(basic_block *dest, int simple_reverse)
and if the arg is true, only fill dest backwards instead of trying to be
intelligent.

Meanwhile I noticed a slight strangeness with
flow_depth_first_order_compute, in that it does not compute any DFS order
but more a width first one. E.g. if presented with a graph like
{(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
produces an order of [a,b,c,d] which is not according to any DF definition
I know of ;) which would result in e.g. [a,b,d,c].

In case anybody is interested, I'm currently trying to implement a linear
time dominator algorithm, and wanted to use some stuff already there. (As
I also need a backward DFS for post-dom, I anyway would have to implement
it myself, but for the beginning ...)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
  2000-05-05 14:32           ` flow_d_f_o_compute misnamed? (was: if-conversion a performance...) Michael Matz
@ 2000-05-05 14:53             ` Jeffrey A Law
  2000-05-05 16:10               ` Michael Matz
  0 siblings, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-05 14:53 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Henderson, gcc

  In message < Pine.SOL.4.10.10005052304100.4463-100000@platon >you write:
  > Hi,
  > 
  > On Fri, 5 May 2000, Jeffrey A Law wrote:
  > > > post_dom the changes are propagating from the _end_). I then also
  > > > implemented a poor man's topological sort for cyclic graphs ;), which
  > > >
  > > Like Richard, I'd like to see you submit this as a function which can be
  > > called from multiple locations in the compiler.
  > 
  > Will do so on the weekend.
Great!


  > > We've got a number of routines that _might_ benefit from this code, but
  > > I'd also like to see some more general benchmarking.  I don't want to
  > > see us slow down the compiler for the common cases just to make Brad's
  > > one test run faster.
  > 
  > Yes, right now this sorting is linear with number of edges, while the
  > simple reverse filling of the worklist is linear with n_basic_blocks. May
  > be it would be worthwhile to make it
  > order_bb_for_backward_flow(basic_block *dest, int simple_reverse)
  > and if the arg is true, only fill dest backwards instead of trying to be
  > intelligent.
I didn't look at the code in any detail, it's entirely possible we're loading
up the worklist in a silly order right now, which can be easily resolved and
may provide most of the benefit of your patch with lower cost.

Most CFGs are relatively well behaved with fast convergence for iterative
dataflow equations -- convergence is bad typically when the CFG has a lot
of nested backedges, and even then those backedges have to enclose a large
number of blocks for iterative dataflow analysis to slow down significantly.

There are certainly other possibilities for speeding up the various dataflow
solvers, whether it's a better dominator algorithm or more efficient code
to solve the LCM dataflow equations.

  > Meanwhile I noticed a slight strangeness with
  > flow_depth_first_order_compute, in that it does not compute any DFS order
  > but more a width first one. E.g. if presented with a graph like
  > {(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
  > produces an order of [a,b,c,d] which is not according to any DF definition
  > I know of ;) which would result in e.g. [a,b,d,c].
If that's what it's doing, then it certainly doesn't sound like DFS to me ;-)

  > In case anybody is interested, I'm currently trying to implement a linear
  > time dominator algorithm, and wanted to use some stuff already there. (As
  > I also need a backward DFS for post-dom, I anyway would have to implement
  > it myself, but for the beginning ...)
Which algorithm?  Improving dominator computation time is going to be more
important going forward as we use that information in more and more places.

It would probably be useful for you to go ahead and get the legal paperwork
filed with the FSF so that we can use your new dominator code when it's
ready.  http://gcc.gnu.org/contribute.html will provide you with the details
you need for your copyright assignment and/or employer disclaimer.

jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flow_d_f_o_compute misnamed? (was: if-conversion a  performance...)
  2000-05-05 14:53             ` Jeffrey A Law
@ 2000-05-05 16:10               ` Michael Matz
  2000-05-05 17:05                 ` Jeffrey A Law
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Matz @ 2000-05-05 16:10 UTC (permalink / raw)
  To: Jeffrey A Law; +Cc: Richard Henderson, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3093 bytes --]

Hi,

On Fri, 5 May 2000, Jeffrey A Law wrote:
> I didn't look at the code in any detail, it's entirely possible we're loading
> up the worklist in a silly order right now, which can be easily resolved and
> may provide most of the benefit of your patch with lower cost.

Right now the worklist is filled in natural order, which couldn't be worse
for a Â´backward' problem (post dominators) ;) Reversing this already helps
alot. Filling it in an intelligent way again reduces the blocks which get
reevaluated, and additionally checking for changes minimizes that number.
I'm now down to only 386 blocks reinserted into the worklist out of ca.
5000 bb in Brads testcase (originally there were around 15000
reinsertions). I doubt that I would get anywhere further with the
bitvector based algorithm. May be another representation of the bitvectors
would help (one which accounts for large blocks of same bits) in case of
_many_ basic blocks. I was going to implement these, but I decided that my
time is better spent in replacing the whole bitvector based algorithm.
It's also more interesting ;)

> There are certainly other possibilities for speeding up the various dataflow
> solvers, whether it's a better dominator algorithm or more efficient code
> to solve the LCM dataflow equations.

I also noticed in reading through the code, that there are many places,
where data flow is calculated in some way, that do slightly the same but
nevertheless do not reuse any code. It would be good to uniform that
middle end (I mean the language and machine independent things), but gcc
is such a huge thing ;)

>   > {(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
>   > produces an order of [a,b,c,d] which is not according to any DF definition
>   > I know of ;) which would result in e.g. [a,b,d,c].
> If that's what it's doing, then it certainly doesn't sound like DFS to me ;-)

So may be it's only misnamed? But then again, it's used in
flow_loops_find() to find outer loops before inner ones. I don't see right
now where a normal DFS would help that, may be I'm too tired :)

> > In case anybody is interested, I'm currently trying to implement a linear
> > time dominator algorithm, and wanted to use some stuff already there. (As
> > I also need a backward DFS for post-dom, I anyway would have to implement
> > it myself, but for the beginning ...)
> Which algorithm?  Improving dominator computation time is going to be more
> important going forward as we use that information in more and more places.

From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topps/ ,
I think report 320 or so). Do you have any other accessible? I haven't
found more (well, I only searched the net).

> It would probably be useful for you to go ahead and get the legal paperwork
> filed with the FSF so that we can use your new dominator code when it's
> ready.  http://gcc.gnu.org/contribute.html will provide you with the details
> you need for your copyright assignment and/or employer disclaimer.

I feared somebody would say that. Bloody bureaucracy ;)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
  2000-05-05 16:10               ` Michael Matz
@ 2000-05-05 17:05                 ` Jeffrey A Law
  2000-05-05 18:21                   ` Michael Matz
  0 siblings, 1 reply; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-05 17:05 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Henderson, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2626 bytes --]

  In message < Pine.SOL.4.10.10005060022080.4463-100000@platon >you write:
  > Hi,
  > 
  > On Fri, 5 May 2000, Jeffrey A Law wrote:
  > > I didn't look at the code in any detail, it's entirely possible we're loa
  > ding
  > > up the worklist in a silly order right now, which can be easily resolved 
  > and
  > > may provide most of the benefit of your patch with lower cost.
  > 
  > Right now the worklist is filled in natural order, which couldn't be worse
  > for a Â´backward' problem (post dominators) ;) Reversing this already helps
  > alot.
Yea, that was exactly what I was asking.  The cost is almost non-existent
with notable benefits.

  > Filling it in an intelligent way again reduces the blocks which get
  > reevaluated, and additionally checking for changes minimizes that number.
  > I'm now down to only 386 blocks reinserted into the worklist out of ca.
  > 5000 bb in Brads testcase (originally there were around 15000
  > reinsertions).
Right.  This is typical behavior when we're looking at stuff in the opposite
of the intended order.

How many reinsertions do we get by just reversing the order of insertion
into the worklist?


  > >   > {(a,b),(a,c),(b,d),(c,d),(ENTRY,a),(d,EXIT)} (i.e. a simple if) it
  > >   > produces an order of [a,b,c,d] which is not according to any DF defin
  > ition
  > >   > I know of ;) which would result in e.g. [a,b,d,c].
  > > If that's what it's doing, then it certainly doesn't sound like DFS to me
  >  ;-)
  > 
  > So may be it's only misnamed? But then again, it's used in
  > flow_loops_find() to find outer loops before inner ones. I don't see right
  > now where a normal DFS would help that, may be I'm too tired :)
A DFS numbering will detect inner loops first, I haven't worked through
whether or not a BFS would detect outer loops first or not.

  > >From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topps/ ,
  > I think report 320 or so). Do you have any other accessible? I haven't
  > found more (well, I only searched the net).
Are they based on the Lengauer & Tarjan algorithm?  That's the one I hear
the most about in papers & texts for fast dominator computations.


  > > It would probably be useful for you to go ahead and get the legal paperwo
  > rk
  > > filed with the FSF so that we can use your new dominator code when it's
  > > ready.  http://gcc.gnu.org/contribute.html will provide you with the deta
  > ils
  > > you need for your copyright assignment and/or employer disclaimer.
  > 
  > I feared somebody would say that. Bloody bureaucracy ;)
:-)  For better or worse, legal paperwork is a requirement.
jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flow_d_f_o_compute misnamed? (was: if-conversion a  performance...)
  2000-05-05 17:05                 ` Jeffrey A Law
@ 2000-05-05 18:21                   ` Michael Matz
  2000-05-05 19:04                     ` Michael Hayes
  2000-05-11 17:14                     ` Jeffrey A Law
  0 siblings, 2 replies; 32+ messages in thread
From: Michael Matz @ 2000-05-05 18:21 UTC (permalink / raw)
  To: Jeffrey A Law; +Cc: Richard Henderson, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3393 bytes --]

Hi,

On Fri, 5 May 2000, Jeffrey A Law wrote:

> > Right now the worklist is filled in natural order, which couldn't be worse
> > for a Â´backward' problem (post dominators) ;) Reversing this already helps
> > alot.
> Yea, that was exactly what I was asking.  The cost is almost non-existent
> with notable benefits.

But why restrict ourself when we can do better ;) Note that the time for
sorting the worklist in any way (may be even randomly ;)) is by far
outweighted (sp?) by the time it takes to do all the bitvectors
intersections, which are mostly useless, in the sense that some components
of the intersection didn't contribute to the result.

E.g. I measured the number of sbitmap_a_and_b's which didn't result in any
change, and my finding was horrible: of around 5 millions intersections
4.8 million were effectless. But the algorithm must try it anyway, so the
bitvector based approach leads to horrors and slowness :)

> > Filling it in an intelligent way again reduces the blocks which get
> > reevaluated, and additionally checking for changes minimizes that number.
> > I'm now down to only 386 blocks reinserted into the worklist out of ca.
> > 5000 bb in Brads testcase (originally there were around 15000
> > reinsertions).
> Right.  This is typical behavior when we're looking at stuff in the opposite
> of the intended order.
> 
> How many reinsertions do we get by just reversing the order of insertion
> into the worklist?

Quick rundown of number of reinserted blocks:
8523 with simple reverse
7003 with more intelligent "topological" sort
386  with avoiding insert as much as possible

The last one I implement with before actually reinserting a block I test,
if the block forcing the reinsert has any effect on the BB in question.
That again squeezed some minutes out of that stuff on my slow machine ;)
But it comes with the price that it now is obvious how many intersections
are done without use. With the other cases it is hidden in
intersection_of_succs (and _preds), and only noticable by the fact that it
is painfully slow ;)

> > So may be it's only misnamed? But then again, it's used in
> > flow_loops_find() to find outer loops before inner ones. I don't see right
> > now where a normal DFS would help that, may be I'm too tired :)
> A DFS numbering will detect inner loops first, I haven't worked through
> whether or not a BFS would detect outer loops first or not.

If DFS first detects inner loops (after some thinking I see that too :))
then some comments in flow_loops_find() are wrong (some say, they are
searching for outer first), and flow_depth_first_order_compute() should be
corrected to do what the name promises ;) Or do I miss something obvious?
I mean it works somehow right now, I'm not really sure if I can believe
that such a central function is incorrect and nobody noticed.

> > >From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topps/ ,
> > I think report 320 or so). Do you have any other accessible? I haven't
> > found more (well, I only searched the net).
> Are they based on the Lengauer & Tarjan algorithm?  That's the one I hear
> the most about in papers & texts for fast dominator computations.

Indeed they based on work from the god of linear graph algorithms, but
while L&T is O(m*alpha(m,n)) (n=nodes, m=edges, alpha a more than
const-function) the A&L&T one is truly linear O(m+n).

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flow_d_f_o_compute misnamed? (was: if-conversion a  performance...)
  2000-05-05 18:21                   ` Michael Matz
@ 2000-05-05 19:04                     ` Michael Hayes
  2000-05-11 17:14                     ` Jeffrey A Law
  1 sibling, 0 replies; 32+ messages in thread
From: Michael Hayes @ 2000-05-05 19:04 UTC (permalink / raw)
  To: Michael Matz; +Cc: Jeffrey A Law, Richard Henderson, gcc

Michael Matz writes:

 > If DFS first detects inner loops (after some thinking I see that too :))
 > then some comments in flow_loops_find() are wrong (some say, they are
 > searching for outer first), and flow_depth_first_order_compute() should be
 > corrected to do what the name promises ;) Or do I miss something obvious?

Even though I wrote this, I cannot remember the details and
unfortunately I do not have my work book with me at present.

It is quite possible that my depth first search algorithm is wrong as
I am naive when it comes to graph theory. I vaguely recall converting
my original recursive algorithm to an iterative algorithm so I could
easily have made a mistake.

 > I mean it works somehow right now, I'm not really sure if I can believe
 > that such a central function is incorrect and nobody noticed.

Although the DFS algorithm may be wrong, it may have no effect on the
loop finding algorithm.  For the latter, it is not essential that all
the outer loops in the CFG are found first but that with nested loops,
an inner loop is found after its containing outer loop.  Possibly
my `DFS' algorithm also has this property?

Michael.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: flow_d_f_o_compute misnamed? (was: if-conversion a performance...)
  2000-05-05 18:21                   ` Michael Matz
  2000-05-05 19:04                     ` Michael Hayes
@ 2000-05-11 17:14                     ` Jeffrey A Law
  1 sibling, 0 replies; 32+ messages in thread
From: Jeffrey A Law @ 2000-05-11 17:14 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Henderson, gcc

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4833 bytes --]

  In message < Pine.SOL.4.10.10005060226560.4463-100000@platon >you write:
  > > > Right now the worklist is filled in natural order, which couldn't be wo
  > rse
  > > > for a Â´backward' problem (post dominators) ;) Reversing this already 
hel
  > ps
  > > > alot.
  > > Yea, that was exactly what I was asking.  The cost is almost non-existent
  > > with notable benefits.
  > 
  > But why restrict ourself when we can do better ;)
I'm not necessarily restricting what we can/will do, only trying to
look at cost/benefit of the various possibilities.  A solution that
is less efficient in pathological cases, but more efficient in the
general case may be the preferred solution, it depends on a number of
factors -- and we need data to try and find reasonable tradeoff points.

  > Note that the time for
  > sorting the worklist in any way (may be even randomly ;)) is by far
  > outweighted (sp?) by the time it takes to do all the bitvectors
  > intersections, which are mostly useless, in the sense that some components
  > of the intersection didn't contribute to the result.
  > 
  > E.g. I measured the number of sbitmap_a_and_b's which didn't result in any
  > change, and my finding was horrible: of around 5 millions intersections
  > 4.8 million were effectless. But the algorithm must try it anyway, so the
  > bitvector based approach leads to horrors and slowness :)
Right.  The tradeoff is time/space vs code complexity -- choosing the right
tradeoff point is sometimes difficult.

Other approaches using "blocking factors" like in the null pointer check
elimination optimization may significantly improve this situaton.  Then
again, it may not.  Aside from being more memory efficient, the blocking
code _may_ ultimately result in significantly fewer bitvector ops than the
current implementation.  See the null pointer check code in gcse.c.  
Converting the rest of gcse to use the blocking code ought to be
reasonably easy and would be greatly appreciated.

  > > How many reinsertions do we get by just reversing the order of insertion
  > > into the worklist?
  > 
  > Quick rundown of number of reinserted blocks:
  > 8523 with simple reverse
  > 7003 with more intelligent "topological" sort
  > 386  with avoiding insert as much as possible
OK.  Assuming the code to avoid reinsertions isn't too  expensive and
difficult to understand/maintain, then it probably makes sense.  One
way to get a feel for the expense of the code is to time a bootstrap
with and without your change.  Presumably the time to bootstrap should
be reduced.

Note you need to do bootstraps anyway for testing purposes :-)


  > The last one I implement with before actually reinserting a block I test,
  > if the block forcing the reinsert has any effect on the BB in question.
Can you explain this in more detail?  I'm presuming that you only insert
a block if something in its bitvector changed?  I thought we already did that.


  > > A DFS numbering will detect inner loops first, I haven't worked through
  > > whether or not a BFS would detect outer loops first or not.
  > 
  > If DFS first detects inner loops (after some thinking I see that too :))
No need to think about it -- just refer to literature and take it as the
gospel :-)

  > then some comments in flow_loops_find() are wrong (some say, they are
  > searching for outer first), and flow_depth_first_order_compute() should be
  > corrected to do what the name promises ;) Or do I miss something obvious?
I don't think you've missed anything.  I think the code is wrong.  I'm not
sure of an algorithm to find the outermost loops first other than to find the
innermost loops and reverse the loop nest when we're done.

  > I mean it works somehow right now, I'm not really sure if I can believe
  > that such a central function is incorrect and nobody noticed.
I suspect it "works" in the sense that even if it doesn't find the right loop
nest the results of an incorrect nest are merely less efficient code, not
incorrect code.  Regardless it needs to be fixed.

  > 
  > > > >From Alstrup/Lauridsen/Thorup ( http://www.diku.dk/research-groups/topp
  > s/,
  > > > I think report 320 or so). Do you have any other accessible? I haven't
  > > > found more (well, I only searched the net).
  > > Are they based on the Lengauer & Tarjan algorithm?  That's the one I hear
  > > the most about in papers & texts for fast dominator computations.
  > 
  > Indeed they based on work from the god of linear graph algorithms, but
  > while L&T is O(m*alpha(m,n)) (n=nodes, m=edges, alpha a more than
  > const-function) the A&L&T one is truly linear O(m+n).
I certainly hope you document both the code and the algorithm well.  You might
also submit a patch to the readings.html web page to reference the paper you
mentioned above.  It would be greatly appreciated.

jeff


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
@ 2000-05-04 22:32 Mike Stump
  2000-05-04 22:35 ` Richard Henderson
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Stump @ 2000-05-04 22:32 UTC (permalink / raw)
  To: lucier, rth; +Cc: gcc, matzmich

> Date: Thu, 4 May 2000 22:16:17 -0700
> From: Richard Henderson <rth@cygnus.com>

> > make -j bootstrap

> Err.. don't do that?  Try make -j3 instead.

Unfortunately -j3 gives you not three jobs, but an exponential cascade
of 3 jobs per recursive make level.  Yes, I have read the paper
`Recursive make considered harmful'.  :-)

I found that -j3 -l10 will at least try and limit them from expanding
too much, which is useful if your swap limited (just how did those 2
emacen grow to be 60M each, and netscape inflate to 90M? :-().

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 22:32 if-conversion a performance bottleneck Mike Stump
@ 2000-05-04 22:35 ` Richard Henderson
  2000-05-05  6:12   ` Brad Lucier
  0 siblings, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2000-05-04 22:35 UTC (permalink / raw)
  To: Mike Stump; +Cc: lucier, gcc, matzmich

On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> Unfortunately -j3 gives you not three jobs, but an exponential cascade
> of 3 jobs per recursive make level.

I know.  But 3 or 15 is still a lot better than 1000, which might
be what you get with just -j.

> I found that -j3 -l10 will at least try and limit them from expanding
> too much, which is useful if your swap limited (just how did those 2
> emacen grow to be 60M each, and netscape inflate to 90M? :-().

That's what gigabytes of RAM are for.  ;-)


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-04 22:35 ` Richard Henderson
@ 2000-05-05  6:12   ` Brad Lucier
  2000-05-05 10:37     ` Richard Henderson
  2000-05-05 13:35     ` Gerald Pfeifer
  0 siblings, 2 replies; 32+ messages in thread
From: Brad Lucier @ 2000-05-05  6:12 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Mike Stump, lucier, gcc, matzmich

> On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> > Unfortunately -j3 gives you not three jobs, but an exponential cascade
> > of 3 jobs per recursive make level.
> 
> I know.  But 3 or 15 is still a lot better than 1000, which might
> be what you get with just -j.
> 
> > I found that -j3 -l10 will at least try and limit them from expanding
> > too much, which is useful if your swap limited (just how did those 2
> > emacen grow to be 60M each, and netscape inflate to 90M? :-().
> 
> That's what gigabytes of RAM are for.  ;-)

I have gigabytes of RAM; I've watched what happens when I say, e.g.,

make -j9 bootstrap

or

make -j 9 bootstrap

or whatever, and it doesn't seem to actually set off a bunch of parallel
jobs---definitely, by the time the stage1 compiler starts compiling
things, I'm down to one job at a time.  That's with the 2.2.13 kernel
on alpha with make 3.77; is this a known problem?  Should I upgrade
make?

BTW, the maximum load with make -j is < 80 (without libgcc) and
things seem to stay rational because make gets less time as the
load rises and can't spawn jobs at such a high rate.

Brad

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-05  6:12   ` Brad Lucier
@ 2000-05-05 10:37     ` Richard Henderson
  2000-05-05 13:35     ` Gerald Pfeifer
  1 sibling, 0 replies; 32+ messages in thread
From: Richard Henderson @ 2000-05-05 10:37 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Mike Stump, gcc, matzmich

On Fri, May 05, 2000 at 08:12:34AM -0500, Brad Lucier wrote:
> make -j 9 bootstrap
> 
> or whatever, and it doesn't seem to actually set off a bunch of parallel
> jobs--

make MAKE='make -j4' bootstrap -- the -j doesn't normally get
passed down to submakes.

> Should I upgrade make?

Apparently both of us should.  I'm interested in this "build server"
thingy...


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
  2000-05-05  6:12   ` Brad Lucier
  2000-05-05 10:37     ` Richard Henderson
@ 2000-05-05 13:35     ` Gerald Pfeifer
  1 sibling, 0 replies; 32+ messages in thread
From: Gerald Pfeifer @ 2000-05-05 13:35 UTC (permalink / raw)
  To: Brad Lucier; +Cc: Richard Henderson, Mike Stump, gcc, matzmich

On Fri, 5 May 2000, Brad Lucier wrote:
> make -j 9 bootstrap
> 
> or whatever, and it doesn't seem to actually set off a bunch of parallel
> jobs---definitely, by the time the stage1 compiler starts compiling
> things, I'm down to one job at a time.  That's with the 2.2.13 kernel
> on alpha with make 3.77; is this a known problem?  Should I upgrade
> make?

Please do so and let us know the results! 

It would be very useful to document properly, how to use parallel
makes for GCC -- perhaps you could come up with a patch for our
documentation at?

  http://gcc.gnu.org/install/build.html

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
@ 2000-05-04 23:28 Mathias Froehlich
  0 siblings, 0 replies; 32+ messages in thread
From: Mathias Froehlich @ 2000-05-04 23:28 UTC (permalink / raw)
  To: gcc

> On Thu, May 04, 2000 at 10:31:59PM -0700, Mike Stump wrote:
> > Unfortunately -j3 gives you not three jobs, but an exponential cascade
> > of 3 jobs per recursive make level.

> I know.  But 3 or 15 is still a lot better than 1000, which might
> be what you get with just -j.

> > I found that -j3 -l10 will at least try and limit them from expanding
> > too much, which is useful if your swap limited (just how did those 2
> > emacen grow to be 60M each, and netscape inflate to 90M? :-().

> That's what gigabytes of RAM are for.  ;-)

Or use a recent version of gmake (>=3.78.1), which implements a so
called "jobserver mode". This limits ther _overall_ number of working
make jobs to the number given by -j. This does also work for recursive
makes. 

Try it out it works great ...

  Regards,

     Mathias Froehlich

-- 
Mathias Fr"ohlich              e-mail: frohlich@na.uni-tuebingen.de
Institut f"ur Mathematik, Universit"at T"ubingen, D-72076 T"ubingen

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: if-conversion a performance bottleneck
@ 2000-05-05 13:46 Brad Lucier
  0 siblings, 0 replies; 32+ messages in thread
From: Brad Lucier @ 2000-05-05 13:46 UTC (permalink / raw)
  To: lucier, pfeifer; +Cc: rth, mrs, gcc, matzmich

> On Fri, 5 May 2000, Brad Lucier wrote:
> > make -j 9 bootstrap
> > 
> > or whatever, and it doesn't seem to actually set off a bunch of parallel
> > jobs---definitely, by the time the stage1 compiler starts compiling
> > things, I'm down to one job at a time.  That's with the 2.2.13 kernel
> > on alpha with make 3.77; is this a known problem?  Should I upgrade
> > make?
> 
> Please do so and let us know the results! 
> 
> It would be very useful to document properly, how to use parallel
> makes for GCC -- perhaps you could come up with a patch for our
> documentation at?
> 
>   http://gcc.gnu.org/install/build.html

After upgrading make from version 3.77 to version 3.79,

make -j 9 bootstrap

works as I think it should---it consistently has about 9 processes
running in various subdirectories throughout the build process.

Brad

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2000-05-11 17:14 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-05-01 10:54 if-conversion a performance bottleneck Brad Lucier
2000-05-01 15:29 ` Richard Henderson
2000-05-01 18:01   ` Brad Lucier
2000-05-02  5:45     ` Michael Matz
2000-05-02 22:58       ` Michael Matz
2000-05-03  5:50         ` Brad Lucier
2000-05-03 20:05         ` Brad Lucier
2000-05-04 11:46           ` Michael Matz
2000-05-04 19:39             ` Brad Lucier
2000-05-04 22:16               ` Richard Henderson
2000-05-10  8:30                 ` Andreas Schwab
2000-05-10  9:36                   ` Joe Buck
2000-05-10  9:52                     ` Jeffrey A Law
2000-05-10 10:49                       ` Joe Buck
2000-05-10 14:17                     ` Joern Rennecke
2000-05-10 14:24                       ` GNU make options (was Re: if-conversion ...) Joe Buck
2000-05-04 11:55         ` if-conversion a performance bottleneck Richard Henderson
2000-05-05 12:01         ` Jeffrey A Law
2000-05-05 14:32           ` flow_d_f_o_compute misnamed? (was: if-conversion a performance...) Michael Matz
2000-05-05 14:53             ` Jeffrey A Law
2000-05-05 16:10               ` Michael Matz
2000-05-05 17:05                 ` Jeffrey A Law
2000-05-05 18:21                   ` Michael Matz
2000-05-05 19:04                     ` Michael Hayes
2000-05-11 17:14                     ` Jeffrey A Law
2000-05-04 22:32 if-conversion a performance bottleneck Mike Stump
2000-05-04 22:35 ` Richard Henderson
2000-05-05  6:12   ` Brad Lucier
2000-05-05 10:37     ` Richard Henderson
2000-05-05 13:35     ` Gerald Pfeifer
2000-05-04 23:28 Mathias Froehlich
2000-05-05 13:46 Brad Lucier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).