On Fri, 19 Oct 2018, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On October 18, 2018 11:05:32 PM GMT+02:00, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>Richard Biener <rguenther@suse.de> writes:
> >>> On Thu, 18 Oct 2018, Richard Sandiford wrote:
> >>>
> >>>> Richard Biener <rguenther@suse.de> writes:
> >>>> > PR63155 made me pick up this old work from Steven, it turns our
> >>>> > linked-list implementation to a two-mode one with one being a
> >>>> > splay tree featuring O(log N) complexity for find/remove.
> >>>> >
> >>>> > Over Stevens original patch I added a bitmap_tree_to_vec helper
> >>>> > that I use from the debug/print methods to avoid changing view
> >>>> > there.  In theory the bitmap iterator could get a "stack"
> >>>> > as well and we could at least support EXECUTE_IF_SET_IN_BITMAP.
> >>>> >
> >>>> > This can be used to fix the two biggest bottlenecks in the PRs
> >>>> > testcase, namely SSA propagator worklist handling and out-of-SSA
> >>>> > coalesce list building.  perf shows the following data, first
> >>>> > unpatched, second patched - also watch the thrid coulumn (samples)
> >>>> > when comparing percentages.
> >>>> >
> >>>> > -O0
> >>>> > -   18.19%    17.35%           407  cc1      cc1               [.]
> >>bitmap_set_bâ
> >>>> >    - bitmap_set_bit                                               
> >>            â
> >>>> >       + 8.77% create_coalesce_list_for_region                     
> >>            â
> >>>> >       + 4.21% calculate_live_ranges                               
> >>            â
> >>>> >       + 2.02% build_ssa_conflict_graph                            
> >>            â
> >>>> >       + 1.66% insert_phi_nodes_for                                
> >>            â
> >>>> >       + 0.86% coalesce_ssa_name                      
> >>>> > patched:
> >>>> > -   12.39%    10.48%           129  cc1      cc1               [.]
> >>bitmap_set_bâ
> >>>> >    - bitmap_set_bit                                               
> >>            â
> >>>> >       + 5.27% calculate_live_ranges                               
> >>            â
> >>>> >       + 2.76% insert_phi_nodes_for                                
> >>            â
> >>>> >       + 1.90% create_coalesce_list_for_region                     
> >>            â
> >>>> >       + 1.63% build_ssa_conflict_graph                            
> >>            â
> >>>> >       + 0.35% coalesce_ssa_name                           
> >>>> >
> >>>> > -O1
> >>>> > -   17.53%    17.53%           842  cc1      cc1               [.]
> >>bitmap_set_bâ
> >>>> >    - bitmap_set_bit                                               
> >>            â
> >>>> >       + 12.39% add_ssa_edge                                       
> >>            â
> >>>> >       + 1.48% create_coalesce_list_for_region                     
> >>            â
> >>>> >       + 0.82% solve_constraints                                   
> >>            â
> >>>> >       + 0.71% calculate_live_ranges                               
> >>            â
> >>>> >       + 0.64% add_implicit_graph_edge                             
> >>            â
> >>>> >       + 0.41% insert_phi_nodes_for                                
> >>            â
> >>>> >       + 0.34% build_ssa_conflict_graph                      
> >>>> > patched:
> >>>> > -    5.79%     5.00%           167  cc1      cc1               [.]
> >>bitmap_set_bâ
> >>>> >    - bitmap_set_bit                                               
> >>            â
> >>>> >       + 1.41% add_ssa_edge                                        
> >>            â
> >>>> >       + 0.88% calculate_live_ranges                               
> >>            â
> >>>> >       + 0.75% add_implicit_graph_edge                             
> >>            â
> >>>> >       + 0.68% solve_constraints                                   
> >>            â
> >>>> >       + 0.48% insert_phi_nodes_for                                
> >>            â
> >>>> >       + 0.45% build_ssa_conflict_graph                   
> >>>> >
> >>>> > -O3
> >>>> > -   12.37%    12.34%          1145  cc1      cc1               [.]
> >>bitmap_set_bâ
> >>>> >    - bitmap_set_bit                                               
> >>            â
> >>>> >       + 9.14% add_ssa_edge                                        
> >>            â
> >>>> >       + 0.80% create_coalesce_list_for_region                     
> >>            â
> >>>> >       + 0.69% add_implicit_graph_edge                             
> >>            â
> >>>> >       + 0.54% solve_constraints                                   
> >>            â
> >>>> >       + 0.34% calculate_live_ranges                               
> >>            â
> >>>> >       + 0.27% insert_phi_nodes_for                                
> >>            â
> >>>> >       + 0.21% build_ssa_conflict_graph                 
> >>>> > -    4.36%     3.86%           227  cc1      cc1               [.]
> >>bitmap_set_bâ
> >>>> >    - bitmap_set_bit                                               
> >>            â
> >>>> >       + 0.98% add_ssa_edge                                        
> >>            â
> >>>> >       + 0.86% add_implicit_graph_edge                             
> >>            â
> >>>> >       + 0.64% solve_constraints                                   
> >>            â
> >>>> >       + 0.57% calculate_live_ranges                               
> >>            â
> >>>> >       + 0.32% build_ssa_conflict_graph                            
> >>            â
> >>>> >       + 0.29% mark_all_vars_used_1                                
> >>            â
> >>>> >       + 0.20% insert_phi_nodes_for                                
> >>            â
> >>>> >       + 0.16% create_coalesce_list_for_region             
> >>>> >
> >>>> >
> >>>> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >>>> 
> >>>> What's the performance like for more reasonable testcases?  E.g. is
> >>there
> >>>> a measurable change either way in --enable-checking=release for some
> >>gcc
> >>>> .iis at -g or -O2 -g?
> >>>
> >>> I did a quick check on my set of cc1files (still .i, .ii ones tend to
> >>> be unusable quite quickly...).  Most of them compile too quickly
> >>> to make any difference appear other than noise.  Multi-second
> >>differences
> >>> like for PR63155 should be the exception but our O(n) bitmap
> >>> implementation really makes some parts of GCC quadratic where it
> >>> doesn't appear so.
> >>>
> >>> Is there a reason you expect it to be ever slower?
> >>
> >>During recent stage3s I've tried to look at profiles of cc1plus
> >>to see whether there was something easy we could do to improve
> >>compile times.  And bitmap operations always showed up near the
> >>top of the profile.  There were always some pathological queries
> >>in which the linear search really hurt, but whenever I tried "simple"
> >>ways to avoid the obvious cases, they made those queries quicker
> >>but slowed things down overall.  It seemed that adding any extra logic
> >>to the queries hurted because even a small slowdown in common lookups
> >>overwhelmed a big saving in less common lookups.
> >
> > Yeah. I also noticed some 'obvious' shortcomings in the heuristics...
> > I guess in the end well predicted branches in the out of line code are
> > important...
> >
> >>
> >>But there again I was looking to speed up more "normal" cases, not ones
> >>like the PR.
> >>
> >>FWIW I've tried it on a local x86_64 box and it slows down current
> >>optabs.ii at -O2 -g by ~0.35% (reproducable).  I see similar slowdowns
> >>for the other files I tried.  But that's hardly a huge amount, and
> >>probably a price worth paying for the speed-up in the PR.
> >
> > Just to make sure what to reproduce - this is with checking disabled?
> > And with or without the hunks actually making use of the splay tree
> > path?
> 
> Yeah, with an --enable-checking=release stage3:
> 
>    ./cc1plus optabs.ii -o /dev/null -O2 -g
> 
> using the optabs.ii from the unpatched --enable-checking=release build.
> 
> It was the whole patch vs. without the patch.

OK, so there are almost no hits from the SSA propagator or out-of-SSA
but only from "unchanged" paths:

-    2.90%     2.90%            23  cc1plus  cc1plus           [.] 
bitmap_set_bâ
   - bitmap_set_bit                                                            
â
      + 0.79% df_lr_bb_local_compute                                           
â
      + 0.38% insert_updated_phi_nodes_for                                     
â
      + 0.27% sched_analyze_reg                                                
â
      + 0.23% walk_aliased_vdefs_1                                             
â
      + 0.13% get_continuation_for_phi                                         
â
      + 0.13% add_control_edge                                                 
â
      + 0.13% df_md_bb_local_compute_process_def                               
â
      + 0.13% mark_for_renaming                                                
â
      + 0.13% (anonymous namespace)::pass_rtl_cprop::execute                   
â
      + 0.13% compute_dominance_frontiers                                      
â
      + 0.12% df_simulate_initialize_backwards      

it's also interesting that most branch misses (for bitmap_set_bit)
are from

      bool res = (ptr->bits[word_num] & bit_val) == 0;
      if (res)
        ptr->bits[word_num] |= bit_val;
      return res;

I'm not sure how "bad" a mispredicted branch is here.  I guess
if it is predicted to be taken (skip the write) then it's bad
but if it is predicted the other way around it should be a matter
of not retiring the store...  But I am not a CPU guy.  I guess
unconditionally updating the memory wouldn't be so bad after all
and it might also help combine in using architecture specific
optimizations like using some CC flags of the OR operation
to get at the comparison result.  Can you benchmark a difference
for this?

Individual compiles of optabs.ii are too fast (<4s for me) to
make out overall performance changes and noise is quite high
for me as well (+-6%).

It looks like you cannot even use perfs precise counters (Haswell here)
of retired instructions to measure differences reliably.

That is, I cannot really reproduce your findings on optabs.ii...
for me, the best runtime of both patched and unpatched is the
same 3.22s.

Richard.