[Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r
@ 2021-10-26 11:13 rguenth at gcc dot gnu.org
  2021-10-26 11:15 ` [Bug tree-optimization/102943] [12 Regression] " rguenth at gcc dot gnu.org
                   ` (53 more replies)
  0 siblings, 54 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-26 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

            Bug ID: 102943
           Summary: VRP threader compile-time hog with 521.wrf_r
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

Compiling 521.wrf_r with -Ofast -march=znver2 -flto results in one LTRANS unit
(unit 105) showing

 tree VRP threader                  : 122.22 ( 44%)   0.04 (  3%) 124.84 ( 44%)
   18M (  2%)
 TOTAL                              : 280.88          1.23        286.88       
  849M

note WRF is prone to run into RTL DF scalability issues as well where other
LTRANS units spend 40% of the compile-time in but the VRP threader issue is
new.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
@ 2021-10-26 11:15 ` rguenth at gcc dot gnu.org
  2021-10-26 11:25 ` rguenth at gcc dot gnu.org
                   ` (52 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-26 11:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aldyh at gcc dot gnu.org
           Keywords|                            |compile-time-hog
   Target Milestone|---                         |12.0
            Summary|VRP threader compile-time   |[12 Regression] VRP
                   |hog with 521.wrf_r          |threader compile-time hog
                   |                            |with 521.wrf_r

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The compiler had checking enabled but I built the LTRANS unit with
-fno-checking.

Not sure what's special about the VRP threader.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
  2021-10-26 11:15 ` [Bug tree-optimization/102943] [12 Regression] " rguenth at gcc dot gnu.org
@ 2021-10-26 11:25 ` rguenth at gcc dot gnu.org
  2021-10-26 11:49 ` rguenth at gcc dot gnu.org
                   ` (51 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-26 11:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The biggest LTRANS unit (36) also shows this:

 df reaching defs                   :  56.27 (  6%)   0.30 (  9%)  57.56 (  6%)
    0  (  0%)
 df live regs                       : 101.65 ( 10%)   0.19 (  6%) 103.05 ( 10%)
    0  (  0%)
 df live&initialized regs           : 108.24 ( 11%)   0.12 (  4%) 109.43 ( 11%)
    0  (  0%)
...
 tree VRP threader                  : 237.97 ( 24%)   0.11 (  3%) 243.81 ( 24%)
   60M (  2%)
...
 TOTAL                              : 976.77          3.27        996.18       
 2618M

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
  2021-10-26 11:15 ` [Bug tree-optimization/102943] [12 Regression] " rguenth at gcc dot gnu.org
  2021-10-26 11:25 ` rguenth at gcc dot gnu.org
@ 2021-10-26 11:49 ` rguenth at gcc dot gnu.org
  2021-10-26 14:57 ` pinskia at gcc dot gnu.org
                   ` (50 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-10-26 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
And just in case it helps, perf for ltrans36

Samples: 1M of event 'cycles', Event count (approx.): 2082526448741             
Overhead       Samples  Command      Shared Object     Symbol                   
   6.50%        118620  lto1-ltrans  lto1              [.] bitmap_bit_p        
                                                                               
  #
   5.46%         99073  lto1-ltrans  lto1              [.] bitmap_and_into     
                                                                               
  #
   5.04%         91631  lto1-ltrans  lto1              [.]
bitmap_list_insert_element_after                                               
                       #
   4.65%         84591  lto1-ltrans  lto1              [.] bitmap_ior_into     
                                                                               
  #
   4.62%         84382  lto1-ltrans  lto1              [.]
bitmap_get_aligned_chunk                                                       
                       #
   4.24%         77368  lto1-ltrans  lto1              [.] bitmap_set_bit      
                                                                               
  #
   3.54%         64592  lto1-ltrans  lto1              [.] df_count_refs       
                                                                               
  #
   3.28%         59956  lto1-ltrans  lto1              [.]
ranger_cache::add_to_update                                                    
                       #
   2.88%         52329  lto1-ltrans  lto1              [.] bitmap_and          
                                                                               
  #
   2.15%         39376  lto1-ltrans  lto1              [.] process_bb_lives    
                                                                               
  #
   1.88%         34289  lto1-ltrans  lto1              [.] bitmap_elt_ior      
                                                                               
  #
   1.77%         32334  lto1-ltrans  lto1              [.] update_pseudo_point 
                                                                               
  #
   1.60%         29081  lto1-ltrans  lto1              [.] bitmap_ior_and_compl
                                                                               
  #
   1.60%         29189  lto1-ltrans  lto1              [.]
gimple_outgoing_range_stmt_p                                                   
                       #
   1.58%         28678  lto1-ltrans  lto1              [.]
pre_and_rev_post_order_compute_fn                                              
                       #
   1.51%         27683  lto1-ltrans  lto1              [.]
get_immediate_dominator                                                        
                       #
   1.40%         25392  lto1-ltrans  lto1              [.]
bitmap_and_compl_into                                                          
                       #
   1.27%         23132  lto1-ltrans  lto1              [.]
ranger_cache::propagate_cache                                                  
                       #
   1.26%         22904  lto1-ltrans  lto1              [.] bitmap_intersect_p  
                                                                               
  #
   1.19%         21655  lto1-ltrans  lto1              [.] df_worklist_dataflow
                                                                               
  #
   1.12%         20432  lto1-ltrans  lto1              [.] bitmap_copy         
                                                                               
  #
   1.03%         18933  lto1-ltrans  lto1              [.]
determine_value_range                                                          
                       #
   0.88%         16151  lto1-ltrans  lto1              [.] update_ssa          
                                                                               
  #
   0.87%         15865  lto1-ltrans  lto1              [.]
bitmap_set_aligned_chunk                                                       
                       #
   0.77%         14157  lto1-ltrans  lto1              [.]
rewrite_update_dom_walker::before_dom_children

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2021-10-26 11:49 ` rguenth at gcc dot gnu.org
@ 2021-10-26 14:57 ` pinskia at gcc dot gnu.org
  2021-10-26 14:58 ` marxin at gcc dot gnu.org
                   ` (49 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-10-26 14:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
*** Bug 102947 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2021-10-26 14:57 ` pinskia at gcc dot gnu.org
@ 2021-10-26 14:58 ` marxin at gcc dot gnu.org
  2021-10-26 15:06 ` marxin at gcc dot gnu.org
                   ` (48 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-10-26 14:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Martin Liška <marxin at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-10-26
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2021-10-26 14:58 ` marxin at gcc dot gnu.org
@ 2021-10-26 15:06 ` marxin at gcc dot gnu.org
  2021-10-30  6:31 ` aldyh at gcc dot gnu.org
                   ` (47 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-10-26 15:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #5 from Martin Liška <marxin at gcc dot gnu.org> ---
Created attachment 51669
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51669&action=edit
Memory and CPU utilization with LTO (--enable-checking=release)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2021-10-26 15:06 ` marxin at gcc dot gnu.org
@ 2021-10-30  6:31 ` aldyh at gcc dot gnu.org
  2021-10-31 20:06 ` hubicka at gcc dot gnu.org
                   ` (46 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-10-30  6:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #6 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
Can this be re-checked now that the forward threader has been dropped post-VRP?

BTW, please CC me on any compile-time hogs related to the threader, especially
if it's not SPEC related, as I've yet to hunt down a copy.  So far this is the
only non-duplicate PR I'm CC'ed on with the threader as a suspect in compile
speed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] VRP threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2021-10-30  6:31 ` aldyh at gcc dot gnu.org
@ 2021-10-31 20:06 ` hubicka at gcc dot gnu.org
  2021-11-02  7:25 ` [Bug tree-optimization/102943] [12 Regression] Jump " rguenth at gcc dot gnu.org
                   ` (45 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-10-31 20:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #7 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
this is compile time plot
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=227.270.8
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=289.270.8
(-O2 and -Ofast with lto)
Things has improved but build times are still a lot worse than before.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2021-10-31 20:06 ` hubicka at gcc dot gnu.org
@ 2021-11-02  7:25 ` rguenth at gcc dot gnu.org
  2021-11-02  7:29 ` aldyh at gcc dot gnu.org
                   ` (44 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-11-02  7:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[12 Regression] VRP         |[12 Regression] Jump
                   |threader compile-time hog   |threader compile-time hog
                   |with 521.wrf_r              |with 521.wrf_r

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
The 'tree VRP threader' instances are now gone (well, obviously..).  There's
now

 backwards jump threading           :  15.98 ( 13%) 
 TOTAL                              : 120.33     

 backwards jump threading           :  41.23 ( 33%) 
 TOTAL                              : 125.43         

 backwards jump threading           :  89.97 ( 19%)  
 TOTAL                              : 473.55     

in the three biggest LTRANS units (all others are <10s compile time).  It might
be that the VRP threading opportunities are now simply taken by backwards
threader instances.

So, re-confirmed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2021-11-02  7:25 ` [Bug tree-optimization/102943] [12 Regression] Jump " rguenth at gcc dot gnu.org
@ 2021-11-02  7:29 ` aldyh at gcc dot gnu.org
  2021-11-03 10:57 ` aldyh at gcc dot gnu.org
                   ` (43 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-02  7:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #9 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> The 'tree VRP threader' instances are now gone (well, obviously..).  There's
> now
> 
>  backwards jump threading           :  15.98 ( 13%) 
>  TOTAL                              : 120.33     
> 
>  backwards jump threading           :  41.23 ( 33%) 
>  TOTAL                              : 125.43         
> 
>  backwards jump threading           :  89.97 ( 19%)  
>  TOTAL                              : 473.55     
> 
> in the three biggest LTRANS units (all others are <10s compile time).  It
> might
> be that the VRP threading opportunities are now simply taken by backwards
> threader instances.

Correct.  There's is no longer a post-VRP threader based on the forward
threader.  Now there's just one pre-VRP based on the backward threader.

Thanks for re-confirming.  I have hunted down a copy of SPEC2017 and will be
looking at this today.  If anyone has specific configury tricks/options for
SPEC, please mail me privately, as it's been decades since I last ran SPEC.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2021-11-02  7:29 ` aldyh at gcc dot gnu.org
@ 2021-11-03 10:57 ` aldyh at gcc dot gnu.org
  2021-11-03 10:58 ` aldyh at gcc dot gnu.org
                   ` (42 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-03 10:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Aldy Hernandez <aldyh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amacleod at redhat dot com

--- Comment #10 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
I tried all sorts of knobs limiting the behavior for large BBs (one function
has over 20k blocks), a large number of imports (dependencies on the final
conditional), and even the max number of blocks to look back.  None of them
made a difference.

Then I realized that this PR was originally reported against the hybrid VRP
threader, which used a different path discovery engine altogether (the old
forward threader).  So, the problem can't be in the backward threader path
discovery bits, but in something the solver is doing.

I timed all the threaders using the solver by functionality (simple versus
fully resolving mode):

backwards simple                   :   4.85 (  2%)   0.00 (  0%)   4.84 (  2%) 
 932k (  0%)
backwards full                     :  54.60 ( 17%)   0.01 (  1%)  54.70 ( 17%) 
 664k (  0%)

This confirms my hypothesis that it's not the backward threader discovery bits,
since the above two entries use the same engine.  So clearly, it's something
that the fully resolving threader does that was common with the hybrid
threader, i.e. our use of the ranger.

A callgrind session shows that the majority of the back threader's time is
being spent in:

  path_range_query::range_on_path_entry (irange &r, tree name)

...which is understandable, because when we can't resolve an SSA within the
path, we ask the ranger what the range on entry to the path is.

Curiously though, most of the time is spent in propagate_cache, especially
add_to_update, which is accounting for 37.5% of the threader's time:

-  if (!bitmap_bit_p (m_propfail, bb->index) &&  !m_update_list.contains (bb))
-    m_update_list.quick_push (bb);

This is a large CFG, so a linear search of a BB, is bound to be slow.  Just
replacing it with an sbitmap knocks a good 12 seconds:

 backwards jump threading           :  48.40 ( 28%)   0.02 (  1%)  48.57 ( 27%)
 1597k (  0%)
 backwards jump threading           :  32.96 ( 22%)   0.09 (  4%)  33.12 ( 22%)
 1499k (  0%)

Not ideal, but a good improvement IMO.

I'll post my proposed patch, but I suspect Andrew may have other tricks up his
sleeve.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2021-11-03 10:57 ` aldyh at gcc dot gnu.org
@ 2021-11-03 10:58 ` aldyh at gcc dot gnu.org
  2021-11-03 13:17 ` rguenther at suse dot de
                   ` (41 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-03 10:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #11 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
Created attachment 51726
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51726&action=edit
untested improvement to ranger cache

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2021-11-03 10:58 ` aldyh at gcc dot gnu.org
@ 2021-11-03 13:17 ` rguenther at suse dot de
  2021-11-03 14:33 ` amacleod at redhat dot com
                   ` (40 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenther at suse dot de @ 2021-11-03 13:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #12 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 3 Nov 2021, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |amacleod at redhat dot com
> 
> --- Comment #10 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> I tried all sorts of knobs limiting the behavior for large BBs (one function
> has over 20k blocks), a large number of imports (dependencies on the final
> conditional), and even the max number of blocks to look back.  None of them
> made a difference.
> 
> Then I realized that this PR was originally reported against the hybrid VRP
> threader, which used a different path discovery engine altogether (the old
> forward threader).  So, the problem can't be in the backward threader path
> discovery bits, but in something the solver is doing.
> 
> I timed all the threaders using the solver by functionality (simple versus
> fully resolving mode):
> 
> backwards simple                   :   4.85 (  2%)   0.00 (  0%)   4.84 (  2%) 
>  932k (  0%)
> backwards full                     :  54.60 ( 17%)   0.01 (  1%)  54.70 ( 17%) 
>  664k (  0%)
> 
> This confirms my hypothesis that it's not the backward threader discovery bits,
> since the above two entries use the same engine.  So clearly, it's something
> that the fully resolving threader does that was common with the hybrid
> threader, i.e. our use of the ranger.
> 
> A callgrind session shows that the majority of the back threader's time is
> being spent in:
> 
>   path_range_query::range_on_path_entry (irange &r, tree name)
> 
> ...which is understandable, because when we can't resolve an SSA within the
> path, we ask the ranger what the range on entry to the path is.
> 
> Curiously though, most of the time is spent in propagate_cache, especially
> add_to_update, which is accounting for 37.5% of the threader's time:
> 
> -  if (!bitmap_bit_p (m_propfail, bb->index) &&  !m_update_list.contains (bb))
> -    m_update_list.quick_push (bb);
> 
> This is a large CFG, so a linear search of a BB, is bound to be slow.

Indeed, vec should never have gotten ::contains () ... I'd have
used a regular bitmap, not sbitmap, because we do

      bb = m_update_list.pop ();

and bitmap_first_set_bit is O(n) for an sbitmap bit O(1) for a bitmap.

>  Just
> replacing it with an sbitmap knocks a good 12 seconds:
> 
>  backwards jump threading           :  48.40 ( 28%)   0.02 (  1%)  48.57 ( 27%)
>  1597k (  0%)
>  backwards jump threading           :  32.96 ( 22%)   0.09 (  4%)  33.12 ( 22%)
>  1499k (  0%)
> 
> Not ideal, but a good improvement IMO.
> 
> I'll post my proposed patch, but I suspect Andrew may have other tricks up his
> sleeve.
> 
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2021-11-03 13:17 ` rguenther at suse dot de
@ 2021-11-03 14:33 ` amacleod at redhat dot com
  2021-11-03 14:42 ` rguenther at suse dot de
                   ` (39 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: amacleod at redhat dot com @ 2021-11-03 14:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #13 from Andrew Macleod <amacleod at redhat dot com> ---

> > 
> > This is a large CFG, so a linear search of a BB, is bound to be slow.
> 
> Indeed, vec should never have gotten ::contains () ... I'd have
> used a regular bitmap, not sbitmap, because we do
> 
>       bb = m_update_list.pop ();
> 
> and bitmap_first_set_bit is O(n) for an sbitmap bit O(1) for a bitmap.
> 

If we replaced the current vector with just the bitmap implementation, then
this would be the ideal way to do it. 

However, the propagation engine is suppose to call add_to_update in a (mostly)
breadth-first way so that changes being pushed minimize turmoil.  As I look
closer, I see that this code doesn't really end up doing that properly now
anyway.  I'll do some experiments and either fix the current code to do breadth
right, or just switch to the bitmap solution you suggest which will then be
quasi-random.  

And you are right.. that contains should never have gotten in there, let alone
been used by me. :-)  Must have been in a  hurry that day.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2021-11-03 14:33 ` amacleod at redhat dot com
@ 2021-11-03 14:42 ` rguenther at suse dot de
  2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
                   ` (38 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenther at suse dot de @ 2021-11-03 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 3 Nov 2021, amacleod at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
> 
> --- Comment #13 from Andrew Macleod <amacleod at redhat dot com> ---
> 
> 
> > > 
> > > This is a large CFG, so a linear search of a BB, is bound to be slow.
> > 
> > Indeed, vec should never have gotten ::contains () ... I'd have
> > used a regular bitmap, not sbitmap, because we do
> > 
> >       bb = m_update_list.pop ();
> > 
> > and bitmap_first_set_bit is O(n) for an sbitmap bit O(1) for a bitmap.
> > 
> 
> If we replaced the current vector with just the bitmap implementation, then
> this would be the ideal way to do it. 
> 
> However, the propagation engine is suppose to call add_to_update in a (mostly)
> breadth-first way so that changes being pushed minimize turmoil.  As I look
> closer, I see that this code doesn't really end up doing that properly now
> anyway.  I'll do some experiments and either fix the current code to do breadth
> right, or just switch to the bitmap solution you suggest which will then be
> quasi-random.

In other places we use one level indirection when we want to visit in
a particular CFG order.  We have a mapping bb-index to CFG-order
and a back-mapping CFG-order to bb-index, then the bitmap is set up
to contain CFG-order[bb_index] and the bitmap_first_set_bit result
is indirected via bb_index[CFG-order].  That gives the desired ordering
of the bitmap based worklist at the expense of two int->int maps.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2021-11-03 14:42 ` rguenther at suse dot de
@ 2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
  2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
                   ` (37 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-11-04 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #15 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Aldy Hernandez <aldyh@gcc.gnu.org>:

https://gcc.gnu.org/g:5ea1ce43b6070aaa94882e8b15f3340344aaa6b2

commit r12-4903-g5ea1ce43b6070aaa94882e8b15f3340344aaa6b2
Author: Aldy Hernandez <aldyh@redhat.com>
Date:   Wed Nov 3 08:23:25 2021 +0100

    path solver: Only compute relations for imports.

    We are currently calculating implicit PHI relations for all PHI
    arguments.  This creates unecessary work, as we only care about SSA
    names in the import bitmap.  Similarly for inter-path relationals.  We
    can avoid things not in the bitmap.

    Tested on x86-64 and ppc64le Linux with the usual regstrap.  I also
    verified that the before and after number of threads was the same
    in a suite of .ii files from a bootstrap.

    gcc/ChangeLog:

            PR tree-optimization/102943
            * gimple-range-path.cc (path_range_query::compute_phi_relations):
            Only compute relations for SSA names in the import list.
            (path_range_query::compute_outgoing_relations): Same.
            * gimple-range-path.h (path_range_query::import_p): New.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
@ 2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
  2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
                   ` (36 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-11-04 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #16 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Aldy Hernandez <aldyh@gcc.gnu.org>:

https://gcc.gnu.org/g:e4411622690654cdc530c6262c7115a9e15dc359

commit r12-4904-ge4411622690654cdc530c6262c7115a9e15dc359
Author: Aldy Hernandez <aldyh@redhat.com>
Date:   Thu Nov 4 11:34:55 2021 +0100

    Avoid repeating calculations in threader.

    We already attempt to resolve the current path on entry to
    find_paths_to_name(), so there's no need to do so again for each
    exported range since nothing has changed.

    Removing this redundant calculation avoids 22% of calls into the path
    solver.

    Tested on x86-64 and ppc64le Linux with the usual regstrap.  I also
    verified that the before and after number of threads was the same
    in a suite of .ii files from a bootstrap.

    gcc/ChangeLog:

            PR tree-optimization/102943
            * tree-ssa-threadbackward.c (back_threader::find_paths_to_names):
            Avoid duplicate calculation of paths.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
@ 2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
  2021-11-04 15:24 ` aldyh at gcc dot gnu.org
                   ` (35 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-11-04 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #17 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Aldy Hernandez <aldyh@gcc.gnu.org>:

https://gcc.gnu.org/g:6a9678f0b30d36ae13259ad635e175a1e24917a1

commit r12-4905-g6a9678f0b30d36ae13259ad635e175a1e24917a1
Author: Aldy Hernandez <aldyh@redhat.com>
Date:   Thu Nov 4 12:37:16 2021 +0100

    path solver: Prefer range_of_expr instead of range_on_edge.

    The range_of_expr method provides better caching than range_on_edge.
    If we have a statement, we can just it and avoid the range_on_edge
    dance.  Plus we can use all the range_of_expr fanciness.

    Tested on x86-64 and ppc64le Linux with the usual regstrap.  I also
    verified that the before and after number of threads was the same or
    greater in a suite of .ii files from a bootstrap.

    gcc/ChangeLog:

            PR tree-optimization/102943
            * gimple-range-path.cc (path_range_query::range_on_path_entry):
            Prefer range_of_expr unless there are no statements in the BB.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
@ 2021-11-04 15:24 ` aldyh at gcc dot gnu.org
  2021-11-04 17:00   ` Jan Hubicka
  2021-11-04 17:00 ` hubicka at kam dot mff.cuni.cz
                   ` (34 subsequent siblings)
  53 siblings, 1 reply; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-04 15:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Aldy Hernandez <aldyh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |103058

--- Comment #18 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
251.wrf_r is no longer building.  Seems to be the same issue in PR103058.

during GIMPLE pass: alias
module_fr_fire_phys.fppized.f90: In function 'init_fuel_cats':
module_fr_fire_phys.fppized.f90:136:25: internal compiler error: in
gimple_call_static_chain_flags, at gimple.c:1669
  136 | subroutine init_fuel_cats
      |                         ^
0x6957b5 gimple_call_static_chain_flags(gcall const*)
        /home/aldyh/src/clean/gcc/gimple.c:1669
0xf72359 handle_rhs_call
        /home/aldyh/src/clean/gcc/tree-ssa-structalias.c:4258
0xf75014 find_func_aliases_for_call
        /home/aldyh/src/clean/gcc/tree-ssa-structalias.c:4921
0xf75014 find_func_aliases
        /home/aldyh/src/clean/gcc/tree-ssa-structalias.c:5024
0xf76906 compute_points_to_sets
        /home/aldyh/src/clean/gcc/tree-ssa-structalias.c:7440
0xf76906 compute_may_aliases()
        /home/aldyh/src/clean/gcc/tree-ssa-structalias.c:7948
0xc84884 execute_function_todo
        /home/aldyh/src/clean/gcc/passes.c:2014
0xc851fe execute_todo
        /home/aldyh/src/clean/gcc/passes.c:2096
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
make[1]: *** [/tmp/cc3vtkEO.mk:116: /tmp/ccUtPK6E.ltrans38.ltrans.o] Error 1
make[1]: *** Waiting for unfinished jobs....


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103058
[Bug 103058] [12 Regression] ICE in gimple_call_static_chain_flags, at
gimple.c:1669 when building 527.cam4_r

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-11-04 15:24 ` aldyh at gcc dot gnu.org
@ 2021-11-04 17:00   ` Jan Hubicka
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Hubicka @ 2021-11-04 17:00 UTC (permalink / raw)
  To: aldyh at gcc dot gnu.org; +Cc: gcc-bugs

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>          Depends on|                            |103058
> 
> --- Comment #18 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> 251.wrf_r is no longer building.  Seems to be the same issue in PR103058.
> 
> during GIMPLE pass: alias
> module_fr_fire_phys.fppized.f90: In function 'init_fuel_cats':
> module_fr_fire_phys.fppized.f90:136:25: internal compiler error: in
> gimple_call_static_chain_flags, at gimple.c:1669
>   136 | subroutine init_fuel_cats
>       |                         ^
> 0x6957b5 gimple_call_static_chain_flags(gcall const*)
>         /home/aldyh/src/clean/gcc/gimple.c:1669

I have commited workaround for this.
However here it looks like a frontend issue - I do not think Fortran
should produce nested functions with external linkage. At least there
seems to be no good reason for doing so since they can not be called
cross-module.

Honza


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2021-11-04 15:24 ` aldyh at gcc dot gnu.org
@ 2021-11-04 17:00 ` hubicka at kam dot mff.cuni.cz
  2021-11-05  9:08 ` aldyh at gcc dot gnu.org
                   ` (33 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-11-04 17:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #19 from hubicka at kam dot mff.cuni.cz ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>          Depends on|                            |103058
> 
> --- Comment #18 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> 251.wrf_r is no longer building.  Seems to be the same issue in PR103058.
> 
> during GIMPLE pass: alias
> module_fr_fire_phys.fppized.f90: In function 'init_fuel_cats':
> module_fr_fire_phys.fppized.f90:136:25: internal compiler error: in
> gimple_call_static_chain_flags, at gimple.c:1669
>   136 | subroutine init_fuel_cats
>       |                         ^
> 0x6957b5 gimple_call_static_chain_flags(gcall const*)
>         /home/aldyh/src/clean/gcc/gimple.c:1669

I have commited workaround for this.
However here it looks like a frontend issue - I do not think Fortran
should produce nested functions with external linkage. At least there
seems to be no good reason for doing so since they can not be called
cross-module.

Honza

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2021-11-04 17:00 ` hubicka at kam dot mff.cuni.cz
@ 2021-11-05  9:08 ` aldyh at gcc dot gnu.org
  2021-11-05 11:10 ` marxin at gcc dot gnu.org
                   ` (32 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-05  9:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #20 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
With attachment 51726 and current trunk, the present damage is 22% for the
ltrans105 unit, which AFAICT, is the worst offender.  This is much better than
the original 44%, but still not ideal.

After some more poking, it looks more like a ranger issue than a threader
issue, so I'll write some notes on how to reproduce for the SPEC challenged
(ahem, me).

I have the following in my config/myconfig.cfg file to avoid LD_LIBRARY_PATH
magic and to save all the build temporary files, since I'm LTO impaired ;-)

default:
BASEDIR = /home/aldyh/bld/mycompiler/install
LDFLAGS = -Wl,-rpath=/usr/lib/gcc/x86_64-redhat-linux/11/
-L/usr/lib/gcc/x86_64-redhat-linux/11/
...
...
default=base:
OPTIMIZE    = -Ofast -flto=20 -march=native -ftime-report -fno-checking
-save-temps -v
CXXOPTIMIZE  = -fpermissive
FOPTIMIZE    = -std=legacy

$ runcpu --config=myconfig -a build -D --noreportable -I -T base -i test  -n 1
521.wrf_r

Then I hunt down some make.wrf_r.out file which has the lto1 magic for the
ltrans105 unit:

/some/dir/lto1 [lots-of-flags] -ftime-report -fno-checking -fltrans
./wrf_r.ltrans105.o -o ./wrf_r.ltrans105.ltrans.s

After which, I can re-run lto1 from the directory with all the LTO/.o files
(the one with the make.wrf_r.out directory).

For the record, I hate the SPEC build system :).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2021-11-05  9:08 ` aldyh at gcc dot gnu.org
@ 2021-11-05 11:10 ` marxin at gcc dot gnu.org
  2021-11-05 11:13 ` aldyh at gcc dot gnu.org
                   ` (31 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-11-05 11:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #21 from Martin Liška <marxin at gcc dot gnu.org> ---
> For the record, I hate the SPEC build system :).

Then you're the first one! No, just kidding, it's cumbersome, and feel free to
contact me with questions regarding that...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2021-11-05 11:10 ` marxin at gcc dot gnu.org
@ 2021-11-05 11:13 ` aldyh at gcc dot gnu.org
  2021-11-05 11:23 ` marxin at gcc dot gnu.org
                   ` (30 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-05 11:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #22 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
(In reply to Martin Liška from comment #21)
> > For the record, I hate the SPEC build system :).
> 
> Then you're the first one! No, just kidding, it's cumbersome, and feel free
> to contact me with questions regarding that...

:-P

Thanks so much for your work narrowing down these bugs.  It's been a real time
saver, especially your work on PR103061 which is also a SPEC issue.  I just
reproduced it thanks to your awesome tips.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2021-11-05 11:13 ` aldyh at gcc dot gnu.org
@ 2021-11-05 11:23 ` marxin at gcc dot gnu.org
  2021-11-05 17:16 ` cvs-commit at gcc dot gnu.org
                   ` (29 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: marxin at gcc dot gnu.org @ 2021-11-05 11:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #23 from Martin Liška <marxin at gcc dot gnu.org> ---
(In reply to Aldy Hernandez from comment #22)
> (In reply to Martin Liška from comment #21)
> > > For the record, I hate the SPEC build system :).
> > 
> > Then you're the first one! No, just kidding, it's cumbersome, and feel free
> > to contact me with questions regarding that...
> 
> :-P
> 
> Thanks so much for your work narrowing down these bugs.

You're welcome.

>  It's been a real
> time saver, especially your work on PR103061 which is also a SPEC issue.  I
> just reproduced it thanks to your awesome tips.

Good! Looking forward to the fix for it :P

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2021-11-05 11:23 ` marxin at gcc dot gnu.org
@ 2021-11-05 17:16 ` cvs-commit at gcc dot gnu.org
  2021-11-07 17:17 ` hubicka at gcc dot gnu.org
                   ` (28 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-11-05 17:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #24 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Andrew Macleod <amacleod@gcc.gnu.org>:

https://gcc.gnu.org/g:98244c68e77cf75f93b66ee02df059f718c3fbc0

commit r12-4947-g98244c68e77cf75f93b66ee02df059f718c3fbc0
Author: Andrew MacLeod <amacleod@redhat.com>
Date:   Thu Nov 4 15:08:06 2021 -0400

    Abstract ranger cache update list.

    Make it more efficient by removing the call to vec::contains.

            PR tree-optimization/102943
            * gimple-range-cache.cc (class update_list): New.
            (update_list::add): Replace add_to_update.
            (update_list::pop): New.
            (ranger_cache::ranger_cache): Adjust.
            (ranger_cache::~ranger_cache): Adjust.
            (ranger_cache::add_to_update): Delete.
            (ranger_cache::propagate_cache): Adjust to new class.
            (ranger_cache::propagate_updated_value): Ditto.
            (ranger_cache::fill_block_cache): Ditto.
            * gimple-range-cache.h (class ranger_cache): Adjust to update
class.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2021-11-05 17:16 ` cvs-commit at gcc dot gnu.org
@ 2021-11-07 17:17 ` hubicka at gcc dot gnu.org
  2021-11-07 18:16 ` aldyh at gcc dot gnu.org
                   ` (27 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-11-07 17:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #25 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
LNT sees new regresion on WRF build times (around 6%) at Nov 3

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.548.8
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=289.270.8

The revision range is 
g:2d01bef2f214bb80dca0e91c14e95cf4d76b0afb..cd389e5f9447d21084abff5d4b7db6cf1da74e57
which contains switch of vrp2 to ranger, so I guess it is the likely reason.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2021-11-07 17:17 ` hubicka at gcc dot gnu.org
@ 2021-11-07 18:16 ` aldyh at gcc dot gnu.org
  2021-11-07 18:59   ` Jan Hubicka
  2021-11-07 18:59 ` hubicka at kam dot mff.cuni.cz
                   ` (26 subsequent siblings)
  53 siblings, 1 reply; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-07 18:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #26 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #25)
> LNT sees new regresion on WRF build times (around 6%) at Nov 3
> 
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.548.8
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=289.270.8
> 
> The revision range is 
> g:2d01bef2f214bb80dca0e91c14e95cf4d76b0afb..
> cd389e5f9447d21084abff5d4b7db6cf1da74e57 which contains switch of vrp2 to
> ranger, so I guess it is the likely reason.

This PR is still open, at least for slowdown in the threader with LTO.  The
issue is ranger wide, so it may also cause slowdowns  on non-LTO builds for
WRF, though I haven't checked.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-11-07 18:16 ` aldyh at gcc dot gnu.org
@ 2021-11-07 18:59   ` Jan Hubicka
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Hubicka @ 2021-11-07 18:59 UTC (permalink / raw)
  To: aldyh at gcc dot gnu.org; +Cc: gcc-bugs

> 
> This PR is still open, at least for slowdown in the threader with LTO.  The
> issue is ranger wide, so it may also cause slowdowns  on non-LTO builds for
> WRF, though I haven't checked.
I just wanted to record the fact somewhere since I was looking up the
revision range mostly to figure out if there was modref change that may
cause this.

Non-lto builds seems fine.  I suppose LTo is needed ot make bug enough
CFGs.  Thanks for looking into it.

Honza


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2021-11-07 18:16 ` aldyh at gcc dot gnu.org
@ 2021-11-07 18:59 ` hubicka at kam dot mff.cuni.cz
  2021-11-12 22:14 ` hubicka at gcc dot gnu.org
                   ` (25 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2021-11-07 18:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #27 from hubicka at kam dot mff.cuni.cz ---
> 
> This PR is still open, at least for slowdown in the threader with LTO.  The
> issue is ranger wide, so it may also cause slowdowns  on non-LTO builds for
> WRF, though I haven't checked.
I just wanted to record the fact somewhere since I was looking up the
revision range mostly to figure out if there was modref change that may
cause this.

Non-lto builds seems fine.  I suppose LTo is needed ot make bug enough
CFGs.  Thanks for looking into it.

Honza

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2021-11-07 18:59 ` hubicka at kam dot mff.cuni.cz
@ 2021-11-12 22:14 ` hubicka at gcc dot gnu.org
  2021-11-14  9:58 ` hubicka at gcc dot gnu.org
                   ` (24 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-11-12 22:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943
Bug 102943 depends on bug 103058, which changed state.

Bug 103058 Summary: [12 Regression] ICE in gimple_call_static_chain_flags, at gimple.c:1669 when building 527.cam4_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103058

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2021-11-12 22:14 ` hubicka at gcc dot gnu.org
@ 2021-11-14  9:58 ` hubicka at gcc dot gnu.org
  2021-11-26 12:38 ` cvs-commit at gcc dot gnu.org
                   ` (23 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-11-14  9:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #28 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Bit unrelated but shows that threader seems bit expensive on other builds too.
Getting stats from cc1plus LTO-link with -flto-partition=one it seems that
backwards threader and dom are two slowest tree passes.

We get
 - 1% of build time for CCP, forward propagate, slp vectrization 
 - 2% of build time for cfgcleanup, VRP, PTA, PRE, FRE
 - 3% of build time for dominator optimization
 - 4% of build time for backwards jump threading

For RTL we get
 - 1% of buid time for fwprop, dse1, dse2, loop init, CPROP, CSE2, LRA live
ranges
 - 2% of build time for CSE, PRE, combiner, LRA non-specific, reload CSE, 
 - 3% for combiner
 - 4% for IRA and scheduler

Time variable                                   usr           sys          wall
          GGC
 phase setup                        :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)
 2301k (  0%)
 phase opt and generate             :1312.92 ( 98%)  70.40 ( 97%)1386.30 ( 98%)
26038M ( 97%)
 phase last asm                     :  27.07 (  2%)   1.63 (  2%)  28.77 (  2%)
  376M (  1%)
 phase stream in                    :   0.96 (  0%)   0.32 (  0%)   1.29 (  0%)
  464M (  2%)
 phase finalize                     :   3.64 (  0%)   0.47 (  1%)   4.12 (  0%)
    0  (  0%)
 garbage collection                 :  27.45 (  2%)   0.04 (  0%)  27.54 (  2%)
    0  (  0%)
 dump files                         :   3.53 (  0%)   0.35 (  0%)   4.37 (  0%)
    0  (  0%)
 callgraph functions expansion      :1311.82 ( 98%)  70.34 ( 97%)1385.15 ( 98%)
26022M ( 97%)
 callgraph ipa passes               :   0.18 (  0%)   0.00 (  0%)   0.18 (  0%)
    0  (  0%)
 ipa dead code removal              :   0.35 (  0%)   0.01 (  0%)   0.37 (  0%)
    0  (  0%)
 ipa virtual call target            :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)
  272  (  0%)
 ipa cp                             :   0.12 (  0%)   0.01 (  0%)   0.16 (  0%)
   49M (  0%)
 ipa inlining heuristics            :   9.04 (  1%)   0.98 (  1%)   9.75 (  1%)
  402M (  1%)
 lto stream decompression           :   1.66 (  0%)   0.17 (  0%)   1.60 (  0%)
    0  (  0%)
 ipa lto gimple in                  :  36.64 (  3%)   3.55 (  5%)  40.05 (  3%)
 3138M ( 12%)
 ipa lto decl in                    :   0.39 (  0%)   0.21 (  0%)   0.61 (  0%)
  137M (  1%)
 ipa lto constructors in            :   0.45 (  0%)   0.05 (  0%)   0.45 (  0%)
   60M (  0%)
 ipa lto cgraph I/O                 :   0.36 (  0%)   0.09 (  0%)   0.44 (  0%)
  274M (  1%)
 ipa reference                      :   0.00 (  0%)   0.01 (  0%)   0.00 (  0%)
    0  (  0%)
 ipa pure const                     :   0.51 (  0%)   0.06 (  0%)   0.65 (  0%)
  342k (  0%)
 ipa modref                         :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)
 4272k (  0%)
 cfg construction                   :   0.84 (  0%)   0.03 (  0%)   0.87 (  0%)
   86M (  0%)
 cfg cleanup                        :  26.67 (  2%)   0.53 (  1%)  28.48 (  2%)
  199M (  1%)
 trivially dead code                :   8.90 (  1%)   0.36 (  0%)   9.26 (  1%)
  166k (  0%)
 df scan insns                      :   5.46 (  0%)   0.23 (  0%)   5.41 (  0%)
 1498k (  0%)
 df reaching defs                   :  13.04 (  1%)   0.17 (  0%)  12.73 (  1%)
    0  (  0%)
 df live regs                       :  48.91 (  4%)   0.75 (  1%)  49.22 (  3%)
   14M (  0%)
 df live&initialized regs           :  20.46 (  2%)   0.33 (  0%)  20.80 (  1%)
    0  (  0%)
 df must-initialized regs           :   1.05 (  0%)   0.01 (  0%)   1.00 (  0%)
    0  (  0%)
 df use-def / def-use chains        :   5.91 (  0%)   0.08 (  0%)   6.25 (  0%)
    0  (  0%)
 df live reg subwords               :   0.00 (  0%)   0.00 (  0%)   0.05 (  0%)
    0  (  0%)
 df reg dead/unused notes           :  18.21 (  1%)   0.24 (  0%)  18.56 (  1%)
  223M (  1%)
 register information               :   4.60 (  0%)   0.08 (  0%)   4.21 (  0%)
    0  (  0%)
 alias analysis                     :  19.76 (  1%)   0.28 (  0%)  19.63 (  1%)
  478M (  2%)
 alias stmt walking                 :  29.06 (  2%)   2.30 (  3%)  30.32 (  2%)
   65M (  0%)
 register scan                      :   2.12 (  0%)   0.06 (  0%)   2.50 (  0%)
   20M (  0%)
 rebuild jump labels                :   2.65 (  0%)   0.05 (  0%)   2.86 (  0%)
  576  (  0%)
 integration                        :  35.83 (  3%)   6.81 (  9%)  41.90 (  3%)
 2650M ( 10%)
 tree CFG cleanup                   :  21.87 (  2%)   2.23 (  3%)  24.72 (  2%)
   35M (  0%)
 tree tail merge                    :   3.02 (  0%)   0.18 (  0%)   3.04 (  0%)
  132M (  0%)
 tree VRP                           :  29.83 (  2%)   1.92 (  3%)  32.77 (  2%)
  354M (  1%)
 tree copy propagation              :   5.06 (  0%)   0.36 (  0%)   4.95 (  0%)
 6205k (  0%)
 tree PTA                           :  22.85 (  2%)   1.54 (  2%)  24.96 (  2%)
  107M (  0%)
 tree SSA incremental               :  35.53 (  3%)   1.74 (  2%)  36.94 (  3%)
  381M (  1%)
 tree operand scan                  :  44.83 (  3%)   6.17 (  8%)  50.12 (  4%)
 1028M (  4%)
 dominator optimization             :  43.80 (  3%)   3.02 (  4%)  47.50 (  3%)
  566M (  2%)
 backwards jump threading           :  49.72 (  4%)   2.27 (  3%)  53.04 (  4%)
  412M (  2%)
 tree SRA                           :   0.61 (  0%)   0.07 (  0%)   0.64 (  0%)
   14M (  0%)
 isolate eroneous paths             :   0.70 (  0%)   0.04 (  0%)   0.68 (  0%)
 7987k (  0%)
 tree CCP                           :  18.10 (  1%)   1.54 (  2%)  18.86 (  1%)
   62M (  0%)
 tree split crit edges              :   0.18 (  0%)   0.01 (  0%)   0.20 (  0%)
   37M (  0%)
 tree reassociation                 :   2.34 (  0%)   0.21 (  0%)   2.60 (  0%)
   10M (  0%)
 tree PRE                           :  24.31 (  2%)   1.57 (  2%)  26.42 (  2%)
  394M (  1%)
 tree FRE                           :  24.13 (  2%)   2.07 (  3%)  26.24 (  2%)
  119M (  0%)
 tree code sinking                  :   2.74 (  0%)   0.24 (  0%)   3.18 (  0%)
  315M (  1%)
 tree linearize phis                :   1.43 (  0%)   0.08 (  0%)   1.69 (  0%)
   42M (  0%)
 tree backward propagate            :   0.45 (  0%)   0.07 (  0%)   0.53 (  0%)
   64  (  0%)
 tree forward propagate             :   9.53 (  1%)   0.97 (  1%)  10.80 (  1%)
   65M (  0%)
 tree phiprop                       :   0.16 (  0%)   0.01 (  0%)   0.16 (  0%)
  299k (  0%)
 tree conservative DCE              :   6.41 (  0%)   0.68 (  1%)   7.26 (  1%)
 9555k (  0%)
 tree buildin call DCE              :   0.05 (  0%)   0.02 (  0%)   0.13 (  0%)
    0  (  0%)
 tree DSE                           :   5.29 (  0%)   0.32 (  0%)   5.65 (  0%)
   35M (  0%)
 PHI merge                          :   1.85 (  0%)   0.02 (  0%)   1.94 (  0%)
   25M (  0%)
 tree loop optimization             :   0.03 (  0%)   0.00 (  0%)   0.03 (  0%)
    0  (  0%)
 loopless fn                        :   0.02 (  0%)   0.00 (  0%)   0.01 (  0%)
    0  (  0%)
 tree loop invariant motion         :   2.18 (  0%)   0.13 (  0%)   2.35 (  0%)
 3995k (  0%)
 tree canonical iv                  :   0.95 (  0%)   0.05 (  0%)   0.90 (  0%)
   21M (  0%)
 scev constant prop                 :   0.29 (  0%)   0.02 (  0%)   0.23 (  0%)
 5916k (  0%)
 complete unrolling                 :   3.86 (  0%)   0.31 (  0%)   3.79 (  0%)
  100M (  0%)
 tree vectorization                 :   0.45 (  0%)   0.01 (  0%)   0.47 (  0%)
   16M (  0%)
 tree slp vectorization             :  10.89 (  1%)   5.59 (  8%)  16.42 (  1%)
  767M (  3%)
 tree loop distribution             :   0.68 (  0%)   0.07 (  0%)   0.87 (  0%)
   10M (  0%)
 tree iv optimization               :   5.06 (  0%)   0.23 (  0%)   5.58 (  0%)
  144M (  1%)
 predictive commoning               :   0.86 (  0%)   0.09 (  0%)   0.86 (  0%)
   22M (  0%)
 tree copy headers                  :   1.35 (  0%)   0.11 (  0%)   1.58 (  0%)
   46M (  0%)
 tree SSA uncprop                   :   1.05 (  0%)   0.14 (  0%)   0.91 (  0%)
   79k (  0%)
 tree NRV optimization              :   0.06 (  0%)   0.01 (  0%)   0.06 (  0%)
  543k (  0%)
 tree switch lowering               :   0.79 (  0%)   0.03 (  0%)   0.75 (  0%)
   37M (  0%)
 gimple CSE sin/cos                 :   0.19 (  0%)   0.00 (  0%)   0.11 (  0%)
    0  (  0%)
 gimple widening/fma detection      :   0.62 (  0%)   0.04 (  0%)   0.69 (  0%)
  828k (  0%)
 tree strlen optimization           :   1.52 (  0%)   0.13 (  0%)   1.49 (  0%)
   78M (  0%)
 tree modref                        :   1.06 (  0%)   0.09 (  0%)   1.12 (  0%)
   14M (  0%)
 dominance frontiers                :   1.22 (  0%)   0.09 (  0%)   1.61 (  0%)
    0  (  0%)
 dominance computation              :  14.46 (  1%)   0.98 (  1%)  15.53 (  1%)
    0  (  0%)
 control dependences                :   0.26 (  0%)   0.01 (  0%)   0.21 (  0%)
    0  (  0%)
 out of ssa                         :   3.31 (  0%)   0.26 (  0%)   3.53 (  0%)
 6688k (  0%)
 expand vars                        :  10.59 (  1%)   0.19 (  0%)  10.92 (  1%)
  132M (  0%)
 expand                             :  23.83 (  2%)   1.16 (  2%)  24.71 (  2%)
 2393M (  9%)
 post expand cleanups               :   1.95 (  0%)   0.10 (  0%)   2.23 (  0%)
   83M (  0%)
 varconst                           :   0.00 (  0%)   0.02 (  0%)   0.03 (  0%)
    0  (  0%)
 lower subreg                       :   0.33 (  0%)   0.01 (  0%)   0.33 (  0%)
  570k (  0%)
 jump                               :   0.20 (  0%)   0.02 (  0%)   0.13 (  0%)
    0  (  0%)
 forward prop                       :  13.81 (  1%)   0.44 (  1%)  14.32 (  1%)
   20M (  0%)
 CSE                                :  24.85 (  2%)   0.58 (  1%)  26.76 (  2%)
   75M (  0%)
 dead code elimination              :   3.46 (  0%)   0.10 (  0%)   3.59 (  0%)
   16k (  0%)
 dead store elim1                   :   7.91 (  1%)   0.20 (  0%)   7.89 (  1%)
  127M (  0%)
 dead store elim2                   :   7.83 (  1%)   0.19 (  0%)   7.90 (  1%)
  163M (  1%)
 loop analysis                      :   0.12 (  0%)   0.01 (  0%)   0.11 (  0%)
    0  (  0%)
 loop init                          :   8.77 (  1%)   0.77 (  1%)   9.89 (  1%)
  498M (  2%)
 loop invariant motion              :   1.31 (  0%)   0.02 (  0%)   1.25 (  0%)
 3328k (  0%)
 loop fini                          :   0.85 (  0%)   0.06 (  0%)   1.07 (  0%)
  228k (  0%)
 CPROP                              :  18.61 (  1%)   0.56 (  1%)  19.28 (  1%)
  434M (  2%)
 PRE                                :  25.73 (  2%)   0.42 (  1%)  26.09 (  2%)
   14M (  0%)
 CSE 2                              :  14.80 (  1%)   0.36 (  0%)  14.99 (  1%)
   40M (  0%)
 branch prediction                  :   0.13 (  0%)   0.02 (  0%)   0.18 (  0%)
 1024k (  0%)
 combiner                           :  38.90 (  3%)   0.75 (  1%)  39.58 (  3%)
  729M (  3%)
 if-conversion                      :   4.21 (  0%)   0.11 (  0%)   4.23 (  0%)
   83M (  0%)
 mode switching                     :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)
    0  (  0%)
 integrated RA                      :  59.64 (  4%)   1.08 (  1%)  59.95 (  4%)
 1714M (  6%)
 LRA non-specific                   :  21.09 (  2%)   0.27 (  0%)  21.33 (  2%)
  123M (  0%)
 LRA virtuals elimination           :   2.74 (  0%)   0.06 (  0%)   2.68 (  0%)
   64M (  0%)
 LRA reload inheritance             :   4.85 (  0%)   0.03 (  0%)   4.47 (  0%)
   70M (  0%)
 LRA create live ranges             :  12.15 (  1%)   0.09 (  0%)  12.34 (  1%)
   15M (  0%)
 LRA hard reg assignment            :   2.17 (  0%)   0.03 (  0%)   2.38 (  0%)
    0  (  0%)
 LRA coalesce pseudo regs           :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)
    0  (  0%)
 LRA rematerialization              :   3.05 (  0%)   0.06 (  0%)   3.17 (  0%)
 9312  (  0%)
 reload                             :   0.22 (  0%)   0.01 (  0%)   0.32 (  0%)
 1134k (  0%)
 reload CSE regs                    :  21.03 (  2%)   0.38 (  1%)  22.08 (  2%)
  229M (  1%)
 ree                                :   2.00 (  0%)   0.04 (  0%)   1.93 (  0%)
 8158k (  0%)
 thread pro- & epilogue             :   4.69 (  0%)   0.11 (  0%)   4.79 (  0%)
  100M (  0%)
 varconst                           :   0.00 (  0%)   0.02 (  0%)   0.03 (  0%)
    0  (  0%)
 lower subreg                       :   0.33 (  0%)   0.01 (  0%)   0.33 (  0%)
  570k (  0%)
 jump                               :   0.20 (  0%)   0.02 (  0%)   0.13 (  0%)
    0  (  0%)
 forward prop                       :  13.81 (  1%)   0.44 (  1%)  14.32 (  1%)
   20M (  0%)
 CSE                                :  24.85 (  2%)   0.58 (  1%)  26.76 (  2%)
   75M (  0%)
 dead code elimination              :   3.46 (  0%)   0.10 (  0%)   3.59 (  0%)
   16k (  0%)
 dead store elim1                   :   7.91 (  1%)   0.20 (  0%)   7.89 (  1%)
  127M (  0%)
 dead store elim2                   :   7.83 (  1%)   0.19 (  0%)   7.90 (  1%)
  163M (  1%)
 loop analysis                      :   0.12 (  0%)   0.01 (  0%)   0.11 (  0%)
    0  (  0%)
 loop init                          :   8.77 (  1%)   0.77 (  1%)   9.89 (  1%)
  498M (  2%)
 loop invariant motion              :   1.31 (  0%)   0.02 (  0%)   1.25 (  0%)
 3328k (  0%)
 loop fini                          :   0.85 (  0%)   0.06 (  0%)   1.07 (  0%)
  228k (  0%)
 CPROP                              :  18.61 (  1%)   0.56 (  1%)  19.28 (  1%)
  434M (  2%)
 PRE                                :  25.73 (  2%)   0.42 (  1%)  26.09 (  2%)
   14M (  0%)
 CSE 2                              :  14.80 (  1%)   0.36 (  0%)  14.99 (  1%)
   40M (  0%)
 branch prediction                  :   0.13 (  0%)   0.02 (  0%)   0.18 (  0%)
 1024k (  0%)
 combiner                           :  38.90 (  3%)   0.75 (  1%)  39.58 (  3%)
  729M (  3%)
 if-conversion                      :   4.21 (  0%)   0.11 (  0%)   4.23 (  0%)
   83M (  0%)
 mode switching                     :   0.02 (  0%)   0.00 (  0%)   0.03 (  0%)
    0  (  0%)
 integrated RA                      :  59.64 (  4%)   1.08 (  1%)  59.95 (  4%)
 1714M (  6%)
 LRA non-specific                   :  21.09 (  2%)   0.27 (  0%)  21.33 (  2%)
  123M (  0%)
 LRA virtuals elimination           :   2.74 (  0%)   0.06 (  0%)   2.68 (  0%)
   64M (  0%)
 LRA reload inheritance             :   4.85 (  0%)   0.03 (  0%)   4.47 (  0%)
   70M (  0%)
 LRA create live ranges             :  12.15 (  1%)   0.09 (  0%)  12.34 (  1%)
   15M (  0%)
 LRA hard reg assignment            :   2.17 (  0%)   0.03 (  0%)   2.38 (  0%)
    0  (  0%)
 LRA coalesce pseudo regs           :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)
    0  (  0%)
 LRA rematerialization              :   3.05 (  0%)   0.06 (  0%)   3.17 (  0%)
 9312  (  0%)
 reload                             :   0.22 (  0%)   0.01 (  0%)   0.32 (  0%)
 1134k (  0%)
 reload CSE regs                    :  21.03 (  2%)   0.38 (  1%)  22.08 (  2%)
  229M (  1%)
 ree                                :   2.00 (  0%)   0.04 (  0%)   1.93 (  0%)
 8158k (  0%)
 thread pro- & epilogue             :   4.69 (  0%)   0.11 (  0%)   4.79 (  0%)
  100M (  0%)
 if-conversion 2                    :   0.94 (  0%)   0.02 (  0%)   1.11 (  0%)
 2090k (  0%)
 combine stack adjustments          :   1.22 (  0%)   0.02 (  0%)   0.89 (  0%)
   35k (  0%)
 peephole 2                         :   4.54 (  0%)   0.07 (  0%)   5.08 (  0%)
   45M (  0%)
 hard reg cprop                     :   6.08 (  0%)   0.15 (  0%)   6.07 (  0%)
 7038k (  0%)
 scheduling 2                       :  54.76 (  4%)   0.99 (  1%)  57.50 (  4%)
  100M (  0%)
 machine dep reorg                  :   3.49 (  0%)   0.07 (  0%)   3.93 (  0%)
 1502k (  0%)
 reorder blocks                     :   6.27 (  0%)   0.09 (  0%)   5.70 (  0%)
  127M (  0%)
 shorten branches                   :   4.66 (  0%)   0.06 (  0%)   4.75 (  0%)
   41k (  0%)
 reg stack                          :   0.05 (  0%)   0.00 (  0%)   0.10 (  0%)
    0  (  0%)
 final                              :  22.18 (  2%)   1.11 (  2%)  23.16 (  2%)
 1246M (  5%)
 variable output                    :   0.74 (  0%)   0.02 (  0%)   0.76 (  0%)
   14M (  0%)
 symout                             :  35.49 (  3%)   1.93 (  3%)  37.27 (  3%)
 2915M ( 11%)
 variable tracking                  :  23.14 (  2%)   0.44 (  1%)  23.53 (  2%)
  801M (  3%)
 var-tracking dataflow              :  34.59 (  3%)   0.18 (  0%)  34.46 (  2%)
   21M (  0%)
 var-tracking emit                  :  25.94 (  2%)   0.22 (  0%)  26.23 (  2%)
  671M (  2%)
 tree if-combine                    :   0.64 (  0%)   0.09 (  0%)   0.65 (  0%)
   12M (  0%)
 uninit var analysis                :   0.19 (  0%)   0.01 (  0%)   0.10 (  0%)
   42k (  0%)
 straight-line strength reduction   :   1.18 (  0%)   0.05 (  0%)   1.08 (  0%)
 8911k (  0%)
 store merging                      :   1.09 (  0%)   0.15 (  0%)   1.09 (  0%)
   16M (  0%)
 initialize rtl                     :   0.03 (  0%)   0.00 (  0%)   0.04 (  0%)
   34k (  0%)
 address lowering                   :   0.09 (  0%)   0.01 (  0%)   0.22 (  0%)
 2001k (  0%)
 tree loop if-conversion            :   0.25 (  0%)   0.03 (  0%)   0.35 (  0%)
 8491k (  0%)
 unaccounted optimizations          :   0.03 (  0%)   0.00 (  0%)   0.03 (  0%)
    0  (  0%)
 rest of compilation                :  24.17 (  2%)   1.66 (  2%)  26.17 (  2%)
  142M (  1%)
 unaccounted late compilation       :   0.02 (  0%)   0.00 (  0%)   0.00 (  0%)
    0  (  0%)
 remove unused locals               :   4.95 (  0%)   0.68 (  1%)   5.75 (  0%)
  420k (  0%)
 address taken                      :   3.96 (  0%)   1.38 (  2%)   5.19 (  0%)
    0  (  0%)
 rebuild frequencies                :   0.76 (  0%)   0.13 (  0%)   1.18 (  0%)
 2998k (  0%)
 repair loop structures             :   0.36 (  0%)   0.02 (  0%)   0.47 (  0%)
   82k (  0%)
 TOTAL                              :1344.59         72.82       1420.49       
26881M

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2021-11-14  9:58 ` hubicka at gcc dot gnu.org
@ 2021-11-26 12:38 ` cvs-commit at gcc dot gnu.org
  2021-11-30 10:55 ` aldyh at gcc dot gnu.org
                   ` (22 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-11-26 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #29 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:a70faf6e4df7481c2c9a08a06657c20beb3043de

commit r12-5538-ga70faf6e4df7481c2c9a08a06657c20beb3043de
Author: Jan Hubicka <jh@suse.cz>
Date:   Fri Nov 26 13:36:35 2021 +0100

    Fix handling of in_flags in update_escape_summary_1

    update_escape_summary_1 has thinko where it compues proper min_flags but
then
    stores original value (ignoring the fact whether there was a dereference
    in the escape point).

            PR ipa/102943
            * ipa-modref.c (update_escape_summary_1): Fix handling of
min_flags.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2021-11-26 12:38 ` cvs-commit at gcc dot gnu.org
@ 2021-11-30 10:55 ` aldyh at gcc dot gnu.org
  2021-12-09 20:17 ` hubicka at gcc dot gnu.org
                   ` (21 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: aldyh at gcc dot gnu.org @ 2021-11-30 10:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #30 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #28)
> Bit unrelated but shows that threader seems bit expensive on other builds
> too.
> Getting stats from cc1plus LTO-link with -flto-partition=one it seems that
> backwards threader and dom are two slowest tree passes.
> 
> We get
>  - 1% of build time for CCP, forward propagate, slp vectrization 
>  - 2% of build time for cfgcleanup, VRP, PTA, PRE, FRE
>  - 3% of build time for dominator optimization
>  - 4% of build time for backwards jump threading

>  tree VRP                           :  29.83 (  2%)   1.92 (  3%)  32.77 ( 
> 2%)   354M (  1%)
...
...
> 3%)   566M (  2%)
>  backwards jump threading           :  49.72 (  4%)   2.27 (  3%)  53.04 ( 

This looks like the issue in PR103409.

Does this fix the problem?

https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585658.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2021-11-30 10:55 ` aldyh at gcc dot gnu.org
@ 2021-12-09 20:17 ` hubicka at gcc dot gnu.org
  2022-01-03  8:47 ` rguenth at gcc dot gnu.org
                   ` (20 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at gcc dot gnu.org @ 2021-12-09 20:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #31 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
With g:r12-5872-gf157c5362b4844f7676cae2aba81a4cf75bd68d5 I limited inliner to
not produce such a large function, so bumping up --param
max-inline-functions-called-once-insns will be necessary to reproduce the
problem.

I will re-do time stats for 1-partition cc1plus LTO build, but I did the same
for linking clang with LTO and passes with over 2% time are:
Time variable                                   usr           sys          wall
          GGC
 garbage collection                 :  26.88 (  3%)   0.27 (  1%)  28.53 (  3%)
    0  (  0%)
 ipa lto gimple in                  :  19.84 (  2%)   1.86 (  4%)  28.42 (  3%)
 3660M ( 13%)
 cfg cleanup                        :  15.36 (  2%)   0.39 (  1%)  16.38 (  2%)
  154M (  1%)
 df live&initialized regs           :  16.20 (  2%)   0.35 (  1%)  18.29 (  2%)
    0  (  0%)
 df reg dead/unused notes           :  17.94 (  2%)   0.43 (  1%)  19.35 (  2%)
  361M (  1%)
 alias analysis                     :  19.87 (  2%)   0.24 (  1%)  21.05 (  2%)
  853M (  3%)
 alias stmt walking                 :  22.90 (  2%)   1.25 (  3%)  25.44 (  2%)
   68M (  0%)
 tree VRP                           :  22.79 (  2%)   1.06 (  2%)  24.22 (  2%)
  488M (  2%)
 tree PTA                           :  30.37 (  3%)   1.84 (  4%)  32.71 (  3%)
  215M (  1%)
 tree operand scan                  :  19.04 (  2%)   0.86 (  2%)  21.70 (  2%)
 1233M (  5%)
 dominator optimization             :  28.85 (  3%)   1.66 (  4%)  32.00 (  3%)
  531M (  2%)
 backwards jump threading           :  23.00 (  2%)   1.37 (  3%)  24.52 (  2%)
  154M (  1%)
 tree PRE                           :  21.81 (  2%)   1.11 (  3%)  24.55 (  2%)
  500M (  2%)
 tree FRE                           :  18.69 (  2%)   1.40 (  3%)  20.12 (  2%)
  221M (  1%)
 expand                             :  21.35 (  2%)   1.14 (  3%)  23.27 (  2%)
 2399M (  9%)
 CSE                                :  20.69 (  2%)   0.78 (  2%)  21.82 (  2%)
  124M (  0%)
 CPROP                              :  16.27 (  2%)   0.62 (  1%)  17.10 (  2%)
  419M (  2%)
 combiner                           :  27.46 (  3%)   0.81 (  2%)  30.46 (  3%)
  854M (  3%)
 integrated RA                      :  70.76 (  7%)   1.82 (  4%)  72.74 (  7%)
 3795M ( 14%)
 LRA non-specific                   :  19.51 (  2%)   0.54 (  1%)  20.89 (  2%)
  159M (  1%)
 reload CSE regs                    :  20.32 (  2%)   0.49 (  1%)  19.98 (  2%)
  290M (  1%)
 scheduling 2                       :  44.95 (  5%)   0.98 (  2%)  48.23 (  4%)
  163M (  1%)
 rest of compilation                :  30.44 (  3%)   1.92 (  4%)  32.03 (  3%)
  366M (  1%)
 TOTAL                              : 985.14         42.92       1082.89       
27226M

So dominators, VRP, and backward searching seems to be still slow on trunk from
20211204

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2021-12-09 20:17 ` hubicka at gcc dot gnu.org
@ 2022-01-03  8:47 ` rguenth at gcc dot gnu.org
  2022-01-03 11:20 ` hubicka at kam dot mff.cuni.cz
                   ` (19 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-03  8:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #32 from Richard Biener <rguenth at gcc dot gnu.org> ---
But they are in reasonable territory now, no longer slowest of all but in the
same ballpark as others.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2022-01-03  8:47 ` rguenth at gcc dot gnu.org
@ 2022-01-03 11:20 ` hubicka at kam dot mff.cuni.cz
  2022-01-19  7:06 ` rguenth at gcc dot gnu.org
                   ` (18 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-01-03 11:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #33 from hubicka at kam dot mff.cuni.cz ---
With the inliner tweaks (which I hope to get bit more aggressive this
week) we "solved" the wrf compile time with LTO by simply not building
the gigantic functions.  However we still have significant regressions
without LTO. 

Accroding to
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch
With spec2006 on kabylake with _O2 we have:
Test Name       gcc-6           gcc-7   gcc-8   gcc-9   gcc-10  gcc-11
gcc-trunk
SPECFP          256.464         3.59%   15.50%  19.29%  29.53%  33.44%  43.50%
SPECint         119.368         3.17%   13.60%  17.23%  14.17%  26.58%  33.58%
and spec2007
SPECFP          638.337         5.39%   21.20%  25.40%  50.18% 45.00%   58.72%
SPECint         217.977         4.03%   11.47%  16.17% 13.29%   22.52%  27.28%

Growing SPECFP -O2 build time by over 10% in one release is quite a lot
though it happened previously with gcc7->gcc8 and gcc9->gcc10 (if I remember
correctly gcc10 was caused by Fortran revisiting libgfortran API and
increasing binaries significantly).

The bigger increase in SPECfp compared to SPECint seems to be attributed
to wrf which regresses 122% compared to gcc6. The build time graph is
here
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=116.548.8&plot.1=161.548.8&plot.2=68.548.8&plot.3=206.548.8&plot.4=373.548.8&plot.5=430.548.8&plot.6=30.548.8&

So I think this is still about the most important compile time issue we
have.  I plan to look at the profile updating issues in threader and
hopefully get bit more faimilar with the code to possibly help here.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2022-01-03 11:20 ` hubicka at kam dot mff.cuni.cz
@ 2022-01-19  7:06 ` rguenth at gcc dot gnu.org
  2022-03-10 11:37 ` rguenth at gcc dot gnu.org
                   ` (17 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-19  7:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2021-10-26 00:00:00         |2022-1-18

--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
We are about as fast as GCC 10 now where GCC 11 was significantly better in
overall compile-time.  But note this cannot really be accounted just to jump
threading anymore (unless of course we thread a lot more).  There is a 1% code
size "jump" around the time the compile-time increases which can hardly explain
all of the difference.  Note we are now talking about -O2.

I'm leaving this open for more investigation.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2022-01-19  7:06 ` rguenth at gcc dot gnu.org
@ 2022-03-10 11:37 ` rguenth at gcc dot gnu.org
  2022-03-10 12:40 ` cvs-commit at gcc dot gnu.org
                   ` (16 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-10 11:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2022-01-18 00:00:00         |2022-3-10

--- Comment #35 from Richard Biener <rguenth at gcc dot gnu.org> ---
So I've re-measured -Ofast -march=znver2 -flto on todays trunk with release
checking (built with GCC 7, not bootstrapped) and the largest LTRANS (ltrans22
at the moment) unit still has

 tree VRP                           :  15.52 ( 20%)   0.03 (  5%)  15.57 ( 20%)
   28M (  4%)
 backwards jump threading           :  16.17 ( 21%)   0.00 (  0%)  16.15 ( 21%)
 1475k (  0%)
 TOTAL                              :  77.29          0.59         77.92       
  744M

and the 2nd largest (ltrans86 at the moment)

 alias stmt walking                 :   7.70 ( 16%)   0.03 (  8%)   7.70 ( 16%)
  703k (  0%)
 tree VRP                           :   8.25 ( 18%)   0.01 (  3%)   8.27 ( 17%)
   14M (  3%)
 backwards jump threading           :   8.79 ( 19%)   0.00 (  0%)   8.82 ( 19%)
 1645k (  0%)
 TOTAL                              :  46.97          0.38         47.38       
  438M

so it's still by far jump-threading/VRP dominating compile-times (I wonder
if we should separate "old" and "new" [E]VRP timevars).  Given that VRP
shows up as well it's more likely the underlying ranger infrastructure?

perf thrown on ltrans22 shows

Samples: 302K of event 'cycles', Event count (approx.): 331301505627            
Overhead       Samples  Command      Shared Object     Symbol                   
  10.34%         31299  lto1-ltrans  lto1              [.]
bitmap_get_aligned_chunk
   7.44%         22540  lto1-ltrans  lto1              [.] bitmap_bit_p
   3.17%          9593  lto1-ltrans  lto1              [.]
get_immediate_dominator
   2.87%          8668  lto1-ltrans  lto1              [.]
determine_value_range
   2.36%          7143  lto1-ltrans  lto1              [.]
ranger_cache::propagate_cache
   2.32%          7031  lto1-ltrans  lto1              [.] bitmap_set_bit
   2.20%          6664  lto1-ltrans  lto1              [.]
operand_compare::operand_equal_p
   1.88%          5692  lto1-ltrans  lto1              [.]
bitmap_set_aligned_chunk
   1.79%          5390  lto1-ltrans  lto1              [.]
number_of_iterations_exit_assumptions
   1.66%          5048  lto1-ltrans  lto1              [.]
get_continuation_for_phi

callgraph info in perf is a mixed bag, but maybe it helps to pinpoint things:

-   10.20%    10.18%         30364  lto1-ltrans  lto1              [.]
bitmap_get_aligned_chunk                                                       
                                                                               
                                                  #
   - 10.18% 0xffffffffffffffff                                                 
                                                                               
                                                                               
                                         #
      + 9.16% ranger_cache::propagate_cache                                    
                                                                               
                                                                               
                                         #
      + 1.01% ranger_cache::fill_block_cache               

-    7.84%     7.83%         23509  lto1-ltrans  lto1              [.]
bitmap_bit_p                                                                   
                                                                               
                                                  #
   - 6.20% 0xffffffffffffffff                                                  
                                                                               
                                                                               
                                         #
      + 1.85% fold_using_range::range_of_range_op                              
                                                                               
                                                                               
                                         #
      + 1.64% ranger_cache::range_on_edge                                      
                                                                               
                                                                               
                                         #
      + 1.29% gimple_ranger::range_of_expr            

and the most prominent get_immediate_dominator calls are from
back_propagate_equivalences which does

  FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
...
      /* Profiling has shown the domination tests here can be fairly
         expensive.  We get significant improvements by building the
         set of blocks that dominate BB.  We can then just test
         for set membership below.

         We also initialize the set lazily since often the only uses
         are going to be in the same block as DEST.  */
      if (!domby)
        {
          domby = BITMAP_ALLOC (NULL);
          basic_block bb = get_immediate_dominator (CDI_DOMINATORS, dest);
          while (bb)
            {
              bitmap_set_bit (domby, bb->index);
              bb = get_immediate_dominator (CDI_DOMINATORS, bb);
            }
        }

      /* This tests if USE_STMT does not dominate DEST.  */
      if (!bitmap_bit_p (domby, gimple_bb (use_stmt)->index))
        continue;

I think that "optimization" is flawed - a dominance check is cheap if
the DFS numbers are up-to-date:

bool
dominated_by_p (enum cdi_direction dir, const_basic_block bb1,
const_basic_block bb2)
{       
  unsigned int dir_index = dom_convert_dir_to_idx (dir);
  struct et_node *n1 = bb1->dom[dir_index], *n2 = bb2->dom[dir_index];

  gcc_checking_assert (dom_computed[dir_index]);

  if (dom_computed[dir_index] == DOM_OK)
    return (n1->dfs_num_in >= n2->dfs_num_in
            && n1->dfs_num_out <= n2->dfs_num_out);

  return et_below (n1, n2);
}

it's just the fallback that is not.  Also recoding _all_ dominators of
'dest' is expensive for a large CFG but you'll only ever need
dominators up to the definition of 'lhs' which we know will dominate
all use_stmt so if that does _not_ dominate e->dest no use will
(but I think that's always the case in the current code).  Note
the caller iterates over simple equivalences on an edge so this
bitmap is populated multiple times (but if we cache it we cannot
prune from the top).  For FP we have usually multiple equivalences
so caching pays off more than pruning for WRF.  Note this is only
a minor part of the slowness, I'm testing a patch for this part.
Note for WRF always going the "slow" dominated_by_p way is as fast
as caching.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2022-03-10 11:37 ` rguenth at gcc dot gnu.org
@ 2022-03-10 12:40 ` cvs-commit at gcc dot gnu.org
  2022-03-10 13:22 ` rguenth at gcc dot gnu.org
                   ` (15 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-10 12:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #36 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:83bc478d3ba6a40fe3cec72dc9057ceca4dc9137

commit r12-7590-g83bc478d3ba6a40fe3cec72dc9057ceca4dc9137
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Mar 10 12:40:02 2022 +0100

    tree-optimization/102943 - avoid (re-)computing dominance bitmap

    Currently back_propagate_equivalences tries to optimize dominance
    queries in a smart way but it fails to notice that when fast indexes
    are available the dominance query is fast (when called from DOM).
    It also re-computes the dominance bitmap for each equivalence recorded
    on an edge, which for FP are usually several.  Finally it fails to
    use the tree bitmap view for efficiency.  Overall this cuts 7
    seconds of compile-time from originally 77 in the slowest LTRANS
    unit when building 521.wrf_r.

    2022-03-10  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/102943
            * tree-ssa-dom.cc (back_propagate_equivalences): Only
            populate the dominance bitmap if fast queries are not
            available.  Use a tree view bitmap.
            (record_temporary_equivalences): Cache the dominance bitmap
            across all equivalences on the edge.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2022-03-10 12:40 ` cvs-commit at gcc dot gnu.org
@ 2022-03-10 13:22 ` rguenth at gcc dot gnu.org
  2022-03-10 13:42 ` cvs-commit at gcc dot gnu.org
                   ` (14 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-10 13:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #37 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm looking at range_def_chain::m_def_chain, it's use is well obfuscated by
inheritance but comments suggest that we have one such structure either for
each edge in the CFG or for each basic-block.  In particular this
m_def_chain vector looks very sparse and fat, replacing that with a

  hash_map<int, rdc *>

and allocating rdcs from another obstack (in principle re-using
m_bitmap.obstack would be possible but somewhat ugly) should make this
more cache and memory friendly (whether SSA name version or pointer is
used as key would remain to be determined).

The ssa1 and ssa2 members are also quite odd, we always record into the
bitmap so those seem to be a waste of time?  Changing allocation the
above way would also enable embedding bitmap_head, removing one pointer
indirection.  Unfortunately we use bitmap_ior_into so using the more
efficient tree form for bitmap queries isn't possible until somebody
implements (efficient!) bitmap_ior_into on tree form.

It wouldnt't fix the appearant algorithmic issues of course, so just food
for thought.  Complexity wise it would reduce O (n-edges * n-ssa-names)
to O (n-edges * n-deps/imports-on-edge).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2022-03-10 13:22 ` rguenth at gcc dot gnu.org
@ 2022-03-10 13:42 ` cvs-commit at gcc dot gnu.org
  2022-03-10 13:45 ` rguenth at gcc dot gnu.org
                   ` (13 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-10 13:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #38 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:ee34ffa429a399f292ad1421333721a92b998772

commit r12-7592-gee34ffa429a399f292ad1421333721a92b998772
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Mar 10 13:43:19 2022 +0100

    tree-optimization/102943 - use tree form for sbr_sparse_bitmap

    The following arranges to remove an indirection do the bitvector
    in sbr_sparse_bitmap by embedding bitmap_head instead of bitmap
    and using the tree form (since we only ever set/query individual
    aligned bit chunks).  That shaves off 6 seconds from 70 seconds
    of the slowest 521.wrf_r LRANS unit build.

    2022-03-10  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/102943
            * gimple-range-cache.cc (sbr_sparse_bitmap::bitvec):
            Make a bitmap_head.
            (sbr_sparse_bitmap::sbr_sparse_bitmap): Adjust and switch
            to tree view.
            (sbr_sparse_bitmap::set_bb_range): Adjust.
            (sbr_sparse_bitmap::get_bb_range): Likewise.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (39 preceding siblings ...)
  2022-03-10 13:42 ` cvs-commit at gcc dot gnu.org
@ 2022-03-10 13:45 ` rguenth at gcc dot gnu.org
  2022-03-10 13:49 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-10 13:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #39 from Richard Biener <rguenth at gcc dot gnu.org> ---
For the second largest LTRANS unit we also have

 tree SSA incremental               :  10.89 ( 10%)   0.02 (  3%)  10.74 ( 10%)
 5030k (  1%)
 tree loop unswitching              :   1.39 (  1%)   0.00 (  0%)   1.39 (  1%)
 8332k (  2%)
 `- tree SSA incremental            :   9.58 (  9%)   0.01 (  1%)   9.53 (  9%)
    0  (  0%)

showing that almost all update_ssa load is from the unswitching pass which
updates SSA form for each unswitching.  One should be able to delay that
until we want to process an outer loop, should be much easier after the
pending rewrite.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (40 preceding siblings ...)
  2022-03-10 13:45 ` rguenth at gcc dot gnu.org
@ 2022-03-10 13:49 ` rguenth at gcc dot gnu.org
  2022-03-10 14:01 ` amacleod at redhat dot com
                   ` (11 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-10 13:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #40 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so after the two micro-optimizations we are now at

 tree VRP                           :  10.92 ( 17%)   0.03 (  5%)  10.94 ( 17%)
   28M (  4%)
 backwards jump threading           :  11.16 ( 18%)   0.00 (  0%)  11.15 ( 17%)
 1475k (  0%)
 TOTAL                              :  63.32          0.56         63.92       
  744M

and

 alias stmt walking                 :   6.91 ( 18%)   0.04 (  9%)   7.02 ( 18%)
  703k (  0%)
 tree VRP                           :   5.80 ( 15%)   0.03 (  7%)   5.83 ( 15%)
   14M (  3%)
 backwards jump threading           :   6.18 ( 16%)   0.01 (  2%)   6.17 ( 16%)
 1645k (  0%)
 TOTAL                              :  38.46          0.43         38.93       
  438M

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (41 preceding siblings ...)
  2022-03-10 13:49 ` rguenth at gcc dot gnu.org
@ 2022-03-10 14:01 ` amacleod at redhat dot com
  2022-03-10 14:17 ` amacleod at redhat dot com
                   ` (10 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: amacleod at redhat dot com @ 2022-03-10 14:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #41 from Andrew Macleod <amacleod at redhat dot com> ---

> 
> so it's still by far jump-threading/VRP dominating compile-times (I wonder
> if we should separate "old" and "new" [E]VRP timevars).  Given that VRP
> shows up as well it's more likely the underlying ranger infrastructure?

Yeah, Id be tempted to just label them vrp1 (evrp)  vrp2 (current vrp1)  and
vrp3 (current vrp2) and track them separately.  I have noticed significant
behaviour differences from the code we see at VRP2 time vs EVRP.


> 
> perf thrown on ltrans22 shows
> 
> Samples: 302K of event 'cycles', Event count (approx.): 331301505627        
> 
> Overhead       Samples  Command      Shared Object     Symbol               
> 
>   10.34%         31299  lto1-ltrans  lto1              [.]
> bitmap_get_aligned_chunk
>    7.44%         22540  lto1-ltrans  lto1              [.] bitmap_bit_p
>    3.17%          9593  lto1-ltrans  lto1              [.]

> 
> callgraph info in perf is a mixed bag, but maybe it helps to pinpoint things:
> 
> -   10.20%    10.18%         30364  lto1-ltrans  lto1              [.]
> bitmap_get_aligned_chunk                                                    
> #
>    - 10.18% 0xffffffffffffffff                                              
> #
>       + 9.16% ranger_cache::propagate_cache                                 
> #
>       + 1.01% ranger_cache::fill_block_cache               
> 

I am currently looking at reworking the cache again so that the propagation is
limited only to actual changes.  It can still currently get out of hand in
massive CFGs, and thats already using the sparse representation.  There may be
some minor tweaks that can make a big difference here.    I'll have a look over
the next couple of days.

Its probably safe to assume the threading performance is directly related to
this as well.


.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (42 preceding siblings ...)
  2022-03-10 14:01 ` amacleod at redhat dot com
@ 2022-03-10 14:17 ` amacleod at redhat dot com
  2022-03-10 14:23 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: amacleod at redhat dot com @ 2022-03-10 14:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #42 from Andrew Macleod <amacleod at redhat dot com> ---
(In reply to Richard Biener from comment #37)
> I'm looking at range_def_chain::m_def_chain, it's use is well obfuscated by
> inheritance but comments suggest that we have one such structure either for
> each edge in the CFG or for each basic-block.  In particular this

There is one structure per ssa-name globally.  It is the dependency list for an
ssa name which contains the list of other ssa-names in the chain of stmts which
are used to construct it.  Meaning, if one of those dependant names change,
this name could change.

The dependant chain of statements does not extend beyond a basic block boundry.

> m_def_chain vector looks very sparse and fat, replacing that with a
> 
>   hash_map<int, rdc *>
> 
> and allocating rdcs from another obstack (in principle re-using
> m_bitmap.obstack would be possible but somewhat ugly) should make this
> more cache and memory friendly (whether SSA name version or pointer is
> used as key would remain to be determined).
> 
> The ssa1 and ssa2 members are also quite odd, we always record into the
> bitmap so those seem to be a waste of time?  Changing allocation the

The bitmap is exhaustive dependencies (up to alimit) withinthe block.
ssa1/ssa2 are basically cached names for fast the direct dependency lookup.

  i_7 = .....

<bb4>
  _1 = i_7 < 0;
  _2 = j_8 < 0;
  _3 = _1 | _2;
  if (_3 != 0)

Imports: i_7  j_8
Exports: _1  _2  _3  i_7  j_8
depchains:
         _1 : i_7(I)                      // ssa1 = i_7
         _2 : j_8(I)                      // ssa1 = j_8
         _3 : _1  _2  i_7(I)  j_8(I)      // ssa1 = _1  ssa2 = _2

the ssa1 and ssa2 fields are used to specify up to 2 ssa-names that occur on
the def stmt itself, and are used during global cache lookup in conjunction
with the timetamp to determine if the current global value is stale or not.

ie, its a fast check.  Ask for the range of _2.  the ssa1 field is set to j_8,
we simply check the timestamp on j_8 vs the timestamp on _2 to ensure its up to
date. if its stale, then we recalculate _2.

Otherwise we either have to parse the stmt or loop thru the bitmap and check
each element.  They were once in their own data structure, but it was more
efficient to simply include them here in this structure

> above way would also enable embedding bitmap_head, removing one pointer
> indirection.  Unfortunately we use bitmap_ior_into so using the more
> efficient tree form for bitmap queries isn't possible until somebody
> implements (efficient!) bitmap_ior_into on tree form.
> 
> It wouldnt't fix the appearant algorithmic issues of course, so just food
> for thought.  Complexity wise it would reduce O (n-edges * n-ssa-names)
> to O (n-edges * n-deps/imports-on-edge).

so its just O(ssa-name) already.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (43 preceding siblings ...)
  2022-03-10 14:17 ` amacleod at redhat dot com
@ 2022-03-10 14:23 ` rguenth at gcc dot gnu.org
  2022-03-10 14:26 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-10 14:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #43 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Macleod from comment #42)
> (In reply to Richard Biener from comment #37)
> > I'm looking at range_def_chain::m_def_chain, it's use is well obfuscated by
> > inheritance but comments suggest that we have one such structure either for
> > each edge in the CFG or for each basic-block.  In particular this
> 
> There is one structure per ssa-name globally.
[...] 
> so its just O(ssa-name) already.

so you mean O(num-ssa-names^2) since if it exists for each SSA name then
we have m_def_chain (of length num-ssa-names) for each SSA name?  That's
what I originally feared, but I failed to find the array(?) that stores
the range_def_chains.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (44 preceding siblings ...)
  2022-03-10 14:23 ` rguenth at gcc dot gnu.org
@ 2022-03-10 14:26 ` rguenth at gcc dot gnu.org
  2022-03-10 14:33 ` amacleod at redhat dot com
                   ` (7 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-10 14:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #44 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #43)
> (In reply to Andrew Macleod from comment #42)
> > (In reply to Richard Biener from comment #37)
> > > I'm looking at range_def_chain::m_def_chain, it's use is well obfuscated by
> > > inheritance but comments suggest that we have one such structure either for
> > > each edge in the CFG or for each basic-block.  In particular this
> > 
> > There is one structure per ssa-name globally.
> [...] 
> > so its just O(ssa-name) already.
> 
> so you mean O(num-ssa-names^2) since if it exists for each SSA name then
> we have m_def_chain (of length num-ssa-names) for each SSA name?  That's
> what I originally feared, but I failed to find the array(?) that stores
> the range_def_chains.

That is, I wondered what's the lifetime of the gori_compute : gori_map :
range_def_chain object(s), where they are allocated and/or freed and
maintained.  I've seen m_bitmaps in range_def_chain which should be
of longer lifetime (maybe) for example.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (45 preceding siblings ...)
  2022-03-10 14:26 ` rguenth at gcc dot gnu.org
@ 2022-03-10 14:33 ` amacleod at redhat dot com
  2022-03-10 14:36 ` amacleod at redhat dot com
                   ` (6 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: amacleod at redhat dot com @ 2022-03-10 14:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #45 from Andrew Macleod <amacleod at redhat dot com> ---
(In reply to Richard Biener from comment #43)
> (In reply to Andrew Macleod from comment #42)
> > (In reply to Richard Biener from comment #37)
> > > I'm looking at range_def_chain::m_def_chain, it's use is well obfuscated by
> > > inheritance but comments suggest that we have one such structure either for
> > > each edge in the CFG or for each basic-block.  In particular this
> > 
> > There is one structure per ssa-name globally.
> [...] 
> > so its just O(ssa-name) already.
> 
> so you mean O(num-ssa-names^2) since if it exists for each SSA name then
> we have m_def_chain (of length num-ssa-names) for each SSA name?  That's
> what I originally feared, but I failed to find the array(?) that stores
> the range_def_chains.

no O(num_ssa_names).

There is a single gori_compute object in ranger, which inherits from a gori-map
(which manages imports/exports list for blocks), which inherits from the
range_def_chain class.

There is a single m_def_chain[] vector for all of ranger. so one entry per
ssa-name.  each ssa-name has a single rdc structure.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (46 preceding siblings ...)
  2022-03-10 14:33 ` amacleod at redhat dot com
@ 2022-03-10 14:36 ` amacleod at redhat dot com
  2022-03-16 19:48 ` amacleod at redhat dot com
                   ` (5 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: amacleod at redhat dot com @ 2022-03-10 14:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #46 from Andrew Macleod <amacleod at redhat dot com> ---
(In reply to Richard Biener from comment #44)
> (In reply to Richard Biener from comment #43)
> > (In reply to Andrew Macleod from comment #42)
> > > (In reply to Richard Biener from comment #37)
> > > > I'm looking at range_def_chain::m_def_chain, it's use is well obfuscated by
> > > > inheritance but comments suggest that we have one such structure either for
> > > > each edge in the CFG or for each basic-block.  In particular this
> > > 
> > > There is one structure per ssa-name globally.
> > [...] 
> > > so its just O(ssa-name) already.
> > 
> > so you mean O(num-ssa-names^2) since if it exists for each SSA name then
> > we have m_def_chain (of length num-ssa-names) for each SSA name?  That's
> > what I originally feared, but I failed to find the array(?) that stores
> > the range_def_chains.
> 
> That is, I wondered what's the lifetime of the gori_compute : gori_map :
> range_def_chain object(s), where they are allocated and/or freed and
> maintained.  I've seen m_bitmaps in range_def_chain which should be
> of longer lifetime (maybe) for example.


They all live and die as one with ranger.

A gimple ranger object has a single ranger_cache:
    ranger_cache m_cache;
Which is constructed when ranger is created, and the cache has a single
gori_computer object:
  gori_compute m_gori;

So they all have identical lifetimes.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (47 preceding siblings ...)
  2022-03-10 14:36 ` amacleod at redhat dot com
@ 2022-03-16 19:48 ` amacleod at redhat dot com
  2022-03-17 11:14 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: amacleod at redhat dot com @ 2022-03-16 19:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #47 from Andrew Macleod <amacleod at redhat dot com> ---
Created attachment 52637
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52637&action=edit
new patch

I am working on a alternative cache for GCC 13, but along the way, I have
changes to the ranger_cache::range_from_dom() routine.  The original version
gave up when it hit a block which had outgoing edges. The new version is
smarter and basically goes back until it finds a cache entry, and then
intersects all outgoing edge between the two places. It also removes the
recursion , and does not SET any cache values during the lookup (making it a
true query).

The net effect of this is significant improvements in cache performance because
its used far less, but there is more time spend doing calculations. This
bootstraps and passes all regression tests.  we do miss out on a few minor
opportunities (30 out of 4400 in all of EVRP over the GCC source)  which occur
as a result of updated values not being propagated properly as the cache is no
longer "full" like it was before.  

IN GCC 13 I will address this, but I thought you might be interested in trying
this patch against this PR.

In building 380 GCC source files, I see the following avg speedups
evrp : -22.57%
VRP2 : -5.4%
thread_jumps_full : -14.16%
total : -0.44%

So it is not insignificant.

It is likely to be most effective in large CFGs.
This is *total* compile time percent speed up for the 5 most significant cases:

expr.ii  -2.62%
lra-constraints.ii -3.75%
caller-save.ii -3.98%
reload.ii -4.04%
optabs.ii -5.05%

EVRP isolated speedups (yes, these are *percetage* speedup)
expr.ii -62.38
simplify-rtx.ii  -65.97
lra-constraints.ii  -67.87
reload.ii trunk  -68.67
caller-save.ii trunk  -71.93
optabs.ii trunk  -78.69

I think those times are probably worth the odd miss.

Anyway, next time you are checking performance for this PR maybe also try this
patch and see how it performs.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (48 preceding siblings ...)
  2022-03-16 19:48 ` amacleod at redhat dot com
@ 2022-03-17 11:14 ` rguenth at gcc dot gnu.org
  2022-03-17 13:05 ` amacleod at redhat dot com
                   ` (3 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-17 11:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #48 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Andrew Macleod from comment #47)
> Created attachment 52637 [details]
> new patch
> 
> I am working on a alternative cache for GCC 13, but along the way, I have
> changes to the ranger_cache::range_from_dom() routine.  The original version
> gave up when it hit a block which had outgoing edges. The new version is
> smarter and basically goes back until it finds a cache entry, and then
> intersects all outgoing edge between the two places. It also removes the
> recursion , and does not SET any cache values during the lookup (making it a
> true query).
> 
> The net effect of this is significant improvements in cache performance
> because its used far less, but there is more time spend doing calculations.
> This bootstraps and passes all regression tests.  we do miss out on a few
> minor opportunities (30 out of 4400 in all of EVRP over the GCC source) 
> which occur as a result of updated values not being propagated properly as
> the cache is no longer "full" like it was before.  
> 
> IN GCC 13 I will address this, but I thought you might be interested in
> trying this patch against this PR.
> 
> In building 380 GCC source files, I see the following avg speedups
> evrp : -22.57%
> VRP2 : -5.4%
> thread_jumps_full : -14.16%
> total : -0.44%
> 
> So it is not insignificant.
> 
> It is likely to be most effective in large CFGs.
> This is *total* compile time percent speed up for the 5 most significant
> cases:
> 
> expr.ii  -2.62%
> lra-constraints.ii -3.75%
> caller-save.ii -3.98%
> reload.ii -4.04%
> optabs.ii -5.05%
> 
> EVRP isolated speedups (yes, these are *percetage* speedup)
> expr.ii -62.38
> simplify-rtx.ii  -65.97
> lra-constraints.ii  -67.87
> reload.ii trunk  -68.67
> caller-save.ii trunk  -71.93
> optabs.ii trunk  -78.69
> 
> I think those times are probably worth the odd miss.
> 
> Anyway, next time you are checking performance for this PR maybe also try
> this patch and see how it performs.

It helps quite a bit, the worst case is now

 tree VRP                           :   5.14 (  7%)   0.02 (  3%)   5.15 (  7%)
   2
9M (  3%)
 backwards jump threading           :   4.05 (  6%)   0.00 (  0%)   4.06 (  6%)
 222
0k (  0%)

overall the patch reduces compile time from 766s to 749 (parallel compile,
serial LTO, release checking).  So IMHO definitely worth it if you are happy
with it.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (49 preceding siblings ...)
  2022-03-17 11:14 ` rguenth at gcc dot gnu.org
@ 2022-03-17 13:05 ` amacleod at redhat dot com
  2022-03-17 14:18 ` hubicka at kam dot mff.cuni.cz
                   ` (2 subsequent siblings)
  53 siblings, 0 replies; 57+ messages in thread
From: amacleod at redhat dot com @ 2022-03-17 13:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #49 from Andrew Macleod <amacleod at redhat dot com> ---
Let me clean it up a little and I'll post it.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (50 preceding siblings ...)
  2022-03-17 13:05 ` amacleod at redhat dot com
@ 2022-03-17 14:18 ` hubicka at kam dot mff.cuni.cz
  2022-03-17 20:44 ` cvs-commit at gcc dot gnu.org
  2022-03-23 10:40 ` rguenth at gcc dot gnu.org
  53 siblings, 0 replies; 57+ messages in thread
From: hubicka at kam dot mff.cuni.cz @ 2022-03-17 14:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #50 from hubicka at kam dot mff.cuni.cz ---
> It helps quite a bit, the worst case is now
> 
>  tree VRP                           :   5.14 (  7%)   0.02 (  3%)   5.15 (  7%)
>    2
> 9M (  3%)
>  backwards jump threading           :   4.05 (  6%)   0.00 (  0%)   4.06 (  6%)
>  222
> 0k (  0%)
> 
> overall the patch reduces compile time from 766s to 749 (parallel compile,
> serial LTO, release checking).  So IMHO definitely worth it if you are happy
> with it.
This looks really promising.  Does it also solve the situation with
--param param_inline_functions_called_once_insns=1000000

I will benchmark how copmile time now behaves with respect increasing
this bound.

Honza

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (51 preceding siblings ...)
  2022-03-17 14:18 ` hubicka at kam dot mff.cuni.cz
@ 2022-03-17 20:44 ` cvs-commit at gcc dot gnu.org
  2022-03-23 10:40 ` rguenth at gcc dot gnu.org
  53 siblings, 0 replies; 57+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-17 20:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

--- Comment #51 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Andrew Macleod <amacleod@gcc.gnu.org>:

https://gcc.gnu.org/g:8db155ddf8cec9e31f0a4b8d80cc67db2c7a26f9

commit r12-7692-g8db155ddf8cec9e31f0a4b8d80cc67db2c7a26f9
Author: Andrew MacLeod <amacleod@redhat.com>
Date:   Thu Mar 17 10:52:10 2022 -0400

    Always use dominators in the cache when available.

    This patch adjusts range_from_dom to follow the dominator tree through the
    cache until value is found, then apply any outgoing ranges encountered
    along the way.  This reduces the amount of cache storage required.

            PR tree-optimization/102943
            * gimple-range-cache.cc (ranger_cache::range_from_dom): Find range
via
            dominators and apply intermediary outgoing edge ranges.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [Bug tree-optimization/102943] [12 Regression] Jump threader compile-time hog with 521.wrf_r
  2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
                   ` (52 preceding siblings ...)
  2022-03-17 20:44 ` cvs-commit at gcc dot gnu.org
@ 2022-03-23 10:40 ` rguenth at gcc dot gnu.org
  53 siblings, 0 replies; 57+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-23 10:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102943

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #52 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-03-23 10:40 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-26 11:13 [Bug tree-optimization/102943] New: VRP threader compile-time hog with 521.wrf_r rguenth at gcc dot gnu.org
2021-10-26 11:15 ` [Bug tree-optimization/102943] [12 Regression] " rguenth at gcc dot gnu.org
2021-10-26 11:25 ` rguenth at gcc dot gnu.org
2021-10-26 11:49 ` rguenth at gcc dot gnu.org
2021-10-26 14:57 ` pinskia at gcc dot gnu.org
2021-10-26 14:58 ` marxin at gcc dot gnu.org
2021-10-26 15:06 ` marxin at gcc dot gnu.org
2021-10-30  6:31 ` aldyh at gcc dot gnu.org
2021-10-31 20:06 ` hubicka at gcc dot gnu.org
2021-11-02  7:25 ` [Bug tree-optimization/102943] [12 Regression] Jump " rguenth at gcc dot gnu.org
2021-11-02  7:29 ` aldyh at gcc dot gnu.org
2021-11-03 10:57 ` aldyh at gcc dot gnu.org
2021-11-03 10:58 ` aldyh at gcc dot gnu.org
2021-11-03 13:17 ` rguenther at suse dot de
2021-11-03 14:33 ` amacleod at redhat dot com
2021-11-03 14:42 ` rguenther at suse dot de
2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
2021-11-04 14:40 ` cvs-commit at gcc dot gnu.org
2021-11-04 15:24 ` aldyh at gcc dot gnu.org
2021-11-04 17:00   ` Jan Hubicka
2021-11-04 17:00 ` hubicka at kam dot mff.cuni.cz
2021-11-05  9:08 ` aldyh at gcc dot gnu.org
2021-11-05 11:10 ` marxin at gcc dot gnu.org
2021-11-05 11:13 ` aldyh at gcc dot gnu.org
2021-11-05 11:23 ` marxin at gcc dot gnu.org
2021-11-05 17:16 ` cvs-commit at gcc dot gnu.org
2021-11-07 17:17 ` hubicka at gcc dot gnu.org
2021-11-07 18:16 ` aldyh at gcc dot gnu.org
2021-11-07 18:59   ` Jan Hubicka
2021-11-07 18:59 ` hubicka at kam dot mff.cuni.cz
2021-11-12 22:14 ` hubicka at gcc dot gnu.org
2021-11-14  9:58 ` hubicka at gcc dot gnu.org
2021-11-26 12:38 ` cvs-commit at gcc dot gnu.org
2021-11-30 10:55 ` aldyh at gcc dot gnu.org
2021-12-09 20:17 ` hubicka at gcc dot gnu.org
2022-01-03  8:47 ` rguenth at gcc dot gnu.org
2022-01-03 11:20 ` hubicka at kam dot mff.cuni.cz
2022-01-19  7:06 ` rguenth at gcc dot gnu.org
2022-03-10 11:37 ` rguenth at gcc dot gnu.org
2022-03-10 12:40 ` cvs-commit at gcc dot gnu.org
2022-03-10 13:22 ` rguenth at gcc dot gnu.org
2022-03-10 13:42 ` cvs-commit at gcc dot gnu.org
2022-03-10 13:45 ` rguenth at gcc dot gnu.org
2022-03-10 13:49 ` rguenth at gcc dot gnu.org
2022-03-10 14:01 ` amacleod at redhat dot com
2022-03-10 14:17 ` amacleod at redhat dot com
2022-03-10 14:23 ` rguenth at gcc dot gnu.org
2022-03-10 14:26 ` rguenth at gcc dot gnu.org
2022-03-10 14:33 ` amacleod at redhat dot com
2022-03-10 14:36 ` amacleod at redhat dot com
2022-03-16 19:48 ` amacleod at redhat dot com
2022-03-17 11:14 ` rguenth at gcc dot gnu.org
2022-03-17 13:05 ` amacleod at redhat dot com
2022-03-17 14:18 ` hubicka at kam dot mff.cuni.cz
2022-03-17 20:44 ` cvs-commit at gcc dot gnu.org
2022-03-23 10:40 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).