public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/65478] New: crafty performance regression
@ 2015-03-19 22:38 hubicka at gcc dot gnu.org
  2015-03-19 22:45 ` [Bug tree-optimization/65478] " hubicka at gcc dot gnu.org
                   ` (24 more replies)
  0 siblings, 25 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-03-19 22:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

            Bug ID: 65478
           Summary: crafty performance regression
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org

As reported to me privately by Igor, crafty SPEC2k benchmark is slower since
r219863 which decreased inline-unit-growth.

I looked into the reason and it is the fact that we do not inline NextMove
function. The function itself is big:

Inline summary for NextMove/18 inlinable                                        
  self time:       177                                                          
  global time:     0                                                            
  self size:       284                                                          
  global size:     0                                                            
  min size:       0                                                             
  self stack:      0                                                            
  global stack:    0                                                            
    size:228.000000, time:150.114000, predicate:(true)                          
    size:3.000000, time:2.000000, predicate:(not inlined)                       
    size:4.000000, time:0.649000, predicate:(op0 changed)                       
    size:4.000000, time:5.946000, predicate:(op1 changed)                       
    size:2.000000, time:1.486000, predicate:(op1 == 0)                          
    size:2.000000, time:1.486000, predicate:(op1 != 0)                          

One quite wrong estiamte I see is the following:

  switch (_45) <default: <L90>, case 1: <L0>, case 2: <L4>, case 3: <L35>, case
4: <L40>, case 5: <L43>, case 6: <L50>, case 8: <L51>, case 9: <L68>, case 10:
<L92>>
                freq:1.00 size: 20 time:  6                                     
                Accounting size:20.00, time:6.00 on predicate:(true)            

This assumes jump tree implementation of switch.  We should discover dense
switch statements and estimate the jumptable.

The function overall is estimated as relatively uncool for inlining
Considering NextMove/2405 with 284 size                                         
 to be inlined into Search.constprop/4352 in unknown:-1                         
 Estimated badness is -0.081348, frequency 0.69.                                
    Badness calculation for Search.constprop/4352 -> NextMove/2405              
      size growth 273, time 174 inline hints: cross_module array_index          
      -0.040674: guessed profile. frequency 0.694000, count 0 caller count 0
time w/o inlining 397.838000, time w inlining 386.734000 overall growth 277
(current) 0 (original)
      Adjusted by hints -0.081348                                               
  not inlinable: Search.constprop/4352 -> NextMove/2405, --param
inline-unit-growth limit reached

I am not 100% sure what makes it interesting, though it sounds natural as it is
part of the innermost loop.

The function is called once, but because ipa-cp decides to clone Search
function we turn it into function called twice:
Estimating effects for Search/3356, base_time: 245.                             
   Estimating body: Search/3356                                                 
   Known to be false:                                                           
   size:725 time:245                                                            
 - estimates for value -32768 for param #0: time_benefit: 1, size: 725          
   Estimating body: Search/3356                                                 
   Known to be false: op4 > 62, op4 changed                                     
   size:707 time:242                                                            
 - estimates for value 2 for param #4: time_benefit: 52, size: 707              
   Estimating body: Search/3356                                                 
   Known to be false:                                                           
   size:725 time:245                                                            
 - estimates for value 0 for param #5: time_benefit: 1, size: 725               
   Estimating body: Search/3356                                                 
   Known to be false:                                                           
   size:725 time:245                                                            
 - estimates for value 1 for param #5: time_benefit: 1, size: 725               

Evaluating opportunities for Search/3356.                                       
 - considering value 2 for param #4 (caller_count: 3)                           
     good_cloning_opportunity_p (time: 52, size: 707, freq_sum: 11298) ->
evaluation: 830, threshold: 500
  Creating a specialized node of Search/3356.                                   
    adding an extra known scalar value 1 for param #5                           
    replacing param #4 with const 2                                             
    replacing param #5 with const 1                                             
                Accounting size:3.00, time:2.00 on new predicate:(not inlined)  
                Accounting size:352.50, time:100.61 on new predicate:(op4 <=
62)
                Accounting size:21.00, time:6.87 on new predicate:(op2 changed)
&& (op4 <= 62)
                Accounting size:15.00, time:1.57 on new predicate:(op2 == 0) &&
(op4 <= 62)
                Accounting size:15.00, time:1.65 on new predicate:(op2 != 0) &&
(op4 <= 62)
                Accounting size:8.00, time:3.13 on new predicate:(op3 changed)
&& (op4 <= 62)
                Accounting size:1.00, time:0.00 on new predicate:(op2 changed)
&& (op3 <= 239) && (op4 <= 62)
                Accounting size:5.00, time:0.01 on new predicate:(op3 <= 239)
&& (op4 <= 62)
                Accounting size:1.00, time:0.00 on new predicate:(op3 changed)
&& (op3 > 239) && (op4 <= 62)
                Accounting size:1.00, time:0.00 on new predicate:(op2 changed)
&& (op3 > 239) && (op4 <= 62)
                Accounting size:5.00, time:0.01 on new predicate:(op3 > 239) &&
(op4 <= 62)
                Accounting size:41.00, time:0.09 on new predicate:(op3 > 120)
&& (op4 <= 62)
                Accounting size:4.00, time:0.00 on new predicate:(op3 changed)
&& (op3 > 120) && (op4 <= 62)
                Accounting size:3.00, time:0.00 on new predicate:(op3 <= 179)
&& (op3 > 120) && (op4 <= 62)
                Accounting size:2.00, time:0.00 on new predicate:(op3 changed)
&& (op3 > 179) && (op3 > 120) && (op4 <= 62)
                Accounting size:3.00, time:0.00 on new predicate:(op3 > 179) &&
(op3 > 120) && (op4 <= 62)
                Accounting size:1.00, time:0.28 on new predicate:(op2 == 1) &&
(op4 <= 62)
                Accounting size:1.00, time:0.73 on new predicate:(op2 != 1) &&
(op4 <= 62)
                Accounting size:0.50, time:0.24 on new predicate:(op4 <= 62) &&
(not inlined)
     the new node is Search.constprop/4352.                                     
 - considering value 1 for param #5 (caller_count: 10)                          
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500


Evaluating opportunities for Search/3356.                                       
 - adding an extra caller Search.constprop/4352 of Search.constprop/4352        
 - adding an extra caller Search.constprop/4352 of Search.constprop/4352        
 - considering value 1 for param #5 (caller_count: 8)                           
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500
     good_cloning_opportunity_p (time: 1, size: 725, freq_sum: 1144) ->
evaluation: 1, threshold: 500


Inline summary for Search/30 inlinable                                          
  self time:       245                                                          
  global time:     0                                                            
  self size:       725                                                          
  global size:     0                                                            
  min size:       0                                                             
  self stack:      0                                                            
  global stack:    0                                                            
    size:0.000000, time:0.000000, predicate:(true)                              
    size:3.000000, time:2.000000, predicate:(not inlined)                       
    size:2.000000, time:2.000000, predicate:(op4 changed)                       
    size:352.500000, time:100.615000, predicate:(op4 <= 62)                     
    size:10.000000, time:0.552000, predicate:(op4 changed) && (op4 <= 62)       
    size:21.000000, time:6.867000, predicate:(op2 changed) && (op4 <= 62)       
    size:15.000000, time:1.573000, predicate:(op2 == 0) && (op4 <= 62)          
    size:15.000000, time:1.654000, predicate:(op2 != 0) && (op4 <= 62)          
    size:8.000000, time:3.133000, predicate:(op3 changed) && (op4 <= 62)        
    size:1.000000, time:0.001000, predicate:(op2 changed) && (op3 <= 239) &&
(op4 <= 62)
    size:5.000000, time:0.005000, predicate:(op3 <= 239) && (op4 <= 62)         
    size:1.000000, time:0.001000, predicate:(op3 changed) && (op3 > 239) &&
(op4 <= 62)
    size:1.000000, time:0.001000, predicate:(op2 changed) && (op3 > 239) &&
(op4 <= 62)
    size:5.000000, time:0.005000, predicate:(op3 > 239) && (op4 <= 62)          
    size:5.000000, time:0.073000, predicate:(op4 changed) && (op3 > 120) &&
(op4 <= 62)
    size:41.000000, time:0.085000, predicate:(op3 > 120) && (op4 <= 62)         
    size:4.000000, time:0.002000, predicate:(op3 changed) && (op3 > 120) &&
(op4 <= 62)
    size:3.000000, time:0.000000, predicate:(op3 <= 179) && (op3 > 120) && (op4
<= 62)
    size:2.000000, time:0.000000, predicate:(op3 changed) && (op3 > 179) &&
(op3 > 120) && (op4 <= 62)
    size:3.000000, time:0.000000, predicate:(op3 > 179) && (op3 > 120) && (op4
<= 62)
    size:1.000000, time:0.285000, predicate:(op2 == 1) && (op4 <= 62)           
    size:1.000000, time:0.734000, predicate:(op2 != 1) && (op4 <= 62)           
    size:0.500000, time:0.244000, predicate:(op4 <= 62) && (not inlined)        
    size:1.000000, time:0.389000, predicate:(op1 changed) && (op4 > 62) && (not
inlined)
  array index:(op4 changed)                                                     

I suppose we can do the following
 - bump inline-unit-growth to 25 so the function is inlined (this is quite
ugly)
 - consider crafty to not be large unit 
   Unit growth for small function inlining: 27793->31749 (14%)
 - convince ipa-cp to either not duplicate Search or duplicate NextMove too
 - make arrayindex to be more important than it is
 - analyze switch better
 - figure out why inling happens and strenghten analysis.

I will continue looking into this tomorrow.
Honza


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug tree-optimization/65478] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
@ 2015-03-19 22:45 ` hubicka at gcc dot gnu.org
  2015-03-20  2:51 ` [Bug ipa/65478] [5 regression] " hubicka at gcc dot gnu.org
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-03-19 22:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mjambor at suse dot cz

--- Comment #1 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Martin, can you take a look on ipa-cp's decision sanity?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
  2015-03-19 22:45 ` [Bug tree-optimization/65478] " hubicka at gcc dot gnu.org
@ 2015-03-20  2:51 ` hubicka at gcc dot gnu.org
  2015-03-20 10:24 ` rguenth at gcc dot gnu.org
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-03-20  2:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |ipa
            Summary|crafty performance          |[5 regression] crafty
                   |regression                  |performance regression

--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Martin: Looking into it, the parameter 4(ply)=2 and donull=true seems to be
used in calls starting the recursion:


/*                                                                              
 ----------------------------------------------------------                     
|                                                          |                    
|   now call Search to produce a value for this move.      |                    
|                                                          |                    
 ----------------------------------------------------------                     
*/                                                                              
    begin_root_nodes=nodes_searched;                                            
    if (first_move) {                                                           
      value=-ABSearch(-beta,-alpha,ChangeSide(wtm),                             
                      depth+extensions,2,DO_NULL);                              
      if (abort_search) {                                                       
        UnMakeMove(1,current_move[1],wtm);                                      
        return(alpha);                                                          
      }                                                                         
      first_move=0;                                                             
    }                                                                           
    else {                                                                      
      value=-ABSearch(-alpha-1,-alpha,ChangeSide(wtm),                          
                      depth+extensions,2,DO_NULL);                              
      if (abort_search) {                                                       
        UnMakeMove(1,current_move[1],wtm);                                      
        return(alpha);                                                          
      }                                                                         
      if ((value > alpha) && (value < beta)) {                                  
        value=-ABSearch(-beta,-alpha,ChangeSide(wtm),                           
                        depth+extensions,2,DO_NULL);                            
        if (abort_search) {                                                     
          UnMakeMove(1,current_move[1],wtm);                                    
          return(alpha);                                                        
        }                                                                       
      }                                                                         
    }                                                                           

While it recursively call itself with alternating players and sometimes drops
to !DO_NULL.

Intuitively, clonning the function specializing for first iteration of
recursion is like loop peeling and that is already done (not particularly well)
by recursive inlining.

I would suggest we may disable/add negative hint for cloning in the case where
the specialized function will end up calling unspecialized version of itself
with non-cold edge.

We also may consider adding bit of negative hints for cases where cloning would
turn function called once (by noncold edge) to a function called twice.
The same may be done with inliner, but that would even more reduce changes that
ipa-split produced split functions will actually get partially inlined.

Function is inlined by 4.9:
Considering NextMove/2405 with 284 size                                         
 to be inlined into Search.constprop/4352 in unknown:-1                         
 Estimated badness is -128, frequency 0.69.                                     
    Badness calculation for Search.constprop/4352 -> NextMove/2405              
      size growth 273, time 174 inline hints: cross_module array_index          
      -128: guessed profile. frequency 0.694000, benefit 1.771337%, time w/o
inlining 621, time w inlining 610 overall growth 266 (current) 266 (original)
                Accounting size:228.00, time:104.18 on predicate:(op4 <= 62)    
                Accounting size:4.00, time:4.13 on predicate:(op2 changed) &&
(op4 <= 62)
                Accounting size:2.00, time:1.03 on predicate:(op2 == 0) && (op4
<= 62)
                Accounting size:2.00, time:1.03 on predicate:(op2 != 0) && (op4
<= 62)

I am marking it as a regression thus and changing component to IPA.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
  2015-03-19 22:45 ` [Bug tree-optimization/65478] " hubicka at gcc dot gnu.org
  2015-03-20  2:51 ` [Bug ipa/65478] [5 regression] " hubicka at gcc dot gnu.org
@ 2015-03-20 10:24 ` rguenth at gcc dot gnu.org
  2015-03-20 18:25 ` hubicka at ucw dot cz
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-03-20 10:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
   Target Milestone|---                         |5.0

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Which options (LTO?)?  I can't see the regression on our testers.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2015-03-20 10:24 ` rguenth at gcc dot gnu.org
@ 2015-03-20 18:25 ` hubicka at ucw dot cz
  2015-03-20 19:19 ` hubicka at ucw dot cz
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at ucw dot cz @ 2015-03-20 18:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #4 from Jan Hubicka <hubicka at ucw dot cz> ---
> Which options (LTO?)?  I can't see the regression on our testers.
-Ofast -flto -funroll-loops

Honza


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2015-03-20 18:25 ` hubicka at ucw dot cz
@ 2015-03-20 19:19 ` hubicka at ucw dot cz
  2015-03-24 14:10 ` jamborm at gcc dot gnu.org
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at ucw dot cz @ 2015-03-20 19:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #5 from Jan Hubicka <hubicka at ucw dot cz> ---
Thre regression seems to be visible at
http://gcc.opensuse.org/SPEC/CINT/sb-frescobaldi.suse.de-ai-64/186_crafty_big.png


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2015-03-20 19:19 ` hubicka at ucw dot cz
@ 2015-03-24 14:10 ` jamborm at gcc dot gnu.org
  2015-03-24 17:23 ` hubicka at ucw dot cz
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: jamborm at gcc dot gnu.org @ 2015-03-24 14:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Martin Jambor <jamborm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org

--- Comment #6 from Martin Jambor <jamborm at gcc dot gnu.org> ---
I can confirm I can see a fairly consistent 4% run time increase
caused by r219863 on my desktop (from ~22.74s to ~23.64s).  However,
when I disable cloning of the Search function, for example by using
the attribute noclone, I only get to, ~23.31s which is still 2.5%
slower.  (All the times are of course subject to noise but I have
measured them repeatedly and as I said, they are fairly consistent).
This suggests that cloning of function Search and not inlining
NextMove is only part of the story.


> I would suggest we may disable/add negative hint for cloning in the
> case where the specialized function will end up calling
> unspecialized version of itself with non-cold edge.

Recursion is handled by iterating over SCCs in call graph in IPA-CP,
and the redirection of the final call to "close" the SCCs is done in a
different iteration than the first cloning.  This unfortunately means
that when function decide_about_value reasons about cloning or not, it
does not know what recursive calls are going to be redirected and
which are not.  Making it aware of this would require a hack in
cgraph_edge_brings_value_p functions.  I may try writing it but I
wonder whether it is really easier than undoing all cloning in an SCC,
which is the right way to implement this as it would also work for
recursions involving two or more functions.

> We also may consider adding bit of negative hints for cases where
> cloning would turn function called once (by noncold edge) to a
> function called twice.

This would be much easier, although the penalty would have to be quite
big because the goodness number calculated by
good_cloning_opportunity_p is 830 and the threshold is 500.

But given the above, perhaps, for gcc 5 at least, we might want to
introduce a 0.7 factor penalty for this and another 0.7 factor penalty
just for being within an SCC?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2015-03-24 14:10 ` jamborm at gcc dot gnu.org
@ 2015-03-24 17:23 ` hubicka at ucw dot cz
  2015-03-24 18:48 ` jamborm at gcc dot gnu.org
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at ucw dot cz @ 2015-03-24 17:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #7 from Jan Hubicka <hubicka at ucw dot cz> ---
> > We also may consider adding bit of negative hints for cases where
> > cloning would turn function called once (by noncold edge) to a
> > function called twice.
> 
> This would be much easier, although the penalty would have to be quite
> big because the goodness number calculated by
> good_cloning_opportunity_p is 830 and the threshold is 500.
> 
> But given the above, perhaps, for gcc 5 at least, we might want to
> introduce a 0.7 factor penalty for this and another 0.7 factor penalty
> just for being within an SCC?

Yep, that sounds like resonable thing to try to me.

Honza


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2015-03-24 17:23 ` hubicka at ucw dot cz
@ 2015-03-24 18:48 ` jamborm at gcc dot gnu.org
  2015-03-25  7:57 ` hubicka at ucw dot cz
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: jamborm at gcc dot gnu.org @ 2015-03-24 18:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #8 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Created attachment 35127
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35127&action=edit
Inlining decisions difference

(In reply to Martin Jambor from comment #6)
> This suggests that cloning of function Search and not inlining
> NextMove is only part of the story.
> 

I'm attaching output of my script that compares inlining decisions.
"File 1" is wpa inlining dump file generated by r219862, "File 2" is
wpa inlining dump generated by r219863 on source where Search was
annotated as noclone.

(In reply to Jan Hubicka from comment #7)
> Yep, that sounds like resonable thing to try to me.
> 

OK, I'll prepare a patch for this part.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2015-03-24 18:48 ` jamborm at gcc dot gnu.org
@ 2015-03-25  7:57 ` hubicka at ucw dot cz
  2015-03-27  9:45 ` jamborm at gcc dot gnu.org
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at ucw dot cz @ 2015-03-25  7:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #9 from Jan Hubicka <hubicka at ucw dot cz> ---
> > This suggests that cloning of function Search and not inlining
> > NextMove is only part of the story.
> > 
> 
> I'm attaching output of my script that compares inlining decisions.
> "File 1" is wpa inlining dump file generated by r219862, "File 2" is
> wpa inlining dump generated by r219863 on source where Search was
> annotated as noclone.

Thanks. I suppose we need to oprofile resulting binary and see.
> 
> (In reply to Jan Hubicka from comment #7)
> > Yep, that sounds like resonable thing to try to me.
> > 
> 
> OK, I'll prepare a patch for this part.

Actually, there is one detail.  Cloning SCC and keeping it a SCC is cool
thing (as one avoid passing constant parameter across the recursive loop),
but clonning function from SCC and keeping all calls within the connected
component to go to the original SCC is not cool.  It would be nice to make
difference between these.

Thanks!
Honza


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2015-03-25  7:57 ` hubicka at ucw dot cz
@ 2015-03-27  9:45 ` jamborm at gcc dot gnu.org
  2015-03-27  9:49 ` jamborm at gcc dot gnu.org
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: jamborm at gcc dot gnu.org @ 2015-03-27  9:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #11 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Created attachment 35159
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=35159&action=edit
Patch implementing cloning penalties

(In reply to Martin Jambor from comment #8)
> (In reply to Jan Hubicka from comment #7)
> > Yep, that sounds like resonable thing to try to me.
> > 
> 
> OK, I'll prepare a patch for this part.

This is the patch, it has survived bootstrap and regtest on
x86_64-linux.  I have not measured crafty run-time with it yet.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2015-03-27  9:45 ` jamborm at gcc dot gnu.org
@ 2015-03-27  9:49 ` jamborm at gcc dot gnu.org
  2015-03-29 14:15 ` hubicka at gcc dot gnu.org
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: jamborm at gcc dot gnu.org @ 2015-03-27  9:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #12 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #9)
> Actually, there is one detail.  Cloning SCC and keeping it a SCC is cool
> thing (as one avoid passing constant parameter across the recursive loop),
> but clonning function from SCC and keeping all calls within the connected
> component to go to the original SCC is not cool.  It would be nice to make
> difference between these.

I understand.  I have actually spent most of Monday trying to do this
at least for the simple cases.  However, I quickly found out simple
hacks did not work and more serious attempts quickly snowballed into
big patches I that I might not finish soon and might not be deemed
reasonable for stage4.  We basically really need to be able to
roll-back all decisions taken when evaluating an SCC.

In the long run, I agree we certainly want to limit or disallow
completely cloning that would leave calls to the original functions of
an SCC.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2015-03-27  9:49 ` jamborm at gcc dot gnu.org
@ 2015-03-29 14:15 ` hubicka at gcc dot gnu.org
  2015-03-29 17:46 ` hubicka at gcc dot gnu.org
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-03-29 14:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenther at suse dot de

--- Comment #13 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Thanks, Martin.  I now get Search unduplicated by ipa-cp.  Funnily enough
however the fix to vortex slowdown (caused by bug in inliner's LTO
inline_failed bookeeping) caused differences in the inline decisions and now we
do not inline FirstOne and LastOne.
This seems to be very stupid implementation of clz
  int FirstOne(BITBOARD arg1)                                                   
  {                                                                             
    union doub {                                                                
      unsigned short i[4];                                                      
      BITBOARD d;                                                               
    };                                                                          
#ifndef SPEC_CPU2000                                                            
    register union doub x;                                                      
#else                                                                           
    union doub x;                                                               
#endif /* SPEC_CPU2000 */                                                       
    x.d=arg1;                                                                   
#  if defined(LITTLE_ENDIAN_ARCH)                                               
    if (x.i[3])                                                                 
      return (first_ones[x.i[3]]);                                              
    if (x.i[2])                                                                 
      return (first_ones[x.i[2]]+16);                                           
    if (x.i[1])                                                                 
      return (first_ones[x.i[1]]+32);                                           
    if (x.i[0])                                                                 
      return (first_ones[x.i[0]]+48);                                           
#  endif                                                                        
#  if !defined(LITTLE_ENDIAN_ARCH)                                              
    if (x.i[0])                                                                 
      return (first_ones[x.i[0]]);                                              
    if (x.i[1])                                                                 
      return (first_ones[x.i[1]]+16);                                           
    if (x.i[2])                                                                 
      return (first_ones[x.i[2]]+32);                                           
    if (x.i[3])                                                                 
      return (first_ones[x.i[3]]+48);                                           
#  endif                                                                        
    return(64);                                                                 
  }                                                                             
which unfortunately gets estimates as quite large by inliner:

Analyzing function body size: FirstOne                                          
                Accounting size:2.00, time:0.00 on new predicate:(not inlined)  

 BB 2 predicate:(true)                                                          
  x.d = arg1_3(D);                                                              
                freq:1.00 size:  1 time:  1                                     
                Accounting size:1.00, time:1.00 on predicate:(true)             
  _5 = x.i[3];                                                                  
                freq:1.00 size:  1 time:  1                                     
                Accounting size:1.00, time:1.00 on predicate:(true)             
  if (_5 != 0)                                                                  
                freq:1.00 size:  2 time:  2                                     
                Accounting size:2.00, time:2.00 on predicate:(true)             

 BB 4 predicate:(true)                                                          
  _9 = x.i[2];                                                                  
                freq:0.61 size:  1 time:  1                                     
                Accounting size:1.00, time:0.61 on predicate:(true)             
  if (_9 != 0)                                                                  
                freq:0.61 size:  2 time:  2                                     
                Accounting size:2.00, time:1.22 on predicate:(true)             
...
so at this point we do not even see that x.d is the value arg. If the things
was implemented by view_convert_expr, inliner would at least see that the
return value of FirstOne depends on its parameter.

Tree optimizers produce:

Removing basic block 11
FirstOne (BITBOARD arg1)
{
  union doub x;
  int _1;
  short unsigned int _5;
  int _6;
  unsigned char _7;
  int _8;
  short unsigned int _9;
  int _10;
  unsigned char _11;
  int _12;
  int _13;
  short unsigned int _14;
  int _15;
  unsigned char _16;
  int _17;
  int _18;
  short unsigned int _19;
  int _20;
  unsigned char _21;
  int _22;
  int _23;

  <bb 2>:
  x.d = arg1_3(D);
  _5 = x.i[3];
  if (_5 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

  <bb 3>:
  _6 = (int) _5;
  _7 = first_ones[_6];
  _8 = (int) _7;
  goto <bb 10>;
...
this is somewhat lame.  Richard, i believed we should synthetize
view_convert_expr in this case?

I am checking how much I need to bump up inline unit growth.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2015-03-29 14:15 ` hubicka at gcc dot gnu.org
@ 2015-03-29 17:46 ` hubicka at gcc dot gnu.org
  2015-03-30  2:23 ` hubicka at gcc dot gnu.org
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-03-29 17:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #14 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Author: hubicka
Date: Sun Mar 29 15:38:52 2015
New Revision: 221763

URL: https://gcc.gnu.org/viewcvs?rev=221763&root=gcc&view=rev
Log:

    PR ipa/65478
    * params.def (PARAM_IPA_CP_RECURSION_PENALTY) : New.
    (PARAM_IPA_CP_SINGLE_CALL_PENALTY): Likewise.
    * ipa-prop.h (ipa_node_params): New flags node_within_scc and
    node_calling_single_call.
    * ipa-cp.c (count_callers): New function.
    (set_single_call_flag): Likewise.
    (initialize_node_lattices): Count callers and set single_flag_call if
    necessary.
    (incorporate_penalties): New function.
    (good_cloning_opportunity_p): Use it, dump new flags.
    (propagate_constants_topo): Set node_within_scc flag if appropriate.
    * doc/invoke.texi (ipa-cp-recursion-penalty,
    ipa-cp-single-call-pentalty): Document.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/invoke.texi
    trunk/gcc/ipa-cp.c
    trunk/gcc/ipa-prop.h
    trunk/gcc/params.def


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2015-03-29 17:46 ` hubicka at gcc dot gnu.org
@ 2015-03-30  2:23 ` hubicka at gcc dot gnu.org
  2015-03-30 11:27 ` rguenth at gcc dot gnu.org
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-03-30  2:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
The inline bump needed is about 23.  Richard, i guess convincing early
optimizers to turn that hack into shifts (that is done by GCC but only at RTL
time), is out of reach for this release, right?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2015-03-30  2:23 ` hubicka at gcc dot gnu.org
@ 2015-03-30 11:27 ` rguenth at gcc dot gnu.org
  2015-03-30 17:53 ` hubicka at ucw dot cz
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-03-30 11:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Jan Hubicka from comment #15)
> The inline bump needed is about 23.  Richard, i guess convincing early
> optimizers to turn that hack into shifts (that is done by GCC but only at
> RTL time), is out of reach for this release, right?

Humm, you mean

  <bb 2>:
  x.d = arg1_3(D);
  _5 = x.i[3];
  if (_5 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;
...
  <bb 4>:
  _12 = x.i[2];
  if (_12 != 0)
    goto <bb 5>;
  else
    goto <bb 6>;

to sth like

  <bb 4>:
  _12 = (unsigned short)(arg1_3(D) >> 32);
  if (_12 != 0)
    goto <bb 5>;
  else
    goto <bb 6>;

?

SCCVN doesn't handle sth "fancy" for the case of union accesses with
not matching offset/size.  We could add that, but I suppose in your
case it's just for the sake of inliner predicates as the actual generated
code might be worse on some targets?

But yes, in principle we can do sth fancy for union loads, though I'd
use BIT_FIELD_REFs (hoping no issues wrt endian...) as the canonical
and "easy" way to represent things here.  Thus

  _12 = BIT_FIELD_REF <arg1_3(D), ...>

(or REALPART/IMAGPART for special cases where that is valid).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2015-03-30 11:27 ` rguenth at gcc dot gnu.org
@ 2015-03-30 17:53 ` hubicka at ucw dot cz
  2015-03-30 20:02 ` glisse at gcc dot gnu.org
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at ucw dot cz @ 2015-03-30 17:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #17 from Jan Hubicka <hubicka at ucw dot cz> ---
>   <bb 2>:
>   x.d = arg1_3(D);
>   _5 = x.i[3];
>   if (_5 != 0)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
> ...
>   <bb 4>:
>   _12 = x.i[2];
>   if (_12 != 0)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
> to sth like
> 
>   <bb 4>:
>   _12 = (unsigned short)(arg1_3(D) >> 32);
>   if (_12 != 0)
>     goto <bb 5>;
>   else
>     goto <bb 6>;
> 
> ?
> 
> SCCVN doesn't handle sth "fancy" for the case of union accesses with
> not matching offset/size.  We could add that, but I suppose in your
> case it's just for the sake of inliner predicates as the actual generated
> code might be worse on some targets?

Currently function have 1 store and 8 loads, with the change, it would have
4 loads that would make it cheaper (and would make predicates work indeed).
> 
>   _12 = BIT_FIELD_REF <arg1_3(D), ...>
> 
> (or REALPART/IMAGPART for special cases where that is valid).

Yeah, that would be nice!


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2015-03-30 17:53 ` hubicka at ucw dot cz
@ 2015-03-30 20:02 ` glisse at gcc dot gnu.org
  2015-03-30 21:40 ` hubicka at gcc dot gnu.org
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: glisse at gcc dot gnu.org @ 2015-03-30 20:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #18 from Marc Glisse <glisse at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #16)
> But yes, in principle we can do sth fancy for union loads, though I'd
> use BIT_FIELD_REFs (hoping no issues wrt endian...) as the canonical
> and "easy" way to represent things here.

Just adding a link to PR 28367.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2015-03-30 20:02 ` glisse at gcc dot gnu.org
@ 2015-03-30 21:40 ` hubicka at gcc dot gnu.org
  2015-03-31 12:14 ` rguenther at suse dot de
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-03-30 21:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Actually at second thought, would BIT_FIELD_REF <arg1_3(D), ...> allow us to
avoid the actual memory store? I tought like COMPONENT_REF it takes address as
parameter. What I am hoping is to fully optimize out union doub x; at early
optimization time.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2015-03-30 21:40 ` hubicka at gcc dot gnu.org
@ 2015-03-31 12:14 ` rguenther at suse dot de
  2015-03-31 13:08 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2015-03-31 12:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #20 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 30 Mar 2015, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478
> 
> --- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
> Actually at second thought, would BIT_FIELD_REF <arg1_3(D), ...> allow us to
> avoid the actual memory store? I tought like COMPONENT_REF it takes address as
> parameter. What I am hoping is to fully optimize out union doub x; at early
> optimization time.

Yes


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2015-03-31 12:14 ` rguenther at suse dot de
@ 2015-03-31 13:08 ` rguenth at gcc dot gnu.org
  2015-04-01 14:43 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-03-31 13:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |WAITING

--- Comment #21 from Richard Biener <rguenth at gcc dot gnu.org> ---
Is the slowdown fixed?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2015-03-31 13:08 ` rguenth at gcc dot gnu.org
@ 2015-04-01 14:43 ` rguenth at gcc dot gnu.org
  2015-04-01 17:51 ` hubicka at ucw dot cz
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-01 14:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #22 from Richard Biener <rguenth at gcc dot gnu.org> ---
Seems to be a regression with -flto only?  I also see EON regressing without
-flto.

http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64/index.html


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2015-04-01 14:43 ` rguenth at gcc dot gnu.org
@ 2015-04-01 17:51 ` hubicka at ucw dot cz
  2015-04-02  8:38 ` rguenther at suse dot de
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at ucw dot cz @ 2015-04-01 17:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #23 from Jan Hubicka <hubicka at ucw dot cz> ---
> Seems to be a regression with -flto only?  I also see EON regressing without
> -flto.
Yes, the inlining is cross file.
> 
> http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64/index.html

Saw that one too. It is in between 
Feb 10, 2015 18:20 UTC
    (Values: Base: 164.gzip: 1558, 175.vpr: 2392, 176.gcc: 2845, 181.mcf: 3766,
186.crafty: 2926, 197.parser: 1975, 252.eon: 3726, 255.vortex: 3305, 256.bzip2:
2218, 300.twolf: 3257 Peak: , 164.gzip: 1546, 175.vpr: 2397, 176.gcc: 1994,
181.mcf: 3819, 186.crafty: 2737, 197.parser: 1911, 252.eon: 4461, 255.vortex:
4364, 256.bzip2: 2348, 300.twolf: 3265)
Feb 10, 2015 09:20 UTC
    (Values: Base: 164.gzip: 1549, 175.vpr: 2452, 176.gcc: 2734, 181.mcf: 3458,
186.crafty: 2833, 197.parser: 1962, 252.eon: 4083, 255.vortex: 3378, 256.bzip2:
2059, 300.twolf: 3231 Peak: , 164.gzip: 1555, 175.vpr: 2241, 176.gcc: 2800,
181.mcf: 3821, 186.crafty: 2681, 197.parser: 1905, 252.eon: 4415, 255.vortex:
4363, 256.bzip2: 2379, 300.twolf: 3220)

So it does not seem to point to inliner changes (fortunately).  At Megrez, you
should be able to access the diff?

Honza


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2015-04-01 17:51 ` hubicka at ucw dot cz
@ 2015-04-02  8:38 ` rguenther at suse dot de
  2015-04-05 23:55 ` hubicka at gcc dot gnu.org
  2015-04-07  8:49 ` rguenth at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: rguenther at suse dot de @ 2015-04-02  8:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #24 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 1 Apr 2015, hubicka at ucw dot cz wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478
> 
> --- Comment #23 from Jan Hubicka <hubicka at ucw dot cz> ---
> > Seems to be a regression with -flto only?  I also see EON regressing without
> > -flto.
> Yes, the inlining is cross file.
> > 
> > http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64/index.html
> 
> Saw that one too. It is in between 
> Feb 10, 2015 18:20 UTC
>     (Values: Base: 164.gzip: 1558, 175.vpr: 2392, 176.gcc: 2845, 181.mcf: 3766,
> 186.crafty: 2926, 197.parser: 1975, 252.eon: 3726, 255.vortex: 3305, 256.bzip2:
> 2218, 300.twolf: 3257 Peak: , 164.gzip: 1546, 175.vpr: 2397, 176.gcc: 1994,
> 181.mcf: 3819, 186.crafty: 2737, 197.parser: 1911, 252.eon: 4461, 255.vortex:
> 4364, 256.bzip2: 2348, 300.twolf: 3265)
> Feb 10, 2015 09:20 UTC
>     (Values: Base: 164.gzip: 1549, 175.vpr: 2452, 176.gcc: 2734, 181.mcf: 3458,
> 186.crafty: 2833, 197.parser: 1962, 252.eon: 4083, 255.vortex: 3378, 256.bzip2:
> 2059, 300.twolf: 3231 Peak: , 164.gzip: 1555, 175.vpr: 2241, 176.gcc: 2800,
> 181.mcf: 3821, 186.crafty: 2681, 197.parser: 1905, 252.eon: 4415, 255.vortex:
> 4363, 256.bzip2: 2379, 300.twolf: 3220)
> 
> So it does not seem to point to inliner changes (fortunately).  At Megrez, you
> should be able to access the diff?

Yes, it's -r220566:220590, so it is very likely

+2015-02-10  Richard Biener  <rguenther@suse.de>
+
+       PR tree-optimization/64909
+       * tree-vect-loop.c (vect_estimate_min_profitable_iters): Properly
+       pass a scalar-stmt count estimate to the cost model.
+       * tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Likewise.

which fixed a pretty serious vectorizer cost-model issue on all AMD archs.

Might be worth investigating (not that the cost modeling is very good...).

On megrez -march=native expands to

-march=bdver2 -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -msse4a -mcx16 
-msahf -mno-movbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mlwp -mfma 
-mfma4 -mxop -mbmi -mno-bmi2 -mtbm -mavx -mno-avx2 -msse4.2 -msse4.1 
-mlzcnt -mno-rtm -mno-hle -mno-rdrnd -mf16c -mno-fsgsbase -mno-rdseed 
-mprfchw -mno-adx -mfxsr -mxsave -mno-xsaveopt -mno-avx512f -mno-avx512er 
-mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec 
-mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma 
-mno-avx512vbmi -mno-clwb -mno-pcommit --param l1-cache-size=16 --param 
l1-cache-line-size=64 --param l2-cache-size=2048 -mtune=bdver2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2015-04-02  8:38 ` rguenther at suse dot de
@ 2015-04-05 23:55 ` hubicka at gcc dot gnu.org
  2015-04-07  8:49 ` rguenth at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: hubicka at gcc dot gnu.org @ 2015-04-05 23:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

--- Comment #25 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Crafty perfomrance is back (with a combination of better heuristics and
increase of inlining limits), eon is not, at least not in all configurations. 
We have separate eon PR, so I am closing this one.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug ipa/65478] [5 regression] crafty performance regression
  2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2015-04-05 23:55 ` hubicka at gcc dot gnu.org
@ 2015-04-07  8:49 ` rguenth at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-07  8:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65478

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #26 from Richard Biener <rguenth at gcc dot gnu.org> ---
Closing.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-04-07  8:49 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-19 22:38 [Bug tree-optimization/65478] New: crafty performance regression hubicka at gcc dot gnu.org
2015-03-19 22:45 ` [Bug tree-optimization/65478] " hubicka at gcc dot gnu.org
2015-03-20  2:51 ` [Bug ipa/65478] [5 regression] " hubicka at gcc dot gnu.org
2015-03-20 10:24 ` rguenth at gcc dot gnu.org
2015-03-20 18:25 ` hubicka at ucw dot cz
2015-03-20 19:19 ` hubicka at ucw dot cz
2015-03-24 14:10 ` jamborm at gcc dot gnu.org
2015-03-24 17:23 ` hubicka at ucw dot cz
2015-03-24 18:48 ` jamborm at gcc dot gnu.org
2015-03-25  7:57 ` hubicka at ucw dot cz
2015-03-27  9:45 ` jamborm at gcc dot gnu.org
2015-03-27  9:49 ` jamborm at gcc dot gnu.org
2015-03-29 14:15 ` hubicka at gcc dot gnu.org
2015-03-29 17:46 ` hubicka at gcc dot gnu.org
2015-03-30  2:23 ` hubicka at gcc dot gnu.org
2015-03-30 11:27 ` rguenth at gcc dot gnu.org
2015-03-30 17:53 ` hubicka at ucw dot cz
2015-03-30 20:02 ` glisse at gcc dot gnu.org
2015-03-30 21:40 ` hubicka at gcc dot gnu.org
2015-03-31 12:14 ` rguenther at suse dot de
2015-03-31 13:08 ` rguenth at gcc dot gnu.org
2015-04-01 14:43 ` rguenth at gcc dot gnu.org
2015-04-01 17:51 ` hubicka at ucw dot cz
2015-04-02  8:38 ` rguenther at suse dot de
2015-04-05 23:55 ` hubicka at gcc dot gnu.org
2015-04-07  8:49 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).