[Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
@ 2015-09-02 13:24 yann.collet.73 at gmail dot com
  2015-09-02 14:01 ` [Bug c/67435] " trippels at gcc dot gnu.org
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-09-02 13:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

            Bug ID: 67435
           Summary: Large performance drop on apparently unrelated changes
                    (probable cause : strange inlining side-effect)
           Product: gcc
           Version: 4.8.4
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yann.collet.73 at gmail dot com
  Target Milestone: ---

Some weird effect with gcc (tested version : 4.8.4).

I've got a performance oriented code, which runs pretty fast. Its speed depends
for a large part on inlining many small functions.
There is no inline statement. All functions are either normal or static.
Automatic inlining decision is solely within compiler's realm, which has worked
fine so far (functions to inline are very small, typically from 1 to 5 lines).

Since inlining across multiple .c files is difficult (-flto is not yet widely
available), I've kept a lot of small functions into a single `.c` file, into
which I'm also developing a codec, and its associated decoder. It's
"relatively" large by my standard (about ~2000 lines, although a lot of them
are mere comments and blank lines), but breaking it into smaller parts opens
new problems, so I would prefer to avoid that, if that is possible.

Encoder and Decoder are related, since they are inverse operations. But from a
programming perspective, they are completely separated, sharing nothing in
common, except a few typedef and very low-level functions (such as reading from
unaligned memory position).

The strange effect is this one :

I recently added a new function fnew to the encoder side. It's a new "entry
point". It's not used nor called from anywhere within the .c file.

The simple fact that it exists makes the performance of the decoder function
fdec drops substantially, by more than 20%, which is way too much to be
ignored.

Now, keep in mind that encoding and decoding operations are completely
separated, they share almost nothing, save some minor typedef (u32, u16 and
such) and associated operations (read/write).

When defining the new encoding function fnew as static, performance of the
decoder fdec increases back to normal. Since fnew isn't called from the .c, I
guess it's the same as if it was not there (dead code elimination).

If static fnew is now called from the encoder side, performance of fdec remains
good.
But as soon as fnew is modified, fdec performance just drops substantially.

Presuming fnew modifications crossed a threshold, I increased the following gcc
parameter : --param max-inline-insns-auto=60 (by default, its value is supposed
to be 40.) And it worked : performance of fdec is now back to normal.

But I guess this game will continue forever with each little modification of
fnew or anything else similar, requiring further tweak on some customized
advance parameter. So I want to avoid that.

I tried another variant : I'm adding another completely useless function, just
to play with. Its content is strictly exactly a copy-paste of fnew, but the
name of the function is obviously different, so let's call it wtf.

When wtf exists (on top of fnew), it doesn't matter if fnew is static or not,
nor what is the value of max-inline-insns-auto : performance of fdec is just
back to normal. Even though wtf is not used nor called from anywhere... :'(

All these effects look plain weird. There is no logical reason for some little
modification in function fnew to have knock-on effect on completely unrelated
function fdec, which only relation is to be in the same file.

I'm trying to understand what could be going on, in order to develop the codec
more reliably. 
For the time being, any modification in function A can have large ripple
effects (positive or negative) on completely unrelated function B, making each
step a tedious process with random outcome. A developer's nightmare.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
@ 2015-09-02 14:01 ` trippels at gcc dot gnu.org
  2015-09-02 14:03 ` pinskia at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-09-02 14:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #2 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
And you can use -fdump-ipa-inline to look at gcc's inline decisions in detail.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
  2015-09-02 14:01 ` [Bug c/67435] " trippels at gcc dot gnu.org
@ 2015-09-02 14:03 ` pinskia at gcc dot gnu.org
  2015-09-02 14:51 ` yann.collet.73 at gmail dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2015-09-02 14:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Gcc also tries to limit code growth for the unit also which might be something
you are seeing. You can also try -Winline. Also -flto is becoming more
prevalent and more popular. So you might want to give that a try also with 4.9
and above.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
  2015-09-02 14:01 ` [Bug c/67435] " trippels at gcc dot gnu.org
  2015-09-02 14:03 ` pinskia at gcc dot gnu.org
@ 2015-09-02 14:51 ` yann.collet.73 at gmail dot com
  2015-09-03  2:44 ` yann.collet.73 at gmail dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-09-02 14:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #5 from Yann Collet <yann.collet.73 at gmail dot com> ---
Complementary information :

-Winline : does not output anything (is that normal ?)

-fdump-ipa-inline : produce several large files, the interesting one being 1.5
MB long. That's a huge dump to analyze.

Nonetheless, I had a deeper look directly at the function which speed is
affected.
Looking at both slow and fast versions, I could spot *no difference* regarding
inline decisions. From what I can tell, the dump file seems strictly identical.

(note : there could be some differences somewhere else that I did not spotted).


Since then, I've also been suggested that maybe this effect could related to
something else, instruction cache line alignment.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect)
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
                   ` (2 preceding siblings ...)
  2015-09-02 14:51 ` yann.collet.73 at gmail dot com
@ 2015-09-03  2:44 ` yann.collet.73 at gmail dot com
  2015-09-03  7:19 ` [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment) trippels at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-09-03  2:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #6 from Yann Collet <yann.collet.73 at gmail dot com> ---
The issue seems in fact related to _instruction alignment_.
More precisely, to alignment of some critical loop.

That's basically why adding some code in the file would just "pushes" some
other code into another position, potentially into a less favorable path (hence
the appearance of "random impact").

The following GCC command saved the day :
-falign-loops=32

Note that -falign-loops=16 doesn't work.
I'm suspecting it might be the default value, but can't be sure.
I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell
cpu.

Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter.
It seems not possible to apply this optimization from within the source file,
such as using :
#pragma GCC optimize ("align-loops=32")
or the function targeted :
__attribute__((optimize("align-loops=32")))

None of these alternatives does work.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
                   ` (3 preceding siblings ...)
  2015-09-03  2:44 ` yann.collet.73 at gmail dot com
@ 2015-09-03  7:19 ` trippels at gcc dot gnu.org
  2015-09-03 18:47 ` yann.collet.73 at gmail dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-09-03  7:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #7 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Yann Collet from comment #6)
> The issue seems in fact related to _instruction alignment_.
> More precisely, to alignment of some critical loop.
> 
> That's basically why adding some code in the file would just "pushes" some
> other code into another position, potentially into a less favorable path
> (hence the appearance of "random impact").
> 
> 
> The following GCC command saved the day :
> -falign-loops=32
> 
> Note that -falign-loops=16 doesn't work.
> I'm suspecting it might be the default value, but can't be sure.
> I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell
> cpu.

Here are the default values (from gcc/config/i386/i386.c):

 2540 /* Processor target table, indexed by processor number */                 
 2541 struct ptt                                                                
 2542 {                                                                         
 2543   const char *const name;                       /* processor name  */     
 2544   const struct processor_costs *cost;           /* Processor costs */     
 2545   const int align_loop;                         /* Default alignments. 
*/                                                                              
 2546   const int align_loop_max_skip;                                          
 2547   const int align_jump;                                                   
 2548   const int align_jump_max_skip;                                          
 2549   const int align_func;                                                   
 2550 };                                                                        
 2551                                                                           
 2552 /* This table must be in sync with enum processor_type in i386.h.  */     
 2553 static const struct ptt processor_target_table[PROCESSOR_max] =           
 2554 {                                                                         
 2555   {"generic", &generic_cost, 16, 10, 16, 10, 16},                         
 2556   {"i386", &i386_cost, 4, 3, 4, 3, 4},                                    
 2557   {"i486", &i486_cost, 16, 15, 16, 15, 16},                               
 2558   {"pentium", &pentium_cost, 16, 7, 16, 7, 16},                           
 2559   {"iamcu", &iamcu_cost, 16, 7, 16, 7, 16},                               
 2560   {"pentiumpro", &pentiumpro_cost, 16, 15, 16, 10, 16},                   
 2561   {"pentium4", &pentium4_cost, 0, 0, 0, 0, 0},                            
 2562   {"nocona", &nocona_cost, 0, 0, 0, 0, 0},                                
 2563   {"core2", &core_cost, 16, 10, 16, 10, 16},                              
 2564   {"nehalem", &core_cost, 16, 10, 16, 10, 16},                            
 2565   {"sandybridge", &core_cost, 16, 10, 16, 10, 16},                        
 2566   {"haswell", &core_cost, 16, 10, 16, 10, 16},                            
 2567   {"bonnell", &atom_cost, 16, 15, 16, 7, 16},                             
 2568   {"silvermont", &slm_cost, 16, 15, 16, 7, 16},                           
 2569   {"knl", &slm_cost, 16, 15, 16, 7, 16},                                  
 2570   {"intel", &intel_cost, 16, 15, 16, 7, 16},                              
 2571   {"geode", &geode_cost, 0, 0, 0, 0, 0},                                  
 2572   {"k6", &k6_cost, 32, 7, 32, 7, 32},                                     
 2573   {"athlon", &athlon_cost, 16, 7, 16, 7, 16},                             
 2574   {"k8", &k8_cost, 16, 7, 16, 7, 16},                                     
 2575   {"amdfam10", &amdfam10_cost, 32, 24, 32, 7, 32},                        
 2576   {"bdver1", &bdver1_cost, 16, 10, 16, 7, 11},                            
 2577   {"bdver2", &bdver2_cost, 16, 10, 16, 7, 11},                            
 2578   {"bdver3", &bdver3_cost, 16, 10, 16, 7, 11},                            
 2579   {"bdver4", &bdver4_cost, 16, 10, 16, 7, 11},                            
 2580   {"btver1", &btver1_cost, 16, 10, 16, 7, 11},                            
 2581   {"btver2", &btver2_cost, 16, 10, 16, 7, 11}                             
 2582 };  

As you can see only AMD's k6 and amdfam10 default to align_loop=32.

> Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter.
> It seems not possible to apply this optimization from within the source file,
> such as using :
> #pragma GCC optimize ("align-loops=32")
> or the function targeted :
> __attribute__((optimize("align-loops=32")))
> 
> None of these alternatives does work.

I don't think this makes much sense for a binary that should run on
any X86 processor anyway. Optimizing for just one specific model will
negatively affect performance on an other.
If you want maximal performance you need to offer different binaries for
different CPUs.

See also (for a similar issue):
http://pzemtsov.github.io/2014/05/12/mystery-of-unstable-performance.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
                   ` (4 preceding siblings ...)
  2015-09-03  7:19 ` [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment) trippels at gcc dot gnu.org
@ 2015-09-03 18:47 ` yann.collet.73 at gmail dot com
  2015-09-04 10:23 ` [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : hot loop alignment) trippels at gcc dot gnu.org
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-09-03 18:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #8 from Yann Collet <yann.collet.73 at gmail dot com> ---
Thanks for the link.
It's a very good read, and indeed, completely in line with my recent
experience.
Recommended solution seems to be the same : "-falign-loops=32"

The article also mentions that the issue is valid for Sandy Bridge cpus.
This broadens the scope : it's not just about Broadwell, but also Haswell, Ivy
Bridge and sandy Bridge. All new cpus from Intel since 2011. It looks like a
large enough installed base to care about.

However, for some reason, in the table provided, both Sandy Bridge and Haswell
get a default loop alignment value of 16. not 32.

Is there a reason for that choice ?

> Optimizing for just one specific model will negatively affect performance on an other.

Well, this issue is apparently important for more than one architecture.
Moreover, being inlined on 32 imply being inlined on 16 too, so it doesn't
introduce drawback for older siblings.

Since then, I could find a few other complaints about the same issue. One
example here : https://software.intel.com/en-us/forums/topic/479392

and a close cousin here :
http://stackoverflow.com/questions/9881002/is-this-a-gcc-bug-when-using-falign-loops-option

This last one introduce a good question : while it's possible to use
"-falign-loops=32" to set the preference for the whole program, it seems not
possible to set it precisely for a single loop.

It looks like a good feature request, as this loop-alignment issue can have a
pretty large impact on performance (~20%), but only matters for a few selected
critical loops. The programmer is typically in good position to know which loop
matters the most. Hence, we don't necessarily need *all* loops to be 32-bytes
aligned, just a handful ones.

Less precise but still great, having the ability to set this optimization
parameter for a function or a section code would be great. But my experiment
seem to show that using #pragma or __attribute__ with align-loops does not
work, as if the optimization setting was simply ignored.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : hot loop alignment)
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
                   ` (5 preceding siblings ...)
  2015-09-03 18:47 ` yann.collet.73 at gmail dot com
@ 2015-09-04 10:23 ` trippels at gcc dot gnu.org
  2015-09-04 10:34 ` [Bug c/67435] Feature request: Implement align-loops attribute trippels at gcc dot gnu.org
  2015-10-20 13:09 ` yann.collet.73 at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-09-04 10:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #9 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Yann Collet from comment #8)
> However, for some reason, in the table provided, both Sandy Bridge and
> Haswell get a default loop alignment value of 16. not 32.
> 
> Is there a reason for that choice ?

These values are normally strait out of the Vendors manuals.
And there are also drawbacks to high alignment values.

> Less precise but still great, having the ability to set this optimization
> parameter for a function or a section code would be great. But my experiment
> seem to show that using #pragma or __attribute__ with align-loops does not
> work, as if the optimization setting was simply ignored.

Well, there already is an aligned attribute for functions, variables and
fields,
see: 
http://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Feature request: Implement align-loops attribute
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
                   ` (6 preceding siblings ...)
  2015-09-04 10:23 ` [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : hot loop alignment) trippels at gcc dot gnu.org
@ 2015-09-04 10:34 ` trippels at gcc dot gnu.org
  2015-10-20 13:09 ` yann.collet.73 at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: trippels at gcc dot gnu.org @ 2015-09-04 10:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

Markus Trippelsdorf <trippels at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-09-04
            Summary|Large performance drop on   |Feature request: Implement
                   |apparently unrelated        |align-loops attribute
                   |changes (potential cause :  |
                   |hot loop alignment)         |
     Ever confirmed|0                           |1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/67435] Feature request: Implement align-loops attribute
  2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
                   ` (7 preceding siblings ...)
  2015-09-04 10:34 ` [Bug c/67435] Feature request: Implement align-loops attribute trippels at gcc dot gnu.org
@ 2015-10-20 13:09 ` yann.collet.73 at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: yann.collet.73 at gmail dot com @ 2015-10-20 13:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435

--- Comment #10 from Yann Collet <yann.collet.73 at gmail dot com> ---
> there already is an aligned attribute for functions, variables and fields,

Sure, but none of them is related to aligning the start of an hot instruction
loop. Aligning the function instead looks like a poor proxy.

> there are also drawbacks to high alignment values

Yes. I could test that using -falign-loops=32 on a larger code base produces
drawbacks. Not just larger code size, worse speed speed.

This makes it all the more relevant to have the ability to select which loop
should be aligned, instead of relying on a unique program-wide compilation
flag.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-10-20 13:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-02 13:24 [Bug c/67435] New: Large performance drop on apparently unrelated changes (probable cause : strange inlining side-effect) yann.collet.73 at gmail dot com
2015-09-02 14:01 ` [Bug c/67435] " trippels at gcc dot gnu.org
2015-09-02 14:03 ` pinskia at gcc dot gnu.org
2015-09-02 14:51 ` yann.collet.73 at gmail dot com
2015-09-03  2:44 ` yann.collet.73 at gmail dot com
2015-09-03  7:19 ` [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment) trippels at gcc dot gnu.org
2015-09-03 18:47 ` yann.collet.73 at gmail dot com
2015-09-04 10:23 ` [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : hot loop alignment) trippels at gcc dot gnu.org
2015-09-04 10:34 ` [Bug c/67435] Feature request: Implement align-loops attribute trippels at gcc dot gnu.org
2015-10-20 13:09 ` yann.collet.73 at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).