public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug other/49194] New: Trivially stupid inlining decisions
@ 2011-05-27 15:32 torvalds@linux-foundation.org
  2011-05-27 16:12 ` [Bug other/49194] " jakub at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: torvalds@linux-foundation.org @ 2011-05-27 15:32 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

           Summary: Trivially stupid inlining decisions
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: torvalds@linux-foundation.org


So gcc inlining heuristics seem to include the rule that if it's called from a
single call-site, and is static, then it should be inlined.

That is actually really horrible for some cases. 

In particular, if you have a small function that has an uncommon case handled
by a much bigger function, you absolutely do *not* want to inline the big
function because it ends up creating a total monster of a stack frame -
something that the small function on its own doesn't need.

So for example, in the kernel we have various of these kinds of situations
where we decrement a refcount or a lock, and then in the unlikely situation
that something is true, we need to so much more processing as a result. An
example of this would be "__rcu_read_unlock()", which basically does

 - decrement the per-thread "rcu_read_lock_nesting" variable
 - if that variable is now zero, *and* we have a pending RCU event, we need to
do some special complicated stuff.

The code is basically written (we have various barriers and helper macros etc
that are irrelevant to the inlining issue) as

    --t->rcu_read_lock_nesting;
    if (t->rcu_read_lock_nesting == 0 &&
     __builtin_expect(!!t->rcu_read_unlock_special, 0))
        rcu_read_unlock_special(t);

(where "t" is the thread pointer).

And that's all. However, because gcc inlines the special case, the function
prologue ends up looking like this (that "current_task" is from our inline asm,
it's just us getting the thread variable):

  __rcu_read_unlock:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        subq    $48, %rsp       #,
        movq    %r13, -24(%rbp) #,
        movq    %rbx, -40(%rbp) #,
        movq    %r12, -32(%rbp) #,
        movq    %r14, -16(%rbp) #,
        movq    %r15, -8(%rbp)  #,
        movq %gs:current_task,%r13      #, t
        decl    256(%r13)       # t_15->rcu_read_lock_nesting
        ...

which is pretty horrid. It uses a lot of stack, and has stupid useless
instructions.

Now, I can just mark that rcu_read_unlock_special() function as "noinline", and
as a result I get this:

__rcu_read_unlock:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        movq %gs:current_task,%rdi      #, t
        decl    256(%rdi)       # t_15->rcu_read_lock_nesting
        ...

which is obviously much superior code for the case that actually matters.

So rather than have to find all of these things manually (yes, those things do
actually show up on profiles - the stack expansion in particular ends up
showing up as extra costs), maybe gcc could just have a simple check:

 - if the call is in an unlikely section
 - .. and the callee is much bigger (in stack frame or whatever) than the
caller
 - .. simply don't inline

Hmm? I realize that this may sound specialized, but I've been looking at kernel
profiles for the last few weeks now, and this particular issue has come up in
two independent hotspots. It turns out that excessive stack use is *expensive*.
It's not just the normal "the kernel stack is limited", it's actually a matter
of "the L1 D$ is pretty small, and a big stack frame actually causes real
costs".

I really didn't expect register saves on the stack to show up in my profiles,
but they actually do. I expect the stack to basically always be in the L1 D$
and dirty (so that an access to it should be almost free), but that is only
true for a _dense_ stack. Not for leaf functions that then have extra stack
frame wasting code in them.

Comments?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
@ 2011-05-27 16:12 ` jakub at gcc dot gnu.org
  2011-05-27 16:15 ` matz at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-05-27 16:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.05.27 16:02:09
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |jakub at gcc dot gnu.org
     Ever Confirmed|0                           |1

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-05-27 16:02:09 UTC ---
I agree if the called function is big and it is very unlikely (most probably
just in PROB_VERY_UNLIKELY cases) -finline-functions-called-once shouldn't
inline.

E.g.
void baz (void);

static void
foo (int x)
{
  char buf[65536];
  int a, b, c, d, e, f, g, h, i, j, k, l, m;
#define A asm volatile ("nop" : "+r" (x), "=r" (a), "=r" (b), "=r" (c), \
"=r" (d), "=r" (e), "=r" (f), "=r" (g), "=r" (h), \
"=r" (i), "=r" (j), "=r" (k), "=r" (l), "=r" (m) \
: "r" (buf));
#define B A A A A A A A A A A A
#define C B B B B B B B B B B B
  C
}

int
bar (int x)
{
  if (__builtin_expect (x > 65, 0))
    foo (x + 6);
  baz ();
  return x;
}

inlines foo at -O2.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
  2011-05-27 16:12 ` [Bug other/49194] " jakub at gcc dot gnu.org
@ 2011-05-27 16:15 ` matz at gcc dot gnu.org
  2011-05-27 16:20 ` hubicka at ucw dot cz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: matz at gcc dot gnu.org @ 2011-05-27 16:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

Michael Matz <matz at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |matz at gcc dot gnu.org

--- Comment #2 from Michael Matz <matz at gcc dot gnu.org> 2011-05-27 16:12:12 UTC ---
For adjusting GCCs idea about what stack frame growth is acceptable also 
fiddle with the large-stack-frame and large-stack-frame-growth parameters.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
  2011-05-27 16:12 ` [Bug other/49194] " jakub at gcc dot gnu.org
  2011-05-27 16:15 ` matz at gcc dot gnu.org
@ 2011-05-27 16:20 ` hubicka at ucw dot cz
  2011-05-27 16:25 ` hubicka at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: hubicka at ucw dot cz @ 2011-05-27 16:20 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

--- Comment #3 from Jan Hubicka <hubicka at ucw dot cz> 2011-05-27 16:15:02 UTC ---
> I agree if the called function is big and it is very unlikely (most probably
> just in PROB_VERY_UNLIKELY cases) -finline-functions-called-once shouldn't
> inline.

-finline-functions-called-once  is trottled down by the large-function-growth
and large-stack-frame-growth limits. The  Kernel case coupld proably be handled
by the second. Does kernel bump down that limits?
It still won't help in case function doesn't have any on-stack aggregates,
since we optimistically assume that all gimple registers will disappear.
Probably
even that could be change, though estimating reload's stack frame usage so
early would
be iffy.

I agree that both limits are bit high for this particular case. However I
experimented with bumping this down some time ago with an extra parameter and
did not find any practictical benefit that would justify adding yet another
limit to the inliner.

So testcases would be welcome.
Honza


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
                   ` (2 preceding siblings ...)
  2011-05-27 16:20 ` hubicka at ucw dot cz
@ 2011-05-27 16:25 ` hubicka at gcc dot gnu.org
  2011-05-27 16:32 ` jakub at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-05-27 16:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-05-27 16:19:04 UTC ---
BTW mainline won't inline foo in that testcase:

Deciding on functions called once:
  not inlinable: bar/1 -> foo/0, --param large-stack-frame-growth limit reached

I fixed some stack-growth estimation issues while rewriting inliner for 4.7, if
4.6 inlines, perhaps I can figure out what gets wrong there.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
                   ` (3 preceding siblings ...)
  2011-05-27 16:25 ` hubicka at gcc dot gnu.org
@ 2011-05-27 16:32 ` jakub at gcc dot gnu.org
  2011-05-27 16:45 ` torvalds@linux-foundation.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-05-27 16:32 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-05-27 16:24:45 UTC ---
Oops, s/65536/128/, I've changed the testcase too late without retesting.

Anyway, the point is that the limits should be adjusted somewhat if the call is
PROB_VERY_UNLIKELY.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
                   ` (4 preceding siblings ...)
  2011-05-27 16:32 ` jakub at gcc dot gnu.org
@ 2011-05-27 16:45 ` torvalds@linux-foundation.org
  2011-05-27 18:52 ` hubicka at ucw dot cz
  2012-03-25  9:57 ` hubicka at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: torvalds@linux-foundation.org @ 2011-05-27 16:45 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

--- Comment #6 from Linus Torvalds <torvalds@linux-foundation.org> 2011-05-27 16:38:22 UTC ---
(In reply to comment #3)
> 
> -finline-functions-called-once  is trottled down by the large-function-growth
> and large-stack-frame-growth limits. The  Kernel case coupld proably be handled
> by the second. Does kernel bump down that limits?

We used to play with inlining limits (gcc had some really bad decisions), but
the meaning of the numbers kept changing from one gcc version to another, and
the heuristics gcc used kept changing too. Which made it practically impossible
to use sanely - you could tweak it for one particular architecture, and one
particular version of gcc, but it would then be worse for others.

Quite frankly, with that kind of history, I'm not very eager to start playing
around with random gcc internal variables again.

So I'd much rather have gcc have good heuristics by default, possibly helped by
the kinds of obvious hints we can give ("unlikely()" in particular is something
we can add for things like this).

Obviously, we can (and do) use the "force the decision" with either "noinline"
or "__always_inline" (which are just the kernel macros to make the gcc
attribute syntax slightly more readable), but since I've been doing those other
bug reports about bad gcc code generation, I thought I'd point out this one
too.

> It still won't help in case function doesn't have any on-stack aggregates,
> since we optimistically assume that all gimple registers will disappear.
> Probably
> even that could be change, though estimating reload's stack frame usage so
> early would
> be iffy.

Yes, early stack estimation might not work all that well.

In the kernel, we do end up having a few complex functions that we basically
expect to inline to almost nothing - simply because we end up depending on
compile-time constant issues (sometimes very explicitly, with
__builtin_constant_p() followed by a largish "switch ()" statement).

That said, this is something where the call-site really can make a big
difference. Not just the fact that the call site might be marked "unlikely()"
(again, that's just the kernel making __builtin_expect() readable), but things
like "none of the arguments are constants" could easily be a good heuristic to
use as a basis for whether to inline or not.

IOW, start out with whatever 'large-stack-frame-growth' and
'large-function-growth' values, but if the call-site is in an unlikely region,
cut those values in half (or whatever). And if none of the arguments are
constants, cut it in half again.

This is an example of why giving these limits as compiler options really
doesn't work: the choice should probably be much more dynamic than just a
single number.

I dunno. As mentioned, we can fix this problem by just marking things noinline
by hand. But I do think that there are fairly obvious cases where inlining
really isn't worth it, and gcc might as well just get those cases right.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
                   ` (5 preceding siblings ...)
  2011-05-27 16:45 ` torvalds@linux-foundation.org
@ 2011-05-27 18:52 ` hubicka at ucw dot cz
  2012-03-25  9:57 ` hubicka at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: hubicka at ucw dot cz @ 2011-05-27 18:52 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

--- Comment #7 from Jan Hubicka <hubicka at ucw dot cz> 2011-05-27 18:47:56 UTC ---
> 
> We used to play with inlining limits (gcc had some really bad decisions), but
> the meaning of the numbers kept changing from one gcc version to another, and
> the heuristics gcc used kept changing too. Which made it practically impossible
> to use sanely - you could tweak it for one particular architecture, and one
> particular version of gcc, but it would then be worse for others.

Well, the --param limits are really meant to be more for GCC development than
for being adjusted to random values by random codebases, so this won't really
help indeed. The heuristics needs to keep envolving since the expectations
on it are changing,too (i.e. 10 years ago people didn't really cared about sane
boost/pooma performance and just year ago no one really loaded 200000 functions
into compiler at once as we do now with LTO on mozilla).

However concerning the stack growth, this was implemented on request of Andi
Kleen for the kernel, since the stack size constraints are special. There
is -fconserve-stack argument to avoid need to use magic --param values (that
are not that magic for stack usage after all given that they count bytes).
> 
> So I'd much rather have gcc have good heuristics by default, possibly helped by
> the kinds of obvious hints we can give ("unlikely()" in particular is something
> we can add for things like this).

Agreed.  In cases like this you might find handy the cold attribute that has
the effect of making function optimized for size as well as telling branch
predictor to predicts calls leading to call of cold function as unlikely.
It would save need sfor some of the unlikely() calls.
> 
> Yes, early stack estimation might not work all that well.

Yep, one would have to do something like computing the number of live
values at given place of function that is doable but moderately expensive
for such a special purpose situation.
> 
> That said, this is something where the call-site really can make a big
> difference. Not just the fact that the call site might be marked "unlikely()"
> (again, that's just the kernel making __builtin_expect() readable), but things
> like "none of the arguments are constants" could easily be a good heuristic to
> use as a basis for whether to inline or not.
> 
> IOW, start out with whatever 'large-stack-frame-growth' and
> 'large-function-growth' values, but if the call-site is in an unlikely region,
> cut those values in half (or whatever). And if none of the arguments are
> constants, cut it in half again.
> 
> This is an example of why giving these limits as compiler options really
> doesn't work: the choice should probably be much more dynamic than just a
> single number.
> 
> I dunno. As mentioned, we can fix this problem by just marking things noinline
> by hand. But I do think that there are fairly obvious cases where inlining
> really isn't worth it, and gcc might as well just get those cases right.

Well, this model won't work as suggested. The problem is that in GCC inliner
simplistic POV inlining is always a win, unless it hits the large
function/stack/unit bounds. Only tradeoff it understand is the code size
growth.

Here inlining causes regalloc to produce bigger stack frame because it is
stupid and doesn't know how to do shrink wraping.  (Bernd recently has posted
patches for this, I duno about their status and if they will help here) This is
important because the hot path through the outer function is extremely short
and the outer function is simple so it won't need the registers otherwise.

The large function/unit limits don't really worry about actual code quality,
just the fact that we don't want non-linear algorithms in the compiler to
become too prominent.  So the starting values are high for this purpose. Large
function is currently set to amusing constant of 2700insns. Dividing by 4 won't
do much help. The real reason is that we are really mixing two concepts
together (compiler nonlinearity and code quality considerations). This is not
good idea.

Some years ago I introduced the notion of hot and cold basic blocks to the
GCC inliner and told it to not inline functions into cold basic blocks unless
caller size (or overall program size) is expected to shrink. This has also
introduced number of regression I had to get through.  (think of not inlining
destructor of object in EH handling code that prevents the object from being
scalar replaced and optimized away).

At the moment I can come up with the following suggestions:

1) inline functions called once from cold basic blocks only when they are small
so caller size will shring (like inlining of small functions does)
2) introduce new function body size limit used only for cold functions called
once
3) try to somehow get very rough stack frame estimate into our current stack
frame growth limits.

I guess I will try 1) and see how it affects other code and if it is not
too bad we can stay with it.
I think 3) won't work in practice as stack growth limits are too large by
default and we really worry about cost of prologue rather than cost of stack
frame. 2) might be alternative if everything fails, its negative size is need
for another parameter and we already have too many.

In any case thanks for analysis and PR. I worried about this scenario at a time
inlining functions called once was introduced (about GCC 3.4 I think)
but since I did not find any benchmark that would regress because of this I
decided to worry about something else.  Actually I think this is the first PR
related to the topic of stack grow that is rather surprising. (we already
solved
problem with the outrageous stack frame growth that hit glibc and fact that
inlining sometimes makes us to mispredict hot part of program as cold because
there is large loop nest somewhere else that hit some fortran benchmarks)

Honza


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug other/49194] Trivially stupid inlining decisions
  2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
                   ` (6 preceding siblings ...)
  2011-05-27 18:52 ` hubicka at ucw dot cz
@ 2012-03-25  9:57 ` hubicka at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-03-25  9:57 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

--- Comment #8 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-03-25 09:43:46 UTC ---
GCC 4.7 has now shrink wrapping that should reduce effect of inlining large
cold functions called once.  Realistic testcases where we still kill code
quality would be welcome.

I tested patch to disable inlining once for cold calls, but it does not help in
general. What happens is that we stop inlining
constructors/destructors/initialization loops that eventually kills code
quality of some benchmarks since known values are no longer propagated.  I will
do more tunning of this for 4.8.

Honza


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-03-25  9:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
2011-05-27 16:12 ` [Bug other/49194] " jakub at gcc dot gnu.org
2011-05-27 16:15 ` matz at gcc dot gnu.org
2011-05-27 16:20 ` hubicka at ucw dot cz
2011-05-27 16:25 ` hubicka at gcc dot gnu.org
2011-05-27 16:32 ` jakub at gcc dot gnu.org
2011-05-27 16:45 ` torvalds@linux-foundation.org
2011-05-27 18:52 ` hubicka at ucw dot cz
2012-03-25  9:57 ` hubicka at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).