public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug other/49194] New: Trivially stupid inlining decisions
@ 2011-05-27 15:32 torvalds@linux-foundation.org
  2011-05-27 16:12 ` [Bug other/49194] " jakub at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: torvalds@linux-foundation.org @ 2011-05-27 15:32 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

           Summary: Trivially stupid inlining decisions
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: torvalds@linux-foundation.org


So gcc inlining heuristics seem to include the rule that if it's called from a
single call-site, and is static, then it should be inlined.

That is actually really horrible for some cases. 

In particular, if you have a small function that has an uncommon case handled
by a much bigger function, you absolutely do *not* want to inline the big
function because it ends up creating a total monster of a stack frame -
something that the small function on its own doesn't need.

So for example, in the kernel we have various of these kinds of situations
where we decrement a refcount or a lock, and then in the unlikely situation
that something is true, we need to so much more processing as a result. An
example of this would be "__rcu_read_unlock()", which basically does

 - decrement the per-thread "rcu_read_lock_nesting" variable
 - if that variable is now zero, *and* we have a pending RCU event, we need to
do some special complicated stuff.

The code is basically written (we have various barriers and helper macros etc
that are irrelevant to the inlining issue) as

    --t->rcu_read_lock_nesting;
    if (t->rcu_read_lock_nesting == 0 &&
     __builtin_expect(!!t->rcu_read_unlock_special, 0))
        rcu_read_unlock_special(t);

(where "t" is the thread pointer).

And that's all. However, because gcc inlines the special case, the function
prologue ends up looking like this (that "current_task" is from our inline asm,
it's just us getting the thread variable):

  __rcu_read_unlock:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        subq    $48, %rsp       #,
        movq    %r13, -24(%rbp) #,
        movq    %rbx, -40(%rbp) #,
        movq    %r12, -32(%rbp) #,
        movq    %r14, -16(%rbp) #,
        movq    %r15, -8(%rbp)  #,
        movq %gs:current_task,%r13      #, t
        decl    256(%r13)       # t_15->rcu_read_lock_nesting
        ...

which is pretty horrid. It uses a lot of stack, and has stupid useless
instructions.

Now, I can just mark that rcu_read_unlock_special() function as "noinline", and
as a result I get this:

__rcu_read_unlock:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        movq %gs:current_task,%rdi      #, t
        decl    256(%rdi)       # t_15->rcu_read_lock_nesting
        ...

which is obviously much superior code for the case that actually matters.

So rather than have to find all of these things manually (yes, those things do
actually show up on profiles - the stack expansion in particular ends up
showing up as extra costs), maybe gcc could just have a simple check:

 - if the call is in an unlikely section
 - .. and the callee is much bigger (in stack frame or whatever) than the
caller
 - .. simply don't inline

Hmm? I realize that this may sound specialized, but I've been looking at kernel
profiles for the last few weeks now, and this particular issue has come up in
two independent hotspots. It turns out that excessive stack use is *expensive*.
It's not just the normal "the kernel stack is limited", it's actually a matter
of "the L1 D$ is pretty small, and a big stack frame actually causes real
costs".

I really didn't expect register saves on the stack to show up in my profiles,
but they actually do. I expect the stack to basically always be in the L1 D$
and dirty (so that an access to it should be almost free), but that is only
true for a _dense_ stack. Not for leaf functions that then have extra stack
frame wasting code in them.

Comments?


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-03-25  9:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-27 15:32 [Bug other/49194] New: Trivially stupid inlining decisions torvalds@linux-foundation.org
2011-05-27 16:12 ` [Bug other/49194] " jakub at gcc dot gnu.org
2011-05-27 16:15 ` matz at gcc dot gnu.org
2011-05-27 16:20 ` hubicka at ucw dot cz
2011-05-27 16:25 ` hubicka at gcc dot gnu.org
2011-05-27 16:32 ` jakub at gcc dot gnu.org
2011-05-27 16:45 ` torvalds@linux-foundation.org
2011-05-27 18:52 ` hubicka at ucw dot cz
2012-03-25  9:57 ` hubicka at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).