From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 26428 invoked by alias); 27 May 2011 15:23:31 -0000 Received: (qmail 26413 invoked by uid 22791); 27 May 2011 15:23:30 -0000 X-SWARE-Spam-Status: No, hits=-2.6 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00,TW_OV X-Spam-Check-By: sourceware.org Received: from localhost (HELO gcc.gnu.org) (127.0.0.1) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 27 May 2011 15:23:16 +0000 From: "torvalds@linux-foundation.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug other/49194] New: Trivially stupid inlining decisions X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: other X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: torvalds@linux-foundation.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Date: Fri, 27 May 2011 15:32:00 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2011-05/txt/msg02697.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 Summary: Trivially stupid inlining decisions Product: gcc Version: 4.5.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other AssignedTo: unassigned@gcc.gnu.org ReportedBy: torvalds@linux-foundation.org So gcc inlining heuristics seem to include the rule that if it's called from a single call-site, and is static, then it should be inlined. That is actually really horrible for some cases. In particular, if you have a small function that has an uncommon case handled by a much bigger function, you absolutely do *not* want to inline the big function because it ends up creating a total monster of a stack frame - something that the small function on its own doesn't need. So for example, in the kernel we have various of these kinds of situations where we decrement a refcount or a lock, and then in the unlikely situation that something is true, we need to so much more processing as a result. An example of this would be "__rcu_read_unlock()", which basically does - decrement the per-thread "rcu_read_lock_nesting" variable - if that variable is now zero, *and* we have a pending RCU event, we need to do some special complicated stuff. The code is basically written (we have various barriers and helper macros etc that are irrelevant to the inlining issue) as --t->rcu_read_lock_nesting; if (t->rcu_read_lock_nesting == 0 && __builtin_expect(!!t->rcu_read_unlock_special, 0)) rcu_read_unlock_special(t); (where "t" is the thread pointer). And that's all. However, because gcc inlines the special case, the function prologue ends up looking like this (that "current_task" is from our inline asm, it's just us getting the thread variable): __rcu_read_unlock: pushq %rbp # movq %rsp, %rbp #, subq $48, %rsp #, movq %r13, -24(%rbp) #, movq %rbx, -40(%rbp) #, movq %r12, -32(%rbp) #, movq %r14, -16(%rbp) #, movq %r15, -8(%rbp) #, movq %gs:current_task,%r13 #, t decl 256(%r13) # t_15->rcu_read_lock_nesting ... which is pretty horrid. It uses a lot of stack, and has stupid useless instructions. Now, I can just mark that rcu_read_unlock_special() function as "noinline", and as a result I get this: __rcu_read_unlock: pushq %rbp # movq %rsp, %rbp #, movq %gs:current_task,%rdi #, t decl 256(%rdi) # t_15->rcu_read_lock_nesting ... which is obviously much superior code for the case that actually matters. So rather than have to find all of these things manually (yes, those things do actually show up on profiles - the stack expansion in particular ends up showing up as extra costs), maybe gcc could just have a simple check: - if the call is in an unlikely section - .. and the callee is much bigger (in stack frame or whatever) than the caller - .. simply don't inline Hmm? I realize that this may sound specialized, but I've been looking at kernel profiles for the last few weeks now, and this particular issue has come up in two independent hotspots. It turns out that excessive stack use is *expensive*. It's not just the normal "the kernel stack is limited", it's actually a matter of "the L1 D$ is pretty small, and a big stack frame actually causes real costs". I really didn't expect register saves on the stack to show up in my profiles, but they actually do. I expect the stack to basically always be in the L1 D$ and dirty (so that an access to it should be almost free), but that is only true for a _dense_ stack. Not for leaf functions that then have extra stack frame wasting code in them. Comments?