From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11425 invoked by alias); 6 Feb 2008 13:29:05 -0000 Received: (qmail 11259 invoked by uid 48); 6 Feb 2008 13:28:21 -0000 Date: Wed, 06 Feb 2008 13:29:00 -0000 Message-ID: <20080206132821.11258.qmail@sourceware.org> X-Bugzilla-Reason: CC References: Subject: [Bug tree-optimization/33761] non-optimal inlining heuristics pessimizes gzip SPEC score at -O3 In-Reply-To: Reply-To: gcc-bugzilla@gcc.gnu.org To: gcc-bugs@gcc.gnu.org From: "hubicka at gcc dot gnu dot org" Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2008-02/txt/msg00669.txt.bz2 ------- Comment #17 from hubicka at gcc dot gnu dot org 2008-02-06 13:28 ------- One problem is the following: do { ; match = window + cur_match; if (match[best_len] != scan_end || match[best_len-1] != scan_end1 || *match != *scan || *++match != scan[1]) continue; scan += 2, match++; do { } while (*++scan == *++match && *++scan == *++match && *++scan == *++match && *++scan == *++match && *++scan == *++match && *++scan == *++match && *++scan == *++match && *++scan == *++match && scan < strend); .... The internal loop is the string comparsion thingy, while the branch prediction logic completely misses it: the continue statement looks like it is forming 4 nested loops, so it concludes that this is the internal loop. We used to have prediction heuristic guessing that continue statement is not used to form a loop. This was killed when gimplification was introduced. Perhaps we should bring it back, since this is resonably common scenario. Looking at longest_match in not unrolled version, the "loops" formed by continue statement has frequencies: 298, 961, 2139, 3100, 6900, 1000 so every loop is predicted to iterate about twice. The outer real loop now gets frequency 92, ie small enough to be predicted as cold. The string comparsion loop now get freuqnecy 344, predicted to iterate 3 times (quite realistically). But because the frequency is so small we end up allocating one of the two pointers in memory: .L9: leal 1(%ecx), %eax movl %eax, -16(%ebp) movzbl 1(%ecx), %eax cmpb 1(%edx), %al jne .L8 leal 2(%ecx), %eax movl %eax, -16(%ebp) movzbl 2(%ecx), %eax cmpb 2(%edx), %al jne .L8 leal 3(%ecx), %eax movl %eax, -16(%ebp) movzbl 3(%ecx), %eax cmpb 3(%edx), %al jne .L8 leal 4(%ecx), %eax movl %eax, -16(%ebp) movzbl 4(%ecx), %eax cmpb 4(%edx), %al jne .L8 leal 5(%ecx), %eax movl %eax, -16(%ebp) movzbl 5(%ecx), %eax cmpb 5(%edx), %al jne .L8 This happens in offline copy of longest_match. The inline gets this detail right, but frequencies of the deflate functions are all crazy, naturally. I guess I should revive the patch for language scope branch predictors. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761