From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21687 invoked by alias); 16 Feb 2013 19:12:30 -0000 Received: (qmail 21397 invoked by uid 48); 16 Feb 2013 19:12:09 -0000 From: "jake.stine at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/54073] [4.7 Regression] SciMark Monte Carlo test performance has seriously decreased in recent GCC releases Date: Sat, 16 Feb 2013 19:12:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: jake.stine at gmail dot com X-Bugzilla-Status: NEW X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 4.7.3 X-Bugzilla-Changed-Fields: CC Message-ID: In-Reply-To: References: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2013-02/txt/msg01709.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54073 Jake Stine changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jake.stine at gmail dot com --- Comment #16 from Jake Stine 2013-02-16 19:12:05 UTC --- Hi, I have done quite a bit of analysis on cmov performance across x86 architectures, so I will share here in case it helps: Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer architectures should probably be avoided "as a rule." History: Conditional moves were beneficial for the Intel Pentium 4, and also (but less-so) for AMD Athlon/Phenom chips. In the AMD Athlon/Phenom case the performance of cmov vs cmp+branch is determined more by the alignment of the target of the branch, than by the prediction rate of the branch. The instruction decoders would incur penalties on certain types of unaligned branch targets (when taken), or when decoding sequences of instructions that contained multiple branches within a 16byte "fetch" window (taken or not). cmov was sometimes handy for avoiding those. With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture: I have found that use of conditional moves (cmov) is only beneficial if the branch that the move is replacing is badly mis-predicted. In my tests, the cmov only became clearly "optimal" when the branch was predicted correctly less than 92% of the time, which is abysmal by modern branch predictor standards and rarely occurs in practice. Above 97% prediction rates, cmov is typically slower than cmp+branch. Inside loops that contain branches with prediction rates approaching 100% (as is the case presented by the OP), cmov becomes a severe performance bottleneck. This holds true for both Core and Bulldozer. Bulldozer has less efficient branching than the i7, but is also severely bottlenecked by its limited fetch/decode. Cmov requires executing more total instructions, and that makes Bulldozer very unhappy. Note that my tests involved relatively simple loops that did not suffer from the added register pressure that cmov introduces. In practice, the prognosis for cmov being "optimal" is even worse than what I've observed in a controlled environment. Furthermore, to my knowledge the status of cmov vs. branch performance on x86 will not be changing anytime soon. cmov will continue to be a liability well into the next couple architecture releases from Intel and AMD. Piledriver will have added fetch/decode resources but should also have a smaller mispredict penalty, so its doubtful cmov will gain much advantages there either. Therefore I would recommend setting -fno-tree-loop-if-convert for all -march matching Intel Core and AMD Bulldozer/Bobcat families. There is one good use-case for cmov on x86: Mis-predicted conditions inside of loops. Currently there's no way to force that behavior in situations where I, the programmer, am fully aware that the condition is chaotic/random. A builtin cmov or condition hint would be nice. For now I'm forced to address those (fortunately infrequent) situations via inline asm.