From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17942 invoked by alias); 17 Feb 2015 02:56:04 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 17869 invoked by uid 48); 17 Feb 2015 02:56:00 -0000 From: "solar-gcc at openwall dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure Date: Tue, 17 Feb 2015 02:56:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 4.6.2 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: solar-gcc at openwall dot com X-Bugzilla-Status: NEW X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-02/txt/msg01846.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017 --- Comment #13 from Alexander Peslyak --- (In reply to Richard Biener from comment #11) > We are putting quite heavy register-pressure on the thing by means of > partial redundancy elimination, thus disabling PRE using -fno-tree-pre > might help (we still spill a lot). It looks like -fno-tree-pre or equivalent was implied in the options I was using, which were "-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions" - yes, with -Os added after -O2 when compiling this specific source file. IIRC, this was experimentally derived as producing best performance with 4.6.x or older. Adding -fno-tree-pre after all of these options merely changes the label names in the generated assembly code, while resulting in identical object files (and obviously no performance change). Also, I now realize -Os was probably the reason why GCC preferred SSE "floating-point" bitwise ops and MOVs here, instead of SSE2's integer ones (they have longer encodings). Omitting -Os results in usage of the SSE2 instructions (both bitwise and MOVs), with correspondingly larger code. And yes, when I omit -Os, I do need to add -fno-tree-pre to regain roughly the same performance, and then to s/movdqu/movdqa/g to regain almost the full speed (movdqu is just as slow as movups on this CPU). I've just tested all of this with GCC 4.8.4 to possibly match yours (you mentioned you used 4.8). So I think you uncovered yet another performance regression I had already worked around with -Os. FWIW, here are the generated assembly code sizes ("wc" output) with GCC 4.8.4: -O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions 5870 17420 137636 1.s -O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions -fno-tree-pre 5870 17420 137636 2.s -O2 -fomit-frame-pointer -funroll-loops -finline-functions 6814 20193 156837 a.s -O2 -fomit-frame-pointer -funroll-loops -finline-functions -fno-tree-pre 6028 17842 138284 b.s As you can see, -fno-tree-pre reduces the size almost to the -Os level. (But the .text size would be significantly larger because of the SSE2 instruction encodings. This is why I show the assembly code sizes for this comparison.)