From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-477513-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 17942 invoked by alias); 17 Feb 2015 02:56:04 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 17869 invoked by uid 48); 17 Feb 2015 02:56:00 -0000
From: "solar-gcc at openwall dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/51017] GCC 4.6 performance regression (vs. 4.4/4.5), PRE increases register pressure
Date: Tue, 17 Feb 2015 02:56:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 4.6.2
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: solar-gcc at openwall dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-51017-4-U2e2oZRr0a@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-51017-4@http.gcc.gnu.org/bugzilla/>
References: <bug-51017-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-02/txt/msg01846.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017
--- Comment #13 from Alexander Peslyak <solar-gcc at openwall dot com> ---
(In reply to Richard Biener from comment #11)
> We are putting quite heavy register-pressure on the thing by means of
> partial redundancy elimination, thus disabling PRE using -fno-tree-pre
> might help (we still spill a lot).

It looks like -fno-tree-pre or equivalent was implied in the options I was
using, which were "-O2 -fomit-frame-pointer -Os -funroll-loops
-finline-functions" - yes, with -Os added after -O2 when compiling this
specific source file.  IIRC, this was experimentally derived as producing best
performance with 4.6.x or older.  Adding -fno-tree-pre after all of these
options merely changes the label names in the generated assembly code, while
resulting in identical object files (and obviously no performance change). 
Also, I now realize -Os was probably the reason why GCC preferred SSE
"floating-point" bitwise ops and MOVs here, instead of SSE2's integer ones
(they have longer encodings). Omitting -Os results in usage of the SSE2
instructions (both bitwise and MOVs), with correspondingly larger code. And
yes, when I omit -Os, I do need to add -fno-tree-pre to regain roughly the same
performance, and then to s/movdqu/movdqa/g to regain almost the full speed
(movdqu is just as slow as movups on this CPU). I've just tested all of this
with GCC 4.8.4 to possibly match yours (you mentioned you used 4.8). So I think
you uncovered yet another performance regression I had already worked around
with -Os.

FWIW, here are the generated assembly code sizes ("wc" output) with GCC 4.8.4:

-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions
  5870  17420 137636 1.s
-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions -fno-tree-pre
  5870  17420 137636 2.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions
  6814  20193 156837 a.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions -fno-tree-pre
  6028  17842 138284 b.s

As you can see, -fno-tree-pre reduces the size almost to the -Os level. (But
the .text size would be significantly larger because of the SSE2 instruction
encodings.  This is why I show the assembly code sizes for this comparison.)