From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-481042-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 119336 invoked by alias); 20 Mar 2015 22:15:59 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 119282 invoked by uid 48); 20 Mar 2015 22:15:54 -0000
From: "linux at carewolf dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/65492] Bad optimization in -O3 due to if-conversion and/or unrolling
Date: Sat, 21 Mar 2015 02:09:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 5.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: linux at carewolf dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields:
Message-ID: <bug-65492-4-VIKYMBm7fA@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-65492-4@http.gcc.gnu.org/bugzilla/>
References: <bug-65492-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-03/txt/msg02186.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65492
--- Comment #9 from Allan Jensen <linux at carewolf dot com> ---
Looking at the assembler, it does indeed appear that the only difference just
loop unrolling and if conversion. 

After testing on another machine (and old PhenomII as opposed to the
Sandybridge), and report that disabling tree-loop-if-convert directly or
indirectly via tree-loop-vectorize -O3 regains all of the speed difference to
-O2 on PhenomII.

My guess is that the small loop-unrolling is conflicting with op-cache Intel
introduced in the SandyBridge and newer architectures which speeds up small
tight loops. On architectures without op-cache the loop-unrolling is probably
still slightly faster.

Unfortunately, using -mtune=sandybridge does not improve the situation, so
maybe there should be some architecture tuning on even trivial loop unrolling,
and possibly discussion on making it part of generic-x64 tuning.