From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-416118-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 24819 invoked by alias); 22 Feb 2013 23:41:02 -0000
Received: (qmail 24766 invoked by uid 48); 22 Feb 2013 23:40:38 -0000
From: "steven at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/54000] [4.6/4.7/4.8 Regression] Performance breakdown for gcc-4.{6,7} vs. gcc-4.5 using std::vector in matrix vector multiplication (IVopts / inliner)
Date: Fri, 22 Feb 2013 23:41:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: steven at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 4.6.4
X-Bugzilla-Changed-Fields: CC Summary
Message-ID: <bug-54000-4-seemoLf7rc@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-54000-4@http.gcc.gnu.org/bugzilla/>
References: <bug-54000-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2013-02/txt/msg02283.txt.bz2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54000

Steven Bosscher <steven at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
            Summary|[4.6/4.7/4.8                |[4.6/4.7/4.8 Regression]
                   |Regression][IVOPTS]         |Performance breakdown for
                   |Performance breakdown for   |gcc-4.{6,7} vs. gcc-4.5
                   |gcc-4.{6,7} vs. gcc-4.5     |using std::vector in matrix
                   |using std::vector in matrix |vector multiplication
                   |vector multiplication       |(IVopts / inliner)
--- Comment #9 from Steven Bosscher <steven at gcc dot gnu.org> 2013-02-22 23:40:35 UTC ---
(In reply to comment #8)
> Thanks for the reduced testcase.  The innermost loops compare as follows:
> 
> 4.5:
> 
> .L7:
>         movsd   (%rbx,%rcx), %xmm0
>         addq    $8, %rcx
>         mulsd   0(%rbp,%rdx), %xmm0
>         addq    $8, %rdx
>         cmpq    $24, %rdx
>         addsd   %xmm0, %xmm1
>         movsd   %xmm1, (%rsi)
>         jne     .L7

4.8 r196182 with "--param early-inlining-insns=2" (2 x the default value):

.L13:   
        movsd   (%rdx), %xmm0
        addq    $8, %rdx
        mulsd   (%rsi,%rax), %xmm0
        addq    $8, %rax
        cmpq    $24, %rax
        addsd   %xmm0, %xmm1
        movsd   %xmm1, 8(%rdi,%rcx)
        jne     .L13


> 
> 4.7:
> 
> .L13:
>         movq    64(%rsp), %rdi
>         movq    80(%rsp), %rdx
>         addq    %rcx, %rdi
>         addq    %r8, %rdx
>         movsd   -8(%rax,%rdi), %xmm0
>         mulsd   (%rsi,%rax), %xmm0
>         addq    $8, %rax
>         cmpq    $24, %rax
>         addsd   (%rdx), %xmm0
>         movsd   %xmm0, (%rdx)
>         jne     .L13

This is similar to what 4.8 r196182 produces without inliner tweaks:

.L18:   
        movq    %rcx, %rdi
        addq    64(%rsp), %rdi
        movq    %r8, %rdx
        addq    80(%rsp), %rdx
        movsd   -8(%rax,%rdi), %xmm0
        mulsd   (%rsi,%rax), %xmm0
        addq    $8, %rax
        cmpq    $24, %rax
        addsd   (%rdx), %xmm0
        movsd   %xmm0, (%rdx)
        jne     .L18


> so we seem to have a register allocation / spilling issue here as well
> as a bad induction variable choice.  GCC 4.8 is not any better here.

All true, but in the end it looks like an inliner heuristics issue first
(as also suggested by comment #3).