From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-187641-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20986 invoked by alias); 31 May 2006 10:57:20 -0000
Received: (qmail 20262 invoked by uid 48); 31 May 2006 10:56:32 -0000
Date: Wed, 31 May 2006 10:57:00 -0000
Message-ID: <20060531105632.20260.qmail@sourceware.org>
X-Bugzilla-Reason: CC
References: <bug-27827-12761@http.gcc.gnu.org/bugzilla/>
Subject: [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
In-Reply-To: <bug-27827-12761@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "uros at kss-loka dot si" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2006-05/txt/msg03197.txt.bz2
List-Id: <gcc-bugs.sourceware.org>


------- Comment #7 from uros at kss-loka dot si  2006-05-31 10:56 -------
IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure
luck.

Looking into 3.x RTL, these things can be observed:

Instruction that multiplies pA0 and rB0 is described as:

__.20.combine:

(insn 75 73 76 2 (set (reg:DF 84)
        (mult:DF (mem:DF (reg/v/f:DI 70 [ pA0 ]) [0 S8 A64])
            (reg/v:DF 78 [ rB0 ]))) 551 {*fop_df_comm_nosse} (insn_list 65
(nil))
    (nil))

At this point, first input operand does not satisfy the operand constraint, so
register allocator pushes memory operand into the register:

__.25.greg:

(insn 703 73 75 2 (set (reg:DF 8 st [84])
        (mem:DF (reg/v/f:DI 0 ax [orig:70 pA0 ] [70]) [0 S8 A64])) 96
{*movdf_integer} (nil)
    (nil))

(insn 75 703 76 2 (set (reg:DF 8 st [84])
        (mult:DF (reg:DF 8 st [84])
            (reg/v:DF 9 st(1) [orig:78 rB0 ] [78]))) 551 {*fop_df_comm_nosse}
(insn_list 65 (nil))
    (nil))

This RTL produces following asm sequence:

        fldl    (%rax)  #* pA0
        fmul    %st(1), %st     #


In 4.x case, we have:

__.127r.combine:

(insn 60 58 61 4 (set (reg:DF 207)
        (mult:DF (reg/v:DF 187 [ rB0 ])
            (mem:DF (plus:DI (reg/v/f:DI 178 [ pA0.161 ])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591
{*fop_df_comm_i387} (nil)
    (nil))

This instruction almost satisfies operand constraint, and register allocator
produces:

__.138r.greg:

(insn 470 58 60 5 (set (reg:DF 12 st(4) [207])
        (reg/v:DF 8 st [orig:187 rB0 ] [187])) 94 {*movdf_integer} (nil)
    (nil))

(insn 60 470 61 5 (set (reg:DF 12 st(4) [207])
        (mult:DF (reg:DF 12 st(4) [207])
            (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591
{*fop_df_comm_i387} (nil)
    (nil))

Stack handling then fixes this RTL to:

__.151r.stack:

(insn 470 58 60 4 (set (reg:DF 8 st)
        (reg:DF 8 st)) 94 {*movdf_integer} (nil)
    (nil))

(insn 60 470 61 4 (set (reg:DF 8 st)
        (mult:DF (reg:DF 8 st)
            (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591
{*fop_df_comm_i387} (nil)
    (nil))


>>From your measurement, it looks that instead of:

        fld     %st(0)  #
        fmull   (%rax)  #* pA0.161

it is faster to emit

        fldl    (%rax)  #* pA0
        fmul    %st(1), %st     #,


-- 

uros at kss-loka dot si changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at kss-loka dot si


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827