From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-116352-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 15671 invoked by alias); 27 Jun 2005 19:20:28 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 15617 invoked by uid 22791); 27 Jun 2005 19:20:22 -0000
Received: from mail-out4.apple.com (HELO mail-out4.apple.com) (17.254.13.23)
    by sourceware.org (qpsmtpd/0.30-dev) with ESMTP; Mon, 27 Jun 2005 19:20:22 +0000
Received: from mailgate2.apple.com (a17-128-100-204.apple.com [17.128.100.204])
	by mail-out4.apple.com (8.12.11/8.12.11) with ESMTP id j5RJKK7k001383
	for <gcc@gcc.gnu.org>; Mon, 27 Jun 2005 12:20:20 -0700 (PDT)
Received: from relay1.apple.com (relay1.apple.com) by mailgate2.apple.com
 (Content Technologies SMTPRS 4.3.17) with ESMTP id <T71ca266cde118064cc894@mailgate2.apple.com> for <gcc@gcc.gnu.org>;
 Mon, 27 Jun 2005 12:20:20 -0700
Received: from [17.201.21.188] (jahan5.apple.com [17.201.21.188])
	by relay1.apple.com (8.12.11/8.12.11) with ESMTP id j5RJKI2e001689;
	Mon, 27 Jun 2005 12:20:18 -0700 (PDT)
In-Reply-To: <B7C12C01-DBA8-4F56-9DF3-4BE0C45D5ADD@apple.com>
References: <B7C12C01-DBA8-4F56-9DF3-4BE0C45D5ADD@apple.com>
Mime-Version: 1.0 (Apple Message framework v728)
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: <6269B965-AF17-4FBD-B3ED-B1BA96936380@apple.com>
Cc: gcc@gcc.gnu.org
Content-Transfer-Encoding: 7bit
From: Fariborz Jahanian <fjahanian@apple.com>
Subject: Re: [RFH] - Less than optimal code compiling 252.eon -O2 for x86
Date: Mon, 27 Jun 2005 19:20:00 -0000
To: Fariborz Jahanian <fjahanian@apple.com>
X-SW-Source: 2005-06/txt/msg01063.txt.bz2

FYI, the change to rtl  in -O2 vs. -O1 is that -O2 includes -fforce- 
mem which forces memory operands to registers to make memory  
references common sub-expressions. In this case, the constant double  
float value is assigned to an xmm register which is used where it is  
needed. So, I would say this behavior is as expected but not ideal  
for x86 where a couple of 'movl   $0x0, mem' may be preferred to a  
single 'movsd   %xmm7, mem' for 252.eon on x86-darwin.

- fariborz

On Jun 24, 2005, at 3:07 PM, Fariborz Jahanian wrote:

> A source file mrSurfaceList.cc of 252.eon produces less efficient  
> code initializing instance objects to 0 at -O2 than at -O1.  
> Behavior is random and it does not happen on all x86  platforms and  
> making the test smaller makes the problem go away. But here is what  
> I found out is the cause.
>
> When source is compiled with -O1 -march=pentium4,  'cse' phase sees  
> the following pattern initializing a 'double' with 0.
>
> (insn 18 13 19 0 (set (reg:SF 109)
>         (mem/u/i:SF (symbol_ref/u:SI ("*LC11") [flags 0x2]) [0 S4  
> A32])) -1 (nil)
>     (nil))
>
> (insn 19 18 20 0 (set (mem/s/j:DF (plus:SI (reg/f:SI 20 frame)
>                 (const_int -32 [0xffffffffffffffe0])) [0  
> objectBox.pmin.e+16 S8 A128])
>         (float_extend:DF (reg:SF 109))) 86 {*extendsfdf2_sse} (nil)
>     (nil))
>
> Then fold_rtx routine  converts it into its reduced form, resulting  
> in optimum code:
>
> (insn 19 13 21 0 (set (mem/s/j:DF (plus:SI (reg/f:SI 20 frame)
>                 (const_int -32 [0xffffffffffffffe0])) [0  
> objectBox.pmin.e+16 S8 A128])
>         (const_double:DF 0.0 [0x0.0p+0])) 64 {*movdf_nointeger} (nil)
>     (nil))
>
>
> But when the same source is compiled with -O2 march=pentium4, 'cse'  
> phase sees a slightly different pattern (note that float_extend:DF  
> has moved)
>
> (insn 18 13 19 0 (set (reg:DF 109)
>         (float_extend:DF (mem/u/i:SF (symbol_ref/u:SI ("*LC13")  
> [flags 0x2]) [0 S4 A32]))) -1 (nil)
>     (nil))
>
> (insn 19 18 20 0 (set (mem/s/j:DF (plus:SI (reg/f:SI 20 frame)
>                 (const_int -32 [0xffffffffffffffe0])) [0  
> objectBox.pmin.e+16 S8 A128])
>         (reg:DF 109)) 64 {*movdf_nointeger} (nil)
>     (nil))
>
> This cannot be simplified by fold_rtx, resulting in less efficient  
> code.
>
> Change in pattern is most likely because of additional tree  
> optimization phases running at -O2. If so, then should the cse be  
> taught to simplify the new rtl pattern. Or, the tree optimizer  
> phase responsible for the less than optimal tree need be twiked to  
> generate the same tree as with -O1?
>
> Thanks, fariborz
>
>