From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 654 invoked by alias); 11 Jun 2009 20:55:58 -0000 Received: (qmail 646 invoked by uid 22791); 11 Jun 2009 20:55:57 -0000 X-SWARE-Spam-Status: No, hits=-1.5 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_41,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org Received: from yw-out-1718.google.com (HELO yw-out-1718.google.com) (74.125.46.153) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 11 Jun 2009 20:55:48 +0000 Received: by yw-out-1718.google.com with SMTP id 5so842624ywm.26 for ; Thu, 11 Jun 2009 13:55:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.151.119.3 with SMTP id w3mr5910278ybm.226.1244753745226; Thu, 11 Jun 2009 13:55:45 -0700 (PDT) In-Reply-To: References: <4A03F8EA.5070705@gnu.org> Date: Thu, 11 Jun 2009 20:55:00 -0000 Message-ID: Subject: Re: Code optimization only in loops From: Jean Christophe Beyler To: Paolo Bonzini Cc: "gcc@gcc.gnu.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2009-06/txt/msg00269.txt.bz2 I've gone back to this problem (since I've solved another one ;-)). And I've moved forward a bit: It seems that if I consider an array of characters, there are no longer any shifts and therefore I do get my two loads with the use of an offset: Code: char data[1312]; uint64_t goo (uint64_t i) { return data[i] - data[i+13]; } generates the right code with two loads with the same base but different offsets. If I use anything else than a char type, I get the problem in generation. This seems to confirm that somehow, the way things are generated blocks the subsequent optimization passes in seeing that the addresses are linked. Right now, I'm trying to figure out why I'm getting shifts and is this the problem instead of a multiply. Since this was one of the differences between what I get and what the i386 port gets. If you've got any ideas, thanks again, Jean Christophe On Wed, May 13, 2009 at 4:58 PM, Jean Christophe Beyler wrote: > Ok, for the i386 port, I use uint32_t instead of uint64_t because > otherwise the assembly code generated is a bit complicated (I'm on a > 32 bit machine). > > The tree dump from final_cleanup are the same for the goo function: > goo (i) > { > : > =A0return data[i + 13] + data[i]; > > } > > > However, the first RTL dump from expand gives this for the i386 port: > > (insn 6 5 7 3 ld.c:17 (parallel [ > =A0 =A0 =A0 =A0 =A0 =A0(set (reg:SI 61) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(plus:SI (reg/v:SI 59 [ i ]) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(const_int 13 [0xd]))) > =A0 =A0 =A0 =A0 =A0 =A0(clobber (reg:CC 17 flags)) > =A0 =A0 =A0 =A0]) -1 (nil)) > > (insn 7 6 8 3 ld.c:17 (set (reg/f:SI 62) > =A0 =A0 =A0 =A0(symbol_ref:SI ("data") )) -1 (n= il)) > > (insn 8 7 9 3 ld.c:17 (set (reg/f:SI 63) > =A0 =A0 =A0 =A0(symbol_ref:SI ("data") )) -1 (n= il)) > > (insn 9 8 10 3 ld.c:17 (set (reg:SI 64) > =A0 =A0 =A0 =A0(mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ]) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(const_int 4 [0x4])) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(reg/f:SI 63)) [3 data S4 A32])) -1 (nil)) > > (insn 10 9 11 3 ld.c:17 (set (reg:SI 65) > =A0 =A0 =A0 =A0(mem/s:SI (plus:SI (mult:SI (reg:SI 61) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(const_int 4 [0x4])) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(reg/f:SI 62)) [3 data S4 A32])) -1 (nil)) > > (insn 11 10 12 3 ld.c:17 (parallel [ > =A0 =A0 =A0 =A0 =A0 =A0(set (reg:SI 60) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(plus:SI (reg:SI 65) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(reg:SI 64))) > =A0 =A0 =A0 =A0 =A0 =A0(clobber (reg:CC 17 flags)) > =A0 =A0 =A0 =A0]) -1 (expr_list:REG_EQUAL (plus:SI (mem/s:SI (plus:SI > (mult:SI (reg:SI 61) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(const_int 4 [0x4])) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(reg/f:SI 62)) [3 data S4 A32]) > =A0 =A0 =A0 =A0 =A0 =A0(mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ]) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(const_int 4 [0x4])) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(reg/f:SI 63)) [3 data S4 A32])) > =A0 =A0 =A0 =A0(nil))) > > As we can see, the compiler moves 13, and the @ of data, then > muliplies the 13 with 4 to get the right size and then performs the 2 > loads and finally has a plus. > > In my port, I get: > > (insn 6 5 7 3 ld.c:17 (set (reg:DI 75) > =A0 =A0 =A0 =A0(plus:DI (reg/v:DI 73 [ i ]) > =A0 =A0 =A0 =A0 =A0 =A0(const_int 13 [0xd]))) -1 (nil)) > > (insn 7 6 8 3 ld.c:17 (set (reg/f:DI 76) > =A0 =A0 =A0 =A0(symbol_ref:DI ("data") )) -1 (n= il)) > > (insn 8 7 9 3 ld.c:17 (set (reg:DI 78) > =A0 =A0 =A0 =A0(const_int 3 [0x3])) -1 (nil)) > > (insn 9 8 10 3 ld.c:17 (set (reg:DI 77) > =A0 =A0 =A0 =A0(ashift:DI (reg:DI 75) > =A0 =A0 =A0 =A0 =A0 =A0(reg:DI 78))) -1 (nil)) > > (insn 10 9 11 3 ld.c:17 (set (reg/f:DI 79) > =A0 =A0 =A0 =A0(plus:DI (reg/f:DI 76) > =A0 =A0 =A0 =A0 =A0 =A0(reg:DI 77))) -1 (nil)) > > (insn 11 10 12 3 ld.c:17 (set (reg/f:DI 80) > =A0 =A0 =A0 =A0(symbol_ref:DI ("data") )) -1 (n= il)) > > (insn 12 11 13 3 ld.c:17 (set (reg:DI 82) > =A0 =A0 =A0 =A0(const_int 3 [0x3])) -1 (nil)) > > (insn 13 12 14 3 ld.c:17 (set (reg:DI 81) > =A0 =A0 =A0 =A0(ashift:DI (reg/v:DI 73 [ i ]) > =A0 =A0 =A0 =A0 =A0 =A0(reg:DI 82))) -1 (nil)) > > (insn 14 13 15 3 ld.c:17 (set (reg/f:DI 83) > =A0 =A0 =A0 =A0(plus:DI (reg/f:DI 80) > =A0 =A0 =A0 =A0 =A0 =A0(reg:DI 81))) -1 (nil)) > > (insn 15 14 16 3 ld.c:17 (set (reg:DI 84) > =A0 =A0 =A0 =A0(mem/s:DI (reg/f:DI 79) [2 data S8 A64])) -1 (nil)) > > (insn 16 15 17 3 ld.c:17 (set (reg:DI 85) > =A0 =A0 =A0 =A0(mem/s:DI (reg/f:DI 83) [2 data S8 A64])) -1 (nil)) > > (insn 17 16 18 3 ld.c:17 (set (reg:DI 74) > =A0 =A0 =A0 =A0(plus:DI (reg:DI 84) > =A0 =A0 =A0 =A0 =A0 =A0(reg:DI 85))) -1 (nil)) > > > Which seems to be the same idea, except that constant 3 gets load up > and a shift is performed. Is it possible that it's that that is > causing my problem in code generation? > > I'm trying to figure out why my port is generating a shift instead of > simply a mult. I actually changed the cost of shift to a large value > and then it uses adds instead of simply a mult. I seem to think that > this is then an rtx_cost problem where I'm not telling the compiler > that a multiplication in this case is correct. > > I've been playing with rtx_cost but have been unable to really get it > to generate the right code. > > Thanks again for your help and insight, > Jc > > On Fri, May 8, 2009 at 5:18 AM, Paolo Bonzini wrote: >> >>> It seems that when set in a loop, the program is able to perform some >>> type of optimization to actually get the use of the offsets where as >>> in the case of no loop, we have twice the calculations of instructions >>> for each address calculations. >> >> I suggest you look at the dumps for i386 to see which pass does the >> changes, and then see what happens in your port. >> >> Paolo >> >