Code optimization only in loops

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Code optimization only in loops
@ 2009-05-07 20:46 Jean Christophe Beyler
  2009-05-08 11:40 ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Christophe Beyler @ 2009-05-07 20:46 UTC (permalink / raw)
  To: gcc

Dear all,

I come back to you with another weirdness due to bad code generation
on my target architecture. I have a very simplified (for the moment)
rtx_costs and my address_cost is inspired by the i386 version.
However, I actually patched in the whole i386_rtx_cost function,
constraints, predicates to see if it was something I had done wrongly
but I seem to get the same results.

This is my two functions:

uint64_t foo (uint64_t n, uint64_t m)
{
    uint64_t sum = 0,i;
    for(i=n;i<n+m;i++)
    {
     sum +=  data[i] + data[i+13];
    }
    return sum;
}

After the prologue of the loop, I get :

    mov r1, theCorrectStartAddress
    load r2,0(r1)
    load r3,104(r1)

However, if I do this:

uint64_t goo (uint64_t i)
{
    return data[i] + data[i+13];
}

I get :
    mov r1, Calculation of data[i]
    mov r2, Calculation of data[i+13]

    ldd r3,0(r1)
    ldd r4,0(r2)

It seems that when set in a loop, the program is able to perform some
type of optimization to actually get the use of the offsets where as
in the case of no loop, we have twice the calculations of instructions
for each address calculations.

Like I said, I replaced the cost function with the x86 version and I
got the same thing, so I don't really know where to look? Could the
expansion of my movDI/SI and instruction definition have an
implication like this?

Thanks for your input,
Jean Christophe Beyler

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization only in loops
  2009-05-07 20:46 Code optimization only in loops Jean Christophe Beyler
@ 2009-05-08 11:40 ` Paolo Bonzini
  2009-05-14  2:11   ` Jean Christophe Beyler
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2009-05-08 11:40 UTC (permalink / raw)
  To: Jean Christophe Beyler; +Cc: gcc


> It seems that when set in a loop, the program is able to perform some
> type of optimization to actually get the use of the offsets where as
> in the case of no loop, we have twice the calculations of instructions
> for each address calculations.

I suggest you look at the dumps for i386 to see which pass does the
changes, and then see what happens in your port.

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization only in loops
  2009-05-08 11:40 ` Paolo Bonzini
@ 2009-05-14  2:11   ` Jean Christophe Beyler
  2009-06-11 20:55     ` Jean Christophe Beyler
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Christophe Beyler @ 2009-05-14  2:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gcc

Ok, for the i386 port, I use uint32_t instead of uint64_t because
otherwise the assembly code generated is a bit complicated (I'm on a
32 bit machine).

The tree dump from final_cleanup are the same for the goo function:
goo (i)
{
<bb 2>:
  return data[i + 13] + data[i];

}


However, the first RTL dump from expand gives this for the i386 port:

(insn 6 5 7 3 ld.c:17 (parallel [
            (set (reg:SI 61)
                (plus:SI (reg/v:SI 59 [ i ])
                    (const_int 13 [0xd])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil))

(insn 7 6 8 3 ld.c:17 (set (reg/f:SI 62)
        (symbol_ref:SI ("data") <var_decl 0xb7e7ce60 data>)) -1 (nil))

(insn 8 7 9 3 ld.c:17 (set (reg/f:SI 63)
        (symbol_ref:SI ("data") <var_decl 0xb7e7ce60 data>)) -1 (nil))

(insn 9 8 10 3 ld.c:17 (set (reg:SI 64)
        (mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ])
                    (const_int 4 [0x4]))
                (reg/f:SI 63)) [3 data S4 A32])) -1 (nil))

(insn 10 9 11 3 ld.c:17 (set (reg:SI 65)
        (mem/s:SI (plus:SI (mult:SI (reg:SI 61)
                    (const_int 4 [0x4]))
                (reg/f:SI 62)) [3 data S4 A32])) -1 (nil))

(insn 11 10 12 3 ld.c:17 (parallel [
            (set (reg:SI 60)
                (plus:SI (reg:SI 65)
                    (reg:SI 64)))
            (clobber (reg:CC 17 flags))
        ]) -1 (expr_list:REG_EQUAL (plus:SI (mem/s:SI (plus:SI
(mult:SI (reg:SI 61)
                        (const_int 4 [0x4]))
                    (reg/f:SI 62)) [3 data S4 A32])
            (mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ])
                        (const_int 4 [0x4]))
                    (reg/f:SI 63)) [3 data S4 A32]))
        (nil)))

As we can see, the compiler moves 13, and the @ of data, then
muliplies the 13 with 4 to get the right size and then performs the 2
loads and finally has a plus.

In my port, I get:

(insn 6 5 7 3 ld.c:17 (set (reg:DI 75)
        (plus:DI (reg/v:DI 73 [ i ])
            (const_int 13 [0xd]))) -1 (nil))

(insn 7 6 8 3 ld.c:17 (set (reg/f:DI 76)
        (symbol_ref:DI ("data") <var_decl 0xb7c85bb0 data>)) -1 (nil))

(insn 8 7 9 3 ld.c:17 (set (reg:DI 78)
        (const_int 3 [0x3])) -1 (nil))

(insn 9 8 10 3 ld.c:17 (set (reg:DI 77)
        (ashift:DI (reg:DI 75)
            (reg:DI 78))) -1 (nil))

(insn 10 9 11 3 ld.c:17 (set (reg/f:DI 79)
        (plus:DI (reg/f:DI 76)
            (reg:DI 77))) -1 (nil))

(insn 11 10 12 3 ld.c:17 (set (reg/f:DI 80)
        (symbol_ref:DI ("data") <var_decl 0xb7c85bb0 data>)) -1 (nil))

(insn 12 11 13 3 ld.c:17 (set (reg:DI 82)
        (const_int 3 [0x3])) -1 (nil))

(insn 13 12 14 3 ld.c:17 (set (reg:DI 81)
        (ashift:DI (reg/v:DI 73 [ i ])
            (reg:DI 82))) -1 (nil))

(insn 14 13 15 3 ld.c:17 (set (reg/f:DI 83)
        (plus:DI (reg/f:DI 80)
            (reg:DI 81))) -1 (nil))

(insn 15 14 16 3 ld.c:17 (set (reg:DI 84)
        (mem/s:DI (reg/f:DI 79) [2 data S8 A64])) -1 (nil))

(insn 16 15 17 3 ld.c:17 (set (reg:DI 85)
        (mem/s:DI (reg/f:DI 83) [2 data S8 A64])) -1 (nil))

(insn 17 16 18 3 ld.c:17 (set (reg:DI 74)
        (plus:DI (reg:DI 84)
            (reg:DI 85))) -1 (nil))


Which seems to be the same idea, except that constant 3 gets load up
and a shift is performed. Is it possible that it's that that is
causing my problem in code generation?

I'm trying to figure out why my port is generating a shift instead of
simply a mult. I actually changed the cost of shift to a large value
and then it uses adds instead of simply a mult. I seem to think that
this is then an rtx_cost problem where I'm not telling the compiler
that a multiplication in this case is correct.

I've been playing with rtx_cost but have been unable to really get it
to generate the right code.

Thanks again for your help and insight,
Jc

On Fri, May 8, 2009 at 5:18 AM, Paolo Bonzini <bonzini@gnu.org> wrote:
>
>> It seems that when set in a loop, the program is able to perform some
>> type of optimization to actually get the use of the offsets where as
>> in the case of no loop, we have twice the calculations of instructions
>> for each address calculations.
>
> I suggest you look at the dumps for i386 to see which pass does the
> changes, and then see what happens in your port.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization only in loops
  2009-05-14  2:11   ` Jean Christophe Beyler
@ 2009-06-11 20:55     ` Jean Christophe Beyler
  2009-06-12  7:56       ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Christophe Beyler @ 2009-06-11 20:55 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gcc

I've gone back to this problem (since I've solved another one ;-)).
And I've moved forward a bit:

It seems that if I consider an array of characters, there are no
longer any shifts and therefore I do get my two loads with the use of
an offset:

Code:

char data[1312];

uint64_t goo (uint64_t i)
{
    return data[i] - data[i+13];
}

generates the right code with two loads with the same base but
different offsets.

If I use anything else than a char type, I get the problem in
generation. This seems to confirm that somehow, the way things are
generated blocks the subsequent optimization passes in seeing that the
addresses are linked.

Right now, I'm trying to figure out why I'm getting shifts and is this
the problem instead of a multiply. Since this was one of the
differences between what I get and what the i386 port gets.

If you've got any ideas, thanks again,
Jean Christophe

On Wed, May 13, 2009 at 4:58 PM, Jean Christophe
Beyler<jean.christophe.beyler@gmail.com> wrote:
> Ok, for the i386 port, I use uint32_t instead of uint64_t because
> otherwise the assembly code generated is a bit complicated (I'm on a
> 32 bit machine).
>
> The tree dump from final_cleanup are the same for the goo function:
> goo (i)
> {
> <bb 2>:
>  return data[i + 13] + data[i];
>
> }
>
>
> However, the first RTL dump from expand gives this for the i386 port:
>
> (insn 6 5 7 3 ld.c:17 (parallel [
>            (set (reg:SI 61)
>                (plus:SI (reg/v:SI 59 [ i ])
>                    (const_int 13 [0xd])))
>            (clobber (reg:CC 17 flags))
>        ]) -1 (nil))
>
> (insn 7 6 8 3 ld.c:17 (set (reg/f:SI 62)
>        (symbol_ref:SI ("data") <var_decl 0xb7e7ce60 data>)) -1 (nil))
>
> (insn 8 7 9 3 ld.c:17 (set (reg/f:SI 63)
>        (symbol_ref:SI ("data") <var_decl 0xb7e7ce60 data>)) -1 (nil))
>
> (insn 9 8 10 3 ld.c:17 (set (reg:SI 64)
>        (mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ])
>                    (const_int 4 [0x4]))
>                (reg/f:SI 63)) [3 data S4 A32])) -1 (nil))
>
> (insn 10 9 11 3 ld.c:17 (set (reg:SI 65)
>        (mem/s:SI (plus:SI (mult:SI (reg:SI 61)
>                    (const_int 4 [0x4]))
>                (reg/f:SI 62)) [3 data S4 A32])) -1 (nil))
>
> (insn 11 10 12 3 ld.c:17 (parallel [
>            (set (reg:SI 60)
>                (plus:SI (reg:SI 65)
>                    (reg:SI 64)))
>            (clobber (reg:CC 17 flags))
>        ]) -1 (expr_list:REG_EQUAL (plus:SI (mem/s:SI (plus:SI
> (mult:SI (reg:SI 61)
>                        (const_int 4 [0x4]))
>                    (reg/f:SI 62)) [3 data S4 A32])
>            (mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ])
>                        (const_int 4 [0x4]))
>                    (reg/f:SI 63)) [3 data S4 A32]))
>        (nil)))
>
> As we can see, the compiler moves 13, and the @ of data, then
> muliplies the 13 with 4 to get the right size and then performs the 2
> loads and finally has a plus.
>
> In my port, I get:
>
> (insn 6 5 7 3 ld.c:17 (set (reg:DI 75)
>        (plus:DI (reg/v:DI 73 [ i ])
>            (const_int 13 [0xd]))) -1 (nil))
>
> (insn 7 6 8 3 ld.c:17 (set (reg/f:DI 76)
>        (symbol_ref:DI ("data") <var_decl 0xb7c85bb0 data>)) -1 (nil))
>
> (insn 8 7 9 3 ld.c:17 (set (reg:DI 78)
>        (const_int 3 [0x3])) -1 (nil))
>
> (insn 9 8 10 3 ld.c:17 (set (reg:DI 77)
>        (ashift:DI (reg:DI 75)
>            (reg:DI 78))) -1 (nil))
>
> (insn 10 9 11 3 ld.c:17 (set (reg/f:DI 79)
>        (plus:DI (reg/f:DI 76)
>            (reg:DI 77))) -1 (nil))
>
> (insn 11 10 12 3 ld.c:17 (set (reg/f:DI 80)
>        (symbol_ref:DI ("data") <var_decl 0xb7c85bb0 data>)) -1 (nil))
>
> (insn 12 11 13 3 ld.c:17 (set (reg:DI 82)
>        (const_int 3 [0x3])) -1 (nil))
>
> (insn 13 12 14 3 ld.c:17 (set (reg:DI 81)
>        (ashift:DI (reg/v:DI 73 [ i ])
>            (reg:DI 82))) -1 (nil))
>
> (insn 14 13 15 3 ld.c:17 (set (reg/f:DI 83)
>        (plus:DI (reg/f:DI 80)
>            (reg:DI 81))) -1 (nil))
>
> (insn 15 14 16 3 ld.c:17 (set (reg:DI 84)
>        (mem/s:DI (reg/f:DI 79) [2 data S8 A64])) -1 (nil))
>
> (insn 16 15 17 3 ld.c:17 (set (reg:DI 85)
>        (mem/s:DI (reg/f:DI 83) [2 data S8 A64])) -1 (nil))
>
> (insn 17 16 18 3 ld.c:17 (set (reg:DI 74)
>        (plus:DI (reg:DI 84)
>            (reg:DI 85))) -1 (nil))
>
>
> Which seems to be the same idea, except that constant 3 gets load up
> and a shift is performed. Is it possible that it's that that is
> causing my problem in code generation?
>
> I'm trying to figure out why my port is generating a shift instead of
> simply a mult. I actually changed the cost of shift to a large value
> and then it uses adds instead of simply a mult. I seem to think that
> this is then an rtx_cost problem where I'm not telling the compiler
> that a multiplication in this case is correct.
>
> I've been playing with rtx_cost but have been unable to really get it
> to generate the right code.
>
> Thanks again for your help and insight,
> Jc
>
> On Fri, May 8, 2009 at 5:18 AM, Paolo Bonzini <bonzini@gnu.org> wrote:
>>
>>> It seems that when set in a loop, the program is able to perform some
>>> type of optimization to actually get the use of the offsets where as
>>> in the case of no loop, we have twice the calculations of instructions
>>> for each address calculations.
>>
>> I suggest you look at the dumps for i386 to see which pass does the
>> changes, and then see what happens in your port.
>>
>> Paolo
>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization only in loops
  2009-06-11 20:55     ` Jean Christophe Beyler
@ 2009-06-12  7:56       ` Paolo Bonzini
  2009-07-13 21:30         ` Jean Christophe Beyler
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2009-06-12  7:56 UTC (permalink / raw)
  To: Jean Christophe Beyler; +Cc: gcc

Jean Christophe Beyler wrote:
> I've gone back to this problem (since I've solved another one ;-)).
> And I've moved forward a bit:
> 
> It seems that if I consider an array of characters, there are no
> longer any shifts and therefore I do get my two loads with the use of
> an offset:

The reason there are shifts instead of multiplies is that 
multiplications are canonicalized to shifts whenever possible outside 
addresses, because a shift instruction should be more efficient.

The interesting dump should be fwprop which is where the address 
generation happens.

 From your dumps, however, the problem seems to be that you do not have 
a shift-by-immediate instruction.  You may consider adding it even 
though it does not apply to your assembly, either with 
define_insn_and_split or by loosening the predicate and keeping a "r" 
constraint (or whatever you're using now).

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization only in loops
  2009-06-12  7:56       ` Paolo Bonzini
@ 2009-07-13 21:30         ` Jean Christophe Beyler
  0 siblings, 0 replies; 6+ messages in thread
From: Jean Christophe Beyler @ 2009-07-13 21:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gcc

Sorry to be coming back to this problem, I'm working on many projects
at the same time at the moment.

The port actually has a shift-immediate. It actually sees it later in
the fwprop pass:

In insn 14, replacing
 (ashift:DI (reg:DI 79)
        (reg:DI 77))
 with (ashift:DI (reg:DI 79)
        (const_int 3 [0x3]))
Changed insn 14
deferring rescan insn with uid = 14.

(not the same numbers but you get the idea). I don't understand why
there is a problem here but it does seem that it is linked to the way
my port handles the calculations of addresses.

Thanks again if you have any ideas, or anything I can do to give
information or ideas,
Jc


On Fri, Jun 12, 2009 at 3:55 AM, Paolo Bonzini<paolo.bonzini@gmail.com> wrote:
> Jean Christophe Beyler wrote:
>>
>> I've gone back to this problem (since I've solved another one ;-)).
>> And I've moved forward a bit:
>>
>> It seems that if I consider an array of characters, there are no
>> longer any shifts and therefore I do get my two loads with the use of
>> an offset:
>
> The reason there are shifts instead of multiplies is that multiplications
> are canonicalized to shifts whenever possible outside addresses, because a
> shift instruction should be more efficient.
>
> The interesting dump should be fwprop which is where the address generation
> happens.
>
> From your dumps, however, the problem seems to be that you do not have a
> shift-by-immediate instruction.  You may consider adding it even though it
> does not apply to your assembly, either with define_insn_and_split or by
> loosening the predicate and keeping a "r" constraint (or whatever you're
> using now).
>
> Paolo
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-07-13 21:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-07 20:46 Code optimization only in loops Jean Christophe Beyler
2009-05-08 11:40 ` Paolo Bonzini
2009-05-14  2:11   ` Jean Christophe Beyler
2009-06-11 20:55     ` Jean Christophe Beyler
2009-06-12  7:56       ` Paolo Bonzini
2009-07-13 21:30         ` Jean Christophe Beyler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).