Code optimization with GCSE

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Code optimization with GCSE
@ 2009-07-14 20:01 Jean Christophe Beyler
  2009-07-15  5:29 ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Christophe Beyler @ 2009-07-14 20:01 UTC (permalink / raw)
  To: gcc

Dear all,

As some might know, I've been concentrating on optimizing the handling
of loads for my port of GCC. I'm now considering this code:
uint64_t data[107];
uint64_t foo (void)
{
    uint64_t x0, x1, x2, x3, x4, x5,x6,x7;
    uint64_t i;

    for(i=0;i<107;i++) {
        data[i] = i;
    }

    return data[0] + data[1] + data[2];
}

However, the code I get is:

    la  r9,data
    mov r8, r9
    #LOOP, of no consequence, it uses r8 for the stores...

    la  r7,data+8    #Loads data+8 in r7 instead of using r9 directly
    ldd r9,0(r9)       #This loads from r9 and the two next from r7
    ldd r6,0(r7)
    ldd r8,8(r7)


As you can see, the compiler uses r9 to store data and then uses that
for data[0] but also loads in r7 data+8 instead of directly using r9.
If I remove the loop then it does not do this.

Any ideas where I can look for this, or is this going to be difficult to fix?
Thanks !
Jean Christophe Beyler

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization with GCSE
  2009-07-14 20:01 Code optimization with GCSE Jean Christophe Beyler
@ 2009-07-15  5:29 ` Paolo Bonzini
  2009-07-15 15:21   ` Jean Christophe Beyler
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2009-07-15  5:29 UTC (permalink / raw)
  To: Jean Christophe Beyler; +Cc: gcc


> As you can see, the compiler uses r9 to store data and then uses that
> for data[0] but also loads in r7 data+8 instead of directly using r9.
> If I remove the loop then it does not do this.

This optimization is done by CSE only, currently.  That's why it cannot 
look through loops.

Paolo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization with GCSE
  2009-07-15  5:29 ` Paolo Bonzini
@ 2009-07-15 15:21   ` Jean Christophe Beyler
  2009-07-15 16:35     ` Adam Nemet
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Christophe Beyler @ 2009-07-15 15:21 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: gcc

Ah ok, so I can see why it would not be able to perform that
optimization around the loop but I changed the code to simply have
this:

uint64_t foo (void)
{
    return data[0] + data[1] + data[2];
}

And this generates :

    la  r9,data
    la  r7,data+8
    ldd r6,0(r7)
    ldd r8,0(r9)
    ldd r7,16(r9)

I'm trying to see if there is a problem with my rtx costs function
because again, I don't understand why it would generate 2 la instead
of using an offset of 8 and 16.

Thanks for any input,
Jc

On Wed, Jul 15, 2009 at 1:29 AM, Paolo Bonzini<bonzini@gnu.org> wrote:
>
>> As you can see, the compiler uses r9 to store data and then uses that
>> for data[0] but also loads in r7 data+8 instead of directly using r9.
>> If I remove the loop then it does not do this.
>
> This optimization is done by CSE only, currently.  That's why it cannot look
> through loops.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization with GCSE
  2009-07-15 15:21   ` Jean Christophe Beyler
@ 2009-07-15 16:35     ` Adam Nemet
  2009-07-15 17:13       ` Jean Christophe Beyler
  0 siblings, 1 reply; 6+ messages in thread
From: Adam Nemet @ 2009-07-15 16:35 UTC (permalink / raw)
  To: Jean Christophe Beyler; +Cc: paolo.bonzini, gcc

Jean Christophe Beyler <jean.christophe.beyler@gmail.com> writes:
> uint64_t foo (void)
> {
>     return data[0] + data[1] + data[2];
> }
>
> And this generates :
>
>     la  r9,data
>     la  r7,data+8
>     ldd r6,0(r7)
>     ldd r8,0(r9)
>     ldd r7,16(r9)
>
> I'm trying to see if there is a problem with my rtx costs function
> because again, I don't understand why it would generate 2 la instead
> of using an offset of 8 and 16.

You probably want to look at the RTL dumps.  This code should have been
expanded with the correct offsets (at least that is what happens on
MIPS).  I don't see how later passes would modify the code other than
removing 2 of the 3 "la rX, data" insns.

Adam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization with GCSE
  2009-07-15 16:35     ` Adam Nemet
@ 2009-07-15 17:13       ` Jean Christophe Beyler
  2009-07-16  7:04         ` Adam Nemet
  0 siblings, 1 reply; 6+ messages in thread
From: Jean Christophe Beyler @ 2009-07-15 17:13 UTC (permalink / raw)
  To: Adam Nemet; +Cc: paolo.bonzini, gcc

The subreg pass has this :

(insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
        (const:DI (plus:DI (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)
                (const_int 8 [0x8])))) 71 {movdi_internal} (nil))

(insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
        (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)) 71
{movdi_internal} (nil))

...

(insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79)
        (const:DI (plus:DI (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)
                (const_int 16 [0x10])))) 71 {movdi_internal} (nil))


As we can see, all three are using the symbol_ref data before adding
their offset. But after cse, we get this:

(insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
        (const:DI (plus:DI (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)
                (const_int 8 [0x8])))) 71 {movdi_internal} (nil))

(insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
        (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)) 71
{movdi_internal} (nil))

...

(insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79)
        (plus:DI (reg/f:DI 75)
            (const_int 16 [0x10]))) 2 {adddi3_port}
(expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("data")
<var_decl 0
                (const_int 16 [0x10])))
        (nil)))


As we can see, the CSE pass, instead of putting the three in function
of 74, puts only the last one in function of 75.

I put the whole dump of cse at the end of this email, I didn't want to
make this one too long...

Thanks again,
Jean Christophe Beyler

------------------ Dump of cse1 ------------------

;; Function foo (foo)


3 basic blocks, 2 edges.

Basic block 0 , next 2, loop_depth 0, count 0, freq 10000, maybe hot.
Predecessors:
Successors:  2 [100.0%]  (fallthru)

Basic block 2 , prev 0, next 1, loop_depth 0, count 0, freq 10000, maybe hot.
Predecessors:  ENTRY [100.0%]  (fallthru)
Successors:  EXIT [100.0%]  (fallthru)

Basic block 1 , prev 2, loop_depth 0, count 0, freq 10000, maybe hot.
Predecessors:  2 [100.0%]  (fallthru)
Successors:

starting the processing of deferred insns
ending the processing of deferred insns
df_analyze called
df_worklist_dataflow_overeager:n_basic_blocks 3 n_edges 2 count 3 (    1)


foo

Dataflow summary:
def_info->table_size = 0, use_info->table_size = 0
;;  invalidated by call 	 2 [r2] 4 [r4] 5 [r5] 6 [r6] 7 [r7] 8 [r8] 9
[r9] 10 [r10] 11 [r11] 12 [r12] 13 [r13] 14 [r14] 15 [r15] 16 [r16] 17
[r17] 18 [r18] 19 [r19] 20 [r20] 21 [r21] 22 [r22] 23 [r23] 24 [r24]
25 [r25] 26 [r26] 27 [r27] 28 [r28] 29 [r29] 30 [r30] 31 [r31] 32
[r32] 33 [r33] 34 [r34] 35 [r35] 36 [r36] 37 [r37] 38 [r38] 39 [r39]
40 [r40] 41 [r41] 42 [r42] 43 [r43] 44 [r44] 45 [r45] 46 [r46] 47
[r47] 63 [r63] 64 [$rap] 65 [cc] 66 [acc]
;;  hardware regs used 	 0 [r0] 1 [r1] 3 [r3]
;;  regular block artificial uses 	 0 [r0] 1 [r1] 3 [r3] 62 [r62]
;;  eh block artificial uses 	 0 [r0] 1 [r1] 3 [r3] 62 [r62]
;;  entry block defs 	 0 [r0] 1 [r1] 3 [r3] 6 [r6] 8 [r8] 9 [r9] 10
[r10] 11 [r11] 12 [r12] 13 [r13] 14 [r14] 15 [r15] 62 [r62] 63 [r63]
;;  exit block uses 	 1 [r1] 3 [r3] 6 [r6] 62 [r62]
;;  regs ever live 	 6[r6]

( )->[0]->( 2 )
;; bb 0 artificial_defs: { d-1(0){ }d-1(1){ }d-1(3){ }d-1(6){ }d-1(8){
}d-1(9){ }d-1(10){ }d-1(11){ }d-1(12){ }d-1(13){ }d-1(14){ }d-1(15){
}d-1(62){ }d-1(63){ }}
;; bb 0 artificial_uses: { }

( 0 )->[2]->( 1 )
;; bb 2 artificial_defs: { }
;; bb 2 artificial_uses: { u-1(0){ }u-1(1){ }u-1(3){ }u-1(62){ }}

( 2 )->[1]->( )
;; bb 1 artificial_defs: { }
;; bb 1 artificial_uses: { u-1(1){ }u-1(3){ }u-1(6){ }u-1(62){ }}

Finding needed instructions:
  Adding insn 23 to worklist
Finished finding needed instructions:
processing block 2 live out =  0 [r0] 1 [r1] 3 [r3] 6 [r6] 62 [r62]
  Adding insn 17 to worklist
  Adding insn 13 to worklist
  Adding insn 12 to worklist
  Adding insn 11 to worklist
  Adding insn 10 to worklist
  Adding insn 9 to worklist
  Adding insn 8 to worklist
  Adding insn 7 to worklist
  Adding insn 6 to worklist
  Adding insn 5 to worklist
df_worklist_dataflow_overeager:n_basic_blocks 3 n_edges 2 count 3 (    1)
;; Following path with 11 sets: 2
deferring rescan insn with uid = 10.
deferring rescan insn with uid = 17.


try_optimize_cfg iteration 1

(note 3 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

(note 2 3 5 2 NOTE_INSN_FUNCTION_BEG)

(insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
        (const:DI (plus:DI (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)
                (const_int 8 [0x8])))) 71 {movdi_internal} (nil))

(insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
        (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)) 71
{movdi_internal} (nil))

(insn 7 6 8 2 ex1b.c:8 (set (reg:DI 77 [ data+8 ])
        (mem/s:DI (reg/f:DI 74) [2 data+8 S8 A64])) 71 {movdi_internal} (nil))

(insn 8 7 9 2 ex1b.c:8 (set (reg:DI 78 [ data ])
        (mem/s:DI (reg/f:DI 75) [2 data+0 S8 A64])) 71 {movdi_internal} (nil))

(insn 9 8 10 2 ex1b.c:8 (set (reg:DI 76)
        (plus:DI (reg:DI 77 [ data+8 ])
            (reg:DI 78 [ data ]))) 2 {adddi3_port} (nil))

(insn 10 9 11 2 ex1b.c:8 (set (reg/f:DI 79)
        (plus:DI (reg/f:DI 75)
            (const_int 16 [0x10]))) 2 {adddi3_port}
(expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("data")
<var_decl 0xb7d35058 data>)
                (const_int 16 [0x10])))
        (nil)))

(insn 11 10 12 2 ex1b.c:8 (set (reg:DI 80 [ data+16 ])
        (mem/s:DI (reg/f:DI 79) [2 data+16 S8 A64])) 71 {movdi_internal} (nil))

(insn 12 11 13 2 ex1b.c:8 (set (reg:DI 73)
        (plus:DI (reg:DI 76)
            (reg:DI 80 [ data+16 ]))) 2 {adddi3_port} (nil))

(insn 13 12 17 2 ex1b.c:8 (set (reg:DI 72 [ <result> ])
        (reg:DI 73)) 71 {movdi_internal} (nil))

(insn 17 13 23 2 ex1b.c:10 (set (reg/i:DI 6 r6)
        (reg:DI 73)) 71 {movdi_internal} (nil))

(insn 23 17 0 2 ex1b.c:10 (use (reg/i:DI 6 r6)) -1 (nil))
starting the processing of deferred insns
rescanning insn with uid = 10.
deleting insn with uid = 10.
rescanning insn with uid = 17.
deleting insn with uid = 17.
ending the processing of deferred insns




On Wed, Jul 15, 2009 at 12:25 PM, Adam Nemet<anemet@caviumnetworks.com> wrote:
> Jean Christophe Beyler <jean.christophe.beyler@gmail.com> writes:
>> uint64_t foo (void)
>> {
>>     return data[0] + data[1] + data[2];
>> }
>>
>> And this generates :
>>
>>     la  r9,data
>>     la  r7,data+8
>>     ldd r6,0(r7)
>>     ldd r8,0(r9)
>>     ldd r7,16(r9)
>>
>> I'm trying to see if there is a problem with my rtx costs function
>> because again, I don't understand why it would generate 2 la instead
>> of using an offset of 8 and 16.
>
> You probably want to look at the RTL dumps.  This code should have been
> expanded with the correct offsets (at least that is what happens on
> MIPS).  I don't see how later passes would modify the code other than
> removing 2 of the 3 "la rX, data" insns.
>
> Adam
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Code optimization with GCSE
  2009-07-15 17:13       ` Jean Christophe Beyler
@ 2009-07-16  7:04         ` Adam Nemet
  0 siblings, 0 replies; 6+ messages in thread
From: Adam Nemet @ 2009-07-16  7:04 UTC (permalink / raw)
  To: Jean Christophe Beyler; +Cc: paolo.bonzini, gcc

Jean Christophe Beyler writes:
> As we can see, all three are using the symbol_ref data before adding
> their offset. But after cse, we get this:
> 
> (insn 5 2 6 2 ex1b.c:8 (set (reg/f:DI 74)
>         (const:DI (plus:DI (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)
>                 (const_int 8 [0x8])))) 71 {movdi_internal} (nil))
> 
> (insn 6 5 7 2 ex1b.c:8 (set (reg/f:DI 75)
>         (symbol_ref:DI ("data") <var_decl 0xb7d35058 data>)) 71
> {movdi_internal} (nil))

You will have to debug CSE.  The actual replacment is done in this loop in
cse_insn:

      /* Terminate loop when replacement made.  This must terminate since
         the current contents will be tested and will always be valid.  */
      while (1)
	{

Adam

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-07-16  7:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-14 20:01 Code optimization with GCSE Jean Christophe Beyler
2009-07-15  5:29 ` Paolo Bonzini
2009-07-15 15:21   ` Jean Christophe Beyler
2009-07-15 16:35     ` Adam Nemet
2009-07-15 17:13       ` Jean Christophe Beyler
2009-07-16  7:04         ` Adam Nemet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).