[Bug target/49473] New: [arm] poor scheduling of loads

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/49473] New: [arm] poor scheduling of loads
@ 2011-06-20 11:43 philb at gnu dot org
  2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: philb at gnu dot org @ 2011-06-20 11:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473

           Summary: [arm] poor scheduling of loads
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: philb@gnu.org
            Target: arm-linux


The instruction scheduler doesn't seem to be doing a very good job of
accounting for the load delay slots on ARM1136JF-S.  See for example the
attached testcase:

$ ./cc1 -fPIC -O2 -mtune=arm1136jf-s -march=armv6 -mfpu=vfp -mfloat-abi=soft

which yields:

gst_mpegts_demux_sink_setcaps:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    stmfd    sp!, {r4, r5, r6, r7, r8, lr}
    sub    sp, sp, #16
    mov    r7, r1
    bl    gst_object_get_parent(PLT)
    mov    r1, #0
    ldr    r4, .L7
.LPIC0:
    add    r4, pc, r4
    mov    r5, r0
    mov    r0, r7
    bl    gst_caps_get_structure(PLT)
    ldr    r3, .L7+4
    ldr    r6, [r4, r3]
    ldr    r3, [r6, #0]
    cmp    r3, #3
    mov    r8, r0
    bls    .L5
    ldr    r3, .L7+8
    ldr    r1, .L7+12
.LPIC2:
    add    r3, pc, r3
    add    r2, r3, #64
    stmia    sp, {r1, r5}
    str    r2, [sp, #8]
    str    r7, [sp, #12]
    add    r2, r3, #12
    mov    r0, #0
    mov    r1, #4
    add    r3, r3, #32
    bl    gst_debug_log(PLT)
.L5:
    ldr    r4, .L7+16
    add    r2, r5, #32768
.LPIC1:
    add    r4, pc, r4
    mov    r0, r8
    mov    r1, r4
    add    r2, r2, #172
    bl    gst_structure_get_int(PLT)
    cmp    r0, #0
    bne    .L3
    ldr    r3, [r6, #0]
    cmp    r3, #3
    bls    .L3
    mov    r2, #484
    add    r3, r4, #88
    stmia    sp, {r2, r5}
    str    r3, [sp, #8]
    mov    r1, #4
    add    r2, r4, #12
    add    r3, r4, #32
    bl    gst_debug_log(PLT)
.L3:
    mov    r0, r5
    bl    gst_object_unref(PLT)
    mov    r0, #1
    add    sp, sp, #16
    ldmfd    sp!, {r4, r5, r6, r7, r8, pc}

Note that:

- the add at .LPIC0 will stall for two cycles because the preceding load has a
result latency of three.  The two subsequent MOVs could have been scheduled in
these slots since they don't have any data dependency on the ADD;

- the add at .LPIC1 will stall for one cycle for the same reason, and the same
applies to the following MOV.

On this topic I noticed that arm1136jfs.md has:

;; An alu op can start sooner after a load, if that alu op does not
;; have an early register dependency on the load
(define_bypass 2 "11_load1"
           "11_alu_op")
(define_bypass 2 "11_load1"
           "11_alu_shift_op"
           "arm_no_early_alu_shift_value_dep")
(define_bypass 2 "11_load1"
           "11_alu_shift_reg_op"
           "arm_no_early_alu_shift_dep")

... which seems a little strange, since the result latency of LDR is three not
two according to the documentation.  The above bypasses look like they would be
correct for instructions where the dependency is a Late Reg, but that isn't the
case for alu_ops.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/49473] [arm] poor scheduling of loads
  2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
@ 2011-06-20 11:44 ` philb at gnu dot org
  2011-07-20 16:00 ` ramana at gcc dot gnu.org
  2011-08-03 10:38 ` philb at gnu dot org
  2 siblings, 0 replies; 4+ messages in thread
From: philb at gnu dot org @ 2011-06-20 11:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473

--- Comment #1 from philb at gnu dot org 2011-06-20 11:43:48 UTC ---
Created attachment 24564
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24564
testcase


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/49473] [arm] poor scheduling of loads
  2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
  2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
@ 2011-07-20 16:00 ` ramana at gcc dot gnu.org
  2011-08-03 10:38 ` philb at gnu dot org
  2 siblings, 0 replies; 4+ messages in thread
From: ramana at gcc dot gnu.org @ 2011-07-20 16:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473

Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.07.20 15:59:59
                 CC|                            |ramana at gcc dot gnu.org
     Ever Confirmed|0                           |1

--- Comment #2 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2011-07-20 15:59:59 UTC ---

> - the add at .LPIC0 will stall for two cycles because the preceding load has a
> result latency of three.  The two subsequent MOVs could have been scheduled in
> these slots since they don't have any data dependency on the ADD;

This looks like it might be to do with the latency of the call instruction at
least for the LPIC0 case. The scheduler thinks that r0 isn't ready really till
cycle 34 or so and hence the compiler can't hoist the mov r5, r0 above the add
r4, pc, r4 . 


The case around LPIC1 doesn't seem to show up in a recent build of trunk I have
: 

.L5:
        ldr     r1, .L7+24      @ 135   pic_load_addr_32bit     [length = 4]
        add     r2, r5, #32768  @ 25    *arm_addsi3/1   [length = 4]
        mov     r0, r7  @ 27    *arm_movsi_insn/1       [length = 4]
.LPIC1:
        add     r1, pc, r1      @ 28    pic_add_dot_plus_eight  [length = 4]
        add     r2, r2, #180    @ 29    *arm_addsi3/1   [length = 4]
        bl      gst_structure_get_int(PLT)      @ 30    *call_value_symbol


This is the bit I see with a more recent version of trunk and that looks better
than what was shown in this case. 

We need to dig further into the 1136 TRM for the other comments in this report. 


Ramana


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug target/49473] [arm] poor scheduling of loads
  2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
  2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
  2011-07-20 16:00 ` ramana at gcc dot gnu.org
@ 2011-08-03 10:38 ` philb at gnu dot org
  2 siblings, 0 replies; 4+ messages in thread
From: philb at gnu dot org @ 2011-08-03 10:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473

--- Comment #3 from philb at gnu dot org 2011-08-03 10:38:28 UTC ---
(In reply to comment #2)
> This looks like it might be to do with the latency of the call instruction at
> least for the LPIC0 case. The scheduler thinks that r0 isn't ready really till
> cycle 34 or so and hence the compiler can't hoist the mov r5, r0 above the add
> r4, pc, r4 . 

That seems rather peculiar.  The worst case behaviour that the called function
is likely to have would be something like:

ldr r0, [r1]
bx lr

It's possible that the ldr might have a result latency of up to four cycles (if
it were an ARM1136 unaligned access), but the bx will take a minimum of four
cycles even if it was correctly predicted by the return stack and hence the
result latency of the ldr will effectively be annulled.  So, as far as the
scheduler is concerned, it seems as though the result latency of the call
instruction should be considered to be one.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-08-03 10:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
2011-07-20 16:00 ` ramana at gcc dot gnu.org
2011-08-03 10:38 ` philb at gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).