public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/49473] New: [arm] poor scheduling of loads
@ 2011-06-20 11:43 philb at gnu dot org
2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: philb at gnu dot org @ 2011-06-20 11:43 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473
Summary: [arm] poor scheduling of loads
Product: gcc
Version: 4.7.0
Status: UNCONFIRMED
Severity: minor
Priority: P3
Component: target
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: philb@gnu.org
Target: arm-linux
The instruction scheduler doesn't seem to be doing a very good job of
accounting for the load delay slots on ARM1136JF-S. See for example the
attached testcase:
$ ./cc1 -fPIC -O2 -mtune=arm1136jf-s -march=armv6 -mfpu=vfp -mfloat-abi=soft
which yields:
gst_mpegts_demux_sink_setcaps:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, lr}
sub sp, sp, #16
mov r7, r1
bl gst_object_get_parent(PLT)
mov r1, #0
ldr r4, .L7
.LPIC0:
add r4, pc, r4
mov r5, r0
mov r0, r7
bl gst_caps_get_structure(PLT)
ldr r3, .L7+4
ldr r6, [r4, r3]
ldr r3, [r6, #0]
cmp r3, #3
mov r8, r0
bls .L5
ldr r3, .L7+8
ldr r1, .L7+12
.LPIC2:
add r3, pc, r3
add r2, r3, #64
stmia sp, {r1, r5}
str r2, [sp, #8]
str r7, [sp, #12]
add r2, r3, #12
mov r0, #0
mov r1, #4
add r3, r3, #32
bl gst_debug_log(PLT)
.L5:
ldr r4, .L7+16
add r2, r5, #32768
.LPIC1:
add r4, pc, r4
mov r0, r8
mov r1, r4
add r2, r2, #172
bl gst_structure_get_int(PLT)
cmp r0, #0
bne .L3
ldr r3, [r6, #0]
cmp r3, #3
bls .L3
mov r2, #484
add r3, r4, #88
stmia sp, {r2, r5}
str r3, [sp, #8]
mov r1, #4
add r2, r4, #12
add r3, r4, #32
bl gst_debug_log(PLT)
.L3:
mov r0, r5
bl gst_object_unref(PLT)
mov r0, #1
add sp, sp, #16
ldmfd sp!, {r4, r5, r6, r7, r8, pc}
Note that:
- the add at .LPIC0 will stall for two cycles because the preceding load has a
result latency of three. The two subsequent MOVs could have been scheduled in
these slots since they don't have any data dependency on the ADD;
- the add at .LPIC1 will stall for one cycle for the same reason, and the same
applies to the following MOV.
On this topic I noticed that arm1136jfs.md has:
;; An alu op can start sooner after a load, if that alu op does not
;; have an early register dependency on the load
(define_bypass 2 "11_load1"
"11_alu_op")
(define_bypass 2 "11_load1"
"11_alu_shift_op"
"arm_no_early_alu_shift_value_dep")
(define_bypass 2 "11_load1"
"11_alu_shift_reg_op"
"arm_no_early_alu_shift_dep")
... which seems a little strange, since the result latency of LDR is three not
two according to the documentation. The above bypasses look like they would be
correct for instructions where the dependency is a Late Reg, but that isn't the
case for alu_ops.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/49473] [arm] poor scheduling of loads
2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
@ 2011-06-20 11:44 ` philb at gnu dot org
2011-07-20 16:00 ` ramana at gcc dot gnu.org
2011-08-03 10:38 ` philb at gnu dot org
2 siblings, 0 replies; 4+ messages in thread
From: philb at gnu dot org @ 2011-06-20 11:44 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473
--- Comment #1 from philb at gnu dot org 2011-06-20 11:43:48 UTC ---
Created attachment 24564
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24564
testcase
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/49473] [arm] poor scheduling of loads
2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
@ 2011-07-20 16:00 ` ramana at gcc dot gnu.org
2011-08-03 10:38 ` philb at gnu dot org
2 siblings, 0 replies; 4+ messages in thread
From: ramana at gcc dot gnu.org @ 2011-07-20 16:00 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Status|UNCONFIRMED |NEW
Last reconfirmed| |2011.07.20 15:59:59
CC| |ramana at gcc dot gnu.org
Ever Confirmed|0 |1
--- Comment #2 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2011-07-20 15:59:59 UTC ---
> - the add at .LPIC0 will stall for two cycles because the preceding load has a
> result latency of three. The two subsequent MOVs could have been scheduled in
> these slots since they don't have any data dependency on the ADD;
This looks like it might be to do with the latency of the call instruction at
least for the LPIC0 case. The scheduler thinks that r0 isn't ready really till
cycle 34 or so and hence the compiler can't hoist the mov r5, r0 above the add
r4, pc, r4 .
The case around LPIC1 doesn't seem to show up in a recent build of trunk I have
:
.L5:
ldr r1, .L7+24 @ 135 pic_load_addr_32bit [length = 4]
add r2, r5, #32768 @ 25 *arm_addsi3/1 [length = 4]
mov r0, r7 @ 27 *arm_movsi_insn/1 [length = 4]
.LPIC1:
add r1, pc, r1 @ 28 pic_add_dot_plus_eight [length = 4]
add r2, r2, #180 @ 29 *arm_addsi3/1 [length = 4]
bl gst_structure_get_int(PLT) @ 30 *call_value_symbol
This is the bit I see with a more recent version of trunk and that looks better
than what was shown in this case.
We need to dig further into the 1136 TRM for the other comments in this report.
Ramana
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/49473] [arm] poor scheduling of loads
2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
2011-07-20 16:00 ` ramana at gcc dot gnu.org
@ 2011-08-03 10:38 ` philb at gnu dot org
2 siblings, 0 replies; 4+ messages in thread
From: philb at gnu dot org @ 2011-08-03 10:38 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49473
--- Comment #3 from philb at gnu dot org 2011-08-03 10:38:28 UTC ---
(In reply to comment #2)
> This looks like it might be to do with the latency of the call instruction at
> least for the LPIC0 case. The scheduler thinks that r0 isn't ready really till
> cycle 34 or so and hence the compiler can't hoist the mov r5, r0 above the add
> r4, pc, r4 .
That seems rather peculiar. The worst case behaviour that the called function
is likely to have would be something like:
ldr r0, [r1]
bx lr
It's possible that the ldr might have a result latency of up to four cycles (if
it were an ARM1136 unaligned access), but the bx will take a minimum of four
cycles even if it was correctly predicted by the return stack and hence the
result latency of the ldr will effectively be annulled. So, as far as the
scheduler is concerned, it seems as though the result latency of the call
instruction should be considered to be one.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-08-03 10:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-20 11:43 [Bug target/49473] New: [arm] poor scheduling of loads philb at gnu dot org
2011-06-20 11:44 ` [Bug target/49473] " philb at gnu dot org
2011-07-20 16:00 ` ramana at gcc dot gnu.org
2011-08-03 10:38 ` philb at gnu dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).