public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/45622] Suboptimal code generation on arm
[not found] <bug-45622-4@http.gcc.gnu.org/bugzilla/>
@ 2010-10-04 9:15 ` ramana at gcc dot gnu.org
0 siblings, 0 replies; 2+ messages in thread
From: ramana at gcc dot gnu.org @ 2010-10-04 9:15 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45622
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |WAITING
Last reconfirmed| |2010.10.04 09:15:04
CC| |ramana at gcc dot gnu.org
Ever Confirmed|0 |1
--- Comment #2 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2010-10-04 09:15:04 UTC ---
(In reply to comment #0)
> Vorbis author, Timothy Terriberry, was complaining about gcc inefficiency on
> arm, so I asked him write it up in case it would be of use to gcc devs to
> improve gcc. Is this a useful?
>
The description is rather useful but it would be good to have a copy of the
pre-processed file as HP says in Comment #2. It also helps so that the bug
report is self-consistent .
What command line options are being used ? i.e. what are the architecture
specific flags ( -march= ? -mfpu= ? -mfloat-abi=? ? ) . If you use --save-temps
you can pickup the .i file .
>
> This the result of using arm-none-linux-gnueabi-gcc (Gentoo 4.5.0 p1.1) 4.5.0
> with -O3 -fomit-frame-pointer -funroll-loops (this last makes a fairly big
> difference on x86, and so is part of the default flags for the library).
-fomit-frame-pointer is typically not required on ARM. It's one of the ABI's
that originally never had a frame pointer in normal functions and thus removing
the frame pointer doesn't make sense.
cheers
Ramana
^ permalink raw reply [flat|nested] 2+ messages in thread
* [Bug tree-optimization/45622] New: Suboptimal code generation on arm
@ 2010-09-09 21:39 tglek at mozilla dot com
2010-09-20 1:58 ` [Bug tree-optimization/45622] " hp at gcc dot gnu dot org
0 siblings, 1 reply; 2+ messages in thread
From: tglek at mozilla dot com @ 2010-09-09 21:39 UTC (permalink / raw)
To: gcc-bugs
Vorbis author, Timothy Terriberry, was complaining about gcc inefficiency on
arm, so I asked him write it up in case it would be of use to gcc devs to
improve gcc. Is this a useful?
His description follows:
You asked me at the summit to describe some of the problems gcc has on ARM.
I'll start with the function oc_huff_token_decode from libtheora, which is the
last function in
http://svn.xiph.org/experimental/derf/theora-ptalarbvorm/lib/huffdec.c
This is the primary function for bitstream decoding, and may get called a few
hundred thousand times per frame. Unlike most of the other time-critical
functions in a video codec, it can't really be accelerated with SIMD asm,
though given how badly gcc does, regular asm might be worth it.
This the result of using arm-none-linux-gnueabi-gcc (Gentoo 4.5.0 p1.1) 4.5.0
with -O3 -fomit-frame-pointer -funroll-loops (this last makes a fairly big
difference on x86, and so is part of the default flags for the library).
oc_huff_token_decode:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
sub sp, sp, #8
So, the next two lines already start off bad:
str r0, [sp, #4]
ldr r4, [sp, #4]
Spilling the register to the stack might be okay, but then loading it back from
the stack when it's still in r0?
ldr ip, [r0, #4]
ldr r3, [r0, #8]
ldr r2, [r4, #12]
ldr r0, [r0, #0]
And then continuing to use _both_ the copy in r0 and r4? What did we even make
the copy for? r0 is now clobbered, and r4 gets clobbered before it gets used
again.
Maybe I'm missing some subtlety here:
mov fp, r1
but it seems to me this mov could easily be eliminated by swapping the roles of
fp and r1 everywhere below (that may not be the best solution, but it's
certainly _a_ solution). This is one of the larger problems with gcc on ARM: it
emits way too many moves for something that's supposed to be compiling for a
three-address ISA.
mov r7, #0
.L418:
mov r1, r7, asl #1
ldrsh r5, [fp, r1]
cmp r2, r5
The "fast path" here takes this branch instead of falling through, but that's
probably my fault for not using PGO or __builtin_expect().
bge .L414
cmp ip, r0
bcs .L419
rsb r1, r2, #24
ldrb r6, [ip], #1 @ zero_extendqisi2
subs r4, r1, #8
add r2, r2, #8
orr r3, r3, r6, asl r1
bmi .L414
So here we're checking if ptr - stop has 4 byte alignment so that we can do 4
iterations at once without testing the if-block inside the loop. This is all
well and good, except that the loop can only iterate 3 times. All of the
unrolling it does to get the alignment right would already have been enough to
execute the entire loop. gcc continues to generate code like this even when n
and available are declared as unsigned (so that overflow would be required to
get "shift" larger than 24, instead of just loading a negative value for
available) and when -funsafe-loop-optimizations is used (though I don't think
this is what that flag actually controls). I'd be happy to hear other
suggestions for rewriting this so that gcc can recognize the iteration limit,
as I doubt that's just an ARM problem.
rsb r6, ip, r0
ands r6, r6, #3
mov r1, ip
beq .L434
cmp r0, ip
bls .L419
So here we... copy ip into r1, then immediately clobber the original ip, then
put r1 _back_ into ip. Two more wasted instructions.
mov r1, ip
ldrb ip, [r1], #1 @ zero_extendqisi2
add r2, r2, #8
orr r3, r3, ip, asl r4
subs r4, r4, #8
mov ip, r1
bmi .L414
cmp r6, #1
beq .L434
cmp r6, #2
beq .L429
ldrb ip, [r1], #1 @ zero_extendqisi2
add r2, r2, #8
orr r3, r3, ip, asl r4
subs r4, r4, #8
mov ip, r1
bmi .L414
.L429:
ldrb ip, [r1], #1 @ zero_extendqisi2
add r2, r2, #8
orr r3, r3, ip, asl r4
subs r4, r4, #8
mov ip, r1
bmi .L414
So here's another minor disaster of excess mov's. r7 is copied into r9 so that
r7 can be used inside the loop... except this loop is self-contained and could
just as well have used r9.
.L434:
mov r9, r7
.L415:
Also, since we just branched here so we could compare ptr to stop once every
four (of three total) iterations, the first thing we do is... compare ptr
against stop. So in the best case this "optimization" actually does three
comparisons instead of two if the loop iterates twice (including the alignment
test), and only if we iterate three times (the maximum) and happened to have
the correct alignment (1 in 4 chance) does it break even.
cmp r0, r1
What is actually put in r7? Why, one of _two_ copies of r1. r1 really contains
the value of ip, which we uselessly copied to it above, and using it here means
we needed to add _another_ useless copy of ip to r1 in the other path to this
label.
mov r7, r1
add r2, r2, #8
This is the one that really takes the cake: just in case we branch here, we'll
copy r1 back into ip (note that ip already contains r1), and then if we don't
branch we immediately clobber it. It does this every time.
mov ip, r1
bls .L437
ldrb ip, [r7], #1 @ zero_extendqisi2
subs r6, r4, #8
orr r3, r3, ip, asl r4
mov r8, r2
mov ip, r7
This one is also pretty fantastic. So, this time we're going to do the load
using r1, instead of r7 like we did above (why do we have two copies? I still
don't know! actually three if you count ip). That means we need to increment r7
manually instead of doing it as part of the load. And just for giggles, we do
that _before_ the branch here, which goes to an instruction that immediately
clobbers r7.
add r7, r7, #1
bmi .L436
ldrb ip, [r1, #1] @ zero_extendqisi2
subs sl, r6, #8
orr r3, r3, ip, asl r6
add r2, r2, #8
mov ip, r7
bmi .L436
ldrb r6, [r7, #0] @ zero_extendqisi2
subs r7, r4, #24
add ip, r1, #3
add r2, r8, #16
orr r3, r3, r6, asl sl
bmi .L436
ldrb ip, [r1, #3] @ zero_extendqisi2
subs r4, r4, #32
add r1, r1, #4
orr r3, r3, ip, asl r7
add r2, r8, #24
Yeah, better copy r1 into ip again so we can... copy r1 into ip again after the
jump. Fortunately the iteration limit means we can never actually get here.
mov ip, r1
bpl .L415
.L436:
mov r7, r9
.L414:
rsb r1, r5, #32
add r1, r7, r3, lsr r1
add r7, fp, r1, asl #1
ldrsh r7, [r7, #2]
cmp r7, #0
Again, the fast path takes this branch.
ble .L417
.L438:
mov r3, r3, asl r5
rsb r2, r5, r2
b .L418
.L437:
mov r7, r9
.L419:
rsb r1, r5, #32
add r1, r7, r3, lsr r1
add r7, fp, r1, asl #1
ldrsh r7, [r7, #2]
mov r2, #1073741824
cmp r7, #0
bgt .L438
.L417:
rsb r7, r7, #0
mov r0, r7, asr #8
mov r3, r3, asl r0
Yeah! We can finally load back that value we stashed on the stack! Maybe if we
hadn't made three or four copies of ptr, we might have been able to keep it in
a register.
ldr r1, [sp, #4]
rsb r2, r0, r2
str ip, [r1, #4]
and r0, r7, #255
str r3, [r1, #8]
str r2, [r1, #12]
add sp, sp, #8
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
bx lr
Things are a little better without -funroll-loops, mostly because the function
is much shorter, but there are still obvious problems, e.g., in the epilogue:
and r8, r8, #255
stmib r0, {r3, ip} @ phole stm
str r2, [r0, #12]
mov r0, r8
If the and was re-ordered, it could take the place of the mov (the
-funroll-loops version gets this right). Or the prologue:
ldmib r0, {r3, ip} @ phole ldm
ldr r4, [r0, #0]
ldr r2, [r0, #12]
It isn't clear to me why it picked registers that were out of order, when it
could have loaded all four with one ldm. But hey, at least that's better than
the -funroll-loops versions, which did four individual loads! Fixing this would
also likely have allowed combining the stmib/str in the epilogue.
--
Summary: Suboptimal code generation on arm
Product: gcc
Version: 4.5.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: tglek at mozilla dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45622
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2010-10-04 9:15 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-45622-4@http.gcc.gnu.org/bugzilla/>
2010-10-04 9:15 ` [Bug tree-optimization/45622] Suboptimal code generation on arm ramana at gcc dot gnu.org
2010-09-09 21:39 [Bug tree-optimization/45622] New: " tglek at mozilla dot com
2010-09-20 1:58 ` [Bug tree-optimization/45622] " hp at gcc dot gnu dot org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).