Inefficient loop unrolling.

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Inefficient loop unrolling.
@ 2008-07-02 11:59 Bingfeng Mei
  2008-07-02 12:08 ` Richard Guenther
  2008-07-02 16:14 ` Steven Bosscher
  0 siblings, 2 replies; 7+ messages in thread
From: Bingfeng Mei @ 2008-07-02 11:59 UTC (permalink / raw)
  To: gcc

Hello,
I am looking at GCC's loop unrolling and find it quite inefficient
compared with manually unrolled loop even for very simple loop. The
followings are a simple loop and its manually unrolled version. I didn't
apply any trick on manually unrolled one as it is exact replications of
original loop body. I have expected by -funroll-loops the first version
should produce code of similar quality as the second one. However,
compiled with ARM target of mainline GCC, both functions produce very
different results. 

GCC-unrolled version mainly suffers from two issues. First, the
load/store offsets are registers. Extra ADD instructions are needed to
increase offset over iteration. In the contrast, manually unrolled code
makes use of immediate offset efficiently and only need one ADD to
adjust base register in the end. Second, the alias (dependence) analysis
is over conservative. The LOAD instruction of next unrolled iteration
cannot be moved beyond previous STORE instruction even they are clearly
not aliased. I suspect the failure of alias analysis is related to the
first issue of handling base and offset address. The .sched2 file shows
that the first loop body requires 57 cycles whereas the second one takes
50 cycles for arm9 (56 cycles vs 34 cycles for Xscale).  It become even
worse for our VLIW porting due to longer latency of MUL and Load
instructions and incapability of filling all slots (120 cycles vs. 20
cycles)

By analyzing compilation phases, I believe if the loop unrolling happens
at the tree-level, or if we have an optimizing pass like "ivopts" after
loop unrolling in RTL level, GCC can produce far more efficient
loop-unrolled code.  "ivopts" pass really does a wonderful job in
optimizing induction variables. Strangely, I found some unrolling
functions at tree-level, but there is no independent tree-level loop
unrolling pass except "cunroll", which is complete unrolling.  What
prevents such a tree-level unrolling pass? Or is there any suggestion to
improve existing RTL level unrolling? Thanks in advance. 

Cheers,
Bingfeng Mei
Broadcom UK

void Unroll( short s, int * restrict b_inout, int *restrict out)
{
        int i;
	for (i=0; i<64; i++)
	{
		b_inout[i] = b_inout[i] * s;
	}
}

void ManualUnroll( short s, int * restrict b_inout, int *restrict out)
{
        int i;
	for (i=0; i<64;)
	{
		b_inout[i] = b_inout[i] * s;
                i++;
		b_inout[i] = b_inout[i] * s;
                i++;
		b_inout[i] = b_inout[i] * s;
                i++;
		b_inout[i] = b_inout[i] * s;
                i++;
		b_inout[i] = b_inout[i] * s;
                i++;
		b_inout[i] = b_inout[i] * s;
                i++;
		b_inout[i] = b_inout[i] * s;
                i++;
		b_inout[i] = b_inout[i] * s;
                i++;
	}
}

arm-elf-gcc tst2.c -O2  -std=c99 -S  -v -fdump-tree-all  -da  -mcpu=arm9
-funroll-loops
Unroll:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	mov	r0, r0, asl #16
	stmfd	sp!, {r4, r5, r6}
	mov	r4, r1
	mov	r6, r0, asr #16
	mov	r5, #0
.L2:
	ldr	r1, [r4, r5]
	add	ip, r5, #4
	mul	r0, r6, r1
	str	r0, [r4, r5]
	ldr	r3, [r4, ip]
	add	r0, ip, #4
	mul	r2, r6, r3
	str	r2, [r4, ip]
	ldr	r1, [r4, r0]
	add	ip, r5, #12
	mul	r3, r6, r1
	str	r3, [r4, r0]
	ldr	r2, [r4, ip]
	add	r1, r5, #16
	mul	r3, r6, r2
	str	r3, [r4, ip]
	ldr	r0, [r4, r1]
	add	ip, r5, #20
	mul	r3, r6, r0
	str	r3, [r4, r1]
	ldr	r2, [r4, ip]
	add	r1, r5, #24
	mul	r0, r6, r2
	str	r0, [r4, ip]
	ldr	r3, [r4, r1]
	add	ip, r5, #28
	mul	r0, r6, r3
	str	r0, [r4, r1]
	ldr	r2, [r4, ip]
	add	r5, r5, #32
	mul	r3, r6, r2
	cmp	r5, #256
	str	r3, [r4, ip]
	bne	.L2
	ldmfd	sp!, {r4, r5, r6}
	bx	lr
	.size	Unroll, .-Unroll

arm-elf-gcc tst2.c -O2  -std=c99 -S  -v -fdump-tree-all  -da  -mcpu=arm9

ManualUnroll:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	mov	r0, r0, asl #16
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
	mov	sl, r1
	mov	r9, r0, asr #16
	add	fp, r1, #256
.L7:
	ldr	r3, [sl, #0]
	ldr	r2, [sl, #4]
	ldr	r1, [sl, #8]
	ldr	r0, [sl, #12]
	ldr	ip, [sl, #16]
	add	r4, sl, #20
	ldmia	r4, {r4, r5, r6}	@ phole ldm
	mul	r7, r9, r3
	mul	r8, r9, r2
	mul	r3, r9, r1
	mul	r2, r9, r0
	mul	r1, r9, ip
	mul	r0, r9, r4
	mul	ip, r9, r5
	mul	r4, r9, r6
	stmia	sl, {r7, r8}	@ phole stm
	str	r3, [sl, #8]
	str	r2, [sl, #12]
	str	r1, [sl, #16]
	str	r0, [sl, #20]
	str	ip, [sl, #24]
	str	r4, [sl, #28]
	add	sl, sl, #32
	cmp	sl, fp
	bne	.L7
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
	bx	lr
	.size	ManualUnroll, .-ManualUnroll
	.ident	"GCC: (GNU) 4.4.0 20080530 (experimental)"

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inefficient loop unrolling.
  2008-07-02 11:59 Inefficient loop unrolling Bingfeng Mei
@ 2008-07-02 12:08 ` Richard Guenther
  2008-07-02 13:40   ` Bingfeng Mei
  2008-07-02 16:14 ` Steven Bosscher
  1 sibling, 1 reply; 7+ messages in thread
From: Richard Guenther @ 2008-07-02 12:08 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc

On Wed, Jul 2, 2008 at 1:13 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hello,
> I am looking at GCC's loop unrolling and find it quite inefficient
> compared with manually unrolled loop even for very simple loop. The
> followings are a simple loop and its manually unrolled version. I didn't
> apply any trick on manually unrolled one as it is exact replications of
> original loop body. I have expected by -funroll-loops the first version
> should produce code of similar quality as the second one. However,
> compiled with ARM target of mainline GCC, both functions produce very
> different results.
>
> GCC-unrolled version mainly suffers from two issues. First, the
> load/store offsets are registers. Extra ADD instructions are needed to
> increase offset over iteration. In the contrast, manually unrolled code
> makes use of immediate offset efficiently and only need one ADD to
> adjust base register in the end. Second, the alias (dependence) analysis
> is over conservative. The LOAD instruction of next unrolled iteration
> cannot be moved beyond previous STORE instruction even they are clearly
> not aliased. I suspect the failure of alias analysis is related to the
> first issue of handling base and offset address. The .sched2 file shows
> that the first loop body requires 57 cycles whereas the second one takes
> 50 cycles for arm9 (56 cycles vs 34 cycles for Xscale).  It become even
> worse for our VLIW porting due to longer latency of MUL and Load
> instructions and incapability of filling all slots (120 cycles vs. 20
> cycles)
>
> By analyzing compilation phases, I believe if the loop unrolling happens
> at the tree-level, or if we have an optimizing pass like "ivopts" after
> loop unrolling in RTL level, GCC can produce far more efficient
> loop-unrolled code.  "ivopts" pass really does a wonderful job in
> optimizing induction variables. Strangely, I found some unrolling
> functions at tree-level, but there is no independent tree-level loop
> unrolling pass except "cunroll", which is complete unrolling.  What
> prevents such a tree-level unrolling pass? Or is there any suggestion to
> improve existing RTL level unrolling? Thanks in advance.

On the tree level only complete unrolling is done.  The reason for this
was the difficulty (or our unwillingness) to properly tune this for the
target (loop unrolling is not a generally profitable optimization, unless
complete unrolling for small loops).

I would suggest to look into doing also partial unrolling on the tree level.

Richard.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Inefficient loop unrolling.
  2008-07-02 12:08 ` Richard Guenther
@ 2008-07-02 13:40   ` Bingfeng Mei
  0 siblings, 0 replies; 7+ messages in thread
From: Bingfeng Mei @ 2008-07-02 13:40 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

The loop unrolling is often a big deal for embedded processors. It makes
10 times of performance difference for our VLIW processor in many loops.
I will look into partial loop unrolling at tree-level.  If possible, I
would like to make some contributions to GCC. I just learned my company
(Broadcom) has signed company-wide FSF copyright assignment.

Cheers,
Bingfeng

-----Original Message-----
From: Richard Guenther [mailto:richard.guenther@gmail.com] 
Sent: 02 July 2008 12:59
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Inefficient loop unrolling.

On Wed, Jul 2, 2008 at 1:13 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hello,
> I am looking at GCC's loop unrolling and find it quite inefficient
> compared with manually unrolled loop even for very simple loop. The
> followings are a simple loop and its manually unrolled version. I
didn't
> apply any trick on manually unrolled one as it is exact replications
of
> original loop body. I have expected by -funroll-loops the first
version
> should produce code of similar quality as the second one. However,
> compiled with ARM target of mainline GCC, both functions produce very
> different results.
>
> GCC-unrolled version mainly suffers from two issues. First, the
> load/store offsets are registers. Extra ADD instructions are needed to
> increase offset over iteration. In the contrast, manually unrolled
code
> makes use of immediate offset efficiently and only need one ADD to
> adjust base register in the end. Second, the alias (dependence)
analysis
> is over conservative. The LOAD instruction of next unrolled iteration
> cannot be moved beyond previous STORE instruction even they are
clearly
> not aliased. I suspect the failure of alias analysis is related to the
> first issue of handling base and offset address. The .sched2 file
shows
> that the first loop body requires 57 cycles whereas the second one
takes
> 50 cycles for arm9 (56 cycles vs 34 cycles for Xscale).  It become
even
> worse for our VLIW porting due to longer latency of MUL and Load
> instructions and incapability of filling all slots (120 cycles vs. 20
> cycles)
>
> By analyzing compilation phases, I believe if the loop unrolling
happens
> at the tree-level, or if we have an optimizing pass like "ivopts"
after
> loop unrolling in RTL level, GCC can produce far more efficient
> loop-unrolled code.  "ivopts" pass really does a wonderful job in
> optimizing induction variables. Strangely, I found some unrolling
> functions at tree-level, but there is no independent tree-level loop
> unrolling pass except "cunroll", which is complete unrolling.  What
> prevents such a tree-level unrolling pass? Or is there any suggestion
to
> improve existing RTL level unrolling? Thanks in advance.

On the tree level only complete unrolling is done.  The reason for this
was the difficulty (or our unwillingness) to properly tune this for the
target (loop unrolling is not a generally profitable optimization,
unless
complete unrolling for small loops).

I would suggest to look into doing also partial unrolling on the tree
level.

Richard.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inefficient loop unrolling.
  2008-07-02 11:59 Inefficient loop unrolling Bingfeng Mei
  2008-07-02 12:08 ` Richard Guenther
@ 2008-07-02 16:14 ` Steven Bosscher
  2008-07-03 13:48   ` Bingfeng Mei
  1 sibling, 1 reply; 7+ messages in thread
From: Steven Bosscher @ 2008-07-02 16:14 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: gcc

On Wed, Jul 2, 2008 at 1:13 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hello,
> I am looking at GCC's loop unrolling and find it quite inefficient
> compared with manually unrolled loop even for very simple loop. The
> followings are a simple loop and its manually unrolled version. I didn't
> apply any trick on manually unrolled one as it is exact replications of
> original loop body. I have expected by -funroll-loops the first version
> should produce code of similar quality as the second one. However,
> compiled with ARM target of mainline GCC, both functions produce very
> different results.
>
> GCC-unrolled version mainly suffers from two issues. First, the
> load/store offsets are registers. Extra ADD instructions are needed to
> increase offset over iteration. In the contrast, manually unrolled code
> makes use of immediate offset efficiently and only need one ADD to
> adjust base register in the end. Second, the alias (dependence) analysis
> is over conservative. The LOAD instruction of next unrolled iteration
> cannot be moved beyond previous STORE instruction even they are clearly
> not aliased. I suspect the failure of alias analysis is related to the
> first issue of handling base and offset address. The .sched2 file shows
> that the first loop body requires 57 cycles whereas the second one takes
> 50 cycles for arm9 (56 cycles vs 34 cycles for Xscale).  It become even
> worse for our VLIW porting due to longer latency of MUL and Load
> instructions and incapability of filling all slots (120 cycles vs. 20
> cycles)

Both issues should be solvable for RTL (where unrolling belongs IMHO).
 If you file a PR in bugzilla (with this test case, target, compiler
options, etc), I promise I will analyze why we don't fold away the
ADDs, and why the scheduler doesn't glob the loads (may be due to bad
alias analysis, but maybe something else is not working properly).

Gr.
Steven

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Inefficient loop unrolling.
  2008-07-02 16:14 ` Steven Bosscher
@ 2008-07-03 13:48   ` Bingfeng Mei
  2008-07-10 13:04     ` Paolo Bonzini
  0 siblings, 1 reply; 7+ messages in thread
From: Bingfeng Mei @ 2008-07-03 13:48 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: gcc

Steven,
I just created a bug report. You should receive a CCed mail now.

I can see these issues are solvable at RTL-level, but require lots of
efforts. The main optimization in loop unrolling pass, split iv, can
reduce dependence chain but not extra ADDs and alias issue. What is the
main reason that loop unrolling should belong to RTL level? Is it
fundamental?

Cheers,
Bingfeng

-----Original Message-----
From: Steven Bosscher [mailto:stevenb.gcc@gmail.com] 
Sent: 02 July 2008 17:01
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Inefficient loop unrolling.

On Wed, Jul 2, 2008 at 1:13 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> Hello,
> I am looking at GCC's loop unrolling and find it quite inefficient
> compared with manually unrolled loop even for very simple loop. The
> followings are a simple loop and its manually unrolled version. I
didn't
> apply any trick on manually unrolled one as it is exact replications
of
> original loop body. I have expected by -funroll-loops the first
version
> should produce code of similar quality as the second one. However,
> compiled with ARM target of mainline GCC, both functions produce very
> different results.
>
> GCC-unrolled version mainly suffers from two issues. First, the
> load/store offsets are registers. Extra ADD instructions are needed to
> increase offset over iteration. In the contrast, manually unrolled
code
> makes use of immediate offset efficiently and only need one ADD to
> adjust base register in the end. Second, the alias (dependence)
analysis
> is over conservative. The LOAD instruction of next unrolled iteration
> cannot be moved beyond previous STORE instruction even they are
clearly
> not aliased. I suspect the failure of alias analysis is related to the
> first issue of handling base and offset address. The .sched2 file
shows
> that the first loop body requires 57 cycles whereas the second one
takes
> 50 cycles for arm9 (56 cycles vs 34 cycles for Xscale).  It become
even
> worse for our VLIW porting due to longer latency of MUL and Load
> instructions and incapability of filling all slots (120 cycles vs. 20
> cycles)

Both issues should be solvable for RTL (where unrolling belongs IMHO).
 If you file a PR in bugzilla (with this test case, target, compiler
options, etc), I promise I will analyze why we don't fold away the
ADDs, and why the scheduler doesn't glob the loads (may be due to bad
alias analysis, but maybe something else is not working properly).

Gr.
Steven


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Inefficient loop unrolling.
  2008-07-03 13:48   ` Bingfeng Mei
@ 2008-07-10 13:04     ` Paolo Bonzini
  2008-07-10 13:44       ` Bingfeng Mei
  0 siblings, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2008-07-10 13:04 UTC (permalink / raw)
  To: Bingfeng Mei; +Cc: Steven Bosscher, gcc

Bingfeng Mei wrote:
> Steven,
> I just created a bug report. You should receive a CCed mail now.
> 
> I can see these issues are solvable at RTL-level, but require lots of
> efforts. The main optimization in loop unrolling pass, split iv, can
> reduce dependence chain but not extra ADDs and alias issue. What is the
> main reason that loop unrolling should belong to RTL level? Is it
> fundamental?

No, it is just effectiveness of the code size expansion heuristics. 
Ivopts is already complex enough on the tree level, that doing it on RTL 
would be insane.  But other low-level loop optimizations had already 
been written on the RTL level and since there were no compelling 
reasons, they were left there.

That said, this is a bug -- fwprop should have folded the ADDs, at the 
very least.  I'll look at the PR.

Paolo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Inefficient loop unrolling.
  2008-07-10 13:04     ` Paolo Bonzini
@ 2008-07-10 13:44       ` Bingfeng Mei
  0 siblings, 0 replies; 7+ messages in thread
From: Bingfeng Mei @ 2008-07-10 13:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Steven Bosscher, gcc

Paolo,
Thanks for the reply. However, I am not sure it is a simple folding
issue. 

For example, 

B1 = B + 4;
 = [A, B1]
B2 = B + 8;
 = [A, B2] 
B3 = B + 12;
 = [A, B3]

Should be transformed to 
C = A + B
= [C, 4]
= [C, 8]
= [C, 12]

Loop exit condition needs to be changed accordingly. 

BTW, I just added an experimental tree-level loop unrolling pass in my
porting, right before ivopt pass. The results are very promising except
a few quirky things, which I belive to be problem of ivopts. The
produced assembly code is as good as maunal unrolling now. 

Cheers,
Bingfeng

-----Original Message-----
From: Paolo Bonzini [mailto:paolo.bonzini@gmail.com] On Behalf Of Paolo
Bonzini
Sent: 10 July 2008 13:34
To: Bingfeng Mei
Cc: Steven Bosscher; gcc@gcc.gnu.org
Subject: Re: Inefficient loop unrolling.

Bingfeng Mei wrote:
> Steven,
> I just created a bug report. You should receive a CCed mail now.
> 
> I can see these issues are solvable at RTL-level, but require lots of
> efforts. The main optimization in loop unrolling pass, split iv, can
> reduce dependence chain but not extra ADDs and alias issue. What is
the
> main reason that loop unrolling should belong to RTL level? Is it
> fundamental?

No, it is just effectiveness of the code size expansion heuristics. 
Ivopts is already complex enough on the tree level, that doing it on RTL

would be insane.  But other low-level loop optimizations had already 
been written on the RTL level and since there were no compelling 
reasons, they were left there.

That said, this is a bug -- fwprop should have folded the ADDs, at the 
very least.  I'll look at the PR.

Paolo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-07-10 13:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-02 11:59 Inefficient loop unrolling Bingfeng Mei
2008-07-02 12:08 ` Richard Guenther
2008-07-02 13:40   ` Bingfeng Mei
2008-07-02 16:14 ` Steven Bosscher
2008-07-03 13:48   ` Bingfeng Mei
2008-07-10 13:04     ` Paolo Bonzini
2008-07-10 13:44       ` Bingfeng Mei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).