public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* RE:  Re: better load/store scheduling
@ 2007-03-01 21:08 Ben Cheng
  2007-03-01 22:12 ` Vladimir Makarov
  0 siblings, 1 reply; 3+ messages in thread
From: Ben Cheng @ 2007-03-01 21:08 UTC (permalink / raw)
  To: gcc-help

Well, I guess the real question is how to make gcc schedule better code
if loop unrolling is enabled?

My original code is actually 

    for (i = 0; i < 4096; i++) {
        g[i]   = h[i] + 10;
    }

After gcc unrolls the loop, the loop bodies from different iterations
aren't overlapping with each other because the load from later
iterations is not scheduled across earlier stores. I thought this might
be due to phase ordering issues of optimization stages so I manually
unroll the loop. But unfortunately I still cannot get gcc to schedule
loads/stores more aggressively.

Since I want gcc to unroll the loop for me, I cannot create temporaries
for h[i]. Therefore I am still hoping for some magic command line
options to make gcc produce better scheduling.

Thanks,
-Ben

-----Original Message-----
From: gcc-help-owner@gcc.gnu.org [mailto:gcc-help-owner@gcc.gnu.org] On
Behalf Of Sergei Organov
Sent: Thursday, March 01, 2007 3:22 AM
To: gcc-help@gcc.gnu.org
Subject: Re: better load/store scheduling

"Ben Cheng" <bccheng@peakstreaminc.com> writes:
> I am trying to tune the performance of hand-unrolled code. I was
> wondering what cmd-line options should I specify in order to get
h[i+1]
> loaded before the store to g[i]:
>
>
> Code:
>
> void foo(int * __restrict g, int * __restrict h)
> {
>     int i;
>     for (i = 0; i < 4096; i+=2) {
>         g[i]   = h[i] + 10;
>         g[i+1] = h[i+1] + 10;
>     }
> }

Use temporaries:

void foo(int * __restrict g, int * __restrict h)
{
    int i;
    for (i = 0; i < 4096; i+=2) {
        int a = h[i];
        int b = h[i+1];
        g[i]   = a + 10;
        g[i+1] = b + 10;
    }
}

>
> Command line:
>
> gcc-4.0.2 -O3 loop.c -fargument-noalias-global -fstrict-aliasing -S
> loop.s
>
> Assembly code of the loop body:
>
> .L2:
>         leal    0(,%ebx,4), %eax
>         leal    (%eax,%esi), %ecx
>         leal    (%edi,%eax), %eax
>         movl    -8(%ecx), %edx                  // = h[i]
>         addl    $10, %edx                       // + 10
>         movl    %edx, -8(%eax)                  // g[i] = 
>         movl    -4(%ecx), %edx                  // = h[i+1]
>         addl    $10, %edx                       // + 10
>         movl    %edx, -4(%eax)                  // g[i+1] =
>         addl    $2, %ebx
>         cmpl    $4098, %ebx
>         jne     .L2

With gcc 4.0.4, it gives:

.L2:
	leal	0(,%ebx,4), %edx
	addl	$2, %ebx
	leal	(%esi,%edx), %eax
	addl	%edi, %edx
	movl	-4(%eax), %ecx
	movl	-8(%eax), %eax
	addl	$10, %ecx
	addl	$10, %eax
	cmpl	$4098, %ebx
	movl	%eax, -8(%edx)
	movl	%ecx, -4(%edx)
	jne	.L2

With gcc 4.1.2, it gives:

.L2:
	movl	-4(%ebx,%ecx,4), %eax
	movl	-8(%ebx,%ecx,4), %edx
	addl	$10, %eax
	addl	$10, %edx
	movl	%edx, -8(%esi,%ecx,4)
	movl	%eax, -4(%esi,%ecx,4)
	addl	$2, %ecx
	cmpl	$4098, %ecx
	jne	.L2

-- Sergei.

^ permalink raw reply	[flat|nested] 3+ messages in thread
[parent not found: <96CDC40E4321F84FA0FB83A1EF2A422864B93F@Hermes.shaktisystems.com>]

end of thread, other threads:[~2007-03-01 22:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-01 21:08 Re: better load/store scheduling Ben Cheng
2007-03-01 22:12 ` Vladimir Makarov
     [not found] <96CDC40E4321F84FA0FB83A1EF2A422864B93F@Hermes.shaktisystems.com>
2007-03-01 11:22 ` Sergei Organov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).