RE: Re: better load/store scheduling

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* RE:  Re: better load/store scheduling
@ 2007-03-01 21:08 Ben Cheng
  2007-03-01 22:12 ` Vladimir Makarov
  0 siblings, 1 reply; 3+ messages in thread
From: Ben Cheng @ 2007-03-01 21:08 UTC (permalink / raw)
  To: gcc-help

Well, I guess the real question is how to make gcc schedule better code
if loop unrolling is enabled?

My original code is actually 

    for (i = 0; i < 4096; i++) {
        g[i]   = h[i] + 10;
    }

After gcc unrolls the loop, the loop bodies from different iterations
aren't overlapping with each other because the load from later
iterations is not scheduled across earlier stores. I thought this might
be due to phase ordering issues of optimization stages so I manually
unroll the loop. But unfortunately I still cannot get gcc to schedule
loads/stores more aggressively.

Since I want gcc to unroll the loop for me, I cannot create temporaries
for h[i]. Therefore I am still hoping for some magic command line
options to make gcc produce better scheduling.

Thanks,
-Ben

-----Original Message-----
From: gcc-help-owner@gcc.gnu.org [mailto:gcc-help-owner@gcc.gnu.org] On
Behalf Of Sergei Organov
Sent: Thursday, March 01, 2007 3:22 AM
To: gcc-help@gcc.gnu.org
Subject: Re: better load/store scheduling

"Ben Cheng" <bccheng@peakstreaminc.com> writes:
> I am trying to tune the performance of hand-unrolled code. I was
> wondering what cmd-line options should I specify in order to get
h[i+1]
> loaded before the store to g[i]:
>
>
> Code:
>
> void foo(int * __restrict g, int * __restrict h)
> {
>     int i;
>     for (i = 0; i < 4096; i+=2) {
>         g[i]   = h[i] + 10;
>         g[i+1] = h[i+1] + 10;
>     }
> }

Use temporaries:

void foo(int * __restrict g, int * __restrict h)
{
    int i;
    for (i = 0; i < 4096; i+=2) {
        int a = h[i];
        int b = h[i+1];
        g[i]   = a + 10;
        g[i+1] = b + 10;
    }
}

>
> Command line:
>
> gcc-4.0.2 -O3 loop.c -fargument-noalias-global -fstrict-aliasing -S
> loop.s
>
> Assembly code of the loop body:
>
> .L2:
>         leal    0(,%ebx,4), %eax
>         leal    (%eax,%esi), %ecx
>         leal    (%edi,%eax), %eax
>         movl    -8(%ecx), %edx                  // = h[i]
>         addl    $10, %edx                       // + 10
>         movl    %edx, -8(%eax)                  // g[i] = 
>         movl    -4(%ecx), %edx                  // = h[i+1]
>         addl    $10, %edx                       // + 10
>         movl    %edx, -4(%eax)                  // g[i+1] =
>         addl    $2, %ebx
>         cmpl    $4098, %ebx
>         jne     .L2

With gcc 4.0.4, it gives:

.L2:
	leal	0(,%ebx,4), %edx
	addl	$2, %ebx
	leal	(%esi,%edx), %eax
	addl	%edi, %edx
	movl	-4(%eax), %ecx
	movl	-8(%eax), %eax
	addl	$10, %ecx
	addl	$10, %eax
	cmpl	$4098, %ebx
	movl	%eax, -8(%edx)
	movl	%ecx, -4(%edx)
	jne	.L2

With gcc 4.1.2, it gives:

.L2:
	movl	-4(%ebx,%ecx,4), %eax
	movl	-8(%ebx,%ecx,4), %edx
	addl	$10, %eax
	addl	$10, %edx
	movl	%edx, -8(%esi,%ecx,4)
	movl	%eax, -4(%esi,%ecx,4)
	addl	$2, %ecx
	cmpl	$4098, %ecx
	jne	.L2

-- Sergei.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: better load/store scheduling
  2007-03-01 21:08 Re: better load/store scheduling Ben Cheng
@ 2007-03-01 22:12 ` Vladimir Makarov
  0 siblings, 0 replies; 3+ messages in thread
From: Vladimir Makarov @ 2007-03-01 22:12 UTC (permalink / raw)
  To: Ben Cheng; +Cc: gcc-help

Ben Cheng wrote:

>Well, I guess the real question is how to make gcc schedule better code
>if loop unrolling is enabled?
>
>My original code is actually 
>
>    for (i = 0; i < 4096; i++) {
>        g[i]   = h[i] + 10;
>    }
>
>After gcc unrolls the loop, the loop bodies from different iterations
>aren't overlapping with each other because the load from later
>iterations is not scheduled across earlier stores. I thought this might
>be due to phase ordering issues of optimization stages so I manually
>unroll the loop. But unfortunately I still cannot get gcc to schedule
>loads/stores more aggressively.
>
>Since I want gcc to unroll the loop for me, I cannot create temporaries
>for h[i]. Therefore I am still hoping for some magic command line
>options to make gcc produce better scheduling.
>
>  
>
There is no such magic option.  The problem is not in the scheduler 
itself.  It can be done when/if we have more accurate aliasing info on 
rtl level.

Another problem is that even if we have more accurate alias analysis, it 
might be still impossible to move ld/st after RA worked.  Insn 
scheduling before RA is switched off for x86, x86_64 because of a bug 
which finally occurs in reload when the reload can not find a hard 
register for an insn operand. To get rid off this bug, insn scheduler 
should be register pressure sensitive.

Also It is better to use software pipelining for this loop.  You can try 
-fmodulo-sched and see what happens.  It might work.


^ permalink raw reply	[flat|nested] 3+ messages in thread

[parent not found: <96CDC40E4321F84FA0FB83A1EF2A422864B93F@Hermes.shaktisystems.com>]

* Re: better load/store scheduling
       [not found] <96CDC40E4321F84FA0FB83A1EF2A422864B93F@Hermes.shaktisystems.com>
@ 2007-03-01 11:22 ` Sergei Organov
  0 siblings, 0 replies; 3+ messages in thread
From: Sergei Organov @ 2007-03-01 11:22 UTC (permalink / raw)
  To: gcc-help

"Ben Cheng" <bccheng@peakstreaminc.com> writes:
> I am trying to tune the performance of hand-unrolled code. I was
> wondering what cmd-line options should I specify in order to get h[i+1]
> loaded before the store to g[i]:
>
>
> Code:
>
> void foo(int * __restrict g, int * __restrict h)
> {
>     int i;
>     for (i = 0; i < 4096; i+=2) {
>         g[i]   = h[i] + 10;
>         g[i+1] = h[i+1] + 10;
>     }
> }

Use temporaries:

void foo(int * __restrict g, int * __restrict h)
{
    int i;
    for (i = 0; i < 4096; i+=2) {
        int a = h[i];
        int b = h[i+1];
        g[i]   = a + 10;
        g[i+1] = b + 10;
    }
}

>
> Command line:
>
> gcc-4.0.2 -O3 loop.c -fargument-noalias-global -fstrict-aliasing -S
> loop.s
>
> Assembly code of the loop body:
>
> .L2:
>         leal    0(,%ebx,4), %eax
>         leal    (%eax,%esi), %ecx
>         leal    (%edi,%eax), %eax
>         movl    -8(%ecx), %edx                  // = h[i]
>         addl    $10, %edx                       // + 10
>         movl    %edx, -8(%eax)                  // g[i] = 
>         movl    -4(%ecx), %edx                  // = h[i+1]
>         addl    $10, %edx                       // + 10
>         movl    %edx, -4(%eax)                  // g[i+1] =
>         addl    $2, %ebx
>         cmpl    $4098, %ebx
>         jne     .L2

With gcc 4.0.4, it gives:

.L2:
	leal	0(,%ebx,4), %edx
	addl	$2, %ebx
	leal	(%esi,%edx), %eax
	addl	%edi, %edx
	movl	-4(%eax), %ecx
	movl	-8(%eax), %eax
	addl	$10, %ecx
	addl	$10, %eax
	cmpl	$4098, %ebx
	movl	%eax, -8(%edx)
	movl	%ecx, -4(%edx)
	jne	.L2

With gcc 4.1.2, it gives:

.L2:
	movl	-4(%ebx,%ecx,4), %eax
	movl	-8(%ebx,%ecx,4), %edx
	addl	$10, %eax
	addl	$10, %edx
	movl	%edx, -8(%esi,%ecx,4)
	movl	%eax, -4(%esi,%ecx,4)
	addl	$2, %ecx
	cmpl	$4098, %ecx
	jne	.L2

-- Sergei.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-03-01 22:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-01 21:08 Re: better load/store scheduling Ben Cheng
2007-03-01 22:12 ` Vladimir Makarov
     [not found] <96CDC40E4321F84FA0FB83A1EF2A422864B93F@Hermes.shaktisystems.com>
2007-03-01 11:22 ` Sergei Organov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).