loop optimizations, mark 9

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* loop optimizations, mark 9
@ 1998-05-19  8:03 Richard Henderson
       [not found] ` <87k97ingv2.fsf@cytocin.hubbe.net>
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Richard Henderson @ 1998-05-19  8:03 UTC (permalink / raw)
  To: egcs

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

I'd left off this patch set for some time, and was reminded of it
while running spec95 last night testing other optimizations.

This is version 9 of the patch; previous versions can be found in
the mailing list archives.

Highlights of this patch are:

  * Tweeks to fold() to handle some of the "interesting" output from
    loop, via expand_mult_add, when expanding the giv increments.

  * A tweek to combine_givs to prefer (r+0),(r+8) over (b+r1),(r2+0),
    as would tend to happen on hosts that support reg+reg addressing.

The result is that, on our good friend RESID, I believe the resulting
code is as good as it can possibly get without GCSE (which ought to 
elimitate two redundancies that normal CSE cannot fix up).

r~

[-- Attachment #2: d-loop-9.gz --]
[-- Type: application/x-gzip, Size: 8621 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
       [not found] ` <87k97ingv2.fsf@cytocin.hubbe.net>
@ 1998-05-19 21:17   ` Richard Henderson
  1998-05-19 21:17     ` Fredrik Hubinette
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Henderson @ 1998-05-19 21:17 UTC (permalink / raw)
  To: Fredrik Hubinette, Richard Henderson; +Cc: egcs

On Tue, May 19, 1998 at 01:30:25PM -0700, Fredrik Hubinette wrote:
> I don't keep up on all the patches and stuff, and I just joined the
> egcs list yesterday, but I've been trying to figure out why egcs
> fails to optimize this example:
> 
> If you take out -DSUBOPTIMAL, it works just fine.
> 
> ----test.c---------------8<-------------------------------------------
> 
> extern int d_flag;
> 
> struct X { short a,b; int *d; };
> 
> void Xcopy(struct X *to, struct X *from, int num, int type)
> {
> #if SUBOPTIMAL
>   if(d_flag)
>   {
>     int e;
>     for(e=0;e<num;e++)
>       if(!(type & (1<<from[e].a)))
> 	 fatal("Bla bla bla (%ld %ld %d)!\n",(long)e,(long)type,from[e].a);
>   }
> #endif
> 
>   while(--num >= 0)
>   {
>     struct X tmp;
>     tmp=*(from++);
>     *(to++)=tmp;
>     tmp.d[0]++;
>   }
> }
> 
> -----------------------8<---------------------------------------------

This is an unfortunate side effect of a rather dumb register allocator.
This will be helped a bit by something called "Live Range Splitting",
which Mike Meissner implemented and will be folded into EGCS within the
next couple of months, iirc.

In the mean time, you can do manually what LRS would have done:

  {
    struct X *tmp_to = to, *tmp_from = from;
    while(--num >= 0)
    {
      struct X tmp;
      tmp=*(tmp_from++);
      *(tmp_to++)=tmp;
      tmp.d[0]++;
    }
  }

which produces the desired optimized output.


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
  1998-05-19 21:17   ` Richard Henderson
@ 1998-05-19 21:17     ` Fredrik Hubinette
  1998-05-21  2:21       ` Jeffrey A Law
  0 siblings, 1 reply; 9+ messages in thread
From: Fredrik Hubinette @ 1998-05-19 21:17 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Fredrik Hubinette, egcs

> In the mean time, you can do manually what LRS would have done:
> 
>  {
>     struct X *tmp_to = to, *tmp_from = from;
>     while(--num >= 0)
>     {
>       struct X tmp;
>       tmp=*(tmp_from++);
>       *(tmp_to++)=tmp;
>       tmp.d[0]++;
>     }
>   }

Too bad I have 40000 lines of code to go through...
It will probably be faster to wait for the LRS to be implemented.
I was also thinking, if egcs could handle split the struct copy
into word pieces, it would save one register, which would allow this
particular loop to be implemented more efficiently. (And saving a
register is always good, right, especially on x86... :)

Basically, the idea is to transform:

	struct X tmp;
	tmp=*(from++);
	*(to++)=tmp;
	tmp.d[0]++;

Into:
	int* tmp;
	tmp=((int *)from)[0];
	((int *)to)[0]=tmp;
	tmp=((int *)from)[1];
	((int *)to)[1]=tmp;
	from++;
	to++;
	tmp[0]++;

I don't know how egcs copies larger structs, but I imagine that it would
use up a whole lot of registers if you try to copy 2 Kb unless something
like this is implemented.

The question: is something like this feasible, or would it require
re-designing somthing?

	/Fredrik Hubinette

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
  1998-05-19 21:17     ` Fredrik Hubinette
@ 1998-05-21  2:21       ` Jeffrey A Law
  0 siblings, 0 replies; 9+ messages in thread
From: Jeffrey A Law @ 1998-05-21  2:21 UTC (permalink / raw)
  To: Fredrik Hubinette; +Cc: Richard Henderson, egcs

  In message < 87hg2lojxc.fsf@cytocin.hubbe.net >you write:
  > Too bad I have 40000 lines of code to go through...
  > It will probably be faster to wait for the LRS to be implemented.
I'd bet Cygnus will be able to contribute LRS code in late
1998/early 1999.  Basically it has to go to our customers first
since they're the ones that pay the bills :-)

  > I was also thinking, if egcs could handle split the struct copy
  > into word pieces, it would save one register, which would allow this
  > particular loop to be implemented more efficiently. (And saving a
  > register is always good, right, especially on x86... :)
  > 
  > Basically, the idea is to transform:
  > 
  > 	struct X tmp;
  > 	tmp=*(from++);
  > 	*(to++)=tmp;
  > 	tmp.d[0]++;
  > 
  > Into:
  > 	int* tmp;
  > 	tmp=((int *)from)[0];
  > 	((int *)to)[0]=tmp;
  > 	tmp=((int *)from)[1];
  > 	((int *)to)[1]=tmp;
  > 	from++;
  > 	to++;
  > 	tmp[0]++;
This does happen in some cases.  In addition to saving a register, it
exposes more ILP to the scheduler and the various address computations
to the CSE & loop optimizers.

Determining exactly under what circumstances this happens is tricky
as it's dependent on the size of the structure, its alignment, 
and various target macros.

jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
  1998-05-21 23:17 ` Jeffrey A Law
@ 1998-05-21 20:15   ` Richard Henderson
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Henderson @ 1998-05-21 20:15 UTC (permalink / raw)
  To: law, Richard Henderson; +Cc: egcs

> The x86 doesn't bootstrap with these changes.  Unfortunately c-torture
> doesn't show any regressions with these changes, so we don't have a
> simple testcase handy :(

That's a shame.  Something to look into.

> I thought we had addressed the issue which originally led to your
> disabling of the 2nd loop optimization pass?  Did I mis-understand
> something the last time we discussed these patches?

No, we'd never addressed that.

> Performance on spec92 on the x86 was kind-of hit or miss.

With spec95 on alpha, not all even ran successfully, though I don't
recall any performance regressions.  Though that test run also included
the new gcse bits, so it's not as clear-cut as it might have been.


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
  1998-05-19  8:03 loop optimizations, mark 9 Richard Henderson
       [not found] ` <87k97ingv2.fsf@cytocin.hubbe.net>
@ 1998-05-21 23:17 ` Jeffrey A Law
  1998-05-21 20:15   ` Richard Henderson
  1998-05-23  9:15 ` Toon Moene
  2 siblings, 1 reply; 9+ messages in thread
From: Jeffrey A Law @ 1998-05-21 23:17 UTC (permalink / raw)
  To: Richard Henderson; +Cc: egcs

  In message < 19980519011620.A20650@dot.cygnus.com >you write:
  > 
  > --LQksG6bCIzRHxTLp
  > Content-Type: text/plain; charset=us-ascii
  > 
  > I'd left off this patch set for some time, and was reminded of it
  > while running spec95 last night testing other optimizations.
  > 
  > This is version 9 of the patch; previous versions can be found in
  > the mailing list archives.
  > 
  > Highlights of this patch are:
  > 
  >   * Tweeks to fold() to handle some of the "interesting" output from
  >     loop, via expand_mult_add, when expanding the giv increments.
  > 
  >   * A tweek to combine_givs to prefer (r+0),(r+8) over (b+r1),(r2+0),
  >     as would tend to happen on hosts that support reg+reg addressing.
  > 
  > The result is that, on our good friend RESID, I believe the resulting
  > code is as good as it can possibly get without GCSE (which ought to 
  > elimitate two redundancies that normal CSE cannot fix up).
I poked around with these a little last night.

The x86 doesn't bootstrap with these changes.  Unfortunately c-torture
doesn't show any regressions with these changes, so we don't have a
simple testcase handy :(

I thought we had addressed the issue which originally led to your
disabling of the 2nd loop optimization pass?  Did I mis-understand
something the last time we discussed these patches?

Performance on spec92 on the x86 was kind-of hit or miss.  Some got
better, some got worse.  I didn't try without the toplev change to
see if any of the regressions were due to not rerunning the loop
optimizer.

jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
  1998-05-19  8:03 loop optimizations, mark 9 Richard Henderson
       [not found] ` <87k97ingv2.fsf@cytocin.hubbe.net>
  1998-05-21 23:17 ` Jeffrey A Law
@ 1998-05-23  9:15 ` Toon Moene
  1998-05-24 13:19   ` Toon Moene
  2 siblings, 1 reply; 9+ messages in thread
From: Toon Moene @ 1998-05-23  9:15 UTC (permalink / raw)
  To: Richard Henderson; +Cc: egcs

Richard,

>  I'd left off this patch set for some time, and was
>  reminded of it while running spec95 last night testing
>  other optimizations.
>
>  This is version 9 of the patch; previous versions can be
>  found in the mailing list archives.

I'm now running a snapshot CVS'd 199805222000UTC with your mark 9  
patches applied.

Indeed, it does The Right Thing for resid:

% /usr/snp/bin/g77 -O2 -S resid.f
% cat resid.s
[ ... ]
L9:
        fmoved a0@,fp0
        faddd a0@(-16),fp0
        faddd a1@+,fp0
        faddd a2@+,fp0
        fmoved a0@(-8),fp1
        fmulx fp3,fp1
        fsubx fp1,fp0
        fmulx fp2,fp0
        fnegx fp0,fp0
        faddd a3@+,fp0
        fmoved fp0,a4@+
        addql #8,a0
        dbra d1,L9
[ ... ]

[ Note how it uses 5 different address registers, in accordance
  with the number of array accesses that are a non-compile-time-
  constant number of bytes apart ]

But for simpler codes:

% cat average.f
      subroutine average(x, y, n)
      integer i, n
      real x(n), y(n)
      y(1) = x(1)
      y(n) = x(n)
      do i = 2, n-1
         y(i) = ( x(i-1)+x(i)+x(i+1) ) / 3.0
      enddo
      end

% /usr/snp/bin/g77 -O2 -S average.f
% cat average.s
[ ... ]
L5:
        fmoves a2@(a0:l),fp0
        fadds a2@(d1:l),fp0
        fadds a2@(a1:l),fp0
        fsgldivx fp1,fp0
        fmoves fp0,a3@(d1:l)
        addql #4,a1
        addql #4,a0
        addql #4,d1
        dbra d0,L5
[ ... ]

which certainly doesn't show the same shrewdness.

Perhaps you've been tuning your code too much to the `resid' case :-(

HTH,
Toon.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
  1998-05-24 13:19   ` Toon Moene
@ 1998-05-24 13:19     ` Richard Henderson
  0 siblings, 0 replies; 9+ messages in thread
From: Richard Henderson @ 1998-05-24 13:19 UTC (permalink / raw)
  To: Toon Moene, Richard Henderson; +Cc: egcs

On Sun, May 24, 1998 at 10:06:03PM +0200, Toon Moene wrote:
> 1. I use the compile time option -fforce-addr, because that'll
>    assure me of the optimal use of John Carr's alias analysis
>    (private communication, May '96).
>    However, when I compile `resid' with -O2 -fforce-addr, I clearly
>    see that your stuff isn't effective (when comparing with -O2).

Hmm, ok, obviously I need to check out m68k and see where things
are going off.


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: loop optimizations, mark 9
  1998-05-23  9:15 ` Toon Moene
@ 1998-05-24 13:19   ` Toon Moene
  1998-05-24 13:19     ` Richard Henderson
  0 siblings, 1 reply; 9+ messages in thread
From: Toon Moene @ 1998-05-24 13:19 UTC (permalink / raw)
  To: Richard Henderson; +Cc: egcs

Yesterday, I wrote:

>  I'm now running a snapshot CVS'd 199805222000UTC with
>  your mark 9 patches applied.

[ It bootstrapped successfully ]

I actually ran a complete HIRLAM run with it:

0SUPOBS TOOK :       19.85659790039
0DATACH TOOK :      9907.8935546875		*)
0ANAEVA TOOK :     11532.0419921875
0GRPEVA TOOK :     22830.1308593750
0HUMSUP TOOK :        1.65625000000
0DATACH TOOK :       561.8164062500
0HUMEVA TOOK :       289.7539062500
0GRPEVA TOOK :       385.2695312500
 PREPARATIONS TOOK         50.7206 SECONDS
 FORECAST TOOK        289.4115 SECONDS
 PREPARATIONS TOOK         54.0532 SECONDS
 FORECAST TOOK      32278.8711 SECONDS

Contrast with (snapshot of 199805210900UTC):

0SUPOBS TOOK :       19.66827392578
0DATACH TOOK :      9275.4941406250		*)
0ANAEVA TOOK :     11315.5214843750
0GRPEVA TOOK :     23142.7968750000
0HUMSUP TOOK :        1.60937500000
0DATACH TOOK :       581.2851562500
0HUMEVA TOOK :       289.7539062500
0GRPEVA TOOK :       385.1562500000
 PREPARATIONS TOOK         48.5590 SECONDS
 FORECAST TOOK        278.5487 SECONDS
 PREPARATIONS TOOK         53.5047 SECONDS
 FORECAST TOOK      32280.9922 SECONDS

Note the timing with *)  Definitely slower with your changes.

Basically, there are two things that are not well, at least on  
m68k-next-nextstep3:

1. I use the compile time option -fforce-addr, because that'll
   assure me of the optimal use of John Carr's alias analysis
   (private communication, May '96).
   However, when I compile `resid' with -O2 -fforce-addr, I clearly
   see that your stuff isn't effective (when comparing with -O2).

2. Rerunning loop optimisation still seems to be useful for my code;
   however, this also could be a result of (1.)

HTH,
Toon.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~1998-05-24 13:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-05-19  8:03 loop optimizations, mark 9 Richard Henderson
     [not found] ` <87k97ingv2.fsf@cytocin.hubbe.net>
1998-05-19 21:17   ` Richard Henderson
1998-05-19 21:17     ` Fredrik Hubinette
1998-05-21  2:21       ` Jeffrey A Law
1998-05-21 23:17 ` Jeffrey A Law
1998-05-21 20:15   ` Richard Henderson
1998-05-23  9:15 ` Toon Moene
1998-05-24 13:19   ` Toon Moene
1998-05-24 13:19     ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).