Optimizations

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Optimizations
@ 1997-12-09  9:52 David M. Ronis
  1997-12-09 11:19 ` Optimizations Jeffrey A Law
  1997-12-10 10:46 ` [EGCS] Optimizations Marc Lehmann
  0 siblings, 2 replies; 27+ messages in thread
From: David M. Ronis @ 1997-12-09  9:52 UTC (permalink / raw)
  To: egcs

There's been some disccussion of egs vs gcc vs MSC benchmark results
on comp.os.linux.development.apps recently, much of which ends up with
various suggestions of what compiler flags should be specified.  For
example, one poster suggests:

gcc -O6 -mpentium -fomit-frame-pointer -fexpensive-optimizations \
>-ffast-math

To that, add:
        -march=pentium
        -fschedule-insns
        -fschedule-insns2
        -fregmove
        -fdelayed-branch

According to the gcc info description, all the -f options are enabled
(if supported) when -O2 (and I presume -O6) is specified.  Is this
correct?  

To the above I normally add the following:

-malign-double -malign-loops=0 -malign-jumps=0 -malign-functions=0\
-mno-ieee-fp 

Are the -malign directives implied by -march=pentium? (they probably
should be, and in either case, this sould be described in the info
pages).

Is -mno-ieee-fp implied by -ffast-math?

David Ronis

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Optimizations
  1997-12-09  9:52 Optimizations David M. Ronis
@ 1997-12-09 11:19 ` Jeffrey A Law
  1997-12-10 10:46 ` [EGCS] Optimizations Marc Lehmann
  1 sibling, 0 replies; 27+ messages in thread
From: Jeffrey A Law @ 1997-12-09 11:19 UTC (permalink / raw)
  To: ronis; +Cc: egcs

  In message < 199712091751.MAA18879@ronispc.chem.mcgill.ca >you write:
  > gcc -O6 -mpentium -fomit-frame-pointer -fexpensive-optimizations \
  > >-ffast-math
-fexpensive-optimizations is on automatically if you specify -O2 or
higher.

Similarly for -fregmove.

-fdelayed-branch does nothing on x86 machines since they do not have
delayed branches.

-fschedule-insns* are normally enabled for -O2 or higher.

  > Is -mno-ieee-fp implied by -ffast-math?
No.  -ffast-math is a machine independent option while -mno-ieee-fp is
an x86 specific option.  They are separate.

jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EGCS] Optimizations
  1997-12-09  9:52 Optimizations David M. Ronis
  1997-12-09 11:19 ` Optimizations Jeffrey A Law
@ 1997-12-10 10:46 ` Marc Lehmann
  1997-12-14  5:39   ` Philipp Thomas
  1 sibling, 1 reply; 27+ messages in thread
From: Marc Lehmann @ 1997-12-10 10:46 UTC (permalink / raw)
  To: ronis; +Cc: egcs

> gcc -O6 -mpentium -fomit-frame-pointer -fexpensive-optimizations \
> >-ffast-math
> 
> To that, add:
>         -march=pentium
>         -fschedule-insns
>         -fschedule-insns2
>         -fregmove
>         -fdelayed-branch
> 
> According to the gcc info description, all the -f options are enabled
> (if supported) when -O2 (and I presume -O6) is specified.  Is this
> correct?  

no, you misread the info files. almost everything
except -finline-functions and -fomit-frame-pointer is enabled at -O2,
-O3 enables -finline-functions.

-fschedule-insns is a *loss* on x86 cpu's!

> -malign-double -malign-loops=0 -malign-jumps=0 -malign-functions=0\
> -mno-ieee-fp 
> 
> Are the -malign directives implied by -march=pentium? (they probably

on pgcc, yes, on egcs, no. the mno-ieee-fp is a bit buggy.

the -mpentium should be selected automatically when you compile
for i586-*-*, but I'm not exactly sure here.

> Is -mno-ieee-fp implied by -ffast-math?

no.

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EGCS] Optimizations
  1997-12-10 10:46 ` [EGCS] Optimizations Marc Lehmann
@ 1997-12-14  5:39   ` Philipp Thomas
  1997-12-14 15:14     ` Optimizations Marc Lehmann
  0 siblings, 1 reply; 27+ messages in thread
From: Philipp Thomas @ 1997-12-14  5:39 UTC (permalink / raw)
  To: egcs

Marc Lehmann wrote:
> -fschedule-insns is a *loss* on x86 cpu's!

care to explain why it is a loss (and most probably also -fschedule-insns2)
?


-- 
Philipp

************************************************************
 If builders would build houses like programmers did       
 their programs, the first woodpecker to come along would
 mean the end of all civilization

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Optimizations
  1997-12-14  5:39   ` Philipp Thomas
@ 1997-12-14 15:14     ` Marc Lehmann
  1997-12-14 20:14       ` Optimizations Jeffrey A Law
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Lehmann @ 1997-12-14 15:14 UTC (permalink / raw)
  To: egcs; +Cc: kthomas

On Sun, Dec 14, 1997 at 02:34:36PM +0100, Philipp Thomas wrote:
> Marc Lehmann wrote:
> > -fschedule-insns is a *loss* on x86 cpu's!
> 
> care to explain why it is a loss (and most probably also -fschedule-insns2)
> ?

AFAIR -fschedule-insns (as opposed to -fschedule-insns2) is normally a loss
sicne the first scheduling pass is done before register allocation, so the
register pressure increases and local/global get's problems. (for fpu code
it _could_ be beneficial, though).

the second scheduling pass is done after register allocation, so no
"new" hardware registers are created.

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Optimizations
  1997-12-14 15:14     ` Optimizations Marc Lehmann
@ 1997-12-14 20:14       ` Jeffrey A Law
  0 siblings, 0 replies; 27+ messages in thread
From: Jeffrey A Law @ 1997-12-14 20:14 UTC (permalink / raw)
  To: egcs; +Cc: kthomas

  In message < 19971215000809.60319@cerebro.laendle >you write:
  > On Sun, Dec 14, 1997 at 02:34:36PM +0100, Philipp Thomas wrote:
  > > Marc Lehmann wrote:
  > > > -fschedule-insns is a *loss* on x86 cpu's!
  > > 
  > > care to explain why it is a loss (and most probably also -fschedule-insns
  > 2)
  > > ?
  > 
  > AFAIR -fschedule-insns (as opposed to -fschedule-insns2) is normally a loss
  > sicne the first scheduling pass is done before register allocation, so the
  > register pressure increases and local/global get's problems. (for fpu code
  > it _could_ be beneficial, though).
To be more correct it may be a loss for machines with a limited number of
registers (such as the x86).  On machines with a generous number of registers
-fschedule-insns is generally a win.

jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EGCS] Optimizations
@ 1997-12-14 14:30 meissner
  1997-12-15  5:38 ` Optimizations Marc Lehmann
       [not found] ` <19971216000653.24186.cygnus.egcs@cerebro.laendle>
  0 siblings, 2 replies; 27+ messages in thread
From: meissner @ 1997-12-14 14:30 UTC (permalink / raw)
  To: egcs

| Marc Lehmann wrote:
| > -fschedule-insns is a *loss* on x86 cpu's!
| 
| care to explain why it is a loss (and most probably also -fschedule-insns2)

The problem is that -fschedule-insns, -funroll-{,all-}loops, and
-fstrength-reduce all tend to work by creating more registers to hold
intermediate results (in compiler speak this is known as register pressure).
Obviously, -fschedule-insns2 doesn't suffer from this problem, since it only
schedules things after register allocation has been done (and thus on a machine
that has plenty of registers has little effect, other than to move spills
around).

--
Michael Meissner, Cygnus Solutions (Massachusetts office)
4th floor, 955 Massachusetts Avenue, Cambridge, MA 02139, USA
meissner@cygnus.com,	617-354-5416 (office),	617-354-7161 (fax)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Optimizations
  1997-12-14 14:30 [EGCS] Optimizations meissner
@ 1997-12-15  5:38 ` Marc Lehmann
  1997-12-15 11:29   ` Optimizations Dave Love
       [not found] ` <19971216000653.24186.cygnus.egcs@cerebro.laendle>
  1 sibling, 1 reply; 27+ messages in thread
From: Marc Lehmann @ 1997-12-15  5:38 UTC (permalink / raw)
  To: egcs

On Sun, Dec 14, 1997 at 05:30:48PM -0500, meissner@cygnus.com wrote:
> | Marc Lehmann wrote:
> | > -fschedule-insns is a *loss* on x86 cpu's!
> | 
> | care to explain why it is a loss (and most probably also -fschedule-insns2)
> 
> The problem is that -fschedule-insns, -funroll-{,all-}loops, and
> -fstrength-reduce all tend to work by creating more registers to hold

The really intersting point is that -fschedule-insns ins generally a loss
on x86, while -funroll-all-loops is generally a win! (even more
so that -funroll-loops)

I guess loop unrolling should be more clever, i.e. while it should
unroll loops without constant number ofm iterations, it should'nt
unroll all of them.

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Optimizations
  1997-12-15  5:38 ` Optimizations Marc Lehmann
@ 1997-12-15 11:29   ` Dave Love
  1997-12-15 15:43     ` Optimizations Marc Lehmann
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Love @ 1997-12-15 11:29 UTC (permalink / raw)
  To: egcs

>>>>> "Marc" == Marc Lehmann <pcg@goof.com> writes:

 Marc> on x86, while -funroll-all-loops is generally a win! (even more
 Marc> so that -funroll-loops)

Could someone explain exactly what is the difference between the sort
of loops unrolled by -funroll-all-loops and -funroll-loops?  The doc
sentence isn't clear to me and ISTR that grovelling unroll.c wasn't
immediately enlightening.  Is it something like for(;;) versus
for(i=1;i<=n;i++)?

BTW, in case the original question fortran-related, at least for
fortran with the default (non-)aliasing model, -fforce-addr can be a
win on x86 as Toon has pointed out.  BTW2, the suggestions for [56]86
options with -malign-...=2 are propagated -- bother -- in the g77
manual, AFAIR after the Linux GCC-HOWTO.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Optimizations
  1997-12-15 11:29   ` Optimizations Dave Love
@ 1997-12-15 15:43     ` Marc Lehmann
  0 siblings, 0 replies; 27+ messages in thread
From: Marc Lehmann @ 1997-12-15 15:43 UTC (permalink / raw)
  To: egcs

> fortran with the default (non-)aliasing model, -fforce-addr can be a
> win on x86 as Toon has pointed out.  BTW2, the suggestions for [56]86
> options with -malign-...=2 are propagated -- bother -- in the g77

could someone check this? I'm pretty sure that (on pentiums!),
4 byte alignment is worse than either zero or 16 byte alignment,
and there is no point in wasting space when zero alignment is at
least equally well.

I never saw benchmark data (except mine) that said 4 byte is better then
zero byte alignment (and intel itself recommends 0 on pentiums).

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <19971216000653.24186.cygnus.egcs@cerebro.laendle>]

* Re: Optimizations
       [not found] ` <19971216000653.24186.cygnus.egcs@cerebro.laendle>
@ 1997-12-23  7:51   ` Stan Cox
  0 siblings, 0 replies; 27+ messages in thread
From: Stan Cox @ 1997-12-23  7:51 UTC (permalink / raw)
  To: egcs

>I never saw benchmark data (except mine) that said 4 byte is better then
>zero byte alignment (and intel itself recommends 0 on pentiums).

gas and the svr4 assembler have pseudoops that say align to X bytes if it takes no
more than Y bytes to do so.  We used this in the DG/UX configuration
after doing quite a bit of benchmark analysis.
-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Optimizations
@ 2000-03-10  1:46 Virgil Palanciuc
  0 siblings, 0 replies; 27+ messages in thread
From: Virgil Palanciuc @ 2000-03-10  1:46 UTC (permalink / raw)
  To: mrs, gcc

>Well, optimizing gcc's memory allocation should be an interesting
>project.  Certainly we do little in this area, so any improvement you
>could contribute would be great.  Also, with the advent of 200+ cycle
>memory misses, it might be fairly profitable.  As for register
>allocation, I would hope that it would be hard to improve gcc's
>scheme, but the right person with the right algorithm...
   My project is specifically for the SC100 family. I never dreamed of being 
able to do any general optimizations (I thought people are trying to 
optimize gcc for too long for me to be able to bring something new in only 
two months). However, if you think optimizing gcc's memory allocation is 
possible (in AT MOST 2 months), write me some details (frankly, I don't know 
how gcc does the memory allocation - my approach was to write a new phase 
that discards whatever memory allocation gcc did so far).

> > schemes for a specific architecture (Motorola's SC100 family).
>Our code isn't specific to any processor and any code you submit that
>we included, would have to be fairly machine independent,or at least,
>it would have to not harm performance on other machines.
    I already explained : I didn't think I can make machine-independent 
optimizations. Of course, I would be glad to be wrong in this matter.

                            Virgil.
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* optimizations
@ 2003-01-14 22:58 Reza Roboubi
  2003-01-15  0:15 ` optimizations Andrew Pinski
  0 siblings, 1 reply; 27+ messages in thread
From: Reza Roboubi @ 2003-01-14 22:58 UTC (permalink / raw)
  To: gcc-help, gcc

In the following code, it is clear that the return value of mm() can be
eliminated.  In fact, many optimizations are possible here.  Yet gcc seems not
to be able to do these optimizations.  Below, I posted the assembly code that
gcc generated (for the while() loop).

I compiled this code with gcc -O2 -Wall.  

I was wondering if I am doing something wrong.  If not, then please comment on
current gcc developments in this regard, and what it takes to add some of these
features.

Please also comment on how other compilers would compare with gcc in this case.

Are there any non-obvious remedies you have for this case?

PS: Please tell me if I must report this as a gcc bug.

Thanks in advance for any help you provide.

inline int mm(int *i)
{
        if((*i)==0x10){return 0;}
        (*i)++;return 1;
}

int  
main(){  

	int k=0;
	while (mm(&k)){}
	write(1,&k,1);

	return 0;
}  

Associated assembly code for the while() loop:

0x80483b0 <main+16>:	mov    0xfffffffc(%ebp),%eax
0x80483b3 <main+19>:	xor    %edx,%edx
0x80483b5 <main+21>:	cmp    $0x10,%eax
0x80483b8 <main+24>:	je     0x80483c3 <main+35>
0x80483ba <main+26>:	inc    %eax
0x80483bb <main+27>:	mov    $0x1,%edx
0x80483c0 <main+32>:	mov    %eax,0xfffffffc(%ebp)
0x80483c3 <main+35>:	test   %edx,%edx
0x80483c5 <main+37>:	jne    0x80483b0 <main+16>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-14 22:58 optimizations Reza Roboubi
@ 2003-01-15  0:15 ` Andrew Pinski
  2003-01-15  5:10   ` optimizations Reza Roboubi
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Pinski @ 2003-01-15  0:15 UTC (permalink / raw)
  To: Reza Roboubi; +Cc: gcc-help, gcc

What version of gcc?
This seems like it was fixed at one point doing the 3.x series because 
it does not happen with 3.3 (prerelease) or 3.4 (experimental).

Thanks,
Andrew Pinski



On Tuesday, Jan 14, 2003, at 12:35 US/Pacific, Reza Roboubi wrote:

> In the following code, it is clear that the return value of mm() can be
> eliminated.  In fact, many optimizations are possible here.  Yet gcc 
> seems not
> to be able to do these optimizations.  Below, I posted the assembly 
> code that
> gcc generated (for the while() loop).
>
> I compiled this code with gcc -O2 -Wall.
>
> I was wondering if I am doing something wrong.  If not, then please 
> comment on
> current gcc developments in this regard, and what it takes to add some 
> of these
> features.
>
> Please also comment on how other compilers would compare with gcc in 
> this case.
>
> Are there any non-obvious remedies you have for this case?
>
> PS: Please tell me if I must report this as a gcc bug.
>
> Thanks in advance for any help you provide.
>
>
> inline int mm(int *i)
> {
>         if((*i)==0x10){return 0;}
>         (*i)++;return 1;
> }
>
> int
> main(){
>
> 	int k=0;
> 	while (mm(&k)){}
> 	write(1,&k,1);
>
> 	return 0;
> }
>
>
> Associated assembly code for the while() loop:
>
> 0x80483b0 <main+16>:	mov    0xfffffffc(%ebp),%eax
> 0x80483b3 <main+19>:	xor    %edx,%edx
> 0x80483b5 <main+21>:	cmp    $0x10,%eax
> 0x80483b8 <main+24>:	je     0x80483c3 <main+35>
> 0x80483ba <main+26>:	inc    %eax
> 0x80483bb <main+27>:	mov    $0x1,%edx
> 0x80483c0 <main+32>:	mov    %eax,0xfffffffc(%ebp)
> 0x80483c3 <main+35>:	test   %edx,%edx
> 0x80483c5 <main+37>:	jne    0x80483b0 <main+16>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-15  0:15 ` optimizations Andrew Pinski
@ 2003-01-15  5:10   ` Reza Roboubi
  2003-01-15  6:31     ` optimizations Reza Roboubi
  0 siblings, 1 reply; 27+ messages in thread
From: Reza Roboubi @ 2003-01-15  5:10 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc-help, gcc

Andrew Pinski wrote:
> 
> What version of gcc?
> This seems like it was fixed at one point doing the 3.x series because
> it does not happen with 3.3 (prerelease) or 3.4 (experimental).

Ah. I am using gcc version 3.2. It's very good if this has been fixed under 3.3
and 3.4. 

I would still appreciate any comments regarding the status of these
optimizations. Are these new features, or are they old ones that temporarily got
broken during gcc 3.2?

What part of the gcc source tree deals with these optimizations?

Thanks again.

> 
> Thanks,
> Andrew Pinski
> 
> On Tuesday, Jan 14, 2003, at 12:35 US/Pacific, Reza Roboubi wrote:
> 
> > In the following code, it is clear that the return value of mm() can be
> > eliminated.  In fact, many optimizations are possible here.  Yet gcc
> > seems not
> > to be able to do these optimizations.  Below, I posted the assembly
> > code that
> > gcc generated (for the while() loop).
> >
> > I compiled this code with gcc -O2 -Wall.
> >
> > I was wondering if I am doing something wrong.  If not, then please
> > comment on
> > current gcc developments in this regard, and what it takes to add some
> > of these
> > features.
> >
> > Please also comment on how other compilers would compare with gcc in
> > this case.
> >
> > Are there any non-obvious remedies you have for this case?
> >
> > PS: Please tell me if I must report this as a gcc bug.
> >
> > Thanks in advance for any help you provide.
> >
> >
> > inline int mm(int *i)
> > {
> >         if((*i)==0x10){return 0;}
> >         (*i)++;return 1;
> > }
> >
> > int
> > main(){
> >
> >       int k=0;
> >       while (mm(&k)){}
> >       write(1,&k,1);
> >
> >       return 0;
> > }
> >
> >
> > Associated assembly code for the while() loop:
> >
> > 0x80483b0 <main+16>:  mov    0xfffffffc(%ebp),%eax
> > 0x80483b3 <main+19>:  xor    %edx,%edx
> > 0x80483b5 <main+21>:  cmp    $0x10,%eax
> > 0x80483b8 <main+24>:  je     0x80483c3 <main+35>
> > 0x80483ba <main+26>:  inc    %eax
> > 0x80483bb <main+27>:  mov    $0x1,%edx
> > 0x80483c0 <main+32>:  mov    %eax,0xfffffffc(%ebp)
> > 0x80483c3 <main+35>:  test   %edx,%edx
> > 0x80483c5 <main+37>:  jne    0x80483b0 <main+16>
> >

-- 
Reza Roboubi
IT Solution Provider: 
   Software Development; Data Servers; Embedded Devices.
www.linisoft.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-15  5:10   ` optimizations Reza Roboubi
@ 2003-01-15  6:31     ` Reza Roboubi
  2003-01-15 17:37       ` optimizations Andrew Pinski
  0 siblings, 1 reply; 27+ messages in thread
From: Reza Roboubi @ 2003-01-15  6:31 UTC (permalink / raw)
  To: Andrew Pinski, gcc-help, gcc

Reza Roboubi wrote:
> 
> Andrew Pinski wrote:
> >
> > What version of gcc?
> > This seems like it was fixed at one point doing the 3.x series because
> > it does not happen with 3.3 (prerelease) or 3.4 (experimental).
> 
> Ah. I am using gcc version 3.2. It's very good if this has been fixed under 3.3
> and 3.4.
> 
> I would still appreciate any comments regarding the status of these
> optimizations. Are these new features, or are they old ones that temporarily got
> broken during gcc 3.2?
> 
> What part of the gcc source tree deals with these optimizations?

Could you please also tell me if 3.3 and 3.4 remove the extra mov's in and out
of %eax. Ideally, there should be no more than 4 instructions in the critical
loop.


> 
> Thanks again.
> 
> >
> > Thanks,
> > Andrew Pinski
> >
> > On Tuesday, Jan 14, 2003, at 12:35 US/Pacific, Reza Roboubi wrote:
> >
> > > In the following code, it is clear that the return value of mm() can be
> > > eliminated.  In fact, many optimizations are possible here.  Yet gcc
> > > seems not
> > > to be able to do these optimizations.  Below, I posted the assembly
> > > code that
> > > gcc generated (for the while() loop).
> > >
> > > I compiled this code with gcc -O2 -Wall.
> > >
> > > I was wondering if I am doing something wrong.  If not, then please
> > > comment on
> > > current gcc developments in this regard, and what it takes to add some
> > > of these
> > > features.
> > >
> > > Please also comment on how other compilers would compare with gcc in
> > > this case.
> > >
> > > Are there any non-obvious remedies you have for this case?
> > >
> > > PS: Please tell me if I must report this as a gcc bug.
> > >
> > > Thanks in advance for any help you provide.
> > >
> > >
> > > inline int mm(int *i)
> > > {
> > >         if((*i)==0x10){return 0;}
> > >         (*i)++;return 1;
> > > }
> > >
> > > int
> > > main(){
> > >
> > >       int k=0;
> > >       while (mm(&k)){}
> > >       write(1,&k,1);
> > >
> > >       return 0;
> > > }
> > >
> > >
> > > Associated assembly code for the while() loop:
> > >
> > > 0x80483b0 <main+16>:  mov    0xfffffffc(%ebp),%eax
> > > 0x80483b3 <main+19>:  xor    %edx,%edx
> > > 0x80483b5 <main+21>:  cmp    $0x10,%eax
> > > 0x80483b8 <main+24>:  je     0x80483c3 <main+35>
> > > 0x80483ba <main+26>:  inc    %eax
> > > 0x80483bb <main+27>:  mov    $0x1,%edx
> > > 0x80483c0 <main+32>:  mov    %eax,0xfffffffc(%ebp)
> > > 0x80483c3 <main+35>:  test   %edx,%edx
> > > 0x80483c5 <main+37>:  jne    0x80483b0 <main+16>
> > >
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-15  6:31     ` optimizations Reza Roboubi
@ 2003-01-15 17:37       ` Andrew Pinski
  2003-01-15 17:46         ` optimizations Reza Roboubi
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Pinski @ 2003-01-15 17:37 UTC (permalink / raw)
  To: Reza Roboubi; +Cc: Andrew Pinski, gcc-help, gcc


On Tuesday, Jan 14, 2003, at 16:49 US/Pacific, Reza Roboubi wrote:

> Could you please also tell me if 3.3 and 3.4 remove the extra mov's in 
> and out
> of %eax. Ideally, there should be no more than 4 instructions in the 
> critical
> loop.
>

For some reason it is not (even with -fnew-ra), but on PPC there is no 
extra load/store.

Thanks,
Andrew Pinski


PS here is the asm for the loop of i[3-6]686, pentium4:

.L2:
         movl    -4(%ebp), %eax  <== still does the store
         cmpl    $16, %eax
         je      .L7
         incl    %eax
         movl    %eax, -4(%ebp) <== and load
         jmp     .L2
.L7:

I do not have access to the machine with 3.{3,4} on PPC right now.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-15 17:37       ` optimizations Andrew Pinski
@ 2003-01-15 17:46         ` Reza Roboubi
  0 siblings, 0 replies; 27+ messages in thread
From: Reza Roboubi @ 2003-01-15 17:46 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc-help, gcc, gcc_bugs

Andrew Pinski wrote:
> 
> On Tuesday, Jan 14, 2003, at 16:49 US/Pacific, Reza Roboubi wrote:
> 
> > Could you please also tell me if 3.3 and 3.4 remove the extra mov's in
> > and out
> > of %eax. Ideally, there should be no more than 4 instructions in the
> > critical
> > loop.
> >
> 
> For some reason it is not (even with -fnew-ra), but on PPC there is no
> extra load/store.

Hmmm. That's interesting. It might be a bug (or overlooked opportunity) on the
PC.


> 
> Thanks,
> Andrew Pinski
> 
> PS here is the asm for the loop of i[3-6]686, pentium4:
> 
> .L2:
>          movl    -4(%ebp), %eax  <== still does the store
>          cmpl    $16, %eax
>          je      .L7
>          incl    %eax
>          movl    %eax, -4(%ebp) <== and load
>          jmp     .L2
> .L7:
> 
> I do not have access to the machine with 3.{3,4} on PPC right now.

I really appreciate your help Andrew.

Reza.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
@ 2003-01-15 23:20 Bonzini
  2003-01-16 10:53 ` optimizations Reza Roboubi
  0 siblings, 1 reply; 27+ messages in thread
From: Bonzini @ 2003-01-15 23:20 UTC (permalink / raw)
  To: gcc, gcc-help; +Cc: reza

> > Could you please also tell me if 3.3 and 3.4 remove the extra mov's in
and out
> > of %eax. Ideally, there should be no more than 4 instructions in the
critical
> > loop.
>
> .L2:
> movl -4(%ebp), %eax <== still does the load
> cmpl $16, %eax
> je .L7
> incl %eax
> movl %eax, -4(%ebp) <== and store
> jmp .L2
> .L7:
>
> For some reason it is not (even with -fnew-ra), but on PPC there
> is no extra load/store.

Instruction counts do not tell the whole story; gcc is simply putting more
pressure on the decoding unit but less pressure on the execution unit (which
otherwise would execute two loads in the `taken' case).  Things might be
different if gcc is given other options like -mtune=i386.

|_  _  _ __
|_)(_)| ),'
------- '---


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-15 23:20 optimizations Bonzini
@ 2003-01-16 10:53 ` Reza Roboubi
  2003-01-16 11:03   ` optimizations tm_gccmail
  2003-01-16 11:53   ` optimizations Paolo Bonzini
  0 siblings, 2 replies; 27+ messages in thread
From: Reza Roboubi @ 2003-01-16 10:53 UTC (permalink / raw)
  To: Bonzini; +Cc: gcc, gcc-help

Bonzini wrote:
> 
> > > Could you please also tell me if 3.3 and 3.4 remove the extra mov's in
> and out
> > > of %eax. Ideally, there should be no more than 4 instructions in the
> critical
> > > loop.
> >
> > .L2:
> > movl -4(%ebp), %eax <== still does the load
> > cmpl $16, %eax
> > je .L7
> > incl %eax
> > movl %eax, -4(%ebp) <== and store
> > jmp .L2
> > .L7:
> >
> > For some reason it is not (even with -fnew-ra), but on PPC there
> > is no extra load/store.
> 
> Instruction counts do not tell the whole story; gcc is simply putting more
> pressure on the decoding unit but less pressure on the execution unit (which
> otherwise would execute two loads in the `taken' case).  Things might be

Would you please elaborate on that?  I don't understand what you mean by the
"taken case."  The suggested optimization is:

CHANGE:
-------
.L2:
movl -4(%ebp), %eax <== still does the load
cmpl $16, %eax
je .L7
incl %eax
movl %eax, -4(%ebp) <== and store
jmp .L2
.L7:

TO:
-------
movl -4(%ebp), %eax
.L2:
cmpl $16, %eax
je .L7
incl %eax
jmp .L2
.L7:
movl %eax, -4(%ebp)

The mov's have moved _outside_ of the critical loop, and the register allocator
may still be able to remove the extra mov at entry to the loop.

The total number of instructions, and hence total program size will remain the
same even in the worst possible case.

Furthermore, an extra jump can be removed from the critical loop. If you
compile:
i=0;
for(;i<10;i++);
write(1,&i,4)   //make i volatile

then you will see that gcc optimizes away even this redundant jump, hence
producing only _three_ lines of code. But when a while() loop is used instead of
the equivalent for() loop that does not happen.

This seems like a crystal clear case for optimization, unless I am missing
something that you should kindly explain to  me in more detail.

Thanks, Reza.

> different if gcc is given other options like -mtune=i386.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-16 10:53 ` optimizations Reza Roboubi
@ 2003-01-16 11:03   ` tm_gccmail
  2003-01-16 12:34     ` optimizations Reza Roboubi
  2003-01-16 11:53   ` optimizations Paolo Bonzini
  1 sibling, 1 reply; 27+ messages in thread
From: tm_gccmail @ 2003-01-16 11:03 UTC (permalink / raw)
  To: Reza Roboubi; +Cc: Bonzini, gcc, gcc-help

On Wed, 15 Jan 2003, Reza Roboubi wrote:

> Bonzini wrote:
> 
> CHANGE:
> -------
> .L2:
> movl -4(%ebp), %eax <== still does the load
> cmpl $16, %eax
> je .L7
> incl %eax
> movl %eax, -4(%ebp) <== and store
> jmp .L2
> .L7:
> 
> TO:
> -------
> movl -4(%ebp), %eax
> .L2:
> cmpl $16, %eax
> je .L7
> incl %eax
> jmp .L2
> .L7:
> movl %eax, -4(%ebp)

The optimization you are suggesting is called "load hoist/store sink" if I
remember correctly.

Here is the story as I remember it:

When egcs-1.0 or 1.1  was released, people noticed a large performance
drop from gcc-2.7.2. I did a little investigation, and verified a large
performanc drop on Whetstone. I did a comprehensive analysis of it, and
isolated a case similar to yours where a variable in a critical loop was
entirely contained in registers in 2.7.2 but was loaded/save from/to
memory in gcc-2.95.

I mentioned this on the gcc-bugs mailing list, and Mark Mitchell
contributed a fairly simple load hoisting improvement to the loop
optmiizer which restored performance on Whetstone.

If you look at the gcc-bugs archives for 1998, you may be able to find
this message thread.

This load-hoisting optimization seems to be responsible for the hoisted
load in your testcase.  However, the corresponding store sink portion of
the optimizer has never been written, and I believe that is why the store
is not sunk out of the loop on your testcase.

Toshi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-16 11:03   ` optimizations tm_gccmail
@ 2003-01-16 12:34     ` Reza Roboubi
  2003-02-18 18:13       ` optimizations Håkan Hjort
  0 siblings, 1 reply; 27+ messages in thread
From: Reza Roboubi @ 2003-01-16 12:34 UTC (permalink / raw)
  To: tm_gccmail; +Cc: gcc, gcc-help

tm_gccmail@mail.kloo.net wrote:
[snap]
> I mentioned this on the gcc-bugs mailing list, and Mark Mitchell
> contributed a fairly simple load hoisting improvement to the loop
> optmiizer which restored performance on Whetstone.
> 
> If you look at the gcc-bugs archives for 1998, you may be able to find
> this message thread.
[snap]

Thanks for this input. It would be interesting to see how the issue was fixed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-16 12:34     ` optimizations Reza Roboubi
@ 2003-02-18 18:13       ` Håkan Hjort
  2003-02-18 18:16         ` optimizations Andrew Pinski
  2003-02-18 18:17         ` optimizations Zack Weinberg
  0 siblings, 2 replies; 27+ messages in thread
From: Håkan Hjort @ 2003-02-18 18:13 UTC (permalink / raw)
  To: Reza Roboubi; +Cc: gcc

Wed Jan 15 2003, Reza Roboubi wrote:
> tm_gccmail@mail.kloo.net wrote:
> [snap]
> > I mentioned this on the gcc-bugs mailing list, and Mark Mitchell
> > contributed a fairly simple load hoisting improvement to the loop
> > optmiizer which restored performance on Whetstone.
> > 
> > If you look at the gcc-bugs archives for 1998, you may be able to find
> > this message thread.
> [snap]
> 
> Thanks for this input. It would be interesting to see how the issue was fixed.
Sorry for getting into this so late.
Nobody actually posted the code generated by 3.3/3.4...

inline int mm(int *i) {
        if((*i)==0x10) return 0;
        (*i)++; return 1;
}

int main() {
        int k=0;
        while (mm(&k)) {}
        write(1,&k,1);
        return 0;
}

For Sun's Forte compiler one gets the following:

main:
         save    %sp,-104,%sp
         or      %g0,16,%g1
         st      %g1,[%fp-4]
         add     %fp,-4,%o1
         or      %g0,1,%o0
         call    write   ! params =  %o0 %o1 %o2 ! Result
         or      %g0,1,%o2
         ret     ! Result =  %i0
         restore %g0,0,%o0

I.e. it just stores '16' in k before the call to write, no trace left
of mm() or any loop, as should be.

Perhaps GCC now does the same after hoisting both the load and the store?

-- 
/HÃ¥kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-02-18 18:13       ` optimizations Håkan Hjort
@ 2003-02-18 18:16         ` Andrew Pinski
  2003-02-18 18:17         ` optimizations Zack Weinberg
  1 sibling, 0 replies; 27+ messages in thread
From: Andrew Pinski @ 2003-02-18 18:16 UTC (permalink / raw)
  To: Håkan Hjort; +Cc: Reza Roboubi, gcc

Here is the generation for 3.4 (20030214) on ppc (darwin/mac os x):
_main:
         mflr r2
         li r9,0
         stw r2,8(r1)
         li r2,0
         stwu r1,-64(r1)
         stw r2,56(r1)		<== store 0 in &k before the loop. WHY?
         b L2
L10:
         addi r9,r9,1
L2:
         cmpwi cr0,r9,16
         bne+ cr0,L10
         addi r4,r1,56	<== the second parm to write
         li r3,1		<== the first parm to write
         li r5,1		<== the third parm to write
         stw r9,56(r1)	<== stores the 2nd parm to write aka &k
         bl _write		<== `call' write
         addi r1,r1,64	<== restore the stack pointer
         lwz r4,8(r1)	
         li r3,0		<== the return value from main
         mtlr r4
         blr

Here is the generation for 3.4 (20030215) on x86 (linux):

main:
         pushl   %ebp
         movl    %esp, %ebp
         subl    $24, %esp
         movl    $0, -4(%ebp)	<=== store 0 into &k
         andl    $-16, %esp
         jmp     .L2
         .p2align 4,,7
.L9:
         incl    %eax			<=== increment k
         movl    %eax, -4(%ebp)	<=== store k into &k WHY?
.L2:
         movl    -4(%ebp), %eax	<=== load k from &k WHY?
         cmpl    $16, %eax		<=== compare k to 16
         jne     .L9			<=== jump not equal to L9
         movl    $1, 8(%esp)		<=== 1st parm to write (1)
         leal    -4(%ebp), %edx	<=== (&k) into edx
         movl    %edx, 4(%esp)	<=== 2nd parm to write (&k/edx)
         movl    $1, (%esp)		<=== 3rd parm to write (1)
         call    write			<=== call write
         leave
         ret

 From the looks of it, gcc does a better job on ppc compared to i686 for 
some reason in terms of optimizations.


Thanks,
Andrew Pinski



On Tuesday, Feb 18, 2003, at 09:55 US/Pacific, Håkan Hjort wrote:

> Wed Jan 15 2003, Reza Roboubi wrote:
>> tm_gccmail@mail.kloo.net wrote:
>> [snap]
>>> I mentioned this on the gcc-bugs mailing list, and Mark Mitchell
>>> contributed a fairly simple load hoisting improvement to the loop
>>> optmiizer which restored performance on Whetstone.
>>>
>>> If you look at the gcc-bugs archives for 1998, you may be able to 
>>> find
>>> this message thread.
>> [snap]
>>
>> Thanks for this input. It would be interesting to see how the issue 
>> was fixed.
> Sorry for getting into this so late.
> Nobody actually posted the code generated by 3.3/3.4...
>
> inline int mm(int *i) {
>         if((*i)==0x10) return 0;
>         (*i)++; return 1;
> }
>
> int main() {
>         int k=0;
>         while (mm(&k)) {}
>         write(1,&k,1);
>         return 0;
> }
>
> For Sun's Forte compiler one gets the following:
>
> main:
>          save    %sp,-104,%sp
>          or      %g0,16,%g1
>          st      %g1,[%fp-4]
>          add     %fp,-4,%o1
>          or      %g0,1,%o0
>          call    write   ! params =  %o0 %o1 %o2 ! Result
>          or      %g0,1,%o2
>          ret     ! Result =  %i0
>          restore %g0,0,%o0
>
> I.e. it just stores '16' in k before the call to write, no trace left
> of mm() or any loop, as should be.
>
> Perhaps GCC now does the same after hoisting both the load and the 
> store?
>
> -- 
> /Håkan
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-02-18 18:13       ` optimizations Håkan Hjort
  2003-02-18 18:16         ` optimizations Andrew Pinski
@ 2003-02-18 18:17         ` Zack Weinberg
  2003-02-18 18:40           ` optimizations Håkan Hjort
  2003-02-19  5:02           ` optimizations David Edelsohn
  1 sibling, 2 replies; 27+ messages in thread
From: Zack Weinberg @ 2003-02-18 18:17 UTC (permalink / raw)
  To: Håkan Hjort; +Cc: Reza Roboubi, gcc

Håkan Hjort <hakan@safelogic.se> writes:

> For Sun's Forte compiler one gets the following:
>
> main:
>          save    %sp,-104,%sp
>          or      %g0,16,%g1
>          st      %g1,[%fp-4]
>          add     %fp,-4,%o1
>          or      %g0,1,%o0
>          call    write   ! params =  %o0 %o1 %o2 ! Result
>          or      %g0,1,%o2
>          ret     ! Result =  %i0
>          restore %g0,0,%o0
>
> I.e. it just stores '16' in k before the call to write, no trace left
> of mm() or any loop, as should be.
>
> Perhaps GCC now does the same after hoisting both the load and the store?

Unfortunately not.  On x86, with -O2, 3.4 20030211 produces

main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $24, %esp
        movl    $0, -4(%ebp)
        andl    $-16, %esp
        jmp     .L2
        .p2align 4,,7
.L9:
        incl    %eax
        movl    %eax, -4(%ebp)
.L2:
        movl    -4(%ebp), %eax
        cmpl    $16, %eax
        jne     .L9
        movl    $1, 8(%esp)
        leal    -4(%ebp), %eax
        movl    %eax, 4(%esp)
        movl    $1, (%esp)
        call    write
        leave
        xorl    %eax, %eax
        ret

so you can see that not only is the loop still present, but the memory
write has not been sunk.  

What happens at -O2 -fssa -fssa-ccp -fssa-dce is interesting:

main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $24, %esp
        andl    $-16, %esp
        jmp     .L2
        .p2align 4,,7
.L9:
        incl    %eax
.L2:
        cmpl    $16, %eax
        jne     .L9
        movl    $1, 8(%esp)
        leal    -4(%ebp), %eax
        movl    %eax, 4(%esp)
        movl    $1, (%esp)
        call    write
        leave
        xorl    %eax, %eax
        ret

The unnecessary memory references are now gone, but the loop remains;
also you can see what may appear to be a bug at first glance -- %eax
is never initialized.  This is not actually a correctness bug: no
matter what value %eax happened to have before the loop, it will leave
the loop with the value 16.  However, I think you'll agree that this
is poor optimization.

RTL-SSA is, I believe, considered somewhat of a failed experiment -
the interesting work is happening on the tree-ssa branch.  I do not
have that branch checked out to experiment with.  Also, the loop
optimizer has been overhauled on the rtlopt branch, which again I do
not have to hand.

zw

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-02-18 18:17         ` optimizations Zack Weinberg
@ 2003-02-18 18:40           ` Håkan Hjort
  2003-02-19  5:02           ` optimizations David Edelsohn
  1 sibling, 0 replies; 27+ messages in thread
From: Håkan Hjort @ 2003-02-18 18:40 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Reza Roboubi, gcc

Tue Feb 18 2003, Zack Weinberg wrote:
> HÃ¥kan Hjort <hakan@safelogic.se> writes:
[snip]
> 
>         jmp     .L2
>         .p2align 4,,7
> .L9:
>         incl    %eax
> .L2:
>         cmpl    $16, %eax
>         jne     .L9
[snip]
> 
> The unnecessary memory references are now gone, but the loop remains;
> also you can see what may appear to be a bug at first glance -- %eax
> is never initialized.  This is not actually a correctness bug: no
> matter what value %eax happened to have before the loop, it will leave
> the loop with the value 16.  However, I think you'll agree that this
> is poor optimization.
> 
This might not be common enough to care about but...  
We do know the exit vaule of a loop with no side effects, only one 
exit edge and a EQ exit condition, it simply must be that value...
Though here we did have the start value too, k was set to 0.

Not having working loop hoisting/sinking seem like it could hurt
performace quite a bit. Though things are never so easy on modern CPUs
with store queues, load bypassing, register renaming, OO excution and
so on.  What's clear is that the instruction count and text size will be
higher than needed.

Let's see if there is someone that can report on the loop-opt and/or
tree-ssa branches.

-- 
/HÃ¥kan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-02-18 18:17         ` optimizations Zack Weinberg
  2003-02-18 18:40           ` optimizations Håkan Hjort
@ 2003-02-19  5:02           ` David Edelsohn
  1 sibling, 0 replies; 27+ messages in thread
From: David Edelsohn @ 2003-02-19  5:02 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: Håkan Hjort, Reza Roboubi, gcc

>>>>> Zack Weinberg writes:

Zack> so you can see that not only is the loop still present, but the memory
Zack> write has not been sunk.  

	Which is why I am eager to fix store motion utilizing Zdenek and
Daniel Berlin's efforts.

David

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: optimizations
  2003-01-16 10:53 ` optimizations Reza Roboubi
  2003-01-16 11:03   ` optimizations tm_gccmail
@ 2003-01-16 11:53   ` Paolo Bonzini
  1 sibling, 0 replies; 27+ messages in thread
From: Paolo Bonzini @ 2003-01-16 11:53 UTC (permalink / raw)
  To: Reza Roboubi; +Cc: gcc, gcc-help

> Would you please elaborate on that?  I don't understand what you mean by
the
> "taken case."  The suggested optimization is:
>
> [snip hoisting load/save of -4(%ebp)

Ah... I thought you were considering

.L2:
cmpl $16, -4(%ebp)
je .L7
incl -4(%ebp)
jmp .L2
.L7:

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-02-19  1:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1997-12-09  9:52 Optimizations David M. Ronis
1997-12-09 11:19 ` Optimizations Jeffrey A Law
1997-12-10 10:46 ` [EGCS] Optimizations Marc Lehmann
1997-12-14  5:39   ` Philipp Thomas
1997-12-14 15:14     ` Optimizations Marc Lehmann
1997-12-14 20:14       ` Optimizations Jeffrey A Law
1997-12-14 14:30 [EGCS] Optimizations meissner
1997-12-15  5:38 ` Optimizations Marc Lehmann
1997-12-15 11:29   ` Optimizations Dave Love
1997-12-15 15:43     ` Optimizations Marc Lehmann
     [not found] ` <19971216000653.24186.cygnus.egcs@cerebro.laendle>
1997-12-23  7:51   ` Optimizations Stan Cox
2000-03-10  1:46 Optimizations Virgil Palanciuc
2003-01-14 22:58 optimizations Reza Roboubi
2003-01-15  0:15 ` optimizations Andrew Pinski
2003-01-15  5:10   ` optimizations Reza Roboubi
2003-01-15  6:31     ` optimizations Reza Roboubi
2003-01-15 17:37       ` optimizations Andrew Pinski
2003-01-15 17:46         ` optimizations Reza Roboubi
2003-01-15 23:20 optimizations Bonzini
2003-01-16 10:53 ` optimizations Reza Roboubi
2003-01-16 11:03   ` optimizations tm_gccmail
2003-01-16 12:34     ` optimizations Reza Roboubi
2003-02-18 18:13       ` optimizations Håkan Hjort
2003-02-18 18:16         ` optimizations Andrew Pinski
2003-02-18 18:17         ` optimizations Zack Weinberg
2003-02-18 18:40           ` optimizations Håkan Hjort
2003-02-19  5:02           ` optimizations David Edelsohn
2003-01-16 11:53   ` optimizations Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).