Performance measurements

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Performance measurements
@ 1998-06-24  2:28 Martin Kahlert
  1998-06-24  8:51 ` David S. Miller
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Martin Kahlert @ 1998-06-24  2:28 UTC (permalink / raw)
  To: egcs; +Cc: axp-list

Hi,
i tried to compare different compilers on my numerical code.
Therefore i extracted a FPU intensive function and surrounded
it with a loop while measuring the execution time.

I will provide the source and the Makefile at the end of this mail.

I work on a Linux 2.0.34 SMP Kernel (libc5). My hardware is a
dual Pentium Pro 200MHz system with 128MB RAM.

pgcc is the portland group compiler and 
tcc the free Tendra compiler system
(from http://alph.dera.gov.uk/TenDRA/ )

Here are my results:
%> make
pgcc:
90.11 MFLOPS
gcc-2.7.2.1:
95.46 MFLOPS
gcc-without double align:
23.52 MFLOPS
egcs-2.91.42:
69.96 MFLOPS
tcc:
92.33 MFLOPS

- The difference -malign-double makes on gcc-2.7.2.1 is very 
  impressive. On egcs it doesn't change the result.

- egcs seems to produce worse code than gcc-2.7.2.1
- During my experience, pgcc produces very good code and is very
  reliable. The code usually runs about 25% faster than egcs-code.
- For tcc i have to say, that the result is very dependent on the 
  order, you declare the variables:
  If you put the declarations int i,j; and double *wksph=wksp+n/2;
  in front of the one for c[], the result drops down to 24.70 MFLOPS.
  For all other compilers, this doesn't make much difference.

Could anybody comment on that?

If anyone is interested in the asm code that pgcc produces, i 
can send it offline (388 lines of asm statements are too 
long for the list)


For the axp list: It would be very kind, if anybody could provide
some values for a Linux Alpha (e.g. 533MHz) for comparison.

Thanks in advance,
Martin


Here is the testfile m.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>

#define N 1024

void trafo(double *a,const int n,double *wksp,double *const ops)
{
 double c[]={ 3.52262918857095333469e-02,
             -8.54412738820266443041e-02,
             -1.35011020010254584323e-01,
              4.59877502118491543470e-01,
              8.06891509311092547385e-01,
              3.32670552950082631938e-01 };
 int i,j;
 double *wksph=wksp+n/2;

 if(n<6)
    return;

 for(i=j=0;i<n/2-2;i++,j+=2)
    {wksph[i]= c[0]*a[j]+c[1]*a[j+1]
     +c[2]*a[j+2]+c[3]*a[j+3]
         +c[4]*a[j+4]+c[5]*a[j+5];
     wksp[i] = c[5]*a[j]-c[4]*a[j+1]
         +c[3]*a[j+2]-c[2]*a[j+3]
         +c[1]*a[j+4]-c[0]*a[j+5];
    }
 wksph[i]= c[0]*a[j]+c[1]*a[j+1]
     +c[2]*a[j+2]+c[3]*a[j+3]
     +c[4]*a[0]+c[5]*a[1];
 wksp[i] = c[5]*a[j]-c[4]*a[j+1]
     +c[3]*a[j+2]-c[2]*a[j+3]
     +c[1]*a[0]-c[0]*a[1];
 i++;j+=2;
 wksph[i]= c[0]*a[j]+c[1]*a[j+1]
     +c[2]*a[0]+c[3]*a[1]
     +c[4]*a[2]+c[5]*a[3];
 wksp[i] = c[5]*a[j]-c[4]*a[j+1]
     +c[3]*a[0]-c[2]*a[1]
     +c[1]*a[2]-c[0]*a[3];

 memcpy(a,wksp,sizeof(double)*n);
 (*ops)+=((double)11)*n;
 return;
}

int main(int argc,const char *argv[])
{
 double *x,h,ops=0;
 int i;
 clock_t start;

 if(!(x=malloc(2*N*sizeof(double))))
    {
     fputs("out of memory\n",stderr);
     return EXIT_FAILURE;
    }

 for(i=0;i<N;i++)
    {
     h=(double)i;
     x[i]=sin(h*PI/(double)N);
    }

 start=clock();
 for(i=0;i<10000;i++)
     trafo(x,N,x+N,&ops);
 h=(double)(clock()-start)/(double)CLOCKS_PER_SEC;
 printf("%.2f MFLOPS\n",1.0e-6*ops/h);

 free(x);
 return 0;
}

And here is the Makefile:

EXECUTABLES = m.pgcc m.egcs-2.91.42 m.gcc-2.7.2.1 m.not_aligned m.tcc
GCC_OPTS = -O3 -malign-double -fomit-frame-pointer -Wall -malign-loops=2 -malign-jumps=2 -malign-functions=2
all: $(EXECUTABLES) test
clean:
	rm -f  $(EXECUTABLES)
test:
	@echo "pgcc:"
	@m.pgcc
	@echo "gcc-2.7.2.1:"
	@m.gcc-2.7.2.1
	@echo "gcc-without double align:"
	@m.not_aligned
	@echo "egcs-2.91.42:"
	@m.egcs-2.91.42
	@echo "tcc:"
	@m.tcc
m.pgcc: m.c
	pgcc -O2 -tp p6 -Mnoframe -o $@ $< -lm
m.gcc-2.7.2.1: m.c
	/usr/bin/gcc $(GCC_OPTS) -o $@ $< -lm
m.not_aligned: m.c
	/usr/bin/gcc -O3 -fomit-frame-pointer -Wall -malign-loops=2 -malign-jumps=2 -malign-functions=2 -o $@ $< -lm
m.egcs-2.91.42: m.c
	/sw/egcs/bin/gcc $(GCC_OPTS) -o $@ $< -lm
m.tcc: m.c
	tcc -Ysystem -O2 -o $@ $< -lm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-24  2:28 Performance measurements Martin Kahlert
@ 1998-06-24  8:51 ` David S. Miller
  1998-06-24 10:08 ` Gerald Pfeifer
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 1998-06-24  8:51 UTC (permalink / raw)
  To: martin.kahlert; +Cc: egcs, axp-list

   Date: Wed, 24 Jun 1998 10:51:03 +0200
   From: Martin Kahlert <martin.kahlert@mchp.siemens.de>

   For the axp list: It would be very kind, if anybody could provide
   some values for a Linux Alpha (e.g. 533MHz) for comparison.

I couldn't resist, with current CVS egcs sources, on a 300Mhz
UltraSparc w/512K L2 cache running Linux:

? ./fpubench
142.58 MFLOPS

Later,
David S. Miller
davem@dm.cobaltmicro.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-24  2:28 Performance measurements Martin Kahlert
  1998-06-24  8:51 ` David S. Miller
@ 1998-06-24 10:08 ` Gerald Pfeifer
  1998-06-25  3:09 ` Jeffrey A Law
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Gerald Pfeifer @ 1998-06-24 10:08 UTC (permalink / raw)
  To: Martin Kahlert; +Cc: egcs, axp-list

On Wed, 24 Jun 1998, Martin Kahlert wrote:
> pgcc:
> 90.11 MFLOPS
> gcc-2.7.2.1:
> 95.46 MFLOPS
> gcc-without double align:
> 23.52 MFLOPS
> egcs-2.91.42:
> 69.96 MFLOPS

Do you happen to have a copy of egcs-1.0.3a available? I'd be rather
curious whether that one performs that bad, too.

Gerald
-- 
Gerald Pfeifer (Jerry)      Vienna University of Technology
pfeifer@dbai.tuwien.ac.at   http://www.dbai.tuwien.ac.at/~pfeifer/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-24  2:28 Performance measurements Martin Kahlert
  1998-06-24  8:51 ` David S. Miller
  1998-06-24 10:08 ` Gerald Pfeifer
@ 1998-06-25  3:09 ` Jeffrey A Law
       [not found] ` <3590D5AE.167EB0E7@iis.fhg.de>
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Jeffrey A Law @ 1998-06-25  3:09 UTC (permalink / raw)
  To: Martin Kahlert; +Cc: egcs, axp-list

  In message < 199806240851.KAA06049@keksy.mchp.siemens.de >you write:
  > Hi,
  > i tried to compare different compilers on my numerical code.
  > Therefore i extracted a FPU intensive function and surrounded
  > it with a loop while measuring the execution time.
  > 
  > I will provide the source and the Makefile at the end of this mail.
  > 
  > I work on a Linux 2.0.34 SMP Kernel (libc5). My hardware is a
  > dual Pentium Pro 200MHz system with 128MB RAM.
  > 
  > pgcc is the portland group compiler and 
  > tcc the free Tendra compiler system
  > (from http://alph.dera.gov.uk/TenDRA/ )
  > 
  > Here are my results:
  > %> make
  > pgcc:
  > 90.11 MFLOPS
  > gcc-2.7.2.1:
  > 95.46 MFLOPS
  > gcc-without double align:
  > 23.52 MFLOPS
  > egcs-2.91.42:
  > 69.96 MFLOPS
  > tcc:
  > 92.33 MFLOPS

Here's some more info.  PPro200

egcs-1.0.3	 69.96
today's sources	 70.40

Note I get 73.14 if I remove all the various -malign switches.

Not particularly good.    Can someone look into this?  It might
be another case of double alignment losing badly.  I don't know
x86 issues well enough.

Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <3590D5AE.167EB0E7@iis.fhg.de>]

[parent not found: <19980624124843.A15248@keksy.mchp.siemens.de>]

[parent not found: <3591031A.2781E494@iis.fhg.de>]

[parent not found: <19980624170051.21290@haegar.physiol.med.tu-muenchen.de>]

* Re: Performance measurements (thanks and conclusion)
       [not found]       ` <19980624170051.21290@haegar.physiol.med.tu-muenchen.de>
@ 1998-06-25  3:09         ` Martin Kahlert
  0 siblings, 0 replies; 18+ messages in thread
From: Martin Kahlert @ 1998-06-25  3:09 UTC (permalink / raw)
  To: axp-list; +Cc: scr, robert, egcs

Quoting Robert Wilhelm (robert@physiol.med.tu-muenchen.de):
> > [Robert: could you please compile my code on your Alpha using
> >          egcs and report your results to me]
> 
> I get about 275 MFLOPS for my 533MHz 21164a for both egcs 1.0 and
> egcs-current with haifa enabled.
> 
> If I use different local variables lfA*, egcs seems to shedule a bit
> better and I get 290 MFLOPS.
> 
> Robert

I was really overwhelmed with the repsonse to this thread on 
axp-list. Thanks a lot for all people who tried my source
and even tried to get more out of the compilers.

I tried both versions on my PPro 200:
Stefan Schroepfer's version:
pgcc:
85.98 MFLOPS
gcc-2.7.2.1:
97.10 MFLOPS
gcc-without double align:
95.46 MFLOPS
egcs-2.91.42:
84.06 MFLOPS
tcc:
17.20 MFLOPS

Robert Wilhelm's version:
pgcc:
81.62 MFLOPS
gcc-2.7.2.1:
98.81 MFLOPS
gcc-without double align:
98.81 MFLOPS
egcs-2.91.42:
83.44 MFLOPS
tcc:
16.44 MFLOPS

It seems that tcc is not the fastest and the most reliable 
under the sun...

Can i conclude, that it's a good idea to insert as many local
vars as possible to get good results from compilers?

Now i have two questions:

-Why is it so difficult for gcc to transform the code
 for(i=0;i<n;i++)
    result[i]=a[i]+2*a[i+1]+3*a[i+2];

 into something like

 _tmp0=a[0];_tmp1=a[1];_tmp2=a[2];
 for(i=0;i<n;i++)
    {
     result[i]=_tmp0+2*_tmp1+3*_tmp2;
     _tmp0=_tmp1;
     _tmp1=_tmp2;
     _tmp2=a[i+2];
    }
 for itself? I think, especially in Fortran such things are a
 common task.
-What's the reason for the performace loss between gcc-2.7.2.1 
 and egcs-2.91.42 - it's nearly 20%, that gcc-2.7.2.1 is better?

Thanks a lot,
Martin.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-24  2:28 Performance measurements Martin Kahlert
                   ` (3 preceding siblings ...)
       [not found] ` <3590D5AE.167EB0E7@iis.fhg.de>
@ 1998-06-26  1:05 ` Aubert Pierre
  1998-06-27  2:25   ` Jeffrey A Law
  1998-06-29 22:34 ` Rask Ingemann Lambertsen
  1998-07-01 21:20 ` Marc Lehmann
  6 siblings, 1 reply; 18+ messages in thread
From: Aubert Pierre @ 1998-06-26  1:05 UTC (permalink / raw)
  To: Martin Kahlert; +Cc: egcs

Martin Kahlert <martin.kahlert@mchp.siemens.de> writes:

> Hi,
> i tried to compare different compilers on my numerical code.
> Therefore i extracted a FPU intensive function and surrounded
> it with a loop while measuring the execution time.
> 
> I will provide the source and the Makefile at the end of this mail.
> 
> I work on a Linux 2.0.34 SMP Kernel (libc5). My hardware is a
> dual Pentium Pro 200MHz system with 128MB RAM.
> 
> pgcc is the portland group compiler and 
> tcc the free Tendra compiler system
> (from http://alph.dera.gov.uk/TenDRA/ )
> 
> Here are my results:
> %> make
> pgcc:
> 90.11 MFLOPS
> gcc-2.7.2.1:
> 95.46 MFLOPS
> gcc-without double align:
> 23.52 MFLOPS
> egcs-2.91.42:
> 69.96 MFLOPS
> tcc:
> 92.33 MFLOPS

Just for information, on HP

PA7200: gcc-2.7.2.2              -O3  :  26.63 MFLOPS
PA7200: egcs-2.91.42             -O3  :  47.13 MFLOPS
PA7200: cc                   -Ae +O4  :  52.88 MFLOPS

PA8000: gcc-2.7.2.2              -O3  : 108.31 MFLOPS
PA8000: egcs-2.91.42             -O3  :  97.10 MFLOPS
PA8000: cc            +DA1.1 -Ae +O4  : 129.47 MFLOPS 
PA8000: cc            +DA2.0 -Ae +O4  : 216.62 MFLOPS

egcs faster on PA8000 and slower on PA7200 than gcc-2.7.2.
cc is faster du to a better support of special instructions.

Is there a planed support for PA2.0 on HP?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-26  1:05 ` Performance measurements Aubert Pierre
@ 1998-06-27  2:25   ` Jeffrey A Law
  0 siblings, 0 replies; 18+ messages in thread
From: Jeffrey A Law @ 1998-06-27  2:25 UTC (permalink / raw)
  To: Aubert Pierre; +Cc: Martin Kahlert, egcs

  > Just for information, on HP
  > 
  > PA7200: gcc-2.7.2.2              -O3  :  26.63 MFLOPS
  > PA7200: egcs-2.91.42             -O3  :  47.13 MFLOPS
  > PA7200: cc                   -Ae +O4  :  52.88 MFLOPS
  > 
  > PA8000: gcc-2.7.2.2              -O3  : 108.31 MFLOPS
  > PA8000: egcs-2.91.42             -O3  :  97.10 MFLOPS
  > PA8000: cc            +DA1.1 -Ae +O4  : 129.47 MFLOPS 
  > PA8000: cc            +DA2.0 -Ae +O4  : 216.62 MFLOPS
  > 
  > egcs faster on PA8000 and slower on PA7200 than gcc-2.7.2.
  > cc is faster du to a better support of special instructions.
Err, you got that backwards :-)  egcs is faster than gcc2 on
the PA7200, but slower on the PA8000 series.

Note that you can get about a 30% improvement in this code on a
PA8000 by disabling the fmpyadd/fmpysub instructions.  They're
reorder buffer killers.

In fact, if someone wanted to submit a patch which added flags for
PA2.0 scheduling and codegen I'd accept it -- even if it did nothing
at this point.   Just having the flags allows us to start experimenting
with the code gen issues.

  > Is there a planed support for PA2.0 on HP?
I'd like to do it, but I don't have the time.  I'd happily accept
contributions.

Note first you have to add PA2.0 support in bfd/binutils/gas if
you're going to use any of the new instructions.

jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-24  2:28 Performance measurements Martin Kahlert
                   ` (4 preceding siblings ...)
  1998-06-26  1:05 ` Performance measurements Aubert Pierre
@ 1998-06-29 22:34 ` Rask Ingemann Lambertsen
  1998-07-01  3:42   ` Nicholas Lee
  1998-07-01 21:20 ` Marc Lehmann
  6 siblings, 1 reply; 18+ messages in thread
From: Rask Ingemann Lambertsen @ 1998-06-29 22:34 UTC (permalink / raw)
  To: EGCS mailing list

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1828 bytes --]

Den 24-Jun-98 10:51:03 skrev Martin Kahlert fÃ¸lgende om "Performance measurements":
> Hi,
> i tried to compare different compilers on my numerical code.
> Therefore i extracted a FPU intensive function and surrounded
> it with a loop while measuring the execution time.

It is also very memcpy() intensive. If your C library's memcpy() isn't very
good, you can gain quite some performance. I went from 34 MFLOPS to 59 MFLOPS
just by optimizing memcpy().

Here are some PowerPC 604e, 200 MHz results:

AmigaOS:
gcc version egcs-2.90.23 980102 (egcs-1.0.1 release)
-DPI=M_PI -O3 -fomit-frame-pointer -mcpu=604 -mmultiple -mstring
59.28 MFLOPS

AmigaOS:
gcc version 2.7.2.1
-DPI=M_PI -O3 -fomit-frame-pointer -mcpu=604 -mmultiple -mstring
54.15 MFLOPS

AIX (RS6000 43P):
gcc version 2.6.3
-DPI=M_PI -O3 -fomit-frame-pointer -mcpu=604 -Wa,-m604
54.68 MFLOPS

AIX (RS6000 43P):
Vendor supplied cc
-DPI=M_PI -O3 -qstrict
63.28 MFLOPS

AIX (RS6000 43P):
Vendor supplied cc
-DPI=M_PI -O3 -qstrict -qarch=ppc -qtune=604
60.89 MFLOPS

(?)

At least the performance is heading in the right direction, but it looks on
the low side low to me. Perhaps someone with a more recent PowerPC EGCS
build could be persuaded to post some results?

Regards,

/Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯TÂ¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯Â¯\
| Rask Ingemann Lambertsen       | E-mail: mailto:rask@kampsax.k-net.dk  |
| Registered Phase5 developer    | WWW: http://www.gbar.dtu.dk/~c948374/ |
| A4000, 775 kkeys/s (RC5-64)    | "ThrustMe" on XPilot, ARCnet and IRC  |
|                           LOAD "emacs",8,1                             |


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-29 22:34 ` Rask Ingemann Lambertsen
@ 1998-07-01  3:42   ` Nicholas Lee
  0 siblings, 0 replies; 18+ messages in thread
From: Nicholas Lee @ 1998-07-01  3:42 UTC (permalink / raw)
  To: EGCS mailing list

Just to join the fun: 8)

Stampede 0.79 P(classic)-133, 64 Megs EDO, 512KB L2.

Note pgcc is the pentium optimised version gcc patched against egcs 1.0.3.
One modification to the initial m.c source.  Changed PI to M_PI so it
would compile.


sands:~/tmp$ make
pgcc -O6 -o m.pgcc.O6 m.c -lm
pgcc -O2 -malign-double -fomit-frame-pointer -Wall -malign-loops=2
-malign-jumps=2 -malign-functions=2 -o m.pgcc m.c -lm
gcc -O2 -malign-double -fomit-frame-pointer -Wall -malign-loops=2
-malign-jumps=2 -malign-functions=2 -o m.egcs-2.90.29 m.c -lm
gcc -V 2.8.1 -O2 -malign-double -fomit-frame-pointer -Wall -malign-loops=2
-malign-jumps=2 -malign-functions=2 -o m.gcc-2.8.1 m.c -lm
gcc -O3 -fomit-frame-pointer -Wall -malign-loops=2 -malign-jumps=2
-malign-functions=2 -o m.not_aligned m.c -lm
pgcc.O6:
32.84 MFLOPS
pgcc:
29.80 MFLOPS
gcc-2.8.1:
31.82 MFLOPS
gcc-without double align:
32.46 MFLOPS
egcs-2.90.29:
29.72 MFLOPS

==
sands:~/tmp$ gcc -v
Reading specs from /usr/lib/gcc-lib/i586-pc-linux-gnu/pgcc-2.90.29/specs
gcc version pgcc-2.90.29 980515 (egcs-1.0.3 release)
sands:~/tmp$ pgcc -v
Reading specs from /usr/lib/gcc-lib/i586-pc-linux-gnu/pgcc-2.90.29/specs
gcc version pgcc-2.90.29 980515 (egcs-1.0.3 release)

--
        "I reserve the right to contradict myself" 
n.lee with math.auckland.ac in nz
NO commercial email please! No SPAM for me.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-06-24  2:28 Performance measurements Martin Kahlert
                   ` (5 preceding siblings ...)
  1998-06-29 22:34 ` Rask Ingemann Lambertsen
@ 1998-07-01 21:20 ` Marc Lehmann
  1998-07-02  7:14   ` Craig Burley
  1998-07-02 15:15   ` Joern Rennecke
  6 siblings, 2 replies; 18+ messages in thread
From: Marc Lehmann @ 1998-07-01 21:20 UTC (permalink / raw)
  To: egcs; +Cc: axp-list

I couldn't resist as well:

-V2.7.2.3 -O6 -mno-align-double					70.99 MFLOPS
-V2.7.2.3 -O6 -malign-double					113.78 MFLOPS

-V1.0.3 -O6 -mno-align-double					125.62 MFLOPS
-V1.0.3 -O6 -malign-double					122.43 MFLOPS

egcs-980628 -O6 -malign-double -funroll-all-loops		135.17 MFLOPS
egcs-980628 -O6 -mno-align-double				141.39 MFLOPS
egcs-980628 -O6 -malign-double -mpentiumpro -march=pentiumpro	144.41 MFLOPS

pgcc-980628 -O2 -malign-double -mno-stack-align-double		155.72 MFLOPS
pgcc-980628 -O2 -malign-double -mstack-align-double		155.72 MFLOPS
pgcc-980628 -O6 -malign-double -mstack-align-double		157.91 MFLOPS
pgcc-980628 -O6 -malign-double -funroll-all-loops		162.46 MFLOPS
					
-B/root/cc/egcs-mmx/gcc/ -O6 -mmx				142.58 MFLOPS

(hardware is a P-II 333) 

now, what is _quite_ interesting is that pgcc (Which means egcs+pentium
patches) seems to perform better with -O2 than egcs with -O6, which is
illogical, since I always believed egcs-O2 should be as fast as pgcc-O2 (and
in general, pgcc's fpu performance is sometimes better, sometimes worse than
corresponding egcs versions, and many programs perform only slightly better
with pgcc on p-ii than with egcs).

I guess I should use that program to show that pgcc is soo much faster ;)

I also don't see the bad performance of egcs vs. gcc-2.7.2.3

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-07-01 21:20 ` Marc Lehmann
@ 1998-07-02  7:14   ` Craig Burley
  1998-07-02 22:44     ` Marc Lehmann
  1998-07-02 15:15   ` Joern Rennecke
  1 sibling, 1 reply; 18+ messages in thread
From: Craig Burley @ 1998-07-02  7:14 UTC (permalink / raw)
  To: pcg; +Cc: egcs, burley

[Sorry I haven't tracked this thread; also I removed axp-list@redhat.com,
since my comments apply to only the x86 processor.]

>I also don't see the bad performance of egcs vs. gcc-2.7.2.3

My guess, not knowing what the code is (even its language), is that
this is due to the program being dependent mostly upon *static*
double-precision variables and arrays.

At least, that would explain the big difference in using
-malign-double over -mno-align-double on 2.7.2.3, especially
if g77 is the compiler being used.  Although, I'm at a loss
to understand why a small *opposite* difference is seen
using egcs 1.0.3a, which doesn't default to aligning static
doubles (based on my `align' package's output, anyway).

Even if stack-based doubles are used a lot, whether the bad
performance actually shows up in a benchmark can depend on
fairly random things -- the "original case" reported to
<fortran@gnu.org> being that the run-times differed *significantly*
depending on the length of the *name* of the executable!!

So it might be that 2.7.2.3 happened to naturally hit bad
alignments with -mno-align-double, nailed great ones (as should
normally, but not always, be the case) with -malign-double,
but none of the other runs happened to hit such black & white
alignments (e.g. maybe there are all either white, eggshell,
beige, or ecru ;-).

At this point, with the 1.1 freeze looming, I feel we have too
much more to learn about solving this problem right to try to
rush a quick fix into 1.1.  Though, I'm going to try and keep
it is a fairly top priority, to avoid having it slip until it's
too late to fix properly (or at least decide we can't do so) in 1.2.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-07-02  7:14   ` Craig Burley
@ 1998-07-02 22:44     ` Marc Lehmann
  1998-07-03  7:20       ` Toon Moene
  0 siblings, 1 reply; 18+ messages in thread
From: Marc Lehmann @ 1998-07-02 22:44 UTC (permalink / raw)
  To: egcs

On Thu, Jul 02, 1998 at 10:14:27AM -0400, Craig Burley wrote:
> [Sorry I haven't tracked this thread; also I removed axp-list@redhat.com,
> since my comments apply to only the x86 processor.]

This is the code Martin Kahlert sent to egcs on 19980624.

> My guess, not knowing what the code is (even its language), is that

C. I've attached the code (slightly massaged to compile with glibc)

> this is due to the program being dependent mostly upon *static*

no, it uses auto variables *exclusively*.

> At least, that would explain the big difference in using
> -malign-double over -mno-align-double on 2.7.2.3, especially

unfortunately, it doesn.t

> Even if stack-based doubles are used a lot, whether the bad performance
> actually shows up in a benchmark can depend on fairly random things -- the
> "original case" reported to <fortran@gnu.org> being that the run-times
> differed *significantly* depending on the length of the *name* of the
> executable!!

I got a testcase written in C in early 1996! (three different matrix
multiply algorithms). This testcase resulted in the -mstack-align-double
switch of pgcc, and support for this in glibc and djgpp (not yet
distributed, though)

> At this point, with the 1.1 freeze looming, I feel we have too
> much more to learn about solving this problem right to try to
> rush a quick fix into 1.1.  Though, I'm going to try and keep
> it is a fairly top priority, to avoid having it slip until it's
> too late to fix properly (or at least decide we can't do so) in 1.2.

we need startup (libc) support. Bernd Schmidt once sent me a patch that
forced stack alignment in "main()", but I din't put this into pgcc, thinking
that isn't the right way to fix things.

At least for glibc, I could imagine having a switch for this (the proposed
-mstack-align-double) might help so the glibc people can put support
for it into glibc (i.e. compiling affected functions with that switch).

This would have the benefit of only forcing possible speed penalties on
functions with callbacks/startup functions.

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-07-02 22:44     ` Marc Lehmann
@ 1998-07-03  7:20       ` Toon Moene
  0 siblings, 0 replies; 18+ messages in thread
From: Toon Moene @ 1998-07-03  7:20 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: egcs

>  I got a testcase written in C in early 1996! (three
>  different matrix multiply algorithms). This testcase
>  resulted in the -mstack-align-double switch of pgcc, and
>  support for this in glibc and djgpp (not yet distributed,
>  though)

We got one end May / beginning of June '96.  Fortunately, we were  
able to help the guy with telling him to "SAVE", i.e. "make static"  
in C-speak, a handful of variables.

However, the general problem had already been brought to the  
attention of the gcc2 mailing list on the 24th of February '96 by  
Robert Dewar (dewar@gnat.com)

>  For P5 and P6 machines, it is important to 8-byte align
>  double floats, I have to check the alignment requirement
>  for 80-bit reals, I would guess it is also 64-bits but
>  I am not sure about that.

Cheers,
Toon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
  1998-07-01 21:20 ` Marc Lehmann
  1998-07-02  7:14   ` Craig Burley
@ 1998-07-02 15:15   ` Joern Rennecke
  1 sibling, 0 replies; 18+ messages in thread
From: Joern Rennecke @ 1998-07-02 15:15 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: egcs, axp-list

> 
> I couldn't resist as well:
> 
> -V2.7.2.3 -O6 -mno-align-double					70.99 MFLOPS
> -V2.7.2.3 -O6 -malign-double					113.78 MFLOPS
> 
> -V1.0.3 -O6 -mno-align-double					125.62 MFLOPS
> -V1.0.3 -O6 -malign-double					122.43 MFLOPS
> 
> egcs-980628 -O6 -malign-double -funroll-all-loops		135.17 MFLOPS
> egcs-980628 -O6 -mno-align-double				141.39 MFLOPS
> egcs-980628 -O6 -malign-double -mpentiumpro -march=pentiumpro	144.41 MFLOPS
> 
> pgcc-980628 -O2 -malign-double -mno-stack-align-double		155.72 MFLOPS
> pgcc-980628 -O2 -malign-double -mstack-align-double		155.72 MFLOPS
> pgcc-980628 -O6 -malign-double -mstack-align-double		157.91 MFLOPS
> pgcc-980628 -O6 -malign-double -funroll-all-loops		162.46 MFLOPS
> 					
> -B/root/cc/egcs-mmx/gcc/ -O6 -mmx				142.58 MFLOPS
> 
> (hardware is a P-II 333) 
> 
> now, what is _quite_ interesting is that pgcc (Which means egcs+pentium
> patches) seems to perform better with -O2 than egcs with -O6, which is
> illogical, since I always believed egcs-O2 should be as fast as pgcc-O2 (and
> in general, pgcc's fpu performance is sometimes better, sometimes worse than
> corresponding egcs versions, and many programs perform only slightly better
> with pgcc on p-ii than with egcs).
> 
> I guess I should use that program to show that pgcc is soo much faster ;)
> 
> I also don't see the bad performance of egcs vs. gcc-2.7.2.3
> 
>       -----==-                                              |
>       ----==-- _                                            |
>       ---==---(_)__  __ ____  __       Marc Lehmann       +--
>       --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
>       -=====/_/_//_/\_,_/ /_/\_\                          --+
>     The choice of a GNU generation                        |
>                                                           |
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
@ 1998-06-24 21:23 N8TM
  0 siblings, 0 replies; 18+ messages in thread
From: N8TM @ 1998-06-24 21:23 UTC (permalink / raw)
  To: martin.kahlert; +Cc: egcs

In a message dated 6/24/98 11:05:35 AM Pacific Daylight Time,
martin.kahlert@mchp.siemens.de writes:

I will provide the source and the Makefile 

I tested this with RH5.0, with libc5 and binutils-2.9.1 installed, on a single
P II/233 64 MB.  

egcs-19980517	93.09 MFLOPS
gcc-not-aligned	76.63 MFLOPS
gcc-2.8.1		123.78 MFLOPS

I don't normally see gcc-2.8.1 running so much faster than egcs snapshots.
These speeds and yours are consistent with peak results I have obtained on
Livermore Kernels with g77.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
@ 1998-06-25  3:09 Christian Iseli
  0 siblings, 0 replies; 18+ messages in thread
From: Christian Iseli @ 1998-06-25  3:09 UTC (permalink / raw)
  To: martin.kahlert, davem; +Cc: egcs, axp-list

> I couldn't resist, with current CVS egcs sources, on a 300Mhz
> UltraSparc w/512K L2 cache running Linux:
> 
> ? ./fpubench
> 142.58 MFLOPS

Ah well, guess I couldn't resist either...
On my alpha box, with RH Linux 5.0, egcs 1.0.3, 600 MHz alpha, 2 MB L2 cache
(Aspen Durango II) I get 191.71 MFLOPS...

Cheers,
					Christian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
@ 1998-06-25  6:50 Brad M. Garcia
  0 siblings, 0 replies; 18+ messages in thread
From: Brad M. Garcia @ 1998-06-25  6:50 UTC (permalink / raw)
  To: egcs

I ran Martin's test program on my machine.
This should help answer Gerald's question about egcs 1.0.3a.

Single PPro 200, 128MB ram, 2.0.34 kernel, glibc 2 (RH5.0).

gcc2.7.2.3:   77.68 MFLOPS  
egcs1.0.3a:   64.00 MFLOPS
gcc2.7.2.3, not aligned:   23.86 MFLOPS
egcs1.0.3a, not aligned:   61.55 MFLOPS


Brad Garcia
   ___/  __ /  __ /  ___/ "Being the Linux of digital media
  __/   /  /  / _/  __/    would be a very good life."
_/    ____/ _/ _| ____/      - Jean-Louis Gassee, CEO of Be, Inc.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Performance measurements
@ 1998-06-27 15:52 John Wehle
  0 siblings, 0 replies; 18+ messages in thread
From: John Wehle @ 1998-06-27 15:52 UTC (permalink / raw)
  To: law; +Cc: egcs

> Here's some more info.  PPro200
> 
> egcs-1.0.3	 69.96
> today's sources	 70.40
> 
> Note I get 73.14 if I remove all the various -malign switches.
> 
> Not particularly good.    Can someone look into this?  It might
> be another case of double alignment losing badly.  I don't know
> x86 issues well enough.

Part of the problem is due to loop turning:

(insn 73 71 74 (set (reg:DF 49)
        (mem/s:DF (plus:SI (plus:SI (mult:SI (reg/v:SI 29)
                        (const_int 8))
                    (reg/v:SI 21))
                (const_int 16)))) 74 {movdf+1} (nil)
    (nil))

(insn 74 73 75 (set (reg:DF 50)
        (mult:DF (reg:DF 48)
            (reg:DF 49))) 360 {ffshi_1+1} (nil)
    (nil))

into:

(insn 73 69 74 (set (reg:DF 49)
        (mem/s:DF (reg:SI 104))) -1 (nil)
    (nil))

(insn 74 73 75 (set (reg:DF 50)
        (mult:DF (reg:DF 48)
            (reg:DF 49))) -1 (nil)
    (nil))

which global register allocation turns into:

(insn 295 298 74 (set (reg:SI 2 %ecx)
        (mem:SI (plus:SI (reg:SI 7 %esp)
                (const_int 16)))) -1 (nil)
    (nil))

(insn:HI 74 295 75 (set (reg:DF 9 %st(1))
        (mult:DF (reg:DF 9 %st(1))
            (mem/s:DF (reg:SI 2 %ecx)))) 360 {ffshi_1+1} (nil)
    (nil))

because it had to spill (reg:SI 104) since the Intel 386 is a
register poor machine.  Defining DONT_REDUCE_ADDR when builting
egcs results in:

(insn:HI 74 379 75 (set (reg:DF 9 %st(1))
        (mult:DF (reg:DF 9 %st(1))
            (mem/s:DF (plus:SI (plus:SI (mult:SI (reg/v:SI 0 %eax)
                            (const_int 8))
                        (reg:SI 2 %ecx))
                    (const_int 16))))) 360 {ffshi_1+1} (nil)
    (nil))

after global register allocation.  The corresponding benchmark
results on a 233 MHz Pentium II running FreeBSD 3.0 are:

egcs-19980621 aout: 87.91 MFLOPS
egcs-19980621 elf: 86.85 MFLOPS

egcs-19980621 DONT_REDUCE_ADDR aout: 105.24 MFLOPS
egcs-19980621 DONT_REDUCE_ADDR elf: 105.24 MFLOPS

Possible solutions:

  1) Don't call find_mem_givs if SMALL_REGISTER_CLASSES.

  2) Don't consider the giv if SMALL_REGISTER_CLASSES and it's a valid
     memory address for the machine.

  3) Consider the giv but don't take an action which will result in
     a new register / (more registers then before) if SMALL_REGISTER_CLASSES
     and the giv is a valid memory address for the machine.

BTW, I'm pulling these solutions out of thin air as I'm not up to speed
with the operation of loop.

-- John
-------------------------------------------------------------------------
|   Feith Systems  |   Voice: 1-215-646-8000  |  Email: john@feith.com  |
|    John Wehle    |     Fax: 1-215-540-5495  |                         |
-------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~1998-07-03  7:20 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-06-24  2:28 Performance measurements Martin Kahlert
1998-06-24  8:51 ` David S. Miller
1998-06-24 10:08 ` Gerald Pfeifer
1998-06-25  3:09 ` Jeffrey A Law
     [not found] ` <3590D5AE.167EB0E7@iis.fhg.de>
     [not found]   ` <19980624124843.A15248@keksy.mchp.siemens.de>
     [not found]     ` <3591031A.2781E494@iis.fhg.de>
     [not found]       ` <19980624170051.21290@haegar.physiol.med.tu-muenchen.de>
1998-06-25  3:09         ` Performance measurements (thanks and conclusion) Martin Kahlert
1998-06-26  1:05 ` Performance measurements Aubert Pierre
1998-06-27  2:25   ` Jeffrey A Law
1998-06-29 22:34 ` Rask Ingemann Lambertsen
1998-07-01  3:42   ` Nicholas Lee
1998-07-01 21:20 ` Marc Lehmann
1998-07-02  7:14   ` Craig Burley
1998-07-02 22:44     ` Marc Lehmann
1998-07-03  7:20       ` Toon Moene
1998-07-02 15:15   ` Joern Rennecke
1998-06-24 21:23 N8TM
1998-06-25  3:09 Christian Iseli
1998-06-25  6:50 Brad M. Garcia
1998-06-27 15:52 John Wehle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).