public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* RE: benchmarking (or almabench)
@ 2003-04-22 15:38 S. Bosscher
  2003-04-22 16:00 ` Daniel Berlin
  2003-04-22 16:09 ` Jeremy Sanders
  0 siblings, 2 replies; 5+ messages in thread
From: S. Bosscher @ 2003-04-22 15:38 UTC (permalink / raw)
  To: 'Jeremy Sanders ', 'gcc@gcc.gnu.org '

-march=pentium4 is known to pessimise code compared to -march=i686 for some
benchmarks, see PR 8474.  Maybe you're seeing the same problem?

Greetz
Steven


-----Original Message-----
From: Jeremy Sanders
To: gcc@gcc.gnu.org
Sent: 22-4-03 16:43
Subject: benchmarking (or almabench)

I've been looking at compiling the almabench benchmark again with gcc.
See:

http://gcc.gnu.org/ml/gcc/2003-01/msg00037.html

With a pentium4 processor I'm getting drastically different times for
the
running the code output from icc and gcc. icc produces code which is up
to
2.7 times faster than gcc code for this program.

(with gcc mainline)

/data/jss/gcc-3.3/bin/g++ -o almabench.o -O2 -mfpmath=sse -msse -msse2
-march=pentium4 -finline-limit=10000 -c almabench.cpp
/data/jss/gcc-3.3/bin/g++ -o almabench -O2 -mfpmath=sse -msse -msse2
-march=pentium4 -finline-limit=10000 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.121u 0.060s 0:33.31 93.6%	0+0k 0+0io 212pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.148u 0.052s 0:33.61 92.7%	0+0k 0+0io 212pf+0w

(I've also tried without sse and march, and there's little difference.
I've also tried fprofile-arcs, which doesn't do anything. inline-limit
has no real effect).

With icc 7.1.

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -c almabench.cpp
icc -o almabench -O2 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.494u 0.013s 0:17.71 93.1%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.445u 0.029s 0:17.53 93.8%	0+0k 0+0io 116pf+0w

That's 88% faster than gcc.


Enabling P4 optimisation (okay gcc can't do vectorization):

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -tpp7 -xW -march=pentium4 -c almabench.cpp
almabench.cpp(219) : (col. 5) remark: LOOP WAS VECTORIZED.
almabench.cpp(230) : (col. 5) remark: LOOP WAS VECTORIZED.
icc -o almabench -O2 -tpp7 -xW -march=pentium4 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.318u 0.005s 0:12.09 93.5%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.277u 0.007s 0:12.08 93.2%	0+0k 0+0io 116pf+0w

That's 2.75 times faster than gcc's code.


Obviously this benchmark is synthetic, but it suggests gcc isn't
optimising something in this code very well. We've also seen similar
effects with other floating-point intensive code. Any suggestions? I can
supply assembler output for both if anyone would like a look!

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: benchmarking (or almabench)
  2003-04-22 15:38 benchmarking (or almabench) S. Bosscher
@ 2003-04-22 16:00 ` Daniel Berlin
  2003-04-22 16:11   ` Jeremy Sanders
  2003-04-22 16:09 ` Jeremy Sanders
  1 sibling, 1 reply; 5+ messages in thread
From: Daniel Berlin @ 2003-04-22 16:00 UTC (permalink / raw)
  To: S. Bosscher; +Cc: 'Jeremy Sanders ', 'gcc@gcc.gnu.org '


On Tuesday, April 22, 2003, at 11:08  AM, S. Bosscher wrote:

> -march=pentium4 is known to pessimise code compared to -march=i686 for 
> some
> benchmarks, see PR 8474.  Maybe you're seeing the same problem?

Actually, if i had to guess, i'd put my money on the vectorization.
Notice ICC vectorized two loops in his example, and obviously, we 
vectorized 0.
:)

If those were compute intensive loops, ....

>
> Greetz
> Steven
>
>
> -----Original Message-----
> From: Jeremy Sanders
> To: gcc@gcc.gnu.org
> Sent: 22-4-03 16:43
> Subject: benchmarking (or almabench)
>
> I've been looking at compiling the almabench benchmark again with gcc.
> See:
>
> http://gcc.gnu.org/ml/gcc/2003-01/msg00037.html
>
> With a pentium4 processor I'm getting drastically different times for
> the
> running the code output from icc and gcc. icc produces code which is up
> to
> 2.7 times faster than gcc code for this program.
>
> (with gcc mainline)
>
> /data/jss/gcc-3.3/bin/g++ -o almabench.o -O2 -mfpmath=sse -msse -msse2
> -march=pentium4 -finline-limit=10000 -c almabench.cpp
> /data/jss/gcc-3.3/bin/g++ -o almabench -O2 -mfpmath=sse -msse -msse2
> -march=pentium4 -finline-limit=10000 almabench.o
> xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
> 31.121u 0.060s 0:33.31 93.6%	0+0k 0+0io 212pf+0w
> xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
> 31.148u 0.052s 0:33.61 92.7%	0+0k 0+0io 212pf+0w
>
> (I've also tried without sse and march, and there's little difference.
> I've also tried fprofile-arcs, which doesn't do anything. inline-limit
> has no real effect).
>
> With icc 7.1.
>
> xpc5:/<3>almabench-1.0.1/cpp> make
> icc -o almabench.o -O2 -c almabench.cpp
> icc -o almabench -O2 almabench.o
> xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
> 16.494u 0.013s 0:17.71 93.1%	0+0k 0+0io 116pf+0w
> xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
> 16.445u 0.029s 0:17.53 93.8%	0+0k 0+0io 116pf+0w
>
> That's 88% faster than gcc.
>
>
> Enabling P4 optimisation (okay gcc can't do vectorization):
>
> xpc5:/<3>almabench-1.0.1/cpp> make
> icc -o almabench.o -O2 -tpp7 -xW -march=pentium4 -c almabench.cpp
> almabench.cpp(219) : (col. 5) remark: LOOP WAS VECTORIZED.
> almabench.cpp(230) : (col. 5) remark: LOOP WAS VECTORIZED.
> icc -o almabench -O2 -tpp7 -xW -march=pentium4 almabench.o
> xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
> 11.318u 0.005s 0:12.09 93.5%	0+0k 0+0io 116pf+0w
> xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
> 11.277u 0.007s 0:12.08 93.2%	0+0k 0+0io 116pf+0w
>
> That's 2.75 times faster than gcc's code.
>
>
> Obviously this benchmark is synthetic, but it suggests gcc isn't
> optimising something in this code very well. We've also seen similar
> effects with other floating-point intensive code. Any suggestions? I 
> can
> supply assembler output for both if anyone would like a look!
>
> Jeremy
>
> -- 
> Jeremy Sanders <jss@ast.cam.ac.uk>   
> http://www-xray.ast.cam.ac.uk/~jss/
> X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
> Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: benchmarking (or almabench)
  2003-04-22 15:38 benchmarking (or almabench) S. Bosscher
  2003-04-22 16:00 ` Daniel Berlin
@ 2003-04-22 16:09 ` Jeremy Sanders
  1 sibling, 0 replies; 5+ messages in thread
From: Jeremy Sanders @ 2003-04-22 16:09 UTC (permalink / raw)
  To: S. Bosscher; +Cc: 'gcc@gcc.gnu.org '

On Tue, 22 Apr 2003, S. Bosscher wrote:

> -march=pentium4 is known to pessimise code compared to -march=i686 for some
> benchmarks, see PR 8474.  Maybe you're seeing the same problem?

No, I get this using -march=i686 instead of pentium4.

xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.742u 0.226s 0:35.01 91.2%	0+0k 0+0io 89pf+0w

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: benchmarking (or almabench)
  2003-04-22 16:00 ` Daniel Berlin
@ 2003-04-22 16:11   ` Jeremy Sanders
  0 siblings, 0 replies; 5+ messages in thread
From: Jeremy Sanders @ 2003-04-22 16:11 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: S. Bosscher, 'gcc@gcc.gnu.org '

On Tue, 22 Apr 2003, Daniel Berlin wrote:

> On Tuesday, April 22, 2003, at 11:08  AM, S. Bosscher wrote:
>
> > -march=pentium4 is known to pessimise code compared to -march=i686 for
> > some
> > benchmarks, see PR 8474.  Maybe you're seeing the same problem?
>
> Actually, if i had to guess, i'd put my money on the vectorization.
> Notice ICC vectorized two loops in his example, and obviously, we
> vectorized 0.
> :)

The intel compiler doesn't seem to vectorize with just "-O2" (by default
it should report whether it is using vectorization), and that's still 88%
faster than gcc. I can't absolutely confirm there's no vectorization as I
can't see a switch to turn it off.

icc says it's vectorizing when the P4 specific options are enabled (which
gcc can't do yet).

If I turn off any optimization on icc, then it's still faster than gcc!!!

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O0 -c almabench.cpp
icc -o almabench -O0 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
23.853u 0.134s 0:25.82 92.8%	0+0k 0+0io 121pf+0w

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 5+ messages in thread

* benchmarking (or almabench)
@ 2003-04-22 15:36 Jeremy Sanders
  0 siblings, 0 replies; 5+ messages in thread
From: Jeremy Sanders @ 2003-04-22 15:36 UTC (permalink / raw)
  To: gcc

I've been looking at compiling the almabench benchmark again with gcc.
See:

http://gcc.gnu.org/ml/gcc/2003-01/msg00037.html

With a pentium4 processor I'm getting drastically different times for the
running the code output from icc and gcc. icc produces code which is up to
2.7 times faster than gcc code for this program.

(with gcc mainline)

/data/jss/gcc-3.3/bin/g++ -o almabench.o -O2 -mfpmath=sse -msse -msse2 -march=pentium4 -finline-limit=10000 -c almabench.cpp
/data/jss/gcc-3.3/bin/g++ -o almabench -O2 -mfpmath=sse -msse -msse2 -march=pentium4 -finline-limit=10000 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.121u 0.060s 0:33.31 93.6%	0+0k 0+0io 212pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.148u 0.052s 0:33.61 92.7%	0+0k 0+0io 212pf+0w

(I've also tried without sse and march, and there's little difference.
I've also tried fprofile-arcs, which doesn't do anything. inline-limit
has no real effect).

With icc 7.1.

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -c almabench.cpp
icc -o almabench -O2 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.494u 0.013s 0:17.71 93.1%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.445u 0.029s 0:17.53 93.8%	0+0k 0+0io 116pf+0w

That's 88% faster than gcc.


Enabling P4 optimisation (okay gcc can't do vectorization):

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -tpp7 -xW -march=pentium4 -c almabench.cpp
almabench.cpp(219) : (col. 5) remark: LOOP WAS VECTORIZED.
almabench.cpp(230) : (col. 5) remark: LOOP WAS VECTORIZED.
icc -o almabench -O2 -tpp7 -xW -march=pentium4 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.318u 0.005s 0:12.09 93.5%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.277u 0.007s 0:12.08 93.2%	0+0k 0+0io 116pf+0w

That's 2.75 times faster than gcc's code.


Obviously this benchmark is synthetic, but it suggests gcc isn't
optimising something in this code very well. We've also seen similar
effects with other floating-point intensive code. Any suggestions? I can
supply assembler output for both if anyone would like a look!

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-04-22 15:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-22 15:38 benchmarking (or almabench) S. Bosscher
2003-04-22 16:00 ` Daniel Berlin
2003-04-22 16:11   ` Jeremy Sanders
2003-04-22 16:09 ` Jeremy Sanders
  -- strict thread matches above, loose matches on Subject: below --
2003-04-22 15:36 Jeremy Sanders

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).