From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-72317-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 22361 invoked by alias); 22 Apr 2003 14:43:32 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 22347 invoked from network); 22 Apr 2003 14:43:31 -0000
Received: from unknown (HELO gold.csi.cam.ac.uk) (131.111.8.12)
  by sources.redhat.com with SMTP; 22 Apr 2003 14:43:31 -0000
Received: from cass41.ast.cam.ac.uk ([131.111.69.186])
	by gold.csi.cam.ac.uk with esmtp (Exim 4.12)
	id 197yzb-0008Mk-00
	for gcc@gcc.gnu.org; Tue, 22 Apr 2003 15:43:31 +0100
Received: from xserv1.ast.cam.ac.uk (IDENT:9OEvmMl9X48wazWlyfQtGXgOGGzjmpwL@xserv1.ast.cam.ac.uk [131.111.69.235])
	by cass41.ast.cam.ac.uk (8.12.9+Sun/8.12.9) with ESMTP id h3MEhVgD020484
	for <gcc@gcc.gnu.org>; Tue, 22 Apr 2003 15:43:31 +0100 (BST)
Received: from xpc5.ast.cam.ac.uk (IDENT:SqKPE5ihYHTMBAn3z5RobB4yi2f1oaOw@xpc5.ast.cam.ac.uk [131.111.68.220])
	by xserv1.ast.cam.ac.uk (8.11.6/8.11.6) with ESMTP id h3MEhUl19649
	for <gcc@gcc.gnu.org>; Tue, 22 Apr 2003 15:43:30 +0100
Date: Tue, 22 Apr 2003 15:36:00 -0000
From: Jeremy Sanders <jss@ast.cam.ac.uk>
To: gcc@gcc.gnu.org
Subject: benchmarking (or almabench)
Message-ID: <Pine.LNX.4.55.0304221530120.13881@xpc5.ast.cam.ac.uk>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-SW-Source: 2003-04/txt/msg01060.txt.bz2

I've been looking at compiling the almabench benchmark again with gcc.
See:

http://gcc.gnu.org/ml/gcc/2003-01/msg00037.html

With a pentium4 processor I'm getting drastically different times for the
running the code output from icc and gcc. icc produces code which is up to
2.7 times faster than gcc code for this program.

(with gcc mainline)

/data/jss/gcc-3.3/bin/g++ -o almabench.o -O2 -mfpmath=sse -msse -msse2 -march=pentium4 -finline-limit=10000 -c almabench.cpp
/data/jss/gcc-3.3/bin/g++ -o almabench -O2 -mfpmath=sse -msse -msse2 -march=pentium4 -finline-limit=10000 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.121u 0.060s 0:33.31 93.6%	0+0k 0+0io 212pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
31.148u 0.052s 0:33.61 92.7%	0+0k 0+0io 212pf+0w

(I've also tried without sse and march, and there's little difference.
I've also tried fprofile-arcs, which doesn't do anything. inline-limit
has no real effect).

With icc 7.1.

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -c almabench.cpp
icc -o almabench -O2 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.494u 0.013s 0:17.71 93.1%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
16.445u 0.029s 0:17.53 93.8%	0+0k 0+0io 116pf+0w

That's 88% faster than gcc.


Enabling P4 optimisation (okay gcc can't do vectorization):

xpc5:/<3>almabench-1.0.1/cpp> make
icc -o almabench.o -O2 -tpp7 -xW -march=pentium4 -c almabench.cpp
almabench.cpp(219) : (col. 5) remark: LOOP WAS VECTORIZED.
almabench.cpp(230) : (col. 5) remark: LOOP WAS VECTORIZED.
icc -o almabench -O2 -tpp7 -xW -march=pentium4 almabench.o
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.318u 0.005s 0:12.09 93.5%	0+0k 0+0io 116pf+0w
xpc5:/<3>almabench-1.0.1/cpp> time ./almabench
11.277u 0.007s 0:12.08 93.2%	0+0k 0+0io 116pf+0w

That's 2.75 times faster than gcc's code.


Obviously this benchmark is synthetic, but it suggests gcc isn't
optimising something in this code very well. We've also seen similar
effects with other floating-point intensive code. Any suggestions? I can
supply assembler output for both if anyone would like a look!

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053