From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-61425-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 15924 invoked by alias); 12 Oct 2002 17:43:08 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 15917 invoked from network); 12 Oct 2002 17:43:07 -0000
Received: from unknown (HELO frodo.biederman.org) (166.70.28.69)
  by sources.redhat.com with SMTP; 12 Oct 2002 17:43:07 -0000
Received: (from eric@localhost)
	by frodo.biederman.org (8.9.3/8.9.3) id LAA27025;
	Sat, 12 Oct 2002 11:41:51 -0600
To: gcc@gcc.gnu.org
Subject: streams is slow under gcc
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sat, 12 Oct 2002 17:20:00 -0000
Message-ID: <m1zntjmu5c.fsf@frodo.biederman.org>
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-SW-Source: 2002-10/txt/msg00649.txt.bz2


Streams (http://www.cs.virginia.edu/stream/) is a benchmark that
measures the effective memory bandwidth on a platform.  Effective in
this case is how much sustained memory bandwidth can the compiler get
out of the system.  The goal is to give some indication of how well
memory bandwidth limited applications will run.

I have done some investigated of the results for gcc on x86.  On
recent platforms Athlon, P4 Xeon and, x86_64 gcc generally achieves
50% of the achievable memory bandwidth.  This includes gcc-3.2 with
sse support.

To come to the above conclusion I wrote a hand optimized memory copy
to see what the achievable, as opposed to theoretical memory bandwidth
numbers were.  As well as a hand optimized memory read, and a hand
optimized memory write.

Rough numbers:
Memory               CPU     Theoretical achievable gcc       intel-7beta
PC2100               Athlon  2100MB/s    2000MB/s   800MB/s
PC2700               x86_64  2700MB/s    2670MB/s   1200MB/s  1500MB/s
Dual Channel PC1600  P4Xeon  3200MB/s    2800MB/s   1400MB/s  1700MB/s

For a hand optimized memory read or a hand optimized memory write I
only get about 2/3 of the theoretical, while I come very close to the
theoretical for copy operations.

The hand optimized assembly does the following things:
1) Processes data in chunks small enough to fit in the L1 cache
2) For each chunk first walks backwards through the chunk reading
   one 32bit word per cache line, forcing the data into the cache.
3) Fills all 8 sse registers with data from the chunk going forward
4) Stores all 8 sse registers using a non temporal store.

Using the a non temporal store for this kind of application raises
my store speed by 3x.  A non intuitive but very interesting result.

I there a gcc option other than -msse2 that I could use to tell it a
loop is memory bound, and so apply appropriate optimizations?

If not what would be the recommended course to fix gcc so that it
performs well in memory bandwidth limited code?

Eric