From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 15924 invoked by alias); 12 Oct 2002 17:43:08 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 15917 invoked from network); 12 Oct 2002 17:43:07 -0000 Received: from unknown (HELO frodo.biederman.org) (166.70.28.69) by sources.redhat.com with SMTP; 12 Oct 2002 17:43:07 -0000 Received: (from eric@localhost) by frodo.biederman.org (8.9.3/8.9.3) id LAA27025; Sat, 12 Oct 2002 11:41:51 -0600 To: gcc@gcc.gnu.org Subject: streams is slow under gcc From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 12 Oct 2002 17:20:00 -0000 Message-ID: User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-SW-Source: 2002-10/txt/msg00649.txt.bz2 Streams (http://www.cs.virginia.edu/stream/) is a benchmark that measures the effective memory bandwidth on a platform. Effective in this case is how much sustained memory bandwidth can the compiler get out of the system. The goal is to give some indication of how well memory bandwidth limited applications will run. I have done some investigated of the results for gcc on x86. On recent platforms Athlon, P4 Xeon and, x86_64 gcc generally achieves 50% of the achievable memory bandwidth. This includes gcc-3.2 with sse support. To come to the above conclusion I wrote a hand optimized memory copy to see what the achievable, as opposed to theoretical memory bandwidth numbers were. As well as a hand optimized memory read, and a hand optimized memory write. Rough numbers: Memory CPU Theoretical achievable gcc intel-7beta PC2100 Athlon 2100MB/s 2000MB/s 800MB/s PC2700 x86_64 2700MB/s 2670MB/s 1200MB/s 1500MB/s Dual Channel PC1600 P4Xeon 3200MB/s 2800MB/s 1400MB/s 1700MB/s For a hand optimized memory read or a hand optimized memory write I only get about 2/3 of the theoretical, while I come very close to the theoretical for copy operations. The hand optimized assembly does the following things: 1) Processes data in chunks small enough to fit in the L1 cache 2) For each chunk first walks backwards through the chunk reading one 32bit word per cache line, forcing the data into the cache. 3) Fills all 8 sse registers with data from the chunk going forward 4) Stores all 8 sse registers using a non temporal store. Using the a non temporal store for this kind of application raises my store speed by 3x. A non intuitive but very interesting result. I there a gcc option other than -msse2 that I could use to tell it a loop is memory bound, and so apply appropriate optimizations? If not what would be the recommended course to fix gcc so that it performs well in memory bandwidth limited code? Eric