public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* streams is slow under gcc
@ 2002-10-12 17:20 Eric W. Biederman
  2002-10-15 15:59 ` Richard Henderson
  0 siblings, 1 reply; 2+ messages in thread
From: Eric W. Biederman @ 2002-10-12 17:20 UTC (permalink / raw)
  To: gcc


Streams (http://www.cs.virginia.edu/stream/) is a benchmark that
measures the effective memory bandwidth on a platform.  Effective in
this case is how much sustained memory bandwidth can the compiler get
out of the system.  The goal is to give some indication of how well
memory bandwidth limited applications will run.

I have done some investigated of the results for gcc on x86.  On
recent platforms Athlon, P4 Xeon and, x86_64 gcc generally achieves
50% of the achievable memory bandwidth.  This includes gcc-3.2 with
sse support.

To come to the above conclusion I wrote a hand optimized memory copy
to see what the achievable, as opposed to theoretical memory bandwidth
numbers were.  As well as a hand optimized memory read, and a hand
optimized memory write.

Rough numbers:
Memory               CPU     Theoretical achievable gcc       intel-7beta
PC2100               Athlon  2100MB/s    2000MB/s   800MB/s
PC2700               x86_64  2700MB/s    2670MB/s   1200MB/s  1500MB/s
Dual Channel PC1600  P4Xeon  3200MB/s    2800MB/s   1400MB/s  1700MB/s

For a hand optimized memory read or a hand optimized memory write I
only get about 2/3 of the theoretical, while I come very close to the
theoretical for copy operations.

The hand optimized assembly does the following things:
1) Processes data in chunks small enough to fit in the L1 cache
2) For each chunk first walks backwards through the chunk reading
   one 32bit word per cache line, forcing the data into the cache.
3) Fills all 8 sse registers with data from the chunk going forward
4) Stores all 8 sse registers using a non temporal store.

Using the a non temporal store for this kind of application raises
my store speed by 3x.  A non intuitive but very interesting result.

I there a gcc option other than -msse2 that I could use to tell it a
loop is memory bound, and so apply appropriate optimizations?

If not what would be the recommended course to fix gcc so that it
performs well in memory bandwidth limited code?

Eric

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: streams is slow under gcc
  2002-10-12 17:20 streams is slow under gcc Eric W. Biederman
@ 2002-10-15 15:59 ` Richard Henderson
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Henderson @ 2002-10-15 15:59 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: gcc

On Sat, Oct 12, 2002 at 11:41:51AM -0600, Eric W. Biederman wrote:
> I there a gcc option other than -msse2 that I could use to tell it a
> loop is memory bound, and so apply appropriate optimizations?

-fprefetch-loop-arrays; YMMV.


r~

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2002-10-15 22:16 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-12 17:20 streams is slow under gcc Eric W. Biederman
2002-10-15 15:59 ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).