From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-148086-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 28103 invoked by alias); 24 Jul 2008 08:04:20 -0000
Received: (qmail 28095 invoked by uid 22791); 24 Jul 2008 08:04:18 -0000
X-Spam-Check-By: sourceware.org
Received: from smtp.fullrate.dk (HELO dns2.fullrate.dk) (89.150.129.5)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Thu, 24 Jul 2008 08:03:48 +0000
Received: from [192.168.1.33] (3604ds3-fb.0.fullrate.dk [90.184.27.253]) 	by dns2.fullrate.dk (Postfix) with ESMTP id C638A5CE28; 	Thu, 24 Jul 2008 10:03:43 +0200 (CEST)
Message-ID: <4888375A.30601@agner.org>
Date: Thu, 24 Jul 2008 09:41:00 -0000
From: Agner Fog <agner@agner.org>
User-Agent: Thunderbird 2.0.0.14 (Windows/20080421)
MIME-Version: 1.0
To: dclarke@opensolaris.org
CC: gcc@gcc.gnu.org, TimothyPrince@sbcglobal.net
Subject: Re: gcc will become the best optimizing x86 compiler
References: <2E073B3ABB3F664DBA1D1C4D5FB47EF40EBDAD8E@NT-IRVA-0752.brcm.ad.broadcom.com> 	 <4887592E.4040804@agner.org> <a6265da20807231908h106c44a0s6271c09152f92ce3@mail.gmail.com>
In-Reply-To: <a6265da20807231908h106c44a0s6271c09152f92ce3@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
X-SW-Source: 2008-07/txt/msg00426.txt.bz2

Dennis Clarke wrote:
 >The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or
 >UltraSparc beats GCC in almost every single test case that I have
 >seen.

This is memcpy on Solaris:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s

It uses exactly the same method as memcpy on gcc libc, with only minor 
differences that have no influence on performance.

> Also, you have provided no data at all.  

I have linked to the data rather than copying it here to save space on 
the mailing list. Here is the link again:
http://www.agner.org/optimize/optimizing_cpp.pdf  section 2.6, page 12.

> So your assertions are those of a marketing person at the moment.

Who sounds like a marketing person, you or me? :-)

 > Please post some code that can be compiled and then tested with high 
resolution timers and perhaps
 > we can compare notes.

Here is my code, again:
http://www.agner.org/optimize/asmlib.zip
My test results, referred to above, uses the "core clock cycles" 
performance counter on Intel and RDTSC on AMD. It's the highest 
resolution you can get. Feel free to do you own tests, it's as simple as 
linking my library into your test program.

Tim Prince wrote:
 >you identify the library you tested only as "ubuntu g++ 4.2.3."
Where can I see the libc version?

 >The corresponding 64-bit linux will see vastly different levels of 
performance, depending on the
 >glibc version, as it doesn't use a builtin string move.
Yes, this is exactly what my tests show. 64-bit libc is better than 
32-bit libc, but still 3-4 times slower than the best library for 
unaligned operands on an Intel.

 >Certain newer CPUs aim to improve performance of the 32-bit gcc 
builtin string moves, but don't
 > entirely eliminate the situations where it isn't optimum.

The Intel manuals are not clear about this. Intel Optimization reference 
manual says:
 >In most cases, applications should take advantage of the default 
memory routines provided by Intel compilers.
What an excellent advice - the Intel compiler puts in a library with an 
automatic run-slowly-on-AMD feature!
The Intel library does not use rep movs when running on an Intel CPU.

The AMD software optimization guide mentions specific situations where 
rep movs is optimal. However, my tests on an Opteron (K8) tell that rep 
movs is never optimal on AMD either. I have no access to test it on the 
new AMD K10, but I expect the XMM register code to run much faster on 
K10 than on K8 because K10 has 128-bit data paths where K8 has only 64-bit.

Evidently, the problem with memcpy has been ignored for years, see 
http://softwarecommunity.intel.com/Wiki/Linux/719.htm