From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6053 invoked by alias); 6 Apr 2003 18:33:00 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 6045 invoked from network); 6 Apr 2003 18:32:58 -0000 Received: from unknown (HELO smtp-out.comcast.net) (24.153.64.110) by sources.redhat.com with SMTP; 6 Apr 2003 18:32:58 -0000 Received: from master.atkinson.dhs.org (pcp219109pcs.elkrdg01.md.comcast.net [68.55.220.142]) by mtaout10.icomcast.net (iPlanet Messaging Server 5.2 HotFix 1.14 (built Mar 18 2003)) with ESMTP id <0HCX00CO3Q6D1M@mtaout10.icomcast.net> for gcc@gcc.gnu.org; Sun, 06 Apr 2003 14:32:37 -0400 (EDT) Received: from kevin-pc.atkinson.dhs.org (kevin-pc.atkinson.dhs.org [192.168.1.3]) by master.atkinson.dhs.org (Postfix) with ESMTP id 3F3BBB84C; Sun, 06 Apr 2003 14:32:37 -0400 (EDT) Date: Sun, 06 Apr 2003 22:17:00 -0000 From: Kevin Atkinson Subject: Re: Slow memcmp for aligned strings on Pentium 3 In-reply-to: <16015.39995.61148.812914@gargle.gargle.HOWL> X-X-Sender: kevina@kevin-pc.atkinson.dhs.org To: gcc@gcc.gnu.org, Jerry Quinn Message-id: MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII Content-transfer-encoding: 7BIT X-SW-Source: 2003-04/txt/msg00248.txt.bz2 On Sat, 5 Apr 2003, Jerry Quinn wrote: > Kevin Atkinson writes: > > On Fri, 4 Apr 2003, Jerry Quinn wrote: > > > > > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly > > > different results: > > > > > > Memory compare int: > > > 10000 > > > 130000 > > > Speed up: 0.076923 > > > Memory compare 15 bytes: > > > 10000 > > > 370000 > > > Speed up: 0.027027 > > > Memory compare 16 bytes: > > > 20000 > > > 330000 > > > Speed up: 0.060606 > > > Memory compare 64 bytes: > > > 10000 > > > 1040000 > > > Speed up: 0.009615 > > > Memory compare 256 bytes: > > > 20000 > > > 2300000 > > > Speed up: 0.008696 > > > > > > Perhaps this is to be expected since the routine uses shifts. > > > > The shift are only used in the case size is not divisible by 4. It seams > > that on the Pentium 4 cmps is the way to go. You might also want to > > increase the number of loop iterations to get more meaning full results > > due the limited precision of clock(). > > Adding iterations didn't change the relative scores significantly. It > still loses big on P4. It also loses big on Athlon. Here are Athlon > results using the later version you posted with 10x iterations: > > jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c > jlquinn@smaug:~/gcc/test$ ./a.out > Memory compare 15 bytes: > 310000 > 5810000 > Speed up: 0.053356 > Memory compare 16 bytes: > 300000 > 5290000 > Speed up: 0.056711 > Memory compare 64 bytes: > 460000 > 13770000 > Speed up: 0.033406 > Memory compare 256 bytes: > 470000 This is extremely interesting. Does anyone have any documentation on cmps behavior on P4 and Athlon? It could be that the processor is somehow "caching" the results of cmps. Maybe it has to do with the fact that the strings are all 0 except for the end or because the strings do not change. Or maybe cmps is just extremely fast, but how? Or it could be that the loop needs unrolling for better pipeline performance. I don't have a P4 or Athlon so if someone could play around with my code by testing by testing some of my theories I would appreciate it. I just ran the test on a Pentium MMX and i got similar results as I did for my P3. So at very least it seams that something similar to my code is the way to go for Pentiums up to P3. --- http://kevin.atkinson.dhs.org