public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04  6:22 Kevin Atkinson
  2003-04-04  8:52 ` Zack Weinberg
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04  6:22 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4774 bytes --]


I was doing some tests with with memcmp and it seams that gcc 3.2.2 always
uses cmps, even for aligned strings.

I did some tests and discovered that using cmps was rather slow, compared 
to a simple loop and then a bswap and subtract at the end.

My method was is the following:

  int cmpa2(unsigned int * x, unsigned int * y, size_t size)
  {
    int i = 0;
    size /= 4;
    while (i < size && x[i] == y[i]) ++i;
    if (i == size) return 0;
    asm("bswap %0" : "+r"(x[i]));
    asm("bswap %0" : "+r"(y[i]));
    return x[i] - y[i];
  }

as oppose to:

  int cmpa(unsigned int * x, unsigned int * y, size_t size)
  {
    return __builtin_memcmp(x,y,size);
  }

Even though It takes integer pointers it really is comparing them as if 
they were byte strings.  I used integer points to avoid having to cast and 
to let the compiler know that the strings are 4 byte aligned. 
This method is 3-5 times as fast.  The longer the string the faster my 
method is.

I also wrote on specialized compare for comparing two ints AS IF they 
were a 4 byte string:

  int cmpi2(unsigned int x, unsigned int y)
  {
    asm("bswap %0" : "+r"(x));
    asm("bswap %0" : "+r"(y));
    return x - y;
  }

as oppose to:

  int cmpi(unsigned int x, unsigned int y) 
  {
    return __builtin_memcmp(&x,&y,4);
  }

Which was several order of magnitude faster.  It is 4 - 6 times faster.  
However I believe the function call is significant here as 
-fomit-frame-pointer increased the ratio (from 4 to 6).  And when I 
tried to account for the cost of the function call the results varies form 
any where to 10 to 30 times faster.

If a better memcmp is not already implemented in yet-to-be-released gcc I 
hope someone will seriously consider rewriting the memcpy implementation.  
Even though I only tested in on a Pentium 3 I am sure the results will 
also hold for a Pentium Classic based on the instruction timings.  I believe 
th bswap instruction was implemented in the 486. So only generic 386 code 
will have to use something other than bswap.

The test program is attached.  Here are the results when compiled with:
gcc -fomit-frame-pointer -O3 -march=pentium3 cmps.c (attribute noinline 
was used on test functions)

Memory compare int:
  2170000
  350000
  Speed up: 6.200000
Memory compare 16 bytes:
  4100000
  1240000
  Speed up: 3.306452
Memory compare 64 bytes:
  10930000
  2990000
  Speed up: 3.655518
Memory compare 256 bytes:
  37510000
  7700000
  Speed up: 4.871429

Here is the relevant assembly code:

cmpi: // integer as byte string compare using memcmp
        cld
        subl    $8, %esp
        movl    $4, %ecx
        movl    %esi, (%esp)
        leal    12(%esp), %esi
        movl    %edi, 4(%esp)
        leal    16(%esp), %edi
        repz
        cmpsb
        movl    4(%esp), %edi
        movl    (%esp), %esi
        seta    %dl
        setb    %cl
        addl    $8, %esp
        subb    %cl, %dl
        movsbl  %dl,%eax
        ret

cmpi: // interger as byte string compare my method
        movl    4(%esp), %eax
        movl    8(%esp), %ecx
        bswap %eax
        bswap %ecx
        subl    %ecx, %eax
        ret

cmpa: // aligned byte compare using memcmp
        cld
        subl    $8, %esp
        movl    20(%esp), %ecx
        movl    %esi, (%esp)
        movl    12(%esp), %esi
        cmpl    %ecx, %ecx
        movl    %edi, 4(%esp)
        movl    16(%esp), %edi
        repz
        cmpsb
        movl    4(%esp), %edi
        movl    (%esp), %esi
        seta    %dl
        setb    %cl
        addl    $8, %esp
        subb    %cl, %dl
        movsbl  %dl,%eax
        ret

cmpa2: // aligned byte compare: my method
        subl    $12, %esp
        movl    24(%esp), %ecx
        xorl    %edx, %edx
        movl    %esi, 4(%esp)
        movl    16(%esp), %esi
        shrl    $2, %ecx
        movl    %edi, 8(%esp)
        cmpl    %ecx, %edx
        movl    20(%esp), %edi
        movl    %ebx, (%esp)
        jae     .L6
        movl    (%edi), %eax
        cmpl    %eax, (%esi)
        je      .L9
.L6:
        xorl    %ebx, %ebx
        cmpl    %ecx, %edx
        je      .L4
        movl    (%esi,%edx,4), %eax
        bswap %eax
        movl    %eax, (%esi,%edx,4)
        movl    (%edi,%edx,4), %eax
        bswap %eax
        movl    %eax, (%edi,%edx,4)
        movl    (%esi,%edx,4), %ebx
        subl    %eax, %ebx
.L4:
        movl    %ebx, %eax
        movl    4(%esp), %esi
        movl    (%esp), %ebx
        movl    8(%esp), %edi
        addl    $12, %esp
        ret
        .p2align 4,,7
.L9:
        incl    %edx
        cmpl    %ecx, %edx
        jae     .L6
        movl    (%edi,%edx,4), %eax
        cmpl    %eax, (%esi,%edx,4)
        je      .L9
        jmp     .L6


--- 
http://kevin.atkinson.dhs.org

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2172 bytes --]

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

__attribute__((noinline))
int cmpi(unsigned int x, unsigned int y) 
{
  return __builtin_memcmp(&x,&y,4);
}

__attribute__((noinline))
int cmpi2(unsigned int x, unsigned int y)
{
  asm("bswap %0" : "+r"(x));
  asm("bswap %0" : "+r"(y));
  return x - y;
}

__attribute__((noinline))
int cmpa(unsigned int * x, unsigned int * y, size_t size)
{
  return __builtin_memcmp(x,y,size);
}

__attribute__((noinline))
int cmpa2(unsigned int * x, unsigned int * y, size_t size)
{
  int i = 0;
  size /= 4;
  while (i < size && x[i] == y[i]) ++i;
  if (i == size) return 0;
  asm("bswap %0" : "+r"(x[i]));
  asm("bswap %0" : "+r"(y[i]));
  return x[i] - y[i];
}

int main()
{
  unsigned int x,y,i;
  int xa[256] = {}, ya[256] = {};
  int s,f;
  clock_t t;
  x = *(unsigned int *)("a  b");
  y = *(unsigned int *)("b  a");

  printf("Memory compare int:\n");
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpi(x,y);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpi2(x,y);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);


  printf("Memory compare 16 bytes:\n");
  xa[3] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 16);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 16);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[3] = 0;

  xa[15] = 6;
  printf("Memory compare 64 bytes:\n");
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 64);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 64);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[15] = 0;

  xa[255] = 6;
  printf("Memory compare 256 bytes:\n");
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 256);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 256);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
}

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 13:18 Roger Sayle
  2003-04-04 14:51 ` Kevin Atkinson
  2003-04-04 17:08 ` Kevin Atkinson
  0 siblings, 2 replies; 14+ messages in thread
From: Roger Sayle @ 2003-04-04 13:18 UTC (permalink / raw)
  To: Kevin Atkinson; +Cc: gcc


Hi Kevin,
> I did some tests and discovered that using cmps was rather slow,
> compared to a simple loop and then a bswap and subtract at the end.

I'm sure that GCC's memcmp implementations could be improved, but
from reading the code examples in your patch it looks like you
are always assuming that either the length is a multiple of four,
or that the bytes following the memory sections to be compared
contain identical values (i.e. you're hoping they're all zero).

i.e., if p and q are suitably 4-byte aligned

  memset(p,"abcd",4);
  memset(q,"abef",4);
  memcmp(p,q,2)

should compare equal but don't using bswaps and subtractions.
Similarly, when two words mismatch their return value <0 or
>0 should depend upon the first byte that differs, not the
values of the bytes that come after it.

I suspect it should be possible to fix your code to handle these
termination conditions correctly, and a comparison of your
routine's performance with these fixes vs. __builtin_memcmp
would be of interest.

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 14:33 Bonzini
  0 siblings, 0 replies; 14+ messages in thread
From: Bonzini @ 2003-04-04 14:33 UTC (permalink / raw)
  To: kevin; +Cc: gcc

> I was doing some tests with with memcmp and it seams that gcc 3.2.2
> always uses cmps, even for aligned strings.

I also had a very fast memcmp that also worked word-by-word in the unaligned case
(using some bit-twiddling tricks with SHRD), but I never had the time to polish it for
inclusion in glibc.  It turned out however that CMPS is as fast as mine (or even
faster) on AMD chips.

I do think anyway that memcmp expansion in gcc should be disabled unless optimizing for
size or unless the size is known to be small, say 16 bytes.

See these posts:

http://gcc.gnu.org/ml/gcc/2002-10/msg01616.html (me)
http://gcc.gnu.org/ml/gcc/2002-10/msg01637.html (Roger)
http://sources.redhat.com/ml/libc-alpha/2002-10/msg00511.html (source)

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 17:13 Jerry Quinn
  2003-04-04 17:29 ` Kevin Atkinson
  0 siblings, 1 reply; 14+ messages in thread
From: Jerry Quinn @ 2003-04-04 17:13 UTC (permalink / raw)
  To: gcc

I just tried the same benchmark on a Pentium 4 out of curiosity.  Slightly
different results:

Memory compare int:
  10000
  130000
  Speed up: 0.076923
Memory compare 15 bytes:
  10000
  370000
  Speed up: 0.027027
Memory compare 16 bytes:
  20000
  330000
  Speed up: 0.060606
Memory compare 64 bytes:
  10000
  1040000
  Speed up: 0.009615
Memory compare 256 bytes:
  20000
  2300000
  Speed up: 0.008696


Perhaps this is to be expected since the routine uses shifts.

Jerry Quinn

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2003-04-07 15:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-04  6:22 Slow memcmp for aligned strings on Pentium 3 Kevin Atkinson
2003-04-04  8:52 ` Zack Weinberg
2003-04-04 15:16   ` Kevin Atkinson
2003-04-04 13:18 Roger Sayle
2003-04-04 14:51 ` Kevin Atkinson
2003-04-04 17:08 ` Kevin Atkinson
2003-04-04 18:58   ` Marcel Cox
2003-04-04 19:27     ` Kevin Atkinson
2003-04-04 14:33 Bonzini
2003-04-04 17:13 Jerry Quinn
2003-04-04 17:29 ` Kevin Atkinson
2003-04-06  5:54   ` Jerry Quinn
2003-04-06 22:17     ` Kevin Atkinson
2003-04-07 17:30       ` Jerry Quinn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).