Re: Slow memcmp for aligned strings on Pentium 3

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 13:18 Roger Sayle
  2003-04-04 14:51 ` Kevin Atkinson
  2003-04-04 17:08 ` Kevin Atkinson
  0 siblings, 2 replies; 14+ messages in thread
From: Roger Sayle @ 2003-04-04 13:18 UTC (permalink / raw)
  To: Kevin Atkinson; +Cc: gcc

Hi Kevin,
> I did some tests and discovered that using cmps was rather slow,
> compared to a simple loop and then a bswap and subtract at the end.

I'm sure that GCC's memcmp implementations could be improved, but
from reading the code examples in your patch it looks like you
are always assuming that either the length is a multiple of four,
or that the bytes following the memory sections to be compared
contain identical values (i.e. you're hoping they're all zero).

i.e., if p and q are suitably 4-byte aligned

  memset(p,"abcd",4);
  memset(q,"abef",4);
  memcmp(p,q,2)

should compare equal but don't using bswaps and subtractions.
Similarly, when two words mismatch their return value <0 or
>0 should depend upon the first byte that differs, not the
values of the bytes that come after it.

I suspect it should be possible to fix your code to handle these
termination conditions correctly, and a comparison of your
routine's performance with these fixes vs. __builtin_memcmp
would be of interest.

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04 13:18 Slow memcmp for aligned strings on Pentium 3 Roger Sayle
@ 2003-04-04 14:51 ` Kevin Atkinson
  2003-04-04 17:08 ` Kevin Atkinson
  1 sibling, 0 replies; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 14:51 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc

On Thu, 3 Apr 2003, Roger Sayle wrote:

> 
> Hi Kevin,
> > I did some tests and discovered that using cmps was rather slow,
> > compared to a simple loop and then a bswap and subtract at the end.
> 
> I'm sure that GCC's memcmp implementations could be improved, but
> from reading the code examples in your patch it looks like you
> are always assuming that either the length is a multiple of four,
> or that the bytes following the memory sections to be compared
> contain identical values (i.e. you're hoping they're all zero).
> 
> i.e., if p and q are suitably 4-byte aligned
> 
>   memset(p,"abcd",4);
>   memset(q,"abef",4);
>   memcmp(p,q,2)
> 
> should compare equal but don't using bswaps and subtractions.
> Similarly, when two words mismatch their return value <0 or
> >0 should depend upon the first byte that differs, not the
> values of the bytes that come after it.

Your right.  It can be fixed by a test if it is a multiple of 4 and if not 
do a byte wise comparison at the end.

> I suspect it should be possible to fix your code to handle these
> termination conditions correctly, and a comparison of your
> routine's performance with these fixes vs. __builtin_memcmp
> would be of interest.

I will see what I can do.

--- 
http://kevin.atkinson.dhs.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04 13:18 Slow memcmp for aligned strings on Pentium 3 Roger Sayle
  2003-04-04 14:51 ` Kevin Atkinson
@ 2003-04-04 17:08 ` Kevin Atkinson
  2003-04-04 18:58   ` Marcel Cox
  1 sibling, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 17:08 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1831 bytes --]

On Thu, 3 Apr 2003, Roger Sayle wrote:

> I suspect it should be possible to fix your code to handle these
> termination conditions correctly, and a comparison of your
> routine's performance with these fixes vs. __builtin_memcmp
> would be of interest.

Here you go.  Still assume the memory is alligned:

This is what I did:

int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
  int i = 0;
  size_t s = size / 4;
  while (i < s && x[i] == y[i]) ++i;
  size -= i * 4;
  if (size == 0) return 0;
  // hopefully if this is inline expanded when size is known
  // the compiler can eliminate many of these conditionals
  else if (size >= 4) { // if original size % 4 == 0 this should
                        // always be the case
    unsigned int xx = x[i], yy = y[i];
    asm("bswap %0" : "+r"(xx));
    asm("bswap %0" : "+r"(yy));
    return xx - yy;
  } else {
    const unsigned char * xb = (const unsigned char *)(x + i);
    const unsigned char * yb = (const unsigned char *)(y + i);
    // if size is known at compile time then the compiler should be
    // able to select the correct choice at compile time
    switch (size) {
    case 1:
      return *xb - *yb;
    case 2:
      return ((xb[0] - yb[0]) <<  8) + (xb[1] - yb[1]);
    case 3:
      return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8)
        + xb[2] - yb[2];}
  }
}

compiled with gcc -fomit-frame-pointer -O3 -march=pentium3 cmps.c

Memory compare int:
  2160000
  360000
  Speed up: 6.000000
Memory compare 15 bytes:
  4120000
  1550000
  Speed up: 2.658065
Memory compare 16 bytes:
  4150000
  1310000
  Speed up: 3.167939
Memory compare 64 bytes:
  11240000
  3470000
  Speed up: 3.239193
Memory compare 256 bytes:
  37930000
  10050000
  Speed up: 3.774129

Code and assembly output attached.

-- 
http://kevin.atkinson.dhs.org

[-- Attachment #2: Type: TEXT/PLAIN, Size: 3315 bytes --]

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

__attribute__((noinline))
int cmpi(unsigned int x, unsigned int y) 
{
  return __builtin_memcmp(&x,&y,4);
}

__attribute__((noinline))
int cmpi2(unsigned int x, unsigned int y)
{
  asm("bswap %0" : "+r"(x));
  asm("bswap %0" : "+r"(y));
  return x - y;
}

__attribute__((noinline))
int cmpa(const unsigned int * x, const unsigned int * y, size_t size)
{
  return __builtin_memcmp(x,y,size);
}

__attribute__((noinline))
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
  int i = 0;
  size_t s = size / 4;
  while (i < s && x[i] == y[i]) ++i;
  size -= i * 4;
  if (size == 0) return 0;
  // hopefully if this is inline expanded when size is known
  // the compiler can eliminate many of these conditionals
  else if (size >= 4) { // if original size % 4 == 0 this should
                        // always be the case
    unsigned int xx = x[i], yy = y[i];
    asm("bswap %0" : "+r"(xx));
    asm("bswap %0" : "+r"(yy));
    return xx - yy;
  } else {
    const unsigned char * xb = (const unsigned char *)(x + i);
    const unsigned char * yb = (const unsigned char *)(y + i);
    // if size is known at compile time then the compiler should be
    // able to select the correct choice at compile time
    switch (size) {
    case 1:
      return *xb - *yb;
    case 2:
      return ((xb[0] - yb[0]) <<  8) + (xb[1] - yb[1]);
    case 3:
      return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8)
        + xb[2] - yb[2];}
  }
}

int main()
{
  unsigned int x,y,i;
  int xa[256] = {}, ya[256] = {};
  int s,f;
  clock_t t;
  x = *(unsigned int *)("a  b");
  y = *(unsigned int *)("b  a");

  printf("Memory compare int:\n");
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpi(x,y);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpi2(x,y);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);


  printf("Memory compare 15 bytes:\n");
  xa[3] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 15);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 15);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[3] = 0;

  printf("Memory compare 16 bytes:\n");
  xa[3] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 16);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 16);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[3] = 0;

  printf("Memory compare 64 bytes:\n");
  xa[15] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 64);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 64);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[15] = 0;

  printf("Memory compare 256 bytes:\n");
  xa[63] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 256);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 256);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
}

[-- Attachment #3: Type: TEXT/PLAIN, Size: 7710 bytes --]

	.file	"cmps.c"
	.text
	.align 2
	.p2align 4,,15
.globl cmpi
	.type	cmpi,@function
cmpi:
	cld
	subl	$8, %esp
	movl	$4, %ecx
	movl	%esi, (%esp)
	leal	12(%esp), %esi
	movl	%edi, 4(%esp)
	leal	16(%esp), %edi
	repz
	cmpsb
	movl	4(%esp), %edi
	movl	(%esp), %esi
	seta	%dl
	setb	%cl
	addl	$8, %esp
	subb	%cl, %dl
	movsbl	%dl,%eax
	ret
.Lfe1:
	.size	cmpi,.Lfe1-cmpi
	.align 2
	.p2align 4,,15
.globl cmpi2
	.type	cmpi2,@function
cmpi2:
	movl	4(%esp), %eax
	movl	8(%esp), %ecx
#APP
	bswap %eax
	bswap %ecx
#NO_APP
	subl	%ecx, %eax
	ret
.Lfe2:
	.size	cmpi2,.Lfe2-cmpi2
	.align 2
	.p2align 4,,15
.globl cmpa
	.type	cmpa,@function
cmpa:
	cld
	subl	$8, %esp
	movl	20(%esp), %ecx
	movl	%esi, (%esp)
	movl	12(%esp), %esi
	cmpl	%ecx, %ecx
	movl	%edi, 4(%esp)
	movl	16(%esp), %edi
	repz
	cmpsb
	movl	4(%esp), %edi
	movl	(%esp), %esi
	seta	%dl
	setb	%cl
	addl	$8, %esp
	subb	%cl, %dl
	movsbl	%dl,%eax
	ret
.Lfe3:
	.size	cmpa,.Lfe3-cmpa
	.align 2
	.p2align 4,,15
.globl cmpa2
	.type	cmpa2,@function
cmpa2:
	pushl	%edi
	xorl	%edx, %edx
	pushl	%esi
	pushl	%ebx
	movl	24(%esp), %ecx
	movl	16(%esp), %esi
	movl	20(%esp), %edi
	movl	%ecx, %ebx
	shrl	$2, %ebx
	cmpl	%ebx, %edx
	jae	.L6
	movl	(%edi), %eax
	cmpl	%eax, (%esi)
	je	.L9
.L6:
	leal	0(,%edx,4), %eax
	subl	%eax, %ecx
	jne	.L10
	xorl	%ecx, %ecx
.L4:
	popl	%ebx
	movl	%ecx, %eax
	popl	%esi
	popl	%edi
	ret
	.p2align 4,,7
.L10:
	cmpl	$3, %ecx
	jbe	.L12
	movl	(%esi,%edx,4), %ecx
	movl	(%edi,%edx,4), %eax
#APP
	bswap %ecx
	bswap %eax
#NO_APP
.L22:
	subl	%eax, %ecx
	jmp	.L4
	.p2align 4,,7
.L12:
	cmpl	$2, %ecx
	leal	(%eax,%esi), %ebx
	leal	(%eax,%edi), %esi
	je	.L16
	cmpl	$2, %ecx
	ja	.L20
	cmpl	$1, %ecx
	jne	.L4
	movzbl	(%ebx), %ecx
	movzbl	(%esi), %eax
	jmp	.L22
.L20:
	cmpl	$3, %ecx
	jne	.L4
	movzbl	1(%esi), %eax
	movzbl	(%ebx), %ecx
	movzbl	(%esi), %edx
	movzbl	1(%ebx), %edi
	subl	%edx, %ecx
	subl	%eax, %edi
	movzbl	2(%esi), %eax
	sall	$8, %edi
	sall	$16, %ecx
	addl	%edi, %ecx
	movzbl	2(%ebx), %edi
	addl	%edi, %ecx
	jmp	.L22
.L16:
	movzbl	(%ebx), %ecx
	movzbl	(%esi), %edx
	movzbl	1(%ebx), %edi
	movzbl	1(%esi), %eax
	subl	%edx, %ecx
	sall	$8, %ecx
	subl	%eax, %edi
	addl	%edi, %ecx
	jmp	.L4
	.p2align 4,,7
.L9:
	incl	%edx
	cmpl	%ebx, %edx
	jae	.L6
	movl	(%edi,%edx,4), %eax
	cmpl	%eax, (%esi,%edx,4)
	je	.L9
	jmp	.L6
.Lfe4:
	.size	cmpa2,.Lfe4-cmpa2
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"a  b"
.LC2:
	.string	"Memory compare int:"
.LC1:
	.string	"b  a"
.LC3:
	.string	"  %d\n"
.LC4:
	.string	"  Speed up: %f\n"
.LC5:
	.string	"Memory compare 15 bytes:"
.LC6:
	.string	"Memory compare 16 bytes:"
.LC7:
	.string	"Memory compare 64 bytes:"
.LC8:
	.string	"Memory compare 256 bytes:"
	.text
	.align 2
	.p2align 4,,15
.globl main
	.type	main,@function
main:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	pushl	%edi
	movl	$256, %ecx
	leal	-1048(%ebp), %edi
	pushl	%esi
	pushl	%ebx
	subl	$2092, %esp
	andl	$-16, %esp
	cld
	rep
	stosl
	movl	$256, %ecx
	leal	-2072(%ebp), %edi
	rep
	stosl
	xorl	%edi, %edi
	movl	.LC0, %esi
	movl	$.LC2, (%esp)
	movl	.LC1, %ebx
	call	puts
	call	clock
	movl	%eax, -2084(%ebp)
	.p2align 4,,15
.L28:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	call	cmpi
	cmpl	$16777216, %edi
	jne	.L28
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC3, (%esp)
	movl	-2080(%ebp), %eax
	movl	%eax, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L33:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	call	cmpi2
	cmpl	$16777216, %edi
	jne	.L33
	call	clock
	movl	$.LC3, (%esp)
	movl	%eax, %esi
	movl	-2084(%ebp), %eax
	xorl	%edi, %edi
	subl	%eax, %esi
	movl	%esi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC4, (%esp)
	pushl	%esi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	movl	$.LC5, (%esp)
	call	puts
	movl	$6, %eax
	movl	%eax, -1036(%ebp)
	call	clock
	movl	%eax, -2084(%ebp)
.L38:
	movl	$15, 8(%esp)
	leal	-1048(%ebp), %ecx
	incl	%edi
	movl	%ecx, (%esp)
	leal	-2072(%ebp), %edx
	movl	%edx, 4(%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L38
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC3, (%esp)
	movl	-2080(%ebp), %ebx
	movl	%ebx, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L43:
	movl	$15, 8(%esp)
	leal	-1048(%ebp), %edx
	incl	%edi
	movl	%edx, (%esp)
	leal	-2072(%ebp), %esi
	movl	%esi, 4(%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L43
	call	clock
	movl	$.LC3, (%esp)
	movl	%eax, %esi
	movl	-2084(%ebp), %eax
	xorl	%edi, %edi
	subl	%eax, %esi
	movl	%esi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC4, (%esp)
	pushl	%esi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	movl	%edi, -1036(%ebp)
	xorl	%edi, %edi
	movl	$.LC6, (%esp)
	call	puts
	movl	$6, %eax
	movl	%eax, -1036(%ebp)
	call	clock
	movl	%eax, -2084(%ebp)
.L48:
	movl	$16, 8(%esp)
	leal	-1048(%ebp), %ecx
	incl	%edi
	movl	%ecx, (%esp)
	leal	-2072(%ebp), %eax
	movl	%eax, 4(%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L48
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC3, (%esp)
	movl	-2080(%ebp), %ebx
	movl	%ebx, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L53:
	movl	$16, 8(%esp)
	leal	-2072(%ebp), %edx
	incl	%edi
	movl	%edx, 4(%esp)
	leal	-1048(%ebp), %esi
	movl	%esi, (%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L53
	call	clock
	movl	$.LC3, (%esp)
	movl	%eax, %esi
	movl	-2084(%ebp), %eax
	xorl	%edi, %edi
	subl	%eax, %esi
	movl	%esi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC4, (%esp)
	pushl	%esi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	movl	%edi, -1036(%ebp)
	xorl	%edi, %edi
	movl	$.LC7, (%esp)
	call	puts
	movl	$6, %eax
	movl	%eax, -988(%ebp)
	call	clock
	movl	%eax, -2084(%ebp)
.L58:
	movl	$64, 8(%esp)
	leal	-1048(%ebp), %ecx
	incl	%edi
	movl	%ecx, (%esp)
	leal	-2072(%ebp), %ebx
	movl	%ebx, 4(%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L58
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC3, (%esp)
	movl	-2080(%ebp), %edx
	movl	%edx, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L63:
	movl	$64, 8(%esp)
	leal	-2072(%ebp), %eax
	incl	%edi
	movl	%eax, 4(%esp)
	leal	-1048(%ebp), %esi
	movl	%esi, (%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L63
	call	clock
	movl	$.LC3, (%esp)
	movl	-2084(%ebp), %ecx
	movl	%eax, %edi
	xorl	%ebx, %ebx
	subl	%ecx, %edi
	movl	%edi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC4, (%esp)
	pushl	%edi
	movl	$6, %edi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	movl	%ebx, -988(%ebp)
	movl	$.LC8, (%esp)
	call	puts
	movl	%edi, -796(%ebp)
	xorl	%edi, %edi
	call	clock
	movl	%eax, -2084(%ebp)
.L68:
	movl	$256, 8(%esp)
	leal	-2072(%ebp), %edx
	incl	%edi
	movl	%edx, 4(%esp)
	leal	-1048(%ebp), %esi
	movl	%esi, (%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L68
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC3, (%esp)
	movl	-2080(%ebp), %eax
	movl	%eax, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L73:
	movl	$256, 8(%esp)
	leal	-1048(%ebp), %ecx
	incl	%edi
	movl	%ecx, (%esp)
	leal	-2072(%ebp), %ebx
	movl	%ebx, 4(%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L73
	call	clock
	movl	$.LC3, (%esp)
	movl	%eax, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, %edi
	movl	%edi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC4, (%esp)
	pushl	%edi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	leal	-12(%ebp), %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
.Lfe5:
	.size	main,.Lfe5-main
	.ident	"GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04 17:08 ` Kevin Atkinson
@ 2003-04-04 18:58   ` Marcel Cox
  2003-04-04 19:27     ` Kevin Atkinson
  0 siblings, 1 reply; 14+ messages in thread
From: Marcel Cox @ 2003-04-04 18:58 UTC (permalink / raw)
  To: gcc


"Kevin Atkinson" <kevina@gnu.org> wrote in message
news:Pine.LNX.4.44.0304040923350.912-300000@kevin-pc.atkinson.dhs.org...
> Here you go.  Still assume the memory is alligned:
>
> This is what I did:
>
> int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
> {
>   int i = 0;
>   size_t s = size / 4;
>   while (i < s && x[i] == y[i]) ++i;
>   size -= i * 4;
>   if (size == 0) return 0;
>   // hopefully if this is inline expanded when size is known
>   // the compiler can eliminate many of these conditionals
>   else if (size >= 4) { // if original size % 4 == 0 this should
>                         // always be the case
>     unsigned int xx = x[i], yy = y[i];
>     asm("bswap %0" : "+r"(xx));
>     asm("bswap %0" : "+r"(yy));
>     return xx - yy;
>   } else {
>     const unsigned char * xb = (const unsigned char *)(x + i);
>     const unsigned char * yb = (const unsigned char *)(y + i);
>     // if size is known at compile time then the compiler should be
>     // able to select the correct choice at compile time
>     switch (size) {
>     case 1:
>       return *xb - *yb;
>     case 2:
>       return ((xb[0] - yb[0]) <<  8) + (xb[1] - yb[1]);
>     case 3:
>       return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8)
>         + xb[2] - yb[2];}
>   }
> }

There is still a flaw in that you assume that the difference of 2 unsigned
integers will return the correct signed result. This will not work if for
instance in your "return xx - yy" statement, xx for instance is 0xf0000000
and yy is 0x100000. In that case, xx is clearly greater than yy, but the
difference is 0xe0000000 which cast to signed integer will be a negative
number and will indicate that xx is smaller than yy which is clearly wrong.

Marcel



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04 18:58   ` Marcel Cox
@ 2003-04-04 19:27     ` Kevin Atkinson
  0 siblings, 0 replies; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 19:27 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1441 bytes --]

On Fri, 4 Apr 2003, Marcel Cox wrote:

> There is still a flaw in that you assume that the difference of 2 unsigned
> integers will return the correct signed result. This will not work if for
> instance in your "return xx - yy" statement, xx for instance is 0xf0000000
> and yy is 0x100000. In that case, xx is clearly greater than yy, but the
> difference is 0xe0000000 which cast to signed integer will be a negative
> number and will indicate that xx is smaller than yy which is clearly wrong.

Your right.  I relized that after I posted the code.  Here is a better 
version which also faster when size % 4 != 0:

int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
  int i = 0;
  size_t s = size / 4;
  while (i < s && x[i] == y[i]) ++i;
  size -= i * 4;
  if (size == 0) return 0;
  unsigned int xx = x[i], yy = y[i];
  asm("bswap %0" : "+r"(xx));
  asm("bswap %0" : "+r"(yy));
  if (size >= 4) {
    return xx < yy ? -1 : 1;
  } else {
    unsigned int dis = 8*(4-size);
    xx >>= dis;
    yy >>= dis;
    return xx - yy;
  }
}

Test results (I removed the integer as byte strings functions):

Memory compare 15 bytes:
  4120000
  1370000
  Speed up: 3.007299
Memory compare 16 bytes:
  4140000
  1460000
  Speed up: 2.835616
Memory compare 64 bytes:
  11220000
  3920000
  Speed up: 2.862245
Memory compare 256 bytes:
  38550000
  12410000
  Speed up: 3.106366

Code attached.
-- 
http://kevin.atkinson.dhs.org


[-- Attachment #2: Type: TEXT/PLAIN, Size: 2102 bytes --]

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

__attribute__((noinline))
int cmpa(const unsigned int * x, const unsigned int * y, size_t size)
{
  return __builtin_memcmp(x,y,size);
}

__attribute__((noinline))
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
  int i = 0;
  size_t s = size / 4;
  while (i < s && x[i] == y[i]) ++i;
  size -= i * 4;
  if (size == 0) return 0;
  unsigned int xx = x[i], yy = y[i];
  asm("bswap %0" : "+r"(xx));
  asm("bswap %0" : "+r"(yy));
  if (size >= 4) {
    return xx < yy ? -1 : 1;
  } else {
    unsigned int dis = 8*(4-size);
    xx >>= dis;
    yy >>= dis;
    return xx - yy;
  }
}

int main()
{
  unsigned int x,y,i;
  int xa[256] = {}, ya[256] = {};
  int s,f;
  clock_t t;

  printf("Memory compare 15 bytes:\n");
  xa[3] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 15);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 15);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[3] = 0;

  printf("Memory compare 16 bytes:\n");
  xa[3] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 16);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 16);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[3] = 0;

  printf("Memory compare 64 bytes:\n");
  xa[15] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 64);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 64);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[15] = 0;

  printf("Memory compare 256 bytes:\n");
  xa[63] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 256);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 256);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
}

[-- Attachment #3: Type: TEXT/PLAIN, Size: 5496 bytes --]

	.file	"cmps.c"
	.text
	.align 2
	.p2align 4,,15
.globl cmpa
	.type	cmpa,@function
cmpa:
	cld
	subl	$8, %esp
	movl	20(%esp), %ecx
	movl	%esi, (%esp)
	movl	12(%esp), %esi
	cmpl	%ecx, %ecx
	movl	%edi, 4(%esp)
	movl	16(%esp), %edi
	repz
	cmpsb
	movl	4(%esp), %edi
	movl	(%esp), %esi
	seta	%dl
	setb	%cl
	addl	$8, %esp
	subb	%cl, %dl
	movsbl	%dl,%eax
	ret
.Lfe1:
	.size	cmpa,.Lfe1-cmpa
	.align 2
	.p2align 4,,15
.globl cmpa2
	.type	cmpa2,@function
cmpa2:
	pushl	%edi
	xorl	%edx, %edx
	pushl	%esi
	pushl	%ebx
	movl	24(%esp), %esi
	movl	16(%esp), %ebx
	movl	20(%esp), %edi
	movl	%esi, %ecx
	shrl	$2, %ecx
	cmpl	%ecx, %edx
	jae	.L4
	movl	(%edi), %eax
	cmpl	%eax, (%ebx)
	je	.L7
.L4:
	leal	0(,%edx,4), %ecx
	subl	%ecx, %esi
	jne	.L8
	xorl	%edx, %edx
.L2:
	popl	%ebx
	movl	%edx, %eax
	popl	%esi
	popl	%edi
	ret
	.p2align 4,,7
.L8:
	movl	(%ebx,%edx,4), %ebx
	movl	(%edi,%edx,4), %eax
#APP
	bswap %ebx
	bswap %eax
#NO_APP
	cmpl	$3, %esi
	jbe	.L9
	cmpl	%eax, %ebx
	sbbl	%edx, %edx
	orl	$1, %edx
	jmp	.L2
	.p2align 4,,7
.L9:
	movl	$4, %ecx
	subl	%esi, %ecx
	sall	$3, %ecx
	movl	%ebx, %edx
	shrl	%cl, %edx
	shrl	%cl, %eax
	subl	%eax, %edx
	jmp	.L2
	.p2align 4,,7
.L7:
	incl	%edx
	cmpl	%ecx, %edx
	jae	.L4
	movl	(%edi,%edx,4), %eax
	cmpl	%eax, (%ebx,%edx,4)
	je	.L7
	jmp	.L4
.Lfe2:
	.size	cmpa2,.Lfe2-cmpa2
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"Memory compare 15 bytes:"
.LC1:
	.string	"  %d\n"
.LC2:
	.string	"  Speed up: %f\n"
.LC3:
	.string	"Memory compare 16 bytes:"
.LC4:
	.string	"Memory compare 64 bytes:"
.LC5:
	.string	"Memory compare 256 bytes:"
	.text
	.align 2
	.p2align 4,,15
.globl main
	.type	main,@function
main:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	pushl	%edi
	movl	$256, %ecx
	pushl	%esi
	leal	-1048(%ebp), %esi
	movl	%esi, %edi
	pushl	%ebx
	leal	-2072(%ebp), %ebx
	subl	$2092, %esp
	cld
	andl	$-16, %esp
	rep
	stosl
	movl	$256, %ecx
	movl	%ebx, %edi
	rep
	stosl
	xorl	%edi, %edi
	movl	$.LC0, (%esp)
	call	puts
	movl	$6, %eax
	movl	%eax, -1036(%ebp)
	call	clock
	movl	%eax, -2084(%ebp)
	.p2align 4,,15
.L19:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$15, 8(%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L19
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC1, (%esp)
	movl	-2080(%ebp), %eax
	movl	%eax, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L24:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$15, 8(%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L24
	call	clock
	movl	$.LC1, (%esp)
	movl	-2084(%ebp), %ecx
	movl	%eax, %edi
	subl	%ecx, %edi
	movl	%edi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC2, (%esp)
	pushl	%edi
	xorl	%edi, %edi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	xorl	%edx, %edx
	movl	%edx, -1036(%ebp)
	movl	$.LC3, (%esp)
	call	puts
	movl	$6, %eax
	movl	%eax, -1036(%ebp)
	call	clock
	movl	%eax, -2084(%ebp)
.L29:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$16, 8(%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L29
	call	clock
	movl	$.LC1, (%esp)
	movl	%eax, -2080(%ebp)
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	-2080(%ebp), %edi
	movl	%edi, 4(%esp)
	xorl	%edi, %edi
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L34:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$16, 8(%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L34
	call	clock
	movl	$.LC1, (%esp)
	movl	%eax, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, %edi
	movl	%edi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC2, (%esp)
	pushl	%edi
	xorl	%edi, %edi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	xorl	%edx, %edx
	movl	%edx, -1036(%ebp)
	movl	$.LC4, (%esp)
	call	puts
	movl	$6, %eax
	movl	%eax, -988(%ebp)
	call	clock
	movl	%eax, -2084(%ebp)
.L39:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$64, 8(%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L39
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC1, (%esp)
	movl	-2080(%ebp), %eax
	movl	%eax, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L44:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$64, 8(%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L44
	call	clock
	movl	$.LC1, (%esp)
	movl	%eax, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, %edi
	movl	%edi, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC2, (%esp)
	pushl	%edi
	xorl	%edi, %edi
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	movl	%edi, -988(%ebp)
	xorl	%edi, %edi
	movl	$.LC5, (%esp)
	call	puts
	movl	$6, %ecx
	movl	%ecx, -796(%ebp)
	call	clock
	movl	%eax, -2084(%ebp)
.L49:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$256, 8(%esp)
	call	cmpa
	cmpl	$16777216, %edi
	jne	.L49
	call	clock
	movl	%eax, -2080(%ebp)
	xorl	%edi, %edi
	movl	-2084(%ebp), %eax
	subl	%eax, -2080(%ebp)
	movl	$.LC1, (%esp)
	movl	-2080(%ebp), %edx
	movl	%edx, 4(%esp)
	call	printf
	call	clock
	movl	%eax, -2084(%ebp)
.L54:
	movl	%esi, (%esp)
	incl	%edi
	movl	%ebx, 4(%esp)
	movl	$256, 8(%esp)
	call	cmpa2
	cmpl	$16777216, %edi
	jne	.L54
	call	clock
	movl	$.LC1, (%esp)
	movl	%eax, %ebx
	movl	-2084(%ebp), %eax
	subl	%eax, %ebx
	movl	%ebx, 4(%esp)
	call	printf
	fildl	-2080(%ebp)
	movl	$.LC2, (%esp)
	pushl	%ebx
	fildl	(%esp)
	addl	$4, %esp
	fdivrp	%st, %st(1)
	fstpl	4(%esp)
	call	printf
	leal	-12(%ebp), %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
.Lfe3:
	.size	main,.Lfe3-main
	.ident	"GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 17:13 Jerry Quinn
  2003-04-04 17:29 ` Kevin Atkinson
  0 siblings, 1 reply; 14+ messages in thread
From: Jerry Quinn @ 2003-04-04 17:13 UTC (permalink / raw)
  To: gcc

I just tried the same benchmark on a Pentium 4 out of curiosity.  Slightly
different results:

Memory compare int:
  10000
  130000
  Speed up: 0.076923
Memory compare 15 bytes:
  10000
  370000
  Speed up: 0.027027
Memory compare 16 bytes:
  20000
  330000
  Speed up: 0.060606
Memory compare 64 bytes:
  10000
  1040000
  Speed up: 0.009615
Memory compare 256 bytes:
  20000
  2300000
  Speed up: 0.008696


Perhaps this is to be expected since the routine uses shifts.

Jerry Quinn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04 17:13 Jerry Quinn
@ 2003-04-04 17:29 ` Kevin Atkinson
  2003-04-06  5:54   ` Jerry Quinn
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 17:29 UTC (permalink / raw)
  To: Jerry Quinn; +Cc: gcc

On Fri, 4 Apr 2003, Jerry Quinn wrote:

> I just tried the same benchmark on a Pentium 4 out of curiosity.  Slightly
> different results:
> 
> Memory compare int:
>   10000
>   130000
>   Speed up: 0.076923
> Memory compare 15 bytes:
>   10000
>   370000
>   Speed up: 0.027027
> Memory compare 16 bytes:
>   20000
>   330000
>   Speed up: 0.060606
> Memory compare 64 bytes:
>   10000
>   1040000
>   Speed up: 0.009615
> Memory compare 256 bytes:
>   20000
>   2300000
>   Speed up: 0.008696
> 
> Perhaps this is to be expected since the routine uses shifts.

The shift are only used in the case size is not divisible by 4.  It seams 
that on the Pentium 4 cmps is the way to go.  You might also want to 
increase the number of loop iterations to get more meaning full results 
due the limited precision of clock().

--- 
http://kevin.atkinson.dhs.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04 17:29 ` Kevin Atkinson
@ 2003-04-06  5:54   ` Jerry Quinn
  2003-04-06 22:17     ` Kevin Atkinson
  0 siblings, 1 reply; 14+ messages in thread
From: Jerry Quinn @ 2003-04-06  5:54 UTC (permalink / raw)
  To: Kevin Atkinson; +Cc: gcc

Kevin Atkinson writes:
 > On Fri, 4 Apr 2003, Jerry Quinn wrote:
 > 
 > > I just tried the same benchmark on a Pentium 4 out of curiosity.  Slightly
 > > different results:
 > > 
 > > Memory compare int:
 > >   10000
 > >   130000
 > >   Speed up: 0.076923
 > > Memory compare 15 bytes:
 > >   10000
 > >   370000
 > >   Speed up: 0.027027
 > > Memory compare 16 bytes:
 > >   20000
 > >   330000
 > >   Speed up: 0.060606
 > > Memory compare 64 bytes:
 > >   10000
 > >   1040000
 > >   Speed up: 0.009615
 > > Memory compare 256 bytes:
 > >   20000
 > >   2300000
 > >   Speed up: 0.008696
 > > 
 > > Perhaps this is to be expected since the routine uses shifts.
 > 
 > The shift are only used in the case size is not divisible by 4.  It seams 
 > that on the Pentium 4 cmps is the way to go.  You might also want to 
 > increase the number of loop iterations to get more meaning full results 
 > due the limited precision of clock().

Adding iterations didn't change the relative scores significantly.  It
still loses big on P4.  It also loses big on Athlon.  Here are Athlon
results using the later version you posted with 10x iterations:

jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c 
jlquinn@smaug:~/gcc/test$ ./a.out 
Memory compare 15 bytes:
  310000
  5810000
  Speed up: 0.053356
Memory compare 16 bytes:
  300000
  5290000
  Speed up: 0.056711
Memory compare 64 bytes:
  460000
  13770000
  Speed up: 0.033406
Memory compare 256 bytes:
  470000

Jerry Quinn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-06  5:54   ` Jerry Quinn
@ 2003-04-06 22:17     ` Kevin Atkinson
  2003-04-07 17:30       ` Jerry Quinn
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-06 22:17 UTC (permalink / raw)
  To: gcc, Jerry Quinn

On Sat, 5 Apr 2003, Jerry Quinn wrote:

> Kevin Atkinson writes:
>  > On Fri, 4 Apr 2003, Jerry Quinn wrote:
>  > 
>  > > I just tried the same benchmark on a Pentium 4 out of curiosity.  Slightly
>  > > different results:
>  > > 
>  > > Memory compare int:
>  > >   10000
>  > >   130000
>  > >   Speed up: 0.076923
>  > > Memory compare 15 bytes:
>  > >   10000
>  > >   370000
>  > >   Speed up: 0.027027
>  > > Memory compare 16 bytes:
>  > >   20000
>  > >   330000
>  > >   Speed up: 0.060606
>  > > Memory compare 64 bytes:
>  > >   10000
>  > >   1040000
>  > >   Speed up: 0.009615
>  > > Memory compare 256 bytes:
>  > >   20000
>  > >   2300000
>  > >   Speed up: 0.008696
>  > > 
>  > > Perhaps this is to be expected since the routine uses shifts.
>  > 
>  > The shift are only used in the case size is not divisible by 4.  It seams 
>  > that on the Pentium 4 cmps is the way to go.  You might also want to 
>  > increase the number of loop iterations to get more meaning full results 
>  > due the limited precision of clock().
> 
> Adding iterations didn't change the relative scores significantly.  It
> still loses big on P4.  It also loses big on Athlon.  Here are Athlon
> results using the later version you posted with 10x iterations:
> 
> jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c 
> jlquinn@smaug:~/gcc/test$ ./a.out 
> Memory compare 15 bytes:
>   310000
>   5810000
>   Speed up: 0.053356
> Memory compare 16 bytes:
>   300000
>   5290000
>   Speed up: 0.056711
> Memory compare 64 bytes:
>   460000
>   13770000
>   Speed up: 0.033406
> Memory compare 256 bytes:
>   470000

This is extremely interesting.  Does anyone have any documentation on cmps 
behavior on P4 and Athlon?  It could be that the processor is somehow 
"caching" the results of cmps.  Maybe it has to do with the fact that the 
strings are all 0 except for the end or because the strings do not change.  
Or maybe cmps is just extremely fast, but how?  Or it could be that the 
loop needs unrolling for better pipeline performance.  I don't have a P4 
or Athlon so if someone could play around with my code by testing by 
testing some of my theories I would appreciate it.

I just ran the test on a Pentium MMX and i got similar results as I did 
for my P3.  So at very least it seams that something similar to my code is 
the way to go for Pentiums up to P3.

---
http://kevin.atkinson.dhs.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-06 22:17     ` Kevin Atkinson
@ 2003-04-07 17:30       ` Jerry Quinn
  0 siblings, 0 replies; 14+ messages in thread
From: Jerry Quinn @ 2003-04-07 17:30 UTC (permalink / raw)
  To: Kevin Atkinson; +Cc: gcc

Kevin Atkinson writes:
 > On Sat, 5 Apr 2003, Jerry Quinn wrote:
 > 
 > > Kevin Atkinson writes:
 > >  > On Fri, 4 Apr 2003, Jerry Quinn wrote:
 > >  > 
 > >  > > I just tried the same benchmark on a Pentium 4 out of curiosity.  Slightly
 > >  > > different results:
 > >  > > 
 > >  > > Memory compare int:
 > >  > >   10000
 > >  > >   130000
 > >  > >   Speed up: 0.076923
 > >  > > Memory compare 15 bytes:
 > >  > >   10000
 > >  > >   370000
 > >  > >   Speed up: 0.027027
 > >  > > Memory compare 16 bytes:
 > >  > >   20000
 > >  > >   330000
 > >  > >   Speed up: 0.060606
 > >  > > Memory compare 64 bytes:
 > >  > >   10000
 > >  > >   1040000
 > >  > >   Speed up: 0.009615
 > >  > > Memory compare 256 bytes:
 > >  > >   20000
 > >  > >   2300000
 > >  > >   Speed up: 0.008696
 > >  > > 
 > >  > > Perhaps this is to be expected since the routine uses shifts.
 > >  > 
 > >  > The shift are only used in the case size is not divisible by 4.  It seams 
 > >  > that on the Pentium 4 cmps is the way to go.  You might also want to 
 > >  > increase the number of loop iterations to get more meaning full results 
 > >  > due the limited precision of clock().
 > > 
 > > Adding iterations didn't change the relative scores significantly.  It
 > > still loses big on P4.  It also loses big on Athlon.  Here are Athlon
 > > results using the later version you posted with 10x iterations:
 > > 
 > > jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c 
 > > jlquinn@smaug:~/gcc/test$ ./a.out 
 > > Memory compare 15 bytes:
 > >   310000
 > >   5810000
 > >   Speed up: 0.053356
 > > Memory compare 16 bytes:
 > >   300000
 > >   5290000
 > >   Speed up: 0.056711
 > > Memory compare 64 bytes:
 > >   460000
 > >   13770000
 > >   Speed up: 0.033406
 > > Memory compare 256 bytes:
 > >   470000
 > 
 > This is extremely interesting.  Does anyone have any documentation on cmps 
 > behavior on P4 and Athlon?  It could be that the processor is somehow 
 > "caching" the results of cmps.  Maybe it has to do with the fact that the 
 > strings are all 0 except for the end or because the strings do not change.  
 > Or maybe cmps is just extremely fast, but how?  Or it could be that the 
 > loop needs unrolling for better pipeline performance.  I don't have a P4 
 > or Athlon so if someone could play around with my code by testing by 
 > testing some of my theories I would appreciate it.
 > 
 > I just ran the test on a Pentium MMX and i got similar results as I did 
 > for my P3.  So at very least it seams that something similar to my code is 
 > the way to go for Pentiums up to P3.

Unrolling makes things even worse for Athlon - or rather MUCH better
for the builtin function:

jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -funroll-loops -mcpu=athlon cmps.c 
jlquinn@smaug:~/gcc/test$ ./a.out 
Memory compare 15 bytes:
  40000
  5440000
  Speed up: 0.007353
Memory compare 16 bytes:
  60000
  5070000
  Speed up: 0.011834
Memory compare 64 bytes:
  60000
  13660000
  Speed up: 0.004392
Memory compare 256 bytes:
  60000
  39300000
  Speed up: 0.001527

If I assign both arrays to the same set of random bytes, the results
tilt a bit more towards your routine :-)

jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -mcpu=athlon cmps.c 
jlquinn@smaug:~/gcc/test$ ./a.out 
Memory compare 15 bytes:
  460000
  5640000
  Speed up: 0.081560
Memory compare 16 bytes:
  450000
  5390000
  Speed up: 0.083488
Memory compare 64 bytes:
  300000
  5250000
  Speed up: 0.057143
Memory compare 256 bytes:
  310000
  5090000
  Speed up: 0.060904

Making the two arrays randomly generated and different from each other
provides yet more improvement:

jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -mcpu=athlon cmps.c 
jlquinn@smaug:~/gcc/test$ ./a.out 
Memory compare 15 bytes:
  310000
  3660000
  Speed up: 0.084699
Memory compare 16 bytes:
  310000
  3510000
  Speed up: 0.088319
Memory compare 64 bytes:
  460000
  3510000
  Speed up: 0.131054
Memory compare 256 bytes:
  460000
  3500000


Still not even close.  I just scanned the Athlon optimization guide
and there is a comment about being sure to align the inputs for
repeated string ops.  That makes it sound like work was done to make
aligned cmpsb efficient.

I also tried commenting out the final int cleanup so that cmpa2 looks
like:

__attribute__((noinline))
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
  int i = 0;
  size_t s = size / 4;
  while (i < s && x[i] == y[i]) ++i;
  size -= i * 4;
  if (size == 0) return 0;
  return 1;
/*   unsigned int xx = x[i], yy = y[i]; */
/*   asm("bswap %0" : "+r"(xx)); */
/*   asm("bswap %0" : "+r"(yy)); */
/*   if (size >= 4) { */
/*     return xx < yy ? -1 : 1; */
/*   } else { */
/*     unsigned int dis = 8*(4-size); */
/*     xx >>= dis; */
/*     yy >>= dis; */
/*     return xx - yy; */
/*   } */
}


I wanted to see if the compiler could make the simple integer compare
fast.  This is an (expected) improvement, but doesn't come close to
narrowing the gap:

jlquinn@smaug:~/gcc/test$ gcc3.3 cmps.c -O3 -fomit-frame-pointer -mcpu=athlon  
jlquinn@smaug:~/gcc/test$ ./a.out 
Memory compare 15 bytes:
  300000
  2740000
  Speed up: 0.109489
Memory compare 16 bytes:
  310000
  2520000
  Speed up: 0.123016
Memory compare 64 bytes:
  480000
  2590000
  Speed up: 0.185328
Memory compare 256 bytes:
  460000
  2680000
  Speed up: 0.171642

It seems the conclusion is that rep cmpsb is very fast on Athlon.
Also probably for P4, but I haven't tried the same tests beyond the
first one.

Jerry Quinn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 14:33 Bonzini
  0 siblings, 0 replies; 14+ messages in thread
From: Bonzini @ 2003-04-04 14:33 UTC (permalink / raw)
  To: kevin; +Cc: gcc

> I was doing some tests with with memcmp and it seams that gcc 3.2.2
> always uses cmps, even for aligned strings.

I also had a very fast memcmp that also worked word-by-word in the unaligned case
(using some bit-twiddling tricks with SHRD), but I never had the time to polish it for
inclusion in glibc.  It turned out however that CMPS is as fast as mine (or even
faster) on AMD chips.

I do think anyway that memcmp expansion in gcc should be disabled unless optimizing for
size or unless the size is known to be small, say 16 bytes.

See these posts:

http://gcc.gnu.org/ml/gcc/2002-10/msg01616.html (me)
http://gcc.gnu.org/ml/gcc/2002-10/msg01637.html (Roger)
http://sources.redhat.com/ml/libc-alpha/2002-10/msg00511.html (source)

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04  6:22 Kevin Atkinson
  2003-04-04  8:52 ` Zack Weinberg
  0 siblings, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04  6:22 UTC (permalink / raw)
  To: gcc

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4774 bytes --]


I was doing some tests with with memcmp and it seams that gcc 3.2.2 always
uses cmps, even for aligned strings.

I did some tests and discovered that using cmps was rather slow, compared 
to a simple loop and then a bswap and subtract at the end.

My method was is the following:

  int cmpa2(unsigned int * x, unsigned int * y, size_t size)
  {
    int i = 0;
    size /= 4;
    while (i < size && x[i] == y[i]) ++i;
    if (i == size) return 0;
    asm("bswap %0" : "+r"(x[i]));
    asm("bswap %0" : "+r"(y[i]));
    return x[i] - y[i];
  }

as oppose to:

  int cmpa(unsigned int * x, unsigned int * y, size_t size)
  {
    return __builtin_memcmp(x,y,size);
  }

Even though It takes integer pointers it really is comparing them as if 
they were byte strings.  I used integer points to avoid having to cast and 
to let the compiler know that the strings are 4 byte aligned. 
This method is 3-5 times as fast.  The longer the string the faster my 
method is.

I also wrote on specialized compare for comparing two ints AS IF they 
were a 4 byte string:

  int cmpi2(unsigned int x, unsigned int y)
  {
    asm("bswap %0" : "+r"(x));
    asm("bswap %0" : "+r"(y));
    return x - y;
  }

as oppose to:

  int cmpi(unsigned int x, unsigned int y) 
  {
    return __builtin_memcmp(&x,&y,4);
  }

Which was several order of magnitude faster.  It is 4 - 6 times faster.  
However I believe the function call is significant here as 
-fomit-frame-pointer increased the ratio (from 4 to 6).  And when I 
tried to account for the cost of the function call the results varies form 
any where to 10 to 30 times faster.

If a better memcmp is not already implemented in yet-to-be-released gcc I 
hope someone will seriously consider rewriting the memcpy implementation.  
Even though I only tested in on a Pentium 3 I am sure the results will 
also hold for a Pentium Classic based on the instruction timings.  I believe 
th bswap instruction was implemented in the 486. So only generic 386 code 
will have to use something other than bswap.

The test program is attached.  Here are the results when compiled with:
gcc -fomit-frame-pointer -O3 -march=pentium3 cmps.c (attribute noinline 
was used on test functions)

Memory compare int:
  2170000
  350000
  Speed up: 6.200000
Memory compare 16 bytes:
  4100000
  1240000
  Speed up: 3.306452
Memory compare 64 bytes:
  10930000
  2990000
  Speed up: 3.655518
Memory compare 256 bytes:
  37510000
  7700000
  Speed up: 4.871429

Here is the relevant assembly code:

cmpi: // integer as byte string compare using memcmp
        cld
        subl    $8, %esp
        movl    $4, %ecx
        movl    %esi, (%esp)
        leal    12(%esp), %esi
        movl    %edi, 4(%esp)
        leal    16(%esp), %edi
        repz
        cmpsb
        movl    4(%esp), %edi
        movl    (%esp), %esi
        seta    %dl
        setb    %cl
        addl    $8, %esp
        subb    %cl, %dl
        movsbl  %dl,%eax
        ret

cmpi: // interger as byte string compare my method
        movl    4(%esp), %eax
        movl    8(%esp), %ecx
        bswap %eax
        bswap %ecx
        subl    %ecx, %eax
        ret

cmpa: // aligned byte compare using memcmp
        cld
        subl    $8, %esp
        movl    20(%esp), %ecx
        movl    %esi, (%esp)
        movl    12(%esp), %esi
        cmpl    %ecx, %ecx
        movl    %edi, 4(%esp)
        movl    16(%esp), %edi
        repz
        cmpsb
        movl    4(%esp), %edi
        movl    (%esp), %esi
        seta    %dl
        setb    %cl
        addl    $8, %esp
        subb    %cl, %dl
        movsbl  %dl,%eax
        ret

cmpa2: // aligned byte compare: my method
        subl    $12, %esp
        movl    24(%esp), %ecx
        xorl    %edx, %edx
        movl    %esi, 4(%esp)
        movl    16(%esp), %esi
        shrl    $2, %ecx
        movl    %edi, 8(%esp)
        cmpl    %ecx, %edx
        movl    20(%esp), %edi
        movl    %ebx, (%esp)
        jae     .L6
        movl    (%edi), %eax
        cmpl    %eax, (%esi)
        je      .L9
.L6:
        xorl    %ebx, %ebx
        cmpl    %ecx, %edx
        je      .L4
        movl    (%esi,%edx,4), %eax
        bswap %eax
        movl    %eax, (%esi,%edx,4)
        movl    (%edi,%edx,4), %eax
        bswap %eax
        movl    %eax, (%edi,%edx,4)
        movl    (%esi,%edx,4), %ebx
        subl    %eax, %ebx
.L4:
        movl    %ebx, %eax
        movl    4(%esp), %esi
        movl    (%esp), %ebx
        movl    8(%esp), %edi
        addl    $12, %esp
        ret
        .p2align 4,,7
.L9:
        incl    %edx
        cmpl    %ecx, %edx
        jae     .L6
        movl    (%edi,%edx,4), %eax
        cmpl    %eax, (%esi,%edx,4)
        je      .L9
        jmp     .L6


--- 
http://kevin.atkinson.dhs.org

[-- Attachment #2: Type: TEXT/PLAIN, Size: 2172 bytes --]

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

__attribute__((noinline))
int cmpi(unsigned int x, unsigned int y) 
{
  return __builtin_memcmp(&x,&y,4);
}

__attribute__((noinline))
int cmpi2(unsigned int x, unsigned int y)
{
  asm("bswap %0" : "+r"(x));
  asm("bswap %0" : "+r"(y));
  return x - y;
}

__attribute__((noinline))
int cmpa(unsigned int * x, unsigned int * y, size_t size)
{
  return __builtin_memcmp(x,y,size);
}

__attribute__((noinline))
int cmpa2(unsigned int * x, unsigned int * y, size_t size)
{
  int i = 0;
  size /= 4;
  while (i < size && x[i] == y[i]) ++i;
  if (i == size) return 0;
  asm("bswap %0" : "+r"(x[i]));
  asm("bswap %0" : "+r"(y[i]));
  return x[i] - y[i];
}

int main()
{
  unsigned int x,y,i;
  int xa[256] = {}, ya[256] = {};
  int s,f;
  clock_t t;
  x = *(unsigned int *)("a  b");
  y = *(unsigned int *)("b  a");

  printf("Memory compare int:\n");
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpi(x,y);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpi2(x,y);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);


  printf("Memory compare 16 bytes:\n");
  xa[3] = 6;
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 16);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 16);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[3] = 0;

  xa[15] = 6;
  printf("Memory compare 64 bytes:\n");
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 64);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 64);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
  xa[15] = 0;

  xa[255] = 6;
  printf("Memory compare 256 bytes:\n");
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa(xa, ya, 256);
  s = clock() - t;
  printf("  %d\n", s);
  t = clock();
  for (i = 0; i != 0x1000000; ++i)
    cmpa2(xa, ya, 256);
  f = clock() - t;
  printf("  %d\n",f);
  printf("  Speed up: %f\n", (double)s/f);
}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04  6:22 Kevin Atkinson
@ 2003-04-04  8:52 ` Zack Weinberg
  2003-04-04 15:16   ` Kevin Atkinson
  0 siblings, 1 reply; 14+ messages in thread
From: Zack Weinberg @ 2003-04-04  8:52 UTC (permalink / raw)
  To: Kevin Atkinson; +Cc: gcc

Kevin Atkinson <kevin@atkinson.dhs.org> writes:

> I was doing some tests with with memcmp and it seams that gcc 3.2.2 always
> uses cmps, even for aligned strings.
>
> I did some tests and discovered that using cmps was rather slow, compared 
> to a simple loop and then a bswap and subtract at the end.

Patches are welcome.

zw

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Slow memcmp for aligned strings on Pentium 3
  2003-04-04  8:52 ` Zack Weinberg
@ 2003-04-04 15:16   ` Kevin Atkinson
  0 siblings, 0 replies; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 15:16 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: gcc

On Thu, 3 Apr 2003, Zack Weinberg wrote:

> Kevin Atkinson <kevin@atkinson.dhs.org> writes:
> 
> > I was doing some tests with with memcmp and it seams that gcc 3.2.2 always
> > uses cmps, even for aligned strings.
> >
> > I did some tests and discovered that using cmps was rather slow, compared 
> > to a simple loop and then a bswap and subtract at the end.
> 
> Patches are welcome.
> 
The best I can give you is a memcmp implementation.  I know nothing about 
gcc internals.

-- 
http://kevin.atkinson.dhs.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2003-04-07 15:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-04 13:18 Slow memcmp for aligned strings on Pentium 3 Roger Sayle
2003-04-04 14:51 ` Kevin Atkinson
2003-04-04 17:08 ` Kevin Atkinson
2003-04-04 18:58   ` Marcel Cox
2003-04-04 19:27     ` Kevin Atkinson
  -- strict thread matches above, loose matches on Subject: below --
2003-04-04 17:13 Jerry Quinn
2003-04-04 17:29 ` Kevin Atkinson
2003-04-06  5:54   ` Jerry Quinn
2003-04-06 22:17     ` Kevin Atkinson
2003-04-07 17:30       ` Jerry Quinn
2003-04-04 14:33 Bonzini
2003-04-04  6:22 Kevin Atkinson
2003-04-04  8:52 ` Zack Weinberg
2003-04-04 15:16   ` Kevin Atkinson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).