* Re: Slow memcmp for aligned strings on Pentium 3 @ 2003-04-04 13:18 Roger Sayle 2003-04-04 14:51 ` Kevin Atkinson 2003-04-04 17:08 ` Kevin Atkinson 0 siblings, 2 replies; 14+ messages in thread From: Roger Sayle @ 2003-04-04 13:18 UTC (permalink / raw) To: Kevin Atkinson; +Cc: gcc Hi Kevin, > I did some tests and discovered that using cmps was rather slow, > compared to a simple loop and then a bswap and subtract at the end. I'm sure that GCC's memcmp implementations could be improved, but from reading the code examples in your patch it looks like you are always assuming that either the length is a multiple of four, or that the bytes following the memory sections to be compared contain identical values (i.e. you're hoping they're all zero). i.e., if p and q are suitably 4-byte aligned memset(p,"abcd",4); memset(q,"abef",4); memcmp(p,q,2) should compare equal but don't using bswaps and subtractions. Similarly, when two words mismatch their return value <0 or >0 should depend upon the first byte that differs, not the values of the bytes that come after it. I suspect it should be possible to fix your code to handle these termination conditions correctly, and a comparison of your routine's performance with these fixes vs. __builtin_memcmp would be of interest. Roger -- Roger Sayle, E-mail: roger@eyesopen.com OpenEye Scientific Software, WWW: http://www.eyesopen.com/ Suite 1107, 3600 Cerrillos Road, Tel: (+1) 505-473-7385 Santa Fe, New Mexico, 87507. Fax: (+1) 505-473-0833 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 13:18 Slow memcmp for aligned strings on Pentium 3 Roger Sayle @ 2003-04-04 14:51 ` Kevin Atkinson 2003-04-04 17:08 ` Kevin Atkinson 1 sibling, 0 replies; 14+ messages in thread From: Kevin Atkinson @ 2003-04-04 14:51 UTC (permalink / raw) To: Roger Sayle; +Cc: gcc On Thu, 3 Apr 2003, Roger Sayle wrote: > > Hi Kevin, > > I did some tests and discovered that using cmps was rather slow, > > compared to a simple loop and then a bswap and subtract at the end. > > I'm sure that GCC's memcmp implementations could be improved, but > from reading the code examples in your patch it looks like you > are always assuming that either the length is a multiple of four, > or that the bytes following the memory sections to be compared > contain identical values (i.e. you're hoping they're all zero). > > i.e., if p and q are suitably 4-byte aligned > > memset(p,"abcd",4); > memset(q,"abef",4); > memcmp(p,q,2) > > should compare equal but don't using bswaps and subtractions. > Similarly, when two words mismatch their return value <0 or > >0 should depend upon the first byte that differs, not the > values of the bytes that come after it. Your right. It can be fixed by a test if it is a multiple of 4 and if not do a byte wise comparison at the end. > I suspect it should be possible to fix your code to handle these > termination conditions correctly, and a comparison of your > routine's performance with these fixes vs. __builtin_memcmp > would be of interest. I will see what I can do. --- http://kevin.atkinson.dhs.org ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 13:18 Slow memcmp for aligned strings on Pentium 3 Roger Sayle 2003-04-04 14:51 ` Kevin Atkinson @ 2003-04-04 17:08 ` Kevin Atkinson 2003-04-04 18:58 ` Marcel Cox 1 sibling, 1 reply; 14+ messages in thread From: Kevin Atkinson @ 2003-04-04 17:08 UTC (permalink / raw) To: Roger Sayle; +Cc: gcc [-- Attachment #1: Type: TEXT/PLAIN, Size: 1831 bytes --] On Thu, 3 Apr 2003, Roger Sayle wrote: > I suspect it should be possible to fix your code to handle these > termination conditions correctly, and a comparison of your > routine's performance with these fixes vs. __builtin_memcmp > would be of interest. Here you go. Still assume the memory is alligned: This is what I did: int cmpa2(const unsigned int * x, const unsigned int * y, size_t size) { int i = 0; size_t s = size / 4; while (i < s && x[i] == y[i]) ++i; size -= i * 4; if (size == 0) return 0; // hopefully if this is inline expanded when size is known // the compiler can eliminate many of these conditionals else if (size >= 4) { // if original size % 4 == 0 this should // always be the case unsigned int xx = x[i], yy = y[i]; asm("bswap %0" : "+r"(xx)); asm("bswap %0" : "+r"(yy)); return xx - yy; } else { const unsigned char * xb = (const unsigned char *)(x + i); const unsigned char * yb = (const unsigned char *)(y + i); // if size is known at compile time then the compiler should be // able to select the correct choice at compile time switch (size) { case 1: return *xb - *yb; case 2: return ((xb[0] - yb[0]) << 8) + (xb[1] - yb[1]); case 3: return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8) + xb[2] - yb[2];} } } compiled with gcc -fomit-frame-pointer -O3 -march=pentium3 cmps.c Memory compare int: 2160000 360000 Speed up: 6.000000 Memory compare 15 bytes: 4120000 1550000 Speed up: 2.658065 Memory compare 16 bytes: 4150000 1310000 Speed up: 3.167939 Memory compare 64 bytes: 11240000 3470000 Speed up: 3.239193 Memory compare 256 bytes: 37930000 10050000 Speed up: 3.774129 Code and assembly output attached. -- http://kevin.atkinson.dhs.org [-- Attachment #2: Type: TEXT/PLAIN, Size: 3315 bytes --] #include <stdio.h> #include <time.h> #include <stdlib.h> __attribute__((noinline)) int cmpi(unsigned int x, unsigned int y) { return __builtin_memcmp(&x,&y,4); } __attribute__((noinline)) int cmpi2(unsigned int x, unsigned int y) { asm("bswap %0" : "+r"(x)); asm("bswap %0" : "+r"(y)); return x - y; } __attribute__((noinline)) int cmpa(const unsigned int * x, const unsigned int * y, size_t size) { return __builtin_memcmp(x,y,size); } __attribute__((noinline)) int cmpa2(const unsigned int * x, const unsigned int * y, size_t size) { int i = 0; size_t s = size / 4; while (i < s && x[i] == y[i]) ++i; size -= i * 4; if (size == 0) return 0; // hopefully if this is inline expanded when size is known // the compiler can eliminate many of these conditionals else if (size >= 4) { // if original size % 4 == 0 this should // always be the case unsigned int xx = x[i], yy = y[i]; asm("bswap %0" : "+r"(xx)); asm("bswap %0" : "+r"(yy)); return xx - yy; } else { const unsigned char * xb = (const unsigned char *)(x + i); const unsigned char * yb = (const unsigned char *)(y + i); // if size is known at compile time then the compiler should be // able to select the correct choice at compile time switch (size) { case 1: return *xb - *yb; case 2: return ((xb[0] - yb[0]) << 8) + (xb[1] - yb[1]); case 3: return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8) + xb[2] - yb[2];} } } int main() { unsigned int x,y,i; int xa[256] = {}, ya[256] = {}; int s,f; clock_t t; x = *(unsigned int *)("a b"); y = *(unsigned int *)("b a"); printf("Memory compare int:\n"); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpi(x,y); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpi2(x,y); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); printf("Memory compare 15 bytes:\n"); xa[3] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 15); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 15); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[3] = 0; printf("Memory compare 16 bytes:\n"); xa[3] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 16); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 16); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[3] = 0; printf("Memory compare 64 bytes:\n"); xa[15] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 64); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 64); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[15] = 0; printf("Memory compare 256 bytes:\n"); xa[63] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 256); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 256); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); } [-- Attachment #3: Type: TEXT/PLAIN, Size: 7710 bytes --] .file "cmps.c" .text .align 2 .p2align 4,,15 .globl cmpi .type cmpi,@function cmpi: cld subl $8, %esp movl $4, %ecx movl %esi, (%esp) leal 12(%esp), %esi movl %edi, 4(%esp) leal 16(%esp), %edi repz cmpsb movl 4(%esp), %edi movl (%esp), %esi seta %dl setb %cl addl $8, %esp subb %cl, %dl movsbl %dl,%eax ret .Lfe1: .size cmpi,.Lfe1-cmpi .align 2 .p2align 4,,15 .globl cmpi2 .type cmpi2,@function cmpi2: movl 4(%esp), %eax movl 8(%esp), %ecx #APP bswap %eax bswap %ecx #NO_APP subl %ecx, %eax ret .Lfe2: .size cmpi2,.Lfe2-cmpi2 .align 2 .p2align 4,,15 .globl cmpa .type cmpa,@function cmpa: cld subl $8, %esp movl 20(%esp), %ecx movl %esi, (%esp) movl 12(%esp), %esi cmpl %ecx, %ecx movl %edi, 4(%esp) movl 16(%esp), %edi repz cmpsb movl 4(%esp), %edi movl (%esp), %esi seta %dl setb %cl addl $8, %esp subb %cl, %dl movsbl %dl,%eax ret .Lfe3: .size cmpa,.Lfe3-cmpa .align 2 .p2align 4,,15 .globl cmpa2 .type cmpa2,@function cmpa2: pushl %edi xorl %edx, %edx pushl %esi pushl %ebx movl 24(%esp), %ecx movl 16(%esp), %esi movl 20(%esp), %edi movl %ecx, %ebx shrl $2, %ebx cmpl %ebx, %edx jae .L6 movl (%edi), %eax cmpl %eax, (%esi) je .L9 .L6: leal 0(,%edx,4), %eax subl %eax, %ecx jne .L10 xorl %ecx, %ecx .L4: popl %ebx movl %ecx, %eax popl %esi popl %edi ret .p2align 4,,7 .L10: cmpl $3, %ecx jbe .L12 movl (%esi,%edx,4), %ecx movl (%edi,%edx,4), %eax #APP bswap %ecx bswap %eax #NO_APP .L22: subl %eax, %ecx jmp .L4 .p2align 4,,7 .L12: cmpl $2, %ecx leal (%eax,%esi), %ebx leal (%eax,%edi), %esi je .L16 cmpl $2, %ecx ja .L20 cmpl $1, %ecx jne .L4 movzbl (%ebx), %ecx movzbl (%esi), %eax jmp .L22 .L20: cmpl $3, %ecx jne .L4 movzbl 1(%esi), %eax movzbl (%ebx), %ecx movzbl (%esi), %edx movzbl 1(%ebx), %edi subl %edx, %ecx subl %eax, %edi movzbl 2(%esi), %eax sall $8, %edi sall $16, %ecx addl %edi, %ecx movzbl 2(%ebx), %edi addl %edi, %ecx jmp .L22 .L16: movzbl (%ebx), %ecx movzbl (%esi), %edx movzbl 1(%ebx), %edi movzbl 1(%esi), %eax subl %edx, %ecx sall $8, %ecx subl %eax, %edi addl %edi, %ecx jmp .L4 .p2align 4,,7 .L9: incl %edx cmpl %ebx, %edx jae .L6 movl (%edi,%edx,4), %eax cmpl %eax, (%esi,%edx,4) je .L9 jmp .L6 .Lfe4: .size cmpa2,.Lfe4-cmpa2 .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "a b" .LC2: .string "Memory compare int:" .LC1: .string "b a" .LC3: .string " %d\n" .LC4: .string " Speed up: %f\n" .LC5: .string "Memory compare 15 bytes:" .LC6: .string "Memory compare 16 bytes:" .LC7: .string "Memory compare 64 bytes:" .LC8: .string "Memory compare 256 bytes:" .text .align 2 .p2align 4,,15 .globl main .type main,@function main: pushl %ebp xorl %eax, %eax movl %esp, %ebp pushl %edi movl $256, %ecx leal -1048(%ebp), %edi pushl %esi pushl %ebx subl $2092, %esp andl $-16, %esp cld rep stosl movl $256, %ecx leal -2072(%ebp), %edi rep stosl xorl %edi, %edi movl .LC0, %esi movl $.LC2, (%esp) movl .LC1, %ebx call puts call clock movl %eax, -2084(%ebp) .p2align 4,,15 .L28: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) call cmpi cmpl $16777216, %edi jne .L28 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC3, (%esp) movl -2080(%ebp), %eax movl %eax, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L33: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) call cmpi2 cmpl $16777216, %edi jne .L33 call clock movl $.LC3, (%esp) movl %eax, %esi movl -2084(%ebp), %eax xorl %edi, %edi subl %eax, %esi movl %esi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC4, (%esp) pushl %esi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf movl $.LC5, (%esp) call puts movl $6, %eax movl %eax, -1036(%ebp) call clock movl %eax, -2084(%ebp) .L38: movl $15, 8(%esp) leal -1048(%ebp), %ecx incl %edi movl %ecx, (%esp) leal -2072(%ebp), %edx movl %edx, 4(%esp) call cmpa cmpl $16777216, %edi jne .L38 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC3, (%esp) movl -2080(%ebp), %ebx movl %ebx, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L43: movl $15, 8(%esp) leal -1048(%ebp), %edx incl %edi movl %edx, (%esp) leal -2072(%ebp), %esi movl %esi, 4(%esp) call cmpa2 cmpl $16777216, %edi jne .L43 call clock movl $.LC3, (%esp) movl %eax, %esi movl -2084(%ebp), %eax xorl %edi, %edi subl %eax, %esi movl %esi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC4, (%esp) pushl %esi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf movl %edi, -1036(%ebp) xorl %edi, %edi movl $.LC6, (%esp) call puts movl $6, %eax movl %eax, -1036(%ebp) call clock movl %eax, -2084(%ebp) .L48: movl $16, 8(%esp) leal -1048(%ebp), %ecx incl %edi movl %ecx, (%esp) leal -2072(%ebp), %eax movl %eax, 4(%esp) call cmpa cmpl $16777216, %edi jne .L48 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC3, (%esp) movl -2080(%ebp), %ebx movl %ebx, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L53: movl $16, 8(%esp) leal -2072(%ebp), %edx incl %edi movl %edx, 4(%esp) leal -1048(%ebp), %esi movl %esi, (%esp) call cmpa2 cmpl $16777216, %edi jne .L53 call clock movl $.LC3, (%esp) movl %eax, %esi movl -2084(%ebp), %eax xorl %edi, %edi subl %eax, %esi movl %esi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC4, (%esp) pushl %esi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf movl %edi, -1036(%ebp) xorl %edi, %edi movl $.LC7, (%esp) call puts movl $6, %eax movl %eax, -988(%ebp) call clock movl %eax, -2084(%ebp) .L58: movl $64, 8(%esp) leal -1048(%ebp), %ecx incl %edi movl %ecx, (%esp) leal -2072(%ebp), %ebx movl %ebx, 4(%esp) call cmpa cmpl $16777216, %edi jne .L58 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC3, (%esp) movl -2080(%ebp), %edx movl %edx, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L63: movl $64, 8(%esp) leal -2072(%ebp), %eax incl %edi movl %eax, 4(%esp) leal -1048(%ebp), %esi movl %esi, (%esp) call cmpa2 cmpl $16777216, %edi jne .L63 call clock movl $.LC3, (%esp) movl -2084(%ebp), %ecx movl %eax, %edi xorl %ebx, %ebx subl %ecx, %edi movl %edi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC4, (%esp) pushl %edi movl $6, %edi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf movl %ebx, -988(%ebp) movl $.LC8, (%esp) call puts movl %edi, -796(%ebp) xorl %edi, %edi call clock movl %eax, -2084(%ebp) .L68: movl $256, 8(%esp) leal -2072(%ebp), %edx incl %edi movl %edx, 4(%esp) leal -1048(%ebp), %esi movl %esi, (%esp) call cmpa cmpl $16777216, %edi jne .L68 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC3, (%esp) movl -2080(%ebp), %eax movl %eax, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L73: movl $256, 8(%esp) leal -1048(%ebp), %ecx incl %edi movl %ecx, (%esp) leal -2072(%ebp), %ebx movl %ebx, 4(%esp) call cmpa2 cmpl $16777216, %edi jne .L73 call clock movl $.LC3, (%esp) movl %eax, %edi movl -2084(%ebp), %eax subl %eax, %edi movl %edi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC4, (%esp) pushl %edi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf leal -12(%ebp), %esp popl %ebx popl %esi popl %edi popl %ebp ret .Lfe5: .size main,.Lfe5-main .ident "GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)" ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 17:08 ` Kevin Atkinson @ 2003-04-04 18:58 ` Marcel Cox 2003-04-04 19:27 ` Kevin Atkinson 0 siblings, 1 reply; 14+ messages in thread From: Marcel Cox @ 2003-04-04 18:58 UTC (permalink / raw) To: gcc "Kevin Atkinson" <kevina@gnu.org> wrote in message news:Pine.LNX.4.44.0304040923350.912-300000@kevin-pc.atkinson.dhs.org... > Here you go. Still assume the memory is alligned: > > This is what I did: > > int cmpa2(const unsigned int * x, const unsigned int * y, size_t size) > { > int i = 0; > size_t s = size / 4; > while (i < s && x[i] == y[i]) ++i; > size -= i * 4; > if (size == 0) return 0; > // hopefully if this is inline expanded when size is known > // the compiler can eliminate many of these conditionals > else if (size >= 4) { // if original size % 4 == 0 this should > // always be the case > unsigned int xx = x[i], yy = y[i]; > asm("bswap %0" : "+r"(xx)); > asm("bswap %0" : "+r"(yy)); > return xx - yy; > } else { > const unsigned char * xb = (const unsigned char *)(x + i); > const unsigned char * yb = (const unsigned char *)(y + i); > // if size is known at compile time then the compiler should be > // able to select the correct choice at compile time > switch (size) { > case 1: > return *xb - *yb; > case 2: > return ((xb[0] - yb[0]) << 8) + (xb[1] - yb[1]); > case 3: > return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8) > + xb[2] - yb[2];} > } > } There is still a flaw in that you assume that the difference of 2 unsigned integers will return the correct signed result. This will not work if for instance in your "return xx - yy" statement, xx for instance is 0xf0000000 and yy is 0x100000. In that case, xx is clearly greater than yy, but the difference is 0xe0000000 which cast to signed integer will be a negative number and will indicate that xx is smaller than yy which is clearly wrong. Marcel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 18:58 ` Marcel Cox @ 2003-04-04 19:27 ` Kevin Atkinson 0 siblings, 0 replies; 14+ messages in thread From: Kevin Atkinson @ 2003-04-04 19:27 UTC (permalink / raw) To: gcc [-- Attachment #1: Type: TEXT/PLAIN, Size: 1441 bytes --] On Fri, 4 Apr 2003, Marcel Cox wrote: > There is still a flaw in that you assume that the difference of 2 unsigned > integers will return the correct signed result. This will not work if for > instance in your "return xx - yy" statement, xx for instance is 0xf0000000 > and yy is 0x100000. In that case, xx is clearly greater than yy, but the > difference is 0xe0000000 which cast to signed integer will be a negative > number and will indicate that xx is smaller than yy which is clearly wrong. Your right. I relized that after I posted the code. Here is a better version which also faster when size % 4 != 0: int cmpa2(const unsigned int * x, const unsigned int * y, size_t size) { int i = 0; size_t s = size / 4; while (i < s && x[i] == y[i]) ++i; size -= i * 4; if (size == 0) return 0; unsigned int xx = x[i], yy = y[i]; asm("bswap %0" : "+r"(xx)); asm("bswap %0" : "+r"(yy)); if (size >= 4) { return xx < yy ? -1 : 1; } else { unsigned int dis = 8*(4-size); xx >>= dis; yy >>= dis; return xx - yy; } } Test results (I removed the integer as byte strings functions): Memory compare 15 bytes: 4120000 1370000 Speed up: 3.007299 Memory compare 16 bytes: 4140000 1460000 Speed up: 2.835616 Memory compare 64 bytes: 11220000 3920000 Speed up: 2.862245 Memory compare 256 bytes: 38550000 12410000 Speed up: 3.106366 Code attached. -- http://kevin.atkinson.dhs.org [-- Attachment #2: Type: TEXT/PLAIN, Size: 2102 bytes --] #include <stdio.h> #include <time.h> #include <stdlib.h> __attribute__((noinline)) int cmpa(const unsigned int * x, const unsigned int * y, size_t size) { return __builtin_memcmp(x,y,size); } __attribute__((noinline)) int cmpa2(const unsigned int * x, const unsigned int * y, size_t size) { int i = 0; size_t s = size / 4; while (i < s && x[i] == y[i]) ++i; size -= i * 4; if (size == 0) return 0; unsigned int xx = x[i], yy = y[i]; asm("bswap %0" : "+r"(xx)); asm("bswap %0" : "+r"(yy)); if (size >= 4) { return xx < yy ? -1 : 1; } else { unsigned int dis = 8*(4-size); xx >>= dis; yy >>= dis; return xx - yy; } } int main() { unsigned int x,y,i; int xa[256] = {}, ya[256] = {}; int s,f; clock_t t; printf("Memory compare 15 bytes:\n"); xa[3] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 15); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 15); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[3] = 0; printf("Memory compare 16 bytes:\n"); xa[3] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 16); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 16); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[3] = 0; printf("Memory compare 64 bytes:\n"); xa[15] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 64); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 64); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[15] = 0; printf("Memory compare 256 bytes:\n"); xa[63] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 256); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 256); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); } [-- Attachment #3: Type: TEXT/PLAIN, Size: 5496 bytes --] .file "cmps.c" .text .align 2 .p2align 4,,15 .globl cmpa .type cmpa,@function cmpa: cld subl $8, %esp movl 20(%esp), %ecx movl %esi, (%esp) movl 12(%esp), %esi cmpl %ecx, %ecx movl %edi, 4(%esp) movl 16(%esp), %edi repz cmpsb movl 4(%esp), %edi movl (%esp), %esi seta %dl setb %cl addl $8, %esp subb %cl, %dl movsbl %dl,%eax ret .Lfe1: .size cmpa,.Lfe1-cmpa .align 2 .p2align 4,,15 .globl cmpa2 .type cmpa2,@function cmpa2: pushl %edi xorl %edx, %edx pushl %esi pushl %ebx movl 24(%esp), %esi movl 16(%esp), %ebx movl 20(%esp), %edi movl %esi, %ecx shrl $2, %ecx cmpl %ecx, %edx jae .L4 movl (%edi), %eax cmpl %eax, (%ebx) je .L7 .L4: leal 0(,%edx,4), %ecx subl %ecx, %esi jne .L8 xorl %edx, %edx .L2: popl %ebx movl %edx, %eax popl %esi popl %edi ret .p2align 4,,7 .L8: movl (%ebx,%edx,4), %ebx movl (%edi,%edx,4), %eax #APP bswap %ebx bswap %eax #NO_APP cmpl $3, %esi jbe .L9 cmpl %eax, %ebx sbbl %edx, %edx orl $1, %edx jmp .L2 .p2align 4,,7 .L9: movl $4, %ecx subl %esi, %ecx sall $3, %ecx movl %ebx, %edx shrl %cl, %edx shrl %cl, %eax subl %eax, %edx jmp .L2 .p2align 4,,7 .L7: incl %edx cmpl %ecx, %edx jae .L4 movl (%edi,%edx,4), %eax cmpl %eax, (%ebx,%edx,4) je .L7 jmp .L4 .Lfe2: .size cmpa2,.Lfe2-cmpa2 .section .rodata.str1.1,"aMS",@progbits,1 .LC0: .string "Memory compare 15 bytes:" .LC1: .string " %d\n" .LC2: .string " Speed up: %f\n" .LC3: .string "Memory compare 16 bytes:" .LC4: .string "Memory compare 64 bytes:" .LC5: .string "Memory compare 256 bytes:" .text .align 2 .p2align 4,,15 .globl main .type main,@function main: pushl %ebp xorl %eax, %eax movl %esp, %ebp pushl %edi movl $256, %ecx pushl %esi leal -1048(%ebp), %esi movl %esi, %edi pushl %ebx leal -2072(%ebp), %ebx subl $2092, %esp cld andl $-16, %esp rep stosl movl $256, %ecx movl %ebx, %edi rep stosl xorl %edi, %edi movl $.LC0, (%esp) call puts movl $6, %eax movl %eax, -1036(%ebp) call clock movl %eax, -2084(%ebp) .p2align 4,,15 .L19: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $15, 8(%esp) call cmpa cmpl $16777216, %edi jne .L19 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC1, (%esp) movl -2080(%ebp), %eax movl %eax, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L24: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $15, 8(%esp) call cmpa2 cmpl $16777216, %edi jne .L24 call clock movl $.LC1, (%esp) movl -2084(%ebp), %ecx movl %eax, %edi subl %ecx, %edi movl %edi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC2, (%esp) pushl %edi xorl %edi, %edi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf xorl %edx, %edx movl %edx, -1036(%ebp) movl $.LC3, (%esp) call puts movl $6, %eax movl %eax, -1036(%ebp) call clock movl %eax, -2084(%ebp) .L29: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $16, 8(%esp) call cmpa cmpl $16777216, %edi jne .L29 call clock movl $.LC1, (%esp) movl %eax, -2080(%ebp) movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl -2080(%ebp), %edi movl %edi, 4(%esp) xorl %edi, %edi call printf call clock movl %eax, -2084(%ebp) .L34: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $16, 8(%esp) call cmpa2 cmpl $16777216, %edi jne .L34 call clock movl $.LC1, (%esp) movl %eax, %edi movl -2084(%ebp), %eax subl %eax, %edi movl %edi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC2, (%esp) pushl %edi xorl %edi, %edi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf xorl %edx, %edx movl %edx, -1036(%ebp) movl $.LC4, (%esp) call puts movl $6, %eax movl %eax, -988(%ebp) call clock movl %eax, -2084(%ebp) .L39: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $64, 8(%esp) call cmpa cmpl $16777216, %edi jne .L39 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC1, (%esp) movl -2080(%ebp), %eax movl %eax, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L44: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $64, 8(%esp) call cmpa2 cmpl $16777216, %edi jne .L44 call clock movl $.LC1, (%esp) movl %eax, %edi movl -2084(%ebp), %eax subl %eax, %edi movl %edi, 4(%esp) call printf fildl -2080(%ebp) movl $.LC2, (%esp) pushl %edi xorl %edi, %edi fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf movl %edi, -988(%ebp) xorl %edi, %edi movl $.LC5, (%esp) call puts movl $6, %ecx movl %ecx, -796(%ebp) call clock movl %eax, -2084(%ebp) .L49: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $256, 8(%esp) call cmpa cmpl $16777216, %edi jne .L49 call clock movl %eax, -2080(%ebp) xorl %edi, %edi movl -2084(%ebp), %eax subl %eax, -2080(%ebp) movl $.LC1, (%esp) movl -2080(%ebp), %edx movl %edx, 4(%esp) call printf call clock movl %eax, -2084(%ebp) .L54: movl %esi, (%esp) incl %edi movl %ebx, 4(%esp) movl $256, 8(%esp) call cmpa2 cmpl $16777216, %edi jne .L54 call clock movl $.LC1, (%esp) movl %eax, %ebx movl -2084(%ebp), %eax subl %eax, %ebx movl %ebx, 4(%esp) call printf fildl -2080(%ebp) movl $.LC2, (%esp) pushl %ebx fildl (%esp) addl $4, %esp fdivrp %st, %st(1) fstpl 4(%esp) call printf leal -12(%ebp), %esp popl %ebx popl %esi popl %edi popl %ebp ret .Lfe3: .size main,.Lfe3-main .ident "GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)" ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 @ 2003-04-04 17:13 Jerry Quinn 2003-04-04 17:29 ` Kevin Atkinson 0 siblings, 1 reply; 14+ messages in thread From: Jerry Quinn @ 2003-04-04 17:13 UTC (permalink / raw) To: gcc I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly different results: Memory compare int: 10000 130000 Speed up: 0.076923 Memory compare 15 bytes: 10000 370000 Speed up: 0.027027 Memory compare 16 bytes: 20000 330000 Speed up: 0.060606 Memory compare 64 bytes: 10000 1040000 Speed up: 0.009615 Memory compare 256 bytes: 20000 2300000 Speed up: 0.008696 Perhaps this is to be expected since the routine uses shifts. Jerry Quinn ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 17:13 Jerry Quinn @ 2003-04-04 17:29 ` Kevin Atkinson 2003-04-06 5:54 ` Jerry Quinn 0 siblings, 1 reply; 14+ messages in thread From: Kevin Atkinson @ 2003-04-04 17:29 UTC (permalink / raw) To: Jerry Quinn; +Cc: gcc On Fri, 4 Apr 2003, Jerry Quinn wrote: > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly > different results: > > Memory compare int: > 10000 > 130000 > Speed up: 0.076923 > Memory compare 15 bytes: > 10000 > 370000 > Speed up: 0.027027 > Memory compare 16 bytes: > 20000 > 330000 > Speed up: 0.060606 > Memory compare 64 bytes: > 10000 > 1040000 > Speed up: 0.009615 > Memory compare 256 bytes: > 20000 > 2300000 > Speed up: 0.008696 > > Perhaps this is to be expected since the routine uses shifts. The shift are only used in the case size is not divisible by 4. It seams that on the Pentium 4 cmps is the way to go. You might also want to increase the number of loop iterations to get more meaning full results due the limited precision of clock(). --- http://kevin.atkinson.dhs.org ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 17:29 ` Kevin Atkinson @ 2003-04-06 5:54 ` Jerry Quinn 2003-04-06 22:17 ` Kevin Atkinson 0 siblings, 1 reply; 14+ messages in thread From: Jerry Quinn @ 2003-04-06 5:54 UTC (permalink / raw) To: Kevin Atkinson; +Cc: gcc Kevin Atkinson writes: > On Fri, 4 Apr 2003, Jerry Quinn wrote: > > > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly > > different results: > > > > Memory compare int: > > 10000 > > 130000 > > Speed up: 0.076923 > > Memory compare 15 bytes: > > 10000 > > 370000 > > Speed up: 0.027027 > > Memory compare 16 bytes: > > 20000 > > 330000 > > Speed up: 0.060606 > > Memory compare 64 bytes: > > 10000 > > 1040000 > > Speed up: 0.009615 > > Memory compare 256 bytes: > > 20000 > > 2300000 > > Speed up: 0.008696 > > > > Perhaps this is to be expected since the routine uses shifts. > > The shift are only used in the case size is not divisible by 4. It seams > that on the Pentium 4 cmps is the way to go. You might also want to > increase the number of loop iterations to get more meaning full results > due the limited precision of clock(). Adding iterations didn't change the relative scores significantly. It still loses big on P4. It also loses big on Athlon. Here are Athlon results using the later version you posted with 10x iterations: jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c jlquinn@smaug:~/gcc/test$ ./a.out Memory compare 15 bytes: 310000 5810000 Speed up: 0.053356 Memory compare 16 bytes: 300000 5290000 Speed up: 0.056711 Memory compare 64 bytes: 460000 13770000 Speed up: 0.033406 Memory compare 256 bytes: 470000 Jerry Quinn ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-06 5:54 ` Jerry Quinn @ 2003-04-06 22:17 ` Kevin Atkinson 2003-04-07 17:30 ` Jerry Quinn 0 siblings, 1 reply; 14+ messages in thread From: Kevin Atkinson @ 2003-04-06 22:17 UTC (permalink / raw) To: gcc, Jerry Quinn On Sat, 5 Apr 2003, Jerry Quinn wrote: > Kevin Atkinson writes: > > On Fri, 4 Apr 2003, Jerry Quinn wrote: > > > > > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly > > > different results: > > > > > > Memory compare int: > > > 10000 > > > 130000 > > > Speed up: 0.076923 > > > Memory compare 15 bytes: > > > 10000 > > > 370000 > > > Speed up: 0.027027 > > > Memory compare 16 bytes: > > > 20000 > > > 330000 > > > Speed up: 0.060606 > > > Memory compare 64 bytes: > > > 10000 > > > 1040000 > > > Speed up: 0.009615 > > > Memory compare 256 bytes: > > > 20000 > > > 2300000 > > > Speed up: 0.008696 > > > > > > Perhaps this is to be expected since the routine uses shifts. > > > > The shift are only used in the case size is not divisible by 4. It seams > > that on the Pentium 4 cmps is the way to go. You might also want to > > increase the number of loop iterations to get more meaning full results > > due the limited precision of clock(). > > Adding iterations didn't change the relative scores significantly. It > still loses big on P4. It also loses big on Athlon. Here are Athlon > results using the later version you posted with 10x iterations: > > jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c > jlquinn@smaug:~/gcc/test$ ./a.out > Memory compare 15 bytes: > 310000 > 5810000 > Speed up: 0.053356 > Memory compare 16 bytes: > 300000 > 5290000 > Speed up: 0.056711 > Memory compare 64 bytes: > 460000 > 13770000 > Speed up: 0.033406 > Memory compare 256 bytes: > 470000 This is extremely interesting. Does anyone have any documentation on cmps behavior on P4 and Athlon? It could be that the processor is somehow "caching" the results of cmps. Maybe it has to do with the fact that the strings are all 0 except for the end or because the strings do not change. Or maybe cmps is just extremely fast, but how? Or it could be that the loop needs unrolling for better pipeline performance. I don't have a P4 or Athlon so if someone could play around with my code by testing by testing some of my theories I would appreciate it. I just ran the test on a Pentium MMX and i got similar results as I did for my P3. So at very least it seams that something similar to my code is the way to go for Pentiums up to P3. --- http://kevin.atkinson.dhs.org ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-06 22:17 ` Kevin Atkinson @ 2003-04-07 17:30 ` Jerry Quinn 0 siblings, 0 replies; 14+ messages in thread From: Jerry Quinn @ 2003-04-07 17:30 UTC (permalink / raw) To: Kevin Atkinson; +Cc: gcc Kevin Atkinson writes: > On Sat, 5 Apr 2003, Jerry Quinn wrote: > > > Kevin Atkinson writes: > > > On Fri, 4 Apr 2003, Jerry Quinn wrote: > > > > > > > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly > > > > different results: > > > > > > > > Memory compare int: > > > > 10000 > > > > 130000 > > > > Speed up: 0.076923 > > > > Memory compare 15 bytes: > > > > 10000 > > > > 370000 > > > > Speed up: 0.027027 > > > > Memory compare 16 bytes: > > > > 20000 > > > > 330000 > > > > Speed up: 0.060606 > > > > Memory compare 64 bytes: > > > > 10000 > > > > 1040000 > > > > Speed up: 0.009615 > > > > Memory compare 256 bytes: > > > > 20000 > > > > 2300000 > > > > Speed up: 0.008696 > > > > > > > > Perhaps this is to be expected since the routine uses shifts. > > > > > > The shift are only used in the case size is not divisible by 4. It seams > > > that on the Pentium 4 cmps is the way to go. You might also want to > > > increase the number of loop iterations to get more meaning full results > > > due the limited precision of clock(). > > > > Adding iterations didn't change the relative scores significantly. It > > still loses big on P4. It also loses big on Athlon. Here are Athlon > > results using the later version you posted with 10x iterations: > > > > jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c > > jlquinn@smaug:~/gcc/test$ ./a.out > > Memory compare 15 bytes: > > 310000 > > 5810000 > > Speed up: 0.053356 > > Memory compare 16 bytes: > > 300000 > > 5290000 > > Speed up: 0.056711 > > Memory compare 64 bytes: > > 460000 > > 13770000 > > Speed up: 0.033406 > > Memory compare 256 bytes: > > 470000 > > This is extremely interesting. Does anyone have any documentation on cmps > behavior on P4 and Athlon? It could be that the processor is somehow > "caching" the results of cmps. Maybe it has to do with the fact that the > strings are all 0 except for the end or because the strings do not change. > Or maybe cmps is just extremely fast, but how? Or it could be that the > loop needs unrolling for better pipeline performance. I don't have a P4 > or Athlon so if someone could play around with my code by testing by > testing some of my theories I would appreciate it. > > I just ran the test on a Pentium MMX and i got similar results as I did > for my P3. So at very least it seams that something similar to my code is > the way to go for Pentiums up to P3. Unrolling makes things even worse for Athlon - or rather MUCH better for the builtin function: jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -funroll-loops -mcpu=athlon cmps.c jlquinn@smaug:~/gcc/test$ ./a.out Memory compare 15 bytes: 40000 5440000 Speed up: 0.007353 Memory compare 16 bytes: 60000 5070000 Speed up: 0.011834 Memory compare 64 bytes: 60000 13660000 Speed up: 0.004392 Memory compare 256 bytes: 60000 39300000 Speed up: 0.001527 If I assign both arrays to the same set of random bytes, the results tilt a bit more towards your routine :-) jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -mcpu=athlon cmps.c jlquinn@smaug:~/gcc/test$ ./a.out Memory compare 15 bytes: 460000 5640000 Speed up: 0.081560 Memory compare 16 bytes: 450000 5390000 Speed up: 0.083488 Memory compare 64 bytes: 300000 5250000 Speed up: 0.057143 Memory compare 256 bytes: 310000 5090000 Speed up: 0.060904 Making the two arrays randomly generated and different from each other provides yet more improvement: jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -mcpu=athlon cmps.c jlquinn@smaug:~/gcc/test$ ./a.out Memory compare 15 bytes: 310000 3660000 Speed up: 0.084699 Memory compare 16 bytes: 310000 3510000 Speed up: 0.088319 Memory compare 64 bytes: 460000 3510000 Speed up: 0.131054 Memory compare 256 bytes: 460000 3500000 Still not even close. I just scanned the Athlon optimization guide and there is a comment about being sure to align the inputs for repeated string ops. That makes it sound like work was done to make aligned cmpsb efficient. I also tried commenting out the final int cleanup so that cmpa2 looks like: __attribute__((noinline)) int cmpa2(const unsigned int * x, const unsigned int * y, size_t size) { int i = 0; size_t s = size / 4; while (i < s && x[i] == y[i]) ++i; size -= i * 4; if (size == 0) return 0; return 1; /* unsigned int xx = x[i], yy = y[i]; */ /* asm("bswap %0" : "+r"(xx)); */ /* asm("bswap %0" : "+r"(yy)); */ /* if (size >= 4) { */ /* return xx < yy ? -1 : 1; */ /* } else { */ /* unsigned int dis = 8*(4-size); */ /* xx >>= dis; */ /* yy >>= dis; */ /* return xx - yy; */ /* } */ } I wanted to see if the compiler could make the simple integer compare fast. This is an (expected) improvement, but doesn't come close to narrowing the gap: jlquinn@smaug:~/gcc/test$ gcc3.3 cmps.c -O3 -fomit-frame-pointer -mcpu=athlon jlquinn@smaug:~/gcc/test$ ./a.out Memory compare 15 bytes: 300000 2740000 Speed up: 0.109489 Memory compare 16 bytes: 310000 2520000 Speed up: 0.123016 Memory compare 64 bytes: 480000 2590000 Speed up: 0.185328 Memory compare 256 bytes: 460000 2680000 Speed up: 0.171642 It seems the conclusion is that rep cmpsb is very fast on Athlon. Also probably for P4, but I haven't tried the same tests beyond the first one. Jerry Quinn ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 14:33 Bonzini
0 siblings, 0 replies; 14+ messages in thread
From: Bonzini @ 2003-04-04 14:33 UTC (permalink / raw)
To: kevin; +Cc: gcc
> I was doing some tests with with memcmp and it seams that gcc 3.2.2
> always uses cmps, even for aligned strings.
I also had a very fast memcmp that also worked word-by-word in the unaligned case
(using some bit-twiddling tricks with SHRD), but I never had the time to polish it for
inclusion in glibc. It turned out however that CMPS is as fast as mine (or even
faster) on AMD chips.
I do think anyway that memcmp expansion in gcc should be disabled unless optimizing for
size or unless the size is known to be small, say 16 bytes.
See these posts:
http://gcc.gnu.org/ml/gcc/2002-10/msg01616.html (me)
http://gcc.gnu.org/ml/gcc/2002-10/msg01637.html (Roger)
http://sources.redhat.com/ml/libc-alpha/2002-10/msg00511.html (source)
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Slow memcmp for aligned strings on Pentium 3 @ 2003-04-04 6:22 Kevin Atkinson 2003-04-04 8:52 ` Zack Weinberg 0 siblings, 1 reply; 14+ messages in thread From: Kevin Atkinson @ 2003-04-04 6:22 UTC (permalink / raw) To: gcc [-- Attachment #1: Type: TEXT/PLAIN, Size: 4774 bytes --] I was doing some tests with with memcmp and it seams that gcc 3.2.2 always uses cmps, even for aligned strings. I did some tests and discovered that using cmps was rather slow, compared to a simple loop and then a bswap and subtract at the end. My method was is the following: int cmpa2(unsigned int * x, unsigned int * y, size_t size) { int i = 0; size /= 4; while (i < size && x[i] == y[i]) ++i; if (i == size) return 0; asm("bswap %0" : "+r"(x[i])); asm("bswap %0" : "+r"(y[i])); return x[i] - y[i]; } as oppose to: int cmpa(unsigned int * x, unsigned int * y, size_t size) { return __builtin_memcmp(x,y,size); } Even though It takes integer pointers it really is comparing them as if they were byte strings. I used integer points to avoid having to cast and to let the compiler know that the strings are 4 byte aligned. This method is 3-5 times as fast. The longer the string the faster my method is. I also wrote on specialized compare for comparing two ints AS IF they were a 4 byte string: int cmpi2(unsigned int x, unsigned int y) { asm("bswap %0" : "+r"(x)); asm("bswap %0" : "+r"(y)); return x - y; } as oppose to: int cmpi(unsigned int x, unsigned int y) { return __builtin_memcmp(&x,&y,4); } Which was several order of magnitude faster. It is 4 - 6 times faster. However I believe the function call is significant here as -fomit-frame-pointer increased the ratio (from 4 to 6). And when I tried to account for the cost of the function call the results varies form any where to 10 to 30 times faster. If a better memcmp is not already implemented in yet-to-be-released gcc I hope someone will seriously consider rewriting the memcpy implementation. Even though I only tested in on a Pentium 3 I am sure the results will also hold for a Pentium Classic based on the instruction timings. I believe th bswap instruction was implemented in the 486. So only generic 386 code will have to use something other than bswap. The test program is attached. Here are the results when compiled with: gcc -fomit-frame-pointer -O3 -march=pentium3 cmps.c (attribute noinline was used on test functions) Memory compare int: 2170000 350000 Speed up: 6.200000 Memory compare 16 bytes: 4100000 1240000 Speed up: 3.306452 Memory compare 64 bytes: 10930000 2990000 Speed up: 3.655518 Memory compare 256 bytes: 37510000 7700000 Speed up: 4.871429 Here is the relevant assembly code: cmpi: // integer as byte string compare using memcmp cld subl $8, %esp movl $4, %ecx movl %esi, (%esp) leal 12(%esp), %esi movl %edi, 4(%esp) leal 16(%esp), %edi repz cmpsb movl 4(%esp), %edi movl (%esp), %esi seta %dl setb %cl addl $8, %esp subb %cl, %dl movsbl %dl,%eax ret cmpi: // interger as byte string compare my method movl 4(%esp), %eax movl 8(%esp), %ecx bswap %eax bswap %ecx subl %ecx, %eax ret cmpa: // aligned byte compare using memcmp cld subl $8, %esp movl 20(%esp), %ecx movl %esi, (%esp) movl 12(%esp), %esi cmpl %ecx, %ecx movl %edi, 4(%esp) movl 16(%esp), %edi repz cmpsb movl 4(%esp), %edi movl (%esp), %esi seta %dl setb %cl addl $8, %esp subb %cl, %dl movsbl %dl,%eax ret cmpa2: // aligned byte compare: my method subl $12, %esp movl 24(%esp), %ecx xorl %edx, %edx movl %esi, 4(%esp) movl 16(%esp), %esi shrl $2, %ecx movl %edi, 8(%esp) cmpl %ecx, %edx movl 20(%esp), %edi movl %ebx, (%esp) jae .L6 movl (%edi), %eax cmpl %eax, (%esi) je .L9 .L6: xorl %ebx, %ebx cmpl %ecx, %edx je .L4 movl (%esi,%edx,4), %eax bswap %eax movl %eax, (%esi,%edx,4) movl (%edi,%edx,4), %eax bswap %eax movl %eax, (%edi,%edx,4) movl (%esi,%edx,4), %ebx subl %eax, %ebx .L4: movl %ebx, %eax movl 4(%esp), %esi movl (%esp), %ebx movl 8(%esp), %edi addl $12, %esp ret .p2align 4,,7 .L9: incl %edx cmpl %ecx, %edx jae .L6 movl (%edi,%edx,4), %eax cmpl %eax, (%esi,%edx,4) je .L9 jmp .L6 --- http://kevin.atkinson.dhs.org [-- Attachment #2: Type: TEXT/PLAIN, Size: 2172 bytes --] #include <stdio.h> #include <time.h> #include <stdlib.h> __attribute__((noinline)) int cmpi(unsigned int x, unsigned int y) { return __builtin_memcmp(&x,&y,4); } __attribute__((noinline)) int cmpi2(unsigned int x, unsigned int y) { asm("bswap %0" : "+r"(x)); asm("bswap %0" : "+r"(y)); return x - y; } __attribute__((noinline)) int cmpa(unsigned int * x, unsigned int * y, size_t size) { return __builtin_memcmp(x,y,size); } __attribute__((noinline)) int cmpa2(unsigned int * x, unsigned int * y, size_t size) { int i = 0; size /= 4; while (i < size && x[i] == y[i]) ++i; if (i == size) return 0; asm("bswap %0" : "+r"(x[i])); asm("bswap %0" : "+r"(y[i])); return x[i] - y[i]; } int main() { unsigned int x,y,i; int xa[256] = {}, ya[256] = {}; int s,f; clock_t t; x = *(unsigned int *)("a b"); y = *(unsigned int *)("b a"); printf("Memory compare int:\n"); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpi(x,y); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpi2(x,y); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); printf("Memory compare 16 bytes:\n"); xa[3] = 6; t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 16); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 16); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[3] = 0; xa[15] = 6; printf("Memory compare 64 bytes:\n"); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 64); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 64); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); xa[15] = 0; xa[255] = 6; printf("Memory compare 256 bytes:\n"); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa(xa, ya, 256); s = clock() - t; printf(" %d\n", s); t = clock(); for (i = 0; i != 0x1000000; ++i) cmpa2(xa, ya, 256); f = clock() - t; printf(" %d\n",f); printf(" Speed up: %f\n", (double)s/f); } ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 6:22 Kevin Atkinson @ 2003-04-04 8:52 ` Zack Weinberg 2003-04-04 15:16 ` Kevin Atkinson 0 siblings, 1 reply; 14+ messages in thread From: Zack Weinberg @ 2003-04-04 8:52 UTC (permalink / raw) To: Kevin Atkinson; +Cc: gcc Kevin Atkinson <kevin@atkinson.dhs.org> writes: > I was doing some tests with with memcmp and it seams that gcc 3.2.2 always > uses cmps, even for aligned strings. > > I did some tests and discovered that using cmps was rather slow, compared > to a simple loop and then a bswap and subtract at the end. Patches are welcome. zw ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3 2003-04-04 8:52 ` Zack Weinberg @ 2003-04-04 15:16 ` Kevin Atkinson 0 siblings, 0 replies; 14+ messages in thread From: Kevin Atkinson @ 2003-04-04 15:16 UTC (permalink / raw) To: Zack Weinberg; +Cc: gcc On Thu, 3 Apr 2003, Zack Weinberg wrote: > Kevin Atkinson <kevin@atkinson.dhs.org> writes: > > > I was doing some tests with with memcmp and it seams that gcc 3.2.2 always > > uses cmps, even for aligned strings. > > > > I did some tests and discovered that using cmps was rather slow, compared > > to a simple loop and then a bswap and subtract at the end. > > Patches are welcome. > The best I can give you is a memcmp implementation. I know nothing about gcc internals. -- http://kevin.atkinson.dhs.org ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2003-04-07 15:23 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2003-04-04 13:18 Slow memcmp for aligned strings on Pentium 3 Roger Sayle 2003-04-04 14:51 ` Kevin Atkinson 2003-04-04 17:08 ` Kevin Atkinson 2003-04-04 18:58 ` Marcel Cox 2003-04-04 19:27 ` Kevin Atkinson -- strict thread matches above, loose matches on Subject: below -- 2003-04-04 17:13 Jerry Quinn 2003-04-04 17:29 ` Kevin Atkinson 2003-04-06 5:54 ` Jerry Quinn 2003-04-06 22:17 ` Kevin Atkinson 2003-04-07 17:30 ` Jerry Quinn 2003-04-04 14:33 Bonzini 2003-04-04 6:22 Kevin Atkinson 2003-04-04 8:52 ` Zack Weinberg 2003-04-04 15:16 ` Kevin Atkinson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).