* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 17:13 Jerry Quinn
2003-04-04 17:29 ` Kevin Atkinson
0 siblings, 1 reply; 14+ messages in thread
From: Jerry Quinn @ 2003-04-04 17:13 UTC (permalink / raw)
To: gcc
I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly
different results:
Memory compare int:
10000
130000
Speed up: 0.076923
Memory compare 15 bytes:
10000
370000
Speed up: 0.027027
Memory compare 16 bytes:
20000
330000
Speed up: 0.060606
Memory compare 64 bytes:
10000
1040000
Speed up: 0.009615
Memory compare 256 bytes:
20000
2300000
Speed up: 0.008696
Perhaps this is to be expected since the routine uses shifts.
Jerry Quinn
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 17:13 Slow memcmp for aligned strings on Pentium 3 Jerry Quinn
@ 2003-04-04 17:29 ` Kevin Atkinson
2003-04-06 5:54 ` Jerry Quinn
0 siblings, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 17:29 UTC (permalink / raw)
To: Jerry Quinn; +Cc: gcc
On Fri, 4 Apr 2003, Jerry Quinn wrote:
> I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly
> different results:
>
> Memory compare int:
> 10000
> 130000
> Speed up: 0.076923
> Memory compare 15 bytes:
> 10000
> 370000
> Speed up: 0.027027
> Memory compare 16 bytes:
> 20000
> 330000
> Speed up: 0.060606
> Memory compare 64 bytes:
> 10000
> 1040000
> Speed up: 0.009615
> Memory compare 256 bytes:
> 20000
> 2300000
> Speed up: 0.008696
>
> Perhaps this is to be expected since the routine uses shifts.
The shift are only used in the case size is not divisible by 4. It seams
that on the Pentium 4 cmps is the way to go. You might also want to
increase the number of loop iterations to get more meaning full results
due the limited precision of clock().
---
http://kevin.atkinson.dhs.org
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 17:29 ` Kevin Atkinson
@ 2003-04-06 5:54 ` Jerry Quinn
2003-04-06 22:17 ` Kevin Atkinson
0 siblings, 1 reply; 14+ messages in thread
From: Jerry Quinn @ 2003-04-06 5:54 UTC (permalink / raw)
To: Kevin Atkinson; +Cc: gcc
Kevin Atkinson writes:
> On Fri, 4 Apr 2003, Jerry Quinn wrote:
>
> > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly
> > different results:
> >
> > Memory compare int:
> > 10000
> > 130000
> > Speed up: 0.076923
> > Memory compare 15 bytes:
> > 10000
> > 370000
> > Speed up: 0.027027
> > Memory compare 16 bytes:
> > 20000
> > 330000
> > Speed up: 0.060606
> > Memory compare 64 bytes:
> > 10000
> > 1040000
> > Speed up: 0.009615
> > Memory compare 256 bytes:
> > 20000
> > 2300000
> > Speed up: 0.008696
> >
> > Perhaps this is to be expected since the routine uses shifts.
>
> The shift are only used in the case size is not divisible by 4. It seams
> that on the Pentium 4 cmps is the way to go. You might also want to
> increase the number of loop iterations to get more meaning full results
> due the limited precision of clock().
Adding iterations didn't change the relative scores significantly. It
still loses big on P4. It also loses big on Athlon. Here are Athlon
results using the later version you posted with 10x iterations:
jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c
jlquinn@smaug:~/gcc/test$ ./a.out
Memory compare 15 bytes:
310000
5810000
Speed up: 0.053356
Memory compare 16 bytes:
300000
5290000
Speed up: 0.056711
Memory compare 64 bytes:
460000
13770000
Speed up: 0.033406
Memory compare 256 bytes:
470000
Jerry Quinn
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-06 5:54 ` Jerry Quinn
@ 2003-04-06 22:17 ` Kevin Atkinson
2003-04-07 17:30 ` Jerry Quinn
0 siblings, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-06 22:17 UTC (permalink / raw)
To: gcc, Jerry Quinn
On Sat, 5 Apr 2003, Jerry Quinn wrote:
> Kevin Atkinson writes:
> > On Fri, 4 Apr 2003, Jerry Quinn wrote:
> >
> > > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly
> > > different results:
> > >
> > > Memory compare int:
> > > 10000
> > > 130000
> > > Speed up: 0.076923
> > > Memory compare 15 bytes:
> > > 10000
> > > 370000
> > > Speed up: 0.027027
> > > Memory compare 16 bytes:
> > > 20000
> > > 330000
> > > Speed up: 0.060606
> > > Memory compare 64 bytes:
> > > 10000
> > > 1040000
> > > Speed up: 0.009615
> > > Memory compare 256 bytes:
> > > 20000
> > > 2300000
> > > Speed up: 0.008696
> > >
> > > Perhaps this is to be expected since the routine uses shifts.
> >
> > The shift are only used in the case size is not divisible by 4. It seams
> > that on the Pentium 4 cmps is the way to go. You might also want to
> > increase the number of loop iterations to get more meaning full results
> > due the limited precision of clock().
>
> Adding iterations didn't change the relative scores significantly. It
> still loses big on P4. It also loses big on Athlon. Here are Athlon
> results using the later version you posted with 10x iterations:
>
> jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c
> jlquinn@smaug:~/gcc/test$ ./a.out
> Memory compare 15 bytes:
> 310000
> 5810000
> Speed up: 0.053356
> Memory compare 16 bytes:
> 300000
> 5290000
> Speed up: 0.056711
> Memory compare 64 bytes:
> 460000
> 13770000
> Speed up: 0.033406
> Memory compare 256 bytes:
> 470000
This is extremely interesting. Does anyone have any documentation on cmps
behavior on P4 and Athlon? It could be that the processor is somehow
"caching" the results of cmps. Maybe it has to do with the fact that the
strings are all 0 except for the end or because the strings do not change.
Or maybe cmps is just extremely fast, but how? Or it could be that the
loop needs unrolling for better pipeline performance. I don't have a P4
or Athlon so if someone could play around with my code by testing by
testing some of my theories I would appreciate it.
I just ran the test on a Pentium MMX and i got similar results as I did
for my P3. So at very least it seams that something similar to my code is
the way to go for Pentiums up to P3.
---
http://kevin.atkinson.dhs.org
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-06 22:17 ` Kevin Atkinson
@ 2003-04-07 17:30 ` Jerry Quinn
0 siblings, 0 replies; 14+ messages in thread
From: Jerry Quinn @ 2003-04-07 17:30 UTC (permalink / raw)
To: Kevin Atkinson; +Cc: gcc
Kevin Atkinson writes:
> On Sat, 5 Apr 2003, Jerry Quinn wrote:
>
> > Kevin Atkinson writes:
> > > On Fri, 4 Apr 2003, Jerry Quinn wrote:
> > >
> > > > I just tried the same benchmark on a Pentium 4 out of curiosity. Slightly
> > > > different results:
> > > >
> > > > Memory compare int:
> > > > 10000
> > > > 130000
> > > > Speed up: 0.076923
> > > > Memory compare 15 bytes:
> > > > 10000
> > > > 370000
> > > > Speed up: 0.027027
> > > > Memory compare 16 bytes:
> > > > 20000
> > > > 330000
> > > > Speed up: 0.060606
> > > > Memory compare 64 bytes:
> > > > 10000
> > > > 1040000
> > > > Speed up: 0.009615
> > > > Memory compare 256 bytes:
> > > > 20000
> > > > 2300000
> > > > Speed up: 0.008696
> > > >
> > > > Perhaps this is to be expected since the routine uses shifts.
> > >
> > > The shift are only used in the case size is not divisible by 4. It seams
> > > that on the Pentium 4 cmps is the way to go. You might also want to
> > > increase the number of loop iterations to get more meaning full results
> > > due the limited precision of clock().
> >
> > Adding iterations didn't change the relative scores significantly. It
> > still loses big on P4. It also loses big on Athlon. Here are Athlon
> > results using the later version you posted with 10x iterations:
> >
> > jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -march=athlon cmps.c
> > jlquinn@smaug:~/gcc/test$ ./a.out
> > Memory compare 15 bytes:
> > 310000
> > 5810000
> > Speed up: 0.053356
> > Memory compare 16 bytes:
> > 300000
> > 5290000
> > Speed up: 0.056711
> > Memory compare 64 bytes:
> > 460000
> > 13770000
> > Speed up: 0.033406
> > Memory compare 256 bytes:
> > 470000
>
> This is extremely interesting. Does anyone have any documentation on cmps
> behavior on P4 and Athlon? It could be that the processor is somehow
> "caching" the results of cmps. Maybe it has to do with the fact that the
> strings are all 0 except for the end or because the strings do not change.
> Or maybe cmps is just extremely fast, but how? Or it could be that the
> loop needs unrolling for better pipeline performance. I don't have a P4
> or Athlon so if someone could play around with my code by testing by
> testing some of my theories I would appreciate it.
>
> I just ran the test on a Pentium MMX and i got similar results as I did
> for my P3. So at very least it seams that something similar to my code is
> the way to go for Pentiums up to P3.
Unrolling makes things even worse for Athlon - or rather MUCH better
for the builtin function:
jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -funroll-loops -mcpu=athlon cmps.c
jlquinn@smaug:~/gcc/test$ ./a.out
Memory compare 15 bytes:
40000
5440000
Speed up: 0.007353
Memory compare 16 bytes:
60000
5070000
Speed up: 0.011834
Memory compare 64 bytes:
60000
13660000
Speed up: 0.004392
Memory compare 256 bytes:
60000
39300000
Speed up: 0.001527
If I assign both arrays to the same set of random bytes, the results
tilt a bit more towards your routine :-)
jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -mcpu=athlon cmps.c
jlquinn@smaug:~/gcc/test$ ./a.out
Memory compare 15 bytes:
460000
5640000
Speed up: 0.081560
Memory compare 16 bytes:
450000
5390000
Speed up: 0.083488
Memory compare 64 bytes:
300000
5250000
Speed up: 0.057143
Memory compare 256 bytes:
310000
5090000
Speed up: 0.060904
Making the two arrays randomly generated and different from each other
provides yet more improvement:
jlquinn@smaug:~/gcc/test$ gcc3.3 -O3 -fomit-frame-pointer -mcpu=athlon cmps.c
jlquinn@smaug:~/gcc/test$ ./a.out
Memory compare 15 bytes:
310000
3660000
Speed up: 0.084699
Memory compare 16 bytes:
310000
3510000
Speed up: 0.088319
Memory compare 64 bytes:
460000
3510000
Speed up: 0.131054
Memory compare 256 bytes:
460000
3500000
Still not even close. I just scanned the Athlon optimization guide
and there is a comment about being sure to align the inputs for
repeated string ops. That makes it sound like work was done to make
aligned cmpsb efficient.
I also tried commenting out the final int cleanup so that cmpa2 looks
like:
__attribute__((noinline))
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
int i = 0;
size_t s = size / 4;
while (i < s && x[i] == y[i]) ++i;
size -= i * 4;
if (size == 0) return 0;
return 1;
/* unsigned int xx = x[i], yy = y[i]; */
/* asm("bswap %0" : "+r"(xx)); */
/* asm("bswap %0" : "+r"(yy)); */
/* if (size >= 4) { */
/* return xx < yy ? -1 : 1; */
/* } else { */
/* unsigned int dis = 8*(4-size); */
/* xx >>= dis; */
/* yy >>= dis; */
/* return xx - yy; */
/* } */
}
I wanted to see if the compiler could make the simple integer compare
fast. This is an (expected) improvement, but doesn't come close to
narrowing the gap:
jlquinn@smaug:~/gcc/test$ gcc3.3 cmps.c -O3 -fomit-frame-pointer -mcpu=athlon
jlquinn@smaug:~/gcc/test$ ./a.out
Memory compare 15 bytes:
300000
2740000
Speed up: 0.109489
Memory compare 16 bytes:
310000
2520000
Speed up: 0.123016
Memory compare 64 bytes:
480000
2590000
Speed up: 0.185328
Memory compare 256 bytes:
460000
2680000
Speed up: 0.171642
It seems the conclusion is that rep cmpsb is very fast on Athlon.
Also probably for P4, but I haven't tried the same tests beyond the
first one.
Jerry Quinn
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 18:58 ` Marcel Cox
@ 2003-04-04 19:27 ` Kevin Atkinson
0 siblings, 0 replies; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 19:27 UTC (permalink / raw)
To: gcc
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1441 bytes --]
On Fri, 4 Apr 2003, Marcel Cox wrote:
> There is still a flaw in that you assume that the difference of 2 unsigned
> integers will return the correct signed result. This will not work if for
> instance in your "return xx - yy" statement, xx for instance is 0xf0000000
> and yy is 0x100000. In that case, xx is clearly greater than yy, but the
> difference is 0xe0000000 which cast to signed integer will be a negative
> number and will indicate that xx is smaller than yy which is clearly wrong.
Your right. I relized that after I posted the code. Here is a better
version which also faster when size % 4 != 0:
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
int i = 0;
size_t s = size / 4;
while (i < s && x[i] == y[i]) ++i;
size -= i * 4;
if (size == 0) return 0;
unsigned int xx = x[i], yy = y[i];
asm("bswap %0" : "+r"(xx));
asm("bswap %0" : "+r"(yy));
if (size >= 4) {
return xx < yy ? -1 : 1;
} else {
unsigned int dis = 8*(4-size);
xx >>= dis;
yy >>= dis;
return xx - yy;
}
}
Test results (I removed the integer as byte strings functions):
Memory compare 15 bytes:
4120000
1370000
Speed up: 3.007299
Memory compare 16 bytes:
4140000
1460000
Speed up: 2.835616
Memory compare 64 bytes:
11220000
3920000
Speed up: 2.862245
Memory compare 256 bytes:
38550000
12410000
Speed up: 3.106366
Code attached.
--
http://kevin.atkinson.dhs.org
[-- Attachment #2: Type: TEXT/PLAIN, Size: 2102 bytes --]
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
__attribute__((noinline))
int cmpa(const unsigned int * x, const unsigned int * y, size_t size)
{
return __builtin_memcmp(x,y,size);
}
__attribute__((noinline))
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
int i = 0;
size_t s = size / 4;
while (i < s && x[i] == y[i]) ++i;
size -= i * 4;
if (size == 0) return 0;
unsigned int xx = x[i], yy = y[i];
asm("bswap %0" : "+r"(xx));
asm("bswap %0" : "+r"(yy));
if (size >= 4) {
return xx < yy ? -1 : 1;
} else {
unsigned int dis = 8*(4-size);
xx >>= dis;
yy >>= dis;
return xx - yy;
}
}
int main()
{
unsigned int x,y,i;
int xa[256] = {}, ya[256] = {};
int s,f;
clock_t t;
printf("Memory compare 15 bytes:\n");
xa[3] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 15);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 15);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[3] = 0;
printf("Memory compare 16 bytes:\n");
xa[3] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 16);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 16);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[3] = 0;
printf("Memory compare 64 bytes:\n");
xa[15] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 64);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 64);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[15] = 0;
printf("Memory compare 256 bytes:\n");
xa[63] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 256);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 256);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
}
[-- Attachment #3: Type: TEXT/PLAIN, Size: 5496 bytes --]
.file "cmps.c"
.text
.align 2
.p2align 4,,15
.globl cmpa
.type cmpa,@function
cmpa:
cld
subl $8, %esp
movl 20(%esp), %ecx
movl %esi, (%esp)
movl 12(%esp), %esi
cmpl %ecx, %ecx
movl %edi, 4(%esp)
movl 16(%esp), %edi
repz
cmpsb
movl 4(%esp), %edi
movl (%esp), %esi
seta %dl
setb %cl
addl $8, %esp
subb %cl, %dl
movsbl %dl,%eax
ret
.Lfe1:
.size cmpa,.Lfe1-cmpa
.align 2
.p2align 4,,15
.globl cmpa2
.type cmpa2,@function
cmpa2:
pushl %edi
xorl %edx, %edx
pushl %esi
pushl %ebx
movl 24(%esp), %esi
movl 16(%esp), %ebx
movl 20(%esp), %edi
movl %esi, %ecx
shrl $2, %ecx
cmpl %ecx, %edx
jae .L4
movl (%edi), %eax
cmpl %eax, (%ebx)
je .L7
.L4:
leal 0(,%edx,4), %ecx
subl %ecx, %esi
jne .L8
xorl %edx, %edx
.L2:
popl %ebx
movl %edx, %eax
popl %esi
popl %edi
ret
.p2align 4,,7
.L8:
movl (%ebx,%edx,4), %ebx
movl (%edi,%edx,4), %eax
#APP
bswap %ebx
bswap %eax
#NO_APP
cmpl $3, %esi
jbe .L9
cmpl %eax, %ebx
sbbl %edx, %edx
orl $1, %edx
jmp .L2
.p2align 4,,7
.L9:
movl $4, %ecx
subl %esi, %ecx
sall $3, %ecx
movl %ebx, %edx
shrl %cl, %edx
shrl %cl, %eax
subl %eax, %edx
jmp .L2
.p2align 4,,7
.L7:
incl %edx
cmpl %ecx, %edx
jae .L4
movl (%edi,%edx,4), %eax
cmpl %eax, (%ebx,%edx,4)
je .L7
jmp .L4
.Lfe2:
.size cmpa2,.Lfe2-cmpa2
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "Memory compare 15 bytes:"
.LC1:
.string " %d\n"
.LC2:
.string " Speed up: %f\n"
.LC3:
.string "Memory compare 16 bytes:"
.LC4:
.string "Memory compare 64 bytes:"
.LC5:
.string "Memory compare 256 bytes:"
.text
.align 2
.p2align 4,,15
.globl main
.type main,@function
main:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
pushl %edi
movl $256, %ecx
pushl %esi
leal -1048(%ebp), %esi
movl %esi, %edi
pushl %ebx
leal -2072(%ebp), %ebx
subl $2092, %esp
cld
andl $-16, %esp
rep
stosl
movl $256, %ecx
movl %ebx, %edi
rep
stosl
xorl %edi, %edi
movl $.LC0, (%esp)
call puts
movl $6, %eax
movl %eax, -1036(%ebp)
call clock
movl %eax, -2084(%ebp)
.p2align 4,,15
.L19:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $15, 8(%esp)
call cmpa
cmpl $16777216, %edi
jne .L19
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC1, (%esp)
movl -2080(%ebp), %eax
movl %eax, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L24:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $15, 8(%esp)
call cmpa2
cmpl $16777216, %edi
jne .L24
call clock
movl $.LC1, (%esp)
movl -2084(%ebp), %ecx
movl %eax, %edi
subl %ecx, %edi
movl %edi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC2, (%esp)
pushl %edi
xorl %edi, %edi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
xorl %edx, %edx
movl %edx, -1036(%ebp)
movl $.LC3, (%esp)
call puts
movl $6, %eax
movl %eax, -1036(%ebp)
call clock
movl %eax, -2084(%ebp)
.L29:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $16, 8(%esp)
call cmpa
cmpl $16777216, %edi
jne .L29
call clock
movl $.LC1, (%esp)
movl %eax, -2080(%ebp)
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl -2080(%ebp), %edi
movl %edi, 4(%esp)
xorl %edi, %edi
call printf
call clock
movl %eax, -2084(%ebp)
.L34:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $16, 8(%esp)
call cmpa2
cmpl $16777216, %edi
jne .L34
call clock
movl $.LC1, (%esp)
movl %eax, %edi
movl -2084(%ebp), %eax
subl %eax, %edi
movl %edi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC2, (%esp)
pushl %edi
xorl %edi, %edi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
xorl %edx, %edx
movl %edx, -1036(%ebp)
movl $.LC4, (%esp)
call puts
movl $6, %eax
movl %eax, -988(%ebp)
call clock
movl %eax, -2084(%ebp)
.L39:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $64, 8(%esp)
call cmpa
cmpl $16777216, %edi
jne .L39
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC1, (%esp)
movl -2080(%ebp), %eax
movl %eax, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L44:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $64, 8(%esp)
call cmpa2
cmpl $16777216, %edi
jne .L44
call clock
movl $.LC1, (%esp)
movl %eax, %edi
movl -2084(%ebp), %eax
subl %eax, %edi
movl %edi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC2, (%esp)
pushl %edi
xorl %edi, %edi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
movl %edi, -988(%ebp)
xorl %edi, %edi
movl $.LC5, (%esp)
call puts
movl $6, %ecx
movl %ecx, -796(%ebp)
call clock
movl %eax, -2084(%ebp)
.L49:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $256, 8(%esp)
call cmpa
cmpl $16777216, %edi
jne .L49
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC1, (%esp)
movl -2080(%ebp), %edx
movl %edx, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L54:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
movl $256, 8(%esp)
call cmpa2
cmpl $16777216, %edi
jne .L54
call clock
movl $.LC1, (%esp)
movl %eax, %ebx
movl -2084(%ebp), %eax
subl %eax, %ebx
movl %ebx, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC2, (%esp)
pushl %ebx
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
leal -12(%ebp), %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.Lfe3:
.size main,.Lfe3-main
.ident "GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 17:08 ` Kevin Atkinson
@ 2003-04-04 18:58 ` Marcel Cox
2003-04-04 19:27 ` Kevin Atkinson
0 siblings, 1 reply; 14+ messages in thread
From: Marcel Cox @ 2003-04-04 18:58 UTC (permalink / raw)
To: gcc
"Kevin Atkinson" <kevina@gnu.org> wrote in message
news:Pine.LNX.4.44.0304040923350.912-300000@kevin-pc.atkinson.dhs.org...
> Here you go. Still assume the memory is alligned:
>
> This is what I did:
>
> int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
> {
> int i = 0;
> size_t s = size / 4;
> while (i < s && x[i] == y[i]) ++i;
> size -= i * 4;
> if (size == 0) return 0;
> // hopefully if this is inline expanded when size is known
> // the compiler can eliminate many of these conditionals
> else if (size >= 4) { // if original size % 4 == 0 this should
> // always be the case
> unsigned int xx = x[i], yy = y[i];
> asm("bswap %0" : "+r"(xx));
> asm("bswap %0" : "+r"(yy));
> return xx - yy;
> } else {
> const unsigned char * xb = (const unsigned char *)(x + i);
> const unsigned char * yb = (const unsigned char *)(y + i);
> // if size is known at compile time then the compiler should be
> // able to select the correct choice at compile time
> switch (size) {
> case 1:
> return *xb - *yb;
> case 2:
> return ((xb[0] - yb[0]) << 8) + (xb[1] - yb[1]);
> case 3:
> return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8)
> + xb[2] - yb[2];}
> }
> }
There is still a flaw in that you assume that the difference of 2 unsigned
integers will return the correct signed result. This will not work if for
instance in your "return xx - yy" statement, xx for instance is 0xf0000000
and yy is 0x100000. In that case, xx is clearly greater than yy, but the
difference is 0xe0000000 which cast to signed integer will be a negative
number and will indicate that xx is smaller than yy which is clearly wrong.
Marcel
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 13:18 Roger Sayle
2003-04-04 14:51 ` Kevin Atkinson
@ 2003-04-04 17:08 ` Kevin Atkinson
2003-04-04 18:58 ` Marcel Cox
1 sibling, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 17:08 UTC (permalink / raw)
To: Roger Sayle; +Cc: gcc
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1831 bytes --]
On Thu, 3 Apr 2003, Roger Sayle wrote:
> I suspect it should be possible to fix your code to handle these
> termination conditions correctly, and a comparison of your
> routine's performance with these fixes vs. __builtin_memcmp
> would be of interest.
Here you go. Still assume the memory is alligned:
This is what I did:
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
int i = 0;
size_t s = size / 4;
while (i < s && x[i] == y[i]) ++i;
size -= i * 4;
if (size == 0) return 0;
// hopefully if this is inline expanded when size is known
// the compiler can eliminate many of these conditionals
else if (size >= 4) { // if original size % 4 == 0 this should
// always be the case
unsigned int xx = x[i], yy = y[i];
asm("bswap %0" : "+r"(xx));
asm("bswap %0" : "+r"(yy));
return xx - yy;
} else {
const unsigned char * xb = (const unsigned char *)(x + i);
const unsigned char * yb = (const unsigned char *)(y + i);
// if size is known at compile time then the compiler should be
// able to select the correct choice at compile time
switch (size) {
case 1:
return *xb - *yb;
case 2:
return ((xb[0] - yb[0]) << 8) + (xb[1] - yb[1]);
case 3:
return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8)
+ xb[2] - yb[2];}
}
}
compiled with gcc -fomit-frame-pointer -O3 -march=pentium3 cmps.c
Memory compare int:
2160000
360000
Speed up: 6.000000
Memory compare 15 bytes:
4120000
1550000
Speed up: 2.658065
Memory compare 16 bytes:
4150000
1310000
Speed up: 3.167939
Memory compare 64 bytes:
11240000
3470000
Speed up: 3.239193
Memory compare 256 bytes:
37930000
10050000
Speed up: 3.774129
Code and assembly output attached.
--
http://kevin.atkinson.dhs.org
[-- Attachment #2: Type: TEXT/PLAIN, Size: 3315 bytes --]
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
__attribute__((noinline))
int cmpi(unsigned int x, unsigned int y)
{
return __builtin_memcmp(&x,&y,4);
}
__attribute__((noinline))
int cmpi2(unsigned int x, unsigned int y)
{
asm("bswap %0" : "+r"(x));
asm("bswap %0" : "+r"(y));
return x - y;
}
__attribute__((noinline))
int cmpa(const unsigned int * x, const unsigned int * y, size_t size)
{
return __builtin_memcmp(x,y,size);
}
__attribute__((noinline))
int cmpa2(const unsigned int * x, const unsigned int * y, size_t size)
{
int i = 0;
size_t s = size / 4;
while (i < s && x[i] == y[i]) ++i;
size -= i * 4;
if (size == 0) return 0;
// hopefully if this is inline expanded when size is known
// the compiler can eliminate many of these conditionals
else if (size >= 4) { // if original size % 4 == 0 this should
// always be the case
unsigned int xx = x[i], yy = y[i];
asm("bswap %0" : "+r"(xx));
asm("bswap %0" : "+r"(yy));
return xx - yy;
} else {
const unsigned char * xb = (const unsigned char *)(x + i);
const unsigned char * yb = (const unsigned char *)(y + i);
// if size is known at compile time then the compiler should be
// able to select the correct choice at compile time
switch (size) {
case 1:
return *xb - *yb;
case 2:
return ((xb[0] - yb[0]) << 8) + (xb[1] - yb[1]);
case 3:
return ((xb[0] - yb[0]) << 16) + ((xb[1] - yb[1]) << 8)
+ xb[2] - yb[2];}
}
}
int main()
{
unsigned int x,y,i;
int xa[256] = {}, ya[256] = {};
int s,f;
clock_t t;
x = *(unsigned int *)("a b");
y = *(unsigned int *)("b a");
printf("Memory compare int:\n");
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpi(x,y);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpi2(x,y);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
printf("Memory compare 15 bytes:\n");
xa[3] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 15);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 15);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[3] = 0;
printf("Memory compare 16 bytes:\n");
xa[3] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 16);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 16);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[3] = 0;
printf("Memory compare 64 bytes:\n");
xa[15] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 64);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 64);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[15] = 0;
printf("Memory compare 256 bytes:\n");
xa[63] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 256);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 256);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
}
[-- Attachment #3: Type: TEXT/PLAIN, Size: 7710 bytes --]
.file "cmps.c"
.text
.align 2
.p2align 4,,15
.globl cmpi
.type cmpi,@function
cmpi:
cld
subl $8, %esp
movl $4, %ecx
movl %esi, (%esp)
leal 12(%esp), %esi
movl %edi, 4(%esp)
leal 16(%esp), %edi
repz
cmpsb
movl 4(%esp), %edi
movl (%esp), %esi
seta %dl
setb %cl
addl $8, %esp
subb %cl, %dl
movsbl %dl,%eax
ret
.Lfe1:
.size cmpi,.Lfe1-cmpi
.align 2
.p2align 4,,15
.globl cmpi2
.type cmpi2,@function
cmpi2:
movl 4(%esp), %eax
movl 8(%esp), %ecx
#APP
bswap %eax
bswap %ecx
#NO_APP
subl %ecx, %eax
ret
.Lfe2:
.size cmpi2,.Lfe2-cmpi2
.align 2
.p2align 4,,15
.globl cmpa
.type cmpa,@function
cmpa:
cld
subl $8, %esp
movl 20(%esp), %ecx
movl %esi, (%esp)
movl 12(%esp), %esi
cmpl %ecx, %ecx
movl %edi, 4(%esp)
movl 16(%esp), %edi
repz
cmpsb
movl 4(%esp), %edi
movl (%esp), %esi
seta %dl
setb %cl
addl $8, %esp
subb %cl, %dl
movsbl %dl,%eax
ret
.Lfe3:
.size cmpa,.Lfe3-cmpa
.align 2
.p2align 4,,15
.globl cmpa2
.type cmpa2,@function
cmpa2:
pushl %edi
xorl %edx, %edx
pushl %esi
pushl %ebx
movl 24(%esp), %ecx
movl 16(%esp), %esi
movl 20(%esp), %edi
movl %ecx, %ebx
shrl $2, %ebx
cmpl %ebx, %edx
jae .L6
movl (%edi), %eax
cmpl %eax, (%esi)
je .L9
.L6:
leal 0(,%edx,4), %eax
subl %eax, %ecx
jne .L10
xorl %ecx, %ecx
.L4:
popl %ebx
movl %ecx, %eax
popl %esi
popl %edi
ret
.p2align 4,,7
.L10:
cmpl $3, %ecx
jbe .L12
movl (%esi,%edx,4), %ecx
movl (%edi,%edx,4), %eax
#APP
bswap %ecx
bswap %eax
#NO_APP
.L22:
subl %eax, %ecx
jmp .L4
.p2align 4,,7
.L12:
cmpl $2, %ecx
leal (%eax,%esi), %ebx
leal (%eax,%edi), %esi
je .L16
cmpl $2, %ecx
ja .L20
cmpl $1, %ecx
jne .L4
movzbl (%ebx), %ecx
movzbl (%esi), %eax
jmp .L22
.L20:
cmpl $3, %ecx
jne .L4
movzbl 1(%esi), %eax
movzbl (%ebx), %ecx
movzbl (%esi), %edx
movzbl 1(%ebx), %edi
subl %edx, %ecx
subl %eax, %edi
movzbl 2(%esi), %eax
sall $8, %edi
sall $16, %ecx
addl %edi, %ecx
movzbl 2(%ebx), %edi
addl %edi, %ecx
jmp .L22
.L16:
movzbl (%ebx), %ecx
movzbl (%esi), %edx
movzbl 1(%ebx), %edi
movzbl 1(%esi), %eax
subl %edx, %ecx
sall $8, %ecx
subl %eax, %edi
addl %edi, %ecx
jmp .L4
.p2align 4,,7
.L9:
incl %edx
cmpl %ebx, %edx
jae .L6
movl (%edi,%edx,4), %eax
cmpl %eax, (%esi,%edx,4)
je .L9
jmp .L6
.Lfe4:
.size cmpa2,.Lfe4-cmpa2
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "a b"
.LC2:
.string "Memory compare int:"
.LC1:
.string "b a"
.LC3:
.string " %d\n"
.LC4:
.string " Speed up: %f\n"
.LC5:
.string "Memory compare 15 bytes:"
.LC6:
.string "Memory compare 16 bytes:"
.LC7:
.string "Memory compare 64 bytes:"
.LC8:
.string "Memory compare 256 bytes:"
.text
.align 2
.p2align 4,,15
.globl main
.type main,@function
main:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
pushl %edi
movl $256, %ecx
leal -1048(%ebp), %edi
pushl %esi
pushl %ebx
subl $2092, %esp
andl $-16, %esp
cld
rep
stosl
movl $256, %ecx
leal -2072(%ebp), %edi
rep
stosl
xorl %edi, %edi
movl .LC0, %esi
movl $.LC2, (%esp)
movl .LC1, %ebx
call puts
call clock
movl %eax, -2084(%ebp)
.p2align 4,,15
.L28:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
call cmpi
cmpl $16777216, %edi
jne .L28
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC3, (%esp)
movl -2080(%ebp), %eax
movl %eax, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L33:
movl %esi, (%esp)
incl %edi
movl %ebx, 4(%esp)
call cmpi2
cmpl $16777216, %edi
jne .L33
call clock
movl $.LC3, (%esp)
movl %eax, %esi
movl -2084(%ebp), %eax
xorl %edi, %edi
subl %eax, %esi
movl %esi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC4, (%esp)
pushl %esi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
movl $.LC5, (%esp)
call puts
movl $6, %eax
movl %eax, -1036(%ebp)
call clock
movl %eax, -2084(%ebp)
.L38:
movl $15, 8(%esp)
leal -1048(%ebp), %ecx
incl %edi
movl %ecx, (%esp)
leal -2072(%ebp), %edx
movl %edx, 4(%esp)
call cmpa
cmpl $16777216, %edi
jne .L38
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC3, (%esp)
movl -2080(%ebp), %ebx
movl %ebx, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L43:
movl $15, 8(%esp)
leal -1048(%ebp), %edx
incl %edi
movl %edx, (%esp)
leal -2072(%ebp), %esi
movl %esi, 4(%esp)
call cmpa2
cmpl $16777216, %edi
jne .L43
call clock
movl $.LC3, (%esp)
movl %eax, %esi
movl -2084(%ebp), %eax
xorl %edi, %edi
subl %eax, %esi
movl %esi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC4, (%esp)
pushl %esi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
movl %edi, -1036(%ebp)
xorl %edi, %edi
movl $.LC6, (%esp)
call puts
movl $6, %eax
movl %eax, -1036(%ebp)
call clock
movl %eax, -2084(%ebp)
.L48:
movl $16, 8(%esp)
leal -1048(%ebp), %ecx
incl %edi
movl %ecx, (%esp)
leal -2072(%ebp), %eax
movl %eax, 4(%esp)
call cmpa
cmpl $16777216, %edi
jne .L48
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC3, (%esp)
movl -2080(%ebp), %ebx
movl %ebx, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L53:
movl $16, 8(%esp)
leal -2072(%ebp), %edx
incl %edi
movl %edx, 4(%esp)
leal -1048(%ebp), %esi
movl %esi, (%esp)
call cmpa2
cmpl $16777216, %edi
jne .L53
call clock
movl $.LC3, (%esp)
movl %eax, %esi
movl -2084(%ebp), %eax
xorl %edi, %edi
subl %eax, %esi
movl %esi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC4, (%esp)
pushl %esi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
movl %edi, -1036(%ebp)
xorl %edi, %edi
movl $.LC7, (%esp)
call puts
movl $6, %eax
movl %eax, -988(%ebp)
call clock
movl %eax, -2084(%ebp)
.L58:
movl $64, 8(%esp)
leal -1048(%ebp), %ecx
incl %edi
movl %ecx, (%esp)
leal -2072(%ebp), %ebx
movl %ebx, 4(%esp)
call cmpa
cmpl $16777216, %edi
jne .L58
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC3, (%esp)
movl -2080(%ebp), %edx
movl %edx, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L63:
movl $64, 8(%esp)
leal -2072(%ebp), %eax
incl %edi
movl %eax, 4(%esp)
leal -1048(%ebp), %esi
movl %esi, (%esp)
call cmpa2
cmpl $16777216, %edi
jne .L63
call clock
movl $.LC3, (%esp)
movl -2084(%ebp), %ecx
movl %eax, %edi
xorl %ebx, %ebx
subl %ecx, %edi
movl %edi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC4, (%esp)
pushl %edi
movl $6, %edi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
movl %ebx, -988(%ebp)
movl $.LC8, (%esp)
call puts
movl %edi, -796(%ebp)
xorl %edi, %edi
call clock
movl %eax, -2084(%ebp)
.L68:
movl $256, 8(%esp)
leal -2072(%ebp), %edx
incl %edi
movl %edx, 4(%esp)
leal -1048(%ebp), %esi
movl %esi, (%esp)
call cmpa
cmpl $16777216, %edi
jne .L68
call clock
movl %eax, -2080(%ebp)
xorl %edi, %edi
movl -2084(%ebp), %eax
subl %eax, -2080(%ebp)
movl $.LC3, (%esp)
movl -2080(%ebp), %eax
movl %eax, 4(%esp)
call printf
call clock
movl %eax, -2084(%ebp)
.L73:
movl $256, 8(%esp)
leal -1048(%ebp), %ecx
incl %edi
movl %ecx, (%esp)
leal -2072(%ebp), %ebx
movl %ebx, 4(%esp)
call cmpa2
cmpl $16777216, %edi
jne .L73
call clock
movl $.LC3, (%esp)
movl %eax, %edi
movl -2084(%ebp), %eax
subl %eax, %edi
movl %edi, 4(%esp)
call printf
fildl -2080(%ebp)
movl $.LC4, (%esp)
pushl %edi
fildl (%esp)
addl $4, %esp
fdivrp %st, %st(1)
fstpl 4(%esp)
call printf
leal -12(%ebp), %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.Lfe5:
.size main,.Lfe5-main
.ident "GCC: (GNU) 3.2 20020903 (Red Hat Linux 8.0 3.2-7)"
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 8:52 ` Zack Weinberg
@ 2003-04-04 15:16 ` Kevin Atkinson
0 siblings, 0 replies; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 15:16 UTC (permalink / raw)
To: Zack Weinberg; +Cc: gcc
On Thu, 3 Apr 2003, Zack Weinberg wrote:
> Kevin Atkinson <kevin@atkinson.dhs.org> writes:
>
> > I was doing some tests with with memcmp and it seams that gcc 3.2.2 always
> > uses cmps, even for aligned strings.
> >
> > I did some tests and discovered that using cmps was rather slow, compared
> > to a simple loop and then a bswap and subtract at the end.
>
> Patches are welcome.
>
The best I can give you is a memcmp implementation. I know nothing about
gcc internals.
--
http://kevin.atkinson.dhs.org
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 13:18 Roger Sayle
@ 2003-04-04 14:51 ` Kevin Atkinson
2003-04-04 17:08 ` Kevin Atkinson
1 sibling, 0 replies; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 14:51 UTC (permalink / raw)
To: Roger Sayle; +Cc: gcc
On Thu, 3 Apr 2003, Roger Sayle wrote:
>
> Hi Kevin,
> > I did some tests and discovered that using cmps was rather slow,
> > compared to a simple loop and then a bswap and subtract at the end.
>
> I'm sure that GCC's memcmp implementations could be improved, but
> from reading the code examples in your patch it looks like you
> are always assuming that either the length is a multiple of four,
> or that the bytes following the memory sections to be compared
> contain identical values (i.e. you're hoping they're all zero).
>
> i.e., if p and q are suitably 4-byte aligned
>
> memset(p,"abcd",4);
> memset(q,"abef",4);
> memcmp(p,q,2)
>
> should compare equal but don't using bswaps and subtractions.
> Similarly, when two words mismatch their return value <0 or
> >0 should depend upon the first byte that differs, not the
> values of the bytes that come after it.
Your right. It can be fixed by a test if it is a multiple of 4 and if not
do a byte wise comparison at the end.
> I suspect it should be possible to fix your code to handle these
> termination conditions correctly, and a comparison of your
> routine's performance with these fixes vs. __builtin_memcmp
> would be of interest.
I will see what I can do.
---
http://kevin.atkinson.dhs.org
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 14:33 Bonzini
0 siblings, 0 replies; 14+ messages in thread
From: Bonzini @ 2003-04-04 14:33 UTC (permalink / raw)
To: kevin; +Cc: gcc
> I was doing some tests with with memcmp and it seams that gcc 3.2.2
> always uses cmps, even for aligned strings.
I also had a very fast memcmp that also worked word-by-word in the unaligned case
(using some bit-twiddling tricks with SHRD), but I never had the time to polish it for
inclusion in glibc. It turned out however that CMPS is as fast as mine (or even
faster) on AMD chips.
I do think anyway that memcmp expansion in gcc should be disabled unless optimizing for
size or unless the size is known to be small, say 16 bytes.
See these posts:
http://gcc.gnu.org/ml/gcc/2002-10/msg01616.html (me)
http://gcc.gnu.org/ml/gcc/2002-10/msg01637.html (Roger)
http://sources.redhat.com/ml/libc-alpha/2002-10/msg00511.html (source)
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 13:18 Roger Sayle
2003-04-04 14:51 ` Kevin Atkinson
2003-04-04 17:08 ` Kevin Atkinson
0 siblings, 2 replies; 14+ messages in thread
From: Roger Sayle @ 2003-04-04 13:18 UTC (permalink / raw)
To: Kevin Atkinson; +Cc: gcc
Hi Kevin,
> I did some tests and discovered that using cmps was rather slow,
> compared to a simple loop and then a bswap and subtract at the end.
I'm sure that GCC's memcmp implementations could be improved, but
from reading the code examples in your patch it looks like you
are always assuming that either the length is a multiple of four,
or that the bytes following the memory sections to be compared
contain identical values (i.e. you're hoping they're all zero).
i.e., if p and q are suitably 4-byte aligned
memset(p,"abcd",4);
memset(q,"abef",4);
memcmp(p,q,2)
should compare equal but don't using bswaps and subtractions.
Similarly, when two words mismatch their return value <0 or
>0 should depend upon the first byte that differs, not the
values of the bytes that come after it.
I suspect it should be possible to fix your code to handle these
termination conditions correctly, and a comparison of your
routine's performance with these fixes vs. __builtin_memcmp
would be of interest.
Roger
--
Roger Sayle, E-mail: roger@eyesopen.com
OpenEye Scientific Software, WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road, Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507. Fax: (+1) 505-473-0833
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Slow memcmp for aligned strings on Pentium 3
2003-04-04 6:22 Kevin Atkinson
@ 2003-04-04 8:52 ` Zack Weinberg
2003-04-04 15:16 ` Kevin Atkinson
0 siblings, 1 reply; 14+ messages in thread
From: Zack Weinberg @ 2003-04-04 8:52 UTC (permalink / raw)
To: Kevin Atkinson; +Cc: gcc
Kevin Atkinson <kevin@atkinson.dhs.org> writes:
> I was doing some tests with with memcmp and it seams that gcc 3.2.2 always
> uses cmps, even for aligned strings.
>
> I did some tests and discovered that using cmps was rather slow, compared
> to a simple loop and then a bswap and subtract at the end.
Patches are welcome.
zw
^ permalink raw reply [flat|nested] 14+ messages in thread
* Slow memcmp for aligned strings on Pentium 3
@ 2003-04-04 6:22 Kevin Atkinson
2003-04-04 8:52 ` Zack Weinberg
0 siblings, 1 reply; 14+ messages in thread
From: Kevin Atkinson @ 2003-04-04 6:22 UTC (permalink / raw)
To: gcc
[-- Attachment #1: Type: TEXT/PLAIN, Size: 4774 bytes --]
I was doing some tests with with memcmp and it seams that gcc 3.2.2 always
uses cmps, even for aligned strings.
I did some tests and discovered that using cmps was rather slow, compared
to a simple loop and then a bswap and subtract at the end.
My method was is the following:
int cmpa2(unsigned int * x, unsigned int * y, size_t size)
{
int i = 0;
size /= 4;
while (i < size && x[i] == y[i]) ++i;
if (i == size) return 0;
asm("bswap %0" : "+r"(x[i]));
asm("bswap %0" : "+r"(y[i]));
return x[i] - y[i];
}
as oppose to:
int cmpa(unsigned int * x, unsigned int * y, size_t size)
{
return __builtin_memcmp(x,y,size);
}
Even though It takes integer pointers it really is comparing them as if
they were byte strings. I used integer points to avoid having to cast and
to let the compiler know that the strings are 4 byte aligned.
This method is 3-5 times as fast. The longer the string the faster my
method is.
I also wrote on specialized compare for comparing two ints AS IF they
were a 4 byte string:
int cmpi2(unsigned int x, unsigned int y)
{
asm("bswap %0" : "+r"(x));
asm("bswap %0" : "+r"(y));
return x - y;
}
as oppose to:
int cmpi(unsigned int x, unsigned int y)
{
return __builtin_memcmp(&x,&y,4);
}
Which was several order of magnitude faster. It is 4 - 6 times faster.
However I believe the function call is significant here as
-fomit-frame-pointer increased the ratio (from 4 to 6). And when I
tried to account for the cost of the function call the results varies form
any where to 10 to 30 times faster.
If a better memcmp is not already implemented in yet-to-be-released gcc I
hope someone will seriously consider rewriting the memcpy implementation.
Even though I only tested in on a Pentium 3 I am sure the results will
also hold for a Pentium Classic based on the instruction timings. I believe
th bswap instruction was implemented in the 486. So only generic 386 code
will have to use something other than bswap.
The test program is attached. Here are the results when compiled with:
gcc -fomit-frame-pointer -O3 -march=pentium3 cmps.c (attribute noinline
was used on test functions)
Memory compare int:
2170000
350000
Speed up: 6.200000
Memory compare 16 bytes:
4100000
1240000
Speed up: 3.306452
Memory compare 64 bytes:
10930000
2990000
Speed up: 3.655518
Memory compare 256 bytes:
37510000
7700000
Speed up: 4.871429
Here is the relevant assembly code:
cmpi: // integer as byte string compare using memcmp
cld
subl $8, %esp
movl $4, %ecx
movl %esi, (%esp)
leal 12(%esp), %esi
movl %edi, 4(%esp)
leal 16(%esp), %edi
repz
cmpsb
movl 4(%esp), %edi
movl (%esp), %esi
seta %dl
setb %cl
addl $8, %esp
subb %cl, %dl
movsbl %dl,%eax
ret
cmpi: // interger as byte string compare my method
movl 4(%esp), %eax
movl 8(%esp), %ecx
bswap %eax
bswap %ecx
subl %ecx, %eax
ret
cmpa: // aligned byte compare using memcmp
cld
subl $8, %esp
movl 20(%esp), %ecx
movl %esi, (%esp)
movl 12(%esp), %esi
cmpl %ecx, %ecx
movl %edi, 4(%esp)
movl 16(%esp), %edi
repz
cmpsb
movl 4(%esp), %edi
movl (%esp), %esi
seta %dl
setb %cl
addl $8, %esp
subb %cl, %dl
movsbl %dl,%eax
ret
cmpa2: // aligned byte compare: my method
subl $12, %esp
movl 24(%esp), %ecx
xorl %edx, %edx
movl %esi, 4(%esp)
movl 16(%esp), %esi
shrl $2, %ecx
movl %edi, 8(%esp)
cmpl %ecx, %edx
movl 20(%esp), %edi
movl %ebx, (%esp)
jae .L6
movl (%edi), %eax
cmpl %eax, (%esi)
je .L9
.L6:
xorl %ebx, %ebx
cmpl %ecx, %edx
je .L4
movl (%esi,%edx,4), %eax
bswap %eax
movl %eax, (%esi,%edx,4)
movl (%edi,%edx,4), %eax
bswap %eax
movl %eax, (%edi,%edx,4)
movl (%esi,%edx,4), %ebx
subl %eax, %ebx
.L4:
movl %ebx, %eax
movl 4(%esp), %esi
movl (%esp), %ebx
movl 8(%esp), %edi
addl $12, %esp
ret
.p2align 4,,7
.L9:
incl %edx
cmpl %ecx, %edx
jae .L6
movl (%edi,%edx,4), %eax
cmpl %eax, (%esi,%edx,4)
je .L9
jmp .L6
---
http://kevin.atkinson.dhs.org
[-- Attachment #2: Type: TEXT/PLAIN, Size: 2172 bytes --]
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
__attribute__((noinline))
int cmpi(unsigned int x, unsigned int y)
{
return __builtin_memcmp(&x,&y,4);
}
__attribute__((noinline))
int cmpi2(unsigned int x, unsigned int y)
{
asm("bswap %0" : "+r"(x));
asm("bswap %0" : "+r"(y));
return x - y;
}
__attribute__((noinline))
int cmpa(unsigned int * x, unsigned int * y, size_t size)
{
return __builtin_memcmp(x,y,size);
}
__attribute__((noinline))
int cmpa2(unsigned int * x, unsigned int * y, size_t size)
{
int i = 0;
size /= 4;
while (i < size && x[i] == y[i]) ++i;
if (i == size) return 0;
asm("bswap %0" : "+r"(x[i]));
asm("bswap %0" : "+r"(y[i]));
return x[i] - y[i];
}
int main()
{
unsigned int x,y,i;
int xa[256] = {}, ya[256] = {};
int s,f;
clock_t t;
x = *(unsigned int *)("a b");
y = *(unsigned int *)("b a");
printf("Memory compare int:\n");
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpi(x,y);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpi2(x,y);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
printf("Memory compare 16 bytes:\n");
xa[3] = 6;
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 16);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 16);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[3] = 0;
xa[15] = 6;
printf("Memory compare 64 bytes:\n");
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 64);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 64);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
xa[15] = 0;
xa[255] = 6;
printf("Memory compare 256 bytes:\n");
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa(xa, ya, 256);
s = clock() - t;
printf(" %d\n", s);
t = clock();
for (i = 0; i != 0x1000000; ++i)
cmpa2(xa, ya, 256);
f = clock() - t;
printf(" %d\n",f);
printf(" Speed up: %f\n", (double)s/f);
}
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2003-04-07 15:23 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-04 17:13 Slow memcmp for aligned strings on Pentium 3 Jerry Quinn
2003-04-04 17:29 ` Kevin Atkinson
2003-04-06 5:54 ` Jerry Quinn
2003-04-06 22:17 ` Kevin Atkinson
2003-04-07 17:30 ` Jerry Quinn
-- strict thread matches above, loose matches on Subject: below --
2003-04-04 14:33 Bonzini
2003-04-04 13:18 Roger Sayle
2003-04-04 14:51 ` Kevin Atkinson
2003-04-04 17:08 ` Kevin Atkinson
2003-04-04 18:58 ` Marcel Cox
2003-04-04 19:27 ` Kevin Atkinson
2003-04-04 6:22 Kevin Atkinson
2003-04-04 8:52 ` Zack Weinberg
2003-04-04 15:16 ` Kevin Atkinson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).