* swap() on x86 platforms.
@ 1998-01-05 19:22 Iain McClatchie
1998-01-15 16:30 ` amylaar
0 siblings, 1 reply; 4+ messages in thread
From: Iain McClatchie @ 1998-01-05 19:22 UTC (permalink / raw)
To: egcs
Pal-Kristian> Testing on a Pentium revealed that using xchgl for
Pal-Kristian> swapping numbers is not the best way!
Pal-Kristian> Results: (Don't count on them!)
Pal-Kristian> Normal = ~5.5 secs (+/- 0.1)
Pal-Kristian> XOR = ~8.5 secs (+/- 0.1)
Here I've included some timings on a 150MHz PPro, egcs-2.91.02 971216
vs gcc-2.7.2. I translated the code into C for comparison as well.
gcc-2.7.2:
in C++
-O -O2 -O3 -O6
swap1: 4.22 4.93 4.93 4.92
swap2: 8.67 7.11 7.10 7.11
swap3: 5.39 5.36 5.38 5.40
swap4: 7.06 7.07 7.04 7.06
in C
swap1: 4.93 4.22 4.22 4.24
swap2: 9.63 7.76 7.74 7.77
egcs-2.91.02 971216:
in C++:
swap1: 4.95 4.23 4.24 4.23
swap2: 9.73 7.76 7.74 7.79
swap3: 5.53 4.97 4.99 4.93
swap4: 7.10 7.03 7.06 7.05
in C
swap1: 3.59 4.23 4.23 4.23
swap2: 9.87 6.40 6.34 6.34
Obviously, swap1 is the way to go, which is good; it's also the way
everybody does it right now in existing code. -O3 and -O6 don't do
anything to this kernel.
C is faster than C++. It's a little interesting to note the
performance loss in C going from -O to -O2. In an effort to discover
the problem, I extracted the inner loops from the C and C++ versions
of swap1 at -O and -O2:
C++:
-O: 7 cycles/loop -O2: 6 cycles/loop
.L60: .L60:
cmpl $10238,%esi cmpl $10238,%esi
jg .L58 jg .L58
leal -40960(%ebp,%esi,4),%eax leal
-40956(%ebp,%esi,4),%eax
leal -40956(%ebp,%esi,4),%ecx movl (%ebx),%ecx
movl (%eax),%ebx movl (%eax),%edx
movl (%ecx),%edx movl %edx,(%ebx)
movl %edx,(%eax) movl %ecx,(%eax)
movl %ebx,(%ecx) addl $4,%ebx
incl %esi incl %esi
jmp .L60 jmp .L60
C:
-O: 5 cycles/loop -O2: 6 cycles/loop
.L39: .L39:
leal -40960(%ebp,%esi,4),%eax movl (%ecx),%edx
leal -40956(%ebp,%esi,4),%edx movl (%ebx),%eax
movl (%eax),%ebx movl %eax,(%ecx)
movl (%edx),%ecx movl %edx,(%ebx)
movl %ecx,(%eax) addl $4,%ebx
movl %ebx,(%edx) addl $4,%ecx
incl %esi incl %esi
cmpl $10238,%esi cmpl $10238,%esi
jle .L39 jle .L39
You can see that C++ fails to combine the loop exit branch with the
loop closure branch, although since both branches are almost perfectly
predicted, and we have spare operational units, there is no loss.
I suspect that without higher level optimization, 5 cycles/loop is
optimal on the PPro. Unrolling the loop does not help. The machine
can do one memory operation per cycle, and is probably spending a
cycle bypassing the result of one store to a successive load. You can
imagine a compiler that notices that one iteration's stored ebx will
be the next iteration's loaded ebx, and thereby eliminates a load and
potentially even a store. On some numeric code this might be an
interesting optimization, but on this example it is somewhat
contrived.
Eliminating that load is worth one cycle per iteration, and since the
machine is no longer bypassing stores to loads, we can unroll the loop
to amortize what appears to be a one cycle per taken branch cost over
several iterations. Eliminating that store is worth another cycle per
iteration. The resulting loop averages 2.34 cycles per iteration of
the source (Why not 2.5 cycles? I don't know). I've included the
loop setup and exit code.
leal -40960(%ebp),%ecx
leal -8(%ebp),%esi
movl (%ecx),%edx
.align 16
.L39:
movl 4(%ecx),%eax
movl %eax,(%ecx)
movl 8(%ecx),%eax
movl %eax,4(%ecx)
addl $8,%ecx
cmpl %esi,%ecx
jle .L39
movl %edx,(%ecx)
I cannot imagine why the "C -O2" loop runs a cycle slower than the "C
-O" loop.
Here's GCC-2.7.2. (I formatted the table differently.) You can see
that the older GCC could actually do induction variable elimination in
C at -O2. The newer egcs-121697 cannot do it -- perhaps someone will
dig around to find out why.
at -O, inner loop:
C++:6 cycles/loop C:7 cycles/loop
.L60: .L44:
cmpl $10238,%esi
jg .L58 leal 0(,%esi,4),%eax
leal 0(,%esi,4),%eax leal -40960(%ebp,%eax),%ebx
leal -40960(%ebp,%eax),%ebx leal -40956(%ebp,%eax),%ecx
leal -40956(%eax,%ebp),%ecx movl (%ebx),%edx
movl (%ebx),%edx movl (%ecx),%eax
movl (%ecx),%eax movl %eax,(%ebx)
movl %eax,(%ebx) movl %edx,(%ecx)
movl %edx,(%ecx) incl %esi
incl %esi cmpl $10238,%esi
jmp .L60 jle .L44
at -O2, inner loop:
C++:7 cycles/loop C:6 cycles/loop
.L60: .L44:
cmpl $10238,%edi leal -40960(%ebp,%esi),%ebx
jg .L58 leal -40956(%ebp,%esi),%ecx
leal -40960(%ebp,%esi),%ebx movl (%ebx),%edx
leal -40956(%esi,%ebp),%ecx movl (%ecx),%eax
movl (%ebx),%edx movl %eax,(%ebx)
movl (%ecx),%eax movl %edx,(%ecx)
movl %eax,(%ebx) addl $4,%esi
movl %edx,(%ecx) cmpl $40952,%esi
addl $4,%esi jle .L44
incl %edi
jmp .L60
-Iain
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: swap() on x86 platforms.
1998-01-05 19:22 swap() on x86 platforms Iain McClatchie
@ 1998-01-15 16:30 ` amylaar
1998-01-19 15:31 ` amylaar
0 siblings, 1 reply; 4+ messages in thread
From: amylaar @ 1998-01-15 16:30 UTC (permalink / raw)
To: Iain McClatchie; +Cc: egcs
> Here's GCC-2.7.2. (I formatted the table differently.) You can see
> that the older GCC could actually do induction variable elimination in
> C at -O2. The newer egcs-121697 cannot do it -- perhaps someone will
> dig around to find out why.
No need to dig around - it was disabled on purpose, to allegedly fix the
bugs associated with biv elimination. I think the overall result is
to disable about 95% of the biv elimination and about 98% of the biv
elimination bugs.
It is possible to solve this properly, but it needs some register
lifetime information at loop optimization time.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: swap() on x86 platforms.
1998-01-15 16:30 ` amylaar
@ 1998-01-19 15:31 ` amylaar
0 siblings, 0 replies; 4+ messages in thread
From: amylaar @ 1998-01-19 15:31 UTC (permalink / raw)
To: amylaar; +Cc: egcs
> > Here's GCC-2.7.2. (I formatted the table differently.) You can see
> > that the older GCC could actually do induction variable elimination in
> > C at -O2. The newer egcs-121697 cannot do it -- perhaps someone will
> > dig around to find out why.
>
> No need to dig around - it was disabled on purpose, to allegedly fix the
> bugs associated with biv elimination. I think the overall result is
> to disable about 95% of the biv elimination and about 98% of the biv
> elimination bugs.
>
> It is possible to solve this properly, but it needs some register
> lifetime information at loop optimization time.
Actually, there would b an alternative way to handle this. My original
approach (about 2.5 years ago) was to only do some transformations that
change the value kept in the basic induction variable (biv) when it could
be shown that its value is not used after the loop.
Alternatively, we could change the biv anyways, and adjust the value
after the loop. Flow will later remove superflous adjustments in the
case where the value is not used. If the value is indeed used, we
might end up with a tighter loop and more complex code after the loop.
Of course, if the use of the value is just a simple test, the extra
complexity might be optimized away by a later arithmetic transformation
(e.g. cse2 or combine) .
The only ceveat that I am aware of is that we have to make sure to handle
multiple loop exits properly. Btw, does anyone have a guess how important
it is to actually perform biv elimination in the multiple loop exit case?
^ permalink raw reply [flat|nested] 4+ messages in thread
* swap() on x86 platforms.
@ 1998-01-05 11:13 Pal-Kristian Engstad
0 siblings, 0 replies; 4+ messages in thread
From: Pal-Kristian Engstad @ 1998-01-05 11:13 UTC (permalink / raw)
To: egcs
Testing on a Pentium revealed that using xchgl for swapping numbers is
not the best way! The reason is that it does not pair very well with
other instructions. On a pentium, if there is a register available,
you should always use:
movl reg1, reg3 ; t = a;
movl reg2, reg1 ; a = b;
movl reg3, reg2 ; b = t;
If a register is not available, it might be faster using the stack:
pushl reg1
pushl reg2
popl reg1
popl reg2
The down-side is of course that you have to use the stack.
Results: (Don't count on them!)
Normal = ~5.5 secs (+/- 0.1)
XOR = ~8.5 secs (+/- 0.1)
XCHG = ~6.0 secs (+/- 0.1)
PUSH/POP = ~5.5 secs (+/- 0.1)
Code:
#include <stdlib.h>
#define FAST_SWAP 1
inline void
swap1 (int & a, int & b)
{
int tmp = a; a = b; b = tmp;
}
inline void
swap2 (int & a, int & b)
{
if (&a != &b) { a ^= b; b ^= a; a ^= b; }
}
inline void
swap3 (int & a, int & b)
{
__asm ( "xchgl %0, %1" : "=r" (a), "=r" (b) : "0" (a), "1" (b) );
}
inline void
swap4 (int & a, int & b)
{
__asm ( "pushl %0; pushl %1; popl %0; popl %1" : "=r" (a), "=r" (b) : "0" (a), "1" (b) );
}
int
main (int argc, char** argv)
{
int data [10240];
for (int i=0; i < 10240; i++)
data[i] = rand();
for (int j=0; j < 10240; j++)
for (int i=0; i < 10239; i++)
{
#if FAST_SWAP == 1
swap1 (data[i], data[i+1]);
#elif FAST_SWAP == 2
swap2 (data[i], data[i+1]);
#elif FAST_SWAP == 3
swap3 (data[i], data[i+1]);
#else
swap4 (data[i], data[i+1]);
#endif
}
}
PKE.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~1998-01-19 15:31 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-01-05 19:22 swap() on x86 platforms Iain McClatchie
1998-01-15 16:30 ` amylaar
1998-01-19 15:31 ` amylaar
-- strict thread matches above, loose matches on Subject: below --
1998-01-05 11:13 Pal-Kristian Engstad
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).