swap() on x86 platforms.

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* swap() on x86 platforms.
@ 1998-01-05 19:22 Iain McClatchie
  1998-01-15 16:30 ` amylaar
  0 siblings, 1 reply; 4+ messages in thread
From: Iain McClatchie @ 1998-01-05 19:22 UTC (permalink / raw)
  To: egcs

Pal-Kristian> Testing on a Pentium revealed that using xchgl for
Pal-Kristian> swapping numbers is not the best way!
Pal-Kristian> Results: (Don't count on them!)

Pal-Kristian> Normal   = ~5.5 secs (+/- 0.1)
Pal-Kristian> XOR      = ~8.5 secs (+/- 0.1)

Here I've included some timings on a 150MHz PPro, egcs-2.91.02 971216
vs gcc-2.7.2.  I translated the code into C for comparison as well.

gcc-2.7.2:
in C++
	-O	-O2	-O3	-O6
swap1:  4.22	4.93	4.93	4.92
swap2:	8.67	7.11	7.10	7.11
swap3:	5.39	5.36	5.38	5.40
swap4:	7.06	7.07	7.04	7.06

in C
swap1:	4.93	4.22	4.22	4.24
swap2:	9.63	7.76	7.74	7.77

egcs-2.91.02 971216:
in C++:
swap1:	4.95	4.23	4.24	4.23
swap2:	9.73	7.76	7.74	7.79
swap3:	5.53	4.97	4.99	4.93
swap4:	7.10	7.03	7.06	7.05

in C
swap1:	3.59	4.23	4.23	4.23
swap2:	9.87	6.40	6.34	6.34

Obviously, swap1 is the way to go, which is good; it's also the way
everybody does it right now in existing code.  -O3 and -O6 don't do
anything to this kernel.

C is faster than C++.  It's a little interesting to note the
performance loss in C going from -O to -O2.  In an effort to discover
the problem, I extracted the inner loops from the C and C++ versions
of swap1 at -O and -O2:

C++:
	-O: 7 cycles/loop			-O2: 6 cycles/loop
.L60:					.L60:
        cmpl $10238,%esi		        cmpl $10238,%esi
        jg .L58				        jg .L58
        leal -40960(%ebp,%esi,4),%eax	        leal
-40956(%ebp,%esi,4),%eax
        leal -40956(%ebp,%esi,4),%ecx	        movl (%ebx),%ecx
        movl (%eax),%ebx		        movl (%eax),%edx
        movl (%ecx),%edx		        movl %edx,(%ebx)
        movl %edx,(%eax)		        movl %ecx,(%eax)
        movl %ebx,(%ecx)		        addl $4,%ebx
        incl %esi			        incl %esi
        jmp .L60			        jmp .L60

C:
	-O: 5 cycles/loop			-O2: 6 cycles/loop
.L39:					.L39:
        leal -40960(%ebp,%esi,4),%eax	        movl (%ecx),%edx
        leal -40956(%ebp,%esi,4),%edx	        movl (%ebx),%eax
        movl (%eax),%ebx		        movl %eax,(%ecx)
        movl (%edx),%ecx		        movl %edx,(%ebx)
        movl %ecx,(%eax)		        addl $4,%ebx
        movl %ebx,(%edx)		        addl $4,%ecx
        incl %esi			        incl %esi
        cmpl $10238,%esi		        cmpl $10238,%esi
        jle .L39			        jle .L39

You can see that C++ fails to combine the loop exit branch with the
loop closure branch, although since both branches are almost perfectly
predicted, and we have spare operational units, there is no loss.

I suspect that without higher level optimization, 5 cycles/loop is
optimal on the PPro.  Unrolling the loop does not help.  The machine
can do one memory operation per cycle, and is probably spending a
cycle bypassing the result of one store to a successive load.  You can
imagine a compiler that notices that one iteration's stored ebx will
be the next iteration's loaded ebx, and thereby eliminates a load and
potentially even a store.  On some numeric code this might be an
interesting optimization, but on this example it is somewhat
contrived.

Eliminating that load is worth one cycle per iteration, and since the
machine is no longer bypassing stores to loads, we can unroll the loop
to amortize what appears to be a one cycle per taken branch cost over
several iterations.  Eliminating that store is worth another cycle per
iteration.  The resulting loop averages 2.34 cycles per iteration of
the source (Why not 2.5 cycles?  I don't know).  I've included the
loop setup and exit code.

        leal -40960(%ebp),%ecx
        leal -8(%ebp),%esi
        movl (%ecx),%edx
        .align 16
.L39:
        movl 4(%ecx),%eax
        movl %eax,(%ecx)
        movl 8(%ecx),%eax
        movl %eax,4(%ecx)
        addl $8,%ecx
        cmpl %esi,%ecx
        jle .L39
        movl %edx,(%ecx)

I cannot imagine why the "C -O2" loop runs a cycle slower than the "C
-O" loop.

Here's GCC-2.7.2.  (I formatted the table differently.)  You can see
that the older GCC could actually do induction variable elimination in
C at -O2.  The newer egcs-121697 cannot do it -- perhaps someone will
dig around to find out why.

at -O, inner loop:
	C++:6 cycles/loop			C:7 cycles/loop
.L60:					.L44:
        cmpl $10238,%esi			
        jg .L58				        leal 0(,%esi,4),%eax
        leal 0(,%esi,4),%eax		        leal -40960(%ebp,%eax),%ebx
        leal -40960(%ebp,%eax),%ebx	        leal -40956(%ebp,%eax),%ecx
        leal -40956(%eax,%ebp),%ecx	        movl (%ebx),%edx
        movl (%ebx),%edx			movl (%ecx),%eax
        movl (%ecx),%eax		        movl %eax,(%ebx)
        movl %eax,(%ebx)		        movl %edx,(%ecx)
        movl %edx,(%ecx)		        incl %esi
        incl %esi			        cmpl $10238,%esi
        jmp .L60			        jle .L44

at -O2, inner loop:
	C++:7 cycles/loop			C:6 cycles/loop
.L60:					.L44:
        cmpl $10238,%edi		        leal -40960(%ebp,%esi),%ebx
        jg .L58				        leal -40956(%ebp,%esi),%ecx
        leal -40960(%ebp,%esi),%ebx	        movl (%ebx),%edx
        leal -40956(%esi,%ebp),%ecx	        movl (%ecx),%eax
        movl (%ebx),%edx		        movl %eax,(%ebx)
        movl (%ecx),%eax		        movl %edx,(%ecx)
        movl %eax,(%ebx)		        addl $4,%esi
        movl %edx,(%ecx)		        cmpl $40952,%esi
        addl $4,%esi			        jle .L44
        incl %edi
        jmp .L60

-Iain

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: swap() on x86 platforms.
  1998-01-05 19:22 swap() on x86 platforms Iain McClatchie
@ 1998-01-15 16:30 ` amylaar
  1998-01-19 15:31   ` amylaar
  0 siblings, 1 reply; 4+ messages in thread
From: amylaar @ 1998-01-15 16:30 UTC (permalink / raw)
  To: Iain McClatchie; +Cc: egcs

> Here's GCC-2.7.2.  (I formatted the table differently.)  You can see
> that the older GCC could actually do induction variable elimination in
> C at -O2.  The newer egcs-121697 cannot do it -- perhaps someone will
> dig around to find out why.

No need to dig around - it was disabled on purpose, to allegedly fix the
bugs associated with biv elimination.  I think the overall result is
to disable about 95% of the biv elimination and about 98% of the biv
elimination bugs.

It is possible to solve this properly, but it needs some register
lifetime information at loop optimization time.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: swap() on x86 platforms.
  1998-01-15 16:30 ` amylaar
@ 1998-01-19 15:31   ` amylaar
  0 siblings, 0 replies; 4+ messages in thread
From: amylaar @ 1998-01-19 15:31 UTC (permalink / raw)
  To: amylaar; +Cc: egcs

> > Here's GCC-2.7.2.  (I formatted the table differently.)  You can see
> > that the older GCC could actually do induction variable elimination in
> > C at -O2.  The newer egcs-121697 cannot do it -- perhaps someone will
> > dig around to find out why.
> 
> No need to dig around - it was disabled on purpose, to allegedly fix the
> bugs associated with biv elimination.  I think the overall result is
> to disable about 95% of the biv elimination and about 98% of the biv
> elimination bugs.
> 
> It is possible to solve this properly, but it needs some register
> lifetime information at loop optimization time.

Actually, there would b an alternative way to handle this.  My original
approach (about 2.5 years ago) was to only do some transformations that
change the value kept in the basic induction variable (biv) when it could
be shown that its value is not used after the loop.

Alternatively, we could change the biv anyways, and adjust the value
after the loop.  Flow will later remove superflous adjustments in the
case where the value is not used.  If the value is indeed used, we
might end up with a tighter loop and more complex code after the loop.
Of course, if the use of the value is just a simple test, the extra
complexity might be optimized away by a later arithmetic transformation
(e.g. cse2 or combine) .

The only ceveat that I am aware of is that we have to make sure to handle
multiple loop exits properly.  Btw, does anyone have a guess how important
it is to actually perform biv elimination in the multiple loop exit case?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* swap() on x86 platforms.
@ 1998-01-05 11:13 Pal-Kristian Engstad
  0 siblings, 0 replies; 4+ messages in thread
From: Pal-Kristian Engstad @ 1998-01-05 11:13 UTC (permalink / raw)
  To: egcs

Testing on a Pentium revealed that using xchgl for swapping numbers is
not the best way! The reason is that it does not pair very well with
other instructions. On a pentium, if there is a register available,
you should always use:

	movl	reg1, reg3	; t = a;
	movl	reg2, reg1	; a = b;
	movl	reg3, reg2	; b = t;

If a register is not available, it might be faster using the stack:

	pushl	reg1
	pushl	reg2
	popl	reg1
	popl	reg2

The down-side is of course that you have to use the stack.

Results: (Don't count on them!)

 Normal   = ~5.5 secs (+/- 0.1)
 XOR      = ~8.5 secs (+/- 0.1)
 XCHG     = ~6.0 secs (+/- 0.1)
 PUSH/POP = ~5.5 secs (+/- 0.1)

Code:

#include <stdlib.h>

#define FAST_SWAP 1

inline void 
swap1 (int & a, int & b)
{
    int tmp = a; a = b; b = tmp;
}

inline void 
swap2 (int & a, int & b)
{
    if (&a != &b) { a ^= b; b ^= a; a ^= b; }
}

inline void
swap3 (int & a, int & b)
{
    __asm ( "xchgl %0, %1" : "=r" (a), "=r" (b) : "0" (a), "1" (b) );
}

inline void
swap4 (int & a, int & b)
{
    __asm ( "pushl %0; pushl %1; popl %0; popl %1" : "=r" (a), "=r" (b) : "0" (a), "1" (b) );
}

int 
main (int argc, char** argv)
{
    int data [10240];

    for (int i=0; i < 10240; i++)
        data[i] = rand();
    
    for (int j=0; j < 10240; j++)
        for (int i=0; i < 10239; i++)
        {
#if FAST_SWAP == 1
            swap1 (data[i], data[i+1]);
#elif FAST_SWAP == 2
            swap2 (data[i], data[i+1]);
#elif FAST_SWAP == 3
            swap3 (data[i], data[i+1]);
#else
            swap4 (data[i], data[i+1]);
#endif
        }
}

PKE.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~1998-01-19 15:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-01-05 19:22 swap() on x86 platforms Iain McClatchie
1998-01-15 16:30 ` amylaar
1998-01-19 15:31   ` amylaar
  -- strict thread matches above, loose matches on Subject: below --
1998-01-05 11:13 Pal-Kristian Engstad

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).