* gcc 4.1.1 poor optimization
@ 2007-01-10 23:17 Greg Smith
2007-01-10 23:31 ` Ian Lance Taylor
0 siblings, 1 reply; 5+ messages in thread
From: Greg Smith @ 2007-01-10 23:17 UTC (permalink / raw)
To: gcc-help
The snippet of code below is part of a much larger module and was
compiled on an FC6 system with gcc 4.1.1 20061011 on linux, kernel
2.6.18-1.2869.
To say that we were disappointed with the emitted assembler would be an
understatement.
Compile options were -O3 -fomit-frame-pointer -march=i686 -fPIC
void (__attribute__ regparm(2) z900_load) (BYTE inst[], REGS *regs)
{
int r1;
int b2;
U64 effective_addr2;
U32 temp = bswap_32(*(U32*)inst);
r1 = (temp >> 20) & 0xf;
b2 = (temp >> 16) & 0xf;
effective_addr2 = temp & 0xfff;
if (b2) effective_addr2 += regs->gr[b2].D; // U64
b2 = (temp >> 12) & 0xf;
if (b2) effective_addr2 += regs->gr[b2].D; // U64
effective_addr2 &= regs->psw.amask.D; // U64
regs->ip += 4;
regs->ilc = 4;
if ((effective_addr2 & 3) == 0)
. . . .
The assembler is below with noted lines:
z900_load:
pushl %ebp
pushl %edi
xorl %edi, %edi
pushl %esi
subl $96, %esp
movl (%eax), %eax
[ 7] movl %edx, 24(%esp)
[ 8] movl %eax, 28(%esp)
#APP
bswap %eax
#NO_APP
movl %eax, %ecx
[11] movl %eax, 28(%esp)
[12] movl 28(%esp), %eax
shrl $16, %ecx
movl %eax, %esi
movl %ecx, %eax
andl $4095, %esi
andl $15, %eax
je .L13528
[19] movl 24(%esp), %edx
addl 80(%edx,%eax,8), %esi
adcl 84(%edx,%eax,8), %edi
.L13528:
movl 28(%esp), %eax
shrl $12, %eax
andl $15, %eax
movl %eax, 76(%esp)
je .L13530
movl %eax, %ecx
[28] movl 24(%esp), %eax
addl 80(%eax,%ecx,8), %esi
adcl 84(%eax,%ecx,8), %edi
.L13530:
[31] movl 24(%esp), %edx
movl 32(%edx), %ecx
addl $4, 44(%edx)
andl %esi, %ecx
movl %ecx, 64(%esp)
movl 36(%edx), %eax
andl %edi, %eax
movl %eax, 68(%esp)
movb $4, 42(%edx)
movl 64(%esp), %eax
[41] xorl %edx, %edx
. movl %edx, %ecx
. andl $3, %eax
. orl %eax, %ecx
[45] jne .L13602
On entry, %eax points to inst, %edx points to REGS.
Variable regs (%edx) is stacked on line 7 and is reloaded from the stack
in lines 19, 28 and 31 despite %edx not being clobbered until line 41.
The 4 byte value pointed to by inst (%eax) is loaded into %eax and then
stacked before bswap (line 8), then stacked again after bswap (line 11).
To add insult to injury, line 12 reloads %eax from the stack.
Lines 41..45 all deal with trying to figure out if the low-order 2 bits
of effective_addr2 are zero. All I can say is, WTF? I can get around
this one by casting effective_addr2 to U32 and then testl/jne is
emitted, but I shouldn't have to do this??
Does anyone have any explanations? I was drawn to this particular code
because an automated benchmark started flagging this routine because the
performance decreased so much.
Greg Smith
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: gcc 4.1.1 poor optimization
2007-01-10 23:17 gcc 4.1.1 poor optimization Greg Smith
@ 2007-01-10 23:31 ` Ian Lance Taylor
2007-01-11 0:56 ` Greg Smith
2007-01-18 4:01 ` Greg Smith
0 siblings, 2 replies; 5+ messages in thread
From: Ian Lance Taylor @ 2007-01-10 23:31 UTC (permalink / raw)
To: Greg Smith; +Cc: gcc-help
Greg Smith <gsmith@nc.rr.com> writes:
> [ 7] movl %edx, 24(%esp)
> [ 8] movl %eax, 28(%esp)
> #APP
> bswap %eax
> #NO_APP
This makes it look like the bswap asm clobbered %edx.
Can you post a small complete standalone preprocessed test case? It's
quite difficult to analyze an incomplete one. Thanks.
> Lines 41..45 all deal with trying to figure out if the low-order 2 bits
> of effective_addr2 are zero. All I can say is, WTF? I can get around
> this one by casting effective_addr2 to U32 and then testl/jne is
> emitted, but I shouldn't have to do this??
This is most likely fixed by the lower-subreg patch I have been
working on. Latest version here:
http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01858.html
Ian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: gcc 4.1.1 poor optimization
2007-01-10 23:31 ` Ian Lance Taylor
@ 2007-01-11 0:56 ` Greg Smith
2007-01-18 4:01 ` Greg Smith
1 sibling, 0 replies; 5+ messages in thread
From: Greg Smith @ 2007-01-11 0:56 UTC (permalink / raw)
To: Ian Lance Taylor; +Cc: gcc-help
On Wed, 2007-01-10 at 15:30 -0800, Ian Lance Taylor wrote:
> Greg Smith <gsmith@nc.rr.com> writes:
>
> > [ 7] movl %edx, 24(%esp)
> > [ 8] movl %eax, 28(%esp)
> > #APP
> > bswap %eax
> > #NO_APP
>
> This makes it look like the bswap asm clobbered %edx.
>
> Can you post a small complete standalone preprocessed test case? It's
> quite difficult to analyze an incomplete one. Thanks.
>
Thanks for the quick response!!
The test case below may not be as small as you want, but it does show a
couple (but not all) of the issues.
typedef unsigned char BYTE;
typedef unsigned long U32;
typedef unsigned long long U64;
#define my_bswap_32(x) \
(__extension__ \
({ register unsigned int __v, __x = (x); \
__asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); \
__v; }))
struct REGS {
U64 gr[16];
U64 amask;
void *ip;
BYTE ilc;
};
typedef struct REGS REGS;
int (__attribute__ (( regparm(2) )) z900_load) (BYTE inst[], REGS *regs)
{
int r1;
int b2;
U64 effective_addr2;
U32 temp = my_bswap_32(*(U32*)inst);
r1 = (temp >> 20) & 0xf;
b2 = (temp >> 16) & 0xf;
effective_addr2 = temp & 0xfff;
if (b2) effective_addr2 += regs->gr[b2];
b2 = (temp >> 12) & 0xf;
if (b2) effective_addr2 += regs->gr[b2];
effective_addr2 &= regs->amask;
regs->ip += 4;
regs->ilc = 4;
if ((effective_addr2 & 3) == 0)
return 1;
return 0;
}
compiled with gcc -O3 -fomit-frame-pointer -fPIC -S testload.c
The assembler:
z900_load:
subl $16, %esp
movl %esi, 4(%esp)
movl %edi, 8(%esp)
xorl %edi, %edi
movl %ebp, 12(%esp)
movl (%eax), %ebp
#APP
bswap %ebp
#NO_APP
movl %ebp, %ecx
movl %ebp, %esi
shrl $16, %ecx
andl $4095, %esi
movl %ecx, %eax
andl $15, %eax
[14] movl %edx, (%esp)
je .L2
[16] movl (%esp), %edx
addl (%edx,%eax,8), %esi
adcl 4(%edx,%eax,8), %edi
.L2:
shrl $12, %ebp
movl %ebp, %eax
andl $15, %eax
je .L4
[23] movl (%esp), %ecx
addl (%ecx,%eax,8), %esi
adcl 4(%ecx,%eax,8), %edi
.L4:
[26] movl (%esp), %ecx
[27] xorl %edx, %edx
andl %edi, %edx
movl 128(%ecx), %eax
addl $4, 136(%ecx)
movb $4, 140(%ecx)
movl 8(%esp), %edi
andl $3, %eax
movl 12(%esp), %ebp
andl %esi, %eax
movl 4(%esp), %esi
orl %edx, %eax
sete %al
addl $16, %esp
movzbl %al, %eax
ret
Original %edx is stacked on line 14 and is reloaded on lines 16, 23 and
26 until it's clobbered in line 27. Line 27 seems to be related to the
fix you've been working on.
Greg Smith
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: gcc 4.1.1 poor optimization
2007-01-10 23:31 ` Ian Lance Taylor
2007-01-11 0:56 ` Greg Smith
@ 2007-01-18 4:01 ` Greg Smith
2007-01-18 4:26 ` Greg Smith
1 sibling, 1 reply; 5+ messages in thread
From: Greg Smith @ 2007-01-18 4:01 UTC (permalink / raw)
To: Ian Lance Taylor; +Cc: gcc-help
On Wed, 2007-01-10 at 15:30 -0800, Ian Lance Taylor wrote:
> > Lines 41..45 all deal with trying to figure out if the low-order 2 bits
> > of effective_addr2 are zero. All I can say is, WTF? I can get around
> > this one by casting effective_addr2 to U32 and then testl/jne is
> > emitted, but I shouldn't have to do this??
>
> This is most likely fixed by the lower-subreg patch I have been
> working on. Latest version here:
>
> http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01858.html
I pulled the source from subversion today. There is no improvement in
the emitted assembler or my benchmarks.
However, with your subreg patch above against today's source, my
benchmarks show a 10% improvement!! The emitted code looks (mostly)
sane, no reloading registers with a value already in a register.
It seems that once a certain level of complexity has been reached that
the quality of the emitted assembler goes down, and your subreg patch is
enough to get that complexity below the threshold.
The particular routine looks like, processed:
void (__attribute__ ( regparm(2) ) z900_load) (BYTE inst[], REGS *regs)
{
int r1;
int b2;
U64 effective_addr2;
U32 temp = fetch_fw(inst);
r1 = (temp >> 20) & 0xf;
b2 = (temp >> 16) & 0xf;
effective_addr2 = temp & 0xfff;
if(b2) effective_addr2 += (regs)->gr[((b2))].D; // U64
b2 = (temp >> 12) & 0xf;
if(b2) effective_addr2 += regs->gr[b2].D; // U64
effective_addr2 &= regs->psw.amask.D; // U64
regs->ip += 4;
regs->psw.ilc = 4;
regs->gr[r1].F.L.F = z900_vfetch4(effective_addr2, b2, regs); // U32
}
fetch_fw is inlined and looks like:
static __inline__ U32 fetch_fw(volatile void *ptr) {
return (__extension__ (
{ register unsigned int __v, __x = (*(U32 *)ptr);
if (__builtin_constant_p (__x))
__v = ((((__x) & 0xff000000) >> 24)
| (((__x) & 0x00ff0000) >> 8)
| (((__x) & 0x0000ff00) << 8)
| (((__x) & 0x000000ff) << 24));
else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x));
__v;
}
));
}
The value is not a constant here so bswap is generated.
z900_vfetch4 is also inlined but too complicated to include as yet; the
first thing it does is check if effective_addr2 is on a 4 byte boundary.
So the i686 assembler looks like:
z900_load:
pushl %ebp
movl %edx, %ebp
pushl %edi
xorl %edi, %edi
pushl %esi
subl $80, %esp
movl (%eax), %eax
movl %eax, 36(%esp)
#APP
bswap %eax
#NO_APP
movl %eax, %ecx
movl %eax, 36(%esp)
movl 36(%esp), %eax
shrl $16, %ecx
movl %eax, %esi
movl %ecx, %eax
andl $4095, %esi
andl $15, %eax
je .L5434
addl 80(%edx,%eax,8), %esi
adcl 84(%edx,%eax,8), %edi
.L5434:
movl 36(%esp), %eax
shrl $12, %eax
andl $15, %eax
movl %eax, 52(%esp)
je .L5436
addl 80(%ebp,%eax,8), %esi
adcl 84(%ebp,%eax,8), %edi
.L5436:
movl 32(%ebp), %edx
addl $4, 44(%ebp)
andl %esi, %edx
movl %edx, 56(%esp)
movl 36(%ebp), %eax
andl %edi, %eax
movl %eax, 60(%esp)
movb $4, 42(%ebp)
movl 56(%esp), %edi
testl $3, %edi
jne .L5476
which is a vast improvement over the original assembler I reported
(where %edx or parm 2 was constantly reloaded). I am still disappointed
in the code surrounding fetch_fw:
movl %eax, 36(%esp)
bswap %eax
movl %eax, %ecx
movl %eax, 36(%esp)
movl 36(%esp), %eax
That is storing %eax in 36(%esp), swapping it, then storing it back into
36(%esp), then reloading %eax from 36(%esp).
Greg Smith
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: gcc 4.1.1 poor optimization
2007-01-18 4:01 ` Greg Smith
@ 2007-01-18 4:26 ` Greg Smith
0 siblings, 0 replies; 5+ messages in thread
From: Greg Smith @ 2007-01-18 4:26 UTC (permalink / raw)
To: Ian Lance Taylor; +Cc: gcc-help
> movl %edx, 56(%esp)
> movl 36(%ebp), %eax
> andl %edi, %eax
> movl %eax, 60(%esp)
> movb $4, 42(%ebp)
> movl 56(%esp), %edi
> testl $3, %edi
Oh, here's one I missed. %edx is stored into 56(%esp) then %edi is
loaded from 56(%esp) for the testl but %edx is still unchanged.
Greg Smith
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2007-01-18 4:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-10 23:17 gcc 4.1.1 poor optimization Greg Smith
2007-01-10 23:31 ` Ian Lance Taylor
2007-01-11 0:56 ` Greg Smith
2007-01-18 4:01 ` Greg Smith
2007-01-18 4:26 ` Greg Smith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).