gcc 4.1.1 poor optimization

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* gcc 4.1.1 poor optimization
@ 2007-01-10 23:17 Greg Smith
  2007-01-10 23:31 ` Ian Lance Taylor
  0 siblings, 1 reply; 5+ messages in thread
From: Greg Smith @ 2007-01-10 23:17 UTC (permalink / raw)
  To: gcc-help

The snippet of code below is part of a much larger module and was
compiled on an FC6 system with gcc 4.1.1 20061011 on linux, kernel
2.6.18-1.2869.

To say that we were disappointed with the emitted assembler would be an
understatement.

Compile options were -O3 -fomit-frame-pointer -march=i686 -fPIC

void (__attribute__ regparm(2) z900_load) (BYTE inst[], REGS *regs)
{
int r1;
int b2;
U64 effective_addr2;

    U32 temp = bswap_32(*(U32*)inst);
    r1 = (temp >> 20) & 0xf;
    b2 = (temp >> 16) & 0xf;
    effective_addr2 = temp & 0xfff;
    if (b2) effective_addr2 += regs->gr[b2].D; // U64
    b2 = (temp >> 12) & 0xf;
    if (b2) effective_addr2 += regs->gr[b2].D; // U64
    effective_addr2 &= regs->psw.amask.D;      // U64
    regs->ip += 4;
    regs->ilc = 4;
    if ((effective_addr2 & 3) == 0)
    .  .  .  .

The assembler is below with noted lines:

z900_load:
        pushl   %ebp
        pushl   %edi
        xorl    %edi, %edi
        pushl   %esi
        subl    $96, %esp
        movl    (%eax), %eax
[ 7]    movl    %edx, 24(%esp)
[ 8]    movl    %eax, 28(%esp)
#APP
        bswap %eax
#NO_APP
        movl    %eax, %ecx
[11]    movl    %eax, 28(%esp)
[12]    movl    28(%esp), %eax
        shrl    $16, %ecx
        movl    %eax, %esi
        movl    %ecx, %eax
        andl    $4095, %esi
        andl    $15, %eax
        je      .L13528
[19]    movl    24(%esp), %edx
        addl    80(%edx,%eax,8), %esi
        adcl    84(%edx,%eax,8), %edi
.L13528:
        movl    28(%esp), %eax
        shrl    $12, %eax
        andl    $15, %eax
        movl    %eax, 76(%esp)
        je      .L13530
        movl    %eax, %ecx
[28]    movl    24(%esp), %eax
        addl    80(%eax,%ecx,8), %esi
        adcl    84(%eax,%ecx,8), %edi
.L13530:
[31]    movl    24(%esp), %edx
        movl    32(%edx), %ecx
        addl    $4, 44(%edx)
        andl    %esi, %ecx
        movl    %ecx, 64(%esp)
        movl    36(%edx), %eax
        andl    %edi, %eax
        movl    %eax, 68(%esp)
        movb    $4, 42(%edx)
        movl    64(%esp), %eax
[41]    xorl    %edx, %edx
 .      movl    %edx, %ecx
 .      andl    $3, %eax
 .      orl     %eax, %ecx
[45]    jne     .L13602

On entry, %eax points to inst, %edx points to REGS.

Variable regs (%edx) is stacked on line 7 and is reloaded from the stack
in lines 19, 28 and 31 despite %edx not being clobbered until line 41.

The 4 byte value pointed to by inst (%eax) is loaded into %eax and then
stacked before bswap (line 8), then stacked again after bswap (line 11).
To add insult to injury, line 12 reloads %eax from the stack.

Lines 41..45 all deal with trying to figure out if the low-order 2 bits
of effective_addr2 are zero.  All I can say is, WTF?  I can get around
this one by casting effective_addr2 to U32 and then testl/jne is
emitted, but I shouldn't have to do this??

Does anyone have any explanations?  I was drawn to this particular code
because an automated benchmark started flagging this routine because the
performance decreased so much.

Greg Smith

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 4.1.1 poor optimization
  2007-01-10 23:17 gcc 4.1.1 poor optimization Greg Smith
@ 2007-01-10 23:31 ` Ian Lance Taylor
  2007-01-11  0:56   ` Greg Smith
  2007-01-18  4:01   ` Greg Smith
  0 siblings, 2 replies; 5+ messages in thread
From: Ian Lance Taylor @ 2007-01-10 23:31 UTC (permalink / raw)
  To: Greg Smith; +Cc: gcc-help

Greg Smith <gsmith@nc.rr.com> writes:

> [ 7]    movl    %edx, 24(%esp)
> [ 8]    movl    %eax, 28(%esp)
> #APP
>         bswap %eax
> #NO_APP

This makes it look like the bswap asm clobbered %edx.

Can you post a small complete standalone preprocessed test case?  It's
quite difficult to analyze an incomplete one.  Thanks.

> Lines 41..45 all deal with trying to figure out if the low-order 2 bits
> of effective_addr2 are zero.  All I can say is, WTF?  I can get around
> this one by casting effective_addr2 to U32 and then testl/jne is
> emitted, but I shouldn't have to do this??

This is most likely fixed by the lower-subreg patch I have been
working on.  Latest version here:

http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01858.html

Ian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 4.1.1 poor optimization
  2007-01-10 23:31 ` Ian Lance Taylor
@ 2007-01-11  0:56   ` Greg Smith
  2007-01-18  4:01   ` Greg Smith
  1 sibling, 0 replies; 5+ messages in thread
From: Greg Smith @ 2007-01-11  0:56 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: gcc-help

On Wed, 2007-01-10 at 15:30 -0800, Ian Lance Taylor wrote:
> Greg Smith <gsmith@nc.rr.com> writes:
> 
> > [ 7]    movl    %edx, 24(%esp)
> > [ 8]    movl    %eax, 28(%esp)
> > #APP
> >         bswap %eax
> > #NO_APP
> 
> This makes it look like the bswap asm clobbered %edx.
> 
> Can you post a small complete standalone preprocessed test case?  It's
> quite difficult to analyze an incomplete one.  Thanks.
> 

Thanks for the quick response!!

The test case below may not be as small as you want, but it does show a
couple (but not all) of the issues.

typedef unsigned char BYTE;
typedef unsigned long U32;
typedef unsigned long long U64;

#define my_bswap_32(x) \
 (__extension__ \
  ({ register unsigned int __v, __x = (x); \
   __asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); \
   __v; }))

struct REGS {
 U64   gr[16];
 U64   amask;
 void *ip;
 BYTE  ilc;
};
typedef struct REGS REGS;

int (__attribute__ (( regparm(2) )) z900_load) (BYTE inst[], REGS *regs)
{
int r1;
int b2;
U64 effective_addr2;

    U32 temp = my_bswap_32(*(U32*)inst);
    r1 = (temp >> 20) & 0xf;
    b2 = (temp >> 16) & 0xf;
    effective_addr2 = temp & 0xfff;
    if (b2) effective_addr2 += regs->gr[b2];
    b2 = (temp >> 12) & 0xf;
    if (b2) effective_addr2 += regs->gr[b2];
    effective_addr2 &= regs->amask;
    regs->ip += 4;
    regs->ilc = 4;
    if ((effective_addr2 & 3) == 0)
        return 1;
    return 0;
}

compiled with gcc -O3 -fomit-frame-pointer -fPIC -S testload.c

The assembler:

z900_load:
        subl    $16, %esp
        movl    %esi, 4(%esp)
        movl    %edi, 8(%esp)
        xorl    %edi, %edi
        movl    %ebp, 12(%esp)
        movl    (%eax), %ebp
#APP
        bswap %ebp
#NO_APP
        movl    %ebp, %ecx
        movl    %ebp, %esi
        shrl    $16, %ecx
        andl    $4095, %esi
        movl    %ecx, %eax
        andl    $15, %eax
[14]    movl    %edx, (%esp)
        je      .L2
[16]    movl    (%esp), %edx
        addl    (%edx,%eax,8), %esi
        adcl    4(%edx,%eax,8), %edi
.L2:
        shrl    $12, %ebp
        movl    %ebp, %eax
        andl    $15, %eax
        je      .L4
[23]    movl    (%esp), %ecx
        addl    (%ecx,%eax,8), %esi
        adcl    4(%ecx,%eax,8), %edi
.L4:
[26]    movl    (%esp), %ecx
[27]    xorl    %edx, %edx
        andl    %edi, %edx
        movl    128(%ecx), %eax
        addl    $4, 136(%ecx)
        movb    $4, 140(%ecx)
        movl    8(%esp), %edi
        andl    $3, %eax
        movl    12(%esp), %ebp
        andl    %esi, %eax
        movl    4(%esp), %esi
        orl     %edx, %eax
        sete    %al
        addl    $16, %esp
        movzbl  %al, %eax
        ret

Original %edx is stacked on line 14 and is reloaded on lines 16, 23 and
26 until it's clobbered in line 27.  Line 27 seems to be related to the
fix you've been working on.

Greg Smith

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 4.1.1 poor optimization
  2007-01-10 23:31 ` Ian Lance Taylor
  2007-01-11  0:56   ` Greg Smith
@ 2007-01-18  4:01   ` Greg Smith
  2007-01-18  4:26     ` Greg Smith
  1 sibling, 1 reply; 5+ messages in thread
From: Greg Smith @ 2007-01-18  4:01 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: gcc-help

On Wed, 2007-01-10 at 15:30 -0800, Ian Lance Taylor wrote:

> > Lines 41..45 all deal with trying to figure out if the low-order 2 bits
> > of effective_addr2 are zero.  All I can say is, WTF?  I can get around
> > this one by casting effective_addr2 to U32 and then testl/jne is
> > emitted, but I shouldn't have to do this??
> 
> This is most likely fixed by the lower-subreg patch I have been
> working on.  Latest version here:
> 
> http://gcc.gnu.org/ml/gcc-patches/2006-12/msg01858.html

I pulled the source from subversion today.  There is no improvement in
the emitted assembler or my benchmarks.

However, with your subreg patch above against today's source, my
benchmarks show a 10% improvement!!  The emitted code looks (mostly)
sane, no reloading registers with a value already in a register.

It seems that once a certain level of complexity has been reached that
the quality of the emitted assembler goes down, and your subreg patch is
enough to get that complexity below the threshold.

The particular routine looks like, processed:

void (__attribute__ ( regparm(2) ) z900_load) (BYTE inst[], REGS *regs)
{
int r1;
int b2;
U64 effective_addr2;
    U32 temp = fetch_fw(inst); 
    r1 = (temp >> 20) & 0xf;
    b2 = (temp >> 16) & 0xf;
    effective_addr2 = temp & 0xfff;
    if(b2) effective_addr2 += (regs)->gr[((b2))].D; // U64
    b2 = (temp >> 12) & 0xf;
    if(b2) effective_addr2 += regs->gr[b2].D; // U64
    effective_addr2 &= regs->psw.amask.D; // U64
    regs->ip += 4;
    regs->psw.ilc = 4;
    regs->gr[r1].F.L.F = z900_vfetch4(effective_addr2, b2, regs); // U32
}

fetch_fw is inlined and looks like:

static __inline__ U32 fetch_fw(volatile void *ptr) {
 return (__extension__ (
  { register unsigned int __v, __x = (*(U32 *)ptr);
    if (__builtin_constant_p (__x))
      __v = ((((__x) & 0xff000000) >> 24)
          |  (((__x) & 0x00ff0000) >> 8)
          | (((__x) & 0x0000ff00) << 8)
          | (((__x) & 0x000000ff) << 24));
    else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x));
     __v;
  }
 ));
}

The value is not a constant here so bswap is generated.

z900_vfetch4 is also inlined but too complicated to include as yet; the
first thing it does is check if effective_addr2 is on a 4 byte boundary.

So the i686 assembler looks like:

z900_load:
        pushl   %ebp
        movl    %edx, %ebp
        pushl   %edi
        xorl    %edi, %edi
        pushl   %esi
        subl    $80, %esp
        movl    (%eax), %eax
        movl    %eax, 36(%esp)
#APP
        bswap %eax
#NO_APP
        movl    %eax, %ecx
        movl    %eax, 36(%esp)
        movl    36(%esp), %eax
        shrl    $16, %ecx
        movl    %eax, %esi
        movl    %ecx, %eax
        andl    $4095, %esi
        andl    $15, %eax
        je      .L5434
        addl    80(%edx,%eax,8), %esi
        adcl    84(%edx,%eax,8), %edi
.L5434:
        movl    36(%esp), %eax
        shrl    $12, %eax
        andl    $15, %eax
        movl    %eax, 52(%esp)
        je      .L5436
        addl    80(%ebp,%eax,8), %esi
        adcl    84(%ebp,%eax,8), %edi
.L5436:
        movl    32(%ebp), %edx
        addl    $4, 44(%ebp)
        andl    %esi, %edx
        movl    %edx, 56(%esp)
        movl    36(%ebp), %eax
        andl    %edi, %eax
        movl    %eax, 60(%esp)
        movb    $4, 42(%ebp)
        movl    56(%esp), %edi
        testl   $3, %edi
        jne     .L5476

which is a vast improvement over the original assembler I reported
(where %edx or parm 2 was constantly reloaded).  I am still disappointed
in the code surrounding fetch_fw:

        movl    %eax, 36(%esp)
        bswap %eax
        movl    %eax, %ecx
        movl    %eax, 36(%esp)
        movl    36(%esp), %eax

That is storing %eax in 36(%esp), swapping it, then storing it back into
36(%esp), then reloading %eax from 36(%esp).

Greg Smith

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 4.1.1 poor optimization
  2007-01-18  4:01   ` Greg Smith
@ 2007-01-18  4:26     ` Greg Smith
  0 siblings, 0 replies; 5+ messages in thread
From: Greg Smith @ 2007-01-18  4:26 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: gcc-help

>         movl    %edx, 56(%esp)
>         movl    36(%ebp), %eax
>         andl    %edi, %eax
>         movl    %eax, 60(%esp)
>         movb    $4, 42(%ebp)
>         movl    56(%esp), %edi
>         testl   $3, %edi 

Oh, here's one I missed.  %edx is stored into 56(%esp) then %edi is
loaded from 56(%esp) for the testl but %edx is still unchanged.

Greg Smith


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-01-18  4:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-10 23:17 gcc 4.1.1 poor optimization Greg Smith
2007-01-10 23:31 ` Ian Lance Taylor
2007-01-11  0:56   ` Greg Smith
2007-01-18  4:01   ` Greg Smith
2007-01-18  4:26     ` Greg Smith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).