From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4777 invoked by alias); 10 Jan 2007 23:17:28 -0000 Received: (qmail 4767 invoked by uid 22791); 10 Jan 2007 23:17:27 -0000 X-Spam-Check-By: sourceware.org Received: from ms-smtp-04.southeast.rr.com (HELO ms-smtp-04.southeast.rr.com) (24.25.9.103) by sourceware.org (qpsmtpd/0.31) with ESMTP; Wed, 10 Jan 2007 23:17:21 +0000 Received: from [192.168.1.106] (cpe-071-077-045-243.nc.res.rr.com [71.77.45.243]) by ms-smtp-04.southeast.rr.com (8.13.6/8.13.6) with ESMTP id l0ANHIIK014717 for ; Wed, 10 Jan 2007 18:17:18 -0500 (EST) Subject: gcc 4.1.1 poor optimization From: Greg Smith To: gcc-help@gcc.gnu.org Content-Type: text/plain Date: Wed, 10 Jan 2007 23:17:00 -0000 Message-Id: <1168471038.9980.31.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.8.2.1 (2.8.2.1-2.fc6) Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-help-owner@gcc.gnu.org X-SW-Source: 2007-01/txt/msg00102.txt.bz2 The snippet of code below is part of a much larger module and was compiled on an FC6 system with gcc 4.1.1 20061011 on linux, kernel 2.6.18-1.2869. To say that we were disappointed with the emitted assembler would be an understatement. Compile options were -O3 -fomit-frame-pointer -march=i686 -fPIC void (__attribute__ regparm(2) z900_load) (BYTE inst[], REGS *regs) { int r1; int b2; U64 effective_addr2; U32 temp = bswap_32(*(U32*)inst); r1 = (temp >> 20) & 0xf; b2 = (temp >> 16) & 0xf; effective_addr2 = temp & 0xfff; if (b2) effective_addr2 += regs->gr[b2].D; // U64 b2 = (temp >> 12) & 0xf; if (b2) effective_addr2 += regs->gr[b2].D; // U64 effective_addr2 &= regs->psw.amask.D; // U64 regs->ip += 4; regs->ilc = 4; if ((effective_addr2 & 3) == 0) . . . . The assembler is below with noted lines: z900_load: pushl %ebp pushl %edi xorl %edi, %edi pushl %esi subl $96, %esp movl (%eax), %eax [ 7] movl %edx, 24(%esp) [ 8] movl %eax, 28(%esp) #APP bswap %eax #NO_APP movl %eax, %ecx [11] movl %eax, 28(%esp) [12] movl 28(%esp), %eax shrl $16, %ecx movl %eax, %esi movl %ecx, %eax andl $4095, %esi andl $15, %eax je .L13528 [19] movl 24(%esp), %edx addl 80(%edx,%eax,8), %esi adcl 84(%edx,%eax,8), %edi .L13528: movl 28(%esp), %eax shrl $12, %eax andl $15, %eax movl %eax, 76(%esp) je .L13530 movl %eax, %ecx [28] movl 24(%esp), %eax addl 80(%eax,%ecx,8), %esi adcl 84(%eax,%ecx,8), %edi .L13530: [31] movl 24(%esp), %edx movl 32(%edx), %ecx addl $4, 44(%edx) andl %esi, %ecx movl %ecx, 64(%esp) movl 36(%edx), %eax andl %edi, %eax movl %eax, 68(%esp) movb $4, 42(%edx) movl 64(%esp), %eax [41] xorl %edx, %edx . movl %edx, %ecx . andl $3, %eax . orl %eax, %ecx [45] jne .L13602 On entry, %eax points to inst, %edx points to REGS. Variable regs (%edx) is stacked on line 7 and is reloaded from the stack in lines 19, 28 and 31 despite %edx not being clobbered until line 41. The 4 byte value pointed to by inst (%eax) is loaded into %eax and then stacked before bswap (line 8), then stacked again after bswap (line 11). To add insult to injury, line 12 reloads %eax from the stack. Lines 41..45 all deal with trying to figure out if the low-order 2 bits of effective_addr2 are zero. All I can say is, WTF? I can get around this one by casting effective_addr2 to U32 and then testl/jne is emitted, but I shouldn't have to do this?? Does anyone have any explanations? I was drawn to this particular code because an automated benchmark started flagging this routine because the performance decreased so much. Greg Smith