From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-413740-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 1341 invoked by alias); 31 Jan 2013 10:43:34 -0000
Received: (qmail 1304 invoked by uid 48); 31 Jan 2013 10:43:18 -0000
From: "jtaylor.debian at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug c/56160] New: unnecessary additions in loop [x86, x86_64]
Date: Thu, 31 Jan 2013 10:43:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: c
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: jtaylor.debian at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-56160-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2013-01/txt/msg02859.txt.bz2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56160

             Bug #: 56160
           Summary: unnecessary additions in loop [x86, x86_64]
    Classification: Unclassified
           Product: gcc
           Version: 4.4.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: jtaylor.debian@gmail.com


the attached code which does complex float multiplication using sse3 produces 4
unnecessary integer additions if the NaN fallback function comp_mult is inlined

the assembly for the loop generated with -msse3 -O3 -std=c99 in gcc 4.4, 4.6,
4.7 and 4.8 svn 195604 looks like this:
  28:    0f 28 0e                 movaps (%esi),%xmm1
  2b:    f3 0f 12 c1              movsldup %xmm1,%xmm0
  2f:    8b 55 08                 mov    0x8(%ebp),%edx
  32:    0f 28 13                 movaps (%ebx),%xmm2
  35:    f3 0f 16 c9              movshdup %xmm1,%xmm1
  39:    0f 59 c2                 mulps  %xmm2,%xmm0
  3c:    0f c6 d2 b1              shufps $0xb1,%xmm2,%xmm2
  40:    0f 59 ca                 mulps  %xmm2,%xmm1
  43:    f2 0f d0 c1              addsubps %xmm1,%xmm0
  47:    0f 29 04 fa              movaps %xmm0,(%edx,%edi,8)
  4b:    0f c2 c0 04              cmpneqps %xmm0,%xmm0
  4f:    0f 50 c0                 movmskps %xmm0,%eax
  52:    85 c0                    test   %eax,%eax
  54:    75 1d                    jne    73 <sse3_mult+0x73> // inlined
comp_mult
  56:    83 c7 02                 add    $0x2,%edi
  59:    83 c6 10                 add    $0x10,%esi
  5c:    83 c3 10                 add    $0x10,%ebx
  5f:    83 c1 10                 add    $0x10,%ecx
  62:    83 45 e4 10              addl   $0x10,-0x1c(%ebp)
  66:    39 7d 14                 cmp    %edi,0x14(%ebp)
  69:    7f bd                    jg     28 <sse3_mult+0x28>
...

the 4 adds for esi ebx ecx and ebp are completely unnecessary and reduce
performance by about 20% on my core2duo.
on amd64 it also creates to seemingly unnecessary additions but I did not test
the performance.

a way to coax gcc to emit proper code is to not allow it to inline the fallback
it then generates following good assembly with only one integer add:

  a8:    0f 28 0c df              movaps (%edi,%ebx,8),%xmm1
  ac:    f3 0f 12 c1              movsldup %xmm1,%xmm0
  b0:    8b 45 08                 mov    0x8(%ebp),%eax
  b3:    0f 28 14 de              movaps (%esi,%ebx,8),%xmm2
  b7:    f3 0f 16 c9              movshdup %xmm1,%xmm1
  bb:    0f 59 c2                 mulps  %xmm2,%xmm0
  be:    0f c6 d2 b1              shufps $0xb1,%xmm2,%xmm2
  c2:    0f 59 ca                 mulps  %xmm2,%xmm1
  c5:    f2 0f d0 c1              addsubps %xmm1,%xmm0
  c9:    0f 29 04 d8              movaps %xmm0,(%eax,%ebx,8)
  cd:    0f c2 c0 04              cmpneqps %xmm0,%xmm0
  d1:    0f 50 c0                 movmskps %xmm0,%eax
  d4:    85 c0                    test   %eax,%eax
  d6:    75 10                    jne    e8 <sse3_mult+0x58> // non-inlined
comp_mult
  d8:    83 c3 02                 add    $0x2,%ebx
  db:    39 5d 14                 cmp    %ebx,0x14(%ebp)
  de:    7f c8                    jg     a8 <sse3_mult+0x18>
...