From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-403803-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 10236 invoked by alias); 16 Oct 2012 14:22:26 -0000
Received: (qmail 10154 invoked by uid 48); 16 Oct 2012 14:22:07 -0000
From: "ysrumyan at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic
Date: Tue, 16 Oct 2012 14:22:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: ysrumyan at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-54939-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2012-10/txt/msg01479.txt.bz2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

             Bug #: 54939
           Summary: Very poor vectorization of loops with complex
                    arithmetic
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: ysrumyan@gmail.com


Analyzing some performance anomaly for spec2000 I found out that 168.wupwise
with vectorization is slower than without it on x86. The main problem is that
gcc does not recognize some special idioms of complex addition and
multiplication in process of loop vectorization. For example, for a simple
zaxpy loop icc genearates 1.6X faster code than gcc. Here is assembly for zaxpy
loop produced by icc:

..B1.4:                         # Preds ..B1.2 ..B1.4
        movups    (%rsi,%rdx), %xmm2                            #7.28
        movups    16(%rsi,%rdx), %xmm5                          #7.28
        movups    (%rsi,%rcx), %xmm4                            #7.17
        movups    16(%rsi,%rcx), %xmm7                          #7.17
        movddup   (%rsi,%rdx), %xmm3                            #7.27
        incq      %r8                                           #6.10
        movddup   16(%rsi,%rdx), %xmm6                          #7.27
        unpckhpd  %xmm2, %xmm2                                  #7.27
        unpckhpd  %xmm5, %xmm5                                  #7.27
        mulpd     %xmm1, %xmm3                                  #7.27
        mulpd     %xmm0, %xmm2                                  #7.27
        mulpd     %xmm1, %xmm6                                  #7.27
        mulpd     %xmm0, %xmm5                                  #7.27
        addsubpd  %xmm2, %xmm3                                  #7.27
        addsubpd  %xmm5, %xmm6                                  #7.27
        addpd     %xmm3, %xmm4                                  #7.9
        addpd     %xmm6, %xmm7                                  #7.9
        movups    %xmm4, (%rsi,%rcx)                            #7.9
        movups    %xmm7, 16(%rsi,%rcx)                          #7.9
        addq      $32, %rsi                                     #6.10
        cmpq      %rdi, %r8                                     #6.10
        jb        ..B1.4        # Prob 64%                      #6.10
( I got it with -xSSE4.2 -O3 options). Gor gcc compiler the following options
were used: -m64 -mfpmath=sse  -march=corei7 -O3 -ffast-math.