From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10236 invoked by alias); 16 Oct 2012 14:22:26 -0000 Received: (qmail 10154 invoked by uid 48); 16 Oct 2012 14:22:07 -0000 From: "ysrumyan at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic Date: Tue, 16 Oct 2012 14:22:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: ysrumyan at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Changed-Fields: Message-ID: X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated Content-Type: text/plain; charset="UTF-8" MIME-Version: 1.0 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org X-SW-Source: 2012-10/txt/msg01479.txt.bz2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939 Bug #: 54939 Summary: Very poor vectorization of loops with complex arithmetic Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned@gcc.gnu.org ReportedBy: ysrumyan@gmail.com Analyzing some performance anomaly for spec2000 I found out that 168.wupwise with vectorization is slower than without it on x86. The main problem is that gcc does not recognize some special idioms of complex addition and multiplication in process of loop vectorization. For example, for a simple zaxpy loop icc genearates 1.6X faster code than gcc. Here is assembly for zaxpy loop produced by icc: ..B1.4: # Preds ..B1.2 ..B1.4 movups (%rsi,%rdx), %xmm2 #7.28 movups 16(%rsi,%rdx), %xmm5 #7.28 movups (%rsi,%rcx), %xmm4 #7.17 movups 16(%rsi,%rcx), %xmm7 #7.17 movddup (%rsi,%rdx), %xmm3 #7.27 incq %r8 #6.10 movddup 16(%rsi,%rdx), %xmm6 #7.27 unpckhpd %xmm2, %xmm2 #7.27 unpckhpd %xmm5, %xmm5 #7.27 mulpd %xmm1, %xmm3 #7.27 mulpd %xmm0, %xmm2 #7.27 mulpd %xmm1, %xmm6 #7.27 mulpd %xmm0, %xmm5 #7.27 addsubpd %xmm2, %xmm3 #7.27 addsubpd %xmm5, %xmm6 #7.27 addpd %xmm3, %xmm4 #7.9 addpd %xmm6, %xmm7 #7.9 movups %xmm4, (%rsi,%rcx) #7.9 movups %xmm7, 16(%rsi,%rcx) #7.9 addq $32, %rsi #6.10 cmpq %rdi, %r8 #6.10 jb ..B1.4 # Prob 64% #6.10 ( I got it with -xSSE4.2 -O3 options). Gor gcc compiler the following options were used: -m64 -mfpmath=sse -march=corei7 -O3 -ffast-math.