[Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic
@ 2012-10-16 14:22 ysrumyan at gmail dot com
  2012-10-16 14:37 ` [Bug tree-optimization/54939] " rguenth at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: ysrumyan at gmail dot com @ 2012-10-16 14:22 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

             Bug #: 54939
           Summary: Very poor vectorization of loops with complex
                    arithmetic
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: ysrumyan@gmail.com


Analyzing some performance anomaly for spec2000 I found out that 168.wupwise
with vectorization is slower than without it on x86. The main problem is that
gcc does not recognize some special idioms of complex addition and
multiplication in process of loop vectorization. For example, for a simple
zaxpy loop icc genearates 1.6X faster code than gcc. Here is assembly for zaxpy
loop produced by icc:

..B1.4:                         # Preds ..B1.2 ..B1.4
        movups    (%rsi,%rdx), %xmm2                            #7.28
        movups    16(%rsi,%rdx), %xmm5                          #7.28
        movups    (%rsi,%rcx), %xmm4                            #7.17
        movups    16(%rsi,%rcx), %xmm7                          #7.17
        movddup   (%rsi,%rdx), %xmm3                            #7.27
        incq      %r8                                           #6.10
        movddup   16(%rsi,%rdx), %xmm6                          #7.27
        unpckhpd  %xmm2, %xmm2                                  #7.27
        unpckhpd  %xmm5, %xmm5                                  #7.27
        mulpd     %xmm1, %xmm3                                  #7.27
        mulpd     %xmm0, %xmm2                                  #7.27
        mulpd     %xmm1, %xmm6                                  #7.27
        mulpd     %xmm0, %xmm5                                  #7.27
        addsubpd  %xmm2, %xmm3                                  #7.27
        addsubpd  %xmm5, %xmm6                                  #7.27
        addpd     %xmm3, %xmm4                                  #7.9
        addpd     %xmm6, %xmm7                                  #7.9
        movups    %xmm4, (%rsi,%rcx)                            #7.9
        movups    %xmm7, 16(%rsi,%rcx)                          #7.9
        addq      $32, %rsi                                     #6.10
        cmpq      %rdi, %r8                                     #6.10
        jb        ..B1.4        # Prob 64%                      #6.10
( I got it with -xSSE4.2 -O3 options). Gor gcc compiler the following options
were used: -m64 -mfpmath=sse  -march=corei7 -O3 -ffast-math.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic
  2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
@ 2012-10-16 14:37 ` rguenth at gcc dot gnu.org
  2012-10-16 14:55 ` ysrumyan at gmail dot com
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-10-16 14:37 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2012-10-16
             Blocks|                            |53947
     Ever Confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> 2012-10-16 14:36:52 UTC ---
Can you reproduce zaxpy source here please?  Also please see the list of bugs
referenced from PR53947, there is likely a duplicate for this issue.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic
  2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
  2012-10-16 14:37 ` [Bug tree-optimization/54939] " rguenth at gcc dot gnu.org
@ 2012-10-16 14:55 ` ysrumyan at gmail dot com
  2012-10-16 15:06 ` ysrumyan at gmail dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: ysrumyan at gmail dot com @ 2012-10-16 14:55 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

--- Comment #2 from Yuri Rumyantsev <ysrumyan at gmail dot com> 2012-10-16 14:54:50 UTC ---
Created attachment 28455
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28455
test reproducer


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic
  2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
  2012-10-16 14:37 ` [Bug tree-optimization/54939] " rguenth at gcc dot gnu.org
  2012-10-16 14:55 ` ysrumyan at gmail dot com
@ 2012-10-16 15:06 ` ysrumyan at gmail dot com
  2012-10-16 15:32 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: ysrumyan at gmail dot com @ 2012-10-16 15:06 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

--- Comment #3 from Yuri Rumyantsev <ysrumyan at gmail dot com> 2012-10-16 15:06:19 UTC ---
I looked through the list of all issues related to vectorization but could not
find duplicate.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic
  2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
                   ` (2 preceding siblings ...)
  2012-10-16 15:06 ` ysrumyan at gmail dot com
@ 2012-10-16 15:32 ` rguenth at gcc dot gnu.org
  2013-03-27 11:19 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-10-16 15:32 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> 2012-10-16 15:31:52 UTC ---
Thanks.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic
  2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
                   ` (3 preceding siblings ...)
  2012-10-16 15:32 ` rguenth at gcc dot gnu.org
@ 2013-03-27 11:19 ` rguenth at gcc dot gnu.org
  2023-07-21 12:28 ` rguenth at gcc dot gnu.org
  2023-07-21 12:31 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2013-03-27 11:19 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
             Blocks|                            |37021
         AssignedTo|unassigned at gcc dot       |rguenth at gcc dot gnu.org
                   |gnu.org                     |

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> 2013-03-27 11:19:49 UTC ---
Confirmed.  GCC vectorizes this using hybrid SLP - it unrolls the loop once
to be able to vectorize two minus and two adds resulting from the complex
multiplication.

The PR is kind-of a duplicate of PR37021 where also a reduction and
a variable stride is involved.  So fixing this bug is required to more
efficiently vectorize PR37021.

Note that even this bug has multiple issues that need to be tackled.
I happen to work on them.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic
  2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
                   ` (4 preceding siblings ...)
  2013-03-27 11:19 ` rguenth at gcc dot gnu.org
@ 2023-07-21 12:28 ` rguenth at gcc dot gnu.org
  2023-07-21 12:31 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-21 12:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED
                 CC|                            |crazylht at gmail dot com

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
With SSE4.2 we now get

.L3:
        movupd  (%rdx,%rax), %xmm0
        movupd  (%rcx,%rax), %xmm4
        movapd  %xmm0, %xmm1
        palignr $8, %xmm0, %xmm0
        mulpd   %xmm3, %xmm1
        mulpd   %xmm2, %xmm0
        addpd   %xmm4, %xmm1
        addsubpd        %xmm0, %xmm1
        movups  %xmm1, (%rcx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L3

with AVX and FMA

.L4:
        vmovupd (%rdx,%rax), %ymm0
        vmovapd %ymm4, %ymm1
        vfmadd213pd     (%rcx,%rax), %ymm0, %ymm1
        vpermilpd       $5, %ymm0, %ymm0
        vmulpd  %ymm3, %ymm0, %ymm0
        vaddsubpd       %ymm0, %ymm1, %ymm1
        vmovupd %ymm1, (%rcx,%rax)
        addq    $32, %rax
        cmpq    %rsi, %rax
        jne     .L4

so I'd say fixed.  But.  With AVX512 we now get

.L4:
        vmovupd (%rdi,%rax), %zmm0
        vmovapd %zmm7, %zmm2
        vmovapd %zmm4, %zmm6
        vfmadd213pd     (%rcx,%rax), %zmm0, %zmm2
        vpermilpd       $85, %zmm0, %zmm0
        vfmadd132pd     %zmm0, %zmm2, %zmm6
        vfnmadd132pd    %zmm4, %zmm2, %zmm0
        vmovapd %zmm6, %zmm0{%k1}
        vmovupd %zmm0, (%rcx,%rax)
        addq    $64, %rax
        cmpq    %rax, %rsi
        jne     .L4

it's odd that this only happens with -mprefer-vector-width=512 though.  Do
we possibly miss vec_{fm,}{addsub,subadd} for those?  Looks like so.

Tracking in PR110767.  The vectorizer side is fixed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic
  2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
                   ` (5 preceding siblings ...)
  2023-07-21 12:28 ` rguenth at gcc dot gnu.org
@ 2023-07-21 12:31 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-21 12:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939
Bug 54939 depends on bug 84361, which changed state.

Bug 84361 Summary: Fails to use vfmaddsub* for complex multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84361

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |DUPLICATE

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-07-21 12:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-16 14:22 [Bug tree-optimization/54939] New: Very poor vectorization of loops with complex arithmetic ysrumyan at gmail dot com
2012-10-16 14:37 ` [Bug tree-optimization/54939] " rguenth at gcc dot gnu.org
2012-10-16 14:55 ` ysrumyan at gmail dot com
2012-10-16 15:06 ` ysrumyan at gmail dot com
2012-10-16 15:32 ` rguenth at gcc dot gnu.org
2013-03-27 11:19 ` rguenth at gcc dot gnu.org
2023-07-21 12:28 ` rguenth at gcc dot gnu.org
2023-07-21 12:31 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).