[Bug tree-optimization/47657] New: missed vectorization

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/47657] New: missed vectorization
@ 2011-02-09  9:33 Joost.VandeVondele at pci dot uzh.ch
  2011-02-09 11:25 ` [Bug tree-optimization/47657] " rguenth at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2011-02-09  9:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47657

           Summary: missed vectorization
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: Joost.VandeVondele@pci.uzh.ch


the following is not vectorized with gfortran (4.6 / 4.5) 

gfortran -O3 -ffast-math -ftree-vectorizer-verbose=6 -S -march=native 
( -march=amdfam10 -mcx16 -msahf -mpopcnt -mabm )

   SUBROUTINE smm_dnn_8_8_8_4_1_2_1(A,B,C)
      REAL(KIND=8) :: C(8,8), B(8,8), A(8,8)
      INTEGER ::i,j,l
      DO j= 1 , 8 , 2
      DO l= 1 , 8 , 1
      DO i= 1 , 8 , 1
        C(i+0,j+0)=C(i+0,j+0)+A(i+0,l+0)*B(l+0,j+0)
        C(i+0,j+1)=C(i+0,j+1)+A(i+0,l+0)*B(l+0,j+1)
      ENDDO
      ENDDO
      ENDDO
    END SUBROUTINE

while the cray ftn compiler does, yielding about twice the speed.

reference asm:
0000000000000000 <smm_dnn_8_8_8_4_1_2_1_>:
   0:   53                      push   %rbx
   1:   48 89 7c 24 f8          mov    %rdi,-0x8(%rsp)
   6:   48 89 74 24 f0          mov    %rsi,-0x10(%rsp)
   b:   48 89 54 24 e8          mov    %rdx,-0x18(%rsp)
  10:   31 c0                   xor    %eax,%eax
  12:   48 89 d1                mov    %rdx,%rcx
  15:   49 89 c0                mov    %rax,%r8
  18:   49 89 c1                mov    %rax,%r9
  1b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  20:   66 0f 10 04 02          movupd (%rdx,%rax,1),%xmm0
  25:   66 0f 10 4c 02 40       movupd 0x40(%rdx,%rax,1),%xmm1
  2b:   66 0f 10 54 02 10       movupd 0x10(%rdx,%rax,1),%xmm2
  31:   66 0f 10 5c 02 50       movupd 0x50(%rdx,%rax,1),%xmm3
  37:   66 0f 10 64 02 20       movupd 0x20(%rdx,%rax,1),%xmm4
  3d:   66 0f 10 6c 02 60       movupd 0x60(%rdx,%rax,1),%xmm5
  43:   66 0f 10 74 02 30       movupd 0x30(%rdx,%rax,1),%xmm6
  49:   66 0f 10 7c 02 70       movupd 0x70(%rdx,%rax,1),%xmm7
  4f:   45 31 d2                xor    %r10d,%r10d
  52:   4d 89 d3                mov    %r10,%r11
  55:   66 66 2e 0f 1f 84 00    nopw   %cs:0x0(%rax,%rax,1)
  5c:   00 00 00 00
  60:   66 46 0f 10 44 1f 30    movupd 0x30(%rdi,%r11,1),%xmm8
  67:   4b 8d 1c 02             lea    (%r10,%r8,1),%rbx
  6b:   f2 44 0f 12 4c de 40    movddup 0x40(%rsi,%rbx,8),%xmm9
  72:   66 45 0f 28 d1          movapd %xmm9,%xmm10
  77:   66 45 0f 59 d0          mulpd  %xmm8,%xmm10
  7c:   66 41 0f 58 fa          addpd  %xmm10,%xmm7
  81:   f2 44 0f 12 14 de       movddup (%rsi,%rbx,8),%xmm10
  87:   66 45 0f 59 c2          mulpd  %xmm10,%xmm8
  8c:   66 41 0f 58 f0          addpd  %xmm8,%xmm6
  91:   66 46 0f 10 44 1f 20    movupd 0x20(%rdi,%r11,1),%xmm8
  98:   66 45 0f 28 d9          movapd %xmm9,%xmm11
  9d:   66 45 0f 59 d8          mulpd  %xmm8,%xmm11
  a2:   66 41 0f 58 eb          addpd  %xmm11,%xmm5
  a7:   66 45 0f 59 c2          mulpd  %xmm10,%xmm8
  ac:   66 41 0f 58 e0          addpd  %xmm8,%xmm4
  b1:   66 46 0f 10 44 1f 10    movupd 0x10(%rdi,%r11,1),%xmm8
  b8:   66 45 0f 28 d9          movapd %xmm9,%xmm11
  bd:   66 45 0f 59 d8          mulpd  %xmm8,%xmm11
  c2:   66 41 0f 58 db          addpd  %xmm11,%xmm3
  c7:   66 45 0f 59 c2          mulpd  %xmm10,%xmm8
  cc:   66 41 0f 58 d0          addpd  %xmm8,%xmm2
  d1:   66 46 0f 10 04 1f       movupd (%rdi,%r11,1),%xmm8
  d7:   66 45 0f 59 c8          mulpd  %xmm8,%xmm9
  dc:   66 41 0f 58 c9          addpd  %xmm9,%xmm1
  e1:   66 45 0f 59 d0          mulpd  %xmm8,%xmm10
  e6:   66 41 0f 58 c2          addpd  %xmm10,%xmm0
  eb:   49 83 c3 40             add    $0x40,%r11
  ef:   49 ff c2                inc    %r10
  f2:   49 83 fa 08             cmp    $0x8,%r10
  f6:   0f 8c 64 ff ff ff       jl     60 <smm_dnn_8_8_8_4_1_2_1_+0x60>
  fc:   f2 0f 11 7c 01 70       movsd  %xmm7,0x70(%rcx,%rax,1)
 102:   66 0f 17 7c 01 78       movhpd %xmm7,0x78(%rcx,%rax,1)
 108:   f2 0f 11 74 02 30       movsd  %xmm6,0x30(%rdx,%rax,1)
 10e:   66 0f 17 74 02 38       movhpd %xmm6,0x38(%rdx,%rax,1)
 114:   f2 0f 11 6c 02 60       movsd  %xmm5,0x60(%rdx,%rax,1)
 11a:   66 0f 17 6c 02 68       movhpd %xmm5,0x68(%rdx,%rax,1)
 120:   f2 0f 11 64 02 20       movsd  %xmm4,0x20(%rdx,%rax,1)
 126:   66 0f 17 64 02 28       movhpd %xmm4,0x28(%rdx,%rax,1)
 12c:   f2 0f 11 5c 02 50       movsd  %xmm3,0x50(%rdx,%rax,1)
 132:   66 0f 17 5c 02 58       movhpd %xmm3,0x58(%rdx,%rax,1)
 138:   f2 0f 11 54 02 10       movsd  %xmm2,0x10(%rdx,%rax,1)
 13e:   66 0f 17 54 02 18       movhpd %xmm2,0x18(%rdx,%rax,1)
 144:   f2 0f 11 4c 02 40       movsd  %xmm1,0x40(%rdx,%rax,1)
 14a:   66 0f 17 4c 02 48       movhpd %xmm1,0x48(%rdx,%rax,1)
 150:   f2 0f 11 04 02          movsd  %xmm0,(%rdx,%rax,1)
 155:   66 0f 17 44 02 08       movhpd %xmm0,0x8(%rdx,%rax,1)
 15b:   49 83 c0 10             add    $0x10,%r8
 15f:   48 83 e8 80             sub    $0xffffffffffffff80,%rax
 163:   49 ff c1                inc    %r9
 166:   49 83 f9 04             cmp    $0x4,%r9
 16a:   0f 8c b0 fe ff ff       jl     20 <smm_dnn_8_8_8_4_1_2_1_+0x20>
 170:   5b                      pop    %rbx
 171:   c3                      retq
 172:   66 66 66 66 66 2e 0f    nopw   %cs:0x0(%rax,%rax,1)
 179:   1f 84 00 00 00 00 00


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/47657] missed vectorization
  2011-02-09  9:33 [Bug tree-optimization/47657] New: missed vectorization Joost.VandeVondele at pci dot uzh.ch
@ 2011-02-09 11:25 ` rguenth at gcc dot gnu.org
  2011-02-09 11:48 ` Joost.VandeVondele at pci dot uzh.ch
  2012-06-30 13:34 ` Joost.VandeVondele at mat dot ethz.ch
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-02-09 11:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47657

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.02.09 11:06:56
     Ever Confirmed|0                           |1

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-02-09 11:06:56 UTC ---
It is vectorized with -fno-vect-cost-model.  The cray compiler probably
exchanged loops and/or did more agressive unrolling than us here
(the code with -fno-vect-cost-model shows the cost model is right ;))


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/47657] missed vectorization
  2011-02-09  9:33 [Bug tree-optimization/47657] New: missed vectorization Joost.VandeVondele at pci dot uzh.ch
  2011-02-09 11:25 ` [Bug tree-optimization/47657] " rguenth at gcc dot gnu.org
@ 2011-02-09 11:48 ` Joost.VandeVondele at pci dot uzh.ch
  2012-06-30 13:34 ` Joost.VandeVondele at mat dot ethz.ch
  2 siblings, 0 replies; 4+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2011-02-09 11:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47657

--- Comment #2 from Joost VandeVondele <Joost.VandeVondele at pci dot uzh.ch> 2011-02-09 11:25:42 UTC ---
Created attachment 23283
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23283
testcase including timing routine,  last number is flop rate.

the cray compiler is supposed to *not* interchange loops, as I'm using:

ftn -O3,ipa0,nointerchange,vector3  testcase.f90

to compile. This gives about 5.6Gflops.

Unrolling still seems to happen (there are 16 mults in the inner loop), and 

ftn -O3,ipa0,nointerchange,vector3,unroll0  testcase.f90 yields poor
performance (2.3Gflops).

Gfortran 4.5 yields 3.424Gflops :

gfortran -O3 -ffast-math -funroll-loops -march=native  testcase.f90


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/47657] missed vectorization
  2011-02-09  9:33 [Bug tree-optimization/47657] New: missed vectorization Joost.VandeVondele at pci dot uzh.ch
  2011-02-09 11:25 ` [Bug tree-optimization/47657] " rguenth at gcc dot gnu.org
  2011-02-09 11:48 ` Joost.VandeVondele at pci dot uzh.ch
@ 2012-06-30 13:34 ` Joost.VandeVondele at mat dot ethz.ch
  2 siblings, 0 replies; 4+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-30 13:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47657

Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #3 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-30 13:34:24 UTC ---
performance seems good on 4.8


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-06-30 13:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-09  9:33 [Bug tree-optimization/47657] New: missed vectorization Joost.VandeVondele at pci dot uzh.ch
2011-02-09 11:25 ` [Bug tree-optimization/47657] " rguenth at gcc dot gnu.org
2011-02-09 11:48 ` Joost.VandeVondele at pci dot uzh.ch
2012-06-30 13:34 ` Joost.VandeVondele at mat dot ethz.ch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).