public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/51179] New: poor vectorization on interlagos.
@ 2011-11-16 19:25 Joost.VandeVondele at mat dot ethz.ch
2011-11-22 11:03 ` [Bug target/51179] " ubizjak at gmail dot com
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-16 19:25 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
Bug #: 51179
Summary: poor vectorization on interlagos.
Classification: Unclassified
Product: gcc
Version: 4.6.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: Joost.VandeVondele@mat.ethz.ch
The following code executes significantly faster when compiled with the cray
compiler (gcc: 43.4s cray:7.7s for 100000000 calls)
SUBROUTINE smm_dnn_4_10_10_4_1_2_1(A,B,C)
REAL(KIND=KIND(0.0D0)) :: C(4,10), B(10,10), A(4,10)
INTEGER ::i,j,l
DO j= 1 , 10 , 2
DO l= 1 , 10 , 1
DO i= 1 , 4 , 1
C(i+0,j+0)=C(i+0,j+0)+A(i+0,l+0)*B(l+0,j+0)
C(i+0,j+1)=C(i+0,j+1)+A(i+0,l+0)*B(l+0,j+1)
ENDDO
ENDDO
ENDDO
END SUBROUTINE
cray options: -h noomp -e m -F -ra -O2 -Oipa1 -v tst.f90
gfortran: -O3 -march=native -ffast-math
which yields for gfortran:
-march=bdver1 -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mabm -mlwp
-mno-fma -mfma4 -mxop -mno-bmi -mno-tbm -mavx -msse4.2 -msse4.1
The cray code looks nice:
0000000000000000 <smm_dnn_4_10_10_4_1_2_1_>:
0: 48 89 7c 24 f8 mov %rdi,-0x8(%rsp)
5: 48 89 74 24 f0 mov %rsi,-0x10(%rsp)
a: 48 89 54 24 e8 mov %rdx,-0x18(%rsp)
f: c5 fc 10 02 vmovups (%rdx),%ymm0
13: c5 fc 10 4a 20 vmovups 0x20(%rdx),%ymm1
18: c5 fc 10 52 40 vmovups 0x40(%rdx),%ymm2
1d: c5 fc 10 5a 60 vmovups 0x60(%rdx),%ymm3
22: c5 fc 10 a2 80 00 00 vmovups 0x80(%rdx),%ymm4
29: 00
2a: c5 fc 10 aa a0 00 00 vmovups 0xa0(%rdx),%ymm5
31: 00
32: c5 fc 10 b2 c0 00 00 vmovups 0xc0(%rdx),%ymm6
39: 00
3a: c5 fc 10 ba e0 00 00 vmovups 0xe0(%rdx),%ymm7
41: 00
42: c5 7c 10 82 00 01 00 vmovups 0x100(%rdx),%ymm8
49: 00
4a: c5 7c 10 8a 20 01 00 vmovups 0x120(%rdx),%ymm9
51: 00
52: 31 c0 xor %eax,%eax
54: 48 89 c1 mov %rax,%rcx
57: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
5e: 00 00
60: c4 62 7d 19 94 c6 d0 vbroadcastsd 0x2d0(%rsi,%rax,8),%ymm10
67: 02 00 00
6a: c5 7c 10 1c 0f vmovups (%rdi,%rcx,1),%ymm11
6f: c4 43 a5 69 c9 a0 vfmaddpd %ymm9,%ymm10,%ymm11,%ymm9
75: c4 62 7d 19 94 c6 80 vbroadcastsd 0x280(%rsi,%rax,8),%ymm10
7c: 02 00 00
7f: c4 43 a5 69 c0 a0 vfmaddpd %ymm8,%ymm10,%ymm11,%ymm8
85: c4 62 7d 19 94 c6 30 vbroadcastsd 0x230(%rsi,%rax,8),%ymm10
8c: 02 00 00
8f: c4 e3 a5 69 ff a0 vfmaddpd %ymm7,%ymm10,%ymm11,%ymm7
95: c4 62 7d 19 94 c6 e0 vbroadcastsd 0x1e0(%rsi,%rax,8),%ymm10
9c: 01 00 00
9f: c4 e3 a5 69 f6 a0 vfmaddpd %ymm6,%ymm10,%ymm11,%ymm6
a5: c4 62 7d 19 94 c6 90 vbroadcastsd 0x190(%rsi,%rax,8),%ymm10
ac: 01 00 00
af: c4 e3 a5 69 ed a0 vfmaddpd %ymm5,%ymm10,%ymm11,%ymm5
b5: c4 62 7d 19 94 c6 40 vbroadcastsd 0x140(%rsi,%rax,8),%ymm10
bc: 01 00 00
bf: c4 e3 a5 69 e4 a0 vfmaddpd %ymm4,%ymm10,%ymm11,%ymm4
c5: c4 62 7d 19 94 c6 f0 vbroadcastsd 0xf0(%rsi,%rax,8),%ymm10
cc: 00 00 00
cf: c4 e3 a5 69 db a0 vfmaddpd %ymm3,%ymm10,%ymm11,%ymm3
d5: c4 62 7d 19 94 c6 a0 vbroadcastsd 0xa0(%rsi,%rax,8),%ymm10
dc: 00 00 00
df: c4 e3 a5 69 d2 a0 vfmaddpd %ymm2,%ymm10,%ymm11,%ymm2
e5: c4 62 7d 19 54 c6 50 vbroadcastsd 0x50(%rsi,%rax,8),%ymm10
ec: c4 e3 a5 69 c9 a0 vfmaddpd %ymm1,%ymm10,%ymm11,%ymm1
f2: c4 62 7d 19 14 c6 vbroadcastsd (%rsi,%rax,8),%ymm10
f8: c4 e3 a5 69 c0 a0 vfmaddpd %ymm0,%ymm10,%ymm11,%ymm0
fe: 48 83 c1 20 add $0x20,%rcx
102: 48 ff c0 inc %rax
105: 48 83 f8 0a cmp $0xa,%rax
109: 0f 8c 51 ff ff ff jl 60 <smm_dnn_4_10_10_4_1_2_1_+0x60>
10f: c5 78 11 8a 20 01 00 vmovups %xmm9,0x120(%rdx)
116: 00
117: c4 63 7d 19 8a 30 01 vextractf128 $0x1,%ymm9,0x130(%rdx)
11e: 00 00 01
121: c5 78 11 82 00 01 00 vmovups %xmm8,0x100(%rdx)
128: 00
129: c4 63 7d 19 82 10 01 vextractf128 $0x1,%ymm8,0x110(%rdx)
130: 00 00 01
133: c5 f8 11 ba e0 00 00 vmovups %xmm7,0xe0(%rdx)
13a: 00
13b: c4 e3 7d 19 ba f0 00 vextractf128 $0x1,%ymm7,0xf0(%rdx)
142: 00 00 01
145: c5 f8 11 b2 c0 00 00 vmovups %xmm6,0xc0(%rdx)
14c: 00
14d: c4 e3 7d 19 b2 d0 00 vextractf128 $0x1,%ymm6,0xd0(%rdx)
154: 00 00 01
157: c5 f8 11 aa a0 00 00 vmovups %xmm5,0xa0(%rdx)
15e: 00
15f: c4 e3 7d 19 aa b0 00 vextractf128 $0x1,%ymm5,0xb0(%rdx)
166: 00 00 01
169: c5 f8 11 a2 80 00 00 vmovups %xmm4,0x80(%rdx)
170: 00
171: c4 e3 7d 19 a2 90 00 vextractf128 $0x1,%ymm4,0x90(%rdx)
178: 00 00 01
17b: c5 f8 11 5a 60 vmovups %xmm3,0x60(%rdx)
180: c4 e3 7d 19 5a 70 01 vextractf128 $0x1,%ymm3,0x70(%rdx)
187: c5 f8 11 52 40 vmovups %xmm2,0x40(%rdx)
18c: c4 e3 7d 19 52 50 01 vextractf128 $0x1,%ymm2,0x50(%rdx)
193: c5 f8 11 4a 20 vmovups %xmm1,0x20(%rdx)
198: c4 e3 7d 19 4a 30 01 vextractf128 $0x1,%ymm1,0x30(%rdx)
19f: c5 f8 11 02 vmovups %xmm0,(%rdx)
1a3: c4 e3 7d 19 42 10 01 vextractf128 $0x1,%ymm0,0x10(%rdx)
1aa: c5 f8 77 vzeroupper
1ad: c3 retq
1ae: 66 90 xchg %ax,%ax
gcc's code looks more involved:
smm_dnn_4_10_10_4_1_2_1_:
.LFB0:
pushq %rbp
.LCFI0:
movl $1, %eax
movq %rsp, %rbp
.LCFI1:
andq $-32, %rsp
subq $616, %rsp
.LCFI2:
vmovupd 96(%rdi), %ymm0
vmovupd (%rdi), %ymm3
vmovupd 32(%rdi), %ymm1
vmovsd 280(%rdi), %xmm13
vmovupd 64(%rdi), %ymm2
vmovsd 288(%rdi), %xmm15
vmovsd 256(%rdi), %xmm4
vmovsd 264(%rdi), %xmm6
vmovsd 272(%rdi), %xmm7
vmovupd 128(%rdi), %ymm12
vmovsd %xmm13, 296(%rsp)
vmovupd 160(%rdi), %ymm11
vperm2f128 $32, %ymm1, %ymm3, %ymm13
vmovsd %xmm15, 288(%rsp)
vperm2f128 $49, %ymm1, %ymm3, %ymm1
vmovsd %xmm4, 320(%rsp)
vperm2f128 $32, %ymm0, %ymm2, %ymm15
vmovsd 296(%rdi), %xmm4
vperm2f128 $49, %ymm0, %ymm2, %ymm2
vmovsd %xmm6, 312(%rsp)
vmovaps %ymm1, 40(%rsp)
vunpcklpd %ymm1, %ymm13, %ymm1
vmovsd 304(%rdi), %xmm6
vunpcklpd %ymm2, %ymm15, %ymm0
vmovsd %xmm7, 304(%rsp)
vmovsd 312(%rdi), %xmm7
vmovaps %ymm2, -24(%rsp)
vperm2f128 $32, %ymm0, %ymm1, %ymm2
vmovupd 192(%rdi), %ymm10
vperm2f128 $49, %ymm0, %ymm1, %ymm0
vmovsd %xmm4, 280(%rsp)
vmovsd %xmm6, 336(%rsp)
vmovaps %ymm13, %ymm4
vmovsd %xmm7, 328(%rsp)
vmovaps %ymm15, %ymm6
vmovaps %ymm2, %ymm7
vunpcklpd %ymm0, %ymm2, %ymm8
vmovupd 224(%rdi), %ymm9
vmovaps %ymm13, 72(%rsp)
vmovaps %ymm15, 8(%rsp)
vmovaps %ymm2, -56(%rsp)
vmovaps %ymm0, -88(%rsp)
vxorps %xmm0, %xmm0, %xmm0
.L3:
vunpckhpd 40(%rsp), %ymm4, %ymm3
vmovupd (%rsi), %ymm4
vunpckhpd -24(%rsp), %ymm6, %ymm1
vunpckhpd -88(%rsp), %ymm7, %ymm5
vperm2f128 $32, %ymm1, %ymm3, %ymm2
vperm2f128 $49, %ymm1, %ymm3, %ymm1
vfmaddpd %ymm0, %ymm5, %ymm4, %ymm15
vfmaddpd %ymm0, %ymm8, %ymm4, %ymm3
vunpcklpd %ymm1, %ymm2, %ymm6
vunpckhpd %ymm1, %ymm2, %ymm2
vmovupd 80(%rsi), %ymm1
vfmaddpd %ymm0, %ymm6, %ymm4, %ymm13
vfmaddpd %ymm0, %ymm2, %ymm4, %ymm4
vmovaps %ymm15, 200(%rsp)
vmovsd 320(%rsp), %xmm15
vfmaddpd %ymm0, %ymm8, %ymm1, %ymm14
vfmaddpd %ymm0, %ymm6, %ymm1, %ymm6
vfmaddpd %ymm0, %ymm5, %ymm1, %ymm5
vfmaddpd %ymm0, %ymm2, %ymm1, %ymm1
vperm2f128 $32, %ymm11, %ymm12, %ymm2
vmovaps %ymm13, -120(%rsp)
vmovsd 64(%rsi), %xmm13
vmovaps %ymm4, 136(%rsp)
vmovaps %ymm6, 232(%rsp)
vfmaddsd (%rdx), %xmm15, %xmm13, %xmm15
vmovaps %ymm1, 104(%rsp)
vperm2f128 $49, %ymm11, %ymm12, %ymm1
vmovaps %ymm5, 168(%rsp)
vperm2f128 $32, %ymm9, %ymm10, %ymm5
vunpcklpd %ymm1, %ymm2, %ymm6
vmovsd %xmm13, 344(%rsp)
vunpckhpd %ymm1, %ymm2, %ymm2
vperm2f128 $49, %ymm9, %ymm10, %ymm1
vunpcklpd %ymm1, %ymm5, %ymm4
vmovsd %xmm15, 352(%rsp)
vunpckhpd %ymm1, %ymm5, %ymm1
vperm2f128 $32, %ymm4, %ymm6, %ymm5
vperm2f128 $49, %ymm4, %ymm6, %ymm4
vunpcklpd %ymm4, %ymm5, %ymm7
vunpckhpd %ymm4, %ymm5, %ymm5
vperm2f128 $32, %ymm1, %ymm2, %ymm4
vperm2f128 $49, %ymm1, %ymm2, %ymm1
vmovupd 32(%rsi), %ymm2
vunpcklpd %ymm1, %ymm4, %ymm6
vunpckhpd %ymm1, %ymm4, %ymm4
vmovupd 112(%rsi), %ymm1
vfmaddpd %ymm3, %ymm7, %ymm2, %ymm3
vfmaddpd %ymm14, %ymm7, %ymm1, %ymm7
vhaddpd %ymm3, %ymm3, %ymm3
vhaddpd %ymm7, %ymm7, %ymm7
vperm2f128 $1, %ymm3, %ymm3, %ymm15
vaddpd %ymm15, %ymm3, %ymm3
vmovaps %ymm3, 584(%rsp)
vmovsd 352(%rsp), %xmm3
vaddsd 584(%rsp), %xmm3, %xmm3
vmovsd 144(%rsi), %xmm15
vmovsd %xmm3, 264(%rsp)
vmovsd 320(%rsp), %xmm3
vfmaddsd 32(%rdx), %xmm3, %xmm15, %xmm13
vperm2f128 $1, %ymm7, %ymm7, %ymm3
vaddpd %ymm3, %ymm7, %ymm3
vmovaps %ymm3, 552(%rsp)
vmovsd 312(%rsp), %xmm3
vaddsd 552(%rsp), %xmm13, %xmm13
vmovsd %xmm13, 272(%rsp)
vmovsd 344(%rsp), %xmm13
vfmaddsd 8(%rdx), %xmm3, %xmm13, %xmm7
vfmaddpd -120(%rsp), %ymm6, %ymm2, %ymm13
vhaddpd %ymm13, %ymm13, %ymm13
vperm2f128 $1, %ymm13, %ymm13, %ymm3
vaddpd %ymm3, %ymm13, %ymm3
vmovaps %ymm3, 520(%rsp)
vaddsd 520(%rsp), %xmm7, %xmm7
vmovsd %xmm7, 352(%rsp)
vfmaddpd 232(%rsp), %ymm6, %ymm1, %ymm6
vmovsd 312(%rsp), %xmm13
vfmaddsd 40(%rdx), %xmm13, %xmm15, %xmm7
vhaddpd %ymm6, %ymm6, %ymm6
vperm2f128 $1, %ymm6, %ymm6, %ymm3
vaddpd %ymm3, %ymm6, %ymm3
vmovsd 304(%rsp), %xmm6
vmovaps %ymm3, 488(%rsp)
vmovsd 344(%rsp), %xmm3
vaddsd 488(%rsp), %xmm7, %xmm13
vfmaddsd 16(%rdx), %xmm6, %xmm3, %xmm7
vfmaddpd 200(%rsp), %ymm5, %ymm2, %ymm3
vfmaddpd 168(%rsp), %ymm5, %ymm1, %ymm5
vfmaddpd 136(%rsp), %ymm4, %ymm2, %ymm2
vfmaddpd 104(%rsp), %ymm4, %ymm1, %ymm1
vmovsd 288(%rsp), %xmm4
vhaddpd %ymm3, %ymm3, %ymm3
vhaddpd %ymm5, %ymm5, %ymm5
vhaddpd %ymm2, %ymm2, %ymm2
vhaddpd %ymm1, %ymm1, %ymm1
vperm2f128 $1, %ymm3, %ymm3, %ymm6
vaddpd %ymm6, %ymm3, %ymm3
vmovaps %ymm3, 456(%rsp)
vperm2f128 $1, %ymm5, %ymm5, %ymm3
vaddpd %ymm3, %ymm5, %ymm3
vaddsd 456(%rsp), %xmm7, %xmm14
vmovsd 304(%rsp), %xmm7
vfmaddsd 48(%rdx), %xmm7, %xmm15, %xmm6
vmovsd 296(%rsp), %xmm7
vmovaps %ymm3, 424(%rsp)
vmovsd 344(%rsp), %xmm3
vfmaddsd 24(%rdx), %xmm7, %xmm3, %xmm5
vperm2f128 $1, %ymm2, %ymm2, %ymm3
vaddpd %ymm3, %ymm2, %ymm2
vaddsd 424(%rsp), %xmm6, %xmm6
vmovaps %ymm2, 392(%rsp)
vperm2f128 $1, %ymm1, %ymm1, %ymm2
vaddpd %ymm2, %ymm1, %ymm1
vfmaddsd 56(%rdx), %xmm7, %xmm15, %xmm15
vaddsd 392(%rsp), %xmm5, %xmm5
vmovaps %ymm1, 360(%rsp)
vmovsd 72(%rsi), %xmm2
vmovsd 152(%rsi), %xmm1
addq $160, %rsi
vaddsd 360(%rsp), %xmm15, %xmm15
vfmaddsd 264(%rsp), %xmm4, %xmm2, %xmm3
vmovsd %xmm3, (%rdx)
vfmaddsd 272(%rsp), %xmm4, %xmm1, %xmm3
vmovsd %xmm3, 32(%rdx)
vmovsd 280(%rsp), %xmm3
vfmaddsd 352(%rsp), %xmm3, %xmm2, %xmm7
vmovsd %xmm7, 8(%rdx)
vfmaddsd %xmm13, %xmm3, %xmm1, %xmm7
vfmaddsd %xmm6, 336(%rsp), %xmm1, %xmm6
vfmaddsd %xmm5, 328(%rsp), %xmm2, %xmm5
vfmaddsd %xmm15, 328(%rsp), %xmm1, %xmm1
vmovsd %xmm7, 40(%rdx)
vfmaddsd %xmm14, 336(%rsp), %xmm2, %xmm7
vmovsd %xmm6, 48(%rdx)
vmovsd %xmm5, 24(%rdx)
vmovsd %xmm1, 56(%rdx)
vmovsd %xmm7, 16(%rdx)
addq $64, %rdx
cmpl $9, %eax
je .L1
addl $2, %eax
vmovaps 72(%rsp), %ymm4
vmovaps 8(%rsp), %ymm6
vmovaps -56(%rsp), %ymm7
jmp .L3
.p2align 5,,7
.p2align 3
.L1:
leave
.LCFI3:
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-22 11:03 ` ubizjak at gmail dot com
2011-11-22 12:31 ` [Bug tree-optimization/51179] " ubizjak at gmail dot com
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-11-22 11:03 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #1 from Uros Bizjak <ubizjak at gmail dot com> 2011-11-22 10:19:40 UTC ---
What about current 4.7 SVN?
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
2011-11-22 11:03 ` [Bug target/51179] " ubizjak at gmail dot com
@ 2011-11-22 12:31 ` ubizjak at gmail dot com
2011-11-22 17:57 ` jakub at gcc dot gnu.org
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-11-22 12:31 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
Uros Bizjak <ubizjak at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2011-11-22
CC| |irar at il dot ibm.com
Component|target |tree-optimization
Ever Confirmed|0 |1
--- Comment #2 from Uros Bizjak <ubizjak at gmail dot com> 2011-11-22 11:33:24 UTC ---
We can start here with something that hopefully resembles your original fortran
code:
--cut here--
double C[10][4], B[10][10], A[10][4];
void test (void)
{
int i = 0, j = 0, l = 0;
//for (; j < 10; j += 2)
// for (; l < 10; l++)
for (; i < 4; i++)
{
C[j+0][i] = C[j+0][i] + A[l][i] * B[j+0][l];
C[j+1][i] = C[j+1][i] + A[l][i] * B[j+1][l];
}
}
--cut here--
gcc -O3 -ffast-math -mfma4 -mavx:
test:
vmovapd A(%rip), %ymm0
vbroadcastsd B(%rip), %ymm1
vfmaddpd C(%rip), %ymm1, %ymm0, %ymm1
vmovapd %ymm1, C(%rip)
vbroadcastsd B+80(%rip), %ymm1
vfmaddpd C+32(%rip), %ymm1, %ymm0, %ymm0
vmovapd %ymm0, C+32(%rip)
vzeroupper
ret
Nice.
Now uncomment the second loop ("l" index) and this kernel will break:
< ... lots of code deleted ... >
.L3:
vmovupd (%r8,%rax), %xmm1
addl $1, %esi
vinsertf128 $0x1, 16(%r8,%rax), %ymm1, %ymm1
vfmaddpd %ymm0, %ymm5, %ymm1, %ymm0
vmovapd %ymm0, (%rbx,%rax)
vmovupd (%rcx,%rax), %xmm0
vinsertf128 $0x1, 16(%rcx,%rax), %ymm0, %ymm0
vfmaddpd %ymm0, %ymm4, %ymm1, %ymm0
vmovupd %xmm0, (%rcx,%rax)
vextractf128 $0x1, %ymm0, 16(%rcx,%rax)
addq $32, %rax
cmpl %r10d, %esi
jb .L3
< ... lots of code deleted ... >
This already happens in the tree optimizers (vectorizer), RTL is just following
this trail.
Confirmed as a vectorizer problem.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
2011-11-22 11:03 ` [Bug target/51179] " ubizjak at gmail dot com
2011-11-22 12:31 ` [Bug tree-optimization/51179] " ubizjak at gmail dot com
@ 2011-11-22 17:57 ` jakub at gcc dot gnu.org
2011-11-22 18:53 ` Joost.VandeVondele at mat dot ethz.ch
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2011-11-22 17:57 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> 2011-11-22 17:13:26 UTC ---
Your testcase doesn't ressemble the original, the inner for cycles need
clearing of the iteration variable.
double C[10][4], B[10][10], A[10][4];
void
test (void)
{
int i = 0, j = 0, l = 0;
for (l = 0; l < 10; l++)
for (j = 0; j < 10; j += 2)
for (i = 0; i < 4; i++)
{
C[j + 0][i] = C[j + 0][i] + A[l][i] * B[j + 0][l];
C[j + 1][i] = C[j + 1][i] + A[l][i] * B[j + 1][l];
}
}
is IMHO just a matter whether graphite can -floop-interchange this or not.
If you swap manually the l and j for lines, the generated code looks better,
though for some reason we unroll even the l loop which increases register
pressure too much.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (2 preceding siblings ...)
2011-11-22 17:57 ` jakub at gcc dot gnu.org
@ 2011-11-22 18:53 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-22 18:55 ` Joost.VandeVondele at mat dot ethz.ch
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-22 18:53 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #4 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-22 18:34:03 UTC ---
Created attachment 25887
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25887
general code
the more general code used to find the most efficient matrix multiply for sizes
4,10,10
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (3 preceding siblings ...)
2011-11-22 18:53 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-22 18:55 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-22 21:38 ` dominiq at lps dot ens.fr
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-22 18:55 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #5 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-22 18:34:48 UTC ---
(In reply to comment #3)
> is IMHO just a matter whether graphite can -floop-interchange this or not.
> If you swap manually the l and j for lines, the generated code looks better,
> though for some reason we unroll even the l loop which increases register
> pressure too much.
Unfortunately, the issue is not just loop ordering or loop unrolling. I have a
code generator which tries systematically all possible loop orderings, and all
possible unroll factors. For this testcase (matrix sizes 4,10,10) the best cray
output (this one) runs at 10.8 Gflops. The best gcc compiled version runs at
4.7 Gflops (smm_dnn_4_10_10_1_1_10_2). I attach the test code, which I use for
testing.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (4 preceding siblings ...)
2011-11-22 18:55 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-22 21:38 ` dominiq at lps dot ens.fr
2011-11-22 23:13 ` ubizjak at gmail dot com
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-11-22 21:38 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #6 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-11-22 21:08:47 UTC ---
> ... I attach the test code, which I use for testing.
Compiling the code with -O3 gives the following ICE
pr51179_1.f90: In function 'tiny_find':
pr51179_1.f90:3594:0: internal compiler error: in check_loop_closed_ssa_use, at
tree-ssa-loop-manip.c:422
(if nobody beats me, I'll try to reduce the code and open a new pr). On a
2.5Ghz Core2Duo, I get 4.1Gflops with -O2 -ftree-vectorize.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (5 preceding siblings ...)
2011-11-22 21:38 ` dominiq at lps dot ens.fr
@ 2011-11-22 23:13 ` ubizjak at gmail dot com
2011-11-23 9:28 ` Joost.VandeVondele at mat dot ethz.ch
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2011-11-22 23:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #7 from Uros Bizjak <ubizjak at gmail dot com> 2011-11-22 22:00:36 UTC ---
(In reply to comment #3)
> Your testcase doesn't ressemble the original, the inner for cycles need
> clearing of the iteration variable.
Ah, indeed... fingers were too fast.
One additional data point with -O2 -ftree-vectorize -mfma4 -mavx with all
loops:
movslq %r8d, %rax
movl $C+32, %edx
xorl %esi, %esi
leaq B(,%rax,8), %rcx
movl $C, %eax
.L3:
>> vmovsd 80(%rcx), %xmm1
addl $2, %esi
vmovapd A(%rdi), %ymm0
>> vmovddup %xmm1, %xmm1
vbroadcastsd (%rcx), %ymm2
addq $160, %rcx
>> vinsertf128 $1, %xmm1, %ymm1, %ymm1
vfmaddpd (%rax), %ymm2, %ymm0, %ymm2
vmovapd %ymm2, (%rax)
addq $64, %rax
vfmaddpd (%rdx), %ymm1, %ymm0, %ymm0
vmovapd %ymm0, (%rdx)
addq $64, %rdx
cmpl $10, %esi
jne .L3
This could be just "vbroadcastsd 80(%rcx), %ymm1". For some reason combine pass
does not form it.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (6 preceding siblings ...)
2011-11-22 23:13 ` ubizjak at gmail dot com
@ 2011-11-23 9:28 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-23 17:50 ` Joost.VandeVondele at mat dot ethz.ch
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-23 9:28 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #8 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-23 08:34:59 UTC ---
(In reply to comment #6)
> (if nobody beats me, I'll try to reduce the code and open a new pr).
If reproduced the ICE with 4.7, and started a delta reduce. It goes very
slowly, so not sure if this is going to beat you ;-).
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (7 preceding siblings ...)
2011-11-23 9:28 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-23 17:50 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-23 20:30 ` Joost.VandeVondele at mat dot ethz.ch
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-23 17:50 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #9 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-23 17:19:28 UTC ---
(In reply to comment #8)
> (In reply to comment #6)
> > (if nobody beats me, I'll try to reduce the code and open a new pr).
> If reproduced the ICE with 4.7, and started a delta reduce. It goes very
> slowly, so not sure if this is going to beat you ;-).
FYI, this ICE on tiny_find is now PR51285
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (8 preceding siblings ...)
2011-11-23 17:50 ` Joost.VandeVondele at mat dot ethz.ch
@ 2011-11-23 20:30 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-30 11:27 ` Joost.VandeVondele at mat dot ethz.ch
2012-07-19 10:36 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2011-11-23 20:30 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #10 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2011-11-23 20:11:17 UTC ---
(In reply to comment #1)
> What about current 4.7 SVN?
The fastest 4x10 . 10x10 multiply as found with tiny_find.f90 yields somewhat
better results with 4.7, but not quite as effective as Cray's compiler.
Cray: 11.363 Gflops
4.7.0: 6.250 Gflops
4.6.1: 5.102 Gflops
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (9 preceding siblings ...)
2011-11-23 20:30 ` Joost.VandeVondele at mat dot ethz.ch
@ 2012-06-30 11:27 ` Joost.VandeVondele at mat dot ethz.ch
2012-07-19 10:36 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2012-06-30 11:27 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
--- Comment #11 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2012-06-30 11:26:59 UTC ---
It looks like this problem is solved in the current 4.7 and 4.8 branches. At
least on an avx machine, the best performance found by the code in comment #4
jumps from 5.3Gflops in 4.6 to 13.9Glfops in 4.7/4.8. Great work.
I can't test this right now on interlagos, but I guess this could be OK as
well.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/51179] poor vectorization on interlagos.
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
` (10 preceding siblings ...)
2012-06-30 11:27 ` Joost.VandeVondele at mat dot ethz.ch
@ 2012-07-19 10:36 ` rguenth at gcc dot gnu.org
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-07-19 10:36 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51179
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
Target Milestone|--- |4.7.2
--- Comment #12 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-07-19 10:36:13 UTC ---
Thus, fixed.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-07-19 10:36 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-16 19:25 [Bug target/51179] New: poor vectorization on interlagos Joost.VandeVondele at mat dot ethz.ch
2011-11-22 11:03 ` [Bug target/51179] " ubizjak at gmail dot com
2011-11-22 12:31 ` [Bug tree-optimization/51179] " ubizjak at gmail dot com
2011-11-22 17:57 ` jakub at gcc dot gnu.org
2011-11-22 18:53 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-22 18:55 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-22 21:38 ` dominiq at lps dot ens.fr
2011-11-22 23:13 ` ubizjak at gmail dot com
2011-11-23 9:28 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-23 17:50 ` Joost.VandeVondele at mat dot ethz.ch
2011-11-23 20:30 ` Joost.VandeVondele at mat dot ethz.ch
2012-06-30 11:27 ` Joost.VandeVondele at mat dot ethz.ch
2012-07-19 10:36 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).