From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17625 invoked by alias); 22 Jan 2015 15:15:54 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 17442 invoked by uid 48); 22 Jan 2015 15:15:25 -0000 From: "jtaylor.debian at googlemail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 Date: Thu, 22 Jan 2015 15:15:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 5.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: jtaylor.debian at googlemail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2015-01/txt/msg02400.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731 Bug ID: 64731 Summary: poor code when using vector_size((32)) for sse2 Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jtaylor.debian at googlemail dot com It would be nice if for some simple cases too large vector_size for the selected instruction set would still produce efficient code. E.g. using vector_size of 32 for SSE2 code results in essentially once unrolled vector_size 16 code and it still simply uses AVX if it one compiles with the appropriate option. But with current gcc 5.0 with this code: typedef double double4 __attribute__((vector_size(32))); void fun(double * a, double * b) { for (int i = 0; i < 1024; i+=4) { *(double4*)&a[i] += *(double4*)&b[i]; } } with AVX this turns into the expected code, but with only SSE2 enabled one gets this: gcc -O3 test2.c -c -std=c99 0000000000000000 : 0: 4c 8d 54 24 08 lea 0x8(%rsp),%r10 5: 48 83 e4 e0 and $0xffffffffffffffe0,%rsp 9: 31 c0 xor %eax,%eax b: 41 ff 72 f8 pushq -0x8(%r10) f: 55 push %rbp 10: 48 89 e5 mov %rsp,%rbp 13: 41 52 push %r10 15: 48 83 ec 10 sub $0x10,%rsp 19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 20: 48 8b 14 07 mov (%rdi,%rax,1),%rdx 24: 48 89 55 90 mov %rdx,-0x70(%rbp) 28: 48 8b 54 07 08 mov 0x8(%rdi,%rax,1),%rdx 2d: 48 89 55 98 mov %rdx,-0x68(%rbp) 31: 48 8b 54 07 10 mov 0x10(%rdi,%rax,1),%rdx 36: 48 89 55 a0 mov %rdx,-0x60(%rbp) 3a: 48 8b 54 07 18 mov 0x18(%rdi,%rax,1),%rdx 3f: 48 89 55 a8 mov %rdx,-0x58(%rbp) 43: 48 8b 14 06 mov (%rsi,%rax,1),%rdx 47: 48 89 55 b0 mov %rdx,-0x50(%rbp) 4b: 48 8b 54 06 08 mov 0x8(%rsi,%rax,1),%rdx 50: 48 89 55 b8 mov %rdx,-0x48(%rbp) 54: 48 8b 54 06 10 mov 0x10(%rsi,%rax,1),%rdx 59: 66 0f 28 45 b0 movapd -0x50(%rbp),%xmm0 5e: 66 0f 58 45 90 addpd -0x70(%rbp),%xmm0 63: 48 89 55 c0 mov %rdx,-0x40(%rbp) 67: 48 8b 54 06 18 mov 0x18(%rsi,%rax,1),%rdx 6c: 48 89 55 c8 mov %rdx,-0x38(%rbp) 70: 0f 29 85 70 ff ff ff movaps %xmm0,-0x90(%rbp) 77: 66 48 0f 7e c2 movq %xmm0,%rdx 7c: 66 0f 28 45 c0 movapd -0x40(%rbp),%xmm0 81: 48 89 14 07 mov %rdx,(%rdi,%rax,1) 85: 48 8b 95 78 ff ff ff mov -0x88(%rbp),%rdx 8c: 66 0f 58 45 a0 addpd -0x60(%rbp),%xmm0 91: 0f 29 45 80 movaps %xmm0,-0x80(%rbp) 95: 48 89 54 07 08 mov %rdx,0x8(%rdi,%rax,1) 9a: 48 8b 55 80 mov -0x80(%rbp),%rdx 9e: 48 89 54 07 10 mov %rdx,0x10(%rdi,%rax,1) a3: 48 8b 55 88 mov -0x78(%rbp),%rdx a7: 48 89 54 07 18 mov %rdx,0x18(%rdi,%rax,1) ac: 48 83 c0 20 add $0x20,%rax b0: 48 3d 00 20 00 00 cmp $0x2000,%rax b6: 0f 85 64 ff ff ff jne 20 bc: 48 83 c4 10 add $0x10,%rsp c0: 41 5a pop %r10 c2: 5d pop %rbp c3: 49 8d 62 f8 lea -0x8(%r10),%rsp c7: c3 retq c8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) cf: 00 while I would have hoped for something along the lines of this: 10: 66 0f 28 44 c6 10 movapd 0x10(%rsi,%rax,8),%xmm0 16: 66 0f 28 0c c6 movapd (%rsi,%rax,8),%xmm1 1b: 66 0f 58 0c c7 addpd (%rdi,%rax,8),%xmm1 20: 66 0f 58 44 c7 10 addpd 0x10(%rdi,%rax,8),%xmm0 26: 66 0f 29 44 c7 10 movapd %xmm0,0x10(%rdi,%rax,8) 2c: 66 0f 29 0c c7 movapd %xmm1,(%rdi,%rax,8) 31: 48 83 c0 04 add $0x4,%rax 35: 3d 00 04 00 00 cmp $0x400,%eax 3a: 7c d4 jl 10