From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-474406-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 17625 invoked by alias); 22 Jan 2015 15:15:54 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 17442 invoked by uid 48); 22 Jan 2015 15:15:25 -0000
From: "jtaylor.debian at googlemail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2
Date: Thu, 22 Jan 2015 15:15:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 5.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: jtaylor.debian at googlemail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter
Message-ID: <bug-64731-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2015-01/txt/msg02400.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

            Bug ID: 64731
           Summary: poor code when using vector_size((32)) for sse2
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jtaylor.debian at googlemail dot com

It would be nice if for some simple cases too large vector_size for the
selected instruction set would still produce efficient code.
E.g. using vector_size of 32 for SSE2 code results in essentially once unrolled
vector_size 16 code and it still simply uses AVX if it one compiles with the
appropriate option.

But with current gcc 5.0 with this code:

typedef double double4 __attribute__((vector_size(32)));

void fun(double * a, double * b)
{
    for (int i = 0; i < 1024; i+=4) {
        *(double4*)&a[i] += *(double4*)&b[i];
    }
}

with AVX this turns into the expected code, but with only SSE2 enabled one gets
this:
gcc -O3 test2.c -c -std=c99

0000000000000000 <fun>:
   0:    4c 8d 54 24 08           lea    0x8(%rsp),%r10
   5:    48 83 e4 e0              and    $0xffffffffffffffe0,%rsp
   9:    31 c0                    xor    %eax,%eax
   b:    41 ff 72 f8              pushq  -0x8(%r10)
   f:    55                       push   %rbp
  10:    48 89 e5                 mov    %rsp,%rbp
  13:    41 52                    push   %r10
  15:    48 83 ec 10              sub    $0x10,%rsp
  19:    0f 1f 80 00 00 00 00     nopl   0x0(%rax)
  20:    48 8b 14 07              mov    (%rdi,%rax,1),%rdx
  24:    48 89 55 90              mov    %rdx,-0x70(%rbp)
  28:    48 8b 54 07 08           mov    0x8(%rdi,%rax,1),%rdx
  2d:    48 89 55 98              mov    %rdx,-0x68(%rbp)
  31:    48 8b 54 07 10           mov    0x10(%rdi,%rax,1),%rdx
  36:    48 89 55 a0              mov    %rdx,-0x60(%rbp)
  3a:    48 8b 54 07 18           mov    0x18(%rdi,%rax,1),%rdx
  3f:    48 89 55 a8              mov    %rdx,-0x58(%rbp)
  43:    48 8b 14 06              mov    (%rsi,%rax,1),%rdx
  47:    48 89 55 b0              mov    %rdx,-0x50(%rbp)
  4b:    48 8b 54 06 08           mov    0x8(%rsi,%rax,1),%rdx
  50:    48 89 55 b8              mov    %rdx,-0x48(%rbp)
  54:    48 8b 54 06 10           mov    0x10(%rsi,%rax,1),%rdx
  59:    66 0f 28 45 b0           movapd -0x50(%rbp),%xmm0
  5e:    66 0f 58 45 90           addpd  -0x70(%rbp),%xmm0
  63:    48 89 55 c0              mov    %rdx,-0x40(%rbp)
  67:    48 8b 54 06 18           mov    0x18(%rsi,%rax,1),%rdx
  6c:    48 89 55 c8              mov    %rdx,-0x38(%rbp)
  70:    0f 29 85 70 ff ff ff     movaps %xmm0,-0x90(%rbp)
  77:    66 48 0f 7e c2           movq   %xmm0,%rdx
  7c:    66 0f 28 45 c0           movapd -0x40(%rbp),%xmm0
  81:    48 89 14 07              mov    %rdx,(%rdi,%rax,1)
  85:    48 8b 95 78 ff ff ff     mov    -0x88(%rbp),%rdx
  8c:    66 0f 58 45 a0           addpd  -0x60(%rbp),%xmm0
  91:    0f 29 45 80              movaps %xmm0,-0x80(%rbp)
  95:    48 89 54 07 08           mov    %rdx,0x8(%rdi,%rax,1)
  9a:    48 8b 55 80              mov    -0x80(%rbp),%rdx
  9e:    48 89 54 07 10           mov    %rdx,0x10(%rdi,%rax,1)
  a3:    48 8b 55 88              mov    -0x78(%rbp),%rdx
  a7:    48 89 54 07 18           mov    %rdx,0x18(%rdi,%rax,1)
  ac:    48 83 c0 20              add    $0x20,%rax
  b0:    48 3d 00 20 00 00        cmp    $0x2000,%rax
  b6:    0f 85 64 ff ff ff        jne    20 <fun+0x20>
  bc:    48 83 c4 10              add    $0x10,%rsp
  c0:    41 5a                    pop    %r10
  c2:    5d                       pop    %rbp
  c3:    49 8d 62 f8              lea    -0x8(%r10),%rsp
  c7:    c3                       retq   
  c8:    0f 1f 84 00 00 00 00     nopl   0x0(%rax,%rax,1)
  cf:    00 


while I would have hoped for something along the lines of this:

  10:    66 0f 28 44 c6 10        movapd 0x10(%rsi,%rax,8),%xmm0
  16:    66 0f 28 0c c6           movapd (%rsi,%rax,8),%xmm1
  1b:    66 0f 58 0c c7           addpd  (%rdi,%rax,8),%xmm1
  20:    66 0f 58 44 c7 10        addpd  0x10(%rdi,%rax,8),%xmm0
  26:    66 0f 29 44 c7 10        movapd %xmm0,0x10(%rdi,%rax,8)
  2c:    66 0f 29 0c c7           movapd %xmm1,(%rdi,%rax,8)
  31:    48 83 c0 04              add    $0x4,%rax
  35:    3d 00 04 00 00           cmp    $0x400,%eax
  3a:    7c d4                    jl     10 <fun+0x10>