public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2
@ 2015-01-22 15:15 jtaylor.debian at googlemail dot com
  2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: jtaylor.debian at googlemail dot com @ 2015-01-22 15:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

            Bug ID: 64731
           Summary: poor code when using vector_size((32)) for sse2
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jtaylor.debian at googlemail dot com

It would be nice if for some simple cases too large vector_size for the
selected instruction set would still produce efficient code.
E.g. using vector_size of 32 for SSE2 code results in essentially once unrolled
vector_size 16 code and it still simply uses AVX if it one compiles with the
appropriate option.

But with current gcc 5.0 with this code:

typedef double double4 __attribute__((vector_size(32)));

void fun(double * a, double * b)
{
    for (int i = 0; i < 1024; i+=4) {
        *(double4*)&a[i] += *(double4*)&b[i];
    }
}

with AVX this turns into the expected code, but with only SSE2 enabled one gets
this:
gcc -O3 test2.c -c -std=c99

0000000000000000 <fun>:
   0:    4c 8d 54 24 08           lea    0x8(%rsp),%r10
   5:    48 83 e4 e0              and    $0xffffffffffffffe0,%rsp
   9:    31 c0                    xor    %eax,%eax
   b:    41 ff 72 f8              pushq  -0x8(%r10)
   f:    55                       push   %rbp
  10:    48 89 e5                 mov    %rsp,%rbp
  13:    41 52                    push   %r10
  15:    48 83 ec 10              sub    $0x10,%rsp
  19:    0f 1f 80 00 00 00 00     nopl   0x0(%rax)
  20:    48 8b 14 07              mov    (%rdi,%rax,1),%rdx
  24:    48 89 55 90              mov    %rdx,-0x70(%rbp)
  28:    48 8b 54 07 08           mov    0x8(%rdi,%rax,1),%rdx
  2d:    48 89 55 98              mov    %rdx,-0x68(%rbp)
  31:    48 8b 54 07 10           mov    0x10(%rdi,%rax,1),%rdx
  36:    48 89 55 a0              mov    %rdx,-0x60(%rbp)
  3a:    48 8b 54 07 18           mov    0x18(%rdi,%rax,1),%rdx
  3f:    48 89 55 a8              mov    %rdx,-0x58(%rbp)
  43:    48 8b 14 06              mov    (%rsi,%rax,1),%rdx
  47:    48 89 55 b0              mov    %rdx,-0x50(%rbp)
  4b:    48 8b 54 06 08           mov    0x8(%rsi,%rax,1),%rdx
  50:    48 89 55 b8              mov    %rdx,-0x48(%rbp)
  54:    48 8b 54 06 10           mov    0x10(%rsi,%rax,1),%rdx
  59:    66 0f 28 45 b0           movapd -0x50(%rbp),%xmm0
  5e:    66 0f 58 45 90           addpd  -0x70(%rbp),%xmm0
  63:    48 89 55 c0              mov    %rdx,-0x40(%rbp)
  67:    48 8b 54 06 18           mov    0x18(%rsi,%rax,1),%rdx
  6c:    48 89 55 c8              mov    %rdx,-0x38(%rbp)
  70:    0f 29 85 70 ff ff ff     movaps %xmm0,-0x90(%rbp)
  77:    66 48 0f 7e c2           movq   %xmm0,%rdx
  7c:    66 0f 28 45 c0           movapd -0x40(%rbp),%xmm0
  81:    48 89 14 07              mov    %rdx,(%rdi,%rax,1)
  85:    48 8b 95 78 ff ff ff     mov    -0x88(%rbp),%rdx
  8c:    66 0f 58 45 a0           addpd  -0x60(%rbp),%xmm0
  91:    0f 29 45 80              movaps %xmm0,-0x80(%rbp)
  95:    48 89 54 07 08           mov    %rdx,0x8(%rdi,%rax,1)
  9a:    48 8b 55 80              mov    -0x80(%rbp),%rdx
  9e:    48 89 54 07 10           mov    %rdx,0x10(%rdi,%rax,1)
  a3:    48 8b 55 88              mov    -0x78(%rbp),%rdx
  a7:    48 89 54 07 18           mov    %rdx,0x18(%rdi,%rax,1)
  ac:    48 83 c0 20              add    $0x20,%rax
  b0:    48 3d 00 20 00 00        cmp    $0x2000,%rax
  b6:    0f 85 64 ff ff ff        jne    20 <fun+0x20>
  bc:    48 83 c4 10              add    $0x10,%rsp
  c0:    41 5a                    pop    %r10
  c2:    5d                       pop    %rbp
  c3:    49 8d 62 f8              lea    -0x8(%r10),%rsp
  c7:    c3                       retq   
  c8:    0f 1f 84 00 00 00 00     nopl   0x0(%rax,%rax,1)
  cf:    00 


while I would have hoped for something along the lines of this:

  10:    66 0f 28 44 c6 10        movapd 0x10(%rsi,%rax,8),%xmm0
  16:    66 0f 28 0c c6           movapd (%rsi,%rax,8),%xmm1
  1b:    66 0f 58 0c c7           addpd  (%rdi,%rax,8),%xmm1
  20:    66 0f 58 44 c7 10        addpd  0x10(%rdi,%rax,8),%xmm0
  26:    66 0f 29 44 c7 10        movapd %xmm0,0x10(%rdi,%rax,8)
  2c:    66 0f 29 0c c7           movapd %xmm1,(%rdi,%rax,8)
  31:    48 83 c0 04              add    $0x4,%rax
  35:    3d 00 04 00 00           cmp    $0x400,%eax
  3a:    7c d4                    jl     10 <fun+0x10>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-12 13:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
2015-01-22 16:23 ` jakub at gcc dot gnu.org
2015-01-22 16:29 ` rguenther at suse dot de
2015-01-22 18:21 ` glisse at gcc dot gnu.org
2023-05-12  4:38 ` pinskia at gcc dot gnu.org
2023-05-12  6:42 ` rguenth at gcc dot gnu.org
2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
2023-05-12 13:05 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).