public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2
@ 2015-01-22 15:15 jtaylor.debian at googlemail dot com
  2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: jtaylor.debian at googlemail dot com @ 2015-01-22 15:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

            Bug ID: 64731
           Summary: poor code when using vector_size((32)) for sse2
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jtaylor.debian at googlemail dot com

It would be nice if for some simple cases too large vector_size for the
selected instruction set would still produce efficient code.
E.g. using vector_size of 32 for SSE2 code results in essentially once unrolled
vector_size 16 code and it still simply uses AVX if it one compiles with the
appropriate option.

But with current gcc 5.0 with this code:

typedef double double4 __attribute__((vector_size(32)));

void fun(double * a, double * b)
{
    for (int i = 0; i < 1024; i+=4) {
        *(double4*)&a[i] += *(double4*)&b[i];
    }
}

with AVX this turns into the expected code, but with only SSE2 enabled one gets
this:
gcc -O3 test2.c -c -std=c99

0000000000000000 <fun>:
   0:    4c 8d 54 24 08           lea    0x8(%rsp),%r10
   5:    48 83 e4 e0              and    $0xffffffffffffffe0,%rsp
   9:    31 c0                    xor    %eax,%eax
   b:    41 ff 72 f8              pushq  -0x8(%r10)
   f:    55                       push   %rbp
  10:    48 89 e5                 mov    %rsp,%rbp
  13:    41 52                    push   %r10
  15:    48 83 ec 10              sub    $0x10,%rsp
  19:    0f 1f 80 00 00 00 00     nopl   0x0(%rax)
  20:    48 8b 14 07              mov    (%rdi,%rax,1),%rdx
  24:    48 89 55 90              mov    %rdx,-0x70(%rbp)
  28:    48 8b 54 07 08           mov    0x8(%rdi,%rax,1),%rdx
  2d:    48 89 55 98              mov    %rdx,-0x68(%rbp)
  31:    48 8b 54 07 10           mov    0x10(%rdi,%rax,1),%rdx
  36:    48 89 55 a0              mov    %rdx,-0x60(%rbp)
  3a:    48 8b 54 07 18           mov    0x18(%rdi,%rax,1),%rdx
  3f:    48 89 55 a8              mov    %rdx,-0x58(%rbp)
  43:    48 8b 14 06              mov    (%rsi,%rax,1),%rdx
  47:    48 89 55 b0              mov    %rdx,-0x50(%rbp)
  4b:    48 8b 54 06 08           mov    0x8(%rsi,%rax,1),%rdx
  50:    48 89 55 b8              mov    %rdx,-0x48(%rbp)
  54:    48 8b 54 06 10           mov    0x10(%rsi,%rax,1),%rdx
  59:    66 0f 28 45 b0           movapd -0x50(%rbp),%xmm0
  5e:    66 0f 58 45 90           addpd  -0x70(%rbp),%xmm0
  63:    48 89 55 c0              mov    %rdx,-0x40(%rbp)
  67:    48 8b 54 06 18           mov    0x18(%rsi,%rax,1),%rdx
  6c:    48 89 55 c8              mov    %rdx,-0x38(%rbp)
  70:    0f 29 85 70 ff ff ff     movaps %xmm0,-0x90(%rbp)
  77:    66 48 0f 7e c2           movq   %xmm0,%rdx
  7c:    66 0f 28 45 c0           movapd -0x40(%rbp),%xmm0
  81:    48 89 14 07              mov    %rdx,(%rdi,%rax,1)
  85:    48 8b 95 78 ff ff ff     mov    -0x88(%rbp),%rdx
  8c:    66 0f 58 45 a0           addpd  -0x60(%rbp),%xmm0
  91:    0f 29 45 80              movaps %xmm0,-0x80(%rbp)
  95:    48 89 54 07 08           mov    %rdx,0x8(%rdi,%rax,1)
  9a:    48 8b 55 80              mov    -0x80(%rbp),%rdx
  9e:    48 89 54 07 10           mov    %rdx,0x10(%rdi,%rax,1)
  a3:    48 8b 55 88              mov    -0x78(%rbp),%rdx
  a7:    48 89 54 07 18           mov    %rdx,0x18(%rdi,%rax,1)
  ac:    48 83 c0 20              add    $0x20,%rax
  b0:    48 3d 00 20 00 00        cmp    $0x2000,%rax
  b6:    0f 85 64 ff ff ff        jne    20 <fun+0x20>
  bc:    48 83 c4 10              add    $0x10,%rsp
  c0:    41 5a                    pop    %r10
  c2:    5d                       pop    %rbp
  c3:    49 8d 62 f8              lea    -0x8(%r10),%rsp
  c7:    c3                       retq   
  c8:    0f 1f 84 00 00 00 00     nopl   0x0(%rax,%rax,1)
  cf:    00 


while I would have hoped for something along the lines of this:

  10:    66 0f 28 44 c6 10        movapd 0x10(%rsi,%rax,8),%xmm0
  16:    66 0f 28 0c c6           movapd (%rsi,%rax,8),%xmm1
  1b:    66 0f 58 0c c7           addpd  (%rdi,%rax,8),%xmm1
  20:    66 0f 58 44 c7 10        addpd  0x10(%rdi,%rax,8),%xmm0
  26:    66 0f 29 44 c7 10        movapd %xmm0,0x10(%rdi,%rax,8)
  2c:    66 0f 29 0c c7           movapd %xmm1,(%rdi,%rax,8)
  31:    48 83 c0 04              add    $0x4,%rax
  35:    3d 00 04 00 00           cmp    $0x400,%eax
  3a:    7c d4                    jl     10 <fun+0x10>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
@ 2015-01-22 16:17 ` rguenth at gcc dot gnu.org
  2015-01-22 16:23 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-01-22 16:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-01-22
                 CC|                            |rguenth at gcc dot gnu.org
            Summary|poor code when using        |vector lowering should
                   |vector_size((32)) for sse2  |split loads and stores
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ok, the issue is "simple" - veclower doesn't split the loads/stores itself but
the
registers:

  <bb 3>:
  # ivtmp.11_24 = PHI <ivtmp.11_23(3), 0(2)>
  _8 = MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B];
  _11 = MEM[base: b_9(D), index: ivtmp.11_24, offset: 0B];
  _17 = BIT_FIELD_REF <_8, 128, 0>;
  _4 = BIT_FIELD_REF <_11, 128, 0>;
  _5 = _4 + _17;
  _29 = BIT_FIELD_REF <_8, 128, 128>;
  _28 = BIT_FIELD_REF <_11, 128, 128>;
  _14 = _28 + _29;
  _12 = {_5, _14};
  MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B] = _12;
  ivtmp.11_23 = ivtmp.11_24 + 32;
  if (ivtmp.11_23 != 8192)
    goto <bb 3>;
  else
    goto <bb 4>;

in this case it would also have a moderately hard time to split the loads/store
as it is faced with TARGET_MEM_REFs already.

Nothing combines this back into a sane form.  I've recently added code that
handles exactly the same situation but only for complex arithmetic
(in tree-ssa-forwprop.c for PR64568).

I wonder why with only -msse2 IVOPTs produces TARGET_MEM_REFs for the loads.
For sure x86_64 cannot load V4DF in one instruction...


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
  2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
@ 2015-01-22 16:23 ` jakub at gcc dot gnu.org
  2015-01-22 16:29 ` rguenther at suse dot de
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2015-01-22 16:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Doesn't look like a regression, I see roughly same code quality all the way
from 4.1 which I tried first to current trunk.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
  2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
  2015-01-22 16:23 ` jakub at gcc dot gnu.org
@ 2015-01-22 16:29 ` rguenther at suse dot de
  2015-01-22 18:21 ` glisse at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenther at suse dot de @ 2015-01-22 16:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 22 Jan 2015, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
> 
> --- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Doesn't look like a regression, I see roughly same code quality all the way
> from 4.1 which I tried first to current trunk.

Yes, it's not a regression and a fix has to wait for GCC 6.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
                   ` (2 preceding siblings ...)
  2015-01-22 16:29 ` rguenther at suse dot de
@ 2015-01-22 18:21 ` glisse at gcc dot gnu.org
  2023-05-12  4:38 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: glisse at gcc dot gnu.org @ 2015-01-22 18:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

--- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> ---
Seems strongly related to PR55266 (which also references PR52436).


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
                   ` (3 preceding siblings ...)
  2015-01-22 18:21 ` glisse at gcc dot gnu.org
@ 2023-05-12  4:38 ` pinskia at gcc dot gnu.org
  2023-05-12  6:42 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-05-12  4:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

--- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
On the trunk we get at -O2:
.L2:
        leaq    (%rdi,%rax), %rcx
        leaq    (%rsi,%rax), %rdx
        movapd  (%rdx), %xmm0
        addpd   (%rcx), %xmm0
        movaps  %xmm0, -64(%rsp)
        movapd  16(%rcx), %xmm0
        addpd   16(%rdx), %xmm0
        movaps  %xmm0, -48(%rsp)
        movdqa  -64(%rsp), %xmm0
        movaps  %xmm0, (%rdi,%rax)
        movdqa  -48(%rsp), %xmm0
        movaps  %xmm0, 16(%rdi,%rax)
        addq    $32, %rax
        cmpq    $8192, %rax
        jne     .L2

Which is definitely better than before.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
                   ` (4 preceding siblings ...)
  2023-05-12  4:38 ` pinskia at gcc dot gnu.org
@ 2023-05-12  6:42 ` rguenth at gcc dot gnu.org
  2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
  2023-05-12 13:05 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-12  6:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
             Status|NEW                         |ASSIGNED

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, forwprop splits the vector loads:

  _5 = &MEM[(double4 *)a_11(D) + ivtmp.14_22 * 1];
  _1 = BIT_FIELD_REF <MEM[(double4 *)_5], 128, 128>;
  _25 = BIT_FIELD_REF <MEM[(double4 *)_5], 128, 0>;
  _14 = &MEM[(double4 *)b_12(D) + ivtmp.14_22 * 1];
  _2 = BIT_FIELD_REF <MEM[(double4 *)_14], 128, 128>;
  _17 = BIT_FIELD_REF <MEM[(double4 *)_14], 128, 0>;
  _24 = _17 + _25;
  _3 = _1 + _2;

but not the store from the CTOR:

  _7 = {_24, _3};
  MEM[(double4 *)a_11(D) + ivtmp.14_22 * 1] = _7;

forwprop would also split that, but we have

          else if (code == CONSTRUCTOR
                   && VECTOR_TYPE_P (TREE_TYPE (rhs))
                   && TYPE_MODE (TREE_TYPE (rhs)) == BLKmode
                   && CONSTRUCTOR_NELTS (rhs) > 0
                   && (!VECTOR_TYPE_P (TREE_TYPE (CONSTRUCTOR_ELT (rhs,
0)->value))
                       || (TYPE_MODE (TREE_TYPE (CONSTRUCTOR_ELT (rhs,
0)->value))
                           != BLKmode)))
            {
              /* Rewrite stores of a single-use vector constructors
                 to component-wise stores if the mode isn't supported.  */
              use_operand_p use_p;
              gimple *use_stmt;
              if (single_imm_use (lhs, &use_p, &use_stmt)
                  && gimple_store_p (use_stmt)
                  && !gimple_has_volatile_ops (use_stmt)
                  && !stmt_can_throw_internal (fun, use_stmt)
                  && is_gimple_assign (use_stmt)
                  && (TREE_CODE (gimple_assign_lhs (use_stmt))
                      != TARGET_MEM_REF))

and in this case there's a TARGET_MEM_REF on the LHS.  With -fno-ivopts we
get

.L2:
        movslq  %ecx, %rax
        addl    $4, %ecx
        salq    $3, %rax
        leaq    (%rdi,%rax), %rdx
        addq    %rsi, %rax
        movapd  16(%rax), %xmm0
        movapd  (%rdx), %xmm1
        addpd   16(%rdx), %xmm0
        addpd   (%rax), %xmm1
        movaps  %xmm0, 16(%rdx)
        movaps  %xmm1, (%rdx)
        subl    $1, %r8d
        jne     .L2

We could use the same trick as optimize_vector_load and instead of a
TARGET_MEM_REF memory reference use that only as address generation.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
                   ` (5 preceding siblings ...)
  2023-05-12  6:42 ` rguenth at gcc dot gnu.org
@ 2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
  2023-05-12 13:05 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-05-12 13:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:cc0e22b3f25d4b2a326322bce711179c02377e6c

commit r14-788-gcc0e22b3f25d4b2a326322bce711179c02377e6c
Author: Richard Biener <rguenther@suse.de>
Date:   Fri May 12 13:43:27 2023 +0200

    tree-optimization/64731 - extend store-from CTOR lowering to TARGET_MEM_REF

    The following also covers TARGET_MEM_REF when decomposing stores from
    CTORs to supported elementwise operations.  This avoids spilling
    and cleans up after vector lowering which doesn't touch loads or
    stores.  It also mimics what we already do for loads.

            PR tree-optimization/64731
            * tree-ssa-forwprop.cc (pass_forwprop::execute): Also
            handle TARGET_MEM_REF destinations of stores from vector
            CTORs.

            * gcc.target/i386/pr64731.c: New testcase.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug tree-optimization/64731] vector lowering should split loads and stores
  2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
                   ` (6 preceding siblings ...)
  2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
@ 2023-05-12 13:05 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-12 13:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED
   Target Milestone|---                         |14.0
      Known to fail|                            |13.1.0

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed for GCC 14.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-12 13:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-22 15:15 [Bug tree-optimization/64731] New: poor code when using vector_size((32)) for sse2 jtaylor.debian at googlemail dot com
2015-01-22 16:17 ` [Bug tree-optimization/64731] vector lowering should split loads and stores rguenth at gcc dot gnu.org
2015-01-22 16:23 ` jakub at gcc dot gnu.org
2015-01-22 16:29 ` rguenther at suse dot de
2015-01-22 18:21 ` glisse at gcc dot gnu.org
2023-05-12  4:38 ` pinskia at gcc dot gnu.org
2023-05-12  6:42 ` rguenth at gcc dot gnu.org
2023-05-12 13:04 ` cvs-commit at gcc dot gnu.org
2023-05-12 13:05 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).