[Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2
       [not found] <bug-65847-4@http.gcc.gnu.org/bugzilla/>
@ 2015-04-22 14:03 ` rguenth at gcc dot gnu.org
  2021-03-24 12:57 ` rguenth at gcc dot gnu.org
  1 sibling, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2015-04-22 14:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65847

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2015-04-22
                 CC|                            |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  The issue is that the vectorizer thinks x and y reside in memory
and thus it vectorizes the code as

  <bb 2>:
  vect__2.5_11 = MEM[(double *)&x];
  vect__3.8_13 = MEM[(double *)&y];
  vect__4.9_14 = vect__2.5_11 + vect__3.8_13;
  MEM[(double *)&D.1840] = vect__4.9_14;
  return D.1840;

which looks good.  But now comes the ABI and passes x, y and the return
value in registers ...

But even the best vectorized sequence would have four stmts - two to
pack arguments into vector registers, one add and one upack for the
return value.

Thus it seems the vectorizer should be informed of this ABI detail
or simply as heuristic never consider function arguments "memory"
it can perform vector loads on (which probably means to disable
group analysis on them?).

On i?86 with SSE2 we get

        movupd  8(%esp), %xmm1
        movl    4(%esp), %eax
        movupd  24(%esp), %xmm0
        addpd   %xmm1, %xmm0
        movups  %xmm0, (%eax)

vs.

        movsd   16(%esp), %xmm0
        movl    4(%esp), %eax
        movsd   8(%esp), %xmm1
        addsd   32(%esp), %xmm0
        addsd   24(%esp), %xmm1
        movsd   %xmm0, 8(%eax)
        movsd   %xmm1, (%eax)

which eventually looks even profitable (with -mfpmath=sse).

So a simple heuristic might pessimize things too much.

Replicating calls.c code to compute how the arguments are passed sounds
odd though...

Eventually the target can pessimize the loads in the target cost model
though (at least it can perform a more reasonable "heuristic").

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2
       [not found] <bug-65847-4@http.gcc.gnu.org/bugzilla/>
  2015-04-22 14:03 ` [Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2 rguenth at gcc dot gnu.org
@ 2021-03-24 12:57 ` rguenth at gcc dot gnu.org
  1 sibling, 0 replies; 2+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-24 12:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65847

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Similarly

struct X { int a; int b; int c; int d; };

struct X foo (struct X x, struct X y)
{
  struct X res;
  res.a = x.a + y.a;
  res.b = x.b + y.b;
  res.c = x.c + y.c;
  res.d = x.d + y.d;
  return res;
}

is vectorized as

foo:
.LFB0:
        .cfi_startproc
        movq    %rdi, -40(%rsp)
        movq    %rsi, -32(%rsp)
        movdqa  -40(%rsp), %xmm0
        movq    %rdx, -24(%rsp)
        movq    %rcx, -16(%rsp)
        paddd   -24(%rsp), %xmm0
        movaps  %xmm0, -40(%rsp)
        movq    -40(%rsp), %rax
        movq    -32(%rsp), %rdx
        ret

which is bad because the on-stack construction of %xmm0 causes a STLF fail.
Unvectorized code isn't necessarily worse, but the vectorized sequence
can be improved 

foo:
.LFB0:
        .cfi_startproc
        movq    %rdi, %rax
        movq    %rdi, %r10
        movq    %rdx, %rdi
        movq    %rsi, %r9
        sarq    $32, %r10
        sarq    $32, %rdi
        addl    %edx, %eax
        movq    %rcx, %r8
        addl    %r10d, %edi
        sarq    $32, %r9
        movl    %eax, %eax
        leal    (%rsi,%rcx), %edx
        movl    %edi, %edi
        sarq    $32, %r8
        salq    $32, %rdi
        orq     %rdi, %rax
        leal    (%r9,%r8), %edi
        salq    $32, %rdi
        orq     %rdi, %rdx
        ret

in this case the spill is caused by LRA not knowing how to re-load
the TImode reg build by pieces by the RTL expansion code.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-03-24 12:57 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-65847-4@http.gcc.gnu.org/bugzilla/>
2015-04-22 14:03 ` [Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2 rguenth at gcc dot gnu.org
2021-03-24 12:57 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).