[Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
@ 2011-02-15 15:37 kretz at kde dot org
  2011-02-15 16:22 ` [Bug target/47754] " rguenth at gcc dot gnu.org
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: kretz at kde dot org @ 2011-02-15 15:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

           Summary: [missed optimization] AVX allows unaligned memory
                    operands but GCC uses unaligned load and register
                    operand
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: kretz@kde.org

According to the AVX docs: "With the exception of explicitly aligned 16 or 32
byte SIMD load/store instructions, most VEX-encoded, arithmetic and data
processing instructions operate in a flexible environment regarding memory
address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load
semantics will support unaligned load operation by default. Memory arguments
for most instructions with VEX prefix operate normally without causing #GP(0)
on any byte-granularity alignment (unlike Legacy SSE instructions)."

I tested whether GCC would take advantage of this, and found that it doesn't:

_mm256_store_ps(&data[3],
  _mm256_add_ps(_mm256_load_ps(&data[0]), _mm256_load_ps(&data[1]))
);
compiles to:
vmovaps 0x200b18(%rip),%ymm0
vaddps 0x200b13(%rip),%ymm0,%ymm0
vmovaps %ymm0,0x200b10(%rip)

whereas

_mm256_store_ps(&data[3],
  _mm256_add_ps(_mm256_loadu_ps(&data[0]), _mm256_loadu_ps(&data[1]))
);
compiles to:
vmovups 0x200b4c(%rip),%ymm0
vmovups 0x200b40(%rip),%ymm1
vaddps %ymm0,%ymm1,%ymm0
vmovaps %ymm0,0x200b3c(%rip)

GCC could use a memory operand in the vaddps here instead. According to the AVX
docs, this doesn't hurt performance. But it reduces register pressure AFAIU.

Would be nice to have.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
  2011-02-15 15:37 [Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand kretz at kde dot org
@ 2011-02-15 16:22 ` rguenth at gcc dot gnu.org
  2011-02-15 16:40 ` kretz at kde dot org
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-02-15 16:22 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.02.15 16:21:49
     Ever Confirmed|0                           |1

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-02-15 16:21:49 UTC ---
Confirmed.  Not sure if it really would not be slower for a non-load/store
instruction to need assist for unaligned loads/stores.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
  2011-02-15 15:37 [Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand kretz at kde dot org
  2011-02-15 16:22 ` [Bug target/47754] " rguenth at gcc dot gnu.org
@ 2011-02-15 16:40 ` kretz at kde dot org
  2011-02-15 16:51 ` kretz at kde dot org
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: kretz at kde dot org @ 2011-02-15 16:40 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #2 from Matthias Kretz <kretz at kde dot org> 2011-02-15 16:31:39 UTC ---
True, the Optimization Reference Manual and AVX Docs are not very specific
about the performance impact of this. But as far as I understood the docs it
will internally not be slower than an unaligned load + op, but also not faster.
Except, of course, if it's related to memory fetch latency. So it's just about
having more registers available - again AFAIU.

If you want I can try the same testcase on ICC...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
  2011-02-15 15:37 [Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand kretz at kde dot org
  2011-02-15 16:22 ` [Bug target/47754] " rguenth at gcc dot gnu.org
  2011-02-15 16:40 ` kretz at kde dot org
@ 2011-02-15 16:51 ` kretz at kde dot org
  2011-02-16 10:56 ` rguenth at gcc dot gnu.org
  2012-02-22 13:54 ` xiaoyuanbo at yeah dot net
  4 siblings, 0 replies; 6+ messages in thread
From: kretz at kde dot org @ 2011-02-15 16:51 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

--- Comment #3 from Matthias Kretz <kretz at kde dot org> 2011-02-15 16:40:38 UTC ---
ICC??? Whatever, I stopped to trust that compiler long ago:
<unaligned()>:
vmovups 0x2039b8(%rip),%xmm0
vmovups 0x2039b4(%rip),%xmm1
vinsertf128 $0x1,0x2039b6(%rip),%ymm0,%ymm2
vinsertf128 $0x1,0x2039b0(%rip),%ymm1,%ymm3
vaddps %ymm3,%ymm2,%ymm4
vmovups %ymm4,0x20399c(%rip)
vzeroupper
retq

<aligned()>:
vmovups 0x203978(%rip),%ymm0
vaddps 0x203974(%rip),%ymm0,%ymm1
vmovups %ymm1,0x203974(%rip)
vzeroupper
retq

Nice optimization of unaligned loads there... not. ???


Just a small side-note for your enjoyment: I wrote a C++ abstraction for SSE;
and with GCC this gives an almost four-fold speedup for Mandelbrot. ICC on the
other hand compiles such awful code that - even with SSE use - it rather
creates a four-fold slowdown compared to the non-SSE code.

GCC really is a nice compiler! Keep on rocking!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
  2011-02-15 15:37 [Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand kretz at kde dot org
                   ` (2 preceding siblings ...)
  2011-02-15 16:51 ` kretz at kde dot org
@ 2011-02-16 10:56 ` rguenth at gcc dot gnu.org
  2012-02-22 13:54 ` xiaoyuanbo at yeah dot net
  4 siblings, 0 replies; 6+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-02-16 10:56 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rth at gcc dot gnu.org

--- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-02-16 10:49:30 UTC ---
Note that GCC doesn't use unaligned memory operands because it doesn't have
the knowledge implemented that this is ok for AVX, it simply treats the
AVX case the same as the SSE case where the memory operands are required to be
aligned.  That said, unaligned SSE and AVX moves are implemented using
UNSPECs, so they will be never combined with other instructions.  I don't know
if there is a way to still distinguish unaligned and aligned loads/stores
and let them appear as regular RTL moves at the same time.

Richard, is that even possible?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
  2011-02-15 15:37 [Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand kretz at kde dot org
                   ` (3 preceding siblings ...)
  2011-02-16 10:56 ` rguenth at gcc dot gnu.org
@ 2012-02-22 13:54 ` xiaoyuanbo at yeah dot net
  4 siblings, 0 replies; 6+ messages in thread
From: xiaoyuanbo at yeah dot net @ 2012-02-22 13:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754

xiaoyuanbo <xiaoyuanbo at yeah dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |xiaoyuanbo at yeah dot net

--- Comment #5 from xiaoyuanbo <xiaoyuanbo at yeah dot net> 2012-02-22 13:04:03 UTC ---
so you are boss


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-02-22 13:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-15 15:37 [Bug target/47754] New: [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand kretz at kde dot org
2011-02-15 16:22 ` [Bug target/47754] " rguenth at gcc dot gnu.org
2011-02-15 16:40 ` kretz at kde dot org
2011-02-15 16:51 ` kretz at kde dot org
2011-02-16 10:56 ` rguenth at gcc dot gnu.org
2012-02-22 13:54 ` xiaoyuanbo at yeah dot net

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).