public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/52572] New: suboptimal assignment to avx element
@ 2012-03-12 22:50 marc.glisse at normalesup dot org
  2012-03-13  7:55 ` [Bug target/52572] " jakub at gcc dot gnu.org
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-12 22:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572

             Bug #: 52572
           Summary: suboptimal assignment to avx element
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: marc.glisse@normalesup.org


For the following program:
#include <x86intrin.h>
__m256d f(__m256d x){
  x[0]=0;
  return x;
}

gcc -O3 generates:
    vmovlpd    .LC0(%rip), %xmm0, %xmm1
    vinsertf128    $0x0, %xmm1, %ymm0, %ymm0
or with -Os:
    vxorps    %xmm2, %xmm2, %xmm2
    vmovsd    %xmm2, %xmm0, %xmm1
    vinsertf128    $0x0, %xmm1, %ymm0, %ymm0

If I understand correctly, it first constructs {0,x[1],0,0} and then merges it
with the upper part of x. However, using the legacy movlpd instruction would
avoid zeroing the upper 128 bits and thus the vinsertf128 wouldn't be needed.

Is there a policy not to generate the non-VEX instructions anymore, or is this
a missed optimization?

Setting x[1] is similar. For x[2] or x[3], we get extract+mov+insert, but it
might be better to do something with vblendpd.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/52572] suboptimal assignment to avx element
  2012-03-12 22:50 [Bug target/52572] New: suboptimal assignment to avx element marc.glisse at normalesup dot org
@ 2012-03-13  7:55 ` jakub at gcc dot gnu.org
  2012-03-13  8:17 ` marc.glisse at normalesup dot org
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-03-13  7:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-03-13 07:54:14 UTC ---
Have you actually tried that?  Mixing VEX encoded insns with legacy encoded
SSE* insns is very costly, for good performance there needs to be a vzeroupper
in between (but then you lose the upper bits).  See e.g. 2.8 in the AVX
Programming Reference.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/52572] suboptimal assignment to avx element
  2012-03-12 22:50 [Bug target/52572] New: suboptimal assignment to avx element marc.glisse at normalesup dot org
  2012-03-13  7:55 ` [Bug target/52572] " jakub at gcc dot gnu.org
@ 2012-03-13  8:17 ` marc.glisse at normalesup dot org
  2012-03-13 17:58 ` marc.glisse at normalesup dot org
  2021-12-25 22:30 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-13  8:17 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572

--- Comment #2 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-13 08:16:58 UTC ---
(In reply to comment #1)
> Have you actually tried that?

Ah, no, sorry, I only have occasional access to such a machine to benchmark the
code. From a -Os perspective it is still shorter (but indeed that matters less
to me than -O3 performance).

>  Mixing VEX encoded insns with legacy encoded
> SSE* insns is very costly, for good performance there needs to be a vzeroupper
> in between (but then you lose the upper bits).  See e.g. 2.8 in the AVX
> Programming Reference.

Thanks, I'd missed that.

The vblendpd solution should still apply (from the initial 'v' it sounds safe),
no?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/52572] suboptimal assignment to avx element
  2012-03-12 22:50 [Bug target/52572] New: suboptimal assignment to avx element marc.glisse at normalesup dot org
  2012-03-13  7:55 ` [Bug target/52572] " jakub at gcc dot gnu.org
  2012-03-13  8:17 ` marc.glisse at normalesup dot org
@ 2012-03-13 17:58 ` marc.glisse at normalesup dot org
  2021-12-25 22:30 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: marc.glisse at normalesup dot org @ 2012-03-13 17:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572

--- Comment #3 from Marc Glisse <marc.glisse at normalesup dot org> 2012-03-13 17:57:58 UTC ---
Or for this variant:
__m256d f(__m256d *y){
  __m256d x=*y;
  x[0]=0; // or x[3]
  return x;
}
it looks like vmaskmovpd could replace:
    vmovapd    (%rdi), %ymm0
    vmovapd    %xmm0, %xmm1
    vmovlpd    .LC0(%rip), %xmm1, %xmm1
    vinsertf128    $0x0, %xmm1, %ymm0, %ymm0
(I tried a version with __builtin_shuffle but it wouldn't generate vmaskmovpd
either)

(sorry for the naive suggestions, there are too many possibilities to optimize
them all...)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug target/52572] suboptimal assignment to avx element
  2012-03-12 22:50 [Bug target/52572] New: suboptimal assignment to avx element marc.glisse at normalesup dot org
                   ` (2 preceding siblings ...)
  2012-03-13 17:58 ` marc.glisse at normalesup dot org
@ 2021-12-25 22:30 ` pinskia at gcc dot gnu.org
  3 siblings, 0 replies; 5+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-25 22:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52572

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
   Last reconfirmed|                            |2021-12-25
             Target|                            |x86_64-linux-gnu
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
LLVM produces:

        vxorps  %xmm1, %xmm1, %xmm1
        vblendps        $3, %ymm1, %ymm0, %ymm0         # ymm0 =
ymm1[0,1],ymm0[2,3,4,5,6,7]

and

        vxorps  %xmm0, %xmm0, %xmm0
        vblendps        $252, (%rdi), %ymm0, %ymm0      # ymm0 =
ymm0[0,1],mem[2,3,4,5,6,7]

Which I suspect is better.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-25 22:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-12 22:50 [Bug target/52572] New: suboptimal assignment to avx element marc.glisse at normalesup dot org
2012-03-13  7:55 ` [Bug target/52572] " jakub at gcc dot gnu.org
2012-03-13  8:17 ` marc.glisse at normalesup dot org
2012-03-13 17:58 ` marc.glisse at normalesup dot org
2021-12-25 22:30 ` pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).