[Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/44551]  New: [missed optimization] AVX vextractf128 after vinsertf128
@ 2010-06-15 22:31 kretz at kde dot org
  2010-06-16  9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
                   ` (10 more replies)
  0 siblings, 11 replies; 15+ messages in thread
From: kretz at kde dot org @ 2010-06-15 22:31 UTC (permalink / raw)
  To: gcc-bugs

Consider the following testcase:

#include <immintrin.h>

static inline __m256i __attribute__((always_inline))
my_add(__m256i a0, __m256i b0)
{
    __m128i a1 = _mm256_extractf128_si256(a0, 1);
    __m128i b1 = _mm256_extractf128_si256(b0, 1);
    __m256i r  =
_mm256_castsi128_si256(_mm_add_epi32(_mm256_castsi256_si128(a0),
_mm256_castsi256_si128(b0)));
    r = _mm256_insertf128_si256(r, _mm_add_epi32(a1, b1), 1);
    return r;
}

extern int DATA[];

void use_insert_extract()
{
    __m256i x = _mm256_loadu_si256((__m256i*)&DATA[0]);
    __m256i y = _mm256_loadu_si256((__m256i*)&DATA[1]);
    x = my_add(x, y);
    x = my_add(x, y);
    _mm256_storeu_si256((__m256i*)&DATA[0], x);
}

int main()
{
    return DATA[1];
}

Compiled with "g++ -mavx -O3 -Wall -S" one gets the following output:
        vmovdqu DATA(%rip), %ymm1
        pushq   %rbp
        vmovdqu DATA+4(%rip), %ymm0
        vextractf128    $0x1, %ymm1, %xmm3
        vmovdqa %xmm1, %xmm2
        movq    %rsp, %rbp
        vmovdqa %xmm0, %xmm1
        vextractf128    $0x1, %ymm0, %xmm0
        vpaddd  %xmm1, %xmm2, %xmm2
        vpaddd  %xmm0, %xmm3, %xmm3
        vinsertf128     $0x1, %xmm3, %ymm2, %ymm2
        vextractf128    $0x1, %ymm2, %xmm3
        vpaddd  %xmm1, %xmm2, %xmm1
        vpaddd  %xmm0, %xmm3, %xmm0
        vinsertf128     $0x1, %xmm0, %ymm1, %ymm0
        vmovdqu %ymm0, DATA(%rip)

ICC 11.1 compiles the same source ("-xavx -O3 -Wall -S") to:
        vmovdqu   DATA(%rip), %ymm1
        vmovdqu   4+DATA(%rip), %ymm0
        vextractf128 $1, %ymm1, %xmm2
        vextractf128 $1, %ymm0, %xmm6
        vpaddd    %xmm0, %xmm1, %xmm3
        vpaddd    %xmm6, %xmm2, %xmm5
        vpaddd    %xmm0, %xmm3, %xmm4
        vpaddd    %xmm6, %xmm5, %xmm7
        vinsertf128 $1, %xmm7, %ymm4, %ymm8
        vmovdqu   %ymm8, DATA(%rip)

Note especially the extract after insert which happens because of the double
application of my_add. This kind of optimization (which ICC is able to apply
here) is important because AVX introduces 256 bit vector registers but
arithmetic/logic/comparison operations on integers remain the 128 bit SSE
variants. Thus if you want to handle integers in YMM registers you will find a
lot of vinsertf128 and vextractf128 operations.


-- 
           Summary: [missed optimization] AVX vextractf128 after vinsertf128
           Product: gcc
           Version: 4.5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: kretz at kde dot org
 GCC build triplet: x86_64-unknown-linux-gnu
  GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
@ 2010-06-16  9:02 ` rguenth at gcc dot gnu dot org
  2010-06-16 19:50 ` hjl dot tools at gmail dot com
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-06-16  9:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from rguenth at gcc dot gnu dot org  2010-06-16 09:02 -------
This is probably missing combiner patterns in sse.md.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu dot
                   |                            |org, hjl at gcc dot gnu dot
                   |                            |org
           Severity|normal                      |enhancement
             Status|UNCONFIRMED                 |NEW
          Component|middle-end                  |target
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2010-06-16 09:02:24
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
  2010-06-16  9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
@ 2010-06-16 19:50 ` hjl dot tools at gmail dot com
  2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-16 19:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from hjl dot tools at gmail dot com  2010-06-16 19:50 -------
The problem is UNSPEC_CAST.  There is no good way to model it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
  2010-06-16  9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
  2010-06-16 19:50 ` hjl dot tools at gmail dot com
@ 2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
  2010-06-16 20:42 ` hjl dot tools at gmail dot com
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-06-16 20:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from pinskia at gcc dot gnu dot org  2010-06-16 20:00 -------
Well for one, you could have a splitter if the case which_alternative == 0 so
that an reg rename can do its magic.

Also what does UNSPEC_CAST really do?  From the looks of it is just a move
which you could use a splitter on.  At least for after reload.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (2 preceding siblings ...)
  2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
@ 2010-06-16 20:42 ` hjl dot tools at gmail dot com
  2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-16 20:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from hjl dot tools at gmail dot com  2010-06-16 20:42 -------
You can cast 256bit to 128bit to get the lower 128bit. You can also
cast 128bit to 256bit with upper 128bit undefined. If I use union,
it will always generate 2 moves via memory.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (3 preceding siblings ...)
  2010-06-16 20:42 ` hjl dot tools at gmail dot com
@ 2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
  2010-06-16 21:21 ` kretz at kde dot org
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-06-16 20:46 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from pinskia at gcc dot gnu dot org  2010-06-16 20:46 -------
(In reply to comment #4)
> You can cast 256bit to 128bit to get the lower 128bit.

This way can be represented using vec_select.  And then later on using a split
(after reload) turned into a move.

> You can also cast 128bit to 256bit with upper 128bit undefined. 
Still use an UNSPEC but use define_insn_and_split which does a splitting (after
reload) to turn it into a move.  Since it is a move after all (the registers
are overlapping).

This should improve code generation.  Also penalize the non matching 0 operand
case in both insn.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (4 preceding siblings ...)
  2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
@ 2010-06-16 21:21 ` kretz at kde dot org
  2010-06-17 22:01 ` hjl dot tools at gmail dot com
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: kretz at kde dot org @ 2010-06-16 21:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from kretz at kde dot org  2010-06-16 21:21 -------
(In reply to comment #4)
> You can also cast 128bit to 256bit with upper 128bit undefined.
If you cast from xmm to ymm after a 128bit instruction coded with VEX prefix
then the upper 128bit are actually guaranteed to be zero. If the SSE
instruction does not use the VEX prefix then the upper 128 bits are not
modified. Thus there is never really an undefined state. That might be useful
information for other optimizations?

> If I use union, it will always generate 2 moves via memory.
Yes, I noticed that unions are not a good choice for performance critical code.
It results in way more memory moves than necessary. BTW ICC also generates
memory moves when implementing the testcase with unions.

PS: Thanks a lot for looking into this!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (5 preceding siblings ...)
  2010-06-16 21:21 ` kretz at kde dot org
@ 2010-06-17 22:01 ` hjl dot tools at gmail dot com
  2010-06-18  0:46 ` hjl dot tools at gmail dot com
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-17 22:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from hjl dot tools at gmail dot com  2010-06-17 22:01 -------
Created an attachment (id=20934)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20934&action=view)
A patch to split cast

Here is a patch to split cast. But it doesn't remove
redundant vinsertf128/vextractf128. I am not sure which
pass can optimize setting/extracting higher elements of
a vector.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (6 preceding siblings ...)
  2010-06-17 22:01 ` hjl dot tools at gmail dot com
@ 2010-06-18  0:46 ` hjl dot tools at gmail dot com
  2010-06-18  0:50 ` pinskia at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-18  0:46 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from hjl dot tools at gmail dot com  2010-06-18 00:46 -------
Can we use subreg instead of vec_select?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (7 preceding siblings ...)
  2010-06-18  0:46 ` hjl dot tools at gmail dot com
@ 2010-06-18  0:50 ` pinskia at gcc dot gnu dot org
  2010-06-28 19:17 ` hjl dot tools at gmail dot com
  2010-06-28 19:18 ` hjl dot tools at gmail dot com
  10 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-06-18  0:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from pinskia at gcc dot gnu dot org  2010-06-18 00:49 -------
(In reply to comment #8)
> Can we use subreg instead of vec_select?

Kinda, you need to do triple subregs, first to an integer mode and then to a
smaller integer mode and then to the other vector mode.  subreg on vector types
are only valid for the same size.  I tried doing this for another target and it
did not work really and I ended up using vec_select instead and penalizing the
non matching constraint case.  


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (8 preceding siblings ...)
  2010-06-18  0:50 ` pinskia at gcc dot gnu dot org
@ 2010-06-28 19:17 ` hjl dot tools at gmail dot com
  2010-06-28 19:18 ` hjl dot tools at gmail dot com
  10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-28 19:17 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from hjl dot tools at gmail dot com  2010-06-28 19:17 -------
Here is a small testcase:

[hjl@gnu-6 44551]$ cat c.s
        .file   "c.c"
        .text
        .p2align 4,,15
.globl foo
        .type   foo, @function
foo:
.LFB798:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        vinsertf128     $0x1, %xmm1, %ymm0, %ymm0
        movq    %rsp, %rbp
        .cfi_offset 6, -16
        .cfi_def_cfa_register 6
        vextractf128    $0x1, %ymm0, %xmm0
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE798:
        .size   foo, .-foo
        .ident  "GCC: (GNU) 4.6.0 20100625 (experimental)"
        .section        .note.GNU-stack,"",@progbits
[hjl@gnu-6 44551]$ 

The optimize code is

        vmovaps   %xmm1, %xmm0
        ret         


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
  2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
                   ` (9 preceding siblings ...)
  2010-06-28 19:17 ` hjl dot tools at gmail dot com
@ 2010-06-28 19:18 ` hjl dot tools at gmail dot com
  10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-28 19:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from hjl dot tools at gmail dot com  2010-06-28 19:17 -------
Testcase is

[hjl@gnu-6 44551]$ cat c.c
#include <immintrin.h>

__m128i
foo (__m256i x, __m128i y)
{
  __m256i r = _mm256_insertf128_si256(x, y, 1);
  __m128i a = _mm256_extractf128_si256(r, 1);
  return a;
}
[hjl@gnu-6 44551]$ make c.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -mavx -O2 -S c.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
       [not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
  2012-12-01 22:38 ` glisse at gcc dot gnu.org
  2014-06-10 17:02 ` glisse at gcc dot gnu.org
@ 2014-07-26  9:01 ` glisse at gcc dot gnu.org
  2 siblings, 0 replies; 15+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-07-26  9:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551

--- Comment #14 from Marc Glisse <glisse at gcc dot gnu.org> ---
Author: glisse
Date: Sat Jul 26 09:00:31 2014
New Revision: 213076

URL: https://gcc.gnu.org/viewcvs?rev=213076&root=gcc&view=rev
Log:
2014-07-26  Marc Glisse  <marc.glisse@inria.fr>

    PR target/44551
gcc/
    * simplify-rtx.c (simplify_binary_operation_1) <VEC_SELECT>:
    Optimize inverse of a VEC_CONCAT.
gcc/testsuite/
    * gcc.target/i386/pr44551-1.c: New file.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr44551-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/simplify-rtx.c
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
       [not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
  2012-12-01 22:38 ` glisse at gcc dot gnu.org
@ 2014-06-10 17:02 ` glisse at gcc dot gnu.org
  2014-07-26  9:01 ` glisse at gcc dot gnu.org
  2 siblings, 0 replies; 15+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-06-10 17:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551

--- Comment #13 from Marc Glisse <glisse at gcc dot gnu.org> ---
Created attachment 32915
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32915&action=edit
simplify vec_select(vec_concat)

A simpler/safer version of the patch linked in comment #12 (untested). It
optimizes the example in comment #11, but fails to optimize the original
testcase because simplify-rtx operations are only done on single-use operands,
and I don't know where in the RTL optimizers we can apply transformations
without this constraint.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
       [not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
@ 2012-12-01 22:38 ` glisse at gcc dot gnu.org
  2014-06-10 17:02 ` glisse at gcc dot gnu.org
  2014-07-26  9:01 ` glisse at gcc dot gnu.org
  2 siblings, 0 replies; 15+ messages in thread
From: glisse at gcc dot gnu.org @ 2012-12-01 22:38 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551

Marc Glisse <glisse at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |glisse at gcc dot gnu.org

--- Comment #12 from Marc Glisse <glisse at gcc dot gnu.org> 2012-12-01 22:38:15 UTC ---
Hmm, maybe this patch:

http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00373.html

would help with the testcase in comment #11 ? I'll have to try and resurrect
it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-07-26  9:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
2010-06-16  9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
2010-06-16 19:50 ` hjl dot tools at gmail dot com
2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
2010-06-16 20:42 ` hjl dot tools at gmail dot com
2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
2010-06-16 21:21 ` kretz at kde dot org
2010-06-17 22:01 ` hjl dot tools at gmail dot com
2010-06-18  0:46 ` hjl dot tools at gmail dot com
2010-06-18  0:50 ` pinskia at gcc dot gnu dot org
2010-06-28 19:17 ` hjl dot tools at gmail dot com
2010-06-28 19:18 ` hjl dot tools at gmail dot com
     [not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
2012-12-01 22:38 ` glisse at gcc dot gnu.org
2014-06-10 17:02 ` glisse at gcc dot gnu.org
2014-07-26  9:01 ` glisse at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).