public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128
@ 2010-06-15 22:31 kretz at kde dot org
2010-06-16 9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
` (10 more replies)
0 siblings, 11 replies; 15+ messages in thread
From: kretz at kde dot org @ 2010-06-15 22:31 UTC (permalink / raw)
To: gcc-bugs
Consider the following testcase:
#include <immintrin.h>
static inline __m256i __attribute__((always_inline))
my_add(__m256i a0, __m256i b0)
{
__m128i a1 = _mm256_extractf128_si256(a0, 1);
__m128i b1 = _mm256_extractf128_si256(b0, 1);
__m256i r =
_mm256_castsi128_si256(_mm_add_epi32(_mm256_castsi256_si128(a0),
_mm256_castsi256_si128(b0)));
r = _mm256_insertf128_si256(r, _mm_add_epi32(a1, b1), 1);
return r;
}
extern int DATA[];
void use_insert_extract()
{
__m256i x = _mm256_loadu_si256((__m256i*)&DATA[0]);
__m256i y = _mm256_loadu_si256((__m256i*)&DATA[1]);
x = my_add(x, y);
x = my_add(x, y);
_mm256_storeu_si256((__m256i*)&DATA[0], x);
}
int main()
{
return DATA[1];
}
Compiled with "g++ -mavx -O3 -Wall -S" one gets the following output:
vmovdqu DATA(%rip), %ymm1
pushq %rbp
vmovdqu DATA+4(%rip), %ymm0
vextractf128 $0x1, %ymm1, %xmm3
vmovdqa %xmm1, %xmm2
movq %rsp, %rbp
vmovdqa %xmm0, %xmm1
vextractf128 $0x1, %ymm0, %xmm0
vpaddd %xmm1, %xmm2, %xmm2
vpaddd %xmm0, %xmm3, %xmm3
vinsertf128 $0x1, %xmm3, %ymm2, %ymm2
vextractf128 $0x1, %ymm2, %xmm3
vpaddd %xmm1, %xmm2, %xmm1
vpaddd %xmm0, %xmm3, %xmm0
vinsertf128 $0x1, %xmm0, %ymm1, %ymm0
vmovdqu %ymm0, DATA(%rip)
ICC 11.1 compiles the same source ("-xavx -O3 -Wall -S") to:
vmovdqu DATA(%rip), %ymm1
vmovdqu 4+DATA(%rip), %ymm0
vextractf128 $1, %ymm1, %xmm2
vextractf128 $1, %ymm0, %xmm6
vpaddd %xmm0, %xmm1, %xmm3
vpaddd %xmm6, %xmm2, %xmm5
vpaddd %xmm0, %xmm3, %xmm4
vpaddd %xmm6, %xmm5, %xmm7
vinsertf128 $1, %xmm7, %ymm4, %ymm8
vmovdqu %ymm8, DATA(%rip)
Note especially the extract after insert which happens because of the double
application of my_add. This kind of optimization (which ICC is able to apply
here) is important because AVX introduces 256 bit vector registers but
arithmetic/logic/comparison operations on integers remain the 128 bit SSE
variants. Thus if you want to handle integers in YMM registers you will find a
lot of vinsertf128 and vextractf128 operations.
--
Summary: [missed optimization] AVX vextractf128 after vinsertf128
Product: gcc
Version: 4.5.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: kretz at kde dot org
GCC build triplet: x86_64-unknown-linux-gnu
GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
@ 2010-06-16 9:02 ` rguenth at gcc dot gnu dot org
2010-06-16 19:50 ` hjl dot tools at gmail dot com
` (9 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-06-16 9:02 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from rguenth at gcc dot gnu dot org 2010-06-16 09:02 -------
This is probably missing combiner patterns in sse.md.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |rguenth at gcc dot gnu dot
| |org, hjl at gcc dot gnu dot
| |org
Severity|normal |enhancement
Status|UNCONFIRMED |NEW
Component|middle-end |target
Ever Confirmed|0 |1
Last reconfirmed|0000-00-00 00:00:00 |2010-06-16 09:02:24
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
2010-06-16 9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
@ 2010-06-16 19:50 ` hjl dot tools at gmail dot com
2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
` (8 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-16 19:50 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from hjl dot tools at gmail dot com 2010-06-16 19:50 -------
The problem is UNSPEC_CAST. There is no good way to model it.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
2010-06-16 9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
2010-06-16 19:50 ` hjl dot tools at gmail dot com
@ 2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
2010-06-16 20:42 ` hjl dot tools at gmail dot com
` (7 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-06-16 20:01 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from pinskia at gcc dot gnu dot org 2010-06-16 20:00 -------
Well for one, you could have a splitter if the case which_alternative == 0 so
that an reg rename can do its magic.
Also what does UNSPEC_CAST really do? From the looks of it is just a move
which you could use a splitter on. At least for after reload.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (2 preceding siblings ...)
2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
@ 2010-06-16 20:42 ` hjl dot tools at gmail dot com
2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
` (6 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-16 20:42 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from hjl dot tools at gmail dot com 2010-06-16 20:42 -------
You can cast 256bit to 128bit to get the lower 128bit. You can also
cast 128bit to 256bit with upper 128bit undefined. If I use union,
it will always generate 2 moves via memory.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (3 preceding siblings ...)
2010-06-16 20:42 ` hjl dot tools at gmail dot com
@ 2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
2010-06-16 21:21 ` kretz at kde dot org
` (5 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-06-16 20:46 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from pinskia at gcc dot gnu dot org 2010-06-16 20:46 -------
(In reply to comment #4)
> You can cast 256bit to 128bit to get the lower 128bit.
This way can be represented using vec_select. And then later on using a split
(after reload) turned into a move.
> You can also cast 128bit to 256bit with upper 128bit undefined.
Still use an UNSPEC but use define_insn_and_split which does a splitting (after
reload) to turn it into a move. Since it is a move after all (the registers
are overlapping).
This should improve code generation. Also penalize the non matching 0 operand
case in both insn.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (4 preceding siblings ...)
2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
@ 2010-06-16 21:21 ` kretz at kde dot org
2010-06-17 22:01 ` hjl dot tools at gmail dot com
` (4 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: kretz at kde dot org @ 2010-06-16 21:21 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from kretz at kde dot org 2010-06-16 21:21 -------
(In reply to comment #4)
> You can also cast 128bit to 256bit with upper 128bit undefined.
If you cast from xmm to ymm after a 128bit instruction coded with VEX prefix
then the upper 128bit are actually guaranteed to be zero. If the SSE
instruction does not use the VEX prefix then the upper 128 bits are not
modified. Thus there is never really an undefined state. That might be useful
information for other optimizations?
> If I use union, it will always generate 2 moves via memory.
Yes, I noticed that unions are not a good choice for performance critical code.
It results in way more memory moves than necessary. BTW ICC also generates
memory moves when implementing the testcase with unions.
PS: Thanks a lot for looking into this!
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (5 preceding siblings ...)
2010-06-16 21:21 ` kretz at kde dot org
@ 2010-06-17 22:01 ` hjl dot tools at gmail dot com
2010-06-18 0:46 ` hjl dot tools at gmail dot com
` (3 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-17 22:01 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from hjl dot tools at gmail dot com 2010-06-17 22:01 -------
Created an attachment (id=20934)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=20934&action=view)
A patch to split cast
Here is a patch to split cast. But it doesn't remove
redundant vinsertf128/vextractf128. I am not sure which
pass can optimize setting/extracting higher elements of
a vector.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (6 preceding siblings ...)
2010-06-17 22:01 ` hjl dot tools at gmail dot com
@ 2010-06-18 0:46 ` hjl dot tools at gmail dot com
2010-06-18 0:50 ` pinskia at gcc dot gnu dot org
` (2 subsequent siblings)
10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-18 0:46 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from hjl dot tools at gmail dot com 2010-06-18 00:46 -------
Can we use subreg instead of vec_select?
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (7 preceding siblings ...)
2010-06-18 0:46 ` hjl dot tools at gmail dot com
@ 2010-06-18 0:50 ` pinskia at gcc dot gnu dot org
2010-06-28 19:17 ` hjl dot tools at gmail dot com
2010-06-28 19:18 ` hjl dot tools at gmail dot com
10 siblings, 0 replies; 15+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2010-06-18 0:50 UTC (permalink / raw)
To: gcc-bugs
------- Comment #9 from pinskia at gcc dot gnu dot org 2010-06-18 00:49 -------
(In reply to comment #8)
> Can we use subreg instead of vec_select?
Kinda, you need to do triple subregs, first to an integer mode and then to a
smaller integer mode and then to the other vector mode. subreg on vector types
are only valid for the same size. I tried doing this for another target and it
did not work really and I ended up using vec_select instead and penalizing the
non matching constraint case.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (8 preceding siblings ...)
2010-06-18 0:50 ` pinskia at gcc dot gnu dot org
@ 2010-06-28 19:17 ` hjl dot tools at gmail dot com
2010-06-28 19:18 ` hjl dot tools at gmail dot com
10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-28 19:17 UTC (permalink / raw)
To: gcc-bugs
------- Comment #10 from hjl dot tools at gmail dot com 2010-06-28 19:17 -------
Here is a small testcase:
[hjl@gnu-6 44551]$ cat c.s
.file "c.c"
.text
.p2align 4,,15
.globl foo
.type foo, @function
foo:
.LFB798:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
vinsertf128 $0x1, %xmm1, %ymm0, %ymm0
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
vextractf128 $0x1, %ymm0, %xmm0
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE798:
.size foo, .-foo
.ident "GCC: (GNU) 4.6.0 20100625 (experimental)"
.section .note.GNU-stack,"",@progbits
[hjl@gnu-6 44551]$
The optimize code is
vmovaps %xmm1, %xmm0
ret
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
` (9 preceding siblings ...)
2010-06-28 19:17 ` hjl dot tools at gmail dot com
@ 2010-06-28 19:18 ` hjl dot tools at gmail dot com
10 siblings, 0 replies; 15+ messages in thread
From: hjl dot tools at gmail dot com @ 2010-06-28 19:18 UTC (permalink / raw)
To: gcc-bugs
------- Comment #11 from hjl dot tools at gmail dot com 2010-06-28 19:17 -------
Testcase is
[hjl@gnu-6 44551]$ cat c.c
#include <immintrin.h>
__m128i
foo (__m256i x, __m128i y)
{
__m256i r = _mm256_insertf128_si256(x, y, 1);
__m128i a = _mm256_extractf128_si256(r, 1);
return a;
}
[hjl@gnu-6 44551]$ make c.s
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc/ -mavx -O2 -S c.c
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
[not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
2012-12-01 22:38 ` glisse at gcc dot gnu.org
2014-06-10 17:02 ` glisse at gcc dot gnu.org
@ 2014-07-26 9:01 ` glisse at gcc dot gnu.org
2 siblings, 0 replies; 15+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-07-26 9:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
--- Comment #14 from Marc Glisse <glisse at gcc dot gnu.org> ---
Author: glisse
Date: Sat Jul 26 09:00:31 2014
New Revision: 213076
URL: https://gcc.gnu.org/viewcvs?rev=213076&root=gcc&view=rev
Log:
2014-07-26 Marc Glisse <marc.glisse@inria.fr>
PR target/44551
gcc/
* simplify-rtx.c (simplify_binary_operation_1) <VEC_SELECT>:
Optimize inverse of a VEC_CONCAT.
gcc/testsuite/
* gcc.target/i386/pr44551-1.c: New file.
Added:
trunk/gcc/testsuite/gcc.target/i386/pr44551-1.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/simplify-rtx.c
trunk/gcc/testsuite/ChangeLog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
[not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
2012-12-01 22:38 ` glisse at gcc dot gnu.org
@ 2014-06-10 17:02 ` glisse at gcc dot gnu.org
2014-07-26 9:01 ` glisse at gcc dot gnu.org
2 siblings, 0 replies; 15+ messages in thread
From: glisse at gcc dot gnu.org @ 2014-06-10 17:02 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
--- Comment #13 from Marc Glisse <glisse at gcc dot gnu.org> ---
Created attachment 32915
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=32915&action=edit
simplify vec_select(vec_concat)
A simpler/safer version of the patch linked in comment #12 (untested). It
optimizes the example in comment #11, but fails to optimize the original
testcase because simplify-rtx operations are only done on single-use operands,
and I don't know where in the RTL optimizers we can apply transformations
without this constraint.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Bug target/44551] [missed optimization] AVX vextractf128 after vinsertf128
[not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
@ 2012-12-01 22:38 ` glisse at gcc dot gnu.org
2014-06-10 17:02 ` glisse at gcc dot gnu.org
2014-07-26 9:01 ` glisse at gcc dot gnu.org
2 siblings, 0 replies; 15+ messages in thread
From: glisse at gcc dot gnu.org @ 2012-12-01 22:38 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44551
Marc Glisse <glisse at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |glisse at gcc dot gnu.org
--- Comment #12 from Marc Glisse <glisse at gcc dot gnu.org> 2012-12-01 22:38:15 UTC ---
Hmm, maybe this patch:
http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00373.html
would help with the testcase in comment #11 ? I'll have to try and resurrect
it.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-07-26 9:01 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-15 22:31 [Bug middle-end/44551] New: [missed optimization] AVX vextractf128 after vinsertf128 kretz at kde dot org
2010-06-16 9:02 ` [Bug target/44551] " rguenth at gcc dot gnu dot org
2010-06-16 19:50 ` hjl dot tools at gmail dot com
2010-06-16 20:01 ` pinskia at gcc dot gnu dot org
2010-06-16 20:42 ` hjl dot tools at gmail dot com
2010-06-16 20:46 ` pinskia at gcc dot gnu dot org
2010-06-16 21:21 ` kretz at kde dot org
2010-06-17 22:01 ` hjl dot tools at gmail dot com
2010-06-18 0:46 ` hjl dot tools at gmail dot com
2010-06-18 0:50 ` pinskia at gcc dot gnu dot org
2010-06-28 19:17 ` hjl dot tools at gmail dot com
2010-06-28 19:18 ` hjl dot tools at gmail dot com
[not found] <bug-44551-4@http.gcc.gnu.org/bugzilla/>
2012-12-01 22:38 ` glisse at gcc dot gnu.org
2014-06-10 17:02 ` glisse at gcc dot gnu.org
2014-07-26 9:01 ` glisse at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).