public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
@ 2021-08-21 21:25 ` pinskia at gcc dot gnu.org
2023-06-21 13:33 ` [Bug middle-end/88873] " rguenth at gcc dot gnu.org
` (6 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-21 21:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Blocks| |101926
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So we now vectorize both functions but we mess up foo's code gen.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926
[Bug 101926] [meta-bug] struct/complex argument passing and return should be
improved
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
2021-08-21 21:25 ` [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type pinskia at gcc dot gnu.org
@ 2023-06-21 13:33 ` rguenth at gcc dot gnu.org
2023-06-21 22:18 ` roger at nextmovesoftware dot com
` (5 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-06-21 13:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |linkw at gcc dot gnu.org,
| |rguenth at gcc dot gnu.org,
| |sayle at gcc dot gnu.org,
| |vmakarov at gcc dot gnu.org
Component|tree-optimization |middle-end
Keywords| |ra
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we "like"
v2df bar (v2df a, v2df b, v2df c)
{
vector(2) double vect__4.19;
vect__4.19_19 = .FMA (b_10(D), a_11(D), c_9(D)); [tail call]
return vect__4.19_19;
}
but foo has the usual ABI issues:
struct s_t foo (struct s_t a, struct s_t b, struct s_t c)
{
vector(2) double vect__4.13;
vector(2) double vect__1.12;
vector(2) double vect__3.9;
vector(2) double vect__2.6;
struct s_t D.4355;
vect__1.12_14 = MEM <vector(2) double> [(double *)&c];
vect__2.6_12 = MEM <vector(2) double> [(double *)&b];
vect__3.9_13 = MEM <vector(2) double> [(double *)&a];
vect__4.13_15 = .FMA (vect__2.6_12, vect__3.9_13, vect__1.12_14);
MEM <vector(2) double> [(double *)&D.4355] = vect__4.13_15;
return D.4355;
}
where the argument passing / return value handling gets us
foo:
vmovq %xmm3, %rax
vmovq %xmm0, -24(%rsp)
vpinsrq $1, %rax, %xmm2, %xmm7
vmovq %xmm5, %rax
vmovq %xmm1, -16(%rsp)
vmovapd %xmm7, %xmm6
vpinsrq $1, %rax, %xmm4, %xmm2
vmovq %xmm4, -40(%rsp)
vfmadd132pd -24(%rsp), %xmm2, %xmm6
vmovq %xmm5, -32(%rsp)
vmovapd %xmm6, -56(%rsp)
vmovsd -48(%rsp), %xmm1
vmovsd -56(%rsp), %xmm0
ret
that's very weird, we also seem to half-way clean up things but fail to
eliminate the useless vmovq %xmm5, -32(%rsp) spill for example.
The IBM folks who want to use SRA-style analysis at RTL expansion time
might in the end deal with this as well.
We expand to
(insn 2 21 3 2 (set (reg:DF 91)
(reg:DF 20 xmm0 [ a ])) "t2.c":8:1 -1
(nil))
(insn 3 2 4 2 (set (reg:DF 92)
(reg:DF 21 xmm1 [ a+8 ])) "t2.c":8:1 -1
(nil))
(insn 4 3 5 2 (set (reg:TI 90)
(const_int 0 [0])) "t2.c":8:1 -1
(nil))
(insn 5 4 6 2 (set (subreg:DF (reg:TI 90) 0)
(reg:DF 91)) "t2.c":8:1 -1
(nil))
(insn 6 5 7 2 (set (subreg:DF (reg:TI 90) 8)
(reg:DF 92)) "t2.c":8:1 -1
(nil))
so we're using TImode pseudos because the aggregate has TImode but the
accesses should tell us that V2DFmode would be a way better choice
(or alternatively V2DImode in case float modes are too dangerous).
The actual single use is then
(insn 23 20 24 2 (set (reg:V2DF 85 [ vect__4.13 ])
(fma:V2DF (subreg:V2DF (reg/v:TI 93 [ b ]) 0)
(subreg:V2DF (reg/v:TI 89 [ a ]) 0)
(subreg:V2DF (reg/v:TI 97 [ c ]) 0))) "t2.c":9:18 -1
(nil))
and of course IRA/LRA are not able to deal with this situation nicely,
possibly because of the subreg sets of the TImode pseudo which we
do not split (well, we can't). We could possibly use STV to handle
some of this though(?)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
2021-08-21 21:25 ` [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type pinskia at gcc dot gnu.org
2023-06-21 13:33 ` [Bug middle-end/88873] " rguenth at gcc dot gnu.org
@ 2023-06-21 22:18 ` roger at nextmovesoftware dot com
2023-07-10 8:09 ` cvs-commit at gcc dot gnu.org
` (4 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: roger at nextmovesoftware dot com @ 2023-06-21 22:18 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
Roger Sayle <roger at nextmovesoftware dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |roger at nextmovesoftware dot com
--- Comment #5 from Roger Sayle <roger at nextmovesoftware dot com> ---
I have a patch (series) that improves some of the TImode parameter passing
issues with the ABI. I'll check/investigate whether this fixes DFmode in the
same way that it improves DImode. I worry that the (hi<<64)|lo idiom might not
be applicable for FP (without SUBREGs), but something similar (with vec_merge)
may resolve this issue during RTL expansion.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
` (2 preceding siblings ...)
2023-06-21 22:18 ` roger at nextmovesoftware dot com
@ 2023-07-10 8:09 ` cvs-commit at gcc dot gnu.org
2023-07-12 11:33 ` rguenth at gcc dot gnu.org
` (3 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-10 8:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:12b78b0b42d53019eb2c500d386094194e90ad16
commit r14-2406-g12b78b0b42d53019eb2c500d386094194e90ad16
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Mon Jul 10 09:06:52 2023 +0100
i386: Add new insvti_lowpart_1 and insvdi_lowpart_1 patterns.
This patch implements another of Uros' suggestions, to investigate a
insvti_lowpart_1 pattern to improve TImode parameter passing on x86_64.
In PR 88873, the RTL the middle-end expands for passing V2DF in TImode
is subtly different from what it does for V2DI in TImode, sufficiently so
that my explanations for why insvti_lowpart_1 isn't required don't apply
in this case.
This patch adds an insvti_lowpart_1 pattern, complementing the existing
insvti_highpart_1 pattern, and also a 32-bit variant, insvdi_lowpart_1.
Because the middle-end represents 128-bit constants using CONST_WIDE_INT
and 64-bit constants using CONST_INT, it's easiest to treat these as
different patterns, rather than attempt <dwi> parameterization.
This patch also includes a peephole2 (actually a pair) to transform
xchg instructions into mov instructions, when one of the destinations
is unused. This optimization is required to produce the optimal code
sequences below.
For the 64-bit case:
__int128 foo(__int128 x, unsigned long long y)
{
__int128 m = ~((__int128)~0ull);
__int128 t = x & m;
__int128 r = t | y;
return r;
}
Before:
xchgq %rdi, %rsi
movq %rdx, %rax
xorl %esi, %esi
xorl %edx, %edx
orq %rsi, %rax
orq %rdi, %rdx
ret
After:
movq %rdx, %rax
movq %rsi, %rdx
ret
For the 32-bit case:
long long bar(long long x, int y)
{
long long mask = ~0ull << 32;
long long t = x & mask;
long long r = t | (unsigned int)y;
return r;
}
Before:
pushl %ebx
movl 12(%esp), %edx
xorl %ebx, %ebx
xorl %eax, %eax
movl 16(%esp), %ecx
orl %ebx, %edx
popl %ebx
orl %ecx, %eax
ret
After:
movl 12(%esp), %eax
movl 8(%esp), %edx
ret
2023-07-10 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (peephole2): Transform xchg insn with a
REG_UNUSED note to a (simple) move.
(*insvti_lowpart_1): New define_insn_and_split.
(*insvdi_lowpart_1): Likewise.
gcc/testsuite/ChangeLog
* gcc.target/i386/insvdi_lowpart-1.c: New test case.
* gcc.target/i386/insvti_lowpart-1.c: Likewise.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
` (3 preceding siblings ...)
2023-07-10 8:09 ` cvs-commit at gcc dot gnu.org
@ 2023-07-12 11:33 ` rguenth at gcc dot gnu.org
2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
7 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-12 11:33 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Didn't yet help for the original testcase in the description. We RTL expand
from
vect__1.11_14 = MEM <vector(2) double> [(double *)&c];
vect__2.5_12 = MEM <vector(2) double> [(double *)&b];
vect__3.8_13 = MEM <vector(2) double> [(double *)&a];
vect__4.12_15 = .FMA (vect__2.5_12, vect__3.8_13, vect__1.11_14);
MEM <vector(2) double> [(double *)&D.4349] = vect__4.12_15;
return D.4349;
and get
(insn 2 21 3 2 (set (reg:DF 91)
(reg:DF 20 xmm0 [ a ])) "t.c":8:1 -1
(nil))
(insn 3 2 4 2 (set (reg:DF 92)
(reg:DF 21 xmm1 [ a+8 ])) "t.c":8:1 -1
(nil))
(insn 4 3 5 2 (set (reg:TI 90)
(const_int 0 [0])) "t.c":8:1 -1
(nil))
(insn 5 4 6 2 (set (subreg:DF (reg:TI 90) 0)
(reg:DF 91)) "t.c":8:1 -1
(nil))
(insn 6 5 7 2 (set (subreg:DF (reg:TI 90) 8)
(reg:DF 92)) "t.c":8:1 -1
(nil))
(insn 7 6 8 2 (set (reg/v:TI 89 [ a ])
(reg:TI 90)) "t.c":8:1 -1
(nil))
...
(insn 23 20 24 2 (set (reg:V2DF 85 [ vect__4.12 ])
(fma:V2DF (subreg:V2DF (reg/v:TI 93 [ b ]) 0)
(subreg:V2DF (reg/v:TI 89 [ a ]) 0)
(subreg:V2DF (reg/v:TI 97 [ c ]) 0))) "t.c":9:18 -1
(nil))
so the ABI passess struct s_t in two %xmm regs but the backend gives it
TImode. Nothing cleans this up, we end up with horrible code in the end.
The subreg pass is likely "confused" by the V2DFmode subreg of the TImode
pseudos, maybe it needs to learn to turn the TImode pseudo into a V2DFmode
one ...
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
` (4 preceding siblings ...)
2023-07-12 11:33 ` rguenth at gcc dot gnu.org
@ 2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
2023-07-20 8:25 ` cvs-commit at gcc dot gnu.org
2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org
7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-14 17:13 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:8911879415d6c2a7baad88235554a912887a1c5c
commit r14-2526-g8911879415d6c2a7baad88235554a912887a1c5c
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Fri Jul 14 18:10:05 2023 +0100
i386: Improved insv of DImode/DFmode {high,low}parts into TImode.
This is the next piece towards a fix for (the x86_64 ABI issues affecting)
PR 88873. This patch generalizes the recent tweak to ix86_expand_move
for setting the highpart of a TImode reg from a DImode source using
*insvti_highpart_1, to handle both DImode and DFmode sources, and also
use the recently added *insvti_lowpart_1 for setting the lowpart.
Although this is another intermediate step (not yet a fix), towards
enabling *insvti and *concat* patterns to be candidates for TImode STV
(by using V2DI/V2DF instructions), it already improves things a little.
For the test case from PR 88873
typedef struct { double x, y; } s_t;
typedef double v2df __attribute__ ((vector_size (2 * sizeof(double))));
s_t foo (s_t a, s_t b, s_t c)
{
return (s_t) { fma(a.x, b.x, c.x), fma (a.y, b.y, c.y) };
}
With -O2 -march=cascadelake, GCC currently generates:
Before (29 instructions):
vmovq %xmm2, -56(%rsp)
movq -56(%rsp), %rdx
vmovq %xmm4, -40(%rsp)
movq $0, -48(%rsp)
movq %rdx, -56(%rsp)
movq -40(%rsp), %rdx
vmovq %xmm0, -24(%rsp)
movq %rdx, -40(%rsp)
movq -24(%rsp), %rsi
movq -56(%rsp), %rax
movq $0, -32(%rsp)
vmovq %xmm3, -48(%rsp)
movq -48(%rsp), %rcx
vmovq %xmm5, -32(%rsp)
vmovq %rax, %xmm6
movq -40(%rsp), %rax
movq $0, -16(%rsp)
movq %rsi, -24(%rsp)
movq -32(%rsp), %rsi
vpinsrq $1, %rcx, %xmm6, %xmm6
vmovq %rax, %xmm7
vmovq %xmm1, -16(%rsp)
vmovapd %xmm6, %xmm3
vpinsrq $1, %rsi, %xmm7, %xmm7
vfmadd132pd -24(%rsp), %xmm7, %xmm3
vmovapd %xmm3, -56(%rsp)
vmovsd -48(%rsp), %xmm1
vmovsd -56(%rsp), %xmm0
ret
After (20 instructions):
vmovq %xmm2, -56(%rsp)
movq -56(%rsp), %rax
vmovq %xmm3, -48(%rsp)
vmovq %xmm4, -40(%rsp)
movq -48(%rsp), %rcx
vmovq %xmm5, -32(%rsp)
vmovq %rax, %xmm6
movq -40(%rsp), %rax
movq -32(%rsp), %rsi
vpinsrq $1, %rcx, %xmm6, %xmm6
vmovq %xmm0, -24(%rsp)
vmovq %rax, %xmm7
vmovq %xmm1, -16(%rsp)
vmovapd %xmm6, %xmm2
vpinsrq $1, %rsi, %xmm7, %xmm7
vfmadd132pd -24(%rsp), %xmm7, %xmm2
vmovapd %xmm2, -56(%rsp)
vmovsd -48(%rsp), %xmm1
vmovsd -56(%rsp), %xmm0
ret
2023-07-14 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Generalize special
case inserting of 64-bit values into a TImode register, to handle
both DImode and DFmode using either *insvti_lowpart_1
or *isnvti_highpart_1.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
` (5 preceding siblings ...)
2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
@ 2023-07-20 8:25 ` cvs-commit at gcc dot gnu.org
2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org
7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-07-20 8:25 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:097106972f243ddcbddbbddd9a6bcc076f58b453
commit r14-2668-g097106972f243ddcbddbbddd9a6bcc076f58b453
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Thu Jul 20 09:23:11 2023 +0100
i386: More TImode parameter passing improvements.
This patch is the next piece of a solution to the x86_64 ABI issues in
PR 88873. This splits the *concat<mode><dwi>3_3 define_insn_and_split
into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT
*concatsidi3_3. This allows us to add an additional alternative to the
the 64-bit version, enabling the register allocator to perform this
operation using SSE registers, which is implemented/split after reload
using vec_concatv2di.
To demonstrate the improvement, the test case from PR88873:
typedef struct { double x, y; } s_t;
s_t foo (s_t a, s_t b, s_t c)
{
return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y, c.y)
};
}
when compiled with -O2 -march=cascadelake, currently generates:
foo: vmovq %xmm2, -56(%rsp)
movq -56(%rsp), %rax
vmovq %xmm3, -48(%rsp)
vmovq %xmm4, -40(%rsp)
movq -48(%rsp), %rcx
vmovq %xmm5, -32(%rsp)
vmovq %rax, %xmm6
movq -40(%rsp), %rax
movq -32(%rsp), %rsi
vpinsrq $1, %rcx, %xmm6, %xmm6
vmovq %xmm0, -24(%rsp)
vmovq %rax, %xmm7
vmovq %xmm1, -16(%rsp)
vmovapd %xmm6, %xmm2
vpinsrq $1, %rsi, %xmm7, %xmm7
vfmadd132pd -24(%rsp), %xmm7, %xmm2
vmovapd %xmm2, -56(%rsp)
vmovsd -48(%rsp), %xmm1
vmovsd -56(%rsp), %xmm0
ret
with this change, we avoid many of the reloads via memory,
foo: vpunpcklqdq %xmm3, %xmm2, %xmm7
vpunpcklqdq %xmm1, %xmm0, %xmm6
vpunpcklqdq %xmm5, %xmm4, %xmm2
vmovdqa %xmm7, -24(%rsp)
vmovdqa %xmm6, %xmm1
movq -16(%rsp), %rax
vpinsrq $1, %rax, %xmm7, %xmm4
vmovapd %xmm4, %xmm6
vfmadd132pd %xmm1, %xmm2, %xmm6
vmovapd %xmm6, -24(%rsp)
vmovsd -16(%rsp), %xmm1
vmovsd -24(%rsp), %xmm0
ret
2023-07-20 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Don't call
force_reg, to use SUBREG rather than create a new pseudo when
inserting DFmode fields into TImode with insvti_{high,low}part.
* config/i386/i386.md (*concat<mode><dwi>3_3): Split into two
define_insn_and_split...
(*concatditi3_3): 64-bit implementation. Provide alternative
that allows register allocation to use SSE registers that is
split into vec_concatv2di after reload.
(*concatsidi3_3): 32-bit implementation.
gcc/testsuite/ChangeLog
* gcc.target/i386/pr88873.c: New test case.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/88873] missing vectorization for decomposed operations on a vector type
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
` (6 preceding siblings ...)
2023-07-20 8:25 ` cvs-commit at gcc dot gnu.org
@ 2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org
7 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-08-04 15:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88873
--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:
https://gcc.gnu.org/g:faa2202ee7fcf039b2016ce5766a2927526c5f78
commit r14-2997-gfaa2202ee7fcf039b2016ce5766a2927526c5f78
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Fri Aug 4 16:23:38 2023 +0100
i386: Split SUBREGs of SSE vector registers into vec_select insns.
This patch is the final piece in the series to improve the ABI issues
affecting PR 88873. The previous patches tackled inserting DFmode
values into V2DFmode registers, by introducing insvti_{low,high}part
patterns. This patch improves the extraction of DFmode values from
V2DFmode registers via TImode intermediates.
I'd initially thought this would require new extvti_{low,high}part
patterns to be defined, but all that's required is to recognize that
the SUBREG idioms produced by combine are equivalent to (forms of)
vec_select patterns. The target-independent middle-end can't be sure
that the appropriate vec_select instruction exists on the target,
hence doesn't canonicalize a SUBREG of a vector mode as a vec_select,
but the backend can provide a define_split stating where and when
this is useful, for example, considering whether the operand is in
memory, or whether !TARGET_SSE_MATH and the destination is i387.
For pr88873.c, gcc -O2 -march=cascadelake currently generates:
foo: vpunpcklqdq %xmm3, %xmm2, %xmm7
vpunpcklqdq %xmm1, %xmm0, %xmm6
vpunpcklqdq %xmm5, %xmm4, %xmm2
vmovdqa %xmm7, -24(%rsp)
vmovdqa %xmm6, %xmm1
movq -16(%rsp), %rax
vpinsrq $1, %rax, %xmm7, %xmm4
vmovapd %xmm4, %xmm6
vfmadd132pd %xmm1, %xmm2, %xmm6
vmovapd %xmm6, -24(%rsp)
vmovsd -16(%rsp), %xmm1
vmovsd -24(%rsp), %xmm0
ret
with this patch, we now generate:
foo: vpunpcklqdq %xmm1, %xmm0, %xmm6
vpunpcklqdq %xmm3, %xmm2, %xmm7
vpunpcklqdq %xmm5, %xmm4, %xmm2
vmovdqa %xmm6, %xmm1
vfmadd132pd %xmm7, %xmm2, %xmm1
vmovsd %xmm1, %xmm1, %xmm0
vunpckhpd %xmm1, %xmm1, %xmm1
ret
The improvement is even more dramatic when compared to the original
29 instructions shown in comment #8. GCC 13, for example, required
12 transfers to/from memory.
2023-08-04 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/sse.md (define_split): Convert highpart:DF extract
from V2DFmode register into a sse2_storehpd instruction.
(define_split): Likewise, convert lowpart:DF extract from V2DF
register into a sse2_storelpd instruction.
gcc/testsuite/ChangeLog
* gcc.target/i386/pr88873.c: Tweak to check for improved code.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-08-04 15:24 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <bug-88873-4@http.gcc.gnu.org/bugzilla/>
2021-08-21 21:25 ` [Bug tree-optimization/88873] missing vectorization for decomposed operations on a vector type pinskia at gcc dot gnu.org
2023-06-21 13:33 ` [Bug middle-end/88873] " rguenth at gcc dot gnu.org
2023-06-21 22:18 ` roger at nextmovesoftware dot com
2023-07-10 8:09 ` cvs-commit at gcc dot gnu.org
2023-07-12 11:33 ` rguenth at gcc dot gnu.org
2023-07-14 17:13 ` cvs-commit at gcc dot gnu.org
2023-07-20 8:25 ` cvs-commit at gcc dot gnu.org
2023-08-04 15:24 ` cvs-commit at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).