* [Bug tree-optimization/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
@ 2020-05-01 21:24 ` glisse at gcc dot gnu.org
2020-05-04 6:30 ` rguenth at gcc dot gnu.org
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: glisse at gcc dot gnu.org @ 2020-05-01 21:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #1 from Marc Glisse <glisse at gcc dot gnu.org> ---
Even if we write __builtin_shuffle, the vector lowering pass turns it into the
same code (constructor of BIT_FIELD_REFs), which seems to indicate that the
target does not handle this pattern.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
2020-05-01 21:24 ` [Bug tree-optimization/94908] " glisse at gcc dot gnu.org
@ 2020-05-04 6:30 ` rguenth at gcc dot gnu.org
2023-02-17 20:49 ` gabravier at gmail dot com
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-05-04 6:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2020-05-04
Ever confirmed|0 |1
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Hmm, ideally it would be extract g()[1], insert at a[0]. But yes, we're not
trying to split an not handled suffle into two but leave that for targets
to sort out ... (x86 has code for many 3-insn shuffles for example).
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug tree-optimization/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
2020-05-01 21:24 ` [Bug tree-optimization/94908] " glisse at gcc dot gnu.org
2020-05-04 6:30 ` rguenth at gcc dot gnu.org
@ 2023-02-17 20:49 ` gabravier at gmail dot com
2023-02-17 21:05 ` [Bug target/94908] " pinskia at gcc dot gnu.org
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: gabravier at gmail dot com @ 2023-02-17 20:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #3 from Gabriel Ravier <gabravier at gmail dot com> ---
Looks like this gives much better output now.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (2 preceding siblings ...)
2023-02-17 20:49 ` gabravier at gmail dot com
@ 2023-02-17 21:05 ` pinskia at gcc dot gnu.org
2023-02-18 9:35 ` ubizjak at gmail dot com
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-02-17 21:05 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
See Also| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=53346,
| |https://gcc.gnu.org/bugzill
| |a/show_bug.cgi?id=93720
Component|tree-optimization |target
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I think this was a target issue and maybe should be split into a couple
different bugs.
For GCC 8, aarch64 produces:
dup v0.4s, v0.s[1]
ldr q1, [sp, 16]
ldp x29, x30, [sp], 32
ins v0.s[1], v1.s[1]
ins v0.s[2], v1.s[2]
ins v0.s[3], v1.s[3]
For GCC 9/10 did (which is ok, though could be improved which it did in GCC
11):
adrp x0, .LC0
ldr q1, [sp, 16]
ldr q2, [x0, #:lo12:.LC0]
ldp x29, x30, [sp], 32
tbl v0.16b, {v0.16b - v1.16b}, v2.16b
For GCC 11+, aarch64 produces:
ldr q1, [sp, 16]
ins v1.s[0], v0.s[1]
mov v0.16b, v1.16b
Which means for aarch64, this was changed in GCC 10 and fixed fully for GCC 11
(by r11-2192-gc9c87e6f9c795b aka PR 93720 which was my patch in fact).
For x86_64, the trunk produces:
movaps (%rsp), %xmm1
addq $24, %rsp
shufps $85, %xmm1, %xmm0
shufps $232, %xmm1, %xmm0
While for GCC 12 produces:
movaps (%rsp), %xmm1
addq $24, %rsp
shufps $85, %xmm0, %xmm0
movaps %xmm1, %xmm2
shufps $85, %xmm1, %xmm2
movaps %xmm2, %xmm3
movaps %xmm1, %xmm2
unpckhps %xmm1, %xmm2
unpcklps %xmm3, %xmm0
shufps $255, %xmm1, %xmm1
unpcklps %xmm1, %xmm2
movlhps %xmm2, %xmm0
This was changed with r13-2843-g3db8e9c2422d92 (aka PR 53346).
For powerpc64le, it looks ok for GCC 11:
addis 9,2,.LC0@toc@ha
addi 1,1,48
addi 9,9,.LC0@toc@l
li 0,-16
lvx 0,0,9
vperm 2,31,2,0
Both the x86_64 and the PowerPC PERM implementation could be improved to
support the inseration like the aarch64 backend does too.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (3 preceding siblings ...)
2023-02-17 21:05 ` [Bug target/94908] " pinskia at gcc dot gnu.org
@ 2023-02-18 9:35 ` ubizjak at gmail dot com
2023-02-20 3:32 ` crazylht at gmail dot com
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2023-02-18 9:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
Uroš Bizjak <ubizjak at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crazylht at gmail dot com
--- Comment #5 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Andrew Pinski from comment #4)
> Both the x86_64 and the PowerPC PERM implementation could be improved to
> support the inseration like the aarch64 backend does too.
Cc Hongtao for x86 part.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (4 preceding siblings ...)
2023-02-18 9:35 ` ubizjak at gmail dot com
@ 2023-02-20 3:32 ` crazylht at gmail dot com
2023-03-08 13:19 ` ubizjak at gmail dot com
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-02-20 3:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
Yes, insertps can select any element from src and insert into any place of the
dest. under sse4.1, x86 can generate
vinsertps xmm0, xmm1, xmm0, 64 # xmm0 = xmm0[1],xmm1[1,2,3]
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (5 preceding siblings ...)
2023-02-20 3:32 ` crazylht at gmail dot com
@ 2023-03-08 13:19 ` ubizjak at gmail dot com
2023-03-09 4:22 ` crazylht at gmail dot com
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2023-03-08 13:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #7 from Uroš Bizjak <ubizjak at gmail dot com> ---
Created attachment 54607
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54607&action=edit
Proposed patch
Patch in testing.
Attached patch produces (-O2 -msse4.1):
f:
subq $24, %rsp
xorl %eax, %eax
vmovaps %xmm0, (%rsp)
call g
vmovaps (%rsp), %xmm1
addq $24, %rsp
vinsertps $64, %xmm0, %xmm1, %xmm0
ret
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (6 preceding siblings ...)
2023-03-08 13:19 ` ubizjak at gmail dot com
@ 2023-03-09 4:22 ` crazylht at gmail dot com
2023-03-09 14:27 ` ubizjak at gmail dot com
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-03-09 4:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Uroš Bizjak from comment #7)
> Created attachment 54607 [details]
> Proposed patch
>
> Patch in testing.
>
> Attached patch produces (-O2 -msse4.1):
>
> f:
> subq $24, %rsp
> xorl %eax, %eax
> vmovaps %xmm0, (%rsp)
> call g
> vmovaps (%rsp), %xmm1
> addq $24, %rsp
> vinsertps $64, %xmm0, %xmm1, %xmm0
> ret
I'm thinking of something like below so it can be matched both by
expand_vselect_vconcat in ix86_expand_vec_perm_const_1 and patterns created by
pass_combine(theoretically).
+(define_insn_and_split "*sse4_1_insertps_1"
+ [(set (match_operand:VI4F_128 0 "register_operand")
+ (vec_select:VI4F_128
+ (vec_concat:<ssedoublevecmode>
+ (match_operand:VI4F_128 1 "register_operand")
+ (match_operand:VI4F_128 2 "register_operand"))
+ (match_parallel 3 "insertps_parallel"
+ [(match_operand 4 "const_int_operand")])))]
+ "TARGET_SSE4_1 && ix86_pre_reload_split ()"
+ "#"
+ "&& 1"
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (7 preceding siblings ...)
2023-03-09 4:22 ` crazylht at gmail dot com
@ 2023-03-09 14:27 ` ubizjak at gmail dot com
2023-03-09 14:32 ` ubizjak at gmail dot com
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2023-03-09 14:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #9 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Hongtao.liu from comment #8)
> I'm thinking of something like below so it can be matched both by
> expand_vselect_vconcat in ix86_expand_vec_perm_const_1 and patterns created
> by pass_combine(theoretically).
>
> +(define_insn_and_split "*sse4_1_insertps_1"
> + [(set (match_operand:VI4F_128 0 "register_operand")
> + (vec_select:VI4F_128
> + (vec_concat:<ssedoublevecmode>
> + (match_operand:VI4F_128 1 "register_operand")
> + (match_operand:VI4F_128 2 "register_operand"))
> + (match_parallel 3 "insertps_parallel"
> + [(match_operand 4 "const_int_operand")])))]
> + "TARGET_SSE4_1 && ix86_pre_reload_split ()"
> + "#"
> + "&& 1"
If you want to go that way, then the resulting pattern should look like
combination of:
(define_insn "*vec_setv4sf_sse4_1"
[(set (match_operand:V4SF 0 "register_operand" "=Yr,*x,v")
(vec_merge:V4SF
(vec_duplicate:V4SF
(match_operand:SF 2 "nonimmediate_operand" "Yrm,*xm,vm"))
(match_operand:V4SF 1 "register_operand" "0,0,v")
(match_operand:SI 3 "const_0_to_3_operand")))]
"TARGET_SSE4_1
&& ((unsigned) exact_log2 (INTVAL (operands[3]))
< GET_MODE_NUNITS (V4SFmode))"
(define_insn_and_split "*sse4_1_extractps"
[(set (match_operand:SF 0 "nonimmediate_operand" "=rm,rm,rm,Yv,Yv")
(vec_select:SF
(match_operand:V4SF 1 "register_operand" "Yr,*x,v,0,v")
(parallel [(match_operand:SI 2 "const_0_to_3_operand")])))]
"TARGET_SSE4_1"
where the later pattern propagates into the former in place of operand 2. This
combination is created only for scalar insert of an extracted value, so I doubt
it is ever created...
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (8 preceding siblings ...)
2023-03-09 14:27 ` ubizjak at gmail dot com
@ 2023-03-09 14:32 ` ubizjak at gmail dot com
2023-04-18 16:59 ` cvs-commit at gcc dot gnu.org
2023-04-18 17:01 ` ubizjak at gmail dot com
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2023-03-09 14:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
Uroš Bizjak <ubizjak at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #54607|0 |1
is obsolete| |
--- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> ---
Created attachment 54624
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54624&action=edit
Proposed patch v2
New version with some code shamelessly stolen from aarch64.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (9 preceding siblings ...)
2023-03-09 14:32 ` ubizjak at gmail dot com
@ 2023-04-18 16:59 ` cvs-commit at gcc dot gnu.org
2023-04-18 17:01 ` ubizjak at gmail dot com
11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-04-18 16:59 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>:
https://gcc.gnu.org/g:95b99e47f4f2df2d0c5680f45e3ec0a3170218ad
commit r14-47-g95b99e47f4f2df2d0c5680f45e3ec0a3170218ad
Author: Uros Bizjak <ubizjak@gmail.com>
Date: Tue Apr 18 17:50:37 2023 +0200
i386: Improve permutations with INSERTPS instruction [PR94908]
INSERTPS can select any element from src and insert into any place
of the dest. For SSE4.1 targets, compiler can generate e.g.
insertps $64, %xmm0, %xmm1
to insert element 1 from %xmm1 to element 0 of %xmm0.
gcc/ChangeLog:
PR target/94908
* config/i386/i386-builtin.def (__builtin_ia32_insertps128):
Use CODE_FOR_sse4_1_insertps_v4sf.
* config/i386/i386-expand.cc (expand_vec_perm_insertps): New.
(expand_vec_perm_1): Call expand_vec_per_insertps.
* config/i386/i386.md ("unspec"): Declare UNSPEC_INSERTPS here.
* config/i386/mmx.md (mmxscalarmode): New mode attribute.
(@sse4_1_insertps_<mode>): New insn pattern.
* config/i386/sse.md (@sse4_1_insertps_<mode>): Macroize insn
pattern from sse4_1_insertps using VI4F_128 mode iterator.
gcc/testsuite/ChangeLog:
PR target/94908
* gcc.target/i386/pr94908.c: New test.
* gcc.target/i386/sse4_1-insertps-5.c: New test.
* gcc.target/i386/vperm-v4sf-2-sse4.c: New test.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug target/94908] Failure to optimally optimize certain shuffle patterns
2020-05-01 19:02 [Bug tree-optimization/94908] New: Failure to optimally optimize certain shuffle patterns gabravier at gmail dot com
` (10 preceding siblings ...)
2023-04-18 16:59 ` cvs-commit at gcc dot gnu.org
@ 2023-04-18 17:01 ` ubizjak at gmail dot com
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2023-04-18 17:01 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94908
--- Comment #12 from Uroš Bizjak <ubizjak at gmail dot com> ---
Implemented also for x86.
^ permalink raw reply [flat|nested] 13+ messages in thread