public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/112384] New: a non-constant vec dup should be improved
@ 2023-11-04 23:09 pinskia at gcc dot gnu.org
2023-11-04 23:12 ` [Bug target/112384] " pinskia at gcc dot gnu.org
2023-11-06 8:21 ` rguenth at gcc dot gnu.org
0 siblings, 2 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-04 23:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112384
Bug ID: 112384
Summary: a non-constant vec dup should be improved
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: pinskia at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
Take:
```
#define vector __attribute__((vector_size(16)))
vector int f1(vector int t, int i)
{
i&=3;
vector int tt = {i, i, i, i};
vector int r = __builtin_shuffle(t, tt);
return r;
}
vector int f2(vector int t, int i)
{
i&=3;
i = t[i];
vector int tt = {i, i, i, i};
return tt;
}
```
Both of these give not so good code generation.
f1 has:
```
dup v31.4s, w0
...
shl v31.4s, v31.4s, 2
tbl v31.16b, {v31.16b}, v28.16b
add v31.16b, v31.16b, v29.16b
```
But we could do better by combing the dup and the shl into.
For RTL level:
Trying 11 -> 12:
11: r98:V4SI=vec_duplicate(r92:SI)
REG_DEAD r92:SI
12: r101:V4SI=r98:V4SI<<const_vector
REG_DEAD r98:V4SI
Failed to match this instruction:
(set (reg:V4SI 101)
(ashift:V4SI (vec_duplicate:V4SI (reg/v:SI 92 [ iD.4390 ]))
(const_vector:V4SI [
(const_int 2 [0x2]) repeated x4
])))
Changing that into:
(set (reg:V4SI 101)
(vec_duplicate:V4SI (ashift:SI (reg/v:SI 92 [ iD.4390 ]) (const_int 2 [0x2])))
Will improve things.
The first tlb seems can be removable too.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug target/112384] a non-constant vec dup should be improved
2023-11-04 23:09 [Bug target/112384] New: a non-constant vec dup should be improved pinskia at gcc dot gnu.org
@ 2023-11-04 23:12 ` pinskia at gcc dot gnu.org
2023-11-06 8:21 ` rguenth at gcc dot gnu.org
1 sibling, 0 replies; 3+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-11-04 23:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112384
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Oh f2 just goes to memory.
Produces:
```
and x0, x0, 3
str q0, [sp]
ldr s0, [sp, x0, lsl 2]
dup v0.4s, v0.s[0]
```
Now clang(LLVM) produces:
```
mov x8, sp
and w9, w0, #0x3
str q0, [sp]
orr x8, x8, x9, lsl #2
ld1r { v0.4s }, [x8]
```
I don't know which is better but it might be the case where GCC's is better for
some micro-arch.
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Bug target/112384] a non-constant vec dup should be improved
2023-11-04 23:09 [Bug target/112384] New: a non-constant vec dup should be improved pinskia at gcc dot gnu.org
2023-11-04 23:12 ` [Bug target/112384] " pinskia at gcc dot gnu.org
@ 2023-11-06 8:21 ` rguenth at gcc dot gnu.org
1 sibling, 0 replies; 3+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-11-06 8:21 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112384
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target|aarch64 |aarch64, x86_64-*-*
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
Last reconfirmed| |2023-11-06
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. Note for f2 the target needs to support .VEC_EXTRACT with variable
index.
OTOH we miss to transform
i_4 = VIEW_CONVERT_EXPR<int[4]>(t)[i_2];
tt_5 = {i_4, i_4, i_4, i_4};
into
tt_3 = {i_2, i_2, i_2, i_2};
r_6 = VEC_PERM_EXPR <t_4(D), t_4(D), tt_3>;
but the complication is that 't' isn't in SSA form (which is also why
it goes through memory here).
On x86_64 with SSE4.1 we get
f1:
.LFB0:
.cfi_startproc
andl $3, %edi
movd %edi, %xmm2
pshufd $0, %xmm2, %xmm1
pslld $2, %xmm1
pshufb .LC1(%rip), %xmm1
paddb .LC2(%rip), %xmm1
pshufb %xmm1, %xmm0
ret
f2:
.LFB1:
.cfi_startproc
andl $3, %edi
movaps %xmm0, -24(%rsp)
movd -24(%rsp,%rdi,4), %xmm1
pshufd $0, %xmm1, %xmm0
ret
I suspect the memory case is actually faster. With AVX512VL this
improves to
f1:
.LFB0:
.cfi_startproc
andl $3, %edi
vmovdqa %xmm0, %xmm1
vpbroadcastd %edi, %xmm0
vpermi2d %xmm1, %xmm1, %xmm0
ret
f2:
.LFB1:
.cfi_startproc
andl $3, %edi
vmovdqa %xmm0, -24(%rsp)
vpbroadcastd -24(%rsp,%rdi,4), %xmm0
ret
AVX2 has the odd
f1:
.LFB0:
.cfi_startproc
andl $3, %edi
vinserti128 $1, %xmm0, %ymm0, %ymm0
vmovd %edi, %xmm2
vpbroadcastd %xmm2, %xmm1
vinserti128 $1, %xmm1, %ymm1, %ymm1
vpermd %ymm0, %ymm1, %ymm0
vzeroupper
ret
where sth feels wrong - f2 is similar to AVX512. It's not clear whether
the f1 IL is better in the end.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-11-06 8:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-04 23:09 [Bug target/112384] New: a non-constant vec dup should be improved pinskia at gcc dot gnu.org
2023-11-04 23:12 ` [Bug target/112384] " pinskia at gcc dot gnu.org
2023-11-06 8:21 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).