public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb
@ 2024-03-28 9:38 liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-03-28 9:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
Bug ID: 114514
Summary: v16qi >> 7 can be optimized with vpcmpgtb
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: liuhongt at gcc dot gnu.org
Target Milestone: ---
v16qi
foo2 (v16qi a, v16qi b)
{
return a >> 7;
}
it can be optimized with
vpxor xmm1, xmm1, xmm1
vpcmpgtb xmm0, xmm1, xmm0
ret
currently we generate(emulated with v16hi)
movl $16843009, %eax
vpsraw $7, %xmm0, %xmm0
vmovd %eax, %xmm1
vpbroadcastd %xmm1, %xmm1
vpandn %xmm1, %xmm0, %xmm0
vpsubb %xmm1, %xmm0, %xmm0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
@ 2024-03-28 23:09 ` pinskia at gcc dot gnu.org
2024-03-28 23:14 ` pinskia at gcc dot gnu.org
2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-28 23:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
Last reconfirmed| |2024-03-28
CC| |pinskia at gcc dot gnu.org
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.
Note non sign bit can be improved too:
```
#define vector __attribute__((vector_size(16)))
typedef vector signed char v16qi;
typedef vector unsigned char v16uqi;
v16qi
foo2 (v16qi a, v16qi b)
{
return a >> 6;
}
v16uqi
foo1 (v16uqi a, v16uqi b)
{
return a >> 6;
}
```
clang produces:
```
_Z4foo2Dv16_aS_:
psrlw $6, %xmm0
pand .LCPI0_0(%rip), %xmm0 #{3,3,3,...}
movdqa .LCPI0_1(%rip), %xmm1 #{2,2,2,...}
pxor %xmm1, %xmm0
psubb %xmm1, %xmm0
retq
_Z4foo1Dv16_hS_:
psrlw $6, %xmm0
pand .LCPI1_0(%rip), %xmm0 #{3,3,3,...}
retq
```
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
@ 2024-03-28 23:14 ` pinskia at gcc dot gnu.org
2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-28 23:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For non constant clang produces:
```
signedshiftright:
movzbl %dil, %eax
movd %eax, %xmm1
psrlw %xmm1, %xmm0
pcmpeqd %xmm2, %xmm2
psrlw %xmm1, %xmm2
movdqa .LCPI0_0(%rip), %xmm3 # xmm3 =
[32896,32896,32896,32896,32896,32896,32896,32896]
psrlw %xmm1, %xmm3
psrlw $8, %xmm2
punpcklbw %xmm2, %xmm2 # xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw $0, %xmm2, %xmm1 # xmm1 = xmm2[0,0,0,0,4,5,6,7]
pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0]
pand %xmm1, %xmm0
pxor %xmm3, %xmm0
psubb %xmm3, %xmm0
retq
unsignedshiftrtight:
movzbl %dil, %eax
movd %eax, %xmm1
psrlw %xmm1, %xmm0
pcmpeqd %xmm2, %xmm2
psrlw %xmm1, %xmm2
psrlw $8, %xmm2
punpcklbw %xmm2, %xmm2 # xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw $0, %xmm2, %xmm1 # xmm1 = xmm2[0,0,0,0,4,5,6,7]
pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0]
pand %xmm1, %xmm0
retq
```
I am not sure which way is faster here though.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
2024-03-28 23:14 ` pinskia at gcc dot gnu.org
@ 2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-03-29 1:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
--- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> Confirmed.
>
> Note non sign bit can be improved too:
> ```
I assume you're talking about broadcast from imm or directly from constant
pool. GCC chooses the former, with -Os we can also generate the later.
According to microbenchmark, the former is better. I also tries to disable
broadcasting from imm and test with stress-ng vecmath, the performance is
similar.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-03-29 1:03 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
2024-03-28 23:14 ` pinskia at gcc dot gnu.org
2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).