public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb
@ 2024-03-28 9:38 liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-03-28 9:38 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
Bug ID: 114514
Summary: v16qi >> 7 can be optimized with vpcmpgtb
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: liuhongt at gcc dot gnu.org
Target Milestone: ---
v16qi
foo2 (v16qi a, v16qi b)
{
return a >> 7;
}
it can be optimized with
vpxor xmm1, xmm1, xmm1
vpcmpgtb xmm0, xmm1, xmm0
ret
currently we generate(emulated with v16hi)
movl $16843009, %eax
vpsraw $7, %xmm0, %xmm0
vmovd %eax, %xmm1
vpbroadcastd %xmm1, %xmm1
vpandn %xmm1, %xmm0, %xmm0
vpsubb %xmm1, %xmm0, %xmm0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
@ 2024-03-28 23:09 ` pinskia at gcc dot gnu.org
2024-03-28 23:14 ` pinskia at gcc dot gnu.org
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-28 23:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
Last reconfirmed| |2024-03-28
CC| |pinskia at gcc dot gnu.org
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Confirmed.
Note non sign bit can be improved too:
```
#define vector __attribute__((vector_size(16)))
typedef vector signed char v16qi;
typedef vector unsigned char v16uqi;
v16qi
foo2 (v16qi a, v16qi b)
{
return a >> 6;
}
v16uqi
foo1 (v16uqi a, v16uqi b)
{
return a >> 6;
}
```
clang produces:
```
_Z4foo2Dv16_aS_:
psrlw $6, %xmm0
pand .LCPI0_0(%rip), %xmm0 #{3,3,3,...}
movdqa .LCPI0_1(%rip), %xmm1 #{2,2,2,...}
pxor %xmm1, %xmm0
psubb %xmm1, %xmm0
retq
_Z4foo1Dv16_hS_:
psrlw $6, %xmm0
pand .LCPI1_0(%rip), %xmm0 #{3,3,3,...}
retq
```
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
@ 2024-03-28 23:14 ` pinskia at gcc dot gnu.org
2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-28 23:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
For non constant clang produces:
```
signedshiftright:
movzbl %dil, %eax
movd %eax, %xmm1
psrlw %xmm1, %xmm0
pcmpeqd %xmm2, %xmm2
psrlw %xmm1, %xmm2
movdqa .LCPI0_0(%rip), %xmm3 # xmm3 =
[32896,32896,32896,32896,32896,32896,32896,32896]
psrlw %xmm1, %xmm3
psrlw $8, %xmm2
punpcklbw %xmm2, %xmm2 # xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw $0, %xmm2, %xmm1 # xmm1 = xmm2[0,0,0,0,4,5,6,7]
pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0]
pand %xmm1, %xmm0
pxor %xmm3, %xmm0
psubb %xmm3, %xmm0
retq
unsignedshiftrtight:
movzbl %dil, %eax
movd %eax, %xmm1
psrlw %xmm1, %xmm0
pcmpeqd %xmm2, %xmm2
psrlw %xmm1, %xmm2
psrlw $8, %xmm2
punpcklbw %xmm2, %xmm2 # xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw $0, %xmm2, %xmm1 # xmm1 = xmm2[0,0,0,0,4,5,6,7]
pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0]
pand %xmm1, %xmm0
retq
```
I am not sure which way is faster here though.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
2024-03-28 23:14 ` pinskia at gcc dot gnu.org
@ 2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-03-29 1:03 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
--- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Andrew Pinski from comment #1)
> Confirmed.
>
> Note non sign bit can be improved too:
> ```
I assume you're talking about broadcast from imm or directly from constant
pool. GCC chooses the former, with -Os we can also generate the later.
According to microbenchmark, the former is better. I also tries to disable
broadcasting from imm and test with stress-ng vecmath, the performance is
similar.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
` (2 preceding siblings ...)
2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
@ 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
2024-05-16 0:42 ` liuhongt at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-16 0:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
--- Comment #4 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:0cc0956b3bb8bcbc9196075b9073a227d799e042
commit r15-529-g0cc0956b3bb8bcbc9196075b9073a227d799e042
Author: liuhongt <hongtao.liu@intel.com>
Date: Tue May 14 18:39:54 2024 +0800
Optimize ashift >> 7 to vpcmpgtb for vector int8.
Since there is no corresponding instruction, the shift operation for
vector int8 is implemented using the instructions for vector int16,
but for some special shift counts, it can be transformed into vpcmpgtb.
gcc/ChangeLog:
PR target/114514
* config/i386/i386-expand.cc
(ix86_expand_vec_shift_qihi_constant): Optimize ashift >> 7 to
vpcmpgtb.
(ix86_expand_vecop_qihi_partial): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr114514-shift.c: New test.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
` (3 preceding siblings ...)
2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
@ 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
2024-05-16 0:42 ` liuhongt at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2024-05-16 0:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
--- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:
https://gcc.gnu.org/g:090714e6cf8029f4ff8883dce687200024adbaeb
commit r15-530-g090714e6cf8029f4ff8883dce687200024adbaeb
Author: liuhongt <hongtao.liu@intel.com>
Date: Wed May 15 10:56:24 2024 +0800
Set d.one_operand_p to true when TARGET_SSSE3 in
ix86_expand_vecop_qihi_partial.
pshufb is available under TARGET_SSSE3, so
ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3.
With the patch under -march=x86-64-v2
v8qi
foo (v8qi a)
{
return a >> 5;
}
< pmovsxbw %xmm0, %xmm0
< psraw $5, %xmm0
< pshufb .LC0(%rip), %xmm0
vs.
> movdqa %xmm0, %xmm1
> pcmpeqd %xmm0, %xmm0
> pmovsxbw %xmm1, %xmm1
> psrlw $8, %xmm0
> psraw $5, %xmm1
> pand %xmm1, %xmm0
> packuswb %xmm0, %xmm0
Although there's a memory load from constant pool, but it should be
better when it's inside a loop. The load from constant pool can be
hoist out. it's 1 instruction vs 4 instructions.
< pshufb .LC0(%rip), %xmm0
vs.
> pcmpeqd %xmm0, %xmm0
> psrlw $8, %xmm0
> pand %xmm1, %xmm0
> packuswb %xmm0, %xmm0
gcc/ChangeLog:
PR target/114514
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Set d.one_operand_p to true when TARGET_SSSE3.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr114514-shufb.c: New test.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
` (4 preceding siblings ...)
2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
@ 2024-05-16 0:42 ` liuhongt at gcc dot gnu.org
5 siblings, 0 replies; 7+ messages in thread
From: liuhongt at gcc dot gnu.org @ 2024-05-16 0:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
Hongtao Liu <liuhongt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |FIXED
Status|NEW |RESOLVED
--- Comment #6 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
Fixed in GCC15.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-05-16 0:42 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org
2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org
2024-03-28 23:14 ` pinskia at gcc dot gnu.org
2024-03-29 1:03 ` liuhongt at gcc dot gnu.org
2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org
2024-05-16 0:42 ` liuhongt at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).