public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb @ 2024-03-28 9:38 liuhongt at gcc dot gnu.org 2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org ` (5 more replies) 0 siblings, 6 replies; 7+ messages in thread From: liuhongt at gcc dot gnu.org @ 2024-03-28 9:38 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 Bug ID: 114514 Summary: v16qi >> 7 can be optimized with vpcmpgtb Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: liuhongt at gcc dot gnu.org Target Milestone: --- v16qi foo2 (v16qi a, v16qi b) { return a >> 7; } it can be optimized with vpxor xmm1, xmm1, xmm1 vpcmpgtb xmm0, xmm1, xmm0 ret currently we generate(emulated with v16hi) movl $16843009, %eax vpsraw $7, %xmm0, %xmm0 vmovd %eax, %xmm1 vpbroadcastd %xmm1, %xmm1 vpandn %xmm1, %xmm0, %xmm0 vpsubb %xmm1, %xmm0, %xmm0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb 2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org @ 2024-03-28 23:09 ` pinskia at gcc dot gnu.org 2024-03-28 23:14 ` pinskia at gcc dot gnu.org ` (4 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: pinskia at gcc dot gnu.org @ 2024-03-28 23:09 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement Last reconfirmed| |2024-03-28 CC| |pinskia at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Confirmed. Note non sign bit can be improved too: ``` #define vector __attribute__((vector_size(16))) typedef vector signed char v16qi; typedef vector unsigned char v16uqi; v16qi foo2 (v16qi a, v16qi b) { return a >> 6; } v16uqi foo1 (v16uqi a, v16uqi b) { return a >> 6; } ``` clang produces: ``` _Z4foo2Dv16_aS_: psrlw $6, %xmm0 pand .LCPI0_0(%rip), %xmm0 #{3,3,3,...} movdqa .LCPI0_1(%rip), %xmm1 #{2,2,2,...} pxor %xmm1, %xmm0 psubb %xmm1, %xmm0 retq _Z4foo1Dv16_hS_: psrlw $6, %xmm0 pand .LCPI1_0(%rip), %xmm0 #{3,3,3,...} retq ``` ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb 2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org 2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org @ 2024-03-28 23:14 ` pinskia at gcc dot gnu.org 2024-03-29 1:03 ` liuhongt at gcc dot gnu.org ` (3 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: pinskia at gcc dot gnu.org @ 2024-03-28 23:14 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 --- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- For non constant clang produces: ``` signedshiftright: movzbl %dil, %eax movd %eax, %xmm1 psrlw %xmm1, %xmm0 pcmpeqd %xmm2, %xmm2 psrlw %xmm1, %xmm2 movdqa .LCPI0_0(%rip), %xmm3 # xmm3 = [32896,32896,32896,32896,32896,32896,32896,32896] psrlw %xmm1, %xmm3 psrlw $8, %xmm2 punpcklbw %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] pshuflw $0, %xmm2, %xmm1 # xmm1 = xmm2[0,0,0,0,4,5,6,7] pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0] pand %xmm1, %xmm0 pxor %xmm3, %xmm0 psubb %xmm3, %xmm0 retq unsignedshiftrtight: movzbl %dil, %eax movd %eax, %xmm1 psrlw %xmm1, %xmm0 pcmpeqd %xmm2, %xmm2 psrlw %xmm1, %xmm2 psrlw $8, %xmm2 punpcklbw %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] pshuflw $0, %xmm2, %xmm1 # xmm1 = xmm2[0,0,0,0,4,5,6,7] pshufd $0, %xmm1, %xmm1 # xmm1 = xmm1[0,0,0,0] pand %xmm1, %xmm0 retq ``` I am not sure which way is faster here though. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb 2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org 2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org 2024-03-28 23:14 ` pinskia at gcc dot gnu.org @ 2024-03-29 1:03 ` liuhongt at gcc dot gnu.org 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org ` (2 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: liuhongt at gcc dot gnu.org @ 2024-03-29 1:03 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 --- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #1) > Confirmed. > > Note non sign bit can be improved too: > ``` I assume you're talking about broadcast from imm or directly from constant pool. GCC chooses the former, with -Os we can also generate the later. According to microbenchmark, the former is better. I also tries to disable broadcasting from imm and test with stress-ng vecmath, the performance is similar. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb 2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org ` (2 preceding siblings ...) 2024-03-29 1:03 ` liuhongt at gcc dot gnu.org @ 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org 2024-05-16 0:42 ` liuhongt at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2024-05-16 0:41 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 --- Comment #4 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:0cc0956b3bb8bcbc9196075b9073a227d799e042 commit r15-529-g0cc0956b3bb8bcbc9196075b9073a227d799e042 Author: liuhongt <hongtao.liu@intel.com> Date: Tue May 14 18:39:54 2024 +0800 Optimize ashift >> 7 to vpcmpgtb for vector int8. Since there is no corresponding instruction, the shift operation for vector int8 is implemented using the instructions for vector int16, but for some special shift counts, it can be transformed into vpcmpgtb. gcc/ChangeLog: PR target/114514 * config/i386/i386-expand.cc (ix86_expand_vec_shift_qihi_constant): Optimize ashift >> 7 to vpcmpgtb. (ix86_expand_vecop_qihi_partial): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr114514-shift.c: New test. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb 2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org ` (3 preceding siblings ...) 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org @ 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org 2024-05-16 0:42 ` liuhongt at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: cvs-commit at gcc dot gnu.org @ 2024-05-16 0:41 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 --- Comment #5 from GCC Commits <cvs-commit at gcc dot gnu.org> --- The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:090714e6cf8029f4ff8883dce687200024adbaeb commit r15-530-g090714e6cf8029f4ff8883dce687200024adbaeb Author: liuhongt <hongtao.liu@intel.com> Date: Wed May 15 10:56:24 2024 +0800 Set d.one_operand_p to true when TARGET_SSSE3 in ix86_expand_vecop_qihi_partial. pshufb is available under TARGET_SSSE3, so ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3. With the patch under -march=x86-64-v2 v8qi foo (v8qi a) { return a >> 5; } < pmovsxbw %xmm0, %xmm0 < psraw $5, %xmm0 < pshufb .LC0(%rip), %xmm0 vs. > movdqa %xmm0, %xmm1 > pcmpeqd %xmm0, %xmm0 > pmovsxbw %xmm1, %xmm1 > psrlw $8, %xmm0 > psraw $5, %xmm1 > pand %xmm1, %xmm0 > packuswb %xmm0, %xmm0 Although there's a memory load from constant pool, but it should be better when it's inside a loop. The load from constant pool can be hoist out. it's 1 instruction vs 4 instructions. < pshufb .LC0(%rip), %xmm0 vs. > pcmpeqd %xmm0, %xmm0 > psrlw $8, %xmm0 > pand %xmm1, %xmm0 > packuswb %xmm0, %xmm0 gcc/ChangeLog: PR target/114514 * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial): Set d.one_operand_p to true when TARGET_SSSE3. gcc/testsuite/ChangeLog: * gcc.target/i386/pr114514-shufb.c: New test. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb 2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org ` (4 preceding siblings ...) 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org @ 2024-05-16 0:42 ` liuhongt at gcc dot gnu.org 5 siblings, 0 replies; 7+ messages in thread From: liuhongt at gcc dot gnu.org @ 2024-05-16 0:42 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 Hongtao Liu <liuhongt at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #6 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- Fixed in GCC15. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-05-16 0:42 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-28 9:38 [Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb liuhongt at gcc dot gnu.org 2024-03-28 23:09 ` [Bug target/114514] " pinskia at gcc dot gnu.org 2024-03-28 23:14 ` pinskia at gcc dot gnu.org 2024-03-29 1:03 ` liuhongt at gcc dot gnu.org 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org 2024-05-16 0:41 ` cvs-commit at gcc dot gnu.org 2024-05-16 0:42 ` liuhongt at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).