From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 681C0385841B; Wed, 19 Jan 2022 17:40:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 681C0385841B From: "munroesj at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/104124] New: Poor optimization for vector splat DW with small consts Date: Wed, 19 Jan 2022 17:40:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.1.1 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: munroesj at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Jan 2022 17:40:36 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104124 Bug ID: 104124 Summary: Poor optimization for vector splat DW with small consts Product: gcc Version: 11.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- It looks to me like the compiler is seeing register pressure caused by load= ing all the vector long long constants I need in my code. This is leaf code of a size it can run out of volatilizes (no stack-frame). But this puts more pressure on volatile VRs, VSRs, and GPRs. Especially GPRs because it loading from .rodata when it could (and should) use a vector immediate. For example: vui64_t __test_splatudi_0_V0 (void) { return vec_splats ((unsigned long long) 0); } vi64_t __test_splatudi_1_V0 (void) { return vec_splats ((signed long long) -1); } Generate: 00000000000001a0 <__test_splatudi_0_V0>: 1a0: 8c 03 40 10 vspltisw v2,0 1a4: 20 00 80 4e blr 00000000000001c0 <__test_splatudi_1_V0>: 1c0: 8c 03 5f 10 vspltisw v2,-1 1c4: 20 00 80 4e blr ... But other cases that could use immedates like: vui64_t __test_splatudi_12_V0 (void) { return vec_splats ((unsigned long long) 12); } GCC 9/10/11 Generates for power8: 0000000000000170 <__test_splatudi_12_V0>: 170: 00 00 4c 3c addis r2,r12,0 170: R_PPC64_REL16_HA .TOC. 174: 00 00 42 38 addi r2,r2,0 174: R_PPC64_REL16_LO .TOC.+0x4 178: 00 00 22 3d addis r9,r2,0 178: R_PPC64_TOC16_HA .rodata.cst16+0x20 17c: 00 00 29 39 addi r9,r9,0 17c: R_PPC64_TOC16_LO .rodata.cst16+0x20 180: ce 48 40 7c lvx v2,0,r9 184: 20 00 80 4e blr and for Power9: 0000000000000000 <__test_splatisd_12_PWR9>: 0: d1 62 40 f0 xxspltib vs34,12 4: 02 16 58 10 vextsb2d v2,v2 8: 20 00 80 4e blr So why can't the power8 target generate: 00000000000000f0 <__test_splatudi_12_V1>: f0: 8c 03 4c 10 vspltisw v2,12 f4: 4e 16 40 10 vupkhsw v2,v2 f8: 20 00 80 4e blr This is 4 cycles vs 9 ((best case) and it is always 9 cycles because GCC do= es not exploit immediate fusion). In fact GCC 8 (AT12) does this. So I tried defining my own vec_splatudi: vi64_t __test_splatudi_12_V1 (void) { vi32_t vwi =3D vec_splat_s32 (12); return vec_unpackl (vwi); } Which generates the <__test_splatudi_12_V1> sequence above for GCC 8. But f= or GCC 9/10/11 it generates: 0000000000000110 <__test_splatudi_12_V1>: 110: 00 00 4c 3c addis r2,r12,0 110: R_PPC64_REL16_HA .TOC. 114: 00 00 42 38 addi r2,r2,0 114: R_PPC64_REL16_LO .TOC.+0x4 118: 00 00 22 3d addis r9,r2,0 118: R_PPC64_TOC16_HA .rodata.cst16+0x20 11c: 00 00 29 39 addi r9,r9,0 11c: R_PPC64_TOC16_LO .rodata.cst16+0x20 120: ce 48 40 7c lvx v2,0,r9 124: 20 00 80 4e blr Again! GCC has gone out of its way to be this clever! Badly! While it can be appropriately clever for power9! I have tried many permutations of this and the only way I have found to pre= vent this (GCC 9/10/11) cleverness is to use inline __asm (which has other bad s= ide effects).=