From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 94C203857822; Fri, 21 Jan 2022 08:28:35 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 94C203857822 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap Date: Fri, 21 Jan 2022 08:28:35 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc priority cf_gcctarget Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Jan 2022 08:28:35 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104151 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.or= g, | |vmakarov at gcc dot gnu.org Priority|P3 |P2 Target| |x86_64-*-* --- Comment #6 from Richard Biener --- With just SSE2 we get the store vectorized only: bswap: .LFB0: .cfi_startproc bswap %rsi bswap %rdi movq %rsi, %xmm0 movq %rdi, %xmm1 punpcklqdq %xmm1, %xmm0 movaps %xmm0, -24(%rsp) movq -24(%rsp), %rax movq -16(%rsp), %rdx ret the 1 times vec_perm costs 4 in body BIT_FIELD_REF 1 times scalar_stmt costs 4 in body BIT_FIELD_REF 1 times scalar_stmt costs 4 in body costs are what we cost building the initial vector from __int128 compared to splitting that into a low/high part. [local count: 1073741824]: _8 =3D BIT_FIELD_REF ; _11 =3D VIEW_CONVERT_EXPR(a_3(D)); _13 =3D VIEW_CONVERT_EXPR(a_3(D)); _12 =3D VEC_PERM_EXPR <_11, _13, { 1, 0 }>; _14 =3D VIEW_CONVERT_EXPR(_12); _15 =3D VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12= , 11, 10, 9, 8 }>; _16 =3D VIEW_CONVERT_EXPR(_15); _1 =3D __builtin_bswap64 (_8); _10 =3D BIT_FIELD_REF ; _2 =3D __builtin_bswap64 (_10); MEM [(long long unsigned int *)&y] =3D= _16; _7 =3D MEM [(char * {ref-all})&y]; doesn't realize that it can maybe move the hi/lo swap across the two permutes to before the store, otherwise it looks as expected. Yes, the vectorizer doesn't account for ABI details on the function boundary but it's very hard to do that in a sensible way. Practically the worst part of the generated code is movq %rdi, -24(%rsp) movq %rsi, -16(%rsp) movdqa -24(%rsp), %xmm0 because the store will fail to forward, causing a huge performance issue. I wonder why we fail to merge those. We face (insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0) (reg:DI 97)) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 97) (nil))) (insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8) (reg:DI 98)) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 98) (nil))) (note 7 27 12 2 NOTE_INSN_FUNCTION_BEG) (insn 12 7 14 2 (set (reg:V2DI 91) (vec_select:V2DI (subreg:V2DI (reg/v:TI 87 [ a ]) 0) (parallel [ (const_int 1 [0x1]) (const_int 0 [0]) ]))) "t.c":6:12 7927 {*ssse3_palignrv2di_perm} (expr_list:REG_DEAD (reg/v:TI 87 [ a ]) (nil))) where Trying 26, 27 -> 12: 26: r87:TI#0=3Dr97:DI REG_DEAD r97:DI 27: r87:TI#8=3Dr98:DI REG_DEAD r98:DI 12: r91:V2DI=3Dvec_select(r87:TI#0,parallel) REG_DEAD r87:TI Can't combine i2 into i3 possibly because 27 is a partial def of r87. We expand to (insn 4 3 5 2 (set (reg:TI 88) (subreg:TI (reg:DI 89) 0)) "t.c":2:1 -1 (nil)) (insn 5 4 6 2 (set (subreg:DI (reg:TI 88) 8) (reg:DI 90)) "t.c":2:1 -1 (nil)) (insn 6 5 7 2 (set (reg/v:TI 87 [ a ]) (reg:TI 88)) "t.c":2:1 -1 (nil)) (note 7 6 10 2 NOTE_INSN_FUNCTION_BEG) (insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ]) (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) -1 (nil)) (insn 12 10 13 2 (set (reg:V2DI 91) (vec_select:V2DI (reg:V2DI 82 [ _11 ]) (parallel [ (const_int 1 [0x1]) (const_int 0 [0]) ]))) "t.c":6:12 -1 (nil)) initially from _11 =3D VIEW_CONVERT_EXPR(a_3(D)); and fwprop1 still sees (insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0) (reg:DI 89 [ a ])) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 95 [ a ]) (nil))) (insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8) (reg:DI 90 [ a+8 ])) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 96 [+8 ]) (nil))) (note 7 27 10 2 NOTE_INSN_FUNCTION_BEG)=20 (insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ]) (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) 1700 {movv2di_internal} (expr_list:REG_DEAD (reg/v:TI 87 [ a ]) (nil))) so that would be the best place to fix this up, realizing reg 87 dies after insn 10. Richard - I'm sure we can construct a similar case for aarch64 where argument passing and vector mode use cause spilling? On x86 the simplest testcase showing this is typedef unsigned long long v2di __attribute__((vector_size(16))); v2di bswap(__uint128_t a) { return *(v2di *)&a; } that produces bswap: .LFB0: .cfi_startproc sub sp, sp, #16 .cfi_def_cfa_offset 16 stp x0, x1, [sp] ldr q0, [sp] add sp, sp, 16 .cfi_def_cfa_offset 0 ret on arm for me. Maybe the stp x0, x1 store can forward to the ldr load though and I'm not sure there's another way to move x0/x1 to q0. Providing LRA with a way to move TImode to VnmImode would of course also avoid the spilling but getting rid of the TImode pseudo when it's on there as intermediary for moving two DImode vals to V2DImode sounds like a useful transform to me. combine is too late since fwprop already merged the subreg with the following shuffle for the larger testcase. Alternatively LRA could also be taught to spill to %xmm by somehow telling it of the wastly increased cost of the double-spill, single-reload sequence? But I guess it would still need to be teached how to reload V2DImode from a {DImode, DImode} pair in %xmm regs ...=