public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/104151] [9/10/11/12 Regression] x86: excessive code generated for 128-bit byteswap Date: Fri, 21 Jan 2022 08:28:35 +0000 [thread overview] Message-ID: <bug-104151-4-WBJNOFyN7e@http.gcc.gnu.org/bugzilla/> (raw) In-Reply-To: <bug-104151-4@http.gcc.gnu.org/bugzilla/> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rsandifo at gcc dot gnu.org, | |vmakarov at gcc dot gnu.org Priority|P3 |P2 Target| |x86_64-*-* --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- With just SSE2 we get the store vectorized only: bswap: .LFB0: .cfi_startproc bswap %rsi bswap %rdi movq %rsi, %xmm0 movq %rdi, %xmm1 punpcklqdq %xmm1, %xmm0 movaps %xmm0, -24(%rsp) movq -24(%rsp), %rax movq -16(%rsp), %rdx ret the <unknown> 1 times vec_perm costs 4 in body BIT_FIELD_REF <a_3(D), 64, 64> 1 times scalar_stmt costs 4 in body BIT_FIELD_REF <a_3(D), 64, 0> 1 times scalar_stmt costs 4 in body costs are what we cost building the initial vector from __int128 compared to splitting that into a low/high part. <bb 2> [local count: 1073741824]: _8 = BIT_FIELD_REF <a_3(D), 64, 64>; _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D)); _13 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D)); _12 = VEC_PERM_EXPR <_11, _13, { 1, 0 }>; _14 = VIEW_CONVERT_EXPR<vector(16) char>(_12); _15 = VEC_PERM_EXPR <_14, _14, { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }>; _16 = VIEW_CONVERT_EXPR<vector(2) long unsigned int>(_15); _1 = __builtin_bswap64 (_8); _10 = BIT_FIELD_REF <a_3(D), 64, 0>; _2 = __builtin_bswap64 (_10); MEM <vector(2) long long unsigned int> [(long long unsigned int *)&y] = _16; _7 = MEM <uint128_t> [(char * {ref-all})&y]; doesn't realize that it can maybe move the hi/lo swap across the two permutes to before the store, otherwise it looks as expected. Yes, the vectorizer doesn't account for ABI details on the function boundary but it's very hard to do that in a sensible way. Practically the worst part of the generated code is movq %rdi, -24(%rsp) movq %rsi, -16(%rsp) movdqa -24(%rsp), %xmm0 because the store will fail to forward, causing a huge performance issue. I wonder why we fail to merge those. We face (insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0) (reg:DI 97)) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 97) (nil))) (insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8) (reg:DI 98)) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 98) (nil))) (note 7 27 12 2 NOTE_INSN_FUNCTION_BEG) (insn 12 7 14 2 (set (reg:V2DI 91) (vec_select:V2DI (subreg:V2DI (reg/v:TI 87 [ a ]) 0) (parallel [ (const_int 1 [0x1]) (const_int 0 [0]) ]))) "t.c":6:12 7927 {*ssse3_palignrv2di_perm} (expr_list:REG_DEAD (reg/v:TI 87 [ a ]) (nil))) where Trying 26, 27 -> 12: 26: r87:TI#0=r97:DI REG_DEAD r97:DI 27: r87:TI#8=r98:DI REG_DEAD r98:DI 12: r91:V2DI=vec_select(r87:TI#0,parallel) REG_DEAD r87:TI Can't combine i2 into i3 possibly because 27 is a partial def of r87. We expand to (insn 4 3 5 2 (set (reg:TI 88) (subreg:TI (reg:DI 89) 0)) "t.c":2:1 -1 (nil)) (insn 5 4 6 2 (set (subreg:DI (reg:TI 88) 8) (reg:DI 90)) "t.c":2:1 -1 (nil)) (insn 6 5 7 2 (set (reg/v:TI 87 [ a ]) (reg:TI 88)) "t.c":2:1 -1 (nil)) (note 7 6 10 2 NOTE_INSN_FUNCTION_BEG) (insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ]) (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) -1 (nil)) (insn 12 10 13 2 (set (reg:V2DI 91) (vec_select:V2DI (reg:V2DI 82 [ _11 ]) (parallel [ (const_int 1 [0x1]) (const_int 0 [0]) ]))) "t.c":6:12 -1 (nil)) initially from _11 = VIEW_CONVERT_EXPR<vector(2) long long unsigned int>(a_3(D)); and fwprop1 still sees (insn 26 25 27 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 0) (reg:DI 89 [ a ])) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 95 [ a ]) (nil))) (insn 27 26 7 2 (set (subreg:DI (reg/v:TI 87 [ a ]) 8) (reg:DI 90 [ a+8 ])) "t.c":2:1 80 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 96 [+8 ]) (nil))) (note 7 27 10 2 NOTE_INSN_FUNCTION_BEG) (insn 10 7 12 2 (set (reg:V2DI 82 [ _11 ]) (subreg:V2DI (reg/v:TI 87 [ a ]) 0)) 1700 {movv2di_internal} (expr_list:REG_DEAD (reg/v:TI 87 [ a ]) (nil))) so that would be the best place to fix this up, realizing reg 87 dies after insn 10. Richard - I'm sure we can construct a similar case for aarch64 where argument passing and vector mode use cause spilling? On x86 the simplest testcase showing this is typedef unsigned long long v2di __attribute__((vector_size(16))); v2di bswap(__uint128_t a) { return *(v2di *)&a; } that produces bswap: .LFB0: .cfi_startproc sub sp, sp, #16 .cfi_def_cfa_offset 16 stp x0, x1, [sp] ldr q0, [sp] add sp, sp, 16 .cfi_def_cfa_offset 0 ret on arm for me. Maybe the stp x0, x1 store can forward to the ldr load though and I'm not sure there's another way to move x0/x1 to q0. Providing LRA with a way to move TImode to VnmImode would of course also avoid the spilling but getting rid of the TImode pseudo when it's on there as intermediary for moving two DImode vals to V2DImode sounds like a useful transform to me. combine is too late since fwprop already merged the subreg with the following shuffle for the larger testcase. Alternatively LRA could also be taught to spill to %xmm by somehow telling it of the wastly increased cost of the double-spill, single-reload sequence? But I guess it would still need to be teached how to reload V2DImode from a {DImode, DImode} pair in %xmm regs ...
next prev parent reply other threads:[~2022-01-21 8:28 UTC|newest] Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-01-20 23:24 [Bug target/104151] New: " nekotekina at gmail dot com 2022-01-20 23:41 ` [Bug middle-end/104151] [9/10/11/12 Regression] " pinskia at gcc dot gnu.org 2022-01-21 1:03 ` crazylht at gmail dot com 2022-01-21 1:25 ` crazylht at gmail dot com 2022-01-21 1:28 ` crazylht at gmail dot com 2022-01-21 1:32 ` crazylht at gmail dot com 2022-01-21 8:28 ` rguenth at gcc dot gnu.org [this message] 2022-01-21 9:11 ` rsandifo at gcc dot gnu.org 2022-01-21 10:18 ` rguenth at gcc dot gnu.org 2022-01-21 10:29 ` rguenth at gcc dot gnu.org 2022-01-21 12:20 ` ubizjak at gmail dot com 2022-01-28 12:20 ` jakub at gcc dot gnu.org 2022-01-31 14:06 ` ubizjak at gmail dot com 2022-05-06 8:32 ` [Bug middle-end/104151] [9/10/11/12/13 " jakub at gcc dot gnu.org 2022-09-06 22:04 ` [Bug middle-end/104151] [10/11/12/13 " pobrn at protonmail dot com 2022-09-07 8:18 ` rguenth at gcc dot gnu.org 2023-05-08 12:23 ` [Bug middle-end/104151] [10/11/12/13/14 " rguenth at gcc dot gnu.org 2023-05-11 13:17 ` chfast at gmail dot com
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-104151-4-WBJNOFyN7e@http.gcc.gnu.org/bugzilla/ \ --to=gcc-bugzilla@gcc.gnu.org \ --cc=gcc-bugs@gcc.gnu.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).