From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 14A4238618E2; Mon, 8 Mar 2021 16:49:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 14A4238618E2 From: "jakub at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/99434] std::bit_cast generates more instructions than __builtin_bit_cast and memcpy with -march=native Date: Mon, 08 Mar 2021 16:49:25 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization, ra X-Bugzilla-Severity: enhancement X-Bugzilla-Who: jakub at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc bug_status everconfirmed cf_reconfirmed_on Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Mar 2021 16:49:26 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99434 Jakub Jelinek changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jakub at gcc dot gnu.org, | |jamborm at gcc dot gnu.org, | |vmakarov at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed| |2021-03-08 --- Comment #4 from Jakub Jelinek --- The umul5 case in #c0 is worse because of SRA. With -O2 -fno-tree-sra optimized dump looks like: _3 =3D a_4(D) w* b_5(D); D.2396 =3D VIEW_CONVERT_EXPR(_3); D.2383 =3D D.2396; return D.2383; and _3 =3D a_4(D) w* b_5(D); D.2389 =3D VIEW_CONVERT_EXPR(_3); return D.2389; for the two functions and even when there is the superfluous copying we emit the same assembly. But with SRA the former becomes: _3 =3D a_4(D) w* b_5(D); D.2396 =3D VIEW_CONVERT_EXPR(_3); SR.6_12 =3D D.2396.low; SR.7_13 =3D D.2396.high; D.2383.low =3D SR.6_12; D.2383.high =3D SR.7_13; return D.2383; In the -fno-tree-sra case the IL just contains one extra TImode pseudo -> pseudo assignment which is shortly optimized away, so we have just: (insn 7 4 13 2 (parallel [ (set (reg:TI 87) (mult:TI (zero_extend:TI (reg:DI 89)) (zero_extend:TI (reg:DI 90)))) (clobber (reg:CC 17 flags)) ]) "pr99434.C":23:66 426 {*umulditi3_1} (expr_list:REG_DEAD (reg:DI 90) (expr_list:REG_DEAD (reg:DI 89) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (insn 13 7 14 2 (set (reg/i:TI 0 ax) (reg:TI 87)) "pr99434.C":24:1 73 {*movti_internal} (expr_list:REG_DEAD (reg:TI 87) (nil))) (insn 14 13 0 2 (use (reg/i:TI 0 ax)) "pr99434.C":24:1 -1 (nil)) before reload, while with SRA we have: (insn 7 4 19 2 (parallel [ (set (reg:TI 90) (mult:TI (zero_extend:TI (reg:DI 98)) (zero_extend:TI (reg:DI 99)))) (clobber (reg:CC 17 flags)) ]) "pr99434.C":18:57 426 {*umulditi3_1} (expr_list:REG_DEAD (reg:DI 99) (expr_list:REG_DEAD (reg:DI 98) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (insn 19 7 20 2 (set (reg:DI 92 [ D.2396 ]) (subreg:DI (reg:TI 90) 0)) "pr99434.C":5:40 74 {*movdi_internal} (nil)) (insn 20 19 23 2 (set (reg:DI 93 [ D.2396+8 ]) (subreg:DI (reg:TI 90) 8)) "pr99434.C":5:40 74 {*movdi_internal} (expr_list:REG_DEAD (reg:TI 90) (nil))) (insn 23 20 24 2 (set (reg:DI 0 ax) (reg:DI 92 [ D.2396 ])) "pr99434.C":19:1 74 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 92 [ D.2396 ]) (nil))) (insn 24 23 17 2 (set (reg:DI 1 dx [+8 ]) (reg:DI 93 [ D.2396+8 ])) "pr99434.C":19:1 74 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 93 [ D.2396+8 ]) (nil))) (insn 17 24 0 2 (use (reg/i:TI 0 ax)) "pr99434.C":19:1 -1 (nil)) While in both cases we get the same (right) RA decisions about the umulditi= 3_1, and the IRA decisions seems to be good too: Popping a2(r90,l0) -- assign reg 0 Popping a4(r98,l0) -- assign reg 5 Popping a0(r93,l0) -- assign reg 1 Popping a1(r92,l0) -- assign reg 0 Popping a3(r99,l0) -- assign reg 4 for some reason LRA then decides to use different registers...=