From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 6841A3858401; Thu, 7 Mar 2024 08:45:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6841A3858401 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1709801110; bh=EKOYF3zmNwhhyLlfiX4CutuZrmOnftTWV6SAh/VkQfw=; h=From:To:Subject:Date:In-Reply-To:References:From; b=hB4UWk7+WH81LnHkSLrAwE70aEsEJ/es7FOUl6nMeHMVFBAA90zEoLuu7nUQbrv0N Hmyvj86NS/h9X0VS3PL/pDF2Vqv48EAdEhDaIAhOvf9bney3slWcozG9dzbCIdYiOn +cWb+8i9AJLAehlJx/BfI4NZ8tgiDlcTdvq8swMM= From: "gjl at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/114252] Introducing bswapsi reduces code performance Date: Thu, 07 Mar 2024 08:45:09 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: gjl at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114252 --- Comment #8 from Georg-Johann Lay --- (In reply to Richard Biener from comment #7) > Note I do understand what you are saying, just the middle-end in detecting > and using __builtin_bswap32 does what it does everywhere else - it checks > whether the target implements the operation. >=20 > The middle-end doesn't try to actually compare costs (it has no idea of t= he > bswapsi costs), But even when the bswapsi insn costs nothing, the v14 code has these additi= onal 6 movqi insns 32...37 compared to v13 code. In order to have the same performance like v13 code, a bswapsi would have to cost negative 6 insns. = And an optimizer that assumes negative costs is not reasonable, in particular because the recognition of bswap opportunities serves optimization -- or is supposed to serve it as far as I understand. > and it most definitely doesn't see how AVR is special in > having only QImode registers and thus the created SImode load (which the > target supports!) will end up as four registers. Even when the bswap insn would cost nothing the code is worse. > The only thing that maybe would make sense with AVR exposing bswapsi is > users calling __builtin_bswap but since it always expands as a libcall > even that makes no sense. It makes perfect sense when C/C++ code uses __builtin_bswap32: * With current bswapsi insn, the code does a call that performs SI:22 =3D bswap(SI:22) with NO additionall register pressure. * Without bswap insn, the code does a real ABI call that performs SI:22 =3D bswap(SI:22) PLUS IT CLOBBERS r18, r19, r20, r21, r26, r27, r30 and r31; wh= ich are the most powerful GPRs. > So my preferred fix would be to remove bswapsi from avr.md? Is there a way that the backend can fold a call to an insn that performs be= tter that a call? Like in TARGET_FOLD_BUILTIN? As far as I know, the backend can only fold target builtins, but not common builtins? Tree fold cannot fold = to an insn obviously, but it could fold to inline asm, no? Or can the target change an optabs entry so it expands to an insn that's mo= re profitable that a respective call? (like avr.md's bswap insn with transpare= nt call is more profitable than a real call). The avr backend does this for many other stuff, too: divmod, SI and PSI multiplications, parity, popcount, clz, ffs,=20 > Does it benefit from recognizing bswap done with shifts on an int? I don't fully understand that question. You mean to write code that shifts bytes around like in uint32_t res =3D 0; res |=3D ((uint32_t) buf[0]) << 24; res |=3D ((uint32_t) buf[1]) << 16; res |=3D (uint32_t) buf[2] << 8; res |=3D buf[3]; return res; is better than a bswapsi call?=