From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 6841A3858401; Thu,  7 Mar 2024 08:45:10 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6841A3858401
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709801110;
	bh=EKOYF3zmNwhhyLlfiX4CutuZrmOnftTWV6SAh/VkQfw=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=hB4UWk7+WH81LnHkSLrAwE70aEsEJ/es7FOUl6nMeHMVFBAA90zEoLuu7nUQbrv0N
	 Hmyvj86NS/h9X0VS3PL/pDF2Vqv48EAdEhDaIAhOvf9bney3slWcozG9dzbCIdYiOn
	 +cWb+8i9AJLAehlJx/BfI4NZ8tgiDlcTdvq8swMM=
From: "gjl at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/114252] Introducing bswapsi reduces code performance
Date: Thu, 07 Mar 2024 08:45:09 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: gjl at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-114252-4-KXiWzLoQ2w@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-114252-4@http.gcc.gnu.org/bugzilla/>
References: <bug-114252-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114252
--- Comment #8 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> Note I do understand what you are saying, just the middle-end in detecting
> and using __builtin_bswap32 does what it does everywhere else - it checks
> whether the target implements the operation.
>=20
> The middle-end doesn't try to actually compare costs (it has no idea of t=
he
> bswapsi costs),

But even when the bswapsi insn costs nothing, the v14 code has these additi=
onal
6 movqi insns 32...37 compared to v13 code.  In order to have the same
performance like v13 code, a bswapsi would have to cost negative 6 insns.  =
And
an optimizer that assumes negative costs is not reasonable, in particular
because the recognition of bswap opportunities serves optimization -- or is
supposed to serve it as far as I understand.

> and it most definitely doesn't see how AVR is special in
> having only QImode registers and thus the created SImode load (which the
> target supports!) will end up as four registers.

Even when the bswap insn would cost nothing the code is worse.

> The only thing that maybe would make sense with AVR exposing bswapsi is
> users calling __builtin_bswap but since it always expands as a libcall
> even that makes no sense.

It makes perfect sense when C/C++ code uses __builtin_bswap32:

* With current bswapsi insn, the code does a call that performs SI:22 =3D
bswap(SI:22) with NO additionall register pressure.

* Without bswap insn, the code does a real ABI call that performs SI:22 =3D
bswap(SI:22) PLUS IT CLOBBERS r18, r19, r20, r21, r26, r27, r30 and r31; wh=
ich
are the most powerful GPRs.

> So my preferred fix would be to remove bswapsi from avr.md?

Is there a way that the backend can fold a call to an insn that performs be=
tter
that a call? Like in TARGET_FOLD_BUILTIN?  As far as I know, the backend can
only fold target builtins, but not common builtins?  Tree fold cannot fold =
to
an insn obviously, but it could fold to inline asm, no?

Or can the target change an optabs entry so it expands to an insn that's mo=
re
profitable that a respective call? (like avr.md's bswap insn with transpare=
nt
call is more profitable than a real call).

The avr backend does this for many other stuff, too:

divmod, SI and PSI multiplications, parity, popcount, clz, ffs,=20

> Does it benefit from recognizing bswap done with shifts on an int?

I don't fully understand that question. You mean to write code that shifts
bytes around like in
    uint32_t res =3D 0;
    res |=3D ((uint32_t) buf[0]) << 24;
    res |=3D ((uint32_t) buf[1]) << 16;
    res |=3D (uint32_t) buf[2] << 8;
    res |=3D buf[3];
    return res;
is better than a bswapsi call?=