[Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance
@ 2024-03-06 10:01 gjl at gcc dot gnu.org
  2024-03-06 11:55 ` [Bug target/114252] " rguenth at gcc dot gnu.org
                   ` (15 more replies)
  0 siblings, 16 replies; 17+ messages in thread
From: gjl at gcc dot gnu.org @ 2024-03-06 10:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

            Bug ID: 114252
           Summary: Introducing bswapsi reduces code performance
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gjl at gcc dot gnu.org
  Target Milestone: ---

Created attachment 57628
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57628&action=edit
GNU-C test case

typedef __UINT8_TYPE__ uint8_t;
typedef __UINT32_TYPE__ uint32_t;

typedef uint8_t __attribute__((vector_size(4))) v4u8_t;

uint32_t func1 (const uint8_t *buf) {
    v4u8_t v4 = { buf[1], buf[0], buf[3], buf[2] };

    return (uint32_t) v4;
}

Compile the code with

$ avr-gcc code.c -S -Os -dp

with v13 the result is:


func1:
        mov r30,r24      ;  37  [c=4 l=1]  movqi_insn/0
        mov r31,r25      ;  38  [c=4 l=1]  movqi_insn/0
        ldd r22,Z+1      ;  39  [c=4 l=1]  movqi_insn/3
        ld r23,Z                 ;  40  [c=4 l=1]  movqi_insn/3
        ldd r24,Z+3      ;  41  [c=4 l=1]  movqi_insn/3
        ldd r25,Z+2      ;  42  [c=4 l=1]  movqi_insn/3
/* epilogue start */
        ret              ;  45  [c=0 l=1]  return

which is good code: insn 37, 38 move the address to pointer register Z, and
then follow 4 loads, one for each byte.

When compiled with v14 however:

func1:
        mov r30,r24      ;  23  [c=4 l=2]  *movhi/0
        mov r31,r25
        ld r22,Z         ;  24  [c=16 l=4]  *movsi/2
        ldd r23,Z+1
        ldd r24,Z+2
        ldd r25,Z+3
        rcall __bswapsi2         ;  25  [c=16 l=1]  *bswapsi2.libgcc
        mov r31,r23      ;  32  [c=4 l=1]  movqi_insn/0
        mov r23,r25      ;  33  [c=4 l=1]  movqi_insn/0
        mov r25,r31      ;  34  [c=4 l=1]  movqi_insn/0
        mov r31,r22      ;  35  [c=4 l=1]  movqi_insn/0
        mov r22,r24      ;  36  [c=4 l=1]  movqi_insn/0
        mov r24,r31      ;  37  [c=4 l=1]  movqi_insn/0
/* epilogue start */
        ret              ;  40  [c=0 l=1]  return


Target: avr
Configured with: ../../source/gcc-master/configure --target=avr --disable-nls
--with-dwarf2 --with-gnu-as --with-gnu-ld --disable-shared
--enable-languages=c,c++
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 14.0.1 20240303 (experimental) (GCC)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
@ 2024-03-06 11:55 ` rguenth at gcc dot gnu.org
  2024-03-06 11:59 ` rguenth at gcc dot gnu.org
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-06 11:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2024-03-06
             Status|UNCONFIRMED                 |NEW
          Component|tree-optimization           |target
           Keywords|                            |missed-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  It looks like we do no cost evaluation in
maybe_optimize_vector_constructor but checking that there's an optab
for bswap with SImode.

insn-flags.h:#define HAVE_bswapsi2 1

but somehow we end up doing a libcall?

We expand as

;; bswapdst_10 = __builtin_bswap32 (load_dst_9);

(insn 6 5 7 (set (reg:SI 47)
        (mem:SI (reg/v/f:HI 46 [ buf ]) [0 MEM <long unsigned int> [(const
uint8_t *)buf_5(D)]+0 S4 A8])) -1
     (nil))

(insn 7 6 8 (set (reg:SI 22 r22)
        (reg:SI 47)) -1
     (nil))

(insn 8 7 9 (set (reg:SI 22 r22)
        (bswap:SI (reg:SI 22 r22))) -1
     (nil))

^^^

so why does that turn into a library call?

I think this is mis-communication between the middle-end and the target.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
  2024-03-06 11:55 ` [Bug target/114252] " rguenth at gcc dot gnu.org
@ 2024-03-06 11:59 ` rguenth at gcc dot gnu.org
  2024-03-06 12:15 ` gjl at gcc dot gnu.org
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-06 11:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Looking at avr.md there's no bswap implementation, only the libcall.  Why
expose it this way?

I suppose the pattern was added to get bswap recognition, so this is what you
get if you do that?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
  2024-03-06 11:55 ` [Bug target/114252] " rguenth at gcc dot gnu.org
  2024-03-06 11:59 ` rguenth at gcc dot gnu.org
@ 2024-03-06 12:15 ` gjl at gcc dot gnu.org
  2024-03-06 12:37 ` rguenth at gcc dot gnu.org
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: gjl at gcc dot gnu.org @ 2024-03-06 12:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #3 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> but somehow we end up doing a libcall?

It's not a libcall in the GCC sense, for the compiler it's just an ordinary
insn.  The backend then prints this as a transparent call to libgcc.

Purpose is that many functions have a small, known footprint as they are
implemented in assembly. An ordinary call would clobber all callee-used regs,
so using a transparent call gives better code than a real call.  Notice this is
the nsn:

(define_insn "*bswapsi2.libgcc"
  [(set (reg:SI 22)
        (bswap:SI (reg:SI 22)))
   (clobber (reg:CC REG_CC))]
  "reload_completed"
  "%~call __bswapsi2"
  [(set_attr "type" "xcall")])

However, for the purpose of this PR, no bswap is needed in the 1st place; just
have a look at the v13 code. It just loads the bytes as they belong into the
target value; while v14 loads all 32 bits in one chunk and then starts fiddling
and moving around the constituent bytes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-03-06 12:15 ` gjl at gcc dot gnu.org
@ 2024-03-06 12:37 ` rguenth at gcc dot gnu.org
  2024-03-06 16:12 ` gjl at gcc dot gnu.org
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-06 12:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Yes, the missing "detail" for the middle-end is that the uint32_t is actually
4 separate byte registers.  And the 'int' argument to bswap32 requires
4 registers as well.

So bswap on a value is just register shuffling, right?  And thus this
libcall will never be better than doing it inline as you probably
cannot expect the incoming arguments and the outgoing return registers to be
allocated in a way so no reg-reg moves remain?

Of course since it's still SImode pseudos on RTL you might want to write
an expander that populates 4 QImode pseudos from the SImode one and
composes that back to a byte-swapped SImode one.  Hoping register allocation
can then elide everything again?

I'd try to avoid using subregs if possible though using those would be easiest
I think (but you might fall foul of RA similar to -fsplit-wide-types).
Shifts and truncates/zero_extends are possibly superior.  Who knows.  Or
split it only after reload and have the pattern consume one scratch you
need for the register-register moves.

Hey, maybe the RA itself can know how to allocate a bswap:SI optimally
and "reload" it to be reg-reg moves ...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-03-06 12:37 ` rguenth at gcc dot gnu.org
@ 2024-03-06 16:12 ` gjl at gcc dot gnu.org
  2024-03-06 17:18 ` rguenther at suse dot de
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: gjl at gcc dot gnu.org @ 2024-03-06 16:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #5 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> So bswap on a value is just register shuffling, right?

The point is that there is no need for bswap in the first place, just have a
look at the code that v13 generates.  It's 4 QI loads and that's it, no
shuffling required at all.

But v14 dropped that, and the bswapsi (presumably due to previous flawed tree
optmizations) is introduced by some tree pass.

There's nothing the backend can do about it.  So would you explain why you
think it's a "target" issue?

Maybe the PR title I used is confusing and does not hit the point?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-03-06 16:12 ` gjl at gcc dot gnu.org
@ 2024-03-06 17:18 ` rguenther at suse dot de
  2024-03-07  7:45 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenther at suse dot de @ 2024-03-06 17:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
> Am 06.03.2024 um 17:12 schrieb gjl at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252
> 
> --- Comment #5 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #4)
>> So bswap on a value is just register shuffling, right?
> 
> The point is that there is no need for bswap in the first place, just have a
> look at the code that v13 generates.  It's 4 QI loads and that's it, no
> shuffling required at all.
> 
> But v14 dropped that, and the bswapsi (presumably due to previous flawed tree
> optmizations) is introduced by some tree pass.
> 
> There's nothing the backend can do about it.  So would you explain why you
> think it's a "target" issue?
> 
> Maybe the PR title I used is confusing and does not hit the point?

Why does the target say it has bswapsi then? In which case is that profitable?

> --
> You are receiving this mail because:
> You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-03-06 17:18 ` rguenther at suse dot de
@ 2024-03-07  7:45 ` rguenth at gcc dot gnu.org
  2024-03-07  8:45 ` gjl at gcc dot gnu.org
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-07  7:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note I do understand what you are saying, just the middle-end in detecting and
using __builtin_bswap32 does what it does everywhere else - it checks whether
the target implements the operation.

The middle-end doesn't try to actually compare costs (it has no idea of the
bswapsi costs), and it most definitely doesn't see how AVR is special in
having only QImode registers and thus the created SImode load (which the
target supports!) will end up as four registers.  To me a 'bswap' on
AVR never makes sense since whatever is swapped will be _always_ available
as a set of byte registers.

That's why I question AVR exposing bswapsi to the middle-end rather than
suggesting the middle-end should maybe see whether AVR has any regs of
HImode or larger.  Note that would break for targets that could eventually
do a load-multiple byteswapped to a set of QImode regs (guess there's no
such one in GCC at least), but it's the only heuristic that might work here.

The only thing that maybe would make sense with AVR exposing bswapsi is
users calling __builtin_bswap but since it always expands as a libcall
even that makes no sense.

So my preferred fix would be to remove bswapsi from avr.md?

Does it benefit from recognizing bswap done with shifts on an int?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-03-07  7:45 ` rguenth at gcc dot gnu.org
@ 2024-03-07  8:45 ` gjl at gcc dot gnu.org
  2024-03-07  9:05 ` gjl at gcc dot gnu.org
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: gjl at gcc dot gnu.org @ 2024-03-07  8:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #8 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> Note I do understand what you are saying, just the middle-end in detecting
> and using __builtin_bswap32 does what it does everywhere else - it checks
> whether the target implements the operation.
> 
> The middle-end doesn't try to actually compare costs (it has no idea of the
> bswapsi costs),

But even when the bswapsi insn costs nothing, the v14 code has these additional
6 movqi insns 32...37 compared to v13 code.  In order to have the same
performance like v13 code, a bswapsi would have to cost negative 6 insns.  And
an optimizer that assumes negative costs is not reasonable, in particular
because the recognition of bswap opportunities serves optimization -- or is
supposed to serve it as far as I understand.

> and it most definitely doesn't see how AVR is special in
> having only QImode registers and thus the created SImode load (which the
> target supports!) will end up as four registers.

Even when the bswap insn would cost nothing the code is worse.

> The only thing that maybe would make sense with AVR exposing bswapsi is
> users calling __builtin_bswap but since it always expands as a libcall
> even that makes no sense.

It makes perfect sense when C/C++ code uses __builtin_bswap32:

* With current bswapsi insn, the code does a call that performs SI:22 =
bswap(SI:22) with NO additionall register pressure.

* Without bswap insn, the code does a real ABI call that performs SI:22 =
bswap(SI:22) PLUS IT CLOBBERS r18, r19, r20, r21, r26, r27, r30 and r31; which
are the most powerful GPRs.

> So my preferred fix would be to remove bswapsi from avr.md?

Is there a way that the backend can fold a call to an insn that performs better
that a call? Like in TARGET_FOLD_BUILTIN?  As far as I know, the backend can
only fold target builtins, but not common builtins?  Tree fold cannot fold to
an insn obviously, but it could fold to inline asm, no?

Or can the target change an optabs entry so it expands to an insn that's more
profitable that a respective call? (like avr.md's bswap insn with transparent
call is more profitable than a real call).

The avr backend does this for many other stuff, too:

divmod, SI and PSI multiplications, parity, popcount, clz, ffs, 

> Does it benefit from recognizing bswap done with shifts on an int?

I don't fully understand that question. You mean to write code that shifts
bytes around like in
    uint32_t res = 0;
    res |= ((uint32_t) buf[0]) << 24;
    res |= ((uint32_t) buf[1]) << 16;
    res |= (uint32_t) buf[2] << 8;
    res |= buf[3];
    return res;
is better than a bswapsi call?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-03-07  8:45 ` gjl at gcc dot gnu.org
@ 2024-03-07  9:05 ` gjl at gcc dot gnu.org
  2024-03-07  9:14 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: gjl at gcc dot gnu.org @ 2024-03-07  9:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #9 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
...and I don't see why a register allocator would or should fix flaws from tree
optimizers.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-03-07  9:05 ` gjl at gcc dot gnu.org
@ 2024-03-07  9:14 ` rguenth at gcc dot gnu.org
  2024-03-07  9:42 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-07  9:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sayle at gcc dot gnu.org

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Georg-Johann Lay from comment #8)
> (In reply to Richard Biener from comment #7)
> > Note I do understand what you are saying, just the middle-end in detecting
> > and using __builtin_bswap32 does what it does everywhere else - it checks
> > whether the target implements the operation.
> > 
> > The middle-end doesn't try to actually compare costs (it has no idea of the
> > bswapsi costs),
> 
> But even when the bswapsi insn costs nothing, the v14 code has these
> additional 6 movqi insns 32...37 compared to v13 code.  In order to have the
> same performance like v13 code, a bswapsi would have to cost negative 6
> insns.  And an optimizer that assumes negative costs is not reasonable, in
> particular because the recognition of bswap opportunities serves
> optimization -- or is supposed to serve it as far as I understand.
> 
> > and it most definitely doesn't see how AVR is special in
> > having only QImode registers and thus the created SImode load (which the
> > target supports!) will end up as four registers.
> 
> Even when the bswap insn would cost nothing the code is worse.

Yes, I know.

> > The only thing that maybe would make sense with AVR exposing bswapsi is
> > users calling __builtin_bswap but since it always expands as a libcall
> > even that makes no sense.
> 
> It makes perfect sense when C/C++ code uses __builtin_bswap32:
> 
> * With current bswapsi insn, the code does a call that performs SI:22 =
> bswap(SI:22) with NO additionall register pressure.
> 
> * Without bswap insn, the code does a real ABI call that performs SI:22 =
> bswap(SI:22) PLUS IT CLOBBERS r18, r19, r20, r21, r26, r27, r30 and r31;
> which are the most powerful GPRs.

I think the target controls the "libcall" ABI that's used for calls to
libgcc, but somehow we fail to go that path (but I can see __bswapsi
and __bswapdi even in the x86_64 libgcc).  In particular

OPTAB_NC(bswap_optab, "bswap$a2", BSWAP)

doesn't list bswap as having a libfunc ...

> > So my preferred fix would be to remove bswapsi from avr.md?
> 
> Is there a way that the backend can fold a call to an insn that performs
> better that a call? Like in TARGET_FOLD_BUILTIN?  As far as I know, the
> backend can only fold target builtins, but not common builtins?  Tree fold
> cannot fold to an insn obviously, but it could fold to inline asm, no?
> 
> Or can the target change an optabs entry so it expands to an insn that's
> more profitable that a respective call? (like avr.md's bswap insn with
> transparent call is more profitable than a real call).

I think the target should implement an inline bswap, possibly via a
define_insn_and_split or define_split so the byte ops are only exposed
at a desired point;  important points being lower_subreg (split-wide-types)
and register allocation - possibly lower_subreg should itself know
how to handle bswap (though the degenerate AVR case is quite special).

I've CCed Roger who might know the traps with "implementing" an SImode
bswap on a target with just QImode regs but multi-reg operations not
decomposed during most of the RTL pipeline(?)

> The avr backend does this for many other stuff, too:
> 
> divmod, SI and PSI multiplications, parity, popcount, clz, ffs, 

Indeed.  Maybe it's never the case that a loop implementing clz is
better than a libcall or separate div/mod are better than divmod
(oddly divmod also lacks the libcall entry for the optabs...).

> > Does it benefit from recognizing bswap done with shifts on an int?
> 
> I don't fully understand that question. You mean to write code that shifts
> bytes around like in
>     uint32_t res = 0;
>     res |= ((uint32_t) buf[0]) << 24;
>     res |= ((uint32_t) buf[1]) << 16;
>     res |= (uint32_t) buf[2] << 8;
>     res |= buf[3];
>     return res;
> is better than a bswapsi call?

Yeah.  Or comparing to open-coding the bswap without going through the call.
I don't have a AVR libgcc around, but libgcc2.s has

#ifdef L_bswapsi2
SItype
__bswapsi2 (SItype u)
{
  return ((((u) & 0xff000000u) >> 24)
          | (((u) & 0x00ff0000u) >>  8)
          | (((u) & 0x0000ff00u) <<  8)
          | (((u) & 0x000000ffu) << 24));
}
#endif 

and that's compiled to

__bswapsi2:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
        rcall __bswapsi2
/* epilogue start */
        ret

so this can't be it ;)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2024-03-07  9:14 ` rguenth at gcc dot gnu.org
@ 2024-03-07  9:42 ` rguenth at gcc dot gnu.org
  2024-03-07 10:47 ` gjl at gcc dot gnu.org
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-07  9:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
diff --git a/gcc/gimple-ssa-store-merging.cc b/gcc/gimple-ssa-store-merging.cc
index 42b68abf61b..c9d4662656f 100644
--- a/gcc/gimple-ssa-store-merging.cc
+++ b/gcc/gimple-ssa-store-merging.cc
@@ -170,6 +170,7 @@
 #include "optabs-tree.h"
 #include "dbgcnt.h"
 #include "selftest.h"
+#include "regs.h"

 /* The maximum size (in bits) of the stores this pass should generate.  */
 #define MAX_STORE_BITSIZE (BITS_PER_WORD)
@@ -1484,7 +1485,8 @@ maybe_optimize_vector_constructor (gimple *cur_stmt)
       break;
     case 32:
       if (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
-         && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing)
+         && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
+         && have_regs_of_mode[SImode])
        {
          load_type = uint32_type_node;
          fndecl = builtin_decl_explicit (BUILT_IN_BSWAP32);
@@ -1545,7 +1547,8 @@ pass_optimize_bswap::execute (function *fun)
   tree bswap32_type = NULL_TREE, bswap64_type = NULL_TREE;

   bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
-              && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing);
+              && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
+              && have_regs_of_mode[SImode]);
   bswap64_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP64)
               && (optab_handler (bswap_optab, DImode) != CODE_FOR_nothing
                   || (bswap32_p && word_mode == SImode)));


doesn't work.  AVR has regs of SImode.  There doesn't seem to be a way to
query the (maximum?) number of hardregs used for a mode.  Using

  bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
               && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
               && have_regs_of_mode[SImode]
               && hard_regno_nregs (0, SImode) == 1);

"works" but is surely wrong (whatever hardreg zero corresponds to).
Looking only at word_mode, requiring SImode size >= word_mode size like with

  bswap32_p = (builtin_decl_explicit_p (BUILT_IN_BSWAP32)
               && optab_handler (bswap_optab, SImode) != CODE_FOR_nothing
               && known_ge (GET_MODE_SIZE (word_mode), GET_MODE_SIZE
(SImode)));

"works" but would affect many more targets.  Maybe && word_mode != QImode
is better.

Note that this will cut off _all_ bswap detection.  Thus my question on
profitability of detecting cases like those in libgcc2.c which then produces

__bswapsi2:
        push r12
        push r13
        push r14
        push r15
        push r16
        push r17
/* prologue: function */
/* frame size = 0 */
/* stack size = 6 */
.L__stack_usage = 6
        mov r16,r22
        mov r17,r23
        mov r18,r24
        mov r19,r25
        mov r22,r19
        clr r23
        clr r24
        clr r25
        mov r15,r16
        clr r14
        clr r13
        clr r12
        or r22,r12
        or r23,r13
        or r24,r14
        or r25,r15
        mov r12,r17
        mov r13,r18
        mov r14,r19
        clr r15
        clr r12
        clr r14
        clr r15
        or r22,r12
        or r23,r13
        or r24,r14
        or r25,r15
        mov r19,r18
        mov r18,r17
        mov r17,r16
        clr r16
        clr r16
        clr r17
        clr r19
        or r22,r16
        or r23,r17
        or r24,r18
        or r25,r19
/* epilogue start */
        pop r17
        pop r16
        pop r15
        pop r14
        pop r13
        pop r12
        ret

then.

bswap detection does not try to do any sophisticated evaluation of costs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2024-03-07  9:42 ` rguenth at gcc dot gnu.org
@ 2024-03-07 10:47 ` gjl at gcc dot gnu.org
  2024-03-07 10:56 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: gjl at gcc dot gnu.org @ 2024-03-07 10:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #12 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #10)
> I think the target controls the "libcall" ABI that's used for calls to
> libgcc,

You have a pointer how to do it or an example? IIRC I looked into it quite a
while ago, and it didn't allow to specify/adjust call_used_regs[] etc.

> I think the target should implement an inline bswap, possibly via a
> define_insn_and_split or define_split so the byte ops are only exposed
> at a desired point;  important points being lower_subreg (split-wide-types)
> and register allocation - possibly lower_subreg should itself know
> how to handle bswap (though the degenerate AVR case is quite special).

That would result in SUBREGs all over the place.  As Vladimir pointed out in 

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110093#c5

DFA doesn't handle subregs properly, and register alloc then uses extra
reloads, bloating the code (not only in PR110093 but also 114243.  Unlikely any
pass will untangle the mess of four (set (subreg:QI (SI)) (subreg:QI (SI)))



> Yeah.  Or comparing to open-coding the bswap without going through the call.
> I don't have a AVR libgcc around, but libgcc2.s has
> 
> #ifdef L_bswapsi2
> SItype
> __bswapsi2 (SItype u)
> {
>   return ((((u) & 0xff000000u) >> 24)
>           | (((u) & 0x00ff0000u) >>  8)
>           | (((u) & 0x0000ff00u) <<  8)
>           | (((u) & 0x000000ffu) << 24));
> }
> #endif 

The libgcc side is not a problem at all, libgcc/config/avr/lib1funcs.S has:

;; swap two registers with different register number
.macro bswap a, b
    eor \a, \b
    eor \b, \a
    eor \a, \b
.endm

#if defined (L_bswapsi2)
;; swap bytes
;; r25:r22 = bswap32 (r25:r22)
DEFUN __bswapsi2
    bswap r22, r25
    bswap r23, r24
    ret
ENDF __bswapsi2
#endif /* defined (L_bswapsi2) */

#if defined (L_bswapdi2)
;; swap bytes
;; r25:r18 = bswap64 (r25:r18)
DEFUN __bswapdi2
    bswap r18, r25
    bswap r19, r24
    bswap r20, r23
    bswap r21, r22
    ret
ENDF __bswapdi2
#endif /* defined (L_bswapdi2) */


There's currently no handcrafted bswap16 though.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2024-03-07 10:47 ` gjl at gcc dot gnu.org
@ 2024-03-07 10:56 ` rguenth at gcc dot gnu.org
  2024-03-07 14:12 ` gjl at gcc dot gnu.org
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 17+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-03-07 10:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Georg-Johann Lay from comment #12)
> (In reply to Richard Biener from comment #10)
> > I think the target controls the "libcall" ABI that's used for calls to
> > libgcc,
> 
> You have a pointer how to do it or an example? IIRC I looked into it quite a
> while ago, and it didn't allow to specify/adjust call_used_regs[] etc.
> 
> > I think the target should implement an inline bswap, possibly via a
> > define_insn_and_split or define_split so the byte ops are only exposed
> > at a desired point;  important points being lower_subreg (split-wide-types)
> > and register allocation - possibly lower_subreg should itself know
> > how to handle bswap (though the degenerate AVR case is quite special).
> 
> That would result in SUBREGs all over the place.  As Vladimir pointed out in 
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110093#c5
> 
> DFA doesn't handle subregs properly, and register alloc then uses extra
> reloads, bloating the code (not only in PR110093 but also 114243.  Unlikely
> any pass will untangle the mess of four (set (subreg:QI (SI)) (subreg:QI
> (SI)))

Yep.  Which is why I was playing thoughts of having (bswap:SI ..) handled
during reload itself ...

The alternative would be to have SImode hardregs by using consecutive
registers and special constraints.  That reduces RA freedom but it would
allow bswap:SI to be split after reload.  Or not split at all but
emitted directly as a sequence of those eor's - of course then making
the assembly quite big, not taking advantage of the fact that we can
probably elide most reg-reg moves.  So splitting after reload might
allow the moves to be eliminated and avoiding the subreg DF.

That said, it probably needs (a lot of) experimenting.

What I've tried to communicate with the store-merging patch attempt is
that GIMPLE optimizations have not enough information to decide whether
a bswap replacement is profitable or not.  Or at least there's no
sophisticated way I can think of that would work for AVR and other targets?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2024-03-07 10:56 ` rguenth at gcc dot gnu.org
@ 2024-03-07 14:12 ` gjl at gcc dot gnu.org
  2024-03-07 15:02 ` pinskia at gcc dot gnu.org
  2024-03-07 17:52 ` rguenther at suse dot de
  15 siblings, 0 replies; 17+ messages in thread
From: gjl at gcc dot gnu.org @ 2024-03-07 14:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #14 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
The code in the example is not a perfect bswap, it needs additional shuffling
of bytes.  The tree passes must know that bswap is not a perfect fit.  There
must be *some* criterion that depends on the permutation, and when a bswap is
closer to the bswapped-permutation that a non-bswapped permutation is to the
original one.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2024-03-07 14:12 ` gjl at gcc dot gnu.org
@ 2024-03-07 15:02 ` pinskia at gcc dot gnu.org
  2024-03-07 17:52 ` rguenther at suse dot de
  15 siblings, 0 replies; 17+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-03-07 15:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Note one thing that might help is define an alternative for bswap that takes a
memory operand and just do the load that way. That will definitely help in the
original code.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Bug target/114252] Introducing bswapsi reduces code performance
  2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2024-03-07 15:02 ` pinskia at gcc dot gnu.org
@ 2024-03-07 17:52 ` rguenther at suse dot de
  15 siblings, 0 replies; 17+ messages in thread
From: rguenther at suse dot de @ 2024-03-07 17:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252

--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 7 Mar 2024, gjl at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252
> 
> --- Comment #14 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
> The code in the example is not a perfect bswap, it needs additional shuffling
> of bytes.  The tree passes must know that bswap is not a perfect fit.  There
> must be *some* criterion that depends on the permutation, and when a bswap is
> closer to the bswapped-permutation that a non-bswapped permutation is to the
> original one.

I think we only support bswap + rotate as replacement (but just use that
if it matches).

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-03-07 17:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-06 10:01 [Bug tree-optimization/114252] New: Introducing bswapsi reduces code performance gjl at gcc dot gnu.org
2024-03-06 11:55 ` [Bug target/114252] " rguenth at gcc dot gnu.org
2024-03-06 11:59 ` rguenth at gcc dot gnu.org
2024-03-06 12:15 ` gjl at gcc dot gnu.org
2024-03-06 12:37 ` rguenth at gcc dot gnu.org
2024-03-06 16:12 ` gjl at gcc dot gnu.org
2024-03-06 17:18 ` rguenther at suse dot de
2024-03-07  7:45 ` rguenth at gcc dot gnu.org
2024-03-07  8:45 ` gjl at gcc dot gnu.org
2024-03-07  9:05 ` gjl at gcc dot gnu.org
2024-03-07  9:14 ` rguenth at gcc dot gnu.org
2024-03-07  9:42 ` rguenth at gcc dot gnu.org
2024-03-07 10:47 ` gjl at gcc dot gnu.org
2024-03-07 10:56 ` rguenth at gcc dot gnu.org
2024-03-07 14:12 ` gjl at gcc dot gnu.org
2024-03-07 15:02 ` pinskia at gcc dot gnu.org
2024-03-07 17:52 ` rguenther at suse dot de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).