[Bug c/113779] New: Very inefficient m68k code generated for simple copy loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop
@ 2024-02-05 21:07 miro.kropacek at gmail dot com
  2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: miro.kropacek at gmail dot com @ 2024-02-05 21:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

            Bug ID: 113779
           Summary: Very inefficient m68k code generated for simple copy
                    loop
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: miro.kropacek at gmail dot com
  Target Milestone: ---

Even as simple loop as this:

void f(const long* src, long* dst, int count) {
        for (int i = 0; i < count; i++) {
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
        }
}

is compiled to:

#NO_APP
        .file   "test.c"
        .text
        .align  2
        .globl  f
        .type   f, @function
f:
        move.l 4(%sp),%a0
        move.l 8(%sp),%a1
        move.l 12(%sp),%d1
        jle .L1
        clr.l %d0
.L3:
        move.l (%a0),(%a1)
        move.l 4(%a0),4(%a1)
        move.l 8(%a0),8(%a1)
        move.l 12(%a0),12(%a1)
        move.l 16(%a0),16(%a1)
        move.l 20(%a0),20(%a1)
        move.l 24(%a0),24(%a1)
        move.l 28(%a0),28(%a1)
        move.l 32(%a0),32(%a1)
        move.l 36(%a0),36(%a1)
        move.l 40(%a0),40(%a1)
        move.l 44(%a0),44(%a1)
        move.l 48(%a0),48(%a1)
        move.l 52(%a0),52(%a1)
        move.l 56(%a0),56(%a1)
        add.w #64,%a0
        add.w #64,%a1
        move.l -4(%a0),-4(%a1)
        addq.l #1,%d0
        cmp.l %d1,%d0
        jne .L3
.L1:
        rts
        .size   f, .-f
        .ident  "GCC: (GNU) 13.2.0"

This has been like this for ages: gcc 4.6.4, gcc 7.2.0 and lately gcc 13.2.0
... the last gcc where it was reported to transform into move.l (a0)+,(a1)+ was
gcc 2.95 and gcc 3.x. 

So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much? Tested on
m68k-elf-gcc -O2 -fomit-frame-pointer -m68020-60.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
  2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
@ 2024-02-05 21:20 ` pinskia at gcc dot gnu.org
  2024-02-06  7:58 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-05 21:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much?

At one point of time (before I think GCC 9 or 8 or so), GCC's IV-OPTs
optimization does not take into account post/pre increment, but now it does.
BUT if the target cost model does not take those into account, then IV-OPTs
could decide not to use them.
Now m68k is a target which not many GCC developers look at fixing, so it is up
to someone to look into why the post increment is no longer being used.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
  2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
  2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
@ 2024-02-06  7:58 ` rguenth at gcc dot gnu.org
  2024-02-06  8:16 ` miro.kropacek at gmail dot com
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-06  7:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I don't think IVOPTs would use postinc for the intermediate increments.  It's
constant propagation/forwarding that accumulates the increments to a constant
offset which removes dependences on the instructions and thus would allow the
loads/stores to be executed in parallel (well, not that m68k uarchs likely can
do any of that ...).

I wonder if the code we emit is measurably slower though?  It's possibly
a little bit larger due to the two IV increments.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
  2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
  2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
  2024-02-06  7:58 ` rguenth at gcc dot gnu.org
@ 2024-02-06  8:16 ` miro.kropacek at gmail dot com
  2024-02-06 12:47 ` mikpelinux at gmail dot com
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: miro.kropacek at gmail dot com @ 2024-02-06  8:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #3 from Miro Kropacek <miro.kropacek at gmail dot com> ---
> I wonder if the code we emit is measurably slower though?  It's possibly
a little bit larger due to the two IV increments.

It's definitely slower as both offsets next to the An registers generate a
separate instruction word. So instead of 2-byte instruction "move.l
(a0)+,(a1)+" we have a 6-byte instruction "move.l off(a0),off(a1)" and that
hurts a lot even on the 68060, not to mention the poor 68000.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
  2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
                   ` (2 preceding siblings ...)
  2024-02-06  8:16 ` miro.kropacek at gmail dot com
@ 2024-02-06 12:47 ` mikpelinux at gmail dot com
  2024-02-06 12:58 ` miro.kropacek at gmail dot com
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: mikpelinux at gmail dot com @ 2024-02-06 12:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #4 from Mikael Pettersson <mikpelinux at gmail dot com> ---
I'm not sure this is an m68k bug. I tried several targets that have
auto-increment addressing modes (m68k, pdp11, msp430, vax, aarch64) and none of
them would use auto-increment for this test case.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
  2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
                   ` (3 preceding siblings ...)
  2024-02-06 12:47 ` mikpelinux at gmail dot com
@ 2024-02-06 12:58 ` miro.kropacek at gmail dot com
  2024-02-06 13:14 ` rguenth at gcc dot gnu.org
  2024-02-17  0:39 ` hp at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: miro.kropacek at gmail dot com @ 2024-02-06 12:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

--- Comment #5 from Miro Kropacek <miro.kropacek at gmail dot com> ---
I have been told that one of the reasons why post-incrementing modes are not
supported / preferred these days is that they halt the CPU pipeline (of course,
totally not applicable on m68k). So with the offsets you can parallelize the
movements while when post-incrementing the values of a1, you always have to
wait for the previous instruction to finish.

So I could understand that this has been changed but it definitely shouldn't be
a change involving all possible CPUs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
  2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
                   ` (4 preceding siblings ...)
  2024-02-06 12:58 ` miro.kropacek at gmail dot com
@ 2024-02-06 13:14 ` rguenth at gcc dot gnu.org
  2024-02-17  0:39 ` hp at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-06 13:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-02-06
     Ever confirmed|0                           |1

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's already visible with a simple

void f(const long* src, long* dst)
{
  *dst++ = *src++;
  *dst = *src;
}

where we expand to RTL from

  _1 = *src_3(D);
  *dst_4(D) = _1;
  _2 = MEM[(const long int *)src_3(D) + 4B];
  MEM[(long int *)dst_4(D) + 4B] = _2;

there's nothing on GIMPLE that would split the add and RTLs auto-inc-dec
pass doesn't do anything either.  We'd need a form of "strength-reduction"
or maybe targets prefering auto-inc/dec should not legitimize constant
offsets before reload ...

Note with one more copy you then see

  _1 = *src_4(D);
  *dst_5(D) = _1;
  _2 = MEM[(const long int *)src_4(D) + 4B];
  MEM[(long int *)dst_5(D) + 4B] = _2;
  _3 = MEM[(const long int *)src_4(D) + 8B];
  MEM[(long int *)dst_5(D) + 8B] = _3;

and naiively splitting gives you

  src_6 = src_4(D) + 4;
  src_7 = src_4(D) + 8;

that said, it's really sth for RTL since it's going to be highly target
dependent which form is more efficient.  The auto-inc pass is well
structured, so it should be possible to extend it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
  2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
                   ` (5 preceding siblings ...)
  2024-02-06 13:14 ` rguenth at gcc dot gnu.org
@ 2024-02-17  0:39 ` hp at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: hp at gcc dot gnu.org @ 2024-02-17  0:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779

Hans-Peter Nilsson <hp at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hp at gcc dot gnu.org

--- Comment #7 from Hans-Peter Nilsson <hp at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #6)
> The auto-inc pass is well
> structured, so it should be possible to extend it.
Or just replace it, as it doesn't look far enough to be able to handle all
incdec-opportunities.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-02-17  0:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
2024-02-06  7:58 ` rguenth at gcc dot gnu.org
2024-02-06  8:16 ` miro.kropacek at gmail dot com
2024-02-06 12:47 ` mikpelinux at gmail dot com
2024-02-06 12:58 ` miro.kropacek at gmail dot com
2024-02-06 13:14 ` rguenth at gcc dot gnu.org
2024-02-17  0:39 ` hp at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).