public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop
@ 2024-02-05 21:07 miro.kropacek at gmail dot com
2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: miro.kropacek at gmail dot com @ 2024-02-05 21:07 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
Bug ID: 113779
Summary: Very inefficient m68k code generated for simple copy
loop
Product: gcc
Version: 13.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: miro.kropacek at gmail dot com
Target Milestone: ---
Even as simple loop as this:
void f(const long* src, long* dst, int count) {
for (int i = 0; i < count; i++) {
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
*dst++ = *src++;
}
}
is compiled to:
#NO_APP
.file "test.c"
.text
.align 2
.globl f
.type f, @function
f:
move.l 4(%sp),%a0
move.l 8(%sp),%a1
move.l 12(%sp),%d1
jle .L1
clr.l %d0
.L3:
move.l (%a0),(%a1)
move.l 4(%a0),4(%a1)
move.l 8(%a0),8(%a1)
move.l 12(%a0),12(%a1)
move.l 16(%a0),16(%a1)
move.l 20(%a0),20(%a1)
move.l 24(%a0),24(%a1)
move.l 28(%a0),28(%a1)
move.l 32(%a0),32(%a1)
move.l 36(%a0),36(%a1)
move.l 40(%a0),40(%a1)
move.l 44(%a0),44(%a1)
move.l 48(%a0),48(%a1)
move.l 52(%a0),52(%a1)
move.l 56(%a0),56(%a1)
add.w #64,%a0
add.w #64,%a1
move.l -4(%a0),-4(%a1)
addq.l #1,%d0
cmp.l %d1,%d0
jne .L3
.L1:
rts
.size f, .-f
.ident "GCC: (GNU) 13.2.0"
This has been like this for ages: gcc 4.6.4, gcc 7.2.0 and lately gcc 13.2.0
... the last gcc where it was reported to transform into move.l (a0)+,(a1)+ was
gcc 2.95 and gcc 3.x.
So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much? Tested on
m68k-elf-gcc -O2 -fomit-frame-pointer -m68020-60.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
@ 2024-02-05 21:20 ` pinskia at gcc dot gnu.org
2024-02-06 7:58 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-02-05 21:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much?
At one point of time (before I think GCC 9 or 8 or so), GCC's IV-OPTs
optimization does not take into account post/pre increment, but now it does.
BUT if the target cost model does not take those into account, then IV-OPTs
could decide not to use them.
Now m68k is a target which not many GCC developers look at fixing, so it is up
to someone to look into why the post increment is no longer being used.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
@ 2024-02-06 7:58 ` rguenth at gcc dot gnu.org
2024-02-06 8:16 ` miro.kropacek at gmail dot com
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-06 7:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I don't think IVOPTs would use postinc for the intermediate increments. It's
constant propagation/forwarding that accumulates the increments to a constant
offset which removes dependences on the instructions and thus would allow the
loads/stores to be executed in parallel (well, not that m68k uarchs likely can
do any of that ...).
I wonder if the code we emit is measurably slower though? It's possibly
a little bit larger due to the two IV increments.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
2024-02-06 7:58 ` rguenth at gcc dot gnu.org
@ 2024-02-06 8:16 ` miro.kropacek at gmail dot com
2024-02-06 12:47 ` mikpelinux at gmail dot com
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: miro.kropacek at gmail dot com @ 2024-02-06 8:16 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
--- Comment #3 from Miro Kropacek <miro.kropacek at gmail dot com> ---
> I wonder if the code we emit is measurably slower though? It's possibly
a little bit larger due to the two IV increments.
It's definitely slower as both offsets next to the An registers generate a
separate instruction word. So instead of 2-byte instruction "move.l
(a0)+,(a1)+" we have a 6-byte instruction "move.l off(a0),off(a1)" and that
hurts a lot even on the 68060, not to mention the poor 68000.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
` (2 preceding siblings ...)
2024-02-06 8:16 ` miro.kropacek at gmail dot com
@ 2024-02-06 12:47 ` mikpelinux at gmail dot com
2024-02-06 12:58 ` miro.kropacek at gmail dot com
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: mikpelinux at gmail dot com @ 2024-02-06 12:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
--- Comment #4 from Mikael Pettersson <mikpelinux at gmail dot com> ---
I'm not sure this is an m68k bug. I tried several targets that have
auto-increment addressing modes (m68k, pdp11, msp430, vax, aarch64) and none of
them would use auto-increment for this test case.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
` (3 preceding siblings ...)
2024-02-06 12:47 ` mikpelinux at gmail dot com
@ 2024-02-06 12:58 ` miro.kropacek at gmail dot com
2024-02-06 13:14 ` rguenth at gcc dot gnu.org
2024-02-17 0:39 ` hp at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: miro.kropacek at gmail dot com @ 2024-02-06 12:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
--- Comment #5 from Miro Kropacek <miro.kropacek at gmail dot com> ---
I have been told that one of the reasons why post-incrementing modes are not
supported / preferred these days is that they halt the CPU pipeline (of course,
totally not applicable on m68k). So with the offsets you can parallelize the
movements while when post-incrementing the values of a1, you always have to
wait for the previous instruction to finish.
So I could understand that this has been changed but it definitely shouldn't be
a change involving all possible CPUs.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
` (4 preceding siblings ...)
2024-02-06 12:58 ` miro.kropacek at gmail dot com
@ 2024-02-06 13:14 ` rguenth at gcc dot gnu.org
2024-02-17 0:39 ` hp at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-02-06 13:14 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2024-02-06
Ever confirmed|0 |1
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's already visible with a simple
void f(const long* src, long* dst)
{
*dst++ = *src++;
*dst = *src;
}
where we expand to RTL from
_1 = *src_3(D);
*dst_4(D) = _1;
_2 = MEM[(const long int *)src_3(D) + 4B];
MEM[(long int *)dst_4(D) + 4B] = _2;
there's nothing on GIMPLE that would split the add and RTLs auto-inc-dec
pass doesn't do anything either. We'd need a form of "strength-reduction"
or maybe targets prefering auto-inc/dec should not legitimize constant
offsets before reload ...
Note with one more copy you then see
_1 = *src_4(D);
*dst_5(D) = _1;
_2 = MEM[(const long int *)src_4(D) + 4B];
MEM[(long int *)dst_5(D) + 4B] = _2;
_3 = MEM[(const long int *)src_4(D) + 8B];
MEM[(long int *)dst_5(D) + 8B] = _3;
and naiively splitting gives you
src_6 = src_4(D) + 4;
src_7 = src_4(D) + 8;
that said, it's really sth for RTL since it's going to be highly target
dependent which form is more efficient. The auto-inc pass is well
structured, so it should be possible to extend it.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/113779] Very inefficient m68k code generated for simple copy loop
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
` (5 preceding siblings ...)
2024-02-06 13:14 ` rguenth at gcc dot gnu.org
@ 2024-02-17 0:39 ` hp at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: hp at gcc dot gnu.org @ 2024-02-17 0:39 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113779
Hans-Peter Nilsson <hp at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |hp at gcc dot gnu.org
--- Comment #7 from Hans-Peter Nilsson <hp at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #6)
> The auto-inc pass is well
> structured, so it should be possible to extend it.
Or just replace it, as it doesn't look far enough to be able to handle all
incdec-opportunities.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-02-17 0:39 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-05 21:07 [Bug c/113779] New: Very inefficient m68k code generated for simple copy loop miro.kropacek at gmail dot com
2024-02-05 21:20 ` [Bug target/113779] " pinskia at gcc dot gnu.org
2024-02-06 7:58 ` rguenth at gcc dot gnu.org
2024-02-06 8:16 ` miro.kropacek at gmail dot com
2024-02-06 12:47 ` mikpelinux at gmail dot com
2024-02-06 12:58 ` miro.kropacek at gmail dot com
2024-02-06 13:14 ` rguenth at gcc dot gnu.org
2024-02-17 0:39 ` hp at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).