public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets
@ 2023-09-20 17:51 lasse.collin at tukaani dot org
2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: lasse.collin at tukaani dot org @ 2023-09-20 17:51 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
Bug ID: 111502
Summary: Suboptimal unaligned 2/4-byte memcpy on strict-align
targets
Product: gcc
Version: 13.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: lasse.collin at tukaani dot org
Target Milestone: ---
I was playing with RISC-V GCC 12.2.0 from Arch Linux. I noticed
inefficient-looking assembly output in code that uses memcpy to access 32-bit
unaligned integers. I tried Godbolt with 16/32-bit integers and seems that the
same weirdness happens with RV32 & RV64 with GCC 13.2.0 and trunk, and also on
a few other targets. (Clang's output looks OK.)
For a little endian target:
#include <stdint.h>
#include <string.h>
uint32_t bytes16(const uint8_t *b)
{
return (uint32_t)b[0]
| ((uint32_t)b[1] << 8);
}
uint32_t copy16(const uint8_t *b)
{
uint16_t v;
memcpy(&v, b, sizeof(v));
return v;
}
riscv64-linux-gnu-gcc -march=rv64gc -O2 -mtune=size
bytes16:
lhu a0,0(a0)
ret
copy16:
lhu a0,0(a0)
ret
That looks good because -mno-strict-align is the default.
After omitting -mtune=size, unaligned access isn't used (the output is the same
as with -mstrict-align):
riscv64-linux-gnu-gcc -march=rv64gc -O2
bytes16:
lbu a5,1(a0)
lbu a0,0(a0)
slli a5,a5,8
or a0,a5,a0
ret
copy16:
lbu a4,0(a0)
lbu a5,1(a0)
addi sp,sp,-16
sb a4,14(sp)
sb a5,15(sp)
lhu a0,14(sp)
addi sp,sp,16
jr ra
bytes16 looks good but copy16 is weird: the bytes are copied to an aligned
location on stack and then loaded back.
On Godbolt it happens with GCC 13.2.0 on RV32, RV64, ARM64 (but only if using
-mstrict-align), MIPS64EL, and SPARC & SPARC64 (comparison needs big endian
bytes16). For ARM64 and MIPS64EL the oldest GCC on Godbolt is GCC 5.4 and the
same thing happens with that too.
32-bit reads with -O2 behave similarly. With -Os a call to memcpy is emitted
for copy32 but not for bytes32.
#include <stdint.h>
#include <string.h>
uint32_t bytes32(const uint8_t *b)
{
return (uint32_t)b[0]
| ((uint32_t)b[1] << 8)
| ((uint32_t)b[2] << 16)
| ((uint32_t)b[3] << 24);
}
uint32_t copy32(const uint8_t *b)
{
uint32_t v;
memcpy(&v, b, sizeof(v));
return v;
}
riscv64-linux-gnu-gcc -march=rv64gc -O2
bytes32:
lbu a4,1(a0)
lbu a3,0(a0)
lbu a5,2(a0)
lbu a0,3(a0)
slli a4,a4,8
or a4,a4,a3
slli a5,a5,16
or a5,a5,a4
slli a0,a0,24
or a0,a0,a5
sext.w a0,a0
ret
copy32:
lbu a2,0(a0)
lbu a3,1(a0)
lbu a4,2(a0)
lbu a5,3(a0)
addi sp,sp,-16
sb a2,12(sp)
sb a3,13(sp)
sb a4,14(sp)
sb a5,15(sp)
lw a0,12(sp)
addi sp,sp,16
jr ra
riscv64-linux-gnu-gcc -march=rv64gc -Os
bytes32:
lbu a4,1(a0)
lbu a5,0(a0)
slli a4,a4,8
or a4,a4,a5
lbu a5,2(a0)
lbu a0,3(a0)
slli a5,a5,16
or a5,a5,a4
slli a0,a0,24
or a0,a0,a5
sext.w a0,a0
ret
copy32:
addi sp,sp,-32
mv a1,a0
li a2,4
addi a0,sp,12
sd ra,24(sp)
call memcpy@plt
ld ra,24(sp)
lw a0,12(sp)
addi sp,sp,32
jr ra
I probably cannot test any proposed fixes but I hope this report is still
useful. Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug tree-optimization/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
@ 2023-09-20 17:56 ` andrew at sifive dot com
2023-09-20 18:37 ` [Bug target/111502] " lasse.collin at tukaani dot org
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: andrew at sifive dot com @ 2023-09-20 17:56 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
Andrew Waterman <andrew at sifive dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |andrew at sifive dot com
--- Comment #1 from Andrew Waterman <andrew at sifive dot com> ---
This isn't actually a bug. Quoting the RVA profile spec, "misaligned loads and
stores might execute extremely slowly"--which is code for the possibility that
they might be trapped and emulated, taking hundreds of clock cycles apiece. So
the default behavior of emitting byte accesses is best when generating generic
code. (Of course, when tuning for a particular microarchitecture, the shorter
code sequence may be emitted.)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug target/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
@ 2023-09-20 18:37 ` lasse.collin at tukaani dot org
2023-09-20 18:42 ` [Bug middle-end/111502] " pinskia at gcc dot gnu.org
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: lasse.collin at tukaani dot org @ 2023-09-20 18:37 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
--- Comment #2 from Lasse Collin <lasse.collin at tukaani dot org> ---
Byte access by default is good when the compiler doesn't know if unaligned is
fast on the target processor. There is no disagreement here.
What I suspect is a bug is the instruction sequence used for byte access in
copy16 and copy32 cases. copy16 uses 2 * lbu + 2 * sb + 1 * lhu, that is, five
memory operations to load an unaligned 16-bit integer. copy32 uses 4 * lbu + 4
* sb + 1 * lw, that is, nine memory operations to load a 32-bit integer.
bytes16 needs two memory operations and bytes32 needs four. Clang generates
this kind of code from both bytesxx and copyxx.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
2023-09-20 18:37 ` [Bug target/111502] " lasse.collin at tukaani dot org
@ 2023-09-20 18:42 ` pinskia at gcc dot gnu.org
2023-09-20 18:49 ` pinskia at gcc dot gnu.org
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-20 18:42 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|target |middle-end
--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is a dup of this bug. Basically memcpy is not changed into an unaligned
load ..
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
` (2 preceding siblings ...)
2023-09-20 18:42 ` [Bug middle-end/111502] " pinskia at gcc dot gnu.org
@ 2023-09-20 18:49 ` pinskia at gcc dot gnu.org
2023-09-20 20:06 ` lasse.collin at tukaani dot org
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-20 18:49 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Depends on| |50417
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
See pr 50417
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
[Bug 50417] [11/12/13/14 regression]: memcpy with known alignment
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
` (3 preceding siblings ...)
2023-09-20 18:49 ` pinskia at gcc dot gnu.org
@ 2023-09-20 20:06 ` lasse.collin at tukaani dot org
2023-09-20 20:20 ` andrew at sifive dot com
2023-09-21 6:22 ` rguenth at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: lasse.collin at tukaani dot org @ 2023-09-20 20:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
--- Comment #5 from Lasse Collin <lasse.collin at tukaani dot org> ---
If I understood correctly, PR 50417 is about wishing that GCC would infer that
a pointer given to memcpy has alignment higher than one. In my examples the
alignment of the uint8_t *b argument is one and thus byte-by-byte access is
needed (if the target processor doesn't have fast unaligned access, determined
from -mtune and -mno-strict-align).
My report is about the instruction sequence used for the byte-by-byte access.
Omitting the stack pointer manipulation and return instruction, this is
bytes16:
lbu a5,1(a0)
lbu a0,0(a0)
slli a5,a5,8
or a0,a5,a0
And copy16:
lbu a4,0(a0)
lbu a5,1(a0)
sb a4,14(sp)
sb a5,15(sp)
lhu a0,14(sp)
Is the latter as good code as the former? If yes, then this report might be
invalid and I apologize for the noise.
PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to
memcpy, so that is the same detail as the -Os case in my first message. Calling
memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with
-Os -mstrict-align the call doesn't save space:
bytes32:
ldrb w1, [x0]
ldrb w2, [x0, 1]
orr x2, x1, x2, lsl 8
ldrb w1, [x0, 2]
ldrb w0, [x0, 3]
orr x1, x2, x1, lsl 16
orr w0, w1, w0, lsl 24
ret
copy32:
stp x29, x30, [sp, -32]!
mov x1, x0
mov x2, 4
mov x29, sp
add x0, sp, 28
bl memcpy
ldr w0, [sp, 28]
ldp x29, x30, [sp], 32
ret
And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too:
bytes32:
ldrb w4, [x0]
ldrb w2, [x0, 1]
ldrb w1, [x0, 2]
ldrb w3, [x0, 3]
orr x2, x4, x2, lsl 8
orr x0, x2, x1, lsl 16
orr w0, w0, w3, lsl 24
ret
copy32:
sub sp, sp, #16
ldrb w3, [x0]
ldrb w2, [x0, 1]
ldrb w1, [x0, 2]
ldrb w0, [x0, 3]
strb w3, [sp, 12]
strb w2, [sp, 13]
strb w1, [sp, 14]
strb w0, [sp, 15]
ldr w0, [sp, 12]
add sp, sp, 16
ret
ARM64 with -mstrict-align might be a contrived example in practice though.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
` (4 preceding siblings ...)
2023-09-20 20:06 ` lasse.collin at tukaani dot org
@ 2023-09-20 20:20 ` andrew at sifive dot com
2023-09-21 6:22 ` rguenth at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: andrew at sifive dot com @ 2023-09-20 20:20 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
--- Comment #6 from Andrew Waterman <andrew at sifive dot com> ---
Ack, I misunderstood your earlier message. You're of course right that the
load/load/shift/or sequence is preferable to the load/load/store/store/load
sequence, on just about any practical implementation. That the memcpy version
is optimized less optimally does seem to be disjoint from the issue Andrew
mentioned.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
` (5 preceding siblings ...)
2023-09-20 20:20 ` andrew at sifive dot com
@ 2023-09-21 6:22 ` rguenth at gcc dot gnu.org
6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-09-21 6:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
You need to trace RTL expansion where it decides to use a temporary stack slot,
there's likely some instruction pattern missing to compose the larger words.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-09-21 6:22 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
2023-09-20 18:37 ` [Bug target/111502] " lasse.collin at tukaani dot org
2023-09-20 18:42 ` [Bug middle-end/111502] " pinskia at gcc dot gnu.org
2023-09-20 18:49 ` pinskia at gcc dot gnu.org
2023-09-20 20:06 ` lasse.collin at tukaani dot org
2023-09-20 20:20 ` andrew at sifive dot com
2023-09-21 6:22 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).