public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets
@ 2023-09-20 17:51 lasse.collin at tukaani dot org
  2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: lasse.collin at tukaani dot org @ 2023-09-20 17:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

            Bug ID: 111502
           Summary: Suboptimal unaligned 2/4-byte memcpy on strict-align
                    targets
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lasse.collin at tukaani dot org
  Target Milestone: ---

I was playing with RISC-V GCC 12.2.0 from Arch Linux. I noticed
inefficient-looking assembly output in code that uses memcpy to access 32-bit
unaligned integers. I tried Godbolt with 16/32-bit integers and seems that the
same weirdness happens with RV32 & RV64 with GCC 13.2.0 and trunk, and also on
a few other targets. (Clang's output looks OK.)

For a little endian target:

#include <stdint.h>
#include <string.h>

uint32_t bytes16(const uint8_t *b)
{
    return (uint32_t)b[0]
        | ((uint32_t)b[1] << 8);
}

uint32_t copy16(const uint8_t *b)
{
    uint16_t v;
    memcpy(&v, b, sizeof(v));
    return v;
}

riscv64-linux-gnu-gcc -march=rv64gc -O2 -mtune=size

bytes16:
        lhu     a0,0(a0)
        ret

copy16:
        lhu     a0,0(a0)
        ret

That looks good because -mno-strict-align is the default.

After omitting -mtune=size, unaligned access isn't used (the output is the same
as with -mstrict-align):

riscv64-linux-gnu-gcc -march=rv64gc -O2

bytes16:
        lbu     a5,1(a0)
        lbu     a0,0(a0)
        slli    a5,a5,8
        or      a0,a5,a0
        ret

copy16:
        lbu     a4,0(a0)
        lbu     a5,1(a0)
        addi    sp,sp,-16
        sb      a4,14(sp)
        sb      a5,15(sp)
        lhu     a0,14(sp)
        addi    sp,sp,16
        jr      ra

bytes16 looks good but copy16 is weird: the bytes are copied to an aligned
location on stack and then loaded back.

On Godbolt it happens with GCC 13.2.0 on RV32, RV64, ARM64 (but only if using
-mstrict-align), MIPS64EL, and SPARC & SPARC64 (comparison needs big endian
bytes16). For ARM64 and MIPS64EL the oldest GCC on Godbolt is GCC 5.4 and the
same thing happens with that too.

32-bit reads with -O2 behave similarly. With -Os a call to memcpy is emitted
for copy32 but not for bytes32.

#include <stdint.h>
#include <string.h>

uint32_t bytes32(const uint8_t *b)
{
    return (uint32_t)b[0]
        | ((uint32_t)b[1] << 8)
        | ((uint32_t)b[2] << 16)
        | ((uint32_t)b[3] << 24);
}

uint32_t copy32(const uint8_t *b)
{
    uint32_t v;
    memcpy(&v, b, sizeof(v));
    return v;
}

riscv64-linux-gnu-gcc -march=rv64gc -O2

bytes32:
        lbu     a4,1(a0)
        lbu     a3,0(a0)
        lbu     a5,2(a0)
        lbu     a0,3(a0)
        slli    a4,a4,8
        or      a4,a4,a3
        slli    a5,a5,16
        or      a5,a5,a4
        slli    a0,a0,24
        or      a0,a0,a5
        sext.w  a0,a0
        ret

copy32:
        lbu     a2,0(a0)
        lbu     a3,1(a0)
        lbu     a4,2(a0)
        lbu     a5,3(a0)
        addi    sp,sp,-16
        sb      a2,12(sp)
        sb      a3,13(sp)
        sb      a4,14(sp)
        sb      a5,15(sp)
        lw      a0,12(sp)
        addi    sp,sp,16
        jr      ra

riscv64-linux-gnu-gcc -march=rv64gc -Os

bytes32:
        lbu     a4,1(a0)
        lbu     a5,0(a0)
        slli    a4,a4,8
        or      a4,a4,a5
        lbu     a5,2(a0)
        lbu     a0,3(a0)
        slli    a5,a5,16
        or      a5,a5,a4
        slli    a0,a0,24
        or      a0,a0,a5
        sext.w  a0,a0
        ret

copy32:
        addi    sp,sp,-32
        mv      a1,a0
        li      a2,4
        addi    a0,sp,12
        sd      ra,24(sp)
        call    memcpy@plt
        ld      ra,24(sp)
        lw      a0,12(sp)
        addi    sp,sp,32
        jr      ra

I probably cannot test any proposed fixes but I hope this report is still
useful. Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug tree-optimization/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
  2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
@ 2023-09-20 17:56 ` andrew at sifive dot com
  2023-09-20 18:37 ` [Bug target/111502] " lasse.collin at tukaani dot org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: andrew at sifive dot com @ 2023-09-20 17:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Andrew Waterman <andrew at sifive dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrew at sifive dot com

--- Comment #1 from Andrew Waterman <andrew at sifive dot com> ---
This isn't actually a bug.  Quoting the RVA profile spec, "misaligned loads and
stores might execute extremely slowly"--which is code for the possibility that
they might be trapped and emulated, taking hundreds of clock cycles apiece.  So
the default behavior of emitting byte accesses is best when generating generic
code.  (Of course, when tuning for a particular microarchitecture, the shorter
code sequence may be emitted.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug target/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
  2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
  2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
@ 2023-09-20 18:37 ` lasse.collin at tukaani dot org
  2023-09-20 18:42 ` [Bug middle-end/111502] " pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: lasse.collin at tukaani dot org @ 2023-09-20 18:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

--- Comment #2 from Lasse Collin <lasse.collin at tukaani dot org> ---
Byte access by default is good when the compiler doesn't know if unaligned is
fast on the target processor. There is no disagreement here.

What I suspect is a bug is the instruction sequence used for byte access in
copy16 and copy32 cases. copy16 uses 2 * lbu + 2 * sb + 1 * lhu, that is, five
memory operations to load an unaligned 16-bit integer. copy32 uses 4 * lbu + 4
* sb + 1 * lw, that is, nine memory operations to load a 32-bit integer.

bytes16 needs two memory operations and bytes32 needs four. Clang generates
this kind of code from both bytesxx and copyxx.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
  2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
  2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
  2023-09-20 18:37 ` [Bug target/111502] " lasse.collin at tukaani dot org
@ 2023-09-20 18:42 ` pinskia at gcc dot gnu.org
  2023-09-20 18:49 ` pinskia at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-20 18:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |middle-end

--- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is a dup of this bug. Basically memcpy is not changed into an unaligned
load ..

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
  2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
                   ` (2 preceding siblings ...)
  2023-09-20 18:42 ` [Bug middle-end/111502] " pinskia at gcc dot gnu.org
@ 2023-09-20 18:49 ` pinskia at gcc dot gnu.org
  2023-09-20 20:06 ` lasse.collin at tukaani dot org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-09-20 18:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |50417

--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
See pr 50417


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
[Bug 50417] [11/12/13/14 regression]: memcpy with known alignment

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
  2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
                   ` (3 preceding siblings ...)
  2023-09-20 18:49 ` pinskia at gcc dot gnu.org
@ 2023-09-20 20:06 ` lasse.collin at tukaani dot org
  2023-09-20 20:20 ` andrew at sifive dot com
  2023-09-21  6:22 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: lasse.collin at tukaani dot org @ 2023-09-20 20:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

--- Comment #5 from Lasse Collin <lasse.collin at tukaani dot org> ---
If I understood correctly, PR 50417 is about wishing that GCC would infer that
a pointer given to memcpy has alignment higher than one. In my examples the
alignment of the uint8_t *b argument is one and thus byte-by-byte access is
needed (if the target processor doesn't have fast unaligned access, determined
from -mtune and -mno-strict-align).

My report is about the instruction sequence used for the byte-by-byte access.

Omitting the stack pointer manipulation and return instruction, this is
bytes16:

        lbu     a5,1(a0)
        lbu     a0,0(a0)
        slli    a5,a5,8
        or      a0,a5,a0

And copy16:

        lbu     a4,0(a0)
        lbu     a5,1(a0)
        sb      a4,14(sp)
        sb      a5,15(sp)
        lhu     a0,14(sp)

Is the latter as good code as the former? If yes, then this report might be
invalid and I apologize for the noise.

PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to
memcpy, so that is the same detail as the -Os case in my first message. Calling
memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with
-Os -mstrict-align the call doesn't save space:

bytes32:
        ldrb    w1, [x0]
        ldrb    w2, [x0, 1]
        orr     x2, x1, x2, lsl 8
        ldrb    w1, [x0, 2]
        ldrb    w0, [x0, 3]
        orr     x1, x2, x1, lsl 16
        orr     w0, w1, w0, lsl 24
        ret

copy32:
        stp     x29, x30, [sp, -32]!
        mov     x1, x0
        mov     x2, 4
        mov     x29, sp
        add     x0, sp, 28
        bl      memcpy
        ldr     w0, [sp, 28]
        ldp     x29, x30, [sp], 32
        ret

And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too:

bytes32:
        ldrb    w4, [x0]
        ldrb    w2, [x0, 1]
        ldrb    w1, [x0, 2]
        ldrb    w3, [x0, 3]
        orr     x2, x4, x2, lsl 8
        orr     x0, x2, x1, lsl 16
        orr     w0, w0, w3, lsl 24
        ret

copy32:
        sub     sp, sp, #16
        ldrb    w3, [x0]
        ldrb    w2, [x0, 1]
        ldrb    w1, [x0, 2]
        ldrb    w0, [x0, 3]
        strb    w3, [sp, 12]
        strb    w2, [sp, 13]
        strb    w1, [sp, 14]
        strb    w0, [sp, 15]
        ldr     w0, [sp, 12]
        add     sp, sp, 16
        ret

ARM64 with -mstrict-align might be a contrived example in practice though.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
  2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
                   ` (4 preceding siblings ...)
  2023-09-20 20:06 ` lasse.collin at tukaani dot org
@ 2023-09-20 20:20 ` andrew at sifive dot com
  2023-09-21  6:22 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: andrew at sifive dot com @ 2023-09-20 20:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

--- Comment #6 from Andrew Waterman <andrew at sifive dot com> ---
Ack, I misunderstood your earlier message.  You're of course right that the
load/load/shift/or sequence is preferable to the load/load/store/store/load
sequence, on just about any practical implementation.  That the memcpy version
is optimized less optimally does seem to be disjoint from the issue Andrew
mentioned.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets
  2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
                   ` (5 preceding siblings ...)
  2023-09-20 20:20 ` andrew at sifive dot com
@ 2023-09-21  6:22 ` rguenth at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-09-21  6:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
You need to trace RTL expansion where it decides to use a temporary stack slot,
there's likely some instruction pattern missing to compose the larger words.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-21  6:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-20 17:51 [Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets lasse.collin at tukaani dot org
2023-09-20 17:56 ` [Bug tree-optimization/111502] " andrew at sifive dot com
2023-09-20 18:37 ` [Bug target/111502] " lasse.collin at tukaani dot org
2023-09-20 18:42 ` [Bug middle-end/111502] " pinskia at gcc dot gnu.org
2023-09-20 18:49 ` pinskia at gcc dot gnu.org
2023-09-20 20:06 ` lasse.collin at tukaani dot org
2023-09-20 20:20 ` andrew at sifive dot com
2023-09-21  6:22 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).