public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat
@ 2023-01-06 23:27 gerbilsoft at gerbilsoft dot com
  2023-01-06 23:28 ` [Bug tree-optimization/108322] " gerbilsoft at gerbilsoft dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: gerbilsoft at gerbilsoft dot com @ 2023-01-06 23:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

            Bug ID: 108322
           Summary: Using __register parameter with -ftree-vectorize
                    (default with -O2) results in massive code bloat
           Product: gcc
           Version: 12.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gerbilsoft at gerbilsoft dot com
  Target Milestone: ---

While examining some code using the bloaty tool, I found that a function for
deinterleaving Super Magic Drive ROM images was taking up ~5 KB when it should
have been less than 1 KB. On examining the disassembly, there appeared to be a
lot of unnecessary instructions; compiling with clang and MSVC resulted in
significantly fewer instructions. Either removing __restrict from the function
parameters (two pointers), or specifying -fno-tree-vectorize to disable
auto-vectorization, fixes this issue with gcc-12.

The generated code isn't buggy as far as I can tell, and it benchmarks around
the same as the non-bloated version.

I've narrowed it down to the following minimal test case:

#include <stdint.h>
#define SMD_BLOCK_SIZE 16384

void decodeBlock_cpp(uint8_t *__restrict pDest, const uint8_t *__restrict pSrc)
{
        // First 8 KB of the source block is ODD bytes.
        const uint8_t *pSrc_end = pSrc + (SMD_BLOCK_SIZE / 2);
        for (uint8_t *pDest_odd = pDest + 1; pSrc < pSrc_end; pDest_odd += 2,
pSrc += 1) {
                pDest_odd[0] = pSrc[0];
        }
}

Assembly output with `g++ -O2 -fno-tree-vectorize` (or removing the __restrict
qualifiers):

decodeBlock_cpp(unsigned char*, unsigned char const*):
        xor     eax, eax
.L2:
        movzx   edx, BYTE PTR [rsi+rax]
        mov     BYTE PTR [rdi+1+rax*2], dl
        add     rax, 1
        cmp     rax, 8192
        jne     .L2
        ret

Assembly output with `g++ -O2` (implying -ftree-vectorize with gcc-12) and
__restrict qualifiers:

decodeBlock_cpp(unsigned char*, unsigned char const*):
        push    r15
        lea     rax, [rsi+8192]
        add     rdi, 1
        push    r14
        push    r13
        push    r12
        push    rbp
        push    rbx
        mov     QWORD PTR [rsp-8], rax
.L2:
        movzx   ecx, BYTE PTR [rsi+10]
        movzx   eax, BYTE PTR [rsi+14]
        add     rsi, 16
        add     rdi, 32
        movzx   edx, BYTE PTR [rsi-3]
        movzx   r15d, BYTE PTR [rsi-1]
        movzx   r11d, BYTE PTR [rsi-10]
        movzx   ebx, BYTE PTR [rsi-11]
        mov     BYTE PTR [rsp-11], cl
        movzx   ecx, BYTE PTR [rsi-16]
        movzx   ebp, BYTE PTR [rsi-12]
        mov     BYTE PTR [rsp-9], al
        movzx   r12d, BYTE PTR [rsi-13]
        movzx   eax, BYTE PTR [rsi-4]
        mov     BYTE PTR [rsp-10], dl
        movzx   r13d, BYTE PTR [rsi-14]
        movzx   edx, BYTE PTR [rsi-5]
        movzx   r14d, BYTE PTR [rsi-15]
        movzx   r8d, BYTE PTR [rsi-7]
        movzx   r9d, BYTE PTR [rsi-8]
        movzx   r10d, BYTE PTR [rsi-9]
        mov     BYTE PTR [rdi-32], cl
        movzx   ecx, BYTE PTR [rsp-11]
        mov     BYTE PTR [rdi-10], dl
        mov     BYTE PTR [rdi-30], r14b
        mov     BYTE PTR [rdi-28], r13b
        mov     BYTE PTR [rdi-26], r12b
        mov     BYTE PTR [rdi-24], bpl
        mov     BYTE PTR [rdi-22], bl
        mov     BYTE PTR [rdi-20], r11b
        mov     BYTE PTR [rdi-18], r10b
        mov     BYTE PTR [rdi-16], r9b
        mov     BYTE PTR [rdi-14], r8b
        mov     BYTE PTR [rdi-12], cl
        mov     BYTE PTR [rdi-8], al
        movzx   eax, BYTE PTR [rsp-9]
        movzx   edx, BYTE PTR [rsp-10]
        mov     BYTE PTR [rdi-2], r15b
        mov     BYTE PTR [rdi-4], al
        mov     rax, QWORD PTR [rsp-8]
        mov     BYTE PTR [rdi-6], dl
        cmp     rsi, rax
        jne     .L2
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret

$ gcc --version
gcc (Gentoo Hardened 12.2.1_p20221008 p1) 12.2.1 20221008
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-01-10  9:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
2023-01-06 23:28 ` [Bug tree-optimization/108322] " gerbilsoft at gerbilsoft dot com
2023-01-06 23:35 ` [Bug target/108322] Using __restrict " pinskia at gcc dot gnu.org
2023-01-07  7:30 ` amonakov at gcc dot gnu.org
2023-01-10  7:58 ` rguenth at gcc dot gnu.org
2023-01-10  8:02 ` amonakov at gcc dot gnu.org
2023-01-10  9:09 ` rguenther at suse dot de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).