[Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat
@ 2023-01-06 23:27 gerbilsoft at gerbilsoft dot com
  2023-01-06 23:28 ` [Bug tree-optimization/108322] " gerbilsoft at gerbilsoft dot com
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: gerbilsoft at gerbilsoft dot com @ 2023-01-06 23:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

            Bug ID: 108322
           Summary: Using __register parameter with -ftree-vectorize
                    (default with -O2) results in massive code bloat
           Product: gcc
           Version: 12.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gerbilsoft at gerbilsoft dot com
  Target Milestone: ---

While examining some code using the bloaty tool, I found that a function for
deinterleaving Super Magic Drive ROM images was taking up ~5 KB when it should
have been less than 1 KB. On examining the disassembly, there appeared to be a
lot of unnecessary instructions; compiling with clang and MSVC resulted in
significantly fewer instructions. Either removing __restrict from the function
parameters (two pointers), or specifying -fno-tree-vectorize to disable
auto-vectorization, fixes this issue with gcc-12.

The generated code isn't buggy as far as I can tell, and it benchmarks around
the same as the non-bloated version.

I've narrowed it down to the following minimal test case:

#include <stdint.h>
#define SMD_BLOCK_SIZE 16384

void decodeBlock_cpp(uint8_t *__restrict pDest, const uint8_t *__restrict pSrc)
{
        // First 8 KB of the source block is ODD bytes.
        const uint8_t *pSrc_end = pSrc + (SMD_BLOCK_SIZE / 2);
        for (uint8_t *pDest_odd = pDest + 1; pSrc < pSrc_end; pDest_odd += 2,
pSrc += 1) {
                pDest_odd[0] = pSrc[0];
        }
}

Assembly output with `g++ -O2 -fno-tree-vectorize` (or removing the __restrict
qualifiers):

decodeBlock_cpp(unsigned char*, unsigned char const*):
        xor     eax, eax
.L2:
        movzx   edx, BYTE PTR [rsi+rax]
        mov     BYTE PTR [rdi+1+rax*2], dl
        add     rax, 1
        cmp     rax, 8192
        jne     .L2
        ret

Assembly output with `g++ -O2` (implying -ftree-vectorize with gcc-12) and
__restrict qualifiers:

decodeBlock_cpp(unsigned char*, unsigned char const*):
        push    r15
        lea     rax, [rsi+8192]
        add     rdi, 1
        push    r14
        push    r13
        push    r12
        push    rbp
        push    rbx
        mov     QWORD PTR [rsp-8], rax
.L2:
        movzx   ecx, BYTE PTR [rsi+10]
        movzx   eax, BYTE PTR [rsi+14]
        add     rsi, 16
        add     rdi, 32
        movzx   edx, BYTE PTR [rsi-3]
        movzx   r15d, BYTE PTR [rsi-1]
        movzx   r11d, BYTE PTR [rsi-10]
        movzx   ebx, BYTE PTR [rsi-11]
        mov     BYTE PTR [rsp-11], cl
        movzx   ecx, BYTE PTR [rsi-16]
        movzx   ebp, BYTE PTR [rsi-12]
        mov     BYTE PTR [rsp-9], al
        movzx   r12d, BYTE PTR [rsi-13]
        movzx   eax, BYTE PTR [rsi-4]
        mov     BYTE PTR [rsp-10], dl
        movzx   r13d, BYTE PTR [rsi-14]
        movzx   edx, BYTE PTR [rsi-5]
        movzx   r14d, BYTE PTR [rsi-15]
        movzx   r8d, BYTE PTR [rsi-7]
        movzx   r9d, BYTE PTR [rsi-8]
        movzx   r10d, BYTE PTR [rsi-9]
        mov     BYTE PTR [rdi-32], cl
        movzx   ecx, BYTE PTR [rsp-11]
        mov     BYTE PTR [rdi-10], dl
        mov     BYTE PTR [rdi-30], r14b
        mov     BYTE PTR [rdi-28], r13b
        mov     BYTE PTR [rdi-26], r12b
        mov     BYTE PTR [rdi-24], bpl
        mov     BYTE PTR [rdi-22], bl
        mov     BYTE PTR [rdi-20], r11b
        mov     BYTE PTR [rdi-18], r10b
        mov     BYTE PTR [rdi-16], r9b
        mov     BYTE PTR [rdi-14], r8b
        mov     BYTE PTR [rdi-12], cl
        mov     BYTE PTR [rdi-8], al
        movzx   eax, BYTE PTR [rsp-9]
        movzx   edx, BYTE PTR [rsp-10]
        mov     BYTE PTR [rdi-2], r15b
        mov     BYTE PTR [rdi-4], al
        mov     rax, QWORD PTR [rsp-8]
        mov     BYTE PTR [rdi-6], dl
        cmp     rsi, rax
        jne     .L2
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        ret

$ gcc --version
gcc (Gentoo Hardened 12.2.1_p20221008 p1) 12.2.1 20221008
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug tree-optimization/108322] Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat
  2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
@ 2023-01-06 23:28 ` gerbilsoft at gerbilsoft dot com
  2023-01-06 23:35 ` [Bug target/108322] Using __restrict " pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: gerbilsoft at gerbilsoft dot com @ 2023-01-06 23:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

--- Comment #1 from David Korth <gerbilsoft at gerbilsoft dot com> ---
Some quick testing with Compiler Explorer at godbolt.org shows that this
behavior started with gcc-8.1, and it doesn't happen with gcc-7.x or earlier.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat
  2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
  2023-01-06 23:28 ` [Bug tree-optimization/108322] " gerbilsoft at gerbilsoft dot com
@ 2023-01-06 23:35 ` pinskia at gcc dot gnu.org
  2023-01-07  7:30 ` amonakov at gcc dot gnu.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-01-06 23:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-linux-gnu
          Component|tree-optimization           |target
           Keywords|                            |missed-optimization

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is a cost model issue with x86_64.

on aarch64, this is not vectorized unless you use -fno-vect-cost-model.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat
  2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
  2023-01-06 23:28 ` [Bug tree-optimization/108322] " gerbilsoft at gerbilsoft dot com
  2023-01-06 23:35 ` [Bug target/108322] Using __restrict " pinskia at gcc dot gnu.org
@ 2023-01-07  7:30 ` amonakov at gcc dot gnu.org
  2023-01-10  7:58 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-01-07  7:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
With '-fdisable-tree-forwprop4 -msse4.1' you see what the vectorizer perhaps
wanted to achieve.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat
  2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
                   ` (2 preceding siblings ...)
  2023-01-07  7:30 ` amonakov at gcc dot gnu.org
@ 2023-01-10  7:58 ` rguenth at gcc dot gnu.org
  2023-01-10  8:02 ` amonakov at gcc dot gnu.org
  2023-01-10  9:09 ` rguenther at suse dot de
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-01-10  7:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |53947
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2023-01-10
             Status|UNCONFIRMED                 |NEW
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
The vectorizer vectorizes this with a strided store, costing

*pSrc_16 1 times unaligned_load (misalign -1) costs 12 in body
_1 16 times scalar_store costs 192 in body
_1 16 times vec_to_scalar costs 64 in body
t.c:8:44: note:  operating only on full vectors.
t.c:8:44: note:  Cost model analysis: 
  Vector inside of loop cost: 268
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 0
  Vector outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 0

now later forwprop figures it can replace the element extracts from the
vector load with scalar loads which then results in effective unrolling
of the loop by a factor of 16.

The vectorizer misses the fact that w/o SSE 4.1 it cannot do efficient
lane extracts.  With SSE 4.1 and disabling the forwprop you'd get

.L3:
        movdqu  (%rsi), %xmm0
        addq    $16, %rsi
        addq    $32, %rax
        pextrb  $0, %xmm0, -32(%rax)
        pextrb  $1, %xmm0, -30(%rax)
        pextrb  $2, %xmm0, -28(%rax)
        pextrb  $3, %xmm0, -26(%rax)
        pextrb  $4, %xmm0, -24(%rax)
        pextrb  $5, %xmm0, -22(%rax)
        pextrb  $6, %xmm0, -20(%rax)
        pextrb  $7, %xmm0, -18(%rax)
        pextrb  $8, %xmm0, -16(%rax)
        pextrb  $9, %xmm0, -14(%rax)
        pextrb  $10, %xmm0, -12(%rax)
        pextrb  $11, %xmm0, -10(%rax)
        pextrb  $12, %xmm0, -8(%rax)
        pextrb  $13, %xmm0, -6(%rax)
        pextrb  $14, %xmm0, -4(%rax)
        pextrb  $15, %xmm0, -2(%rax)
        cmpq    %rdx, %rsi
        jne     .L3

which is what the vectorizer thinks is going to be generated.  But with
just SSE2 we are spilling to memory for the lane extract.

For the case at hand loading two vectors from the destination and then
punpck{h,l}bw and storing them again might be the most efficient thing
to do here.

On the cost model side 'vec_to_scalar' is ambiguous, the x86 backend
tries to compensate with

  /* If we do elementwise loads into a vector then we are bound by
     latency and execution resources for the many scalar loads 
     (AGU and load ports).  Try to account for this by scaling the
     construction cost by the number of elements involved.  */
  if ((kind == vec_construct || kind == vec_to_scalar)
      && stmt_info
      && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
          || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
      && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
    { 
      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
    }

but that doesn't trigger here because the step is constant two.

RTL expansion will eventually use the vec_extract optab and that succeeds
even for SSE2 by spilling, so it isn't useful to query support:

void
ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
{   
...
  if (use_vec_extr)
    { 
...
    }
  else
    {
      rtx mem = assign_stack_temp (mode, GET_MODE_SIZE (mode));

      emit_move_insn (mem, vec);

      tmp = adjust_address (mem, inner_mode, elt*GET_MODE_SIZE (inner_mode));
      emit_move_insn (target, tmp);
    }
}

the fallback is eventually done by RTL expansion anyway.

Note fixing that and querying vec_extract support (the vectorizer doesn't
do that - it relies on expands fallback here but could do better costing
and also generate a single spill slot rather than one for each extract).

Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat
  2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
                   ` (3 preceding siblings ...)
  2023-01-10  7:58 ` rguenth at gcc dot gnu.org
@ 2023-01-10  8:02 ` amonakov at gcc dot gnu.org
  2023-01-10  9:09 ` rguenther at suse dot de
  5 siblings, 0 replies; 7+ messages in thread
From: amonakov at gcc dot gnu.org @ 2023-01-10  8:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #4)
> 
> For the case at hand loading two vectors from the destination and then
> punpck{h,l}bw and storing them again might be the most efficient thing
> to do here.

I think such read-modify-write on the destination introduces a data race for
bytes that are not accessed in the original program, so that would be okay only
under -fallow-store-data-races?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat
  2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
                   ` (4 preceding siblings ...)
  2023-01-10  8:02 ` amonakov at gcc dot gnu.org
@ 2023-01-10  9:09 ` rguenther at suse dot de
  5 siblings, 0 replies; 7+ messages in thread
From: rguenther at suse dot de @ 2023-01-10  9:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 10 Jan 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322
> 
> --- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #4)
> > 
> > For the case at hand loading two vectors from the destination and then
> > punpck{h,l}bw and storing them again might be the most efficient thing
> > to do here.
> 
> I think such read-modify-write on the destination introduces a data race for
> bytes that are not accessed in the original program, so that would be okay only
> under -fallow-store-data-races?

Yes, obviously.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-01-10  9:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-06 23:27 [Bug tree-optimization/108322] New: Using __register parameter with -ftree-vectorize (default with -O2) results in massive code bloat gerbilsoft at gerbilsoft dot com
2023-01-06 23:28 ` [Bug tree-optimization/108322] " gerbilsoft at gerbilsoft dot com
2023-01-06 23:35 ` [Bug target/108322] Using __restrict " pinskia at gcc dot gnu.org
2023-01-07  7:30 ` amonakov at gcc dot gnu.org
2023-01-10  7:58 ` rguenth at gcc dot gnu.org
2023-01-10  8:02 ` amonakov at gcc dot gnu.org
2023-01-10  9:09 ` rguenther at suse dot de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).