[Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE
@ 2023-02-08 19:17 gbs at canishe dot com
  2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: gbs at canishe dot com @ 2023-02-08 19:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

            Bug ID: 108724
           Summary: [11 regression] Poor codegen when summing two arrays
                    without AVX or SSE
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gbs at canishe dot com
  Target Milestone: ---

This program:

void foo(int *a, const int *__restrict b, const int *__restrict c)
{
  for (int i = 0; i < 16; i++) {
    a[i] = b[i] + c[i];
  }
}


When compiled for x86 by GCC 11.1+ with -O3 -mno-avx -mno-sse, produces:

foo:
        movq    %rdx, %rax
        subq    $8, %rsp
        movl    (%rsi), %edx
        movq    %rsi, %rcx
        addl    (%rax), %edx
        movl    4(%rax), %esi
        movq    $0, (%rsp)
        movl    %edx, (%rsp)
        movq    (%rsp), %rdx
        addl    4(%rcx), %esi
        movq    %rdx, -8(%rsp)
        movl    %esi, -4(%rsp)
        movq    -8(%rsp), %rdx
        movq    %rdx, (%rdi)
        movl    8(%rax), %edx
        addl    8(%rcx), %edx
        movq    $0, -16(%rsp)
        movl    %edx, -16(%rsp)
        movq    -16(%rsp), %rdx
        movl    12(%rcx), %esi
        addl    12(%rax), %esi
        movq    %rdx, -24(%rsp)
        movl    %esi, -20(%rsp)
        movq    -24(%rsp), %rdx
        movq    %rdx, 8(%rdi)
        [snip more of the same]
        movl    48(%rcx), %edx
        movq    $0, -96(%rsp)
        addl    48(%rax), %edx
        movl    %edx, -96(%rsp)
        movq    -96(%rsp), %rdx
        movl    52(%rcx), %esi
        addl    52(%rax), %esi
        movq    %rdx, -104(%rsp)
        movl    %esi, -100(%rsp)
        movq    -104(%rsp), %rdx
        movq    %rdx, 48(%rdi)
        movl    56(%rcx), %edx
        movq    $0, -112(%rsp)
        addl    56(%rax), %edx
        movl    %edx, -112(%rsp)
        movq    -112(%rsp), %rdx
        movl    60(%rcx), %ecx
        addl    60(%rax), %ecx
        movq    %rdx, -120(%rsp)
        movl    %ecx, -116(%rsp)
        movq    -120(%rsp), %rdx
        movq    %rdx, 56(%rdi)
        addq    $8, %rsp
        ret

(Godbolt link: https://godbolt.org/z/qq9dbP8ed)

This is bizarre - it's storing intermediate results on the stack, instead of
keeping them in registers or writing them directly to *a, which is bound to be
slow. (GCC 10.4, and Clang, produce more or less what I would expect, using
only the provided arrays and a register.) I haven't done any benchmarking
myself, but Jonathan Wakely's results (on list:
https://gcc.gnu.org/pipermail/gcc-help/2023-February/142181.html) seem to bear
this out.

From a bisect, this behavior seems to have been introduced by commit
33c0f246f799b7403171e97f31276a8feddd05c9 (tree-optimization/97626 - handle SCCs
properly in SLP stmt analysis) from Oct 2020, and persists into GCC trunk.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
@ 2023-02-08 19:30 ` pinskia at gcc dot gnu.org
  2023-02-09  9:37 ` crazylht at gmail dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu.org @ 2023-02-08 19:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
            Summary|[11 regression] Poor        |Poor codegen when summing
                   |codegen when summing two    |two arrays without AVX or
                   |arrays without AVX or SSE   |SSE
          Component|target                      |tree-optimization

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I noticed that on aarch64, with -mgeneral-regs-only, the problem of trying to
do a vectorization has been there since at least GCC 5 ...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
  2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org
@ 2023-02-09  9:37 ` crazylht at gmail dot com
  2023-02-09 13:54 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: crazylht at gmail dot com @ 2023-02-09  9:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Hongtao.liu <crazylht at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #2 from Hongtao.liu <crazylht at gmail dot com> ---
Guess it's related to 

 8537bool
 8538vect_can_vectorize_without_simd_p (tree_code code)
 8539{
 8540  switch (code)
 8541    {
 8542    case PLUS_EXPR:
 8543    case MINUS_EXPR:
 8544    case NEGATE_EXPR:
 8545    case BIT_AND_EXPR:
 8546    case BIT_IOR_EXPR:
 8547    case BIT_XOR_EXPR:
 8548    case BIT_NOT_EXPR:
 8549      return true;
 8550
 8551    default:
 8552      return false;
 8553    }

vectorizer will still do vectorization even without simd instructions.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
  2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org
  2023-02-09  9:37 ` crazylht at gmail dot com
@ 2023-02-09 13:54 ` rguenth at gcc dot gnu.org
  2023-02-10 10:00 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-02-09 13:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED
   Last reconfirmed|                            |2023-02-09

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Adding -fopt-info shows

t.c:3:21: optimized: loop vectorized using 8 byte vectors
t.c:1:6: optimized: loop with 7 iterations completely unrolled (header
execution count 63136016)

disabling unrolling instead shows

.L2:
        leaq    (%rsi,%rax), %r8
        leaq    (%rdx,%rax), %rdi
        movl    (%r8), %ecx
        addl    (%rdi), %ecx
        movq    %r10, -8(%rsp)
        movl    %ecx, -8(%rsp)
        movq    -8(%rsp), %rcx
        movl    4(%rdi), %edi
        addl    4(%r8), %edi
        movq    %rcx, -16(%rsp)
        movl    %edi, -12(%rsp)
        movq    -16(%rsp), %rcx
        movq    %rcx, (%r9,%rax)
        addq    $8, %rax
        cmpq    $64, %rax
        jne     .L2

and what happens is that vector lowering fails to perform generic vector
addition (vector lowering is supposed to materialize that), but instead
decomposes the vector, doing scalar adds, which eventually results in
us spilling ...

The reason is that vector lowering does

/* Expand a vector operation to scalars; for integer types we can use
   special bit twiddling tricks to do the sums a word at a time, using
   function F_PARALLEL instead of F.  These tricks are done only if
   they can process at least four items, that is, only if the vector
   holds at least four items and if a word can hold four items.  */
static tree
expand_vector_addition (gimple_stmt_iterator *gsi,
                        elem_op_func f, elem_op_func f_parallel,
                        tree type, tree a, tree b, enum tree_code code)
{
  int parts_per_word = BITS_PER_WORD / vector_element_bits (type);

  if (INTEGRAL_TYPE_P (TREE_TYPE (type))
      && parts_per_word >= 4
      && nunits_for_known_piecewise_op (type) >= 4)
    return expand_vector_parallel (gsi, f_parallel,
                                   type, a, b, code);
  else
    return expand_vector_piecewise (gsi, f,
                                    type, TREE_TYPE (type),
                                    a, b, code, false);

so it only treats >= 4 elements as profitable to vectorize this way but the
vectorizer doesn't seem to know that, it instead applies its own cost model
here while vector lowering doesn't have any.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (2 preceding siblings ...)
  2023-02-09 13:54 ` rguenth at gcc dot gnu.org
@ 2023-02-10 10:00 ` rguenth at gcc dot gnu.org
  2023-02-10 10:07 ` [Bug tree-optimization/108724] [11/12/13 Regression] " rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-02-10 10:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=101801

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the vectorizer thinks that

foo:
.LFB0:
        .cfi_startproc
        movabsq $9223372034707292159, %rax
        movq    %rdi, %rcx
        movq    (%rdx), %r10
        movq    (%rsi), %rdi
        movq    %rax, %r8
        movq    %rax, %r9
        movq    %rax, %r11
        andq    %r10, %r8
        andq    %rdi, %r9
        addq    %r8, %r9
        movq    %rdi, %r8
        movabsq $-9223372034707292160, %rdi
        xorq    %r10, %r8
        movq    8(%rdx), %r10
        andq    %rdi, %r8
        xorq    %r9, %r8
        movq    %rax, %r9
        movq    %r8, (%rcx)
        movq    8(%rsi), %r8
        andq    %r10, %r9
        andq    %r8, %r11
        xorq    %r10, %r8
        movq    16(%rdx), %r10
        addq    %r11, %r9
        andq    %rdi, %r8
        movq    %rax, %r11
        xorq    %r9, %r8
        movq    %rax, %r9
        andq    %r10, %r11
        movq    %r8, 8(%rcx)
        movq    16(%rsi), %r8
        andq    %r8, %r9
        xorq    %r10, %r8
        movq    24(%rdx), %r10
        addq    %r11, %r9
        andq    %rdi, %r8
        movq    %rax, %r11
        xorq    %r9, %r8
        movq    %rax, %r9
        andq    %r10, %r11
        movq    %r8, 16(%rcx)
        movq    24(%rsi), %r8
        andq    %r8, %r9
        xorq    %r10, %r8
        movq    32(%rdx), %r10
        addq    %r11, %r9
        andq    %rdi, %r8
        movq    %rax, %r11
        xorq    %r9, %r8
        movq    %rax, %r9
        andq    %r10, %r11
        movq    %r8, 24(%rcx)
        movq    32(%rsi), %r8
        andq    %r8, %r9
        xorq    %r10, %r8
        movq    40(%rdx), %r10
        addq    %r11, %r9
        andq    %rdi, %r8
        movq    %rax, %r11
        xorq    %r9, %r8
        movq    %rax, %r9
        andq    %r10, %r11
        movq    %r8, 32(%rcx)
        movq    40(%rsi), %r8
        andq    %r8, %r9
        addq    %r11, %r9
        xorq    %r10, %r8
        movq    48(%rsi), %r10
        movq    %rax, %r11
        andq    %rdi, %r8
        xorq    %r9, %r8
        movq    %rax, %r9
        andq    %r10, %r11
        movq    %r8, 40(%rcx)
        movq    48(%rdx), %r8
        movq    56(%rdx), %rdx
        andq    %r8, %r9
        xorq    %r10, %r8
        addq    %r11, %r9
        andq    %rdi, %r8
        xorq    %r9, %r8
        movq    %r8, 48(%rcx)
        movq    56(%rsi), %r8
        movq    %rax, %rsi
        andq    %rdx, %rsi
        andq    %r8, %rax
        xorq    %r8, %rdx
        addq    %rsi, %rax
        andq    %rdi, %rdx
        xorq    %rdx, %rax
        movq    %rax, 56(%rcx)
        ret

will be faster than when not vectorizing.  Not vectorizing produces

foo:
.LFB0:
        .cfi_startproc
        movq    %rsi, %rcx
        movl    (%rsi), %esi
        addl    (%rdx), %esi
        movl    %esi, (%rdi)
        movl    4(%rdx), %esi
        addl    4(%rcx), %esi
        movl    %esi, 4(%rdi)
        movl    8(%rdx), %esi
        addl    8(%rcx), %esi
        movl    %esi, 8(%rdi)
        movl    12(%rdx), %esi
        addl    12(%rcx), %esi
        movl    %esi, 12(%rdi)
        movl    16(%rdx), %esi
        addl    16(%rcx), %esi
        movl    %esi, 16(%rdi)
        movl    20(%rdx), %esi
        addl    20(%rcx), %esi
        movl    %esi, 20(%rdi)
        movl    24(%rdx), %esi
        addl    24(%rcx), %esi
        movl    %esi, 24(%rdi)
        movl    28(%rdx), %esi
        addl    28(%rcx), %esi
        movl    %esi, 28(%rdi)
        movl    32(%rdx), %esi
        addl    32(%rcx), %esi
        movl    %esi, 32(%rdi)
        movl    36(%rdx), %esi
        addl    36(%rcx), %esi
        movl    %esi, 36(%rdi)
        movl    40(%rdx), %esi
        addl    40(%rcx), %esi
        movl    %esi, 40(%rdi)
        movl    44(%rdx), %esi
        addl    44(%rcx), %esi
        movl    %esi, 44(%rdi)
        movl    48(%rdx), %esi
        addl    48(%rcx), %esi
        movl    %esi, 48(%rdi)
        movl    52(%rdx), %esi
        addl    52(%rcx), %esi
        movl    %esi, 52(%rdi)
        movl    56(%rdx), %esi
        movl    60(%rdx), %edx
        addl    56(%rcx), %esi
        addl    60(%rcx), %edx
        movl    %esi, 56(%rdi)
        movl    %edx, 60(%rdi)
        ret

The vectorizer produces un-lowered vector adds which is good in case followup
optimizations are possible (the ops are not obfuscated), but also bad
because unrolling estimates the size in a wrong way.  Costs go

*_3 1 times scalar_load costs 12 in prologue
*_5 1 times scalar_load costs 12 in prologue 
_4 + _6 1 times scalar_stmt costs 4 in prologue
_8 1 times scalar_store costs 12 in prologue 

vs

*_3 1 times unaligned_load (misalign -1) costs 12 in body
*_5 1 times unaligned_load (misalign -1) costs 12 in body
_4 + _6 1 times vector_stmt costs 4 in body
_4 + _6 5 times scalar_stmt costs 20 in body
_8 1 times unaligned_store (misalign -1) costs 12 in body

and as usual the savings from the wide loads and store outweight the
loss on the arithmetic side, but it's a close call:

  Vector inside of loop cost: 60
  Scalar iteration cost: 40

so 2*40 > 60 and the vectorization is profitable.

The easiest fix is to avoid vectorizing, another possibility is to adhere
to the vectorizers decision and expand to the lowered sequence immediately
from within the vectorizer itself.  The original goal was to remove
this hard cap on the number of elements but this bug shows that work was
incomplete.

I'm going to re-instantiate the hard cap and revisit for GCC 14.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] [11/12/13 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (3 preceding siblings ...)
  2023-02-10 10:00 ` rguenth at gcc dot gnu.org
@ 2023-02-10 10:07 ` rguenth at gcc dot gnu.org
  2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-02-10 10:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2
            Summary|Poor codegen when summing   |[11/12/13 Regression] Poor
                   |two arrays without AVX or   |codegen when summing two
                   |SSE                         |arrays without AVX or SSE
   Target Milestone|---                         |11.4

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] [11/12/13 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (4 preceding siblings ...)
  2023-02-10 10:07 ` [Bug tree-optimization/108724] [11/12/13 Regression] " rguenth at gcc dot gnu.org
@ 2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org
  2023-02-10 11:22 ` [Bug tree-optimization/108724] [11/12 " rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-02-10 11:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:dc87e1391c55c666c7ff39d4f0dea87666f25468

commit r13-5771-gdc87e1391c55c666c7ff39d4f0dea87666f25468
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Feb 10 11:07:30 2023 +0100

    tree-optimization/108724 - vectorized code getting piecewise expanded

    This fixes an oversight to when removing the hard limits on using
    generic vectors for the vectorizer to enable both SLP and BB
    vectorization to use those.  The vectorizer relies on vector lowering
    to expand plus, minus and negate to bit operations but vector
    lowering has a hard limit on the minimum number of elements per
    work item.  Vectorizer costs for the testcase at hand work out
    to vectorize a loop with just two work items per vector and that
    causes element wise expansion and spilling.

    The fix for now is to re-instantiate the hard limit, matching what
    vector lowering does.  For the future the way to go is to emit the
    lowered sequence directly from the vectorizer instead.

            PR tree-optimization/108724
            * tree-vect-stmts.cc (vectorizable_operation): Avoid
            using word_mode vectors when vector lowering will
            decompose them to elementwise operations.

            * gcc.target/i386/pr108724.c: New testcase.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] [11/12 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (5 preceding siblings ...)
  2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org
@ 2023-02-10 11:22 ` rguenth at gcc dot gnu.org
  2023-03-15  9:48 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-02-10 11:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |13.0
            Summary|[11/12/13 Regression] Poor  |[11/12 Regression] Poor
                   |codegen when summing two    |codegen when summing two
                   |arrays without AVX or SSE   |arrays without AVX or SSE

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Fixed on trunk sofar.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] [11/12 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (6 preceding siblings ...)
  2023-02-10 11:22 ` [Bug tree-optimization/108724] [11/12 " rguenth at gcc dot gnu.org
@ 2023-03-15  9:48 ` cvs-commit at gcc dot gnu.org
  2023-05-05  8:34 ` [Bug tree-optimization/108724] [11 " rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-03-15  9:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

--- Comment #7 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-12 branch has been updated by Richard Biener
<rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:21e7145aaf582c263e69a3ee05dfa9d42bdbd1aa

commit r12-9258-g21e7145aaf582c263e69a3ee05dfa9d42bdbd1aa
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Feb 10 11:07:30 2023 +0100

    tree-optimization/108724 - vectorized code getting piecewise expanded

    This fixes an oversight to when removing the hard limits on using
    generic vectors for the vectorizer to enable both SLP and BB
    vectorization to use those.  The vectorizer relies on vector lowering
    to expand plus, minus and negate to bit operations but vector
    lowering has a hard limit on the minimum number of elements per
    work item.  Vectorizer costs for the testcase at hand work out
    to vectorize a loop with just two work items per vector and that
    causes element wise expansion and spilling.

    The fix for now is to re-instantiate the hard limit, matching what
    vector lowering does.  For the future the way to go is to emit the
    lowered sequence directly from the vectorizer instead.

            PR tree-optimization/108724
            * tree-vect-stmts.cc (vectorizable_operation): Avoid
            using word_mode vectors when vector lowering will
            decompose them to elementwise operations.

            * gcc.target/i386/pr108724.c: New testcase.

    (cherry picked from commit dc87e1391c55c666c7ff39d4f0dea87666f25468)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug tree-optimization/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (7 preceding siblings ...)
  2023-03-15  9:48 ` cvs-commit at gcc dot gnu.org
@ 2023-05-05  8:34 ` rguenth at gcc dot gnu.org
  2023-05-05 12:06 ` [Bug target/108724] " rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-05  8:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that the GCC 11 branch isn't affected by what's fixed on trunk and GCC 12,
the 11 branch still uses vect_worthwhile_without_simd_p which checks
vect_min_worthwhile_factor.

Instead the GCC 11 branch correctly doesn't vectorize the add but instead
vectorizes the stores only which results in stack spilling to build
the DImode values we then store.  We're also doing this in odd ways.

In GCC 13/12 we have fixed this in the costing:

node 0x4156628 1 times vec_construct costs 100 in prologue

while the GCC 11 branch costs this as 8.  Not sure where this happens in
the target.

Hmm, I think this happens in error.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (8 preceding siblings ...)
  2023-05-05  8:34 ` [Bug tree-optimization/108724] [11 " rguenth at gcc dot gnu.org
@ 2023-05-05 12:06 ` rguenth at gcc dot gnu.org
  2023-05-23 12:55 ` rguenth at gcc dot gnu.org
  2023-05-29 10:08 ` jakub at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-05 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target
             Target|                            |x86_64-*-* i?86-*-*

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
And the remaining issue with GCC 11 would be that we fail to account for the
GPR -> XMM move.  Or the remaining issue for _all_ branches is that we fail
to realize that emulated "vector" CTORs are even more expensive since we lack
a good way to materialize the CTOR in a GPR (generic RTL expansion fails to
consider using shift + and for example).

Not sure what a good expansion of a V2SImode, V4HImode or V8QImode
CTOR to a GPR DImode reg would look like.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (9 preceding siblings ...)
  2023-05-05 12:06 ` [Bug target/108724] " rguenth at gcc dot gnu.org
@ 2023-05-23 12:55 ` rguenth at gcc dot gnu.org
  2023-05-29 10:08 ` jakub at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-05-23 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
On trunk we're back to vectorizing but as intended with DImode which makes us
save half of the loads and stores and we think the extended required arithmetic
covers up for that (by quite some margin).

        movabsq $9223372034707292159, %rcx
        movq    (%rdx), %rax
        movq    (%rsi), %rsi
        movq    %rcx, %rdx
        andq    %rax, %rdx
        andq    %rsi, %rcx
        xorq    %rsi, %rax
        addq    %rcx, %rdx
        movabsq $-9223372034707292160, %rcx
        andq    %rcx, %rax
        xorq    %rdx, %rax
        movq    %rax, (%rdi)

vs

        movl    (%rdx), %eax
        addl    (%rsi), %eax
        movl    %eax, (%rdi)
        movl    4(%rdx), %eax
        addl    4(%rsi), %eax
        movl    %eax, 4(%rdi)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/108724] [11 Regression] Poor codegen when summing two arrays without AVX or SSE
  2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
                   ` (10 preceding siblings ...)
  2023-05-23 12:55 ` rguenth at gcc dot gnu.org
@ 2023-05-29 10:08 ` jakub at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-05-29 10:08 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|11.4                        |11.5

--- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC 11.4 is being released, retargeting bugs to GCC 11.5.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-05-29 10:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-08 19:17 [Bug target/108724] New: [11 regression] Poor codegen when summing two arrays without AVX or SSE gbs at canishe dot com
2023-02-08 19:30 ` [Bug tree-optimization/108724] " pinskia at gcc dot gnu.org
2023-02-09  9:37 ` crazylht at gmail dot com
2023-02-09 13:54 ` rguenth at gcc dot gnu.org
2023-02-10 10:00 ` rguenth at gcc dot gnu.org
2023-02-10 10:07 ` [Bug tree-optimization/108724] [11/12/13 Regression] " rguenth at gcc dot gnu.org
2023-02-10 11:22 ` cvs-commit at gcc dot gnu.org
2023-02-10 11:22 ` [Bug tree-optimization/108724] [11/12 " rguenth at gcc dot gnu.org
2023-03-15  9:48 ` cvs-commit at gcc dot gnu.org
2023-05-05  8:34 ` [Bug tree-optimization/108724] [11 " rguenth at gcc dot gnu.org
2023-05-05 12:06 ` [Bug target/108724] " rguenth at gcc dot gnu.org
2023-05-23 12:55 ` rguenth at gcc dot gnu.org
2023-05-29 10:08 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).