[Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers
@ 2021-04-05  1:53 schnetter at gmail dot com
  2021-04-05  1:54 ` [Bug target/99912] " schnetter at gmail dot com
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: schnetter at gmail dot com @ 2021-04-05  1:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

            Bug ID: 99912
           Summary: Unnecessary / inefficient spilling of AVX2 ymm
                    registers
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schnetter at gmail dot com
  Target Milestone: ---

Created attachment 50507
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50507&action=edit
Compressed preprocessed source code

I am using "g++ (Spack GCC) 11.0.1 20210404 (experimental)" (fresh checkout) on
MacOS 11.2.3 with a x86-64 Skylake CPU.

I am manually SIMD-vectorizing a loop kernel using AVX2 intrinsics. The
generated code is correct, but has obvious inefficiencies. I find these issues:

1. There are spills (?) of AVX2 ymm registers that are overwritten by another
spill a few instructions later, without being read in the mean time

2. The same register is spilled into multiple stack slots in consecutive
instructions

3. After spilling an ymm register, the stack slot is copied to another stack
slot, using xmm registers (i.e. using two loads/stores)

I tried to reproduce the issue in a small example, but failed. If this issue is
really due to spilling, then it might not be possible to have a small test
case.



Here is an example of issues 1 and 2; I show a few lines from the attached
disassembled file to clarify:
{{{
    1520: c5 fd 29 8c 24 a0 24 00 00    vmovapd %ymm1, 9376(%rsp)
    1529: c5 fd 29 8c 24 20 29 00 00    vmovapd %ymm1, 10528(%rsp)
    1532: c5 fd 29 b4 24 80 28 00 00    vmovapd %ymm6, 10368(%rsp)
    153b: c5 fd 29 ac 24 a0 28 00 00    vmovapd %ymm5, 10400(%rsp)
    1544: c5 fd 29 a4 24 c0 28 00 00    vmovapd %ymm4, 10432(%rsp)
    154d: c5 fd 29 9c 24 e0 28 00 00    vmovapd %ymm3, 10464(%rsp)
    1556: c5 fd 29 94 24 00 29 00 00    vmovapd %ymm2, 10496(%rsp)
    155f: c4 a2 1d 2d 34 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm6
    1565: 48 8b 84 24 00 05 00 00       movq    1280(%rsp), %rax
    156d: c5 fd 29 b4 24 00 24 00 00    vmovapd %ymm6, 9216(%rsp)
    1576: c4 a2 1d 2d 2c 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm5
    157c: 48 8b 84 24 38 07 00 00       movq    1848(%rsp), %rax
    1584: c5 fd 29 ac 24 20 24 00 00    vmovapd %ymm5, 9248(%rsp)
    158d: c4 a2 1d 2d 24 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm4
    1593: 48 8b 84 24 60 04 00 00       movq    1120(%rsp), %rax
    159b: c5 fd 29 a4 24 40 24 00 00    vmovapd %ymm4, 9280(%rsp)
    15a4: c4 a2 1d 2d 1c 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm3
    15aa: 48 8b 84 24 68 04 00 00       movq    1128(%rsp), %rax
    15b2: c5 fd 29 9c 24 60 24 00 00    vmovapd %ymm3, 9312(%rsp)
    15bb: c4 a2 1d 2d 14 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm2
    15c1: c5 fd 29 94 24 80 24 00 00    vmovapd %ymm2, 9344(%rsp)
    15ca: 48 8b 84 24 08 05 00 00       movq    1288(%rsp), %rax
    15d2: c4 a2 1d 2d 0c 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm1
    15d8: 48 8b 84 24 70 04 00 00       movq    1136(%rsp), %rax
    15e0: c5 fd 29 8c 24 a0 24 00 00    vmovapd %ymm1, 9376(%rsp)
    15e9: c5 fd 29 b4 24 40 29 00 00    vmovapd %ymm6, 10560(%rsp)
    15f2: c5 fd 29 ac 24 60 29 00 00    vmovapd %ymm5, 10592(%rsp)
    15fb: c5 fd 29 a4 24 80 29 00 00    vmovapd %ymm4, 10624(%rsp)
    1604: c5 fd 29 9c 24 a0 29 00 00    vmovapd %ymm3, 10656(%rsp)
    160d: c5 fd 29 94 24 c0 29 00 00    vmovapd %ymm2, 10688(%rsp)
    1616: c5 fd 29 8c 24 e0 29 00 00    vmovapd %ymm1, 10720(%rsp)
}}}

The beginning and end of this sample are what I think might be spill
instructions. The instruction at 1520 writes to 9376(%rsp), and the instruction
at 15e0 overwrites this stack slot. Also, the register %ymm1 is written
multiple times to different stack slots. (That by itself could be fine, but it
looks strange.)

A few instructions later I find this code:
{{{
    16d7: c5 79 6f 84 24 80 28 00 00    vmovdqa 10368(%rsp), %xmm8
    16e0: c5 79 6f ac 24 20 29 00 00    vmovdqa 10528(%rsp), %xmm13
    16e9: c5 79 7f 84 24 e0 19 00 00    vmovdqa %xmm8, 6624(%rsp)
    16f2: c5 79 6f 84 24 90 28 00 00    vmovdqa 10384(%rsp), %xmm8
    16fb: c5 79 7f ac 24 80 1a 00 00    vmovdqa %xmm13, 6784(%rsp)
    1704: c5 79 7f 84 24 f0 19 00 00    vmovdqa %xmm8, 6640(%rsp)
}}}
This copies the 32 bytes at 10368(%rsp) (written above), but uses %xmm8 to copy
the stack slot in 16-byte chunks. This shouldn't happen; there is no reason to
copy from one stack slot to another (presumably, since I know the code, but I
could be mistaken here). There is also no reason to copy in 16-byte chunks.
(All relevant local variables are ultimately of type __m256d, wrapped in C++
structs, and should thus be correctly aligned.)



To give some background information: The loop is quite large; it is part of a
complex numerical kernel for the Einstein equations
<http://einsteintoolkit.org>. I expect there to be a significant number of
local variables / stack spill slots, but these should still fit into the L1
data cache. The instructions for the kernel occupy currently about 44 kB. I
plan to reduce this later, and removing unnecessary stack spills would help.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
@ 2021-04-05  1:54 ` schnetter at gmail dot com
  2021-04-05  2:03 ` schnetter at gmail dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: schnetter at gmail dot com @ 2021-04-05  1:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #1 from Erik Schnetter <schnetter at gmail dot com> ---
Created attachment 50508
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50508&action=edit
Compressed disassembled object file

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
  2021-04-05  1:54 ` [Bug target/99912] " schnetter at gmail dot com
@ 2021-04-05  2:03 ` schnetter at gmail dot com
  2021-04-06  8:25 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: schnetter at gmail dot com @ 2021-04-05  2:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #2 from Erik Schnetter <schnetter at gmail dot com> ---
I did not describe the scale of the issue. There are more than just a few
inefficient or unnecessary operations:

The loop kernel (a single basic block) extends from address 0x1240 to 0xbf27 in
the attached disassembled object file.

Out of about 6000 instructions in the loop, 1000 are inefficient (and likely
superfluous) moves that copy one 32-byte stack slot into another, using 16-byte
wide copies.

For example, the stack slot 9376(%rsp) is written 9 times in the loop kernel,
but is read only once.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
  2021-04-05  1:54 ` [Bug target/99912] " schnetter at gmail dot com
  2021-04-05  2:03 ` schnetter at gmail dot com
@ 2021-04-06  8:25 ` rguenth at gcc dot gnu.org
  2021-04-06 14:42 ` schnetter at gmail dot com
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-06  8:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Which function does the loop kernel reside in?  I see you have some lambdas
in Z4c_RHS, done fancy as out-of-line functions, that do look like they
could comprise the actual kernels.  In apply_upwind_diss I see cases without
stack usage.

I'm looking at -O2 -march=skylake compiles

Note that with C++ it's easy to retain some abstraction and thus misinterpret
stack accesses as spilling where they are aggregates not eliminated.  For
example in one of the lambdas I see

  _61489 = __builtin_ia32_maskloadpd256 (_104487, _61513);
  D.545024[1].elts.car = _61489;
...
  MEM[(struct vect *)&D.544982].elts._M_elems[1] = MEM[(const struct simd
&)&D.545024 + 32];
...
  MEM[(struct mat3 *)&vars + 992B] = MEM[(const struct mat3 &)&D.544982];

and D.544982 is later variable indexed in some MIN/MAX, FMA using code
(instead of using 'vars' there).  Looking at what -fdump-tree-optimized
produces is sometimes pointing at problems.

That said, the code is large so please point at some source lines within the
important kernel(s) (of the preprocessed source, that is) and the compile
options used.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (2 preceding siblings ...)
  2021-04-06  8:25 ` rguenth at gcc dot gnu.org
@ 2021-04-06 14:42 ` schnetter at gmail dot com
  2021-04-06 16:33 ` schnetter at gmail dot com
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: schnetter at gmail dot com @ 2021-04-06 14:42 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #4 from Erik Schnetter <schnetter at gmail dot com> ---
I build with the compiler options

/Users/eschnett/src/CarpetX/Cactus/view-compilers/bin/g++  -fopenmp -Wall -pipe
-g -march=skylake -std=gnu++17 -O3 -fcx-limited-range -fexcess-precision=fast
-fno-math-errno -fno-rounding-math -fno-signaling-nans
-funsafe-math-optimizations   -c -o configs/sim/build/Z4c/rhs.cxx.o
configs/sim/build/Z4c/rhs.cxx.ii

One of the kernels in question (the one I describe above) is the C++ lambda in
lines 281013 to 281119. The call to the "noinline" function ensures that the
kernel (and surrounding for loops) is compiled as a separate function, which
produces more efficient code. The function "grid.loop_int_device" contains
essentially three nested for loops, and the actual kernel is the C++ lambda in
lines 281015 to 281118.

I'll have a look at -fdump-tree-optimized.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (3 preceding siblings ...)
  2021-04-06 14:42 ` schnetter at gmail dot com
@ 2021-04-06 16:33 ` schnetter at gmail dot com
  2021-04-07  9:36 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: schnetter at gmail dot com @ 2021-04-06 16:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #5 from Erik Schnetter <schnetter at gmail dot com> ---
As you suggested, the problem is probably not caused by register spills, but by
stores into a struct that are not optimized away. In this case, the respective
struct elements are unused in the code.

I traced the results of the first __builtin_ia32_maskloadpd256:

  _63940 = __builtin_ia32_maskloadpd256 (_63955, prephitmp_86203);
  MEM <const vector(4) double> [(struct mat3 *)&vars + 992B] = _63940;
  _178613 = .FMA (_63940, _64752, _178609);
  MEM <const vector(4) double> [(struct mat3 *)&vars + 1312B] = _63940;

The respective struct locations (+ 992B, + 1312B) are indeed not used anywhere
else.

The struct is of type z4c_vars. It (and its parent) are defined in lines 279837
to 280818. It is large.

Is there e.g. a parameter I could set to make GCC try harder avoid unnecessary
stores?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (4 preceding siblings ...)
  2021-04-06 16:33 ` schnetter at gmail dot com
@ 2021-04-07  9:36 ` rguenth at gcc dot gnu.org
  2021-04-07 16:48 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-07  9:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot gnu.org
   Last reconfirmed|                            |2021-04-07
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Erik Schnetter from comment #5)
> As you suggested, the problem is probably not caused by register spills, but
> by stores into a struct that are not optimized away. In this case, the
> respective struct elements are unused in the code.
> 
> I traced the results of the first __builtin_ia32_maskloadpd256:
> 
>   _63940 = __builtin_ia32_maskloadpd256 (_63955, prephitmp_86203);
>   MEM <const vector(4) double> [(struct mat3 *)&vars + 992B] = _63940;
>   _178613 = .FMA (_63940, _64752, _178609);
>   MEM <const vector(4) double> [(struct mat3 *)&vars + 1312B] = _63940;
> 
> The respective struct locations (+ 992B, + 1312B) are indeed not used
> anywhere else.
> 
> The struct is of type z4c_vars. It (and its parent) are defined in lines
> 279837 to 280818. It is large.
> 
> Is there e.g. a parameter I could set to make GCC try harder avoid
> unnecessary stores?

Yes, there's --param dse-max-alias-queries-per-store=N where setting N
to 10000 seems to help quite a bit (it's default is 256 and it's used to
limit the quadratic complexity of DSE to constant-time).

There are still some other temporaries left, notably we have aggregate
copies like

  MEM[(struct vect *)&D.662088 + 96B clique 23 base 1].elts._M_elems[0] =
MEM[(const struct simd &)&D.662080 clique 23 base 0];

and later used as

  SR.3537_174107 = MEM <simd_vector> [(const struct vec3 &)&D.662088];

the aggregate copying keeps the stores into D.662080 live (instead of
directly storing into D.662088).  At SRA time we still take the address
of both so they are not considered for decomposition.  That's from
dead stores to some std::initializer_list.  Thus there's a pass ordering
issue and SRA would benefit from another DSE pass before it.

For the fun I did

diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc57..0c8a50e7a07 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -221,6 +221,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_tail_recursion);
       NEXT_PASS (pass_ch);
       NEXT_PASS (pass_lower_complex);
+      NEXT_PASS (pass_dse);
       NEXT_PASS (pass_sra);
       /* The dom pass will also resolve all __builtin_constant_p calls
          that are still there to 0.  This has to be done after some
@@ -236,7 +237,6 @@ along with GCC; see the file COPYING3.  If not see
       /* Identify paths that should never be executed in a conforming
         program and isolate those paths.  */
       NEXT_PASS (pass_isolate_erroneous_paths);
-      NEXT_PASS (pass_dse);
       NEXT_PASS (pass_reassoc, true /* insert_powi_p */);
       NEXT_PASS (pass_dce);
       NEXT_PASS (pass_forwprop);

which helps SRA in this case.  It does need quite some magic to remove all
the C++ abstraction in this code.  There's also lots of stray CLOBBERS
of otherwise unused variables in the code which skews the DSE alias
query parameter limit, we should look at doing TODO_remove_unused_locals
more often (for example after DSE) - that alone is enough to get almost
all the improvements without increasing any of the DSE walking limits.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (5 preceding siblings ...)
  2021-04-07  9:36 ` rguenth at gcc dot gnu.org
@ 2021-04-07 16:48 ` rguenth at gcc dot gnu.org
  2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-07 16:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
I've posted a series of two patches that will improve things for GCC 12.

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567743.html
https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567731.html
https://gcc.gnu.org/pipermail/gcc-patches/2021-April/567744.html

the last will eventually see adjustments and/or a different implementation
approach.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (6 preceding siblings ...)
  2021-04-07 16:48 ` rguenth at gcc dot gnu.org
@ 2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
  2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-04-27 13:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #8 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:d8e1f1d24179690fd9c0f63c27b12e030010d9ea

commit r12-155-gd8e1f1d24179690fd9c0f63c27b12e030010d9ea
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Apr 7 12:09:44 2021 +0200

    tree-optimization/99912 - schedule DSE before SRA

    For the testcase in the PR the main SRA pass is unable to do some
    important scalarizations because dead stores of addresses make
    the candiate variables disqualified.  The following patch adds
    another DSE pass before SRA forming a DCE/DSE pair and moves the
    DSE pass that is currently closely after SRA up to after the
    next DCE pass, forming another DCE/DSE pair now residing after PRE.

    2021-04-07  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/99912
            * passes.def (pass_all_optimizations): Add pass_dse before
            the first pass_dce, move the first pass_dse before the
            pass_dce following pass_pre.

            * gcc.dg/tree-ssa/ldist-33.c: Disable PRE and LIM.
            * gcc.dg/tree-ssa/pr96789.c: Adjust dump file scanned.
            * gcc.dg/tree-ssa/ssa-dse-28.c: Likewise.
            * gcc.dg/tree-ssa/ssa-dse-29.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (7 preceding siblings ...)
  2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
@ 2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
  2021-04-27 14:02 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-04-27 13:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #9 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:8d4c374c4419a8751cfae18d6b58169c62dea49f

commit r12-156-g8d4c374c4419a8751cfae18d6b58169c62dea49f
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Apr 27 14:27:40 2021 +0200

    tree-optimization/99912 - schedule another TODO_remove_unused_locals

    This makes sure to remove unused locals and prune CLOBBERs after
    the first scalar cleanup phase after IPA optimizations.  On the
    testcase in the PR this results in 8000 CLOBBERs removed which
    in turn unleashes more DSE which otherwise hits its walking limit
    of 256 too early on this testcase.

    2021-04-27  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/99912
            * passes.def: Add comment about new TODO_remove_unused_locals.
            * tree-stdarg.c (pass_data_stdarg): Run TODO_remove_unused_locals
            at start.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (8 preceding siblings ...)
  2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
@ 2021-04-27 14:02 ` rguenth at gcc dot gnu.org
  2021-04-27 16:03 ` schnetter at gmail dot com
  2021-04-29  6:32 ` cvs-commit at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-04-27 14:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with the latest patches I now see real spilling dominating (oops).  I also
see, on the GIMPLE level

  _64425 = (unsigned long) SR.3210_122492;
  _64416 = _64425 + ivtmp.5307_121062;
  _62971 = (double &) _64416;
  __builtin_ia32_maskloadpd256 (_62971, _61513);

that is, dead masked loads (that's odd).

There's also still some dead / redundant code from the abstraction to remove.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (9 preceding siblings ...)
  2021-04-27 14:02 ` rguenth at gcc dot gnu.org
@ 2021-04-27 16:03 ` schnetter at gmail dot com
  2021-04-29  6:32 ` cvs-commit at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: schnetter at gmail dot com @ 2021-04-27 16:03 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #11 from Erik Schnetter <schnetter at gmail dot com> ---
The number of active local variables is likely much larger than the number of
registers, and I expect there to be a lot of spilling. I hope that the compiler
is clever about changing the order in which expressions are evaluated to reduce
spilling as much as possible.

Because the loop is so large, I split it into two, each calculating about half
of the output variables. The code here looks at one of the loops. To simplify
the code, each loop still loads all variables (via masked loads), but may not
use all of them. The unused masked loads do not surprise me per se, but I
expect the compiler to remove them.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers
  2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
                   ` (10 preceding siblings ...)
  2021-04-27 16:03 ` schnetter at gmail dot com
@ 2021-04-29  6:32 ` cvs-commit at gcc dot gnu.org
  11 siblings, 0 replies; 13+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2021-04-29  6:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

--- Comment #12 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:b58dc0b803057c0e6032e0d9bd92cd834f72c75c

commit r12-248-gb58dc0b803057c0e6032e0d9bd92cd834f72c75c
Author: Richard Biener <rguenther@suse.de>
Date:   Tue Apr 27 14:32:27 2021 +0200

    tree-optimization/99912 - delete trivially dead stmts during DSE

    DSE performs a backwards walk over stmts removing stores but it
    leaves removing resulting dead SSA defs to later passes.  This
    eats into its own alias walking budget if the removed stores kept
    loads live.  The following patch adds removal of trivially dead
    SSA defs which helps in this situation and reduces the amount of
    garbage followup passes need to deal with.

    2021-04-28  Richard Biener  <rguenther@suse.de>

            PR tree-optimization/99912
            * tree-ssa-dse.c (dse_dom_walker::m_need_cfg_cleanup): New.
            (dse_dom_walker::todo): Likewise.
            (dse_dom_walker::dse_optimize_stmt): Move VDEF check to the
            caller.
            (dse_dom_walker::before_dom_children): Remove trivially
            dead SSA defs and schedule CFG cleanup if we removed all
            PHIs in a block.
            (pass_dse::execute): Get TODO as computed by the DOM walker
            and return it.  Wipe dominator info earlier.

            * gcc.dg/pr95580.c: Disable DSE.
            * gcc.dg/Wrestrict-8.c: Place a use after each memcpy.
            * c-c++-common/ubsan/overflow-negate-3.c: Make asms volatile
            to prevent them from being removed.
            * c-c++-common/ubsan/overflow-sub-4.c: Likewise.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-04-29  6:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
2021-04-05  1:54 ` [Bug target/99912] " schnetter at gmail dot com
2021-04-05  2:03 ` schnetter at gmail dot com
2021-04-06  8:25 ` rguenth at gcc dot gnu.org
2021-04-06 14:42 ` schnetter at gmail dot com
2021-04-06 16:33 ` schnetter at gmail dot com
2021-04-07  9:36 ` rguenth at gcc dot gnu.org
2021-04-07 16:48 ` rguenth at gcc dot gnu.org
2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
2021-04-27 14:02 ` rguenth at gcc dot gnu.org
2021-04-27 16:03 ` schnetter at gmail dot com
2021-04-29  6:32 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).