public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers
@ 2021-04-05  1:53 schnetter at gmail dot com
  2021-04-05  1:54 ` [Bug target/99912] " schnetter at gmail dot com
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: schnetter at gmail dot com @ 2021-04-05  1:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

            Bug ID: 99912
           Summary: Unnecessary / inefficient spilling of AVX2 ymm
                    registers
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: schnetter at gmail dot com
  Target Milestone: ---

Created attachment 50507
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50507&action=edit
Compressed preprocessed source code

I am using "g++ (Spack GCC) 11.0.1 20210404 (experimental)" (fresh checkout) on
MacOS 11.2.3 with a x86-64 Skylake CPU.

I am manually SIMD-vectorizing a loop kernel using AVX2 intrinsics. The
generated code is correct, but has obvious inefficiencies. I find these issues:

1. There are spills (?) of AVX2 ymm registers that are overwritten by another
spill a few instructions later, without being read in the mean time

2. The same register is spilled into multiple stack slots in consecutive
instructions

3. After spilling an ymm register, the stack slot is copied to another stack
slot, using xmm registers (i.e. using two loads/stores)

I tried to reproduce the issue in a small example, but failed. If this issue is
really due to spilling, then it might not be possible to have a small test
case.



Here is an example of issues 1 and 2; I show a few lines from the attached
disassembled file to clarify:
{{{
    1520: c5 fd 29 8c 24 a0 24 00 00    vmovapd %ymm1, 9376(%rsp)
    1529: c5 fd 29 8c 24 20 29 00 00    vmovapd %ymm1, 10528(%rsp)
    1532: c5 fd 29 b4 24 80 28 00 00    vmovapd %ymm6, 10368(%rsp)
    153b: c5 fd 29 ac 24 a0 28 00 00    vmovapd %ymm5, 10400(%rsp)
    1544: c5 fd 29 a4 24 c0 28 00 00    vmovapd %ymm4, 10432(%rsp)
    154d: c5 fd 29 9c 24 e0 28 00 00    vmovapd %ymm3, 10464(%rsp)
    1556: c5 fd 29 94 24 00 29 00 00    vmovapd %ymm2, 10496(%rsp)
    155f: c4 a2 1d 2d 34 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm6
    1565: 48 8b 84 24 00 05 00 00       movq    1280(%rsp), %rax
    156d: c5 fd 29 b4 24 00 24 00 00    vmovapd %ymm6, 9216(%rsp)
    1576: c4 a2 1d 2d 2c 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm5
    157c: 48 8b 84 24 38 07 00 00       movq    1848(%rsp), %rax
    1584: c5 fd 29 ac 24 20 24 00 00    vmovapd %ymm5, 9248(%rsp)
    158d: c4 a2 1d 2d 24 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm4
    1593: 48 8b 84 24 60 04 00 00       movq    1120(%rsp), %rax
    159b: c5 fd 29 a4 24 40 24 00 00    vmovapd %ymm4, 9280(%rsp)
    15a4: c4 a2 1d 2d 1c 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm3
    15aa: 48 8b 84 24 68 04 00 00       movq    1128(%rsp), %rax
    15b2: c5 fd 29 9c 24 60 24 00 00    vmovapd %ymm3, 9312(%rsp)
    15bb: c4 a2 1d 2d 14 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm2
    15c1: c5 fd 29 94 24 80 24 00 00    vmovapd %ymm2, 9344(%rsp)
    15ca: 48 8b 84 24 08 05 00 00       movq    1288(%rsp), %rax
    15d2: c4 a2 1d 2d 0c 30             vmaskmovpd      (%rax,%r14), %ymm12,
%ymm1
    15d8: 48 8b 84 24 70 04 00 00       movq    1136(%rsp), %rax
    15e0: c5 fd 29 8c 24 a0 24 00 00    vmovapd %ymm1, 9376(%rsp)
    15e9: c5 fd 29 b4 24 40 29 00 00    vmovapd %ymm6, 10560(%rsp)
    15f2: c5 fd 29 ac 24 60 29 00 00    vmovapd %ymm5, 10592(%rsp)
    15fb: c5 fd 29 a4 24 80 29 00 00    vmovapd %ymm4, 10624(%rsp)
    1604: c5 fd 29 9c 24 a0 29 00 00    vmovapd %ymm3, 10656(%rsp)
    160d: c5 fd 29 94 24 c0 29 00 00    vmovapd %ymm2, 10688(%rsp)
    1616: c5 fd 29 8c 24 e0 29 00 00    vmovapd %ymm1, 10720(%rsp)
}}}

The beginning and end of this sample are what I think might be spill
instructions. The instruction at 1520 writes to 9376(%rsp), and the instruction
at 15e0 overwrites this stack slot. Also, the register %ymm1 is written
multiple times to different stack slots. (That by itself could be fine, but it
looks strange.)

A few instructions later I find this code:
{{{
    16d7: c5 79 6f 84 24 80 28 00 00    vmovdqa 10368(%rsp), %xmm8
    16e0: c5 79 6f ac 24 20 29 00 00    vmovdqa 10528(%rsp), %xmm13
    16e9: c5 79 7f 84 24 e0 19 00 00    vmovdqa %xmm8, 6624(%rsp)
    16f2: c5 79 6f 84 24 90 28 00 00    vmovdqa 10384(%rsp), %xmm8
    16fb: c5 79 7f ac 24 80 1a 00 00    vmovdqa %xmm13, 6784(%rsp)
    1704: c5 79 7f 84 24 f0 19 00 00    vmovdqa %xmm8, 6640(%rsp)
}}}
This copies the 32 bytes at 10368(%rsp) (written above), but uses %xmm8 to copy
the stack slot in 16-byte chunks. This shouldn't happen; there is no reason to
copy from one stack slot to another (presumably, since I know the code, but I
could be mistaken here). There is also no reason to copy in 16-byte chunks.
(All relevant local variables are ultimately of type __m256d, wrapped in C++
structs, and should thus be correctly aligned.)



To give some background information: The loop is quite large; it is part of a
complex numerical kernel for the Einstein equations
<http://einsteintoolkit.org>. I expect there to be a significant number of
local variables / stack spill slots, but these should still fit into the L1
data cache. The instructions for the kernel occupy currently about 44 kB. I
plan to reduce this later, and removing unnecessary stack spills would help.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-04-29  6:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-05  1:53 [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers schnetter at gmail dot com
2021-04-05  1:54 ` [Bug target/99912] " schnetter at gmail dot com
2021-04-05  2:03 ` schnetter at gmail dot com
2021-04-06  8:25 ` rguenth at gcc dot gnu.org
2021-04-06 14:42 ` schnetter at gmail dot com
2021-04-06 16:33 ` schnetter at gmail dot com
2021-04-07  9:36 ` rguenth at gcc dot gnu.org
2021-04-07 16:48 ` rguenth at gcc dot gnu.org
2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
2021-04-27 13:17 ` cvs-commit at gcc dot gnu.org
2021-04-27 14:02 ` rguenth at gcc dot gnu.org
2021-04-27 16:03 ` schnetter at gmail dot com
2021-04-29  6:32 ` cvs-commit at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).