From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 845333858018; Mon, 5 Apr 2021 01:53:16 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 845333858018 From: "schnetter at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/99912] New: Unnecessary / inefficient spilling of AVX2 ymm registers Date: Mon, 05 Apr 2021 01:53:16 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: schnetter at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Apr 2021 01:53:16 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99912 Bug ID: 99912 Summary: Unnecessary / inefficient spilling of AVX2 ymm registers Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: schnetter at gmail dot com Target Milestone: --- Created attachment 50507 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=3D50507&action=3Dedit Compressed preprocessed source code I am using "g++ (Spack GCC) 11.0.1 20210404 (experimental)" (fresh checkout= ) on MacOS 11.2.3 with a x86-64 Skylake CPU. I am manually SIMD-vectorizing a loop kernel using AVX2 intrinsics. The generated code is correct, but has obvious inefficiencies. I find these iss= ues: 1. There are spills (?) of AVX2 ymm registers that are overwritten by anoth= er spill a few instructions later, without being read in the mean time 2. The same register is spilled into multiple stack slots in consecutive instructions 3. After spilling an ymm register, the stack slot is copied to another stack slot, using xmm registers (i.e. using two loads/stores) I tried to reproduce the issue in a small example, but failed. If this issu= e is really due to spilling, then it might not be possible to have a small test case. Here is an example of issues 1 and 2; I show a few lines from the attached disassembled file to clarify: {{{ 1520: c5 fd 29 8c 24 a0 24 00 00 vmovapd %ymm1, 9376(%rsp) 1529: c5 fd 29 8c 24 20 29 00 00 vmovapd %ymm1, 10528(%rsp) 1532: c5 fd 29 b4 24 80 28 00 00 vmovapd %ymm6, 10368(%rsp) 153b: c5 fd 29 ac 24 a0 28 00 00 vmovapd %ymm5, 10400(%rsp) 1544: c5 fd 29 a4 24 c0 28 00 00 vmovapd %ymm4, 10432(%rsp) 154d: c5 fd 29 9c 24 e0 28 00 00 vmovapd %ymm3, 10464(%rsp) 1556: c5 fd 29 94 24 00 29 00 00 vmovapd %ymm2, 10496(%rsp) 155f: c4 a2 1d 2d 34 30 vmaskmovpd (%rax,%r14), %ymm12, %ymm6 1565: 48 8b 84 24 00 05 00 00 movq 1280(%rsp), %rax 156d: c5 fd 29 b4 24 00 24 00 00 vmovapd %ymm6, 9216(%rsp) 1576: c4 a2 1d 2d 2c 30 vmaskmovpd (%rax,%r14), %ymm12, %ymm5 157c: 48 8b 84 24 38 07 00 00 movq 1848(%rsp), %rax 1584: c5 fd 29 ac 24 20 24 00 00 vmovapd %ymm5, 9248(%rsp) 158d: c4 a2 1d 2d 24 30 vmaskmovpd (%rax,%r14), %ymm12, %ymm4 1593: 48 8b 84 24 60 04 00 00 movq 1120(%rsp), %rax 159b: c5 fd 29 a4 24 40 24 00 00 vmovapd %ymm4, 9280(%rsp) 15a4: c4 a2 1d 2d 1c 30 vmaskmovpd (%rax,%r14), %ymm12, %ymm3 15aa: 48 8b 84 24 68 04 00 00 movq 1128(%rsp), %rax 15b2: c5 fd 29 9c 24 60 24 00 00 vmovapd %ymm3, 9312(%rsp) 15bb: c4 a2 1d 2d 14 30 vmaskmovpd (%rax,%r14), %ymm12, %ymm2 15c1: c5 fd 29 94 24 80 24 00 00 vmovapd %ymm2, 9344(%rsp) 15ca: 48 8b 84 24 08 05 00 00 movq 1288(%rsp), %rax 15d2: c4 a2 1d 2d 0c 30 vmaskmovpd (%rax,%r14), %ymm12, %ymm1 15d8: 48 8b 84 24 70 04 00 00 movq 1136(%rsp), %rax 15e0: c5 fd 29 8c 24 a0 24 00 00 vmovapd %ymm1, 9376(%rsp) 15e9: c5 fd 29 b4 24 40 29 00 00 vmovapd %ymm6, 10560(%rsp) 15f2: c5 fd 29 ac 24 60 29 00 00 vmovapd %ymm5, 10592(%rsp) 15fb: c5 fd 29 a4 24 80 29 00 00 vmovapd %ymm4, 10624(%rsp) 1604: c5 fd 29 9c 24 a0 29 00 00 vmovapd %ymm3, 10656(%rsp) 160d: c5 fd 29 94 24 c0 29 00 00 vmovapd %ymm2, 10688(%rsp) 1616: c5 fd 29 8c 24 e0 29 00 00 vmovapd %ymm1, 10720(%rsp) }}} The beginning and end of this sample are what I think might be spill instructions. The instruction at 1520 writes to 9376(%rsp), and the instruc= tion at 15e0 overwrites this stack slot. Also, the register %ymm1 is written multiple times to different stack slots. (That by itself could be fine, but= it looks strange.) A few instructions later I find this code: {{{ 16d7: c5 79 6f 84 24 80 28 00 00 vmovdqa 10368(%rsp), %xmm8 16e0: c5 79 6f ac 24 20 29 00 00 vmovdqa 10528(%rsp), %xmm13 16e9: c5 79 7f 84 24 e0 19 00 00 vmovdqa %xmm8, 6624(%rsp) 16f2: c5 79 6f 84 24 90 28 00 00 vmovdqa 10384(%rsp), %xmm8 16fb: c5 79 7f ac 24 80 1a 00 00 vmovdqa %xmm13, 6784(%rsp) 1704: c5 79 7f 84 24 f0 19 00 00 vmovdqa %xmm8, 6640(%rsp) }}} This copies the 32 bytes at 10368(%rsp) (written above), but uses %xmm8 to = copy the stack slot in 16-byte chunks. This shouldn't happen; there is no reason= to copy from one stack slot to another (presumably, since I know the code, but= I could be mistaken here). There is also no reason to copy in 16-byte chunks. (All relevant local variables are ultimately of type __m256d, wrapped in C++ structs, and should thus be correctly aligned.) To give some background information: The loop is quite large; it is part of= a complex numerical kernel for the Einstein equations . I expect there to be a significant number of local variables / stack spill slots, but these should still fit into the L1 data cache. The instructions for the kernel occupy currently about 44 kB. I plan to reduce this later, and removing unnecessary stack spills would help= .=