From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id BEFC23896C0B; Wed, 7 Apr 2021 09:36:12 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BEFC23896C0B From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/99912] Unnecessary / inefficient spilling of AVX2 ymm registers Date: Wed, 07 Apr 2021 09:36:12 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to cf_reconfirmed_on everconfirmed bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 07 Apr 2021 09:36:12 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99912 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot = gnu.org Last reconfirmed| |2021-04-07 Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED --- Comment #6 from Richard Biener --- (In reply to Erik Schnetter from comment #5) > As you suggested, the problem is probably not caused by register spills, = but > by stores into a struct that are not optimized away. In this case, the > respective struct elements are unused in the code. >=20 > I traced the results of the first __builtin_ia32_maskloadpd256: >=20 > _63940 =3D __builtin_ia32_maskloadpd256 (_63955, prephitmp_86203); > MEM [(struct mat3 *)&vars + 992B] =3D _63940; > _178613 =3D .FMA (_63940, _64752, _178609); > MEM [(struct mat3 *)&vars + 1312B] =3D _63940; >=20 > The respective struct locations (+ 992B, + 1312B) are indeed not used > anywhere else. >=20 > The struct is of type z4c_vars. It (and its parent) are defined in lines > 279837 to 280818. It is large. >=20 > Is there e.g. a parameter I could set to make GCC try harder avoid > unnecessary stores? Yes, there's --param dse-max-alias-queries-per-store=3DN where setting N to 10000 seems to help quite a bit (it's default is 256 and it's used to limit the quadratic complexity of DSE to constant-time). There are still some other temporaries left, notably we have aggregate copies like MEM[(struct vect *)&D.662088 + 96B clique 23 base 1].elts._M_elems[0] =3D MEM[(const struct simd &)&D.662080 clique 23 base 0]; and later used as SR.3537_174107 =3D MEM [(const struct vec3 &)&D.662088]; the aggregate copying keeps the stores into D.662080 live (instead of directly storing into D.662088). At SRA time we still take the address of both so they are not considered for decomposition. That's from dead stores to some std::initializer_list. Thus there's a pass ordering issue and SRA would benefit from another DSE pass before it. For the fun I did diff --git a/gcc/passes.def b/gcc/passes.def index e9ed3c7bc57..0c8a50e7a07 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -221,6 +221,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_tail_recursion); NEXT_PASS (pass_ch); NEXT_PASS (pass_lower_complex); + NEXT_PASS (pass_dse); NEXT_PASS (pass_sra); /* The dom pass will also resolve all __builtin_constant_p calls that are still there to 0. This has to be done after some @@ -236,7 +237,6 @@ along with GCC; see the file COPYING3. If not see /* Identify paths that should never be executed in a conforming program and isolate those paths. */ NEXT_PASS (pass_isolate_erroneous_paths); - NEXT_PASS (pass_dse); NEXT_PASS (pass_reassoc, true /* insert_powi_p */); NEXT_PASS (pass_dce); NEXT_PASS (pass_forwprop); which helps SRA in this case. It does need quite some magic to remove all the C++ abstraction in this code. There's also lots of stray CLOBBERS of otherwise unused variables in the code which skews the DSE alias query parameter limit, we should look at doing TODO_remove_unused_locals more often (for example after DSE) - that alone is enough to get almost all the improvements without increasing any of the DSE walking limits.=