From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 96DFD38930F3; Mon, 8 Feb 2021 09:46:37 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 96DFD38930F3 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/98863] [11 Regression] WRF with LTO consumes a lot of memory in REE, FWPROP and x86 specific passes Date: Mon, 08 Feb 2021 09:46:37 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: memory-hog, ra X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rsandifo at gcc dot gnu.org X-Bugzilla-Target-Milestone: 11.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Feb 2021 09:46:37 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98863 --- Comment #42 from Richard Biener --- (In reply to richard.sandiford from comment #41) > "rguenther at suse dot de" writes: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98863 > > > > --- Comment #40 from rguenther at suse dot de --- > > On Mon, 8 Feb 2021, rsandifo at gcc dot gnu.org wrote: > > > >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98863 > >>=20 > >> --- Comment #39 from rsandifo at gcc dot gnu.org --- > >> Just to give an update on this: I have a patch that reduces the > >> amount of memory consumed by fwprop so that it no longer seems > >> to be outlier. However, it involves doing more bitmap operations. > >> In this testcase we have a larger number of registers that > >> seem to be live but unused across a large region of code, > >> so bitmap ANDs with the live in sets are expensive and hit > >> the worst-case O(nblocks=C3=97nregisters). I'm still trying to find > >> a way of reducing the effect of that. > > > > But isn't this what the RD problem does as well (yeah, DF shows > > up as quite compile-time expensive here), and thus all UD/DU chain > > users suffer from the same issue? > Sure, it certainly isn't specific to the RTL-SSA code :-) > I just think we can do better than my current WIP patch does. >=20 > > What I didn't explore further is re-doing the way RD numbers defs > > in the bitmaps with the idea that all defs just used inside a > > single BB are not necessary to be represented (the local problems > > take care of them). But that of course only helps if there are > > a significant number of such defs (shadowed by later defs of the same > > reg in the same BB) - which usually should be the case. > Yeah. And I think the problem here is that we have a large > number of non-local defs and uses. It doesn't look like there > are an excessive number of uses per def, just that the defs are > live across a large region before being used. >=20 > > There's extra overhead for re-numbering things of course (but my hope > > was to make the RD problem fit in the cache for a nice speedup...) >=20 > Has anyone looked into how we end up in this situation for this > testcase? E.g. did we make bad inlining decisions? Or is it just > a natural consequence of the way the source is written? I don't think it's the natural consequence. Last time I looked at WRF excessive compile-time issues the root is that there are _lots_ of loops eventually cloned for pro/epilogue by vectorization so we have _lots_ of loops. I think the issue is also visible without LTO, just less obvious= ly pronounced. > We should cope with the situation better regardless, but since > extreme cases like this tend to trigger --param limits, it would > be good to avoid getting into the situation too. :-) Yeah. One source of extra lifetime is of course demotion of memory to registers done by GIMPLE optimizers (store-motion, PRE, hoisting but also simply FRE). IIRC WRF once was a bad hitter at store-motion. > FWIW, as far as compile-time goes, the outlier in a release build > seems to be do_rpo_vn. Yeah, I've looked at profiles and the outlier was static void do_unwind (unwind_state *to, int rpo_idx, rpo_elim &avail, int *bb_to_rpo) { ... /* Prune [rpo_idx, ] from avail. */ /* ??? This is O(number-of-values-in-region) which is O(region-size) rather than O(iteration-piece). */ for (hash_table::iterator i =3D vn_ssa_aux_hash->begin= (); i !=3D vn_ssa_aux_hash->end (); ++i) { while ((*i)->avail) { if (bb_to_rpo[(*i)->avail->location] < rpo_idx) break; vn_avail *av =3D (*i)->avail; (*i)->avail =3D (*i)->avail->next; av->next =3D avail.m_avail_freelist; avail.m_avail_freelist =3D av; } } which has tons of cache misses (and this is triggered from the region-based call from unrolling a lot of times, making the comment eventually apply as well). But for me the "total" outlier is the sum of DF times. Again almost all dataflow problems do not fit the cache since the function is so large. That adds quite some constant factor to all compile-time cost making it memory bound :/=