From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 90F9C3858413; Tue, 9 Aug 2022 09:09:11 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 90F9C3858413 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/106514] [12/13 Regression] ranger slowness in path query Date: Tue, 09 Aug 2022 09:09:11 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: compile-time-hog X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.2 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Aug 2022 09:09:11 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D106514 --- Comment #6 from Richard Biener --- So one now needs to bump the limit to 60 to get enough samples for perf. T= hen we now see Samples: 55K of event 'cycles:u', Event count (approx.): 49013411833=20=20= =20=20=20=20=20=20=20=20=20=20 Overhead Samples Command Shared Object Symbol=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 51.19% 28195 cc1 cc1 [.] path_range_query::compute_ranges_in_block 11.67% 6427 cc1 cc1 [.] path_range_query::adjust_for_non_null_uses 9.20% 5069 cc1 cc1 [.] path_range_query::range_defined_in_block 3.39% 1869 cc1 cc1 [.] bitmap_set_bit 1.95% 1072 cc1 cc1 [.] back_threader::find_paths_to_names 1.93% 1066 cc1 cc1 [.] bitmap_bit_p the compute_ranges_in_block is also top with 30 but adjust_for_non_null_uses pops up newly with 60. The compute_ranges_in_block slowness is attributed to // ...and then the rest of the imports. EXECUTE_IF_SET_IN_BITMAP (m_imports, 0, i, bi) { tree name =3D ssa_name (i); Value_Range r (TREE_TYPE (name)); if (gimple_code (SSA_NAME_DEF_STMT (name)) !=3D GIMPLE_PHI && range_defined_in_block (r, name, bb)) ^^^^ plus gori_compute &g =3D m_ranger->gori (); bitmap exports =3D g.exports (bb); EXECUTE_IF_AND_IN_BITMAP (m_imports, exports, 0, i, bi) { tree name =3D ssa_name (i); Value_Range r (TREE_TYPE (name)); if (g.outgoing_edge_range_p (r, e, name, *this)) ^^^^ for this testcase there seem to be a lot of imports but not many exports so range_defined_in_block is called very many times compared to outgoing_edge_range_p but the latter is comparatively more expensive. For the path query I wonder why we are interested in computing (aka updating the cache) for any but the exports? When we compute the exports, why is the cache not lazily computed just for the interesting names? AFAICS we invalidate all local defs (but even then, why? we get to see a def exactly once, why do we have to even think about clearing sth we should not have seen?) That is, in path_range_query::compute_ranges while (1) { basic_block bb =3D curr_bb (); compute_ranges_in_block (bb); adjust_for_non_null_uses (bb); if (at_exit ()) break; move_next (); } I'd expect only a small portion of the actual compute_ranges_in_block work to be done for all blocks and the real resolving work only for the block ending the path? Maybe the backwards threader is just using the wrong (expensive) API here? It does m_solver->compute_ranges (path, m_imports); m_solver->range_of_stmt (r, cond); -- Btw, I wondered if path-range-query can handle parts of the path being a "black box", aka, skip to the immediate dominator instead of one of the predecessor edges? I _think_ analysis wise this would be quite straight forward but of course we'd have to represent this somehow in the path. Maybe it works by simply leaving out the intermediate blocks? Thus, B |\ A / \ C D \ / E \ the path would be from B to E but we don't care whether we go the C or D way, and when duplicating the path we'd simply duplicate the whole diamond instead of duplicating only one branch, say A->D, and keeping the edge A->C to the original block C, defeating the threading of E to its successor if we happen to go that way.=