From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 05DBB3858D20; Wed, 31 May 2023 10:37:16 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 05DBB3858D20 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1685529436; bh=1JQzroz1Vz9ocIAU3su2S4KLDG6inwXjKz8U8VyxLbQ=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Bx1/PJpJV9MMahNs+2moFi8PoV5h6IaRdMnvE7oPamBZgQrjtA2OF3OQL3ZrqTfUt QuH6Wvu80gUDQyxa4SCpEYPsYQ4kw+7DGqFDtRyMxnFUAlxtmNM/PK5hGbbRaRFwOD 3SAywlyfSSD6KEAtI5rvm6Yh5ShN8IIRx3rYNwWQ= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug ipa/109983] [12/13/14 regression] Wireshark compilation hangs with -O2 -fipa-pta Date: Wed, 31 May 2023 10:37:14 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: ipa X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: compile-time-hog X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.4 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109983 --- Comment #12 from Richard Biener --- Bad: Samples: 4M of event 'cycles:u', Event count (approx.): 6574968845607=20=20= =20=20=20=20=20=20=20=20=20 Overhead Samples Command Shared Object Symbol=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 59.44% 2803407 cc1 cc1 [.] bitmap_ior_into 23.71% 1119062 cc1 cc1 [.] bitmap_bit_p 12.30% 583673 cc1 cc1 [.] bitmap_set_bit 1.13% 53698 cc1 cc1 [.] find 1.09% 51770 cc1 cc1 [.] solve_constraints 0.55% 26265 cc1 cc1 [.] add_graph_edge 0.52% 24726 cc1 cc1 [.] solve_add_graph_ed= ge 0.34% 16328 cc1 cc1 [.] find_what_var_points_to 0.23% 11259 cc1 cc1 [.] solution_set_expand 0.14% 7015 cc1 cc1 [.] ipa_pta_execute 0.13% 6296 cc1 cc1 [.] topo_visit 0.06% 2708 cc1 cc1 [.] auto_var_in_fn_p 1157768.85 msec task-clock:u # 1.000 CPUs utilized=20=20=20=20=20=20=20=20=20=20 0 context-switches:u # 0.000 /sec=20= =20=20=20=20=20=20=20 0 cpu-migrations:u # 0.000 /sec=20= =20=20=20=20=20=20=20 712310 page-faults:u # 615.244 /sec=20= =20=20=20=20=20=20=20 6492736607297 cycles:u # 5.608 GHz=20= =20=20=20=20=20=20=20 (83.33%) 629989262 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.33%) 1079567639 stalled-cycles-backend:u # 0.02% backend cycles idle (83.33%) 3618200824104 instructions:u # 0.56 insn per cycle=20=20=20=20=20=20=20=20=20 # 0.00 stalled cycles= per insn (83.33%) 976059766707 branches:u # 843.052 M/sec= =20=20=20=20=20=20 (83.33%) 2658800911 branch-misses:u # 0.27% of all branches (83.33%) 1158.070038218 seconds time elapsed 1157.021441000 seconds user 0.743804000 seconds sys Samples: 4M of event 'cache-misses:u', Event count (approx.): 30071477326= =20=20=20=20=20=20=20 Overhead Samples Command Shared Object Symbol=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 67.54% 2723606 cc1 cc1 [.] bitmap_ior_into 16.50% 994335 cc1 cc1 [.] bitmap_bit_p 3.33% 121544 cc1 cc1 [.] bitmap_set_bit Good: Samples: 390K of event 'cycles:u', Event count (approx.): 532564428833=20= =20=20=20=20=20=20=20=20=20 Overhead Samples Command Shared Object Symbol=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 73.09% 280949 cc1 cc1 [.] bitmap_set_bit 3.70% 14919 cc1 cc1 [.] find 3.54% 13591 cc1 cc1 [.] find_what_var_points_to 3.48% 13952 cc1 cc1 [.] bitmap_bit_p 2.26% 8969 cc1 cc1 [.] bitmap_ior_into 2.24% 9027 cc1 cc1 [.] add_graph_edge 2.03% 8176 cc1 cc1 [.] solve_add_graph_ed= ge 1.75% 7072 cc1 cc1 [.] solution_set_expand 1.41% 5620 cc1 cc1 [.] ipa_pta_execute 1.34% 5406 cc1 cc1 [.] solve_constraints 98853.84 msec task-clock:u # 1.000 CPUs utilized=20=20=20=20=20=20=20=20=20=20 0 context-switches:u # 0.000 /sec=20= =20=20=20=20=20=20=20 0 cpu-migrations:u # 0.000 /sec=20= =20=20=20=20=20=20=20 558875 page-faults:u # 5.654 K/sec= =20=20=20=20=20=20=20 533990826593 cycles:u # 5.402 GHz=20= =20=20=20=20=20=20=20 (83.33%) 387751641 stalled-cycles-frontend:u # 0.07% frontend cycles idle (83.33%) 236682949 stalled-cycles-backend:u # 0.04% backend cycles idle (83.33%) 1084066822948 instructions:u # 2.03 insn per cycle=20=20=20=20=20=20=20=20=20 # 0.00 stalled cycles= per insn (83.33%) 276015818957 branches:u # 2.792 G/sec= =20=20=20=20=20=20 (83.34%) 899265351 branch-misses:u # 0.33% of all branches (83.33%) 98.878776190 seconds time elapsed 98.459437000 seconds user 0.395912000 seconds sys Samples: 140K of event 'cache-misses:u', Event count (approx.): 4272440656= =20=20=20=20=20=20 Overhead Samples Command Shared Object Symbol=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 17.92% 6124 cc1 cc1 [.] ipa_pta_execute 11.93% 28566 cc1 cc1 [.] bitmap_set_bit 10.48% 14489 cc1 cc1 [.] find But the key observation is from the stats. In the bad case we have Points-to Stats: Total vars: 108033 Non-pointer vars: 59 Statically unified vars: 20147 Dynamically unified vars: 0 Iterations: 4 Number of edges: 141447426 Number of implicit edges: 215676 Points to sets created:15930 and the good case sees Points-to Stats: Total vars: 108033 Non-pointer vars: 59 Statically unified vars: 20147 Dynamically unified vars: 0 Iterations: 10 Number of edges: 582729 Number of implicit edges: 215676 Points to sets created:15930 so we have a _lot_ more (242x) edges in the slow case (but also only 4 solver iterations) while patched we somehow do with less edges (and thus way less propagations aka bitmap_ior) but need 2.5x the number of solver iterations. The final points-to solutions are 1:1 identical (semantically, I did not look at the ESCAPED set here). The key is the heuristic in add_graph_edge triggers way more often: /* The graph solving process does not avoid "triangles", thus there can be multiple paths from a node to another involving intermediate other nodes. That causes extra copying which is most difficult to avoid when the intermediate node is ESCAPED because there are no edges added from ESCAPED. Avoid adding the direct edge FROM -> TO when we have FROM -> ESCAPED and TO contains ESCAPED.=20 ??? Note this is only a heuristic, it does not prevent the situation from occuring. The heuristic helps PR38474 and PR99912 significantly. */ if (to < FIRST_REF_NODE && bitmap_bit_p (graph->succs[from], find (escaped_id)) && bitmap_bit_p (get_varinfo (find (to))->solution, escaped_id)) { stats.num_avoided_edges++; return false; } I've added num_avoided_edges and for the fast case we see Number of avoided edges: 870215056 (that's 6x the number of edges in the sl= ow case!) For the slow case we have Number of avoided edges: 1113769444 which is even more. But I think that should be the only reason the number of edges created differs. Converging more slowly means executing ESCAPED =3D *ESCAPED more often which creates edges to escaped earlier and thus possibly triggering the above. But it also means the 'to' solution might not yet contain ESCAPED ... ICK. I can get it to ipa points-to : 0.40 ( 14%) 0.03 ( 5%) 0.43 ( = 12%) 1950k ( 1%) TOTAL : 2.87 0.61 3.49=20= =20=20=20=20=20=20 253M when I avoid defering copies to ESCAPED itself (that makes the first ESCAPED =3D *ESCAPED do more but delays increasing other solutions to followup iterations). In fact it probably special-cases the ESCAPED =3D *ESCAPED complex constraint which we also locally iterate until convergence before visiting other constraints. Note the initial graph tends to be very unconnected thus the order is somewhat random but the way we compute the topological sorting and process the worklist ensures ESCAPED =3D *ESCAPED is done as something very last in the first iteration (if there are no existing edges to ESCAPED - I think we do have those for calls though). It does help to order the ESCAPED connected component first: Points-to Stats: Total vars: 108033 Non-pointer vars: 59 Statically unified vars: 20147 Dynamically unified vars: 0 Iterations: 3 Number of edges: 239596 Number of implicit edges: 215676 Points to sets created:15930 ipa points-to : 0.28 ( 10%) 0.05 ( 8%) 0.34 ( = 10%) 2191k ( 1%) TOTAL : 2.85 0.60 3.47=20= =20=20=20=20=20=20 253M I'm testing this approach now.=