From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 9A3CD3858D28; Thu, 28 Mar 2024 10:39:11 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9A3CD3858D28 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1711622351; bh=3W7yzsiRss2gL0lMb3JZEsjWm1lKKVeowQWNN0VC7cs=; h=From:To:Subject:Date:In-Reply-To:References:From; b=L+XCjXsxP/GUCN18Xowm0QA34oOGUdm4zQ3CBUhNZU9QkDqxipUk86ZvRIsN6eOM2 oV880fxzcEilgpoKOaY/f4+0KVq5OHrqRjCTVGFshmkxz0Ej87F4EgKUgE08VPjv5o 7Pj/waXbvY+nlFBgtGqSTyxNR2ncK7Qtwa+qdkEk= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug c++/114480] g++: internal compiler error: Segmentation fault signal terminated program cc1plus Date: Thu, 28 Mar 2024 10:39:11 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: c++ X-Bugzilla-Version: 11.4.0 X-Bugzilla-Keywords: compile-time-hog, memory-hog, ra X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114480 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot = gnu.org Status|NEW |ASSIGNED --- Comment #15 from Richard Biener --- (In reply to Richard Biener from comment #14) > Created attachment 57829 [details] > smaller testcase >=20 > Smaller testcase, shows the same compile-time issue at -O0. At -O1 it's a > lot > less bad but memory usage is better (8GB), so the slowness of the full > testcase > is likely memory bandwidth related. >=20 > -O1 is then >=20 > tree PTA : 20.59 ( 21%) > expand vars : 9.19 ( 9%) > expand : 14.26 ( 15%) The memory use goes into RTXen created during RTL expansion. The compile-t= ime part is add_scope_conflicts. There's the possibility to do like var-tracking and use rev_post_order_and_mark_dfs_back_seme, avoiding iterat= ion for non-loops and have better cache locality. We have half of the profile hits on ggc_internal_alloc and it's 17 | d8:+- mov %r14,%rax=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20 # | | mov (%r14),%r14=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20 # 1440 | | test %r14,%r14=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20 # 4 | | je 530=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20 # | |if (p->bytes =3D=3D entry_size)=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20 # | e7:| cmp 0x10(%r14),%r12=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20 # 65582 | +--jne d8=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20 which is the linear walk /* Check the list of free pages for one we can use. */ for (pp =3D &G.free_pages, p =3D *pp; p; pp =3D &p->next, p =3D *pp)=20 if (p->bytes =3D=3D entry_size) break; so we seem to have many free pages for some reason but the free pages pool is global and not per order?! Samples: 299K of event 'cycles', Event count (approx.): 338413178083=20=20= =20=20=20=20=20=20=20=20=20=20 Overhead Samples Command Shared Object Symbol=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 23.16% 67756 cc1plus cc1plus [.] ggc_internal_alloc 6.98% 21637 cc1plus cc1plus [.] bitmap_tree_splay 6.89% 20413 cc1plus cc1plus [.] bitmap_ior_into 4.05% 11989 cc1plus cc1plus [.] bitmap_elt_ior 3.16% 9840 cc1plus cc1plus [.] mergesort 2.90% 8860 cc1plus cc1plus [.] bitmap_set_bit 2.76% 8281 cc1plus cc1plus [.] get_ref_base_and_extent 1.37% 4071 cc1plus cc1plus [.] stmt_may_clobber_ref_p_1 1.32% 4095 cc1plus cc1plus [.] dominated_by_p 1.16% 3597 cc1plus cc1plus [.] bitmap_tree_unlink_element 1.06% 3128 cc1plus cc1plus [.] walk_aliased_vdefs= _1 the bitmap_tree_splay is from compute_idf, refactoring that some more, also avoiding the duplicate processing and doing away with the bitmap for the workset might help a bit there (not using tree view just gets set-bit up with no overall positive change). I will look into the above things more (but not the RA slowness at -O0).=