From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D907E3858D28; Fri, 10 Feb 2023 14:05:59 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D907E3858D28
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1676037959;
	bh=9WEQdMY6FV6XOmIDg0Pq6NBxT5GIUoVhkZ/jHX4z0II=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=Izu8dB0D5UJpAbiAn3f1TT3c2lth/f+mkRoryZT+ybOb8njHCoNuVWcBsSKw9D5W3
	 K5niHWLg7d2k34cn1eCgwpIbWWpJO7eXT8VB2ZFHWKHIaHNCiz+5K7KR9Rx90KiNX7
	 13o1Jvo8RRkuHAOoE+WZMsil+Qw/Pjiw18BVuF5E=
From: "vmakarov at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/108500] [11/12 Regression] -O
 -finline-small-functions results in "internal compiler error: Segmentation
 fault" on a very large program (700k function calls)
Date: Fri, 10 Feb 2023 14:05:57 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.2.0
X-Bugzilla-Keywords: compile-time-hog, ice-on-valid-code, memory-hog
X-Bugzilla-Severity: normal
X-Bugzilla-Who: vmakarov at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 11.4
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108500-4-Lx228mwY5c@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108500-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108500-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108500
--- Comment #20 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #14)
> Thanks for the new testcase.  With -O0 (and a --enable-checking=3Drelease
> built compiler) this builds in ~11 minutes (on a Ryzen 9 7900X) with
>=20
>  integrated RA                      :  38.96 (  6%)   1.94 ( 20%)  42.00 =
(=20
> 6%)  3392M ( 23%)
>  LRA non-specific                   :  18.93 (  3%)   1.24 ( 13%)  23.78 =
(=20
> 4%)   450M (  3%)
>  LRA virtuals elimination           :   5.67 (  1%)   0.05 (  1%)   5.75 =
(=20
> 1%)   457M (  3%)
>  LRA reload inheritance             : 318.25 ( 49%)   0.24 (  2%) 318.51 (
> 48%)     0  (  0%)
>  LRA create live ranges             : 199.24 ( 31%)   0.12 (  1%) 199.38 (
> 30%)   228M (  2%)
> 645.67user 10.29system 11:04.42elapsed 98%CPU (0avgtext+0avgdata
> 30577844maxresident)k
> 3936200inputs+1091808outputs (122053major+10664929minor)pagefaults 0swaps
>

I've tried test-1M.i with -O0 for clang-14.  It took about 12hours on E5-26=
97
v3 vs about 30min for GCC.  The most time (99%) of clang is spent in "fast
register allocator":

  Total Execution Time: 42103.9395 seconds (42243.9819 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  -=
--
Name ---
  41533.7657 ( 99.5%)  269.5347 ( 78.6%)  41803.3005 ( 99.3%)  41942.4177 (
99.3%)  Fast Register Allocator
  139.1669 (  0.3%)  16.4785 (  4.8%)  155.6454 (  0.4%)  156.3196 (  0.4%)=
=20
X86 DAG->DAG Instruction Selection

I've tried the same for -O1.  Again gcc took about 30min and I stopped clang
(with another used RA algorithm) after 120hours.

So the situation with RA is not so bad for GCC.  But in any case I'll try to
improve the speed for this case.

> so register allocation taking all of the time.  There's maybe the possibi=
lity
> to gate some of its features on the # of BBs or insns (or whatever the ac=
tual
> "bad" thing is - I didn't look closer yet).
>=20
> It also seems to use 30GB of peak memory at -O0 ...
>=20

I see only 3GB.  Improving this is hard task.  The IRA for -O0 uses very si=
mple
algorithm with usage of very few resources.  We could use even simpler meth=
od
(assigning memory only for all pseudos) but I think it does not worth to do=
 as
the generated code will be much bigger and probably will be 1.5-2 times slo=
wer.=