From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 57D713858C2D; Thu,  2 Feb 2023 10:22:58 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 57D713858C2D
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1675333378;
	bh=AohORTa8QrnvQPaaqJy0YdMPAQ4Td0edmo/ME/hQ7y8=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=lcSh6hYaeBNCMpNWMJOqgDwAl9TigKYDe78FNWP/UpOOcCHhT11/G8SDvmtgr/XA3
	 dqVPie4bwA+4V3DljsXfdP21+8Kfh6jIrJqx+YL2lj7Rba99h0yug8YBc8x8zkneuV
	 6rTyOG7gVIDuWo75ztJqK43a1qAayg5+L8jmrnkc=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/108500] [11/12 Regression] -O
 -finline-small-functions results in "internal compiler error: Segmentation
 fault" on a very large program (700k function calls)
Date: Thu, 02 Feb 2023 10:22:55 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.2.0
X-Bugzilla-Keywords: compile-time-hog, ice-on-valid-code, memory-hog
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 11.4
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-108500-4-GM1K1pTq0l@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108500-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108500-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108500

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org,
                   |                            |vmakarov at gcc dot gnu.org
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
Thanks for the new testcase.  With -O0 (and a --enable-checking=3Drelease b=
uilt
compiler) this builds in ~11 minutes (on a Ryzen 9 7900X) with

 integrated RA                      :  38.96 (  6%)   1.94 ( 20%)  42.00 ( =
 6%)
 3392M ( 23%)
 LRA non-specific                   :  18.93 (  3%)   1.24 ( 13%)  23.78 ( =
 4%)
  450M (  3%)
 LRA virtuals elimination           :   5.67 (  1%)   0.05 (  1%)   5.75 ( =
 1%)
  457M (  3%)
 LRA reload inheritance             : 318.25 ( 49%)   0.24 (  2%) 318.51 ( =
48%)
    0  (  0%)
 LRA create live ranges             : 199.24 ( 31%)   0.12 (  1%) 199.38 ( =
30%)
  228M (  2%)
645.67user 10.29system 11:04.42elapsed 98%CPU (0avgtext+0avgdata
30577844maxresident)k
3936200inputs+1091808outputs (122053major+10664929minor)pagefaults 0swaps

so register allocation taking all of the time.  There's maybe the possibili=
ty
to gate some of its features on the # of BBs or insns (or whatever the actu=
al
"bad" thing is - I didn't look closer yet).

It also seems to use 30GB of peak memory at -O0 ...

For -O the situation is "better":

 tree PTA                           : 987.21 ( 99%)   0.41 ( 12%) 987.70 ( =
99%)
  128  (  0%)
992.56user 3.53system 16:36.20elapsed 99%CPU (0avgtext+0avgdata
2968740maxresident)k
42576inputs+8outputs (28major+717414minor)pagefaults 0swaps

which suggests a clear workaround, -fno-tree-pta, which makes it compile
in 5s for me.

Doing -O -finline-small-functions -fno-tree-pta we get a very high
compile-time in SRAs propagate_all_subaccesses which probably sees
a very large struct copy chain

  tem1 =3D s2;
  s2 =3D tem1;
  tem2 =3D s2;
  s2 =3D tem2;
...

and somehow ends up quadratic (possibly switching the candidate_bitmap
to tree form at the start of propagate_all_subaccesses will help a bit).
tree form bitmap doesn't help, I guess we end up queueing all elements in
the copy chain to the worklist and via the chains end up with a O(n^2)
working set.  The testcase can probably be shortened to get at this
problem.  SRA is actually quite important here, so disabling SRA as a
workaround doesn't look to improve the situation a lot.

Still with -fno-tree-sra added we get good compile time and DCE/DSE
remove all code plus -fno-tree-pta isn't required.

Martin, can you look at the SRA issue?  Do you want me to create a separate
bugreport for this?  The IL into SRA looks like

  <bb 2> :
  s2D.2755 =3D {};
  s1D.2756 =3D {};
  _unusedD.2002766 =3D s1D.2756;
  sD.2002767 =3D s2D.2755;
  s2D.2755 =3D sD.2002767;
  _unusedD.2002766 =3D{v} {CLOBBER(eol)};
  sD.2002767 =3D{v} {CLOBBER(eol)};
  _unusedD.2002764 =3D s1D.2756;
  sD.2002765 =3D s2D.2755;
  s2D.2755 =3D sD.2002765;
  _unusedD.2002764 =3D{v} {CLOBBER(eol)};
  sD.2002765 =3D{v} {CLOBBER(eol)};
  _unusedD.2002762 =3D s1D.2756;
  sD.2002763 =3D s2D.2755;
  s2D.2755 =3D sD.2002763;
  _unusedD.2002762 =3D{v} {CLOBBER(eol)};
  sD.2002763 =3D{v} {CLOBBER(eol)};
  _unusedD.2002760 =3D s1D.2756;
  sD.2002761 =3D s2D.2755;
  s2D.2755 =3D sD.2002761;
  _unusedD.2002760 =3D{v} {CLOBBER(eol)};
  sD.2002761 =3D{v} {CLOBBER(eol)};
...=