From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 1F5D63858D33; Wed, 8 Feb 2023 00:00:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1F5D63858D33 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1675814427; bh=IPC4dknwlH8uAff1jvkBrqiFslGm7NVkqsD94qSBdB4=; h=From:To:Subject:Date:From; b=GcFQXXao7sPfoe/4yyOliuiad3sHI5kdufgnEvAPOq7fsMAzOmu/Z+WcYKgFGmL5F qfsEvvUV83ELoRA07O8rsjF+njeNtF479wghZF73j0Q1oFX9ZmcLNhzB8f5qb7jG5I Z2d5JUNqrTMhXc05wXEZBQyJg97mDCtIIa78dxtY= From: "rimvydas.jas at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/108705] New: Unexpected CPU time usage with LTO in ranger propagation Date: Wed, 08 Feb 2023 00:00:25 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rimvydas.jas at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108705 Bug ID: 108705 Summary: Unexpected CPU time usage with LTO in ranger propagation Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rimvydas.jas at gmail dot com Target Milestone: --- Very trivialized reduced testcase that still works with --enable-checking=3Drelease configured trunk. $ cat hog.f90 # foo() and bar() are in separate units in original case subroutine bar(n,m,p,s) ! in bar.f90 implicit none integer :: n,m real,intent(inout) :: p(n),s(*) !real,intent(inout) :: p(:),s(:) ! gives slower growth call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) call foo(n,m,p,s) ! ... !call foo(n,m,p,s) end subroutine bar subroutine foo(n,m,p,b) ! in foo.f90 implicit none integer :: n,m,j real,intent(inout) :: p(n),b(*) do j=3D1,n b(m+j-1)=3Dp(j) enddo m=3Dm+n ! <---- problematic part end subroutine foo $ gfortran -Wall -Wextra -O2 -flto -c hog.f90 # mimic ccBemfcW.ltrans23.o $ lto1 -ftime-report -fchecking=3D0 hog.o # -fltrans /tmp/ccBemfcW.ltrans23= .o Reading object files: hog.o {GC 2518k} {heap 1028k} Reading the symbol table: Merging declarations: {GC 2520k} {heap 1028k} Reading summaries: {GC 2520k} {heap 1028k} {GC 25= 20k} {heap 1028k} {GC 2520k} {heap 1028k} {GC 2530k} {heap 1168k} {GC 2530k} {heap 1168k} {GC 2532k} {heap 116= 8k} {GC 2532k}=20 Merging symbols: {heap 1168k}Reading function bodies: Performing interprocedural optimizations {heap 1168k} {heap 1168k} {heap 1168k} {heap 1360k} {heap 1360k} {heap 1360k} {= heap 1360k} {heap 1360k} {heap 1360k} {heap 13= 60k} {heap 1360k} {heap 1360k} {heap 1360= k} {heap 1360k} {heap 1360k}Assembling functions: {heap 1360k} foo in:foo_ bar in:bar_ Time variable usr sys = wall GGC phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.00 ( = 0%) 2583k ( 67%) phase opt and generate :1490.95 (100%) 0.00 ( 0%)1491.02 (1= 00%) 1230k ( 32%) callgraph functions expansion :1490.95 (100%) 0.00 ( 0%)1491.02 (1= 00%) 1201k ( 31%) tree VRP : 5.08 ( 0%) 0.00 ( 0%) 5.07 ( = 0%) 84k ( 2%) dominator optimization :1485.85 (100%) 0.00 ( 0%)1485.92 (1= 00%) 16k ( 0%) tree canonical iv : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( = 0%) 71k ( 2%) tree loop distribution : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( = 0%) 115k ( 3%) dead store elim1 : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( = 0%) 2304 ( 0%) combiner : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( = 0%) 42k ( 1%) initialize rtl : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( = 0%) 12k ( 0%) TOTAL :1490.95 0.00 1491.02=20= =20=20=20=20=20=20 3846k For 10+ calls adding extra new call adds to total time: 0.21, 1.07, 6.29, 38.74, 238.24, 1490.95, 9424.12, ... seconds of CPU time. This is not a problem for non-LTO builds since objects need to be compiled = only once for all executables in the project. However with LTO it means that for any executable having problematic subroutines in call graph *and* having su= ch unit pairs in the same ltrans partition would need to be compiled from scra= tch over and over. In original case final LTO link of a first executable with -flto=3D20 was still compiling last 3 ltrans partitions after 25h+ (gcc-12 = is fine). It is quite hard to say where lto1 gets "stuck" (as in infinite loop, no new output gets added to assembly outputs either). One has to ps(1) the full command line, grab /tmp/ccBemfcW.ltrans23.o and manually invoke lto1 to see where problematic code units could be. Also there are no support for attributes to deal with such problems e.g.: !GCC$ ATTRIBUTES noclone,noinline :: foo while LTO is getting better at detecting "strategically placed debug write() statements".=