public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program
@ 2010-09-27 13:13 burnus at gcc dot gnu.org
  2010-09-27 15:48 ` [Bug lto/45810] " Joost.VandeVondele at pci dot uzh.ch
                   ` (26 more replies)
  0 siblings, 27 replies; 28+ messages in thread
From: burnus at gcc dot gnu.org @ 2010-09-27 13:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

           Summary: 40% slowdown when using LTO for a single-file program
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: lto
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: burnus@gcc.gnu.org


That's on a Intel Core(TM)2 Duo CPU E8400 @ 3.00GHz and using CentOS Linux 5.5
(x86-64)  with glibc-2.5-49.el5_5.2, binutils-2.17.50.0.6-14.el5 and
gcc version 4.6.0 20100921 (experimental) [trunk revision 164472] (GCC)

The performance for fatigue of the Polyhedron test case drops by 40% if one
enables LTO (using -fwhole=program):

gfortran -march=native -ffast-math -funroll-loops -fwhole-program
-fno-protect-parens -O3

real    0m5.115s / user    0m5.071s / sys     0m0.015s

gfortran -march=native -ffast-math -funroll-loops -flto -fwhole-program
-fno-protect-parens -O3

real    0m7.225s / user    0m7.129s / sys     0m0.017s

For the other test cases, the results are mostly similar w/ and w/o LTO though
in tendency, the non-LTO version seems to be slightly slower (but also other
programs are running now thus the results are not 100% comparable with my
previous ones at
https://users.physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/iff/ )


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
@ 2010-09-27 15:48 ` Joost.VandeVondele at pci dot uzh.ch
  2010-09-27 15:54 ` rguenth at gcc dot gnu.org
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2010-09-27 15:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Joost VandeVondele <Joost.VandeVondele at pci dot uzh.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Joost.VandeVondele at pci
                   |                            |dot uzh.ch

--- Comment #1 from Joost VandeVondele <Joost.VandeVondele at pci dot uzh.ch> 2010-09-27 10:39:05 UTC ---
I have observed similar 40% slowdown in CP2K as a result of LTO. I haven't yet
investigated.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
  2010-09-27 15:48 ` [Bug lto/45810] " Joost.VandeVondele at pci dot uzh.ch
@ 2010-09-27 15:54 ` rguenth at gcc dot gnu.org
  2010-09-28 15:35 ` burnus at gcc dot gnu.org
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2010-09-27 15:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2010-09-27 10:48:33 UTC ---
For single-file programs -fwhole-program and -flto should be basically
equivalent if the Frontend provides correctly merged decls.  I suppose
it does not and thus we do less inlining with -fwhole-program compared
to -flto.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
  2010-09-27 15:48 ` [Bug lto/45810] " Joost.VandeVondele at pci dot uzh.ch
  2010-09-27 15:54 ` rguenth at gcc dot gnu.org
@ 2010-09-28 15:35 ` burnus at gcc dot gnu.org
  2010-09-28 16:24 ` rguenth at gcc dot gnu.org
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: burnus at gcc dot gnu.org @ 2010-09-28 15:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> 2010-09-28 12:23:06 UTC ---
(In reply to comment #2)
> For single-file programs -fwhole-program and -flto should be basically
> equivalent if the Frontend provides correctly merged decls.  I suppose
> it does not and thus we do less inlining with -fwhole-program compared
> to -flto.

It might well be the reason that one does less inlining without LTO - but
that's then not only a FE bug (not correctly merged decls) but also a ME/target
bug as the LTO program is _slower_.


Cf. also PR 44334, which is about a -fwhole-program slowdown (w/ and w/o
-flto). For the latter program, it helped to use "--param
hot-bb-frequency-fraction=2000". However, for this PR, the option does not seem
to help.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2010-09-28 15:35 ` burnus at gcc dot gnu.org
@ 2010-09-28 16:24 ` rguenth at gcc dot gnu.org
  2010-09-28 16:25 ` Joost.VandeVondele at pci dot uzh.ch
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2010-09-28 16:24 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2010-09-28 13:38:58 UTC ---
(In reply to comment #3)
> (In reply to comment #2)
> > For single-file programs -fwhole-program and -flto should be basically
> > equivalent if the Frontend provides correctly merged decls.  I suppose
> > it does not and thus we do less inlining with -fwhole-program compared
> > to -flto.
> 
> It might well be the reason that one does less inlining without LTO - but

more inlining with LTO.  You read my stmt wrong.

> that's then not only a FE bug (not correctly merged decls) but also a ME/target
> bug as the LTO program is _slower_.

Sure.  As with all performance related bugs this needs analysis and is
unlikely an "LTO" problem - LTO does not (not-)optimize, optimization
passes do.

> 
> Cf. also PR 44334, which is about a -fwhole-program slowdown (w/ and w/o
> -flto). For the latter program, it helped to use "--param
> hot-bb-frequency-fraction=2000". However, for this PR, the option does not seem
> to help.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2010-09-28 16:24 ` rguenth at gcc dot gnu.org
@ 2010-09-28 16:25 ` Joost.VandeVondele at pci dot uzh.ch
  2010-09-28 16:50 ` rguenth at gcc dot gnu.org
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2010-09-28 16:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #5 from Joost VandeVondele <Joost.VandeVondele at pci dot uzh.ch> 2010-09-28 13:58:18 UTC ---
(In reply to comment #4)
> Sure.  As with all performance related bugs this needs analysis and is
> unlikely an "LTO" problem - LTO does not (not-)optimize, optimization
> passes do.

I'm wondering if there is any description on how to do this. For example, how
do I get the assembly of a function and the -fdump-tree-all files from a gold
based linking that goes as:

rm -f test.s test2.s test.o test2.o ;
gfortran -c -flto test.f90 ; 
gfortran -c -flto test2.f90 ;  
gfortran -O3 -march=native -fuse-linker-plugin -fwhopr=2 test.o test2.o

just using -S or -fdump-tree-all doesn't work. 

Is 'objdump -d' the only tool ?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2010-09-28 16:50 ` rguenth at gcc dot gnu.org
@ 2010-09-28 16:50 ` Joost.VandeVondele at pci dot uzh.ch
  2010-09-28 16:55 ` burnus at gcc dot gnu.org
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: Joost.VandeVondele at pci dot uzh.ch @ 2010-09-28 16:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #7 from Joost VandeVondele <Joost.VandeVondele at pci dot uzh.ch> 2010-09-28 14:19:38 UTC ---
(In reply to comment #6)
> No, -fdump-tree-all works

great... I forgot to look in /tmp, and -save-temps also works fine.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2010-09-28 16:25 ` Joost.VandeVondele at pci dot uzh.ch
@ 2010-09-28 16:50 ` rguenth at gcc dot gnu.org
  2010-09-28 16:50 ` Joost.VandeVondele at pci dot uzh.ch
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2010-09-28 16:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #6 from Richard Guenther <rguenth at gcc dot gnu.org> 2010-09-28 14:07:54 UTC ---
(In reply to comment #5)
> (In reply to comment #4)
> > Sure.  As with all performance related bugs this needs analysis and is
> > unlikely an "LTO" problem - LTO does not (not-)optimize, optimization
> > passes do.
> 
> I'm wondering if there is any description on how to do this. For example, how
> do I get the assembly of a function and the -fdump-tree-all files from a gold
> based linking that goes as:
> 
> rm -f test.s test2.s test.o test2.o ;
> gfortran -c -flto test.f90 ; 
> gfortran -c -flto test2.f90 ;  
> gfortran -O3 -march=native -fuse-linker-plugin -fwhopr=2 test.o test2.o
> 
> just using -S or -fdump-tree-all doesn't work. 
> 
> Is 'objdump -d' the only tool ?

No, -fdump-tree-all works, it just uses maybe un-intuitive base-names.
Append -v to see them, for -fwhopr it should be the output file
specified with -o (which you leave out which causes us to use
not a.out but some temporary file in /tmp), with -o t I get
t.ltrans[01].147t.optimized, etc.  With -flto it's just t.147t.optimized.
To retain assembler you have to use -save-temps which retains
t.ltrans[01].s, with -flto it retains t1.s (using the base of the first
object file).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2010-09-28 16:50 ` Joost.VandeVondele at pci dot uzh.ch
@ 2010-09-28 16:55 ` burnus at gcc dot gnu.org
  2010-09-30  3:27 ` dominiq at lps dot ens.fr
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: burnus at gcc dot gnu.org @ 2010-09-28 16:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #8 from Tobias Burnus <burnus at gcc dot gnu.org> 2010-09-28 14:57:34 UTC ---
Using -fno-inline-functions, the program recovers the speed of the no-LTO
version.

Notes from #gcc:
(dominiq) For fatigue the key for speed-up is inlining of
generalized_hookes_law and you need -finline-limit=400
(richi) "Considering inline candidate generalized_hookes_law." / "Inlining
failed: --param max-inline-insns-auto limit reached"


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2010-09-28 16:55 ` burnus at gcc dot gnu.org
@ 2010-09-30  3:27 ` dominiq at lps dot ens.fr
  2010-09-30 19:54 ` dominiq at lps dot ens.fr
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2010-09-30  3:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #9 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-09-29 20:27:36 UTC ---
(In reply to comment #8)
> Using -fno-inline-functions, the program recovers the speed of the no-LTO
> version.

This is weird!-( I have done the following profiling and it shows that -flto
prevents the inlining of __perdida_m_MOD_perdida, while -fno-inline-functions
restores it. This contradicts what the manual says:

-finline-functions
Integrate all simple functions into their callers. The compiler heuristically
decides which functions are simple enough to be worth integrating in this way.

Note also that in order to inline __perdida_m_MOD_generalized_hookes_law one
needs -finline-limit=600 (actually some number between 300 and 400).


[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g fatigue.f90
[macbook] lin/test% time a.out > /dev/null
6.547u 0.024s 0:06.57 99.8%    0+0k 0+2io 0pf+0w

+ 70.8%, MAIN__, a.out
| + 10.1%, free, libSystem.B.dylib
| |   7.9%, szone_size, libSystem.B.dylib
| + 8.0%, malloc, libSystem.B.dylib
| | + 6.4%, malloc_zone_malloc, libSystem.B.dylib
| | |   4.4%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.4%, szone_malloc, libSystem.B.dylib
| |   0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
| |   0.1%, szone_malloc_should_clear, libSystem.B.dylib
|   4.1%, szone_free_definite_size, libSystem.B.dylib
|   2.4%, cosisin, libSystem.B.dylib
| + 0.7%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.0%, dyld_stub_exp, libSystem.B.dylib
  27.2%, __perdida_m_MOD_generalized_hookes_law, a.out
  0.5%, dyld_stub_malloc, a.out
  0.4%, free, libSystem.B.dylib
  0.4%, dyld_stub_free, a.out
  0.4%, szone_free_definite_size, libSystem.B.dylib
  0.2%, malloc, libSystem.B.dylib
  0.1%, dyld_stub_cexp, a.out
  0.0%, cexp, libSystem.B.dylib

[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out > /dev/null
9.013u 0.027s 0:09.04 99.8%    0+0k 0+2io 0pf+0w

+ 64.8%, __perdida_m_MOD_perdida, a.out                                 
<-------
| + 6.8%, free, libSystem.B.dylib
| |   4.9%, szone_size, libSystem.B.dylib
| + 5.2%, malloc, libSystem.B.dylib
| | + 4.1%, malloc_zone_malloc, libSystem.B.dylib
| | |   2.5%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.5%, szone_malloc, libSystem.B.dylib
| |   0.3%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
|   3.1%, szone_free_definite_size, libSystem.B.dylib
  19.3%, __perdida_m_MOD_generalized_hookes_law, a.out
+ 14.6%, MAIN__.2130, a.out
|   1.8%, cosisin, libSystem.B.dylib
| + 0.4%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.0%, dyld_stub_exp, libSystem.B.dylib
| |   0.0%, cosisin, libSystem.B.dylib
  0.3%, szone_free_definite_size, libSystem.B.dylib
  0.3%, dyld_stub_malloc, a.out
  0.3%, dyld_stub_free, a.out
  0.2%, free, libSystem.B.dylib
  0.2%, malloc, libSystem.B.dylib
  0.0%, cexp, libSystem.B.dylib
  0.0%, data_transfer_init, libgfortran.3.dylib

[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto
-fno-inline-functions fatigue.f90
[macbook] lin/test% time a.out > /dev/null
6.575u 0.021s 0:06.61 99.6%    0+0k 0+2io 0pf+0w

+ 71.0%, MAIN__.2130, a.out
| + 8.9%, free, libSystem.B.dylib
| |   6.6%, szone_size, libSystem.B.dylib
| + 8.1%, malloc, libSystem.B.dylib
| | + 6.4%, malloc_zone_malloc, libSystem.B.dylib
| | |   4.5%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.6%, szone_malloc, libSystem.B.dylib
| |   0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
| |   0.2%, szone_malloc_should_clear, libSystem.B.dylib
|   4.4%, szone_free_definite_size, libSystem.B.dylib
|   1.9%, cosisin, libSystem.B.dylib
| + 1.0%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.1%, cosisin, libSystem.B.dylib
| |   0.0%, dyld_stub_exp, libSystem.B.dylib
  27.3%, __perdida_m_MOD_generalized_hookes_law, a.out
  0.4%, free, libSystem.B.dylib
  0.3%, dyld_stub_malloc, a.out
  0.3%, dyld_stub_free, a.out
  0.3%, szone_free_definite_size, libSystem.B.dylib
  0.2%, malloc, libSystem.B.dylib
  0.1%, dyld_stub_cexp, a.out
  0.0%, cexp, libSystem.B.dylib

[macbook] lin/test% gfc -Ofast -funroll-loops -fwhole-program -flto
-finline-limit=600 fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.768u 0.018s 0:04.79 99.5%    0+0k 0+1io 0pf+0w

+ 97.5%, MAIN__.2133, a.out
| + 15.4%, free, libSystem.B.dylib
| |   10.6%, szone_size, libSystem.B.dylib
| + 11.4%, malloc, libSystem.B.dylib
| | + 9.6%, malloc_zone_malloc, libSystem.B.dylib
| | |   4.9%, szone_malloc_should_clear, libSystem.B.dylib
| | |   0.9%, szone_malloc, libSystem.B.dylib
| |   0.4%, dyld_stub_malloc_zone_malloc, libSystem.B.dylib
|   6.4%, szone_free_definite_size, libSystem.B.dylib
|   2.7%, cosisin, libSystem.B.dylib
| + 0.8%, cexp, libSystem.B.dylib
| |   0.1%, exp$fenv_access_off, libSystem.B.dylib
| |   0.1%, cosisin, libSystem.B.dylib
| |   0.0%, dyld_stub_exp, libSystem.B.dylib
  0.5%, szone_free_definite_size, libSystem.B.dylib
  0.5%, dyld_stub_malloc, a.out
  0.5%, dyld_stub_free, a.out
  0.4%, free, libSystem.B.dylib
  0.4%, malloc, libSystem.B.dylib
  0.1%, dyld_stub_cexp, a.out


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2010-09-30  3:27 ` dominiq at lps dot ens.fr
@ 2010-09-30 19:54 ` dominiq at lps dot ens.fr
  2011-01-08 20:41 ` hubicka at gcc dot gnu.org
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2010-09-30 19:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #10 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2010-09-30 17:28:19 UTC ---
(In reply to comment #8)
> Using -fno-inline-functions, the program recovers the speed of the no-LTO
> version.

This does not work on powerpc-apple-darwin9:

[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g fatigue.f90
[karma] lin/test% time a.out > /dev/null
15.942u 0.052s 0:16.54 96.6%    0+0k 2+1io 40pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto
fatigue.f90
[karma] lin/test% time a.out > /dev/null
20.330u 0.063s 0:21.06 96.8%    0+0k 0+2io 0pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto
-fno-inline-functions fatigue.f90
[karma] lin/test% time a.out > /dev/null
20.678u 0.063s 0:21.33 97.1%    0+0k 0+2io 0pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -fwhole-program -g -flto
-finline-limit=600 fatigue.f90
[karma] lin/test% time a.out > /dev/null
10.903u 0.036s 0:11.30 96.7%    0+0k 0+2io 0pf+0w


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2010-09-30 19:54 ` dominiq at lps dot ens.fr
@ 2011-01-08 20:41 ` hubicka at gcc dot gnu.org
  2011-01-23 16:36 ` hubicka at gcc dot gnu.org
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-01-08 20:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #11 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-01-08 20:08:26 UTC ---
Does --param hot-bb-frequency-fraction=100000 work here?

This is weird!-( I have done the following profiling and it shows that -flto
prevents the inlining of __perdida_m_MOD_perdida, while -fno-inline-functions
restores it. This contradicts what the manual says:

-finline-functions
Integrate all simple functions into their callers. The compiler heuristically
decides which functions are simple enough to be worth integrating in this way.


Disabling autoinlining of small function can allow other inlining (inlining
functions called once or inlining for size), so this is not completely
unexpected.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2011-01-08 20:41 ` hubicka at gcc dot gnu.org
@ 2011-01-23 16:36 ` hubicka at gcc dot gnu.org
  2011-01-23 18:08 ` hubicka at gcc dot gnu.org
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-01-23 16:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.01.23 15:59:30
     Ever Confirmed|0                           |1

--- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-01-23 15:59:30 UTC ---
Reproduces for me.

Perdida is funcion called once, what happens with default settings is that
perdida is not considered as inline candidate for small function inlining (it
is estimated to over 700 instructions, so it is huge)

later we try to inline it as function called once, but hit large function
growth limit. Compiling with --param large-function-growth=1000000 solve the
problem, but it does not make the testcase faster.
So problem is elsewhere.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2011-01-23 16:36 ` hubicka at gcc dot gnu.org
@ 2011-01-23 18:08 ` hubicka at gcc dot gnu.org
  2011-01-23 19:38 ` dominiq at lps dot ens.fr
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-01-23 18:08 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #13 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-01-23 16:45:23 UTC ---
OK, the slowdown comes away when both hookers_law and perida is inlined.
First needs -finline-limit=380 the second needs large-function-growth=10000000
(or large increase of inline limit to make perida to be considered as small
function and inlined before iztaccihuatl grows that much).

Without large-function-growth we fail at:
Considering perdida size 1056.
 Called once from iztaccihuatl 6151 insns.
 Not inlining: --param large-function-growth limit reached.

This is because inlining for functions called once first process read_input:
Considering read_input size 3099.
 Called once from iztaccihuatl 3128 insns.
 Inlined into iztaccihuatl which now has 6151 size for a net change of -76
size.

that makes it too large.

large-function-insns is 2700, large-function-growth is 100%, so iztaccihuatl
can't growth past 3128*2 insns.

We might increase large-function-growth (I will give it a try on our
benchmarks) or we might convince inlined to inline first perida rather than
read_input because perida is smaller...

Honza


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2011-01-23 18:08 ` hubicka at gcc dot gnu.org
@ 2011-01-23 19:38 ` dominiq at lps dot ens.fr
  2011-01-23 20:00 ` hubicka at gcc dot gnu.org
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-01-23 19:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #14 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-01-23 17:04:07 UTC ---
After removing the comments, generalized_hookes_law reads

      function generalized_hookes_law (strain_tensor, lambda, mu) result
(stress_tensor)
!
      real (kind = LONGreal), dimension(:,:), intent(in) :: strain_tensor
      real (kind = LONGreal), intent(in) :: lambda, mu
      real (kind = LONGreal), dimension(3,3) :: stress_tensor
      real (kind = LONGreal), dimension(6) ::generalized_strain_vector,        
            &
                                             generalized_stress_vector
      real (kind = LONGreal), dimension(6,6) :: generalized_constitutive_tensor
      integer :: i
!
      generalized_constitutive_tensor(:,:) = 0.0_LONGreal
      generalized_constitutive_tensor(1,1) = lambda + 2.0_LONGreal * mu
      generalized_constitutive_tensor(1,2) = lambda
      generalized_constitutive_tensor(1,3) = lambda
      generalized_constitutive_tensor(2,1) = lambda
      generalized_constitutive_tensor(2,2) = lambda + 2.0_LONGreal * mu
      generalized_constitutive_tensor(2,3) = lambda
      generalized_constitutive_tensor(3,1) = lambda
      generalized_constitutive_tensor(3,2) = lambda
      generalized_constitutive_tensor(3,3) = lambda + 2.0_LONGreal * mu
      generalized_constitutive_tensor(4,4) = mu
      generalized_constitutive_tensor(5,5) = mu
      generalized_constitutive_tensor(6,6) = mu
!
      generalized_strain_vector(1) = strain_tensor(1,1)
      generalized_strain_vector(2) = strain_tensor(2,2)
      generalized_strain_vector(3) = strain_tensor(3,3)
      generalized_strain_vector(4) = strain_tensor(2,3)
      generalized_strain_vector(5) = strain_tensor(1,3)
      generalized_strain_vector(6) = strain_tensor(1,2)
!
      do i = 1, 6
          generalized_stress_vector(i) =
dot_product(generalized_constitutive_tensor(i,:),  &   
                                                               
generalized_strain_vector(:))
      end do
!
      stress_tensor(1,1) = generalized_stress_vector(1)
      stress_tensor(2,2) = generalized_stress_vector(2)
      stress_tensor(3,3) = generalized_stress_vector(3)
      stress_tensor(2,3) = generalized_stress_vector(4)
      stress_tensor(1,3) = generalized_stress_vector(5)
      stress_tensor(1,2) = generalized_stress_vector(6)
      stress_tensor(3,2) = stress_tensor(2,3)
      stress_tensor(3,1) = stress_tensor(1,3)
      stress_tensor(2,1) = stress_tensor(1,2)
!
      end function generalized_hookes_law

Note that 24 elements out of the 36 ones of generalized_constitutive_tensor are
null. Using that, the subroutine can be replaced with

      function generalized_hookes_law (strain_tensor, lambda, mu) result
(stress_tensor)
!
      real (kind = LONGreal), dimension(:,:), intent(in) :: strain_tensor
      real (kind = LONGreal), intent(in) :: lambda, mu
      real (kind = LONGreal), dimension(3,3) :: stress_tensor
      real (kind = LONGreal) :: tmp
!
      stress_tensor(:,:) = mu * strain_tensor(:,:)
      tmp = lambda * (strain_tensor(1,1) + strain_tensor(2,2) +
strain_tensor(3,3))
      stress_tensor(1,1) = tmp + 2.0_LONGreal * stress_tensor(1,1)
      stress_tensor(2,2) = tmp + 2.0_LONGreal * stress_tensor(2,2)
      stress_tensor(3,3) = tmp + 2.0_LONGreal * stress_tensor(3,3)
!
      end function generalized_hookes_law

end module perdida_m

which is inlined at -finline-limit=320.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2011-01-23 19:38 ` dominiq at lps dot ens.fr
@ 2011-01-23 20:00 ` hubicka at gcc dot gnu.org
  2011-01-23 21:02 ` hubicka at gcc dot gnu.org
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-01-23 20:00 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-01-23 17:56:31 UTC ---
Enabling early FRE
Index: passes.c
===================================================================
--- passes.c    (revision 169136)
+++ passes.c    (working copy)
@@ -760,6 +760,7 @@
          NEXT_PASS (pass_remove_cgraph_callee_edges);
          NEXT_PASS (pass_rename_ssa_copies);
          NEXT_PASS (pass_ccp);
+      NEXT_PASS (pass_fre);
          NEXT_PASS (pass_forwprop);
          /* pass_build_ealias is a dummy pass that ensures that we
             execute TODO_rebuild_alias at this point.  Re-building
@@ -782,7 +783,7 @@

reduces perida size estimate to 694 (so by about 30%) and hookes law to 141 (by
11%). Not enough to make inlining happen, still.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2011-01-23 20:00 ` hubicka at gcc dot gnu.org
@ 2011-01-23 21:02 ` hubicka at gcc dot gnu.org
  2011-01-23 21:12 ` dominiq at lps dot ens.fr
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-01-23 21:02 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #16 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-01-23 17:57:58 UTC ---
Also w/o inlining hookes_law but with inlining perida (by using
large-function-growth parameter only and the patch abov), I get 30% speedup,
not 50% as with inlining both, but it seems that we miss some optimization that
is independent on inlining w/o early FRE.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2011-01-23 21:02 ` hubicka at gcc dot gnu.org
@ 2011-01-23 21:12 ` dominiq at lps dot ens.fr
  2011-01-23 22:12 ` hubicka at gcc dot gnu.org
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-01-23 21:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #17 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-01-23 19:38:30 UTC ---
With the patch in comment #15 and -finline-limit=300, I get

================================================================================
Date & Time     : 23 Jan 2011 20:18:02
Test Name       : pbharness
Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=300 -fwhole-program -flto -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      3.55       54576      8.12       2  0.0062
      aermod    103.51     1595448     18.87       2  0.0079
         air      8.87       90048      6.89       2  0.0798
    capacita      5.84       89056     40.27       2  0.0199
     channel      1.62       34448      2.98       2  0.0168
       doduc     14.30      203936     27.79       2  0.0162
     fatigue      4.89       89264      4.74       2  0.0106
     gas_dyn     11.72      148176      4.64       5  0.0535
      induct     10.87      205976     14.00       2  0.0036
       linpk      1.58       21536     21.71       2  0.0415
        mdbx      5.60       84752     12.56       2  0.1871
          nf      7.24       83712     29.23       5  0.0744
     protein     11.81      163760     35.10       2  0.0342
      rnflow     14.86      171392     26.91       2  0.0223
    test_fpu     11.35      145848     11.03       2  0.0952
        tfft      1.10       22072      3.30       2  0.1817

Geometric Mean Execution Time =      12.36 seconds

to be compared to the lowest Geometric Mean I have got so far (most of the
difference is due to nf which depends a lot of the mood of my laptop)

================================================================================
Date & Time     : 22 Dec 2010 10:33:08
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=600 --param hot-bb-frequency-fraction=2000
-fwhole-program -flto -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac     11.55       58672      8.11       2  0.0123
      aermod    164.78     1522240     19.11       2  0.1151
         air     20.73       85984      6.87       5  0.1914
    capacita     14.66      105472     40.22       2  0.0584
     channel      3.22       34448      2.92       4  0.1714
       doduc     24.70      212360     27.81       2  0.1025
     fatigue      9.81       85144      4.70       3  0.1862
     gas_dyn     24.13      144240      4.66       5  0.4507
      induct     22.50      214136     13.69       2  0.1096
       linpk      2.56       21536     21.68       2  0.0231
        mdbx      8.93       84744     12.52       2  0.0080
          nf     22.61      104136     27.63       2  0.0778
     protein     26.19      155768     35.51       2  0.0127
      rnflow     30.99      163200     26.15       2  0.0248
    test_fpu     18.79      145848     10.98       2  0.0182
        tfft      1.92       22072      3.29       2  0.0304

Geometric Mean Execution Time =      12.27 seconds


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2011-01-23 21:12 ` dominiq at lps dot ens.fr
@ 2011-01-23 22:12 ` hubicka at gcc dot gnu.org
  2011-01-23 22:39 ` hubicka at gcc dot gnu.org
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-01-23 22:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2011-01-23 15:59:30         |
                 CC|                            |rguenther at suse dot de

--- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-01-23 20:00:23 UTC ---
We produce very lousy code for the out of line copy of
__perdida_m_MOD_generalized_hookes_law. This seems to be reason why we inline
it.

Code is bit better with early FRE but still we get in
vect_pgeneralized_constitutive_tensor (optimized dump):

  generalized_constitutive_tensor = {};
  D.4502_45 = *lambda_44(D);
  D.4503_47 = *mu_46(D);
  D.4504_48 = D.4503_47 * 2.0e+0;
  D.4505_49 = D.4504_48 + D.4502_45;
  generalized_constitutive_tensor[0] = D.4505_49;
  generalized_constitutive_tensor[6] = D.4502_45;
  generalized_constitutive_tensor[12] = D.4502_45;
  generalized_constitutive_tensor[1] = D.4502_45;
  generalized_constitutive_tensor[7] = D.4505_49;
  generalized_constitutive_tensor[13] = D.4502_45;
  generalized_constitutive_tensor[2] = D.4502_45;
  generalized_constitutive_tensor[8] = D.4502_45;
  generalized_constitutive_tensor[14] = D.4505_49;
  generalized_constitutive_tensor[21] = D.4503_47;
  generalized_constitutive_tensor[28] = D.4503_47;
  generalized_constitutive_tensor[35] = D.4503_47;

initialize the array with mostly zeros and then we use it in vectorized loop:

  vect_cst_.855_301 = {D.4508_69, D.4508_69};
  vect_cst_.862_295 = {D.4511_73, D.4511_73};
  vect_cst_.870_288 = {D.4514_77, D.4514_77};
  vect_cst_.878_323 = {D.4519_82, D.4519_82};
  vect_cst_.886_330 = {D.4522_86, D.4522_86};
  vect_cst_.894_337 = {D.4526_90, D.4526_90};
  vect_var_.853_205 = MEM[(real(kind=8)[36]
*)&generalized_constitutive_tensor];
  vect_var_.854_210 = vect_var_.853_205 * vect_cst_.855_301;
  vect_var_.860_211 = MEM[(real(kind=8)[36] *)&generalized_constitutive_tensor
+ 48B];
  vect_var_.861_214 = vect_var_.860_211 * vect_cst_.862_295;
  vect_var_.863_215 = vect_var_.861_214 + vect_var_.854_210;
  vect_var_.868_220 = MEM[(real(kind=8)[36] *)&generalized_constitutive_tensor
+ 96B];
  vect_var_.869_221 = vect_var_.868_220 * vect_cst_.870_288;
  vect_var_.871_224 = vect_var_.863_215 + vect_var_.869_221;
  vect_var_.876_225 = MEM[(real(kind=8)[36] *)&generalized_constitutive_tensor
+ 144B];

we would better go with unrolling this and optimizing away 0 terms.
W/o -ftree-vectorize we however still don't do this transform. We end up with:

  generalized_constitutive_tensor = {};
  D.4502_45 = *lambda_44(D);
  D.4503_47 = *mu_46(D);
  D.4504_48 = D.4503_47 * 2.0e+0;
  D.4505_49 = D.4504_48 + D.4502_45;
  generalized_constitutive_tensor[1] = D.4502_45;
  generalized_constitutive_tensor[7] = D.4505_49;
  generalized_constitutive_tensor[13] = D.4502_45;
  generalized_constitutive_tensor[2] = D.4502_45;
  generalized_constitutive_tensor[8] = D.4502_45;
  generalized_constitutive_tensor[14] = D.4505_49;
  generalized_constitutive_tensor[21] = D.4503_47;
  generalized_constitutive_tensor[28] = D.4503_47;
  generalized_constitutive_tensor[35] = D.4503_47;
....
 pretmp.827_334 = generalized_constitutive_tensor[1];
  pretmp.830_336 = generalized_constitutive_tensor[7];
  pretmp.832_338 = generalized_constitutive_tensor[13];
  pretmp.834_340 = generalized_constitutive_tensor[19];
  pretmp.836_342 = generalized_constitutive_tensor[25];
  pretmp.838_344 = generalized_constitutive_tensor[31];

so copy propagation and SRA are missing. Moreover we can't figure out that
generalized_constitutive_tensor[31] is 0.

So it is quite good testcase for optimization queue ordering.
Honza


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2011-01-23 22:12 ` hubicka at gcc dot gnu.org
@ 2011-01-23 22:39 ` hubicka at gcc dot gnu.org
  2011-01-24  2:04 ` dominiq at lps dot ens.fr
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-01-23 22:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2011-01-23 15:59:30

--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-01-23 21:05:51 UTC ---
This adds enough passes so we generate sane code for hookes_law.
(and we do that before inlining)
Index: passes.c
===================================================================
--- passes.c    (revision 169136)
+++ passes.c    (working copy)
@@ -775,6 +775,14 @@
          NEXT_PASS (pass_convert_switch);
           NEXT_PASS (pass_cleanup_eh);
           NEXT_PASS (pass_profile);
+         NEXT_PASS (pass_tree_loop_init);
+         NEXT_PASS (pass_complete_unroll);
+         NEXT_PASS (pass_tree_loop_done);
+          NEXT_PASS (pass_ccp);
+          NEXT_PASS (pass_fre);
+          NEXT_PASS (pass_dse);
+          NEXT_PASS (pass_fre);
+          NEXT_PASS (pass_cd_dce);
           NEXT_PASS (pass_local_pure_const);
          /* Split functions creates parts that are not run through
             early optimizations again.  It is thus good idea to do this
@@ -782,7 +790,7 @@

We need to unroll the loop, do ccp to get constant array indexes, FRE to
propagate through memory acceses. For some reason FRE is needed twice or the
loads from the temporary array are not copy propagated.

I didn't tested if DSE is really needed or cd_dce gets rid of the dead store
into the array. Still a lot of copyprop oppurtunity is left.

This makes hookes_law estimate to be 91 instructions, so -finline-limit=183
should be enough.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2011-01-23 22:39 ` hubicka at gcc dot gnu.org
@ 2011-01-24  2:04 ` dominiq at lps dot ens.fr
  2011-01-24  9:43 ` dominiq at lps dot ens.fr
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-01-24  2:04 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #20 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-01-23 23:20:34 UTC ---
> This makes hookes_law estimate to be 91 instructions, so -finline-limit=183
> should be enough.

With the patch in comment #19, I rather find a threshold of -finline-limit=256.
In top of that as shown by the timing below the patch increases the threshold
for ac.f90 and breaks the vectorization for induct.f90.

Would the patch in comment #15 and an increase of the default value for
-finline-limit to 300 be acceptable at this stage (with the usual bells and
whisles: SPEC, ...)?

================================================================================
Date & Time     : 23 Jan 2011 23:18:23
Test Name       : pbharness
Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=300 -fwhole-program -flto -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      3.15       50536      9.58       2  0.0156
      aermod    104.98     1652280     18.79       2  0.1011
         air      8.83       90048      6.99       5  0.7334
    capacita      5.95       89056     40.21       2  0.0174
     channel      1.65       34448      2.99       2  0.0502
       doduc     14.59      208056     27.91       2  0.0036
     fatigue      4.80       89264      4.72       2  0.0212
     gas_dyn     11.65      148176      4.66       5  0.4391
      induct     11.20      205976     22.34       2  0.0672
       linpk      1.59       21536     21.70       2  0.0299
        mdbx      5.78       84760     12.58       2  0.0119
          nf      7.60       83712     29.53       5  0.3854
     protein     11.69      163760     35.18       2  0.1109
      rnflow     15.23      167296     26.97       2  0.0890
    test_fpu     11.33      145848     11.06       5  0.3715
        tfft      1.13       22072      3.30       2  0.0607

Geometric Mean Execution Time =      12.89 seconds

================================================================================
Date & Time     : 23 Jan 2011 23:54:28
Test Name       : pbharness
Compile Command : gfcp %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      3.59       54576      8.10       2  0.0062
      aermod    103.73     1558344     18.91       2  0.0238
         air     10.47       89992      6.77       5  0.1563
    capacita      7.47      101344     40.08       2  0.0137
     channel      1.65       34448      2.97       5  0.5872
       doduc     15.82      216376     27.61       2  0.0000
     fatigue      5.10       89264      4.73       2  0.0000
     gas_dyn     12.09      152264      4.69       5  0.6428
      induct     11.10      205976     22.33       2  0.0403
       linpk      1.59       21536     21.72       2  0.0368
        mdbx      5.85       84760     12.58       2  0.0517
          nf     11.34      108280     28.98       2  0.1087
     protein     11.65      163760     35.18       3  0.1422
      rnflow     17.39      183696     26.71       2  0.0243
    test_fpu     11.49      145816     11.02       2  0.1226
        tfft      1.43       22072      3.29       2  0.0911

Geometric Mean Execution Time =      12.70 seconds


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2011-01-24  2:04 ` dominiq at lps dot ens.fr
@ 2011-01-24  9:43 ` dominiq at lps dot ens.fr
  2011-01-24 14:37 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-01-24  9:43 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #21 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-01-24 09:29:00 UTC ---
I have regtested my working tree (with other patches) with the patch in comment
#15 and got 180 new failures (likely 90 for both -m32 and -m64), but I have not
checked that carefully). 

Among them, 124 are of the kind "scan-tree-dump-times fre *: dump file does not
exist" and seem to be due to the extra pass producing fre1 and fre2. I can
adjust the test for say fre2 and see what's happening.

Then I see

FAIL: gcc.dg/ipa/ipa-pta-14.c scan-ipa-dump pta "foo.result = { NULL a[^ ]* a[^
]* c[^ ]* }"

FAIL: gcc.dg/matrix/matrix-1.c scan-ipa-dump-times matrix-reorg "Flattened 3
dimensions" 1
FAIL: gcc.dg/matrix/matrix-2.c scan-ipa-dump-times matrix-reorg "Flattened 2
dimensions" 1
FAIL: gcc.dg/matrix/matrix-3.c scan-ipa-dump-times matrix-reorg "Flattened 2
dimensions" 1
FAIL: gcc.dg/matrix/matrix-6.c scan-ipa-dump-times matrix-reorg "Flattened 2
dimensions" 1
FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times matrix-reorg "Flattened 3
dimensions" 1
FAIL: gcc.dg/matrix/transpose-1.c scan-ipa-dump-times matrix-reorg "Transposed"
3
FAIL: gcc.dg/matrix/transpose-2.c scan-ipa-dump-times matrix-reorg "Flattened 3
dimensions" 1
FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times matrix-reorg "Flattened 2
dimensions" 1
FAIL: gcc.dg/matrix/transpose-3.c scan-ipa-dump-times matrix-reorg "Transposed"
2
FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times matrix-reorg "Flattened 3
dimensions" 1
FAIL: gcc.dg/matrix/transpose-4.c scan-ipa-dump-times matrix-reorg "Transposed"
2
FAIL: gcc.dg/matrix/transpose-5.c scan-ipa-dump-times matrix-reorg "Flattened 3
dimensions" 1
FAIL: gcc.dg/matrix/transpose-6.c scan-ipa-dump-times matrix-reorg "Flattened 3
dimensions" 1

FAIL: gcc.dg/torture/pta-structcopy-1.c  -O2  scan-tree-dump alias "points-to
vars: { i }"
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O3 -fomit-frame-pointer 
scan-tree-dump alias "points-to vars: { i }"
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O3 -g  scan-tree-dump alias
"points-to vars: { i }"
FAIL: gcc.dg/torture/pta-structcopy-1.c  -Os  scan-tree-dump alias "points-to
vars: { i }"
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O2 -flto -flto-partition=none 
scan-tree-dump alias "points-to vars: { i }"
FAIL: gcc.dg/torture/pta-structcopy-1.c  -O2 -flto  scan-tree-dump alias
"points-to vars: { i }"

FAIL: gcc.dg/tree-ssa/pta-ptrarith-1.c scan-tree-dump ealias "q_., points-to
vars: { k }"
FAIL: gcc.dg/tree-ssa/sra-9.c scan-tree-dump-times optimized "= s.b" 0
FAIL: gcc.dg/tree-ssa/ssa-dce-4.c scan-tree-dump-times cddce1 "a\[[^

FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg "f6: va_list escapes 0,
needs to save (3|12|24) GPR units"
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg "f11: va_list escapes 0,
needs to save (3|12|24) GPR units"
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg "f12: va_list escapes 0,
needs to save [1-9][0-9]* GPR units"
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg "f13: va_list escapes 0,
needs to save [1-9][0-9]* GPR units"
FAIL: gcc.dg/tree-ssa/stdarg-2.c scan-tree-dump stdarg "f14: va_list escapes 0,
needs to save [1-9][0-9]* GPR units"

FAIL: g++.dg/ipa/iinline-1.C scan-ipa-dump inline "String::funcOne[^\n]*inline
copy in int main"
FAIL: g++.dg/ipa/iinline-2.C scan-ipa-dump inline "String::funcOne[^\n]*inline
copy in int main"

So far I have only looked at gcc.dg/ipa/ipa-pta-14.c, for which grepping
foo.result yields

p_1 = foo.result
foo.result = foo.arg1
Equivalence classes for Direct node node id 15:foo.result are pointer: 8,
location:0
Unifying foo.result to foo.arg0
foo.result = { a.0+32 } same as foo.arg0

instead of

p_1 = foo.result
foo.result = D.2736_3
Equivalence classes for Direct node node id 15:foo.result are pointer: 13,
location:0
Unifying foo.result to p_1
foo.result = { NULL a.0+32 a.64+64 c.0+32 } same as p_1

Is it a missed optimization or wrong-code?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2011-01-24  9:43 ` dominiq at lps dot ens.fr
@ 2011-01-24 14:37 ` rguenth at gcc dot gnu.org
  2011-01-24 18:09 ` howarth at nitro dot med.uc.edu
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-01-24 14:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #22 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-24 14:07:14 UTC ---
(In reply to comment #15)
> Enabling early FRE
> Index: passes.c
> ===================================================================
> --- passes.c    (revision 169136)
> +++ passes.c    (working copy)
> @@ -760,6 +760,7 @@
>           NEXT_PASS (pass_remove_cgraph_callee_edges);
>           NEXT_PASS (pass_rename_ssa_copies);
>           NEXT_PASS (pass_ccp);
> +      NEXT_PASS (pass_fre);
>           NEXT_PASS (pass_forwprop);
>           /* pass_build_ealias is a dummy pass that ensures that we
>              execute TODO_rebuild_alias at this point.  Re-building
> @@ -782,7 +783,7 @@
> 
> reduces perida size estimate to 694 (so by about 30%) and hookes law to 141 (by
> 11%). Not enough to make inlining happen, still.

That FRE pass should be after pass_sra_early (certainly after
pass_build_ealias).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2011-01-24 14:37 ` rguenth at gcc dot gnu.org
@ 2011-01-24 18:09 ` howarth at nitro dot med.uc.edu
  2011-01-24 18:38 ` dominiq at lps dot ens.fr
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: howarth at nitro dot med.uc.edu @ 2011-01-24 18:09 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Jack Howarth <howarth at nitro dot med.uc.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |howarth at nitro dot
                   |                            |med.uc.edu

--- Comment #23 from Jack Howarth <howarth at nitro dot med.uc.edu> 2011-01-24 17:58:00 UTC ---
(In reply to comment #22)

> That FRE pass should be after pass_sra_early (certainly after
> pass_build_ealias).

Index: gcc/passes.c
===================================================================
--- gcc/passes.c    (revision 169145)
+++ gcc/passes.c    (working copy)
@@ -767,6 +767,7 @@ init_optimization_passes (void)
          locals into SSA form if possible.  */
       NEXT_PASS (pass_build_ealias);
       NEXT_PASS (pass_sra_early);
+          NEXT_PASS (pass_fre);
       NEXT_PASS (pass_copy_prop);
       NEXT_PASS (pass_merge_phi);
       NEXT_PASS (pass_cd_dce);

gives Elapsed CPU time  =     8.43600E+00 for

gfortran -O3 -ffast-math -funroll-loops -flto -fwhole-program fatigue.f90 -o
fatigue

and Elapsed CPU time  =     4.16600E+00 for

gfortran -O3 -ffast-math -funroll-loops -finline-limit=250 --param
large-function-growth=250 -flto -fwhole-program fatigue.f90 -o fatigue


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2011-01-24 18:09 ` howarth at nitro dot med.uc.edu
@ 2011-01-24 18:38 ` dominiq at lps dot ens.fr
  2011-02-16 18:44 ` dominiq at lps dot ens.fr
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-01-24 18:38 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #24 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-01-24 18:16:47 UTC ---
(In reply to comment #22)
> That FRE pass should be after pass_sra_early (certainly after
> pass_build_ealias).

Moving pass_fre after pass_sra_early does not fix the failures in the test
suite rported in comment #21.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2011-01-24 18:38 ` dominiq at lps dot ens.fr
@ 2011-02-16 18:44 ` dominiq at lps dot ens.fr
  2011-09-22 15:53 ` dominiq at lps dot ens.fr
  2011-09-26 10:37 ` rguenth at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-02-16 18:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #25 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-02-16 18:38:19 UTC ---
AFAICT the patch in http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html
seems to fix most of the fatigue.f90 problems:

At revision 170178 without the patch, I get

[macbook] lin/test% gfcp -Ofast fatigue.f90
[macbook] lin/test% time a.out > /dev/null
8.903u 0.005s 0:08.91 99.8%    0+0k 0+2io 0pf+0w
[macbook] lin/test% gfcp -Ofast -fwhole-program fatigue.f90
[macbook] lin/test% time a.out > /dev/null
6.392u 0.002s 0:06.39 100.0%    0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -Ofast -finline-limit=322 -fwhole-program fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.653u 0.002s 0:04.65 100.0%    0+0k 0+1io 0pf+0w
[macbook] lin/test% gfcp -Ofast -finline-limit=322 -fwhole-program -flto
fatigue.f90
[macbook] lin/test% time a.out > /dev/null
8.212u 0.004s 0:08.22 99.8%    0+0k 0+2io 0pf+0w
[macbook] lin/test% gfcp -Ofast -finline-limit=322 --param
large-function-growth=132 -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.526u 0.004s 0:04.53 99.7%    0+0k 0+1io 0pf+0w

At revision 170212 with the patch, I get

[macbook] lin/test% gfc -Ofast fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.628u 0.002s 0:04.63 99.7%    0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -fwhole-program fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.654u 0.002s 0:04.65 100.0%    0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 -fwhole-program fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.657u 0.002s 0:04.66 99.7%    0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 -fwhole-program -flto
fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.715u 0.003s 0:04.72 99.7%    0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 --param
large-function-growth=132 -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.713u 0.003s 0:04.71 100.0%    0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast -finline-limit=322 --param
large-function-growth=137 -fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.524u 0.003s 0:04.52 100.0%    0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast --param large-function-growth=137
-fwhole-program -flto fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.564u 0.003s 0:04.57 99.7%    0+0k 0+1io 0pf+0w
[macbook] lin/test% gfc -Ofast --param large-function-growth=137
-fwhole-program fatigue.f90
[macbook] lin/test% time a.out > /dev/null
4.479u 0.003s 0:04.48 99.7%    0+0k 0+2io 0pf+0w

A quick check of the other tests does not show any obvious slowdown with the
patch.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2011-02-16 18:44 ` dominiq at lps dot ens.fr
@ 2011-09-22 15:53 ` dominiq at lps dot ens.fr
  2011-09-26 10:37 ` rguenth at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-09-22 15:53 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

--- Comment #26 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-09-22 15:25:48 UTC ---
AFAICT this pr has been fixed since some time. Here are the results I get on
x86_64-apple-darwin10 (Core2Duo 2.53Ghz, 3Mb cache, 4Gb RAM) at revision
179079:

Compile options : -fprotect-parens -Ofast -funroll-loops -fwhole-program

                   without -flto                     with -flto

Benchmark   Compile  Executable   Ave Run   Compile  Executable   Ave Run
     Name    (secs)     (bytes)    (secs)    (secs)     (bytes)    (secs)
---------   -------  ----------   -------   -------  ----------   -------
       ac      3.28       54936      8.81      6.64       54968      8.81
   aermod     75.46     1184280     18.65    131.50     1212648     18.20
      air     11.24      106336      7.26     22.38      106904      7.39
 capacita      3.87       77152     41.29      7.36       77200     41.31
  channel      1.25       34744      3.03      2.39       34864      3.03
    doduc     12.40      200016     28.02     22.47      200496     27.69
  fatigue      4.06       77400      4.83      8.17       77488      4.84
  gas_dyn      9.32      119256      4.92     16.64      119816      4.92
   induct      7.37      148840     13.83     14.76      153224     13.84
    linpk      0.70       26024     21.64      1.93       26064     21.64
     mdbx      3.77       80864     12.46      7.21       81040     12.46
       nf      4.08       71848     19.34      8.07       71896     19.35
  protein     15.17      131304     35.30     26.05      127224     35.48
   rnflow     12.58      130888     28.25     23.76      131000     26.92
 test_fpu      4.78       92968     10.63     13.35       93024     10.64
     tfft      0.74       22352      3.28      1.98       22432      3.28

Geometric Mean Execution Time =     12.23 secs                      12.18 secs

Compile options : -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear 
-fomit-frame-pointer --param max-inline-insns-auto=200 -fwhole-program

                   without -flto                     with -flto

Benchmark   Compile  Executable   Ave Run   Compile  Executable   Ave Run
     Name    (secs)     (bytes)    (secs)    (secs)     (bytes)    (secs)
---------   -------  ----------   -------   -------  ----------   -------
       ac      4.05       54904      8.11      8.18       54920      8.11
   aermod    101.55     1494688     18.17    169.63     1527120     18.12
      air     14.46      114328      7.05     30.35      114912      7.04
 capacita      5.39       97552     40.24     10.80       97584     40.21
  channel      1.68       38792      2.91      3.17       38888      2.91
    doduc     12.98      208112     27.47     25.77      208584     27.52
  fatigue      4.84       81440      2.95     10.27       81504      2.93
  gas_dyn     13.55      143776      4.86     25.03      144392      4.86
   induct     12.95      189872     13.78     24.32      190176     13.96
    linpk      0.73       21856     21.69      2.44       21888     21.69
     mdbx      4.32       84928     12.45      9.39       85104     12.54
       nf      7.41       92248     18.93     17.82       92272     18.91
  protein     17.26      160040     35.51     31.08      155984     35.47
   rnflow     15.16      138880     28.27     27.28      139040     26.85
 test_fpu      5.05       92872     10.65     14.65       92928     10.65
     tfft      0.75       22352      3.28      1.72       22432      3.28

Geometric Mean Execution Time =     11.67 secs                      11.64 secs

The option -flto improves the run time for rnflow.f90 by ~5% without slowdown
for the other tests. Could these results be checked on other platforms and this
PR closed if they agree with mine?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Bug lto/45810] 40% slowdown when using LTO for a single-file program
  2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2011-09-22 15:53 ` dominiq at lps dot ens.fr
@ 2011-09-26 10:37 ` rguenth at gcc dot gnu.org
  26 siblings, 0 replies; 28+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-09-26 10:37 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45810

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #27 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-09-26 10:16:20 UTC ---
Yes, I think I analyzed the reason for this at some point (IPA profile) and
fixed it.


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-09-26 10:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-27 13:13 [Bug lto/45810] New: 40% slowdown when using LTO for a single-file program burnus at gcc dot gnu.org
2010-09-27 15:48 ` [Bug lto/45810] " Joost.VandeVondele at pci dot uzh.ch
2010-09-27 15:54 ` rguenth at gcc dot gnu.org
2010-09-28 15:35 ` burnus at gcc dot gnu.org
2010-09-28 16:24 ` rguenth at gcc dot gnu.org
2010-09-28 16:25 ` Joost.VandeVondele at pci dot uzh.ch
2010-09-28 16:50 ` rguenth at gcc dot gnu.org
2010-09-28 16:50 ` Joost.VandeVondele at pci dot uzh.ch
2010-09-28 16:55 ` burnus at gcc dot gnu.org
2010-09-30  3:27 ` dominiq at lps dot ens.fr
2010-09-30 19:54 ` dominiq at lps dot ens.fr
2011-01-08 20:41 ` hubicka at gcc dot gnu.org
2011-01-23 16:36 ` hubicka at gcc dot gnu.org
2011-01-23 18:08 ` hubicka at gcc dot gnu.org
2011-01-23 19:38 ` dominiq at lps dot ens.fr
2011-01-23 20:00 ` hubicka at gcc dot gnu.org
2011-01-23 21:02 ` hubicka at gcc dot gnu.org
2011-01-23 21:12 ` dominiq at lps dot ens.fr
2011-01-23 22:12 ` hubicka at gcc dot gnu.org
2011-01-23 22:39 ` hubicka at gcc dot gnu.org
2011-01-24  2:04 ` dominiq at lps dot ens.fr
2011-01-24  9:43 ` dominiq at lps dot ens.fr
2011-01-24 14:37 ` rguenth at gcc dot gnu.org
2011-01-24 18:09 ` howarth at nitro dot med.uc.edu
2011-01-24 18:38 ` dominiq at lps dot ens.fr
2011-02-16 18:44 ` dominiq at lps dot ens.fr
2011-09-22 15:53 ` dominiq at lps dot ens.fr
2011-09-26 10:37 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).