public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug fortran/48636] New: Enable more inlining with -O2 and higher
@ 2011-04-15 21:05 tkoenig at gcc dot gnu.org
2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
` (43 more replies)
0 siblings, 44 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2011-04-15 21:05 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
Summary: Enable more inlining with -O2 and higher
Product: gcc
Version: 4.7.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: fortran
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: tkoenig@gcc.gnu.org
See http://gcc.gnu.org/ml/fortran/2011-04/msg00144.html .
We whould inline more for Fortran at higher optimization levels.
Options to consider:
- Set -finline-limit=600 for -O2 and higher
- Consider heuristics to mark functions as inline in the front end:
- If it has many arguments (argument processing has a lot of effect)
- If it uses assumed-shape arrays (setting up that array descriptor
may take a lot of time
- Mark everything as inline
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
@ 2011-04-16 11:22 ` rguenth at gcc dot gnu.org
2011-04-17 10:23 ` dominiq at lps dot ens.fr
` (42 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-04-16 11:22 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
Richard Guenther <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2011.04.16 09:59:01
Ever Confirmed|0 |1
--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-04-16 09:59:01 UTC ---
Changing params from the Frontend won't work for LTO. Declaring some or all
functions inline would work.
See also http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html and
discussion (the patch probably doesn't apply anymore). Thus, I'd like to
address this in the middle-end, but if there are suitable heuristics on
when to set DECL_DECLARED_INLINE for the frontend by all means go ahead.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
@ 2011-04-17 10:23 ` dominiq at lps dot ens.fr
2011-04-17 10:44 ` hubicka at gcc dot gnu.org
` (41 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-04-17 10:23 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #2 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-04-17 10:23:03 UTC ---
As shown by the following results it seems that --param max-inline-insns-auto=*
is the way to go.
Date & Time : 17 Apr 2011 11:22:05
Test Name : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer --param max-inline-insns-auto=400 -fwhole-program -flto
-fstack-arrays -o %n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 300.0
Target Error % : 0.200
Minimum Repeats : 2
Maximum Repeats : 5
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 8.07 54576 8.11 2 0.0062
aermod 175.22 1472624 18.83 2 0.1647
air 25.65 89992 6.78 5 0.1871
capacita 14.02 109536 40.36 2 0.0483
channel 3.11 34448 2.94 5 0.6012
doduc 29.46 224584 27.44 2 0.0437
fatigue 9.85 77032 2.74 2 0.0365
gas_dyn 26.09 144112 4.68 5 0.6928
induct 24.32 189696 14.24 2 0.1193
linpk 3.13 21536 21.69 2 0.0254
mdbx 9.18 84776 12.55 2 0.0678
nf 34.14 124640 18.38 2 0.1034
protein 28.14 155624 35.48 2 0.0789
rnflow 43.93 204176 26.70 2 0.0262
test_fpu 21.90 141696 11.18 2 0.0045
tfft 1.71 22072 3.29 5 0.1369
Geometric Mean Execution Time = 11.60 seconds
================================================================================
Date & Time : 17 Apr 2011 11:50:20
Test Name : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o
%n
Benchmarks : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times : 300.0
Target Error % : 0.200
Minimum Repeats : 2
Maximum Repeats : 5
Benchmark Compile Executable Ave Run Number Estim
Name (secs) (bytes) (secs) Repeats Err %
--------- ------- ---------- ------- ------- ------
ac 8.06 54576 8.11 2 0.0062
aermod 175.54 1480632 18.92 2 0.0106
air 25.36 89992 6.76 2 0.0740
capacita 13.95 109536 40.32 2 0.0161
channel 3.13 34448 2.95 5 0.1703
doduc 27.31 212280 27.18 2 0.0331
fatigue 9.82 77032 2.74 2 0.0182
gas_dyn 24.86 144112 4.67 5 0.3052
induct 24.25 189696 14.21 2 0.0035
linpk 2.55 21536 21.69 2 0.0023
mdbx 9.17 84776 12.53 2 0.0239
nf 34.21 124640 18.41 4 0.1634
protein 28.01 155624 35.46 2 0.0310
rnflow 38.11 183696 26.74 2 0.0037
test_fpu 19.63 141720 10.84 2 0.0323
tfft 1.69 22072 3.29 2 0.0152
Geometric Mean Execution Time = 11.57 seconds
================================================================================
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
2011-04-17 10:23 ` dominiq at lps dot ens.fr
@ 2011-04-17 10:44 ` hubicka at gcc dot gnu.org
2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
` (40 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-04-17 10:44 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-04-17 10:44:23 UTC ---
I am slowly starting to look into fortran issues now. For years it was
non-issue since we had the non-one-decl-per-function problem. This is finally
solved
One additional problem is that we often hit large-stack frame limits because
the fortran i/o drops large datastructure on stack. Consequently any functions
that do i/o (for debug purposes, for example) are not inlined into functions
that doesn't. We will need to relax this.
- Consider heuristics to mark functions as inline in the front end:
- If it has many arguments (argument processing has a lot of effect)
- If it uses assumed-shape arrays (setting up that array descriptor
may take a lot of time
- Mark everything as inline
I briefly discussed option of marking everything as inline on IRC for 4.6.x
series but it did not go well with Richard. Observation is that dominating
coding style in fortran is to not care that much about code size if perfomrance
improve and inlining do help here.
In longer term it would be cool if inliner was able to work out as much as
possible himself w/o frontend help.
The first item you mention is something backend can do at its own (and it
already knows how to benefit many arguments, but it proably does not do it
enough to make difference for fortran). I am just about commit patch that makes
backend by hair more sensitive on this.
The second item is interesting - it would be cool if backend was able to work
out that the code is supposed to simplify after inlining. Either by itself or
by frontend hint.
Can you provide me very simple testcase for that I can look into how it looks
like in backend? Perhaps some kind of frontend hinting would work well here.
Honza
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (2 preceding siblings ...)
2011-04-17 10:44 ` hubicka at gcc dot gnu.org
@ 2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
2011-04-17 14:12 ` dominiq at lps dot ens.fr
` (39 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2011-04-17 13:32 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #4 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2011-04-17 13:32:11 UTC ---
(In reply to comment #3)
> The second item is interesting - it would be cool if backend was able to work
> out that the code is supposed to simplify after inlining. Either by itself or
> by frontend hint.
> Can you provide me very simple testcase for that I can look into how it looks
> like in backend? Perhaps some kind of frontend hinting would work well here.
Here is some sample code (extreme, I admit) which profits a lot from
inlining:
- Strides are known to be one when inlining (a common case, but you can
never be sure if the user doesn't call a(1:5:2))
- Expensive setting up of, and reading from the array descriptor
- Loops can be completely unrolled
module foo
implicit none
contains
subroutine bar(a,x)
real, dimension(:,:), intent(in) :: a
real, intent(out) :: x
integer :: i,j
x = 0
do j=1,ubound(a,2)
do i=1,ubound(a,1)
x = x + a(i,j)**2
end do
end do
end subroutine bar
end module foo
program main
use foo
implicit none
real, dimension(2,3) :: a
real :: x
data a /1.0, 2.0, 3.0, -1.0, -2.0, -3.0/
call bar(a,x)
print *,x
end program main
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (3 preceding siblings ...)
2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
@ 2011-04-17 14:12 ` dominiq at lps dot ens.fr
2011-04-20 11:22 ` jb at gcc dot gnu.org
` (38 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-04-17 14:12 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #5 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-04-17 14:12:30 UTC ---
I have investigated why test_fpu is slower with --param
max-inline-insns-auto=400 (11.18s) compared to -finline-limit=600 (10.84s) in
the timings of comment #2. This is due to the inlining of dgemm in the fourth
test Lapack 2:
[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param
max-inline-insns-auto=385 test_lap.f90
[macbook] lin/test% time a.out
Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts 2.6 sec Err= 0.000000000000250
total = 2.6 sec
2.824u 0.081s 0:02.90 100.0% 0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param
max-inline-insns-auto=386 test_lap.f90
[macbook] lin/test% time a.out
Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts 3.0 sec Err= 0.000000000000250
total = 3.0 sec
3.214u 0.082s 0:03.29 100.0% 0+0k 0+0io 0pf+0w
Looking at the assembly, I see 'call _dgemm_' three times for 385 and none
for 386 (note there are only two calls in the code one in dgetri always inlined
and one in dgetrf not inlined). It would be interesting to understand why
inlining dgemm slows down the code.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (4 preceding siblings ...)
2011-04-17 14:12 ` dominiq at lps dot ens.fr
@ 2011-04-20 11:22 ` jb at gcc dot gnu.org
2011-04-20 12:29 ` burnus at gcc dot gnu.org
` (37 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: jb at gcc dot gnu.org @ 2011-04-20 11:22 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
Janne Blomqvist <jb at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jb at gcc dot gnu.org
--- Comment #6 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-04-20 11:21:15 UTC ---
Note that some of these issues might change with the new array descriptor that
we must introduce at some point (the hope is that it'll get in for 4.7, but it
remains to be seen if there is enough time). See
http://gcc.gnu.org/wiki/ArrayDescriptorUpdate
For instance (comments inline):
(In reply to comment #4)
> (In reply to comment #3)
>
> > The second item is interesting - it would be cool if backend was able to work
> > out that the code is supposed to simplify after inlining. Either by itself or
> > by frontend hint.
> > Can you provide me very simple testcase for that I can look into how it looks
> > like in backend? Perhaps some kind of frontend hinting would work well here.
>
> Here is some sample code (extreme, I admit) which profits a lot from
> inlining:
>
> - Strides are known to be one when inlining (a common case, but you can
> never be sure if the user doesn't call a(1:5:2))
Not strictly related to inlining, but in the new descriptor we'll have a field
specifying whether the array is simply contiguous, so it might make sense to
generate two loops for each loop over the array in the source, one for the
contiguous case where it can be vectorized etc. and another loop for the
general case. This might reduce the profitability of inlining.
> - Expensive setting up of, and reading from the array descriptor
As we're planning to use the TR 29113 descriptor as the native one, this has
some implications for the procedure call interface as well. See
http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html
This will reduce the procedure call overhead substantially, at the cost of some
extra work in the caller in the case of non-default lower bounds.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (5 preceding siblings ...)
2011-04-20 11:22 ` jb at gcc dot gnu.org
@ 2011-04-20 12:29 ` burnus at gcc dot gnu.org
2011-04-20 13:10 ` jb at gcc dot gnu.org
` (36 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-04-20 12:29 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #7 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-04-20 12:29:02 UTC ---
(In reply to comment #6)
> > Here is some sample code (extreme, I admit) which profits a lot from
> > inlining:
> >
> > - Strides are known to be one when inlining (a common case, but you can
> > never be sure if the user doesn't call a(1:5:2))
First, you do not have any issue with strides if the dummy argument is either
allocatable, has the contiguous attribute, or is an explicit or assumed-sized
array.
For inlining, I see only one place where information loss happens: If a
simply-contiguous array is passed as actual argument to a assumed-shape dummy.
Then the Fortran front-end knows that the stride of the actual argument is 1,
but the callee needs to assume an arbitrary stride. The middle-end will
continue to do so as the "simply contiguous" information is lost - even though
it would be profitable for inlining.
> Not strictly related to inlining, but in the new descriptor we'll have a field
> specifying whether the array is simply contiguous
I am not sure we will indeed have one; initially I thought one should, but I am
no longer convinced that it is the right approach. My impression is now that
setting and updating the flag all the time is more expensive then doing once a
is_contiguous() check. The TR descriptor also does not such an flag - thus one
needs to handle such arrays - if they come from C - with extra care. (Unless
one requires the C side to call a function, which could set this flag. I think
one does not need to do so.)
By the way, the latest version of the TR draft is linked at
http://j3-fortran.org/pipermail/interop-tr/2011-April/000582.html
> so it might make sense to
> generate two loops for each loop over the array in the source, one for the
> contiguous case where it can be vectorized etc. and another loop for the
> general case.
Maybe. Definitely not for -Os. Best would be if the middle end would be able to
generate automatically a stride-free version when it thinks that it is
profitable. The FE could also do it, if one had a way to tell the ME that it
might drop the stride-free version, if it thinks that it is more profitable.
> As we're planning to use the TR 29113 descriptor as the native one, this has
> some implications for the procedure call interface as well. See
> http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html
Regarding:
"For a descriptor of an assumed-shape array, the value of the
lower-bound member of each element of the dim member of the descriptor
shall be zero."
That's actually also not that different from the current situation: In Fortran,
the lower bound of assumed-shape arrays is also always the same: It is 1. Which
makes sense as on can then do the following w/o worrying about the lbound:
subroutine bar(a)
real :: a(:)
do i = 1, ubound(a, dim=1)
a(i) = ...
For explicit-shape/assumed-size arrays one does not have a descriptor and for
deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which
is the same as currently in Fortran.
> This will reduce the procedure call overhead substantially, at the cost
> of some extra work in the caller in the case of non-default lower bounds.
Which is actually nothing new ... That's the reason that one often creates a
new descriptor for procedure calls.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (6 preceding siblings ...)
2011-04-20 12:29 ` burnus at gcc dot gnu.org
@ 2011-04-20 13:10 ` jb at gcc dot gnu.org
2011-04-20 15:42 ` burnus at gcc dot gnu.org
` (35 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: jb at gcc dot gnu.org @ 2011-04-20 13:10 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #8 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-04-20 13:09:51 UTC ---
(In reply to comment #7)
> (In reply to comment #6)
> > > Here is some sample code (extreme, I admit) which profits a lot from
> > > inlining:
> > >
> > > - Strides are known to be one when inlining (a common case, but you can
> > > never be sure if the user doesn't call a(1:5:2))
>
> First, you do not have any issue with strides if the dummy argument is either
> allocatable, has the contiguous attribute, or is an explicit or assumed-sized
> array.
>
> For inlining, I see only one place where information loss happens: If a
> simply-contiguous array is passed as actual argument to a assumed-shape dummy.
> Then the Fortran front-end knows that the stride of the actual argument is 1,
> but the callee needs to assume an arbitrary stride. The middle-end will
> continue to do so as the "simply contiguous" information is lost - even though
> it would be profitable for inlining.
Passing simply contiguous arrays to assumed-shape dummies is a fairly common
case in "modern Fortran", so it would be nice if we could generate fast code
for this.
> > Not strictly related to inlining, but in the new descriptor we'll have a field
> > specifying whether the array is simply contiguous
>
> I am not sure we will indeed have one; initially I thought one should, but I am
> no longer convinced that it is the right approach. My impression is now that
> setting and updating the flag all the time is more expensive then doing once a
> is_contiguous() check.
Hmm, maybe. Shouldn't it be necessary to update the contiguous flag only when
passing slices to procedures with explicit interfaces? But OTOH, calculating
whether an array is simply contiguous at procedure entry is just a few
arithmetic operations anyway. But, in any case I don't have any profiling data
to argue which approach would be better.
> The TR descriptor also does not such an flag - thus one
> needs to handle such arrays - if they come from C - with extra care. (Unless
> one requires the C side to call a function, which could set this flag. I think
> one does not need to do so.)
I suppose one cannot require the C side to set such a flag, as the TR doesn't
require its presence? Thus we'd need to calculate whether the array is simply
contiguous anyway if it's possible the array comes from C. Do such procedures
have to be marked with BIND(C) in some way or how does this work?
In any case, maybe this is what tips it in favor of always calculating the
contiguousness instead of having a flag in the descriptor - it would be one
single way of handling it, reducing the possibility of bugs. Also, if the
contigousness isn't used for anything in the procedure, the dead code
elimination should delete it anyway.
> > As we're planning to use the TR 29113 descriptor as the native one, this has
> > some implications for the procedure call interface as well. See
> > http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html
>
> Regarding:
> "For a descriptor of an assumed-shape array, the value of the
> lower-bound member of each element of the dim member of the descriptor
> shall be zero."
>
> That's actually also not that different from the current situation: In Fortran,
> the lower bound of assumed-shape arrays is also always the same: It is 1.
Yes. But what is different from the current situation is that the above is what
the standard requires semantically, and the implementation is free to implement
it as it sees fit. In the TR, OTOH, we have the explicit requirement that on
procedure entry the lower bounds in the descriptor should be 0. This of course
applies only to inter-operable procedures, for "pure Fortran" we're still free
to do as we please. But again, it might make sense to do it the same way in
both cases in order to reduce the implementation and maintenance burden.
> For explicit-shape/assumed-size arrays one does not have a descriptor and for
> deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which
> is the same as currently in Fortran.
Yes.
> > This will reduce the procedure call overhead substantially, at the cost
> > of some extra work in the caller in the case of non-default lower bounds.
>
> Which is actually nothing new ... That's the reason that one often creates a
> new descriptor for procedure calls.
But do we actually do this? I did some tests a while ago, and IIRC for assumed
shape dummy arguments the procedure always calculates new bounds such that they
start from 1. That is, the procedure assumes that the actual argument
descriptor may have lower bounds != 1.
So my argument is basically that with the new descriptor it might make sense to
switch the responsibility around such that it's the caller who makes sure that
all lower bounds are 0 (as we must have the capability to do this anyway in
order to call inter-operable procedures, no?) instead of the callee.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (7 preceding siblings ...)
2011-04-20 13:10 ` jb at gcc dot gnu.org
@ 2011-04-20 15:42 ` burnus at gcc dot gnu.org
2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
` (34 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-04-20 15:42 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-04-20 15:39:47 UTC ---
> But do we actually do this? I did some tests a while ago, and IIRC for assumed
> shape dummy arguments the procedure always calculates new bounds such that they
> start from 1. That is, the procedure assumes that the actual argument
> descriptor may have lower bounds != 1.
> So my argument is basically that with the new descriptor it might make sense to
> switch the responsibility around such that it's the caller who makes sure that
> all lower bounds are 0 (as we must have the capability to do this anyway in
> order to call inter-operable procedures, no?) instead of the callee.
No, the conversion is already done in the caller:
subroutine bar(B)
interface; subroutine foo(a); integer :: a(:); end subroutine foo
end interface
integer :: B(:)
call foo(B)
end subroutine bar
Shows:
parm.4.dim[0].lbound = 1;
[...]
foo (&parm.4);
For assumed-shape actual arguments, creating a new descriptor is actually not
needed - only for deferred shape ones - or if one does not have a full array
ref.
Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.
However, some additional calculation is also done in the the callee to
determine the stride and offset; e.g.
ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1;
again, if the dummy argument is not deferred-shaped (allocatable or pointer),
one actually knows that "b->dim[0].lbound" == 1. I think we have some
redundancy here -> missed optimization.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (8 preceding siblings ...)
2011-04-20 15:42 ` burnus at gcc dot gnu.org
@ 2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
2011-04-20 18:15 ` jb at gcc dot gnu.org
` (33 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2011-04-20 16:41 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #10 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2011-04-20 16:40:46 UTC ---
(In reply to comment #6)
> Not strictly related to inlining, but in the new descriptor we'll have a field
> specifying whether the array is simply contiguous, so it might make sense to
> generate two loops for each loop over the array in the source, one for the
> contiguous case where it can be vectorized etc. and another loop for the
> general case. This might reduce the profitability of inlining.
Consider the following, hand-crafted matmul:
Here, we have three nested loops. The most interesting one is
the innermost loop of the matmul, which we vectorize by inlining if we omit
the call to my_matmul with non-unity stride for a when compiling with
-fwhole-program -O3.
How many versions of the loop should we generate? Two or eight, depending
on what the caller may do? ;-)
module foo
implicit none
contains
subroutine my_matmul(a,b,c)
implicit none
integer :: count, m, n
real, dimension(:,:), intent(in) :: a,b
real, dimension(:,:), intent(out) :: c
integer :: i,j,k
m = ubound(a,1)
n = ubound(b,2)
count = ubound(a,2)
c = 0
do j=1,n
do k=1, count
do i=1,m
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
end subroutine my_matmul
end module foo
program main
use foo
implicit none
integer, parameter :: factor=100
integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor
real, dimension(m, count) :: a
real, dimension(count, n) :: b
real, dimension(m,n) :: c1, c2
real, dimension(m/2, n) :: ch_1, ch_2
call random_number(a)
call random_number(b)
call my_matmul(a,b,c1)
c2 = matmul(a,b)
if (any(abs(c1 - c2) > 1e-5)) call abort
call my_matmul(a(1:m:2,:),b,ch_1)
ch_2 = matmul(a(1:m:2,:),b)
if (any(abs(ch_1 - ch_2) > 1e-5)) call abort
end program main
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (9 preceding siblings ...)
2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
@ 2011-04-20 18:15 ` jb at gcc dot gnu.org
2011-05-04 16:23 ` hubicka at gcc dot gnu.org
` (32 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: jb at gcc dot gnu.org @ 2011-04-20 18:15 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #11 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-04-20 18:14:20 UTC ---
(In reply to comment #9)
> > But do we actually do this? I did some tests a while ago, and IIRC for assumed
> > shape dummy arguments the procedure always calculates new bounds such that they
> > start from 1. That is, the procedure assumes that the actual argument
> > descriptor may have lower bounds != 1.
> > So my argument is basically that with the new descriptor it might make sense to
> > switch the responsibility around such that it's the caller who makes sure that
> > all lower bounds are 0 (as we must have the capability to do this anyway in
> > order to call inter-operable procedures, no?) instead of the callee.
>
> No, the conversion is already done in the caller:
>
> subroutine bar(B)
> interface; subroutine foo(a); integer :: a(:); end subroutine foo
> end interface
> integer :: B(:)
> call foo(B)
> end subroutine bar
>
> Shows:
> parm.4.dim[0].lbound = 1;
> [...]
> foo (&parm.4);
> For assumed-shape actual arguments, creating a new descriptor is actually not
> needed - only for deferred shape ones - or if one does not have a full array
> ref.
> Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.
>
> However, some additional calculation is also done in the the callee to
> determine the stride and offset; e.g.
> ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1;
> again, if the dummy argument is not deferred-shaped (allocatable or pointer),
> one actually knows that "b->dim[0].lbound" == 1. I think we have some
> redundancy here -> missed optimization.
Yes, there seems to be some redundancy indeed in that case. I dug up my
testcase:
module asstest
implicit none
contains
subroutine assub(a, r)
real, intent(in) :: a(:,:)
real, intent(out) :: r
r = a(42,43)
end subroutine assub
subroutine assub2(a, r)
real, intent(in), allocatable :: a(:,:)
real, intent(out) :: r
r = a(42,43)
end subroutine assub2
end module asstest
The -fdump-tree-original tree for this module is:
assub2 (struct array2_real(kind=4) & a, real(kind=4) & r)
{
*r = (*(real(kind=4)[0:] *) a->data)[(a->dim[0].stride * 42 +
a->dim[1].stride * 43) + a->offset];
}
assub (struct array2_real(kind=4) & a, real(kind=4) & r)
{
integer(kind=8) ubound.0;
integer(kind=8) stride.1;
integer(kind=8) ubound.2;
integer(kind=8) stride.3;
integer(kind=8) offset.4;
integer(kind=8) size.5;
real(kind=4)[0:D.1567] * a.0;
integer(kind=8) D.1567;
bit_size_type D.1568;
<unnamed-unsigned:64> D.1569;
{
integer(kind=8) D.1566;
D.1566 = a->dim[0].stride;
stride.1 = D.1566 != 0 ? D.1566 : 1;
a.0 = (real(kind=4)[0:D.1567] *) a->data;
ubound.0 = (a->dim[0].ubound - a->dim[0].lbound) + 1;
stride.3 = a->dim[1].stride;
ubound.2 = (a->dim[1].ubound - a->dim[1].lbound) + 1;
size.5 = stride.3 * NON_LVALUE_EXPR <ubound.2>;
offset.4 = -stride.1 - NON_LVALUE_EXPR <stride.3>;
D.1567 = size.5 + -1;
D.1568 = (bit_size_type) size.5 * 32;
D.1569 = (<unnamed-unsigned:64>) size.5 * 4;
}
*r = (*a.0)[(stride.1 * 42 + stride.3 * 43) + offset.4];
}
So if we make sure that the caller fixes up the descriptor so that bounds are
correct for assumed-shape parameters (as the TR requires for inter-operable
procedures), then assub could be as simple as assub2.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (10 preceding siblings ...)
2011-04-20 18:15 ` jb at gcc dot gnu.org
@ 2011-05-04 16:23 ` hubicka at gcc dot gnu.org
2011-05-04 17:31 ` burnus at gcc dot gnu.org
` (31 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-05-04 16:23 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-05-04 16:09:19 UTC ---
Hi,
I discussed some of the issues today with Martin. For the array descriptor
testcase, we really want ipa-cp to be propagate the constant array bounds
instead of making Inliner to blindly inline enough in all cases.
For that we need
1) Make ipa-prop to work on aggregates. For aggregates passed by value we
can have jump functions that define known constants at known offsets
2) Make ipa-inline-analysis to produce predicates on constantness of
aggregate fields in the same format
3) Array descriptors are passed by reference, rather than by value. This
need further work, since need to be sure that the value passed does not change
by aliasing. IPA-SRA would help here if it was really SRA and we had
-fwhole-program, but that is weak. We would need IPA-PTA to solve this in
general. Perhaps frontend could help us here since the descriptors are probably
constant after they are initialized, or is there way to change existing
descriptor?
4) Make ipa-inline-analysis to understand that determining loop bounds is
very cool to do.
I also looked into the dumps of fatigue. One obvious problem is that we
overestimate stack sizes by about factor of 10. This seems to be mostly due to
I/O routines structures that gets packed later.
We used to take results of stack frame packing, but Steven reverted this
behaviour and now we estimate stack sizes by simply summing up the size of
local arrays. I wonder, perhaps we want to revert to original way at least when
optimizing and when generating summary for late inliner (early inliner probably
does not care and Steven's main concern was that this is computed 3 times,
twice for early inliner and once for real inliner).
Honza
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (11 preceding siblings ...)
2011-05-04 16:23 ` hubicka at gcc dot gnu.org
@ 2011-05-04 17:31 ` burnus at gcc dot gnu.org
2011-06-04 18:08 ` hubicka at gcc dot gnu.org
` (30 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-04 17:31 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #13 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-04 17:30:31 UTC ---
(In reply to comment #12)
> Perhaps frontend could help us here since the descriptors are probably
> constant after they are initialized, or is there way to change existing
> descriptor?
Only if the dummy/formal argument is a pointer or an allocatable.
Regarding ipa-cp, wasn't there a problem with "fn spec" (cf. PR 45579)? And
many Fortran procedures have this attribute.
> This seems to be mostly due to I/O routines structures that gets packed later.
We really need to start reusing them ... (Cf. PR 48419, PR 34705, and
http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01762.html )
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (12 preceding siblings ...)
2011-05-04 17:31 ` burnus at gcc dot gnu.org
@ 2011-06-04 18:08 ` hubicka at gcc dot gnu.org
2012-07-03 17:44 ` jamborm at gcc dot gnu.org
` (29 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-06-04 18:08 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #14 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-06-04 18:06:01 UTC ---
Yeah, the fnspec issue is something we ought to solve. ipa-cp should be
effective on fortran so it should not disable itself there ;)
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (13 preceding siblings ...)
2011-06-04 18:08 ` hubicka at gcc dot gnu.org
@ 2012-07-03 17:44 ` jamborm at gcc dot gnu.org
2012-08-11 10:50 ` jamborm at gcc dot gnu.org
` (28 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: jamborm at gcc dot gnu.org @ 2012-07-03 17:44 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #15 from Martin Jambor <jamborm at gcc dot gnu.org> 2012-07-03 17:43:35 UTC ---
Hi,
(In reply to comment #12)
> Hi,
> I discussed some of the issues today with Martin. For the array descriptor
> testcase, we really want ipa-cp to be propagate the constant array bounds
> instead of making Inliner to blindly inline enough in all cases.
> For that we need
>
> 1) Make ipa-prop to work on aggregates. For aggregates passed by value we
> can have jump functions that define known constants at known offsets
I have posted a patch implementing this yesterday:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00039.html
> 2) Make ipa-inline-analysis to produce predicates on constantness of
> aggregate fields in the same format
I have posted a patch implementing this yesterday too:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00041.html
With this patch, the test case from comment #4 is inlined at -O3
without any parameter tweaking. It is in fact a testcase in that
patch. However, functions in real benchmarks are bigger. We will see
to what extent IPA-CP can help with them.
> 3) Array descriptors are passed by reference, rather than by value. This
> need further work, since need to be sure that the value passed does not change
> by aliasing. IPA-SRA would help here if it was really SRA and we had
> -fwhole-program, but that is weak. We would need IPA-PTA to solve this in
> general. Perhaps frontend could help us here since the descriptors are probably
> constant after they are initialized, or is there way to change existing
> descriptor?
At the moment I'm relying on a slightly sophisticated intra-PTA and
TBAA. I'll try to investigate where this does not work in Fortran and
perhaps will have some suggestions afterwards.
> 4) Make ipa-inline-analysis to understand that determining loop bounds is
> very cool to do.
Yep, knowing what constants should get a profitability "boost" would
be indeed very beneficial.
Meanwhile, the following patch also helps ipa-inline-analysis.c with
cases like fatigue2:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00052.html
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (14 preceding siblings ...)
2012-07-03 17:44 ` jamborm at gcc dot gnu.org
@ 2012-08-11 10:50 ` jamborm at gcc dot gnu.org
2012-08-21 6:54 ` hubicka at gcc dot gnu.org
` (27 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: jamborm at gcc dot gnu.org @ 2012-08-11 10:50 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #16 from Martin Jambor <jamborm at gcc dot gnu.org> 2012-08-11 10:50:29 UTC ---
Author: jamborm
Date: Sat Aug 11 10:50:24 2012
New Revision: 190313
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190313
Log:
2012-08-11 Martin Jambor <mjambor@suse.cz>
PR fortran/48636
* ipa-inline.h (condition): New fields offset, agg_contents and by_ref.
* ipa-inline-analysis.c (agg_position_info): New type.
(add_condition): New parameter aggpos, also store agg_contents, by_ref
and offset.
(dump_condition): Also dump aggregate conditions.
(evaluate_conditions_for_known_args): Also handle aggregate
conditions. New parameter known_aggs.
(evaluate_properties_for_edge): Gather known aggregate contents.
(inline_node_duplication_hook): Pass NULL known_aggs to
evaluate_conditions_for_known_args.
(unmodified_parm): Split into unmodified_parm and unmodified_parm_1.
(unmodified_parm_or_parm_agg_item): New function.
(set_cond_stmt_execution_predicate): Handle values passed in
aggregates.
(set_switch_stmt_execution_predicate): Likewise.
(will_be_nonconstant_predicate): Likewise.
(estimate_edge_devirt_benefit): Pass new parameter known_aggs to
ipa_get_indirect_edge_target.
(estimate_calls_size_and_time): New parameter known_aggs, pass it
recrsively to itself and to estimate_edge_devirt_benefit.
(estimate_node_size_and_time): New vector known_aggs, pass it o
functions which need it.
(remap_predicate): New parameter offset_map, use it to remap aggregate
conditions.
(remap_edge_summaries): New parameter offset_map, pass it recursively
to itself and to remap_predicate.
(inline_merge_summary): Also create and populate vector offset_map.
(do_estimate_edge_time): New vector of known aggregate contents,
passed to functions which need it.
(inline_read_section): Stream new fields of condition.
(inline_write_summary): Likewise.
* ipa-cp.c (ipa_get_indirect_edge_target): Also examine the aggregate
contents. Let all local callers pass NULL for known_aggs.
* testsuite/gfortran.dg/pr48636.f90: New test.
Added:
trunk/gcc/testsuite/gfortran.dg/pr48636.f90
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-cp.c
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.h
trunk/gcc/ipa-prop.h
trunk/gcc/testsuite/ChangeLog
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (15 preceding siblings ...)
2012-08-11 10:50 ` jamborm at gcc dot gnu.org
@ 2012-08-21 6:54 ` hubicka at gcc dot gnu.org
2012-08-21 8:15 ` hubicka at gcc dot gnu.org
` (26 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-08-21 6:54 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #17 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-08-21 06:54:09 UTC ---
Author: hubicka
Date: Tue Aug 21 06:54:01 2012
New Revision: 190556
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190556
Log:
PR fortran/48636
* ipa-inline.c (want_inline_small_function_p): Take loop_iterations hint.
(edge_badness): Likewise.
* ipa-inline.h (inline_hints_vals): Add INLINE_HINT_loop_iterations.
(inline_summary): Add loop_iterations.
* ipa-inline-analysis.c: Include tree-scalar-evolution.h.
(dump_inline_hints): Dump loop_iterations.
(reset_inline_summary): Free loop_iterations.
(inline_node_duplication_hook): Update loop_iterations.
(dump_inline_summary): Dump loop_iterations.
(will_be_nonconstant_expr_predicate): New function.
(estimate_function_body_sizes): Analyze loops.
(estimate_node_size_and_time): Set hint loop_iterations.
(inline_merge_summary): Merge loop iterations.
(inline_read_section): Stream in loop_iterations.
(inline_write_summary): Stream out loop_iterations.
Added:
trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-1.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c
trunk/gcc/ipa-inline.h
trunk/gcc/testsuite/ChangeLog
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (16 preceding siblings ...)
2012-08-21 6:54 ` hubicka at gcc dot gnu.org
@ 2012-08-21 8:15 ` hubicka at gcc dot gnu.org
2012-09-12 21:52 ` hubicka at gcc dot gnu.org
` (25 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-08-21 8:15 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-08-21 08:14:33 UTC ---
With loop_iterations hint, we should now hint the bar function of testcase in
comment #4, but we don't because the value is used conditionally:
# iftmp.11_3 = PHI <_12(3), 1(2)>
a.0_16 = a_11(D)->data;
_17 = a_11(D)->dim[0].ubound;
_18 = a_11(D)->dim[0].lbound;
_19 = _17 - _18;
ubound.0_20 = _19 + 1;
stride.3_21 = a_11(D)->dim[1].stride;
_22 = a_11(D)->dim[1].ubound;
_23 = a_11(D)->dim[1].lbound;
_24 = _22 - _23;
ubound.2_25 = _24 + 1;
_27 = -iftmp.11_3;
offset.4_28 = _27 - stride.3_21;
*x_35(D) = 0.0;
_53 = stride.3_21 >= 0;
_56 = ubound.2_25 > 0;
_57 = _53 & _56;
_59 = stride.3_21 < 0;
_60 = _57 | _59;
_66 = _59 | _60;
if (_66 != 0)
goto <bb 5>;
else
goto <bb 6>;
<bb 5>:
iftmp.13_68 = (integer(kind=4)) ubound.2_25;
<bb 6>:
# iftmp.13_4 = PHI <iftmp.13_68(5), 0(4)>
if (iftmp.13_4 > 0)
goto <bb 7>;
else
goto <bb 14>;
Martin, does this work with your PHI patch? (i.e. do you get "loop iterations"
in -fdump-ipa-inline?)
Next step will be to teach inliner to inline into functions called once when
callee is important and some propagation happens.
Honza
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (17 preceding siblings ...)
2012-08-21 8:15 ` hubicka at gcc dot gnu.org
@ 2012-09-12 21:52 ` hubicka at gcc dot gnu.org
2012-10-16 16:39 ` hubicka at gcc dot gnu.org
` (24 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-09-12 21:52 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-09-12 21:51:21 UTC ---
Author: hubicka
Date: Wed Sep 12 21:51:14 2012
New Revision: 191232
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191232
Log:
PR fortran/48636
* gcc.dg/ipa/inlinehint-2.c: New testcase.
* ipa-inline-analysis.c (dump_inline_hints): Dump loop stride.
(set_hint_predicate): New function.
(reset_inline_summary): Reset loop stride.
(remap_predicate_after_duplication): New function.
(remap_hint_predicate_after_duplication): New function.
(inline_node_duplication_hook): Update.
(dump_inline_summary): Dump stride summaries.
(estimate_function_body_sizes): Compute strides.
(remap_hint_predicate): New function.
(inline_merge_summary): Use it.
(inline_read_section): Read stride.
(inline_write_summary): Write stride.
* ipa-inline.c (want_inline_small_function_p): Handle strides.
(edge_badness): Likewise.
* ipa-inline.h (inline_hints_vals): Add stride hint.
(inline_summary): Update stride.
Added:
trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-2.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c
trunk/gcc/ipa-inline.h
trunk/gcc/testsuite/ChangeLog
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (18 preceding siblings ...)
2012-09-12 21:52 ` hubicka at gcc dot gnu.org
@ 2012-10-16 16:39 ` hubicka at gcc dot gnu.org
2012-10-16 17:58 ` dominiq at lps dot ens.fr
` (23 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-16 16:39 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #20 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-16 16:38:27 UTC ---
Created attachment 28456
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28456
Path I am considering
Hi,
I am considering to enable inlining when inline-analysis says that the inline
function will get significantly fater regardless the
inline-insns-single/inline-sinsns-auto limits. The patch is attached.
In general it may make more sense to gradually set the limits based on expected
speedup but I am affraid this will become hard to understand and maintain. So
for now lets have simple boolean decisions.
It would be nice to know where it helps (i.e. for fatigue and cray) and where
it doesn't. It also causes quite considerable code size growth on some of
SPEC2000 for relatively little benefit, so I guess it will need more evaulation
and reduction of inline-insns-auto limits. It also may be problem in
unrealistic estimates in ipa-inline-analysis, this is first time we take them
really seriously.
Comments/ideas are welcome.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (19 preceding siblings ...)
2012-10-16 16:39 ` hubicka at gcc dot gnu.org
@ 2012-10-16 17:58 ` dominiq at lps dot ens.fr
2012-10-16 20:59 ` dominiq at lps dot ens.fr
` (22 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-16 17:58 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #21 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-16 17:57:52 UTC ---
Before the patch in comment #20, I get
-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*
[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops
-ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=150
-fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90
8.485u 0.205s 0:08.73 99.4% 0+0k 0+29io 0pf+0w
[macbook] lin/test% ll a.out
-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*
[[macbook] lin/test% time a.out > /dev/null
2.916u 0.003s 0:02.92 99.6% 0+0k 0+1io 0pf+0w
[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops
-ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto
-fno-tree-loop-if-convert fatigue.f90
6.822u 0.193s 0:07.06 99.2% 0+0k 0+30io 0pf+0w
[macbook] lin/test% ll a.out
-rwxr-xr-x 1 dominiq staff 69312 Oct 16 19:21 a.out*
[macbook] lin/test% time a.out > /dev/null
4.851u 0.004s 0:04.86 99.7% 0+0k 0+1io 0pf+0w
After the patch I get
[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops
-ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto
-fno-tree-loop-if-convert fatigue.f90
7.277u 0.217s 0:07.52 99.4% 0+0k 0+28io 0pf+0w
[macbook] lin/test% ll a.out-rwxr-xr-x 1 dominiq staff 69248 Oct 16 19:46
a.out*
[macbook] lin/test% time a.out > /dev/null
2.912u 0.003s 0:02.91 100.0% 0+0k 0+2io 0pf+0w
So for this particular test with the same options, after the patch the
compilation time is ~6% slower, the size is about the same (actually smaller;-)
and the run time ~40% faster. Without the patch and with --param
max-inline-insns-auto=150 compared to with the patch without this option, the
compilation time is ~20% slower, the size is ~6% larger, and the runtime is the
same.
Further testing coming, thanks for the patch.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (20 preceding siblings ...)
2012-10-16 17:58 ` dominiq at lps dot ens.fr
@ 2012-10-16 20:59 ` dominiq at lps dot ens.fr
2012-10-17 12:20 ` jakub at gcc dot gnu.org
` (21 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-16 20:59 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #22 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-16 20:58:58 UTC ---
With the patch I see a ~10% slowdown in the Test4 - Lapack 2 (1001x1001) of
test_fpu.f90 compared to revision 192449
[macbook] lin/test% time /opt/gcc/gcc4.8c/bin/gfortran -fprotect-parens -Ofast
-funroll-loops test_lap.f90
6.742u 0.097s 0:06.87 99.4% 0+0k 0+20io 0pf+0w
[macbook] lin/test% a.out
Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts 2.6 sec Err= 0.000000000000250
total = 2.6 sec
[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-all-loops
test_lap.f90
9.489u 0.116s 0:09.62 99.6% 0+0k 0+16io 0pf+0w
[macbook] lin/test% a.out
Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts 2.8 sec Err= 0.000000000000250
total = 2.8 sec
This looks similar to what I saw in comment #5. However now dgetri is never
inlined while dgetrf is inlined with the patch. Also dtrmv and dscal are
inlined with the patch (respectively 20 and 21 occurrences without the patch).
The last difference I see is 35 occurrences of dswap with the patch compared to
32 without.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (21 preceding siblings ...)
2012-10-16 20:59 ` dominiq at lps dot ens.fr
@ 2012-10-17 12:20 ` jakub at gcc dot gnu.org
2012-10-17 13:13 ` dominiq at lps dot ens.fr
` (20 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-10-17 12:20 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jakub at gcc dot gnu.org
--- Comment #23 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-10-17 12:20:05 UTC ---
If the middle-end was hinted something about the array descriptors (what fields
in the struct are important and what not), perhaps IPA-CP could handle also
arguments pointing to array descriptors if all callers fill the descriptor in
certain important way (where important would be constant strides and/or
constant start/end in some of the dimension(s)). I guess many Fortran routines
are just too large for inlining, but optimizing constant strides etc. might
still be useful.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (22 preceding siblings ...)
2012-10-17 12:20 ` jakub at gcc dot gnu.org
@ 2012-10-17 13:13 ` dominiq at lps dot ens.fr
2012-10-17 14:06 ` dominiq at lps dot ens.fr
` (19 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-17 13:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #24 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-17 13:13:24 UTC ---
Summary for the polyhedron tests (pb05):
(a) revision 192449 unpatched
(b) revision 192516 with patch in comment #20
options: -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear \
-fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert
Benchmark Compile Executable Ave Run
Name (secs) (bytes) (secs)
(a) (b) (a) (b) (a) (b)
--------- ------- ------- ---------- ---------- ------- -------
ac 6.30 9.66 54904 54856 9.12 8.49
aermod 129.56 178.13 1158904 1527680 17.69 17.28
air 25.64 29.95 102752 106712 7.15 7.16
capacita 6.34 18.99 69096 130536 42.00 40.57
channel 2.99 3.51 34736 34736 3.03 3.02
doduc 19.51 23.70 163528 196016 27.54 27.65
fatigue 7.03 7.54 69312 69248 4.92 2.94
gas_dyn 14.65 14.91 99320 99280 3.72 3.88
induct 19.50 21.02 149160 161552 13.18 13.14
linpk 1.84 2.91 22008 17832 21.65 21.69
mdbx 6.59 10.14 68800 80968 12.85 12.78
nf 6.34 24.25 59544 104544 18.95 18.88
protein 22.27 46.34 119056 156048 35.65 35.41
rnflow 17.75 31.23 114600 147280 23.56 23.54
test_fpu 10.69 20.66 76640 117512 10.89 10.95
tfft 1.62 2.93 22424 22384 3.34 3.33
Geometric Mean Execution Time = (a) 11.88 seconds (b) 11.43 seconds
I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to
slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for
gfortran.dg/vect/vect-8.f90: the loop
do while(ii > 1)
ipnt= ipntp
ipntp= ipntp+ii
ii= ishft(ii,-1)
i= ipntp+1
!dir$ vector always
x(ipntp+2:ipntp+ii+1)=x(ipnt+2:ipntp:2)-v(ipnt+2:ipntp:2) &
&*x(ipnt+1:ipntp-1:2)-v(ipnt+3:ipntp+1:2)*x(ipnt+3:ipntp+1:2)
END DO
is not vectorized because
154: not vectorized: complicated access pattern.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (23 preceding siblings ...)
2012-10-17 13:13 ` dominiq at lps dot ens.fr
@ 2012-10-17 14:06 ` dominiq at lps dot ens.fr
2012-10-19 8:45 ` vincenzo.innocente at cern dot ch
` (18 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-17 14:06 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #25 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-17 14:05:51 UTC ---
> I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to
> slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for
> gfortran.dg/vect/vect-8.f90: the loop ...
This seems to be a glitch which appeared between r192440 and r192516 and
disappeared before or at r192531.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (24 preceding siblings ...)
2012-10-17 14:06 ` dominiq at lps dot ens.fr
@ 2012-10-19 8:45 ` vincenzo.innocente at cern dot ch
2012-10-20 10:35 ` hubicka at gcc dot gnu.org
` (17 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2012-10-19 8:45 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #26 from vincenzo Innocente <vincenzo.innocente at cern dot ch> 2012-10-19 08:45:03 UTC ---
I'm interested to test the patch on our large application currently compiled
with 4.7.2.
would it be possible to get the same patch against gcc-4_7-branch?
thanks
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (25 preceding siblings ...)
2012-10-19 8:45 ` vincenzo.innocente at cern dot ch
@ 2012-10-20 10:35 ` hubicka at gcc dot gnu.org
2012-10-20 11:22 ` dominiq at lps dot ens.fr
` (16 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-20 10:35 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #27 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-20 10:34:58 UTC ---
Thank you for testing. It seems that the patch works well for small benchmarks,
I will look into lapack/test_fpu slowdown.
There is problem that it really causes inacceptable growth on SPEC2k6 and 2k in
non-LTO mode. I will need to analyze some of these testcases and see why we
predict so much of speedup when there are no benefits in runtime.
Jakub: the plan is to make ipa-cp to handle propagation across aggregates in
general (jump functions are already in place), that will handle the array
descriptors, too.
The fatigue is however different case - the values happens to be loop invariant
of the outer loop the function is called from. So inlining enables a lot of
invariant code motion. This is similar to cray. Both these cases are now
understood by ipa-inline-analysis but the fact is not really used w/o this
patch.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (26 preceding siblings ...)
2012-10-20 10:35 ` hubicka at gcc dot gnu.org
@ 2012-10-20 11:22 ` dominiq at lps dot ens.fr
2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
` (15 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-20 11:22 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #28 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-20 11:22:16 UTC ---
If I understand correctly the patch, the default value for
max-inline-min-speedup is 20. This could be over-agressive: for fatigue.f90 the
threshold is between 94 (fast) and 95 (slow). I see a similar threshold for
test_fpu.f90 from 100 (fast) to ~97 (slow), unfortunately above 94.
Also the choice of true or false for big_speedup_p is based on a test
't-t1>a*t' which is equivalent to 't1<=b*t' with b=1-a. Any reason for the
choice?
Last point, AFAICT the behavior of the different param tuning the inlining is
often non monotonic (I am trying to investigate that in more detail).
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (27 preceding siblings ...)
2012-10-20 11:22 ` dominiq at lps dot ens.fr
@ 2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
2012-10-28 10:08 ` hubicka at gcc dot gnu.org
` (14 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2012-10-20 12:11 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #29 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2012-10-20 12:10:49 UTC ---
Another approach (not for the benchmarks) would be to
make inlining tunable by the user, e.g. support
!GCC$ ATTRIBUTES always_inline :: procedure_name
See PR 41209.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (28 preceding siblings ...)
2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
@ 2012-10-28 10:08 ` hubicka at gcc dot gnu.org
2012-10-28 10:11 ` hubicka at gcc dot gnu.org
` (13 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-28 10:08 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #30 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-28 10:08:23 UTC ---
Created attachment 28543
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28543
Updated patch
Hi,
this is updated patch I am testing. It fixes the big speedup test and also
changes badness metric completely to be based on expected speedup of the
caller. It seems to work pretty well in my test with the catch that I still
can't beat 4.6 on tramp3d.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (29 preceding siblings ...)
2012-10-28 10:08 ` hubicka at gcc dot gnu.org
@ 2012-10-28 10:11 ` hubicka at gcc dot gnu.org
2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
` (12 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-28 10:11 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #31 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-28 10:11:13 UTC ---
Concerning vincenzo's request about 4.7 version, it won't work - it depends on
improvements of inline metric and ipa-prop we made for 4.8
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (30 preceding siblings ...)
2012-10-28 10:11 ` hubicka at gcc dot gnu.org
@ 2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
2012-11-07 9:34 ` hubicka at gcc dot gnu.org
` (11 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2012-10-28 11:27 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #32 from vincenzo Innocente <vincenzo.innocente at cern dot ch> 2012-10-28 11:27:22 UTC ---
In a small test (that I will eventually publish here) the new patch at -O2
looks superior to 4.7.2 at O3.
I would like to build a test with multiple source files where lto matters
though.
We will also try to build our whole software stack with 4.8 (we have a
production version with 4.7.2 at this point, so we can move to experimenting
builds with 4.8…)
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (31 preceding siblings ...)
2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
@ 2012-11-07 9:34 ` hubicka at gcc dot gnu.org
2012-11-07 11:18 ` hubicka at gcc dot gnu.org
` (10 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-07 9:34 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #33 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-07 09:34:25 UTC ---
Created attachment 28628
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28628
Final patch (I hope)
This is version of path I will commit today or tomorrow (depending on when
autotesters gets Martin's aggregate changes). Of course it no longer helps
fatigue - we now inline the loop and do not hit the loop based heuristics. I
made separate patch for this. It however helps several other testcases and
should be win for Fortrain in general.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (32 preceding siblings ...)
2012-11-07 9:34 ` hubicka at gcc dot gnu.org
@ 2012-11-07 11:18 ` hubicka at gcc dot gnu.org
2012-11-08 16:46 ` hubicka at gcc dot gnu.org
` (9 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-07 11:18 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #34 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-07 11:17:28 UTC ---
Created attachment 28629
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28629
Array index hint
This patch should help to inline when array descriptors become known, such as
in fatigue
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (33 preceding siblings ...)
2012-11-07 11:18 ` hubicka at gcc dot gnu.org
@ 2012-11-08 16:46 ` hubicka at gcc dot gnu.org
2012-11-11 18:15 ` hubicka at gcc dot gnu.org
` (8 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-08 16:46 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #35 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-08 16:46:28 UTC ---
Author: hubicka
Date: Thu Nov 8 16:46:18 2012
New Revision: 193331
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193331
Log:
PR middle-end/48636
* ipa-inline.c (big_speedup_p): New function.
(want_inline_small_function_p): Use it.
(edge_badness): Dump it.
* params.def (inline-min-speedup): New parameter.
* doc/invoke.texi (inline-min-speedup): Document.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/doc/invoke.texi
trunk/gcc/ipa-inline.c
trunk/gcc/params.def
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/winline-3.c
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (34 preceding siblings ...)
2012-11-08 16:46 ` hubicka at gcc dot gnu.org
@ 2012-11-11 18:15 ` hubicka at gcc dot gnu.org
2012-11-12 12:16 ` hubicka at gcc dot gnu.org
` (7 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-11 18:15 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #36 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-11 18:14:40 UTC ---
Author: hubicka
Date: Sun Nov 11 18:14:35 2012
New Revision: 193406
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193406
Log:
PR middle-end/48636
* ipa-inline.c (want_inline_small_function_p): Take aray index hint.
(edge_badness): Likewise.
* ipa-inline.h (inline_hints_vals): Add array_index and comments.
(inline_summary_: Add ARRAY_INDEX.
* ipa-inline-analysis.c (dump_inline_hints): Dump array_index hint.
(reset_inline_summary): Handle array_index hint.
(inline_node_duplication_hook): Likewise.
(dump_inline_summary): Likewise.
(array_index_predicate): New function.
(estimate_function_body_sizes): Use it.
(estimate_node_size_and_time): Use array_index hint.
(inline_merge_summary, inline_read_section): Likewise.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c
trunk/gcc/ipa-inline.h
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (35 preceding siblings ...)
2012-11-11 18:15 ` hubicka at gcc dot gnu.org
@ 2012-11-12 12:16 ` hubicka at gcc dot gnu.org
2012-11-12 12:45 ` izamyatin at gmail dot com
` (6 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-12 12:16 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #37 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-12 12:16:18 UTC ---
Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only
half of inlining because jump functions are not able to track array descriptors
in both calls to generalized_hookes_law.
What are the other testcases to look at?
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (36 preceding siblings ...)
2012-11-12 12:16 ` hubicka at gcc dot gnu.org
@ 2012-11-12 12:45 ` izamyatin at gmail dot com
2012-11-14 23:22 ` hubicka at gcc dot gnu.org
` (5 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: izamyatin at gmail dot com @ 2012-11-12 12:45 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
Igor Zamyatin <izamyatin at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |izamyatin at gmail dot com
--- Comment #38 from Igor Zamyatin <izamyatin at gmail dot com> 2012-11-12 12:44:43 UTC ---
Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32
-O3 -funroll-loops
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (37 preceding siblings ...)
2012-11-12 12:45 ` izamyatin at gmail dot com
@ 2012-11-14 23:22 ` hubicka at gcc dot gnu.org
2012-11-14 23:54 ` Jan Hubicka
2012-11-14 23:55 ` hubicka at ucw dot cz
` (4 subsequent siblings)
43 siblings, 1 reply; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-14 23:22 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #39 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-14 23:22:40 UTC ---
Hmm, indeed. Good catch. I will look into it.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [Bug fortran/48636] Enable more inlining with -O2 and higher
2012-11-14 23:22 ` hubicka at gcc dot gnu.org
@ 2012-11-14 23:54 ` Jan Hubicka
0 siblings, 0 replies; 46+ messages in thread
From: Jan Hubicka @ 2012-11-14 23:54 UTC (permalink / raw)
To: hubicka at gcc dot gnu.org; +Cc: gcc-bugs
mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems independent regression of this change.
No idea what goes wrong.
Honza
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (38 preceding siblings ...)
2012-11-14 23:22 ` hubicka at gcc dot gnu.org
@ 2012-11-14 23:55 ` hubicka at ucw dot cz
2012-11-15 2:29 ` hubicka at gcc dot gnu.org
` (3 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at ucw dot cz @ 2012-11-14 23:55 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #40 from Jan Hubicka <hubicka at ucw dot cz> 2012-11-14 23:54:44 UTC ---
mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems
independent regression of this change.
No idea what goes wrong.
Honza
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (39 preceding siblings ...)
2012-11-14 23:55 ` hubicka at ucw dot cz
@ 2012-11-15 2:29 ` hubicka at gcc dot gnu.org
2012-11-16 14:43 ` dominiq at lps dot ens.fr
` (2 subsequent siblings)
43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-15 2:29 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #41 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-15 02:28:26 UTC ---
mgrid regression is now PR55334
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (40 preceding siblings ...)
2012-11-15 2:29 ` hubicka at gcc dot gnu.org
@ 2012-11-16 14:43 ` dominiq at lps dot ens.fr
2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
2013-03-04 17:54 ` wschmidt at gcc dot gnu.org
43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-11-16 14:43 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
--- Comment #42 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-11-16 14:42:33 UTC ---
> Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only
> half of inlining because jump functions are not able to track array descriptors
> in both calls to generalized_hookes_law.
The same applies to the tests in comments #4 and #10.
What is the status of "assumed-shape" arrays with respect to "array_index
hint"?
> What are the other testcases to look at?
ac.f90 and aermod.f90 give a 5-10% faster runtime when compiled with '--param
max-inline-insns-auto=150' (and -Ofast -fwhole-program).
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (41 preceding siblings ...)
2012-11-16 14:43 ` dominiq at lps dot ens.fr
@ 2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
2013-03-04 17:54 ` wschmidt at gcc dot gnu.org
43 siblings, 0 replies; 46+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2013-03-01 17:49 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
William J. Schmidt <wschmidt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |wschmidt at gcc dot gnu.org
--- Comment #43 from William J. Schmidt <wschmidt at gcc dot gnu.org> 2013-03-01 17:48:51 UTC ---
(In reply to comment #38)
> Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32
> -O3 -funroll-loops
The same degradation was seen on powerpc64-unknown-linux-gnu with r193331. The
fix by Martin Jambor for PR55334 did not help for -m32. It did give a slight
bump to -m64, but did not return the performance to pre-r193331 levels. So
there still seems to be a problem with 172.mgrid related to this change.
^ permalink raw reply [flat|nested] 46+ messages in thread
* [Bug fortran/48636] Enable more inlining with -O2 and higher
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
` (42 preceding siblings ...)
2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
@ 2013-03-04 17:54 ` wschmidt at gcc dot gnu.org
43 siblings, 0 replies; 46+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2013-03-04 17:54 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636
William J. Schmidt <wschmidt at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |bergner at gcc dot gnu.org
--- Comment #44 from William J. Schmidt <wschmidt at gcc dot gnu.org> 2013-03-04 17:53:17 UTC ---
Compiling mgrid.f on powerpc64-unknown-linux-gnu as follows:
$ gfortran -S -m32 -O3 -mcpu=power7 -fpeel-loops -funroll-loops -ffast-math
-fvect-cost-model mgrid.f
I examined the assembly generated for revisions 193330, 193331 (this issue),
and 196171 (PR55334). What I'm seeing is that for both 193331 and 196171, the
inliner is much more aggressive, and in particular is inlining several copies
of some pretty large functions.
For -m32, I am not seeing any specialization of resid_, so although the change
in 196171 helped a little, it appears that this was by reducing overall code
size. There weren't any changes in inlining decisions. Of course there is a
lot of distance between 193331 and 196171, so it is not a perfect comparison,
though it appears 196171 is where -m32 received a slight boost.
Anyway, the non-inlined call tree for 193330 is:
main
MAIN__
resid_ (x4)
comm3_
psinv_ (x3)
comm3_
norm2u3_ (x2)
interp_ (x2)
setup_
rprj3_ (x4)
zran3_
The non-inlined call tree for 193331 is:
main
MAIN__
comm3_ (x5)
resid_
comm3_
norm2u3_ (x2)
setup_
zran3_
So with 193331 we have the following additional inlines:
3 inlines of resid_, size = 1068, total size = 3204
3 inlines of psinv_, size = 1046, total size = 3138
2 inlines of interp_, size = 1544, total size = 3088
4 inlines of rprj3_, size = 220, total size = 880
Here "size" is the number of lines of assembly code of the called procedure,
including labels, so it's just a rough measure. The number of static call
sites of comm3_ was also reduced by one, but I don't know whether it was
inlined or specialized away.
These are pretty large procedures to be duplicating, particularly to be
duplicating more than once. Looking at resid_, it already generates spill code
on its own, so putting 3 copies of this in its caller isn't likely to be very
helpful. Of these, I think only rprj3_ looks like a reasonable inline
candidate.
Total lines of the assembly files are:
8660 r193330/mgrid.s
16398 r193331/mgrid.s
14592 r196171/mgrid.s
Inlining creates unreachable code, so removing the unreachable procedures
gives:
7765 r193330/mgrid.s
12591 r193331/mgrid.s
10795 r196171/mgrid.s
With r196171 the reachable code is still about 40% larger than r193330 (where
some reasonable inlining was already being done). This is better than the 60%
bloat with r193331 but still seems too high. Again, these are rough measures
but I think they are indicative.
Without knowing anything about the inliner, I think the inlining heuristics
probably need to take more account of code size than they seem to do at the
moment, particularly when making more than one copy of a procedure and thus
reducing spatial locality.
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2013-03-04 17:54 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
2011-04-17 10:23 ` dominiq at lps dot ens.fr
2011-04-17 10:44 ` hubicka at gcc dot gnu.org
2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
2011-04-17 14:12 ` dominiq at lps dot ens.fr
2011-04-20 11:22 ` jb at gcc dot gnu.org
2011-04-20 12:29 ` burnus at gcc dot gnu.org
2011-04-20 13:10 ` jb at gcc dot gnu.org
2011-04-20 15:42 ` burnus at gcc dot gnu.org
2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
2011-04-20 18:15 ` jb at gcc dot gnu.org
2011-05-04 16:23 ` hubicka at gcc dot gnu.org
2011-05-04 17:31 ` burnus at gcc dot gnu.org
2011-06-04 18:08 ` hubicka at gcc dot gnu.org
2012-07-03 17:44 ` jamborm at gcc dot gnu.org
2012-08-11 10:50 ` jamborm at gcc dot gnu.org
2012-08-21 6:54 ` hubicka at gcc dot gnu.org
2012-08-21 8:15 ` hubicka at gcc dot gnu.org
2012-09-12 21:52 ` hubicka at gcc dot gnu.org
2012-10-16 16:39 ` hubicka at gcc dot gnu.org
2012-10-16 17:58 ` dominiq at lps dot ens.fr
2012-10-16 20:59 ` dominiq at lps dot ens.fr
2012-10-17 12:20 ` jakub at gcc dot gnu.org
2012-10-17 13:13 ` dominiq at lps dot ens.fr
2012-10-17 14:06 ` dominiq at lps dot ens.fr
2012-10-19 8:45 ` vincenzo.innocente at cern dot ch
2012-10-20 10:35 ` hubicka at gcc dot gnu.org
2012-10-20 11:22 ` dominiq at lps dot ens.fr
2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
2012-10-28 10:08 ` hubicka at gcc dot gnu.org
2012-10-28 10:11 ` hubicka at gcc dot gnu.org
2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
2012-11-07 9:34 ` hubicka at gcc dot gnu.org
2012-11-07 11:18 ` hubicka at gcc dot gnu.org
2012-11-08 16:46 ` hubicka at gcc dot gnu.org
2012-11-11 18:15 ` hubicka at gcc dot gnu.org
2012-11-12 12:16 ` hubicka at gcc dot gnu.org
2012-11-12 12:45 ` izamyatin at gmail dot com
2012-11-14 23:22 ` hubicka at gcc dot gnu.org
2012-11-14 23:54 ` Jan Hubicka
2012-11-14 23:55 ` hubicka at ucw dot cz
2012-11-15 2:29 ` hubicka at gcc dot gnu.org
2012-11-16 14:43 ` dominiq at lps dot ens.fr
2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
2013-03-04 17:54 ` wschmidt at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).