[Bug fortran/48636] New: Enable more inlining with -O2 and higher

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug fortran/48636] New: Enable more inlining with -O2 and higher
@ 2011-04-15 21:05 tkoenig at gcc dot gnu.org
  2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
                   ` (43 more replies)
  0 siblings, 44 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2011-04-15 21:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

           Summary: Enable more inlining with -O2 and higher
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: fortran
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: tkoenig@gcc.gnu.org


See http://gcc.gnu.org/ml/fortran/2011-04/msg00144.html .

We whould inline more for Fortran at higher optimization levels.

Options to consider:

- Set -finline-limit=600 for -O2 and higher

- Consider heuristics to mark functions as inline in the front end:

    - If it has many arguments (argument processing has a lot of effect)

    - If it uses assumed-shape arrays (setting up that array descriptor
      may take a lot of time

    - Mark everything as inline


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
@ 2011-04-16 11:22 ` rguenth at gcc dot gnu.org
  2011-04-17 10:23 ` dominiq at lps dot ens.fr
                   ` (42 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-04-16 11:22 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011.04.16 09:59:01
     Ever Confirmed|0                           |1

--- Comment #1 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-04-16 09:59:01 UTC ---
Changing params from the Frontend won't work for LTO.  Declaring some or all
functions inline would work.

See also http://gcc.gnu.org/ml/gcc-patches/2011-02/msg00973.html and
discussion (the patch probably doesn't apply anymore).  Thus, I'd like to
address this in the middle-end, but if there are suitable heuristics on
when to set DECL_DECLARED_INLINE for the frontend by all means go ahead.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
  2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
@ 2011-04-17 10:23 ` dominiq at lps dot ens.fr
  2011-04-17 10:44 ` hubicka at gcc dot gnu.org
                   ` (41 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-04-17 10:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #2 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-04-17 10:23:03 UTC ---
As shown by the following results it seems that --param max-inline-insns-auto=*
is the way to go.

Date & Time     : 17 Apr 2011 11:22:05
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer --param max-inline-insns-auto=400 -fwhole-program -flto
-fstack-arrays -o %n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
    Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
      ac      8.07       54576      8.11       2  0.0062
      aermod    175.22     1472624     18.83       2  0.1647
     air     25.65       89992      6.78       5  0.1871
    capacita     14.02      109536     40.36       2  0.0483
     channel      3.11       34448      2.94       5  0.6012
       doduc     29.46      224584     27.44       2  0.0437
     fatigue      9.85       77032      2.74       2  0.0365
     gas_dyn     26.09      144112      4.68       5  0.6928
      induct     24.32      189696     14.24       2  0.1193
       linpk      3.13       21536     21.69       2  0.0254
    mdbx      9.18       84776     12.55       2  0.0678
      nf     34.14      124640     18.38       2  0.1034
     protein     28.14      155624     35.48       2  0.0789
      rnflow     43.93      204176     26.70       2  0.0262
    test_fpu     21.90      141696     11.18       2  0.0045
    tfft      1.71       22072      3.29       5  0.1369

Geometric Mean Execution Time =      11.60 seconds

================================================================================

Date & Time     : 17 Apr 2011 11:50:20
Test Name       : pbharness
Compile Command : gfc %n.f90 -Ofast -funroll-loops -ftree-loop-linear
-fomit-frame-pointer -finline-limit=600 -fwhole-program -flto -fstack-arrays -o
%n
Benchmarks      : ac aermod air capacita channel doduc fatigue gas_dyn induct
linpk mdbx nf protein rnflow test_fpu tfft
Maximum Times   :      300.0
Target Error %  :      0.200
Minimum Repeats :     2
Maximum Repeats :     5

   Benchmark   Compile  Executable   Ave Run  Number   Estim
    Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
      ac      8.06       54576      8.11       2  0.0062
      aermod    175.54     1480632     18.92       2  0.0106
     air     25.36       89992      6.76       2  0.0740
    capacita     13.95      109536     40.32       2  0.0161
     channel      3.13       34448      2.95       5  0.1703
       doduc     27.31      212280     27.18       2  0.0331
     fatigue      9.82       77032      2.74       2  0.0182
     gas_dyn     24.86      144112      4.67       5  0.3052
      induct     24.25      189696     14.21       2  0.0035
       linpk      2.55       21536     21.69       2  0.0023
    mdbx      9.17       84776     12.53       2  0.0239
      nf     34.21      124640     18.41       4  0.1634
     protein     28.01      155624     35.46       2  0.0310
      rnflow     38.11      183696     26.74       2  0.0037
    test_fpu     19.63      141720     10.84       2  0.0323
    tfft      1.69       22072      3.29       2  0.0152

Geometric Mean Execution Time =      11.57 seconds

================================================================================


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
  2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
  2011-04-17 10:23 ` dominiq at lps dot ens.fr
@ 2011-04-17 10:44 ` hubicka at gcc dot gnu.org
  2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
                   ` (40 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-04-17 10:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #3 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-04-17 10:44:23 UTC ---
I am slowly starting to look into fortran issues now.  For years it was
non-issue since we had the non-one-decl-per-function problem. This is finally
solved

One additional problem is that we often hit large-stack frame limits because
the fortran i/o drops large datastructure on stack.  Consequently any functions
that do i/o (for debug purposes, for example) are not inlined into functions
that doesn't.  We will need to relax this.

- Consider heuristics to mark functions as inline in the front end:

    - If it has many arguments (argument processing has a lot of effect)

    - If it uses assumed-shape arrays (setting up that array descriptor
      may take a lot of time

    - Mark everything as inline

I briefly discussed option of marking everything as inline on IRC for 4.6.x
series but it did not go well with Richard. Observation is that dominating
coding style in fortran is to not care that much about code size if perfomrance
improve and inlining do help here.

In longer term it would be cool if inliner was able to work out as much as
possible himself w/o frontend help.
The first item you mention is something backend can do at its own (and it
already knows how to benefit many arguments, but it proably does not do it
enough to make difference for fortran). I am just about commit patch that makes
backend by hair more sensitive on this.

The second item is interesting - it would be cool if backend was able to work
out that the code is supposed to simplify after inlining. Either by itself or
by frontend hint.
Can you provide me very simple testcase for that I can look into how it looks
like in backend?  Perhaps some kind of frontend hinting would work well here.

Honza

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2011-04-17 10:44 ` hubicka at gcc dot gnu.org
@ 2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
  2011-04-17 14:12 ` dominiq at lps dot ens.fr
                   ` (39 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2011-04-17 13:32 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #4 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2011-04-17 13:32:11 UTC ---
(In reply to comment #3)

> The second item is interesting - it would be cool if backend was able to work
> out that the code is supposed to simplify after inlining. Either by itself or
> by frontend hint.
> Can you provide me very simple testcase for that I can look into how it looks
> like in backend?  Perhaps some kind of frontend hinting would work well here.

Here is some sample code (extreme, I admit) which profits a lot from
inlining:

- Strides are known to be one when inlining (a common case, but you can
  never be sure if the user doesn't call a(1:5:2))

- Expensive setting up of, and reading from the array descriptor

- Loops can be completely unrolled

module foo
  implicit none
contains
  subroutine bar(a,x)
    real, dimension(:,:), intent(in) :: a
    real, intent(out) :: x
    integer :: i,j

    x = 0
    do j=1,ubound(a,2)
       do i=1,ubound(a,1)
          x = x + a(i,j)**2
       end do
    end do
  end subroutine bar
end module foo

program main
  use foo
  implicit none
  real, dimension(2,3) :: a
  real :: x

  data a /1.0, 2.0, 3.0, -1.0, -2.0, -3.0/

  call bar(a,x)
  print *,x
end program main


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
@ 2011-04-17 14:12 ` dominiq at lps dot ens.fr
  2011-04-20 11:22 ` jb at gcc dot gnu.org
                   ` (38 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-04-17 14:12 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #5 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-04-17 14:12:30 UTC ---
I have investigated why test_fpu is slower with --param
max-inline-insns-auto=400 (11.18s) compared to -finline-limit=600 (10.84s) in
the timings of comment #2. This is due to the inlining of dgemm in the fourth
test Lapack 2:

[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param
max-inline-insns-auto=385 test_lap.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.000000000000250
                             total =  2.6 sec

2.824u 0.081s 0:02.90 100.0%    0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -funroll-loops -fstack-arrays --param
max-inline-insns-auto=386 test_lap.f90
[macbook] lin/test% time a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  3.0 sec  Err= 0.000000000000250
                             total =  3.0 sec

3.214u 0.082s 0:03.29 100.0%    0+0k 0+0io 0pf+0w

Looking at the assembly, I see 'call    _dgemm_' three times for 385 and none
for 386 (note there are only two calls in the code one in dgetri always inlined
and one in dgetrf not inlined). It would be interesting to understand why
inlining dgemm slows down the code.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2011-04-17 14:12 ` dominiq at lps dot ens.fr
@ 2011-04-20 11:22 ` jb at gcc dot gnu.org
  2011-04-20 12:29 ` burnus at gcc dot gnu.org
                   ` (37 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: jb at gcc dot gnu.org @ 2011-04-20 11:22 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

Janne Blomqvist <jb at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jb at gcc dot gnu.org

--- Comment #6 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-04-20 11:21:15 UTC ---
Note that some of these issues might change with the new array descriptor that
we must introduce at some point (the hope is that it'll get in for 4.7, but it
remains to be seen if there is enough time). See

http://gcc.gnu.org/wiki/ArrayDescriptorUpdate

For instance (comments inline):

(In reply to comment #4)
> (In reply to comment #3)
> 
> > The second item is interesting - it would be cool if backend was able to work
> > out that the code is supposed to simplify after inlining. Either by itself or
> > by frontend hint.
> > Can you provide me very simple testcase for that I can look into how it looks
> > like in backend?  Perhaps some kind of frontend hinting would work well here.
> 
> Here is some sample code (extreme, I admit) which profits a lot from
> inlining:
> 
> - Strides are known to be one when inlining (a common case, but you can
>   never be sure if the user doesn't call a(1:5:2))

Not strictly related to inlining, but in the new descriptor we'll have a field
specifying whether the array is simply contiguous, so it might make sense to
generate two loops for each loop over the array in the source, one for the
contiguous case where it can be vectorized etc. and another loop for the
general case.  This might reduce the profitability of inlining.

> - Expensive setting up of, and reading from the array descriptor

As we're planning to use the TR 29113 descriptor as the native one, this has
some implications for the procedure call interface as well. See

http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html

This will reduce the procedure call overhead substantially, at the cost of some
extra work in the caller in the case of non-default lower bounds.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2011-04-20 11:22 ` jb at gcc dot gnu.org
@ 2011-04-20 12:29 ` burnus at gcc dot gnu.org
  2011-04-20 13:10 ` jb at gcc dot gnu.org
                   ` (36 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-04-20 12:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #7 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-04-20 12:29:02 UTC ---
(In reply to comment #6)
> > Here is some sample code (extreme, I admit) which profits a lot from
> > inlining:
> > 
> > - Strides are known to be one when inlining (a common case, but you can
> >   never be sure if the user doesn't call a(1:5:2))

First, you do not have any issue with strides if the dummy argument is either
allocatable, has the contiguous attribute, or is an explicit or assumed-sized
array.

For inlining, I see only one place where information loss happens: If a
simply-contiguous array is passed as actual argument to a assumed-shape dummy.
Then the Fortran front-end knows that the stride of the actual argument is 1,
but the callee needs to assume an arbitrary stride. The middle-end will
continue to do so as the "simply contiguous" information is lost - even though
it would be profitable for inlining.

> Not strictly related to inlining, but in the new descriptor we'll have a field
> specifying whether the array is simply contiguous

I am not sure we will indeed have one; initially I thought one should, but I am
no longer convinced that it is the right approach. My impression is now that
setting and updating the flag all the time is more expensive then doing once a
is_contiguous() check. The TR descriptor also does not such an flag - thus one
needs to handle such arrays - if they come from C - with extra care. (Unless
one requires the C side to call a function, which could set this flag. I think
one does not need to do so.)

By the way, the latest version of the TR draft is linked at
http://j3-fortran.org/pipermail/interop-tr/2011-April/000582.html

> so it might make sense to
> generate two loops for each loop over the array in the source, one for the
> contiguous case where it can be vectorized etc. and another loop for the
> general case.

Maybe. Definitely not for -Os. Best would be if the middle end would be able to
generate automatically a stride-free version when it thinks that it is
profitable. The FE could also do it, if one had a way to tell the ME that it
might drop the stride-free version, if it thinks that it is more profitable.

> As we're planning to use the TR 29113 descriptor as the native one, this has
> some implications for the procedure call interface as well. See
> http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html

Regarding:
"For a descriptor of an assumed-shape array, the value of the
lower-bound member of each element of the dim member of the descriptor
shall be zero."

That's actually also not that different from the current situation: In Fortran,
the lower bound of assumed-shape arrays is also always the same: It is 1. Which
makes sense as on can then do the following w/o worrying about the lbound:
  subroutine bar(a)
    real :: a(:)
    do i = 1, ubound(a, dim=1)
      a(i) = ...

For explicit-shape/assumed-size arrays one does not have a descriptor and for
deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which
is the same as currently in Fortran.

> This will reduce the procedure call overhead substantially, at the cost
> of some extra work in the caller in the case of non-default lower bounds.

Which is actually nothing new ... That's the reason that one often creates a
new descriptor for procedure calls.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2011-04-20 12:29 ` burnus at gcc dot gnu.org
@ 2011-04-20 13:10 ` jb at gcc dot gnu.org
  2011-04-20 15:42 ` burnus at gcc dot gnu.org
                   ` (35 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: jb at gcc dot gnu.org @ 2011-04-20 13:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #8 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-04-20 13:09:51 UTC ---
(In reply to comment #7)
> (In reply to comment #6)
> > > Here is some sample code (extreme, I admit) which profits a lot from
> > > inlining:
> > > 
> > > - Strides are known to be one when inlining (a common case, but you can
> > >   never be sure if the user doesn't call a(1:5:2))
> 
> First, you do not have any issue with strides if the dummy argument is either
> allocatable, has the contiguous attribute, or is an explicit or assumed-sized
> array.
> 
> For inlining, I see only one place where information loss happens: If a
> simply-contiguous array is passed as actual argument to a assumed-shape dummy.
> Then the Fortran front-end knows that the stride of the actual argument is 1,
> but the callee needs to assume an arbitrary stride. The middle-end will
> continue to do so as the "simply contiguous" information is lost - even though
> it would be profitable for inlining.

Passing simply contiguous arrays to assumed-shape dummies is a fairly common
case in "modern Fortran", so it would be nice if we could generate fast code
for this.

> > Not strictly related to inlining, but in the new descriptor we'll have a field
> > specifying whether the array is simply contiguous
> 
> I am not sure we will indeed have one; initially I thought one should, but I am
> no longer convinced that it is the right approach. My impression is now that
> setting and updating the flag all the time is more expensive then doing once a
> is_contiguous() check.

Hmm, maybe. Shouldn't it be necessary to update the contiguous flag only when
passing slices to procedures with explicit interfaces? But OTOH, calculating
whether an array is simply contiguous at procedure entry is just a few
arithmetic operations anyway. But, in any case I don't have any profiling data
to argue which approach would be better. 

> The TR descriptor also does not such an flag - thus one
> needs to handle such arrays - if they come from C - with extra care. (Unless
> one requires the C side to call a function, which could set this flag. I think
> one does not need to do so.)

I suppose one cannot require the C side to set such a flag, as the TR doesn't
require its presence? Thus we'd need to calculate whether the array is simply
contiguous anyway if it's possible the array comes from C. Do such procedures
have to be marked with BIND(C) in some way or how does this work? 

In any case, maybe this is what tips it in favor of always calculating the
contiguousness instead of having a flag in the descriptor - it would be one
single way of handling it, reducing the possibility of bugs. Also, if the
contigousness isn't used for anything in the procedure, the dead code
elimination should delete it anyway.

> > As we're planning to use the TR 29113 descriptor as the native one, this has
> > some implications for the procedure call interface as well. See
> > http://gcc.gnu.org/ml/fortran/2011-03/msg00215.html
> 
> Regarding:
> "For a descriptor of an assumed-shape array, the value of the
> lower-bound member of each element of the dim member of the descriptor
> shall be zero."
> 
> That's actually also not that different from the current situation: In Fortran,
> the lower bound of assumed-shape arrays is also always the same: It is 1.

Yes. But what is different from the current situation is that the above is what
the standard requires semantically, and the implementation is free to implement
it as it sees fit. In the TR, OTOH, we have the explicit requirement that on
procedure entry the lower bounds in the descriptor should be 0. This of course
applies only to inter-operable procedures, for "pure Fortran" we're still free
to do as we please. But again, it might make sense to do it the same way in
both cases in order to reduce the implementation and maintenance burden.

> For explicit-shape/assumed-size arrays one does not have a descriptor and for
> deferred-shape arrays (allocatables, pointers) the TR keeps the lbound - which
> is the same as currently in Fortran.

Yes.

> > This will reduce the procedure call overhead substantially, at the cost
> > of some extra work in the caller in the case of non-default lower bounds.
> 
> Which is actually nothing new ... That's the reason that one often creates a
> new descriptor for procedure calls.

But do we actually do this? I did some tests a while ago, and IIRC for assumed
shape dummy arguments the procedure always calculates new bounds such that they
start from 1. That is, the procedure assumes that the actual argument
descriptor may have lower bounds != 1. 

So my argument is basically that with the new descriptor it might make sense to
switch the responsibility around such that it's the caller who makes sure that
all lower bounds are 0 (as we must have the capability to do this anyway in
order to call inter-operable procedures, no?) instead of the callee.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2011-04-20 13:10 ` jb at gcc dot gnu.org
@ 2011-04-20 15:42 ` burnus at gcc dot gnu.org
  2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
                   ` (34 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-04-20 15:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-04-20 15:39:47 UTC ---
> But do we actually do this? I did some tests a while ago, and IIRC for assumed
> shape dummy arguments the procedure always calculates new bounds such that they
> start from 1. That is, the procedure assumes that the actual argument
> descriptor may have lower bounds != 1. 
> So my argument is basically that with the new descriptor it might make sense to
> switch the responsibility around such that it's the caller who makes sure that
> all lower bounds are 0 (as we must have the capability to do this anyway in
> order to call inter-operable procedures, no?) instead of the callee.

No, the conversion is already done in the caller:

subroutine bar(B)
  interface;   subroutine foo(a); integer :: a(:); end subroutine foo
  end interface
integer :: B(:)
call foo(B)
end subroutine bar

Shows:
    parm.4.dim[0].lbound = 1;
    [...]
    foo (&parm.4);
For assumed-shape actual arguments, creating a new descriptor is actually not
needed - only for deferred shape ones - or if one does not have a full array
ref.
Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.

However, some additional calculation is also done in the the callee to
determine the stride and offset; e.g.
  ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1;
again, if the dummy argument is not deferred-shaped (allocatable or pointer),
one actually knows that "b->dim[0].lbound" == 1. I think we have some
redundancy here -> missed optimization.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2011-04-20 15:42 ` burnus at gcc dot gnu.org
@ 2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
  2011-04-20 18:15 ` jb at gcc dot gnu.org
                   ` (33 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2011-04-20 16:41 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #10 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2011-04-20 16:40:46 UTC ---
(In reply to comment #6)

> Not strictly related to inlining, but in the new descriptor we'll have a field
> specifying whether the array is simply contiguous, so it might make sense to
> generate two loops for each loop over the array in the source, one for the
> contiguous case where it can be vectorized etc. and another loop for the
> general case.  This might reduce the profitability of inlining.

Consider the following, hand-crafted matmul:

Here, we have three nested loops. The most interesting one is
the innermost loop of the matmul, which we vectorize by inlining if we omit
the call to my_matmul with non-unity stride for a when compiling with
-fwhole-program -O3.

How many versions of the loop should we generate?  Two or eight, depending
on what the caller may do? ;-)

module foo
  implicit none
contains
  subroutine my_matmul(a,b,c)
    implicit none
    integer :: count, m, n
    real, dimension(:,:), intent(in) :: a,b
    real, dimension(:,:), intent(out) :: c
    integer :: i,j,k

    m = ubound(a,1)
    n = ubound(b,2)
    count = ubound(a,2)
    c = 0
    do j=1,n
       do k=1, count
          do i=1,m
             c(i,j) = c(i,j) + a(i,k) * b(k,j)
          end do
       end do
    end do
  end subroutine my_matmul
end module foo

program main
  use foo
  implicit none
  integer, parameter :: factor=100
  integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor
  real, dimension(m, count) :: a
  real, dimension(count, n) :: b
  real, dimension(m,n) :: c1, c2
  real, dimension(m/2, n) :: ch_1, ch_2

  call random_number(a)
  call random_number(b)
  call my_matmul(a,b,c1)
  c2 = matmul(a,b)
  if (any(abs(c1 - c2) > 1e-5)) call abort
  call my_matmul(a(1:m:2,:),b,ch_1)
  ch_2 = matmul(a(1:m:2,:),b)
  if (any(abs(ch_1 - ch_2) > 1e-5)) call abort
end program main


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
@ 2011-04-20 18:15 ` jb at gcc dot gnu.org
  2011-05-04 16:23 ` hubicka at gcc dot gnu.org
                   ` (32 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: jb at gcc dot gnu.org @ 2011-04-20 18:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #11 from Janne Blomqvist <jb at gcc dot gnu.org> 2011-04-20 18:14:20 UTC ---
(In reply to comment #9)
> > But do we actually do this? I did some tests a while ago, and IIRC for assumed
> > shape dummy arguments the procedure always calculates new bounds such that they
> > start from 1. That is, the procedure assumes that the actual argument
> > descriptor may have lower bounds != 1. 
> > So my argument is basically that with the new descriptor it might make sense to
> > switch the responsibility around such that it's the caller who makes sure that
> > all lower bounds are 0 (as we must have the capability to do this anyway in
> > order to call inter-operable procedures, no?) instead of the callee.
> 
> No, the conversion is already done in the caller:
> 
> subroutine bar(B)
>   interface;   subroutine foo(a); integer :: a(:); end subroutine foo
>   end interface
> integer :: B(:)
> call foo(B)
> end subroutine bar
> 
> Shows:
>     parm.4.dim[0].lbound = 1;
>     [...]
>     foo (&parm.4);
> For assumed-shape actual arguments, creating a new descriptor is actually not
> needed - only for deferred shape ones - or if one does not have a full array
> ref.
> Cf. gfc_conv_array_parameter, which is called by gfc_conv_procedure_call.
> 
> However, some additional calculation is also done in the the callee to
> determine the stride and offset; e.g.
>   ubound.0 = (b->dim[0].ubound - b->dim[0].lbound) + 1;
> again, if the dummy argument is not deferred-shaped (allocatable or pointer),
> one actually knows that "b->dim[0].lbound" == 1. I think we have some
> redundancy here -> missed optimization.

Yes, there seems to be some redundancy indeed in that case. I dug up my
testcase:

module asstest
  implicit none
contains
  subroutine assub(a, r)
    real, intent(in) :: a(:,:)
    real, intent(out) :: r

    r = a(42,43)
  end subroutine assub

  subroutine assub2(a, r)
    real, intent(in), allocatable :: a(:,:)
    real, intent(out) :: r

    r = a(42,43)
  end subroutine assub2
end module asstest

The -fdump-tree-original tree for this module is:

assub2 (struct array2_real(kind=4) & a, real(kind=4) & r)
{
  *r = (*(real(kind=4)[0:] *) a->data)[(a->dim[0].stride * 42 +
a->dim[1].stride * 43) + a->offset];
}


assub (struct array2_real(kind=4) & a, real(kind=4) & r)
{
  integer(kind=8) ubound.0;
  integer(kind=8) stride.1;
  integer(kind=8) ubound.2;
  integer(kind=8) stride.3;
  integer(kind=8) offset.4;
  integer(kind=8) size.5;
  real(kind=4)[0:D.1567] * a.0;
  integer(kind=8) D.1567;
  bit_size_type D.1568;
  <unnamed-unsigned:64> D.1569;

  {
    integer(kind=8) D.1566;

    D.1566 = a->dim[0].stride;
    stride.1 = D.1566 != 0 ? D.1566 : 1;
    a.0 = (real(kind=4)[0:D.1567] *) a->data;
    ubound.0 = (a->dim[0].ubound - a->dim[0].lbound) + 1;
    stride.3 = a->dim[1].stride;
    ubound.2 = (a->dim[1].ubound - a->dim[1].lbound) + 1;
    size.5 = stride.3 * NON_LVALUE_EXPR <ubound.2>;
    offset.4 = -stride.1 - NON_LVALUE_EXPR <stride.3>;
    D.1567 = size.5 + -1;
    D.1568 = (bit_size_type) size.5 * 32;
    D.1569 = (<unnamed-unsigned:64>) size.5 * 4;
  }
  *r = (*a.0)[(stride.1 * 42 + stride.3 * 43) + offset.4];
}

So if we make sure that the caller fixes up the descriptor so that bounds are
correct for assumed-shape parameters (as the TR requires for inter-operable
procedures), then assub could be as simple as assub2.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2011-04-20 18:15 ` jb at gcc dot gnu.org
@ 2011-05-04 16:23 ` hubicka at gcc dot gnu.org
  2011-05-04 17:31 ` burnus at gcc dot gnu.org
                   ` (31 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-05-04 16:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-05-04 16:09:19 UTC ---
Hi,
I discussed some of the issues today with Martin.  For the array descriptor
testcase, we really want ipa-cp to be propagate the constant array bounds
instead of making Inliner to blindly inline enough in all cases.
For that we need

  1) Make ipa-prop to work on aggregates.  For aggregates passed by value we
can have jump functions that define known constants at known offsets
  2) Make ipa-inline-analysis to produce predicates on constantness of
aggregate fields in the same format
  3) Array descriptors are passed by reference, rather than by value.  This
need further work, since need to be sure that the value passed does not change
by aliasing.  IPA-SRA would help here if it was really SRA and we had
-fwhole-program, but that is weak. We would need IPA-PTA to solve this in
general. Perhaps frontend could help us here since the descriptors are probably
constant after they are initialized, or is there way to change existing
descriptor?
  4) Make ipa-inline-analysis to understand that determining loop bounds is
very cool to do.

I also looked into the dumps of fatigue. One obvious problem is that we
overestimate stack sizes by about factor of 10.  This seems to be mostly due to
I/O routines structures that gets packed later. 
We used to take results of stack frame packing, but Steven reverted this
behaviour and now we estimate stack sizes by simply summing up the size of
local arrays. I wonder, perhaps we want to revert to original way at least when
optimizing and when generating summary for late inliner (early inliner probably
does not care and Steven's main concern was that this is computed 3 times,
twice for early inliner and once for real inliner).

Honza


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2011-05-04 16:23 ` hubicka at gcc dot gnu.org
@ 2011-05-04 17:31 ` burnus at gcc dot gnu.org
  2011-06-04 18:08 ` hubicka at gcc dot gnu.org
                   ` (30 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: burnus at gcc dot gnu.org @ 2011-05-04 17:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #13 from Tobias Burnus <burnus at gcc dot gnu.org> 2011-05-04 17:30:31 UTC ---
(In reply to comment #12)
> Perhaps frontend could help us here since the descriptors are probably
> constant after they are initialized, or is there way to change existing
> descriptor?

Only if the dummy/formal argument is a pointer or an allocatable.

Regarding ipa-cp, wasn't there a problem with "fn spec" (cf. PR 45579)? And
many Fortran procedures have this attribute.

>  This seems to be mostly due to I/O routines structures that gets packed later.

We really need to start reusing them ... (Cf. PR 48419, PR 34705, and
http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01762.html )


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2011-05-04 17:31 ` burnus at gcc dot gnu.org
@ 2011-06-04 18:08 ` hubicka at gcc dot gnu.org
  2012-07-03 17:44 ` jamborm at gcc dot gnu.org
                   ` (29 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2011-06-04 18:08 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #14 from Jan Hubicka <hubicka at gcc dot gnu.org> 2011-06-04 18:06:01 UTC ---
Yeah, the fnspec issue is something we ought to solve. ipa-cp should be
effective on fortran so it should not disable itself there ;)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2011-06-04 18:08 ` hubicka at gcc dot gnu.org
@ 2012-07-03 17:44 ` jamborm at gcc dot gnu.org
  2012-08-11 10:50 ` jamborm at gcc dot gnu.org
                   ` (28 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: jamborm at gcc dot gnu.org @ 2012-07-03 17:44 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #15 from Martin Jambor <jamborm at gcc dot gnu.org> 2012-07-03 17:43:35 UTC ---
Hi,

(In reply to comment #12)
> Hi,
> I discussed some of the issues today with Martin.  For the array descriptor
> testcase, we really want ipa-cp to be propagate the constant array bounds
> instead of making Inliner to blindly inline enough in all cases.
> For that we need
> 
>   1) Make ipa-prop to work on aggregates.  For aggregates passed by value we
> can have jump functions that define known constants at known offsets

I have posted a patch implementing this yesterday:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00039.html

>   2) Make ipa-inline-analysis to produce predicates on constantness of
> aggregate fields in the same format

I have posted a patch implementing this yesterday too:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00041.html

With this patch, the test case from comment #4 is inlined at -O3
without any parameter tweaking.  It is in fact a testcase in that
patch.  However, functions in real benchmarks are bigger.  We will see
to what extent IPA-CP can help with them.

>   3) Array descriptors are passed by reference, rather than by value.  This
> need further work, since need to be sure that the value passed does not change
> by aliasing.  IPA-SRA would help here if it was really SRA and we had
> -fwhole-program, but that is weak. We would need IPA-PTA to solve this in
> general. Perhaps frontend could help us here since the descriptors are probably
> constant after they are initialized, or is there way to change existing
> descriptor?

At the moment I'm relying on a slightly sophisticated intra-PTA and
TBAA.  I'll try to investigate where this does not work in Fortran and
perhaps will have some suggestions afterwards.

>   4) Make ipa-inline-analysis to understand that determining loop bounds is
> very cool to do.

Yep, knowing what constants should get a profitability "boost" would
be indeed very beneficial.

Meanwhile, the following patch also helps ipa-inline-analysis.c with
cases like fatigue2:
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00052.html


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2012-07-03 17:44 ` jamborm at gcc dot gnu.org
@ 2012-08-11 10:50 ` jamborm at gcc dot gnu.org
  2012-08-21  6:54 ` hubicka at gcc dot gnu.org
                   ` (27 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: jamborm at gcc dot gnu.org @ 2012-08-11 10:50 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #16 from Martin Jambor <jamborm at gcc dot gnu.org> 2012-08-11 10:50:29 UTC ---
Author: jamborm
Date: Sat Aug 11 10:50:24 2012
New Revision: 190313

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190313
Log:
2012-08-11  Martin Jambor  <mjambor@suse.cz>

    PR fortran/48636
    * ipa-inline.h (condition): New fields offset, agg_contents and by_ref.
    * ipa-inline-analysis.c (agg_position_info): New type.
    (add_condition): New parameter aggpos, also store agg_contents, by_ref
    and offset.
    (dump_condition): Also dump aggregate conditions.
    (evaluate_conditions_for_known_args): Also handle aggregate
    conditions.  New parameter known_aggs.
    (evaluate_properties_for_edge): Gather known aggregate contents.
    (inline_node_duplication_hook): Pass NULL known_aggs to
    evaluate_conditions_for_known_args.
    (unmodified_parm): Split into unmodified_parm and unmodified_parm_1.
    (unmodified_parm_or_parm_agg_item): New function.
    (set_cond_stmt_execution_predicate): Handle values passed in
    aggregates.
    (set_switch_stmt_execution_predicate): Likewise.
    (will_be_nonconstant_predicate): Likewise.
    (estimate_edge_devirt_benefit): Pass new parameter known_aggs to
    ipa_get_indirect_edge_target.
    (estimate_calls_size_and_time): New parameter known_aggs, pass it
    recrsively to itself and to estimate_edge_devirt_benefit.
    (estimate_node_size_and_time): New vector known_aggs, pass it o
    functions which need it.
    (remap_predicate): New parameter offset_map, use it to remap aggregate
    conditions.
    (remap_edge_summaries): New parameter offset_map, pass it recursively
    to itself and to remap_predicate.
    (inline_merge_summary): Also create and populate vector offset_map.
    (do_estimate_edge_time): New vector of known aggregate contents,
    passed to functions which need it.
    (inline_read_section): Stream new fields of condition.
    (inline_write_summary): Likewise.
    * ipa-cp.c (ipa_get_indirect_edge_target): Also examine the aggregate
    contents.  Let all local callers pass NULL for known_aggs.

    * testsuite/gfortran.dg/pr48636.f90: New test.


Added:
    trunk/gcc/testsuite/gfortran.dg/pr48636.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-cp.c
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.h
    trunk/gcc/ipa-prop.h
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2012-08-11 10:50 ` jamborm at gcc dot gnu.org
@ 2012-08-21  6:54 ` hubicka at gcc dot gnu.org
  2012-08-21  8:15 ` hubicka at gcc dot gnu.org
                   ` (26 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-08-21  6:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #17 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-08-21 06:54:09 UTC ---
Author: hubicka
Date: Tue Aug 21 06:54:01 2012
New Revision: 190556

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190556
Log:
    PR fortran/48636
    * ipa-inline.c (want_inline_small_function_p): Take loop_iterations hint.
    (edge_badness): Likewise.
    * ipa-inline.h (inline_hints_vals): Add INLINE_HINT_loop_iterations.
    (inline_summary): Add loop_iterations.
    * ipa-inline-analysis.c: Include tree-scalar-evolution.h.
    (dump_inline_hints): Dump loop_iterations.
    (reset_inline_summary): Free loop_iterations.
    (inline_node_duplication_hook): Update loop_iterations.
    (dump_inline_summary): Dump loop_iterations.
    (will_be_nonconstant_expr_predicate): New function.
    (estimate_function_body_sizes): Analyze loops.
    (estimate_node_size_and_time): Set hint loop_iterations.
    (inline_merge_summary): Merge loop iterations.
    (inline_read_section): Stream in loop_iterations.
    (inline_write_summary): Stream out loop_iterations.

Added:
    trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2012-08-21  6:54 ` hubicka at gcc dot gnu.org
@ 2012-08-21  8:15 ` hubicka at gcc dot gnu.org
  2012-09-12 21:52 ` hubicka at gcc dot gnu.org
                   ` (25 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-08-21  8:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-08-21 08:14:33 UTC ---
With loop_iterations hint, we should now hint the bar function of testcase in
comment #4, but we don't because the value is used conditionally:

  # iftmp.11_3 = PHI <_12(3), 1(2)>
  a.0_16 = a_11(D)->data;
  _17 = a_11(D)->dim[0].ubound;
  _18 = a_11(D)->dim[0].lbound;
  _19 = _17 - _18;
  ubound.0_20 = _19 + 1;
  stride.3_21 = a_11(D)->dim[1].stride;
  _22 = a_11(D)->dim[1].ubound;
  _23 = a_11(D)->dim[1].lbound;
  _24 = _22 - _23;
  ubound.2_25 = _24 + 1;
  _27 = -iftmp.11_3;
  offset.4_28 = _27 - stride.3_21;
  *x_35(D) = 0.0;
  _53 = stride.3_21 >= 0;
  _56 = ubound.2_25 > 0;
  _57 = _53 & _56;
  _59 = stride.3_21 < 0;
  _60 = _57 | _59;
  _66 = _59 | _60;
  if (_66 != 0)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  iftmp.13_68 = (integer(kind=4)) ubound.2_25;

  <bb 6>:
  # iftmp.13_4 = PHI <iftmp.13_68(5), 0(4)>
  if (iftmp.13_4 > 0)
    goto <bb 7>;
  else
    goto <bb 14>;

Martin, does this work with your PHI patch? (i.e. do you get "loop iterations"
in -fdump-ipa-inline?)

Next step will be to teach inliner to inline into functions called once when
callee is important and some propagation happens.

Honza


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2012-08-21  8:15 ` hubicka at gcc dot gnu.org
@ 2012-09-12 21:52 ` hubicka at gcc dot gnu.org
  2012-10-16 16:39 ` hubicka at gcc dot gnu.org
                   ` (24 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-09-12 21:52 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-09-12 21:51:21 UTC ---
Author: hubicka
Date: Wed Sep 12 21:51:14 2012
New Revision: 191232

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=191232
Log:

    PR fortran/48636
    * gcc.dg/ipa/inlinehint-2.c: New testcase.
    * ipa-inline-analysis.c (dump_inline_hints): Dump loop stride.
    (set_hint_predicate): New function.
    (reset_inline_summary): Reset loop stride.
    (remap_predicate_after_duplication): New function.
    (remap_hint_predicate_after_duplication): New function.
    (inline_node_duplication_hook): Update.
    (dump_inline_summary): Dump stride summaries.
    (estimate_function_body_sizes): Compute strides.
    (remap_hint_predicate): New function.
    (inline_merge_summary): Use it.
    (inline_read_section): Read stride.
    (inline_write_summary): Write stride.
    * ipa-inline.c (want_inline_small_function_p): Handle strides.
    (edge_badness): Likewise.
    * ipa-inline.h (inline_hints_vals): Add stride hint.
    (inline_summary): Update stride.

Added:
    trunk/gcc/testsuite/gcc.dg/ipa/inlinehint-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h
    trunk/gcc/testsuite/ChangeLog


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2012-09-12 21:52 ` hubicka at gcc dot gnu.org
@ 2012-10-16 16:39 ` hubicka at gcc dot gnu.org
  2012-10-16 17:58 ` dominiq at lps dot ens.fr
                   ` (23 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-16 16:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #20 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-16 16:38:27 UTC ---
Created attachment 28456
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28456
Path I am considering

Hi,
I am considering to enable inlining when inline-analysis says that the inline
function will get significantly fater regardless the
inline-insns-single/inline-sinsns-auto limits. The patch is attached.

In general it may make more sense to gradually set the limits based on expected
speedup but I am affraid this will become hard to understand and maintain. So
for now lets have simple boolean decisions.

It would be nice to know where it helps (i.e. for fatigue and cray) and where
it doesn't.  It also causes quite considerable code size growth on some of
SPEC2000 for relatively little benefit, so I guess it will need more evaulation
and reduction of inline-insns-auto limits. It also may be problem in
unrealistic estimates in ipa-inline-analysis, this is first time we take them
really seriously.

Comments/ideas are welcome.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2012-10-16 16:39 ` hubicka at gcc dot gnu.org
@ 2012-10-16 17:58 ` dominiq at lps dot ens.fr
  2012-10-16 20:59 ` dominiq at lps dot ens.fr
                   ` (22 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-16 17:58 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #21 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-16 17:57:52 UTC ---
Before the patch in comment #20, I get

-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*
[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops
-ftree-loop-linear -fomit-frame-pointer --param max-inline-insns-auto=150
-fwhole-program -flto -fno-tree-loop-if-convert fatigue.f90
8.485u 0.205s 0:08.73 99.4%    0+0k 0+29io 0pf+0w
[macbook] lin/test% ll a.out
-rwxr-xr-x 1 dominiq staff 73336 Oct 16 19:19 a.out*
[[macbook] lin/test% time a.out > /dev/null
2.916u 0.003s 0:02.92 99.6%    0+0k 0+1io 0pf+0w

[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops
-ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto
-fno-tree-loop-if-convert fatigue.f90
6.822u 0.193s 0:07.06 99.2%    0+0k 0+30io 0pf+0w
[macbook] lin/test% ll a.out                                                   
                                                                     
-rwxr-xr-x 1 dominiq staff 69312 Oct 16 19:21 a.out*
[macbook] lin/test% time a.out > /dev/null
4.851u 0.004s 0:04.86 99.7%    0+0k 0+1io 0pf+0w

After the patch I get

[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-loops
-ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto
-fno-tree-loop-if-convert fatigue.f90
7.277u 0.217s 0:07.52 99.4%    0+0k 0+28io 0pf+0w
[macbook] lin/test% ll a.out-rwxr-xr-x 1 dominiq staff 69248 Oct 16 19:46
a.out*
[macbook] lin/test% time a.out > /dev/null
2.912u 0.003s 0:02.91 100.0%    0+0k 0+2io 0pf+0w

So for this particular test with the same options, after the patch the
compilation time is ~6% slower, the size is about the same (actually smaller;-)
and the run time ~40% faster. Without the patch and with --param
max-inline-insns-auto=150 compared to with the patch without this option, the
compilation time is ~20% slower, the size is ~6% larger, and the runtime is the
same.

Further testing coming, thanks for the patch.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2012-10-16 17:58 ` dominiq at lps dot ens.fr
@ 2012-10-16 20:59 ` dominiq at lps dot ens.fr
  2012-10-17 12:20 ` jakub at gcc dot gnu.org
                   ` (21 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-16 20:59 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #22 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-16 20:58:58 UTC ---
With the patch I see a ~10% slowdown in the Test4 - Lapack 2 (1001x1001) of
test_fpu.f90 compared to revision 192449

[macbook] lin/test% time /opt/gcc/gcc4.8c/bin/gfortran -fprotect-parens -Ofast
-funroll-loops test_lap.f90
6.742u 0.097s 0:06.87 99.4%    0+0k 0+20io 0pf+0w
[macbook] lin/test% a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  2.6 sec  Err= 0.000000000000250
                             total =  2.6 sec

[macbook] lin/test% time gfc -fprotect-parens -Ofast -funroll-all-loops
test_lap.f90
9.489u 0.116s 0:09.62 99.6%    0+0k 0+16io 0pf+0w
[macbook] lin/test% a.out
  Benchmark running, hopefully as only ACTIVE task
Test4 - Lapack 2 (1001x1001) inverts  2.8 sec  Err= 0.000000000000250
                             total =  2.8 sec

This looks similar to what I saw in comment #5. However now dgetri is never
inlined while dgetrf is inlined with the patch. Also dtrmv and dscal are
inlined with the patch (respectively 20 and 21 occurrences without the patch).
The last difference I see is 35 occurrences of dswap with the patch compared to
32 without.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2012-10-16 20:59 ` dominiq at lps dot ens.fr
@ 2012-10-17 12:20 ` jakub at gcc dot gnu.org
  2012-10-17 13:13 ` dominiq at lps dot ens.fr
                   ` (20 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: jakub at gcc dot gnu.org @ 2012-10-17 12:20 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #23 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-10-17 12:20:05 UTC ---
If the middle-end was hinted something about the array descriptors (what fields
in the struct are important and what not), perhaps IPA-CP could handle also
arguments pointing to array descriptors if all callers fill the descriptor in
certain important way (where important would be constant strides and/or
constant start/end in some of the dimension(s)).  I guess many Fortran routines
are just too large for inlining, but optimizing constant strides etc. might
still be useful.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2012-10-17 12:20 ` jakub at gcc dot gnu.org
@ 2012-10-17 13:13 ` dominiq at lps dot ens.fr
  2012-10-17 14:06 ` dominiq at lps dot ens.fr
                   ` (19 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-17 13:13 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #24 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-17 13:13:24 UTC ---
Summary for the polyhedron tests (pb05):

(a) revision 192449 unpatched
(b) revision 192516 with patch in comment #20
options: -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear \
         -fomit-frame-pointer -fwhole-program -flto -fno-tree-loop-if-convert

Benchmark         Compile            Executable              Ave Run   
     Name          (secs)               (bytes)               (secs)  
              (a)       (b)        (a)         (b)        (a)       (b)     
---------   -------   -------  ----------  ----------   -------   -------
       ac      6.30      9.66       54904       54856      9.12      8.49
   aermod    129.56    178.13     1158904     1527680     17.69     17.28
      air     25.64     29.95      102752      106712      7.15      7.16
 capacita      6.34     18.99       69096      130536     42.00     40.57
  channel      2.99      3.51       34736       34736      3.03      3.02
    doduc     19.51     23.70      163528      196016     27.54     27.65
  fatigue      7.03      7.54       69312       69248      4.92      2.94
  gas_dyn     14.65     14.91       99320       99280      3.72      3.88
   induct     19.50     21.02      149160      161552     13.18     13.14
    linpk      1.84      2.91       22008       17832     21.65     21.69
     mdbx      6.59     10.14       68800       80968     12.85     12.78
       nf      6.34     24.25       59544      104544     18.95     18.88
  protein     22.27     46.34      119056      156048     35.65     35.41
   rnflow     17.75     31.23      114600      147280     23.56     23.54
 test_fpu     10.69     20.66       76640      117512     10.89     10.95
     tfft      1.62      2.93       22424       22384      3.34      3.33

Geometric Mean Execution Time =  (a) 11.88 seconds (b) 11.43 seconds

I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to
slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for
gfortran.dg/vect/vect-8.f90: the loop

do while(ii >  1)
ipnt= ipntp
ipntp= ipntp+ii
ii= ishft(ii,-1)
i= ipntp+1
!dir$ vector always
       x(ipntp+2:ipntp+ii+1)=x(ipnt+2:ipntp:2)-v(ipnt+2:ipntp:2) &
     &*x(ipnt+1:ipntp-1:2)-v(ipnt+3:ipntp+1:2)*x(ipnt+3:ipntp+1:2)
END DO

is not vectorized because

154: not vectorized: complicated access pattern.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (23 preceding siblings ...)
  2012-10-17 13:13 ` dominiq at lps dot ens.fr
@ 2012-10-17 14:06 ` dominiq at lps dot ens.fr
  2012-10-19  8:45 ` vincenzo.innocente at cern dot ch
                   ` (18 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-17 14:06 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #25 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-17 14:05:51 UTC ---
> I also see many failures for the gcc.dg/tree-ssa/slsr-* tests: slsr-2.c to
> slsr-11.c, slsr-14.c to slsr-20.c, slsr-24.c, and slsr-25.c, and for
> gfortran.dg/vect/vect-8.f90: the loop ...

This seems to be a glitch which appeared between r192440 and r192516 and
disappeared before or at r192531.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (24 preceding siblings ...)
  2012-10-17 14:06 ` dominiq at lps dot ens.fr
@ 2012-10-19  8:45 ` vincenzo.innocente at cern dot ch
  2012-10-20 10:35 ` hubicka at gcc dot gnu.org
                   ` (17 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2012-10-19  8:45 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #26 from vincenzo Innocente <vincenzo.innocente at cern dot ch> 2012-10-19 08:45:03 UTC ---
I'm interested to test the patch on our large application currently compiled
with 4.7.2.
would it be possible to get the same patch against gcc-4_7-branch?
thanks


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (25 preceding siblings ...)
  2012-10-19  8:45 ` vincenzo.innocente at cern dot ch
@ 2012-10-20 10:35 ` hubicka at gcc dot gnu.org
  2012-10-20 11:22 ` dominiq at lps dot ens.fr
                   ` (16 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-20 10:35 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #27 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-20 10:34:58 UTC ---
Thank you for testing. It seems that the patch works well for small benchmarks,
I will look into lapack/test_fpu slowdown.
There is problem that it really causes inacceptable growth on SPEC2k6 and 2k in
non-LTO mode.  I will need to analyze some of these testcases and see why we
predict so much of speedup when there are no benefits in runtime.

Jakub: the plan is to make ipa-cp to handle propagation across aggregates in
general (jump functions are already in place), that will handle the array
descriptors, too.

The fatigue is however different case - the values happens to be loop invariant
of the outer loop the function is called from. So inlining enables a lot of
invariant code motion.  This is similar to cray. Both these cases are now
understood by ipa-inline-analysis but the fact is not really used w/o this
patch.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (26 preceding siblings ...)
  2012-10-20 10:35 ` hubicka at gcc dot gnu.org
@ 2012-10-20 11:22 ` dominiq at lps dot ens.fr
  2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
                   ` (15 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-10-20 11:22 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #28 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-10-20 11:22:16 UTC ---
If I understand correctly the patch, the default value for
max-inline-min-speedup is 20. This could be over-agressive: for fatigue.f90 the
threshold is between 94 (fast) and 95 (slow). I see a similar threshold for
test_fpu.f90 from 100 (fast) to ~97 (slow), unfortunately above 94.

Also the choice of true or false for big_speedup_p is based on a test
't-t1>a*t' which is equivalent to 't1<=b*t' with b=1-a. Any reason for the
choice?

Last point, AFAICT the behavior of the different param tuning the inlining is
often non monotonic (I am trying to investigate that in more detail).


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (27 preceding siblings ...)
  2012-10-20 11:22 ` dominiq at lps dot ens.fr
@ 2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
  2012-10-28 10:08 ` hubicka at gcc dot gnu.org
                   ` (14 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2012-10-20 12:11 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #29 from Thomas Koenig <tkoenig at gcc dot gnu.org> 2012-10-20 12:10:49 UTC ---
Another approach (not for the benchmarks) would be to
make inlining tunable by the user, e.g. support

!GCC$ ATTRIBUTES always_inline :: procedure_name

See PR 41209.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (28 preceding siblings ...)
  2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
@ 2012-10-28 10:08 ` hubicka at gcc dot gnu.org
  2012-10-28 10:11 ` hubicka at gcc dot gnu.org
                   ` (13 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-28 10:08 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #30 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-28 10:08:23 UTC ---
Created attachment 28543
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28543
Updated patch

Hi,
this is updated patch I am testing. It fixes the big speedup test and also
changes badness metric completely to be based on expected speedup of the
caller. It seems to work pretty well in my test with the catch that I still
can't beat 4.6 on tramp3d.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (29 preceding siblings ...)
  2012-10-28 10:08 ` hubicka at gcc dot gnu.org
@ 2012-10-28 10:11 ` hubicka at gcc dot gnu.org
  2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
                   ` (12 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-10-28 10:11 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #31 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-10-28 10:11:13 UTC ---
Concerning vincenzo's request about 4.7 version, it won't work - it depends on
improvements of inline metric and ipa-prop we made for 4.8


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (30 preceding siblings ...)
  2012-10-28 10:11 ` hubicka at gcc dot gnu.org
@ 2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
  2012-11-07  9:34 ` hubicka at gcc dot gnu.org
                   ` (11 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: vincenzo.innocente at cern dot ch @ 2012-10-28 11:27 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #32 from vincenzo Innocente <vincenzo.innocente at cern dot ch> 2012-10-28 11:27:22 UTC ---
In a small test (that I will eventually publish here) the new patch at -O2
looks superior to 4.7.2 at O3.
I would like to build a test with multiple source files where lto matters
though.
We will also try to build our whole software stack with 4.8 (we have a
production version with 4.7.2 at this point, so we can move to experimenting
builds with 4.8…)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (31 preceding siblings ...)
  2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
@ 2012-11-07  9:34 ` hubicka at gcc dot gnu.org
  2012-11-07 11:18 ` hubicka at gcc dot gnu.org
                   ` (10 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-07  9:34 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #33 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-07 09:34:25 UTC ---
Created attachment 28628
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28628
Final patch (I hope)

This is version of path I will commit today or tomorrow (depending on when
autotesters gets Martin's aggregate changes). Of course it no longer helps
fatigue - we now inline the loop and do not hit the loop based heuristics.  I
made separate patch for this.  It however helps several other testcases and
should be win for Fortrain in general.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (32 preceding siblings ...)
  2012-11-07  9:34 ` hubicka at gcc dot gnu.org
@ 2012-11-07 11:18 ` hubicka at gcc dot gnu.org
  2012-11-08 16:46 ` hubicka at gcc dot gnu.org
                   ` (9 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-07 11:18 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #34 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-07 11:17:28 UTC ---
Created attachment 28629
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28629
Array index hint

This patch should help to inline when array descriptors become known, such as
in fatigue


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (33 preceding siblings ...)
  2012-11-07 11:18 ` hubicka at gcc dot gnu.org
@ 2012-11-08 16:46 ` hubicka at gcc dot gnu.org
  2012-11-11 18:15 ` hubicka at gcc dot gnu.org
                   ` (8 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-08 16:46 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #35 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-08 16:46:28 UTC ---
Author: hubicka
Date: Thu Nov  8 16:46:18 2012
New Revision: 193331

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193331
Log:
    PR middle-end/48636
    * ipa-inline.c (big_speedup_p): New function.
    (want_inline_small_function_p): Use it.
    (edge_badness): Dump it.
    * params.def (inline-min-speedup): New parameter.
    * doc/invoke.texi (inline-min-speedup): Document.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/doc/invoke.texi
    trunk/gcc/ipa-inline.c
    trunk/gcc/params.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/winline-3.c


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (34 preceding siblings ...)
  2012-11-08 16:46 ` hubicka at gcc dot gnu.org
@ 2012-11-11 18:15 ` hubicka at gcc dot gnu.org
  2012-11-12 12:16 ` hubicka at gcc dot gnu.org
                   ` (7 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-11 18:15 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #36 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-11 18:14:40 UTC ---
Author: hubicka
Date: Sun Nov 11 18:14:35 2012
New Revision: 193406

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193406
Log:

    PR middle-end/48636
    * ipa-inline.c (want_inline_small_function_p): Take aray index hint.
    (edge_badness): Likewise.
    * ipa-inline.h (inline_hints_vals): Add array_index and comments.
    (inline_summary_: Add ARRAY_INDEX.
    * ipa-inline-analysis.c (dump_inline_hints): Dump array_index hint.
    (reset_inline_summary): Handle array_index hint.
    (inline_node_duplication_hook): Likewise.
    (dump_inline_summary): Likewise.
    (array_index_predicate): New function.
    (estimate_function_body_sizes): Use it.
    (estimate_node_size_and_time): Use array_index hint.
    (inline_merge_summary, inline_read_section): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-inline-analysis.c
    trunk/gcc/ipa-inline.c
    trunk/gcc/ipa-inline.h


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (35 preceding siblings ...)
  2012-11-11 18:15 ` hubicka at gcc dot gnu.org
@ 2012-11-12 12:16 ` hubicka at gcc dot gnu.org
  2012-11-12 12:45 ` izamyatin at gmail dot com
                   ` (6 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-12 12:16 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #37 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-12 12:16:18 UTC ---
Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only
half of inlining because jump functions are not able to track array descriptors
in both calls to generalized_hookes_law.  

What are the other testcases to look at?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (36 preceding siblings ...)
  2012-11-12 12:16 ` hubicka at gcc dot gnu.org
@ 2012-11-12 12:45 ` izamyatin at gmail dot com
  2012-11-14 23:22 ` hubicka at gcc dot gnu.org
                   ` (5 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: izamyatin at gmail dot com @ 2012-11-12 12:45 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

Igor Zamyatin <izamyatin at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |izamyatin at gmail dot com

--- Comment #38 from Igor Zamyatin <izamyatin at gmail dot com> 2012-11-12 12:44:43 UTC ---
Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32
-O3 -funroll-loops


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (37 preceding siblings ...)
  2012-11-12 12:45 ` izamyatin at gmail dot com
@ 2012-11-14 23:22 ` hubicka at gcc dot gnu.org
  2012-11-14 23:54   ` Jan Hubicka
  2012-11-14 23:55 ` hubicka at ucw dot cz
                   ` (4 subsequent siblings)
  43 siblings, 1 reply; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-14 23:22 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #39 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-14 23:22:40 UTC ---
Hmm, indeed. Good catch. I will look into it.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Bug fortran/48636] Enable more inlining with -O2 and higher
  2012-11-14 23:22 ` hubicka at gcc dot gnu.org
@ 2012-11-14 23:54   ` Jan Hubicka
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Hubicka @ 2012-11-14 23:54 UTC (permalink / raw)
  To: hubicka at gcc dot gnu.org; +Cc: gcc-bugs

mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems independent regression of this change.
No idea what goes wrong.

Honza


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (38 preceding siblings ...)
  2012-11-14 23:22 ` hubicka at gcc dot gnu.org
@ 2012-11-14 23:55 ` hubicka at ucw dot cz
  2012-11-15  2:29 ` hubicka at gcc dot gnu.org
                   ` (3 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at ucw dot cz @ 2012-11-14 23:55 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #40 from Jan Hubicka <hubicka at ucw dot cz> 2012-11-14 23:54:44 UTC ---
mgrid do not seem to be sensitive to --param min-inline-speedup, so it seems
independent regression of this change.
No idea what goes wrong.

Honza


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (39 preceding siblings ...)
  2012-11-14 23:55 ` hubicka at ucw dot cz
@ 2012-11-15  2:29 ` hubicka at gcc dot gnu.org
  2012-11-16 14:43 ` dominiq at lps dot ens.fr
                   ` (2 subsequent siblings)
  43 siblings, 0 replies; 46+ messages in thread
From: hubicka at gcc dot gnu.org @ 2012-11-15  2:29 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #41 from Jan Hubicka <hubicka at gcc dot gnu.org> 2012-11-15 02:28:26 UTC ---
mgrid regression is now PR55334


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (40 preceding siblings ...)
  2012-11-15  2:29 ` hubicka at gcc dot gnu.org
@ 2012-11-16 14:43 ` dominiq at lps dot ens.fr
  2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
  2013-03-04 17:54 ` wschmidt at gcc dot gnu.org
  43 siblings, 0 replies; 46+ messages in thread
From: dominiq at lps dot ens.fr @ 2012-11-16 14:43 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

--- Comment #42 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2012-11-16 14:42:33 UTC ---
> Fatigue now gets all inlining with -O3 -fwhole-program, with -O3 it gets only
> half of inlining because jump functions are not able to track array descriptors
> in both calls to generalized_hookes_law.  

The same applies to the tests in comments #4 and #10.
What is the status of "assumed-shape" arrays with respect to "array_index
hint"?

> What are the other testcases to look at?

ac.f90 and aermod.f90 give a 5-10% faster runtime when compiled with '--param
max-inline-insns-auto=150' (and -Ofast -fwhole-program).


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (41 preceding siblings ...)
  2012-11-16 14:43 ` dominiq at lps dot ens.fr
@ 2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
  2013-03-04 17:54 ` wschmidt at gcc dot gnu.org
  43 siblings, 0 replies; 46+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2013-03-01 17:49 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

William J. Schmidt <wschmidt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wschmidt at gcc dot gnu.org

--- Comment #43 from William J. Schmidt <wschmidt at gcc dot gnu.org> 2013-03-01 17:48:51 UTC ---
(In reply to comment #38)
> Looks like for x86 r193331 led to significant regression on 172.mgrid for -m32
> -O3 -funroll-loops

The same degradation was seen on powerpc64-unknown-linux-gnu with r193331.  The
fix by Martin Jambor for PR55334 did not help for -m32.  It did give a slight
bump to -m64, but did not return the performance to pre-r193331 levels.  So
there still seems to be a problem with 172.mgrid related to this change.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Bug fortran/48636] Enable more inlining with -O2 and higher
  2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
                   ` (42 preceding siblings ...)
  2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
@ 2013-03-04 17:54 ` wschmidt at gcc dot gnu.org
  43 siblings, 0 replies; 46+ messages in thread
From: wschmidt at gcc dot gnu.org @ 2013-03-04 17:54 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48636

William J. Schmidt <wschmidt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bergner at gcc dot gnu.org

--- Comment #44 from William J. Schmidt <wschmidt at gcc dot gnu.org> 2013-03-04 17:53:17 UTC ---
Compiling mgrid.f on powerpc64-unknown-linux-gnu as follows:

$ gfortran -S -m32 -O3 -mcpu=power7 -fpeel-loops -funroll-loops -ffast-math
-fvect-cost-model mgrid.f

I examined the assembly generated for revisions 193330, 193331 (this issue),
and 196171 (PR55334).  What I'm seeing is that for both 193331 and 196171, the
inliner is much more aggressive, and in particular is inlining several copies
of some pretty large functions.

For -m32, I am not seeing any specialization of resid_, so although the change
in 196171 helped a little, it appears that this was by reducing overall code
size.  There weren't any changes in inlining decisions.  Of course there is a
lot of distance between 193331 and 196171, so it is not a perfect comparison,
though it appears 196171 is where -m32 received a slight boost.

Anyway, the non-inlined call tree for 193330 is:

 main
  MAIN__
   resid_ (x4)
    comm3_
   psinv_ (x3)
    comm3_
   norm2u3_ (x2)
   interp_ (x2)
   setup_
   rprj3_ (x4)
   zran3_

The non-inlined call tree for 193331 is:

 main
  MAIN__
   comm3_ (x5)
   resid_
    comm3_
   norm2u3_ (x2)
   setup_
   zran3_

So with 193331 we have the following additional inlines:

  3 inlines of resid_,  size = 1068, total size = 3204
  3 inlines of psinv_,  size = 1046, total size = 3138
  2 inlines of interp_, size = 1544, total size = 3088
  4 inlines of rprj3_,  size = 220,  total size = 880

Here "size" is the number of lines of assembly code of the called procedure,
including labels, so it's just a rough measure.  The number of static call
sites of comm3_ was also reduced by one, but I don't know whether it was
inlined or specialized away.

These are pretty large procedures to be duplicating, particularly to be
duplicating more than once.  Looking at resid_, it already generates spill code
on its own, so putting 3 copies of this in its caller isn't likely to be very
helpful.  Of these, I think only rprj3_ looks like a reasonable inline
candidate.

Total lines of the assembly files are:

  8660 r193330/mgrid.s
 16398 r193331/mgrid.s
 14592 r196171/mgrid.s

Inlining creates unreachable code, so removing the unreachable procedures
gives:

  7765 r193330/mgrid.s
 12591 r193331/mgrid.s
 10795 r196171/mgrid.s

With r196171 the reachable code is still about 40% larger than r193330 (where
some reasonable inlining was already being done).  This is better than the 60%
bloat with r193331 but still seems too high.  Again, these are rough measures
but I think they are indicative.

Without knowing anything about the inliner, I think the inlining heuristics
probably need to take more account of code size than they seem to do at the
moment, particularly when making more than one copy of a procedure and thus
reducing spatial locality.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2013-03-04 17:54 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-15 21:05 [Bug fortran/48636] New: Enable more inlining with -O2 and higher tkoenig at gcc dot gnu.org
2011-04-16 11:22 ` [Bug fortran/48636] " rguenth at gcc dot gnu.org
2011-04-17 10:23 ` dominiq at lps dot ens.fr
2011-04-17 10:44 ` hubicka at gcc dot gnu.org
2011-04-17 13:32 ` tkoenig at gcc dot gnu.org
2011-04-17 14:12 ` dominiq at lps dot ens.fr
2011-04-20 11:22 ` jb at gcc dot gnu.org
2011-04-20 12:29 ` burnus at gcc dot gnu.org
2011-04-20 13:10 ` jb at gcc dot gnu.org
2011-04-20 15:42 ` burnus at gcc dot gnu.org
2011-04-20 16:41 ` tkoenig at gcc dot gnu.org
2011-04-20 18:15 ` jb at gcc dot gnu.org
2011-05-04 16:23 ` hubicka at gcc dot gnu.org
2011-05-04 17:31 ` burnus at gcc dot gnu.org
2011-06-04 18:08 ` hubicka at gcc dot gnu.org
2012-07-03 17:44 ` jamborm at gcc dot gnu.org
2012-08-11 10:50 ` jamborm at gcc dot gnu.org
2012-08-21  6:54 ` hubicka at gcc dot gnu.org
2012-08-21  8:15 ` hubicka at gcc dot gnu.org
2012-09-12 21:52 ` hubicka at gcc dot gnu.org
2012-10-16 16:39 ` hubicka at gcc dot gnu.org
2012-10-16 17:58 ` dominiq at lps dot ens.fr
2012-10-16 20:59 ` dominiq at lps dot ens.fr
2012-10-17 12:20 ` jakub at gcc dot gnu.org
2012-10-17 13:13 ` dominiq at lps dot ens.fr
2012-10-17 14:06 ` dominiq at lps dot ens.fr
2012-10-19  8:45 ` vincenzo.innocente at cern dot ch
2012-10-20 10:35 ` hubicka at gcc dot gnu.org
2012-10-20 11:22 ` dominiq at lps dot ens.fr
2012-10-20 12:11 ` tkoenig at gcc dot gnu.org
2012-10-28 10:08 ` hubicka at gcc dot gnu.org
2012-10-28 10:11 ` hubicka at gcc dot gnu.org
2012-10-28 11:27 ` vincenzo.innocente at cern dot ch
2012-11-07  9:34 ` hubicka at gcc dot gnu.org
2012-11-07 11:18 ` hubicka at gcc dot gnu.org
2012-11-08 16:46 ` hubicka at gcc dot gnu.org
2012-11-11 18:15 ` hubicka at gcc dot gnu.org
2012-11-12 12:16 ` hubicka at gcc dot gnu.org
2012-11-12 12:45 ` izamyatin at gmail dot com
2012-11-14 23:22 ` hubicka at gcc dot gnu.org
2012-11-14 23:54   ` Jan Hubicka
2012-11-14 23:55 ` hubicka at ucw dot cz
2012-11-15  2:29 ` hubicka at gcc dot gnu.org
2012-11-16 14:43 ` dominiq at lps dot ens.fr
2013-03-01 17:49 ` wschmidt at gcc dot gnu.org
2013-03-04 17:54 ` wschmidt at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).