public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
@ 2011-12-10 13:27 dominiq at lps dot ens.fr
  2011-12-10 18:42 ` [Bug lto/51497] " dominiq at lps dot ens.fr
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-10 13:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

             Bug #: 51497
           Summary: [4.7 Regression] The run time for the polyhedron test
                    nf.f90 is ~10% slower with -flto after revision 182107
    Classification: Unclassified
           Product: gcc
           Version: 4.7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: lto
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: dominiq@lps.ens.fr
                CC: rguenther@suse.de
              Host: x86_64-apple-darwin10
            Target: x86_64-apple-darwin10
             Build: x86_64-apple-darwin10


After revision 182107 the run time for the polyhedron test nf.f90 is ~10%
slower with -flto:

[macbook] lin/test% gfc -Ofast -funroll-loops nf.f90
[macbook] lin/test% time a.out > /dev/null
18.795u 0.202s 0:19.00 99.9%    0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -Ofast -funroll-loops nf.f90 -flto
[macbook] lin/test% time a.out > /dev/null
20.640u 0.173s 0:20.82 99.9%    0+0k 0+0io 0pf+0w

This slowdown disappears if I revert the first 'else if' added block:

[macbook] lin/test% gfcp -Ofast -funroll-loops nf.f90
[macbook] lin/test% time a.out > /dev/null
18.820u 0.198s 0:19.02 99.9%    0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -Ofast -funroll-loops nf.f90 -flto
[macbook] lin/test% time a.out > /dev/null
18.821u 0.202s 0:19.02 100.0%    0+0k 0+0io 0pf+0w

but not if I revert the second one:

[macbook] lin/test% /opt/gcc/gcc4.7p-182107r1a/bin/gfortran -Ofast
-funroll-loops nf.f90
[macbook] lin/test% time a.out > /dev/null
18.809u 0.199s 0:19.01 99.8%    0+0k 0+0io 0pf+0w
[macbook] lin/test% /opt/gcc/gcc4.7p-182107r1a/bin/gfortran -Ofast
-funroll-loops nf.f90 -flto
[macbook] lin/test% time a.out > /dev/null
20.601u 0.177s 0:20.78 99.9%    0+0k 0+0io 0pf+0w


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
@ 2011-12-10 18:42 ` dominiq at lps dot ens.fr
  2011-12-11 14:14 ` dominiq at lps dot ens.fr
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-10 18:42 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

--- Comment #1 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-10 18:39:15 UTC ---
The profiles are without -flto:

+ 34.6%, nf3dprecon.2105.constprop.1, a.out
|   34.6%, nf2dprecon.2116, a.out
  33.5%, spmmult.2139, a.out
+ 29.8%, nfcg_, a.out
| + 7.6%, nf3dprecon.2105.constprop.1, a.out
| |   0.4%, nf2dprecon.2116, a.out
|   0.4%, nf2dprecon.2116, a.out
  0.9%, mattest_, a.out

and with -flto

+ 37.7%, nf3dprecon.2105.2457.constprop.1.2435, a.out
|   37.7%, nf2dprecon.2116.2442.2436, a.out
  32.7%, spmmult.2139.2426.2446, a.out
+ 27.6%, nfcg_, a.out
| + 7.0%, nf3dprecon.2105.2457.constprop.1.2435, a.out
| |   0.4%, nf2dprecon.2116.2442.2436, a.out
|   0.4%, nf2dprecon.2116.2442.2436, a.out
|   0.0%, free, libSystem.B.dylib
  0.8%, mattest_, a.out

So the slow routines are nf2dprecon, accounting for ~1.2s, and spmmult,
accounting for ~0.5s. If I am reading the assembly correctly, in nf2dprecon,
the implicit loop

x(i:i+nx-1) = x(i:i+nx-1) - au2(i-nx:i-1)*x(i-nx:i-1)

is unrolled eight times without -flto and four times with -flto. In spmmult,
the implicit loop

b = ad*x

is unrolled four times and vectorized without -flto and eight times, but not
vectorized, with -flto.

Note that --param max-unroll-times=4 does not change the times.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
  2011-12-10 18:42 ` [Bug lto/51497] " dominiq at lps dot ens.fr
@ 2011-12-11 14:14 ` dominiq at lps dot ens.fr
  2011-12-12 10:40 ` rguenth at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-11 14:14 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

--- Comment #2 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-11 14:07:59 UTC ---
Upon further looking at the assembly, I have found that the seven loops in
spmmult are all vectorized without -flto, while none of them are with -flto. 

For nf2dprecon after trisolve inlining, the code looks like

subroutine NF2DPrecon(x,gi,au1,au2,i1,i2,nx)       ! 2D NF Preconditioning
matrix
implicit none
integer :: i1,i2,nx
real(8),dimension(i2)::x,t,gi,au1,au2
integer :: i,j
do i = i1 , i2 , nx
   if ( i>i1 ) x(i:i+nx-1) = x(i:i+nx-1) - au2(i-nx:i-1)*x(i-nx:i-1)
   x(i) = gi(i)* x(i)
   do j = i+1 , i+nx-1
      x(j) = gi(j)*(x(j)-au1(j-1)*x(j-1))
   enddo
   do j = i+nx-2 , i , -1
      x(j) = x(j) - gi(j)*au1(j)*x(j+1)
   enddo
enddo 
do i = i2-2*nx+1 , i1 , -nx
   t(i:i+nx-1) = au2(i:i+nx-1)*x(i+nx:i+2*nx-1)
   t(i) = gi(i)* t(i)
   do j = i+1 , i+nx-1
      t(j) = gi(j)*(t(j)-au1(j-1)*t(j-1))
   enddo
   do j = i+nx-2 , i , -1
      t(j) = t(j) - gi(j)*au1(j)*t(j+1)
   enddo
   x(i:i+nx-1) = x(i:i+nx-1) - t(i:i+nx-1)
enddo
end subroutine NF2DPrecon            !=========================================

where none of the explicit 'do j' loops are vectorized ("possible dependence
between data-refs") while the three implicit loops are vectorized without
-flto, while only the last two are with -flto. Note that the first loop not
vectorized with -lflto:

x(i:i+nx-1) = x(i:i+nx-1) - au2(i-nx:i-1)*x(i-nx:i-1)

is vectorized without it with "created 1 versioning for alias checks." (alias
between au2 and x? if yes, valid Fortran codes guarantee that there is no
aliasing).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
  2011-12-10 18:42 ` [Bug lto/51497] " dominiq at lps dot ens.fr
  2011-12-11 14:14 ` dominiq at lps dot ens.fr
@ 2011-12-12 10:40 ` rguenth at gcc dot gnu.org
  2011-12-12 13:11 ` dominiq at lps dot ens.fr
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-12 10:40 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |4.7.0

--- Comment #3 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-12 10:37:51 UTC ---
I can't see any vectorizer differences for the testcase in comment #2 and the
patch you cite only (should) have debuginfo changes, no changes to the produced
IL at statement level (eventually it has better type-based alias analysis).

Not confirmed.

The two else if blocks are related, not independent, independently reverting
them makes no sense.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
                   ` (2 preceding siblings ...)
  2011-12-12 10:40 ` rguenth at gcc dot gnu.org
@ 2011-12-12 13:11 ` dominiq at lps dot ens.fr
  2011-12-12 14:46 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-12 13:11 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

--- Comment #4 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-12 13:09:49 UTC ---
> I can't see any vectorizer differences for the testcase in comment #2 and the
> patch you cite only (should) have debuginfo changes, no changes to the produced
> IL at statement level (eventually it has better type-based alias analysis).
>
> Not confirmed.

I have just done the following check:

(1) gfc -Ofast -funroll-loops nf.f90 -ftree-vectorizer-verbose=1 > & tmp1
(2) gfc -Ofast -funroll-loops nf.f90 -ftree-vectorizer-verbose=1 -flto > & tmp2
I noticed that the tmp2 file contains two sets of annotations, likely one for
the usual vectorization (up to line 334) and a second one for the lto stage.
(3) I have split the file tmp2 in a new tmp2 keeping only the first 334 lines
and a second one containing the second part.
(4) I have used diff to compare the files: tmp1 and the new tmp2 are identical,
while I see missing vectorizations in tmp3:

--- tmp1    2011-12-12 13:49:06.000000000 +0100
+++ tmp3    2011-12-12 13:54:12.000000000 +0100
...
-206: LOOP VECTORIZED.
-nf.f90:204: note: vectorized 7 loops in function.
...
-nf.f90:256: note: vectorized 3 loops in function.
+nf.f90:256: note: vectorized 2 loops in function.
...
-nf.f90:288: note: vectorized 3 loops in function.
+nf.f90:288: note: vectorized 2 loops in function.

This confirms what I have seen in the disassembled executable.

Questions:
(1) do you see the slowdown with -flto?
(2) can you reproduce the above?

> The two else if blocks are related, not independent, independently reverting
> them makes no sense.

I am not suggesting to remove one block. I was only interested in finding which
part of the patch caused/exposed the problem (which looks like yet another
instance of a bad choice of optimization for size: as pointed in 51499, the
vectorization generates two loops, one vectorized and one not, hence ~doubling
the code size).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
                   ` (3 preceding siblings ...)
  2011-12-12 13:11 ` dominiq at lps dot ens.fr
@ 2011-12-12 14:46 ` rguenth at gcc dot gnu.org
  2011-12-12 15:19 ` dominiq at lps dot ens.fr
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-12 14:46 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2011-12-12
     Ever Confirmed|0                           |1

--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-12 14:40:38 UTC ---
I can't reproduce anything with the testcase from comment #2.  I can confirm,
for the whole nf.f90 testcase:

 191: LOOP VECTORIZED.
 192: LOOP VECTORIZED.
 193: LOOP VECTORIZED.
-206: LOOP VECTORIZED.
-207: LOOP VECTORIZED.
-208: LOOP VECTORIZED.
-209: LOOP VECTORIZED.
-210: LOOP VECTORIZED.
-211: LOOP VECTORIZED.
-212: LOOP VECTORIZED.
 220: LOOP VECTORIZED.
 248: LOOP VECTORIZED.
-261: LOOP VECTORIZED.
 265: LOOP VECTORIZED.
 267: LOOP VECTORIZED.
 280: LOOP VECTORIZED.
-293: LOOP VECTORIZED.
 297: LOOP VECTORIZED.
 299: LOOP VECTORIZED.

for -flto vs. -fno-lto.

I see differences in the dumps, mainly around references to parent
function variables appearantly no longer hoisted out of the loop.

Maybe you can reduce the testcase with this information?  Simple try
failed:

subroutine nfcg(nx,nxy,nxyz,ad,x,maxiter)
implicit none ; integer,parameter :: dpkind=kind(1.0D0)
integer :: nx , nxy , nxyz , maxiter
real(dpkind),dimension(nxyz):: ad,x
real(dpkind),allocatable,dimension(:) :: r
allocate (r(nxyz))
CALL SPMMULT(r,x)
contains
subroutine spmmult(x,b)
real(dpkind),dimension(nxyz):: x,b
b = ad*x
end subroutine spmmult
end subroutine

no CHAIN pointers left.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
                   ` (4 preceding siblings ...)
  2011-12-12 14:46 ` rguenth at gcc dot gnu.org
@ 2011-12-12 15:19 ` dominiq at lps dot ens.fr
  2011-12-14 13:06 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-12 15:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

--- Comment #6 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-12 15:17:19 UTC ---
> I can't reproduce anything with the testcase from comment #2.

Sorry for the confusion. The code in comment #2 was here only to show the
actual code after the inlining of trisolve and to try to point to the loops
that were vectorized or not with/without -flto. 

> Maybe you can reduce the testcase with this information?

Well, I'll try, but I won't have too much time for it in the coming weeks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
                   ` (5 preceding siblings ...)
  2011-12-12 15:19 ` dominiq at lps dot ens.fr
@ 2011-12-14 13:06 ` rguenth at gcc dot gnu.org
  2011-12-14 15:33 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-14 13:06 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
         AssignedTo|unassigned at gcc dot       |rguenth at gcc dot gnu.org
                   |gnu.org                     |

--- Comment #7 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-14 12:50:52 UTC ---
Created attachment 26080
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26080
patch candidate

Ok, so the reason is that we now stream the VLA types for, for example 'x',
locally.  Thus they do not go through the type merging machinery (no
problem) - but they also do not get a TYPE_CANONICAL computed, which would
not be bad either would it be itself, but it is NULL_TREE and thus
references to such VLA arrays get alias-set zero.

The function local LTO sections are not structured in a way we can
arrange to call uniquify_nodes, but we should be able to fixup
canonical types (and variant types).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
                   ` (6 preceding siblings ...)
  2011-12-14 13:06 ` rguenth at gcc dot gnu.org
@ 2011-12-14 15:33 ` rguenth at gcc dot gnu.org
  2011-12-14 15:36 ` rguenth at gcc dot gnu.org
  2011-12-14 15:39 ` dominiq at lps dot ens.fr
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-14 15:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

Richard Guenther <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED

--- Comment #9 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-14 15:31:48 UTC ---
Fixed.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
                   ` (7 preceding siblings ...)
  2011-12-14 15:33 ` rguenth at gcc dot gnu.org
@ 2011-12-14 15:36 ` rguenth at gcc dot gnu.org
  2011-12-14 15:39 ` dominiq at lps dot ens.fr
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-14 15:36 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

--- Comment #8 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-14 15:31:31 UTC ---
Author: rguenth
Date: Wed Dec 14 15:31:24 2011
New Revision: 182336

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=182336
Log:
2011-12-14  Richard Guenther  <rguenther@suse.de>

    PR lto/51497
    * lto-streamer-in.c (lto_read_body): Fixup local types
    TYPE_CANONICAL and variant chain.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/lto-streamer-in.c


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug lto/51497] [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107
  2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
                   ` (8 preceding siblings ...)
  2011-12-14 15:36 ` rguenth at gcc dot gnu.org
@ 2011-12-14 15:39 ` dominiq at lps dot ens.fr
  9 siblings, 0 replies; 11+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-14 15:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51497

--- Comment #10 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-14 15:36:19 UTC ---
> Created attachment 26080
>   --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26080
> patch candidate
> ...

The patch fixes this pr without any visible side-effect on the other tests of
the polyhedron suite. Thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-12-14 15:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-10 13:27 [Bug lto/51497] New: [4.7 Regression] The run time for the polyhedron test nf.f90 is ~10% slower with -flto after revision 182107 dominiq at lps dot ens.fr
2011-12-10 18:42 ` [Bug lto/51497] " dominiq at lps dot ens.fr
2011-12-11 14:14 ` dominiq at lps dot ens.fr
2011-12-12 10:40 ` rguenth at gcc dot gnu.org
2011-12-12 13:11 ` dominiq at lps dot ens.fr
2011-12-12 14:46 ` rguenth at gcc dot gnu.org
2011-12-12 15:19 ` dominiq at lps dot ens.fr
2011-12-14 13:06 ` rguenth at gcc dot gnu.org
2011-12-14 15:33 ` rguenth at gcc dot gnu.org
2011-12-14 15:36 ` rguenth at gcc dot gnu.org
2011-12-14 15:39 ` dominiq at lps dot ens.fr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).