[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
@ 2013-03-29 11:03 ` Joost.VandeVondele at mat dot ethz.ch
  2013-07-14  7:11 ` spop at gcc dot gnu.org
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2013-03-29 11:03 UTC (permalink / raw)
  To: gcc-bugs


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2006-04-23 17:57:20         |2013-03-29
                 CC|                            |Joost.VandeVondele at mat
                   |                            |dot ethz.ch
             Blocks|                            |51119
            Summary|missing transformations     |graphite with loop blocking
                   |lead to poorly optimized    |and interchanging doesn't
                   |code                        |optimize a matrix
                   |                            |multiplication loop
      Known to fail|tree-ssa                    |4.9.0

--- Comment #17 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> 2013-03-29 11:03:40 UTC ---
With 4.9 trunk using 

gfortran -ffast-math -O3 -march=native -fgraphite -floop-interchange
-floop-block PR14741.f90

The testcase still runs about 8x slower than what ifort produces at -O3 -xHost.

This is too bad, and in my opinion blocks as simple solution for PR51119


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
  2013-03-29 11:03 ` [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop Joost.VandeVondele at mat dot ethz.ch
@ 2013-07-14  7:11 ` spop at gcc dot gnu.org
  2013-07-16  2:25 ` spop at gcc dot gnu.org
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-14  7:11 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #18 from Sebastian Pop <spop at gcc dot gnu.org> ---
On my laptop ARM Exynos5 at 1.6GHz I get this:

gfortran -ffast-math -O3 t.f90
./a.out
   192.75000000000000        10.239999999999826     

gfortran -ffast-math -O3 -fgraphite -floop-interchange -floop-block t.f90
./a.out
   193.77500000000001        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90
t.f90: In function ‘MAIN__’:
t.f90:5:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 B=0.1D0
 ^
f951: note: containing loop
t.f90:4:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 A=0.1D0
 ^
f951: note: containing loop
./a.out
./a.out: No such file or directory

I don't know why the compiler does not produce an executable: -S does produce a
.s file.

Adding -fdump-tree-graphite-all produces a file t.f90.106t.graphite containing
the information about what graphite has done: I see that we do loop block the
loop nest like this:

gfortran -ffast-math -O3 -floop-nest-optimize -fdump-tree-graphite-all t.f90

CLAST generated by CLooG: 
for (scat_0=0;scat_0<=1023;scat_0+=32) {
  for (scat_1=0;scat_1<=1023;scat_1+=32) {
    for (scat_2=scat_0;scat_2<=scat_0+31;scat_2++) {
      for (scat_3=scat_1;scat_3<=scat_1+31;scat_3++) {
        (scat_2,scat_3);
      }
    }
  }
}

I see that the tile size is hard coded as a constant in graphite-optimize-isl.c

  TileMap = getTileMap(ctx, *Dimensions, 32);

that should be replaced by a param and tuned.
>From gcc-bugs-return-426199-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Sun Jul 14 07:47:37 2013
Return-Path: <gcc-bugs-return-426199-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 12209 invoked by alias); 14 Jul 2013 07:47:37 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 6560 invoked by uid 48); 14 Jul 2013 07:45:34 -0000
From: "frankhb1989 at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug c++/53402] [C++11] non-inline namespace can be wrongly re-opened as inline
Date: Sun, 14 Jul 2013 07:47:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: c++
X-Bugzilla-Version: 4.8.0
X-Bugzilla-Keywords: accepts-invalid
X-Bugzilla-Severity: minor
X-Bugzilla-Who: frankhb1989 at gmail dot com
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-53402-4-sZZHa3r89Z@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-53402-4@http.gcc.gnu.org/bugzilla/>
References: <bug-53402-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2013-07/txt/msg00706.txt.bz2
Content-length: 652

http://gcc.gnu.org/bugzilla/show_bug.cgi?idS402

frankhb1989 at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |frankhb1989 at gmail dot com

--- Comment #2 from frankhb1989 at gmail dot com ---
Confirmed in 4.8.1.
PS. Clang(trunk) treats 'non-inline namespace cannot be reopened as a inline'
an error (correctly), but also complains 'inline namespace cannot be reopened
as a non-inline namespace'. I'm curious why the latter should trigger a
warning. G++ have no diagnostic for them at all.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
  2013-03-29 11:03 ` [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop Joost.VandeVondele at mat dot ethz.ch
  2013-07-14  7:11 ` spop at gcc dot gnu.org
@ 2013-07-16  2:25 ` spop at gcc dot gnu.org
  2013-07-16  2:29 ` spop at gcc dot gnu.org
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-16  2:25 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #19 from Sebastian Pop <spop at gcc dot gnu.org> ---
Default: tile_size = 32
gfortran -ffast-math -O3 -floop-nest-optimize t.f90 -v
./a.out
   176.42500000000001        10.239999999999826     

and then with a trivial patch that replaces that default constant 32 with the
param that we already have in -floop-block we can see that there is not much
impact of the tile size:

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=51
./a.out
   173.49000000000001        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=10
./a.out
   175.32499999999999        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=20
./a.out
   173.91000000000000        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=30
./a.out
   177.07499999999999        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=40
./a.out
   173.77500000000001        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=50
./a.out
   176.96500000000000        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=60
./a.out
   176.78999999999999        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=70
./a.out
   176.12500000000000        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=80
./a.out
   175.55000000000001        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=90
./a.out
   184.63500000000002        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=100
./a.out
   187.30500000000001        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=110
./a.out
   187.99000000000001        10.239999999999826     

gfortran -ffast-math -O3 -floop-nest-optimize t.f90 --param
loop-block-tile-size=120
./a.out
   188.13999999999999        10.239999999999826


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (2 preceding siblings ...)
  2013-07-16  2:25 ` spop at gcc dot gnu.org
@ 2013-07-16  2:29 ` spop at gcc dot gnu.org
  2013-07-21  9:15 ` spop at gcc dot gnu.org
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-16  2:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #20 from Sebastian Pop <spop at gcc dot gnu.org> ---
-floop-nest-optimize does not have a cost model and so it blocks all loop
nests.
These two stmts are blocked:
A=0.1D0
B=0.1D0

This other stmt is not blocked:
   C(I,J)=C(I,J)+A(I,K)*B(K,J)

The only thing we can get from the dump file for this stmt is:
number of SCoPs: 0
most likely scop detection did not like something in that loop nest.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (3 preceding siblings ...)
  2013-07-16  2:29 ` spop at gcc dot gnu.org
@ 2013-07-21  9:15 ` spop at gcc dot gnu.org
  2013-07-21  9:59 ` spop at gcc dot gnu.org
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-21  9:15 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #21 from Sebastian Pop <spop at gcc dot gnu.org> ---
Scop detection does not detect this loop because we now require the scev of the
data references to be analyzable in all the loops around:

commit e97c4b0daa932558eae4d9b9794cdd549e6d40bd
Author: rguenth <rguenth@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Jan 10 09:14:51 2012 +0000

    2012-01-10  Richard Guenther  <rguenther@suse.de>

        PR tree-optimization/50913
        * graphite-scop-detection.c (stmt_has_simple_data_refs_p):
        Require data-refs to be representable by Graphite with respect
        to any loop nest.

        * gcc.dg/graphite/interchange-16.c: New testcase.
        * gcc.dg/graphite/scop-20.c: XFAIL.
        * gfortran.dg/graphite/interchange-1.f: Likewise.
        * gfortran.dg/graphite/block-1.f90: Likewise.
        * gfortran.dg/graphite/block-2.f: Likewise.

So for this gimple stmt:
# VUSE <.MEM_6>
_29 = *a_28(D)[_27];

the data reference when analyzed in the innermost loop is:
{(stride.12_14 + offset.13_15) + _19, +, stride.12_14}_3
and that cannot be represented in graphite as it has a non constant
(parametric) stride.

Note that this data reference can be represented in an outer loop, as the
stride is then a multidimensional access with constant strides.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (4 preceding siblings ...)
  2013-07-21  9:15 ` spop at gcc dot gnu.org
@ 2013-07-21  9:59 ` spop at gcc dot gnu.org
  2013-07-21 10:14 ` spop at gcc dot gnu.org
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-21  9:59 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #22 from Sebastian Pop <spop at gcc dot gnu.org> ---
Once we revert that patch, the remaining problem is that
graphite_can_represent_scev returns false on this scev:

{{(stride.12_14 + offset.13_15) + 1, +, stride.12_14}_1, +, 1}_2

the parameters are defined like this:

    _12 = *n_11(D);
    ubound.10_13 = (integer(kind=8)) _12;
    stride.12_14 = MAX_EXPR <ubound.10_13, 0>;
    offset.13_15 = ~stride.12_14;


#(Data Ref: 
#  bb: 5 
#  stmt: c__I_lsm.30_8 = *c_21(D)[_20];
#  ref: *c_21(D)[_20];
#  base_object: *c_21(D);
#  Access function 0: {{(stride.12_14 + offset.13_15) + 1, +, stride.12_14}_1,
+, 1}_2
#)

#(Data Ref: 
#  bb: 6 
#  stmt: _29 = *a_28(D)[_27];
#  ref: *a_28(D)[_27];
#  base_object: *a_28(D);
#  Access function 0: {{(stride.12_14 + offset.13_15) + 1, +, 1}_2, +,
stride.12_14}_3
#)

Why have these arrays been linearized?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (5 preceding siblings ...)
  2013-07-21  9:59 ` spop at gcc dot gnu.org
@ 2013-07-21 10:14 ` spop at gcc dot gnu.org
  2013-07-21 10:34 ` spop at gcc dot gnu.org
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-21 10:14 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #23 from Sebastian Pop <spop at gcc dot gnu.org> ---
If possible, we need to maintain the subscripted version of arrays:
C(I,J)
A(I,K)
B(K,J)

Without a representation of multi dimensional arrays, we would need to
delinearize the arrays prior to graphite.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (6 preceding siblings ...)
  2013-07-21 10:14 ` spop at gcc dot gnu.org
@ 2013-07-21 10:34 ` spop at gcc dot gnu.org
  2013-07-21 22:01 ` spop at gcc dot gnu.org
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-21 10:34 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #24 from Sebastian Pop <spop at gcc dot gnu.org> ---
Looking at t.f90.003t.original (the first dump file of -fdump-tree-all-all)
I see that the array c is already in linear form:

(*cD.1876)[((integer(kind=8)D.9) jD.1897 * stride.12D.1892 + offset.13D.1893) +
(integer(kind=8)D.9) iD.1896] = (*cD.1876)[((integer(kind=8)D.9) jD.1897 *
stride.12D.1892 + offset.13D.1893) + (integer(kind=8)D.9) iD.1896] +
(*aD.1874)[((integer(kind=8)D.9) kD.1898 * stride.2D.1880 + offset.3D.1881) +
(integer(kind=8)D.9) iD.1896] * (*bD.1875)[((integer(kind=8)D.9) jD.1897 *
stride.7D.1886 + offset.8D.1887) + (integer(kind=8)D.9) kD.1898];

Can the Fortran front-end build multi-dimensional arrays instead of this linear
form?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (7 preceding siblings ...)
  2013-07-21 10:34 ` spop at gcc dot gnu.org
@ 2013-07-21 22:01 ` spop at gcc dot gnu.org
  2013-07-27 15:19 ` dushistov at mail dot ru
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-21 22:01 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #25 from Sebastian Pop <spop at gcc dot gnu.org> ---
I think the linearization of array subscripts problem is linked to passing
arguments to a function in Fortran: by inlining the mult function call in the
main program, the main loop on C(I,J)=C(I,J)+A(I,K)*B(K,J) is blocked as well

$ gfortran -ffast-math -O3 -floop-nest-optimize tt.f90 -fdump-tree-graphite-all

$ cat tt.f90
INTEGER, PARAMETER :: N=1024
REAL*8 :: A(N,N), B(N,N), C(N,N)
REAL*8 :: t1,t2
INTEGER :: I,J,K
A=0.1D0
B=0.1D0
C=0.0D0
CALL cpu_time(t1)
DO J=1,N
DO I=1,N
DO K=1,N
  C(I,J)=C(I,J)+A(I,K)*B(K,J)
ENDDO
ENDDO
ENDDO
CALL cpu_time(t2)
write(6,*) t2-t1,C(1,1)
END


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (8 preceding siblings ...)
  2013-07-21 22:01 ` spop at gcc dot gnu.org
@ 2013-07-27 15:19 ` dushistov at mail dot ru
  2013-07-27 15:22 ` dushistov at mail dot ru
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: dushistov at mail dot ru @ 2013-07-27 15:19 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #26 from Evgeniy Dushistov <dushistov at mail dot ru> ---
I try such simple C++ function, compiled in separate object file(-march=native
-Ofast): 

void mult(const double * const __restrict__ A, const double * const
__restrict__ B, double * const __restrict__ C, const size_t N)
{
    for (size_t j = 0; j < N; ++j)
        for (size_t i = 0; i < N; ++i)
            for (size_t k = 0; k < N; ++k)
                C[i * N + j] += A[i * N + k] + B[k * N + j];
}

$ time ./test_gcc 
204.800000

real    0m9.628s
user    0m9.620s
sys     0m0.000s

$ time ./test_icc 
204.800000

real    0m0.637s
user    0m0.630s
sys     0m0.000s


Difference 15.2 times

Looks like the difference here:
GCC:
Analyzing loop at mult.cpp:5
Analyzing loop at mult.cpp:6
Analyzing loop at mult.cpp:7

mult.cpp:3: note: vectorized 0 loops in function.

ICC:
mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.
mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.
mult.cpp(5): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (9 preceding siblings ...)
  2013-07-27 15:19 ` dushistov at mail dot ru
@ 2013-07-27 15:22 ` dushistov at mail dot ru
  2013-07-29 22:16 ` spop at gcc dot gnu.org
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: dushistov at mail dot ru @ 2013-07-27 15:22 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #27 from Evgeniy Dushistov <dushistov at mail dot ru> ---
Created attachment 30563
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30563&action=edit
icc -c -Ofast -march=native objdump


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (10 preceding siblings ...)
  2013-07-27 15:22 ` dushistov at mail dot ru
@ 2013-07-29 22:16 ` spop at gcc dot gnu.org
  2014-03-13 15:03 ` spop at gcc dot gnu.org
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2013-07-29 22:16 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #28 from Sebastian Pop <spop at gcc dot gnu.org> ---
(In reply to Evgeniy Dushistov from comment #26)
> void mult(const double * const __restrict__ A, const double * const
> __restrict__ B, double * const __restrict__ C, const size_t N)
> {
> 	for (size_t j = 0; j < N; ++j)
> 		for (size_t i = 0; i < N; ++i)
> 			for (size_t k = 0; k < N; ++k)
> 				C[i * N + j] += A[i * N + k] + B[k * N + j];

This code has the same problem as the Fortran program: it has linearized memory
access functions.  This code won't be transformed by Graphite unless you
instantiate N with a compile time constant, or otherwise you'll have to
delinearize the array accesses, and we don't have code to do that yet.
>From gcc-bugs-return-426940-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org Mon Jul 29 22:34:33 2013
Return-Path: <gcc-bugs-return-426940-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Delivered-To: listarch-gcc-bugs@gcc.gnu.org
Received: (qmail 6101 invoked by alias); 29 Jul 2013 22:34:32 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Delivered-To: mailing list gcc-bugs@gcc.gnu.org
Received: (qmail 6038 invoked by uid 48); 29 Jul 2013 22:34:29 -0000
From: "abensonca at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug fortran/58023] New: ICE on invalid with missing NOPASS
Date: Mon, 29 Jul 2013 22:34:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: fortran
X-Bugzilla-Version: 4.9.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: abensonca at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter
Message-ID: <bug-58023-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2013-07/txt/msg01447.txt.bz2
Content-length: 2931

http://gcc.gnu.org/bugzilla/show_bug.cgi?idX023

            Bug ID: 58023
           Summary: ICE on invalid with missing NOPASS
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: abensonca at gmail dot com

The following causes and ICE using gfortran 4.9.0 (r201320):

module m
  implicit none

  abstract interface
     double precision function mr()
     end function mr
  end interface

  type :: sfd
     procedure(mr), pointer :: mr1
     procedure(mr), pointer :: mr2
  end type sfd

contains

  subroutine go()
    implicit none
    type(sfd):: d

    write (0,*) d%mr2()
    return
  end subroutine go

end module m



$ gfortran -v
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/home/abenson/Galacticus/Tools/libexec/gcc/x86_64-unknown-linux-gnu/4.9.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-trunk/configure --prefix=/home/abenson/Galacticus/Tools
--enable-languages=c,c++,fortran --disable-multilib
--with-gmp=/home/abenson/Galacticus/Tools
Thread model: posix
gcc version 4.9.0 20130729 (experimental) (GCC)

$ gfortran -c bug.F90 -o bug.o
bug.F90:10.34:

     procedure(mr), pointer :: mr1
                                  1
Error: Procedure pointer component 'mr1' with PASS at (1) must have at least
one argument
f951: internal compiler error: in update_ppc_arglist, at fortran/resolve.c:5346
0x58332a update_ppc_arglist
        ../../gcc-trunk/gcc/fortran/resolve.c:5346
0x57f63c resolve_expr_ppc
        ../../gcc-trunk/gcc/fortran/resolve.c:6020
0x57f63c gfc_resolve_expr(gfc_expr*)
        ../../gcc-trunk/gcc/fortran/resolve.c:6119
0x5860bb resolve_code
        ../../gcc-trunk/gcc/fortran/resolve.c:9692
0x585e0b gfc_resolve_blocks(gfc_code*, gfc_namespace*)
        ../../gcc-trunk/gcc/fortran/resolve.c:9006
0x586099 resolve_code
        ../../gcc-trunk/gcc/fortran/resolve.c:9682
0x588c8e resolve_codes
        ../../gcc-trunk/gcc/fortran/resolve.c:14471
0x588b97 resolve_codes
        ../../gcc-trunk/gcc/fortran/resolve.c:14457
0x588d72 gfc_resolve
        ../../gcc-trunk/gcc/fortran/resolve.c:14499
0x5755bf gfc_parse_file()
        ../../gcc-trunk/gcc/fortran/parse.c:4645
0x5b1545 gfc_be_parse_file
        ../../gcc-trunk/gcc/fortran/f95-lang.c:189
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.

The code is invalid because both procedure pointers in the "sfd" derived type
should have the NOPASS attribute (since the "mr" abstract interface declares no
arguments).

The ICE only occurs if two (or more) procedure points are declared in this way,
AND the procedure pointer is accessed (via the write statement in this
example).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (11 preceding siblings ...)
  2013-07-29 22:16 ` spop at gcc dot gnu.org
@ 2014-03-13 15:03 ` spop at gcc dot gnu.org
  2014-04-29 12:29 ` dominiq at lps dot ens.fr
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2014-03-13 15:03 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #29 from Sebastian Pop <spop at gcc dot gnu.org> ---
(In reply to Sebastian Pop from comment #28)
> delinearize the array accesses, and we don't have code to do that yet.

There is code to delinearize array accesses in LLVM now: it works on top of
SCEV, so it should be portable to GCC.  If somebody is interested to contribute
the delinearization to GCC, I can help. The LLVM code is in
llvm/lib/Analysis/ScalarEvolution.cpp SCEVAddRecExpr::delinearize()


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (12 preceding siblings ...)
  2014-03-13 15:03 ` spop at gcc dot gnu.org
@ 2014-04-29 12:29 ` dominiq at lps dot ens.fr
  2015-05-16  8:48 ` tkoenig at gcc dot gnu.org
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: dominiq at lps dot ens.fr @ 2014-04-29 12:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #30 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
Since at least gcc 4.8.2, -floop-interchange is able to apply it to the test in
comment 25 (see also pr60997):

[Book15] f90/bug% gfortran-fsf-4.8 -Ofast pr14741.f90 -floop-interchange
[Book15] f90/bug% time a.out
  0.48877399999999999        10.239999999999826     
0.492u 0.007s 0:00.50 98.0%    0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfortran-fsf-4.8 -Ofast pr14741.f90 
[Book15] f90/bug% time a.out
   3.9007499999999999        10.239999999999826     
3.897u 0.015s 0:03.91 99.7%    0+0k 0+0io 1pf+0w

However compiling it with -floop-nest-optimize --param loop-block-tile-size=xx,
gives the following error (see comment 18)

f951: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
f951: note: containing loop
f951: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
f951: note: containing loop
pr14741.f90:6:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 B=0.1D0
 ^
f951: note: containing loop
pr14741.f90:5:0: warning: iteration 31 invokes undefined behavior
[-Waggressive-loop-optimizations]
 A=0.1D0
 ^
f951: note: containing loop

for any values of xx I have tested. If I add
-fno-aggressive-loop-optimizations, the code segfaults (Access to an undefined
portion of a memory object) at run time.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (13 preceding siblings ...)
  2014-04-29 12:29 ` dominiq at lps dot ens.fr
@ 2015-05-16  8:48 ` tkoenig at gcc dot gnu.org
  2015-05-18  6:28 ` Joost.VandeVondele at mat dot ethz.ch
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2015-05-16  8:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #31 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
If the middle end is not up to this, should we be looking at doing loop
blocking in the Fortran front end, at least for the Matmul intrinsic?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (14 preceding siblings ...)
  2015-05-16  8:48 ` tkoenig at gcc dot gnu.org
@ 2015-05-18  6:28 ` Joost.VandeVondele at mat dot ethz.ch
  2015-09-11  9:26 ` manfred99 at gmx dot ch
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: Joost.VandeVondele at mat dot ethz.ch @ 2015-05-18  6:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #32 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> ---
(In reply to Thomas Koenig from comment #31)
> If the middle end is not up to this, should we be looking at doing loop
> blocking in the Fortran front end, at least for the Matmul intrinsic?

I think this makes sense, fixing this issue in the middle end seems to be a
project on a different timescale. Ideally, matmul expands to something that
generates good code even at e.g. -O2 -march=native (which would require both
blocking and unrolling). At that point, the inlined code would be faster than
the runtime library...for all sizes.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (15 preceding siblings ...)
  2015-05-18  6:28 ` Joost.VandeVondele at mat dot ethz.ch
@ 2015-09-11  9:26 ` manfred99 at gmx dot ch
  2015-09-11 19:27 ` spop at gcc dot gnu.org
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: manfred99 at gmx dot ch @ 2015-09-11  9:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

Manfred Schwarb <manfred99 at gmx dot ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |manfred99 at gmx dot ch

--- Comment #33 from Manfred Schwarb <manfred99 at gmx dot ch> ---
There is some slight improvement since 2015-09-08:

before:
#> a
   17.848000000000003        10.239999999999871
#> grep "number of SCoPs" a.f90.124t.graphite                                  
      number of SCoPs: 0
number of SCoPs: 2

after:
#> a
   15.847999999999999        10.239999999999871
#> grep "number of SCoPs" a.f90.121t.graphite                                  
      number of SCoPs: 0
number of SCoPs: 1

which also gives a testsuite warning on my box:
XPASS: gfortran.dg/graphite/pr14741.f90   -O   scan-tree-dump-times graphite
"number of SCoPs: 1" 1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (16 preceding siblings ...)
  2015-09-11  9:26 ` manfred99 at gmx dot ch
@ 2015-09-11 19:27 ` spop at gcc dot gnu.org
  2015-09-11 21:55 ` dominiq at lps dot ens.fr
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2015-09-11 19:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #34 from Sebastian Pop <spop at gcc dot gnu.org> ---
r227567 extends the limits of a scop, and now we can detect a scop in the
MAIN__ function corresponding to the following code:

  A=0.1D0
  B=0.1D0

-fdump-tree-graphite-all shows that the loops have been tiled:

tiled by 51
tiled by 51

ISL AST generated by ISL: 
{
  for (int c1 = 0; c1 <= 1023; c1 += 51)
    for (int c2 = 0; c2 <= 1023; c2 += 51)
      for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1)
        for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1)
          S_4(c3, c4);
  for (int c1 = 0; c1 <= 1023; c1 += 51)
    for (int c2 = 0; c2 <= 1023; c2 += 51)
      for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1)
        for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1)
          S_10(c3, c4);
}

What makes me wondering is why for memset kind of loops when tiling gets us a
better performance as reported:

before:
   17.848000000000003
after:
   15.847999999999999

Btw, what architecture have you used for this experiment?

The same happens on an AArch64 machine where I was able to reproduce your
results: the loop blocked initialization of arrays is consistently faster by
about 10%.

I noted that on a recent Intel x86_64 machine the first runs show some 10%
speedup with loop blocking and then the speedup disappears in subsequent runs
(I was alternating runs with and without loop block 10 times).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (17 preceding siblings ...)
  2015-09-11 19:27 ` spop at gcc dot gnu.org
@ 2015-09-11 21:55 ` dominiq at lps dot ens.fr
  2015-09-12 12:06 ` dominiq at lps dot ens.fr
  2015-09-12 12:19 ` spop at gcc dot gnu.org
  20 siblings, 0 replies; 21+ messages in thread
From: dominiq at lps dot ens.fr @ 2015-09-11 21:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #35 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
I get

[Book15] f90/bug% /opt/gcc/gcc6p-227264p1/bin/gfortran -Ofast pr14741.f90
-floop-interchange -march=native -Wa,-q                                    
[Book15] f90/bug% time a.out
  0.48728300000000002        10.239999999999826     
0.491u 0.006s 0:00.50 98.0%     0+0k 0+0io 0pf+0w
[Book15] f90/bug% /opt/gcc/gcc6p-227383p1/bin/gfortran -Ofast pr14741.f90
-floop-interchange -march=native -Wa,-q                                    
[Book15] f90/bug% time a.out
   1.4271590000000001        10.239999999999826     
1.430u 0.008s 0:01.44 99.3%     0+0k 0+0io 0pf+0w

i.e., r227264 (or 4.8, 4.9, and 5.2) is ~3 time faster than r227383.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (18 preceding siblings ...)
  2015-09-11 21:55 ` dominiq at lps dot ens.fr
@ 2015-09-12 12:06 ` dominiq at lps dot ens.fr
  2015-09-12 12:19 ` spop at gcc dot gnu.org
  20 siblings, 0 replies; 21+ messages in thread
From: dominiq at lps dot ens.fr @ 2015-09-12 12:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #36 from Dominique d'Humieres <dominiq at lps dot ens.fr> ---
> i.e., r227264 (or 4.8, 4.9, and 5.2) is ~3 time faster than r227383.

It is due to r227277 (with r227265 reverted in order to bootstrap Ada).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop
       [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
                   ` (19 preceding siblings ...)
  2015-09-12 12:06 ` dominiq at lps dot ens.fr
@ 2015-09-12 12:19 ` spop at gcc dot gnu.org
  20 siblings, 0 replies; 21+ messages in thread
From: spop at gcc dot gnu.org @ 2015-09-12 12:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

--- Comment #37 from Sebastian Pop <spop at gcc dot gnu.org> ---
I think gcc should stop blocking initialization loops: it should only block a
loop when it contains more than two arrays, one array accessing non-contiguous
elements in memory, and another array accessing contiguous elements in memory.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-09-12 12:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-14741-4@http.gcc.gnu.org/bugzilla/>
2013-03-29 11:03 ` [Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop Joost.VandeVondele at mat dot ethz.ch
2013-07-14  7:11 ` spop at gcc dot gnu.org
2013-07-16  2:25 ` spop at gcc dot gnu.org
2013-07-16  2:29 ` spop at gcc dot gnu.org
2013-07-21  9:15 ` spop at gcc dot gnu.org
2013-07-21  9:59 ` spop at gcc dot gnu.org
2013-07-21 10:14 ` spop at gcc dot gnu.org
2013-07-21 10:34 ` spop at gcc dot gnu.org
2013-07-21 22:01 ` spop at gcc dot gnu.org
2013-07-27 15:19 ` dushistov at mail dot ru
2013-07-27 15:22 ` dushistov at mail dot ru
2013-07-29 22:16 ` spop at gcc dot gnu.org
2014-03-13 15:03 ` spop at gcc dot gnu.org
2014-04-29 12:29 ` dominiq at lps dot ens.fr
2015-05-16  8:48 ` tkoenig at gcc dot gnu.org
2015-05-18  6:28 ` Joost.VandeVondele at mat dot ethz.ch
2015-09-11  9:26 ` manfred99 at gmx dot ch
2015-09-11 19:27 ` spop at gcc dot gnu.org
2015-09-11 21:55 ` dominiq at lps dot ens.fr
2015-09-12 12:06 ` dominiq at lps dot ens.fr
2015-09-12 12:19 ` spop at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).