Re: [patch, fortran, RFC] First steps towards inlining matmul

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [patch, fortran, RFC] First steps towards inlining matmul
@ 2015-04-05 13:55 Dominique Dhumieres
  2015-04-05 14:33 ` Thomas Koenig
  0 siblings, 1 reply; 16+ messages in thread
From: Dominique Dhumieres @ 2015-04-05 13:55 UTC (permalink / raw)
  To: tkoenig; +Cc: gcc-patches, fortran, jvdelisle

> > So, what do you think about this?
>
> I am curious about what performance gain results from this?
> I can see saving a library call to our runtime libraries.
> Do you have some timing results?
>
> Jerry

IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
or 16x16 depending of your field!-). I have a variant of the polyhedron test induct.f90
in which I have replaced three dot products by the equivalent matmul and which executes
ten times slower than the original test.

Dominique

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 13:55 [patch, fortran, RFC] First steps towards inlining matmul Dominique Dhumieres
@ 2015-04-05 14:33 ` Thomas Koenig
  2015-04-05 18:25   ` Dominique d'Humières
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 14:33 UTC (permalink / raw)
  To: Dominique Dhumieres; +Cc: gcc-patches, fortran, jvdelisle

Hi Dominique,

> IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
> or 16x16 depending of your field!-)

The problem with the library function we have is that it is quite
general; it can deal with all the complexity of assumed-shape array
arguments.  Inlining allows the compiler to take advantage of
contiguous memory, compile-time knowledge of array sizes, etc.
to allow for vectorization or even complete unrolling.

Of course, if you use c=matmul(a,b) where all three are assumed-shape
arrays, your advantage over the library function will be minimal at
best.

It will be interesting to see at what size calling BLAS directly
will be better.

	Thomas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 14:33 ` Thomas Koenig
@ 2015-04-05 18:25   ` Dominique d'Humières
  2015-04-05 20:37     ` Thomas Koenig
  0 siblings, 1 reply; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-05 18:25 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: gcc-patches, fortran, jvdelisle

I have done some timings

(1) with the test given below, before the patch I get (last column in Gflops)

[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    373.708008       373.69497100000001        4.2815668504139435     
 Time, MATMUL:    172.086609       23.117919000000001        69.210381782201068     
545.374u 0.537s 6:36.93 137.5%	0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90  -framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    176.327881       23.855111999999998        67.071577781735002     
 Time, MATMUL:    182.086746       24.453551000000001        65.430170039516952     
358.059u 0.471s 0:48.43 740.2%	0+0k 0+0io 0pf+0w

after the patch

[Book15] f90/bug% time a.out
 Time, MATMUL:    392.415436       392.41728799999999        4.0772923337669056     
 Time, MATMUL:    171.690399       22.905118000000002        69.853383859450091     
563.671u 0.551s 6:55.44 135.8%	0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    392.850342       392.84190799999999        4.0728852177349673     
 Time, MATMUL:    174.534821       23.784797000000001        67.269861500184334     
566.824u 0.674s 6:56.74 136.1%	0+0k 0+0io 0pf+0w

which means that -fexternal-blas should disable the inlining.

program t2 
implicit none 
REAL time_begin, time_end 
integer, parameter :: n=2000; 
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n) 
integer, parameter :: m = 100 
integer :: i 
call RANDOM_NUMBER(a) 
call RANDOM_NUMBER(b) 
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    c = MATMUL(a,b) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
end program 
! http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197

(2) The polyhedron test compiled with -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto runs in ~6.5s. After applying the patch below, it runs in ~66s with/without the patch.

Dominique

--- induct.f90	2005-10-11 22:53:32.000000000 +0200
+++ induct_vmc.f90	2015-04-05 19:06:30.000000000 +0200
@@ -1644,18 +1644,17 @@ contains
           coil_tmp_vector(1) = -sin(theta)
           coil_tmp_vector(2) = cos(theta)
           coil_tmp_vector(3) = 0.0_longreal
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -1664,9 +1663,7 @@ contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !
@@ -1756,18 +1753,17 @@ contains
           coil_tmp_vector(1) = -sin(theta)
           coil_tmp_vector(2) = cos(theta)
           coil_tmp_vector(3) = 0.0_longreal
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -1776,9 +1772,7 @@ contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !
@@ -2061,18 +2055,17 @@ contains
 !
 !       compute current vector for the coil in the global coordinate system
 !
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -2081,9 +2074,7 @@ contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !
@@ -2204,18 +2195,17 @@ contains
 !
 !       compute current vector for the coil in the global coordinate system
 !
-          coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
-          coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
-          coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+          coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
 !
           do j = 1, 9
               c_vector(3) = 0.5 * h_coil * z1gauss(j)
 !
 !       rotate coil vector into the global coordinate system and translate it
 !
-              rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
-              rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
-              rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+              rot_c_vector = matmul(rotate_coil,c_vector)
+              rot_c_vector(1) = rot_c_vector(1) + dx
+              rot_c_vector(2) = rot_c_vector(2) + dy
+              rot_c_vector(3) = rot_c_vector(3) + dz
 !
               do k = 1, 9
                   q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -2224,9 +2214,7 @@ contains
 !
 !       rotate quad vector into the global coordinate system
 !
-                  rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
-                  rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
-                  rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+                  rot_q_vector = matmul(rotate_quad,q_vector)
 !
 !       compute and add in quadrature term
 !

> Le 5 avr. 2015 à 16:33, Thomas Koenig <tkoenig@netcologne.de> a écrit :
> 
> Hi Dominique,
> 
>> IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
>> or 16x16 depending of your field!-)
> 
> The problem with the library function we have is that it is quite
> general; it can deal with all the complexity of assumed-shape array
> arguments.  Inlining allows the compiler to take advantage of
> contiguous memory, compile-time knowledge of array sizes, etc.
> to allow for vectorization or even complete unrolling.
> 
> Of course, if you use c=matmul(a,b) where all three are assumed-shape
> arrays, your advantage over the library function will be minimal at
> best.
> 
> It will be interesting to see at what size calling BLAS directly
> will be better.
> 
> 	Thomas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 18:25   ` Dominique d'Humières
@ 2015-04-05 20:37     ` Thomas Koenig
  2015-04-05 23:15       ` Dominique d'Humières
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 20:37 UTC (permalink / raw)
  To: Dominique d'Humières; +Cc: gcc-patches, fortran, jvdelisle

Hi Dominique,

> which means that -fexternal-blas should disable the inlining.

It is not surprising that a higly tuned BLAS library is better than
a simple inlining for large matrices.

I did some tests by adjusting n; it seems the inline version is
faster for n<=22, which is not too bad.

Regarding your other test case:  This tests matrix*vector
multiplication, which is not implemented yet :-)

Regards,

	Thomas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 20:37     ` Thomas Koenig
@ 2015-04-05 23:15       ` Dominique d'Humières
  2015-04-06 12:18         ` Dominique d'Humières
  0 siblings, 1 reply; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-05 23:15 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: gcc-patches, fortran, jvdelisle

The patch causes the following regressions:

FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single  -O2  -latomic (internal compiler error)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single  -O2  -latomic (test for excess errors)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib  -O2  -lcaf_single -latomic (internal compiler error)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib  -O2  -lcaf_single -latomic (test for excess errors)
FAIL: gfortran.dg/array_function_3.f90   -O  (internal compiler error)
FAIL: gfortran.dg/array_function_3.f90   -O  (test for excess errors)
FAIL: gfortran.dg/bind_c_vars.f90   -g -flto  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O0  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O0  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O1  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O1  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O2  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O2  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-loops  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-loops  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -O3 -g  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -O3 -g  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -Os  (internal compiler error)
FAIL: gfortran.dg/bound_2.f90   -Os  (test for excess errors)
FAIL: gfortran.dg/bound_2.f90   -g -flto  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O0  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O1  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O1  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O2  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O2  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-loops  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-loops  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -O3 -g  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -O3 -g  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -Os  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -Os  (test for excess errors)
FAIL: gfortran.dg/bound_7.f90   -g -flto  (internal compiler error)
FAIL: gfortran.dg/bound_7.f90   -g -flto  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O0  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O0  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O1  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O1  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O2  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O2  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-loops  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-loops  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -O3 -g  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -O3 -g  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -Os  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -Os  (test for excess errors)
FAIL: gfortran.dg/bound_8.f90   -g -flto  (internal compiler error)
FAIL: gfortran.dg/bound_8.f90   -g -flto  (test for excess errors)

I think the line

  if (!upper && as->type == AS_ASSUMED_SHAPE && dim)

should be something such as (untested)

  if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE)

Dominique

> Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit :
> 
> Hi Dominique,
> 
>> which means that -fexternal-blas should disable the inlining.
> 
> It is not surprising that a higly tuned BLAS library is better than
> a simple inlining for large matrices.
> 
> I did some tests by adjusting n; it seems the inline version is
> faster for n<=22, which is not too bad.
> 
> Regarding your other test case:  This tests matrix*vector
> multiplication, which is not implemented yet :-)
> 
> Regards,
> 
> 	Thomas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 23:15       ` Dominique d'Humières
@ 2015-04-06 12:18         ` Dominique d'Humières
  2015-04-09 22:18           ` Thomas Koenig
  0 siblings, 1 reply; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-06 12:18 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran, jvdelisle


> Le 6 avr. 2015 à 01:15, Dominique d'Humières <dominiq@lps.ens.fr> a écrit :
> 
> The patch causes the following regressions:
> 
> FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single  -O2  -latomic (internal compiler error)
> …
> FAIL: gfortran.dg/bound_8.f90   -g -flto  (test for excess errors)
> 
> I think the line
> 
>  if (!upper && as->type == AS_ASSUMED_SHAPE && dim)
> 
> should be something such as (untested)
> 
>  if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE)

This fixes a first batch of ICEs. A second one if fixed by

          if (kind && lower->expr_type == EXPR_CONSTANT)

While the first change is obvious, I am not sure about the second one.

With this changes I am left with the following regressions

FAIL: gfortran.dg/function_optimize_1.f90   -O   scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/function_optimize_7.f90   -O   scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/function_optimize_2.f90   -O   scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/matmul_bounds_2.f90   -O1  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O2  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -fomit-frame-pointer  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -fomit-frame-pointer -funroll-loops  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -O3 -g  output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90   -Os  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O1  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O2  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -fomit-frame-pointer  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -fomit-frame-pointer -funroll-loops  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -O3 -g  output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90   -Os  output pattern test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -O3 -g  execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90   -Os  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O1  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O2  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -O3 -g  execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03   -Os  execution test

The gfortran.dg/matmul_bounds_* are failures due to a change in the run time error (is there a good reason for that?) and the gfortran.dg/realloc_on_assign_* failures are due to segfaults at run time.

Dominique

>> Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit :
>> 
>> Hi Dominique,
>> 
>>> which means that -fexternal-blas should disable the inlining.
>> 
>> It is not surprising that a higly tuned BLAS library is better than
>> a simple inlining for large matrices.
>> 
>> I did some tests by adjusting n; it seems the inline version is
>> faster for n<=22, which is not too bad.
>> 
>> Regarding your other test case:  This tests matrix*vector
>> multiplication, which is not implemented yet :-)
>> 
>> Regards,
>> 
>> 	Thomas
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-06 12:18         ` Dominique d'Humières
@ 2015-04-09 22:18           ` Thomas Koenig
  2015-04-10 14:15             ` Dominique d'Humières
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-09 22:18 UTC (permalink / raw)
  To: GCC Patches, GNU GFortran; +Cc: Dominique d'Humières

[-- Attachment #1: Type: text/plain, Size: 691 bytes --]

Hello world,

here is an update on the matmul inlining patch.  Now, the
rank one + rank two cases are also handled.  Reallocation
on assignment also works now.

Still missing:

1. Control via an option, BLAS inlining has to take precedence
2. handling of matmul(a,b) occurring in the middle of an expression.
3. Bounds checking from the front end pass (basically, calling
   gfortran_runtime_error).
4. More test cases

What do you think?  1. is straightforward, we don't really need to
do 2. for committing early in stage one.  For 3, we could maybe
just issue a STOP.  4. is required, I need to look at some more
corner cases where bugs may still be lurking.

What do you think?

	Thomas


[-- Attachment #2: matmul-8.diff --]
[-- Type: text/x-patch, Size: 27830 bytes --]

Index: frontend-passes.c
===================================================================
--- frontend-passes.c	(Revision 221944)
+++ frontend-passes.c	(Arbeitskopie)
@@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *);
 static void optimize_reduction (gfc_namespace *);
 static int callback_reduction (gfc_expr **, int *, void *);
 static void realloc_strings (gfc_namespace *);
-static gfc_expr *create_var (gfc_expr *);
+static gfc_expr *create_var (gfc_expr *, const char *vname=NULL);
+static int optimize_matmul_assign (gfc_code **, int *, void *);
+static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *,
+				  locus *, gfc_namespace *, 
+				  char *vname=NULL);
 
 /* How deep we are inside an argument list.  */
 
@@ -93,8 +97,24 @@ struct my_struct *evec;
 
 static bool in_assoc_list;
 
+/* Counter for temporary variables.  */
+
+static int var_num = 1;
+
+/* What sort of matrix we are dealing with when optimizing MATMUL.  */
+
+enum matrix_case { none=0, A2B2, A2B1, A1B2 };
+
+/* Keep track of the number of expressions we have inserted so far 
+   using create_var.  */
+
+int n_vars;
+
+void gfc_debug_expr (gfc_expr *);
+
 /* Entry point - run all passes for a namespace.  */
 
+
 void
 gfc_run_passes (gfc_namespace *ns)
 {
@@ -157,7 +177,7 @@ realloc_string_callback (gfc_code **c, int *walk_s
     return 0;
   
   current_code = c;
-  n = create_var (expr2);
+  n = create_var (expr2, "trim");
   co->expr2 = n;
   return 0;
 }
@@ -524,29 +544,11 @@ constant_string_length (gfc_expr *e)
 
 }
 
-/* Returns a new expression (a variable) to be used in place of the old one,
-   with an assignment statement before the current statement to set
-   the value of the variable. Creates a new BLOCK for the statement if
-   that hasn't already been done and puts the statement, plus the
-   newly created variables, in that block.  Special cases:  If the
-   expression is constant or a temporary which has already
-   been created, just copy it.  */
-
-static gfc_expr*
-create_var (gfc_expr * e)
+static gfc_namespace*
+insert_block ()
 {
-  char name[GFC_MAX_SYMBOL_LEN +1];
-  static int num = 1;
-  gfc_symtree *symtree;
-  gfc_symbol *symbol;
-  gfc_expr *result;
-  gfc_code *n;
   gfc_namespace *ns;
-  int i;
 
-  if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
-    return gfc_copy_expr (e);
-
   /* If the block hasn't already been created, do so.  */
   if (inserted_block == NULL)
     {
@@ -578,7 +580,37 @@ constant_string_length (gfc_expr *e)
   else
     ns = inserted_block->ext.block.ns;
 
-  sprintf(name, "__var_%d",num++);
+  return ns;
+}
+
+/* Returns a new expression (a variable) to be used in place of the old one,
+   with an optional assignment statement before the current statement to set
+   the value of the variable. Creates a new BLOCK for the statement if that
+   hasn't already been done and puts the statement, plus the newly created
+   variables, in that block.  Special cases: If the expression is constant or
+   a temporary which has already been created, just copy it.  */
+
+static gfc_expr*
+create_var (gfc_expr * e, const char *vname)
+{
+  char name[GFC_MAX_SYMBOL_LEN +1];
+  gfc_symtree *symtree;
+  gfc_symbol *symbol;
+  gfc_expr *result;
+  gfc_code *n;
+  gfc_namespace *ns;
+  int i;
+
+  if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
+    return gfc_copy_expr (e);
+
+  ns = insert_block ();
+
+  if (vname)
+    snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname);
+  else
+    snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d", var_num++);
+
   if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
     gcc_unreachable ();
 
@@ -651,6 +683,7 @@ constant_string_length (gfc_expr *e)
       result->ref->type = REF_ARRAY;
       result->ref->u.ar.type = AR_FULL;
       result->ref->u.ar.where = e->where;
+      result->ref->u.ar.dimen = e->rank;
       result->ref->u.ar.as = symbol->ts.type == BT_CLASS
 			     ? CLASS_DATA (symbol)->as : symbol->as;
       if (warn_array_temporaries)
@@ -658,6 +691,7 @@ constant_string_length (gfc_expr *e)
 		     "Creating array temporary at %L", &(e->where));
     }
 
+
   /* Generate the new assignment.  */
   n = XCNEW (gfc_code);
   n->op = EXEC_ASSIGN;
@@ -666,6 +700,7 @@ constant_string_length (gfc_expr *e)
   n->expr1 = gfc_copy_expr (result);
   n->expr2 = e;
   *changed_statement = n;
+  n_vars ++;
 
   return result;
 }
@@ -724,7 +759,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees,
 	  if (gfc_dep_compare_functions (*ei, *ej, true) == 0)
 	    {
 	      if (newvar == NULL)
-		newvar = create_var (*ei);
+		newvar = create_var (*ei, "fcn");
 
 	      if (warn_function_elimination)
 		do_warn_function_elimination (*ej);
@@ -931,6 +966,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A
   /*  Don't walk subtrees.  */
   return 0;
 }
+
 /* Optimize a namespace, including all contained namespaces.  */
 
 static void
@@ -947,6 +983,7 @@ optimize_namespace (gfc_namespace *ns)
   gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL);
   gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL);
   gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL);
+  gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback, NULL);
 
   /* BLOCKs are handled in the expression walker below.  */
   for (ns = ns->contained; ns; ns = ns->sibling)
@@ -1222,7 +1259,7 @@ combine_array_constructor (gfc_expr *e)
   if (op2->ts.type == BT_CHARACTER)
     return false;
 
-  scalar = create_var (gfc_copy_expr (op2));
+  scalar = create_var (gfc_copy_expr (op2), "constr");
 
   oldbase = op1->value.constructor;
   newbase = NULL;
@@ -1939,7 +1976,798 @@ doloop_warn (gfc_namespace *ns)
   gfc_code_walker (&ns->code, doloop_code, do_function, NULL);
 }
 
+/* Auxiliary function to build and simplify an array inquiry function.
+   dim is zero-based.  */
 
+static gfc_expr *
+get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id)
+{
+
+  gfc_expr *fcn;
+  gfc_expr *dim_arg, *kind;
+  const char *name;
+
+  switch (id)
+    {
+    case GFC_ISYM_LBOUND:
+      name = "_gfortran_lbound";
+      break;
+
+    case GFC_ISYM_UBOUND:
+      name = "_gfortran_ubound";
+      break;
+
+    case GFC_ISYM_SIZE:
+      name = "_gfortran_size";
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  dim_arg =  gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim);
+
+  kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where,
+			   gfc_index_integer_kind);
+  
+  fcn = gfc_build_intrinsic_call (current_ns, id, name, e->where, 3,
+				  gfc_copy_expr(e), dim_arg,  kind);
+
+  gfc_simplify_expr (fcn, 0);
+  return fcn;
+}
+
+/* Build a not equals - expression.  */
+
+static gfc_expr*
+build_logical_expr (gfc_expr *e1, gfc_expr *e2, gfc_intrinsic_op op)
+{
+  gfc_typespec ts;
+  gfc_expr *res;
+
+  ts.type = BT_LOGICAL;
+  ts.kind = gfc_default_logical_kind;
+  res = gfc_get_expr ();
+  res->where = e1->where;
+  res->expr_type = EXPR_OP;
+  res->value.op.op = op;
+  res->value.op.op1 = e1;
+  res->value.op.op2 = e2;
+  res->ts = ts;
+
+  return res;
+}
+/* Handle matrix reallocation.  Caller is responsible to insert into
+   the code tree.
+
+   For the two-dimensional case, build 
+
+  if (allocated(c)) then
+     if (size(c,1) /= size(a,1) .or. size(c,2) /= size(b,2)) then
+        deallocate(c)
+        allocate (c(size(a,1), size(b,2)))
+     end if
+  else
+     allocate (c(size(a,1),size(b,2)))
+  end if
+
+*/
+
+static gfc_code *
+matmul_lhs_realloc (gfc_expr *c, gfc_expr *a, gfc_expr *b,
+		    enum matrix_case m_case)
+{
+
+  gfc_expr *allocated, *alloc_expr;
+  gfc_code *if_alloc_1, *if_alloc_2, *if_size_1, *if_size_2;
+  gfc_code *else_alloc;
+  gfc_code *deallocate, *allocate1, *allocate_else;
+  gfc_ref *ref;
+  gfc_array_ref *ar;
+  gfc_expr *cond, *ne1, *ne2;
+
+  alloc_expr = gfc_copy_expr (c);
+
+  ref = alloc_expr->ref;
+
+  while (ref)
+    {
+      if (ref->type == REF_ARRAY && ref->u.ar.type != AR_ELEMENT)
+	break;
+
+      ref = ref->next;
+
+    }
+  ar = &ref->u.ar;
+  gcc_assert (ar && ar->type == AR_FULL);
+
+  /* c comes in as a full ref.  Change it into a copy and make it into an
+     element ref so it has the right for for ALLOCATE.  In the same switch,
+     also generate the size comparison for the secod IF statement.  */
+
+  ar->type = AR_ELEMENT;
+
+  switch (m_case)
+    {
+    case A2B2:
+      ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE);
+      ar->start[1] = get_array_inq_function (b, 2, GFC_ISYM_SIZE);
+      ne1 = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+				get_array_inq_function (a, 1, GFC_ISYM_SIZE),
+				INTRINSIC_NE);
+      ne2 = build_logical_expr (get_array_inq_function (c, 2, GFC_ISYM_SIZE),
+				get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+				INTRINSIC_NE);
+      cond = build_logical_expr (ne1, ne2, INTRINSIC_OR);
+      break;
+
+    case A2B1:
+      ar->start[0] = get_array_inq_function (a, 0, GFC_ISYM_SIZE);
+      cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+				 get_array_inq_function (a, 1, GFC_ISYM_SIZE),
+				 INTRINSIC_NE);
+      break;
+
+    case A1B2:
+      ar->start[0] = get_array_inq_function (b, 1, GFC_ISYM_SIZE);
+      cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+				 get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+				 INTRINSIC_NE);
+      break;
+
+    default:
+      gcc_unreachable();
+
+    }
+
+  gfc_simplify_expr (cond, 0);
+
+  /* We need two identical allocate statements in two
+     branches of the IF statement.  */
+  
+  allocate1 = XCNEW (gfc_code);
+  allocate1->op = EXEC_ALLOCATE;
+  allocate1->ext.alloc.list = gfc_get_alloc ();
+  allocate1->loc = c->where;
+  allocate1->ext.alloc.list->expr = gfc_copy_expr (alloc_expr);
+
+  allocate_else = XCNEW (gfc_code);
+  allocate_else->op = EXEC_ALLOCATE;
+  allocate_else->ext.alloc.list = gfc_get_alloc ();
+  allocate_else->loc = c->where;
+  allocate_else->ext.alloc.list->expr = alloc_expr;
+
+  allocated = gfc_build_intrinsic_call (current_ns, GFC_ISYM_ALLOCATED,
+					"_gfortran_allocated", c->where,
+					1, gfc_copy_expr (c));
+
+  deallocate = XCNEW (gfc_code);
+  deallocate->op = EXEC_DEALLOCATE;
+  deallocate->ext.alloc.list = gfc_get_alloc ();
+  deallocate->ext.alloc.list->expr = gfc_copy_expr (c);
+  deallocate->next = allocate1;
+  deallocate->loc = c->where;
+  
+  if_size_2 = XCNEW (gfc_code);
+  if_size_2->op = EXEC_IF;
+  if_size_2->expr1 = cond;
+  if_size_2->loc = c->where;
+  if_size_2->next = deallocate;
+
+  if_size_1 = XCNEW (gfc_code);
+  if_size_1->op = EXEC_IF;
+  if_size_1->block = if_size_2;
+  if_size_1->loc = c->where;
+
+  else_alloc = XCNEW (gfc_code);
+  else_alloc->op = EXEC_IF;
+  else_alloc->loc = c->where;
+  else_alloc->next = allocate_else;
+
+  if_alloc_2 = XCNEW (gfc_code);
+  if_alloc_2->op = EXEC_IF;
+  if_alloc_2->expr1 = allocated;
+  if_alloc_2->loc = c->where;
+  if_alloc_2->next = if_size_1;
+  if_alloc_2->block = else_alloc;
+
+  if_alloc_1 = XCNEW (gfc_code);
+  if_alloc_1->op = EXEC_IF;
+  if_alloc_1->block = if_alloc_2;
+  if_alloc_1->loc = c->where;
+
+  return if_alloc_1;
+}
+
+
+/* Callback function for has_function_or_op.  */
+
+static int
+is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED,
+	     void *data ATTRIBUTE_UNUSED)
+{
+  if ((*e) == 0)
+    return 0;
+  else
+    return (*e)->expr_type == EXPR_FUNCTION
+      || (*e)->expr_type == EXPR_OP;
+}
+
+/* Returns true if the expression contains a function.  */
+
+static bool
+has_function_or_op (gfc_expr **e)
+{
+  if (e == NULL)
+    return false;
+  else
+    return gfc_expr_walker (e, is_function_or_op, NULL);
+}
+
+/* Freeze a single expression.  */
+
+static void
+freeze_expr (gfc_expr **ep)
+{
+  gfc_expr *ne;
+  if (has_function_or_op (ep))
+    {
+      ne = create_var (*ep, "freeze");
+      *ep = ne;
+    }
+}
+
+/* Go through an expression's references and assign them to temporary
+   variables if they contain functions.  This is usually done prior to
+   front-end scalarization to avoid multiple invocations of functions.  */
+
+static void
+freeze_references (gfc_expr *e)
+{
+  gfc_ref *r;
+  gfc_array_ref *ar;
+  int i;
+
+  for (r=e->ref; r; r=r->next)
+    {
+      if (r->type == REF_SUBSTRING)
+	{
+	  if (r->u.ss.start != NULL)
+	    freeze_expr (&r->u.ss.start);
+
+	  if (r->u.ss.end != NULL)
+	    freeze_expr (&r->u.ss.end);
+	}
+      else if (r->type == REF_ARRAY)
+	{
+	  ar = &r->u.ar;
+	  switch (ar->type)
+	    {
+	    case AR_FULL:
+	      break;
+
+	    case AR_SECTION:
+	      for (i=0; i<ar->dimen; i++)
+		{
+		  if (ar->dimen_type[i] == DIMEN_RANGE)
+		    {
+		      freeze_expr (&ar->start[i]);
+		      freeze_expr (&ar->end[i]);
+		      freeze_expr (&ar->stride[i]);
+		    }
+		}
+	    default:
+	      break;
+	    }
+	}
+    }
+}
+
+/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise.  */
+
+static gfc_expr *
+convert_to_index_kind (gfc_expr *e)
+{
+  gfc_expr *res;
+
+  gcc_assert (e != NULL);
+
+  res = gfc_copy_expr (e);
+
+  gcc_assert (e->ts.type == BT_INTEGER);
+
+  if (res->ts.kind != gfc_index_integer_kind)
+    {
+      gfc_typespec ts;
+      gfc_clear_ts (&ts);
+      ts.type = BT_INTEGER;
+      ts.kind = gfc_index_integer_kind;
+
+      gfc_convert_type_warn (e, &ts, 2, 0);
+    }
+
+  return res;
+}
+
+/* Function to create a DO loop including creation of the
+   iteration variable.  gfc_expr are copied.*/
+
+static gfc_code *
+create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where,
+		gfc_namespace *ns, char *vname)
+{
+
+  char name[GFC_MAX_SYMBOL_LEN +1];
+  gfc_symtree *symtree;
+  gfc_symbol *symbol;
+  gfc_expr *i;
+  gfc_code *n, *n2;
+
+  /* Create an expression for the iteration variable.  */
+  if (vname)
+    sprintf (name, "__var_%d_do_%s", var_num++, vname);
+  else
+    sprintf (name, "__var_%d_do", var_num++);
+
+
+  if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
+    gcc_unreachable ();
+
+  symbol = symtree->n.sym;
+  symbol->ts.type = BT_INTEGER;
+  symbol->ts.kind = gfc_index_integer_kind;
+  symbol->attr.flavor = FL_VARIABLE;
+  symbol->attr.referenced = 1;
+  symbol->attr.dimension = 0;
+  symbol->attr.fe_temp = 1;
+  gfc_commit_symbol (symbol);
+
+  i = gfc_get_expr ();
+  i->expr_type = EXPR_VARIABLE;
+  i->ts = symbol->ts;
+  i->rank = 0;
+  i->where = *where;
+  i->symtree = symtree;
+
+  n = XCNEW (gfc_code);
+  n->op = EXEC_DO;
+  n->loc = *where;
+  n->ext.iterator = gfc_get_iterator ();
+  n->ext.iterator->var = i;
+  n->ext.iterator->start = convert_to_index_kind (start);
+  n->ext.iterator->end = convert_to_index_kind (end);
+  if (step)
+    n->ext.iterator->step = convert_to_index_kind (step);
+  else
+    n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind,
+					      where, 1);
+
+  n2 = XCNEW (gfc_code);
+  n2->op = EXEC_DO;
+  n2->loc = *where;
+  n2->next = NULL;
+  n->block = n2;
+  return n;
+}
+
+/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes
+   compatible typespecs.  */
+
+static gfc_expr *
+get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
+{
+  gfc_expr *res;
+
+  res = gfc_get_expr ();
+  res->ts = e1->ts;
+  res->where = e1->where;
+  res->expr_type = EXPR_OP;
+  res->value.op.op = op;
+  res->value.op.op1 = e1;
+  res->value.op.op2 = e2;
+  gfc_simplify_expr (res, 0);
+  return res;
+}
+
+/* Get the upper bound of the DO loops for matmul along
+   a dimension, i.e. size minus one.  */
+static gfc_expr*
+get_size_m1 (gfc_expr *e, int dimen)
+{
+  mpz_t size;
+  gfc_expr *res;
+
+  if (gfc_array_dimen_size (e, dimen, &size))
+    {
+      res = gfc_get_constant_expr (BT_INTEGER,
+				   gfc_index_integer_kind, &e->where);
+      mpz_sub_ui (res->value.integer, size, 1);
+      mpz_clear (size);
+    }
+  else
+    {
+      res = get_operand (INTRINSIC_MINUS,
+			 get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE),
+			 gfc_get_int_expr (gfc_index_integer_kind,
+					   &e->where, 1));
+      gfc_simplify_expr (res, 0);
+    }
+
+  return res;
+}
+
+/* Function to return a scalarized expression. It is assumed that indices are
+ zero based to make generation of DO loops easier.  A zero as index will
+ access the first element along a dimension.  Single element references will
+ be skipped.  A NULL as an expression will be replaced by a full reference.
+ This assumes that the index loops have gfc_index_integer_kind, and that all
+ references have been frozen.  */
+
+static gfc_expr*
+scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index)
+{
+  gfc_ref *ref;
+  gfc_array_ref *ar;
+  int i;
+  int rank;
+  gfc_expr *e;
+  int i_index;
+  bool was_fullref;
+
+  e = gfc_copy_expr(e_in);
+
+  rank = e->rank;
+
+  ref = e->ref;
+
+  while (ref)
+    {
+      if (ref->type == REF_ARRAY
+	  && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION))
+	break;
+      ref = ref->next;
+    }
+  ar = &ref->u.ar;
+
+  /* We scalarize count_index variables, reducing the rank by count_index.  */
+
+  e->rank = rank - count_index;
+
+  was_fullref = ar->type == AR_FULL;
+
+  if (e->rank == 0)
+    ar->type = AR_ELEMENT;
+  else
+    ar->type = AR_SECTION;
+
+  /* Loop over the indices.  For each index, create the expression
+     index + lbound(e, dim).  */
+  
+  i_index = 0;
+  for (i=0; i < ar->dimen; i++)
+    {
+      if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE)
+	{
+	  if (index[i_index] != NULL)
+	    {
+	      gfc_expr *lbound, *nindex;
+	      gfc_expr *loopvar;
+	      
+	      loopvar = gfc_copy_expr (index[i_index]); 
+	      
+	      if (ar->stride[i])
+		{
+		  gfc_expr *tmp;
+
+		  tmp = gfc_copy_expr(ar->stride[i]);
+		  if (tmp->ts.kind != gfc_index_integer_kind)
+		    {
+		      gfc_typespec ts;
+		      gfc_clear_ts (&ts);
+		      ts.type = BT_INTEGER;
+		      ts.kind = gfc_index_integer_kind;
+		      gfc_convert_type (tmp, &ts, 2);
+		    }
+		  nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp);
+		}
+	      else
+		nindex = loopvar;
+	      
+	      if (ar->start[i])
+		{
+		  lbound = gfc_copy_expr (ar->start[i]);
+		  if (lbound->ts.kind != gfc_index_integer_kind)
+		    {
+		      gfc_typespec ts;
+		      gfc_clear_ts (&ts);
+		      ts.type = BT_INTEGER;
+		      ts.kind = gfc_index_integer_kind;
+		      gfc_convert_type (lbound, &ts, 2);
+
+		    }
+		}
+	      else
+		lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND);
+
+	      ar->dimen_type[i] = DIMEN_ELEMENT;
+	      ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound);
+	      ar->end[i] = NULL;
+	      ar->stride[i] = NULL;
+	      gfc_simplify_expr (ar->start[i], 0);
+	    }
+	  else if (was_fullref)
+	    {
+	      ar->dimen_type[i] = DIMEN_RANGE;
+	      ar->start[i] = NULL;
+	      ar->end[i] = NULL;
+	      ar->stride[i] = NULL;
+	    }
+	  i_index ++;
+	}
+    }
+  return e;
+}
+
+
+/* Optimize assignments of the form c = matmul(a,b).
+   Handle only the cases currently where b and c are rank-two arrays.
+
+   This translates the code to
+
+   BLOCK
+     integer i,j,k
+     c = 0
+     do j=0, size(b,2)-1
+       do k=0, size(a, 2)-1
+         do i=0, size(a, 1)-1
+            c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) =
+	    c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
+            a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) *
+            b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
+         end do
+       end do
+     end do
+   END BLOCK
+   
+*/
+
+static int
+optimize_matmul_assign (gfc_code **c,
+			  int *walk_subtrees ATTRIBUTE_UNUSED,
+			  void *data ATTRIBUTE_UNUSED)
+{
+  gfc_code *co = *c;
+  gfc_expr *expr1, *expr2;
+  gfc_expr *matrix_a, *matrix_b;
+  gfc_actual_arglist *a, *b;
+  gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul;
+  gfc_expr *zero_e;
+  gfc_expr *u1, *u2, *u3;
+  gfc_expr *list[2];
+  gfc_expr *ascalar, *bscalar, *cscalar;
+  gfc_expr *mult;
+  gfc_expr *var_1, *var_2, *var_3;
+  gfc_expr *zero;
+  gfc_namespace *ns;
+  gfc_intrinsic_op op_times, op_plus;
+  enum matrix_case m_case;
+  gfc_code *next_code;
+  gfc_code *p;
+  int i;
+
+
+  if (co->op != EXEC_ASSIGN)
+    return 0;
+
+  expr1 = co->expr1;
+  expr2 = co->expr2;
+  if (expr2->expr_type != EXPR_FUNCTION
+      || expr2->value.function.isym == NULL
+      || expr2->value.function.isym->id != GFC_ISYM_MATMUL)
+    return 0;
+
+  /* Handling only two matrices of dimension two for the time being.  */
+
+  current_code = c;
+  inserted_block = NULL;
+  changed_statement = NULL;
+
+  a = expr2->value.function.actual;
+  matrix_a = a->expr;
+  b = a->next;
+  matrix_b = b->expr;
+
+  if (matrix_a->expr_type != EXPR_VARIABLE
+      || matrix_b->expr_type != EXPR_VARIABLE)
+    return 0;
+
+  if (matrix_a->rank == 2)
+    m_case = matrix_b->rank == 1 ? A2B1 : A2B2;
+  else
+    m_case = A1B2;
+
+  if (gfc_check_dependency (expr1, matrix_a, true)
+      || gfc_check_dependency (expr1, matrix_b, true))
+    return 0;
+
+  ns = insert_block ();
+
+  switch(expr1->ts.type)
+    {
+    case BT_INTEGER:
+      zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+      break;
+
+    case BT_LOGICAL:
+      op_times = INTRINSIC_AND;
+      op_plus = INTRINSIC_OR;
+      zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where,
+				     0);
+      break;
+    case BT_REAL:
+      zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind,
+				      &expr1->where);
+      mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+      break;
+
+    case BT_COMPLEX:
+      zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind,
+				      &expr1->where);
+      mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+
+      break;
+
+    default:
+      gcc_unreachable();
+    }
+
+  current_code = &ns->code;
+
+  n_vars = 0;
+  freeze_references (matrix_a);
+  freeze_references (matrix_b);
+
+  assign_zero = XCNEW (gfc_code);
+  assign_zero->op = EXEC_ASSIGN;
+  assign_zero->loc = co->loc;
+  assign_zero->expr1 = gfc_copy_expr (expr1);
+  assign_zero->expr2 = zero_e;
+
+  if (flag_realloc_lhs && gfc_is_reallocatable_lhs (expr1))
+    {
+      gfc_code *lhs_alloc;
+
+      lhs_alloc = matmul_lhs_realloc (expr1, matrix_a, matrix_b, m_case);
+      lhs_alloc->next = assign_zero;
+      next_code = lhs_alloc;
+    }
+  else
+    next_code = assign_zero;
+
+  if (n_vars == 0)
+    *current_code = next_code;
+  else
+    {
+      p = ns->code;
+      for (i=0; i<n_vars-1; i++)
+	  p = p->next;
+
+      p->next = next_code;
+    }
+
+  zero = gfc_get_int_expr (gfc_index_integer_kind, &co->loc, 0);
+
+  assign_matmul = XCNEW (gfc_code);
+  assign_matmul->op = EXEC_ASSIGN;
+  assign_matmul->loc = co->loc;
+
+  switch (m_case)
+    {
+    case A2B2:
+      u1 = get_size_m1 (matrix_b, 1);
+      u2 = get_size_m1 (matrix_a, 1);
+      u3 = get_size_m1 (matrix_a, 0);
+
+      do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+      do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+      do_3 = create_do_loop (gfc_copy_expr (zero), u3, NULL, &co->loc, ns);
+
+      do_1->block->next = do_2;
+      do_2->block->next = do_3;
+      do_3->block->next = assign_matmul;
+
+      var_1 = do_1->ext.iterator->var;
+      var_2 = do_2->ext.iterator->var;
+      var_3 = do_3->ext.iterator->var;
+
+      list[0] = var_3;
+      list[1] = var_1;
+      cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2);
+
+      list[0] = var_3;
+      list[1] = var_2;
+      ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+      list[0] = var_2;
+      list[1] = var_1;
+      bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+      break;
+
+    case A2B1:
+      u1 = get_size_m1 (matrix_b, 0);
+      u2 = get_size_m1 (matrix_a, 0);
+
+      do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+      do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+      do_1->block->next = do_2;
+      do_2->block->next = assign_matmul;
+
+      var_1 = do_1->ext.iterator->var;
+      var_2 = do_2->ext.iterator->var;
+
+      list[0] = var_2;
+      cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+      list[0] = var_2;
+      list[1] = var_1;
+      ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+      list[0] = var_1;
+      bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 1);
+
+      break;
+
+    case A1B2:
+      u1 = get_size_m1 (matrix_b, 1);
+      u2 = get_size_m1 (matrix_a, 0);
+
+      do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+      do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+      do_1->block->next = do_2;
+      do_2->block->next = assign_matmul;
+
+      var_1 = do_1->ext.iterator->var;
+      var_2 = do_2->ext.iterator->var;
+
+      list[0] = var_1;
+      cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+      list[0] = var_2;
+      ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 1);
+
+      list[0] = var_2;
+      list[1] = var_1;
+      bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+      break;
+
+    default:
+      gcc_unreachable();
+    }
+
+  assign_zero->next = do_1;
+
+
+  assign_matmul->expr1 = gfc_copy_expr (cscalar);
+
+  mult = get_operand (op_times, ascalar, bscalar);
+  assign_matmul->expr2 = get_operand (op_plus, cscalar, mult);
+
+  co->next = NULL;
+  gfc_free_statements(co);
+  gfc_free_expr (zero);
+  return 0;
+}
+
 #define WALK_SUBEXPR(NODE) \
   do							\
     {							\
Index: gfortran.h
===================================================================
--- gfortran.h	(Revision 221944)
+++ gfortran.h	(Arbeitskopie)
@@ -3225,4 +3225,8 @@ int gfc_code_walker (gfc_code **, walk_code_fn_t,
 
 void gfc_convert_mpz_to_signed (mpz_t, int);
 
+/* trans-array.c  */
+
+bool gfc_is_reallocatable_lhs (gfc_expr *);
+
 #endif /* GCC_GFORTRAN_H  */
Index: simplify.c
===================================================================
--- simplify.c	(Revision 221944)
+++ simplify.c	(Arbeitskopie)
@@ -3445,6 +3445,31 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf
 
  done:
 
+  if (!upper && as && as->type == AS_ASSUMED_SHAPE && dim
+      && dim->expr_type == EXPR_CONSTANT)
+    {
+      if (!(array->symtree && array->symtree->n.sym
+	    && (array->symtree->n.sym->attr.allocatable
+		|| array->symtree->n.sym->attr.pointer)))
+	{
+	  unsigned long int ndim;
+	  gfc_expr *lower, *res;
+
+	  ndim = mpz_get_si (dim->value.integer) - 1;
+	  lower = as->lower[ndim];
+	  if (lower->expr_type == EXPR_CONSTANT)
+	    {
+	      res = gfc_copy_expr (lower);
+	      if (kind)
+		{
+		  int nkind = mpz_get_si (kind->value.integer);
+		  res->ts.kind = nkind;
+		}
+	      return res;
+	    }
+	}
+    }
+
   if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE
 	     || as->type == AS_ASSUMED_RANK))
     return NULL;
Index: trans-array.h
===================================================================
--- trans-array.h	(Revision 221944)
+++ trans-array.h	(Arbeitskopie)
@@ -64,8 +64,6 @@ tree gfc_copy_only_alloc_comp (gfc_symbol *, tree,
 
 tree gfc_alloc_allocatable_for_assignment (gfc_loopinfo*, gfc_expr*, gfc_expr*);
 
-bool gfc_is_reallocatable_lhs (gfc_expr *);
-
 /* Add initialization for deferred arrays.  */
 void gfc_trans_deferred_array (gfc_symbol *, gfc_wrapped_block *);
 /* Generate an initializer for a static pointer or allocatable array.  */

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-09 22:18           ` Thomas Koenig
@ 2015-04-10 14:15             ` Dominique d'Humières
  2015-04-10 16:57               ` Dominique d'Humières
  2015-04-11 12:24               ` Thomas Koenig
  0 siblings, 2 replies; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-10 14:15 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran


> Le 10 avr. 2015 à 00:18, Thomas Koenig <tkoenig@netcologne.de> a écrit :
> 
> Hello world,
> 
> here is an update on the matmul inlining patch.  Now, the
> rank one + rank two cases are also handled.  

Preliminary tests show that my variant of fatigue.f90 with matmul is as fast (if not faster) as the original test with dot products.

However this causes new failures

FAIL: gfortran.dg/matmul_bounds_4.f90   -O1  execution test
FAIL: gfortran.dg/matmul_bounds_4.f90   -O2  execution test
FAIL: gfortran.dg/matmul_bounds_4.f90   -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/matmul_bounds_4.f90   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/matmul_bounds_4.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/matmul_bounds_4.f90   -O3 -g  execution test
FAIL: gfortran.dg/matmul_bounds_4.f90   -Os  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -O1  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -O2  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -O3 -g  execution test
FAIL: gfortran.dg/matmul_bounds_5.f90   -Os  execution test

and may be

FAIL: gfortran.dg/coarray_lib_this_image_2.f90   -O   scan-tree-dump-times original "mylbound = parm...dim\\[0\\].stride >= 0 && parm...dim\\[0\\].ubound >= parm...dim\\[0\\].lbound \\|\\| parm...dim\\[0\\].stride < 0 \\?[^\n\r]* parm...dim\\[0\\].lbound : 1;" 1
FAIL: gfortran.dg/dependency_26.f90   -O   scan-tree-dump-times original "&a" 1

> Reallocation on assignment also works now.

Confirmed.

> 
> Still missing:
> 
> 1. Control via an option, BLAS inlining has to take precedence
> 2. handling of matmul(a,b) occurring in the middle of an expression.
> 3. Bounds checking from the front end pass (basically, calling
>   gfortran_runtime_error).
> 4. More test cases
> 
> What do you think?  1. is straightforward,

I agree if the bounds are known at compile time (it will be nice to use -fblas-matmul-limit=n for the threshold).
However for bounds know at run time only, I don’t see how we can escape some king of versioning.

> we don't really need to do 2. for committing early in stage one.  

Agreed: no point to mess with complicated expressions if the simple ones are not properly debugged!

> For 3, we could maybe just issue a STOP.  

I have silenced these failures (gfortran.dg/matmul_bounds_*.f90) by adding -fno-frontend-optimize to the list of options. IMO having the same message with/without inlining is needed.

> 4. is required, I need to look at some more corner cases where bugs may still be lurking.

I also see the following failures

FAIL: gfortran.dg/shape_2.f90   -O0  execution test
FAIL: gfortran.dg/shape_2.f90   -O1  execution test
FAIL: gfortran.dg/shape_2.f90   -O2  execution test
FAIL: gfortran.dg/shape_2.f90   -O3 -fomit-frame-pointer  execution test
FAIL: gfortran.dg/shape_2.f90   -O3 -fomit-frame-pointer -funroll-loops  execution test
FAIL: gfortran.dg/shape_2.f90   -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions  execution test
FAIL: gfortran.dg/shape_2.f90   -O3 -g  execution test
FAIL: gfortran.dg/shape_2.f90   -Os  execution test
FAIL: gfortran.dg/shape_2.f90   -g -flto  execution test

A reduced test is:

program main
  integer, dimension (40, 80) :: a = 1
  call test (a)
contains
  subroutine test (b)
    integer, dimension (11:, -8:), target :: b
    integer, dimension (:, :), pointer :: ptr

    print *, lbound (b (:, :), 1)
!    if (lbound (b (:, :), 1) .ne. 1) call abort
    print *, lbound (b (:, :), 2)
!    if (lbound (b (:, :), 2) .ne. 1) call abort

    print *, lbound (b (20:30:3, 40), 1)
    if (lbound (b (20:30:3, 40), 1) .ne. 1) call abort
  end subroutine test
end program main

(the other tests in the original file succeed).

Thanks for working of this,

Dominique

> What do you think?
> 
> 	Thomas
> 
> <matmul-8.diff>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-10 14:15             ` Dominique d'Humières
@ 2015-04-10 16:57               ` Dominique d'Humières
  2015-04-11 12:24               ` Thomas Koenig
  1 sibling, 0 replies; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-10 16:57 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran


> Le 10 avr. 2015 à 16:15, Dominique d'Humières <dominiq@lps.ens.fr> a écrit :
> 
> 
>> 4. is required, I need to look at some more corner cases where bugs may still be lurking.
> 

Compiling the polyhedron test rnflow.f90 after the patch gives an ICE

f951: internal compiler error: gfc_array_dimen_size(): Bad dimension

Reduced test extracted from the subroutine evlrnf:

      subroutine evlrnf (ptrs0t, nclsm, prnf0t) 
      real, dimension (1:nclsm,1:nclsm), intent (in) :: ptrs0t
      integer, intent (in)            :: nclsm ! nombre de classes
      real, dimension (1:nclsm,1:nclsm), intent (out):: prnf0t
      real, allocatable, dimension (:,:) :: utrsft ! probas up
      real, allocatable, dimension (:)   :: vwrkft
      real, allocatable, dimension (:)   :: vwrk2t
      ncls = nclsm
      allocate (utrsft (1:ncls,1:ncls))
      allocate (vwrkft (1:ncls))
      allocate (vwrk2t (1:ncls))
         vwrk2t = matmul (utrsft, vwrkft)   ! U Fk
      deallocate (utrsft)
      deallocate (vwrk2t)
      deallocate (vwrkft)
      end subroutine evlrnf

Dominique

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-10 14:15             ` Dominique d'Humières
  2015-04-10 16:57               ` Dominique d'Humières
@ 2015-04-11 12:24               ` Thomas Koenig
  2015-04-11 13:43                 ` Mikael Morin
  1 sibling, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-11 12:24 UTC (permalink / raw)
  To: Dominique d'Humières; +Cc: GCC Patches, GNU GFortran

[-- Attachment #1: Type: text/plain, Size: 513 bytes --]

OK, here is a new version.

There is now an option for setting a maximum on the array size,
which takes its default from the BLAS limit (if specified).

Currently, only setting the maximum size to zero as a way of
disabling the unrolling is supported.  I have done this in a
few test cases.

The bugs reported by Dominique have now been fixed.

Still to do:  Bounds checking (a rather big one), honoring the
maximum size, test cases, ChangeLog, some more comments to the code.

Anything else?

Regards

	Thomas



[-- Attachment #2: matmul-10.diff --]
[-- Type: text/x-patch, Size: 31446 bytes --]

Index: fortran/frontend-passes.c
===================================================================
--- fortran/frontend-passes.c	(Revision 221944)
+++ fortran/frontend-passes.c	(Arbeitskopie)
@@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *);
 static void optimize_reduction (gfc_namespace *);
 static int callback_reduction (gfc_expr **, int *, void *);
 static void realloc_strings (gfc_namespace *);
-static gfc_expr *create_var (gfc_expr *);
+static gfc_expr *create_var (gfc_expr *, const char *vname=NULL);
+static int optimize_matmul_assign (gfc_code **, int *, void *);
+static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *,
+				  locus *, gfc_namespace *, 
+				  char *vname=NULL);
 
 /* How deep we are inside an argument list.  */
 
@@ -93,8 +97,24 @@ struct my_struct *evec;
 
 static bool in_assoc_list;
 
+/* Counter for temporary variables.  */
+
+static int var_num = 1;
+
+/* What sort of matrix we are dealing with when optimizing MATMUL.  */
+
+enum matrix_case { none=0, A2B2, A2B1, A1B2 };
+
+/* Keep track of the number of expressions we have inserted so far 
+   using create_var.  */
+
+int n_vars;
+
+void gfc_debug_expr (gfc_expr *);
+
 /* Entry point - run all passes for a namespace.  */
 
+
 void
 gfc_run_passes (gfc_namespace *ns)
 {
@@ -157,7 +177,7 @@ realloc_string_callback (gfc_code **c, int *walk_s
     return 0;
   
   current_code = c;
-  n = create_var (expr2);
+  n = create_var (expr2, "trim");
   co->expr2 = n;
   return 0;
 }
@@ -524,29 +544,11 @@ constant_string_length (gfc_expr *e)
 
 }
 
-/* Returns a new expression (a variable) to be used in place of the old one,
-   with an assignment statement before the current statement to set
-   the value of the variable. Creates a new BLOCK for the statement if
-   that hasn't already been done and puts the statement, plus the
-   newly created variables, in that block.  Special cases:  If the
-   expression is constant or a temporary which has already
-   been created, just copy it.  */
-
-static gfc_expr*
-create_var (gfc_expr * e)
+static gfc_namespace*
+insert_block ()
 {
-  char name[GFC_MAX_SYMBOL_LEN +1];
-  static int num = 1;
-  gfc_symtree *symtree;
-  gfc_symbol *symbol;
-  gfc_expr *result;
-  gfc_code *n;
   gfc_namespace *ns;
-  int i;
 
-  if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
-    return gfc_copy_expr (e);
-
   /* If the block hasn't already been created, do so.  */
   if (inserted_block == NULL)
     {
@@ -578,7 +580,37 @@ constant_string_length (gfc_expr *e)
   else
     ns = inserted_block->ext.block.ns;
 
-  sprintf(name, "__var_%d",num++);
+  return ns;
+}
+
+/* Returns a new expression (a variable) to be used in place of the old one,
+   with an optional assignment statement before the current statement to set
+   the value of the variable. Creates a new BLOCK for the statement if that
+   hasn't already been done and puts the statement, plus the newly created
+   variables, in that block.  Special cases: If the expression is constant or
+   a temporary which has already been created, just copy it.  */
+
+static gfc_expr*
+create_var (gfc_expr * e, const char *vname)
+{
+  char name[GFC_MAX_SYMBOL_LEN +1];
+  gfc_symtree *symtree;
+  gfc_symbol *symbol;
+  gfc_expr *result;
+  gfc_code *n;
+  gfc_namespace *ns;
+  int i;
+
+  if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
+    return gfc_copy_expr (e);
+
+  ns = insert_block ();
+
+  if (vname)
+    snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname);
+  else
+    snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d", var_num++);
+
   if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
     gcc_unreachable ();
 
@@ -651,6 +683,7 @@ constant_string_length (gfc_expr *e)
       result->ref->type = REF_ARRAY;
       result->ref->u.ar.type = AR_FULL;
       result->ref->u.ar.where = e->where;
+      result->ref->u.ar.dimen = e->rank;
       result->ref->u.ar.as = symbol->ts.type == BT_CLASS
 			     ? CLASS_DATA (symbol)->as : symbol->as;
       if (warn_array_temporaries)
@@ -658,6 +691,7 @@ constant_string_length (gfc_expr *e)
 		     "Creating array temporary at %L", &(e->where));
     }
 
+
   /* Generate the new assignment.  */
   n = XCNEW (gfc_code);
   n->op = EXEC_ASSIGN;
@@ -666,6 +700,7 @@ constant_string_length (gfc_expr *e)
   n->expr1 = gfc_copy_expr (result);
   n->expr2 = e;
   *changed_statement = n;
+  n_vars ++;
 
   return result;
 }
@@ -724,7 +759,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees,
 	  if (gfc_dep_compare_functions (*ei, *ej, true) == 0)
 	    {
 	      if (newvar == NULL)
-		newvar = create_var (*ei);
+		newvar = create_var (*ei, "fcn");
 
 	      if (warn_function_elimination)
 		do_warn_function_elimination (*ej);
@@ -931,6 +966,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A
   /*  Don't walk subtrees.  */
   return 0;
 }
+
 /* Optimize a namespace, including all contained namespaces.  */
 
 static void
@@ -947,6 +983,9 @@ optimize_namespace (gfc_namespace *ns)
   gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL);
   gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL);
   gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL);
+  if (flag_inline_matmul_limit != 0)
+    gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback,
+		     NULL);
 
   /* BLOCKs are handled in the expression walker below.  */
   for (ns = ns->contained; ns; ns = ns->sibling)
@@ -1222,7 +1261,7 @@ combine_array_constructor (gfc_expr *e)
   if (op2->ts.type == BT_CHARACTER)
     return false;
 
-  scalar = create_var (gfc_copy_expr (op2));
+  scalar = create_var (gfc_copy_expr (op2), "constr");
 
   oldbase = op1->value.constructor;
   newbase = NULL;
@@ -1939,7 +1978,795 @@ doloop_warn (gfc_namespace *ns)
   gfc_code_walker (&ns->code, doloop_code, do_function, NULL);
 }
 
+/* Auxiliary function to build and simplify an array inquiry function.
+   dim is zero-based.  */
 
+static gfc_expr *
+get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id)
+{
+  gfc_expr *fcn;
+  gfc_expr *dim_arg, *kind;
+  const char *name;
+
+  switch (id)
+    {
+    case GFC_ISYM_LBOUND:
+      name = "_gfortran_lbound";
+      break;
+
+    case GFC_ISYM_UBOUND:
+      name = "_gfortran_ubound";
+      break;
+
+    case GFC_ISYM_SIZE:
+      name = "_gfortran_size";
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  dim_arg =  gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim);
+  kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where,
+			   gfc_index_integer_kind);
+
+  fcn = gfc_build_intrinsic_call (current_ns, id, name, e->where, 3,
+				  gfc_copy_expr(e), dim_arg,  kind);
+  gfc_simplify_expr (fcn, 0);
+  return fcn;
+}
+
+/* Build a not equals - expression.  */
+
+static gfc_expr*
+build_logical_expr (gfc_expr *e1, gfc_expr *e2, gfc_intrinsic_op op)
+{
+  gfc_typespec ts;
+  gfc_expr *res;
+
+  ts.type = BT_LOGICAL;
+  ts.kind = gfc_default_logical_kind;
+  res = gfc_get_expr ();
+  res->where = e1->where;
+  res->expr_type = EXPR_OP;
+  res->value.op.op = op;
+  res->value.op.op1 = e1;
+  res->value.op.op2 = e2;
+  res->ts = ts;
+
+  return res;
+}
+/* Handle matrix reallocation.  Caller is responsible to insert into
+   the code tree.
+
+   For the two-dimensional case, build 
+
+  if (allocated(c)) then
+     if (size(c,1) /= size(a,1) .or. size(c,2) /= size(b,2)) then
+        deallocate(c)
+        allocate (c(size(a,1), size(b,2)))
+     end if
+  else
+     allocate (c(size(a,1),size(b,2)))
+  end if
+
+*/
+
+static gfc_code *
+matmul_lhs_realloc (gfc_expr *c, gfc_expr *a, gfc_expr *b,
+		    enum matrix_case m_case)
+{
+
+  gfc_expr *allocated, *alloc_expr;
+  gfc_code *if_alloc_1, *if_alloc_2, *if_size_1, *if_size_2;
+  gfc_code *else_alloc;
+  gfc_code *deallocate, *allocate1, *allocate_else;
+  gfc_ref *ref;
+  gfc_array_ref *ar;
+  gfc_expr *cond, *ne1, *ne2;
+
+  alloc_expr = gfc_copy_expr (c);
+
+  ref = alloc_expr->ref;
+
+  while (ref)
+    {
+      if (ref->type == REF_ARRAY && ref->u.ar.type != AR_ELEMENT)
+	break;
+
+      ref = ref->next;
+
+    }
+  ar = &ref->u.ar;
+  gcc_assert (ar && ar->type == AR_FULL);
+
+  /* c comes in as a full ref.  Change it into a copy and make it into an
+     element ref so it has the right for for ALLOCATE.  In the same switch,
+     also generate the size comparison for the secod IF statement.  */
+
+  ar->type = AR_ELEMENT;
+
+  switch (m_case)
+    {
+    case A2B2:
+      ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE);
+      ar->start[1] = get_array_inq_function (b, 2, GFC_ISYM_SIZE);
+      ne1 = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+				get_array_inq_function (a, 1, GFC_ISYM_SIZE),
+				INTRINSIC_NE);
+      ne2 = build_logical_expr (get_array_inq_function (c, 2, GFC_ISYM_SIZE),
+				get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+				INTRINSIC_NE);
+      cond = build_logical_expr (ne1, ne2, INTRINSIC_OR);
+      break;
+
+    case A2B1:
+      ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE);
+      cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+				 get_array_inq_function (a, 2, GFC_ISYM_SIZE),
+				 INTRINSIC_NE);
+      break;
+
+    case A1B2:
+      ar->start[0] = get_array_inq_function (b, 1, GFC_ISYM_SIZE);
+      cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+				 get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+				 INTRINSIC_NE);
+      break;
+
+    default:
+      gcc_unreachable();
+
+    }
+
+  gfc_simplify_expr (cond, 0);
+
+  /* We need two identical allocate statements in two
+     branches of the IF statement.  */
+  
+  allocate1 = XCNEW (gfc_code);
+  allocate1->op = EXEC_ALLOCATE;
+  allocate1->ext.alloc.list = gfc_get_alloc ();
+  allocate1->loc = c->where;
+  allocate1->ext.alloc.list->expr = gfc_copy_expr (alloc_expr);
+
+  allocate_else = XCNEW (gfc_code);
+  allocate_else->op = EXEC_ALLOCATE;
+  allocate_else->ext.alloc.list = gfc_get_alloc ();
+  allocate_else->loc = c->where;
+  allocate_else->ext.alloc.list->expr = alloc_expr;
+
+  allocated = gfc_build_intrinsic_call (current_ns, GFC_ISYM_ALLOCATED,
+					"_gfortran_allocated", c->where,
+					1, gfc_copy_expr (c));
+
+  deallocate = XCNEW (gfc_code);
+  deallocate->op = EXEC_DEALLOCATE;
+  deallocate->ext.alloc.list = gfc_get_alloc ();
+  deallocate->ext.alloc.list->expr = gfc_copy_expr (c);
+  deallocate->next = allocate1;
+  deallocate->loc = c->where;
+  
+  if_size_2 = XCNEW (gfc_code);
+  if_size_2->op = EXEC_IF;
+  if_size_2->expr1 = cond;
+  if_size_2->loc = c->where;
+  if_size_2->next = deallocate;
+
+  if_size_1 = XCNEW (gfc_code);
+  if_size_1->op = EXEC_IF;
+  if_size_1->block = if_size_2;
+  if_size_1->loc = c->where;
+
+  else_alloc = XCNEW (gfc_code);
+  else_alloc->op = EXEC_IF;
+  else_alloc->loc = c->where;
+  else_alloc->next = allocate_else;
+
+  if_alloc_2 = XCNEW (gfc_code);
+  if_alloc_2->op = EXEC_IF;
+  if_alloc_2->expr1 = allocated;
+  if_alloc_2->loc = c->where;
+  if_alloc_2->next = if_size_1;
+  if_alloc_2->block = else_alloc;
+
+  if_alloc_1 = XCNEW (gfc_code);
+  if_alloc_1->op = EXEC_IF;
+  if_alloc_1->block = if_alloc_2;
+  if_alloc_1->loc = c->where;
+
+  return if_alloc_1;
+}
+
+
+/* Callback function for has_function_or_op.  */
+
+static int
+is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED,
+	     void *data ATTRIBUTE_UNUSED)
+{
+  if ((*e) == 0)
+    return 0;
+  else
+    return (*e)->expr_type == EXPR_FUNCTION
+      || (*e)->expr_type == EXPR_OP;
+}
+
+/* Returns true if the expression contains a function.  */
+
+static bool
+has_function_or_op (gfc_expr **e)
+{
+  if (e == NULL)
+    return false;
+  else
+    return gfc_expr_walker (e, is_function_or_op, NULL);
+}
+
+/* Freeze a single expression.  */
+
+static void
+freeze_expr (gfc_expr **ep)
+{
+  gfc_expr *ne;
+  if (has_function_or_op (ep))
+    {
+      ne = create_var (*ep, "freeze");
+      *ep = ne;
+    }
+}
+
+/* Go through an expression's references and assign them to temporary
+   variables if they contain functions.  This is usually done prior to
+   front-end scalarization to avoid multiple invocations of functions.  */
+
+static void
+freeze_references (gfc_expr *e)
+{
+  gfc_ref *r;
+  gfc_array_ref *ar;
+  int i;
+
+  for (r=e->ref; r; r=r->next)
+    {
+      if (r->type == REF_SUBSTRING)
+	{
+	  if (r->u.ss.start != NULL)
+	    freeze_expr (&r->u.ss.start);
+
+	  if (r->u.ss.end != NULL)
+	    freeze_expr (&r->u.ss.end);
+	}
+      else if (r->type == REF_ARRAY)
+	{
+	  ar = &r->u.ar;
+	  switch (ar->type)
+	    {
+	    case AR_FULL:
+	      break;
+
+	    case AR_SECTION:
+	      for (i=0; i<ar->dimen; i++)
+		{
+		  if (ar->dimen_type[i] == DIMEN_RANGE)
+		    {
+		      freeze_expr (&ar->start[i]);
+		      freeze_expr (&ar->end[i]);
+		      freeze_expr (&ar->stride[i]);
+		    }
+		}
+	    default:
+	      break;
+	    }
+	}
+    }
+}
+
+/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise.  */
+
+static gfc_expr *
+convert_to_index_kind (gfc_expr *e)
+{
+  gfc_expr *res;
+
+  gcc_assert (e != NULL);
+
+  res = gfc_copy_expr (e);
+
+  gcc_assert (e->ts.type == BT_INTEGER);
+
+  if (res->ts.kind != gfc_index_integer_kind)
+    {
+      gfc_typespec ts;
+      gfc_clear_ts (&ts);
+      ts.type = BT_INTEGER;
+      ts.kind = gfc_index_integer_kind;
+
+      gfc_convert_type_warn (e, &ts, 2, 0);
+    }
+
+  return res;
+}
+
+/* Function to create a DO loop including creation of the
+   iteration variable.  gfc_expr are copied.*/
+
+static gfc_code *
+create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where,
+		gfc_namespace *ns, char *vname)
+{
+
+  char name[GFC_MAX_SYMBOL_LEN +1];
+  gfc_symtree *symtree;
+  gfc_symbol *symbol;
+  gfc_expr *i;
+  gfc_code *n, *n2;
+
+  /* Create an expression for the iteration variable.  */
+  if (vname)
+    sprintf (name, "__var_%d_do_%s", var_num++, vname);
+  else
+    sprintf (name, "__var_%d_do", var_num++);
+
+
+  if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
+    gcc_unreachable ();
+
+  symbol = symtree->n.sym;
+  symbol->ts.type = BT_INTEGER;
+  symbol->ts.kind = gfc_index_integer_kind;
+  symbol->attr.flavor = FL_VARIABLE;
+  symbol->attr.referenced = 1;
+  symbol->attr.dimension = 0;
+  symbol->attr.fe_temp = 1;
+  gfc_commit_symbol (symbol);
+
+  i = gfc_get_expr ();
+  i->expr_type = EXPR_VARIABLE;
+  i->ts = symbol->ts;
+  i->rank = 0;
+  i->where = *where;
+  i->symtree = symtree;
+
+  n = XCNEW (gfc_code);
+  n->op = EXEC_DO;
+  n->loc = *where;
+  n->ext.iterator = gfc_get_iterator ();
+  n->ext.iterator->var = i;
+  n->ext.iterator->start = convert_to_index_kind (start);
+  n->ext.iterator->end = convert_to_index_kind (end);
+  if (step)
+    n->ext.iterator->step = convert_to_index_kind (step);
+  else
+    n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind,
+					      where, 1);
+
+  n2 = XCNEW (gfc_code);
+  n2->op = EXEC_DO;
+  n2->loc = *where;
+  n2->next = NULL;
+  n->block = n2;
+  return n;
+}
+
+/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes
+   compatible typespecs.  */
+
+static gfc_expr *
+get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
+{
+  gfc_expr *res;
+
+  res = gfc_get_expr ();
+  res->ts = e1->ts;
+  res->where = e1->where;
+  res->expr_type = EXPR_OP;
+  res->value.op.op = op;
+  res->value.op.op1 = e1;
+  res->value.op.op2 = e2;
+  gfc_simplify_expr (res, 0);
+  return res;
+}
+
+/* Get the upper bound of the DO loops for matmul along
+   a dimension, i.e. size minus one.  */
+static gfc_expr*
+get_size_m1 (gfc_expr *e, int dimen)
+{
+  mpz_t size;
+  gfc_expr *res;
+
+  if (gfc_array_dimen_size (e, dimen, &size))
+    {
+      res = gfc_get_constant_expr (BT_INTEGER,
+				   gfc_index_integer_kind, &e->where);
+      mpz_sub_ui (res->value.integer, size, 1);
+      mpz_clear (size);
+    }
+  else
+    {
+      res = get_operand (INTRINSIC_MINUS,
+			 get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE),
+			 gfc_get_int_expr (gfc_index_integer_kind,
+					   &e->where, 1));
+      gfc_simplify_expr (res, 0);
+    }
+
+  return res;
+}
+
+/* Function to return a scalarized expression. It is assumed that indices are
+ zero based to make generation of DO loops easier.  A zero as index will
+ access the first element along a dimension.  Single element references will
+ be skipped.  A NULL as an expression will be replaced by a full reference.
+ This assumes that the index loops have gfc_index_integer_kind, and that all
+ references have been frozen.  */
+
+static gfc_expr*
+scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index)
+{
+  gfc_ref *ref;
+  gfc_array_ref *ar;
+  int i;
+  int rank;
+  gfc_expr *e;
+  int i_index;
+  bool was_fullref;
+
+  e = gfc_copy_expr(e_in);
+
+  rank = e->rank;
+
+  ref = e->ref;
+
+  while (ref)
+    {
+      if (ref->type == REF_ARRAY
+	  && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION))
+	break;
+      ref = ref->next;
+    }
+  ar = &ref->u.ar;
+
+  /* We scalarize count_index variables, reducing the rank by count_index.  */
+
+  e->rank = rank - count_index;
+
+  was_fullref = ar->type == AR_FULL;
+
+  if (e->rank == 0)
+    ar->type = AR_ELEMENT;
+  else
+    ar->type = AR_SECTION;
+
+  /* Loop over the indices.  For each index, create the expression
+     index + lbound(e, dim).  */
+  
+  i_index = 0;
+  for (i=0; i < ar->dimen; i++)
+    {
+      if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE)
+	{
+	  if (index[i_index] != NULL)
+	    {
+	      gfc_expr *lbound, *nindex;
+	      gfc_expr *loopvar;
+	      
+	      loopvar = gfc_copy_expr (index[i_index]); 
+	      
+	      if (ar->stride[i])
+		{
+		  gfc_expr *tmp;
+
+		  tmp = gfc_copy_expr(ar->stride[i]);
+		  if (tmp->ts.kind != gfc_index_integer_kind)
+		    {
+		      gfc_typespec ts;
+		      gfc_clear_ts (&ts);
+		      ts.type = BT_INTEGER;
+		      ts.kind = gfc_index_integer_kind;
+		      gfc_convert_type (tmp, &ts, 2);
+		    }
+		  nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp);
+		}
+	      else
+		nindex = loopvar;
+	      
+	      if (ar->start[i])
+		{
+		  lbound = gfc_copy_expr (ar->start[i]);
+		  if (lbound->ts.kind != gfc_index_integer_kind)
+		    {
+		      gfc_typespec ts;
+		      gfc_clear_ts (&ts);
+		      ts.type = BT_INTEGER;
+		      ts.kind = gfc_index_integer_kind;
+		      gfc_convert_type (lbound, &ts, 2);
+
+		    }
+		}
+	      else
+		lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND);
+
+	      ar->dimen_type[i] = DIMEN_ELEMENT;
+	      ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound);
+	      ar->end[i] = NULL;
+	      ar->stride[i] = NULL;
+	      gfc_simplify_expr (ar->start[i], 0);
+	    }
+	  else if (was_fullref)
+	    {
+	      ar->dimen_type[i] = DIMEN_RANGE;
+	      ar->start[i] = NULL;
+	      ar->end[i] = NULL;
+	      ar->stride[i] = NULL;
+	    }
+	  i_index ++;
+	}
+    }
+  return e;
+}
+
+
+/* Optimize assignments of the form c = matmul(a,b).
+   Handle only the cases currently where b and c are rank-two arrays.
+
+   This translates the code to
+
+   BLOCK
+     integer i,j,k
+     c = 0
+     do j=0, size(b,2)-1
+       do k=0, size(a, 2)-1
+         do i=0, size(a, 1)-1
+            c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) =
+	    c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
+            a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) *
+            b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
+         end do
+       end do
+     end do
+   END BLOCK
+   
+*/
+
+static int
+optimize_matmul_assign (gfc_code **c,
+			  int *walk_subtrees ATTRIBUTE_UNUSED,
+			  void *data ATTRIBUTE_UNUSED)
+{
+  gfc_code *co = *c;
+  gfc_expr *expr1, *expr2;
+  gfc_expr *matrix_a, *matrix_b;
+  gfc_actual_arglist *a, *b;
+  gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul;
+  gfc_expr *zero_e;
+  gfc_expr *u1, *u2, *u3;
+  gfc_expr *list[2];
+  gfc_expr *ascalar, *bscalar, *cscalar;
+  gfc_expr *mult;
+  gfc_expr *var_1, *var_2, *var_3;
+  gfc_expr *zero;
+  gfc_namespace *ns;
+  gfc_intrinsic_op op_times, op_plus;
+  enum matrix_case m_case;
+  gfc_code *next_code;
+  gfc_code *p;
+  int i;
+
+
+  if (co->op != EXEC_ASSIGN)
+    return 0;
+
+  expr1 = co->expr1;
+  expr2 = co->expr2;
+  if (expr2->expr_type != EXPR_FUNCTION
+      || expr2->value.function.isym == NULL
+      || expr2->value.function.isym->id != GFC_ISYM_MATMUL)
+    return 0;
+
+  /* Handling only two matrices of dimension two for the time being.  */
+
+  current_code = c;
+  inserted_block = NULL;
+  changed_statement = NULL;
+
+  a = expr2->value.function.actual;
+  matrix_a = a->expr;
+  b = a->next;
+  matrix_b = b->expr;
+
+  if (matrix_a->expr_type != EXPR_VARIABLE
+      || matrix_b->expr_type != EXPR_VARIABLE)
+    return 0;
+
+  if (matrix_a->rank == 2)
+    m_case = matrix_b->rank == 1 ? A2B1 : A2B2;
+  else
+    m_case = A1B2;
+
+  if (gfc_check_dependency (expr1, matrix_a, true)
+      || gfc_check_dependency (expr1, matrix_b, true))
+    return 0;
+
+  ns = insert_block ();
+
+  switch(expr1->ts.type)
+    {
+    case BT_INTEGER:
+      zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+      break;
+
+    case BT_LOGICAL:
+      op_times = INTRINSIC_AND;
+      op_plus = INTRINSIC_OR;
+      zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where,
+				     0);
+      break;
+    case BT_REAL:
+      zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind,
+				      &expr1->where);
+      mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+      break;
+
+    case BT_COMPLEX:
+      zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind,
+				      &expr1->where);
+      mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+
+      break;
+
+    default:
+      gcc_unreachable();
+    }
+
+  current_code = &ns->code;
+
+  n_vars = 0;
+  freeze_references (matrix_a);
+  freeze_references (matrix_b);
+
+  assign_zero = XCNEW (gfc_code);
+  assign_zero->op = EXEC_ASSIGN;
+  assign_zero->loc = co->loc;
+  assign_zero->expr1 = gfc_copy_expr (expr1);
+  assign_zero->expr2 = zero_e;
+
+  if (flag_realloc_lhs && gfc_is_reallocatable_lhs (expr1))
+    {
+      gfc_code *lhs_alloc;
+
+      lhs_alloc = matmul_lhs_realloc (expr1, matrix_a, matrix_b, m_case);
+      lhs_alloc->next = assign_zero;
+      next_code = lhs_alloc;
+    }
+  else
+    next_code = assign_zero;
+
+  if (n_vars == 0)
+    *current_code = next_code;
+  else
+    {
+      p = ns->code;
+      for (i=0; i<n_vars-1; i++)
+	  p = p->next;
+
+      p->next = next_code;
+    }
+
+  zero = gfc_get_int_expr (gfc_index_integer_kind, &co->loc, 0);
+
+  assign_matmul = XCNEW (gfc_code);
+  assign_matmul->op = EXEC_ASSIGN;
+  assign_matmul->loc = co->loc;
+
+  switch (m_case)
+    {
+    case A2B2:
+      u1 = get_size_m1 (matrix_b, 1);
+      u2 = get_size_m1 (matrix_a, 1);
+      u3 = get_size_m1 (matrix_a, 0);
+
+      do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+      do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+      do_3 = create_do_loop (gfc_copy_expr (zero), u3, NULL, &co->loc, ns);
+
+      do_1->block->next = do_2;
+      do_2->block->next = do_3;
+      do_3->block->next = assign_matmul;
+
+      var_1 = do_1->ext.iterator->var;
+      var_2 = do_2->ext.iterator->var;
+      var_3 = do_3->ext.iterator->var;
+
+      list[0] = var_3;
+      list[1] = var_1;
+      cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2);
+
+      list[0] = var_3;
+      list[1] = var_2;
+      ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+      list[0] = var_2;
+      list[1] = var_1;
+      bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+      break;
+
+    case A2B1:
+      u1 = get_size_m1 (matrix_b, 0);
+      u2 = get_size_m1 (matrix_a, 0);
+
+      do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+      do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+      do_1->block->next = do_2;
+      do_2->block->next = assign_matmul;
+
+      var_1 = do_1->ext.iterator->var;
+      var_2 = do_2->ext.iterator->var;
+
+      list[0] = var_2;
+      cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+      list[0] = var_2;
+      list[1] = var_1;
+      ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+      list[0] = var_1;
+      bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 1);
+
+      break;
+
+    case A1B2:
+      u1 = get_size_m1 (matrix_b, 1);
+      u2 = get_size_m1 (matrix_a, 0);
+
+      do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+      do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+      do_1->block->next = do_2;
+      do_2->block->next = assign_matmul;
+
+      var_1 = do_1->ext.iterator->var;
+      var_2 = do_2->ext.iterator->var;
+
+      list[0] = var_1;
+      cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+      list[0] = var_2;
+      ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 1);
+
+      list[0] = var_2;
+      list[1] = var_1;
+      bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+      break;
+
+    default:
+      gcc_unreachable();
+    }
+
+  assign_zero->next = do_1;
+
+
+  assign_matmul->expr1 = gfc_copy_expr (cscalar);
+
+  mult = get_operand (op_times, ascalar, bscalar);
+  assign_matmul->expr2 = get_operand (op_plus, cscalar, mult);
+
+  co->next = NULL;
+  gfc_free_statements(co);
+  gfc_free_expr (zero);
+  return 0;
+}
+
 #define WALK_SUBEXPR(NODE) \
   do							\
     {							\
Index: fortran/gfortran.h
===================================================================
--- fortran/gfortran.h	(Revision 221944)
+++ fortran/gfortran.h	(Arbeitskopie)
@@ -3225,4 +3225,8 @@ int gfc_code_walker (gfc_code **, walk_code_fn_t,
 
 void gfc_convert_mpz_to_signed (mpz_t, int);
 
+/* trans-array.c  */
+
+bool gfc_is_reallocatable_lhs (gfc_expr *);
+
 #endif /* GCC_GFORTRAN_H  */
Index: fortran/lang.opt
===================================================================
--- fortran/lang.opt	(Revision 221944)
+++ fortran/lang.opt	(Arbeitskopie)
@@ -542,6 +542,10 @@ Enum(gfc_init_local_real) String(inf) Value(GFC_IN
 EnumValue
 Enum(gfc_init_local_real) String(-inf) Value(GFC_INIT_REAL_NEG_INF)
 
+finline-matmul-limit=
+Fortran RejectNegative Joined UInteger Var(flag_inline_matmul_limit) Init(-1)
+-finline-matmul-limit=<n>	Specify the size of the largest matrix for which matmul will be inlined
+
 fmax-array-constructor=
 Fortran RejectNegative Joined UInteger Var(flag_max_array_constructor) Init(65535)
 -fmax-array-constructor=<n>	Maximum number of objects in an array constructor
Index: fortran/options.c
===================================================================
--- fortran/options.c	(Revision 221944)
+++ fortran/options.c	(Arbeitskopie)
@@ -378,6 +378,11 @@ gfc_post_options (const char **pfilename)
   if (!flag_automatic)
     flag_max_stack_var_size = 0;
   
+  /* If we call BLAS directly, only inline up to the BLAS limit.  */
+
+  if (flag_external_blas && flag_inline_matmul_limit < 0)
+    flag_inline_matmul_limit = flag_blas_matmul_limit;
+
   /* Optimization implies front end optimization, unless the user
      specified it directly.  */
 
Index: fortran/simplify.c
===================================================================
--- fortran/simplify.c	(Revision 221944)
+++ fortran/simplify.c	(Arbeitskopie)
@@ -3445,6 +3445,32 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf
 
  done:
 
+
+  if (!upper && as && as->type == AS_ASSUMED_SHAPE && dim
+      && dim->expr_type == EXPR_CONSTANT && ref->u.ar.type != AR_SECTION)
+    {
+      if (!(array->symtree && array->symtree->n.sym
+	    && (array->symtree->n.sym->attr.allocatable
+		|| array->symtree->n.sym->attr.pointer)))
+	{
+	  unsigned long int ndim;
+	  gfc_expr *lower, *res;
+
+	  ndim = mpz_get_si (dim->value.integer) - 1;
+	  lower = as->lower[ndim];
+	  if (lower->expr_type == EXPR_CONSTANT)
+	    {
+	      res = gfc_copy_expr (lower);
+	      if (kind)
+		{
+		  int nkind = mpz_get_si (kind->value.integer);
+		  res->ts.kind = nkind;
+		}
+	      return res;
+	    }
+	}
+    }
+
   if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE
 	     || as->type == AS_ASSUMED_RANK))
     return NULL;
Index: fortran/trans-array.h
===================================================================
--- fortran/trans-array.h	(Revision 221944)
+++ fortran/trans-array.h	(Arbeitskopie)
@@ -64,8 +64,6 @@ tree gfc_copy_only_alloc_comp (gfc_symbol *, tree,
 
 tree gfc_alloc_allocatable_for_assignment (gfc_loopinfo*, gfc_expr*, gfc_expr*);
 
-bool gfc_is_reallocatable_lhs (gfc_expr *);
-
 /* Add initialization for deferred arrays.  */
 void gfc_trans_deferred_array (gfc_symbol *, gfc_wrapped_block *);
 /* Generate an initializer for a static pointer or allocatable array.  */
Index: testsuite/gfortran.dg/function_optimize_1.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_1.f90	(Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_1.f90	(Arbeitskopie)
@@ -1,5 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-O -fdump-tree-original -Warray-temporaries" }
+! { dg-options "-O -fdump-tree-original -finline-matmul-limit=0 -Warray-temporaries" }
 program main
   implicit none
   real, dimension(2,2) :: a, b, c, d
Index: testsuite/gfortran.dg/function_optimize_2.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_2.f90	(Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_2.f90	(Arbeitskopie)
@@ -1,5 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-O -faggressive-function-elimination -fdump-tree-original" }
+! { dg-options "-O -finline-matmul-limit=0 -faggressive-function-elimination -fdump-tree-original" }
 program main
   implicit none
   real, dimension(2,2) :: a, b, c, d
Index: testsuite/gfortran.dg/function_optimize_5.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_5.f90	(Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_5.f90	(Arbeitskopie)
@@ -1,5 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-ffrontend-optimize -Wfunction-elimination" }
+! { dg-options "-ffrontend-optimize -finline-matmul-limit=0 -Wfunction-elimination" }
 ! Check the -ffrontend-optimize (in the absence of -O) and
 ! -Wfunction-elimination options.
 program main
Index: testsuite/gfortran.dg/function_optimize_7.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_7.f90	(Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_7.f90	(Arbeitskopie)
@@ -1,5 +1,5 @@
 ! { dg-do compile }
-! { dg-options "-O -fdump-tree-original -Warray-temporaries" }
+! { dg-options "-O -fdump-tree-original -Warray-temporaries -finline-matmul-limit=0" }
 subroutine xx(n, m, a, b, c, d, x, z, i, s_in, s_out)
   implicit none
   integer, intent(in) :: n, m

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-11 12:24               ` Thomas Koenig
@ 2015-04-11 13:43                 ` Mikael Morin
  2015-04-11 17:01                   ` Thomas Koenig
  0 siblings, 1 reply; 16+ messages in thread
From: Mikael Morin @ 2015-04-11 13:43 UTC (permalink / raw)
  To: Thomas Koenig, Dominique d'Humières; +Cc: GCC Patches, GNU GFortran

Hello, I haven't looked at the patch in detail yet, but...

Le 11/04/2015 14:24, Thomas Koenig a Ã©crit :
> Still to do:  Bounds checking (a rather big one),
... as you do a front-end to front-end transformation, you get bounds
checking for free, don't you?

Mikael

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-11 13:43                 ` Mikael Morin
@ 2015-04-11 17:01                   ` Thomas Koenig
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Koenig @ 2015-04-11 17:01 UTC (permalink / raw)
  To: Mikael Morin, Dominique d'Humières; +Cc: GCC Patches, GNU GFortran

Hi Mikael,

>> Still to do:  Bounds checking (a rather big one),
> ... as you do a front-end to front-end transformation, you get bounds
> checking for free, don't you?

Only partially.

What the patch does is

     integer i,j,k
     c = 0
     do j=0, size(b,2)-1
       do k=0, size(a, 2)-1
         do i=0, size(a, 1)-1
            c(i * stride(c,1) + lbound(c,1), j * stride(c,2) +
lbound(c,2)) =
           c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
            a(i * stride(a,1) + lbound(a,1), k * stride(a,2) +
lbound(a,2)) *
            b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
         end do
       end do
     end do

If size(b,2) < size(c,2) or size(a,1) < size(c,1) or
size(a,2) < size(b,1), this will not get caught - no
array bounds violation in the DO loops, but illegal
code nonetheless.

Also, the error message is different, which should also be
changed.

What I would like to add to check before the loop, and then
add a "do not bounds-check" flag to the reference.

	Thomas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch, fortran, RFC] First steps towards inlining matmul
@ 2015-04-05 12:32 Thomas Koenig
  2015-04-05 13:20 ` Jerry DeLisle
  2015-04-07 19:24 ` David Malcolm
  0 siblings, 2 replies; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 12:32 UTC (permalink / raw)
  To: fortran, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3687 bytes --]

Hello world,

this is a first draft of a patch to inline matmul (PR 37171). This is
preliminary, but functional as far as it goes.  Definitely for the
next stage one :-)

Basically, it takes

c = matmul(a,b)

and converts this into

   BLOCK
     integer i,j,k
     c = 0
     do j=0, size(b,2)-1
       do k=0, size(a, 2)-1
         do i=0, size(a, 1)-1
            c(i * stride(c,1) + lbound(c,1), j * stride(c,2) +
lbound(c,2)) =
	    c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
            a(i * stride(a,1) + lbound(a,1), k * stride(a,2) +
lbound(a,2)) *
            b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
         end do
       end do
     end do
   END BLOCK

For this, I wrote something like a scalarizer that only operates on the
AST only.  I find it rather straigtforward to use (in contrast to the
one we have), but that may also be due to the fact that I wrote it
myself :-)

I chose zero-based loop variables because I didn't want to special-case
too many strange array bounds, and I trusted the induction variable
optimization and CSE in the middle end and in the RTL optimization.
If this approach suboptimizes something, let me know.

References are frozen to make sure that no additional evaluations are
done on.

The patch to simplify.c is not strictly necessary, but it should open
up some optimization possiblities and, frankly, makes the
-fdump-fortran-optimized output at least halfway readable.

This patch needs to be extended in different directions:

- Currently, it only handles the case of two-dimensional arguments

- Special cases like transpose(a) and spread(a) in the arguments
  arguments should be added, possibly with a reordering of loops
  and addition of some more variables.

- Temporary handling for arguments and results, which does not depend
  on allocatable RHS.  I am thinking of creating nested blocks,
  where the inner block has arrays with the bounds of the
  references.

- Cases like matmul(a+1,b) can also be handled in the loop,
  without a temporary

- This needs command-line arguments, -finline-matmul and
  -finline-matmul-limit=, similar to the handling of BLAS.

- Bounds checking could/should also be handled.  Currently, this
  causes the only regression because of a different error message.
  We need to call gfortran_runtime_error directly from the front end,
  and to be able to tell the bounds-checking code "do not worry about
  bounds checking this, it has already been taken care of".

- I already have some test cases, but for this, I would really like
  to cover all code paths.  This also needs some more work.

So, what do you think about this?

	Thomas

2015-04-05  Thomas Koenig  <tkoenig@gcc.gnu.org>

        * frontend-passes.c (create_var):  Add optional argument
        vname as part of the name.  Split off block creation into
        (insert_block): New function.
        (cfe_expr_0):  Use "fcn" as part of tempoary name.
        (optimize_namesapce):  Call optimize_matmul_assign.
        (combine_array_constructor):  Use "constr" as part of
        temporary name.
        (get_array_inq_function):  New function.
        (is_function_or_op):  New function.
        (has_function_or_op):  New function.
        (freeze_expr):  New function.
        (freeze_references):  New function.
        (convert_to_index_kind):  New function.
        (create_do_loop):  New function.
        (get_operand):  New function.
        (get_size_1):  New function.
        (scalarize_expr):  New function.
        (optimize_matmul_assign): New function.
        * simplify.c (simplify_bound):  Simplify the case of the
        lower bound of an assumed-shape argument.

[-- Attachment #2: matmul-1.diff --]
[-- Type: text/x-patch, Size: 19198 bytes --]

Index: frontend-passes.c
===================================================================
--- frontend-passes.c	(Revision 221757)
+++ frontend-passes.c	(Arbeitskopie)
@@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *);
 static void optimize_reduction (gfc_namespace *);
 static int callback_reduction (gfc_expr **, int *, void *);
 static void realloc_strings (gfc_namespace *);
-static gfc_expr *create_var (gfc_expr *);
+static gfc_expr *create_var (gfc_expr *, const char *vname=NULL);
+static int optimize_matmul_assign (gfc_code **, int *, void *);
+static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *,
+				  locus *, gfc_namespace *, 
+				  char *vname=NULL);
 
 /* How deep we are inside an argument list.  */
 
@@ -95,6 +99,10 @@ static bool in_assoc_list;
 
 /* Entry point - run all passes for a namespace.  */
 
+/* Counter for temporary variables.  */
+
+static int var_num = 1;
+
 void
 gfc_run_passes (gfc_namespace *ns)
 {
@@ -157,7 +165,7 @@ realloc_string_callback (gfc_code **c, int *walk_s
     return 0;
   
   current_code = c;
-  n = create_var (expr2);
+  n = create_var (expr2, "trim");
   co->expr2 = n;
   return 0;
 }
@@ -524,29 +532,11 @@ constant_string_length (gfc_expr *e)
 
 }
 
-/* Returns a new expression (a variable) to be used in place of the old one,
-   with an assignment statement before the current statement to set
-   the value of the variable. Creates a new BLOCK for the statement if
-   that hasn't already been done and puts the statement, plus the
-   newly created variables, in that block.  Special cases:  If the
-   expression is constant or a temporary which has already
-   been created, just copy it.  */
-
-static gfc_expr*
-create_var (gfc_expr * e)
+static gfc_namespace*
+insert_block ()
 {
-  char name[GFC_MAX_SYMBOL_LEN +1];
-  static int num = 1;
-  gfc_symtree *symtree;
-  gfc_symbol *symbol;
-  gfc_expr *result;
-  gfc_code *n;
   gfc_namespace *ns;
-  int i;
 
-  if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
-    return gfc_copy_expr (e);
-
   /* If the block hasn't already been created, do so.  */
   if (inserted_block == NULL)
     {
@@ -578,7 +568,37 @@ constant_string_length (gfc_expr *e)
   else
     ns = inserted_block->ext.block.ns;
 
-  sprintf(name, "__var_%d",num++);
+  return ns;
+}
+
+/* Returns a new expression (a variable) to be used in place of the old one,
+   with an optional assignment statement before the current statement to set
+   the value of the variable. Creates a new BLOCK for the statement if that
+   hasn't already been done and puts the statement, plus the newly created
+   variables, in that block.  Special cases: If the expression is constant or
+   a temporary which has already been created, just copy it.  */
+
+static gfc_expr*
+create_var (gfc_expr * e, const char *vname)
+{
+  char name[GFC_MAX_SYMBOL_LEN +1];
+  gfc_symtree *symtree;
+  gfc_symbol *symbol;
+  gfc_expr *result;
+  gfc_code *n;
+  gfc_namespace *ns;
+  int i;
+
+  if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
+    return gfc_copy_expr (e);
+
+  ns = insert_block ();
+
+  if (vname)
+    snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname);
+    else
+  sprintf (name, "__var_%d", var_num++);
+
   if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
     gcc_unreachable ();
 
@@ -658,6 +678,7 @@ constant_string_length (gfc_expr *e)
 		     "Creating array temporary at %L", &(e->where));
     }
 
+
   /* Generate the new assignment.  */
   n = XCNEW (gfc_code);
   n->op = EXEC_ASSIGN;
@@ -724,7 +745,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees,
 	  if (gfc_dep_compare_functions (*ei, *ej, true) == 0)
 	    {
 	      if (newvar == NULL)
-		newvar = create_var (*ei);
+		newvar = create_var (*ei, "fcn");
 
 	      if (warn_function_elimination)
 		do_warn_function_elimination (*ej);
@@ -931,6 +952,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A
   /*  Don't walk subtrees.  */
   return 0;
 }
+
 /* Optimize a namespace, including all contained namespaces.  */
 
 static void
@@ -947,6 +969,7 @@ optimize_namespace (gfc_namespace *ns)
   gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL);
   gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL);
   gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL);
+  gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback, NULL);
 
   /* BLOCKs are handled in the expression walker below.  */
   for (ns = ns->contained; ns; ns = ns->sibling)
@@ -1222,7 +1245,7 @@ combine_array_constructor (gfc_expr *e)
   if (op2->ts.type == BT_CHARACTER)
     return false;
 
-  scalar = create_var (gfc_copy_expr (op2));
+  scalar = create_var (gfc_copy_expr (op2), "constr");
 
   oldbase = op1->value.constructor;
   newbase = NULL;
@@ -1939,7 +1962,576 @@ doloop_warn (gfc_namespace *ns)
   gfc_code_walker (&ns->code, doloop_code, do_function, NULL);
 }
 
+/* Auxiliary function to build and simplify an array inquiry function.
+   dim is zero-based.  */
 
+static gfc_expr *
+get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id)
+{
+
+  gfc_expr *fcn;
+  gfc_actual_arglist *actual_arglist, *n1, *n2;
+  gfc_expr *dim_arg, *kind;
+  const char *name;
+
+  fcn = gfc_get_expr ();
+  fcn->expr_type = EXPR_FUNCTION;
+  fcn->value.function.isym = gfc_intrinsic_function_by_id (id);
+  actual_arglist = gfc_get_actual_arglist ();
+  actual_arglist->expr = gfc_copy_expr (e);
+
+  n1 = gfc_get_actual_arglist ();
+  dim_arg = gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim);
+  n1->expr = dim_arg;
+  actual_arglist->next = n1;
+
+  n2 = gfc_get_actual_arglist ();
+  kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where,
+			   gfc_index_integer_kind);
+  n2->expr = kind;
+  n1->next = n2;
+
+  fcn->value.function.actual = actual_arglist;
+  fcn->where = e->where;
+  fcn->ts.type = BT_INTEGER;
+  fcn->ts.kind = gfc_index_integer_kind;
+
+  switch (id)
+    {
+    case GFC_ISYM_LBOUND:
+      name = "__internal_lbound";
+      break;
+
+    case GFC_ISYM_UBOUND:
+      name = "__internal_ubound";
+      break;
+
+    case GFC_ISYM_SIZE:
+      name = "__internal_size";
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  gfc_get_sym_tree (name, current_ns, &fcn->symtree, false);
+  fcn->symtree->n.sym->ts = fcn->ts;
+  fcn->symtree->n.sym->attr.flavor = FL_PROCEDURE;
+  fcn->symtree->n.sym->attr.function = 1;
+  fcn->symtree->n.sym->attr.elemental = 1;
+  fcn->symtree->n.sym->attr.referenced = 1;
+  fcn->symtree->n.sym->attr.access = ACCESS_PRIVATE;
+  gfc_commit_symbol (fcn->symtree->n.sym);
+  gfc_simplify_expr (fcn, 0);
+  return fcn;
+}
+
+/* Callback function for has_function_or_op.  */
+
+static int
+is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED,
+	     void *data ATTRIBUTE_UNUSED)
+{
+  if ((*e) == 0)
+    return 0;
+  else
+    return (*e)->expr_type == EXPR_FUNCTION
+      || (*e)->expr_type == EXPR_OP;
+}
+
+/* Returns true if the expression contains a function.  */
+
+static bool
+has_function_or_op (gfc_expr **e)
+{
+  if (e == NULL)
+    return false;
+  else
+    return gfc_expr_walker (e, is_function_or_op, NULL);
+}
+
+/* Freeze a single expression.  */
+
+static void
+freeze_expr (gfc_expr **ep)
+{
+  gfc_expr *ne;
+  if (has_function_or_op (ep))
+    {
+      ne = create_var (*ep, "freeze");
+      *ep = ne;
+    }
+}
+
+/* Go through an expression's references and assign them to temporary
+   variables if they contain functions.  This is usually done prior to
+   front-end scalarization to avoid multiple invocations of functions.  */
+
+static void
+freeze_references (gfc_expr *e)
+{
+  gfc_ref *r;
+  gfc_array_ref *ar;
+  int i;
+
+  for (r=e->ref; r; r=r->next)
+    {
+      if (r->type == REF_SUBSTRING)
+	{
+	  if (r->u.ss.start != NULL)
+	    freeze_expr (&r->u.ss.start);
+
+	  if (r->u.ss.end != NULL)
+	    freeze_expr (&r->u.ss.end);
+	}
+      else if (r->type == REF_ARRAY)
+	{
+	  ar = &r->u.ar;
+	  switch (ar->type)
+	    {
+	    case AR_FULL:
+	      break;
+
+	    case AR_SECTION:
+	      for (i=0; i<ar->dimen; i++)
+		{
+		  if (ar->dimen_type[i] == DIMEN_RANGE)
+		    {
+		      freeze_expr (&ar->start[i]);
+		      freeze_expr (&ar->end[i]);
+		      freeze_expr (&ar->stride[i]);
+		    }
+		}
+	    default:
+	      break;
+	    }
+	}
+    }
+}
+
+/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise.  */
+
+static gfc_expr *
+convert_to_index_kind (gfc_expr *e)
+{
+  gfc_expr *res;
+
+  gcc_assert (e != NULL);
+
+  res = gfc_copy_expr (e);
+
+  gcc_assert (e->ts.type == BT_INTEGER);
+
+  if (res->ts.kind != gfc_index_integer_kind)
+    {
+      gfc_typespec ts;
+      gfc_clear_ts (&ts);
+      ts.type = BT_INTEGER;
+      ts.kind = gfc_index_integer_kind;
+
+      gfc_convert_type_warn (e, &ts, 2, 0);
+    }
+
+  return res;
+}
+
+/* Function to create a DO loop including creation of the
+   iteration variable.  gfc_expr are copied.*/
+
+static gfc_code *
+create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where,
+		gfc_namespace *ns, char *vname)
+{
+
+  char name[GFC_MAX_SYMBOL_LEN +1];
+  gfc_symtree *symtree;
+  gfc_symbol *symbol;
+  gfc_expr *i;
+  gfc_code *n, *n2;
+
+  /* Create an expression for the iteration variable.  */
+  if (vname)
+    sprintf (name, "__var_%d_do_%s", var_num++, vname);
+  else
+    sprintf (name, "__var_%d_do", var_num++);
+
+
+  if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
+    gcc_unreachable ();
+
+  symbol = symtree->n.sym;
+  symbol->ts.type = BT_INTEGER;
+  symbol->ts.kind = gfc_index_integer_kind;
+  symbol->attr.flavor = FL_VARIABLE;
+  symbol->attr.referenced = 1;
+  symbol->attr.dimension = 0;
+  symbol->attr.fe_temp = 1;
+  gfc_commit_symbol (symbol);
+
+  i = gfc_get_expr ();
+  i->expr_type = EXPR_VARIABLE;
+  i->ts = symbol->ts;
+  i->rank = 0;
+  i->where = *where;
+  i->symtree = symtree;
+
+  n = XCNEW (gfc_code);
+  n->op = EXEC_DO;
+  n->loc = *where;
+  n->ext.iterator = gfc_get_iterator ();
+  n->ext.iterator->var = i;
+  n->ext.iterator->start = convert_to_index_kind (start);
+  n->ext.iterator->end = convert_to_index_kind (end);
+  if (step)
+    n->ext.iterator->step = convert_to_index_kind (step);
+  else
+    n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind,
+					      where, 1);
+
+  n2 = XCNEW (gfc_code);
+  n2->op = EXEC_DO;
+  n2->loc = *where;
+  n2->next = NULL;
+  n->block = n2;
+  return n;
+}
+
+/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes
+   compatible typespecs.  */
+
+static gfc_expr *
+get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
+{
+  gfc_expr *res;
+
+  res = gfc_get_expr ();
+  res->ts = e1->ts;
+  res->where = e1->where;
+  res->expr_type = EXPR_OP;
+  res->value.op.op = op;
+  res->value.op.op1 = e1;
+  res->value.op.op2 = e2;
+  gfc_simplify_expr (res, 0);
+  return res;
+}
+
+/* Get the upper bound of the DO loops for matmul along
+   a dimension, i.e. size minus one.  */
+static gfc_expr*
+get_size_m1 (gfc_expr *e, int dimen)
+{
+  mpz_t size;
+  gfc_expr *res;
+
+  if (gfc_array_dimen_size (e, dimen, &size))
+    {
+      res = gfc_get_constant_expr (BT_INTEGER,
+				   gfc_index_integer_kind, &e->where);
+      mpz_sub_ui (res->value.integer, size, 1);
+      mpz_clear (size);
+    }
+  else
+    {
+      res = get_operand (INTRINSIC_MINUS,
+			 get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE),
+			 gfc_get_int_expr (gfc_index_integer_kind,
+					   &e->where, 1));
+      gfc_simplify_expr (res, 0);
+    }
+
+  return res;
+}
+
+/* Function to return a scalarized expression. It is assumed that indices are
+ zero based to make generation of DO loops easier.  A zero as index will
+ access the first element along a dimension.  Single element references will
+ be skipped.  A NULL as an expression will be replaced by a full reference.
+ This assumes that the index loops have gfc_index_integer_kind, and that all
+ references have been frozen.  */
+
+static gfc_expr*
+scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index)
+{
+  gfc_ref *ref;
+  gfc_array_ref *ar;
+  int i;
+  int rank;
+  gfc_expr *e;
+  int i_index;
+  bool was_fullref;
+
+  e = gfc_copy_expr(e_in);
+
+  rank = e->rank;
+
+  ref = e->ref;
+
+  while (ref)
+    {
+      if (ref->type == REF_ARRAY
+	  && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION))
+	break;
+    }
+  ar = &ref->u.ar;
+
+  /* We scalarize count_index variables, reducing the rank by count_index.  */
+
+  e->rank = rank - count_index;
+
+  was_fullref = ar->type == AR_FULL;
+
+  if (e->rank == 0)
+    ar->type = AR_ELEMENT;
+  else
+    ar->type = AR_SECTION;
+
+  ar->dimen = rank;
+
+  /* Loop over the indices.  For each index, create the expression
+     index + lbound(e, dim).  */
+  
+  i_index = 0;
+  for (i=0; i<rank; i++)
+    {
+      if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE)
+	{
+	  if (index[i_index] != NULL)
+	    {
+	      gfc_expr *lbound, *nindex;
+	      gfc_expr *loopvar;
+	      
+	      loopvar = gfc_copy_expr (index[i_index]); 
+	      
+	      if (ar->stride[i])
+		{
+		  gfc_expr *tmp;
+
+		  tmp = gfc_copy_expr(ar->stride[i]);
+		  if (tmp->ts.kind != gfc_index_integer_kind)
+		    {
+		      gfc_typespec ts;
+		      gfc_clear_ts (&ts);
+		      ts.type = BT_INTEGER;
+		      ts.kind = gfc_index_integer_kind;
+		      gfc_convert_type (tmp, &ts, 2);
+		    }
+		  nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp);
+		}
+	      else
+		nindex = loopvar;
+	      
+	      if (ar->start[i])
+		{
+		  lbound = gfc_copy_expr (ar->start[i]);
+		  if (lbound->ts.kind != gfc_index_integer_kind)
+		    {
+		      gfc_typespec ts;
+		      gfc_clear_ts (&ts);
+		      ts.type = BT_INTEGER;
+		      ts.kind = gfc_index_integer_kind;
+		      gfc_convert_type (lbound, &ts, 2);
+
+		    }
+		}
+	      else
+		lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND);
+
+	      ar->dimen_type[i] = DIMEN_ELEMENT;
+	      ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound);
+	      ar->end[i] = NULL;
+	      ar->stride[i] = NULL;
+	      gfc_simplify_expr (ar->start[i], 0);
+	    }
+	  else if (was_fullref)
+	    {
+	      ar->dimen_type[i] = DIMEN_RANGE;
+	      ar->start[i] = NULL;
+	      ar->end[i] = NULL;
+	      ar->stride[i] = NULL;
+	    }
+	  i_index ++;
+	}
+    }
+  return e;
+}
+
+
+/* Optimize assignments of the form c = matmul(a,b).
+   Handle only the cases currently where b and c are rank-two arrays.
+
+   This translates the code to
+
+   BLOCK
+     integer i,j,k
+     c = 0
+     do j=0, size(b,2)-1
+       do k=0, size(a, 2)-1
+         do i=0, size(a, 1)-1
+            c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) =
+	    c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
+            a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) *
+            b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
+         end do
+       end do
+     end do
+   END BLOCK
+   
+*/
+
+static int
+optimize_matmul_assign (gfc_code **c,
+			  int *walk_subtrees ATTRIBUTE_UNUSED,
+			  void *data ATTRIBUTE_UNUSED)
+{
+  gfc_code *co = *c;
+  gfc_expr *expr1, *expr2;
+  gfc_expr *matrix_a, *matrix_b;
+  gfc_actual_arglist *a, *b;
+  gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul;
+  gfc_expr *zero_e;
+  gfc_expr *u1, *u2, *u3;
+  gfc_expr *list[2];
+  gfc_expr *ascalar, *bscalar, *cscalar;
+  gfc_expr *mult;
+  gfc_expr *var_1, *var_2, *var_3;
+  gfc_namespace *ns;
+  gfc_intrinsic_op op_times, op_plus;
+
+  if (co->op != EXEC_ASSIGN)
+    return 0;
+
+  expr1 = co->expr1;
+  expr2 = co->expr2;
+  if (expr2->expr_type != EXPR_FUNCTION
+      || expr2->value.function.isym == NULL
+      || expr2->value.function.isym->id != GFC_ISYM_MATMUL)
+    return 0;
+
+  /* Handling only two matrices of dimension two for the time being.  */
+
+  current_code = c;
+  inserted_block = NULL;
+  changed_statement = NULL;
+
+  a = expr2->value.function.actual;
+  matrix_a = a->expr;
+  b = a->next;
+  matrix_b = b->expr;
+
+  if (matrix_a->expr_type != EXPR_VARIABLE
+      || matrix_b->expr_type != EXPR_VARIABLE)
+    return 0;
+
+  if (matrix_a->rank != 2 || matrix_b->rank != 2)
+    return 0;
+
+  if (gfc_check_dependency (expr1, matrix_a, true)
+      || gfc_check_dependency (expr1, matrix_b, true))
+    return 0;
+
+  ns = insert_block ();
+
+  switch(expr1->ts.type)
+    {
+    case BT_INTEGER:
+      zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+      break;
+
+    case BT_LOGICAL:
+      op_times = INTRINSIC_AND;
+      op_plus = INTRINSIC_OR;
+      zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where,
+				     0);
+      break;
+    case BT_REAL:
+      zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind,
+				      &expr1->where);
+      mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+      break;
+
+    case BT_COMPLEX:
+      zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind,
+				      &expr1->where);
+      mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE);
+      op_times = INTRINSIC_TIMES;
+      op_plus = INTRINSIC_PLUS;
+
+      break;
+
+    default:
+      gcc_unreachable();
+    }
+
+  assign_zero = XCNEW (gfc_code);
+  assign_zero->op = EXEC_ASSIGN;
+  assign_zero->loc = co->loc;
+  assign_zero->expr1 = gfc_copy_expr (expr1);
+  assign_zero->expr2 = zero_e;
+
+  ns->code = assign_zero;
+  current_code = &ns->code;
+
+  freeze_references (matrix_a);
+  freeze_references (matrix_b);
+
+  u1 = get_size_m1 (matrix_b, 1);
+  u2 = get_size_m1 (matrix_a, 1);
+  u3 = get_size_m1 (matrix_a, 0);
+
+  do_1 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind,
+  					    &co->loc, 0),
+			 u1, NULL, &co->loc, ns);
+
+  do_2 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind,
+					   &co->loc, 0),
+			 u2, NULL, &co->loc, ns);
+
+  do_1->block->next = do_2;
+
+  do_3 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind,
+                                            &co->loc, 0),
+			 u3, NULL, &co->loc, ns);
+
+  do_2->block->next = do_3;
+
+  var_1 = do_1->ext.iterator->var;
+  var_2 = do_2->ext.iterator->var;
+  var_3 = do_3->ext.iterator->var;
+
+  assign_zero->next = do_1;
+
+  assign_matmul = XCNEW (gfc_code);
+  assign_matmul->op = EXEC_ASSIGN;
+  assign_matmul->loc = co->loc;
+
+  list[0] = var_3;
+  list[1] = var_1;
+  cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2);
+
+  list[0] = var_3;
+  list[1] = var_2;
+  ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+  list[0] = var_2;
+  list[1] = var_1;
+  bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+  assign_matmul->expr1 = gfc_copy_expr (cscalar);
+
+  mult = get_operand (op_times, ascalar, bscalar);
+  assign_matmul->expr2 = get_operand (op_plus, cscalar, mult);
+
+  do_3->block->next = assign_matmul;
+
+  co->next = NULL;
+  gfc_free_statements(co);
+  return 0;
+}
+
 #define WALK_SUBEXPR(NODE) \
   do							\
     {							\

[-- Attachment #3: simplify-1.diff --]
[-- Type: text/x-patch, Size: 819 bytes --]

Index: simplify.c
===================================================================
--- simplify.c	(Revision 221757)
+++ simplify.c	(Arbeitskopie)
@@ -3445,6 +3445,25 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf
 
  done:
 
+  if (!upper && as->type == AS_ASSUMED_SHAPE && dim)
+    {
+      if (dim->expr_type == EXPR_CONSTANT)
+	{
+	  unsigned long int ndim;
+	  gfc_expr *lower, *res;
+
+	  ndim = mpz_get_si (dim->value.integer) - 1;
+	  lower = as->lower[ndim];
+	  if (lower->expr_type == EXPR_CONSTANT)
+	    {
+	      int nkind = mpz_get_si (kind->value.integer);
+	      res = gfc_copy_expr (lower);
+	      res->ts.kind = nkind;
+	      return res;
+	    }
+	}
+    }
+
   if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE
 	     || as->type == AS_ASSUMED_RANK))
     return NULL;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 12:32 Thomas Koenig
@ 2015-04-05 13:20 ` Jerry DeLisle
  2015-04-05 13:48   ` Thomas Koenig
  2015-04-07 19:24 ` David Malcolm
  1 sibling, 1 reply; 16+ messages in thread
From: Jerry DeLisle @ 2015-04-05 13:20 UTC (permalink / raw)
  To: Thomas Koenig, fortran, gcc-patches

On 04/05/2015 05:32 AM, Thomas Koenig wrote:
--- snip ---
>
> So, what do you think about this?
>
> 	Thomas

I am curious about what performance gain results from this? I can see saving a 
library call to our runtime libraries.  Do you have some timing results?

Jerry

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 13:20 ` Jerry DeLisle
@ 2015-04-05 13:48   ` Thomas Koenig
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 13:48 UTC (permalink / raw)
  To: Jerry DeLisle, fortran, gcc-patches

Hi Jerry,

> I am curious about what performance gain results from this? I can see
> saving a library call to our runtime libraries.  Do you have some timing
> results?

The speedup can be quite drastic for small matrices which can be
completely unrolled by -O3:

b1.f90:

program main
  use b2
  implicit none
  real, dimension(3,3) :: a, b, c
  integer :: i

  call random_number(a)
  call random_number(b)
  do i=1,10**8
     c = matmul(a,b)
     call bar(b,c)
  end do
end program main

b2.f90:

module b2
contains
  subroutine bar(b,c)
    real, dimension(3,3) :: b,c
  end subroutine bar
end module b2

ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 -fno-frontend-optimize
b2.f90 b1.f90 && time ./a.out

real    0m15.411s
user    0m15.404s
sys     0m0.001s

ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 b2.f90 b1.f90 && time ./a.out

real    0m1.736s
user    0m1.735s
sys     0m0.001s

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch, fortran, RFC] First steps towards inlining matmul
  2015-04-05 12:32 Thomas Koenig
  2015-04-05 13:20 ` Jerry DeLisle
@ 2015-04-07 19:24 ` David Malcolm
  1 sibling, 0 replies; 16+ messages in thread
From: David Malcolm @ 2015-04-07 19:24 UTC (permalink / raw)
  To: Thomas Koenig; +Cc: fortran, gcc-patches

On Sun, 2015-04-05 at 14:32 +0200, Thomas Koenig wrote:
> Hello world,
> 
> this is a first draft of a patch to inline matmul (PR 37171). This is

(FWIW, the above PR# looks like it should be PR 37131)


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-04-11 17:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-05 13:55 [patch, fortran, RFC] First steps towards inlining matmul Dominique Dhumieres
2015-04-05 14:33 ` Thomas Koenig
2015-04-05 18:25   ` Dominique d'Humières
2015-04-05 20:37     ` Thomas Koenig
2015-04-05 23:15       ` Dominique d'Humières
2015-04-06 12:18         ` Dominique d'Humières
2015-04-09 22:18           ` Thomas Koenig
2015-04-10 14:15             ` Dominique d'Humières
2015-04-10 16:57               ` Dominique d'Humières
2015-04-11 12:24               ` Thomas Koenig
2015-04-11 13:43                 ` Mikael Morin
2015-04-11 17:01                   ` Thomas Koenig
  -- strict thread matches above, loose matches on Subject: below --
2015-04-05 12:32 Thomas Koenig
2015-04-05 13:20 ` Jerry DeLisle
2015-04-05 13:48   ` Thomas Koenig
2015-04-07 19:24 ` David Malcolm

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).