* Re: [patch, fortran, RFC] First steps towards inlining matmul
@ 2015-04-05 13:55 Dominique Dhumieres
2015-04-05 14:33 ` Thomas Koenig
0 siblings, 1 reply; 16+ messages in thread
From: Dominique Dhumieres @ 2015-04-05 13:55 UTC (permalink / raw)
To: tkoenig; +Cc: gcc-patches, fortran, jvdelisle
> > So, what do you think about this?
>
> I am curious about what performance gain results from this?
> I can see saving a library call to our runtime libraries.
> Do you have some timing results?
>
> Jerry
IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
or 16x16 depending of your field!-). I have a variant of the polyhedron test induct.f90
in which I have replaced three dot products by the equivalent matmul and which executes
ten times slower than the original test.
Dominique
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 13:55 [patch, fortran, RFC] First steps towards inlining matmul Dominique Dhumieres
@ 2015-04-05 14:33 ` Thomas Koenig
2015-04-05 18:25 ` Dominique d'Humières
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 14:33 UTC (permalink / raw)
To: Dominique Dhumieres; +Cc: gcc-patches, fortran, jvdelisle
Hi Dominique,
> IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
> or 16x16 depending of your field!-)
The problem with the library function we have is that it is quite
general; it can deal with all the complexity of assumed-shape array
arguments. Inlining allows the compiler to take advantage of
contiguous memory, compile-time knowledge of array sizes, etc.
to allow for vectorization or even complete unrolling.
Of course, if you use c=matmul(a,b) where all three are assumed-shape
arrays, your advantage over the library function will be minimal at
best.
It will be interesting to see at what size calling BLAS directly
will be better.
Thomas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 14:33 ` Thomas Koenig
@ 2015-04-05 18:25 ` Dominique d'Humières
2015-04-05 20:37 ` Thomas Koenig
0 siblings, 1 reply; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-05 18:25 UTC (permalink / raw)
To: Thomas Koenig; +Cc: gcc-patches, fortran, jvdelisle
I have done some timings
(1) with the test given below, before the patch I get (last column in Gflops)
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
Time, MATMUL: 373.708008 373.69497100000001 4.2815668504139435
Time, MATMUL: 172.086609 23.117919000000001 69.210381782201068
545.374u 0.537s 6:36.93 137.5% 0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
Time, MATMUL: 176.327881 23.855111999999998 67.071577781735002
Time, MATMUL: 182.086746 24.453551000000001 65.430170039516952
358.059u 0.471s 0:48.43 740.2% 0+0k 0+0io 0pf+0w
after the patch
[Book15] f90/bug% time a.out
Time, MATMUL: 392.415436 392.41728799999999 4.0772923337669056
Time, MATMUL: 171.690399 22.905118000000002 69.853383859450091
563.671u 0.551s 6:55.44 135.8% 0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
Time, MATMUL: 392.850342 392.84190799999999 4.0728852177349673
Time, MATMUL: 174.534821 23.784797000000001 67.269861500184334
566.824u 0.674s 6:56.74 136.1% 0+0k 0+0io 0pf+0w
which means that -fexternal-blas should disable the inlining.
program t2
implicit none
REAL time_begin, time_end
integer, parameter :: n=2000;
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n)
integer, parameter :: m = 100
integer :: i
call RANDOM_NUMBER(a)
call RANDOM_NUMBER(b)
call cpu_time(time_begin)
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m
a(1,1) = a(1,1) + 0.1
c = MATMUL(a,b)
enddo
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end)
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin)
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m
a(1,1) = a(1,1) + 0.1
call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n)
enddo
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end)
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
end program
! http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197
(2) The polyhedron test compiled with -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto runs in ~6.5s. After applying the patch below, it runs in ~66s with/without the patch.
Dominique
--- induct.f90 2005-10-11 22:53:32.000000000 +0200
+++ induct_vmc.f90 2015-04-05 19:06:30.000000000 +0200
@@ -1644,18 +1644,17 @@ contains
coil_tmp_vector(1) = -sin(theta)
coil_tmp_vector(2) = cos(theta)
coil_tmp_vector(3) = 0.0_longreal
- coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
- coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
- coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+ coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
!
do j = 1, 9
c_vector(3) = 0.5 * h_coil * z1gauss(j)
!
! rotate coil vector into the global coordinate system and translate it
!
- rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
- rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
- rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+ rot_c_vector = matmul(rotate_coil,c_vector)
+ rot_c_vector(1) = rot_c_vector(1) + dx
+ rot_c_vector(2) = rot_c_vector(2) + dy
+ rot_c_vector(3) = rot_c_vector(3) + dz
!
do k = 1, 9
q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -1664,9 +1663,7 @@ contains
!
! rotate quad vector into the global coordinate system
!
- rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
- rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
- rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+ rot_q_vector = matmul(rotate_quad,q_vector)
!
! compute and add in quadrature term
!
@@ -1756,18 +1753,17 @@ contains
coil_tmp_vector(1) = -sin(theta)
coil_tmp_vector(2) = cos(theta)
coil_tmp_vector(3) = 0.0_longreal
- coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
- coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
- coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+ coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
!
do j = 1, 9
c_vector(3) = 0.5 * h_coil * z1gauss(j)
!
! rotate coil vector into the global coordinate system and translate it
!
- rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
- rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
- rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+ rot_c_vector = matmul(rotate_coil,c_vector)
+ rot_c_vector(1) = rot_c_vector(1) + dx
+ rot_c_vector(2) = rot_c_vector(2) + dy
+ rot_c_vector(3) = rot_c_vector(3) + dz
!
do k = 1, 9
q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -1776,9 +1772,7 @@ contains
!
! rotate quad vector into the global coordinate system
!
- rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
- rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
- rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+ rot_q_vector = matmul(rotate_quad,q_vector)
!
! compute and add in quadrature term
!
@@ -2061,18 +2055,17 @@ contains
!
! compute current vector for the coil in the global coordinate system
!
- coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
- coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
- coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+ coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
!
do j = 1, 9
c_vector(3) = 0.5 * h_coil * z1gauss(j)
!
! rotate coil vector into the global coordinate system and translate it
!
- rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
- rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
- rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+ rot_c_vector = matmul(rotate_coil,c_vector)
+ rot_c_vector(1) = rot_c_vector(1) + dx
+ rot_c_vector(2) = rot_c_vector(2) + dy
+ rot_c_vector(3) = rot_c_vector(3) + dz
!
do k = 1, 9
q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -2081,9 +2074,7 @@ contains
!
! rotate quad vector into the global coordinate system
!
- rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
- rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
- rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+ rot_q_vector = matmul(rotate_quad,q_vector)
!
! compute and add in quadrature term
!
@@ -2204,18 +2195,17 @@ contains
!
! compute current vector for the coil in the global coordinate system
!
- coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:))
- coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:))
- coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:))
+ coil_current_vec = matmul(rotate_coil,coil_tmp_vector)
!
do j = 1, 9
c_vector(3) = 0.5 * h_coil * z1gauss(j)
!
! rotate coil vector into the global coordinate system and translate it
!
- rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx
- rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy
- rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz
+ rot_c_vector = matmul(rotate_coil,c_vector)
+ rot_c_vector(1) = rot_c_vector(1) + dx
+ rot_c_vector(2) = rot_c_vector(2) + dy
+ rot_c_vector(3) = rot_c_vector(3) + dz
!
do k = 1, 9
q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal)
@@ -2224,9 +2214,7 @@ contains
!
! rotate quad vector into the global coordinate system
!
- rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:))
- rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:))
- rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:))
+ rot_q_vector = matmul(rotate_quad,q_vector)
!
! compute and add in quadrature term
!
> Le 5 avr. 2015 à 16:33, Thomas Koenig <tkoenig@netcologne.de> a écrit :
>
> Hi Dominique,
>
>> IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
>> or 16x16 depending of your field!-)
>
> The problem with the library function we have is that it is quite
> general; it can deal with all the complexity of assumed-shape array
> arguments. Inlining allows the compiler to take advantage of
> contiguous memory, compile-time knowledge of array sizes, etc.
> to allow for vectorization or even complete unrolling.
>
> Of course, if you use c=matmul(a,b) where all three are assumed-shape
> arrays, your advantage over the library function will be minimal at
> best.
>
> It will be interesting to see at what size calling BLAS directly
> will be better.
>
> Thomas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 18:25 ` Dominique d'Humières
@ 2015-04-05 20:37 ` Thomas Koenig
2015-04-05 23:15 ` Dominique d'Humières
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 20:37 UTC (permalink / raw)
To: Dominique d'Humières; +Cc: gcc-patches, fortran, jvdelisle
Hi Dominique,
> which means that -fexternal-blas should disable the inlining.
It is not surprising that a higly tuned BLAS library is better than
a simple inlining for large matrices.
I did some tests by adjusting n; it seems the inline version is
faster for n<=22, which is not too bad.
Regarding your other test case: This tests matrix*vector
multiplication, which is not implemented yet :-)
Regards,
Thomas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 20:37 ` Thomas Koenig
@ 2015-04-05 23:15 ` Dominique d'Humières
2015-04-06 12:18 ` Dominique d'Humières
0 siblings, 1 reply; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-05 23:15 UTC (permalink / raw)
To: Thomas Koenig; +Cc: gcc-patches, fortran, jvdelisle
The patch causes the following regressions:
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single -O2 -latomic (internal compiler error)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single -O2 -latomic (test for excess errors)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib -O2 -lcaf_single -latomic (internal compiler error)
FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib -O2 -lcaf_single -latomic (test for excess errors)
FAIL: gfortran.dg/array_function_3.f90 -O (internal compiler error)
FAIL: gfortran.dg/array_function_3.f90 -O (test for excess errors)
FAIL: gfortran.dg/bind_c_vars.f90 -g -flto (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -O0 (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -O0 (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -O1 (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -O1 (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -O2 (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -O2 (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-loops (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -O3 -g (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -O3 -g (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -Os (internal compiler error)
FAIL: gfortran.dg/bound_2.f90 -Os (test for excess errors)
FAIL: gfortran.dg/bound_2.f90 -g -flto (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -O0 (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -O1 (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -O1 (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -O2 (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -O2 (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-loops (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -O3 -g (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -O3 -g (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -Os (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -Os (test for excess errors)
FAIL: gfortran.dg/bound_7.f90 -g -flto (internal compiler error)
FAIL: gfortran.dg/bound_7.f90 -g -flto (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -O0 (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -O0 (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -O1 (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -O1 (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -O2 (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -O2 (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-loops (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -O3 -g (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -O3 -g (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -Os (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -Os (test for excess errors)
FAIL: gfortran.dg/bound_8.f90 -g -flto (internal compiler error)
FAIL: gfortran.dg/bound_8.f90 -g -flto (test for excess errors)
I think the line
if (!upper && as->type == AS_ASSUMED_SHAPE && dim)
should be something such as (untested)
if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE)
Dominique
> Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit :
>
> Hi Dominique,
>
>> which means that -fexternal-blas should disable the inlining.
>
> It is not surprising that a higly tuned BLAS library is better than
> a simple inlining for large matrices.
>
> I did some tests by adjusting n; it seems the inline version is
> faster for n<=22, which is not too bad.
>
> Regarding your other test case: This tests matrix*vector
> multiplication, which is not implemented yet :-)
>
> Regards,
>
> Thomas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 23:15 ` Dominique d'Humières
@ 2015-04-06 12:18 ` Dominique d'Humières
2015-04-09 22:18 ` Thomas Koenig
0 siblings, 1 reply; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-06 12:18 UTC (permalink / raw)
To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran, jvdelisle
> Le 6 avr. 2015 à 01:15, Dominique d'Humières <dominiq@lps.ens.fr> a écrit :
>
> The patch causes the following regressions:
>
> FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single -O2 -latomic (internal compiler error)
> …
> FAIL: gfortran.dg/bound_8.f90 -g -flto (test for excess errors)
>
> I think the line
>
> if (!upper && as->type == AS_ASSUMED_SHAPE && dim)
>
> should be something such as (untested)
>
> if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE)
This fixes a first batch of ICEs. A second one if fixed by
if (kind && lower->expr_type == EXPR_CONSTANT)
While the first change is obvious, I am not sure about the second one.
With this changes I am left with the following regressions
FAIL: gfortran.dg/function_optimize_1.f90 -O scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/function_optimize_7.f90 -O scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/function_optimize_2.f90 -O scan-tree-dump-times original "matmul_r4" 1
FAIL: gfortran.dg/matmul_bounds_2.f90 -O1 output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90 -O2 output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -fomit-frame-pointer output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -fomit-frame-pointer -funroll-loops output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -g output pattern test
FAIL: gfortran.dg/matmul_bounds_2.f90 -Os output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90 -O1 output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90 -O2 output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -fomit-frame-pointer output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -fomit-frame-pointer -funroll-loops output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -g output pattern test
FAIL: gfortran.dg/matmul_bounds_3.f90 -Os output pattern test
FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -fomit-frame-pointer execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -fomit-frame-pointer -funroll-loops execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -g execution test
FAIL: gfortran.dg/realloc_on_assign_11.f90 -Os execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03 -O1 execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03 -O2 execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -fomit-frame-pointer execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -fomit-frame-pointer -funroll-loops execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -g execution test
FAIL: gfortran.dg/realloc_on_assign_7.f03 -Os execution test
The gfortran.dg/matmul_bounds_* are failures due to a change in the run time error (is there a good reason for that?) and the gfortran.dg/realloc_on_assign_* failures are due to segfaults at run time.
Dominique
>> Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit :
>>
>> Hi Dominique,
>>
>>> which means that -fexternal-blas should disable the inlining.
>>
>> It is not surprising that a higly tuned BLAS library is better than
>> a simple inlining for large matrices.
>>
>> I did some tests by adjusting n; it seems the inline version is
>> faster for n<=22, which is not too bad.
>>
>> Regarding your other test case: This tests matrix*vector
>> multiplication, which is not implemented yet :-)
>>
>> Regards,
>>
>> Thomas
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-06 12:18 ` Dominique d'Humières
@ 2015-04-09 22:18 ` Thomas Koenig
2015-04-10 14:15 ` Dominique d'Humières
0 siblings, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-09 22:18 UTC (permalink / raw)
To: GCC Patches, GNU GFortran; +Cc: Dominique d'Humières
[-- Attachment #1: Type: text/plain, Size: 691 bytes --]
Hello world,
here is an update on the matmul inlining patch. Now, the
rank one + rank two cases are also handled. Reallocation
on assignment also works now.
Still missing:
1. Control via an option, BLAS inlining has to take precedence
2. handling of matmul(a,b) occurring in the middle of an expression.
3. Bounds checking from the front end pass (basically, calling
gfortran_runtime_error).
4. More test cases
What do you think? 1. is straightforward, we don't really need to
do 2. for committing early in stage one. For 3, we could maybe
just issue a STOP. 4. is required, I need to look at some more
corner cases where bugs may still be lurking.
What do you think?
Thomas
[-- Attachment #2: matmul-8.diff --]
[-- Type: text/x-patch, Size: 27830 bytes --]
Index: frontend-passes.c
===================================================================
--- frontend-passes.c (Revision 221944)
+++ frontend-passes.c (Arbeitskopie)
@@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *);
static void optimize_reduction (gfc_namespace *);
static int callback_reduction (gfc_expr **, int *, void *);
static void realloc_strings (gfc_namespace *);
-static gfc_expr *create_var (gfc_expr *);
+static gfc_expr *create_var (gfc_expr *, const char *vname=NULL);
+static int optimize_matmul_assign (gfc_code **, int *, void *);
+static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *,
+ locus *, gfc_namespace *,
+ char *vname=NULL);
/* How deep we are inside an argument list. */
@@ -93,8 +97,24 @@ struct my_struct *evec;
static bool in_assoc_list;
+/* Counter for temporary variables. */
+
+static int var_num = 1;
+
+/* What sort of matrix we are dealing with when optimizing MATMUL. */
+
+enum matrix_case { none=0, A2B2, A2B1, A1B2 };
+
+/* Keep track of the number of expressions we have inserted so far
+ using create_var. */
+
+int n_vars;
+
+void gfc_debug_expr (gfc_expr *);
+
/* Entry point - run all passes for a namespace. */
+
void
gfc_run_passes (gfc_namespace *ns)
{
@@ -157,7 +177,7 @@ realloc_string_callback (gfc_code **c, int *walk_s
return 0;
current_code = c;
- n = create_var (expr2);
+ n = create_var (expr2, "trim");
co->expr2 = n;
return 0;
}
@@ -524,29 +544,11 @@ constant_string_length (gfc_expr *e)
}
-/* Returns a new expression (a variable) to be used in place of the old one,
- with an assignment statement before the current statement to set
- the value of the variable. Creates a new BLOCK for the statement if
- that hasn't already been done and puts the statement, plus the
- newly created variables, in that block. Special cases: If the
- expression is constant or a temporary which has already
- been created, just copy it. */
-
-static gfc_expr*
-create_var (gfc_expr * e)
+static gfc_namespace*
+insert_block ()
{
- char name[GFC_MAX_SYMBOL_LEN +1];
- static int num = 1;
- gfc_symtree *symtree;
- gfc_symbol *symbol;
- gfc_expr *result;
- gfc_code *n;
gfc_namespace *ns;
- int i;
- if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
- return gfc_copy_expr (e);
-
/* If the block hasn't already been created, do so. */
if (inserted_block == NULL)
{
@@ -578,7 +580,37 @@ constant_string_length (gfc_expr *e)
else
ns = inserted_block->ext.block.ns;
- sprintf(name, "__var_%d",num++);
+ return ns;
+}
+
+/* Returns a new expression (a variable) to be used in place of the old one,
+ with an optional assignment statement before the current statement to set
+ the value of the variable. Creates a new BLOCK for the statement if that
+ hasn't already been done and puts the statement, plus the newly created
+ variables, in that block. Special cases: If the expression is constant or
+ a temporary which has already been created, just copy it. */
+
+static gfc_expr*
+create_var (gfc_expr * e, const char *vname)
+{
+ char name[GFC_MAX_SYMBOL_LEN +1];
+ gfc_symtree *symtree;
+ gfc_symbol *symbol;
+ gfc_expr *result;
+ gfc_code *n;
+ gfc_namespace *ns;
+ int i;
+
+ if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
+ return gfc_copy_expr (e);
+
+ ns = insert_block ();
+
+ if (vname)
+ snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname);
+ else
+ snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d", var_num++);
+
if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
gcc_unreachable ();
@@ -651,6 +683,7 @@ constant_string_length (gfc_expr *e)
result->ref->type = REF_ARRAY;
result->ref->u.ar.type = AR_FULL;
result->ref->u.ar.where = e->where;
+ result->ref->u.ar.dimen = e->rank;
result->ref->u.ar.as = symbol->ts.type == BT_CLASS
? CLASS_DATA (symbol)->as : symbol->as;
if (warn_array_temporaries)
@@ -658,6 +691,7 @@ constant_string_length (gfc_expr *e)
"Creating array temporary at %L", &(e->where));
}
+
/* Generate the new assignment. */
n = XCNEW (gfc_code);
n->op = EXEC_ASSIGN;
@@ -666,6 +700,7 @@ constant_string_length (gfc_expr *e)
n->expr1 = gfc_copy_expr (result);
n->expr2 = e;
*changed_statement = n;
+ n_vars ++;
return result;
}
@@ -724,7 +759,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees,
if (gfc_dep_compare_functions (*ei, *ej, true) == 0)
{
if (newvar == NULL)
- newvar = create_var (*ei);
+ newvar = create_var (*ei, "fcn");
if (warn_function_elimination)
do_warn_function_elimination (*ej);
@@ -931,6 +966,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A
/* Don't walk subtrees. */
return 0;
}
+
/* Optimize a namespace, including all contained namespaces. */
static void
@@ -947,6 +983,7 @@ optimize_namespace (gfc_namespace *ns)
gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL);
gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL);
gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL);
+ gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback, NULL);
/* BLOCKs are handled in the expression walker below. */
for (ns = ns->contained; ns; ns = ns->sibling)
@@ -1222,7 +1259,7 @@ combine_array_constructor (gfc_expr *e)
if (op2->ts.type == BT_CHARACTER)
return false;
- scalar = create_var (gfc_copy_expr (op2));
+ scalar = create_var (gfc_copy_expr (op2), "constr");
oldbase = op1->value.constructor;
newbase = NULL;
@@ -1939,7 +1976,798 @@ doloop_warn (gfc_namespace *ns)
gfc_code_walker (&ns->code, doloop_code, do_function, NULL);
}
+/* Auxiliary function to build and simplify an array inquiry function.
+ dim is zero-based. */
+static gfc_expr *
+get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id)
+{
+
+ gfc_expr *fcn;
+ gfc_expr *dim_arg, *kind;
+ const char *name;
+
+ switch (id)
+ {
+ case GFC_ISYM_LBOUND:
+ name = "_gfortran_lbound";
+ break;
+
+ case GFC_ISYM_UBOUND:
+ name = "_gfortran_ubound";
+ break;
+
+ case GFC_ISYM_SIZE:
+ name = "_gfortran_size";
+ break;
+
+ default:
+ gcc_unreachable ();
+ }
+
+ dim_arg = gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim);
+
+ kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where,
+ gfc_index_integer_kind);
+
+ fcn = gfc_build_intrinsic_call (current_ns, id, name, e->where, 3,
+ gfc_copy_expr(e), dim_arg, kind);
+
+ gfc_simplify_expr (fcn, 0);
+ return fcn;
+}
+
+/* Build a not equals - expression. */
+
+static gfc_expr*
+build_logical_expr (gfc_expr *e1, gfc_expr *e2, gfc_intrinsic_op op)
+{
+ gfc_typespec ts;
+ gfc_expr *res;
+
+ ts.type = BT_LOGICAL;
+ ts.kind = gfc_default_logical_kind;
+ res = gfc_get_expr ();
+ res->where = e1->where;
+ res->expr_type = EXPR_OP;
+ res->value.op.op = op;
+ res->value.op.op1 = e1;
+ res->value.op.op2 = e2;
+ res->ts = ts;
+
+ return res;
+}
+/* Handle matrix reallocation. Caller is responsible to insert into
+ the code tree.
+
+ For the two-dimensional case, build
+
+ if (allocated(c)) then
+ if (size(c,1) /= size(a,1) .or. size(c,2) /= size(b,2)) then
+ deallocate(c)
+ allocate (c(size(a,1), size(b,2)))
+ end if
+ else
+ allocate (c(size(a,1),size(b,2)))
+ end if
+
+*/
+
+static gfc_code *
+matmul_lhs_realloc (gfc_expr *c, gfc_expr *a, gfc_expr *b,
+ enum matrix_case m_case)
+{
+
+ gfc_expr *allocated, *alloc_expr;
+ gfc_code *if_alloc_1, *if_alloc_2, *if_size_1, *if_size_2;
+ gfc_code *else_alloc;
+ gfc_code *deallocate, *allocate1, *allocate_else;
+ gfc_ref *ref;
+ gfc_array_ref *ar;
+ gfc_expr *cond, *ne1, *ne2;
+
+ alloc_expr = gfc_copy_expr (c);
+
+ ref = alloc_expr->ref;
+
+ while (ref)
+ {
+ if (ref->type == REF_ARRAY && ref->u.ar.type != AR_ELEMENT)
+ break;
+
+ ref = ref->next;
+
+ }
+ ar = &ref->u.ar;
+ gcc_assert (ar && ar->type == AR_FULL);
+
+ /* c comes in as a full ref. Change it into a copy and make it into an
+ element ref so it has the right for for ALLOCATE. In the same switch,
+ also generate the size comparison for the secod IF statement. */
+
+ ar->type = AR_ELEMENT;
+
+ switch (m_case)
+ {
+ case A2B2:
+ ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE);
+ ar->start[1] = get_array_inq_function (b, 2, GFC_ISYM_SIZE);
+ ne1 = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+ get_array_inq_function (a, 1, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ ne2 = build_logical_expr (get_array_inq_function (c, 2, GFC_ISYM_SIZE),
+ get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ cond = build_logical_expr (ne1, ne2, INTRINSIC_OR);
+ break;
+
+ case A2B1:
+ ar->start[0] = get_array_inq_function (a, 0, GFC_ISYM_SIZE);
+ cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+ get_array_inq_function (a, 1, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ break;
+
+ case A1B2:
+ ar->start[0] = get_array_inq_function (b, 1, GFC_ISYM_SIZE);
+ cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+ get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ break;
+
+ default:
+ gcc_unreachable();
+
+ }
+
+ gfc_simplify_expr (cond, 0);
+
+ /* We need two identical allocate statements in two
+ branches of the IF statement. */
+
+ allocate1 = XCNEW (gfc_code);
+ allocate1->op = EXEC_ALLOCATE;
+ allocate1->ext.alloc.list = gfc_get_alloc ();
+ allocate1->loc = c->where;
+ allocate1->ext.alloc.list->expr = gfc_copy_expr (alloc_expr);
+
+ allocate_else = XCNEW (gfc_code);
+ allocate_else->op = EXEC_ALLOCATE;
+ allocate_else->ext.alloc.list = gfc_get_alloc ();
+ allocate_else->loc = c->where;
+ allocate_else->ext.alloc.list->expr = alloc_expr;
+
+ allocated = gfc_build_intrinsic_call (current_ns, GFC_ISYM_ALLOCATED,
+ "_gfortran_allocated", c->where,
+ 1, gfc_copy_expr (c));
+
+ deallocate = XCNEW (gfc_code);
+ deallocate->op = EXEC_DEALLOCATE;
+ deallocate->ext.alloc.list = gfc_get_alloc ();
+ deallocate->ext.alloc.list->expr = gfc_copy_expr (c);
+ deallocate->next = allocate1;
+ deallocate->loc = c->where;
+
+ if_size_2 = XCNEW (gfc_code);
+ if_size_2->op = EXEC_IF;
+ if_size_2->expr1 = cond;
+ if_size_2->loc = c->where;
+ if_size_2->next = deallocate;
+
+ if_size_1 = XCNEW (gfc_code);
+ if_size_1->op = EXEC_IF;
+ if_size_1->block = if_size_2;
+ if_size_1->loc = c->where;
+
+ else_alloc = XCNEW (gfc_code);
+ else_alloc->op = EXEC_IF;
+ else_alloc->loc = c->where;
+ else_alloc->next = allocate_else;
+
+ if_alloc_2 = XCNEW (gfc_code);
+ if_alloc_2->op = EXEC_IF;
+ if_alloc_2->expr1 = allocated;
+ if_alloc_2->loc = c->where;
+ if_alloc_2->next = if_size_1;
+ if_alloc_2->block = else_alloc;
+
+ if_alloc_1 = XCNEW (gfc_code);
+ if_alloc_1->op = EXEC_IF;
+ if_alloc_1->block = if_alloc_2;
+ if_alloc_1->loc = c->where;
+
+ return if_alloc_1;
+}
+
+
+/* Callback function for has_function_or_op. */
+
+static int
+is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED,
+ void *data ATTRIBUTE_UNUSED)
+{
+ if ((*e) == 0)
+ return 0;
+ else
+ return (*e)->expr_type == EXPR_FUNCTION
+ || (*e)->expr_type == EXPR_OP;
+}
+
+/* Returns true if the expression contains a function. */
+
+static bool
+has_function_or_op (gfc_expr **e)
+{
+ if (e == NULL)
+ return false;
+ else
+ return gfc_expr_walker (e, is_function_or_op, NULL);
+}
+
+/* Freeze a single expression. */
+
+static void
+freeze_expr (gfc_expr **ep)
+{
+ gfc_expr *ne;
+ if (has_function_or_op (ep))
+ {
+ ne = create_var (*ep, "freeze");
+ *ep = ne;
+ }
+}
+
+/* Go through an expression's references and assign them to temporary
+ variables if they contain functions. This is usually done prior to
+ front-end scalarization to avoid multiple invocations of functions. */
+
+static void
+freeze_references (gfc_expr *e)
+{
+ gfc_ref *r;
+ gfc_array_ref *ar;
+ int i;
+
+ for (r=e->ref; r; r=r->next)
+ {
+ if (r->type == REF_SUBSTRING)
+ {
+ if (r->u.ss.start != NULL)
+ freeze_expr (&r->u.ss.start);
+
+ if (r->u.ss.end != NULL)
+ freeze_expr (&r->u.ss.end);
+ }
+ else if (r->type == REF_ARRAY)
+ {
+ ar = &r->u.ar;
+ switch (ar->type)
+ {
+ case AR_FULL:
+ break;
+
+ case AR_SECTION:
+ for (i=0; i<ar->dimen; i++)
+ {
+ if (ar->dimen_type[i] == DIMEN_RANGE)
+ {
+ freeze_expr (&ar->start[i]);
+ freeze_expr (&ar->end[i]);
+ freeze_expr (&ar->stride[i]);
+ }
+ }
+ default:
+ break;
+ }
+ }
+ }
+}
+
+/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise. */
+
+static gfc_expr *
+convert_to_index_kind (gfc_expr *e)
+{
+ gfc_expr *res;
+
+ gcc_assert (e != NULL);
+
+ res = gfc_copy_expr (e);
+
+ gcc_assert (e->ts.type == BT_INTEGER);
+
+ if (res->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+
+ gfc_convert_type_warn (e, &ts, 2, 0);
+ }
+
+ return res;
+}
+
+/* Function to create a DO loop including creation of the
+ iteration variable. gfc_expr are copied.*/
+
+static gfc_code *
+create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where,
+ gfc_namespace *ns, char *vname)
+{
+
+ char name[GFC_MAX_SYMBOL_LEN +1];
+ gfc_symtree *symtree;
+ gfc_symbol *symbol;
+ gfc_expr *i;
+ gfc_code *n, *n2;
+
+ /* Create an expression for the iteration variable. */
+ if (vname)
+ sprintf (name, "__var_%d_do_%s", var_num++, vname);
+ else
+ sprintf (name, "__var_%d_do", var_num++);
+
+
+ if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
+ gcc_unreachable ();
+
+ symbol = symtree->n.sym;
+ symbol->ts.type = BT_INTEGER;
+ symbol->ts.kind = gfc_index_integer_kind;
+ symbol->attr.flavor = FL_VARIABLE;
+ symbol->attr.referenced = 1;
+ symbol->attr.dimension = 0;
+ symbol->attr.fe_temp = 1;
+ gfc_commit_symbol (symbol);
+
+ i = gfc_get_expr ();
+ i->expr_type = EXPR_VARIABLE;
+ i->ts = symbol->ts;
+ i->rank = 0;
+ i->where = *where;
+ i->symtree = symtree;
+
+ n = XCNEW (gfc_code);
+ n->op = EXEC_DO;
+ n->loc = *where;
+ n->ext.iterator = gfc_get_iterator ();
+ n->ext.iterator->var = i;
+ n->ext.iterator->start = convert_to_index_kind (start);
+ n->ext.iterator->end = convert_to_index_kind (end);
+ if (step)
+ n->ext.iterator->step = convert_to_index_kind (step);
+ else
+ n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind,
+ where, 1);
+
+ n2 = XCNEW (gfc_code);
+ n2->op = EXEC_DO;
+ n2->loc = *where;
+ n2->next = NULL;
+ n->block = n2;
+ return n;
+}
+
+/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes
+ compatible typespecs. */
+
+static gfc_expr *
+get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
+{
+ gfc_expr *res;
+
+ res = gfc_get_expr ();
+ res->ts = e1->ts;
+ res->where = e1->where;
+ res->expr_type = EXPR_OP;
+ res->value.op.op = op;
+ res->value.op.op1 = e1;
+ res->value.op.op2 = e2;
+ gfc_simplify_expr (res, 0);
+ return res;
+}
+
+/* Get the upper bound of the DO loops for matmul along
+ a dimension, i.e. size minus one. */
+static gfc_expr*
+get_size_m1 (gfc_expr *e, int dimen)
+{
+ mpz_t size;
+ gfc_expr *res;
+
+ if (gfc_array_dimen_size (e, dimen, &size))
+ {
+ res = gfc_get_constant_expr (BT_INTEGER,
+ gfc_index_integer_kind, &e->where);
+ mpz_sub_ui (res->value.integer, size, 1);
+ mpz_clear (size);
+ }
+ else
+ {
+ res = get_operand (INTRINSIC_MINUS,
+ get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE),
+ gfc_get_int_expr (gfc_index_integer_kind,
+ &e->where, 1));
+ gfc_simplify_expr (res, 0);
+ }
+
+ return res;
+}
+
+/* Function to return a scalarized expression. It is assumed that indices are
+ zero based to make generation of DO loops easier. A zero as index will
+ access the first element along a dimension. Single element references will
+ be skipped. A NULL as an expression will be replaced by a full reference.
+ This assumes that the index loops have gfc_index_integer_kind, and that all
+ references have been frozen. */
+
+static gfc_expr*
+scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index)
+{
+ gfc_ref *ref;
+ gfc_array_ref *ar;
+ int i;
+ int rank;
+ gfc_expr *e;
+ int i_index;
+ bool was_fullref;
+
+ e = gfc_copy_expr(e_in);
+
+ rank = e->rank;
+
+ ref = e->ref;
+
+ while (ref)
+ {
+ if (ref->type == REF_ARRAY
+ && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION))
+ break;
+ ref = ref->next;
+ }
+ ar = &ref->u.ar;
+
+ /* We scalarize count_index variables, reducing the rank by count_index. */
+
+ e->rank = rank - count_index;
+
+ was_fullref = ar->type == AR_FULL;
+
+ if (e->rank == 0)
+ ar->type = AR_ELEMENT;
+ else
+ ar->type = AR_SECTION;
+
+ /* Loop over the indices. For each index, create the expression
+ index + lbound(e, dim). */
+
+ i_index = 0;
+ for (i=0; i < ar->dimen; i++)
+ {
+ if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE)
+ {
+ if (index[i_index] != NULL)
+ {
+ gfc_expr *lbound, *nindex;
+ gfc_expr *loopvar;
+
+ loopvar = gfc_copy_expr (index[i_index]);
+
+ if (ar->stride[i])
+ {
+ gfc_expr *tmp;
+
+ tmp = gfc_copy_expr(ar->stride[i]);
+ if (tmp->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+ gfc_convert_type (tmp, &ts, 2);
+ }
+ nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp);
+ }
+ else
+ nindex = loopvar;
+
+ if (ar->start[i])
+ {
+ lbound = gfc_copy_expr (ar->start[i]);
+ if (lbound->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+ gfc_convert_type (lbound, &ts, 2);
+
+ }
+ }
+ else
+ lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND);
+
+ ar->dimen_type[i] = DIMEN_ELEMENT;
+ ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound);
+ ar->end[i] = NULL;
+ ar->stride[i] = NULL;
+ gfc_simplify_expr (ar->start[i], 0);
+ }
+ else if (was_fullref)
+ {
+ ar->dimen_type[i] = DIMEN_RANGE;
+ ar->start[i] = NULL;
+ ar->end[i] = NULL;
+ ar->stride[i] = NULL;
+ }
+ i_index ++;
+ }
+ }
+ return e;
+}
+
+
+/* Optimize assignments of the form c = matmul(a,b).
+ Handle only the cases currently where b and c are rank-two arrays.
+
+ This translates the code to
+
+ BLOCK
+ integer i,j,k
+ c = 0
+ do j=0, size(b,2)-1
+ do k=0, size(a, 2)-1
+ do i=0, size(a, 1)-1
+ c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) =
+ c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
+ a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) *
+ b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
+ end do
+ end do
+ end do
+ END BLOCK
+
+*/
+
+static int
+optimize_matmul_assign (gfc_code **c,
+ int *walk_subtrees ATTRIBUTE_UNUSED,
+ void *data ATTRIBUTE_UNUSED)
+{
+ gfc_code *co = *c;
+ gfc_expr *expr1, *expr2;
+ gfc_expr *matrix_a, *matrix_b;
+ gfc_actual_arglist *a, *b;
+ gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul;
+ gfc_expr *zero_e;
+ gfc_expr *u1, *u2, *u3;
+ gfc_expr *list[2];
+ gfc_expr *ascalar, *bscalar, *cscalar;
+ gfc_expr *mult;
+ gfc_expr *var_1, *var_2, *var_3;
+ gfc_expr *zero;
+ gfc_namespace *ns;
+ gfc_intrinsic_op op_times, op_plus;
+ enum matrix_case m_case;
+ gfc_code *next_code;
+ gfc_code *p;
+ int i;
+
+
+ if (co->op != EXEC_ASSIGN)
+ return 0;
+
+ expr1 = co->expr1;
+ expr2 = co->expr2;
+ if (expr2->expr_type != EXPR_FUNCTION
+ || expr2->value.function.isym == NULL
+ || expr2->value.function.isym->id != GFC_ISYM_MATMUL)
+ return 0;
+
+ /* Handling only two matrices of dimension two for the time being. */
+
+ current_code = c;
+ inserted_block = NULL;
+ changed_statement = NULL;
+
+ a = expr2->value.function.actual;
+ matrix_a = a->expr;
+ b = a->next;
+ matrix_b = b->expr;
+
+ if (matrix_a->expr_type != EXPR_VARIABLE
+ || matrix_b->expr_type != EXPR_VARIABLE)
+ return 0;
+
+ if (matrix_a->rank == 2)
+ m_case = matrix_b->rank == 1 ? A2B1 : A2B2;
+ else
+ m_case = A1B2;
+
+ if (gfc_check_dependency (expr1, matrix_a, true)
+ || gfc_check_dependency (expr1, matrix_b, true))
+ return 0;
+
+ ns = insert_block ();
+
+ switch(expr1->ts.type)
+ {
+ case BT_INTEGER:
+ zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+ break;
+
+ case BT_LOGICAL:
+ op_times = INTRINSIC_AND;
+ op_plus = INTRINSIC_OR;
+ zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where,
+ 0);
+ break;
+ case BT_REAL:
+ zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind,
+ &expr1->where);
+ mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+ break;
+
+ case BT_COMPLEX:
+ zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind,
+ &expr1->where);
+ mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+
+ break;
+
+ default:
+ gcc_unreachable();
+ }
+
+ current_code = &ns->code;
+
+ n_vars = 0;
+ freeze_references (matrix_a);
+ freeze_references (matrix_b);
+
+ assign_zero = XCNEW (gfc_code);
+ assign_zero->op = EXEC_ASSIGN;
+ assign_zero->loc = co->loc;
+ assign_zero->expr1 = gfc_copy_expr (expr1);
+ assign_zero->expr2 = zero_e;
+
+ if (flag_realloc_lhs && gfc_is_reallocatable_lhs (expr1))
+ {
+ gfc_code *lhs_alloc;
+
+ lhs_alloc = matmul_lhs_realloc (expr1, matrix_a, matrix_b, m_case);
+ lhs_alloc->next = assign_zero;
+ next_code = lhs_alloc;
+ }
+ else
+ next_code = assign_zero;
+
+ if (n_vars == 0)
+ *current_code = next_code;
+ else
+ {
+ p = ns->code;
+ for (i=0; i<n_vars-1; i++)
+ p = p->next;
+
+ p->next = next_code;
+ }
+
+ zero = gfc_get_int_expr (gfc_index_integer_kind, &co->loc, 0);
+
+ assign_matmul = XCNEW (gfc_code);
+ assign_matmul->op = EXEC_ASSIGN;
+ assign_matmul->loc = co->loc;
+
+ switch (m_case)
+ {
+ case A2B2:
+ u1 = get_size_m1 (matrix_b, 1);
+ u2 = get_size_m1 (matrix_a, 1);
+ u3 = get_size_m1 (matrix_a, 0);
+
+ do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+ do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+ do_3 = create_do_loop (gfc_copy_expr (zero), u3, NULL, &co->loc, ns);
+
+ do_1->block->next = do_2;
+ do_2->block->next = do_3;
+ do_3->block->next = assign_matmul;
+
+ var_1 = do_1->ext.iterator->var;
+ var_2 = do_2->ext.iterator->var;
+ var_3 = do_3->ext.iterator->var;
+
+ list[0] = var_3;
+ list[1] = var_1;
+ cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2);
+
+ list[0] = var_3;
+ list[1] = var_2;
+ ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+ list[0] = var_2;
+ list[1] = var_1;
+ bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+ break;
+
+ case A2B1:
+ u1 = get_size_m1 (matrix_b, 0);
+ u2 = get_size_m1 (matrix_a, 0);
+
+ do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+ do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+ do_1->block->next = do_2;
+ do_2->block->next = assign_matmul;
+
+ var_1 = do_1->ext.iterator->var;
+ var_2 = do_2->ext.iterator->var;
+
+ list[0] = var_2;
+ cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+ list[0] = var_2;
+ list[1] = var_1;
+ ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+ list[0] = var_1;
+ bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 1);
+
+ break;
+
+ case A1B2:
+ u1 = get_size_m1 (matrix_b, 1);
+ u2 = get_size_m1 (matrix_a, 0);
+
+ do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+ do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+ do_1->block->next = do_2;
+ do_2->block->next = assign_matmul;
+
+ var_1 = do_1->ext.iterator->var;
+ var_2 = do_2->ext.iterator->var;
+
+ list[0] = var_1;
+ cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+ list[0] = var_2;
+ ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 1);
+
+ list[0] = var_2;
+ list[1] = var_1;
+ bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+ break;
+
+ default:
+ gcc_unreachable();
+ }
+
+ assign_zero->next = do_1;
+
+
+ assign_matmul->expr1 = gfc_copy_expr (cscalar);
+
+ mult = get_operand (op_times, ascalar, bscalar);
+ assign_matmul->expr2 = get_operand (op_plus, cscalar, mult);
+
+ co->next = NULL;
+ gfc_free_statements(co);
+ gfc_free_expr (zero);
+ return 0;
+}
+
#define WALK_SUBEXPR(NODE) \
do \
{ \
Index: gfortran.h
===================================================================
--- gfortran.h (Revision 221944)
+++ gfortran.h (Arbeitskopie)
@@ -3225,4 +3225,8 @@ int gfc_code_walker (gfc_code **, walk_code_fn_t,
void gfc_convert_mpz_to_signed (mpz_t, int);
+/* trans-array.c */
+
+bool gfc_is_reallocatable_lhs (gfc_expr *);
+
#endif /* GCC_GFORTRAN_H */
Index: simplify.c
===================================================================
--- simplify.c (Revision 221944)
+++ simplify.c (Arbeitskopie)
@@ -3445,6 +3445,31 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf
done:
+ if (!upper && as && as->type == AS_ASSUMED_SHAPE && dim
+ && dim->expr_type == EXPR_CONSTANT)
+ {
+ if (!(array->symtree && array->symtree->n.sym
+ && (array->symtree->n.sym->attr.allocatable
+ || array->symtree->n.sym->attr.pointer)))
+ {
+ unsigned long int ndim;
+ gfc_expr *lower, *res;
+
+ ndim = mpz_get_si (dim->value.integer) - 1;
+ lower = as->lower[ndim];
+ if (lower->expr_type == EXPR_CONSTANT)
+ {
+ res = gfc_copy_expr (lower);
+ if (kind)
+ {
+ int nkind = mpz_get_si (kind->value.integer);
+ res->ts.kind = nkind;
+ }
+ return res;
+ }
+ }
+ }
+
if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE
|| as->type == AS_ASSUMED_RANK))
return NULL;
Index: trans-array.h
===================================================================
--- trans-array.h (Revision 221944)
+++ trans-array.h (Arbeitskopie)
@@ -64,8 +64,6 @@ tree gfc_copy_only_alloc_comp (gfc_symbol *, tree,
tree gfc_alloc_allocatable_for_assignment (gfc_loopinfo*, gfc_expr*, gfc_expr*);
-bool gfc_is_reallocatable_lhs (gfc_expr *);
-
/* Add initialization for deferred arrays. */
void gfc_trans_deferred_array (gfc_symbol *, gfc_wrapped_block *);
/* Generate an initializer for a static pointer or allocatable array. */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-09 22:18 ` Thomas Koenig
@ 2015-04-10 14:15 ` Dominique d'Humières
2015-04-10 16:57 ` Dominique d'Humières
2015-04-11 12:24 ` Thomas Koenig
0 siblings, 2 replies; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-10 14:15 UTC (permalink / raw)
To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran
> Le 10 avr. 2015 à 00:18, Thomas Koenig <tkoenig@netcologne.de> a écrit :
>
> Hello world,
>
> here is an update on the matmul inlining patch. Now, the
> rank one + rank two cases are also handled.
Preliminary tests show that my variant of fatigue.f90 with matmul is as fast (if not faster) as the original test with dot products.
However this causes new failures
FAIL: gfortran.dg/matmul_bounds_4.f90 -O1 execution test
FAIL: gfortran.dg/matmul_bounds_4.f90 -O2 execution test
FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -fomit-frame-pointer execution test
FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -fomit-frame-pointer -funroll-loops execution test
FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test
FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -g execution test
FAIL: gfortran.dg/matmul_bounds_4.f90 -Os execution test
FAIL: gfortran.dg/matmul_bounds_5.f90 -O1 execution test
FAIL: gfortran.dg/matmul_bounds_5.f90 -O2 execution test
FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -fomit-frame-pointer execution test
FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -fomit-frame-pointer -funroll-loops execution test
FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test
FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -g execution test
FAIL: gfortran.dg/matmul_bounds_5.f90 -Os execution test
and may be
FAIL: gfortran.dg/coarray_lib_this_image_2.f90 -O scan-tree-dump-times original "mylbound = parm...dim\\[0\\].stride >= 0 && parm...dim\\[0\\].ubound >= parm...dim\\[0\\].lbound \\|\\| parm...dim\\[0\\].stride < 0 \\?[^\n\r]* parm...dim\\[0\\].lbound : 1;" 1
FAIL: gfortran.dg/dependency_26.f90 -O scan-tree-dump-times original "&a" 1
> Reallocation on assignment also works now.
Confirmed.
>
> Still missing:
>
> 1. Control via an option, BLAS inlining has to take precedence
> 2. handling of matmul(a,b) occurring in the middle of an expression.
> 3. Bounds checking from the front end pass (basically, calling
> gfortran_runtime_error).
> 4. More test cases
>
> What do you think? 1. is straightforward,
I agree if the bounds are known at compile time (it will be nice to use -fblas-matmul-limit=n for the threshold).
However for bounds know at run time only, I don’t see how we can escape some king of versioning.
> we don't really need to do 2. for committing early in stage one.
Agreed: no point to mess with complicated expressions if the simple ones are not properly debugged!
> For 3, we could maybe just issue a STOP.
I have silenced these failures (gfortran.dg/matmul_bounds_*.f90) by adding -fno-frontend-optimize to the list of options. IMO having the same message with/without inlining is needed.
> 4. is required, I need to look at some more corner cases where bugs may still be lurking.
I also see the following failures
FAIL: gfortran.dg/shape_2.f90 -O0 execution test
FAIL: gfortran.dg/shape_2.f90 -O1 execution test
FAIL: gfortran.dg/shape_2.f90 -O2 execution test
FAIL: gfortran.dg/shape_2.f90 -O3 -fomit-frame-pointer execution test
FAIL: gfortran.dg/shape_2.f90 -O3 -fomit-frame-pointer -funroll-loops execution test
FAIL: gfortran.dg/shape_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test
FAIL: gfortran.dg/shape_2.f90 -O3 -g execution test
FAIL: gfortran.dg/shape_2.f90 -Os execution test
FAIL: gfortran.dg/shape_2.f90 -g -flto execution test
A reduced test is:
program main
integer, dimension (40, 80) :: a = 1
call test (a)
contains
subroutine test (b)
integer, dimension (11:, -8:), target :: b
integer, dimension (:, :), pointer :: ptr
print *, lbound (b (:, :), 1)
! if (lbound (b (:, :), 1) .ne. 1) call abort
print *, lbound (b (:, :), 2)
! if (lbound (b (:, :), 2) .ne. 1) call abort
print *, lbound (b (20:30:3, 40), 1)
if (lbound (b (20:30:3, 40), 1) .ne. 1) call abort
end subroutine test
end program main
(the other tests in the original file succeed).
Thanks for working of this,
Dominique
> What do you think?
>
> Thomas
>
> <matmul-8.diff>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-10 14:15 ` Dominique d'Humières
@ 2015-04-10 16:57 ` Dominique d'Humières
2015-04-11 12:24 ` Thomas Koenig
1 sibling, 0 replies; 16+ messages in thread
From: Dominique d'Humières @ 2015-04-10 16:57 UTC (permalink / raw)
To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran
> Le 10 avr. 2015 à 16:15, Dominique d'Humières <dominiq@lps.ens.fr> a écrit :
>
>
>> 4. is required, I need to look at some more corner cases where bugs may still be lurking.
>
Compiling the polyhedron test rnflow.f90 after the patch gives an ICE
f951: internal compiler error: gfc_array_dimen_size(): Bad dimension
Reduced test extracted from the subroutine evlrnf:
subroutine evlrnf (ptrs0t, nclsm, prnf0t)
real, dimension (1:nclsm,1:nclsm), intent (in) :: ptrs0t
integer, intent (in) :: nclsm ! nombre de classes
real, dimension (1:nclsm,1:nclsm), intent (out):: prnf0t
real, allocatable, dimension (:,:) :: utrsft ! probas up
real, allocatable, dimension (:) :: vwrkft
real, allocatable, dimension (:) :: vwrk2t
ncls = nclsm
allocate (utrsft (1:ncls,1:ncls))
allocate (vwrkft (1:ncls))
allocate (vwrk2t (1:ncls))
vwrk2t = matmul (utrsft, vwrkft) ! U Fk
deallocate (utrsft)
deallocate (vwrk2t)
deallocate (vwrkft)
end subroutine evlrnf
Dominique
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-10 14:15 ` Dominique d'Humières
2015-04-10 16:57 ` Dominique d'Humières
@ 2015-04-11 12:24 ` Thomas Koenig
2015-04-11 13:43 ` Mikael Morin
1 sibling, 1 reply; 16+ messages in thread
From: Thomas Koenig @ 2015-04-11 12:24 UTC (permalink / raw)
To: Dominique d'Humières; +Cc: GCC Patches, GNU GFortran
[-- Attachment #1: Type: text/plain, Size: 513 bytes --]
OK, here is a new version.
There is now an option for setting a maximum on the array size,
which takes its default from the BLAS limit (if specified).
Currently, only setting the maximum size to zero as a way of
disabling the unrolling is supported. I have done this in a
few test cases.
The bugs reported by Dominique have now been fixed.
Still to do: Bounds checking (a rather big one), honoring the
maximum size, test cases, ChangeLog, some more comments to the code.
Anything else?
Regards
Thomas
[-- Attachment #2: matmul-10.diff --]
[-- Type: text/x-patch, Size: 31446 bytes --]
Index: fortran/frontend-passes.c
===================================================================
--- fortran/frontend-passes.c (Revision 221944)
+++ fortran/frontend-passes.c (Arbeitskopie)
@@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *);
static void optimize_reduction (gfc_namespace *);
static int callback_reduction (gfc_expr **, int *, void *);
static void realloc_strings (gfc_namespace *);
-static gfc_expr *create_var (gfc_expr *);
+static gfc_expr *create_var (gfc_expr *, const char *vname=NULL);
+static int optimize_matmul_assign (gfc_code **, int *, void *);
+static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *,
+ locus *, gfc_namespace *,
+ char *vname=NULL);
/* How deep we are inside an argument list. */
@@ -93,8 +97,24 @@ struct my_struct *evec;
static bool in_assoc_list;
+/* Counter for temporary variables. */
+
+static int var_num = 1;
+
+/* What sort of matrix we are dealing with when optimizing MATMUL. */
+
+enum matrix_case { none=0, A2B2, A2B1, A1B2 };
+
+/* Keep track of the number of expressions we have inserted so far
+ using create_var. */
+
+int n_vars;
+
+void gfc_debug_expr (gfc_expr *);
+
/* Entry point - run all passes for a namespace. */
+
void
gfc_run_passes (gfc_namespace *ns)
{
@@ -157,7 +177,7 @@ realloc_string_callback (gfc_code **c, int *walk_s
return 0;
current_code = c;
- n = create_var (expr2);
+ n = create_var (expr2, "trim");
co->expr2 = n;
return 0;
}
@@ -524,29 +544,11 @@ constant_string_length (gfc_expr *e)
}
-/* Returns a new expression (a variable) to be used in place of the old one,
- with an assignment statement before the current statement to set
- the value of the variable. Creates a new BLOCK for the statement if
- that hasn't already been done and puts the statement, plus the
- newly created variables, in that block. Special cases: If the
- expression is constant or a temporary which has already
- been created, just copy it. */
-
-static gfc_expr*
-create_var (gfc_expr * e)
+static gfc_namespace*
+insert_block ()
{
- char name[GFC_MAX_SYMBOL_LEN +1];
- static int num = 1;
- gfc_symtree *symtree;
- gfc_symbol *symbol;
- gfc_expr *result;
- gfc_code *n;
gfc_namespace *ns;
- int i;
- if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
- return gfc_copy_expr (e);
-
/* If the block hasn't already been created, do so. */
if (inserted_block == NULL)
{
@@ -578,7 +580,37 @@ constant_string_length (gfc_expr *e)
else
ns = inserted_block->ext.block.ns;
- sprintf(name, "__var_%d",num++);
+ return ns;
+}
+
+/* Returns a new expression (a variable) to be used in place of the old one,
+ with an optional assignment statement before the current statement to set
+ the value of the variable. Creates a new BLOCK for the statement if that
+ hasn't already been done and puts the statement, plus the newly created
+ variables, in that block. Special cases: If the expression is constant or
+ a temporary which has already been created, just copy it. */
+
+static gfc_expr*
+create_var (gfc_expr * e, const char *vname)
+{
+ char name[GFC_MAX_SYMBOL_LEN +1];
+ gfc_symtree *symtree;
+ gfc_symbol *symbol;
+ gfc_expr *result;
+ gfc_code *n;
+ gfc_namespace *ns;
+ int i;
+
+ if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
+ return gfc_copy_expr (e);
+
+ ns = insert_block ();
+
+ if (vname)
+ snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname);
+ else
+ snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d", var_num++);
+
if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
gcc_unreachable ();
@@ -651,6 +683,7 @@ constant_string_length (gfc_expr *e)
result->ref->type = REF_ARRAY;
result->ref->u.ar.type = AR_FULL;
result->ref->u.ar.where = e->where;
+ result->ref->u.ar.dimen = e->rank;
result->ref->u.ar.as = symbol->ts.type == BT_CLASS
? CLASS_DATA (symbol)->as : symbol->as;
if (warn_array_temporaries)
@@ -658,6 +691,7 @@ constant_string_length (gfc_expr *e)
"Creating array temporary at %L", &(e->where));
}
+
/* Generate the new assignment. */
n = XCNEW (gfc_code);
n->op = EXEC_ASSIGN;
@@ -666,6 +700,7 @@ constant_string_length (gfc_expr *e)
n->expr1 = gfc_copy_expr (result);
n->expr2 = e;
*changed_statement = n;
+ n_vars ++;
return result;
}
@@ -724,7 +759,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees,
if (gfc_dep_compare_functions (*ei, *ej, true) == 0)
{
if (newvar == NULL)
- newvar = create_var (*ei);
+ newvar = create_var (*ei, "fcn");
if (warn_function_elimination)
do_warn_function_elimination (*ej);
@@ -931,6 +966,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A
/* Don't walk subtrees. */
return 0;
}
+
/* Optimize a namespace, including all contained namespaces. */
static void
@@ -947,6 +983,9 @@ optimize_namespace (gfc_namespace *ns)
gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL);
gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL);
gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL);
+ if (flag_inline_matmul_limit != 0)
+ gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback,
+ NULL);
/* BLOCKs are handled in the expression walker below. */
for (ns = ns->contained; ns; ns = ns->sibling)
@@ -1222,7 +1261,7 @@ combine_array_constructor (gfc_expr *e)
if (op2->ts.type == BT_CHARACTER)
return false;
- scalar = create_var (gfc_copy_expr (op2));
+ scalar = create_var (gfc_copy_expr (op2), "constr");
oldbase = op1->value.constructor;
newbase = NULL;
@@ -1939,7 +1978,795 @@ doloop_warn (gfc_namespace *ns)
gfc_code_walker (&ns->code, doloop_code, do_function, NULL);
}
+/* Auxiliary function to build and simplify an array inquiry function.
+ dim is zero-based. */
+static gfc_expr *
+get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id)
+{
+ gfc_expr *fcn;
+ gfc_expr *dim_arg, *kind;
+ const char *name;
+
+ switch (id)
+ {
+ case GFC_ISYM_LBOUND:
+ name = "_gfortran_lbound";
+ break;
+
+ case GFC_ISYM_UBOUND:
+ name = "_gfortran_ubound";
+ break;
+
+ case GFC_ISYM_SIZE:
+ name = "_gfortran_size";
+ break;
+
+ default:
+ gcc_unreachable ();
+ }
+
+ dim_arg = gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim);
+ kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where,
+ gfc_index_integer_kind);
+
+ fcn = gfc_build_intrinsic_call (current_ns, id, name, e->where, 3,
+ gfc_copy_expr(e), dim_arg, kind);
+ gfc_simplify_expr (fcn, 0);
+ return fcn;
+}
+
+/* Build a not equals - expression. */
+
+static gfc_expr*
+build_logical_expr (gfc_expr *e1, gfc_expr *e2, gfc_intrinsic_op op)
+{
+ gfc_typespec ts;
+ gfc_expr *res;
+
+ ts.type = BT_LOGICAL;
+ ts.kind = gfc_default_logical_kind;
+ res = gfc_get_expr ();
+ res->where = e1->where;
+ res->expr_type = EXPR_OP;
+ res->value.op.op = op;
+ res->value.op.op1 = e1;
+ res->value.op.op2 = e2;
+ res->ts = ts;
+
+ return res;
+}
+/* Handle matrix reallocation. Caller is responsible to insert into
+ the code tree.
+
+ For the two-dimensional case, build
+
+ if (allocated(c)) then
+ if (size(c,1) /= size(a,1) .or. size(c,2) /= size(b,2)) then
+ deallocate(c)
+ allocate (c(size(a,1), size(b,2)))
+ end if
+ else
+ allocate (c(size(a,1),size(b,2)))
+ end if
+
+*/
+
+static gfc_code *
+matmul_lhs_realloc (gfc_expr *c, gfc_expr *a, gfc_expr *b,
+ enum matrix_case m_case)
+{
+
+ gfc_expr *allocated, *alloc_expr;
+ gfc_code *if_alloc_1, *if_alloc_2, *if_size_1, *if_size_2;
+ gfc_code *else_alloc;
+ gfc_code *deallocate, *allocate1, *allocate_else;
+ gfc_ref *ref;
+ gfc_array_ref *ar;
+ gfc_expr *cond, *ne1, *ne2;
+
+ alloc_expr = gfc_copy_expr (c);
+
+ ref = alloc_expr->ref;
+
+ while (ref)
+ {
+ if (ref->type == REF_ARRAY && ref->u.ar.type != AR_ELEMENT)
+ break;
+
+ ref = ref->next;
+
+ }
+ ar = &ref->u.ar;
+ gcc_assert (ar && ar->type == AR_FULL);
+
+ /* c comes in as a full ref. Change it into a copy and make it into an
+ element ref so it has the right for for ALLOCATE. In the same switch,
+ also generate the size comparison for the secod IF statement. */
+
+ ar->type = AR_ELEMENT;
+
+ switch (m_case)
+ {
+ case A2B2:
+ ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE);
+ ar->start[1] = get_array_inq_function (b, 2, GFC_ISYM_SIZE);
+ ne1 = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+ get_array_inq_function (a, 1, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ ne2 = build_logical_expr (get_array_inq_function (c, 2, GFC_ISYM_SIZE),
+ get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ cond = build_logical_expr (ne1, ne2, INTRINSIC_OR);
+ break;
+
+ case A2B1:
+ ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE);
+ cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+ get_array_inq_function (a, 2, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ break;
+
+ case A1B2:
+ ar->start[0] = get_array_inq_function (b, 1, GFC_ISYM_SIZE);
+ cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE),
+ get_array_inq_function (b, 2, GFC_ISYM_SIZE),
+ INTRINSIC_NE);
+ break;
+
+ default:
+ gcc_unreachable();
+
+ }
+
+ gfc_simplify_expr (cond, 0);
+
+ /* We need two identical allocate statements in two
+ branches of the IF statement. */
+
+ allocate1 = XCNEW (gfc_code);
+ allocate1->op = EXEC_ALLOCATE;
+ allocate1->ext.alloc.list = gfc_get_alloc ();
+ allocate1->loc = c->where;
+ allocate1->ext.alloc.list->expr = gfc_copy_expr (alloc_expr);
+
+ allocate_else = XCNEW (gfc_code);
+ allocate_else->op = EXEC_ALLOCATE;
+ allocate_else->ext.alloc.list = gfc_get_alloc ();
+ allocate_else->loc = c->where;
+ allocate_else->ext.alloc.list->expr = alloc_expr;
+
+ allocated = gfc_build_intrinsic_call (current_ns, GFC_ISYM_ALLOCATED,
+ "_gfortran_allocated", c->where,
+ 1, gfc_copy_expr (c));
+
+ deallocate = XCNEW (gfc_code);
+ deallocate->op = EXEC_DEALLOCATE;
+ deallocate->ext.alloc.list = gfc_get_alloc ();
+ deallocate->ext.alloc.list->expr = gfc_copy_expr (c);
+ deallocate->next = allocate1;
+ deallocate->loc = c->where;
+
+ if_size_2 = XCNEW (gfc_code);
+ if_size_2->op = EXEC_IF;
+ if_size_2->expr1 = cond;
+ if_size_2->loc = c->where;
+ if_size_2->next = deallocate;
+
+ if_size_1 = XCNEW (gfc_code);
+ if_size_1->op = EXEC_IF;
+ if_size_1->block = if_size_2;
+ if_size_1->loc = c->where;
+
+ else_alloc = XCNEW (gfc_code);
+ else_alloc->op = EXEC_IF;
+ else_alloc->loc = c->where;
+ else_alloc->next = allocate_else;
+
+ if_alloc_2 = XCNEW (gfc_code);
+ if_alloc_2->op = EXEC_IF;
+ if_alloc_2->expr1 = allocated;
+ if_alloc_2->loc = c->where;
+ if_alloc_2->next = if_size_1;
+ if_alloc_2->block = else_alloc;
+
+ if_alloc_1 = XCNEW (gfc_code);
+ if_alloc_1->op = EXEC_IF;
+ if_alloc_1->block = if_alloc_2;
+ if_alloc_1->loc = c->where;
+
+ return if_alloc_1;
+}
+
+
+/* Callback function for has_function_or_op. */
+
+static int
+is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED,
+ void *data ATTRIBUTE_UNUSED)
+{
+ if ((*e) == 0)
+ return 0;
+ else
+ return (*e)->expr_type == EXPR_FUNCTION
+ || (*e)->expr_type == EXPR_OP;
+}
+
+/* Returns true if the expression contains a function. */
+
+static bool
+has_function_or_op (gfc_expr **e)
+{
+ if (e == NULL)
+ return false;
+ else
+ return gfc_expr_walker (e, is_function_or_op, NULL);
+}
+
+/* Freeze a single expression. */
+
+static void
+freeze_expr (gfc_expr **ep)
+{
+ gfc_expr *ne;
+ if (has_function_or_op (ep))
+ {
+ ne = create_var (*ep, "freeze");
+ *ep = ne;
+ }
+}
+
+/* Go through an expression's references and assign them to temporary
+ variables if they contain functions. This is usually done prior to
+ front-end scalarization to avoid multiple invocations of functions. */
+
+static void
+freeze_references (gfc_expr *e)
+{
+ gfc_ref *r;
+ gfc_array_ref *ar;
+ int i;
+
+ for (r=e->ref; r; r=r->next)
+ {
+ if (r->type == REF_SUBSTRING)
+ {
+ if (r->u.ss.start != NULL)
+ freeze_expr (&r->u.ss.start);
+
+ if (r->u.ss.end != NULL)
+ freeze_expr (&r->u.ss.end);
+ }
+ else if (r->type == REF_ARRAY)
+ {
+ ar = &r->u.ar;
+ switch (ar->type)
+ {
+ case AR_FULL:
+ break;
+
+ case AR_SECTION:
+ for (i=0; i<ar->dimen; i++)
+ {
+ if (ar->dimen_type[i] == DIMEN_RANGE)
+ {
+ freeze_expr (&ar->start[i]);
+ freeze_expr (&ar->end[i]);
+ freeze_expr (&ar->stride[i]);
+ }
+ }
+ default:
+ break;
+ }
+ }
+ }
+}
+
+/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise. */
+
+static gfc_expr *
+convert_to_index_kind (gfc_expr *e)
+{
+ gfc_expr *res;
+
+ gcc_assert (e != NULL);
+
+ res = gfc_copy_expr (e);
+
+ gcc_assert (e->ts.type == BT_INTEGER);
+
+ if (res->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+
+ gfc_convert_type_warn (e, &ts, 2, 0);
+ }
+
+ return res;
+}
+
+/* Function to create a DO loop including creation of the
+ iteration variable. gfc_expr are copied.*/
+
+static gfc_code *
+create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where,
+ gfc_namespace *ns, char *vname)
+{
+
+ char name[GFC_MAX_SYMBOL_LEN +1];
+ gfc_symtree *symtree;
+ gfc_symbol *symbol;
+ gfc_expr *i;
+ gfc_code *n, *n2;
+
+ /* Create an expression for the iteration variable. */
+ if (vname)
+ sprintf (name, "__var_%d_do_%s", var_num++, vname);
+ else
+ sprintf (name, "__var_%d_do", var_num++);
+
+
+ if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
+ gcc_unreachable ();
+
+ symbol = symtree->n.sym;
+ symbol->ts.type = BT_INTEGER;
+ symbol->ts.kind = gfc_index_integer_kind;
+ symbol->attr.flavor = FL_VARIABLE;
+ symbol->attr.referenced = 1;
+ symbol->attr.dimension = 0;
+ symbol->attr.fe_temp = 1;
+ gfc_commit_symbol (symbol);
+
+ i = gfc_get_expr ();
+ i->expr_type = EXPR_VARIABLE;
+ i->ts = symbol->ts;
+ i->rank = 0;
+ i->where = *where;
+ i->symtree = symtree;
+
+ n = XCNEW (gfc_code);
+ n->op = EXEC_DO;
+ n->loc = *where;
+ n->ext.iterator = gfc_get_iterator ();
+ n->ext.iterator->var = i;
+ n->ext.iterator->start = convert_to_index_kind (start);
+ n->ext.iterator->end = convert_to_index_kind (end);
+ if (step)
+ n->ext.iterator->step = convert_to_index_kind (step);
+ else
+ n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind,
+ where, 1);
+
+ n2 = XCNEW (gfc_code);
+ n2->op = EXEC_DO;
+ n2->loc = *where;
+ n2->next = NULL;
+ n->block = n2;
+ return n;
+}
+
+/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes
+ compatible typespecs. */
+
+static gfc_expr *
+get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
+{
+ gfc_expr *res;
+
+ res = gfc_get_expr ();
+ res->ts = e1->ts;
+ res->where = e1->where;
+ res->expr_type = EXPR_OP;
+ res->value.op.op = op;
+ res->value.op.op1 = e1;
+ res->value.op.op2 = e2;
+ gfc_simplify_expr (res, 0);
+ return res;
+}
+
+/* Get the upper bound of the DO loops for matmul along
+ a dimension, i.e. size minus one. */
+static gfc_expr*
+get_size_m1 (gfc_expr *e, int dimen)
+{
+ mpz_t size;
+ gfc_expr *res;
+
+ if (gfc_array_dimen_size (e, dimen, &size))
+ {
+ res = gfc_get_constant_expr (BT_INTEGER,
+ gfc_index_integer_kind, &e->where);
+ mpz_sub_ui (res->value.integer, size, 1);
+ mpz_clear (size);
+ }
+ else
+ {
+ res = get_operand (INTRINSIC_MINUS,
+ get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE),
+ gfc_get_int_expr (gfc_index_integer_kind,
+ &e->where, 1));
+ gfc_simplify_expr (res, 0);
+ }
+
+ return res;
+}
+
+/* Function to return a scalarized expression. It is assumed that indices are
+ zero based to make generation of DO loops easier. A zero as index will
+ access the first element along a dimension. Single element references will
+ be skipped. A NULL as an expression will be replaced by a full reference.
+ This assumes that the index loops have gfc_index_integer_kind, and that all
+ references have been frozen. */
+
+static gfc_expr*
+scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index)
+{
+ gfc_ref *ref;
+ gfc_array_ref *ar;
+ int i;
+ int rank;
+ gfc_expr *e;
+ int i_index;
+ bool was_fullref;
+
+ e = gfc_copy_expr(e_in);
+
+ rank = e->rank;
+
+ ref = e->ref;
+
+ while (ref)
+ {
+ if (ref->type == REF_ARRAY
+ && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION))
+ break;
+ ref = ref->next;
+ }
+ ar = &ref->u.ar;
+
+ /* We scalarize count_index variables, reducing the rank by count_index. */
+
+ e->rank = rank - count_index;
+
+ was_fullref = ar->type == AR_FULL;
+
+ if (e->rank == 0)
+ ar->type = AR_ELEMENT;
+ else
+ ar->type = AR_SECTION;
+
+ /* Loop over the indices. For each index, create the expression
+ index + lbound(e, dim). */
+
+ i_index = 0;
+ for (i=0; i < ar->dimen; i++)
+ {
+ if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE)
+ {
+ if (index[i_index] != NULL)
+ {
+ gfc_expr *lbound, *nindex;
+ gfc_expr *loopvar;
+
+ loopvar = gfc_copy_expr (index[i_index]);
+
+ if (ar->stride[i])
+ {
+ gfc_expr *tmp;
+
+ tmp = gfc_copy_expr(ar->stride[i]);
+ if (tmp->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+ gfc_convert_type (tmp, &ts, 2);
+ }
+ nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp);
+ }
+ else
+ nindex = loopvar;
+
+ if (ar->start[i])
+ {
+ lbound = gfc_copy_expr (ar->start[i]);
+ if (lbound->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+ gfc_convert_type (lbound, &ts, 2);
+
+ }
+ }
+ else
+ lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND);
+
+ ar->dimen_type[i] = DIMEN_ELEMENT;
+ ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound);
+ ar->end[i] = NULL;
+ ar->stride[i] = NULL;
+ gfc_simplify_expr (ar->start[i], 0);
+ }
+ else if (was_fullref)
+ {
+ ar->dimen_type[i] = DIMEN_RANGE;
+ ar->start[i] = NULL;
+ ar->end[i] = NULL;
+ ar->stride[i] = NULL;
+ }
+ i_index ++;
+ }
+ }
+ return e;
+}
+
+
+/* Optimize assignments of the form c = matmul(a,b).
+ Handle only the cases currently where b and c are rank-two arrays.
+
+ This translates the code to
+
+ BLOCK
+ integer i,j,k
+ c = 0
+ do j=0, size(b,2)-1
+ do k=0, size(a, 2)-1
+ do i=0, size(a, 1)-1
+ c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) =
+ c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
+ a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) *
+ b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
+ end do
+ end do
+ end do
+ END BLOCK
+
+*/
+
+static int
+optimize_matmul_assign (gfc_code **c,
+ int *walk_subtrees ATTRIBUTE_UNUSED,
+ void *data ATTRIBUTE_UNUSED)
+{
+ gfc_code *co = *c;
+ gfc_expr *expr1, *expr2;
+ gfc_expr *matrix_a, *matrix_b;
+ gfc_actual_arglist *a, *b;
+ gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul;
+ gfc_expr *zero_e;
+ gfc_expr *u1, *u2, *u3;
+ gfc_expr *list[2];
+ gfc_expr *ascalar, *bscalar, *cscalar;
+ gfc_expr *mult;
+ gfc_expr *var_1, *var_2, *var_3;
+ gfc_expr *zero;
+ gfc_namespace *ns;
+ gfc_intrinsic_op op_times, op_plus;
+ enum matrix_case m_case;
+ gfc_code *next_code;
+ gfc_code *p;
+ int i;
+
+
+ if (co->op != EXEC_ASSIGN)
+ return 0;
+
+ expr1 = co->expr1;
+ expr2 = co->expr2;
+ if (expr2->expr_type != EXPR_FUNCTION
+ || expr2->value.function.isym == NULL
+ || expr2->value.function.isym->id != GFC_ISYM_MATMUL)
+ return 0;
+
+ /* Handling only two matrices of dimension two for the time being. */
+
+ current_code = c;
+ inserted_block = NULL;
+ changed_statement = NULL;
+
+ a = expr2->value.function.actual;
+ matrix_a = a->expr;
+ b = a->next;
+ matrix_b = b->expr;
+
+ if (matrix_a->expr_type != EXPR_VARIABLE
+ || matrix_b->expr_type != EXPR_VARIABLE)
+ return 0;
+
+ if (matrix_a->rank == 2)
+ m_case = matrix_b->rank == 1 ? A2B1 : A2B2;
+ else
+ m_case = A1B2;
+
+ if (gfc_check_dependency (expr1, matrix_a, true)
+ || gfc_check_dependency (expr1, matrix_b, true))
+ return 0;
+
+ ns = insert_block ();
+
+ switch(expr1->ts.type)
+ {
+ case BT_INTEGER:
+ zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+ break;
+
+ case BT_LOGICAL:
+ op_times = INTRINSIC_AND;
+ op_plus = INTRINSIC_OR;
+ zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where,
+ 0);
+ break;
+ case BT_REAL:
+ zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind,
+ &expr1->where);
+ mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+ break;
+
+ case BT_COMPLEX:
+ zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind,
+ &expr1->where);
+ mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+
+ break;
+
+ default:
+ gcc_unreachable();
+ }
+
+ current_code = &ns->code;
+
+ n_vars = 0;
+ freeze_references (matrix_a);
+ freeze_references (matrix_b);
+
+ assign_zero = XCNEW (gfc_code);
+ assign_zero->op = EXEC_ASSIGN;
+ assign_zero->loc = co->loc;
+ assign_zero->expr1 = gfc_copy_expr (expr1);
+ assign_zero->expr2 = zero_e;
+
+ if (flag_realloc_lhs && gfc_is_reallocatable_lhs (expr1))
+ {
+ gfc_code *lhs_alloc;
+
+ lhs_alloc = matmul_lhs_realloc (expr1, matrix_a, matrix_b, m_case);
+ lhs_alloc->next = assign_zero;
+ next_code = lhs_alloc;
+ }
+ else
+ next_code = assign_zero;
+
+ if (n_vars == 0)
+ *current_code = next_code;
+ else
+ {
+ p = ns->code;
+ for (i=0; i<n_vars-1; i++)
+ p = p->next;
+
+ p->next = next_code;
+ }
+
+ zero = gfc_get_int_expr (gfc_index_integer_kind, &co->loc, 0);
+
+ assign_matmul = XCNEW (gfc_code);
+ assign_matmul->op = EXEC_ASSIGN;
+ assign_matmul->loc = co->loc;
+
+ switch (m_case)
+ {
+ case A2B2:
+ u1 = get_size_m1 (matrix_b, 1);
+ u2 = get_size_m1 (matrix_a, 1);
+ u3 = get_size_m1 (matrix_a, 0);
+
+ do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+ do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+ do_3 = create_do_loop (gfc_copy_expr (zero), u3, NULL, &co->loc, ns);
+
+ do_1->block->next = do_2;
+ do_2->block->next = do_3;
+ do_3->block->next = assign_matmul;
+
+ var_1 = do_1->ext.iterator->var;
+ var_2 = do_2->ext.iterator->var;
+ var_3 = do_3->ext.iterator->var;
+
+ list[0] = var_3;
+ list[1] = var_1;
+ cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2);
+
+ list[0] = var_3;
+ list[1] = var_2;
+ ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+ list[0] = var_2;
+ list[1] = var_1;
+ bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+ break;
+
+ case A2B1:
+ u1 = get_size_m1 (matrix_b, 0);
+ u2 = get_size_m1 (matrix_a, 0);
+
+ do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+ do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+ do_1->block->next = do_2;
+ do_2->block->next = assign_matmul;
+
+ var_1 = do_1->ext.iterator->var;
+ var_2 = do_2->ext.iterator->var;
+
+ list[0] = var_2;
+ cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+ list[0] = var_2;
+ list[1] = var_1;
+ ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+ list[0] = var_1;
+ bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 1);
+
+ break;
+
+ case A1B2:
+ u1 = get_size_m1 (matrix_b, 1);
+ u2 = get_size_m1 (matrix_a, 0);
+
+ do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns);
+ do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns);
+
+ do_1->block->next = do_2;
+ do_2->block->next = assign_matmul;
+
+ var_1 = do_1->ext.iterator->var;
+ var_2 = do_2->ext.iterator->var;
+
+ list[0] = var_1;
+ cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1);
+
+ list[0] = var_2;
+ ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 1);
+
+ list[0] = var_2;
+ list[1] = var_1;
+ bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+ break;
+
+ default:
+ gcc_unreachable();
+ }
+
+ assign_zero->next = do_1;
+
+
+ assign_matmul->expr1 = gfc_copy_expr (cscalar);
+
+ mult = get_operand (op_times, ascalar, bscalar);
+ assign_matmul->expr2 = get_operand (op_plus, cscalar, mult);
+
+ co->next = NULL;
+ gfc_free_statements(co);
+ gfc_free_expr (zero);
+ return 0;
+}
+
#define WALK_SUBEXPR(NODE) \
do \
{ \
Index: fortran/gfortran.h
===================================================================
--- fortran/gfortran.h (Revision 221944)
+++ fortran/gfortran.h (Arbeitskopie)
@@ -3225,4 +3225,8 @@ int gfc_code_walker (gfc_code **, walk_code_fn_t,
void gfc_convert_mpz_to_signed (mpz_t, int);
+/* trans-array.c */
+
+bool gfc_is_reallocatable_lhs (gfc_expr *);
+
#endif /* GCC_GFORTRAN_H */
Index: fortran/lang.opt
===================================================================
--- fortran/lang.opt (Revision 221944)
+++ fortran/lang.opt (Arbeitskopie)
@@ -542,6 +542,10 @@ Enum(gfc_init_local_real) String(inf) Value(GFC_IN
EnumValue
Enum(gfc_init_local_real) String(-inf) Value(GFC_INIT_REAL_NEG_INF)
+finline-matmul-limit=
+Fortran RejectNegative Joined UInteger Var(flag_inline_matmul_limit) Init(-1)
+-finline-matmul-limit=<n> Specify the size of the largest matrix for which matmul will be inlined
+
fmax-array-constructor=
Fortran RejectNegative Joined UInteger Var(flag_max_array_constructor) Init(65535)
-fmax-array-constructor=<n> Maximum number of objects in an array constructor
Index: fortran/options.c
===================================================================
--- fortran/options.c (Revision 221944)
+++ fortran/options.c (Arbeitskopie)
@@ -378,6 +378,11 @@ gfc_post_options (const char **pfilename)
if (!flag_automatic)
flag_max_stack_var_size = 0;
+ /* If we call BLAS directly, only inline up to the BLAS limit. */
+
+ if (flag_external_blas && flag_inline_matmul_limit < 0)
+ flag_inline_matmul_limit = flag_blas_matmul_limit;
+
/* Optimization implies front end optimization, unless the user
specified it directly. */
Index: fortran/simplify.c
===================================================================
--- fortran/simplify.c (Revision 221944)
+++ fortran/simplify.c (Arbeitskopie)
@@ -3445,6 +3445,32 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf
done:
+
+ if (!upper && as && as->type == AS_ASSUMED_SHAPE && dim
+ && dim->expr_type == EXPR_CONSTANT && ref->u.ar.type != AR_SECTION)
+ {
+ if (!(array->symtree && array->symtree->n.sym
+ && (array->symtree->n.sym->attr.allocatable
+ || array->symtree->n.sym->attr.pointer)))
+ {
+ unsigned long int ndim;
+ gfc_expr *lower, *res;
+
+ ndim = mpz_get_si (dim->value.integer) - 1;
+ lower = as->lower[ndim];
+ if (lower->expr_type == EXPR_CONSTANT)
+ {
+ res = gfc_copy_expr (lower);
+ if (kind)
+ {
+ int nkind = mpz_get_si (kind->value.integer);
+ res->ts.kind = nkind;
+ }
+ return res;
+ }
+ }
+ }
+
if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE
|| as->type == AS_ASSUMED_RANK))
return NULL;
Index: fortran/trans-array.h
===================================================================
--- fortran/trans-array.h (Revision 221944)
+++ fortran/trans-array.h (Arbeitskopie)
@@ -64,8 +64,6 @@ tree gfc_copy_only_alloc_comp (gfc_symbol *, tree,
tree gfc_alloc_allocatable_for_assignment (gfc_loopinfo*, gfc_expr*, gfc_expr*);
-bool gfc_is_reallocatable_lhs (gfc_expr *);
-
/* Add initialization for deferred arrays. */
void gfc_trans_deferred_array (gfc_symbol *, gfc_wrapped_block *);
/* Generate an initializer for a static pointer or allocatable array. */
Index: testsuite/gfortran.dg/function_optimize_1.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_1.f90 (Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_1.f90 (Arbeitskopie)
@@ -1,5 +1,5 @@
! { dg-do compile }
-! { dg-options "-O -fdump-tree-original -Warray-temporaries" }
+! { dg-options "-O -fdump-tree-original -finline-matmul-limit=0 -Warray-temporaries" }
program main
implicit none
real, dimension(2,2) :: a, b, c, d
Index: testsuite/gfortran.dg/function_optimize_2.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_2.f90 (Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_2.f90 (Arbeitskopie)
@@ -1,5 +1,5 @@
! { dg-do compile }
-! { dg-options "-O -faggressive-function-elimination -fdump-tree-original" }
+! { dg-options "-O -finline-matmul-limit=0 -faggressive-function-elimination -fdump-tree-original" }
program main
implicit none
real, dimension(2,2) :: a, b, c, d
Index: testsuite/gfortran.dg/function_optimize_5.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_5.f90 (Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_5.f90 (Arbeitskopie)
@@ -1,5 +1,5 @@
! { dg-do compile }
-! { dg-options "-ffrontend-optimize -Wfunction-elimination" }
+! { dg-options "-ffrontend-optimize -finline-matmul-limit=0 -Wfunction-elimination" }
! Check the -ffrontend-optimize (in the absence of -O) and
! -Wfunction-elimination options.
program main
Index: testsuite/gfortran.dg/function_optimize_7.f90
===================================================================
--- testsuite/gfortran.dg/function_optimize_7.f90 (Revision 221944)
+++ testsuite/gfortran.dg/function_optimize_7.f90 (Arbeitskopie)
@@ -1,5 +1,5 @@
! { dg-do compile }
-! { dg-options "-O -fdump-tree-original -Warray-temporaries" }
+! { dg-options "-O -fdump-tree-original -Warray-temporaries -finline-matmul-limit=0" }
subroutine xx(n, m, a, b, c, d, x, z, i, s_in, s_out)
implicit none
integer, intent(in) :: n, m
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-11 12:24 ` Thomas Koenig
@ 2015-04-11 13:43 ` Mikael Morin
2015-04-11 17:01 ` Thomas Koenig
0 siblings, 1 reply; 16+ messages in thread
From: Mikael Morin @ 2015-04-11 13:43 UTC (permalink / raw)
To: Thomas Koenig, Dominique d'Humières; +Cc: GCC Patches, GNU GFortran
Hello, I haven't looked at the patch in detail yet, but...
Le 11/04/2015 14:24, Thomas Koenig a écrit :
> Still to do: Bounds checking (a rather big one),
... as you do a front-end to front-end transformation, you get bounds
checking for free, don't you?
Mikael
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-11 13:43 ` Mikael Morin
@ 2015-04-11 17:01 ` Thomas Koenig
0 siblings, 0 replies; 16+ messages in thread
From: Thomas Koenig @ 2015-04-11 17:01 UTC (permalink / raw)
To: Mikael Morin, Dominique d'Humières; +Cc: GCC Patches, GNU GFortran
Hi Mikael,
>> Still to do: Bounds checking (a rather big one),
> ... as you do a front-end to front-end transformation, you get bounds
> checking for free, don't you?
Only partially.
What the patch does is
integer i,j,k
c = 0
do j=0, size(b,2)-1
do k=0, size(a, 2)-1
do i=0, size(a, 1)-1
c(i * stride(c,1) + lbound(c,1), j * stride(c,2) +
lbound(c,2)) =
c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
a(i * stride(a,1) + lbound(a,1), k * stride(a,2) +
lbound(a,2)) *
b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
end do
end do
end do
If size(b,2) < size(c,2) or size(a,1) < size(c,1) or
size(a,2) < size(b,1), this will not get caught - no
array bounds violation in the DO loops, but illegal
code nonetheless.
Also, the error message is different, which should also be
changed.
What I would like to add to check before the loop, and then
add a "do not bounds-check" flag to the reference.
Thomas
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 12:32 Thomas Koenig
2015-04-05 13:20 ` Jerry DeLisle
@ 2015-04-07 19:24 ` David Malcolm
1 sibling, 0 replies; 16+ messages in thread
From: David Malcolm @ 2015-04-07 19:24 UTC (permalink / raw)
To: Thomas Koenig; +Cc: fortran, gcc-patches
On Sun, 2015-04-05 at 14:32 +0200, Thomas Koenig wrote:
> Hello world,
>
> this is a first draft of a patch to inline matmul (PR 37171). This is
(FWIW, the above PR# looks like it should be PR 37131)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 13:20 ` Jerry DeLisle
@ 2015-04-05 13:48 ` Thomas Koenig
0 siblings, 0 replies; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 13:48 UTC (permalink / raw)
To: Jerry DeLisle, fortran, gcc-patches
Hi Jerry,
> I am curious about what performance gain results from this? I can see
> saving a library call to our runtime libraries. Do you have some timing
> results?
The speedup can be quite drastic for small matrices which can be
completely unrolled by -O3:
b1.f90:
program main
use b2
implicit none
real, dimension(3,3) :: a, b, c
integer :: i
call random_number(a)
call random_number(b)
do i=1,10**8
c = matmul(a,b)
call bar(b,c)
end do
end program main
b2.f90:
module b2
contains
subroutine bar(b,c)
real, dimension(3,3) :: b,c
end subroutine bar
end module b2
ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 -fno-frontend-optimize
b2.f90 b1.f90 && time ./a.out
real 0m15.411s
user 0m15.404s
sys 0m0.001s
ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 b2.f90 b1.f90 && time ./a.out
real 0m1.736s
user 0m1.735s
sys 0m0.001s
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul
2015-04-05 12:32 Thomas Koenig
@ 2015-04-05 13:20 ` Jerry DeLisle
2015-04-05 13:48 ` Thomas Koenig
2015-04-07 19:24 ` David Malcolm
1 sibling, 1 reply; 16+ messages in thread
From: Jerry DeLisle @ 2015-04-05 13:20 UTC (permalink / raw)
To: Thomas Koenig, fortran, gcc-patches
On 04/05/2015 05:32 AM, Thomas Koenig wrote:
--- snip ---
>
> So, what do you think about this?
>
> Thomas
I am curious about what performance gain results from this? I can see saving a
library call to our runtime libraries. Do you have some timing results?
Jerry
^ permalink raw reply [flat|nested] 16+ messages in thread
* [patch, fortran, RFC] First steps towards inlining matmul
@ 2015-04-05 12:32 Thomas Koenig
2015-04-05 13:20 ` Jerry DeLisle
2015-04-07 19:24 ` David Malcolm
0 siblings, 2 replies; 16+ messages in thread
From: Thomas Koenig @ 2015-04-05 12:32 UTC (permalink / raw)
To: fortran, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 3687 bytes --]
Hello world,
this is a first draft of a patch to inline matmul (PR 37171). This is
preliminary, but functional as far as it goes. Definitely for the
next stage one :-)
Basically, it takes
c = matmul(a,b)
and converts this into
BLOCK
integer i,j,k
c = 0
do j=0, size(b,2)-1
do k=0, size(a, 2)-1
do i=0, size(a, 1)-1
c(i * stride(c,1) + lbound(c,1), j * stride(c,2) +
lbound(c,2)) =
c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
a(i * stride(a,1) + lbound(a,1), k * stride(a,2) +
lbound(a,2)) *
b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
end do
end do
end do
END BLOCK
For this, I wrote something like a scalarizer that only operates on the
AST only. I find it rather straigtforward to use (in contrast to the
one we have), but that may also be due to the fact that I wrote it
myself :-)
I chose zero-based loop variables because I didn't want to special-case
too many strange array bounds, and I trusted the induction variable
optimization and CSE in the middle end and in the RTL optimization.
If this approach suboptimizes something, let me know.
References are frozen to make sure that no additional evaluations are
done on.
The patch to simplify.c is not strictly necessary, but it should open
up some optimization possiblities and, frankly, makes the
-fdump-fortran-optimized output at least halfway readable.
This patch needs to be extended in different directions:
- Currently, it only handles the case of two-dimensional arguments
- Special cases like transpose(a) and spread(a) in the arguments
arguments should be added, possibly with a reordering of loops
and addition of some more variables.
- Temporary handling for arguments and results, which does not depend
on allocatable RHS. I am thinking of creating nested blocks,
where the inner block has arrays with the bounds of the
references.
- Cases like matmul(a+1,b) can also be handled in the loop,
without a temporary
- This needs command-line arguments, -finline-matmul and
-finline-matmul-limit=, similar to the handling of BLAS.
- Bounds checking could/should also be handled. Currently, this
causes the only regression because of a different error message.
We need to call gfortran_runtime_error directly from the front end,
and to be able to tell the bounds-checking code "do not worry about
bounds checking this, it has already been taken care of".
- I already have some test cases, but for this, I would really like
to cover all code paths. This also needs some more work.
So, what do you think about this?
Thomas
2015-04-05 Thomas Koenig <tkoenig@gcc.gnu.org>
* frontend-passes.c (create_var): Add optional argument
vname as part of the name. Split off block creation into
(insert_block): New function.
(cfe_expr_0): Use "fcn" as part of tempoary name.
(optimize_namesapce): Call optimize_matmul_assign.
(combine_array_constructor): Use "constr" as part of
temporary name.
(get_array_inq_function): New function.
(is_function_or_op): New function.
(has_function_or_op): New function.
(freeze_expr): New function.
(freeze_references): New function.
(convert_to_index_kind): New function.
(create_do_loop): New function.
(get_operand): New function.
(get_size_1): New function.
(scalarize_expr): New function.
(optimize_matmul_assign): New function.
* simplify.c (simplify_bound): Simplify the case of the
lower bound of an assumed-shape argument.
[-- Attachment #2: matmul-1.diff --]
[-- Type: text/x-patch, Size: 19198 bytes --]
Index: frontend-passes.c
===================================================================
--- frontend-passes.c (Revision 221757)
+++ frontend-passes.c (Arbeitskopie)
@@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *);
static void optimize_reduction (gfc_namespace *);
static int callback_reduction (gfc_expr **, int *, void *);
static void realloc_strings (gfc_namespace *);
-static gfc_expr *create_var (gfc_expr *);
+static gfc_expr *create_var (gfc_expr *, const char *vname=NULL);
+static int optimize_matmul_assign (gfc_code **, int *, void *);
+static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *,
+ locus *, gfc_namespace *,
+ char *vname=NULL);
/* How deep we are inside an argument list. */
@@ -95,6 +99,10 @@ static bool in_assoc_list;
/* Entry point - run all passes for a namespace. */
+/* Counter for temporary variables. */
+
+static int var_num = 1;
+
void
gfc_run_passes (gfc_namespace *ns)
{
@@ -157,7 +165,7 @@ realloc_string_callback (gfc_code **c, int *walk_s
return 0;
current_code = c;
- n = create_var (expr2);
+ n = create_var (expr2, "trim");
co->expr2 = n;
return 0;
}
@@ -524,29 +532,11 @@ constant_string_length (gfc_expr *e)
}
-/* Returns a new expression (a variable) to be used in place of the old one,
- with an assignment statement before the current statement to set
- the value of the variable. Creates a new BLOCK for the statement if
- that hasn't already been done and puts the statement, plus the
- newly created variables, in that block. Special cases: If the
- expression is constant or a temporary which has already
- been created, just copy it. */
-
-static gfc_expr*
-create_var (gfc_expr * e)
+static gfc_namespace*
+insert_block ()
{
- char name[GFC_MAX_SYMBOL_LEN +1];
- static int num = 1;
- gfc_symtree *symtree;
- gfc_symbol *symbol;
- gfc_expr *result;
- gfc_code *n;
gfc_namespace *ns;
- int i;
- if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
- return gfc_copy_expr (e);
-
/* If the block hasn't already been created, do so. */
if (inserted_block == NULL)
{
@@ -578,7 +568,37 @@ constant_string_length (gfc_expr *e)
else
ns = inserted_block->ext.block.ns;
- sprintf(name, "__var_%d",num++);
+ return ns;
+}
+
+/* Returns a new expression (a variable) to be used in place of the old one,
+ with an optional assignment statement before the current statement to set
+ the value of the variable. Creates a new BLOCK for the statement if that
+ hasn't already been done and puts the statement, plus the newly created
+ variables, in that block. Special cases: If the expression is constant or
+ a temporary which has already been created, just copy it. */
+
+static gfc_expr*
+create_var (gfc_expr * e, const char *vname)
+{
+ char name[GFC_MAX_SYMBOL_LEN +1];
+ gfc_symtree *symtree;
+ gfc_symbol *symbol;
+ gfc_expr *result;
+ gfc_code *n;
+ gfc_namespace *ns;
+ int i;
+
+ if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e))
+ return gfc_copy_expr (e);
+
+ ns = insert_block ();
+
+ if (vname)
+ snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname);
+ else
+ sprintf (name, "__var_%d", var_num++);
+
if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
gcc_unreachable ();
@@ -658,6 +678,7 @@ constant_string_length (gfc_expr *e)
"Creating array temporary at %L", &(e->where));
}
+
/* Generate the new assignment. */
n = XCNEW (gfc_code);
n->op = EXEC_ASSIGN;
@@ -724,7 +745,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees,
if (gfc_dep_compare_functions (*ei, *ej, true) == 0)
{
if (newvar == NULL)
- newvar = create_var (*ei);
+ newvar = create_var (*ei, "fcn");
if (warn_function_elimination)
do_warn_function_elimination (*ej);
@@ -931,6 +952,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A
/* Don't walk subtrees. */
return 0;
}
+
/* Optimize a namespace, including all contained namespaces. */
static void
@@ -947,6 +969,7 @@ optimize_namespace (gfc_namespace *ns)
gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL);
gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL);
gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL);
+ gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback, NULL);
/* BLOCKs are handled in the expression walker below. */
for (ns = ns->contained; ns; ns = ns->sibling)
@@ -1222,7 +1245,7 @@ combine_array_constructor (gfc_expr *e)
if (op2->ts.type == BT_CHARACTER)
return false;
- scalar = create_var (gfc_copy_expr (op2));
+ scalar = create_var (gfc_copy_expr (op2), "constr");
oldbase = op1->value.constructor;
newbase = NULL;
@@ -1939,7 +1962,576 @@ doloop_warn (gfc_namespace *ns)
gfc_code_walker (&ns->code, doloop_code, do_function, NULL);
}
+/* Auxiliary function to build and simplify an array inquiry function.
+ dim is zero-based. */
+static gfc_expr *
+get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id)
+{
+
+ gfc_expr *fcn;
+ gfc_actual_arglist *actual_arglist, *n1, *n2;
+ gfc_expr *dim_arg, *kind;
+ const char *name;
+
+ fcn = gfc_get_expr ();
+ fcn->expr_type = EXPR_FUNCTION;
+ fcn->value.function.isym = gfc_intrinsic_function_by_id (id);
+ actual_arglist = gfc_get_actual_arglist ();
+ actual_arglist->expr = gfc_copy_expr (e);
+
+ n1 = gfc_get_actual_arglist ();
+ dim_arg = gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim);
+ n1->expr = dim_arg;
+ actual_arglist->next = n1;
+
+ n2 = gfc_get_actual_arglist ();
+ kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where,
+ gfc_index_integer_kind);
+ n2->expr = kind;
+ n1->next = n2;
+
+ fcn->value.function.actual = actual_arglist;
+ fcn->where = e->where;
+ fcn->ts.type = BT_INTEGER;
+ fcn->ts.kind = gfc_index_integer_kind;
+
+ switch (id)
+ {
+ case GFC_ISYM_LBOUND:
+ name = "__internal_lbound";
+ break;
+
+ case GFC_ISYM_UBOUND:
+ name = "__internal_ubound";
+ break;
+
+ case GFC_ISYM_SIZE:
+ name = "__internal_size";
+ break;
+
+ default:
+ gcc_unreachable ();
+ }
+
+ gfc_get_sym_tree (name, current_ns, &fcn->symtree, false);
+ fcn->symtree->n.sym->ts = fcn->ts;
+ fcn->symtree->n.sym->attr.flavor = FL_PROCEDURE;
+ fcn->symtree->n.sym->attr.function = 1;
+ fcn->symtree->n.sym->attr.elemental = 1;
+ fcn->symtree->n.sym->attr.referenced = 1;
+ fcn->symtree->n.sym->attr.access = ACCESS_PRIVATE;
+ gfc_commit_symbol (fcn->symtree->n.sym);
+ gfc_simplify_expr (fcn, 0);
+ return fcn;
+}
+
+/* Callback function for has_function_or_op. */
+
+static int
+is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED,
+ void *data ATTRIBUTE_UNUSED)
+{
+ if ((*e) == 0)
+ return 0;
+ else
+ return (*e)->expr_type == EXPR_FUNCTION
+ || (*e)->expr_type == EXPR_OP;
+}
+
+/* Returns true if the expression contains a function. */
+
+static bool
+has_function_or_op (gfc_expr **e)
+{
+ if (e == NULL)
+ return false;
+ else
+ return gfc_expr_walker (e, is_function_or_op, NULL);
+}
+
+/* Freeze a single expression. */
+
+static void
+freeze_expr (gfc_expr **ep)
+{
+ gfc_expr *ne;
+ if (has_function_or_op (ep))
+ {
+ ne = create_var (*ep, "freeze");
+ *ep = ne;
+ }
+}
+
+/* Go through an expression's references and assign them to temporary
+ variables if they contain functions. This is usually done prior to
+ front-end scalarization to avoid multiple invocations of functions. */
+
+static void
+freeze_references (gfc_expr *e)
+{
+ gfc_ref *r;
+ gfc_array_ref *ar;
+ int i;
+
+ for (r=e->ref; r; r=r->next)
+ {
+ if (r->type == REF_SUBSTRING)
+ {
+ if (r->u.ss.start != NULL)
+ freeze_expr (&r->u.ss.start);
+
+ if (r->u.ss.end != NULL)
+ freeze_expr (&r->u.ss.end);
+ }
+ else if (r->type == REF_ARRAY)
+ {
+ ar = &r->u.ar;
+ switch (ar->type)
+ {
+ case AR_FULL:
+ break;
+
+ case AR_SECTION:
+ for (i=0; i<ar->dimen; i++)
+ {
+ if (ar->dimen_type[i] == DIMEN_RANGE)
+ {
+ freeze_expr (&ar->start[i]);
+ freeze_expr (&ar->end[i]);
+ freeze_expr (&ar->stride[i]);
+ }
+ }
+ default:
+ break;
+ }
+ }
+ }
+}
+
+/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise. */
+
+static gfc_expr *
+convert_to_index_kind (gfc_expr *e)
+{
+ gfc_expr *res;
+
+ gcc_assert (e != NULL);
+
+ res = gfc_copy_expr (e);
+
+ gcc_assert (e->ts.type == BT_INTEGER);
+
+ if (res->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+
+ gfc_convert_type_warn (e, &ts, 2, 0);
+ }
+
+ return res;
+}
+
+/* Function to create a DO loop including creation of the
+ iteration variable. gfc_expr are copied.*/
+
+static gfc_code *
+create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where,
+ gfc_namespace *ns, char *vname)
+{
+
+ char name[GFC_MAX_SYMBOL_LEN +1];
+ gfc_symtree *symtree;
+ gfc_symbol *symbol;
+ gfc_expr *i;
+ gfc_code *n, *n2;
+
+ /* Create an expression for the iteration variable. */
+ if (vname)
+ sprintf (name, "__var_%d_do_%s", var_num++, vname);
+ else
+ sprintf (name, "__var_%d_do", var_num++);
+
+
+ if (gfc_get_sym_tree (name, ns, &symtree, false) != 0)
+ gcc_unreachable ();
+
+ symbol = symtree->n.sym;
+ symbol->ts.type = BT_INTEGER;
+ symbol->ts.kind = gfc_index_integer_kind;
+ symbol->attr.flavor = FL_VARIABLE;
+ symbol->attr.referenced = 1;
+ symbol->attr.dimension = 0;
+ symbol->attr.fe_temp = 1;
+ gfc_commit_symbol (symbol);
+
+ i = gfc_get_expr ();
+ i->expr_type = EXPR_VARIABLE;
+ i->ts = symbol->ts;
+ i->rank = 0;
+ i->where = *where;
+ i->symtree = symtree;
+
+ n = XCNEW (gfc_code);
+ n->op = EXEC_DO;
+ n->loc = *where;
+ n->ext.iterator = gfc_get_iterator ();
+ n->ext.iterator->var = i;
+ n->ext.iterator->start = convert_to_index_kind (start);
+ n->ext.iterator->end = convert_to_index_kind (end);
+ if (step)
+ n->ext.iterator->step = convert_to_index_kind (step);
+ else
+ n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind,
+ where, 1);
+
+ n2 = XCNEW (gfc_code);
+ n2->op = EXEC_DO;
+ n2->loc = *where;
+ n2->next = NULL;
+ n->block = n2;
+ return n;
+}
+
+/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes
+ compatible typespecs. */
+
+static gfc_expr *
+get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2)
+{
+ gfc_expr *res;
+
+ res = gfc_get_expr ();
+ res->ts = e1->ts;
+ res->where = e1->where;
+ res->expr_type = EXPR_OP;
+ res->value.op.op = op;
+ res->value.op.op1 = e1;
+ res->value.op.op2 = e2;
+ gfc_simplify_expr (res, 0);
+ return res;
+}
+
+/* Get the upper bound of the DO loops for matmul along
+ a dimension, i.e. size minus one. */
+static gfc_expr*
+get_size_m1 (gfc_expr *e, int dimen)
+{
+ mpz_t size;
+ gfc_expr *res;
+
+ if (gfc_array_dimen_size (e, dimen, &size))
+ {
+ res = gfc_get_constant_expr (BT_INTEGER,
+ gfc_index_integer_kind, &e->where);
+ mpz_sub_ui (res->value.integer, size, 1);
+ mpz_clear (size);
+ }
+ else
+ {
+ res = get_operand (INTRINSIC_MINUS,
+ get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE),
+ gfc_get_int_expr (gfc_index_integer_kind,
+ &e->where, 1));
+ gfc_simplify_expr (res, 0);
+ }
+
+ return res;
+}
+
+/* Function to return a scalarized expression. It is assumed that indices are
+ zero based to make generation of DO loops easier. A zero as index will
+ access the first element along a dimension. Single element references will
+ be skipped. A NULL as an expression will be replaced by a full reference.
+ This assumes that the index loops have gfc_index_integer_kind, and that all
+ references have been frozen. */
+
+static gfc_expr*
+scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index)
+{
+ gfc_ref *ref;
+ gfc_array_ref *ar;
+ int i;
+ int rank;
+ gfc_expr *e;
+ int i_index;
+ bool was_fullref;
+
+ e = gfc_copy_expr(e_in);
+
+ rank = e->rank;
+
+ ref = e->ref;
+
+ while (ref)
+ {
+ if (ref->type == REF_ARRAY
+ && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION))
+ break;
+ }
+ ar = &ref->u.ar;
+
+ /* We scalarize count_index variables, reducing the rank by count_index. */
+
+ e->rank = rank - count_index;
+
+ was_fullref = ar->type == AR_FULL;
+
+ if (e->rank == 0)
+ ar->type = AR_ELEMENT;
+ else
+ ar->type = AR_SECTION;
+
+ ar->dimen = rank;
+
+ /* Loop over the indices. For each index, create the expression
+ index + lbound(e, dim). */
+
+ i_index = 0;
+ for (i=0; i<rank; i++)
+ {
+ if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE)
+ {
+ if (index[i_index] != NULL)
+ {
+ gfc_expr *lbound, *nindex;
+ gfc_expr *loopvar;
+
+ loopvar = gfc_copy_expr (index[i_index]);
+
+ if (ar->stride[i])
+ {
+ gfc_expr *tmp;
+
+ tmp = gfc_copy_expr(ar->stride[i]);
+ if (tmp->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+ gfc_convert_type (tmp, &ts, 2);
+ }
+ nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp);
+ }
+ else
+ nindex = loopvar;
+
+ if (ar->start[i])
+ {
+ lbound = gfc_copy_expr (ar->start[i]);
+ if (lbound->ts.kind != gfc_index_integer_kind)
+ {
+ gfc_typespec ts;
+ gfc_clear_ts (&ts);
+ ts.type = BT_INTEGER;
+ ts.kind = gfc_index_integer_kind;
+ gfc_convert_type (lbound, &ts, 2);
+
+ }
+ }
+ else
+ lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND);
+
+ ar->dimen_type[i] = DIMEN_ELEMENT;
+ ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound);
+ ar->end[i] = NULL;
+ ar->stride[i] = NULL;
+ gfc_simplify_expr (ar->start[i], 0);
+ }
+ else if (was_fullref)
+ {
+ ar->dimen_type[i] = DIMEN_RANGE;
+ ar->start[i] = NULL;
+ ar->end[i] = NULL;
+ ar->stride[i] = NULL;
+ }
+ i_index ++;
+ }
+ }
+ return e;
+}
+
+
+/* Optimize assignments of the form c = matmul(a,b).
+ Handle only the cases currently where b and c are rank-two arrays.
+
+ This translates the code to
+
+ BLOCK
+ integer i,j,k
+ c = 0
+ do j=0, size(b,2)-1
+ do k=0, size(a, 2)-1
+ do i=0, size(a, 1)-1
+ c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) =
+ c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) +
+ a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) *
+ b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2))
+ end do
+ end do
+ end do
+ END BLOCK
+
+*/
+
+static int
+optimize_matmul_assign (gfc_code **c,
+ int *walk_subtrees ATTRIBUTE_UNUSED,
+ void *data ATTRIBUTE_UNUSED)
+{
+ gfc_code *co = *c;
+ gfc_expr *expr1, *expr2;
+ gfc_expr *matrix_a, *matrix_b;
+ gfc_actual_arglist *a, *b;
+ gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul;
+ gfc_expr *zero_e;
+ gfc_expr *u1, *u2, *u3;
+ gfc_expr *list[2];
+ gfc_expr *ascalar, *bscalar, *cscalar;
+ gfc_expr *mult;
+ gfc_expr *var_1, *var_2, *var_3;
+ gfc_namespace *ns;
+ gfc_intrinsic_op op_times, op_plus;
+
+ if (co->op != EXEC_ASSIGN)
+ return 0;
+
+ expr1 = co->expr1;
+ expr2 = co->expr2;
+ if (expr2->expr_type != EXPR_FUNCTION
+ || expr2->value.function.isym == NULL
+ || expr2->value.function.isym->id != GFC_ISYM_MATMUL)
+ return 0;
+
+ /* Handling only two matrices of dimension two for the time being. */
+
+ current_code = c;
+ inserted_block = NULL;
+ changed_statement = NULL;
+
+ a = expr2->value.function.actual;
+ matrix_a = a->expr;
+ b = a->next;
+ matrix_b = b->expr;
+
+ if (matrix_a->expr_type != EXPR_VARIABLE
+ || matrix_b->expr_type != EXPR_VARIABLE)
+ return 0;
+
+ if (matrix_a->rank != 2 || matrix_b->rank != 2)
+ return 0;
+
+ if (gfc_check_dependency (expr1, matrix_a, true)
+ || gfc_check_dependency (expr1, matrix_b, true))
+ return 0;
+
+ ns = insert_block ();
+
+ switch(expr1->ts.type)
+ {
+ case BT_INTEGER:
+ zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+ break;
+
+ case BT_LOGICAL:
+ op_times = INTRINSIC_AND;
+ op_plus = INTRINSIC_OR;
+ zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where,
+ 0);
+ break;
+ case BT_REAL:
+ zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind,
+ &expr1->where);
+ mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+ break;
+
+ case BT_COMPLEX:
+ zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind,
+ &expr1->where);
+ mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE);
+ op_times = INTRINSIC_TIMES;
+ op_plus = INTRINSIC_PLUS;
+
+ break;
+
+ default:
+ gcc_unreachable();
+ }
+
+ assign_zero = XCNEW (gfc_code);
+ assign_zero->op = EXEC_ASSIGN;
+ assign_zero->loc = co->loc;
+ assign_zero->expr1 = gfc_copy_expr (expr1);
+ assign_zero->expr2 = zero_e;
+
+ ns->code = assign_zero;
+ current_code = &ns->code;
+
+ freeze_references (matrix_a);
+ freeze_references (matrix_b);
+
+ u1 = get_size_m1 (matrix_b, 1);
+ u2 = get_size_m1 (matrix_a, 1);
+ u3 = get_size_m1 (matrix_a, 0);
+
+ do_1 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind,
+ &co->loc, 0),
+ u1, NULL, &co->loc, ns);
+
+ do_2 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind,
+ &co->loc, 0),
+ u2, NULL, &co->loc, ns);
+
+ do_1->block->next = do_2;
+
+ do_3 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind,
+ &co->loc, 0),
+ u3, NULL, &co->loc, ns);
+
+ do_2->block->next = do_3;
+
+ var_1 = do_1->ext.iterator->var;
+ var_2 = do_2->ext.iterator->var;
+ var_3 = do_3->ext.iterator->var;
+
+ assign_zero->next = do_1;
+
+ assign_matmul = XCNEW (gfc_code);
+ assign_matmul->op = EXEC_ASSIGN;
+ assign_matmul->loc = co->loc;
+
+ list[0] = var_3;
+ list[1] = var_1;
+ cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2);
+
+ list[0] = var_3;
+ list[1] = var_2;
+ ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2);
+
+ list[0] = var_2;
+ list[1] = var_1;
+ bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2);
+
+ assign_matmul->expr1 = gfc_copy_expr (cscalar);
+
+ mult = get_operand (op_times, ascalar, bscalar);
+ assign_matmul->expr2 = get_operand (op_plus, cscalar, mult);
+
+ do_3->block->next = assign_matmul;
+
+ co->next = NULL;
+ gfc_free_statements(co);
+ return 0;
+}
+
#define WALK_SUBEXPR(NODE) \
do \
{ \
[-- Attachment #3: simplify-1.diff --]
[-- Type: text/x-patch, Size: 819 bytes --]
Index: simplify.c
===================================================================
--- simplify.c (Revision 221757)
+++ simplify.c (Arbeitskopie)
@@ -3445,6 +3445,25 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf
done:
+ if (!upper && as->type == AS_ASSUMED_SHAPE && dim)
+ {
+ if (dim->expr_type == EXPR_CONSTANT)
+ {
+ unsigned long int ndim;
+ gfc_expr *lower, *res;
+
+ ndim = mpz_get_si (dim->value.integer) - 1;
+ lower = as->lower[ndim];
+ if (lower->expr_type == EXPR_CONSTANT)
+ {
+ int nkind = mpz_get_si (kind->value.integer);
+ res = gfc_copy_expr (lower);
+ res->ts.kind = nkind;
+ return res;
+ }
+ }
+ }
+
if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE
|| as->type == AS_ASSUMED_RANK))
return NULL;
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-04-11 17:01 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-05 13:55 [patch, fortran, RFC] First steps towards inlining matmul Dominique Dhumieres
2015-04-05 14:33 ` Thomas Koenig
2015-04-05 18:25 ` Dominique d'Humières
2015-04-05 20:37 ` Thomas Koenig
2015-04-05 23:15 ` Dominique d'Humières
2015-04-06 12:18 ` Dominique d'Humières
2015-04-09 22:18 ` Thomas Koenig
2015-04-10 14:15 ` Dominique d'Humières
2015-04-10 16:57 ` Dominique d'Humières
2015-04-11 12:24 ` Thomas Koenig
2015-04-11 13:43 ` Mikael Morin
2015-04-11 17:01 ` Thomas Koenig
-- strict thread matches above, loose matches on Subject: below --
2015-04-05 12:32 Thomas Koenig
2015-04-05 13:20 ` Jerry DeLisle
2015-04-05 13:48 ` Thomas Koenig
2015-04-07 19:24 ` David Malcolm
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).