* Re: [patch, fortran, RFC] First steps towards inlining matmul
@ 2015-04-05 13:55 Dominique Dhumieres
2015-04-05 14:33 ` Thomas Koenig
0 siblings, 1 reply; 16+ messages in thread
From: Dominique Dhumieres @ 2015-04-05 13:55 UTC (permalink / raw)
To: tkoenig; +Cc: gcc-patches, fortran, jvdelisle
> > So, what do you think about this?
>
> I am curious about what performance gain results from this?
> I can see saving a library call to our runtime libraries.
> Do you have some timing results?
>
> Jerry
IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9
or 16x16 depending of your field!-). I have a variant of the polyhedron test induct.f90
in which I have replaced three dot products by the equivalent matmul and which executes
ten times slower than the original test.
Dominique
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 13:55 [patch, fortran, RFC] First steps towards inlining matmul Dominique Dhumieres @ 2015-04-05 14:33 ` Thomas Koenig 2015-04-05 18:25 ` Dominique d'Humières 0 siblings, 1 reply; 16+ messages in thread From: Thomas Koenig @ 2015-04-05 14:33 UTC (permalink / raw) To: Dominique Dhumieres; +Cc: gcc-patches, fortran, jvdelisle Hi Dominique, > IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9 > or 16x16 depending of your field!-) The problem with the library function we have is that it is quite general; it can deal with all the complexity of assumed-shape array arguments. Inlining allows the compiler to take advantage of contiguous memory, compile-time knowledge of array sizes, etc. to allow for vectorization or even complete unrolling. Of course, if you use c=matmul(a,b) where all three are assumed-shape arrays, your advantage over the library function will be minimal at best. It will be interesting to see at what size calling BLAS directly will be better. Thomas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 14:33 ` Thomas Koenig @ 2015-04-05 18:25 ` Dominique d'Humières 2015-04-05 20:37 ` Thomas Koenig 0 siblings, 1 reply; 16+ messages in thread From: Dominique d'Humières @ 2015-04-05 18:25 UTC (permalink / raw) To: Thomas Koenig; +Cc: gcc-patches, fortran, jvdelisle I have done some timings (1) with the test given below, before the patch I get (last column in Gflops) [Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL: 373.708008 373.69497100000001 4.2815668504139435 Time, MATMUL: 172.086609 23.117919000000001 69.210381782201068 545.374u 0.537s 6:36.93 137.5% 0+0k 0+0io 0pf+0w [Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL: 176.327881 23.855111999999998 67.071577781735002 Time, MATMUL: 182.086746 24.453551000000001 65.430170039516952 358.059u 0.471s 0:48.43 740.2% 0+0k 0+0io 0pf+0w after the patch [Book15] f90/bug% time a.out Time, MATMUL: 392.415436 392.41728799999999 4.0772923337669056 Time, MATMUL: 171.690399 22.905118000000002 69.853383859450091 563.671u 0.551s 6:55.44 135.8% 0+0k 0+0io 0pf+0w [Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate [Book15] f90/bug% time a.out Time, MATMUL: 392.850342 392.84190799999999 4.0728852177349673 Time, MATMUL: 174.534821 23.784797000000001 67.269861500184334 566.824u 0.674s 6:56.74 136.1% 0+0k 0+0io 0pf+0w which means that -fexternal-blas should disable the inlining. program t2 implicit none REAL time_begin, time_end integer, parameter :: n=2000; integer(8) :: ts, te, rate8, cmax8 real(8) :: elapsed REAL(8) :: a(n,n), b(n,n), c(n,n) integer, parameter :: m = 100 integer :: i call RANDOM_NUMBER(a) call RANDOM_NUMBER(b) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 c = MATMUL(a,b) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) call cpu_time(time_begin) call SYSTEM_CLOCK (ts, rate8, cmax8) do i = 1,m a(1,1) = a(1,1) + 0.1 call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) enddo call SYSTEM_CLOCK (te, rate8, cmax8) call cpu_time(time_end) elapsed = real(te-ts, kind=8)/real(rate8, kind=8) PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed) end program ! http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197 (2) The polyhedron test compiled with -fprotect-parens -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto runs in ~6.5s. After applying the patch below, it runs in ~66s with/without the patch. Dominique --- induct.f90 2005-10-11 22:53:32.000000000 +0200 +++ induct_vmc.f90 2015-04-05 19:06:30.000000000 +0200 @@ -1644,18 +1644,17 @@ contains coil_tmp_vector(1) = -sin(theta) coil_tmp_vector(2) = cos(theta) coil_tmp_vector(3) = 0.0_longreal - coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:)) - coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:)) - coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:)) + coil_current_vec = matmul(rotate_coil,coil_tmp_vector) ! do j = 1, 9 c_vector(3) = 0.5 * h_coil * z1gauss(j) ! ! rotate coil vector into the global coordinate system and translate it ! - rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx - rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy - rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz + rot_c_vector = matmul(rotate_coil,c_vector) + rot_c_vector(1) = rot_c_vector(1) + dx + rot_c_vector(2) = rot_c_vector(2) + dy + rot_c_vector(3) = rot_c_vector(3) + dz ! do k = 1, 9 q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal) @@ -1664,9 +1663,7 @@ contains ! ! rotate quad vector into the global coordinate system ! - rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:)) - rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:)) - rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:)) + rot_q_vector = matmul(rotate_quad,q_vector) ! ! compute and add in quadrature term ! @@ -1756,18 +1753,17 @@ contains coil_tmp_vector(1) = -sin(theta) coil_tmp_vector(2) = cos(theta) coil_tmp_vector(3) = 0.0_longreal - coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:)) - coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:)) - coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:)) + coil_current_vec = matmul(rotate_coil,coil_tmp_vector) ! do j = 1, 9 c_vector(3) = 0.5 * h_coil * z1gauss(j) ! ! rotate coil vector into the global coordinate system and translate it ! - rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx - rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy - rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz + rot_c_vector = matmul(rotate_coil,c_vector) + rot_c_vector(1) = rot_c_vector(1) + dx + rot_c_vector(2) = rot_c_vector(2) + dy + rot_c_vector(3) = rot_c_vector(3) + dz ! do k = 1, 9 q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal) @@ -1776,9 +1772,7 @@ contains ! ! rotate quad vector into the global coordinate system ! - rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:)) - rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:)) - rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:)) + rot_q_vector = matmul(rotate_quad,q_vector) ! ! compute and add in quadrature term ! @@ -2061,18 +2055,17 @@ contains ! ! compute current vector for the coil in the global coordinate system ! - coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:)) - coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:)) - coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:)) + coil_current_vec = matmul(rotate_coil,coil_tmp_vector) ! do j = 1, 9 c_vector(3) = 0.5 * h_coil * z1gauss(j) ! ! rotate coil vector into the global coordinate system and translate it ! - rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx - rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy - rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz + rot_c_vector = matmul(rotate_coil,c_vector) + rot_c_vector(1) = rot_c_vector(1) + dx + rot_c_vector(2) = rot_c_vector(2) + dy + rot_c_vector(3) = rot_c_vector(3) + dz ! do k = 1, 9 q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal) @@ -2081,9 +2074,7 @@ contains ! ! rotate quad vector into the global coordinate system ! - rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:)) - rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:)) - rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:)) + rot_q_vector = matmul(rotate_quad,q_vector) ! ! compute and add in quadrature term ! @@ -2204,18 +2195,17 @@ contains ! ! compute current vector for the coil in the global coordinate system ! - coil_current_vec(1) = dot_product(rotate_coil(1,:),coil_tmp_vector(:)) - coil_current_vec(2) = dot_product(rotate_coil(2,:),coil_tmp_vector(:)) - coil_current_vec(3) = dot_product(rotate_coil(3,:),coil_tmp_vector(:)) + coil_current_vec = matmul(rotate_coil,coil_tmp_vector) ! do j = 1, 9 c_vector(3) = 0.5 * h_coil * z1gauss(j) ! ! rotate coil vector into the global coordinate system and translate it ! - rot_c_vector(1) = dot_product(rotate_coil(1,:),c_vector(:)) + dx - rot_c_vector(2) = dot_product(rotate_coil(2,:),c_vector(:)) + dy - rot_c_vector(3) = dot_product(rotate_coil(3,:),c_vector(:)) + dz + rot_c_vector = matmul(rotate_coil,c_vector) + rot_c_vector(1) = rot_c_vector(1) + dx + rot_c_vector(2) = rot_c_vector(2) + dy + rot_c_vector(3) = rot_c_vector(3) + dz ! do k = 1, 9 q_vector(1) = 0.5_longreal * a * (x2gauss(k) + 1.0_longreal) @@ -2224,9 +2214,7 @@ contains ! ! rotate quad vector into the global coordinate system ! - rot_q_vector(1) = dot_product(rotate_quad(1,:),q_vector(:)) - rot_q_vector(2) = dot_product(rotate_quad(2,:),q_vector(:)) - rot_q_vector(3) = dot_product(rotate_quad(3,:),q_vector(:)) + rot_q_vector = matmul(rotate_quad,q_vector) ! ! compute and add in quadrature term ! > Le 5 avr. 2015 à 16:33, Thomas Koenig <tkoenig@netcologne.de> a écrit : > > Hi Dominique, > >> IMO the inlining of MATMUL should be restricted to small matrices (less than 4x4, 9x9 >> or 16x16 depending of your field!-) > > The problem with the library function we have is that it is quite > general; it can deal with all the complexity of assumed-shape array > arguments. Inlining allows the compiler to take advantage of > contiguous memory, compile-time knowledge of array sizes, etc. > to allow for vectorization or even complete unrolling. > > Of course, if you use c=matmul(a,b) where all three are assumed-shape > arrays, your advantage over the library function will be minimal at > best. > > It will be interesting to see at what size calling BLAS directly > will be better. > > Thomas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 18:25 ` Dominique d'Humières @ 2015-04-05 20:37 ` Thomas Koenig 2015-04-05 23:15 ` Dominique d'Humières 0 siblings, 1 reply; 16+ messages in thread From: Thomas Koenig @ 2015-04-05 20:37 UTC (permalink / raw) To: Dominique d'Humières; +Cc: gcc-patches, fortran, jvdelisle Hi Dominique, > which means that -fexternal-blas should disable the inlining. It is not surprising that a higly tuned BLAS library is better than a simple inlining for large matrices. I did some tests by adjusting n; it seems the inline version is faster for n<=22, which is not too bad. Regarding your other test case: This tests matrix*vector multiplication, which is not implemented yet :-) Regards, Thomas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 20:37 ` Thomas Koenig @ 2015-04-05 23:15 ` Dominique d'Humières 2015-04-06 12:18 ` Dominique d'Humières 0 siblings, 1 reply; 16+ messages in thread From: Dominique d'Humières @ 2015-04-05 23:15 UTC (permalink / raw) To: Thomas Koenig; +Cc: gcc-patches, fortran, jvdelisle The patch causes the following regressions: FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single -O2 -latomic (internal compiler error) FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single -O2 -latomic (test for excess errors) FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib -O2 -lcaf_single -latomic (internal compiler error) FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=lib -O2 -lcaf_single -latomic (test for excess errors) FAIL: gfortran.dg/array_function_3.f90 -O (internal compiler error) FAIL: gfortran.dg/array_function_3.f90 -O (test for excess errors) FAIL: gfortran.dg/bind_c_vars.f90 -g -flto (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -O0 (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -O0 (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -O1 (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -O1 (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -O2 (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -O2 (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-loops (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -O3 -g (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -O3 -g (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -Os (internal compiler error) FAIL: gfortran.dg/bound_2.f90 -Os (test for excess errors) FAIL: gfortran.dg/bound_2.f90 -g -flto (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -O0 (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -O1 (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -O1 (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -O2 (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -O2 (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-loops (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -O3 -g (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -O3 -g (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -Os (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -Os (test for excess errors) FAIL: gfortran.dg/bound_7.f90 -g -flto (internal compiler error) FAIL: gfortran.dg/bound_7.f90 -g -flto (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -O0 (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -O0 (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -O1 (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -O1 (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -O2 (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -O2 (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-loops (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -O3 -g (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -O3 -g (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -Os (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -Os (test for excess errors) FAIL: gfortran.dg/bound_8.f90 -g -flto (internal compiler error) FAIL: gfortran.dg/bound_8.f90 -g -flto (test for excess errors) I think the line if (!upper && as->type == AS_ASSUMED_SHAPE && dim) should be something such as (untested) if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE) Dominique > Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit : > > Hi Dominique, > >> which means that -fexternal-blas should disable the inlining. > > It is not surprising that a higly tuned BLAS library is better than > a simple inlining for large matrices. > > I did some tests by adjusting n; it seems the inline version is > faster for n<=22, which is not too bad. > > Regarding your other test case: This tests matrix*vector > multiplication, which is not implemented yet :-) > > Regards, > > Thomas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 23:15 ` Dominique d'Humières @ 2015-04-06 12:18 ` Dominique d'Humières 2015-04-09 22:18 ` Thomas Koenig 0 siblings, 1 reply; 16+ messages in thread From: Dominique d'Humières @ 2015-04-06 12:18 UTC (permalink / raw) To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran, jvdelisle > Le 6 avr. 2015 à 01:15, Dominique d'Humières <dominiq@lps.ens.fr> a écrit : > > The patch causes the following regressions: > > FAIL: gfortran.dg/coarray/dummy_1.f90 -fcoarray=single -O2 -latomic (internal compiler error) > … > FAIL: gfortran.dg/bound_8.f90 -g -flto (test for excess errors) > > I think the line > > if (!upper && as->type == AS_ASSUMED_SHAPE && dim) > > should be something such as (untested) > > if (!upper && dim && as && as->type == AS_ASSUMED_SHAPE) This fixes a first batch of ICEs. A second one if fixed by if (kind && lower->expr_type == EXPR_CONSTANT) While the first change is obvious, I am not sure about the second one. With this changes I am left with the following regressions FAIL: gfortran.dg/function_optimize_1.f90 -O scan-tree-dump-times original "matmul_r4" 1 FAIL: gfortran.dg/function_optimize_7.f90 -O scan-tree-dump-times original "matmul_r4" 1 FAIL: gfortran.dg/function_optimize_2.f90 -O scan-tree-dump-times original "matmul_r4" 1 FAIL: gfortran.dg/matmul_bounds_2.f90 -O1 output pattern test FAIL: gfortran.dg/matmul_bounds_2.f90 -O2 output pattern test FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -fomit-frame-pointer output pattern test FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -fomit-frame-pointer -funroll-loops output pattern test FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions output pattern test FAIL: gfortran.dg/matmul_bounds_2.f90 -O3 -g output pattern test FAIL: gfortran.dg/matmul_bounds_2.f90 -Os output pattern test FAIL: gfortran.dg/matmul_bounds_3.f90 -O1 output pattern test FAIL: gfortran.dg/matmul_bounds_3.f90 -O2 output pattern test FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -fomit-frame-pointer output pattern test FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -fomit-frame-pointer -funroll-loops output pattern test FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions output pattern test FAIL: gfortran.dg/matmul_bounds_3.f90 -O3 -g output pattern test FAIL: gfortran.dg/matmul_bounds_3.f90 -Os output pattern test FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -fomit-frame-pointer execution test FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: gfortran.dg/realloc_on_assign_11.f90 -O3 -g execution test FAIL: gfortran.dg/realloc_on_assign_11.f90 -Os execution test FAIL: gfortran.dg/realloc_on_assign_7.f03 -O1 execution test FAIL: gfortran.dg/realloc_on_assign_7.f03 -O2 execution test FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -fomit-frame-pointer execution test FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: gfortran.dg/realloc_on_assign_7.f03 -O3 -g execution test FAIL: gfortran.dg/realloc_on_assign_7.f03 -Os execution test The gfortran.dg/matmul_bounds_* are failures due to a change in the run time error (is there a good reason for that?) and the gfortran.dg/realloc_on_assign_* failures are due to segfaults at run time. Dominique >> Le 5 avr. 2015 à 22:37, Thomas Koenig <tkoenig@netcologne.de> a écrit : >> >> Hi Dominique, >> >>> which means that -fexternal-blas should disable the inlining. >> >> It is not surprising that a higly tuned BLAS library is better than >> a simple inlining for large matrices. >> >> I did some tests by adjusting n; it seems the inline version is >> faster for n<=22, which is not too bad. >> >> Regarding your other test case: This tests matrix*vector >> multiplication, which is not implemented yet :-) >> >> Regards, >> >> Thomas > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-06 12:18 ` Dominique d'Humières @ 2015-04-09 22:18 ` Thomas Koenig 2015-04-10 14:15 ` Dominique d'Humières 0 siblings, 1 reply; 16+ messages in thread From: Thomas Koenig @ 2015-04-09 22:18 UTC (permalink / raw) To: GCC Patches, GNU GFortran; +Cc: Dominique d'Humières [-- Attachment #1: Type: text/plain, Size: 691 bytes --] Hello world, here is an update on the matmul inlining patch. Now, the rank one + rank two cases are also handled. Reallocation on assignment also works now. Still missing: 1. Control via an option, BLAS inlining has to take precedence 2. handling of matmul(a,b) occurring in the middle of an expression. 3. Bounds checking from the front end pass (basically, calling gfortran_runtime_error). 4. More test cases What do you think? 1. is straightforward, we don't really need to do 2. for committing early in stage one. For 3, we could maybe just issue a STOP. 4. is required, I need to look at some more corner cases where bugs may still be lurking. What do you think? Thomas [-- Attachment #2: matmul-8.diff --] [-- Type: text/x-patch, Size: 27830 bytes --] Index: frontend-passes.c =================================================================== --- frontend-passes.c (Revision 221944) +++ frontend-passes.c (Arbeitskopie) @@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *); static void optimize_reduction (gfc_namespace *); static int callback_reduction (gfc_expr **, int *, void *); static void realloc_strings (gfc_namespace *); -static gfc_expr *create_var (gfc_expr *); +static gfc_expr *create_var (gfc_expr *, const char *vname=NULL); +static int optimize_matmul_assign (gfc_code **, int *, void *); +static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *, + locus *, gfc_namespace *, + char *vname=NULL); /* How deep we are inside an argument list. */ @@ -93,8 +97,24 @@ struct my_struct *evec; static bool in_assoc_list; +/* Counter for temporary variables. */ + +static int var_num = 1; + +/* What sort of matrix we are dealing with when optimizing MATMUL. */ + +enum matrix_case { none=0, A2B2, A2B1, A1B2 }; + +/* Keep track of the number of expressions we have inserted so far + using create_var. */ + +int n_vars; + +void gfc_debug_expr (gfc_expr *); + /* Entry point - run all passes for a namespace. */ + void gfc_run_passes (gfc_namespace *ns) { @@ -157,7 +177,7 @@ realloc_string_callback (gfc_code **c, int *walk_s return 0; current_code = c; - n = create_var (expr2); + n = create_var (expr2, "trim"); co->expr2 = n; return 0; } @@ -524,29 +544,11 @@ constant_string_length (gfc_expr *e) } -/* Returns a new expression (a variable) to be used in place of the old one, - with an assignment statement before the current statement to set - the value of the variable. Creates a new BLOCK for the statement if - that hasn't already been done and puts the statement, plus the - newly created variables, in that block. Special cases: If the - expression is constant or a temporary which has already - been created, just copy it. */ - -static gfc_expr* -create_var (gfc_expr * e) +static gfc_namespace* +insert_block () { - char name[GFC_MAX_SYMBOL_LEN +1]; - static int num = 1; - gfc_symtree *symtree; - gfc_symbol *symbol; - gfc_expr *result; - gfc_code *n; gfc_namespace *ns; - int i; - if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e)) - return gfc_copy_expr (e); - /* If the block hasn't already been created, do so. */ if (inserted_block == NULL) { @@ -578,7 +580,37 @@ constant_string_length (gfc_expr *e) else ns = inserted_block->ext.block.ns; - sprintf(name, "__var_%d",num++); + return ns; +} + +/* Returns a new expression (a variable) to be used in place of the old one, + with an optional assignment statement before the current statement to set + the value of the variable. Creates a new BLOCK for the statement if that + hasn't already been done and puts the statement, plus the newly created + variables, in that block. Special cases: If the expression is constant or + a temporary which has already been created, just copy it. */ + +static gfc_expr* +create_var (gfc_expr * e, const char *vname) +{ + char name[GFC_MAX_SYMBOL_LEN +1]; + gfc_symtree *symtree; + gfc_symbol *symbol; + gfc_expr *result; + gfc_code *n; + gfc_namespace *ns; + int i; + + if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e)) + return gfc_copy_expr (e); + + ns = insert_block (); + + if (vname) + snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname); + else + snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d", var_num++); + if (gfc_get_sym_tree (name, ns, &symtree, false) != 0) gcc_unreachable (); @@ -651,6 +683,7 @@ constant_string_length (gfc_expr *e) result->ref->type = REF_ARRAY; result->ref->u.ar.type = AR_FULL; result->ref->u.ar.where = e->where; + result->ref->u.ar.dimen = e->rank; result->ref->u.ar.as = symbol->ts.type == BT_CLASS ? CLASS_DATA (symbol)->as : symbol->as; if (warn_array_temporaries) @@ -658,6 +691,7 @@ constant_string_length (gfc_expr *e) "Creating array temporary at %L", &(e->where)); } + /* Generate the new assignment. */ n = XCNEW (gfc_code); n->op = EXEC_ASSIGN; @@ -666,6 +700,7 @@ constant_string_length (gfc_expr *e) n->expr1 = gfc_copy_expr (result); n->expr2 = e; *changed_statement = n; + n_vars ++; return result; } @@ -724,7 +759,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees, if (gfc_dep_compare_functions (*ei, *ej, true) == 0) { if (newvar == NULL) - newvar = create_var (*ei); + newvar = create_var (*ei, "fcn"); if (warn_function_elimination) do_warn_function_elimination (*ej); @@ -931,6 +966,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A /* Don't walk subtrees. */ return 0; } + /* Optimize a namespace, including all contained namespaces. */ static void @@ -947,6 +983,7 @@ optimize_namespace (gfc_namespace *ns) gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL); gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL); gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL); + gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback, NULL); /* BLOCKs are handled in the expression walker below. */ for (ns = ns->contained; ns; ns = ns->sibling) @@ -1222,7 +1259,7 @@ combine_array_constructor (gfc_expr *e) if (op2->ts.type == BT_CHARACTER) return false; - scalar = create_var (gfc_copy_expr (op2)); + scalar = create_var (gfc_copy_expr (op2), "constr"); oldbase = op1->value.constructor; newbase = NULL; @@ -1939,7 +1976,798 @@ doloop_warn (gfc_namespace *ns) gfc_code_walker (&ns->code, doloop_code, do_function, NULL); } +/* Auxiliary function to build and simplify an array inquiry function. + dim is zero-based. */ +static gfc_expr * +get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id) +{ + + gfc_expr *fcn; + gfc_expr *dim_arg, *kind; + const char *name; + + switch (id) + { + case GFC_ISYM_LBOUND: + name = "_gfortran_lbound"; + break; + + case GFC_ISYM_UBOUND: + name = "_gfortran_ubound"; + break; + + case GFC_ISYM_SIZE: + name = "_gfortran_size"; + break; + + default: + gcc_unreachable (); + } + + dim_arg = gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim); + + kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where, + gfc_index_integer_kind); + + fcn = gfc_build_intrinsic_call (current_ns, id, name, e->where, 3, + gfc_copy_expr(e), dim_arg, kind); + + gfc_simplify_expr (fcn, 0); + return fcn; +} + +/* Build a not equals - expression. */ + +static gfc_expr* +build_logical_expr (gfc_expr *e1, gfc_expr *e2, gfc_intrinsic_op op) +{ + gfc_typespec ts; + gfc_expr *res; + + ts.type = BT_LOGICAL; + ts.kind = gfc_default_logical_kind; + res = gfc_get_expr (); + res->where = e1->where; + res->expr_type = EXPR_OP; + res->value.op.op = op; + res->value.op.op1 = e1; + res->value.op.op2 = e2; + res->ts = ts; + + return res; +} +/* Handle matrix reallocation. Caller is responsible to insert into + the code tree. + + For the two-dimensional case, build + + if (allocated(c)) then + if (size(c,1) /= size(a,1) .or. size(c,2) /= size(b,2)) then + deallocate(c) + allocate (c(size(a,1), size(b,2))) + end if + else + allocate (c(size(a,1),size(b,2))) + end if + +*/ + +static gfc_code * +matmul_lhs_realloc (gfc_expr *c, gfc_expr *a, gfc_expr *b, + enum matrix_case m_case) +{ + + gfc_expr *allocated, *alloc_expr; + gfc_code *if_alloc_1, *if_alloc_2, *if_size_1, *if_size_2; + gfc_code *else_alloc; + gfc_code *deallocate, *allocate1, *allocate_else; + gfc_ref *ref; + gfc_array_ref *ar; + gfc_expr *cond, *ne1, *ne2; + + alloc_expr = gfc_copy_expr (c); + + ref = alloc_expr->ref; + + while (ref) + { + if (ref->type == REF_ARRAY && ref->u.ar.type != AR_ELEMENT) + break; + + ref = ref->next; + + } + ar = &ref->u.ar; + gcc_assert (ar && ar->type == AR_FULL); + + /* c comes in as a full ref. Change it into a copy and make it into an + element ref so it has the right for for ALLOCATE. In the same switch, + also generate the size comparison for the secod IF statement. */ + + ar->type = AR_ELEMENT; + + switch (m_case) + { + case A2B2: + ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE); + ar->start[1] = get_array_inq_function (b, 2, GFC_ISYM_SIZE); + ne1 = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE), + get_array_inq_function (a, 1, GFC_ISYM_SIZE), + INTRINSIC_NE); + ne2 = build_logical_expr (get_array_inq_function (c, 2, GFC_ISYM_SIZE), + get_array_inq_function (b, 2, GFC_ISYM_SIZE), + INTRINSIC_NE); + cond = build_logical_expr (ne1, ne2, INTRINSIC_OR); + break; + + case A2B1: + ar->start[0] = get_array_inq_function (a, 0, GFC_ISYM_SIZE); + cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE), + get_array_inq_function (a, 1, GFC_ISYM_SIZE), + INTRINSIC_NE); + break; + + case A1B2: + ar->start[0] = get_array_inq_function (b, 1, GFC_ISYM_SIZE); + cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE), + get_array_inq_function (b, 2, GFC_ISYM_SIZE), + INTRINSIC_NE); + break; + + default: + gcc_unreachable(); + + } + + gfc_simplify_expr (cond, 0); + + /* We need two identical allocate statements in two + branches of the IF statement. */ + + allocate1 = XCNEW (gfc_code); + allocate1->op = EXEC_ALLOCATE; + allocate1->ext.alloc.list = gfc_get_alloc (); + allocate1->loc = c->where; + allocate1->ext.alloc.list->expr = gfc_copy_expr (alloc_expr); + + allocate_else = XCNEW (gfc_code); + allocate_else->op = EXEC_ALLOCATE; + allocate_else->ext.alloc.list = gfc_get_alloc (); + allocate_else->loc = c->where; + allocate_else->ext.alloc.list->expr = alloc_expr; + + allocated = gfc_build_intrinsic_call (current_ns, GFC_ISYM_ALLOCATED, + "_gfortran_allocated", c->where, + 1, gfc_copy_expr (c)); + + deallocate = XCNEW (gfc_code); + deallocate->op = EXEC_DEALLOCATE; + deallocate->ext.alloc.list = gfc_get_alloc (); + deallocate->ext.alloc.list->expr = gfc_copy_expr (c); + deallocate->next = allocate1; + deallocate->loc = c->where; + + if_size_2 = XCNEW (gfc_code); + if_size_2->op = EXEC_IF; + if_size_2->expr1 = cond; + if_size_2->loc = c->where; + if_size_2->next = deallocate; + + if_size_1 = XCNEW (gfc_code); + if_size_1->op = EXEC_IF; + if_size_1->block = if_size_2; + if_size_1->loc = c->where; + + else_alloc = XCNEW (gfc_code); + else_alloc->op = EXEC_IF; + else_alloc->loc = c->where; + else_alloc->next = allocate_else; + + if_alloc_2 = XCNEW (gfc_code); + if_alloc_2->op = EXEC_IF; + if_alloc_2->expr1 = allocated; + if_alloc_2->loc = c->where; + if_alloc_2->next = if_size_1; + if_alloc_2->block = else_alloc; + + if_alloc_1 = XCNEW (gfc_code); + if_alloc_1->op = EXEC_IF; + if_alloc_1->block = if_alloc_2; + if_alloc_1->loc = c->where; + + return if_alloc_1; +} + + +/* Callback function for has_function_or_op. */ + +static int +is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + if ((*e) == 0) + return 0; + else + return (*e)->expr_type == EXPR_FUNCTION + || (*e)->expr_type == EXPR_OP; +} + +/* Returns true if the expression contains a function. */ + +static bool +has_function_or_op (gfc_expr **e) +{ + if (e == NULL) + return false; + else + return gfc_expr_walker (e, is_function_or_op, NULL); +} + +/* Freeze a single expression. */ + +static void +freeze_expr (gfc_expr **ep) +{ + gfc_expr *ne; + if (has_function_or_op (ep)) + { + ne = create_var (*ep, "freeze"); + *ep = ne; + } +} + +/* Go through an expression's references and assign them to temporary + variables if they contain functions. This is usually done prior to + front-end scalarization to avoid multiple invocations of functions. */ + +static void +freeze_references (gfc_expr *e) +{ + gfc_ref *r; + gfc_array_ref *ar; + int i; + + for (r=e->ref; r; r=r->next) + { + if (r->type == REF_SUBSTRING) + { + if (r->u.ss.start != NULL) + freeze_expr (&r->u.ss.start); + + if (r->u.ss.end != NULL) + freeze_expr (&r->u.ss.end); + } + else if (r->type == REF_ARRAY) + { + ar = &r->u.ar; + switch (ar->type) + { + case AR_FULL: + break; + + case AR_SECTION: + for (i=0; i<ar->dimen; i++) + { + if (ar->dimen_type[i] == DIMEN_RANGE) + { + freeze_expr (&ar->start[i]); + freeze_expr (&ar->end[i]); + freeze_expr (&ar->stride[i]); + } + } + default: + break; + } + } + } +} + +/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise. */ + +static gfc_expr * +convert_to_index_kind (gfc_expr *e) +{ + gfc_expr *res; + + gcc_assert (e != NULL); + + res = gfc_copy_expr (e); + + gcc_assert (e->ts.type == BT_INTEGER); + + if (res->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + + gfc_convert_type_warn (e, &ts, 2, 0); + } + + return res; +} + +/* Function to create a DO loop including creation of the + iteration variable. gfc_expr are copied.*/ + +static gfc_code * +create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where, + gfc_namespace *ns, char *vname) +{ + + char name[GFC_MAX_SYMBOL_LEN +1]; + gfc_symtree *symtree; + gfc_symbol *symbol; + gfc_expr *i; + gfc_code *n, *n2; + + /* Create an expression for the iteration variable. */ + if (vname) + sprintf (name, "__var_%d_do_%s", var_num++, vname); + else + sprintf (name, "__var_%d_do", var_num++); + + + if (gfc_get_sym_tree (name, ns, &symtree, false) != 0) + gcc_unreachable (); + + symbol = symtree->n.sym; + symbol->ts.type = BT_INTEGER; + symbol->ts.kind = gfc_index_integer_kind; + symbol->attr.flavor = FL_VARIABLE; + symbol->attr.referenced = 1; + symbol->attr.dimension = 0; + symbol->attr.fe_temp = 1; + gfc_commit_symbol (symbol); + + i = gfc_get_expr (); + i->expr_type = EXPR_VARIABLE; + i->ts = symbol->ts; + i->rank = 0; + i->where = *where; + i->symtree = symtree; + + n = XCNEW (gfc_code); + n->op = EXEC_DO; + n->loc = *where; + n->ext.iterator = gfc_get_iterator (); + n->ext.iterator->var = i; + n->ext.iterator->start = convert_to_index_kind (start); + n->ext.iterator->end = convert_to_index_kind (end); + if (step) + n->ext.iterator->step = convert_to_index_kind (step); + else + n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind, + where, 1); + + n2 = XCNEW (gfc_code); + n2->op = EXEC_DO; + n2->loc = *where; + n2->next = NULL; + n->block = n2; + return n; +} + +/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes + compatible typespecs. */ + +static gfc_expr * +get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2) +{ + gfc_expr *res; + + res = gfc_get_expr (); + res->ts = e1->ts; + res->where = e1->where; + res->expr_type = EXPR_OP; + res->value.op.op = op; + res->value.op.op1 = e1; + res->value.op.op2 = e2; + gfc_simplify_expr (res, 0); + return res; +} + +/* Get the upper bound of the DO loops for matmul along + a dimension, i.e. size minus one. */ +static gfc_expr* +get_size_m1 (gfc_expr *e, int dimen) +{ + mpz_t size; + gfc_expr *res; + + if (gfc_array_dimen_size (e, dimen, &size)) + { + res = gfc_get_constant_expr (BT_INTEGER, + gfc_index_integer_kind, &e->where); + mpz_sub_ui (res->value.integer, size, 1); + mpz_clear (size); + } + else + { + res = get_operand (INTRINSIC_MINUS, + get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE), + gfc_get_int_expr (gfc_index_integer_kind, + &e->where, 1)); + gfc_simplify_expr (res, 0); + } + + return res; +} + +/* Function to return a scalarized expression. It is assumed that indices are + zero based to make generation of DO loops easier. A zero as index will + access the first element along a dimension. Single element references will + be skipped. A NULL as an expression will be replaced by a full reference. + This assumes that the index loops have gfc_index_integer_kind, and that all + references have been frozen. */ + +static gfc_expr* +scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index) +{ + gfc_ref *ref; + gfc_array_ref *ar; + int i; + int rank; + gfc_expr *e; + int i_index; + bool was_fullref; + + e = gfc_copy_expr(e_in); + + rank = e->rank; + + ref = e->ref; + + while (ref) + { + if (ref->type == REF_ARRAY + && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION)) + break; + ref = ref->next; + } + ar = &ref->u.ar; + + /* We scalarize count_index variables, reducing the rank by count_index. */ + + e->rank = rank - count_index; + + was_fullref = ar->type == AR_FULL; + + if (e->rank == 0) + ar->type = AR_ELEMENT; + else + ar->type = AR_SECTION; + + /* Loop over the indices. For each index, create the expression + index + lbound(e, dim). */ + + i_index = 0; + for (i=0; i < ar->dimen; i++) + { + if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE) + { + if (index[i_index] != NULL) + { + gfc_expr *lbound, *nindex; + gfc_expr *loopvar; + + loopvar = gfc_copy_expr (index[i_index]); + + if (ar->stride[i]) + { + gfc_expr *tmp; + + tmp = gfc_copy_expr(ar->stride[i]); + if (tmp->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + gfc_convert_type (tmp, &ts, 2); + } + nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp); + } + else + nindex = loopvar; + + if (ar->start[i]) + { + lbound = gfc_copy_expr (ar->start[i]); + if (lbound->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + gfc_convert_type (lbound, &ts, 2); + + } + } + else + lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND); + + ar->dimen_type[i] = DIMEN_ELEMENT; + ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound); + ar->end[i] = NULL; + ar->stride[i] = NULL; + gfc_simplify_expr (ar->start[i], 0); + } + else if (was_fullref) + { + ar->dimen_type[i] = DIMEN_RANGE; + ar->start[i] = NULL; + ar->end[i] = NULL; + ar->stride[i] = NULL; + } + i_index ++; + } + } + return e; +} + + +/* Optimize assignments of the form c = matmul(a,b). + Handle only the cases currently where b and c are rank-two arrays. + + This translates the code to + + BLOCK + integer i,j,k + c = 0 + do j=0, size(b,2)-1 + do k=0, size(a, 2)-1 + do i=0, size(a, 1)-1 + c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) = + c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) + + a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) * + b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2)) + end do + end do + end do + END BLOCK + +*/ + +static int +optimize_matmul_assign (gfc_code **c, + int *walk_subtrees ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + gfc_code *co = *c; + gfc_expr *expr1, *expr2; + gfc_expr *matrix_a, *matrix_b; + gfc_actual_arglist *a, *b; + gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul; + gfc_expr *zero_e; + gfc_expr *u1, *u2, *u3; + gfc_expr *list[2]; + gfc_expr *ascalar, *bscalar, *cscalar; + gfc_expr *mult; + gfc_expr *var_1, *var_2, *var_3; + gfc_expr *zero; + gfc_namespace *ns; + gfc_intrinsic_op op_times, op_plus; + enum matrix_case m_case; + gfc_code *next_code; + gfc_code *p; + int i; + + + if (co->op != EXEC_ASSIGN) + return 0; + + expr1 = co->expr1; + expr2 = co->expr2; + if (expr2->expr_type != EXPR_FUNCTION + || expr2->value.function.isym == NULL + || expr2->value.function.isym->id != GFC_ISYM_MATMUL) + return 0; + + /* Handling only two matrices of dimension two for the time being. */ + + current_code = c; + inserted_block = NULL; + changed_statement = NULL; + + a = expr2->value.function.actual; + matrix_a = a->expr; + b = a->next; + matrix_b = b->expr; + + if (matrix_a->expr_type != EXPR_VARIABLE + || matrix_b->expr_type != EXPR_VARIABLE) + return 0; + + if (matrix_a->rank == 2) + m_case = matrix_b->rank == 1 ? A2B1 : A2B2; + else + m_case = A1B2; + + if (gfc_check_dependency (expr1, matrix_a, true) + || gfc_check_dependency (expr1, matrix_b, true)) + return 0; + + ns = insert_block (); + + switch(expr1->ts.type) + { + case BT_INTEGER: + zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + break; + + case BT_LOGICAL: + op_times = INTRINSIC_AND; + op_plus = INTRINSIC_OR; + zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where, + 0); + break; + case BT_REAL: + zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind, + &expr1->where); + mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + break; + + case BT_COMPLEX: + zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind, + &expr1->where); + mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + + break; + + default: + gcc_unreachable(); + } + + current_code = &ns->code; + + n_vars = 0; + freeze_references (matrix_a); + freeze_references (matrix_b); + + assign_zero = XCNEW (gfc_code); + assign_zero->op = EXEC_ASSIGN; + assign_zero->loc = co->loc; + assign_zero->expr1 = gfc_copy_expr (expr1); + assign_zero->expr2 = zero_e; + + if (flag_realloc_lhs && gfc_is_reallocatable_lhs (expr1)) + { + gfc_code *lhs_alloc; + + lhs_alloc = matmul_lhs_realloc (expr1, matrix_a, matrix_b, m_case); + lhs_alloc->next = assign_zero; + next_code = lhs_alloc; + } + else + next_code = assign_zero; + + if (n_vars == 0) + *current_code = next_code; + else + { + p = ns->code; + for (i=0; i<n_vars-1; i++) + p = p->next; + + p->next = next_code; + } + + zero = gfc_get_int_expr (gfc_index_integer_kind, &co->loc, 0); + + assign_matmul = XCNEW (gfc_code); + assign_matmul->op = EXEC_ASSIGN; + assign_matmul->loc = co->loc; + + switch (m_case) + { + case A2B2: + u1 = get_size_m1 (matrix_b, 1); + u2 = get_size_m1 (matrix_a, 1); + u3 = get_size_m1 (matrix_a, 0); + + do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns); + do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns); + do_3 = create_do_loop (gfc_copy_expr (zero), u3, NULL, &co->loc, ns); + + do_1->block->next = do_2; + do_2->block->next = do_3; + do_3->block->next = assign_matmul; + + var_1 = do_1->ext.iterator->var; + var_2 = do_2->ext.iterator->var; + var_3 = do_3->ext.iterator->var; + + list[0] = var_3; + list[1] = var_1; + cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2); + + list[0] = var_3; + list[1] = var_2; + ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2); + + list[0] = var_2; + list[1] = var_1; + bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2); + + break; + + case A2B1: + u1 = get_size_m1 (matrix_b, 0); + u2 = get_size_m1 (matrix_a, 0); + + do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns); + do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns); + + do_1->block->next = do_2; + do_2->block->next = assign_matmul; + + var_1 = do_1->ext.iterator->var; + var_2 = do_2->ext.iterator->var; + + list[0] = var_2; + cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1); + + list[0] = var_2; + list[1] = var_1; + ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2); + + list[0] = var_1; + bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 1); + + break; + + case A1B2: + u1 = get_size_m1 (matrix_b, 1); + u2 = get_size_m1 (matrix_a, 0); + + do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns); + do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns); + + do_1->block->next = do_2; + do_2->block->next = assign_matmul; + + var_1 = do_1->ext.iterator->var; + var_2 = do_2->ext.iterator->var; + + list[0] = var_1; + cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1); + + list[0] = var_2; + ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 1); + + list[0] = var_2; + list[1] = var_1; + bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2); + + break; + + default: + gcc_unreachable(); + } + + assign_zero->next = do_1; + + + assign_matmul->expr1 = gfc_copy_expr (cscalar); + + mult = get_operand (op_times, ascalar, bscalar); + assign_matmul->expr2 = get_operand (op_plus, cscalar, mult); + + co->next = NULL; + gfc_free_statements(co); + gfc_free_expr (zero); + return 0; +} + #define WALK_SUBEXPR(NODE) \ do \ { \ Index: gfortran.h =================================================================== --- gfortran.h (Revision 221944) +++ gfortran.h (Arbeitskopie) @@ -3225,4 +3225,8 @@ int gfc_code_walker (gfc_code **, walk_code_fn_t, void gfc_convert_mpz_to_signed (mpz_t, int); +/* trans-array.c */ + +bool gfc_is_reallocatable_lhs (gfc_expr *); + #endif /* GCC_GFORTRAN_H */ Index: simplify.c =================================================================== --- simplify.c (Revision 221944) +++ simplify.c (Arbeitskopie) @@ -3445,6 +3445,31 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf done: + if (!upper && as && as->type == AS_ASSUMED_SHAPE && dim + && dim->expr_type == EXPR_CONSTANT) + { + if (!(array->symtree && array->symtree->n.sym + && (array->symtree->n.sym->attr.allocatable + || array->symtree->n.sym->attr.pointer))) + { + unsigned long int ndim; + gfc_expr *lower, *res; + + ndim = mpz_get_si (dim->value.integer) - 1; + lower = as->lower[ndim]; + if (lower->expr_type == EXPR_CONSTANT) + { + res = gfc_copy_expr (lower); + if (kind) + { + int nkind = mpz_get_si (kind->value.integer); + res->ts.kind = nkind; + } + return res; + } + } + } + if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE || as->type == AS_ASSUMED_RANK)) return NULL; Index: trans-array.h =================================================================== --- trans-array.h (Revision 221944) +++ trans-array.h (Arbeitskopie) @@ -64,8 +64,6 @@ tree gfc_copy_only_alloc_comp (gfc_symbol *, tree, tree gfc_alloc_allocatable_for_assignment (gfc_loopinfo*, gfc_expr*, gfc_expr*); -bool gfc_is_reallocatable_lhs (gfc_expr *); - /* Add initialization for deferred arrays. */ void gfc_trans_deferred_array (gfc_symbol *, gfc_wrapped_block *); /* Generate an initializer for a static pointer or allocatable array. */ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-09 22:18 ` Thomas Koenig @ 2015-04-10 14:15 ` Dominique d'Humières 2015-04-10 16:57 ` Dominique d'Humières 2015-04-11 12:24 ` Thomas Koenig 0 siblings, 2 replies; 16+ messages in thread From: Dominique d'Humières @ 2015-04-10 14:15 UTC (permalink / raw) To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran > Le 10 avr. 2015 à 00:18, Thomas Koenig <tkoenig@netcologne.de> a écrit : > > Hello world, > > here is an update on the matmul inlining patch. Now, the > rank one + rank two cases are also handled. Preliminary tests show that my variant of fatigue.f90 with matmul is as fast (if not faster) as the original test with dot products. However this causes new failures FAIL: gfortran.dg/matmul_bounds_4.f90 -O1 execution test FAIL: gfortran.dg/matmul_bounds_4.f90 -O2 execution test FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -fomit-frame-pointer execution test FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: gfortran.dg/matmul_bounds_4.f90 -O3 -g execution test FAIL: gfortran.dg/matmul_bounds_4.f90 -Os execution test FAIL: gfortran.dg/matmul_bounds_5.f90 -O1 execution test FAIL: gfortran.dg/matmul_bounds_5.f90 -O2 execution test FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -fomit-frame-pointer execution test FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: gfortran.dg/matmul_bounds_5.f90 -O3 -g execution test FAIL: gfortran.dg/matmul_bounds_5.f90 -Os execution test and may be FAIL: gfortran.dg/coarray_lib_this_image_2.f90 -O scan-tree-dump-times original "mylbound = parm...dim\\[0\\].stride >= 0 && parm...dim\\[0\\].ubound >= parm...dim\\[0\\].lbound \\|\\| parm...dim\\[0\\].stride < 0 \\?[^\n\r]* parm...dim\\[0\\].lbound : 1;" 1 FAIL: gfortran.dg/dependency_26.f90 -O scan-tree-dump-times original "&a" 1 > Reallocation on assignment also works now. Confirmed. > > Still missing: > > 1. Control via an option, BLAS inlining has to take precedence > 2. handling of matmul(a,b) occurring in the middle of an expression. > 3. Bounds checking from the front end pass (basically, calling > gfortran_runtime_error). > 4. More test cases > > What do you think? 1. is straightforward, I agree if the bounds are known at compile time (it will be nice to use -fblas-matmul-limit=n for the threshold). However for bounds know at run time only, I don’t see how we can escape some king of versioning. > we don't really need to do 2. for committing early in stage one. Agreed: no point to mess with complicated expressions if the simple ones are not properly debugged! > For 3, we could maybe just issue a STOP. I have silenced these failures (gfortran.dg/matmul_bounds_*.f90) by adding -fno-frontend-optimize to the list of options. IMO having the same message with/without inlining is needed. > 4. is required, I need to look at some more corner cases where bugs may still be lurking. I also see the following failures FAIL: gfortran.dg/shape_2.f90 -O0 execution test FAIL: gfortran.dg/shape_2.f90 -O1 execution test FAIL: gfortran.dg/shape_2.f90 -O2 execution test FAIL: gfortran.dg/shape_2.f90 -O3 -fomit-frame-pointer execution test FAIL: gfortran.dg/shape_2.f90 -O3 -fomit-frame-pointer -funroll-loops execution test FAIL: gfortran.dg/shape_2.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions execution test FAIL: gfortran.dg/shape_2.f90 -O3 -g execution test FAIL: gfortran.dg/shape_2.f90 -Os execution test FAIL: gfortran.dg/shape_2.f90 -g -flto execution test A reduced test is: program main integer, dimension (40, 80) :: a = 1 call test (a) contains subroutine test (b) integer, dimension (11:, -8:), target :: b integer, dimension (:, :), pointer :: ptr print *, lbound (b (:, :), 1) ! if (lbound (b (:, :), 1) .ne. 1) call abort print *, lbound (b (:, :), 2) ! if (lbound (b (:, :), 2) .ne. 1) call abort print *, lbound (b (20:30:3, 40), 1) if (lbound (b (20:30:3, 40), 1) .ne. 1) call abort end subroutine test end program main (the other tests in the original file succeed). Thanks for working of this, Dominique > What do you think? > > Thomas > > <matmul-8.diff> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-10 14:15 ` Dominique d'Humières @ 2015-04-10 16:57 ` Dominique d'Humières 2015-04-11 12:24 ` Thomas Koenig 1 sibling, 0 replies; 16+ messages in thread From: Dominique d'Humières @ 2015-04-10 16:57 UTC (permalink / raw) To: Thomas Koenig; +Cc: GCC Patches, GNU GFortran > Le 10 avr. 2015 à 16:15, Dominique d'Humières <dominiq@lps.ens.fr> a écrit : > > >> 4. is required, I need to look at some more corner cases where bugs may still be lurking. > Compiling the polyhedron test rnflow.f90 after the patch gives an ICE f951: internal compiler error: gfc_array_dimen_size(): Bad dimension Reduced test extracted from the subroutine evlrnf: subroutine evlrnf (ptrs0t, nclsm, prnf0t) real, dimension (1:nclsm,1:nclsm), intent (in) :: ptrs0t integer, intent (in) :: nclsm ! nombre de classes real, dimension (1:nclsm,1:nclsm), intent (out):: prnf0t real, allocatable, dimension (:,:) :: utrsft ! probas up real, allocatable, dimension (:) :: vwrkft real, allocatable, dimension (:) :: vwrk2t ncls = nclsm allocate (utrsft (1:ncls,1:ncls)) allocate (vwrkft (1:ncls)) allocate (vwrk2t (1:ncls)) vwrk2t = matmul (utrsft, vwrkft) ! U Fk deallocate (utrsft) deallocate (vwrk2t) deallocate (vwrkft) end subroutine evlrnf Dominique ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-10 14:15 ` Dominique d'Humières 2015-04-10 16:57 ` Dominique d'Humières @ 2015-04-11 12:24 ` Thomas Koenig 2015-04-11 13:43 ` Mikael Morin 1 sibling, 1 reply; 16+ messages in thread From: Thomas Koenig @ 2015-04-11 12:24 UTC (permalink / raw) To: Dominique d'Humières; +Cc: GCC Patches, GNU GFortran [-- Attachment #1: Type: text/plain, Size: 513 bytes --] OK, here is a new version. There is now an option for setting a maximum on the array size, which takes its default from the BLAS limit (if specified). Currently, only setting the maximum size to zero as a way of disabling the unrolling is supported. I have done this in a few test cases. The bugs reported by Dominique have now been fixed. Still to do: Bounds checking (a rather big one), honoring the maximum size, test cases, ChangeLog, some more comments to the code. Anything else? Regards Thomas [-- Attachment #2: matmul-10.diff --] [-- Type: text/x-patch, Size: 31446 bytes --] Index: fortran/frontend-passes.c =================================================================== --- fortran/frontend-passes.c (Revision 221944) +++ fortran/frontend-passes.c (Arbeitskopie) @@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *); static void optimize_reduction (gfc_namespace *); static int callback_reduction (gfc_expr **, int *, void *); static void realloc_strings (gfc_namespace *); -static gfc_expr *create_var (gfc_expr *); +static gfc_expr *create_var (gfc_expr *, const char *vname=NULL); +static int optimize_matmul_assign (gfc_code **, int *, void *); +static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *, + locus *, gfc_namespace *, + char *vname=NULL); /* How deep we are inside an argument list. */ @@ -93,8 +97,24 @@ struct my_struct *evec; static bool in_assoc_list; +/* Counter for temporary variables. */ + +static int var_num = 1; + +/* What sort of matrix we are dealing with when optimizing MATMUL. */ + +enum matrix_case { none=0, A2B2, A2B1, A1B2 }; + +/* Keep track of the number of expressions we have inserted so far + using create_var. */ + +int n_vars; + +void gfc_debug_expr (gfc_expr *); + /* Entry point - run all passes for a namespace. */ + void gfc_run_passes (gfc_namespace *ns) { @@ -157,7 +177,7 @@ realloc_string_callback (gfc_code **c, int *walk_s return 0; current_code = c; - n = create_var (expr2); + n = create_var (expr2, "trim"); co->expr2 = n; return 0; } @@ -524,29 +544,11 @@ constant_string_length (gfc_expr *e) } -/* Returns a new expression (a variable) to be used in place of the old one, - with an assignment statement before the current statement to set - the value of the variable. Creates a new BLOCK for the statement if - that hasn't already been done and puts the statement, plus the - newly created variables, in that block. Special cases: If the - expression is constant or a temporary which has already - been created, just copy it. */ - -static gfc_expr* -create_var (gfc_expr * e) +static gfc_namespace* +insert_block () { - char name[GFC_MAX_SYMBOL_LEN +1]; - static int num = 1; - gfc_symtree *symtree; - gfc_symbol *symbol; - gfc_expr *result; - gfc_code *n; gfc_namespace *ns; - int i; - if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e)) - return gfc_copy_expr (e); - /* If the block hasn't already been created, do so. */ if (inserted_block == NULL) { @@ -578,7 +580,37 @@ constant_string_length (gfc_expr *e) else ns = inserted_block->ext.block.ns; - sprintf(name, "__var_%d",num++); + return ns; +} + +/* Returns a new expression (a variable) to be used in place of the old one, + with an optional assignment statement before the current statement to set + the value of the variable. Creates a new BLOCK for the statement if that + hasn't already been done and puts the statement, plus the newly created + variables, in that block. Special cases: If the expression is constant or + a temporary which has already been created, just copy it. */ + +static gfc_expr* +create_var (gfc_expr * e, const char *vname) +{ + char name[GFC_MAX_SYMBOL_LEN +1]; + gfc_symtree *symtree; + gfc_symbol *symbol; + gfc_expr *result; + gfc_code *n; + gfc_namespace *ns; + int i; + + if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e)) + return gfc_copy_expr (e); + + ns = insert_block (); + + if (vname) + snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname); + else + snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d", var_num++); + if (gfc_get_sym_tree (name, ns, &symtree, false) != 0) gcc_unreachable (); @@ -651,6 +683,7 @@ constant_string_length (gfc_expr *e) result->ref->type = REF_ARRAY; result->ref->u.ar.type = AR_FULL; result->ref->u.ar.where = e->where; + result->ref->u.ar.dimen = e->rank; result->ref->u.ar.as = symbol->ts.type == BT_CLASS ? CLASS_DATA (symbol)->as : symbol->as; if (warn_array_temporaries) @@ -658,6 +691,7 @@ constant_string_length (gfc_expr *e) "Creating array temporary at %L", &(e->where)); } + /* Generate the new assignment. */ n = XCNEW (gfc_code); n->op = EXEC_ASSIGN; @@ -666,6 +700,7 @@ constant_string_length (gfc_expr *e) n->expr1 = gfc_copy_expr (result); n->expr2 = e; *changed_statement = n; + n_vars ++; return result; } @@ -724,7 +759,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees, if (gfc_dep_compare_functions (*ei, *ej, true) == 0) { if (newvar == NULL) - newvar = create_var (*ei); + newvar = create_var (*ei, "fcn"); if (warn_function_elimination) do_warn_function_elimination (*ej); @@ -931,6 +966,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A /* Don't walk subtrees. */ return 0; } + /* Optimize a namespace, including all contained namespaces. */ static void @@ -947,6 +983,9 @@ optimize_namespace (gfc_namespace *ns) gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL); gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL); gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL); + if (flag_inline_matmul_limit != 0) + gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback, + NULL); /* BLOCKs are handled in the expression walker below. */ for (ns = ns->contained; ns; ns = ns->sibling) @@ -1222,7 +1261,7 @@ combine_array_constructor (gfc_expr *e) if (op2->ts.type == BT_CHARACTER) return false; - scalar = create_var (gfc_copy_expr (op2)); + scalar = create_var (gfc_copy_expr (op2), "constr"); oldbase = op1->value.constructor; newbase = NULL; @@ -1939,7 +1978,795 @@ doloop_warn (gfc_namespace *ns) gfc_code_walker (&ns->code, doloop_code, do_function, NULL); } +/* Auxiliary function to build and simplify an array inquiry function. + dim is zero-based. */ +static gfc_expr * +get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id) +{ + gfc_expr *fcn; + gfc_expr *dim_arg, *kind; + const char *name; + + switch (id) + { + case GFC_ISYM_LBOUND: + name = "_gfortran_lbound"; + break; + + case GFC_ISYM_UBOUND: + name = "_gfortran_ubound"; + break; + + case GFC_ISYM_SIZE: + name = "_gfortran_size"; + break; + + default: + gcc_unreachable (); + } + + dim_arg = gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim); + kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where, + gfc_index_integer_kind); + + fcn = gfc_build_intrinsic_call (current_ns, id, name, e->where, 3, + gfc_copy_expr(e), dim_arg, kind); + gfc_simplify_expr (fcn, 0); + return fcn; +} + +/* Build a not equals - expression. */ + +static gfc_expr* +build_logical_expr (gfc_expr *e1, gfc_expr *e2, gfc_intrinsic_op op) +{ + gfc_typespec ts; + gfc_expr *res; + + ts.type = BT_LOGICAL; + ts.kind = gfc_default_logical_kind; + res = gfc_get_expr (); + res->where = e1->where; + res->expr_type = EXPR_OP; + res->value.op.op = op; + res->value.op.op1 = e1; + res->value.op.op2 = e2; + res->ts = ts; + + return res; +} +/* Handle matrix reallocation. Caller is responsible to insert into + the code tree. + + For the two-dimensional case, build + + if (allocated(c)) then + if (size(c,1) /= size(a,1) .or. size(c,2) /= size(b,2)) then + deallocate(c) + allocate (c(size(a,1), size(b,2))) + end if + else + allocate (c(size(a,1),size(b,2))) + end if + +*/ + +static gfc_code * +matmul_lhs_realloc (gfc_expr *c, gfc_expr *a, gfc_expr *b, + enum matrix_case m_case) +{ + + gfc_expr *allocated, *alloc_expr; + gfc_code *if_alloc_1, *if_alloc_2, *if_size_1, *if_size_2; + gfc_code *else_alloc; + gfc_code *deallocate, *allocate1, *allocate_else; + gfc_ref *ref; + gfc_array_ref *ar; + gfc_expr *cond, *ne1, *ne2; + + alloc_expr = gfc_copy_expr (c); + + ref = alloc_expr->ref; + + while (ref) + { + if (ref->type == REF_ARRAY && ref->u.ar.type != AR_ELEMENT) + break; + + ref = ref->next; + + } + ar = &ref->u.ar; + gcc_assert (ar && ar->type == AR_FULL); + + /* c comes in as a full ref. Change it into a copy and make it into an + element ref so it has the right for for ALLOCATE. In the same switch, + also generate the size comparison for the secod IF statement. */ + + ar->type = AR_ELEMENT; + + switch (m_case) + { + case A2B2: + ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE); + ar->start[1] = get_array_inq_function (b, 2, GFC_ISYM_SIZE); + ne1 = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE), + get_array_inq_function (a, 1, GFC_ISYM_SIZE), + INTRINSIC_NE); + ne2 = build_logical_expr (get_array_inq_function (c, 2, GFC_ISYM_SIZE), + get_array_inq_function (b, 2, GFC_ISYM_SIZE), + INTRINSIC_NE); + cond = build_logical_expr (ne1, ne2, INTRINSIC_OR); + break; + + case A2B1: + ar->start[0] = get_array_inq_function (a, 1, GFC_ISYM_SIZE); + cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE), + get_array_inq_function (a, 2, GFC_ISYM_SIZE), + INTRINSIC_NE); + break; + + case A1B2: + ar->start[0] = get_array_inq_function (b, 1, GFC_ISYM_SIZE); + cond = build_logical_expr (get_array_inq_function (c, 1, GFC_ISYM_SIZE), + get_array_inq_function (b, 2, GFC_ISYM_SIZE), + INTRINSIC_NE); + break; + + default: + gcc_unreachable(); + + } + + gfc_simplify_expr (cond, 0); + + /* We need two identical allocate statements in two + branches of the IF statement. */ + + allocate1 = XCNEW (gfc_code); + allocate1->op = EXEC_ALLOCATE; + allocate1->ext.alloc.list = gfc_get_alloc (); + allocate1->loc = c->where; + allocate1->ext.alloc.list->expr = gfc_copy_expr (alloc_expr); + + allocate_else = XCNEW (gfc_code); + allocate_else->op = EXEC_ALLOCATE; + allocate_else->ext.alloc.list = gfc_get_alloc (); + allocate_else->loc = c->where; + allocate_else->ext.alloc.list->expr = alloc_expr; + + allocated = gfc_build_intrinsic_call (current_ns, GFC_ISYM_ALLOCATED, + "_gfortran_allocated", c->where, + 1, gfc_copy_expr (c)); + + deallocate = XCNEW (gfc_code); + deallocate->op = EXEC_DEALLOCATE; + deallocate->ext.alloc.list = gfc_get_alloc (); + deallocate->ext.alloc.list->expr = gfc_copy_expr (c); + deallocate->next = allocate1; + deallocate->loc = c->where; + + if_size_2 = XCNEW (gfc_code); + if_size_2->op = EXEC_IF; + if_size_2->expr1 = cond; + if_size_2->loc = c->where; + if_size_2->next = deallocate; + + if_size_1 = XCNEW (gfc_code); + if_size_1->op = EXEC_IF; + if_size_1->block = if_size_2; + if_size_1->loc = c->where; + + else_alloc = XCNEW (gfc_code); + else_alloc->op = EXEC_IF; + else_alloc->loc = c->where; + else_alloc->next = allocate_else; + + if_alloc_2 = XCNEW (gfc_code); + if_alloc_2->op = EXEC_IF; + if_alloc_2->expr1 = allocated; + if_alloc_2->loc = c->where; + if_alloc_2->next = if_size_1; + if_alloc_2->block = else_alloc; + + if_alloc_1 = XCNEW (gfc_code); + if_alloc_1->op = EXEC_IF; + if_alloc_1->block = if_alloc_2; + if_alloc_1->loc = c->where; + + return if_alloc_1; +} + + +/* Callback function for has_function_or_op. */ + +static int +is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + if ((*e) == 0) + return 0; + else + return (*e)->expr_type == EXPR_FUNCTION + || (*e)->expr_type == EXPR_OP; +} + +/* Returns true if the expression contains a function. */ + +static bool +has_function_or_op (gfc_expr **e) +{ + if (e == NULL) + return false; + else + return gfc_expr_walker (e, is_function_or_op, NULL); +} + +/* Freeze a single expression. */ + +static void +freeze_expr (gfc_expr **ep) +{ + gfc_expr *ne; + if (has_function_or_op (ep)) + { + ne = create_var (*ep, "freeze"); + *ep = ne; + } +} + +/* Go through an expression's references and assign them to temporary + variables if they contain functions. This is usually done prior to + front-end scalarization to avoid multiple invocations of functions. */ + +static void +freeze_references (gfc_expr *e) +{ + gfc_ref *r; + gfc_array_ref *ar; + int i; + + for (r=e->ref; r; r=r->next) + { + if (r->type == REF_SUBSTRING) + { + if (r->u.ss.start != NULL) + freeze_expr (&r->u.ss.start); + + if (r->u.ss.end != NULL) + freeze_expr (&r->u.ss.end); + } + else if (r->type == REF_ARRAY) + { + ar = &r->u.ar; + switch (ar->type) + { + case AR_FULL: + break; + + case AR_SECTION: + for (i=0; i<ar->dimen; i++) + { + if (ar->dimen_type[i] == DIMEN_RANGE) + { + freeze_expr (&ar->start[i]); + freeze_expr (&ar->end[i]); + freeze_expr (&ar->stride[i]); + } + } + default: + break; + } + } + } +} + +/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise. */ + +static gfc_expr * +convert_to_index_kind (gfc_expr *e) +{ + gfc_expr *res; + + gcc_assert (e != NULL); + + res = gfc_copy_expr (e); + + gcc_assert (e->ts.type == BT_INTEGER); + + if (res->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + + gfc_convert_type_warn (e, &ts, 2, 0); + } + + return res; +} + +/* Function to create a DO loop including creation of the + iteration variable. gfc_expr are copied.*/ + +static gfc_code * +create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where, + gfc_namespace *ns, char *vname) +{ + + char name[GFC_MAX_SYMBOL_LEN +1]; + gfc_symtree *symtree; + gfc_symbol *symbol; + gfc_expr *i; + gfc_code *n, *n2; + + /* Create an expression for the iteration variable. */ + if (vname) + sprintf (name, "__var_%d_do_%s", var_num++, vname); + else + sprintf (name, "__var_%d_do", var_num++); + + + if (gfc_get_sym_tree (name, ns, &symtree, false) != 0) + gcc_unreachable (); + + symbol = symtree->n.sym; + symbol->ts.type = BT_INTEGER; + symbol->ts.kind = gfc_index_integer_kind; + symbol->attr.flavor = FL_VARIABLE; + symbol->attr.referenced = 1; + symbol->attr.dimension = 0; + symbol->attr.fe_temp = 1; + gfc_commit_symbol (symbol); + + i = gfc_get_expr (); + i->expr_type = EXPR_VARIABLE; + i->ts = symbol->ts; + i->rank = 0; + i->where = *where; + i->symtree = symtree; + + n = XCNEW (gfc_code); + n->op = EXEC_DO; + n->loc = *where; + n->ext.iterator = gfc_get_iterator (); + n->ext.iterator->var = i; + n->ext.iterator->start = convert_to_index_kind (start); + n->ext.iterator->end = convert_to_index_kind (end); + if (step) + n->ext.iterator->step = convert_to_index_kind (step); + else + n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind, + where, 1); + + n2 = XCNEW (gfc_code); + n2->op = EXEC_DO; + n2->loc = *where; + n2->next = NULL; + n->block = n2; + return n; +} + +/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes + compatible typespecs. */ + +static gfc_expr * +get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2) +{ + gfc_expr *res; + + res = gfc_get_expr (); + res->ts = e1->ts; + res->where = e1->where; + res->expr_type = EXPR_OP; + res->value.op.op = op; + res->value.op.op1 = e1; + res->value.op.op2 = e2; + gfc_simplify_expr (res, 0); + return res; +} + +/* Get the upper bound of the DO loops for matmul along + a dimension, i.e. size minus one. */ +static gfc_expr* +get_size_m1 (gfc_expr *e, int dimen) +{ + mpz_t size; + gfc_expr *res; + + if (gfc_array_dimen_size (e, dimen, &size)) + { + res = gfc_get_constant_expr (BT_INTEGER, + gfc_index_integer_kind, &e->where); + mpz_sub_ui (res->value.integer, size, 1); + mpz_clear (size); + } + else + { + res = get_operand (INTRINSIC_MINUS, + get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE), + gfc_get_int_expr (gfc_index_integer_kind, + &e->where, 1)); + gfc_simplify_expr (res, 0); + } + + return res; +} + +/* Function to return a scalarized expression. It is assumed that indices are + zero based to make generation of DO loops easier. A zero as index will + access the first element along a dimension. Single element references will + be skipped. A NULL as an expression will be replaced by a full reference. + This assumes that the index loops have gfc_index_integer_kind, and that all + references have been frozen. */ + +static gfc_expr* +scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index) +{ + gfc_ref *ref; + gfc_array_ref *ar; + int i; + int rank; + gfc_expr *e; + int i_index; + bool was_fullref; + + e = gfc_copy_expr(e_in); + + rank = e->rank; + + ref = e->ref; + + while (ref) + { + if (ref->type == REF_ARRAY + && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION)) + break; + ref = ref->next; + } + ar = &ref->u.ar; + + /* We scalarize count_index variables, reducing the rank by count_index. */ + + e->rank = rank - count_index; + + was_fullref = ar->type == AR_FULL; + + if (e->rank == 0) + ar->type = AR_ELEMENT; + else + ar->type = AR_SECTION; + + /* Loop over the indices. For each index, create the expression + index + lbound(e, dim). */ + + i_index = 0; + for (i=0; i < ar->dimen; i++) + { + if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE) + { + if (index[i_index] != NULL) + { + gfc_expr *lbound, *nindex; + gfc_expr *loopvar; + + loopvar = gfc_copy_expr (index[i_index]); + + if (ar->stride[i]) + { + gfc_expr *tmp; + + tmp = gfc_copy_expr(ar->stride[i]); + if (tmp->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + gfc_convert_type (tmp, &ts, 2); + } + nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp); + } + else + nindex = loopvar; + + if (ar->start[i]) + { + lbound = gfc_copy_expr (ar->start[i]); + if (lbound->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + gfc_convert_type (lbound, &ts, 2); + + } + } + else + lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND); + + ar->dimen_type[i] = DIMEN_ELEMENT; + ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound); + ar->end[i] = NULL; + ar->stride[i] = NULL; + gfc_simplify_expr (ar->start[i], 0); + } + else if (was_fullref) + { + ar->dimen_type[i] = DIMEN_RANGE; + ar->start[i] = NULL; + ar->end[i] = NULL; + ar->stride[i] = NULL; + } + i_index ++; + } + } + return e; +} + + +/* Optimize assignments of the form c = matmul(a,b). + Handle only the cases currently where b and c are rank-two arrays. + + This translates the code to + + BLOCK + integer i,j,k + c = 0 + do j=0, size(b,2)-1 + do k=0, size(a, 2)-1 + do i=0, size(a, 1)-1 + c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) = + c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) + + a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) * + b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2)) + end do + end do + end do + END BLOCK + +*/ + +static int +optimize_matmul_assign (gfc_code **c, + int *walk_subtrees ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + gfc_code *co = *c; + gfc_expr *expr1, *expr2; + gfc_expr *matrix_a, *matrix_b; + gfc_actual_arglist *a, *b; + gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul; + gfc_expr *zero_e; + gfc_expr *u1, *u2, *u3; + gfc_expr *list[2]; + gfc_expr *ascalar, *bscalar, *cscalar; + gfc_expr *mult; + gfc_expr *var_1, *var_2, *var_3; + gfc_expr *zero; + gfc_namespace *ns; + gfc_intrinsic_op op_times, op_plus; + enum matrix_case m_case; + gfc_code *next_code; + gfc_code *p; + int i; + + + if (co->op != EXEC_ASSIGN) + return 0; + + expr1 = co->expr1; + expr2 = co->expr2; + if (expr2->expr_type != EXPR_FUNCTION + || expr2->value.function.isym == NULL + || expr2->value.function.isym->id != GFC_ISYM_MATMUL) + return 0; + + /* Handling only two matrices of dimension two for the time being. */ + + current_code = c; + inserted_block = NULL; + changed_statement = NULL; + + a = expr2->value.function.actual; + matrix_a = a->expr; + b = a->next; + matrix_b = b->expr; + + if (matrix_a->expr_type != EXPR_VARIABLE + || matrix_b->expr_type != EXPR_VARIABLE) + return 0; + + if (matrix_a->rank == 2) + m_case = matrix_b->rank == 1 ? A2B1 : A2B2; + else + m_case = A1B2; + + if (gfc_check_dependency (expr1, matrix_a, true) + || gfc_check_dependency (expr1, matrix_b, true)) + return 0; + + ns = insert_block (); + + switch(expr1->ts.type) + { + case BT_INTEGER: + zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + break; + + case BT_LOGICAL: + op_times = INTRINSIC_AND; + op_plus = INTRINSIC_OR; + zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where, + 0); + break; + case BT_REAL: + zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind, + &expr1->where); + mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + break; + + case BT_COMPLEX: + zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind, + &expr1->where); + mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + + break; + + default: + gcc_unreachable(); + } + + current_code = &ns->code; + + n_vars = 0; + freeze_references (matrix_a); + freeze_references (matrix_b); + + assign_zero = XCNEW (gfc_code); + assign_zero->op = EXEC_ASSIGN; + assign_zero->loc = co->loc; + assign_zero->expr1 = gfc_copy_expr (expr1); + assign_zero->expr2 = zero_e; + + if (flag_realloc_lhs && gfc_is_reallocatable_lhs (expr1)) + { + gfc_code *lhs_alloc; + + lhs_alloc = matmul_lhs_realloc (expr1, matrix_a, matrix_b, m_case); + lhs_alloc->next = assign_zero; + next_code = lhs_alloc; + } + else + next_code = assign_zero; + + if (n_vars == 0) + *current_code = next_code; + else + { + p = ns->code; + for (i=0; i<n_vars-1; i++) + p = p->next; + + p->next = next_code; + } + + zero = gfc_get_int_expr (gfc_index_integer_kind, &co->loc, 0); + + assign_matmul = XCNEW (gfc_code); + assign_matmul->op = EXEC_ASSIGN; + assign_matmul->loc = co->loc; + + switch (m_case) + { + case A2B2: + u1 = get_size_m1 (matrix_b, 1); + u2 = get_size_m1 (matrix_a, 1); + u3 = get_size_m1 (matrix_a, 0); + + do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns); + do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns); + do_3 = create_do_loop (gfc_copy_expr (zero), u3, NULL, &co->loc, ns); + + do_1->block->next = do_2; + do_2->block->next = do_3; + do_3->block->next = assign_matmul; + + var_1 = do_1->ext.iterator->var; + var_2 = do_2->ext.iterator->var; + var_3 = do_3->ext.iterator->var; + + list[0] = var_3; + list[1] = var_1; + cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2); + + list[0] = var_3; + list[1] = var_2; + ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2); + + list[0] = var_2; + list[1] = var_1; + bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2); + + break; + + case A2B1: + u1 = get_size_m1 (matrix_b, 0); + u2 = get_size_m1 (matrix_a, 0); + + do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns); + do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns); + + do_1->block->next = do_2; + do_2->block->next = assign_matmul; + + var_1 = do_1->ext.iterator->var; + var_2 = do_2->ext.iterator->var; + + list[0] = var_2; + cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1); + + list[0] = var_2; + list[1] = var_1; + ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2); + + list[0] = var_1; + bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 1); + + break; + + case A1B2: + u1 = get_size_m1 (matrix_b, 1); + u2 = get_size_m1 (matrix_a, 0); + + do_1 = create_do_loop (gfc_copy_expr (zero), u1, NULL, &co->loc, ns); + do_2 = create_do_loop (gfc_copy_expr (zero), u2, NULL, &co->loc, ns); + + do_1->block->next = do_2; + do_2->block->next = assign_matmul; + + var_1 = do_1->ext.iterator->var; + var_2 = do_2->ext.iterator->var; + + list[0] = var_1; + cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 1); + + list[0] = var_2; + ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 1); + + list[0] = var_2; + list[1] = var_1; + bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2); + + break; + + default: + gcc_unreachable(); + } + + assign_zero->next = do_1; + + + assign_matmul->expr1 = gfc_copy_expr (cscalar); + + mult = get_operand (op_times, ascalar, bscalar); + assign_matmul->expr2 = get_operand (op_plus, cscalar, mult); + + co->next = NULL; + gfc_free_statements(co); + gfc_free_expr (zero); + return 0; +} + #define WALK_SUBEXPR(NODE) \ do \ { \ Index: fortran/gfortran.h =================================================================== --- fortran/gfortran.h (Revision 221944) +++ fortran/gfortran.h (Arbeitskopie) @@ -3225,4 +3225,8 @@ int gfc_code_walker (gfc_code **, walk_code_fn_t, void gfc_convert_mpz_to_signed (mpz_t, int); +/* trans-array.c */ + +bool gfc_is_reallocatable_lhs (gfc_expr *); + #endif /* GCC_GFORTRAN_H */ Index: fortran/lang.opt =================================================================== --- fortran/lang.opt (Revision 221944) +++ fortran/lang.opt (Arbeitskopie) @@ -542,6 +542,10 @@ Enum(gfc_init_local_real) String(inf) Value(GFC_IN EnumValue Enum(gfc_init_local_real) String(-inf) Value(GFC_INIT_REAL_NEG_INF) +finline-matmul-limit= +Fortran RejectNegative Joined UInteger Var(flag_inline_matmul_limit) Init(-1) +-finline-matmul-limit=<n> Specify the size of the largest matrix for which matmul will be inlined + fmax-array-constructor= Fortran RejectNegative Joined UInteger Var(flag_max_array_constructor) Init(65535) -fmax-array-constructor=<n> Maximum number of objects in an array constructor Index: fortran/options.c =================================================================== --- fortran/options.c (Revision 221944) +++ fortran/options.c (Arbeitskopie) @@ -378,6 +378,11 @@ gfc_post_options (const char **pfilename) if (!flag_automatic) flag_max_stack_var_size = 0; + /* If we call BLAS directly, only inline up to the BLAS limit. */ + + if (flag_external_blas && flag_inline_matmul_limit < 0) + flag_inline_matmul_limit = flag_blas_matmul_limit; + /* Optimization implies front end optimization, unless the user specified it directly. */ Index: fortran/simplify.c =================================================================== --- fortran/simplify.c (Revision 221944) +++ fortran/simplify.c (Arbeitskopie) @@ -3445,6 +3445,32 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf done: + + if (!upper && as && as->type == AS_ASSUMED_SHAPE && dim + && dim->expr_type == EXPR_CONSTANT && ref->u.ar.type != AR_SECTION) + { + if (!(array->symtree && array->symtree->n.sym + && (array->symtree->n.sym->attr.allocatable + || array->symtree->n.sym->attr.pointer))) + { + unsigned long int ndim; + gfc_expr *lower, *res; + + ndim = mpz_get_si (dim->value.integer) - 1; + lower = as->lower[ndim]; + if (lower->expr_type == EXPR_CONSTANT) + { + res = gfc_copy_expr (lower); + if (kind) + { + int nkind = mpz_get_si (kind->value.integer); + res->ts.kind = nkind; + } + return res; + } + } + } + if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE || as->type == AS_ASSUMED_RANK)) return NULL; Index: fortran/trans-array.h =================================================================== --- fortran/trans-array.h (Revision 221944) +++ fortran/trans-array.h (Arbeitskopie) @@ -64,8 +64,6 @@ tree gfc_copy_only_alloc_comp (gfc_symbol *, tree, tree gfc_alloc_allocatable_for_assignment (gfc_loopinfo*, gfc_expr*, gfc_expr*); -bool gfc_is_reallocatable_lhs (gfc_expr *); - /* Add initialization for deferred arrays. */ void gfc_trans_deferred_array (gfc_symbol *, gfc_wrapped_block *); /* Generate an initializer for a static pointer or allocatable array. */ Index: testsuite/gfortran.dg/function_optimize_1.f90 =================================================================== --- testsuite/gfortran.dg/function_optimize_1.f90 (Revision 221944) +++ testsuite/gfortran.dg/function_optimize_1.f90 (Arbeitskopie) @@ -1,5 +1,5 @@ ! { dg-do compile } -! { dg-options "-O -fdump-tree-original -Warray-temporaries" } +! { dg-options "-O -fdump-tree-original -finline-matmul-limit=0 -Warray-temporaries" } program main implicit none real, dimension(2,2) :: a, b, c, d Index: testsuite/gfortran.dg/function_optimize_2.f90 =================================================================== --- testsuite/gfortran.dg/function_optimize_2.f90 (Revision 221944) +++ testsuite/gfortran.dg/function_optimize_2.f90 (Arbeitskopie) @@ -1,5 +1,5 @@ ! { dg-do compile } -! { dg-options "-O -faggressive-function-elimination -fdump-tree-original" } +! { dg-options "-O -finline-matmul-limit=0 -faggressive-function-elimination -fdump-tree-original" } program main implicit none real, dimension(2,2) :: a, b, c, d Index: testsuite/gfortran.dg/function_optimize_5.f90 =================================================================== --- testsuite/gfortran.dg/function_optimize_5.f90 (Revision 221944) +++ testsuite/gfortran.dg/function_optimize_5.f90 (Arbeitskopie) @@ -1,5 +1,5 @@ ! { dg-do compile } -! { dg-options "-ffrontend-optimize -Wfunction-elimination" } +! { dg-options "-ffrontend-optimize -finline-matmul-limit=0 -Wfunction-elimination" } ! Check the -ffrontend-optimize (in the absence of -O) and ! -Wfunction-elimination options. program main Index: testsuite/gfortran.dg/function_optimize_7.f90 =================================================================== --- testsuite/gfortran.dg/function_optimize_7.f90 (Revision 221944) +++ testsuite/gfortran.dg/function_optimize_7.f90 (Arbeitskopie) @@ -1,5 +1,5 @@ ! { dg-do compile } -! { dg-options "-O -fdump-tree-original -Warray-temporaries" } +! { dg-options "-O -fdump-tree-original -Warray-temporaries -finline-matmul-limit=0" } subroutine xx(n, m, a, b, c, d, x, z, i, s_in, s_out) implicit none integer, intent(in) :: n, m ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-11 12:24 ` Thomas Koenig @ 2015-04-11 13:43 ` Mikael Morin 2015-04-11 17:01 ` Thomas Koenig 0 siblings, 1 reply; 16+ messages in thread From: Mikael Morin @ 2015-04-11 13:43 UTC (permalink / raw) To: Thomas Koenig, Dominique d'Humières; +Cc: GCC Patches, GNU GFortran Hello, I haven't looked at the patch in detail yet, but... Le 11/04/2015 14:24, Thomas Koenig a écrit : > Still to do: Bounds checking (a rather big one), ... as you do a front-end to front-end transformation, you get bounds checking for free, don't you? Mikael ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-11 13:43 ` Mikael Morin @ 2015-04-11 17:01 ` Thomas Koenig 0 siblings, 0 replies; 16+ messages in thread From: Thomas Koenig @ 2015-04-11 17:01 UTC (permalink / raw) To: Mikael Morin, Dominique d'Humières; +Cc: GCC Patches, GNU GFortran Hi Mikael, >> Still to do: Bounds checking (a rather big one), > ... as you do a front-end to front-end transformation, you get bounds > checking for free, don't you? Only partially. What the patch does is integer i,j,k c = 0 do j=0, size(b,2)-1 do k=0, size(a, 2)-1 do i=0, size(a, 1)-1 c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) = c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) + a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) * b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2)) end do end do end do If size(b,2) < size(c,2) or size(a,1) < size(c,1) or size(a,2) < size(b,1), this will not get caught - no array bounds violation in the DO loops, but illegal code nonetheless. Also, the error message is different, which should also be changed. What I would like to add to check before the loop, and then add a "do not bounds-check" flag to the reference. Thomas ^ permalink raw reply [flat|nested] 16+ messages in thread
* [patch, fortran, RFC] First steps towards inlining matmul @ 2015-04-05 12:32 Thomas Koenig 2015-04-05 13:20 ` Jerry DeLisle 2015-04-07 19:24 ` David Malcolm 0 siblings, 2 replies; 16+ messages in thread From: Thomas Koenig @ 2015-04-05 12:32 UTC (permalink / raw) To: fortran, gcc-patches [-- Attachment #1: Type: text/plain, Size: 3687 bytes --] Hello world, this is a first draft of a patch to inline matmul (PR 37171). This is preliminary, but functional as far as it goes. Definitely for the next stage one :-) Basically, it takes c = matmul(a,b) and converts this into BLOCK integer i,j,k c = 0 do j=0, size(b,2)-1 do k=0, size(a, 2)-1 do i=0, size(a, 1)-1 c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) = c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) + a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) * b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2)) end do end do end do END BLOCK For this, I wrote something like a scalarizer that only operates on the AST only. I find it rather straigtforward to use (in contrast to the one we have), but that may also be due to the fact that I wrote it myself :-) I chose zero-based loop variables because I didn't want to special-case too many strange array bounds, and I trusted the induction variable optimization and CSE in the middle end and in the RTL optimization. If this approach suboptimizes something, let me know. References are frozen to make sure that no additional evaluations are done on. The patch to simplify.c is not strictly necessary, but it should open up some optimization possiblities and, frankly, makes the -fdump-fortran-optimized output at least halfway readable. This patch needs to be extended in different directions: - Currently, it only handles the case of two-dimensional arguments - Special cases like transpose(a) and spread(a) in the arguments arguments should be added, possibly with a reordering of loops and addition of some more variables. - Temporary handling for arguments and results, which does not depend on allocatable RHS. I am thinking of creating nested blocks, where the inner block has arrays with the bounds of the references. - Cases like matmul(a+1,b) can also be handled in the loop, without a temporary - This needs command-line arguments, -finline-matmul and -finline-matmul-limit=, similar to the handling of BLAS. - Bounds checking could/should also be handled. Currently, this causes the only regression because of a different error message. We need to call gfortran_runtime_error directly from the front end, and to be able to tell the bounds-checking code "do not worry about bounds checking this, it has already been taken care of". - I already have some test cases, but for this, I would really like to cover all code paths. This also needs some more work. So, what do you think about this? Thomas 2015-04-05 Thomas Koenig <tkoenig@gcc.gnu.org> * frontend-passes.c (create_var): Add optional argument vname as part of the name. Split off block creation into (insert_block): New function. (cfe_expr_0): Use "fcn" as part of tempoary name. (optimize_namesapce): Call optimize_matmul_assign. (combine_array_constructor): Use "constr" as part of temporary name. (get_array_inq_function): New function. (is_function_or_op): New function. (has_function_or_op): New function. (freeze_expr): New function. (freeze_references): New function. (convert_to_index_kind): New function. (create_do_loop): New function. (get_operand): New function. (get_size_1): New function. (scalarize_expr): New function. (optimize_matmul_assign): New function. * simplify.c (simplify_bound): Simplify the case of the lower bound of an assumed-shape argument. [-- Attachment #2: matmul-1.diff --] [-- Type: text/x-patch, Size: 19198 bytes --] Index: frontend-passes.c =================================================================== --- frontend-passes.c (Revision 221757) +++ frontend-passes.c (Arbeitskopie) @@ -43,7 +43,11 @@ static void doloop_warn (gfc_namespace *); static void optimize_reduction (gfc_namespace *); static int callback_reduction (gfc_expr **, int *, void *); static void realloc_strings (gfc_namespace *); -static gfc_expr *create_var (gfc_expr *); +static gfc_expr *create_var (gfc_expr *, const char *vname=NULL); +static int optimize_matmul_assign (gfc_code **, int *, void *); +static gfc_code * create_do_loop (gfc_expr *, gfc_expr *, gfc_expr *, + locus *, gfc_namespace *, + char *vname=NULL); /* How deep we are inside an argument list. */ @@ -95,6 +99,10 @@ static bool in_assoc_list; /* Entry point - run all passes for a namespace. */ +/* Counter for temporary variables. */ + +static int var_num = 1; + void gfc_run_passes (gfc_namespace *ns) { @@ -157,7 +165,7 @@ realloc_string_callback (gfc_code **c, int *walk_s return 0; current_code = c; - n = create_var (expr2); + n = create_var (expr2, "trim"); co->expr2 = n; return 0; } @@ -524,29 +532,11 @@ constant_string_length (gfc_expr *e) } -/* Returns a new expression (a variable) to be used in place of the old one, - with an assignment statement before the current statement to set - the value of the variable. Creates a new BLOCK for the statement if - that hasn't already been done and puts the statement, plus the - newly created variables, in that block. Special cases: If the - expression is constant or a temporary which has already - been created, just copy it. */ - -static gfc_expr* -create_var (gfc_expr * e) +static gfc_namespace* +insert_block () { - char name[GFC_MAX_SYMBOL_LEN +1]; - static int num = 1; - gfc_symtree *symtree; - gfc_symbol *symbol; - gfc_expr *result; - gfc_code *n; gfc_namespace *ns; - int i; - if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e)) - return gfc_copy_expr (e); - /* If the block hasn't already been created, do so. */ if (inserted_block == NULL) { @@ -578,7 +568,37 @@ constant_string_length (gfc_expr *e) else ns = inserted_block->ext.block.ns; - sprintf(name, "__var_%d",num++); + return ns; +} + +/* Returns a new expression (a variable) to be used in place of the old one, + with an optional assignment statement before the current statement to set + the value of the variable. Creates a new BLOCK for the statement if that + hasn't already been done and puts the statement, plus the newly created + variables, in that block. Special cases: If the expression is constant or + a temporary which has already been created, just copy it. */ + +static gfc_expr* +create_var (gfc_expr * e, const char *vname) +{ + char name[GFC_MAX_SYMBOL_LEN +1]; + gfc_symtree *symtree; + gfc_symbol *symbol; + gfc_expr *result; + gfc_code *n; + gfc_namespace *ns; + int i; + + if (e->expr_type == EXPR_CONSTANT || is_fe_temp (e)) + return gfc_copy_expr (e); + + ns = insert_block (); + + if (vname) + snprintf (name, GFC_MAX_SYMBOL_LEN, "__var_%d_%s", var_num++, vname); + else + sprintf (name, "__var_%d", var_num++); + if (gfc_get_sym_tree (name, ns, &symtree, false) != 0) gcc_unreachable (); @@ -658,6 +678,7 @@ constant_string_length (gfc_expr *e) "Creating array temporary at %L", &(e->where)); } + /* Generate the new assignment. */ n = XCNEW (gfc_code); n->op = EXEC_ASSIGN; @@ -724,7 +745,7 @@ cfe_expr_0 (gfc_expr **e, int *walk_subtrees, if (gfc_dep_compare_functions (*ei, *ej, true) == 0) { if (newvar == NULL) - newvar = create_var (*ei); + newvar = create_var (*ei, "fcn"); if (warn_function_elimination) do_warn_function_elimination (*ej); @@ -931,6 +952,7 @@ convert_elseif (gfc_code **c, int *walk_subtrees A /* Don't walk subtrees. */ return 0; } + /* Optimize a namespace, including all contained namespaces. */ static void @@ -947,6 +969,7 @@ optimize_namespace (gfc_namespace *ns) gfc_code_walker (&ns->code, convert_elseif, dummy_expr_callback, NULL); gfc_code_walker (&ns->code, cfe_code, cfe_expr_0, NULL); gfc_code_walker (&ns->code, optimize_code, optimize_expr, NULL); + gfc_code_walker (&ns->code, optimize_matmul_assign, dummy_expr_callback, NULL); /* BLOCKs are handled in the expression walker below. */ for (ns = ns->contained; ns; ns = ns->sibling) @@ -1222,7 +1245,7 @@ combine_array_constructor (gfc_expr *e) if (op2->ts.type == BT_CHARACTER) return false; - scalar = create_var (gfc_copy_expr (op2)); + scalar = create_var (gfc_copy_expr (op2), "constr"); oldbase = op1->value.constructor; newbase = NULL; @@ -1939,7 +1962,576 @@ doloop_warn (gfc_namespace *ns) gfc_code_walker (&ns->code, doloop_code, do_function, NULL); } +/* Auxiliary function to build and simplify an array inquiry function. + dim is zero-based. */ +static gfc_expr * +get_array_inq_function (gfc_expr *e, int dim, gfc_isym_id id) +{ + + gfc_expr *fcn; + gfc_actual_arglist *actual_arglist, *n1, *n2; + gfc_expr *dim_arg, *kind; + const char *name; + + fcn = gfc_get_expr (); + fcn->expr_type = EXPR_FUNCTION; + fcn->value.function.isym = gfc_intrinsic_function_by_id (id); + actual_arglist = gfc_get_actual_arglist (); + actual_arglist->expr = gfc_copy_expr (e); + + n1 = gfc_get_actual_arglist (); + dim_arg = gfc_get_int_expr (gfc_default_integer_kind, &e->where, dim); + n1->expr = dim_arg; + actual_arglist->next = n1; + + n2 = gfc_get_actual_arglist (); + kind = gfc_get_int_expr (gfc_default_integer_kind, &e->where, + gfc_index_integer_kind); + n2->expr = kind; + n1->next = n2; + + fcn->value.function.actual = actual_arglist; + fcn->where = e->where; + fcn->ts.type = BT_INTEGER; + fcn->ts.kind = gfc_index_integer_kind; + + switch (id) + { + case GFC_ISYM_LBOUND: + name = "__internal_lbound"; + break; + + case GFC_ISYM_UBOUND: + name = "__internal_ubound"; + break; + + case GFC_ISYM_SIZE: + name = "__internal_size"; + break; + + default: + gcc_unreachable (); + } + + gfc_get_sym_tree (name, current_ns, &fcn->symtree, false); + fcn->symtree->n.sym->ts = fcn->ts; + fcn->symtree->n.sym->attr.flavor = FL_PROCEDURE; + fcn->symtree->n.sym->attr.function = 1; + fcn->symtree->n.sym->attr.elemental = 1; + fcn->symtree->n.sym->attr.referenced = 1; + fcn->symtree->n.sym->attr.access = ACCESS_PRIVATE; + gfc_commit_symbol (fcn->symtree->n.sym); + gfc_simplify_expr (fcn, 0); + return fcn; +} + +/* Callback function for has_function_or_op. */ + +static int +is_function_or_op (gfc_expr **e, int *walk_subtrees ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + if ((*e) == 0) + return 0; + else + return (*e)->expr_type == EXPR_FUNCTION + || (*e)->expr_type == EXPR_OP; +} + +/* Returns true if the expression contains a function. */ + +static bool +has_function_or_op (gfc_expr **e) +{ + if (e == NULL) + return false; + else + return gfc_expr_walker (e, is_function_or_op, NULL); +} + +/* Freeze a single expression. */ + +static void +freeze_expr (gfc_expr **ep) +{ + gfc_expr *ne; + if (has_function_or_op (ep)) + { + ne = create_var (*ep, "freeze"); + *ep = ne; + } +} + +/* Go through an expression's references and assign them to temporary + variables if they contain functions. This is usually done prior to + front-end scalarization to avoid multiple invocations of functions. */ + +static void +freeze_references (gfc_expr *e) +{ + gfc_ref *r; + gfc_array_ref *ar; + int i; + + for (r=e->ref; r; r=r->next) + { + if (r->type == REF_SUBSTRING) + { + if (r->u.ss.start != NULL) + freeze_expr (&r->u.ss.start); + + if (r->u.ss.end != NULL) + freeze_expr (&r->u.ss.end); + } + else if (r->type == REF_ARRAY) + { + ar = &r->u.ar; + switch (ar->type) + { + case AR_FULL: + break; + + case AR_SECTION: + for (i=0; i<ar->dimen; i++) + { + if (ar->dimen_type[i] == DIMEN_RANGE) + { + freeze_expr (&ar->start[i]); + freeze_expr (&ar->end[i]); + freeze_expr (&ar->stride[i]); + } + } + default: + break; + } + } + } +} + +/* Convert to gfc_index_integer_kind if needed, just do a copy otherwise. */ + +static gfc_expr * +convert_to_index_kind (gfc_expr *e) +{ + gfc_expr *res; + + gcc_assert (e != NULL); + + res = gfc_copy_expr (e); + + gcc_assert (e->ts.type == BT_INTEGER); + + if (res->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + + gfc_convert_type_warn (e, &ts, 2, 0); + } + + return res; +} + +/* Function to create a DO loop including creation of the + iteration variable. gfc_expr are copied.*/ + +static gfc_code * +create_do_loop (gfc_expr *start, gfc_expr *end, gfc_expr *step, locus *where, + gfc_namespace *ns, char *vname) +{ + + char name[GFC_MAX_SYMBOL_LEN +1]; + gfc_symtree *symtree; + gfc_symbol *symbol; + gfc_expr *i; + gfc_code *n, *n2; + + /* Create an expression for the iteration variable. */ + if (vname) + sprintf (name, "__var_%d_do_%s", var_num++, vname); + else + sprintf (name, "__var_%d_do", var_num++); + + + if (gfc_get_sym_tree (name, ns, &symtree, false) != 0) + gcc_unreachable (); + + symbol = symtree->n.sym; + symbol->ts.type = BT_INTEGER; + symbol->ts.kind = gfc_index_integer_kind; + symbol->attr.flavor = FL_VARIABLE; + symbol->attr.referenced = 1; + symbol->attr.dimension = 0; + symbol->attr.fe_temp = 1; + gfc_commit_symbol (symbol); + + i = gfc_get_expr (); + i->expr_type = EXPR_VARIABLE; + i->ts = symbol->ts; + i->rank = 0; + i->where = *where; + i->symtree = symtree; + + n = XCNEW (gfc_code); + n->op = EXEC_DO; + n->loc = *where; + n->ext.iterator = gfc_get_iterator (); + n->ext.iterator->var = i; + n->ext.iterator->start = convert_to_index_kind (start); + n->ext.iterator->end = convert_to_index_kind (end); + if (step) + n->ext.iterator->step = convert_to_index_kind (step); + else + n->ext.iterator->step = gfc_get_int_expr (gfc_index_integer_kind, + where, 1); + + n2 = XCNEW (gfc_code); + n2->op = EXEC_DO; + n2->loc = *where; + n2->next = NULL; + n->block = n2; + return n; +} + +/* Return an operation of one two gfc_expr (one if e2 is NULL. This assumes + compatible typespecs. */ + +static gfc_expr * +get_operand (gfc_intrinsic_op op, gfc_expr *e1, gfc_expr *e2) +{ + gfc_expr *res; + + res = gfc_get_expr (); + res->ts = e1->ts; + res->where = e1->where; + res->expr_type = EXPR_OP; + res->value.op.op = op; + res->value.op.op1 = e1; + res->value.op.op2 = e2; + gfc_simplify_expr (res, 0); + return res; +} + +/* Get the upper bound of the DO loops for matmul along + a dimension, i.e. size minus one. */ +static gfc_expr* +get_size_m1 (gfc_expr *e, int dimen) +{ + mpz_t size; + gfc_expr *res; + + if (gfc_array_dimen_size (e, dimen, &size)) + { + res = gfc_get_constant_expr (BT_INTEGER, + gfc_index_integer_kind, &e->where); + mpz_sub_ui (res->value.integer, size, 1); + mpz_clear (size); + } + else + { + res = get_operand (INTRINSIC_MINUS, + get_array_inq_function (e, dimen + 1, GFC_ISYM_SIZE), + gfc_get_int_expr (gfc_index_integer_kind, + &e->where, 1)); + gfc_simplify_expr (res, 0); + } + + return res; +} + +/* Function to return a scalarized expression. It is assumed that indices are + zero based to make generation of DO loops easier. A zero as index will + access the first element along a dimension. Single element references will + be skipped. A NULL as an expression will be replaced by a full reference. + This assumes that the index loops have gfc_index_integer_kind, and that all + references have been frozen. */ + +static gfc_expr* +scalarized_expr (gfc_expr *e_in, gfc_expr **index, int count_index) +{ + gfc_ref *ref; + gfc_array_ref *ar; + int i; + int rank; + gfc_expr *e; + int i_index; + bool was_fullref; + + e = gfc_copy_expr(e_in); + + rank = e->rank; + + ref = e->ref; + + while (ref) + { + if (ref->type == REF_ARRAY + && (ref->u.ar.type == AR_FULL || ref->u.ar.type == AR_SECTION)) + break; + } + ar = &ref->u.ar; + + /* We scalarize count_index variables, reducing the rank by count_index. */ + + e->rank = rank - count_index; + + was_fullref = ar->type == AR_FULL; + + if (e->rank == 0) + ar->type = AR_ELEMENT; + else + ar->type = AR_SECTION; + + ar->dimen = rank; + + /* Loop over the indices. For each index, create the expression + index + lbound(e, dim). */ + + i_index = 0; + for (i=0; i<rank; i++) + { + if (was_fullref || ar->dimen_type[i] == DIMEN_RANGE) + { + if (index[i_index] != NULL) + { + gfc_expr *lbound, *nindex; + gfc_expr *loopvar; + + loopvar = gfc_copy_expr (index[i_index]); + + if (ar->stride[i]) + { + gfc_expr *tmp; + + tmp = gfc_copy_expr(ar->stride[i]); + if (tmp->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + gfc_convert_type (tmp, &ts, 2); + } + nindex = get_operand (INTRINSIC_TIMES, loopvar, tmp); + } + else + nindex = loopvar; + + if (ar->start[i]) + { + lbound = gfc_copy_expr (ar->start[i]); + if (lbound->ts.kind != gfc_index_integer_kind) + { + gfc_typespec ts; + gfc_clear_ts (&ts); + ts.type = BT_INTEGER; + ts.kind = gfc_index_integer_kind; + gfc_convert_type (lbound, &ts, 2); + + } + } + else + lbound = get_array_inq_function (e_in, i+1, GFC_ISYM_LBOUND); + + ar->dimen_type[i] = DIMEN_ELEMENT; + ar->start[i] = get_operand (INTRINSIC_PLUS, nindex, lbound); + ar->end[i] = NULL; + ar->stride[i] = NULL; + gfc_simplify_expr (ar->start[i], 0); + } + else if (was_fullref) + { + ar->dimen_type[i] = DIMEN_RANGE; + ar->start[i] = NULL; + ar->end[i] = NULL; + ar->stride[i] = NULL; + } + i_index ++; + } + } + return e; +} + + +/* Optimize assignments of the form c = matmul(a,b). + Handle only the cases currently where b and c are rank-two arrays. + + This translates the code to + + BLOCK + integer i,j,k + c = 0 + do j=0, size(b,2)-1 + do k=0, size(a, 2)-1 + do i=0, size(a, 1)-1 + c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) = + c(i * stride(c,1) + lbound(c,1), j * stride(c,2) + lbound(c,2)) + + a(i * stride(a,1) + lbound(a,1), k * stride(a,2) + lbound(a,2)) * + b(k * stride(b,1) + lbound(b,1), j * stride(b,2) + lbound(b,2)) + end do + end do + end do + END BLOCK + +*/ + +static int +optimize_matmul_assign (gfc_code **c, + int *walk_subtrees ATTRIBUTE_UNUSED, + void *data ATTRIBUTE_UNUSED) +{ + gfc_code *co = *c; + gfc_expr *expr1, *expr2; + gfc_expr *matrix_a, *matrix_b; + gfc_actual_arglist *a, *b; + gfc_code *do_1, *do_2, *do_3, *assign_zero, *assign_matmul; + gfc_expr *zero_e; + gfc_expr *u1, *u2, *u3; + gfc_expr *list[2]; + gfc_expr *ascalar, *bscalar, *cscalar; + gfc_expr *mult; + gfc_expr *var_1, *var_2, *var_3; + gfc_namespace *ns; + gfc_intrinsic_op op_times, op_plus; + + if (co->op != EXEC_ASSIGN) + return 0; + + expr1 = co->expr1; + expr2 = co->expr2; + if (expr2->expr_type != EXPR_FUNCTION + || expr2->value.function.isym == NULL + || expr2->value.function.isym->id != GFC_ISYM_MATMUL) + return 0; + + /* Handling only two matrices of dimension two for the time being. */ + + current_code = c; + inserted_block = NULL; + changed_statement = NULL; + + a = expr2->value.function.actual; + matrix_a = a->expr; + b = a->next; + matrix_b = b->expr; + + if (matrix_a->expr_type != EXPR_VARIABLE + || matrix_b->expr_type != EXPR_VARIABLE) + return 0; + + if (matrix_a->rank != 2 || matrix_b->rank != 2) + return 0; + + if (gfc_check_dependency (expr1, matrix_a, true) + || gfc_check_dependency (expr1, matrix_b, true)) + return 0; + + ns = insert_block (); + + switch(expr1->ts.type) + { + case BT_INTEGER: + zero_e = gfc_get_int_expr (expr1->ts.kind, &expr1->where, 0); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + break; + + case BT_LOGICAL: + op_times = INTRINSIC_AND; + op_plus = INTRINSIC_OR; + zero_e = gfc_get_logical_expr (expr1->ts.kind, &expr1->where, + 0); + break; + case BT_REAL: + zero_e = gfc_get_constant_expr (BT_REAL, expr1->ts.kind, + &expr1->where); + mpfr_set_si (zero_e->value.real, 0, GFC_RND_MODE); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + break; + + case BT_COMPLEX: + zero_e = gfc_get_constant_expr (BT_COMPLEX, expr1->ts.kind, + &expr1->where); + mpc_set_si_si (zero_e->value.complex, 0, 0, GFC_RND_MODE); + op_times = INTRINSIC_TIMES; + op_plus = INTRINSIC_PLUS; + + break; + + default: + gcc_unreachable(); + } + + assign_zero = XCNEW (gfc_code); + assign_zero->op = EXEC_ASSIGN; + assign_zero->loc = co->loc; + assign_zero->expr1 = gfc_copy_expr (expr1); + assign_zero->expr2 = zero_e; + + ns->code = assign_zero; + current_code = &ns->code; + + freeze_references (matrix_a); + freeze_references (matrix_b); + + u1 = get_size_m1 (matrix_b, 1); + u2 = get_size_m1 (matrix_a, 1); + u3 = get_size_m1 (matrix_a, 0); + + do_1 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind, + &co->loc, 0), + u1, NULL, &co->loc, ns); + + do_2 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind, + &co->loc, 0), + u2, NULL, &co->loc, ns); + + do_1->block->next = do_2; + + do_3 = create_do_loop (gfc_get_int_expr (gfc_index_integer_kind, + &co->loc, 0), + u3, NULL, &co->loc, ns); + + do_2->block->next = do_3; + + var_1 = do_1->ext.iterator->var; + var_2 = do_2->ext.iterator->var; + var_3 = do_3->ext.iterator->var; + + assign_zero->next = do_1; + + assign_matmul = XCNEW (gfc_code); + assign_matmul->op = EXEC_ASSIGN; + assign_matmul->loc = co->loc; + + list[0] = var_3; + list[1] = var_1; + cscalar = scalarized_expr (gfc_copy_expr (co->expr1), list, 2); + + list[0] = var_3; + list[1] = var_2; + ascalar = scalarized_expr (gfc_copy_expr (matrix_a), list, 2); + + list[0] = var_2; + list[1] = var_1; + bscalar = scalarized_expr (gfc_copy_expr (matrix_b), list, 2); + + assign_matmul->expr1 = gfc_copy_expr (cscalar); + + mult = get_operand (op_times, ascalar, bscalar); + assign_matmul->expr2 = get_operand (op_plus, cscalar, mult); + + do_3->block->next = assign_matmul; + + co->next = NULL; + gfc_free_statements(co); + return 0; +} + #define WALK_SUBEXPR(NODE) \ do \ { \ [-- Attachment #3: simplify-1.diff --] [-- Type: text/x-patch, Size: 819 bytes --] Index: simplify.c =================================================================== --- simplify.c (Revision 221757) +++ simplify.c (Arbeitskopie) @@ -3445,6 +3445,25 @@ simplify_bound (gfc_expr *array, gfc_expr *dim, gf done: + if (!upper && as->type == AS_ASSUMED_SHAPE && dim) + { + if (dim->expr_type == EXPR_CONSTANT) + { + unsigned long int ndim; + gfc_expr *lower, *res; + + ndim = mpz_get_si (dim->value.integer) - 1; + lower = as->lower[ndim]; + if (lower->expr_type == EXPR_CONSTANT) + { + int nkind = mpz_get_si (kind->value.integer); + res = gfc_copy_expr (lower); + res->ts.kind = nkind; + return res; + } + } + } + if (as && (as->type == AS_DEFERRED || as->type == AS_ASSUMED_SHAPE || as->type == AS_ASSUMED_RANK)) return NULL; ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 12:32 Thomas Koenig @ 2015-04-05 13:20 ` Jerry DeLisle 2015-04-05 13:48 ` Thomas Koenig 2015-04-07 19:24 ` David Malcolm 1 sibling, 1 reply; 16+ messages in thread From: Jerry DeLisle @ 2015-04-05 13:20 UTC (permalink / raw) To: Thomas Koenig, fortran, gcc-patches On 04/05/2015 05:32 AM, Thomas Koenig wrote: --- snip --- > > So, what do you think about this? > > Thomas I am curious about what performance gain results from this? I can see saving a library call to our runtime libraries. Do you have some timing results? Jerry ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 13:20 ` Jerry DeLisle @ 2015-04-05 13:48 ` Thomas Koenig 0 siblings, 0 replies; 16+ messages in thread From: Thomas Koenig @ 2015-04-05 13:48 UTC (permalink / raw) To: Jerry DeLisle, fortran, gcc-patches Hi Jerry, > I am curious about what performance gain results from this? I can see > saving a library call to our runtime libraries. Do you have some timing > results? The speedup can be quite drastic for small matrices which can be completely unrolled by -O3: b1.f90: program main use b2 implicit none real, dimension(3,3) :: a, b, c integer :: i call random_number(a) call random_number(b) do i=1,10**8 c = matmul(a,b) call bar(b,c) end do end program main b2.f90: module b2 contains subroutine bar(b,c) real, dimension(3,3) :: b,c end subroutine bar end module b2 ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 -fno-frontend-optimize b2.f90 b1.f90 && time ./a.out real 0m15.411s user 0m15.404s sys 0m0.001s ig25@linux-fd1f:~/Krempel/Matmul> gfortran -O3 b2.f90 b1.f90 && time ./a.out real 0m1.736s user 0m1.735s sys 0m0.001s ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [patch, fortran, RFC] First steps towards inlining matmul 2015-04-05 12:32 Thomas Koenig 2015-04-05 13:20 ` Jerry DeLisle @ 2015-04-07 19:24 ` David Malcolm 1 sibling, 0 replies; 16+ messages in thread From: David Malcolm @ 2015-04-07 19:24 UTC (permalink / raw) To: Thomas Koenig; +Cc: fortran, gcc-patches On Sun, 2015-04-05 at 14:32 +0200, Thomas Koenig wrote: > Hello world, > > this is a first draft of a patch to inline matmul (PR 37171). This is (FWIW, the above PR# looks like it should be PR 37131) ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-04-11 17:01 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-04-05 13:55 [patch, fortran, RFC] First steps towards inlining matmul Dominique Dhumieres 2015-04-05 14:33 ` Thomas Koenig 2015-04-05 18:25 ` Dominique d'Humières 2015-04-05 20:37 ` Thomas Koenig 2015-04-05 23:15 ` Dominique d'Humières 2015-04-06 12:18 ` Dominique d'Humières 2015-04-09 22:18 ` Thomas Koenig 2015-04-10 14:15 ` Dominique d'Humières 2015-04-10 16:57 ` Dominique d'Humières 2015-04-11 12:24 ` Thomas Koenig 2015-04-11 13:43 ` Mikael Morin 2015-04-11 17:01 ` Thomas Koenig -- strict thread matches above, loose matches on Subject: below -- 2015-04-05 12:32 Thomas Koenig 2015-04-05 13:20 ` Jerry DeLisle 2015-04-05 13:48 ` Thomas Koenig 2015-04-07 19:24 ` David Malcolm
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).