* Loop fusion. @ 2018-04-22 15:23 Toon Moene 2018-04-23 11:00 ` Bin.Cheng 2018-04-23 11:02 ` Richard Biener 0 siblings, 2 replies; 13+ messages in thread From: Toon Moene @ 2018-04-22 15:23 UTC (permalink / raw) To: gcc mailing list A few days ago there was a rant on the Fortran Standardization Committee's e-mail list about Fortran's "whole array arithmetic" being unoptimizable. An example picked at random from our weather forecasting code: ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) The reaction from one of the members of the committee (about "their" compiler): 'And multiple consecutive array statements with the same shape are âfusedâ exactly so that the compiler can generate good cache use. This sort of optimization is pretty low hanging fruit.' As far as I can see loop fusion as a stand-alone optimization is not supported as yet, although some mention is made in the context of graphite. Is this something that should be pursued ? Kind regards, -- Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-22 15:23 Loop fusion Toon Moene @ 2018-04-23 11:00 ` Bin.Cheng 2018-04-23 12:31 ` Richard Biener 2018-04-23 11:02 ` Richard Biener 1 sibling, 1 reply; 13+ messages in thread From: Bin.Cheng @ 2018-04-23 11:00 UTC (permalink / raw) To: Toon Moene; +Cc: gcc mailing list On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote: > A few days ago there was a rant on the Fortran Standardization Committee's > e-mail list about Fortran's "whole array arithmetic" being unoptimizable. > > An example picked at random from our weather forecasting code: > > ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) > ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) > ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) > ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) > > The reaction from one of the members of the committee (about "their" > compiler): > > 'And multiple consecutive array statements with the same shape are “fused” > exactly so that the compiler can generate good cache use. This sort of > optimization is pretty low hanging fruit.' > > As far as I can see loop fusion as a stand-alone optimization is not > supported as yet, although some mention is made in the context of graphite. > > Is this something that should be pursued ? Hi, I don't know the current status of fusion in graphite. As for traditional fusion transformation, I think it's not very difficult to be implemented along with existing distribution, actually, quite lot of code should be shared. What we do need are something like: more motivation cases, good/conservative cost model. Thanks, bin > > Kind regards, > > -- > Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 > Saturnushof 14, 3738 XG Maartensdijk, The Netherlands > At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ > Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-23 11:00 ` Bin.Cheng @ 2018-04-23 12:31 ` Richard Biener 2018-04-23 12:47 ` Janne Blomqvist 0 siblings, 1 reply; 13+ messages in thread From: Richard Biener @ 2018-04-23 12:31 UTC (permalink / raw) To: Bin.Cheng; +Cc: Toon Moene, gcc mailing list On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng <amker.cheng@gmail.com> wrote: > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote: >> A few days ago there was a rant on the Fortran Standardization Committee's >> e-mail list about Fortran's "whole array arithmetic" being unoptimizable. >> >> An example picked at random from our weather forecasting code: >> >> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) >> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) >> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) >> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) >> >> The reaction from one of the members of the committee (about "their" >> compiler): >> >> 'And multiple consecutive array statements with the same shape are “fused” >> exactly so that the compiler can generate good cache use. This sort of >> optimization is pretty low hanging fruit.' >> >> As far as I can see loop fusion as a stand-alone optimization is not >> supported as yet, although some mention is made in the context of graphite. >> >> Is this something that should be pursued ? > Hi, > I don't know the current status of fusion in graphite. As for > traditional fusion transformation, I think it's not very difficult to > be implemented along with existing distribution, actually, quite lot > of code should be shared. What we do need are something like: more > motivation cases, good/conservative cost model. Yes, I guess before distribution you want to do maximum fusion and then apply (re-)distribution on the fused loop. The cost model should be the very same for distribution/fusion. Richard. > Thanks, > bin >> >> Kind regards, >> >> -- >> Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 >> Saturnushof 14, 3738 XG Maartensdijk, The Netherlands >> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ >> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-23 12:31 ` Richard Biener @ 2018-04-23 12:47 ` Janne Blomqvist 2018-04-23 14:11 ` Richard Biener 0 siblings, 1 reply; 13+ messages in thread From: Janne Blomqvist @ 2018-04-23 12:47 UTC (permalink / raw) To: Richard Biener; +Cc: Bin.Cheng, Toon Moene, gcc mailing list On Mon, Apr 23, 2018 at 2:02 PM, Richard Biener <richard.guenther@gmail.com> wrote: > On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng <amker.cheng@gmail.com> wrote: > > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote: > >> A few days ago there was a rant on the Fortran Standardization > Committee's > >> e-mail list about Fortran's "whole array arithmetic" being > unoptimizable. > >> > >> An example picked at random from our weather forecasting code: > >> > >> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) > >> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) > >> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) > >> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) > >> > >> The reaction from one of the members of the committee (about "their" > >> compiler): > >> > >> 'And multiple consecutive array statements with the same shape are > “fused” > >> exactly so that the compiler can generate good cache use. This sort of > >> optimization is pretty low hanging fruit.' > >> > >> As far as I can see loop fusion as a stand-alone optimization is not > >> supported as yet, although some mention is made in the context of > graphite. > >> > >> Is this something that should be pursued ? > > Hi, > > I don't know the current status of fusion in graphite. As for > > traditional fusion transformation, I think it's not very difficult to > > be implemented along with existing distribution, actually, quite lot > > of code should be shared. What we do need are something like: more > > motivation cases, good/conservative cost model. > > Yes, I guess before distribution you want to do maximum fusion and then > apply (re-)distribution on the fused loop. The cost model should be the > very same for distribution/fusion. > > Richard. > I recall Fujitsu bragging that the key to them getting good application performance (read: outside linpack) on the K computer is extensive use of loop FISSION + software pipelining. Though I guess sw-pipelining is only useful if you have lots of architectural registers, which disqualifies x86-64.. -- Janne Blomqvist ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-23 12:47 ` Janne Blomqvist @ 2018-04-23 14:11 ` Richard Biener 0 siblings, 0 replies; 13+ messages in thread From: Richard Biener @ 2018-04-23 14:11 UTC (permalink / raw) To: Janne Blomqvist; +Cc: Bin.Cheng, Toon Moene, gcc mailing list On Mon, Apr 23, 2018 at 2:31 PM, Janne Blomqvist <blomqvist.janne@gmail.com> wrote: > On Mon, Apr 23, 2018 at 2:02 PM, Richard Biener <richard.guenther@gmail.com> > wrote: >> >> On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng <amker.cheng@gmail.com> wrote: >> > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote: >> >> A few days ago there was a rant on the Fortran Standardization >> >> Committee's >> >> e-mail list about Fortran's "whole array arithmetic" being >> >> unoptimizable. >> >> >> >> An example picked at random from our weather forecasting code: >> >> >> >> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) >> >> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) >> >> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) >> >> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) >> >> >> >> The reaction from one of the members of the committee (about "their" >> >> compiler): >> >> >> >> 'And multiple consecutive array statements with the same shape are >> >> “fused” >> >> exactly so that the compiler can generate good cache use. This sort of >> >> optimization is pretty low hanging fruit.' >> >> >> >> As far as I can see loop fusion as a stand-alone optimization is not >> >> supported as yet, although some mention is made in the context of >> >> graphite. >> >> >> >> Is this something that should be pursued ? >> > Hi, >> > I don't know the current status of fusion in graphite. As for >> > traditional fusion transformation, I think it's not very difficult to >> > be implemented along with existing distribution, actually, quite lot >> > of code should be shared. What we do need are something like: more >> > motivation cases, good/conservative cost model. >> >> Yes, I guess before distribution you want to do maximum fusion and then >> apply (re-)distribution on the fused loop. The cost model should be the >> very same for distribution/fusion. >> >> Richard. > > > > I recall Fujitsu bragging that the key to them getting good application > performance (read: outside linpack) on the K computer is extensive use of > loop FISSION + software pipelining. Though I guess sw-pipelining is only > useful if you have lots of architectural registers, which disqualifies > x86-64.. FISSION we can do quite well (though we lack a cost model here), that's what loop distribution does. Richard. > > -- > Janne Blomqvist ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-22 15:23 Loop fusion Toon Moene 2018-04-23 11:00 ` Bin.Cheng @ 2018-04-23 11:02 ` Richard Biener 2018-04-24 2:22 ` Toon Moene 1 sibling, 1 reply; 13+ messages in thread From: Richard Biener @ 2018-04-23 11:02 UTC (permalink / raw) To: Toon Moene; +Cc: gcc mailing list On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene <toon@moene.org> wrote: > A few days ago there was a rant on the Fortran Standardization Committee's > e-mail list about Fortran's "whole array arithmetic" being unoptimizable. > > An example picked at random from our weather forecasting code: > > ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) > ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) > ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) > ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) > > The reaction from one of the members of the committee (about "their" > compiler): > > 'And multiple consecutive array statements with the same shape are “fused” > exactly so that the compiler can generate good cache use. This sort of > optimization is pretty low hanging fruit.' > > As far as I can see loop fusion as a stand-alone optimization is not > supported as yet, although some mention is made in the context of graphite. > > Is this something that should be pursued ? In principle GRAPHITE can handle loop fusion but yes, standalone fusion is sth useful. Note that while it looks "obvious" in the above source fragment the IL that is presented to optimizers may make things a lot less "low-hanging". Richard. > Kind regards, > > -- > Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 > Saturnushof 14, 3738 XG Maartensdijk, The Netherlands > At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ > Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-23 11:02 ` Richard Biener @ 2018-04-24 2:22 ` Toon Moene 2018-04-24 12:58 ` Richard Biener 0 siblings, 1 reply; 13+ messages in thread From: Toon Moene @ 2018-04-24 2:22 UTC (permalink / raw) To: Richard Biener; +Cc: gcc mailing list On 04/23/2018 01:00 PM, Richard Biener wrote: > On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene <toon@moene.org> wrote: >> A few days ago there was a rant on the Fortran Standardization Committee's >> e-mail list about Fortran's "whole array arithmetic" being unoptimizable. >> >> An example picked at random from our weather forecasting code: >> >> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) >> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) >> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) >> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) >> >> The reaction from one of the members of the committee (about "their" >> compiler): >> >> 'And multiple consecutive array statements with the same shape are âfusedâ >> exactly so that the compiler can generate good cache use. This sort of >> optimization is pretty low hanging fruit.' >> >> As far as I can see loop fusion as a stand-alone optimization is not >> supported as yet, although some mention is made in the context of graphite. >> >> Is this something that should be pursued ? > > In principle GRAPHITE can handle loop fusion but yes, standalone fusion > is sth useful. > > Note that while it looks "obvious" in the above source fragment the IL > that is presented to optimizers may make things a lot less "low-hanging". Well, the loops are generated by the front end, so I *assume* they are basically the same ... Probably the largest problem to address is the heuristic for preventing register pressure going through the roof ... -- Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-24 2:22 ` Toon Moene @ 2018-04-24 12:58 ` Richard Biener 2018-04-25 8:06 ` Toon Moene 0 siblings, 1 reply; 13+ messages in thread From: Richard Biener @ 2018-04-24 12:58 UTC (permalink / raw) To: Toon Moene; +Cc: gcc mailing list On Mon, Apr 23, 2018 at 8:35 PM, Toon Moene <toon@moene.org> wrote: > On 04/23/2018 01:00 PM, Richard Biener wrote: > >> On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene <toon@moene.org> wrote: >>> >>> A few days ago there was a rant on the Fortran Standardization >>> Committee's >>> e-mail list about Fortran's "whole array arithmetic" being unoptimizable. >>> >>> An example picked at random from our weather forecasting code: >>> >>> ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP) >>> ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP) >>> ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP) >>> ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP) >>> >>> The reaction from one of the members of the committee (about "their" >>> compiler): >>> >>> 'And multiple consecutive array statements with the same shape are >>> “fused” >>> exactly so that the compiler can generate good cache use. This sort of >>> optimization is pretty low hanging fruit.' >>> >>> As far as I can see loop fusion as a stand-alone optimization is not >>> supported as yet, although some mention is made in the context of >>> graphite. >>> >>> Is this something that should be pursued ? >> >> >> In principle GRAPHITE can handle loop fusion but yes, standalone fusion >> is sth useful. >> >> Note that while it looks "obvious" in the above source fragment the IL >> that is presented to optimizers may make things a lot less "low-hanging". > > > Well, the loops are generated by the front end, so I *assume* they are > basically the same ... The issue will be boiler-plate code like duplicated loop header checks. That said, it's perfectly understandable that other Fortran compilers have high-level loop opts deeply embedded within their frontends... > Probably the largest problem to address is the heuristic for preventing > register pressure going through the roof ... Yes. As Bin said loop fusion and fission are closely related and should share their cost modeling - they both have the goal to optimize the number of input and output memory streams and their re-use within the constraints of the CPU architecture (which includes number of available registers). Note that the difficulty with fusion compared to fission is that our dependence analysis framework cannot handle the case of sibling loops (as required by fusion). So we have to either apply some "tricks" to use the current framework or move over to more capable dependence testing infrastructure like that of ISL. Richard. > > -- > Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 > Saturnushof 14, 3738 XG Maartensdijk, The Netherlands > At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ > Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2018-04-24 12:58 ` Richard Biener @ 2018-04-25 8:06 ` Toon Moene 0 siblings, 0 replies; 13+ messages in thread From: Toon Moene @ 2018-04-25 8:06 UTC (permalink / raw) To: Richard Biener; +Cc: gcc mailing list On 04/24/2018 09:18 AM, Richard Biener wrote: > On Mon, Apr 23, 2018 at 8:35 PM, Toon Moene <toon@moene.org> wrote: >> On 04/23/2018 01:00 PM, Richard Biener wrote: >>> Note that while it looks "obvious" in the above source fragment the IL >>> that is presented to optimizers may make things a lot less "low-hanging". >> >> >> Well, the loops are generated by the front end, so I *assume* they are >> basically the same ... > > The issue will be boiler-plate code like duplicated loop header checks. > That said, it's perfectly understandable that other Fortran compilers have > high-level loop opts deeply embedded within their frontends... I agree that this would be more easily handled in the Fortran front end. However, for that it would first have to get a (high level) basic block finder, because it has to be established that consecutive array expressions are part of the same basic block. I discussed high (i.e., Fortran-) level basic blocks briefly in my 2007 GCC Summit talk (http://moene.org/~toon/GCCSummit-2007.pdf, paragraph 7), but I do not think anyone really worked on it. -- Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Loop fusion. @ 2015-04-22 19:19 Toon Moene 2015-04-22 20:05 ` Steven Bosscher 0 siblings, 1 reply; 13+ messages in thread From: Toon Moene @ 2015-04-22 19:19 UTC (permalink / raw) To: gcc mailing list L.S., Last week, a colleague of mine from Meteo France held a talk at the yearly meeting of all researchers working on HARMONIE (see http://hirlam.org) discussing the performance of our code when compiled with each of the supported compilers on the Cray XC30 at ECMWF (http://www.ecmwf.int/en/computing/our-facilities). In the context of GCC this is relevant, because one of the three compilers is gfortran (version 4.9.2). One of his slides discussed the differences in optimizations that the three compilers offer; I was surprised to learn that GCC/gfortran doesn't do loop fusion *at all*. Note, I discussed loop fusion (among other optimizations) at LinuxExpo 99 (http://moene.org/~toon/nwp.ps) which, unsurprisingly, was held 16 years ago :-) Why is loop fusion important, especially in Fortran 90 and later programs ? Because without it, every array assignment is a single loop nest, isolated from related, same-shape assignments. Consider this (artificial, but typical) example [updating atmospheric quantities after the computation of the rate of change during a time step of the integration]: SUBROUTINE UPDATE_DT(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, & & NLON, NLAT, NLEV, TSTEP) ... REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT ... T = T + TSTEP*DTDT ! Update temperature U = U + TSTEP*DUDT ! Update east-west wind component V = V + TSTEP*DVDT ! Update north-south wind component Q = Q + TSTEP*DQDT ! Update specific humidity ... END This generates four consecutive 3 deep loop nests over NLEV, NLAT, NLON. Of course, it would be much more efficient if this were just one loop nest, as Fortran 77 programmers would write it: DO JLEV = 1, NLEV DO JLAT = 1, NLAT DO JLON = 1, NLON T(JLON, JLAT, JLEV) = T(JLON, JLAT, JLEV) + TSTEP*DTDT(JLON, JLAT, JLEV) U(JLON, JLAT, JLEV) = U(JLON, JLAT, JLEV) + TSTEP*DUDT(JLON, JLAT, JLEV) V(JLON, JLAT, JLEV) = V(JLON, JLAT, JLEV) + TSTEP*DVDT(JLON, JLAT, JLEV) Q(JLON, JLAT, JLEV) = Q(JLON, JLAT, JLEV) + TSTEP*DQDT(JLON, JLAT, JLEV) ENDDO ENDDO ENDDO After a loop fusion optimization pass the Fortran 90 and the Fortran 77 code should result in the same assembler output. Is this something the Graphite infrastructure could help with ? From the wiki documentation I get the impression that it only works on single loop nests, but I must confess that I am not familiar with the nomenclature in its description ... Would it be hard to write a loop fusion pass otherwise ? Kind regards, -- Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2015-04-22 19:19 Toon Moene @ 2015-04-22 20:05 ` Steven Bosscher 2015-04-23 4:58 ` Toon Moene 0 siblings, 1 reply; 13+ messages in thread From: Steven Bosscher @ 2015-04-22 20:05 UTC (permalink / raw) To: Toon Moene; +Cc: gcc mailing list On Wed, Apr 22, 2015 at 6:59 PM, Toon Moene wrote: > Why is loop fusion important, especially in Fortran 90 and later programs ? > > Because without it, every array assignment is a single loop nest, isolated > from related, same-shape assignments. Why is this a bad thing? When you're talking about single-node machines, separate loops is probably faster if your arrays are large enough: better cache locality and easier to vectorize. ----- 8< ----- $ cat test.f90; gfortran.exe -O2 test.f90 ; ./a.exe PROGRAM TEST_FUSION IMPLICIT NONE REAL, PARAMETER :: TSTEP = 0.01 CALL ONE_TEST(100) CALL ONE_TEST(200) CALL ONE_TEST(400) STOP CONTAINS SUBROUTINE ONE_TEST(N) IMPLICIT NONE INTEGER :: N, I REAL, DIMENSION(:,:,:), ALLOCATABLE :: T, U, V, Q REAL, DIMENSION(:,:,:), ALLOCATABLE :: DTDT, DUDT, DVDT, DQDT REAL :: START, FINISH PRINT '("Test with N=",I3)', N ALLOCATE (T(N,N,N), U(N,N,N), V(N,N,N), Q(N,N,N)) ALLOCATE (DTDT(N,N,N), DUDT(N,N,N), DVDT(N,N,N), DQDT(N,N,N)) ! CALL CPU_TIME(START) DO I=1,100 CALL UPDATE_DT_1(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, N, N, N, TSTEP) END DO CALL CPU_TIME(FINISH) PRINT '("F90-style array assignments - time=",f6.3,"seconds.")', (FINISH - START) ! CALL CPU_TIME(START) DO I=1,100 CALL UPDATE_DT_2(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, N, N, N, TSTEP) END DO CALL CPU_TIME(FINISH) PRINT '("F77-style loopy assignments - time=",f6.3,"seconds.")', (FINISH - START) ! ! DEALLOCATE (T, U, V, Q) DEALLOCATE (DTDT, DUDT, DVDT, DQDT) PRINT * END SUBROUTINE ONE_TEST SUBROUTINE UPDATE_DT_1(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, & & NLON, NLAT, NLEV, TSTEP) IMPLICIT NONE INTEGER :: NLON, NLAT, NLEV REAL :: TSTEP REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT T = T + TSTEP*DTDT ! Update temperature U = U + TSTEP*DUDT ! Update east-west wind component V = V + TSTEP*DVDT ! Update north-south wind component Q = Q + TSTEP*DQDT ! Update specific humidity END SUBROUTINE UPDATE_DT_1 SUBROUTINE UPDATE_DT_2(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, & & NLON, NLAT, NLEV, TSTEP) IMPLICIT NONE INTEGER :: NLON, NLAT, NLEV REAL :: TSTEP REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT INTEGER :: JLON, JLAT, JLEV DO JLEV = 1, NLEV DO JLAT = 1, NLAT DO JLON = 1, NLON T(JLON, JLAT, JLEV) = T(JLON, JLAT, JLEV) + TSTEP*DTDT(JLON, JLAT, JLEV) U(JLON, JLAT, JLEV) = U(JLON, JLAT, JLEV) + TSTEP*DUDT(JLON, JLAT, JLEV) V(JLON, JLAT, JLEV) = V(JLON, JLAT, JLEV) + TSTEP*DVDT(JLON, JLAT, JLEV) Q(JLON, JLAT, JLEV) = Q(JLON, JLAT, JLEV) + TSTEP*DQDT(JLON, JLAT, JLEV) ENDDO ENDDO ENDDO END SUBROUTINE UPDATE_DT_2 END PROGRAM Test with N=100 F90-style array assignments - time= 0.390seconds. F77-style loopy assignments - time= 0.578seconds. Test with N=200 F90-style array assignments - time= 2.969seconds. F77-style loopy assignments - time= 4.765seconds. Test with N=400 F90-style array assignments - time=24.344seconds. F77-style loopy assignments - time=38.672seconds. $ ----- 8< ----- Loop fusion is only a win if you iterate through the same array variables. Writing such a pass is not so hard for the simple, most common cases. The front end could do some of the rewriting from F90-style array assignments to fused loops if it notices consecutive array assignments/operations on the same variables. Ciao! Steven ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2015-04-22 20:05 ` Steven Bosscher @ 2015-04-23 4:58 ` Toon Moene 2015-04-23 17:17 ` Richard Biener 0 siblings, 1 reply; 13+ messages in thread From: Toon Moene @ 2015-04-23 4:58 UTC (permalink / raw) To: Steven Bosscher; +Cc: gcc mailing list On 04/22/2015 09:10 PM, Steven Bosscher wrote: > On Wed, Apr 22, 2015 at 6:59 PM, Toon Moene wrote: >> Why is loop fusion important, especially in Fortran 90 and later programs ? >> >> Because without it, every array assignment is a single loop nest, isolated >> from related, same-shape assignments. > > Why is this a bad thing? When you're talking about single-node > machines, separate loops is probably faster if your arrays are large > enough: better cache locality and easier to vectorize. > Loop fusion is only a win if you iterate through the same array > variables. Writing such a pass is not so hard for the simple, most > common cases. The front end could do some of the rewriting from > F90-style array assignments to fused loops if it notices consecutive > array assignments/operations on the same variables. It could well be that my artificial example was not what my colleague measured ... Indeed, I thought about the front end doing this, but that would limit it to those that the front end could recognize; on the other hand, that might be the right limitation. Thanks ! -- Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Loop fusion. 2015-04-23 4:58 ` Toon Moene @ 2015-04-23 17:17 ` Richard Biener 0 siblings, 0 replies; 13+ messages in thread From: Richard Biener @ 2015-04-23 17:17 UTC (permalink / raw) To: Toon Moene; +Cc: Steven Bosscher, gcc mailing list On Wed, Apr 22, 2015 at 10:05 PM, Toon Moene <toon@moene.org> wrote: > On 04/22/2015 09:10 PM, Steven Bosscher wrote: > >> On Wed, Apr 22, 2015 at 6:59 PM, Toon Moene wrote: > > >>> Why is loop fusion important, especially in Fortran 90 and later programs >>> ? >>> >>> Because without it, every array assignment is a single loop nest, >>> isolated >>> from related, same-shape assignments. >> >> >> Why is this a bad thing? When you're talking about single-node >> machines, separate loops is probably faster if your arrays are large >> enough: better cache locality and easier to vectorize. > > >> Loop fusion is only a win if you iterate through the same array >> variables. Writing such a pass is not so hard for the simple, most >> common cases. The front end could do some of the rewriting from >> F90-style array assignments to fused loops if it notices consecutive >> array assignments/operations on the same variables. > > > It could well be that my artificial example was not what my colleague > measured ... > > Indeed, I thought about the front end doing this, but that would limit it to > those that the front end could recognize; on the other hand, that might be > the right limitation. Generally loop fusion is a nice-to-have thing but as Steven points out your example isn't particularly well-suited here. In fact loop distribution (which performs fission) would undo such fusion. But it also points to infrastructure we already have - loop distribution can do the reverse transform so doing fusion there as well looks natural (given you might fuse parts of a loop with parts of another loop but keep parts separate as well - all based on data locality analysis loop distribution performs). But usually fusion requires some enablement transform(s) to make the loop iteration spaces match. Oh, and it's probably time to commit the patches I developed some years ago to make loop distribution handle loop nests... Oh, and I'm probably going to look into this and related transforms (that enablement stuff) for GCC 6. Richard. > Thanks ! > > > -- > Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290 > Saturnushof 14, 3738 XG Maartensdijk, The Netherlands > At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ > Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-04-25 6:04 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-22 15:23 Loop fusion Toon Moene 2018-04-23 11:00 ` Bin.Cheng 2018-04-23 12:31 ` Richard Biener 2018-04-23 12:47 ` Janne Blomqvist 2018-04-23 14:11 ` Richard Biener 2018-04-23 11:02 ` Richard Biener 2018-04-24 2:22 ` Toon Moene 2018-04-24 12:58 ` Richard Biener 2018-04-25 8:06 ` Toon Moene -- strict thread matches above, loose matches on Subject: below -- 2015-04-22 19:19 Toon Moene 2015-04-22 20:05 ` Steven Bosscher 2015-04-23 4:58 ` Toon Moene 2015-04-23 17:17 ` Richard Biener
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).