public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* Loop fusion.
@ 2018-04-22 15:23 Toon Moene
  2018-04-23 11:00 ` Bin.Cheng
  2018-04-23 11:02 ` Richard Biener
  0 siblings, 2 replies; 13+ messages in thread
From: Toon Moene @ 2018-04-22 15:23 UTC (permalink / raw)
  To: gcc mailing list

A few days ago there was a rant on the Fortran Standardization 
Committee's e-mail list about Fortran's "whole array arithmetic" being 
unoptimizable.

An example picked at random from our weather forecasting code:

     ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
     ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
     ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
     ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)

The reaction from one of the members of the committee (about "their" 
compiler):

'And multiple consecutive array statements with the same shape are 
“fused” exactly so that the compiler can generate good cache use. This 
sort of optimization is pretty low hanging fruit.'

As far as I can see loop fusion as a stand-alone optimization is not 
supported as yet, although some mention is made in the context of graphite.

Is this something that should be pursued ?

Kind regards,

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-22 15:23 Loop fusion Toon Moene
@ 2018-04-23 11:00 ` Bin.Cheng
  2018-04-23 12:31   ` Richard Biener
  2018-04-23 11:02 ` Richard Biener
  1 sibling, 1 reply; 13+ messages in thread
From: Bin.Cheng @ 2018-04-23 11:00 UTC (permalink / raw)
  To: Toon Moene; +Cc: gcc mailing list

On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote:
> A few days ago there was a rant on the Fortran Standardization Committee's
> e-mail list about Fortran's "whole array arithmetic" being unoptimizable.
>
> An example picked at random from our weather forecasting code:
>
>     ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
>     ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
>     ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
>     ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
>
> The reaction from one of the members of the committee (about "their"
> compiler):
>
> 'And multiple consecutive array statements with the same shape are “fused”
> exactly so that the compiler can generate good cache use. This sort of
> optimization is pretty low hanging fruit.'
>
> As far as I can see loop fusion as a stand-alone optimization is not
> supported as yet, although some mention is made in the context of graphite.
>
> Is this something that should be pursued ?
Hi,
I don't know the current status of fusion in graphite.  As for
traditional fusion transformation, I think it's not very difficult to
be implemented along with existing distribution, actually, quite lot
of code should be shared.  What we do need are something like: more
motivation cases, good/conservative cost model.

Thanks,
bin
>
> Kind regards,
>
> --
> Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-22 15:23 Loop fusion Toon Moene
  2018-04-23 11:00 ` Bin.Cheng
@ 2018-04-23 11:02 ` Richard Biener
  2018-04-24  2:22   ` Toon Moene
  1 sibling, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-04-23 11:02 UTC (permalink / raw)
  To: Toon Moene; +Cc: gcc mailing list

On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene <toon@moene.org> wrote:
> A few days ago there was a rant on the Fortran Standardization Committee's
> e-mail list about Fortran's "whole array arithmetic" being unoptimizable.
>
> An example picked at random from our weather forecasting code:
>
>     ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
>     ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
>     ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
>     ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
>
> The reaction from one of the members of the committee (about "their"
> compiler):
>
> 'And multiple consecutive array statements with the same shape are “fused”
> exactly so that the compiler can generate good cache use. This sort of
> optimization is pretty low hanging fruit.'
>
> As far as I can see loop fusion as a stand-alone optimization is not
> supported as yet, although some mention is made in the context of graphite.
>
> Is this something that should be pursued ?

In principle GRAPHITE can handle loop fusion but yes, standalone fusion
is sth useful.

Note that while it looks "obvious" in the above source fragment the IL
that is presented to optimizers may make things a lot less "low-hanging".

Richard.

> Kind regards,
>
> --
> Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-23 11:00 ` Bin.Cheng
@ 2018-04-23 12:31   ` Richard Biener
  2018-04-23 12:47     ` Janne Blomqvist
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-04-23 12:31 UTC (permalink / raw)
  To: Bin.Cheng; +Cc: Toon Moene, gcc mailing list

On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote:
>> A few days ago there was a rant on the Fortran Standardization Committee's
>> e-mail list about Fortran's "whole array arithmetic" being unoptimizable.
>>
>> An example picked at random from our weather forecasting code:
>>
>>     ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
>>     ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
>>     ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
>>     ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
>>
>> The reaction from one of the members of the committee (about "their"
>> compiler):
>>
>> 'And multiple consecutive array statements with the same shape are “fused”
>> exactly so that the compiler can generate good cache use. This sort of
>> optimization is pretty low hanging fruit.'
>>
>> As far as I can see loop fusion as a stand-alone optimization is not
>> supported as yet, although some mention is made in the context of graphite.
>>
>> Is this something that should be pursued ?
> Hi,
> I don't know the current status of fusion in graphite.  As for
> traditional fusion transformation, I think it's not very difficult to
> be implemented along with existing distribution, actually, quite lot
> of code should be shared.  What we do need are something like: more
> motivation cases, good/conservative cost model.

Yes, I guess before distribution you want to do maximum fusion and then
apply (re-)distribution on the fused loop.  The cost model should be the
very same for distribution/fusion.

Richard.

> Thanks,
> bin
>>
>> Kind regards,
>>
>> --
>> Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
>> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
>> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
>> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-23 12:31   ` Richard Biener
@ 2018-04-23 12:47     ` Janne Blomqvist
  2018-04-23 14:11       ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: Janne Blomqvist @ 2018-04-23 12:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: Bin.Cheng, Toon Moene, gcc mailing list

On Mon, Apr 23, 2018 at 2:02 PM, Richard Biener <richard.guenther@gmail.com>
wrote:

> On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote:
> >> A few days ago there was a rant on the Fortran Standardization
> Committee's
> >> e-mail list about Fortran's "whole array arithmetic" being
> unoptimizable.
> >>
> >> An example picked at random from our weather forecasting code:
> >>
> >>     ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
> >>     ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
> >>     ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
> >>     ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
> >>
> >> The reaction from one of the members of the committee (about "their"
> >> compiler):
> >>
> >> 'And multiple consecutive array statements with the same shape are
> “fused”
> >> exactly so that the compiler can generate good cache use. This sort of
> >> optimization is pretty low hanging fruit.'
> >>
> >> As far as I can see loop fusion as a stand-alone optimization is not
> >> supported as yet, although some mention is made in the context of
> graphite.
> >>
> >> Is this something that should be pursued ?
> > Hi,
> > I don't know the current status of fusion in graphite.  As for
> > traditional fusion transformation, I think it's not very difficult to
> > be implemented along with existing distribution, actually, quite lot
> > of code should be shared.  What we do need are something like: more
> > motivation cases, good/conservative cost model.
>
> Yes, I guess before distribution you want to do maximum fusion and then
> apply (re-)distribution on the fused loop.  The cost model should be the
> very same for distribution/fusion.
>
> Richard.
>


I recall Fujitsu bragging that the key to them getting good application
performance (read: outside linpack) on the K computer is extensive use of
loop FISSION + software pipelining. Though I guess sw-pipelining is only
useful if you have lots of architectural registers, which disqualifies
x86-64..


-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-23 12:47     ` Janne Blomqvist
@ 2018-04-23 14:11       ` Richard Biener
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Biener @ 2018-04-23 14:11 UTC (permalink / raw)
  To: Janne Blomqvist; +Cc: Bin.Cheng, Toon Moene, gcc mailing list

On Mon, Apr 23, 2018 at 2:31 PM, Janne Blomqvist
<blomqvist.janne@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 2:02 PM, Richard Biener <richard.guenther@gmail.com>
> wrote:
>>
>> On Mon, Apr 23, 2018 at 12:59 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> > On Sun, Apr 22, 2018 at 3:27 PM, Toon Moene <toon@moene.org> wrote:
>> >> A few days ago there was a rant on the Fortran Standardization
>> >> Committee's
>> >> e-mail list about Fortran's "whole array arithmetic" being
>> >> unoptimizable.
>> >>
>> >> An example picked at random from our weather forecasting code:
>> >>
>> >>     ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
>> >>     ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
>> >>     ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
>> >>     ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
>> >>
>> >> The reaction from one of the members of the committee (about "their"
>> >> compiler):
>> >>
>> >> 'And multiple consecutive array statements with the same shape are
>> >> “fused”
>> >> exactly so that the compiler can generate good cache use. This sort of
>> >> optimization is pretty low hanging fruit.'
>> >>
>> >> As far as I can see loop fusion as a stand-alone optimization is not
>> >> supported as yet, although some mention is made in the context of
>> >> graphite.
>> >>
>> >> Is this something that should be pursued ?
>> > Hi,
>> > I don't know the current status of fusion in graphite.  As for
>> > traditional fusion transformation, I think it's not very difficult to
>> > be implemented along with existing distribution, actually, quite lot
>> > of code should be shared.  What we do need are something like: more
>> > motivation cases, good/conservative cost model.
>>
>> Yes, I guess before distribution you want to do maximum fusion and then
>> apply (re-)distribution on the fused loop.  The cost model should be the
>> very same for distribution/fusion.
>>
>> Richard.
>
>
>
> I recall Fujitsu bragging that the key to them getting good application
> performance (read: outside linpack) on the K computer is extensive use of
> loop FISSION + software pipelining. Though I guess sw-pipelining is only
> useful if you have lots of architectural registers, which disqualifies
> x86-64..

FISSION we can do quite well (though we lack a cost model here), that's
what loop distribution does.

Richard.

>
> --
> Janne Blomqvist

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-23 11:02 ` Richard Biener
@ 2018-04-24  2:22   ` Toon Moene
  2018-04-24 12:58     ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: Toon Moene @ 2018-04-24  2:22 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc mailing list

On 04/23/2018 01:00 PM, Richard Biener wrote:

> On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene <toon@moene.org> wrote:
>> A few days ago there was a rant on the Fortran Standardization Committee's
>> e-mail list about Fortran's "whole array arithmetic" being unoptimizable.
>>
>> An example picked at random from our weather forecasting code:
>>
>>      ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
>>      ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
>>      ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
>>      ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
>>
>> The reaction from one of the members of the committee (about "their"
>> compiler):
>>
>> 'And multiple consecutive array statements with the same shape are “fused”
>> exactly so that the compiler can generate good cache use. This sort of
>> optimization is pretty low hanging fruit.'
>>
>> As far as I can see loop fusion as a stand-alone optimization is not
>> supported as yet, although some mention is made in the context of graphite.
>>
>> Is this something that should be pursued ?
> 
> In principle GRAPHITE can handle loop fusion but yes, standalone fusion
> is sth useful.
> 
> Note that while it looks "obvious" in the above source fragment the IL
> that is presented to optimizers may make things a lot less "low-hanging".

Well, the loops are generated by the front end, so I *assume* they are 
basically the same ...

Probably the largest problem to address is the heuristic for preventing 
register pressure going through the roof ...

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-24  2:22   ` Toon Moene
@ 2018-04-24 12:58     ` Richard Biener
  2018-04-25  8:06       ` Toon Moene
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-04-24 12:58 UTC (permalink / raw)
  To: Toon Moene; +Cc: gcc mailing list

On Mon, Apr 23, 2018 at 8:35 PM, Toon Moene <toon@moene.org> wrote:
> On 04/23/2018 01:00 PM, Richard Biener wrote:
>
>> On Sun, Apr 22, 2018 at 4:27 PM, Toon Moene <toon@moene.org> wrote:
>>>
>>> A few days ago there was a rant on the Fortran Standardization
>>> Committee's
>>> e-mail list about Fortran's "whole array arithmetic" being unoptimizable.
>>>
>>> An example picked at random from our weather forecasting code:
>>>
>>>      ZQICE(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YI%MP)
>>>      ZQLI(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YL%MP)
>>>      ZQRAIN(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YR%MP)
>>>      ZQSNOW(1:NPROMA,1:NFLEVG) = PGFL(1:NPROMA,1:NFLEVG,YS%MP)
>>>
>>> The reaction from one of the members of the committee (about "their"
>>> compiler):
>>>
>>> 'And multiple consecutive array statements with the same shape are
>>> “fused”
>>> exactly so that the compiler can generate good cache use. This sort of
>>> optimization is pretty low hanging fruit.'
>>>
>>> As far as I can see loop fusion as a stand-alone optimization is not
>>> supported as yet, although some mention is made in the context of
>>> graphite.
>>>
>>> Is this something that should be pursued ?
>>
>>
>> In principle GRAPHITE can handle loop fusion but yes, standalone fusion
>> is sth useful.
>>
>> Note that while it looks "obvious" in the above source fragment the IL
>> that is presented to optimizers may make things a lot less "low-hanging".
>
>
> Well, the loops are generated by the front end, so I *assume* they are
> basically the same ...

The issue will be boiler-plate code like duplicated loop header checks.
That said, it's perfectly understandable that other Fortran compilers have
high-level loop opts deeply embedded within their frontends...

> Probably the largest problem to address is the heuristic for preventing
> register pressure going through the roof ...

Yes.  As Bin said loop fusion and fission are closely related and should
share their cost modeling - they both have the goal to optimize the
number of input and output memory streams and their re-use within
the constraints of the CPU architecture (which includes number of
available registers).

Note that the difficulty with fusion compared to fission is that our
dependence analysis framework cannot handle the case of sibling
loops (as required by fusion).  So we have to either apply some
"tricks" to use the current framework or move over to more capable
dependence testing infrastructure like that of ISL.

Richard.

>
> --
> Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2018-04-24 12:58     ` Richard Biener
@ 2018-04-25  8:06       ` Toon Moene
  0 siblings, 0 replies; 13+ messages in thread
From: Toon Moene @ 2018-04-25  8:06 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc mailing list

On 04/24/2018 09:18 AM, Richard Biener wrote:

> On Mon, Apr 23, 2018 at 8:35 PM, Toon Moene <toon@moene.org> wrote:

>> On 04/23/2018 01:00 PM, Richard Biener wrote:

>>> Note that while it looks "obvious" in the above source fragment the IL
>>> that is presented to optimizers may make things a lot less "low-hanging".
>>
>>
>> Well, the loops are generated by the front end, so I *assume* they are
>> basically the same ...
> 
> The issue will be boiler-plate code like duplicated loop header checks.
> That said, it's perfectly understandable that other Fortran compilers have
> high-level loop opts deeply embedded within their frontends...

I agree that this would be more easily handled in the Fortran front end. 
However, for that it would first have to get a (high level) basic block 
finder, because it has to be established that consecutive array 
expressions are part of the same basic block.

I discussed high (i.e., Fortran-) level basic blocks briefly in my 2007 
GCC Summit talk (http://moene.org/~toon/GCCSummit-2007.pdf, paragraph 
7), but I do not think anyone really worked on it.

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2015-04-23  4:58   ` Toon Moene
@ 2015-04-23 17:17     ` Richard Biener
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Biener @ 2015-04-23 17:17 UTC (permalink / raw)
  To: Toon Moene; +Cc: Steven Bosscher, gcc mailing list

On Wed, Apr 22, 2015 at 10:05 PM, Toon Moene <toon@moene.org> wrote:
> On 04/22/2015 09:10 PM, Steven Bosscher wrote:
>
>> On Wed, Apr 22, 2015 at 6:59 PM, Toon Moene wrote:
>
>
>>> Why is loop fusion important, especially in Fortran 90 and later programs
>>> ?
>>>
>>> Because without it, every array assignment is a single loop nest,
>>> isolated
>>> from related, same-shape assignments.
>>
>>
>> Why is this a bad thing? When you're talking about single-node
>> machines, separate loops is probably faster if your arrays are large
>> enough: better cache locality and easier to vectorize.
>
>
>> Loop fusion is only a win if you iterate through the same array
>> variables. Writing such a pass is not so hard for the simple, most
>> common cases. The front end could do some of the rewriting from
>> F90-style array assignments to fused loops if it notices consecutive
>> array assignments/operations on the same variables.
>
>
> It could well be that my artificial example was not what my colleague
> measured ...
>
> Indeed, I thought about the front end doing this, but that would limit it to
> those that the front end could recognize; on the other hand, that might be
> the right limitation.

Generally loop fusion is a nice-to-have thing but as Steven points out
your example isn't particularly well-suited here.  In fact loop distribution
(which performs fission) would undo such fusion.

But it also points to infrastructure we already have - loop distribution
can do the reverse transform so doing fusion there as well looks
natural (given you might fuse parts of a loop with parts of another
loop but keep parts separate as well - all based on data locality
analysis loop distribution performs).

But usually fusion requires some enablement transform(s) to make
the loop iteration spaces match.

Oh, and it's probably time to commit the patches I developed some
years ago to make loop distribution handle loop nests...

Oh, and I'm probably going to look into this and related transforms
(that enablement stuff) for GCC 6.

Richard.

> Thanks !
>
>
> --
> Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
> Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
> At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
> Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2015-04-22 20:05 ` Steven Bosscher
@ 2015-04-23  4:58   ` Toon Moene
  2015-04-23 17:17     ` Richard Biener
  0 siblings, 1 reply; 13+ messages in thread
From: Toon Moene @ 2015-04-23  4:58 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: gcc mailing list

On 04/22/2015 09:10 PM, Steven Bosscher wrote:

> On Wed, Apr 22, 2015 at 6:59 PM, Toon Moene wrote:

>> Why is loop fusion important, especially in Fortran 90 and later programs ?
>>
>> Because without it, every array assignment is a single loop nest, isolated
>> from related, same-shape assignments.
>
> Why is this a bad thing? When you're talking about single-node
> machines, separate loops is probably faster if your arrays are large
> enough: better cache locality and easier to vectorize.

> Loop fusion is only a win if you iterate through the same array
> variables. Writing such a pass is not so hard for the simple, most
> common cases. The front end could do some of the rewriting from
> F90-style array assignments to fused loops if it notices consecutive
> array assignments/operations on the same variables.

It could well be that my artificial example was not what my colleague 
measured ...

Indeed, I thought about the front end doing this, but that would limit 
it to those that the front end could recognize; on the other hand, that 
might be the right limitation.

Thanks !

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Loop fusion.
  2015-04-22 19:19 Toon Moene
@ 2015-04-22 20:05 ` Steven Bosscher
  2015-04-23  4:58   ` Toon Moene
  0 siblings, 1 reply; 13+ messages in thread
From: Steven Bosscher @ 2015-04-22 20:05 UTC (permalink / raw)
  To: Toon Moene; +Cc: gcc mailing list

On Wed, Apr 22, 2015 at 6:59 PM, Toon Moene wrote:
> Why is loop fusion important, especially in Fortran 90 and later programs ?
>
> Because without it, every array assignment is a single loop nest, isolated
> from related, same-shape assignments.

Why is this a bad thing? When you're talking about single-node
machines, separate loops is probably faster if your arrays are large
enough: better cache locality and easier to vectorize.


----- 8< -----
$ cat test.f90; gfortran.exe -O2 test.f90 ; ./a.exe
        PROGRAM TEST_FUSION
        IMPLICIT NONE
        REAL, PARAMETER :: TSTEP = 0.01

        CALL ONE_TEST(100)
        CALL ONE_TEST(200)
        CALL ONE_TEST(400)
        STOP


        CONTAINS

        SUBROUTINE ONE_TEST(N)
        IMPLICIT NONE
        INTEGER :: N, I
        REAL, DIMENSION(:,:,:), ALLOCATABLE :: T, U, V, Q
        REAL, DIMENSION(:,:,:), ALLOCATABLE :: DTDT, DUDT, DVDT, DQDT
        REAL :: START, FINISH

        PRINT '("Test with N=",I3)', N
        ALLOCATE (T(N,N,N), U(N,N,N), V(N,N,N), Q(N,N,N))
        ALLOCATE (DTDT(N,N,N), DUDT(N,N,N), DVDT(N,N,N), DQDT(N,N,N))
        !
        CALL CPU_TIME(START)
        DO I=1,100
          CALL UPDATE_DT_1(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, N, N, N, TSTEP)
        END DO
        CALL CPU_TIME(FINISH)
        PRINT '("F90-style array assignments -
time=",f6.3,"seconds.")', (FINISH - START)
        !
        CALL CPU_TIME(START)
        DO I=1,100
          CALL UPDATE_DT_2(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, N, N, N, TSTEP)
        END DO
        CALL CPU_TIME(FINISH)
        PRINT '("F77-style loopy assignments -
time=",f6.3,"seconds.")', (FINISH - START)
        !
        !
        DEALLOCATE (T, U, V, Q)
        DEALLOCATE (DTDT, DUDT, DVDT, DQDT)
        PRINT *
        END SUBROUTINE ONE_TEST


        SUBROUTINE UPDATE_DT_1(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, &
                               & NLON, NLAT, NLEV, TSTEP)
        IMPLICIT NONE
        INTEGER :: NLON, NLAT, NLEV
        REAL :: TSTEP
        REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT
        T = T + TSTEP*DTDT ! Update temperature
        U = U + TSTEP*DUDT ! Update east-west wind component
        V = V + TSTEP*DVDT ! Update north-south wind component
        Q = Q + TSTEP*DQDT ! Update specific humidity
        END SUBROUTINE UPDATE_DT_1

        SUBROUTINE UPDATE_DT_2(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, &
                               & NLON, NLAT, NLEV, TSTEP)
        IMPLICIT NONE
        INTEGER :: NLON, NLAT, NLEV
        REAL :: TSTEP
        REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT
        INTEGER :: JLON, JLAT, JLEV
        DO JLEV = 1, NLEV
          DO JLAT = 1, NLAT
            DO JLON = 1, NLON
              T(JLON, JLAT, JLEV) = T(JLON, JLAT, JLEV) +
TSTEP*DTDT(JLON, JLAT, JLEV)
              U(JLON, JLAT, JLEV) = U(JLON, JLAT, JLEV) +
TSTEP*DUDT(JLON, JLAT, JLEV)
              V(JLON, JLAT, JLEV) = V(JLON, JLAT, JLEV) +
TSTEP*DVDT(JLON, JLAT, JLEV)
              Q(JLON, JLAT, JLEV) = Q(JLON, JLAT, JLEV) +
TSTEP*DQDT(JLON, JLAT, JLEV)
            ENDDO
          ENDDO
        ENDDO
        END SUBROUTINE UPDATE_DT_2

        END PROGRAM
Test with N=100
F90-style array assignments - time= 0.390seconds.
F77-style loopy assignments - time= 0.578seconds.

Test with N=200
F90-style array assignments - time= 2.969seconds.
F77-style loopy assignments - time= 4.765seconds.

Test with N=400
F90-style array assignments - time=24.344seconds.
F77-style loopy assignments - time=38.672seconds.


$
----- 8< -----

Loop fusion is only a win if you iterate through the same array
variables. Writing such a pass is not so hard for the simple, most
common cases. The front end could do some of the rewriting from
F90-style array assignments to fused loops if it notices consecutive
array assignments/operations on the same variables.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Loop fusion.
@ 2015-04-22 19:19 Toon Moene
  2015-04-22 20:05 ` Steven Bosscher
  0 siblings, 1 reply; 13+ messages in thread
From: Toon Moene @ 2015-04-22 19:19 UTC (permalink / raw)
  To: gcc mailing list

L.S.,

Last week, a colleague of mine from Meteo France held a talk at the 
yearly meeting of all researchers working on HARMONIE (see 
http://hirlam.org) discussing the performance of our code when compiled 
with each of the supported compilers on the Cray XC30 at ECMWF 
(http://www.ecmwf.int/en/computing/our-facilities).

In the context of GCC this is relevant, because one of the three 
compilers is gfortran (version 4.9.2).

One of his slides discussed the differences in optimizations that the 
three compilers offer; I was surprised to learn that GCC/gfortran 
doesn't do loop fusion *at all*. Note, I discussed loop fusion (among 
other optimizations) at LinuxExpo 99 (http://moene.org/~toon/nwp.ps) 
which, unsurprisingly, was held 16 years ago :-)

Why is loop fusion important, especially in Fortran 90 and later programs ?

Because without it, every array assignment is a single loop nest, 
isolated from related, same-shape assignments.

Consider this (artificial, but typical) example [updating atmospheric 
quantities after the computation of the rate of change during a time 
step of the integration]:

SUBROUTINE UPDATE_DT(T, U, V, Q, DTDT, DUDT, DVDT, DQDT, &
    & NLON, NLAT, NLEV, TSTEP)
...
REAL, DIMENSION(NLON, NLAT, NLEV) :: T, U, V, Q, DTDT, DUDT, DVDT, DQDT
...
T = T + TSTEP*DTDT ! Update temperature
U = U + TSTEP*DUDT ! Update east-west wind component
V = V + TSTEP*DVDT ! Update north-south wind component
Q = Q + TSTEP*DQDT ! Update specific humidity
...
END

This generates four consecutive 3 deep loop nests over NLEV, NLAT, NLON.
Of course, it would be much more efficient if this were just one loop 
nest, as Fortran 77 programmers would write it:

DO JLEV = 1, NLEV
   DO JLAT = 1, NLAT
     DO JLON = 1, NLON
       T(JLON, JLAT, JLEV) = T(JLON, JLAT, JLEV) + TSTEP*DTDT(JLON, 
JLAT, JLEV)
       U(JLON, JLAT, JLEV) = U(JLON, JLAT, JLEV) + TSTEP*DUDT(JLON, 
JLAT, JLEV)
       V(JLON, JLAT, JLEV) = V(JLON, JLAT, JLEV) + TSTEP*DVDT(JLON, 
JLAT, JLEV)
       Q(JLON, JLAT, JLEV) = Q(JLON, JLAT, JLEV) + TSTEP*DQDT(JLON, 
JLAT, JLEV)
     ENDDO
   ENDDO
ENDDO

After a loop fusion optimization pass the Fortran 90 and the Fortran 77 
code should result in the same assembler output.

Is this something the Graphite infrastructure could help with ? From the 
wiki documentation I get the impression that it only works on single 
loop nests, but I must confess that I am not familiar with the 
nomenclature in its description ...

Would it be hard to write a loop fusion pass otherwise ?

Kind regards,

-- 
Toon Moene - e-mail: toon@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-04-25  6:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-22 15:23 Loop fusion Toon Moene
2018-04-23 11:00 ` Bin.Cheng
2018-04-23 12:31   ` Richard Biener
2018-04-23 12:47     ` Janne Blomqvist
2018-04-23 14:11       ` Richard Biener
2018-04-23 11:02 ` Richard Biener
2018-04-24  2:22   ` Toon Moene
2018-04-24 12:58     ` Richard Biener
2018-04-25  8:06       ` Toon Moene
  -- strict thread matches above, loose matches on Subject: below --
2015-04-22 19:19 Toon Moene
2015-04-22 20:05 ` Steven Bosscher
2015-04-23  4:58   ` Toon Moene
2015-04-23 17:17     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).