Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

* Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
@ 2018-07-21 12:44 Chris Elrod
  2018-07-21 19:14 ` Jerry DeLisle
  0 siblings, 1 reply; 3+ messages in thread
From: Chris Elrod @ 2018-07-21 12:44 UTC (permalink / raw)
  To: fortran

Here is code:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
processors)

Compiling with:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
-funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s

results in this assmebly:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s

where
$ gfortran --version
GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The manually unrolled version runs in about 135 ns, while the for loop
takes just over 1.4 microseconds on my computer.

Looking at the assembly of the manually unrolled version:

There are 13 total vmovapd instructions, 8 of them are moving from one zmm
register to another, while 5 move from a zmm register to %rsp, eg:

vmovapd    %zmm20, 136(%rsp)

Is there a good reason for this? The source of the move is then almost
immediately overwritten by another instruction (and there's no reason to
have 2 copies anyway). So I'd have thought the optimal code would have 0
such instructions.
The only reason I can see is if there's a restriction on where fma
instructions can store their result. For example, if they can't always
store them in the register of the number being summed (ie, if they're not
capable of always doing z = x * y + z, but need to overwrite x or y instead
sometimes for some reason -- like an architectural restriction? )
Assuming it's better not to have them, any way to try and diagnose why
they're generated, and avoid it?

Otherwise, the assembly looks great: repeated blocks containing
2x vmovupd
7x vbroadcastsd
14x vfmadd231pd

(For comparison, ifort produces much slower code (220 ns), with their
assembly annotation noting lots of register spills.)

The unrolled code, on the other hand, has massive piles of instructions
just moving data around between registers, between registers and memory,
etc.

The looped version's assembly is actually over 3 times longer than the
entirely (manually) unrolled version!

I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
been able to save the effort of doing it manually, or also that the plain
loop also generates clean code.

But instead, I'd need a preprocessor or some other means of generating
kernels if I wanted to experiment with optimization.

Is this a bug, expected behaviour that loops manage memory like that, or
something that can easily be worked around another way?

Thanks,
Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
  2018-07-21 12:44 Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower Chris Elrod
@ 2018-07-21 19:14 ` Jerry DeLisle
  2018-07-21 19:38   ` n8tm via fortran
  0 siblings, 1 reply; 3+ messages in thread
From: Jerry DeLisle @ 2018-07-21 19:14 UTC (permalink / raw)
  To: Chris Elrod, fortran

This is the gfortran list but these optimizations are handled by the gcc 
optimizers and not the compiler front-end. Probably need to post to 
bugzilla here:

https://gcc.gnu.org/bugzilla/

Jerry


On 07/21/2018 05:44 AM, Chris Elrod wrote:
> Here is code:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
> for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
> processors)
> 
> Compiling with:
> 
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
> -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s
> 
> results in this assmebly:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s
> 
> where
> $ gfortran --version
> GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> 
> The manually unrolled version runs in about 135 ns, while the for loop
> takes just over 1.4 microseconds on my computer.
> 
> Looking at the assembly of the manually unrolled version:
> 
> There are 13 total vmovapd instructions, 8 of them are moving from one zmm
> register to another, while 5 move from a zmm register to %rsp, eg:
> 
> vmovapd    %zmm20, 136(%rsp)
> 
> Is there a good reason for this? The source of the move is then almost
> immediately overwritten by another instruction (and there's no reason to
> have 2 copies anyway). So I'd have thought the optimal code would have 0
> such instructions.
> The only reason I can see is if there's a restriction on where fma
> instructions can store their result. For example, if they can't always
> store them in the register of the number being summed (ie, if they're not
> capable of always doing z = x * y + z, but need to overwrite x or y instead
> sometimes for some reason -- like an architectural restriction? )
> Assuming it's better not to have them, any way to try and diagnose why
> they're generated, and avoid it?
> 
> Otherwise, the assembly looks great: repeated blocks containing
> 2x vmovupd
> 7x vbroadcastsd
> 14x vfmadd231pd
> 
> (For comparison, ifort produces much slower code (220 ns), with their
> assembly annotation noting lots of register spills.)
> 
> 
> The unrolled code, on the other hand, has massive piles of instructions
> just moving data around between registers, between registers and memory,
> etc.
> 
> The looped version's assembly is actually over 3 times longer than the
> entirely (manually) unrolled version!
> 
> I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
> been able to save the effort of doing it manually, or also that the plain
> loop also generates clean code.
> 
> But instead, I'd need a preprocessor or some other means of generating
> kernels if I wanted to experiment with optimization.
> 
> Is this a bug, expected behaviour that loops manage memory like that, or
> something that can easily be worked around another way?
> 
> Thanks,
> Chris
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
  2018-07-21 19:14 ` Jerry DeLisle
@ 2018-07-21 19:38   ` n8tm via fortran
  0 siblings, 0 replies; 3+ messages in thread
From: n8tm via fortran @ 2018-07-21 19:38 UTC (permalink / raw)
  To: Jerry DeLisle, Chris Elrod, fortran

If the application requires a specific unroll factor,  max unroll times should be set.  It does look like a bug if default unroll is so aggressive as to force spills.


Sent via the Samsung Galaxy S8 Active, an AT&T 4G LTE smartphone
-------- Original message --------From: Jerry DeLisle <jvdelisle@charter.net> Date: 7/21/18  3:14 PM  (GMT-05:00) To: Chris Elrod <elrodc@gmail.com>, fortran@gcc.gnu.org Subject: Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower 
This is the gfortran list but these optimizations are handled by the gcc 
optimizers and not the compiler front-end. Probably need to post to 
bugzilla here:

https://gcc.gnu.org/bugzilla/

Jerry


On 07/21/2018 05:44 AM, Chris Elrod wrote:
> Here is code:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
> for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
> processors)
> 
> Compiling with:
> 
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
> -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s
> 
> results in this assmebly:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s
> 
> where
> $ gfortran --version
> GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> 
> The manually unrolled version runs in about 135 ns, while the for loop
> takes just over 1.4 microseconds on my computer.
> 
> Looking at the assembly of the manually unrolled version:
> 
> There are 13 total vmovapd instructions, 8 of them are moving from one zmm
> register to another, while 5 move from a zmm register to %rsp, eg:
> 
> vmovapd    %zmm20, 136(%rsp)
> 
> Is there a good reason for this? The source of the move is then almost
> immediately overwritten by another instruction (and there's no reason to
> have 2 copies anyway). So I'd have thought the optimal code would have 0
> such instructions.
> The only reason I can see is if there's a restriction on where fma
> instructions can store their result. For example, if they can't always
> store them in the register of the number being summed (ie, if they're not
> capable of always doing z = x * y + z, but need to overwrite x or y instead
> sometimes for some reason -- like an architectural restriction? )
> Assuming it's better not to have them, any way to try and diagnose why
> they're generated, and avoid it?
> 
> Otherwise, the assembly looks great: repeated blocks containing
> 2x vmovupd
> 7x vbroadcastsd
> 14x vfmadd231pd
> 
> (For comparison, ifort produces much slower code (220 ns), with their
> assembly annotation noting lots of register spills.)
> 
> 
> The unrolled code, on the other hand, has massive piles of instructions
> just moving data around between registers, between registers and memory,
> etc.
> 
> The looped version's assembly is actually over 3 times longer than the
> entirely (manually) unrolled version!
> 
> I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
> been able to save the effort of doing it manually, or also that the plain
> loop also generates clean code.
> 
> But instead, I'd need a preprocessor or some other means of generating
> kernels if I wanted to experiment with optimization.
> 
> Is this a bug, expected behaviour that loops manage memory like that, or
> something that can easily be worked around another way?
> 
> Thanks,
> Chris
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-07-21 19:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-21 12:44 Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower Chris Elrod
2018-07-21 19:14 ` Jerry DeLisle
2018-07-21 19:38   ` n8tm via fortran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).