Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

From: "n8tm via fortran" <fortran@gcc.gnu.org>
To: Jerry DeLisle <jvdelisle@charter.net>,
	Chris Elrod <elrodc@gmail.com>,
	fortran@gcc.gnu.org
Subject: Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
Date: Sat, 21 Jul 2018 19:38:00 -0000	[thread overview]
Message-ID: <20180721193800._xj7n1i3QneX9E4Rwc5pugxAQVfDzzkMVW9TBCs-XEA@z> (raw)
In-Reply-To: <4fb3e3ad-886e-b905-d755-1f0049dd6162@charter.net>

If the application requires a specific unroll factor,  max unroll times should be set.  It does look like a bug if default unroll is so aggressive as to force spills.


Sent via the Samsung Galaxy S8 Active, an AT&T 4G LTE smartphone
-------- Original message --------From: Jerry DeLisle <jvdelisle@charter.net> Date: 7/21/18  3:14 PM  (GMT-05:00) To: Chris Elrod <elrodc@gmail.com>, fortran@gcc.gnu.org Subject: Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower 
This is the gfortran list but these optimizations are handled by the gcc 
optimizers and not the compiler front-end. Probably need to post to 
bugzilla here:

https://gcc.gnu.org/bugzilla/

Jerry


On 07/21/2018 05:44 AM, Chris Elrod wrote:
> Here is code:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
> for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
> processors)
> 
> Compiling with:
> 
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
> -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s
> 
> results in this assmebly:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s
> 
> where
> $ gfortran --version
> GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> 
> The manually unrolled version runs in about 135 ns, while the for loop
> takes just over 1.4 microseconds on my computer.
> 
> Looking at the assembly of the manually unrolled version:
> 
> There are 13 total vmovapd instructions, 8 of them are moving from one zmm
> register to another, while 5 move from a zmm register to %rsp, eg:
> 
> vmovapd    %zmm20, 136(%rsp)
> 
> Is there a good reason for this? The source of the move is then almost
> immediately overwritten by another instruction (and there's no reason to
> have 2 copies anyway). So I'd have thought the optimal code would have 0
> such instructions.
> The only reason I can see is if there's a restriction on where fma
> instructions can store their result. For example, if they can't always
> store them in the register of the number being summed (ie, if they're not
> capable of always doing z = x * y + z, but need to overwrite x or y instead
> sometimes for some reason -- like an architectural restriction? )
> Assuming it's better not to have them, any way to try and diagnose why
> they're generated, and avoid it?
> 
> Otherwise, the assembly looks great: repeated blocks containing
> 2x vmovupd
> 7x vbroadcastsd
> 14x vfmadd231pd
> 
> (For comparison, ifort produces much slower code (220 ns), with their
> assembly annotation noting lots of register spills.)
> 
> 
> The unrolled code, on the other hand, has massive piles of instructions
> just moving data around between registers, between registers and memory,
> etc.
> 
> The looped version's assembly is actually over 3 times longer than the
> entirely (manually) unrolled version!
> 
> I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
> been able to save the effort of doing it manually, or also that the plain
> loop also generates clean code.
> 
> But instead, I'd need a preprocessor or some other means of generating
> kernels if I wanted to experiment with optimization.
> 
> Is this a bug, expected behaviour that loops manage memory like that, or
> something that can easily be worked around another way?
> 
> Thanks,
> Chris
>

     prev parent reply	other threads:[~2018-07-21 19:38 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-21 12:44 Chris Elrod
2018-07-21 19:14 ` Jerry DeLisle
2018-07-21 19:38   ` n8tm via fortran [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180721193800._xj7n1i3QneX9E4Rwc5pugxAQVfDzzkMVW9TBCs-XEA@z \
    --to=fortran@gcc.gnu.org \
    --cc=elrodc@gmail.com \
    --cc=jvdelisle@charter.net \
    --cc=n8tm@aol.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).