From: "n8tm via fortran" <fortran@gcc.gnu.org>
To: Jerry DeLisle <jvdelisle@charter.net>,
Chris Elrod <elrodc@gmail.com>,
fortran@gcc.gnu.org
Subject: Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
Date: Sat, 21 Jul 2018 19:38:00 -0000 [thread overview]
Message-ID: <20180721193800._xj7n1i3QneX9E4Rwc5pugxAQVfDzzkMVW9TBCs-XEA@z> (raw)
In-Reply-To: <4fb3e3ad-886e-b905-d755-1f0049dd6162@charter.net>
If the application requires a specific unroll factor, max unroll times should be set. It does look like a bug if default unroll is so aggressive as to force spills.
Sent via the Samsung Galaxy S8 Active, an AT&T 4G LTE smartphone
-------- Original message --------From: Jerry DeLisle <jvdelisle@charter.net> Date: 7/21/18 3:14 PM (GMT-05:00) To: Chris Elrod <elrodc@gmail.com>, fortran@gcc.gnu.org Subject: Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
This is the gfortran list but these optimizations are handled by the gcc
optimizers and not the compiler front-end. Probably need to post to
bugzilla here:
https://gcc.gnu.org/bugzilla/
Jerry
On 07/21/2018 05:44 AM, Chris Elrod wrote:
> Here is code:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
> for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
> processors)
>
> Compiling with:
>
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
> -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s
>
> results in this assmebly:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s
>
> where
> $ gfortran --version
> GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions. There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>
>
> The manually unrolled version runs in about 135 ns, while the for loop
> takes just over 1.4 microseconds on my computer.
>
> Looking at the assembly of the manually unrolled version:
>
> There are 13 total vmovapd instructions, 8 of them are moving from one zmm
> register to another, while 5 move from a zmm register to %rsp, eg:
>
> vmovapd %zmm20, 136(%rsp)
>
> Is there a good reason for this? The source of the move is then almost
> immediately overwritten by another instruction (and there's no reason to
> have 2 copies anyway). So I'd have thought the optimal code would have 0
> such instructions.
> The only reason I can see is if there's a restriction on where fma
> instructions can store their result. For example, if they can't always
> store them in the register of the number being summed (ie, if they're not
> capable of always doing z = x * y + z, but need to overwrite x or y instead
> sometimes for some reason -- like an architectural restriction? )
> Assuming it's better not to have them, any way to try and diagnose why
> they're generated, and avoid it?
>
> Otherwise, the assembly looks great: repeated blocks containing
> 2x vmovupd
> 7x vbroadcastsd
> 14x vfmadd231pd
>
> (For comparison, ifort produces much slower code (220 ns), with their
> assembly annotation noting lots of register spills.)
>
>
> The unrolled code, on the other hand, has massive piles of instructions
> just moving data around between registers, between registers and memory,
> etc.
>
> The looped version's assembly is actually over 3 times longer than the
> entirely (manually) unrolled version!
>
> I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
> been able to save the effort of doing it manually, or also that the plain
> loop also generates clean code.
>
> But instead, I'd need a preprocessor or some other means of generating
> kernels if I wanted to experiment with optimization.
>
> Is this a bug, expected behaviour that loops manage memory like that, or
> something that can easily be worked around another way?
>
> Thanks,
> Chris
>
prev parent reply other threads:[~2018-07-21 19:38 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-21 12:44 Chris Elrod
2018-07-21 19:14 ` Jerry DeLisle
2018-07-21 19:38 ` n8tm via fortran [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180721193800._xj7n1i3QneX9E4Rwc5pugxAQVfDzzkMVW9TBCs-XEA@z \
--to=fortran@gcc.gnu.org \
--cc=elrodc@gmail.com \
--cc=jvdelisle@charter.net \
--cc=n8tm@aol.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).