Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower

public inbox for fortran@gcc.gnu.org
 help / color / mirror / Atom feed

From: Chris Elrod <elrodc@gmail.com>
To: fortran@gcc.gnu.org
Subject: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
Date: Sat, 21 Jul 2018 12:44:00 -0000	[thread overview]
Message-ID: <CA+pTmbCyxKH7acLXgxVt+EcxSB+o1ncqvWit4ueGx_cu9moHGQ@mail.gmail.com> (raw)

Here is code:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
processors)

Compiling with:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
-funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s

results in this assmebly:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s

where
$ gfortran --version
GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The manually unrolled version runs in about 135 ns, while the for loop
takes just over 1.4 microseconds on my computer.

Looking at the assembly of the manually unrolled version:

There are 13 total vmovapd instructions, 8 of them are moving from one zmm
register to another, while 5 move from a zmm register to %rsp, eg:

vmovapd    %zmm20, 136(%rsp)

Is there a good reason for this? The source of the move is then almost
immediately overwritten by another instruction (and there's no reason to
have 2 copies anyway). So I'd have thought the optimal code would have 0
such instructions.
The only reason I can see is if there's a restriction on where fma
instructions can store their result. For example, if they can't always
store them in the register of the number being summed (ie, if they're not
capable of always doing z = x * y + z, but need to overwrite x or y instead
sometimes for some reason -- like an architectural restriction? )
Assuming it's better not to have them, any way to try and diagnose why
they're generated, and avoid it?

Otherwise, the assembly looks great: repeated blocks containing
2x vmovupd
7x vbroadcastsd
14x vfmadd231pd

(For comparison, ifort produces much slower code (220 ns), with their
assembly annotation noting lots of register spills.)

The unrolled code, on the other hand, has massive piles of instructions
just moving data around between registers, between registers and memory,
etc.

The looped version's assembly is actually over 3 times longer than the
entirely (manually) unrolled version!

I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
been able to save the effort of doing it manually, or also that the plain
loop also generates clean code.

But instead, I'd need a preprocessor or some other means of generating
kernels if I wanted to experiment with optimization.

Is this a bug, expected behaviour that loops manage memory like that, or
something that can easily be worked around another way?

Thanks,
Chris

next             reply	other threads:[~2018-07-21 12:44 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-21 12:44 Chris Elrod [this message]
2018-07-21 19:14 ` Jerry DeLisle
2018-07-21 19:38   ` n8tm via fortran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+pTmbCyxKH7acLXgxVt+EcxSB+o1ncqvWit4ueGx_cu9moHGQ@mail.gmail.com \
    --to=elrodc@gmail.com \
    --cc=fortran@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).