From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <fortran-return-50736-listarch-fortran=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 21174 invoked by alias); 21 Jul 2018 12:44:47 -0000
Mailing-List: contact fortran-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <fortran.gcc.gnu.org>
List-Subscribe: <mailto:fortran-subscribe@gcc.gnu.org>
List-Post: <mailto:fortran@gcc.gnu.org>
List-Help: <mailto:fortran-help@gcc.gnu.org>, <http://sourceware.org/lists.html#faqs>
Sender: fortran-owner@gcc.gnu.org
Received: (qmail 21164 invoked by uid 89); 21 Jul 2018 12:44:46 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=experiment, H*c:alternative
X-HELO: mail-qk0-f178.google.com
Received: from mail-qk0-f178.google.com (HELO mail-qk0-f178.google.com) (209.85.220.178) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 21 Jul 2018 12:44:45 +0000
Received: by mail-qk0-f178.google.com with SMTP id t79-v6so7727832qke.4        for <fortran@gcc.gnu.org>; Sat, 21 Jul 2018 05:44:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:from:date:message-id:subject:to;        bh=d/nmbkEktwfR73c9E9w61h2m9RrdOC0XBH6LOJczOp4=;        b=YWGcuPsH9T8nGNB8jL69SMB042aNtQiF6mHVYIIzs3IwG7+LgwulkhOTbA6NYPCp5D         9HE3+Bxjtj6SDiDHvBOUaHD8PQfmaeP5LOrAKiINuhLOVuzl2Ai766qdrcSvpWQZ1yZy         xX3BrRDW0Fjed2zx8YF0PAhAP3XYAQna4m+sX0/VvQKZfdNWZldzKZQSLzhN7LIIu06w         r3R7ZjJl7msx5yQjwKY5GOEApy4+4I/cZdnefjSJ8fksESx3FThJvqtSpslmmcAz/dyl         1TtoXoeYaCFBll82CO1kPU2MOKwFZPzLJrYT1Dn+HkagEsYnyiwbOSXFN4eJDTyqzR5S         k92w==
MIME-Version: 1.0
Received: by 2002:a0c:873c:0:0:0:0:0 with HTTP; Sat, 21 Jul 2018 05:44:42 -0700 (PDT)
From: Chris Elrod <elrodc@gmail.com>
Date: Sat, 21 Jul 2018 12:44:00 -0000
Message-ID: <CA+pTmbCyxKH7acLXgxVt+EcxSB+o1ncqvWit4ueGx_cu9moHGQ@mail.gmail.com>
Subject: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
To: fortran@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2018-07/txt/msg00087.txt.bz2

Here is code:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
processors)

Compiling with:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
-funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s

results in this assmebly:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s

where
$ gfortran --version
GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


The manually unrolled version runs in about 135 ns, while the for loop
takes just over 1.4 microseconds on my computer.

Looking at the assembly of the manually unrolled version:

There are 13 total vmovapd instructions, 8 of them are moving from one zmm
register to another, while 5 move from a zmm register to %rsp, eg:

vmovapd    %zmm20, 136(%rsp)

Is there a good reason for this? The source of the move is then almost
immediately overwritten by another instruction (and there's no reason to
have 2 copies anyway). So I'd have thought the optimal code would have 0
such instructions.
The only reason I can see is if there's a restriction on where fma
instructions can store their result. For example, if they can't always
store them in the register of the number being summed (ie, if they're not
capable of always doing z = x * y + z, but need to overwrite x or y instead
sometimes for some reason -- like an architectural restriction? )
Assuming it's better not to have them, any way to try and diagnose why
they're generated, and avoid it?

Otherwise, the assembly looks great: repeated blocks containing
2x vmovupd
7x vbroadcastsd
14x vfmadd231pd

(For comparison, ifort produces much slower code (220 ns), with their
assembly annotation noting lots of register spills.)


The unrolled code, on the other hand, has massive piles of instructions
just moving data around between registers, between registers and memory,
etc.

The looped version's assembly is actually over 3 times longer than the
entirely (manually) unrolled version!

I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
been able to save the effort of doing it manually, or also that the plain
loop also generates clean code.

But instead, I'd need a preprocessor or some other means of generating
kernels if I wanted to experiment with optimization.

Is this a bug, expected behaviour that loops manage memory like that, or
something that can easily be worked around another way?

Thanks,
Chris