From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21174 invoked by alias); 21 Jul 2018 12:44:47 -0000 Mailing-List: contact fortran-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: fortran-owner@gcc.gnu.org Received: (qmail 21164 invoked by uid 89); 21 Jul 2018 12:44:46 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=experiment, H*c:alternative X-HELO: mail-qk0-f178.google.com Received: from mail-qk0-f178.google.com (HELO mail-qk0-f178.google.com) (209.85.220.178) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 21 Jul 2018 12:44:45 +0000 Received: by mail-qk0-f178.google.com with SMTP id t79-v6so7727832qke.4 for ; Sat, 21 Jul 2018 05:44:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=d/nmbkEktwfR73c9E9w61h2m9RrdOC0XBH6LOJczOp4=; b=YWGcuPsH9T8nGNB8jL69SMB042aNtQiF6mHVYIIzs3IwG7+LgwulkhOTbA6NYPCp5D 9HE3+Bxjtj6SDiDHvBOUaHD8PQfmaeP5LOrAKiINuhLOVuzl2Ai766qdrcSvpWQZ1yZy xX3BrRDW0Fjed2zx8YF0PAhAP3XYAQna4m+sX0/VvQKZfdNWZldzKZQSLzhN7LIIu06w r3R7ZjJl7msx5yQjwKY5GOEApy4+4I/cZdnefjSJ8fksESx3FThJvqtSpslmmcAz/dyl 1TtoXoeYaCFBll82CO1kPU2MOKwFZPzLJrYT1Dn+HkagEsYnyiwbOSXFN4eJDTyqzR5S k92w== MIME-Version: 1.0 Received: by 2002:a0c:873c:0:0:0:0:0 with HTTP; Sat, 21 Jul 2018 05:44:42 -0700 (PDT) From: Chris Elrod Date: Sat, 21 Jul 2018 12:44:00 -0000 Message-ID: Subject: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower To: fortran@gcc.gnu.org Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2018-07/txt/msg00087.txt.bz2 Here is code: https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90 for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512 processors) Compiling with: gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s results in this assmebly: https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s where $ gfortran --version GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. The manually unrolled version runs in about 135 ns, while the for loop takes just over 1.4 microseconds on my computer. Looking at the assembly of the manually unrolled version: There are 13 total vmovapd instructions, 8 of them are moving from one zmm register to another, while 5 move from a zmm register to %rsp, eg: vmovapd %zmm20, 136(%rsp) Is there a good reason for this? The source of the move is then almost immediately overwritten by another instruction (and there's no reason to have 2 copies anyway). So I'd have thought the optimal code would have 0 such instructions. The only reason I can see is if there's a restriction on where fma instructions can store their result. For example, if they can't always store them in the register of the number being summed (ie, if they're not capable of always doing z = x * y + z, but need to overwrite x or y instead sometimes for some reason -- like an architectural restriction? ) Assuming it's better not to have them, any way to try and diagnose why they're generated, and avoid it? Otherwise, the assembly looks great: repeated blocks containing 2x vmovupd 7x vbroadcastsd 14x vfmadd231pd (For comparison, ifort produces much slower code (220 ns), with their assembly annotation noting lots of register spills.) The unrolled code, on the other hand, has massive piles of instructions just moving data around between registers, between registers and memory, etc. The looped version's assembly is actually over 3 times longer than the entirely (manually) unrolled version! I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have been able to save the effort of doing it manually, or also that the plain loop also generates clean code. But instead, I'd need a preprocessor or some other means of generating kernels if I wanted to experiment with optimization. Is this a bug, expected behaviour that loops manage memory like that, or something that can easily be worked around another way? Thanks, Chris