From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-197995-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 39743 invoked by alias); 9 Jan 2019 10:56:10 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 39723 invoked by uid 89); 9 Jan 2019 10:56:09 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=monitoring, varies, jbe, vectorised
X-HELO: mail-wr1-f67.google.com
Received: from mail-wr1-f67.google.com (HELO mail-wr1-f67.google.com) (209.85.221.67) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 09 Jan 2019 10:56:07 +0000
Received: by mail-wr1-f67.google.com with SMTP id 96so7199766wrb.2        for <gcc@gcc.gnu.org>; Wed, 09 Jan 2019 02:56:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=subject:to:references:from:message-id:date:user-agent:mime-version         :in-reply-to:content-language:content-transfer-encoding;        bh=q06bD6zi/e75hwNoI6pUCJmTOLd+0594bjrJ0uIYJOI=;        b=ew6wTlY6AYgkLlfVQ9tbC4VCdBJe4jT27us+dkbZMDWC+7AHr+hXsvRbqBu+jdRDfz         J1poK2TKEH6KeaE3NkvaB5ghL6zkK1VYfCekHIywCVrdGqp5/j6IUUPPCZ3ITG7zpwwt         5mTp2XRgbMMP2J7FyGrMRcedawFzPBGWgkgRSoaf8kRkRyujGDLF1WQyzOf2Q7AouSjU         gPzSi1dVan49CNaEb4X4DWGFOJyHcZJP5y3YUJ+r3iwdICTxzAO6yxIYHFdXB2Lu1ZCR         0gfmPSDLi6id5zZ2CNdJUHHawQITZr7H3TSSHUaII/njK0O0KA6e6CQ26COkqmC9PGtJ         ZwxQ==
Return-Path: <kfjahnke@gmail.com>
Received: from [192.168.178.20] (p5DDEE0CE.dip0.t-ipconnect.de. [93.222.224.206])        by smtp.googlemail.com with ESMTPSA id k128sm15746261wmd.37.2019.01.09.02.56.03        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);        Wed, 09 Jan 2019 02:56:03 -0800 (PST)
Subject: Re: autovectorization in gcc
To: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>, "gcc@gcc.gnu.org" <gcc@gcc.gnu.org>
References: <41ea83cd-0ce8-4f25-35e5-888513d69c7b@gmail.com> <5C35C2C2.1050106@foss.arm.com>
From: "Kay F. Jahnke" <kfjahnke@gmail.com>
Message-ID: <a03f97fc-0014-2af4-6185-5a10d6fa2cad@gmail.com>
Date: Wed, 09 Jan 2019 10:56:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <5C35C2C2.1050106@foss.arm.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-SW-Source: 2019-01/txt/msg00051.txt.bz2

On 09.01.19 10:45, Kyrill Tkachov wrote:

> There's plenty of work being done on auto-vectorisation in GCC.
> Auto-vectorisation is a performance optimisation and as such is not really
> a user-visible feature that absolutely requires user documentation.

Since I'm trying to deliberately exploit it, a more user-visible guise 
would help ;)

>> - repeated use of vectorizable functions
>>
>> for ( int i = 0 ; i < vsz ; i++ )
>> Â Â  A [ i ] = sqrt ( B [ i ] ) ;
>>
>> Here, replacing the repeated call of sqrt with the vectorized equivalent
>> gives a dramatic speedup (ca. 4X)

The above is a typical example. So, to give a complete source 'vec_sqrt.cc':

#include <cmath>

extern float data [ 32768 ] ;

extern void vf1()
{
   #pragma vectorize enable
   for ( int i = 0 ; i < 32768 ; i++ )
     data [ i ] = std::sqrt ( data [ i ] ) ;
}

This has a large trip count, the loop is trivial. It's an ideal 
candidate for autovectorization. When I compile this source, using

g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

the inner loop translates to:

.L2:
         vmovss  (%rbx), %xmm0
         vucomiss        %xmm0, %xmm2
         vsqrtss %xmm0, %xmm1, %xmm1
         jbe     .L3
         vmovss  %xmm2, 12(%rsp)
         addq    $4, %rbx
         vmovss  %xmm1, 8(%rsp)
         call    sqrtf@PLT
         vmovss  8(%rsp), %xmm1
         vmovss  %xmm1, -4(%rbx)
         cmpq    %rbp, %rbx
         vmovss  12(%rsp), %xmm2
         jne     .L2

AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.

> I believe GCC will do some of that already given a high-enough 
> optimisation level
> and floating-point constraints.
> Do you have examples where it doesn't? Testcases with self-contained 
> source code
> and compiler flags would be useful to analyse.

so, see above. With -Ofast output is similar, just the inner loop is 
unrolled. But maybe I'm missing something? Any hints for additional flags?

>> If the compiler were to provide the autovectorization facilities, and if
>> the patterns it recognizes were well-documented, users could rely on
>> certain code patterns being recognized and autovectorized - sort of a
>> contract between the user and the compiler. With a well-chosen spectrum
>> of patterns, this would make it unnecessary to have to rely on explicit
>> vectorization in many cases. My hope is that such an interface would
>> help vectorization to become more frequently used - as I understand the
>> status quo, this is still a niche topic, even though many processors
>> provide suitable hardware nowadays.
>>
> 
> I wouldn't say it's a niche topic :)
>  From my monitoring of the GCC development over the last few years 
> there's been lots
> of improvements in auto-vectorisation in compilers (at least in GCC).

Okay, I'll take your word for it.

> The thing is, auto-vectorisation is not always profitable for performance.
> Sometimes the runtime loop iteration count is so low that setting up the 
> vectorised loop
> (alignment checks, loads/permutes) is slower than just doing the scalar 
> form,
> especially since SIMD performance varies from CPU to CPU.
> So we would want the compiler to have the freedom to make its own 
> judgement on when
> to auto-vectorise rather than enforce a "contract". If the user really 
> only wants
> vector code, they should use one of the explicit programming paradigms.

I know that these issues are important. I am using Vc for explicit 
vectorization, so I can easily code to produce vector code for common 
targets. And I can compare the performance. I have tried the example 
given above on my AVX2 machine, linking with a main program which calls 
'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version 
takes about half a second, the unvectorized takes about three. with 
functions like sqrt, trigonometric functions, exp and pow, vectorization 
is very profitable. Some further details:

Here's the main program 'memaxs.cc':

float data [ 32768 ] ;
extern void vf1() ;

int main ( int argc , char * argv[] )
{
   for ( int k = 0 ; k < 32768 ; k++ )
   {
     vf1() ;
   }
}

And the compiler call to get a binary:

g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc

Here's the performance:

$ time ./memaxs

real    0m3,205s
user    0m3,200s
sys     0m0,004s

This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')

#include <Vc/Vc>

extern float data [ 32768 ] ;

extern void vf1()
{
   for ( int k = 0 ; k < 32768 ; k += 8 )
   {
     Vc::float_v fv ( data + k ) ;
     fv = sqrt ( fv ) ;
     fv.store ( data + k ) ;
   }
}

Translated to assembler, I get the inner loop

.L2:
         vmovups (%rax), %xmm0
         addq    $32, %rax
         vinsertf128     $0x1, -16(%rax), %ymm0, %ymm0
         vsqrtps %ymm0, %ymm0
         vmovups %xmm0, -32(%rax)
         vextractf128    $0x1, %ymm0, -16(%rax)
         cmpq    %rax, %rdx
         jne     .L2
         vzeroupper
         ret
         .cfi_endproc

note how the data are read 32 bytes at a time and processed with vsqrtps.

creating the corresponding binary and executing it:

$ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc
$ time ./memaxs

real    0m0,548s
user    0m0,544s
sys     0m0,004s

So, I think this performance difference looks like a good enough gain to 
consider my vectorization-of-math-functions proposal. When it comes to 
the gather/scatter with arbitrary indexes, I suppose that's less 
profitable and probably harder to scan for.

Kay