Re: autovectorization in gcc

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

From: "Kay F. Jahnke" <kfjahnke@gmail.com>
To: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>,
	"gcc@gcc.gnu.org" <gcc@gcc.gnu.org>
Subject: Re: autovectorization in gcc
Date: Wed, 09 Jan 2019 10:56:00 -0000	[thread overview]
Message-ID: <a03f97fc-0014-2af4-6185-5a10d6fa2cad@gmail.com> (raw)
In-Reply-To: <5C35C2C2.1050106@foss.arm.com>

On 09.01.19 10:45, Kyrill Tkachov wrote:

> There's plenty of work being done on auto-vectorisation in GCC.
> Auto-vectorisation is a performance optimisation and as such is not really
> a user-visible feature that absolutely requires user documentation.

Since I'm trying to deliberately exploit it, a more user-visible guise 
would help ;)

>> - repeated use of vectorizable functions
>>
>> for ( int i = 0 ; i < vsz ; i++ )
>> Â Â  A [ i ] = sqrt ( B [ i ] ) ;
>>
>> Here, replacing the repeated call of sqrt with the vectorized equivalent
>> gives a dramatic speedup (ca. 4X)

The above is a typical example. So, to give a complete source 'vec_sqrt.cc':

#include <cmath>

extern float data [ 32768 ] ;

extern void vf1()
{
   #pragma vectorize enable
   for ( int i = 0 ; i < 32768 ; i++ )
     data [ i ] = std::sqrt ( data [ i ] ) ;
}

This has a large trip count, the loop is trivial. It's an ideal 
candidate for autovectorization. When I compile this source, using

g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

the inner loop translates to:

.L2:
         vmovss  (%rbx), %xmm0
         vucomiss        %xmm0, %xmm2
         vsqrtss %xmm0, %xmm1, %xmm1
         jbe     .L3
         vmovss  %xmm2, 12(%rsp)
         addq    $4, %rbx
         vmovss  %xmm1, 8(%rsp)
         call    sqrtf@PLT
         vmovss  8(%rsp), %xmm1
         vmovss  %xmm1, -4(%rbx)
         cmpq    %rbp, %rbx
         vmovss  12(%rsp), %xmm2
         jne     .L2

AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.

> I believe GCC will do some of that already given a high-enough 
> optimisation level
> and floating-point constraints.
> Do you have examples where it doesn't? Testcases with self-contained 
> source code
> and compiler flags would be useful to analyse.

so, see above. With -Ofast output is similar, just the inner loop is 
unrolled. But maybe I'm missing something? Any hints for additional flags?

>> If the compiler were to provide the autovectorization facilities, and if
>> the patterns it recognizes were well-documented, users could rely on
>> certain code patterns being recognized and autovectorized - sort of a
>> contract between the user and the compiler. With a well-chosen spectrum
>> of patterns, this would make it unnecessary to have to rely on explicit
>> vectorization in many cases. My hope is that such an interface would
>> help vectorization to become more frequently used - as I understand the
>> status quo, this is still a niche topic, even though many processors
>> provide suitable hardware nowadays.
>>
> 
> I wouldn't say it's a niche topic :)
>  From my monitoring of the GCC development over the last few years 
> there's been lots
> of improvements in auto-vectorisation in compilers (at least in GCC).

Okay, I'll take your word for it.

> The thing is, auto-vectorisation is not always profitable for performance.
> Sometimes the runtime loop iteration count is so low that setting up the 
> vectorised loop
> (alignment checks, loads/permutes) is slower than just doing the scalar 
> form,
> especially since SIMD performance varies from CPU to CPU.
> So we would want the compiler to have the freedom to make its own 
> judgement on when
> to auto-vectorise rather than enforce a "contract". If the user really 
> only wants
> vector code, they should use one of the explicit programming paradigms.

I know that these issues are important. I am using Vc for explicit 
vectorization, so I can easily code to produce vector code for common 
targets. And I can compare the performance. I have tried the example 
given above on my AVX2 machine, linking with a main program which calls 
'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version 
takes about half a second, the unvectorized takes about three. with 
functions like sqrt, trigonometric functions, exp and pow, vectorization 
is very profitable. Some further details:

Here's the main program 'memaxs.cc':

float data [ 32768 ] ;
extern void vf1() ;

int main ( int argc , char * argv[] )
{
   for ( int k = 0 ; k < 32768 ; k++ )
   {
     vf1() ;
   }
}

And the compiler call to get a binary:

g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc

Here's the performance:

$ time ./memaxs

real    0m3,205s
user    0m3,200s
sys     0m0,004s

This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')

#include <Vc/Vc>

extern float data [ 32768 ] ;

extern void vf1()
{
   for ( int k = 0 ; k < 32768 ; k += 8 )
   {
     Vc::float_v fv ( data + k ) ;
     fv = sqrt ( fv ) ;
     fv.store ( data + k ) ;
   }
}

Translated to assembler, I get the inner loop

.L2:
         vmovups (%rax), %xmm0
         addq    $32, %rax
         vinsertf128     $0x1, -16(%rax), %ymm0, %ymm0
         vsqrtps %ymm0, %ymm0
         vmovups %xmm0, -32(%rax)
         vextractf128    $0x1, %ymm0, -16(%rax)
         cmpq    %rax, %rdx
         jne     .L2
         vzeroupper
         ret
         .cfi_endproc

note how the data are read 32 bytes at a time and processed with vsqrtps.

creating the corresponding binary and executing it:

$ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc
$ time ./memaxs

real    0m0,548s
user    0m0,544s
sys     0m0,004s

So, I think this performance difference looks like a good enough gain to 
consider my vectorization-of-math-functions proposal. When it comes to 
the gather/scatter with arbitrary indexes, I suppose that's less 
profitable and probably harder to scan for.

Kay

next prev parent reply	other threads:[~2019-01-09 10:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-09  8:29 Kay F. Jahnke
2019-01-09  9:46 ` Kyrill Tkachov
2019-01-09  9:50   ` Andrew Haley
2019-01-09  9:56     ` Jonathan Wakely
2019-01-09 16:10       ` David Malcolm
2019-01-09 16:25         ` Jakub Jelinek
2019-01-10  8:19           ` Richard Biener
2019-01-10 11:11             ` Szabolcs Nagy
2019-01-09 16:26         ` David Malcolm
2019-01-09 10:47     ` Ramana Radhakrishnan
2019-01-10  9:24     ` Kay F. Jahnke
2019-01-10 11:18       ` Jonathan Wakely
2019-08-18 10:59         ` [wwwdocs PATCH] for " Gerald Pfeifer
2019-01-09 10:56   ` Kay F. Jahnke [this message]
2019-01-09 11:03     ` Jakub Jelinek
2019-01-09 11:21       ` Jakub Jelinek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a03f97fc-0014-2af4-6185-5a10d6fa2cad@gmail.com \
    --to=kfjahnke@gmail.com \
    --cc=gcc@gcc.gnu.org \
    --cc=kyrylo.tkachov@foss.arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).