From: "Kay F. Jahnke" <kfjahnke@gmail.com>
To: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>,
"gcc@gcc.gnu.org" <gcc@gcc.gnu.org>
Subject: Re: autovectorization in gcc
Date: Wed, 09 Jan 2019 10:56:00 -0000 [thread overview]
Message-ID: <a03f97fc-0014-2af4-6185-5a10d6fa2cad@gmail.com> (raw)
In-Reply-To: <5C35C2C2.1050106@foss.arm.com>
On 09.01.19 10:45, Kyrill Tkachov wrote:
> There's plenty of work being done on auto-vectorisation in GCC.
> Auto-vectorisation is a performance optimisation and as such is not really
> a user-visible feature that absolutely requires user documentation.
Since I'm trying to deliberately exploit it, a more user-visible guise
would help ;)
>> - repeated use of vectorizable functions
>>
>> for ( int i = 0 ; i < vsz ; i++ )
>> Â Â A [ i ] = sqrt ( B [ i ] ) ;
>>
>> Here, replacing the repeated call of sqrt with the vectorized equivalent
>> gives a dramatic speedup (ca. 4X)
The above is a typical example. So, to give a complete source 'vec_sqrt.cc':
#include <cmath>
extern float data [ 32768 ] ;
extern void vf1()
{
#pragma vectorize enable
for ( int i = 0 ; i < 32768 ; i++ )
data [ i ] = std::sqrt ( data [ i ] ) ;
}
This has a large trip count, the loop is trivial. It's an ideal
candidate for autovectorization. When I compile this source, using
g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc
the inner loop translates to:
.L2:
vmovss (%rbx), %xmm0
vucomiss %xmm0, %xmm2
vsqrtss %xmm0, %xmm1, %xmm1
jbe .L3
vmovss %xmm2, 12(%rsp)
addq $4, %rbx
vmovss %xmm1, 8(%rsp)
call sqrtf@PLT
vmovss 8(%rsp), %xmm1
vmovss %xmm1, -4(%rbx)
cmpq %rbp, %rbx
vmovss 12(%rsp), %xmm2
jne .L2
AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.
> I believe GCC will do some of that already given a high-enough
> optimisation level
> and floating-point constraints.
> Do you have examples where it doesn't? Testcases with self-contained
> source code
> and compiler flags would be useful to analyse.
so, see above. With -Ofast output is similar, just the inner loop is
unrolled. But maybe I'm missing something? Any hints for additional flags?
>> If the compiler were to provide the autovectorization facilities, and if
>> the patterns it recognizes were well-documented, users could rely on
>> certain code patterns being recognized and autovectorized - sort of a
>> contract between the user and the compiler. With a well-chosen spectrum
>> of patterns, this would make it unnecessary to have to rely on explicit
>> vectorization in many cases. My hope is that such an interface would
>> help vectorization to become more frequently used - as I understand the
>> status quo, this is still a niche topic, even though many processors
>> provide suitable hardware nowadays.
>>
>
> I wouldn't say it's a niche topic :)
> From my monitoring of the GCC development over the last few years
> there's been lots
> of improvements in auto-vectorisation in compilers (at least in GCC).
Okay, I'll take your word for it.
> The thing is, auto-vectorisation is not always profitable for performance.
> Sometimes the runtime loop iteration count is so low that setting up the
> vectorised loop
> (alignment checks, loads/permutes) is slower than just doing the scalar
> form,
> especially since SIMD performance varies from CPU to CPU.
> So we would want the compiler to have the freedom to make its own
> judgement on when
> to auto-vectorise rather than enforce a "contract". If the user really
> only wants
> vector code, they should use one of the explicit programming paradigms.
I know that these issues are important. I am using Vc for explicit
vectorization, so I can easily code to produce vector code for common
targets. And I can compare the performance. I have tried the example
given above on my AVX2 machine, linking with a main program which calls
'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version
takes about half a second, the unvectorized takes about three. with
functions like sqrt, trigonometric functions, exp and pow, vectorization
is very profitable. Some further details:
Here's the main program 'memaxs.cc':
float data [ 32768 ] ;
extern void vf1() ;
int main ( int argc , char * argv[] )
{
for ( int k = 0 ; k < 32768 ; k++ )
{
vf1() ;
}
}
And the compiler call to get a binary:
g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc
Here's the performance:
$ time ./memaxs
real 0m3,205s
user 0m3,200s
sys 0m0,004s
This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')
#include <Vc/Vc>
extern float data [ 32768 ] ;
extern void vf1()
{
for ( int k = 0 ; k < 32768 ; k += 8 )
{
Vc::float_v fv ( data + k ) ;
fv = sqrt ( fv ) ;
fv.store ( data + k ) ;
}
}
Translated to assembler, I get the inner loop
.L2:
vmovups (%rax), %xmm0
addq $32, %rax
vinsertf128 $0x1, -16(%rax), %ymm0, %ymm0
vsqrtps %ymm0, %ymm0
vmovups %xmm0, -32(%rax)
vextractf128 $0x1, %ymm0, -16(%rax)
cmpq %rax, %rdx
jne .L2
vzeroupper
ret
.cfi_endproc
note how the data are read 32 bytes at a time and processed with vsqrtps.
creating the corresponding binary and executing it:
$ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc
$ time ./memaxs
real 0m0,548s
user 0m0,544s
sys 0m0,004s
So, I think this performance difference looks like a good enough gain to
consider my vectorization-of-math-functions proposal. When it comes to
the gather/scatter with arbitrary indexes, I suppose that's less
profitable and probably harder to scan for.
Kay
next prev parent reply other threads:[~2019-01-09 10:56 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-09 8:29 Kay F. Jahnke
2019-01-09 9:46 ` Kyrill Tkachov
2019-01-09 9:50 ` Andrew Haley
2019-01-09 9:56 ` Jonathan Wakely
2019-01-09 16:10 ` David Malcolm
2019-01-09 16:25 ` Jakub Jelinek
2019-01-10 8:19 ` Richard Biener
2019-01-10 11:11 ` Szabolcs Nagy
2019-01-09 16:26 ` David Malcolm
2019-01-09 10:47 ` Ramana Radhakrishnan
2019-01-10 9:24 ` Kay F. Jahnke
2019-01-10 11:18 ` Jonathan Wakely
2019-08-18 10:59 ` [wwwdocs PATCH] for " Gerald Pfeifer
2019-01-09 10:56 ` Kay F. Jahnke [this message]
2019-01-09 11:03 ` Jakub Jelinek
2019-01-09 11:21 ` Jakub Jelinek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a03f97fc-0014-2af4-6185-5a10d6fa2cad@gmail.com \
--to=kfjahnke@gmail.com \
--cc=gcc@gcc.gnu.org \
--cc=kyrylo.tkachov@foss.arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).