From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 39743 invoked by alias); 9 Jan 2019 10:56:10 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 39723 invoked by uid 89); 9 Jan 2019 10:56:09 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=monitoring, varies, jbe, vectorised X-HELO: mail-wr1-f67.google.com Received: from mail-wr1-f67.google.com (HELO mail-wr1-f67.google.com) (209.85.221.67) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 09 Jan 2019 10:56:07 +0000 Received: by mail-wr1-f67.google.com with SMTP id 96so7199766wrb.2 for ; Wed, 09 Jan 2019 02:56:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=q06bD6zi/e75hwNoI6pUCJmTOLd+0594bjrJ0uIYJOI=; b=ew6wTlY6AYgkLlfVQ9tbC4VCdBJe4jT27us+dkbZMDWC+7AHr+hXsvRbqBu+jdRDfz J1poK2TKEH6KeaE3NkvaB5ghL6zkK1VYfCekHIywCVrdGqp5/j6IUUPPCZ3ITG7zpwwt 5mTp2XRgbMMP2J7FyGrMRcedawFzPBGWgkgRSoaf8kRkRyujGDLF1WQyzOf2Q7AouSjU gPzSi1dVan49CNaEb4X4DWGFOJyHcZJP5y3YUJ+r3iwdICTxzAO6yxIYHFdXB2Lu1ZCR 0gfmPSDLi6id5zZ2CNdJUHHawQITZr7H3TSSHUaII/njK0O0KA6e6CQ26COkqmC9PGtJ ZwxQ== Return-Path: Received: from [192.168.178.20] (p5DDEE0CE.dip0.t-ipconnect.de. [93.222.224.206]) by smtp.googlemail.com with ESMTPSA id k128sm15746261wmd.37.2019.01.09.02.56.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 02:56:03 -0800 (PST) Subject: Re: autovectorization in gcc To: Kyrill Tkachov , "gcc@gcc.gnu.org" References: <41ea83cd-0ce8-4f25-35e5-888513d69c7b@gmail.com> <5C35C2C2.1050106@foss.arm.com> From: "Kay F. Jahnke" Message-ID: Date: Wed, 09 Jan 2019 10:56:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <5C35C2C2.1050106@foss.arm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-SW-Source: 2019-01/txt/msg00051.txt.bz2 On 09.01.19 10:45, Kyrill Tkachov wrote: > There's plenty of work being done on auto-vectorisation in GCC. > Auto-vectorisation is a performance optimisation and as such is not really > a user-visible feature that absolutely requires user documentation. Since I'm trying to deliberately exploit it, a more user-visible guise would help ;) >> - repeated use of vectorizable functions >> >> for ( int i = 0 ; i < vsz ; i++ ) >>    A [ i ] = sqrt ( B [ i ] ) ; >> >> Here, replacing the repeated call of sqrt with the vectorized equivalent >> gives a dramatic speedup (ca. 4X) The above is a typical example. So, to give a complete source 'vec_sqrt.cc': #include extern float data [ 32768 ] ; extern void vf1() { #pragma vectorize enable for ( int i = 0 ; i < 32768 ; i++ ) data [ i ] = std::sqrt ( data [ i ] ) ; } This has a large trip count, the loop is trivial. It's an ideal candidate for autovectorization. When I compile this source, using g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc the inner loop translates to: .L2: vmovss (%rbx), %xmm0 vucomiss %xmm0, %xmm2 vsqrtss %xmm0, %xmm1, %xmm1 jbe .L3 vmovss %xmm2, 12(%rsp) addq $4, %rbx vmovss %xmm1, 8(%rsp) call sqrtf@PLT vmovss 8(%rsp), %xmm1 vmovss %xmm1, -4(%rbx) cmpq %rbp, %rbx vmovss 12(%rsp), %xmm2 jne .L2 AFAICT this is not vectorized, it only uses a single float at a time. In vector code, I'd expect the vsqrtps mnemonic to show up. > I believe GCC will do some of that already given a high-enough > optimisation level > and floating-point constraints. > Do you have examples where it doesn't? Testcases with self-contained > source code > and compiler flags would be useful to analyse. so, see above. With -Ofast output is similar, just the inner loop is unrolled. But maybe I'm missing something? Any hints for additional flags? >> If the compiler were to provide the autovectorization facilities, and if >> the patterns it recognizes were well-documented, users could rely on >> certain code patterns being recognized and autovectorized - sort of a >> contract between the user and the compiler. With a well-chosen spectrum >> of patterns, this would make it unnecessary to have to rely on explicit >> vectorization in many cases. My hope is that such an interface would >> help vectorization to become more frequently used - as I understand the >> status quo, this is still a niche topic, even though many processors >> provide suitable hardware nowadays. >> > > I wouldn't say it's a niche topic :) > From my monitoring of the GCC development over the last few years > there's been lots > of improvements in auto-vectorisation in compilers (at least in GCC). Okay, I'll take your word for it. > The thing is, auto-vectorisation is not always profitable for performance. > Sometimes the runtime loop iteration count is so low that setting up the > vectorised loop > (alignment checks, loads/permutes) is slower than just doing the scalar > form, > especially since SIMD performance varies from CPU to CPU. > So we would want the compiler to have the freedom to make its own > judgement on when > to auto-vectorise rather than enforce a "contract". If the user really > only wants > vector code, they should use one of the explicit programming paradigms. I know that these issues are important. I am using Vc for explicit vectorization, so I can easily code to produce vector code for common targets. And I can compare the performance. I have tried the example given above on my AVX2 machine, linking with a main program which calls 'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version takes about half a second, the unvectorized takes about three. with functions like sqrt, trigonometric functions, exp and pow, vectorization is very profitable. Some further details: Here's the main program 'memaxs.cc': float data [ 32768 ] ; extern void vf1() ; int main ( int argc , char * argv[] ) { for ( int k = 0 ; k < 32768 ; k++ ) { vf1() ; } } And the compiler call to get a binary: g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc Here's the performance: $ time ./memaxs real 0m3,205s user 0m3,200s sys 0m0,004s This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc') #include extern float data [ 32768 ] ; extern void vf1() { for ( int k = 0 ; k < 32768 ; k += 8 ) { Vc::float_v fv ( data + k ) ; fv = sqrt ( fv ) ; fv.store ( data + k ) ; } } Translated to assembler, I get the inner loop .L2: vmovups (%rax), %xmm0 addq $32, %rax vinsertf128 $0x1, -16(%rax), %ymm0, %ymm0 vsqrtps %ymm0, %ymm0 vmovups %xmm0, -32(%rax) vextractf128 $0x1, %ymm0, -16(%rax) cmpq %rax, %rdx jne .L2 vzeroupper ret .cfi_endproc note how the data are read 32 bytes at a time and processed with vsqrtps. creating the corresponding binary and executing it: $ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc $ time ./memaxs real 0m0,548s user 0m0,544s sys 0m0,004s So, I think this performance difference looks like a good enough gain to consider my vectorization-of-math-functions proposal. When it comes to the gather/scatter with arbitrary indexes, I suppose that's less profitable and probably harder to scan for. Kay