autovectorization in gcc

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* autovectorization in gcc
@ 2019-01-09  8:29 Kay F. Jahnke
  2019-01-09  9:46 ` Kyrill Tkachov
  0 siblings, 1 reply; 16+ messages in thread
From: Kay F. Jahnke @ 2019-01-09  8:29 UTC (permalink / raw)
  To: gcc

Hi there!

I am developing software which tries to deliberately exploit the 
compiler's autovectorization facilities by feeding data in 
autovectorization-friendly loops. I'm currently using both g++ and 
clang++ to see how well this approach works. Using simple arithmetic, I 
often get good results. To widen the scope of my work, I was looking for 
documentation on which constructs would be recognized by the 
autovectorization stage, and found

https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html

By the looks of it, this document has not seen any changes for several 
years. Has development on the autovectorization stage stopped, or is 
there simply no documentation?

In my experience, vectorization is essential to speed up arithmetic on 
the CPU, and reliable recognition of vectorization opportunities by the 
compiler can provide vectorization to programs which don't bother to 
code it explicitly. I feel the topic is being neglected - at least the 
documentation I found suggests this. To demonstrate what I mean, I have 
two concrete scenarios which I'd like to be handled by the 
autovectorization stage:

- gather/scatter with arbitrary indexes

In C, this would be loops like

// gather from B to A using gather indexes

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = B [ indexes [ i ] ] ;

 From the AVX2 ISA onwards, there are hardware gather/scatter 
operations, which can speed things up a good deal.

- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent 
gives a dramatic speedup (ca. 4X)

If the compiler were to provide the autovectorization facilities, and if 
the patterns it recognizes were well-documented, users could rely on 
certain code patterns being recognized and autovectorized - sort of a 
contract between the user and the compiler. With a well-chosen spectrum 
of patterns, this would make it unnecessary to have to rely on explicit 
vectorization in many cases. My hope is that such an interface would 
help vectorization to become more frequently used - as I understand the 
status quo, this is still a niche topic, even though many processors 
provide suitable hardware nowadays.

Can you point me to where 'the action is' in this regard?

With regards

Kay F. Jahnke

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09  8:29 autovectorization in gcc Kay F. Jahnke
@ 2019-01-09  9:46 ` Kyrill Tkachov
  2019-01-09  9:50   ` Andrew Haley
  2019-01-09 10:56   ` Kay F. Jahnke
  0 siblings, 2 replies; 16+ messages in thread
From: Kyrill Tkachov @ 2019-01-09  9:46 UTC (permalink / raw)
  To: Kay F. Jahnke, gcc

Hi Kay,

On 09/01/19 08:29, Kay F. Jahnke wrote:
> Hi there!
>
> I am developing software which tries to deliberately exploit the
> compiler's autovectorization facilities by feeding data in
> autovectorization-friendly loops. I'm currently using both g++ and
> clang++ to see how well this approach works. Using simple arithmetic, I
> often get good results. To widen the scope of my work, I was looking for
> documentation on which constructs would be recognized by the
> autovectorization stage, and found
>
> https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html
>

Yeah, that page hasn't been updated in ages AFAIK.

> By the looks of it, this document has not seen any changes for several
> years. Has development on the autovectorization stage stopped, or is
> there simply no documentation?
>

There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.

> In my experience, vectorization is essential to speed up arithmetic on
> the CPU, and reliable recognition of vectorization opportunities by the
> compiler can provide vectorization to programs which don't bother to
> code it explicitly. I feel the topic is being neglected - at least the
> documentation I found suggests this. To demonstrate what I mean, I have
> two concrete scenarios which I'd like to be handled by the
> autovectorization stage:
>
> - gather/scatter with arbitrary indexes
>
> In C, this would be loops like
>
> // gather from B to A using gather indexes
>
> for ( int i = 0 ; i < vsz ; i++ )
>    A [ i ] = B [ indexes [ i ] ] ;
>
>  From the AVX2 ISA onwards, there are hardware gather/scatter
> operations, which can speed things up a good deal.
>
> - repeated use of vectorizable functions
>
> for ( int i = 0 ; i < vsz ; i++ )
>    A [ i ] = sqrt ( B [ i ] ) ;
>
> Here, replacing the repeated call of sqrt with the vectorized equivalent
> gives a dramatic speedup (ca. 4X)
>

I believe GCC will do some of that already given a high-enough optimisation level
and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained source code
and compiler flags would be useful to analyse.

> If the compiler were to provide the autovectorization facilities, and if
> the patterns it recognizes were well-documented, users could rely on
> certain code patterns being recognized and autovectorized - sort of a
> contract between the user and the compiler. With a well-chosen spectrum
> of patterns, this would make it unnecessary to have to rely on explicit
> vectorization in many cases. My hope is that such an interface would
> help vectorization to become more frequently used - as I understand the
> status quo, this is still a niche topic, even though many processors
> provide suitable hardware nowadays.
>

I wouldn't say it's a niche topic :)
 From my monitoring of the GCC development over the last few years there's been lots
of improvements in auto-vectorisation in compilers (at least in GCC).

The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the vectorised loop
(alignment checks, loads/permutes) is slower than just doing the scalar form,
especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own judgement on when
to auto-vectorise rather than enforce a "contract". If the user really only wants
vector code, they should use one of the explicit programming paradigms.

HTH,
Kyrill

> Can you point me to where 'the action is' in this regard?
>
> With regards
>
> Kay F. Jahnke
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09  9:46 ` Kyrill Tkachov
@ 2019-01-09  9:50   ` Andrew Haley
  2019-01-09  9:56     ` Jonathan Wakely
                       ` (2 more replies)
  2019-01-09 10:56   ` Kay F. Jahnke
  1 sibling, 3 replies; 16+ messages in thread
From: Andrew Haley @ 2019-01-09  9:50 UTC (permalink / raw)
  To: Kyrill Tkachov, Kay F. Jahnke, gcc

On 1/9/19 9:45 AM, Kyrill Tkachov wrote:
> Hi Kay,
> 
> On 09/01/19 08:29, Kay F. Jahnke wrote:
>> Hi there!
>>
>> I am developing software which tries to deliberately exploit the
>> compiler's autovectorization facilities by feeding data in
>> autovectorization-friendly loops. I'm currently using both g++ and
>> clang++ to see how well this approach works. Using simple arithmetic, I
>> often get good results. To widen the scope of my work, I was looking for
>> documentation on which constructs would be recognized by the
>> autovectorization stage, and found
>>
>> https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html
>>
> 
> Yeah, that page hasn't been updated in ages AFAIK.
> 
>> By the looks of it, this document has not seen any changes for several
>> years. Has development on the autovectorization stage stopped, or is
>> there simply no documentation?
>>
> 
> There's plenty of work being done on auto-vectorisation in GCC.
> Auto-vectorisation is a performance optimisation and as such is not really
> a user-visible feature that absolutely requires user documentation.

I don't agree. Sometimes vectorization is critical. It would be nice
to have a warning which would fire if vectorization failed. That would
surely help the OP.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09  9:50   ` Andrew Haley
@ 2019-01-09  9:56     ` Jonathan Wakely
  2019-01-09 16:10       ` David Malcolm
  2019-01-09 10:47     ` Ramana Radhakrishnan
  2019-01-10  9:24     ` Kay F. Jahnke
  2 siblings, 1 reply; 16+ messages in thread
From: Jonathan Wakely @ 2019-01-09  9:56 UTC (permalink / raw)
  To: Andrew Haley; +Cc: Kyrill Tkachov, Kay F. Jahnke, gcc

On Wed, 9 Jan 2019 at 09:50, Andrew Haley wrote:
> I don't agree. Sometimes vectorization is critical. It would be nice
> to have a warning which would fire if vectorization failed. That would
> surely help the OP.

Dave Malcolm has been working on something like that:
https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01749.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09  9:56     ` Jonathan Wakely
@ 2019-01-09 16:10       ` David Malcolm
  2019-01-09 16:25         ` Jakub Jelinek
  2019-01-09 16:26         ` David Malcolm
  0 siblings, 2 replies; 16+ messages in thread
From: David Malcolm @ 2019-01-09 16:10 UTC (permalink / raw)
  To: Jonathan Wakely, Andrew Haley; +Cc: Kyrill Tkachov, Kay F. Jahnke, gcc

On Wed, 2019-01-09 at 09:56 +0000, Jonathan Wakely wrote:
> On Wed, 9 Jan 2019 at 09:50, Andrew Haley wrote:
> > I don't agree. Sometimes vectorization is critical. It would be
> > nice
> > to have a warning which would fire if vectorization failed. That
> > would
> > surely help the OP.
> 
> Dave Malcolm has been working on something like that:
> https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01749.html

Yes: this code is in trunk for gcc 9, but it doesn't help much for the
case given elsewhere in this thread:

#include <cmath>

extern float data [ 32768 ] ;

extern void vf1()
{
   #pragma vectorize enable
   for ( int i = 0 ; i < 32768 ; i++ )
     data [ i ] = std::sqrt ( data [ i ] ) ;
}

Compiling on this x86_64 box with -fopt-info-vec-missed shows the
rather cryptic:

g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-missed
/tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
/tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27: missed: statement clobbers memory: __builtin_sqrtf (_1);

and with -fopt-info-vec-all-internals shows:

g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-all-internals

Analyzing loop at /tmp/sqrt-test.cc:8
/tmp/sqrt-test.cc:8:24: note:  === analyze_loop_nest ===
/tmp/sqrt-test.cc:8:24: note:   === vect_analyze_loop_form ===
/tmp/sqrt-test.cc:8:24: missed:   not vectorized: control flow in loop.
/tmp/sqrt-test.cc:8:24: missed:  bad loop form.
/tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
/tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
/tmp/sqrt-test.cc:5:13: note: vectorized 0 loops in function.
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27: note:  === vect_slp_analyze_bb ===
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27: note:   === vect_analyze_data_refs ===
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27: note:   got vectype for stmt: _1 = data[i_12];
vector(8) float
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27: missed:  not vectorized: not enough data-refs in basic block.
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27: missed: statement clobbers memory: __builtin_sqrtf (_1);
/tmp/sqrt-test.cc:8:24: note:  === vect_slp_analyze_bb ===
/tmp/sqrt-test.cc:8:24: note:   === vect_analyze_data_refs ===
/tmp/sqrt-test.cc:8:24: note:   got vectype for stmt: data[i_12] = _7;
vector(8) float
/tmp/sqrt-test.cc:8:24: missed:  not vectorized: not enough data-refs in basic block.
/tmp/sqrt-test.cc:10:1: note:  === vect_slp_analyze_bb ===
/tmp/sqrt-test.cc:10:1: note:   === vect_analyze_data_refs ===
/tmp/sqrt-test.cc:10:1: missed:  not vectorized: not enough data-refs in basic block.

I had to turn on -fdump-tree-all to try to figure out what that
"control flow in loop" was; it seems to be a guard against the input to
value being negative:

  <bb 3> [local count: 1063004407]:
  # i_12 = PHI <0(2), i_6(7)>
  # ivtmp_10 = PHI <32768(2), ivtmp_2(7)>
  # DEBUG i => i_12
  # DEBUG BEGIN_STMT
  _1 = data[i_12];
  # DEBUG __x => _1
  # DEBUG BEGIN_STMT
  _7 = .SQRT (_1);
  if (_1 u>= 0.0)
    goto <bb 8>; [99.95%]
  else
    goto <bb 4>; [0.05%]

  <bb 8> [local count: 1062472912]:
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 531495]:
  __builtin_sqrtf (_1);

I'm not sure where that control flow came from: it isn't in
  sqrt-test.cc.104t.stdarg
but is in
  sqrt-test.cc.105t.cdce
so I think it's coming from the argument-range code in cdce.

Arguably the location on the statement is wrong: it's on the loop
header, when it presumably should be on the std::sqrt call.

Shall I file a bugzilla about this?

Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09 16:10       ` David Malcolm
@ 2019-01-09 16:25         ` Jakub Jelinek
  2019-01-10  8:19           ` Richard Biener
  2019-01-09 16:26         ` David Malcolm
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Jelinek @ 2019-01-09 16:25 UTC (permalink / raw)
  To: David Malcolm, Richard Biener
  Cc: Jonathan Wakely, Andrew Haley, Kyrill Tkachov, Kay F. Jahnke, gcc

On Wed, Jan 09, 2019 at 11:10:25AM -0500, David Malcolm wrote:
> extern void vf1()
> {
>    #pragma vectorize enable
>    for ( int i = 0 ; i < 32768 ; i++ )
>      data [ i ] = std::sqrt ( data [ i ] ) ;
> }
> 
> Compiling on this x86_64 box with -fopt-info-vec-missed shows the

>   _7 = .SQRT (_1);
>   if (_1 u>= 0.0)
>     goto <bb 8>; [99.95%]
>   else
>     goto <bb 4>; [0.05%]
> 
>   <bb 8> [local count: 1062472912]:
>   goto <bb 5>; [100.00%]
> 
>   <bb 4> [local count: 531495]:
>   __builtin_sqrtf (_1);
> 
> I'm not sure where that control flow came from: it isn't in
>   sqrt-test.cc.104t.stdarg
> but is in
>   sqrt-test.cc.105t.cdce
> so I think it's coming from the argument-range code in cdce.
> 
> Arguably the location on the statement is wrong: it's on the loop
> header, when it presumably should be on the std::sqrt call.

See my either mail, it is the result of the -fmath-errno default,
the inline emitted sqrt doesn't handle errno setting and we emit
essentially x = sqrt (arg); if (__builtin_expect (arg < 0.0, 0)) sqrt (arg); where
the former sqrt is inline using HW instructions and the latter is the
library call.

With some extra work we could vectorize it; e.g. if we make it handle
OpenMP #pragma omp ordered simd efficiently, it would be the same thing
- allow non-vectorizable portions of vectorized loops by doing there a
scalar loop from 0 to vf-1 doing the non-vectorizable stuff + drop the limitation
that the vectorized loop is a single bb.  Essentially, in this case it would
be
  vec1 = vec_load (data + i);
  vec2 = vec_sqrt (vec1);
  if (__builtin_expect (any (vec2 < 0.0)))
    {
      for (int i = 0; i < vf; i++)
        sqrt (vec2[i]);
    }
  vec_store (data + i, vec2);
If that would turn to be way too hard, we could for the vectorization
purposes hide that into the .SQRT internal fn, say add a fndecl argument to
it if it should treat the exceptional cases some way so that the control
flow isn't visible in the vectorized loop.

	Jakub

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09 16:25         ` Jakub Jelinek
@ 2019-01-10  8:19           ` Richard Biener
  2019-01-10 11:11             ` Szabolcs Nagy
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2019-01-10  8:19 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: David Malcolm, Jonathan Wakely, Andrew Haley, Kyrill Tkachov,
	Kay F. Jahnke, gcc

On Wed, 9 Jan 2019, Jakub Jelinek wrote:

> On Wed, Jan 09, 2019 at 11:10:25AM -0500, David Malcolm wrote:
> > extern void vf1()
> > {
> >    #pragma vectorize enable
> >    for ( int i = 0 ; i < 32768 ; i++ )
> >      data [ i ] = std::sqrt ( data [ i ] ) ;
> > }
> > 
> > Compiling on this x86_64 box with -fopt-info-vec-missed shows the
> 
> >   _7 = .SQRT (_1);
> >   if (_1 u>= 0.0)
> >     goto <bb 8>; [99.95%]
> >   else
> >     goto <bb 4>; [0.05%]
> > 
> >   <bb 8> [local count: 1062472912]:
> >   goto <bb 5>; [100.00%]
> > 
> >   <bb 4> [local count: 531495]:
> >   __builtin_sqrtf (_1);
> > 
> > I'm not sure where that control flow came from: it isn't in
> >   sqrt-test.cc.104t.stdarg
> > but is in
> >   sqrt-test.cc.105t.cdce
> > so I think it's coming from the argument-range code in cdce.
> > 
> > Arguably the location on the statement is wrong: it's on the loop
> > header, when it presumably should be on the std::sqrt call.
> 
> See my either mail, it is the result of the -fmath-errno default,
> the inline emitted sqrt doesn't handle errno setting and we emit
> essentially x = sqrt (arg); if (__builtin_expect (arg < 0.0, 0)) sqrt (arg); where
> the former sqrt is inline using HW instructions and the latter is the
> library call.
> 
> With some extra work we could vectorize it; e.g. if we make it handle
> OpenMP #pragma omp ordered simd efficiently, it would be the same thing
> - allow non-vectorizable portions of vectorized loops by doing there a
> scalar loop from 0 to vf-1 doing the non-vectorizable stuff + drop the limitation
> that the vectorized loop is a single bb.  Essentially, in this case it would
> be
>   vec1 = vec_load (data + i);
>   vec2 = vec_sqrt (vec1);
>   if (__builtin_expect (any (vec2 < 0.0)))
>     {
>       for (int i = 0; i < vf; i++)
>         sqrt (vec2[i]);
>     }
>   vec_store (data + i, vec2);
> If that would turn to be way too hard, we could for the vectorization
> purposes hide that into the .SQRT internal fn, say add a fndecl argument to
> it if it should treat the exceptional cases some way so that the control
> flow isn't visible in the vectorized loop.

If we decide it's worth the trouble I'd rather do that in the epilogue
and thus make the any (vec2 < 0.0) a reduction.  Like

   smallest = min(smallest, vec1);

and after the loop do the errno thing on the smallest element.

That said, this is a transform that is probably worthwhile even
on scalar code, possibly easiest to code-gen right from the start
in the call-dce pass.

Richard.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-10  8:19           ` Richard Biener
@ 2019-01-10 11:11             ` Szabolcs Nagy
  0 siblings, 0 replies; 16+ messages in thread
From: Szabolcs Nagy @ 2019-01-10 11:11 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek
  Cc: nd, David Malcolm, Jonathan Wakely, Andrew Haley, Kyrill Tkachov,
	Kay F. Jahnke, gcc

On 10/01/2019 08:19, Richard Biener wrote:
> On Wed, 9 Jan 2019, Jakub Jelinek wrote:
> 
>> On Wed, Jan 09, 2019 at 11:10:25AM -0500, David Malcolm wrote:
>>> extern void vf1()
>>> {
>>>    #pragma vectorize enable
>>>    for ( int i = 0 ; i < 32768 ; i++ )
>>>      data [ i ] = std::sqrt ( data [ i ] ) ;
>>> }
>>>
>>> Compiling on this x86_64 box with -fopt-info-vec-missed shows the
>>
>>>   _7 = .SQRT (_1);
>>>   if (_1 u>= 0.0)
>>>     goto <bb 8>; [99.95%]
>>>   else
>>>     goto <bb 4>; [0.05%]
>>>
>>>   <bb 8> [local count: 1062472912]:
>>>   goto <bb 5>; [100.00%]
>>>
>>>   <bb 4> [local count: 531495]:
>>>   __builtin_sqrtf (_1);
>>>
>>> I'm not sure where that control flow came from: it isn't in
>>>   sqrt-test.cc.104t.stdarg
>>> but is in
>>>   sqrt-test.cc.105t.cdce
>>> so I think it's coming from the argument-range code in cdce.
>>>
>>> Arguably the location on the statement is wrong: it's on the loop
>>> header, when it presumably should be on the std::sqrt call.
>>
>> See my either mail, it is the result of the -fmath-errno default,
>> the inline emitted sqrt doesn't handle errno setting and we emit
>> essentially x = sqrt (arg); if (__builtin_expect (arg < 0.0, 0)) sqrt (arg); where
>> the former sqrt is inline using HW instructions and the latter is the
>> library call.
>>
>> With some extra work we could vectorize it; e.g. if we make it handle
>> OpenMP #pragma omp ordered simd efficiently, it would be the same thing
>> - allow non-vectorizable portions of vectorized loops by doing there a
>> scalar loop from 0 to vf-1 doing the non-vectorizable stuff + drop the limitation
>> that the vectorized loop is a single bb.  Essentially, in this case it would
>> be
>>   vec1 = vec_load (data + i);
>>   vec2 = vec_sqrt (vec1);
>>   if (__builtin_expect (any (vec2 < 0.0)))
>>     {
>>       for (int i = 0; i < vf; i++)
>>         sqrt (vec2[i]);
>>     }
>>   vec_store (data + i, vec2);
>> If that would turn to be way too hard, we could for the vectorization
>> purposes hide that into the .SQRT internal fn, say add a fndecl argument to
>> it if it should treat the exceptional cases some way so that the control
>> flow isn't visible in the vectorized loop.
> 
> If we decide it's worth the trouble I'd rather do that in the epilogue
> and thus make the any (vec2 < 0.0) a reduction.  Like
> 
>    smallest = min(smallest, vec1);
> 
> and after the loop do the errno thing on the smallest element.
> 
> That said, this is a transform that is probably worthwhile even
> on scalar code, possibly easiest to code-gen right from the start
> in the call-dce pass.

if this is useful other than errno handling then fine,
but i think it's a really bad idea to add optimization
complexity because of errno handling: nobody checks
errno after sqrt (other than conformance test code).

-fno-math-errno is almost surely closer to what the user
wants than trying to vectorize the errno handling.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09 16:10       ` David Malcolm
  2019-01-09 16:25         ` Jakub Jelinek
@ 2019-01-09 16:26         ` David Malcolm
  1 sibling, 0 replies; 16+ messages in thread
From: David Malcolm @ 2019-01-09 16:26 UTC (permalink / raw)
  To: Jonathan Wakely, Andrew Haley; +Cc: Kyrill Tkachov, Kay F. Jahnke, gcc

On Wed, 2019-01-09 at 11:10 -0500, David Malcolm wrote:
> On Wed, 2019-01-09 at 09:56 +0000, Jonathan Wakely wrote:
> > On Wed, 9 Jan 2019 at 09:50, Andrew Haley wrote:
> > > I don't agree. Sometimes vectorization is critical. It would be
> > > nice
> > > to have a warning which would fire if vectorization failed. That
> > > would
> > > surely help the OP.
> > 
> > Dave Malcolm has been working on something like that:
> > https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01749.html
> 
> Yes: this code is in trunk for gcc 9, but it doesn't help much for
> the
> case given elsewhere in this thread:
> 
> #include <cmath>
> 
> extern float data [ 32768 ] ;
> 
> extern void vf1()
> {
>    #pragma vectorize enable
>    for ( int i = 0 ; i < 32768 ; i++ )
>      data [ i ] = std::sqrt ( data [ i ] ) ;
> }
> 
> Compiling on this x86_64 box with -fopt-info-vec-missed shows the
> rather cryptic:
> 
> g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-missed
> /tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
> /tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: missed: statement clobbers
> memory: __builtin_sqrtf (_1);
> 
> and with -fopt-info-vec-all-internals shows:
> 
> g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-all-internals
> 
> Analyzing loop at /tmp/sqrt-test.cc:8
> /tmp/sqrt-test.cc:8:24: note:  === analyze_loop_nest ===
> /tmp/sqrt-test.cc:8:24: note:   === vect_analyze_loop_form ===
> /tmp/sqrt-test.cc:8:24: missed:   not vectorized: control flow in
> loop.
> /tmp/sqrt-test.cc:8:24: missed:  bad loop form.
> /tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
> /tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
> /tmp/sqrt-test.cc:5:13: note: vectorized 0 loops in function.
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: note:  ===
> vect_slp_analyze_bb ===
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: note:   ===
> vect_analyze_data_refs ===
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: note:   got vectype for stmt:
> _1 = data[i_12];
> vector(8) float
> /home/david/coding/gcc-python/gcc-svn-trunk/install-

> dogfood/include/c++/9.0.0/cmath:464:27: missed:  not vectorized: not
> enough data-refs in basic block.
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: missed: statement clobbers
> memory: __builtin_sqrtf (_1);
> /tmp/sqrt-test.cc:8:24: note:  === vect_slp_analyze_bb ===
> /tmp/sqrt-test.cc:8:24: note:   === vect_analyze_data_refs ===
> /tmp/sqrt-test.cc:8:24: note:   got vectype for stmt: data[i_12] =
> _7;
> vector(8) float
> /tmp/sqrt-test.cc:8:24: missed:  not vectorized: not enough data-refs 
> in basic block.
> /tmp/sqrt-test.cc:10:1: note:  === vect_slp_analyze_bb ===
> /tmp/sqrt-test.cc:10:1: note:   === vect_analyze_data_refs ===
> /tmp/sqrt-test.cc:10:1: missed:  not vectorized: not enough data-refs 
> in basic block.
> 
> I had to turn on -fdump-tree-all to try to figure out what that
> "control flow in loop" was; it seems to be a guard against the input
> to
> value being negative:
> 
>   <bb 3> [local count: 1063004407]:
>   # i_12 = PHI <0(2), i_6(7)>
>   # ivtmp_10 = PHI <32768(2), ivtmp_2(7)>
>   # DEBUG i => i_12
>   # DEBUG BEGIN_STMT
>   _1 = data[i_12];
>   # DEBUG __x => _1
>   # DEBUG BEGIN_STMT
>   _7 = .SQRT (_1);
>   if (_1 u>= 0.0)
>     goto <bb 8>; [99.95%]
>   else
>     goto <bb 4>; [0.05%]
> 
>   <bb 8> [local count: 1062472912]:
>   goto <bb 5>; [100.00%]
> 
>   <bb 4> [local count: 531495]:
>   __builtin_sqrtf (_1);
> 
> I'm not sure where that control flow came from: it isn't in
>   sqrt-test.cc.104t.stdarg
> but is in
>   sqrt-test.cc.105t.cdce
> so I think it's coming from the argument-range code in cdce.
> 
> Arguably the location on the statement is wrong: it's on the loop
> header, when it presumably should be on the std::sqrt call.
> 
> Shall I file a bugzilla about this?

...and -fno-tree-builtin-call-dce eliminates the control flow, but it
still doesn't vectorize the loop; on godbolt.org with:
  -O3 -mavx2 -fopt-info-vec-all -fno-tree-builtin-call-dce
gcc trunk x86_64 gives:

<source>:8:24: missed: couldn't vectorize loop
/opt/compiler-explorer/gcc-trunk-20190109/include/c++/9.0.0/cmath:464:27: missed: statement clobbers memory: _7 = __builtin_sqrtf (_1);
<source>:5:13: note: vectorized 0 loops in function.
/opt/compiler-explorer/gcc-trunk-20190109/include/c++/9.0.0/cmath:464:27: missed: statement clobbers memory: _7 = __builtin_sqrtf (_1);
Compiler returned: 0

...so presumably it doesn't know how to vectorize that builtin call.

Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09  9:50   ` Andrew Haley
  2019-01-09  9:56     ` Jonathan Wakely
@ 2019-01-09 10:47     ` Ramana Radhakrishnan
  2019-01-10  9:24     ` Kay F. Jahnke
  2 siblings, 0 replies; 16+ messages in thread
From: Ramana Radhakrishnan @ 2019-01-09 10:47 UTC (permalink / raw)
  To: Andrew Haley; +Cc: Kyrill Tkachov, Kay F. Jahnke, gcc

On Wed, Jan 9, 2019 at 9:50 AM Andrew Haley <aph@redhat.com> wrote:
>
> On 1/9/19 9:45 AM, Kyrill Tkachov wrote:
> > Hi Kay,
> >
> > On 09/01/19 08:29, Kay F. Jahnke wrote:
> >> Hi there!
> >>
> >> I am developing software which tries to deliberately exploit the
> >> compiler's autovectorization facilities by feeding data in
> >> autovectorization-friendly loops. I'm currently using both g++ and
> >> clang++ to see how well this approach works. Using simple arithmetic, I
> >> often get good results. To widen the scope of my work, I was looking for
> >> documentation on which constructs would be recognized by the
> >> autovectorization stage, and found
> >>
> >> https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html
> >>
> >
> > Yeah, that page hasn't been updated in ages AFAIK.
> >
> >> By the looks of it, this document has not seen any changes for several
> >> years. Has development on the autovectorization stage stopped, or is
> >> there simply no documentation?
> >>
> >
> > There's plenty of work being done on auto-vectorisation in GCC.
> > Auto-vectorisation is a performance optimisation and as such is not really
> > a user-visible feature that absolutely requires user documentation.
>
> I don't agree. Sometimes vectorization is critical. It would be nice
> to have a warning which would fire if vectorization failed. That would
> surely help the OP.

That would help certainly : the user could get some information out
today with the debug dumps - however they are designed more for the
compiler writers rather than users.

regards
Ramana

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09  9:50   ` Andrew Haley
  2019-01-09  9:56     ` Jonathan Wakely
  2019-01-09 10:47     ` Ramana Radhakrishnan
@ 2019-01-10  9:24     ` Kay F. Jahnke
  2019-01-10 11:18       ` Jonathan Wakely
  2 siblings, 1 reply; 16+ messages in thread
From: Kay F. Jahnke @ 2019-01-10  9:24 UTC (permalink / raw)
  To: Andrew Haley, Kyrill Tkachov, gcc

On 09.01.19 10:50, Andrew Haley wrote:
> On 1/9/19 9:45 AM, Kyrill Tkachov wrote:
>> There's plenty of work being done on auto-vectorisation in GCC.
>> Auto-vectorisation is a performance optimisation and as such is not really
>> a user-visible feature that absolutely requires user documentation.
> 
> I don't agree. Sometimes vectorization is critical. It would be nice
> to have a warning which would fire if vectorization failed. That would
> surely help the OP. 

Further down this thread, some g++ flags were used which produced 
meaningful information about vectorization failures, so the facility is 
there - maybe it's not very prominent.

When it comes to user visibility, I'd like to add that there are great 
differences between different users. I spend most of my time writing 
library code, using template metaprogramming in C++. It's essential for 
my code to perform well (real-time visualization), but I don't have 
intimate compiler knowledge - I'm aiming at writing portable, 
standard-compliant code. I'd like the compilers I use to provide 
extensive documentation if I need to track down a problem, and I dislike 
it if I have to use 'special' commands to get things done. Other users 
may produce target-specific code with one specific compiler, and they 
have different needs. It's better to have documentation and not need it 
than the other way round.

So my idea of a 'contract' regarding vectorization is like this:

- the documentation states the scope of vectorization
- the use of a feature can be forced or disallowed
- or left up to a cost model
- the compiler can be made to produce diagnostic output

Documentation is absolutely essential. If there is lots of development 
in autovectorization, not documenting this work in a way users can 
simply find is - in my eyes - a grave omission. The text 
'Auto-vectorization in GCC' looks like it has last been updated in 2011 
(according to the 'Latest News' section). I'm curious to know what new 
capabilities have been added since then. It makes my life much easier if 
I can write loops to follow a given pattern relying on the 
autovectorizer, rather than having to use explicit vector code, having 
to rely on a library. There is also another aspect to being dependent on 
external libraries. When a new architecture comes around, chances are 
the compiler writers will be first to support it. It may take years for 
an external library to add a new target ISA, more time until this runs 
smoothly, and then more time until it has trickled down to the package 
repos of most distributions - if this happens at all. Plus you have the 
danger of betting on the wrong horse, and when the very promising 
library you've used to code your stuff goes offline or commercial, 
you've wasted your precious time. Relying only on the compiler brings 
innovation out most reliably and quickly, and is a good strategy to 
avoid wasting resources.

Now I may be missing things here because I haven't dug deeply enough to 
find documentation about autovectorization in gcc. This was why I have 
asked to be pointed to 'where the action is'. I was hoping to maybe get 
some helpful hints. My main objective is, after all, to 'deliberately 
exploit the compiler's autovectorization facilities by feeding data in
autovectorization-friendly loops'. The code will run, vectorized or not, 
but it would be great to have good guidelines what will or will not be 
autovectorized with a given compiler, rather than having to look at the 
assembler output.

Kay

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-10  9:24     ` Kay F. Jahnke
@ 2019-01-10 11:18       ` Jonathan Wakely
  2019-08-18 10:59         ` [wwwdocs PATCH] for " Gerald Pfeifer
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Wakely @ 2019-01-10 11:18 UTC (permalink / raw)
  To: Kay F. Jahnke; +Cc: Andrew Haley, Kyrill Tkachov, gcc

On Thu, 10 Jan 2019 at 09:25, Kay F. Jahnke wrote:
> Documentation is absolutely essential. If there is lots of development
> in autovectorization, not documenting this work in a way users can
> simply find is - in my eyes - a grave omission. The text
> 'Auto-vectorization in GCC' looks like it has last been updated in 2011
> (according to the 'Latest News' section). I'm curious to know what new
> capabilities have been added since then.

The page you're looking at documents the project to *add*
autovectorization to GCC. That project was completed many years ago,
and the feature has been present in GCC for years.

I'm not disputing that there could be better documentation, but that
page is not the place to find it. That page should probably get a
notice added saying that the project is complete and that the page is
now only of historical interest.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [wwwdocs PATCH] for Re: autovectorization in gcc
  2019-01-10 11:18       ` Jonathan Wakely
@ 2019-08-18 10:59         ` Gerald Pfeifer
  0 siblings, 0 replies; 16+ messages in thread
From: Gerald Pfeifer @ 2019-08-18 10:59 UTC (permalink / raw)
  To: Jonathan Wakely
  Cc: Kay F. Jahnke, Andrew Haley, Kyrill Tkachov, gcc, gcc-patches

On Thu, 10 Jan 2019, Jonathan Wakely wrote:
>> [ https://gcc.gnu.org/projects/tree-ssa/vectorization.html ]
> I'm not disputing that there could be better documentation, but that
> page is not the place to find it. That page should probably get a
> notice added saying that the project is complete and that the page is
> now only of historical interest.

Like this? ;-)

Committed.

Gerald

Index: projects/tree-ssa/vectorization.html
===================================================================
RCS file: /cvs/gcc/wwwdocs/htdocs/projects/tree-ssa/vectorization.html,v
retrieving revision 1.42
diff -u -r1.42 vectorization.html
--- projects/tree-ssa/vectorization.html	30 Sep 2018 14:38:57 -0000	1.42
+++ projects/tree-ssa/vectorization.html	18 Aug 2019 10:55:46 -0000
@@ -2,15 +2,17 @@
 <html lang="en">
 
 <head>
-    <title>Auto-vectorization in GCC</title>
+<title>Auto-vectorization in GCC</title>
 <link rel="stylesheet" type="text/css" href="https://gcc.gnu.org/gcc.css" />
 </head>
 
 <body>
     <h1>Auto-vectorization in GCC<br /></h1>
 
-    <p>The goal of this project is to develop a loop and basic block vectorizer in
-    GCC, based on the <a href="./">tree-ssa</a> framework.</p>
+    <p>The goal of this project was to develop a loop and basic block
+    vectorizer in GCC, based on the <a href="./">tree-ssa</a> framework.
+    It has been completed and the functionality has been part of GCC
+    for years.</p>
 
     <h2>Table of Contents</h2>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09  9:46 ` Kyrill Tkachov
  2019-01-09  9:50   ` Andrew Haley
@ 2019-01-09 10:56   ` Kay F. Jahnke
  2019-01-09 11:03     ` Jakub Jelinek
  1 sibling, 1 reply; 16+ messages in thread
From: Kay F. Jahnke @ 2019-01-09 10:56 UTC (permalink / raw)
  To: Kyrill Tkachov, gcc

On 09.01.19 10:45, Kyrill Tkachov wrote:

> There's plenty of work being done on auto-vectorisation in GCC.
> Auto-vectorisation is a performance optimisation and as such is not really
> a user-visible feature that absolutely requires user documentation.

Since I'm trying to deliberately exploit it, a more user-visible guise 
would help ;)

>> - repeated use of vectorizable functions
>>
>> for ( int i = 0 ; i < vsz ; i++ )
>> Â Â  A [ i ] = sqrt ( B [ i ] ) ;
>>
>> Here, replacing the repeated call of sqrt with the vectorized equivalent
>> gives a dramatic speedup (ca. 4X)

The above is a typical example. So, to give a complete source 'vec_sqrt.cc':

#include <cmath>

extern float data [ 32768 ] ;

extern void vf1()
{
   #pragma vectorize enable
   for ( int i = 0 ; i < 32768 ; i++ )
     data [ i ] = std::sqrt ( data [ i ] ) ;
}

This has a large trip count, the loop is trivial. It's an ideal 
candidate for autovectorization. When I compile this source, using

g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

the inner loop translates to:

.L2:
         vmovss  (%rbx), %xmm0
         vucomiss        %xmm0, %xmm2
         vsqrtss %xmm0, %xmm1, %xmm1
         jbe     .L3
         vmovss  %xmm2, 12(%rsp)
         addq    $4, %rbx
         vmovss  %xmm1, 8(%rsp)
         call    sqrtf@PLT
         vmovss  8(%rsp), %xmm1
         vmovss  %xmm1, -4(%rbx)
         cmpq    %rbp, %rbx
         vmovss  12(%rsp), %xmm2
         jne     .L2

AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.

> I believe GCC will do some of that already given a high-enough 
> optimisation level
> and floating-point constraints.
> Do you have examples where it doesn't? Testcases with self-contained 
> source code
> and compiler flags would be useful to analyse.

so, see above. With -Ofast output is similar, just the inner loop is 
unrolled. But maybe I'm missing something? Any hints for additional flags?

>> If the compiler were to provide the autovectorization facilities, and if
>> the patterns it recognizes were well-documented, users could rely on
>> certain code patterns being recognized and autovectorized - sort of a
>> contract between the user and the compiler. With a well-chosen spectrum
>> of patterns, this would make it unnecessary to have to rely on explicit
>> vectorization in many cases. My hope is that such an interface would
>> help vectorization to become more frequently used - as I understand the
>> status quo, this is still a niche topic, even though many processors
>> provide suitable hardware nowadays.
>>
> 
> I wouldn't say it's a niche topic :)
>  From my monitoring of the GCC development over the last few years 
> there's been lots
> of improvements in auto-vectorisation in compilers (at least in GCC).

Okay, I'll take your word for it.

> The thing is, auto-vectorisation is not always profitable for performance.
> Sometimes the runtime loop iteration count is so low that setting up the 
> vectorised loop
> (alignment checks, loads/permutes) is slower than just doing the scalar 
> form,
> especially since SIMD performance varies from CPU to CPU.
> So we would want the compiler to have the freedom to make its own 
> judgement on when
> to auto-vectorise rather than enforce a "contract". If the user really 
> only wants
> vector code, they should use one of the explicit programming paradigms.

I know that these issues are important. I am using Vc for explicit 
vectorization, so I can easily code to produce vector code for common 
targets. And I can compare the performance. I have tried the example 
given above on my AVX2 machine, linking with a main program which calls 
'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version 
takes about half a second, the unvectorized takes about three. with 
functions like sqrt, trigonometric functions, exp and pow, vectorization 
is very profitable. Some further details:

Here's the main program 'memaxs.cc':

float data [ 32768 ] ;
extern void vf1() ;

int main ( int argc , char * argv[] )
{
   for ( int k = 0 ; k < 32768 ; k++ )
   {
     vf1() ;
   }
}

And the compiler call to get a binary:

g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc

Here's the performance:

$ time ./memaxs

real    0m3,205s
user    0m3,200s
sys     0m0,004s

This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')

#include <Vc/Vc>

extern float data [ 32768 ] ;

extern void vf1()
{
   for ( int k = 0 ; k < 32768 ; k += 8 )
   {
     Vc::float_v fv ( data + k ) ;
     fv = sqrt ( fv ) ;
     fv.store ( data + k ) ;
   }
}

Translated to assembler, I get the inner loop

.L2:
         vmovups (%rax), %xmm0
         addq    $32, %rax
         vinsertf128     $0x1, -16(%rax), %ymm0, %ymm0
         vsqrtps %ymm0, %ymm0
         vmovups %xmm0, -32(%rax)
         vextractf128    $0x1, %ymm0, -16(%rax)
         cmpq    %rax, %rdx
         jne     .L2
         vzeroupper
         ret
         .cfi_endproc

note how the data are read 32 bytes at a time and processed with vsqrtps.

creating the corresponding binary and executing it:

$ g++ -O3 -mavx2 -o memaxs sqrt_vc.s memaxs.cc -lVc
$ time ./memaxs

real    0m0,548s
user    0m0,544s
sys     0m0,004s

So, I think this performance difference looks like a good enough gain to 
consider my vectorization-of-math-functions proposal. When it comes to 
the gather/scatter with arbitrary indexes, I suppose that's less 
profitable and probably harder to scan for.

Kay

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09 10:56   ` Kay F. Jahnke
@ 2019-01-09 11:03     ` Jakub Jelinek
  2019-01-09 11:21       ` Jakub Jelinek
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Jelinek @ 2019-01-09 11:03 UTC (permalink / raw)
  To: Kay F. Jahnke; +Cc: Kyrill Tkachov, gcc

On Wed, Jan 09, 2019 at 11:56:03AM +0100, Kay F. Jahnke wrote:
> The above is a typical example. So, to give a complete source 'vec_sqrt.cc':
> 
> #include <cmath>
> 
> extern float data [ 32768 ] ;
> 
> extern void vf1()
> {
>   #pragma vectorize enable
>   for ( int i = 0 ; i < 32768 ; i++ )
>     data [ i ] = std::sqrt ( data [ i ] ) ;
> }
> 
> This has a large trip count, the loop is trivial. It's an ideal candidate
> for autovectorization. When I compile this source, using
> 
> g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

Generally you want -Ofast or -ffast-math or at least some suboptions of that
if you want to vectorize floating point loops, because vectorization in many
cases changes where FPU exceptions would be generated, can affect precision
by reordering the ops etc. In the above case it is just that glibc
declares the vector math functions for #ifdef __FAST_MATH__ only, as they
have worse precision.

Note, gcc doesn't recognize #pragma vectorize, you can use e.g.
#pragma omp simd
or
#pragma GCC ivdep
if you want to assert some properties of the loop the compiler can't easily
prove itself that would help the vectorization.

	Jakub

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: autovectorization in gcc
  2019-01-09 11:03     ` Jakub Jelinek
@ 2019-01-09 11:21       ` Jakub Jelinek
  0 siblings, 0 replies; 16+ messages in thread
From: Jakub Jelinek @ 2019-01-09 11:21 UTC (permalink / raw)
  To: Kay F. Jahnke; +Cc: Kyrill Tkachov, gcc

On Wed, Jan 09, 2019 at 12:03:45PM +0100, Jakub Jelinek wrote:
> > The above is a typical example. So, to give a complete source 'vec_sqrt.cc':
> > 
> > #include <cmath>
> > 
> > extern float data [ 32768 ] ;
> > 
> > extern void vf1()
> > {
> >   #pragma vectorize enable
> >   for ( int i = 0 ; i < 32768 ; i++ )
> >     data [ i ] = std::sqrt ( data [ i ] ) ;
> > }
> > 
> > This has a large trip count, the loop is trivial. It's an ideal candidate
> > for autovectorization. When I compile this source, using
> > 
> > g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc
> 
> Generally you want -Ofast or -ffast-math or at least some suboptions of that
> if you want to vectorize floating point loops, because vectorization in many
> cases changes where FPU exceptions would be generated, can affect precision
> by reordering the ops etc. In the above case it is just that glibc
> declares the vector math functions for #ifdef __FAST_MATH__ only, as they
> have worse precision.

Actually, the last sentence was just a wrong guess in this case, for sqrt no
glibc libcall is needed, that is for trigonometric and the like, all you
need for the above to vectorize from -ffast-math is -fno-math-errno, tell
the compiler you don't need errno set if you call sqrt on negative etc.
With  -fopt-info-vec-missed the compiler would tell you:
/tmp/1.c:5:3: note: not vectorized: control flow in loop.
/tmp/1.c:5:3: note: bad loop form.
and you could look at the dumps to see that there is
  _2 = .SQRT (_1);
  if (_1 u>= 0.0)
    goto <bb 8>; [99.95%]
  else
    goto <bb 4>; [0.05%]
...
  <bb 4> [local count: 531495]:
  __builtin_sqrt (_1);
which is the idiom to do sqrt inline using instruction, but in the unlikely
case when the argument is negative, also call the library function so that
it handles the errno setting.

	Jakub

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-08-18 10:59 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09  8:29 autovectorization in gcc Kay F. Jahnke
2019-01-09  9:46 ` Kyrill Tkachov
2019-01-09  9:50   ` Andrew Haley
2019-01-09  9:56     ` Jonathan Wakely
2019-01-09 16:10       ` David Malcolm
2019-01-09 16:25         ` Jakub Jelinek
2019-01-10  8:19           ` Richard Biener
2019-01-10 11:11             ` Szabolcs Nagy
2019-01-09 16:26         ` David Malcolm
2019-01-09 10:47     ` Ramana Radhakrishnan
2019-01-10  9:24     ` Kay F. Jahnke
2019-01-10 11:18       ` Jonathan Wakely
2019-08-18 10:59         ` [wwwdocs PATCH] for " Gerald Pfeifer
2019-01-09 10:56   ` Kay F. Jahnke
2019-01-09 11:03     ` Jakub Jelinek
2019-01-09 11:21       ` Jakub Jelinek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).