[Bug tree-optimization/51499] New: vectorizer missing simple case

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug tree-optimization/51499] New: vectorizer missing simple case
@ 2011-12-10 18:39 fb.programming at gmail dot com
  2011-12-11  8:29 ` [Bug tree-optimization/51499] " irar at il dot ibm.com
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-10 18:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

             Bug #: 51499
           Summary: vectorizer missing simple case
    Classification: Unclassified
           Product: gcc
           Version: 4.6.2
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: fb.programming@gmail.com


The sse vectorizer seems to miss one of the simplest cases:

#include <cstdio>
#include <cstdlib>

double loop(double a, size_t n){
   // initialise differently so compiler doesn't simplify
   double sum1=0.1, sum2=0.2, sum3=0.3, sum4=0.4, sum5=0.5, sum6=0.6;
   for(size_t i=0; i<n; i++){
      sum1+=a; sum2+=a; sum3+=a; sum4+=a; sum5+=a; sum6+=a;
   }
   return sum1+sum2+sum3+sum4+sum5+sum6-2.1-6.0*a*n;
}

int main(int argc, char** argv) {
   size_t n=1000000;
   double a=1.1;
   printf("res=%f\n", loop(a,n));
   return EXIT_SUCCESS;
}

g++-4.6.2 -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=2 test.cpp

test.cpp:7: note: not vectorized: unsupported use in stmt.
test.cpp:4: note: vectorized 0 loops in function.

We get six addsd operations - whereas an optimisation should have
given us three addpd operations.

.L3:
    addq    $1, %rax
    addsd    %xmm0, %xmm6
    cmpq    %rdi, %rax
    addsd    %xmm0, %xmm5
    addsd    %xmm0, %xmm4
    addsd    %xmm0, %xmm3
    addsd    %xmm0, %xmm2
    addsd    %xmm0, %xmm1
    jne    .L3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
@ 2011-12-11  8:29 ` irar at il dot ibm.com
  2011-12-11  8:48 ` fb.programming at gmail dot com
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-11  8:29 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

Ira Rosen <irar at il dot ibm.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |irar at il dot ibm.com

--- Comment #1 from Ira Rosen <irar at il dot ibm.com> 2011-12-11 07:37:36 UTC ---
You need -ffast-math to allow floating point reduction.
You also need -fno-vect-cost-model, because the vectorization is not profitable
in this case.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
  2011-12-11  8:29 ` [Bug tree-optimization/51499] " irar at il dot ibm.com
@ 2011-12-11  8:48 ` fb.programming at gmail dot com
  2011-12-11  9:05 ` irar at il dot ibm.com
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-11  8:48 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #2 from fb.programming at gmail dot com 2011-12-11 08:33:40 UTC ---
(In reply to comment #1)

g++-4.6.2 -S -Wall -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 \
          -ffast-math  -fno-vect-cost-model

gives me exactly the same assembly code as above (which I'm surprised
a bit as -funsafe-math-optimizations might as well have eliminated the
loop completely).

The optimal assembly, however, I would expect to be something like:

.L3:
    addq    $1, %rax
    addpd    %xmm0, %xmm3
    cmpq    %rdi, %rax
    addpd    %xmm0, %xmm2
    addpd    %xmm0, %xmm1
    jne    .L3

Where the vector (sum1,sum2) is stored in xmm1, (sum3,sum4) stored in
xmm2, etc and (a,a) stored in xmm0. This speeds it up by a factor of 2
and is completely equivalent to the scalar case so I don't see why
-ffast-math (which implies -funsafe-math-optimizations) should be
necessary in this case, either.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
  2011-12-11  8:29 ` [Bug tree-optimization/51499] " irar at il dot ibm.com
  2011-12-11  8:48 ` fb.programming at gmail dot com
@ 2011-12-11  9:05 ` irar at il dot ibm.com
  2011-12-11 12:13 ` fb.programming at gmail dot com
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-11  9:05 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #3 from Ira Rosen <irar at il dot ibm.com> 2011-12-11 08:48:24 UTC ---
It gets vectorized with 4.7.
I guess, due to this 4.7 patch
http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00620.html.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (2 preceding siblings ...)
  2011-12-11  9:05 ` irar at il dot ibm.com
@ 2011-12-11 12:13 ` fb.programming at gmail dot com
  2011-12-11 13:39 ` irar at il dot ibm.com
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-11 12:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #4 from fb.programming at gmail dot com 2011-12-11 11:52:30 UTC ---
Looks like there has been some great progress in gcc 4.7!

Still I think it behaves slightly buggy.

(1) In this case it should work without -funsafe-math-optimizations but
    it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
   -fassociative-math to make it work.

(2) The prediction:
       7: not vectorized: vectorization not profitable.
    is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up
    by factor of 2.

(3) If I change all double's into float's in the code above it seems to
    work without forcing it (-fno-vect-cost-model):


   g++-4.7 -S -Wall -O2  -ftree-vectorize -ftree-vectorizer-verbose=2 \
           -funsafe-math-optimizations test.cpp

   Analyzing loop at test.cpp:7


   Vectorizing loop at test.cpp:7

   7: vectorizing stmts using SLP.
   7: LOOP VECTORIZED.
   test.cpp:4: note: vectorized 1 loops in function.


    However, it hasn't vectorized it at all as the assembly shows:

.L11:
    addq    $1, %rax
    addss    %xmm0, %xmm3
    cmpq    %rax, %rdi
    addss    %xmm0, %xmm4
    addss    %xmm0, %xmm7
    addss    %xmm0, %xmm6
    addss    %xmm0, %xmm5
    addss    %xmm0, %xmm1
    ja    .L11


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (3 preceding siblings ...)
  2011-12-11 12:13 ` fb.programming at gmail dot com
@ 2011-12-11 13:39 ` irar at il dot ibm.com
  2011-12-11 14:55 ` dominiq at lps dot ens.fr
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-11 13:39 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #5 from Ira Rosen <irar at il dot ibm.com> 2011-12-11 13:30:41 UTC ---
(In reply to comment #4)
> Looks like there has been some great progress in gcc 4.7!
> 
> Still I think it behaves slightly buggy.
> 
> (1) In this case it should work without -funsafe-math-optimizations but
>     it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
>    -fassociative-math to make it work.
> 

It's reduction, when we vectorize we change the order of computation. In order
to be able to do that for floating point we need flag_associative_math.

> (2) The prediction:
>        7: not vectorized: vectorization not profitable.
>     is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up
>     by factor of 2.
> 
> (3) If I change all double's into float's in the code above it seems to
>     work without forcing it (-fno-vect-cost-model):
> 
> 
>    g++-4.7 -S -Wall -O2  -ftree-vectorize -ftree-vectorizer-verbose=2 \
>            -funsafe-math-optimizations test.cpp
> 
>    Analyzing loop at test.cpp:7
> 
> 
>    Vectorizing loop at test.cpp:7
> 
>    7: vectorizing stmts using SLP.
>    7: LOOP VECTORIZED.
>    test.cpp:4: note: vectorized 1 loops in function.
> 
> 
>     However, it hasn't vectorized it at all as the assembly shows:
> 
> .L11:
>     addq    $1, %rax
>     addss    %xmm0, %xmm3
>     cmpq    %rax, %rdi
>     addss    %xmm0, %xmm4
>     addss    %xmm0, %xmm7
>     addss    %xmm0, %xmm6
>     addss    %xmm0, %xmm5
>     addss    %xmm0, %xmm1
>     ja    .L11


I think you are looking at the scalar epilogue. The number of iterations is
unknown, so we need an epilogue loop for the case that number of iterations is
not a multiple of 4.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (4 preceding siblings ...)
  2011-12-11 13:39 ` irar at il dot ibm.com
@ 2011-12-11 14:55 ` dominiq at lps dot ens.fr
  2011-12-11 16:58 ` fb.programming at gmail dot com
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-11 14:55 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #6 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-11 14:14:01 UTC ---
> I think you are looking at the scalar epilogue. The number of iterations is
> unknown, so we need an epilogue loop for the case that number of iterations is
> not a multiple of 4.

While investigating pr51597, I have found that vectorized loops in programs as
simple as

subroutine spmmult(x,b,ad)
implicit none
integer, parameter :: nxyz=1008315
real(8),dimension(nxyz):: x,b,ad
b = ad*x
end subroutine spmmult               !=========================================

has always an additional non-vectorized loop, i.e. a vectorized one

L3:
        movsd   (%r9,%rax), %xmm1
        addq    $1, %rcx
        movapd  (%r10,%rax), %xmm0
        movhpd  8(%r9,%rax), %xmm1
        mulpd   %xmm1, %xmm0
        movlpd  %xmm0, (%r8,%rax)
        movhpd  %xmm0, 8(%r8,%rax)
        addq    $16, %rax
        cmpq    $504156, %rcx
        jbe     L3

and a non-vectorized one

L5:
        movsd   -8(%rdi,%rax,8), %xmm0
        mulsd   -8(%rdx,%rax,8), %xmm0
        movsd   %xmm0, -8(%rsi,%rax,8)
        addq    $1, %rax
        cmpq    %rcx, %rax
        jne     L5

even when the above loops are unrolled. How can the loop L5 be unrolled if it
is only there for a "scalar epilogue"?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (5 preceding siblings ...)
  2011-12-11 14:55 ` dominiq at lps dot ens.fr
@ 2011-12-11 16:58 ` fb.programming at gmail dot com
  2011-12-12 11:13 ` irar at il dot ibm.com
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-11 16:58 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #7 from fb.programming at gmail dot com 2011-12-11 14:55:13 UTC ---
(In reply to comment #5)

> > (3) If I change all double's into float's in the code above it seems to

> I think you are looking at the scalar epilogue. The number of iterations is
> unknown, so we need an epilogue loop for the case that number of iterations is
> not a multiple of 4.

Yes you're right. Sorry about that, my mistake.


> > (1) In this case it should work without -funsafe-math-optimizations but
> >     it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
> >    -fassociative-math to make it work.
> > 
> 
> It's reduction, when we vectorize we change the order of computation. In order
> to be able to do that for floating point we need flag_associative_math.

In some cases it might be necessary but not here:

 sum1+=a;
 sum2+=a;

gives exactly the same result as

 (sum1, sum2) += (a, a);

Lets take a more applied example, say calculating the sum of 1/i:

   double harmon(int n) {
      double sum=0.0;
      for(int i=1; i<n; i++){
         sum += 1.0/i;
      }
      return sum;
   }

This requires reordering of the sum to be vectorized, so in this case
I agree we need -funsafe-math-optimizations.
However, one could manually split the sum 

   double harmon(int n) {
      assert(n%2==0);
      double sum1=0.0, sum2=0.0;
      for(int i=1; i<n; i+=2){
         sum1 += 1.0/i;
         sum2 += 1.0/(i+1);
      }
      return sum1+sum2;
   }

and now I'd expect the compiler to vectorize this without
-funsafe-math-optimizations as it doesn't change any computational
results:

         (sum1, sum2) += (1.0/i, 1.0/(i+1));

I can attach a test case with that example if that'd be useful?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (6 preceding siblings ...)
  2011-12-11 16:58 ` fb.programming at gmail dot com
@ 2011-12-12 11:13 ` irar at il dot ibm.com
  2011-12-12 11:23 ` irar at il dot ibm.com
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-12 11:13 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #8 from Ira Rosen <irar at il dot ibm.com> 2011-12-12 11:03:59 UTC ---
(In reply to comment #6)

> While investigating pr51597, I have found that vectorized loops in programs as
> simple as
> 
> subroutine spmmult(x,b,ad)
> implicit none
> integer, parameter :: nxyz=1008315
> real(8),dimension(nxyz):: x,b,ad
> b = ad*x
> end subroutine spmmult               !=========================================
> 
> has always an additional non-vectorized loop,

This loop has a prologue loop for alignment purposes.

> L5:
>         movsd   -8(%rdi,%rax,8), %xmm0
>         mulsd   -8(%rdx,%rax,8), %xmm0
>         movsd   %xmm0, -8(%rsi,%rax,8)
>         addq    $1, %rax
>         cmpq    %rcx, %rax
>         jne     L5
> 
> even when the above loops are unrolled. How can the loop L5 be unrolled if it
> is only there for a "scalar epilogue"?

It can't be unrolled, since the alignment is unknown, so we don't know the
number of iterations of the prologue loop, and, therefore, we don't know the
number of iterations of the epilogue.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (7 preceding siblings ...)
  2011-12-12 11:13 ` irar at il dot ibm.com
@ 2011-12-12 11:23 ` irar at il dot ibm.com
  2011-12-12 11:27 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-12 11:23 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #9 from Ira Rosen <irar at il dot ibm.com> 2011-12-12 11:13:24 UTC ---
(In reply to comment #7)
> 
> In some cases it might be necessary but not here:
> 
>  sum1+=a;
>  sum2+=a;
> 
> gives exactly the same result as
> 
>  (sum1, sum2) += (a, a);
> 

So, you are suggesting to remove the need in flag_associative_math for fp for
cases when a reduction computation is already unrolled by the vectorization
factor. Sounds reasonable to me.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (8 preceding siblings ...)
  2011-12-12 11:23 ` irar at il dot ibm.com
@ 2011-12-12 11:27 ` rguenth at gcc dot gnu.org
  2011-12-12 12:21 ` irar at il dot ibm.com
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-12 11:27 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-12 11:24:22 UTC ---
Hmm.  But we are vectorizing

  sum += a[i]
  sum += a[i+1]

the same as

  sum += a[i+1]
  sum += a[i]

no?  Thus you have to check whether the summation occours in "memory order"?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (9 preceding siblings ...)
  2011-12-12 11:27 ` rguenth at gcc dot gnu.org
@ 2011-12-12 12:21 ` irar at il dot ibm.com
  2011-12-12 13:10 ` dominiq at lps dot ens.fr
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-12 12:21 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #11 from Ira Rosen <irar at il dot ibm.com> 2011-12-12 11:27:26 UTC ---
Right. We need to check that there is no load permutation.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (10 preceding siblings ...)
  2011-12-12 12:21 ` irar at il dot ibm.com
@ 2011-12-12 13:10 ` dominiq at lps dot ens.fr
  2011-12-12 14:31 ` fb.programming at gmail dot com
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-12 13:10 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #12 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-12 12:47:54 UTC ---
> > even when the above loops are unrolled. How can the loop L5 be unrolled if it
> > is only there for a "scalar epilogue"?
>
> It can't be unrolled, since the alignment is unknown, so we don't know the
> number of iterations of the prologue loop, and, therefore, we don't know the
> number of iterations of the epilogue.

Well, it is unrolled with -funroll-loops, for instance if I compile with
'-Ofast -funroll-loops --param max-unroll-times=4', I get

L3:
        movsd   (%r8,%r11), %xmm3
        addq    $4, %r10
        movsd   16(%r8,%r11), %xmm5
        movsd   32(%r8,%r11), %xmm7
        movhpd  8(%r8,%r11), %xmm3
        movsd   48(%r8,%r11), %xmm9
        movhpd  24(%r8,%r11), %xmm5
        movapd  (%r9,%r11), %xmm4
        movhpd  40(%r8,%r11), %xmm7
        movapd  16(%r9,%r11), %xmm6
        movhpd  56(%r8,%r11), %xmm9
        movapd  32(%r9,%r11), %xmm8
        mulpd   %xmm3, %xmm4
        movapd  48(%r9,%r11), %xmm10
        mulpd   %xmm5, %xmm6
        mulpd   %xmm7, %xmm8
        mulpd   %xmm9, %xmm10
        movlpd  %xmm4, (%rcx,%r11)
        movhpd  %xmm4, 8(%rcx,%r11)
        movlpd  %xmm6, 16(%rcx,%r11)
        movhpd  %xmm6, 24(%rcx,%r11)
        movlpd  %xmm8, 32(%rcx,%r11)
        movhpd  %xmm8, 40(%rcx,%r11)
        movlpd  %xmm10, 48(%rcx,%r11)
        movhpd  %xmm10, 56(%rcx,%r11)
        addq    $64, %r11
        cmpq    $504156, %r10
        jbe     L3

and

L5:
        movsd   -8(%rdi,%r9,8), %xmm15
        leaq    1(%r9), %rbx
        leaq    2(%r9), %r8
        movsd   -8(%rdi,%rbx,8), %xmm0
        leaq    3(%r9), %rcx
        movsd   -8(%rdi,%r8,8), %xmm1
        mulsd   -8(%rdx,%r9,8), %xmm15
        movsd   -8(%rdi,%rcx,8), %xmm2
        mulsd   -8(%rdx,%rbx,8), %xmm0
        mulsd   -8(%rdx,%r8,8), %xmm1
        mulsd   -8(%rdx,%rcx,8), %xmm2
        movsd   %xmm15, -8(%rsi,%r9,8)
        addq    $4, %r9
        cmpq    %r12, %r9
        movsd   %xmm0, -8(%rsi,%rbx,8)
        movsd   %xmm1, -8(%rsi,%r8,8)
        movsd   %xmm2, -8(%rsi,%rcx,8)
        jne     L5

So both the vectorized and the unvectorized loops are unrolled four times. This
does not seem logical to me if the L5 loop was there only to handle a left over
scalar (AFAIU %xmm* store only one or two doubles and there is at most one left
if the length is odd or if the length is even and the first one has been peeled
for alignement).

I am also puzzled by the way the vectors as stored back as a pair

        movlpd  %xmm4, (%rcx,%r11)
        movhpd  %xmm4, 8(%rcx,%r11)

Why not a 'movapd' instead?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (11 preceding siblings ...)
  2011-12-12 13:10 ` dominiq at lps dot ens.fr
@ 2011-12-12 14:31 ` fb.programming at gmail dot com
  2011-12-13 16:33 ` irar at il dot ibm.com
  2021-08-07  5:19 ` [Bug tree-optimization/51499] -Ofast does not vectorize while -O3 does pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-12 14:31 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #13 from fb.programming at gmail dot com 2011-12-12 14:20:58 UTC ---
(In reply to comment #9)

> So, you are suggesting to remove the need in flag_associative_math for fp for
> cases when a reduction computation is already unrolled by the vectorization
> factor. Sounds reasonable to me.

Yes I think that's it, basically only require flag_associative_math if
the order of summation or products is changed by the vectorizer. That is
quite important I think, as most of the time
 -ffast-math / -funsafe-math-optimizations / -fassociative-math
might not be acceptable for many projects.

However, I don't fully understand Richard Guenther's example. Yes his
example requires -fassociative-math to be vectorized, however, my example
would translate to something like

  sum1 += a[i];
  sum2 += a[i+1];

and now it doesn't matter if it's executed this way or the other way
around

  sum2 += a[i+1];
  sum1 += a[i];

Second issue is just to double check the profitability calculation
as it wrongly decided:

  7: not vectorized: vectorization not profitable.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] vectorizer missing simple case
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (12 preceding siblings ...)
  2011-12-12 14:31 ` fb.programming at gmail dot com
@ 2011-12-13 16:33 ` irar at il dot ibm.com
  2021-08-07  5:19 ` [Bug tree-optimization/51499] -Ofast does not vectorize while -O3 does pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-13 16:33 UTC (permalink / raw)
  To: gcc-bugs

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

--- Comment #14 from Ira Rosen <irar at il dot ibm.com> 2011-12-13 16:27:19 UTC ---
(In reply to comment #13)
> 
> However, I don't fully understand Richard Guenther's example. Yes his
> example requires -fassociative-math to be vectorized, however, my example
> would translate to something like
> 
>   sum1 += a[i];
>   sum2 += a[i+1];
> 
> and now it doesn't matter if it's executed this way or the other way
> around
> 
>   sum2 += a[i+1];
>   sum1 += a[i];

The problem is probably more in implementation. The change of order will also
change between sum1 and sum2, so when you want to return sum1+sum2, the
vectorized version will return sum2+sum1.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Bug tree-optimization/51499] -Ofast does not vectorize while -O3 does.
  2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
                   ` (13 preceding siblings ...)
  2011-12-13 16:33 ` irar at il dot ibm.com
@ 2021-08-07  5:19 ` pinskia at gcc dot gnu.org
  14 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-07  5:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-08-07
     Ever confirmed|0                           |1
            Summary|vectorizer missing simple   |-Ofast does not vectorize
                   |case                        |while -O3 does.
             Status|UNCONFIRMED                 |NEW

--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So here is the interesting for the trunk,
With -O3 we can vectorize the loop because we are using a SLP vectorizer but
-Ofast we don't as we say the vectorization is too costly.

The inner most loop for -O3:
.L3:
        addq    $1, %rax
        addpd   %xmm1, %xmm2
        addpd   %xmm1, %xmm3
        addpd   %xmm1, %xmm4
        cmpq    %rax, %rdi
        jne     .L3

The SLP vectorizer has done it since 11+.

Here is the inner loop for -Ofast:
.L3:
        addq    $1, %rax
        addsd   %xmm0, %xmm3
        addsd   %xmm0, %xmm6
        addsd   %xmm0, %xmm1
        addsd   %xmm0, %xmm5
        addsd   %xmm0, %xmm2
        addsd   %xmm0, %xmm4
        cmpq    %rax, %rdi
        jne     .L3

as you can see we don't vectorize it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-08-07  5:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
2011-12-11  8:29 ` [Bug tree-optimization/51499] " irar at il dot ibm.com
2011-12-11  8:48 ` fb.programming at gmail dot com
2011-12-11  9:05 ` irar at il dot ibm.com
2011-12-11 12:13 ` fb.programming at gmail dot com
2011-12-11 13:39 ` irar at il dot ibm.com
2011-12-11 14:55 ` dominiq at lps dot ens.fr
2011-12-11 16:58 ` fb.programming at gmail dot com
2011-12-12 11:13 ` irar at il dot ibm.com
2011-12-12 11:23 ` irar at il dot ibm.com
2011-12-12 11:27 ` rguenth at gcc dot gnu.org
2011-12-12 12:21 ` irar at il dot ibm.com
2011-12-12 13:10 ` dominiq at lps dot ens.fr
2011-12-12 14:31 ` fb.programming at gmail dot com
2011-12-13 16:33 ` irar at il dot ibm.com
2021-08-07  5:19 ` [Bug tree-optimization/51499] -Ofast does not vectorize while -O3 does pinskia at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).