* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
@ 2011-12-11 8:29 ` irar at il dot ibm.com
2011-12-11 8:48 ` fb.programming at gmail dot com
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-11 8:29 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
Ira Rosen <irar at il dot ibm.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |irar at il dot ibm.com
--- Comment #1 from Ira Rosen <irar at il dot ibm.com> 2011-12-11 07:37:36 UTC ---
You need -ffast-math to allow floating point reduction.
You also need -fno-vect-cost-model, because the vectorization is not profitable
in this case.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
2011-12-11 8:29 ` [Bug tree-optimization/51499] " irar at il dot ibm.com
@ 2011-12-11 8:48 ` fb.programming at gmail dot com
2011-12-11 9:05 ` irar at il dot ibm.com
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-11 8:48 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #2 from fb.programming at gmail dot com 2011-12-11 08:33:40 UTC ---
(In reply to comment #1)
g++-4.6.2 -S -Wall -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 \
-ffast-math -fno-vect-cost-model
gives me exactly the same assembly code as above (which I'm surprised
a bit as -funsafe-math-optimizations might as well have eliminated the
loop completely).
The optimal assembly, however, I would expect to be something like:
.L3:
addq $1, %rax
addpd %xmm0, %xmm3
cmpq %rdi, %rax
addpd %xmm0, %xmm2
addpd %xmm0, %xmm1
jne .L3
Where the vector (sum1,sum2) is stored in xmm1, (sum3,sum4) stored in
xmm2, etc and (a,a) stored in xmm0. This speeds it up by a factor of 2
and is completely equivalent to the scalar case so I don't see why
-ffast-math (which implies -funsafe-math-optimizations) should be
necessary in this case, either.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
2011-12-11 8:29 ` [Bug tree-optimization/51499] " irar at il dot ibm.com
2011-12-11 8:48 ` fb.programming at gmail dot com
@ 2011-12-11 9:05 ` irar at il dot ibm.com
2011-12-11 12:13 ` fb.programming at gmail dot com
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-11 9:05 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #3 from Ira Rosen <irar at il dot ibm.com> 2011-12-11 08:48:24 UTC ---
It gets vectorized with 4.7.
I guess, due to this 4.7 patch
http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00620.html.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (2 preceding siblings ...)
2011-12-11 9:05 ` irar at il dot ibm.com
@ 2011-12-11 12:13 ` fb.programming at gmail dot com
2011-12-11 13:39 ` irar at il dot ibm.com
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-11 12:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #4 from fb.programming at gmail dot com 2011-12-11 11:52:30 UTC ---
Looks like there has been some great progress in gcc 4.7!
Still I think it behaves slightly buggy.
(1) In this case it should work without -funsafe-math-optimizations but
it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
-fassociative-math to make it work.
(2) The prediction:
7: not vectorized: vectorization not profitable.
is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up
by factor of 2.
(3) If I change all double's into float's in the code above it seems to
work without forcing it (-fno-vect-cost-model):
g++-4.7 -S -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=2 \
-funsafe-math-optimizations test.cpp
Analyzing loop at test.cpp:7
Vectorizing loop at test.cpp:7
7: vectorizing stmts using SLP.
7: LOOP VECTORIZED.
test.cpp:4: note: vectorized 1 loops in function.
However, it hasn't vectorized it at all as the assembly shows:
.L11:
addq $1, %rax
addss %xmm0, %xmm3
cmpq %rax, %rdi
addss %xmm0, %xmm4
addss %xmm0, %xmm7
addss %xmm0, %xmm6
addss %xmm0, %xmm5
addss %xmm0, %xmm1
ja .L11
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (3 preceding siblings ...)
2011-12-11 12:13 ` fb.programming at gmail dot com
@ 2011-12-11 13:39 ` irar at il dot ibm.com
2011-12-11 14:55 ` dominiq at lps dot ens.fr
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-11 13:39 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #5 from Ira Rosen <irar at il dot ibm.com> 2011-12-11 13:30:41 UTC ---
(In reply to comment #4)
> Looks like there has been some great progress in gcc 4.7!
>
> Still I think it behaves slightly buggy.
>
> (1) In this case it should work without -funsafe-math-optimizations but
> it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
> -fassociative-math to make it work.
>
It's reduction, when we vectorize we change the order of computation. In order
to be able to do that for floating point we need flag_associative_math.
> (2) The prediction:
> 7: not vectorized: vectorization not profitable.
> is just wrong. Forcing it with -fno-vect-cost-model shows it speeds up
> by factor of 2.
>
> (3) If I change all double's into float's in the code above it seems to
> work without forcing it (-fno-vect-cost-model):
>
>
> g++-4.7 -S -Wall -O2 -ftree-vectorize -ftree-vectorizer-verbose=2 \
> -funsafe-math-optimizations test.cpp
>
> Analyzing loop at test.cpp:7
>
>
> Vectorizing loop at test.cpp:7
>
> 7: vectorizing stmts using SLP.
> 7: LOOP VECTORIZED.
> test.cpp:4: note: vectorized 1 loops in function.
>
>
> However, it hasn't vectorized it at all as the assembly shows:
>
> .L11:
> addq $1, %rax
> addss %xmm0, %xmm3
> cmpq %rax, %rdi
> addss %xmm0, %xmm4
> addss %xmm0, %xmm7
> addss %xmm0, %xmm6
> addss %xmm0, %xmm5
> addss %xmm0, %xmm1
> ja .L11
I think you are looking at the scalar epilogue. The number of iterations is
unknown, so we need an epilogue loop for the case that number of iterations is
not a multiple of 4.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (4 preceding siblings ...)
2011-12-11 13:39 ` irar at il dot ibm.com
@ 2011-12-11 14:55 ` dominiq at lps dot ens.fr
2011-12-11 16:58 ` fb.programming at gmail dot com
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-11 14:55 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #6 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-11 14:14:01 UTC ---
> I think you are looking at the scalar epilogue. The number of iterations is
> unknown, so we need an epilogue loop for the case that number of iterations is
> not a multiple of 4.
While investigating pr51597, I have found that vectorized loops in programs as
simple as
subroutine spmmult(x,b,ad)
implicit none
integer, parameter :: nxyz=1008315
real(8),dimension(nxyz):: x,b,ad
b = ad*x
end subroutine spmmult !=========================================
has always an additional non-vectorized loop, i.e. a vectorized one
L3:
movsd (%r9,%rax), %xmm1
addq $1, %rcx
movapd (%r10,%rax), %xmm0
movhpd 8(%r9,%rax), %xmm1
mulpd %xmm1, %xmm0
movlpd %xmm0, (%r8,%rax)
movhpd %xmm0, 8(%r8,%rax)
addq $16, %rax
cmpq $504156, %rcx
jbe L3
and a non-vectorized one
L5:
movsd -8(%rdi,%rax,8), %xmm0
mulsd -8(%rdx,%rax,8), %xmm0
movsd %xmm0, -8(%rsi,%rax,8)
addq $1, %rax
cmpq %rcx, %rax
jne L5
even when the above loops are unrolled. How can the loop L5 be unrolled if it
is only there for a "scalar epilogue"?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (5 preceding siblings ...)
2011-12-11 14:55 ` dominiq at lps dot ens.fr
@ 2011-12-11 16:58 ` fb.programming at gmail dot com
2011-12-12 11:13 ` irar at il dot ibm.com
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-11 16:58 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #7 from fb.programming at gmail dot com 2011-12-11 14:55:13 UTC ---
(In reply to comment #5)
> > (3) If I change all double's into float's in the code above it seems to
> I think you are looking at the scalar epilogue. The number of iterations is
> unknown, so we need an epilogue loop for the case that number of iterations is
> not a multiple of 4.
Yes you're right. Sorry about that, my mistake.
> > (1) In this case it should work without -funsafe-math-optimizations but
> > it doesn't. gcc 4.7 requires -fno-signed-zeros -fno-trapping-math
> > -fassociative-math to make it work.
> >
>
> It's reduction, when we vectorize we change the order of computation. In order
> to be able to do that for floating point we need flag_associative_math.
In some cases it might be necessary but not here:
sum1+=a;
sum2+=a;
gives exactly the same result as
(sum1, sum2) += (a, a);
Lets take a more applied example, say calculating the sum of 1/i:
double harmon(int n) {
double sum=0.0;
for(int i=1; i<n; i++){
sum += 1.0/i;
}
return sum;
}
This requires reordering of the sum to be vectorized, so in this case
I agree we need -funsafe-math-optimizations.
However, one could manually split the sum
double harmon(int n) {
assert(n%2==0);
double sum1=0.0, sum2=0.0;
for(int i=1; i<n; i+=2){
sum1 += 1.0/i;
sum2 += 1.0/(i+1);
}
return sum1+sum2;
}
and now I'd expect the compiler to vectorize this without
-funsafe-math-optimizations as it doesn't change any computational
results:
(sum1, sum2) += (1.0/i, 1.0/(i+1));
I can attach a test case with that example if that'd be useful?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (6 preceding siblings ...)
2011-12-11 16:58 ` fb.programming at gmail dot com
@ 2011-12-12 11:13 ` irar at il dot ibm.com
2011-12-12 11:23 ` irar at il dot ibm.com
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-12 11:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #8 from Ira Rosen <irar at il dot ibm.com> 2011-12-12 11:03:59 UTC ---
(In reply to comment #6)
> While investigating pr51597, I have found that vectorized loops in programs as
> simple as
>
> subroutine spmmult(x,b,ad)
> implicit none
> integer, parameter :: nxyz=1008315
> real(8),dimension(nxyz):: x,b,ad
> b = ad*x
> end subroutine spmmult !=========================================
>
> has always an additional non-vectorized loop,
This loop has a prologue loop for alignment purposes.
> L5:
> movsd -8(%rdi,%rax,8), %xmm0
> mulsd -8(%rdx,%rax,8), %xmm0
> movsd %xmm0, -8(%rsi,%rax,8)
> addq $1, %rax
> cmpq %rcx, %rax
> jne L5
>
> even when the above loops are unrolled. How can the loop L5 be unrolled if it
> is only there for a "scalar epilogue"?
It can't be unrolled, since the alignment is unknown, so we don't know the
number of iterations of the prologue loop, and, therefore, we don't know the
number of iterations of the epilogue.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (7 preceding siblings ...)
2011-12-12 11:13 ` irar at il dot ibm.com
@ 2011-12-12 11:23 ` irar at il dot ibm.com
2011-12-12 11:27 ` rguenth at gcc dot gnu.org
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-12 11:23 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #9 from Ira Rosen <irar at il dot ibm.com> 2011-12-12 11:13:24 UTC ---
(In reply to comment #7)
>
> In some cases it might be necessary but not here:
>
> sum1+=a;
> sum2+=a;
>
> gives exactly the same result as
>
> (sum1, sum2) += (a, a);
>
So, you are suggesting to remove the need in flag_associative_math for fp for
cases when a reduction computation is already unrolled by the vectorization
factor. Sounds reasonable to me.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (8 preceding siblings ...)
2011-12-12 11:23 ` irar at il dot ibm.com
@ 2011-12-12 11:27 ` rguenth at gcc dot gnu.org
2011-12-12 12:21 ` irar at il dot ibm.com
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: rguenth at gcc dot gnu.org @ 2011-12-12 11:27 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #10 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-12-12 11:24:22 UTC ---
Hmm. But we are vectorizing
sum += a[i]
sum += a[i+1]
the same as
sum += a[i+1]
sum += a[i]
no? Thus you have to check whether the summation occours in "memory order"?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (9 preceding siblings ...)
2011-12-12 11:27 ` rguenth at gcc dot gnu.org
@ 2011-12-12 12:21 ` irar at il dot ibm.com
2011-12-12 13:10 ` dominiq at lps dot ens.fr
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-12 12:21 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #11 from Ira Rosen <irar at il dot ibm.com> 2011-12-12 11:27:26 UTC ---
Right. We need to check that there is no load permutation.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (10 preceding siblings ...)
2011-12-12 12:21 ` irar at il dot ibm.com
@ 2011-12-12 13:10 ` dominiq at lps dot ens.fr
2011-12-12 14:31 ` fb.programming at gmail dot com
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: dominiq at lps dot ens.fr @ 2011-12-12 13:10 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #12 from Dominique d'Humieres <dominiq at lps dot ens.fr> 2011-12-12 12:47:54 UTC ---
> > even when the above loops are unrolled. How can the loop L5 be unrolled if it
> > is only there for a "scalar epilogue"?
>
> It can't be unrolled, since the alignment is unknown, so we don't know the
> number of iterations of the prologue loop, and, therefore, we don't know the
> number of iterations of the epilogue.
Well, it is unrolled with -funroll-loops, for instance if I compile with
'-Ofast -funroll-loops --param max-unroll-times=4', I get
L3:
movsd (%r8,%r11), %xmm3
addq $4, %r10
movsd 16(%r8,%r11), %xmm5
movsd 32(%r8,%r11), %xmm7
movhpd 8(%r8,%r11), %xmm3
movsd 48(%r8,%r11), %xmm9
movhpd 24(%r8,%r11), %xmm5
movapd (%r9,%r11), %xmm4
movhpd 40(%r8,%r11), %xmm7
movapd 16(%r9,%r11), %xmm6
movhpd 56(%r8,%r11), %xmm9
movapd 32(%r9,%r11), %xmm8
mulpd %xmm3, %xmm4
movapd 48(%r9,%r11), %xmm10
mulpd %xmm5, %xmm6
mulpd %xmm7, %xmm8
mulpd %xmm9, %xmm10
movlpd %xmm4, (%rcx,%r11)
movhpd %xmm4, 8(%rcx,%r11)
movlpd %xmm6, 16(%rcx,%r11)
movhpd %xmm6, 24(%rcx,%r11)
movlpd %xmm8, 32(%rcx,%r11)
movhpd %xmm8, 40(%rcx,%r11)
movlpd %xmm10, 48(%rcx,%r11)
movhpd %xmm10, 56(%rcx,%r11)
addq $64, %r11
cmpq $504156, %r10
jbe L3
and
L5:
movsd -8(%rdi,%r9,8), %xmm15
leaq 1(%r9), %rbx
leaq 2(%r9), %r8
movsd -8(%rdi,%rbx,8), %xmm0
leaq 3(%r9), %rcx
movsd -8(%rdi,%r8,8), %xmm1
mulsd -8(%rdx,%r9,8), %xmm15
movsd -8(%rdi,%rcx,8), %xmm2
mulsd -8(%rdx,%rbx,8), %xmm0
mulsd -8(%rdx,%r8,8), %xmm1
mulsd -8(%rdx,%rcx,8), %xmm2
movsd %xmm15, -8(%rsi,%r9,8)
addq $4, %r9
cmpq %r12, %r9
movsd %xmm0, -8(%rsi,%rbx,8)
movsd %xmm1, -8(%rsi,%r8,8)
movsd %xmm2, -8(%rsi,%rcx,8)
jne L5
So both the vectorized and the unvectorized loops are unrolled four times. This
does not seem logical to me if the L5 loop was there only to handle a left over
scalar (AFAIU %xmm* store only one or two doubles and there is at most one left
if the length is odd or if the length is even and the first one has been peeled
for alignement).
I am also puzzled by the way the vectors as stored back as a pair
movlpd %xmm4, (%rcx,%r11)
movhpd %xmm4, 8(%rcx,%r11)
Why not a 'movapd' instead?
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (11 preceding siblings ...)
2011-12-12 13:10 ` dominiq at lps dot ens.fr
@ 2011-12-12 14:31 ` fb.programming at gmail dot com
2011-12-13 16:33 ` irar at il dot ibm.com
2021-08-07 5:19 ` [Bug tree-optimization/51499] -Ofast does not vectorize while -O3 does pinskia at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: fb.programming at gmail dot com @ 2011-12-12 14:31 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #13 from fb.programming at gmail dot com 2011-12-12 14:20:58 UTC ---
(In reply to comment #9)
> So, you are suggesting to remove the need in flag_associative_math for fp for
> cases when a reduction computation is already unrolled by the vectorization
> factor. Sounds reasonable to me.
Yes I think that's it, basically only require flag_associative_math if
the order of summation or products is changed by the vectorizer. That is
quite important I think, as most of the time
-ffast-math / -funsafe-math-optimizations / -fassociative-math
might not be acceptable for many projects.
However, I don't fully understand Richard Guenther's example. Yes his
example requires -fassociative-math to be vectorized, however, my example
would translate to something like
sum1 += a[i];
sum2 += a[i+1];
and now it doesn't matter if it's executed this way or the other way
around
sum2 += a[i+1];
sum1 += a[i];
Second issue is just to double check the profitability calculation
as it wrongly decided:
7: not vectorized: vectorization not profitable.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] vectorizer missing simple case
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (12 preceding siblings ...)
2011-12-12 14:31 ` fb.programming at gmail dot com
@ 2011-12-13 16:33 ` irar at il dot ibm.com
2021-08-07 5:19 ` [Bug tree-optimization/51499] -Ofast does not vectorize while -O3 does pinskia at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: irar at il dot ibm.com @ 2011-12-13 16:33 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #14 from Ira Rosen <irar at il dot ibm.com> 2011-12-13 16:27:19 UTC ---
(In reply to comment #13)
>
> However, I don't fully understand Richard Guenther's example. Yes his
> example requires -fassociative-math to be vectorized, however, my example
> would translate to something like
>
> sum1 += a[i];
> sum2 += a[i+1];
>
> and now it doesn't matter if it's executed this way or the other way
> around
>
> sum2 += a[i+1];
> sum1 += a[i];
The problem is probably more in implementation. The change of order will also
change between sum1 and sum2, so when you want to return sum1+sum2, the
vectorized version will return sum2+sum1.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Bug tree-optimization/51499] -Ofast does not vectorize while -O3 does.
2011-12-10 18:39 [Bug tree-optimization/51499] New: vectorizer missing simple case fb.programming at gmail dot com
` (13 preceding siblings ...)
2011-12-13 16:33 ` irar at il dot ibm.com
@ 2021-08-07 5:19 ` pinskia at gcc dot gnu.org
14 siblings, 0 replies; 16+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-07 5:19 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2021-08-07
Ever confirmed|0 |1
Summary|vectorizer missing simple |-Ofast does not vectorize
|case |while -O3 does.
Status|UNCONFIRMED |NEW
--- Comment #15 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So here is the interesting for the trunk,
With -O3 we can vectorize the loop because we are using a SLP vectorizer but
-Ofast we don't as we say the vectorization is too costly.
The inner most loop for -O3:
.L3:
addq $1, %rax
addpd %xmm1, %xmm2
addpd %xmm1, %xmm3
addpd %xmm1, %xmm4
cmpq %rax, %rdi
jne .L3
The SLP vectorizer has done it since 11+.
Here is the inner loop for -Ofast:
.L3:
addq $1, %rax
addsd %xmm0, %xmm3
addsd %xmm0, %xmm6
addsd %xmm0, %xmm1
addsd %xmm0, %xmm5
addsd %xmm0, %xmm2
addsd %xmm0, %xmm4
cmpq %rax, %rdi
jne .L3
as you can see we don't vectorize it.
^ permalink raw reply [flat|nested] 16+ messages in thread