* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
@ 2008-08-04 17:59 ` rguenth at gcc dot gnu dot org
2008-08-19 15:31 ` rguenth at gcc dot gnu dot org
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-08-04 17:59 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from rguenth at gcc dot gnu dot org 2008-08-04 17:58 -------
Patch for 1) http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00221.html
Patch for 2) http://gcc.gnu.org/ml/gcc-patches/2008-08/msg00226.html
I didn't yet start on 3), so 4) is unknown yet (as is 5, 6, ... ;))
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
2008-08-04 17:59 ` [Bug tree-optimization/37021] " rguenth at gcc dot gnu dot org
@ 2008-08-19 15:31 ` rguenth at gcc dot gnu dot org
2009-01-21 15:43 ` rguenth at gcc dot gnu dot org
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-08-19 15:31 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from rguenth at gcc dot gnu dot org 2008-08-19 15:29 -------
3) is because data-ref requires a constant step
else if (!simple_iv (loop, stmt, poffset, &offset_iv, false))
{
if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "failed: evolution of offset is not affine.\n");
return;
but the step is (<unnamed-signed:64>) ((<unnamed-unsigned:64>) stride.3_36 *
16)
as we are dealing with general incoming arrays which are arbitrary striped.
Fixing this requires for example versioning for a constant stride.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
2008-08-04 17:59 ` [Bug tree-optimization/37021] " rguenth at gcc dot gnu dot org
2008-08-19 15:31 ` rguenth at gcc dot gnu dot org
@ 2009-01-21 15:43 ` rguenth at gcc dot gnu dot org
2009-01-23 15:33 ` rguenth at gcc dot gnu dot org
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-21 15:43 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from rguenth at gcc dot gnu dot org 2009-01-21 15:43 -------
Mine. I am working on adding versioning for non-constant strides.
--
rguenth at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |irar at il dot ibm dot com
AssignedTo|unassigned at gcc dot gnu |rguenth at gcc dot gnu dot
|dot org |org
Status|UNCONFIRMED |ASSIGNED
Ever Confirmed|0 |1
Last reconfirmed|0000-00-00 00:00:00 |2009-01-21 15:43:08
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
` (2 preceding siblings ...)
2009-01-21 15:43 ` rguenth at gcc dot gnu dot org
@ 2009-01-23 15:33 ` rguenth at gcc dot gnu dot org
2009-01-23 15:36 ` rguenth at gcc dot gnu dot org
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-23 15:33 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from rguenth at gcc dot gnu dot org 2009-01-23 15:33 -------
The testcase should be
subroutine to_product_of(self,a,b,a1,a2)
complex(kind=8) :: self (:)
complex(kind=8), intent(in) :: a(:,:)
complex(kind=8), intent(in) :: b(:)
integer a1,a2
do i = 1,a1
do j = 1,a2
self(i) = self(i) + a(j,i)*b(j)
end do
end do
end subroutine
to be meaningful - otherwise we are accessing a in non-continuous ways in the
inner loop which would prevent vectorization.
With the versioning for stride == 1 I get then
.L13:
movupd 16(%rax), %xmm1
movupd (%rax), %xmm3
incl %ecx
movupd (%rdx), %xmm4
addq $32, %rax
movapd %xmm3, %xmm0
unpckhpd %xmm1, %xmm3
unpcklpd %xmm1, %xmm0
movupd 16(%rdx), %xmm1
movapd %xmm4, %xmm2
addq $32, %rdx
movapd %xmm3, %xmm9
cmpl %ecx, %r8d
unpcklpd %xmm1, %xmm2
unpckhpd %xmm1, %xmm4
movapd %xmm4, %xmm1
movapd %xmm2, %xmm4
mulpd %xmm1, %xmm9
mulpd %xmm0, %xmm4
mulpd %xmm3, %xmm2
mulpd %xmm1, %xmm0
subpd %xmm9, %xmm4
addpd %xmm2, %xmm0
addpd %xmm4, %xmm6
addpd %xmm0, %xmm5
ja .L13
haddpd %xmm5, %xmm5
cmpl %r15d, %edi
movl -4(%rsp), %ecx
haddpd %xmm6, %xmm6
addsd %xmm5, %xmm8
addsd %xmm6, %xmm7
jne .L12
jmp .L14
for the innermost loop, followed by a tail loop (peel for niters). This is
about 15% faster on AMD K10 than the non-vectorized loop (if you disable
the cost-model and make sure to have enough iterations in the inner loop
to pay back for the extra guarding conditions).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
` (3 preceding siblings ...)
2009-01-23 15:33 ` rguenth at gcc dot gnu dot org
@ 2009-01-23 15:36 ` rguenth at gcc dot gnu dot org
2009-01-25 9:13 ` irar at il dot ibm dot com
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-23 15:36 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from rguenth at gcc dot gnu dot org 2009-01-23 15:36 -------
So,
4) The vectorized version sucks because we have to use peeling for niters
because we need to unroll the loop once and cannot apply SLP here.
Q1: does SLP work with reductions at all?
Q2: does SLP do pattern recognition?
First of all we would need to recognize a complex reduction as a single
vectorized reduction. Second we need to vectorize the complex multiplication
with SLP, feeding the reduction with one resulting complex vector.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
` (4 preceding siblings ...)
2009-01-23 15:36 ` rguenth at gcc dot gnu dot org
@ 2009-01-25 9:13 ` irar at il dot ibm dot com
2009-01-25 11:04 ` rguenther at suse dot de
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: irar at il dot ibm dot com @ 2009-01-25 9:13 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from irar at il dot ibm dot com 2009-01-25 09:12 -------
(In reply to comment #5)
> So,
> 4) The vectorized version sucks because we have to use peeling for niters
> because we need to unroll the loop once and cannot apply SLP here.
What do you mean by "unroll the loop once"?
> Q1: does SLP work with reductions at all?
No. SLP currently originates from groups of strided stores.
> Q2: does SLP do pattern recognition?
Pattern recoginition is done before SLP, and SLP handles stmts that were marked
as a part of a pattern. There is no SLP specific pattern recoginition.
> First of all we would need to recognize a complex reduction as a single
> vectorized reduction. Second we need to vectorize the complex multiplication
> with SLP, feeding the reduction with one resulting complex vector.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
` (5 preceding siblings ...)
2009-01-25 9:13 ` irar at il dot ibm dot com
@ 2009-01-25 11:04 ` rguenther at suse dot de
2009-01-25 12:17 ` irar at il dot ibm dot com
2009-01-27 12:40 ` dorit at gcc dot gnu dot org
8 siblings, 0 replies; 10+ messages in thread
From: rguenther at suse dot de @ 2009-01-25 11:04 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from rguenther at suse dot de 2009-01-25 11:04 -------
Subject: Re: Fortran Complex reduction /
multiplication not vectorized
On Sun, 25 Jan 2009, irar at il dot ibm dot com wrote:
>
>
> ------- Comment #6 from irar at il dot ibm dot com 2009-01-25 09:12 -------
> (In reply to comment #5)
> > So,
> > 4) The vectorized version sucks because we have to use peeling for niters
> > because we need to unroll the loop once and cannot apply SLP here.
>
> What do you mean by "unroll the loop once"?
The vectorization factor is two, so we need two copies of the loop body
(which means unrolling it once and creating an epilogue loop because
niter may be odd)
> > Q1: does SLP work with reductions at all?
>
> No. SLP currently originates from groups of strided stores.
Ah, I see. In this loop we have two reductions, so to apply SLP
we would need to see that we can use a group of reductions for SLP?
> > Q2: does SLP do pattern recognition?
>
> Pattern recoginition is done before SLP, and SLP handles stmts that were marked
> as a part of a pattern. There is no SLP specific pattern recoginition.
Ok, but with a reduction it won't help me here.
Can a loop be vectorized with just pattern recognition? Hm, if I
remember correctly we detect scalar patterns and then vectorize them.
We don't support detecting "vector patterns" from scalar code, correct?
Thanks,
Richard.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
` (6 preceding siblings ...)
2009-01-25 11:04 ` rguenther at suse dot de
@ 2009-01-25 12:17 ` irar at il dot ibm dot com
2009-01-27 12:40 ` dorit at gcc dot gnu dot org
8 siblings, 0 replies; 10+ messages in thread
From: irar at il dot ibm dot com @ 2009-01-25 12:17 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from irar at il dot ibm dot com 2009-01-25 12:17 -------
(In reply to comment #7)
> > > Q1: does SLP work with reductions at all?
> >
> > No. SLP currently originates from groups of strided stores.
> Ah, I see. In this loop we have two reductions, so to apply SLP
> we would need to see that we can use a group of reductions for SLP?
Yes, I think this will work.
> > > Q2: does SLP do pattern recognition?
> >
> > Pattern recoginition is done before SLP, and SLP handles stmts that were marked
> > as a part of a pattern. There is no SLP specific pattern recoginition.
> Ok, but with a reduction it won't help me here.
> Can a loop be vectorized with just pattern recognition? Hm, if I
> remember correctly we detect scalar patterns and then vectorize them.
> We don't support detecting "vector patterns" from scalar code, correct?
Yes, if I understand you correctly, we detect scalar patterns, but adding
vector pattern detection does not seem to be complicated.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
2008-08-04 17:57 [Bug tree-optimization/37021] New: Fortran Complex reduction / multiplication not vectorized rguenth at gcc dot gnu dot org
` (7 preceding siblings ...)
2009-01-25 12:17 ` irar at il dot ibm dot com
@ 2009-01-27 12:40 ` dorit at gcc dot gnu dot org
8 siblings, 0 replies; 10+ messages in thread
From: dorit at gcc dot gnu dot org @ 2009-01-27 12:40 UTC (permalink / raw)
To: gcc-bugs
------- Comment #9 from dorit at gcc dot gnu dot org 2009-01-27 12:40 -------
(In reply to comment #4)
> The testcase should be
> subroutine to_product_of(self,a,b,a1,a2)
> complex(kind=8) :: self (:)
> complex(kind=8), intent(in) :: a(:,:)
> complex(kind=8), intent(in) :: b(:)
> integer a1,a2
> do i = 1,a1
> do j = 1,a2
> self(i) = self(i) + a(j,i)*b(j)
> end do
> end do
> end subroutine
> to be meaningful - otherwise we are accessing a in non-continuous ways in the
> inner loop which would prevent vectorization.
this change from a(i,j) to a(j,i) is not required if we try to vectorize the
outer-loop, where the stride is 1. It's also a better way to vectorize the
reduction. A few limitations on the way though are:
1) somehow don't let gcc create guard code around the innermost loop to check
that it executes more than zero iterations. This creates a complicated control
flow structure within the outer-loop. For now you have to have constant number
of iterations for the inner-loop because of that, or insert a statement like
"if (a2<=0) return;" before the loop...
2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the
latch block and the vectorizer likes to have the latch block empty...
(see also PR33113 for related reference).
> With the versioning for stride == 1 I get then
> .L13:
> movupd 16(%rax), %xmm1
> movupd (%rax), %xmm3
> incl %ecx
> movupd (%rdx), %xmm4
> addq $32, %rax
> movapd %xmm3, %xmm0
> unpckhpd %xmm1, %xmm3
> unpcklpd %xmm1, %xmm0
> movupd 16(%rdx), %xmm1
> movapd %xmm4, %xmm2
> addq $32, %rdx
> movapd %xmm3, %xmm9
> cmpl %ecx, %r8d
> unpcklpd %xmm1, %xmm2
> unpckhpd %xmm1, %xmm4
> movapd %xmm4, %xmm1
> movapd %xmm2, %xmm4
> mulpd %xmm1, %xmm9
> mulpd %xmm0, %xmm4
> mulpd %xmm3, %xmm2
> mulpd %xmm1, %xmm0
> subpd %xmm9, %xmm4
> addpd %xmm2, %xmm0
> addpd %xmm4, %xmm6
> addpd %xmm0, %xmm5
> ja .L13
> haddpd %xmm5, %xmm5
> cmpl %r15d, %edi
> movl -4(%rsp), %ecx
> haddpd %xmm6, %xmm6
> addsd %xmm5, %xmm8
> addsd %xmm6, %xmm7
> jne .L12
> jmp .L14
> for the innermost loop, followed by a tail loop (peel for niters). This is
> about 15% faster on AMD K10 than the non-vectorized loop (if you disable
> the cost-model and make sure to have enough iterations in the inner loop
> to pay back for the extra guarding conditions).
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
^ permalink raw reply [flat|nested] 10+ messages in thread