From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-274144-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 4191 invoked by alias); 27 Jan 2009 12:40:41 -0000
Received: (qmail 4094 invoked by uid 48); 27 Jan 2009 12:40:26 -0000
Date: Tue, 27 Jan 2009 12:40:00 -0000
Message-ID: <20090127124026.4093.qmail@sourceware.org>
X-Bugzilla-Reason: CC
References: <bug-37021-10053@http.gcc.gnu.org/bugzilla/>
Subject: [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
In-Reply-To: <bug-37021-10053@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "dorit at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2009-01/txt/msg02946.txt.bz2


------- Comment #9 from dorit at gcc dot gnu dot org  2009-01-27 12:40 -------
(In reply to comment #4)
> The testcase should be
> subroutine to_product_of(self,a,b,a1,a2)
>   complex(kind=8) :: self (:)
>   complex(kind=8), intent(in) :: a(:,:)
>   complex(kind=8), intent(in) :: b(:)
>   integer a1,a2
>   do i = 1,a1
>     do j = 1,a2
>       self(i) = self(i) + a(j,i)*b(j)
>     end do
>   end do
> end subroutine
> to be meaningful - otherwise we are accessing a in non-continuous ways in the
> inner loop which would prevent vectorization.

this change from a(i,j) to a(j,i) is not required if we try to vectorize the
outer-loop, where the stride is 1. It's also a better way to vectorize the
reduction. A few limitations on the way though are:

1) somehow don't let gcc create guard code around the innermost loop to check
that it executes more than zero iterations. This creates a complicated control
flow structure within the outer-loop. For now you have to have  constant number
of iterations for the inner-loop because of that, or insert a statement like
"if (a2<=0) return;" before the loop...

2) use -fno-tree-sink cause otherwise it moves the loop iv increment to the
latch block and the vectorizer likes to have the latch block empty...

(see also PR33113 for related reference).


> With the versioning for stride == 1 I get then
> .L13:
>         movupd  16(%rax), %xmm1
>         movupd  (%rax), %xmm3
>         incl    %ecx
>         movupd  (%rdx), %xmm4
>         addq    $32, %rax
>         movapd  %xmm3, %xmm0
>         unpckhpd        %xmm1, %xmm3
>         unpcklpd        %xmm1, %xmm0
>         movupd  16(%rdx), %xmm1
>         movapd  %xmm4, %xmm2
>         addq    $32, %rdx
>         movapd  %xmm3, %xmm9
>         cmpl    %ecx, %r8d
>         unpcklpd        %xmm1, %xmm2
>         unpckhpd        %xmm1, %xmm4
>         movapd  %xmm4, %xmm1
>         movapd  %xmm2, %xmm4
>         mulpd   %xmm1, %xmm9
>         mulpd   %xmm0, %xmm4
>         mulpd   %xmm3, %xmm2
>         mulpd   %xmm1, %xmm0
>         subpd   %xmm9, %xmm4
>         addpd   %xmm2, %xmm0
>         addpd   %xmm4, %xmm6
>         addpd   %xmm0, %xmm5
>         ja      .L13
>         haddpd  %xmm5, %xmm5
>         cmpl    %r15d, %edi
>         movl    -4(%rsp), %ecx
>         haddpd  %xmm6, %xmm6
>         addsd   %xmm5, %xmm8
>         addsd   %xmm6, %xmm7
>         jne     .L12
>         jmp     .L14
> for the innermost loop, followed by a tail loop (peel for niters).  This is
> about 15% faster on AMD K10 than the non-vectorized loop (if you disable
> the cost-model and make sure to have enough iterations in the inner loop
> to pay back for the extra guarding conditions).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021