From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-273647-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 23187 invoked by alias); 23 Jan 2009 15:33:21 -0000
Received: (qmail 23116 invoked by uid 48); 23 Jan 2009 15:33:07 -0000
Date: Fri, 23 Jan 2009 15:33:00 -0000
Message-ID: <20090123153307.23115.qmail@sourceware.org>
X-Bugzilla-Reason: CC
References: <bug-37021-10053@http.gcc.gnu.org/bugzilla/>
Subject: [Bug tree-optimization/37021] Fortran Complex reduction / multiplication not vectorized
In-Reply-To: <bug-37021-10053@http.gcc.gnu.org/bugzilla/>
Reply-To: gcc-bugzilla@gcc.gnu.org
To: gcc-bugs@gcc.gnu.org
From: "rguenth at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org>
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2009-01/txt/msg02449.txt.bz2


------- Comment #4 from rguenth at gcc dot gnu dot org  2009-01-23 15:33 -------
The testcase should be

subroutine to_product_of(self,a,b,a1,a2)
  complex(kind=8) :: self (:)
  complex(kind=8), intent(in) :: a(:,:)
  complex(kind=8), intent(in) :: b(:)
  integer a1,a2
  do i = 1,a1
    do j = 1,a2
      self(i) = self(i) + a(j,i)*b(j)
    end do
  end do
end subroutine

to be meaningful - otherwise we are accessing a in non-continuous ways in the
inner loop which would prevent vectorization.

With the versioning for stride == 1 I get then

.L13:
        movupd  16(%rax), %xmm1
        movupd  (%rax), %xmm3
        incl    %ecx
        movupd  (%rdx), %xmm4
        addq    $32, %rax
        movapd  %xmm3, %xmm0
        unpckhpd        %xmm1, %xmm3
        unpcklpd        %xmm1, %xmm0
        movupd  16(%rdx), %xmm1
        movapd  %xmm4, %xmm2
        addq    $32, %rdx
        movapd  %xmm3, %xmm9
        cmpl    %ecx, %r8d
        unpcklpd        %xmm1, %xmm2
        unpckhpd        %xmm1, %xmm4
        movapd  %xmm4, %xmm1
        movapd  %xmm2, %xmm4
        mulpd   %xmm1, %xmm9
        mulpd   %xmm0, %xmm4
        mulpd   %xmm3, %xmm2
        mulpd   %xmm1, %xmm0
        subpd   %xmm9, %xmm4
        addpd   %xmm2, %xmm0
        addpd   %xmm4, %xmm6
        addpd   %xmm0, %xmm5
        ja      .L13
        haddpd  %xmm5, %xmm5
        cmpl    %r15d, %edi
        movl    -4(%rsp), %ecx
        haddpd  %xmm6, %xmm6
        addsd   %xmm5, %xmm8
        addsd   %xmm6, %xmm7
        jne     .L12
        jmp     .L14

for the innermost loop, followed by a tail loop (peel for niters).  This is
about 15% faster on AMD K10 than the non-vectorized loop (if you disable
the cost-model and make sure to have enough iterations in the inner loop
to pay back for the extra guarding conditions).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021