From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-363852-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 12724 invoked by alias); 27 Jul 2011 12:39:08 -0000
Received: (qmail 12716 invoked by uid 22791); 27 Jul 2011 12:39:07 -0000
X-SWARE-Spam-Status: No, hits=-2.9 required=5.0	tests=ALL_TRUSTED,AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Received: from localhost (HELO gcc.gnu.org) (127.0.0.1)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Wed, 27 Jul 2011 12:38:53 +0000
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/18437] vectorizer failed for matrix multiplication
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Changed-Fields:
Message-ID: <bug-18437-4-AWlmHWZB65@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-18437-4@http.gcc.gnu.org/bugzilla/>
References: <bug-18437-4@http.gcc.gnu.org/bugzilla/>
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
Content-Type: text/plain; charset="UTF-8"
MIME-Version: 1.0
Date: Wed, 27 Jul 2011 12:39:00 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
X-SW-Source: 2011-07/txt/msg02336.txt.bz2

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18437
--- Comment #5 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-07-27 12:38:20 UTC ---
The initial testcase is probably a bad example (3x3 matrix).  The following
testcase is borrowed from Polyhedron rnflow and is vectorized by ICC but
not by GCC (the ICC variant is 15% faster):

      function trs2a2 (j, k, u, d, m)
      real, dimension (1:m,1:m) :: trs2a2  
      real, dimension (1:m,1:m) :: u, d
      integer, intent (in)      :: j, k, m
      real (kind = selected_real_kind (10,50)) :: dtmp
      trs2a2 = 0.0
      do iclw1 = j, k - 1
         do iclw2 = j, k - 1
            dtmp = 0.0d0
            do iclww = j, k - 1
               dtmp = dtmp + u (iclw1, iclww) * d (iclww, iclw2)
            enddo
            trs2a2 (iclw1, iclw2) = dtmp
         enddo
      enddo
      return
      end function trs2a2

the reason why GCC cannot vectorize this is that the load from U has
a non-constant stride, so vectorization would need to load two scalars
and build up a vector (ICC does that).  If the stride were constant
but not power-of-two GCC would reject that as well, probably to not
confuse the interleaving code.  Data dependence analysis also rejects
non-constant strides.

Further complication (for the cost model) is the accumulator of
type double compared to the data types of float.  ICC uses only
half of the float vectors here to handle mixed float/double type
loops (but it still unrolls the loop).