From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugs-return-458768-listarch-gcc-bugs=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 20053 invoked by alias); 19 Aug 2014 10:52:00 -0000
Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-bugs.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-help@gcc.gnu.org>
Sender: gcc-bugs-owner@gcc.gnu.org
Received: (qmail 19983 invoked by uid 48); 19 Aug 2014 10:51:54 -0000
From: "amker.cheng at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211
Date: Tue, 19 Aug 2014 10:52:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 5.0
X-Bugzilla-Keywords:
X-Bugzilla-Severity: normal
X-Bugzilla-Who: amker.cheng at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags:
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-62178-4-9SvjbqK52g@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-62178-4@http.gcc.gnu.org/bugzilla/>
References: <bug-62178-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-08/txt/msg01265.txt.bz2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178

bin.cheng <amker.cheng at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker.cheng at gmail dot com

--- Comment #3 from bin.cheng <amker.cheng at gmail dot com> ---
I think it's a flaw in iv candidates choosing algorithm revealed by my patch.
Though r211211 does change cost of addressing modes, it doesn't change the cost
of optimal candidate set.  The problem with iv candidates choosing algorithm is
it's a heuristic one and would fail to find the optimal set for this specific
case.

In details, the only cost differences between r211210/r211211 is like below 
***************
*** 1,8 ****
  Use 1:
    cand    cost    compl.    depends on
!   1    16    1    1
    9    1    0    
!   10    4    1    1
    11    1    0    
!   12    5    0    
!   14    8    1    1
--- 1,8 ----
  Use 1:
    cand    cost    compl.    depends on
!   1    13    1    1
    9    1    0    
!   10    1    1    1
    11    1    0    
!   12    1    1    
!   14    5    1    1

The final candidates set choosed by r211210 is like below.

Initial set of candidates:
  cost: 19 (complexity 2)
  cand_cost: 10
  cand_use_cost: 5 (complexity 2)
  candidates: 11, 14
   use:0 --> iv_cand:14, cost=(4,2)
   use:1 --> iv_cand:11, cost=(1,0)
   use:2 --> iv_cand:11, cost=(0,0)
  invariants 1

Improved to:
  cost: 15 (complexity 0)
  cand_cost: 10
  cand_use_cost: 2 (complexity 0)
  candidates: 11, 13
   use:0 --> iv_cand:13, cost=(1,0)
   use:1 --> iv_cand:11, cost=(1,0)
   use:2 --> iv_cand:11, cost=(0,0)
  invariants 1

The final candidates set choosed by r211211 is like below.

Initial set of candidates:
  cost: 17 (complexity 3)
  cand_cost: 5
  cand_use_cost: 9 (complexity 3)
  candidates: 14
   use:0 --> iv_cand:14, cost=(4,2)
   use:1 --> iv_cand:14, cost=(5,1)
   use:2 --> iv_cand:14, cost=(0,0)

It is clear, r211211 doesn't change the optimal candidates set (which is
13/11).  It is the algorithm that can't find out the optimal set because it's
heuristic and would fail on this one.

With manual changes of candidates set, the diff of assembly code is like below.
*** 6,46 ****
      .type    Intmm, %function
  Intmm:
      movi    v3.2s, 0
      adrp    x6, a+128
!     adrp    x8, r+128
!     adrp    x10, r+3848
!     adrp    x9, b+128
!     adrp    x7, b+248
      add    x6, x6, :lo12:a+128
!     add    x8, x8, :lo12:r+128
!     add    x10, x10, :lo12:r+3848
!     add    x9, x9, :lo12:b+128
!     add    x7, x7, :lo12:b+248
  .L2:
!     mov    x5, x8
!     mov    x4, x8
!     mov    x3, x9
  .L4:
!     str    d3, [x4]
!     add    x2, x3, 3720
      movi    v0.2s, 0
      mov    x1, x6
!     mov    x0, x3
  .L3:
!     ldr    d1, [x0]
!     add    x0, x0, 124
      ld1r    {v2.2s}, [x1], 4
!     cmp    x0, x2
      mla    v0.2s, v2.2s, v1.2s
      bne    .L3
!     str    d0, [x5], 8
      add    x3, x3, 8
!     cmp    x3, x7
!     add    x4, x4, 8
      bne    .L4
!     add    x8, x8, 124
      add    x6, x6, 124
!     cmp    x8, x10
      bne    .L2
      ret
      .size    Intmm, .-Intmm
--- 6,42 ----
      .type    Intmm, %function
  Intmm:
      movi    v3.2s, 0
+     adrp    x4, r+128
      adrp    x6, a+128
!     adrp    x7, r+3848
!     adrp    x5, b
!     add    x4, x4, :lo12:r+128
      add    x6, x6, :lo12:a+128
!     add    x7, x7, :lo12:r+3848
!     add    x5, x5, :lo12:b
  .L2:
!     mov    x3, 0
  .L4:
!     str    d3, [x4, x3]
!     add    x0, x3, 128
      movi    v0.2s, 0
+     add    x2, x3, 3848
+     add    x2, x2, x5
      mov    x1, x6
!     add    x0, x5, x0
  .L3:
!     ldr    d1, [x0], 124
      ld1r    {v2.2s}, [x1], 4
!     cmp    x2, x0
      mla    v0.2s, v2.2s, v1.2s
      bne    .L3
!     str    d0, [x4, x3]
      add    x3, x3, 8
!     cmp    x3, 120
      bne    .L4
!     add    x4, x4, 124
      add    x6, x6, 124
!     cmp    x4, x7
      bne    .L2
      ret
      .size    Intmm, .-Intmm

You can see the inner most loop is back to optimized.  The additinal
instruction in the second loop is caused by addressing mode change, but I think
it can be fixed by enabling auto-increment addressing mode for IVOPT on
aarch64.  YES, we hasn't done that yet.


I will see how the IVOPT candidates choosing algorithm should be improved.