From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 20053 invoked by alias); 19 Aug 2014 10:52:00 -0000 Mailing-List: contact gcc-bugs-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-bugs-owner@gcc.gnu.org Received: (qmail 19983 invoked by uid 48); 19 Aug 2014 10:51:54 -0000 From: "amker.cheng at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211 Date: Tue, 19 Aug 2014 10:52:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 5.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: amker.cheng at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-08/txt/msg01265.txt.bz2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 bin.cheng changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amker.cheng at gmail dot com --- Comment #3 from bin.cheng --- I think it's a flaw in iv candidates choosing algorithm revealed by my patch. Though r211211 does change cost of addressing modes, it doesn't change the cost of optimal candidate set. The problem with iv candidates choosing algorithm is it's a heuristic one and would fail to find the optimal set for this specific case. In details, the only cost differences between r211210/r211211 is like below *************** *** 1,8 **** Use 1: cand cost compl. depends on ! 1 16 1 1 9 1 0 ! 10 4 1 1 11 1 0 ! 12 5 0 ! 14 8 1 1 --- 1,8 ---- Use 1: cand cost compl. depends on ! 1 13 1 1 9 1 0 ! 10 1 1 1 11 1 0 ! 12 1 1 ! 14 5 1 1 The final candidates set choosed by r211210 is like below. Initial set of candidates: cost: 19 (complexity 2) cand_cost: 10 cand_use_cost: 5 (complexity 2) candidates: 11, 14 use:0 --> iv_cand:14, cost=(4,2) use:1 --> iv_cand:11, cost=(1,0) use:2 --> iv_cand:11, cost=(0,0) invariants 1 Improved to: cost: 15 (complexity 0) cand_cost: 10 cand_use_cost: 2 (complexity 0) candidates: 11, 13 use:0 --> iv_cand:13, cost=(1,0) use:1 --> iv_cand:11, cost=(1,0) use:2 --> iv_cand:11, cost=(0,0) invariants 1 The final candidates set choosed by r211211 is like below. Initial set of candidates: cost: 17 (complexity 3) cand_cost: 5 cand_use_cost: 9 (complexity 3) candidates: 14 use:0 --> iv_cand:14, cost=(4,2) use:1 --> iv_cand:14, cost=(5,1) use:2 --> iv_cand:14, cost=(0,0) It is clear, r211211 doesn't change the optimal candidates set (which is 13/11). It is the algorithm that can't find out the optimal set because it's heuristic and would fail on this one. With manual changes of candidates set, the diff of assembly code is like below. *** 6,46 **** .type Intmm, %function Intmm: movi v3.2s, 0 adrp x6, a+128 ! adrp x8, r+128 ! adrp x10, r+3848 ! adrp x9, b+128 ! adrp x7, b+248 add x6, x6, :lo12:a+128 ! add x8, x8, :lo12:r+128 ! add x10, x10, :lo12:r+3848 ! add x9, x9, :lo12:b+128 ! add x7, x7, :lo12:b+248 .L2: ! mov x5, x8 ! mov x4, x8 ! mov x3, x9 .L4: ! str d3, [x4] ! add x2, x3, 3720 movi v0.2s, 0 mov x1, x6 ! mov x0, x3 .L3: ! ldr d1, [x0] ! add x0, x0, 124 ld1r {v2.2s}, [x1], 4 ! cmp x0, x2 mla v0.2s, v2.2s, v1.2s bne .L3 ! str d0, [x5], 8 add x3, x3, 8 ! cmp x3, x7 ! add x4, x4, 8 bne .L4 ! add x8, x8, 124 add x6, x6, 124 ! cmp x8, x10 bne .L2 ret .size Intmm, .-Intmm --- 6,42 ---- .type Intmm, %function Intmm: movi v3.2s, 0 + adrp x4, r+128 adrp x6, a+128 ! adrp x7, r+3848 ! adrp x5, b ! add x4, x4, :lo12:r+128 add x6, x6, :lo12:a+128 ! add x7, x7, :lo12:r+3848 ! add x5, x5, :lo12:b .L2: ! mov x3, 0 .L4: ! str d3, [x4, x3] ! add x0, x3, 128 movi v0.2s, 0 + add x2, x3, 3848 + add x2, x2, x5 mov x1, x6 ! add x0, x5, x0 .L3: ! ldr d1, [x0], 124 ld1r {v2.2s}, [x1], 4 ! cmp x2, x0 mla v0.2s, v2.2s, v1.2s bne .L3 ! str d0, [x4, x3] add x3, x3, 8 ! cmp x3, 120 bne .L4 ! add x4, x4, 124 add x6, x6, 124 ! cmp x4, x7 bne .L2 ret .size Intmm, .-Intmm You can see the inner most loop is back to optimized. The additinal instruction in the second loop is caused by addressing mode change, but I think it can be fixed by enabling auto-increment addressing mode for IVOPT on aarch64. YES, we hasn't done that yet. I will see how the IVOPT candidates choosing algorithm should be improved.