public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 @ 2014-08-18 22:47 spop at gcc dot gnu.org 2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org ` (9 more replies) 0 siblings, 10 replies; 11+ messages in thread From: spop at gcc dot gnu.org @ 2014-08-18 22:47 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 Bug ID: 62178 Summary: [AArch64] Performance regression on matrix matrix multiply due to r211211 Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: spop at gcc dot gnu.org int a[30 +1][30 +1], b[30 +1][30 +1], r[30 +1][30 +1]; void Intmm (int run) { int i, j, k; for ( i = 1; i <= 30; i++ ) for ( j = 1; j <= 30; j++ ) { r[i][j] = 0; for(k = 1; k <= 30; k++ ) r[i][j] += a[i][k]*b[k][j]; } } compile this at -O3 with the last good compiler r211210 and with the first bad compiler at r211211, then diff the assembly: --- good.s 2014-08-18 17:44:26.179506000 -0500 +++ bad.s 2014-08-18 17:44:26.213807000 -0500 @@ -6,45 +6,44 @@ .type Intmm, %function Intmm: movi v3.2s, 0 - adrp x6, a+128 - adrp x8, r+128 - adrp x10, r+3848 - adrp x9, b+128 - adrp x7, b+248 - add x6, x6, :lo12:a+128 - add x8, x8, :lo12:r+128 - add x10, x10, :lo12:r+3848 - add x9, x9, :lo12:b+128 - add x7, x7, :lo12:b+248 + adrp x6, r+128 + adrp x4, a+124 + adrp x8, r+3848 + adrp x7, b + add x6, x6, :lo12:r+128 + add x4, x4, :lo12:a+124 + add x8, x8, :lo12:r+3848 + add x7, x7, :lo12:b .L2: - mov x5, x8 - mov x4, x8 - mov x3, x9 + mov x5, 0 .L4: - str d3, [x4] - add x2, x3, 3720 - movi v0.2s, 0 - mov x1, x6 - mov x0, x3 + str d3, [x6, x5] + add x3, x5, 128 + movi v1.2s, 0 + add x3, x3, x7 + mov x0, 0 .L3: - ldr d1, [x0] - add x0, x0, 124 - ld1r {v2.2s}, [x1], 4 - cmp x0, x2 - mla v0.2s, v2.2s, v1.2s + add x1, x4, x0 + lsl x2, x0, 5 + sub x2, x2, x0 + add x0, x0, 4 + cmp x0, 120 + ldr w1, [x1, 4] + ldr d2, [x3, x2] + dup v0.2s, w1 + mla v1.2s, v0.2s, v2.2s bne .L3 - str d0, [x5], 8 - add x3, x3, 8 - cmp x3, x7 - add x4, x4, 8 + str d1, [x6, x5] + add x5, x5, 8 + cmp x5, 120 bne .L4 - add x8, x8, 124 add x6, x6, 124 - cmp x8, x10 + add x4, x4, 124 + cmp x6, x8 bne .L2 ret .size Intmm, .-Intmm .comm r,3844,8 .comm b,3844,8 .comm a,3844,8 Remark that the innermost loop .L3 contains 5 more instructions with the bad compiler, due to more scalar computations for the addressing modes: .L3: - ldr d1, [x0] - add x0, x0, 124 - ld1r {v2.2s}, [x1], 4 - cmp x0, x2 - mla v0.2s, v2.2s, v1.2s + add x1, x4, x0 + lsl x2, x0, 5 + sub x2, x2, x0 + add x0, x0, 4 + cmp x0, 120 + ldr w1, [x1, 4] + ldr d2, [x3, x2] + dup v0.2s, w1 + mla v1.2s, v0.2s, v2.2s bne .L3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org @ 2014-08-18 22:58 ` pinskia at gcc dot gnu.org 2014-08-19 10:52 ` amker.cheng at gmail dot com ` (8 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: pinskia at gcc dot gnu.org @ 2014-08-18 22:58 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Even -fno-ivopts produces better code: .L3: add x0, x5, x1, sxtw add w1, w1, 1 ldr d2, [x3], 124 ldr w0, [x4, x0, lsl 2] dup v0.2s, w0 mla v1.2s, v0.2s, v2.2s subs w2, w2, #1 bne .L3 Compared with: .L3: lsl x2, x0, 5 add x1, x0, x4 sub x2, x2, x0 add x1, x1, x5 ldr d2, [x3, x2] add x0, x0, 4 ldr w1, [x1, 4] dup v0.2s, w1 mla v1.2s, v0.2s, v2.2s cmp x0, 120 bne .L3 But I think the main reason for the performance regression is: ldr w1, [x1, 4] dup v0.2s, w1 If the compiler had used ldr1 instead the performance would be back to normal. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org 2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org @ 2014-08-19 10:52 ` amker.cheng at gmail dot com 2014-10-28 11:27 ` [Bug middle-end/62178] [5.0 regression] " ramana at gcc dot gnu.org ` (7 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: amker.cheng at gmail dot com @ 2014-08-19 10:52 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 bin.cheng <amker.cheng at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amker.cheng at gmail dot com --- Comment #3 from bin.cheng <amker.cheng at gmail dot com> --- I think it's a flaw in iv candidates choosing algorithm revealed by my patch. Though r211211 does change cost of addressing modes, it doesn't change the cost of optimal candidate set. The problem with iv candidates choosing algorithm is it's a heuristic one and would fail to find the optimal set for this specific case. In details, the only cost differences between r211210/r211211 is like below *************** *** 1,8 **** Use 1: cand cost compl. depends on ! 1 16 1 1 9 1 0 ! 10 4 1 1 11 1 0 ! 12 5 0 ! 14 8 1 1 --- 1,8 ---- Use 1: cand cost compl. depends on ! 1 13 1 1 9 1 0 ! 10 1 1 1 11 1 0 ! 12 1 1 ! 14 5 1 1 The final candidates set choosed by r211210 is like below. Initial set of candidates: cost: 19 (complexity 2) cand_cost: 10 cand_use_cost: 5 (complexity 2) candidates: 11, 14 use:0 --> iv_cand:14, cost=(4,2) use:1 --> iv_cand:11, cost=(1,0) use:2 --> iv_cand:11, cost=(0,0) invariants 1 Improved to: cost: 15 (complexity 0) cand_cost: 10 cand_use_cost: 2 (complexity 0) candidates: 11, 13 use:0 --> iv_cand:13, cost=(1,0) use:1 --> iv_cand:11, cost=(1,0) use:2 --> iv_cand:11, cost=(0,0) invariants 1 The final candidates set choosed by r211211 is like below. Initial set of candidates: cost: 17 (complexity 3) cand_cost: 5 cand_use_cost: 9 (complexity 3) candidates: 14 use:0 --> iv_cand:14, cost=(4,2) use:1 --> iv_cand:14, cost=(5,1) use:2 --> iv_cand:14, cost=(0,0) It is clear, r211211 doesn't change the optimal candidates set (which is 13/11). It is the algorithm that can't find out the optimal set because it's heuristic and would fail on this one. With manual changes of candidates set, the diff of assembly code is like below. *** 6,46 **** .type Intmm, %function Intmm: movi v3.2s, 0 adrp x6, a+128 ! adrp x8, r+128 ! adrp x10, r+3848 ! adrp x9, b+128 ! adrp x7, b+248 add x6, x6, :lo12:a+128 ! add x8, x8, :lo12:r+128 ! add x10, x10, :lo12:r+3848 ! add x9, x9, :lo12:b+128 ! add x7, x7, :lo12:b+248 .L2: ! mov x5, x8 ! mov x4, x8 ! mov x3, x9 .L4: ! str d3, [x4] ! add x2, x3, 3720 movi v0.2s, 0 mov x1, x6 ! mov x0, x3 .L3: ! ldr d1, [x0] ! add x0, x0, 124 ld1r {v2.2s}, [x1], 4 ! cmp x0, x2 mla v0.2s, v2.2s, v1.2s bne .L3 ! str d0, [x5], 8 add x3, x3, 8 ! cmp x3, x7 ! add x4, x4, 8 bne .L4 ! add x8, x8, 124 add x6, x6, 124 ! cmp x8, x10 bne .L2 ret .size Intmm, .-Intmm --- 6,42 ---- .type Intmm, %function Intmm: movi v3.2s, 0 + adrp x4, r+128 adrp x6, a+128 ! adrp x7, r+3848 ! adrp x5, b ! add x4, x4, :lo12:r+128 add x6, x6, :lo12:a+128 ! add x7, x7, :lo12:r+3848 ! add x5, x5, :lo12:b .L2: ! mov x3, 0 .L4: ! str d3, [x4, x3] ! add x0, x3, 128 movi v0.2s, 0 + add x2, x3, 3848 + add x2, x2, x5 mov x1, x6 ! add x0, x5, x0 .L3: ! ldr d1, [x0], 124 ld1r {v2.2s}, [x1], 4 ! cmp x2, x0 mla v0.2s, v2.2s, v1.2s bne .L3 ! str d0, [x4, x3] add x3, x3, 8 ! cmp x3, 120 bne .L4 ! add x4, x4, 124 add x6, x6, 124 ! cmp x4, x7 bne .L2 ret .size Intmm, .-Intmm You can see the inner most loop is back to optimized. The additinal instruction in the second loop is caused by addressing mode change, but I think it can be fixed by enabling auto-increment addressing mode for IVOPT on aarch64. YES, we hasn't done that yet. I will see how the IVOPT candidates choosing algorithm should be improved. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org 2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org 2014-08-19 10:52 ` amker.cheng at gmail dot com @ 2014-10-28 11:27 ` ramana at gcc dot gnu.org 2014-11-24 13:06 ` rguenth at gcc dot gnu.org ` (6 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: ramana at gcc dot gnu.org @ 2014-10-28 11:27 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2014-10-28 CC| |ramana at gcc dot gnu.org Target Milestone|--- |5.0 Summary|[AArch64] Performance |[5.0 regression] [AArch64] |regression on matrix matrix |Performance regression on |multiply due to r211211 |matrix matrix multiply due | |to r211211 Ever confirmed|0 |1 --- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> --- Patch pending for review here https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org ` (2 preceding siblings ...) 2014-10-28 11:27 ` [Bug middle-end/62178] [5.0 regression] " ramana at gcc dot gnu.org @ 2014-11-24 13:06 ` rguenth at gcc dot gnu.org 2014-11-25 1:35 ` amker.cheng at gmail dot com ` (5 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: rguenth at gcc dot gnu.org @ 2014-11-24 13:06 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org ` (3 preceding siblings ...) 2014-11-24 13:06 ` rguenth at gcc dot gnu.org @ 2014-11-25 1:35 ` amker.cheng at gmail dot com 2014-11-27 14:35 ` ramana at gcc dot gnu.org ` (4 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: amker.cheng at gmail dot com @ 2014-11-25 1:35 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 --- Comment #5 from bin.cheng <amker.cheng at gmail dot com> --- Now I think the patch proposed isn't good enough. I am revisiting the implementation to see if I can improve the existing algorithm, rather than just adding another heuristic pass. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org ` (4 preceding siblings ...) 2014-11-25 1:35 ` amker.cheng at gmail dot com @ 2014-11-27 14:35 ` ramana at gcc dot gnu.org 2014-12-18 2:54 ` amker at gcc dot gnu.org ` (3 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: ramana at gcc dot gnu.org @ 2014-11-27 14:35 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org ` (5 preceding siblings ...) 2014-11-27 14:35 ` ramana at gcc dot gnu.org @ 2014-12-18 2:54 ` amker at gcc dot gnu.org 2014-12-18 2:57 ` amker at gcc dot gnu.org ` (2 subsequent siblings) 9 siblings, 0 replies; 11+ messages in thread From: amker at gcc dot gnu.org @ 2014-12-18 2:54 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 --- Comment #6 from amker at gcc dot gnu.org --- Author: amker Date: Thu Dec 18 02:53:42 2014 New Revision: 218855 URL: https://gcc.gnu.org/viewcvs?rev=218855&root=gcc&view=rev Log: PR tree-optimization/62178 * tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function. (iv_ca_replace): New function. (try_improve_iv_set): New parameter try_replace_p. Break local optimal fixed-point by calling iv_ca_replace. (find_optimal_iv_set_1): Pass new argument to try_improve_iv_set. gcc/testsuite/ChangeLog PR tree-optimization/62178 * gcc.target/aarch64/pr62178.c: New test. Added: trunk/gcc/testsuite/gcc.target/aarch64/pr62178.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-loop-ivopts.c ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org ` (6 preceding siblings ...) 2014-12-18 2:54 ` amker at gcc dot gnu.org @ 2014-12-18 2:57 ` amker at gcc dot gnu.org 2014-12-22 10:30 ` amker at gcc dot gnu.org 2015-04-02 6:46 ` yroux at gcc dot gnu.org 9 siblings, 0 replies; 11+ messages in thread From: amker at gcc dot gnu.org @ 2014-12-18 2:57 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 --- Comment #7 from amker at gcc dot gnu.org --- Should be fixed. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org ` (7 preceding siblings ...) 2014-12-18 2:57 ` amker at gcc dot gnu.org @ 2014-12-22 10:30 ` amker at gcc dot gnu.org 2015-04-02 6:46 ` yroux at gcc dot gnu.org 9 siblings, 0 replies; 11+ messages in thread From: amker at gcc dot gnu.org @ 2014-12-22 10:30 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 amker at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution|--- |FIXED --- Comment #8 from amker at gcc dot gnu.org --- Close. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org ` (8 preceding siblings ...) 2014-12-22 10:30 ` amker at gcc dot gnu.org @ 2015-04-02 6:46 ` yroux at gcc dot gnu.org 9 siblings, 0 replies; 11+ messages in thread From: yroux at gcc dot gnu.org @ 2015-04-02 6:46 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178 --- Comment #9 from Yvan Roux <yroux at gcc dot gnu.org> --- Author: yroux Date: Thu Apr 2 06:45:24 2015 New Revision: 221820 URL: https://gcc.gnu.org/viewcvs?rev=221820&root=gcc&view=rev Log: gcc/ 2015-04.02 Yvan Roux <yvan.roux@linaro.org> Backport from trunk r218855. 2014-12-18 Bin Cheng <bin.cheng@arm.com> PR tree-optimization/62178 * tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function. (iv_ca_replace): New function. (try_improve_iv_set): New parameter try_replace_p. Break local optimal fixed-point by calling iv_ca_replace. (find_optimal_iv_set_1): Pass new argument to try_improve_iv_set. gcc/testsuite/ 2015-04:02 Yvan Roux <yvan.roux@linaro.org> Backport from trunk r218855. 2014-12-18 Bin Cheng <bin.cheng@arm.com> PR tree-optimization/62178 * gcc.target/aarch64/pr62178.c: New test. Added: branches/linaro/gcc-4_9-branch/gcc/testsuite/gcc.target/aarch64/pr62178.c Modified: branches/linaro/gcc-4_9-branch/gcc/ChangeLog.linaro branches/linaro/gcc-4_9-branch/gcc/testsuite/ChangeLog.linaro branches/linaro/gcc-4_9-branch/gcc/tree-ssa-loop-ivopts.c ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-04-02 6:46 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org 2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org 2014-08-19 10:52 ` amker.cheng at gmail dot com 2014-10-28 11:27 ` [Bug middle-end/62178] [5.0 regression] " ramana at gcc dot gnu.org 2014-11-24 13:06 ` rguenth at gcc dot gnu.org 2014-11-25 1:35 ` amker.cheng at gmail dot com 2014-11-27 14:35 ` ramana at gcc dot gnu.org 2014-12-18 2:54 ` amker at gcc dot gnu.org 2014-12-18 2:57 ` amker at gcc dot gnu.org 2014-12-22 10:30 ` amker at gcc dot gnu.org 2015-04-02 6:46 ` yroux at gcc dot gnu.org
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).