public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211
@ 2014-08-18 22:47 spop at gcc dot gnu.org
2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: spop at gcc dot gnu.org @ 2014-08-18 22:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
Bug ID: 62178
Summary: [AArch64] Performance regression on matrix matrix
multiply due to r211211
Product: gcc
Version: 5.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: spop at gcc dot gnu.org
int a[30 +1][30 +1], b[30 +1][30 +1], r[30 +1][30 +1];
void Intmm (int run) {
int i, j, k;
for ( i = 1; i <= 30; i++ )
for ( j = 1; j <= 30; j++ ) {
r[i][j] = 0;
for(k = 1; k <= 30; k++ )
r[i][j] += a[i][k]*b[k][j];
}
}
compile this at -O3 with the last good compiler r211210 and with the first bad
compiler at r211211, then diff the assembly:
--- good.s 2014-08-18 17:44:26.179506000 -0500
+++ bad.s 2014-08-18 17:44:26.213807000 -0500
@@ -6,45 +6,44 @@
.type Intmm, %function
Intmm:
movi v3.2s, 0
- adrp x6, a+128
- adrp x8, r+128
- adrp x10, r+3848
- adrp x9, b+128
- adrp x7, b+248
- add x6, x6, :lo12:a+128
- add x8, x8, :lo12:r+128
- add x10, x10, :lo12:r+3848
- add x9, x9, :lo12:b+128
- add x7, x7, :lo12:b+248
+ adrp x6, r+128
+ adrp x4, a+124
+ adrp x8, r+3848
+ adrp x7, b
+ add x6, x6, :lo12:r+128
+ add x4, x4, :lo12:a+124
+ add x8, x8, :lo12:r+3848
+ add x7, x7, :lo12:b
.L2:
- mov x5, x8
- mov x4, x8
- mov x3, x9
+ mov x5, 0
.L4:
- str d3, [x4]
- add x2, x3, 3720
- movi v0.2s, 0
- mov x1, x6
- mov x0, x3
+ str d3, [x6, x5]
+ add x3, x5, 128
+ movi v1.2s, 0
+ add x3, x3, x7
+ mov x0, 0
.L3:
- ldr d1, [x0]
- add x0, x0, 124
- ld1r {v2.2s}, [x1], 4
- cmp x0, x2
- mla v0.2s, v2.2s, v1.2s
+ add x1, x4, x0
+ lsl x2, x0, 5
+ sub x2, x2, x0
+ add x0, x0, 4
+ cmp x0, 120
+ ldr w1, [x1, 4]
+ ldr d2, [x3, x2]
+ dup v0.2s, w1
+ mla v1.2s, v0.2s, v2.2s
bne .L3
- str d0, [x5], 8
- add x3, x3, 8
- cmp x3, x7
- add x4, x4, 8
+ str d1, [x6, x5]
+ add x5, x5, 8
+ cmp x5, 120
bne .L4
- add x8, x8, 124
add x6, x6, 124
- cmp x8, x10
+ add x4, x4, 124
+ cmp x6, x8
bne .L2
ret
.size Intmm, .-Intmm
.comm r,3844,8
.comm b,3844,8
.comm a,3844,8
Remark that the innermost loop .L3 contains 5 more instructions with the bad
compiler, due to more scalar computations for the addressing modes:
.L3:
- ldr d1, [x0]
- add x0, x0, 124
- ld1r {v2.2s}, [x1], 4
- cmp x0, x2
- mla v0.2s, v2.2s, v1.2s
+ add x1, x4, x0
+ lsl x2, x0, 5
+ sub x2, x2, x0
+ add x0, x0, 4
+ cmp x0, 120
+ ldr w1, [x1, 4]
+ ldr d2, [x3, x2]
+ dup v0.2s, w1
+ mla v1.2s, v0.2s, v2.2s
bne .L3
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
@ 2014-08-18 22:58 ` pinskia at gcc dot gnu.org
2014-08-19 10:52 ` amker.cheng at gmail dot com
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2014-08-18 22:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Even -fno-ivopts produces better code:
.L3:
add x0, x5, x1, sxtw
add w1, w1, 1
ldr d2, [x3], 124
ldr w0, [x4, x0, lsl 2]
dup v0.2s, w0
mla v1.2s, v0.2s, v2.2s
subs w2, w2, #1
bne .L3
Compared with:
.L3:
lsl x2, x0, 5
add x1, x0, x4
sub x2, x2, x0
add x1, x1, x5
ldr d2, [x3, x2]
add x0, x0, 4
ldr w1, [x1, 4]
dup v0.2s, w1
mla v1.2s, v0.2s, v2.2s
cmp x0, 120
bne .L3
But I think the main reason for the performance regression is:
ldr w1, [x1, 4]
dup v0.2s, w1
If the compiler had used ldr1 instead the performance would be back to normal.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org
@ 2014-08-19 10:52 ` amker.cheng at gmail dot com
2014-10-28 11:27 ` [Bug middle-end/62178] [5.0 regression] " ramana at gcc dot gnu.org
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: amker.cheng at gmail dot com @ 2014-08-19 10:52 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
bin.cheng <amker.cheng at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |amker.cheng at gmail dot com
--- Comment #3 from bin.cheng <amker.cheng at gmail dot com> ---
I think it's a flaw in iv candidates choosing algorithm revealed by my patch.
Though r211211 does change cost of addressing modes, it doesn't change the cost
of optimal candidate set. The problem with iv candidates choosing algorithm is
it's a heuristic one and would fail to find the optimal set for this specific
case.
In details, the only cost differences between r211210/r211211 is like below
***************
*** 1,8 ****
Use 1:
cand cost compl. depends on
! 1 16 1 1
9 1 0
! 10 4 1 1
11 1 0
! 12 5 0
! 14 8 1 1
--- 1,8 ----
Use 1:
cand cost compl. depends on
! 1 13 1 1
9 1 0
! 10 1 1 1
11 1 0
! 12 1 1
! 14 5 1 1
The final candidates set choosed by r211210 is like below.
Initial set of candidates:
cost: 19 (complexity 2)
cand_cost: 10
cand_use_cost: 5 (complexity 2)
candidates: 11, 14
use:0 --> iv_cand:14, cost=(4,2)
use:1 --> iv_cand:11, cost=(1,0)
use:2 --> iv_cand:11, cost=(0,0)
invariants 1
Improved to:
cost: 15 (complexity 0)
cand_cost: 10
cand_use_cost: 2 (complexity 0)
candidates: 11, 13
use:0 --> iv_cand:13, cost=(1,0)
use:1 --> iv_cand:11, cost=(1,0)
use:2 --> iv_cand:11, cost=(0,0)
invariants 1
The final candidates set choosed by r211211 is like below.
Initial set of candidates:
cost: 17 (complexity 3)
cand_cost: 5
cand_use_cost: 9 (complexity 3)
candidates: 14
use:0 --> iv_cand:14, cost=(4,2)
use:1 --> iv_cand:14, cost=(5,1)
use:2 --> iv_cand:14, cost=(0,0)
It is clear, r211211 doesn't change the optimal candidates set (which is
13/11). It is the algorithm that can't find out the optimal set because it's
heuristic and would fail on this one.
With manual changes of candidates set, the diff of assembly code is like below.
*** 6,46 ****
.type Intmm, %function
Intmm:
movi v3.2s, 0
adrp x6, a+128
! adrp x8, r+128
! adrp x10, r+3848
! adrp x9, b+128
! adrp x7, b+248
add x6, x6, :lo12:a+128
! add x8, x8, :lo12:r+128
! add x10, x10, :lo12:r+3848
! add x9, x9, :lo12:b+128
! add x7, x7, :lo12:b+248
.L2:
! mov x5, x8
! mov x4, x8
! mov x3, x9
.L4:
! str d3, [x4]
! add x2, x3, 3720
movi v0.2s, 0
mov x1, x6
! mov x0, x3
.L3:
! ldr d1, [x0]
! add x0, x0, 124
ld1r {v2.2s}, [x1], 4
! cmp x0, x2
mla v0.2s, v2.2s, v1.2s
bne .L3
! str d0, [x5], 8
add x3, x3, 8
! cmp x3, x7
! add x4, x4, 8
bne .L4
! add x8, x8, 124
add x6, x6, 124
! cmp x8, x10
bne .L2
ret
.size Intmm, .-Intmm
--- 6,42 ----
.type Intmm, %function
Intmm:
movi v3.2s, 0
+ adrp x4, r+128
adrp x6, a+128
! adrp x7, r+3848
! adrp x5, b
! add x4, x4, :lo12:r+128
add x6, x6, :lo12:a+128
! add x7, x7, :lo12:r+3848
! add x5, x5, :lo12:b
.L2:
! mov x3, 0
.L4:
! str d3, [x4, x3]
! add x0, x3, 128
movi v0.2s, 0
+ add x2, x3, 3848
+ add x2, x2, x5
mov x1, x6
! add x0, x5, x0
.L3:
! ldr d1, [x0], 124
ld1r {v2.2s}, [x1], 4
! cmp x2, x0
mla v0.2s, v2.2s, v1.2s
bne .L3
! str d0, [x4, x3]
add x3, x3, 8
! cmp x3, 120
bne .L4
! add x4, x4, 124
add x6, x6, 124
! cmp x4, x7
bne .L2
ret
.size Intmm, .-Intmm
You can see the inner most loop is back to optimized. The additinal
instruction in the second loop is caused by addressing mode change, but I think
it can be fixed by enabling auto-increment addressing mode for IVOPT on
aarch64. YES, we hasn't done that yet.
I will see how the IVOPT candidates choosing algorithm should be improved.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org
2014-08-19 10:52 ` amker.cheng at gmail dot com
@ 2014-10-28 11:27 ` ramana at gcc dot gnu.org
2014-11-24 13:06 ` rguenth at gcc dot gnu.org
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: ramana at gcc dot gnu.org @ 2014-10-28 11:27 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Status|UNCONFIRMED |NEW
Last reconfirmed| |2014-10-28
CC| |ramana at gcc dot gnu.org
Target Milestone|--- |5.0
Summary|[AArch64] Performance |[5.0 regression] [AArch64]
|regression on matrix matrix |Performance regression on
|multiply due to r211211 |matrix matrix multiply due
| |to r211211
Ever confirmed|0 |1
--- Comment #4 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
Patch pending for review here
https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
` (2 preceding siblings ...)
2014-10-28 11:27 ` [Bug middle-end/62178] [5.0 regression] " ramana at gcc dot gnu.org
@ 2014-11-24 13:06 ` rguenth at gcc dot gnu.org
2014-11-25 1:35 ` amker.cheng at gmail dot com
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2014-11-24 13:06 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Priority|P3 |P1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
` (3 preceding siblings ...)
2014-11-24 13:06 ` rguenth at gcc dot gnu.org
@ 2014-11-25 1:35 ` amker.cheng at gmail dot com
2014-11-27 14:35 ` ramana at gcc dot gnu.org
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: amker.cheng at gmail dot com @ 2014-11-25 1:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
--- Comment #5 from bin.cheng <amker.cheng at gmail dot com> ---
Now I think the patch proposed isn't good enough. I am revisiting the
implementation to see if I can improve the existing algorithm, rather than just
adding another heuristic pass.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
` (4 preceding siblings ...)
2014-11-25 1:35 ` amker.cheng at gmail dot com
@ 2014-11-27 14:35 ` ramana at gcc dot gnu.org
2014-12-18 2:54 ` amker at gcc dot gnu.org
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: ramana at gcc dot gnu.org @ 2014-11-27 14:35 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
` (5 preceding siblings ...)
2014-11-27 14:35 ` ramana at gcc dot gnu.org
@ 2014-12-18 2:54 ` amker at gcc dot gnu.org
2014-12-18 2:57 ` amker at gcc dot gnu.org
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: amker at gcc dot gnu.org @ 2014-12-18 2:54 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
--- Comment #6 from amker at gcc dot gnu.org ---
Author: amker
Date: Thu Dec 18 02:53:42 2014
New Revision: 218855
URL: https://gcc.gnu.org/viewcvs?rev=218855&root=gcc&view=rev
Log:
PR tree-optimization/62178
* tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
(iv_ca_replace): New function.
(try_improve_iv_set): New parameter try_replace_p.
Break local optimal fixed-point by calling iv_ca_replace.
(find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.
gcc/testsuite/ChangeLog
PR tree-optimization/62178
* gcc.target/aarch64/pr62178.c: New test.
Added:
trunk/gcc/testsuite/gcc.target/aarch64/pr62178.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-loop-ivopts.c
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
` (6 preceding siblings ...)
2014-12-18 2:54 ` amker at gcc dot gnu.org
@ 2014-12-18 2:57 ` amker at gcc dot gnu.org
2014-12-22 10:30 ` amker at gcc dot gnu.org
2015-04-02 6:46 ` yroux at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: amker at gcc dot gnu.org @ 2014-12-18 2:57 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
--- Comment #7 from amker at gcc dot gnu.org ---
Should be fixed.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
` (7 preceding siblings ...)
2014-12-18 2:57 ` amker at gcc dot gnu.org
@ 2014-12-22 10:30 ` amker at gcc dot gnu.org
2015-04-02 6:46 ` yroux at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: amker at gcc dot gnu.org @ 2014-12-22 10:30 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
amker at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |RESOLVED
Resolution|--- |FIXED
--- Comment #8 from amker at gcc dot gnu.org ---
Close.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Bug middle-end/62178] [5.0 regression] [AArch64] Performance regression on matrix matrix multiply due to r211211
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
` (8 preceding siblings ...)
2014-12-22 10:30 ` amker at gcc dot gnu.org
@ 2015-04-02 6:46 ` yroux at gcc dot gnu.org
9 siblings, 0 replies; 11+ messages in thread
From: yroux at gcc dot gnu.org @ 2015-04-02 6:46 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178
--- Comment #9 from Yvan Roux <yroux at gcc dot gnu.org> ---
Author: yroux
Date: Thu Apr 2 06:45:24 2015
New Revision: 221820
URL: https://gcc.gnu.org/viewcvs?rev=221820&root=gcc&view=rev
Log:
gcc/
2015-04.02 Yvan Roux <yvan.roux@linaro.org>
Backport from trunk r218855.
2014-12-18 Bin Cheng <bin.cheng@arm.com>
PR tree-optimization/62178
* tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
(iv_ca_replace): New function.
(try_improve_iv_set): New parameter try_replace_p.
Break local optimal fixed-point by calling iv_ca_replace.
(find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.
gcc/testsuite/
2015-04:02 Yvan Roux <yvan.roux@linaro.org>
Backport from trunk r218855.
2014-12-18 Bin Cheng <bin.cheng@arm.com>
PR tree-optimization/62178
* gcc.target/aarch64/pr62178.c: New test.
Added:
branches/linaro/gcc-4_9-branch/gcc/testsuite/gcc.target/aarch64/pr62178.c
Modified:
branches/linaro/gcc-4_9-branch/gcc/ChangeLog.linaro
branches/linaro/gcc-4_9-branch/gcc/testsuite/ChangeLog.linaro
branches/linaro/gcc-4_9-branch/gcc/tree-ssa-loop-ivopts.c
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-04-02 6:46 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-18 22:47 [Bug target/62178] New: [AArch64] Performance regression on matrix matrix multiply due to r211211 spop at gcc dot gnu.org
2014-08-18 22:58 ` [Bug target/62178] " pinskia at gcc dot gnu.org
2014-08-19 10:52 ` amker.cheng at gmail dot com
2014-10-28 11:27 ` [Bug middle-end/62178] [5.0 regression] " ramana at gcc dot gnu.org
2014-11-24 13:06 ` rguenth at gcc dot gnu.org
2014-11-25 1:35 ` amker.cheng at gmail dot com
2014-11-27 14:35 ` ramana at gcc dot gnu.org
2014-12-18 2:54 ` amker at gcc dot gnu.org
2014-12-18 2:57 ` amker at gcc dot gnu.org
2014-12-22 10:30 ` amker at gcc dot gnu.org
2015-04-02 6:46 ` yroux at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).