public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
@ 2012-12-09 10:00 siarhei.siamashka at gmail dot com
2012-12-09 10:01 ` [Bug tree-optimization/55623] " siarhei.siamashka at gmail dot com
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2012-12-09 10:00 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
Bug #: 55623
Summary: [ARM] GCC should not prefer long dependency chains,
they inhibit performance on superscalar processors
Classification: Unclassified
Product: gcc
Version: 4.7.2
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: tree-optimization
AssignedTo: unassigned@gcc.gnu.org
ReportedBy: siarhei.siamashka@gmail.com
This is a missing optimization. Or in this particular case, it's more like GCC
is reversing an attempt of a programmer to optimize the code for superscalar
dual-issue processors.
$ arm-none-linux-gnueabi-gcc -O2 -mcpu=cortex-a8 -o badsched badsched.c
$ objdump -d badsched
00000000 <f1>:
0: e1a03120 lsr r3, r0, #2
4: e08330a0 add r3, r3, r0, lsr #1
8: e08331a0 add r3, r3, r0, lsr #3
c: e0833220 add r3, r3, r0, lsr #4
10: e08332a0 add r3, r3, r0, lsr #5
14: e0833320 add r3, r3, r0, lsr #6
18: e08333a0 add r3, r3, r0, lsr #7
1c: e0833420 add r3, r3, r0, lsr #8
20: e08334a0 add r3, r3, r0, lsr #9
24: e0833520 add r3, r3, r0, lsr #10
28: e08335a0 add r3, r3, r0, lsr #11
2c: e0833620 add r3, r3, r0, lsr #12
30: e08336a0 add r3, r3, r0, lsr #13
34: e0833720 add r3, r3, r0, lsr #14
38: e08337a0 add r3, r3, r0, lsr #15
3c: e0833820 add r3, r3, r0, lsr #16
40: e08338a0 add r3, r3, r0, lsr #17
44: e0833920 add r3, r3, r0, lsr #18
48: e08339a0 add r3, r3, r0, lsr #19
4c: e0833a20 add r3, r3, r0, lsr #20
50: e0833aa0 add r3, r3, r0, lsr #21
54: e0833b20 add r3, r3, r0, lsr #22
58: e0833ba0 add r3, r3, r0, lsr #23
5c: e0830c20 add r0, r3, r0, lsr #24
60: e12fff1e bx lr
00000064 <f2>:
64: e1a031a0 lsr r3, r0, #3
68: e1a02220 lsr r2, r0, #4
6c: e08330a0 add r3, r3, r0, lsr #1
70: e0822120 add r2, r2, r0, lsr #2
74: e08332a0 add r3, r3, r0, lsr #5
78: e0822320 add r2, r2, r0, lsr #6
7c: e08333a0 add r3, r3, r0, lsr #7
80: e0822420 add r2, r2, r0, lsr #8
84: e08334a0 add r3, r3, r0, lsr #9
88: e0822520 add r2, r2, r0, lsr #10
8c: e08335a0 add r3, r3, r0, lsr #11
90: e0822620 add r2, r2, r0, lsr #12
94: e08336a0 add r3, r3, r0, lsr #13
98: e0822720 add r2, r2, r0, lsr #14
9c: e08337a0 add r3, r3, r0, lsr #15
a0: e0822820 add r2, r2, r0, lsr #16
a4: e08338a0 add r3, r3, r0, lsr #17
a8: e0822920 add r2, r2, r0, lsr #18
ac: e08339a0 add r3, r3, r0, lsr #19
b0: e0822a20 add r2, r2, r0, lsr #20
b4: e0833aa0 add r3, r3, r0, lsr #21
b8: e0822b20 add r2, r2, r0, lsr #22
bc: e0833ba0 add r3, r3, r0, lsr #23
c0: e0820c20 add r0, r2, r0, lsr #24
c4: e0800003 add r0, r0, r3
c8: e12fff1e bx lr
Guess which one of these two functions will be faster?
=== Cortex-A8 @1000MHz ===
$ time ./badsched 1
real 0m2.512s
user 0m2.500s
sys 0m0.000s
$ time ./badsched 2
real 0m2.064s
user 0m2.008s
sys 0m0.008s
=== Cortex-A15 @1700MHz ===
real 0m2.786s
user 0m2.770s
sys 0m0.005s
real 0m1.451s
user 0m1.440s
sys 0m0.005s
There is a function call and loop overhead which prevents Cortex-A8 from
showing ~2x better performance in the case of using "f2" function. We can try
to mark these function as static in order to get them inlined, but in this case
the asm workaround becomes ineffective in a rather interesting way, which also
demonstrates instructions scheduling issues.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug tree-optimization/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
@ 2012-12-09 10:01 ` siarhei.siamashka at gmail dot com
2012-12-09 10:48 ` [Bug middle-end/55623] " pinskia at gcc dot gnu.org
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2012-12-09 10:01 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
--- Comment #1 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2012-12-09 10:00:59 UTC ---
Created attachment 28904
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28904
badsched.c
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
2012-12-09 10:01 ` [Bug tree-optimization/55623] " siarhei.siamashka at gmail dot com
@ 2012-12-09 10:48 ` pinskia at gcc dot gnu.org
2012-12-09 11:19 ` siarhei.siamashka at gmail dot com
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: pinskia at gcc dot gnu.org @ 2012-12-09 10:48 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target| |arm*-*-*, aarch64-*-*
Status|UNCONFIRMED |NEW
Keywords| |missed-optimization
Last reconfirmed| |2012-12-09
Component|tree-optimization |middle-end
Ever Confirmed|0 |1
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> 2012-12-09 10:48:06 UTC ---
This is an ARM (both arm32 and arm64) specific issue due to the shifts being
"free". If you look at the mips assembly, it looks good for a dual issue
processor as it is scheduled as an add followed by a shift.
I think the issue is reassocdoes not know that shifts are free on arm.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
2012-12-09 10:01 ` [Bug tree-optimization/55623] " siarhei.siamashka at gmail dot com
2012-12-09 10:48 ` [Bug middle-end/55623] " pinskia at gcc dot gnu.org
@ 2012-12-09 11:19 ` siarhei.siamashka at gmail dot com
2012-12-09 11:22 ` siarhei.siamashka at gmail dot com
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2012-12-09 11:19 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
--- Comment #3 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2012-12-09 11:18:56 UTC ---
(In reply to comment #2)
> This is an ARM (both arm32 and arm64) specific issue due to the shifts being
> "free". If you look at the mips assembly, it looks good for a dual issue
> processor as it is scheduled as an add followed by a shift.
>
> I think the issue is reassocdoes not know that shifts are free on arm.
This does not look like only an ARM issue. To properly demonstrate it on MIPS
and even without dual-issue, all the additions can be just changed with
multiplications (because it is a long latency instruction). In this case we
get:
unsigned int f1(unsigned int x)
{
unsigned int a, b;
a = x >> 1;
b = x >> 2;
a *= x >> 3;
b *= x >> 4;
a *= x >> 5;
b *= x >> 6;
a *= x >> 7;
b *= x >> 8;
a *= x >> 9;
b *= x >> 10;
a *= x >> 11;
b *= x >> 12;
a *= x >> 13;
b *= x >> 14;
a *= x >> 15;
b *= x >> 16;
a *= x >> 17;
b *= x >> 18;
a *= x >> 19;
b *= x >> 20;
a *= x >> 21;
b *= x >> 22;
a *= x >> 23;
b *= x >> 24;
return a * b;
}
unsigned int f2(unsigned int x)
{
unsigned int a, b;
a = x >> 1;
b = x >> 2;
a *= x >> 3;
b *= x >> 4;
a *= x >> 5;
b *= x >> 6;
a *= x >> 7;
b *= x >> 8;
a *= x >> 9;
b *= x >> 10;
a *= x >> 11;
b *= x >> 12;
a *= x >> 13;
b *= x >> 14;
a *= x >> 15;
b *= x >> 16;
a *= x >> 17;
b *= x >> 18;
a *= x >> 19;
b *= x >> 20;
a *= x >> 21;
b *= x >> 22;
a *= x >> 23;
b *= x >> 24;
asm ("" : "+r" (a));
return a * b;
}
And the benchmark run on MIPS 74K:
$ gcc -O2 -march=mips32r2 -mtune=74kc -o badschedmul badschedmul.c
$ time ./badchedmul 1
real 0m34.934s
user 0m34.689s
sys 0m0.073s
$ time ./badchedmul 2
real 0m19.261s
user 0m19.122s
sys 0m0.050s
The symptoms are still the same. GCC just merges two independent calculations
into a single dependency chain. While I would have expected it to be the other
way around (breaking dependency chains to run faster on the target CPU).
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
` (2 preceding siblings ...)
2012-12-09 11:19 ` siarhei.siamashka at gmail dot com
@ 2012-12-09 11:22 ` siarhei.siamashka at gmail dot com
2012-12-09 11:36 ` steven at gcc dot gnu.org
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: siarhei.siamashka at gmail dot com @ 2012-12-09 11:22 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
--- Comment #4 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2012-12-09 11:21:42 UTC ---
Created attachment 28905
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28905
badschedmul.c
The testcase, converted to use multiplications. Can be used to demonstrates the
problem on all architectures, even including x86-64.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
` (3 preceding siblings ...)
2012-12-09 11:22 ` siarhei.siamashka at gmail dot com
@ 2012-12-09 11:36 ` steven at gcc dot gnu.org
2012-12-09 12:13 ` steven at gcc dot gnu.org
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: steven at gcc dot gnu.org @ 2012-12-09 11:36 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
Steven Bosscher <steven at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target|arm*-*-*, aarch64-*-* |
--- Comment #5 from Steven Bosscher <steven at gcc dot gnu.org> 2012-12-09 11:36:39 UTC ---
Not target dependent, also shows on powerpc64.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
` (4 preceding siblings ...)
2012-12-09 11:36 ` steven at gcc dot gnu.org
@ 2012-12-09 12:13 ` steven at gcc dot gnu.org
2012-12-09 12:15 ` steven at gcc dot gnu.org
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: steven at gcc dot gnu.org @ 2012-12-09 12:13 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
Steven Bosscher <steven at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |steven at gcc dot gnu.org
--- Comment #6 from Steven Bosscher <steven at gcc dot gnu.org> 2012-12-09 12:13:03 UTC ---
Can you try using -frename-registers?
Without vs with -frename-registers:
mov r3, r0, lsr #2 mov r3, r0, lsr #2
add r3, r3, r0, lsr #1 | add r1, r3, r0, lsr #1
add r3, r3, r0, lsr #3 | add r2, r1, r0, lsr #3
add r3, r3, r0, lsr #4 | add ip, r2, r0, lsr #4
add r3, r3, r0, lsr #5 | add r3, ip, r0, lsr #5
add r3, r3, r0, lsr #6 | add r1, r3, r0, lsr #6
add r3, r3, r0, lsr #7 | add r2, r1, r0, lsr #7
add r3, r3, r0, lsr #8 | add ip, r2, r0, lsr #8
add r3, r3, r0, lsr #9 | add r3, ip, r0, lsr #9
add r3, r3, r0, lsr #10 | add r1, r3, r0, lsr #10
add r3, r3, r0, lsr #11 | add r2, r1, r0, lsr #11
add r3, r3, r0, lsr #12 | add ip, r2, r0, lsr #12
add r3, r3, r0, lsr #13 | add r3, ip, r0, lsr #13
add r3, r3, r0, lsr #14 | add r1, r3, r0, lsr #14
add r3, r3, r0, lsr #15 | add r2, r1, r0, lsr #15
add r3, r3, r0, lsr #16 | add ip, r2, r0, lsr #16
add r3, r3, r0, lsr #17 | add r3, ip, r0, lsr #17
add r3, r3, r0, lsr #18 | add r1, r3, r0, lsr #18
add r3, r3, r0, lsr #19 | add r2, r1, r0, lsr #19
add r3, r3, r0, lsr #20 | add ip, r2, r0, lsr #20
add r3, r3, r0, lsr #21 | add r3, ip, r0, lsr #21
add r3, r3, r0, lsr #22 | add r1, r3, r0, lsr #22
add r3, r3, r0, lsr #23 | add r2, r1, r0, lsr #23
add r0, r3, r0, lsr #24 | add r0, r2, r0, lsr #24
bx lr bx lr
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
` (5 preceding siblings ...)
2012-12-09 12:13 ` steven at gcc dot gnu.org
@ 2012-12-09 12:15 ` steven at gcc dot gnu.org
2012-12-10 9:47 ` rguenth at gcc dot gnu.org
2012-12-11 16:33 ` ramana at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: steven at gcc dot gnu.org @ 2012-12-09 12:15 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
Steven Bosscher <steven at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC|steven at gcc dot gnu.org |
--- Comment #7 from Steven Bosscher <steven at gcc dot gnu.org> 2012-12-09 12:14:50 UTC ---
(In reply to comment #6)
> Can you try using -frename-registers?
(Not that I expect it to help because the dependencies are not really
broken, but maybe the in-core reg-renamer likes this form better...)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
` (6 preceding siblings ...)
2012-12-09 12:15 ` steven at gcc dot gnu.org
@ 2012-12-10 9:47 ` rguenth at gcc dot gnu.org
2012-12-11 16:33 ` ramana at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: rguenth at gcc dot gnu.org @ 2012-12-10 9:47 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> 2012-12-10 09:47:42 UTC ---
You can also adjust --param tree-reassoc-width or have the target implement
the sched.reassociation_width target hook (for the default).
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
` (7 preceding siblings ...)
2012-12-10 9:47 ` rguenth at gcc dot gnu.org
@ 2012-12-11 16:33 ` ramana at gcc dot gnu.org
8 siblings, 0 replies; 10+ messages in thread
From: ramana at gcc dot gnu.org @ 2012-12-11 16:33 UTC (permalink / raw)
To: gcc-bugs
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623
Ramana Radhakrishnan <ramana at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |ramana at gcc dot gnu.org
--- Comment #9 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2012-12-11 16:33:25 UTC ---
(In reply to comment #8)
> You can also adjust --param tree-reassoc-width or have the target implement
> the sched.reassociation_width target hook (for the default).
I did try it on an A9 when the hook came out and it made bugger all difference
to the benchmarks I cared about on a range of values.
It's possible this might help on some of the newer cores.
ramana
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-12-11 16:33 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-09 10:00 [Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors siarhei.siamashka at gmail dot com
2012-12-09 10:01 ` [Bug tree-optimization/55623] " siarhei.siamashka at gmail dot com
2012-12-09 10:48 ` [Bug middle-end/55623] " pinskia at gcc dot gnu.org
2012-12-09 11:19 ` siarhei.siamashka at gmail dot com
2012-12-09 11:22 ` siarhei.siamashka at gmail dot com
2012-12-09 11:36 ` steven at gcc dot gnu.org
2012-12-09 12:13 ` steven at gcc dot gnu.org
2012-12-09 12:15 ` steven at gcc dot gnu.org
2012-12-10 9:47 ` rguenth at gcc dot gnu.org
2012-12-11 16:33 ` ramana at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).