public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/31396] New: Inline code performance much worse than out-of-line
@ 2007-03-29 22:15 jamagallon at ono dot com
2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:15 UTC (permalink / raw)
To: gcc-bugs
A simple function that just sums over a vector is much slower if inlined than
out of line. The o-o-l version keeps the sum in a xmm register, the inline
version keeps reading and storing the stack variable on each iteration (guessed
looking at the assembler).
Timings on a 2.4 P4 Xeon:
out-of line:
T0: 3117.44 ms
T1: 653.93 ms
inline:
T0: 3097.05 ms
T1: 3104.18 ms
--
Summary: Inline code performance much worse than out-of-line
Product: gcc
Version: 4.1.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: jamagallon at ono dot com
GCC target triplet: i586-mandriva-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug c/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
@ 2007-03-29 22:17 ` jamagallon at ono dot com
2007-03-29 22:18 ` jamagallon at ono dot com
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:17 UTC (permalink / raw)
To: gcc-bugs
------- Comment #1 from jamagallon at ono dot com 2007-03-29 23:17 -------
Created an attachment (id=13298)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13298&action=view)
testcase
Simple test case with a loop in main() and a call to a function.
Both just calculate the sum of all elements on a vector.
The code in main() is muuch slower that the function.
If the function is inlined (-DINLINE), it becomes equally slower.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug c/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
@ 2007-03-29 22:18 ` jamagallon at ono dot com
2007-03-29 22:23 ` jamagallon at ono dot com
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:18 UTC (permalink / raw)
To: gcc-bugs
------- Comment #2 from jamagallon at ono dot com 2007-03-29 23:18 -------
Created an attachment (id=13299)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13299&action=view)
Makefile for testcase
Makefile to build tst.c.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug c/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
2007-03-29 22:18 ` jamagallon at ono dot com
@ 2007-03-29 22:23 ` jamagallon at ono dot com
2007-03-29 22:47 ` [Bug middle-end/31396] " jamagallon at ono dot com
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:23 UTC (permalink / raw)
To: gcc-bugs
------- Comment #3 from jamagallon at ono dot com 2007-03-29 23:22 -------
Sample assembler for the loops.
For the funcion, out of line:
#APP
#FBGN
#NO_APP
movl data, %edx
fldz
movl $1, %eax
.L2:
fadds -4(%edx,%eax,4)
addl $1, %eax
cmpl $268435457, %eax
jne .L2
#APP
#FEND
#NO_APP
For the loop in main():
.L11:
fldl -56(%ebp) <= look here
fadds -4(%edx,%eax,4)
fstpl -56(%ebp) <= and here
addl $1, %eax
cmpl $268435457, %eax
jne .L11
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug middle-end/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (2 preceding siblings ...)
2007-03-29 22:23 ` jamagallon at ono dot com
@ 2007-03-29 22:47 ` jamagallon at ono dot com
2007-04-03 4:49 ` [Bug rtl-optimization/31396] " pinskia at gcc dot gnu dot org
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:47 UTC (permalink / raw)
To: gcc-bugs
------- Comment #4 from jamagallon at ono dot com 2007-03-29 23:47 -------
Assembler for the opteron.
out-of-line:
.L2:
cvtss2sd (%rdx,%rax,4), %xmm0
incq %rax
cmpq $268435456, %rax
addsd %xmm0, %xmm1
jne .L2
inline:
.L11:
cvtss2sd (%rdx,%rax,4), %xmm0
incq %rax
cmpq $268435456, %rax
addsd 24(%rsp), %xmm0
movsd %xmm0, 24(%rsp)
jne .L11
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (3 preceding siblings ...)
2007-03-29 22:47 ` [Bug middle-end/31396] " jamagallon at ono dot com
@ 2007-04-03 4:49 ` pinskia at gcc dot gnu dot org
2007-04-03 5:03 ` pinskia at gcc dot gnu dot org
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-04-03 4:49 UTC (permalink / raw)
To: gcc-bugs
------- Comment #5 from pinskia at gcc dot gnu dot org 2007-04-03 05:49 -------
The same thing happens on PPC also:
L6:
lfsx f0,r2,r9
addi r2,r2,4
lfd f13,104(r1)
fadd f13,f13,f0
stfd f13,104(r1)
bdnz L6
Why are you storing to the stack? Ok, part of the problem is how we represent
vararg function passing. But I think this one issue is a regression on the
mainline only.
--
pinskia at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Component|middle-end |rtl-optimization
Keywords| |missed-optimization, ra
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (4 preceding siblings ...)
2007-04-03 4:49 ` [Bug rtl-optimization/31396] " pinskia at gcc dot gnu dot org
@ 2007-04-03 5:03 ` pinskia at gcc dot gnu dot org
2007-04-04 7:05 ` ubizjak at gmail dot com
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-04-03 5:03 UTC (permalink / raw)
To: gcc-bugs
------- Comment #6 from pinskia at gcc dot gnu dot org 2007-04-03 06:03 -------
(In reply to comment #5)
> Why are you storing to the stack?
The PPC issue is only an issue on the trunk, so I filed PR 31455 for that bug.
But I bet this bug is related to some extend.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (5 preceding siblings ...)
2007-04-03 5:03 ` pinskia at gcc dot gnu dot org
@ 2007-04-04 7:05 ` ubizjak at gmail dot com
2007-04-04 8:21 ` ubizjak at gmail dot com
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2007-04-04 7:05 UTC (permalink / raw)
To: gcc-bugs
------- Comment #7 from ubizjak at gmail dot com 2007-04-04 08:05 -------
This is the minimal test case for this bug:
--cut here--
extern void foo(void);
double *data;
double test()
{
double sum = 123.321;
int i;
for (i=0; i<4; i++)
sum += data[i];
foo();
foo();
return sum;
}
--cut here--
Without the second call to foo(), function compiles to (gcc version 4.3.0
20070403 (experimental)):
test:
subq $24, %rsp
movq data(%rip), %rdx
movl $1, %eax
movsd .LC0(%rip), %xmm0
addsd (%rdx), %xmm0
.L2:
addsd (%rdx,%rax,8), %xmm0
addq $1, %rax
cmpq $4, %rax
jne .L2
movsd %xmm0, (%rsp)
call foo
movsd (%rsp), %xmm0
addq $24, %rsp
ret
When the second call to foo() is added, RA gets confused and pushes
sum variable to stack:
test:
subq $8, %rsp
movq data(%rip), %rdx
movl $1, %eax
movsd .LC0(%rip), %xmm0
addsd (%rdx), %xmm0
movsd %xmm0, (%rsp) <= here
.L2:
movsd (%rsp), %xmm0 <= here
addsd (%rdx,%rax,8), %xmm0
addq $1, %rax
cmpq $4, %rax
movsd %xmm0, (%rsp) <= here
jne .L2
call foo
call foo
movsd (%rsp), %xmm0
addq $8, %rsp
ret
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |ubizjak at gmail dot com
Status|UNCONFIRMED |NEW
Ever Confirmed|0 |1
Known to fail| |4.3.0
Last reconfirmed|0000-00-00 00:00:00 |2007-04-04 08:05:01
date| |
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (6 preceding siblings ...)
2007-04-04 7:05 ` ubizjak at gmail dot com
@ 2007-04-04 8:21 ` ubizjak at gmail dot com
2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2007-04-04 8:21 UTC (permalink / raw)
To: gcc-bugs
------- Comment #8 from ubizjak at gmail dot com 2007-04-04 09:21 -------
The difference is in CALLER_SAVE_PROFITALBLE condition. The pseudo that holds
sum is referenced 6 times. When only one foo() is called, default
CALLER_SAVE_PROFITABLE condition causes RA to allocate call-clobbered register
(fp or xmm regs are all call-clobbered for x86 targets). When two calls to
foo() are present, default heuristics
#define CALLER_SAVE_PROFITABLE(REFS, CALLS) (4 * (CALLS) < (REFS))
pushes pseudo to memory, as RA does not consider the fact that pseudo is used
inside the loop.
Default heuristics is _wrong_. When pseudo is accessed inside the loop,
call-clobbered register should be allocated, no matter how much calls it
crosses.
This can be confirmed by changing "double" keyword to "int" in the example of
comment #7. gcc now chooses ebx register (call-preserved) and loop compiles to
expected thight sequence:
test:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $4, %esp
movl data, %edx
movl (%edx), %eax
leal 123(%eax), %ebx
movl $2, %eax
.L2:
addl -4(%edx,%eax,4), %ebx
addl $1, %eax
cmpl $5, %eax
jne .L2
call foo
call foo
movl %ebx, %eax
addl $4, %esp
popl %ebx
popl %ebp
ret
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (7 preceding siblings ...)
2007-04-04 8:21 ` ubizjak at gmail dot com
@ 2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2008-01-12 19:14 UTC (permalink / raw)
To: gcc-bugs
------- Comment #9 from hubicka at gcc dot gnu dot org 2008-01-12 19:00 -------
Created an attachment (id=14930)
--> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14930&action=view)
tentative fix
I am testing the attached patch. It is obvious that we should use profile
here. The PR is most likely regression to 2.95 that used to multiply n_refs by
3 inside loops.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (8 preceding siblings ...)
2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
@ 2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
2008-01-18 8:59 ` ubizjak at gmail dot com
11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2008-01-16 17:20 UTC (permalink / raw)
To: gcc-bugs
------- Comment #10 from hubicka at gcc dot gnu dot org 2008-01-16 16:32 -------
Subject: Bug 31396
Author: hubicka
Date: Wed Jan 16 16:32:05 2008
New Revision: 131576
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131576
Log:
PR rtl-optimization/31396
* regstat.c (regstat_bb_compute_ri): Compute FREQ_CALLS_CROSSED.
* cfg.c (dump_reg_info): Print it.
* regs.h (struct reg_info_t): add freq_calls_crossed.
(REG_FREQ_CALLS_CROSSED): New macro.
* global.c (global_alloc): Compute freq_calls_crossed for allocno.
(find_reg): Update call of CALLER_SAVE_PROFITABLE.
* regmove.c (optimize_reg_copy_1, optimize_reg_copy_2, fixup_match_2,
regmove_optimize): Update call crossed frequencies.
* local-alloc.c (struct qty): Add freq_calls_crossed.
(alloc_qty): Copute freq_calls_crossed.
(update_equiv_regs, combine_regs): Update REG_FREQ_CALLS_CROSSED.
(find_free_reg): Update call of CALLER_SAVE_PROFITABLE.
* ra.h (struct allocno): Add freq_calls_crossed.
Modified:
trunk/gcc/ChangeLog
trunk/gcc/cfg.c
trunk/gcc/global.c
trunk/gcc/local-alloc.c
trunk/gcc/ra.h
trunk/gcc/regmove.c
trunk/gcc/regs.h
trunk/gcc/regstat.c
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (9 preceding siblings ...)
2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
@ 2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
2008-01-18 8:59 ` ubizjak at gmail dot com
11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2008-01-16 17:26 UTC (permalink / raw)
To: gcc-bugs
------- Comment #11 from hubicka at gcc dot gnu dot org 2008-01-16 16:33 -------
Fixed on mainline.
--
hubicka at gcc dot gnu dot org changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
` (10 preceding siblings ...)
2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
@ 2008-01-18 8:59 ` ubizjak at gmail dot com
11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2008-01-18 8:59 UTC (permalink / raw)
To: gcc-bugs
------- Comment #12 from ubizjak at gmail dot com 2008-01-18 07:12 -------
Part of problems described here is caused by PR 23322.
--
ubizjak at gmail dot com changed:
What |Removed |Added
----------------------------------------------------------------------------
BugsThisDependsOn| |23322
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2008-01-18 7:13 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
2007-03-29 22:18 ` jamagallon at ono dot com
2007-03-29 22:23 ` jamagallon at ono dot com
2007-03-29 22:47 ` [Bug middle-end/31396] " jamagallon at ono dot com
2007-04-03 4:49 ` [Bug rtl-optimization/31396] " pinskia at gcc dot gnu dot org
2007-04-03 5:03 ` pinskia at gcc dot gnu dot org
2007-04-04 7:05 ` ubizjak at gmail dot com
2007-04-04 8:21 ` ubizjak at gmail dot com
2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
2008-01-18 8:59 ` ubizjak at gmail dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).