[Bug c/31396] New: Inline code performance much worse than out-of-line

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/31396]  New: Inline code performance much worse than out-of-line
@ 2007-03-29 22:15 jamagallon at ono dot com
  2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:15 UTC (permalink / raw)
  To: gcc-bugs

A simple function that just sums over a vector is much slower if inlined than
out of line. The o-o-l version keeps the sum in a xmm register, the inline
version keeps reading and storing the stack variable on each iteration (guessed
looking at the assembler).

Timings on a 2.4 P4 Xeon:
out-of line:
T0: 3117.44 ms
T1: 653.93 ms
inline:
T0: 3097.05 ms
T1: 3104.18 ms


-- 
           Summary: Inline code performance much worse than out-of-line
           Product: gcc
           Version: 4.1.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: jamagallon at ono dot com
GCC target triplet: i586-mandriva-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug c/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
@ 2007-03-29 22:17 ` jamagallon at ono dot com
  2007-03-29 22:18 ` jamagallon at ono dot com
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:17 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from jamagallon at ono dot com  2007-03-29 23:17 -------
Created an attachment (id=13298)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13298&action=view)
testcase

Simple test case with a loop in main() and a call to a function.
Both just calculate the sum of all elements on a vector.
The code in main() is muuch slower that the function.
If the function is inlined (-DINLINE), it becomes equally slower.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug c/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
  2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
@ 2007-03-29 22:18 ` jamagallon at ono dot com
  2007-03-29 22:23 ` jamagallon at ono dot com
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from jamagallon at ono dot com  2007-03-29 23:18 -------
Created an attachment (id=13299)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=13299&action=view)
Makefile for testcase

Makefile to build tst.c.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug c/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
  2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
  2007-03-29 22:18 ` jamagallon at ono dot com
@ 2007-03-29 22:23 ` jamagallon at ono dot com
  2007-03-29 22:47 ` [Bug middle-end/31396] " jamagallon at ono dot com
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from jamagallon at ono dot com  2007-03-29 23:22 -------
Sample assembler for the loops.
For the funcion, out of line:

#APP
    #FBGN
#NO_APP
    movl    data, %edx
    fldz
    movl    $1, %eax
.L2:
    fadds   -4(%edx,%eax,4)
    addl    $1, %eax
    cmpl    $268435457, %eax
    jne .L2
#APP
    #FEND   
#NO_APP

For the loop in main():

.L11:
    fldl    -56(%ebp)         <= look here
    fadds   -4(%edx,%eax,4)
    fstpl   -56(%ebp)         <= and here
    addl    $1, %eax
    cmpl    $268435457, %eax
    jne .L11


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug middle-end/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (2 preceding siblings ...)
  2007-03-29 22:23 ` jamagallon at ono dot com
@ 2007-03-29 22:47 ` jamagallon at ono dot com
  2007-04-03  4:49 ` [Bug rtl-optimization/31396] " pinskia at gcc dot gnu dot org
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: jamagallon at ono dot com @ 2007-03-29 22:47 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from jamagallon at ono dot com  2007-03-29 23:47 -------
Assembler for the opteron.

out-of-line:
.L2:
    cvtss2sd    (%rdx,%rax,4), %xmm0
    incq    %rax
    cmpq    $268435456, %rax
    addsd   %xmm0, %xmm1
    jne .L2

inline:

.L11:
    cvtss2sd    (%rdx,%rax,4), %xmm0
    incq    %rax
    cmpq    $268435456, %rax
    addsd   24(%rsp), %xmm0
    movsd   %xmm0, 24(%rsp)
    jne .L11


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (3 preceding siblings ...)
  2007-03-29 22:47 ` [Bug middle-end/31396] " jamagallon at ono dot com
@ 2007-04-03  4:49 ` pinskia at gcc dot gnu dot org
  2007-04-03  5:03 ` pinskia at gcc dot gnu dot org
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-04-03  4:49 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from pinskia at gcc dot gnu dot org  2007-04-03 05:49 -------
The same thing happens on PPC also:
L6:
        lfsx f0,r2,r9
        addi r2,r2,4
        lfd f13,104(r1)
        fadd f13,f13,f0
        stfd f13,104(r1)
        bdnz L6

Why are you storing to the stack?  Ok, part of the problem is how we represent
vararg function passing.  But I think this one issue is a regression on the
mainline only.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |rtl-optimization
           Keywords|                            |missed-optimization, ra


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (4 preceding siblings ...)
  2007-04-03  4:49 ` [Bug rtl-optimization/31396] " pinskia at gcc dot gnu dot org
@ 2007-04-03  5:03 ` pinskia at gcc dot gnu dot org
  2007-04-04  7:05 ` ubizjak at gmail dot com
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-04-03  5:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from pinskia at gcc dot gnu dot org  2007-04-03 06:03 -------
(In reply to comment #5)
> Why are you storing to the stack?  
The PPC issue is only an issue on the trunk, so I filed PR 31455 for that bug. 
But I bet this bug is related to some extend.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (5 preceding siblings ...)
  2007-04-03  5:03 ` pinskia at gcc dot gnu dot org
@ 2007-04-04  7:05 ` ubizjak at gmail dot com
  2007-04-04  8:21 ` ubizjak at gmail dot com
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2007-04-04  7:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from ubizjak at gmail dot com  2007-04-04 08:05 -------
This is the minimal test case for this bug:

--cut here--
extern void foo(void);

double *data;

double test()
{
  double sum = 123.321;
  int i;

  for (i=0; i<4; i++)
    sum += data[i];

  foo();
  foo();

  return sum;
}
--cut here--

Without the second call to foo(), function compiles to (gcc version 4.3.0
20070403 (experimental)):

test:
        subq    $24, %rsp
        movq    data(%rip), %rdx
        movl    $1, %eax
        movsd   .LC0(%rip), %xmm0
        addsd   (%rdx), %xmm0
.L2:
        addsd   (%rdx,%rax,8), %xmm0
        addq    $1, %rax
        cmpq    $4, %rax
        jne     .L2

        movsd   %xmm0, (%rsp)
        call    foo
        movsd   (%rsp), %xmm0
        addq    $24, %rsp
        ret

When the second call to foo() is added, RA gets confused and pushes
sum variable to stack:

test:
        subq    $8, %rsp
        movq    data(%rip), %rdx
        movl    $1, %eax
        movsd   .LC0(%rip), %xmm0
        addsd   (%rdx), %xmm0
        movsd   %xmm0, (%rsp)          <= here
.L2:
        movsd   (%rsp), %xmm0          <= here
        addsd   (%rdx,%rax,8), %xmm0
        addq    $1, %rax
        cmpq    $4, %rax
        movsd   %xmm0, (%rsp)          <= here
        jne     .L2

        call    foo
        call    foo
        movsd   (%rsp), %xmm0
        addq    $8, %rsp
        ret


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ubizjak at gmail dot com
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
      Known to fail|                            |4.3.0
   Last reconfirmed|0000-00-00 00:00:00         |2007-04-04 08:05:01
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (6 preceding siblings ...)
  2007-04-04  7:05 ` ubizjak at gmail dot com
@ 2007-04-04  8:21 ` ubizjak at gmail dot com
  2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2007-04-04  8:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from ubizjak at gmail dot com  2007-04-04 09:21 -------
The difference is in CALLER_SAVE_PROFITALBLE condition. The pseudo that holds
sum is referenced 6 times.  When only one foo() is called, default
CALLER_SAVE_PROFITABLE condition causes RA to allocate call-clobbered register
(fp or xmm regs are all call-clobbered for x86 targets). When two calls to
foo() are present, default heuristics 

#define CALLER_SAVE_PROFITABLE(REFS, CALLS)  (4 * (CALLS) < (REFS))

pushes pseudo to memory, as RA does not consider the fact that pseudo is used
inside the loop.

Default heuristics is _wrong_. When pseudo is accessed inside the loop,
call-clobbered register should be allocated, no matter how much calls it
crosses.

This can be confirmed by changing "double" keyword to "int" in the example of
comment #7. gcc now chooses ebx register (call-preserved) and loop compiles to
expected thight sequence:

test:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ebx
        subl    $4, %esp
        movl    data, %edx
        movl    (%edx), %eax
        leal    123(%eax), %ebx
        movl    $2, %eax
.L2:
        addl    -4(%edx,%eax,4), %ebx
        addl    $1, %eax
        cmpl    $5, %eax
        jne     .L2
        call    foo
        call    foo
        movl    %ebx, %eax
        addl    $4, %esp
        popl    %ebx
        popl    %ebp
        ret


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (7 preceding siblings ...)
  2007-04-04  8:21 ` ubizjak at gmail dot com
@ 2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
  2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2008-01-12 19:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from hubicka at gcc dot gnu dot org  2008-01-12 19:00 -------
Created an attachment (id=14930)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14930&action=view)
tentative fix

I am testing the attached patch.  It is obvious that we should use profile
here.  The PR is most likely regression to 2.95 that used to multiply n_refs by
3 inside loops.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (8 preceding siblings ...)
  2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
@ 2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
  2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
  2008-01-18  8:59 ` ubizjak at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2008-01-16 17:20 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from hubicka at gcc dot gnu dot org  2008-01-16 16:32 -------
Subject: Bug 31396

Author: hubicka
Date: Wed Jan 16 16:32:05 2008
New Revision: 131576

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131576
Log:

        PR rtl-optimization/31396
        * regstat.c (regstat_bb_compute_ri): Compute FREQ_CALLS_CROSSED.
        * cfg.c (dump_reg_info): Print it.
        * regs.h (struct reg_info_t): add freq_calls_crossed.
        (REG_FREQ_CALLS_CROSSED): New macro.
        * global.c (global_alloc): Compute freq_calls_crossed for allocno.
        (find_reg): Update call of CALLER_SAVE_PROFITABLE.
        * regmove.c (optimize_reg_copy_1, optimize_reg_copy_2, fixup_match_2,
        regmove_optimize): Update call crossed frequencies.
        * local-alloc.c (struct qty): Add freq_calls_crossed.
        (alloc_qty): Copute freq_calls_crossed.
        (update_equiv_regs, combine_regs): Update REG_FREQ_CALLS_CROSSED.
        (find_free_reg): Update call of CALLER_SAVE_PROFITABLE.
        * ra.h (struct allocno): Add freq_calls_crossed.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfg.c
    trunk/gcc/global.c
    trunk/gcc/local-alloc.c
    trunk/gcc/ra.h
    trunk/gcc/regmove.c
    trunk/gcc/regs.h
    trunk/gcc/regstat.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (9 preceding siblings ...)
  2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
@ 2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
  2008-01-18  8:59 ` ubizjak at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2008-01-16 17:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from hubicka at gcc dot gnu dot org  2008-01-16 16:33 -------
Fixed on mainline.


-- 

hubicka at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Bug rtl-optimization/31396] Inline code performance much worse than out-of-line
  2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
                   ` (10 preceding siblings ...)
  2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
@ 2008-01-18  8:59 ` ubizjak at gmail dot com
  11 siblings, 0 replies; 13+ messages in thread
From: ubizjak at gmail dot com @ 2008-01-18  8:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from ubizjak at gmail dot com  2008-01-18 07:12 -------
Part of problems described here is caused by PR 23322.


-- 

ubizjak at gmail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |23322


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31396


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-01-18  7:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-29 22:15 [Bug c/31396] New: Inline code performance much worse than out-of-line jamagallon at ono dot com
2007-03-29 22:17 ` [Bug c/31396] " jamagallon at ono dot com
2007-03-29 22:18 ` jamagallon at ono dot com
2007-03-29 22:23 ` jamagallon at ono dot com
2007-03-29 22:47 ` [Bug middle-end/31396] " jamagallon at ono dot com
2007-04-03  4:49 ` [Bug rtl-optimization/31396] " pinskia at gcc dot gnu dot org
2007-04-03  5:03 ` pinskia at gcc dot gnu dot org
2007-04-04  7:05 ` ubizjak at gmail dot com
2007-04-04  8:21 ` ubizjak at gmail dot com
2008-01-12 19:14 ` hubicka at gcc dot gnu dot org
2008-01-16 17:20 ` hubicka at gcc dot gnu dot org
2008-01-16 17:26 ` hubicka at gcc dot gnu dot org
2008-01-18  8:59 ` ubizjak at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).