[Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug regression/33928]  New: 33% performance slowdown from 4.2.2 in floating-point code
@ 2007-10-28  1:46 lucier at math dot purdue dot edu
  2007-10-28  1:49 ` [Bug regression/33928] " lucier at math dot purdue dot edu
                   ` (115 more replies)
  0 siblings, 116 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28  1:46 UTC (permalink / raw)
  To: gcc-bugs

With these compile options

-Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math
-fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp

With this compiler:

euler-44% /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline
--enable-languages=c --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2
--with-mpfr=/pkgs/gmp-4.2.2
Thread model: posix
gcc version 4.3.0 20071026 (experimental) [trunk revision 129664] (GCC) 

With the following routine compiled with gcc-4.2.2 you get

(time (direct-fft-recursive-4 a table))
    366 ms real time
    366 ms cpu time (366 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

while with today's mainline you get

(time (direct-fft-recursive-4 a table))
    448 ms real time
    448 ms cpu time (448 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

I've isolated that one routine and I'll add it at the end of an attachment;
unfortunately there are a lot of declarations and global data that are
difficult to winnow.

There is really only one main loop in the routine, the one that begins at
___L19_direct_2d_fft_2d_recursive_2d_4.  This loop was scheduled in 102 cycles
(sched2) on 4.4.2 and in 134 cycles in mainline.


-- 
           Summary: 33% performance slowdown from 4.2.2 in floating-point
                    code
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: regression
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: lucier at math dot purdue dot edu
 GCC build triplet: x86_64-unknown-linux-gnu
  GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
@ 2007-10-28  1:49 ` lucier at math dot purdue dot edu
  2007-10-28 12:05 ` [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 " rguenth at gcc dot gnu dot org
                   ` (114 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28  1:49 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from lucier at math dot purdue dot edu  2007-10-28 01:49 -------
Created an attachment (id=14418)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14418&action=view)
.i file for fft routine


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
  2007-10-28  1:49 ` [Bug regression/33928] " lucier at math dot purdue dot edu
@ 2007-10-28 12:05 ` rguenth at gcc dot gnu dot org
  2007-10-28 15:41 ` [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos lucier at math dot purdue dot edu
                   ` (113 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-10-28 12:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from rguenth at gcc dot gnu dot org  2007-10-28 12:05 -------
Can you attach assembler files?  What happens if you use -O2?  Why do you need
-fno-strict-aliasing?  Does -fno-ivopts help?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
  2007-10-28  1:49 ` [Bug regression/33928] " lucier at math dot purdue dot edu
  2007-10-28 12:05 ` [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 " rguenth at gcc dot gnu dot org
@ 2007-10-28 15:41 ` lucier at math dot purdue dot edu
  2007-10-28 15:42 ` lucier at math dot purdue dot edu
                   ` (112 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28 15:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from lucier at math dot purdue dot edu  2007-10-28 15:41 -------
Created an attachment (id=14423)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14423&action=view)
Assembly from 4.2.2


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (2 preceding siblings ...)
  2007-10-28 15:41 ` [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos lucier at math dot purdue dot edu
@ 2007-10-28 15:42 ` lucier at math dot purdue dot edu
  2007-10-28 15:45 ` lucier at math dot purdue dot edu
                   ` (111 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28 15:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from lucier at math dot purdue dot edu  2007-10-28 15:42 -------
Created an attachment (id=14424)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14424&action=view)
assembly from 4.3.0

I had to remove the "static" from the declaration of direct-fft-recursive to
get assembly.  (In the larger file the address of direct-fft-recursive is
eventually put into an array.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (3 preceding siblings ...)
  2007-10-28 15:42 ` lucier at math dot purdue dot edu
@ 2007-10-28 15:45 ` lucier at math dot purdue dot edu
  2007-10-28 15:46 ` lucier at math dot purdue dot edu
                   ` (110 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28 15:45 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from lucier at math dot purdue dot edu  2007-10-28 15:45 -------
Created an attachment (id=14425)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14425&action=view)
assembly after replacing -O1 with -O2


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (4 preceding siblings ...)
  2007-10-28 15:45 ` lucier at math dot purdue dot edu
@ 2007-10-28 15:46 ` lucier at math dot purdue dot edu
  2007-10-28 16:05 ` lucier at math dot purdue dot edu
                   ` (109 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28 15:46 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #6 from lucier at math dot purdue dot edu  2007-10-28 15:45 -------
Created an attachment (id=14426)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14426&action=view)
assembly after replacing -O1 with -O2


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (5 preceding siblings ...)
  2007-10-28 15:46 ` lucier at math dot purdue dot edu
@ 2007-10-28 16:05 ` lucier at math dot purdue dot edu
  2007-10-28 16:09 ` lucier at math dot purdue dot edu
                   ` (108 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28 16:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from lucier at math dot purdue dot edu  2007-10-28 16:05 -------
time with -O2 instead of -O1:

with 4.2.2:

(time (direct-fft-recursive-4 a table))
    426 ms real time
    426 ms cpu time (425 user, 1 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

with 4.3.0:

(time (direct-fft-recursive-4 a table))
    433 ms real time
    433 ms cpu time (433 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

With -O1 -fno-ivopts:

with 4.2.2:

(time (direct-fft-recursive-4 a table))
    374 ms real time
    374 ms cpu time (374 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

with 4.3.0:

(time (direct-fft-recursive-4 a table))
    443 ms real time
    443 ms cpu time (443 user, 0 system)
    no collections
    64 bytes allocated
    1 minor fault
    no major faults

Why -fno-strict-aliasing: I don't need it for this particular routine, but in
the rest of the file is part of a bignum library that accesses the bignum
digits as arrays of either 8-, 32-, or 64-bit unsigned ints, and it hasn't been
rewritten to use unions of arrays.  (This is part of the runtime system of a
Scheme implementation, and there are other places that just cast pointers to
achieve low-level things.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (6 preceding siblings ...)
  2007-10-28 16:05 ` lucier at math dot purdue dot edu
@ 2007-10-28 16:09 ` lucier at math dot purdue dot edu
  2007-10-28 16:38 ` rguenth at gcc dot gnu dot org
                   ` (107 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-10-28 16:09 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #8 from lucier at math dot purdue dot edu  2007-10-28 16:08 -------
Subject: Re:  33% performance slowdown from 4.2.2 to 4.3.0 in floating-point
code


On Oct 28, 2007, at 8:05 AM, rguenth at gcc dot gnu dot org wrote:

> ------- Comment #2 from rguenth at gcc dot gnu dot org  2007-10-28  
> 12:05 -------
> Can you attach assembler files?  What happens if you use -O2?  Why  
> do you need
> -fno-strict-aliasing?  Does -fno-ivopts help?

I think I've answered your questions in the attachments and comments  
to the PR.

Brad


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (7 preceding siblings ...)
  2007-10-28 16:09 ` lucier at math dot purdue dot edu
@ 2007-10-28 16:38 ` rguenth at gcc dot gnu dot org
  2007-10-28 16:39 ` [Bug regression/33928] [4.3 Regression] " rguenth at gcc dot gnu dot org
                   ` (106 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-10-28 16:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from rguenth at gcc dot gnu dot org  2007-10-28 16:38 -------
The main difference I see is that 4.2 avoids re-use of %eax as index register:

.L34:
        movq    %r11, %rdi
        addq    8(%r10), %rdi
        movq    8(%r10), %rsi
        movq    8(%r10), %rdx
        movq    40(%r10), %rax
        leaq    4(%r11), %rbx
        addq    %rdi, %rsi
        leaq    4(%rdi), %r9
        movq    %rdi, -8(%r10)
        addq    %rsi, %rdx
        leaq    4(%rsi), %r8
        movq    %rsi, -24(%r10)
        leaq    4(%rdx), %rcx
        movq    %r9, -16(%r10)
        movq    %rdx, -40(%r10)
        movq    %r8, -32(%r10)
        addq    $7, %rax
        movq    %rcx, -48(%r10)
        movsd   (%rax,%rcx,2), %xmm12
        leaq    (%rbx,%rbx), %rcx
        movsd   (%rax,%rdx,2), %xmm3
        leaq    (%rax,%r11,2), %rdx
        addq    $8, %r11
        movsd   (%rax,%r8,2), %xmm14
        cmpq    %r11, %r13
        movsd   (%rax,%rsi,2), %xmm13
        movsd   (%rax,%r9,2), %xmm11
        movsd   (%rax,%rdi,2), %xmm10
        movsd   (%rax,%rcx), %xmm8
...

while 4.3 always re-loads %rax as index:

.L26:
        leaq    4(%rdi), %rdx
        movq    %rdi, %rax
        movq    %rdx, -8(%rsp)
        addq    (%r8), %rax
        movq    %rax, (%r9)
        addq    $4, %rax
        movq    %rax, (%rbp)
        movq    (%r9), %rax
        addq    (%r8), %rax
        movq    %rax, (%r10)
        addq    $4, %rax
        movq    %rax, (%rbx)
        movq    (%r10), %rax
        addq    (%r8), %rax
        movq    %rax, (%r11)
        movq    -64(%rsp), %rcx
        addq    $4, %rax
        movq    %rax, (%rcx)
        movq    (%rsi), %rdx
        movq    -8(%rsp), %rcx
        addq    $7, %rdx
        movsd   (%rdx,%rax,2), %xmm13
        movq    (%r11), %rax
        addq    %rcx, %rcx
        movsd   (%rdx,%rcx), %xmm8
        movsd   (%rdx,%rax,2), %xmm3
        movq    (%rbx), %rax
        movsd   (%rdx,%rax,2), %xmm14
        movq    (%r10), %rax
        movsd   (%rdx,%rax,2), %xmm12
        movq    (%rbp), %rax
        movsd   (%rdx,%rax,2), %xmm11
        movq    (%r9), %rax
        movsd   (%rdx,%rax,2), %xmm10
        movq    (%r12), %rax
        leaq    (%rdx,%rdi,2), %rdx
...

the root cause needs to be investigated still.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (8 preceding siblings ...)
  2007-10-28 16:38 ` rguenth at gcc dot gnu dot org
@ 2007-10-28 16:39 ` rguenth at gcc dot gnu dot org
  2007-11-12 21:50 ` [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code lucier at math dot purdue dot edu
                   ` (105 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2007-10-28 16:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from rguenth at gcc dot gnu dot org  2007-10-28 16:39 -------
So, confirmed.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
           Keywords|                            |missed-optimization
   Last reconfirmed|0000-00-00 00:00:00         |2007-10-28 16:39:27
               date|                            |
            Summary|33% performance slowdown    |[4.3 Regression] 33%
                   |from 4.2.2 to 4.3.0 in      |performance slowdown from
                   |floating-point code with    |4.2.2 to 4.3.0 in floating-
                   |computed gotos              |point code with computed
                   |                            |gotos


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (9 preceding siblings ...)
  2007-10-28 16:39 ` [Bug regression/33928] [4.3 Regression] " rguenth at gcc dot gnu dot org
@ 2007-11-12 21:50 ` lucier at math dot purdue dot edu
  2007-11-12 21:51 ` lucier at math dot purdue dot edu
                   ` (104 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-11-12 21:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #11 from lucier at math dot purdue dot edu  2007-11-12 21:50 -------
I suspected that the slowdown had nothing to do with computed gotos, so I
regenerated the C code using a switch instead of the computed gotos and got the
following:

For that same copy of mainline

gcc version 4.3.0 20071026 (experimental) [trunk revision 129664] (GCC) 

:

(time (direct-fft-recursive-4 a table))
    470 ms real time
    470 ms cpu time (470 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

For 4.2.2:

(time (direct-fft-recursive-4 a table))
    384 ms real time
    384 ms cpu time (383 user, 1 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

So that's almost exactly the same slowdown as with computed gotos.

I changed the subject line to use 22% instead of 33% (I don't know how I got
33% before, perhaps I just mistyped it) and removed the phrase "with computed
gotos".

I'll include the new .i and .s files as attachments.


-- 

lucier at math dot purdue dot edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[4.3 Regression] 33%        |[4.3 Regression] 22%
                   |performance slowdown from   |performance slowdown from
                   |4.2.2 to 4.3.0 in floating- |4.2.2 to 4.3.0 in floating-
                   |point code with computed    |point code
                   |gotos                       |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (10 preceding siblings ...)
  2007-11-12 21:50 ` [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code lucier at math dot purdue dot edu
@ 2007-11-12 21:51 ` lucier at math dot purdue dot edu
  2007-11-12 21:52 ` lucier at math dot purdue dot edu
                   ` (103 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-11-12 21:51 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #12 from lucier at math dot purdue dot edu  2007-11-12 21:51 -------
Created an attachment (id=14534)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14534&action=view)
.i file using a switch instead of computed gotos

This is the generated code with a switch instead of computed gotos.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (11 preceding siblings ...)
  2007-11-12 21:51 ` lucier at math dot purdue dot edu
@ 2007-11-12 21:52 ` lucier at math dot purdue dot edu
  2007-11-12 21:53 ` lucier at math dot purdue dot edu
                   ` (102 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-11-12 21:52 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #13 from lucier at math dot purdue dot edu  2007-11-12 21:52 -------
Created an attachment (id=14535)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14535&action=view)
4.2.2 assembly for code using switch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (12 preceding siblings ...)
  2007-11-12 21:52 ` lucier at math dot purdue dot edu
@ 2007-11-12 21:53 ` lucier at math dot purdue dot edu
  2007-11-19  6:06 ` pinskia at gcc dot gnu dot org
                   ` (101 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-11-12 21:53 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #14 from lucier at math dot purdue dot edu  2007-11-12 21:53 -------
Created an attachment (id=14536)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14536&action=view)
4.3.0 assembly for code using a switch


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (13 preceding siblings ...)
  2007-11-12 21:53 ` lucier at math dot purdue dot edu
@ 2007-11-19  6:06 ` pinskia at gcc dot gnu dot org
  2007-11-27  5:53 ` mmitchel at gcc dot gnu dot org
                   ` (100 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-11-19  6:06 UTC (permalink / raw)
  To: gcc-bugs



-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pinskia at gcc dot gnu dot
                   |                            |org
   Target Milestone|---                         |4.3.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (14 preceding siblings ...)
  2007-11-19  6:06 ` pinskia at gcc dot gnu dot org
@ 2007-11-27  5:53 ` mmitchel at gcc dot gnu dot org
  2007-11-30  5:39 ` bonzini at gnu dot org
                   ` (99 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: mmitchel at gcc dot gnu dot org @ 2007-11-27  5:53 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #15 from mmitchel at gcc dot gnu dot org  2007-11-27 05:53 -------
I've marked this P1 because I'd like to see us start to explain these kinds of
dramatic performance changes.  If we can explain the issue coherently, we may
well decide that it's not important to fix it, but I think we ought to force
ourselves to figure out what's going on.


-- 

mmitchel at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P1


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (15 preceding siblings ...)
  2007-11-27  5:53 ` mmitchel at gcc dot gnu dot org
@ 2007-11-30  5:39 ` bonzini at gnu dot org
  2007-11-30 14:47 ` lucier at math dot purdue dot edu
                   ` (98 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2007-11-30  5:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #16 from bonzini at gnu dot org  2007-11-30 05:39 -------
One suspect is fwprop.  Anyone can confirm?


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bonzini at gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (16 preceding siblings ...)
  2007-11-30  5:39 ` bonzini at gnu dot org
@ 2007-11-30 14:47 ` lucier at math dot purdue dot edu
  2007-11-30 14:58 ` bonzini at gnu dot org
                   ` (97 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-11-30 14:47 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #17 from lucier at math dot purdue dot edu  2007-11-30 14:47 -------
Subject: Re:  [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in
floating-point code

On Nov 30, 2007, at 12:39 AM, bonzini at gnu dot org wrote:

> One suspect is fwprop.  Anyone can confirm?

How does one turn off fwprop?  It doesn't seem to like "-fno-fwprop".


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (17 preceding siblings ...)
  2007-11-30 14:47 ` lucier at math dot purdue dot edu
@ 2007-11-30 14:58 ` bonzini at gnu dot org
  2007-12-01 18:59 ` lucier at math dot purdue dot edu
                   ` (96 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2007-11-30 14:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #18 from bonzini at gnu dot org  2007-11-30 14:58 -------
It would be -fno-forward-propagate, but what I meant is that the changes
*connected to* fwprop could be the culprit.  One has to look at dumps to
understand if this is the case.

It would be possible, maybe, to put an asm around the problematic basic block,
so that one could plot the number of instructions in that basic block over
time?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (18 preceding siblings ...)
  2007-11-30 14:58 ` bonzini at gnu dot org
@ 2007-12-01 18:59 ` lucier at math dot purdue dot edu
  2008-01-09 14:18 ` rguenth at gcc dot gnu dot org
                   ` (95 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2007-12-01 18:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #19 from lucier at math dot purdue dot edu  2007-12-01 18:59 -------
Subject: Re:  [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in
floating-point code


On Nov 30, 2007, at 9:58 AM, bonzini at gnu dot org wrote:

> -fno-forward-propagate

I don't know how to debug this, that's clear enough, but adding -fno- 
forward-propagate as an option doesn't change the code at all.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (19 preceding siblings ...)
  2007-12-01 18:59 ` lucier at math dot purdue dot edu
@ 2008-01-09 14:18 ` rguenth at gcc dot gnu dot org
  2008-01-09 19:21 ` lucier at math dot purdue dot edu
                   ` (94 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-01-09 14:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #20 from rguenth at gcc dot gnu dot org  2008-01-09 12:45 -------
Can we have updated measurements please?  Also I don't think this bug should be
P1.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (20 preceding siblings ...)
  2008-01-09 14:18 ` rguenth at gcc dot gnu dot org
@ 2008-01-09 19:21 ` lucier at math dot purdue dot edu
  2008-01-12 18:03 ` rguenth at gcc dot gnu dot org
                   ` (93 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-01-09 19:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #21 from lucier at math dot purdue dot edu  2008-01-09 18:44 -------
The assembler is identical to that in the third attachment and the time is
basically the same (other things were going on at the same time):

(time (direct-fft-recursive-4 a table))
    465 ms real time
    466 ms cpu time (466 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

euler-86% /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline
--enable-languages=c --enable-checking=release --with-gmp=/pkgs/gmp-4.2.2
--with-mpfr=/pkgs/gmp-4.2.2 --enable-gather-detailed-mem-stats
Thread model: posix
gcc version 4.3.0 20080109 (experimental) [trunk revision 131427] (GCC) 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (21 preceding siblings ...)
  2008-01-09 19:21 ` lucier at math dot purdue dot edu
@ 2008-01-12 18:03 ` rguenth at gcc dot gnu dot org
  2008-01-21 20:01 ` ubizjak at gmail dot com
                   ` (92 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-01-12 18:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #22 from rguenth at gcc dot gnu dot org  2008-01-12 17:56 -------
I'm downgrading this to P2.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P1                          |P2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (22 preceding siblings ...)
  2008-01-12 18:03 ` rguenth at gcc dot gnu dot org
@ 2008-01-21 20:01 ` ubizjak at gmail dot com
  2008-01-21 23:12 ` lucier at math dot purdue dot edu
                   ` (91 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: ubizjak at gmail dot com @ 2008-01-21 20:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #23 from ubizjak at gmail dot com  2008-01-21 19:21 -------
It is not possible to create an executable from direct.i. My compilation fails:

(.text+0x20): undefined reference to `main'
/tmp/cc0VOLHm.o: In function `___H_direct_2d_fft_2d_recursive_2d_4':
_num.c:(.text+0xf1): undefined reference to `___gstate'
_num.c:(.text+0x18e): undefined reference to `___gstate'
_num.c:(.text+0x1c7): undefined reference to `___gstate'
_num.c:(.text+0x27b): undefined reference to `___gstate'
_num.c:(.text+0x2e0): undefined reference to `___gstate'
/tmp/cc0VOLHm.o:_num.c:(.text+0x6f0): more undefined references to `___gstate'
follow

Could you attach the source that can be used to create the executable? Or
perhaps a detailed instructions how to create one from sources you already
posted.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (23 preceding siblings ...)
  2008-01-21 20:01 ` ubizjak at gmail dot com
@ 2008-01-21 23:12 ` lucier at math dot purdue dot edu
  2008-01-22 12:23 ` ubizjak at gmail dot com
                   ` (90 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-01-21 23:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #24 from lucier at math dot purdue dot edu  2008-01-21 22:43 -------
Subject: Re:  [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in
floating-point code


On Jan 21, 2008, at 2:21 PM, ubizjak at gmail dot com wrote:

> It is not possible to create an executable from direct.i.

That's correct, sorry.

> Could you attach the source that can be used to create the executable?

Here are instructions on how to build and test a modified version of  
Gambit, from which I derived direct.i.

Download the file

http://www.math.purdue.edu/~lucier/gcc/test-files/bugzilla/33928/ 
gambc-v4_1_2.tgz

Build it with the following commands:

> tar zxf gambc-v4_1_2.tgz
> cd gambc-v4_1_2
> ./configure CC='/pkgs/gcc-mainline/bin/gcc -save-temps'
> make -j

If you want to recompile the source after reconfiguring, do

> make mostlyclean


not 'make clean', unfortunately.

Then test it with

> gsi/gsi -e '(define a (time (expt 3 10000000)))(define b (time (* a  
> a)))'

The output ends with something like

> (time (##bignum.make (##fixnum.quotient result-length  
> (##fixnum.quotient ##bignum.adigit-width ##bignum.fdigit-width)) #f  
> #f))
>     4 ms real time
>     5 ms cpu time (3 user, 2 system)
>     no collections
>     3962448 bytes allocated
>     968 minor faults
>     no major faults
> (time (##make-f64vector (##fixnum.* two^n 2)))
>     5 ms real time
>     5 ms cpu time (1 user, 4 system)
>     1 collection accounting for 5 ms real time (1 user, 4 system)
>     33554464 bytes allocated
>     59 minor faults
>     no major faults
> (time (make-w (##fixnum.- log-two^n 1)))
>     30 ms real time
>     31 ms cpu time (17 user, 14 system)
>     no collections
>     16810144 bytes allocated
>     4097 minor faults
>     no major faults
> (time (make-w-rac log-two^n))
>     28 ms real time
>     28 ms cpu time (16 user, 12 system)
>     no collections
>     16826272 bytes allocated
>     4097 minor faults
>     no major faults
> (time (bignum->f64vector-rac x a))
>     45 ms real time
>     45 ms cpu time (20 user, 25 system)
>     no collections
>     -16 bytes allocated
>     8192 minor faults
>     no major faults
> (time (componentwise-rac-multiply a rac-table))
>     26 ms real time
>     26 ms cpu time (26 user, 0 system)
>     no collections
>     -16 bytes allocated
>     no minor faults
>     no major faults
> (time (direct-fft-recursive-4 a table))
>     445 ms real time
>     445 ms cpu time (445 user, 0 system)
>     no collections
>     64 bytes allocated
>     no minor faults
>     no major faults
> (time (componentwise-complex-multiply a a))
>     24 ms real time
>     24 ms cpu time (24 user, 0 system)
>     no collections
>     -16 bytes allocated
>     no minor faults
>     no major faults
> (time (inverse-fft-recursive-4 a table))
>     418 ms real time
>     418 ms cpu time (418 user, 0 system)
>     no collections
>     64 bytes allocated
>     no minor faults
>     no major faults
> (time (componentwise-rac-multiply-conjugate a rac-table))
>     26 ms real time
>     26 ms cpu time (26 user, 0 system)
>     no collections
>     -16 bytes allocated
>     no minor faults
>     no major faults
> (time (bignum<-f64vector-rac a result result-length))
>     108 ms real time
>     108 ms cpu time (108 user, 0 system)
>     no collections
>     112 bytes allocated
>     no minor faults
>     no major faults
> (time (* a a))
>     1170 ms real time
>     1170 ms cpu time (1105 user, 65 system)
>     1 collection accounting for 5 ms real time (1 user, 4 system)
>     71266896 bytes allocated
>     17413 minor faults
>     no major faults


The time for the routine in direct.i is the time reported for direct- 
fft-recursive-4:

> (time (direct-fft-recursive-4 a table))
>     445 ms real time
>     445 ms cpu time (445 user, 0 system)
>     no collections
>     64 bytes allocated
>     no minor faults
>     no major faults

The name of the routine in the .i and .s files is  
___H_direct_2d_fft_2d_recursive_2d_4.

By the way, ___H_inverse_2d_fft_2d_recursive_2d_4 is a similar  
routine implementing the inverse fft, which, for some reason, goes  
faster than the direct (forward) fft.

Brad


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (24 preceding siblings ...)
  2008-01-21 23:12 ` lucier at math dot purdue dot edu
@ 2008-01-22 12:23 ` ubizjak at gmail dot com
  2008-01-22 12:29 ` [Bug target/33928] " pinskia at gcc dot gnu dot org
                   ` (89 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: ubizjak at gmail dot com @ 2008-01-22 12:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #25 from ubizjak at gmail dot com  2008-01-22 12:03 -------
Created an attachment (id=14996)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14996&action=view)
Much shorter testcase.

This testcase was used to track down problems with fre pass. Stay tuned for an
analysis.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug target/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (25 preceding siblings ...)
  2008-01-22 12:23 ` ubizjak at gmail dot com
@ 2008-01-22 12:29 ` pinskia at gcc dot gnu dot org
  2008-01-22 12:38 ` ubizjak at gmail dot com
                   ` (88 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2008-01-22 12:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #26 from pinskia at gcc dot gnu dot org  2008-01-22 12:07 -------
Really I bet FRE is doing its job and the RA can't do its.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug target/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (26 preceding siblings ...)
  2008-01-22 12:29 ` [Bug target/33928] " pinskia at gcc dot gnu dot org
@ 2008-01-22 12:38 ` ubizjak at gmail dot com
  2008-01-22 13:24 ` rguenth at gcc dot gnu dot org
                   ` (87 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: ubizjak at gmail dot com @ 2008-01-22 12:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #27 from ubizjak at gmail dot com  2008-01-22 12:20 -------
As already noted by Richi in Comment #9, the difference is in usage of %rax.

gcc-4.2 generates:
        ...
        addq    $7, %rax
        leaq    (%rax,%rbp,2), %r10
        leaq    (%rax,%rdx,2), %rdx
        leaq    (%rax,%rdi,2), %rdi
        movq    (%rcx), %rsi
        movq    (%r13), %rcx
        leaq    (%rax,%r9,2), %r9
        leaq    (%rax,%r8,2), %r8
        leaq    (%rax,%r14,2), %r11
        addq    $8, %rbp
        movsd   (%rdx), %xmm3
        leaq    (%rax,%rsi,2), %rsi
        leaq    (%rax,%rcx,2), %rcx
        ...
        movsd   %xmm7, (%rcx)
        subsd   %xmm1, %xmm10
        addsd   %xmm1, %xmm0
        movsd   %xmm8, (%rsi)
        movsd   %xmm0, (%rdi)
        movapd  %xmm12, %xmm0
        subsd   %xmm3, %xmm12
        addsd   %xmm3, %xmm0
        movsd   %xmm0, (%r8)
        movsd   %xmm10, (%r9)
        movsd   %xmm12, (%rdx)
        jg      .L26

where gcc-4.3 limps along with:
        ...
        leaq    7(%rax), %r9
        movq    %rbx, -64(%rsp)
        movq    -56(%rsp), %rcx
        addq    %r10, %r10
        movsd   7(%rax,%rdx), %xmm3
        movsd   (%r9,%rbx,2), %xmm8
        movq    (%r11), %rbx
        movsd   7(%rax,%r10), %xmm5
        addq    %r8, %r8
        addq    %rdi, %rdi
        movsd   7(%rax,%r8), %xmm12
        movsd   15(%rbx), %xmm2
        leaq    (%r9,%rbp,2), %r9
        movsd   7(%rbx), %xmm1
        ...
        movsd   %xmm0, 7(%rax,%r9,2)
        movapd  %xmm10, %xmm0
        movsd   %xmm7, 7(%rax,%rcx)
        subsd   %xmm1, %xmm10
        addsd   %xmm1, %xmm0
        movsd   %xmm8, 7(%rax,%rsi)
        movsd   %xmm0, 7(%rax,%rdi)
        movapd  %xmm12, %xmm0
        subsd   %xmm3, %xmm12
        addsd   %xmm3, %xmm0
        movsd   %xmm0, 7(%rax,%r8)
        movsd   %xmm10, 7(%rax,%r10)
        movsd   %xmm12, 7(%rax,%rdx)
        jg      .L17

The difference is in offseted addresses. Looking at the tree dumps, it is
obvious that the problem is in fre pass.

At the end of the loop (line 685+ in _.034.fre) gcc-4.2 transforms every
seqence  of:

  D.2013_432 = ___fp_256 + 40B;
  D.2014_433 = *D.2013_432;
  D.2068_434 = (long int *) D.2014_433;
  D.2069_435 = D.2068_434 + 7B;
  D.2070_436 = (long int) D.2069_435;
  D.2094_437 = ___r3_35 << 1;
  D.2095_438 = D.2070_436 + D.2094_437;
  D.2096_439 = (double *) D.2095_438;
  *D.2096_439 = ___F64V53_431;
  D.2013_440 = ___fp_256 + 40B;
  D.2014_441 = *D.2013_440;
  D.2068_442 = (long int *) D.2014_441;
  D.2069_443 = D.2068_442 + 7B;
  D.2070_444 = (long int) D.2069_443;
  D.2091_445 = ___r4_257 << 1;
  D.2092_446 = D.2070_444 + D.2091_445;
  D.2093_447 = (double *) D.2092_446;
  *D.2093_447 = ___F64V52_430;
  D.2013_448 = ___fp_256 + 40B;
  D.2014_449 = *D.2013_448;
  D.2068_450 = (long int *) D.2014_449;
  D.2069_451 = D.2068_450 + 7B;
  D.2070_452 = (long int) D.2069_451;
  ...

into:

  D.2013_432 = D.2013_286;
  D.2014_433 = D.2014_287;
  D.2068_434 = D.2068_288;
  D.2069_435 = D.2069_289;
  D.2070_436 = D.2070_290;
  D.2094_437 = D.2094_366;
  D.2095_438 = D.2095_367;
  D.2096_439 = D.2096_368;
  *D.2096_439 = ___F64V53_431;
  D.2013_440 = D.2013_286;
  D.2014_441 = D.2014_287;
  D.2068_442 = D.2068_288;
  D.2069_443 = D.2069_289;
  D.2070_444 = D.2070_290;
  D.2091_445 = D.2091_357;
  D.2092_446 = D.2092_358;
  D.2093_447 = D.2093_359;
  *D.2093_447 = ___F64V52_430;
  D.2013_448 = D.2013_286;
  D.2014_449 = D.2014_287;
  D.2068_450 = D.2068_288;
  D.2069_451 = D.2069_289;
  D.2070_452 = D.2070_290;
  D.1994_453 = D.1994_258;
  D.2040_454 = D.2040_347;
  D.2041_455 = D.2041_348;
  D.2089_456 = D.2089_349;
  D.2090_457 = D.2090_350;
  ...

and this is optimized in further passes into:

  *D.2096 = ___F64V32 + ___F64V45;
  *D.2093 = ___F64V31 + ___F64V42;
  *D.2090 = ___F64V32 - ___F64V45;
  *D.2088 = ___F64V31 - ___F64V42;
  *D.2084 = ___F64V28 + ___F64V39;
  *D.2081 = ___F64V27 + ___F64V36;
  *D.2077 = ___F64V28 - ___F64V39;
  *D.2074 = ___F64V27 - ___F64V36;

However, for some reason gcc-4.3 transforms only _some_ instructions (line 708+
in _.085t.fre dump), creating:

  D.1683_428 = D.1683_282;
  D.1684_429 = D.1684_283;
  D.1738_430 = D.1738_284;
  D.1739_431 = D.1739_285;
  D.1740_432 = D.1740_286;
  D.1764_433 = D.1764_362;
  D.1765_434 = D.1765_363;
  D.1766_435 = D.1766_364;
  *D.1766_435 = ___F64V53_427;
  D.1683_436 = D.1683_282;
  D.1684_437 = *D.1683_436;
  D.1738_438 = (long unsigned int) D.1684_437;
  D.1739_439 = D.1738_438 + 7;
  D.1740_440 = (long int) D.1739_439;
  D.1761_441 = D.1761_353;
  D.1762_442 = D.1740_440 + D.1761_441;
  D.1763_443 = (double *) D.1762_442;
  *D.1763_443 = ___F64V52_426;
  D.1683_444 = D.1683_282;
  D.1684_445 = *D.1683_444;
  D.1738_446 = (long unsigned int) D.1684_445;
  D.1739_447 = D.1738_446 + 7;
  D.1740_448 = (long int) D.1739_447;
  ...

which leaves us with:

  *D.1766 = ___F64V32 + ___F64V45;
  *(double *) (D.1761 + (long int) ((long unsigned int) *pretmp.33 + 7)) =
___F64V31 + ___F64V42;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*temp.65 <<
1)) = ___F64V32 - ___F64V45;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*D.1685 <<
1)) = ___F64V31 - ___F64V42;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*temp.61 <<
1)) = ___F64V28 + ___F64V39;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*pretmp.152
<< 1)) = ___F64V27 + ___F64V36;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*pretmp.147
<< 1)) = ___F64V28 - ___F64V39;
  *(double *) ((long int) ((long unsigned int) *pretmp.33 + 7) + (*___fp.47 <<
1)) = ___F64V27 - ___F64V36;

and creates unoptimal asm as above.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug target/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (27 preceding siblings ...)
  2008-01-22 12:38 ` ubizjak at gmail dot com
@ 2008-01-22 13:24 ` rguenth at gcc dot gnu dot org
  2008-01-22 13:25 ` [Bug tree-optimization/33928] " bonzini at gnu dot org
                   ` (86 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-01-22 13:24 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #28 from rguenth at gcc dot gnu dot org  2008-01-22 12:38 -------
This is an alias partitioning problem, with --param max-aliased-vops=10000 I
see the sequence optimized by FRE.  Or, with the alias-oracle patch for FRE
--param max-fields-for-field-sensitive=1 does the job as well.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |alias


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (28 preceding siblings ...)
  2008-01-22 13:24 ` rguenth at gcc dot gnu dot org
@ 2008-01-22 13:25 ` bonzini at gnu dot org
  2008-01-22 13:29 ` ubizjak at gmail dot com
                   ` (85 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2008-01-22 13:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #29 from bonzini at gnu dot org  2008-01-22 12:39 -------
target independent


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|target                      |tree-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (29 preceding siblings ...)
  2008-01-22 13:25 ` [Bug tree-optimization/33928] " bonzini at gnu dot org
@ 2008-01-22 13:29 ` ubizjak at gmail dot com
  2008-01-22 13:30 ` rguenth at gcc dot gnu dot org
                   ` (84 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: ubizjak at gmail dot com @ 2008-01-22 13:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #30 from ubizjak at gmail dot com  2008-01-22 12:52 -------
Please note that for the original testcase (direct.i), even '-O2 --param
max-aliased-vops=100000' doesn't generate expected code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (30 preceding siblings ...)
  2008-01-22 13:29 ` ubizjak at gmail dot com
@ 2008-01-22 13:30 ` rguenth at gcc dot gnu dot org
  2008-03-14 17:04 ` [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 " rguenth at gcc dot gnu dot org
                   ` (83 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-01-22 13:30 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #31 from rguenth at gcc dot gnu dot org  2008-01-22 13:06 -------
Created an attachment (id=14997)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=14997&action=view)
asm with alias-oracle enabled FRE

This is the asm produced from direct.i with -O2 --param
max-fields-for-field-sensitive=1 (SFTs disabled, which is the goal for 4.4)
with the (ok, a modified) alias-oracle patch for FRE applied.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (31 preceding siblings ...)
  2008-01-22 13:30 ` rguenth at gcc dot gnu dot org
@ 2008-03-14 17:04 ` rguenth at gcc dot gnu dot org
  2008-05-30 16:02 ` lucier at math dot purdue dot edu
                   ` (82 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-03-14 17:04 UTC (permalink / raw)
  To: gcc-bugs



-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to fail|                            |4.3.0
   Target Milestone|4.3.0                       |4.3.1


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (32 preceding siblings ...)
  2008-03-14 17:04 ` [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 " rguenth at gcc dot gnu dot org
@ 2008-05-30 16:02 ` lucier at math dot purdue dot edu
  2008-06-06 15:00 ` rguenth at gcc dot gnu dot org
                   ` (81 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-05-30 16:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #32 from lucier at math dot purdue dot edu  2008-05-30 16:01 -------
I've decided to test the current ira branch with this problem.  I used the
build instructions in comment 24.

With -fno-ira I get the same results as with 4.3.0 (no surprise there).

With -fira I get the time

(time (direct-fft-recursive-4 a table))
    422 ms real time
    421 ms cpu time (421 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

which is an improvement, and the code at the beginning of the loop is

.L7262:
        movq    %rdx, %rcx
        addq    (%rsi), %rcx
        leaq    4(%rdx), %r15
        movq    %rcx, (%rbx)
        addq    $4, %rcx
        movq    %rcx, (%rbp)
        movq    (%rbx), %rcx
        addq    (%rsi), %rcx
        movq    %rcx, (%rdi)
        addq    $4, %rcx
        movq    %rcx, (%r8)
        movq    (%rdi), %rcx
        addq    (%rsi), %rcx
        leaq    4(%rcx), %r10
        movq    %rcx, (%r9)
        movq    %r10, (%r13)
        movq    (%rax), %rcx
        addq    $7, %rcx
        movsd   (%rcx,%r10,2), %xmm4
        movq    (%r9), %r10
        leaq    (%rcx,%rdx,2), %r11
        addq    $8, %rdx
        movsd   (%r11), %xmm11
        movsd   (%rcx,%r10,2), %xmm5
        movq    (%r8), %r10 
        movsd   (%rcx,%r10,2), %xmm6
        movq    (%rdi), %r10
        movsd   (%rcx,%r10,2), %xmm7
        movq    (%rbp), %r10
        movsd   (%rcx,%r10,2), %xmm8
        movq    (%rbx), %r10
        movapd  %xmm8, %xmm14
        movsd   (%rcx,%r10,2), %xmm9
        leaq    (%r15,%r15), %r10
        movsd   (%rcx,%r10), %xmm10
        movq    (%r12), %rcx
        movapd  %xmm9, %xmm15
        movsd   15(%rcx), %xmm1
        movsd   7(%rcx), %xmm2
        movapd  %xmm1, %xmm13
        movsd   31(%rcx), %xmm3
        movapd  %xmm2, %xmm12

which is also an improvement, but it still is nowhere near the result for
4.2.2.

So, whatever is causing this problem, it appears the new register allocator
isn't going to fix it.

The code generated by today's mainline (136210) isn't better than 4.3.0; the
time is

(time (direct-fft-recursive-4 a table))
    469 ms real time
    469 ms cpu time (469 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

and code is essentially the same as for 4.3.0


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (33 preceding siblings ...)
  2008-05-30 16:02 ` lucier at math dot purdue dot edu
@ 2008-06-06 15:00 ` rguenth at gcc dot gnu dot org
  2008-07-09 16:06 ` lucier at math dot purdue dot edu
                   ` (80 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-06-06 15:00 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #33 from rguenth at gcc dot gnu dot org  2008-06-06 14:58 -------
4.3.1 is being released, adjusting target milestone.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.3.1                       |4.3.2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (34 preceding siblings ...)
  2008-06-06 15:00 ` rguenth at gcc dot gnu dot org
@ 2008-07-09 16:06 ` lucier at math dot purdue dot edu
  2008-08-27 22:10 ` jsm28 at gcc dot gnu dot org
                   ` (79 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-07-09 16:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #34 from lucier at math dot purdue dot edu  2008-07-09 16:05 -------
Problem still exists with

euler-18% /pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--with-gmp=/pkgs/gmp-4.2.2/ --with-mpfr=/pkgs/gmp-4.2.2/
--prefix=/pkgs/gcc-mainline --enable-languages=c
--enable-gather-detailed-mem-stats
Thread model: posix
gcc version 4.4.0 20080708 (experimental) [trunk revision 137644] (GCC) 

Just checking whether recent changes happened to fix it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (35 preceding siblings ...)
  2008-07-09 16:06 ` lucier at math dot purdue dot edu
@ 2008-08-27 22:10 ` jsm28 at gcc dot gnu dot org
  2008-09-04 20:40 ` lucier at math dot purdue dot edu
                   ` (78 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: jsm28 at gcc dot gnu dot org @ 2008-08-27 22:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #35 from jsm28 at gcc dot gnu dot org  2008-08-27 22:02 -------
4.3.2 is released, changing milestones to 4.3.3.


-- 

jsm28 at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.3.2                       |4.3.3


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (36 preceding siblings ...)
  2008-08-27 22:10 ` jsm28 at gcc dot gnu dot org
@ 2008-09-04 20:40 ` lucier at math dot purdue dot edu
  2008-09-04 20:45 ` rguenth at gcc dot gnu dot org
                   ` (77 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-09-04 20:40 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #36 from lucier at math dot purdue dot edu  2008-09-04 20:39 -------
I don't really understand the status of this bug.

Before 4.3.0, it was P!, and Mark said he said he'd "like to see us start to
explain these kinds of dramatic performance changes."

There was quite a bit of detective work that ended with "for some reason
gcc-4.3 transforms only _some_ instructions (line 708+ in _.085t.fre dump)
...".

Richard opined that it was an "alias partitioning problem", but Uros noted that
for the original code instead of the reduced testcase expanding some parameter
to its maximum still doesn't fix the problem.

So (a) we don't know what the current code is doing wrong, and (b) we don't
know why 4.2 got it right.

So I don't think Mark got what he wanted, and now it's P2, and each release the
target release for fixing it gets pushed back.

I've been testing mainline on this bug sporadically, especially when an entry
in gcc-patches mentions some words that also appear on this PR, to see if it's
fixed.  I'm a bit concerned that the target of 4.3.* is becoming increasingly
out of reach, as changes committed to that branch seem to be more and more
conservative because it's a release branch.

I don't think the code for this bug is terribly atypical for machine-generated
code; it would be nice to be able to remove this performance regression. 
Unfortunately, I'm in no position to do so.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928

^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (37 preceding siblings ...)
  2008-09-04 20:40 ` lucier at math dot purdue dot edu
@ 2008-09-04 20:45 ` rguenth at gcc dot gnu dot org
  2008-09-04 20:50 ` lucier at math dot purdue dot edu
                   ` (76 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-09-04 20:45 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #37 from rguenth at gcc dot gnu dot org  2008-09-04 20:43 -------
We have to admit that this bug is unlikely to get fixed in the 4.3 series.
It still lacks proper analysis, as unfortunately that done on the shorter
testcase was not valid.  Analysis takes time, and honestly at this point I
rather spend time fixing wrong-code or ice-on-valid bugs.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (38 preceding siblings ...)
  2008-09-04 20:45 ` rguenth at gcc dot gnu dot org
@ 2008-09-04 20:50 ` lucier at math dot purdue dot edu
  2008-12-06 16:39 ` lucier at math dot purdue dot edu
                   ` (75 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-09-04 20:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #38 from lucier at math dot purdue dot edu  2008-09-04 20:49 -------
OK, but I was moved to write because Jakub's latest 4.4 status report requests

Please concentrate now on fixing bugs, especially the performance regressions.

and this is a definite 4.3/4.4 performance regression from 4.2.  (How many of
the P1 PRs are performance regressions?)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (39 preceding siblings ...)
  2008-09-04 20:50 ` lucier at math dot purdue dot edu
@ 2008-12-06 16:39 ` lucier at math dot purdue dot edu
  2008-12-07  2:56 ` bonzini at gnu dot org
                   ` (74 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-12-06 16:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #39 from lucier at math dot purdue dot edu  2008-12-06 16:37 -------
I may have narrowed down the problem a bit.

With this compiler (revision 118491):

pythagoras-277% /tmp/lucier/install/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--prefix=/tmp/lucier/install --enable-languages=c
Thread model: posix
gcc version 4.3.0 20061105 (experimental)

one gets (on a faster machine than previous reports)

(time (direct-fft-recursive-4 a table))
    133 ms real time
    140 ms cpu time (140 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

With this compiler (revision 118474):

pythagoras-24% /tmp/lucier/install/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--prefix=/tmp/lucier/install --enable-languages=c
Thread model: posix
gcc version 4.3.0 20061104 (experimental)

one gets

(time (direct-fft-recursive-4 a table))
    116 ms real time
    108 ms cpu time (108 user, 0 system)
    no collections
    64 bytes allocated
    no minor faults
    no major faults

and you see the typical problem with assembly code from direct.i with the later
compiler.

Paolo may have been right about fwprop, this patch was installed that day:

Author: bonzini
Date: Sat Nov  4 08:36:45 2006
New Revision: 118475

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=118475
Log:
2006-11-03  Paolo Bonzini  <bonzini@gnu.org>
            Steven Bosscher  <stevenb.gcc@gmail.com>

        * fwprop.c: New file.
        * Makefile.in: Add fwprop.o.
        * tree-pass.h (pass_rtl_fwprop, pass_rtl_fwprop_with_addr): New.
        * passes.c (init_optimization_passes): Schedule forward propagation.
        * rtlanal.c (loc_mentioned_in_p): Support NULL value of the second
        parameter.
        * timevar.def (TV_FWPROP): New.
        * common.opt (-fforward-propagate): New.
        * opts.c (decode_options): Enable forward propagation at -O2.
        * gcse.c (one_cprop_pass): Do not run local cprop unless touching
jumps.
        * cse.c (fold_rtx_subreg, fold_rtx_mem, fold_rtx_mem_1, find_best_addr,
        canon_for_address, table_size): Remove.
        (new_basic_block, insert, remove_from_table): Remove references to
        table_size.
        (fold_rtx): Process SUBREGs and MEMs with equiv_constant, make
        simplification loop more straightforward by not calling fold_rtx
        recursively.
        (equiv_constant): Move here a small part of fold_rtx_subreg,
        do not call fold_rtx.  Call avoid_constant_pool_reference
        to process MEMs.
        * recog.c (canonicalize_change_group): New.
        * recog.h (canonicalize_change_group): New.

        * doc/invoke.texi (Optimization Options): Document fwprop.
        * doc/passes.texi (RTL passes): Document fwprop.


Added:
    trunk/gcc/fwprop.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/common.opt
    trunk/gcc/cse.c
    trunk/gcc/doc/invoke.texi
    trunk/gcc/doc/passes.texi
    trunk/gcc/gcse.c
    trunk/gcc/opts.c
    trunk/gcc/passes.c
    trunk/gcc/recog.c
    trunk/gcc/recog.h
    trunk/gcc/rtlanal.c
    trunk/gcc/timevar.def
    trunk/gcc/tree-pass.h


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (40 preceding siblings ...)
  2008-12-06 16:39 ` lucier at math dot purdue dot edu
@ 2008-12-07  2:56 ` bonzini at gnu dot org
  2008-12-07 13:01 ` rguenth at gcc dot gnu dot org
                   ` (73 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2008-12-07  2:56 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #40 from bonzini at gnu dot org  2008-12-07 02:55 -------
IIUC this is a typical case in which CSE was fixing something that earlier
passes messed up.  Unfortunately fwprop does (better) what CSE was meant to do,
but does not do what I assumed was already done before CSE.

If the problem is aliasing/FRE, then I think Richi is the one who could fix it
for good in the tree passes.  If there is more to it, however, I can take a
look at why fwprop is generating the ugly code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 in floating-point code
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (41 preceding siblings ...)
  2008-12-07  2:56 ` bonzini at gnu dot org
@ 2008-12-07 13:01 ` rguenth at gcc dot gnu dot org
  2008-12-07 19:40 ` [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by r118475 lucier at math dot purdue dot edu
                   ` (72 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2008-12-07 13:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #41 from rguenth at gcc dot gnu dot org  2008-12-07 13:00 -------
There's not much to be done for aliasing - everything points to global memory
and thus aliases.  There may be some opportunities for offset-based
disambiguations
via pointers, but I didn't investigate in detail.  Whoever wants someone to
work on specific details needs to provide way shorter testcases ;)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (42 preceding siblings ...)
  2008-12-07 13:01 ` rguenth at gcc dot gnu dot org
@ 2008-12-07 19:40 ` lucier at math dot purdue dot edu
  2009-01-24 10:28 ` rguenth at gcc dot gnu dot org
                   ` (71 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2008-12-07 19:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #42 from lucier at math dot purdue dot edu  2008-12-07 19:39 -------
Just a comment that -fforward-propagate isn't enabled at -O1 (the main
optimization option in the test) while the cse code it replaces was enabled at
-O1.  This is presumably why adding -fno-forward-propagate to the command line
in the test a year ago didn't affect the generated code.

Adding -fno-forward-propagate to the command line of the test case with
revision r118475 of gcc changes the generated code, but doesn't improve the
problem code in the main loop.

Updated the title to report the performance hit on

Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz

as reported by /proc/cpuinfo


-- 

lucier at math dot purdue dot edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[4.3/4.4 Regression] 22%    |[4.3/4.4 Regression] 30%
                   |performance slowdown from   |performance slowdown in
                   |4.2.2 to 4.3/4.4.0 in       |floating-point code caused
                   |floating-point code         |by  r118475


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (43 preceding siblings ...)
  2008-12-07 19:40 ` [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by r118475 lucier at math dot purdue dot edu
@ 2009-01-24 10:28 ` rguenth at gcc dot gnu dot org
  2009-02-13 16:05 ` bonzini at gnu dot org
                   ` (70 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-01-24 10:28 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #43 from rguenth at gcc dot gnu dot org  2009-01-24 10:19 -------
GCC 4.3.3 is being released, adjusting target milestone.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.3.3                       |4.3.4


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (44 preceding siblings ...)
  2009-01-24 10:28 ` rguenth at gcc dot gnu dot org
@ 2009-02-13 16:05 ` bonzini at gnu dot org
  2009-02-13 16:10 ` lucier at math dot purdue dot edu
                   ` (69 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-02-13 16:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #44 from bonzini at gnu dot org  2009-02-13 16:05 -------
A simplified (local, noncascading) fwprop not using UD chains would not be hard
to do...  Basically, at -O1 use FOR_EACH_BB/FOR_EACH_BB_INSN instead of walking
the uses, keep a (regno, insn) map of pseudos (cleared at the beginning of
every basic block), and use that info instead of UD chains in
use_killed_between...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (45 preceding siblings ...)
  2009-02-13 16:05 ` bonzini at gnu dot org
@ 2009-02-13 16:10 ` lucier at math dot purdue dot edu
  2009-02-13 16:32 ` bonzini at gnu dot org
                   ` (68 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-02-13 16:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #45 from lucier at math dot purdue dot edu  2009-02-13 16:09 -------
Subject: Re:  [4.3/4.4 Regression] 30%
 performance slowdown in floating-point code caused by  r118475

On Fri, 2009-02-13 at 16:05 +0000, bonzini at gnu dot org wrote:
> ------- Comment #44 from bonzini at gnu dot org  2009-02-13 16:05 -------
> A simplified (local, noncascading) fwprop not using UD chains would not be hard
> to do...  Basically, at -O1 use FOR_EACH_BB/FOR_EACH_BB_INSN instead of walking
> the uses, keep a (regno, insn) map of pseudos (cleared at the beginning of
> every basic block), and use that info instead of UD chains in
> use_killed_between...

As noted in comment 42, enabling FWPROP on this test case does not fix
the performance problem.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (46 preceding siblings ...)
  2009-02-13 16:10 ` lucier at math dot purdue dot edu
@ 2009-02-13 16:32 ` bonzini at gnu dot org
  2009-02-13 17:23 ` lucier at math dot purdue dot edu
                   ` (67 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-02-13 16:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #46 from bonzini at gnu dot org  2009-02-13 16:32 -------
Regarding your comment in bug 26854:

> address calculations are no longer optimized as much as they
> were before 

Sometimes, actually, they are optimized better.  It depends on the case.

In comment #42, also, you talked about -O1, where fwprop is not enabled.  So
I'm failing to understand if the problem is at the tree or RTL level for this
bug.

My comment was related to something said in PR39517, i.e. that chains are very
expensive and a reason why fwprop should not be enabled at -O1.  Following up
on my comment, alternatively, fwprop could compute its own dataflow instead of
using UD chains, since it only cares by design about uses with a single
definition.  This looks much better.

You would use something like df_chain_create_bb and
df_chain_create_bb_process_use, with code like the following (cfr.
df_chain_create_bb_process_use):

          /* Do not want to go through this for an uninitialized var.  */
          int count = DF_DEFS_COUNT (regno);
          if (count)
            {
              if (top_flag == (DF_REF_FLAGS (use) & DF_REF_AT_TOP))
                {
                  unsigned int first_index = DF_DEFS_BEGIN (uregno);
                  unsigned int last_index = first_index + count - 1;

                  /* Uninitialized?  Exit.  */
                  bmp_iter_set_init (&bi, local_rd, first_index, &def_index);
                  if (!bmp_iter_set (&bi, &def_index) || def_index >
last_index)
                    continue;

                  /* 2 or more defs for this use, exit.  */
                  bmp_iter_next (&(ITER), &(BITNUM)))
                  if (!bmp_iter_set (&bi, &def_index) || def_index >
last_index)
                    SET_BIT (can_fwprop, DF_REF_ID (use));
                }
            }

With this change there would be no reason not to run fwprop at -O1.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (47 preceding siblings ...)
  2009-02-13 16:32 ` bonzini at gnu dot org
@ 2009-02-13 17:23 ` lucier at math dot purdue dot edu
  2009-02-13 20:10 ` bonzini at gnu dot org
                   ` (66 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-02-13 17:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #47 from lucier at math dot purdue dot edu  2009-02-13 17:22 -------
Subject: Re:  [4.3/4.4 Regression] 30%
 performance slowdown in floating-point code caused by  r118475

On Fri, 2009-02-13 at 16:32 +0000, bonzini at gnu dot org wrote:
> 
> 
> ------- Comment #46 from bonzini at gnu dot org  2009-02-13 16:32 -------
> Regarding your comment in bug 26854:
> 
> > address calculations are no longer optimized as much as they
> > were before 
> 
> Sometimes, actually, they are optimized better.  It depends on the case.

Yes.  I don't see why the optimizations in CSE, which were relatively
cheap and which were effective for this case, needed to be disabled when
FWPROP was added without, evidently, understanding why FWPROP does not
do what CSE was already doing.

> In comment #42, also, you talked about -O1, where fwprop is not enabled.  So
> I'm failing to understand if the problem is at the tree or RTL level for this
> bug.

When I add -fforward-propagate to the command line, then the assembly
code changes in some ways, but the performance problem remains the same.

Brad


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (48 preceding siblings ...)
  2009-02-13 17:23 ` lucier at math dot purdue dot edu
@ 2009-02-13 20:10 ` bonzini at gnu dot org
  2009-04-23 15:59 ` [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 79% performance slowdown in floating-point code partially " lucier at math dot purdue dot edu
                   ` (65 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-02-13 20:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #48 from bonzini at gnu dot org  2009-02-13 20:09 -------
Subject: Re:  [4.3/4.4 Regression] 30% 
        performance slowdown in floating-point code caused by r118475

> Yes.  I don't see why the optimizations in CSE, which were relatively
> cheap and which were effective for this case, needed to be disabled when
> FWPROP was added without, evidently, understanding why FWPROP does not
> do what CSE was already doing.

Just to mention it, fwprop saved 3% of compile time.  That's not
"cheap".  It was also tested with SPEC and Nullstone on several
architectures.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 79% performance slowdown in floating-point code partially caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (49 preceding siblings ...)
  2009-02-13 20:10 ` bonzini at gnu dot org
@ 2009-04-23 15:59 ` lucier at math dot purdue dot edu
  2009-04-23 16:01 ` lucier at math dot purdue dot edu
                   ` (64 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-04-23 15:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #49 from lucier at math dot purdue dot edu  2009-04-23 15:58 -------
With 4.4.0 and with mainline this code now runs in 280 ms instead of in 156 ms
with 4.2.4.

Since 280/156 = 1.794871794871795 I changed the subject line (the slowdown is
now not completely caused by r118475).

I guess I'll post the assembly code generated by 4.4.0 in the next attachment.

Timings (best of three runs) for the last

(time (direct-fft-recursive-4 a table))

from

 gsi/gsi -e '(define a (time (expt 3 10000000)))(define b (time (* a a)))'

With gcc-4.1.2:

    188 ms cpu time (188 user, 0 system)

With gcc-4.2.4

    156 ms cpu time (152 user, 4 system)

With gcc-4.3.3:

    180 ms cpu time (180 user, 0 system)

With gcc-4.4.0

    280 ms cpu time (280 user, 0 system)

With 4.5.0 20090423 (experimental) [trunk revision 146634]

    280 ms cpu time (280 user, 0 system)


-- 

lucier at math dot purdue dot edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[4.3/4.4/4.5 Regression] 30%|[4.3/4.4/4.5 Regression] 79%
                   |performance slowdown in     |performance slowdown in
                   |floating-point code caused  |floating-point code
                   |by  r118475                 |partially caused by  r118475


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 79% performance slowdown in floating-point code partially caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (50 preceding siblings ...)
  2009-04-23 15:59 ` [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 79% performance slowdown in floating-point code partially " lucier at math dot purdue dot edu
@ 2009-04-23 16:01 ` lucier at math dot purdue dot edu
  2009-04-23 16:03 ` lucier at math dot purdue dot edu
                   ` (63 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-04-23 16:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #50 from lucier at math dot purdue dot edu  2009-04-23 16:00 -------
Created an attachment (id=17685)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17685&action=view)
direct.s generated by 4.4.0


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 79% performance slowdown in floating-point code partially caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (51 preceding siblings ...)
  2009-04-23 16:01 ` lucier at math dot purdue dot edu
@ 2009-04-23 16:03 ` lucier at math dot purdue dot edu
  2009-04-26 18:27 ` [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code " lucier at math dot purdue dot edu
                   ` (62 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-04-23 16:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #51 from lucier at math dot purdue dot edu  2009-04-23 16:03 -------
Forgot to mention, the main loop starts at .L2947.

This is on

model name      : Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz

Brad


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (52 preceding siblings ...)
  2009-04-23 16:03 ` lucier at math dot purdue dot edu
@ 2009-04-26 18:27 ` lucier at math dot purdue dot edu
  2009-05-06  3:43 ` lucier at math dot purdue dot edu
                   ` (61 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-04-26 18:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #52 from lucier at math dot purdue dot edu  2009-04-26 18:27 -------
I narrowed down the new performance regression to code added some time around
March 12, 2009, so I changed back the subject line of this PR to reflect the
performance regression caused only by the code added 2006-11-03 and added a new
PR

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39914

to reflect the effects of the March, 2009, code.


-- 

lucier at math dot purdue dot edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[4.3/4.4/4.5 Regression] 79%|[4.3/4.4/4.5 Regression] 30%
                   |performance slowdown in     |performance slowdown in
                   |floating-point code         |floating-point code caused
                   |partially caused by  r118475|by  r118475


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (53 preceding siblings ...)
  2009-04-26 18:27 ` [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code " lucier at math dot purdue dot edu
@ 2009-05-06  3:43 ` lucier at math dot purdue dot edu
  2009-05-06  3:50 ` lucier at math dot purdue dot edu
                   ` (60 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-06  3:43 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #53 from lucier at math dot purdue dot edu  2009-05-06 03:43 -------
I posted a possible fix to gcc-patches with the subject line

Possible fix for 30% performance regression in PR 33928

Here's the assembly for the main loop after the changes I proposed:

.L4230:
        movq    %r11, %rdi
        addq    8(%r10), %rdi
        movq    8(%r10), %rsi
        movq    8(%r10), %rdx
        movq    40(%r10), %rax
        leaq    4(%r11), %rbx
        addq    %rdi, %rsi
        leaq    4(%rdi), %r9
        movq    %rdi, -8(%r10)
        addq    %rsi, %rdx
        leaq    4(%rsi), %r8
        movq    %rsi, -24(%r10)
        leaq    4(%rdx), %rcx
        movq    %r9, -16(%r10)
        movq    %rdx, -40(%r10)
        movq    %r8, -32(%r10)
        addq    $7, %rax
        movq    %rcx, -48(%r10)
        movsd   (%rax,%rcx,2), %xmm12
        leaq    (%rbx,%rbx), %rcx
        movsd   (%rax,%rdx,2), %xmm3
        leaq    (%rax,%r11,2), %rdx
        addq    $8, %r11
        movsd   (%rax,%r8,2), %xmm14
        cmpq    %r11, %r13
        movsd   (%rax,%rsi,2), %xmm13
        movsd   (%rax,%r9,2), %xmm11
        movsd   (%rax,%rdi,2), %xmm10
        movsd   (%rax,%rcx), %xmm8
        movq    24(%r10), %rax
        movsd   (%rdx), %xmm7
        movsd   15(%rax), %xmm2
        movsd   7(%rax), %xmm1
        movapd  %xmm2, %xmm0
        movsd   31(%rax), %xmm9
        movapd  %xmm1, %xmm6
        mulsd   %xmm3, %xmm0
        movapd  %xmm1, %xmm4
        mulsd   %xmm12, %xmm6
        mulsd   %xmm3, %xmm4
        movapd  %xmm1, %xmm3
        mulsd   %xmm13, %xmm1
        mulsd   %xmm14, %xmm3
        addsd   %xmm0, %xmm6
        movapd  %xmm2, %xmm0
        movsd   23(%rax), %xmm5
        mulsd   %xmm12, %xmm0
        movapd  %xmm7, %xmm12
        subsd   %xmm0, %xmm4
        movapd  %xmm2, %xmm0
        mulsd   %xmm14, %xmm2
        movapd  %xmm8, %xmm14
        mulsd   %xmm13, %xmm0
        movapd  %xmm11, %xmm13
        addsd   %xmm6, %xmm11
        subsd   %xmm6, %xmm13
        subsd   %xmm2, %xmm1
        movapd  %xmm10, %xmm2
        addsd   %xmm0, %xmm3
        movapd  %xmm5, %xmm0
        subsd   %xmm4, %xmm2
        addsd   %xmm4, %xmm10
        subsd   %xmm1, %xmm12
        addsd   %xmm1, %xmm7
        movapd  %xmm9, %xmm1
        subsd   %xmm3, %xmm14
        mulsd   %xmm2, %xmm0
        xorpd   .LC5(%rip), %xmm1
        addsd   %xmm3, %xmm8
        movapd  %xmm1, %xmm3
        mulsd   %xmm2, %xmm1
        movapd  %xmm5, %xmm2
        mulsd   %xmm13, %xmm3
        mulsd   %xmm11, %xmm2
        addsd   %xmm0, %xmm3
        movapd  %xmm5, %xmm0
        mulsd   %xmm10, %xmm5
        mulsd   %xmm13, %xmm0
        subsd   %xmm0, %xmm1
        movapd  %xmm9, %xmm0
        mulsd   %xmm11, %xmm9
        mulsd   %xmm10, %xmm0
        subsd   %xmm9, %xmm5
        addsd   %xmm0, %xmm2
        movapd  %xmm7, %xmm0
        addsd   %xmm5, %xmm0
        subsd   %xmm5, %xmm7
        movsd   %xmm0, (%rdx)
        movapd  %xmm8, %xmm0
        movq    40(%r10), %rax
        subsd   %xmm2, %xmm8
        addsd   %xmm2, %xmm0
        movsd   %xmm0, 7(%rcx,%rax)
        movq    -8(%r10), %rdx
        movq    40(%r10), %rax
        movapd  %xmm12, %xmm0
        subsd   %xmm1, %xmm12
        movsd   %xmm7, 7(%rax,%rdx,2)
        movq    -16(%r10), %rdx
        movq    40(%r10), %rax
        addsd   %xmm1, %xmm0
        movsd   %xmm8, 7(%rax,%rdx,2)
        movq    -24(%r10), %rdx
        movq    40(%r10), %rax
        movsd   %xmm0, 7(%rax,%rdx,2)
        movapd  %xmm14, %xmm0
        movq    -32(%r10), %rdx
        movq    40(%r10), %rax
        subsd   %xmm3, %xmm14
        addsd   %xmm3, %xmm0
        movsd   %xmm0, 7(%rax,%rdx,2)
        movq    -40(%r10), %rdx
        movq    40(%r10), %rax
        movsd   %xmm12, 7(%rax,%rdx,2)
        movq    -48(%r10), %rdx
        movq    40(%r10), %rax
        movsd   %xmm14, 7(%rax,%rdx,2)
        jg      .L4230
        movq    %rbx, %r13
.L4228:


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (54 preceding siblings ...)
  2009-05-06  3:43 ` lucier at math dot purdue dot edu
@ 2009-05-06  3:50 ` lucier at math dot purdue dot edu
  2009-05-06  9:21 ` bonzini at gnu dot org
                   ` (59 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-06  3:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #54 from lucier at math dot purdue dot edu  2009-05-06 03:50 -------
Created an attachment (id=17805)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17805&action=view)
svn diff of cse.c to fix the performance regression

This partially reverts r118475 and adds code to call find_best_address for MEMs
in fold_rtx.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (55 preceding siblings ...)
  2009-05-06  3:50 ` lucier at math dot purdue dot edu
@ 2009-05-06  9:21 ` bonzini at gnu dot org
  2009-05-06  9:32 ` bonzini at gnu dot org
                   ` (58 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-06  9:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #55 from bonzini at gnu dot org  2009-05-06 09:20 -------
Created an attachment (id=17807)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17807&action=view)
svn diff of cse.c to "fix" the performance regression (updated)


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #17805|0                           |1
        is obsolete|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (56 preceding siblings ...)
  2009-05-06  9:21 ` bonzini at gnu dot org
@ 2009-05-06  9:32 ` bonzini at gnu dot org
  2009-05-06  9:50 ` jakub at gcc dot gnu dot org
                   ` (57 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-06  9:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #56 from bonzini at gnu dot org  2009-05-06 09:31 -------
Created an attachment (id=17808)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17808&action=view)
usable testcase

Ok, I managed to make a reasonably readable source code (uninclude stdlib
files, remove unused gambit stuff and ___ prefixes, simplify some expressions),
find the heavy loops, annotate them with asm statements (see comment #18,
2007-11-30) and find the length of the loops.

                   4.2      4.5     4.5 + patch
LOOP 1            ~190     ~230    ~190
INNER LOOP 1.1    ~120     ~130    ~120
LOOP 2             33       36      31

I am thus obsoleting (almost) everything that was posted and is not relevant
anymore.  Let's start from scratch with the new testcase.


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #14418|0                           |1
        is obsolete|                            |
  Attachment #14423|0                           |1
        is obsolete|                            |
  Attachment #14424|0                           |1
        is obsolete|                            |
  Attachment #14425|0                           |1
        is obsolete|                            |
  Attachment #14426|0                           |1
        is obsolete|                            |
  Attachment #14534|0                           |1
        is obsolete|                            |
  Attachment #14535|0                           |1
        is obsolete|                            |
  Attachment #14536|0                           |1
        is obsolete|                            |
  Attachment #14997|0                           |1
        is obsolete|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (57 preceding siblings ...)
  2009-05-06  9:32 ` bonzini at gnu dot org
@ 2009-05-06  9:50 ` jakub at gcc dot gnu dot org
  2009-05-06  9:57 ` bonzini at gnu dot org
                   ` (56 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-06  9:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #57 from jakub at gcc dot gnu dot org  2009-05-06 09:49 -------
Why do you need any #include lines at all in the reduced testcase?  Compiles
just fine even without them...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (58 preceding siblings ...)
  2009-05-06  9:50 ` jakub at gcc dot gnu dot org
@ 2009-05-06  9:57 ` bonzini at gnu dot org
  2009-05-06 10:00 ` bonzini at gnu dot org
                   ` (55 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-06  9:57 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #58 from bonzini at gnu dot org  2009-05-06 09:56 -------
Uhm, it's better to run unpatched 4.5 with -O1 -fforward-propagate to get a
fair comparison.  Also, I was counting the loop headers, which are not part of
the hot code.

                   4.2 -O1     4.5 -O1 -ffw-prop     4.5 + patch -O1
LOOP 1                181         201                   180
INNER LOOP 1.1        117         118                   113
LOOP 2                27           27                    26

This shows that you should compare running the code (you can use direct.i) with
4.2/-O1 and 4.5/-O1 -fforward-propagate.  This is very important, otherwise
you're comparing apples to oranges.

fwprop is creating too high register pressure by creating offsets like these in
the loop header:

        leaq    -8(%r12), %rsi
        leaq    8(%r12), %r10
        leaq    -16(%r12), %r9
        leaq    -24(%r12), %rbx
        leaq    -32(%r12), %rbp
        leaq    -40(%r12), %rdi
        leaq    -48(%r12), %r11
        leaq    40(%r12), %rdx

Then, the additional register pressure is causing the bad scheduling we have in
the fast assembly outputs:

        movq    (%rdx), %rax
        movsd   (%rax,%r15,2), %xmm7
        movq    (%rdi), %r15
        movsd   (%rax,%r15,2), %xmm10
        movq    (%rbp), %r15
        movsd   (%rax,%r15,2), %xmm5
        movq    (%rbx), %r15
        movsd   (%rax,%r15,2), %xmm6
        movq    (%r9), %r15
        movsd   (%rax,%r15,2), %xmm15
        movq    (%rsi), %r15
        movsd   (%rax,%r15,2), %xmm11


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (59 preceding siblings ...)
  2009-05-06  9:57 ` bonzini at gnu dot org
@ 2009-05-06 10:00 ` bonzini at gnu dot org
  2009-05-06 10:48 ` bonzini at gnu dot org
                   ` (54 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-06 10:00 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #59 from bonzini at gnu dot org  2009-05-06 09:59 -------
Created an attachment (id=17809)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17809&action=view)
usable testcase

Without includes as Jakub suggested.


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #17808|0                           |1
        is obsolete|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (60 preceding siblings ...)
  2009-05-06 10:00 ` bonzini at gnu dot org
@ 2009-05-06 10:48 ` bonzini at gnu dot org
  2009-05-06 13:06 ` [Bug rtl-optimization/33928] " jakub at gcc dot gnu dot org
                   ` (53 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-06 10:48 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #60 from bonzini at gnu dot org  2009-05-06 10:47 -------
Actually those are created by -fmove-loop-invariants.  With -O1
-fforward-propagate -fno-move-loop-invariants I get:

                   4.5 -O1 -ffw-prop -fno-move-loop-inv
LOOP 1                183
INNER LOOP 1.1        116
LOOP 2                25

You should be able to get performance close to 4.2 or better with options "-O1
-fforward-propagate -fno-move-loop-invariants -fschedule-insns2".  If you do,
this means two things:

1) That the bug is in the register pressure estimations of
-fno-move-loop-invariants, and merely exposed by the fwprop patch.

2) That maybe you should start from -O2 and go backwards, eliminating
optimizations that do not help you or cause high compilation time, instead of
using -O1.


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |WAITING


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (61 preceding siblings ...)
  2009-05-06 10:48 ` bonzini at gnu dot org
@ 2009-05-06 13:06 ` jakub at gcc dot gnu dot org
  2009-05-06 15:08 ` bonzini at gnu dot org
                   ` (52 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: jakub at gcc dot gnu dot org @ 2009-05-06 13:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #61 from jakub at gcc dot gnu dot org  2009-05-06 13:05 -------
Also see PR39871, maybe that's related (though on ARM).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (62 preceding siblings ...)
  2009-05-06 13:06 ` [Bug rtl-optimization/33928] " jakub at gcc dot gnu dot org
@ 2009-05-06 15:08 ` bonzini at gnu dot org
  2009-05-06 19:58 ` lucier at math dot purdue dot edu
                   ` (51 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-06 15:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #62 from bonzini at gnu dot org  2009-05-06 15:07 -------
No, totally unrelated to PR39871


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (63 preceding siblings ...)
  2009-05-06 15:08 ` bonzini at gnu dot org
@ 2009-05-06 19:58 ` lucier at math dot purdue dot edu
  2009-05-06 20:44 ` lucier at math dot purdue dot edu
                   ` (50 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-06 19:58 UTC (permalink / raw)
  To: gcc-bugs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1966 bytes --]



------- Comment #63 from lucier at math dot purdue dot edu  2009-05-06 19:57 -------
Was the patch in comment 55 meant for me to bootstrap and test with today's
mainline?  It crashes at the gcc_assert at

/* Subroutine of canon_reg.  Pass *XLOC through canon_reg, and validate
   the result if necessary.  INSN is as for canon_reg.  */

static void
validate_canon_reg (rtx *xloc, rtx insn)
{
  if (*xloc)
    {
      rtx new_rtx = canon_reg (*xloc, insn);

      /* If replacing pseudo with hard reg or vice versa, ensure the
         insn remains valid.  Likewise if the insn has MATCH_DUPs.  */
      gcc_assert (insn && new_rtx);
      validate_change (insn, xloc, new_rtx, 1);
    }
}

when building libgcc:

/tmp/lucier/gcc/objdirs/mainline/./gcc/xgcc
-B/tmp/lucier/gcc/objdirs/mainline/./gcc/
-B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/bin/
-B/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/lib/ -isystem
/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/include -isystem
/pkgs/gcc-mainline/x86_64-unknown-linux-gnu/sys-include -g -O2 -m32 -O2  -g -O2
-DIN_GCC   -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes
-Wcast-qual -Wold-style-definition  -isystem ./include  -fPIC -g
-DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED   -I. -I.
-I../../.././gcc -I../../../../../mainline/libgcc
-I../../../../../mainline/libgcc/. -I../../../../../mainline/libgcc/../gcc
-I../../../../../mainline/libgcc/../include
-I../../../../../mainline/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT
-DHAVE_CC_TLS -DUSE_TLS -o _moddi3.o -MT _moddi3.o -MD -MP -MF _moddi3.dep
-DL_moddi3 -c ../../../../../mainline/libgcc/../gcc/libgcc2.c \
          -fexceptions -fnon-call-exceptions -fvisibility=hidden -DHIDE_EXPORTS
../../../../../mainline/libgcc/../gcc/libgcc2.c: In function â:
../../../../../mainline/libgcc/../gcc/libgcc2.c:1121: internal compiler error:
in validate_canon_reg, at cse.c:2730


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (64 preceding siblings ...)
  2009-05-06 19:58 ` lucier at math dot purdue dot edu
@ 2009-05-06 20:44 ` lucier at math dot purdue dot edu
  2009-05-07  5:04 ` bonzini at gnu dot org
                   ` (49 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-06 20:44 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #64 from lucier at math dot purdue dot edu  2009-05-06 20:43 -------
In answer to comment 60, here's the command line where I added
-fforward-propagate -fno-move-loop-invariants:

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate
-fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY
-D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\""
-D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

here's the compiler:

/pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /tmp/lucier/gcc/mainline/configure --enable-checking=release
--prefix=/pkgs/gcc-mainline --enable-languages=c
Thread model: posix
gcc version 4.5.0 20090506 (experimental) [trunk revision 147199] (GCC) 

and the runtime didn't change (substantially)

    132 ms cpu time (132 user, 0 system)

and the loop looks pretty much just as bad (it's 117 instructions long, by my
count):

.L2752:
        movq    %rcx, %rdx
        addq    8(%rax), %rdx
        leaq    4(%rcx), %rdi
        movq    %rdx, -8(%rax)
        leaq    4(%rdx), %rbx
        addq    8(%rax), %rdx
        movq    %rbx, -16(%rax)
        movq    %rdx, -24(%rax)
        leaq    4(%rdx), %rbx
        addq    8(%rax), %rdx
        movq    %rbx, -32(%rax)
        movq    %rdx, -40(%rax)
        leaq    4(%rdx), %rbx
        movq    40(%rax), %rdx
        movq    %rbx, -48(%rax)
        movsd   7(%rdx,%rbx,2), %xmm9
        movq    -40(%rax), %rbx
        leaq    7(%rdx,%rcx,2), %r8
        addq    $8, %rcx
        movsd   (%r8), %xmm4
        cmpq    %rcx, %r13
        movsd   7(%rdx,%rbx,2), %xmm11
        movq    -32(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm5
        movq    -24(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm7
        movq    -16(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm14
        movq    -8(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm6
        leaq    (%rdi,%rdi), %rbx
        movsd   7(%rbx,%rdx), %xmm8
        movq    24(%rax), %rdx
        movapd  %xmm6, %xmm13
        movsd   15(%rdx), %xmm1
        movsd   7(%rdx), %xmm2
        movapd  %xmm1, %xmm10
        movsd   31(%rdx), %xmm3
        movapd  %xmm2, %xmm12
        mulsd   %xmm11, %xmm10
        mulsd   %xmm9, %xmm12
        mulsd   %xmm2, %xmm11
        mulsd   %xmm1, %xmm9
        movsd   23(%rdx), %xmm0
        addsd   %xmm12, %xmm10
        movapd  %xmm2, %xmm12
        mulsd   %xmm7, %xmm2
        subsd   %xmm9, %xmm11
        movapd  %xmm1, %xmm9
        mulsd   %xmm5, %xmm12
        mulsd   %xmm5, %xmm1
        movapd  %xmm8, %xmm5
        mulsd   %xmm7, %xmm9
        movapd  %xmm4, %xmm7
        subsd   %xmm11, %xmm13
        addsd   %xmm6, %xmm11
        movsd   .LC5(%rip), %xmm6
        subsd   %xmm1, %xmm2
        movapd  %xmm0, %xmm1
        addsd   %xmm12, %xmm9
        movapd  %xmm14, %xmm12
        xorpd   %xmm3, %xmm6
        subsd   %xmm10, %xmm12
        mulsd   %xmm13, %xmm1
        subsd   %xmm2, %xmm7
        addsd   %xmm4, %xmm2
        movapd  %xmm6, %xmm4
        addsd   %xmm14, %xmm10
        mulsd   %xmm13, %xmm6
        mulsd   %xmm12, %xmm4
        subsd   %xmm9, %xmm5
        mulsd   %xmm0, %xmm12
        addsd   %xmm8, %xmm9
        movapd  %xmm0, %xmm8
        mulsd   %xmm11, %xmm0
        addsd   %xmm1, %xmm4
        movapd  %xmm3, %xmm1
        mulsd   %xmm10, %xmm3
        subsd   %xmm12, %xmm6
        mulsd   %xmm11, %xmm1
        mulsd   %xmm10, %xmm8
        subsd   %xmm3, %xmm0
        addsd   %xmm1, %xmm8
        movapd  %xmm2, %xmm1
        addsd   %xmm0, %xmm1
        subsd   %xmm0, %xmm2
        movapd  %xmm7, %xmm0
        subsd   %xmm6, %xmm7
        addsd   %xmm6, %xmm0
        movsd   %xmm1, (%r8)
        movapd  %xmm9, %xmm1
        movq    40(%rax), %rdx
        subsd   %xmm8, %xmm9
        addsd   %xmm8, %xmm1
        movsd   %xmm1, 7(%rbx,%rdx)
        movq    -8(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm2, 7(%rdx,%rbx,2)
        movq    -16(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm9, 7(%rdx,%rbx,2)
        movq    -24(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm0, 7(%rdx,%rbx,2)
        movapd  %xmm5, %xmm0
        movq    -32(%rax), %rbx
        movq    40(%rax), %rdx
        subsd   %xmm4, %xmm5
        addsd   %xmm4, %xmm0
        movsd   %xmm0, 7(%rdx,%rbx,2)
        movq    -40(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm7, 7(%rdx,%rbx,2)
        movq    -48(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm5, 7(%rdx,%rbx,2)
        jg      .L2752
        movq    %rdi, %r13
.L2751:


-- 

lucier at math dot purdue dot edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (65 preceding siblings ...)
  2009-05-06 20:44 ` lucier at math dot purdue dot edu
@ 2009-05-07  5:04 ` bonzini at gnu dot org
  2009-05-07  5:27 ` lucier at math dot purdue dot edu
                   ` (48 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-07  5:04 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #65 from bonzini at gnu dot org  2009-05-07 05:03 -------
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance
 slowdown in floating-point code caused by  r118475

lucier at math dot purdue dot edu wrote:
> ------- Comment #64 from lucier at math dot purdue dot edu  2009-05-06 20:43 -------
> In answer to comment 60, here's the command line where I added
> -fforward-propagate -fno-move-loop-invariants:

Hmm, can you try adding -frename-registers *or* -fweb (i.e. together
they get no benefit) too?

> and the loop looks pretty much just as bad (it's 117 instructions long, by my
> count):

116 actually: the movq here is outside the loop (that's how I made all
the instruction counts).

>         movsd   %xmm5, 7(%rdx,%rbx,2)
>         jg      .L2752
>         movq    %rdi, %r13
> .L2751:


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (66 preceding siblings ...)
  2009-05-07  5:04 ` bonzini at gnu dot org
@ 2009-05-07  5:27 ` lucier at math dot purdue dot edu
  2009-05-07 13:41 ` bonzini at gnu dot org
                   ` (47 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-07  5:27 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #66 from lucier at math dot purdue dot edu  2009-05-07 05:27 -------
Adding -frename-registers gives a significant speedup (sometimes as fast as
4.1.2 on this shared machine, i.e., it somtimes hits 108 ms instead of
132-140ms), the command line with -fforward-propagate -fno-move-loop-invariants
-frename-registers  is

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate
-fno-move-loop-invariants -frename-registers -DHAVE_CONFIG_H -D___PRIMAL
-D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\""
-D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\""
-D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

and the loop is

.L2752:
        movq    %rcx, %r12
        addq    8(%rax), %r12
        leaq    4(%rcx), %rdi
        movq    %r12, -8(%rax)
        leaq    4(%r12), %r8
        addq    8(%rax), %r12
        movq    %r8, -16(%rax)
        movq    -8(%rax), %r8
        movq    -16(%rax), %rdx
        movq    %r12, -24(%rax)
        leaq    4(%r12), %rbx
        addq    8(%rax), %r12
        movq    -24(%rax), %r9
        movq    %rbx, -32(%rax)
        movq    24(%rax), %rbx
        movq    -32(%rax), %r10
        leaq    4(%r12), %r11
        movq    %r12, -40(%rax)
        movq    40(%rax), %r12
        movq    -40(%rax), %r14
        movq    %r11, -48(%rax)
        movsd   15(%rbx), %xmm1
        movsd   7(%rbx), %xmm2
        movsd   7(%r12,%r11,2), %xmm9
        movapd  %xmm1, %xmm3
        movsd   7(%r12,%r14,2), %xmm11
        leaq    7(%r12,%rcx,2), %r11
        movapd  %xmm2, %xmm10
        leaq    (%rdi,%rdi), %r14
        mulsd   %xmm11, %xmm3
        movapd  %xmm2, %xmm12
        mulsd   %xmm9, %xmm10
        addq    $8, %rcx
        mulsd   %xmm1, %xmm9
        cmpq    %rcx, %r13
        mulsd   %xmm2, %xmm11
        movsd   7(%r12,%r10,2), %xmm5
        movsd   7(%r12,%r9,2), %xmm7
        addsd   %xmm10, %xmm3
        movsd   7(%r12,%r8,2), %xmm6
        subsd   %xmm9, %xmm11
        mulsd   %xmm7, %xmm2
        movapd  %xmm1, %xmm9
        mulsd   %xmm5, %xmm1
        movapd  %xmm6, %xmm13
        movsd   7(%r12,%rdx,2), %xmm14
        mulsd   %xmm5, %xmm12
        mulsd   %xmm7, %xmm9
        subsd   %xmm11, %xmm13
        movsd   31(%rbx), %xmm0
        addsd   %xmm6, %xmm11
        movsd   .LC5(%rip), %xmm6
        subsd   %xmm1, %xmm2
        movsd   (%r11), %xmm4
        movapd  %xmm14, %xmm10
        xorpd   %xmm0, %xmm6
        addsd   %xmm12, %xmm9
        movsd   7(%r14,%r12), %xmm8
        subsd   %xmm3, %xmm10
        movapd  %xmm4, %xmm7
        addsd   %xmm14, %xmm3
        movsd   23(%rbx), %xmm15
        subsd   %xmm2, %xmm7
        movapd  %xmm8, %xmm5
        addsd   %xmm4, %xmm2
        movapd  %xmm6, %xmm4
        subsd   %xmm9, %xmm5
        movapd  %xmm15, %xmm14
        addsd   %xmm8, %xmm9
        mulsd   %xmm10, %xmm4
        movapd  %xmm15, %xmm8
        mulsd   %xmm15, %xmm10
        movapd  %xmm0, %xmm12
        mulsd   %xmm11, %xmm15
        mulsd   %xmm3, %xmm0
        movapd  %xmm7, %xmm1
        mulsd   %xmm13, %xmm6
        mulsd   %xmm3, %xmm8
        movapd  %xmm9, %xmm3
        mulsd   %xmm11, %xmm12
        subsd   %xmm0, %xmm15
        mulsd   %xmm13, %xmm14
        subsd   %xmm10, %xmm6
        movapd  %xmm2, %xmm10
        movapd  %xmm5, %xmm0
        addsd   %xmm12, %xmm8
        addsd   %xmm15, %xmm10
        subsd   %xmm15, %xmm2
        addsd   %xmm14, %xmm4
        addsd   %xmm8, %xmm3
        movsd   %xmm10, (%r11)
        movq    40(%rax), %r10
        subsd   %xmm8, %xmm9
        addsd   %xmm6, %xmm1
        addsd   %xmm4, %xmm0
        movsd   %xmm3, 7(%r14,%r10)
        movq    -8(%rax), %r9
        movq    40(%rax), %rdx
        subsd   %xmm6, %xmm7
        subsd   %xmm4, %xmm5
        movsd   %xmm2, 7(%rdx,%r9,2)
        movq    -16(%rax), %r8
        movq    40(%rax), %r12
        movsd   %xmm9, 7(%r12,%r8,2)
        movq    -24(%rax), %rbx
        movq    40(%rax), %r11
        movsd   %xmm1, 7(%r11,%rbx,2)
        movq    -32(%rax), %r14
        movq    40(%rax), %r10
        movsd   %xmm0, 7(%r10,%r14,2)
        movq    -40(%rax), %r9
        movq    40(%rax), %rdx
        movsd   %xmm7, 7(%rdx,%r9,2)
        movq    -48(%rax), %r8
        movq    40(%rax), %r12
        movsd   %xmm5, 7(%r12,%r8,2)
        jg      .L2752

Adding -fforward-propagate -fno-move-loop-invariants -fweb instead of
-fforward-propagate -fno-move-loop-invariants -frename-registers, so the
compile line is

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -fforward-propagate
-fno-move-loop-invariants -fweb -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY
-D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\""
-D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

the time is not so good (consistently 128ms) and the loop is

.L2752:
        movq    %rcx, %rdx
        addq    8(%rax), %rdx
        leaq    4(%rcx), %rdi
        movq    %rdx, -8(%rax)
        leaq    4(%rdx), %rbx
        addq    8(%rax), %rdx
        movq    %rbx, -16(%rax)
        movq    %rdx, -24(%rax)
        leaq    4(%rdx), %rbx
        addq    8(%rax), %rdx
        movq    %rbx, -32(%rax)
        movq    %rdx, -40(%rax)
        leaq    4(%rdx), %rbx
        movq    40(%rax), %rdx
        movq    %rbx, -48(%rax)
        movsd   7(%rdx,%rbx,2), %xmm9
        movq    -40(%rax), %rbx
        leaq    7(%rdx,%rcx,2), %r8
        addq    $8, %rcx
        movsd   (%r8), %xmm4
        cmpq    %rcx, %r13
        movsd   7(%rdx,%rbx,2), %xmm11
        movq    -32(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm5
        movq    -24(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm7
        movq    -16(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm14
        movq    -8(%rax), %rbx
        movsd   7(%rdx,%rbx,2), %xmm6
        leaq    (%rdi,%rdi), %rbx
        movsd   7(%rbx,%rdx), %xmm8
        movq    24(%rax), %rdx
        movapd  %xmm6, %xmm13
        movsd   15(%rdx), %xmm1
        movsd   7(%rdx), %xmm2
        movapd  %xmm1, %xmm10
        movsd   31(%rdx), %xmm3
        movapd  %xmm2, %xmm12
        mulsd   %xmm11, %xmm10
        mulsd   %xmm9, %xmm12
        mulsd   %xmm2, %xmm11
        mulsd   %xmm1, %xmm9
        movsd   23(%rdx), %xmm0
        addsd   %xmm12, %xmm10
        movapd  %xmm2, %xmm12
        mulsd   %xmm7, %xmm2
        subsd   %xmm9, %xmm11
        movapd  %xmm1, %xmm9
        mulsd   %xmm5, %xmm12
        mulsd   %xmm5, %xmm1
        movapd  %xmm8, %xmm5
        mulsd   %xmm7, %xmm9
        movapd  %xmm4, %xmm7
        subsd   %xmm11, %xmm13
        addsd   %xmm6, %xmm11
        movsd   .LC5(%rip), %xmm6
        subsd   %xmm1, %xmm2
        movapd  %xmm0, %xmm1
        addsd   %xmm12, %xmm9
        movapd  %xmm14, %xmm12
        xorpd   %xmm3, %xmm6
        subsd   %xmm10, %xmm12
        mulsd   %xmm13, %xmm1
        subsd   %xmm2, %xmm7
        addsd   %xmm4, %xmm2
        movapd  %xmm6, %xmm4
        addsd   %xmm14, %xmm10
        mulsd   %xmm13, %xmm6
        mulsd   %xmm12, %xmm4
        subsd   %xmm9, %xmm5
        mulsd   %xmm0, %xmm12
        addsd   %xmm8, %xmm9
        movapd  %xmm0, %xmm8
        mulsd   %xmm11, %xmm0
        addsd   %xmm1, %xmm4
        movapd  %xmm3, %xmm1
        mulsd   %xmm10, %xmm3
        subsd   %xmm12, %xmm6
        mulsd   %xmm11, %xmm1
        mulsd   %xmm10, %xmm8
        subsd   %xmm3, %xmm0
        addsd   %xmm1, %xmm8
        movapd  %xmm2, %xmm1
        addsd   %xmm0, %xmm1
        subsd   %xmm0, %xmm2
        movapd  %xmm7, %xmm0
        subsd   %xmm6, %xmm7
        addsd   %xmm6, %xmm0
        movsd   %xmm1, (%r8)
        movapd  %xmm9, %xmm1
        movq    40(%rax), %rdx
        subsd   %xmm8, %xmm9
        addsd   %xmm8, %xmm1
        movsd   %xmm1, 7(%rbx,%rdx)
        movq    -8(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm2, 7(%rdx,%rbx,2)
        movq    -16(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm9, 7(%rdx,%rbx,2)
        movq    -24(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm0, 7(%rdx,%rbx,2)
        movapd  %xmm5, %xmm0
        movq    -32(%rax), %rbx
        movq    40(%rax), %rdx
        subsd   %xmm4, %xmm5
        addsd   %xmm4, %xmm0
        movsd   %xmm0, 7(%rdx,%rbx,2)
        movq    -40(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm7, 7(%rdx,%rbx,2)
        movq    -48(%rax), %rbx
        movq    40(%rax), %rdx
        movsd   %xmm5, 7(%rdx,%rbx,2)
        jg      .L2752

And I still count 117 instructions in the loop in comment 64 (whether that
matters, I don't know).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (67 preceding siblings ...)
  2009-05-07  5:27 ` lucier at math dot purdue dot edu
@ 2009-05-07 13:41 ` bonzini at gnu dot org
  2009-05-07 15:41 ` steven at gcc dot gnu dot org
                   ` (46 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-07 13:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #67 from bonzini at gnu dot org  2009-05-07 13:40 -------
I'm thinking of enabling -frename-registers on x86; since it does not enable
the first scheduling pass, the live ranges will be shorter and the register
allocator may reuse the same register over and over with no freedom on
schedule-insns2.  

This would leave only the bug with RTL loop invariant motion.

Brad, you are the one who's regularly producing "insane" testcases, can you
measure the slowdown from -O1 to -O1 -frename-registers?  It is a local pass,
so it should not be that much, but I'd rather check before (I'll check on a
bootstrap instead).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (68 preceding siblings ...)
  2009-05-07 13:41 ` bonzini at gnu dot org
@ 2009-05-07 15:41 ` steven at gcc dot gnu dot org
  2009-05-07 15:58 ` lucier at math dot purdue dot edu
                   ` (45 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: steven at gcc dot gnu dot org @ 2009-05-07 15:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #68 from steven at gcc dot gnu dot org  2009-05-07 15:40 -------
Be careful with -frename-registers, it is quadratic in the size of a basic
block. For Bradley's test cases it will certainly give a slow-down.

I have tried a rewrite of -frename-registers, but I keep running into trouble
with the INDEX_REGS and BASE_REGS non-classes. Paolo, we could look at this
stuff together if you want my help.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (69 preceding siblings ...)
  2009-05-07 15:41 ` steven at gcc dot gnu dot org
@ 2009-05-07 15:58 ` lucier at math dot purdue dot edu
  2009-05-07 16:01 ` lucier at math dot purdue dot edu
                   ` (44 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-07 15:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #69 from lucier at math dot purdue dot edu  2009-05-07 15:57 -------
    Well, adding -frename-registers by itself to -O1 and not
-fforward-propagate and -fno-move-loop-invariants doesn't help (loop is given
below, along with complete compile options), the time is

        140 ms cpu time (140 user, 0 system)

    and adding -frename-registers and -fno-move-loop-invariants without
-fforward-propagate doesn't help (loop is again given below), it gets

        140 ms cpu time (140 user, 0 system)

    Adding all three gives a very consistent time this morning of

        120 ms cpu time (120 user, 0 system)

    so which is the same as the 4.2.4 time without any of these options (this
morning).

    But -fforward-propagate is not a viable option in general for this type of
code; here are some times for the testcase from PR 31957 with various options
on a 2.something GHz Xeon server:

    pythagoras-45% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I.
-Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math
-fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp
-frename-registers -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i
-ftime-report -fmem-report >& rename-report
    252.987u 9.592s 4:23.20 99.7%   0+0k 0+0io 0pf+0w
    pythagoras-46% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I.
-Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math
-fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp
-DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report
-fmem-report > & no-rename-report
    249.875u 10.544s 4:21.73 99.4%  0+0k 0+0io 0pf+0w
    pythagoras-47% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I.
-Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math
-fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp
-frename-registers -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL
-D___LIBRARY -c compiler.i -ftime-report -fmem-report > &
rename-no-move-loop-invariants-report
    246.663u 10.484s 4:18.30 99.5%  0+0k 0+0io 0pf+0w
    pythagoras-48% time /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I.
-Wall -W -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math
-fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp
-frename-registers -fno-move-loop-invariants -fforward-propagate
-DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY -c compiler.i -ftime-report
-fmem-report > & rename-no-move-loop-invariants-forward-propagate-report
    357.830u 28.417s 6:27.81 99.5%  0+0k 0+0io 11pf+0w

    With -fforward-propagate the memory required went up to at least 21GB.

    I'll attach the time reports for the various options, but the compiler
wasn't configured to provide detailed memory reports.

    Brad


    Loop with -frename-registers

    /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W
-Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math
-fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp
-frename-registers  -DHAVE_CONFIG_H -D___PRIMAL -D___LIBRARY
-D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\"" -D___SYS_TYPE_CPU="\"x86_64\""
-D___SYS_TYPE_VENDOR="\"unknown\"" -D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c



            movq    %rdx, %r12
            addq    (%r11), %r12
            leaq    4(%rdx), %r14
            movq    %r12, (%rsi)
            addq    $4, %r12
            movq    %r12, (%r10)
            movq    (%r11), %rcx
            addq    (%rsi), %rcx
            movq    %rcx, (%rbx)
            addq    $4, %rcx
            movq    %rcx, (%r9)
            movq    (%r11), %r13
            addq    (%rbx), %r13
            movq    %r13, (%r8)
            addq    $4, %r13
            movq    %r13, (%r15)
            movq    (%rax), %rcx
            movq    (%r8), %r12
            addq    $7, %rcx
            movsd   (%rcx,%r12,2), %xmm10
            movq    (%rbx), %r12
            movsd   (%rcx,%r13,2), %xmm13
            movq    (%r9), %r13
            movsd   (%rcx,%r12,2), %xmm6
            movq    (%rsi), %r12
            movsd   (%rcx,%r13,2), %xmm5
            movq    (%r10), %r13
            movsd   (%rcx,%r12,2), %xmm9
            leaq    (%r14,%r14), %r12
            movsd   (%rcx,%r13,2), %xmm11
            leaq    (%rcx,%rdx,2), %r13
            movsd   (%rcx,%r12), %xmm3
            movq    24(%rdi), %rcx
            movsd   (%r13), %xmm4
            addq    $8, %rdx
            movsd   15(%rcx), %xmm14
            movsd   7(%rcx), %xmm15
            movapd  %xmm14, %xmm8
            movapd  %xmm14, %xmm7
            movapd  %xmm15, %xmm12
            mulsd   %xmm10, %xmm8
            mulsd   %xmm13, %xmm12
            mulsd   %xmm15, %xmm10
            mulsd   %xmm14, %xmm13
            movsd   31(%rcx), %xmm2
            addsd   %xmm8, %xmm12
            movapd  %xmm15, %xmm8
            mulsd   %xmm6, %xmm7
            mulsd   %xmm5, %xmm14
            subsd   %xmm13, %xmm10
            mulsd   %xmm5, %xmm8
            movapd  %xmm2, %xmm13
            mulsd   %xmm6, %xmm15
            movapd  %xmm4, %xmm6
            xorpd   .LC5(%rip), %xmm13
            movapd  %xmm3, %xmm5
            addsd   %xmm7, %xmm8
            movapd  %xmm11, %xmm7
            subsd   %xmm14, %xmm15
            movapd  %xmm9, %xmm14
            movsd   23(%rcx), %xmm0
            subsd   %xmm12, %xmm7
            subsd   %xmm10, %xmm14
            movapd  %xmm13, %xmm1
            addsd   %xmm11, %xmm12
            movapd  %xmm2, %xmm11
            subsd   %xmm15, %xmm6
            addsd   %xmm4, %xmm15
            movapd  %xmm0, %xmm4
            mulsd   %xmm7, %xmm1
            addsd   %xmm9, %xmm10
            mulsd   %xmm14, %xmm4
            subsd   %xmm8, %xmm5
            mulsd   %xmm0, %xmm7
            addsd   %xmm3, %xmm8
            mulsd   %xmm13, %xmm14
            movapd  %xmm15, %xmm9
            mulsd   %xmm10, %xmm11
            mulsd   %xmm0, %xmm10
            addsd   %xmm1, %xmm4
            movapd  %xmm8, %xmm3
            movapd  %xmm5, %xmm1
            subsd   %xmm7, %xmm14
            movapd  %xmm0, %xmm7
            mulsd   %xmm12, %xmm7
            addsd   %xmm4, %xmm1
            mulsd   %xmm2, %xmm12
            movapd  %xmm6, %xmm2
            subsd   %xmm14, %xmm6
            addsd   %xmm14, %xmm2
            addsd   %xmm11, %xmm7
            subsd   %xmm12, %xmm10
            subsd   %xmm4, %xmm5
            addsd   %xmm7, %xmm3
            addsd   %xmm10, %xmm9
            subsd   %xmm10, %xmm15
            subsd   %xmm7, %xmm8
            movsd   %xmm9, (%r13)
            movq    (%rax), %rcx
            movsd   %xmm3, 7(%r12,%rcx)
            movq    (%rsi), %r13
            movq    (%rax), %rcx
            movsd   %xmm15, 7(%rcx,%r13,2)
            movq    (%r10), %r12
            movq    (%rax), %r13
            movsd   %xmm8, 7(%r13,%r12,2)
            movq    (%rbx), %rcx
            movq    (%rax), %r13
            movsd   %xmm2, 7(%r13,%rcx,2)
            movq    (%r9), %r12
            movq    (%rax), %rcx
            movsd   %xmm1, 7(%rcx,%r12,2)
            movq    (%r8), %r13
            movq    (%rax), %rcx
            movsd   %xmm6, 7(%rcx,%r13,2)
            movq    (%r15), %r12
            movq    (%rax), %r13
            movsd   %xmm5, 7(%r13,%r12,2)
            cmpq    %rdx, -104(%rsp)
            jg      .L2941

    Loop with -frename-registers -fno-move-loop-invariants

    /pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I. -Wall -W
-Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math
-fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp
-frename-registers -fno-move-loop-invariants -DHAVE_CONFIG_H -D___PRIMAL
-D___LIBRARY -D___GAMBCDIR="\"/usr/local/Gambit-C/v4.1.2\""
-D___SYS_TYPE_CPU="\"x86_64\"" -D___SYS_TYPE_VENDOR="\"unknown\""
-D___SYS_TYPE_OS="\"linux-gnu\"" -c _num.c

    .L2755:
            leaq    8(%rax), %rdx
            movq    %rcx, %r13
            leaq    -16(%rax), %r9
            leaq    -8(%rax), %r10
            leaq    -24(%rax), %r8
            leaq    -32(%rax), %rdi
            addq    (%rdx), %r13
            leaq    4(%rcx), %r14
            leaq    4(%r13), %rsi
            movq    %r13, (%r10)
            movq    %rsi, (%r9)
            addq    (%rdx), %r13
            leaq    -40(%rax), %rsi
            leaq    4(%r13), %r11
            movq    %r13, (%r8)
            movq    %r11, (%rdi)
            addq    (%rdx), %r13
            leaq    -48(%rax), %r11
            leaq    40(%rax), %rdx
            movq    %r13, (%rsi)
            addq    $4, %r13
            movq    %r13, (%r11)
            movq    (%rdx), %rbx
            movq    (%rsi), %r12
            addq    $7, %rbx
            movsd   (%rbx,%r12,2), %xmm11
            movq    (%r8), %r12
            movsd   (%rbx,%r13,2), %xmm9
            movq    (%rdi), %r13
            movsd   (%rbx,%r12,2), %xmm7
            movq    (%r10), %r12
            movsd   (%rbx,%r13,2), %xmm5
            movq    (%r9), %r13
            movsd   (%rbx,%r12,2), %xmm6
            leaq    (%r14,%r14), %r12
            movsd   (%rbx,%r13,2), %xmm14
            leaq    (%rbx,%rcx,2), %r13
            movsd   (%rbx,%r12), %xmm8
            movq    24(%rax), %rbx
            movapd  %xmm6, %xmm13
            addq    $8, %rcx
            movsd   (%r13), %xmm4
            cmpq    %rcx, %r15
            movsd   15(%rbx), %xmm1
            movsd   7(%rbx), %xmm2
            movapd  %xmm1, %xmm3
            movsd   31(%rbx), %xmm0
            movapd  %xmm2, %xmm10
            mulsd   %xmm11, %xmm3
            movapd  %xmm2, %xmm12
            mulsd   %xmm9, %xmm10
            mulsd   %xmm2, %xmm11
            mulsd   %xmm1, %xmm9
            mulsd   %xmm7, %xmm2
            addsd   %xmm10, %xmm3
            mulsd   %xmm5, %xmm12
            movapd  %xmm14, %xmm10
            movsd   23(%rbx), %xmm15
            subsd   %xmm9, %xmm11
            movapd  %xmm1, %xmm9
            mulsd   %xmm5, %xmm1
            movapd  %xmm8, %xmm5
            mulsd   %xmm7, %xmm9
            subsd   %xmm3, %xmm10
            movapd  %xmm4, %xmm7
            subsd   %xmm11, %xmm13
            addsd   %xmm6, %xmm11
            movsd   .LC5(%rip), %xmm6
            subsd   %xmm1, %xmm2
            xorpd   %xmm0, %xmm6
            addsd   %xmm14, %xmm3
            addsd   %xmm12, %xmm9
            movapd  %xmm15, %xmm14
            movapd  %xmm0, %xmm12
            subsd   %xmm2, %xmm7
            mulsd   %xmm13, %xmm14
            addsd   %xmm4, %xmm2
            movapd  %xmm6, %xmm4
            subsd   %xmm9, %xmm5
            mulsd   %xmm3, %xmm0
            addsd   %xmm8, %xmm9
            mulsd   %xmm10, %xmm4
            movapd  %xmm15, %xmm8
            mulsd   %xmm15, %xmm10
            mulsd   %xmm11, %xmm15
            movapd  %xmm7, %xmm1
            mulsd   %xmm13, %xmm6
            mulsd   %xmm3, %xmm8
            movapd  %xmm9, %xmm3
            mulsd   %xmm11, %xmm12
            addsd   %xmm14, %xmm4
            subsd   %xmm0, %xmm15
            movapd  %xmm5, %xmm0
            subsd   %xmm10, %xmm6
            movapd  %xmm2, %xmm10
            addsd   %xmm12, %xmm8
            addsd   %xmm15, %xmm10
            subsd   %xmm15, %xmm2
            addsd   %xmm6, %xmm1
            addsd   %xmm8, %xmm3
            movsd   %xmm10, (%r13)
            movq    (%rdx), %rbx
            subsd   %xmm8, %xmm9
            addsd   %xmm4, %xmm0
            subsd   %xmm6, %xmm7
            movsd   %xmm3, 7(%r12,%rbx)
            movq    (%r10), %r10
            movq    (%rdx), %r13
            subsd   %xmm4, %xmm5
            movsd   %xmm2, 7(%r13,%r10,2)
            movq    (%r9), %rbx
            movq    (%rdx), %r12
            movsd   %xmm9, 7(%r12,%rbx,2)
            movq    (%r8), %r13
            movq    (%rdx), %r10
            movsd   %xmm1, 7(%r10,%r13,2)
            movq    (%rdi), %r9
            movq    (%rdx), %rbx
            movsd   %xmm0, 7(%rbx,%r9,2)
            movq    (%rsi), %rsi
            movq    (%rdx), %r8
            movsd   %xmm7, 7(%r8,%rsi,2)
            movq    (%r11), %rdi
            movq    (%rdx), %r12
            movsd   %xmm5, 7(%r12,%rdi,2)
            jg      .L2755


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (70 preceding siblings ...)
  2009-05-07 15:58 ` lucier at math dot purdue dot edu
@ 2009-05-07 16:01 ` lucier at math dot purdue dot edu
  2009-05-07 16:03 ` lucier at math dot purdue dot edu
                   ` (43 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-07 16:01 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #70 from lucier at math dot purdue dot edu  2009-05-07 16:00 -------
Created an attachment (id=17819)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17819&action=view)
time report related to comment 69, time for PR 31957 with no options


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (72 preceding siblings ...)
  2009-05-07 16:03 ` lucier at math dot purdue dot edu
@ 2009-05-07 16:03 ` lucier at math dot purdue dot edu
  2009-05-07 16:04 ` lucier at math dot purdue dot edu
                   ` (41 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-07 16:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #72 from lucier at math dot purdue dot edu  2009-05-07 16:03 -------
Created an attachment (id=17821)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17821&action=view)
time for 31957, with rename-registers no-move-loop-invariants


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (71 preceding siblings ...)
  2009-05-07 16:01 ` lucier at math dot purdue dot edu
@ 2009-05-07 16:03 ` lucier at math dot purdue dot edu
  2009-05-07 16:03 ` lucier at math dot purdue dot edu
                   ` (42 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-07 16:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #71 from lucier at math dot purdue dot edu  2009-05-07 16:02 -------
Created an attachment (id=17820)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17820&action=view)
time for 31957, with rename-registers


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (73 preceding siblings ...)
  2009-05-07 16:03 ` lucier at math dot purdue dot edu
@ 2009-05-07 16:04 ` lucier at math dot purdue dot edu
  2009-05-07 16:21 ` bonzini at gnu dot org
                   ` (40 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-07 16:04 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #73 from lucier at math dot purdue dot edu  2009-05-07 16:04 -------
Created an attachment (id=17822)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17822&action=view)
time for 31957, with rename-registers no-move-loop-invariants forward-propagate


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (74 preceding siblings ...)
  2009-05-07 16:04 ` lucier at math dot purdue dot edu
@ 2009-05-07 16:21 ` bonzini at gnu dot org
  2009-05-07 16:32 ` lucier at math dot purdue dot edu
                   ` (39 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-07 16:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #74 from bonzini at gnu dot org  2009-05-07 16:21 -------
Ok.  One step at a time. :-)  To recap, here is the situation:

- the CSE optimization you mention was *not* removed, it was moved to fwprop,
so it does not run at -O1.

- once this was done, the way to go is to tune new optimizations, not to
reintroduce old ones

- for example, fwprop in turn triggered a bad choice in loop invariant motion,
for which a patch has been posted.  This patch will remove the need for
-fno-move-loop-invariants on this testcase (this is a deficiency in LIM that is
not specific to machine-generated code, OTOH the presence of many fp[N]
accesses helps triggering it).

- that scheduling is necessary now and not in 4.2.x, probably is just a matter
of luck

- why renaming registers is necessary now and not in 4.2.x is still a mystery;
but, there is an explanation as to why it helps (it prolongs live ranges,
something that on non-x86 archs is done by the pre-regalloc scheduling)

- at least we have a set of options providing good performance on this
testcase, and guidance towards better tuning of the various problematic
optimizations

To conclude, nobody is underestimating the significance of its PR, it's just a
matter of priorities.  Near the end of the release cycle, you tend to look at
PRs with small testcases to minimize the time spent understanding the code;
near the beginning, you hope that new features magically fix the PRs and
concentrate on wrong-code bugs and so on.  Complex P2s such as this one
unfortunately tend to stay in a limbo.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (75 preceding siblings ...)
  2009-05-07 16:21 ` bonzini at gnu dot org
@ 2009-05-07 16:32 ` lucier at math dot purdue dot edu
  2009-05-07 16:38 ` bonzini at gnu dot org
                   ` (38 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-07 16:32 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #75 from lucier at math dot purdue dot edu  2009-05-07 16:31 -------
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance slowdown in
floating-point code caused by  r118475


On May 7, 2009, at 12:21 PM, bonzini at gnu dot org wrote:

> ------- Comment #74 from bonzini at gnu dot org  2009-05-07 16:21  
> -------
> Ok.  One step at a time. :-)  To recap, here is the situation:
>
> - that scheduling is necessary now and not in 4.2.x, probably is  
> just a matter
> of luck

If you mean -fschedule-insns2, it has always been part of the options  
list.

> - at least we have a set of options providing good performance on this
> testcase, and guidance towards better tuning of the various  
> problematic
> optimizations

OK, but -fforward-propagate is not viable in general for these  
machine-generated codes.

>
Brad


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (76 preceding siblings ...)
  2009-05-07 16:32 ` lucier at math dot purdue dot edu
@ 2009-05-07 16:38 ` bonzini at gnu dot org
  2009-05-07 17:50 ` steven at gcc dot gnu dot org
                   ` (37 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-07 16:38 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #76 from bonzini at gnu dot org  2009-05-07 16:37 -------
It should be possible to modify fwprop to avoid excessive memory usage (doing
its own dataflow, basically, instead of using UD chains)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (77 preceding siblings ...)
  2009-05-07 16:38 ` bonzini at gnu dot org
@ 2009-05-07 17:50 ` steven at gcc dot gnu dot org
  2009-05-08  6:51 ` bonzini at gcc dot gnu dot org
                   ` (36 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: steven at gcc dot gnu dot org @ 2009-05-07 17:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #77 from steven at gcc dot gnu dot org  2009-05-07 17:50 -------
Re. comment #75: Just the fact that an option is enabled in both releases
doesn't mean the pass behind it is doing the same thing in both releases. What
the scheduler does, depends heavily on the code you feed it.  Sometimes it is
pure (good or bad) luck that changes  the behavior of a pass in the compiler. 
The interactions between all the pieces are just very complicated (which is
why, IMHO, retargetable-compiler engineering is so difficult: controlling the
pipeline is undoable).

Re. comment #76:
Sad as it may be, I think this is the best short-term solution.
Alternatively we could re-work fwprop to work on regions and use the
partial-CFG dataflow stuff, similar to what the RTL loop optimizers (like
loop-invariant) do.  To be honest, I'd much prefer the latter, but the
DIY-fwprop thing is probably easier in the short term.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (78 preceding siblings ...)
  2009-05-07 17:50 ` steven at gcc dot gnu dot org
@ 2009-05-08  6:51 ` bonzini at gcc dot gnu dot org
  2009-05-08  7:18 ` bonzini at gnu dot org
                   ` (35 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gcc dot gnu dot org @ 2009-05-08  6:51 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #78 from bonzini at gnu dot org  2009-05-08 06:51 -------
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 06:51:12 2009
New Revision: 147270

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147270
Log:
2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

        PR rtl-optimization/33928
        * loop-invariant.c (struct use): Add addr_use_p.
        (struct def): Add n_addr_uses.
        (struct invariant): Add cheap_address.
        (create_new_invariant): Set cheap_address.
        (record_use): Accept df_ref.  Set addr_use_p and update n_addr_uses.
        (record_uses): Pass df_ref to record_use.
        (get_inv_cost): Do not add inv->cost to comp_cost for cheap addresses
used
        only as such.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/loop-invariant.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (79 preceding siblings ...)
  2009-05-08  6:51 ` bonzini at gcc dot gnu dot org
@ 2009-05-08  7:18 ` bonzini at gnu dot org
  2009-05-08  7:52 ` bonzini at gcc dot gnu dot org
                   ` (34 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-08  7:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #79 from bonzini at gnu dot org  2009-05-08 07:18 -------
I'm cobbling up the DIY dataflow patch and it is all but ugly, actually.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (80 preceding siblings ...)
  2009-05-08  7:18 ` bonzini at gnu dot org
@ 2009-05-08  7:52 ` bonzini at gcc dot gnu dot org
  2009-05-08  7:55 ` bonzini at gnu dot org
                   ` (33 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gcc dot gnu dot org @ 2009-05-08  7:52 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #80 from bonzini at gnu dot org  2009-05-08 07:51 -------
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 07:51:46 2009
New Revision: 147274

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147274
Log:
2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

        PR rtl-optimization/33928
        * loop-invariant.c (record_use): Fix && vs. || mishap.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/loop-invariant.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (81 preceding siblings ...)
  2009-05-08  7:52 ` bonzini at gcc dot gnu dot org
@ 2009-05-08  7:55 ` bonzini at gnu dot org
  2009-05-08  9:41 ` bonzini at gnu dot org
                   ` (32 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-08  7:55 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #81 from bonzini at gnu dot org  2009-05-08 07:55 -------
Created an attachment (id=17825)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17825&action=view)
speed up fwprop and enable it at -O1

Here is a patch I'm bootstrapping to remove fwprop's usage of UD chains.  It
does not affect at all the assembly output, it just changes the data structure
that is used.

compiler.i is probably too big for me, but I tried slatex.i and fwprop was ~2%
of compilation time with this patch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (82 preceding siblings ...)
  2009-05-08  7:55 ` bonzini at gnu dot org
@ 2009-05-08  9:41 ` bonzini at gnu dot org
  2009-05-08 12:23 ` bonzini at gcc dot gnu dot org
                   ` (31 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-08  9:41 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #82 from bonzini at gnu dot org  2009-05-08 09:41 -------
Hm, looking at the time reports the patch will save about 30-40% of the fwprop
execution time, and should fix the memory hog problem, but will still leave in
the 70s needed to compute reaching definitions.  I guess it's a step forward
for -O2 but borderline for -O1.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (83 preceding siblings ...)
  2009-05-08  9:41 ` bonzini at gnu dot org
@ 2009-05-08 12:23 ` bonzini at gcc dot gnu dot org
  2009-05-15 10:36 ` bonzini at gnu dot org
                   ` (30 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gcc dot gnu dot org @ 2009-05-08 12:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #83 from bonzini at gnu dot org  2009-05-08 12:22 -------
Subject: Bug 33928

Author: bonzini
Date: Fri May  8 12:22:30 2009
New Revision: 147282

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147282
Log:
2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

        PR rtl-optimization/33928
        PR 26854
        * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
        process_uses, build_single_def_use_links): New.
        (update_df): Update use_def_ref.
        (forward_propagate_into): Use get_def_for_use instead of use-def
        chains.
        (fwprop_init): Call build_single_def_use_links and let it initialize
        dataflow.
        (fwprop_done): Free use_def_ref.
        (fwprop_addr): Eliminate duplicate call to df_set_flags.
        * df-problems.c (df_rd_simulate_artificial_defs_at_top, 
        df_rd_simulate_one_insn): New.
        (df_rd_bb_local_compute_process_def): Update head comment.
        (df_chain_create_bb): Use the new RD simulation functions.
        * df.h (df_rd_simulate_artificial_defs_at_top, 
        df_rd_simulate_one_insn): New.
        * opts.c (decode_options): Enable fwprop at -O1.
        * doc/invoke.texi (-fforward-propagate): Document this.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/df-problems.c
    trunk/gcc/df.h
    trunk/gcc/doc/invoke.texi
    trunk/gcc/fwprop.c
    trunk/gcc/opts.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (84 preceding siblings ...)
  2009-05-08 12:23 ` bonzini at gcc dot gnu dot org
@ 2009-05-15 10:36 ` bonzini at gnu dot org
  2009-05-16  0:20 ` lucier at math dot purdue dot edu
                   ` (29 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-05-15 10:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #84 from bonzini at gnu dot org  2009-05-15 10:35 -------
Ok, I am working on a patch to add a multiple-definitions DF problem and use
that together with a domwalk to find the single definitions (instead of
reaching-definitions, which is the remaining slow part).  The new problem has a
bitvector sized by the number of registers rather than the number of defs (that
is sized like the bitvectors for liveness), which means it will be fast.  It is
defined as follows:

MDkill (B) = regs that have a def in B
MDinit (B) = (union of MDkill (P) for every P : B \in DomFrontier(P) \cap
LRin(B)
MDin (B) = MDinit (B) \cup (union of MDout (P) for every predecessor P of B)
MDout (B) = MDin (B) - MDkill (B)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (85 preceding siblings ...)
  2009-05-15 10:36 ` bonzini at gnu dot org
@ 2009-05-16  0:20 ` lucier at math dot purdue dot edu
  2009-05-16  0:29 ` lucier at math dot purdue dot edu
                   ` (28 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-16  0:20 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #85 from lucier at math dot purdue dot edu  2009-05-16 00:20 -------
Created an attachment (id=17878)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17878&action=view)
Large test file for testing time and memory usage

This is the file compiler.i used in the previous tests.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (86 preceding siblings ...)
  2009-05-16  0:20 ` lucier at math dot purdue dot edu
@ 2009-05-16  0:29 ` lucier at math dot purdue dot edu
  2009-05-16  0:33 ` lucier at math dot purdue dot edu
                   ` (27 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-16  0:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #86 from lucier at math dot purdue dot edu  2009-05-16 00:29 -------
Created an attachment (id=17879)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17879&action=view)
Time and memory report for compiler.i

This is the time and memory report after the hack from

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39301#c8

to make the statistic fields HOST_WIDEST_INTs.

Some interesting lines:

fwprop.c:178 (build_single_def_use_links)        8      8438189160          
82240               0    1027496
df-problems.c:311 (df_rd_alloc)             155420      8433928200     
8433870880      8433870880          0
df-problems.c:593 (df_rd_transfer_functio   909666     40718919320     
6755812320      6755736840    2025096
Total                                     13171390     61130398320


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (87 preceding siblings ...)
  2009-05-16  0:29 ` lucier at math dot purdue dot edu
@ 2009-05-16  0:33 ` lucier at math dot purdue dot edu
  2009-06-08  8:40 ` bonzini at gnu dot org
                   ` (26 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-05-16  0:33 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #87 from lucier at math dot purdue dot edu  2009-05-16 00:33 -------
The compiler options for the previous report:

/pkgs/gcc-mainline/bin/gcc -save-temps -I../include -I.  -Wall -W -Wno-unused
-O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing
-fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -frename-registers
-fno-move-loop-invariants -fforward-propagate -DHAVE_CONFIG_H -D___PRIMAL
-D___LIBRARY -c compiler.i -ftime-report -fmem-report > &
rename-no-move-loop-invariants-forward-propagate-report-new


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (88 preceding siblings ...)
  2009-05-16  0:33 ` lucier at math dot purdue dot edu
@ 2009-06-08  8:40 ` bonzini at gnu dot org
  2009-06-08  8:59 ` bonzini at gnu dot org
                   ` (25 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-08  8:40 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #88 from bonzini at gnu dot org  2009-06-08 08:40 -------
Created an attachment (id=17963)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17963&action=view)
patch I'm testing

Here is a patch I'm testing that completes the rewrite of fwprop's dataflow. 
This should make it much faster and less memory hungry.  It should also keep
the generated code fast (with -frename-registers of course), if not it's a bug
in the patch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (89 preceding siblings ...)
  2009-06-08  8:40 ` bonzini at gnu dot org
@ 2009-06-08  8:59 ` bonzini at gnu dot org
  2009-06-08 16:36 ` bonzini at gnu dot org
                   ` (24 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-08  8:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #89 from bonzini at gnu dot org  2009-06-08 08:59 -------
Created an attachment (id=17964)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17964&action=view)
correct version

oops, the previous one didn't work at -O1 even though it bootstrapped :-)


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #17963|0                           |1
        is obsolete|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (90 preceding siblings ...)
  2009-06-08  8:59 ` bonzini at gnu dot org
@ 2009-06-08 16:36 ` bonzini at gnu dot org
  2009-06-08 18:19 ` lucier at math dot purdue dot edu
                   ` (23 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-08 16:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #90 from bonzini at gnu dot org  2009-06-08 16:35 -------
Yo, with the patch the time to compile compiler.i with the given options is
331s on my machine (with a checking compiler).  Fwprop takes only 1% (including
computation of the new dataflow problem).  I'd estimate around 250s with your
nonchecking build.  I'll split it and post it tomorrow.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (91 preceding siblings ...)
  2009-06-08 16:36 ` bonzini at gnu dot org
@ 2009-06-08 18:19 ` lucier at math dot purdue dot edu
  2009-06-12 14:51 ` bonzini at gnu dot org
                   ` (22 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-06-08 18:19 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #91 from lucier at math dot purdue dot edu  2009-06-08 18:19 -------
Created an attachment (id=17968)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17968&action=view)
time and memory report for compiler.i after Paolo's patch

The patch cut the total bitmaps used compiling compiler.i from > 60GB to 3GB;
maximum memory (just from top) was 1631MB.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (92 preceding siblings ...)
  2009-06-08 18:19 ` lucier at math dot purdue dot edu
@ 2009-06-12 14:51 ` bonzini at gnu dot org
  2009-06-13 14:18 ` rguenth at gcc dot gnu dot org
                   ` (21 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-12 14:51 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #92 from bonzini at gnu dot org  2009-06-12 14:50 -------
In the meanwhile something caused "tree incremental SSA" to jump up from 10s to
26s.  Sob.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (93 preceding siblings ...)
  2009-06-12 14:51 ` bonzini at gnu dot org
@ 2009-06-13 14:18 ` rguenth at gcc dot gnu dot org
  2009-06-14  4:44 ` jamborm at gcc dot gnu dot org
                   ` (20 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-06-13 14:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #93 from rguenth at gcc dot gnu dot org  2009-06-13 14:18 -------
I would say that was the new SRA.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mjambor at suse dot cz


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (94 preceding siblings ...)
  2009-06-13 14:18 ` rguenth at gcc dot gnu dot org
@ 2009-06-14  4:44 ` jamborm at gcc dot gnu dot org
  2009-06-14 14:59 ` lucier at math dot purdue dot edu
                   ` (19 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: jamborm at gcc dot gnu dot org @ 2009-06-14  4:44 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #94 from jamborm at gcc dot gnu dot org  2009-06-14 04:43 -------
(In reply to comment #92)
> In the meanwhile something caused "tree incremental SSA" to jump up from 10s to
> 26s.  Sob.
> 

(In reply to comment #93)
> I would say that was the new SRA.
> 

OK, I'll try to investigate.  Which of the various attachments to this
bug is the one to look at?

Martin


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (95 preceding siblings ...)
  2009-06-14  4:44 ` jamborm at gcc dot gnu dot org
@ 2009-06-14 14:59 ` lucier at math dot purdue dot edu
  2009-06-14 15:02 ` lucier at math dot purdue dot edu
                   ` (18 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-06-14 14:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #95 from lucier at math dot purdue dot edu  2009-06-14 14:59 -------
The test case is compiler.i.gz


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (96 preceding siblings ...)
  2009-06-14 14:59 ` lucier at math dot purdue dot edu
@ 2009-06-14 15:02 ` lucier at math dot purdue dot edu
  2009-06-15 15:14 ` bonzini at gnu dot org
                   ` (17 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-06-14 15:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #96 from lucier at math dot purdue dot edu  2009-06-14 15:02 -------
Sorry, the gcc options are in comment 87 (the -fforward-propagate is now
redundant), and without Paolo's recently proposed patch it requires about 9GB
of memory to compile.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (97 preceding siblings ...)
  2009-06-14 15:02 ` lucier at math dot purdue dot edu
@ 2009-06-15 15:14 ` bonzini at gnu dot org
  2009-06-15 16:12 ` lucier at math dot purdue dot edu
                   ` (16 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-15 15:14 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #97 from bonzini at gnu dot org  2009-06-15 15:14 -------
Brad, could you try to time compiler.i with and without -ftime-report to see
how much of the "tree stmt walking" timevar is just accounting overhead?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (98 preceding siblings ...)
  2009-06-15 15:14 ` bonzini at gnu dot org
@ 2009-06-15 16:12 ` lucier at math dot purdue dot edu
  2009-06-15 16:21 ` paolo dot bonzini at gmail dot com
                   ` (15 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-06-15 16:12 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #98 from lucier at math dot purdue dot edu  2009-06-15 16:11 -------
I don't quite understand how you would like me to configure and run the test.

First, I've applied your patches to speed up computing DF to my tree; do you
want them included in the test, or should I use a pristine mainline?

Second, when configuring mainline, should I include, or not include

1.  --enable-gather-detailed-mem-stats
2.  --enable-checking=release

After that, I think you just want to run two compiles with and without
-ftime-report, is that right?  (Nothing about -fmem-report.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (99 preceding siblings ...)
  2009-06-15 16:12 ` lucier at math dot purdue dot edu
@ 2009-06-15 16:21 ` paolo dot bonzini at gmail dot com
  2009-06-15 16:22 ` bonzini at gnu dot org
                   ` (14 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: paolo dot bonzini at gmail dot com @ 2009-06-15 16:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #99 from paolo dot bonzini at gmail dot com  2009-06-15 16:20 -------
Subject: Re:  [4.3/4.4/4.5 Regression] 30% performance
 slowdown in floating-point code caused by  r118475

> First, I've applied your patches to speed up computing DF to my tree; do you
> want them included in the test, or should I use a pristine mainline?

It doesn't matter, but yes, use them.

> Second, when configuring mainline, should I include, or not include
> 
> 1.  --enable-gather-detailed-mem-stats
> 2.  --enable-checking=release

Again it shouldn't matter, but use only --enable-checking=release.

> After that, I think you just want to run two compiles with and without
> -ftime-report, is that right?  (Nothing about -fmem-report.)

Yes, and the output of -ftime-report is not needed.  Just the "time 
./cc1 ..." output for the two.  Thanks!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (100 preceding siblings ...)
  2009-06-15 16:21 ` paolo dot bonzini at gmail dot com
@ 2009-06-15 16:22 ` bonzini at gnu dot org
  2009-06-15 16:26 ` [Bug rtl-optimization/33928] [4.3/4.4 " bonzini at gnu dot org
                   ` (13 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-15 16:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #100 from bonzini at gnu dot org  2009-06-15 16:22 -------
Just as a reminder for after the fwprop patches are committed, the problem in
CFG cleanup is that the iterative fixing of dominators in
remove_edge_and_dominated_blocks is very expensive.  Probably we should make
sure no dominators are there in some key cfgcleanup passes.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (101 preceding siblings ...)
  2009-06-15 16:22 ` bonzini at gnu dot org
@ 2009-06-15 16:26 ` bonzini at gnu dot org
  2009-06-15 19:57 ` lucier at math dot purdue dot edu
                   ` (12 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-15 16:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #101 from bonzini at gnu dot org  2009-06-15 16:26 -------
Time for cleanup.  This bug is fixed on mainline, and likely WONTFIX on 4.3/4.4
(though it could in principle be fixed by backporting the fwprop patches to
4.4).  I'll add some pointers to PR26854 for the attachments related to
compile-time problems.


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|                            |4.5.0
            Summary|[4.3/4.4/4.5 Regression] 30%|[4.3/4.4 Regression] 30%
                   |performance slowdown in     |performance slowdown in
                   |floating-point code caused  |floating-point code caused
                   |by  r118475                 |by  r118475
            Version|4.3.0                       |4.5.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (102 preceding siblings ...)
  2009-06-15 16:26 ` [Bug rtl-optimization/33928] [4.3/4.4 " bonzini at gnu dot org
@ 2009-06-15 19:57 ` lucier at math dot purdue dot edu
  2009-06-15 20:21 ` [Bug rtl-optimization/33928] [4.3/4.4/4.5 " lucier at math dot purdue dot edu
                   ` (11 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-06-15 19:57 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #102 from lucier at math dot purdue dot edu  2009-06-15 19:57 -------
Subject: Re:  [4.3/4.4/4.5 Regression] 30%
 performance slowdown in floating-point code caused by  r118475

On Mon, 2009-06-15 at 16:20 +0000, paolo dot bonzini at gmail dot com
wrote:

> Yes, and the output of -ftime-report is not needed.  Just the "time 
> ./cc1 ..." output for the two.  Thanks!

The two commands:

time /pkgs/gcc-mainline/bin/gcc -O1 -fno-math-errno -fschedule-insns2
-fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC
-fno-common -mieee-fp -c compiler.i 
261.424u 1.184s 4:22.76 99.9%   0+0k 0+28456io 0pf+0w
time /pkgs/gcc-mainline/bin/gcc -O1 -fno-math-errno -fschedule-insns2
-fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC
-fno-common -mieee-fp -c compiler.i -ftime-report 
263.424u 4.900s 4:28.68 99.8%   0+0k 0+28480io 0pf+0w


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (103 preceding siblings ...)
  2009-06-15 19:57 ` lucier at math dot purdue dot edu
@ 2009-06-15 20:21 ` lucier at math dot purdue dot edu
  2009-06-16  6:48 ` bonzini at gnu dot org
                   ` (10 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-06-15 20:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #103 from lucier at math dot purdue dot edu  2009-06-15 20:21 -------
Regarding comment #101 ...

With

heine:~/programs/gcc/objdirs/gsc-fft-tests/gambc-v4_1_2>
/pkgs/gcc-mainline/bin/gcc -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --prefix=/pkgs/gcc-mainline
--enable-languages=c --disable-multilib --enable-checking=release
Thread model: posix
gcc version 4.5.0 20090608 (experimental) [trunk revision 148276] (GCC) 

(and including Paolo's patch to speed up DF), the routine in direct.c takes

    168 ms cpu time (168 user, 0 system)

As reported here

http://www.math.purdue.edu/~lucier/bugzilla/9/

with gcc-4.2.4, this routine takes 156 ms on the same machine.

Comment #9 gives the code that 4.2.4 generates at the start of the main loop; 
the start of the main loop with the version of 4.5.0 I gave above is:

.L2938:
        movq    %rcx, %rdx
        addq    8(%rax), %rdx
        leaq    4(%rcx), %rbx
        movq    %rdx, -8(%rax)
        leaq    4(%rdx), %rdi
        addq    8(%rax), %rdx
        movq    %rdi, -16(%rax)
        movq    %rdx, -24(%rax)
        leaq    4(%rdx), %rdi
        addq    8(%rax), %rdx
        movq    %rdi, -32(%rax)
        movq    %rdx, -40(%rax)
        leaq    4(%rdx), %rdi
        movq    40(%rax), %rdx
        movq    %rdi, -48(%rax)
        movsd   7(%rdx,%rdi,2), %xmm7
        movq    -40(%rax), %rdi
        leaq    7(%rdx,%rcx,2), %r8
        addq    $8, %rcx
        movsd   (%r8), %xmm4
        cmpq    %rcx, %r13
        movsd   7(%rdx,%rdi,2), %xmm10
        movq    -32(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm5
        movq    -24(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm6
        movq    -16(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm13
        movq    -8(%rax), %rdi
        movsd   7(%rdx,%rdi,2), %xmm11
        leaq    (%rbx,%rbx), %rdi
        movsd   7(%rdi,%rdx), %xmm9
        movq    24(%rax), %rdx
        movapd  %xmm11, %xmm14
        movsd   15(%rdx), %xmm1
        movsd   7(%rdx), %xmm2
        movapd  %xmm1, %xmm8
        movsd   31(%rdx), %xmm3
        movapd  %xmm2, %xmm12
        mulsd   %xmm10, %xmm8
        mulsd   %xmm7, %xmm12
        mulsd   %xmm2, %xmm10
        mulsd   %xmm1, %xmm7
        movsd   23(%rdx), %xmm0

So, to my mind, this is still a 4.5 regression, as there is still a slow-down
and the code is still much less optimized by 4.5.0 than by 4.2.4. 168/156 ~
1.08, so if you want to change the Summary of this bug to 8% regression, or
some other things, that's fine, but I've changed this PR back to being a 4.5
regression.

I was not really thrilled when Richard marked PR 39157 as a duplicate of this
PR.  To my mind, there are three more or less independent things---run time of
Gambit-generated code, compile time of the code, and the space required to
compile the code.  This PR is about run time; PR 39157 was about space needed
by the compiler; PR 26854 is about compile time.  They seem to have all been
mushed together.


-- 

lucier at math dot purdue dot edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
      Known to work|4.5.0                       |
            Summary|[4.3/4.4 Regression] 30%    |[4.3/4.4/4.5 Regression] 30%
                   |performance slowdown in     |performance slowdown in
                   |floating-point code caused  |floating-point code caused
                   |by  r118475                 |by  r118475


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (104 preceding siblings ...)
  2009-06-15 20:21 ` [Bug rtl-optimization/33928] [4.3/4.4/4.5 " lucier at math dot purdue dot edu
@ 2009-06-16  6:48 ` bonzini at gnu dot org
  2009-06-16  7:02 ` bonzini at gnu dot org
                   ` (9 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-16  6:48 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #104 from bonzini at gnu dot org  2009-06-16 06:47 -------
I understood that with -frename-registers the regression is fixed.  As I said,
without a pre-regalloc scheduling pass and without register renaming, the
scheduling quality you get is more or less random.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (105 preceding siblings ...)
  2009-06-16  6:48 ` bonzini at gnu dot org
@ 2009-06-16  7:02 ` bonzini at gnu dot org
  2009-06-16  7:25 ` lucier at math dot purdue dot edu
                   ` (8 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: bonzini at gnu dot org @ 2009-06-16  7:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #105 from bonzini at gnu dot org  2009-06-16 07:01 -------
Marking PR39157 as a duplicate of PR26854 is not exact (only the fwprop part is
a duplicate, because we were getting large compile times because of building
large data structures; the CFG Cleanup part is not exactly a duplicate) but I
don't think it's important because anyway we have a patch for the fwprop issue.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (106 preceding siblings ...)
  2009-06-16  7:02 ` bonzini at gnu dot org
@ 2009-06-16  7:25 ` lucier at math dot purdue dot edu
  2009-08-04 12:37 ` rguenth at gcc dot gnu dot org
                   ` (7 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-06-16  7:25 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #106 from lucier at math dot purdue dot edu  2009-06-16 07:24 -------
This machine has 4ms ticks, so we're getting down to a few ticks difference
with a benchmark of this size.  It's 156ms with 4.2.4, 168ms with 4.5.0, and
164 ms when -frename-registers is added to the command line.

It's not just scheduling, there are more memory accesses with 4.5.0.

With a problem roughly 10 times as large, the times are

4.2.4:  2912ms
4.5.0:  3204ms
4.5.0:  3120ms (adding -frename-registers)

So there's a 7% difference with -frename-registers.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (107 preceding siblings ...)
  2009-06-16  7:25 ` lucier at math dot purdue dot edu
@ 2009-08-04 12:37 ` rguenth at gcc dot gnu dot org
  2009-08-27  1:18 ` lucier at math dot purdue dot edu
                   ` (6 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2009-08-04 12:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #107 from rguenth at gcc dot gnu dot org  2009-08-04 12:28 -------
GCC 4.3.4 is being released, adjusting target milestone.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.3.4                       |4.3.5


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (108 preceding siblings ...)
  2009-08-04 12:37 ` rguenth at gcc dot gnu dot org
@ 2009-08-27  1:18 ` lucier at math dot purdue dot edu
  2009-08-27  1:22 ` lucier at math dot purdue dot edu
                   ` (5 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-08-27  1:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #108 from lucier at math dot purdue dot edu  2009-08-27 01:18 -------
direct.c contains a direct FFT; I've compiled the direct and inverse fft and I
ran it on arrays with 2^23 double-precision complex elements and

heine:~/programs/gcc/objdirs/bench-mainline-on-fft> /pkgs/gcc-mainline/bin/gcc
-v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../../mainline/configure --enable-checking=release
--prefix=/pkgs/gcc-mainline --enable-languages=c,c++
-enable-stage1-languages=c,c++
Thread model: posix
gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) 

The compile options were

/pkgs/gcc-mainline/bin/gcc -save-temps -c -Wno-unused -O1 -fno-math-errno
-fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv
-fomit-frame-pointer -fPIC -fno-common -mieee-fp -rdynamic -shared
-fschedule-insns

and the same without -fschedule-insns.

The runtime for direct+inverse FFT with instruction scheduling was 1.264
seconds and the time for direct+inverse FFT without -fschedule-insns was 1.444
seconds, which is a 14% speedup for that one compiler option.  This is on a
2.33GHz Core 2 quad machine.

I'll attach the inner loops of direct.c with and with -fschedule-insns.

I haven't been able to compile the complete Gambit runtime with
-fschedule-insns on either x86-64 or ppc64; I've filed PR41164 and PR41176 for
those two different failures.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (109 preceding siblings ...)
  2009-08-27  1:18 ` lucier at math dot purdue dot edu
@ 2009-08-27  1:22 ` lucier at math dot purdue dot edu
  2009-08-27  1:23 ` lucier at math dot purdue dot edu
                   ` (4 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-08-27  1:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #109 from lucier at math dot purdue dot edu  2009-08-27 01:22 -------
Created an attachment (id=18432)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18432&action=view)
inner loop of direct.c with -fschedule-insns


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (110 preceding siblings ...)
  2009-08-27  1:22 ` lucier at math dot purdue dot edu
@ 2009-08-27  1:23 ` lucier at math dot purdue dot edu
  2009-08-27 17:02 ` lucier at math dot purdue dot edu
                   ` (3 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-08-27  1:23 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #110 from lucier at math dot purdue dot edu  2009-08-27 01:22 -------
Created an attachment (id=18433)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18433&action=view)
inner loop of direct.c without -fschedule-insns


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (111 preceding siblings ...)
  2009-08-27  1:23 ` lucier at math dot purdue dot edu
@ 2009-08-27 17:02 ` lucier at math dot purdue dot edu
  2009-10-03  1:39 ` bergner at gcc dot gnu dot org
                   ` (2 subsequent siblings)
  115 siblings, 0 replies; 117+ messages in thread
From: lucier at math dot purdue dot edu @ 2009-08-27 17:02 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #111 from lucier at math dot purdue dot edu  2009-08-27 17:02 -------
I can compile gambit 4.1.2 with -fschedule-insns except for the function noted
in PR41164.

On

model name      : Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz

with

gcc version 4.5.0 20090803 (experimental) [trunk revision 150373] (GCC) 

the times with -fschedule-insns are

(time (direct-fft-recursive-4 a table))
    144 ms cpu time (144 user, 0 system)
(time (inverse-fft-recursive-4 a table))
    136 ms cpu time (136 user, 0 system)

and the times without -fschedule-insns are

(time (direct-fft-recursive-4 a table))
    168 ms cpu time (168 user, 0 system)
(time (inverse-fft-recursive-4 a table))
    172 ms cpu time (172 user, 0 system)

That's a pretty big improvement.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (112 preceding siblings ...)
  2009-08-27 17:02 ` lucier at math dot purdue dot edu
@ 2009-10-03  1:39 ` bergner at gcc dot gnu dot org
  2010-04-29 14:35 ` [Bug rtl-optimization/33928] [4.3/4.4/4.5/4.6 " bergner at gcc dot gnu dot org
  2010-05-22 18:20 ` rguenth at gcc dot gnu dot org
  115 siblings, 0 replies; 117+ messages in thread
From: bergner at gcc dot gnu dot org @ 2009-10-03  1:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #112 from bergner at gcc dot gnu dot org  2009-10-03 01:39 -------
Subject: Bug 33928

Author: bergner
Date: Sat Oct  3 01:39:14 2009
New Revision: 152430

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=152430
Log:
        Backport from mainline.

        2009-08-30  Alan Modra  <amodra@bigpond.net.au>

        PR target/41081
        * fwprop.c (get_reg_use_in): Delete.
        (free_load_extend): New function.
        (forward_propagate_subreg): Use it.

        2009-08-23  Alan Modra  <amodra@bigpond.net.au>

        PR target/41081
        * fwprop.c (try_fwprop_subst): Allow multiple sets.
        (get_reg_use_in): New function.
        (forward_propagate_subreg): Propagate through subreg of zero_extend
        or sign_extend.

        2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

        PR rtl-optimization/33928
        PR 26854
        * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
        process_uses, build_single_def_use_links): New.
        (update_df): Update use_def_ref.
        (forward_propagate_into): Use get_def_for_use instead of use-def
        chains.
        (fwprop_init): Call build_single_def_use_links and let it initialize
        dataflow.
        (fwprop_done): Free use_def_ref.
        (fwprop_addr): Eliminate duplicate call to df_set_flags.
        * df-problems.c (df_rd_simulate_artificial_defs_at_top,
        df_rd_simulate_one_insn): New.
        (df_rd_bb_local_compute_process_def): Update head comment.
        (df_chain_create_bb): Use the new RD simulation functions.
        * df.h (df_rd_simulate_artificial_defs_at_top,
        df_rd_simulate_one_insn): New.
        * opts.c (decode_options): Enable fwprop at -O1.
        * doc/invoke.texi (-fforward-propagate): Document this.

Modified:
    branches/ibm/gcc-4_3-branch/gcc/ChangeLog.ibm
    branches/ibm/gcc-4_3-branch/gcc/REVISION
    branches/ibm/gcc-4_3-branch/gcc/df-problems.c
    branches/ibm/gcc-4_3-branch/gcc/df.h
    branches/ibm/gcc-4_3-branch/gcc/doc/invoke.texi
    branches/ibm/gcc-4_3-branch/gcc/fwprop.c
    branches/ibm/gcc-4_3-branch/gcc/opts.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5/4.6 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (113 preceding siblings ...)
  2009-10-03  1:39 ` bergner at gcc dot gnu dot org
@ 2010-04-29 14:35 ` bergner at gcc dot gnu dot org
  2010-05-22 18:20 ` rguenth at gcc dot gnu dot org
  115 siblings, 0 replies; 117+ messages in thread
From: bergner at gcc dot gnu dot org @ 2010-04-29 14:35 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #113 from bergner at gcc dot gnu dot org  2010-04-29 14:34 -------
Subject: Bug 33928

Author: bergner
Date: Thu Apr 29 14:34:35 2010
New Revision: 158902

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=158902
Log:
        Backport from mainline.

        2009-08-30  Alan Modra  <amodra@bigpond.net.au>

        PR target/41081
        * fwprop.c (get_reg_use_in): Delete.
        (free_load_extend): New function.
        (forward_propagate_subreg): Use it.

        2009-08-23  Alan Modra  <amodra@bigpond.net.au>

        PR target/41081
        * fwprop.c (try_fwprop_subst): Allow multiple sets.
        (get_reg_use_in): New function.
        (forward_propagate_subreg): Propagate through subreg of zero_extend
        or sign_extend.

        2009-05-08  Paolo Bonzini  <bonzini@gnu.org>

        PR rtl-optimization/33928
        PR 26854
        * fwprop.c (use_def_ref, get_def_for_use, bitmap_only_bit_bitween,
        process_uses, build_single_def_use_links): New.
        (update_df): Update use_def_ref.
        (forward_propagate_into): Use get_def_for_use instead of use-def
        chains.
        (fwprop_init): Call build_single_def_use_links and let it initialize
        dataflow.
        (fwprop_done): Free use_def_ref.
        (fwprop_addr): Eliminate duplicate call to df_set_flags.
        * df-problems.c (df_rd_simulate_artificial_defs_at_top,
        df_rd_simulate_one_insn): New.
        (df_rd_bb_local_compute_process_def): Update head comment.
        (df_chain_create_bb): Use the new RD simulation functions.
        * df.h (df_rd_simulate_artificial_defs_at_top,
        df_rd_simulate_one_insn): New.
        * opts.c (decode_options): Enable fwprop at -O1.
        * doc/invoke.texi (-fforward-propagate): Document this.

Modified:
    branches/ibm/gcc-4_4-branch/gcc/ChangeLog.ibm
    branches/ibm/gcc-4_4-branch/gcc/df-problems.c
    branches/ibm/gcc-4_4-branch/gcc/df.h
    branches/ibm/gcc-4_4-branch/gcc/doc/invoke.texi
    branches/ibm/gcc-4_4-branch/gcc/fwprop.c
    branches/ibm/gcc-4_4-branch/gcc/opts.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

* [Bug rtl-optimization/33928] [4.3/4.4/4.5/4.6 Regression] 30% performance slowdown in floating-point code caused by  r118475
  2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
                   ` (114 preceding siblings ...)
  2010-04-29 14:35 ` [Bug rtl-optimization/33928] [4.3/4.4/4.5/4.6 " bergner at gcc dot gnu dot org
@ 2010-05-22 18:20 ` rguenth at gcc dot gnu dot org
  115 siblings, 0 replies; 117+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2010-05-22 18:20 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #114 from rguenth at gcc dot gnu dot org  2010-05-22 18:11 -------
GCC 4.3.5 is being released, adjusting target milestone.


-- 

rguenth at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|4.3.5                       |4.3.6


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33928


^ permalink raw reply	[flat|nested] 117+ messages in thread

end of thread, other threads:[~2010-05-22 18:20 UTC | newest]

Thread overview: 117+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-28  1:46 [Bug regression/33928] New: 33% performance slowdown from 4.2.2 in floating-point code lucier at math dot purdue dot edu
2007-10-28  1:49 ` [Bug regression/33928] " lucier at math dot purdue dot edu
2007-10-28 12:05 ` [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 " rguenth at gcc dot gnu dot org
2007-10-28 15:41 ` [Bug regression/33928] 33% performance slowdown from 4.2.2 to 4.3.0 in floating-point code with computed gotos lucier at math dot purdue dot edu
2007-10-28 15:42 ` lucier at math dot purdue dot edu
2007-10-28 15:45 ` lucier at math dot purdue dot edu
2007-10-28 15:46 ` lucier at math dot purdue dot edu
2007-10-28 16:05 ` lucier at math dot purdue dot edu
2007-10-28 16:09 ` lucier at math dot purdue dot edu
2007-10-28 16:38 ` rguenth at gcc dot gnu dot org
2007-10-28 16:39 ` [Bug regression/33928] [4.3 Regression] " rguenth at gcc dot gnu dot org
2007-11-12 21:50 ` [Bug regression/33928] [4.3 Regression] 22% performance slowdown from 4.2.2 to 4.3.0 in floating-point code lucier at math dot purdue dot edu
2007-11-12 21:51 ` lucier at math dot purdue dot edu
2007-11-12 21:52 ` lucier at math dot purdue dot edu
2007-11-12 21:53 ` lucier at math dot purdue dot edu
2007-11-19  6:06 ` pinskia at gcc dot gnu dot org
2007-11-27  5:53 ` mmitchel at gcc dot gnu dot org
2007-11-30  5:39 ` bonzini at gnu dot org
2007-11-30 14:47 ` lucier at math dot purdue dot edu
2007-11-30 14:58 ` bonzini at gnu dot org
2007-12-01 18:59 ` lucier at math dot purdue dot edu
2008-01-09 14:18 ` rguenth at gcc dot gnu dot org
2008-01-09 19:21 ` lucier at math dot purdue dot edu
2008-01-12 18:03 ` rguenth at gcc dot gnu dot org
2008-01-21 20:01 ` ubizjak at gmail dot com
2008-01-21 23:12 ` lucier at math dot purdue dot edu
2008-01-22 12:23 ` ubizjak at gmail dot com
2008-01-22 12:29 ` [Bug target/33928] " pinskia at gcc dot gnu dot org
2008-01-22 12:38 ` ubizjak at gmail dot com
2008-01-22 13:24 ` rguenth at gcc dot gnu dot org
2008-01-22 13:25 ` [Bug tree-optimization/33928] " bonzini at gnu dot org
2008-01-22 13:29 ` ubizjak at gmail dot com
2008-01-22 13:30 ` rguenth at gcc dot gnu dot org
2008-03-14 17:04 ` [Bug tree-optimization/33928] [4.3/4.4 Regression] 22% performance slowdown from 4.2.2 to 4.3/4.4.0 " rguenth at gcc dot gnu dot org
2008-05-30 16:02 ` lucier at math dot purdue dot edu
2008-06-06 15:00 ` rguenth at gcc dot gnu dot org
2008-07-09 16:06 ` lucier at math dot purdue dot edu
2008-08-27 22:10 ` jsm28 at gcc dot gnu dot org
2008-09-04 20:40 ` lucier at math dot purdue dot edu
2008-09-04 20:45 ` rguenth at gcc dot gnu dot org
2008-09-04 20:50 ` lucier at math dot purdue dot edu
2008-12-06 16:39 ` lucier at math dot purdue dot edu
2008-12-07  2:56 ` bonzini at gnu dot org
2008-12-07 13:01 ` rguenth at gcc dot gnu dot org
2008-12-07 19:40 ` [Bug tree-optimization/33928] [4.3/4.4 Regression] 30% performance slowdown in floating-point code caused by r118475 lucier at math dot purdue dot edu
2009-01-24 10:28 ` rguenth at gcc dot gnu dot org
2009-02-13 16:05 ` bonzini at gnu dot org
2009-02-13 16:10 ` lucier at math dot purdue dot edu
2009-02-13 16:32 ` bonzini at gnu dot org
2009-02-13 17:23 ` lucier at math dot purdue dot edu
2009-02-13 20:10 ` bonzini at gnu dot org
2009-04-23 15:59 ` [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 79% performance slowdown in floating-point code partially " lucier at math dot purdue dot edu
2009-04-23 16:01 ` lucier at math dot purdue dot edu
2009-04-23 16:03 ` lucier at math dot purdue dot edu
2009-04-26 18:27 ` [Bug tree-optimization/33928] [4.3/4.4/4.5 Regression] 30% performance slowdown in floating-point code " lucier at math dot purdue dot edu
2009-05-06  3:43 ` lucier at math dot purdue dot edu
2009-05-06  3:50 ` lucier at math dot purdue dot edu
2009-05-06  9:21 ` bonzini at gnu dot org
2009-05-06  9:32 ` bonzini at gnu dot org
2009-05-06  9:50 ` jakub at gcc dot gnu dot org
2009-05-06  9:57 ` bonzini at gnu dot org
2009-05-06 10:00 ` bonzini at gnu dot org
2009-05-06 10:48 ` bonzini at gnu dot org
2009-05-06 13:06 ` [Bug rtl-optimization/33928] " jakub at gcc dot gnu dot org
2009-05-06 15:08 ` bonzini at gnu dot org
2009-05-06 19:58 ` lucier at math dot purdue dot edu
2009-05-06 20:44 ` lucier at math dot purdue dot edu
2009-05-07  5:04 ` bonzini at gnu dot org
2009-05-07  5:27 ` lucier at math dot purdue dot edu
2009-05-07 13:41 ` bonzini at gnu dot org
2009-05-07 15:41 ` steven at gcc dot gnu dot org
2009-05-07 15:58 ` lucier at math dot purdue dot edu
2009-05-07 16:01 ` lucier at math dot purdue dot edu
2009-05-07 16:03 ` lucier at math dot purdue dot edu
2009-05-07 16:03 ` lucier at math dot purdue dot edu
2009-05-07 16:04 ` lucier at math dot purdue dot edu
2009-05-07 16:21 ` bonzini at gnu dot org
2009-05-07 16:32 ` lucier at math dot purdue dot edu
2009-05-07 16:38 ` bonzini at gnu dot org
2009-05-07 17:50 ` steven at gcc dot gnu dot org
2009-05-08  6:51 ` bonzini at gcc dot gnu dot org
2009-05-08  7:18 ` bonzini at gnu dot org
2009-05-08  7:52 ` bonzini at gcc dot gnu dot org
2009-05-08  7:55 ` bonzini at gnu dot org
2009-05-08  9:41 ` bonzini at gnu dot org
2009-05-08 12:23 ` bonzini at gcc dot gnu dot org
2009-05-15 10:36 ` bonzini at gnu dot org
2009-05-16  0:20 ` lucier at math dot purdue dot edu
2009-05-16  0:29 ` lucier at math dot purdue dot edu
2009-05-16  0:33 ` lucier at math dot purdue dot edu
2009-06-08  8:40 ` bonzini at gnu dot org
2009-06-08  8:59 ` bonzini at gnu dot org
2009-06-08 16:36 ` bonzini at gnu dot org
2009-06-08 18:19 ` lucier at math dot purdue dot edu
2009-06-12 14:51 ` bonzini at gnu dot org
2009-06-13 14:18 ` rguenth at gcc dot gnu dot org
2009-06-14  4:44 ` jamborm at gcc dot gnu dot org
2009-06-14 14:59 ` lucier at math dot purdue dot edu
2009-06-14 15:02 ` lucier at math dot purdue dot edu
2009-06-15 15:14 ` bonzini at gnu dot org
2009-06-15 16:12 ` lucier at math dot purdue dot edu
2009-06-15 16:21 ` paolo dot bonzini at gmail dot com
2009-06-15 16:22 ` bonzini at gnu dot org
2009-06-15 16:26 ` [Bug rtl-optimization/33928] [4.3/4.4 " bonzini at gnu dot org
2009-06-15 19:57 ` lucier at math dot purdue dot edu
2009-06-15 20:21 ` [Bug rtl-optimization/33928] [4.3/4.4/4.5 " lucier at math dot purdue dot edu
2009-06-16  6:48 ` bonzini at gnu dot org
2009-06-16  7:02 ` bonzini at gnu dot org
2009-06-16  7:25 ` lucier at math dot purdue dot edu
2009-08-04 12:37 ` rguenth at gcc dot gnu dot org
2009-08-27  1:18 ` lucier at math dot purdue dot edu
2009-08-27  1:22 ` lucier at math dot purdue dot edu
2009-08-27  1:23 ` lucier at math dot purdue dot edu
2009-08-27 17:02 ` lucier at math dot purdue dot edu
2009-10-03  1:39 ` bergner at gcc dot gnu dot org
2010-04-29 14:35 ` [Bug rtl-optimization/33928] [4.3/4.4/4.5/4.6 " bergner at gcc dot gnu dot org
2010-05-22 18:20 ` rguenth at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).