[Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug rtl-optimization/27827]  New: gcc 4 produces worse x87 code on all platforms than gcc 3
@ 2006-05-31  0:33 hiclint at gmail dot com
  2006-05-31  0:35 ` [Bug rtl-optimization/27827] " pinskia at gcc dot gnu dot org
                   ` (70 more replies)
  0 siblings, 71 replies; 75+ messages in thread
From: hiclint at gmail dot com @ 2006-05-31  0:33 UTC (permalink / raw)
  To: gcc-bugs

Hi guys.  My name is Clint Whaley, I'm the developer of ATLAS, an open source
linear algebra package:
   http://directory.fsf.org/atlas.html

My users are asking me to support gcc 4, but right now its x87 fp performance
is much worse than gcc 3.  Depending on the machine and code being run it
appears to be between 10-50% worse.  Here is a tarfile that allows you to
reproduce the problem on any machine:
   http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz

I have timed under a Pentium-D (gcc 4 gets 85% of gcc 3's performance on
example code) and Athlon-64 X2 (gcc 4 gets 60% of gcc 3's performance).  This
is a typical kernel from ATLAS, not the worst . . .

By looking at the assembly (the provided makefile will gen it with "make
assall"), the differences seem fairly minor.  From what I can tell, mostly it
seems to come down to gcc 4 using a from memory fmull rather than loading ops
to the fpstack first.

I know that sse is the prefered target these days, but the x87 (when optimized
right) kills the single precision SSE unit in scalar mode due to the expense of
the scalar vector load, and the x87 unit is slightly faster even in double
precision (in scalar mode).  Gcc cannot yet auto-vectorize any ATLAS kernels.

Any help much appreciated,
Clint

-- 
           Summary: gcc 4 produces worse x87 code on all platforms than gcc
                    3
           Product: gcc
           Version: 4.1.1
            Status: UNCONFIRMED
          Severity: blocker
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: hiclint at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug rtl-optimization/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
@ 2006-05-31  0:35 ` pinskia at gcc dot gnu dot org
  2006-05-31  0:36 ` hiclint at gmail dot com
                   ` (69 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2006-05-31  0:35 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #1 from pinskia at gcc dot gnu dot org  2006-05-31 00:35 -------
Do you have a small testcase which shows the problem?


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|blocker                     |normal


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug rtl-optimization/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
  2006-05-31  0:35 ` [Bug rtl-optimization/27827] " pinskia at gcc dot gnu dot org
@ 2006-05-31  0:36 ` hiclint at gmail dot com
  2006-05-31  0:42 ` [Bug target/27827] " pinskia at gcc dot gnu dot org
                   ` (68 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: hiclint at gmail dot com @ 2006-05-31  0:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #2 from hiclint at gmail dot com  2006-05-31 00:36 -------
Created an attachment (id=11541)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11541&action=view)
Makefile and source to demonstrate performance problem


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
  2006-05-31  0:35 ` [Bug rtl-optimization/27827] " pinskia at gcc dot gnu dot org
  2006-05-31  0:36 ` hiclint at gmail dot com
@ 2006-05-31  0:42 ` pinskia at gcc dot gnu dot org
  2006-05-31  0:50 ` hiclint at gmail dot com
                   ` (67 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2006-05-31  0:42 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #3 from pinskia at gcc dot gnu dot org  2006-05-31 00:41 -------
This is fully a target issue.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |target


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (2 preceding siblings ...)
  2006-05-31  0:42 ` [Bug target/27827] " pinskia at gcc dot gnu dot org
@ 2006-05-31  0:50 ` hiclint at gmail dot com
  2006-05-31  0:55 ` pinskia at gcc dot gnu dot org
                   ` (66 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: hiclint at gmail dot com @ 2006-05-31  0:50 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #4 from hiclint at gmail dot com  2006-05-31 00:50 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Andrew,

Thanks for the reply.  For the small case demonstrating the problem, I
included it in the original message:
   http://www.cs.utsa.edu/~whaley/mmbench4.tar.gz

and have uploaded it as an attachment.  I am not sure what you mean by
"fully a target issue".  Perhaps I have submitted to the wrong area of
gcc performance bug?  Note that it is not limited to one machine: the
gcc 4 code is inferior to gcc 3 on both AMD and Intel.  I chose the
two newest machines I have access to, but I believe it is true for
older machines as well . . .

Any clarification appreciated,
Clint
On 31 May 2006 00:41:55 -0000, pinskia at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:
>
>
> ------- Comment #3 from pinskia at gcc dot gnu dot org  2006-05-31 00:41 -------
> This is fully a target issue.
>
>
> --
>
> pinskia at gcc dot gnu dot org changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>           Component|rtl-optimization            |target
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
>
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
> You reported the bug, or are watching the reporter.
>


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (3 preceding siblings ...)
  2006-05-31  0:50 ` hiclint at gmail dot com
@ 2006-05-31  0:55 ` pinskia at gcc dot gnu dot org
  2006-05-31  1:09 ` whaley at cs dot utsa dot edu
                   ` (65 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2006-05-31  0:55 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #5 from pinskia at gcc dot gnu dot org  2006-05-31 00:55 -------
(In reply to comment #4)
> and have uploaded it as an attachment.  I am not sure what you mean by
> "fully a target issue".  Perhaps I have submitted to the wrong area of
> gcc performance bug?  Note that it is not limited to one machine: the
> gcc 4 code is inferior to gcc 3 on both AMD and Intel.  I chose the
> two newest machines I have access to, but I believe it is true for
> older machines as well . . .
It only effects x86/x86_64 (really just x87 and the stack machine).
It truely looks like a ra issue.

There is no issues like this on say Powerpc.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 GCC target triplet|                            |i?86-*-*, x86_64-*-*
           Keywords|                            |ra


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (4 preceding siblings ...)
  2006-05-31  0:55 ` pinskia at gcc dot gnu dot org
@ 2006-05-31  1:09 ` whaley at cs dot utsa dot edu
  2006-05-31 10:57 ` uros at kss-loka dot si
                   ` (64 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-05-31  1:09 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #6 from whaley at cs dot utsa dot edu  2006-05-31 01:09 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Yes, I agree it is an x86/x86_64 issue.  I have not yet scoped the performance
of any of the other architectures with gcc 4 vs. 3: since 90% of my users use
an x86 of some sort, I can't switch to gcc 4 support until the x86 performance
is reasonable.  It seems the x87 performance always goes down with any big gcc
change (bugzilla 4991 is a similar performance drop between 2.x and 3.0, though
the issues are not the exact same), probably because its oddball 2-op assembler
/ x87 stack doesn't map well to the more sane ISAs, which all compiler guys
strongly prefer :)

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (5 preceding siblings ...)
  2006-05-31  1:09 ` whaley at cs dot utsa dot edu
@ 2006-05-31 10:57 ` uros at kss-loka dot si
  2006-05-31 14:13 ` whaley at cs dot utsa dot edu
                   ` (63 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: uros at kss-loka dot si @ 2006-05-31 10:57 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #7 from uros at kss-loka dot si  2006-05-31 10:56 -------
IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure
luck.

Looking into 3.x RTL, these things can be observed:

Instruction that multiplies pA0 and rB0 is described as:

__.20.combine:

(insn 75 73 76 2 (set (reg:DF 84)
        (mult:DF (mem:DF (reg/v/f:DI 70 [ pA0 ]) [0 S8 A64])
            (reg/v:DF 78 [ rB0 ]))) 551 {*fop_df_comm_nosse} (insn_list 65
(nil))
    (nil))

At this point, first input operand does not satisfy the operand constraint, so
register allocator pushes memory operand into the register:

__.25.greg:

(insn 703 73 75 2 (set (reg:DF 8 st [84])
        (mem:DF (reg/v/f:DI 0 ax [orig:70 pA0 ] [70]) [0 S8 A64])) 96
{*movdf_integer} (nil)
    (nil))

(insn 75 703 76 2 (set (reg:DF 8 st [84])
        (mult:DF (reg:DF 8 st [84])
            (reg/v:DF 9 st(1) [orig:78 rB0 ] [78]))) 551 {*fop_df_comm_nosse}
(insn_list 65 (nil))
    (nil))

This RTL produces following asm sequence:

        fldl    (%rax)  #* pA0
        fmul    %st(1), %st     #


In 4.x case, we have:

__.127r.combine:

(insn 60 58 61 4 (set (reg:DF 207)
        (mult:DF (reg/v:DF 187 [ rB0 ])
            (mem:DF (plus:DI (reg/v/f:DI 178 [ pA0.161 ])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591
{*fop_df_comm_i387} (nil)
    (nil))

This instruction almost satisfies operand constraint, and register allocator
produces:

__.138r.greg:

(insn 470 58 60 5 (set (reg:DF 12 st(4) [207])
        (reg/v:DF 8 st [orig:187 rB0 ] [187])) 94 {*movdf_integer} (nil)
    (nil))

(insn 60 470 61 5 (set (reg:DF 12 st(4) [207])
        (mult:DF (reg:DF 12 st(4) [207])
            (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591
{*fop_df_comm_i387} (nil)
    (nil))

Stack handling then fixes this RTL to:

__.151r.stack:

(insn 470 58 60 4 (set (reg:DF 8 st)
        (reg:DF 8 st)) 94 {*movdf_integer} (nil)
    (nil))

(insn 60 470 61 4 (set (reg:DF 8 st)
        (mult:DF (reg:DF 8 st)
            (mem:DF (plus:DI (reg/v/f:DI 0 ax [orig:178 pA0.161 ] [178])
                    (const_int 960 [0x3c0])) [0 S8 A64]))) 591
{*fop_df_comm_i387} (nil)
    (nil))


>From your measurement, it looks that instead of:

        fld     %st(0)  #
        fmull   (%rax)  #* pA0.161

it is faster to emit

        fldl    (%rax)  #* pA0
        fmul    %st(1), %st     #,


-- 

uros at kss-loka dot si changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |uros at kss-loka dot si


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (6 preceding siblings ...)
  2006-05-31 10:57 ` uros at kss-loka dot si
@ 2006-05-31 14:13 ` whaley at cs dot utsa dot edu
  2006-06-01  8:43 ` uros at kss-loka dot si
                   ` (62 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-05-31 14:13 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #8 from whaley at cs dot utsa dot edu  2006-05-31 14:12 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,

>IMO the fact that gcc 3.x beats 4.x on this code could be attributed to pure luck.

As far as understanding from first principles, performance on a modern x86
(which is busy doing OOE, register renaming, CISC/RISC translation, operand
fusion and fission, etc) is *always* a blind accident, IMHO :)   I've
hand-tuned code for the x87 for a *long* time (and written my own compilation
framework), and it has been my experience that only by trying different
schedules, instruction selection, etc. can you get decent performing code.  gcc
actually does an amazing job of x87 performance when it's working right, and I
always figured it had to empirically tweaked to get that level of performance. 
The fact that x87 performance always drops off at major releases (return to
first principles over discovered best-cases) seems to verify this . . .

So, I agree with you that the difference does not seem to have some big plan
behind it, but I want to stress that it is nonetheless critical: it happens to
all x87 codes on every x86 machine (I have so far tried Pentium-D, Athlon 64
X2, and P4e), and it happens no matter what optimized code I feed gcc 4.  Note
that ATLAS is not a static library, but rather uses a code generator to tune
matrix multiplication.  What this means is that ATLAS tries thousands of
different source implementations in trying to find one that will run the
fastest on the given architecture/compiler (the code generator does things like
tiling, register blocking, unroll & jam, software pipelining, unrolling, all at
the ANSI C source level, in an attempt to find the combo that the compiler/arch
likes etc).  On no x86 architecture I've installed on can gcc 4 compete with
gcc 3.  Thus, out of literally thousands of implementations on each platform,
gcc 4 cannot find one that it can compete with gcc 3's best 
 case.  I cannot, of course, send you thousands of codes and say "see all of
these are inferior", but they are, and the case I sent is not the worst.  For
instance, for single precision gemm on the Athlon 64, the kernel tuned for gcc
4 (best case of thousands taken) runs at 56.7% of the performance of the gcc
3-tuned kernel.  Nor does using SSE fix things: gcc 4 is still far slower using
SSE than gcc 3 using the x87 on all platforms, and for single precision, the
gap is worse than between x87 implementations!

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (7 preceding siblings ...)
  2006-05-31 14:13 ` whaley at cs dot utsa dot edu
@ 2006-06-01  8:43 ` uros at kss-loka dot si
  2006-06-01 16:03 ` whaley at cs dot utsa dot edu
                   ` (61 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: uros at kss-loka dot si @ 2006-06-01  8:43 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #9 from uros at kss-loka dot si  2006-06-01 08:43 -------
The benchmark run on a Pentium4 3.2G/800MHz FSB (32bit):

vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 9
cpu MHz         : 3191.917
cache size      : 512 KB

shows even more interesting results:

gcc version 3.4.6
vs.
gcc version 4.2.0 20060601 (experimental)

-fomit-frame-pointer -O -msse2 -mfpmath=sse

GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.162     2664.87

GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.164     2633.13

and

-fomit-frame-pointer -O -mfpmath=387

GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.160     2697.37

GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.164     2633.15

There is a small performance drop on gcc-4.x, but nothing critical.

I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the
problem is in the order of instructions (Software Optimization Guide for AMD
Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how
things should be, and gcc-4.2 code looks similar to the example, how things
should _NOT_ be.

BTW: Did you try to run the benchmark on AMD target with -march=k8? The effects
of this flag are devastating on Pentium4 CPU:

-O -msse2 -mfpmath=sse -march=k8

./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.836      516.79

GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.287     1504.66


-- 

uros at kss-loka dot si changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever Confirmed|0                           |1
   Last reconfirmed|0000-00-00 00:00:00         |2006-06-01 08:43:34
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (8 preceding siblings ...)
  2006-06-01  8:43 ` uros at kss-loka dot si
@ 2006-06-01 16:03 ` whaley at cs dot utsa dot edu
  2006-06-01 16:26 ` whaley at cs dot utsa dot edu
                   ` (60 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-01 16:03 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #10 from whaley at cs dot utsa dot edu  2006-06-01 16:02 -------
Created an attachment (id=11571)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11571&action=view)
Same benchmark, but with single precision timing included

Here's the same benchmark, but can time single as well as double precision, in
case you want to play with the SSE code.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (9 preceding siblings ...)
  2006-06-01 16:03 ` whaley at cs dot utsa dot edu
@ 2006-06-01 16:26 ` whaley at cs dot utsa dot edu
  2006-06-01 18:43 ` whaley at cs dot utsa dot edu
                   ` (59 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-01 16:26 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #11 from whaley at cs dot utsa dot edu  2006-06-01 16:26 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,

OK, I originally replied a couple of hours ago, but that is not appearing on
bugzilla for some reason, so I'll try again, this time CCing myself so
I don't have to retype everything :)

>gcc version 3.4.6
>vs.
>gcc version 4.2.0 20060601 (experimental)
>
>-fomit-frame-pointer -O -msse2 -mfpmath=sse
>
>There is a small performance drop on gcc-4.x, but nothing critical.
>
>I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the
>problem is in the order of instructions (Software Optimization Guide for AMD
>Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how
>things should be, and gcc-4.2 code looks similar to the example, how things
>should _NOT_ be.

First, thanks for looking into this!  As to your point, yes, I am aware
that gcc4-sse can get almost the same performance as gcc3-x87 (though not
quite), and in fact can do so on the Athlon 64 as well, 
**but only for double precision**.  To get SSE within a few percent of x87
on the AMD machine, you use a different kernel (remember, I'm sending you an
example out of many), and throw the following flags:
   -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \
   -ftree-vectorize -fargument-noalias-global 
(note this does not vectorize the code, but I throw the flag in the hope that
 future versions will :)

Note that my bug report concentrates on "x87 performance"!  There are reasons
to use x87 even if scalar SSE is competitive performance-wise, as the x87
unit produces much superior accuracy.  However, even if we were to take the
tack (and gcc may be doing this for all I know) that once scalar SSE can
compete
performance wise, the x87 unit will no longer be supported, we must also
examine single precision performance.  For single precision performance,
I have never gotten any scalar SSE kernel to compete even close to the gcc3-x87
numbers.  I believe (w/o having proved it) that this is probably due to the
cost of using the scalar load: double precision can use the low-overhead movlpd
instruction, but single must use MOVSS, which is **much** slower than FLD,
and so any kernel using scalar SSE blows chunks.  ATLAS's best case gcc4-sse
kernel gets roughly half of the gcc-x87 performance on an Athlon-64, and
something like 80% on a P4e (note that intel machines have half the theoretical
peak for x87 [AMD: 2 flops/cycle, Intel: 1 flop/cycle]: getting a large % of
performance gets easier the lower your peak gets!).

I originally submitted a double precision kernel, because that showed the
x87 performance problem, and allowed me to reuse the infrastructure I
created for an earlier bug report (bugzilla 4991).  I have just uploaded
an example attachment that can time both single and double precision
performance, if you want to confirm for yourself that SSE is not competitive
for single precision.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (10 preceding siblings ...)
  2006-06-01 16:26 ` whaley at cs dot utsa dot edu
@ 2006-06-01 18:43 ` whaley at cs dot utsa dot edu
  2006-06-07 22:39 ` whaley at cs dot utsa dot edu
                   ` (58 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-01 18:43 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #12 from whaley at cs dot utsa dot edu  2006-06-01 18:43 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Uros,

>gcc version 3.4.6
>vs.
>gcc version 4.2.0 20060601 (experimental)
>
>-fomit-frame-pointer -O -msse2 -mfpmath=sse
>There is a small performance drop on gcc-4.x, but nothing critical.
>I can confirm, that code indeed runs >50% slower on 64bit athlon. Perhaps the
>problem is in the order of instructions (Software Optimization Guide for AMD
>Athlon 64, Section 10.2). The gcc-3.4 code looks similar to the example, how
>things should be, and gcc-4.2 code looks similar to the example, how things
>should _NOT_ be.

Thanks for looking into this!  However, I am indeed aware that by using SSE2
you
can get the double precision results fairly close to the x87 on most platforms.
In fact, you can get gcc 4.1-sse within a few % of gcc 3-x87 on the Athlon 64
as well, by changing the kernel you feed gcc, and giving it these flags:
   -march=athlon64 -O2 -mfpmath=sse -msse -msse2 -m64 \ 
   -ftree-vectorize -fargument-noalias-global
(this doesn't make it vectorize, but I throw the flag for future hope :)

Now, sometimes you want to use the x87 unit because of its superior precision,
but the real problem with the approach of "ignore the x87 performance and
just use SSE" comes in single precision.  The performance of the best
kernel found by ATLAS in single precision using gcc4.1-sse is roughly half
of that of using the x87 unit on an Athlon-64, and 80% on a P4e (one reason
they are closer on the P4e is that the P4e's x87 peak is 1/2 that of the
Athlon [AMD machines can do 2 flops/cycle using the x87, whereas intel machines
can do only 1]), so there's not as large a gap between excellent and
non-so-excellent kernels).  My guess (and it's only a guess) for the reason
scalar double-precision sse can compete and single cannot comes down to the
cost of doing scalar load and stores.  In double, you can use movlpd instead of
movsd for a low-overhead vector load, but in single you must use movss, and
since movss is much more expensive than fld, scalar SSE always blows in
comparison to x87 . . .

So, that's why my error report concentrated on "x87 performance".  I submitted
in double precision because I had a preexisting Makefile/source demonstrating
the performance problem from a prior bug report (bugzilla 4991).  I think
we should not blow off the x87 performance even if SSE *was* competitive,
because there are times when the x87 is better.  However, in single precision,
scalar SSE is not competitive, at least on the platforms I have tried.  If you
guys are planning on deprecating the x87 unit when SSE is competitive on modern
machines, I can certainly rework the tarfile so I can send you single precision
benchmark, so you can see the sse/x87 performance gap yourself.  Let me know
if you want this, as I'll need to do a bit of extra work.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (11 preceding siblings ...)
  2006-06-01 18:43 ` whaley at cs dot utsa dot edu
@ 2006-06-07 22:39 ` whaley at cs dot utsa dot edu
  2006-06-14  3:04 ` whaley at cs dot utsa dot edu
                   ` (57 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-07 22:39 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #13 from whaley at cs dot utsa dot edu  2006-06-07 22:28 -------
Subject: Re:  gcc 4 produces worse x87 code on all platforms than gcc 3

Guys,

Just got access to a CoreDuo machine, and tested things there.  I had to
do some hand-translation of assemblies, as I didn't have access to the
gnu compiler there, so there's the possibility of error, but it looked
like to me that the Core likes the gcc4 x87 code stream better than
the gcc3, so I think you'll want to select amongst them according to
-march . . .  Core is a PIII-based architecture, so when I have a moment
I'll try to find a PIII that's still running to see if PIIIs in general
like that code stream, while P4s and Athlons like the gcc3 way of things . . .

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (12 preceding siblings ...)
  2006-06-07 22:39 ` whaley at cs dot utsa dot edu
@ 2006-06-14  3:04 ` whaley at cs dot utsa dot edu
  2006-06-24 18:11 ` whaley at cs dot utsa dot edu
                   ` (56 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-14  3:04 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #14 from whaley at cs dot utsa dot edu  2006-06-14 02:40 -------
OK, I got access to some older machines, and it appears that Core is the only
architecture that likes gcc 4's code.  More precisely, I have confirmed that
the following architectures run significantly slower using gcc4 than gcc 3:
Pentium-D, P4e, Pentium III, PentiumPRO, Athlon-64 X2, Opteron.

Any help appreciated,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (13 preceding siblings ...)
  2006-06-14  3:04 ` whaley at cs dot utsa dot edu
@ 2006-06-24 18:11 ` whaley at cs dot utsa dot edu
  2006-06-24 19:13 ` rguenth at gcc dot gnu dot org
                   ` (55 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-24 18:11 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #15 from whaley at cs dot utsa dot edu  2006-06-24 18:10 -------
Hi,

Can someone tell me if anyone is looking into this problem with the hopes of
fixing it?  I just noticed that despite the posted code demonstrating the
problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D,
Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned
to look at it  . . .

The reason I ask is that I am preparing the next stable release of ATLAS, and
I'm getting close to having to make a decision on what compilers I will
support.
If someone is working feverishly in the background, I will be sure to wait
for it, in the hopes that there'll be a fix that will allow me to use
gcc 4, which I think will be what most of my users want.  If this problem
is not being looked into, I should not delay the ATLAS release for it, and
just require my users to install gcc 3 in order to get decent performance.

I realize you guys are busy, and fp performance is probably not your main
concern, so hopefully this message sounds more like a request for info on what
is going on, than a bitch about help that I'm getting for free :)  

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (14 preceding siblings ...)
  2006-06-24 18:11 ` whaley at cs dot utsa dot edu
@ 2006-06-24 19:13 ` rguenth at gcc dot gnu dot org
  2006-06-25 13:35 ` whaley at cs dot utsa dot edu
                   ` (54 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2006-06-24 19:13 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #16 from rguenth at gcc dot gnu dot org  2006-06-24 19:00 -------
Don't hold your breath.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (15 preceding siblings ...)
  2006-06-24 19:13 ` rguenth at gcc dot gnu dot org
@ 2006-06-25 13:35 ` whaley at cs dot utsa dot edu
  2006-06-25 23:05 ` rguenth at gcc dot gnu dot org
                   ` (53 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-25 13:35 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #17 from whaley at cs dot utsa dot edu  2006-06-25 13:17 -------
OK, thanks for the reply.  I will assume gcc 4 won't be fixed in the near
future.  My guess is this will make icc an easier compiler for users, which I
kind of hate, which is why I worked as much as I did on this report . . .

I hope you will consider adding the mmbench4s.tar.gz attachment above (the one
that runs both single and double precision) to the gcc regression tests. 
Notice that it caught this problem between 3 and 4, as well as a similar fp
performance drop between gcc 2 and 3 (bugzilla 4991).  The kernel here is
typical of those used in ATLAS, which is used by hundreds of thousands of
people worldwide.  I believe these kernels are also typical of pretty much any
register blocked fp code, so having them in the regression tests may help other
open source fp packages (eg, fftw, etc) as well.  Notice that closed-source
alternatives that ship binaries do not face this challenge, so that having the
compiler drop between releases gives them an advantage, and can drive HPC users
(where performance dictates everything) to proprietary solutions.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (16 preceding siblings ...)
  2006-06-25 13:35 ` whaley at cs dot utsa dot edu
@ 2006-06-25 23:05 ` rguenth at gcc dot gnu dot org
  2006-06-26  1:12 ` whaley at cs dot utsa dot edu
                   ` (52 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: rguenth at gcc dot gnu dot org @ 2006-06-25 23:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #18 from rguenth at gcc dot gnu dot org  2006-06-25 20:05 -------
Unfortunately we don't have infrastructure for performance regression tests. 
Btw. did you check what happens if you do not unroll the innermost loop
manually but let -funroll-loops do it?  For me the performance is the same (but
I may have screwed up removing the unrolling).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (17 preceding siblings ...)
  2006-06-25 23:05 ` rguenth at gcc dot gnu dot org
@ 2006-06-26  1:12 ` whaley at cs dot utsa dot edu
  2006-06-26  7:53 ` uros at kss-loka dot si
                   ` (51 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-26  1:12 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #19 from whaley at cs dot utsa dot edu  2006-06-26 00:55 -------
Thanks for the info.  I'm sorry to hear that no performance regression tests
are done, but I guess it kind of explains why these problems reoccur :)

As to not unrolling, the fully unrolled case is almost always commandingly
better whenever I've looked at it.  After your note, I just tried on my P4,
using ATLAS's  P4 kernel, and I get (ku is inner loop unrolling, and nb=40, so
40 is fully unrolled):
  GCC 4 ku=1  : 1.65Gflop
  GCC 4 ku=40 : 1.84Gflop
  Gcc 3 ku=1  : 1.90Gflop
  Gcc 3 ku=40:  2.19Gflop

This is throwing the -funroll-loops flag.

BTW, gcc 4 w/o the -funroll-loops (ku=1) is indeed slower, at roughly 1.54 . .
.

Anyway, I've never found the performance of gcc ku=1 competitive with ku=<fully
unrolled> on any machine.  Even in assembly, I have to fully unroll the inner
loop to get near peak on all intel machines.  On the Opteron, you can get
within 5% or so with a rolled loop in assembly, but I've not gotten a C code to
do that.I think the gcc unrolling probably defaults to something like 4 or 8
(guess from performance, not verified): unrolling all the way (the loop is over
a compile-time constant) is the way to go . . .

When you said competitive, did you mean that gcc 4 ku=1 was competitive with
gcc 4 ku=40 or gcc 3 ku=1?  If the latter, I find it hard to believe unless you
use SSE for gcc 4 and something unexpected happens.  Even so, if you are using
SSE try it with the single precision kernel, where SSE cannot compete with the
x87 unit (even the broken one in gcc 4). 

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (18 preceding siblings ...)
  2006-06-26  1:12 ` whaley at cs dot utsa dot edu
@ 2006-06-26  7:53 ` uros at kss-loka dot si
  2006-06-26 16:02 ` whaley at cs dot utsa dot edu
                   ` (50 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: uros at kss-loka dot si @ 2006-06-26  7:53 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #20 from uros at kss-loka dot si  2006-06-26 06:31 -------
(In reply to comment #15)

> Can someone tell me if anyone is looking into this problem with the hopes of
> fixing it?  I just noticed that despite the posted code demonstrating the
> problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D,
> Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned
> to look at it  . . .

Hm, I tried your single testcase (SSE) on:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 9
cpu MHz         : 3191.917
cache size      : 512 KB

And the results are a bit suprising (this is the exact output of your test):

/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2
-mfpmath=sse -DTYPE=float -c mmbench.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2
-mfpmath=sse -c sgemm_atlas.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2
-mfpmath=sse -o xsmm_gcc mmbench.o sgemm_atlas.o
rm -f *.o
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse
-DTYPE=float -c mmbench.c
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c
sgemm_atlas.c
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o
xsmm_gc4 mmbench.o sgemm_atlas.o
rm -f *.o
echo "GCC 3.x     single performance:"
GCC 3.x     single performance:
./xsmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

echo "GCC 4.x     single performance:"
GCC 4.x     single performance:
./xsmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

where:

"gcc (GCC) 3.4.6" was tested against "gcc version 4.2.0 20060608
(experimental)"

FYI: there is another pathological testcase (PR target/19780), where SSE code
is 30% slower on AMD64, despite the fact that for SSE, 16 xmm registers were
available and _no_ memory was accessed in a for loop.

> The reason I ask is that I am preparing the next stable release of ATLAS, and
> I'm getting close to having to make a decision on what compilers I will
> support.
> If someone is working feverishly in the background, I will be sure to wait
> for it, in the hopes that there'll be a fix that will allow me to use
> gcc 4, which I think will be what most of my users want.  If this problem
> is not being looked into, I should not delay the ATLAS release for it, and
> just require my users to install gcc 3 in order to get decent performance.
> 
> I realize you guys are busy, and fp performance is probably not your main
> concern, so hopefully this message sounds more like a request for info on what
> is going on, than a bitch about help that I'm getting for free :)  

Without any other information available, I can only speculate, that perhaps
gcc4 code does not fully utilize multiple FP pipelines in the processors you
listed.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (19 preceding siblings ...)
  2006-06-26  7:53 ` uros at kss-loka dot si
@ 2006-06-26 16:02 ` whaley at cs dot utsa dot edu
  2006-06-27  6:05 ` uros at kss-loka dot si
                   ` (49 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-26 16:02 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #21 from whaley at cs dot utsa dot edu  2006-06-26 15:03 -------
Uros,

Thanks for the reply; I think some confusion has set in (see below) :)

>And the results are a bit suprising (this is the exact output of your test):

Note that you are running the opposite of my test case: SSE vs SSE rather than
x87 vs x87.  This whole bug report is about x87 performance.  You can get more
detail on why I want x87 in my messages above, particularly comment #11, but
single precision is indeed the place where SSE cannot compete with the x87
unit.  To see it, put the flags back the way I had them in the attachment, and
you'll see that gcc 3 is much faster.  Also, you should find in single
precision that the x87 unit soundly beats the SSE unit (unlike double
precision, where the gcc 3's x87 unit is only slightly faster than the best SSE
code).  I think the x87 will win even using gcc 4 for both compilations, even
though gcc 4's x87 support is crippled by its new register allocation scheme.

So, let me say what I think is going on here, and you can correct me if I've
gotten it wrong.  I think in this last timing you think you've found an
exception to the problem, but have forgotten we want to look at the x87 (which
is the fastest method in this case anyway).  Try it with my original flags
(essentially, throw '-mfpmath=387' instead of the sse flags), and you should
see that this gives far better performance using gcc 3 than any use of scalar
sse.  I think even gcc 4 will be better using its de-optimized x87 code,
because x87 is inherently better than scalar sse on these platforms.  There is
only one machine that likes the gcc 4's new x87 register usage pattern of all
the ones I've tested, and that is the CoreDue.

The issue is in x87 register usage: Gcc 4 saves a register, and does the FMUL
from memory rather than first loading the value to the fpstack, and on at least
the PentiumPRO, Pentium III, Pentium 4e, Pentium-D, Athlon-64 X2 and Opteron,
that drops your x87 (which is your best) performance significantly.

Note that given gcc 3's register usage, I think a simple peephole step can
transform it to gcc 4's, if you want to maintain that usage for CoreDuo. 
Unfortunately, going the other way requires an additional register, and the
load plays with your stack operands, so it is easier to keep gcc 3's way as the
default, and peephole to gcc 4's when on a machine that likes that usage
(currently, only the Core).

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (20 preceding siblings ...)
  2006-06-26 16:02 ` whaley at cs dot utsa dot edu
@ 2006-06-27  6:05 ` uros at kss-loka dot si
  2006-06-27 14:37 ` whaley at cs dot utsa dot edu
                   ` (48 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: uros at kss-loka dot si @ 2006-06-27  6:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #22 from uros at kss-loka dot si  2006-06-27 05:49 -------
(In reply to comment #21)

> Note that you are running the opposite of my test case: SSE vs SSE rather than
> x87 vs x87.  This whole bug report is about x87 performance.  You can get more
> detail on why I want x87 in my messages above, particularly comment #11, but
> single precision is indeed the place where SSE cannot compete with the x87
> unit.  To see it, put the flags back the way I had them in the attachment, and
> you'll see that gcc 3 is much faster.  Also, you should find in single

Hm, these are x87 results:

/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -DTYPE=float
-c mmbench.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c
sgemm_atlas.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xsmm_gcc
mmbench.o sgemm_atlas.o
rm -f *.o
/usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -DTYPE=float -c
mmbench.c
/usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c sgemm_atlas.c
/usr/local.uros/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xsmm_gc4
mmbench.o sgemm_atlas.o
rm -f *.o
echo "GCC 3.x     single performance:"
GCC 3.x     single performance:
./xsmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

echo "GCC 4.x     single performance:"
GCC 4.x     single performance:
./xsmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.143     3029.92


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (21 preceding siblings ...)
  2006-06-27  6:05 ` uros at kss-loka dot si
@ 2006-06-27 14:37 ` whaley at cs dot utsa dot edu
  2006-06-27 17:47 ` whaley at cs dot utsa dot edu
                   ` (47 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-27 14:37 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #23 from whaley at cs dot utsa dot edu  2006-06-27 14:20 -------
Uros,

OK, I made the stupid assumption that the P4 would behave like the P4e,
should've known better :)

I got access to a Pentium 4 (family=15, model=2), and indeed I can repeat the
several surprising things you report:

   (1) SSE does as well as x87 on this platform
   (2) The difference between gcc 3 & 4 x87 performance extremely minor
   (3) The code is amazingly optimal (roughly 95-96% of peak!)

The significance of (3) is that it tells us we are not in the bad case where
the kernel in question gets such crappy performance that all codes look alike. 
This performance was so good, that I ran a tester to verify that we were still
getting the right answer, and indeed we are :)

On this platform, I didn't install the compilers myself, (system had Red Hat
4.0.2-8 and 3.3.6 installed), so I scoped the assembly, and indeed they have
the fmul difference that causes problems on the other x87 machines, so it is
really true that the Pentium 4 handles either instruction stream almost as well
(not sure the 2% is significant; 2% is less than clock resolution, though in my
timings anytime there is a difference, gcc 4 always loses).

Here is the machine breakdown as measured now:
   LIKES GCC 4    DOESN'T CARE    LIKES GCC 3
   ===========    ============    ===========
   CoreDuo        Pentium 4       PentiumPRO
                                  Pentium III
                                  Pentium 4e
                                  Pentium D
                                  Athlon-64 X2
                                  Opteron

The only machine we are missing that I can think of is the K7 (i.e. original
Athlon, not Athlon-64).  I don't presently have access to a K7, but I can
probably find someone on the developer list who could run the test if you like.

The other thing that would be of interest is for each machine to chart the %
performance lost/gained.  Here, though, we want two numbers: % lost on simple
benchmark code (which is easy to repeat), and % lost with ATLAS code generator
(which compares each compiler's best case out of thousands to each other).  I
will undertake to get this first (quick to run) number for the machines so we
have some quantitative results to look at . . .  The ATLAS comparison is
probably more important, but takes so long that maybe I'll post it only for the
most problematic platforms (i.e., if the arch shows a big drop gcc3 v. gcc4,
see if the drop is that big when we ask ATLAS to auto-adapt to gcc4).

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (22 preceding siblings ...)
  2006-06-27 14:37 ` whaley at cs dot utsa dot edu
@ 2006-06-27 17:47 ` whaley at cs dot utsa dot edu
  2006-06-28 17:37 ` [Bug target/27827] [4.0/4.1/4.2 Regression] " steven at gcc dot gnu dot org
                   ` (46 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-27 17:47 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #24 from whaley at cs dot utsa dot edu  2006-06-27 16:44 -------
Guys,

OK, here is a table summarizing the performance you can see using the
mmbench4s.tar.gz.  I believe this covers a strong majority of the x86
architectures in use today (there are some specialty processors such as the
Pentium-M, Turion, Efficeon, etc. missing, but I don't think they are a big %
of the market).

In this table, I report the following for each machine and data precision:
  % Clock: % of clock rate achieved by best compiled version of gemm_atlas.c
           (rated in mflop).  Note, theoretical peak for intel machines is
           1 flop/clock, and is 2 flops/clock for AMD, which would correspond
           to 100% and 200% respectively.
  gcc4/3 : (gcc 4 x87 performance) / (gcc 3 x87 performance)
           so < 1 indicates slowdown, > 1 indicates speedup

NOTES:
(1) Pentium 4 is a model=2, while Pentium 4E is model=3.
(2) PPRO, PIII & P4e get bad % clock for double: this is because the
    static blocking factor in the benchmark (nb=60) exceeds the cache,
    which makes the gcc 4 #s look better than they are.
(3) In general, the % peak achieved by this kernel is large enough that
    I think it is truly indicative of the computational efficiency of the
    generated code.

                        double                 single
                    --------------         ---------------
MACHINES            %CLOCK  gcc4/3         %CLOCK  gcc4/3
===========         ======  ======         ======  ======
PentiumPRO            67.5    0.77           78.5    0.71
PentiumIII            47.6    0.95           81.4    0.69
Pentium 4             93.8    0.92           95.7    1.00
Pentium4e             72.8    0.75           80.4    0.80
Pentium-D             86.7    0.83           94.1    0.91
CoreDuo               85.8    1.01           94.9    1.11
Athlon-K7            137.8    0.62          139.1    0.63
Athlon-64 X2         160.0    0.58          165.5    0.60
Opteron              164.6    0.57          164.6    0.61

The CoreDue numbers above are generated by me on a OS X machine, where I
hand-translated Linux assembly to run, since I could not compile stock gccs.  I
have a request out for results from a guy who has Linux/CoreDue, and when I get
those I will update the results if necessary.  At that time, I will also post
an attachment with all the raw timing runs that I generated the table from.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (23 preceding siblings ...)
  2006-06-27 17:47 ` whaley at cs dot utsa dot edu
@ 2006-06-28 17:37 ` steven at gcc dot gnu dot org
  2006-06-28 20:18 ` whaley at cs dot utsa dot edu
                   ` (45 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: steven at gcc dot gnu dot org @ 2006-06-28 17:37 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #25 from steven at gcc dot gnu dot org  2006-06-28 17:30 -------
Pure luck or not, this is a regression.


-- 

steven at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 GCC target triplet|i?86-*-*, x86_64-*-*        |i386, x86_64
   Last reconfirmed|2006-06-01 08:43:34         |2006-06-28 17:30:40
               date|                            |
            Summary|gcc 4 produces worse x87    |[4.0/4.1/4.2 Regression] gcc
                   |code on all platforms than  |4 produces worse x87 code on
                   |gcc 3                       |all platforms than gcc 3
   Target Milestone|---                         |4.1.2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (24 preceding siblings ...)
  2006-06-28 17:37 ` [Bug target/27827] [4.0/4.1/4.2 Regression] " steven at gcc dot gnu dot org
@ 2006-06-28 20:18 ` whaley at cs dot utsa dot edu
  2006-06-29  4:18 ` hjl at lucon dot org
                   ` (44 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-28 20:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #26 from whaley at cs dot utsa dot edu  2006-06-28 19:57 -------
Created an attachment (id=11773)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11773&action=view)
raw runs table is generated from

As promised, here is the raw data I built the table out of, including a new run
from the Linux/CoreDuo user, which does not materially change the table.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (25 preceding siblings ...)
  2006-06-28 20:18 ` whaley at cs dot utsa dot edu
@ 2006-06-29  4:18 ` hjl at lucon dot org
  2006-06-29  6:43 ` whaley at cs dot utsa dot edu
                   ` (43 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: hjl at lucon dot org @ 2006-06-29  4:18 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #27 from hjl at lucon dot org  2006-06-29 02:32 -------
Created an attachment (id=11777)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11777&action=view)
An integer loop

I changed the loop from double to long long. The 64bit code generated by gcc
4.0
is 10% slower than gcc 3.4 on Nocona:

/usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-3.4/bin/gcc -m32 -fomit-frame-pointer -O -o xmm_gcc mmbench.o
gemm_atlas.o
rm -f *.o
/usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-4.0/bin/gcc -m32 -fomit-frame-pointer -O -o xmm_gc4 mmbench.o
gemm_atlas.o
rm -f *.o
echo "GCC 3.x     performance:"
GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60    250       0.381      283.51

echo "GCC 4.x     performance:"
GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60    250       0.389      277.68

gnu-16:pts/2[5]> make                                     ~/bugs/gcc/27827/loop
/usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-3.4/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xmm_gcc mmbench.o
gemm_atlas.o
rm -f *.o
/usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c mmbench.c
/usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -c gemm_atlas.c
/usr/gcc-4.0/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -o xmm_gc4 mmbench.o
gemm_atlas.o
rm -f *.o
echo "GCC 3.x     performance:"
GCC 3.x     performance:
./xmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.172     2512.01

echo "GCC 4.x     performance:"
GCC 4.x     performance:
./xmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.193     2238.68

So the problem may be also loop related.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (26 preceding siblings ...)
  2006-06-29  4:18 ` hjl at lucon dot org
@ 2006-06-29  6:43 ` whaley at cs dot utsa dot edu
  2006-07-04 13:15 ` whaley at cs dot utsa dot edu
                   ` (42 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-06-29  6:43 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #28 from whaley at cs dot utsa dot edu  2006-06-29 04:17 -------
Guys,

If you are looking for the reason that the new code might be slower, my feeling
from the benchmark data is that involves hiding the cost of the loads.  Notice
that, except for the cases where the double exceeds the cache, the single
precision gcc4 code always gets a greater percentage of gcc3's numbers than
double for each platform.  This is the opposite of what you expect if the
problem is purely computational, but exactly what you expect if the problem is
due to memory costs (since single has half the memory cost).  If I were forced
to take a WAG as to what's going on, I would guess it has to do with the more
dependencies in the new code sequence confusing tomasulo's or register
renaming.  I haven't worked it out in detail, but scope the two competing code
sequences:

   gcc 3                gcc 4
   ===========          =======
   fldl 32(%edx)        fldl 32(%edx)
   fldl 32(%eax)        fld %st(0)
   fmul %st(1),%st      fmull 32(%eax)
   faddp %st,%st(6)     faddp %st, %st(2)

Note that in gcc 3, both loads are independent, and can be moved past each
other and arbitrarily early in the instruction stream.  The fmull would need to
be broken into two instructions before a similar freedom occurs.  I'm not sure
how the fp stack handling is done in hardware, but the fact that you've
replaced two independent loads with 3 forced-order instructions cannot be
beneficial.  At the same time, it is difficult for me to see how the new
sequence can be better.  We've got the same number of loads, the same number of
instructions, the same register use (I think), with a forced ordering and loads
you cannot advance (critical in load-happy 8-register land).  I originally
thought that the gcc 4 stream used one less register, but it appears to copy
the edx operand twice to stack, so I'm no longer sure it has even that
advantage?

Just my guess,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (27 preceding siblings ...)
  2006-06-29  6:43 ` whaley at cs dot utsa dot edu
@ 2006-07-04 13:15 ` whaley at cs dot utsa dot edu
  2006-07-05 17:55 ` mmitchel at gcc dot gnu dot org
                   ` (41 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-07-04 13:15 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #29 from whaley at cs dot utsa dot edu  2006-07-04 13:15 -------
Guys,

The integer and fp differences do not appear to be strongly related.  In
particular, on my P4e, gcc 4's integer code is actually faster than gcc 3's. 
Further, if you look at the assemblies of the integer code, it does not have
the extra dependencies that gcc 4's x87 code has.  In integer, both gcc 3 and 4
explicitly do all loads to registers.  I haven't scoped it in detail, but the
main difference appears to be in scheduling, with gcc 3 performing a bunch of
loads, then a bunch of computations, and gcc 4 intermixing them more.

So, we'd need a new series of runs to see which integer schedule is better, but
the integer code should not be studied to solve the x87 problem.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (28 preceding siblings ...)
  2006-07-04 13:15 ` whaley at cs dot utsa dot edu
@ 2006-07-05 17:55 ` mmitchel at gcc dot gnu dot org
  2006-08-04  7:46 ` bonzini at gnu dot org
                   ` (40 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: mmitchel at gcc dot gnu dot org @ 2006-07-05 17:55 UTC (permalink / raw)
  To: gcc-bugs



-- 

mmitchel at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (29 preceding siblings ...)
  2006-07-05 17:55 ` mmitchel at gcc dot gnu dot org
@ 2006-08-04  7:46 ` bonzini at gnu dot org
  2006-08-04 16:24 ` whaley at cs dot utsa dot edu
                   ` (39 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: bonzini at gnu dot org @ 2006-08-04  7:46 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #30 from bonzini at gnu dot org  2006-08-04 07:45 -------
Can you try this patch?  My only i686 machine is neutral to this problem.

I'm a bit worried about the Core Duo thing, but my hope is that other changes
between GCC 3 and GCC 4 improved performance on all machines, and Core Duo is
the only processor that does not see the performance loss introduced by "fld
%st".

I'm currently bootstrapping and regtesting the patch; a minimal testcase is
here:

/* { dg-do compile } */
/* { dg-options "-O2" } */

double a, b;
double f(double c)
{
  double x = a * b;
  return x + c * a;
}

/* { dg-final { scan-assembler-not "fld\[ \t\]*%st" } } */
/* { dg-final { scan-assembler "fmul\[ \t\]*%st" } } */

Without patch:
        fldl    a
        fld     %st(0)
        fmull   b
        fxch    %st(1)
        fmull   4(%esp)
        faddp   %st, %st(1)
        ret

With patch:
        fldl    a
        fldl    4(%esp)
        fmul    %st(1), %st
        fxch    %st(1)
        fmull   b
        faddp   %st, %st(1)
        ret

Index: i386.md
===================================================================
--- i386.md     (revision 115412)
+++ i386.md     (working copy)
@@ -18757,6 +18757,32 @@
   [(set_attr "type" "sseadd")
    (set_attr "mode" "DF")])

+;; Make two stack loads independent:
+;;   fld aa              fld aa
+;;   fld %st(0)     ->   fld bb
+;;   fmul bb             fmul %st(1), %st
+;;
+;; Actually we only match the last two instructions for simplicity.
+(define_peephole2
+  [(set (match_operand 0 "fp_register_operand" "")
+       (match_operand 1 "fp_register_operand" ""))
+   (set (match_dup 0)
+       (match_operator 2 "binary_fp_operator"
+          [(match_dup 0)
+           (match_operand 3 "memory_operand" "")]))]
+  "REGNO (operands[0]) != REGNO (operands[1])"
+  [(set (match_dup 0) (match_dup 3))
+   (set (match_dup 0) (match_dup 4))]
+
+  ;; The % modifier is not operational anymore in peephole2's, so we have to
+  ;; swap the operands manually in the case of addition and multiplication.
+  "if (COMMUTATIVE_ARITH_P (operands[2]))
+     operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE
(operands[2]),
+                                operands[0], operands[1]);
+   else
+     operands[4] = gen_rtx_fmt_ee (GET_CODE (operands[2]), GET_MODE
(operands[2]),
+                                operands[1], operands[0]);")
+
 ;; Conditional addition patterns
 (define_expand "addqicc"
   [(match_operand:QI 0 "register_operand" "")


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bonzini at gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (30 preceding siblings ...)
  2006-08-04  7:46 ` bonzini at gnu dot org
@ 2006-08-04 16:24 ` whaley at cs dot utsa dot edu
  2006-08-05  7:21 ` bonzini at gnu dot org
                   ` (38 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-04 16:24 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #31 from whaley at cs dot utsa dot edu  2006-08-04 16:24 -------
Paolo,

Thanks for the update.  I attempted to apply this patch, but apparantly I
failed, as it made absolutely no difference.  I mean, not only did it not
change performance, but if you diff the assembly, you get only 4 lines
different (version numbers and use of ffreep rather than fstp).  Here is what I
did:
>   59  10:29   cd gcc-4.1.1/
>   60  10:30   pushd gcc/config/i386/
>   62  10:30   patch < ~/x87patch
>   64  10:31   cd ../../..
>   67  10:31   mkdir MyObj
>   68  10:31   cd MyObj/
>   71  10:32   ../configure --prefix=/home/whaley/local/gcc4.1.1p1 --enable-languages=c,fortran
>   72  10:32   make
>   73  10:58   make install

I did this on my P4e (IA32) and Athlon64 X2 (x86-64) machines.  I did have to
hand-edit the patch, due to line breaks in mouse-copying from the webpage (it
wouldn't apply until I did that), so maybe that is the problem.

Can you grab the mmbench4s.tar.gz attachment, and point its Makefile at your
modified compiler, and tell it "make assall", and see if the generated dmm_4.s
and smm_4.s are different than what you get with stock 4.1.1?  If so, post them
as attachments, and I can probably hack the benchmark to load the assembly, as
I did on the Core.

Assuming they are different, maybe you can check that this is the only patch I
need to make?  If it is, is there something wrong with the way I applied it? 
If not, maybe you should post the patch file as an attachment so we can rule
out copying error . . .

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (31 preceding siblings ...)
  2006-08-04 16:24 ` whaley at cs dot utsa dot edu
@ 2006-08-05  7:21 ` bonzini at gnu dot org
  2006-08-05 14:24 ` whaley at cs dot utsa dot edu
                   ` (37 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: bonzini at gnu dot org @ 2006-08-05  7:21 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #32 from bonzini at gnu dot org  2006-08-05 07:21 -------
It works for me.

GCC 4.x double      60   1000       0.208     2076.79
GCC patch double    60   1000       0.168     2571.28

GCC 4.x single      60   1000       0.188     2297.74
GCC patch single    60   1000       0.152     2841.94


Assembly changes are as follows: < is without my patch, > is with it.

---

21,22c21,22
<       fld     %st(0)
<       fmuls   (%eax)
---
>       flds    (%eax)
>       fmul    %st(1), %st
25,26c25,26
<       fld     %st(2)
<       fmuls   240(%eax)
---
>       flds    240(%eax)
>       fmul    %st(3), %st
28,29c28,29
<       fld     %st(3)
<       fmuls   480(%eax)
---
>       flds    480(%eax)
>       fmul    %st(4), %st


-- 

bonzini at gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         AssignedTo|unassigned at gcc dot gnu   |bonzini at gnu dot org
                   |dot org                     |
             Status|NEW                         |ASSIGNED
   Last reconfirmed|2006-06-28 17:30:40         |2006-08-05 07:21:46
               date|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (32 preceding siblings ...)
  2006-08-05  7:21 ` bonzini at gnu dot org
@ 2006-08-05 14:24 ` whaley at cs dot utsa dot edu
  2006-08-05 17:16 ` bonzini at gnu dot org
                   ` (36 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-05 14:24 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #33 from whaley at cs dot utsa dot edu  2006-08-05 14:24 -------
Paolo,

Can you post the assembly and the patch as attachments?  If necessary, I can
hack the benchmark to call the assembly routines on a couple of platforms. 
Also, did you see what I did wrong in applying the patch?

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (33 preceding siblings ...)
  2006-08-05 14:24 ` whaley at cs dot utsa dot edu
@ 2006-08-05 17:16 ` bonzini at gnu dot org
  2006-08-05 18:26 ` whaley at cs dot utsa dot edu
                   ` (35 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: bonzini at gnu dot org @ 2006-08-05 17:16 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #34 from bonzini at gnu dot org  2006-08-05 17:15 -------
Created an attachment (id=12019)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12019&action=view)
MMBENCH4s.tar.gz + assembly without and with patch

I don't know what was wrong, but you can now fetch the patch yourself from
http://gcc.gnu.org/ml/gcc-patches/2006-08/msg00113.html

Anyway, here's your .tar.gz now including the .s files (and the Makefile points
to my gcc's).  ?mm_3.s is the unpatched GCC 4.2, ?mm_4.s is the patched one.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1/4.2 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (34 preceding siblings ...)
  2006-08-05 17:16 ` bonzini at gnu dot org
@ 2006-08-05 18:26 ` whaley at cs dot utsa dot edu
  2006-08-06 15:03 ` [Bug target/27827] [4.0/4.1 " whaley at cs dot utsa dot edu
                   ` (34 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-05 18:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #35 from whaley at cs dot utsa dot edu  2006-08-05 18:26 -------
Created an attachment (id=12020)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12020&action=view)
new Makefile targets

OK, this is same benchmark again, now creating MMBENCHS directory.  In addition
to the ability to make single & double, also has ability to build executables
from assembly files (see "asgexe" target of Makefile)


-- 

whaley at cs dot utsa dot edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #11541|0                           |1
        is obsolete|                            |
  Attachment #11571|0                           |1
        is obsolete|                            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (35 preceding siblings ...)
  2006-08-05 18:26 ` whaley at cs dot utsa dot edu
@ 2006-08-06 15:03 ` whaley at cs dot utsa dot edu
  2006-08-07  6:19 ` bonzini at gnu dot org
                   ` (33 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-06 15:03 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #36 from whaley at cs dot utsa dot edu  2006-08-06 15:03 -------
Paola,

Thanks for working on this.  We are making progres, but I have some mixed
results.  I timed the assemblies you provided directly.  I added a target
"asgexe" that builds the same benchmark, assuming assembly source instead of C
to make this more reproducable.  I ran on the Athlon-64X2, where your new
assembly ran *faster* than gcc 3 for double precision.  However, you still lost
for single precision.  I believe the reason is that you still have more
fmuls/fmull (fmul from memory) than does gcc 3:

>animal>fgrep -i fmuls smm_4.s | wc
>    240     480    4051
>animal>fgrep -i fmuls smm_asg.s | wc
>     60     120    1020
>animal>fgrep -i fmuls smm_3.s  | wc
>      0       0       0
>animal>fgrep -i fmull dmm_4.s | wc
>    100     200    1739
>animal>fgrep -i fmull dmm_asg.s | wc
>     20      40     360
>animal>fgrep -i fmuls dmm_3.s | wc
>      0       0       0

I haven't really scoped out the dmm diff, but in single prec anyway, these
dreaded fmuls are in the inner loop, and this is probably why you are still
losing.  I'm guessing your peephole is missing some cases, and for some reason
is missing more under single.  Any ideas?

As for your assembly actually beating gcc 3 for double, my guess is that it is
some other optimization that gcc 4 has, and you will beat by even more once the
final fmull are removed . . .

On the P4e, your double precision code is faster than stock gcc 4, but still
slower than gcc3.  again, I suspect the remaining fmull.  Then comes the thing
I cannot explain at all.  Your single precision results are horrible.  gcc 3
gets 1991MFLOPS, gcc 4 gets 1664, and the assembly you sent gets 34!  No chance
the mixed fld/fmuls is causing stack overflow, I guess?  I think this might
account for such a catastrophic drop  . . .  That's about the only WAG I've got
for this behavior.

Anyway, I think the first order of business may be to get your peephole to
grabbing all the cases, and see if that makes you win everywhere on Athlon, and
if it makes single precision P4e better, and we can go from there . . .

If you do that, attach the assemblies  again, and I'll redo timings.  Also, if
you could attach (not put in comment) the patch, it'd be nice to get the
compiler, so I could test x86-64 code on Athlon, etc.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (36 preceding siblings ...)
  2006-08-06 15:03 ` [Bug target/27827] [4.0/4.1 " whaley at cs dot utsa dot edu
@ 2006-08-07  6:19 ` bonzini at gnu dot org
  2006-08-07 15:32 ` whaley at cs dot utsa dot edu
                   ` (32 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: bonzini at gnu dot org @ 2006-08-07  6:19 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #37 from bonzini at gnu dot org  2006-08-07 06:19 -------
I don't see how the last fmul[sl] can be removed without increasing code size. 
The only way to fix it would be to change the machine description to say that
"this processor does not like FP operations with a memory operand".  With a
peephole, this is as good as we can get it.  The last fmul is not coupled with
a "fld %st" because it consumes the stack entry.  See in comment #30, where
there is still a "fmull b".

Can you please try re-running the tests?  It takes skill^W^W seems quite weird
to have a 100x slow-down, also because my tests were run on a similar Prescott
(P4e).

It also would be interesting to re-run your code generator on a compiler built
from svn trunk.  If it can provide higher performance, you'd be satisfied I
guess even if it comes from a different kernel.  Also, I strongly believe that
you should implement vectorization, or at least find out *why* GCC does not
vectorize your code.  It may be simply that it does not have any guarantee on
the alignment.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (37 preceding siblings ...)
  2006-08-07  6:19 ` bonzini at gnu dot org
@ 2006-08-07 15:32 ` whaley at cs dot utsa dot edu
  2006-08-07 16:47 ` whaley at cs dot utsa dot edu
                   ` (31 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-07 15:32 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #38 from whaley at cs dot utsa dot edu  2006-08-07 15:32 -------
Paolo,

Thanks for all the help.  I'm not sure I understand everything perfectly
though, so there's some questions below . . .

>I don't see how the last fmul[sl] can be removed without increasing code size.

Since the flags are asking for performance, not size optimization, this should
only be an argument if the fmul[s,l]'s are performance-neutral.  A lot of
performance optimizations increase code size, after all . . .  Obviously, no
fmul[sl] is possible, since gcc 3 achieves it.  However, I can see that the
peephole phase might not be able to change the register usage.

>Can you please try re-running the tests?  It takes skill^W^W

Yes, I found the results confusing as well, which is why I reran them 50 times
before posting.  I also posted the tarfile (wt Makefile and assemblies) that
built them, so that my mistakes could be caught by someone with more skill. 
Just as a check, maybe you can confirm the .s you posted is the right one?  I
can't find the loads of the matrix C anywhere in its assembly, and I can find
them in the double version  . . .  Anyway, I like your suggestion (below) of
getting the compiler so we won't have to worry about assemblies, so that's
probably the way to go.  On this front, is there some reason you cannot post
the patch(es) as attachments, just to rule out copy problems, as I've asked in
last several messages?  Note there's no need if I can grab your stuff from SVN,
as below . . .

>because my tests were run on a similar Prescott (P4e)

You didn't post the gcc 3 performance numbers.  What were those like?  If
you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big
deal.  If gcc 3 is still winning, on the other hand . . .

>It also would be interesting to re-run your code generator on a compiler built from svn trunk.

Are your changes on a branch I could check out?  If so, give me the commands to
get that branch, as we are scoping assemblies only because of the patching
problem.  Having a full compiler would indeed enable more detailed
investigations, including loosing the full code generator on the improved
compiler.

>Also, I strongly believe that you should implement vectorization,

ATLAS implements vectorization, by writing the entire GEMM kernel in assembly
and directly using SSE.  However, there are cases where generated C code must
be called, and that's where gcc comes in . . .

>or at least find out *why* GCC does not vectorize your code. It may be simply that it does not have any guarantee on the alignment.

I'm all for this.  info gcc says that w/o a guarantee of alignment, loops are
duped, with an if selecting between vector and scalar loops, is this not
accurate?  I spent a day trying to get gcc to vectorize any of the generator's
loops, and did not succeed (can you make it vectorize the provided benchmark
code?).  I also tried various unrollings of the inner loop, particularly no
unrolling and unroll=2 (vector length).  I was unable to truly decipher the
warning messages explaining the lack of vectorization, and I would truly
welcome some help in fixing this.

This is a separate issue from the x87 code, and this tracker item is already
fairly complex :) I'm assuming if I attempted to open a bug tracker of "gcc
will not vectorize atlas's generated code" it would be closed pretty quickly. 
Maybe you can recommend how to approach this, or open another report that we
can exchange info on?  I would truly appreciate the opportunity to get some
feedback from gcc authors to help guide me to solving this problem.

Thanks for all the info,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (38 preceding siblings ...)
  2006-08-07 15:32 ` whaley at cs dot utsa dot edu
@ 2006-08-07 16:47 ` whaley at cs dot utsa dot edu
  2006-08-07 16:58 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (30 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-07 16:47 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #39 from whaley at cs dot utsa dot edu  2006-08-07 16:47 -------
Paolo,

OK, never mind about all the questions on assembly/patches/SVN/gcc3 perf: I
checked out the main branch, and vi'd the patched file, and I see that your
patch is there.  I am presently building the SVN gcc on several machines, and
will be posting results/issues as they come in . . .

I would still be very interested in advice on approaching the vectorization
problem as discussed at the end of the mail.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (39 preceding siblings ...)
  2006-08-07 16:47 ` whaley at cs dot utsa dot edu
@ 2006-08-07 16:58 ` paolo dot bonzini at lu dot unisi dot ch
  2006-08-07 17:19 ` whaley at cs dot utsa dot edu
                   ` (29 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2006-08-07 16:58 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #40 from paolo dot bonzini at lu dot unisi dot ch  2006-08-07 16:58 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


>> I don't see how the last fmul[sl] can be removed without increasing code size.
>>     
> However, I can see that the
> peephole phase might not be able to change the register usage.
Actually, the peephole phase may not change the register usage, but it 
could peruse a scratch register if available.  But it would be much more 
controversial (even if backed by your hard numbers on ATLAS) to state 
that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless 
there is some manual telling us exactly that... for example it would be 
a different story if it could give higher scheduling freedom (stuff like 
VectorPath vs. DirectPath on Athlons), and if we could figure out on 
which platforms it improves performance.
> On this front, is there some reason you cannot post
> the patch(es) as attachments, just to rule out copy problems, as I've asked in
> last several messages?  Note there's no need if I can grab your stuff from SVN,
> as below . . .
>   
You already found about this :-P

Unfortunately I mistyped the PR number when I committed the patch; I 
meant the commit to appear in the audit trail, so that you'd have seen 
that I had committed it.
>> because my tests were run on a similar Prescott (P4e)
>>     
> You didn't post the gcc 3 performance numbers.  What were those like?  If
> you beat/tied gcc 3, then the remaining fmul[l,s] are probably not a big
> deal.  If gcc 3 is still winning, on the other hand . . .
>   
I don't have GCC 3 on that machine.

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (40 preceding siblings ...)
  2006-08-07 16:58 ` paolo dot bonzini at lu dot unisi dot ch
@ 2006-08-07 17:19 ` whaley at cs dot utsa dot edu
  2006-08-07 18:19 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (28 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-07 17:19 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #41 from whaley at cs dot utsa dot edu  2006-08-07 17:19 -------
Paolo,

>Actually, the peephole phase may not change the register usage, but it
>could peruse a scratch register if available.  But it would be much more
>controversial (even if backed by your hard numbers on ATLAS) to state
>that splitting fmul[sl] to fld[sl]+fmul is always beneficial, unless

We'll have to see how this is in x87 code.  I have experience with it in SSE,
where doing it is fully a target issue.  For instance, the P4E likes you to
avoid the explicit load on the end, where the Hammer prefers the explicit load.
 If I recall right, there is a *slight* advantage on the intel to the from-mem
instruction, but I can't remember how much difference doing the separate
load/use made on the AMD.  We should get some idea by comparing gcc3 vs. your
patched compiler on the various platforms, though other gcc3/4 changes will
cloud the picture somewhat . . .

If this kind of machine difference in optimality holds true for x87 as well, I
assume a new peephole phase that looks for the scratch register could be called
if the appropriate -march were thrown?

Speaking of -march issues, when I get a compiler build that gens your new code,
I will pull the assembly trick to try it on the CoreDuo as well.  If the new
code is worse, you can probably not call your present peephole if that -march
is thrown?

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (41 preceding siblings ...)
  2006-08-07 17:19 ` whaley at cs dot utsa dot edu
@ 2006-08-07 18:19 ` paolo dot bonzini at lu dot unisi dot ch
  2006-08-07 20:35 ` dorit at il dot ibm dot com
                   ` (27 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2006-08-07 18:19 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #42 from paolo dot bonzini at lu dot unisi dot ch  2006-08-07 18:19 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> We should get some idea by comparing gcc3 vs. your
> patched compiler on the various platforms, though other gcc3/4 changes will
> cloud the picture somewhat . . .
>   
That's why you should compare 4.2 before and after my patch, instead.
> If this kind of machine difference in optimality holds true for x87 as well, I
> assume a new peephole phase that looks for the scratch register could be called
> if the appropriate -march were thrown?
>   
Or you can disable the fmul[sl] instructions altogether.
> Speaking of -march issues, when I get a compiler build that gens your new code,
> I will pull the assembly trick to try it on the CoreDuo as well.  If the new
> code is worse, you can probably not call your present peephole if that -march
> is thrown?
>   
I'd find it very strange.  It is more likely that the Core Duo has a 
more powerful scheduler (maybe the micro-op fusion thing?) that does not 
dislike fmul[sl].


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (42 preceding siblings ...)
  2006-08-07 18:19 ` paolo dot bonzini at lu dot unisi dot ch
@ 2006-08-07 20:35 ` dorit at il dot ibm dot com
  2006-08-07 21:57 ` whaley at cs dot utsa dot edu
                   ` (26 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: dorit at il dot ibm dot com @ 2006-08-07 20:35 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #43 from dorit at il dot ibm dot com  2006-08-07 20:35 -------
> I'm all for this.  info gcc says that w/o a guarantee of alignment, loops are
> duped, with an if selecting between vector and scalar loops, is this not
> accurate?  

yes

>I spent a day trying to get gcc to vectorize any of the generator's
> loops, and did not succeed (can you make it vectorize the provided benchmark
> code?).  

The aggressive unrolling in the provided example seems to be the first obstacle
to vectorize the code

> I also tried various unrollings of the inner loop, particularly no
> unrolling and unroll=2 (vector length).  I was unable to truly decipher the
> warning messages explaining the lack of vectorization, and I would truly
> welcome some help in fixing this.

I'd be happy to help decipher the vectorizer's dump file. please send the
un-unrolled version and the dump file generated by -fdump-tree-vect-details,
and I'll see if I can help.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (43 preceding siblings ...)
  2006-08-07 20:35 ` dorit at il dot ibm dot com
@ 2006-08-07 21:57 ` whaley at cs dot utsa dot edu
  2006-08-08  2:59 ` whaley at cs dot utsa dot edu
                   ` (25 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-07 21:57 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #44 from whaley at cs dot utsa dot edu  2006-08-07 21:56 -------
Guys,

OK, the mystery of why my hand-patched gcc didn't work is now cleared up.  My
first clue was that neither did the SVN-build gcc!  Turns out, your peephole
opt is only done if I throw the flag -O3 rather than -O, which is what my
tarfile used.  Any reason it's done at only the high levels, since it makes
such a performance difference?

FYI, in gcc3 -O gets better performance than -O3, which is why that's my
default flags.  However, it appears that gcc4 gets very nice performance with
-O3.  Its fairly common for -O to give better performance than -O3, however
(since the ATLAS code is already aggressively optimized, gcc's max optimization
often de-optimize an optimal code), so turning this on at the default level, or
being able to turn it off and on manually would be ideal . . .

>That's why you should compare 4.2 before and after my patch, instead.

Yeah, except 4.2 w/o your patch has horrible performance.  Our goal is not to
beat horrible performance, but rather to get good performance!  Gcc 3 provides
a measure of good performance.  However, I take your point that it'd be nice to
see the new stuff put a headlock on the crap performance, so I include that
below as well :)

Here's some initial data.  I report MFLOPS achieved by the kernel as compiled
by : gcc3 (usually gcc 3.2 or 3.4.3), gccS (current SVN gcc), and gcc4 (usually
gcc 4.1.1).  I will try to get more data later, but this is pretty suggestive,
IMHO.

                              DOUBLE            SINGLE
              PEAK        gcc3/gccS/gcc4    gcc3/gccS/gcc4
              ====        ==============    ==============
Pentium-D :   2800        2359/2417/2067    2685/2684/2362
Ath64-X2  :   5600        3677/3585/2102    3680/3914/2207
Opteron   :   3200        2590/2517/1507    2625/2800/1580

So, it appears to me we are seeing the same pattern I previously saw in my
hand-tuned SSE code: Intel likes the new pattern of doing the last load as part
of the FMUL instruction, but AMD is hampered by it.  Note that gccS is the best
compiler for both single & double on the Intel. On both AMD machines, however,
it wins only for single, where the cost of the load is lower.  It loses to gcc3
for double, where load performance more completely determines matmul
performance.  This is consistant with the view that gcc 4 does some other
optimizations better than gcc 3, and so if we got the fldl removed, gcc 4 would
win for all precisions . . .

Don't get me wrong, your patch has already removed the emergency: in the worst
case so far you are less than 3% slower.  However, I suspect if we added the
optional (for amd chips only) peephole step to get rid of all possible
fmul[s,l], then we'd win for double, and win even more for single on AMD chips
. . .  So, any chance of an AMD-only or flag-controlled peephole step to get
rid of the last fmul[s,l]?

>Or you can disable the fmul[sl] instructions altogether.

As I mentioned, my own hand-tuning has indicated that the final fmul[sl] is
good for Intel netburst archs, but bad for AMD hammer archs.

I'll see about posting some vectorization data ASAP.  Can someone create a new
bug report so that the two threads of inquiry don't get mixed up, or do you
want to just intermix them here?

Thanks,
Clint

P.S.: I tried to run this on the Core by hand-translating gccS-genned assembly
to OS X assembly.  The double precision gccS runs at the same speed as apple's
gcc.  However, the single precision is an order of magnitude slower, as I
experienced this morning on the P4E.  This is almost certainly an error in my
makefile, but damned if I can find it.

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (44 preceding siblings ...)
  2006-08-07 21:57 ` whaley at cs dot utsa dot edu
@ 2006-08-08  2:59 ` whaley at cs dot utsa dot edu
  2006-08-08  6:15 ` hubicka at gcc dot gnu dot org
                   ` (24 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-08  2:59 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #45 from whaley at cs dot utsa dot edu  2006-08-08 02:59 -------
Guys,

OK, with Dorit's -fdump-tree-vect-details, I made a little progress on
vectorization.  In order to get vectorization to work, I had to add the flag
'-funsafe-math-optimizations'.  I will try to create a tarfile with everything
tomorrow so you guys can see all the output, but is it normal to need to throw
this to get vectorization?  SSE is IEEE compliant (unless you turn it off), and
ATLAS needs to stay IEEE, so I can't turn on unsafe-math-opt in general . . .

With these flags, gcc can vectorize the kernel if I do no unrolling at all.  I
have not yet run the full search on with these flags, but I've done quite a few
hand-called cases, and the performance is lower than either the x87 (best) or
scalar SSE for double on both the P4E and Ath64X2.  For single precision, there
is a modest speedup over the x87 code on both systems, but the total is *way*
below my assembly SSE kernels.

I just quickly glanced at the code, and I see that it never uses "movapd" from
memory, which is a key to getting decent performance.  ATLAS ensures that the
input matrices (A & B) are 16-byte aligned.  Is there any pragma/flag/etc I can
set that says "pointer X points to data that is 16-byte aligned"?

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (45 preceding siblings ...)
  2006-08-08  2:59 ` whaley at cs dot utsa dot edu
@ 2006-08-08  6:15 ` hubicka at gcc dot gnu dot org
  2006-08-08  6:28   ` Jan Hubicka
  2006-08-08  6:29 ` hubicka at ucw dot cz
                   ` (23 subsequent siblings)
  70 siblings, 1 reply; 75+ messages in thread
From: hubicka at gcc dot gnu dot org @ 2006-08-08  6:15 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #46 from hubicka at gcc dot gnu dot org  2006-08-08 06:15 -------
In x86/x86-64 world one can be almost sure that the load+execute instruction
pair will execute (marginaly to noticeably) faster than move+load-and-execute
instruction pair as the more complex instructions are harder for on-chip
scheduling (they retire later).
Perhaps we can move such a transformation somewhere more generically perhaps to
post-reload copyprop?

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-08-08  6:15 ` hubicka at gcc dot gnu dot org
@ 2006-08-08  6:28   ` Jan Hubicka
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Hubicka @ 2006-08-08  6:28 UTC (permalink / raw)
  To: hubicka at gcc dot gnu dot org; +Cc: gcc-bugs

> In x86/x86-64 world one can be almost sure that the load+execute instruction
> pair will execute (marginaly to noticeably) faster than move+load-and-execute
> instruction pair as the more complex instructions are harder for on-chip
> scheduling (they retire later).
			       ^^^ retirement filling up the scheduler
			       easilly.
> Perhaps we can move such a transformation somewhere more generically perhaps to
> post-reload copyprop?
> 
> Honza


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (46 preceding siblings ...)
  2006-08-08  6:15 ` hubicka at gcc dot gnu dot org
@ 2006-08-08  6:29 ` hubicka at ucw dot cz
  2006-08-08  7:05 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (22 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: hubicka at ucw dot cz @ 2006-08-08  6:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #47 from hubicka at ucw dot cz  2006-08-08 06:28 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code on all
platforms than gcc 3

> In x86/x86-64 world one can be almost sure that the load+execute instruction
> pair will execute (marginaly to noticeably) faster than move+load-and-execute
> instruction pair as the more complex instructions are harder for on-chip
> scheduling (they retire later).
                               ^^^ retirement filling up the scheduler
                               easilly.
> Perhaps we can move such a transformation somewhere more generically perhaps to
> post-reload copyprop?
> 
> Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (47 preceding siblings ...)
  2006-08-08  6:29 ` hubicka at ucw dot cz
@ 2006-08-08  7:05 ` paolo dot bonzini at lu dot unisi dot ch
  2006-08-08 16:44 ` whaley at cs dot utsa dot edu
                   ` (21 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2006-08-08  7:05 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #48 from paolo dot bonzini at lu dot unisi dot ch  2006-08-08 07:05 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> In x86/x86-64 world one can be almost sure that the load+execute instruction
> pair will execute (marginaly to noticeably) faster than move+load-and-execute
> instruction pair as the more complex instructions are harder for on-chip
> scheduling (they retire later).
Yes, so far so good and this part has already been committed.  But does 
a *single* load-and-execute instruction execute faster than the two 
instructions in a load+execute sequence?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (48 preceding siblings ...)
  2006-08-08  7:05 ` paolo dot bonzini at lu dot unisi dot ch
@ 2006-08-08 16:44 ` whaley at cs dot utsa dot edu
  2006-08-08 18:36 ` whaley at cs dot utsa dot edu
                   ` (20 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-08 16:44 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #49 from whaley at cs dot utsa dot edu  2006-08-08 16:43 -------
Paolo,

>Yes, so far so good and this part has already been committed.  But does
>a *single* load-and-execute instruction execute faster than the two
>instructions in a load+execute sequence?

As I said, in my hand-tuned SSE assembly experience, which is faster depends on
the architecture.  In particular, netburst or Core do well with the final
fmul[ls], and other archs do not.  My guess is that netburst and Core probably
crack this single instruction in two during decode, which allows the implicit
load to be advanced, but with less instruction load.  I think other
architectures do not split the inst during decode, which means that tomasulo's
cannot advance the load due to dependencies, which makes the separate
instructions faster, even in the face of the extra instruction.

If you can give me a patch that makes gcc call a new peephole opt getting rid
of the final mul[sl] only when a certain flag is thrown, I will see if I can't
post timings across a variety of architectures using both ways, so we can see
if my SSE experience is true for x87, and how strong the performance benefit
for various architectures.  This will allow us to evaluate how important
getting this choice is, what should be the default state, and how we should
vary it according to architecture.  My own theoretical guess is that if you
*have* to pick a behavior, surely separate instructions are better: on systems
with the cracking, this extra inst at worst eats up some mem and a bit of
decode bandwidth, which on most machines is not critical.  On the other hand,
having a non-advancable load is pretty bad news on systems w/o the cracking
ability.  The proposed timings could demonstrate the accuracy of this guess.

As I mentioned, and I *think* Jan echoed, for the case you have already fixed,
the peephole's way should be the default way, even at low optimization: there's
no extra instruction to this peephole, and it is better everywhere we've timed,
and I see no way in theory for the first sequence to be better.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (49 preceding siblings ...)
  2006-08-08 16:44 ` whaley at cs dot utsa dot edu
@ 2006-08-08 18:36 ` whaley at cs dot utsa dot edu
  2006-08-09  4:34 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (19 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-08 18:36 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #50 from whaley at cs dot utsa dot edu  2006-08-08 18:36 -------
Guys,

I've been scoping this a little closer on the Athlon64X2.  I have found that
the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
go to town.  That at least ties the best I've ever seen for an x86 chip, and
what it means is that on this architecture, the x87 unit can be coaxed into
beating the SSE unit *even when the SSE instructions are fully vectorized* (for
double precision only, of course: vector single prec SSE has twice theoretical
peak of x87).  This also means that ATLAS should get a real speed boost when
the new gcc is released, and other fp packages have the potential to do so as
well.  So, with this motivation, I edited the genned assembly, and made the
following changes by hand in ~30 different places in the kernel assembly:

>#ifdef FMULL
>        fmull   1440(%rcx)
>#else
>        fldl    1440(%rcx)
>        fmulp   %st,%st(1)
>#endif

To my surprise, on this arch, using the fldl/fmulp pair caused a performance
drop.  So, either my SSE experience does not necessarily translate to x87, or
the Opteron (where I did the SSE tuning) is subtly different than the
Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is
this the peephole you would do?

Anyway, doing this by hand is too burdensome to make widespread timings
feasable, so if you'd like to see that, I'll need a gcc patch to do it
automatically . . .

Cheers,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (50 preceding siblings ...)
  2006-08-08 18:36 ` whaley at cs dot utsa dot edu
@ 2006-08-09  4:34 ` paolo dot bonzini at lu dot unisi dot ch
  2006-08-09 14:33 ` whaley at cs dot utsa dot edu
                   ` (18 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2006-08-09  4:34 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #51 from paolo dot bonzini at lu dot unisi dot ch  2006-08-09 04:33 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> I've been scoping this a little closer on the Athlon64X2.  I have found that
> the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
> 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
> go to town.
Not unexpected.  Code was so tightly tuned for GCC 3, and so big were 
the changes between GCC 3 and 4, that you were comparing sort of apples 
to oranges.  It could be interesting to see which different 
optimizations are performed by your code generator for GCC 3 vs. GCC 4.
>>        fmull   1440(%rcx)
>> #else
>>        fldl    1440(%rcx)
>>        fmulp   %st,%st(1)
>> #endif
>>     
> To my surprise, on this arch, using the fldl/fmulp pair caused a performance
> drop.  So, either my SSE experience does not necessarily translate to x87, or
> the Opteron (where I did the SSE tuning) is subtly different than the
> Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is
> this the peephole you would do?
>   
In some sense, this is the peephole I would rather *not* do.  But the 
answer is yes. :-)

So, do you now agree that the bug would be fixed if the patch that is in 
GCC 4.2 was backported to GCC 4.1 (so that your users can use that)?

And do you still see the abysmal x87 single-precision FP performance?

Thanks!


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (51 preceding siblings ...)
  2006-08-09  4:34 ` paolo dot bonzini at lu dot unisi dot ch
@ 2006-08-09 14:33 ` whaley at cs dot utsa dot edu
  2006-08-09 15:52 ` whaley at cs dot utsa dot edu
                   ` (17 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-09 14:33 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #52 from whaley at cs dot utsa dot edu  2006-08-09 14:33 -------
Paolo,

>In some sense, this is the peephole I would rather *not* do.  But the answer is yes. :-)

Ahh, got it :)

>So, do you now agree that the bug would be fixed if the patch that is in GCC 4.2 was backported to GCC 4.1 (so that your users can use that)?

Well, much as I might like to deny it, yes I must agree bug is fixed :)  I
think there might still be more performance to get, and initial timings show
that 4 may be slower than 3 on some systems.  However, it will also clearly be
faster than 3 on some (so far, most) systems, and so far, is competitive
everwhere, so not even I can call that a performance bug :)

And yes, getting it into the next gcc release would be very helpful for ATLAS.

>And do you still see the abysmal x87 single-precision FP performance?

No, the problems were the same for both precisions.  I haven't retimed all the
systems, but here's the numbers I do have for the benchmark:

                              DOUBLE            SINGLE
              PEAK        gcc3/gccS/gcc4    gcc3/gccS/gcc4
              ====        ==============    ==============
Pentium-D :   2800        2359/2417/2067    2685/2684/2362
Ath64-X2  :   5600        3681/4011/2102    3716/4256/2207
Opteron   :   3200        2590/2517/1507    2625/2800/1580
P4E       :   2800        1767/1754/1480    1914/1954/1609
PentiumIII:    500        239/238/225       407/393/283

As you can see, on the benchmark, the single precision numbers are better than
the double now.  I cannot get single precision to run at quite the impressive
93% of peak as double when exercising the code generator on the Ath64-X2, but
it gets a respectable 85% of peak (at these levels of performance, it takes
only very minor differences to drop from 93 to 85, so that's not that
unexpected: I am still investigating this).

Thanks for all the help,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (52 preceding siblings ...)
  2006-08-09 14:33 ` whaley at cs dot utsa dot edu
@ 2006-08-09 15:52 ` whaley at cs dot utsa dot edu
  2006-08-09 16:08 ` whaley at cs dot utsa dot edu
                   ` (16 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-09 15:52 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #53 from whaley at cs dot utsa dot edu  2006-08-09 15:52 -------
Created an attachment (id=12047)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12047&action=view)
benchmark wt vectorizable kernel


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (53 preceding siblings ...)
  2006-08-09 15:52 ` whaley at cs dot utsa dot edu
@ 2006-08-09 16:08 ` whaley at cs dot utsa dot edu
  2006-08-09 19:10   ` Dorit Nuzman
  2006-08-09 19:10 ` dorit at il dot ibm dot com
                   ` (15 subsequent siblings)
  70 siblings, 1 reply; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-09 16:08 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #54 from whaley at cs dot utsa dot edu  2006-08-09 16:08 -------
Dorit,

OK, I've posted a new tarfile with a safe kernel code where the loop is not
unrolled, so that the vectorizer has a chance.  With this kernel, I can make it
vectorize code, but only if I throw the -funsafe-math-optimizations flag.  This
kernel doesn't use a lot of registers, so it should work for both x86-32 and
x86-64 archs.

I would expect for the vectorized code to beat the x87 in both precisions on
the P4E (vector SSE has two and four times the peak of x87 respectively), and
beat the x87 code in single on the Ath64 (twice the peak).  So far,
vectorization is never a win on the P4e, but I can make single win on Ath64. 
On both platforms, editing the assembly confirms that there are loops in there
that use the vector instructions.  Once I understand better what's going on,
maybe I can improve this . . .

Here's some questions I need to figure out:
(1) Why do I have to throw the -funsafe-math-optimizations flag to enable this?
   -- I see where the .vect file warns of it, but it refers to an SSA line,
      so I'm not sure what's going on.
   -- ATLAS cannot throw this flag, because it enables non-IEEE fp arithmetic,
      and ATLAS must maintain IEEE compliance.  SSE itself does *not* require
      ruining IEEE compliance.
   -- Let me know if there is some way in the code that I can avoid this prob
   -- If it cannot be avoided, is there a way to make this optimization
      controlled by a flag that does not mean a loss of IEEE compliance?
(2) Is there any pragma or assertion, etc, that I can put in the code to
    notify the compiler that certain pointers point to 16-byte aligned data?
    -- Only the output array (C) is possibly misaligned in ATLAS

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (54 preceding siblings ...)
  2006-08-09 16:08 ` whaley at cs dot utsa dot edu
@ 2006-08-09 19:10 ` dorit at il dot ibm dot com
  2006-08-09 21:33 ` whaley at cs dot utsa dot edu
                   ` (14 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: dorit at il dot ibm dot com @ 2006-08-09 19:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #55 from dorit at il dot ibm dot com  2006-08-09 19:10 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code
 on all platforms than gcc 3

>
> Here's some questions I need to figure out:
> (1) Why do I have to throw the -funsafe-math-optimizations flag to
> enable this?
>    -- I see where the .vect file warns of it, but it refers to an SSA
line,
>       so I'm not sure what's going on.

This flag is needed in order to allow vectorization of reduction (summation
in your case) of floating-point data. This is because vectorization of
reduction changes the order of the computation, which may result in
different behavior (instead of summing this way:
((((((a0+a1)+a2)+a3)+a4)+a5)+a6)+a7, we sum this way
(((a0+a2)+a4)+a6)+(((a1+a3)+a5)+a7)

> (2) Is there any pragma or assertion, etc, that I can put in the code to
>     notify the compiler that certain pointers point to 16-byte aligned
data?
>     -- Only the output array (C) is possibly misaligned in ATLAS
>

Not really, I'm afraid - there is something that's not entirely supported
in gcc yet - see details in PR20794.

dorit

> Thanks,
> Clint
>
>
> --
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
>


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code  on all platforms than gcc 3
  2006-08-09 16:08 ` whaley at cs dot utsa dot edu
@ 2006-08-09 19:10   ` Dorit Nuzman
  0 siblings, 0 replies; 75+ messages in thread
From: Dorit Nuzman @ 2006-08-09 19:10 UTC (permalink / raw)
  To: gcc-bugzilla; +Cc: gcc-bugs

>
> Here's some questions I need to figure out:
> (1) Why do I have to throw the -funsafe-math-optimizations flag to
> enable this?
>    -- I see where the .vect file warns of it, but it refers to an SSA
line,
>       so I'm not sure what's going on.

This flag is needed in order to allow vectorization of reduction (summation
in your case) of floating-point data. This is because vectorization of
reduction changes the order of the computation, which may result in
different behavior (instead of summing this way:
((((((a0+a1)+a2)+a3)+a4)+a5)+a6)+a7, we sum this way
(((a0+a2)+a4)+a6)+(((a1+a3)+a5)+a7)

> (2) Is there any pragma or assertion, etc, that I can put in the code to
>     notify the compiler that certain pointers point to 16-byte aligned
data?
>     -- Only the output array (C) is possibly misaligned in ATLAS
>

Not really, I'm afraid - there is something that's not entirely supported
in gcc yet - see details in PR20794.

dorit

> Thanks,
> Clint
>
>
> --
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (55 preceding siblings ...)
  2006-08-09 19:10 ` dorit at il dot ibm dot com
@ 2006-08-09 21:33 ` whaley at cs dot utsa dot edu
  2006-08-09 21:46   ` Andrew Pinski
  2006-08-09 21:46 ` pinskia at physics dot uc dot edu
                   ` (13 subsequent siblings)
  70 siblings, 1 reply; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-09 21:33 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #56 from whaley at cs dot utsa dot edu  2006-08-09 21:33 -------
Dorit,

>This flag is needed in order to allow vectorization of reduction (summation
>in your case) of floating-point data.

OK, but this is a baaaad flag to require.  From the computational scientist's
point of view, there is a *vast* difference between reordering (which many
aggressive optimizations imply) and failing to have IEEE compliance.  Almost no
computational scientist will use non-IEEE code (because you have essentially no
idea if your answer is correct), but almost all will allow reordering.  So, it
is  really important to separate the non-IEEE optimizations from the IEEE
compliant ones.

If vectorization requires me to throw a flag that says it causes non-IEEE
arithmetic, I can't use it, and neither can anyone other than, AFAIK, some
graphics guys.  IEEE is the "contract" between the user and the computer, that
bounds how much error there can be, and allows the programmer to know if a
given algorithm will produce a usable result.  Non-IEEE is therefore the
death-knell for having any theoretical or a priori understanding of accuracy. 
So, while reordering and non-IEEE may both seem unsafe, a reordering just gives
different results, which are still known to be within normal fp error, while
non-IEEE means there is no contract between the programmer at all, and indeed
the answer may be arbitrarily bad.  Further, behavior under exceptional
conditions is not maintained, and so the answer may actually be undetectably
nonsensical, not merely inaccurate.  Having an oddly colored pixel doesn't hurt
the graphics guy, but sending a satellite into the atmosphere, or registering
cancer in a clean MRI are rather more serious . . .  So, mixing the two
transformation types on one flag means that vectorization is unusable to what
must be the majority of it's audience.  Maybe I should open this as another bug
report "flag mixes normal and catastrophic optimizations"?

>Not really, I'm afraid - there is something that's not entirely supported
>in gcc yet - see details in PR20794

Hmm.  I'd tried the __attribute__, but I must have mistyped it, because it
didn't work before on pointers.  However, it just did in the MMBENCHV tarfile. 
However, the code still didn't use aligned load to access the vectors (using
multiple movlpd/movhpd instead) . . .  Even more scary, having the attribute
calls does not change the genned assembly at all.  Does the vectorization phase
get this alignment info passed to it?

Aligned loads can be as much as twice as fast as unaligned, and if you have to
choose amongst loops in the midst of a deep loop nest, these factors can
actually make vectorization a loser . . .

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-08-09 21:33 ` whaley at cs dot utsa dot edu
@ 2006-08-09 21:46   ` Andrew Pinski
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Pinski @ 2006-08-09 21:46 UTC (permalink / raw)
  To: gcc-bugzilla; +Cc: gcc-bugs

> 
> 
> 
> ------- Comment #56 from whaley at cs dot utsa dot edu  2006-08-09 21:33 -------
> Dorit,
> 
> >This flag is needed in order to allow vectorization of reduction (summation
> >in your case) of floating-point data.
> 
> OK, but this is a baaaad flag to require.  From the computational scientist's
> point of view, there is a *vast* difference between reordering (which many
> aggressive optimizations imply) and failing to have IEEE compliance.  Almost no
> computational scientist will use non-IEEE code (because you have essentially no
> idea if your answer is correct), but almost all will allow reordering.  So, it
> is  really important to separate the non-IEEE optimizations from the IEEE
> compliant ones.
Except for the fact IEEE compliant fp does not allow for reordering at all except
in some small cases.  For an example is (a + b) + (-a) is not the same as (a + (-a)) + b,
so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we should split out
the option for unsafe math fp op for reordering but that is different issue.

-- Pinski


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (56 preceding siblings ...)
  2006-08-09 21:33 ` whaley at cs dot utsa dot edu
@ 2006-08-09 21:46 ` pinskia at physics dot uc dot edu
  2006-08-09 23:02 ` whaley at cs dot utsa dot edu
                   ` (12 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: pinskia at physics dot uc dot edu @ 2006-08-09 21:46 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #57 from pinskia at physics dot uc dot edu  2006-08-09 21:46 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse x87 code on all
platforms than gcc 3

> 
> 
> 
> ------- Comment #56 from whaley at cs dot utsa dot edu  2006-08-09 21:33 -------
> Dorit,
> 
> >This flag is needed in order to allow vectorization of reduction (summation
> >in your case) of floating-point data.
> 
> OK, but this is a baaaad flag to require.  From the computational scientist's
> point of view, there is a *vast* difference between reordering (which many
> aggressive optimizations imply) and failing to have IEEE compliance.  Almost no
> computational scientist will use non-IEEE code (because you have essentially no
> idea if your answer is correct), but almost all will allow reordering.  So, it
> is  really important to separate the non-IEEE optimizations from the IEEE
> compliant ones.
Except for the fact IEEE compliant fp does not allow for reordering at all
except
in some small cases.  For an example is (a + b) + (-a) is not the same as (a +
(-a)) + b,
so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we
should split out
the option for unsafe math fp op for reordering but that is different issue.

-- Pinski


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (57 preceding siblings ...)
  2006-08-09 21:46 ` pinskia at physics dot uc dot edu
@ 2006-08-09 23:02 ` whaley at cs dot utsa dot edu
  2006-08-10  6:52 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (11 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-09 23:02 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #58 from whaley at cs dot utsa dot edu  2006-08-09 23:01 -------
Andrew,

>Except for the fact IEEE compliant fp does not allow for reordering at all
>except
>in some small cases.  For an example is (a + b) + (-a) is not the same as (a +
>(-a)) + b,
>so reordering will invalid IEEE fp for larger a and small b.  Yes maybe we
>should split out
>the option for unsafe math fp op for reordering but that is different issue.

Thanks for the response, but I believe you are conflating two issues (as is
this flag, which is why this is bad news).  Different answers to the question
"what is this sum" does not ruin IEEE compliance.  I am referring to IEEE 754,
which is a standard set of rules for storage and arithmetic for floating point
(fp) on modern hardware.  I am unaware of their being any rules on compilation.
 I.e.  whether re-orderings are allowed is beyond the standard.  It rather is a
set of rules that discusses for floating point operations (FLOPS) how rounding
must be done, how overflow/underflow must be handled, etc.  Perhaps there is
another IEEE standard concerning compilation that you are referring to?

Now of course, floating point arithmetic in general (and IEEE-compliant fp in
specific) is not associative, so indeed (a+b+c) != (c+b+a).  However, both
sequences are valid answers to "what are these 3 things summed up", and both
are IEEE compliant if each addition is compliant.

What non-IEEE means is that the individual flops are no longer IEEE compliant. 
This means that overflow may not be handled, or exceptional conditions may
cause unknown results (eg., divide by zero), and indeed we have no way at all
of knowing what an fp add even means.  An example of a non-IEEE optimization is
using 3DNow! vectorization, because 3DNow! does not follow the IEEE standard
(for instance, it handles overflow only by saturation, which violates the
standard).  SSE (unless you turn IEEE compliance off manually) is IEEE
compliant, and this is why you see computational guys like myself using it, and
not using 3DNow!.

To a computational scientist, non-IEEE is catastophic, and "may change the
answer" is not.  "May change the answer" in this case simply means that I've
got a different ordering, which is also a valid IEEE fp answer, and indeed may
be a "better" answer than the original ordering (depending on the data; no way
to know this w/o looking at the data).  Non-IEEE means that I have no way of
knowing what kind of rounding was done, how flop was done, if underflow (or
gradual overflow!) occurred, etc.  It is for this reason that optimizations
which are non-IEEE are a killer for computational scientists, and reorders are
no big deal.  In the first you have no idea what has happened with the data,
and in the second you have an IEEE-compliant answer, which has known
properties.

It has been my experience that most compiler people (and I have some experience
there, as I got my PhD in compilation) are more concerned with integer work,
and thus not experts on fp computation.  I've done fp computational work for
the majority of my research for the last decade, so I thought I might be able
to provide useful input to bridge the camps, so to speak.  In this case, I
think that by lumping "cause different IEEE-compliant answers" in with "use
non-IEEE arithmetic" you are preventing all serious fp users from utilizing the
optimizations.  Since vectorization is of great importance on modern machines,
this is bad news.  Obviously, I may be wrong in what I say, but if reordering
makes something non-IEEE I'm going to have some students mad at me for teaching
them the wrong stuff :)

Has this made my point any clearer, or do you still think I am wrong?  If I'm
wrong, maybe you can point to the part of the IEEE standard that discusses
orderings violating the standard (as opposed to the well-known fact that all
implemented fp arithemetic is non-associative)?  After you do this, I'll have
to dig up my copy of the thing, which I don't think I've seen in the last 2
years (but I did scope some of books that cover it, and didn't find anything
about compilation).

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (58 preceding siblings ...)
  2006-08-09 23:02 ` whaley at cs dot utsa dot edu
@ 2006-08-10  6:52 ` paolo dot bonzini at lu dot unisi dot ch
  2006-08-10 14:08 ` whaley at cs dot utsa dot edu
                   ` (10 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2006-08-10  6:52 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #59 from paolo dot bonzini at lu dot unisi dot ch  2006-08-10 06:52 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> Thanks for the response, but I believe you are conflating two issues (as is
> this flag, which is why this is bad news).  Different answers to the question
> "what is this sum" does not ruin IEEE compliance.  I am referring to IEEE 754,
> which is a standard set of rules for storage and arithmetic for floating point
> (fp) on modern hardware.
You are also confusing -funsafe-math-optimizations with -ffast-math.  
The latter is a "one catch all" flag that compiles as if there were no 
FP traps, infinities, NaNs, and so on.  The former instead enables 
"unsafe" optimizations but not "catastrophic" optimizations -- if you 
consider meaningless results on badly conditioned matrixes to not be 
catastrophic...

A more or less complete list of things enabled by 
-funsafe-math-optimizations includes:

Reassociation:
- reassociation of operations, not only for the vectorizer's sake but 
also in the unroller (see around line 1600 of loop-unroll.c)
- other simplifications like a/(b*c) for a/b/c
- expansion of pow (a, b) to multiplications if b is integer

Compile-time evaluation:
- doing more aggressive compile-time evaluation of floating-point 
expressions (e.g. cabs)
- less accurate modeling of overflow in compile-time expressions, for 
formats such as 106-bit mantissa long doubles

Math identities:
- expansion of cabs to sqrt (a*a + b*b)
- simplifications involving trascendental functions, e.g. exp (0.5*x) 
for sqrt (exp (x)), or x for tan(atan(x))
- moving terms to the other side of a comparison, e.g. a > 4 for a + 4 > 
8, or x > -1 for 1 - x < 2
- assuming in-domain arguments of sqrt, log, etc., e.g. x for 
sqrt(x)*sqrt(x)
- in turn, this enables removing math functions from comparisons, e.g. x 
 > 4 for sqrt (x) > 2

Optimization:
- strength reduction of a/b to a*(1/b), both as loop invariants and in 
code like vector normalization
- eliminating recursion for "accumulator"-like functions, i.e. f (n) = n 
+ f(n-1)

Back-end operation:
- using x87 builtins for transcendental functions

There may be bugs, but in general these optimizations are safe for 
infinities and NaNs, but not for signed zeros or (as I said) for very 
badly conditioned data.
> I am unaware of their being any rules on compilation.
>   
Rules are determined by the language standards.  I believe that C 
mandates no reassociation; Fortran allows reassociation unless explicit 
parentheses are present in the source, but this is not (yet) implemented 
by GCC.

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (59 preceding siblings ...)
  2006-08-10  6:52 ` paolo dot bonzini at lu dot unisi dot ch
@ 2006-08-10 14:08 ` whaley at cs dot utsa dot edu
  2006-08-10 14:29 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (9 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-10 14:08 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #60 from whaley at cs dot utsa dot edu  2006-08-10 14:08 -------
Paolo,

Thanks for the explanation of what -funsafe is presently doing.

>You are also confusing -funsafe-math-optimizations with -ffast-math.

No, what I'm doing is reading the man page (the closest thing to a contract
between gcc and me on what it is doing with my code):
|      -funsafe-math-optimizations
|          Allow optimizations for floating-point arithmetic that (a) assume
|          that arguments and results are valid and (b) may violate IEEE or
|          ANSI standards.

The (b) in this statement prevents me, as a library provider that *must* be
able to reassure my users that I have done nothing to violate IEEE fp standard
(don't get me wrong, there's plenty of violations of the standard that occur in
hardware, but typically in well-understood ways by the scientists of those
platforms, and in the less important parts of the standard), from using this
flag.  I can't even use it after verifying that no optimization has hurt the
present code, because an optimization that violates IEEE could be added at a
later date, or used on a system that I'm not testing on (eg., on some systems,
could cause 3DNow! vectorization).

>Rules are determined by the language standards.  I believe that C
>mandates no reassociation; Fortran allows reassociation unless explicit
>parentheses are present in the source, but this is not (yet) implemented
>by GCC.

My precise point.  There are *lots* of C rules that a fp guy could give a crap
about (for certain types of fp kernels), but IEEE is pretty much inviolate. 
Since this flag conflates language violations (don't care) with IEEE
(catastrophic) I can't use it.  I cannot stress enough just how important IEEE
is: it is the only contract that tells us what it means to do a flop, and gives
us any way of understanding what our answer will be.

Making vectorization depend on a flag that says it is allowed to violate IEEE
is therefore a killer for me (and most knowledgable fp guys).  This is ironic,
since vectorization of sums (as in GEMM) is usually implemented as scalar
expansion on the accumulators, and this not only produces an IEEE-compliant
answer, but it is *more* accurate for almost all data.

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (60 preceding siblings ...)
  2006-08-10 14:08 ` whaley at cs dot utsa dot edu
@ 2006-08-10 14:29 ` paolo dot bonzini at lu dot unisi dot ch
  2006-08-10 15:16 ` whaley at cs dot utsa dot edu
                   ` (8 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2006-08-10 14:29 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #61 from paolo dot bonzini at lu dot unisi dot ch  2006-08-10 14:28 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


> Making vectorization depend on a flag that says it is allowed to violate IEEE
> is therefore a killer for me (and most knowledgable fp guys).  This is ironic,
> since vectorization of sums (as in GEMM) is usually implemented as scalar
> expansion on the accumulators
>   
In case of GCC, it performs the transformation that Dorit explained.  It 
may not produce an IEEE-compliant answer if there are zeros and you 
expect to see a particular sign for the zero.
> and this not only produces an IEEE-compliant answer
>   
The IEEE standard mandates particular rules for performing operations on 
infinities, NaNs, signed zeros, denormals, ...  The C standard, by 
mandating no reassociation, ensures that you don't mess with NaNs, 
infinities, and signed zeros.  As soon as you perform reassociation, 
there is *no way* you can be sure that you get IEEE-compliant math.

  +Inf + (1 / +0) = Inf, +Inf + (1 / -0) = NaN.
> but it is *more* accurate for almost all data.
http://citeseer.ist.psu.edu/589698.html is an example of a paper that 
shows FP code that avoids accuracy problems.  Any kind of reassociation 
will break that code, and lower its accuracy.  That's why reassociation 
is an "unsafe" math optimization.

If you want a -freassociate-fp math, open an enhancement PR and somebody 
might be more than happy to separate reassociation from the other 
effects of -funsafe-math-optimizations.

(Independent of this, you should also open a separate PR for ATLAS 
vectorization, because that would not be a regression and would not be 
on x87) :-)

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (61 preceding siblings ...)
  2006-08-10 14:29 ` paolo dot bonzini at lu dot unisi dot ch
@ 2006-08-10 15:16 ` whaley at cs dot utsa dot edu
  2006-08-10 15:22 ` paolo dot bonzini at lu dot unisi dot ch
                   ` (7 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-10 15:16 UTC (permalink / raw)
  To: gcc-bugs

------- Comment #62 from whaley at cs dot utsa dot edu  2006-08-10 15:15 -------
Paolo,

>The IEEE standard mandates particular rules for performing operations on
>infinities, NaNs, signed zeros, denormals, ...  The C standard, by
>mandating no reassociation, ensures that you don't mess with NaNs,
>infinities, and signed zeros.  As soon as you perform reassociation,
>there is *no way* you can be sure that you get IEEE-compliant math.

No, again this is a conflation of the issues.  You have IEEE-compliant math,
but the differing orderings provide different summations of those values.  It
is a ANSI/ISO C rule being violated, not an IEEE.  Each individual operation is
IEEE, and therefore both results are IEEE-compliant, but since the C rule
requiring order has been broken, some codes will break.  However, they break
not because of a violation of IEEE, but because of a violation of ANSI/ISO C. 
I can certify whether my code can take this violation of ANSI/ISO C by
examining my code.  I cannot certify my code works w/o IEEE by examining it,
since that means a+b is now essentially undefined.

>http://citeseer.ist.psu.edu/589698.html is an example of a paper that
>shows FP code that avoids accuracy problems.  Any kind of reassociation
>will break that code, and lower its accuracy.  That's why reassociation
>is an "unsafe" math optimization.

Please note I never argued it is was safe.  Violating the C usage rules is
always unsafe.  However, as explained above, I can certify my code for
reordering by examination, but nothing helps an IEEE violation.  My problem is
lumping in IEEE violations (such as 3dNow vectorization, or turning on non-IEEE
mode in SSE) with C violations.

>If you want a -freassociate-fp math, open an enhancement PR and somebody

Ah, you mean like I asked about in end of 2nd paragraph of Comment #56?

>might be more than happy to separate reassociation from the other
>effects of -funsafe-math-optimizations.

What I'm arguing for is not lumping in violations of ISO/ANSI C with IEEE
violations, but you are right that this would fix my particular case.  From
what I see, -funsafe ought to be redefined as violating ANSI/ISO alone, and not
mention IEEE at all.

>(Independent of this, you should also open a separate PR for ATLAS
>vectorization, because that would not be a regression and would not be
>on x87) :-)

You mean like I pleaded for in the last paragraph of Comment #38, but
reluctantly shoved in here because that's what people seemed to want? :)

Thanks,
Clint

-- 

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (62 preceding siblings ...)
  2006-08-10 15:16 ` whaley at cs dot utsa dot edu
@ 2006-08-10 15:22 ` paolo dot bonzini at lu dot unisi dot ch
  2006-08-11  9:19 ` uros at kss-loka dot si
                   ` (6 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: paolo dot bonzini at lu dot unisi dot ch @ 2006-08-10 15:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #63 from paolo dot bonzini at lu dot unisi dot ch  2006-08-10 15:22 -------
Subject: Re:  [4.0/4.1 Regression] gcc 4 produces worse
 x87 code on all platforms than gcc 3


>> If you want a -freassociate-fp math, open an enhancement PR and somebody
>>     
> Ah, you mean like I asked about in end of 2nd paragraph of Comment #56?
>> (Independent of this, you should also open a separate PR for ATLAS
>> vectorization, because that would not be a regression and would not be
>> on x87) :-)
>>     
> You mean like I pleaded for in the last paragraph of Comment #38
Be bold.  Don't ask, just open PRs if you feel an issue is separate.  Go 
ahead now if you wish.  Having them closed or marked as duplicate is not 
a problem, and it is much easier to track than cluttering an existing PRs.

All these issues with ATLAS will not be visible to somebody looking for 
bug fixes "known to fail" in 4.2.0, because the original problem is now 
fixed in that version, and will soon be in 4.1.1 too.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (63 preceding siblings ...)
  2006-08-10 15:22 ` paolo dot bonzini at lu dot unisi dot ch
@ 2006-08-11  9:19 ` uros at kss-loka dot si
  2006-08-11 13:26 ` bonzini at gcc dot gnu dot org
                   ` (5 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: uros at kss-loka dot si @ 2006-08-11  9:19 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #64 from uros at kss-loka dot si  2006-08-11 09:18 -------
Slightly offtopic, but to put some numbers to comment #8 and comment #11,
equivalent SSE code now reaches only 50% of x87 single performance and 60% of
x87 double performance on AMD x86_64:


ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

[float] -O2 -mfpmath=sse -march=k8:
atlasmm       60   1000       0.273     1582.66
[float] -O2 -mfpmath=387 -march=k8:
atlasmm       60   1000       0.138     3130.91

[double] -O2 -mfpmath=sse -march=k8:
atlasmm       60   1000       0.252     1714.54
[double] -O2 -mfpmath=387 -march=k8:
atlasmm       60   1000       0.152     2842.55

This effect was first observed in PR19780.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (64 preceding siblings ...)
  2006-08-11  9:19 ` uros at kss-loka dot si
@ 2006-08-11 13:26 ` bonzini at gcc dot gnu dot org
  2006-08-11 14:10 ` [Bug target/27827] [4.0 " bonzini at gnu dot org
                   ` (4 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: bonzini at gcc dot gnu dot org @ 2006-08-11 13:26 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #65 from bonzini at gnu dot org  2006-08-11 13:26 -------
Subject: Bug 27827

Author: bonzini
Date: Fri Aug 11 13:25:58 2006
New Revision: 116082

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116082
Log:
2006-08-11  Paolo Bonzini  <bonzini@gnu.org>

        PR target/27827
        * config/i386/i386.md: Add peephole2 to avoid "fld %st"
        instructions.

testsuite:
2006-08-11  Paolo Bonzini  <bonzini@gnu.org>

        PR target/27827
        * gcc.target/i386/pr27827.c: New testcase.


Added:
    branches/gcc-4_1-branch/gcc/testsuite/gcc.target/i386/pr27827.c
      - copied unchanged from r115969,
trunk/gcc/testsuite/gcc.target/i386/pr27827.c
Modified:
    branches/gcc-4_1-branch/gcc/ChangeLog
    branches/gcc-4_1-branch/gcc/config/i386/i386.md
    branches/gcc-4_1-branch/gcc/testsuite/ChangeLog


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (65 preceding siblings ...)
  2006-08-11 13:26 ` bonzini at gcc dot gnu dot org
@ 2006-08-11 14:10 ` bonzini at gnu dot org
  2006-08-11 15:22 ` whaley at cs dot utsa dot edu
                   ` (3 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: bonzini at gnu dot org @ 2006-08-11 14:10 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #66 from bonzini at gnu dot org  2006-08-11 14:10 -------
(on bugzilla because I had problems sending mail to you)

> Just got your most recent update.  From what I can tell, you have applied
> your patch to the 4.1 series, so that the next 4.1 release will have the fix?

Yes.

> So, my question is that I notice the comment says:
>    * config/i386/i386.md: Add peephole2 to avoid "fld %st" instructions.
>
> Which, if its what we've been doing should be something like:
>    * config/i386/i386.md: Add peephole2 to substitute "fld" for memory-source 
>      "fmul"

No, what my patch does is exactly replacing "fld reg + fmul mem" with "fld mem
+ fmul reg,reg".  Maybe the ChangeLog is not completely descriptive, but the PR
number is there and will make things clear enough.

> BTW, it's going to remain the case that you must do at least -O2 to get
> this peephole invoked?

You can add -fpeephole2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (66 preceding siblings ...)
  2006-08-11 14:10 ` [Bug target/27827] [4.0 " bonzini at gnu dot org
@ 2006-08-11 15:22 ` whaley at cs dot utsa dot edu
  2006-08-23 10:36 ` oliver dot jennrich at googlemail dot com
                   ` (2 subsequent siblings)
  70 siblings, 0 replies; 75+ messages in thread
From: whaley at cs dot utsa dot edu @ 2006-08-11 15:22 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #67 from whaley at cs dot utsa dot edu  2006-08-11 15:22 -------
Uros,

>Slightly offtopic, but to put some numbers to comment #8 and comment #11,
>equivalent SSE code now reaches only 50% of x87 single performance and 60% of
>x87 double performance on AMD x86_64

FYI, you *may* get slightly better single SSE performance with these flags:
   -fomit-frame-pointer -march=athlon64 -O2 -mfpmath=sse \
   -msse -msse2 -msse3 -fargument-noalias-global

Also, when ATLAS is allowed to exercise the code generator to find the best
kernel, for double precision gcc 4's SSE could be made to almost tie gcc3's x87
performance (gcc3's double x87 performance is roughly 92% of the patched gcc 4
for this platform).  However, single precision SSE, even allowing the code
generator to go crazy, could only achieve about 2/3 of double *SSE*
performance, and since x87 single perf is actually greater for x87 . . .

You can find some details at:
  
https://sourceforge.net/mailarchive/forum.php?thread_id=10026092&forum_id=426

Cheers,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (67 preceding siblings ...)
  2006-08-11 15:22 ` whaley at cs dot utsa dot edu
@ 2006-08-23 10:36 ` oliver dot jennrich at googlemail dot com
  2006-10-07 10:06 ` steven at gcc dot gnu dot org
  2007-02-13  2:59 ` pinskia at gcc dot gnu dot org
  70 siblings, 0 replies; 75+ messages in thread
From: oliver dot jennrich at googlemail dot com @ 2006-08-23 10:36 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #68 from oliver dot jennrich at googlemail dot com  2006-08-23 10:36 -------
(In reply to comment #23)

I read the discussion with  a lot of interest - so here are the data for a
Pentium-M:

echo "GCC 3.x     double performance:"
GCC 3.x     double performance:
./xdmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.281     1537.37

echo "GCC 4.x     double performance:"
GCC 4.x     double performance:
./xdmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.265     1630.19

echo "GCC 3.x     single performance:"
GCC 3.x     single performance:
./xsmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.281     1537.37

echo "GCC 4.x     single performance:"
GCC 4.x     single performance:
./xsmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.266     1624.06

> Here is the machine breakdown as measured now:
>    LIKES GCC 4    DOESN'T CARE    LIKES GCC 3
>    ===========    ============    ===========
>    CoreDuo        Pentium 4       PentiumPRO
>                                   Pentium III
>                                   Pentium 4e
>                                   Pentium D
>                                   Athlon-64 X2
>                                   Opteron

So I guess the first column gets another entry: Pentium M


-- 

oliver dot jennrich at googlemail dot com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |oliver dot jennrich at
                   |                            |googlemail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (68 preceding siblings ...)
  2006-08-23 10:36 ` oliver dot jennrich at googlemail dot com
@ 2006-10-07 10:06 ` steven at gcc dot gnu dot org
  2007-02-13  2:59 ` pinskia at gcc dot gnu dot org
  70 siblings, 0 replies; 75+ messages in thread
From: steven at gcc dot gnu dot org @ 2006-10-07 10:06 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #69 from steven at gcc dot gnu dot org  2006-10-07 10:06 -------
The linked-to patch is already on the trunk.


-- 

steven at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|http://gcc.gnu.org/ml/gcc-  |
                   |patches/2006-               |
                   |08/msg00113.html            |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [Bug target/27827] [4.0 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3
  2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
                   ` (69 preceding siblings ...)
  2006-10-07 10:06 ` steven at gcc dot gnu dot org
@ 2007-02-13  2:59 ` pinskia at gcc dot gnu dot org
  70 siblings, 0 replies; 75+ messages in thread
From: pinskia at gcc dot gnu dot org @ 2007-02-13  2:59 UTC (permalink / raw)
  To: gcc-bugs



------- Comment #70 from pinskia at gcc dot gnu dot org  2007-02-13 02:59 -------
Fixed, 4.0 branch is now been closed.


-- 

pinskia at gcc dot gnu dot org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827


^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2007-02-13  2:59 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-31  0:33 [Bug rtl-optimization/27827] New: gcc 4 produces worse x87 code on all platforms than gcc 3 hiclint at gmail dot com
2006-05-31  0:35 ` [Bug rtl-optimization/27827] " pinskia at gcc dot gnu dot org
2006-05-31  0:36 ` hiclint at gmail dot com
2006-05-31  0:42 ` [Bug target/27827] " pinskia at gcc dot gnu dot org
2006-05-31  0:50 ` hiclint at gmail dot com
2006-05-31  0:55 ` pinskia at gcc dot gnu dot org
2006-05-31  1:09 ` whaley at cs dot utsa dot edu
2006-05-31 10:57 ` uros at kss-loka dot si
2006-05-31 14:13 ` whaley at cs dot utsa dot edu
2006-06-01  8:43 ` uros at kss-loka dot si
2006-06-01 16:03 ` whaley at cs dot utsa dot edu
2006-06-01 16:26 ` whaley at cs dot utsa dot edu
2006-06-01 18:43 ` whaley at cs dot utsa dot edu
2006-06-07 22:39 ` whaley at cs dot utsa dot edu
2006-06-14  3:04 ` whaley at cs dot utsa dot edu
2006-06-24 18:11 ` whaley at cs dot utsa dot edu
2006-06-24 19:13 ` rguenth at gcc dot gnu dot org
2006-06-25 13:35 ` whaley at cs dot utsa dot edu
2006-06-25 23:05 ` rguenth at gcc dot gnu dot org
2006-06-26  1:12 ` whaley at cs dot utsa dot edu
2006-06-26  7:53 ` uros at kss-loka dot si
2006-06-26 16:02 ` whaley at cs dot utsa dot edu
2006-06-27  6:05 ` uros at kss-loka dot si
2006-06-27 14:37 ` whaley at cs dot utsa dot edu
2006-06-27 17:47 ` whaley at cs dot utsa dot edu
2006-06-28 17:37 ` [Bug target/27827] [4.0/4.1/4.2 Regression] " steven at gcc dot gnu dot org
2006-06-28 20:18 ` whaley at cs dot utsa dot edu
2006-06-29  4:18 ` hjl at lucon dot org
2006-06-29  6:43 ` whaley at cs dot utsa dot edu
2006-07-04 13:15 ` whaley at cs dot utsa dot edu
2006-07-05 17:55 ` mmitchel at gcc dot gnu dot org
2006-08-04  7:46 ` bonzini at gnu dot org
2006-08-04 16:24 ` whaley at cs dot utsa dot edu
2006-08-05  7:21 ` bonzini at gnu dot org
2006-08-05 14:24 ` whaley at cs dot utsa dot edu
2006-08-05 17:16 ` bonzini at gnu dot org
2006-08-05 18:26 ` whaley at cs dot utsa dot edu
2006-08-06 15:03 ` [Bug target/27827] [4.0/4.1 " whaley at cs dot utsa dot edu
2006-08-07  6:19 ` bonzini at gnu dot org
2006-08-07 15:32 ` whaley at cs dot utsa dot edu
2006-08-07 16:47 ` whaley at cs dot utsa dot edu
2006-08-07 16:58 ` paolo dot bonzini at lu dot unisi dot ch
2006-08-07 17:19 ` whaley at cs dot utsa dot edu
2006-08-07 18:19 ` paolo dot bonzini at lu dot unisi dot ch
2006-08-07 20:35 ` dorit at il dot ibm dot com
2006-08-07 21:57 ` whaley at cs dot utsa dot edu
2006-08-08  2:59 ` whaley at cs dot utsa dot edu
2006-08-08  6:15 ` hubicka at gcc dot gnu dot org
2006-08-08  6:28   ` Jan Hubicka
2006-08-08  6:29 ` hubicka at ucw dot cz
2006-08-08  7:05 ` paolo dot bonzini at lu dot unisi dot ch
2006-08-08 16:44 ` whaley at cs dot utsa dot edu
2006-08-08 18:36 ` whaley at cs dot utsa dot edu
2006-08-09  4:34 ` paolo dot bonzini at lu dot unisi dot ch
2006-08-09 14:33 ` whaley at cs dot utsa dot edu
2006-08-09 15:52 ` whaley at cs dot utsa dot edu
2006-08-09 16:08 ` whaley at cs dot utsa dot edu
2006-08-09 19:10   ` Dorit Nuzman
2006-08-09 19:10 ` dorit at il dot ibm dot com
2006-08-09 21:33 ` whaley at cs dot utsa dot edu
2006-08-09 21:46   ` Andrew Pinski
2006-08-09 21:46 ` pinskia at physics dot uc dot edu
2006-08-09 23:02 ` whaley at cs dot utsa dot edu
2006-08-10  6:52 ` paolo dot bonzini at lu dot unisi dot ch
2006-08-10 14:08 ` whaley at cs dot utsa dot edu
2006-08-10 14:29 ` paolo dot bonzini at lu dot unisi dot ch
2006-08-10 15:16 ` whaley at cs dot utsa dot edu
2006-08-10 15:22 ` paolo dot bonzini at lu dot unisi dot ch
2006-08-11  9:19 ` uros at kss-loka dot si
2006-08-11 13:26 ` bonzini at gcc dot gnu dot org
2006-08-11 14:10 ` [Bug target/27827] [4.0 " bonzini at gnu dot org
2006-08-11 15:22 ` whaley at cs dot utsa dot edu
2006-08-23 10:36 ` oliver dot jennrich at googlemail dot com
2006-10-07 10:06 ` steven at gcc dot gnu dot org
2007-02-13  2:59 ` pinskia at gcc dot gnu dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).