public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* 3.0 vs 3.0.1 on oopack's Max
@ 2001-09-07  6:43 Paolo Carlini
  2001-09-07 10:14 ` Tim Prince
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Paolo Carlini @ 2001-09-07  6:43 UTC (permalink / raw)
  To: gcc; +Cc: jh, jbuck

Hi all,

once more, I'm writing to the list to describe a recent performance
regression :( on a simple benchmark.

On my system (PII-400, 256 M, glibc2.2.4, binutils2.11.2) 3.0.1 produces
code much slower than 3.0 for the Max test of the oopack suite. In this
test the following two styles (the first "C-style", the second
"OOP-style") are compared:

void MaxBenchmark::c_style() const  // Compute max of vector (C-style)
{
    double max = U[0];
    for( int k=1; k<M; k++ )   // Loop over vector elements
 if( U[k] > max )
     max=U[k];
    MaxResult = max;
}

inline int Greater( double i, double j )
{
    return i>j;
}

void MaxBenchmark::oop_style() const   // Compute max of vector
(OOP-style)
{
    double max = U[0];
    for( int k=1; k<M; k++ )   // Loop over vector elements
 if( Greater( U[k], max ) )
     max=U[k];
    MaxResult = max;
}

Now, if I compile oopack_v1p8.C with 3.0 and with 3.0.1 on my system
with

    g++ -O2 -finline-limit=600 oopack_v1p8.C

this is what I get as run times:

3.0.1
-----
                         Seconds       Mflops
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max            500000    7.1  19.6   70.6  25.4    2.8


3.0
---
                         Seconds       Mflops
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max            500000    7.1   9.7   70.5  51.7    1.4


It turns out that the core loop over k is compiled in the same way for
the "C-style" case by both the compilers:

 80488d0: dd 02                 fldl   (%edx)
 80488d2: dd e1                 fucom  %st(1)
 80488d4: df e0                 fnstsw %ax
 80488d6: 9e                    sahf
 80488d7: 76 04                 jbe    80488dd
<_ZNK12MaxBenchmark7c_styleEv+0x2d>
 80488d9: dd d9                 fstp   %st(1)
 80488db: eb 02                 jmp    80488df
<_ZNK12MaxBenchmark7c_styleEv+0x2f>
 80488dd: dd d8                 fstp   %st(0)
 80488df: 83 c2 08              add    $0x8,%edx
 80488e2: 49                    dec    %ecx
 80488e3: 79 eb                 jns    80488d0
<_ZNK12MaxBenchmark7c_styleEv+0x20>


On the other hand, for the "OOP-style" case:

3.0.1
-----

 8048910: dd 01                 fldl   (%ecx)
 8048912: dd e1                 fucom  %st(1)
 8048914: df e0                 fnstsw %ax
 8048916: 9e                    sahf
 8048917: 0f 97 c0              seta   %al
 804891a: 83 e0 01              and    $0x1,%eax
 804891d: 74 04                 je     8048923
<_ZNK12MaxBenchmark9oop_styleEv+0x33>
 804891f: dd d9                 fstp   %st(1)
 8048921: eb 02                 jmp    8048925
<_ZNK12MaxBenchmark9oop_styleEv+0x35>
 8048923: dd d8                 fstp   %st(0)
 8048925: 83 c1 08              add    $0x8,%ecx
 8048928: 4a                    dec    %edx
 8048929: 79 e5                 jns    8048910
<_ZNK12MaxBenchmark9oop_styleEv+0x20>

3.0
---

 8048910: dd 03                 fldl   (%ebx)
 8048912: 31 d2                 xor    %edx,%edx
 8048914: dd e1                 fucom  %st(1)
 8048916: df e0                 fnstsw %ax
 8048918: 9e                    sahf
 8048919: 0f 97 c2              seta   %dl
 804891c: 85 d2                 test   %edx,%edx
 804891e: 74 04                 je     8048924
<_ZNK12MaxBenchmark9oop_styleEv+0x34>
 8048920: dd d9                 fstp   %st(1)
 8048922: eb 02                 jmp    8048926
<_ZNK12MaxBenchmark9oop_styleEv+0x36>
 8048924: dd d8                 fstp   %st(0)
 8048926: 83 c3 08              add    $0x8,%ebx
 8048929: 49                    dec    %ecx
 804892a: 79 e4                 jns    8048910
<_ZNK12MaxBenchmark9oop_styleEv+0x20>


By the way, the same performance regression with respect to some weeks
ago happens for recent 3.1 snapshots:

3.1 20010902 (experimental)
---------------------------

                         Seconds       Mflops
Test       Iterations     C    OOP     C    OOP  Ratio
----       ----------  -----------  -----------  -----
Max            500000    7.1  19.7   70.1  25.4    2.8


"C-style"
---------

 80488e0: dd 04 d5 60 b2 04 08  fldl   0x804b260(,%edx,8)
 80488e7: dd e1                 fucom  %st(1)
 80488e9: df e0                 fnstsw %ax
 80488eb: 9e                    sahf
 80488ec: 76 22                 jbe    8048910
<_ZNK12MaxBenchmark7c_styleEv+0x40>
 80488ee: dd d9                 fstp   %st(1)
 80488f0: 42                    inc    %edx
 80488f1: 81 fa e7 03 00 00     cmp    $0x3e7,%edx
 80488f7: 7e e7                 jle    80488e0
<_ZNK12MaxBenchmark7c_styleEv+0x10>


"OOP-style"
-----------

 8048930: dd 04 d5 60 b2 04 08  fldl   0x804b260(,%edx,8)
 8048937: dd e1                 fucom  %st(1)
 8048939: df e0                 fnstsw %ax
 804893b: 9e                    sahf
 804893c: 0f 97 c0              seta   %al
 804893f: 83 e0 01              and    $0x1,%eax
 8048942: 74 1c                 je     8048960
<_ZNK12MaxBenchmark9oop_styleEv+0x40>
 8048944: dd d9                 fstp   %st(1)
 8048946: 42                    inc    %edx
 8048947: 81 fa e7 03 00 00     cmp    $0x3e7,%edx
 804894d: 7e e1                 jle    8048930
<_ZNK12MaxBenchmark9oop_styleEv+0x10>



I hope that some of the gcc developers (perhaps Jan Hubicka?) may take
care of this disappointing behavior!

Regards,
Paolo Carlini.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.0 vs 3.0.1 on oopack's Max
  2001-09-07  6:43 3.0 vs 3.0.1 on oopack's Max Paolo Carlini
@ 2001-09-07 10:14 ` Tim Prince
  2001-09-07 16:53 ` Richard Henderson
  2001-09-08  9:03 ` Jan Hubicka
  2 siblings, 0 replies; 7+ messages in thread
From: Tim Prince @ 2001-09-07 10:14 UTC (permalink / raw)
  To: pcarlini, gcc; +Cc: jh, jbuck

I don't know why you would not use at least -march=pentiumpro for
this compilation.  It used to require -ffast-math as well for
good performance, but I have noticed that has become less
necessary.  I find that gcc has been out-performing several well
known commercial compilers on similar operations, although my
source is not identical to yours.
----- Original Message -----
From: "Paolo Carlini" <pcarlini@unitus.it>
To: <gcc@gcc.gnu.org>
Cc: <jh@suse.cz>; <jbuck@synopsys.com>
Sent: Friday, September 07, 2001 6:41 AM
Subject: 3.0 vs 3.0.1 on oopack's Max


> Hi all,
>
> once more, I'm writing to the list to describe a recent
performance
> regression :( on a simple benchmark.
>
> On my system (PII-400, 256 M, glibc2.2.4, binutils2.11.2) 3.0.1
produces
> code much slower than 3.0 for the Max test of the oopack suite.
In this
> test the following two styles (the first "C-style", the second
> "OOP-style") are compared:
>
> void MaxBenchmark::c_style() const  // Compute max of vector
(C-style)
> {
>     double max = U[0];
>     for( int k=1; k<M; k++ )   // Loop over vector elements
>  if( U[k] > max )
>      max=U[k];
>     MaxResult = max;
> }
>
> inline int Greater( double i, double j )
> {
>     return i>j;
> }
>
> void MaxBenchmark::oop_style() const   // Compute max of vector
> (OOP-style)
> {
>     double max = U[0];
>     for( int k=1; k<M; k++ )   // Loop over vector elements
>  if( Greater( U[k], max ) )
>      max=U[k];
>     MaxResult = max;
> }
>
> Now, if I compile oopack_v1p8.C with 3.0 and with 3.0.1 on my
system
> with
>
>     g++ -O2 -finline-limit=600 oopack_v1p8.C
>
> this is what I get as run times:
>
> 3.0.1
> -----
>                          Seconds       Mflops
> Test       Iterations     C    OOP     C    OOP  Ratio
> ----       ----------  -----------  -----------  -----
> Max            500000    7.1  19.6   70.6  25.4    2.8
>
>
> 3.0
> ---
>                          Seconds       Mflops
> Test       Iterations     C    OOP     C    OOP  Ratio
> ----       ----------  -----------  -----------  -----
> Max            500000    7.1   9.7   70.5  51.7    1.4
>
>
> It turns out that the core loop over k is compiled in the same
way for
> the "C-style" case by both the compilers:
>
>  80488d0: dd 02                 fldl   (%edx)
>  80488d2: dd e1                 fucom  %st(1)
>  80488d4: df e0                 fnstsw %ax
>  80488d6: 9e                    sahf
>  80488d7: 76 04                 jbe    80488dd
> <_ZNK12MaxBenchmark7c_styleEv+0x2d>
>  80488d9: dd d9                 fstp   %st(1)
>  80488db: eb 02                 jmp    80488df
> <_ZNK12MaxBenchmark7c_styleEv+0x2f>
>  80488dd: dd d8                 fstp   %st(0)
>  80488df: 83 c2 08              add    $0x8,%edx
>  80488e2: 49                    dec    %ecx
>  80488e3: 79 eb                 jns    80488d0
> <_ZNK12MaxBenchmark7c_styleEv+0x20>
>
>
> On the other hand, for the "OOP-style" case:
>
> 3.0.1
> -----
>
>  8048910: dd 01                 fldl   (%ecx)
>  8048912: dd e1                 fucom  %st(1)
>  8048914: df e0                 fnstsw %ax
>  8048916: 9e                    sahf
>  8048917: 0f 97 c0              seta   %al
>  804891a: 83 e0 01              and    $0x1,%eax
>  804891d: 74 04                 je     8048923
> <_ZNK12MaxBenchmark9oop_styleEv+0x33>
>  804891f: dd d9                 fstp   %st(1)
>  8048921: eb 02                 jmp    8048925
> <_ZNK12MaxBenchmark9oop_styleEv+0x35>
>  8048923: dd d8                 fstp   %st(0)
>  8048925: 83 c1 08              add    $0x8,%ecx
>  8048928: 4a                    dec    %edx
>  8048929: 79 e5                 jns    8048910
> <_ZNK12MaxBenchmark9oop_styleEv+0x20>
>
> 3.0
> ---
>
>  8048910: dd 03                 fldl   (%ebx)
>  8048912: 31 d2                 xor    %edx,%edx
>  8048914: dd e1                 fucom  %st(1)
>  8048916: df e0                 fnstsw %ax
>  8048918: 9e                    sahf
>  8048919: 0f 97 c2              seta   %dl
>  804891c: 85 d2                 test   %edx,%edx
>  804891e: 74 04                 je     8048924
> <_ZNK12MaxBenchmark9oop_styleEv+0x34>
>  8048920: dd d9                 fstp   %st(1)
>  8048922: eb 02                 jmp    8048926
> <_ZNK12MaxBenchmark9oop_styleEv+0x36>
>  8048924: dd d8                 fstp   %st(0)
>  8048926: 83 c3 08              add    $0x8,%ebx
>  8048929: 49                    dec    %ecx
>  804892a: 79 e4                 jns    8048910
> <_ZNK12MaxBenchmark9oop_styleEv+0x20>
>
>
> By the way, the same performance regression with respect to
some weeks
> ago happens for recent 3.1 snapshots:
>
> 3.1 20010902 (experimental)
> ---------------------------
>
>                          Seconds       Mflops
> Test       Iterations     C    OOP     C    OOP  Ratio
> ----       ----------  -----------  -----------  -----
> Max            500000    7.1  19.7   70.1  25.4    2.8
>
>
> "C-style"
> ---------
>
>  80488e0: dd 04 d5 60 b2 04 08  fldl   0x804b260(,%edx,8)
>  80488e7: dd e1                 fucom  %st(1)
>  80488e9: df e0                 fnstsw %ax
>  80488eb: 9e                    sahf
>  80488ec: 76 22                 jbe    8048910
> <_ZNK12MaxBenchmark7c_styleEv+0x40>
>  80488ee: dd d9                 fstp   %st(1)
>  80488f0: 42                    inc    %edx
>  80488f1: 81 fa e7 03 00 00     cmp    $0x3e7,%edx
>  80488f7: 7e e7                 jle    80488e0
> <_ZNK12MaxBenchmark7c_styleEv+0x10>
>
>
> "OOP-style"
> -----------
>
>  8048930: dd 04 d5 60 b2 04 08  fldl   0x804b260(,%edx,8)
>  8048937: dd e1                 fucom  %st(1)
>  8048939: df e0                 fnstsw %ax
>  804893b: 9e                    sahf
>  804893c: 0f 97 c0              seta   %al
>  804893f: 83 e0 01              and    $0x1,%eax
>  8048942: 74 1c                 je     8048960
> <_ZNK12MaxBenchmark9oop_styleEv+0x40>
>  8048944: dd d9                 fstp   %st(1)
>  8048946: 42                    inc    %edx
>  8048947: 81 fa e7 03 00 00     cmp    $0x3e7,%edx
>  804894d: 7e e1                 jle    8048930
> <_ZNK12MaxBenchmark9oop_styleEv+0x10>
>
>
>
> I hope that some of the gcc developers (perhaps Jan Hubicka?)
may take
> care of this disappointing behavior!
>
> Regards,
> Paolo Carlini.
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.0 vs 3.0.1 on oopack's Max
  2001-09-07  6:43 3.0 vs 3.0.1 on oopack's Max Paolo Carlini
  2001-09-07 10:14 ` Tim Prince
@ 2001-09-07 16:53 ` Richard Henderson
  2001-09-08  2:15   ` Paolo Carlini
  2001-09-08  9:03 ` Jan Hubicka
  2 siblings, 1 reply; 7+ messages in thread
From: Richard Henderson @ 2001-09-07 16:53 UTC (permalink / raw)
  To: Paolo Carlini; +Cc: gcc, jh, jbuck

On Fri, Sep 07, 2001 at 03:41:49PM +0200, Paolo Carlini wrote:
> once more, I'm writing to the list to describe a recent performance
> regression :( on a simple benchmark.

You didn't mention how the compiler was configured, 
or what options you gave for compilation.

The performance regression is caused by

3.0:
>  8048912: 31 d2                 xor    %edx,%edx
>  8048919: 0f 97 c2              seta   %dl
>  804891c: 85 d2                 test   %edx,%edx

3.0.1:
>  8048917: 0f 97 c0              seta   %al
>  804891a: 83 e0 01              and    $0x1,%eax

The 3.0.1 version has a partial register stall on %al.



r~

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.0 vs 3.0.1 on oopack's Max
  2001-09-07 16:53 ` Richard Henderson
@ 2001-09-08  2:15   ` Paolo Carlini
  0 siblings, 0 replies; 7+ messages in thread
From: Paolo Carlini @ 2001-09-08  2:15 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc

Hi and thank you very much for your feedback!

--- Richard Henderson <rth@redhat.com> wrote:
> You didn't mention how the compiler was configured, 
> or what options you gave for compilation.

Well, I did mention that the options were:

  -O2 -finline-limit=600

The compilers were built with a trivial:

  --prefix=... --enable-languages=c,c++,f77

> The performance regression is caused by
> 
> 3.0:
> >  8048912: 31 d2                 xor    %edx,%edx
> >  8048919: 0f 97 c2              seta   %dl
> >  804891c: 85 d2                 test   %edx,%edx
> 
> 3.0.1:
> >  8048917: 0f 97 c0              seta   %al
> >  804891a: 83 e0 01              and    $0x1,%eax
> 
> The 3.0.1 version has a partial register stall on
> %al.

Thanks for your analysis! I'm really puzzled by the
fact that (of course!) only very safe patches went in
in the branch between 3.0.0 and 3.0.1.

Is there something else I can do for you, RTL dumps or
whatever?!?

Cheers,
Paolo.


__________________________________________________
Do You Yahoo!?
Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger
http://im.yahoo.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.0 vs 3.0.1 on oopack's Max
  2001-09-07  6:43 3.0 vs 3.0.1 on oopack's Max Paolo Carlini
  2001-09-07 10:14 ` Tim Prince
  2001-09-07 16:53 ` Richard Henderson
@ 2001-09-08  9:03 ` Jan Hubicka
  2001-09-10  9:33   ` Paolo Carlini
  2 siblings, 1 reply; 7+ messages in thread
From: Jan Hubicka @ 2001-09-08  9:03 UTC (permalink / raw)
  To: Paolo Carlini; +Cc: gcc, jh, jbuck, rth

> 
> I hope that some of the gcc developers (perhaps Jan Hubicka?) may take
> care of this disappointing behavior!
:) OK, it looks the problem is in the setcc code that causes partial
register stall.
Otherwise the 3.1 code looks good to me and also perofrms quite happily
on Athlon that is free of the issue.

The problem is relativly old - you have QImode register as result of setcc.
The AND following it gets combined to have (SUBREG:SI (QIreg)) argument.
In most of cases GCC gets around this by avoiding 8bit computations at all,
but setcc must be 8bit.  Richard has recently changes way setcc is expanded,
so it is optimized better. One of better optimizations is this one and
sadly it is not very lucky...

I am not quite sure how to avoid gcc from doing this optimization.
I will take a look overnight.

Honza
> 
> Regards,
> Paolo Carlini.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.0 vs 3.0.1 on oopack's Max
  2001-09-08  9:03 ` Jan Hubicka
@ 2001-09-10  9:33   ` Paolo Carlini
  2001-09-10  9:37     ` Jan Hubicka
  0 siblings, 1 reply; 7+ messages in thread
From: Paolo Carlini @ 2001-09-10  9:33 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc, rth

Hi,

Jan Hubicka wrote:

> :) OK, it looks the problem is in the setcc code that causes partial
> register stall.
> Otherwise the 3.1 code looks good to me and also perofrms quite happily
> on Athlon that is free of the issue.

But sadly on Intel processors we have got a neat performance regression on a
very popular benchmark...
I know well that such short tests are not representative of real code but,
nonetheless, from a "marketing" point of view...

> The problem is relativly old - you have QImode register as result of setcc.
> The AND following it gets combined to have (SUBREG:SI (QIreg)) argument.
> In most of cases GCC gets around this by avoiding 8bit computations at all,
> but setcc must be 8bit.  Richard has recently changes way setcc is expanded,
> so it is optimized better. One of better optimizations is this one and
> sadly it is not very lucky...

Thanks for the analysis!
(At this point all of this is still quite obscure to me, unfortunately... but
for me it is a good occasion to learn more!)

By the way, I found only one change to setcc (ix86_expand_setcc) involving the
branch, that is:

    http://gcc.gnu.org/ml/gcc-patches/2001-07/msg01627.html

Which was a follow up to:

    http://gcc.gnu.org/ml/gcc-patches/2001-07/msg01552.html

> I am not quite sure how to avoid gcc from doing this optimization.
> I will take a look overnight.

Thanks!
Paolo.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 3.0 vs 3.0.1 on oopack's Max
  2001-09-10  9:33   ` Paolo Carlini
@ 2001-09-10  9:37     ` Jan Hubicka
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Hubicka @ 2001-09-10  9:37 UTC (permalink / raw)
  To: Paolo Carlini; +Cc: Jan Hubicka, gcc, rth

> Thanks for the analysis!
> (At this point all of this is still quite obscure to me, unfortunately... but
> for me it is a good occasion to learn more!)
:)
> 
> By the way, I found only one change to setcc (ix86_expand_setcc) involving the
> branch, that is:
> 
>     http://gcc.gnu.org/ml/gcc-patches/2001-07/msg01627.html

Thats the change.
I will try to come with something to solve it tomorrow, but still I am not
quite sure how to do it.

One way I see is to fold both operations (setcc and movzx) into one instruction
before reload, but this is just papering around one particular special case.

Honza
> 
> Which was a follow up to:
> 
>     http://gcc.gnu.org/ml/gcc-patches/2001-07/msg01552.html
> 
> > I am not quite sure how to avoid gcc from doing this optimization.
> > I will take a look overnight.
> 
> Thanks!
> Paolo.
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2001-09-10  9:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-07  6:43 3.0 vs 3.0.1 on oopack's Max Paolo Carlini
2001-09-07 10:14 ` Tim Prince
2001-09-07 16:53 ` Richard Henderson
2001-09-08  2:15   ` Paolo Carlini
2001-09-08  9:03 ` Jan Hubicka
2001-09-10  9:33   ` Paolo Carlini
2001-09-10  9:37     ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).