Re: GCC performance regression

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: GCC performance regression - up to 20% ?
       [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs>
@ 2002-04-21  7:58   ` Andi Kleen
  0 siblings, 0 replies; 16+ messages in thread
From: Andi Kleen @ 2002-04-21  7:58 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: aj, gcc

Jan Hubicka <jh@suse.cz> writes:
> 
> No, the SPECint numbers are quite consistenly higher than in any previous
> release. See http://www.suse.de/~aj/SPEC
> In fact no previous release had such a huge gap in perofrmance.

http://www.suse.de/~aj/SPEC/CFP/sandbox-gcc-3_1-branch/index.html

179.art seems to have regressed badly a lot a few days ago. Is that just a bad
run or a price for some bug fix? 

-Andi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* GCC performance regression - up to 20% ?
@ 2002-04-20 18:13 Michel LESPINASSE
  2002-04-21  3:41 ` Andreas Jaeger
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Michel LESPINASSE @ 2002-04-20 18:13 UTC (permalink / raw)
  To: gcc list

Hi,

I have downloaded the latest 3.1 snapshot (20020415) and ran some
performance tests. So far I've been impressed by the FP performance,
but kinda disappointed by the integer performance.

The benchmarks I've run are two libraries I maintain, libmpeg2 and
liba52. These are used by several open-source dvd players, and are
quite CPU intensive (especially libmpeg2). So here are my results,
using gcc 2.95 as a reference:

First the good news: liba52 (mostly FP intensive workload)
on athlon tbird 950, using -mcpu=pentiumpro:
gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams
gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4
from these measurements 3.1 has a very nice performance, very close to
intel's icc. Great work ! Also using -march=athlon-tbird and
generating sse code, I can get yet a few extra % of performance.

Now the bad news: for libmepg2, which is an integer-only workload, I
get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0
was already slower than 2.95.4, but 3.1 seems to be worse for this
workload at least.

libmpeg2, on athlon tbird 950, with mmx optimizations:
gcc-3.0 is about 2% slower than 2.95.4
gcc-3.1 snapshot is about 10% slower than 2.95.4

libmpeg2, on athlon tbird 950, using pure C code:
gcc-3.0 is about 4.5% slower than 2.95.4
gcc-3.1 snapshot is about 5.5% slower than 2.95.4

libmpeg2, on celeron 366, with mmx optimizations:
gcc-3.0 is about 4% slower than 2.95.4
gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!)

These results are all very repeatable. the celeron 366 results are the
most worrying, as this processor already has borderline performance
for decoding mpeg2 streams.

Is there a known performance regression in current GCCs (say, do they
get lower SPECint scores ?) or is it only with my code ?

Also, is there anything I could do in my code to enhance performance
with newer gcc versions ? One thing I noticed is that 3.1 snapshot
produces less inlining than 3.0 or 2.95. This probably accounts for
some of the slowdown I see when using mmx optimizations, as my mmx
routines are written using a few routines that I really expect to get
inlined. Is there any way I can get back control about that, so that
gcc honours the inline keyword ? I have not managed to do this either.

BTW, these two apps I mentionned can be found at
http://libmpeg2.sourceforge.net/
http://liba52.sourceforge.net/

Puzzled,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-20 18:13 Michel LESPINASSE
@ 2002-04-21  3:41 ` Andreas Jaeger
  2002-04-21  5:46 ` Jan Hubicka
  2002-04-22  1:47 ` Gerald Pfeifer
  2 siblings, 0 replies; 16+ messages in thread
From: Andreas Jaeger @ 2002-04-21  3:41 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

Michel LESPINASSE <walken@zoy.org> writes:

> Hi,
>
> I have downloaded the latest 3.1 snapshot (20020415) and ran some
> performance tests. So far I've been impressed by the FP performance,
> but kinda disappointed by the integer performance.
>
> The benchmarks I've run are two libraries I maintain, libmpeg2 and
> liba52. These are used by several open-source dvd players, and are
> quite CPU intensive (especially libmpeg2). So here are my results,
> using gcc 2.95 as a reference:
>
> First the good news: liba52 (mostly FP intensive workload)
> on athlon tbird 950, using -mcpu=pentiumpro:
> gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams
> gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4
> from these measurements 3.1 has a very nice performance, very close to
> intel's icc. Great work ! Also using -march=athlon-tbird and
> generating sse code, I can get yet a few extra % of performance.
>
> Now the bad news: for libmepg2, which is an integer-only workload, I
> get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0
> was already slower than 2.95.4, but 3.1 seems to be worse for this
> workload at least.
>
> libmpeg2, on athlon tbird 950, with mmx optimizations:
> gcc-3.0 is about 2% slower than 2.95.4
> gcc-3.1 snapshot is about 10% slower than 2.95.4
>
> libmpeg2, on athlon tbird 950, using pure C code:
> gcc-3.0 is about 4.5% slower than 2.95.4
> gcc-3.1 snapshot is about 5.5% slower than 2.95.4
>
> libmpeg2, on celeron 366, with mmx optimizations:
> gcc-3.0 is about 4% slower than 2.95.4
> gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!)
>
> These results are all very repeatable. the celeron 366 results are the
> most worrying, as this processor already has borderline performance
> for decoding mpeg2 streams.
>
> Is there a known performance regression in current GCCs (say, do they
> get lower SPECint scores ?) or is it only with my code ?

Can you distill a test case that is as small as possible (the optimal
way would be just the loop that causes the problem) and show it to us?
That way it's much easier to discuss the issues and start looking into
what needs to be done.

> Also, is there anything I could do in my code to enhance performance
> with newer gcc versions ? One thing I noticed is that 3.1 snapshot
> produces less inlining than 3.0 or 2.95. This probably accounts for
> some of the slowdown I see when using mmx optimizations, as my mmx
> routines are written using a few routines that I really expect to get
> inlined. Is there any way I can get back control about that, so that
> gcc honours the inline keyword ? I have not managed to do this either.

Try -finline-limit=2000 but check the manual for the exact name of the
switch.

Andreas
-- 
 Andreas Jaeger
  SuSE Labs aj@suse.de
   private aj@arthur.inka.de
    http://www.suse.de/~aj

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-20 18:13 Michel LESPINASSE
  2002-04-21  3:41 ` Andreas Jaeger
@ 2002-04-21  5:46 ` Jan Hubicka
  2002-04-21 23:46   ` Michel LESPINASSE
  2002-04-22  1:47 ` Gerald Pfeifer
  2 siblings, 1 reply; 16+ messages in thread
From: Jan Hubicka @ 2002-04-21  5:46 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

> 
> libmpeg2, on athlon tbird 950, with mmx optimizations:
> gcc-3.0 is about 2% slower than 2.95.4
> gcc-3.1 snapshot is about 10% slower than 2.95.4
> 
> libmpeg2, on athlon tbird 950, using pure C code:
> gcc-3.0 is about 4.5% slower than 2.95.4
> gcc-3.1 snapshot is about 5.5% slower than 2.95.4
> 
> libmpeg2, on celeron 366, with mmx optimizations:
> gcc-3.0 is about 4% slower than 2.95.4
> gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!)
> 
> These results are all very repeatable. the celeron 366 results are the
> most worrying, as this processor already has borderline performance
> for decoding mpeg2 streams.

Are you able to figure out what exactly makes the code slow? Having self
contained testcase will definitly help a lot.  WHat flags do you use?

I would be quite curious whether using profile feedback helps.
(see documentation of -fprofile-arcs and -fbranch-probabilities)
you can just have some badly predicted branch in the innermost
loop.

Problem of such code usualy is fact that it is tuned to avoid problems on one
particular version of gcc, so even when new version os faster overall, it is
slower in such places.  We've hit similar case with Athlon matrix
multiplication code and such problems are usually easy to fix on gcc side.
> 
> Is there a known performance regression in current GCCs (say, do they
> get lower SPECint scores ?) or is it only with my code ?

No, the SPECint numbers are quite consistenly higher than in any previous
release. See http://www.suse.de/~aj/SPEC
In fact no previous release had such a huge gap in perofrmance.
> 
> Also, is there anything I could do in my code to enhance performance
> with newer gcc versions ? One thing I noticed is that 3.1 snapshot
> produces less inlining than 3.0 or 2.95. This probably accounts for
> some of the slowdown I see when using mmx optimizations, as my mmx
> routines are written using a few routines that I really expect to get
> inlined. Is there any way I can get back control about that, so that
> gcc honours the inline keyword ? I have not managed to do this either.
THere is parameter to increase inline threshold as well as allwaysinline
function attribute. See the documentation.

Honza
> 
> BTW, these two apps I mentionned can be found at
> http://libmpeg2.sourceforge.net/
> http://liba52.sourceforge.net/
> 
> Puzzled,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-21  5:46 ` Jan Hubicka
@ 2002-04-21 23:46   ` Michel LESPINASSE
  2002-04-22  0:17     ` Andreas Jaeger
  2002-04-22  7:11     ` Carlo Wood
  0 siblings, 2 replies; 16+ messages in thread
From: Michel LESPINASSE @ 2002-04-21 23:46 UTC (permalink / raw)
  To: Jan Hubicka, Andreas Jaeger; +Cc: gcc list

Hi,

I spent the afternoon looking at this issue and trying different
flags. I think I got some interesting results, though I still dont get
the full picture.

The first thing I tweaked was to make the inlining work the way I
wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with
usual options (-O3), the behaviour was that functions declared inline
were always inlined, and functions not declared inline were inlined
only if gcc thought they were simple enough. While in gcc 3.1
snapshot, it looks like the inline keyword only makes the function an
inline candidate. I will argue later on why I think the old behaviour
(inline being always honoured) is more useful (basically, function
specialization is harder to do if inline is not honoured). In the mean
time, I worked around the issue by using an absurdly high value
-finline-limit=6000 (I tried 2000 first which was not sufficient), and
then I also added -fno-inline-functions so that I dont get everything
inlined when I dont ask for it.

With the custom inlining, gcc-3.1 snapshot is about 4% slower than
gcc-2.95, on my athlon tbird. This is a progress, since without the
custom inlining, the slowdown was 10%.

Then I tried to figure out where the slowdown is, using gprof. And
this is where things get really interesting: gprof tells me that the
code compiled with 3.1 is faster, but 'time' tells me that the user
time spent executing that code is higher with 3.1 than with 2.95. I'm
not sure what to make of this, but I think this might give you some
clues, so I'll describe it in more detail. I'm not sure what the
overhead is, but it seems to be right in gprof's blind spot.

I have to describe my gprof methodology first. 'normal' gprof
(i.e. compiling every file with -pg) seems to have a high overhead for
me, plus it conflicts with -fomit-frame-pointer which I usualy use. So
I tend to use what I'll call 'light' gprof, which is as follows:
everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro
except main.c which is compiled with -g -O3 -p and is not cpu intensive.
'light' gprof can not help me figure out call graphs, but it should
normally be good enough to obtain a flat profile.

When using gcc 2.95, 'light' gprof works great. In a flat profile, the
last number in the 'cumulative seconds' column always matches (within
0.1s) the user time as reported by the 'time' command. I think I can
trust the flat profile information.

When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds
gap between 'cumulative seconds' and 'user time'. I dont understand
what happens during this time - could it be that 3.x has a higher
overhead for function calls, and that this overhead is not accounted
for in gprof ? I dont understand what happens here, but I get the
feeling that this might be key to the slowdown I observe.

To summarize the gprof thing, I observe
3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time

If I use 'normal' gprof (every file gets -pg and I remove the
-fomit-frame-pointer), that inflates the user times a lot, but I still
get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time.

Also to answer Jan's question, I did try to use -fbranch-probabilities
and it helped, with this option 3.1 snapshot is about 0.5% slower (in
user time) than 2.95, instead of 4% slower without. It's still
frustrating though, because gprof pretends 3.1 is about 10% faster
than 2.95, so there may be a lot of untapped performance.

OK, so this is all I can say for now. I hope someone will know gprof
internals better than I do and understand why gprof returns times that
are smaller in 3.1 shapshot vs. 2.95, while time does not agree.

I thought I should add a few comments about the structure of libmpeg2
code, as I suppose it is somewhat unusual. The time-consuming loop is
pretty big, and includes several function calls. Each loop execution
decodes a full mpeg2 macroblocks (=256 pixels), during this decoding
it calls the IDCT (cosine transform) function up to 6 times, and some
MC (motion compensation) functions up to 12 times. All these calls are
indirect, using function pointers. These called functions are all
kinda short, executing in about 0.4 microseconds per call on average.
Yes, thats a lot of function calls, and I suspect SPECint does not do
as many, which could maybe explain why it's not seeing the same
performance regression that I see ? It would seem consistent with the
gprof blind spot thing, too.

Finally, I thought I should include a small explanation about what I
do with inlines and why I'd like to have a gcc option so that the
inline keyword always inlines a function, instead of just making that
function an inline candidate. One example of that would be in the MC
functions mentionned above. To get acceptable speed, these are written
using mmx assembly operations. Actually they even have several
implementations, one in straight c for compatibility, one in straight
mmx, one using sse integer instructions (as present on PIII and
athlon), and one using 3dnow instructions. It's easy to select at init
time which implementation to use, since the MC functions are only
called thru function pointers. The only difference between the sse
version and the 3dnow version, is that the sse version uses the pavgb
instruction, while the 3dnow version uses pavgusb. These instructions
have different opcodes but the same behaviour, so I wrote this using
an inline function:

static inline void MC_generic (...., int cpu)
{
	.... do stuff ....
	if (cpu == CPU_3DNOW)
		pavgusb (....);
	else
		pavgb ();
	.... do more stuff ....
}

void MC_3dnow (.....)
{
	MC_generic (....., CPU_3DNOW);
}

void MC_sse (.....)
{
	MC_generic (....., CPU_SSE);
}

In gcc-2.95 and gcc-3.0, this construct works out nicely since
MC_generic gets inlined, and then the compiler figures out that the
cpu test in it is a constant and generates clean code for both the
3dnow and the sse function. This kind of specialisation is sometimes
very convenient, but it requires that the inline keyword does an
unconditional inlining, not subject to compiler heuristics. I would
really love to see a gcc option to make inlining unconditional when
using the inline keyword, and have gcc use its heuristics when there
is no such keyword.

This email is longer than I thought it would be, thanks a lot for
those who're still reading me :)

Cheers,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-21 23:46   ` Michel LESPINASSE
@ 2002-04-22  0:17     ` Andreas Jaeger
  2002-04-22 17:42       ` Michel LESPINASSE
  2002-04-22  7:11     ` Carlo Wood
  1 sibling, 1 reply; 16+ messages in thread
From: Andreas Jaeger @ 2002-04-22  0:17 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, gcc list

Michel LESPINASSE <walken@zoy.org> writes:

> Hi,
>
> I spent the afternoon looking at this issue and trying different
> flags. I think I got some interesting results, though I still dont get
> the full picture.
>
> The first thing I tweaked was to make the inlining work the way I
> wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with
> usual options (-O3), the behaviour was that functions declared inline
> were always inlined, and functions not declared inline were inlined
> only if gcc thought they were simple enough. While in gcc 3.1
> snapshot, it looks like the inline keyword only makes the function an
> inline candidate. I will argue later on why I think the old behaviour
> (inline being always honoured) is more useful (basically, function
> specialization is harder to do if inline is not honoured). In the mean
> time, I worked around the issue by using an absurdly high value
> -finline-limit=6000 (I tried 2000 first which was not sufficient), and
> then I also added -fno-inline-functions so that I dont get everything
> inlined when I dont ask for it.
>
> With the custom inlining, gcc-3.1 snapshot is about 4% slower than
> gcc-2.95, on my athlon tbird. This is a progress, since without the
> custom inlining, the slowdown was 10%.
>
> Then I tried to figure out where the slowdown is, using gprof. And
> this is where things get really interesting: gprof tells me that the
> code compiled with 3.1 is faster, but 'time' tells me that the user
> time spent executing that code is higher with 3.1 than with 2.95. I'm
> not sure what to make of this, but I think this might give you some
> clues, so I'll describe it in more detail. I'm not sure what the
> overhead is, but it seems to be right in gprof's blind spot.

Using -pg changes the binary and the results might not be the same as
without the flag.  You could try to use the Performance counters of
your Athlon using e.g. oprofile (http://oprofile.sourceforge.net).

> I have to describe my gprof methodology first. 'normal' gprof
> (i.e. compiling every file with -pg) seems to have a high overhead for
> me, plus it conflicts with -fomit-frame-pointer which I usualy use. So
> I tend to use what I'll call 'light' gprof, which is as follows:
> everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro
> except main.c which is compiled with -g -O3 -p and is not cpu intensive.
> 'light' gprof can not help me figure out call graphs, but it should
> normally be good enough to obtain a flat profile.
>
> When using gcc 2.95, 'light' gprof works great. In a flat profile, the
> last number in the 'cumulative seconds' column always matches (within
> 0.1s) the user time as reported by the 'time' command. I think I can
> trust the flat profile information.
>
> When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds
> gap between 'cumulative seconds' and 'user time'. I dont understand
> what happens during this time - could it be that 3.x has a higher
> overhead for function calls, and that this overhead is not accounted
> for in gprof ? I dont understand what happens here, but I get the
> feeling that this might be key to the slowdown I observe.
>
> To summarize the gprof thing, I observe
> 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time
>
> If I use 'normal' gprof (every file gets -pg and I remove the
> -fomit-frame-pointer), that inflates the user times a lot, but I still
> get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time.
>
> Also to answer Jan's question, I did try to use -fbranch-probabilities
> and it helped, with this option 3.1 snapshot is about 0.5% slower (in
> user time) than 2.95, instead of 4% slower without. It's still
> frustrating though, because gprof pretends 3.1 is about 10% faster
> than 2.95, so there may be a lot of untapped performance.
>
>
> OK, so this is all I can say for now. I hope someone will know gprof
> internals better than I do and understand why gprof returns times that
> are smaller in 3.1 shapshot vs. 2.95, while time does not agree.
>
> I thought I should add a few comments about the structure of libmpeg2
> code, as I suppose it is somewhat unusual. The time-consuming loop is
> pretty big, and includes several function calls. Each loop execution
> decodes a full mpeg2 macroblocks (=256 pixels), during this decoding
> it calls the IDCT (cosine transform) function up to 6 times, and some
> MC (motion compensation) functions up to 12 times. All these calls are
> indirect, using function pointers. These called functions are all
> kinda short, executing in about 0.4 microseconds per call on average.
> Yes, thats a lot of function calls, and I suspect SPECint does not do
> as many, which could maybe explain why it's not seeing the same
> performance regression that I see ? It would seem consistent with the
> gprof blind spot thing, too.

It would really help a lot if you could try to write some small
program that behaves the same way (performance wise) as this routine
libmpeg2.

>
> Finally, I thought I should include a small explanation about what I
> do with inlines and why I'd like to have a gcc option so that the
> inline keyword always inlines a function, instead of just making that
> function an inline candidate. One example of that would be in the MC
> functions mentionned above. To get acceptable speed, these are written
> using mmx assembly operations. Actually they even have several
> implementations, one in straight c for compatibility, one in straight
> mmx, one using sse integer instructions (as present on PIII and
> athlon), and one using 3dnow instructions. It's easy to select at init
> time which implementation to use, since the MC functions are only
> called thru function pointers. The only difference between the sse
> version and the 3dnow version, is that the sse version uses the pavgb
> instruction, while the 3dnow version uses pavgusb. These instructions
> have different opcodes but the same behaviour, so I wrote this using
> an inline function:

>
> static inline void MC_generic (...., int cpu)
> {
> 	.... do stuff ....
> 	if (cpu == CPU_3DNOW)
> 		pavgusb (....);
> 	else
> 		pavgb ();
> 	.... do more stuff ....
> }
>
> void MC_3dnow (.....)
> {
> 	MC_generic (....., CPU_3DNOW);
> }
>
> void MC_sse (.....)
> {
> 	MC_generic (....., CPU_SSE);
> }
>
> In gcc-2.95 and gcc-3.0, this construct works out nicely since
> MC_generic gets inlined, and then the compiler figures out that the
> cpu test in it is a constant and generates clean code for both the
> 3dnow and the sse function. This kind of specialisation is sometimes
> very convenient, but it requires that the inline keyword does an
> unconditional inlining, not subject to compiler heuristics. I would
> really love to see a gcc option to make inlining unconditional when
> using the inline keyword, and have gcc use its heuristics when there
> is no such keyword.

You might want to use the always_inline function attribute for your
inline functions in 3.1.  It makes the inlining unconditional.


>
> This email is longer than I thought it would be, thanks a lot for
> those who're still reading me :)

;-)

Andreas
-- 
 Andreas Jaeger
  SuSE Labs aj@suse.de
   private aj@arthur.inka.de
    http://www.suse.de/~aj

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  0:17     ` Andreas Jaeger
@ 2002-04-22 17:42       ` Michel LESPINASSE
  2002-04-22 18:20         ` Andrew Pinski
  0 siblings, 1 reply; 16+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 17:42 UTC (permalink / raw)
  To: Andreas Jaeger; +Cc: Jan Hubicka, gcc list

On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote:
> You might want to use the always_inline function attribute for your
> inline functions in 3.1.  It makes the inlining unconditional.

Thanks, that suggestion does solve my inlining problem. I think the
documentation for this option is a bit confusing: it says "For
functions declared inline, this attribute inlines the function even if
no optimization level was specified." which is true I guess, but the
documentation does not mention that when optimization is enabled gcc
will not always inline functions declared inline, unless we use this
attribute.

Also I want to point out that this is a new behaviour, as gcc up to
3.0.4 apparently always inlined functions declared inline (when
compiling with -O2 at least).

I think either the documentation of always_inline should be made more
clear, or the inline keyword should be made to behave like
always_inline when optimizing with -O2 and above (as it did previously).

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 17:42       ` Michel LESPINASSE
@ 2002-04-22 18:20         ` Andrew Pinski
  2002-04-22 18:30           ` Carlo Wood
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Pinski @ 2002-04-22 18:20 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Andreas Jaeger, Jan Hubicka, gcc list

On Monday, April 22, 2002, at 08:39 , Michel LESPINASSE wrote:

> On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote:
>> You might want to use the always_inline function attribute for your
>> inline functions in 3.1.  It makes the inlining unconditional.
>
> Thanks, that suggestion does solve my inlining problem. I think the
> documentation for this option is a bit confusing: it says "For
> functions declared inline, this attribute inlines the function even if
> no optimization level was specified." which is true I guess, but the
> documentation does not mention that when optimization is enabled gcc
> will not always inline functions declared inline, unless we use this
> attribute.

I think the problem is that part of documentation was written before the 
part about always_inline.

>
> Also I want to point out that this is a new behaviour, as gcc up to
> 3.0.4 apparently always inlined functions declared inline (when
> compiling with -O2 at least).

Not really new `behavior' was just the inline limit was lowered in 3.0 
and above.

>
> I think either the documentation of always_inline should be made more
> clear, or the inline keyword should be made to behave like
> always_inline when optimizing with -O2 and above (as it did previously).

the inline keyword was always just an hint to the compiler in ISO C++ 
and ISO C.
And it was -O3 that included the inline optimization.

Thanks,
Andrew Pinski

>
> --
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 18:20         ` Andrew Pinski
@ 2002-04-22 18:30           ` Carlo Wood
  2002-04-22 19:25             ` Andrew Pinski
  0 siblings, 1 reply; 16+ messages in thread
From: Carlo Wood @ 2002-04-22 18:30 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote:
> And it was -O3 that included the inline optimization.

Huh??
I am using the fact (I hope) that functions marked with 'inline'
or member functions that are defined inside their class declaration
are inlined already with -O.
Talking about real small functions here (accessors etc).

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 18:30           ` Carlo Wood
@ 2002-04-22 19:25             ` Andrew Pinski
  2002-04-24 15:24               ` Allan Sandfeld Jensen
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Pinski @ 2002-04-22 19:25 UTC (permalink / raw)
  To: Carlo Wood; +Cc: gcc

On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote:

> On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote:
>> And it was -O3 that included the inline optimization.
>
> Huh??
> I am using the fact (I hope) that functions marked with 'inline'
> or member functions that are defined inside their class declaration
> are inlined already with -O.
> Talking about real small functions here (accessors etc).
>

Read the manual, under the -O3 option:

Optimize yet more. `-O3' turns on all optimizations specified by `-O2' 
and also turns on the `-finline-functions' and `-frename-registers' 
options.

These options control various sorts of optimizations:

-O
-O1
Optimize. Optimizing compilation takes somewhat more time, and a lot 
more memory for a large function.

Without `-O', the compiler's goal is to reduce the cost of compilation 
and to make debugging produce the expected results. Statements are 
independent: if you stop the program with a breakpoint between 
statements, you can then assign a new value to any variable or change 
the program counter to any other statement in the function and get 
exactly the results you would expect from the source code.

Without `-O', the compiler only allocates variables declared register in 
registers. The resulting compiled code is a little worse than produced 
by PCC without `-O'.

With `-O', the compiler tries to reduce code size and execution time.

When you specify `-O', the compiler turns on `-fthread-jumps' and 
`-fdefer-pop' on all machines. The compiler turns on `-fdelayed-branch' 
on machines that have delay slots, and `-fomit-frame-pointer' on 
machines that can support debugging even without a frame pointer. On 
some machines the compiler also turns on other flags.

-O2
Optimize even more. GCC performs nearly all supported optimizations that 
do not involve a space-speed tradeoff. The compiler does not perform 
loop unrolling or function inlining when you specify `-O2'. As compared 
to `-O', this option increases both compilation time and the performance 
of the generated code.

`-O2' turns on all optional optimizations except for loop unrolling, 
function inlining, and register renaming. It also turns on the 
`-fforce-mem' option on all machines and frame pointer elimination on 
machines where doing so does not interfere with debugging.

Please note the warning under `-fgcse' about invoking `-O2' on programs 
that use computed gotos.

-O3
Optimize yet more. `-O3' turns on all optimizations specified by `-O2' 
and also turns on the `-finline-functions' and `-frename-registers' 
options.

-O0
Do not optimize.

-Os
Optimize for size. `-Os' enables all `-O2' optimizations that do not 
typically increase code size. It also performs further optimizations 
designed to reduce code size.

If you use multiple `-O' options, with or without level numbers, the 
last such option is the one that is effective.

> --
> Carlo Wood <carlo@alinoe.com>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 19:25             ` Andrew Pinski
@ 2002-04-24 15:24               ` Allan Sandfeld Jensen
  0 siblings, 0 replies; 16+ messages in thread
From: Allan Sandfeld Jensen @ 2002-04-24 15:24 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

On Tuesday 23 April 2002 04:05, Andrew Pinski wrote:
> On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote:
> > On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote:
> >> And it was -O3 that included the inline optimization.
> >
> > Huh??
> > I am using the fact (I hope) that functions marked with 'inline'
> > or member functions that are defined inside their class declaration
> > are inlined already with -O.
> > Talking about real small functions here (accessors etc).
>
> Read the manual, under the -O3 option:
>
> Optimize yet more. `-O3' turns on all optimizations specified by `-O2'
> and also turns on the `-finline-functions' and `-frename-registers'
> options.
>
That inline is the "inline-optimization", ie. inlining function NOT declared 
inline. Functions declared inline are inlined already at -O

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-21 23:46   ` Michel LESPINASSE
  2002-04-22  0:17     ` Andreas Jaeger
@ 2002-04-22  7:11     ` Carlo Wood
  2002-04-22  7:11       ` Falk Hueffner
                         ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Carlo Wood @ 2002-04-22  7:11 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, Andreas Jaeger, gcc list

> 
> Then I tried to figure out where the slowdown is, using gprof. And
> this is where things get really interesting: gprof tells me that the
> code compiled with 3.1 is faster, but 'time' tells me that the user
> time spent executing that code is higher with 3.1 than with 2.95. I'm
> not sure what to make of this, but I think this might give you some
> clues, so I'll describe it in more detail. I'm not sure what the
> overhead is, but it seems to be right in gprof's blind spot.

gprof "measures" the time that a function takes by probing
which function the program is executing about every 20 ms.
From that it builds up a statistical histogram.

I wish there would be a more precise profiler that uses the
hardware counters.  Does anyone know of one?  Hmm, I remember
a Subject: line on the PAPI mailinglist that mentioned gprof,
but I deleted it.  I think it asked the same question: whether
or not there existed a 'gprof' that used PAPI.

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  7:11     ` Carlo Wood
@ 2002-04-22  7:11       ` Falk Hueffner
  2002-04-22  7:34       ` law
  2002-04-22  8:23       ` Johannes Stezenbach
  2 siblings, 0 replies; 16+ messages in thread
From: Falk Hueffner @ 2002-04-22  7:11 UTC (permalink / raw)
  To: gcc list

Carlo Wood <carlo@alinoe.com> writes:

> I wish there would be a more precise profiler that uses the hardware
> counters.  Does anyone know of one?  Hmm, I remember a Subject: line
> on the PAPI mailinglist that mentioned gprof, but I deleted it.  I
> think it asked the same question: whether or not there existed a
> 'gprof' that used PAPI.

http://oprofile.sourceforge.net/

Only runs on i386 Linux and needs a kernel module, though.

        Falk

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  7:11     ` Carlo Wood
  2002-04-22  7:11       ` Falk Hueffner
@ 2002-04-22  7:34       ` law
  2002-04-22  8:23       ` Johannes Stezenbach
  2 siblings, 0 replies; 16+ messages in thread
From: law @ 2002-04-22  7:34 UTC (permalink / raw)
  To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list

In message <20020422155801.A21747@alinoe.com>, Carlo Wood writes:
 > > 
 > > Then I tried to figure out where the slowdown is, using gprof. And
 > > this is where things get really interesting: gprof tells me that the
 > > code compiled with 3.1 is faster, but 'time' tells me that the user
 > > time spent executing that code is higher with 3.1 than with 2.95. I'm
 > > not sure what to make of this, but I think this might give you some
 > > clues, so I'll describe it in more detail. I'm not sure what the
 > > overhead is, but it seems to be right in gprof's blind spot.
 > 
 > gprof "measures" the time that a function takes by probing
 > which function the program is executing about every 20 ms.
 > >From that it builds up a statistical histogram.
 > 
 > I wish there would be a more precise profiler that uses the
 > hardware counters.  Does anyone know of one?  Hmm, I remember
 > a Subject: line on the PAPI mailinglist that mentioned gprof,
 > but I deleted it.  I think it asked the same question: whether
 > or not there existed a 'gprof' that used PAPI.
You might look at oprofile.  
jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  7:11     ` Carlo Wood
  2002-04-22  7:11       ` Falk Hueffner
  2002-04-22  7:34       ` law
@ 2002-04-22  8:23       ` Johannes Stezenbach
  2 siblings, 0 replies; 16+ messages in thread
From: Johannes Stezenbach @ 2002-04-22  8:23 UTC (permalink / raw)
  To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list

On Mon, Apr 22, 2002 at 03:58:01PM +0200, Carlo Wood wrote:
> gprof "measures" the time that a function takes by probing
> which function the program is executing about every 20 ms.
> From that it builds up a statistical histogram.
> 
> I wish there would be a more precise profiler that uses the
> hardware counters.  Does anyone know of one?  Hmm, I remember
> a Subject: line on the PAPI mailinglist that mentioned gprof,
> but I deleted it.  I think it asked the same question: whether
> or not there existed a 'gprof' that used PAPI.

I know of two profilers based on gcc's
-finstrument-functions feature:

* FunctionCheck:
  http://www710.univ-lyon1.fr/~yperret/fnccheck/profiler.html
  http://sources.redhat.com/ml/binutils/2001-01/msg00142.html
  I've tried this one once, looked promising but seems to be
  unmaintained now. Alpha quality.

* Eazel's profiler:
  http://www.mozilla.org/performance/eazel.html
  http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=eazel-tools/profiler
  Alpha quality, now unmaintained since Eazel is out of business.
  Haven't tried it.

Regards,
Johannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-20 18:13 Michel LESPINASSE
  2002-04-21  3:41 ` Andreas Jaeger
  2002-04-21  5:46 ` Jan Hubicka
@ 2002-04-22  1:47 ` Gerald Pfeifer
  2 siblings, 0 replies; 16+ messages in thread
From: Gerald Pfeifer @ 2002-04-22  1:47 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Sat, 20 Apr 2002, Michel LESPINASSE wrote:
> One thing I noticed is that 3.1 snapshot produces less inlining than 3.0
> or 2.95.

There is a difference between 3.0/3.0.1/3.0.2 and the one hand, and
3.0.3/3.0.4 on the other hand.

GCC 3.0-3.0.2 were so bad with respect to compile-time and memory
consumption that the inlining strategy was changed for 3.0.3 (and
mainline, which is now becoming 3.1).

The new strategy should be more or less equivalent in most cases,
but if you are seeing significant differences, this issue needs to
be revisited.

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2002-04-24 21:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20020421005718.GA16378@zoy.org.suse.lists.egcs>
     [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs>
2002-04-21  7:58   ` GCC performance regression - up to 20% ? Andi Kleen
2002-04-20 18:13 Michel LESPINASSE
2002-04-21  3:41 ` Andreas Jaeger
2002-04-21  5:46 ` Jan Hubicka
2002-04-21 23:46   ` Michel LESPINASSE
2002-04-22  0:17     ` Andreas Jaeger
2002-04-22 17:42       ` Michel LESPINASSE
2002-04-22 18:20         ` Andrew Pinski
2002-04-22 18:30           ` Carlo Wood
2002-04-22 19:25             ` Andrew Pinski
2002-04-24 15:24               ` Allan Sandfeld Jensen
2002-04-22  7:11     ` Carlo Wood
2002-04-22  7:11       ` Falk Hueffner
2002-04-22  7:34       ` law
2002-04-22  8:23       ` Johannes Stezenbach
2002-04-22  1:47 ` Gerald Pfeifer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).