GCC performance regression

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* GCC performance regression - up to 20% ?
@ 2002-04-20 18:13 Michel LESPINASSE
  2002-04-21  3:41 ` Andreas Jaeger
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-20 18:13 UTC (permalink / raw)
  To: gcc list

Hi,

I have downloaded the latest 3.1 snapshot (20020415) and ran some
performance tests. So far I've been impressed by the FP performance,
but kinda disappointed by the integer performance.

The benchmarks I've run are two libraries I maintain, libmpeg2 and
liba52. These are used by several open-source dvd players, and are
quite CPU intensive (especially libmpeg2). So here are my results,
using gcc 2.95 as a reference:

First the good news: liba52 (mostly FP intensive workload)
on athlon tbird 950, using -mcpu=pentiumpro:
gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams
gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4
from these measurements 3.1 has a very nice performance, very close to
intel's icc. Great work ! Also using -march=athlon-tbird and
generating sse code, I can get yet a few extra % of performance.

Now the bad news: for libmepg2, which is an integer-only workload, I
get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0
was already slower than 2.95.4, but 3.1 seems to be worse for this
workload at least.

libmpeg2, on athlon tbird 950, with mmx optimizations:
gcc-3.0 is about 2% slower than 2.95.4
gcc-3.1 snapshot is about 10% slower than 2.95.4

libmpeg2, on athlon tbird 950, using pure C code:
gcc-3.0 is about 4.5% slower than 2.95.4
gcc-3.1 snapshot is about 5.5% slower than 2.95.4

libmpeg2, on celeron 366, with mmx optimizations:
gcc-3.0 is about 4% slower than 2.95.4
gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!)

These results are all very repeatable. the celeron 366 results are the
most worrying, as this processor already has borderline performance
for decoding mpeg2 streams.

Is there a known performance regression in current GCCs (say, do they
get lower SPECint scores ?) or is it only with my code ?

Also, is there anything I could do in my code to enhance performance
with newer gcc versions ? One thing I noticed is that 3.1 snapshot
produces less inlining than 3.0 or 2.95. This probably accounts for
some of the slowdown I see when using mmx optimizations, as my mmx
routines are written using a few routines that I really expect to get
inlined. Is there any way I can get back control about that, so that
gcc honours the inline keyword ? I have not managed to do this either.

BTW, these two apps I mentionned can be found at
http://libmpeg2.sourceforge.net/
http://liba52.sourceforge.net/

Puzzled,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE
@ 2002-04-21  3:41 ` Andreas Jaeger
  2002-04-21  5:46 ` Jan Hubicka
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Andreas Jaeger @ 2002-04-21  3:41 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

Michel LESPINASSE <walken@zoy.org> writes:

> Hi,
>
> I have downloaded the latest 3.1 snapshot (20020415) and ran some
> performance tests. So far I've been impressed by the FP performance,
> but kinda disappointed by the integer performance.
>
> The benchmarks I've run are two libraries I maintain, libmpeg2 and
> liba52. These are used by several open-source dvd players, and are
> quite CPU intensive (especially libmpeg2). So here are my results,
> using gcc 2.95 as a reference:
>
> First the good news: liba52 (mostly FP intensive workload)
> on athlon tbird 950, using -mcpu=pentiumpro:
> gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams
> gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4
> from these measurements 3.1 has a very nice performance, very close to
> intel's icc. Great work ! Also using -march=athlon-tbird and
> generating sse code, I can get yet a few extra % of performance.
>
> Now the bad news: for libmepg2, which is an integer-only workload, I
> get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0
> was already slower than 2.95.4, but 3.1 seems to be worse for this
> workload at least.
>
> libmpeg2, on athlon tbird 950, with mmx optimizations:
> gcc-3.0 is about 2% slower than 2.95.4
> gcc-3.1 snapshot is about 10% slower than 2.95.4
>
> libmpeg2, on athlon tbird 950, using pure C code:
> gcc-3.0 is about 4.5% slower than 2.95.4
> gcc-3.1 snapshot is about 5.5% slower than 2.95.4
>
> libmpeg2, on celeron 366, with mmx optimizations:
> gcc-3.0 is about 4% slower than 2.95.4
> gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!)
>
> These results are all very repeatable. the celeron 366 results are the
> most worrying, as this processor already has borderline performance
> for decoding mpeg2 streams.
>
> Is there a known performance regression in current GCCs (say, do they
> get lower SPECint scores ?) or is it only with my code ?

Can you distill a test case that is as small as possible (the optimal
way would be just the loop that causes the problem) and show it to us?
That way it's much easier to discuss the issues and start looking into
what needs to be done.

> Also, is there anything I could do in my code to enhance performance
> with newer gcc versions ? One thing I noticed is that 3.1 snapshot
> produces less inlining than 3.0 or 2.95. This probably accounts for
> some of the slowdown I see when using mmx optimizations, as my mmx
> routines are written using a few routines that I really expect to get
> inlined. Is there any way I can get back control about that, so that
> gcc honours the inline keyword ? I have not managed to do this either.

Try -finline-limit=2000 but check the manual for the exact name of the
switch.

Andreas
-- 
 Andreas Jaeger
  SuSE Labs aj@suse.de
   private aj@arthur.inka.de
    http://www.suse.de/~aj

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE
  2002-04-21  3:41 ` Andreas Jaeger
@ 2002-04-21  5:46 ` Jan Hubicka
  2002-04-21 23:46   ` Michel LESPINASSE
  2002-04-22  1:47 ` Gerald Pfeifer
  2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
  3 siblings, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2002-04-21  5:46 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

> 
> libmpeg2, on athlon tbird 950, with mmx optimizations:
> gcc-3.0 is about 2% slower than 2.95.4
> gcc-3.1 snapshot is about 10% slower than 2.95.4
> 
> libmpeg2, on athlon tbird 950, using pure C code:
> gcc-3.0 is about 4.5% slower than 2.95.4
> gcc-3.1 snapshot is about 5.5% slower than 2.95.4
> 
> libmpeg2, on celeron 366, with mmx optimizations:
> gcc-3.0 is about 4% slower than 2.95.4
> gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!)
> 
> These results are all very repeatable. the celeron 366 results are the
> most worrying, as this processor already has borderline performance
> for decoding mpeg2 streams.

Are you able to figure out what exactly makes the code slow? Having self
contained testcase will definitly help a lot.  WHat flags do you use?

I would be quite curious whether using profile feedback helps.
(see documentation of -fprofile-arcs and -fbranch-probabilities)
you can just have some badly predicted branch in the innermost
loop.

Problem of such code usualy is fact that it is tuned to avoid problems on one
particular version of gcc, so even when new version os faster overall, it is
slower in such places.  We've hit similar case with Athlon matrix
multiplication code and such problems are usually easy to fix on gcc side.
> 
> Is there a known performance regression in current GCCs (say, do they
> get lower SPECint scores ?) or is it only with my code ?

No, the SPECint numbers are quite consistenly higher than in any previous
release. See http://www.suse.de/~aj/SPEC
In fact no previous release had such a huge gap in perofrmance.
> 
> Also, is there anything I could do in my code to enhance performance
> with newer gcc versions ? One thing I noticed is that 3.1 snapshot
> produces less inlining than 3.0 or 2.95. This probably accounts for
> some of the slowdown I see when using mmx optimizations, as my mmx
> routines are written using a few routines that I really expect to get
> inlined. Is there any way I can get back control about that, so that
> gcc honours the inline keyword ? I have not managed to do this either.
THere is parameter to increase inline threshold as well as allwaysinline
function attribute. See the documentation.

Honza
> 
> BTW, these two apps I mentionned can be found at
> http://libmpeg2.sourceforge.net/
> http://liba52.sourceforge.net/
> 
> Puzzled,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-21  5:46 ` Jan Hubicka
@ 2002-04-21 23:46   ` Michel LESPINASSE
  2002-04-22  0:17     ` Andreas Jaeger
  2002-04-22  7:11     ` Carlo Wood
  0 siblings, 2 replies; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-21 23:46 UTC (permalink / raw)
  To: Jan Hubicka, Andreas Jaeger; +Cc: gcc list

Hi,

I spent the afternoon looking at this issue and trying different
flags. I think I got some interesting results, though I still dont get
the full picture.

The first thing I tweaked was to make the inlining work the way I
wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with
usual options (-O3), the behaviour was that functions declared inline
were always inlined, and functions not declared inline were inlined
only if gcc thought they were simple enough. While in gcc 3.1
snapshot, it looks like the inline keyword only makes the function an
inline candidate. I will argue later on why I think the old behaviour
(inline being always honoured) is more useful (basically, function
specialization is harder to do if inline is not honoured). In the mean
time, I worked around the issue by using an absurdly high value
-finline-limit=6000 (I tried 2000 first which was not sufficient), and
then I also added -fno-inline-functions so that I dont get everything
inlined when I dont ask for it.

With the custom inlining, gcc-3.1 snapshot is about 4% slower than
gcc-2.95, on my athlon tbird. This is a progress, since without the
custom inlining, the slowdown was 10%.

Then I tried to figure out where the slowdown is, using gprof. And
this is where things get really interesting: gprof tells me that the
code compiled with 3.1 is faster, but 'time' tells me that the user
time spent executing that code is higher with 3.1 than with 2.95. I'm
not sure what to make of this, but I think this might give you some
clues, so I'll describe it in more detail. I'm not sure what the
overhead is, but it seems to be right in gprof's blind spot.

I have to describe my gprof methodology first. 'normal' gprof
(i.e. compiling every file with -pg) seems to have a high overhead for
me, plus it conflicts with -fomit-frame-pointer which I usualy use. So
I tend to use what I'll call 'light' gprof, which is as follows:
everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro
except main.c which is compiled with -g -O3 -p and is not cpu intensive.
'light' gprof can not help me figure out call graphs, but it should
normally be good enough to obtain a flat profile.

When using gcc 2.95, 'light' gprof works great. In a flat profile, the
last number in the 'cumulative seconds' column always matches (within
0.1s) the user time as reported by the 'time' command. I think I can
trust the flat profile information.

When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds
gap between 'cumulative seconds' and 'user time'. I dont understand
what happens during this time - could it be that 3.x has a higher
overhead for function calls, and that this overhead is not accounted
for in gprof ? I dont understand what happens here, but I get the
feeling that this might be key to the slowdown I observe.

To summarize the gprof thing, I observe
3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time

If I use 'normal' gprof (every file gets -pg and I remove the
-fomit-frame-pointer), that inflates the user times a lot, but I still
get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time.

Also to answer Jan's question, I did try to use -fbranch-probabilities
and it helped, with this option 3.1 snapshot is about 0.5% slower (in
user time) than 2.95, instead of 4% slower without. It's still
frustrating though, because gprof pretends 3.1 is about 10% faster
than 2.95, so there may be a lot of untapped performance.

OK, so this is all I can say for now. I hope someone will know gprof
internals better than I do and understand why gprof returns times that
are smaller in 3.1 shapshot vs. 2.95, while time does not agree.

I thought I should add a few comments about the structure of libmpeg2
code, as I suppose it is somewhat unusual. The time-consuming loop is
pretty big, and includes several function calls. Each loop execution
decodes a full mpeg2 macroblocks (=256 pixels), during this decoding
it calls the IDCT (cosine transform) function up to 6 times, and some
MC (motion compensation) functions up to 12 times. All these calls are
indirect, using function pointers. These called functions are all
kinda short, executing in about 0.4 microseconds per call on average.
Yes, thats a lot of function calls, and I suspect SPECint does not do
as many, which could maybe explain why it's not seeing the same
performance regression that I see ? It would seem consistent with the
gprof blind spot thing, too.

Finally, I thought I should include a small explanation about what I
do with inlines and why I'd like to have a gcc option so that the
inline keyword always inlines a function, instead of just making that
function an inline candidate. One example of that would be in the MC
functions mentionned above. To get acceptable speed, these are written
using mmx assembly operations. Actually they even have several
implementations, one in straight c for compatibility, one in straight
mmx, one using sse integer instructions (as present on PIII and
athlon), and one using 3dnow instructions. It's easy to select at init
time which implementation to use, since the MC functions are only
called thru function pointers. The only difference between the sse
version and the 3dnow version, is that the sse version uses the pavgb
instruction, while the 3dnow version uses pavgusb. These instructions
have different opcodes but the same behaviour, so I wrote this using
an inline function:

static inline void MC_generic (...., int cpu)
{
	.... do stuff ....
	if (cpu == CPU_3DNOW)
		pavgusb (....);
	else
		pavgb ();
	.... do more stuff ....
}

void MC_3dnow (.....)
{
	MC_generic (....., CPU_3DNOW);
}

void MC_sse (.....)
{
	MC_generic (....., CPU_SSE);
}

In gcc-2.95 and gcc-3.0, this construct works out nicely since
MC_generic gets inlined, and then the compiler figures out that the
cpu test in it is a constant and generates clean code for both the
3dnow and the sse function. This kind of specialisation is sometimes
very convenient, but it requires that the inline keyword does an
unconditional inlining, not subject to compiler heuristics. I would
really love to see a gcc option to make inlining unconditional when
using the inline keyword, and have gcc use its heuristics when there
is no such keyword.

This email is longer than I thought it would be, thanks a lot for
those who're still reading me :)

Cheers,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-21 23:46   ` Michel LESPINASSE
@ 2002-04-22  0:17     ` Andreas Jaeger
  2002-04-22 17:42       ` Michel LESPINASSE
  2002-04-22  7:11     ` Carlo Wood
  1 sibling, 1 reply; 32+ messages in thread
From: Andreas Jaeger @ 2002-04-22  0:17 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, gcc list

Michel LESPINASSE <walken@zoy.org> writes:

> Hi,
>
> I spent the afternoon looking at this issue and trying different
> flags. I think I got some interesting results, though I still dont get
> the full picture.
>
> The first thing I tweaked was to make the inlining work the way I
> wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with
> usual options (-O3), the behaviour was that functions declared inline
> were always inlined, and functions not declared inline were inlined
> only if gcc thought they were simple enough. While in gcc 3.1
> snapshot, it looks like the inline keyword only makes the function an
> inline candidate. I will argue later on why I think the old behaviour
> (inline being always honoured) is more useful (basically, function
> specialization is harder to do if inline is not honoured). In the mean
> time, I worked around the issue by using an absurdly high value
> -finline-limit=6000 (I tried 2000 first which was not sufficient), and
> then I also added -fno-inline-functions so that I dont get everything
> inlined when I dont ask for it.
>
> With the custom inlining, gcc-3.1 snapshot is about 4% slower than
> gcc-2.95, on my athlon tbird. This is a progress, since without the
> custom inlining, the slowdown was 10%.
>
> Then I tried to figure out where the slowdown is, using gprof. And
> this is where things get really interesting: gprof tells me that the
> code compiled with 3.1 is faster, but 'time' tells me that the user
> time spent executing that code is higher with 3.1 than with 2.95. I'm
> not sure what to make of this, but I think this might give you some
> clues, so I'll describe it in more detail. I'm not sure what the
> overhead is, but it seems to be right in gprof's blind spot.

Using -pg changes the binary and the results might not be the same as
without the flag.  You could try to use the Performance counters of
your Athlon using e.g. oprofile (http://oprofile.sourceforge.net).

> I have to describe my gprof methodology first. 'normal' gprof
> (i.e. compiling every file with -pg) seems to have a high overhead for
> me, plus it conflicts with -fomit-frame-pointer which I usualy use. So
> I tend to use what I'll call 'light' gprof, which is as follows:
> everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro
> except main.c which is compiled with -g -O3 -p and is not cpu intensive.
> 'light' gprof can not help me figure out call graphs, but it should
> normally be good enough to obtain a flat profile.
>
> When using gcc 2.95, 'light' gprof works great. In a flat profile, the
> last number in the 'cumulative seconds' column always matches (within
> 0.1s) the user time as reported by the 'time' command. I think I can
> trust the flat profile information.
>
> When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds
> gap between 'cumulative seconds' and 'user time'. I dont understand
> what happens during this time - could it be that 3.x has a higher
> overhead for function calls, and that this overhead is not accounted
> for in gprof ? I dont understand what happens here, but I get the
> feeling that this might be key to the slowdown I observe.
>
> To summarize the gprof thing, I observe
> 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time
>
> If I use 'normal' gprof (every file gets -pg and I remove the
> -fomit-frame-pointer), that inflates the user times a lot, but I still
> get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time.
>
> Also to answer Jan's question, I did try to use -fbranch-probabilities
> and it helped, with this option 3.1 snapshot is about 0.5% slower (in
> user time) than 2.95, instead of 4% slower without. It's still
> frustrating though, because gprof pretends 3.1 is about 10% faster
> than 2.95, so there may be a lot of untapped performance.
>
>
> OK, so this is all I can say for now. I hope someone will know gprof
> internals better than I do and understand why gprof returns times that
> are smaller in 3.1 shapshot vs. 2.95, while time does not agree.
>
> I thought I should add a few comments about the structure of libmpeg2
> code, as I suppose it is somewhat unusual. The time-consuming loop is
> pretty big, and includes several function calls. Each loop execution
> decodes a full mpeg2 macroblocks (=256 pixels), during this decoding
> it calls the IDCT (cosine transform) function up to 6 times, and some
> MC (motion compensation) functions up to 12 times. All these calls are
> indirect, using function pointers. These called functions are all
> kinda short, executing in about 0.4 microseconds per call on average.
> Yes, thats a lot of function calls, and I suspect SPECint does not do
> as many, which could maybe explain why it's not seeing the same
> performance regression that I see ? It would seem consistent with the
> gprof blind spot thing, too.

It would really help a lot if you could try to write some small
program that behaves the same way (performance wise) as this routine
libmpeg2.

>
> Finally, I thought I should include a small explanation about what I
> do with inlines and why I'd like to have a gcc option so that the
> inline keyword always inlines a function, instead of just making that
> function an inline candidate. One example of that would be in the MC
> functions mentionned above. To get acceptable speed, these are written
> using mmx assembly operations. Actually they even have several
> implementations, one in straight c for compatibility, one in straight
> mmx, one using sse integer instructions (as present on PIII and
> athlon), and one using 3dnow instructions. It's easy to select at init
> time which implementation to use, since the MC functions are only
> called thru function pointers. The only difference between the sse
> version and the 3dnow version, is that the sse version uses the pavgb
> instruction, while the 3dnow version uses pavgusb. These instructions
> have different opcodes but the same behaviour, so I wrote this using
> an inline function:

>
> static inline void MC_generic (...., int cpu)
> {
> 	.... do stuff ....
> 	if (cpu == CPU_3DNOW)
> 		pavgusb (....);
> 	else
> 		pavgb ();
> 	.... do more stuff ....
> }
>
> void MC_3dnow (.....)
> {
> 	MC_generic (....., CPU_3DNOW);
> }
>
> void MC_sse (.....)
> {
> 	MC_generic (....., CPU_SSE);
> }
>
> In gcc-2.95 and gcc-3.0, this construct works out nicely since
> MC_generic gets inlined, and then the compiler figures out that the
> cpu test in it is a constant and generates clean code for both the
> 3dnow and the sse function. This kind of specialisation is sometimes
> very convenient, but it requires that the inline keyword does an
> unconditional inlining, not subject to compiler heuristics. I would
> really love to see a gcc option to make inlining unconditional when
> using the inline keyword, and have gcc use its heuristics when there
> is no such keyword.

You might want to use the always_inline function attribute for your
inline functions in 3.1.  It makes the inlining unconditional.


>
> This email is longer than I thought it would be, thanks a lot for
> those who're still reading me :)

;-)

Andreas
-- 
 Andreas Jaeger
  SuSE Labs aj@suse.de
   private aj@arthur.inka.de
    http://www.suse.de/~aj

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  0:17     ` Andreas Jaeger
@ 2002-04-22 17:42       ` Michel LESPINASSE
  2002-04-22 18:20         ` Andrew Pinski
  0 siblings, 1 reply; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 17:42 UTC (permalink / raw)
  To: Andreas Jaeger; +Cc: Jan Hubicka, gcc list

On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote:
> You might want to use the always_inline function attribute for your
> inline functions in 3.1.  It makes the inlining unconditional.

Thanks, that suggestion does solve my inlining problem. I think the
documentation for this option is a bit confusing: it says "For
functions declared inline, this attribute inlines the function even if
no optimization level was specified." which is true I guess, but the
documentation does not mention that when optimization is enabled gcc
will not always inline functions declared inline, unless we use this
attribute.

Also I want to point out that this is a new behaviour, as gcc up to
3.0.4 apparently always inlined functions declared inline (when
compiling with -O2 at least).

I think either the documentation of always_inline should be made more
clear, or the inline keyword should be made to behave like
always_inline when optimizing with -O2 and above (as it did previously).

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 17:42       ` Michel LESPINASSE
@ 2002-04-22 18:20         ` Andrew Pinski
  2002-04-22 18:30           ` Carlo Wood
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Pinski @ 2002-04-22 18:20 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Andreas Jaeger, Jan Hubicka, gcc list

On Monday, April 22, 2002, at 08:39 , Michel LESPINASSE wrote:

> On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote:
>> You might want to use the always_inline function attribute for your
>> inline functions in 3.1.  It makes the inlining unconditional.
>
> Thanks, that suggestion does solve my inlining problem. I think the
> documentation for this option is a bit confusing: it says "For
> functions declared inline, this attribute inlines the function even if
> no optimization level was specified." which is true I guess, but the
> documentation does not mention that when optimization is enabled gcc
> will not always inline functions declared inline, unless we use this
> attribute.

I think the problem is that part of documentation was written before the 
part about always_inline.

>
> Also I want to point out that this is a new behaviour, as gcc up to
> 3.0.4 apparently always inlined functions declared inline (when
> compiling with -O2 at least).

Not really new `behavior' was just the inline limit was lowered in 3.0 
and above.

>
> I think either the documentation of always_inline should be made more
> clear, or the inline keyword should be made to behave like
> always_inline when optimizing with -O2 and above (as it did previously).

the inline keyword was always just an hint to the compiler in ISO C++ 
and ISO C.
And it was -O3 that included the inline optimization.

Thanks,
Andrew Pinski

>
> --
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 18:20         ` Andrew Pinski
@ 2002-04-22 18:30           ` Carlo Wood
  2002-04-22 19:25             ` Andrew Pinski
  0 siblings, 1 reply; 32+ messages in thread
From: Carlo Wood @ 2002-04-22 18:30 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote:
> And it was -O3 that included the inline optimization.

Huh??
I am using the fact (I hope) that functions marked with 'inline'
or member functions that are defined inside their class declaration
are inlined already with -O.
Talking about real small functions here (accessors etc).

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 18:30           ` Carlo Wood
@ 2002-04-22 19:25             ` Andrew Pinski
  2002-04-24 15:24               ` Allan Sandfeld Jensen
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Pinski @ 2002-04-22 19:25 UTC (permalink / raw)
  To: Carlo Wood; +Cc: gcc

On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote:

> On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote:
>> And it was -O3 that included the inline optimization.
>
> Huh??
> I am using the fact (I hope) that functions marked with 'inline'
> or member functions that are defined inside their class declaration
> are inlined already with -O.
> Talking about real small functions here (accessors etc).
>

Read the manual, under the -O3 option:

Optimize yet more. `-O3' turns on all optimizations specified by `-O2' 
and also turns on the `-finline-functions' and `-frename-registers' 
options.

These options control various sorts of optimizations:

-O
-O1
Optimize. Optimizing compilation takes somewhat more time, and a lot 
more memory for a large function.

Without `-O', the compiler's goal is to reduce the cost of compilation 
and to make debugging produce the expected results. Statements are 
independent: if you stop the program with a breakpoint between 
statements, you can then assign a new value to any variable or change 
the program counter to any other statement in the function and get 
exactly the results you would expect from the source code.

Without `-O', the compiler only allocates variables declared register in 
registers. The resulting compiled code is a little worse than produced 
by PCC without `-O'.

With `-O', the compiler tries to reduce code size and execution time.

When you specify `-O', the compiler turns on `-fthread-jumps' and 
`-fdefer-pop' on all machines. The compiler turns on `-fdelayed-branch' 
on machines that have delay slots, and `-fomit-frame-pointer' on 
machines that can support debugging even without a frame pointer. On 
some machines the compiler also turns on other flags.

-O2
Optimize even more. GCC performs nearly all supported optimizations that 
do not involve a space-speed tradeoff. The compiler does not perform 
loop unrolling or function inlining when you specify `-O2'. As compared 
to `-O', this option increases both compilation time and the performance 
of the generated code.

`-O2' turns on all optional optimizations except for loop unrolling, 
function inlining, and register renaming. It also turns on the 
`-fforce-mem' option on all machines and frame pointer elimination on 
machines where doing so does not interfere with debugging.

Please note the warning under `-fgcse' about invoking `-O2' on programs 
that use computed gotos.

-O3
Optimize yet more. `-O3' turns on all optimizations specified by `-O2' 
and also turns on the `-finline-functions' and `-frename-registers' 
options.

-O0
Do not optimize.

-Os
Optimize for size. `-Os' enables all `-O2' optimizations that do not 
typically increase code size. It also performs further optimizations 
designed to reduce code size.

If you use multiple `-O' options, with or without level numbers, the 
last such option is the one that is effective.

> --
> Carlo Wood <carlo@alinoe.com>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22 19:25             ` Andrew Pinski
@ 2002-04-24 15:24               ` Allan Sandfeld Jensen
  0 siblings, 0 replies; 32+ messages in thread
From: Allan Sandfeld Jensen @ 2002-04-24 15:24 UTC (permalink / raw)
  To: Andrew Pinski; +Cc: gcc

On Tuesday 23 April 2002 04:05, Andrew Pinski wrote:
> On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote:
> > On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote:
> >> And it was -O3 that included the inline optimization.
> >
> > Huh??
> > I am using the fact (I hope) that functions marked with 'inline'
> > or member functions that are defined inside their class declaration
> > are inlined already with -O.
> > Talking about real small functions here (accessors etc).
>
> Read the manual, under the -O3 option:
>
> Optimize yet more. `-O3' turns on all optimizations specified by `-O2'
> and also turns on the `-finline-functions' and `-frename-registers'
> options.
>
That inline is the "inline-optimization", ie. inlining function NOT declared 
inline. Functions declared inline are inlined already at -O

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-21 23:46   ` Michel LESPINASSE
  2002-04-22  0:17     ` Andreas Jaeger
@ 2002-04-22  7:11     ` Carlo Wood
  2002-04-22  7:11       ` Falk Hueffner
                         ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Carlo Wood @ 2002-04-22  7:11 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, Andreas Jaeger, gcc list

> 
> Then I tried to figure out where the slowdown is, using gprof. And
> this is where things get really interesting: gprof tells me that the
> code compiled with 3.1 is faster, but 'time' tells me that the user
> time spent executing that code is higher with 3.1 than with 2.95. I'm
> not sure what to make of this, but I think this might give you some
> clues, so I'll describe it in more detail. I'm not sure what the
> overhead is, but it seems to be right in gprof's blind spot.

gprof "measures" the time that a function takes by probing
which function the program is executing about every 20 ms.
From that it builds up a statistical histogram.

I wish there would be a more precise profiler that uses the
hardware counters.  Does anyone know of one?  Hmm, I remember
a Subject: line on the PAPI mailinglist that mentioned gprof,
but I deleted it.  I think it asked the same question: whether
or not there existed a 'gprof' that used PAPI.

-- 
Carlo Wood <carlo@alinoe.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  7:11     ` Carlo Wood
@ 2002-04-22  7:11       ` Falk Hueffner
  2002-04-22  7:34       ` law
  2002-04-22  8:23       ` Johannes Stezenbach
  2 siblings, 0 replies; 32+ messages in thread
From: Falk Hueffner @ 2002-04-22  7:11 UTC (permalink / raw)
  To: gcc list

Carlo Wood <carlo@alinoe.com> writes:

> I wish there would be a more precise profiler that uses the hardware
> counters.  Does anyone know of one?  Hmm, I remember a Subject: line
> on the PAPI mailinglist that mentioned gprof, but I deleted it.  I
> think it asked the same question: whether or not there existed a
> 'gprof' that used PAPI.

http://oprofile.sourceforge.net/

Only runs on i386 Linux and needs a kernel module, though.

        Falk

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  7:11     ` Carlo Wood
  2002-04-22  7:11       ` Falk Hueffner
@ 2002-04-22  7:34       ` law
  2002-04-22  8:23       ` Johannes Stezenbach
  2 siblings, 0 replies; 32+ messages in thread
From: law @ 2002-04-22  7:34 UTC (permalink / raw)
  To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list

In message <20020422155801.A21747@alinoe.com>, Carlo Wood writes:
 > > 
 > > Then I tried to figure out where the slowdown is, using gprof. And
 > > this is where things get really interesting: gprof tells me that the
 > > code compiled with 3.1 is faster, but 'time' tells me that the user
 > > time spent executing that code is higher with 3.1 than with 2.95. I'm
 > > not sure what to make of this, but I think this might give you some
 > > clues, so I'll describe it in more detail. I'm not sure what the
 > > overhead is, but it seems to be right in gprof's blind spot.
 > 
 > gprof "measures" the time that a function takes by probing
 > which function the program is executing about every 20 ms.
 > >From that it builds up a statistical histogram.
 > 
 > I wish there would be a more precise profiler that uses the
 > hardware counters.  Does anyone know of one?  Hmm, I remember
 > a Subject: line on the PAPI mailinglist that mentioned gprof,
 > but I deleted it.  I think it asked the same question: whether
 > or not there existed a 'gprof' that used PAPI.
You might look at oprofile.  
jeff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-22  7:11     ` Carlo Wood
  2002-04-22  7:11       ` Falk Hueffner
  2002-04-22  7:34       ` law
@ 2002-04-22  8:23       ` Johannes Stezenbach
  2 siblings, 0 replies; 32+ messages in thread
From: Johannes Stezenbach @ 2002-04-22  8:23 UTC (permalink / raw)
  To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list

On Mon, Apr 22, 2002 at 03:58:01PM +0200, Carlo Wood wrote:
> gprof "measures" the time that a function takes by probing
> which function the program is executing about every 20 ms.
> From that it builds up a statistical histogram.
> 
> I wish there would be a more precise profiler that uses the
> hardware counters.  Does anyone know of one?  Hmm, I remember
> a Subject: line on the PAPI mailinglist that mentioned gprof,
> but I deleted it.  I think it asked the same question: whether
> or not there existed a 'gprof' that used PAPI.

I know of two profilers based on gcc's
-finstrument-functions feature:

* FunctionCheck:
  http://www710.univ-lyon1.fr/~yperret/fnccheck/profiler.html
  http://sources.redhat.com/ml/binutils/2001-01/msg00142.html
  I've tried this one once, looked promising but seems to be
  unmaintained now. Alpha quality.

* Eazel's profiler:
  http://www.mozilla.org/performance/eazel.html
  http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=eazel-tools/profiler
  Alpha quality, now unmaintained since Eazel is out of business.
  Haven't tried it.

Regards,
Johannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - up to 20% ?
  2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE
  2002-04-21  3:41 ` Andreas Jaeger
  2002-04-21  5:46 ` Jan Hubicka
@ 2002-04-22  1:47 ` Gerald Pfeifer
  2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
  3 siblings, 0 replies; 32+ messages in thread
From: Gerald Pfeifer @ 2002-04-22  1:47 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Sat, 20 Apr 2002, Michel LESPINASSE wrote:
> One thing I noticed is that 3.1 snapshot produces less inlining than 3.0
> or 2.95.

There is a difference between 3.0/3.0.1/3.0.2 and the one hand, and
3.0.3/3.0.4 on the other hand.

GCC 3.0-3.0.2 were so bad with respect to compile-time and memory
consumption that the inlining strategy was changed for 3.0.3 (and
mainline, which is now becoming 3.1).

The new strategy should be more or less equivalent in most cases,
but if you are seeing significant differences, this issue needs to
be revisited.

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* GCC performance regression - its memset !
  2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE
                   ` (2 preceding siblings ...)
  2002-04-22  1:47 ` Gerald Pfeifer
@ 2002-04-22 14:33 ` Michel LESPINASSE
  2002-04-22 14:58   ` Jason R Thorpe
  2002-04-22 17:10   ` Richard Henderson
  3 siblings, 2 replies; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 14:33 UTC (permalink / raw)
  To: gcc list

OK, so I worked more to find the cause of the slowdown, and I figured
out its all because of memset(). This function seems to be about twice
slower than in 2.95, and also for some reason the time spent in memset
does not show up in gprof.

Here is a test case:

--------------------------- foo.c ------------------------------
#include <string.h>

short table[64];

void bar (void);

int main (void)
{
    int i;

    bar ();

    for (i = 0; i < 100000000; i++)
        memset (table + 1, 0, 63 * sizeof(short));
    return 0;
}
----------------------------- end of foo.c ------------------------

----------------------------- bar.c -------------------------------
void bar (void)
{
}
----------------------------- end of bar.c ------------------------

# gcc-2.95 -g -O3 -p foo.c bar.c
# time ./a.out
./a.out  5.75s user 0.00s system 100% cpu 5.739 total
# gprof -bp ./a.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
100.00      5.74     5.74                             main
  0.00      5.74     0.00        1     0.00     0.00  bar


# gcc-3.1 -g -O3 -p foo.c bar.c
# time ./a.out
./a.out  10.78s user 0.00s system 101% cpu 10.634 total
# gprof -bp ./a.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
100.00      0.62     0.62                             main
  0.00      0.62     0.00        1     0.00     0.00  bar


gcc-3.1 snapshot is about twice slower than 2.95 on that test case,
and for some reason the gprof output is bogus (it does not account for
the time spent in memset), while it was not with 2.95.

I did not know my code spent that much time in memset, I'll see what I
can do about it.

Hope this helps,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
@ 2002-04-22 14:58   ` Jason R Thorpe
  2002-04-22 15:27     ` Michel LESPINASSE
  2002-04-22 16:59     ` Segher Boessenkool
  2002-04-22 17:10   ` Richard Henderson
  1 sibling, 2 replies; 32+ messages in thread
From: Jason R Thorpe @ 2002-04-22 14:58 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote:

 > gcc-3.1 snapshot is about twice slower than 2.95 on that test case,
 > and for some reason the gprof output is bogus (it does not account for
 > the time spent in memset), while it was not with 2.95.

gprof doesn't see it because gcc is doing the memset inline, presumably; it
does this in certain cases where it knows the size at compile time.

Try running your test with -fno-builtin.  (I'm not suggesting this as a
fix for your performance issue, just as an explanation of why memset() is
invisible to gprof in your testcase).

-- 
        -- Jason R. Thorpe <thorpej@wasabisystems.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:58   ` Jason R Thorpe
@ 2002-04-22 15:27     ` Michel LESPINASSE
  2002-04-22 16:59     ` Segher Boessenkool
  1 sibling, 0 replies; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 15:27 UTC (permalink / raw)
  To: Jason R Thorpe, gcc list

On Mon, Apr 22, 2002 at 02:41:25PM -0700, Jason R Thorpe wrote:
> gprof doesn't see it because gcc is doing the memset inline, presumably; it
> does this in certain cases where it knows the size at compile time.

I'm not surprised that there is no line for memset in the flat
profile. But, I'm surprised that the cumulative time in the flat
profile is not equal to the user time in the time command. 'normal'
inlines (i.e. the ones I could write by declaring an inline function)
are accounted normally by gprof (i.e. they do show up in the
cumulative time).

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:58   ` Jason R Thorpe
  2002-04-22 15:27     ` Michel LESPINASSE
@ 2002-04-22 16:59     ` Segher Boessenkool
  1 sibling, 0 replies; 32+ messages in thread
From: Segher Boessenkool @ 2002-04-22 16:59 UTC (permalink / raw)
  To: thorpej; +Cc: Michel LESPINASSE, gcc list

Jason R Thorpe wrote:
> 
> On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote:
> 
>  > gcc-3.1 snapshot is about twice slower than 2.95 on that test case,
>  > and for some reason the gprof output is bogus (it does not account for
>  > the time spent in memset), while it was not with 2.95.
> 
> gprof doesn't see it because gcc is doing the memset inline, presumably; it
> does this in certain cases where it knows the size at compile time.
> 
> Try running your test with -fno-builtin.  (I'm not suggesting this as a
> fix for your performance issue, just as an explanation of why memset() is
> invisible to gprof in your testcase).

Erm, no.  In the 2.95 case, GCC _did_ inline the memset();  that's why it
showed up as 5.74 seconds in main().  In the 3.1 case, it was not inlined;
gprof doesn't show you the time spent in libc, as libc is not compiled
with profiling enabled.  Inlined functions are never profiled separately.

Linking statically will make gprof show you the time spent in shared library
functions (but not the call graph to those); or you can link against a libc
that was compiled with profiling enabled, so you get a call graph as well.

Segher

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
  2002-04-22 14:58   ` Jason R Thorpe
@ 2002-04-22 17:10   ` Richard Henderson
  2002-04-22 17:13     ` Michel LESPINASSE
  1 sibling, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2002-04-22 17:10 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote:
> OK, so I worked more to find the cause of the slowdown, and I figured
> out its all because of memset().

See whether or not disabling glibc's inline expansion of 
memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:10   ` Richard Henderson
@ 2002-04-22 17:13     ` Michel LESPINASSE
  2002-04-22 17:39       ` Richard Henderson
  2002-04-23  2:39       ` Jan Hubicka
  0 siblings, 2 replies; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 17:13 UTC (permalink / raw)
  To: Richard Henderson, gcc list

On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote:
> See whether or not disabling glibc's inline expansion of 
> memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES.

I can get gcc 2.95 to be as slow as 3.1 snapshot by using both
-fno-builtin and -D__NO_STRING_INLINES

but, I cant get gcc 3.1 to do the inlining for me

If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3,
gcc 2.95 generates this:

        movl $table+2,%edi
        cld
        movl $31,%ecx
        rep
        stosl
        stosw

while 3.1 snapshot generates this:

        movl    $table+2, (%esp)
        movl    $0, 4(%esp)
        movl    $126, 8(%esp)
        call    memset

Hope this helps,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:13     ` Michel LESPINASSE
@ 2002-04-22 17:39       ` Richard Henderson
  2002-04-22 17:49         ` Michel LESPINASSE
  2002-04-23  2:39       ` Jan Hubicka
  1 sibling, 1 reply; 32+ messages in thread
From: Richard Henderson @ 2002-04-22 17:39 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: gcc list

On Mon, Apr 22, 2002 at 05:10:45PM -0700, Michel LESPINASSE wrote:
> I can get gcc 2.95 to be as slow as 3.1 snapshot by using both
> -fno-builtin and -D__NO_STRING_INLINES

I wasn't interested in -fno-builtin, but only differences seen
when both 2.95 and 3.1 are using -D__NO_STRING_INLINES.


r~

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:39       ` Richard Henderson
@ 2002-04-22 17:49         ` Michel LESPINASSE
  2002-04-23  5:03           ` Falk Hueffner
  0 siblings, 1 reply; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-22 17:49 UTC (permalink / raw)
  To: Richard Henderson, gcc list

On Mon, Apr 22, 2002 at 05:13:35PM -0700, Richard Henderson wrote:
> I wasn't interested in -fno-builtin, but only differences seen
> when both 2.95 and 3.1 are using -D__NO_STRING_INLINES.

If I use just -O3 -D__NO_STRING_INLINES, I get the same behaviour as
with just -O3 (i.e. 2.95 inlines the memset and 3.1 doesnt)

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:49         ` Michel LESPINASSE
@ 2002-04-23  5:03           ` Falk Hueffner
  2002-04-23  6:53             ` Andreas Schwab
  0 siblings, 1 reply; 32+ messages in thread
From: Falk Hueffner @ 2002-04-23  5:03 UTC (permalink / raw)
  To: gcc list

Hi,

while we're at it, when you have a prototype for memset, gcc forgets
about the alignment:

void f1(unsigned long *p) { memset(p, 0, 16); }
#include <string.h>
void f2(unsigned long *p) { memset(p, 0, 16); }

-> (Alpha)

0000000000000000 <f1>:
   0:   08 00 f0 b7     stq     zero,8(a0)
   4:   1f 04 ff 5f     fnop
   8:   00 00 f0 b7     stq     zero,0(a0)
   c:   01 80 fa 6b     ret
0000000000000020 <f2>:
  20:   0f 00 30 2c     ldq_u   t0,15(a0)
  24:   00 00 50 2c     ldq_u   t1,0(a0)
  28:   41 0e 30 48     mskqh   t0,a0,t0
  2c:   0f 00 30 3c     stq_u   t0,15(a0)
  30:   42 06 50 48     mskql   t1,a0,t1
  34:   08 00 f0 3f     stq_u   zero,8(a0)
  38:   00 00 50 3c     stq_u   t1,0(a0)
  3c:   01 80 fa 6b     ret

Is there anything that can be done about that?

        Falk

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23  5:03           ` Falk Hueffner
@ 2002-04-23  6:53             ` Andreas Schwab
  0 siblings, 0 replies; 32+ messages in thread
From: Andreas Schwab @ 2002-04-23  6:53 UTC (permalink / raw)
  To: Falk Hueffner; +Cc: gcc list

Falk Hueffner <falk.hueffner@student.uni-tuebingen.de> writes:

|> Hi,
|> 
|> while we're at it, when you have a prototype for memset, gcc forgets
|> about the alignment:
|> 
|> void f1(unsigned long *p) { memset(p, 0, 16); }
|> #include <string.h>
|> void f2(unsigned long *p) { memset(p, 0, 16); }
|> 
|> -> (Alpha)
|> 
|> 0000000000000000 <f1>:
|>    0:   08 00 f0 b7     stq     zero,8(a0)
|>    4:   1f 04 ff 5f     fnop
|>    8:   00 00 f0 b7     stq     zero,0(a0)
|>    c:   01 80 fa 6b     ret
|> 0000000000000020 <f2>:
|>   20:   0f 00 30 2c     ldq_u   t0,15(a0)
|>   24:   00 00 50 2c     ldq_u   t1,0(a0)
|>   28:   41 0e 30 48     mskqh   t0,a0,t0
|>   2c:   0f 00 30 3c     stq_u   t0,15(a0)
|>   30:   42 06 50 48     mskql   t1,a0,t1
|>   34:   08 00 f0 3f     stq_u   zero,8(a0)
|>   38:   00 00 50 3c     stq_u   t1,0(a0)
|>   3c:   01 80 fa 6b     ret
|> 
|> Is there anything that can be done about that?

Please try -D__NO_STRING_INLINES.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE GmbH, Deutschherrnstr. 15-19, D-90429 NÃ¼rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-22 17:13     ` Michel LESPINASSE
  2002-04-22 17:39       ` Richard Henderson
@ 2002-04-23  2:39       ` Jan Hubicka
  2002-04-23 13:36         ` Michel LESPINASSE
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2002-04-23  2:39 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Richard Henderson, gcc list

> On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote:
> > See whether or not disabling glibc's inline expansion of 
> > memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES.
> 
> I can get gcc 2.95 to be as slow as 3.1 snapshot by using both
> -fno-builtin and -D__NO_STRING_INLINES
> 
> but, I cant get gcc 3.1 to do the inlining for me
> 
> If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3,
> gcc 2.95 generates this:
> 
>         movl $table+2,%edi
>         cld
>         movl $31,%ecx
>         rep
>         stosl
>         stosw
> 
> while 3.1 snapshot generates this:
> 
>         movl    $table+2, (%esp)
>         movl    $0, 4(%esp)
>         movl    $126, 8(%esp)
>         call    memset
I guess the inlining threshold is too low or the default memset implementation
too lame.
I was tunning it for Athlon, so the mileage may warry from CPU to CPU.
I will investigate the misscompilation first and check this second.

Honza
> 
> Hope this helps,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23  2:39       ` Jan Hubicka
@ 2002-04-23 13:36         ` Michel LESPINASSE
  2002-04-24  0:30           ` Jan Hubicka
  2002-04-24  3:32           ` Jan Hubicka
  0 siblings, 2 replies; 32+ messages in thread
From: Michel LESPINASSE @ 2002-04-23 13:36 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Richard Henderson, gcc list

On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote:
> I guess the inlining threshold is too low or the default memset
> implementation too lame.  I was tunning it for Athlon, so the
> mileage may warry from CPU to CPU.  I will investigate the
> misscompilation first and check this second.

> Concerning the inlining, gcc inlines all memcpys with size smaller
> than 64 bytes. Perhaps this should be extended to 128 bytes in case
> we are still about 2 times as bad. This is partly due to lame
> implementation of memset in glibc too :(

When gcc does the inlining, performance seems to not be so bad. There
is probably still some untapped performance though, as some of the
initial and final alignment checks could be ommited when gcc already
knows about the alignment of the memory zone (like in my test case, it
was an array of shorts in the data segment, so it was known to be on a
two-byte boundary at least). But might be hard to code into gcc, I
dont know.

Also as I've been only giving bad news up to now, I wanted to say that
now that I've worked around the two issues I had with inlining and
with memset, the 3.1 snapshot does provide superior performance on my
libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8%
when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird
instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
still worried though, that other people will have the same trouble
with inlining as I did and not see all of the performance improvements
as a result.

Cheers,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23 13:36         ` Michel LESPINASSE
@ 2002-04-24  0:30           ` Jan Hubicka
  2002-04-24  0:50             ` Jakub Jelinek
  2002-04-24  3:32           ` Jan Hubicka
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2002-04-24  0:30 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list

> On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote:
> > I guess the inlining threshold is too low or the default memset
> > implementation too lame.  I was tunning it for Athlon, so the
> > mileage may warry from CPU to CPU.  I will investigate the
> > misscompilation first and check this second.
> 
> > Concerning the inlining, gcc inlines all memcpys with size smaller
> > than 64 bytes. Perhaps this should be extended to 128 bytes in case
> > we are still about 2 times as bad. This is partly due to lame
> > implementation of memset in glibc too :(
> 
> When gcc does the inlining, performance seems to not be so bad. There
> is probably still some untapped performance though, as some of the
> initial and final alignment checks could be ommited when gcc already
> knows about the alignment of the memory zone (like in my test case, it

When it knows, it should avoid it.  Definitly on array of shorts, the alignment
to even byte is not done.  It is dificult to make it expect that array of
shorts is 4 byte aligned, as ABI does not specify this, so it may not be.
GCC has new alignment tracking code, so it should be better than any previous
version, but still not that good.  (for instnace when array is static, it
definitly do have chance to conclude so, but it does not, however majority of
string functions come to computed addresses)

> was an array of shorts in the data segment, so it was known to be on a
> two-byte boundary at least). But might be hard to code into gcc, I
> dont know.
> 
> Also as I've been only giving bad news up to now, I wanted to say that
> now that I've worked around the two issues I had with inlining and
> with memset, the 3.1 snapshot does provide superior performance on my
> libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8%
> when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird

That sounds good :)
> instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
> still worried though, that other people will have the same trouble
> with inlining as I did and not see all of the performance improvements
> as a result.

I will send patch to increase the constant to 128.  I was re-benchmarking
the code and on P4/Athlon and my assembly memset, the 64 is just on the border
(ie inlined/not inlined sollution have less than 10% difference), setting
it to 128 does not make us to loose something.  For glibc implementation
128 is still a win to be inlined :(

Honza
> 
> Cheers,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-24  0:30           ` Jan Hubicka
@ 2002-04-24  0:50             ` Jakub Jelinek
  2002-04-24  1:00               ` Jan Hubicka
  0 siblings, 1 reply; 32+ messages in thread
From: Jakub Jelinek @ 2002-04-24  0:50 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Michel LESPINASSE, Richard Henderson, gcc list

On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote:
> > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
> > still worried though, that other people will have the same trouble
> > with inlining as I did and not see all of the performance improvements
> > as a result.
> 
> I will send patch to increase the constant to 128.  I was re-benchmarking
> the code and on P4/Athlon and my assembly memset, the 64 is just on the border
> (ie inlined/not inlined sollution have less than 10% difference), setting
> it to 128 does not make us to loose something.  For glibc implementation
> 128 is still a win to be inlined :(

So please contribute it to glibc then...
Is yours an Athlon optimized memset or does it perform better on P3/P4 too?

	Jakub

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-24  0:50             ` Jakub Jelinek
@ 2002-04-24  1:00               ` Jan Hubicka
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Hubicka @ 2002-04-24  1:00 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jan Hubicka, Michel LESPINASSE, Richard Henderson, gcc list

> On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote:
> > > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am
> > > still worried though, that other people will have the same trouble
> > > with inlining as I did and not see all of the performance improvements
> > > as a result.
> > 
> > I will send patch to increase the constant to 128.  I was re-benchmarking
> > the code and on P4/Athlon and my assembly memset, the 64 is just on the border
> > (ie inlined/not inlined sollution have less than 10% difference), setting
> > it to 128 does not make us to loose something.  For glibc implementation
> > 128 is still a win to be inlined :(
> 
> So please contribute it to glibc then...
> Is yours an Athlon optimized memset or does it perform better on P3/P4 too?

It is fairly simple memset (for small cases only what matter is whether memset
do have fast path trought), but I do have Athlon optimized as well.
The problem with glibc is that it needs machinery for Athlon specific stuff
and AFAIK glibc maintainers didn't decided how to do that.

Honza
> 
> 	Jakub

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: GCC performance regression - its memset !
  2002-04-23 13:36         ` Michel LESPINASSE
  2002-04-24  0:30           ` Jan Hubicka
@ 2002-04-24  3:32           ` Jan Hubicka
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Hubicka @ 2002-04-24  3:32 UTC (permalink / raw)
  To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list

> Also as I've been only giving bad news up to now, I wanted to say that
> now that I've worked around the two issues I had with inlining and
> with memset, the 3.1 snapshot does provide superior performance on my
> libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8%
> when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird
> instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am

Just note that -mcpu=pentiumpro is not generic at all.  It optimizes
for pentiumpro and (in turn P2/P3), but nothing else.
Perhaps we can have switch to generate code working well on the modern CPUs,
but we don't

Honza

> still worried though, that other people will have the same trouble
> with inlining as I did and not see all of the performance improvements
> as a result.
> 
> Cheers,
> 
> -- 
> Michel "Walken" LESPINASSE
> Is this the best that god can do ? Then I'm not impressed.

^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <20020421005718.GA16378@zoy.org.suse.lists.egcs>]

[parent not found: <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs>]

* Re: GCC performance regression - up to 20% ?
       [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs>
@ 2002-04-21  7:58   ` Andi Kleen
  0 siblings, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2002-04-21  7:58 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: aj, gcc

Jan Hubicka <jh@suse.cz> writes:
> 
> No, the SPECint numbers are quite consistenly higher than in any previous
> release. See http://www.suse.de/~aj/SPEC
> In fact no previous release had such a huge gap in perofrmance.

http://www.suse.de/~aj/SPEC/CFP/sandbox-gcc-3_1-branch/index.html

179.art seems to have regressed badly a lot a few days ago. Is that just a bad
run or a price for some bug fix? 

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2002-04-24 21:28 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE
2002-04-21  3:41 ` Andreas Jaeger
2002-04-21  5:46 ` Jan Hubicka
2002-04-21 23:46   ` Michel LESPINASSE
2002-04-22  0:17     ` Andreas Jaeger
2002-04-22 17:42       ` Michel LESPINASSE
2002-04-22 18:20         ` Andrew Pinski
2002-04-22 18:30           ` Carlo Wood
2002-04-22 19:25             ` Andrew Pinski
2002-04-24 15:24               ` Allan Sandfeld Jensen
2002-04-22  7:11     ` Carlo Wood
2002-04-22  7:11       ` Falk Hueffner
2002-04-22  7:34       ` law
2002-04-22  8:23       ` Johannes Stezenbach
2002-04-22  1:47 ` Gerald Pfeifer
2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE
2002-04-22 14:58   ` Jason R Thorpe
2002-04-22 15:27     ` Michel LESPINASSE
2002-04-22 16:59     ` Segher Boessenkool
2002-04-22 17:10   ` Richard Henderson
2002-04-22 17:13     ` Michel LESPINASSE
2002-04-22 17:39       ` Richard Henderson
2002-04-22 17:49         ` Michel LESPINASSE
2002-04-23  5:03           ` Falk Hueffner
2002-04-23  6:53             ` Andreas Schwab
2002-04-23  2:39       ` Jan Hubicka
2002-04-23 13:36         ` Michel LESPINASSE
2002-04-24  0:30           ` Jan Hubicka
2002-04-24  0:50             ` Jakub Jelinek
2002-04-24  1:00               ` Jan Hubicka
2002-04-24  3:32           ` Jan Hubicka
     [not found] <20020421005718.GA16378@zoy.org.suse.lists.egcs>
     [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs>
2002-04-21  7:58   ` GCC performance regression - up to 20% ? Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).