* Re: GCC performance regression - up to 20% ? [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs> @ 2002-04-21 7:58 ` Andi Kleen 0 siblings, 0 replies; 16+ messages in thread From: Andi Kleen @ 2002-04-21 7:58 UTC (permalink / raw) To: Jan Hubicka; +Cc: aj, gcc Jan Hubicka <jh@suse.cz> writes: > > No, the SPECint numbers are quite consistenly higher than in any previous > release. See http://www.suse.de/~aj/SPEC > In fact no previous release had such a huge gap in perofrmance. http://www.suse.de/~aj/SPEC/CFP/sandbox-gcc-3_1-branch/index.html 179.art seems to have regressed badly a lot a few days ago. Is that just a bad run or a price for some bug fix? -Andi ^ permalink raw reply [flat|nested] 16+ messages in thread
* GCC performance regression - up to 20% ? @ 2002-04-20 18:13 Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Michel LESPINASSE @ 2002-04-20 18:13 UTC (permalink / raw) To: gcc list Hi, I have downloaded the latest 3.1 snapshot (20020415) and ran some performance tests. So far I've been impressed by the FP performance, but kinda disappointed by the integer performance. The benchmarks I've run are two libraries I maintain, libmpeg2 and liba52. These are used by several open-source dvd players, and are quite CPU intensive (especially libmpeg2). So here are my results, using gcc 2.95 as a reference: First the good news: liba52 (mostly FP intensive workload) on athlon tbird 950, using -mcpu=pentiumpro: gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4 from these measurements 3.1 has a very nice performance, very close to intel's icc. Great work ! Also using -march=athlon-tbird and generating sse code, I can get yet a few extra % of performance. Now the bad news: for libmepg2, which is an integer-only workload, I get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0 was already slower than 2.95.4, but 3.1 seems to be worse for this workload at least. libmpeg2, on athlon tbird 950, with mmx optimizations: gcc-3.0 is about 2% slower than 2.95.4 gcc-3.1 snapshot is about 10% slower than 2.95.4 libmpeg2, on athlon tbird 950, using pure C code: gcc-3.0 is about 4.5% slower than 2.95.4 gcc-3.1 snapshot is about 5.5% slower than 2.95.4 libmpeg2, on celeron 366, with mmx optimizations: gcc-3.0 is about 4% slower than 2.95.4 gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!) These results are all very repeatable. the celeron 366 results are the most worrying, as this processor already has borderline performance for decoding mpeg2 streams. Is there a known performance regression in current GCCs (say, do they get lower SPECint scores ?) or is it only with my code ? Also, is there anything I could do in my code to enhance performance with newer gcc versions ? One thing I noticed is that 3.1 snapshot produces less inlining than 3.0 or 2.95. This probably accounts for some of the slowdown I see when using mmx optimizations, as my mmx routines are written using a few routines that I really expect to get inlined. Is there any way I can get back control about that, so that gcc honours the inline keyword ? I have not managed to do this either. BTW, these two apps I mentionned can be found at http://libmpeg2.sourceforge.net/ http://liba52.sourceforge.net/ Puzzled, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-20 18:13 Michel LESPINASSE @ 2002-04-21 3:41 ` Andreas Jaeger 2002-04-21 5:46 ` Jan Hubicka 2002-04-22 1:47 ` Gerald Pfeifer 2 siblings, 0 replies; 16+ messages in thread From: Andreas Jaeger @ 2002-04-21 3:41 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list Michel LESPINASSE <walken@zoy.org> writes: > Hi, > > I have downloaded the latest 3.1 snapshot (20020415) and ran some > performance tests. So far I've been impressed by the FP performance, > but kinda disappointed by the integer performance. > > The benchmarks I've run are two libraries I maintain, libmpeg2 and > liba52. These are used by several open-source dvd players, and are > quite CPU intensive (especially libmpeg2). So here are my results, > using gcc 2.95 as a reference: > > First the good news: liba52 (mostly FP intensive workload) > on athlon tbird 950, using -mcpu=pentiumpro: > gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams > gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4 > from these measurements 3.1 has a very nice performance, very close to > intel's icc. Great work ! Also using -march=athlon-tbird and > generating sse code, I can get yet a few extra % of performance. > > Now the bad news: for libmepg2, which is an integer-only workload, I > get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0 > was already slower than 2.95.4, but 3.1 seems to be worse for this > workload at least. > > libmpeg2, on athlon tbird 950, with mmx optimizations: > gcc-3.0 is about 2% slower than 2.95.4 > gcc-3.1 snapshot is about 10% slower than 2.95.4 > > libmpeg2, on athlon tbird 950, using pure C code: > gcc-3.0 is about 4.5% slower than 2.95.4 > gcc-3.1 snapshot is about 5.5% slower than 2.95.4 > > libmpeg2, on celeron 366, with mmx optimizations: > gcc-3.0 is about 4% slower than 2.95.4 > gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!) > > These results are all very repeatable. the celeron 366 results are the > most worrying, as this processor already has borderline performance > for decoding mpeg2 streams. > > Is there a known performance regression in current GCCs (say, do they > get lower SPECint scores ?) or is it only with my code ? Can you distill a test case that is as small as possible (the optimal way would be just the loop that causes the problem) and show it to us? That way it's much easier to discuss the issues and start looking into what needs to be done. > Also, is there anything I could do in my code to enhance performance > with newer gcc versions ? One thing I noticed is that 3.1 snapshot > produces less inlining than 3.0 or 2.95. This probably accounts for > some of the slowdown I see when using mmx optimizations, as my mmx > routines are written using a few routines that I really expect to get > inlined. Is there any way I can get back control about that, so that > gcc honours the inline keyword ? I have not managed to do this either. Try -finline-limit=2000 but check the manual for the exact name of the switch. Andreas -- Andreas Jaeger SuSE Labs aj@suse.de private aj@arthur.inka.de http://www.suse.de/~aj ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-20 18:13 Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger @ 2002-04-21 5:46 ` Jan Hubicka 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 1:47 ` Gerald Pfeifer 2 siblings, 1 reply; 16+ messages in thread From: Jan Hubicka @ 2002-04-21 5:46 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list > > libmpeg2, on athlon tbird 950, with mmx optimizations: > gcc-3.0 is about 2% slower than 2.95.4 > gcc-3.1 snapshot is about 10% slower than 2.95.4 > > libmpeg2, on athlon tbird 950, using pure C code: > gcc-3.0 is about 4.5% slower than 2.95.4 > gcc-3.1 snapshot is about 5.5% slower than 2.95.4 > > libmpeg2, on celeron 366, with mmx optimizations: > gcc-3.0 is about 4% slower than 2.95.4 > gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!) > > These results are all very repeatable. the celeron 366 results are the > most worrying, as this processor already has borderline performance > for decoding mpeg2 streams. Are you able to figure out what exactly makes the code slow? Having self contained testcase will definitly help a lot. WHat flags do you use? I would be quite curious whether using profile feedback helps. (see documentation of -fprofile-arcs and -fbranch-probabilities) you can just have some badly predicted branch in the innermost loop. Problem of such code usualy is fact that it is tuned to avoid problems on one particular version of gcc, so even when new version os faster overall, it is slower in such places. We've hit similar case with Athlon matrix multiplication code and such problems are usually easy to fix on gcc side. > > Is there a known performance regression in current GCCs (say, do they > get lower SPECint scores ?) or is it only with my code ? No, the SPECint numbers are quite consistenly higher than in any previous release. See http://www.suse.de/~aj/SPEC In fact no previous release had such a huge gap in perofrmance. > > Also, is there anything I could do in my code to enhance performance > with newer gcc versions ? One thing I noticed is that 3.1 snapshot > produces less inlining than 3.0 or 2.95. This probably accounts for > some of the slowdown I see when using mmx optimizations, as my mmx > routines are written using a few routines that I really expect to get > inlined. Is there any way I can get back control about that, so that > gcc honours the inline keyword ? I have not managed to do this either. THere is parameter to increase inline threshold as well as allwaysinline function attribute. See the documentation. Honza > > BTW, these two apps I mentionned can be found at > http://libmpeg2.sourceforge.net/ > http://liba52.sourceforge.net/ > > Puzzled, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-21 5:46 ` Jan Hubicka @ 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 0:17 ` Andreas Jaeger 2002-04-22 7:11 ` Carlo Wood 0 siblings, 2 replies; 16+ messages in thread From: Michel LESPINASSE @ 2002-04-21 23:46 UTC (permalink / raw) To: Jan Hubicka, Andreas Jaeger; +Cc: gcc list Hi, I spent the afternoon looking at this issue and trying different flags. I think I got some interesting results, though I still dont get the full picture. The first thing I tweaked was to make the inlining work the way I wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with usual options (-O3), the behaviour was that functions declared inline were always inlined, and functions not declared inline were inlined only if gcc thought they were simple enough. While in gcc 3.1 snapshot, it looks like the inline keyword only makes the function an inline candidate. I will argue later on why I think the old behaviour (inline being always honoured) is more useful (basically, function specialization is harder to do if inline is not honoured). In the mean time, I worked around the issue by using an absurdly high value -finline-limit=6000 (I tried 2000 first which was not sufficient), and then I also added -fno-inline-functions so that I dont get everything inlined when I dont ask for it. With the custom inlining, gcc-3.1 snapshot is about 4% slower than gcc-2.95, on my athlon tbird. This is a progress, since without the custom inlining, the slowdown was 10%. Then I tried to figure out where the slowdown is, using gprof. And this is where things get really interesting: gprof tells me that the code compiled with 3.1 is faster, but 'time' tells me that the user time spent executing that code is higher with 3.1 than with 2.95. I'm not sure what to make of this, but I think this might give you some clues, so I'll describe it in more detail. I'm not sure what the overhead is, but it seems to be right in gprof's blind spot. I have to describe my gprof methodology first. 'normal' gprof (i.e. compiling every file with -pg) seems to have a high overhead for me, plus it conflicts with -fomit-frame-pointer which I usualy use. So I tend to use what I'll call 'light' gprof, which is as follows: everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro except main.c which is compiled with -g -O3 -p and is not cpu intensive. 'light' gprof can not help me figure out call graphs, but it should normally be good enough to obtain a flat profile. When using gcc 2.95, 'light' gprof works great. In a flat profile, the last number in the 'cumulative seconds' column always matches (within 0.1s) the user time as reported by the 'time' command. I think I can trust the flat profile information. When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds gap between 'cumulative seconds' and 'user time'. I dont understand what happens during this time - could it be that 3.x has a higher overhead for function calls, and that this overhead is not accounted for in gprof ? I dont understand what happens here, but I get the feeling that this might be key to the slowdown I observe. To summarize the gprof thing, I observe 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time If I use 'normal' gprof (every file gets -pg and I remove the -fomit-frame-pointer), that inflates the user times a lot, but I still get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time. Also to answer Jan's question, I did try to use -fbranch-probabilities and it helped, with this option 3.1 snapshot is about 0.5% slower (in user time) than 2.95, instead of 4% slower without. It's still frustrating though, because gprof pretends 3.1 is about 10% faster than 2.95, so there may be a lot of untapped performance. OK, so this is all I can say for now. I hope someone will know gprof internals better than I do and understand why gprof returns times that are smaller in 3.1 shapshot vs. 2.95, while time does not agree. I thought I should add a few comments about the structure of libmpeg2 code, as I suppose it is somewhat unusual. The time-consuming loop is pretty big, and includes several function calls. Each loop execution decodes a full mpeg2 macroblocks (=256 pixels), during this decoding it calls the IDCT (cosine transform) function up to 6 times, and some MC (motion compensation) functions up to 12 times. All these calls are indirect, using function pointers. These called functions are all kinda short, executing in about 0.4 microseconds per call on average. Yes, thats a lot of function calls, and I suspect SPECint does not do as many, which could maybe explain why it's not seeing the same performance regression that I see ? It would seem consistent with the gprof blind spot thing, too. Finally, I thought I should include a small explanation about what I do with inlines and why I'd like to have a gcc option so that the inline keyword always inlines a function, instead of just making that function an inline candidate. One example of that would be in the MC functions mentionned above. To get acceptable speed, these are written using mmx assembly operations. Actually they even have several implementations, one in straight c for compatibility, one in straight mmx, one using sse integer instructions (as present on PIII and athlon), and one using 3dnow instructions. It's easy to select at init time which implementation to use, since the MC functions are only called thru function pointers. The only difference between the sse version and the 3dnow version, is that the sse version uses the pavgb instruction, while the 3dnow version uses pavgusb. These instructions have different opcodes but the same behaviour, so I wrote this using an inline function: static inline void MC_generic (...., int cpu) { .... do stuff .... if (cpu == CPU_3DNOW) pavgusb (....); else pavgb (); .... do more stuff .... } void MC_3dnow (.....) { MC_generic (....., CPU_3DNOW); } void MC_sse (.....) { MC_generic (....., CPU_SSE); } In gcc-2.95 and gcc-3.0, this construct works out nicely since MC_generic gets inlined, and then the compiler figures out that the cpu test in it is a constant and generates clean code for both the 3dnow and the sse function. This kind of specialisation is sometimes very convenient, but it requires that the inline keyword does an unconditional inlining, not subject to compiler heuristics. I would really love to see a gcc option to make inlining unconditional when using the inline keyword, and have gcc use its heuristics when there is no such keyword. This email is longer than I thought it would be, thanks a lot for those who're still reading me :) Cheers, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-21 23:46 ` Michel LESPINASSE @ 2002-04-22 0:17 ` Andreas Jaeger 2002-04-22 17:42 ` Michel LESPINASSE 2002-04-22 7:11 ` Carlo Wood 1 sibling, 1 reply; 16+ messages in thread From: Andreas Jaeger @ 2002-04-22 0:17 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, gcc list Michel LESPINASSE <walken@zoy.org> writes: > Hi, > > I spent the afternoon looking at this issue and trying different > flags. I think I got some interesting results, though I still dont get > the full picture. > > The first thing I tweaked was to make the inlining work the way I > wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with > usual options (-O3), the behaviour was that functions declared inline > were always inlined, and functions not declared inline were inlined > only if gcc thought they were simple enough. While in gcc 3.1 > snapshot, it looks like the inline keyword only makes the function an > inline candidate. I will argue later on why I think the old behaviour > (inline being always honoured) is more useful (basically, function > specialization is harder to do if inline is not honoured). In the mean > time, I worked around the issue by using an absurdly high value > -finline-limit=6000 (I tried 2000 first which was not sufficient), and > then I also added -fno-inline-functions so that I dont get everything > inlined when I dont ask for it. > > With the custom inlining, gcc-3.1 snapshot is about 4% slower than > gcc-2.95, on my athlon tbird. This is a progress, since without the > custom inlining, the slowdown was 10%. > > Then I tried to figure out where the slowdown is, using gprof. And > this is where things get really interesting: gprof tells me that the > code compiled with 3.1 is faster, but 'time' tells me that the user > time spent executing that code is higher with 3.1 than with 2.95. I'm > not sure what to make of this, but I think this might give you some > clues, so I'll describe it in more detail. I'm not sure what the > overhead is, but it seems to be right in gprof's blind spot. Using -pg changes the binary and the results might not be the same as without the flag. You could try to use the Performance counters of your Athlon using e.g. oprofile (http://oprofile.sourceforge.net). > I have to describe my gprof methodology first. 'normal' gprof > (i.e. compiling every file with -pg) seems to have a high overhead for > me, plus it conflicts with -fomit-frame-pointer which I usualy use. So > I tend to use what I'll call 'light' gprof, which is as follows: > everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro > except main.c which is compiled with -g -O3 -p and is not cpu intensive. > 'light' gprof can not help me figure out call graphs, but it should > normally be good enough to obtain a flat profile. > > When using gcc 2.95, 'light' gprof works great. In a flat profile, the > last number in the 'cumulative seconds' column always matches (within > 0.1s) the user time as reported by the 'time' command. I think I can > trust the flat profile information. > > When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds > gap between 'cumulative seconds' and 'user time'. I dont understand > what happens during this time - could it be that 3.x has a higher > overhead for function calls, and that this overhead is not accounted > for in gprof ? I dont understand what happens here, but I get the > feeling that this might be key to the slowdown I observe. > > To summarize the gprof thing, I observe > 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time > > If I use 'normal' gprof (every file gets -pg and I remove the > -fomit-frame-pointer), that inflates the user times a lot, but I still > get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time. > > Also to answer Jan's question, I did try to use -fbranch-probabilities > and it helped, with this option 3.1 snapshot is about 0.5% slower (in > user time) than 2.95, instead of 4% slower without. It's still > frustrating though, because gprof pretends 3.1 is about 10% faster > than 2.95, so there may be a lot of untapped performance. > > > OK, so this is all I can say for now. I hope someone will know gprof > internals better than I do and understand why gprof returns times that > are smaller in 3.1 shapshot vs. 2.95, while time does not agree. > > I thought I should add a few comments about the structure of libmpeg2 > code, as I suppose it is somewhat unusual. The time-consuming loop is > pretty big, and includes several function calls. Each loop execution > decodes a full mpeg2 macroblocks (=256 pixels), during this decoding > it calls the IDCT (cosine transform) function up to 6 times, and some > MC (motion compensation) functions up to 12 times. All these calls are > indirect, using function pointers. These called functions are all > kinda short, executing in about 0.4 microseconds per call on average. > Yes, thats a lot of function calls, and I suspect SPECint does not do > as many, which could maybe explain why it's not seeing the same > performance regression that I see ? It would seem consistent with the > gprof blind spot thing, too. It would really help a lot if you could try to write some small program that behaves the same way (performance wise) as this routine libmpeg2. > > Finally, I thought I should include a small explanation about what I > do with inlines and why I'd like to have a gcc option so that the > inline keyword always inlines a function, instead of just making that > function an inline candidate. One example of that would be in the MC > functions mentionned above. To get acceptable speed, these are written > using mmx assembly operations. Actually they even have several > implementations, one in straight c for compatibility, one in straight > mmx, one using sse integer instructions (as present on PIII and > athlon), and one using 3dnow instructions. It's easy to select at init > time which implementation to use, since the MC functions are only > called thru function pointers. The only difference between the sse > version and the 3dnow version, is that the sse version uses the pavgb > instruction, while the 3dnow version uses pavgusb. These instructions > have different opcodes but the same behaviour, so I wrote this using > an inline function: > > static inline void MC_generic (...., int cpu) > { > .... do stuff .... > if (cpu == CPU_3DNOW) > pavgusb (....); > else > pavgb (); > .... do more stuff .... > } > > void MC_3dnow (.....) > { > MC_generic (....., CPU_3DNOW); > } > > void MC_sse (.....) > { > MC_generic (....., CPU_SSE); > } > > In gcc-2.95 and gcc-3.0, this construct works out nicely since > MC_generic gets inlined, and then the compiler figures out that the > cpu test in it is a constant and generates clean code for both the > 3dnow and the sse function. This kind of specialisation is sometimes > very convenient, but it requires that the inline keyword does an > unconditional inlining, not subject to compiler heuristics. I would > really love to see a gcc option to make inlining unconditional when > using the inline keyword, and have gcc use its heuristics when there > is no such keyword. You might want to use the always_inline function attribute for your inline functions in 3.1. It makes the inlining unconditional. > > This email is longer than I thought it would be, thanks a lot for > those who're still reading me :) ;-) Andreas -- Andreas Jaeger SuSE Labs aj@suse.de private aj@arthur.inka.de http://www.suse.de/~aj ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 0:17 ` Andreas Jaeger @ 2002-04-22 17:42 ` Michel LESPINASSE 2002-04-22 18:20 ` Andrew Pinski 0 siblings, 1 reply; 16+ messages in thread From: Michel LESPINASSE @ 2002-04-22 17:42 UTC (permalink / raw) To: Andreas Jaeger; +Cc: Jan Hubicka, gcc list On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote: > You might want to use the always_inline function attribute for your > inline functions in 3.1. It makes the inlining unconditional. Thanks, that suggestion does solve my inlining problem. I think the documentation for this option is a bit confusing: it says "For functions declared inline, this attribute inlines the function even if no optimization level was specified." which is true I guess, but the documentation does not mention that when optimization is enabled gcc will not always inline functions declared inline, unless we use this attribute. Also I want to point out that this is a new behaviour, as gcc up to 3.0.4 apparently always inlined functions declared inline (when compiling with -O2 at least). I think either the documentation of always_inline should be made more clear, or the inline keyword should be made to behave like always_inline when optimizing with -O2 and above (as it did previously). -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 17:42 ` Michel LESPINASSE @ 2002-04-22 18:20 ` Andrew Pinski 2002-04-22 18:30 ` Carlo Wood 0 siblings, 1 reply; 16+ messages in thread From: Andrew Pinski @ 2002-04-22 18:20 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Andreas Jaeger, Jan Hubicka, gcc list On Monday, April 22, 2002, at 08:39 , Michel LESPINASSE wrote: > On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote: >> You might want to use the always_inline function attribute for your >> inline functions in 3.1. It makes the inlining unconditional. > > Thanks, that suggestion does solve my inlining problem. I think the > documentation for this option is a bit confusing: it says "For > functions declared inline, this attribute inlines the function even if > no optimization level was specified." which is true I guess, but the > documentation does not mention that when optimization is enabled gcc > will not always inline functions declared inline, unless we use this > attribute. I think the problem is that part of documentation was written before the part about always_inline. > > Also I want to point out that this is a new behaviour, as gcc up to > 3.0.4 apparently always inlined functions declared inline (when > compiling with -O2 at least). Not really new `behavior' was just the inline limit was lowered in 3.0 and above. > > I think either the documentation of always_inline should be made more > clear, or the inline keyword should be made to behave like > always_inline when optimizing with -O2 and above (as it did previously). the inline keyword was always just an hint to the compiler in ISO C++ and ISO C. And it was -O3 that included the inline optimization. Thanks, Andrew Pinski > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 18:20 ` Andrew Pinski @ 2002-04-22 18:30 ` Carlo Wood 2002-04-22 19:25 ` Andrew Pinski 0 siblings, 1 reply; 16+ messages in thread From: Carlo Wood @ 2002-04-22 18:30 UTC (permalink / raw) To: Andrew Pinski; +Cc: gcc On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote: > And it was -O3 that included the inline optimization. Huh?? I am using the fact (I hope) that functions marked with 'inline' or member functions that are defined inside their class declaration are inlined already with -O. Talking about real small functions here (accessors etc). -- Carlo Wood <carlo@alinoe.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 18:30 ` Carlo Wood @ 2002-04-22 19:25 ` Andrew Pinski 2002-04-24 15:24 ` Allan Sandfeld Jensen 0 siblings, 1 reply; 16+ messages in thread From: Andrew Pinski @ 2002-04-22 19:25 UTC (permalink / raw) To: Carlo Wood; +Cc: gcc On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote: > On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote: >> And it was -O3 that included the inline optimization. > > Huh?? > I am using the fact (I hope) that functions marked with 'inline' > or member functions that are defined inside their class declaration > are inlined already with -O. > Talking about real small functions here (accessors etc). > Read the manual, under the -O3 option: Optimize yet more. `-O3' turns on all optimizations specified by `-O2' and also turns on the `-finline-functions' and `-frename-registers' options. These options control various sorts of optimizations: -O -O1 Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function. Without `-O', the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results. Statements are independent: if you stop the program with a breakpoint between statements, you can then assign a new value to any variable or change the program counter to any other statement in the function and get exactly the results you would expect from the source code. Without `-O', the compiler only allocates variables declared register in registers. The resulting compiled code is a little worse than produced by PCC without `-O'. With `-O', the compiler tries to reduce code size and execution time. When you specify `-O', the compiler turns on `-fthread-jumps' and `-fdefer-pop' on all machines. The compiler turns on `-fdelayed-branch' on machines that have delay slots, and `-fomit-frame-pointer' on machines that can support debugging even without a frame pointer. On some machines the compiler also turns on other flags. -O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify `-O2'. As compared to `-O', this option increases both compilation time and the performance of the generated code. `-O2' turns on all optional optimizations except for loop unrolling, function inlining, and register renaming. It also turns on the `-fforce-mem' option on all machines and frame pointer elimination on machines where doing so does not interfere with debugging. Please note the warning under `-fgcse' about invoking `-O2' on programs that use computed gotos. -O3 Optimize yet more. `-O3' turns on all optimizations specified by `-O2' and also turns on the `-finline-functions' and `-frename-registers' options. -O0 Do not optimize. -Os Optimize for size. `-Os' enables all `-O2' optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size. If you use multiple `-O' options, with or without level numbers, the last such option is the one that is effective. > -- > Carlo Wood <carlo@alinoe.com> > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 19:25 ` Andrew Pinski @ 2002-04-24 15:24 ` Allan Sandfeld Jensen 0 siblings, 0 replies; 16+ messages in thread From: Allan Sandfeld Jensen @ 2002-04-24 15:24 UTC (permalink / raw) To: Andrew Pinski; +Cc: gcc On Tuesday 23 April 2002 04:05, Andrew Pinski wrote: > On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote: > > On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote: > >> And it was -O3 that included the inline optimization. > > > > Huh?? > > I am using the fact (I hope) that functions marked with 'inline' > > or member functions that are defined inside their class declaration > > are inlined already with -O. > > Talking about real small functions here (accessors etc). > > Read the manual, under the -O3 option: > > Optimize yet more. `-O3' turns on all optimizations specified by `-O2' > and also turns on the `-finline-functions' and `-frename-registers' > options. > That inline is the "inline-optimization", ie. inlining function NOT declared inline. Functions declared inline are inlined already at -O ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 0:17 ` Andreas Jaeger @ 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner ` (2 more replies) 1 sibling, 3 replies; 16+ messages in thread From: Carlo Wood @ 2002-04-22 7:11 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, Andreas Jaeger, gcc list > > Then I tried to figure out where the slowdown is, using gprof. And > this is where things get really interesting: gprof tells me that the > code compiled with 3.1 is faster, but 'time' tells me that the user > time spent executing that code is higher with 3.1 than with 2.95. I'm > not sure what to make of this, but I think this might give you some > clues, so I'll describe it in more detail. I'm not sure what the > overhead is, but it seems to be right in gprof's blind spot. gprof "measures" the time that a function takes by probing which function the program is executing about every 20 ms. From that it builds up a statistical histogram. I wish there would be a more precise profiler that uses the hardware counters. Does anyone know of one? Hmm, I remember a Subject: line on the PAPI mailinglist that mentioned gprof, but I deleted it. I think it asked the same question: whether or not there existed a 'gprof' that used PAPI. -- Carlo Wood <carlo@alinoe.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 7:11 ` Carlo Wood @ 2002-04-22 7:11 ` Falk Hueffner 2002-04-22 7:34 ` law 2002-04-22 8:23 ` Johannes Stezenbach 2 siblings, 0 replies; 16+ messages in thread From: Falk Hueffner @ 2002-04-22 7:11 UTC (permalink / raw) To: gcc list Carlo Wood <carlo@alinoe.com> writes: > I wish there would be a more precise profiler that uses the hardware > counters. Does anyone know of one? Hmm, I remember a Subject: line > on the PAPI mailinglist that mentioned gprof, but I deleted it. I > think it asked the same question: whether or not there existed a > 'gprof' that used PAPI. http://oprofile.sourceforge.net/ Only runs on i386 Linux and needs a kernel module, though. Falk ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner @ 2002-04-22 7:34 ` law 2002-04-22 8:23 ` Johannes Stezenbach 2 siblings, 0 replies; 16+ messages in thread From: law @ 2002-04-22 7:34 UTC (permalink / raw) To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list In message <20020422155801.A21747@alinoe.com>, Carlo Wood writes: > > > > Then I tried to figure out where the slowdown is, using gprof. And > > this is where things get really interesting: gprof tells me that the > > code compiled with 3.1 is faster, but 'time' tells me that the user > > time spent executing that code is higher with 3.1 than with 2.95. I'm > > not sure what to make of this, but I think this might give you some > > clues, so I'll describe it in more detail. I'm not sure what the > > overhead is, but it seems to be right in gprof's blind spot. > > gprof "measures" the time that a function takes by probing > which function the program is executing about every 20 ms. > >From that it builds up a statistical histogram. > > I wish there would be a more precise profiler that uses the > hardware counters. Does anyone know of one? Hmm, I remember > a Subject: line on the PAPI mailinglist that mentioned gprof, > but I deleted it. I think it asked the same question: whether > or not there existed a 'gprof' that used PAPI. You might look at oprofile. jeff ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner 2002-04-22 7:34 ` law @ 2002-04-22 8:23 ` Johannes Stezenbach 2 siblings, 0 replies; 16+ messages in thread From: Johannes Stezenbach @ 2002-04-22 8:23 UTC (permalink / raw) To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list On Mon, Apr 22, 2002 at 03:58:01PM +0200, Carlo Wood wrote: > gprof "measures" the time that a function takes by probing > which function the program is executing about every 20 ms. > From that it builds up a statistical histogram. > > I wish there would be a more precise profiler that uses the > hardware counters. Does anyone know of one? Hmm, I remember > a Subject: line on the PAPI mailinglist that mentioned gprof, > but I deleted it. I think it asked the same question: whether > or not there existed a 'gprof' that used PAPI. I know of two profilers based on gcc's -finstrument-functions feature: * FunctionCheck: http://www710.univ-lyon1.fr/~yperret/fnccheck/profiler.html http://sources.redhat.com/ml/binutils/2001-01/msg00142.html I've tried this one once, looked promising but seems to be unmaintained now. Alpha quality. * Eazel's profiler: http://www.mozilla.org/performance/eazel.html http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=eazel-tools/profiler Alpha quality, now unmaintained since Eazel is out of business. Haven't tried it. Regards, Johannes ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-20 18:13 Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger 2002-04-21 5:46 ` Jan Hubicka @ 2002-04-22 1:47 ` Gerald Pfeifer 2 siblings, 0 replies; 16+ messages in thread From: Gerald Pfeifer @ 2002-04-22 1:47 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Sat, 20 Apr 2002, Michel LESPINASSE wrote: > One thing I noticed is that 3.1 snapshot produces less inlining than 3.0 > or 2.95. There is a difference between 3.0/3.0.1/3.0.2 and the one hand, and 3.0.3/3.0.4 on the other hand. GCC 3.0-3.0.2 were so bad with respect to compile-time and memory consumption that the inlining strategy was changed for 3.0.3 (and mainline, which is now becoming 3.1). The new strategy should be more or less equivalent in most cases, but if you are seeing significant differences, this issue needs to be revisited. Gerald -- Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/ ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2002-04-24 21:28 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20020421005718.GA16378@zoy.org.suse.lists.egcs> [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs> 2002-04-21 7:58 ` GCC performance regression - up to 20% ? Andi Kleen 2002-04-20 18:13 Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger 2002-04-21 5:46 ` Jan Hubicka 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 0:17 ` Andreas Jaeger 2002-04-22 17:42 ` Michel LESPINASSE 2002-04-22 18:20 ` Andrew Pinski 2002-04-22 18:30 ` Carlo Wood 2002-04-22 19:25 ` Andrew Pinski 2002-04-24 15:24 ` Allan Sandfeld Jensen 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner 2002-04-22 7:34 ` law 2002-04-22 8:23 ` Johannes Stezenbach 2002-04-22 1:47 ` Gerald Pfeifer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).