* GCC performance regression - up to 20% ? @ 2002-04-20 18:13 Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger ` (3 more replies) 0 siblings, 4 replies; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-20 18:13 UTC (permalink / raw) To: gcc list Hi, I have downloaded the latest 3.1 snapshot (20020415) and ran some performance tests. So far I've been impressed by the FP performance, but kinda disappointed by the integer performance. The benchmarks I've run are two libraries I maintain, libmpeg2 and liba52. These are used by several open-source dvd players, and are quite CPU intensive (especially libmpeg2). So here are my results, using gcc 2.95 as a reference: First the good news: liba52 (mostly FP intensive workload) on athlon tbird 950, using -mcpu=pentiumpro: gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4 from these measurements 3.1 has a very nice performance, very close to intel's icc. Great work ! Also using -march=athlon-tbird and generating sse code, I can get yet a few extra % of performance. Now the bad news: for libmepg2, which is an integer-only workload, I get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0 was already slower than 2.95.4, but 3.1 seems to be worse for this workload at least. libmpeg2, on athlon tbird 950, with mmx optimizations: gcc-3.0 is about 2% slower than 2.95.4 gcc-3.1 snapshot is about 10% slower than 2.95.4 libmpeg2, on athlon tbird 950, using pure C code: gcc-3.0 is about 4.5% slower than 2.95.4 gcc-3.1 snapshot is about 5.5% slower than 2.95.4 libmpeg2, on celeron 366, with mmx optimizations: gcc-3.0 is about 4% slower than 2.95.4 gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!) These results are all very repeatable. the celeron 366 results are the most worrying, as this processor already has borderline performance for decoding mpeg2 streams. Is there a known performance regression in current GCCs (say, do they get lower SPECint scores ?) or is it only with my code ? Also, is there anything I could do in my code to enhance performance with newer gcc versions ? One thing I noticed is that 3.1 snapshot produces less inlining than 3.0 or 2.95. This probably accounts for some of the slowdown I see when using mmx optimizations, as my mmx routines are written using a few routines that I really expect to get inlined. Is there any way I can get back control about that, so that gcc honours the inline keyword ? I have not managed to do this either. BTW, these two apps I mentionned can be found at http://libmpeg2.sourceforge.net/ http://liba52.sourceforge.net/ Puzzled, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE @ 2002-04-21 3:41 ` Andreas Jaeger 2002-04-21 5:46 ` Jan Hubicka ` (2 subsequent siblings) 3 siblings, 0 replies; 32+ messages in thread From: Andreas Jaeger @ 2002-04-21 3:41 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list Michel LESPINASSE <walken@zoy.org> writes: > Hi, > > I have downloaded the latest 3.1 snapshot (20020415) and ran some > performance tests. So far I've been impressed by the FP performance, > but kinda disappointed by the integer performance. > > The benchmarks I've run are two libraries I maintain, libmpeg2 and > liba52. These are used by several open-source dvd players, and are > quite CPU intensive (especially libmpeg2). So here are my results, > using gcc 2.95 as a reference: > > First the good news: liba52 (mostly FP intensive workload) > on athlon tbird 950, using -mcpu=pentiumpro: > gcc-3.0 is between 4.5% and 6.5% faster than 2.95.4 depending on streams > gcc-3.1 snapshot is between 8% and 9.5% faster than 2.95.4 > from these measurements 3.1 has a very nice performance, very close to > intel's icc. Great work ! Also using -march=athlon-tbird and > generating sse code, I can get yet a few extra % of performance. > > Now the bad news: for libmepg2, which is an integer-only workload, I > get a 10% to 20% performance regression between 2.95.4 and 3.1... 3.0 > was already slower than 2.95.4, but 3.1 seems to be worse for this > workload at least. > > libmpeg2, on athlon tbird 950, with mmx optimizations: > gcc-3.0 is about 2% slower than 2.95.4 > gcc-3.1 snapshot is about 10% slower than 2.95.4 > > libmpeg2, on athlon tbird 950, using pure C code: > gcc-3.0 is about 4.5% slower than 2.95.4 > gcc-3.1 snapshot is about 5.5% slower than 2.95.4 > > libmpeg2, on celeron 366, with mmx optimizations: > gcc-3.0 is about 4% slower than 2.95.4 > gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!) > > These results are all very repeatable. the celeron 366 results are the > most worrying, as this processor already has borderline performance > for decoding mpeg2 streams. > > Is there a known performance regression in current GCCs (say, do they > get lower SPECint scores ?) or is it only with my code ? Can you distill a test case that is as small as possible (the optimal way would be just the loop that causes the problem) and show it to us? That way it's much easier to discuss the issues and start looking into what needs to be done. > Also, is there anything I could do in my code to enhance performance > with newer gcc versions ? One thing I noticed is that 3.1 snapshot > produces less inlining than 3.0 or 2.95. This probably accounts for > some of the slowdown I see when using mmx optimizations, as my mmx > routines are written using a few routines that I really expect to get > inlined. Is there any way I can get back control about that, so that > gcc honours the inline keyword ? I have not managed to do this either. Try -finline-limit=2000 but check the manual for the exact name of the switch. Andreas -- Andreas Jaeger SuSE Labs aj@suse.de private aj@arthur.inka.de http://www.suse.de/~aj ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger @ 2002-04-21 5:46 ` Jan Hubicka 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 1:47 ` Gerald Pfeifer 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE 3 siblings, 1 reply; 32+ messages in thread From: Jan Hubicka @ 2002-04-21 5:46 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list > > libmpeg2, on athlon tbird 950, with mmx optimizations: > gcc-3.0 is about 2% slower than 2.95.4 > gcc-3.1 snapshot is about 10% slower than 2.95.4 > > libmpeg2, on athlon tbird 950, using pure C code: > gcc-3.0 is about 4.5% slower than 2.95.4 > gcc-3.1 snapshot is about 5.5% slower than 2.95.4 > > libmpeg2, on celeron 366, with mmx optimizations: > gcc-3.0 is about 4% slower than 2.95.4 > gcc-3.1 snapshot is about 20.5% slower than 2.95.4 (!!!!) > > These results are all very repeatable. the celeron 366 results are the > most worrying, as this processor already has borderline performance > for decoding mpeg2 streams. Are you able to figure out what exactly makes the code slow? Having self contained testcase will definitly help a lot. WHat flags do you use? I would be quite curious whether using profile feedback helps. (see documentation of -fprofile-arcs and -fbranch-probabilities) you can just have some badly predicted branch in the innermost loop. Problem of such code usualy is fact that it is tuned to avoid problems on one particular version of gcc, so even when new version os faster overall, it is slower in such places. We've hit similar case with Athlon matrix multiplication code and such problems are usually easy to fix on gcc side. > > Is there a known performance regression in current GCCs (say, do they > get lower SPECint scores ?) or is it only with my code ? No, the SPECint numbers are quite consistenly higher than in any previous release. See http://www.suse.de/~aj/SPEC In fact no previous release had such a huge gap in perofrmance. > > Also, is there anything I could do in my code to enhance performance > with newer gcc versions ? One thing I noticed is that 3.1 snapshot > produces less inlining than 3.0 or 2.95. This probably accounts for > some of the slowdown I see when using mmx optimizations, as my mmx > routines are written using a few routines that I really expect to get > inlined. Is there any way I can get back control about that, so that > gcc honours the inline keyword ? I have not managed to do this either. THere is parameter to increase inline threshold as well as allwaysinline function attribute. See the documentation. Honza > > BTW, these two apps I mentionned can be found at > http://libmpeg2.sourceforge.net/ > http://liba52.sourceforge.net/ > > Puzzled, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-21 5:46 ` Jan Hubicka @ 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 0:17 ` Andreas Jaeger 2002-04-22 7:11 ` Carlo Wood 0 siblings, 2 replies; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-21 23:46 UTC (permalink / raw) To: Jan Hubicka, Andreas Jaeger; +Cc: gcc list Hi, I spent the afternoon looking at this issue and trying different flags. I think I got some interesting results, though I still dont get the full picture. The first thing I tweaked was to make the inlining work the way I wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with usual options (-O3), the behaviour was that functions declared inline were always inlined, and functions not declared inline were inlined only if gcc thought they were simple enough. While in gcc 3.1 snapshot, it looks like the inline keyword only makes the function an inline candidate. I will argue later on why I think the old behaviour (inline being always honoured) is more useful (basically, function specialization is harder to do if inline is not honoured). In the mean time, I worked around the issue by using an absurdly high value -finline-limit=6000 (I tried 2000 first which was not sufficient), and then I also added -fno-inline-functions so that I dont get everything inlined when I dont ask for it. With the custom inlining, gcc-3.1 snapshot is about 4% slower than gcc-2.95, on my athlon tbird. This is a progress, since without the custom inlining, the slowdown was 10%. Then I tried to figure out where the slowdown is, using gprof. And this is where things get really interesting: gprof tells me that the code compiled with 3.1 is faster, but 'time' tells me that the user time spent executing that code is higher with 3.1 than with 2.95. I'm not sure what to make of this, but I think this might give you some clues, so I'll describe it in more detail. I'm not sure what the overhead is, but it seems to be right in gprof's blind spot. I have to describe my gprof methodology first. 'normal' gprof (i.e. compiling every file with -pg) seems to have a high overhead for me, plus it conflicts with -fomit-frame-pointer which I usualy use. So I tend to use what I'll call 'light' gprof, which is as follows: everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro except main.c which is compiled with -g -O3 -p and is not cpu intensive. 'light' gprof can not help me figure out call graphs, but it should normally be good enough to obtain a flat profile. When using gcc 2.95, 'light' gprof works great. In a flat profile, the last number in the 'cumulative seconds' column always matches (within 0.1s) the user time as reported by the 'time' command. I think I can trust the flat profile information. When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds gap between 'cumulative seconds' and 'user time'. I dont understand what happens during this time - could it be that 3.x has a higher overhead for function calls, and that this overhead is not accounted for in gprof ? I dont understand what happens here, but I get the feeling that this might be key to the slowdown I observe. To summarize the gprof thing, I observe 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time If I use 'normal' gprof (every file gets -pg and I remove the -fomit-frame-pointer), that inflates the user times a lot, but I still get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time. Also to answer Jan's question, I did try to use -fbranch-probabilities and it helped, with this option 3.1 snapshot is about 0.5% slower (in user time) than 2.95, instead of 4% slower without. It's still frustrating though, because gprof pretends 3.1 is about 10% faster than 2.95, so there may be a lot of untapped performance. OK, so this is all I can say for now. I hope someone will know gprof internals better than I do and understand why gprof returns times that are smaller in 3.1 shapshot vs. 2.95, while time does not agree. I thought I should add a few comments about the structure of libmpeg2 code, as I suppose it is somewhat unusual. The time-consuming loop is pretty big, and includes several function calls. Each loop execution decodes a full mpeg2 macroblocks (=256 pixels), during this decoding it calls the IDCT (cosine transform) function up to 6 times, and some MC (motion compensation) functions up to 12 times. All these calls are indirect, using function pointers. These called functions are all kinda short, executing in about 0.4 microseconds per call on average. Yes, thats a lot of function calls, and I suspect SPECint does not do as many, which could maybe explain why it's not seeing the same performance regression that I see ? It would seem consistent with the gprof blind spot thing, too. Finally, I thought I should include a small explanation about what I do with inlines and why I'd like to have a gcc option so that the inline keyword always inlines a function, instead of just making that function an inline candidate. One example of that would be in the MC functions mentionned above. To get acceptable speed, these are written using mmx assembly operations. Actually they even have several implementations, one in straight c for compatibility, one in straight mmx, one using sse integer instructions (as present on PIII and athlon), and one using 3dnow instructions. It's easy to select at init time which implementation to use, since the MC functions are only called thru function pointers. The only difference between the sse version and the 3dnow version, is that the sse version uses the pavgb instruction, while the 3dnow version uses pavgusb. These instructions have different opcodes but the same behaviour, so I wrote this using an inline function: static inline void MC_generic (...., int cpu) { .... do stuff .... if (cpu == CPU_3DNOW) pavgusb (....); else pavgb (); .... do more stuff .... } void MC_3dnow (.....) { MC_generic (....., CPU_3DNOW); } void MC_sse (.....) { MC_generic (....., CPU_SSE); } In gcc-2.95 and gcc-3.0, this construct works out nicely since MC_generic gets inlined, and then the compiler figures out that the cpu test in it is a constant and generates clean code for both the 3dnow and the sse function. This kind of specialisation is sometimes very convenient, but it requires that the inline keyword does an unconditional inlining, not subject to compiler heuristics. I would really love to see a gcc option to make inlining unconditional when using the inline keyword, and have gcc use its heuristics when there is no such keyword. This email is longer than I thought it would be, thanks a lot for those who're still reading me :) Cheers, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-21 23:46 ` Michel LESPINASSE @ 2002-04-22 0:17 ` Andreas Jaeger 2002-04-22 17:42 ` Michel LESPINASSE 2002-04-22 7:11 ` Carlo Wood 1 sibling, 1 reply; 32+ messages in thread From: Andreas Jaeger @ 2002-04-22 0:17 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, gcc list Michel LESPINASSE <walken@zoy.org> writes: > Hi, > > I spent the afternoon looking at this issue and trying different > flags. I think I got some interesting results, though I still dont get > the full picture. > > The first thing I tweaked was to make the inlining work the way I > wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with > usual options (-O3), the behaviour was that functions declared inline > were always inlined, and functions not declared inline were inlined > only if gcc thought they were simple enough. While in gcc 3.1 > snapshot, it looks like the inline keyword only makes the function an > inline candidate. I will argue later on why I think the old behaviour > (inline being always honoured) is more useful (basically, function > specialization is harder to do if inline is not honoured). In the mean > time, I worked around the issue by using an absurdly high value > -finline-limit=6000 (I tried 2000 first which was not sufficient), and > then I also added -fno-inline-functions so that I dont get everything > inlined when I dont ask for it. > > With the custom inlining, gcc-3.1 snapshot is about 4% slower than > gcc-2.95, on my athlon tbird. This is a progress, since without the > custom inlining, the slowdown was 10%. > > Then I tried to figure out where the slowdown is, using gprof. And > this is where things get really interesting: gprof tells me that the > code compiled with 3.1 is faster, but 'time' tells me that the user > time spent executing that code is higher with 3.1 than with 2.95. I'm > not sure what to make of this, but I think this might give you some > clues, so I'll describe it in more detail. I'm not sure what the > overhead is, but it seems to be right in gprof's blind spot. Using -pg changes the binary and the results might not be the same as without the flag. You could try to use the Performance counters of your Athlon using e.g. oprofile (http://oprofile.sourceforge.net). > I have to describe my gprof methodology first. 'normal' gprof > (i.e. compiling every file with -pg) seems to have a high overhead for > me, plus it conflicts with -fomit-frame-pointer which I usualy use. So > I tend to use what I'll call 'light' gprof, which is as follows: > everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro > except main.c which is compiled with -g -O3 -p and is not cpu intensive. > 'light' gprof can not help me figure out call graphs, but it should > normally be good enough to obtain a flat profile. > > When using gcc 2.95, 'light' gprof works great. In a flat profile, the > last number in the 'cumulative seconds' column always matches (within > 0.1s) the user time as reported by the 'time' command. I think I can > trust the flat profile information. > > When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds > gap between 'cumulative seconds' and 'user time'. I dont understand > what happens during this time - could it be that 3.x has a higher > overhead for function calls, and that this overhead is not accounted > for in gprof ? I dont understand what happens here, but I get the > feeling that this might be key to the slowdown I observe. > > To summarize the gprof thing, I observe > 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time > > If I use 'normal' gprof (every file gets -pg and I remove the > -fomit-frame-pointer), that inflates the user times a lot, but I still > get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time. > > Also to answer Jan's question, I did try to use -fbranch-probabilities > and it helped, with this option 3.1 snapshot is about 0.5% slower (in > user time) than 2.95, instead of 4% slower without. It's still > frustrating though, because gprof pretends 3.1 is about 10% faster > than 2.95, so there may be a lot of untapped performance. > > > OK, so this is all I can say for now. I hope someone will know gprof > internals better than I do and understand why gprof returns times that > are smaller in 3.1 shapshot vs. 2.95, while time does not agree. > > I thought I should add a few comments about the structure of libmpeg2 > code, as I suppose it is somewhat unusual. The time-consuming loop is > pretty big, and includes several function calls. Each loop execution > decodes a full mpeg2 macroblocks (=256 pixels), during this decoding > it calls the IDCT (cosine transform) function up to 6 times, and some > MC (motion compensation) functions up to 12 times. All these calls are > indirect, using function pointers. These called functions are all > kinda short, executing in about 0.4 microseconds per call on average. > Yes, thats a lot of function calls, and I suspect SPECint does not do > as many, which could maybe explain why it's not seeing the same > performance regression that I see ? It would seem consistent with the > gprof blind spot thing, too. It would really help a lot if you could try to write some small program that behaves the same way (performance wise) as this routine libmpeg2. > > Finally, I thought I should include a small explanation about what I > do with inlines and why I'd like to have a gcc option so that the > inline keyword always inlines a function, instead of just making that > function an inline candidate. One example of that would be in the MC > functions mentionned above. To get acceptable speed, these are written > using mmx assembly operations. Actually they even have several > implementations, one in straight c for compatibility, one in straight > mmx, one using sse integer instructions (as present on PIII and > athlon), and one using 3dnow instructions. It's easy to select at init > time which implementation to use, since the MC functions are only > called thru function pointers. The only difference between the sse > version and the 3dnow version, is that the sse version uses the pavgb > instruction, while the 3dnow version uses pavgusb. These instructions > have different opcodes but the same behaviour, so I wrote this using > an inline function: > > static inline void MC_generic (...., int cpu) > { > .... do stuff .... > if (cpu == CPU_3DNOW) > pavgusb (....); > else > pavgb (); > .... do more stuff .... > } > > void MC_3dnow (.....) > { > MC_generic (....., CPU_3DNOW); > } > > void MC_sse (.....) > { > MC_generic (....., CPU_SSE); > } > > In gcc-2.95 and gcc-3.0, this construct works out nicely since > MC_generic gets inlined, and then the compiler figures out that the > cpu test in it is a constant and generates clean code for both the > 3dnow and the sse function. This kind of specialisation is sometimes > very convenient, but it requires that the inline keyword does an > unconditional inlining, not subject to compiler heuristics. I would > really love to see a gcc option to make inlining unconditional when > using the inline keyword, and have gcc use its heuristics when there > is no such keyword. You might want to use the always_inline function attribute for your inline functions in 3.1. It makes the inlining unconditional. > > This email is longer than I thought it would be, thanks a lot for > those who're still reading me :) ;-) Andreas -- Andreas Jaeger SuSE Labs aj@suse.de private aj@arthur.inka.de http://www.suse.de/~aj ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 0:17 ` Andreas Jaeger @ 2002-04-22 17:42 ` Michel LESPINASSE 2002-04-22 18:20 ` Andrew Pinski 0 siblings, 1 reply; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-22 17:42 UTC (permalink / raw) To: Andreas Jaeger; +Cc: Jan Hubicka, gcc list On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote: > You might want to use the always_inline function attribute for your > inline functions in 3.1. It makes the inlining unconditional. Thanks, that suggestion does solve my inlining problem. I think the documentation for this option is a bit confusing: it says "For functions declared inline, this attribute inlines the function even if no optimization level was specified." which is true I guess, but the documentation does not mention that when optimization is enabled gcc will not always inline functions declared inline, unless we use this attribute. Also I want to point out that this is a new behaviour, as gcc up to 3.0.4 apparently always inlined functions declared inline (when compiling with -O2 at least). I think either the documentation of always_inline should be made more clear, or the inline keyword should be made to behave like always_inline when optimizing with -O2 and above (as it did previously). -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 17:42 ` Michel LESPINASSE @ 2002-04-22 18:20 ` Andrew Pinski 2002-04-22 18:30 ` Carlo Wood 0 siblings, 1 reply; 32+ messages in thread From: Andrew Pinski @ 2002-04-22 18:20 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Andreas Jaeger, Jan Hubicka, gcc list On Monday, April 22, 2002, at 08:39 , Michel LESPINASSE wrote: > On Mon, Apr 22, 2002 at 09:10:55AM +0200, Andreas Jaeger wrote: >> You might want to use the always_inline function attribute for your >> inline functions in 3.1. It makes the inlining unconditional. > > Thanks, that suggestion does solve my inlining problem. I think the > documentation for this option is a bit confusing: it says "For > functions declared inline, this attribute inlines the function even if > no optimization level was specified." which is true I guess, but the > documentation does not mention that when optimization is enabled gcc > will not always inline functions declared inline, unless we use this > attribute. I think the problem is that part of documentation was written before the part about always_inline. > > Also I want to point out that this is a new behaviour, as gcc up to > 3.0.4 apparently always inlined functions declared inline (when > compiling with -O2 at least). Not really new `behavior' was just the inline limit was lowered in 3.0 and above. > > I think either the documentation of always_inline should be made more > clear, or the inline keyword should be made to behave like > always_inline when optimizing with -O2 and above (as it did previously). the inline keyword was always just an hint to the compiler in ISO C++ and ISO C. And it was -O3 that included the inline optimization. Thanks, Andrew Pinski > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 18:20 ` Andrew Pinski @ 2002-04-22 18:30 ` Carlo Wood 2002-04-22 19:25 ` Andrew Pinski 0 siblings, 1 reply; 32+ messages in thread From: Carlo Wood @ 2002-04-22 18:30 UTC (permalink / raw) To: Andrew Pinski; +Cc: gcc On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote: > And it was -O3 that included the inline optimization. Huh?? I am using the fact (I hope) that functions marked with 'inline' or member functions that are defined inside their class declaration are inlined already with -O. Talking about real small functions here (accessors etc). -- Carlo Wood <carlo@alinoe.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 18:30 ` Carlo Wood @ 2002-04-22 19:25 ` Andrew Pinski 2002-04-24 15:24 ` Allan Sandfeld Jensen 0 siblings, 1 reply; 32+ messages in thread From: Andrew Pinski @ 2002-04-22 19:25 UTC (permalink / raw) To: Carlo Wood; +Cc: gcc On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote: > On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote: >> And it was -O3 that included the inline optimization. > > Huh?? > I am using the fact (I hope) that functions marked with 'inline' > or member functions that are defined inside their class declaration > are inlined already with -O. > Talking about real small functions here (accessors etc). > Read the manual, under the -O3 option: Optimize yet more. `-O3' turns on all optimizations specified by `-O2' and also turns on the `-finline-functions' and `-frename-registers' options. These options control various sorts of optimizations: -O -O1 Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function. Without `-O', the compiler's goal is to reduce the cost of compilation and to make debugging produce the expected results. Statements are independent: if you stop the program with a breakpoint between statements, you can then assign a new value to any variable or change the program counter to any other statement in the function and get exactly the results you would expect from the source code. Without `-O', the compiler only allocates variables declared register in registers. The resulting compiled code is a little worse than produced by PCC without `-O'. With `-O', the compiler tries to reduce code size and execution time. When you specify `-O', the compiler turns on `-fthread-jumps' and `-fdefer-pop' on all machines. The compiler turns on `-fdelayed-branch' on machines that have delay slots, and `-fomit-frame-pointer' on machines that can support debugging even without a frame pointer. On some machines the compiler also turns on other flags. -O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify `-O2'. As compared to `-O', this option increases both compilation time and the performance of the generated code. `-O2' turns on all optional optimizations except for loop unrolling, function inlining, and register renaming. It also turns on the `-fforce-mem' option on all machines and frame pointer elimination on machines where doing so does not interfere with debugging. Please note the warning under `-fgcse' about invoking `-O2' on programs that use computed gotos. -O3 Optimize yet more. `-O3' turns on all optimizations specified by `-O2' and also turns on the `-finline-functions' and `-frename-registers' options. -O0 Do not optimize. -Os Optimize for size. `-Os' enables all `-O2' optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size. If you use multiple `-O' options, with or without level numbers, the last such option is the one that is effective. > -- > Carlo Wood <carlo@alinoe.com> > > ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 19:25 ` Andrew Pinski @ 2002-04-24 15:24 ` Allan Sandfeld Jensen 0 siblings, 0 replies; 32+ messages in thread From: Allan Sandfeld Jensen @ 2002-04-24 15:24 UTC (permalink / raw) To: Andrew Pinski; +Cc: gcc On Tuesday 23 April 2002 04:05, Andrew Pinski wrote: > On Monday, April 22, 2002, at 09:20 , Carlo Wood wrote: > > On Mon, Apr 22, 2002 at 08:49:20PM -0400, Andrew Pinski wrote: > >> And it was -O3 that included the inline optimization. > > > > Huh?? > > I am using the fact (I hope) that functions marked with 'inline' > > or member functions that are defined inside their class declaration > > are inlined already with -O. > > Talking about real small functions here (accessors etc). > > Read the manual, under the -O3 option: > > Optimize yet more. `-O3' turns on all optimizations specified by `-O2' > and also turns on the `-finline-functions' and `-frename-registers' > options. > That inline is the "inline-optimization", ie. inlining function NOT declared inline. Functions declared inline are inlined already at -O ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 0:17 ` Andreas Jaeger @ 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner ` (2 more replies) 1 sibling, 3 replies; 32+ messages in thread From: Carlo Wood @ 2002-04-22 7:11 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, Andreas Jaeger, gcc list > > Then I tried to figure out where the slowdown is, using gprof. And > this is where things get really interesting: gprof tells me that the > code compiled with 3.1 is faster, but 'time' tells me that the user > time spent executing that code is higher with 3.1 than with 2.95. I'm > not sure what to make of this, but I think this might give you some > clues, so I'll describe it in more detail. I'm not sure what the > overhead is, but it seems to be right in gprof's blind spot. gprof "measures" the time that a function takes by probing which function the program is executing about every 20 ms. From that it builds up a statistical histogram. I wish there would be a more precise profiler that uses the hardware counters. Does anyone know of one? Hmm, I remember a Subject: line on the PAPI mailinglist that mentioned gprof, but I deleted it. I think it asked the same question: whether or not there existed a 'gprof' that used PAPI. -- Carlo Wood <carlo@alinoe.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 7:11 ` Carlo Wood @ 2002-04-22 7:11 ` Falk Hueffner 2002-04-22 7:34 ` law 2002-04-22 8:23 ` Johannes Stezenbach 2 siblings, 0 replies; 32+ messages in thread From: Falk Hueffner @ 2002-04-22 7:11 UTC (permalink / raw) To: gcc list Carlo Wood <carlo@alinoe.com> writes: > I wish there would be a more precise profiler that uses the hardware > counters. Does anyone know of one? Hmm, I remember a Subject: line > on the PAPI mailinglist that mentioned gprof, but I deleted it. I > think it asked the same question: whether or not there existed a > 'gprof' that used PAPI. http://oprofile.sourceforge.net/ Only runs on i386 Linux and needs a kernel module, though. Falk ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner @ 2002-04-22 7:34 ` law 2002-04-22 8:23 ` Johannes Stezenbach 2 siblings, 0 replies; 32+ messages in thread From: law @ 2002-04-22 7:34 UTC (permalink / raw) To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list In message <20020422155801.A21747@alinoe.com>, Carlo Wood writes: > > > > Then I tried to figure out where the slowdown is, using gprof. And > > this is where things get really interesting: gprof tells me that the > > code compiled with 3.1 is faster, but 'time' tells me that the user > > time spent executing that code is higher with 3.1 than with 2.95. I'm > > not sure what to make of this, but I think this might give you some > > clues, so I'll describe it in more detail. I'm not sure what the > > overhead is, but it seems to be right in gprof's blind spot. > > gprof "measures" the time that a function takes by probing > which function the program is executing about every 20 ms. > >From that it builds up a statistical histogram. > > I wish there would be a more precise profiler that uses the > hardware counters. Does anyone know of one? Hmm, I remember > a Subject: line on the PAPI mailinglist that mentioned gprof, > but I deleted it. I think it asked the same question: whether > or not there existed a 'gprof' that used PAPI. You might look at oprofile. jeff ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner 2002-04-22 7:34 ` law @ 2002-04-22 8:23 ` Johannes Stezenbach 2 siblings, 0 replies; 32+ messages in thread From: Johannes Stezenbach @ 2002-04-22 8:23 UTC (permalink / raw) To: Carlo Wood; +Cc: Michel LESPINASSE, Jan Hubicka, Andreas Jaeger, gcc list On Mon, Apr 22, 2002 at 03:58:01PM +0200, Carlo Wood wrote: > gprof "measures" the time that a function takes by probing > which function the program is executing about every 20 ms. > From that it builds up a statistical histogram. > > I wish there would be a more precise profiler that uses the > hardware counters. Does anyone know of one? Hmm, I remember > a Subject: line on the PAPI mailinglist that mentioned gprof, > but I deleted it. I think it asked the same question: whether > or not there existed a 'gprof' that used PAPI. I know of two profilers based on gcc's -finstrument-functions feature: * FunctionCheck: http://www710.univ-lyon1.fr/~yperret/fnccheck/profiler.html http://sources.redhat.com/ml/binutils/2001-01/msg00142.html I've tried this one once, looked promising but seems to be unmaintained now. Alpha quality. * Eazel's profiler: http://www.mozilla.org/performance/eazel.html http://cvs.gnome.org/bonsai/rview.cgi?cvsroot=/cvs/gnome&dir=eazel-tools/profiler Alpha quality, now unmaintained since Eazel is out of business. Haven't tried it. Regards, Johannes ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - up to 20% ? 2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger 2002-04-21 5:46 ` Jan Hubicka @ 2002-04-22 1:47 ` Gerald Pfeifer 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE 3 siblings, 0 replies; 32+ messages in thread From: Gerald Pfeifer @ 2002-04-22 1:47 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Sat, 20 Apr 2002, Michel LESPINASSE wrote: > One thing I noticed is that 3.1 snapshot produces less inlining than 3.0 > or 2.95. There is a difference between 3.0/3.0.1/3.0.2 and the one hand, and 3.0.3/3.0.4 on the other hand. GCC 3.0-3.0.2 were so bad with respect to compile-time and memory consumption that the inlining strategy was changed for 3.0.3 (and mainline, which is now becoming 3.1). The new strategy should be more or less equivalent in most cases, but if you are seeing significant differences, this issue needs to be revisited. Gerald -- Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* GCC performance regression - its memset ! 2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE ` (2 preceding siblings ...) 2002-04-22 1:47 ` Gerald Pfeifer @ 2002-04-22 14:33 ` Michel LESPINASSE 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 17:10 ` Richard Henderson 3 siblings, 2 replies; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-22 14:33 UTC (permalink / raw) To: gcc list OK, so I worked more to find the cause of the slowdown, and I figured out its all because of memset(). This function seems to be about twice slower than in 2.95, and also for some reason the time spent in memset does not show up in gprof. Here is a test case: --------------------------- foo.c ------------------------------ #include <string.h> short table[64]; void bar (void); int main (void) { int i; bar (); for (i = 0; i < 100000000; i++) memset (table + 1, 0, 63 * sizeof(short)); return 0; } ----------------------------- end of foo.c ------------------------ ----------------------------- bar.c ------------------------------- void bar (void) { } ----------------------------- end of bar.c ------------------------ # gcc-2.95 -g -O3 -p foo.c bar.c # time ./a.out ./a.out 5.75s user 0.00s system 100% cpu 5.739 total # gprof -bp ./a.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 100.00 5.74 5.74 main 0.00 5.74 0.00 1 0.00 0.00 bar # gcc-3.1 -g -O3 -p foo.c bar.c # time ./a.out ./a.out 10.78s user 0.00s system 101% cpu 10.634 total # gprof -bp ./a.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 100.00 0.62 0.62 main 0.00 0.62 0.00 1 0.00 0.00 bar gcc-3.1 snapshot is about twice slower than 2.95 on that test case, and for some reason the gprof output is bogus (it does not account for the time spent in memset), while it was not with 2.95. I did not know my code spent that much time in memset, I'll see what I can do about it. Hope this helps, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE @ 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 15:27 ` Michel LESPINASSE 2002-04-22 16:59 ` Segher Boessenkool 2002-04-22 17:10 ` Richard Henderson 1 sibling, 2 replies; 32+ messages in thread From: Jason R Thorpe @ 2002-04-22 14:58 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote: > gcc-3.1 snapshot is about twice slower than 2.95 on that test case, > and for some reason the gprof output is bogus (it does not account for > the time spent in memset), while it was not with 2.95. gprof doesn't see it because gcc is doing the memset inline, presumably; it does this in certain cases where it knows the size at compile time. Try running your test with -fno-builtin. (I'm not suggesting this as a fix for your performance issue, just as an explanation of why memset() is invisible to gprof in your testcase). -- -- Jason R. Thorpe <thorpej@wasabisystems.com> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:58 ` Jason R Thorpe @ 2002-04-22 15:27 ` Michel LESPINASSE 2002-04-22 16:59 ` Segher Boessenkool 1 sibling, 0 replies; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-22 15:27 UTC (permalink / raw) To: Jason R Thorpe, gcc list On Mon, Apr 22, 2002 at 02:41:25PM -0700, Jason R Thorpe wrote: > gprof doesn't see it because gcc is doing the memset inline, presumably; it > does this in certain cases where it knows the size at compile time. I'm not surprised that there is no line for memset in the flat profile. But, I'm surprised that the cumulative time in the flat profile is not equal to the user time in the time command. 'normal' inlines (i.e. the ones I could write by declaring an inline function) are accounted normally by gprof (i.e. they do show up in the cumulative time). -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 15:27 ` Michel LESPINASSE @ 2002-04-22 16:59 ` Segher Boessenkool 1 sibling, 0 replies; 32+ messages in thread From: Segher Boessenkool @ 2002-04-22 16:59 UTC (permalink / raw) To: thorpej; +Cc: Michel LESPINASSE, gcc list Jason R Thorpe wrote: > > On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote: > > > gcc-3.1 snapshot is about twice slower than 2.95 on that test case, > > and for some reason the gprof output is bogus (it does not account for > > the time spent in memset), while it was not with 2.95. > > gprof doesn't see it because gcc is doing the memset inline, presumably; it > does this in certain cases where it knows the size at compile time. > > Try running your test with -fno-builtin. (I'm not suggesting this as a > fix for your performance issue, just as an explanation of why memset() is > invisible to gprof in your testcase). Erm, no. In the 2.95 case, GCC _did_ inline the memset(); that's why it showed up as 5.74 seconds in main(). In the 3.1 case, it was not inlined; gprof doesn't show you the time spent in libc, as libc is not compiled with profiling enabled. Inlined functions are never profiled separately. Linking statically will make gprof show you the time spent in shared library functions (but not the call graph to those); or you can link against a libc that was compiled with profiling enabled, so you get a call graph as well. Segher ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE 2002-04-22 14:58 ` Jason R Thorpe @ 2002-04-22 17:10 ` Richard Henderson 2002-04-22 17:13 ` Michel LESPINASSE 1 sibling, 1 reply; 32+ messages in thread From: Richard Henderson @ 2002-04-22 17:10 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Mon, Apr 22, 2002 at 02:32:22PM -0700, Michel LESPINASSE wrote: > OK, so I worked more to find the cause of the slowdown, and I figured > out its all because of memset(). See whether or not disabling glibc's inline expansion of memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES. r~ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:10 ` Richard Henderson @ 2002-04-22 17:13 ` Michel LESPINASSE 2002-04-22 17:39 ` Richard Henderson 2002-04-23 2:39 ` Jan Hubicka 0 siblings, 2 replies; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-22 17:13 UTC (permalink / raw) To: Richard Henderson, gcc list On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote: > See whether or not disabling glibc's inline expansion of > memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES. I can get gcc 2.95 to be as slow as 3.1 snapshot by using both -fno-builtin and -D__NO_STRING_INLINES but, I cant get gcc 3.1 to do the inlining for me If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3, gcc 2.95 generates this: movl $table+2,%edi cld movl $31,%ecx rep stosl stosw while 3.1 snapshot generates this: movl $table+2, (%esp) movl $0, 4(%esp) movl $126, 8(%esp) call memset Hope this helps, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:13 ` Michel LESPINASSE @ 2002-04-22 17:39 ` Richard Henderson 2002-04-22 17:49 ` Michel LESPINASSE 2002-04-23 2:39 ` Jan Hubicka 1 sibling, 1 reply; 32+ messages in thread From: Richard Henderson @ 2002-04-22 17:39 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: gcc list On Mon, Apr 22, 2002 at 05:10:45PM -0700, Michel LESPINASSE wrote: > I can get gcc 2.95 to be as slow as 3.1 snapshot by using both > -fno-builtin and -D__NO_STRING_INLINES I wasn't interested in -fno-builtin, but only differences seen when both 2.95 and 3.1 are using -D__NO_STRING_INLINES. r~ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:39 ` Richard Henderson @ 2002-04-22 17:49 ` Michel LESPINASSE 2002-04-23 5:03 ` Falk Hueffner 0 siblings, 1 reply; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-22 17:49 UTC (permalink / raw) To: Richard Henderson, gcc list On Mon, Apr 22, 2002 at 05:13:35PM -0700, Richard Henderson wrote: > I wasn't interested in -fno-builtin, but only differences seen > when both 2.95 and 3.1 are using -D__NO_STRING_INLINES. If I use just -O3 -D__NO_STRING_INLINES, I get the same behaviour as with just -O3 (i.e. 2.95 inlines the memset and 3.1 doesnt) -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:49 ` Michel LESPINASSE @ 2002-04-23 5:03 ` Falk Hueffner 2002-04-23 6:53 ` Andreas Schwab 0 siblings, 1 reply; 32+ messages in thread From: Falk Hueffner @ 2002-04-23 5:03 UTC (permalink / raw) To: gcc list Hi, while we're at it, when you have a prototype for memset, gcc forgets about the alignment: void f1(unsigned long *p) { memset(p, 0, 16); } #include <string.h> void f2(unsigned long *p) { memset(p, 0, 16); } -> (Alpha) 0000000000000000 <f1>: 0: 08 00 f0 b7 stq zero,8(a0) 4: 1f 04 ff 5f fnop 8: 00 00 f0 b7 stq zero,0(a0) c: 01 80 fa 6b ret 0000000000000020 <f2>: 20: 0f 00 30 2c ldq_u t0,15(a0) 24: 00 00 50 2c ldq_u t1,0(a0) 28: 41 0e 30 48 mskqh t0,a0,t0 2c: 0f 00 30 3c stq_u t0,15(a0) 30: 42 06 50 48 mskql t1,a0,t1 34: 08 00 f0 3f stq_u zero,8(a0) 38: 00 00 50 3c stq_u t1,0(a0) 3c: 01 80 fa 6b ret Is there anything that can be done about that? Falk ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 5:03 ` Falk Hueffner @ 2002-04-23 6:53 ` Andreas Schwab 0 siblings, 0 replies; 32+ messages in thread From: Andreas Schwab @ 2002-04-23 6:53 UTC (permalink / raw) To: Falk Hueffner; +Cc: gcc list Falk Hueffner <falk.hueffner@student.uni-tuebingen.de> writes: |> Hi, |> |> while we're at it, when you have a prototype for memset, gcc forgets |> about the alignment: |> |> void f1(unsigned long *p) { memset(p, 0, 16); } |> #include <string.h> |> void f2(unsigned long *p) { memset(p, 0, 16); } |> |> -> (Alpha) |> |> 0000000000000000 <f1>: |> 0: 08 00 f0 b7 stq zero,8(a0) |> 4: 1f 04 ff 5f fnop |> 8: 00 00 f0 b7 stq zero,0(a0) |> c: 01 80 fa 6b ret |> 0000000000000020 <f2>: |> 20: 0f 00 30 2c ldq_u t0,15(a0) |> 24: 00 00 50 2c ldq_u t1,0(a0) |> 28: 41 0e 30 48 mskqh t0,a0,t0 |> 2c: 0f 00 30 3c stq_u t0,15(a0) |> 30: 42 06 50 48 mskql t1,a0,t1 |> 34: 08 00 f0 3f stq_u zero,8(a0) |> 38: 00 00 50 3c stq_u t1,0(a0) |> 3c: 01 80 fa 6b ret |> |> Is there anything that can be done about that? Please try -D__NO_STRING_INLINES. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE GmbH, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-22 17:13 ` Michel LESPINASSE 2002-04-22 17:39 ` Richard Henderson @ 2002-04-23 2:39 ` Jan Hubicka 2002-04-23 13:36 ` Michel LESPINASSE 1 sibling, 1 reply; 32+ messages in thread From: Jan Hubicka @ 2002-04-23 2:39 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Richard Henderson, gcc list > On Mon, Apr 22, 2002 at 04:59:53PM -0700, Richard Henderson wrote: > > See whether or not disabling glibc's inline expansion of > > memset affects 2.95 vs 3.x with -D__NO_STRING_INLINES. > > I can get gcc 2.95 to be as slow as 3.1 snapshot by using both > -fno-builtin and -D__NO_STRING_INLINES > > but, I cant get gcc 3.1 to do the inlining for me > > If I dont use -fno-builtin -D__NO_STRING_INLINES, and just compile with -O3, > gcc 2.95 generates this: > > movl $table+2,%edi > cld > movl $31,%ecx > rep > stosl > stosw > > while 3.1 snapshot generates this: > > movl $table+2, (%esp) > movl $0, 4(%esp) > movl $126, 8(%esp) > call memset I guess the inlining threshold is too low or the default memset implementation too lame. I was tunning it for Athlon, so the mileage may warry from CPU to CPU. I will investigate the misscompilation first and check this second. Honza > > Hope this helps, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 2:39 ` Jan Hubicka @ 2002-04-23 13:36 ` Michel LESPINASSE 2002-04-24 0:30 ` Jan Hubicka 2002-04-24 3:32 ` Jan Hubicka 0 siblings, 2 replies; 32+ messages in thread From: Michel LESPINASSE @ 2002-04-23 13:36 UTC (permalink / raw) To: Jan Hubicka; +Cc: Richard Henderson, gcc list On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote: > I guess the inlining threshold is too low or the default memset > implementation too lame. I was tunning it for Athlon, so the > mileage may warry from CPU to CPU. I will investigate the > misscompilation first and check this second. > Concerning the inlining, gcc inlines all memcpys with size smaller > than 64 bytes. Perhaps this should be extended to 128 bytes in case > we are still about 2 times as bad. This is partly due to lame > implementation of memset in glibc too :( When gcc does the inlining, performance seems to not be so bad. There is probably still some untapped performance though, as some of the initial and final alignment checks could be ommited when gcc already knows about the alignment of the memory zone (like in my test case, it was an array of shorts in the data segment, so it was known to be on a two-byte boundary at least). But might be hard to code into gcc, I dont know. Also as I've been only giving bad news up to now, I wanted to say that now that I've worked around the two issues I had with inlining and with memset, the 3.1 snapshot does provide superior performance on my libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8% when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am still worried though, that other people will have the same trouble with inlining as I did and not see all of the performance improvements as a result. Cheers, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 13:36 ` Michel LESPINASSE @ 2002-04-24 0:30 ` Jan Hubicka 2002-04-24 0:50 ` Jakub Jelinek 2002-04-24 3:32 ` Jan Hubicka 1 sibling, 1 reply; 32+ messages in thread From: Jan Hubicka @ 2002-04-24 0:30 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list > On Tue, Apr 23, 2002 at 11:25:40AM +0200, Jan Hubicka wrote: > > I guess the inlining threshold is too low or the default memset > > implementation too lame. I was tunning it for Athlon, so the > > mileage may warry from CPU to CPU. I will investigate the > > misscompilation first and check this second. > > > Concerning the inlining, gcc inlines all memcpys with size smaller > > than 64 bytes. Perhaps this should be extended to 128 bytes in case > > we are still about 2 times as bad. This is partly due to lame > > implementation of memset in glibc too :( > > When gcc does the inlining, performance seems to not be so bad. There > is probably still some untapped performance though, as some of the > initial and final alignment checks could be ommited when gcc already > knows about the alignment of the memory zone (like in my test case, it When it knows, it should avoid it. Definitly on array of shorts, the alignment to even byte is not done. It is dificult to make it expect that array of shorts is 4 byte aligned, as ABI does not specify this, so it may not be. GCC has new alignment tracking code, so it should be better than any previous version, but still not that good. (for instnace when array is static, it definitly do have chance to conclude so, but it does not, however majority of string functions come to computed addresses) > was an array of shorts in the data segment, so it was known to be on a > two-byte boundary at least). But might be hard to code into gcc, I > dont know. > > Also as I've been only giving bad news up to now, I wanted to say that > now that I've worked around the two issues I had with inlining and > with memset, the 3.1 snapshot does provide superior performance on my > libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8% > when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird That sounds good :) > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am > still worried though, that other people will have the same trouble > with inlining as I did and not see all of the performance improvements > as a result. I will send patch to increase the constant to 128. I was re-benchmarking the code and on P4/Athlon and my assembly memset, the 64 is just on the border (ie inlined/not inlined sollution have less than 10% difference), setting it to 128 does not make us to loose something. For glibc implementation 128 is still a win to be inlined :( Honza > > Cheers, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-24 0:30 ` Jan Hubicka @ 2002-04-24 0:50 ` Jakub Jelinek 2002-04-24 1:00 ` Jan Hubicka 0 siblings, 1 reply; 32+ messages in thread From: Jakub Jelinek @ 2002-04-24 0:50 UTC (permalink / raw) To: Jan Hubicka; +Cc: Michel LESPINASSE, Richard Henderson, gcc list On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote: > > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am > > still worried though, that other people will have the same trouble > > with inlining as I did and not see all of the performance improvements > > as a result. > > I will send patch to increase the constant to 128. I was re-benchmarking > the code and on P4/Athlon and my assembly memset, the 64 is just on the border > (ie inlined/not inlined sollution have less than 10% difference), setting > it to 128 does not make us to loose something. For glibc implementation > 128 is still a win to be inlined :( So please contribute it to glibc then... Is yours an Athlon optimized memset or does it perform better on P3/P4 too? Jakub ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-24 0:50 ` Jakub Jelinek @ 2002-04-24 1:00 ` Jan Hubicka 0 siblings, 0 replies; 32+ messages in thread From: Jan Hubicka @ 2002-04-24 1:00 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Jan Hubicka, Michel LESPINASSE, Richard Henderson, gcc list > On Wed, Apr 24, 2002 at 09:17:48AM +0200, Jan Hubicka wrote: > > > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am > > > still worried though, that other people will have the same trouble > > > with inlining as I did and not see all of the performance improvements > > > as a result. > > > > I will send patch to increase the constant to 128. I was re-benchmarking > > the code and on P4/Athlon and my assembly memset, the 64 is just on the border > > (ie inlined/not inlined sollution have less than 10% difference), setting > > it to 128 does not make us to loose something. For glibc implementation > > 128 is still a win to be inlined :( > > So please contribute it to glibc then... > Is yours an Athlon optimized memset or does it perform better on P3/P4 too? It is fairly simple memset (for small cases only what matter is whether memset do have fast path trought), but I do have Athlon optimized as well. The problem with glibc is that it needs machinery for Athlon specific stuff and AFAIK glibc maintainers didn't decided how to do that. Honza > > Jakub ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: GCC performance regression - its memset ! 2002-04-23 13:36 ` Michel LESPINASSE 2002-04-24 0:30 ` Jan Hubicka @ 2002-04-24 3:32 ` Jan Hubicka 1 sibling, 0 replies; 32+ messages in thread From: Jan Hubicka @ 2002-04-24 3:32 UTC (permalink / raw) To: Michel LESPINASSE; +Cc: Jan Hubicka, Richard Henderson, gcc list > Also as I've been only giving bad news up to now, I wanted to say that > now that I've worked around the two issues I had with inlining and > with memset, the 3.1 snapshot does provide superior performance on my > libmpeg2 codebase, about 5% faster than 2.95.4, and that gets up to 8% > when using -fbranch-probabilities and 9% when using -mcpu=athlon-tbird > instead of the more generic -mcpu=pentiumpro. Nice work guys ! I am Just note that -mcpu=pentiumpro is not generic at all. It optimizes for pentiumpro and (in turn P2/P3), but nothing else. Perhaps we can have switch to generate code working well on the modern CPUs, but we don't Honza > still worried though, that other people will have the same trouble > with inlining as I did and not see all of the performance improvements > as a result. > > Cheers, > > -- > Michel "Walken" LESPINASSE > Is this the best that god can do ? Then I'm not impressed. ^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <20020421005718.GA16378@zoy.org.suse.lists.egcs>]
[parent not found: <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs>]
* Re: GCC performance regression - up to 20% ? [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs> @ 2002-04-21 7:58 ` Andi Kleen 0 siblings, 0 replies; 32+ messages in thread From: Andi Kleen @ 2002-04-21 7:58 UTC (permalink / raw) To: Jan Hubicka; +Cc: aj, gcc Jan Hubicka <jh@suse.cz> writes: > > No, the SPECint numbers are quite consistenly higher than in any previous > release. See http://www.suse.de/~aj/SPEC > In fact no previous release had such a huge gap in perofrmance. http://www.suse.de/~aj/SPEC/CFP/sandbox-gcc-3_1-branch/index.html 179.art seems to have regressed badly a lot a few days ago. Is that just a bad run or a price for some bug fix? -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2002-04-24 21:28 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-04-20 18:13 GCC performance regression - up to 20% ? Michel LESPINASSE 2002-04-21 3:41 ` Andreas Jaeger 2002-04-21 5:46 ` Jan Hubicka 2002-04-21 23:46 ` Michel LESPINASSE 2002-04-22 0:17 ` Andreas Jaeger 2002-04-22 17:42 ` Michel LESPINASSE 2002-04-22 18:20 ` Andrew Pinski 2002-04-22 18:30 ` Carlo Wood 2002-04-22 19:25 ` Andrew Pinski 2002-04-24 15:24 ` Allan Sandfeld Jensen 2002-04-22 7:11 ` Carlo Wood 2002-04-22 7:11 ` Falk Hueffner 2002-04-22 7:34 ` law 2002-04-22 8:23 ` Johannes Stezenbach 2002-04-22 1:47 ` Gerald Pfeifer 2002-04-22 14:33 ` GCC performance regression - its memset ! Michel LESPINASSE 2002-04-22 14:58 ` Jason R Thorpe 2002-04-22 15:27 ` Michel LESPINASSE 2002-04-22 16:59 ` Segher Boessenkool 2002-04-22 17:10 ` Richard Henderson 2002-04-22 17:13 ` Michel LESPINASSE 2002-04-22 17:39 ` Richard Henderson 2002-04-22 17:49 ` Michel LESPINASSE 2002-04-23 5:03 ` Falk Hueffner 2002-04-23 6:53 ` Andreas Schwab 2002-04-23 2:39 ` Jan Hubicka 2002-04-23 13:36 ` Michel LESPINASSE 2002-04-24 0:30 ` Jan Hubicka 2002-04-24 0:50 ` Jakub Jelinek 2002-04-24 1:00 ` Jan Hubicka 2002-04-24 3:32 ` Jan Hubicka [not found] <20020421005718.GA16378@zoy.org.suse.lists.egcs> [not found] ` <20020421113238.GC16602@atrey.karlin.mff.cuni.cz.suse.lists.egcs> 2002-04-21 7:58 ` GCC performance regression - up to 20% ? Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).