From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19000 invoked by alias); 22 Apr 2002 07:13:47 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 18993 invoked from network); 22 Apr 2002 07:13:46 -0000 Received: from unknown (HELO Cantor.suse.de) (213.95.15.193) by sources.redhat.com with SMTP; 22 Apr 2002 07:13:46 -0000 Received: from Hermes.suse.de (Charybdis.suse.de [213.95.15.201]) by Cantor.suse.de (Postfix) with ESMTP id 4A6EA1EAD4; Mon, 22 Apr 2002 09:10:56 +0200 (MEST) X-Authentication-Warning: gee.suse.de: aj set sender to aj@suse.de using -f To: Michel LESPINASSE Cc: Jan Hubicka , gcc list Subject: Re: GCC performance regression - up to 20% ? References: <20020421005718.GA16378@zoy.org> <20020421113238.GC16602@atrey.karlin.mff.cuni.cz> <20020422061937.GA27171@zoy.org> From: Andreas Jaeger Date: Mon, 22 Apr 2002 00:17:00 -0000 In-Reply-To: <20020422061937.GA27171@zoy.org> (Michel LESPINASSE's message of "Sun, 21 Apr 2002 23:19:37 -0700") Message-ID: User-Agent: Gnus/5.090006 (Oort Gnus v0.06) XEmacs/21.4 (Artificial Intelligence, i386-suse-linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-SW-Source: 2002-04/txt/msg01068.txt.bz2 Michel LESPINASSE writes: > Hi, > > I spent the afternoon looking at this issue and trying different > flags. I think I got some interesting results, though I still dont get > the full picture. > > The first thing I tweaked was to make the inlining work the way I > wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with > usual options (-O3), the behaviour was that functions declared inline > were always inlined, and functions not declared inline were inlined > only if gcc thought they were simple enough. While in gcc 3.1 > snapshot, it looks like the inline keyword only makes the function an > inline candidate. I will argue later on why I think the old behaviour > (inline being always honoured) is more useful (basically, function > specialization is harder to do if inline is not honoured). In the mean > time, I worked around the issue by using an absurdly high value > -finline-limit=6000 (I tried 2000 first which was not sufficient), and > then I also added -fno-inline-functions so that I dont get everything > inlined when I dont ask for it. > > With the custom inlining, gcc-3.1 snapshot is about 4% slower than > gcc-2.95, on my athlon tbird. This is a progress, since without the > custom inlining, the slowdown was 10%. > > Then I tried to figure out where the slowdown is, using gprof. And > this is where things get really interesting: gprof tells me that the > code compiled with 3.1 is faster, but 'time' tells me that the user > time spent executing that code is higher with 3.1 than with 2.95. I'm > not sure what to make of this, but I think this might give you some > clues, so I'll describe it in more detail. I'm not sure what the > overhead is, but it seems to be right in gprof's blind spot. Using -pg changes the binary and the results might not be the same as without the flag. You could try to use the Performance counters of your Athlon using e.g. oprofile (http://oprofile.sourceforge.net). > I have to describe my gprof methodology first. 'normal' gprof > (i.e. compiling every file with -pg) seems to have a high overhead for > me, plus it conflicts with -fomit-frame-pointer which I usualy use. So > I tend to use what I'll call 'light' gprof, which is as follows: > everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro > except main.c which is compiled with -g -O3 -p and is not cpu intensive. > 'light' gprof can not help me figure out call graphs, but it should > normally be good enough to obtain a flat profile. > > When using gcc 2.95, 'light' gprof works great. In a flat profile, the > last number in the 'cumulative seconds' column always matches (within > 0.1s) the user time as reported by the 'time' command. I think I can > trust the flat profile information. > > When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds > gap between 'cumulative seconds' and 'user time'. I dont understand > what happens during this time - could it be that 3.x has a higher > overhead for function calls, and that this overhead is not accounted > for in gprof ? I dont understand what happens here, but I get the > feeling that this might be key to the slowdown I observe. > > To summarize the gprof thing, I observe > 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time > > If I use 'normal' gprof (every file gets -pg and I remove the > -fomit-frame-pointer), that inflates the user times a lot, but I still > get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time. > > Also to answer Jan's question, I did try to use -fbranch-probabilities > and it helped, with this option 3.1 snapshot is about 0.5% slower (in > user time) than 2.95, instead of 4% slower without. It's still > frustrating though, because gprof pretends 3.1 is about 10% faster > than 2.95, so there may be a lot of untapped performance. > > > OK, so this is all I can say for now. I hope someone will know gprof > internals better than I do and understand why gprof returns times that > are smaller in 3.1 shapshot vs. 2.95, while time does not agree. > > I thought I should add a few comments about the structure of libmpeg2 > code, as I suppose it is somewhat unusual. The time-consuming loop is > pretty big, and includes several function calls. Each loop execution > decodes a full mpeg2 macroblocks (=256 pixels), during this decoding > it calls the IDCT (cosine transform) function up to 6 times, and some > MC (motion compensation) functions up to 12 times. All these calls are > indirect, using function pointers. These called functions are all > kinda short, executing in about 0.4 microseconds per call on average. > Yes, thats a lot of function calls, and I suspect SPECint does not do > as many, which could maybe explain why it's not seeing the same > performance regression that I see ? It would seem consistent with the > gprof blind spot thing, too. It would really help a lot if you could try to write some small program that behaves the same way (performance wise) as this routine libmpeg2. > > Finally, I thought I should include a small explanation about what I > do with inlines and why I'd like to have a gcc option so that the > inline keyword always inlines a function, instead of just making that > function an inline candidate. One example of that would be in the MC > functions mentionned above. To get acceptable speed, these are written > using mmx assembly operations. Actually they even have several > implementations, one in straight c for compatibility, one in straight > mmx, one using sse integer instructions (as present on PIII and > athlon), and one using 3dnow instructions. It's easy to select at init > time which implementation to use, since the MC functions are only > called thru function pointers. The only difference between the sse > version and the 3dnow version, is that the sse version uses the pavgb > instruction, while the 3dnow version uses pavgusb. These instructions > have different opcodes but the same behaviour, so I wrote this using > an inline function: > > static inline void MC_generic (...., int cpu) > { > .... do stuff .... > if (cpu == CPU_3DNOW) > pavgusb (....); > else > pavgb (); > .... do more stuff .... > } > > void MC_3dnow (.....) > { > MC_generic (....., CPU_3DNOW); > } > > void MC_sse (.....) > { > MC_generic (....., CPU_SSE); > } > > In gcc-2.95 and gcc-3.0, this construct works out nicely since > MC_generic gets inlined, and then the compiler figures out that the > cpu test in it is a constant and generates clean code for both the > 3dnow and the sse function. This kind of specialisation is sometimes > very convenient, but it requires that the inline keyword does an > unconditional inlining, not subject to compiler heuristics. I would > really love to see a gcc option to make inlining unconditional when > using the inline keyword, and have gcc use its heuristics when there > is no such keyword. You might want to use the always_inline function attribute for your inline functions in 3.1. It makes the inlining unconditional. > > This email is longer than I thought it would be, thanks a lot for > those who're still reading me :) ;-) Andreas -- Andreas Jaeger SuSE Labs aj@suse.de private aj@arthur.inka.de http://www.suse.de/~aj