From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12853 invoked by alias); 22 Apr 2002 06:19:45 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 12845 invoked from network); 22 Apr 2002 06:19:43 -0000 Received: from unknown (HELO Angel.zoy.org) (12.236.86.18) by sources.redhat.com with SMTP; 22 Apr 2002 06:19:43 -0000 Received: by Angel.zoy.org (Postfix, from userid 1000) id B9A4BB89B; Sun, 21 Apr 2002 23:19:37 -0700 (PDT) Date: Sun, 21 Apr 2002 23:46:00 -0000 From: Michel LESPINASSE To: Jan Hubicka , Andreas Jaeger Cc: gcc list Subject: Re: GCC performance regression - up to 20% ? Message-ID: <20020422061937.GA27171@zoy.org> References: <20020421005718.GA16378@zoy.org> <20020421113238.GC16602@atrey.karlin.mff.cuni.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20020421113238.GC16602@atrey.karlin.mff.cuni.cz> User-Agent: Mutt/1.3.28i X-SW-Source: 2002-04/txt/msg01065.txt.bz2 Hi, I spent the afternoon looking at this issue and trying different flags. I think I got some interesting results, though I still dont get the full picture. The first thing I tweaked was to make the inlining work the way I wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with usual options (-O3), the behaviour was that functions declared inline were always inlined, and functions not declared inline were inlined only if gcc thought they were simple enough. While in gcc 3.1 snapshot, it looks like the inline keyword only makes the function an inline candidate. I will argue later on why I think the old behaviour (inline being always honoured) is more useful (basically, function specialization is harder to do if inline is not honoured). In the mean time, I worked around the issue by using an absurdly high value -finline-limit=6000 (I tried 2000 first which was not sufficient), and then I also added -fno-inline-functions so that I dont get everything inlined when I dont ask for it. With the custom inlining, gcc-3.1 snapshot is about 4% slower than gcc-2.95, on my athlon tbird. This is a progress, since without the custom inlining, the slowdown was 10%. Then I tried to figure out where the slowdown is, using gprof. And this is where things get really interesting: gprof tells me that the code compiled with 3.1 is faster, but 'time' tells me that the user time spent executing that code is higher with 3.1 than with 2.95. I'm not sure what to make of this, but I think this might give you some clues, so I'll describe it in more detail. I'm not sure what the overhead is, but it seems to be right in gprof's blind spot. I have to describe my gprof methodology first. 'normal' gprof (i.e. compiling every file with -pg) seems to have a high overhead for me, plus it conflicts with -fomit-frame-pointer which I usualy use. So I tend to use what I'll call 'light' gprof, which is as follows: everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro except main.c which is compiled with -g -O3 -p and is not cpu intensive. 'light' gprof can not help me figure out call graphs, but it should normally be good enough to obtain a flat profile. When using gcc 2.95, 'light' gprof works great. In a flat profile, the last number in the 'cumulative seconds' column always matches (within 0.1s) the user time as reported by the 'time' command. I think I can trust the flat profile information. When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds gap between 'cumulative seconds' and 'user time'. I dont understand what happens during this time - could it be that 3.x has a higher overhead for function calls, and that this overhead is not accounted for in gprof ? I dont understand what happens here, but I get the feeling that this might be key to the slowdown I observe. To summarize the gprof thing, I observe 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time If I use 'normal' gprof (every file gets -pg and I remove the -fomit-frame-pointer), that inflates the user times a lot, but I still get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time. Also to answer Jan's question, I did try to use -fbranch-probabilities and it helped, with this option 3.1 snapshot is about 0.5% slower (in user time) than 2.95, instead of 4% slower without. It's still frustrating though, because gprof pretends 3.1 is about 10% faster than 2.95, so there may be a lot of untapped performance. OK, so this is all I can say for now. I hope someone will know gprof internals better than I do and understand why gprof returns times that are smaller in 3.1 shapshot vs. 2.95, while time does not agree. I thought I should add a few comments about the structure of libmpeg2 code, as I suppose it is somewhat unusual. The time-consuming loop is pretty big, and includes several function calls. Each loop execution decodes a full mpeg2 macroblocks (=256 pixels), during this decoding it calls the IDCT (cosine transform) function up to 6 times, and some MC (motion compensation) functions up to 12 times. All these calls are indirect, using function pointers. These called functions are all kinda short, executing in about 0.4 microseconds per call on average. Yes, thats a lot of function calls, and I suspect SPECint does not do as many, which could maybe explain why it's not seeing the same performance regression that I see ? It would seem consistent with the gprof blind spot thing, too. Finally, I thought I should include a small explanation about what I do with inlines and why I'd like to have a gcc option so that the inline keyword always inlines a function, instead of just making that function an inline candidate. One example of that would be in the MC functions mentionned above. To get acceptable speed, these are written using mmx assembly operations. Actually they even have several implementations, one in straight c for compatibility, one in straight mmx, one using sse integer instructions (as present on PIII and athlon), and one using 3dnow instructions. It's easy to select at init time which implementation to use, since the MC functions are only called thru function pointers. The only difference between the sse version and the 3dnow version, is that the sse version uses the pavgb instruction, while the 3dnow version uses pavgusb. These instructions have different opcodes but the same behaviour, so I wrote this using an inline function: static inline void MC_generic (...., int cpu) { .... do stuff .... if (cpu == CPU_3DNOW) pavgusb (....); else pavgb (); .... do more stuff .... } void MC_3dnow (.....) { MC_generic (....., CPU_3DNOW); } void MC_sse (.....) { MC_generic (....., CPU_SSE); } In gcc-2.95 and gcc-3.0, this construct works out nicely since MC_generic gets inlined, and then the compiler figures out that the cpu test in it is a constant and generates clean code for both the 3dnow and the sse function. This kind of specialisation is sometimes very convenient, but it requires that the inline keyword does an unconditional inlining, not subject to compiler heuristics. I would really love to see a gcc option to make inlining unconditional when using the inline keyword, and have gcc use its heuristics when there is no such keyword. This email is longer than I thought it would be, thanks a lot for those who're still reading me :) Cheers, -- Michel "Walken" LESPINASSE Is this the best that god can do ? Then I'm not impressed.