From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-50416-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 19000 invoked by alias); 22 Apr 2002 07:13:47 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 18993 invoked from network); 22 Apr 2002 07:13:46 -0000
Received: from unknown (HELO Cantor.suse.de) (213.95.15.193)
  by sources.redhat.com with SMTP; 22 Apr 2002 07:13:46 -0000
Received: from Hermes.suse.de (Charybdis.suse.de [213.95.15.201])
	by Cantor.suse.de (Postfix) with ESMTP
	id 4A6EA1EAD4; Mon, 22 Apr 2002 09:10:56 +0200 (MEST)
X-Authentication-Warning: gee.suse.de: aj set sender to aj@suse.de using -f
To: Michel LESPINASSE <walken@zoy.org>
Cc: Jan Hubicka <jh@suse.cz>, gcc list <gcc@gcc.gnu.org>
Subject: Re: GCC performance regression - up to 20% ?
References: <20020421005718.GA16378@zoy.org>
	<20020421113238.GC16602@atrey.karlin.mff.cuni.cz>
	<20020422061937.GA27171@zoy.org>
From: Andreas Jaeger <aj@suse.de>
Date: Mon, 22 Apr 2002 00:17:00 -0000
In-Reply-To: <20020422061937.GA27171@zoy.org> (Michel LESPINASSE's message
 of "Sun, 21 Apr 2002 23:19:37 -0700")
Message-ID: <ho1yd85hts.fsf@gee.suse.de>
User-Agent: Gnus/5.090006 (Oort Gnus v0.06) XEmacs/21.4 (Artificial
 Intelligence, i386-suse-linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-SW-Source: 2002-04/txt/msg01068.txt.bz2

Michel LESPINASSE <walken@zoy.org> writes:

> Hi,
>
> I spent the afternoon looking at this issue and trying different
> flags. I think I got some interesting results, though I still dont get
> the full picture.
>
> The first thing I tweaked was to make the inlining work the way I
> wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with
> usual options (-O3), the behaviour was that functions declared inline
> were always inlined, and functions not declared inline were inlined
> only if gcc thought they were simple enough. While in gcc 3.1
> snapshot, it looks like the inline keyword only makes the function an
> inline candidate. I will argue later on why I think the old behaviour
> (inline being always honoured) is more useful (basically, function
> specialization is harder to do if inline is not honoured). In the mean
> time, I worked around the issue by using an absurdly high value
> -finline-limit=6000 (I tried 2000 first which was not sufficient), and
> then I also added -fno-inline-functions so that I dont get everything
> inlined when I dont ask for it.
>
> With the custom inlining, gcc-3.1 snapshot is about 4% slower than
> gcc-2.95, on my athlon tbird. This is a progress, since without the
> custom inlining, the slowdown was 10%.
>
> Then I tried to figure out where the slowdown is, using gprof. And
> this is where things get really interesting: gprof tells me that the
> code compiled with 3.1 is faster, but 'time' tells me that the user
> time spent executing that code is higher with 3.1 than with 2.95. I'm
> not sure what to make of this, but I think this might give you some
> clues, so I'll describe it in more detail. I'm not sure what the
> overhead is, but it seems to be right in gprof's blind spot.

Using -pg changes the binary and the results might not be the same as
without the flag.  You could try to use the Performance counters of
your Athlon using e.g. oprofile (http://oprofile.sourceforge.net).

> I have to describe my gprof methodology first. 'normal' gprof
> (i.e. compiling every file with -pg) seems to have a high overhead for
> me, plus it conflicts with -fomit-frame-pointer which I usualy use. So
> I tend to use what I'll call 'light' gprof, which is as follows:
> everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro
> except main.c which is compiled with -g -O3 -p and is not cpu intensive.
> 'light' gprof can not help me figure out call graphs, but it should
> normally be good enough to obtain a flat profile.
>
> When using gcc 2.95, 'light' gprof works great. In a flat profile, the
> last number in the 'cumulative seconds' column always matches (within
> 0.1s) the user time as reported by the 'time' command. I think I can
> trust the flat profile information.
>
> When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds
> gap between 'cumulative seconds' and 'user time'. I dont understand
> what happens during this time - could it be that 3.x has a higher
> overhead for function calls, and that this overhead is not accounted
> for in gprof ? I dont understand what happens here, but I get the
> feeling that this might be key to the slowdown I observe.
>
> To summarize the gprof thing, I observe
> 3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time
>
> If I use 'normal' gprof (every file gets -pg and I remove the
> -fomit-frame-pointer), that inflates the user times a lot, but I still
> get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time.
>
> Also to answer Jan's question, I did try to use -fbranch-probabilities
> and it helped, with this option 3.1 snapshot is about 0.5% slower (in
> user time) than 2.95, instead of 4% slower without. It's still
> frustrating though, because gprof pretends 3.1 is about 10% faster
> than 2.95, so there may be a lot of untapped performance.
>
>
> OK, so this is all I can say for now. I hope someone will know gprof
> internals better than I do and understand why gprof returns times that
> are smaller in 3.1 shapshot vs. 2.95, while time does not agree.
>
> I thought I should add a few comments about the structure of libmpeg2
> code, as I suppose it is somewhat unusual. The time-consuming loop is
> pretty big, and includes several function calls. Each loop execution
> decodes a full mpeg2 macroblocks (=256 pixels), during this decoding
> it calls the IDCT (cosine transform) function up to 6 times, and some
> MC (motion compensation) functions up to 12 times. All these calls are
> indirect, using function pointers. These called functions are all
> kinda short, executing in about 0.4 microseconds per call on average.
> Yes, thats a lot of function calls, and I suspect SPECint does not do
> as many, which could maybe explain why it's not seeing the same
> performance regression that I see ? It would seem consistent with the
> gprof blind spot thing, too.

It would really help a lot if you could try to write some small
program that behaves the same way (performance wise) as this routine
libmpeg2.

>
> Finally, I thought I should include a small explanation about what I
> do with inlines and why I'd like to have a gcc option so that the
> inline keyword always inlines a function, instead of just making that
> function an inline candidate. One example of that would be in the MC
> functions mentionned above. To get acceptable speed, these are written
> using mmx assembly operations. Actually they even have several
> implementations, one in straight c for compatibility, one in straight
> mmx, one using sse integer instructions (as present on PIII and
> athlon), and one using 3dnow instructions. It's easy to select at init
> time which implementation to use, since the MC functions are only
> called thru function pointers. The only difference between the sse
> version and the 3dnow version, is that the sse version uses the pavgb
> instruction, while the 3dnow version uses pavgusb. These instructions
> have different opcodes but the same behaviour, so I wrote this using
> an inline function:

>
> static inline void MC_generic (...., int cpu)
> {
> 	.... do stuff ....
> 	if (cpu == CPU_3DNOW)
> 		pavgusb (....);
> 	else
> 		pavgb ();
> 	.... do more stuff ....
> }
>
> void MC_3dnow (.....)
> {
> 	MC_generic (....., CPU_3DNOW);
> }
>
> void MC_sse (.....)
> {
> 	MC_generic (....., CPU_SSE);
> }
>
> In gcc-2.95 and gcc-3.0, this construct works out nicely since
> MC_generic gets inlined, and then the compiler figures out that the
> cpu test in it is a constant and generates clean code for both the
> 3dnow and the sse function. This kind of specialisation is sometimes
> very convenient, but it requires that the inline keyword does an
> unconditional inlining, not subject to compiler heuristics. I would
> really love to see a gcc option to make inlining unconditional when
> using the inline keyword, and have gcc use its heuristics when there
> is no such keyword.

You might want to use the always_inline function attribute for your
inline functions in 3.1.  It makes the inlining unconditional.


>
> This email is longer than I thought it would be, thanks a lot for
> those who're still reading me :)

;-)

Andreas
-- 
 Andreas Jaeger
  SuSE Labs aj@suse.de
   private aj@arthur.inka.de
    http://www.suse.de/~aj