From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-50413-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 12853 invoked by alias); 22 Apr 2002 06:19:45 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 12845 invoked from network); 22 Apr 2002 06:19:43 -0000
Received: from unknown (HELO Angel.zoy.org) (12.236.86.18)
  by sources.redhat.com with SMTP; 22 Apr 2002 06:19:43 -0000
Received: by Angel.zoy.org (Postfix, from userid 1000)
	id B9A4BB89B; Sun, 21 Apr 2002 23:19:37 -0700 (PDT)
Date: Sun, 21 Apr 2002 23:46:00 -0000
From: Michel LESPINASSE <walken@zoy.org>
To: Jan Hubicka <jh@suse.cz>, Andreas Jaeger <aj@suse.de>
Cc: gcc list <gcc@gcc.gnu.org>
Subject: Re: GCC performance regression - up to 20% ?
Message-ID: <20020422061937.GA27171@zoy.org>
References: <20020421005718.GA16378@zoy.org> <20020421113238.GC16602@atrey.karlin.mff.cuni.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20020421113238.GC16602@atrey.karlin.mff.cuni.cz>
User-Agent: Mutt/1.3.28i
X-SW-Source: 2002-04/txt/msg01065.txt.bz2

Hi,

I spent the afternoon looking at this issue and trying different
flags. I think I got some interesting results, though I still dont get
the full picture.

The first thing I tweaked was to make the inlining work the way I
wanted. As I understand it, in gcc 2.95 and 3.0, when compiling with
usual options (-O3), the behaviour was that functions declared inline
were always inlined, and functions not declared inline were inlined
only if gcc thought they were simple enough. While in gcc 3.1
snapshot, it looks like the inline keyword only makes the function an
inline candidate. I will argue later on why I think the old behaviour
(inline being always honoured) is more useful (basically, function
specialization is harder to do if inline is not honoured). In the mean
time, I worked around the issue by using an absurdly high value
-finline-limit=6000 (I tried 2000 first which was not sufficient), and
then I also added -fno-inline-functions so that I dont get everything
inlined when I dont ask for it.

With the custom inlining, gcc-3.1 snapshot is about 4% slower than
gcc-2.95, on my athlon tbird. This is a progress, since without the
custom inlining, the slowdown was 10%.

Then I tried to figure out where the slowdown is, using gprof. And
this is where things get really interesting: gprof tells me that the
code compiled with 3.1 is faster, but 'time' tells me that the user
time spent executing that code is higher with 3.1 than with 2.95. I'm
not sure what to make of this, but I think this might give you some
clues, so I'll describe it in more detail. I'm not sure what the
overhead is, but it seems to be right in gprof's blind spot.

I have to describe my gprof methodology first. 'normal' gprof
(i.e. compiling every file with -pg) seems to have a high overhead for
me, plus it conflicts with -fomit-frame-pointer which I usualy use. So
I tend to use what I'll call 'light' gprof, which is as follows:
everything is compiled with -g -O3 -fomit-frame-pointer -mcpu=pentiumpro
except main.c which is compiled with -g -O3 -p and is not cpu intensive.
'light' gprof can not help me figure out call graphs, but it should
normally be good enough to obtain a flat profile.

When using gcc 2.95, 'light' gprof works great. In a flat profile, the
last number in the 'cumulative seconds' column always matches (within
0.1s) the user time as reported by the 'time' command. I think I can
trust the flat profile information.

When using gcc 3.0 or 3.1 shapshot though, there is a several-seconds
gap between 'cumulative seconds' and 'user time'. I dont understand
what happens during this time - could it be that 3.x has a higher
overhead for function calls, and that this overhead is not accounted
for in gprof ? I dont understand what happens here, but I get the
feeling that this might be key to the slowdown I observe.

To summarize the gprof thing, I observe
3.1 gprof time < 2.95 gprof time = 2.95 user time < 3.1 user time

If I use 'normal' gprof (every file gets -pg and I remove the
-fomit-frame-pointer), that inflates the user times a lot, but I still
get 3.1 gprof time < 2.95 gprof time and 2.95 user time < 3.1 user time.

Also to answer Jan's question, I did try to use -fbranch-probabilities
and it helped, with this option 3.1 snapshot is about 0.5% slower (in
user time) than 2.95, instead of 4% slower without. It's still
frustrating though, because gprof pretends 3.1 is about 10% faster
than 2.95, so there may be a lot of untapped performance.


OK, so this is all I can say for now. I hope someone will know gprof
internals better than I do and understand why gprof returns times that
are smaller in 3.1 shapshot vs. 2.95, while time does not agree.

I thought I should add a few comments about the structure of libmpeg2
code, as I suppose it is somewhat unusual. The time-consuming loop is
pretty big, and includes several function calls. Each loop execution
decodes a full mpeg2 macroblocks (=256 pixels), during this decoding
it calls the IDCT (cosine transform) function up to 6 times, and some
MC (motion compensation) functions up to 12 times. All these calls are
indirect, using function pointers. These called functions are all
kinda short, executing in about 0.4 microseconds per call on average.
Yes, thats a lot of function calls, and I suspect SPECint does not do
as many, which could maybe explain why it's not seeing the same
performance regression that I see ? It would seem consistent with the
gprof blind spot thing, too.

Finally, I thought I should include a small explanation about what I
do with inlines and why I'd like to have a gcc option so that the
inline keyword always inlines a function, instead of just making that
function an inline candidate. One example of that would be in the MC
functions mentionned above. To get acceptable speed, these are written
using mmx assembly operations. Actually they even have several
implementations, one in straight c for compatibility, one in straight
mmx, one using sse integer instructions (as present on PIII and
athlon), and one using 3dnow instructions. It's easy to select at init
time which implementation to use, since the MC functions are only
called thru function pointers. The only difference between the sse
version and the 3dnow version, is that the sse version uses the pavgb
instruction, while the 3dnow version uses pavgusb. These instructions
have different opcodes but the same behaviour, so I wrote this using
an inline function:

static inline void MC_generic (...., int cpu)
{
	.... do stuff ....
	if (cpu == CPU_3DNOW)
		pavgusb (....);
	else
		pavgb ();
	.... do more stuff ....
}

void MC_3dnow (.....)
{
	MC_generic (....., CPU_3DNOW);
}

void MC_sse (.....)
{
	MC_generic (....., CPU_SSE);
}

In gcc-2.95 and gcc-3.0, this construct works out nicely since
MC_generic gets inlined, and then the compiler figures out that the
cpu test in it is a constant and generates clean code for both the
3dnow and the sse function. This kind of specialisation is sometimes
very convenient, but it requires that the inline keyword does an
unconditional inlining, not subject to compiler heuristics. I would
really love to see a gcc option to make inlining unconditional when
using the inline keyword, and have gcc use its heuristics when there
is no such keyword.

This email is longer than I thought it would be, thanks a lot for
those who're still reading me :)

Cheers,

-- 
Michel "Walken" LESPINASSE
Is this the best that god can do ? Then I'm not impressed.