Re: [Fwd: performance with gcc -O0/-O2]

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [Fwd: performance with gcc -O0/-O2]
  2007-11-27 14:19 [Fwd: performance with gcc -O0/-O2] Howard Chu
@ 2007-11-27 14:19 ` Richard Guenther
  2007-11-27 16:09   ` Tim Prince
  2007-11-27 17:59   ` Howard Chu
  2007-11-27 15:46 ` Andrew Haley
  1 sibling, 2 replies; 9+ messages in thread
From: Richard Guenther @ 2007-11-27 14:19 UTC (permalink / raw)
  To: Howard Chu; +Cc: gcc

On Nov 27, 2007 2:23 PM, Howard Chu <hyc@highlandsun.com> wrote:
> A bit of a minor mystery. Not a problem, just a curiosity. If someone knew off
> the top of their head a reason for it, that'd be cool, but otherwise no sweat.

I'd try -Os, you might run into ICache limitations.

Richard.

> -------- Original Message --------
> Subject: Re: commit: ldap/servers/slapd connection.c daemon.c proto-slap.h
> syncrepl.c
> Date: Tue, 27 Nov 2007 05:17:04 -0800
> From: Howard Chu <hyc@symas.com>
> To: OpenLDAP-devel@openldap.org
> References: <200711261603.lAQG3R7e010741@cantor.openldap.org>
> <474AFA54.6080805@symas.com>    <474B0620.8030706@symas.com>
> <474B92F5.50306@symas.com>
>
> Howard Chu wrote:
> > Howard Chu wrote:
> >> Howard Chu wrote:
> >>> For reference, the peak throughput with back-null on the previous code was
> >>> only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140
> >>> auths/sec.
>
> Those numbers are for Windows Server 2003 x86_64 on a Celestica A8440 with 4
> Opteron 875s, using OpenLDAP compiled with gcc 4.3.0. The following numbers
> are for Linux 2.6.23.1 x86_64, on the same machine, compiled first with gcc
> 4.1.2 and then later with gcc 4.2.2. There's no disk I/O in these tests.
>
> >>> In both cases the throughput declines as more client threads are
> >>> used. (Compare to 35,553 auths/sec for the same machine running Linux, and no
> >>> drop in throughput all the way up to hundreds/thousands of connections.)
>
> > Re-running on Linux with a non-optimized build, peaked at 40,101 auths/sec. (I
> > guess HEAD has sped up a bit more in the past week or so...)
>
> OK, this is odd. The code compiled without optimization peaks at 40K auths/sec
> at around 124-132 client threads. The code compiled with -O2 peaks at 37K sec
> at around 128 client threads.
>
> The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the
> nonoptimized code is faster at every load level. I was originally using gcc
> 4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only
> configured with 8 worker threads in all of these tests. Strange that whatever
> optimizations the compiler has generated speeds things up for lighter load,
> but works against it under heavier load.
> --
>    -- Howard Chu
>    Chief Architect, Symas Corp.  http://www.symas.com
>    Director, Highland Sun        http://highlandsun.com/hyc/
>    Chief Architect, OpenLDAP     http://www.openldap.org/project/
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Fwd: performance with gcc -O0/-O2]
@ 2007-11-27 14:19 Howard Chu
  2007-11-27 14:19 ` Richard Guenther
  2007-11-27 15:46 ` Andrew Haley
  0 siblings, 2 replies; 9+ messages in thread
From: Howard Chu @ 2007-11-27 14:19 UTC (permalink / raw)
  To: gcc

A bit of a minor mystery. Not a problem, just a curiosity. If someone knew off 
the top of their head a reason for it, that'd be cool, but otherwise no sweat.

-------- Original Message --------
Subject: Re: commit: ldap/servers/slapd connection.c daemon.c proto-slap.h 
syncrepl.c
Date: Tue, 27 Nov 2007 05:17:04 -0800
From: Howard Chu <hyc@symas.com>
To: OpenLDAP-devel@openldap.org
References: <200711261603.lAQG3R7e010741@cantor.openldap.org> 
<474AFA54.6080805@symas.com>	<474B0620.8030706@symas.com> 
<474B92F5.50306@symas.com>

Howard Chu wrote:
> Howard Chu wrote:
>> Howard Chu wrote:
>>> For reference, the peak throughput with back-null on the previous code was
>>> only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140
>>> auths/sec.

Those numbers are for Windows Server 2003 x86_64 on a Celestica A8440 with 4 
Opteron 875s, using OpenLDAP compiled with gcc 4.3.0. The following numbers 
are for Linux 2.6.23.1 x86_64, on the same machine, compiled first with gcc 
4.1.2 and then later with gcc 4.2.2. There's no disk I/O in these tests.

>>> In both cases the throughput declines as more client threads are
>>> used. (Compare to 35,553 auths/sec for the same machine running Linux, and no
>>> drop in throughput all the way up to hundreds/thousands of connections.)

> Re-running on Linux with a non-optimized build, peaked at 40,101 auths/sec. (I
> guess HEAD has sped up a bit more in the past week or so...)

OK, this is odd. The code compiled without optimization peaks at 40K auths/sec
at around 124-132 client threads. The code compiled with -O2 peaks at 37K sec
at around 128 client threads.

The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the
nonoptimized code is faster at every load level. I was originally using gcc
4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only
configured with 8 worker threads in all of these tests. Strange that whatever
optimizations the compiler has generated speeds things up for lighter load,
but works against it under heavier load.
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: performance with gcc -O0/-O2]
  2007-11-27 14:19 [Fwd: performance with gcc -O0/-O2] Howard Chu
  2007-11-27 14:19 ` Richard Guenther
@ 2007-11-27 15:46 ` Andrew Haley
  1 sibling, 0 replies; 9+ messages in thread
From: Andrew Haley @ 2007-11-27 15:46 UTC (permalink / raw)
  To: Howard Chu; +Cc: gcc

Howard Chu writes:

 > A bit of a minor mystery. Not a problem, just a curiosity. If
 > someone knew off the top of their head a reason for it, that'd be
 > cool, but otherwise no sweat.

It's possible, although unlikley, that the optimized code has worse
cache behaviour.  No way to know better without doing some profiling.

Andrew.


 > 
 > -------- Original Message --------
 > Subject: Re: commit: ldap/servers/slapd connection.c daemon.c proto-slap.h 
 > syncrepl.c
 > Date: Tue, 27 Nov 2007 05:17:04 -0800
 > From: Howard Chu <hyc@symas.com>
 > To: OpenLDAP-devel@openldap.org
 > References: <200711261603.lAQG3R7e010741@cantor.openldap.org> 
 > <474AFA54.6080805@symas.com>	<474B0620.8030706@symas.com> 
 > <474B92F5.50306@symas.com>
 > 
 > Howard Chu wrote:
 > > Howard Chu wrote:
 > >> Howard Chu wrote:
 > >>> For reference, the peak throughput with back-null on the previous code was
 > >>> only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140
 > >>> auths/sec.
 > 
 > Those numbers are for Windows Server 2003 x86_64 on a Celestica A8440 with 4 
 > Opteron 875s, using OpenLDAP compiled with gcc 4.3.0. The following numbers 
 > are for Linux 2.6.23.1 x86_64, on the same machine, compiled first with gcc 
 > 4.1.2 and then later with gcc 4.2.2. There's no disk I/O in these tests.
 > 
 > >>> In both cases the throughput declines as more client threads are
 > >>> used. (Compare to 35,553 auths/sec for the same machine running Linux, and no
 > >>> drop in throughput all the way up to hundreds/thousands of connections.)
 > 
 > > Re-running on Linux with a non-optimized build, peaked at 40,101 auths/sec. (I
 > > guess HEAD has sped up a bit more in the past week or so...)
 > 
 > OK, this is odd. The code compiled without optimization peaks at 40K auths/sec
 > at around 124-132 client threads. The code compiled with -O2 peaks at 37K sec
 > at around 128 client threads.
 > 
 > The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the
 > nonoptimized code is faster at every load level. I was originally using gcc
 > 4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only
 > configured with 8 worker threads in all of these tests. Strange that whatever
 > optimizations the compiler has generated speeds things up for lighter load,
 > but works against it under heavier load.
 > -- 
 >    -- Howard Chu
 >    Chief Architect, Symas Corp.  http://www.symas.com
 >    Director, Highland Sun        http://highlandsun.com/hyc/
 >    Chief Architect, OpenLDAP     http://www.openldap.org/project/

-- 
Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, UK
Registered in England and Wales No. 3798903

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: performance with gcc -O0/-O2]
  2007-11-27 14:19 ` Richard Guenther
@ 2007-11-27 16:09   ` Tim Prince
  2007-11-27 17:59   ` Howard Chu
  1 sibling, 0 replies; 9+ messages in thread
From: Tim Prince @ 2007-11-27 16:09 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Howard Chu, gcc

Richard Guenther wrote:
> On Nov 27, 2007 2:23 PM, Howard Chu <hyc@highlandsun.com> wrote:
>> A bit of a minor mystery. Not a problem, just a curiosity. If someone knew off
>> the top of their head a reason for it, that'd be cool, but otherwise no sweat.
> 
> I'd try -Os, you might run into ICache limitations.
Try -Os with and without setting -mpreferred-stack-boundary=4 (or
whatever value you currently have).  Watch memory usage, cache
evictions, etc. while running.
> 
> Richard.
> 
>> -------- Original Message --------
>> Subject: Re: commit: ldap/servers/slapd connection.c daemon.c proto-slap.h
>> syncrepl.c
>> Date: Tue, 27 Nov 2007 05:17:04 -0800
>> From: Howard Chu <hyc@symas.com>
>> To: OpenLDAP-devel@openldap.org
>> References: <200711261603.lAQG3R7e010741@cantor.openldap.org>
>> <474AFA54.6080805@symas.com>    <474B0620.8030706@symas.com>
>> <474B92F5.50306@symas.com>
>>
>> Howard Chu wrote:
>>> Howard Chu wrote:
>>>> Howard Chu wrote:
>>>>> For reference, the peak throughput with back-null on the previous code was
>>>>> only 7,800 auths/sec (with 8 client threads). With this patch it's 11,140
>>>>> auths/sec.
>> Those numbers are for Windows Server 2003 x86_64 on a Celestica A8440 with 4
>> Opteron 875s, using OpenLDAP compiled with gcc 4.3.0. The following numbers
>> are for Linux 2.6.23.1 x86_64, on the same machine, compiled first with gcc
>> 4.1.2 and then later with gcc 4.2.2. There's no disk I/O in these tests.
>>
>>>>> In both cases the throughput declines as more client threads are
>>>>> used. (Compare to 35,553 auths/sec for the same machine running Linux, and no
>>>>> drop in throughput all the way up to hundreds/thousands of connections.)
>>> Re-running on Linux with a non-optimized build, peaked at 40,101 auths/sec. (I
>>> guess HEAD has sped up a bit more in the past week or so...)
>> OK, this is odd. The code compiled without optimization peaks at 40K auths/sec
>> at around 124-132 client threads. The code compiled with -O2 peaks at 37K sec
>> at around 128 client threads.
>>
>> The -O2 build is faster from about 4 to 24 client threads. From 28 on up, the
>> nonoptimized code is faster at every load level. I was originally using gcc
>> 4.1.2 but I'm seeing the same result now using gcc 4.2.2. Also, slapd is only
>> configured with 8 worker threads in all of these tests. Strange that whatever
>> optimizations the compiler has generated speeds things up for lighter load,
>> but works against it under heavier load.
>> --
>>    -- Howard Chu
>>    Chief Architect, Symas Corp.  http://www.symas.com
>>    Director, Highland Sun        http://highlandsun.com/hyc/
>>    Chief Architect, OpenLDAP     http://www.openldap.org/project/
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: performance with gcc -O0/-O2]
  2007-11-27 14:19 ` Richard Guenther
  2007-11-27 16:09   ` Tim Prince
@ 2007-11-27 17:59   ` Howard Chu
  1 sibling, 0 replies; 9+ messages in thread
From: Howard Chu @ 2007-11-27 17:59 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc

Richard Guenther wrote:
> On Nov 27, 2007 2:23 PM, Howard Chu <hyc@highlandsun.com> wrote:
>> A bit of a minor mystery. Not a problem, just a curiosity. If someone knew off
>> the top of their head a reason for it, that'd be cool, but otherwise no sweat.
> 
> I'd try -Os, you might run into ICache limitations.

Thanks to everyone for the replies.

Fyi, the runs were all with -march=k8.

The test with gcc 4.2.2 and -Os just finished, it was again faster than -O0 
from 4 to 28 clients, but slower than -O0 the rest of the way. It peaked at 
38K auths/sec at 120 client threads. Slightly better than -O2. I'll try again 
with some of these other suggestions.
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: performance with gcc -O0/-O2]
  2007-11-27 15:51 J.C. Pizarro
@ 2007-11-27 20:45 ` Howard Chu
  0 siblings, 0 replies; 9+ messages in thread
From: Howard Chu @ 2007-11-27 20:45 UTC (permalink / raw)
  To: J.C. Pizarro; +Cc: gcc

J.C. Pizarro wrote:
> For your Opteron, try with this option
> 
> -O3 -fomit-frame-pointer -march=k8 -funroll-loops -finline-functions
> -fpeel-loops \
> -mno-sse3 -msse2 -msse -mno-mmx -mno-3dnow
> 
> The Opteron hardware said that it's better to use SSE2 than SSE3.
> The MMX and 3DNow!+ instructions are shorter and older than SSE2/SSE
> instructions.

Interesting. With these flags, the peak was 39K/sec, and it didn't top out 
until 272 client connections. (Quite a lengthy test; I'm running 2 minutes per 
iteration with X number of clients, then increasing on the next iteration, 
repeating until the transaction count stops growing. So this was over two 
hours before it finally maxed out.) I guess this is a pretty good setting for 
heavy scalability even though it didn't quite reach 40K/sec.

During these tests I see that about 94% of one core is consumed by interrupt 
processing, with 2% idle time left. I guess this ~200K packet per second rate 
is pretty near the limit of what this system can handle on gigabit ethernet. 
I've seen this box hit as high as 43K auths/sec using 4 slapd processes with 3 
threads each, as opposed to a single process with 8 threads. In that test 100% 
of a core was doing interrupt processing.

Anyway, thanks again for all your responses.
-- 
   -- Howard Chu
   Chief Architect, Symas Corp.  http://www.symas.com
   Director, Highland Sun        http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP     http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: performance with gcc -O0/-O2]
  2007-11-27 16:47   ` Andi Kleen
@ 2007-11-27 18:13     ` Andrew Haley
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Haley @ 2007-11-27 18:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gcc, Howard Chu

Andi Kleen writes:
 > Andrew Haley <aph@redhat.com> writes:
 > 
 > > Howard Chu writes:
 > >
 > >  > A bit of a minor mystery. Not a problem, just a curiosity. If
 > >  > someone knew off the top of their head a reason for it, that'd be
 > >  > cool, but otherwise no sweat.
 > >
 > > It's possible, although unlikley, that the optimized code has worse
 > > cache behaviour.  No way to know better without doing some profiling.
 > 
 > It's quite possible if he hits the conditional store "optimization"
 > (that actually adds unnecessary cache misses) that was recently discussed
 > in the load thread safety thread.

Possibly.  I'm guessing that what we are actually seeing is something
like an acutely timing-sensitive race condition, where making some
threads faster causes pessimal cache behaviour.  It's a really
interesting problem.  :-)

Andrew.

-- 
Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, UK
Registered in England and Wales No. 3798903

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: performance with gcc -O0/-O2]
       [not found] ` <18252.8713.124845.847654@zebedee.pink.suse.lists.egcs>
@ 2007-11-27 16:47   ` Andi Kleen
  2007-11-27 18:13     ` Andrew Haley
  0 siblings, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2007-11-27 16:47 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc, Howard Chu

Andrew Haley <aph@redhat.com> writes:

> Howard Chu writes:
>
>  > A bit of a minor mystery. Not a problem, just a curiosity. If
>  > someone knew off the top of their head a reason for it, that'd be
>  > cool, but otherwise no sweat.
>
> It's possible, although unlikley, that the optimized code has worse
> cache behaviour.  No way to know better without doing some profiling.

It's quite possible if he hits the conditional store "optimization"
(that actually adds unnecessary cache misses) that was recently discussed
in the load thread safety thread.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: performance with gcc -O0/-O2]
@ 2007-11-27 15:51 J.C. Pizarro
  2007-11-27 20:45 ` Howard Chu
  0 siblings, 1 reply; 9+ messages in thread
From: J.C. Pizarro @ 2007-11-27 15:51 UTC (permalink / raw)
  To: Howard Chu, gcc

For your Opteron, try with this option

-O3 -fomit-frame-pointer -march=k8 -funroll-loops -finline-functions
-fpeel-loops \
-mno-sse3 -msse2 -msse -mno-mmx -mno-3dnow

The Opteron hardware said that it's better to use SSE2 than SSE3.
The MMX and 3DNow!+ instructions are shorter and older than SSE2/SSE
instructions.

Sincerely, J.C.Pizarro

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-11-27 19:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-11-27 14:19 [Fwd: performance with gcc -O0/-O2] Howard Chu
2007-11-27 14:19 ` Richard Guenther
2007-11-27 16:09   ` Tim Prince
2007-11-27 17:59   ` Howard Chu
2007-11-27 15:46 ` Andrew Haley
2007-11-27 15:51 J.C. Pizarro
2007-11-27 20:45 ` Howard Chu
     [not found] <474C1A62.9000300@highlandsun.com.suse.lists.egcs>
     [not found] ` <18252.8713.124845.847654@zebedee.pink.suse.lists.egcs>
2007-11-27 16:47   ` Andi Kleen
2007-11-27 18:13     ` Andrew Haley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).