public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed
* Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
@ 2014-03-25  8:33 Xinrong Fu
  2014-03-25 10:10 ` Andrew Haley
  2014-03-25 10:12 ` David Brown
  0 siblings, 2 replies; 8+ messages in thread
From: Xinrong Fu @ 2014-03-25  8:33 UTC (permalink / raw)
  To: gcc-help

Hi guys:
   What does the number of stalled cycles in the CPU pipeline frontend
means? Why is the stalled frontend cycles of 32bit program more than
64bit program's stalled cycles when they running on same 64bit system?
Is there any gcc options to fix it?

linux-jjhr:/mnt/sda3/home/sean/suse_lab/test_32_64 # gcc -Wall test.c -o
test64
linux-jjhr:/mnt/sda3/home/sean/suse_lab/test_32_64 # gcc -Wall test.c -o
test32 -m32

linux-jjhr:/mnt/sda3/home/sean/suse_lab/test_32_64 # perf stat ./test64
1000000000

 Performance counter stats for './test64 1000000000':

      24650.018596 task-clock                #    0.999 CPUs
utilized
             2,100 context-switches          #    0.000
M/sec
                 3 CPU-migrations            #    0.000
M/sec
               135 page-faults               #    0.000
M/sec
    71,966,342,812 cycles                    #    2.920 GHz
[83.33%]
     6,369,556,234 stalled-cycles-frontend   #    8.85% frontend cycles
idle    [83.33%]
     1,699,050,991 stalled-cycles-backend    #    2.36% backend  cycles
idle    [66.67%]
   156,985,267,463 instructions              #    2.18  insns per
cycle
                                             #    0.04  stalled cycles
per insn [83.33%]
    35,472,160,125 branches                  # 1439.032 M/sec
[83.33%]
         2,436,028 branch-misses             #    0.01% of all branches
[83.35%]

      24.674703793 seconds time elapsed

linux-jjhr:/mnt/sda3/home/sean/suse_lab/test_32_64 # perf stat ./test32
1000000000

 Performance counter stats for './test32 1000000000':

      54676.882729 task-clock                #    0.999 CPUs
utilized
             4,657 context-switches          #    0.000
M/sec
                 7 CPU-migrations            #    0.000
M/sec
               116 page-faults               #    0.000
M/sec
   159,670,693,964 cycles                    #    2.920 GHz
[83.33%]
    71,123,035,082 stalled-cycles-frontend   #   44.54% frontend cycles
idle    [83.34%]
     7,119,090,236 stalled-cycles-backend    #    4.46% backend  cycles
idle    [66.66%]
   204,576,003,586 instructions              #    1.28  insns per
cycle
                                             #    0.35  stalled cycles
per insn [83.33%]
    39,748,525,691 branches                  #  726.971 M/sec
[83.34%]
         4,300,876 branch-misses             #    0.01% of all branches
[83.33%]

      54.731504570 seconds time elapsed

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
  2014-03-25  8:33 Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it? Xinrong Fu
@ 2014-03-25 10:10 ` Andrew Haley
  2014-03-25 10:12 ` David Brown
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Haley @ 2014-03-25 10:10 UTC (permalink / raw)
  To: Xinrong Fu, gcc-help

On 03/25/2014 03:31 AM, Xinrong Fu wrote:
>    What does the number of stalled cycles in the CPU pipeline frontend
> means? Why is the stalled frontend cycles of 32bit program more than
> 64bit program's stalled cycles when they running on same 64bit system?
> Is there any gcc options to fix it?

What is the test case?

Andrew.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
  2014-03-25  8:33 Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it? Xinrong Fu
  2014-03-25 10:10 ` Andrew Haley
@ 2014-03-25 10:12 ` David Brown
  2014-03-25 15:51   ` Jonathan Wakely
  2014-03-25 19:38   ` Vincent Diepeveen
  1 sibling, 2 replies; 8+ messages in thread
From: David Brown @ 2014-03-25 10:12 UTC (permalink / raw)
  To: Xinrong Fu, gcc-help

On 25/03/14 04:31, Xinrong Fu wrote:
> Hi guys:
>    What does the number of stalled cycles in the CPU pipeline frontend
> means? Why is the stalled frontend cycles of 32bit program more than
> 64bit program's stalled cycles when they running on same 64bit system?
> Is there any gcc options to fix it?
> 

Are you asking why the same program runs faster when compiled as 64-bit
rather than 32-bit?  There are /many/ reasons why 64-bit x86 code can be
faster than 32-bit x86 code - without having any idea about your code,
we can only make general points.  In comparison to 32-bit x86, the
64-bit mode has access to more registers, has wider registers (which
speeds data movement), less complicated instruction decoding and
instruction prefixes, more efficient floating point, and much more
efficient calling conventions.  It has the disadvantage that pointers
take up twice as much data cache and memory bandwidth, as they are twice
the size.

As for gcc options to "fix" it, there is no problem to fix - it is
normal that 64-bit code is a bit more efficient than 32-bit code from
the same program, but details vary according to the code in question.

One thing I notice from your post is that you are compiling without
enabling optimisation, which cripples the compiler's performance.
Enabling "-O2" will probably make your code several times faster (again,
without information on the program, I can only make general statements).
 Different optimisation settings like "-Os", "-O3", and individual
optimisation flags may or may not make the code faster, but "-O2" is a
good start.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
  2014-03-25 10:12 ` David Brown
@ 2014-03-25 15:51   ` Jonathan Wakely
  2014-03-25 19:38   ` Vincent Diepeveen
  1 sibling, 0 replies; 8+ messages in thread
From: Jonathan Wakely @ 2014-03-25 15:51 UTC (permalink / raw)
  To: Xinrong Fu; +Cc: gcc-help

On 25 March 2014 10:10, David Brown wrote:
>
> One thing I notice from your post is that you are compiling without
> enabling optimisation, which cripples the compiler's performance.

Performance comparisons on unoptimised code are worthless.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
  2014-03-25 10:12 ` David Brown
  2014-03-25 15:51   ` Jonathan Wakely
@ 2014-03-25 19:38   ` Vincent Diepeveen
  2014-03-26 14:07     ` Florian Weimer
  1 sibling, 1 reply; 8+ messages in thread
From: Vincent Diepeveen @ 2014-03-25 19:38 UTC (permalink / raw)
  To: David Brown; +Cc: Xinrong Fu, gcc-help



On Tue, 25 Mar 2014, David Brown wrote:

> On 25/03/14 04:31, Xinrong Fu wrote:
>> Hi guys:
>>    What does the number of stalled cycles in the CPU pipeline frontend
>> means? Why is the stalled frontend cycles of 32bit program more than
>> 64bit program's stalled cycles when they running on same 64bit system?
>> Is there any gcc options to fix it?

If the question is: Why a 32 bits program compiled 
64 bits is a lot slower:

There can be several reasons yet let's name a few.

a) for example if you use signed 32 bits indexation, for example

int i, array[64];

i = ...;
x = array[i];

this goes very fast in 32 bits processor and 32 bits mode yet a lot slower 
in 64 bits mode, as i needs a sign extension to 64 bits.
So the compiler generates 1 additional instruction in 64 bits mode
to sign extend i from 32 bits to 64 bits.

b) some processors can 'issue' more 32 bits instructions a clock than 64 
bits instructions. This can have many reasons, for example the processor 
can just decode a limited amount of bytes per clock and as 32 bits 
instructions occupy less space that means they can decode 4 instructions 
of 32 bits yet just 3 of 64 bits. Please note: not taking into account 
vector instructions here, just seeing an instruction as an 
absolute instruction here and not taking into account how wide the 
register is upon which it operates.

Agner Fog has more exact measurements on how little bytes modern 
processors can actually decode a clock.

My chessprogram Diep which is deterministic integer code (so no 
vector codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 
64 bits than in 32 bits. This where it does use a few 64 bits datatypes (very 
little though). In 64 bits the datasize used doesn't grow, instruction 
wise it grows immense of course.

Besides the above reasons another reason why 32 bits programs compiled 64 
bits can be a lot slower in case of Diep is:

c) the larger code size causes more L1 instruction cache misses.

And that i a major problem especially as those L1i's are already so tiny 
at modern processors.

d) gcc is total horrible in optimizing branches. Where a compiler like 
intel c++ easily gets 20%-25% performance out of the PGO (profile guided 
optimization), gcc gets total peanuts out of the pgo phase for my 
chessprogram. 3% or so.

This all has to do with how it deals with branches and the horrible 
optimizations that trigger.

Now these horrors you could either benefit from going to 64 
bits, as it no longer uses that horror by then, or 
get additional penalty from moving to 64 bits. That last case for example 
at an older generation AMD processor when the jump suddenly is outside of 
what the processor is seeing in its lookahead, causing a huge penalty 
suddenly for a branch mispredict.

It largely depends upon the processor you have, especially older types AMD 
processors suffer there.

So moving there to 64 bits could speed you up occasional, even when not 
using any optimization at all, just because some of the old FUBAR codes 
used for 32 bits no longer get triggered.

> Are you asking why the same program runs faster when compiled as 64-bit
> rather than 32-bit?  There are /many/ reasons why 64-bit x86 code can be
> faster than 32-bit x86 code - without having any idea about your code,
> we can only make general points.  In comparison to 32-bit x86, the
> 64-bit mode has access to more registers,

Usually processors are optimized to just use a few registers whereas 
they use all sorts of tricks (where additional registers get used) already 
to make up for it, so the additional registers hardly is an
advantage of any kind in 64 bits, not even in algorithmic codes here.

In tests performed using more registers using assembler code the 
processors actually slow down. So there is a performance benefit in 
reusing the same few registers over and over again.

This performance penalty of using more registers is not only there in x64, 
it already was the case in x86 processors. In fact it's easy to measure in 
the pentium from 2 decades ago already.

Hopefully that'll change a tad in the future - yet i consider that 
unlikely, as it also would involve changes in the intel c++ compiler.

>has wider registers (which

Exactly:

If you use 64 bits datatypes like "long long" then obviously 64 bits is 
a huge advantage over 32 bits. This can easily give a factor 2 speed 
improvement in case of integer codes that are 64 bits.

> speeds data movement), less complicated instruction decoding and
> instruction prefixes, more efficient floating point, and much more
> efficient calling conventions.  It has the disadvantage that pointers
> take up twice as much data cache and memory bandwidth, as they are twice
> the size.

From a distance seen you're totally correct here that caches are the 
problem. To zoom in: the larger pointer is more of a problem for the 
instruction part of the cache.

In itself the larger pointer doesn't mean the size the data occupies in 
the datacache grows.

Yet in the compiler in 64 bits needs more instructions to get to the 32 
bits data and such 64 bits pointer instructions simply are larger laying 
more stress upon the instruction decoding/transport, whereas we already 
know it can just decode 3 hands of bytes a clock.

Now for a lot of programs this isn't a big problem as another bitwise AND 
is a very fast instruction, yet for my software which is pretty optimized 
i feel additional instructions not in the last place as it makes the 
already supertiny L1 instruction cache ugh out even more and as the IPC 
already is very high :)

> As for gcc options to "fix" it, there is no problem to fix - it is
> normal that 64-bit code is a bit more efficient than 32-bit code from
> the same program, but details vary according to the code in question.

> One thing I notice from your post is that you are compiling without
> enabling optimisation, which cripples the compiler's performance.
> Enabling "-O2" will probably make your code several times faster (again,
> without information on the program, I can only make general statements).
> Different optimisation settings like "-Os", "-O3", and individual
> optimisation flags may or may not make the code faster, but "-O2" is a
> good start.

A good tip in GCC is to never go further than -O2

Going further at your own risk :)

The past 20 years or so, gcc actually never generated faster code for my 
chess software with -O3, usually it causes problems and slows down.

Kind Regards,
Vincent

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
  2014-03-25 19:38   ` Vincent Diepeveen
@ 2014-03-26 14:07     ` Florian Weimer
  2014-03-26 14:47       ` Vincent Diepeveen
  0 siblings, 1 reply; 8+ messages in thread
From: Florian Weimer @ 2014-03-26 14:07 UTC (permalink / raw)
  To: Vincent Diepeveen, David Brown; +Cc: Xinrong Fu, gcc-help

On 03/25/2014 04:51 PM, Vincent Diepeveen wrote:

> a) for example if you use signed 32 bits indexation, for example
>
> int i, array[64];
>
> i = ...;
> x = array[i];
>
> this goes very fast in 32 bits processor and 32 bits mode yet a lot
> slower in 64 bits mode, as i needs a sign extension to 64 bits.
> So the compiler generates 1 additional instruction in 64 bits mode
> to sign extend i from 32 bits to 64 bits.

Is this relevant in practice?  I'm asking because it's a missed 
optimization opportunity—negative subscripts lead to undefined behavior 
here, so the sign extension can be omitted.

> b) some processors can 'issue' more 32 bits instructions a clock than 64
> bits instructions.

Some earlier processors also support more µop optimization in 32 bit mode.

> My chessprogram Diep which is deterministic integer code (so no vector
> codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64
> bits than in 32 bits. This where it does use a few 64 bits datatypes
> (very little though). In 64 bits the datasize used doesn't grow,
> instruction wise it grows immense of course.

Well, chess programs used to be the prototypical example for 64 bit 
architectures ...

> Besides the above reasons another reason why 32 bits programs compiled
> 64 bits can be a lot slower in case of Diep is:
>
> c) the larger code size causes more L1 instruction cache misses.

This really depends on the code.  Not everything is larger.  Typically 
it's the increased pointer size that cause increased data cache misses, 
which then casues slowdowns.

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
  2014-03-26 14:07     ` Florian Weimer
@ 2014-03-26 14:47       ` Vincent Diepeveen
  2014-03-27 18:02         ` Xinrong Fu
  0 siblings, 1 reply; 8+ messages in thread
From: Vincent Diepeveen @ 2014-03-26 14:47 UTC (permalink / raw)
  To: Florian Weimer; +Cc: David Brown, Xinrong Fu, gcc-help

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6560 bytes --]



On Wed, 26 Mar 2014, Florian Weimer wrote:

> On 03/25/2014 04:51 PM, Vincent Diepeveen wrote:
>
>> a) for example if you use signed 32 bits indexation, for example
>> 
>> int i, array[64];
>> 
>> i = ...;
>> x = array[i];
>> 
>> this goes very fast in 32 bits processor and 32 bits mode yet a lot
>> slower in 64 bits mode, as i needs a sign extension to 64 bits.
>> So the compiler generates 1 additional instruction in 64 bits mode
>> to sign extend i from 32 bits to 64 bits.
>
> Is this relevant in practice?  I'm asking because it's a missed optimization 
> opportunity—negative subscripts lead to undefined behavior here, so the sign 
> extension can be omitted.

Yes this is very relevant of course, as it is an instruction.
It all adds up you know. Now i don't know whether some modern processors 
can secretly internal fuse this - as about 99.9% of all C and C++ source 
codes in existance just use 'int' of course.

In the C specification in fact 'int' gets defined as the fastest possible 
datatype.

Well at x64 it is not. It's a lot slower if you use to index it. Factor 2 
slower to be precise, if you use it to index, as it generates another 
instruction.

If i write normal code, i simply use "int" and standardize upon that.

Writing for speed has not been made easier, because "int" still is a 32 
bits datatype whereas we have 64 bits processors nowadays.

Problem would be solved when 'sizeof(int)' suddenly is 8 bytes of course.

That would mean big refactoring of lots of codes though, yet one day we 
will need to go through that proces :)

I tend to remember that back in the days, sizeof(long) at DEC alpha was 8 
bytes already.

Now i'm not suggesting, not even indicating, this would be a wise change.

>> b) some processors can 'issue' more 32 bits instructions a clock than 64
>> bits instructions.
>
> Some earlier processors also support more µop optimization in 32 bit mode.

I'm not a big expert on how the decoding and transport phase of processors 
nowadays works - it all has become so very complex.

Yet the decoding and delivery of the instructions is the bottleneck at 
todays processors. They all have plenty of execution units.

They just cannot decode+deliver enough bytes per clock.

>> My chessprogram Diep which is deterministic integer code (so no vector
>> codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64
>> bits than in 32 bits. This where it does use a few 64 bits datatypes
>> (very little though). In 64 bits the datasize used doesn't grow,
>> instruction wise it grows immense of course.
>
> Well, chess programs used to be the prototypical example for 64 bit 
> architectures ...

Only when a bunch of CIA related organisations got involved in funding a
  bunch of programs - as it's easier then to copy source code if you
write it for a sneaky organisation anyway.

The from origin top chess engines are all 32 bits based as they can 
execute 32 bits instructions faster of course and most mobile phones still 
are 32 bits anyway.

You cannot just cut and paste source codes from others and get away with 
it in a commercial setting.

Commercial seen that's too expensive to cut n paste 
other persons work because of all the courtcases, and you bet they will be 
there - just when governments get involved for first time in history i saw 
a bunch of guys work together who otherwise would stick out each others 
eyes at any given occasion :)

I made another chessprogram here a while ago which gets nearby 10 million 
nps single core. No 64 bits engine will ever manage that :)

Those extra instructions you can execute are deadly. And we're NOT 
speaking about vector instructions here - just integers.

The reason why 64 bits is interesting is not because it is any faster - it 
is not. It's slower in terms of executing instructions.

Yet algorithmically you can use a huge hashtable with all cores together, 
so that speeds you up bigtime then.

More than a decade ago i was happy to use 200 GB there at the SGI 
supercomputer. It really helps... ...not as much as some would guess it 
helps, yet a factor 2 really is a lot :)

>> Besides the above reasons another reason why 32 bits programs compiled
>> 64 bits can be a lot slower in case of Diep is:
>> 
>> c) the larger code size causes more L1 instruction cache misses.
>
> This really depends on the code.  Not everything is larger.  Typically it's 
> the increased pointer size that cause increased data cache misses, which then 
> casues slowdowns.

Really a lot changes to 64 bits of course, as
the above chesssoftware is mainly busy with array lookups and branches in 
between them.

You need those lookups everywhere. Arrays are really important. Not only 
as you want to lookup something, but also because they avoid writing
out another bunch of lines of codes to get to the same :)

Also the index into the array needs to be 64 bits of course. Which means 
that in the end every value gets converted to 64 bits in 64 bits mode, 
which makes sense.

Now i'm sure you define all array lookups as lookups into a pointer so 
we're on the same page then :)

Please also note that suddenly lots of branches in chessprograms also 
tend to get slower. Some in fact might go from say around a clock or 5 
penalty to 30 clocks penalty, because the distance in bytes between the 
conditional jump and the spot where it might jump to is more bytes away.

That you really feel bigtime.

GCC always has been worldchampion in rewriting branches to something that 
in advance is slower than the straightforward manner - and even the PGO 
phase couldn't improve upon that. Yet it especially slowed down most at 
AMD.

I tend to remember a discussion between a GCC guy and Linus there, 
where Linus said there was no excuse to not now and then generate CMOV's 
at modern processors like core2 and opteron - where the GCC teammember (a 
polish name i didn't recognize) argued that crippling GCC was needed as he 
owned a P4 :)

That was not long after i posted some similar code in forums showing how 
FUBAR gcc was with branches - yet "by accident" that got a 25-30 clocks 
penalty at AMD and not at intel.

That piece of code goes better nowadays.

Where GCC needs major improvements is in the PGO phase right now.
It's just abnormal difference. something like 3% speedup using pgo in GCC 
versus 20-25% speedup with other compilers under which intel c++.

I do not know what it causes - yet there should be tons of source codes 
available that have the same problem.





> -- > Florian Weimer / Red Hat Product 
Security Team >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?
  2014-03-26 14:47       ` Vincent Diepeveen
@ 2014-03-27 18:02         ` Xinrong Fu
  0 siblings, 0 replies; 8+ messages in thread
From: Xinrong Fu @ 2014-03-27 18:02 UTC (permalink / raw)
  To: Vincent Diepeveen; +Cc: Florian Weimer, David Brown, gcc-help

[-- Attachment #1: Type: text/plain, Size: 7138 bytes --]

Hi Guys:
     Thanks for your reply. I am sorry missed test case.
The attachment is test case.

BestRegards


2014-03-26 22:07 GMT+08:00 Vincent Diepeveen <diep@xs4all.nl>:
>
>
> On Wed, 26 Mar 2014, Florian Weimer wrote:
>
>> On 03/25/2014 04:51 PM, Vincent Diepeveen wrote:
>>
>>> a) for example if you use signed 32 bits indexation, for example
>>>
>>> int i, array[64];
>>>
>>> i = ...;
>>> x = array[i];
>>>
>>> this goes very fast in 32 bits processor and 32 bits mode yet a lot
>>> slower in 64 bits mode, as i needs a sign extension to 64 bits.
>>> So the compiler generates 1 additional instruction in 64 bits mode
>>> to sign extend i from 32 bits to 64 bits.
>>
>>
>> Is this relevant in practice?  I'm asking because it's a missed
>> optimization opportunity--negative subscripts lead to undefined behavior
>> here, so the sign extension can be omitted.
>
>
> Yes this is very relevant of course, as it is an instruction.
> It all adds up you know. Now i don't know whether some modern processors can
> secretly internal fuse this - as about 99.9% of all C and C++ source codes
> in existance just use 'int' of course.
>
> In the C specification in fact 'int' gets defined as the fastest possible
> datatype.
>
> Well at x64 it is not. It's a lot slower if you use to index it. Factor 2
> slower to be precise, if you use it to index, as it generates another
> instruction.
>
> If i write normal code, i simply use "int" and standardize upon that.
>
> Writing for speed has not been made easier, because "int" still is a 32 bits
> datatype whereas we have 64 bits processors nowadays.
>
> Problem would be solved when 'sizeof(int)' suddenly is 8 bytes of course.
>
> That would mean big refactoring of lots of codes though, yet one day we will
> need to go through that proces :)
>
> I tend to remember that back in the days, sizeof(long) at DEC alpha was 8
> bytes already.
>
> Now i'm not suggesting, not even indicating, this would be a wise change.
>
>
>>> b) some processors can 'issue' more 32 bits instructions a clock than 64
>>> bits instructions.
>>
>>
>> Some earlier processors also support more µop optimization in 32 bit mode.
>
>
> I'm not a big expert on how the decoding and transport phase of processors
> nowadays works - it all has become so very complex.
>
> Yet the decoding and delivery of the instructions is the bottleneck at
> todays processors. They all have plenty of execution units.
>
> They just cannot decode+deliver enough bytes per clock.
>
>
>>> My chessprogram Diep which is deterministic integer code (so no vector
>>> codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64
>>> bits than in 32 bits. This where it does use a few 64 bits datatypes
>>> (very little though). In 64 bits the datasize used doesn't grow,
>>> instruction wise it grows immense of course.
>>
>>
>> Well, chess programs used to be the prototypical example for 64 bit
>> architectures ...
>
>
> Only when a bunch of CIA related organisations got involved in funding a
>  bunch of programs - as it's easier then to copy source code if you
> write it for a sneaky organisation anyway.
>
> The from origin top chess engines are all 32 bits based as they can execute
> 32 bits instructions faster of course and most mobile phones still are 32
> bits anyway.
>
> You cannot just cut and paste source codes from others and get away with it
> in a commercial setting.
>
> Commercial seen that's too expensive to cut n paste other persons work
> because of all the courtcases, and you bet they will be there - just when
> governments get involved for first time in history i saw a bunch of guys
> work together who otherwise would stick out each others eyes at any given
> occasion :)
>
> I made another chessprogram here a while ago which gets nearby 10 million
> nps single core. No 64 bits engine will ever manage that :)
>
> Those extra instructions you can execute are deadly. And we're NOT speaking
> about vector instructions here - just integers.
>
> The reason why 64 bits is interesting is not because it is any faster - it
> is not. It's slower in terms of executing instructions.
>
> Yet algorithmically you can use a huge hashtable with all cores together, so
> that speeds you up bigtime then.
>
> More than a decade ago i was happy to use 200 GB there at the SGI
> supercomputer. It really helps... ...not as much as some would guess it
> helps, yet a factor 2 really is a lot :)
>
>
>>> Besides the above reasons another reason why 32 bits programs compiled
>>> 64 bits can be a lot slower in case of Diep is:
>>>
>>> c) the larger code size causes more L1 instruction cache misses.
>>
>>
>> This really depends on the code.  Not everything is larger.  Typically
>> it's the increased pointer size that cause increased data cache misses,
>> which then casues slowdowns.
>
>
> Really a lot changes to 64 bits of course, as
> the above chesssoftware is mainly busy with array lookups and branches in
> between them.
>
> You need those lookups everywhere. Arrays are really important. Not only as
> you want to lookup something, but also because they avoid writing
> out another bunch of lines of codes to get to the same :)
>
> Also the index into the array needs to be 64 bits of course. Which means
> that in the end every value gets converted to 64 bits in 64 bits mode, which
> makes sense.
>
> Now i'm sure you define all array lookups as lookups into a pointer so we're
> on the same page then :)
>
> Please also note that suddenly lots of branches in chessprograms also tend
> to get slower. Some in fact might go from say around a clock or 5 penalty to
> 30 clocks penalty, because the distance in bytes between the conditional
> jump and the spot where it might jump to is more bytes away.
>
> That you really feel bigtime.
>
> GCC always has been worldchampion in rewriting branches to something that in
> advance is slower than the straightforward manner - and even the PGO phase
> couldn't improve upon that. Yet it especially slowed down most at AMD.
>
> I tend to remember a discussion between a GCC guy and Linus there, where
> Linus said there was no excuse to not now and then generate CMOV's at modern
> processors like core2 and opteron - where the GCC teammember (a polish name
> i didn't recognize) argued that crippling GCC was needed as he owned a P4 :)
>
> That was not long after i posted some similar code in forums showing how
> FUBAR gcc was with branches - yet "by accident" that got a 25-30 clocks
> penalty at AMD and not at intel.
>
> That piece of code goes better nowadays.
>
> Where GCC needs major improvements is in the PGO phase right now.
> It's just abnormal difference. something like 3% speedup using pgo in GCC
> versus 20-25% speedup with other compilers under which intel c++.
>
> I do not know what it causes - yet there should be tons of source codes
> available that have the same problem.
>
>
>
>
>
>
>> -- > Florian Weimer / Red Hat Product
>
> Security Team >

[-- Attachment #2: test.c --]
[-- Type: text/x-csrc, Size: 681 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int args ,char* argv[])
{
        int count = atoi(argv[1]);
        int j = 0;
        FILE *fp;
        char msg[]={'x','y','z'};
        char buf[20];
        
        if((fp=fopen("/dev/zero","wb+"))==NULL)
        {
            printf("Cannot open file strike any key exit!");
            exit(1);
        }
        
        fwrite(msg,sizeof(msg),1,fp);
        
        while( (count > 0 && j < count) || (count==0) )
        {
            fread(buf,strlen(msg),1,fp);
            j++;
            if(j > count)
                break;
        }
        
        fclose(fp); 
        
        return 0;
}


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-03-27 13:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-25  8:33 Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it? Xinrong Fu
2014-03-25 10:10 ` Andrew Haley
2014-03-25 10:12 ` David Brown
2014-03-25 15:51   ` Jonathan Wakely
2014-03-25 19:38   ` Vincent Diepeveen
2014-03-26 14:07     ` Florian Weimer
2014-03-26 14:47       ` Vincent Diepeveen
2014-03-27 18:02         ` Xinrong Fu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).