gcc 3.4.3: -march optimization for Intel Core2Duo

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* gcc 3.4.3:  -march optimization for Intel Core2Duo
@ 2006-10-05 13:49 Jan Dillmann
  2006-10-05 15:50 ` Ian Lance Taylor
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Dillmann @ 2006-10-05 13:49 UTC (permalink / raw)
  To: gcc-help

Hi,

we are running several benchmarks (SpecCPU200...) on 32-bit linux-systems and are able to set an optimization-parameter
for '-march'. We use Intel Core2uo-CPUs. Which parameter should we use (nocona, prescott...) ?

regards
Jan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 3.4.3:  -march optimization for Intel Core2Duo
  2006-10-05 13:49 gcc 3.4.3: -march optimization for Intel Core2Duo Jan Dillmann
@ 2006-10-05 15:50 ` Ian Lance Taylor
  2006-10-05 16:08   ` Vladimir Makarov
  0 siblings, 1 reply; 5+ messages in thread
From: Ian Lance Taylor @ 2006-10-05 15:50 UTC (permalink / raw)
  To: Jan Dillmann; +Cc: gcc-help

"Jan Dillmann" <jan.dillmann@tooltec.de> writes:

> we are running several benchmarks (SpecCPU200...) on 32-bit linux-systems and are able to set an optimization-parameter
> for '-march'. We use Intel Core2uo-CPUs. Which parameter should we use (nocona, prescott...) ?

gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other reason
than the release was made before the processors became available.  My
guess would be that you will get the best results with -mtune=nocona.
But it is only a guess.

Ian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 3.4.3:  -march optimization for Intel Core2Duo
  2006-10-05 15:50 ` Ian Lance Taylor
@ 2006-10-05 16:08   ` Vladimir Makarov
  0 siblings, 0 replies; 5+ messages in thread
From: Vladimir Makarov @ 2006-10-05 16:08 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Jan Dillmann, gcc-help

Ian Lance Taylor wrote:

>"Jan Dillmann" <jan.dillmann@tooltec.de> writes:
>
>  
>
>>we are running several benchmarks (SpecCPU200...) on 32-bit linux-systems and are able to set an optimization-parameter
>>for '-march'. We use Intel Core2uo-CPUs. Which parameter should we use (nocona, prescott...) ?
>>    
>>
>
>gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other reason
>than the release was made before the processors became available.  My
>guess would be that you will get the best results with -mtune=nocona.
>But it is only a guess.
>
>  
>
I believe that pentium-m will work better.  Nocona (a x86_64 processor) 
is based on northwood/prescott core which is a high frequency core with 
long pipelines.  Core2 Duo is closer to pentium M (lower frequency core 
with much shorter pipelines).  Although usage of penium-m will result a 
bigger code in comparison with nocona because aligning loop/function 
will be forced (nortwood core is not so sensitive to aligning therfore 
aligining is not done when -mtune=nocona is used).  I don't remember 
Intel recomendation about aligning code for Core Duo (probably it is the 
same as for penium M).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: gcc 3.4.3:  -march optimization for Intel Core2Duo
@ 2006-10-05 16:26 Timothy C Prince
  2006-10-05 16:41 ` Vladimir Makarov
  2006-10-07  6:36 ` Ryan Hill
  0 siblings, 2 replies; 5+ messages in thread
From: Timothy C Prince @ 2006-10-05 16:26 UTC (permalink / raw)
  To: vmakarov; +Cc: iant, jan.dillmann, gcc-help



-----Original Message-----
From: Vladimir Makarov <vmakarov@redhat.com>
To: Ian Lance Taylor <iant@google.com>
Date: Thu, 05 Oct 2006 12:08:28 -0400
Subject: Re: gcc 3.4.3:  -march optimization for Intel Core2Duo

Ian Lance Taylor wrote:

>"Jan Dillmann" <jan.dillmann@tooltec.de> writes:
>
>  
>
>>we are running several benchmarks (SpecCPU200...) on 32-bit linux-systems and are able to set an optimization-parameter
>>for '-march'. We use Intel Core2uo-CPUs. Which parameter should we use (nocona, prescott...) ?
>>    
>>
>
>gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other reason
>than the release was made before the processors became available.  My
>guess would be that you will get the best results with -mtune=nocona.
>But it is only a guess.
>
>  
>
I believe that pentium-m will work better.  Nocona (a x86_64 processor) 
is based on northwood/prescott core which is a high frequency core with 
long pipelines.  Core2 Duo is closer to pentium M (lower frequency core
with much shorter pipelines).  Although usage of penium-m will result a
bigger code in comparison with nocona because aligning loop/function
will be forced (nortwood core is not so sensitive to aligning therfore
aligining is not done when -mtune=nocona is used).  I don't remember
Intel recomendation about aligning code for Core Duo (probably it is the
same as for penium M).


________________________________

FWIW, pentium-m is optimized by using 387 code for nearly everything except (int) casts. This is because of the Banias SSE decoder bottleneck. If you use -march=pentium-m, you would add -fpmath=sse to attempt to get code more optimum for any CPU other than Banias/Dothan.
OP question was about Core 2 Duo, a more advanced  (64-bit capable) CPU than Core Duo.
Tim Prince

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 3.4.3:  -march optimization for Intel Core2Duo
  2006-10-05 16:26 Timothy C Prince
@ 2006-10-05 16:41 ` Vladimir Makarov
  2006-10-07  6:36 ` Ryan Hill
  1 sibling, 0 replies; 5+ messages in thread
From: Vladimir Makarov @ 2006-10-05 16:41 UTC (permalink / raw)
  To: Timothy C Prince; +Cc: iant, jan.dillmann, gcc-help

Timothy C Prince wrote:

>-----Original Message-----
>From: Vladimir Makarov <vmakarov@redhat.com>
>To: Ian Lance Taylor <iant@google.com>
>Date: Thu, 05 Oct 2006 12:08:28 -0400
>Subject: Re: gcc 3.4.3:  -march optimization for Intel Core2Duo
>
>Ian Lance Taylor wrote:
>
>  
>
>>"Jan Dillmann" <jan.dillmann@tooltec.de> writes:
>>
>> 
>>
>>    
>>
>>>we are running several benchmarks (SpecCPU200...) on 32-bit linux-systems and are able to set an optimization-parameter
>>>for '-march'. We use Intel Core2uo-CPUs. Which parameter should we use (nocona, prescott...) ?
>>>   
>>>
>>>      
>>>
>>gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other reason
>>than the release was made before the processors became available.  My
>>guess would be that you will get the best results with -mtune=nocona.
>>But it is only a guess.
>>
>> 
>>
>>    
>>
>I believe that pentium-m will work better.  Nocona (a x86_64 processor) 
>is based on northwood/prescott core which is a high frequency core with 
>long pipelines.  Core2 Duo is closer to pentium M (lower frequency core
>with much shorter pipelines).  Although usage of penium-m will result a
>bigger code in comparison with nocona because aligning loop/function
>will be forced (nortwood core is not so sensitive to aligning therfore
>aligining is not done when -mtune=nocona is used).  I don't remember
>Intel recomendation about aligning code for Core Duo (probably it is the
>same as for penium M).
>
>
>________________________________
>
>FWIW, pentium-m is optimized by using 387 code for nearly everything except (int) casts. This is because of the Banias SSE decoder bottleneck. If you use -march=pentium-m, you would add -fpmath=sse to attempt to get code more optimum for any CPU other than Banias/Dothan.
>OP question was about Core 2 Duo, a more advanced  (64-bit capable) CPU than Core Duo.
>  
>
Sorry, that was a typo.  When I wrote this I really meant Core 2 Duo 
(not Core Duo) in 32 bit environment.  Pentium-m tunning is not ideal 
(not sse only but for example maximal number of issued insns per cycle). 
 To be more accurate I believe it will work better for SPEC2000.  Some 
tunning for Core 2 Duo may be found on yara-branch (it is called 
woodcrest there).


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gcc 3.4.3:  -march optimization for Intel Core2Duo
  2006-10-05 16:26 Timothy C Prince
  2006-10-05 16:41 ` Vladimir Makarov
@ 2006-10-07  6:36 ` Ryan Hill
  1 sibling, 0 replies; 5+ messages in thread
From: Ryan Hill @ 2006-10-07  6:36 UTC (permalink / raw)
  To: gcc-help

[-- Attachment #1: Type: text/plain, Size: 14467 bytes --]

Timothy C Prince wrote:
> From: Vladimir Makarov <vmakarov@redhat.com> Ian Lance Taylor wrote:
>> "Jan Dillmann" <jan.dillmann@tooltec.de> writes:

>>> we are running several benchmarks (SpecCPU200...) on 32-bit
>>> linux-systems and are able to set an optimization-parameter for
>>> '-march'. We use Intel Core2uo-CPUs. Which parameter should we
>>> use (nocona, prescott...) ?
>>> 
>>> 
>> gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other
>> reason than the release was made before the processors became
>> available.  My guess would be that you will get the best results
>> with -mtune=nocona. But it is only a guess.
>> 
>> 
>> 
> I believe that pentium-m will work better.  Nocona (a x86_64
> processor) is based on northwood/prescott core which is a high
> frequency core with long pipelines.  Core2 Duo is closer to pentium M
> (lower frequency core with much shorter pipelines).  Although usage
> of penium-m will result a bigger code in comparison with nocona
> because aligning loop/function will be forced (nortwood core is not
> so sensitive to aligning therfore aligining is not done when
> -mtune=nocona is used).  I don't remember Intel recomendation about
> aligning code for Core Duo (probably it is the same as for penium M).
> 
> 
> 
> ________________________________
> 
> FWIW, pentium-m is optimized by using 387 code for nearly everything
> except (int) casts. This is because of the Banias SSE decoder
> bottleneck. If you use -march=pentium-m, you would add -fpmath=sse to
> attempt to get code more optimum for any CPU other than
> Banias/Dothan. OP question was about Core 2 Duo, a more advanced
> (64-bit capable) CPU than Core Duo. Tim Prince
> 

I did some very unscientific and limited benchmarking of GCC trunk
performance in tramp3d.  Any analysis or suggestions, and other
benchmark numbers would be greatly appreciated if you can supply them.
I plan to do more when I have some time.

Also, this is on a Core Duo rather than a Core 2 Duo.  Does that make a
significant difference?


----------
(http://forums.gentoo.org/viewtopic-p-3602555.html#3601332)

ok, i did one simple c++ benchmark using TraMP3d-v4. keep in mind it's
just one benchmark.

the system used was a Toshiba Satellite A100 laptop with a Core Duo
T2300 @ 1.66GHz (Yonah), 2MiB shared L2 cache, and 1GiB of memory. the
GCC version used was 4.1-branch svn built yesterday.

[-O2 -march=prescott -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=prescott -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp  -o tramp3d-v4-prescott
95.45user 0.84system 1:35.69elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+202080minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-prescott -n 25 --cartvis 1.0 0.0
--rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.07124s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.946142s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (0.966466s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.975241s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (0.97465s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (0.985882s/it)
i = 7    t = 0.013681    dt = 0.00193011 (1.0047s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (1.00467s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (1.00171s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (1.0184s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.01445s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (1.00954s/it)
i = 13   t = 0.0263471   dt = 0.002275 (1.01139s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.01714s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (1.01269s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (1.01118s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (1.00748s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (1.00433s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (1.00053s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (1.00177s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (0.989871s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.997915s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (0.98717s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (0.989589s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (0.987146s/it)
Time spent in iteration: 23.9913
Correctness:
        sum(rh) difference = 1.45519e-11
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-prescott
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-prescott, please wait...
i486:    0 i586:    0 ppro:  130 mmx:    0 sse:    0 sse2:    0 sse3:    2
tramp3d-v4-prescott will run on Pentium IV (pentium4) w/ SSE3 or higher
processor.


[-O2 -march=pentium-m -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp  -o tramp3d-v4-pentiumm-plain
97.74user 0.74system 1:38.47elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (11major+200253minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-plain -n 25 --cartvis 1.0
0.0 --rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.0692961s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.992859s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (1.0033s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.975363s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (0.98926s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (0.986304s/it)
i = 7    t = 0.013681    dt = 0.00193011 (0.997433s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (0.99804s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (1.00585s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (1.00463s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.01035s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (1.00643s/it)
i = 13   t = 0.0263471   dt = 0.002275 (1.00908s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.00359s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (1.00683s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (1.0018s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (1.00395s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (0.994894s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (0.995252s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (0.992024s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (0.989914s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.984155s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (0.986609s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (0.981239s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (0.986686s/it)
Time spent in iteration: 23.9751
Correctness:
        sum(rh) difference = 1.45519e-11
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-plain
                                                               Checking
vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm-plain, please wait...
i486:    0 i586:    0 ppro:  135 mmx:    0 sse:    0 sse2:    4 sse3:    0
tramp3d-v4-pentiumm-plain will run on Pentium IV (pentium4) or higher
processor.


[-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -msse3 -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp  -o tramp3d-v4-pentiumm
97.73user 1.01system 1:38.05elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+197280minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm -n 25 --cartvis 1.0 0.0
--rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.069342s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.968165s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (0.985737s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.999364s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (1.01105s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (1.01161s/it)
i = 7    t = 0.013681    dt = 0.00193011 (1.02449s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (1.02412s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (1.02851s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (1.02592s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.03424s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (1.0353s/it)
i = 13   t = 0.0263471   dt = 0.002275 (1.03373s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.03266s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (1.03526s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (1.02011s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (1.0232s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (1.02476s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (1.0153s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (1.00431s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (1.00313s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.989761s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (0.99909s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (0.989536s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (0.996134s/it)
Time spent in iteration: 24.3848
Correctness:
        sum(rh) difference = 1.45519e-11
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm, please wait...
i486:    0 i586:    0 ppro:  135 mmx:    0 sse:    0 sse2:    0 sse3:    2
tramp3d-v4-pentiumm will run on Pentium IV (pentium4) w/ SSE3 or higher
processor.


[-O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe]

dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe
-Dleafify=flatten tramp3d-v4.cpp  -o tramp3d-v4-pentiumm-sse
98.40user 0.94system 1:39.15elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (3major+198438minor)pagefaults 0swaps

dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-sse -n 25 --cartvis 1.0
0.0 --rhomin 1e-8
Using
  using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
  solving eeq
  time increments from [0, 1.79769e+308], cfl 0.5
  starting at t = 0, i = 1
  cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  face  physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
  periodic boundaries in X Y Z
i = 1    t = 0.00209225  dt = 0.00209225 (0.0617449s/it)
i = 2    t = 0.00410537  dt = 0.00201312 (0.897831s/it)
i = 3    t = 0.00603889  dt = 0.00193352 (0.964484s/it)
i = 4    t = 0.00794139  dt = 0.00190251 (0.94189s/it)
i = 5    t = 0.00984636  dt = 0.00190497 (0.972172s/it)
i = 6    t = 0.0117508   dt = 0.00190449 (0.973818s/it)
i = 7    t = 0.013681    dt = 0.00193011 (0.984364s/it)
i = 8    t = 0.0156598   dt = 0.0019788 (0.988743s/it)
i = 9    t = 0.0176706   dt = 0.00201081 (0.996885s/it)
i = 10   t = 0.0197364   dt = 0.0020658 (0.997118s/it)
i = 11   t = 0.0218716   dt = 0.0021352 (1.00016s/it)
i = 12   t = 0.0240721   dt = 0.00220057 (0.99685s/it)
i = 13   t = 0.0263471   dt = 0.002275 (0.998231s/it)
i = 14   t = 0.0287159   dt = 0.00236875 (1.00025s/it)
i = 15   t = 0.0311533   dt = 0.00243738 (0.987068s/it)
i = 16   t = 0.0336768   dt = 0.0025235 (0.981898s/it)
i = 17   t = 0.0362863   dt = 0.00260952 (0.990963s/it)
i = 18   t = 0.0389715   dt = 0.00268521 (0.986071s/it)
i = 19   t = 0.0417381   dt = 0.00276665 (0.980461s/it)
i = 20   t = 0.0445873   dt = 0.00284919 (0.982345s/it)
i = 21   t = 0.0475216   dt = 0.0029343 (1.00055s/it)
i = 22   t = 0.0505258   dt = 0.00300413 (0.995297s/it)
i = 23   t = 0.0535938   dt = 0.00306807 (1.00189s/it)
i = 24   t = 0.0567043   dt = 0.0031105 (1.00527s/it)
i = 25   t = 0.0598233   dt = 0.00311892 (1.01299s/it)
Time spent in iteration: 23.6994
Correctness:
        sum(rh) difference = 1.28966e-08
        sum(vx) = -0.242582
        sum(vy) = -0.295116
        sum(vz) = -0.335474
        sum(rh*T) difference = -297.099

dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-sse
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm-sse, please wait...
i486:    0 i586:    0 ppro:   84 mmx:   44 sse:    0 sse2: 3089 sse3:    0
tramp3d-v4-pentiumm-sse will run on Pentium IV (pentium4) or higher
processor.


Keep in mind that anything that does strip-flags (ie. GCC, glibc,
kernel, etc.) will remove both -msse3 and -mfpmath from your C[XX]FLAGS

Very little difference in runtimes, maybe half a second, and next to no
difference in compile time. Surprisingly,
-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe was the slowest.
I reran the test to be sure and it was slightly worse (24.5397s) than
the original run.

It also appears -mfpmath=sse does not generate sse3 instructions.

----------


I also tested "-O2 -march=prescott -mfpmath=sse -fomit-frame-pointer
-pipe".  I forgot to record the results, but the times were a small
improvement over "-O2 -march=pentium-m -msse3 -mfpmath=sse
-fomit-frame-pointer -pipe", somewhere btwn 22.8 and 23.2s if I remember
correctly.

--de.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-10-07  6:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-05 13:49 gcc 3.4.3: -march optimization for Intel Core2Duo Jan Dillmann
2006-10-05 15:50 ` Ian Lance Taylor
2006-10-05 16:08   ` Vladimir Makarov
2006-10-05 16:26 Timothy C Prince
2006-10-05 16:41 ` Vladimir Makarov
2006-10-07  6:36 ` Ryan Hill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).