* Re: gcc 3.4.3: -march optimization for Intel Core2Duo
2006-10-05 16:26 Timothy C Prince
2006-10-05 16:41 ` Vladimir Makarov
@ 2006-10-07 6:36 ` Ryan Hill
1 sibling, 0 replies; 5+ messages in thread
From: Ryan Hill @ 2006-10-07 6:36 UTC (permalink / raw)
To: gcc-help
[-- Attachment #1: Type: text/plain, Size: 14467 bytes --]
Timothy C Prince wrote:
> From: Vladimir Makarov <vmakarov@redhat.com> Ian Lance Taylor wrote:
>> "Jan Dillmann" <jan.dillmann@tooltec.de> writes:
>>> we are running several benchmarks (SpecCPU200...) on 32-bit
>>> linux-systems and are able to set an optimization-parameter for
>>> '-march'. We use Intel Core2uo-CPUs. Which parameter should we
>>> use (nocona, prescott...) ?
>>>
>>>
>> gcc 3.4.3 has no specific tuning for Core2 Duo, if for no other
>> reason than the release was made before the processors became
>> available. My guess would be that you will get the best results
>> with -mtune=nocona. But it is only a guess.
>>
>>
>>
> I believe that pentium-m will work better. Nocona (a x86_64
> processor) is based on northwood/prescott core which is a high
> frequency core with long pipelines. Core2 Duo is closer to pentium M
> (lower frequency core with much shorter pipelines). Although usage
> of penium-m will result a bigger code in comparison with nocona
> because aligning loop/function will be forced (nortwood core is not
> so sensitive to aligning therfore aligining is not done when
> -mtune=nocona is used). I don't remember Intel recomendation about
> aligning code for Core Duo (probably it is the same as for penium M).
>
>
>
> ________________________________
>
> FWIW, pentium-m is optimized by using 387 code for nearly everything
> except (int) casts. This is because of the Banias SSE decoder
> bottleneck. If you use -march=pentium-m, you would add -fpmath=sse to
> attempt to get code more optimum for any CPU other than
> Banias/Dothan. OP question was about Core 2 Duo, a more advanced
> (64-bit capable) CPU than Core Duo. Tim Prince
>
I did some very unscientific and limited benchmarking of GCC trunk
performance in tramp3d. Any analysis or suggestions, and other
benchmark numbers would be greatly appreciated if you can supply them.
I plan to do more when I have some time.
Also, this is on a Core Duo rather than a Core 2 Duo. Does that make a
significant difference?
----------
(http://forums.gentoo.org/viewtopic-p-3602555.html#3601332)
ok, i did one simple c++ benchmark using TraMP3d-v4. keep in mind it's
just one benchmark.
the system used was a Toshiba Satellite A100 laptop with a Core Duo
T2300 @ 1.66GHz (Yonah), 2MiB shared L2 cache, and 1GiB of memory. the
GCC version used was 4.1-branch svn built yesterday.
[-O2 -march=prescott -fomit-frame-pointer -pipe]
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=prescott -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp -o tramp3d-v4-prescott
95.45user 0.84system 1:35.69elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+202080minor)pagefaults 0swaps
dirtyepic@tycho ~/tmp $ ./tramp3d-v4-prescott -n 25 --cartvis 1.0 0.0
--rhomin 1e-8
Using
using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
solving eeq
time increments from [0, 1.79769e+308], cfl 0.5
starting at t = 0, i = 1
cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
face physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
periodic boundaries in X Y Z
i = 1 t = 0.00209225 dt = 0.00209225 (0.07124s/it)
i = 2 t = 0.00410537 dt = 0.00201312 (0.946142s/it)
i = 3 t = 0.00603889 dt = 0.00193352 (0.966466s/it)
i = 4 t = 0.00794139 dt = 0.00190251 (0.975241s/it)
i = 5 t = 0.00984636 dt = 0.00190497 (0.97465s/it)
i = 6 t = 0.0117508 dt = 0.00190449 (0.985882s/it)
i = 7 t = 0.013681 dt = 0.00193011 (1.0047s/it)
i = 8 t = 0.0156598 dt = 0.0019788 (1.00467s/it)
i = 9 t = 0.0176706 dt = 0.00201081 (1.00171s/it)
i = 10 t = 0.0197364 dt = 0.0020658 (1.0184s/it)
i = 11 t = 0.0218716 dt = 0.0021352 (1.01445s/it)
i = 12 t = 0.0240721 dt = 0.00220057 (1.00954s/it)
i = 13 t = 0.0263471 dt = 0.002275 (1.01139s/it)
i = 14 t = 0.0287159 dt = 0.00236875 (1.01714s/it)
i = 15 t = 0.0311533 dt = 0.00243738 (1.01269s/it)
i = 16 t = 0.0336768 dt = 0.0025235 (1.01118s/it)
i = 17 t = 0.0362863 dt = 0.00260952 (1.00748s/it)
i = 18 t = 0.0389715 dt = 0.00268521 (1.00433s/it)
i = 19 t = 0.0417381 dt = 0.00276665 (1.00053s/it)
i = 20 t = 0.0445873 dt = 0.00284919 (1.00177s/it)
i = 21 t = 0.0475216 dt = 0.0029343 (0.989871s/it)
i = 22 t = 0.0505258 dt = 0.00300413 (0.997915s/it)
i = 23 t = 0.0535938 dt = 0.00306807 (0.98717s/it)
i = 24 t = 0.0567043 dt = 0.0031105 (0.989589s/it)
i = 25 t = 0.0598233 dt = 0.00311892 (0.987146s/it)
Time spent in iteration: 23.9913
Correctness:
sum(rh) difference = 1.45519e-11
sum(vx) = -0.242582
sum(vy) = -0.295116
sum(vz) = -0.335474
sum(rh*T) difference = -297.099
dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-prescott
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-prescott, please wait...
i486: 0 i586: 0 ppro: 130 mmx: 0 sse: 0 sse2: 0 sse3: 2
tramp3d-v4-prescott will run on Pentium IV (pentium4) w/ SSE3 or higher
processor.
[-O2 -march=pentium-m -fomit-frame-pointer -pipe]
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp -o tramp3d-v4-pentiumm-plain
97.74user 0.74system 1:38.47elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (11major+200253minor)pagefaults 0swaps
dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-plain -n 25 --cartvis 1.0
0.0 --rhomin 1e-8
Using
using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
solving eeq
time increments from [0, 1.79769e+308], cfl 0.5
starting at t = 0, i = 1
cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
face physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
periodic boundaries in X Y Z
i = 1 t = 0.00209225 dt = 0.00209225 (0.0692961s/it)
i = 2 t = 0.00410537 dt = 0.00201312 (0.992859s/it)
i = 3 t = 0.00603889 dt = 0.00193352 (1.0033s/it)
i = 4 t = 0.00794139 dt = 0.00190251 (0.975363s/it)
i = 5 t = 0.00984636 dt = 0.00190497 (0.98926s/it)
i = 6 t = 0.0117508 dt = 0.00190449 (0.986304s/it)
i = 7 t = 0.013681 dt = 0.00193011 (0.997433s/it)
i = 8 t = 0.0156598 dt = 0.0019788 (0.99804s/it)
i = 9 t = 0.0176706 dt = 0.00201081 (1.00585s/it)
i = 10 t = 0.0197364 dt = 0.0020658 (1.00463s/it)
i = 11 t = 0.0218716 dt = 0.0021352 (1.01035s/it)
i = 12 t = 0.0240721 dt = 0.00220057 (1.00643s/it)
i = 13 t = 0.0263471 dt = 0.002275 (1.00908s/it)
i = 14 t = 0.0287159 dt = 0.00236875 (1.00359s/it)
i = 15 t = 0.0311533 dt = 0.00243738 (1.00683s/it)
i = 16 t = 0.0336768 dt = 0.0025235 (1.0018s/it)
i = 17 t = 0.0362863 dt = 0.00260952 (1.00395s/it)
i = 18 t = 0.0389715 dt = 0.00268521 (0.994894s/it)
i = 19 t = 0.0417381 dt = 0.00276665 (0.995252s/it)
i = 20 t = 0.0445873 dt = 0.00284919 (0.992024s/it)
i = 21 t = 0.0475216 dt = 0.0029343 (0.989914s/it)
i = 22 t = 0.0505258 dt = 0.00300413 (0.984155s/it)
i = 23 t = 0.0535938 dt = 0.00306807 (0.986609s/it)
i = 24 t = 0.0567043 dt = 0.0031105 (0.981239s/it)
i = 25 t = 0.0598233 dt = 0.00311892 (0.986686s/it)
Time spent in iteration: 23.9751
Correctness:
sum(rh) difference = 1.45519e-11
sum(vx) = -0.242582
sum(vy) = -0.295116
sum(vz) = -0.335474
sum(rh*T) difference = -297.099
dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-plain
Checking
vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm-plain, please wait...
i486: 0 i586: 0 ppro: 135 mmx: 0 sse: 0 sse2: 4 sse3: 0
tramp3d-v4-pentiumm-plain will run on Pentium IV (pentium4) or higher
processor.
[-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe]
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -msse3 -fomit-frame-pointer -pipe -Dleafify=flatten
tramp3d-v4.cpp -o tramp3d-v4-pentiumm
97.73user 1.01system 1:38.05elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+197280minor)pagefaults 0swaps
dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm -n 25 --cartvis 1.0 0.0
--rhomin 1e-8
Using
using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
solving eeq
time increments from [0, 1.79769e+308], cfl 0.5
starting at t = 0, i = 1
cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
face physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
periodic boundaries in X Y Z
i = 1 t = 0.00209225 dt = 0.00209225 (0.069342s/it)
i = 2 t = 0.00410537 dt = 0.00201312 (0.968165s/it)
i = 3 t = 0.00603889 dt = 0.00193352 (0.985737s/it)
i = 4 t = 0.00794139 dt = 0.00190251 (0.999364s/it)
i = 5 t = 0.00984636 dt = 0.00190497 (1.01105s/it)
i = 6 t = 0.0117508 dt = 0.00190449 (1.01161s/it)
i = 7 t = 0.013681 dt = 0.00193011 (1.02449s/it)
i = 8 t = 0.0156598 dt = 0.0019788 (1.02412s/it)
i = 9 t = 0.0176706 dt = 0.00201081 (1.02851s/it)
i = 10 t = 0.0197364 dt = 0.0020658 (1.02592s/it)
i = 11 t = 0.0218716 dt = 0.0021352 (1.03424s/it)
i = 12 t = 0.0240721 dt = 0.00220057 (1.0353s/it)
i = 13 t = 0.0263471 dt = 0.002275 (1.03373s/it)
i = 14 t = 0.0287159 dt = 0.00236875 (1.03266s/it)
i = 15 t = 0.0311533 dt = 0.00243738 (1.03526s/it)
i = 16 t = 0.0336768 dt = 0.0025235 (1.02011s/it)
i = 17 t = 0.0362863 dt = 0.00260952 (1.0232s/it)
i = 18 t = 0.0389715 dt = 0.00268521 (1.02476s/it)
i = 19 t = 0.0417381 dt = 0.00276665 (1.0153s/it)
i = 20 t = 0.0445873 dt = 0.00284919 (1.00431s/it)
i = 21 t = 0.0475216 dt = 0.0029343 (1.00313s/it)
i = 22 t = 0.0505258 dt = 0.00300413 (0.989761s/it)
i = 23 t = 0.0535938 dt = 0.00306807 (0.99909s/it)
i = 24 t = 0.0567043 dt = 0.0031105 (0.989536s/it)
i = 25 t = 0.0598233 dt = 0.00311892 (0.996134s/it)
Time spent in iteration: 24.3848
Correctness:
sum(rh) difference = 1.45519e-11
sum(vx) = -0.242582
sum(vy) = -0.295116
sum(vz) = -0.335474
sum(rh*T) difference = -297.099
dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm, please wait...
i486: 0 i586: 0 ppro: 135 mmx: 0 sse: 0 sse2: 0 sse3: 2
tramp3d-v4-pentiumm will run on Pentium IV (pentium4) w/ SSE3 or higher
processor.
[-O2 -march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe]
dirtyepic@tycho ~/tmp $ /usr/bin/time /usr/bin/g++-4.1.2-pre20060923 -O2
-march=pentium-m -msse3 -mfpmath=sse -fomit-frame-pointer -pipe
-Dleafify=flatten tramp3d-v4.cpp -o tramp3d-v4-pentiumm-sse
98.40user 0.94system 1:39.15elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (3major+198438minor)pagefaults 0swaps
dirtyepic@tycho ~/tmp $ ./tramp3d-v4-pentiumm-sse -n 25 --cartvis 1.0
0.0 --rhomin 1e-8
Using
using [1,1,1] block setup for computation on domain [0:63:1,0:63:1,0:63:1]
solving eeq
time increments from [0, 1.79769e+308], cfl 0.5
starting at t = 0, i = 1
cell physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
face physical/total domain [0:62:1,0:62:1,0:62:1],
[-2:64:1,-2:64:1,-2:64:1]
periodic boundaries in X Y Z
i = 1 t = 0.00209225 dt = 0.00209225 (0.0617449s/it)
i = 2 t = 0.00410537 dt = 0.00201312 (0.897831s/it)
i = 3 t = 0.00603889 dt = 0.00193352 (0.964484s/it)
i = 4 t = 0.00794139 dt = 0.00190251 (0.94189s/it)
i = 5 t = 0.00984636 dt = 0.00190497 (0.972172s/it)
i = 6 t = 0.0117508 dt = 0.00190449 (0.973818s/it)
i = 7 t = 0.013681 dt = 0.00193011 (0.984364s/it)
i = 8 t = 0.0156598 dt = 0.0019788 (0.988743s/it)
i = 9 t = 0.0176706 dt = 0.00201081 (0.996885s/it)
i = 10 t = 0.0197364 dt = 0.0020658 (0.997118s/it)
i = 11 t = 0.0218716 dt = 0.0021352 (1.00016s/it)
i = 12 t = 0.0240721 dt = 0.00220057 (0.99685s/it)
i = 13 t = 0.0263471 dt = 0.002275 (0.998231s/it)
i = 14 t = 0.0287159 dt = 0.00236875 (1.00025s/it)
i = 15 t = 0.0311533 dt = 0.00243738 (0.987068s/it)
i = 16 t = 0.0336768 dt = 0.0025235 (0.981898s/it)
i = 17 t = 0.0362863 dt = 0.00260952 (0.990963s/it)
i = 18 t = 0.0389715 dt = 0.00268521 (0.986071s/it)
i = 19 t = 0.0417381 dt = 0.00276665 (0.980461s/it)
i = 20 t = 0.0445873 dt = 0.00284919 (0.982345s/it)
i = 21 t = 0.0475216 dt = 0.0029343 (1.00055s/it)
i = 22 t = 0.0505258 dt = 0.00300413 (0.995297s/it)
i = 23 t = 0.0535938 dt = 0.00306807 (1.00189s/it)
i = 24 t = 0.0567043 dt = 0.0031105 (1.00527s/it)
i = 25 t = 0.0598233 dt = 0.00311892 (1.01299s/it)
Time spent in iteration: 23.6994
Correctness:
sum(rh) difference = 1.28966e-08
sum(vx) = -0.242582
sum(vy) = -0.295116
sum(vz) = -0.335474
sum(rh*T) difference = -297.099
dirtyepic@tycho ~/tmp $ analyze-x86 tramp3d-v4-pentiumm-sse
Checking vendor_id string... GenuineIntel
Disassembling tramp3d-v4-pentiumm-sse, please wait...
i486: 0 i586: 0 ppro: 84 mmx: 44 sse: 0 sse2: 3089 sse3: 0
tramp3d-v4-pentiumm-sse will run on Pentium IV (pentium4) or higher
processor.
Keep in mind that anything that does strip-flags (ie. GCC, glibc,
kernel, etc.) will remove both -msse3 and -mfpmath from your C[XX]FLAGS
Very little difference in runtimes, maybe half a second, and next to no
difference in compile time. Surprisingly,
-O2 -march=pentium-m -msse3 -fomit-frame-pointer -pipe was the slowest.
I reran the test to be sure and it was slightly worse (24.5397s) than
the original run.
It also appears -mfpmath=sse does not generate sse3 instructions.
----------
I also tested "-O2 -march=prescott -mfpmath=sse -fomit-frame-pointer
-pipe". I forgot to record the results, but the times were a small
improvement over "-O2 -march=pentium-m -msse3 -mfpmath=sse
-fomit-frame-pointer -pipe", somewhere btwn 22.8 and 23.2s if I remember
correctly.
--de.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread