* Re: Strange conditional jumps on the POWER4 [not found] <200304080131.VAA30506@makai.watson.ibm.com> @ 2003-04-08 15:50 ` Roger Sayle 0 siblings, 0 replies; 9+ messages in thread From: Roger Sayle @ 2003-04-08 15:50 UTC (permalink / raw) To: David Edelsohn; +Cc: gcc > N7 > -O2 -mcpu=power4: 92 > -O2 -mcpu=power4 -ffunction-sections -fdata-sections: 139 > > Most is alignment, the rest is scheduling quirk. Funny how it isn't possible to align instructions in the AIX assembler, you mentioned it doesn't support .p2align. Is there anyway it can be done by explicitly inserting nops into the insn stream? Perhaps via machine-dependent reorg or something similar? I've no idea whether GCC already has support for aligning instructions without the assistance of assembler pseudo-ops? Roger -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Strange conditional jumps on the POWER4 @ 2003-04-05 21:26 Roger Sayle 2003-04-06 1:42 ` Zack Weinberg 0 siblings, 1 reply; 9+ messages in thread From: Roger Sayle @ 2003-04-05 21:26 UTC (permalink / raw) To: David Edelsohn; +Cc: gcc Hi David, I've come across some very stange behaviour when benchmarking GCC on an AIX power4 box. The test case I'm investigating is: if (j == 1) j = 0; else j = 1; It turns out that the code above is twice as slow as if (j > 1) j = 0; else j = 1; as timed using "gcc -O2" with mainline CVS. This is particularly curious because the second form uses a conditional jump, whereas the first form generates straight line code. I naturally suspected that the rs6000 backend was just poorly tuned for power4, and that GCC was using straight line code when it should have left the original branch untouched. The real mystery is that when I then use "gcc -O2 -fno-if-conversion" the resulting code with a conditional jump for the equality test is twice as slow still? Is it really the case that testing for equality/inequality on a powerpc chip is 4 times slower that testing less-than/greater-than? This counter-intuitive behaviour means that that gcc 3.4 runs loop N3 of the whetstone benchmark at half the speed of gcc 3.2 on powerpc-ibm-aix5.2.0.0, even though the same transformation doubles the performance from 3.2 to 3.4 on most other platforms. Basically, jump bypassing turns: if (j == 1) j = 2; else j = 3; if (j > 2) j = 0; else j = 1; if (j < 1) j = 1; else j = 0; into the equivalent if (j == 1) j = 0; else j = 1; But for reasons I cannot explain, this degrades performance!? Any clues as to whats going on? I'm completely ignorant of this architecture. Is it some kind of pipeline stall? Roger -- Roger Sayle, E-mail: roger@eyesopen.com OpenEye Scientific Software, WWW: http://www.eyesopen.com/ Suite 1107, 3600 Cerrillos Road, Tel: (+1) 505-473-7385 Santa Fe, New Mexico, 87507. Fax: (+1) 505-473-0833 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Strange conditional jumps on the POWER4 2003-04-05 21:26 Roger Sayle @ 2003-04-06 1:42 ` Zack Weinberg 2003-04-06 1:59 ` Roger Sayle 0 siblings, 1 reply; 9+ messages in thread From: Zack Weinberg @ 2003-04-06 1:42 UTC (permalink / raw) To: Roger Sayle; +Cc: David Edelsohn, gcc Roger Sayle <roger@www.eyesopen.com> writes: > Hi David, > > I've come across some very stange behaviour when benchmarking > GCC on an AIX power4 box. It would help immensely if you would post assembly language snippets. zw ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Strange conditional jumps on the POWER4 2003-04-06 1:42 ` Zack Weinberg @ 2003-04-06 1:59 ` Roger Sayle 2003-04-06 3:33 ` David Edelsohn 2003-04-06 15:14 ` David Edelsohn 0 siblings, 2 replies; 9+ messages in thread From: Roger Sayle @ 2003-04-06 1:59 UTC (permalink / raw) To: Zack Weinberg; +Cc: David Edelsohn, gcc On Sat, 5 Apr 2003, Zack Weinberg wrote: > Roger Sayle <roger@www.eyesopen.com> writes: > > I've come across some very stange behaviour when benchmarking > > GCC on an AIX power4 box. > > It would help immensely if you would post assembly language snippets. My apologies. As I mentioned previously, I wan't sure whether this is a register allocation issue, i.e. whether my timing loops affect the slow-down etc... However here are some snippets: #include <stdio.h> #include <time.h> int foo(int j) { clock_t t1, t2; unsigned int i; t1 = clock(); for (i=0; i<100000000; i++) { if (j == 1) j = 0; else j = 1; } t2 = clock(); printf("ticks = %d\n",(t2-t1)/1000); return j; } int main() { foo(0); return 0; } Generates the following loop body with both gcc3.4 "-O2" and gcc3.2 "-O2" (600 ticks) L..10: xori 0,31,1 addic 9,0,-1 subfe 31,9,0 bdnz L..10 Changing the condition to (j > 1) generates with gcc3.2 "-O2" (300 ticks) L..11: cmpwi 0,31,1 li 31,0 bgt- 0,L..4 li 31,1 L..4: bdnz L..11 but the same code, (j > 1), generates with gcc3.4 "-O2" (600 ticks) L..10: cmpwi 7,31,1 li 31,1 ble- 7,L..4 li 31,0 L..4: bdnz L..10 However, with (j == 1) and "gcc-3.4 -O2 -fno-if-conversion", we get (1200 ticks) L..10: cmpwi 7,31,1 li 31,1 beq- 7,L..12 bdnz L..10 L..13: ... L..12: li 31,0 bdnz L..10 b L..13 So it looks as though the problem is not a register stall, or scheduling problem but with branch prediction and basic block re-ordering. If you're lucky and the compiler gets the branch probabilities right, and you always branch the same way it can be done in 300 ticks. If you're unlucky, the compiler gets the probabilities wrong and you alternate taken/not-taken it takes 1200 ticks. Hence if-conversion hedges its bets and always uses a 600 tick straight-line sequence. We were just lucky in gcc-3.2 that we got it right, and the "safer" code generated in gcc-3.4 is just twice as slow. If we'd got the branch prediction wrong in gcc-3.2 we'd see a doubling of performance. Sorry, if I've wasted your time. I hope the above analysis is interesting. Obviously power4 has a significant misprediction penalty. Roger -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Strange conditional jumps on the POWER4 2003-04-06 1:59 ` Roger Sayle @ 2003-04-06 3:33 ` David Edelsohn 2003-04-06 5:21 ` Roger Sayle 2003-04-06 15:14 ` David Edelsohn 1 sibling, 1 reply; 9+ messages in thread From: David Edelsohn @ 2003-04-06 3:33 UTC (permalink / raw) To: Roger Sayle; +Cc: gcc >>>>> Roger Sayle writes: Roger> but the same code, (j > 1), generates with gcc3.4 "-O2" (600 ticks) Roger> L..10: Roger> cmpwi 7,31,1 Roger> li 31,1 Roger> ble- 7,L..4 Roger> li 31,0 Roger> L..4: Roger> bdnz L..10 Roger> However, with (j == 1) and "gcc-3.4 -O2 -fno-if-conversion", we get Roger> (1200 ticks) Roger> L..10: Roger> cmpwi 7,31,1 Roger> li 31,1 Roger> beq- 7,L..12 Roger> bdnz L..10 Roger> L..13: It appears that you are not compiling the code with -mcpu=power4 commandline option for GCC 3.4, because the branch hint mnemonics (+/-) would not be present without greater probabilities. David ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Strange conditional jumps on the POWER4 2003-04-06 3:33 ` David Edelsohn @ 2003-04-06 5:21 ` Roger Sayle 2003-04-06 16:49 ` Segher Boessenkool 0 siblings, 1 reply; 9+ messages in thread From: Roger Sayle @ 2003-04-06 5:21 UTC (permalink / raw) To: David Edelsohn; +Cc: gcc Hi David, > It appears that you are not compiling the code with -mcpu=power4 > commandline option for GCC 3.4, because the branch hint mnemonics (+/-) > would not be present without greater probabilities. Indeed, I used exactly the flags described in the posting. Out of interest, I've now also run Whetstone with "-mcpu=power4", hoping that it would fix things. Not only does this not fix the N3 regression, it actually does slightly worse overall. With gcc-3.2 -O2 -ffast-math: Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475025653839111 157.257 0.588 N2 floating point -1.12274754047393799 130.235 4.970 N3 if then else 1.00000000000000000 488.682 1.020 N4 fixed point 12.00000000000000000 171.030 8.870 N5 sin,cos etc. 0.49904659390449524 32.550 12.310 N6 floating point 0.99999988079071045 58.826 44.160 N7 assignments 3.00000000000000000 92.323 9.640 N8 exp,sqrt etc. 0.75110864639282227 10.374 17.270 MWIPS 487.311 98.828 With gcc-3.4 -O2 -ffast-math: Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 189.658 0.582 N2 floating point -1.12274742126464844 180.109 4.290 N3 if then else 1.00000000000000000 214.037 2.780 N4 fixed point 12.00000000000000000 186.310 9.720 N5 sin,cos etc. 0.49911010265350342 33.102 14.450 N6 floating point 0.99999982118606567 72.217 42.940 N7 assignments 3.00000000000000000 137.619 7.720 N8 exp,sqrt etc. 0.75110864639282227 16.008 13.360 MWIPS 599.841 95.842 With gcc-3.4 -O2 -ffast-math -mcpu=power4 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 190.647 0.582 N2 floating point -1.12274742126464844 178.142 4.360 N3 if then else 1.00000000000000000 214.382 2.790 N4 fixed point 12.00000000000000000 204.308 8.910 N5 sin,cos etc. 0.49911010265350342 33.647 14.290 N6 floating point 0.99999982118606567 71.627 43.520 N7 assignments 3.00000000000000000 88.334 12.090 N8 exp,sqrt etc. 0.75110864639282227 15.819 13.590 MWIPS 577.138 100.132 So as you can see the situation isn't too bad at all. Overall gcc 3.4 is much faster than gcc 3.2. But you can see my concern over the slow down of N3, and that adding "-mcpu=power4" doesn't resolve the issue. Roger -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Strange conditional jumps on the POWER4 2003-04-06 5:21 ` Roger Sayle @ 2003-04-06 16:49 ` Segher Boessenkool 2003-04-07 23:03 ` David Edelsohn 0 siblings, 1 reply; 9+ messages in thread From: Segher Boessenkool @ 2003-04-06 16:49 UTC (permalink / raw) To: Roger Sayle; +Cc: David Edelsohn, gcc Roger Sayle wrote: > With gcc-3.2 -O2 -ffast-math: > N7 assignments 3.00000000000000000 92.323 9.640 > With gcc-3.4 -O2 -ffast-math: > N7 assignments 3.00000000000000000 137.619 7.720 > With gcc-3.4 -O2 -ffast-math -mcpu=power4 > N7 assignments 3.00000000000000000 88.334 12.090 That last one looks very bad, too. Could you post source and asm snippets for that? Segher ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Strange conditional jumps on the POWER4 2003-04-06 16:49 ` Segher Boessenkool @ 2003-04-07 23:03 ` David Edelsohn 0 siblings, 0 replies; 9+ messages in thread From: David Edelsohn @ 2003-04-07 23:03 UTC (permalink / raw) To: Segher Boessenkool, Roger Sayle; +Cc: gcc >>>>> Segher Boessenkool writes: >> With gcc-3.4 -O2 -ffast-math: >> N7 assignments 3.00000000000000000 137.619 7.720 >> With gcc-3.4 -O2 -ffast-math -mcpu=power4 >> N7 assignments 3.00000000000000000 88.334 12.090 Segher> That last one looks very bad, too. Could you post source Segher> and asm snippets for that? The N7 performance difference is due to an unfortunate instruction address alignment. The default processor model just happens to align the N7 loop well and the Power4 model aligns it poorly. The AIX assembler does not understand .p2align to allow GCC to generate alignment directives to improve this. However, using -ffunction-sections -fdata-sections allows the AIX linker to lay out the functions more effectively, also producing better alignments, recovering most of the performance. David ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Strange conditional jumps on the POWER4 2003-04-06 1:59 ` Roger Sayle 2003-04-06 3:33 ` David Edelsohn @ 2003-04-06 15:14 ` David Edelsohn 1 sibling, 0 replies; 9+ messages in thread From: David Edelsohn @ 2003-04-06 15:14 UTC (permalink / raw) To: Roger Sayle; +Cc: Zack Weinberg, gcc >>>>> Roger Sayle writes: Roger> So it looks as though the problem is not a register stall, or scheduling Roger> problem but with branch prediction and basic block re-ordering. If Roger> you're lucky and the compiler gets the branch probabilities right, Roger> and you always branch the same way it can be done in 300 ticks. If Roger> you're unlucky, the compiler gets the probabilities wrong and you Roger> alternate taken/not-taken it takes 1200 ticks. Newer PowerPC processors switched to using a different encoding of branch hints. GCC for AIX currently does not enable the assembler mode to map the mnemonics to the new hint bits, so Power4 always uses its internal branch prediction, even if the +/- is present. The Power4 processor is predicting the branch correctly. I suspect the problem is the branch to branch. Power4 has a one cycle latency between branches. In the faster code, the conditional branch is not taken, we overwrite the register (hidden by renaming) and we fallthru to the branch on counter. For the slower code, we don't overwrite the regiser, but the conditional branch is taken jumping to the branch on counter. The branch on counter is held off one cycle, leading to the performance degradation. David ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2003-04-08 14:45 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <200304080131.VAA30506@makai.watson.ibm.com> 2003-04-08 15:50 ` Strange conditional jumps on the POWER4 Roger Sayle 2003-04-05 21:26 Roger Sayle 2003-04-06 1:42 ` Zack Weinberg 2003-04-06 1:59 ` Roger Sayle 2003-04-06 3:33 ` David Edelsohn 2003-04-06 5:21 ` Roger Sayle 2003-04-06 16:49 ` Segher Boessenkool 2003-04-07 23:03 ` David Edelsohn 2003-04-06 15:14 ` David Edelsohn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).