From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 14449 invoked by alias); 5 Apr 2003 22:37:46 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 14441 invoked from network); 5 Apr 2003 22:37:46 -0000 Received: from unknown (HELO www.eyesopen.com) (12.96.199.11) by sources.redhat.com with SMTP; 5 Apr 2003 22:37:46 -0000 Received: from localhost (roger@localhost) by www.eyesopen.com (8.11.6/8.11.6) with ESMTP id h35Lc3u16957; Sat, 5 Apr 2003 14:38:03 -0700 Date: Sun, 06 Apr 2003 01:59:00 -0000 From: Roger Sayle To: Zack Weinberg cc: David Edelsohn , Subject: Re: Strange conditional jumps on the POWER4 In-Reply-To: <87of3khbta.fsf@egil.codesourcery.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-SW-Source: 2003-04/txt/msg00230.txt.bz2 On Sat, 5 Apr 2003, Zack Weinberg wrote: > Roger Sayle writes: > > I've come across some very stange behaviour when benchmarking > > GCC on an AIX power4 box. > > It would help immensely if you would post assembly language snippets. My apologies. As I mentioned previously, I wan't sure whether this is a register allocation issue, i.e. whether my timing loops affect the slow-down etc... However here are some snippets: #include #include int foo(int j) { clock_t t1, t2; unsigned int i; t1 = clock(); for (i=0; i<100000000; i++) { if (j == 1) j = 0; else j = 1; } t2 = clock(); printf("ticks = %d\n",(t2-t1)/1000); return j; } int main() { foo(0); return 0; } Generates the following loop body with both gcc3.4 "-O2" and gcc3.2 "-O2" (600 ticks) L..10: xori 0,31,1 addic 9,0,-1 subfe 31,9,0 bdnz L..10 Changing the condition to (j > 1) generates with gcc3.2 "-O2" (300 ticks) L..11: cmpwi 0,31,1 li 31,0 bgt- 0,L..4 li 31,1 L..4: bdnz L..11 but the same code, (j > 1), generates with gcc3.4 "-O2" (600 ticks) L..10: cmpwi 7,31,1 li 31,1 ble- 7,L..4 li 31,0 L..4: bdnz L..10 However, with (j == 1) and "gcc-3.4 -O2 -fno-if-conversion", we get (1200 ticks) L..10: cmpwi 7,31,1 li 31,1 beq- 7,L..12 bdnz L..10 L..13: ... L..12: li 31,0 bdnz L..10 b L..13 So it looks as though the problem is not a register stall, or scheduling problem but with branch prediction and basic block re-ordering. If you're lucky and the compiler gets the branch probabilities right, and you always branch the same way it can be done in 300 ticks. If you're unlucky, the compiler gets the probabilities wrong and you alternate taken/not-taken it takes 1200 ticks. Hence if-conversion hedges its bets and always uses a 600 tick straight-line sequence. We were just lucky in gcc-3.2 that we got it right, and the "safer" code generated in gcc-3.4 is just twice as slow. If we'd got the branch prediction wrong in gcc-3.2 we'd see a doubling of performance. Sorry, if I've wasted your time. I hope the above analysis is interesting. Obviously power4 has a significant misprediction penalty. Roger --