From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-71487-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 14449 invoked by alias); 5 Apr 2003 22:37:46 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 14441 invoked from network); 5 Apr 2003 22:37:46 -0000
Received: from unknown (HELO www.eyesopen.com) (12.96.199.11)
  by sources.redhat.com with SMTP; 5 Apr 2003 22:37:46 -0000
Received: from localhost (roger@localhost)
	by www.eyesopen.com (8.11.6/8.11.6) with ESMTP id h35Lc3u16957;
	Sat, 5 Apr 2003 14:38:03 -0700
Date: Sun, 06 Apr 2003 01:59:00 -0000
From: Roger Sayle <roger@www.eyesopen.com>
To: Zack Weinberg <zack@codesourcery.com>
cc: David Edelsohn <dje@watson.ibm.com>, <gcc@gcc.gnu.org>
Subject: Re: Strange conditional jumps on the POWER4
In-Reply-To: <87of3khbta.fsf@egil.codesourcery.com>
Message-ID: <Pine.LNX.4.44.0304051355450.14942-100000@www.eyesopen.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-SW-Source: 2003-04/txt/msg00230.txt.bz2


On Sat, 5 Apr 2003, Zack Weinberg wrote:
> Roger Sayle <roger@www.eyesopen.com> writes:
> > I've come across some very stange behaviour when benchmarking
> > GCC on an AIX power4 box.
>
> It would help immensely if you would post assembly language snippets.

My apologies.  As I mentioned previously, I wan't sure whether this
is a register allocation issue, i.e. whether my timing loops affect
the slow-down etc...  However here are some snippets:

#include <stdio.h>
#include <time.h>

int foo(int j)
{
  clock_t t1, t2;
  unsigned int i;

  t1 = clock();
  for (i=0; i<100000000; i++)
  {
    if (j == 1) j = 0;
    else        j = 1;
  }
  t2 = clock();
  printf("ticks = %d\n",(t2-t1)/1000);
  return j;
}

int main()
{
  foo(0);
  return 0;
}


Generates the following loop body with both gcc3.4 "-O2" and gcc3.2
"-O2" (600 ticks)

L..10:
        xori 0,31,1
        addic 9,0,-1
        subfe 31,9,0
        bdnz L..10


Changing the condition to (j > 1) generates with gcc3.2 "-O2" (300
ticks)

L..11:
        cmpwi 0,31,1
        li 31,0
        bgt- 0,L..4
        li 31,1
L..4:
        bdnz L..11

but the same code, (j > 1), generates with gcc3.4 "-O2" (600 ticks)

L..10:
        cmpwi 7,31,1
        li 31,1
        ble- 7,L..4
        li 31,0
L..4:
        bdnz L..10


However, with (j == 1) and "gcc-3.4 -O2 -fno-if-conversion", we get
(1200 ticks)

L..10:
        cmpwi 7,31,1
        li 31,1
        beq- 7,L..12
        bdnz L..10
L..13:

	...

L..12:
        li 31,0
        bdnz L..10
        b L..13


So it looks as though the problem is not a register stall, or scheduling
problem but with branch prediction and basic block re-ordering.  If
you're lucky and the compiler gets the branch probabilities right,
and you always branch the same way it can be done in 300 ticks.  If
you're unlucky, the compiler gets the probabilities wrong and you
alternate taken/not-taken it takes 1200 ticks.

Hence if-conversion hedges its bets and always uses a 600 tick
straight-line sequence.

We were just lucky in gcc-3.2 that we got it right, and the "safer"
code generated in gcc-3.4 is just twice as slow.  If we'd got the
branch prediction wrong in gcc-3.2 we'd see a doubling of performance.


Sorry, if I've wasted your time.  I hope the above analysis is
interesting.  Obviously power4 has a significant misprediction
penalty.

Roger
--