Re: Strange conditional jumps on the POWER4

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: Strange conditional jumps on the POWER4
       [not found] <200304080131.VAA30506@makai.watson.ibm.com>
@ 2003-04-08 15:50 ` Roger Sayle
  0 siblings, 0 replies; 9+ messages in thread
From: Roger Sayle @ 2003-04-08 15:50 UTC (permalink / raw)
  To: David Edelsohn; +Cc: gcc

> N7
> -O2 -mcpu=power4:						92
> -O2 -mcpu=power4 -ffunction-sections -fdata-sections:		139
>
> Most is alignment, the rest is scheduling quirk.

Funny how it isn't possible to align instructions in the AIX
assembler, you mentioned it doesn't support .p2align.  Is there
anyway it can be done by explicitly inserting nops into the insn
stream?  Perhaps via machine-dependent reorg or something similar?

I've no idea whether GCC already has support for aligning
instructions without the assistance of assembler pseudo-ops?

Roger
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Strange conditional jumps on the POWER4
@ 2003-04-05 21:26 Roger Sayle
  2003-04-06  1:42 ` Zack Weinberg
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Sayle @ 2003-04-05 21:26 UTC (permalink / raw)
  To: David Edelsohn; +Cc: gcc

Hi David,

I've come across some very stange behaviour when benchmarking
GCC on an AIX power4 box.  The test case I'm investigating is:

	if (j == 1) j = 0;
	else        j = 1;

It turns out that the code above is twice as slow as

	if (j > 1) j = 0;
	else       j = 1;

as timed using "gcc -O2" with mainline CVS.  This is particularly
curious because the second form uses a conditional jump, whereas
the first form generates straight line code.

I naturally suspected that the rs6000 backend was just poorly
tuned for power4, and that GCC was using straight line code when
it should have left the original branch untouched.

The real mystery is that when I then use "gcc -O2 -fno-if-conversion"
the resulting code with a conditional jump for the equality test is
twice as slow still?

Is it really the case that testing for equality/inequality on a
powerpc chip is 4 times slower that testing less-than/greater-than?

This counter-intuitive behaviour means that that gcc 3.4 runs
loop N3 of the whetstone benchmark at half the speed of gcc 3.2
on powerpc-ibm-aix5.2.0.0, even though the same transformation
doubles the performance from 3.2 to 3.4 on most other platforms.

Basically, jump bypassing turns:

	if (j == 1) j = 2;
	else        j = 3;
	if (j > 2)  j = 0;
	else        j = 1;
	if (j < 1)  j = 1;
	else        j = 0;

into the equivalent

	if (j == 1) j = 0;
	else        j = 1;

But for reasons I cannot explain, this degrades performance!?

Any clues as to whats going on?  I'm completely ignorant of this
architecture.  Is it some kind of pipeline stall?

Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange conditional jumps on the POWER4
  2003-04-05 21:26 Roger Sayle
@ 2003-04-06  1:42 ` Zack Weinberg
  2003-04-06  1:59   ` Roger Sayle
  0 siblings, 1 reply; 9+ messages in thread
From: Zack Weinberg @ 2003-04-06  1:42 UTC (permalink / raw)
  To: Roger Sayle; +Cc: David Edelsohn, gcc

Roger Sayle <roger@www.eyesopen.com> writes:

> Hi David,
>
> I've come across some very stange behaviour when benchmarking
> GCC on an AIX power4 box.

It would help immensely if you would post assembly language snippets.

zw

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange conditional jumps on the POWER4
  2003-04-06  1:42 ` Zack Weinberg
@ 2003-04-06  1:59   ` Roger Sayle
  2003-04-06  3:33     ` David Edelsohn
  2003-04-06 15:14     ` David Edelsohn
  0 siblings, 2 replies; 9+ messages in thread
From: Roger Sayle @ 2003-04-06  1:59 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: David Edelsohn, gcc

On Sat, 5 Apr 2003, Zack Weinberg wrote:
> Roger Sayle <roger@www.eyesopen.com> writes:
> > I've come across some very stange behaviour when benchmarking
> > GCC on an AIX power4 box.
>
> It would help immensely if you would post assembly language snippets.

My apologies.  As I mentioned previously, I wan't sure whether this
is a register allocation issue, i.e. whether my timing loops affect
the slow-down etc...  However here are some snippets:

#include <stdio.h>
#include <time.h>

int foo(int j)
{
  clock_t t1, t2;
  unsigned int i;

  t1 = clock();
  for (i=0; i<100000000; i++)
  {
    if (j == 1) j = 0;
    else        j = 1;
  }
  t2 = clock();
  printf("ticks = %d\n",(t2-t1)/1000);
  return j;
}

int main()
{
  foo(0);
  return 0;
}

Generates the following loop body with both gcc3.4 "-O2" and gcc3.2
"-O2" (600 ticks)

L..10:
        xori 0,31,1
        addic 9,0,-1
        subfe 31,9,0
        bdnz L..10

Changing the condition to (j > 1) generates with gcc3.2 "-O2" (300
ticks)

L..11:
        cmpwi 0,31,1
        li 31,0
        bgt- 0,L..4
        li 31,1
L..4:
        bdnz L..11

but the same code, (j > 1), generates with gcc3.4 "-O2" (600 ticks)

L..10:
        cmpwi 7,31,1
        li 31,1
        ble- 7,L..4
        li 31,0
L..4:
        bdnz L..10

However, with (j == 1) and "gcc-3.4 -O2 -fno-if-conversion", we get
(1200 ticks)

L..10:
        cmpwi 7,31,1
        li 31,1
        beq- 7,L..12
        bdnz L..10
L..13:

	...

L..12:
        li 31,0
        bdnz L..10
        b L..13

So it looks as though the problem is not a register stall, or scheduling
problem but with branch prediction and basic block re-ordering.  If
you're lucky and the compiler gets the branch probabilities right,
and you always branch the same way it can be done in 300 ticks.  If
you're unlucky, the compiler gets the probabilities wrong and you
alternate taken/not-taken it takes 1200 ticks.

Hence if-conversion hedges its bets and always uses a 600 tick
straight-line sequence.

We were just lucky in gcc-3.2 that we got it right, and the "safer"
code generated in gcc-3.4 is just twice as slow.  If we'd got the
branch prediction wrong in gcc-3.2 we'd see a doubling of performance.

Sorry, if I've wasted your time.  I hope the above analysis is
interesting.  Obviously power4 has a significant misprediction
penalty.

Roger
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange conditional jumps on the POWER4
  2003-04-06  1:59   ` Roger Sayle
@ 2003-04-06  3:33     ` David Edelsohn
  2003-04-06  5:21       ` Roger Sayle
  2003-04-06 15:14     ` David Edelsohn
  1 sibling, 1 reply; 9+ messages in thread
From: David Edelsohn @ 2003-04-06  3:33 UTC (permalink / raw)
  To: Roger Sayle; +Cc: gcc

>>>>> Roger Sayle writes:

Roger> but the same code, (j > 1), generates with gcc3.4 "-O2" (600 ticks)

Roger> L..10:
Roger> cmpwi 7,31,1
Roger> li 31,1
Roger> ble- 7,L..4
Roger> li 31,0
Roger> L..4:
Roger> bdnz L..10

Roger> However, with (j == 1) and "gcc-3.4 -O2 -fno-if-conversion", we get
Roger> (1200 ticks)

Roger> L..10:
Roger> cmpwi 7,31,1
Roger> li 31,1
Roger> beq- 7,L..12
Roger> bdnz L..10
Roger> L..13:

	It appears that you are not compiling the code with -mcpu=power4
commandline option for GCC 3.4, because the branch hint mnemonics (+/-)
would not be present without greater probabilities.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange conditional jumps on the POWER4
  2003-04-06  3:33     ` David Edelsohn
@ 2003-04-06  5:21       ` Roger Sayle
  2003-04-06 16:49         ` Segher Boessenkool
  0 siblings, 1 reply; 9+ messages in thread
From: Roger Sayle @ 2003-04-06  5:21 UTC (permalink / raw)
  To: David Edelsohn; +Cc: gcc


Hi David,
> 	It appears that you are not compiling the code with -mcpu=power4
> commandline option for GCC 3.4, because the branch hint mnemonics (+/-)
> would not be present without greater probabilities.

Indeed, I used exactly the flags described in the posting.  Out of
interest, I've now also run Whetstone with "-mcpu=power4", hoping that
it would fix things.  Not only does this not fix the N3 regression, it
actually does slightly worse overall.


With gcc-3.2 -O2 -ffast-math:

Loop content                  Result              MFLOPS      MOPS   Seconds
N1 floating point     -1.12475025653839111       157.257              0.588
N2 floating point     -1.12274754047393799       130.235              4.970
N3 if then else        1.00000000000000000                 488.682    1.020
N4 fixed point        12.00000000000000000                 171.030    8.870
N5 sin,cos etc.        0.49904659390449524                  32.550   12.310
N6 floating point      0.99999988079071045        58.826             44.160
N7 assignments         3.00000000000000000                  92.323    9.640
N8 exp,sqrt etc.       0.75110864639282227                  10.374   17.270
MWIPS                                            487.311             98.828

With gcc-3.4 -O2 -ffast-math:

Loop content                  Result              MFLOPS      MOPS   Seconds
N1 floating point     -1.12475013732910156       189.658              0.582
N2 floating point     -1.12274742126464844       180.109              4.290
N3 if then else        1.00000000000000000                 214.037    2.780
N4 fixed point        12.00000000000000000                 186.310    9.720
N5 sin,cos etc.        0.49911010265350342                  33.102   14.450
N6 floating point      0.99999982118606567        72.217             42.940
N7 assignments         3.00000000000000000                 137.619    7.720
N8 exp,sqrt etc.       0.75110864639282227                  16.008   13.360
MWIPS                                            599.841             95.842

With gcc-3.4 -O2 -ffast-math -mcpu=power4

Loop content                  Result              MFLOPS      MOPS   Seconds
N1 floating point     -1.12475013732910156       190.647              0.582
N2 floating point     -1.12274742126464844       178.142              4.360
N3 if then else        1.00000000000000000                 214.382    2.790
N4 fixed point        12.00000000000000000                 204.308    8.910
N5 sin,cos etc.        0.49911010265350342                  33.647   14.290
N6 floating point      0.99999982118606567        71.627             43.520
N7 assignments         3.00000000000000000                  88.334   12.090
N8 exp,sqrt etc.       0.75110864639282227                  15.819   13.590
MWIPS                                            577.138            100.132


So as you can see the situation isn't too bad at all.  Overall gcc 3.4
is much faster than gcc 3.2.  But you can see my concern over the slow
down of N3, and that adding "-mcpu=power4" doesn't resolve the issue.

Roger
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange conditional jumps on the POWER4
  2003-04-06  5:21       ` Roger Sayle
@ 2003-04-06 16:49         ` Segher Boessenkool
  2003-04-07 23:03           ` David Edelsohn
  0 siblings, 1 reply; 9+ messages in thread
From: Segher Boessenkool @ 2003-04-06 16:49 UTC (permalink / raw)
  To: Roger Sayle; +Cc: David Edelsohn, gcc

Roger Sayle wrote:
> With gcc-3.2 -O2 -ffast-math:
> N7 assignments         3.00000000000000000                  92.323    9.640

> With gcc-3.4 -O2 -ffast-math:
> N7 assignments         3.00000000000000000                 137.619    7.720

> With gcc-3.4 -O2 -ffast-math -mcpu=power4
> N7 assignments         3.00000000000000000                  88.334   12.090

That last one looks very bad, too.  Could you post source
and asm snippets for that?


Segher


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange conditional jumps on the POWER4
  2003-04-06 16:49         ` Segher Boessenkool
@ 2003-04-07 23:03           ` David Edelsohn
  0 siblings, 0 replies; 9+ messages in thread
From: David Edelsohn @ 2003-04-07 23:03 UTC (permalink / raw)
  To: Segher Boessenkool, Roger Sayle; +Cc: gcc

>>>>> Segher Boessenkool writes:

>> With gcc-3.4 -O2 -ffast-math:
>> N7 assignments         3.00000000000000000                 137.619    7.720

>> With gcc-3.4 -O2 -ffast-math -mcpu=power4
>> N7 assignments         3.00000000000000000                  88.334   12.090

Segher> That last one looks very bad, too.  Could you post source
Segher> and asm snippets for that?

	The N7 performance difference is due to an unfortunate instruction
address alignment.  The default processor model just happens to align the
N7 loop well and the Power4 model aligns it poorly.  The AIX assembler
does not understand .p2align to allow GCC to generate alignment
directives to improve this.  However, using

-ffunction-sections -fdata-sections

allows the AIX linker to lay out the functions more effectively, also
producing better alignments, recovering most of the performance.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Strange conditional jumps on the POWER4
  2003-04-06  1:59   ` Roger Sayle
  2003-04-06  3:33     ` David Edelsohn
@ 2003-04-06 15:14     ` David Edelsohn
  1 sibling, 0 replies; 9+ messages in thread
From: David Edelsohn @ 2003-04-06 15:14 UTC (permalink / raw)
  To: Roger Sayle; +Cc: Zack Weinberg, gcc

>>>>> Roger Sayle writes:

Roger> So it looks as though the problem is not a register stall, or scheduling
Roger> problem but with branch prediction and basic block re-ordering.  If
Roger> you're lucky and the compiler gets the branch probabilities right,
Roger> and you always branch the same way it can be done in 300 ticks.  If
Roger> you're unlucky, the compiler gets the probabilities wrong and you
Roger> alternate taken/not-taken it takes 1200 ticks.

	Newer PowerPC processors switched to using a different encoding of
branch hints.  GCC for AIX currently does not enable the assembler mode to
map the mnemonics to the new hint bits, so Power4 always uses its internal
branch prediction, even if the +/- is present.  The Power4 processor is
predicting the branch correctly.

	I suspect the problem is the branch to branch.  Power4 has a one
cycle latency between branches.  In the faster code, the conditional
branch is not taken, we overwrite the register (hidden by renaming) and we
fallthru to the branch on counter.  For the slower code, we don't
overwrite the regiser, but the conditional branch is taken jumping to the
branch on counter.  The branch on counter is held off one cycle, leading
to the performance degradation.

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-04-08 14:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200304080131.VAA30506@makai.watson.ibm.com>
2003-04-08 15:50 ` Strange conditional jumps on the POWER4 Roger Sayle
2003-04-05 21:26 Roger Sayle
2003-04-06  1:42 ` Zack Weinberg
2003-04-06  1:59   ` Roger Sayle
2003-04-06  3:33     ` David Edelsohn
2003-04-06  5:21       ` Roger Sayle
2003-04-06 16:49         ` Segher Boessenkool
2003-04-07 23:03           ` David Edelsohn
2003-04-06 15:14     ` David Edelsohn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).