Re: PA8000 performance oddity

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: PA8000 performance oddity
@ 1999-05-25 19:54 N8TM
  1999-05-31 21:36 ` N8TM
  0 siblings, 1 reply; 20+ messages in thread
From: N8TM @ 1999-05-25 19:54 UTC (permalink / raw)
  To: jquinn, law; +Cc: egcs

In a message dated 5/25/99 8:47:44 AM Pacific Daylight Time, 
jquinn@nortelnetworks.com writes:

> It almost suggested that there is some kind of cache effect inside the
>  processor favoring recently used registers over ones that are idle.  Is
>  this possible/reasonable?  If so, it would make correct machine modeling
>  much more complex.
Or maybe the hardware does a more thorough dependency and re-ordering 
analysis for the case where a register is being re-used.  I recall Jeff 
quoting HP as saying there is no point in trying to assess accurately the 
costs of many operations on hppa2 (I hope I'm not misquoting).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25 19:54 PA8000 performance oddity N8TM
@ 1999-05-31 21:36 ` N8TM
  0 siblings, 0 replies; 20+ messages in thread
From: N8TM @ 1999-05-31 21:36 UTC (permalink / raw)
  To: jquinn, law; +Cc: egcs

In a message dated 5/25/99 8:47:44 AM Pacific Daylight Time, 
jquinn@nortelnetworks.com writes:

> It almost suggested that there is some kind of cache effect inside the
>  processor favoring recently used registers over ones that are idle.  Is
>  this possible/reasonable?  If so, it would make correct machine modeling
>  much more complex.
Or maybe the hardware does a more thorough dependency and re-ordering 
analysis for the case where a register is being re-used.  I recall Jeff 
quoting HP as saying there is no point in trying to assess accurately the 
costs of many operations on hppa2 (I hope I'm not misquoting).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-27 12:09   ` Jerry Quinn
  1999-05-27 13:22     ` Jerry Quinn
@ 1999-05-31 21:36     ` Jerry Quinn
  1 sibling, 0 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-31 21:36 UTC (permalink / raw)
  To: law; +Cc: Geert Bosch, egcs

Jeffrey A Law wrote:
> 
> Anyway, back to the PA8000 performance problem from Jerry.  I doubt the
> problem here is register renaming.  But it's hard to get a handle on it
> right now.  The most obvious inefficiency in the assembly code is not
> using the extended displacement for FP loads & stores.
> 
> It'll be interesting to see the performance once that is cleaned up (and it'll
> be a lot easier to analyze simply because the number of insns will be
> significantly reduced).

I had gcc generate a non-gas assembler file and then tweaked the float
loads to use long offsets.  It makes a 1-2% improvement as expected.

It turns out that gcc loses to aCC because there is a constant that gcc
creates as a single and converts to double every usage, while aCC makes
it a double when writing the assembly.

I attached the source file for you to see.  The constant 0.333 is being
compiled as a float rather than a double, so we lose due to conversions
after every load.

Jerry

-- 
Jerry Quinn                             Tel: (514) 761-8737
jquinn@nortelnetworks.com               Fax: (514) 761-8505
Speech Recognition Research

#include <stdio.h>

#define LSIZE   256L

double doSth (double lfPosX, double lfPosY, double lfPosZ,
              double lfPntX, double lfPntY, double lfPntZ,
              double lfSqrRadius);

//====================================================================
int main ()
//====================================================================
{
  long          lSizeX, lSizeY, lSizeZ; // volume dimensions
  long          lActX,  lActY,  lActZ;  // actual voxel numbers
  double        lfPosX, lfPosY, lfPosZ; // center of zeroth voxel
  double        lfActX, lfActY, lfActZ; // actual position
  double        lfPntX, lfPntY, lfPntZ; // 'some' point in space
  double        lfSum = 0.;             // result

  // handling a LSIZE^3 volume (no allocation, only coordinate calculation)
  lSizeX = LSIZE;
  lSizeY = LSIZE;
  lSizeZ = LSIZE;

  // initializing 'some' point
  lfPntX = 0.1234;
  lfPntY = 0.3456;
  lfPntZ = 0.5678;

  // set center of volume to origin of (cartesian) coordinate system
  lfPosX = - (double) (lSizeX -1L) * 0.5;
  lfPosY = - (double) (lSizeY -1L) * 0.5;
  lfPosZ = - (double) (lSizeZ -1L) * 0.5;

  // going thru voxel centers
  for (lActZ=0L; lActZ<lSizeZ; lActZ++)
  {
    lfActZ = lfPosZ + (double) lActZ;
    for (lActY=0L; lActY<lSizeY; lActY++)
    {
      lfActY = lfPosY + (double) lActY;
      for (lActX=0L; lActX<lSizeX; lActX++)
      {
        lfActX = lfPosX + (double) lActX;

        lfSum += doSth (lfActX, lfActY, lfActZ,
                        lfPntX, lfPntY, lfPntZ,
                        (double) LSIZE * 0.333);
      }
    }
  }

  printf ("lfSum = %10.3f\n", lfSum);

  return (0);
}

//====================================================================
double doSth (double lfPosX, double lfPosY, double lfPosZ,
              double lfPntX, double lfPntY, double lfPntZ,
              double lfSqrRadius)
//====================================================================
{
  double        lfDistX, lfDistY, lfDistZ;

  lfDistX = lfPntX - lfPosX;
  lfDistY = lfPntY - lfPosY;
  lfDistZ = lfPntZ - lfPosZ;

  if (lfDistX * lfDistX + lfDistY * lfDistY + lfDistZ * lfDistZ <= lfSqrRadius)
    return (1.);
  else
    return (0.);
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* PA8000 performance oddity
  1999-05-21 14:25 Jerry Quinn
  1999-05-22  6:08 ` Jeffrey A Law
@ 1999-05-31 21:36 ` Jerry Quinn
  1 sibling, 0 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-31 21:36 UTC (permalink / raw)
  To: egcs

I was looking at a test case someone had posted originally showing
poor g++ (vs gcc) performance and began looking at the performance
difference between aCC and egcs.  On this particular piece of code,
aCC wins by a sizeable margin.

In attempting to understand what is going on, I started looking at
register allocation in the doSth subroutine.  egcs heavily favors
reusing registers even when others are available while aCC apparently
tries to spread things out among registers when it can (i.e. b = a + b 
vs. c = a + b)

I wanted to see if reusing registers was causing a problem, so I
started reassigning registers by hand in this routine.  Then I found
this wierd performance impact.  The subroutine has the following op
sequence:

r6 = r24 - r7
r24 = r6 * r6
r8 = r23 - r5
r27 = r22 - r26
r10 = r8 * r8 + r24
r11 = r27 * r27 + r10

I'd already changed a few assignments by hand without any noticable
gain or loss.  I tried to change the second op to write to r9 on the
theory that it might allow the fifth op to execute sooner.  However,
doing this actually slowed the program down by 2%, repeatably.  The
times don't vary by more than .01 second.

Time as shown:
2.79u 0.02s 0:02.87 97.9%

Time with register change:
2.85u 0.02s 0:02.91 98.6%


I've included the complete assembly for anyone that cares to try it.
It is followed by a patch file to make the register change.  Link
using the 990517 snapshot and a recent binutils snapshot which has the
pa2.0 support.

Does anyone have a clue why this would happen?  Am I missing something
very obvious?

-- 
Jerry Quinn                             Tel: (514) 761-8737
jquinn@nortelnetworks.com               Fax: (514) 761-8505
Speech Recognition Research

------ cut here -------------------------------------
	.LEVEL 2.0
	.SPACE $PRIVATE$
	.SUBSPA $DATA$,QUAD=1,ALIGN=8,ACCESS=31
	.SUBSPA $BSS$,QUAD=1,ALIGN=8,ACCESS=31,ZERO,SORT=82
	.SPACE $TEXT$
	.SUBSPA $LIT$,QUAD=0,ALIGN=8,ACCESS=44
	.SUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.IMPORT $global$,DATA
	.IMPORT $$dyncall,MILLICODE
; gcc_compiled.:
	.IMPORT __main,CODE
	.IMPORT doSth__Fddddddd,CODE
	.IMPORT printf,CODE
	.SPACE $TEXT$
	.SUBSPA $LIT$

	.align 4
L$C0005
	.STRING "lfSum = %10.3f\x0a\x00"
	.align 8
L$C0000
	.word 0x3fbf9724
	.word 0x74538ef3
	.align 8
L$C0001
	.word 0x3fd61e4f
	.word 0x765fd8ae
	.align 8
L$C0002
	.word 0x3fe22b6a
	.word 0xe7d566cf
	.align 8
L$C0004
	.word 0x40554fdf
	.word 0x3b645a1d
	.align 8
L$C0006
	.word 0xc05fe000
	.word 0x0
	.SPACE $TEXT$
	.SUBSPA $CODE$

	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.EXPORT main,ENTRY,PRIV_LEV=3,RTNVAL=GR
main
	.PROC
	.CALLINFO FRAME=192,CALLS,SAVE_RP,ENTRY_GR=11,ENTRY_FR=19
	.ENTRY
	stw %r2,-20(%r30)
	stwm %r11,192(%r30)
	stw %r9,-184(%r30)
	ldo -152(%r30),%r1
	ldi 256,%r9
	stw %r10,-188(%r30)
	stw %r8,-180(%r30)
	stw %r7,-176(%r30)
	stw %r6,-172(%r30)
	stw %r5,-168(%r30)
	stw %r4,-164(%r30)
	stw %r3,-160(%r30)
	fstds,ma %fr19,8(%r1)
	fstds,ma %fr18,8(%r1)
	fstds,ma %fr17,8(%r1)
	fstds,ma %fr16,8(%r1)
	fstds,ma %fr15,8(%r1)
	fstds,ma %fr14,8(%r1)
	fcpy,dbl %fr0,%fr14
	fstds,ma %fr13,8(%r1)
	.CALL 
	bl __main,%r2
	fstds,ma %fr12,8(%r1)
	ldil LR'L$C0000,%r19
	ldil LR'L$C0001,%r20
	ldo RR'L$C0000(%r19),%r19
	ldil LR'L$C0002,%r21
	fldds 0(%r19),%fr19
	ldil LR'L$C0004,%r22
	ldil LR'L$C0006,%r19
	ldo RR'L$C0001(%r20),%r20
	ldo RR'L$C0006(%r19),%r19
	ldo RR'L$C0002(%r21),%r21
	ldo RR'L$C0004(%r22),%r22
	fldds 0(%r19),%fr15
	fldds 0(%r20),%fr18
	ldi 0,%r19
	fldds 0(%r21),%fr17
	fldds 0(%r22),%fr16
	stw %r19,-16(%r30)
L$0040
	ldo 1(%r19),%r11
	ldi 0,%r19
	fldws -16(%r30),%fr23L
	fcnvxf,sgl,dbl %fr23L,%fr22
	fadd,dbl %fr15,%fr22,%fr13
	stw %r19,-16(%r30)
L$0039
	ldo 1(%r19),%r10
	ldi 0,%r3
	extru %r9,31,2,%r19
	fldws -16(%r30),%fr23L
	fcnvxf,sgl,dbl %fr23L,%fr22
	comib,>= 0,%r9,L$0020
	fadd,dbl %fr15,%fr22,%fr12
	comib,= 0,%r19,L$0038
	ldo -56(%r30),%r4
	comib,>=,n 1,%r19,L$0020
	comib,>= 2,%r19,L$0021
	ldo -56(%r30),%r19
	ldo -64(%r30),%r20
	fstds %fr13,0(%r19)
	ldo -72(%r30),%r21
	ldo -80(%r30),%r19
	ldo -88(%r30),%r22
	fstds %fr19,0(%r20)
	fcpy,dbl %fr15,%fr5
	fcpy,dbl %fr12,%fr7
	ldi 1,%r3
	fstds %fr18,0(%r21)
	fstds %fr17,0(%r19)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r22)
	fadd,dbl %fr14,%fr4,%fr14
L$0021
	ldo -56(%r30),%r19
	ldo -64(%r30),%r20
	fstds %fr13,0(%r19)
	ldo -72(%r30),%r21
	ldo -80(%r30),%r22
	ldo -88(%r30),%r19
	stw %r3,-16(%r30)
	fcpy,dbl %fr12,%fr7
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r20)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r21)
	ldo 1(%r3),%r3
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r22)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r19)
	fadd,dbl %fr14,%fr4,%fr14
L$0020
	ldo -56(%r30),%r19
	ldo -64(%r30),%r20
	fstds %fr13,0(%r19)
	ldo -72(%r30),%r21
	ldo -80(%r30),%r22
	ldo -88(%r30),%r19
	stw %r3,-16(%r30)
	fcpy,dbl %fr12,%fr7
	fldws -16(%r30),%fr23L
	fstds %fr19,0(%r20)
	fcnvxf,sgl,dbl %fr23L,%fr5
	fstds %fr18,0(%r21)
	ldo 1(%r3),%r3
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r22)
	b L$0037
	fstds %fr16,0(%r19)
L$0014
	ldo -56(%r30),%r4
L$0038
	ldo -64(%r30),%r5
	fstds %fr13,0(%r4)
	ldo -72(%r30),%r7
	ldo -80(%r30),%r8
	ldo -88(%r30),%r6
	stw %r3,-16(%r30)
	fcpy,dbl %fr12,%fr7
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r6)
	fstds %fr13,0(%r4)
	ldo 1(%r3),%r19
	fcpy,dbl %fr12,%fr7
	stw %r19,-16(%r30)
	fadd,dbl %fr14,%fr4,%fr14
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r6)
	fstds %fr13,0(%r4)
	ldo 2(%r3),%r19
	fcpy,dbl %fr12,%fr7
	stw %r19,-16(%r30)
	fadd,dbl %fr14,%fr4,%fr14
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r6)
	fstds %fr13,0(%r4)
	ldo 3(%r3),%r19
	fcpy,dbl %fr12,%fr7
	ldo 4(%r3),%r3
	stw %r19,-16(%r30)
	fadd,dbl %fr14,%fr4,%fr14
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	fstds %fr16,0(%r6)
L$0037
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	nop
	comb,> %r9,%r3,L$0014
	fadd,dbl %fr14,%fr4,%fr14
	copy %r10,%r19
	comb,>,n %r9,%r19,L$0039
	stw %r19,-16(%r30)
	copy %r11,%r19
	comb,>,n %r9,%r19,L$0040
	stw %r19,-16(%r30)
	fcpy,dbl %fr14,%fr7
	ldil LR'L$C0005,%r26
	.CALL ARGW0=GR,ARGW2=FR,ARGW3=FU
	bl printf,%r2
	ldo RR'L$C0005(%r26),%r26
	ldo -152(%r30),%r1
	ldw -212(%r30),%r2
	fldds,ma 8(%r1),%fr19
	ldi 0,%r28
	ldw -188(%r30),%r10
	fldds,ma 8(%r1),%fr18
	ldw -184(%r30),%r9
	fldds,ma 8(%r1),%fr17
	ldw -180(%r30),%r8
	fldds,ma 8(%r1),%fr16
	ldw -176(%r30),%r7
	fldds,ma 8(%r1),%fr15
	ldw -172(%r30),%r6
	fldds,ma 8(%r1),%fr14
	ldw -168(%r30),%r5
	fldds,ma 8(%r1),%fr13
	ldw -164(%r30),%r4
	ldw -160(%r30),%r3
	fldds,ma 8(%r1),%fr12
	bv %r0(%r2)
	ldwm -192(%r30),%r11
	.EXIT
	.PROCEND
	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.PARAM __static_initialization_and_destruction_0,ARGW0=GR,ARGW1=GR
__static_initialization_and_destruction_0
	.PROC
	.CALLINFO FRAME=0,NO_CALLS
	.ENTRY
	bv,n %r0(%r2)
	.EXIT
	.PROCEND
	.SPACE $TEXT$
	.SUBSPA $LIT$

	.align 8
L$C0008
	.word 0x3ff00000
	.word 0x0
	.SPACE $TEXT$
	.SUBSPA $CODE$

	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.EXPORT doSth__Fddddddd,ENTRY,PRIV_LEV=3,ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU,RTNVAL=FU
doSth__Fddddddd
	.PROC
	.CALLINFO FRAME=0,NO_CALLS
	.ENTRY
	ldo -72(%r30),%r19
	ldo -64(%r30),%r20
	fldds 0(%r19),%fr24
	ldo -56(%r30),%r21
	fldds 0(%r20),%fr23
	ldo -80(%r30),%r19
	fsub,dbl %fr24,%fr7,%fr6
	fldds 0(%r19),%fr22
;	fmpysub,dbl %fr24,%fr24,%fr24,%fr5,%fr23
	fmpy,dbl %fr6,%fr6,%fr9
	fsub,dbl %fr23,%fr5,%fr8
	fldds 0(%r21),%fr26
	ldo -88(%r30),%r19
	fsub,dbl %fr22,%fr26,%fr27
	fldds 0(%r19),%fr25
	fmpyfadd,dbl %fr8,%fr8,%fr9,%fr10
	fmpyfadd,dbl %fr27,%fr27,%fr10,%fr11
	fcmp,dbl,> %fr11,%fr25
	ftest
	b L$0042
	ldil LR'L$C0008,%r19
	bv %r0(%r2)
	fcpy,dbl %fr0,%fr4
L$0042
	ldo RR'L$C0008(%r19),%r19
	bv %r0(%r2)
	fldds 0(%r19),%fr4
	.EXIT
	.PROCEND
	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.PARAM __static_initialization_and_destruction_1,ARGW0=GR,ARGW1=GR
__static_initialization_and_destruction_1
	.PROC
	.CALLINFO FRAME=0,NO_CALLS
	.ENTRY
	bv,n %r0(%r2)
	.EXIT
	.PROCEND
------------cut here------------------------
*** nolfv.s     Fri May 21 11:55:10 1999
--- nolfv.mod.s Fri May 21 12:26:50 1999
***************
*** 296,308 ****
        fsub,dbl %fr24,%fr7,%fr6
        fldds 0(%r19),%fr22
  ;     fmpysub,dbl %fr24,%fr24,%fr24,%fr5,%fr23
!       fmpy,dbl %fr6,%fr6,%fr24
        fsub,dbl %fr23,%fr5,%fr8
        fldds 0(%r21),%fr26
        ldo -88(%r30),%r19
        fsub,dbl %fr22,%fr26,%fr27
        fldds 0(%r19),%fr25
!       fmpyfadd,dbl %fr8,%fr8,%fr24,%fr10
        fmpyfadd,dbl %fr27,%fr27,%fr10,%fr11
        fcmp,dbl,> %fr11,%fr25
        ftest
--- 296,308 ----
        fsub,dbl %fr24,%fr7,%fr6
        fldds 0(%r19),%fr22
  ;     fmpysub,dbl %fr24,%fr24,%fr24,%fr5,%fr23
!       fmpy,dbl %fr6,%fr6,%fr9
        fsub,dbl %fr23,%fr5,%fr8
        fldds 0(%r21),%fr26
        ldo -88(%r30),%r19
        fsub,dbl %fr22,%fr26,%fr27
        fldds 0(%r19),%fr25
!       fmpyfadd,dbl %fr8,%fr8,%fr9,%fr10
        fmpyfadd,dbl %fr27,%fr27,%fr10,%fr11
        fcmp,dbl,> %fr11,%fr25
        ftest

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-27 23:10       ` Gerald Pfeifer
@ 1999-05-31 21:36         ` Gerald Pfeifer
  0 siblings, 0 replies; 20+ messages in thread
From: Gerald Pfeifer @ 1999-05-31 21:36 UTC (permalink / raw)
  To: Jerry Quinn; +Cc: egcs

On Thu, 27 May 1999, Jerry Quinn wrote:
> Is this a known limitation of function inlining in gcc, that the
> function to be inlined be defined before its usage?

gcc.info says:
  
  Some calls cannot be integrated for various reasons (in particular,
  calls that precede the function's definition cannot be integrated, and
  neither can recursive calls within the definition).

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-22  6:08 ` Jeffrey A Law
  1999-05-25  8:44   ` Jerry Quinn
@ 1999-05-31 21:36   ` Jeffrey A Law
  1 sibling, 0 replies; 20+ messages in thread
From: Jeffrey A Law @ 1999-05-31 21:36 UTC (permalink / raw)
  To: Jerry Quinn; +Cc: egcs

  In message < 199905211645.MAA03952@wmtl249c.us.nortel.com >you write:
  > 
  > I was looking at a test case someone had posted originally showing
  > poor g++ (vs gcc) performance and began looking at the performance
  > difference between aCC and egcs.  On this particular piece of code,
  > aCC wins by a sizeable margin.
  > 
  > In attempting to understand what is going on, I started looking at
  > register allocation in the doSth subroutine.  egcs heavily favors
  > reusing registers even when others are available while aCC apparently
  > tries to spread things out among registers when it can (i.e. b = a + b 
  > vs. c = a + b)
Yes.  Register renaming.  It can avoid false dependency stalls added by
register allocation in some circumstances.  GCC doesn't support this yet,
but does have some heuristics in the register allocator to try and avoid
creating false dependency stalls.

[ ... ]

  > I've included the complete assembly for anyone that cares to try it.
  > It is followed by a patch file to make the register change.  Link
  > using the 990517 snapshot and a recent binutils snapshot which has the
  > pa2.0 support.
  > 
  > Does anyone have a clue why this would happen?  Am I missing something
  > very obvious?
The most obvious thing I see is we are not taking advantage of the larger
displacement allowed in reg+disp addresses for FP insns.

In PA1.0/PA1.1 a FP load/store is only allowed a 5 bit displacement.  Thus
you see all those  "ldo disp(base), temp; fldds 0(temp),fptarget".

In PA2.0 FP loads/stores can have an aligned 16bit offset, which would
allow most of the ldo instructions to disappear.  This (of course) requires
more GAS work to support the larger displacements.

I did some fooling around with this stuff a while back and it was worth a
few more percent across specfp.

It may also be the case that using targetted FP compares rather than queued
FP compares.   I've never done any experiments to see what the benefit was,
but presumably PA, mips & others that added multiple FP condition code regs
did so for a reason :-)

Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25 10:15 ` Jeffrey A Law
  1999-05-27 12:09   ` Jerry Quinn
@ 1999-05-31 21:36   ` Jeffrey A Law
  1 sibling, 0 replies; 20+ messages in thread
From: Jeffrey A Law @ 1999-05-31 21:36 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Jerry Quinn, egcs

  In message < 9905251642.AA17457@nile.gnat.com >you write:
  > This could well be possible. At least on Intel Pentium II, referencing
  > register names that are still in use by a non-retired intruction
  > can be faster than using new register names. The issue here is
  > mapping the register names to actual (renamed) registers.
I don't think this is the case on the PA8000 series machines though.

  > Favoring new renamings of already used register names over
  > using more register names may also reduce register pressure.
Yes.  More generally, latency scheduling for an out of order execution machine
does not make sense, except for those instructions with a long latency 
(several cycles).  The larger the reorder buffer, the longer latency the
processor can hide.

By ignoring relatively small latencies you reduce register pressure as a
secondary side effect.

The PA8000 scheduler tries to do this on a small scale basis.  It's still
missing some important cases.

Anyway, back to the PA8000 performance problem from Jerry.  I doubt the
problem here is register renaming.  But it's hard to get a handle on it
right now.  The most obvious inefficiency in the assembly code is not
using the extended displacement for FP loads & stores.

It'll be interesting to see the performance once that is cleaned up (and it'll
be a lot easier to analyze simply because the number of insns will be 
significantly reduced).

jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-27 13:22     ` Jerry Quinn
  1999-05-27 23:10       ` Gerald Pfeifer
@ 1999-05-31 21:36       ` Jerry Quinn
  1 sibling, 0 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-31 21:36 UTC (permalink / raw)
  To: egcs

Quinn, Jerry (J.) [EXCHANGE:MTL:6X17:BNR] wrote:
> 
> Jeffrey A Law wrote:
> >
> > Anyway, back to the PA8000 performance problem from Jerry.  I doubt the
> > problem here is register renaming.  But it's hard to get a handle on it
> > right now.  The most obvious inefficiency in the assembly code is not
> > using the extended displacement for FP loads & stores.
> >
> > It'll be interesting to see the performance once that is cleaned up (and it'll
> > be a lot easier to analyze simply because the number of insns will be
> > significantly reduced).
> 
> I had gcc generate a non-gas assembler file and then tweaked the float
> loads to use long offsets.  It makes a 1-2% improvement as expected.
> 
> It turns out that gcc loses to aCC because there is a constant that gcc
> creates as a single and converts to double every usage, while aCC makes
> it a double when writing the assembly.
> 
> I attached the source file for you to see.  The constant 0.333 is being
> compiled as a float rather than a double, so we lose due to conversions
> after every load.

Oops.  Foot in mouth time.  The type conversion is required.  In the
assembly, a loop index is stored as int, loaded into a float reg.  aCC
converts from signed int to double, while gcc code converts from float
to double.  It seems like the gcc code shouldn't be correct, yet it
still gives the right answer.  I don't understand.

As it turns out, the big difference is that aCC (and hp cc) are inlining
the subroutine and gcc is not.  I tried cranking the -finline-limit to
20000000 but it had no effect.  I then tried moving the subroutine
before the main routine and it did inline, even without having to modify
inline-limit.

Is this a known limitation of function inlining in gcc, that the
function to be inlined be defined before its usage?

Jerry

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25  8:44   ` Jerry Quinn
  1999-05-25 10:33     ` Jeffrey A Law
@ 1999-05-31 21:36     ` Jerry Quinn
  1 sibling, 0 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-31 21:36 UTC (permalink / raw)
  To: law; +Cc: egcs

Jeffrey A Law wrote:
> 
>   In message < 199905211645.MAA03952@wmtl249c.us.nortel.com >you write:
>   >
>   > I was looking at a test case someone had posted originally showing
>   > poor g++ (vs gcc) performance and began looking at the performance
>   > difference between aCC and egcs.  On this particular piece of code,
>   > aCC wins by a sizeable margin.
>   >
>   > In attempting to understand what is going on, I started looking at
>   > register allocation in the doSth subroutine.  egcs heavily favors
>   > reusing registers even when others are available while aCC apparently
>   > tries to spread things out among registers when it can (i.e. b = a + b
>   > vs. c = a + b)
> Yes.  Register renaming.  It can avoid false dependency stalls added by
> register allocation in some circumstances.  GCC doesn't support this yet,
> but does have some heuristics in the register allocator to try and avoid
> creating false dependency stalls.

The actual weirdness I was trying to address was that when I manually
remove the false dependency by changing register allocation to another
temp register, i.e., changing b = a + b to c = a + b and updating the
ensuing code, performance got WORSE by a little bit.  If anything, I
would expect it to either be improved (due to eliminating a dependency)
or unchanged (latency hidden by reordering, or no latency problem).

It almost suggested that there is some kind of cache effect inside the
processor favoring recently used registers over ones that are idle.  Is
this possible/reasonable?  If so, it would make correct machine modeling
much more complex.

> The most obvious thing I see is we are not taking advantage of the larger
> displacement allowed in reg+disp addresses for FP insns.
> 
> In PA1.0/PA1.1 a FP load/store is only allowed a 5 bit displacement.  Thus
> you see all those  "ldo disp(base), temp; fldds 0(temp),fptarget".
> 
> In PA2.0 FP loads/stores can have an aligned 16bit offset, which would
> allow most of the ldo instructions to disappear.  This (of course) requires
> more GAS work to support the larger displacements.
> 
> I did some fooling around with this stuff a while back and it was worth a
> few more percent across specfp.

I saw something similar when I played with hobbling the aCC assembly
this way.  If I get a chance, I'll look at adding this stuff in.

> It may also be the case that using targetted FP compares rather than queued
> FP compares.   I've never done any experiments to see what the benefit was,
> but presumably PA, mips & others that added multiple FP condition code regs
> did so for a reason :-)

> Jeff

I can send the aCC assembly if you'd like to see what aCC did vs gcc.

One other thing I noted.  Gcc still generated fmpysub, so I'll have to
update the pattern in pa.md.  Shutting off pa_combine_instructions still
lets this pattern operate.  There was actually a performance loss by
having the instruction in there.  I know you said this, but I didn't
actually believe it.  I'm still not sure I actually understand.  Even
though fmpyadd/sub takes two reorder buffer slots, it shouldn't be any
slower than fmpy followed by fsub, should it?  If anything I would think
there is still a bit of benefit by having one less instruction - at
least in this particular example.

I've also played with a couple of other things.  I tried different
BRANCH_COST values up to 10 and ran it on our code, plus the performance
test suite that Mark Lehmann set up.  There doesn't seem to be much
benefit to changing the value that I can see.  Perhaps a value of 2 or 3
is a bit better, but there were plenty of test cases that got worse as
well as better.

Finally, in High Performance Computing 2nd Ed., there is an appendix
that sketches the PA8000.  One thing they mention that I haven't seen in
HP's publicly available papers is that branches take a slot in both the
memory reorder buffer and the arithmetic buffer.  I tried this and saw a
bit (1%?) drop in performance.  Is it possible that this comment isn't
correct?  I simply added branch instructions to the memory function unit
as well as the alu function unit.

Jerry

-- 
Jerry Quinn                             Tel: (514) 761-8737
jquinn@nortelnetworks.com               Fax: (514) 761-8505
Speech Recognition Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25 10:33     ` Jeffrey A Law
@ 1999-05-31 21:36       ` Jeffrey A Law
  0 siblings, 0 replies; 20+ messages in thread
From: Jeffrey A Law @ 1999-05-31 21:36 UTC (permalink / raw)
  To: Jerry Quinn; +Cc: egcs

  In message < 374AC4C5.614AD369@americasm01.nt.com >you write:
  > The actual weirdness I was trying to address was that when I manually
  > remove the false dependency by changing register allocation to another
  > temp register, i.e., changing b = a + b to c = a + b and updating the
  > ensuing code, performance got WORSE by a little bit.  If anything, I
  > would expect it to either be improved (due to eliminating a dependency)
  > or unchanged (latency hidden by reordering, or no latency problem).
It can vary.  Predicting preformance for out of order execution machines
is difficult at best.  I wouldn't worry too much about it right now (ie,
I suspect there are other things we can do to improve performance, which
will also make it easier to analyze the code later).

  > It almost suggested that there is some kind of cache effect inside the
  > processor favoring recently used registers over ones that are idle.  Is
  > this possible/reasonable?  If so, it would make correct machine modeling
  > much more complex.
Not that I'm aware of.  At least not in the PA8000 or PA8200.  I don't know
about the PA8500 or PA8700.

  > > I did some fooling around with this stuff a while back and it was worth a
  > > few more percent across specfp.
  > 
  > I saw something similar when I played with hobbling the aCC assembly
  > this way.  If I get a chance, I'll look at adding this stuff in.
I would recommend it.  It'll be a little more complex than the stuff you've
been working on, but not terribly so.

For example, some parts of the PA backend try to rewrite addresses for FP
loads and stores to make better use of the -16 .. 15 displacement.  You'll
need to tweak them (LEGITIMIZE_ADDRESS, LEGITIMIZE_RELOAD_ADDRESS).  You'll
also need to update GO_IF_LEGITIMATE_ADDRESS.  Assembler & BFD work is also
needed to handle the larger displacements for FP loads & stores.

My recommendation is to get the compiler stuff working first -- use the HP
assembler to test your code (and benchmark any improvements in your app).



  > I can send the aCC assembly if you'd like to see what aCC did vs gcc.
I really wouldn't have the time to look at it.  My focus is on gcc-2.95,
not performance issues for future releases.


  > One other thing I noted.  Gcc still generated fmpysub, so I'll have to
  > update the pattern in pa.md.  Shutting off pa_combine_instructions still
  > lets this pattern operate.  There was actually a performance loss by
  > having the instruction in there.  I know you said this, but I didn't
  > actually believe it.  I'm still not sure I actually understand.  Even
  > though fmpyadd/sub takes two reorder buffer slots, it shouldn't be any
  > slower than fmpy followed by fsub, should it?  If anything I would think
  > there is still a bit of benefit by having one less instruction - at
  > least in this particular example.
An fmpysub can not retire until both ops are finished.  Thus it holds resources
until both operations are finished.  A fmpy followed by a fsub allows either
operation to retire as soon as it is complete, returning resources to the
processor (particularly reorder buffer slots) for use by other insns.

You should invesitgate how the fmpysub was created.  That shouldn't be
happening.

  > I've also played with a couple of other things.  I tried different
  > BRANCH_COST values up to 10 and ran it on our code, plus the performance
  > test suite that Mark Lehmann set up.  There doesn't seem to be much
  > benefit to changing the value that I can see.  Perhaps a value of 2 or 3
  > is a bit better, but there were plenty of test cases that got worse as
  > well as better.
Like most optimization work, rarely will it always help.  I suspect a value
of 2 is best for the PA8000.  We may need to tweak it further for PA8200,
PA8500 and PA8700.


  > Finally, in High Performance Computing 2nd Ed., there is an appendix
  > that sketches the PA8000.  One thing they mention that I haven't seen in
  > HP's publicly available papers is that branches take a slot in both the
  > memory reorder buffer and the arithmetic buffer.  I tried this and saw a
  > bit (1%?) drop in performance.  Is it possible that this comment isn't
  > correct?  I simply added branch instructions to the memory function unit
  > as well as the alu function unit.
It is correct, but I don't think trying to model it is going to work. 
Remember that the scheduler does not schedule branches.  I have no idea what
exposing this aspect of the PA architecture would do to the schedules.

jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25  9:46 Geert Bosch
  1999-05-25 10:15 ` Jeffrey A Law
@ 1999-05-31 21:36 ` Geert Bosch
  1 sibling, 0 replies; 20+ messages in thread
From: Geert Bosch @ 1999-05-31 21:36 UTC (permalink / raw)
  To: Jerry Quinn, law; +Cc: egcs

On Tue, 25 May 1999 11:41:57 -0400, Jerry Quinn wrote:

  It almost suggested that there is some kind of cache effect inside the
  processor favoring recently used registers over ones that are idle.  Is
  this possible/reasonable?  If so, it would make correct machine modeling
  much more complex.

This could well be possible. At least on Intel Pentium II, referencing
register names that are still in use by a non-retired intruction
can be faster than using new register names. The issue here is
mapping the register names to actual (renamed) registers.

Favoring new renamings of already used register names over
using more register names may also reduce register pressure.

Compare the schedules below for the following program fragment:

Mem2 := Mem1 * Mem1;
Mem4 := Mem3 + 1;

  Schedule 1)       Schedule 2)

1 rA := Mem1;       rA := Mem1;
2 rB := Mem3;       rA := rA * rA;
3 rA := rA * rA;    Mem2 := rA;
4 rB := rB + 1;     rA := Mem3;
5 Mem2 := rA;       rA := rA + 1;
6 Mem4 := rB;       Mem4 := rA;

In 1) we use two register names, and 4 register instances.
On the other hand in 2) we only use a single name, still
with 4 instances. It is interesting to look at the effects
of out of order execution on this schedule. It may be that
schedule 2) performs as good as 1) while using less register
names.

Lets assume a machine that issues 1 instruction per cycle, has one
unit that can read, two for arithmetic and two for write. Read and
Write have a latency of three cycles, Add 1 and Mult 5.

  Execution 1)     Execution 2)

1 Read           1 Read
  |   2 Read       |
  |     :          |
3 Mult  |        2 Mult 4 Read
  |     |          |      |
  |   4 Add        |      |
  |   5 Write      |    5 Add
  |     |          |    6 Write
6 Write |        3 Write  |
  |     *          |      |
  |                |      *
  *                *

Regards,
   Geert

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-27 13:22     ` Jerry Quinn
@ 1999-05-27 23:10       ` Gerald Pfeifer
  1999-05-31 21:36         ` Gerald Pfeifer
  1999-05-31 21:36       ` Jerry Quinn
  1 sibling, 1 reply; 20+ messages in thread
From: Gerald Pfeifer @ 1999-05-27 23:10 UTC (permalink / raw)
  To: Jerry Quinn; +Cc: egcs

On Thu, 27 May 1999, Jerry Quinn wrote:
> Is this a known limitation of function inlining in gcc, that the
> function to be inlined be defined before its usage?

gcc.info says:
  
  Some calls cannot be integrated for various reasons (in particular,
  calls that precede the function's definition cannot be integrated, and
  neither can recursive calls within the definition).

Gerald
-- 
Gerald "Jerry" pfeifer@dbai.tuwien.ac.at http://www.dbai.tuwien.ac.at/~pfeifer/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-27 12:09   ` Jerry Quinn
@ 1999-05-27 13:22     ` Jerry Quinn
  1999-05-27 23:10       ` Gerald Pfeifer
  1999-05-31 21:36       ` Jerry Quinn
  1999-05-31 21:36     ` Jerry Quinn
  1 sibling, 2 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-27 13:22 UTC (permalink / raw)
  To: egcs

Quinn, Jerry (J.) [EXCHANGE:MTL:6X17:BNR] wrote:
> 
> Jeffrey A Law wrote:
> >
> > Anyway, back to the PA8000 performance problem from Jerry.  I doubt the
> > problem here is register renaming.  But it's hard to get a handle on it
> > right now.  The most obvious inefficiency in the assembly code is not
> > using the extended displacement for FP loads & stores.
> >
> > It'll be interesting to see the performance once that is cleaned up (and it'll
> > be a lot easier to analyze simply because the number of insns will be
> > significantly reduced).
> 
> I had gcc generate a non-gas assembler file and then tweaked the float
> loads to use long offsets.  It makes a 1-2% improvement as expected.
> 
> It turns out that gcc loses to aCC because there is a constant that gcc
> creates as a single and converts to double every usage, while aCC makes
> it a double when writing the assembly.
> 
> I attached the source file for you to see.  The constant 0.333 is being
> compiled as a float rather than a double, so we lose due to conversions
> after every load.

Oops.  Foot in mouth time.  The type conversion is required.  In the
assembly, a loop index is stored as int, loaded into a float reg.  aCC
converts from signed int to double, while gcc code converts from float
to double.  It seems like the gcc code shouldn't be correct, yet it
still gives the right answer.  I don't understand.

As it turns out, the big difference is that aCC (and hp cc) are inlining
the subroutine and gcc is not.  I tried cranking the -finline-limit to
20000000 but it had no effect.  I then tried moving the subroutine
before the main routine and it did inline, even without having to modify
inline-limit.

Is this a known limitation of function inlining in gcc, that the
function to be inlined be defined before its usage?

Jerry

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25 10:15 ` Jeffrey A Law
@ 1999-05-27 12:09   ` Jerry Quinn
  1999-05-27 13:22     ` Jerry Quinn
  1999-05-31 21:36     ` Jerry Quinn
  1999-05-31 21:36   ` Jeffrey A Law
  1 sibling, 2 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-27 12:09 UTC (permalink / raw)
  To: law; +Cc: Geert Bosch, egcs

Jeffrey A Law wrote:
> 
> Anyway, back to the PA8000 performance problem from Jerry.  I doubt the
> problem here is register renaming.  But it's hard to get a handle on it
> right now.  The most obvious inefficiency in the assembly code is not
> using the extended displacement for FP loads & stores.
> 
> It'll be interesting to see the performance once that is cleaned up (and it'll
> be a lot easier to analyze simply because the number of insns will be
> significantly reduced).

I had gcc generate a non-gas assembler file and then tweaked the float
loads to use long offsets.  It makes a 1-2% improvement as expected.

It turns out that gcc loses to aCC because there is a constant that gcc
creates as a single and converts to double every usage, while aCC makes
it a double when writing the assembly.

I attached the source file for you to see.  The constant 0.333 is being
compiled as a float rather than a double, so we lose due to conversions
after every load.

Jerry

-- 
Jerry Quinn                             Tel: (514) 761-8737
jquinn@nortelnetworks.com               Fax: (514) 761-8505
Speech Recognition Research

#include <stdio.h>

#define LSIZE   256L

double doSth (double lfPosX, double lfPosY, double lfPosZ,
              double lfPntX, double lfPntY, double lfPntZ,
              double lfSqrRadius);

//====================================================================
int main ()
//====================================================================
{
  long          lSizeX, lSizeY, lSizeZ; // volume dimensions
  long          lActX,  lActY,  lActZ;  // actual voxel numbers
  double        lfPosX, lfPosY, lfPosZ; // center of zeroth voxel
  double        lfActX, lfActY, lfActZ; // actual position
  double        lfPntX, lfPntY, lfPntZ; // 'some' point in space
  double        lfSum = 0.;             // result

  // handling a LSIZE^3 volume (no allocation, only coordinate calculation)
  lSizeX = LSIZE;
  lSizeY = LSIZE;
  lSizeZ = LSIZE;

  // initializing 'some' point
  lfPntX = 0.1234;
  lfPntY = 0.3456;
  lfPntZ = 0.5678;

  // set center of volume to origin of (cartesian) coordinate system
  lfPosX = - (double) (lSizeX -1L) * 0.5;
  lfPosY = - (double) (lSizeY -1L) * 0.5;
  lfPosZ = - (double) (lSizeZ -1L) * 0.5;

  // going thru voxel centers
  for (lActZ=0L; lActZ<lSizeZ; lActZ++)
  {
    lfActZ = lfPosZ + (double) lActZ;
    for (lActY=0L; lActY<lSizeY; lActY++)
    {
      lfActY = lfPosY + (double) lActY;
      for (lActX=0L; lActX<lSizeX; lActX++)
      {
        lfActX = lfPosX + (double) lActX;

        lfSum += doSth (lfActX, lfActY, lfActZ,
                        lfPntX, lfPntY, lfPntZ,
                        (double) LSIZE * 0.333);
      }
    }
  }

  printf ("lfSum = %10.3f\n", lfSum);

  return (0);
}

//====================================================================
double doSth (double lfPosX, double lfPosY, double lfPosZ,
              double lfPntX, double lfPntY, double lfPntZ,
              double lfSqrRadius)
//====================================================================
{
  double        lfDistX, lfDistY, lfDistZ;

  lfDistX = lfPntX - lfPosX;
  lfDistY = lfPntY - lfPosY;
  lfDistZ = lfPntZ - lfPosZ;

  if (lfDistX * lfDistX + lfDistY * lfDistY + lfDistZ * lfDistZ <= lfSqrRadius)
    return (1.);
  else
    return (0.);
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25  8:44   ` Jerry Quinn
@ 1999-05-25 10:33     ` Jeffrey A Law
  1999-05-31 21:36       ` Jeffrey A Law
  1999-05-31 21:36     ` Jerry Quinn
  1 sibling, 1 reply; 20+ messages in thread
From: Jeffrey A Law @ 1999-05-25 10:33 UTC (permalink / raw)
  To: Jerry Quinn; +Cc: egcs

  In message < 374AC4C5.614AD369@americasm01.nt.com >you write:
  > The actual weirdness I was trying to address was that when I manually
  > remove the false dependency by changing register allocation to another
  > temp register, i.e., changing b = a + b to c = a + b and updating the
  > ensuing code, performance got WORSE by a little bit.  If anything, I
  > would expect it to either be improved (due to eliminating a dependency)
  > or unchanged (latency hidden by reordering, or no latency problem).
It can vary.  Predicting preformance for out of order execution machines
is difficult at best.  I wouldn't worry too much about it right now (ie,
I suspect there are other things we can do to improve performance, which
will also make it easier to analyze the code later).

  > It almost suggested that there is some kind of cache effect inside the
  > processor favoring recently used registers over ones that are idle.  Is
  > this possible/reasonable?  If so, it would make correct machine modeling
  > much more complex.
Not that I'm aware of.  At least not in the PA8000 or PA8200.  I don't know
about the PA8500 or PA8700.

  > > I did some fooling around with this stuff a while back and it was worth a
  > > few more percent across specfp.
  > 
  > I saw something similar when I played with hobbling the aCC assembly
  > this way.  If I get a chance, I'll look at adding this stuff in.
I would recommend it.  It'll be a little more complex than the stuff you've
been working on, but not terribly so.

For example, some parts of the PA backend try to rewrite addresses for FP
loads and stores to make better use of the -16 .. 15 displacement.  You'll
need to tweak them (LEGITIMIZE_ADDRESS, LEGITIMIZE_RELOAD_ADDRESS).  You'll
also need to update GO_IF_LEGITIMATE_ADDRESS.  Assembler & BFD work is also
needed to handle the larger displacements for FP loads & stores.

My recommendation is to get the compiler stuff working first -- use the HP
assembler to test your code (and benchmark any improvements in your app).



  > I can send the aCC assembly if you'd like to see what aCC did vs gcc.
I really wouldn't have the time to look at it.  My focus is on gcc-2.95,
not performance issues for future releases.


  > One other thing I noted.  Gcc still generated fmpysub, so I'll have to
  > update the pattern in pa.md.  Shutting off pa_combine_instructions still
  > lets this pattern operate.  There was actually a performance loss by
  > having the instruction in there.  I know you said this, but I didn't
  > actually believe it.  I'm still not sure I actually understand.  Even
  > though fmpyadd/sub takes two reorder buffer slots, it shouldn't be any
  > slower than fmpy followed by fsub, should it?  If anything I would think
  > there is still a bit of benefit by having one less instruction - at
  > least in this particular example.
An fmpysub can not retire until both ops are finished.  Thus it holds resources
until both operations are finished.  A fmpy followed by a fsub allows either
operation to retire as soon as it is complete, returning resources to the
processor (particularly reorder buffer slots) for use by other insns.

You should invesitgate how the fmpysub was created.  That shouldn't be
happening.

  > I've also played with a couple of other things.  I tried different
  > BRANCH_COST values up to 10 and ran it on our code, plus the performance
  > test suite that Mark Lehmann set up.  There doesn't seem to be much
  > benefit to changing the value that I can see.  Perhaps a value of 2 or 3
  > is a bit better, but there were plenty of test cases that got worse as
  > well as better.
Like most optimization work, rarely will it always help.  I suspect a value
of 2 is best for the PA8000.  We may need to tweak it further for PA8200,
PA8500 and PA8700.


  > Finally, in High Performance Computing 2nd Ed., there is an appendix
  > that sketches the PA8000.  One thing they mention that I haven't seen in
  > HP's publicly available papers is that branches take a slot in both the
  > memory reorder buffer and the arithmetic buffer.  I tried this and saw a
  > bit (1%?) drop in performance.  Is it possible that this comment isn't
  > correct?  I simply added branch instructions to the memory function unit
  > as well as the alu function unit.
It is correct, but I don't think trying to model it is going to work. 
Remember that the scheduler does not schedule branches.  I have no idea what
exposing this aspect of the PA architecture would do to the schedules.

jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-25  9:46 Geert Bosch
@ 1999-05-25 10:15 ` Jeffrey A Law
  1999-05-27 12:09   ` Jerry Quinn
  1999-05-31 21:36   ` Jeffrey A Law
  1999-05-31 21:36 ` Geert Bosch
  1 sibling, 2 replies; 20+ messages in thread
From: Jeffrey A Law @ 1999-05-25 10:15 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Jerry Quinn, egcs

  In message < 9905251642.AA17457@nile.gnat.com >you write:
  > This could well be possible. At least on Intel Pentium II, referencing
  > register names that are still in use by a non-retired intruction
  > can be faster than using new register names. The issue here is
  > mapping the register names to actual (renamed) registers.
I don't think this is the case on the PA8000 series machines though.

  > Favoring new renamings of already used register names over
  > using more register names may also reduce register pressure.
Yes.  More generally, latency scheduling for an out of order execution machine
does not make sense, except for those instructions with a long latency 
(several cycles).  The larger the reorder buffer, the longer latency the
processor can hide.

By ignoring relatively small latencies you reduce register pressure as a
secondary side effect.

The PA8000 scheduler tries to do this on a small scale basis.  It's still
missing some important cases.

Anyway, back to the PA8000 performance problem from Jerry.  I doubt the
problem here is register renaming.  But it's hard to get a handle on it
right now.  The most obvious inefficiency in the assembly code is not
using the extended displacement for FP loads & stores.

It'll be interesting to see the performance once that is cleaned up (and it'll
be a lot easier to analyze simply because the number of insns will be 
significantly reduced).

jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
@ 1999-05-25  9:46 Geert Bosch
  1999-05-25 10:15 ` Jeffrey A Law
  1999-05-31 21:36 ` Geert Bosch
  0 siblings, 2 replies; 20+ messages in thread
From: Geert Bosch @ 1999-05-25  9:46 UTC (permalink / raw)
  To: Jerry Quinn, law; +Cc: egcs

On Tue, 25 May 1999 11:41:57 -0400, Jerry Quinn wrote:

  It almost suggested that there is some kind of cache effect inside the
  processor favoring recently used registers over ones that are idle.  Is
  this possible/reasonable?  If so, it would make correct machine modeling
  much more complex.

This could well be possible. At least on Intel Pentium II, referencing
register names that are still in use by a non-retired intruction
can be faster than using new register names. The issue here is
mapping the register names to actual (renamed) registers.

Favoring new renamings of already used register names over
using more register names may also reduce register pressure.

Compare the schedules below for the following program fragment:

Mem2 := Mem1 * Mem1;
Mem4 := Mem3 + 1;

  Schedule 1)       Schedule 2)

1 rA := Mem1;       rA := Mem1;
2 rB := Mem3;       rA := rA * rA;
3 rA := rA * rA;    Mem2 := rA;
4 rB := rB + 1;     rA := Mem3;
5 Mem2 := rA;       rA := rA + 1;
6 Mem4 := rB;       Mem4 := rA;

In 1) we use two register names, and 4 register instances.
On the other hand in 2) we only use a single name, still
with 4 instances. It is interesting to look at the effects
of out of order execution on this schedule. It may be that
schedule 2) performs as good as 1) while using less register
names.

Lets assume a machine that issues 1 instruction per cycle, has one
unit that can read, two for arithmetic and two for write. Read and
Write have a latency of three cycles, Add 1 and Mult 5.

  Execution 1)     Execution 2)

1 Read           1 Read
  |   2 Read       |
  |     :          |
3 Mult  |        2 Mult 4 Read
  |     |          |      |
  |   4 Add        |      |
  |   5 Write      |    5 Add
  |     |          |    6 Write
6 Write |        3 Write  |
  |     *          |      |
  |                |      *
  *                *

Regards,
   Geert

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-22  6:08 ` Jeffrey A Law
@ 1999-05-25  8:44   ` Jerry Quinn
  1999-05-25 10:33     ` Jeffrey A Law
  1999-05-31 21:36     ` Jerry Quinn
  1999-05-31 21:36   ` Jeffrey A Law
  1 sibling, 2 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-25  8:44 UTC (permalink / raw)
  To: law; +Cc: egcs

Jeffrey A Law wrote:
> 
>   In message < 199905211645.MAA03952@wmtl249c.us.nortel.com >you write:
>   >
>   > I was looking at a test case someone had posted originally showing
>   > poor g++ (vs gcc) performance and began looking at the performance
>   > difference between aCC and egcs.  On this particular piece of code,
>   > aCC wins by a sizeable margin.
>   >
>   > In attempting to understand what is going on, I started looking at
>   > register allocation in the doSth subroutine.  egcs heavily favors
>   > reusing registers even when others are available while aCC apparently
>   > tries to spread things out among registers when it can (i.e. b = a + b
>   > vs. c = a + b)
> Yes.  Register renaming.  It can avoid false dependency stalls added by
> register allocation in some circumstances.  GCC doesn't support this yet,
> but does have some heuristics in the register allocator to try and avoid
> creating false dependency stalls.

The actual weirdness I was trying to address was that when I manually
remove the false dependency by changing register allocation to another
temp register, i.e., changing b = a + b to c = a + b and updating the
ensuing code, performance got WORSE by a little bit.  If anything, I
would expect it to either be improved (due to eliminating a dependency)
or unchanged (latency hidden by reordering, or no latency problem).

It almost suggested that there is some kind of cache effect inside the
processor favoring recently used registers over ones that are idle.  Is
this possible/reasonable?  If so, it would make correct machine modeling
much more complex.

> The most obvious thing I see is we are not taking advantage of the larger
> displacement allowed in reg+disp addresses for FP insns.
> 
> In PA1.0/PA1.1 a FP load/store is only allowed a 5 bit displacement.  Thus
> you see all those  "ldo disp(base), temp; fldds 0(temp),fptarget".
> 
> In PA2.0 FP loads/stores can have an aligned 16bit offset, which would
> allow most of the ldo instructions to disappear.  This (of course) requires
> more GAS work to support the larger displacements.
> 
> I did some fooling around with this stuff a while back and it was worth a
> few more percent across specfp.

I saw something similar when I played with hobbling the aCC assembly
this way.  If I get a chance, I'll look at adding this stuff in.

> It may also be the case that using targetted FP compares rather than queued
> FP compares.   I've never done any experiments to see what the benefit was,
> but presumably PA, mips & others that added multiple FP condition code regs
> did so for a reason :-)

> Jeff

I can send the aCC assembly if you'd like to see what aCC did vs gcc.

One other thing I noted.  Gcc still generated fmpysub, so I'll have to
update the pattern in pa.md.  Shutting off pa_combine_instructions still
lets this pattern operate.  There was actually a performance loss by
having the instruction in there.  I know you said this, but I didn't
actually believe it.  I'm still not sure I actually understand.  Even
though fmpyadd/sub takes two reorder buffer slots, it shouldn't be any
slower than fmpy followed by fsub, should it?  If anything I would think
there is still a bit of benefit by having one less instruction - at
least in this particular example.

I've also played with a couple of other things.  I tried different
BRANCH_COST values up to 10 and ran it on our code, plus the performance
test suite that Mark Lehmann set up.  There doesn't seem to be much
benefit to changing the value that I can see.  Perhaps a value of 2 or 3
is a bit better, but there were plenty of test cases that got worse as
well as better.

Finally, in High Performance Computing 2nd Ed., there is an appendix
that sketches the PA8000.  One thing they mention that I haven't seen in
HP's publicly available papers is that branches take a slot in both the
memory reorder buffer and the arithmetic buffer.  I tried this and saw a
bit (1%?) drop in performance.  Is it possible that this comment isn't
correct?  I simply added branch instructions to the memory function unit
as well as the alu function unit.

Jerry

-- 
Jerry Quinn                             Tel: (514) 761-8737
jquinn@nortelnetworks.com               Fax: (514) 761-8505
Speech Recognition Research

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: PA8000 performance oddity
  1999-05-21 14:25 Jerry Quinn
@ 1999-05-22  6:08 ` Jeffrey A Law
  1999-05-25  8:44   ` Jerry Quinn
  1999-05-31 21:36   ` Jeffrey A Law
  1999-05-31 21:36 ` Jerry Quinn
  1 sibling, 2 replies; 20+ messages in thread
From: Jeffrey A Law @ 1999-05-22  6:08 UTC (permalink / raw)
  To: Jerry Quinn; +Cc: egcs

  In message < 199905211645.MAA03952@wmtl249c.us.nortel.com >you write:
  > 
  > I was looking at a test case someone had posted originally showing
  > poor g++ (vs gcc) performance and began looking at the performance
  > difference between aCC and egcs.  On this particular piece of code,
  > aCC wins by a sizeable margin.
  > 
  > In attempting to understand what is going on, I started looking at
  > register allocation in the doSth subroutine.  egcs heavily favors
  > reusing registers even when others are available while aCC apparently
  > tries to spread things out among registers when it can (i.e. b = a + b 
  > vs. c = a + b)
Yes.  Register renaming.  It can avoid false dependency stalls added by
register allocation in some circumstances.  GCC doesn't support this yet,
but does have some heuristics in the register allocator to try and avoid
creating false dependency stalls.

[ ... ]

  > I've included the complete assembly for anyone that cares to try it.
  > It is followed by a patch file to make the register change.  Link
  > using the 990517 snapshot and a recent binutils snapshot which has the
  > pa2.0 support.
  > 
  > Does anyone have a clue why this would happen?  Am I missing something
  > very obvious?
The most obvious thing I see is we are not taking advantage of the larger
displacement allowed in reg+disp addresses for FP insns.

In PA1.0/PA1.1 a FP load/store is only allowed a 5 bit displacement.  Thus
you see all those  "ldo disp(base), temp; fldds 0(temp),fptarget".

In PA2.0 FP loads/stores can have an aligned 16bit offset, which would
allow most of the ldo instructions to disappear.  This (of course) requires
more GAS work to support the larger displacements.

I did some fooling around with this stuff a while back and it was worth a
few more percent across specfp.

It may also be the case that using targetted FP compares rather than queued
FP compares.   I've never done any experiments to see what the benefit was,
but presumably PA, mips & others that added multiple FP condition code regs
did so for a reason :-)

Jeff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* PA8000 performance oddity
@ 1999-05-21 14:25 Jerry Quinn
  1999-05-22  6:08 ` Jeffrey A Law
  1999-05-31 21:36 ` Jerry Quinn
  0 siblings, 2 replies; 20+ messages in thread
From: Jerry Quinn @ 1999-05-21 14:25 UTC (permalink / raw)
  To: egcs

I was looking at a test case someone had posted originally showing
poor g++ (vs gcc) performance and began looking at the performance
difference between aCC and egcs.  On this particular piece of code,
aCC wins by a sizeable margin.

In attempting to understand what is going on, I started looking at
register allocation in the doSth subroutine.  egcs heavily favors
reusing registers even when others are available while aCC apparently
tries to spread things out among registers when it can (i.e. b = a + b 
vs. c = a + b)

I wanted to see if reusing registers was causing a problem, so I
started reassigning registers by hand in this routine.  Then I found
this wierd performance impact.  The subroutine has the following op
sequence:

r6 = r24 - r7
r24 = r6 * r6
r8 = r23 - r5
r27 = r22 - r26
r10 = r8 * r8 + r24
r11 = r27 * r27 + r10

I'd already changed a few assignments by hand without any noticable
gain or loss.  I tried to change the second op to write to r9 on the
theory that it might allow the fifth op to execute sooner.  However,
doing this actually slowed the program down by 2%, repeatably.  The
times don't vary by more than .01 second.

Time as shown:
2.79u 0.02s 0:02.87 97.9%

Time with register change:
2.85u 0.02s 0:02.91 98.6%


I've included the complete assembly for anyone that cares to try it.
It is followed by a patch file to make the register change.  Link
using the 990517 snapshot and a recent binutils snapshot which has the
pa2.0 support.

Does anyone have a clue why this would happen?  Am I missing something
very obvious?

-- 
Jerry Quinn                             Tel: (514) 761-8737
jquinn@nortelnetworks.com               Fax: (514) 761-8505
Speech Recognition Research

------ cut here -------------------------------------
	.LEVEL 2.0
	.SPACE $PRIVATE$
	.SUBSPA $DATA$,QUAD=1,ALIGN=8,ACCESS=31
	.SUBSPA $BSS$,QUAD=1,ALIGN=8,ACCESS=31,ZERO,SORT=82
	.SPACE $TEXT$
	.SUBSPA $LIT$,QUAD=0,ALIGN=8,ACCESS=44
	.SUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.IMPORT $global$,DATA
	.IMPORT $$dyncall,MILLICODE
; gcc_compiled.:
	.IMPORT __main,CODE
	.IMPORT doSth__Fddddddd,CODE
	.IMPORT printf,CODE
	.SPACE $TEXT$
	.SUBSPA $LIT$

	.align 4
L$C0005
	.STRING "lfSum = %10.3f\x0a\x00"
	.align 8
L$C0000
	.word 0x3fbf9724
	.word 0x74538ef3
	.align 8
L$C0001
	.word 0x3fd61e4f
	.word 0x765fd8ae
	.align 8
L$C0002
	.word 0x3fe22b6a
	.word 0xe7d566cf
	.align 8
L$C0004
	.word 0x40554fdf
	.word 0x3b645a1d
	.align 8
L$C0006
	.word 0xc05fe000
	.word 0x0
	.SPACE $TEXT$
	.SUBSPA $CODE$

	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.EXPORT main,ENTRY,PRIV_LEV=3,RTNVAL=GR
main
	.PROC
	.CALLINFO FRAME=192,CALLS,SAVE_RP,ENTRY_GR=11,ENTRY_FR=19
	.ENTRY
	stw %r2,-20(%r30)
	stwm %r11,192(%r30)
	stw %r9,-184(%r30)
	ldo -152(%r30),%r1
	ldi 256,%r9
	stw %r10,-188(%r30)
	stw %r8,-180(%r30)
	stw %r7,-176(%r30)
	stw %r6,-172(%r30)
	stw %r5,-168(%r30)
	stw %r4,-164(%r30)
	stw %r3,-160(%r30)
	fstds,ma %fr19,8(%r1)
	fstds,ma %fr18,8(%r1)
	fstds,ma %fr17,8(%r1)
	fstds,ma %fr16,8(%r1)
	fstds,ma %fr15,8(%r1)
	fstds,ma %fr14,8(%r1)
	fcpy,dbl %fr0,%fr14
	fstds,ma %fr13,8(%r1)
	.CALL 
	bl __main,%r2
	fstds,ma %fr12,8(%r1)
	ldil LR'L$C0000,%r19
	ldil LR'L$C0001,%r20
	ldo RR'L$C0000(%r19),%r19
	ldil LR'L$C0002,%r21
	fldds 0(%r19),%fr19
	ldil LR'L$C0004,%r22
	ldil LR'L$C0006,%r19
	ldo RR'L$C0001(%r20),%r20
	ldo RR'L$C0006(%r19),%r19
	ldo RR'L$C0002(%r21),%r21
	ldo RR'L$C0004(%r22),%r22
	fldds 0(%r19),%fr15
	fldds 0(%r20),%fr18
	ldi 0,%r19
	fldds 0(%r21),%fr17
	fldds 0(%r22),%fr16
	stw %r19,-16(%r30)
L$0040
	ldo 1(%r19),%r11
	ldi 0,%r19
	fldws -16(%r30),%fr23L
	fcnvxf,sgl,dbl %fr23L,%fr22
	fadd,dbl %fr15,%fr22,%fr13
	stw %r19,-16(%r30)
L$0039
	ldo 1(%r19),%r10
	ldi 0,%r3
	extru %r9,31,2,%r19
	fldws -16(%r30),%fr23L
	fcnvxf,sgl,dbl %fr23L,%fr22
	comib,>= 0,%r9,L$0020
	fadd,dbl %fr15,%fr22,%fr12
	comib,= 0,%r19,L$0038
	ldo -56(%r30),%r4
	comib,>=,n 1,%r19,L$0020
	comib,>= 2,%r19,L$0021
	ldo -56(%r30),%r19
	ldo -64(%r30),%r20
	fstds %fr13,0(%r19)
	ldo -72(%r30),%r21
	ldo -80(%r30),%r19
	ldo -88(%r30),%r22
	fstds %fr19,0(%r20)
	fcpy,dbl %fr15,%fr5
	fcpy,dbl %fr12,%fr7
	ldi 1,%r3
	fstds %fr18,0(%r21)
	fstds %fr17,0(%r19)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r22)
	fadd,dbl %fr14,%fr4,%fr14
L$0021
	ldo -56(%r30),%r19
	ldo -64(%r30),%r20
	fstds %fr13,0(%r19)
	ldo -72(%r30),%r21
	ldo -80(%r30),%r22
	ldo -88(%r30),%r19
	stw %r3,-16(%r30)
	fcpy,dbl %fr12,%fr7
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r20)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r21)
	ldo 1(%r3),%r3
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r22)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r19)
	fadd,dbl %fr14,%fr4,%fr14
L$0020
	ldo -56(%r30),%r19
	ldo -64(%r30),%r20
	fstds %fr13,0(%r19)
	ldo -72(%r30),%r21
	ldo -80(%r30),%r22
	ldo -88(%r30),%r19
	stw %r3,-16(%r30)
	fcpy,dbl %fr12,%fr7
	fldws -16(%r30),%fr23L
	fstds %fr19,0(%r20)
	fcnvxf,sgl,dbl %fr23L,%fr5
	fstds %fr18,0(%r21)
	ldo 1(%r3),%r3
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r22)
	b L$0037
	fstds %fr16,0(%r19)
L$0014
	ldo -56(%r30),%r4
L$0038
	ldo -64(%r30),%r5
	fstds %fr13,0(%r4)
	ldo -72(%r30),%r7
	ldo -80(%r30),%r8
	ldo -88(%r30),%r6
	stw %r3,-16(%r30)
	fcpy,dbl %fr12,%fr7
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r6)
	fstds %fr13,0(%r4)
	ldo 1(%r3),%r19
	fcpy,dbl %fr12,%fr7
	stw %r19,-16(%r30)
	fadd,dbl %fr14,%fr4,%fr14
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r6)
	fstds %fr13,0(%r4)
	ldo 2(%r3),%r19
	fcpy,dbl %fr12,%fr7
	stw %r19,-16(%r30)
	fadd,dbl %fr14,%fr4,%fr14
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	fstds %fr16,0(%r6)
	fstds %fr13,0(%r4)
	ldo 3(%r3),%r19
	fcpy,dbl %fr12,%fr7
	ldo 4(%r3),%r3
	stw %r19,-16(%r30)
	fadd,dbl %fr14,%fr4,%fr14
	fldws -16(%r30),%fr22L
	fstds %fr19,0(%r5)
	fcnvxf,sgl,dbl %fr22L,%fr5
	fstds %fr18,0(%r7)
	fadd,dbl %fr15,%fr5,%fr5
	fstds %fr17,0(%r8)
	fstds %fr16,0(%r6)
L$0037
	.CALL ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU
	bl doSth__Fddddddd,%r2
	nop
	comb,> %r9,%r3,L$0014
	fadd,dbl %fr14,%fr4,%fr14
	copy %r10,%r19
	comb,>,n %r9,%r19,L$0039
	stw %r19,-16(%r30)
	copy %r11,%r19
	comb,>,n %r9,%r19,L$0040
	stw %r19,-16(%r30)
	fcpy,dbl %fr14,%fr7
	ldil LR'L$C0005,%r26
	.CALL ARGW0=GR,ARGW2=FR,ARGW3=FU
	bl printf,%r2
	ldo RR'L$C0005(%r26),%r26
	ldo -152(%r30),%r1
	ldw -212(%r30),%r2
	fldds,ma 8(%r1),%fr19
	ldi 0,%r28
	ldw -188(%r30),%r10
	fldds,ma 8(%r1),%fr18
	ldw -184(%r30),%r9
	fldds,ma 8(%r1),%fr17
	ldw -180(%r30),%r8
	fldds,ma 8(%r1),%fr16
	ldw -176(%r30),%r7
	fldds,ma 8(%r1),%fr15
	ldw -172(%r30),%r6
	fldds,ma 8(%r1),%fr14
	ldw -168(%r30),%r5
	fldds,ma 8(%r1),%fr13
	ldw -164(%r30),%r4
	ldw -160(%r30),%r3
	fldds,ma 8(%r1),%fr12
	bv %r0(%r2)
	ldwm -192(%r30),%r11
	.EXIT
	.PROCEND
	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.PARAM __static_initialization_and_destruction_0,ARGW0=GR,ARGW1=GR
__static_initialization_and_destruction_0
	.PROC
	.CALLINFO FRAME=0,NO_CALLS
	.ENTRY
	bv,n %r0(%r2)
	.EXIT
	.PROCEND
	.SPACE $TEXT$
	.SUBSPA $LIT$

	.align 8
L$C0008
	.word 0x3ff00000
	.word 0x0
	.SPACE $TEXT$
	.SUBSPA $CODE$

	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.EXPORT doSth__Fddddddd,ENTRY,PRIV_LEV=3,ARGW0=FR,ARGW1=FU,ARGW2=FR,ARGW3=FU,RTNVAL=FU
doSth__Fddddddd
	.PROC
	.CALLINFO FRAME=0,NO_CALLS
	.ENTRY
	ldo -72(%r30),%r19
	ldo -64(%r30),%r20
	fldds 0(%r19),%fr24
	ldo -56(%r30),%r21
	fldds 0(%r20),%fr23
	ldo -80(%r30),%r19
	fsub,dbl %fr24,%fr7,%fr6
	fldds 0(%r19),%fr22
;	fmpysub,dbl %fr24,%fr24,%fr24,%fr5,%fr23
	fmpy,dbl %fr6,%fr6,%fr9
	fsub,dbl %fr23,%fr5,%fr8
	fldds 0(%r21),%fr26
	ldo -88(%r30),%r19
	fsub,dbl %fr22,%fr26,%fr27
	fldds 0(%r19),%fr25
	fmpyfadd,dbl %fr8,%fr8,%fr9,%fr10
	fmpyfadd,dbl %fr27,%fr27,%fr10,%fr11
	fcmp,dbl,> %fr11,%fr25
	ftest
	b L$0042
	ldil LR'L$C0008,%r19
	bv %r0(%r2)
	fcpy,dbl %fr0,%fr4
L$0042
	ldo RR'L$C0008(%r19),%r19
	bv %r0(%r2)
	fldds 0(%r19),%fr4
	.EXIT
	.PROCEND
	.align 4
	.NSUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=44,CODE_ONLY
	.PARAM __static_initialization_and_destruction_1,ARGW0=GR,ARGW1=GR
__static_initialization_and_destruction_1
	.PROC
	.CALLINFO FRAME=0,NO_CALLS
	.ENTRY
	bv,n %r0(%r2)
	.EXIT
	.PROCEND
------------cut here------------------------
*** nolfv.s     Fri May 21 11:55:10 1999
--- nolfv.mod.s Fri May 21 12:26:50 1999
***************
*** 296,308 ****
        fsub,dbl %fr24,%fr7,%fr6
        fldds 0(%r19),%fr22
  ;     fmpysub,dbl %fr24,%fr24,%fr24,%fr5,%fr23
!       fmpy,dbl %fr6,%fr6,%fr24
        fsub,dbl %fr23,%fr5,%fr8
        fldds 0(%r21),%fr26
        ldo -88(%r30),%r19
        fsub,dbl %fr22,%fr26,%fr27
        fldds 0(%r19),%fr25
!       fmpyfadd,dbl %fr8,%fr8,%fr24,%fr10
        fmpyfadd,dbl %fr27,%fr27,%fr10,%fr11
        fcmp,dbl,> %fr11,%fr25
        ftest
--- 296,308 ----
        fsub,dbl %fr24,%fr7,%fr6
        fldds 0(%r19),%fr22
  ;     fmpysub,dbl %fr24,%fr24,%fr24,%fr5,%fr23
!       fmpy,dbl %fr6,%fr6,%fr9
        fsub,dbl %fr23,%fr5,%fr8
        fldds 0(%r21),%fr26
        ldo -88(%r30),%r19
        fsub,dbl %fr22,%fr26,%fr27
        fldds 0(%r19),%fr25
!       fmpyfadd,dbl %fr8,%fr8,%fr9,%fr10
        fmpyfadd,dbl %fr27,%fr27,%fr10,%fr11
        fcmp,dbl,> %fr11,%fr25
        ftest

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~1999-05-31 21:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-05-25 19:54 PA8000 performance oddity N8TM
1999-05-31 21:36 ` N8TM
  -- strict thread matches above, loose matches on Subject: below --
1999-05-25  9:46 Geert Bosch
1999-05-25 10:15 ` Jeffrey A Law
1999-05-27 12:09   ` Jerry Quinn
1999-05-27 13:22     ` Jerry Quinn
1999-05-27 23:10       ` Gerald Pfeifer
1999-05-31 21:36         ` Gerald Pfeifer
1999-05-31 21:36       ` Jerry Quinn
1999-05-31 21:36     ` Jerry Quinn
1999-05-31 21:36   ` Jeffrey A Law
1999-05-31 21:36 ` Geert Bosch
1999-05-21 14:25 Jerry Quinn
1999-05-22  6:08 ` Jeffrey A Law
1999-05-25  8:44   ` Jerry Quinn
1999-05-25 10:33     ` Jeffrey A Law
1999-05-31 21:36       ` Jeffrey A Law
1999-05-31 21:36     ` Jerry Quinn
1999-05-31 21:36   ` Jeffrey A Law
1999-05-31 21:36 ` Jerry Quinn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).