From mboxrd@z Thu Jan  1 00:00:00 1970
From: N8TM@aol.com
To: law@cygnus.com
Cc: amylaar@cygnus.co.uk, egcs@cygnus.com
Subject: Re: 19980707 built on win95/i686-pc-cygwin32
Date: Sun, 12 Jul 1998 00:41:00 -0000
Message-id: <404f573f.35a864cf@aol.com>
X-SW-Source: 1998-07/msg00425.html

In a message dated 7/11/98 11:04:08 PM Pacific Daylight Time,
law@hurl.cygnus.com writes:

> Any chance you could analyze this code in more detail?  I'm quite
>  interested in cases where gcse makes code slower.


Thanks for the suggestion.  The differences in performance turn out to be
confined to small parts of my benchmark codes.  In the Livermore Kernels
double precision, -fno-gcse makes a significant difference in just 2 of the 24
kernel tests.  By significant I mean a difference greater than the
"experimental timing error" assessed by the benchmark code.  I ran these tests
on Pentium II 233 Mhz, with -funroll-loops -malign-double -march=pentiumpro
-O2, with binutils-2.9.1 installed with the p2align hooks.  I quote numbers
from win95/cygwin32 first, then mention the comparison with Linux.  win95
timings were done with sys_clock() rewritten with QueryPerformance WinAPI
calls; Linux with cpu_time() from libU77.

Kernel 9 performance at vector length 101 drops from 58 to 52 Mflops with
gcse.  At vector length 15, it is 57 Mflops either way.  This is definitely
abnormal, for performance to drop with increasing vector length, and, to me,
this would indicate an increase in cache miss rate.

Kernel 16 drops from 39 to 34 Mflops with gcse, at all vector lengths
(15,40,75).  I don't see anything to tell whether this is a code size or a
jump target alignment effect.  It could easily be the latter.

Kernels 9, 11, and 16 are the only ones where Linux (i686-pc-linux-gnulibc1)
performance is significantly better than win95/cygwin32.  Linux does not
exhibit any reduced performance with increased vector length.  Again, I think
this supports the supposition of an adverse cache effect under win95.

Maybe tomorrow I will get a chance to look at the .s code for these cases.  I
would look for a correlation with code size or order of data access.  I don't
know that I'm likely to see anything, or to figure out how to see the results
obtained by p2align.

I do see consistent increases in run time for number-crunching codes going
from Linux to win95.  On real codes, some of that evidently is in the slowness
of disk file access under win95.

Let's see if I can paste in source code for Kernels 9 and 16:
C***********************************************************************
C***  KERNEL 9      INTEGRATE PREDICTORS
C***********************************************************************
C
C
	  do k= 1,n
	    px(1,k)= dm28*px(13,k)+dm27*px(12,k)+dm26*px(11,k)+dm25*px(1
     &0,k)+dm24*px(9,k)+dm23*px(8,k)+dm22*px(7,k)+c0*(px(5,k)+px(6,k))+p
     &x(3,k)
	    enddo

C***********************************************************************
C***  KERNEL 16     MONTE CARLO SEARCH LOOP
C***********************************************************************
C
  do m= 1,zone(1)
	      j2= (n+n)*(m-1)+1
	      do k= 1,n
		  k2= k2+1
		  j4= j2+k+k
		  j5= zone(j4)
		  if(j5 >= n)then
		      if(j5 == n)then
			exit
			endif
		      k3= k3+1
		      if(d(j5) <  d(j5-1)*(t-d(j5-2))**2+(s-d(j5-3))**2+
     &(r-d(j5-4))**2)then
			goto200
			endif
		      if(d(j5) == d(j5-1)*(t-d(j5-2))**2+(s-d(j5-3))**2+
     &(r-d(j5-4))**2)then
			exit
			endif
		    else
		      if(j5-n+lb <  0)then
			  if(plan(j5) <  t)then
			    goto200
			    endif
			  if(plan(j5) == t)then
			    exit
			    endif
			else
			  if(j5-n+ii <  0)then
			      if(plan(j5) <  s)then
				goto200
				endif
			      if(plan(j5) == s)then
				exit
				endif
			    else
				if(plan(j5) <  r)then
				  goto200
				  endif
				if(plan(j5) == r)then
				  exit
				  endif
			    endif
			endif
		    endif
		  if(zone(j4-1) <= 0)then
		    goto200
		    endif
		enddo
	      exit
200             if(zone(j4-1) == 0)then
		  exit
		  endif
	    enddo