From mboxrd@z Thu Jan 1 00:00:00 1970 From: N8TM@aol.com To: law@cygnus.com Cc: amylaar@cygnus.co.uk, egcs@cygnus.com Subject: Re: 19980707 built on win95/i686-pc-cygwin32 Date: Sun, 12 Jul 1998 00:41:00 -0000 Message-id: <404f573f.35a864cf@aol.com> X-SW-Source: 1998-07/msg00425.html In a message dated 7/11/98 11:04:08 PM Pacific Daylight Time, law@hurl.cygnus.com writes: > Any chance you could analyze this code in more detail? I'm quite > interested in cases where gcse makes code slower. Thanks for the suggestion. The differences in performance turn out to be confined to small parts of my benchmark codes. In the Livermore Kernels double precision, -fno-gcse makes a significant difference in just 2 of the 24 kernel tests. By significant I mean a difference greater than the "experimental timing error" assessed by the benchmark code. I ran these tests on Pentium II 233 Mhz, with -funroll-loops -malign-double -march=pentiumpro -O2, with binutils-2.9.1 installed with the p2align hooks. I quote numbers from win95/cygwin32 first, then mention the comparison with Linux. win95 timings were done with sys_clock() rewritten with QueryPerformance WinAPI calls; Linux with cpu_time() from libU77. Kernel 9 performance at vector length 101 drops from 58 to 52 Mflops with gcse. At vector length 15, it is 57 Mflops either way. This is definitely abnormal, for performance to drop with increasing vector length, and, to me, this would indicate an increase in cache miss rate. Kernel 16 drops from 39 to 34 Mflops with gcse, at all vector lengths (15,40,75). I don't see anything to tell whether this is a code size or a jump target alignment effect. It could easily be the latter. Kernels 9, 11, and 16 are the only ones where Linux (i686-pc-linux-gnulibc1) performance is significantly better than win95/cygwin32. Linux does not exhibit any reduced performance with increased vector length. Again, I think this supports the supposition of an adverse cache effect under win95. Maybe tomorrow I will get a chance to look at the .s code for these cases. I would look for a correlation with code size or order of data access. I don't know that I'm likely to see anything, or to figure out how to see the results obtained by p2align. I do see consistent increases in run time for number-crunching codes going from Linux to win95. On real codes, some of that evidently is in the slowness of disk file access under win95. Let's see if I can paste in source code for Kernels 9 and 16: C*********************************************************************** C*** KERNEL 9 INTEGRATE PREDICTORS C*********************************************************************** C C do k= 1,n px(1,k)= dm28*px(13,k)+dm27*px(12,k)+dm26*px(11,k)+dm25*px(1 &0,k)+dm24*px(9,k)+dm23*px(8,k)+dm22*px(7,k)+c0*(px(5,k)+px(6,k))+p &x(3,k) enddo C*********************************************************************** C*** KERNEL 16 MONTE CARLO SEARCH LOOP C*********************************************************************** C do m= 1,zone(1) j2= (n+n)*(m-1)+1 do k= 1,n k2= k2+1 j4= j2+k+k j5= zone(j4) if(j5 >= n)then if(j5 == n)then exit endif k3= k3+1 if(d(j5) < d(j5-1)*(t-d(j5-2))**2+(s-d(j5-3))**2+ &(r-d(j5-4))**2)then goto200 endif if(d(j5) == d(j5-1)*(t-d(j5-2))**2+(s-d(j5-3))**2+ &(r-d(j5-4))**2)then exit endif else if(j5-n+lb < 0)then if(plan(j5) < t)then goto200 endif if(plan(j5) == t)then exit endif else if(j5-n+ii < 0)then if(plan(j5) < s)then goto200 endif if(plan(j5) == s)then exit endif else if(plan(j5) < r)then goto200 endif if(plan(j5) == r)then exit endif endif endif endif if(zone(j4-1) <= 0)then goto200 endif enddo exit 200 if(zone(j4-1) == 0)then exit endif enddo