From mboxrd@z Thu Jan  1 00:00:00 1970
From: N8TM@aol.com
To: pcg@goof.com, rth@cygnus.com, hjstein@bfr.co.il, toon@moene.indiv.nluug.nl, burley@gnu.org
Cc: egcs@cygnus.com
Subject: Re: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
Date: Mon, 21 Dec 1998 23:30:00 -0000
Message-id: <c73f6118.367f4a69@aol.com>
X-SW-Source: 1998-12/msg00843.html

Contrary to an opinion put forth in this exchange, I see that alignment of
64-bit spills makes a measurable difference in performance on my Pentium 2.

I noticed, somewhat accidentally, that Livermore Fortran Kernel 8 runs 10%
faster when linked with cygwin-b20.1 than with cygwin-b19/coolview.  I built
the compiler today under cygwin-b19, and the performance of all the other
kernels was unchanged from the previous version of egcs/g77.  Relinking the
same .o with the different .dll made the difference, and it made no difference
whether I ran under bash linked with one .dll or the other.

Examining the code, 9 loop invariant REAL*8 scalars are spilled outside the 2
innermost loops.  Each is restored once inside the inner loop.  There are 15
REAL*8 memory accesses directly to COMMON in the inner loop, and I believe 33
floating point operations.  In addition, 5 pointers are spilled and restored
in the inner loop.  The 10% increase in execution time for a mis-aligned stack
would indicate that the penalty for restoring a spilled REAL*8 is twice as
great when it is mis-aligned, even though it surely would stay in level 1
cache in the absence of cache mapping conflicts.

As I had mentioned several times earlier, I had noticed that the -O2 code was
running slower on W95 than -Os code, while this effect was not repeated on
linux-gnulibc1.  Today's finding confirms that effects like this stemmed from
mis-alignment of the stack, together with the smaller number of spills
generated with -Os.  With the up-to-date versions of both g77 and cygwin-b20,
there no longer are any Livermore Kernels which run slower with -O2 than -Os. 

Not to say there are no challenges left!  I still find a few cases where the
commercial compiler lf90 4.50g runs 40% faster than g77 (as well as a smaller
number where g77 excels).  Apparently, there are no 80-bit spills or mis-
aligned COMMONs in that Lahey version, unlike the current l95.