From mboxrd@z Thu Jan 1 00:00:00 1970 From: N8TM@aol.com To: pcg@goof.com, rth@cygnus.com, hjstein@bfr.co.il, toon@moene.indiv.nluug.nl, burley@gnu.org Cc: egcs@cygnus.com Subject: Re: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 Date: Mon, 21 Dec 1998 23:30:00 -0000 Message-id: X-SW-Source: 1998-12/msg00843.html Contrary to an opinion put forth in this exchange, I see that alignment of 64-bit spills makes a measurable difference in performance on my Pentium 2. I noticed, somewhat accidentally, that Livermore Fortran Kernel 8 runs 10% faster when linked with cygwin-b20.1 than with cygwin-b19/coolview. I built the compiler today under cygwin-b19, and the performance of all the other kernels was unchanged from the previous version of egcs/g77. Relinking the same .o with the different .dll made the difference, and it made no difference whether I ran under bash linked with one .dll or the other. Examining the code, 9 loop invariant REAL*8 scalars are spilled outside the 2 innermost loops. Each is restored once inside the inner loop. There are 15 REAL*8 memory accesses directly to COMMON in the inner loop, and I believe 33 floating point operations. In addition, 5 pointers are spilled and restored in the inner loop. The 10% increase in execution time for a mis-aligned stack would indicate that the penalty for restoring a spilled REAL*8 is twice as great when it is mis-aligned, even though it surely would stay in level 1 cache in the absence of cache mapping conflicts. As I had mentioned several times earlier, I had noticed that the -O2 code was running slower on W95 than -Os code, while this effect was not repeated on linux-gnulibc1. Today's finding confirms that effects like this stemmed from mis-alignment of the stack, together with the smaller number of spills generated with -Os. With the up-to-date versions of both g77 and cygwin-b20, there no longer are any Livermore Kernels which run slower with -O2 than -Os. Not to say there are no challenges left! I still find a few cases where the commercial compiler lf90 4.50g runs 40% faster than g77 (as well as a smaller number where g77 excels). Apparently, there are no 80-bit spills or mis- aligned COMMONs in that Lahey version, unlike the current l95.