From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-92988-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 17024 invoked by alias); 18 Mar 2004 17:30:15 -0000
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
Received: (qmail 16979 invoked from network); 18 Mar 2004 17:30:13 -0000
Received: from unknown (HELO www.eyesopen.COM) (12.96.199.11)
  by sources.redhat.com with SMTP; 18 Mar 2004 17:30:13 -0000
Received: from localhost (roger@localhost)
	by www.eyesopen.COM (8.11.6/8.11.6) with ESMTP id i2IGEj718627
	for <gcc@gcc.gnu.org>; Thu, 18 Mar 2004 09:14:45 -0700
Date: Thu, 18 Mar 2004 17:49:00 -0000
From: Roger Sayle <roger@eyesopen.com>
To: gcc@gcc.gnu.org
Subject: x87 float truncation/accuracy (gcc vs. icc/msvc)
Message-ID: <Pine.LNX.4.44.0403180818300.16213-100000@www.eyesopen.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-SW-Source: 2004-03/txt/msg01069.txt.bz2


My boss recently alerted me to an anomalous performance change in
a piece of code he was working on.  The reduced test case is shown
below:

float foo(float *x)
{
  int i;
  float y = 0.0;
  for (i=0; i<10; i++)
    y += 2.0*x[i];
  return y;
}

which on x87 runs at less than half of the speed of changing the
"2.0" to "2.0f"!?.  The cause can be seen in the assembly output:

foo:    subl    $4, %esp
        fldz
        movl    8(%esp), %edx
        xorl    %eax, %eax
.L6:    flds    (%edx,%eax,4)
        incl    %eax
        cmpl    $9, %eax
        fadd    %st(0), %st
        faddp   %st, %st(1)
        fstps   (%esp)           <--- here
        flds    (%esp)           <--- here
        jle     .L6
        popl    %eax
        ret


The two instructions marked "here" are removed by using the float
constant, but are present with a double constant, even with -ffast-math.
The problem is that these instructions round %st(0) to a "float" by
storing it to memory and reading it back in again.  Changing the type
of "y" to double also resolves the problem (even though the addition
is actually XFmode, we don't attempt to correctly round down to DFmode).

In the original code, these two variants actually produced significantly
different results by changing the coefficient's type.  Particularly,
interesting is that both the Microsoft Visual C/C++ compiler and Intel's
icc both *by default* completely optimized away this "float_truncate",
producing incorrectly rounded results.

The same problem also hurts GCC on the popular "mflop" benchmark.


My interest now is how best to catch this transformation/optimization
using flag_unsafe_math_optimizations.  My analysis so far is that
this is an i386.md specific transformation.  On many machines "float"
operations are faster than "double", and their hardware often supports
efficient "double->float" conversion.  The IA-32 architecture on the
other hand seems unique in that commercial compilers are free to consider
"truncdfsf2" as a no-op, in the same way as "extendsfdf2".

Do any of the x86 backend gurus have any suggestions as to how best
to implement "truncdfsf2" as a move between x87 registers, but as a
regular "fst*s" instruction for memory targets?  My initial attempt
was to simply guard the following splitter with !flag_unsafe_math_...

(define_split
  [(set (match_operand:SF 0 "register_operand" "")
        (float_truncate:SF
         (match_operand:DF 1 "fp_register_operand" "")))
   (clobber (match_operand:SF 2 "memory_operand" ""))]
  "TARGET_80387 && reload_completed"
  [(set (match_dup 2) (float_truncate:SF (match_dup 1)))
   (set (match_dup 0) (match_dup 2))]
  "")

Alas this failed miserably.

Any advice would be much appreciated.  I've confirmed that GCC performs
the related "safe" constant folding optimizations, such as converting
"(float)((double)f1 op (double)f2)" into "f1 op f2" for floating point
values f1 and f2, and operation op one of add, sub or mul.  For "mul",
for example, the two 24 bit mantissas of an IEEE "float" can't overflow
the 53 bit mantissa of an IEEE double, so there's no double rounding and
so a floating point multiplication returns the same (perfectly rounded)
result.  These don't help the code above however, which is fundamentally
unsafe and not normally a win except on Intel.

Roger
--