From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 17024 invoked by alias); 18 Mar 2004 17:30:15 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 16979 invoked from network); 18 Mar 2004 17:30:13 -0000 Received: from unknown (HELO www.eyesopen.COM) (12.96.199.11) by sources.redhat.com with SMTP; 18 Mar 2004 17:30:13 -0000 Received: from localhost (roger@localhost) by www.eyesopen.COM (8.11.6/8.11.6) with ESMTP id i2IGEj718627 for ; Thu, 18 Mar 2004 09:14:45 -0700 Date: Thu, 18 Mar 2004 17:49:00 -0000 From: Roger Sayle To: gcc@gcc.gnu.org Subject: x87 float truncation/accuracy (gcc vs. icc/msvc) Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-SW-Source: 2004-03/txt/msg01069.txt.bz2 My boss recently alerted me to an anomalous performance change in a piece of code he was working on. The reduced test case is shown below: float foo(float *x) { int i; float y = 0.0; for (i=0; i<10; i++) y += 2.0*x[i]; return y; } which on x87 runs at less than half of the speed of changing the "2.0" to "2.0f"!?. The cause can be seen in the assembly output: foo: subl $4, %esp fldz movl 8(%esp), %edx xorl %eax, %eax .L6: flds (%edx,%eax,4) incl %eax cmpl $9, %eax fadd %st(0), %st faddp %st, %st(1) fstps (%esp) <--- here flds (%esp) <--- here jle .L6 popl %eax ret The two instructions marked "here" are removed by using the float constant, but are present with a double constant, even with -ffast-math. The problem is that these instructions round %st(0) to a "float" by storing it to memory and reading it back in again. Changing the type of "y" to double also resolves the problem (even though the addition is actually XFmode, we don't attempt to correctly round down to DFmode). In the original code, these two variants actually produced significantly different results by changing the coefficient's type. Particularly, interesting is that both the Microsoft Visual C/C++ compiler and Intel's icc both *by default* completely optimized away this "float_truncate", producing incorrectly rounded results. The same problem also hurts GCC on the popular "mflop" benchmark. My interest now is how best to catch this transformation/optimization using flag_unsafe_math_optimizations. My analysis so far is that this is an i386.md specific transformation. On many machines "float" operations are faster than "double", and their hardware often supports efficient "double->float" conversion. The IA-32 architecture on the other hand seems unique in that commercial compilers are free to consider "truncdfsf2" as a no-op, in the same way as "extendsfdf2". Do any of the x86 backend gurus have any suggestions as to how best to implement "truncdfsf2" as a move between x87 registers, but as a regular "fst*s" instruction for memory targets? My initial attempt was to simply guard the following splitter with !flag_unsafe_math_... (define_split [(set (match_operand:SF 0 "register_operand" "") (float_truncate:SF (match_operand:DF 1 "fp_register_operand" ""))) (clobber (match_operand:SF 2 "memory_operand" ""))] "TARGET_80387 && reload_completed" [(set (match_dup 2) (float_truncate:SF (match_dup 1))) (set (match_dup 0) (match_dup 2))] "") Alas this failed miserably. Any advice would be much appreciated. I've confirmed that GCC performs the related "safe" constant folding optimizations, such as converting "(float)((double)f1 op (double)f2)" into "f1 op f2" for floating point values f1 and f2, and operation op one of add, sub or mul. For "mul", for example, the two 24 bit mantissas of an IEEE "float" can't overflow the 53 bit mantissa of an IEEE double, so there's no double rounding and so a floating point multiplication returns the same (perfectly rounded) result. These don't help the code above however, which is fundamentally unsafe and not normally a win except on Intel. Roger --