Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-13 18:23 Stephen L Moshier
  1998-12-14  1:52 ` Harvey J. Stein
  0 siblings, 1 reply; 65+ messages in thread
From: Stephen L Moshier @ 1998-12-13 18:23 UTC (permalink / raw)
  To: tprince, egcs

The extra-precise registers are supposed to be a feature, not a bug.
Neither the computer language nor the compiler has a way to say
"this is an extra-precise register" so there is some inconvenience
using the feature.  It can't be made consistent.  The harder you look,
the more contradictions you find.

If you don't believe that, the alternative that makes sense is to
ask for straight IEEE behavior.  You can't get IEEE behavior without
setting the coprocessor rounding precision.  After you set the
rounding precision, all the other bugs disappear except for a rare
hardware bug or two.  The hardware bugs are dealt with by a trap
handler in the operating system, in the time honored fashion
of Intel, Borland, or Microsoft.

So there could be a straightforward plan to make x86 obey IEEE.
It's doubtful there could be a workable plan to fix the extra-precise
registers; anyway, they are a feature, no fix is needed!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-13 18:23 FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 Stephen L Moshier
@ 1998-12-14  1:52 ` Harvey J. Stein
  1998-12-14 14:56   ` Edward Jason Riedy
  0 siblings, 1 reply; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-14  1:52 UTC (permalink / raw)
  To: moshier; +Cc: hjstein

Stephen L Moshier <moshier@mediaone.net> writes:

 > The extra-precise registers are supposed to be a feature, not a bug.
 > Neither the computer language nor the compiler has a way to say
 > "this is an extra-precise register" so there is some inconvenience
 > using the feature.  It can't be made consistent.  The harder you look,
 > the more contradictions you find.
 > 
 > If you don't believe that, the alternative that makes sense is to
 > ask for straight IEEE behavior.  You can't get IEEE behavior without
 > setting the coprocessor rounding precision.  After you set the
 > rounding precision, all the other bugs disappear except for a rare
 > hardware bug or two.  The hardware bugs are dealt with by a trap
 > handler in the operating system, in the time honored fashion
 > of Intel, Borland, or Microsoft.
 > 
 > So there could be a straightforward plan to make x86 obey IEEE.
 > It's doubtful there could be a workable plan to fix the extra-precise
 > registers; anyway, they are a feature, no fix is needed!

I agree wholeheartedly.

Reasonable floating point code should expect that reordering
operations will produce slightly different results due to round off
error, and should be tolerant of the optimizer doing such.  Especially
given how little control the programmer has over exactly how
computations are ordered.

What floating point code *doesn't* expect is that there are multiple
underlying representations with different precisions, and that
identical computations might yield different results depending on
which representation happens to get used.  For example, most code will
expect that after:

   x = 1.0;
   y = 1.0;

that f(x) and f(y) will always be equal (for an f with no side
effects).

It will also expect that after

   x = 1.0/3.0;
   y = 1.0/3.0;

that x and y will always be equal.

Both of these assumptions break on ix86 because of the extended
precision FPU registers.

These are very reasonable things to expect, and much code breaks when
they're not satisfied.  One would think that this can be fixed by the
programmer with careful use of tolerances in comparisons, but this is
not the case.  Even if *very* carefully (i.e. - with knowledge of the
precision of the underlying representations), it will often just cause
discontinuities which create worse problems elsewhere.  But, it
*can't* be done carefully, because values in registers and values in
memory can accumulate differently, thus affecting more than just the
excess precision bits.

C and Fortran give little control over order of operations and
virtually no control over where such variances in precision might crop
up.  The compiler is free to use registers as it sees fit, and the
programmer has no control over it aside from recourse to assembler
programming.

So, well behaved, well written floating point code might behave well
on motorola, sparc & alpha CPUs but fail miserably on ix86 CPUs.  In
particular, the problems that have been discussed don't have to do
with compiler reordering & optimization.  They only have to do with
the fact that registers can contain excess precision, which AFAIK,
only happens on the ix86 CPUs.

As was pointed out, -ffloat-store only goes part way to fixing the
problem, because it doesn't affect compiler generated temporaries.  It
also makes the code slower.

I think the option of setting the fpu precision to 53 bits is a great
solution.  Even the Intel assember manual says:

   The double precision and single precision settings reduce the size
   of the significand to 53 and 24 bits, respectively.  These settings
   are providied to support the IEEE standard and to allow exact
   replication of calculations which were done using the lower
   precision data types.

I think Intel's comments about how great it is to use the extra
precision is just propaganda.  I think that it maybe helps *slightly*
with badly written code.  But it can make it impossible to write good
code.  Maybe it was just a poorly thought out clever idea for fixing
bad code.  Maybe it's to avoid having to have single precision
and mixed single/double register instructions, coupled with lots of
marketing propaganda to prevent seeing this shortcomming.

I'd think the only drawback to using the FPU in double precision mode
instead of extended precision mode would be for carefully hand
optimized numerical routines in assembler which can possibly do things
quicker by utilizing the 64 bit extended precision mode.  For such
code the programmer could explicitly muck with the FPU control word -
save its value, set it to extended precision, do the computations &
restore the value.  The OS will have to save and restore the FPU
control word when context switching, but it has to do this now anyway
& the ix86 FPU state save & restore instructions include the FPU
control word.

Of course, it's also *not* going to solve problems with single
precision unless the compiler is careful to not mix single precision &
double precision operations on the FPU stack & to always keep the FPU
control word set to the appropriate value.  It'd seem, actually, that
this would be the only way to get IEEE conformance with code that uses
both single and double precision values.  Fortunately, it seems that
people are tending to just use double precision, so maybe this can be
put off for later.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-14  1:52 ` Harvey J. Stein
@ 1998-12-14 14:56   ` Edward Jason Riedy
  1998-12-14 17:20     ` Joe Buck
  0 siblings, 1 reply; 65+ messages in thread
From: Edward Jason Riedy @ 1998-12-14 14:56 UTC (permalink / raw)
  To: Harvey J. Stein; +Cc: egcs

Oh well.  And Harvey J. Stein writes:
 - 
 - I think Intel's comments about how great it is to use the extra
 - precision is just propaganda.  I think that it maybe helps *slightly*
 - with badly written code.

No.  

Read through chapter four at
http://www.netlib.org/cgi-bin/checkout/blast/blast.pl .  The current
draft contains examples of good uses for extended precision.  (It's 
going to change within the next few days, so if the examples disappear, 
I still have a copy of the current draft.)

There are times when a few dabs of extended precision yield accurate
results from poorly-conditioned problems.  This is not a result of
poorly written code but rather walking the thin line between performance
and accuracy.  Sometimes you can get quadratic efficiency gains with no
loss of precision by using extended precision in hardware at the right
times.  In other cases, extended precision can yield a fully backwards
stable algorithm no matter what the input (e.g. iterative refinement of
a linear system's solution).

 - I'd think the only drawback to using the FPU in double precision mode
 - instead of extended precision mode would be for carefully hand
 - optimized numerical routines in assembler which can possibly do things
 - quicker by utilizing the 64 bit extended precision mode.

A little thought, planning, and lack of -O flags, and you can get huge
benefits from the 80-bit cells.  Of course, this is highly non-portable,
but more than a few folks do use the extended precision with gcc in
exactly that fashion.  You just don't hear from them often.

I do agree that setting the precision to the standard double by default
is a reasonable thing to do for the majority of the population.  I can 
cite cases where it'll break code, or rather I can go upstairs, knock on 
Dr. Kahan's door, and be pointed at a few.

I'd prefer a solution that keeps the extra precision whenever possible,
but my view is a somewhat compiler-aware numericist's.  I'm sure you'll 
agree that it doesn't reflect the majority of users (or, alas, the majority 
of numericists).  Still, there are ways to fix the problem well.  Any 
twiddle to the control words ought to be considered temporary.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-14 14:56   ` Edward Jason Riedy
@ 1998-12-14 17:20     ` Joe Buck
  1998-12-14 18:51       ` Edward Jason Riedy
  1998-12-14 22:54       ` Craig Burley
  0 siblings, 2 replies; 65+ messages in thread
From: Joe Buck @ 1998-12-14 17:20 UTC (permalink / raw)
  To: Edward Jason Riedy; +Cc: hjstein, egcs

> A little thought, planning, and lack of -O flags, and you can get huge
> benefits from the 80-bit cells.

It would seem that if we did 80-bit spills, then the -O flags would not
affect the results (assuming that the original code uses parentheses
to constrain the order of evaluation of FP expressions and that the
optimizer avoids transformations that can change numerical results,
such as pre-evaluating expressions with 64-bit that would otherwise
be evaluated using 80-bit precision at runtime).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-14 17:20     ` Joe Buck
@ 1998-12-14 18:51       ` Edward Jason Riedy
  1998-12-14 21:54         ` Craig Burley
  1998-12-15 17:11         ` Jamie Lokier
  1998-12-14 22:54       ` Craig Burley
  1 sibling, 2 replies; 65+ messages in thread
From: Edward Jason Riedy @ 1998-12-14 18:51 UTC (permalink / raw)
  To: Joe Buck; +Cc: hjstein, egcs

Oh well.  And Joe Buck writes:
 - 
 - It would seem that if we did 80-bit spills, then the -O flags would not
 - affect the results (assuming that the original code uses parentheses
 - to constrain the order of evaluation of FP expressions and that the
 - optimizer avoids transformations that can change numerical results,
 - such as pre-evaluating expressions with 64-bit that would otherwise
 - be evaluated using 80-bit precision at runtime).

Yup.  Seeing someone else actually propose this is what prompted my 
jumping in...  

Using 80-bit spills is a quick approximation to extending the FP stack
into memory, and it should give some of the benefit with very little 
(hopefully) hassle.  Of course, 80 bits is wider than the normal spill,
so it eats more memory bandwidth, cache space, etc.  Anyone who's that
concerned will bend over backwards to avoid spills anyways.

I believe -ffast-math is necessary before result-changing reorderings 
are allowed.  (Someone from here corrected me on usenet when I assumed
otherwise, iirc.)  The pre-evaluation could conceivably be handled by 
running in extra precision at compilation (there are well-documented 
tricks that will work on anything with a guard bit, and gcc doesn't 
support the old Crays that don't have it), but that may be more trouble 
than it's worth.  If I were being paranoid, I wouldn't want to risk 
compiling in any round-off, so I'd evaluate it myself.  Extreme 
precision in compiler evaluations should be platform independant, too.

And if there are never in-expression truncations, -ffloat-store should
give proper IEEE rounding in all cases.  I'm basing that on the informal
definition of IEEE arithmetic (the correct result correctly rounded),
but it feels true.  Need to check opposing round-off errors before I can 
say that for sure.  That error should be at most a bit, if anything.

Overflow conditions wouldn't be triggered until the store, but any 
instructions to test that must come after the expression evaluation, so 
the result would be stored before the test.  I need to review what should 
happen with underflow...  Computing in 80 bits when you think it's a 
double could mean you're not dividing by zero when you would have been.
It might change a divide-by-zero (NaN or +-inf) into an overflow (+-inf).

Seems like 80-bit spills would be closer to IEEE, if not close enough to
claim full support.  Off-hand I'd say people would be happier with
those numeric properties than with the current situation.

I'll think about the esoterica (probably break down and ask people here).  
It would certainly make gcc the most reliable x86 floating-point compiler 
I know.  Sun's does something similar to -ffloat-store by default (gee, 
look at how fast that ultra is), and Microsoft's runs off the end of the 
stack at run-time .  I don't know what Intel's does, so I'm not considering 
it.  ;)

Jason, who will happily accept corrections to the above...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-14 18:51       ` Edward Jason Riedy
@ 1998-12-14 21:54         ` Craig Burley
  1998-12-15 14:31           ` Edward Jason Riedy
  1998-12-15 17:11         ` Jamie Lokier
  1 sibling, 1 reply; 65+ messages in thread
From: Craig Burley @ 1998-12-14 21:54 UTC (permalink / raw)
  To: ejr; +Cc: burley

>  Sun's does something similar to -ffloat-store by default (gee, 
>look at how fast that ultra is)

Let me see if I understand the above, by translating it, and perhaps
helping continue to clarify the picture....

"Sun's compiler[s] effectively default to -ffloat-store on x86
machines, which not only increases consistency between x86 and
UltraSPARC executables under Solaris, but has the wonderful side
effect of making UltraSPARCs compare more favorably to x86 than
they would if they didn't make the x86's go through the extra
store/reload hoop for each assignment to a floating-point variable
to achieve this consistency."

In any case it is really wonderful to see all the commentary now
taking place over this issue.  I'm quite certain most of the
participants know more than I do about the issues -- probably all
of them.  That's probably the best thing about working on software
like egcs.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-14 17:20     ` Joe Buck
  1998-12-14 18:51       ` Edward Jason Riedy
@ 1998-12-14 22:54       ` Craig Burley
  1 sibling, 0 replies; 65+ messages in thread
From: Craig Burley @ 1998-12-14 22:54 UTC (permalink / raw)
  To: jbuck; +Cc: burley

>> A little thought, planning, and lack of -O flags, and you can get huge
>> benefits from the 80-bit cells.
>
>It would seem that if we did 80-bit spills, then the -O flags would not
>affect the results (assuming that the original code uses parentheses
>to constrain the order of evaluation of FP expressions and that the
>optimizer avoids transformations that can change numerical results,
>such as pre-evaluating expressions with 64-bit that would otherwise
>be evaluated using 80-bit precision at runtime).

I wouldn't want us to say that, even assuming my proposal was adopted,
until we'd done at least a minimal audit of the compiler to reassure
ourselves that it was true.  Certainly I don't have enough confidence
or expertise to say it.

But it'd probably be nice if it was true, or could at least be made true
using an option or two!

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-14 21:54         ` Craig Burley
@ 1998-12-15 14:31           ` Edward Jason Riedy
  0 siblings, 0 replies; 65+ messages in thread
From: Edward Jason Riedy @ 1998-12-15 14:31 UTC (permalink / raw)
  To: Craig Burley; +Cc: egcs

Oh well.  And Craig Burley writes:
 - 
 - Let me see if I understand the above, by translating it, and perhaps
 - helping continue to clarify the picture....

I'm afraid my statement reads as a slant against Sun compiler engineers.
It shouldn't.  I don't doubt that they are trying to make x86 give the 
same numeric results as the Ultra.  They picked a float-store-like 
solution on a non-primary platform to avoid a probable re-design.  It's 
a valid engineering choice.  It's also documented in their man page
(under -fstore).

My statement was more an echo of what I hear from various users.

But yes, your translation is correct.  The Sun compiler defaults to 
the equivalent of -ffloat-store on x86, and that slaughters performance.  
They also give a large number of compiler options to play with FP modes 
and precisions.  It's really nice in that aspect.  I'd love to see 
things like that added to gcc, but obviously I'm going to have to wait 
until I learn enough to do it.  ;)

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-14 18:51       ` Edward Jason Riedy
  1998-12-14 21:54         ` Craig Burley
@ 1998-12-15 17:11         ` Jamie Lokier
  1998-12-16  0:26           ` Harvey J. Stein
                             ` (3 more replies)
  1 sibling, 4 replies; 65+ messages in thread
From: Jamie Lokier @ 1998-12-15 17:11 UTC (permalink / raw)
  To: Edward Jason Riedy, Joe Buck; +Cc: hjstein, egcs

On Mon, Dec 14, 1998 at 06:51:28PM -0800, Edward Jason Riedy wrote:
> Using 80-bit spills is a quick approximation to extending the FP stack
> into memory, and it should give some of the benefit with very little 
> (hopefully) hassle.  Of course, 80 bits is wider than the normal spill,
> so it eats more memory bandwidth, cache space, etc.  Anyone who's that
> concerned will bend over backwards to avoid spills anyways.

I like this 80-bit spills idea.

But given that you can just put the FPU into 64-bit precision mode
anyway to get predictable arithmetic, I would like to see the option to
use just 64-bit spills for those programs that do run the FPU in 64-bit
precision mode.

Maybe that option could be implied by -ffast-math.

-- Jamie

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 17:11         ` Jamie Lokier
@ 1998-12-16  0:26           ` Harvey J. Stein
  1998-12-16  9:33             ` Craig Burley
  1998-12-16  9:38           ` Craig Burley
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-16  0:26 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: hjstein

Jamie Lokier <egcs@tantalophile.demon.co.uk> writes:

 > On Mon, Dec 14, 1998 at 06:51:28PM -0800, Edward Jason Riedy wrote:
 > > Using 80-bit spills is a quick approximation to extending the FP stack
 > > into memory, and it should give some of the benefit with very little 
 > > (hopefully) hassle.  Of course, 80 bits is wider than the normal spill,
 > > so it eats more memory bandwidth, cache space, etc.  Anyone who's that
 > > concerned will bend over backwards to avoid spills anyways.
 > 
 > I like this 80-bit spills idea.
 > 
 > But given that you can just put the FPU into 64-bit precision mode
 > anyway to get predictable arithmetic, I would like to see the option to
 > use just 64-bit spills for those programs that do run the FPU in 64-bit
 > precision mode.

I was just thinking along these lines.  It seems that in general when
one spills a register, one should spill the register and not part of
it.  This means that the spill width should be >= the FP register
width being used.

This would imply that for now, since the spill width is 64 bits &
can't immediately be changed, that the FPU be set to double precision,
and when gcc is capable of spilling other widths then the spill width
should be increased to 80 bits.  It'd be nice to allow it to be
selectable, but then you're really getting into what Craig Burley was
talking about - namely different compiled code could have different
spill widths and mucking with the FP control word could cause all
sorts of havoc.

 > Maybe that option could be implied by -ffast-math.

I'd much rather have more precise control over it.  Doesn't
-ffast-math imply various sorts of liberties to be taken?  This modulo
the above comment.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16  0:26           ` Harvey J. Stein
@ 1998-12-16  9:33             ` Craig Burley
  1998-12-16 12:18               ` Harvey J. Stein
  0 siblings, 1 reply; 65+ messages in thread
From: Craig Burley @ 1998-12-16  9:33 UTC (permalink / raw)
  To: hjstein; +Cc: burley

>This would imply that for now, since the spill width is 64 bits &
>can't immediately be changed, that the FPU be set to double precision,
>and when gcc is capable of spilling other widths then the spill width
>should be increased to 80 bits.

Okay, not a bad idea, but to set the FPU to double precision without
breaking existing code, the compiler would have to save the FPU
setting on entry to any procedure, ensure that it's restored when
that procedure returns, and restore it prior to any call, then
re-save and re-set it upon that call's return.  (And we cross our fingers
that the call wasn't made specifically to change the FPU settings,
to be made effective in the current routine, since we'd be breaking
that sort of thing.)

Somehow I don't think slowing FP performance to a crawl in gcc will
be considered acceptable, and I believe setting the FPU this way would
indeed slow performance to a crawl.

>It'd be nice to allow it to be
>selectable, but then you're really getting into what Craig Burley was
>talking about - namely different compiled code could have different
>spill widths and mucking with the FP control word could cause all
>sorts of havoc.

Even if you *don't* allow it to be selectable, you get into all that
as soon as you decide to set the FPU to 64 bits, or even to 80 bits.

AFAIK, the current assumption throughout the x86 world, except for
"leaf" nodes (codes, object files, libraries all produced by a
particular compiler vendor, generally incompatible with others),
is that the FPU is left in the default of 80 bits, though maybe set
by main() (or main's caller).

So IMO if we even changed gcc to explicitly set the FPU to *80* bits
on entry to any procedure, we'd be introducing potential problems
in some cases -- for users who've carefully ensured their code, when
compiled, worked fine when the FPU was set to 64 or 32 bits, and don't
realize gcc is now re-setting the FPU out from under them.

Put simply: we *cannot* fiddle with the FPU settings in any global sense,
without simultaneously "taking over" the entire codebase that gets
linked with gcc-produced code, something that I think is completely
infeasible compared to, for example, just implementing my proposal
(which isn't easy, but at least doesn't require a few million man-hours).

The most optimal step at this point still seems to be to continue to
gather (credible) information, and perhaps to adopt my proposal (which
calls for spilling FP registers to 80-bit temporaries -- presumably
actually 128-bit temps -- instead of 64-bit ones) as soon as feasible,
with an option to revert to the old behavior of spilling to 64-bit temps.

I do think that option name should be crafted to be precision-specific,
e.g.:

  -fchop-fp-spills
  -fchop-fp-spills-64
  -fchop-fp-spills-32

The first means "chop down to actual type of operation based on its
operands", the second means "chop down to 64 bits", etc.  I think the
first would mean the current behavior as well, but we should be careful
to note any differences, and perhaps account for them (via a slightly
different option, such as -fchop-fp-spills-old).

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 17:11         ` Jamie Lokier
  1998-12-16  0:26           ` Harvey J. Stein
@ 1998-12-16  9:38           ` Craig Burley
  1998-12-16 12:25           ` Marc Lehmann
  1998-12-16 23:11           ` Joern Rennecke
  3 siblings, 0 replies; 65+ messages in thread
From: Craig Burley @ 1998-12-16  9:38 UTC (permalink / raw)
  To: egcs; +Cc: burley

>But given that you can just put the FPU into 64-bit precision mode
>anyway to get predictable arithmetic, I would like to see the option to
>use just 64-bit spills for those programs that do run the FPU in 64-bit
>precision mode.

I think my proposed -fchop-fp-spills option would provide this.  It
seems important, fundamentally, to provide an option to get the current
behavior at the same time we introduce proper spilling of complete FP
registers, in any case.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16  9:33             ` Craig Burley
@ 1998-12-16 12:18               ` Harvey J. Stein
  0 siblings, 0 replies; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-16 12:18 UTC (permalink / raw)
  To: Craig Burley; +Cc: hjstein

Craig Burley <burley@gnu.org> writes:

 > >This would imply that for now, since the spill width is 64 bits &
 > >can't immediately be changed, that the FPU be set to double precision,
 > >and when gcc is capable of spilling other widths then the spill width
 > >should be increased to 80 bits.
 > 
 > Okay, not a bad idea, but to set the FPU to double precision without
 > breaking existing code, the compiler would have to save the FPU
 > setting on entry to any procedure, ensure that it's restored when
 > that procedure returns, and restore it prior to any call, then
 > re-save and re-set it upon that call's return.  (And we cross our fingers
 > that the call wasn't made specifically to change the FPU settings,
 > to be made effective in the current routine, since we'd be breaking
 > that sort of thing.)

I think this is massive over-engineering.  The numerics are so screwed
up anyway with the 64 bit spills & fcn return temporary variables that
I really can't imagine that there's actually code relying on gcc which
will break if the FPU setting was globally double precision.

Furthermore, given all this, I think that anyone who's using gcc &
intentionally writing code to exploit the 80 bit FPU registers (read -
implementers of things like libm) must be sufficiently on the ball to
twiddle the control word back & forth for his code that requires it.

I can't imagine anyone who's not intentionally writing such code
ever noticing such a change.

On the other hand, I think that most people will see numeric problems
go away if the FPU width matched external data width.

Yes, technically speaking, one should push & pop FPU width according
to what the programmer expected when written.  But the thing is, most
programmers haven't thought about it at all.  Why do they need it to
be preserved, let alone preserved in a broken way (80 bit registers
spilled into 64 bit memory)?  Remember - the code that's already
compiled is *always* going to be doing 64 bit spills.  It's only the
newly recompiled code that might get 80 bit spills.  The only thing we
can do to fix the old code is to run the FPU in double precision.
Without a doubt, many more things will get fixed than will get broken.

On the other hand, I don't even know why I'm arguing about it anymore!
I'll just stick the appropriate glibc call to put the FPU into 64 bit
mode in the main or MAIN or MAIN__ of all my code & be done with it.

 > Somehow I don't think slowing FP performance to a crawl in gcc will
 > be considered acceptable, and I believe setting the FPU this way would
 > indeed slow performance to a crawl.

Again, I think this is overkill.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 17:11         ` Jamie Lokier
  1998-12-16  0:26           ` Harvey J. Stein
  1998-12-16  9:38           ` Craig Burley
@ 1998-12-16 12:25           ` Marc Lehmann
  1998-12-16 12:50             ` Tim Hollebeek
  1998-12-16 23:11           ` Joern Rennecke
  3 siblings, 1 reply; 65+ messages in thread
From: Marc Lehmann @ 1998-12-16 12:25 UTC (permalink / raw)
  To: egcs

Sorry if my answers below were already answered, but I read the whole thread
and didn't see it.

On Wed, Dec 16, 1998 at 01:05:46AM +0000, Jamie Lokier wrote:
> On Mon, Dec 14, 1998 at 06:51:28PM -0800, Edward Jason Riedy wrote:
> > Using 80-bit spills is a quick approximation to extending the FP stack
> > into memory, and it should give some of the benefit with very little 
> > (hopefully) hassle.  Of course, 80 bits is wider than the normal spill,
> > so it eats more memory bandwidth, cache space, etc.  Anyone who's that
> > concerned will bend over backwards to avoid spills anyways.
> 
> I like this 80-bit spills idea.

Since it seems to give us full ieee without too much a drawback (spilling in
itself is slow, compared to the extra memory, at least in inner loops or so
;)

> But given that you can just put the FPU into 64-bit precision mode
> anyway to get predictable arithmetic, I would like to see the option to
> use just 64-bit spills for those programs that do run the FPU in 64-bit
> precision mode.

I still don't see what the 64 bit precision idea gives us, in terms of
performance. First, it doesn't give us full ieee, second, it kills
performance, depending on where the rounding mode is set (before each
assignment? resetting it to normal before each long double assignment?)

IAW, how is 64 bit rounding mode going to be faster? For me, it seems this
creates a similar situation to the float->integer conversion, i.e. save and
restoring the control word with each assignment.

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 12:25           ` Marc Lehmann
@ 1998-12-16 12:50             ` Tim Hollebeek
  1998-12-16 13:04               ` Harvey J. Stein
                                 ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Tim Hollebeek @ 1998-12-16 12:50 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: egcs

Marc Lehmann writes ...
> 
> I still don't see what the 64 bit precision idea gives us, in terms of
> performance. First, it doesn't give us full ieee, second, it kills
> performance, depending on where the rounding mode is set (before each
> assignment? resetting it to normal before each long double assignment?)

64 bit rounding is a wonderful solution .... assuming the only
floating point type you ever use is 'double'.  As such, it makes sense
as a practical user level solution, but I'm afraid it's almost useless
as a general purpose solution.

---------------------------------------------------------------------------
Tim Hollebeek                           | "Everything above is a true
email: tim@wfn-shop.princeton.edu       |  statement, for sufficiently
URL: http://wfn-shop.princeton.edu/~tim |  false values of true."

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 12:50             ` Tim Hollebeek
@ 1998-12-16 13:04               ` Harvey J. Stein
  1998-12-16 14:01               ` Marc Lehmann
  1998-12-20 11:24               ` Dave Love
  2 siblings, 0 replies; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-16 13:04 UTC (permalink / raw)
  To: Tim Hollebeek; +Cc: hjstein

Tim Hollebeek <tim@wagner.princeton.edu> writes:

 > Marc Lehmann writes ...
 > > 
 > > I still don't see what the 64 bit precision idea gives us, in terms of
 > > performance. First, it doesn't give us full ieee, second, it kills
 > > performance, depending on where the rounding mode is set (before each
 > > assignment? resetting it to normal before each long double assignment?)
 > 
 > 64 bit rounding is a wonderful solution .... assuming the only
 > floating point type you ever use is 'double'.  As such, it makes sense
 > as a practical user level solution, but I'm afraid it's almost useless
 > as a general purpose solution.

As I just detailed, I don't think it's any worse wrt correct numerics
than using 80 bit FPU + 80 bit spills.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 12:50             ` Tim Hollebeek
  1998-12-16 13:04               ` Harvey J. Stein
@ 1998-12-16 14:01               ` Marc Lehmann
  1998-12-17 11:26                 ` Dave Love
  1998-12-20 11:24               ` Dave Love
  2 siblings, 1 reply; 65+ messages in thread
From: Marc Lehmann @ 1998-12-16 14:01 UTC (permalink / raw)
  To: Tim Hollebeek, Marc Lehmann; +Cc: egcs

On Wed, Dec 16, 1998 at 03:50:22PM -0500, Tim Hollebeek wrote:
> Marc Lehmann writes ...
> > 
> > I still don't see what the 64 bit precision idea gives us, in terms of
> > performance. First, it doesn't give us full ieee, second, it kills
> > performance, depending on where the rounding mode is set (before each
> > assignment? resetting it to normal before each long double assignment?)
> 
> 64 bit rounding is a wonderful solution .... assuming the only
> floating point type you ever use is 'double'.

And the only floatingpoint type your highly optimized libm uses is double,
OR the algorithms work stable with both data types.

> As such, it makes sense as a practical user level solution, but I'm afraid
> it's almost useless as a general purpose solution.

I fear it won't even make sense as a practical user level solution. I should
hjave a look at glibc's libm to see wether rounding would affect it
negatively.

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 17:11         ` Jamie Lokier
                             ` (2 preceding siblings ...)
  1998-12-16 12:25           ` Marc Lehmann
@ 1998-12-16 23:11           ` Joern Rennecke
  1998-12-17  6:07             ` Jamie Lokier
  3 siblings, 1 reply; 65+ messages in thread
From: Joern Rennecke @ 1998-12-16 23:11 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: ejr, jbuck, hjstein, egcs

> I like this 80-bit spills idea.
> 
> But given that you can just put the FPU into 64-bit precision mode
> anyway to get predictable arithmetic, I would like to see the option to
> use just 64-bit spills for those programs that do run the FPU in 64-bit
> precision mode.

The larger exponent range of 64 bit precision mode means that you still get
inconsistent behaviour for values that are close to zero.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 23:11           ` Joern Rennecke
@ 1998-12-17  6:07             ` Jamie Lokier
  0 siblings, 0 replies; 65+ messages in thread
From: Jamie Lokier @ 1998-12-17  6:07 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: ejr, jbuck, hjstein, egcs

On Thu, Dec 17, 1998 at 07:10:51AM +0000, Joern Rennecke wrote:
> > I like this 80-bit spills idea.
> > 
> > But given that you can just put the FPU into 64-bit precision mode
> > anyway to get predictable arithmetic, I would like to see the option to
> > use just 64-bit spills for those programs that do run the FPU in 64-bit
> > precision mode.
> 
> The larger exponent range of 64 bit precision mode means that you still get
> inconsistent behaviour for values that are close to zero.

Ok, what I really meant was I'd like the option for the code to be fast,
if I don't care about _precise_ behaviour.

For some FP codes speed is more important than accuracy.

-- Jamie

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 14:01               ` Marc Lehmann
@ 1998-12-17 11:26                 ` Dave Love
  1998-12-17 15:06                   ` Marc Lehmann
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Love @ 1998-12-17 11:26 UTC (permalink / raw)
  To: egcs

>>>>> "Marc" == Marc Lehmann <pcg@goof.com> writes:

 Marc> I should hjave a look at glibc's libm to see wether rounding
 Marc> would affect it negatively.

Yes please!  I've never been able to get a story on it but haven't
seen problems in rather limited Fortran tests.  (There's a comment
about libm requirements in fpu_control.h which seemed to be
contradicted by something else in the header.)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 11:26                 ` Dave Love
@ 1998-12-17 15:06                   ` Marc Lehmann
  1998-12-18 12:50                     ` Dave Love
  0 siblings, 1 reply; 65+ messages in thread
From: Marc Lehmann @ 1998-12-17 15:06 UTC (permalink / raw)
  To: Dave Love, egcs

On Thu, Dec 17, 1998 at 07:26:15PM +0000, Dave Love wrote:
> >>>>> "Marc" == Marc Lehmann <pcg@goof.com> writes:
> 
>  Marc> I should hjave a look at glibc's libm to see wether rounding
>  Marc> would affect it negatively.
> 
> Yes please!  I've never been able to get a story on it but haven't
> seen problems in rather limited Fortran tests.  (There's a comment

Just that I'm the wrong person to ask, as we have much better glibc experts
on this list ;)

> about libm requirements in fpu_control.h which seemed to be
> contradicted by something else in the header.)

You mean:
 * The hardware default is 0x037f. I choose 0x1372.
(extended precision)

vs.

/* The fdlibm code requires strict IEEE double precision arithmetic,

?? The question is: what is "fdlibm"?

Back to the problem: I see a couple of functions in bits/mathinline.h
explicitly using long double, but I suspect this is to workaround the
"spilling vs. chop" problem when these are spilled to memory.

Looking through the functions I see more potential problems. Given
some function (say, hypot), which has to do some calculcations itself,
the result is probably correct to 64 bits when using extended precision.

When we set double precision in the fpu, all the internal calculations
in these functions are done with that precision, and we have rounding.

While the effects are small, isn't it possible that, out of a sudden, many
of our libm functions start to give us less then double precision (say,
54 or 55 bits)? Even if this is very minor, these functions have changed.

None of them set the fpu control word, btw.

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 15:06                   ` Marc Lehmann
@ 1998-12-18 12:50                     ` Dave Love
  1998-12-19 14:09                       ` Marc Lehmann
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Love @ 1998-12-18 12:50 UTC (permalink / raw)
  To: egcs

>>>>> "Marc" == Marc Lehmann <pcg@goof.com> writes:

 Marc> You mean:
 Marc>  * The hardware default is 0x037f. I choose 0x1372.
 Marc> (extended precision)

 Marc> vs.

 Marc> /* The fdlibm code requires strict IEEE double precision arithmetic,

 Marc> ?? The question is: what is "fdlibm"?

Yes, that looks familiar.

 Marc> Back to the problem: I see a couple of functions in
 Marc> bits/mathinline.h explicitly using long double, but I suspect
 Marc> this is to workaround the "spilling vs. chop" problem when
 Marc> these are spilled to memory.

If they're confined to the inlines, at least we could avoid them
affecting Fortran, which would be nice (for some of us) to know.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-18 12:50                     ` Dave Love
@ 1998-12-19 14:09                       ` Marc Lehmann
  1998-12-20 11:28                         ` Dave Love
  0 siblings, 1 reply; 65+ messages in thread
From: Marc Lehmann @ 1998-12-19 14:09 UTC (permalink / raw)
  To: Dave Love; +Cc: egcs

On Fri, Dec 18, 1998 at 07:48:52PM +0000, Dave Love wrote:
> 
>  Marc> Back to the problem: I see a couple of functions in
>  Marc> bits/mathinline.h explicitly using long double, but I suspect
>  Marc> this is to workaround the "spilling vs. chop" problem when
>  Marc> these are spilled to memory.
> 
> If they're confined to the inlines, at least we could avoid them
> affecting Fortran, which would be nice (for some of us) to know.

The non-inline functions _seem_ to use float and double, respectively,
as intermediate storage. At least the ones in the libm-ieee754
directory.

The functions in libm-i386, though, are mostyl implemente dusing assembly,
so, naturally, you won't be able to see wether they use double or long double
(or more important: wether they depend on extended precision or not).

Since *most* functions just call i386 assembly equivalents, there should be
no rounding problems, but remember that I didn't say this ;)

This might make them immune against 64 bit rounding mode.

--
Happy New Year, I'll be away from 21. Dec to 7. Jan

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 12:50             ` Tim Hollebeek
  1998-12-16 13:04               ` Harvey J. Stein
  1998-12-16 14:01               ` Marc Lehmann
@ 1998-12-20 11:24               ` Dave Love
  2 siblings, 0 replies; 65+ messages in thread
From: Dave Love @ 1998-12-20 11:24 UTC (permalink / raw)
  To: egcs

>>>>> "Tim" == Tim Hollebeek <tim@wagner.princeton.edu> writes:

 Tim> 64 bit rounding is a wonderful solution .... assuming the only
 Tim> floating point type you ever use is 'double'.  As such, it makes
 Tim> sense as a practical user level solution, but I'm afraid it's
 Tim> almost useless as a general purpose solution.

So I'll ask again: what real problems that we've missed in g77 support
traffic does it not address?  (I'm not arguing and I'd like to know
about them for support purposes.)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-19 14:09                       ` Marc Lehmann
@ 1998-12-20 11:28                         ` Dave Love
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Love @ 1998-12-20 11:28 UTC (permalink / raw)
  To: egcs

>>>>> "Marc" == Marc Lehmann <pcg@goof.com> writes:

 Marc> Since *most* functions just call i386 assembly equivalents,
 Marc> there should be no rounding problems, but remember that I
 Marc> didn't say this ;)

I'll try to run the referenced (thanks rth) libm test stuff to check
sometime in the new year if no-one beats me to it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-19  6:42     ` Emil Hallin
@ 1998-12-19 14:26       ` Dave Love
  0 siblings, 0 replies; 65+ messages in thread
From: Dave Love @ 1998-12-19 14:26 UTC (permalink / raw)
  To: egcs

>>>>> "Emil" == Emil Hallin <emil@skatter.usask.ca> writes:

 Emil> Also, I am not willing to use a tool which *deliberately*
 Emil> produces incorrect code,

With respect to what specification is g77 generating incorrect code?
(That isn't rhetorical, I don't know; it looks as thought it does the
same as, for instance, SunPro, though.)

 Emil> With the increasing pervasiveness of computers in every area of
 Emil> life, and with the increasing use of linux, it is not a huge
 Emil> stretch of the imagination to believe that at some point in the
 Emil> future people's lives might depend on a piece of gcc/g77
 Emil> compiled code that made the wrong decision because of a
 Emil> numerical error!

That's an absurd argument.  It would, apart from anything else, mean
that g77 needed changing (whether in default or design I'm not sure in
general) to avoid the need for the `Working Code' node of the manual.
You're not even talking about the most common problem.

 Emil> And yes, the code in question *might* have been written (and
 Emil> tested) by someone who would simply not be aware of the
 Emil> possibility that a test based on a < b would not *always* be
 Emil> evaluated the same way.

And you want to blame _us_ for that?  (Despite the information in the
manual.)

 Emil> And for those who need the ultimate performance, even if this
 Emil> proposal were to reduce speeds by 25%, your CPU today is an
 Emil> order of magnitude faster than it was a year or so ago.

Throw away the optimizer entirely and you'll avoid most of the `g77
has a bug because it doesn't give the same answers as ...' complaints.

I don't know whether the proposal at issue is good or bad, but I'm not
prepared to evaluate it on this basis.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-19  0:17   ` Craig Burley
@ 1998-12-19  6:42     ` Emil Hallin
  1998-12-19 14:26       ` Dave Love
  0 siblings, 1 reply; 65+ messages in thread
From: Emil Hallin @ 1998-12-19  6:42 UTC (permalink / raw)
  To: Craig Burley, egcs

I have been following this thread with a great deal of interest. I very much
appreciate your proposal AND I endorse it completely. I am more than willing to
pay a performance penalty in order to get numerically accurate results with less
programming on my part. Also, I am not willing to use a tool which *deliberately*
produces incorrect code, when the right thing to do is both known and can be
implemented in a reasonable way. With the increasing pervasiveness of computers in
every area of life, and with the increasing use of linux, it is not a huge stretch
of the imagination to believe that at some point in the future people's lives
might depend on a piece of gcc/g77 compiled code that made the wrong decision
because of a numerical error! And yes, the code in question *might* have been
written (and tested) by someone who would simply not be aware of the possibility
that a test based on a < b would not *always* be evaluated the same way.

Craig, please don't give up. Faster CPUs are always available for those that need
the ultimate speed.

And for those who need the ultimate performance, even if this proposal were to
reduce speeds by 25%, your CPU today is an order of magnitude faster than it was a
year or so ago.
    emil

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-19  0:27           ` Craig Burley
@ 1998-12-19  5:06             ` Stephen L Moshier
  0 siblings, 0 replies; 65+ messages in thread
From: Stephen L Moshier @ 1998-12-19  5:06 UTC (permalink / raw)
  To: Craig Burley; +Cc: pcg, hjstein, bosch, egcs, tprince

On Sat, 19 Dec 1998, Craig Burley wrote:

> I don't even know for sure if it'd stop working in 64-bit-mode FPU.

You should see failures in c-torture/execute/conversion.c and perhaps
some similar tests.  Long long int conversions in particular will be
affected.  The reason is that those mode conversions use XFmode if
it is available, and the compiler will think XFmode is still available.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 12:16   ` Harvey J. Stein
@ 1998-12-19  0:29     ` Craig Burley
  0 siblings, 0 replies; 65+ messages in thread
From: Craig Burley @ 1998-12-19  0:29 UTC (permalink / raw)
  To: hjstein; +Cc: burley

>Those trying to port to ix86 *must* be porting code that doesn't
>expect 80 bit cpu registers & therefore would actually produce results
>closer to the original code with the FPU set to 64 bits.

Yup.  And will all the libraries they use -- perhaps separately
ported -- will work that way as well?

And what about those trying to port to *gcc* (or g77) from some
*other* compiler on the ix86, where the other compiler made
proper use of the x86 (by doing 80-bit spills)?

But, never mind, I've withdrawn my proposal to try and offer 80-bit
spills, so such users should just be told to rewrite their code --
after taking a few more classes on numerical analysis.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 14:54         ` Marc Lehmann
@ 1998-12-19  0:27           ` Craig Burley
  1998-12-19  5:06             ` Stephen L Moshier
  0 siblings, 1 reply; 65+ messages in thread
From: Craig Burley @ 1998-12-19  0:27 UTC (permalink / raw)
  To: pcg; +Cc: burley

>On Thu, Dec 17, 1998 at 01:21:52PM -0500, Craig Burley wrote:
>> >The choice is clear, at least for me.  I'm setting the FPU to 64 bit
>> >mode in all my code.  Maybe I'll even put it in crt0.o.
>> 
>> Yup.  We should do this (whichever gives the most coverage) across
>> the board in enough egcs snapshots to find out how it actually
>> affects people, e.g. how much other code must be rewritten.
>
>Will we warn for "long double"? ;->>>

I don't even know for sure if it'd stop working in 64-bit-mode FPU.

If it does, those who proposed that mode as "the" solution to
the problems with FP behavior will surely suggest something.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 14:51 ` Marc Lehmann
@ 1998-12-19  0:17   ` Craig Burley
  1998-12-19  6:42     ` Emil Hallin
  0 siblings, 1 reply; 65+ messages in thread
From: Craig Burley @ 1998-12-19  0:17 UTC (permalink / raw)
  To: pcg; +Cc: burley

>On Thu, Dec 17, 1998 at 02:27:02PM -0500, Brad Lucier wrote:
>> If that means spilling FP registers to 80 bit temporaries aligned to
>> 128-bit (or 64bit) boundaries, then so be it.
>
>Given that the speed penalty of this solution might be very small, except for
>(?) degenerate cases, we should not accept or decline this solution unless
>somebody states some hard data (read: benchmarks).

Apparently no hard data at all is needed for several people to have
already concluded my proposal is a "loser all around" and constitutes
nothing more than "numerical political correctness" -- some of the
people I've trusted most to be thoughtful and considerate before
making such statements in the past.

I'm now sorry to have proposed it in the first place.  It seems
to have been a huge waste of time, at least on my part.  I suspect
someday it'll be clear my proposal was both ahead of, and behind,
its time: ahead, because it seems quite likely the rest of the
industry will decide to go in that direction anyway (as some parts
apparently already do), and gcc will have to follow; behind, because
if gcc had long-ago been implemented on the x86 the way the x87 (FP unit)
designer apparently intended it to be used, we'd be discussing whether
to add an option to provide 32/64-bit spills to get extra performance,
and it's unlikely people would be calling the 80-bit spills a "loser
all around", but simply a "reasonable default" that some very
knowledgable users might wish to override for performance reasons.

In the meantime, I'll withdraw my proposal.  Which means I no longer
suggest the planned x86 machine-description rewrite, or any other
part of gcc, take into consideration the potential need for 80-bit
spills at all.  After all, if the performance is already known to
be a major problem, there's no need to even experiment with an
option to enable them, and certainly we wouldn't make it the default.

(If we someday decide we want to, as I expect will happen, we can just
rewrite the appropriate parts of gcc -- again.)

In case it isn't clear: I'm withdrawing my proposal only because
I don't really want to argue about it anymore, and I've decided I must
have completely lost touch with what is appropriate for Fortran and
gcc users, generally, to have so completely missed the boat on
how important performance might be to various people, so I probably
shouldn't be making such proposals in the first place.  Besides, it's
probably better for me to read up on all the relevant issues -- e.g.
become a numerical-analyst expert, so I too can know what's needed
to write working FP code -- before talking about it, or working on
related Fortran stuff, further.

So, for now, I'll just stick with fixing g77 bugs and doing other
non-FP work (once I get back from Christmas vacation, anyway), until
I can figure out what I *am* qualified to discuss, propose, design,
implement, and so on, and how to become more competent in the
relevant areas.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-18  3:05     ` Harvey J. Stein
  1998-12-18  9:01       ` Toon Moene
@ 1998-12-18 15:59       ` Richard Henderson
  1 sibling, 0 replies; 65+ messages in thread
From: Richard Henderson @ 1998-12-18 15:59 UTC (permalink / raw)
  To: Harvey J. Stein, Toon Moene; +Cc: egcs

On Fri, Dec 18, 1998 at 01:05:37PM +0200, Harvey J. Stein wrote:
> That's true, but, like I said, it's 10 byte, not 16 byte, and it's vs
> the current 8 byte.  There's currently no option of doing 4 byte spills.

On the contrary.  If you work with SFmode values, they'll be spilled
in SFmode.  And XFmode reads/writes to unaligned (mod 16) addresses
takes extra time.


r~

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 15:30 ` Harvey J. Stein
  1998-12-18  1:54   ` Toon Moene
@ 1998-12-18 13:26   ` Marc Lehmann
  1 sibling, 0 replies; 65+ messages in thread
From: Marc Lehmann @ 1998-12-18 13:26 UTC (permalink / raw)
  To: egcs

On Fri, Dec 18, 1998 at 01:29:59AM +0200, Harvey J. Stein wrote:
> Toon Moene <toon@moene.indiv.nluug.nl> writes:
> 
>  > I must say that I am *not* amused.  This discussion goes into a
>  > direction that will leave us with a compiler that, although
>  > numerical-politically correct, will be generating such slow code as
>  > to be totally unuseable.
> 
> Do you really think it'll make it that much slower?  If you're doing
> computations, how much time is spent dumping & restoring FP registers?
> Is dumping 10 bytes instead of 8 bytes (current practice) going to
> have a substantial impact in total run time?  You've mentioned in
> other posts a factor of 2, as if because it's 16 byte aligned it has
> to move 16 bytes of data.  But even if it's 16 byte aligned, it's
> still only 10 bytes of data, so why should it take 2x as long?

In addition:

- the overhead for a single spill is not only in the additional memory
  transferred. Depending on your cpu, 16 or 32 bytes are transferred from/to
  memory anway (if at all)
- even if the spill itself would take twice as long this wouldn't
  make your code twice as slow. The effects of increasing spill time
  first has to be measured. For most real-world programs no spills
  are required, at leats not in inner loops.

> disclaimer that I'm not advocating 80 bit spills.  I just hate seeing
> things accepted or rejected for incorrect reasons.)

BTW, Imwas not attacking anybody with this mail ;->

--
Happy New Year, I'll be away from 21. Dec to 7. Jan

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 14:38 Toon Moene
  1998-12-17 15:30 ` Harvey J. Stein
@ 1998-12-18 12:50 ` Dave Love
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Love @ 1998-12-18 12:50 UTC (permalink / raw)
  To: egcs

>>>>> "Toon" == Toon Moene <toon@moene.indiv.nluug.nl> writes:

 >> I haven't followed all this, but am somewhat bemused 
 >> by it.

 Toon> I must say that I am *not* amused.  

Me neither.  _Be_mused, baffled.

 Toon> This discussion goes into a direction that will leave us with a
 Toon> compiler that, although numerical-politically correct, will be
 Toon> generating such slow code as to be totally unuseable.

I don't know whether it's any more than numerical political
correctness.  I've been flamed several times for querying claims
apropos both x86 and Alpha that LAPACK (specifically, possibly other
things) requires strict IEEE double behaviour.  The flamers produced
no evidence, but I guess it might still be so and that I ought to ask
the LA pack themselves.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-18  3:05     ` Harvey J. Stein
@ 1998-12-18  9:01       ` Toon Moene
  1998-12-18 15:59       ` Richard Henderson
  1 sibling, 0 replies; 65+ messages in thread
From: Toon Moene @ 1998-12-18  9:01 UTC (permalink / raw)
  To: Harvey J. Stein, egcs

Harvey J. Stein wrote:

>  > For a thoroughly 32-bit application like mine, 16-byte spills are four
>  > times as large as necessary - and yes, having all this data move in and
>  > out of registers does have a cost (think cache footprint).
> 
> That's true, but, like I said, it's 10 byte, not 16 byte, and it's vs
> the current 8 byte.  There's currently no option of doing 4 byte
> spills.

Ah, that could be; I am not really up to snuff on the stuff reload can
and cannot do (would be a nice optimisation, though).

On the 10 vs. 16 byte issue:  This probably depends on the type of your
Intel processor (see for instance 
http://www.geocities.com/SiliconValley/9498/pentopt.html ) - the value
that's important is the number of bytes in a cache line, because that's
the number reserved / transferrd.

Cheers,

-- 
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
g77 Support: fortran@gnu.org; egcs: egcs-bugs@cygnus.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-18  1:54   ` Toon Moene
@ 1998-12-18  3:05     ` Harvey J. Stein
  1998-12-18  9:01       ` Toon Moene
  1998-12-18 15:59       ` Richard Henderson
  0 siblings, 2 replies; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-18  3:05 UTC (permalink / raw)
  To: Toon Moene; +Cc: hjstein

Toon Moene <toon@moene.indiv.nluug.nl> writes:

 > Harvey J. Stein wrote:
 > > 
 > > Toon Moene <toon@moene.indiv.nluug.nl> writes:
 > > 
 > >  > I must say that I am *not* amused.  This discussion goes into a
 > >  > direction that will leave us with a compiler that, although
 > >  > numerical-politically correct, will be generating such slow code as
 > >  > to be totally unuseable.
 > > 
 > > Do you really think it'll make it that much slower?  If you're doing
 > > computations, how much time is spent dumping & restoring FP registers?
 > > Is dumping 10 bytes instead of 8 bytes (current practice) going to
 > > have a substantial impact in total run time?  You've mentioned in
 > > other posts a factor of 2, as if because it's 16 byte aligned it has
 > > to move 16 bytes of data.  But even if it's 16 byte aligned, it's
 > > still only 10 bytes of data, so why should it take 2x as long?
 > 
 > For a thoroughly 32-bit application like mine, 16-byte spills are four
 > times as large as necessary - and yes, having all this data move in and
 > out of registers does have a cost (think cache footprint).

That's true, but, like I said, it's 10 byte, not 16 byte, and it's vs
the current 8 byte.  There's currently no option of doing 4 byte
spills.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 15:30 ` Harvey J. Stein
@ 1998-12-18  1:54   ` Toon Moene
  1998-12-18  3:05     ` Harvey J. Stein
  1998-12-18 13:26   ` Marc Lehmann
  1 sibling, 1 reply; 65+ messages in thread
From: Toon Moene @ 1998-12-18  1:54 UTC (permalink / raw)
  To: Harvey J. Stein, egcs

Harvey J. Stein wrote:
> 
> Toon Moene <toon@moene.indiv.nluug.nl> writes:
> 
>  > I must say that I am *not* amused.  This discussion goes into a
>  > direction that will leave us with a compiler that, although
>  > numerical-politically correct, will be generating such slow code as
>  > to be totally unuseable.
> 
> Do you really think it'll make it that much slower?  If you're doing
> computations, how much time is spent dumping & restoring FP registers?
> Is dumping 10 bytes instead of 8 bytes (current practice) going to
> have a substantial impact in total run time?  You've mentioned in
> other posts a factor of 2, as if because it's 16 byte aligned it has
> to move 16 bytes of data.  But even if it's 16 byte aligned, it's
> still only 10 bytes of data, so why should it take 2x as long?

For a thoroughly 32-bit application like mine, 16-byte spills are four
times as large as necessary - and yes, having all this data move in and
out of registers does have a cost (think cache footprint).

-- 
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
g77 Support: fortran@gnu.org; egcs: egcs-bugs@cygnus.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 14:38 Toon Moene
@ 1998-12-17 15:30 ` Harvey J. Stein
  1998-12-18  1:54   ` Toon Moene
  1998-12-18 13:26   ` Marc Lehmann
  1998-12-18 12:50 ` Dave Love
  1 sibling, 2 replies; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-17 15:30 UTC (permalink / raw)
  To: Toon Moene; +Cc: hjstein

Toon Moene <toon@moene.indiv.nluug.nl> writes:

 > I must say that I am *not* amused.  This discussion goes into a
 > direction that will leave us with a compiler that, although
 > numerical-politically correct, will be generating such slow code as
 > to be totally unuseable.

Do you really think it'll make it that much slower?  If you're doing
computations, how much time is spent dumping & restoring FP registers?
Is dumping 10 bytes instead of 8 bytes (current practice) going to
have a substantial impact in total run time?  You've mentioned in
other posts a factor of 2, as if because it's 16 byte aligned it has
to move 16 bytes of data.  But even if it's 16 byte aligned, it's
still only 10 bytes of data, so why should it take 2x as long?

(This discussion has gotten heated enough that I'd better include the
disclaimer that I'm not advocating 80 bit spills.  I just hate seeing
things accepted or rejected for incorrect reasons.)

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 14:37 tprince
@ 1998-12-17 15:15 ` Stephen L Moshier
  0 siblings, 0 replies; 65+ messages in thread
From: Stephen L Moshier @ 1998-12-17 15:15 UTC (permalink / raw)
  To: tprince; +Cc: bosch, egcs, hjstein

>  I
> suggest we try patching gcc to force programs to start in 64-bit
> mode and otherwise not worry about it ("set once and forget") as
> you've been proposing.<<

The compiler may continue to generate some XFmode insns, so you should
check your programs and libgcc.a for them.  By design it is supposed to be
possible to create a GCC that never generates XFmode, but there may
still be a breakage in i386.md that doesn't quite allow it.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 10:22       ` Craig Burley
@ 1998-12-17 14:54         ` Marc Lehmann
  1998-12-19  0:27           ` Craig Burley
  0 siblings, 1 reply; 65+ messages in thread
From: Marc Lehmann @ 1998-12-17 14:54 UTC (permalink / raw)
  To: Craig Burley, hjstein; +Cc: bosch, moshier, egcs, tprince

On Thu, Dec 17, 1998 at 01:21:52PM -0500, Craig Burley wrote:
> >The choice is clear, at least for me.  I'm setting the FPU to 64 bit
> >mode in all my code.  Maybe I'll even put it in crt0.o.
> 
> Yup.  We should do this (whichever gives the most coverage) across
> the board in enough egcs snapshots to find out how it actually
> affects people, e.g. how much other code must be rewritten.

Will we warn for "long double"? ;->>>

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 11:27 Brad Lucier
@ 1998-12-17 14:51 ` Marc Lehmann
  1998-12-19  0:17   ` Craig Burley
  0 siblings, 1 reply; 65+ messages in thread
From: Marc Lehmann @ 1998-12-17 14:51 UTC (permalink / raw)
  To: Brad Lucier; +Cc: egcs

On Thu, Dec 17, 1998 at 02:27:02PM -0500, Brad Lucier wrote:
> If that means spilling FP registers to 80 bit temporaries aligned to
> 128-bit (or 64bit) boundaries, then so be it.

Given that the speed penalty of this solution might be very small, except for
(?) degenerate cases, we should not accept or decline this solution unless
somebody states some hard data (read: benchmarks).

      -----==-                                              |
      ----==-- _                                            |
      ---==---(_)__  __ ____  __       Marc Lehmann       +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com       |e|
      -=====/_/_//_/\_,_/ /_/\_\                          --+
    The choice of a GNU generation                        |
                                                          |

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-17 14:38 Toon Moene
  1998-12-17 15:30 ` Harvey J. Stein
  1998-12-18 12:50 ` Dave Love
  0 siblings, 2 replies; 65+ messages in thread
From: Toon Moene @ 1998-12-17 14:38 UTC (permalink / raw)
  To: egcs

Toon> However, what I'm challenging is that we should burden
 Toon> the compiler with these considerations *by default* 

> Indeed.

> I haven't followed all this, but am somewhat bemused 
> by it.

I must say that I am *not* amused.  This discussion goes into a
direction that will leave us with a compiler that, although
numerical-politically correct, will be generating such slow code as to
be totally unuseable.

> There are frequent complaints about Fortran due to the > x86 register business.  All the ones I've checked have > been covered by the advice in the g77 manual to link
> code frobbing the control word (which allows us to
> pass paranoia).  Anyone know of exceptions?

Exactly.  That's the test we should be aiming for - not something
someone comes up with who hasn't taken the time to read the relevant
(numerical analysis) texts.

Sorry to be so harsh, but I'm trying to save a compiler here.

To put it all in a one-liner:

The compiler can't - and won't - save you from doing a numerical
analysis class.

-- 
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
g77 Support: fortran@gnu.org; egcs: egcs-bugs@cygnus.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-17 14:37 tprince
  1998-12-17 15:15 ` Stephen L Moshier
  0 siblings, 1 reply; 65+ messages in thread
From: tprince @ 1998-12-17 14:37 UTC (permalink / raw)
  To: bosch, egcs, hjstein, moshier

          -Reply


>>Since my proposal is not likely to be implemented anytime
soon, I
suggest we try patching gcc to force programs to start in 64-bit
mode and otherwise not worry about it ("set once and forget") as
you've been proposing.<<

Can't this be made optional, either a flag which links the
modified startup, or a system function call?
Dr. Timothy C. Prince
Consulting Engineer
Solar Turbines, a Caterpillar Company
alternate e-mail: tprince@computer.org

           To:                                              INTERNET - IBMMAIL
                                                            N3356140 - IBMMAIL

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-17 10:06 ` Craig Burley
@ 1998-12-17 12:16   ` Harvey J. Stein
  1998-12-19  0:29     ` Craig Burley
  0 siblings, 1 reply; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-17 12:16 UTC (permalink / raw)
  To: Craig Burley; +Cc: hjstein

Craig Burley <burley@gnu.org> writes:

 > The opposite approach I'm seeing a few people advocating simply doesn't
 > work for a product used so widely by so many people.  I'd agree it's
 > fine for, say, a shop like Toon's, but, these days, much of the work
 > done porting numerical code to new compiler/OS/CPU combinations seems
 > to be done by people who *don't* understand the numerical algorithms
 > involved, and are thus just "code jockeys" who know the language (C,
 > Fortran, whatever) well enough to handle ordinary "porting issues".

Those trying to port to ix86 *must* be porting code that doesn't
expect 80 bit cpu registers & therefore would actually produce results
closer to the original code with the FPU set to 64 bits.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-17 11:27 Brad Lucier
  1998-12-17 14:51 ` Marc Lehmann
  0 siblings, 1 reply; 65+ messages in thread
From: Brad Lucier @ 1998-12-17 11:27 UTC (permalink / raw)
  To: burley, egcs; +Cc: lucier

I agree with Craig (I think ;-).

To get consistent, predictable, useable floating-point results, it is
absolutely necessary that spilled floating-point registers be stored in
memory in a format such that spilling and reading a value back into a
register does not change one bit (in the technical sense) of the number.
If that means spilling FP registers to 80 bit temporaries aligned to
128-bit (or 64bit) boundaries, then so be it.

I say this as someone who has built highly accurate elementary function
routine libraries, together with test libraries for those routines, built
test libraries for last-bit accuracy floating-point IO routines, etc.
The programmer needs absolute control over the precision and range of
all results, including intermediate results, for him/her to be successful
at things like this.

Brad Lucier      lucier@math.purdue.edu

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 13:52 Toon Moene
  1998-12-17 10:06 ` Craig Burley
@ 1998-12-17 11:20 ` Dave Love
  1 sibling, 0 replies; 65+ messages in thread
From: Dave Love @ 1998-12-17 11:20 UTC (permalink / raw)
  To: egcs

>>>>> "Toon" == Toon Moene <toon@moene.indiv.nluug.nl> writes:

 Toon> However, what I'm challenging is that we should burden
 Toon> the compiler with these considerations *by default* 

Indeed.

I haven't followed all this, but am somewhat bemused by it.

There are frequent complaints about Fortran due to the x86 register
business.  All the ones I've checked have been covered by the advice
in the g77 manual to link code frobbing the control word (which allows
us to pass paranoia).  Anyone know of exceptions?

[I started on implementing (as far as possible in f77-ish) the
Fortran2000 intrinsics for IEEE floating point control but gave up in
the absence of time to figure out more about the architectures which I
didn't have a clear story on.  At some stage I hope to find time to
consult Reid (the author) and do what he says.]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 12:47     ` Harvey J. Stein
@ 1998-12-17 10:22       ` Craig Burley
  1998-12-17 14:54         ` Marc Lehmann
  0 siblings, 1 reply; 65+ messages in thread
From: Craig Burley @ 1998-12-17 10:22 UTC (permalink / raw)
  To: hjstein; +Cc: burley

>Given this, then doing 80 bit spills will fix things *only* if you
>guarantee that once a value hits the FPU it remains 80 bit (aka 80 bit
>contagion).  I.e. - the compiler can't store back into a double (for
>example) and then fetch this value back for additional computations.
>It has to always get the value from 80 bit memory instead.

Isn't that what we've been discussing from the beginning, i.e. from
the first email I sent containing my proposal?  Else, what have *you*
been talking about?

>If gcc actually does this, then great.  80 bit spills will fix the
>numerics and also give the excess precision people are demanding.  I
>haven't actually met any of these people, but rumor has it that they
>exist. :)

Been reading egcs-bugs lately?

>However, if gcc *doesn't* guarantee this, then the numerics will still
>suck, even with 80 bit spills.  They'll just suck less often.

I'm still not convinced we can guarantee this.  Even if we can't,
fixing this problem that we *do* know about will allow us some
time and breathing room to let *all* of us -- developers and users --
find out what further needs to be done, *and* what it might actually
cost.

>On the other hand, running the FPU in 64 bit mode *will* guarantee
>this at least for computations done on double precision numbers.  As
>for the single precision computations, this property won't hold, so
>they'll still suck, *but*, they'll suck exactly as much as the double
>precision computations would have sucked with an 80 bit FPU & 80 bit
>spills.

Since my proposal is not likely to be implemented anytime soon, I
suggest we try patching gcc to force programs to start in 64-bit
mode and otherwise not worry about it ("set once and forget") as
you've been proposing.

Then, leave this in the snapshots for as long as it takes to get some
real feedback on it, while publically encouraging people with numerical
codes, other libraries, and so on to try it out and see how it works.

Of course, we'd have to make it clear that this is not a committed-to
strategy, but is being done to collect data on its impact, so people
shouldn't start rewriting or recompiling their libraries to accommodate
it just yet.

>The choice is clear, at least for me.  I'm setting the FPU to 64 bit
>mode in all my code.  Maybe I'll even put it in crt0.o.

Yup.  We should do this (whichever gives the most coverage) across
the board in enough egcs snapshots to find out how it actually
affects people, e.g. how much other code must be rewritten.

As I've already pointed out, this is a proposal that might well break
things, possibly well down the road after we *think* it won't.

My proposal is unlikely to break things to nearly the same extent,
though it might slow things down a bit.  That's mainly why I prefer
it.

But the only way to convince some of you that I'm right about 64-bit-mode
FPU possibly breaking things is to convince you to install it, and see
what happens.  That way we can all find out for ourselves.

Again, I'm *not* in support of this proposal.  I *am* in support of
collecting data for it, and, since *my* proposal will take some time
to implement, we might as well implement this much-easier proposal
(at least that's how it could ideally work out) and see what happens,
if only to stop people from claiming that 64-bit-mode is some kind
of panacea (though it'd be great if it was).

What most worries me about your approach is how *clear* the choice
is for you *already*, without having all the data I see as necessary
for *any* of us to make a decision.  Being so willing to change such
a fundamental aspect of the processor's behavior when any gcc-compiled
main program starts running, without even *trying* to learn how that
might affect existing code, strikes me as rather irresponsible.  But
maybe you're right and I'm wrong; in any case, you don't seem to
pay attention to my warnings about this, so go ahead and implement it.
Maybe the feedback we get will convince you.  It's even possible it'll
stop me from worrying anymore.

>And I also realized why I'm still arguing about it.  I think people
>who find the problem have an easy & reasonable fix, so I don't think
>the whole thing really is a big issue.  But, I'd just like the issues
>to be clear for all involved.

I agree with that last part, definitely.  But people with code that
is 32-bit and/or 64-bit and is not able to handle intermediate
results that are sometimes 80-bit, sometimes chopped to 32/64 bits,
do *not* have an easy & reasonable fix, IMO, unless you count "stop
using gcc on the x86 [and m68k?]" as one.  But if you know of a
reasonable fix for each the cases that might come up, write up the
docs so we can check it out for ourselves, and evaluate those
fixes in terms of their clarity, expressiveness, and performance
on compiler/CPU combinations *other* than gcc/x86.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 13:52 Toon Moene
@ 1998-12-17 10:06 ` Craig Burley
  1998-12-17 12:16   ` Harvey J. Stein
  1998-12-17 11:20 ` Dave Love
  1 sibling, 1 reply; 65+ messages in thread
From: Craig Burley @ 1998-12-17 10:06 UTC (permalink / raw)
  To: toon; +Cc: burley

>> Also, you can use genuinely _better_ algorithms when you can rely on 
>> something very close to IEEE, and that is currently pretty hard on 
>> x86 with gcc.  And a touch of extended precision can really lead to 
>> algorithms that give huge performance improvements (factors of 20-40
>> for normal eqns. v. QR for least squares), although those examples
>> are beyond the current discussion.
>
>I'll grant you that one - I sure never researched the boundaries of IEEE
>754 arithmetic.  However, what I'm challenging is that we should burden
>the compiler with these considerations *by default* (I don't mind if
>it's hidden behind a compile time option like -pedantic-numerics).

I don't think anybody's saying we *should* so burden the compiler --
I'm certainly not -- but, given that the compiler is *already* burdened
with computing 80-bit results, to achieve half-decent performance on
the x86, the question is, what can we do to mitigate the fairly
serious programming challenges posed by the fact that the compiler,
on its own, pretty randomly truncates *some* of those results to 64-bit?

>What I tried to get across is that it is not *reasonable* to punish
>32-bit applications with the burden of either 1) unaligned 80 bit spills
>or 2) aligned 80 bit spills that are 4 times as large as necesssary.

We're *already* punishing them with computation of 80-bit results on
the x86!!  I don't know whether it's possible to get the x86 to run
faster by having it run in 32-bit FPU mode, but *surely* it's possible
to build a similar CPU that goes much faster when it knows it's operating
on 32-bit quantities and computing only 32-bit results!  Maybe we
can't do this in gcc, but to say that making gcc generate correct,
consistent code on the x86 is a case of *gcc* punishing 32-bit users is
a bit over the top.

So, IMO, users of the x86 are already getting poor 32-bit FP performance,
in terms of what the chip they're using *could* do, given its silicon.
If we have to hurt performance a little bit more to make the generated
code work correctly, so be it, but we won't know the effects until we
try it out for awhile.

We *still* won't be doing what vendors like Sun do to get even *more*
consistent results, to wit, making -ffloat-store the default on x86,
which I suspect would hurt performance far more than what I'm proposing
on all but the rarest code.

What I *do* know is that, if we don't implement my proposal, we might
as well stop telling people to use -ffloat-store on the x86, because it
is nearly useless as long as we randomly spill intermediate results to 64
bits.  (I doubt there's much code written that needs -ffloat-store but
somehow copes with this random spill/chop combination.)

>Yes, I know that you believe that floating point register spills are
>scarce.  You probably also believe that "Real Programmers are not afraid
>of 5 page long DO loops" was meant as a joke.

Until I see proof to the contrary, *I* believe 5-page-long DO loops
that see substantial performance decline due to spilling to 80-bit
temporaries are going to be quite rare, while Fortran and C codes
that *start working* as they *already* do when compiled under most *other*
combinations of compiler and CPU architecture will be, comparatively,
quite common.

Do you disagree with *that*?

If so, then let's find out who's right by implementing my proposal,
making it the default, and soliciting feedback from users as to what
effect the new snapshots have on their code.

If we implement my proposal as the default, we'll find out pretty
quickly whether it works, and what it costs to newly compiled code.

If it doesn't work, we'll have to fix it anyway, but we probably *wouldn't*
find that out without making it the default in snapshots and, perhaps,
in a release or two.

If it costs too much, by the time we'll have found that out, the entire
egcs-using community, perhaps as well as the rest of the industry, will
have learned an important lesson.

And, at that point, we can probably change the default back without
upsetting too many people, though that's easier if we haven't done an
official *release* with that default.  (So we should definitely take
that step with care and thought aforehand.)

If we refuse to make spilling to 80-bit temporaries a default, I suspect
we'll never really get such spilling to work in the first place.  It
won't matter if we make it available only as an option: the people who
need to use it are, mostly, the unwashed masses who don't think they
should have to read 80 pages of documentation on a compiler to figure
out how to get existing code to compile under gcc on an x86 and produce
correct results -- code that *already* "works" on most any other machine,
and perhaps under several proprietary compilers on the x86 as well.

So they won't use the option, and most of them will conclude gcc and g77
just don't work well for hard-core numerical work.

I still haven't seen any evidence that we should keep fast flakiness
in FP numerics the *default* for the gcc compiler suite.  "Runs faster"
isn't much of an argument, IMO -- else we might as well make -ffast-math
the default, since that'd make things run even faster, with less
IEEE-style predictability or consistency.  (And I'm not generally
opposed to or in favor of doing things like that.  I'd rather *any*
product default entirely to always-works, tweak-for-speed, or to
always-fast, tweak-for-full-range-and-predictability, rather than some
mish-mash as gcc now does and will probably forever continue to do.
So I focus on each issue as it comes up: is it worth getting this
particular thing right, or fast, considering all the factors?)

Further, it seems the only arguments against my proposal come from
people who, except for Toon, also argue the compiler shouldn't ever
rearrange FP expressions for optimization -- something I believe
every single proprietary optimizing Fortran compiler on the planet
does, something that is *necessary* for generation of fast code and
*expected* by programmers.

Let's get our priorities straight, and I'm sorry if this offends some
of you, but in my experience, some of you don't seem to have a good
grasp of engineering products for *use* by *people* with expertise that
substantially differs from yours:

  -  Compiling D = A * B * C as D = (C * A) * B is perfectly
     acceptable in Fortran, C, Ada, and so on; the programmers know
     it, they accept it, and they can often *predict* when it'll happen.
     gcc might not do it now, but it will someday, if it is to remain
     relevant in the industry.  (Someday, it and/or other compilers might
     compile the statement that way only sometimes, though.)

     Further, it's trivial for them to avoid it by writing D = (A * B) * C,
     with, typically, a neglible impact on performance, across the
     board, preserving expressive portability.

  -  Compiling D = A * B * C so that, sometimes, A * B is computed
     as an 80-bit result, and other times, it isn't, is generally
     *unacceptable* in any language; the programmers don't know it's
     going to happen (especially when they're moving their code across
     architectures), they don't accept it when they discover it *is*
     happening, and they can't *predict* when it might happen with any
     worthwhile degree of reliability.

     Further, the only way they can avoid it is by writing *much* slower
     code, e.g., by assigning A * B to a temporary *and* compiling
     using -ffloat-store when compiling, and doing this on *every single
     line* of existing code...which will then likely slow it down on
     *every* compiler/architecture combination on which it is then compiled.
     Or, e.g., they can change their code to test for tolerances, tests that
     are not needed on most compiler/CPU combinations but, again, slow
     those down anyway, all to accommodate gcc/x86.

I know some of you want to do things "fast".  I prefer we do them "right".

Some of you want people to dumb down their portable codes so they work
on the gcc/x86 combination as well as they work elsewhere, even though
doing so will almost certainly slow them down elsewhere.  I prefer we
not force people to change their code when there's a reasonable alternative
(and my proposal *is* a reasonable alternative, I think; at least, there's
been no evidence submitted to date that it'll make any code work less
well).

Make the compiler *work* out of the box for more people.  Let users then
tweak their code and/or compile-time invocations to make their code *fast*
afterwards, if necessary.  It's probably cheaper for them to buy a faster
x86 to make up for the loss of speed than to pore over every line of code
and figure out how to accommodate 64-bit spills, anyway.

The opposite approach I'm seeing a few people advocating simply doesn't
work for a product used so widely by so many people.  I'd agree it's
fine for, say, a shop like Toon's, but, these days, much of the work
done porting numerical code to new compiler/OS/CPU combinations seems
to be done by people who *don't* understand the numerical algorithms
involved, and are thus just "code jockeys" who know the language (C,
Fortran, whatever) well enough to handle ordinary "porting issues".

Do we want these people to feel free to port all these codes to gcc,
or to tell their managers, in droves, that gcc doesn't work for most
numerical codes, especially when used on the most popular general-
purpose CPU architecture on the planet, the x86?

If it turns out that the new default makes tons of working code slower
but makes very little code start working as expected, then we can change
the default back to -fchop-fp-spills.  That way, users can try
-fno-chop-fp-spills to fix the second problem described above, with
(presumably) not as much impact on performance as if they have to
use -ffloat-store and change their code as well.  But we won't find
out if -fno-chop-fp-spills ever works in the first place without making
it the default, because users won't know to use it (since needing to
use an option to get the compiler to save the entire contents of a
register *it* decides to spill is so nonsensical, most users won't
think of it as something to check for).

        tq vm, (burley)

P.S. I can't believe Toon is worrying about keeping the x86's
FP performance looking good.  Has the world turned upside down?  :)

P.P.S. Without clear documentation, perhaps from those opposing my
proposed default, on how to code without ever needing 80-bit spills,
it'll be pretty hard to convince people they don't.  Telling them they
should know better is not nearly as good as teaching them what they
should supposedly know.  That kind of documentation *alone* would
established a need for a front-end-independent document on gcc code-
generation issues (i.e. a "this is what you, as a programmer *using*
gcc, g77, g++, etc., need to know about the gcc back end" document),
since it should be fairly language-independent.  I think that's all to
the good, so maybe those of you opposing my proposal should stop
complaining via email and start writing the documentation containing
all your experience regarding how to avoid these FP mistakes in clear,
simple language that typical code-porters can understand.

P.P.P.S. Many of the other posts, opposed to my proposed default, have
contained clear, factual *errors* regarding what various standards permit
or disallow.  If these people, who are supposedly so bright they can
make emphatic statements about how software works or how hardware can
or cannot be designed, cannot be bothered to read the relevant standards
before discussing proposed compiler *policy*, where do they get off
expecting many thousands of users to read the tomes of documentation
necessary to figure out compiler *use* before trying, and fixing, their
already-working FP code so it continues to work under gcc/x86?

Or, put another way: let's not be the typical group of developers who
"lays heavy burdens upon others, but will not lift up so much as a finger
to help them", or however that quote goes.  If we expect users to read
all the docs before compiling with gcc, *we* should go to that much
*more* trouble, reading standards and learning about the issues, before
making emphatic statements about what future compiler policy in technical
matters should be.  (That's why I've generally phrased my proposal as
quite tentative, pending research, etc.  The stridency of the opposition
has been somewhat surprising to me, under the circumstances, especially
given the lack of actual facts that I'd hoped to see as counters to
my proposal.)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-16 13:52 Toon Moene
  1998-12-17 10:06 ` Craig Burley
  1998-12-17 11:20 ` Dave Love
  0 siblings, 2 replies; 65+ messages in thread
From: Toon Moene @ 1998-12-16 13:52 UTC (permalink / raw)
  To: egcs

Edward, Joe and Craig,

I'm not going to address all the posts separately - lets just
concentrate on Edward's:

> Also, you can use genuinely _better_ algorithms when you can rely on 
> something very close to IEEE, and that is currently pretty hard on 
> x86 with gcc.  And a touch of extended precision can really lead to 
> algorithms that give huge performance improvements (factors of 20-40
> for normal eqns. v. QR for least squares), although those examples
> are beyond the current discussion.

I'll grant you that one - I sure never researched the boundaries of IEEE
754 arithmetic.  However, what I'm challenging is that we should burden
the compiler with these considerations *by default* (I don't mind if
it's hidden behind a compile time option like -pedantic-numerics).

> And Mr. Buck's example does happen in real code.

Yes, but that doesn't make it correct, even on a strict, one size fits
all IEEE 754 machine.  The point is that the following code:

      REAL FUNCTION FINDROOT(FIRSTGUESS)
  10  FINDROOT=<expression involving FIRSTGUESS>
      IF (FINDROOT .EQ. FIRSTGUESS) RETURN
      FIRSTGUESS = FINDROOT
      GOTO 10
      END

simply is not guaranteed to work (I discussed this on comp.compilers
some months ago).  For an arbitrary choice of FIRSTGUESS and <expression
involving FIRSTGUESS> one _cannot_ prove that this won't eternally
oscillate between two numbers just one bit apart *in any precision*.

So this is the wrong way to solve such a problem.

The correct termination comparison is:

      IF (ABS(FINDROOT - FIRSTGUESS) .LT.
     ,    TOLERANCE * FIRSTGUESS) RETURN

with a suitable value of TOLERANCE (dependent on whether computations
are with 32, 64, or 80 bit REALS (Fortran 90 offers intrinsics to
parametrise this).

>  - My main concern is that there is a grid spacing that will render the
>  - basic equation of geostrophy badly approximated in 32-bit arithmetic:
> 
> And at the moment, that entirely depends on which variables happen
> to be spilled and which don't.

[ Sorry, I meant: My *only* concern is ... ]

No, because _we_ know what we're doing because we estimated the error
propagation in an independent way.

What I tried to get across is that it is not *reasonable* to punish
32-bit applications with the burden of either 1) unaligned 80 bit spills
or 2) aligned 80 bit spills that are 4 times as large as necesssary.

Yes, I know that you believe that floating point register spills are
scarce.  You probably also believe that "Real Programmers are not afraid
of 5 page long DO loops" was meant as a joke.

Cheers,

-- 
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
g77 Support: fortran@gnu.org; egcs: egcs-bugs@cygnus.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16 10:36   ` Craig Burley
@ 1998-12-16 12:47     ` Harvey J. Stein
  1998-12-17 10:22       ` Craig Burley
  0 siblings, 1 reply; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-16 12:47 UTC (permalink / raw)
  To: Craig Burley; +Cc: hjstein

Craig Burley <burley@gnu.org> writes:

 > So, I think gcc is closer to this latter case of yours than the former,
 > although I'm uncomfortable saying that "everything always stays in
 > FP registers except for spills", not having studied the issues.

The *only* way to get correct numerics is to guarantee that values
can't go from higher precision to lower precision and back to higher
precision.  Once this happens then you're left with the possibility of
computations sometimes using higher precision & sometimes lower
precision, and having seemingly identical computations produce results
which don't test for equality - i.e. - all the problems with the
current defaults.

Given this, then doing 80 bit spills will fix things *only* if you
guarantee that once a value hits the FPU it remains 80 bit (aka 80 bit
contagion).  I.e. - the compiler can't store back into a double (for
example) and then fetch this value back for additional computations.
It has to always get the value from 80 bit memory instead.

If gcc actually does this, then great.  80 bit spills will fix the
numerics and also give the excess precision people are demanding.  I
haven't actually met any of these people, but rumor has it that they
exist. :)

However, if gcc *doesn't* guarantee this, then the numerics will still
suck, even with 80 bit spills.  They'll just suck less often.

On the other hand, running the FPU in 64 bit mode *will* guarantee
this at least for computations done on double precision numbers.  As
for the single precision computations, this property won't hold, so
they'll still suck, *but*, they'll suck exactly as much as the double
precision computations would have sucked with an 80 bit FPU & 80 bit
spills.

So, the summary:

                    FPU width & spill width
                 64                          80
computation
 width

 doubles     Correct.                     Correct *if* you guarantee 
                                          80 bit contagion.

 floats      Correct *if* you guarantee   Correct *if* you guarantee 
             64 bit contagion.            80 bit contagion.

The choice is clear, at least for me.  I'm setting the FPU to 64 bit
mode in all my code.  Maybe I'll even put it in crt0.o.

And I also realized why I'm still arguing about it.  I think people
who find the problem have an easy & reasonable fix, so I don't think
the whole thing really is a big issue.  But, I'd just like the issues
to be clear for all involved.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15  3:34 ` Harvey J. Stein
@ 1998-12-16 10:36   ` Craig Burley
  1998-12-16 12:47     ` Harvey J. Stein
  0 siblings, 1 reply; 65+ messages in thread
From: Craig Burley @ 1998-12-16 10:36 UTC (permalink / raw)
  To: hjstein; +Cc: burley

>For example, suppose I have code like:
>
>    x = a*b;
>    y = c*d;
>    z = x+y;
>
>I've been under the (worst case) assumption that any combination of a,
>b, c & d might be in FP registers, that the multiplies might be done
>using register/register or memory/register multiplies, and that x and
>y might be gotten either from memory or from registers.
>
>Is this the case?

Yes, conceptually.  I don't know whether gcc does this currently,
but any current or future CPU chip might mandate that maximum
optimization is achieved by being able to make such decisions on
arbitrary bases -- including replicated that code sequence into
multiple ones, each with different sets of decisions, with the
choice of entry depending on which code path preceded and/or followed
the snippet, to get minimal cache re-loading.

In other words, the first time the above code is executed, it puts
everything in registers; the second time, some things spill to memory;
and so on.

In a sufficiently complicated CPU, it might even be profitable to
*simultaneously* execute multiple instances of the above snippet
implementing different decisions, and choose which results to use
at the last moment depending on preceding or subsequent decisions
made by the code, again, for optimization.  (If anyone doesn't get
or believe this, don't worry -- it's quite unlikely we'll ever see
such a CPU that also has wider FP registers than normal FP operations
like the x86.  But I might have to eat my words someday.)

>Or is it the case that values always go into FP registers first, and
>are always manipulated from FP registers except if we run out, in
>which case a spill is done?

I think this relates to -fforce-mem.  Basically it's up to the compiler.
Some issues might be more general than the x86 (the x86 might mandate
some choices due to instruction set, others due to optimization
requirements), and my proposal relates to the general issue of
completely spilling all registers.

>In particular, is it never the case that
>something would get stored back into memory (freeing up an FP
>register), and then later loaded back into an FP.  For example, in the
>above (after adding enough computations), could x get computed, stored
>back into &x and then later loaded from &x to compute z?

Assuming &x doesn't have the C meaning, but refers to a temporary
copy of x, then, yes, in fact gcc does this now.  It does it trivially
for function return values (causing f(x) < f(y) to not imply f(y) > f(x)),
with more difficulty for straight code (I have an example I made up).

>If my original assumption is correct, then I think my objections still
>hold - spilling in extended precision will help a little but not
>completely.

It'll at least get rid of a big, obvious problem, but we might well
find problems remain.  I'd rather we study the problem up front so
we understand and document the remaining problems, and have some
idea of what solutions to propose, if not offer, but I don't think
we should ask people to *not* fix gcc as I propose until we have
accomplished this (as long as we have an option to get the current
behavior).

>If not - if it's really the case that everything always stays in FP
>registers except for spills, then I agree that doing 80 bit spills
>will largely prevent weird numerical values.  It would effectively
>make 80 bitness contagious, which should be sufficient even for
>comparisons to act reasonably (assuming constants are also computed in
>80 bits, to prevent 1.0/3.0 from not equalling x/y after x=1.0;
>y=3.0).

My impression is that, because -fno-force-mem is now the default, most
*computations* are 80-bit, even if the operands start out as 64-bit.

So, I think gcc is closer to this latter case of yours than the former,
although I'm uncomfortable saying that "everything always stays in
FP registers except for spills", not having studied the issues.

At least, for the sample program I tried to write (and compiled without
optimization), it was *hard* to get the compiler to do *anything*
outside of FP registers, AFAICT, except of course the loads/stores
of the variables, and the final store of the result.  Only by making
this straight-line code very complicated could I convince it to
spill any intermediate results.

Therefore, I think the stuff we'll tend to "fix" by adopting my
proposal will be function return values, offhand, not by what
we think of as "normal" spills -- because it seems that normal
spills in straight-line code are pretty rare.

That's also why I don't think we'll see a big performance hit on
most code, and, of the code that *does* see it, I would't be
surprised if it turned out that a substantial portion of it was
previously producing inadequately correct results in its "fast" mode.

(This seems to be a recurring theme throughout the industry.  I remember
back when I went through, then "inflicted" on others, various iterations
of the "we will now default to rejecting implicit declarations" paradigm,
e.g. IMPLICIT NONE in Fortran.  Each time, the opposing arguments were
things like "but that'll slow me down", "I already know what I'm doing",
"what about all the existing, working code", but once the decision was
executed, the results slowly convinced most everyone that the bugs found
in existing production code were worth finding, even at the additional
costs.  The mind-set that claims partial spills are an okay default is
the same that claimed, 15 years ago, that header (#include) files and
prototypes were a waste of time and energy, suitable only for the
newbie programmer.  Not entirely false, but missing the big, and growing,
picture, IMO.)

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15  6:43 ` Stephen L Moshier
@ 1998-12-16 10:14   ` Craig Burley
  0 siblings, 0 replies; 65+ messages in thread
From: Craig Burley @ 1998-12-16 10:14 UTC (permalink / raw)
  To: moshier; +Cc: burley

>If you want to guarantee a value will be written out to memory,
>you can simply declare the memory variable to be "volatile."
>Then the compiler cannot continue to ignore you.  The reason
>this works, and must work on all compilers that optimize, is that
>you would not be able to write hardware device drivers if "volatile"
>did not work.  Software people seem to find this solution unappealing.
>I do not understand why.

Because `volatile' has nothing to do with getting consistent numerical
results.  "Writing to memory" is a meaningless phrase in numerical
programming.  Numerical programmers should not have to worry whether
the compiler places a value in a register, in memory, in a temporary
file, or whatever.  They should be able to just focus on the minimum
precision they need for variables, operations, and so on, and get
consistent behavior from the code generators they use.

If they *don't* get the consistent behavior, they'll use better
code generators that *do* give them consistent behavior.

Further, `volatile' is basically completely useless in getting f(x) < f(y)
to imply that f(y) > f(x), AFAICT.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 12:24 Toon Moene
  1998-12-15 12:55 ` Joe Buck
  1998-12-15 15:05 ` Edward Jason Riedy
@ 1998-12-16 10:05 ` Craig Burley
  2 siblings, 0 replies; 65+ messages in thread
From: Craig Burley @ 1998-12-16 10:05 UTC (permalink / raw)
  To: toon; +Cc: burley

>Am I the only one - apart from Harvey J. Stein and Tim Prince - who
>finds this whole discussion unreal ?  Surely, 80 bit temporaries might
>seem a neat hack to a numerical analyst like Dr. Kahan, but the ordinary
>computational physicist or chemist knows better than to choose "poorly
>conditioned" algorithms.

Then we have an awful lot of "extraordinary" people sending email
asking why FP doesn't work as expected in g77, gcc, egcs, and so on,
and campaigning for various corrections to Java, IEEE 854, x86, or
whatever!

>My main concern is that there is a grid spacing that will render the
>basic equation of geostrophy badly approximated in 32-bit arithmetic:
>
>	 1  dp
>	--- -- = f v
>	rho dx
>
>p	pressure
>rho	mass of air per unit volume (1 kg / m^3)
>x	distance
>f	Coriolis parameter (10^-4 s^-1)
>v	wind speed
>
>You can do the math (p ~ 10^5 kg m^-1 s^-2, v ~ 10 m/s, what dx will
>make dp < 10^-3 p ?)

I can't figure out what you're saying.  How will *not* randomly spilling
a computed 80-bit intermediate value to a chopped-down 64-bit result
make your code stop working, exactly?

>If that's the case we have to rethink our finite difference code for the
>first time in 13 years and use a trick like subtracting a basic state
>from the equations - big deal.

I still don't know what you're saying.  How does fixing this longstanding
bug in gcc/g77 break your code, exactly, in terms I, as *not* a
math expert, can understand?

>The last thing I need is to have egcs slowed down to a crawl by having
>it spill unaligned 80-bit temporaries for something that shouldn't be
>larger than 32 bits in the first place.

We don't *know* that it'll slow down to a crawl.  Spilling outside
of function-call return values seems to be rather rare; spilling return
values probably happens less when optimization is turned on, and,
besides, you're doing a *call* already!

I don't like that stack frames will get somewhat larger, though.

>Please make this and other "accuracy" options a "-pedantic-numerics"
>one.

Sounds like you're arguing in favor of the default being extreme speed
at the expense of correct, consistent behavior.

IMO, your experience represents that now-rare breed of programmers:
People Who Know What They're Doing.

And, if shops like yours can rewrite all your code, at huge expense,
to no longer depend on language support for 64-bit integers, to get
it to run on the fastest multi-million-dollar supercomputer you could
get ahold of...

...then you can darn well use the `-fchop-fp-spills' option I've
proposed (though, in my original proposal, I hadn't yet proposed a
name for it) when you decide you want your working code to run
faster.

I think that's pretty fair to ask of you, rather than ask the millions
of people who we want to use gcc, g77, and so on, over the next few
years, to use a special option to get their fast code to start
working in the first place!

In particular, I'd rather people who use g77 to do their numerical
work be able to remain experts in their fields, rather than have to
become experts in compiler code generation, which they'll have to be
to know what options to use.  They might have to increase their
computer expertise to get things to run *faster*, but, generally,
I think we should not require people to be experts on deep-down
details of how particular pieces of software do their job just to
get their straightforwardly-written code to *work* in the first place.

I know, "late answers are wrong answers", but wrong answers remain
wrong answers no matter how quickly one obtains them.  And since most
people have far less expertise and resources than shops like yours,
Toon, we don't want to burden them by telling them all the options
they must use to get code to work "as expected", all of which carry a
red flag saying "this will slow down your code, but if you know what
you're doing you can avoid it", which will cause *most* of these
programmers to say "of course I know what I'm doing", forget the
option, and get wrong results.

I think we'd be far better off, thinking globally and into the future,
if the defaults tended towards correctness and consistency, and the
options that changed behavior generally said things like "if you know
your code never depends on ..., you can use this option to possibly
get better performance".

We're less likely to get spurious bug reports using this approach, at
least -- compare the number of spurious bug reports we've gotten from
people *using* -ffast-math versus those *forgetting* to use -ffloat-store,
for example!

Ideally, the philosophy I'm promoting above would extend to ensuring
much more consistency across *all* GNU platforms.  As I've said before,
this'd mean defaulting to -mieee (or even -mieee-with-inexact) on
Alphas, completely emulating IEEE software on a few older machines,
and so on, and I don't think the industry would welcome the kind of
performance drop we'd get as a worthwhile tradeoff for the small amount
of extra consistency...at least, not right now, and, besides, if users
want Java, they know where to get it (at least for the moment, in
theory, when the moon is just overhead ;-).

But there's a *clear* widespread lack of understanding among today's
programmers that f(x) < f(y) does not imply f(y) > f(x) even when
f has no side effects or external references and x and y are constants.

I think it's easier for us to at least try to live with this lack of
understanding by fixing the compiler to meet the expectations of
this huge audience, than to try and teach them all to at least use
some option like -pedantic-numerics, much less teach them all about
why internal compiler code generation can produce such amazing
results.

From my compiler-internals perspective, I'm flummoxed as to why *anyone*
with knowledge of the issues would claim that 80-bit values should be
randomly chopped down to 64 bits by the compiler as a *default*.  From that
perspective, performance simply isn't an issue -- if it was, we could
simply not spill *anything* and just re-use random data and get *great*
performance, if consistent results were so unimportant to us.

The FP register stack contains 80-bit registers.  *Exactly* why should
we not spill them *correctly*?

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-16  8:32     ` Sylvain Pion
@ 1998-12-16  9:20       ` Craig Burley
  0 siblings, 0 replies; 65+ messages in thread
From: Craig Burley @ 1998-12-16  9:20 UTC (permalink / raw)
  To: Sylvain.Pion; +Cc: burley

>It is safe in the sence that the language may not require evaluating
>(a*b*c*d) in a specific order (ANSI C doesn't afaik).  However for FP,
>it can give different results.  Think overflow for example, or when
>rounding is set to infinity, you have: 0.1 * (0.1 * -1) != (0.1 * 0.1) * -1

Correct.  Nor does *any* version of Fortran -- *the* language
for high-speed scientific/numeric programming -- mandate evaluation
of expressions like a*b*c*d in any particular order.

I don't think I want to spend my valuable time correcting people on
language issues, when they can read the standards, and the discussions
already taking place among experts on the Internet, for themselves.

So, Sylvain, Joe Buck, and others, thanks for helping correct people
who seem to have appeared out of the woodwork to argue against a
very reasonable proposal using non-existant requirements regarding
numerics among the Fortran, C, and IEEE standards!

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 10:14   ` Jeffrey A Law
@ 1998-12-16  8:32     ` Sylvain Pion
  1998-12-16  9:20       ` Craig Burley
  0 siblings, 1 reply; 65+ messages in thread
From: Sylvain Pion @ 1998-12-16  8:32 UTC (permalink / raw)
  To: law; +Cc: egcs

On Tue, Dec 15, 1998 at 11:12:01AM -0700, Jeffrey A Law wrote:
> For FP, we would like the ability to reassociate some expressions.  Take
> (a * b * c * d) * e
> 
> Right now we'll genrate
> 
> t1 = a * b;
> t2 = t1 * c;
> t3 = t2 * d;
> t4 = t3 * e;
> 
> Note the dependency of each insn on the previous insn.  This can be a major
> performance penalty -- especially on targets which have dual FP units or where
> a fpmul isn't incredibly fast (data dependency stalls at each step).
> 
> t1 = a * b;
> t2 = c * d;
> t3 = t1 * t2;
> t4 = t3 * e;
> 
> Is a much better (and safe as far as I know) sequence.  The first two insns
> are totally independent, which at the minimum reduces one of the 3 stall
> conditions due to data dependency.  For a target with a pipelined FPU or
> dual FPUs the second sequence will be significantly faster.

It is safe in the sence that the language may not require evaluating
(a*b*c*d) in a specific order (ANSI C doesn't afaik).  However for FP,
it can give different results.  Think overflow for example, or when
rounding is set to infinity, you have: 0.1 * (0.1 * -1) != (0.1 * 0.1) * -1

-- 
Sylvain

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 12:24 Toon Moene
  1998-12-15 12:55 ` Joe Buck
@ 1998-12-15 15:05 ` Edward Jason Riedy
  1998-12-16 10:05 ` Craig Burley
  2 siblings, 0 replies; 65+ messages in thread
From: Edward Jason Riedy @ 1998-12-15 15:05 UTC (permalink / raw)
  To: Toon Moene; +Cc: egcs

Oh well.  And Toon Moene writes:
 - 
 - Am I the only one - apart from Harvey J. Stein and Tim Prince - who
 - finds this whole discussion unreal ?  Surely, 80 bit temporaries might
 - seem a neat hack to a numerical analyst like Dr. Kahan, but the ordinary
 - computational physicist or chemist knows better than to choose "poorly
 - conditioned" algorithms.

In my experience, that is not true.  I've seen many computational
chemists never check the condition numbers of their matrices, toss
them into pre-packaged routines (EISPACK, even), and then present 
eigenvalues as being more precise than they are.  It's not done 
intentionally; they just don't know better.  (Which is perfectly 
reasonable.  Ask me about equivalent difficulties in chemistry and 
I'll be clueless.)

Also, you can use genuinely _better_ algorithms when you can rely on 
something very close to IEEE, and that is currently pretty hard on 
x86 with gcc.  And a touch of extended precision can really lead to 
algorithms that give huge performance improvements (factors of 20-40
for normal eqns. v. QR for least squares), although those examples
are beyond the current discussion.

And Mr. Buck's example does happen in real code.

 - My main concern is that there is a grid spacing that will render the
 - basic equation of geostrophy badly approximated in 32-bit arithmetic:

And at the moment, that entirely depends on which variables happen
to be spilled and which don't.  Spilling 80-bit units won't hurt
your app in accuracy.  Take the fact that you've never run into the
problem as evidence that your current discretization is fine.

 - The last thing I need is to have egcs slowed down to a crawl by having
 - it spill unaligned 80-bit temporaries for something that shouldn't be
 - larger than 32 bits in the first place.

If they're aligned, it won't slow down much (I think) on real machines.
The extra alignment will cause it to eat 128 bits in cache rather than
32 bits, but I've been told most differential equation solvers aren't
as picky about cache as linear solvers (my area).  Yours may be 
different.  And I believe Mr. Buck is only looking at spills _within
single expressions_.  It's quite possible your app doesn't have any,
in which case you won't be bothered at all.

Anyways, I'll shut up until I at least know what would need modified
to implement 80-bit spills.  I think they're a good start and probably
just what most people need.

Jason

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 12:10 Geert Bosch
@ 1998-12-15 13:09 ` Jeffrey A Law
  0 siblings, 0 replies; 65+ messages in thread
From: Jeffrey A Law @ 1998-12-15 13:09 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Joe Buck, egcs, hjstein, moshier, tprince

  In message < 9812152009.AA19622@nile.gnat.com >you write:
  > I'd like to comment on this issue for Ada, and I'll explain what the
  > current situation is in GNAT and how (small?) compiler changes could
  > improve efficiency of overflow checking.
Thanks.  I'd peeked at the Ada compiler to try and figure out how it handled
the overflow issues and got totally lost.

  > The compiler does all arithmetic that needs to be checked for overflows
  > using a wider type. Regular 32-bit integers are calculated using 64-bits.
Ouch!  Yea, that's got to be quite inefficient.

  > To get efficient
  > overflow checks, the compiler should be able to take advantage of overflow
  > bits in the status register and raise an exception when an overflow is
  > detected.
Right.

  > This would be a place where the backend could help, although I don't
  > know exactly how this should be implemented.
Shoundn't be all that hard.  Most of the work is in the front/middle end.

Somehow the front end has to specify to the middle end where overflow checks
need to occur.  Presumably you'd need a new tree code for that.

The middle end would convert that tree code into a trap_if or similar rtl
construct which checked the overflow bit.

  > Reordering integer additions is fine for Ada-95, as exceptions do not need 
  > to be exact as long as they occur in the same block. It is also allowed to
  > not raise an exception at all if the final result is mathematically correct
  > even if intermediate values would have overflowed. Also when some operation
  > would have no external effect in the absense of checks, the compiler is
  > allowed to remove the checks and as a result usually is able to remove the
  > operation as well.
Ah.  Excellent.  That's a pretty reasonable definition.  It explains why
fold-const is allowed to perform reassociations which may mask overflows or
make them inexact.

Thanks!

jeff

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15 12:24 Toon Moene
@ 1998-12-15 12:55 ` Joe Buck
  1998-12-15 15:05 ` Edward Jason Riedy
  1998-12-16 10:05 ` Craig Burley
  2 siblings, 0 replies; 65+ messages in thread
From: Joe Buck @ 1998-12-15 12:55 UTC (permalink / raw)
  To: Toon Moene; +Cc: egcs

> The last thing I need is to have egcs slowed down to a crawl by having
> it spill unaligned 80-bit temporaries for something that shouldn't be
> larger than 32 bits in the first place.

If egcs is going to spill temporaries, it's going to have to align them,
or yes, we will slow down to a crawl.

> Please make this and other "accuracy" options a "-pedantic-numerics"
> one.

I'm not worried about the last few bits of accuracy, I'm worried about
things like root-finding algorithms blowing up because the < operator
isn't transitive if intermediate results are spilled.

e.g. assume f(x) is continuous and its derivative is always positive,
that f(lo) is negative and f(hi) is positive, and we want to find the
root.  A binary search approach may not be reliable, it may loop forever!
This is because the 80-bit version of lo may be less than the 80-bit
version of hi, while the 64-bit versions are equal.  The current state
of affairs is that the compiler randomly gives some results 80 bits
of precision and some results less, and it may change this at any time,
at random.

Yes, for well-conditioned algorithms just putting the FPU into 32-bit
mode may be the best solution.  But the current behavior has too many
surprises.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-15 12:24 Toon Moene
  1998-12-15 12:55 ` Joe Buck
                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Toon Moene @ 1998-12-15 12:24 UTC (permalink / raw)
  To: egcs

> Using 80-bit spills is a quick approximation to 
> extending the FP stack into memory, and it should give > some of the benefit with very little  (hopefully) 
> hassle.  Of course, 80 bits is wider than the normal 
> spill, so it eats more memory bandwidth, cache space, 
> etc.  Anyone who's that concerned will bend over 
> backwards to avoid spills anyways.

Am I the only one - apart from Harvey J. Stein and Tim Prince - who
finds this whole discussion unreal ?  Surely, 80 bit temporaries might
seem a neat hack to a numerical analyst like Dr. Kahan, but the ordinary
computational physicist or chemist knows better than to choose "poorly
conditioned" algorithms.

My main concern is that there is a grid spacing that will render the
basic equation of geostrophy badly approximated in 32-bit arithmetic:

	 1  dp
	--- -- = f v
	rho dx

p	pressure
rho	mass of air per unit volume (1 kg / m^3)
x	distance
f	Coriolis parameter (10^-4 s^-1)
v	wind speed

You can do the math (p ~ 10^5 kg m^-1 s^-2, v ~ 10 m/s, what dx will
make dp < 10^-3 p ?)

If that's the case we have to rethink our finite difference code for the
first time in 13 years and use a trick like subtracting a basic state
from the equations - big deal.

The last thing I need is to have egcs slowed down to a crawl by having
it spill unaligned 80-bit temporaries for something that shouldn't be
larger than 32 bits in the first place.

Please make this and other "accuracy" options a "-pedantic-numerics"
one.

-- 
Toon Moene (toon@moene.indiv.nluug.nl)
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Phone: +31 346 214290; Fax: +31 346 214286
g77 Support: fortran@gnu.org; egcs: egcs-bugs@cygnus.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-15 12:10 Geert Bosch
  1998-12-15 13:09 ` Jeffrey A Law
  0 siblings, 1 reply; 65+ messages in thread
From: Geert Bosch @ 1998-12-15 12:10 UTC (permalink / raw)
  To: Jeffrey A Law, Joe Buck, law; +Cc: egcs, hjstein, moshier, tprince

On Tue, 15 Dec 1998 11:12:01 -0700, Jeffrey A Law wrote:

  For integer, we need to know where the parens are to preserve integer overflow
  semantics in languages like Ada for similar transformations

I'd like to comment on this issue for Ada, and I'll explain what the current
situation is in GNAT and how (small?) compiler changes could improve efficiency 
of overflow checking.

Currently overflow checking is not done by default in GNAT (GNU Ada95 compiler).
To be fully standards conforming you need to run the compiler with the -gnato 
flag which enables these checks. The reason these checks are disabled is that 
they are inefficient, at least on 32-bit targets.

The compiler does all arithmetic that needs to be checked for overflows using a
wider type. Regular 32-bit integers are calculated using 64-bits. To get efficient
overflow checks, the compiler should be able to take advantage of overflow
bits in the status register and raise an exception when an overflow is
detected. This would be a place where the backend could help, although I don't
know exactly how this should be implemented.

Reordering integer additions is fine for Ada-95, as exceptions do not need 
to be exact as long as they occur in the same block. It is also allowed to
not raise an exception at all if the final result is mathematically correct,
even if intermediate values would have overflowed. Also when some operation
would have no external effect in the absense of checks, the compiler is allowed
to remove the checks and as a result usually is able to remove the operation
as well.

With checks off, the behavior in GNAT is the same as with C. Formally, it
would still be allowed to detect overflows or range checks and raise an
exception. Suppressing checks only means that the implementation should
not impose extra overhead because of the checks. 

Regards,
   Geert

PS. This description is informal, for the exact details see the Ada RM.
    (ISO/IEC/ANSI 8652:1995, http://www.adahome.com/Resources/refs/rm95.html )

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15  9:29 ` Joe Buck
@ 1998-12-15 10:14   ` Jeffrey A Law
  1998-12-16  8:32     ` Sylvain Pion
  0 siblings, 1 reply; 65+ messages in thread
From: Jeffrey A Law @ 1998-12-15 10:14 UTC (permalink / raw)
  To: Joe Buck; +Cc: bosch, hjstein, moshier, egcs, tprince

  In message < 199812151728.JAA00746@yamato.synopsys.com >you write:
  > 
  > > Many useful fpt algorithms rely on ordering of operations to be honored, 
  > > and a compiler evaluating  B + (A - B) as (B + A) - B or even as A 
  > > is seriously broken for numerical stuff.
  > 
  > gcc is not broken in that way: parentheses prevent reordering for FP
  > operations.
Actually, having recently looked into reassociation optimizations I'll chime
in with a minor clarification.

GCC does not show parens anywhere in its tree or rtl structures.  We prevent
these transformations across parens by simply never performing these
transformations on floating point values.

At some point we'll need to be able to perform more fine grained tests, both
to help the floating point issues and to deal with overflow issues in languages
like Ada.

For FP, we would like the ability to reassociate some expressions.  Take
(a * b * c * d) * e

Right now we'll genrate

t1 = a * b;
t2 = t1 * c;
t3 = t2 * d;
t4 = t3 * e;

Note the dependency of each insn on the previous insn.  This can be a major
performance penalty -- especially on targets which have dual FP units or where
a fpmul isn't incredibly fast (data dependency stalls at each step).

t1 = a * b;
t2 = c * d;
t3 = t1 * t2;
t4 = t3 * e;

Is a much better (and safe as far as I know) sequence.  The first two insns
are totally independent, which at the minimum reduces one of the 3 stall
conditions due to data dependency.  For a target with a pipelined FPU or
dual FPUs the second sequence sequence will be significantly faster.

For integer, we need to know where the parens are to preserve integer overflow
semantics in languages like Ada for similar transformations.

jeff

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15  1:45 Geert Bosch
  1998-12-15  3:34 ` Harvey J. Stein
  1998-12-15  6:43 ` Stephen L Moshier
@ 1998-12-15  9:29 ` Joe Buck
  1998-12-15 10:14   ` Jeffrey A Law
  2 siblings, 1 reply; 65+ messages in thread
From: Joe Buck @ 1998-12-15  9:29 UTC (permalink / raw)
  To: bosch; +Cc: hjstein, moshier, egcs, tprince

> Many useful fpt algorithms rely on ordering of operations to be honored, 
> and a compiler evaluating  B + (A - B) as (B + A) - B or even as A 
> is seriously broken for numerical stuff.

gcc is not broken in that way: parentheses prevent reordering for FP
operations.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15  1:45 Geert Bosch
  1998-12-15  3:34 ` Harvey J. Stein
@ 1998-12-15  6:43 ` Stephen L Moshier
  1998-12-16 10:14   ` Craig Burley
  1998-12-15  9:29 ` Joe Buck
  2 siblings, 1 reply; 65+ messages in thread
From: Stephen L Moshier @ 1998-12-15  6:43 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Harvey J. Stein, egcs, tprince

On Tue, 15 Dec 1998, Geert Bosch wrote:

> and a compiler evaluating  B + (A - B) as (B + A) - B or even as A 
> is seriously broken for numerical stuff.

Associative law "optimizations" were rooted out of gcc years ago.
But I do know of at least two commercial dsp compilers, based on gcc-2.3
or earlier, that might have this problem.

The reasons for the compiler deciding to write something out to
memory, or to not write it out, are many and mysterious.  I have
analyzed only one test case in which all 8 fpu registers actually got
used up and something had to be spilled.  That program really cratered
the computer and it is enshrined in c-torture.

If you declare an item to be long double precision in your source
program, then it better stay long double, or else please do post a test
case right away!  If you don't have long double, then maybe long double
is what you should be asking for.

If you want to guarantee a value will be written out to memory,
you can simply declare the memory variable to be "volatile."
Then the compiler cannot continue to ignore you.  The reason
this works, and must work on all compilers that optimize, is that
you would not be able to write hardware device drivers if "volatile"
did not work.  Software people seem to find this solution unappealing.
I do not understand why.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
  1998-12-15  1:45 Geert Bosch
@ 1998-12-15  3:34 ` Harvey J. Stein
  1998-12-16 10:36   ` Craig Burley
  1998-12-15  6:43 ` Stephen L Moshier
  1998-12-15  9:29 ` Joe Buck
  2 siblings, 1 reply; 65+ messages in thread
From: Harvey J. Stein @ 1998-12-15  3:34 UTC (permalink / raw)
  To: Geert Bosch; +Cc: hjstein

"Geert Bosch" <bosch@gnat.com> writes:

 > On 14 Dec 1998 11:51:23 +0200, Harvey J. Stein wrote:
 > 
 >   Reasonable floating point code should expect that reordering
 >   operations will produce slightly different results due to round off
 >   error, and should be tolerant of the optimizer doing such.  Especially
 >   given how little control the programmer has over exactly how
 >   computations are ordered.
 > 
 > Many useful fpt algorithms rely on ordering of operations to be honored, 
 > and a compiler evaluating  B + (A - B) as (B + A) - B or even as A 
 > is seriously broken for numerical stuff.
 > 
 > Having spills to memory retain full precision is very useful as this allows
 > one to prove much more about fpt code. Here is an example of what I mean,
 > using a decimal fpt type with 4 digits for extended precision in registers 
 > and 3 for the in-memory precision of a variable. (Examples using binary
 > 64-bit and 80-bit fpt types are similar but harder to read.)
 >
 > Calculate  S = (10.0 + 0.454) - (0.454 + 10.0), spilling one partial sum to T.

<snip>

 > Case 1 does not use extended registers and rounds at every operation.
 >   This is completely IEEE conformant behaviour.
 > 
 > Case 2 uses extended registers and same precision for spilled value.
 >   This is not IEEE-conformant, but guarantees consistent rounding behaviour.
 >   In particular the relative error is never more than that of case 1.
 >   For most algorithms this will work fine, double rounding will only
 >   occur on the final assignment. This is not ideal, but now worst-case
 >   is one double rounding per statement instead of one per operation.
 >   If assignments are forced to go to memory (using volatile var's for example),
 >   fpt behaviour is independent of optimization level. 
 > 
 > Case 3 uses extended registers, but lower precision for spilled value.
 >   This is the worst case and is what is causing problems right now.
 >   The intermediate values while evaluating the expression may be subject 
 >   to double rounding errors. People who care about right answers often 
 >   turn off optimization, but ironically this makes the problems only worse! 

I was thinking more along the lines of complex sequences of
computations with subexpression elimination, etc.

But, I'm a little unclear on register spilling.  Exactly when do
values enter & leave FP registers?  Does everything stay in FP
registers for the useful lifetime of the value except when a spill
occurs, or can things move in and out of variables more freely?

For example, suppose I have code like:

    x = a*b;
    y = c*d;
    z = x+y;

I've been under the (worst case) assumption that any combination of a,
b, c & d might be in FP registers, that the multiplies might be done
using register/register or memory/register multiplies, and that x and
y might be gotten either from memory or from registers.

Is this the case?

Or is it the case that values always go into FP registers first, and
are always manipulated from FP registers except if we run out, in
which case a spill is done?  In particular, is it never the case that
something would get stored back into memory (freeing up an FP
register), and then later loaded back into an FP.  For example, in the
above (after adding enough computations), could x get computed, stored
back into &x and then later loaded from &x to compute z?

If my original assumption is correct, then I think my objections still
hold - spilling in extended precision will help a little but not
completely.

If not - if it's really the case that everything always stays in FP
registers except for spills, then I agree that doing 80 bit spills
will largely prevent weird numerical values.  It would effectively
make 80 bitness contagious, which should be sufficient even for
comparisons to act reasonably (assuming constants are also computed in
80 bits, to prevent 1.0/3.0 from not equalling x/y after x=1.0;
y=3.0).

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86
@ 1998-12-15  1:45 Geert Bosch
  1998-12-15  3:34 ` Harvey J. Stein
                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Geert Bosch @ 1998-12-15  1:45 UTC (permalink / raw)
  To: Harvey J. Stein, moshier; +Cc: egcs, hjstein, tprince

On 14 Dec 1998 11:51:23 +0200, Harvey J. Stein wrote:

  Reasonable floating point code should expect that reordering
  operations will produce slightly different results due to round off
  error, and should be tolerant of the optimizer doing such.  Especially
  given how little control the programmer has over exactly how
  computations are ordered.

Many useful fpt algorithms rely on ordering of operations to be honored, 
and a compiler evaluating  B + (A - B) as (B + A) - B or even as A 
is seriously broken for numerical stuff.

Having spills to memory retain full precision is very useful as this allows
one to prove much more about fpt code. Here is an example of what I mean,
using a decimal fpt type with 4 digits for extended precision in registers 
and 3 for the in-memory precision of a variable. (Examples using binary
64-bit and 80-bit fpt types are similar but harder to read.)

Calculate  S = (10.0 + 0.454) - (0.454 + 10.0), spilling one partial sum to T.

    Case 1)                Case 2)                  Case 3)

    10.0                   10.0                     10.0
     0.454 +                0.454 +                  0.454 +
    -----                  ------                   -----
    10.5                   10.45                    10.45

T = 10.5               T = 10.45                T = 10.4
                0.454                  0.454                    0.454
               10.0  +                10.00  +                 10.00  +
               -----                  ------                   ------
    10.5  - << 10.5        10.45 - << 10.45         10.45 - << 10.45
    -----                  -----                    -----
S =  0.00              S =  0.00                S = -0.05

Case 1 does not use extended registers and rounds at every operation.
  This is completely IEEE conformant behaviour.

Case 2 uses extended registers and same precision for spilled value.
  This is not IEEE-conformant, but guarantees consistent rounding behaviour.
  In particular the relative error is never more than that of case 1.
  For most algorithms this will work fine, double rounding will only
  occur on the final assignment. This is not ideal, but now worst-case
  is one double rounding per statement instead of one per operation.
  If assignments are forced to go to memory (using volatile var's for example),
  fpt behaviour is independent of optimization level. 

Case 3 uses extended registers, but lower precision for spilled value.
  This is the worst case and is what is causing problems right now.
  The intermediate values while evaluating the expression may be subject 
  to double rounding errors. People who care about right answers often 
  turn off optimization, but ironically this makes the problems only worse! 

Regards,
   Geert

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~1998-12-20 11:28 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-12-13 18:23 FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 Stephen L Moshier
1998-12-14  1:52 ` Harvey J. Stein
1998-12-14 14:56   ` Edward Jason Riedy
1998-12-14 17:20     ` Joe Buck
1998-12-14 18:51       ` Edward Jason Riedy
1998-12-14 21:54         ` Craig Burley
1998-12-15 14:31           ` Edward Jason Riedy
1998-12-15 17:11         ` Jamie Lokier
1998-12-16  0:26           ` Harvey J. Stein
1998-12-16  9:33             ` Craig Burley
1998-12-16 12:18               ` Harvey J. Stein
1998-12-16  9:38           ` Craig Burley
1998-12-16 12:25           ` Marc Lehmann
1998-12-16 12:50             ` Tim Hollebeek
1998-12-16 13:04               ` Harvey J. Stein
1998-12-16 14:01               ` Marc Lehmann
1998-12-17 11:26                 ` Dave Love
1998-12-17 15:06                   ` Marc Lehmann
1998-12-18 12:50                     ` Dave Love
1998-12-19 14:09                       ` Marc Lehmann
1998-12-20 11:28                         ` Dave Love
1998-12-20 11:24               ` Dave Love
1998-12-16 23:11           ` Joern Rennecke
1998-12-17  6:07             ` Jamie Lokier
1998-12-14 22:54       ` Craig Burley
1998-12-15  1:45 Geert Bosch
1998-12-15  3:34 ` Harvey J. Stein
1998-12-16 10:36   ` Craig Burley
1998-12-16 12:47     ` Harvey J. Stein
1998-12-17 10:22       ` Craig Burley
1998-12-17 14:54         ` Marc Lehmann
1998-12-19  0:27           ` Craig Burley
1998-12-19  5:06             ` Stephen L Moshier
1998-12-15  6:43 ` Stephen L Moshier
1998-12-16 10:14   ` Craig Burley
1998-12-15  9:29 ` Joe Buck
1998-12-15 10:14   ` Jeffrey A Law
1998-12-16  8:32     ` Sylvain Pion
1998-12-16  9:20       ` Craig Burley
1998-12-15 12:10 Geert Bosch
1998-12-15 13:09 ` Jeffrey A Law
1998-12-15 12:24 Toon Moene
1998-12-15 12:55 ` Joe Buck
1998-12-15 15:05 ` Edward Jason Riedy
1998-12-16 10:05 ` Craig Burley
1998-12-16 13:52 Toon Moene
1998-12-17 10:06 ` Craig Burley
1998-12-17 12:16   ` Harvey J. Stein
1998-12-19  0:29     ` Craig Burley
1998-12-17 11:20 ` Dave Love
1998-12-17 11:27 Brad Lucier
1998-12-17 14:51 ` Marc Lehmann
1998-12-19  0:17   ` Craig Burley
1998-12-19  6:42     ` Emil Hallin
1998-12-19 14:26       ` Dave Love
1998-12-17 14:37 tprince
1998-12-17 15:15 ` Stephen L Moshier
1998-12-17 14:38 Toon Moene
1998-12-17 15:30 ` Harvey J. Stein
1998-12-18  1:54   ` Toon Moene
1998-12-18  3:05     ` Harvey J. Stein
1998-12-18  9:01       ` Toon Moene
1998-12-18 15:59       ` Richard Henderson
1998-12-18 13:26   ` Marc Lehmann
1998-12-18 12:50 ` Dave Love

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).