Re: Floating-point Consistency, -ffloat-store, and x86 (mostly)

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: Floating-point Consistency, -ffloat-store, and x86 (mostly)
@ 1998-12-17  1:30 N8TM
  0 siblings, 0 replies; 6+ messages in thread
From: N8TM @ 1998-12-17  1:30 UTC (permalink / raw)
  To: burley; +Cc: hjstein, egcs

In a message dated 12/16/98 11:38:23 AM Pacific Standard Time, burley@gnu.org
writes:

<< Is lf90 spilling 80-bit values resulting from operations on 32-bit
 operands (originally, anyway) as 32, 64, or 80 bits? >>
I don't have tools to examine lf90; lf95 has adopted a more standard format
for .obj which works with objdump.  I'll do some checking on real code to
compare spills between g77 and lf95.  Unfortunately, the linker misalignments
haven't been corrected in lf95.  I'm not sure that I'll like the model of lf95
spilling, which may well be different from lf90, but at least I can report
some indication of how much spilling occurs in a real situation.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Floating-point Consistency, -ffloat-store, and x86 (mostly)
  1998-12-15  0:20 N8TM
@ 1998-12-16 10:37 ` Craig Burley
  0 siblings, 0 replies; 6+ messages in thread
From: Craig Burley @ 1998-12-16 10:37 UTC (permalink / raw)
  To: N8TM; +Cc: burley

>Switching from 32-bit spills to improperly aligned 80-bit spills will bring a
>noticeable reduction in performance.  Lahey just admitted the error of their
>ways in their mis-aligning of COMMON blocks in the compiler versions of recent
>months. Some of the Livermore Kernels suddenly took as much as 6 times as long
>to run, due to mis-alignments, but today's lf90 version is back to being
>generally faster than g77. Intel specifically recommend 128-bit aligned
>storage for 80-bit quantities.

Is lf90 spilling 80-bit values resulting from operations on 32-bit
operands (originally, anyway) as 32, 64, or 80 bits?  I don't think
we even do much spilling of 32-bit values, except for function return
values, but it'd be worth knowing what the competition is up to
before we make any final decisions!

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Floating-point Consistency, -ffloat-store, and x86 (mostly)
@ 1998-12-15  0:20 N8TM
  1998-12-16 10:37 ` Craig Burley
  0 siblings, 1 reply; 6+ messages in thread
From: N8TM @ 1998-12-15  0:20 UTC (permalink / raw)
  To: burley, hjstein; +Cc: egcs

In a message dated 12/14/98 11:57:20 PM Pacific Standard Time, burley@gnu.org
writes:

<< Either the FPU must be switched back and forth
 among the two modes (32 and 64), which is, apparently, dog slow
 on current chips, or we get right back into the double-rounding
 and exponent-range problems, >>

There isn't a double rounding problem with doing single precision in the
double precision rounding mode, at least not for single addition, subtraction,
and multiplication operations, as there is no round-off  when a double
precision result is obtained from a single precision operation. 

<< switching to 80-bit spills seems least likely
to hurt anyone, to noticably hurt performance >>

Switching from 32-bit spills to improperly aligned 80-bit spills will bring a
noticeable reduction in performance.  Lahey just admitted the error of their
ways in their mis-aligning of COMMON blocks in the compiler versions of recent
months. Some of the Livermore Kernels suddenly took as much as 6 times as long
to run, due to mis-alignments, but today's lf90 version is back to being
generally faster than g77. Intel specifically recommend 128-bit aligned
storage for 80-bit quantities.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Floating-point Consistency, -ffloat-store, and x86 (mostly)
  1998-12-14 12:29 ` Harvey J. Stein
@ 1998-12-14 23:49   ` Craig Burley
  0 siblings, 0 replies; 6+ messages in thread
From: Craig Burley @ 1998-12-14 23:49 UTC (permalink / raw)
  To: hjstein; +Cc: burley

>However, I don't understand why you claim that leaving the CPU in 64
>bit mode would be problematic.

Maybe it wouldn't, but I'm quite pessimistic about any change that
makes the CPU behave differently than what other programmers,
who through some means created .o and .a files to which newly
compiled modules (that set the CPU to the different behavior)
link to before invoking the code in them, expect.

>It would seem to me that the only code that would appropriately rely
>on the registers being extended precision would have to be either hand
>written assembler or *extremely* carefully written code in a higher
>level language.  I'd think that anyone capable of doing this would
>also be cognizant of the issues involved & quite capable of saving,
>setting & restoring the FPU control word appropriately.

If we were designing a new architecture from scratch, along with the
whole toolchain (assemblers, libraries, linkers, object-file formats,
and so on), I'd be less concerned, and just work hard to make sure
we made all the appropriate facilities available and documented the
bajeebers out of the issues.

But, keep in mind that, as someone else pointed out, there are
plenty of people *already* happy with the present situation, or
at least (as I worry :) who *think* they're happy with it.

That is, their code uses 80-bit precision, and they either don't
care, or expected it but don't have, or haven't yet noticed ill
effects from, any 64-bit spills of 80-bit values.

Now, for those who expected that 80-bit precision, which they got
for "free" on x86 machines without having to make any changes
(probably) in their code...

...what do you think they'll say when, upon using the "new gcc"
that sets the FPU to 64-bit mode, their code suddenly goes into
64-bit mode -- even their code that was compiled with the "old gcc"?

I think they'll be really upset.

To me, this is kind of like the issue of 64-bit double alignment
on the stack.  (To do it right, we need cooperation from crtN
through main() all the way down the call chain, at least to the
leafs, defined as procedures that don't call any FP-using
procedures.  That is, making sure the last several bits of the
frame pointer are "universally" zero is basically an exercise similar to
making sure the FP mode is universally 64-bit, in that it requires
some mixture of cooperation and defensiveness during code
generation to get it always right.)

Only difference is, mix up .o's, .a's, and different compiler
invocations, and with the 64-bit double alignment issue you get
flaky timing...

...whereas, with the 64-bit-mode FPU issue, you get flaky numerics.

Given that I'm still a bit worried about the alignment/performance
issue, you can understand why I simply cannot propose, or endorse
any proposal, that we simply decide the FPU on x86 machines (and
similar) be placed in 64-bit mode and hand-wave whatever problems
that might expose down the road.

>So, I don't see why you're claiming that the FPU control word would
>have to be fiddled on the fly according to embedded object code
>notations.

See the above.  In short, each snippet of code, at no more than a
procedure-level boundary, has its own requirements, or lack thereof,
regarding the precision of the FPU.

AFAIK, for *all* existing x86 code, once it reaches the .s, .o, .a,
.so, or executable, stage, there is *no way* to recover those
requirements, which I believe can currently exist only as commentary
in the code, if we're even that lucky.

We can't assume "prevailing mode acceptable" is going to work for
all existing code out there, and the danger gets worse as the
percentage for which it doesn't goes *down*, until it reaches zero,
because the lower that percentage, the more changes people will
trust the executable that links in the rare code that expected
80-bit mode on the grounds that "this has always worked before",
and thus not be suspicious of what turns out to be wrong answers.

>Also, you said that leaving the FPU in 64 bit mode isn't a full
>solution because of double rounding.  It seems to me that the double
>rounding is only an issue near the edges of the range of a double, and
>as such is also a small problem.

I've seen that claimed by others discussing this issue vis-a-vis
Java and x86 numerics, but IIRC it isn't 64-bit-mode that does the
double rounding, it's 80-bit-mode with continual 64-bit store/reload.
At least, I think that's what they were talking about.

I *think* 64-bit mode doesn't suffer from double rounding.  I think
what it suffers from is greater exponent range than IEEE 64-bit doubles
really provide, until (of course) spills to memory happen.  We
should definitely study these alternatives, with an eye to determining
whether the rest of the industry will find acceptable, before assuming
we can just set the FPU to whatever mode we like.

>After all, one thing that people
>expect from floating point math is trailing garbage, and this would
>just be another example - another check on the paranoia tests.

That might be right, but it'll worry me until the numerical big-wigs
come out and say it.  And, again, I think it's only a problem for
-ffloat-store-like store/reloads (FPU in 80-bit mode).

>I think the biggest problem is the places where the underlying
>extended precision registers are exposed to programmer.

That, and, 64-bit-mode FPU doesn't sound promising vis-a-vis
32-bit operations.  Either the FPU must be switched back and forth
among the two modes (32 and 64), which is, apparently, dog slow
on current chips, or we get right back into the double-rounding
and exponent-range problems, perhaps the spill-with-chopping
problem, as well.

That is, putting the FPU into 64-bit mode, AFAICT, gives us an
enticingly "cheap" way to say "see, now all those 64-bit calculations
will be nearly strict IEEE", but will have done basically nothing
for the 32-bit calculations, which I think will still be a problem
(as compared to other machines -- at least some of them, anyway,
do real IEEE 32-bit operations, I'm pretty sure).

>There's no way for a programmer in a higher level language to
>currently get at the values in the extended precision registers, nor
>is there a way to explicitely use them.  However, their existence is
>exposed by comparison operations.  This is what leads to all sorts of
>insanities such as 1.0/3.0 != 1.0/3.0 (when done appropriately with
>variables).

Yup.  Though some compilers (some of Digital's, I think) support
explicit extended-type declarations (REAL*10 in Fortran, as I think
I posted about already last week or so), "we" (gcc) don't yet.

>This is why I think the best thing would be to just default the FPU to
>double precision (not extended precision) mode, or at least make it
>easily settable.  It removes the difference between comparing
>registers & memory, and makes numerics as register vs memory usage
>independent as they can be on an ix86 in a higher level language.

I agree we should make it easily settable, but I think we'd need
to document that it can have undesirable (or at least unpredictable)
effects on existing codes (especially already-compiled codes).

>Although spilling in extended precision would make things somewhat
>more consistent, I don't think it'd be consistent enough to really
>help - it'd still mostly leave the comparison problems.

So would 64-bit-mode FPU, I think, when it comes to 32-bit comparisons.

And, remember, Fortran has "real" 32-bit-FP handling compared to C,
in the sense that the Fortran community has long had reliable
declaration and use of 32-bit FP, something not true for the C
community (which long had to assume 64-bit FP would get used, regardless
of what we declared explicitly or otherwise).

So you can't just hand-wave 32-bit FP processing by saying "well,
most C programmers won't really care if 32-bit FP values are
processed with 64-bit precision", even though you'd probably be right,
because many *Fortran* programmers *would* care.

>It'd also
>still leave the uncertainties of when extended precision gets used
>because the compiler is still free to decide when and how things move
>into registers.

The jury is still out on that, and I feel the problems cannot be
completely addressed (with any adequate performance ratio) either
way.  But I think the numerical community will, at least for a time,
prefer computing and spilling 80-bit values, as a default on machinery
that prefers 80-bit FP computation.

IMO doing 80-bit spills of stuff we already calculate in 80 bits
poses the least risk and least performance drop.

And, the benefit I like most about it, aside from it just making sense
(assuming we don't go to 64-bit-mode FPU, i.e. my "if your registers
are 80-bit, you'd better spill them to 80-bit memories" argument :),
is that, if we do this, we offer the industry a "benchmark" of sorts
to at least thrash out what the *actual* effects of widespread, easily
available, more-predictable 80-bit computing of intermediate results
will be.  This benchmark would be in the form of the gcc compiler
suite.

It might also give some numerical shops more reason to make g77/gcc/g++
their "default" compiler.

>I guess the only problem with:
>
>   __setfpucw ((_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_DOUBLE);
>
>really is a portablility issue.  Maybe I'll try it on my code where
>I've had to resort to -ffloat-store and see if it helps.

It should, assuming you're using 64-bit FP pretty much exclusively
(little or no pertinent 32-bit FP).

And the portability issue should not, IMO, be a reason to not do what
you're proposing.  If we decide it'll be better for the numerical
community as a whole to go with 64-bit-mode FPU, then it's really
just a matter of deciding *how* to do it (set it once and for all
in crtN or main()? set it at the beginning of any procedure that
uses FP? re-set it after any call to a "mysterious" procedure? save
it whenever setting, and maybe re-setting, it, to restore it at
any procedure return?) at a design level.

Once we decide that, then I think implementation, albeit hairy,
should not get in our way of making the right decision.

I just don't think this is the right direction, and I'd sure like
to see a lot more hairy numerical types agree with it first before
we spend much more time discussing it, or (especially) talking about
how it just won't bother anyone important if we go ahead and do
it (because I think such talk would be wrong).  (By "hairy" I don't
mean they have to have beards, or be men; they just have to email
as if they did.)

In the meantime, switching to 80-bit spills seems least likely
to hurt anyone, to noticably hurt performance, etc., though I
still don't expect it's an easy thing to do (else I've submitted
a patch by now :).

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Floating-point Consistency, -ffloat-store, and x86 (mostly)
  1998-12-01 21:58 Craig Burley
@ 1998-12-14 12:29 ` Harvey J. Stein
  1998-12-14 23:49   ` Craig Burley
  0 siblings, 1 reply; 6+ messages in thread
From: Harvey J. Stein @ 1998-12-14 12:29 UTC (permalink / raw)
  To: Craig Burley; +Cc: hjstein

Unfortunately, I read your original post after I sent my other msg.
Here are some comments & questions.

I see your point about how doing 80 bit spills would help to reduce
numerical uncertainty.

However, I don't understand why you claim that leaving the CPU in 64
bit mode would be problematic.

It would seem to me that the only code that would appropriately rely
on the registers being extended precision would have to be either hand
written assembler or *extremely* carefully written code in a higher
level language.  I'd think that anyone capable of doing this would
also be cognizant of the issues involved & quite capable of saving,
setting & restoring the FPU control word appropriately.

So, I don't see why you're claiming that the FPU control word would
have to be fiddled on the fly according to embedded object code
notations.

Also, you said that leaving the FPU in 64 bit mode isn't a full
solution because of double rounding.  It seems to me that the double
rounding is only an issue near the edges of the range of a double, and
as such is also a small problem.  After all, one thing that people
expect from floating point math is trailing garbage, and this would
just be another example - another check on the paranoia tests.

I think the biggest problem is the places where the underlying
extended precision registers are exposed to programmer.

There's no way for a programmer in a higher level language to
currently get at the values in the extended precision registers, nor
is there a way to explicitely use them.  However, their existence is
exposed by comparison operations.  This is what leads to all sorts of
insanities such as 1.0/3.0 != 1.0/3.0 (when done appropriately with
variables).

This is why I think the best thing would be to just default the FPU to
double precision (not extended precision) mode, or at least make it
easily settable.  It removes the difference between comparing
registers & memory, and makes numerics as register vs memory usage
independent as they can be on an ix86 in a higher level language.

Although spilling in extended precision would make things somewhat
more consistent, I don't think it'd be consistent enough to really
help - it'd still mostly leave the comparison problems.  It'd also
still leave the uncertainties of when extended precision gets used
because the compiler is still free to decide when and how things move
into registers.

I guess the only problem with:

   __setfpucw ((_FPU_DEFAULT & ~_FPU_EXTENDED) | _FPU_DOUBLE);

really is a portablility issue.  Maybe I'll try it on my code where
I've had to resort to -ffloat-store and see if it helps.

-- 
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Floating-point Consistency, -ffloat-store, and x86 (mostly)
@ 1998-12-01 21:58 Craig Burley
  1998-12-14 12:29 ` Harvey J. Stein
  0 siblings, 1 reply; 6+ messages in thread
From: Craig Burley @ 1998-12-01 21:58 UTC (permalink / raw)
  To: egcs; +Cc: burley

Although I've been busy trying to upgrade/fix my various computers
(a CD-ROM drive no longer works, most of the time, under recent
versions of Linux, for example, which is why the computer I'm typing
this on is mostly taken apart), I've been aware of some discussions
on this list about floating-point issues, especially regarding the x86
architecture.

On my list of things to do, for a long time, is study the issues so
I can perhaps figure out, and/or speak with authority regarding,
what to do vis-a-vis g77 and egcs.  But that hasn't happened yet.

One thing I've begun to gather, from reading comp.compilers and comp.arch
posts on the relevant topics, is that the x86 architecture (and hardware)
often isn't the only culprit leading to poor FP behavior (accuracy,
precision, performance, etc.).

Yes, the x86 basically makes it impossible to provide exact IEEE
behavior at the "expected" performance for a typical implementation
(due to things like excess exponent range in temporaries, even when
the FP unit is put in "double" mode).  And, when left in the default
mode, the x86 not only isn't close to exact IEEE behavior, it isn't
close to what many FP programmers expect (e.g. intermediate results
and maybe optimized-away variables get calculated with, *sometimes*,
more precision than explicitly declared and/or expected by the
programmer).

But, the impression I've gotten is that *software*, mainly compilers,
can be the culprit leading to problems, even in code designed to
cope with (or, perhaps, expect) the additional precision that is
used for intermediate calculations on the x86.

And, my impression is that the "first solution" is to ensure that
any FP temporaries are spilled only to *like-sized* memory locations.

In other words, the default for x86 code generation should apparently
be that, when the compiler generates an intermediate result, it
*always* uses maximum available precision for that result, even if
it has to spill the result to memory.  (I *think* it can do this while
obeying the current FP mode, but don't have time to check right now.)

Here's an example that illustrates some of the key points:

  double a, b, c, d, e;
  ...
  e = a*b + c*d;

The -ffloat-store option controls whether the store into `e' includes
a "chopping" (truncation or, usually, rounding) to the exact type
specified -- in this case, 64-bit IEEE (double) format.

The intermediate calculations -- t1=a*b, t2=c*d, and t1+t2 -- are
expressed here as computed in an unspecified precision and stored into
an undeclared temporary of the same precision, for the purposes of
discussion.

Typically, t1 and t2 end up as extended precision on the x86, because
the code just uses the prevailing FP mode (which is set to "extended",
normally) and the temporaries themselves reside on the x86 FP stack,
which accommodates extended-precision results.  These are, I believe,
80-bit results (though when stored to memory, 96 bits are written),
though those specifics are not pertinent to my points here.

The problem seems to be that, sometimes, t1 and/or t2 end up as
*double* precision, because the compiler realizes it can't keep
them on the stack while doing other pending calculations (assuming
other stuff going on that isn't in the example).

In such a case, either (say) t1=a*b is computed entirely in double
precision, requiring chopping of `a' and `b' into 64-bit doubles
(think t3=(double)a, t4=(double)b, t1=t3*t4) or, more likely (on
the x86 anyway), a*b is computed in extended precision but the
result is chopped down into double when t1 is spilled to memory.

In that more-likely latter case, at least, the problem is not just
that rounding happens *twice*, but that it happens without *any*
reasonable expectation, predictability, or control on the part of
the programmer who wrote the code.

I think there are a *lot* of issues in this issue that we someday
need to address -- the Java/Kahan discussions probably raise most,
at least, of these.

But, at least, I think we could make egcs take a big step forward
towards providing *some* degree of predictability by doing the following:

  Ensure that any spills of intermediate calculations are done
  in such a way that the *complete* contents of the register being
  spilled are preserved and, later, restored.

Right now, for a sample case I (laboriously ;-) made up, it seems
that the above is not currently being done.  So, when egcs (g77)
decides it has to spill an FP register to make room for more
intermediate calculations, it chops it into a 64-bit double, which
leads to double roundings and other numerically suspicious things
(again, especially a problem when the programmer has no control
over this).

(If anyone wants to study this example of mine, let me know and I'll
put it up for ftp and email everyone the pointer, at which point
stop letting me know.  :)

Therefore, I propose that the above be implemented in egcs in time
for some targeted, well-publicized release (e.g. 1.3, or 2.0), and
that the current behavior (spilling to chopped registers) be obtainable
via an option such as `-ffloat-chop-spills', with `-fno-float-chop-spills'
being the default.

What I don't know is how hard this would be to do, and whether any
targets other than the x86 are affected.  I don't think performance
will be very adversely affected, but in cases where it is, people
should investigate whether their code really wanted those chopped
spills in the first place.  (The performance issue seems to be
primarily made up of the extra space allocated for spilled
temporaries on the stack, plus the extra bytes stored to and read
back from memory as part of the spill activity.  But, maybe
doing this would also require changing how intermediate operations
are performed, and perhaps that would reduce performance as well.)

Again, keep in mind I propose the above *without* having done the
pertinent reasearch.  I wouldn't propose it except that it seems
that there are some top-level design decisions being made about some
kind of rewrite of the x86 machine description, and this proposal
represents what, given the reading I *have* done, am pretty sure
would be worth implementing, at least as an option but, preferably,
as the default.

I'd appreciate it if others would research the issues and see if I've
got them right (check into the Java FP issues and the sorts of things
Kahan and others write about).  I probably won't get to this myself
for some time (several months, perhaps), unless funded to do so.

Pending a conclusion that my proposal should be rejected, however,
I'd suggest any x86 rewrite be architected to accommodate the above,
e.g. make sure the back end knows that FP registers are, when spilled
to memory, up to 96 bits wide (or whatever), and so on.

        tq vm, (burley)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~1998-12-17  1:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-12-17  1:30 Floating-point Consistency, -ffloat-store, and x86 (mostly) N8TM
  -- strict thread matches above, loose matches on Subject: below --
1998-12-15  0:20 N8TM
1998-12-16 10:37 ` Craig Burley
1998-12-01 21:58 Craig Burley
1998-12-14 12:29 ` Harvey J. Stein
1998-12-14 23:49   ` Craig Burley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).