FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 (MOS

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 (MOS
@ 1998-12-02 18:34 tprince
  1998-12-02 18:34 ` Craig Burley
  0 siblings, 1 reply; 2+ messages in thread
From: tprince @ 1998-12-02 18:34 UTC (permalink / raw)
  To: burley, egcs

          -Reply

>>"And, my impression is that the "first solution" is to ensure that
any FP temporaries are spilled only to *like-sized* memory
locations."

At least, double-precision spills are desirable even when all the
variables are single precision.  This could apply to some of the
Power-PC ports as well.

>>"In other words, the default for x86 code generation should
apparently
be that, when the compiler generates an intermediate result, it
*always* uses maximum available precision for that result, even
if
it has to spill the result to memory.  (I *think* it can do this while
obeying the current FP mode, but don't have time to check right
now.)

Here's an example that illustrates some of the key points:

  double a, b, c, d, e;
  ...
  e = a*b + c*d;

The -ffloat-store option controls whether the store into `e'
includes
a "chopping" (truncation or, usually, rounding) to the exact type
specified -- in this case, 64-bit IEEE (double) format."

In the case where e is used in a subsequent calculation, we
don't want to force a store and reload unless -ffloat-store is
invoked.  But I'm not sure you can always apply the same rules to
storage to a named variable (it might be stored in a structure or
COMMON block) as to register spills, which aren't visible in the
source code.  We might even have a rare case where we are
doing some kind of remaindering, where we need to use the
form which is rounded to the declared format.  This is a more
difficult question to solve and I'm confused about what
connection you are making between that and the spilled
temporaries.

The intermediate calculations -- t1=a*b, t2=c*d, and t1+t2 -- are
expressed here as computed in an unspecified precision and
stored into
an undeclared temporary of the same precision, for the
purposes of
discussion.

Typically, t1 and t2 end up as extended precision on the x86,
because
the code just uses the prevailing FP mode (which is set to
"extended",
normally) and the temporaries themselves reside on the x86 FP
stack,
which accommodates extended-precision results.  These are, I
believe,
80-bit results (though when stored to memory, 96 bits are
written),
though those specifics are not pertinent to my points here."

I suspect the 96 bits must be written to a 128-bit aligned storage
location to minimize the performance hit.

>>"The problem seems to be that, sometimes, t1 and/or t2 end
up as
*double* precision, because the compiler realizes it can't keep
them on the stack while doing other pending calculations
(assuming
other stuff going on that isn't in the example).

In such a case, either (say) t1=a*b is computed entirely in double
precision, requiring chopping of `a' and `b' into 64-bit doubles
(think t3=(double)a, t4=(double)b, t1=t3*t4) or, more likely (on
the x86 anyway), a*b is computed in extended precision but the
result is chopped down into double when t1 is spilled to
memory.

In that more-likely latter case, at least, the problem is not just
that rounding happens *twice*, but that it happens without *any*
reasonable expectation, predictability, or control on the part of
the programmer who wrote the code."

"But, at least, I think we could make egcs take a big step forward
towards providing *some* degree of predictability by doing the
following:

  Ensure that any spills of intermediate calculations are done
  in such a way that the *complete* contents of the register being
  spilled are preserved and, later, restored.

Right now, for a sample case I (laboriously ;-) made up, it seems
that the above is not currently being done.  So, when egcs (g77)
decides it has to spill an FP register to make room for more
intermediate calculations, it chops it into a 64-bit double, which
leads to double roundings and other numerically suspicious
things
(again, especially a problem when the programmer has no
control
over this)."

Including the possibility that the results change as a side effect
of changes in optimization or nearby code.

>>        tq vm, (burley)

If someone does manage to implement this, I would like to study
the effect on the complex math functions of libF77, using Cody's
CELEFUNT test suite.  I have demonstrated already that the
extended double facility shows to good advantage in the double
complex functions.  The single complex functions already
accomplish what we are talking about by using double
declarations for all locals, and that gives them a big advantage
over certain vendors' libraries.

 I know I'll arouse the ire of some people again when I point out
that CELEFUNT shows only 4 decimal places accuracy from
some of the functions, when everything is done purely in single
precision, and that a few of the tests show the full gain of 11 bits
of accuracy from pure 53-bit double precision to extended
double.

But if there are people who really depend on the complex math
functions and can show a good reason why they should not be
more accurate, we'll have to pay attention to that.  I don't consider
emulating one particular vendor's results a sufficient
consideration, when they all differ from each other, and we have
a recognized test suite as an arbitrator.

Dr. Timothy C. Prince
Consulting Engineer
Solar Turbines, a Caterpillar Company
alternate e-mail: tprince@computer.org

           To:                                              INTERNET - IBMMAIL
                                                            I5716149 - IBMMAIL

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 (MOS
  1998-12-02 18:34 FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 (MOS tprince
@ 1998-12-02 18:34 ` Craig Burley
  0 siblings, 0 replies; 2+ messages in thread
From: Craig Burley @ 1998-12-02 18:34 UTC (permalink / raw)
  To: tprince; +Cc: burley

[I had a bit of trouble following the quoting scheme you used, so I'll
try to clean it up for others and, perhaps, to help you and others discover
where I might have misread your email!]

>>And, my impression is that the "first solution" is to ensure that
>>any FP temporaries are spilled only to *like-sized* memory
>>locations.
>
>At least, double-precision spills are desirable even when all the
>variables are single precision.  This could apply to some of the
>Power-PC ports as well.

I'm not sure about the details here.  On the x86, my impression is
that the loads/stores involving the variables themselves would be
single-precision, but the operations are done in, or produce results
in, extended (80-bit) precision.  These should, according to my
proposal, be *spilled* as 80-bit, not 64-bit or 32-bit, values,
though when written to destinations (user-named variables), they'd
then (normally) be chopped down to size, per -ffloat-store and
what-not.

>>In other words, the default for x86 code generation should
>>apparently be that, when the compiler generates an intermediate result,
>>it *always* uses maximum available precision for that result, even
>>if it has to spill the result to memory.  (I *think* it can do this while
>>obeying the current FP mode, but don't have time to check right
>>now.)
>>[...]
>
>In the case where e is used in a subsequent calculation, we
>don't want to force a store and reload unless -ffloat-store is
>invoked.

Correct, AFAIK.

>But I'm not sure you can always apply the same rules to
>storage to a named variable (it might be stored in a structure or
>COMMON block) as to register spills, which aren't visible in the
>source code.

No, I don't think you can, and that's what my proposal and email
were trying to clarify (less than successfully, I gather!).

That is, I was trying to focus my proposal on only the compiler-
generated temporaries that get spilled and chopped down to "size"
at the same time.

>We might even have a rare case where we are
>doing some kind of remaindering, where we need to use the
>form which is rounded to the declared format.

I'm not sure offhand what this means, but if you mean the programmer
would like a way to specify just when a store to a variable *must*
be done "chopped down" to the (perhaps implicitly) declared precision
of that variable, I agree.  That's a language issue, though, and must
be solved by each front end in its own way, though the back end should
provide a more fine-grained control of that sort of thing than
-ffloat-store (and the first front end to provide this feature will
probably have to drive that back-end improvement).

>This is a more
>difficult question to solve and I'm confused about what
>connection you are making between that and the spilled
>temporaries.

In my proposal, essentially none, except that it used to confuse me,
and I believe it still confuses others, that there are pretty bright-
line distinctions between compiler-generated temporaries and user-named
variables, in terms of precisions the compiler is, or should be,
permitted to employ for each class.  (But not all the distinctions
are so clear, it seems.)

In a nutshell (apologies to ora.com): user-named variables are, in most
languages, implicitly or explicitly declared to have a particular
precision, while compiler-generated temporaries, used to hold intermediate
results, normally have the implicit precision of the operations that
produced those results.

With user-named variables, it is sometimes helpful or hurtful, but
normally permitted, for the compiler to employ more than the declared
precision, some or all of the time.  The back end supports the
-ffloat-store option to disable this (though the option is somewhat
misnamed in this sense, but that's more of a packaging issue than
a numerics/code-generation one).

With compiler-generated temporaries, it is, again, helpful or hurtful,
and normally permitted, for the compiler to employ *more* than the
implicit precision of the operation, but the problem with the gcc
back end, on the x86 at least, is that it (apparently) sometimes
employs *less*, specifically, when spilling those temporaries.  (That
is, when the temporary needs to be copied from the register in which
it "lives" to a memory location, the gcc back end apparently is
happy to chop the temporary down to fit into a smaller memory location.)

My proposal deals only with this latter deficiency (as I now think it
is), that is, it recommends that precision *reduction* of compiler-
generated temporaries no longer happen (at least not by default).

The other deficiencies raised above, then, are as follows, including
the one I'm addressing with my proposal:

  -  The compiler sometimes uses more precision than the programmer
     asked for when optimizing stores and loads of programmer-named
     variables.  When this hurts, but never helps, an entire source-file
     module, the programmer can specify -ffloat-store during the
     compilation.  But, there should be language-specific ways provided
     to specify which variables, perhaps even which *stores* (assignments)
     to variables, must be done in the exact precision, and which may
     (or must?) be done in the prevailing extended precision.

     (This is the case where `c = a * b; e = c * d;', with all variables
     declared as 64-bit, is optimized to `e = a * b * d;' with all
     multiplies done in extended, e.g. 80-bit, precision.  The -ffloat-store
     option prevents this from happening, in effect by ensuring that the
     `a * b' intermediate result is chopped down to 64 bits.)

     This sort of fine-grained control is the lowest priority of these
     issues, IMO, because my impression is that the scientific/numerical
     community either doesn't know exactly what it wants here ("Java" almost
     was a possible answer, but seems to be no longer, if my impressions about
     the arguing over strict IEEE vs. fast x86 execution are correct) or
     isn't really prepared to make use of such fine-grained control.  But
     things should change over time here, one way or another.

  -  The compiler provides no way to "force" available excess precision
     to be reliably used for programmer-named variables anyplace that
     is possible (say, within a module).  Some compilers offer explicit
     extended type declarations (REAL*10 in Fortran; `long double' in C?),
     but g77 doesn't yet.  So, whether a named variable carries the
     (possible) excess precision of its computed value into subsequent
     calculations is at the whim of the compiler's optimization phases.

     (This facility would ensure that, after `c = a * b;', *all* uses
     of `c' employed the extended-precision, not double-precision, result,
     except where the double-precision result is required by the language,
     such as when passing it to an external routine or writing it to a
     file.)

     My impression is that this is not much of a problem, and it should
     thus be a low-priority one to solve.  (Though REAL*16 seems to be
     asked for fairly often.)

  -  The compiler sometimes uses more precision than the programmer expected
     for intermediate results, said expectations being the product of the
     implicit or explicit declarations applying to the operands of the
     computed operations.  The programmer might be able to work
     around this somewhat by fiddling with processor "internals" such
     as the current FP format for operations, but there's no compiler
     option to prevent this, generally, from happening.

     (This is the case where `d = a * b * c;' uses more than 64 bits
     for the first or second multiply, such as `a * b'.  The resulting
     excess precision might not be expected by the programmer.)

     This might be the next thing to solve, after the solution I'm
     proposing (repeated below).  It would presumably require something
     akin to a -ffloat-store option, but for temporaries instead of named
     variables (say, -ffloat-store-temps).  In fact, until a year or so
     ago, I thought -ffloat-store applied to *all* computed results, not
     just named variables, and I've noticed other people have made the
     same mistake.

  -  The compiler provides no way to "force" available excess precision
     to be reliably used for intermediate computations.  Programmers
     depending on this tendency (say, x86 programmers) might like some
     way to explicitly code-in this dependency.

     (This would ensure that `d = a * b * c;' *will* use extended
     precision for all intermediate calculations.)

     But, I'm not sure there's much interest in this currently, because
     my impression is that it's always true for targets like the x86,
     never true for any others, and making it true for others would
     make for really slow code (in which case, might as well run it
     on an x86 anyway ;-).  This is probably the lowest priority of the
     bunch, therefore.

  -  The compiler provides no way to prevent excess precision used for
     *some* intermediate results from being chopped down to lower
     precision, due to spills.

     (This is the problem where `c = a * b * c;' might or might not use
     extended precision for the `a * b', or whatever is the first multiply,
     depending on its mood -- because it might decide to "spill" that
     first result, and thus chop it down to "size" before using it in
     the second multiply.)

     My impression is that this is the highest-priority problem of all in
     this list, and it's the only thing I'm proposing we fix, in the
     medium term, at this point.

>I suspect the 96 bits must be written to a 128-bit aligned storage
>location to minimize the performance hit.

Probably.  But we're not even at 64-bit aligned storage for stack
variables (which is where spills must happen, for the most part) yet,
and IMO code that requires FP spills, on the x86 anyway, is probably
not going to notice the lack of alignment due to its complexity.

(I'm talking about spills of unnamed, temporary results, not spills
of named variables that are assigned to registers, here.  It might
be an interesting exercise to patch gcc to abort whenever it spills
an FP temporary of this sort, and see how much code we can find that
triggers the crash, to get an idea of how much code would even be
potentially subject to a performance hit due to adopting my
proposal.)

>If someone does manage to implement this, I would like to study
>the effect on the complex math functions of libF77, using Cody's
>CELEFUNT test suite.  I have demonstrated already that the
>extended double facility shows to good advantage in the double
>complex functions.  The single complex functions already
>accomplish what we are talking about by using double
>declarations for all locals, and that gives them a big advantage
>over certain vendors' libraries.

Right now, my impression is that the effect would be nil *unless*
these codes are complicated enough to cause spills of temporaries
in the first place.

But, there is a question I do have about my own proposal.  What
does it, or should it, imply about spills of *named* variables when
they *are* carrying excess precision (more than the user declared)?

That is, assuming -fno-float-store (the default), `c = a * b; e = c * d;'
is permitted to be computed as `e = a * b * d;' using extended
precision on the right-hand side of the assignment.

In that case, can gcc subsequently decide it has to spill the `a * b'
result before that, or a subsequent, similar, use?  My guess is, yes,
it *can*, though it might, in practice, rarely do so.

If yes, should my proposal affect those spills as well?  And do
"those spills" include normal spills, such as around calls, or
are they limited to "complexity" spills, the sort that are hard
for programmers to predict or see in their code?

There are a few reasons I'm unsure about this issue.

First, the main goal of my proposal is to reduce unpredictable loss
of precision on machines like x86, where programmers should be
aware their code will often employ extended precision (and thus might
depend on it).

However, if -ffloat-store is not used, then perhaps this reduction
would not be complete, and lead to rarer, yet even more obscure and
hard-to-find, bugs, unless we indeed make sure that even spills of
named variables carry never chop the values of those variables (which
might be in extended precision).

I'm worried about this because it took my quite a bit of effort to
produce a Fortran module that did *any* spills of the sort I'm
generally concerned about (that's what I worked on yesterday, in the
midst of sending my earlier email, for probably 30 minutes or so) --
that is, spills of intermediate results.

But, it might take a lot *less* effort to produce the spills I'm
referring to here, namely, spills of named variables in their
excess-precision form, because here the optimizer might help
"complexify" the code to a point where spills, other than the "normal"
ones around calls and such, are more likely to happen.

So it might be the case that just dealing with spills of temporaries
doesn't really address enough of the problem to constitute an effective
reduction of unexpected loss of precision.

Second, other goals of my proposal are to increase the predictability
of chopping-down of values to the actual types of the variables to
which they're assigned, and to execute existing, working code,
faithfully.

My concern here is that if we do always spill without chopping, even
when spilling named variables, we might break code, and expectations,
that assume that `c = a * b; foo(); e = c * d;' can never carry excess
precision for `c' across the call to foo().

For all I know, gcc might currently, or someday, optimize that to
`foo(); e = a * b * d;' but, due to other complexities and such,
calculate it as `t1 = a * b; foo(); e = t1 * d;', leading to excess
precision, as far as the programmer is concerned.

I don't know how legitimate my concerns are, but offhand it seems like
a non-trivial task to change the back end (at least; maybe the front
ends would need changing in some cases) to know the difference between
"visible" spill-points and other, internally demanded, spills, chopping
down named variables in the former situations but never in the latter.

        tq vm, (burley)

P.S. Most, if not all of this, is the result of widespread disagreement
over what a simple type declaration like `REAL*8 A' or `double a;' really
means.  The simple view is "it means that the variable must be capable
of holding the specified precision", but so many people really expect
it to mean so much more, in terms of whether operations on the variable
may, might, or must involve more precision, etc.  And, since the
predominant languages give those people no straightforward way to express
what they *do* really want, how surprising is it that they "overload" the
"simple" view of what a type definition really means?

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~1998-12-02 18:34 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1998-12-02 18:34 FWD: FLOATING-POINT CONSISTENCY, -FFLOAT-STORE, AND X86 (MOS tprince
1998-12-02 18:34 ` Craig Burley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).