[Bug libgcc/108279] New: Improved speed for float128 routines

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libgcc/108279] New: Improved speed for float128 routines
@ 2023-01-03 20:55 tkoenig at gcc dot gnu.org
  2023-01-03 21:02 ` [Bug libgcc/108279] " tkoenig at gcc dot gnu.org
                   ` (23 more replies)
  0 siblings, 24 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-03 20:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

            Bug ID: 108279
           Summary: Improved speed for float128 routines
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: libgcc
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Our soft-float routines, which are used for the basic float128 arithmetic
(__addtf3, __subtf3, etc) are much slower than they need to be.

Michael S has some routines which are considerably faster, at
https://github.com/already5chosen/extfloat, which he would like to
contribute to gcc.  There is a rather lengthy thread in comp.arch
starting with https://groups.google.com/g/comp.arch/c/Izheu-k00Nw .

Current status of the discussion:

The routines currently do not support rounding modes, they support round to
nearest with tie even only. Adding such support would be feasible.

Handling the rounding mode it is currently done in libgcc, by
querying the hardware, leading to a high overhead for each
call. This would not be needed if -ffast-math (or a relevant
suboption) is specified.

It would also be suitable as is (with a different name) for Fortran
intrinsics such as matmul.

Fortran is a bit special because rounding modes are default on procedure
entry and are restored on procedure exit (which is why setting rounding
modes in a subroutine is a no-op). This would allow to keep a local
variable keeping track of the rounding mode.

The current idea would be something like this:

The current behavior of __addtf3 and friends could remain as is,
but its speed could be improved,. but it would still query the
hardware.

There can be two additional routines for each arithmetic operation. One
of them would implement the operation given a specified rounding mode
(to be called from Fortran when the correct IEEE module is in
use).

The other one would just implement round-to-nearest, for use from
Fortran intrinsics and from all other languages if the right flags
are given. It would be good to bolt this onto some flag which is
used for libgfortran, to make it accessible from C.

Probably gcc14 material.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
@ 2023-01-03 21:02 ` tkoenig at gcc dot gnu.org
  2023-01-04 10:43 ` jakub at gcc dot gnu.org
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-03 21:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #1 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Created attachment 54183
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54183&action=edit
Example patch with Michael S's code just pasted over the libgcc implementation,
for a test

A benchmarks: Just pasting over the code from the github
repo yields an improvement of gfortran's matmul by almost a factor of two,
so significant speedups are possible:

module tick
interface
function rdtsc() bind(C,name="rdtsc")
use iso_c_binding
integer(kind=c_long) :: rdtsc
end function rdtsc
end interface
end module tick

program main
use tick
use iso_c_binding
implicit none
integer, parameter :: wp = selected_real_kind(30)
! integer, parameter :: n=5000, p=4000, m=3666
integer, parameter :: n = 1000, p = 1000, m = 1000
real (kind=wp) :: c(n,p), a(n,m), b(m, p)
character(len=80) :: line
integer(c_long) :: t1, t2, t3
real (kind=wp) :: fl = 2.d0*n*m*p
integer :: i,j

print *,wp

line = '10 10'
call random_number(a)
call random_number(b)
t1 = rdtsc()
t2 = rdtsc()
t3 = t2-t1
print *,t3
t1 = rdtsc()
c = matmul(a,b)
t2 = rdtsc()
print *,1/(fl/(t2-t1-t3)),"Cycles per operation"
read (unit=line,fmt=*) i,j
write (unit=line,fmt=*) c(i,j)
end program main

showed

tkoenig@gcc188:~> ./original
16
32
^C
tkoenig@gcc188:~> time ./original
16
32
90.5696151959999999999999999999999997 Cycles per operation

real 1m2,148s
user 1m2,123s
sys 0m0,008s
tkoenig@gcc188:~> time ./modified
16
32
52.8148391719999999999999999999999957 Cycles per operation

real 0m36,296s
user 0m36,278s
sys 0m0,008s 

where "original" is the current libgcc soft-float implementation, and
"modified" is with the code from the repro.

It does not handle exceptions, so this causes a few regressions, but certainly
shows the potential

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
  2023-01-03 21:02 ` [Bug libgcc/108279] " tkoenig at gcc dot gnu.org
@ 2023-01-04 10:43 ` jakub at gcc dot gnu.org
  2023-01-04 17:14 ` tkoenig at gcc dot gnu.org
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-01-04 10:43 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
From what I can see, they are certainly not portable.
E.g. the relying on __int128 rules out various arches (basically all 32-bit
arches,
ia32, powerpc 32-bit among others).  Not handling exceptions is a show stopper
too.
Guess better time investment would be to improve performance of the soft-fp
versions.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
  2023-01-03 21:02 ` [Bug libgcc/108279] " tkoenig at gcc dot gnu.org
  2023-01-04 10:43 ` jakub at gcc dot gnu.org
@ 2023-01-04 17:14 ` tkoenig at gcc dot gnu.org
  2023-01-04 22:19 ` already5chosen at yahoo dot com
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-04 17:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #3 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #2)
> From what I can see, they are certainly not portable.
> E.g. the relying on __int128 rules out various arches (basically all 32-bit
> arches,
> ia32, powerpc 32-bit among others).

For this kind of performance improvement on 64-bit systems, we could probably
introduce an appropriate #ifdef. Regarding x86 intrinsics, maybe they
can be replaced by gcc's vector extension.

> Not handling exceptions is a show
> stopper too.

Agreed, we should not be replacing the soft-fp that way.

> Guess better time investment would be to improve performance of the soft-fp
> versions.

I'm not sure, I think we could get an appreciable benefit if we
only invoke this kind of routine behind the appropriate sub-flags
of -ffast-math.

For a general-purpose code, I see at least no way around the bottleneck
of querying the processor status on each invocation, and that is a waste
if the program does not care.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2023-01-04 17:14 ` tkoenig at gcc dot gnu.org
@ 2023-01-04 22:19 ` already5chosen at yahoo dot com
  2023-01-11 23:06 ` already5chosen at yahoo dot com
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-04 22:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #4 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Jakub Jelinek from comment #2)
> From what I can see, they are certainly not portable.
> E.g. the relying on __int128 rules out various arches (basically all 32-bit
> arches,
> ia32, powerpc 32-bit among others).  Not handling exceptions is a show
> stopper too.
> Guess better time investment would be to improve performance of the soft-fp
> versions.

The only 32-bit gcc target [out of targets represented on Goldbolt] that
supports __float128/_Float128 right now are
- x386==IA32
- SPARC
- RV32
I don't know why it is supported, but would be surprised if __float128 on RV32
is actually used by anybody. Just too slow. 

32-bit SPARC sounds as dead platform.

x386 is more problematic. People still use it, people still develop new
programs for it. And since modern x386 processors are so fast, performance of
__float128 could be usable, even if it is 5 or 10 times slower than on the same
machine running x86-64.

PPC32 is not supported, so no problems from that side.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2023-01-04 22:19 ` already5chosen at yahoo dot com
@ 2023-01-11 23:06 ` already5chosen at yahoo dot com
  2023-01-12 23:24 ` tkoenig at gcc dot gnu.org
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-11 23:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #5 from Michael_S <already5chosen at yahoo dot com> ---
Hi Thomas
Are you in or out?

If you are still in, I can use your help on several issues.

1. Torture. 
See if Invalid Operand exception raised properly now. Also if there are still
remaining problems with NaN.

2. Run my correction tests on as many non-AMD64 targets as you can. Preferably,
with 100,000,000 iterations, but on weaker HW 10,000,000 will do.

3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of AMD64
computers than I did.
Of special interest are
- AMD Zen3 on Linux running on bare metal
- Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running on
bare metal
I realize that doing speed tests is not nearly as simple as correctness tests.
We need non-busy (preferably almost idle) machines that have stable CPU clock
rate. It's not easy to find machines like that nowadays. But, may be, you can
find at least some from the list.

4. Run my speed tests on as many non-obsolete ARM64 computers as you can find.
Well, probably a wishful thinking on my part.


Also off topic but of interest: postprocessed source of matmul_r16.c

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-01-11 23:06 ` already5chosen at yahoo dot com
@ 2023-01-12 23:24 ` tkoenig at gcc dot gnu.org
  2023-01-13  0:34 ` already5chosen at yahoo dot com
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-12 23:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #6 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
(In reply to Michael_S from comment #5)
> Hi Thomas
> Are you in or out?

Depends a bit on what exactly you want to do, and if there is
a chance that what you want to do will be incorporated into gcc.

If you want to replace the soft-float routines, you will have to
replace them with the full functionality.

And there will have to be a decision about 32-bit targets.

> If you are still in, I can use your help on several issues.
> 
> 1. Torture. 
> See if Invalid Operand exception raised properly now. Also if there are
> still remaining problems with NaN.

I've putyour addition/subtraction routines in as a replacement
an am running a regression test.  We'll see when that finishes.

> 2. Run my correction tests on as many non-AMD64 targets as you can.
> Preferably, with 100,000,000 iterations, but on weaker HW 10,000,000 will do.

This will take some time.

> 3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of
> AMD64 computers than I did.
> Of special interest are
> - AMD Zen3 on Linux running on bare metal
> - Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running
> on bare metal
> I realize that doing speed tests is not nearly as simple as correctness
> tests.
> We need non-busy (preferably almost idle) machines that have stable CPU
> clock rate. It's not easy to find machines like that nowadays. But, may be,
> you can find at least some from the list.

I currenty have no access to that sort of hardware (I'm just a volunteer,
and my home box is Zen-1).

> 4. Run my speed tests on as many non-obsolete ARM64 computers as you can
> find.
> Well, probably a wishful thinking on my part.
> 
> 
> Also off topic but of interest: postprocessed source of matmul_r16.c

Where should I send that to?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-01-12 23:24 ` tkoenig at gcc dot gnu.org
@ 2023-01-13  0:34 ` already5chosen at yahoo dot com
  2023-01-13  1:29 ` already5chosen at yahoo dot com
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-13  0:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #7 from Michael_S <already5chosen at yahoo dot com> ---
Either here or my yahoo e-mail

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2023-01-13  0:34 ` already5chosen at yahoo dot com
@ 2023-01-13  1:29 ` already5chosen at yahoo dot com
  2023-01-14  9:21 ` tkoenig at gcc dot gnu.org
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-13  1:29 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #8 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Thomas Koenig from comment #6)
> (In reply to Michael_S from comment #5)
> > Hi Thomas
> > Are you in or out?
> 
> Depends a bit on what exactly you want to do, and if there is
> a chance that what you want to do will be incorporated into gcc.
> 

What about incorporation in Fortran?
What about incorporation in C under fast-math ?

> If you want to replace the soft-float routines, you will have to
> replace them with the full functionality.
> 

Full functionality including Inexact Exception that practically nobody uses?
Sounds wasteful of perfectly good CPU cycles.
Also, I am not so sure that Inexact Exception is fully supported in  existing
soft-float library.

Almost-full functionality with support for non-default rounding modes, but
without Inexact Exception?
I actually implemented it and did few measurements. You can find the results in
the directory /reports in my repo.
Summary: architecture-neutral method cause very serious slowdown. Less so on
slower machines, massive 2.5x on the fastest machine (Zen3 under Linux under
WSL).
AMD64-specific method causes smaller slowdown, esp. on relatively old Intel
cores on Windows (I have no modern Intel cores available for testing). But
Zen3/Linux still suffer 1.45x slowdown. Again, a big wastage of perfectly good
CPU cycles.
Also, what about other architectures? Should they suffer an
"architecture-neutral" slowdown? Even if there are faster methods on other
architecture, these methods should be found by somebody and tested by somebody.
This sort of work is time-consuming. And for what?

Also I measured an impact of implementing non-default rounding through
additional function parameter. An impact is very small, 0  to 5%.
You said on comp.arch that at least for Fortran it could work.

What else is missing for "full functionality"? 
Surely there are other things that I forgot. May be, additional exceptions
apart from Invalid Operand (that hopefully already works) and apart from
Inexact that I find stupid? I don't think that they are hard to implement or
expensive in terms of speed. Just a bit of work and more than a bit of testing.

> And there will have to be a decision about 32-bit targets.
>

IMHO, 32-bit targets should be left in their current state.
People that use them probably do not care deeply about performance.
Technically, I can implement 32-bit targets in the same sources, by means of
few ifdefs and macros, but resulting source code will look much uglier than how
it looks today. Still, not to the same level of horror that you have in
matmul_r16.c, but certainly uglier than how I like it to look.
And I am not sure at all that my implementation of 32-bit targets would be
significantly faster than current soft float. 
In short, it does not sound as a good ROI.
BTW, do you know why current soft float supports so few 32-bit targets?
Most likely somebody felt just like me about it - it's not too hard to support
more 32-bit targets, but it's not a good ROI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2023-01-13  1:29 ` already5chosen at yahoo dot com
@ 2023-01-14  9:21 ` tkoenig at gcc dot gnu.org
  2023-01-14 10:22 ` tkoenig at gcc dot gnu.org
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-14  9:21 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #9 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Created attachment 54273
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54273&action=edit
matmul_r16.i

Here is matmul_r16.i from a relatively recent trunk.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2023-01-14  9:21 ` tkoenig at gcc dot gnu.org
@ 2023-01-14 10:22 ` tkoenig at gcc dot gnu.org
  2023-01-14 23:52 ` already5chosen at yahoo dot com
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-14 10:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #10 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
What we would need for incorporation into gcc is to have several
functions, which would then called depending on which floating point
options are in force at the time of invocation.

So, let's go through the gcc options, to see what would fit where. Walking
down the options tree, depth first.

From the gcc docs:

'-ffast-math'
     Sets the options '-fno-math-errno', '-funsafe-math-optimizations',
     '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans',
     '-fcx-limited-range' and '-fexcess-precision=fast'.

-fno-math-errno is irrelevant in this context, no need to look at that.

'-funsafe-math-optimizations'

     Allow optimizations for floating-point arithmetic that (a) assume
     that arguments and results are valid and (b) may violate IEEE or
     ANSI standards.  When used at link time, it may include libraries
     or startup files that change the default FPU control word or other
     similar optimizations.

     This option is not turned on by any '-O' option since it can result
     in incorrect output for programs that depend on an exact
     implementation of IEEE or ISO rules/specifications for math
     functions.  It may, however, yield faster code for programs that do
     not require the guarantees of these specifications.  Enables
     '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and
     '-freciprocal-math'.

'-fno-signed-zeros'
     Allow optimizations for floating-point arithmetic that ignore the
     signedness of zero.  IEEE arithmetic specifies the behavior of
     distinct +0.0 and -0.0 values, which then prohibits simplification
     of expressions such as x+0.0 or 0.0*x (even with
     '-ffinite-math-only').  This option implies that the sign of a zero
     result isn't significant.

     The default is '-fsigned-zeros'.

I don't think this options is relevant.

'-fno-trapping-math'
     Compile code assuming that floating-point operations cannot
     generate user-visible traps.  These traps include division by zero,
     overflow, underflow, inexact result and invalid operation.  This
     option requires that '-fno-signaling-nans' be in effect.  Setting
     this option may allow faster code if one relies on "non-stop" IEEE
     arithmetic, for example.

     This option should never be turned on by any '-O' option since it
     can result in incorrect output for programs that depend on an exact
     implementation of IEEE or ISO rules/specifications for math
     functions.

     The default is '-ftrapping-math'.

Relevant.

'-ffinite-math-only'
     Allow optimizations for floating-point arithmetic that assume that
     arguments and results are not NaNs or +-Infs.

     This option is not turned on by any '-O' option since it can result
     in incorrect output for programs that depend on an exact
     implementation of IEEE or ISO rules/specifications for math
     functions.  It may, however, yield faster code for programs that do
     not require the guarantees of these specifications.

This does not have further suboptions. Relevant.

'-fassociative-math'

     Allow re-association of operands in series of floating-point
     operations.  This violates the ISO C and C++ language standard by
     possibly changing computation result.  NOTE: re-ordering may change
     the sign of zero as well as ignore NaNs and inhibit or create
     underflow or overflow (and thus cannot be used on code that relies
     on rounding behavior like '(x + 2**52) - 2**52'.  May also reorder
     floating-point comparisons and thus may not be used when ordered
     comparisons are required.  This option requires that both
     '-fno-signed-zeros' and '-fno-trapping-math' be in effect.
     Moreover, it doesn't make much sense with '-frounding-math'.  For
     Fortran the option is automatically enabled when both
     '-fno-signed-zeros' and '-fno-trapping-math' are in effect.

     The default is '-fno-associative-math'.

Not relevant, I think - this influences compiler optimizations.

'-freciprocal-math'

     Allow the reciprocal of a value to be used instead of dividing by
     the value if this enables optimizations.  For example 'x / y' can
     be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject
     to common subexpression elimination.  Note that this loses
     precision and increases the number of flops operating on the value.

     The default is '-fno-reciprocal-math'.

Again, not relevant.


'-frounding-math'
     Disable transformations and optimizations that assume default
     floating-point rounding behavior.  This is round-to-zero for all
     floating point to integer conversions, and round-to-nearest for all
     other arithmetic truncations.  This option should be specified for
     programs that change the FP rounding mode dynamically, or that may
     be executed with a non-default rounding mode.  This option disables
     constant folding of floating-point expressions at compile time
     (which may be affected by rounding mode) and arithmetic
     transformations that are unsafe in the presence of sign-dependent
     rounding modes.

     The default is '-fno-rounding-math'.

     This option is experimental and does not currently guarantee to
     disable all GCC optimizations that are affected by rounding mode.
     Future versions of GCC may provide finer control of this setting
     using C99's 'FENV_ACCESS' pragma.  This command-line option will be
     used to specify the default state for 'FENV_ACCESS'.

Also no further suboptions. This is relevant.

'-fsignaling-nans'
     Compile code assuming that IEEE signaling NaNs may generate
     user-visible traps during floating-point operations.  Setting this
     option disables optimizations that may change the number of
     exceptions visible with signaling NaNs.  This option implies
     '-ftrapping-math'.

     This option causes the preprocessor macro '__SUPPORT_SNAN__' to be
     defined.

     The default is '-fno-signaling-nans'.

     This option is experimental and does not currently guarantee to
     disable all GCC optimizations that affect signaling NaN behavior.

Also, no further suboptions. Relevant.

-fcx-limited-range is not relevant, and neither is -fexcess-precision=fast.

So, unless I missed something, wit should be possible to select different
functions depending on the values of -ftrapping-math, -finite-math-only,
-frounding-math and -fsignalling-nans.

Regarding Fortran's matmul: We use -ffast-math when compiling the library
functions, so any change we make to any of the -ffast-math suboptions
would be used, as well.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2023-01-14 10:22 ` tkoenig at gcc dot gnu.org
@ 2023-01-14 23:52 ` already5chosen at yahoo dot com
  2023-01-15  0:20 ` already5chosen at yahoo dot com
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-14 23:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #11 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Thomas Koenig from comment #9)
> Created attachment 54273 [details]
> matmul_r16.i
> 
> Here is matmul_r16.i from a relatively recent trunk.

Thank you.
Unfortunately, I was not able to link it with main in Fortran.
So, I still only have to guess why even after replacement of __multf3 and
__addtf3 by my implementations it is still more than twice slower (on Zen3)
then what it should be.
Looking at source and assuming that inner loop starts at line 8944, this
loop looks very strange, but apart from bad programming style and apart from
misunderstanding of what is optimal scheduling there is nothing criminal about
it. May be, because of wrong scheduling, it is 10-15% slower than the best, but
certainly it should not be 2.4 times slower that I am seeing.


That's what I got from linker:
/usr/bin/ld: matmul_r16.o: warning: relocation against
`_gfortrani_matmul_r16_avx128_fma3' in read-only section `.text'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx':
matmul_r16.c:(.text+0x48): undefined reference to `_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x342): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x10e4): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x10f1): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x12e5): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x13a9): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x13e7): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx2':
matmul_r16.c:(.text+0x24c8): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x27c2): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x3564): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x3571): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x3765): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x3829): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x3867): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_avx512f':
matmul_r16.c:(.text+0x4948): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x4c47): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x5a32): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x5a3f): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x5c35): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x5cfb): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x5d35): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `matmul_r16_vanilla':
matmul_r16.c:(.text+0x6de8): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x70e2): undefined reference to
`_gfortrani_compile_options'
/usr/bin/ld: matmul_r16.c:(.text+0x7e84): undefined reference to
`_gfortrani_size0'
/usr/bin/ld: matmul_r16.c:(.text+0x7e91): undefined reference to
`_gfortrani_xmallocarray'
/usr/bin/ld: matmul_r16.c:(.text+0x8085): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x8149): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.c:(.text+0x8187): undefined reference to
`_gfortrani_runtime_error'
/usr/bin/ld: matmul_r16.o: in function `_gfortran_matmul_r16':
matmul_r16.c:(.text+0x92b7): undefined reference to
`_gfortrani_matmul_r16_avx128_fma3'
/usr/bin/ld: matmul_r16.c:(.text+0x92ea): undefined reference to
`_gfortrani_matmul_r16_avx128_fma4'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2023-01-14 23:52 ` already5chosen at yahoo dot com
@ 2023-01-15  0:20 ` already5chosen at yahoo dot com
  2023-01-15 15:13 ` tkoenig at gcc dot gnu.org
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-15  0:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #12 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Thomas Koenig from comment #10)
> What we would need for incorporation into gcc is to have several
> functions, which would then called depending on which floating point
> options are in force at the time of invocation.
> 
> So, let's go through the gcc options, to see what would fit where. Walking
> down the options tree, depth first.
> 
> From the gcc docs:
> 
> '-ffast-math'
>      Sets the options '-fno-math-errno', '-funsafe-math-optimizations',
>      '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans',
>      '-fcx-limited-range' and '-fexcess-precision=fast'.
> 
> -fno-math-errno is irrelevant in this context, no need to look at that.
> 
> '-funsafe-math-optimizations'
> 
>      Allow optimizations for floating-point arithmetic that (a) assume
>      that arguments and results are valid and (b) may violate IEEE or
>      ANSI standards.  When used at link time, it may include libraries
>      or startup files that change the default FPU control word or other
>      similar optimizations.
> 
>      This option is not turned on by any '-O' option since it can result
>      in incorrect output for programs that depend on an exact
>      implementation of IEEE or ISO rules/specifications for math
>      functions.  It may, however, yield faster code for programs that do
>      not require the guarantees of these specifications.  Enables
>      '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and
>      '-freciprocal-math'.
> 
> '-fno-signed-zeros'
>      Allow optimizations for floating-point arithmetic that ignore the
>      signedness of zero.  IEEE arithmetic specifies the behavior of
>      distinct +0.0 and -0.0 values, which then prohibits simplification
>      of expressions such as x+0.0 or 0.0*x (even with
>      '-ffinite-math-only').  This option implies that the sign of a zero
>      result isn't significant.
> 
>      The default is '-fsigned-zeros'.
> 
> I don't think this options is relevant.
> 
> '-fno-trapping-math'
>      Compile code assuming that floating-point operations cannot
>      generate user-visible traps.  These traps include division by zero,
>      overflow, underflow, inexact result and invalid operation.  This
>      option requires that '-fno-signaling-nans' be in effect.  Setting
>      this option may allow faster code if one relies on "non-stop" IEEE
>      arithmetic, for example.
> 
>      This option should never be turned on by any '-O' option since it
>      can result in incorrect output for programs that depend on an exact
>      implementation of IEEE or ISO rules/specifications for math
>      functions.
> 
>      The default is '-ftrapping-math'.
> 
> Relevant.
> 
> '-ffinite-math-only'
>      Allow optimizations for floating-point arithmetic that assume that
>      arguments and results are not NaNs or +-Infs.
> 
>      This option is not turned on by any '-O' option since it can result
>      in incorrect output for programs that depend on an exact
>      implementation of IEEE or ISO rules/specifications for math
>      functions.  It may, however, yield faster code for programs that do
>      not require the guarantees of these specifications.
> 
> This does not have further suboptions. Relevant.
> 
> '-fassociative-math'
> 
>      Allow re-association of operands in series of floating-point
>      operations.  This violates the ISO C and C++ language standard by
>      possibly changing computation result.  NOTE: re-ordering may change
>      the sign of zero as well as ignore NaNs and inhibit or create
>      underflow or overflow (and thus cannot be used on code that relies
>      on rounding behavior like '(x + 2**52) - 2**52'.  May also reorder
>      floating-point comparisons and thus may not be used when ordered
>      comparisons are required.  This option requires that both
>      '-fno-signed-zeros' and '-fno-trapping-math' be in effect.
>      Moreover, it doesn't make much sense with '-frounding-math'.  For
>      Fortran the option is automatically enabled when both
>      '-fno-signed-zeros' and '-fno-trapping-math' are in effect.
> 
>      The default is '-fno-associative-math'.
> 
> Not relevant, I think - this influences compiler optimizations.
> 
> '-freciprocal-math'
> 
>      Allow the reciprocal of a value to be used instead of dividing by
>      the value if this enables optimizations.  For example 'x / y' can
>      be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject
>      to common subexpression elimination.  Note that this loses
>      precision and increases the number of flops operating on the value.
> 
>      The default is '-fno-reciprocal-math'.
> 
> Again, not relevant.
> 
> 
> '-frounding-math'
>      Disable transformations and optimizations that assume default
>      floating-point rounding behavior.  This is round-to-zero for all
>      floating point to integer conversions, and round-to-nearest for all
>      other arithmetic truncations.  This option should be specified for
>      programs that change the FP rounding mode dynamically, or that may
>      be executed with a non-default rounding mode.  This option disables
>      constant folding of floating-point expressions at compile time
>      (which may be affected by rounding mode) and arithmetic
>      transformations that are unsafe in the presence of sign-dependent
>      rounding modes.
> 
>      The default is '-fno-rounding-math'.
> 
>      This option is experimental and does not currently guarantee to
>      disable all GCC optimizations that are affected by rounding mode.
>      Future versions of GCC may provide finer control of this setting
>      using C99's 'FENV_ACCESS' pragma.  This command-line option will be
>      used to specify the default state for 'FENV_ACCESS'.
> 
> Also no further suboptions. This is relevant.
> 
> '-fsignaling-nans'
>      Compile code assuming that IEEE signaling NaNs may generate
>      user-visible traps during floating-point operations.  Setting this
>      option disables optimizations that may change the number of
>      exceptions visible with signaling NaNs.  This option implies
>      '-ftrapping-math'.
> 
>      This option causes the preprocessor macro '__SUPPORT_SNAN__' to be
>      defined.
> 
>      The default is '-fno-signaling-nans'.
> 
>      This option is experimental and does not currently guarantee to
>      disable all GCC optimizations that affect signaling NaN behavior.
> 
> Also, no further suboptions. Relevant.
> 
> -fcx-limited-range is not relevant, and neither is -fexcess-precision=fast.
> 
> So, unless I missed something, wit should be possible to select different
> functions depending on the values of -ftrapping-math, -finite-math-only,
> -frounding-math and -fsignalling-nans.
> 
> Regarding Fortran's matmul: We use -ffast-math when compiling the library
> functions, so any change we make to any of the -ffast-math suboptions
> would be used, as well.

This set of options does not map too well into real difficulties of
implementation.
There are only 2 things that are expensive:
1. Inexact Exception
2. Fetching of the current rounding mode.
The rest of IEEE-754 features is so cheap that creating separate variants
without them simply is not worth the effort of maintaining distinct variants,
even if all difference is a single three-lines #ifdef

BTW, Inexact Exception can be made fairly affordable with a little help from
compiler. All we need for that is ability to say "don't remove this floating
point addition even if you don't see that it produces any effect".
Something similar to 'volatile', but with volatile compiler currently puts
result of addition on stack, which adds undesirable cost.
However, judged by comment of Jakub, compiler maintainers are not particularly
interested in this enterprise.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2023-01-15  0:20 ` already5chosen at yahoo dot com
@ 2023-01-15 15:13 ` tkoenig at gcc dot gnu.org
  2023-01-15 19:17 ` tkoenig at gcc dot gnu.org
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-15 15:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #13 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
I tried compiling your tests on Apple silicon using Asahi Linux, but
without success. A first step was rather easy; replacing __float128 by
_Float128 was required.  I then bootstrapped gcc on that machine and
added the (private) include path for <quadmath.h>, and am now hitting missing
__float128 in quadmath.h.  Not sure how to proceed from here.

The machine is gcc103.fsffrance.org, by the way, of the GCC compile farm.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (12 preceding siblings ...)
  2023-01-15 15:13 ` tkoenig at gcc dot gnu.org
@ 2023-01-15 19:17 ` tkoenig at gcc dot gnu.org
  2023-01-15 19:28 ` jakub at gcc dot gnu.org
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: tkoenig at gcc dot gnu.org @ 2023-01-15 19:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #14 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Seems that libquadmath is not built on that particular Linux/CPU variant,
for whatever reason. At last I cannot find any '*quadmath* files
in the build directory.

/proc/cpuinfo tells me that

processor       : 0
BogoMIPS        : 48.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit
uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint
CPU implementer : 0x61
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0x022
CPU revision    : 1

[...]

and uname -a is

Linux gcc103.fsffrance.org 6.0.0-rc5-asahi-00001-gc62bd3fe430f #1 SMP Sun Sep
18 18:07:57 CEST 2022 aarch64 GNU/Linux

So much for testing on Apple silicon.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (13 preceding siblings ...)
  2023-01-15 19:17 ` tkoenig at gcc dot gnu.org
@ 2023-01-15 19:28 ` jakub at gcc dot gnu.org
  2023-01-15 22:27 ` already5chosen at yahoo dot com
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-01-15 19:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #15 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
libquadmath is not needed nor useful on aarch64-linux, because long double type
there is already IEEE 754 quad.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (14 preceding siblings ...)
  2023-01-15 19:28 ` jakub at gcc dot gnu.org
@ 2023-01-15 22:27 ` already5chosen at yahoo dot com
  2023-01-16 22:16 ` joseph at codesourcery dot com
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-15 22:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #16 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Jakub Jelinek from comment #15)
> libquadmath is not needed nor useful on aarch64-linux, because long double
> type there is already IEEE 754 quad.

That's good to know. Thank you.

If you are here I have one more somewhat related question: is IEEE binary128
arithmetic really not supported under ARMv7 or, like in the case above, it is
supported, but not under one of common names (__float128/_Float128) ?

And if it is not supported then why? 
If I had to peek one 32-bit architecture to support then it certainly would be
ARMv7, the most used 32-bit architecture in the world that probably outsells
the next one by order of magnitude.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (15 preceding siblings ...)
  2023-01-15 22:27 ` already5chosen at yahoo dot com
@ 2023-01-16 22:16 ` joseph at codesourcery dot com
  2023-01-18 19:02 ` wilco at gcc dot gnu.org
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: joseph at codesourcery dot com @ 2023-01-16 22:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #17 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
It's not part of the ABI for the Arm 32-bit Architecture (AAPCS32).

https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst

You can file an issue there if you want, though I don't know how 
interested the maintainers will be in that optional language feature.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (16 preceding siblings ...)
  2023-01-16 22:16 ` joseph at codesourcery dot com
@ 2023-01-18 19:02 ` wilco at gcc dot gnu.org
  2023-01-18 19:20 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: wilco at gcc dot gnu.org @ 2023-01-18 19:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #18 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Michael_S from comment #12)

> This set of options does not map too well into real difficulties of
> implementation.
> There are only 2 things that are expensive:
> 1. Inexact Exception
> 2. Fetching of the current rounding mode.
> The rest of IEEE-754 features is so cheap that creating separate variants
> without them simply is not worth the effort of maintaining distinct
> variants, even if all difference is a single three-lines #ifdef

In general reading the current rounding mode is relatively cheap, but modifying
can be expensive, so optimized fenv implementations in GLIBC only modify the FP
status if a change is required. It should be feasible to check for
round-to-even and use optimized code for that case.

> BTW, Inexact Exception can be made fairly affordable with a little help from
> compiler. All we need for that is ability to say "don't remove this floating
> point addition even if you don't see that it produces any effect".
> Something similar to 'volatile', but with volatile compiler currently puts
> result of addition on stack, which adds undesirable cost.
> However, judged by comment of Jakub, compiler maintainers are not
> particularly interested in this enterprise.

There are macros in GLIBC math-barriers.h which do what you want - eg. AArch64:

#define math_opt_barrier(x)                                     \
  ({ __typeof (x) __x = (x); __asm ("" : "+w" (__x)); __x; })
#define math_force_eval(x)                                              \
  ({ __typeof (x) __x = (x); __asm __volatile__ ("" : : "w" (__x)); })

The first blocks optimizations (like constant folding) across the barrier, the
2nd forces evaluation of an expression even if it is deemed useless. These are
used in many math functions in GLIBC. They are target specific due to needing
inline assembler operands, but it should be easy to add similar definitions to
libgcc.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (17 preceding siblings ...)
  2023-01-18 19:02 ` wilco at gcc dot gnu.org
@ 2023-01-18 19:20 ` jakub at gcc dot gnu.org
  2023-01-18 19:26 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-01-18 19:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #19 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86 sfp-machine.h
#define FP_INIT_ROUNDMODE                                       \
  do {                                                          \
    __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw));       \
  } while (0)
#endif
to do that only if not round-to-nearest.
E.g. the fast-float library uses since last November:
  static volatile float fmin = __FLT_MIN__;
  float fmini = fmin; // we copy it so that it gets loaded at most once.
  return (fmini + 1.0f == 1.0f - fmini);
as a quick test whether round-to-nearest is in effect or some other rounding
mode.
Most likely if this is done with -frounding-math it wouldn't even need the
volatile stuff.  Of course, if it isn't round-to-nearest, we would need to do
the expensive {,v}stmxcsr.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (18 preceding siblings ...)
  2023-01-18 19:20 ` jakub at gcc dot gnu.org
@ 2023-01-18 19:26 ` jakub at gcc dot gnu.org
  2023-01-18 20:10 ` wilco at gcc dot gnu.org
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: jakub at gcc dot gnu.org @ 2023-01-18 19:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #20 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
__attribute__((noinline, optimize ("rounding-math"))) static int
round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; }

and
  if (round_to_nearest ()) \
    _fcw = FP_RND_NEAREST; \
  else \
    __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \

Except that from _fcw we don't determine just the rounding mode but also what
exceptions are enabled.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (19 preceding siblings ...)
  2023-01-18 19:26 ` jakub at gcc dot gnu.org
@ 2023-01-18 20:10 ` wilco at gcc dot gnu.org
  2023-01-18 22:28 ` already5chosen at yahoo dot com
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 25+ messages in thread
From: wilco at gcc dot gnu.org @ 2023-01-18 20:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #21 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #20)
> __attribute__((noinline, optimize ("rounding-math"))) static int
> round_to_nearest (void) { return 1.0f - __FLT_MIN__ == 1.0f + __FLT_MIN__; }

Wouldn't that always set inexact?

> and
>   if (round_to_nearest ()) \
>     _fcw = FP_RND_NEAREST; \
>   else \
>     __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw)); \
> 
> Except that from _fcw we don't determine just the rounding mode but also
> what exceptions are enabled.

Yes that wouldn't work in fenv but FP emulation functions don't need to read
the exception flags.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (20 preceding siblings ...)
  2023-01-18 20:10 ` wilco at gcc dot gnu.org
@ 2023-01-18 22:28 ` already5chosen at yahoo dot com
  2023-01-18 23:31 ` already5chosen at yahoo dot com
  2023-02-10 13:38 ` already5chosen at yahoo dot com
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-18 22:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #22 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Michael_S from comment #8)
> (In reply to Thomas Koenig from comment #6)
> > And there will have to be a decision about 32-bit targets.
> >
> 
> IMHO, 32-bit targets should be left in their current state.
> People that use them probably do not care deeply about performance.
> Technically, I can implement 32-bit targets in the same sources, by means of
> few ifdefs and macros, but resulting source code will look much uglier than
> how it looks today. Still, not to the same level of horror that you have in
> matmul_r16.c, but certainly uglier than how I like it to look.
> And I am not sure at all that my implementation of 32-bit targets would be
> significantly faster than current soft float.

I explored this path (implementing 32-bit and 64-bit targets from the same
source with few ifdefs) a little more:
Now I am even more sure that it is not a way to go. gcc compiler does not
generate good 32-bit code for this style of sources. This especially applies to
i386, other supported 32-bit targets (RV32, SPARC32) are affected less.

In the process I encountered a funny illogical pessimization by i386 code
generator:
https://godbolt.org/z/En6Tredox
Routines foo32() and foo64() are semantically identical, but foo32() is written
 with 32-bit targets in mind while foo64() is the style of could that will
likely be written if one wants to support 32 and 64 bits from the same source
with #ifdef.

The code, generated by gcc for foo32() is reasonable. Like in the source, we
can see 8 multiplications.
The code, generated by gcc for foo64() is anything but reasonable. Somehow,
compiler invented 5 more multiplications for a total of 13 multiplications.

May be, it deserves a separate bug report.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (21 preceding siblings ...)
  2023-01-18 22:28 ` already5chosen at yahoo dot com
@ 2023-01-18 23:31 ` already5chosen at yahoo dot com
  2023-02-10 13:38 ` already5chosen at yahoo dot com
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-01-18 23:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #23 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Jakub Jelinek from comment #19)
> So, if stmxcsr/vstmxcsr is too slow, perhaps we should change x86
> sfp-machine.h
> #define FP_INIT_ROUNDMODE                                       \
>   do {                                                          \
>     __asm__ __volatile__ ("%vstmxcsr\t%0" : "=m" (_fcw));       \
>   } while (0)
> #endif
> to do that only if not round-to-nearest.
> E.g. the fast-float library uses since last November:
>   static volatile float fmin = __FLT_MIN__;
>   float fmini = fmin; // we copy it so that it gets loaded at most once.
>   return (fmini + 1.0f == 1.0f - fmini);
> as a quick test whether round-to-nearest is in effect or some other rounding
> mode.
> Most likely if this is done with -frounding-math it wouldn't even need the
> volatile stuff.  Of course, if it isn't round-to-nearest, we would need to
> do the expensive {,v}stmxcsr.

I agree with Wilco. This trick is problematic due to effect on inexact flag.
Also, I don't quite understand how you got to setting rounding mode.
I don't need to set rounding mode, I just need to read a current rounding mode.

Doing it in portable way, i.e. by fegetround(), is slow mostly due to various
overheads.
Doing it in non-portable way on x86-64 (by _MM_GET_ROUNDING_MODE()) is not slow
on Intel, but still pretty slow on AMD Zen3, although even on Zen3 it is much
faster than fegetround().
Results of measurements are here:
https://github.com/already5chosen/extfloat/blob/master/binary128/reports/rm-impact.txt
Anyway, I'd very much prefer a portable solution over multitude of ifdefs.
It is a pity that gcc doesn't implement FLT_ROUNDS like other compilers.

But, then again, it is a pity that gcc doesn't implement few other things
implemented by other compilers that could make life of developers of portable
multiprecision routines in general and of soft float in particular so much
easier.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [Bug libgcc/108279] Improved speed for float128 routines
  2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
                   ` (22 preceding siblings ...)
  2023-01-18 23:31 ` already5chosen at yahoo dot com
@ 2023-02-10 13:38 ` already5chosen at yahoo dot com
  23 siblings, 0 replies; 25+ messages in thread
From: already5chosen at yahoo dot com @ 2023-02-10 13:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #24 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Michael_S from comment #22)
> (In reply to Michael_S from comment #8)
> > (In reply to Thomas Koenig from comment #6)
> > > And there will have to be a decision about 32-bit targets.
> > >
> > 
> > IMHO, 32-bit targets should be left in their current state.
> > People that use them probably do not care deeply about performance.
> > Technically, I can implement 32-bit targets in the same sources, by means of
> > few ifdefs and macros, but resulting source code will look much uglier than
> > how it looks today. Still, not to the same level of horror that you have in
> > matmul_r16.c, but certainly uglier than how I like it to look.
> > And I am not sure at all that my implementation of 32-bit targets would be
> > significantly faster than current soft float.
> 
> I explored this path (implementing 32-bit and 64-bit targets from the same
> source with few ifdefs) a little more:
> Now I am even more sure that it is not a way to go. gcc compiler does not
> generate good 32-bit code for this style of sources. This especially applies
> to i386, other supported 32-bit targets (RV32, SPARC32) are affected less.
> 

I can't explain to myself why I am doing it, but I did continue exploration of
32-bit targets. Well, not quite "targets", I don't have SPARC32 or RV32 to
play. So, I did continue exploration of i386.
As said above, using the same code for 32-bit and 64-bit does not produce
acceptable results. But pure 32-bit source did better than what I expected.
So when 2023-01-13 I wrote "And I am not sure at all that my implementation of
32-bit targets would be significantly faster than current soft float" I was
wrong. My implementation of 32-bit targets (i.e. i386) is significantly faster
than current soft float. Up to 3 times faster on Zen3, approximately 2 times
faster on various oldish Intel CPUs.
Today I put 32-bit sources into my github repository.

I am still convinced that improving performance of IEEE binary128 on 32-bit
targets is wastage of time, but since the time is already wasted may be results
can be used.

And may be, it can be used to bring IEEE binary128 to the Arm Cortex-M, where
it can be moderately useful in some situations.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-02-10 13:38 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-03 20:55 [Bug libgcc/108279] New: Improved speed for float128 routines tkoenig at gcc dot gnu.org
2023-01-03 21:02 ` [Bug libgcc/108279] " tkoenig at gcc dot gnu.org
2023-01-04 10:43 ` jakub at gcc dot gnu.org
2023-01-04 17:14 ` tkoenig at gcc dot gnu.org
2023-01-04 22:19 ` already5chosen at yahoo dot com
2023-01-11 23:06 ` already5chosen at yahoo dot com
2023-01-12 23:24 ` tkoenig at gcc dot gnu.org
2023-01-13  0:34 ` already5chosen at yahoo dot com
2023-01-13  1:29 ` already5chosen at yahoo dot com
2023-01-14  9:21 ` tkoenig at gcc dot gnu.org
2023-01-14 10:22 ` tkoenig at gcc dot gnu.org
2023-01-14 23:52 ` already5chosen at yahoo dot com
2023-01-15  0:20 ` already5chosen at yahoo dot com
2023-01-15 15:13 ` tkoenig at gcc dot gnu.org
2023-01-15 19:17 ` tkoenig at gcc dot gnu.org
2023-01-15 19:28 ` jakub at gcc dot gnu.org
2023-01-15 22:27 ` already5chosen at yahoo dot com
2023-01-16 22:16 ` joseph at codesourcery dot com
2023-01-18 19:02 ` wilco at gcc dot gnu.org
2023-01-18 19:20 ` jakub at gcc dot gnu.org
2023-01-18 19:26 ` jakub at gcc dot gnu.org
2023-01-18 20:10 ` wilco at gcc dot gnu.org
2023-01-18 22:28 ` already5chosen at yahoo dot com
2023-01-18 23:31 ` already5chosen at yahoo dot com
2023-02-10 13:38 ` already5chosen at yahoo dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).