public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/103008] New: poor inlined builtin_fmod on x86_64
@ 2021-10-30 18:51 fx at gnu dot org
  2021-10-30 18:52 ` [Bug target/103008] " fx at gnu dot org
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: fx at gnu dot org @ 2021-10-30 18:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

            Bug ID: 103008
           Summary: poor inlined builtin_fmod on x86_64
           Product: gcc
           Version: 11.2.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fx at gnu dot org
  Target Milestone: ---
            Target: x86_64

Created attachment 51706
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51706&action=edit
ggl.f90

This is from looking at a Fortran benchmark set
<https://www.fortran.uk/fortran-compiler-comparisons/>, but presumably
isn't Fortran-specific.

One of the cases in that set (ac.f90) gets bottlenecked on a random
number routine (which may be rubbish, but it's there).  It uses DMOD,
which gets compiled to __builtin_fmod according to the tree dump, and
is inlined.  However, the benchmark performance is still 50% worse
with gfortran than Intel ifort, and if I replace DMOD with its
definition, gfortran is much closer to ifort.

I'll attach files ggl.f90, the original, and gglx.f90 which avoids the
call to the intrinsic, along with assembler from each.  The assembler
is from GCC 11.2.0, run (on SKX) as

  gfortran -Ofast -march=native

(I note that the generated fmod isn't inlined with -O3, which looks to
me like a Fortran miss that I should report.)

I only take benchmarks too seriously for understanding the results
but, at least with PDO, GCC is pretty much on a par with ifort on the
bottom line of that set, despite also #40770, and another poor case. :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
@ 2021-10-30 18:52 ` fx at gnu dot org
  2021-10-30 18:55 ` fx at gnu dot org
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: fx at gnu dot org @ 2021-10-30 18:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #1 from Dave Love <fx at gnu dot org> ---
Created attachment 51707
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51707&action=edit
gglx.f90

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
  2021-10-30 18:52 ` [Bug target/103008] " fx at gnu dot org
@ 2021-10-30 18:55 ` fx at gnu dot org
  2021-10-30 18:56 ` fx at gnu dot org
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: fx at gnu dot org @ 2021-10-30 18:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #2 from Dave Love <fx at gnu dot org> ---
Created attachment 51708
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51708&action=edit
ggl.s extract

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
  2021-10-30 18:52 ` [Bug target/103008] " fx at gnu dot org
  2021-10-30 18:55 ` fx at gnu dot org
@ 2021-10-30 18:56 ` fx at gnu dot org
  2021-10-30 20:15 ` fx at gnu dot org
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: fx at gnu dot org @ 2021-10-30 18:56 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #3 from Dave Love <fx at gnu dot org> ---
Created attachment 51709
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51709&action=edit
gglx.s extract

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (2 preceding siblings ...)
  2021-10-30 18:56 ` fx at gnu dot org
@ 2021-10-30 20:15 ` fx at gnu dot org
  2021-10-30 20:39 ` anlauf at gcc dot gnu.org
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: fx at gnu dot org @ 2021-10-30 20:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #4 from Dave Love <fx at gnu dot org> ---
On further consideration, perhaps this is just a Fortran issue.  I thought
-ffast-math should turn off all the relevant checks to allow reducing mod to
the arithmetic expression, but it probably doesn't.  Also, MAQAO complained
about x87 instructions being generated, but I'm not sure about that either if
it's just for status.  Apologies if this is invalid, and correction welcome.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (3 preceding siblings ...)
  2021-10-30 20:15 ` fx at gnu dot org
@ 2021-10-30 20:39 ` anlauf at gcc dot gnu.org
  2021-10-31 20:05 ` ubizjak at gmail dot com
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: anlauf at gcc dot gnu.org @ 2021-10-30 20:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #5 from anlauf at gcc dot gnu.org ---
There's a mixture of single and double precision in the testcase variants.
I haven't checked thoroughly enough if both variants are really equivalent.

Do you see the issue if you have only single or only double precision?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (4 preceding siblings ...)
  2021-10-30 20:39 ` anlauf at gcc dot gnu.org
@ 2021-10-31 20:05 ` ubizjak at gmail dot com
  2021-11-01  8:23 ` ubizjak at gmail dot com
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2021-10-31 20:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

Uroš Bizjak <ubizjak at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-10-31
     Ever confirmed|0                           |1

--- Comment #6 from Uroš Bizjak <ubizjak at gmail dot com> ---
Please see PR29852. The only way to avoid the libcall in 2006 was to inline the
x87 version also for SSE math. Nowadays a (generic) arithmetic sequence can be
provided and x87 fmod builtins can be enabled for x87 math only with:

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e733a40fc90..ed818232ae7 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -17378,6 +17378,8 @@
    (use (match_operand:MODEF 1 "general_operand"))
    (use (match_operand:MODEF 2 "general_operand"))]
   "TARGET_USE_FANCY_MATH_387
+   && (!(SSE_FLOAT_MODE_P (<MODE>mode) && TARGET_SSE_MATH)
+       || TARGET_MIX_SSE_I387)
    && flag_finite_math_only"
 {
   rtx (*gen_truncxf) (rtx, rtx);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (5 preceding siblings ...)
  2021-10-31 20:05 ` ubizjak at gmail dot com
@ 2021-11-01  8:23 ` ubizjak at gmail dot com
  2022-02-10 14:18 ` rguenth at gcc dot gnu.org
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2021-11-01  8:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #7 from Uroš Bizjak <ubizjak at gmail dot com> ---
IMO, inlined fmod (and drem) should eventually be expanded in a generic way in
the middle-end as:

fmod (a, p) = a - trunc (a/p) * p
drem (a, p) = a - roundeven (a/p) * p

so division can be later simplified to multiplication with reciprocal, either
constant or implemented with rcp instruction.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (6 preceding siblings ...)
  2021-11-01  8:23 ` ubizjak at gmail dot com
@ 2022-02-10 14:18 ` rguenth at gcc dot gnu.org
  2022-02-10 14:22 ` rguenth at gcc dot gnu.org
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-10 14:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
*** Bug 104485 has been marked as a duplicate of this bug. ***

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (7 preceding siblings ...)
  2022-02-10 14:18 ` rguenth at gcc dot gnu.org
@ 2022-02-10 14:22 ` rguenth at gcc dot gnu.org
  2022-02-10 18:09 ` ubizjak at gmail dot com
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-10 14:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jsm28 at gcc dot gnu.org

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
The PR104485 case in 526.blender_r has a special case of fmod (x, 1.0) which
could be simplified further as fmod (a, 1.) = a - trunc (a) if side-cases
with NaNs, Infs, signed zeros, etc. allow.  Btw, trunc rounds to the nearest
integer so that's probably not wanted, instead floor looks more appropriate
when a is positive.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (8 preceding siblings ...)
  2022-02-10 14:22 ` rguenth at gcc dot gnu.org
@ 2022-02-10 18:09 ` ubizjak at gmail dot com
  2022-02-10 20:50 ` joseph at codesourcery dot com
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2022-02-10 18:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> ---
FYI, the following testcase:

--cut here--
#include <math.h>

float
__attribute__((noinline))
_fmodf (float x, float y)
{
  return x - truncf (x/y) * y;
}

int
main ()
{

  float a, b;
  volatile float z;

  for (a = -1000.0f; a < 1000.0f; a += 0.01f)
    for (b = -1000.0f; b < 1000.0f; b += 0.1f)
      z = fmodf (a, b);

  return 0;
}
--cut here--

$ gcc -Ofast -lm fmod-bench.c

      22,127092116 seconds time elapsed

      22,125111000 seconds user
       0,000999000 seconds sys


$ gcc -Ofast -fno-builtin-fmodf -lm fmod-bench.c

      32,751589079 seconds time elapsed

      32,746156000 seconds user
       0,000999000 seconds sys


Which points that the x87 code is considerably faster on my target
(Ivybridge-E) on Fedora-34 with glibc-2.33.

For reference, when the above _fmodf is called, I get:

$ gcc -Ofast -lm fmod-bench.c

      10,706189749 seconds time elapsed

      10,704859000 seconds user
       0,000999000 seconds sys

$ gcc -Ofast -lm -msse4 fmod-bench.c

      11,391062747 seconds time elapsed

      11,390771000 seconds user
       0,000000000 seconds sys

So, considerable faster!

It looks that with -ffast-math it is not inlined x87 code that is problematic,
but the missing fmod transformation. As shown above, the SSE2 code for truncf
is on par with SSE4 roundss instruction, so if the target can provide optimized
truncf code, the fmodf should definitely be converted to "a - trunc (a/p) * p".

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (9 preceding siblings ...)
  2022-02-10 18:09 ` ubizjak at gmail dot com
@ 2022-02-10 20:50 ` joseph at codesourcery dot com
  2022-02-11  7:59 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: joseph at codesourcery dot com @ 2022-02-10 20:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #11 from joseph at codesourcery dot com <joseph at codesourcery dot com> ---
An implementation using division like that definitely isn't valid without 
-funsafe-math-optimizations (it gives nonsense results when the exponent 
difference between the arguments is too large, inaccurate results whenever 
the multiplication is inexact unless you use fma, and spurious exceptions 
for operations that are never supposed to raise "inexact" on any 
arguments).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (10 preceding siblings ...)
  2022-02-10 20:50 ` joseph at codesourcery dot com
@ 2022-02-11  7:59 ` rguenth at gcc dot gnu.org
  2022-02-11 15:47 ` ubizjak at gmail dot com
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-11  7:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
Just as data-point on znver2 Uros testcase shows

rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2
rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out 
19.18user 0.00system 0:19.18elapsed 99%CPU (0avgtext+0avgdata 1528maxresident)k
0inputs+0outputs (0major+76minor)pagefaults 0swaps
rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 -fno-builtin-fmod
rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out 
19.26user 0.00system 0:19.26elapsed 99%CPU (0avgtext+0avgdata 1528maxresident)k
0inputs+0outputs (0major+76minor)pagefaults 0swaps
rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 -Dfmodf=_fmodf   
rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out 
4.40user 0.00system 0:04.40elapsed 100%CPU (0avgtext+0avgdata 1528maxresident)k
0inputs+0outputs (0major+76minor)pagefaults 0swaps

that's with glibc 2.31.  So the _fmodf variant is very much faster.  But
as Joseph says a general expansion like that is probably a bad idea.

The specific case of blender using doubles and fmod (x, 1.) shows that
glibc is very much slower than x87 in the test below on znver2 but the
proposed inline is very very much faster.

Note that using modf(x, &tem) is more than three times as fast as
using fmod (x, 1.) with glibc 2.31.  While we have an optab for fmod
we don't have one for modf (which has an unfortunate pointer output API).
I'm not sure whether fmod (x, 1.) == modf (x, &tem).

#include <math.h>

double
__attribute__((noinline))
_fmod (double x, double)
{
  return x - trunc (x);
}

int
main ()
{

  double a, b;
  volatile double z;

  for (a = -1000.0; a < 1000.0; a += 0.01)
    for (b = -1000.0; b < 1000.0; b += 0.1)
      {
        volatile double tem = a;
        z = fmod (tem, 1.);
      }

  return 0;
}

Note that replacing a call of fmod (x, 1.) with x - trunc (x) would
not be a simplifcation on GIMPLE so that should be possibly done
by RTL expansion?  Replacing it with modf (x, &tem) would be OK
I think (unfortunately modf doesn't seem to accept a NULL arg).
Both functions are part of C99 / POSIX so replacing one with the
other should be generally OK.

Maybe there's a function that does not compute the integer part
as well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (11 preceding siblings ...)
  2022-02-11  7:59 ` rguenth at gcc dot gnu.org
@ 2022-02-11 15:47 ` ubizjak at gmail dot com
  2022-02-12 22:07 ` ubizjak at gmail dot com
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2022-02-11 15:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #12)
> Just as data-point on znver2 Uros testcase shows
> 
> rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2
> rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out 
> 19.18user 0.00system 0:19.18elapsed 99%CPU (0avgtext+0avgdata
> 1528maxresident)k
> 0inputs+0outputs (0major+76minor)pagefaults 0swaps
> rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 -fno-builtin-fmod

You should use -fno-builtin-fmodf in the above compile flags.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (12 preceding siblings ...)
  2022-02-11 15:47 ` ubizjak at gmail dot com
@ 2022-02-12 22:07 ` ubizjak at gmail dot com
  2022-02-13 21:00 ` Dave.Love at manchester dot ac.uk
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: ubizjak at gmail dot com @ 2022-02-12 22:07 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #14 from Uroš Bizjak <ubizjak at gmail dot com> ---
Created attachment 52428
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52428&action=edit
Proposed patch

The attached patch implements:

fmod (a, p) = a - trunc (a/p) * p
drem (a, p) = a - roundeven (a/p) * p

using SSE4 round instruction (and uses fnma when available).

Timings with Polyhedron ac.f90 on IvyBridge-E, Fedora-34, glibc 2.33-21.fc34

-Ofast:
       6,150082000 seconds user

-Ofast -mno-80387:
      18,354654000 seconds user

-Ofast -msse4:
       5,722511000 seconds user

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (13 preceding siblings ...)
  2022-02-12 22:07 ` ubizjak at gmail dot com
@ 2022-02-13 21:00 ` Dave.Love at manchester dot ac.uk
  2022-02-14  7:12 ` rguenther at suse dot de
  2022-02-14  7:35 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: Dave.Love at manchester dot ac.uk @ 2022-02-13 21:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #15 from Dave.Love at manchester dot ac.uk ---
"ubizjak at gmail dot com" <gcc-bugzilla@gcc.gnu.org> writes:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008
>
> --- Comment #14 from Uroš Bizjak <ubizjak at gmail dot com> ---
> Created attachment 52428
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52428&action=edit
> Proposed patch
>
> The attached patch implements:
>
> fmod (a, p) = a - trunc (a/p) * p
> drem (a, p) = a - roundeven (a/p) * p
>
> using SSE4 round instruction (and uses fnma when available).
>
> Timings with Polyhedron ac.f90 on IvyBridge-E, Fedora-34, glibc 2.33-21.fc34
>
> -Ofast:
>        6,150082000 seconds user
>
> -Ofast -mno-80387:
>       18,354654000 seconds user
>
> -Ofast -msse4:
>        5,722511000 seconds user

I realize I never made a Fortran bug report about this, and should do.
(I think the gfortran intrinsic should avoid fmod anyway, and just use
the standard's arithmetical definition of MOD without having to bother
about errors.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (14 preceding siblings ...)
  2022-02-13 21:00 ` Dave.Love at manchester dot ac.uk
@ 2022-02-14  7:12 ` rguenther at suse dot de
  2022-02-14  7:35 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: rguenther at suse dot de @ 2022-02-14  7:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 11 Feb 2022, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008
> 
> --- Comment #13 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to Richard Biener from comment #12)
> > Just as data-point on znver2 Uros testcase shows
> > 
> > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2
> > rguenther@ryzen:/tmp> numactl --physcpubind=3 /usr/bin/time ./a.out 
> > 19.18user 0.00system 0:19.18elapsed 99%CPU (0avgtext+0avgdata
> > 1528maxresident)k
> > 0inputs+0outputs (0major+76minor)pagefaults 0swaps
> > rguenther@ryzen:/tmp> gcc-11 t.c -Ofast -lm -march=znver2 -fno-builtin-fmod
> 
> You should use -fno-builtin-fmodf in the above compile flags.

Oops, yes.  Then the glibc version is

22.53user 0.00system 0:22.53elapsed 99%CPU (0avgtext+0avgdata 
1600maxresident)k
0inputs+0outputs (0major+77minor)pagefaults 0swaps

so indeed for float the x87 inline version is faster when benchmarked
this way.  For double it's

19.31user 0.00system 0:19.31elapsed 99%CPU (0avgtext+0avgdata 
1536maxresident)k
0inputs+0outputs (0major+76minor)pagefaults 0swaps

vs.

18.47user 0.00system 0:18.47elapsed 99%CPU (0avgtext+0avgdata 
1600maxresident)k
0inputs+0outputs (0major+77minor)pagefaults 0swaps

so glibc is a bit faster here while the x87 version is of course
similar.  Avoiding the libcall can of course avoid spilling SSE
regs around the call.

So what remains is really the special case in blender doing
fmod (x, 1.) which can eventually be optimized with SSE4.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Bug target/103008] poor inlined builtin_fmod on x86_64
  2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
                   ` (15 preceding siblings ...)
  2022-02-14  7:12 ` rguenther at suse dot de
@ 2022-02-14  7:35 ` rguenth at gcc dot gnu.org
  16 siblings, 0 replies; 18+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-02-14  7:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103008

--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #14)
> Created attachment 52428 [details]
> Proposed patch
> 
> The attached patch implements:
> 
> fmod (a, p) = a - trunc (a/p) * p
> drem (a, p) = a - roundeven (a/p) * p
> 
> using SSE4 round instruction (and uses fnma when available).
> 
> Timings with Polyhedron ac.f90 on IvyBridge-E, Fedora-34, glibc 2.33-21.fc34
> 
> -Ofast:
>        6,150082000 seconds user
> 
> -Ofast -mno-80387:
>       18,354654000 seconds user
> 
> -Ofast -msse4:
>        5,722511000 seconds user

I fear this is a bit too much on the "unsafe" side.  Maybe we can
go this way for float but use double arithmetic for the fmod to avoid
the exponent issue?  For double, can we do some cheap range checking
and fall back to fmod() when not safe?

That said, can we have a flag like -mrecip to control this?

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-02-14  7:35 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-30 18:51 [Bug target/103008] New: poor inlined builtin_fmod on x86_64 fx at gnu dot org
2021-10-30 18:52 ` [Bug target/103008] " fx at gnu dot org
2021-10-30 18:55 ` fx at gnu dot org
2021-10-30 18:56 ` fx at gnu dot org
2021-10-30 20:15 ` fx at gnu dot org
2021-10-30 20:39 ` anlauf at gcc dot gnu.org
2021-10-31 20:05 ` ubizjak at gmail dot com
2021-11-01  8:23 ` ubizjak at gmail dot com
2022-02-10 14:18 ` rguenth at gcc dot gnu.org
2022-02-10 14:22 ` rguenth at gcc dot gnu.org
2022-02-10 18:09 ` ubizjak at gmail dot com
2022-02-10 20:50 ` joseph at codesourcery dot com
2022-02-11  7:59 ` rguenth at gcc dot gnu.org
2022-02-11 15:47 ` ubizjak at gmail dot com
2022-02-12 22:07 ` ubizjak at gmail dot com
2022-02-13 21:00 ` Dave.Love at manchester dot ac.uk
2022-02-14  7:12 ` rguenther at suse dot de
2022-02-14  7:35 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).