[Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ?

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ?
@ 2020-04-06  9:35 grasland at lal dot in2p3.fr
  2020-04-06  9:39 ` [Bug middle-end/94497] " pinskia at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: grasland at lal dot in2p3.fr @ 2020-04-06  9:35 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

            Bug ID: 94497
           Summary: Branchless clamp in the general case gets a branch in
                    a particular case ?
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: grasland at lal dot in2p3.fr
  Target Milestone: ---

(Triage note: I think this is probably a compiler middle-end or back-end issue,
but I am not knowledgeable enough about the structure of the GCC codebase to
pick the right component.)

---

I am trying to make a floating-point computation autovectorization-friendly,
without mandating the use of -ffast-math for optimal performance as that is a
numerical stability and compiler portability hazard. This turned out to be an
interesting exercise in IEEE-754 pedantry, of course, but I can live with that.

However, while trying to optimize a "clamp" computation, I ended up at a point
where the behavior of the GCC optimizer just does not make sense to me and I
could use the opinion of an expert.

Consider the following functions:

```
double fast_min(double x, double y) {
    return (x < y) ? x : y;
}

double fast_max(double x, double y) {
    return (x > y) ? x : y;
}
```

The definitions of fast_min and fast_max are carefully crafted to match the
semantics of x86's min and max instruction family, and indeed if I compile this
code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions
depending on which vector instruction sets are enabled.

This is exactly what I wanted, so far I'm happy. And if I now try to use these
min and max functions to write a clamp function...

```
double fast_clamp(double x, double min, double max) {
    return fast_max(fast_min(x, max), min);
}
```

...again, at -O1 optimization level and above, I get a minsd/maxsd pair, short
and sweet:

```
fast_clamp(double, double, double):
        minsd   xmm0, xmm2
        maxsd   xmm0, xmm1
        ret
```

Where this perfect picture becomes tainted, however, is as soon as I try to
_use_ this function with certain min/max arguments.

```
double use_fast_clamp(double x) {
    return fast_clamp(x, 0.0, 1.0);
}
```

All of a sudden, the assembly becomes branchy and terrible-looking, even in -O3
mode!

```
use_fast_clamp(double):
        movapd  xmm1, xmm0
        movsd   xmm0, QWORD PTR .LC0[rip]
        comisd  xmm0, xmm1
        jbe     .L13
        maxsd   xmm1, QWORD PTR .LC1[rip]
        movapd  xmm0, xmm1
.L13:
        ret
.LC0:
        .long   0
        .long   1072693248
.LC1:
        .long   0
        .long   0
```

I can make the generated code go back to a minsd/maxsd pair if I enable
-ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations),
but to the best of my knowledge, I shouldn't need fast-math flags here.

Further, even if I did forget about an IEEE-754 oddity that requires fast-math
flags, it would still mean that the above compilation of the general fast_clamp
function is incorrect: if this compilation output should work for any pair of
"min" and "max" double-precision arguments, then it trivially should work when
the min is 0.0 and max is 1.0. So one way or another, I think the GCC optimizer
is doing something strange here.

---

This is the most minimal example of this behavior that I managed to come up
with. Using only the fast_min or fast_math functions in isolation will behave
as expected and codegen into a single minsd or maxsd:

```
double use_fast_min(double x) {
    return fast_min(x, 1.0);
}

double use_fast_max(double x) {
    return fast_max(x, 0.0);
}
```

I observed similar behavior on any GCC build I could get my hands on, all the
way from the most recent GCC trunk build currently available on godbolt (10.0.1
20200405) to the most ancient build provided by godbolt (4.1.2).

Both my local system and godbolt run are Linux-based.

My local GCC build was configured with  ../configure --prefix=/usr
--infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64
--libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d
--enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver
--disable-werror --with-gxx-include-dir=/usr/include/c++/9 --enable-ssp
--disable-libssp --disable-libvtv --disable-cet --disable-libcc1
--enable-plugin --with-bugurl=https://bugs.opensuse.org/
--with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib
--enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-libphobos
--enable-version-specific-runtime-libs --with-gcc-major-version-only
--enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function
--program-suffix=-9 --without-system-libunwind --enable-multilib
--with-arch-32=x86-64 --with-tune=generic
--with-build-config=bootstrap-lto-lean --enable-link-mutex
--build=x86_64-suse-linux --host=x86_64-suse-linux

As for godbolt builds, it is easy to go to godbolt.org and add a -v to the
compiler options of the build you're interested in, so I will invite you to do
that instead of cluttering this already long bug report further.

---

FWIW, clang 10 behaves the way I would expect without fast-math flags (and also
generates the zero in place with a xorpd instead of loading it from memory,
which is kind of cool), but I'm well aware of the danger of comparing the
floating-point behavior of various compiler optimizers. So I wouldn't read too
much into that:

```
.LCPI5_0:
        .quad   4607182418800017408     # double 1
use_fast_clamp(double):                    # @use_fast_clamp(double)
        minsd   xmm0, qword ptr [rip + .LCPI5_0]
        xorpd   xmm1, xmm1
        maxsd   xmm0, xmm1
        ret
```

If you like to experiment on godbolt too, here's my setup:
https://godbolt.org/z/eD-guY .

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
@ 2020-04-06  9:39 ` pinskia at gcc dot gnu.org
  2020-04-06 13:33 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2020-04-06  9:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
So there might be some jump threading going on which is causing this issue. 
There might already be a bug about that case too.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
  2020-04-06  9:39 ` [Bug middle-end/94497] " pinskia at gcc dot gnu.org
@ 2020-04-06 13:33 ` rguenth at gcc dot gnu.org
  2020-04-06 13:34 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-06 13:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-04-06
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is that our GIMPLE level if-conversion does not do the min/max
replacement because our MIN/MAX_EXPR IL elements are not IEEE conforming
with respect to NaN behavior.  There _is_ now FMIN/FMAX internal functions
which targets can support (not sure if x86 does support that named patterns)
which if-conversion could use (but it doesn't).  There's a bug about exactly
that I think.  PR88540 might be a close fit.

I do have a not working patch to address this, queued for GCC 11.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
  2020-04-06  9:39 ` [Bug middle-end/94497] " pinskia at gcc dot gnu.org
  2020-04-06 13:33 ` rguenth at gcc dot gnu.org
@ 2020-04-06 13:34 ` rguenth at gcc dot gnu.org
  2020-04-06 13:37 ` rguenth at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-06 13:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 48213
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48213&action=edit
incomplete patch

In case anybody is interested to complete it ...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
                   ` (2 preceding siblings ...)
  2020-04-06 13:34 ` rguenth at gcc dot gnu.org
@ 2020-04-06 13:37 ` rguenth at gcc dot gnu.org
  2020-04-06 14:04 ` grasland at lal dot in2p3.fr
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-06 13:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
As a workaround you can use -ffinite-math-only -fno-signed-zeros if that is
applicable to the rest of your application.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
                   ` (3 preceding siblings ...)
  2020-04-06 13:37 ` rguenth at gcc dot gnu.org
@ 2020-04-06 14:04 ` grasland at lal dot in2p3.fr
  2020-04-06 16:31 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: grasland at lal dot in2p3.fr @ 2020-04-06 14:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

--- Comment #5 from Hadrien Grasland <grasland at lal dot in2p3.fr> ---
Thanks for the clarifications! We could probably live with -fno-signed-zeros,
but I think -ffinite-math-only would be too much as an application-wide flag,
as I've spotted several occurences of the "do the computation, then check if
the result is normal" pattern around...

I tried to experiment with #pragma optimize as a finer-grained alternative to a
global fast-math flag, which would probably be acceptable. But that seemed to
have the side-effect of preventing inlining of the functions to which the
attributes are applied into other functions that don't have it (I guess that
attribute is implemented by splitting the code into multiple compilation
units?), whereas in this part of the codebase we want all the inlining we can
get...

So since I don't have the skills and time to work on GCC myself, I guess I will
need to wait for now and see what you come up with.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
                   ` (4 preceding siblings ...)
  2020-04-06 14:04 ` grasland at lal dot in2p3.fr
@ 2020-04-06 16:31 ` rguenth at gcc dot gnu.org
  2021-08-08 23:59 ` pinskia at gcc dot gnu.org
  2023-07-20  8:38 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-04-06 16:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-* i?86-*-*

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hadrien Grasland from comment #5)
> Thanks for the clarifications! We could probably live with
> -fno-signed-zeros, but I think -ffinite-math-only would be too much as an
> application-wide flag, as I've spotted several occurences of the "do the
> computation, then check if the result is normal" pattern around...
> 
> I tried to experiment with #pragma optimize as a finer-grained alternative
> to a global fast-math flag, which would probably be acceptable. But that
> seemed to have the side-effect of preventing inlining of the functions to
> which the attributes are applied into other functions that don't have it (I
> guess that attribute is implemented by splitting the code into multiple
> compilation units?), whereas in this part of the codebase we want all the
> inlining we can get...

Yeah, since the #pragma optimize works at a function level we cannot permit
inlining of functions with conflicting settings...

> So since I don't have the skills and time to work on GCC myself, I guess I
> will need to wait for now and see what you come up with.

There's also the possibility to use XMM intrinsics or simply fmin()/fmax()
(but the latter is only optimized for targets implementing fmin/fmax expanders
which unfortunately x86 doesn't).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
                   ` (5 preceding siblings ...)
  2020-04-06 16:31 ` rguenth at gcc dot gnu.org
@ 2021-08-08 23:59 ` pinskia at gcc dot gnu.org
  2023-07-20  8:38 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-08-08 23:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
On a related note, I Notice we are doing a load for the constant forming of 0
here.  That seems less than optimial.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/94497] Branchless clamp in the general case gets a branch in a particular case ?
  2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
                   ` (6 preceding siblings ...)
  2021-08-08 23:59 ` pinskia at gcc dot gnu.org
@ 2023-07-20  8:38 ` rguenth at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2023-07-20  8:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.0
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
This is now fixed with the fix for PR61747 (this bug would have been a better
fit though).

Fixed in GCC 14.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-07-20  8:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-06  9:35 [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? grasland at lal dot in2p3.fr
2020-04-06  9:39 ` [Bug middle-end/94497] " pinskia at gcc dot gnu.org
2020-04-06 13:33 ` rguenth at gcc dot gnu.org
2020-04-06 13:34 ` rguenth at gcc dot gnu.org
2020-04-06 13:37 ` rguenth at gcc dot gnu.org
2020-04-06 14:04 ` grasland at lal dot in2p3.fr
2020-04-06 16:31 ` rguenth at gcc dot gnu.org
2021-08-08 23:59 ` pinskia at gcc dot gnu.org
2023-07-20  8:38 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).