[Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ?

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "grasland at lal dot in2p3.fr" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ?
Date: Mon, 06 Apr 2020 09:35:30 +0000	[thread overview]
Message-ID: <bug-94497-4@http.gcc.gnu.org/bugzilla/> (raw)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

            Bug ID: 94497
           Summary: Branchless clamp in the general case gets a branch in
                    a particular case ?
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: grasland at lal dot in2p3.fr
  Target Milestone: ---

(Triage note: I think this is probably a compiler middle-end or back-end issue,
but I am not knowledgeable enough about the structure of the GCC codebase to
pick the right component.)

---

I am trying to make a floating-point computation autovectorization-friendly,
without mandating the use of -ffast-math for optimal performance as that is a
numerical stability and compiler portability hazard. This turned out to be an
interesting exercise in IEEE-754 pedantry, of course, but I can live with that.

However, while trying to optimize a "clamp" computation, I ended up at a point
where the behavior of the GCC optimizer just does not make sense to me and I
could use the opinion of an expert.

Consider the following functions:

```
double fast_min(double x, double y) {
    return (x < y) ? x : y;
}

double fast_max(double x, double y) {
    return (x > y) ? x : y;
}
```

The definitions of fast_min and fast_max are carefully crafted to match the
semantics of x86's min and max instruction family, and indeed if I compile this
code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions
depending on which vector instruction sets are enabled.

This is exactly what I wanted, so far I'm happy. And if I now try to use these
min and max functions to write a clamp function...

```
double fast_clamp(double x, double min, double max) {
    return fast_max(fast_min(x, max), min);
}
```

...again, at -O1 optimization level and above, I get a minsd/maxsd pair, short
and sweet:

```
fast_clamp(double, double, double):
        minsd   xmm0, xmm2
        maxsd   xmm0, xmm1
        ret
```

Where this perfect picture becomes tainted, however, is as soon as I try to
_use_ this function with certain min/max arguments.

```
double use_fast_clamp(double x) {
    return fast_clamp(x, 0.0, 1.0);
}
```

All of a sudden, the assembly becomes branchy and terrible-looking, even in -O3
mode!

```
use_fast_clamp(double):
        movapd  xmm1, xmm0
        movsd   xmm0, QWORD PTR .LC0[rip]
        comisd  xmm0, xmm1
        jbe     .L13
        maxsd   xmm1, QWORD PTR .LC1[rip]
        movapd  xmm0, xmm1
.L13:
        ret
.LC0:
        .long   0
        .long   1072693248
.LC1:
        .long   0
        .long   0
```

I can make the generated code go back to a minsd/maxsd pair if I enable
-ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations),
but to the best of my knowledge, I shouldn't need fast-math flags here.

Further, even if I did forget about an IEEE-754 oddity that requires fast-math
flags, it would still mean that the above compilation of the general fast_clamp
function is incorrect: if this compilation output should work for any pair of
"min" and "max" double-precision arguments, then it trivially should work when
the min is 0.0 and max is 1.0. So one way or another, I think the GCC optimizer
is doing something strange here.

---

This is the most minimal example of this behavior that I managed to come up
with. Using only the fast_min or fast_math functions in isolation will behave
as expected and codegen into a single minsd or maxsd:

```
double use_fast_min(double x) {
    return fast_min(x, 1.0);
}

double use_fast_max(double x) {
    return fast_max(x, 0.0);
}
```

I observed similar behavior on any GCC build I could get my hands on, all the
way from the most recent GCC trunk build currently available on godbolt (10.0.1
20200405) to the most ancient build provided by godbolt (4.1.2).

Both my local system and godbolt run are Linux-based.

My local GCC build was configured with  ../configure --prefix=/usr
--infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64
--libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d
--enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver
--disable-werror --with-gxx-include-dir=/usr/include/c++/9 --enable-ssp
--disable-libssp --disable-libvtv --disable-cet --disable-libcc1
--enable-plugin --with-bugurl=https://bugs.opensuse.org/
--with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib
--enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-libphobos
--enable-version-specific-runtime-libs --with-gcc-major-version-only
--enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function
--program-suffix=-9 --without-system-libunwind --enable-multilib
--with-arch-32=x86-64 --with-tune=generic
--with-build-config=bootstrap-lto-lean --enable-link-mutex
--build=x86_64-suse-linux --host=x86_64-suse-linux

As for godbolt builds, it is easy to go to godbolt.org and add a -v to the
compiler options of the build you're interested in, so I will invite you to do
that instead of cluttering this already long bug report further.

---

FWIW, clang 10 behaves the way I would expect without fast-math flags (and also
generates the zero in place with a xorpd instead of loading it from memory,
which is kind of cool), but I'm well aware of the danger of comparing the
floating-point behavior of various compiler optimizers. So I wouldn't read too
much into that:

```
.LCPI5_0:
        .quad   4607182418800017408     # double 1
use_fast_clamp(double):                    # @use_fast_clamp(double)
        minsd   xmm0, qword ptr [rip + .LCPI5_0]
        xorpd   xmm1, xmm1
        maxsd   xmm0, xmm1
        ret
```

If you like to experiment on godbolt too, here's my setup:
https://godbolt.org/z/eD-guY .

next             reply	other threads:[~2020-04-06  9:35 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-06  9:35 grasland at lal dot in2p3.fr [this message]
2020-04-06  9:39 ` [Bug middle-end/94497] " pinskia at gcc dot gnu.org
2020-04-06 13:33 ` rguenth at gcc dot gnu.org
2020-04-06 13:34 ` rguenth at gcc dot gnu.org
2020-04-06 13:37 ` rguenth at gcc dot gnu.org
2020-04-06 14:04 ` grasland at lal dot in2p3.fr
2020-04-06 16:31 ` rguenth at gcc dot gnu.org
2021-08-08 23:59 ` pinskia at gcc dot gnu.org
2023-07-20  8:38 ` rguenth at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-94497-4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).