From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 8BBD0385B831; Mon,  6 Apr 2020 09:35:30 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8BBD0385B831
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
 s=default; t=1586165730;
 bh=UH3YzSfPmtfNSXTMqaWO9unX3mS+gPdOV8wdfrYKeTA=;
 h=From:To:Subject:Date:From;
 b=EaDOwFL7WOxd8GEQeFkJ/RMvbCJnlBMyrvVD9gHlgHxw0/FcanSzPUeqfODan1rUx
 PtnuFBK1S3xo9Z+CpsWUH/hdHDWs2qYt/Ypk33bLkv4T4IeTI3PfnkcybBoWXQ8DWZ
 gpV09j84d0Q4hDgTwpr+YToo1f0IxkpeQEkcDXHs=
From: "grasland at lal dot in2p3.fr" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug c/94497] New: Branchless clamp in the general case gets a
 branch in a particular case ?
Date: Mon, 06 Apr 2020 09:35:30 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: c
X-Bugzilla-Version: 10.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: grasland at lal dot in2p3.fr
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-94497-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <http://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <http://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Apr 2020 09:35:30 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D94497

            Bug ID: 94497
           Summary: Branchless clamp in the general case gets a branch in
                    a particular case ?
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: grasland at lal dot in2p3.fr
  Target Milestone: ---

(Triage note: I think this is probably a compiler middle-end or back-end is=
sue,
but I am not knowledgeable enough about the structure of the GCC codebase to
pick the right component.)

---

I am trying to make a floating-point computation autovectorization-friendly,
without mandating the use of -ffast-math for optimal performance as that is=
 a
numerical stability and compiler portability hazard. This turned out to be =
an
interesting exercise in IEEE-754 pedantry, of course, but I can live with t=
hat.

However, while trying to optimize a "clamp" computation, I ended up at a po=
int
where the behavior of the GCC optimizer just does not make sense to me and I
could use the opinion of an expert.

Consider the following functions:

```
double fast_min(double x, double y) {
    return (x < y) ? x : y;
}

double fast_max(double x, double y) {
    return (x > y) ? x : y;
}
```

The definitions of fast_min and fast_max are carefully crafted to match the
semantics of x86's min and max instruction family, and indeed if I compile =
this
code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions
depending on which vector instruction sets are enabled.

This is exactly what I wanted, so far I'm happy. And if I now try to use th=
ese
min and max functions to write a clamp function...

```
double fast_clamp(double x, double min, double max) {
    return fast_max(fast_min(x, max), min);
}
```

...again, at -O1 optimization level and above, I get a minsd/maxsd pair, sh=
ort
and sweet:

```
fast_clamp(double, double, double):
        minsd   xmm0, xmm2
        maxsd   xmm0, xmm1
        ret
```

Where this perfect picture becomes tainted, however, is as soon as I try to
_use_ this function with certain min/max arguments.

```
double use_fast_clamp(double x) {
    return fast_clamp(x, 0.0, 1.0);
}
```

All of a sudden, the assembly becomes branchy and terrible-looking, even in=
 -O3
mode!

```
use_fast_clamp(double):
        movapd  xmm1, xmm0
        movsd   xmm0, QWORD PTR .LC0[rip]
        comisd  xmm0, xmm1
        jbe     .L13
        maxsd   xmm1, QWORD PTR .LC1[rip]
        movapd  xmm0, xmm1
.L13:
        ret
.LC0:
        .long   0
        .long   1072693248
.LC1:
        .long   0
        .long   0
```

I can make the generated code go back to a minsd/maxsd pair if I enable
-ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations),
but to the best of my knowledge, I shouldn't need fast-math flags here.

Further, even if I did forget about an IEEE-754 oddity that requires fast-m=
ath
flags, it would still mean that the above compilation of the general fast_c=
lamp
function is incorrect: if this compilation output should work for any pair =
of
"min" and "max" double-precision arguments, then it trivially should work w=
hen
the min is 0.0 and max is 1.0. So one way or another, I think the GCC optim=
izer
is doing something strange here.

---

This is the most minimal example of this behavior that I managed to come up
with. Using only the fast_min or fast_math functions in isolation will beha=
ve
as expected and codegen into a single minsd or maxsd:

```
double use_fast_min(double x) {
    return fast_min(x, 1.0);
}

double use_fast_max(double x) {
    return fast_max(x, 0.0);
}
```

I observed similar behavior on any GCC build I could get my hands on, all t=
he
way from the most recent GCC trunk build currently available on godbolt (10=
.0.1
20200405) to the most ancient build provided by godbolt (4.1.2).

Both my local system and godbolt run are Linux-based.

My local GCC build was configured with  ../configure --prefix=3D/usr
--infodir=3D/usr/share/info --mandir=3D/usr/share/man --libdir=3D/usr/lib64
--libexecdir=3D/usr/lib64 --enable-languages=3Dc,c++,objc,fortran,obj-c++,a=
da,go,d
--enable-offload-targets=3Dhsa,nvptx-none=3D/usr/nvptx-none, --without-cuda=
-driver
--disable-werror --with-gxx-include-dir=3D/usr/include/c++/9 --enable-ssp
--disable-libssp --disable-libvtv --disable-cet --disable-libcc1
--enable-plugin --with-bugurl=3Dhttps://bugs.opensuse.org/
--with-pkgversion=3D'SUSE Linux' --with-slibdir=3D/lib64 --with-system-zlib
--enable-libstdcxx-allocator=3Dnew --disable-libstdcxx-pch --enable-libphob=
os
--enable-version-specific-runtime-libs --with-gcc-major-version-only
--enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function
--program-suffix=3D-9 --without-system-libunwind --enable-multilib
--with-arch-32=3Dx86-64 --with-tune=3Dgeneric
--with-build-config=3Dbootstrap-lto-lean --enable-link-mutex
--build=3Dx86_64-suse-linux --host=3Dx86_64-suse-linux

As for godbolt builds, it is easy to go to godbolt.org and add a -v to the
compiler options of the build you're interested in, so I will invite you to=
 do
that instead of cluttering this already long bug report further.

---

FWIW, clang 10 behaves the way I would expect without fast-math flags (and =
also
generates the zero in place with a xorpd instead of loading it from memory,
which is kind of cool), but I'm well aware of the danger of comparing the
floating-point behavior of various compiler optimizers. So I wouldn't read =
too
much into that:

```
.LCPI5_0:
        .quad   4607182418800017408     # double 1
use_fast_clamp(double):                    # @use_fast_clamp(double)
        minsd   xmm0, qword ptr [rip + .LCPI5_0]
        xorpd   xmm1, xmm1
        maxsd   xmm0, xmm1
        ret
```

If you like to experiment on godbolt too, here's my setup:
https://godbolt.org/z/eD-guY .=