From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8BBD0385B831; Mon, 6 Apr 2020 09:35:30 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8BBD0385B831 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1586165730; bh=UH3YzSfPmtfNSXTMqaWO9unX3mS+gPdOV8wdfrYKeTA=; h=From:To:Subject:Date:From; b=EaDOwFL7WOxd8GEQeFkJ/RMvbCJnlBMyrvVD9gHlgHxw0/FcanSzPUeqfODan1rUx PtnuFBK1S3xo9Z+CpsWUH/hdHDWs2qYt/Ypk33bLkv4T4IeTI3PfnkcybBoWXQ8DWZ gpV09j84d0Q4hDgTwpr+YToo1f0IxkpeQEkcDXHs= From: "grasland at lal dot in2p3.fr" To: gcc-bugs@gcc.gnu.org Subject: [Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ? Date: Mon, 06 Apr 2020 09:35:30 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: c X-Bugzilla-Version: 10.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: grasland at lal dot in2p3.fr X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Apr 2020 09:35:30 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D94497 Bug ID: 94497 Summary: Branchless clamp in the general case gets a branch in a particular case ? Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: grasland at lal dot in2p3.fr Target Milestone: --- (Triage note: I think this is probably a compiler middle-end or back-end is= sue, but I am not knowledgeable enough about the structure of the GCC codebase to pick the right component.) --- I am trying to make a floating-point computation autovectorization-friendly, without mandating the use of -ffast-math for optimal performance as that is= a numerical stability and compiler portability hazard. This turned out to be = an interesting exercise in IEEE-754 pedantry, of course, but I can live with t= hat. However, while trying to optimize a "clamp" computation, I ended up at a po= int where the behavior of the GCC optimizer just does not make sense to me and I could use the opinion of an expert. Consider the following functions: ``` double fast_min(double x, double y) { return (x < y) ? x : y; } double fast_max(double x, double y) { return (x > y) ? x : y; } ``` The definitions of fast_min and fast_max are carefully crafted to match the semantics of x86's min and max instruction family, and indeed if I compile = this code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions depending on which vector instruction sets are enabled. This is exactly what I wanted, so far I'm happy. And if I now try to use th= ese min and max functions to write a clamp function... ``` double fast_clamp(double x, double min, double max) { return fast_max(fast_min(x, max), min); } ``` ...again, at -O1 optimization level and above, I get a minsd/maxsd pair, sh= ort and sweet: ``` fast_clamp(double, double, double): minsd xmm0, xmm2 maxsd xmm0, xmm1 ret ``` Where this perfect picture becomes tainted, however, is as soon as I try to _use_ this function with certain min/max arguments. ``` double use_fast_clamp(double x) { return fast_clamp(x, 0.0, 1.0); } ``` All of a sudden, the assembly becomes branchy and terrible-looking, even in= -O3 mode! ``` use_fast_clamp(double): movapd xmm1, xmm0 movsd xmm0, QWORD PTR .LC0[rip] comisd xmm0, xmm1 jbe .L13 maxsd xmm1, QWORD PTR .LC1[rip] movapd xmm0, xmm1 .L13: ret .LC0: .long 0 .long 1072693248 .LC1: .long 0 .long 0 ``` I can make the generated code go back to a minsd/maxsd pair if I enable -ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations), but to the best of my knowledge, I shouldn't need fast-math flags here. Further, even if I did forget about an IEEE-754 oddity that requires fast-m= ath flags, it would still mean that the above compilation of the general fast_c= lamp function is incorrect: if this compilation output should work for any pair = of "min" and "max" double-precision arguments, then it trivially should work w= hen the min is 0.0 and max is 1.0. So one way or another, I think the GCC optim= izer is doing something strange here. --- This is the most minimal example of this behavior that I managed to come up with. Using only the fast_min or fast_math functions in isolation will beha= ve as expected and codegen into a single minsd or maxsd: ``` double use_fast_min(double x) { return fast_min(x, 1.0); } double use_fast_max(double x) { return fast_max(x, 0.0); } ``` I observed similar behavior on any GCC build I could get my hands on, all t= he way from the most recent GCC trunk build currently available on godbolt (10= .0.1 20200405) to the most ancient build provided by godbolt (4.1.2). Both my local system and godbolt run are Linux-based. My local GCC build was configured with ../configure --prefix=3D/usr --infodir=3D/usr/share/info --mandir=3D/usr/share/man --libdir=3D/usr/lib64 --libexecdir=3D/usr/lib64 --enable-languages=3Dc,c++,objc,fortran,obj-c++,a= da,go,d --enable-offload-targets=3Dhsa,nvptx-none=3D/usr/nvptx-none, --without-cuda= -driver --disable-werror --with-gxx-include-dir=3D/usr/include/c++/9 --enable-ssp --disable-libssp --disable-libvtv --disable-cet --disable-libcc1 --enable-plugin --with-bugurl=3Dhttps://bugs.opensuse.org/ --with-pkgversion=3D'SUSE Linux' --with-slibdir=3D/lib64 --with-system-zlib --enable-libstdcxx-allocator=3Dnew --disable-libstdcxx-pch --enable-libphob= os --enable-version-specific-runtime-libs --with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function --program-suffix=3D-9 --without-system-libunwind --enable-multilib --with-arch-32=3Dx86-64 --with-tune=3Dgeneric --with-build-config=3Dbootstrap-lto-lean --enable-link-mutex --build=3Dx86_64-suse-linux --host=3Dx86_64-suse-linux As for godbolt builds, it is easy to go to godbolt.org and add a -v to the compiler options of the build you're interested in, so I will invite you to= do that instead of cluttering this already long bug report further. --- FWIW, clang 10 behaves the way I would expect without fast-math flags (and = also generates the zero in place with a xorpd instead of loading it from memory, which is kind of cool), but I'm well aware of the danger of comparing the floating-point behavior of various compiler optimizers. So I wouldn't read = too much into that: ``` .LCPI5_0: .quad 4607182418800017408 # double 1 use_fast_clamp(double): # @use_fast_clamp(double) minsd xmm0, qword ptr [rip + .LCPI5_0] xorpd xmm1, xmm1 maxsd xmm0, xmm1 ret ``` If you like to experiment on godbolt too, here's my setup: https://godbolt.org/z/eD-guY .=