From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 5AA813858C83; Thu, 13 Apr 2023 16:54:20 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5AA813858C83 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1681404860; bh=RaGj0NR+8f/pfTZD7Oc1GWYVtZUGwx7kVBwdDfjf5JY=; h=From:To:Subject:Date:In-Reply-To:References:From; b=CntxVMbttLbTYzd9zy4UGQExpQsViGzbWO3oVJ3uLP3j6D0365gfzaVl5m3gtQA2w +huLqRPUlgKhjJJNMD59Jww9eB6V0nThp93PQ/QVgA3Z4pCv+NiknRSpLGOIe76TEz fI7EP7S4UoV+1IeebrORHjRERpwRiazo06juNo08= From: "jakub at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons Date: Thu, 13 Apr 2023 16:54:19 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: jakub at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P1 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 13.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109154 --- Comment #45 from Jakub Jelinek --- So, would void foo (float *f, float d, float e) { if (e >=3D 2.0f && e <=3D 4.0f) ; else __builtin_unreachable (); for (int i =3D 0; i < 1024; i++) { float a =3D f[i]; f[i] =3D (a < 0.0f ? 1.0f : 1.0f - a * d) * (a < e ? 1.0f : 0.0f); } } be a better reduction on what's going on? >From the frange/threading POV, when e is in [2.0f, 4.0f] range, if a < 0.0f= , we know that a < e is also true, so there is no point in testing that at runti= me. So I think what threadfull1 does is right and desirable if the final code actually performs those comparisons and uses conditional jumps. The only thing is that it is harmful for vectorization and maybe for predic= ated code. Therefore, for scalar code at least without massive ARM style conditional execution, the above is better emitted as if (a < 0.0f) tmp =3D 1.0f; else { tmp =3D (1.0f - a * d) * (a < e ? 1.0f : 0.0f); } or even if (a < 0.0f) tmp =3D 1.0f; else if (a < e) tmp =3D 1.0f - a * d; else tmp =3D 0.0f; f[i] =3D tmp; Thus, could we effectively try to undo it at ifcvt time on loops for vectorization only, or during vectorization or something similar? As ifcvt then turns the IMHO desirable if (a_16 >=3D 0.0) goto ; [59.00%] else goto ; [41.00%] [local count: 435831803]: goto ; [100.00%] [local count: 627172605]: _7 =3D a_16 * d_17(D); iftmp.0_18 =3D 1.0e+0 - _7; if (e_13(D) > a_16) goto ; [20.00%] else goto ; [80.00%] [local count: 125434523]: goto ; [100.00%] [local count: 501738082]: [local count: 1063004410]: # prephitmp_26 =3D PHI (ok, the 2 empty forwarders are unlikely useful) into: _7 =3D a_16 * d_17(D); iftmp.0_18 =3D 1.0e+0 - _7; _21 =3D a_16 >=3D 0.0; _10 =3D e_13(D) > a_16; _9 =3D _10 & _21; _27 =3D e_13(D) <=3D a_16; _28 =3D _21 & _27; _ifc__43 =3D _9 ? iftmp.0_18 : 0.0; _ifc__44 =3D _28 ? 0.0 : _ifc__43; _45 =3D a_16 < 0.0; prephitmp_26 =3D _45 ? 1.0e+0 : _ifc__44; Now, perhaps if ifcvt used ranger, it could figure out that a_16 < 0.0 impl= ies e_13(D) > a_16 and do something smarter with it. Or maybe just try to do smarter ifcvt just based on the original CFG. The pre-ifcvt code was a_16 < 0.0f ? 1.0f : a_16 < e_13 ? 1.0f - a_16 * d_1= 7 : 0.0f so when ifcvt puts everything together, make it _7 =3D a_16 * d_17(D); iftmp.0_18 =3D 1.0e+0 - _7; _27 =3D e_13(D) > a_16; _28 =3D a_16 < 0.0; _ifc__43 =3D _27 ? iftmp.0_18 : 0.0f; prephitmp_26 =3D _28 ? 1.0f : _ifc__43; ?=