From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 5AA813858C83; Thu, 13 Apr 2023 16:54:20 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5AA813858C83
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1681404860;
	bh=RaGj0NR+8f/pfTZD7Oc1GWYVtZUGwx7kVBwdDfjf5JY=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=CntxVMbttLbTYzd9zy4UGQExpQsViGzbWO3oVJ3uLP3j6D0365gfzaVl5m3gtQA2w
	 +huLqRPUlgKhjJJNMD59Jww9eB6V0nThp93PQ/QVgA3Z4pCv+NiknRSpLGOIe76TEz
	 fI7EP7S4UoV+1IeebrORHjRERpwRiazo06juNo08=
From: "jakub at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/109154] [13 regression] jump threading
 de-optimizes nested floating point comparisons
Date: Thu, 13 Apr 2023 16:54:19 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: jakub at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 13.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109154-4-FGD1i0KKev@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109154-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109154-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109154
--- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, would
void
foo (float *f, float d, float e)
{
  if (e >=3D 2.0f && e <=3D 4.0f)
    ;
  else
    __builtin_unreachable ();
  for (int i =3D 0; i < 1024; i++)
    {
      float a =3D f[i];
      f[i] =3D (a < 0.0f ? 1.0f : 1.0f - a * d) * (a < e ? 1.0f : 0.0f);
    }
}
be a better reduction on what's going on?
>From the frange/threading POV, when e is in [2.0f, 4.0f] range, if a < 0.0f=
, we
know that a < e is also true, so there is no point in testing that at runti=
me.
So I think what threadfull1 does is right and desirable if the final code
actually performs those comparisons and uses conditional jumps.
The only thing is that it is harmful for vectorization and maybe for predic=
ated
code.
Therefore, for scalar code at least without massive ARM style conditional
execution,
the above is better emitted as
  if (a < 0.0f)
    tmp =3D 1.0f;
  else
    {
      tmp =3D (1.0f - a * d) * (a < e ? 1.0f : 0.0f);
    }
or even
  if (a < 0.0f)
    tmp =3D 1.0f;
  else if (a < e)
    tmp =3D 1.0f - a * d;
  else
    tmp =3D 0.0f;
  f[i] =3D tmp;
Thus, could we effectively try to undo it at ifcvt time on loops for
vectorization only, or during vectorization or something similar?
As ifcvt then turns the IMHO desirable
  if (a_16 >=3D 0.0)
    goto <bb 5>; [59.00%]
  else
    goto <bb 11>; [41.00%]

  <bb 11> [local count: 435831803]:
  goto <bb 7>; [100.00%]

  <bb 5> [local count: 627172605]:
  _7 =3D a_16 * d_17(D);
  iftmp.0_18 =3D 1.0e+0 - _7;
  if (e_13(D) > a_16)
    goto <bb 12>; [20.00%]
  else
    goto <bb 6>; [80.00%]

  <bb 12> [local count: 125434523]:
  goto <bb 7>; [100.00%]

  <bb 6> [local count: 501738082]:

  <bb 7> [local count: 1063004410]:
  # prephitmp_26 =3D PHI <iftmp.0_18(12), 0.0(6), 1.0e+0(11)>
(ok, the 2 empty forwarders are unlikely useful) into:
  _7 =3D a_16 * d_17(D);
  iftmp.0_18 =3D 1.0e+0 - _7;
  _21 =3D a_16 >=3D 0.0;
  _10 =3D e_13(D) > a_16;
  _9 =3D _10 & _21;
  _27 =3D e_13(D) <=3D a_16;
  _28 =3D _21 & _27;
  _ifc__43 =3D _9 ? iftmp.0_18 : 0.0;
  _ifc__44 =3D _28 ? 0.0 : _ifc__43;
  _45 =3D a_16 < 0.0;
  prephitmp_26 =3D _45 ? 1.0e+0 : _ifc__44;
Now, perhaps if ifcvt used ranger, it could figure out that a_16 < 0.0 impl=
ies
e_13(D) > a_16 and do something smarter with it.
Or maybe just try to do smarter ifcvt just based on the original CFG.
The pre-ifcvt code was a_16 < 0.0f ? 1.0f : a_16 < e_13 ? 1.0f - a_16 * d_1=
7 :
0.0f
so when ifcvt puts everything together, make it
  _7 =3D a_16 * d_17(D);
  iftmp.0_18 =3D 1.0e+0 - _7;
  _27 =3D e_13(D) > a_16;
  _28 =3D a_16 < 0.0;
  _ifc__43 =3D _27 ? iftmp.0_18 : 0.0f;
  prephitmp_26 =3D _28 ? 1.0f : _ifc__43;
?=