From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 58C263858C20; Thu, 13 Apr 2023 17:25:49 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 58C263858C20
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1681406749;
	bh=771VvVW9smbb85I5Jq+q8sZ1ZmR4bMGwypF+r3UWxBE=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=s/w6hBvaldseu9TZvdMzc0y6aGSEB5Q4LBmjm6dbLujRt7XwgIQxlaAXKNsqbTKiU
	 oKSxhxBe8phpVVd0nqt7161zjvKGU+3rrYM3V5zRWfZ1Vz4UowCd2q2rL0vlb+8uf8
	 8G89j/xVk/8K2L7s2RctJRIlC0CJIyiAnRyszWz8=
From: "rguenther at suse dot de" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/109154] [13 regression] jump threading
 de-optimizes nested floating point comparisons
Date: Thu, 13 Apr 2023 17:25:47 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenther at suse dot de
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 13.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109154-4-RsgGC9SSNd@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109154-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109154-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109154
--- Comment #46 from rguenther at suse dot de <rguenther at suse dot de> ---
Am 13.04.2023 um 18:54 schrieb jakub at gcc dot gnu.org
<gcc-bugzilla@gcc.gnu.org>:
>=20
> =EF=BB=BFhttps://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109154
>=20
> --- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> So, would
> void
> foo (float *f, float d, float e)
> {
>  if (e >=3D 2.0f && e <=3D 4.0f)
>    ;
>  else
>    __builtin_unreachable ();
>  for (int i =3D 0; i < 1024; i++)
>    {
>      float a =3D f[i];
>      f[i] =3D (a < 0.0f ? 1.0f : 1.0f - a * d) * (a < e ? 1.0f : 0.0f);
>    }
> }
> be a better reduction on what's going on?
> From the frange/threading POV, when e is in [2.0f, 4.0f] range, if a < 0.=
0f, we
> know that a < e is also true, so there is no point in testing that at run=
time.
> So I think what threadfull1 does is right and desirable if the final code
> actually performs those comparisons and uses conditional jumps.
> The only thing is that it is harmful for vectorization and maybe for pred=
icated
> code.
> Therefore, for scalar code at least without massive ARM style conditional
> execution,
> the above is better emitted as
>  if (a < 0.0f)
>    tmp =3D 1.0f;
>  else
>    {
>      tmp =3D (1.0f - a * d) * (a < e ? 1.0f : 0.0f);
>    }
> or even
>  if (a < 0.0f)
>    tmp =3D 1.0f;
>  else if (a < e)
>    tmp =3D 1.0f - a * d;
>  else
>    tmp =3D 0.0f;
>  f[i] =3D tmp;
> Thus, could we effectively try to undo it at ifcvt time on loops for
> vectorization only, or during vectorization or something similar?
> As ifcvt then turns the IMHO desirable
>  if (a_16 >=3D 0.0)
>    goto <bb 5>; [59.00%]
>  else
>    goto <bb 11>; [41.00%]
>=20
>  <bb 11> [local count: 435831803]:
>  goto <bb 7>; [100.00%]
>=20
>  <bb 5> [local count: 627172605]:
>  _7 =3D a_16 * d_17(D);
>  iftmp.0_18 =3D 1.0e+0 - _7;
>  if (e_13(D) > a_16)
>    goto <bb 12>; [20.00%]
>  else
>    goto <bb 6>; [80.00%]
>=20
>  <bb 12> [local count: 125434523]:
>  goto <bb 7>; [100.00%]
>=20
>  <bb 6> [local count: 501738082]:
>=20
>  <bb 7> [local count: 1063004410]:
>  # prephitmp_26 =3D PHI <iftmp.0_18(12), 0.0(6), 1.0e+0(11)>
> (ok, the 2 empty forwarders are unlikely useful) into:
>  _7 =3D a_16 * d_17(D);
>  iftmp.0_18 =3D 1.0e+0 - _7;
>  _21 =3D a_16 >=3D 0.0;
>  _10 =3D e_13(D) > a_16;
>  _9 =3D _10 & _21;
>  _27 =3D e_13(D) <=3D a_16;
>  _28 =3D _21 & _27;
>  _ifc__43 =3D _9 ? iftmp.0_18 : 0.0;
>  _ifc__44 =3D _28 ? 0.0 : _ifc__43;
>  _45 =3D a_16 < 0.0;
>  prephitmp_26 =3D _45 ? 1.0e+0 : _ifc__44;
> Now, perhaps if ifcvt used ranger, it could figure out that a_16 < 0.0 im=
plies
> e_13(D) > a_16 and do something smarter with it.
> Or maybe just try to do smarter ifcvt just based on the original CFG.
> The pre-ifcvt code was a_16 < 0.0f ? 1.0f : a_16 < e_13 ? 1.0f - a_16 * d=
_17 :
> 0.0f
> so when ifcvt puts everything together, make it
>  _7 =3D a_16 * d_17(D);
>  iftmp.0_18 =3D 1.0e+0 - _7;
>  _27 =3D e_13(D) > a_16;
>  _28 =3D a_16 < 0.0;
>  _ifc__43 =3D _27 ? iftmp.0_18 : 0.0f;
>  prephitmp_26 =3D _28 ? 1.0f : _ifc__43;
> ?

Certainly improving what ifcvt produces for multiarg phis is desirable. I=
=E2=80=99m not
sure if undoing the threading is generally possible.

> --=20
> You are receiving this mail because:
> You are on the CC list for the bug.=