From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 127FB3858D35; Thu, 16 Mar 2023 17:03:05 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 127FB3858D35
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1678986185;
	bh=KukIghGOLY88zDmbh82yNrHFs8wDP6EmhoExRCHtAlI=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=F9VpgnW3dHGP2Y0aqcD/U7jvvampG0XItwOG5leZDHZrt8mAJi97d/wtZtYfQEPdb
	 l561hI9LDRvR4gyZyviikzlmbjmPBRaYQfooPqQIOiUEx+Loc4xZCXfMS9seW24TQv
	 FX+b8ym/+aBBjZ4PfaJ1nvV+PE+bJvne3oIliXAA=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109154] [13 regression] aarch64 -mcpu=neoverse-v1
 microbude performance regression
Date: Thu, 16 Mar 2023 17:03:03 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 13.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109154-4-6cTZgOODJW@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109154-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109154-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109154
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Confirmed, It looks like the extra range information from
g:4fbe3e6aa74dae5c75a73c46ae6683fdecd1a75d is leading jump threading down t=
he
wrong path.

Reduced testcase:
---

int etot_0, fasten_main_natpro_chrg_init;

void fasten_main_natpro() {
  float elcdst =3D 1;
  for (int l; l < 1; l++) {
    int zone1 =3D l < 0.0f, chrg_e =3D fasten_main_natpro_chrg_init * (zone=
1 ?: 1)
*
                                   (l < elcdst ? 1 : 0.0f);
    etot_0 +=3D chrg_e;
  }
}

---

and compile with `-O1`. Issue also effects all targets not just AArch64
https://godbolt.org/z/qes4K4oTz. and using `-fno-thread-jumps` confirmed to
"fix" it.

With the new case jump threading seems to duplicate the edges on the l < 0.=
0f
check.

the dump says:

"Jump threading proved probability of edge 5->7 too small (it is 41.0%
(guessed) should be 69.5% (guessed))"

In BB 3 the branch probabilities are guessed as:

    if (_1 < 0.0)
      goto <bb 4>; [41.00%]
    else
      goto <bb 5>; [59.00%]

and in BB 5:

    if (_1 < 1.0e+0)=20=20=20=20=20=20=20
      goto <bb 7>; [41.00%]
    else
      goto <bb 6>; [59.00%]

and so it thinks that the chances of _1 >=3D 0.0 && _1 < 1.0 is very small:

    if (_1 < 1.0e+0)
      goto <bb 7>; [14.80%]
    else
      goto <bb 6>; [85.20%]

The problem is that both BB 4 falls through to BB 5, and BB 6 falls through=
 to
BB 7.

jump threading optimizes BB 5 by splitting the work to be done in BB 5 for =
the
fall-through from BB 4 back into BB 4.
It then threads the additional edge to BB 7 where the final calculation is =
now
more expensive.  much more than before (three way phi-node).

but because the hot path in BB 6 also falls into BB 7 the overall result is
that all paths become slower. but the hot path actually got an additional
comparison.

This is why the code slows down, for each instance of this occurrence (and =
in
the example provided by microbude it happens often) we get an addition bran=
ch
in a few paths.

this has a bigger slow down in SVE (vs the scalar slowdown) because it then
creates a longer dependency chain on producing the predicate for the BB.

It looks like this threading shouldn't be done if both hot and cold branches
end up in the same place?=