From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 2F8BD3858C66; Fri,  3 May 2024 10:45:10 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2F8BD3858C66
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1714733110;
	bh=jbbGbnS44jEXrAxSoyc6k8E1al3F7TneVykomC4N7ag=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=E3npxtLdtpS342nGN4d+YKw/u5fMCqikC9R/Bmw/KLHjM7QmWVdW68SD4uyyx1Soz
	 +AnR3mcIhcG4z2tMlFux4CjevESAo1DAlGBRPBbhcnrzKSuJQHdTHjVb6MJ+4srzbO
	 plvJuDRGW/KxOGZlRcDT2Ml1AUMVvEM8DSCIPZf4=
From: "prathamesh3492 at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/114860] [14/15 regression] [aarch64] 511.povray
 regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since
 r14-10014-ga2f4be3dae04fa
Date: Fri, 03 May 2024 10:45:09 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: prathamesh3492 at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-114860-4-YhZA6wDm9o@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-114860-4@http.gcc.gnu.org/bugzilla/>
References: <bug-114860-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114860

--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
Hi Tamar,
Sorry for late response.

perf profile for povray with LTO:

Compiled with 82d6d385f97 (commit before a2f4be3dae0):=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
20.03%  pov::All_CSG_Intersect_Intersections=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
                                  16.42%  pov::All_Plane_Intersections=20=
=20=20=20=20=20=20=20=20
                                                             10.29%=20
pov::All_Sphere_Intersections=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20
                    10.10%  pov::Intersect_BBox_Tree

Compiled with a2f4be3dae0:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20
                                               19.51%=20
pov::All_CSG_Intersect_Intersections=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20
                               16.91%  pov::All_Plane_Intersections=20=20=
=20=20=20=20=20=20=20=20=20=20
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
12.53%  pov::All_Sphere_Intersections=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20
                              9.81%   pov::Intersect_BBox_Tree=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20

I verified there are no code-gen differences for any of the above hot
functions.
Running size on povray_r_exe.out shows a slight code-size decrease of 344 b=
ytes
for text section:
Compiled with 82d6d385f97: 1101505
Compiled with a2f4be3dae0: 1101161

Curiously, there=E2=80=99s a meaningful difference for pov::All_Sphere_Inte=
rsections,
which seems to be caused due to following adrp instruction (with no code-gen
changes in All_Sphere_Intersections):

Compiled with 82d6d385f97:
 18.07 =E2=94=824aec44:   adrp  x0, 4e0000 <pov::SetCommandOption(POVMSData=
*, unsigned
int, pov::shelldata*) [clone .isra.0]+0x1c0>
  1.77 =E2=94=824aec48:   ldr   d28, [x0, #2784]

Compiled with a2f4be3dae0:
 28.93 =E2=94=824aeae4:   adrp  x0, 4e0000 <pov::Warning(unsigned int, char=
 const*,
...) [clone .constprop.0]+0x100>
  1.27  =E2=94=824aeae8:   ldr   d28, [x0, #2432]

This seems to come from following condition in Intersect_Sphere (which gets
inlined into All_Sphere Intersections):

if ((OCSquared >=3D Radius2) && (t_Closest_Approach < EPSILON))

As far as I see, there=E2=80=99s no difference between both adrp instructio=
ns except
the address (4aec44 vs 4aeae4). And as far as I know, adrp will only calcul=
ate
pc-relative page address (and not load any data). To check for any possible
icache misses I used L1I_CACHE_REFILL counter, and turns out that there are=
 64%
more L1 icache misses for above adrp instruction with a2f4be3dae0 compared =
to
82d6d385f97, which may (partially) explain the performance difference ?
Although perf stat shows there are around 7% more L1 icache misses for whole
program run with 82d6d385f97 compared to a2f4be3dae0.

I could (repeatedly) reproduce the issue on two neoverse-v2 machines.
The full command line passed to the compiler was:
"-O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arra=
ys
-flto -march=3Dnative -mcpu=3Dneoverse-v2"

Thanks,
Prathamesh=