From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 2F8BD3858C66; Fri, 3 May 2024 10:45:10 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2F8BD3858C66 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1714733110; bh=jbbGbnS44jEXrAxSoyc6k8E1al3F7TneVykomC4N7ag=; h=From:To:Subject:Date:In-Reply-To:References:From; b=E3npxtLdtpS342nGN4d+YKw/u5fMCqikC9R/Bmw/KLHjM7QmWVdW68SD4uyyx1Soz +AnR3mcIhcG4z2tMlFux4CjevESAo1DAlGBRPBbhcnrzKSuJQHdTHjVb6MJ+4srzbO plvJuDRGW/KxOGZlRcDT2Ml1AUMVvEM8DSCIPZf4= From: "prathamesh3492 at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa Date: Fri, 03 May 2024 10:45:09 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: prathamesh3492 at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 14.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114860 --- Comment #4 from prathamesh3492 at gcc dot gnu.org --- Hi Tamar, Sorry for late response. perf profile for povray with LTO: Compiled with 82d6d385f97 (commit before a2f4be3dae0):=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 20.03% pov::All_CSG_Intersect_Intersections=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 16.42% pov::All_Plane_Intersections=20= =20=20=20=20=20=20=20=20 10.29%=20 pov::All_Sphere_Intersections=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20 10.10% pov::Intersect_BBox_Tree Compiled with a2f4be3dae0:=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20 19.51%=20 pov::All_CSG_Intersect_Intersections=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20 16.91% pov::All_Plane_Intersections=20=20= =20=20=20=20=20=20=20=20=20=20 =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 12.53% pov::All_Sphere_Intersections=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20 9.81% pov::Intersect_BBox_Tree=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20 I verified there are no code-gen differences for any of the above hot functions. Running size on povray_r_exe.out shows a slight code-size decrease of 344 b= ytes for text section: Compiled with 82d6d385f97: 1101505 Compiled with a2f4be3dae0: 1101161 Curiously, there=E2=80=99s a meaningful difference for pov::All_Sphere_Inte= rsections, which seems to be caused due to following adrp instruction (with no code-gen changes in All_Sphere_Intersections): Compiled with 82d6d385f97: 18.07 =E2=94=824aec44: adrp x0, 4e0000 1.77 =E2=94=824aec48: ldr d28, [x0, #2784] Compiled with a2f4be3dae0: 28.93 =E2=94=824aeae4: adrp x0, 4e0000 1.27 =E2=94=824aeae8: ldr d28, [x0, #2432] This seems to come from following condition in Intersect_Sphere (which gets inlined into All_Sphere Intersections): if ((OCSquared >=3D Radius2) && (t_Closest_Approach < EPSILON)) As far as I see, there=E2=80=99s no difference between both adrp instructio= ns except the address (4aec44 vs 4aeae4). And as far as I know, adrp will only calcul= ate pc-relative page address (and not load any data). To check for any possible icache misses I used L1I_CACHE_REFILL counter, and turns out that there are= 64% more L1 icache misses for above adrp instruction with a2f4be3dae0 compared = to 82d6d385f97, which may (partially) explain the performance difference ? Although perf stat shows there are around 7% more L1 icache misses for whole program run with 82d6d385f97 compared to a2f4be3dae0. I could (repeatedly) reproduce the issue on two neoverse-v2 machines. The full command line passed to the compiler was: "-O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arra= ys -flto -march=3Dnative -mcpu=3Dneoverse-v2" Thanks, Prathamesh=