public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2
@ 2024-04-26  6:25 prathamesh3492 at gcc dot gnu.org
  2024-04-26 14:27 ` [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa rguenth at gcc dot gnu.org
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: prathamesh3492 at gcc dot gnu.org @ 2024-04-26  6:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

            Bug ID: 114860
           Summary: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto
                    -march=native -mcpu=neoverse-v2
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: prathamesh3492 at gcc dot gnu.org
  Target Milestone: ---

Hi,
It seems performance of povray bmk is regressing ~5.5% with -O3 -flto
-march=native -mcpu=neoverse-v2, and ~1.6% without LTO.

This seems to have happened after following commit:
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=a2f4be3dae04fa8606d1cc8451f0b9d450f7e6e6

Reverting it brings back performance. I am investigating further.

Thanks,
Prathamesh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
@ 2024-04-26 14:27 ` rguenth at gcc dot gnu.org
  2024-04-26 14:30 ` tnfchris at gcc dot gnu.org
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-04-26 14:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |14.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
  2024-04-26 14:27 ` [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa rguenth at gcc dot gnu.org
@ 2024-04-26 14:30 ` tnfchris at gcc dot gnu.org
  2024-04-26 14:46 ` tnfchris at gcc dot gnu.org
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-26 14:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #1 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Hmm

I Am unable to reproduce this with -O3 - flto -mcpu=neoverse-v2 on a
neoverse-v2 machine.

Is any other option required?

Also that code was new in gcc 14 and was partially reverted due to register
allocation issue.

So if there is a performance difference it's a miss optimization and not a
regression

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
  2024-04-26 14:27 ` [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa rguenth at gcc dot gnu.org
  2024-04-26 14:30 ` tnfchris at gcc dot gnu.org
@ 2024-04-26 14:46 ` tnfchris at gcc dot gnu.org
  2024-05-01  6:22 ` tnfchris at gcc dot gnu.org
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-04-26 14:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
There's no change in the binaries aside from an alignment change in one of the
lesser hot functions.

I guess you recompile libc? In which case you need to compare against GCC 13.

As mentioned in the quoted commit this reverts the general case for GCC 14.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2024-04-26 14:46 ` tnfchris at gcc dot gnu.org
@ 2024-05-01  6:22 ` tnfchris at gcc dot gnu.org
  2024-05-03 10:45 ` prathamesh3492 at gcc dot gnu.org
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-01  6:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #3 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
I cannot reproduce this even recompiling libc.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2024-05-01  6:22 ` tnfchris at gcc dot gnu.org
@ 2024-05-03 10:45 ` prathamesh3492 at gcc dot gnu.org
  2024-05-03 21:20 ` pinskia at gcc dot gnu.org
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: prathamesh3492 at gcc dot gnu.org @ 2024-05-03 10:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
Hi Tamar,
Sorry for late response.

perf profile for povray with LTO:

Compiled with 82d6d385f97 (commit before a2f4be3dae0):                         
                                                                          
20.03%  pov::All_CSG_Intersect_Intersections                                   
                                  16.42%  pov::All_Plane_Intersections         
                                                             10.29% 
pov::All_Sphere_Intersections                                                  
                    10.10%  pov::Intersect_BBox_Tree

Compiled with a2f4be3dae0:                                                     
                                               19.51% 
pov::All_CSG_Intersect_Intersections                                           
                               16.91%  pov::All_Plane_Intersections            
                                                                          
12.53%  pov::All_Sphere_Intersections                                          
                              9.81%   pov::Intersect_BBox_Tree                  

I verified there are no code-gen differences for any of the above hot
functions.
Running size on povray_r_exe.out shows a slight code-size decrease of 344 bytes
for text section:
Compiled with 82d6d385f97: 1101505
Compiled with a2f4be3dae0: 1101161

Curiously, there’s a meaningful difference for pov::All_Sphere_Intersections,
which seems to be caused due to following adrp instruction (with no code-gen
changes in All_Sphere_Intersections):

Compiled with 82d6d385f97:
 18.07 │4aec44:   adrp  x0, 4e0000 <pov::SetCommandOption(POVMSData*, unsigned
int, pov::shelldata*) [clone .isra.0]+0x1c0>
  1.77 │4aec48:   ldr   d28, [x0, #2784]

Compiled with a2f4be3dae0:
 28.93 │4aeae4:   adrp  x0, 4e0000 <pov::Warning(unsigned int, char const*,
...) [clone .constprop.0]+0x100>
  1.27  │4aeae8:   ldr   d28, [x0, #2432]

This seems to come from following condition in Intersect_Sphere (which gets
inlined into All_Sphere Intersections):

if ((OCSquared >= Radius2) && (t_Closest_Approach < EPSILON))

As far as I see, there’s no difference between both adrp instructions except
the address (4aec44 vs 4aeae4). And as far as I know, adrp will only calculate
pc-relative page address (and not load any data). To check for any possible
icache misses I used L1I_CACHE_REFILL counter, and turns out that there are 64%
more L1 icache misses for above adrp instruction with a2f4be3dae0 compared to
82d6d385f97, which may (partially) explain the performance difference ?
Although perf stat shows there are around 7% more L1 icache misses for whole
program run with 82d6d385f97 compared to a2f4be3dae0.

I could (repeatedly) reproduce the issue on two neoverse-v2 machines.
The full command line passed to the compiler was:
"-O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arrays
-flto -march=native -mcpu=neoverse-v2"

Thanks,
Prathamesh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2024-05-03 10:45 ` prathamesh3492 at gcc dot gnu.org
@ 2024-05-03 21:20 ` pinskia at gcc dot gnu.org
  2024-05-07  7:45 ` rguenth at gcc dot gnu.org
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: pinskia at gcc dot gnu.org @ 2024-05-03 21:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to prathamesh3492 from comment #4)
> To check for any
> possible icache misses I used L1I_CACHE_REFILL counter, and turns out that
> there are 64% more L1 icache misses for above adrp instruction with
> a2f4be3dae0 compared to 82d6d385f97, which may (partially) explain the
> performance difference ? Although perf stat shows there are around 7% more
> L1 icache misses for whole program run with 82d6d385f97 compared to
> a2f4be3dae0.

This makes it sound like there is some code alignment issue going on or a
branch misprediction issue going on. 

bad alignment: 4aeae4
good alignment 4aec44

The good alignment case is at the (almost) start at an icache line while the
bad alignment case is in the middle. (I am assuming 64byte cache lines which I
think is correct)

Maybe look at mispredicted branches too. It might be the branch leading to this
code is being mispredicted more due to the address of the branch is now
interfeeing with another branch.

It might just have been bad luck that caused this regression in both cases
really; alignment differences and/or address differences can be bad luck.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2024-05-03 21:20 ` pinskia at gcc dot gnu.org
@ 2024-05-07  7:45 ` rguenth at gcc dot gnu.org
  2024-05-16 17:17 ` tnfchris at gcc dot gnu.org
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: rguenth at gcc dot gnu.org @ 2024-05-07  7:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|14.0                        |14.2

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 14.1 is being released, retargeting bugs to GCC 14.2.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2024-05-07  7:45 ` rguenth at gcc dot gnu.org
@ 2024-05-16 17:17 ` tnfchris at gcc dot gnu.org
  2024-05-22 11:27 ` prathamesh3492 at gcc dot gnu.org
  2024-05-22 11:36 ` tnfchris at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-16 17:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Yeah, it's most likely an alignment issue, especially as there's no code
changes.

We run our benchmarking with different flags so it may be why we don't see it.
the loop seems misaligned, you can try increasing the alignment guarantee to
check. e.f. -falign-loops=5.

But ultimately, I think it's just bad luck. We don't align loops and labels if
they require too much padding instructions.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2024-05-16 17:17 ` tnfchris at gcc dot gnu.org
@ 2024-05-22 11:27 ` prathamesh3492 at gcc dot gnu.org
  2024-05-22 11:36 ` tnfchris at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: prathamesh3492 at gcc dot gnu.org @ 2024-05-22 11:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #8 from prathamesh3492 at gcc dot gnu.org ---
Hi Tamar,
Using -falign-loops=5 indeed brings back the performance.
The adrp instruction has same address (0x4ae784) by setting -falign-loops=5
(which reduces misalignment to 4) with/without a2f4be3dae0. So I guess this is
really code-alignment issue ?

(Also in our latest builds the regression has seemingly gone away without any
adjustments to code alignment)

Thanks,
Prathamesh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa
  2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2024-05-22 11:27 ` prathamesh3492 at gcc dot gnu.org
@ 2024-05-22 11:36 ` tnfchris at gcc dot gnu.org
  9 siblings, 0 replies; 11+ messages in thread
From: tnfchris at gcc dot gnu.org @ 2024-05-22 11:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860

--- Comment #9 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to prathamesh3492 from comment #8)
> Hi Tamar,
> Using -falign-loops=5 indeed brings back the performance.
> The adrp instruction has same address (0x4ae784) by setting -falign-loops=5
> (which reduces misalignment to 4) with/without a2f4be3dae0. So I guess this
> is really code-alignment issue ?
> 

Indeed, we don't aggressively align loops if they require too much padding to
not bloat the binaries too much.  That's why sometimes you just get unlucky and
the hot loop gets misaligned.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-05-22 11:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26  6:25 [Bug target/114860] New: [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 prathamesh3492 at gcc dot gnu.org
2024-04-26 14:27 ` [Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa rguenth at gcc dot gnu.org
2024-04-26 14:30 ` tnfchris at gcc dot gnu.org
2024-04-26 14:46 ` tnfchris at gcc dot gnu.org
2024-05-01  6:22 ` tnfchris at gcc dot gnu.org
2024-05-03 10:45 ` prathamesh3492 at gcc dot gnu.org
2024-05-03 21:20 ` pinskia at gcc dot gnu.org
2024-05-07  7:45 ` rguenth at gcc dot gnu.org
2024-05-16 17:17 ` tnfchris at gcc dot gnu.org
2024-05-22 11:27 ` prathamesh3492 at gcc dot gnu.org
2024-05-22 11:36 ` tnfchris at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).