public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/103850] New: missed optimization in AVX code
@ 2021-12-28 10:10 martin@mpa-garching.mpg.de
2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-12-28 10:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
Bug ID: 103850
Summary: missed optimization in AVX code
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: martin@mpa-garching.mpg.de
Target Milestone: ---
Created attachment 52076
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit
test case
(I'm reporting this under "C" because I don't know which optimizer is
responsible for this, but I observe the same beaviour in C++ programs as well.)
This test case was distilled from a hot loop in a library computing spherical
harmonic transforms. Apparently it can be compiled in a way that gives close to
theoretical peak performance at least on my hardware (Zen 2), but this only
happens if the statements in the inner loop are arranged in a specific way.
Trivial rearrangements result in a performance which is about 30% lower.
I would have expected that gcc would be able to spot this kind of rearrangement
and do it by itself, but this doesn't seem the case at the moment. If that
could be fixed, that would obviously be great, but if not, I'd be grateful for
any tips how the most "efficient" arrangements can be found for such critical
loops without resorting to trial and error.
The loops in question start at lines 27 and 78 in the attached test case.
On my machine the code reports
slow kernel version: 45.317578 GFlops/s
fast kernel version: 67.083952 GFlops/s
when compiled with "-O3 -march=znver2 -ffast-math -W -Wall"
Clang and Intel icx show the same discrepancy, so it seems that the required
re-ordering is indeed hard to do.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
@ 2021-12-28 10:24 ` pinskia at gcc dot gnu.org
2021-12-28 10:32 ` martin@mpa-garching.mpg.de
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-28 10:24 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
Andrew Pinski <pinskia at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is due to not scheduling on x86 before register allocation.
Adding -fschedule-insns seems to get the slow kernel to be the same speed as
the fast kernel.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
@ 2021-12-28 10:32 ` martin@mpa-garching.mpg.de
2021-12-28 10:41 ` martin@mpa-garching.mpg.de
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-12-28 10:32 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
--- Comment #2 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Thanks! This flag indeed causes both kernels to have the same speed, but (at
least for me) it's slower than both original versions...
slow kernel version: 29.027915 GFlops/s
fast kernel version: 29.008313 GFlops/s
Strange.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
2021-12-28 10:32 ` martin@mpa-garching.mpg.de
@ 2021-12-28 10:41 ` martin@mpa-garching.mpg.de
2022-01-04 13:45 ` rguenth at gcc dot gnu.org
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-12-28 10:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
--- Comment #3 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Just for completeness, this is the CPU I'm running on:
vendor_id : AuthenticAMD
cpu family : 23
model : 96
model name : AMD Ryzen 7 4800H with Radeon Graphics
stepping : 1
microcode : 0x8600103
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
` (2 preceding siblings ...)
2021-12-28 10:41 ` martin@mpa-garching.mpg.de
@ 2022-01-04 13:45 ` rguenth at gcc dot gnu.org
2022-01-04 13:55 ` rguenth at gcc dot gnu.org
2022-01-04 14:12 ` martin@mpa-garching.mpg.de
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-04 13:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
Last reconfirmed| |2022-01-04
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
` (3 preceding siblings ...)
2022-01-04 13:45 ` rguenth at gcc dot gnu.org
@ 2022-01-04 13:55 ` rguenth at gcc dot gnu.org
2022-01-04 14:12 ` martin@mpa-garching.mpg.de
5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-04 13:55 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note the issue can be reproduced without -ffast-math as well where the
functions are nearly identical so I fear you are running into some
micro-architectural hazard. Maybe
.L3:
vmovapd %ymm2, %ymm0
vmovapd %ymm3, %ymm1
.L2:
vbroadcastsd (%rsi), %ymm2
vbroadcastsd 8(%rsi), %ymm4
addq $32, %rdx
addq $16, %rsi
vmovapd %ymm2, %ymm3
vfmadd132pd -32(%rsp), %ymm4, %ymm2
vfmadd132pd %ymm15, %ymm4, %ymm3
vbroadcastsd -24(%rdx), %ymm4
vfmadd132pd %ymm1, %ymm14, %ymm3
vmovapd %ymm1, %ymm14
vfmadd231pd %ymm1, %ymm4, %ymm11
vfmadd231pd %ymm0, %ymm4, %ymm7
vbroadcastsd -8(%rdx), %ymm4
vfmadd132pd %ymm0, %ymm13, %ymm2
vbroadcastsd -32(%rdx), %ymm13
vfmadd231pd %ymm1, %ymm4, %ymm9
vfmadd231pd %ymm0, %ymm4, %ymm5
vfmadd231pd %ymm1, %ymm13, %ymm12
vfmadd231pd %ymm0, %ymm13, %ymm8
vbroadcastsd -16(%rdx), %ymm13
vfmadd231pd %ymm1, %ymm13, %ymm10
vfmadd231pd %ymm0, %ymm13, %ymm6
vmovapd %ymm0, %ymm13
cmpq %rdx, %rax
jne .L3
is easier to handle since there's only one data dependence on the
addq $32, %rdx of the followup loads but for (slow)
.L8:
vmovapd %ymm2, %ymm0
vmovapd %ymm3, %ymm1
.L7:
vbroadcastsd 8(%rdx), %ymm2
vbroadcastsd (%rdx), %ymm3
addq $16, %rsi
addq $32, %rdx
vbroadcastsd -8(%rsi), %ymm4
vfmadd231pd %ymm1, %ymm2, %ymm11
vfmadd231pd %ymm0, %ymm2, %ymm7
vbroadcastsd -8(%rdx), %ymm2
vfmadd231pd %ymm1, %ymm3, %ymm12
vfmadd231pd %ymm0, %ymm3, %ymm8
vbroadcastsd -16(%rdx), %ymm3
vfmadd231pd %ymm1, %ymm2, %ymm9
vfmadd231pd %ymm0, %ymm2, %ymm5
vbroadcastsd -16(%rsi), %ymm2
vfmadd231pd %ymm1, %ymm3, %ymm10
vfmadd231pd %ymm0, %ymm3, %ymm6
vmovapd %ymm2, %ymm3
vfmadd132pd -32(%rsp), %ymm4, %ymm2
vfmadd132pd %ymm15, %ymm4, %ymm3
vfmadd132pd %ymm1, %ymm14, %ymm3
vmovapd %ymm1, %ymm14
vfmadd132pd %ymm0, %ymm13, %ymm2
vmovapd %ymm0, %ymm13
cmpq %rsi, %rax
jne .L8
both increments impose dependences on the following loads.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
` (4 preceding siblings ...)
2022-01-04 13:55 ` rguenth at gcc dot gnu.org
@ 2022-01-04 14:12 ` martin@mpa-garching.mpg.de
5 siblings, 0 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2022-01-04 14:12 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
--- Comment #6 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
I would have expected that this does not make a significant difference,
assuming that speculative execution works and the branch predictor takes the
jump backwards at the loop's end. In that picture both versions of the loop
should look exactly the same.
But my knowledge about all this is admittedly really vague...
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-01-04 14:12 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
2021-12-28 10:32 ` martin@mpa-garching.mpg.de
2021-12-28 10:41 ` martin@mpa-garching.mpg.de
2022-01-04 13:45 ` rguenth at gcc dot gnu.org
2022-01-04 13:55 ` rguenth at gcc dot gnu.org
2022-01-04 14:12 ` martin@mpa-garching.mpg.de
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).