public inbox for gcc-bugs@sourceware.org help / color / mirror / Atom feed
* [Bug c/103850] New: missed optimization in AVX code @ 2021-12-28 10:10 martin@mpa-garching.mpg.de 2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org ` (5 more replies) 0 siblings, 6 replies; 7+ messages in thread From: martin@mpa-garching.mpg.de @ 2021-12-28 10:10 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 Bug ID: 103850 Summary: missed optimization in AVX code Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: martin@mpa-garching.mpg.de Target Milestone: --- Created attachment 52076 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit test case (I'm reporting this under "C" because I don't know which optimizer is responsible for this, but I observe the same beaviour in C++ programs as well.) This test case was distilled from a hot loop in a library computing spherical harmonic transforms. Apparently it can be compiled in a way that gives close to theoretical peak performance at least on my hardware (Zen 2), but this only happens if the statements in the inner loop are arranged in a specific way. Trivial rearrangements result in a performance which is about 30% lower. I would have expected that gcc would be able to spot this kind of rearrangement and do it by itself, but this doesn't seem the case at the moment. If that could be fixed, that would obviously be great, but if not, I'd be grateful for any tips how the most "efficient" arrangements can be found for such critical loops without resorting to trial and error. The loops in question start at lines 27 and 78 in the attached test case. On my machine the code reports slow kernel version: 45.317578 GFlops/s fast kernel version: 67.083952 GFlops/s when compiled with "-O3 -march=znver2 -ffast-math -W -Wall" Clang and Intel icx show the same discrepancy, so it seems that the required re-ordering is indeed hard to do. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code 2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de @ 2021-12-28 10:24 ` pinskia at gcc dot gnu.org 2021-12-28 10:32 ` martin@mpa-garching.mpg.de ` (4 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: pinskia at gcc dot gnu.org @ 2021-12-28 10:24 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- This is due to not scheduling on x86 before register allocation. Adding -fschedule-insns seems to get the slow kernel to be the same speed as the fast kernel. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code 2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de 2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org @ 2021-12-28 10:32 ` martin@mpa-garching.mpg.de 2021-12-28 10:41 ` martin@mpa-garching.mpg.de ` (3 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: martin@mpa-garching.mpg.de @ 2021-12-28 10:32 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 --- Comment #2 from Martin Reinecke <martin@mpa-garching.mpg.de> --- Thanks! This flag indeed causes both kernels to have the same speed, but (at least for me) it's slower than both original versions... slow kernel version: 29.027915 GFlops/s fast kernel version: 29.008313 GFlops/s Strange. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code 2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de 2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org 2021-12-28 10:32 ` martin@mpa-garching.mpg.de @ 2021-12-28 10:41 ` martin@mpa-garching.mpg.de 2022-01-04 13:45 ` rguenth at gcc dot gnu.org ` (2 subsequent siblings) 5 siblings, 0 replies; 7+ messages in thread From: martin@mpa-garching.mpg.de @ 2021-12-28 10:41 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 --- Comment #3 from Martin Reinecke <martin@mpa-garching.mpg.de> --- Just for completeness, this is the CPU I'm running on: vendor_id : AuthenticAMD cpu family : 23 model : 96 model name : AMD Ryzen 7 4800H with Radeon Graphics stepping : 1 microcode : 0x8600103 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code 2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de ` (2 preceding siblings ...) 2021-12-28 10:41 ` martin@mpa-garching.mpg.de @ 2022-01-04 13:45 ` rguenth at gcc dot gnu.org 2022-01-04 13:55 ` rguenth at gcc dot gnu.org 2022-01-04 14:12 ` martin@mpa-garching.mpg.de 5 siblings, 0 replies; 7+ messages in thread From: rguenth at gcc dot gnu.org @ 2022-01-04 13:45 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed| |2022-01-04 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code 2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de ` (3 preceding siblings ...) 2022-01-04 13:45 ` rguenth at gcc dot gnu.org @ 2022-01-04 13:55 ` rguenth at gcc dot gnu.org 2022-01-04 14:12 ` martin@mpa-garching.mpg.de 5 siblings, 0 replies; 7+ messages in thread From: rguenth at gcc dot gnu.org @ 2022-01-04 13:55 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Note the issue can be reproduced without -ffast-math as well where the functions are nearly identical so I fear you are running into some micro-architectural hazard. Maybe .L3: vmovapd %ymm2, %ymm0 vmovapd %ymm3, %ymm1 .L2: vbroadcastsd (%rsi), %ymm2 vbroadcastsd 8(%rsi), %ymm4 addq $32, %rdx addq $16, %rsi vmovapd %ymm2, %ymm3 vfmadd132pd -32(%rsp), %ymm4, %ymm2 vfmadd132pd %ymm15, %ymm4, %ymm3 vbroadcastsd -24(%rdx), %ymm4 vfmadd132pd %ymm1, %ymm14, %ymm3 vmovapd %ymm1, %ymm14 vfmadd231pd %ymm1, %ymm4, %ymm11 vfmadd231pd %ymm0, %ymm4, %ymm7 vbroadcastsd -8(%rdx), %ymm4 vfmadd132pd %ymm0, %ymm13, %ymm2 vbroadcastsd -32(%rdx), %ymm13 vfmadd231pd %ymm1, %ymm4, %ymm9 vfmadd231pd %ymm0, %ymm4, %ymm5 vfmadd231pd %ymm1, %ymm13, %ymm12 vfmadd231pd %ymm0, %ymm13, %ymm8 vbroadcastsd -16(%rdx), %ymm13 vfmadd231pd %ymm1, %ymm13, %ymm10 vfmadd231pd %ymm0, %ymm13, %ymm6 vmovapd %ymm0, %ymm13 cmpq %rdx, %rax jne .L3 is easier to handle since there's only one data dependence on the addq $32, %rdx of the followup loads but for (slow) .L8: vmovapd %ymm2, %ymm0 vmovapd %ymm3, %ymm1 .L7: vbroadcastsd 8(%rdx), %ymm2 vbroadcastsd (%rdx), %ymm3 addq $16, %rsi addq $32, %rdx vbroadcastsd -8(%rsi), %ymm4 vfmadd231pd %ymm1, %ymm2, %ymm11 vfmadd231pd %ymm0, %ymm2, %ymm7 vbroadcastsd -8(%rdx), %ymm2 vfmadd231pd %ymm1, %ymm3, %ymm12 vfmadd231pd %ymm0, %ymm3, %ymm8 vbroadcastsd -16(%rdx), %ymm3 vfmadd231pd %ymm1, %ymm2, %ymm9 vfmadd231pd %ymm0, %ymm2, %ymm5 vbroadcastsd -16(%rsi), %ymm2 vfmadd231pd %ymm1, %ymm3, %ymm10 vfmadd231pd %ymm0, %ymm3, %ymm6 vmovapd %ymm2, %ymm3 vfmadd132pd -32(%rsp), %ymm4, %ymm2 vfmadd132pd %ymm15, %ymm4, %ymm3 vfmadd132pd %ymm1, %ymm14, %ymm3 vmovapd %ymm1, %ymm14 vfmadd132pd %ymm0, %ymm13, %ymm2 vmovapd %ymm0, %ymm13 cmpq %rsi, %rax jne .L8 both increments impose dependences on the following loads. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [Bug target/103850] missed optimization in AVX code 2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de ` (4 preceding siblings ...) 2022-01-04 13:55 ` rguenth at gcc dot gnu.org @ 2022-01-04 14:12 ` martin@mpa-garching.mpg.de 5 siblings, 0 replies; 7+ messages in thread From: martin@mpa-garching.mpg.de @ 2022-01-04 14:12 UTC (permalink / raw) To: gcc-bugs https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850 --- Comment #6 from Martin Reinecke <martin@mpa-garching.mpg.de> --- I would have expected that this does not make a significant difference, assuming that speculative execution works and the branch predictor takes the jump backwards at the loop's end. In that picture both versions of the loop should look exactly the same. But my knowledge about all this is admittedly really vague... ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-01-04 14:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de 2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org 2021-12-28 10:32 ` martin@mpa-garching.mpg.de 2021-12-28 10:41 ` martin@mpa-garching.mpg.de 2022-01-04 13:45 ` rguenth at gcc dot gnu.org 2022-01-04 13:55 ` rguenth at gcc dot gnu.org 2022-01-04 14:12 ` martin@mpa-garching.mpg.de
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).