public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/103850] New: missed optimization in AVX code
@ 2021-12-28 10:10 martin@mpa-garching.mpg.de
  2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-12-28 10:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

            Bug ID: 103850
           Summary: missed optimization in AVX code
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: martin@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 52076
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit
test case

(I'm reporting this under "C" because I don't know which optimizer is
responsible for this, but I observe the same beaviour in C++ programs as well.)

This test case was distilled from a hot loop in a library computing spherical
harmonic transforms. Apparently it can be compiled in a way that gives close to
theoretical peak performance at least on my hardware (Zen 2), but this only
happens if the statements in the inner loop are arranged in a specific way.
Trivial rearrangements result in a performance which is about 30% lower.

I would have expected that gcc would be able to spot this kind of rearrangement
and do it by itself, but this doesn't seem the case at the moment. If that
could be fixed, that would obviously be great, but if not, I'd be grateful for
any tips how the most "efficient" arrangements can be found for such critical
loops without resorting to trial and error.

The loops in question start at lines 27 and 78 in the attached test case.
On my machine the code reports

slow kernel version: 45.317578 GFlops/s
fast kernel version: 67.083952 GFlops/s

when compiled with "-O3 -march=znver2 -ffast-math -W -Wall"

Clang and Intel icx show the same discrepancy, so it seems that the required
re-ordering is indeed hard to do.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-04 14:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
2021-12-28 10:32 ` martin@mpa-garching.mpg.de
2021-12-28 10:41 ` martin@mpa-garching.mpg.de
2022-01-04 13:45 ` rguenth at gcc dot gnu.org
2022-01-04 13:55 ` rguenth at gcc dot gnu.org
2022-01-04 14:12 ` martin@mpa-garching.mpg.de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).