[Bug c/103850] New: missed optimization in AVX code

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/103850] New: missed optimization in AVX code
@ 2021-12-28 10:10 martin@mpa-garching.mpg.de
  2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-12-28 10:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

            Bug ID: 103850
           Summary: missed optimization in AVX code
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: martin@mpa-garching.mpg.de
  Target Milestone: ---

Created attachment 52076
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit
test case

(I'm reporting this under "C" because I don't know which optimizer is
responsible for this, but I observe the same beaviour in C++ programs as well.)

This test case was distilled from a hot loop in a library computing spherical
harmonic transforms. Apparently it can be compiled in a way that gives close to
theoretical peak performance at least on my hardware (Zen 2), but this only
happens if the statements in the inner loop are arranged in a specific way.
Trivial rearrangements result in a performance which is about 30% lower.

I would have expected that gcc would be able to spot this kind of rearrangement
and do it by itself, but this doesn't seem the case at the moment. If that
could be fixed, that would obviously be great, but if not, I'd be grateful for
any tips how the most "efficient" arrangements can be found for such critical
loops without resorting to trial and error.

The loops in question start at lines 27 and 78 in the attached test case.
On my machine the code reports

slow kernel version: 45.317578 GFlops/s
fast kernel version: 67.083952 GFlops/s

when compiled with "-O3 -march=znver2 -ffast-math -W -Wall"

Clang and Intel icx show the same discrepancy, so it seems that the required
re-ordering is indeed hard to do.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103850] missed optimization in AVX code
  2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
@ 2021-12-28 10:24 ` pinskia at gcc dot gnu.org
  2021-12-28 10:32 ` martin@mpa-garching.mpg.de
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: pinskia at gcc dot gnu.org @ 2021-12-28 10:24 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
This is due to not scheduling on x86 before register allocation.
Adding -fschedule-insns seems to get the slow kernel to be the same speed as
the fast kernel.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103850] missed optimization in AVX code
  2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
  2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
@ 2021-12-28 10:32 ` martin@mpa-garching.mpg.de
  2021-12-28 10:41 ` martin@mpa-garching.mpg.de
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-12-28 10:32 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

--- Comment #2 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Thanks! This flag indeed causes both kernels to have the same speed, but (at
least for me) it's slower than both original versions...

slow kernel version: 29.027915 GFlops/s
fast kernel version: 29.008313 GFlops/s

Strange.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103850] missed optimization in AVX code
  2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
  2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
  2021-12-28 10:32 ` martin@mpa-garching.mpg.de
@ 2021-12-28 10:41 ` martin@mpa-garching.mpg.de
  2022-01-04 13:45 ` rguenth at gcc dot gnu.org
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2021-12-28 10:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

--- Comment #3 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
Just for completeness, this is the CPU I'm running on:

vendor_id       : AuthenticAMD
cpu family      : 23
model           : 96
model name      : AMD Ryzen 7 4800H with Radeon Graphics
stepping        : 1
microcode       : 0x8600103

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103850] missed optimization in AVX code
  2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
                   ` (2 preceding siblings ...)
  2021-12-28 10:41 ` martin@mpa-garching.mpg.de
@ 2022-01-04 13:45 ` rguenth at gcc dot gnu.org
  2022-01-04 13:55 ` rguenth at gcc dot gnu.org
  2022-01-04 14:12 ` martin@mpa-garching.mpg.de
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-04 13:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2022-01-04

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103850] missed optimization in AVX code
  2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
                   ` (3 preceding siblings ...)
  2022-01-04 13:45 ` rguenth at gcc dot gnu.org
@ 2022-01-04 13:55 ` rguenth at gcc dot gnu.org
  2022-01-04 14:12 ` martin@mpa-garching.mpg.de
  5 siblings, 0 replies; 7+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-01-04 13:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note the issue can be reproduced without -ffast-math as well where the
functions are nearly identical so I fear you are running into some
micro-architectural hazard.  Maybe


.L3:
        vmovapd %ymm2, %ymm0
        vmovapd %ymm3, %ymm1
.L2:
        vbroadcastsd    (%rsi), %ymm2
        vbroadcastsd    8(%rsi), %ymm4
        addq    $32, %rdx
        addq    $16, %rsi
        vmovapd %ymm2, %ymm3
        vfmadd132pd     -32(%rsp), %ymm4, %ymm2
        vfmadd132pd     %ymm15, %ymm4, %ymm3
        vbroadcastsd    -24(%rdx), %ymm4
        vfmadd132pd     %ymm1, %ymm14, %ymm3
        vmovapd %ymm1, %ymm14
        vfmadd231pd     %ymm1, %ymm4, %ymm11
        vfmadd231pd     %ymm0, %ymm4, %ymm7
        vbroadcastsd    -8(%rdx), %ymm4
        vfmadd132pd     %ymm0, %ymm13, %ymm2
        vbroadcastsd    -32(%rdx), %ymm13
        vfmadd231pd     %ymm1, %ymm4, %ymm9
        vfmadd231pd     %ymm0, %ymm4, %ymm5
        vfmadd231pd     %ymm1, %ymm13, %ymm12
        vfmadd231pd     %ymm0, %ymm13, %ymm8
        vbroadcastsd    -16(%rdx), %ymm13
        vfmadd231pd     %ymm1, %ymm13, %ymm10
        vfmadd231pd     %ymm0, %ymm13, %ymm6
        vmovapd %ymm0, %ymm13
        cmpq    %rdx, %rax
        jne     .L3

is easier to handle since there's only one data dependence on the
addq $32, %rdx of the followup loads but for (slow)

.L8:
        vmovapd %ymm2, %ymm0
        vmovapd %ymm3, %ymm1
.L7:
        vbroadcastsd    8(%rdx), %ymm2
        vbroadcastsd    (%rdx), %ymm3
        addq    $16, %rsi
        addq    $32, %rdx
        vbroadcastsd    -8(%rsi), %ymm4
        vfmadd231pd     %ymm1, %ymm2, %ymm11
        vfmadd231pd     %ymm0, %ymm2, %ymm7
        vbroadcastsd    -8(%rdx), %ymm2
        vfmadd231pd     %ymm1, %ymm3, %ymm12
        vfmadd231pd     %ymm0, %ymm3, %ymm8
        vbroadcastsd    -16(%rdx), %ymm3
        vfmadd231pd     %ymm1, %ymm2, %ymm9
        vfmadd231pd     %ymm0, %ymm2, %ymm5
        vbroadcastsd    -16(%rsi), %ymm2
        vfmadd231pd     %ymm1, %ymm3, %ymm10
        vfmadd231pd     %ymm0, %ymm3, %ymm6
        vmovapd %ymm2, %ymm3
        vfmadd132pd     -32(%rsp), %ymm4, %ymm2
        vfmadd132pd     %ymm15, %ymm4, %ymm3
        vfmadd132pd     %ymm1, %ymm14, %ymm3
        vmovapd %ymm1, %ymm14
        vfmadd132pd     %ymm0, %ymm13, %ymm2
        vmovapd %ymm0, %ymm13
        cmpq    %rsi, %rax
        jne     .L8

both increments impose dependences on the following loads.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Bug target/103850] missed optimization in AVX code
  2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
                   ` (4 preceding siblings ...)
  2022-01-04 13:55 ` rguenth at gcc dot gnu.org
@ 2022-01-04 14:12 ` martin@mpa-garching.mpg.de
  5 siblings, 0 replies; 7+ messages in thread
From: martin@mpa-garching.mpg.de @ 2022-01-04 14:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850

--- Comment #6 from Martin Reinecke <martin@mpa-garching.mpg.de> ---
I would have expected that this does not make a significant difference,
assuming that speculative execution works and the branch predictor takes the
jump backwards at the loop's end. In that picture both versions of the loop
should look exactly the same.
But my knowledge about all this is admittedly really vague...

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-04 14:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28 10:10 [Bug c/103850] New: missed optimization in AVX code martin@mpa-garching.mpg.de
2021-12-28 10:24 ` [Bug target/103850] " pinskia at gcc dot gnu.org
2021-12-28 10:32 ` martin@mpa-garching.mpg.de
2021-12-28 10:41 ` martin@mpa-garching.mpg.de
2022-01-04 13:45 ` rguenth at gcc dot gnu.org
2022-01-04 13:55 ` rguenth at gcc dot gnu.org
2022-01-04 14:12 ` martin@mpa-garching.mpg.de

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).