[Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
@ 2020-06-25 15:10 elrodc at gmail dot com
  2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: elrodc at gmail dot com @ 2020-06-25 15:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

            Bug ID: 95899
           Summary: -funroll-loops does not duplicate accumulators when
                    calculating reductions, failing to break up dependency
                    chains
           Product: gcc
           Version: 10.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: elrodc at gmail dot com
  Target Milestone: ---

Created attachment 48784
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit
cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S
dot.c -o dot.s

Sample code:

```
double dot(double* a, double* b, long N){
  double s = 0.0;
  for (long n = 0; n < N; n++){
    s += a[n] * b[n];
  }
  return s;
}
```

Relevant part of the asm:
```
.L4:
        vmovupd (%rdi,%r11), %zmm8
        vmovupd 64(%rdi,%r11), %zmm9
        vfmadd231pd     (%rsi,%r11), %zmm8, %zmm0
        vmovupd 128(%rdi,%r11), %zmm10
        vmovupd 192(%rdi,%r11), %zmm11
        vmovupd 256(%rdi,%r11), %zmm12
        vmovupd 320(%rdi,%r11), %zmm13
        vfmadd231pd     64(%rsi,%r11), %zmm9, %zmm0
        vmovupd 384(%rdi,%r11), %zmm14
        vmovupd 448(%rdi,%r11), %zmm15
        vfmadd231pd     128(%rsi,%r11), %zmm10, %zmm0
        vfmadd231pd     192(%rsi,%r11), %zmm11, %zmm0
        vfmadd231pd     256(%rsi,%r11), %zmm12, %zmm0
        vfmadd231pd     320(%rsi,%r11), %zmm13, %zmm0
        vfmadd231pd     384(%rsi,%r11), %zmm14, %zmm0
        vfmadd231pd     448(%rsi,%r11), %zmm15, %zmm0
        addq    $512, %r11
        cmpq    %r8, %r11
        jne     .L4

```

Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of
4 cycles.

Because each unrolled instance accumulates into `%zmm0`, we are limited by the
dependency chain to 1 fma every 4 cycles.

It should use separate accumulators.

Additionally, if the loads are aligned, it would have a throughput of 2
loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per
cycle. If the dependency chain were the primary motivation for unrolling, we'd
only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4
simultaneous / OoO fmas.

Something like a sum (1 load per add) would perform better with the 8x
unrolling seen here (at least, from 100 or so elements until it becomes memory
bound).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
  2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
@ 2020-06-25 15:22 ` ktkachov at gcc dot gnu.org
  2020-06-25 15:45 ` elrodc at gmail dot com
  2020-06-26  6:47 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2020-06-25 15:22 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ktkachov at gcc dot gnu.org

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Is this -fvariable-expansion-in-unroller ? (off by default for everything but
powerpc)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
  2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
  2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
@ 2020-06-25 15:45 ` elrodc at gmail dot com
  2020-06-26  6:47 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: elrodc at gmail dot com @ 2020-06-25 15:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

--- Comment #2 from Chris Elrod <elrodc at gmail dot com> ---
Interesting. Compiling with:

gcc -march=native -fvariable-expansion-in-unroller -Ofast -funroll-loops -S
dot.c -o dot.s

Yields:

```
.L4:
        vmovupd (%rdi,%r11), %zmm9
        vmovupd 64(%rdi,%r11), %zmm10
        vfmadd231pd     (%rsi,%r11), %zmm9, %zmm0
        vfmadd231pd     64(%rsi,%r11), %zmm10, %zmm1
        vmovupd 128(%rdi,%r11), %zmm11
        vmovupd 192(%rdi,%r11), %zmm12
        vmovupd 256(%rdi,%r11), %zmm13
        vfmadd231pd     128(%rsi,%r11), %zmm11, %zmm0
        vfmadd231pd     192(%rsi,%r11), %zmm12, %zmm1
        vmovupd 320(%rdi,%r11), %zmm14
        vmovupd 384(%rdi,%r11), %zmm15
        vmovupd 448(%rdi,%r11), %zmm4
        vfmadd231pd     256(%rsi,%r11), %zmm13, %zmm0
        vfmadd231pd     320(%rsi,%r11), %zmm14, %zmm1
        vfmadd231pd     384(%rsi,%r11), %zmm15, %zmm0
        vfmadd231pd     448(%rsi,%r11), %zmm4, %zmm1
        addq    $512, %r11
        cmpq    %r8, %r11
        jne     .L4
```

So the dependency chain has now been split in 2.
4 would be ideal. I'll try running benchmarks later to see how it does.
FWIW, the original ran at between 20 and 25 GFLOPS from roughly N = 80 through
N = 1024.
The fastest versions I benchmarked climbed from around 20 to 50 GFLOPS over
this range. So perhaps just splitting the dependency once can get it much of
the way there.

Out of curiosity, what's the reason for this being off by default for
everything but ppc?
Seems like it should turned on with `-funroll-loops`, given that breaking
dependency chains are one of the primary ways unrolling can actually help
performance.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
  2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
  2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
  2020-06-25 15:45 ` elrodc at gmail dot com
@ 2020-06-26  6:47 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-06-26  6:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
The register allocator cannot always recover so it can lead to spilling.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-06-26  6:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
2020-06-25 15:45 ` elrodc at gmail dot com
2020-06-26  6:47 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).