public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
@ 2020-06-25 15:10 elrodc at gmail dot com
  2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: elrodc at gmail dot com @ 2020-06-25 15:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

            Bug ID: 95899
           Summary: -funroll-loops does not duplicate accumulators when
                    calculating reductions, failing to break up dependency
                    chains
           Product: gcc
           Version: 10.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: elrodc at gmail dot com
  Target Milestone: ---

Created attachment 48784
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit
cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S
dot.c -o dot.s

Sample code:

```
double dot(double* a, double* b, long N){
  double s = 0.0;
  for (long n = 0; n < N; n++){
    s += a[n] * b[n];
  }
  return s;
}
```

Relevant part of the asm:
```
.L4:
        vmovupd (%rdi,%r11), %zmm8
        vmovupd 64(%rdi,%r11), %zmm9
        vfmadd231pd     (%rsi,%r11), %zmm8, %zmm0
        vmovupd 128(%rdi,%r11), %zmm10
        vmovupd 192(%rdi,%r11), %zmm11
        vmovupd 256(%rdi,%r11), %zmm12
        vmovupd 320(%rdi,%r11), %zmm13
        vfmadd231pd     64(%rsi,%r11), %zmm9, %zmm0
        vmovupd 384(%rdi,%r11), %zmm14
        vmovupd 448(%rdi,%r11), %zmm15
        vfmadd231pd     128(%rsi,%r11), %zmm10, %zmm0
        vfmadd231pd     192(%rsi,%r11), %zmm11, %zmm0
        vfmadd231pd     256(%rsi,%r11), %zmm12, %zmm0
        vfmadd231pd     320(%rsi,%r11), %zmm13, %zmm0
        vfmadd231pd     384(%rsi,%r11), %zmm14, %zmm0
        vfmadd231pd     448(%rsi,%r11), %zmm15, %zmm0
        addq    $512, %r11
        cmpq    %r8, %r11
        jne     .L4

```

Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of
4 cycles.

Because each unrolled instance accumulates into `%zmm0`, we are limited by the
dependency chain to 1 fma every 4 cycles.

It should use separate accumulators.

Additionally, if the loads are aligned, it would have a throughput of 2
loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per
cycle. If the dependency chain were the primary motivation for unrolling, we'd
only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4
simultaneous / OoO fmas.

Something like a sum (1 load per add) would perform better with the 8x
unrolling seen here (at least, from 100 or so elements until it becomes memory
bound).

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-06-26  6:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
2020-06-25 15:45 ` elrodc at gmail dot com
2020-06-26  6:47 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).