public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
@ 2020-06-25 15:10 elrodc at gmail dot com
2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: elrodc at gmail dot com @ 2020-06-25 15:10 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899
Bug ID: 95899
Summary: -funroll-loops does not duplicate accumulators when
calculating reductions, failing to break up dependency
chains
Product: gcc
Version: 10.1.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
Created attachment 48784
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit
cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S
dot.c -o dot.s
Sample code:
```
double dot(double* a, double* b, long N){
double s = 0.0;
for (long n = 0; n < N; n++){
s += a[n] * b[n];
}
return s;
}
```
Relevant part of the asm:
```
.L4:
vmovupd (%rdi,%r11), %zmm8
vmovupd 64(%rdi,%r11), %zmm9
vfmadd231pd (%rsi,%r11), %zmm8, %zmm0
vmovupd 128(%rdi,%r11), %zmm10
vmovupd 192(%rdi,%r11), %zmm11
vmovupd 256(%rdi,%r11), %zmm12
vmovupd 320(%rdi,%r11), %zmm13
vfmadd231pd 64(%rsi,%r11), %zmm9, %zmm0
vmovupd 384(%rdi,%r11), %zmm14
vmovupd 448(%rdi,%r11), %zmm15
vfmadd231pd 128(%rsi,%r11), %zmm10, %zmm0
vfmadd231pd 192(%rsi,%r11), %zmm11, %zmm0
vfmadd231pd 256(%rsi,%r11), %zmm12, %zmm0
vfmadd231pd 320(%rsi,%r11), %zmm13, %zmm0
vfmadd231pd 384(%rsi,%r11), %zmm14, %zmm0
vfmadd231pd 448(%rsi,%r11), %zmm15, %zmm0
addq $512, %r11
cmpq %r8, %r11
jne .L4
```
Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of
4 cycles.
Because each unrolled instance accumulates into `%zmm0`, we are limited by the
dependency chain to 1 fma every 4 cycles.
It should use separate accumulators.
Additionally, if the loads are aligned, it would have a throughput of 2
loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per
cycle. If the dependency chain were the primary motivation for unrolling, we'd
only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4
simultaneous / OoO fmas.
Something like a sum (1 load per add) would perform better with the 8x
unrolling seen here (at least, from 100 or so elements until it becomes memory
bound).
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
@ 2020-06-25 15:22 ` ktkachov at gcc dot gnu.org
2020-06-25 15:45 ` elrodc at gmail dot com
2020-06-26 6:47 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: ktkachov at gcc dot gnu.org @ 2020-06-25 15:22 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899
ktkachov at gcc dot gnu.org changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |ktkachov at gcc dot gnu.org
--- Comment #1 from ktkachov at gcc dot gnu.org ---
Is this -fvariable-expansion-in-unroller ? (off by default for everything but
powerpc)
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
@ 2020-06-25 15:45 ` elrodc at gmail dot com
2020-06-26 6:47 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: elrodc at gmail dot com @ 2020-06-25 15:45 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899
--- Comment #2 from Chris Elrod <elrodc at gmail dot com> ---
Interesting. Compiling with:
gcc -march=native -fvariable-expansion-in-unroller -Ofast -funroll-loops -S
dot.c -o dot.s
Yields:
```
.L4:
vmovupd (%rdi,%r11), %zmm9
vmovupd 64(%rdi,%r11), %zmm10
vfmadd231pd (%rsi,%r11), %zmm9, %zmm0
vfmadd231pd 64(%rsi,%r11), %zmm10, %zmm1
vmovupd 128(%rdi,%r11), %zmm11
vmovupd 192(%rdi,%r11), %zmm12
vmovupd 256(%rdi,%r11), %zmm13
vfmadd231pd 128(%rsi,%r11), %zmm11, %zmm0
vfmadd231pd 192(%rsi,%r11), %zmm12, %zmm1
vmovupd 320(%rdi,%r11), %zmm14
vmovupd 384(%rdi,%r11), %zmm15
vmovupd 448(%rdi,%r11), %zmm4
vfmadd231pd 256(%rsi,%r11), %zmm13, %zmm0
vfmadd231pd 320(%rsi,%r11), %zmm14, %zmm1
vfmadd231pd 384(%rsi,%r11), %zmm15, %zmm0
vfmadd231pd 448(%rsi,%r11), %zmm4, %zmm1
addq $512, %r11
cmpq %r8, %r11
jne .L4
```
So the dependency chain has now been split in 2.
4 would be ideal. I'll try running benchmarks later to see how it does.
FWIW, the original ran at between 20 and 25 GFLOPS from roughly N = 80 through
N = 1024.
The fastest versions I benchmarked climbed from around 20 to 50 GFLOPS over
this range. So perhaps just splitting the dependency once can get it much of
the way there.
Out of curiosity, what's the reason for this being off by default for
everything but ppc?
Seems like it should turned on with `-funroll-loops`, given that breaking
dependency chains are one of the primary ways unrolling can actually help
performance.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
2020-06-25 15:45 ` elrodc at gmail dot com
@ 2020-06-26 6:47 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-06-26 6:47 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
The register allocator cannot always recover so it can lead to spilling.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-06-26 6:47 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 15:10 [Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains elrodc at gmail dot com
2020-06-25 15:22 ` [Bug middle-end/95899] " ktkachov at gcc dot gnu.org
2020-06-25 15:45 ` elrodc at gmail dot com
2020-06-26 6:47 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).