public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
@ 2020-10-08 23:09 already5chosen at yahoo dot com
2020-10-09 6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: already5chosen at yahoo dot com @ 2020-10-08 23:09 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343
Bug ID: 97343
Summary: AVX2 vectorizer generates extremely strange and slow
code for AoSoA complex dot product
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: already5chosen at yahoo dot com
Target Milestone: ---
Let's continue our complex dot product series started here
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854
This time I have no code generation bugs for your pleasure, just "interesting"
optimization issues.
All examples below, unless stated otherwise were compiled with gcc.10.2 for
x86-64 with following sets
of flags:
set1: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math -fno-associative-math
set2: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math
The kernel in question is an example of complex dot product in so-called
"hybrid AoS" layout a.k.a. AoSoA.
https://en.wikipedia.org/wiki/AoS_and_SoA#Array_of_Structures_of_Arrays
In my experience it's quite rare when in dense complex linear algebra and
similar computational fields AoSoA is *not* the optimal internal form.
So, practically, I consider these kernels as more important than AoS kernel
presented in bug 96854.
More specifically, the layout can be described as struct { double re[4], im[4];
};
But for purpose of simplicity I omitted the type definition fro code examples
and coded it directly over flat arrays of doubles.
Part 1.
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_re = 0;
double acc_im = 0;
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_re = acc_re + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
acc_im = acc_im - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = acc_re;
res[4] = acc_im;
}
That's how we want to code it in the ideal world and let to compiles to care
about dirty details.
In less ideal world, gcc is not the only compiler that can't cope with it.
MSVC (-W4 -O2 -fp:fast -arch:AVX2) also can't vectorize it. Even mighty icc
generates code that it's not quite bad, but somewhat suboptimal.
So, let's it pass. I don't want to blame gcc for not being smart enough. It's
just normal.
Except that when we use set2 the code generated by gcc becomes not just
non-smart, but quite crazy.
I am ignoring it in the hope that it would be magically fixed by the change
made by Richard Biener 2020-08-31
Part 2.
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_rere = 0;
double acc_imim = 0;
double acc_reim = 0;
double acc_imre = 0;
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_rere += a[c*8+k+0]*b[c*8+k+0];
acc_imim += a[c*8+k+4]*b[c*8+k+4];
acc_reim += a[c*8+k+0]*b[c*8+k+4];
acc_imre += a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = acc_rere+acc_imim;
res[4] = acc_imre-acc_reim;
}
We are explaining it to compiler slowly.
For icc and MSVC it's enough. They understood.
icc generates near-perfect code. I can write it nicer, but do not expect my
variant to be any faster.
MSVC generates near-perfect inner loop and epilogue that is not great, but not
really much slower.
gcc still didn't get it. It still tries to implement 4 accumulators literally,
as if -ffast-math were not here.
But, sad as it is, it's still a case of not being smart enough. So, I am not
complaining.
Part 3.
static inline double sum4(double x[]) {
return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_re[4] = {0};
double acc_im[4] = {0};
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_re[k] = acc_re[k] + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
acc_im[k] = acc_im[k] - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = sum4(acc_re);
res[4] = sum4(acc_im);
}
Attempt to feed compiler by teaspoon.
That's not a way I want to write code in HLL.
icc copes, producing about the same code as in Part 1
MSVC doesn't understand a Kunststück (I am sympathetic) and generates literal
scalar code with local arrays on stack.
gcc with set1 is a little better than MSVC - the code is fully scalar, but at
least accumulators kept in registers.
gcc with set2 is the most interesting. It vectorizes, but how?
Here is an inner loop:
.L3:
vpermpd $27, (%r8,%rax), %ymm2
vpermpd $27, 32(%rdx,%rax), %ymm3
vpermpd $27, (%rdx,%rax), %ymm1
vpermpd $27, 32(%r8,%rax), %ymm0
vmulpd %ymm2, %ymm1, %ymm6
vmulpd %ymm2, %ymm3, %ymm2
addq $64, %rax
vfnmadd132pd %ymm0, %ymm2, %ymm1
vfmadd132pd %ymm3, %ymm6, %ymm0
vaddpd %ymm1, %ymm5, %ymm5
vaddpd %ymm0, %ymm4, %ymm4
cmpq %rcx, %rax
jne .L3
What all this vpermpd business about? Shuffling SIMD lanes around just because
it's funny?
That the first thing I do want to complain about. Not "not smart enough", but
too smart for its own good.
And finally
Part 4
static inline double sum4(double x[]) {
return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_rere[4] = {0};
double acc_imim[4] = {0};
double acc_reim[4] = {0};
double acc_imre[4] = {0};
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_rere[k] += a[c*8+k+0]*b[c*8+k+0];
acc_imim[k] += a[c*8+k+4]*b[c*8+k+4];
acc_reim[k] += a[c*8+k+0]*b[c*8+k+4];
acc_imre[k] += a[c*8+k+4]*b[c*8+k+0];
}
}
double acc_re[4];
double acc_im[4];
for (int k = 0; k < 4; ++k) {
acc_re[k] = acc_rere[k]+acc_imim[k];
acc_im[k] = acc_imre[k]-acc_reim[k];
}
res[0] = sum4(acc_re);
res[4] = sum4(acc_im);
}
Not just fed by teaspoon, but compiler's mouth held open manually, so to speak.
icc, of course, understands and generates pretty much the same good code as in
Part 2.
MSVC, of course, does not understand and generates arrays on stack.
gcc with set2, of course, continues to enjoy a juggling. Doubling or tripling
up vs last time's performance.
Inner loop:
.L3:
vmovupd (%rdx,%rax), %ymm1
vmovupd 32(%rdx,%rax), %ymm0
vmovupd 32(%r8,%rax), %ymm5
vperm2f128 $49, %ymm1, %ymm0, %ymm3
vinsertf128 $1, %xmm1, %ymm0, %ymm0
vpermpd $221, %ymm0, %ymm10
vpermpd $136, %ymm0, %ymm1
vmovupd (%r8,%rax), %ymm0
vpermpd $136, %ymm3, %ymm9
vperm2f128 $49, %ymm5, %ymm0, %ymm2
vinsertf128 $1, %xmm5, %ymm0, %ymm0
vpermpd $40, %ymm2, %ymm11
vpermpd $125, %ymm0, %ymm5
vpermpd $221, %ymm3, %ymm3
vpermpd $40, %ymm0, %ymm0
vpermpd $125, %ymm2, %ymm2
addq $64, %rax
vfmadd231pd %ymm2, %ymm3, %ymm8
vfmadd231pd %ymm11, %ymm9, %ymm6
vfmadd231pd %ymm5, %ymm10, %ymm7
vfmadd231pd %ymm0, %ymm1, %ymm4
cmpq %rax, %rcx
jne .L3
But this time gcc with set1 was a real star of the show. My only reaction is
"What?"
.L4:
vmovupd 0(%r13), %ymm5
vmovupd 64(%r13), %ymm7
vmovupd 192(%r13), %ymm4
vmovupd 128(%r13), %ymm6
vunpcklpd 32(%r13), %ymm5, %ymm13
vunpckhpd 32(%r13), %ymm5, %ymm12
vunpckhpd 96(%r13), %ymm7, %ymm1
vunpcklpd 96(%r13), %ymm7, %ymm5
vmovupd 128(%r13), %ymm7
vunpcklpd 224(%r13), %ymm4, %ymm2
vunpcklpd 160(%r13), %ymm6, %ymm6
vunpckhpd 160(%r13), %ymm7, %ymm11
vunpckhpd 224(%r13), %ymm4, %ymm0
vpermpd $216, %ymm13, %ymm13
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm2, %ymm2
vpermpd $216, %ymm5, %ymm5
vunpcklpd %ymm2, %ymm6, %ymm4
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm11, %ymm11
vunpcklpd %ymm5, %ymm13, %ymm9
vpermpd $216, %ymm12, %ymm12
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm4, %ymm3
vpermpd $216, %ymm9, %ymm9
vunpckhpd %ymm2, %ymm6, %ymm4
vunpckhpd %ymm5, %ymm13, %ymm5
vunpcklpd %ymm1, %ymm12, %ymm6
vunpcklpd %ymm0, %ymm11, %ymm2
vunpckhpd %ymm1, %ymm12, %ymm12
vunpckhpd %ymm0, %ymm11, %ymm0
vpermpd $216, %ymm12, %ymm1
vunpcklpd %ymm3, %ymm9, %ymm11
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm4, %ymm4
vpermpd $216, %ymm0, %ymm0
vmovupd 64(%r12), %ymm15
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm11, %ymm8
vpermpd $216, %ymm2, %ymm2
vunpcklpd %ymm4, %ymm5, %ymm11
vunpckhpd %ymm3, %ymm9, %ymm9
vunpckhpd %ymm4, %ymm5, %ymm4
vunpcklpd %ymm0, %ymm1, %ymm5
vpermpd $216, %ymm9, %ymm3
vunpcklpd %ymm2, %ymm6, %ymm9
vunpckhpd %ymm2, %ymm6, %ymm2
vpermpd $216, %ymm5, %ymm6
vunpcklpd 96(%r12), %ymm15, %ymm12
vunpckhpd %ymm0, %ymm1, %ymm0
vmovupd %ymm6, 64(%rsp)
vunpckhpd 96(%r12), %ymm15, %ymm6
vmovupd 128(%r12), %ymm15
vpermpd $216, %ymm0, %ymm5
vpermpd $216, %ymm9, %ymm7
vmovupd (%r12), %ymm0
vunpckhpd 160(%r12), %ymm15, %ymm9
vmovupd %ymm5, 96(%rsp)
vunpcklpd 160(%r12), %ymm15, %ymm5
vmovupd 192(%r12), %ymm15
vunpcklpd 32(%r12), %ymm0, %ymm1
vpermpd $216, %ymm9, %ymm14
vunpcklpd 224(%r12), %ymm15, %ymm9
vunpckhpd 224(%r12), %ymm15, %ymm13
vunpckhpd 32(%r12), %ymm0, %ymm0
vpermpd $216, %ymm12, %ymm12
vpermpd $216, %ymm9, %ymm9
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm6, %ymm6
vunpcklpd %ymm12, %ymm1, %ymm10
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm13, %ymm13
vunpckhpd %ymm12, %ymm1, %ymm1
vunpcklpd %ymm9, %ymm5, %ymm12
vpermpd $216, %ymm12, %ymm12
vpermpd $216, %ymm10, %ymm10
vunpckhpd %ymm9, %ymm5, %ymm5
vunpcklpd %ymm6, %ymm0, %ymm9
vunpckhpd %ymm6, %ymm0, %ymm0
vunpcklpd %ymm13, %ymm14, %ymm6
vunpckhpd %ymm13, %ymm14, %ymm13
vpermpd $216, %ymm13, %ymm14
vunpcklpd %ymm12, %ymm10, %ymm13
vpermpd $216, %ymm13, %ymm13
vmulpd %ymm13, %ymm8, %ymm15
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm9, %ymm9
vpermpd $216, %ymm0, %ymm0
vunpckhpd %ymm12, %ymm10, %ymm10
vunpcklpd %ymm6, %ymm9, %ymm12
vunpckhpd %ymm6, %ymm9, %ymm9
vunpcklpd %ymm5, %ymm1, %ymm6
vunpckhpd %ymm5, %ymm1, %ymm1
vunpcklpd %ymm14, %ymm0, %ymm5
vunpckhpd %ymm14, %ymm0, %ymm0
vpermpd $216, %ymm0, %ymm0
vmovupd %ymm0, 160(%rsp)
vmovq %r9, %xmm0
vaddsd %xmm15, %xmm0, %xmm0
vunpckhpd %xmm15, %xmm15, %xmm14
vpermpd $216, %ymm10, %ymm10
vaddsd %xmm14, %xmm0, %xmm0
vextractf128 $0x1, %ymm15, %xmm14
vmulpd %ymm10, %ymm8, %ymm8
vaddsd %xmm14, %xmm0, %xmm15
vunpckhpd %xmm14, %xmm14, %xmm14
vpermpd $216, %ymm12, %ymm12
vaddsd %xmm14, %xmm15, %xmm0
vmulpd %ymm10, %ymm3, %ymm15
vunpckhpd %xmm8, %xmm8, %xmm10
vmovq %xmm0, %r9
vmovq %rcx, %xmm0
vmulpd %ymm13, %ymm3, %ymm3
vaddsd %xmm15, %xmm0, %xmm0
vunpckhpd %xmm15, %xmm15, %xmm14
vextractf128 $0x1, %ymm15, %xmm15
vaddsd %xmm14, %xmm0, %xmm14
vpermpd $216, %ymm1, %ymm1
vmovupd %ymm1, 128(%rsp)
vaddsd %xmm15, %xmm14, %xmm14
vunpckhpd %xmm15, %xmm15, %xmm15
vpermpd $216, %ymm2, %ymm2
vaddsd %xmm15, %xmm14, %xmm0
vmovsd 56(%rsp), %xmm14
vpermpd $216, %ymm9, %ymm9
vaddsd %xmm8, %xmm14, %xmm14
vextractf128 $0x1, %ymm8, %xmm8
vmovq %xmm0, %rcx
vaddsd %xmm10, %xmm14, %xmm10
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm11, %ymm11
vaddsd %xmm8, %xmm10, %xmm10
vunpckhpd %xmm8, %xmm8, %xmm8
vpermpd $216, %ymm4, %ymm4
vaddsd %xmm8, %xmm10, %xmm0
vmovsd 48(%rsp), %xmm10
vunpckhpd %xmm3, %xmm3, %xmm8
vaddsd %xmm3, %xmm10, %xmm10
vextractf128 $0x1, %ymm3, %xmm3
vmovsd %xmm0, 56(%rsp)
vaddsd %xmm8, %xmm10, %xmm8
vmulpd %ymm12, %ymm7, %ymm10
vmulpd %ymm9, %ymm7, %ymm7
vaddsd %xmm3, %xmm8, %xmm8
vunpckhpd %xmm3, %xmm3, %xmm3
vpermpd $216, %ymm5, %ymm5
vaddsd %xmm3, %xmm8, %xmm0
vunpckhpd %xmm10, %xmm10, %xmm3
addq $256, %r12
vmovsd %xmm0, 48(%rsp)
vmovq %rdi, %xmm0
vaddsd %xmm10, %xmm0, %xmm8
vextractf128 $0x1, %ymm10, %xmm10
vmovq %rbx, %xmm0
vaddsd %xmm3, %xmm8, %xmm3
vmulpd %ymm9, %ymm2, %ymm8
vmulpd %ymm12, %ymm2, %ymm2
vaddsd %xmm10, %xmm3, %xmm3
vunpckhpd %xmm10, %xmm10, %xmm10
addq $256, %r13
vaddsd %xmm10, %xmm3, %xmm1
vaddsd %xmm8, %xmm0, %xmm10
vunpckhpd %xmm8, %xmm8, %xmm3
vextractf128 $0x1, %ymm8, %xmm8
vaddsd %xmm3, %xmm10, %xmm3
vmovq %xmm1, %rdi
vmovq %r11, %xmm1
vaddsd %xmm8, %xmm3, %xmm3
vunpckhpd %xmm8, %xmm8, %xmm8
vmovq %r10, %xmm0
vaddsd %xmm8, %xmm3, %xmm3
vmovsd 40(%rsp), %xmm8
vaddsd %xmm7, %xmm8, %xmm8
vmovq %xmm3, %rbx
vunpckhpd %xmm7, %xmm7, %xmm3
vaddsd %xmm3, %xmm8, %xmm3
vextractf128 $0x1, %ymm7, %xmm7
vaddsd %xmm7, %xmm3, %xmm3
vunpckhpd %xmm7, %xmm7, %xmm7
vaddsd %xmm7, %xmm3, %xmm3
vmovsd 32(%rsp), %xmm7
vaddsd %xmm2, %xmm7, %xmm7
vmovsd %xmm3, 40(%rsp)
vunpckhpd %xmm2, %xmm2, %xmm3
vaddsd %xmm3, %xmm7, %xmm3
vextractf128 $0x1, %ymm2, %xmm2
vmulpd %ymm6, %ymm11, %ymm7
vaddsd %xmm2, %xmm3, %xmm3
vunpckhpd %xmm2, %xmm2, %xmm2
vaddsd %xmm2, %xmm3, %xmm2
vaddsd %xmm7, %xmm1, %xmm3
vmovupd 128(%rsp), %ymm1
vmovsd %xmm2, 32(%rsp)
vunpckhpd %xmm7, %xmm7, %xmm2
vaddsd %xmm2, %xmm3, %xmm2
vextractf128 $0x1, %ymm7, %xmm7
vmulpd %ymm1, %ymm4, %ymm3
vaddsd %xmm7, %xmm2, %xmm2
vunpckhpd %xmm7, %xmm7, %xmm7
vmulpd %ymm1, %ymm11, %ymm1
vaddsd %xmm7, %xmm2, %xmm2
vaddsd %xmm3, %xmm0, %xmm7
vmulpd %ymm6, %ymm4, %ymm4
vmovq %xmm2, %r11
vunpckhpd %xmm3, %xmm3, %xmm2
vaddsd %xmm2, %xmm7, %xmm2
vextractf128 $0x1, %ymm3, %xmm3
vmovupd 64(%rsp), %ymm6
vaddsd %xmm3, %xmm2, %xmm2
vunpckhpd %xmm3, %xmm3, %xmm3
vmovupd 96(%rsp), %ymm7
vaddsd %xmm3, %xmm2, %xmm2
vmovsd 24(%rsp), %xmm3
vmovupd 160(%rsp), %ymm0
vaddsd %xmm1, %xmm3, %xmm3
vmovq %xmm2, %r10
vunpckhpd %xmm1, %xmm1, %xmm2
vaddsd %xmm2, %xmm3, %xmm2
vextractf128 $0x1, %ymm1, %xmm1
vmovq %rbp, %xmm3
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm1
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm4, %xmm4, %xmm1
vmovsd %xmm2, 24(%rsp)
vmovsd 16(%rsp), %xmm2
vaddsd %xmm4, %xmm2, %xmm2
vextractf128 $0x1, %ymm4, %xmm4
vaddsd %xmm1, %xmm2, %xmm1
vaddsd %xmm4, %xmm1, %xmm1
vunpckhpd %xmm4, %xmm4, %xmm4
vaddsd %xmm4, %xmm1, %xmm4
vmovsd %xmm4, 16(%rsp)
vmulpd %ymm6, %ymm5, %ymm4
vmulpd %ymm7, %ymm5, %ymm5
vaddsd %xmm4, %xmm3, %xmm1
vunpckhpd %xmm4, %xmm4, %xmm2
vmovq %rsi, %xmm3
vaddsd %xmm2, %xmm1, %xmm2
vextractf128 $0x1, %ymm4, %xmm1
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm1
vaddsd %xmm1, %xmm2, %xmm4
vmovq %xmm4, %rbp
vmulpd %ymm0, %ymm7, %ymm4
vmulpd %ymm0, %ymm6, %ymm0
vaddsd %xmm4, %xmm3, %xmm1
vunpckhpd %xmm4, %xmm4, %xmm2
vaddsd %xmm2, %xmm1, %xmm2
vextractf128 $0x1, %ymm4, %xmm1
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm1
vaddsd %xmm1, %xmm2, %xmm4
vmovsd 8(%rsp), %xmm2
vunpckhpd %xmm0, %xmm0, %xmm1
vaddsd %xmm0, %xmm2, %xmm2
vextractf128 $0x1, %ymm0, %xmm0
vmovq %xmm4, %rsi
vaddsd %xmm1, %xmm2, %xmm1
vaddsd %xmm0, %xmm1, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm0
vaddsd %xmm0, %xmm1, %xmm6
vmovsd (%rsp), %xmm1
vunpckhpd %xmm5, %xmm5, %xmm0
vaddsd %xmm5, %xmm1, %xmm1
vextractf128 $0x1, %ymm5, %xmm5
vmovsd %xmm6, 8(%rsp)
vaddsd %xmm0, %xmm1, %xmm0
vaddsd %xmm5, %xmm0, %xmm0
vunpckhpd %xmm5, %xmm5, %xmm5
vaddsd %xmm5, %xmm0, %xmm5
vmovsd %xmm5, (%rsp)
cmpq %rax, %r12
jne .L4
movl %r15d, %r12d
andl $-4, %r12d
movl %r12d, %edx
cmpl %r12d, %r15d
je .L5
.L3:
movl %r15d, %eax
subl %r12d, %eax
cmpl $1, %eax
je .L6
salq $6, %r12
leaq (%r14,%r12), %r13
vmovupd 16(%r13), %xmm3
vmovupd 48(%r13), %xmm0
vmovupd 64(%r13), %xmm8
vmovupd 112(%r13), %xmm10
vmovupd 0(%r13), %xmm4
vmovupd 32(%r13), %xmm2
vmovupd 80(%r13), %xmm6
vmovupd 96(%r13), %xmm1
vunpcklpd %xmm3, %xmm4, %xmm5
vunpckhpd %xmm3, %xmm4, %xmm4
vunpcklpd %xmm0, %xmm2, %xmm3
vunpckhpd %xmm0, %xmm2, %xmm2
vunpcklpd %xmm6, %xmm8, %xmm0
vunpckhpd %xmm6, %xmm8, %xmm6
vunpcklpd %xmm10, %xmm1, %xmm8
vunpckhpd %xmm10, %xmm1, %xmm1
vunpcklpd %xmm3, %xmm5, %xmm11
vunpcklpd %xmm2, %xmm4, %xmm10
vunpckhpd %xmm3, %xmm5, %xmm3
vunpckhpd %xmm2, %xmm4, %xmm2
vunpcklpd %xmm8, %xmm0, %xmm5
vunpcklpd %xmm1, %xmm6, %xmm4
vunpckhpd %xmm8, %xmm0, %xmm0
vunpckhpd %xmm1, %xmm6, %xmm1
addq %r8, %r12
vunpcklpd %xmm5, %xmm11, %xmm8
vunpckhpd %xmm0, %xmm3, %xmm7
vunpckhpd %xmm5, %xmm11, %xmm11
vunpckhpd %xmm1, %xmm2, %xmm5
vmovupd 64(%r12), %xmm12
vunpcklpd %xmm1, %xmm2, %xmm6
vmovupd 80(%r12), %xmm9
vmovupd 48(%r12), %xmm1
vmovupd 96(%r12), %xmm2
vunpcklpd %xmm4, %xmm10, %xmm14
vunpcklpd %xmm0, %xmm3, %xmm13
vunpckhpd %xmm4, %xmm10, %xmm10
vmovupd 32(%r12), %xmm3
vmovupd 16(%r12), %xmm4
vmovapd %xmm7, 64(%rsp)
vmovapd %xmm5, 96(%rsp)
vmovupd 112(%r12), %xmm7
vmovupd (%r12), %xmm5
movl %eax, %r12d
vunpcklpd %xmm4, %xmm5, %xmm15
vunpckhpd %xmm4, %xmm5, %xmm5
vunpcklpd %xmm1, %xmm3, %xmm4
vunpckhpd %xmm1, %xmm3, %xmm3
vunpcklpd %xmm9, %xmm12, %xmm1
vunpckhpd %xmm9, %xmm12, %xmm9
vunpcklpd %xmm7, %xmm2, %xmm12
vunpckhpd %xmm7, %xmm2, %xmm2
vunpcklpd %xmm4, %xmm15, %xmm7
vunpckhpd %xmm4, %xmm15, %xmm15
vunpcklpd %xmm12, %xmm1, %xmm4
vunpckhpd %xmm12, %xmm1, %xmm1
vunpcklpd %xmm3, %xmm5, %xmm12
vunpckhpd %xmm3, %xmm5, %xmm5
vunpcklpd %xmm2, %xmm9, %xmm3
vunpckhpd %xmm2, %xmm9, %xmm2
vunpcklpd %xmm4, %xmm7, %xmm9
vunpckhpd %xmm1, %xmm15, %xmm0
vunpckhpd %xmm4, %xmm7, %xmm4
vunpcklpd %xmm3, %xmm12, %xmm7
vunpckhpd %xmm3, %xmm12, %xmm3
vunpcklpd %xmm1, %xmm15, %xmm12
vunpcklpd %xmm2, %xmm5, %xmm15
vunpckhpd %xmm2, %xmm5, %xmm2
vmulpd %xmm9, %xmm8, %xmm5
vmovapd %xmm0, 128(%rsp)
vmovq %r9, %xmm0
andl $-2, %r12d
addl %r12d, %edx
vaddsd %xmm5, %xmm0, %xmm0
vunpckhpd %xmm5, %xmm5, %xmm5
vaddsd %xmm5, %xmm0, %xmm1
vmulpd %xmm4, %xmm11, %xmm5
vmulpd %xmm4, %xmm8, %xmm4
vmovq %xmm1, %r9
vmovq %rcx, %xmm1
vmulpd %xmm9, %xmm11, %xmm11
vaddsd %xmm5, %xmm1, %xmm1
vunpckhpd %xmm5, %xmm5, %xmm5
vmulpd %xmm7, %xmm14, %xmm9
vaddsd %xmm5, %xmm1, %xmm1
vmovsd 56(%rsp), %xmm5
vmulpd %xmm3, %xmm10, %xmm8
vaddsd %xmm4, %xmm5, %xmm5
vunpckhpd %xmm4, %xmm4, %xmm4
vmovq %xmm1, %rcx
vaddsd %xmm4, %xmm5, %xmm4
vmovq %rdi, %xmm1
vmulpd %xmm3, %xmm14, %xmm14
vmovsd %xmm4, 56(%rsp)
vmovsd 48(%rsp), %xmm4
vmovq %rbx, %xmm0
vaddsd %xmm11, %xmm4, %xmm4
vunpckhpd %xmm11, %xmm11, %xmm11
vmovsd 40(%rsp), %xmm3
vaddsd %xmm11, %xmm4, %xmm4
vmulpd %xmm7, %xmm10, %xmm10
vaddsd %xmm14, %xmm3, %xmm3
vmovsd %xmm4, 48(%rsp)
vaddsd %xmm9, %xmm1, %xmm4
vunpckhpd %xmm9, %xmm9, %xmm9
vunpckhpd %xmm14, %xmm14, %xmm14
vaddsd %xmm9, %xmm4, %xmm4
vmovapd 128(%rsp), %xmm5
vmovapd 64(%rsp), %xmm11
vmovq %xmm4, %rdi
vaddsd %xmm8, %xmm0, %xmm4
vunpckhpd %xmm8, %xmm8, %xmm8
vmovsd 24(%rsp), %xmm1
vaddsd %xmm8, %xmm4, %xmm4
vmovsd 16(%rsp), %xmm0
vmovq %xmm4, %rbx
vaddsd %xmm14, %xmm3, %xmm4
vmovsd 32(%rsp), %xmm3
vaddsd %xmm10, %xmm3, %xmm3
vunpckhpd %xmm10, %xmm10, %xmm10
vmovsd %xmm4, 40(%rsp)
vaddsd %xmm10, %xmm3, %xmm7
vmulpd %xmm12, %xmm13, %xmm3
vmulpd %xmm5, %xmm13, %xmm13
vmovsd %xmm7, 32(%rsp)
vmovq %r11, %xmm7
vmulpd %xmm11, %xmm12, %xmm12
vaddsd %xmm3, %xmm7, %xmm4
vunpckhpd %xmm3, %xmm3, %xmm3
vaddsd %xmm13, %xmm1, %xmm1
vaddsd %xmm3, %xmm4, %xmm7
vmulpd %xmm5, %xmm11, %xmm3
vunpckhpd %xmm13, %xmm13, %xmm13
vmovq %xmm7, %r11
vmovq %r10, %xmm7
vaddsd %xmm12, %xmm0, %xmm0
vaddsd %xmm3, %xmm7, %xmm4
vunpckhpd %xmm3, %xmm3, %xmm3
vunpckhpd %xmm12, %xmm12, %xmm12
vaddsd %xmm3, %xmm4, %xmm7
vaddsd %xmm13, %xmm1, %xmm4
vmovq %xmm7, %r10
vmovsd %xmm4, 24(%rsp)
vaddsd %xmm12, %xmm0, %xmm4
vmulpd %xmm15, %xmm6, %xmm0
vmovq %rbp, %xmm7
vmovsd %xmm4, 16(%rsp)
vmovapd 96(%rsp), %xmm5
vaddsd %xmm0, %xmm7, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm0
vmovq %rsi, %xmm7
vaddsd %xmm0, %xmm1, %xmm4
vmulpd %xmm5, %xmm2, %xmm0
vmulpd %xmm2, %xmm6, %xmm2
vmovq %xmm4, %rbp
vmulpd %xmm5, %xmm15, %xmm15
vaddsd %xmm0, %xmm7, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm0
vaddsd %xmm0, %xmm1, %xmm4
vmovsd 8(%rsp), %xmm0
vaddsd %xmm2, %xmm0, %xmm0
vunpckhpd %xmm2, %xmm2, %xmm2
vmovq %xmm4, %rsi
vaddsd %xmm2, %xmm0, %xmm6
vmovsd (%rsp), %xmm0
vaddsd %xmm15, %xmm0, %xmm0
vunpckhpd %xmm15, %xmm15, %xmm15
vmovsd %xmm6, 8(%rsp)
vaddsd %xmm15, %xmm0, %xmm5
vmovsd %xmm5, (%rsp)
cmpl %r12d, %eax
je .L5
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
@ 2020-10-09 6:41 ` rguenth at gcc dot gnu.org
2020-10-09 11:58 ` already5chosen at yahoo dot com
2020-10-12 13:11 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-10-09 6:41 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Component|target |tree-optimization
Ever confirmed|0 |1
Target| |x86_64-*-* i?86-*-*
Blocks| |53947
Keywords| |missed-optimization
Last reconfirmed| |2020-10-09
CC| |rguenth at gcc dot gnu.org
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
All below for Part 2.
Without -ffast-math you are seeing GCC using in-order reductions now while
with -ffast-math the vectorizer gets a bit confused about reassociations done
before, for me producing
.L3:
vmovupd 32(%rsi,%rax), %ymm3
vmovupd (%rdx,%rax), %ymm7
vinsertf128 $1, (%rsi,%rax), %ymm3, %ymm0
vinsertf128 $1, 32(%rdx,%rax), %ymm7, %ymm2
vmovupd 32(%rsi,%rax), %ymm5
vpermpd $136, %ymm0, %ymm4
vpermpd $40, %ymm2, %ymm7
vpermpd $221, %ymm0, %ymm1
vpermpd $125, %ymm2, %ymm3
vperm2f128 $49, (%rsi,%rax), %ymm5, %ymm0
vmovupd (%rdx,%rax), %ymm2
vperm2f128 $49, 32(%rdx,%rax), %ymm2, %ymm2
addq $64, %rax
vpermpd $136, %ymm0, %ymm5
vpermpd $221, %ymm0, %ymm0
vpermpd $40, %ymm2, %ymm8
vpermpd $125, %ymm2, %ymm2
vmulpd %ymm8, %ymm5, %ymm5
vmulpd %ymm2, %ymm0, %ymm0
vfmadd132pd %ymm3, %ymm5, %ymm1
vfmadd231pd %ymm7, %ymm4, %ymm0
vaddpd %ymm0, %ymm1, %ymm0
vaddpd %ymm0, %ymm6, %ymm6
cmpq %rcx, %rax
jne .L3
-ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
chains and this SLP reduction ends up looking like
t5.c:7:21: note: Vectorizing SLP tree:
t5.c:7:21: note: node 0x4100c20 (max_nunits=4, refcnt=2)
t5.c:7:21: note: stmt 0 acc_imre_158 = acc_imre_3 + _34;
t5.c:7:21: note: stmt 1 acc_reim_156 = acc_reim_1 + _8;
t5.c:7:21: note: stmt 2 acc_imim_154 = _21 + acc_imim_35;
t5.c:7:21: note: stmt 3 acc_rere_146 = _11 + acc_rere_29;
t5.c:7:21: note: children 0x3f272e0 0x4100bb0
t5.c:7:21: note: node 0x3f272e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
t5.c:7:21: note: stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
t5.c:7:21: note: stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
t5.c:7:21: note: stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
t5.c:7:21: note: children 0x4100c20
t5.c:7:21: note: node 0x4100bb0 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _34 = _36 + _157;
t5.c:7:21: note: stmt 1 _8 = _30 + _155;
t5.c:7:21: note: stmt 2 _21 = _15 + _153;
t5.c:7:21: note: stmt 3 _11 = _6 + _145;
t5.c:7:21: note: children 0x4100920 0x4100b40
t5.c:7:21: note: node 0x4100920 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _36 = _37 + _73;
t5.c:7:21: note: stmt 1 _30 = _32 + _71;
t5.c:7:21: note: stmt 2 _15 = _10 + _69;
t5.c:7:21: note: stmt 3 _6 = _31 + _61;
t5.c:7:21: note: children 0x41004e0 0x41008b0
t5.c:7:21: note: node 0x41004e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _37 = _101 + _129;
t5.c:7:21: note: stmt 1 _32 = _99 + _127;
t5.c:7:21: note: stmt 2 _10 = _97 + _125;
t5.c:7:21: note: stmt 3 _31 = _89 + _117;
t5.c:7:21: note: children 0x3f2a550 0x3f28700
t5.c:7:21: note: node 0x3f2a550 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _101 = _88 * _94;
t5.c:7:21: note: stmt 1 _99 = _86 * _96;
t5.c:7:21: note: stmt 2 _97 = _94 * _96;
t5.c:7:21: note: stmt 3 _89 = _86 * _88;
t5.c:7:21: note: children 0x40b6990 0x3f29e00
t5.c:7:21: note: node 0x40b6990 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _88 = *_87;
t5.c:7:21: note: stmt 1 _96 = *_95;
t5.c:7:21: note: stmt 2 _96 = *_95;
t5.c:7:21: note: stmt 3 _88 = *_87;
t5.c:7:21: note: load permutation { 1 5 5 1 }
t5.c:7:21: note: node 0x3f29e00 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _94 = *_93;
t5.c:7:21: note: stmt 1 _86 = *_85;
t5.c:7:21: note: stmt 2 _94 = *_93;
t5.c:7:21: note: stmt 3 _86 = *_85;
t5.c:7:21: note: load permutation { 5 1 5 1 }
t5.c:7:21: note: node 0x3f28700 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _129 = _116 * _122;
t5.c:7:21: note: stmt 1 _127 = _114 * _124;
t5.c:7:21: note: stmt 2 _125 = _122 * _124;
t5.c:7:21: note: stmt 3 _117 = _114 * _116;
t5.c:7:21: note: children 0x3f287e0 0x3f28770
t5.c:7:21: note: node 0x3f287e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _116 = *_115;
t5.c:7:21: note: stmt 1 _124 = *_123;
t5.c:7:21: note: stmt 2 _124 = *_123;
t5.c:7:21: note: stmt 3 _116 = *_115;
t5.c:7:21: note: load permutation { 2 6 6 2 }
t5.c:7:21: note: node 0x3f28770 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _122 = *_121;
t5.c:7:21: note: stmt 1 _114 = *_113;
t5.c:7:21: note: stmt 2 _122 = *_121;
t5.c:7:21: note: stmt 3 _114 = *_113;
t5.c:7:21: note: load permutation { 6 2 6 2 }
t5.c:7:21: note: node 0x41008b0 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _73 = _60 * _66;
t5.c:7:21: note: stmt 1 _71 = _58 * _68;
t5.c:7:21: note: stmt 2 _69 = _66 * _68;
t5.c:7:21: note: stmt 3 _61 = _58 * _60;
t5.c:7:21: note: children 0x4100290 0x4100810
t5.c:7:21: note: node 0x4100290 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _60 = *_59;
t5.c:7:21: note: stmt 1 _68 = *_67;
t5.c:7:21: note: stmt 2 _68 = *_67;
t5.c:7:21: note: stmt 3 _60 = *_59;
t5.c:7:21: note: load permutation { 0 4 4 0 }
t5.c:7:21: note: node 0x4100810 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _66 = *_65;
t5.c:7:21: note: stmt 1 _58 = *_57;
t5.c:7:21: note: stmt 2 _66 = *_65;
t5.c:7:21: note: stmt 3 _58 = *_57;
t5.c:7:21: note: load permutation { 4 0 4 0 }
t5.c:7:21: note: node 0x4100b40 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _157 = _144 * _150;
t5.c:7:21: note: stmt 1 _155 = _142 * _152;
t5.c:7:21: note: stmt 2 _153 = _150 * _152;
t5.c:7:21: note: stmt 3 _145 = _142 * _144;
t5.c:7:21: note: children 0x4100990 0x4100a50
t5.c:7:21: note: node 0x4100990 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _144 = *_143;
t5.c:7:21: note: stmt 1 _152 = *_151;
t5.c:7:21: note: stmt 2 _152 = *_151;
t5.c:7:21: note: stmt 3 _144 = *_143;
t5.c:7:21: note: load permutation { 3 7 7 3 }
t5.c:7:21: note: node 0x4100a50 (max_nunits=4, refcnt=1)
t5.c:7:21: note: stmt 0 _150 = *_149;
t5.c:7:21: note: stmt 1 _142 = *_141;
t5.c:7:21: note: stmt 2 _150 = *_149;
t5.c:7:21: note: stmt 3 _142 = *_141;
t5.c:7:21: note: load permutation { 7 3 7 3 }
which eventually shows some non-obvious permute optimization opportunities.
I'm currently working on a permute optimization phase btw. but for start
only handling cases that do not help here.
Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I get
the reduction chain variant which optimizes to
.L3:
vmovupd 32(%rsi,%rax), %ymm6
vmovupd 32(%rdx,%rax), %ymm7
vmovupd (%rsi,%rax), %ymm5
vfmadd231pd (%rdx,%rax), %ymm6, %ymm0
vfmadd231pd (%rdx,%rax), %ymm5, %ymm3
vfmadd231pd (%rsi,%rax), %ymm7, %ymm1
addq $64, %rax
vfmadd231pd %ymm6, %ymm7, %ymm2
cmpq %rcx, %rax
jne .L3
even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). Which means
the following source change helps:
void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
double* a, const double* b, int N)
{
double acc_rere = 0;
double acc_imim = 0;
double acc_reim = 0;
double acc_imre = 0;
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_rere += a[c*8+k+0]*b[c*8+k+0];
acc_imim += a[c*8+k+4]*b[c*8+k+4];
acc_reim += a[c*8+k+0]*b[c*8+k+4];
acc_imre += a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = acc_rere+acc_imim;
res[4] = acc_imre-acc_reim;
}
the reduction epilogue ends up like
vextractf128 $0x1, %ymm3, %xmm4
vaddpd %xmm3, %xmm4, %xmm3
vunpckhpd %xmm3, %xmm3, %xmm4
vaddpd %xmm3, %xmm4, %xmm3
vextractf128 $0x1, %ymm2, %xmm4
vaddpd %xmm2, %xmm4, %xmm4
vunpckhpd %xmm4, %xmm4, %xmm2
vaddpd %xmm4, %xmm2, %xmm2
vextractf128 $0x1, %ymm1, %xmm4
vaddpd %xmm1, %xmm4, %xmm4
vaddsd %xmm2, %xmm3, %xmm2
vunpckhpd %xmm4, %xmm4, %xmm1
vaddpd %xmm4, %xmm1, %xmm1
vextractf128 $0x1, %ymm0, %xmm4
vaddpd %xmm0, %xmm4, %xmm4
vunpckhpd %xmm4, %xmm4, %xmm0
vaddpd %xmm4, %xmm0, %xmm0
vsubsd %xmm1, %xmm0, %xmm0
vzeroupper
vmovsd %xmm2, (%rdi)
vmovsd %xmm0, 32(%rdi)
which is not optimal since we miss the opportunity to vectorize the
adds of the accumulators
res[0] = acc_rere+acc_imim;
res[4] = acc_imre-acc_reim;
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
2020-10-09 6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
@ 2020-10-09 11:58 ` already5chosen at yahoo dot com
2020-10-12 13:11 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: already5chosen at yahoo dot com @ 2020-10-09 11:58 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343
--- Comment #2 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #1)
> All below for Part 2.
>
> Without -ffast-math you are seeing GCC using in-order reductions now while
> with -ffast-math the vectorizer gets a bit confused about reassociations done
> before, for me producing
>
Just to understand, when you are saying "Without -ffast-math" does it includes
my set1 == "-O3 -ffast-math -fno-associative-math" ?
BTW, I like your phrasing: "a bit confused about reassociations"
> .L3:
> vmovupd 32(%rsi,%rax), %ymm3
> vmovupd (%rdx,%rax), %ymm7
> vinsertf128 $1, (%rsi,%rax), %ymm3, %ymm0
> vinsertf128 $1, 32(%rdx,%rax), %ymm7, %ymm2
> vmovupd 32(%rsi,%rax), %ymm5
> vpermpd $136, %ymm0, %ymm4
> vpermpd $40, %ymm2, %ymm7
> vpermpd $221, %ymm0, %ymm1
> vpermpd $125, %ymm2, %ymm3
> vperm2f128 $49, (%rsi,%rax), %ymm5, %ymm0
> vmovupd (%rdx,%rax), %ymm2
> vperm2f128 $49, 32(%rdx,%rax), %ymm2, %ymm2
> addq $64, %rax
> vpermpd $136, %ymm0, %ymm5
> vpermpd $221, %ymm0, %ymm0
> vpermpd $40, %ymm2, %ymm8
> vpermpd $125, %ymm2, %ymm2
> vmulpd %ymm8, %ymm5, %ymm5
> vmulpd %ymm2, %ymm0, %ymm0
> vfmadd132pd %ymm3, %ymm5, %ymm1
> vfmadd231pd %ymm7, %ymm4, %ymm0
> vaddpd %ymm0, %ymm1, %ymm0
> vaddpd %ymm0, %ymm6, %ymm6
> cmpq %rcx, %rax
> jne .L3
>
> -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> chains and this SLP reduction ends up looking like
>
> t5.c:7:21: note: Vectorizing SLP tree:
> t5.c:7:21: note: node 0x4100c20 (max_nunits=4, refcnt=2)
> t5.c:7:21: note: stmt 0 acc_imre_158 = acc_imre_3 + _34;
> t5.c:7:21: note: stmt 1 acc_reim_156 = acc_reim_1 + _8;
> t5.c:7:21: note: stmt 2 acc_imim_154 = _21 + acc_imim_35;
> t5.c:7:21: note: stmt 3 acc_rere_146 = _11 + acc_rere_29;
> t5.c:7:21: note: children 0x3f272e0 0x4100bb0
> t5.c:7:21: note: node 0x3f272e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> t5.c:7:21: note: stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> t5.c:7:21: note: stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> t5.c:7:21: note: stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> t5.c:7:21: note: children 0x4100c20
> t5.c:7:21: note: node 0x4100bb0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _34 = _36 + _157;
> t5.c:7:21: note: stmt 1 _8 = _30 + _155;
> t5.c:7:21: note: stmt 2 _21 = _15 + _153;
> t5.c:7:21: note: stmt 3 _11 = _6 + _145;
> t5.c:7:21: note: children 0x4100920 0x4100b40
> t5.c:7:21: note: node 0x4100920 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _36 = _37 + _73;
> t5.c:7:21: note: stmt 1 _30 = _32 + _71;
> t5.c:7:21: note: stmt 2 _15 = _10 + _69;
> t5.c:7:21: note: stmt 3 _6 = _31 + _61;
> t5.c:7:21: note: children 0x41004e0 0x41008b0
> t5.c:7:21: note: node 0x41004e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _37 = _101 + _129;
> t5.c:7:21: note: stmt 1 _32 = _99 + _127;
> t5.c:7:21: note: stmt 2 _10 = _97 + _125;
> t5.c:7:21: note: stmt 3 _31 = _89 + _117;
> t5.c:7:21: note: children 0x3f2a550 0x3f28700
> t5.c:7:21: note: node 0x3f2a550 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _101 = _88 * _94;
> t5.c:7:21: note: stmt 1 _99 = _86 * _96;
> t5.c:7:21: note: stmt 2 _97 = _94 * _96;
> t5.c:7:21: note: stmt 3 _89 = _86 * _88;
> t5.c:7:21: note: children 0x40b6990 0x3f29e00
> t5.c:7:21: note: node 0x40b6990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _88 = *_87;
> t5.c:7:21: note: stmt 1 _96 = *_95;
> t5.c:7:21: note: stmt 2 _96 = *_95;
> t5.c:7:21: note: stmt 3 _88 = *_87;
> t5.c:7:21: note: load permutation { 1 5 5 1 }
> t5.c:7:21: note: node 0x3f29e00 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _94 = *_93;
> t5.c:7:21: note: stmt 1 _86 = *_85;
> t5.c:7:21: note: stmt 2 _94 = *_93;
> t5.c:7:21: note: stmt 3 _86 = *_85;
> t5.c:7:21: note: load permutation { 5 1 5 1 }
> t5.c:7:21: note: node 0x3f28700 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _129 = _116 * _122;
> t5.c:7:21: note: stmt 1 _127 = _114 * _124;
> t5.c:7:21: note: stmt 2 _125 = _122 * _124;
> t5.c:7:21: note: stmt 3 _117 = _114 * _116;
> t5.c:7:21: note: children 0x3f287e0 0x3f28770
> t5.c:7:21: note: node 0x3f287e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _116 = *_115;
> t5.c:7:21: note: stmt 1 _124 = *_123;
> t5.c:7:21: note: stmt 2 _124 = *_123;
> t5.c:7:21: note: stmt 3 _116 = *_115;
> t5.c:7:21: note: load permutation { 2 6 6 2 }
> t5.c:7:21: note: node 0x3f28770 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _122 = *_121;
> t5.c:7:21: note: stmt 1 _114 = *_113;
> t5.c:7:21: note: stmt 2 _122 = *_121;
> t5.c:7:21: note: stmt 3 _114 = *_113;
> t5.c:7:21: note: load permutation { 6 2 6 2 }
> t5.c:7:21: note: node 0x41008b0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _73 = _60 * _66;
> t5.c:7:21: note: stmt 1 _71 = _58 * _68;
> t5.c:7:21: note: stmt 2 _69 = _66 * _68;
> t5.c:7:21: note: stmt 3 _61 = _58 * _60;
> t5.c:7:21: note: children 0x4100290 0x4100810
> t5.c:7:21: note: node 0x4100290 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _60 = *_59;
> t5.c:7:21: note: stmt 1 _68 = *_67;
> t5.c:7:21: note: stmt 2 _68 = *_67;
> t5.c:7:21: note: stmt 3 _60 = *_59;
> t5.c:7:21: note: load permutation { 0 4 4 0 }
> t5.c:7:21: note: node 0x4100810 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _66 = *_65;
> t5.c:7:21: note: stmt 1 _58 = *_57;
> t5.c:7:21: note: stmt 2 _66 = *_65;
> t5.c:7:21: note: stmt 3 _58 = *_57;
> t5.c:7:21: note: load permutation { 4 0 4 0 }
> t5.c:7:21: note: node 0x4100b40 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _157 = _144 * _150;
> t5.c:7:21: note: stmt 1 _155 = _142 * _152;
> t5.c:7:21: note: stmt 2 _153 = _150 * _152;
> t5.c:7:21: note: stmt 3 _145 = _142 * _144;
> t5.c:7:21: note: children 0x4100990 0x4100a50
> t5.c:7:21: note: node 0x4100990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _144 = *_143;
> t5.c:7:21: note: stmt 1 _152 = *_151;
> t5.c:7:21: note: stmt 2 _152 = *_151;
> t5.c:7:21: note: stmt 3 _144 = *_143;
> t5.c:7:21: note: load permutation { 3 7 7 3 }
> t5.c:7:21: note: node 0x4100a50 (max_nunits=4, refcnt=1)
> t5.c:7:21: note: stmt 0 _150 = *_149;
> t5.c:7:21: note: stmt 1 _142 = *_141;
> t5.c:7:21: note: stmt 2 _150 = *_149;
> t5.c:7:21: note: stmt 3 _142 = *_141;
> t5.c:7:21: note: load permutation { 7 3 7 3 }
>
> which eventually shows some non-obvious permute optimization opportunities.
> I'm currently working on a permute optimization phase btw. but for start
> only handling cases that do not help here.
>
> Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> get
> the reduction chain variant which optimizes to
>
> .L3:
> vmovupd 32(%rsi,%rax), %ymm6
> vmovupd 32(%rdx,%rax), %ymm7
> vmovupd (%rsi,%rax), %ymm5
> vfmadd231pd (%rdx,%rax), %ymm6, %ymm0
> vfmadd231pd (%rdx,%rax), %ymm5, %ymm3
> vfmadd231pd (%rsi,%rax), %ymm7, %ymm1
> addq $64, %rax
> vfmadd231pd %ymm6, %ymm7, %ymm2
> cmpq %rcx, %rax
> jne .L3
>
> even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc).
It's very fragile. I made a tiny (and natural for my real app) change in the
source (see below) and the nice MSVC-like inner loop disappeared.
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_rere = 0;
double acc_imim = 0;
double acc_reim = 0;
double acc_imre = 0;
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_rere += a[c*8+k+0]*b[c*8+k+0];
acc_imim += a[c*8+k+4]*b[c*8+k+4];
acc_reim -= a[c*8+k+0]*b[c*8+k+4];
acc_imre += a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = acc_rere+acc_imim;
res[4] = acc_reim+acc_imre;
}
> Which means
> the following source change helps:
>
> void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> double* a, const double* b, int N)
> {
> double acc_rere = 0;
> double acc_imim = 0;
> double acc_reim = 0;
> double acc_imre = 0;
> for (int c = 0; c < N; ++c) {
> for (int k = 0; k < 4; ++k) {
> acc_rere += a[c*8+k+0]*b[c*8+k+0];
> acc_imim += a[c*8+k+4]*b[c*8+k+4];
> acc_reim += a[c*8+k+0]*b[c*8+k+4];
> acc_imre += a[c*8+k+4]*b[c*8+k+0];
> }
> }
> res[0] = acc_rere+acc_imim;
> res[4] = acc_imre-acc_reim;
> }
>
IMHO, options like "no-tree-reassoc", including and especially within
__attribute__((optimize(""))) are for people like you.
People like me, i.e. not compiler guys, don't consider them not only for
production, but even for hobby. Unless a hobby is compiler research.
Also, several years ago I was told by (not you, as Richard Biener, but "you" as
"gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as Richard
Biener, also took part in discussion, but appeared to hold different opinion,
see 70255) told me that __attribute__((optimize(""))) can't be relied on in
production. Back than we came to conclusion that this statement has to be in
official docs. And indeed since GCC7 section 6.33.1 contains the folowing
sentences:
"The optimize attribute should be used for debugging purposes only. It is not
suitable in production code."
> the reduction epilogue ends up like
>
> vextractf128 $0x1, %ymm3, %xmm4
> vaddpd %xmm3, %xmm4, %xmm3
> vunpckhpd %xmm3, %xmm3, %xmm4
> vaddpd %xmm3, %xmm4, %xmm3
> vextractf128 $0x1, %ymm2, %xmm4
> vaddpd %xmm2, %xmm4, %xmm4
> vunpckhpd %xmm4, %xmm4, %xmm2
> vaddpd %xmm4, %xmm2, %xmm2
> vextractf128 $0x1, %ymm1, %xmm4
> vaddpd %xmm1, %xmm4, %xmm4
> vaddsd %xmm2, %xmm3, %xmm2
> vunpckhpd %xmm4, %xmm4, %xmm1
> vaddpd %xmm4, %xmm1, %xmm1
> vextractf128 $0x1, %ymm0, %xmm4
> vaddpd %xmm0, %xmm4, %xmm4
> vunpckhpd %xmm4, %xmm4, %xmm0
> vaddpd %xmm4, %xmm0, %xmm0
> vsubsd %xmm1, %xmm0, %xmm0
> vzeroupper
> vmovsd %xmm2, (%rdi)
> vmovsd %xmm0, 32(%rdi)
>
> which is not optimal since we miss the opportunity to vectorize the
> adds of the accumulators
>
> res[0] = acc_rere+acc_imim;
> res[4] = acc_imre-acc_reim;
Epilogue is a tricky matter.
There are many ways to do it with about the same speed and which variant is
faster could depend on fine microarchitectural details, that could be not the
same even on such quite similar CPUs as Skylake and Zen2 or Skylake and Ice
Lake, or may be even Skylake and Skylake-X (well, may be the last one is an
exaggeration). The optimal sequence also depends on surrounding, e.g. if I
change the source to:
res[0] = acc_rere+acc_imim;
res[1] = acc_reim-acc_imre; // results adjacent in memory
It could already be different.
In later case it likely would be
vaddpd ymm_rere,ymm_imim,ymm_re
vsubpd ymm_reim,ymm_imre,ymm_im
vhadd ymm_im,ymm_re,ymm_reim
vperm2f128 $1, ymm_reim, ymm_reim, ymm_reimH
vaddpd xmm2_reimH, xmm_reim, xmm_reim
Even icc can't do it perfectly right now.
It would be nice (for you personally, at least) if in this case gcc will
generate better code than icc, but it is far [far [far...]] less important than
robust handling of inner loop.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
2020-10-09 6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
2020-10-09 11:58 ` already5chosen at yahoo dot com
@ 2020-10-12 13:11 ` rguenth at gcc dot gnu.org
2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-10-12 13:11 UTC (permalink / raw)
To: gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Michael_S from comment #2)
> (In reply to Richard Biener from comment #1)
> > All below for Part 2.
> >
> > Without -ffast-math you are seeing GCC using in-order reductions now while
> > with -ffast-math the vectorizer gets a bit confused about reassociations done
> > before, for me producing
> >
>
> Just to understand, when you are saying "Without -ffast-math" does it
> includes my set1 == "-O3 -ffast-math -fno-associative-math" ?
No, -fno-associative-math breaks it again.
> BTW, I like your phrasing: "a bit confused about reassociations"
>
> > .L3:
> > vmovupd 32(%rsi,%rax), %ymm3
> > vmovupd (%rdx,%rax), %ymm7
> > vinsertf128 $1, (%rsi,%rax), %ymm3, %ymm0
> > vinsertf128 $1, 32(%rdx,%rax), %ymm7, %ymm2
> > vmovupd 32(%rsi,%rax), %ymm5
> > vpermpd $136, %ymm0, %ymm4
> > vpermpd $40, %ymm2, %ymm7
> > vpermpd $221, %ymm0, %ymm1
> > vpermpd $125, %ymm2, %ymm3
> > vperm2f128 $49, (%rsi,%rax), %ymm5, %ymm0
> > vmovupd (%rdx,%rax), %ymm2
> > vperm2f128 $49, 32(%rdx,%rax), %ymm2, %ymm2
> > addq $64, %rax
> > vpermpd $136, %ymm0, %ymm5
> > vpermpd $221, %ymm0, %ymm0
> > vpermpd $40, %ymm2, %ymm8
> > vpermpd $125, %ymm2, %ymm2
> > vmulpd %ymm8, %ymm5, %ymm5
> > vmulpd %ymm2, %ymm0, %ymm0
> > vfmadd132pd %ymm3, %ymm5, %ymm1
> > vfmadd231pd %ymm7, %ymm4, %ymm0
> > vaddpd %ymm0, %ymm1, %ymm0
> > vaddpd %ymm0, %ymm6, %ymm6
> > cmpq %rcx, %rax
> > jne .L3
> >
> > -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> > chains and this SLP reduction ends up looking like
> >
> > t5.c:7:21: note: Vectorizing SLP tree:
> > t5.c:7:21: note: node 0x4100c20 (max_nunits=4, refcnt=2)
> > t5.c:7:21: note: stmt 0 acc_imre_158 = acc_imre_3 + _34;
> > t5.c:7:21: note: stmt 1 acc_reim_156 = acc_reim_1 + _8;
> > t5.c:7:21: note: stmt 2 acc_imim_154 = _21 + acc_imim_35;
> > t5.c:7:21: note: stmt 3 acc_rere_146 = _11 + acc_rere_29;
> > t5.c:7:21: note: children 0x3f272e0 0x4100bb0
> > t5.c:7:21: note: node 0x3f272e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> > t5.c:7:21: note: stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> > t5.c:7:21: note: stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> > t5.c:7:21: note: stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> > t5.c:7:21: note: children 0x4100c20
> > t5.c:7:21: note: node 0x4100bb0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _34 = _36 + _157;
> > t5.c:7:21: note: stmt 1 _8 = _30 + _155;
> > t5.c:7:21: note: stmt 2 _21 = _15 + _153;
> > t5.c:7:21: note: stmt 3 _11 = _6 + _145;
> > t5.c:7:21: note: children 0x4100920 0x4100b40
> > t5.c:7:21: note: node 0x4100920 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _36 = _37 + _73;
> > t5.c:7:21: note: stmt 1 _30 = _32 + _71;
> > t5.c:7:21: note: stmt 2 _15 = _10 + _69;
> > t5.c:7:21: note: stmt 3 _6 = _31 + _61;
> > t5.c:7:21: note: children 0x41004e0 0x41008b0
> > t5.c:7:21: note: node 0x41004e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _37 = _101 + _129;
> > t5.c:7:21: note: stmt 1 _32 = _99 + _127;
> > t5.c:7:21: note: stmt 2 _10 = _97 + _125;
> > t5.c:7:21: note: stmt 3 _31 = _89 + _117;
> > t5.c:7:21: note: children 0x3f2a550 0x3f28700
> > t5.c:7:21: note: node 0x3f2a550 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _101 = _88 * _94;
> > t5.c:7:21: note: stmt 1 _99 = _86 * _96;
> > t5.c:7:21: note: stmt 2 _97 = _94 * _96;
> > t5.c:7:21: note: stmt 3 _89 = _86 * _88;
> > t5.c:7:21: note: children 0x40b6990 0x3f29e00
> > t5.c:7:21: note: node 0x40b6990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _88 = *_87;
> > t5.c:7:21: note: stmt 1 _96 = *_95;
> > t5.c:7:21: note: stmt 2 _96 = *_95;
> > t5.c:7:21: note: stmt 3 _88 = *_87;
> > t5.c:7:21: note: load permutation { 1 5 5 1 }
> > t5.c:7:21: note: node 0x3f29e00 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _94 = *_93;
> > t5.c:7:21: note: stmt 1 _86 = *_85;
> > t5.c:7:21: note: stmt 2 _94 = *_93;
> > t5.c:7:21: note: stmt 3 _86 = *_85;
> > t5.c:7:21: note: load permutation { 5 1 5 1 }
> > t5.c:7:21: note: node 0x3f28700 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _129 = _116 * _122;
> > t5.c:7:21: note: stmt 1 _127 = _114 * _124;
> > t5.c:7:21: note: stmt 2 _125 = _122 * _124;
> > t5.c:7:21: note: stmt 3 _117 = _114 * _116;
> > t5.c:7:21: note: children 0x3f287e0 0x3f28770
> > t5.c:7:21: note: node 0x3f287e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _116 = *_115;
> > t5.c:7:21: note: stmt 1 _124 = *_123;
> > t5.c:7:21: note: stmt 2 _124 = *_123;
> > t5.c:7:21: note: stmt 3 _116 = *_115;
> > t5.c:7:21: note: load permutation { 2 6 6 2 }
> > t5.c:7:21: note: node 0x3f28770 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _122 = *_121;
> > t5.c:7:21: note: stmt 1 _114 = *_113;
> > t5.c:7:21: note: stmt 2 _122 = *_121;
> > t5.c:7:21: note: stmt 3 _114 = *_113;
> > t5.c:7:21: note: load permutation { 6 2 6 2 }
> > t5.c:7:21: note: node 0x41008b0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _73 = _60 * _66;
> > t5.c:7:21: note: stmt 1 _71 = _58 * _68;
> > t5.c:7:21: note: stmt 2 _69 = _66 * _68;
> > t5.c:7:21: note: stmt 3 _61 = _58 * _60;
> > t5.c:7:21: note: children 0x4100290 0x4100810
> > t5.c:7:21: note: node 0x4100290 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _60 = *_59;
> > t5.c:7:21: note: stmt 1 _68 = *_67;
> > t5.c:7:21: note: stmt 2 _68 = *_67;
> > t5.c:7:21: note: stmt 3 _60 = *_59;
> > t5.c:7:21: note: load permutation { 0 4 4 0 }
> > t5.c:7:21: note: node 0x4100810 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _66 = *_65;
> > t5.c:7:21: note: stmt 1 _58 = *_57;
> > t5.c:7:21: note: stmt 2 _66 = *_65;
> > t5.c:7:21: note: stmt 3 _58 = *_57;
> > t5.c:7:21: note: load permutation { 4 0 4 0 }
> > t5.c:7:21: note: node 0x4100b40 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _157 = _144 * _150;
> > t5.c:7:21: note: stmt 1 _155 = _142 * _152;
> > t5.c:7:21: note: stmt 2 _153 = _150 * _152;
> > t5.c:7:21: note: stmt 3 _145 = _142 * _144;
> > t5.c:7:21: note: children 0x4100990 0x4100a50
> > t5.c:7:21: note: node 0x4100990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _144 = *_143;
> > t5.c:7:21: note: stmt 1 _152 = *_151;
> > t5.c:7:21: note: stmt 2 _152 = *_151;
> > t5.c:7:21: note: stmt 3 _144 = *_143;
> > t5.c:7:21: note: load permutation { 3 7 7 3 }
> > t5.c:7:21: note: node 0x4100a50 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note: stmt 0 _150 = *_149;
> > t5.c:7:21: note: stmt 1 _142 = *_141;
> > t5.c:7:21: note: stmt 2 _150 = *_149;
> > t5.c:7:21: note: stmt 3 _142 = *_141;
> > t5.c:7:21: note: load permutation { 7 3 7 3 }
> >
> > which eventually shows some non-obvious permute optimization opportunities.
> > I'm currently working on a permute optimization phase btw. but for start
> > only handling cases that do not help here.
> >
> > Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> > get
> > the reduction chain variant which optimizes to
> >
> > .L3:
> > vmovupd 32(%rsi,%rax), %ymm6
> > vmovupd 32(%rdx,%rax), %ymm7
> > vmovupd (%rsi,%rax), %ymm5
> > vfmadd231pd (%rdx,%rax), %ymm6, %ymm0
> > vfmadd231pd (%rdx,%rax), %ymm5, %ymm3
> > vfmadd231pd (%rsi,%rax), %ymm7, %ymm1
> > addq $64, %rax
> > vfmadd231pd %ymm6, %ymm7, %ymm2
> > cmpq %rcx, %rax
> > jne .L3
> >
> > even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc).
>
> It's very fragile. I made a tiny (and natural for my real app) change in the
> source (see below) and the nice MSVC-like inner loop disappeared.
>
> void cdot(double* res, const double* a, const double* b, int N)
> {
> double acc_rere = 0;
> double acc_imim = 0;
> double acc_reim = 0;
> double acc_imre = 0;
> for (int c = 0; c < N; ++c) {
> for (int k = 0; k < 4; ++k) {
> acc_rere += a[c*8+k+0]*b[c*8+k+0];
> acc_imim += a[c*8+k+4]*b[c*8+k+4];
> acc_reim -= a[c*8+k+0]*b[c*8+k+4];
> acc_imre += a[c*8+k+4]*b[c*8+k+0];
> }
> }
> res[0] = acc_rere+acc_imim;
> res[4] = acc_reim+acc_imre;
> }
wow, OK ... :/
> > Which means
> > the following source change helps:
> >
> > void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> > double* a, const double* b, int N)
> > {
> > double acc_rere = 0;
> > double acc_imim = 0;
> > double acc_reim = 0;
> > double acc_imre = 0;
> > for (int c = 0; c < N; ++c) {
> > for (int k = 0; k < 4; ++k) {
> > acc_rere += a[c*8+k+0]*b[c*8+k+0];
> > acc_imim += a[c*8+k+4]*b[c*8+k+4];
> > acc_reim += a[c*8+k+0]*b[c*8+k+4];
> > acc_imre += a[c*8+k+4]*b[c*8+k+0];
> > }
> > }
> > res[0] = acc_rere+acc_imim;
> > res[4] = acc_imre-acc_reim;
> > }
> >
>
> IMHO, options like "no-tree-reassoc", including and especially within
> __attribute__((optimize(""))) are for people like you.
> People like me, i.e. not compiler guys, don't consider them not only for
> production, but even for hobby. Unless a hobby is compiler research.
That's true.
> Also, several years ago I was told by (not you, as Richard Biener, but "you"
> as "gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as
> Richard Biener, also took part in discussion, but appeared to hold different
> opinion, see 70255) told me that __attribute__((optimize(""))) can't be
> relied on in production. Back than we came to conclusion that this statement
> has to be in official docs. And indeed since GCC7 section 6.33.1 contains
> the folowing sentences:
> "The optimize attribute should be used for debugging purposes only. It is
> not suitable in production code."
Yeah, guess from my HPC background I'd say HPC is never "production" ;)
I indeed would avoid optimized("") in say a github hosted project but
IMHO it's as valid as splitting out the relevant function and compiling
the TU with -fno-tree-reassoc - it's tuning the setup for max performance
with a specific compiler and for a specific host.
>
> > the reduction epilogue ends up like
> >
> > vextractf128 $0x1, %ymm3, %xmm4
> > vaddpd %xmm3, %xmm4, %xmm3
> > vunpckhpd %xmm3, %xmm3, %xmm4
> > vaddpd %xmm3, %xmm4, %xmm3
> > vextractf128 $0x1, %ymm2, %xmm4
> > vaddpd %xmm2, %xmm4, %xmm4
> > vunpckhpd %xmm4, %xmm4, %xmm2
> > vaddpd %xmm4, %xmm2, %xmm2
> > vextractf128 $0x1, %ymm1, %xmm4
> > vaddpd %xmm1, %xmm4, %xmm4
> > vaddsd %xmm2, %xmm3, %xmm2
> > vunpckhpd %xmm4, %xmm4, %xmm1
> > vaddpd %xmm4, %xmm1, %xmm1
> > vextractf128 $0x1, %ymm0, %xmm4
> > vaddpd %xmm0, %xmm4, %xmm4
> > vunpckhpd %xmm4, %xmm4, %xmm0
> > vaddpd %xmm4, %xmm0, %xmm0
> > vsubsd %xmm1, %xmm0, %xmm0
> > vzeroupper
> > vmovsd %xmm2, (%rdi)
> > vmovsd %xmm0, 32(%rdi)
> >
> > which is not optimal since we miss the opportunity to vectorize the
> > adds of the accumulators
> >
> > res[0] = acc_rere+acc_imim;
> > res[4] = acc_imre-acc_reim;
>
> Epilogue is a tricky matter.
> There are many ways to do it with about the same speed and which variant is
> faster could depend on fine microarchitectural details, that could be not
> the same even on such quite similar CPUs as Skylake and Zen2 or Skylake and
> Ice Lake, or may be even Skylake and Skylake-X (well, may be the last one is
> an exaggeration). The optimal sequence also depends on surrounding, e.g. if
> I change the source to:
> res[0] = acc_rere+acc_imim;
> res[1] = acc_reim-acc_imre; // results adjacent in memory
> It could already be different.
> In later case it likely would be
> vaddpd ymm_rere,ymm_imim,ymm_re
> vsubpd ymm_reim,ymm_imre,ymm_im
> vhadd ymm_im,ymm_re,ymm_reim
> vperm2f128 $1, ymm_reim, ymm_reim, ymm_reimH
> vaddpd xmm2_reimH, xmm_reim, xmm_reim
>
> Even icc can't do it perfectly right now.
> It would be nice (for you personally, at least) if in this case gcc will
> generate better code than icc, but it is far [far [far...]] less important
> than robust handling of inner loop.
Yeah, the issue really is that with reassociation we miss the obviously
better reduction scheme and the vectorizer cannot translate between one
and the other at the moment (it doesn't itself try reassociating other than
that implied by vectorizing). We could try to add this capability without
too many issues I guess - the main problem will be that it's decision
upfront for the reduction scheme and that it wouldn't be one based on
actual costs.
Thanks for the insightful loop kernels btw!
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-10-12 13:11 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
2020-10-09 6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
2020-10-09 11:58 ` already5chosen at yahoo dot com
2020-10-12 13:11 ` rguenth at gcc dot gnu.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).