public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
@ 2020-10-08 23:09 already5chosen at yahoo dot com
  2020-10-09  6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: already5chosen at yahoo dot com @ 2020-10-08 23:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

            Bug ID: 97343
           Summary: AVX2 vectorizer generates extremely strange and slow
                    code for AoSoA complex dot product
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: already5chosen at yahoo dot com
  Target Milestone: ---

Let's continue our complex dot product series started here
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854

This time I have no code generation bugs for your pleasure, just "interesting"
optimization issues.
All examples below, unless stated otherwise were compiled with gcc.10.2 for
x86-64 with following sets 
of flags: 
set1: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math -fno-associative-math
set2: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math

The kernel in question is an example of complex dot product in so-called
"hybrid AoS" layout a.k.a. AoSoA.
https://en.wikipedia.org/wiki/AoS_and_SoA#Array_of_Structures_of_Arrays
In my experience it's quite rare when in dense complex linear algebra and
similar computational fields AoSoA is *not* the optimal internal form.
So, practically, I consider these kernels as more important than AoS kernel
presented in bug 96854.

More specifically, the layout can be described as struct { double re[4], im[4];
};
But for purpose of simplicity I omitted the type definition fro code examples
and coded it directly over flat arrays of doubles.

Part 1.
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_re = 0;
  double acc_im = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_re = acc_re + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
      acc_im = acc_im - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_re;
  res[4] = acc_im;
}

That's how we want to code it in the ideal world and let to compiles to care
about dirty details.
In less ideal world, gcc is not the only compiler that can't cope with it.
MSVC (-W4 -O2 -fp:fast -arch:AVX2) also can't vectorize it. Even mighty icc
generates code that it's not quite bad, but somewhat suboptimal.
So, let's it pass. I don't want to blame gcc for not being smart enough. It's
just normal.
Except that when we use set2 the code generated by gcc becomes not just
non-smart, but quite crazy.
I am ignoring it in the hope that it would be magically fixed by the change
made by Richard Biener 2020-08-31

Part 2.
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere += a[c*8+k+0]*b[c*8+k+0];
      acc_imim += a[c*8+k+4]*b[c*8+k+4];
      acc_reim += a[c*8+k+0]*b[c*8+k+4];
      acc_imre += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_imre-acc_reim;
}

We are explaining it to compiler slowly.
For icc and MSVC it's enough. They understood.
icc generates near-perfect code. I can write it nicer, but do not expect my
variant to be any faster.
MSVC generates near-perfect inner loop and epilogue that is not great, but not
really much slower.
gcc still didn't get it. It still tries to implement 4 accumulators literally,
as if -ffast-math were not here.
But, sad as it is, it's still a case of not being smart enough. So, I am not
complaining.

Part 3.
static inline double sum4(double x[]) {
  return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_re[4] = {0};
  double acc_im[4] = {0};
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_re[k] = acc_re[k] + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
      acc_im[k] = acc_im[k] - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = sum4(acc_re);
  res[4] = sum4(acc_im);
}

Attempt to feed compiler by teaspoon.
That's not a way I want to write code in HLL.
icc copes, producing about the same code as in Part 1
MSVC doesn't understand a Kunststück (I am sympathetic) and generates literal
scalar code with local arrays on stack.
gcc with set1 is a little better than MSVC - the code is fully scalar, but at
least accumulators kept in registers.
gcc with set2 is the most interesting. It vectorizes, but how?
Here is an inner loop:

.L3:
        vpermpd $27, (%r8,%rax), %ymm2
        vpermpd $27, 32(%rdx,%rax), %ymm3
        vpermpd $27, (%rdx,%rax), %ymm1
        vpermpd $27, 32(%r8,%rax), %ymm0
        vmulpd  %ymm2, %ymm1, %ymm6
        vmulpd  %ymm2, %ymm3, %ymm2
        addq    $64, %rax
        vfnmadd132pd    %ymm0, %ymm2, %ymm1
        vfmadd132pd     %ymm3, %ymm6, %ymm0
        vaddpd  %ymm1, %ymm5, %ymm5
        vaddpd  %ymm0, %ymm4, %ymm4
        cmpq    %rcx, %rax
        jne     .L3

What all this vpermpd business about? Shuffling SIMD lanes around just because
it's funny?
That the first thing I do want to complain about. Not "not smart enough", but
too smart for its own good.

And finally 
Part 4
static inline double sum4(double x[]) {
  return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere[4] = {0};
  double acc_imim[4] = {0};
  double acc_reim[4] = {0};
  double acc_imre[4] = {0};
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere[k] += a[c*8+k+0]*b[c*8+k+0];
      acc_imim[k] += a[c*8+k+4]*b[c*8+k+4];
      acc_reim[k] += a[c*8+k+0]*b[c*8+k+4];
      acc_imre[k] += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  double acc_re[4];
  double acc_im[4];
  for (int k = 0; k < 4; ++k) {
    acc_re[k] = acc_rere[k]+acc_imim[k];
    acc_im[k] = acc_imre[k]-acc_reim[k];
  }
  res[0] = sum4(acc_re);
  res[4] = sum4(acc_im);
}

Not just fed by teaspoon, but compiler's mouth held open manually, so to speak.
icc, of course, understands and generates pretty much the same good code as in
Part 2.
MSVC, of course, does not understand and generates arrays on stack.

gcc with set2, of course, continues to enjoy a juggling. Doubling or tripling
up vs last time's performance.
Inner loop:
.L3:
        vmovupd (%rdx,%rax), %ymm1
        vmovupd 32(%rdx,%rax), %ymm0
        vmovupd 32(%r8,%rax), %ymm5
        vperm2f128      $49, %ymm1, %ymm0, %ymm3
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vpermpd $221, %ymm0, %ymm10
        vpermpd $136, %ymm0, %ymm1
        vmovupd (%r8,%rax), %ymm0
        vpermpd $136, %ymm3, %ymm9
        vperm2f128      $49, %ymm5, %ymm0, %ymm2
        vinsertf128     $1, %xmm5, %ymm0, %ymm0
        vpermpd $40, %ymm2, %ymm11
        vpermpd $125, %ymm0, %ymm5
        vpermpd $221, %ymm3, %ymm3
        vpermpd $40, %ymm0, %ymm0
        vpermpd $125, %ymm2, %ymm2
        addq    $64, %rax
        vfmadd231pd     %ymm2, %ymm3, %ymm8
        vfmadd231pd     %ymm11, %ymm9, %ymm6
        vfmadd231pd     %ymm5, %ymm10, %ymm7
        vfmadd231pd     %ymm0, %ymm1, %ymm4
        cmpq    %rax, %rcx
        jne     .L3

But this time gcc with set1 was a real star of the show. My only reaction is
"What?"
.L4:
        vmovupd 0(%r13), %ymm5
        vmovupd 64(%r13), %ymm7
        vmovupd 192(%r13), %ymm4
        vmovupd 128(%r13), %ymm6
        vunpcklpd       32(%r13), %ymm5, %ymm13
        vunpckhpd       32(%r13), %ymm5, %ymm12
        vunpckhpd       96(%r13), %ymm7, %ymm1
        vunpcklpd       96(%r13), %ymm7, %ymm5
        vmovupd 128(%r13), %ymm7
        vunpcklpd       224(%r13), %ymm4, %ymm2
        vunpcklpd       160(%r13), %ymm6, %ymm6
        vunpckhpd       160(%r13), %ymm7, %ymm11
        vunpckhpd       224(%r13), %ymm4, %ymm0
        vpermpd $216, %ymm13, %ymm13
        vpermpd $216, %ymm6, %ymm6
        vpermpd $216, %ymm2, %ymm2
        vpermpd $216, %ymm5, %ymm5
        vunpcklpd       %ymm2, %ymm6, %ymm4
        vpermpd $216, %ymm1, %ymm1
        vpermpd $216, %ymm11, %ymm11
        vunpcklpd       %ymm5, %ymm13, %ymm9
        vpermpd $216, %ymm12, %ymm12
        vpermpd $216, %ymm0, %ymm0
        vpermpd $216, %ymm4, %ymm3
        vpermpd $216, %ymm9, %ymm9
        vunpckhpd       %ymm2, %ymm6, %ymm4
        vunpckhpd       %ymm5, %ymm13, %ymm5
        vunpcklpd       %ymm1, %ymm12, %ymm6
        vunpcklpd       %ymm0, %ymm11, %ymm2
        vunpckhpd       %ymm1, %ymm12, %ymm12
        vunpckhpd       %ymm0, %ymm11, %ymm0
        vpermpd $216, %ymm12, %ymm1
        vunpcklpd       %ymm3, %ymm9, %ymm11
        vpermpd $216, %ymm5, %ymm5
        vpermpd $216, %ymm4, %ymm4
        vpermpd $216, %ymm0, %ymm0
        vmovupd 64(%r12), %ymm15
        vpermpd $216, %ymm6, %ymm6
        vpermpd $216, %ymm11, %ymm8
        vpermpd $216, %ymm2, %ymm2
        vunpcklpd       %ymm4, %ymm5, %ymm11
        vunpckhpd       %ymm3, %ymm9, %ymm9
        vunpckhpd       %ymm4, %ymm5, %ymm4
        vunpcklpd       %ymm0, %ymm1, %ymm5
        vpermpd $216, %ymm9, %ymm3
        vunpcklpd       %ymm2, %ymm6, %ymm9
        vunpckhpd       %ymm2, %ymm6, %ymm2
        vpermpd $216, %ymm5, %ymm6
        vunpcklpd       96(%r12), %ymm15, %ymm12
        vunpckhpd       %ymm0, %ymm1, %ymm0
        vmovupd %ymm6, 64(%rsp)
        vunpckhpd       96(%r12), %ymm15, %ymm6
        vmovupd 128(%r12), %ymm15
        vpermpd $216, %ymm0, %ymm5
        vpermpd $216, %ymm9, %ymm7
        vmovupd (%r12), %ymm0
        vunpckhpd       160(%r12), %ymm15, %ymm9
        vmovupd %ymm5, 96(%rsp)
        vunpcklpd       160(%r12), %ymm15, %ymm5
        vmovupd 192(%r12), %ymm15
        vunpcklpd       32(%r12), %ymm0, %ymm1
        vpermpd $216, %ymm9, %ymm14
        vunpcklpd       224(%r12), %ymm15, %ymm9
        vunpckhpd       224(%r12), %ymm15, %ymm13
        vunpckhpd       32(%r12), %ymm0, %ymm0
        vpermpd $216, %ymm12, %ymm12
        vpermpd $216, %ymm9, %ymm9
        vpermpd $216, %ymm1, %ymm1
        vpermpd $216, %ymm5, %ymm5
        vpermpd $216, %ymm6, %ymm6
        vunpcklpd       %ymm12, %ymm1, %ymm10
        vpermpd $216, %ymm0, %ymm0
        vpermpd $216, %ymm13, %ymm13
        vunpckhpd       %ymm12, %ymm1, %ymm1
        vunpcklpd       %ymm9, %ymm5, %ymm12
        vpermpd $216, %ymm12, %ymm12
        vpermpd $216, %ymm10, %ymm10
        vunpckhpd       %ymm9, %ymm5, %ymm5
        vunpcklpd       %ymm6, %ymm0, %ymm9
        vunpckhpd       %ymm6, %ymm0, %ymm0
        vunpcklpd       %ymm13, %ymm14, %ymm6
        vunpckhpd       %ymm13, %ymm14, %ymm13
        vpermpd $216, %ymm13, %ymm14
        vunpcklpd       %ymm12, %ymm10, %ymm13
        vpermpd $216, %ymm13, %ymm13
        vmulpd  %ymm13, %ymm8, %ymm15
        vpermpd $216, %ymm5, %ymm5
        vpermpd $216, %ymm6, %ymm6
        vpermpd $216, %ymm1, %ymm1
        vpermpd $216, %ymm9, %ymm9
        vpermpd $216, %ymm0, %ymm0
        vunpckhpd       %ymm12, %ymm10, %ymm10
        vunpcklpd       %ymm6, %ymm9, %ymm12
        vunpckhpd       %ymm6, %ymm9, %ymm9
        vunpcklpd       %ymm5, %ymm1, %ymm6
        vunpckhpd       %ymm5, %ymm1, %ymm1
        vunpcklpd       %ymm14, %ymm0, %ymm5
        vunpckhpd       %ymm14, %ymm0, %ymm0
        vpermpd $216, %ymm0, %ymm0
        vmovupd %ymm0, 160(%rsp)
        vmovq   %r9, %xmm0
        vaddsd  %xmm15, %xmm0, %xmm0
        vunpckhpd       %xmm15, %xmm15, %xmm14
        vpermpd $216, %ymm10, %ymm10
        vaddsd  %xmm14, %xmm0, %xmm0
        vextractf128    $0x1, %ymm15, %xmm14
        vmulpd  %ymm10, %ymm8, %ymm8
        vaddsd  %xmm14, %xmm0, %xmm15
        vunpckhpd       %xmm14, %xmm14, %xmm14
        vpermpd $216, %ymm12, %ymm12
        vaddsd  %xmm14, %xmm15, %xmm0
        vmulpd  %ymm10, %ymm3, %ymm15
        vunpckhpd       %xmm8, %xmm8, %xmm10
        vmovq   %xmm0, %r9
        vmovq   %rcx, %xmm0
        vmulpd  %ymm13, %ymm3, %ymm3
        vaddsd  %xmm15, %xmm0, %xmm0
        vunpckhpd       %xmm15, %xmm15, %xmm14
        vextractf128    $0x1, %ymm15, %xmm15
        vaddsd  %xmm14, %xmm0, %xmm14
        vpermpd $216, %ymm1, %ymm1
        vmovupd %ymm1, 128(%rsp)
        vaddsd  %xmm15, %xmm14, %xmm14
        vunpckhpd       %xmm15, %xmm15, %xmm15
        vpermpd $216, %ymm2, %ymm2
        vaddsd  %xmm15, %xmm14, %xmm0
        vmovsd  56(%rsp), %xmm14
        vpermpd $216, %ymm9, %ymm9
        vaddsd  %xmm8, %xmm14, %xmm14
        vextractf128    $0x1, %ymm8, %xmm8
        vmovq   %xmm0, %rcx
        vaddsd  %xmm10, %xmm14, %xmm10
        vpermpd $216, %ymm6, %ymm6
        vpermpd $216, %ymm11, %ymm11
        vaddsd  %xmm8, %xmm10, %xmm10
        vunpckhpd       %xmm8, %xmm8, %xmm8
        vpermpd $216, %ymm4, %ymm4
        vaddsd  %xmm8, %xmm10, %xmm0
        vmovsd  48(%rsp), %xmm10
        vunpckhpd       %xmm3, %xmm3, %xmm8
        vaddsd  %xmm3, %xmm10, %xmm10
        vextractf128    $0x1, %ymm3, %xmm3
        vmovsd  %xmm0, 56(%rsp)
        vaddsd  %xmm8, %xmm10, %xmm8
        vmulpd  %ymm12, %ymm7, %ymm10
        vmulpd  %ymm9, %ymm7, %ymm7
        vaddsd  %xmm3, %xmm8, %xmm8
        vunpckhpd       %xmm3, %xmm3, %xmm3
        vpermpd $216, %ymm5, %ymm5
        vaddsd  %xmm3, %xmm8, %xmm0
        vunpckhpd       %xmm10, %xmm10, %xmm3
        addq    $256, %r12
        vmovsd  %xmm0, 48(%rsp)
        vmovq   %rdi, %xmm0
        vaddsd  %xmm10, %xmm0, %xmm8
        vextractf128    $0x1, %ymm10, %xmm10
        vmovq   %rbx, %xmm0
        vaddsd  %xmm3, %xmm8, %xmm3
        vmulpd  %ymm9, %ymm2, %ymm8
        vmulpd  %ymm12, %ymm2, %ymm2
        vaddsd  %xmm10, %xmm3, %xmm3
        vunpckhpd       %xmm10, %xmm10, %xmm10
        addq    $256, %r13
        vaddsd  %xmm10, %xmm3, %xmm1
        vaddsd  %xmm8, %xmm0, %xmm10
        vunpckhpd       %xmm8, %xmm8, %xmm3
        vextractf128    $0x1, %ymm8, %xmm8
        vaddsd  %xmm3, %xmm10, %xmm3
        vmovq   %xmm1, %rdi
        vmovq   %r11, %xmm1
        vaddsd  %xmm8, %xmm3, %xmm3
        vunpckhpd       %xmm8, %xmm8, %xmm8
        vmovq   %r10, %xmm0
        vaddsd  %xmm8, %xmm3, %xmm3
        vmovsd  40(%rsp), %xmm8
        vaddsd  %xmm7, %xmm8, %xmm8
        vmovq   %xmm3, %rbx
        vunpckhpd       %xmm7, %xmm7, %xmm3
        vaddsd  %xmm3, %xmm8, %xmm3
        vextractf128    $0x1, %ymm7, %xmm7
        vaddsd  %xmm7, %xmm3, %xmm3
        vunpckhpd       %xmm7, %xmm7, %xmm7
        vaddsd  %xmm7, %xmm3, %xmm3
        vmovsd  32(%rsp), %xmm7
        vaddsd  %xmm2, %xmm7, %xmm7
        vmovsd  %xmm3, 40(%rsp)
        vunpckhpd       %xmm2, %xmm2, %xmm3
        vaddsd  %xmm3, %xmm7, %xmm3
        vextractf128    $0x1, %ymm2, %xmm2
        vmulpd  %ymm6, %ymm11, %ymm7
        vaddsd  %xmm2, %xmm3, %xmm3
        vunpckhpd       %xmm2, %xmm2, %xmm2
        vaddsd  %xmm2, %xmm3, %xmm2
        vaddsd  %xmm7, %xmm1, %xmm3
        vmovupd 128(%rsp), %ymm1
        vmovsd  %xmm2, 32(%rsp)
        vunpckhpd       %xmm7, %xmm7, %xmm2
        vaddsd  %xmm2, %xmm3, %xmm2
        vextractf128    $0x1, %ymm7, %xmm7
        vmulpd  %ymm1, %ymm4, %ymm3
        vaddsd  %xmm7, %xmm2, %xmm2
        vunpckhpd       %xmm7, %xmm7, %xmm7
        vmulpd  %ymm1, %ymm11, %ymm1
        vaddsd  %xmm7, %xmm2, %xmm2
        vaddsd  %xmm3, %xmm0, %xmm7
        vmulpd  %ymm6, %ymm4, %ymm4
        vmovq   %xmm2, %r11
        vunpckhpd       %xmm3, %xmm3, %xmm2
        vaddsd  %xmm2, %xmm7, %xmm2
        vextractf128    $0x1, %ymm3, %xmm3
        vmovupd 64(%rsp), %ymm6
        vaddsd  %xmm3, %xmm2, %xmm2
        vunpckhpd       %xmm3, %xmm3, %xmm3
        vmovupd 96(%rsp), %ymm7
        vaddsd  %xmm3, %xmm2, %xmm2
        vmovsd  24(%rsp), %xmm3
        vmovupd 160(%rsp), %ymm0
        vaddsd  %xmm1, %xmm3, %xmm3
        vmovq   %xmm2, %r10
        vunpckhpd       %xmm1, %xmm1, %xmm2
        vaddsd  %xmm2, %xmm3, %xmm2
        vextractf128    $0x1, %ymm1, %xmm1
        vmovq   %rbp, %xmm3
        vaddsd  %xmm1, %xmm2, %xmm2
        vunpckhpd       %xmm1, %xmm1, %xmm1
        vaddsd  %xmm1, %xmm2, %xmm2
        vunpckhpd       %xmm4, %xmm4, %xmm1
        vmovsd  %xmm2, 24(%rsp)
        vmovsd  16(%rsp), %xmm2
        vaddsd  %xmm4, %xmm2, %xmm2
        vextractf128    $0x1, %ymm4, %xmm4
        vaddsd  %xmm1, %xmm2, %xmm1
        vaddsd  %xmm4, %xmm1, %xmm1
        vunpckhpd       %xmm4, %xmm4, %xmm4
        vaddsd  %xmm4, %xmm1, %xmm4
        vmovsd  %xmm4, 16(%rsp)
        vmulpd  %ymm6, %ymm5, %ymm4
        vmulpd  %ymm7, %ymm5, %ymm5
        vaddsd  %xmm4, %xmm3, %xmm1
        vunpckhpd       %xmm4, %xmm4, %xmm2
        vmovq   %rsi, %xmm3
        vaddsd  %xmm2, %xmm1, %xmm2
        vextractf128    $0x1, %ymm4, %xmm1
        vaddsd  %xmm1, %xmm2, %xmm2
        vunpckhpd       %xmm1, %xmm1, %xmm1
        vaddsd  %xmm1, %xmm2, %xmm4
        vmovq   %xmm4, %rbp
        vmulpd  %ymm0, %ymm7, %ymm4
        vmulpd  %ymm0, %ymm6, %ymm0
        vaddsd  %xmm4, %xmm3, %xmm1
        vunpckhpd       %xmm4, %xmm4, %xmm2
        vaddsd  %xmm2, %xmm1, %xmm2
        vextractf128    $0x1, %ymm4, %xmm1
        vaddsd  %xmm1, %xmm2, %xmm2
        vunpckhpd       %xmm1, %xmm1, %xmm1
        vaddsd  %xmm1, %xmm2, %xmm4
        vmovsd  8(%rsp), %xmm2
        vunpckhpd       %xmm0, %xmm0, %xmm1
        vaddsd  %xmm0, %xmm2, %xmm2
        vextractf128    $0x1, %ymm0, %xmm0
        vmovq   %xmm4, %rsi
        vaddsd  %xmm1, %xmm2, %xmm1
        vaddsd  %xmm0, %xmm1, %xmm1
        vunpckhpd       %xmm0, %xmm0, %xmm0
        vaddsd  %xmm0, %xmm1, %xmm6
        vmovsd  (%rsp), %xmm1
        vunpckhpd       %xmm5, %xmm5, %xmm0
        vaddsd  %xmm5, %xmm1, %xmm1
        vextractf128    $0x1, %ymm5, %xmm5
        vmovsd  %xmm6, 8(%rsp)
        vaddsd  %xmm0, %xmm1, %xmm0
        vaddsd  %xmm5, %xmm0, %xmm0
        vunpckhpd       %xmm5, %xmm5, %xmm5
        vaddsd  %xmm5, %xmm0, %xmm5
        vmovsd  %xmm5, (%rsp)
        cmpq    %rax, %r12
        jne     .L4
        movl    %r15d, %r12d
        andl    $-4, %r12d
        movl    %r12d, %edx
        cmpl    %r12d, %r15d
        je      .L5
.L3:
        movl    %r15d, %eax
        subl    %r12d, %eax
        cmpl    $1, %eax
        je      .L6
        salq    $6, %r12
        leaq    (%r14,%r12), %r13
        vmovupd 16(%r13), %xmm3
        vmovupd 48(%r13), %xmm0
        vmovupd 64(%r13), %xmm8
        vmovupd 112(%r13), %xmm10
        vmovupd 0(%r13), %xmm4
        vmovupd 32(%r13), %xmm2
        vmovupd 80(%r13), %xmm6
        vmovupd 96(%r13), %xmm1
        vunpcklpd       %xmm3, %xmm4, %xmm5
        vunpckhpd       %xmm3, %xmm4, %xmm4
        vunpcklpd       %xmm0, %xmm2, %xmm3
        vunpckhpd       %xmm0, %xmm2, %xmm2
        vunpcklpd       %xmm6, %xmm8, %xmm0
        vunpckhpd       %xmm6, %xmm8, %xmm6
        vunpcklpd       %xmm10, %xmm1, %xmm8
        vunpckhpd       %xmm10, %xmm1, %xmm1
        vunpcklpd       %xmm3, %xmm5, %xmm11
        vunpcklpd       %xmm2, %xmm4, %xmm10
        vunpckhpd       %xmm3, %xmm5, %xmm3
        vunpckhpd       %xmm2, %xmm4, %xmm2
        vunpcklpd       %xmm8, %xmm0, %xmm5
        vunpcklpd       %xmm1, %xmm6, %xmm4
        vunpckhpd       %xmm8, %xmm0, %xmm0
        vunpckhpd       %xmm1, %xmm6, %xmm1
        addq    %r8, %r12
        vunpcklpd       %xmm5, %xmm11, %xmm8
        vunpckhpd       %xmm0, %xmm3, %xmm7
        vunpckhpd       %xmm5, %xmm11, %xmm11
        vunpckhpd       %xmm1, %xmm2, %xmm5
        vmovupd 64(%r12), %xmm12
        vunpcklpd       %xmm1, %xmm2, %xmm6
        vmovupd 80(%r12), %xmm9
        vmovupd 48(%r12), %xmm1
        vmovupd 96(%r12), %xmm2
        vunpcklpd       %xmm4, %xmm10, %xmm14
        vunpcklpd       %xmm0, %xmm3, %xmm13
        vunpckhpd       %xmm4, %xmm10, %xmm10
        vmovupd 32(%r12), %xmm3
        vmovupd 16(%r12), %xmm4
        vmovapd %xmm7, 64(%rsp)
        vmovapd %xmm5, 96(%rsp)
        vmovupd 112(%r12), %xmm7
        vmovupd (%r12), %xmm5
        movl    %eax, %r12d
        vunpcklpd       %xmm4, %xmm5, %xmm15
        vunpckhpd       %xmm4, %xmm5, %xmm5
        vunpcklpd       %xmm1, %xmm3, %xmm4
        vunpckhpd       %xmm1, %xmm3, %xmm3
        vunpcklpd       %xmm9, %xmm12, %xmm1
        vunpckhpd       %xmm9, %xmm12, %xmm9
        vunpcklpd       %xmm7, %xmm2, %xmm12
        vunpckhpd       %xmm7, %xmm2, %xmm2
        vunpcklpd       %xmm4, %xmm15, %xmm7
        vunpckhpd       %xmm4, %xmm15, %xmm15
        vunpcklpd       %xmm12, %xmm1, %xmm4
        vunpckhpd       %xmm12, %xmm1, %xmm1
        vunpcklpd       %xmm3, %xmm5, %xmm12
        vunpckhpd       %xmm3, %xmm5, %xmm5
        vunpcklpd       %xmm2, %xmm9, %xmm3
        vunpckhpd       %xmm2, %xmm9, %xmm2
        vunpcklpd       %xmm4, %xmm7, %xmm9
        vunpckhpd       %xmm1, %xmm15, %xmm0
        vunpckhpd       %xmm4, %xmm7, %xmm4
        vunpcklpd       %xmm3, %xmm12, %xmm7
        vunpckhpd       %xmm3, %xmm12, %xmm3
        vunpcklpd       %xmm1, %xmm15, %xmm12
        vunpcklpd       %xmm2, %xmm5, %xmm15
        vunpckhpd       %xmm2, %xmm5, %xmm2
        vmulpd  %xmm9, %xmm8, %xmm5
        vmovapd %xmm0, 128(%rsp)
        vmovq   %r9, %xmm0
        andl    $-2, %r12d
        addl    %r12d, %edx
        vaddsd  %xmm5, %xmm0, %xmm0
        vunpckhpd       %xmm5, %xmm5, %xmm5
        vaddsd  %xmm5, %xmm0, %xmm1
        vmulpd  %xmm4, %xmm11, %xmm5
        vmulpd  %xmm4, %xmm8, %xmm4
        vmovq   %xmm1, %r9
        vmovq   %rcx, %xmm1
        vmulpd  %xmm9, %xmm11, %xmm11
        vaddsd  %xmm5, %xmm1, %xmm1
        vunpckhpd       %xmm5, %xmm5, %xmm5
        vmulpd  %xmm7, %xmm14, %xmm9
        vaddsd  %xmm5, %xmm1, %xmm1
        vmovsd  56(%rsp), %xmm5
        vmulpd  %xmm3, %xmm10, %xmm8
        vaddsd  %xmm4, %xmm5, %xmm5
        vunpckhpd       %xmm4, %xmm4, %xmm4
        vmovq   %xmm1, %rcx
        vaddsd  %xmm4, %xmm5, %xmm4
        vmovq   %rdi, %xmm1
        vmulpd  %xmm3, %xmm14, %xmm14
        vmovsd  %xmm4, 56(%rsp)
        vmovsd  48(%rsp), %xmm4
        vmovq   %rbx, %xmm0
        vaddsd  %xmm11, %xmm4, %xmm4
        vunpckhpd       %xmm11, %xmm11, %xmm11
        vmovsd  40(%rsp), %xmm3
        vaddsd  %xmm11, %xmm4, %xmm4
        vmulpd  %xmm7, %xmm10, %xmm10
        vaddsd  %xmm14, %xmm3, %xmm3
        vmovsd  %xmm4, 48(%rsp)
        vaddsd  %xmm9, %xmm1, %xmm4
        vunpckhpd       %xmm9, %xmm9, %xmm9
        vunpckhpd       %xmm14, %xmm14, %xmm14
        vaddsd  %xmm9, %xmm4, %xmm4
        vmovapd 128(%rsp), %xmm5
        vmovapd 64(%rsp), %xmm11
        vmovq   %xmm4, %rdi
        vaddsd  %xmm8, %xmm0, %xmm4
        vunpckhpd       %xmm8, %xmm8, %xmm8
        vmovsd  24(%rsp), %xmm1
        vaddsd  %xmm8, %xmm4, %xmm4
        vmovsd  16(%rsp), %xmm0
        vmovq   %xmm4, %rbx
        vaddsd  %xmm14, %xmm3, %xmm4
        vmovsd  32(%rsp), %xmm3
        vaddsd  %xmm10, %xmm3, %xmm3
        vunpckhpd       %xmm10, %xmm10, %xmm10
        vmovsd  %xmm4, 40(%rsp)
        vaddsd  %xmm10, %xmm3, %xmm7
        vmulpd  %xmm12, %xmm13, %xmm3
        vmulpd  %xmm5, %xmm13, %xmm13
        vmovsd  %xmm7, 32(%rsp)
        vmovq   %r11, %xmm7
        vmulpd  %xmm11, %xmm12, %xmm12
        vaddsd  %xmm3, %xmm7, %xmm4
        vunpckhpd       %xmm3, %xmm3, %xmm3
        vaddsd  %xmm13, %xmm1, %xmm1
        vaddsd  %xmm3, %xmm4, %xmm7
        vmulpd  %xmm5, %xmm11, %xmm3
        vunpckhpd       %xmm13, %xmm13, %xmm13
        vmovq   %xmm7, %r11
        vmovq   %r10, %xmm7
        vaddsd  %xmm12, %xmm0, %xmm0
        vaddsd  %xmm3, %xmm7, %xmm4
        vunpckhpd       %xmm3, %xmm3, %xmm3
        vunpckhpd       %xmm12, %xmm12, %xmm12
        vaddsd  %xmm3, %xmm4, %xmm7
        vaddsd  %xmm13, %xmm1, %xmm4
        vmovq   %xmm7, %r10
        vmovsd  %xmm4, 24(%rsp)
        vaddsd  %xmm12, %xmm0, %xmm4
        vmulpd  %xmm15, %xmm6, %xmm0
        vmovq   %rbp, %xmm7
        vmovsd  %xmm4, 16(%rsp)
        vmovapd 96(%rsp), %xmm5
        vaddsd  %xmm0, %xmm7, %xmm1
        vunpckhpd       %xmm0, %xmm0, %xmm0
        vmovq   %rsi, %xmm7
        vaddsd  %xmm0, %xmm1, %xmm4
        vmulpd  %xmm5, %xmm2, %xmm0
        vmulpd  %xmm2, %xmm6, %xmm2
        vmovq   %xmm4, %rbp
        vmulpd  %xmm5, %xmm15, %xmm15
        vaddsd  %xmm0, %xmm7, %xmm1
        vunpckhpd       %xmm0, %xmm0, %xmm0
        vaddsd  %xmm0, %xmm1, %xmm4
        vmovsd  8(%rsp), %xmm0
        vaddsd  %xmm2, %xmm0, %xmm0
        vunpckhpd       %xmm2, %xmm2, %xmm2
        vmovq   %xmm4, %rsi
        vaddsd  %xmm2, %xmm0, %xmm6
        vmovsd  (%rsp), %xmm0
        vaddsd  %xmm15, %xmm0, %xmm0
        vunpckhpd       %xmm15, %xmm15, %xmm15
        vmovsd  %xmm6, 8(%rsp)
        vaddsd  %xmm15, %xmm0, %xmm5
        vmovsd  %xmm5, (%rsp)
        cmpl    %r12d, %eax
        je      .L5

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
  2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
@ 2020-10-09  6:41 ` rguenth at gcc dot gnu.org
  2020-10-09 11:58 ` already5chosen at yahoo dot com
  2020-10-12 13:11 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-10-09  6:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
          Component|target                      |tree-optimization
     Ever confirmed|0                           |1
             Target|                            |x86_64-*-* i?86-*-*
             Blocks|                            |53947
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2020-10-09
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
All below for Part 2.

Without -ffast-math you are seeing GCC using in-order reductions now while
with -ffast-math the vectorizer gets a bit confused about reassociations done
before, for me producing

.L3:
        vmovupd 32(%rsi,%rax), %ymm3
        vmovupd (%rdx,%rax), %ymm7
        vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
        vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
        vmovupd 32(%rsi,%rax), %ymm5
        vpermpd $136, %ymm0, %ymm4
        vpermpd $40, %ymm2, %ymm7
        vpermpd $221, %ymm0, %ymm1
        vpermpd $125, %ymm2, %ymm3
        vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
        vmovupd (%rdx,%rax), %ymm2
        vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
        addq    $64, %rax
        vpermpd $136, %ymm0, %ymm5
        vpermpd $221, %ymm0, %ymm0
        vpermpd $40, %ymm2, %ymm8
        vpermpd $125, %ymm2, %ymm2
        vmulpd  %ymm8, %ymm5, %ymm5
        vmulpd  %ymm2, %ymm0, %ymm0
        vfmadd132pd     %ymm3, %ymm5, %ymm1
        vfmadd231pd     %ymm7, %ymm4, %ymm0
        vaddpd  %ymm0, %ymm1, %ymm0
        vaddpd  %ymm0, %ymm6, %ymm6
        cmpq    %rcx, %rax
        jne     .L3

-ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
chains and this SLP reduction ends up looking like

t5.c:7:21: note:   Vectorizing SLP tree:
t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
t5.c:7:21: note:        children 0x4100c20
t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
t5.c:7:21: note:        children 0x4100920 0x4100b40
t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
t5.c:7:21: note:        children 0x41004e0 0x41008b0
t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
t5.c:7:21: note:        children 0x3f2a550 0x3f28700
t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
t5.c:7:21: note:        children 0x40b6990 0x3f29e00
t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _88 = *_87;
t5.c:7:21: note:        stmt 1 _96 = *_95;
t5.c:7:21: note:        stmt 2 _96 = *_95;
t5.c:7:21: note:        stmt 3 _88 = *_87;
t5.c:7:21: note:        load permutation { 1 5 5 1 }
t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _94 = *_93;
t5.c:7:21: note:        stmt 1 _86 = *_85;
t5.c:7:21: note:        stmt 2 _94 = *_93;
t5.c:7:21: note:        stmt 3 _86 = *_85;
t5.c:7:21: note:        load permutation { 5 1 5 1 }
t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
t5.c:7:21: note:        children 0x3f287e0 0x3f28770
t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _116 = *_115;
t5.c:7:21: note:        stmt 1 _124 = *_123;
t5.c:7:21: note:        stmt 2 _124 = *_123;
t5.c:7:21: note:        stmt 3 _116 = *_115;
t5.c:7:21: note:        load permutation { 2 6 6 2 }
t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _122 = *_121;
t5.c:7:21: note:        stmt 1 _114 = *_113;
t5.c:7:21: note:        stmt 2 _122 = *_121;
t5.c:7:21: note:        stmt 3 _114 = *_113;
t5.c:7:21: note:        load permutation { 6 2 6 2 }
t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
t5.c:7:21: note:        children 0x4100290 0x4100810
t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _60 = *_59;
t5.c:7:21: note:        stmt 1 _68 = *_67;
t5.c:7:21: note:        stmt 2 _68 = *_67;
t5.c:7:21: note:        stmt 3 _60 = *_59;
t5.c:7:21: note:        load permutation { 0 4 4 0 }
t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _66 = *_65;
t5.c:7:21: note:        stmt 1 _58 = *_57;
t5.c:7:21: note:        stmt 2 _66 = *_65;
t5.c:7:21: note:        stmt 3 _58 = *_57;
t5.c:7:21: note:        load permutation { 4 0 4 0 }
t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
t5.c:7:21: note:        children 0x4100990 0x4100a50
t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _144 = *_143;
t5.c:7:21: note:        stmt 1 _152 = *_151;
t5.c:7:21: note:        stmt 2 _152 = *_151;
t5.c:7:21: note:        stmt 3 _144 = *_143;
t5.c:7:21: note:        load permutation { 3 7 7 3 }
t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _150 = *_149;
t5.c:7:21: note:        stmt 1 _142 = *_141;
t5.c:7:21: note:        stmt 2 _150 = *_149;
t5.c:7:21: note:        stmt 3 _142 = *_141;
t5.c:7:21: note:        load permutation { 7 3 7 3 }

which eventually shows some non-obvious permute optimization opportunities.
I'm currently working on a permute optimization phase btw. but for start
only handling cases that do not help here.

Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I get
the reduction chain variant which optimizes to

.L3:
        vmovupd 32(%rsi,%rax), %ymm6
        vmovupd 32(%rdx,%rax), %ymm7
        vmovupd (%rsi,%rax), %ymm5
        vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
        vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
        vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
        addq    $64, %rax
        vfmadd231pd     %ymm6, %ymm7, %ymm2
        cmpq    %rcx, %rax
        jne     .L3

even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc).  Which means
the following source change helps:

void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere += a[c*8+k+0]*b[c*8+k+0];
      acc_imim += a[c*8+k+4]*b[c*8+k+4];
      acc_reim += a[c*8+k+0]*b[c*8+k+4];
      acc_imre += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_imre-acc_reim;
}

the reduction epilogue ends up like

        vextractf128    $0x1, %ymm3, %xmm4
        vaddpd  %xmm3, %xmm4, %xmm3
        vunpckhpd       %xmm3, %xmm3, %xmm4
        vaddpd  %xmm3, %xmm4, %xmm3
        vextractf128    $0x1, %ymm2, %xmm4
        vaddpd  %xmm2, %xmm4, %xmm4
        vunpckhpd       %xmm4, %xmm4, %xmm2
        vaddpd  %xmm4, %xmm2, %xmm2
        vextractf128    $0x1, %ymm1, %xmm4
        vaddpd  %xmm1, %xmm4, %xmm4
        vaddsd  %xmm2, %xmm3, %xmm2
        vunpckhpd       %xmm4, %xmm4, %xmm1
        vaddpd  %xmm4, %xmm1, %xmm1
        vextractf128    $0x1, %ymm0, %xmm4
        vaddpd  %xmm0, %xmm4, %xmm4
        vunpckhpd       %xmm4, %xmm4, %xmm0
        vaddpd  %xmm4, %xmm0, %xmm0
        vsubsd  %xmm1, %xmm0, %xmm0
        vzeroupper
        vmovsd  %xmm2, (%rdi)
        vmovsd  %xmm0, 32(%rdi)

which is not optimal since we miss the opportunity to vectorize the
adds of the accumulators

  res[0] = acc_rere+acc_imim;
  res[4] = acc_imre-acc_reim;


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
  2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
  2020-10-09  6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
@ 2020-10-09 11:58 ` already5chosen at yahoo dot com
  2020-10-12 13:11 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: already5chosen at yahoo dot com @ 2020-10-09 11:58 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

--- Comment #2 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #1)
> All below for Part 2.
> 
> Without -ffast-math you are seeing GCC using in-order reductions now while
> with -ffast-math the vectorizer gets a bit confused about reassociations done
> before, for me producing
> 

Just to understand, when you are saying "Without -ffast-math" does it includes
my set1 == "-O3 -ffast-math -fno-associative-math" ?

BTW, I like your phrasing: "a bit confused about reassociations"


> .L3:
>         vmovupd 32(%rsi,%rax), %ymm3
>         vmovupd (%rdx,%rax), %ymm7
>         vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
>         vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
>         vmovupd 32(%rsi,%rax), %ymm5
>         vpermpd $136, %ymm0, %ymm4
>         vpermpd $40, %ymm2, %ymm7
>         vpermpd $221, %ymm0, %ymm1
>         vpermpd $125, %ymm2, %ymm3
>         vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
>         vmovupd (%rdx,%rax), %ymm2
>         vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
>         addq    $64, %rax
>         vpermpd $136, %ymm0, %ymm5
>         vpermpd $221, %ymm0, %ymm0
>         vpermpd $40, %ymm2, %ymm8
>         vpermpd $125, %ymm2, %ymm2
>         vmulpd  %ymm8, %ymm5, %ymm5
>         vmulpd  %ymm2, %ymm0, %ymm0
>         vfmadd132pd     %ymm3, %ymm5, %ymm1
>         vfmadd231pd     %ymm7, %ymm4, %ymm0
>         vaddpd  %ymm0, %ymm1, %ymm0
>         vaddpd  %ymm0, %ymm6, %ymm6
>         cmpq    %rcx, %rax
>         jne     .L3
> 
> -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> chains and this SLP reduction ends up looking like
> 
> t5.c:7:21: note:   Vectorizing SLP tree:
> t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
> t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
> t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
> t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
> t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
> t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
> t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> t5.c:7:21: note:        children 0x4100c20
> t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
> t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
> t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
> t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
> t5.c:7:21: note:        children 0x4100920 0x4100b40
> t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
> t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
> t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
> t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
> t5.c:7:21: note:        children 0x41004e0 0x41008b0
> t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
> t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
> t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
> t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
> t5.c:7:21: note:        children 0x3f2a550 0x3f28700
> t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
> t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
> t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
> t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
> t5.c:7:21: note:        children 0x40b6990 0x3f29e00
> t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _88 = *_87;
> t5.c:7:21: note:        stmt 1 _96 = *_95;
> t5.c:7:21: note:        stmt 2 _96 = *_95;
> t5.c:7:21: note:        stmt 3 _88 = *_87;
> t5.c:7:21: note:        load permutation { 1 5 5 1 }
> t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _94 = *_93;
> t5.c:7:21: note:        stmt 1 _86 = *_85;
> t5.c:7:21: note:        stmt 2 _94 = *_93;
> t5.c:7:21: note:        stmt 3 _86 = *_85;
> t5.c:7:21: note:        load permutation { 5 1 5 1 }
> t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
> t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
> t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
> t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
> t5.c:7:21: note:        children 0x3f287e0 0x3f28770
> t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _116 = *_115;
> t5.c:7:21: note:        stmt 1 _124 = *_123;
> t5.c:7:21: note:        stmt 2 _124 = *_123;
> t5.c:7:21: note:        stmt 3 _116 = *_115;
> t5.c:7:21: note:        load permutation { 2 6 6 2 }
> t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _122 = *_121;
> t5.c:7:21: note:        stmt 1 _114 = *_113;
> t5.c:7:21: note:        stmt 2 _122 = *_121;
> t5.c:7:21: note:        stmt 3 _114 = *_113;
> t5.c:7:21: note:        load permutation { 6 2 6 2 }
> t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
> t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
> t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
> t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
> t5.c:7:21: note:        children 0x4100290 0x4100810
> t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _60 = *_59;
> t5.c:7:21: note:        stmt 1 _68 = *_67;
> t5.c:7:21: note:        stmt 2 _68 = *_67;
> t5.c:7:21: note:        stmt 3 _60 = *_59;
> t5.c:7:21: note:        load permutation { 0 4 4 0 }
> t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _66 = *_65;
> t5.c:7:21: note:        stmt 1 _58 = *_57;
> t5.c:7:21: note:        stmt 2 _66 = *_65;
> t5.c:7:21: note:        stmt 3 _58 = *_57;
> t5.c:7:21: note:        load permutation { 4 0 4 0 }
> t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
> t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
> t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
> t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
> t5.c:7:21: note:        children 0x4100990 0x4100a50
> t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _144 = *_143;
> t5.c:7:21: note:        stmt 1 _152 = *_151;
> t5.c:7:21: note:        stmt 2 _152 = *_151;
> t5.c:7:21: note:        stmt 3 _144 = *_143;
> t5.c:7:21: note:        load permutation { 3 7 7 3 }
> t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _150 = *_149;
> t5.c:7:21: note:        stmt 1 _142 = *_141;
> t5.c:7:21: note:        stmt 2 _150 = *_149;
> t5.c:7:21: note:        stmt 3 _142 = *_141;
> t5.c:7:21: note:        load permutation { 7 3 7 3 }
> 
> which eventually shows some non-obvious permute optimization opportunities.
> I'm currently working on a permute optimization phase btw. but for start
> only handling cases that do not help here.
> 
> Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> get
> the reduction chain variant which optimizes to
> 
> .L3:
>         vmovupd 32(%rsi,%rax), %ymm6
>         vmovupd 32(%rdx,%rax), %ymm7
>         vmovupd (%rsi,%rax), %ymm5
>         vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
>         vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
>         vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
>         addq    $64, %rax
>         vfmadd231pd     %ymm6, %ymm7, %ymm2
>         cmpq    %rcx, %rax
>         jne     .L3
> 
> even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). 

It's very fragile. I made a tiny (and natural for my real app) change in the
source (see below) and the nice MSVC-like inner loop disappeared.

void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere += a[c*8+k+0]*b[c*8+k+0];
      acc_imim += a[c*8+k+4]*b[c*8+k+4];
      acc_reim -= a[c*8+k+0]*b[c*8+k+4];
      acc_imre += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_reim+acc_imre;
}

> Which means
> the following source change helps:
> 
> void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> double* a, const double* b, int N)
> {
>   double acc_rere = 0;
>   double acc_imim = 0;
>   double acc_reim = 0;
>   double acc_imre = 0;
>   for (int c = 0; c < N; ++c) {
>     for (int k = 0; k < 4; ++k) {
>       acc_rere += a[c*8+k+0]*b[c*8+k+0];
>       acc_imim += a[c*8+k+4]*b[c*8+k+4];
>       acc_reim += a[c*8+k+0]*b[c*8+k+4];
>       acc_imre += a[c*8+k+4]*b[c*8+k+0];
>     }
>   }
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_imre-acc_reim;
> }
> 

IMHO, options like "no-tree-reassoc", including and especially within
__attribute__((optimize("")))  are for people like you.
People like me, i.e. not compiler guys, don't consider them not only for
production, but even for hobby. Unless a hobby is compiler research.

Also, several years ago I was told by (not you, as Richard Biener, but "you" as
"gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as Richard
Biener, also took part in discussion, but appeared to hold different opinion,
see 70255) told me that __attribute__((optimize(""))) can't be relied on in
production. Back than we came to conclusion that this statement has to be in
official docs. And indeed since GCC7 section 6.33.1 contains the folowing
sentences:
"The optimize attribute should be used for debugging purposes only. It is not
suitable in production code."


> the reduction epilogue ends up like
> 
>         vextractf128    $0x1, %ymm3, %xmm4
>         vaddpd  %xmm3, %xmm4, %xmm3
>         vunpckhpd       %xmm3, %xmm3, %xmm4
>         vaddpd  %xmm3, %xmm4, %xmm3
>         vextractf128    $0x1, %ymm2, %xmm4
>         vaddpd  %xmm2, %xmm4, %xmm4
>         vunpckhpd       %xmm4, %xmm4, %xmm2
>         vaddpd  %xmm4, %xmm2, %xmm2
>         vextractf128    $0x1, %ymm1, %xmm4
>         vaddpd  %xmm1, %xmm4, %xmm4
>         vaddsd  %xmm2, %xmm3, %xmm2
>         vunpckhpd       %xmm4, %xmm4, %xmm1
>         vaddpd  %xmm4, %xmm1, %xmm1
>         vextractf128    $0x1, %ymm0, %xmm4
>         vaddpd  %xmm0, %xmm4, %xmm4
>         vunpckhpd       %xmm4, %xmm4, %xmm0
>         vaddpd  %xmm4, %xmm0, %xmm0
>         vsubsd  %xmm1, %xmm0, %xmm0
>         vzeroupper
>         vmovsd  %xmm2, (%rdi)
>         vmovsd  %xmm0, 32(%rdi)
> 
> which is not optimal since we miss the opportunity to vectorize the
> adds of the accumulators
> 
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_imre-acc_reim;

Epilogue is a tricky matter.
There are many ways to do it with about the same speed and which variant is
faster could depend on fine microarchitectural details, that could be not the
same even on such quite similar CPUs as Skylake and Zen2 or Skylake and Ice
Lake, or may be even Skylake and Skylake-X (well, may be the last one is an 
exaggeration). The optimal sequence also depends on surrounding, e.g. if I
change the source to:
  res[0] = acc_rere+acc_imim;
  res[1] = acc_reim-acc_imre; // results adjacent in memory
It could already be different.
In later case it likely would be
  vaddpd  ymm_rere,ymm_imim,ymm_re
  vsubpd  ymm_reim,ymm_imre,ymm_im
  vhadd   ymm_im,ymm_re,ymm_reim
  vperm2f128 $1,    ymm_reim, ymm_reim, ymm_reimH
  vaddpd  xmm2_reimH, xmm_reim, xmm_reim

Even icc can't do it perfectly right now. 
It would be nice (for you personally, at least) if in this case gcc will
generate better code than icc, but it is far [far [far...]] less important than
robust handling of inner loop.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
  2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
  2020-10-09  6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
  2020-10-09 11:58 ` already5chosen at yahoo dot com
@ 2020-10-12 13:11 ` rguenth at gcc dot gnu.org
  2 siblings, 0 replies; 4+ messages in thread
From: rguenth at gcc dot gnu.org @ 2020-10-12 13:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Michael_S from comment #2)
> (In reply to Richard Biener from comment #1)
> > All below for Part 2.
> > 
> > Without -ffast-math you are seeing GCC using in-order reductions now while
> > with -ffast-math the vectorizer gets a bit confused about reassociations done
> > before, for me producing
> > 
> 
> Just to understand, when you are saying "Without -ffast-math" does it
> includes my set1 == "-O3 -ffast-math -fno-associative-math" ?

No, -fno-associative-math breaks it again.

> BTW, I like your phrasing: "a bit confused about reassociations"
> 
> > .L3:
> >         vmovupd 32(%rsi,%rax), %ymm3
> >         vmovupd (%rdx,%rax), %ymm7
> >         vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
> >         vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
> >         vmovupd 32(%rsi,%rax), %ymm5
> >         vpermpd $136, %ymm0, %ymm4
> >         vpermpd $40, %ymm2, %ymm7
> >         vpermpd $221, %ymm0, %ymm1
> >         vpermpd $125, %ymm2, %ymm3
> >         vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
> >         vmovupd (%rdx,%rax), %ymm2
> >         vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
> >         addq    $64, %rax
> >         vpermpd $136, %ymm0, %ymm5
> >         vpermpd $221, %ymm0, %ymm0
> >         vpermpd $40, %ymm2, %ymm8
> >         vpermpd $125, %ymm2, %ymm2
> >         vmulpd  %ymm8, %ymm5, %ymm5
> >         vmulpd  %ymm2, %ymm0, %ymm0
> >         vfmadd132pd     %ymm3, %ymm5, %ymm1
> >         vfmadd231pd     %ymm7, %ymm4, %ymm0
> >         vaddpd  %ymm0, %ymm1, %ymm0
> >         vaddpd  %ymm0, %ymm6, %ymm6
> >         cmpq    %rcx, %rax
> >         jne     .L3
> > 
> > -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> > chains and this SLP reduction ends up looking like
> > 
> > t5.c:7:21: note:   Vectorizing SLP tree:
> > t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
> > t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
> > t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
> > t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
> > t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
> > t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
> > t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> > t5.c:7:21: note:        children 0x4100c20
> > t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
> > t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
> > t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
> > t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
> > t5.c:7:21: note:        children 0x4100920 0x4100b40
> > t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
> > t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
> > t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
> > t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
> > t5.c:7:21: note:        children 0x41004e0 0x41008b0
> > t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
> > t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
> > t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
> > t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
> > t5.c:7:21: note:        children 0x3f2a550 0x3f28700
> > t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
> > t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
> > t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
> > t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
> > t5.c:7:21: note:        children 0x40b6990 0x3f29e00
> > t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _88 = *_87;
> > t5.c:7:21: note:        stmt 1 _96 = *_95;
> > t5.c:7:21: note:        stmt 2 _96 = *_95;
> > t5.c:7:21: note:        stmt 3 _88 = *_87;
> > t5.c:7:21: note:        load permutation { 1 5 5 1 }
> > t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _94 = *_93;
> > t5.c:7:21: note:        stmt 1 _86 = *_85;
> > t5.c:7:21: note:        stmt 2 _94 = *_93;
> > t5.c:7:21: note:        stmt 3 _86 = *_85;
> > t5.c:7:21: note:        load permutation { 5 1 5 1 }
> > t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
> > t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
> > t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
> > t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
> > t5.c:7:21: note:        children 0x3f287e0 0x3f28770
> > t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _116 = *_115;
> > t5.c:7:21: note:        stmt 1 _124 = *_123;
> > t5.c:7:21: note:        stmt 2 _124 = *_123;
> > t5.c:7:21: note:        stmt 3 _116 = *_115;
> > t5.c:7:21: note:        load permutation { 2 6 6 2 }
> > t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _122 = *_121;
> > t5.c:7:21: note:        stmt 1 _114 = *_113;
> > t5.c:7:21: note:        stmt 2 _122 = *_121;
> > t5.c:7:21: note:        stmt 3 _114 = *_113;
> > t5.c:7:21: note:        load permutation { 6 2 6 2 }
> > t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
> > t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
> > t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
> > t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
> > t5.c:7:21: note:        children 0x4100290 0x4100810
> > t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _60 = *_59;
> > t5.c:7:21: note:        stmt 1 _68 = *_67;
> > t5.c:7:21: note:        stmt 2 _68 = *_67;
> > t5.c:7:21: note:        stmt 3 _60 = *_59;
> > t5.c:7:21: note:        load permutation { 0 4 4 0 }
> > t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _66 = *_65;
> > t5.c:7:21: note:        stmt 1 _58 = *_57;
> > t5.c:7:21: note:        stmt 2 _66 = *_65;
> > t5.c:7:21: note:        stmt 3 _58 = *_57;
> > t5.c:7:21: note:        load permutation { 4 0 4 0 }
> > t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
> > t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
> > t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
> > t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
> > t5.c:7:21: note:        children 0x4100990 0x4100a50
> > t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _144 = *_143;
> > t5.c:7:21: note:        stmt 1 _152 = *_151;
> > t5.c:7:21: note:        stmt 2 _152 = *_151;
> > t5.c:7:21: note:        stmt 3 _144 = *_143;
> > t5.c:7:21: note:        load permutation { 3 7 7 3 }
> > t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _150 = *_149;
> > t5.c:7:21: note:        stmt 1 _142 = *_141;
> > t5.c:7:21: note:        stmt 2 _150 = *_149;
> > t5.c:7:21: note:        stmt 3 _142 = *_141;
> > t5.c:7:21: note:        load permutation { 7 3 7 3 }
> > 
> > which eventually shows some non-obvious permute optimization opportunities.
> > I'm currently working on a permute optimization phase btw. but for start
> > only handling cases that do not help here.
> > 
> > Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> > get
> > the reduction chain variant which optimizes to
> > 
> > .L3:
> >         vmovupd 32(%rsi,%rax), %ymm6
> >         vmovupd 32(%rdx,%rax), %ymm7
> >         vmovupd (%rsi,%rax), %ymm5
> >         vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
> >         vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
> >         vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
> >         addq    $64, %rax
> >         vfmadd231pd     %ymm6, %ymm7, %ymm2
> >         cmpq    %rcx, %rax
> >         jne     .L3
> > 
> > even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). 
> 
> It's very fragile. I made a tiny (and natural for my real app) change in the
> source (see below) and the nice MSVC-like inner loop disappeared.
> 
> void cdot(double* res, const double* a, const double* b, int N)
> {
>   double acc_rere = 0;
>   double acc_imim = 0;
>   double acc_reim = 0;
>   double acc_imre = 0;
>   for (int c = 0; c < N; ++c) {
>     for (int k = 0; k < 4; ++k) {
>       acc_rere += a[c*8+k+0]*b[c*8+k+0];
>       acc_imim += a[c*8+k+4]*b[c*8+k+4];
>       acc_reim -= a[c*8+k+0]*b[c*8+k+4];
>       acc_imre += a[c*8+k+4]*b[c*8+k+0];
>     }
>   }
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_reim+acc_imre;
> }

wow, OK ... :/

> > Which means
> > the following source change helps:
> > 
> > void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> > double* a, const double* b, int N)
> > {
> >   double acc_rere = 0;
> >   double acc_imim = 0;
> >   double acc_reim = 0;
> >   double acc_imre = 0;
> >   for (int c = 0; c < N; ++c) {
> >     for (int k = 0; k < 4; ++k) {
> >       acc_rere += a[c*8+k+0]*b[c*8+k+0];
> >       acc_imim += a[c*8+k+4]*b[c*8+k+4];
> >       acc_reim += a[c*8+k+0]*b[c*8+k+4];
> >       acc_imre += a[c*8+k+4]*b[c*8+k+0];
> >     }
> >   }
> >   res[0] = acc_rere+acc_imim;
> >   res[4] = acc_imre-acc_reim;
> > }
> > 
> 
> IMHO, options like "no-tree-reassoc", including and especially within
> __attribute__((optimize("")))  are for people like you.
> People like me, i.e. not compiler guys, don't consider them not only for
> production, but even for hobby. Unless a hobby is compiler research.

That's true.

> Also, several years ago I was told by (not you, as Richard Biener, but "you"
> as "gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as
> Richard Biener, also took part in discussion, but appeared to hold different
> opinion, see 70255) told me that __attribute__((optimize(""))) can't be
> relied on in production. Back than we came to conclusion that this statement
> has to be in official docs. And indeed since GCC7 section 6.33.1 contains
> the folowing sentences:
> "The optimize attribute should be used for debugging purposes only. It is
> not suitable in production code."

Yeah, guess from my HPC background I'd say HPC is never "production" ;)
I indeed would avoid optimized("") in say a github hosted project but
IMHO it's as valid as splitting out the relevant function and compiling
the TU with -fno-tree-reassoc - it's tuning the setup for max performance
with a specific compiler and for a specific host.

> 
> > the reduction epilogue ends up like
> > 
> >         vextractf128    $0x1, %ymm3, %xmm4
> >         vaddpd  %xmm3, %xmm4, %xmm3
> >         vunpckhpd       %xmm3, %xmm3, %xmm4
> >         vaddpd  %xmm3, %xmm4, %xmm3
> >         vextractf128    $0x1, %ymm2, %xmm4
> >         vaddpd  %xmm2, %xmm4, %xmm4
> >         vunpckhpd       %xmm4, %xmm4, %xmm2
> >         vaddpd  %xmm4, %xmm2, %xmm2
> >         vextractf128    $0x1, %ymm1, %xmm4
> >         vaddpd  %xmm1, %xmm4, %xmm4
> >         vaddsd  %xmm2, %xmm3, %xmm2
> >         vunpckhpd       %xmm4, %xmm4, %xmm1
> >         vaddpd  %xmm4, %xmm1, %xmm1
> >         vextractf128    $0x1, %ymm0, %xmm4
> >         vaddpd  %xmm0, %xmm4, %xmm4
> >         vunpckhpd       %xmm4, %xmm4, %xmm0
> >         vaddpd  %xmm4, %xmm0, %xmm0
> >         vsubsd  %xmm1, %xmm0, %xmm0
> >         vzeroupper
> >         vmovsd  %xmm2, (%rdi)
> >         vmovsd  %xmm0, 32(%rdi)
> > 
> > which is not optimal since we miss the opportunity to vectorize the
> > adds of the accumulators
> > 
> >   res[0] = acc_rere+acc_imim;
> >   res[4] = acc_imre-acc_reim;
> 
> Epilogue is a tricky matter.
> There are many ways to do it with about the same speed and which variant is
> faster could depend on fine microarchitectural details, that could be not
> the same even on such quite similar CPUs as Skylake and Zen2 or Skylake and
> Ice Lake, or may be even Skylake and Skylake-X (well, may be the last one is
> an  exaggeration). The optimal sequence also depends on surrounding, e.g. if
> I change the source to:
>   res[0] = acc_rere+acc_imim;
>   res[1] = acc_reim-acc_imre; // results adjacent in memory
> It could already be different.
> In later case it likely would be
>   vaddpd  ymm_rere,ymm_imim,ymm_re
>   vsubpd  ymm_reim,ymm_imre,ymm_im
>   vhadd   ymm_im,ymm_re,ymm_reim
>   vperm2f128 $1,    ymm_reim, ymm_reim, ymm_reimH
>   vaddpd  xmm2_reimH, xmm_reim, xmm_reim
> 
> Even icc can't do it perfectly right now. 
> It would be nice (for you personally, at least) if in this case gcc will
> generate better code than icc, but it is far [far [far...]] less important
> than robust handling of inner loop.

Yeah, the issue really is that with reassociation we miss the obviously
better reduction scheme and the vectorizer cannot translate between one
and the other at the moment (it doesn't itself try reassociating other than
that implied by vectorizing).  We could try to add this capability without
too many issues I guess - the main problem will be that it's decision
upfront for the reduction scheme and that it wouldn't be one based on
actual costs.

Thanks for the insightful loop kernels btw!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-10-12 13:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08 23:09 [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product already5chosen at yahoo dot com
2020-10-09  6:41 ` [Bug tree-optimization/97343] " rguenth at gcc dot gnu.org
2020-10-09 11:58 ` already5chosen at yahoo dot com
2020-10-12 13:11 ` rguenth at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).