From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 16DFF3861030; Thu, 8 Oct 2020 23:09:37 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 16DFF3861030 From: "already5chosen at yahoo dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/97343] New: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product Date: Thu, 08 Oct 2020 23:09:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 10.2.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: already5chosen at yahoo dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Oct 2020 23:09:37 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97343 Bug ID: 97343 Summary: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: already5chosen at yahoo dot com Target Milestone: --- Let's continue our complex dot product series started here https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D96854 This time I have no code generation bugs for your pleasure, just "interesti= ng" optimization issues. All examples below, unless stated otherwise were compiled with gcc.10.2 for x86-64 with following sets=20 of flags:=20 set1: -Wall -mavx2 -mfma -march=3Dskylake -O3 -ffast-math -fno-associative-= math set2: -Wall -mavx2 -mfma -march=3Dskylake -O3 -ffast-math The kernel in question is an example of complex dot product in so-called "hybrid AoS" layout a.k.a. AoSoA. https://en.wikipedia.org/wiki/AoS_and_SoA#Array_of_Structures_of_Arrays In my experience it's quite rare when in dense complex linear algebra and similar computational fields AoSoA is *not* the optimal internal form. So, practically, I consider these kernels as more important than AoS kernel presented in bug 96854. More specifically, the layout can be described as struct { double re[4], im= [4]; }; But for purpose of simplicity I omitted the type definition fro code exampl= es and coded it directly over flat arrays of doubles. Part 1. void cdot(double* res, const double* a, const double* b, int N) { double acc_re =3D 0; double acc_im =3D 0; for (int c =3D 0; c < N; ++c) { for (int k =3D 0; k < 4; ++k) { acc_re =3D acc_re + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4]; acc_im =3D acc_im - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0]; } } res[0] =3D acc_re; res[4] =3D acc_im; } That's how we want to code it in the ideal world and let to compiles to care about dirty details. In less ideal world, gcc is not the only compiler that can't cope with it. MSVC (-W4 -O2 -fp:fast -arch:AVX2) also can't vectorize it. Even mighty icc generates code that it's not quite bad, but somewhat suboptimal. So, let's it pass. I don't want to blame gcc for not being smart enough. It= 's just normal. Except that when we use set2 the code generated by gcc becomes not just non-smart, but quite crazy. I am ignoring it in the hope that it would be magically fixed by the change made by Richard Biener 2020-08-31 Part 2. void cdot(double* res, const double* a, const double* b, int N) { double acc_rere =3D 0; double acc_imim =3D 0; double acc_reim =3D 0; double acc_imre =3D 0; for (int c =3D 0; c < N; ++c) { for (int k =3D 0; k < 4; ++k) { acc_rere +=3D a[c*8+k+0]*b[c*8+k+0]; acc_imim +=3D a[c*8+k+4]*b[c*8+k+4]; acc_reim +=3D a[c*8+k+0]*b[c*8+k+4]; acc_imre +=3D a[c*8+k+4]*b[c*8+k+0]; } } res[0] =3D acc_rere+acc_imim; res[4] =3D acc_imre-acc_reim; } We are explaining it to compiler slowly. For icc and MSVC it's enough. They understood. icc generates near-perfect code. I can write it nicer, but do not expect my variant to be any faster. MSVC generates near-perfect inner loop and epilogue that is not great, but = not really much slower. gcc still didn't get it. It still tries to implement 4 accumulators literal= ly, as if -ffast-math were not here. But, sad as it is, it's still a case of not being smart enough. So, I am not complaining. Part 3. static inline double sum4(double x[]) { return x[0]+x[1]+x[2]+x[3]; } void cdot(double* res, const double* a, const double* b, int N) { double acc_re[4] =3D {0}; double acc_im[4] =3D {0}; for (int c =3D 0; c < N; ++c) { for (int k =3D 0; k < 4; ++k) { acc_re[k] =3D acc_re[k] + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+= 4]; acc_im[k] =3D acc_im[k] - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+= 0]; } } res[0] =3D sum4(acc_re); res[4] =3D sum4(acc_im); } Attempt to feed compiler by teaspoon. That's not a way I want to write code in HLL. icc copes, producing about the same code as in Part 1 MSVC doesn't understand a Kunstst=C3=BCck (I am sympathetic) and generates = literal scalar code with local arrays on stack. gcc with set1 is a little better than MSVC - the code is fully scalar, but = at least accumulators kept in registers. gcc with set2 is the most interesting. It vectorizes, but how? Here is an inner loop: .L3: vpermpd $27, (%r8,%rax), %ymm2 vpermpd $27, 32(%rdx,%rax), %ymm3 vpermpd $27, (%rdx,%rax), %ymm1 vpermpd $27, 32(%r8,%rax), %ymm0 vmulpd %ymm2, %ymm1, %ymm6 vmulpd %ymm2, %ymm3, %ymm2 addq $64, %rax vfnmadd132pd %ymm0, %ymm2, %ymm1 vfmadd132pd %ymm3, %ymm6, %ymm0 vaddpd %ymm1, %ymm5, %ymm5 vaddpd %ymm0, %ymm4, %ymm4 cmpq %rcx, %rax jne .L3 What all this vpermpd business about? Shuffling SIMD lanes around just beca= use it's funny? That the first thing I do want to complain about. Not "not smart enough", b= ut too smart for its own good. And finally=20 Part 4 static inline double sum4(double x[]) { return x[0]+x[1]+x[2]+x[3]; } void cdot(double* res, const double* a, const double* b, int N) { double acc_rere[4] =3D {0}; double acc_imim[4] =3D {0}; double acc_reim[4] =3D {0}; double acc_imre[4] =3D {0}; for (int c =3D 0; c < N; ++c) { for (int k =3D 0; k < 4; ++k) { acc_rere[k] +=3D a[c*8+k+0]*b[c*8+k+0]; acc_imim[k] +=3D a[c*8+k+4]*b[c*8+k+4]; acc_reim[k] +=3D a[c*8+k+0]*b[c*8+k+4]; acc_imre[k] +=3D a[c*8+k+4]*b[c*8+k+0]; } } double acc_re[4]; double acc_im[4]; for (int k =3D 0; k < 4; ++k) { acc_re[k] =3D acc_rere[k]+acc_imim[k]; acc_im[k] =3D acc_imre[k]-acc_reim[k]; } res[0] =3D sum4(acc_re); res[4] =3D sum4(acc_im); } Not just fed by teaspoon, but compiler's mouth held open manually, so to sp= eak. icc, of course, understands and generates pretty much the same good code as= in Part 2. MSVC, of course, does not understand and generates arrays on stack. gcc with set2, of course, continues to enjoy a juggling. Doubling or tripli= ng up vs last time's performance. Inner loop: .L3: vmovupd (%rdx,%rax), %ymm1 vmovupd 32(%rdx,%rax), %ymm0 vmovupd 32(%r8,%rax), %ymm5 vperm2f128 $49, %ymm1, %ymm0, %ymm3 vinsertf128 $1, %xmm1, %ymm0, %ymm0 vpermpd $221, %ymm0, %ymm10 vpermpd $136, %ymm0, %ymm1 vmovupd (%r8,%rax), %ymm0 vpermpd $136, %ymm3, %ymm9 vperm2f128 $49, %ymm5, %ymm0, %ymm2 vinsertf128 $1, %xmm5, %ymm0, %ymm0 vpermpd $40, %ymm2, %ymm11 vpermpd $125, %ymm0, %ymm5 vpermpd $221, %ymm3, %ymm3 vpermpd $40, %ymm0, %ymm0 vpermpd $125, %ymm2, %ymm2 addq $64, %rax vfmadd231pd %ymm2, %ymm3, %ymm8 vfmadd231pd %ymm11, %ymm9, %ymm6 vfmadd231pd %ymm5, %ymm10, %ymm7 vfmadd231pd %ymm0, %ymm1, %ymm4 cmpq %rax, %rcx jne .L3 But this time gcc with set1 was a real star of the show. My only reaction is "What?" .L4: vmovupd 0(%r13), %ymm5 vmovupd 64(%r13), %ymm7 vmovupd 192(%r13), %ymm4 vmovupd 128(%r13), %ymm6 vunpcklpd 32(%r13), %ymm5, %ymm13 vunpckhpd 32(%r13), %ymm5, %ymm12 vunpckhpd 96(%r13), %ymm7, %ymm1 vunpcklpd 96(%r13), %ymm7, %ymm5 vmovupd 128(%r13), %ymm7 vunpcklpd 224(%r13), %ymm4, %ymm2 vunpcklpd 160(%r13), %ymm6, %ymm6 vunpckhpd 160(%r13), %ymm7, %ymm11 vunpckhpd 224(%r13), %ymm4, %ymm0 vpermpd $216, %ymm13, %ymm13 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm2, %ymm2 vpermpd $216, %ymm5, %ymm5 vunpcklpd %ymm2, %ymm6, %ymm4 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm11, %ymm11 vunpcklpd %ymm5, %ymm13, %ymm9 vpermpd $216, %ymm12, %ymm12 vpermpd $216, %ymm0, %ymm0 vpermpd $216, %ymm4, %ymm3 vpermpd $216, %ymm9, %ymm9 vunpckhpd %ymm2, %ymm6, %ymm4 vunpckhpd %ymm5, %ymm13, %ymm5 vunpcklpd %ymm1, %ymm12, %ymm6 vunpcklpd %ymm0, %ymm11, %ymm2 vunpckhpd %ymm1, %ymm12, %ymm12 vunpckhpd %ymm0, %ymm11, %ymm0 vpermpd $216, %ymm12, %ymm1 vunpcklpd %ymm3, %ymm9, %ymm11 vpermpd $216, %ymm5, %ymm5 vpermpd $216, %ymm4, %ymm4 vpermpd $216, %ymm0, %ymm0 vmovupd 64(%r12), %ymm15 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm11, %ymm8 vpermpd $216, %ymm2, %ymm2 vunpcklpd %ymm4, %ymm5, %ymm11 vunpckhpd %ymm3, %ymm9, %ymm9 vunpckhpd %ymm4, %ymm5, %ymm4 vunpcklpd %ymm0, %ymm1, %ymm5 vpermpd $216, %ymm9, %ymm3 vunpcklpd %ymm2, %ymm6, %ymm9 vunpckhpd %ymm2, %ymm6, %ymm2 vpermpd $216, %ymm5, %ymm6 vunpcklpd 96(%r12), %ymm15, %ymm12 vunpckhpd %ymm0, %ymm1, %ymm0 vmovupd %ymm6, 64(%rsp) vunpckhpd 96(%r12), %ymm15, %ymm6 vmovupd 128(%r12), %ymm15 vpermpd $216, %ymm0, %ymm5 vpermpd $216, %ymm9, %ymm7 vmovupd (%r12), %ymm0 vunpckhpd 160(%r12), %ymm15, %ymm9 vmovupd %ymm5, 96(%rsp) vunpcklpd 160(%r12), %ymm15, %ymm5 vmovupd 192(%r12), %ymm15 vunpcklpd 32(%r12), %ymm0, %ymm1 vpermpd $216, %ymm9, %ymm14 vunpcklpd 224(%r12), %ymm15, %ymm9 vunpckhpd 224(%r12), %ymm15, %ymm13 vunpckhpd 32(%r12), %ymm0, %ymm0 vpermpd $216, %ymm12, %ymm12 vpermpd $216, %ymm9, %ymm9 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm5, %ymm5 vpermpd $216, %ymm6, %ymm6 vunpcklpd %ymm12, %ymm1, %ymm10 vpermpd $216, %ymm0, %ymm0 vpermpd $216, %ymm13, %ymm13 vunpckhpd %ymm12, %ymm1, %ymm1 vunpcklpd %ymm9, %ymm5, %ymm12 vpermpd $216, %ymm12, %ymm12 vpermpd $216, %ymm10, %ymm10 vunpckhpd %ymm9, %ymm5, %ymm5 vunpcklpd %ymm6, %ymm0, %ymm9 vunpckhpd %ymm6, %ymm0, %ymm0 vunpcklpd %ymm13, %ymm14, %ymm6 vunpckhpd %ymm13, %ymm14, %ymm13 vpermpd $216, %ymm13, %ymm14 vunpcklpd %ymm12, %ymm10, %ymm13 vpermpd $216, %ymm13, %ymm13 vmulpd %ymm13, %ymm8, %ymm15 vpermpd $216, %ymm5, %ymm5 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm9, %ymm9 vpermpd $216, %ymm0, %ymm0 vunpckhpd %ymm12, %ymm10, %ymm10 vunpcklpd %ymm6, %ymm9, %ymm12 vunpckhpd %ymm6, %ymm9, %ymm9 vunpcklpd %ymm5, %ymm1, %ymm6 vunpckhpd %ymm5, %ymm1, %ymm1 vunpcklpd %ymm14, %ymm0, %ymm5 vunpckhpd %ymm14, %ymm0, %ymm0 vpermpd $216, %ymm0, %ymm0 vmovupd %ymm0, 160(%rsp) vmovq %r9, %xmm0 vaddsd %xmm15, %xmm0, %xmm0 vunpckhpd %xmm15, %xmm15, %xmm14 vpermpd $216, %ymm10, %ymm10 vaddsd %xmm14, %xmm0, %xmm0 vextractf128 $0x1, %ymm15, %xmm14 vmulpd %ymm10, %ymm8, %ymm8 vaddsd %xmm14, %xmm0, %xmm15 vunpckhpd %xmm14, %xmm14, %xmm14 vpermpd $216, %ymm12, %ymm12 vaddsd %xmm14, %xmm15, %xmm0 vmulpd %ymm10, %ymm3, %ymm15 vunpckhpd %xmm8, %xmm8, %xmm10 vmovq %xmm0, %r9 vmovq %rcx, %xmm0 vmulpd %ymm13, %ymm3, %ymm3 vaddsd %xmm15, %xmm0, %xmm0 vunpckhpd %xmm15, %xmm15, %xmm14 vextractf128 $0x1, %ymm15, %xmm15 vaddsd %xmm14, %xmm0, %xmm14 vpermpd $216, %ymm1, %ymm1 vmovupd %ymm1, 128(%rsp) vaddsd %xmm15, %xmm14, %xmm14 vunpckhpd %xmm15, %xmm15, %xmm15 vpermpd $216, %ymm2, %ymm2 vaddsd %xmm15, %xmm14, %xmm0 vmovsd 56(%rsp), %xmm14 vpermpd $216, %ymm9, %ymm9 vaddsd %xmm8, %xmm14, %xmm14 vextractf128 $0x1, %ymm8, %xmm8 vmovq %xmm0, %rcx vaddsd %xmm10, %xmm14, %xmm10 vpermpd $216, %ymm6, %ymm6 vpermpd $216, %ymm11, %ymm11 vaddsd %xmm8, %xmm10, %xmm10 vunpckhpd %xmm8, %xmm8, %xmm8 vpermpd $216, %ymm4, %ymm4 vaddsd %xmm8, %xmm10, %xmm0 vmovsd 48(%rsp), %xmm10 vunpckhpd %xmm3, %xmm3, %xmm8 vaddsd %xmm3, %xmm10, %xmm10 vextractf128 $0x1, %ymm3, %xmm3 vmovsd %xmm0, 56(%rsp) vaddsd %xmm8, %xmm10, %xmm8 vmulpd %ymm12, %ymm7, %ymm10 vmulpd %ymm9, %ymm7, %ymm7 vaddsd %xmm3, %xmm8, %xmm8 vunpckhpd %xmm3, %xmm3, %xmm3 vpermpd $216, %ymm5, %ymm5 vaddsd %xmm3, %xmm8, %xmm0 vunpckhpd %xmm10, %xmm10, %xmm3 addq $256, %r12 vmovsd %xmm0, 48(%rsp) vmovq %rdi, %xmm0 vaddsd %xmm10, %xmm0, %xmm8 vextractf128 $0x1, %ymm10, %xmm10 vmovq %rbx, %xmm0 vaddsd %xmm3, %xmm8, %xmm3 vmulpd %ymm9, %ymm2, %ymm8 vmulpd %ymm12, %ymm2, %ymm2 vaddsd %xmm10, %xmm3, %xmm3 vunpckhpd %xmm10, %xmm10, %xmm10 addq $256, %r13 vaddsd %xmm10, %xmm3, %xmm1 vaddsd %xmm8, %xmm0, %xmm10 vunpckhpd %xmm8, %xmm8, %xmm3 vextractf128 $0x1, %ymm8, %xmm8 vaddsd %xmm3, %xmm10, %xmm3 vmovq %xmm1, %rdi vmovq %r11, %xmm1 vaddsd %xmm8, %xmm3, %xmm3 vunpckhpd %xmm8, %xmm8, %xmm8 vmovq %r10, %xmm0 vaddsd %xmm8, %xmm3, %xmm3 vmovsd 40(%rsp), %xmm8 vaddsd %xmm7, %xmm8, %xmm8 vmovq %xmm3, %rbx vunpckhpd %xmm7, %xmm7, %xmm3 vaddsd %xmm3, %xmm8, %xmm3 vextractf128 $0x1, %ymm7, %xmm7 vaddsd %xmm7, %xmm3, %xmm3 vunpckhpd %xmm7, %xmm7, %xmm7 vaddsd %xmm7, %xmm3, %xmm3 vmovsd 32(%rsp), %xmm7 vaddsd %xmm2, %xmm7, %xmm7 vmovsd %xmm3, 40(%rsp) vunpckhpd %xmm2, %xmm2, %xmm3 vaddsd %xmm3, %xmm7, %xmm3 vextractf128 $0x1, %ymm2, %xmm2 vmulpd %ymm6, %ymm11, %ymm7 vaddsd %xmm2, %xmm3, %xmm3 vunpckhpd %xmm2, %xmm2, %xmm2 vaddsd %xmm2, %xmm3, %xmm2 vaddsd %xmm7, %xmm1, %xmm3 vmovupd 128(%rsp), %ymm1 vmovsd %xmm2, 32(%rsp) vunpckhpd %xmm7, %xmm7, %xmm2 vaddsd %xmm2, %xmm3, %xmm2 vextractf128 $0x1, %ymm7, %xmm7 vmulpd %ymm1, %ymm4, %ymm3 vaddsd %xmm7, %xmm2, %xmm2 vunpckhpd %xmm7, %xmm7, %xmm7 vmulpd %ymm1, %ymm11, %ymm1 vaddsd %xmm7, %xmm2, %xmm2 vaddsd %xmm3, %xmm0, %xmm7 vmulpd %ymm6, %ymm4, %ymm4 vmovq %xmm2, %r11 vunpckhpd %xmm3, %xmm3, %xmm2 vaddsd %xmm2, %xmm7, %xmm2 vextractf128 $0x1, %ymm3, %xmm3 vmovupd 64(%rsp), %ymm6 vaddsd %xmm3, %xmm2, %xmm2 vunpckhpd %xmm3, %xmm3, %xmm3 vmovupd 96(%rsp), %ymm7 vaddsd %xmm3, %xmm2, %xmm2 vmovsd 24(%rsp), %xmm3 vmovupd 160(%rsp), %ymm0 vaddsd %xmm1, %xmm3, %xmm3 vmovq %xmm2, %r10 vunpckhpd %xmm1, %xmm1, %xmm2 vaddsd %xmm2, %xmm3, %xmm2 vextractf128 $0x1, %ymm1, %xmm1 vmovq %rbp, %xmm3 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm1, %xmm1, %xmm1 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm4, %xmm4, %xmm1 vmovsd %xmm2, 24(%rsp) vmovsd 16(%rsp), %xmm2 vaddsd %xmm4, %xmm2, %xmm2 vextractf128 $0x1, %ymm4, %xmm4 vaddsd %xmm1, %xmm2, %xmm1 vaddsd %xmm4, %xmm1, %xmm1 vunpckhpd %xmm4, %xmm4, %xmm4 vaddsd %xmm4, %xmm1, %xmm4 vmovsd %xmm4, 16(%rsp) vmulpd %ymm6, %ymm5, %ymm4 vmulpd %ymm7, %ymm5, %ymm5 vaddsd %xmm4, %xmm3, %xmm1 vunpckhpd %xmm4, %xmm4, %xmm2 vmovq %rsi, %xmm3 vaddsd %xmm2, %xmm1, %xmm2 vextractf128 $0x1, %ymm4, %xmm1 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm1, %xmm1, %xmm1 vaddsd %xmm1, %xmm2, %xmm4 vmovq %xmm4, %rbp vmulpd %ymm0, %ymm7, %ymm4 vmulpd %ymm0, %ymm6, %ymm0 vaddsd %xmm4, %xmm3, %xmm1 vunpckhpd %xmm4, %xmm4, %xmm2 vaddsd %xmm2, %xmm1, %xmm2 vextractf128 $0x1, %ymm4, %xmm1 vaddsd %xmm1, %xmm2, %xmm2 vunpckhpd %xmm1, %xmm1, %xmm1 vaddsd %xmm1, %xmm2, %xmm4 vmovsd 8(%rsp), %xmm2 vunpckhpd %xmm0, %xmm0, %xmm1 vaddsd %xmm0, %xmm2, %xmm2 vextractf128 $0x1, %ymm0, %xmm0 vmovq %xmm4, %rsi vaddsd %xmm1, %xmm2, %xmm1 vaddsd %xmm0, %xmm1, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm0 vaddsd %xmm0, %xmm1, %xmm6 vmovsd (%rsp), %xmm1 vunpckhpd %xmm5, %xmm5, %xmm0 vaddsd %xmm5, %xmm1, %xmm1 vextractf128 $0x1, %ymm5, %xmm5 vmovsd %xmm6, 8(%rsp) vaddsd %xmm0, %xmm1, %xmm0 vaddsd %xmm5, %xmm0, %xmm0 vunpckhpd %xmm5, %xmm5, %xmm5 vaddsd %xmm5, %xmm0, %xmm5 vmovsd %xmm5, (%rsp) cmpq %rax, %r12 jne .L4 movl %r15d, %r12d andl $-4, %r12d movl %r12d, %edx cmpl %r12d, %r15d je .L5 .L3: movl %r15d, %eax subl %r12d, %eax cmpl $1, %eax je .L6 salq $6, %r12 leaq (%r14,%r12), %r13 vmovupd 16(%r13), %xmm3 vmovupd 48(%r13), %xmm0 vmovupd 64(%r13), %xmm8 vmovupd 112(%r13), %xmm10 vmovupd 0(%r13), %xmm4 vmovupd 32(%r13), %xmm2 vmovupd 80(%r13), %xmm6 vmovupd 96(%r13), %xmm1 vunpcklpd %xmm3, %xmm4, %xmm5 vunpckhpd %xmm3, %xmm4, %xmm4 vunpcklpd %xmm0, %xmm2, %xmm3 vunpckhpd %xmm0, %xmm2, %xmm2 vunpcklpd %xmm6, %xmm8, %xmm0 vunpckhpd %xmm6, %xmm8, %xmm6 vunpcklpd %xmm10, %xmm1, %xmm8 vunpckhpd %xmm10, %xmm1, %xmm1 vunpcklpd %xmm3, %xmm5, %xmm11 vunpcklpd %xmm2, %xmm4, %xmm10 vunpckhpd %xmm3, %xmm5, %xmm3 vunpckhpd %xmm2, %xmm4, %xmm2 vunpcklpd %xmm8, %xmm0, %xmm5 vunpcklpd %xmm1, %xmm6, %xmm4 vunpckhpd %xmm8, %xmm0, %xmm0 vunpckhpd %xmm1, %xmm6, %xmm1 addq %r8, %r12 vunpcklpd %xmm5, %xmm11, %xmm8 vunpckhpd %xmm0, %xmm3, %xmm7 vunpckhpd %xmm5, %xmm11, %xmm11 vunpckhpd %xmm1, %xmm2, %xmm5 vmovupd 64(%r12), %xmm12 vunpcklpd %xmm1, %xmm2, %xmm6 vmovupd 80(%r12), %xmm9 vmovupd 48(%r12), %xmm1 vmovupd 96(%r12), %xmm2 vunpcklpd %xmm4, %xmm10, %xmm14 vunpcklpd %xmm0, %xmm3, %xmm13 vunpckhpd %xmm4, %xmm10, %xmm10 vmovupd 32(%r12), %xmm3 vmovupd 16(%r12), %xmm4 vmovapd %xmm7, 64(%rsp) vmovapd %xmm5, 96(%rsp) vmovupd 112(%r12), %xmm7 vmovupd (%r12), %xmm5 movl %eax, %r12d vunpcklpd %xmm4, %xmm5, %xmm15 vunpckhpd %xmm4, %xmm5, %xmm5 vunpcklpd %xmm1, %xmm3, %xmm4 vunpckhpd %xmm1, %xmm3, %xmm3 vunpcklpd %xmm9, %xmm12, %xmm1 vunpckhpd %xmm9, %xmm12, %xmm9 vunpcklpd %xmm7, %xmm2, %xmm12 vunpckhpd %xmm7, %xmm2, %xmm2 vunpcklpd %xmm4, %xmm15, %xmm7 vunpckhpd %xmm4, %xmm15, %xmm15 vunpcklpd %xmm12, %xmm1, %xmm4 vunpckhpd %xmm12, %xmm1, %xmm1 vunpcklpd %xmm3, %xmm5, %xmm12 vunpckhpd %xmm3, %xmm5, %xmm5 vunpcklpd %xmm2, %xmm9, %xmm3 vunpckhpd %xmm2, %xmm9, %xmm2 vunpcklpd %xmm4, %xmm7, %xmm9 vunpckhpd %xmm1, %xmm15, %xmm0 vunpckhpd %xmm4, %xmm7, %xmm4 vunpcklpd %xmm3, %xmm12, %xmm7 vunpckhpd %xmm3, %xmm12, %xmm3 vunpcklpd %xmm1, %xmm15, %xmm12 vunpcklpd %xmm2, %xmm5, %xmm15 vunpckhpd %xmm2, %xmm5, %xmm2 vmulpd %xmm9, %xmm8, %xmm5 vmovapd %xmm0, 128(%rsp) vmovq %r9, %xmm0 andl $-2, %r12d addl %r12d, %edx vaddsd %xmm5, %xmm0, %xmm0 vunpckhpd %xmm5, %xmm5, %xmm5 vaddsd %xmm5, %xmm0, %xmm1 vmulpd %xmm4, %xmm11, %xmm5 vmulpd %xmm4, %xmm8, %xmm4 vmovq %xmm1, %r9 vmovq %rcx, %xmm1 vmulpd %xmm9, %xmm11, %xmm11 vaddsd %xmm5, %xmm1, %xmm1 vunpckhpd %xmm5, %xmm5, %xmm5 vmulpd %xmm7, %xmm14, %xmm9 vaddsd %xmm5, %xmm1, %xmm1 vmovsd 56(%rsp), %xmm5 vmulpd %xmm3, %xmm10, %xmm8 vaddsd %xmm4, %xmm5, %xmm5 vunpckhpd %xmm4, %xmm4, %xmm4 vmovq %xmm1, %rcx vaddsd %xmm4, %xmm5, %xmm4 vmovq %rdi, %xmm1 vmulpd %xmm3, %xmm14, %xmm14 vmovsd %xmm4, 56(%rsp) vmovsd 48(%rsp), %xmm4 vmovq %rbx, %xmm0 vaddsd %xmm11, %xmm4, %xmm4 vunpckhpd %xmm11, %xmm11, %xmm11 vmovsd 40(%rsp), %xmm3 vaddsd %xmm11, %xmm4, %xmm4 vmulpd %xmm7, %xmm10, %xmm10 vaddsd %xmm14, %xmm3, %xmm3 vmovsd %xmm4, 48(%rsp) vaddsd %xmm9, %xmm1, %xmm4 vunpckhpd %xmm9, %xmm9, %xmm9 vunpckhpd %xmm14, %xmm14, %xmm14 vaddsd %xmm9, %xmm4, %xmm4 vmovapd 128(%rsp), %xmm5 vmovapd 64(%rsp), %xmm11 vmovq %xmm4, %rdi vaddsd %xmm8, %xmm0, %xmm4 vunpckhpd %xmm8, %xmm8, %xmm8 vmovsd 24(%rsp), %xmm1 vaddsd %xmm8, %xmm4, %xmm4 vmovsd 16(%rsp), %xmm0 vmovq %xmm4, %rbx vaddsd %xmm14, %xmm3, %xmm4 vmovsd 32(%rsp), %xmm3 vaddsd %xmm10, %xmm3, %xmm3 vunpckhpd %xmm10, %xmm10, %xmm10 vmovsd %xmm4, 40(%rsp) vaddsd %xmm10, %xmm3, %xmm7 vmulpd %xmm12, %xmm13, %xmm3 vmulpd %xmm5, %xmm13, %xmm13 vmovsd %xmm7, 32(%rsp) vmovq %r11, %xmm7 vmulpd %xmm11, %xmm12, %xmm12 vaddsd %xmm3, %xmm7, %xmm4 vunpckhpd %xmm3, %xmm3, %xmm3 vaddsd %xmm13, %xmm1, %xmm1 vaddsd %xmm3, %xmm4, %xmm7 vmulpd %xmm5, %xmm11, %xmm3 vunpckhpd %xmm13, %xmm13, %xmm13 vmovq %xmm7, %r11 vmovq %r10, %xmm7 vaddsd %xmm12, %xmm0, %xmm0 vaddsd %xmm3, %xmm7, %xmm4 vunpckhpd %xmm3, %xmm3, %xmm3 vunpckhpd %xmm12, %xmm12, %xmm12 vaddsd %xmm3, %xmm4, %xmm7 vaddsd %xmm13, %xmm1, %xmm4 vmovq %xmm7, %r10 vmovsd %xmm4, 24(%rsp) vaddsd %xmm12, %xmm0, %xmm4 vmulpd %xmm15, %xmm6, %xmm0 vmovq %rbp, %xmm7 vmovsd %xmm4, 16(%rsp) vmovapd 96(%rsp), %xmm5 vaddsd %xmm0, %xmm7, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm0 vmovq %rsi, %xmm7 vaddsd %xmm0, %xmm1, %xmm4 vmulpd %xmm5, %xmm2, %xmm0 vmulpd %xmm2, %xmm6, %xmm2 vmovq %xmm4, %rbp vmulpd %xmm5, %xmm15, %xmm15 vaddsd %xmm0, %xmm7, %xmm1 vunpckhpd %xmm0, %xmm0, %xmm0 vaddsd %xmm0, %xmm1, %xmm4 vmovsd 8(%rsp), %xmm0 vaddsd %xmm2, %xmm0, %xmm0 vunpckhpd %xmm2, %xmm2, %xmm2 vmovq %xmm4, %rsi vaddsd %xmm2, %xmm0, %xmm6 vmovsd (%rsp), %xmm0 vaddsd %xmm15, %xmm0, %xmm0 vunpckhpd %xmm15, %xmm15, %xmm15 vmovsd %xmm6, 8(%rsp) vaddsd %xmm15, %xmm0, %xmm5 vmovsd %xmm5, (%rsp) cmpl %r12d, %eax je .L5=