From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id B22AA3851C23; Tue, 27 Jul 2021 07:24:08 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org B22AA3851C23 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/39821] 120% slowdown with vectorizer Date: Tue, 27 Jul 2021 07:24:08 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 4.4.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jul 2021 07:24:08 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D39821 --- Comment #6 from Richard Biener --- 0x398f310 _2 * _4 1 times scalar_stmt costs 12 in body ... 0x392b3f0 _1 w* _3 2 times vec_promote_demote costs 8 in body ... t4.c:4:12: note: Cost model analysis: Vector inside of loop cost: 40 Vector prologue cost: 4 Vector epilogue cost: 108 Scalar iteration cost: 40 Scalar outside cost: 32 Vector outside cost: 112 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 3 so clearly the widening multiplication is not costed correctly. With SSE 4= .2 we can do better: .L4: movdqu (%rcx,%rax), %xmm0 movdqu (%rsi,%rax), %xmm1 addq $16, %rax movdqa %xmm0, %xmm3 movdqa %xmm1, %xmm4 punpckldq %xmm0, %xmm3 punpckldq %xmm1, %xmm4 punpckhdq %xmm0, %xmm0 pmuldq %xmm4, %xmm3 punpckhdq %xmm1, %xmm1 pmuldq %xmm1, %xmm0 paddq %xmm3, %xmm2 paddq %xmm0, %xmm2 cmpq %rdi, %rax jne .L4 but even there the costing is imprecise. The vectorizer is unhelpful in categorizing the widen mult as vec_promote_demote which then fails to run into case MULT_EXPR: case WIDEN_MULT_EXPR: case MULT_HIGHPART_EXPR: stmt_cost =3D ix86_multiplication_cost (ix86_cost, mode); break; fixing that yields 0x392b3f0 _1 w* _3 2 times vector_stmt costs 136 in body for both SSE2 and SSE4.2 and AVX2 so that's over-estimating cost then via /* V*DImode is emulated with 5-8 insns. */ else if (mode =3D=3D V2DImode || mode =3D=3D V4DImode) { if (TARGET_XOP && mode =3D=3D V2DImode) return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 3); else return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5); } with cost->mulss =3D=3D 16. I suppose it is somehow failing to realize it's doing a widening multiply.=