From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 94702385BF84; Mon, 25 Oct 2021 21:44:09 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 94702385BF84 From: "peter at cordes dot ca" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP Date: Mon, 25 Oct 2021 21:44:09 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: peter at cordes dot ca X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Oct 2021 21:44:09 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102494 Peter Cordes changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #10 from Peter Cordes --- Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcv= Ta=20 Still doing two separate sign extensions and two stores / wider reload (sto= re forwarding stall): -O3 -march=3Dskylake -fopenmp simde_vaddlv_s8: push rbp vpmovsxbw xmm2, xmm0 vpsrlq xmm0, xmm0, 32 mov rbp, rsp vpmovsxbw xmm3, xmm0 and rsp, -32 vmovq QWORD PTR [rsp-16], xmm2 vmovq QWORD PTR [rsp-8], xmm3 vmovdqa xmm4, XMMWORD PTR [rsp-16] ... then asm using byte-shifts Including stuff like movdqa xmm1, xmm0 psrldq xmm1, 4 instead of pshufd, which is an option because high garbage can be ignored. And ARM64 goes scalar. ---- Current trunk *without* -fopenmp produces decent asm https://godbolt.org/z/h1KEKPTW9 For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3) simde_vaddlv_s8: sxtl v0.8h, v0.8b addv h0, v0.8h umov w0, v0.h[0] ret x86-64 gcc -O3 -march=3Dskylake simde_vaddlv_s8: vpmovsxbw xmm1, xmm0 vpsrlq xmm0, xmm0, 32 vpmovsxbw xmm0, xmm0 vpaddw xmm0, xmm1, xmm0 vpsrlq xmm1, xmm0, 32 vpaddw xmm0, xmm0, xmm1 vpsrlq xmm1, xmm0, 16 vpaddw xmm0, xmm0, xmm1 vpextrw eax, xmm0, 0 ret That's pretty good, but VMOVD eax, xmm0 would be more efficient than VPE= XTRW when we don't need to avoid high garbage (because it's a return value in th= is case). VPEXTRW zero-extends into RAX, so it's not directly helpful if we n= eed to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar mo= vsx. Or with BMI2, go scalar before the last shift / VPADDW step, e.g. ... vmovd eax, xmm0 rorx edx, eax, 16 add eax, edx=