From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 55143385828E; Fri, 17 Feb 2023 21:05:30 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 55143385828E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1676667930; bh=RjeSv27oo+S7EWoKz5d9xemz12QfJgIJvW2/J+T73Co=; h=From:To:Subject:Date:In-Reply-To:References:From; b=JacbdvG8GIwz9RL55XSrE+8y1iDQY+5SzQ8A1eov2Hg82iz11YuE2IHFmaQyya1lc h1oM5MehrkeHmlcZaHpgRL8CfV/dXCHP+IeArstL64Te2zh9k3Kb2PdRBSBikdJd/s iZU6k3hxKggVUrM6lHM3eopwAT+bLB3ZhfLXiMTg= From: "pinskia at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/94908] Failure to optimally optimize certain shuffle patterns Date: Fri, 17 Feb 2023 21:05:29 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 10.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: enhancement X-Bugzilla-Who: pinskia at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_severity see_also component Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D94908 Andrew Pinski changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement See Also| |https://gcc.gnu.org/bugzill | |a/show_bug.cgi?id=3D53346, | |https://gcc.gnu.org/bugzill | |a/show_bug.cgi?id=3D93720 Component|tree-optimization |target --- Comment #4 from Andrew Pinski --- I think this was a target issue and maybe should be split into a couple different bugs. For GCC 8, aarch64 produces: dup v0.4s, v0.s[1] ldr q1, [sp, 16] ldp x29, x30, [sp], 32 ins v0.s[1], v1.s[1] ins v0.s[2], v1.s[2] ins v0.s[3], v1.s[3] For GCC 9/10 did (which is ok, though could be improved which it did in GCC 11): adrp x0, .LC0 ldr q1, [sp, 16] ldr q2, [x0, #:lo12:.LC0] ldp x29, x30, [sp], 32 tbl v0.16b, {v0.16b - v1.16b}, v2.16b For GCC 11+, aarch64 produces: ldr q1, [sp, 16] ins v1.s[0], v0.s[1] mov v0.16b, v1.16b Which means for aarch64, this was changed in GCC 10 and fixed fully for GCC= 11 (by r11-2192-gc9c87e6f9c795b aka PR 93720 which was my patch in fact). For x86_64, the trunk produces: movaps (%rsp), %xmm1 addq $24, %rsp shufps $85, %xmm1, %xmm0 shufps $232, %xmm1, %xmm0 While for GCC 12 produces: movaps (%rsp), %xmm1 addq $24, %rsp shufps $85, %xmm0, %xmm0 movaps %xmm1, %xmm2 shufps $85, %xmm1, %xmm2 movaps %xmm2, %xmm3 movaps %xmm1, %xmm2 unpckhps %xmm1, %xmm2 unpcklps %xmm3, %xmm0 shufps $255, %xmm1, %xmm1 unpcklps %xmm1, %xmm2 movlhps %xmm2, %xmm0 This was changed with r13-2843-g3db8e9c2422d92 (aka PR 53346). For powerpc64le, it looks ok for GCC 11: addis 9,2,.LC0@toc@ha addi 1,1,48 addi 9,9,.LC0@toc@l li 0,-16 lvx 0,0,9 vperm 2,31,2,0 Both the x86_64 and the PowerPC PERM implementation could be improved to support the inseration like the aarch64 backend does too.=