From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 55143385828E; Fri, 17 Feb 2023 21:05:30 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 55143385828E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1676667930;
	bh=RjeSv27oo+S7EWoKz5d9xemz12QfJgIJvW2/J+T73Co=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=JacbdvG8GIwz9RL55XSrE+8y1iDQY+5SzQ8A1eov2Hg82iz11YuE2IHFmaQyya1lc
	 h1oM5MehrkeHmlcZaHpgRL8CfV/dXCHP+IeArstL64Te2zh9k3Kb2PdRBSBikdJd/s
	 iZU6k3hxKggVUrM6lHM3eopwAT+bLB3ZhfLXiMTg=
From: "pinskia at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/94908] Failure to optimally optimize certain shuffle
 patterns
Date: Fri, 17 Feb 2023 21:05:29 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 10.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: enhancement
X-Bugzilla-Who: pinskia at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_severity see_also component
Message-ID: <bug-94908-4-KBAZimhVtx@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-94908-4@http.gcc.gnu.org/bugzilla/>
References: <bug-94908-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D94908

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement
           See Also|                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=3D53346,
                   |                            |https://gcc.gnu.org/bugzill
                   |                            |a/show_bug.cgi?id=3D93720
          Component|tree-optimization           |target
--- Comment #4 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
I think this was a target issue and maybe should be split into a couple
different bugs.

For GCC 8, aarch64 produces:
        dup     v0.4s, v0.s[1]
        ldr     q1, [sp, 16]
        ldp     x29, x30, [sp], 32
        ins     v0.s[1], v1.s[1]
        ins     v0.s[2], v1.s[2]
        ins     v0.s[3], v1.s[3]


For GCC 9/10 did (which is ok, though could be improved which it did in GCC
11):
        adrp    x0, .LC0
        ldr     q1, [sp, 16]
        ldr     q2, [x0, #:lo12:.LC0]
        ldp     x29, x30, [sp], 32
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b
For GCC 11+, aarch64 produces:
        ldr     q1, [sp, 16]
        ins     v1.s[0], v0.s[1]
        mov     v0.16b, v1.16b


Which means for aarch64, this was changed in GCC 10 and fixed fully for GCC=
 11
(by r11-2192-gc9c87e6f9c795b aka PR 93720 which was my patch in fact).

For x86_64, the trunk produces:

        movaps  (%rsp), %xmm1
        addq    $24, %rsp
        shufps  $85, %xmm1, %xmm0
        shufps  $232, %xmm1, %xmm0

While for GCC 12 produces:

        movaps  (%rsp), %xmm1
        addq    $24, %rsp
        shufps  $85, %xmm0, %xmm0
        movaps  %xmm1, %xmm2
        shufps  $85, %xmm1, %xmm2
        movaps  %xmm2, %xmm3
        movaps  %xmm1, %xmm2
        unpckhps        %xmm1, %xmm2
        unpcklps        %xmm3, %xmm0
        shufps  $255, %xmm1, %xmm1
        unpcklps        %xmm1, %xmm2
        movlhps %xmm2, %xmm0

This was changed with r13-2843-g3db8e9c2422d92 (aka PR 53346).

For powerpc64le, it looks ok for GCC 11:
        addis 9,2,.LC0@toc@ha
        addi 1,1,48
        addi 9,9,.LC0@toc@l
        li 0,-16
        lvx 0,0,9
        vperm 2,31,2,0

Both the x86_64 and the PowerPC PERM implementation could be improved to
support the inseration like the aarch64 backend does too.=