From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D49123858D32; Mon,  8 May 2023 13:58:36 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D49123858D32
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1683554316;
	bh=5O47t1Jn9OWu3sMyJhIqpiQR/eaEOqvyQFerw04mQ8w=;
	h=From:To:Subject:Date:From;
	b=CCksKJ9oQr7ey2Ww/I9oa1TurKxyLrVRi7aFQxowOK4SlB1ijYoWzvcZbaltXsKSd
	 tSQDCWFWEHA4CpJMgP9jYT6MNoSHriDJsiSnzVdMgfZmKxz8Aze3whpL9Wn0Y+FOfL
	 8QVmWgUixHlTY32kUrou6jwo4DgAqCzO/93GN2iM=
From: "chfast at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/109771] New: Unnecessary pblendw for
 vectorized or
Date: Mon, 08 May 2023 13:58:36 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 13.1.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: chfast at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-109771-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109771

            Bug ID: 109771
           Summary: Unnecessary pblendw for vectorized or
           Product: gcc
           Version: 13.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chfast at gmail dot com
  Target Milestone: ---

I have an example of vectorization of 4x64-bit struct (representation of
256-bit integer). The implementation just uses for loop of count 4.

This is vectorized in isolation however when combined with some non-trivial
control-flow and additional wrapping functions the final assembly contains
weird pblendw instructions.

pblendw xmm1, xmm3, 240          (GCC 13, x86-64-v2)
movlpd  xmm1, QWORD PTR [rdi+16] (GCC 13, x86-64-v1)
shufpd  xmm1, xmm3, 2            (GCC 12)

I believe this is some kind of regression in GCC 13 because I have a bigger
context where GCC 12 was optimizing it "correctly". However, I lost this
information during test reduction.

https://godbolt.org/z/jzK44h3js

cpp:

struct u256 {
    unsigned long w[4];
};

inline u256 or_(u256 x, u256 y) {
    u256 z;
    for (int i =3D 0; i < 4; ++i)=20
        z.w[i] =3D x.w[i] | y.w[i];
    return z;
}

inline void or_to(u256& z, u256 y) { z =3D or_(z, y); }

void op_or(u256* t) { or_to(t[1], t[0]); }

void test(u256* t) {
    void* tbl[]{&&CLOBBER, &&OR};
CLOBBER:
    goto * 0;
OR:
    op_or(t);
    goto * 0;
}


x86-64-v2 asm:

test(u256*):
        xorl    %eax, %eax
        jmp     *%rax
        movdqu  32(%rdi), %xmm3
        movdqu  (%rdi), %xmm1
        movdqu  16(%rdi), %xmm2
        movdqu  48(%rdi), %xmm0
        por     %xmm3, %xmm1
        movups  %xmm1, 32(%rdi)
        movdqa  %xmm2, %xmm1
        pblendw $240, %xmm0, %xmm1
        pblendw $240, %xmm2, %xmm0
        por     %xmm1, %xmm0
        movups  %xmm0, 48(%rdi)
        jmp     *%rax=