From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D49123858D32; Mon, 8 May 2023 13:58:36 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D49123858D32 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1683554316; bh=5O47t1Jn9OWu3sMyJhIqpiQR/eaEOqvyQFerw04mQ8w=; h=From:To:Subject:Date:From; b=CCksKJ9oQr7ey2Ww/I9oa1TurKxyLrVRi7aFQxowOK4SlB1ijYoWzvcZbaltXsKSd tSQDCWFWEHA4CpJMgP9jYT6MNoSHriDJsiSnzVdMgfZmKxz8Aze3whpL9Wn0Y+FOfL 8QVmWgUixHlTY32kUrou6jwo4DgAqCzO/93GN2iM= From: "chfast at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug rtl-optimization/109771] New: Unnecessary pblendw for vectorized or Date: Mon, 08 May 2023 13:58:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: rtl-optimization X-Bugzilla-Version: 13.1.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: chfast at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109771 Bug ID: 109771 Summary: Unnecessary pblendw for vectorized or Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: chfast at gmail dot com Target Milestone: --- I have an example of vectorization of 4x64-bit struct (representation of 256-bit integer). The implementation just uses for loop of count 4. This is vectorized in isolation however when combined with some non-trivial control-flow and additional wrapping functions the final assembly contains weird pblendw instructions. pblendw xmm1, xmm3, 240 (GCC 13, x86-64-v2) movlpd xmm1, QWORD PTR [rdi+16] (GCC 13, x86-64-v1) shufpd xmm1, xmm3, 2 (GCC 12) I believe this is some kind of regression in GCC 13 because I have a bigger context where GCC 12 was optimizing it "correctly". However, I lost this information during test reduction. https://godbolt.org/z/jzK44h3js cpp: struct u256 { unsigned long w[4]; }; inline u256 or_(u256 x, u256 y) { u256 z; for (int i =3D 0; i < 4; ++i)=20 z.w[i] =3D x.w[i] | y.w[i]; return z; } inline void or_to(u256& z, u256 y) { z =3D or_(z, y); } void op_or(u256* t) { or_to(t[1], t[0]); } void test(u256* t) { void* tbl[]{&&CLOBBER, &&OR}; CLOBBER: goto * 0; OR: op_or(t); goto * 0; } x86-64-v2 asm: test(u256*): xorl %eax, %eax jmp *%rax movdqu 32(%rdi), %xmm3 movdqu (%rdi), %xmm1 movdqu 16(%rdi), %xmm2 movdqu 48(%rdi), %xmm0 por %xmm3, %xmm1 movups %xmm1, 32(%rdi) movdqa %xmm2, %xmm1 pblendw $240, %xmm0, %xmm1 pblendw $240, %xmm2, %xmm0 por %xmm1, %xmm0 movups %xmm0, 48(%rdi) jmp *%rax=