From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 14BE93857C4E; Sun, 11 Oct 2020 16:05:45 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 14BE93857C4E From: "amonakov at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics Date: Sun, 11 Oct 2020 16:05:44 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: amonakov at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Oct 2020 16:05:45 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97366 Alexander Monakov changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov --- Intrinsics being type-agnostic cause vector subregs to appear before regist= er allocation: the pseudo coming from the load has mode V2DI, the shift needs = to be done in mode V4SI, the bitwise-or and the store are done in mode V2DI ag= ain. Subreg in the bitwise-or appears to be handled inefficiently. Didn't dig de= eper as to what happens during allocation. FWIW, using generic vectors allows to avoid introducing such mismatches, and indeed the variant coded with generic vectors does not have extra loads. For your original code you'll have to convert between generic vectors and __m12= 8i to use the shuffle intrinsic. The last paragraphs in "Vector Extensions" chapter [1] suggest using a union for that purpose in C; in C++ reinterpret= ing via union is formally UB, so another approach could be used (probably simply converting via assignment). [1] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html typedef uint32_t u32v4 __attribute__((vector_size(16))); void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict input) { u32v4 *vin =3D (u32v4 *)input; u32v4 *vout =3D (u32v4 *)out; for (unsigned i=3D0 ; i<1024; i+=3D16) { u32v4 in =3D *vin++; *vout++ =3D in | (in >> 4); } } Above code on Compiler Explorer: https://godbolt.org/z/MKPvxb=