From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id A87CB384D1B9; Wed, 7 Sep 2022 08:18:58 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A87CB384D1B9 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1662538738; bh=LwPhGUuX5jkoe12FphqIOpUs2i1iJldWG79F8WpIghA=; h=From:To:Subject:Date:In-Reply-To:References:From; b=uC54Js4m7WpoG3mVNQt7jokrVUdP2A6ry0G2GEapB8JAWR1hFeIcsznulxPdUl9iA z9S0gvfxLxToIqTelnLWMjgShhiAQUAryCtGGzHoUywBSGW8MYwDqVpET2BmFCYCVs EKGM8uCafHIbswuK05ZTobkHxPlyaIyETBiDkwkM= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/104151] [10/11/12/13 Regression] x86: excessive code generated for 128-bit byteswap Date: Wed, 07 Sep 2022 08:18:55 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.3 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104151 --- Comment #16 from Richard Biener --- (In reply to Barnab=C3=A1s P=C5=91cze from comment #15) > Sorry, I haven't found a better issue. But I think the example below > exhibits the same or a very similar issue. >=20 > I would expect the following code >=20 > void f(unsigned char *p, std::uint32_t x, std::uint32_t y) > { > p[0] =3D x >> 24; > p[1] =3D x >> 16; > p[2] =3D x >> 8; > p[3] =3D x >> 0; >=20 > p[4] =3D y >> 24; > p[5] =3D y >> 16; > p[6] =3D y >> 8; > p[7] =3D y >> 0; > } >=20 > to be compiled to something along the lines of >=20 > f(unsigned char*, unsigned int, unsigned int): > bswap esi > bswap edx > mov DWORD PTR [rdi], esi > mov DWORD PTR [rdi+4], edx > ret >=20 > however, I get scores of bitwise operations instead if `-fno-tree-vectori= ze` > is not specified. >=20 > https://gcc.godbolt.org/z/z51K6qorv Yes, here we vectorize the store: [local count: 1073741824]: _1 =3D x_15(D) >> 24; _2 =3D (unsigned char) _1; _3 =3D x_15(D) >> 16; _4 =3D (unsigned char) _3; _5 =3D x_15(D) >> 8; _6 =3D (unsigned char) _5; _7 =3D (unsigned char) x_15(D); _8 =3D y_22(D) >> 24; _9 =3D (unsigned char) _8; _10 =3D y_22(D) >> 16; _11 =3D (unsigned char) _10; _12 =3D y_22(D) >> 8; _13 =3D (unsigned char) _12; _14 =3D (unsigned char) y_22(D); _35 =3D {_2, _4, _6, _7, _9, _11, _13, _14}; vectp.4_36 =3D p_17(D); MEM [(unsigned char *)vectp.4_36] =3D _35; but without vectorizing the store merging pass (which comes after vectorization) is able to detect two SImode bswaps. Basically we fail to consider "generic" vectorization as option here and generic vectorization fails to consider using bswap for permutes of "existing vectors". Likewise we fail to consider _1, _3, etc. as element accesses of the existing "vectors" x and y. That would work iff the shift + truncates were canonicalized as BIT_FIELD_REF, but it's certainly possible to work with the existing IL here. Note this issue is probably better tracked in a separate bugreport.=