From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id A87CB384D1B9; Wed,  7 Sep 2022 08:18:58 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A87CB384D1B9
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1662538738;
	bh=LwPhGUuX5jkoe12FphqIOpUs2i1iJldWG79F8WpIghA=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=uC54Js4m7WpoG3mVNQt7jokrVUdP2A6ry0G2GEapB8JAWR1hFeIcsznulxPdUl9iA
	 z9S0gvfxLxToIqTelnLWMjgShhiAQUAryCtGGzHoUywBSGW8MYwDqVpET2BmFCYCVs
	 EKGM8uCafHIbswuK05ZTobkHxPlyaIyETBiDkwkM=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/104151] [10/11/12/13 Regression] x86: excessive code
 generated for 128-bit byteswap
Date: Wed, 07 Sep 2022 08:18:55 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.3
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-104151-4-UH0dufAiKG@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-104151-4@http.gcc.gnu.org/bugzilla/>
References: <bug-104151-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104151
--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Barnab=C3=A1s P=C5=91cze from comment #15)
> Sorry, I haven't found a better issue. But I think the example below
> exhibits the same or a very similar issue.
>=20
> I would expect the following code
>=20
> void f(unsigned char *p, std::uint32_t x, std::uint32_t y)
> {
>     p[0] =3D x >> 24;
>     p[1] =3D x >> 16;
>     p[2] =3D x >>  8;
>     p[3] =3D x >>  0;
>=20
>     p[4] =3D y >> 24;
>     p[5] =3D y >> 16;
>     p[6] =3D y >>  8;
>     p[7] =3D y >>  0;
> }
>=20
> to be compiled to something along the lines of
>=20
> f(unsigned char*, unsigned int, unsigned int):
>         bswap   esi
>         bswap   edx
>         mov     DWORD PTR [rdi], esi
>         mov     DWORD PTR [rdi+4], edx
>         ret
>=20
> however, I get scores of bitwise operations instead if `-fno-tree-vectori=
ze`
> is not specified.
>=20
> https://gcc.godbolt.org/z/z51K6qorv

Yes, here we vectorize the store:

  <bb 2> [local count: 1073741824]:
  _1 =3D x_15(D) >> 24;
  _2 =3D (unsigned char) _1;
  _3 =3D x_15(D) >> 16;
  _4 =3D (unsigned char) _3;
  _5 =3D x_15(D) >> 8;
  _6 =3D (unsigned char) _5;
  _7 =3D (unsigned char) x_15(D);
  _8 =3D y_22(D) >> 24;
  _9 =3D (unsigned char) _8;
  _10 =3D y_22(D) >> 16;
  _11 =3D (unsigned char) _10;
  _12 =3D y_22(D) >> 8;
  _13 =3D (unsigned char) _12;
  _14 =3D (unsigned char) y_22(D);
  _35 =3D {_2, _4, _6, _7, _9, _11, _13, _14};
  vectp.4_36 =3D p_17(D);
  MEM <vector(8) unsigned char> [(unsigned char *)vectp.4_36] =3D _35;

but without vectorizing the store merging pass (which comes after
vectorization) is able to detect two SImode bswaps.

Basically we fail to consider "generic" vectorization as option here
and generic vectorization fails to consider using bswap for permutes
of "existing vectors".  Likewise we fail to consider _1, _3, etc.
as element accesses of the existing "vectors" x and y.  That would
work iff the shift + truncates were canonicalized as BIT_FIELD_REF,
but it's certainly possible to work with the existing IL here.

Note this issue is probably better tracked in a separate bugreport.=