From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 7AE44385BF93; Fri, 18 Feb 2022 11:31:49 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7AE44385BF93
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/104582] [11/12 Regression] Unoptimal code for
 __negdi2 (and others) from libgcc2 due to unwanted vectorization
Date: Fri, 18 Feb 2022 11:31:49 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 11.3
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-104582-4-BftwqPaH6c@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-104582-4@http.gcc.gnu.org/bugzilla/>
References: <bug-104582-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Feb 2022 11:31:49 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104582
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
The patch will cause

FAIL: gcc.target/i386/pr91446.c scan-assembler-times vmovdqa[^\\n\\r]*xmm[0=
-9]
2
FAIL: gcc.target/i386/pr92658-avx512bw-2.c scan-assembler-times pmovsxdq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxbq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxdq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxwq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxbq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxdq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxwq 2
XPASS: gcc.target/i386/pr99881.c scan-assembler-not xmm[0-9]

I have to look into some of them.  The pr92658 one seems to be cases like

void
bar_u32_u64 (v2di * dst, v4si src)
{
  unsigned long long tem[2];
  tem[0] =3D src[0];
  tem[1] =3D src[1];
  dst[0] =3D *(v2di *) tem;
}

where we fail to recognize the BIT_FIELD_REF as accessing a pre-existing
vector (we only support a subset of cases during SLP discovery):

  _1 =3D BIT_FIELD_REF <src_6(D), 32, 0>;
  _2 =3D (long long unsigned int) _1;
  tem[0] =3D _2;
  _3 =3D BIT_FIELD_REF <src_6(D), 32, 32>;
  _4 =3D (long long unsigned int) _3;
  tem[1] =3D _4;

but when vectorizing just store and the conversion as

  <bb 2> [local count: 1073741824]:
  _1 =3D BIT_FIELD_REF <src_6(D), 32, 0>;
  _3 =3D BIT_FIELD_REF <src_6(D), 32, 32>;
  _13 =3D {_1, _3};
  vect__2.110_14 =3D (vector(2) long long unsigned int) _13;
  MEM <vector(2) long long unsigned int> [(long long unsigned int *)&tem] =
=3D
vect__2.110_14;

we can recover things on the RTL side.

So we just realize that costing is a difficult thing.

Cost model analysis:
_2 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
(long long unsigned int) _1 1 times scalar_stmt costs 4 in body
(long long unsigned int) _3 1 times scalar_stmt costs 4 in body
(long long unsigned int) _1 1 times vector_stmt costs 4 in body
node 0x415e268 1 times vec_construct costs 20 in prologue
_2 1 times vector_store costs 16 in body
Cost model analysis for part in loop 0:
  Vector cost: 40
  Scalar cost: 32
not vectorized: vectorization is not profitable.

note this uses icelake-server costs which has an unusally high sse_to_integ=
er
cost.

The fix here would best be to recognize the BIT_FIELD_REF vector use of cou=
rse.=