From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 55E1B385DC06; Wed, 27 Mar 2024 10:37:00 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 55E1B385DC06
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1711535820;
	bh=0zjkoNSXN8IvJkoJ6cmNPD5sm3b3zS3CsDc3R5Y3hWQ=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=MeoLJfpMlgZVSr12VHvbHjPrB1xnPX9A6FzUNpcfE9JMxGBEByqi/XLdxD2TkbOhK
	 P8fGe14BFr1XusNyaIEzLTwNZAP7fUISKHMysfPH1dvegnjFPsLa2lCUaZhHmbRzia
	 NZ1TsKgS8I5OlpDUcIoA4bOuqajb7yjLsegDCLJs=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/114057] [14 Regression] 435.gromacs fails
 verification with -Ofast -march={znver2,znver4} and PGO after
 r14-7272-g57f611604e8bab
Date: Wed, 27 Mar 2024 10:36:58 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: wrong-code
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-114057-4-yyPcrVbKM7@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-114057-4@http.gcc.gnu.org/bugzilla/>
References: <bug-114057-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114057
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
OK, so I think the change is that we get to "correctly" notice

-vec.h:380:9: note: node (external) 0x6a2e9d8 (max_nunits=3D2, refcnt=3D1)
vector(2) float
-vec.h:380:9: note:     stmt 0 _164 =3D MEM[(const real *)_27 + 8B];
-vec.h:380:9: note:     stmt 1 _158 =3D MEM[(const real *)_27];
+vec.h:380:9: note: node (external) 0x5a823a8 (max_nunits=3D2, refcnt=3D1)
vector(2) float
+vec.h:380:9: note:     [l] stmt 0 _164 =3D MEM[(const real *)_27 + 8B];
+vec.h:380:9: note:     [l] stmt 1 _158 =3D MEM[(const real *)_27];

for the loads we do not handle because of gaps and promoted external.  That
leads to extra costs.

But also

+vec.h:380:9: note: node 0x5a81770 (max_nunits=3D2, refcnt=3D2) vector(2) f=
loat
 vec.h:380:9: note: op template: x_160 =3D _158 - _159;
 vec.h:380:9: note:     stmt 0 x_160 =3D _158 - _159;
-vec.h:380:9: note:     [l] stmt 1 y_163 =3D _161 - _162;
+vec.h:380:9: note:     stmt 1 y_163 =3D _161 - _162;

so y_163 isn't considered live for some reason.  We find

_123 =3D _117 * y_163;

is vectorized as part of a reduction.  On the costing side we then see

-_161 - _162 1 times scalar_stmt costs 12 in body
-MEM[(const real *)_27 + 4B] 1 times scalar_load costs 12 in body
-MEM[(const real *)_24 + 4B] 1 times scalar_load costs 12 in body

which is the live (and dependent) stmts no longer costed on the scalar
side but also

+MEM[(const real *)_27 + 8B] 1 times vec_to_scalar costs 4 in epilogue
+MEM[(const real *)_24 + 8B] 1 times vec_to_scalar costs 4 in epilogue

costed in the vector epilog.  This is because we're conservative as we
don't really know whether we'll be able to code-generate the live
operation.  The costing side here is also not in sync as can be seen
from the _161 - _162 op removed.

I should also note that the setting of PURE_SLP is done a bit too early,
before we analyze operations and eventually throw away instances or
prune it by promoting ops external.

For reductions we also falsely claim all root stmts are vectorized - we
do have remain ops.  Fixing this restores the LIVE on them and in some
way restores vectorization.

I'm going to test this as fix for now.=