From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 4F438385802A; Mon, 15 Mar 2021 09:52:09 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4F438385802A
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/90579] [8/9/10/11 Regression] Huge store
 forward stall due to vectorizer, missed CSE
Date: Mon, 15 Mar 2021 09:52:08 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 9.1.1
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 8.5
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-90579-4-Rv3LM1gLNe@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-90579-4@http.gcc.gnu.org/bugzilla/>
References: <bug-90579-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Mar 2021 09:52:09 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D90579
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #9)
> So we now have a "real" FRE after the vectorizer but we fail to CSE
>=20
>   MEM <vector(4) double> [(double *)&r] =3D vect__3.20_74;
> ...
>   MEM <vector(2) double> [(double *)&r + 32B] =3D vect__62.26_88;
> ...
>   vect__5.7_34 =3D MEM <vector(4) double> [(double *)&r + 16B];
>=20
> mine for GCC 11 to look at.  The code to CSE that load for _74 and _88
> is going to be a bit awkward though but it will nicely combine with the
> following stmts
>=20
>   vect__5.8_35 =3D VEC_PERM_EXPR <vect__5.7_34, vect__5.7_34, { 3, 2, 1, =
0 }>;
>   stmp_t_12.9_36 =3D BIT_FIELD_REF <vect__5.8_35, 64, 0>;
>   stmp_t_12.9_37 =3D stmp_t_12.9_36 + 0.0;
>   stmp_t_12.9_38 =3D BIT_FIELD_REF <vect__5.8_35, 64, 64>;
>   stmp_t_12.9_39 =3D stmp_t_12.9_37 + stmp_t_12.9_38;
>   stmp_t_12.9_40 =3D BIT_FIELD_REF <vect__5.8_35, 64, 128>;
>   stmp_t_12.9_41 =3D stmp_t_12.9_39 + stmp_t_12.9_40;
>   stmp_t_12.9_42 =3D BIT_FIELD_REF <vect__5.8_35, 64, 192>;
>   t_12 =3D stmp_t_12.9_41 + stmp_t_12.9_42;
>=20
> and hopefully elide 'r' completely.

So the difficult thing is that we need to compose the upper v2df half of
vect__3.20_74 and the v2df vect__62.26_88.  Assembly for that would be sth
like

        vextractf128    $0x1, %ymm0, %xmm0
        vinsertf128     $0x1, %xmm1, %ymm0, %ymm0

and on GIMPLE

    tem_42 =3D BIT_FIELD_REF <vect__3.20_74, 128, 128>;
    vect__5.7_34 =3D { tem_42, vect__62.26_88 };

that's two stmts which at the moment VN simplification insertion doesn't
support.  It would be "nicer" to enhance for example VEC_PERM to allow

    vect__5.7_34 =3D VEC_PERM <vect__3.20_74, vect__62.26_88, { 2, 3, 4, 5 =
}>

"implicitely" extending _88 to v4df (aka a paradoxical v4df subreg of
the v2df SSE reg).  It would turn VEC_PERM into a concat + select operation
with not requiring the intermediate to have vector mode (in this case
it would have v6df without introducing subregs, a mode not possible).
On RTL unfortunately (vec_select:V4DF (vec_concat (reg:V4DF ..) (reg:V2DF .=
.))
..) is not possible because of that restriction.  OTOH RTL lacks that
concat-and-select operation, allowing the cited form and vec_merge to be
"merged" (vec_merge doesn't require such intermediate mode either).

I'll see how difficult it is to teach VN multi-stmt insertions.=