From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id C77B93840C32; Tue, 29 Mar 2022 10:39:01 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C77B93840C32
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast
 -march=native regressed by 26% on Intel Cascade Lake server CPU
Date: Tue, 29 Mar 2022 10:39:01 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_status cc
Message-ID: <bug-104271-4-GvzEbuI03f@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-104271-4@http.gcc.gnu.org/bugzilla/>
References: <bug-104271-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Mar 2022 10:39:01 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104271

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |jamborm at gcc dot gnu.org
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to cuilili from comment #7)
> Created attachment 52706 [details]
> Add a heuristic for eliminate redundant load and store in inline pass.
>=20
> Hi Richard,
>=20
> Could you help take a look? This is my first time adding code in mid-end,
> hope you can give me some advice, thank you!
>=20
> I add a INLINE_HINT_eliminate_load_and_store hint in to inline pass. when
> callee's memory access is caller's local memory parameter and access size=
 is
> greater than the target threshold, we will enable the hint. with the hint,
> inlining_insns_auto will enlarge the bound. The target hook is only enabl=
ed
> for x86 now.
>=20
> With the patch applied
> Icelake server: 538.imagic_r get 15.18% improvement for multicopy and 40.=
78%
> improvement for single copy with no measurable changes for other benchmar=
ks.
>=20
> Casecadelake: 538.imagic_r get 12.4% improvement for multicopy with and c=
ode
> size increased by 0.4%. With no measurable changes for other benchmarks.
>=20
> Znver3 server: 538.imagic_r get 9.6% improvement for multicopy with and c=
ode
> size increased by 0.5%. With no measurable changes for other benchmarks.

It's an interesting idea, note Honza knows better about IPA modref and
inlining than me.  What I doubt is that you can directly use IPA modref
info to determine whether inlining will likely elide a store/load pair
since IIRC the modref info is for the whole function.

IPA SRA might perform the kind of analysis that is contained to the
call context and that might be available here already (and eventually
even IPA SRA considers passing the stored/loaded values by value?)

But yes, having a stream of up to N (independent?) stores before each call
plus a stream of up to M (independent hoistable to function start?) loads
at each function start would make such analysis possible.=