From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 24850 invoked by alias); 29 Aug 2019 17:36:13 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 24841 invoked by uid 89); 29 Aug 2019 17:36:13 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-4.5 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.1 spammy=H*r:MSK, alexander, peculiarity, H*F:D*ru X-HELO: smtp.ispras.ru Received: from bran.ispras.ru (HELO smtp.ispras.ru) (83.149.199.196) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 29 Aug 2019 17:36:11 +0000 Received: from [10.10.3.121] (monopod.intra.ispras.ru [10.10.3.121]) by smtp.ispras.ru (Postfix) with ESMTP id BE7A6203A5; Thu, 29 Aug 2019 20:36:08 +0300 (MSK) Date: Thu, 29 Aug 2019 18:18:00 -0000 From: Alexander Monakov To: Maxim Kuvyrkov cc: Richard Guenther , gcc-patches@gcc.gnu.org, Wilco Dijkstra Subject: Re: [PR91598] Improve autoprefetcher heuristic in haifa-sched.c In-Reply-To: Message-ID: References: <09F25146-8361-4FB0-AE6B-E13BF8CF332F@gmail.com> User-Agent: Alpine 2.20.13 (LNX 116 2015-12-14) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-SW-Source: 2019-08/txt/msg02011.txt.bz2 On Thu, 29 Aug 2019, Maxim Kuvyrkov wrote: > >> r1 = [rb + 0] > >> > >> r2 = [rb + 8] > >> > >> r3 = [rb + 16] > >> > >> > >> which, apparently, cortex-a53 autoprefetcher doesn't recognize. This > >> schedule happens because r2= load gets lower priority than the > >> "irrelevant" due to the above patch. > >> > >> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled > >> means that true dependencies of all similar base+offset loads are > >> resolved. Therefore, for autoprefetcher-friendly schedule we should > >> prioritize memory reads before "irrelevant" instructions. > > > > But isn't there also max number of load issues in a fetch window to consider? > > So interleaving arithmetic with loads might be profitable. > > It appears that cores with autoprefetcher hardware prefer loads and stores > bundled together, not interspersed with other instructions to occupy the rest > of CPU units. Let me point out that the motivating example has a bigger effect in play: (1) r1 = [rb + 0] (2) (3) r2 = [rb + 8] (4) (5) r3 = [rb + 16] (6) here Cortex-A53, being an in-order core, cannot issue the load at (3) until after the load at (1) has completed, because the use at (2) depends on it. The good schedule allows the three loads to issue in a pipelined fashion. So essentially the main issue is not a hardware peculiarity, but rather the bad schedule being totally wrong (it could only make sense if loads had 1-cycle latency, which they do not). I think this highlights how implementing this autoprefetch heuristic via the dfa_lookahead_guard interface looks questionable in the first place, but the patch itself makes sense to me. Alexander