From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 81139 invoked by alias); 4 May 2017 10:34:23 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 81123 invoked by uid 89); 4 May 2017 10:34:22 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-26.7 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_LOTSOFHASH,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=scores X-HELO: foss.arm.com Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 04 May 2017 10:34:21 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1A512344; Thu, 4 May 2017 03:34:22 -0700 (PDT) Received: from e105689-lin.cambridge.arm.com (e105689-lin.cambridge.arm.com [10.2.207.32]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2DB513F4FF; Thu, 4 May 2017 03:34:21 -0700 (PDT) Subject: Re: [PATCH][AArch64] Model Cortex-A53 load forwarding To: Wilco Dijkstra , GCC Patches References: Cc: nd , James Greenhalgh From: "Richard Earnshaw (lists)" Message-ID: <43c07c0f-11ed-3001-370d-fb5884e3207d@arm.com> Date: Thu, 04 May 2017 10:40:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-SW-Source: 2017-05/txt/msg00283.txt.bz2 On 05/04/17 13:29, Wilco Dijkstra wrote: > Code scheduling for Cortex-A53 isn't as good as it could be. It turns out > code runs faster overall if we place loads and stores with a dependency > closer together. To achieve this effect, this patch adds a bypass between > cortex_a53_load1 and cortex_a53_load*/cortex_a53_store* if the result of an > earlier load is used in an address calculation. This significantly improved > benchmark scores in a proprietary benchmark suite. > > Passes AArch64 bootstrap and regress. OK for stage 1? > What about an ARM bootstrap? OK if that also passes. R. > ChangeLog: > 2017-04-05 Wilco Dijkstra > > * config/arm/aarch-common.c (arm_early_load_addr_dep_ptr): > New function. > (arm_early_store_addr_dep_ptr): Likewise. > * config/arm/aarch-common-protos.h > (arm_early_load_addr_dep_ptr): Add prototype. > (arm_early_store_addr_dep_ptr): Likewise. > * config/arm/cortex-a53.md: Add new bypasses. > --- > > diff --git a/gcc/config/arm/aarch-common-protos.h b/gcc/config/arm/aarch-common-protos.h > index 8e9fb7a895b0a4aaf1585eb3368443899b061c9b..5298172e6b6930a110388a40a7533ff208a87095 100644 > --- a/gcc/config/arm/aarch-common-protos.h > +++ b/gcc/config/arm/aarch-common-protos.h > @@ -30,7 +30,9 @@ extern bool aarch_rev16_p (rtx); > extern bool aarch_rev16_shleft_mask_imm_p (rtx, machine_mode); > extern bool aarch_rev16_shright_mask_imm_p (rtx, machine_mode); > extern int arm_early_load_addr_dep (rtx, rtx); > +extern int arm_early_load_addr_dep_ptr (rtx, rtx); > extern int arm_early_store_addr_dep (rtx, rtx); > +extern int arm_early_store_addr_dep_ptr (rtx, rtx); > extern int arm_mac_accumulator_is_mul_result (rtx, rtx); > extern int arm_mac_accumulator_is_result (rtx, rtx); > extern int arm_no_early_alu_shift_dep (rtx, rtx); > diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c > index dd37be0291a633f606d95ec8acacc598435828b3..74b80b272550028919c4274387944867ffed43d1 100644 > --- a/gcc/config/arm/aarch-common.c > +++ b/gcc/config/arm/aarch-common.c > @@ -241,6 +241,24 @@ arm_early_load_addr_dep (rtx producer, rtx consumer) > return reg_overlap_mentioned_p (value, addr); > } > > +/* Return nonzero if the CONSUMER instruction (a load) does need > + a Pmode PRODUCER's value to calculate the address. */ > + > +int > +arm_early_load_addr_dep_ptr (rtx producer, rtx consumer) > +{ > + rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false); > + rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false); > + > + if (!value || !addr || !MEM_P (SET_SRC (value))) > + return 0; > + > + value = SET_DEST (value); > + addr = SET_SRC (addr); > + > + return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr); > +} > + > /* Return nonzero if the CONSUMER instruction (an ALU op) does not > have an early register shift value or amount dependency on the > result of PRODUCER. */ > @@ -336,6 +354,24 @@ arm_early_store_addr_dep (rtx producer, rtx consumer) > return !arm_no_early_store_addr_dep (producer, consumer); > } > > +/* Return nonzero if the CONSUMER instruction (a store) does need > + a Pmode PRODUCER's value to calculate the address. */ > + > +int > +arm_early_store_addr_dep_ptr (rtx producer, rtx consumer) > +{ > + rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false); > + rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false); > + > + if (!value || !addr || !MEM_P (SET_SRC (value))) > + return 0; > + > + value = SET_DEST (value); > + addr = SET_DEST (addr); > + > + return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr); > +} > + > /* Return non-zero iff the consumer (a multiply-accumulate or a > multiple-subtract instruction) has an accumulator dependency on the > result of the producer and no other dependency on that result. It > diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md > index b367ad403a4a641da34521c17669027b87092737..f8225f33c7a06485147b30fe2633309ac252d0c7 100644 > --- a/gcc/config/arm/cortex-a53.md > +++ b/gcc/config/arm/cortex-a53.md > @@ -246,6 +246,16 @@ > "cortex_a53_store*" > "arm_no_early_store_addr_dep") > > +;; Model a bypass for load to load/store address. > + > +(define_bypass 3 "cortex_a53_load1" > + "cortex_a53_load*" > + "arm_early_load_addr_dep_ptr") > + > +(define_bypass 3 "cortex_a53_load1" > + "cortex_a53_store*" > + "arm_early_store_addr_dep_ptr") > + > ;; Model a GP->FP register move as similar to stores. > > (define_bypass 0 "cortex_a53_alu*,cortex_a53_shift*" >