* [PATCH][AArch64] Model Cortex-A53 load forwarding
@ 2017-04-05 12:29 Wilco Dijkstra
2017-04-20 15:59 ` Wilco Dijkstra
2017-05-04 10:40 ` Richard Earnshaw (lists)
0 siblings, 2 replies; 3+ messages in thread
From: Wilco Dijkstra @ 2017-04-05 12:29 UTC (permalink / raw)
To: GCC Patches; +Cc: nd, James Greenhalgh
Code scheduling for Cortex-A53 isn't as good as it could be. It turns out
code runs faster overall if we place loads and stores with a dependency
closer together. To achieve this effect, this patch adds a bypass between
cortex_a53_load1 and cortex_a53_load*/cortex_a53_store* if the result of an
earlier load is used in an address calculation. This significantly improved
benchmark scores in a proprietary benchmark suite.
Passes AArch64 bootstrap and regress. OK for stage 1?
ChangeLog:
2017-04-05 Wilco Dijkstra <wdijkstr@arm.com>
* config/arm/aarch-common.c (arm_early_load_addr_dep_ptr):
New function.
(arm_early_store_addr_dep_ptr): Likewise.
* config/arm/aarch-common-protos.h
(arm_early_load_addr_dep_ptr): Add prototype.
(arm_early_store_addr_dep_ptr): Likewise.
* config/arm/cortex-a53.md: Add new bypasses.
---
diff --git a/gcc/config/arm/aarch-common-protos.h b/gcc/config/arm/aarch-common-protos.h
index 8e9fb7a895b0a4aaf1585eb3368443899b061c9b..5298172e6b6930a110388a40a7533ff208a87095 100644
--- a/gcc/config/arm/aarch-common-protos.h
+++ b/gcc/config/arm/aarch-common-protos.h
@@ -30,7 +30,9 @@ extern bool aarch_rev16_p (rtx);
extern bool aarch_rev16_shleft_mask_imm_p (rtx, machine_mode);
extern bool aarch_rev16_shright_mask_imm_p (rtx, machine_mode);
extern int arm_early_load_addr_dep (rtx, rtx);
+extern int arm_early_load_addr_dep_ptr (rtx, rtx);
extern int arm_early_store_addr_dep (rtx, rtx);
+extern int arm_early_store_addr_dep_ptr (rtx, rtx);
extern int arm_mac_accumulator_is_mul_result (rtx, rtx);
extern int arm_mac_accumulator_is_result (rtx, rtx);
extern int arm_no_early_alu_shift_dep (rtx, rtx);
diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c
index dd37be0291a633f606d95ec8acacc598435828b3..74b80b272550028919c4274387944867ffed43d1 100644
--- a/gcc/config/arm/aarch-common.c
+++ b/gcc/config/arm/aarch-common.c
@@ -241,6 +241,24 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
return reg_overlap_mentioned_p (value, addr);
}
+/* Return nonzero if the CONSUMER instruction (a load) does need
+ a Pmode PRODUCER's value to calculate the address. */
+
+int
+arm_early_load_addr_dep_ptr (rtx producer, rtx consumer)
+{
+ rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false);
+ rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false);
+
+ if (!value || !addr || !MEM_P (SET_SRC (value)))
+ return 0;
+
+ value = SET_DEST (value);
+ addr = SET_SRC (addr);
+
+ return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr);
+}
+
/* Return nonzero if the CONSUMER instruction (an ALU op) does not
have an early register shift value or amount dependency on the
result of PRODUCER. */
@@ -336,6 +354,24 @@ arm_early_store_addr_dep (rtx producer, rtx consumer)
return !arm_no_early_store_addr_dep (producer, consumer);
}
+/* Return nonzero if the CONSUMER instruction (a store) does need
+ a Pmode PRODUCER's value to calculate the address. */
+
+int
+arm_early_store_addr_dep_ptr (rtx producer, rtx consumer)
+{
+ rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false);
+ rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false);
+
+ if (!value || !addr || !MEM_P (SET_SRC (value)))
+ return 0;
+
+ value = SET_DEST (value);
+ addr = SET_DEST (addr);
+
+ return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr);
+}
+
/* Return non-zero iff the consumer (a multiply-accumulate or a
multiple-subtract instruction) has an accumulator dependency on the
result of the producer and no other dependency on that result. It
diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md
index b367ad403a4a641da34521c17669027b87092737..f8225f33c7a06485147b30fe2633309ac252d0c7 100644
--- a/gcc/config/arm/cortex-a53.md
+++ b/gcc/config/arm/cortex-a53.md
@@ -246,6 +246,16 @@
"cortex_a53_store*"
"arm_no_early_store_addr_dep")
+;; Model a bypass for load to load/store address.
+
+(define_bypass 3 "cortex_a53_load1"
+ "cortex_a53_load*"
+ "arm_early_load_addr_dep_ptr")
+
+(define_bypass 3 "cortex_a53_load1"
+ "cortex_a53_store*"
+ "arm_early_store_addr_dep_ptr")
+
;; Model a GP->FP register move as similar to stores.
(define_bypass 0 "cortex_a53_alu*,cortex_a53_shift*"
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH][AArch64] Model Cortex-A53 load forwarding
2017-04-05 12:29 [PATCH][AArch64] Model Cortex-A53 load forwarding Wilco Dijkstra
@ 2017-04-20 15:59 ` Wilco Dijkstra
2017-05-04 10:40 ` Richard Earnshaw (lists)
1 sibling, 0 replies; 3+ messages in thread
From: Wilco Dijkstra @ 2017-04-20 15:59 UTC (permalink / raw)
To: GCC Patches, James Greenhalgh; +Cc: nd
ping
From: Wilco Dijkstra
Sent: 05 April 2017 13:29
To: GCC Patches
Cc: nd; James Greenhalgh
Subject: [PATCH][AArch64] Model Cortex-A53 load forwarding
Code scheduling for Cortex-A53 isn't as good as it could be. It turns out
code runs faster overall if we place loads and stores with a dependency
closer together. To achieve this effect, this patch adds a bypass between
cortex_a53_load1 and cortex_a53_load*/cortex_a53_store* if the result of an
earlier load is used in an address calculation. This significantly improved
benchmark scores in a proprietary benchmark suite.
Passes AArch64 bootstrap and regress. OK for stage 1?
ChangeLog:
2017-04-05 Wilco Dijkstra <wdijkstr@arm.com>
* config/arm/aarch-common.c (arm_early_load_addr_dep_ptr):
New function.
(arm_early_store_addr_dep_ptr): Likewise.
* config/arm/aarch-common-protos.h
(arm_early_load_addr_dep_ptr): Add prototype.
(arm_early_store_addr_dep_ptr): Likewise.
* config/arm/cortex-a53.md: Add new bypasses.
---
diff --git a/gcc/config/arm/aarch-common-protos.h b/gcc/config/arm/aarch-common-protos.h
index 8e9fb7a895b0a4aaf1585eb3368443899b061c9b..5298172e6b6930a110388a40a7533ff208a87095 100644
--- a/gcc/config/arm/aarch-common-protos.h
+++ b/gcc/config/arm/aarch-common-protos.h
@@ -30,7 +30,9 @@ extern bool aarch_rev16_p (rtx);
extern bool aarch_rev16_shleft_mask_imm_p (rtx, machine_mode);
extern bool aarch_rev16_shright_mask_imm_p (rtx, machine_mode);
extern int arm_early_load_addr_dep (rtx, rtx);
+extern int arm_early_load_addr_dep_ptr (rtx, rtx);
extern int arm_early_store_addr_dep (rtx, rtx);
+extern int arm_early_store_addr_dep_ptr (rtx, rtx);
extern int arm_mac_accumulator_is_mul_result (rtx, rtx);
extern int arm_mac_accumulator_is_result (rtx, rtx);
extern int arm_no_early_alu_shift_dep (rtx, rtx);
diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c
index dd37be0291a633f606d95ec8acacc598435828b3..74b80b272550028919c4274387944867ffed43d1 100644
--- a/gcc/config/arm/aarch-common.c
+++ b/gcc/config/arm/aarch-common.c
@@ -241,6 +241,24 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
return reg_overlap_mentioned_p (value, addr);
}
+/* Return nonzero if the CONSUMER instruction (a load) does need
+ a Pmode PRODUCER's value to calculate the address. */
+
+int
+arm_early_load_addr_dep_ptr (rtx producer, rtx consumer)
+{
+ rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false);
+ rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false);
+
+ if (!value || !addr || !MEM_P (SET_SRC (value)))
+ return 0;
+
+ value = SET_DEST (value);
+ addr = SET_SRC (addr);
+
+ return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr);
+}
+
/* Return nonzero if the CONSUMER instruction (an ALU op) does not
have an early register shift value or amount dependency on the
result of PRODUCER. */
@@ -336,6 +354,24 @@ arm_early_store_addr_dep (rtx producer, rtx consumer)
return !arm_no_early_store_addr_dep (producer, consumer);
}
+/* Return nonzero if the CONSUMER instruction (a store) does need
+ a Pmode PRODUCER's value to calculate the address. */
+
+int
+arm_early_store_addr_dep_ptr (rtx producer, rtx consumer)
+{
+ rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false);
+ rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false);
+
+ if (!value || !addr || !MEM_P (SET_SRC (value)))
+ return 0;
+
+ value = SET_DEST (value);
+ addr = SET_DEST (addr);
+
+ return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr);
+}
+
/* Return non-zero iff the consumer (a multiply-accumulate or a
multiple-subtract instruction) has an accumulator dependency on the
result of the producer and no other dependency on that result. It
diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md
index b367ad403a4a641da34521c17669027b87092737..f8225f33c7a06485147b30fe2633309ac252d0c7 100644
--- a/gcc/config/arm/cortex-a53.md
+++ b/gcc/config/arm/cortex-a53.md
@@ -246,6 +246,16 @@
"cortex_a53_store*"
"arm_no_early_store_addr_dep")
+;; Model a bypass for load to load/store address.
+
+(define_bypass 3 "cortex_a53_load1"
+ "cortex_a53_load*"
+ "arm_early_load_addr_dep_ptr")
+
+(define_bypass 3 "cortex_a53_load1"
+ "cortex_a53_store*"
+ "arm_early_store_addr_dep_ptr")
+
;; Model a GP->FP register move as similar to stores.
(define_bypass 0 "cortex_a53_alu*,cortex_a53_shift*"
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH][AArch64] Model Cortex-A53 load forwarding
2017-04-05 12:29 [PATCH][AArch64] Model Cortex-A53 load forwarding Wilco Dijkstra
2017-04-20 15:59 ` Wilco Dijkstra
@ 2017-05-04 10:40 ` Richard Earnshaw (lists)
1 sibling, 0 replies; 3+ messages in thread
From: Richard Earnshaw (lists) @ 2017-05-04 10:40 UTC (permalink / raw)
To: Wilco Dijkstra, GCC Patches; +Cc: nd, James Greenhalgh
On 05/04/17 13:29, Wilco Dijkstra wrote:
> Code scheduling for Cortex-A53 isn't as good as it could be. It turns out
> code runs faster overall if we place loads and stores with a dependency
> closer together. To achieve this effect, this patch adds a bypass between
> cortex_a53_load1 and cortex_a53_load*/cortex_a53_store* if the result of an
> earlier load is used in an address calculation. This significantly improved
> benchmark scores in a proprietary benchmark suite.
>
> Passes AArch64 bootstrap and regress. OK for stage 1?
>
What about an ARM bootstrap? OK if that also passes.
R.
> ChangeLog:
> 2017-04-05 Wilco Dijkstra <wdijkstr@arm.com>
>
> * config/arm/aarch-common.c (arm_early_load_addr_dep_ptr):
> New function.
> (arm_early_store_addr_dep_ptr): Likewise.
> * config/arm/aarch-common-protos.h
> (arm_early_load_addr_dep_ptr): Add prototype.
> (arm_early_store_addr_dep_ptr): Likewise.
> * config/arm/cortex-a53.md: Add new bypasses.
> ---
>
> diff --git a/gcc/config/arm/aarch-common-protos.h b/gcc/config/arm/aarch-common-protos.h
> index 8e9fb7a895b0a4aaf1585eb3368443899b061c9b..5298172e6b6930a110388a40a7533ff208a87095 100644
> --- a/gcc/config/arm/aarch-common-protos.h
> +++ b/gcc/config/arm/aarch-common-protos.h
> @@ -30,7 +30,9 @@ extern bool aarch_rev16_p (rtx);
> extern bool aarch_rev16_shleft_mask_imm_p (rtx, machine_mode);
> extern bool aarch_rev16_shright_mask_imm_p (rtx, machine_mode);
> extern int arm_early_load_addr_dep (rtx, rtx);
> +extern int arm_early_load_addr_dep_ptr (rtx, rtx);
> extern int arm_early_store_addr_dep (rtx, rtx);
> +extern int arm_early_store_addr_dep_ptr (rtx, rtx);
> extern int arm_mac_accumulator_is_mul_result (rtx, rtx);
> extern int arm_mac_accumulator_is_result (rtx, rtx);
> extern int arm_no_early_alu_shift_dep (rtx, rtx);
> diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c
> index dd37be0291a633f606d95ec8acacc598435828b3..74b80b272550028919c4274387944867ffed43d1 100644
> --- a/gcc/config/arm/aarch-common.c
> +++ b/gcc/config/arm/aarch-common.c
> @@ -241,6 +241,24 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
> return reg_overlap_mentioned_p (value, addr);
> }
>
> +/* Return nonzero if the CONSUMER instruction (a load) does need
> + a Pmode PRODUCER's value to calculate the address. */
> +
> +int
> +arm_early_load_addr_dep_ptr (rtx producer, rtx consumer)
> +{
> + rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false);
> + rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false);
> +
> + if (!value || !addr || !MEM_P (SET_SRC (value)))
> + return 0;
> +
> + value = SET_DEST (value);
> + addr = SET_SRC (addr);
> +
> + return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr);
> +}
> +
> /* Return nonzero if the CONSUMER instruction (an ALU op) does not
> have an early register shift value or amount dependency on the
> result of PRODUCER. */
> @@ -336,6 +354,24 @@ arm_early_store_addr_dep (rtx producer, rtx consumer)
> return !arm_no_early_store_addr_dep (producer, consumer);
> }
>
> +/* Return nonzero if the CONSUMER instruction (a store) does need
> + a Pmode PRODUCER's value to calculate the address. */
> +
> +int
> +arm_early_store_addr_dep_ptr (rtx producer, rtx consumer)
> +{
> + rtx value = arm_find_sub_rtx_with_code (PATTERN (producer), SET, false);
> + rtx addr = arm_find_sub_rtx_with_code (PATTERN (consumer), SET, false);
> +
> + if (!value || !addr || !MEM_P (SET_SRC (value)))
> + return 0;
> +
> + value = SET_DEST (value);
> + addr = SET_DEST (addr);
> +
> + return GET_MODE (value) == Pmode && reg_overlap_mentioned_p (value, addr);
> +}
> +
> /* Return non-zero iff the consumer (a multiply-accumulate or a
> multiple-subtract instruction) has an accumulator dependency on the
> result of the producer and no other dependency on that result. It
> diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md
> index b367ad403a4a641da34521c17669027b87092737..f8225f33c7a06485147b30fe2633309ac252d0c7 100644
> --- a/gcc/config/arm/cortex-a53.md
> +++ b/gcc/config/arm/cortex-a53.md
> @@ -246,6 +246,16 @@
> "cortex_a53_store*"
> "arm_no_early_store_addr_dep")
>
> +;; Model a bypass for load to load/store address.
> +
> +(define_bypass 3 "cortex_a53_load1"
> + "cortex_a53_load*"
> + "arm_early_load_addr_dep_ptr")
> +
> +(define_bypass 3 "cortex_a53_load1"
> + "cortex_a53_store*"
> + "arm_early_store_addr_dep_ptr")
> +
> ;; Model a GP->FP register move as similar to stores.
>
> (define_bypass 0 "cortex_a53_alu*,cortex_a53_shift*"
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-05-04 10:34 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-05 12:29 [PATCH][AArch64] Model Cortex-A53 load forwarding Wilco Dijkstra
2017-04-20 15:59 ` Wilco Dijkstra
2017-05-04 10:40 ` Richard Earnshaw (lists)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).