public inbox for gcc-regression@sourceware.org
help / color / mirror / Atom feed
* Re: [TCWG CI] 453.povray slowed down by 3% after gcc: ira: Consider modelling caller-save allocations as loop spills
[not found] <1851691846.10927.1641946260901@jenkins.jenkins>
@ 2022-02-10 16:16 ` Maxim Kuvyrkov
0 siblings, 0 replies; only message in thread
From: Maxim Kuvyrkov @ 2022-02-10 16:16 UTC (permalink / raw)
To: Richard Sandiford; +Cc: gcc-regression
Hi Richard,
It seems your patch slows down 453.povray by 3%. It’s not a huge slow down, but take a look if you have a spare minute.
FWIW, it seems that some other patch has slowed down 453.povray by another 3%, but, unfortunately, our benchmarking CI was down for a week and we didn’t catch the culprit.
Thanks,
--
Maxim Kuvyrkov
https://www.linaro.org
> On 12 Jan 2022, at 03:11, ci_notify@linaro.org wrote:
>
> After gcc commit 01f3e6a40e7202310abbeb41c345d325bd69554f
> Author: Richard Sandiford <richard.sandiford@arm.com>
>
> ira: Consider modelling caller-save allocations as loop spills
>
> the following benchmarks slowed down by more than 2%:
> - 453.povray slowed down by 3% from 4710 to 4840 perf samples
>
> Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI.
>
> For your convenience, we have uploaded tarballs with pre-processed source and assembly files at:
> - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/build-01f3e6a40e7202310abbeb41c345d325bd69554f/save-temps/
> - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/build-8e7a23728f66d2da88b47e34224410457fdefbf5/save-temps/
> - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/build-baseline/save-temps/
>
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: GCC + Glibc + GNU Linker
> - Version: all components were built from their tip of trunk
> - Target: arm-linux-gnueabihf
> - Compiler flags: -O3 -flto -marm
> - Hardware: NVidia TK1 4x Cortex-A15
>
> This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain@lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports.
>
> THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
>
> This commit has regressed these CI configurations:
> - tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3_LTO
>
> First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/build-01f3e6a40e7202310abbeb41c345d325bd69554f/
> Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/build-8e7a23728f66d2da88b47e34224410457fdefbf5/
> Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/build-baseline/
> Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/
>
> Reproduce builds:
> <cut>
> mkdir investigate-gcc-01f3e6a40e7202310abbeb41c345d325bd69554f
> cd investigate-gcc-01f3e6a40e7202310abbeb41c345d325bd69554f
>
> # Fetch scripts
> git clone https://git.linaro.org/toolchain/jenkins-scripts
>
> # Fetch manifests and test.sh script
> mkdir -p artifacts/manifests
> curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/manifests/build-baseline.sh --fail
> curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/manifests/build-parameters.sh --fail
> curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-arm-spec2k6-O3_LTO/45/artifact/artifacts/test.sh --fail
> chmod +x artifacts/test.sh
>
> # Reproduce the baseline build (build all pre-requisites)
> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
>
> # Save baseline build state (which is then restored in artifacts/test.sh)
> mkdir -p ./bisect
> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
>
> cd gcc
>
> # Reproduce first_bad build
> git checkout --detach 01f3e6a40e7202310abbeb41c345d325bd69554f
> ../artifacts/test.sh
>
> # Reproduce last_good build
> git checkout --detach 8e7a23728f66d2da88b47e34224410457fdefbf5
> ../artifacts/test.sh
>
> cd ..
> </cut>
>
> Full commit (up to 1000 lines):
> <cut>
> commit 01f3e6a40e7202310abbeb41c345d325bd69554f
> Author: Richard Sandiford <richard.sandiford@arm.com>
> Date: Mon Jan 10 14:47:08 2022 +0000
>
> ira: Consider modelling caller-save allocations as loop spills
>
> If an allocno A in an inner loop L spans a call, a parent allocno AP
> can choose to handle a call-clobbered/caller-saved hard register R
> in one of two ways:
>
> (1) save R before each call in L and restore R after each call
> (2) spill R to memory throughout L
>
> (2) can be cheaper than (1) in some cases, particularly if L does
> not reference A.
>
> Before the patch we always did (1). The patch adds support for
> picking (2) instead, when it seems cheaper. It builds on the
> earlier support for not propagating conflicts to parent allocnos.
>
> gcc/
> PR rtl-optimization/98782
> * ira-int.h (ira_caller_save_cost): New function.
> (ira_caller_save_loop_spill_p): Likewise.
> * ira-build.c (ira_propagate_hard_reg_costs): Test whether it is
> cheaper to spill a call-clobbered register throughout a loop rather
> than spill it around each individual call. If so, treat all
> call-clobbered registers as conflicts and...
> (propagate_allocno_info): ...do not propagate call information
> from the child to the parent.
> * ira-color.c (move_spill_restore): Update accordingly.
> * ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost.
>
> gcc/testsuite/
> * gcc.target/aarch64/reg-alloc-3.c: New test.
> ---
> gcc/ira-build.c | 23 +++++----
> gcc/ira-color.c | 13 ++++--
> gcc/ira-costs.c | 7 +--
> gcc/ira-int.h | 39 ++++++++++++++++
> gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c | 65 ++++++++++++++++++++++++++
> 5 files changed, 129 insertions(+), 18 deletions(-)
>
> diff --git a/gcc/ira-build.c b/gcc/ira-build.c
> index 875b4d8ed7c..ab3e87164e1 100644
> --- a/gcc/ira-build.c
> +++ b/gcc/ira-build.c
> @@ -2000,6 +2000,8 @@ ira_propagate_hard_reg_costs (ira_allocno_t parent_a, ira_allocno_t a,
> int spill_cost)
> {
> HARD_REG_SET conflicts = ira_total_conflict_hard_regs (a);
> + if (ira_caller_save_loop_spill_p (parent_a, a, spill_cost))
> + conflicts |= ira_need_caller_save_regs (a);
> conflicts &= ~ira_total_conflict_hard_regs (parent_a);
>
> auto costs = ALLOCNO_HARD_REG_COSTS (a);
> @@ -2069,15 +2071,18 @@ propagate_allocno_info (void)
> if (!ira_subloop_allocnos_can_differ_p (parent_a))
> merge_hard_reg_conflicts (a, parent_a, true);
>
> - ALLOCNO_CALL_FREQ (parent_a) += ALLOCNO_CALL_FREQ (a);
> - ALLOCNO_CALLS_CROSSED_NUM (parent_a)
> - += ALLOCNO_CALLS_CROSSED_NUM (a);
> - ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a)
> - += ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a);
> - ALLOCNO_CROSSED_CALLS_ABIS (parent_a)
> - |= ALLOCNO_CROSSED_CALLS_ABIS (a);
> - ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a)
> - |= ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a);
> + if (!ira_caller_save_loop_spill_p (parent_a, a, spill_cost))
> + {
> + ALLOCNO_CALL_FREQ (parent_a) += ALLOCNO_CALL_FREQ (a);
> + ALLOCNO_CALLS_CROSSED_NUM (parent_a)
> + += ALLOCNO_CALLS_CROSSED_NUM (a);
> + ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a)
> + += ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a);
> + ALLOCNO_CROSSED_CALLS_ABIS (parent_a)
> + |= ALLOCNO_CROSSED_CALLS_ABIS (a);
> + ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a)
> + |= ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a);
> + }
> ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (parent_a)
> += ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (a);
> aclass = ALLOCNO_CLASS (a);
> diff --git a/gcc/ira-color.c b/gcc/ira-color.c
> index 4344ee6689e..1487afc5ef1 100644
> --- a/gcc/ira-color.c
> +++ b/gcc/ira-color.c
> @@ -3597,11 +3597,16 @@ move_spill_restore (void)
> propagate_allocno_info will have propagated
> the cost of spilling HARD_REGNO in SUBLOOP_NODE.
> (ira_subloop_allocnos_can_differ_p must be true
> - in that case.) Otherwise, SPILL_COST acted as
> - a cap on the propagated register cost, in cases
> - where the allocations can differ. */
> + in that case.) If HARD_REGNO is a caller-saved
> + register, we might have modelled it in the same way.
> +
> + Otherwise, SPILL_COST acted as a cap on the propagated
> + register cost, in cases where the allocations can differ. */
> auto conflicts = ira_total_conflict_hard_regs (subloop_allocno);
> - if (TEST_HARD_REG_BIT (conflicts, hard_regno))
> + if (TEST_HARD_REG_BIT (conflicts, hard_regno)
> + || (ira_need_caller_save_p (subloop_allocno, hard_regno)
> + && ira_caller_save_loop_spill_p (a, subloop_allocno,
> + spill_cost)))
> reg_cost = spill_cost;
> else if (ira_subloop_allocnos_can_differ_p (a))
> reg_cost = MIN (reg_cost, spill_cost);
> diff --git a/gcc/ira-costs.c b/gcc/ira-costs.c
> index 280befc58da..cbb58d32be8 100644
> --- a/gcc/ira-costs.c
> +++ b/gcc/ira-costs.c
> @@ -2308,7 +2308,7 @@ ira_tune_allocno_costs (void)
> {
> int j, n, regno;
> int cost, min_cost, *reg_costs;
> - enum reg_class aclass, rclass;
> + enum reg_class aclass;
> machine_mode mode;
> ira_allocno_t a;
> ira_allocno_iterator ai;
> @@ -2347,12 +2347,9 @@ ira_tune_allocno_costs (void)
> }
> if (skip_p)
> continue;
> - rclass = REGNO_REG_CLASS (regno);
> cost = 0;
> if (ira_need_caller_save_p (a, regno))
> - cost += (ALLOCNO_CALL_FREQ (a)
> - * (ira_memory_move_cost[mode][rclass][0]
> - + ira_memory_move_cost[mode][rclass][1]));
> + cost += ira_caller_save_cost (a);
> #ifdef IRA_HARD_REGNO_ADD_COST_MULTIPLIER
> cost += ((ira_memory_move_cost[mode][rclass][0]
> + ira_memory_move_cost[mode][rclass][1])
> diff --git a/gcc/ira-int.h b/gcc/ira-int.h
> index 8b87498f77f..a78811eb416 100644
> --- a/gcc/ira-int.h
> +++ b/gcc/ira-int.h
> @@ -1660,4 +1660,43 @@ ira_total_conflict_hard_regs (ira_allocno_t a)
> return conflicts;
> }
>
> +/* Return the cost of saving a caller-saved register before each call
> + in A's live range and restoring the same register after each call. */
> +inline int
> +ira_caller_save_cost (ira_allocno_t a)
> +{
> + auto mode = ALLOCNO_MODE (a);
> + auto rclass = ALLOCNO_CLASS (a);
> + return (ALLOCNO_CALL_FREQ (a)
> + * (ira_memory_move_cost[mode][rclass][0]
> + + ira_memory_move_cost[mode][rclass][1]));
> +}
> +
> +/* A and SUBLOOP_A are allocnos for the same pseudo register, with A's
> + loop immediately enclosing SUBLOOP_A's loop. If we allocate to A a
> + hard register R that is clobbered by a call in SUBLOOP_A, decide
> + which of the following approaches should be used for handling the
> + conflict:
> +
> + (1) Spill R on entry to SUBLOOP_A's loop, assign memory to SUBLOOP_A,
> + and restore R on exit from SUBLOOP_A's loop.
> +
> + (2) Spill R before each necessary call in SUBLOOP_A's live range and
> + restore R after each such call.
> +
> + Return true if (1) is better than (2). SPILL_COST is the cost of
> + doing (1). */
> +inline bool
> +ira_caller_save_loop_spill_p (ira_allocno_t a, ira_allocno_t subloop_a,
> + int spill_cost)
> +{
> + if (!ira_subloop_allocnos_can_differ_p (a))
> + return false;
> +
> + /* Calculate the cost of saving a call-clobbered register
> + before each call and restoring it afterwards. */
> + int call_cost = ira_caller_save_cost (subloop_a);
> + return call_cost && call_cost >= spill_cost;
> +}
> +
> #endif /* GCC_IRA_INT_H */
> diff --git a/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c
> new file mode 100644
> index 00000000000..7acdc432b0c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c
> @@ -0,0 +1,65 @@
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
> +
> +#define PROB 0.1
> +
> +struct L
> +{
> + int data;
> + volatile struct L *next;
> + volatile struct L *inner;
> +};
> +
> +void ext();
> +
> +/* The thing we're testing here is that the !head->inner path of the outer loop
> + body has no stack accesses. It's possible that we'll need to update this
> + pattern for unrelated code changes. but the test should be XFAILed rather
> + than changed if any new stack accesses creep into the !head->inner path. */
> +/*
> +** foo:
> +** ...
> +** ldr (w[0-9]+), \[(x[0-9]+)\]
> +** add (w[0-9]+), (?:\3, \1|\1, \3)
> +** ldr (x[0-9]+), \[\2, #?16\]
> +** str \3, \[\2\]
> +** ldr \2, \[\2, #?8\]
> +** cbn?z \4, .*
> +** ...
> +** ret
> +*/
> +void
> +foo (volatile struct L *head, int inc, double *ptr)
> +{
> + double d = *ptr;
> + while (head)
> + {
> + /* Clobber all call-preserved GPRs, so that the loop has to use
> + call-clobbered GPRs if it is to avoid spilling. */
> + asm volatile ("" :::
> + "x19", "x20", "x21", "x22", "x23",
> + "x24", "x25", "x26", "x27", "x28");
> + inc = head->data + inc;
> + volatile struct L *inner = head->inner;
> + head->data = inc;
> + head = head->next;
> + if (__builtin_expect_with_probability (inner != 0, 0, PROB))
> + for (int i = 0; i < 1000; ++i)
> + {
> + ext ();
> + /* Hack to create high register pressure, so that IRA doesn't
> + collapse this loop into the parent loop. */
> + d += 1;
> + asm volatile ("// foo" :::
> + "d0", "d1", "d2", "d3",
> + "d4", "d5", "d6", "d7",
> + "d8", "d9", "d10", "d11",
> + "d12", "d13", "d14", "d15",
> + "d16", "d17", "d18", "d19",
> + "d20", "d21", "d22", "d23",
> + "d24", "d25", "d26", "d27",
> + "d28", "d29", "d30", "d31");
> + }
> + }
> + *ptr = d;
> +}
> </cut>
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2022-02-10 16:16 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1851691846.10927.1641946260901@jenkins.jenkins>
2022-02-10 16:16 ` [TCWG CI] 453.povray slowed down by 3% after gcc: ira: Consider modelling caller-save allocations as loop spills Maxim Kuvyrkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).