From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com [IPv6:2a00:1450:4864:20::22d]) by sourceware.org (Postfix) with ESMTPS id 497573858C20 for ; Thu, 10 Feb 2022 16:16:19 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 497573858C20 Received: by mail-lj1-x22d.google.com with SMTP id j14so8753300lja.3 for ; Thu, 10 Feb 2022 08:16:19 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=ljLPN/4MgJ1aMv/aD8x62tbByQUSRmR60gQE21HYpkY=; b=J4k8hW5TXJ6eajm5zvYM9iTBn36SIYuli6DABgIbbNI6JIad6gH+g3JuUhEPUt6NQh G3C0le0FK2ScX/8sqzfJoih2S3MrrBOzb+o4I4nTkYE/9jXfUaO9M8U629GOUj1HI0U9 wnKBgINo/9W+zZBgDS0hbAIckPDvmKgjIieMImoq0HK3dN4mm6ZGagsgMEqLSSuuHVKq p3PtFNY68GFhsGthVDbXlLvlwXH+TXfhNKjwdl1ybTDm7SvyoIPrBxWDiWWp7ZvQmERk G67kT+jkPqTz9LPp4hPAmwnPoplEBD754n1wuGYDQKUygDeUgmYB7kFMzOag573uuSvF V9AA== X-Gm-Message-State: AOAM532QwKvi0QPVFSbZmynEGT9RBQ+kaohubnI7cP51Y4tsX2+N2J3X +dzkekjfK6wMMkEpqtOSU0Ft X-Google-Smtp-Source: ABdhPJxcNIoV/aqvn2h9ej82cq+u0/hMCSw4Cs1/UbUGRlAX1XqEAu7fXSFqvDGhMA7U6kWapfqM8w== X-Received: by 2002:a2e:91d9:: with SMTP id u25mr5241616ljg.41.1644509777147; Thu, 10 Feb 2022 08:16:17 -0800 (PST) Received: from smtpclient.apple ([185.30.228.158]) by smtp.gmail.com with ESMTPSA id o10sm2837210lfl.116.2022.02.10.08.16.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Feb 2022 08:16:16 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Subject: Re: [TCWG CI] 453.povray slowed down by 3% after gcc: ira: Consider modelling caller-save allocations as loop spills From: Maxim Kuvyrkov In-Reply-To: <1851691846.10927.1641946260901@jenkins.jenkins> Date: Thu, 10 Feb 2022 19:16:15 +0300 Cc: gcc-regression@gcc.gnu.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <1851691846.10927.1641946260901@jenkins.jenkins> To: Richard Sandiford X-Mailer: Apple Mail (2.3654.120.0.1.13) X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-regression@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-regression mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Feb 2022 16:16:22 -0000 Hi Richard, It seems your patch slows down 453.povray by 3%. It=E2=80=99s not a = huge slow down, but take a look if you have a spare minute. FWIW, it seems that some other patch has slowed down 453.povray by = another 3%, but, unfortunately, our benchmarking CI was down for a week = and we didn=E2=80=99t catch the culprit. Thanks, -- Maxim Kuvyrkov https://www.linaro.org > On 12 Jan 2022, at 03:11, ci_notify@linaro.org wrote: >=20 > After gcc commit 01f3e6a40e7202310abbeb41c345d325bd69554f > Author: Richard Sandiford >=20 > ira: Consider modelling caller-save allocations as loop spills >=20 > the following benchmarks slowed down by more than 2%: > - 453.povray slowed down by 3% from 4710 to 4840 perf samples >=20 > Below reproducer instructions can be used to re-build both "first_bad" = and "last_good" cross-toolchains used in this bisection. Naturally, the = scripts will fail when triggerring benchmarking jobs if you don't have = access to Linaro TCWG CI. >=20 > For your convenience, we have uploaded tarballs with pre-processed = source and assembly files at: > - First_bad save-temps: = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/build-01f3e6a40e7202310abbeb41c345= d325bd69554f/save-temps/ > - Last_good save-temps: = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/build-8e7a23728f66d2da88b47e342244= 10457fdefbf5/save-temps/ > - Baseline save-temps: = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/build-baseline/save-temps/ >=20 > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: GCC + Glibc + GNU Linker > - Version: all components were built from their tip of trunk > - Target: arm-linux-gnueabihf > - Compiler flags: -O3 -flto -marm > - Hardware: NVidia TK1 4x Cortex-A15 >=20 > This benchmarking CI is work-in-progress, and we welcome feedback and = suggestions at linaro-toolchain@lists.linaro.org . In our improvement = plans is to add support for SPEC CPU2017 benchmarks and provide "perf = report/annotate" data behind these reports. >=20 > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, = REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. >=20 > This commit has regressed these CI configurations: > - tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3_LTO >=20 > First_bad build: = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/build-01f3e6a40e7202310abbeb41c345= d325bd69554f/ > Last_good build: = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/build-8e7a23728f66d2da88b47e342244= 10457fdefbf5/ > Baseline build: = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/build-baseline/ > Even more details: = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/ >=20 > Reproduce builds: > > mkdir investigate-gcc-01f3e6a40e7202310abbeb41c345d325bd69554f > cd investigate-gcc-01f3e6a40e7202310abbeb41c345d325bd69554f >=20 > # Fetch scripts > git clone https://git.linaro.org/toolchain/jenkins-scripts >=20 > # Fetch manifests and test.sh script > mkdir -p artifacts/manifests > curl -o artifacts/manifests/build-baseline.sh = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/manifests/build-baseline.sh = --fail > curl -o artifacts/manifests/build-parameters.sh = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/manifests/build-parameters.sh = --fail > curl -o artifacts/test.sh = https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a= rm-spec2k6-O3_LTO/45/artifact/artifacts/test.sh --fail > chmod +x artifacts/test.sh >=20 > # Reproduce the baseline build (build all pre-requisites) > ./jenkins-scripts/tcwg_bmk-build.sh @@ = artifacts/manifests/build-baseline.sh >=20 > # Save baseline build state (which is then restored in = artifacts/test.sh) > mkdir -p ./bisect > rsync -a --del --delete-excluded --exclude /bisect/ --exclude = /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ >=20 > cd gcc >=20 > # Reproduce first_bad build > git checkout --detach 01f3e6a40e7202310abbeb41c345d325bd69554f > ../artifacts/test.sh >=20 > # Reproduce last_good build > git checkout --detach 8e7a23728f66d2da88b47e34224410457fdefbf5 > ../artifacts/test.sh >=20 > cd .. > >=20 > Full commit (up to 1000 lines): > > commit 01f3e6a40e7202310abbeb41c345d325bd69554f > Author: Richard Sandiford > Date: Mon Jan 10 14:47:08 2022 +0000 >=20 > ira: Consider modelling caller-save allocations as loop spills >=20 > If an allocno A in an inner loop L spans a call, a parent allocno = AP > can choose to handle a call-clobbered/caller-saved hard register R > in one of two ways: >=20 > (1) save R before each call in L and restore R after each call > (2) spill R to memory throughout L >=20 > (2) can be cheaper than (1) in some cases, particularly if L does > not reference A. >=20 > Before the patch we always did (1). The patch adds support for > picking (2) instead, when it seems cheaper. It builds on the > earlier support for not propagating conflicts to parent allocnos. >=20 > gcc/ > PR rtl-optimization/98782 > * ira-int.h (ira_caller_save_cost): New function. > (ira_caller_save_loop_spill_p): Likewise. > * ira-build.c (ira_propagate_hard_reg_costs): Test whether = it is > cheaper to spill a call-clobbered register throughout a = loop rather > than spill it around each individual call. If so, treat = all > call-clobbered registers as conflicts and... > (propagate_allocno_info): ...do not propagate call = information > from the child to the parent. > * ira-color.c (move_spill_restore): Update accordingly. > * ira-costs.c (ira_tune_allocno_costs): Use = ira_caller_save_cost. >=20 > gcc/testsuite/ > * gcc.target/aarch64/reg-alloc-3.c: New test. > --- > gcc/ira-build.c | 23 +++++---- > gcc/ira-color.c | 13 ++++-- > gcc/ira-costs.c | 7 +-- > gcc/ira-int.h | 39 ++++++++++++++++ > gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c | 65 = ++++++++++++++++++++++++++ > 5 files changed, 129 insertions(+), 18 deletions(-) >=20 > diff --git a/gcc/ira-build.c b/gcc/ira-build.c > index 875b4d8ed7c..ab3e87164e1 100644 > --- a/gcc/ira-build.c > +++ b/gcc/ira-build.c > @@ -2000,6 +2000,8 @@ ira_propagate_hard_reg_costs (ira_allocno_t = parent_a, ira_allocno_t a, > int spill_cost) > { > HARD_REG_SET conflicts =3D ira_total_conflict_hard_regs (a); > + if (ira_caller_save_loop_spill_p (parent_a, a, spill_cost)) > + conflicts |=3D ira_need_caller_save_regs (a); > conflicts &=3D ~ira_total_conflict_hard_regs (parent_a); >=20 > auto costs =3D ALLOCNO_HARD_REG_COSTS (a); > @@ -2069,15 +2071,18 @@ propagate_allocno_info (void) > if (!ira_subloop_allocnos_can_differ_p (parent_a)) > merge_hard_reg_conflicts (a, parent_a, true); >=20 > - ALLOCNO_CALL_FREQ (parent_a) +=3D ALLOCNO_CALL_FREQ (a); > - ALLOCNO_CALLS_CROSSED_NUM (parent_a) > - +=3D ALLOCNO_CALLS_CROSSED_NUM (a); > - ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a) > - +=3D ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a); > - ALLOCNO_CROSSED_CALLS_ABIS (parent_a) > - |=3D ALLOCNO_CROSSED_CALLS_ABIS (a); > - ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a) > - |=3D ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a); > + if (!ira_caller_save_loop_spill_p (parent_a, a, spill_cost)) > + { > + ALLOCNO_CALL_FREQ (parent_a) +=3D ALLOCNO_CALL_FREQ (a); > + ALLOCNO_CALLS_CROSSED_NUM (parent_a) > + +=3D ALLOCNO_CALLS_CROSSED_NUM (a); > + ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a) > + +=3D ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a); > + ALLOCNO_CROSSED_CALLS_ABIS (parent_a) > + |=3D ALLOCNO_CROSSED_CALLS_ABIS (a); > + ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a) > + |=3D ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a); > + } > ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (parent_a) > +=3D ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (a); > aclass =3D ALLOCNO_CLASS (a); > diff --git a/gcc/ira-color.c b/gcc/ira-color.c > index 4344ee6689e..1487afc5ef1 100644 > --- a/gcc/ira-color.c > +++ b/gcc/ira-color.c > @@ -3597,11 +3597,16 @@ move_spill_restore (void) > propagate_allocno_info will have propagated > the cost of spilling HARD_REGNO in SUBLOOP_NODE. > (ira_subloop_allocnos_can_differ_p must be true > - in that case.) Otherwise, SPILL_COST acted as > - a cap on the propagated register cost, in cases > - where the allocations can differ. */ > + in that case.) If HARD_REGNO is a caller-saved > + register, we might have modelled it in the same way. > + > + Otherwise, SPILL_COST acted as a cap on the propagated > + register cost, in cases where the allocations can = differ. */ > auto conflicts =3D ira_total_conflict_hard_regs = (subloop_allocno); > - if (TEST_HARD_REG_BIT (conflicts, hard_regno)) > + if (TEST_HARD_REG_BIT (conflicts, hard_regno) > + || (ira_need_caller_save_p (subloop_allocno, = hard_regno) > + && ira_caller_save_loop_spill_p (a, = subloop_allocno, > + spill_cost))) > reg_cost =3D spill_cost; > else if (ira_subloop_allocnos_can_differ_p (a)) > reg_cost =3D MIN (reg_cost, spill_cost); > diff --git a/gcc/ira-costs.c b/gcc/ira-costs.c > index 280befc58da..cbb58d32be8 100644 > --- a/gcc/ira-costs.c > +++ b/gcc/ira-costs.c > @@ -2308,7 +2308,7 @@ ira_tune_allocno_costs (void) > { > int j, n, regno; > int cost, min_cost, *reg_costs; > - enum reg_class aclass, rclass; > + enum reg_class aclass; > machine_mode mode; > ira_allocno_t a; > ira_allocno_iterator ai; > @@ -2347,12 +2347,9 @@ ira_tune_allocno_costs (void) > } > if (skip_p) > continue; > - rclass =3D REGNO_REG_CLASS (regno); > cost =3D 0; > if (ira_need_caller_save_p (a, regno)) > - cost +=3D (ALLOCNO_CALL_FREQ (a) > - * (ira_memory_move_cost[mode][rclass][0] > - + ira_memory_move_cost[mode][rclass][1])); > + cost +=3D ira_caller_save_cost (a); > #ifdef IRA_HARD_REGNO_ADD_COST_MULTIPLIER > cost +=3D ((ira_memory_move_cost[mode][rclass][0] > + ira_memory_move_cost[mode][rclass][1]) > diff --git a/gcc/ira-int.h b/gcc/ira-int.h > index 8b87498f77f..a78811eb416 100644 > --- a/gcc/ira-int.h > +++ b/gcc/ira-int.h > @@ -1660,4 +1660,43 @@ ira_total_conflict_hard_regs (ira_allocno_t a) > return conflicts; > } >=20 > +/* Return the cost of saving a caller-saved register before each call > + in A's live range and restoring the same register after each call. = */ > +inline int > +ira_caller_save_cost (ira_allocno_t a) > +{ > + auto mode =3D ALLOCNO_MODE (a); > + auto rclass =3D ALLOCNO_CLASS (a); > + return (ALLOCNO_CALL_FREQ (a) > + * (ira_memory_move_cost[mode][rclass][0] > + + ira_memory_move_cost[mode][rclass][1])); > +} > + > +/* A and SUBLOOP_A are allocnos for the same pseudo register, with = A's > + loop immediately enclosing SUBLOOP_A's loop. If we allocate to A = a > + hard register R that is clobbered by a call in SUBLOOP_A, decide > + which of the following approaches should be used for handling the > + conflict: > + > + (1) Spill R on entry to SUBLOOP_A's loop, assign memory to = SUBLOOP_A, > + and restore R on exit from SUBLOOP_A's loop. > + > + (2) Spill R before each necessary call in SUBLOOP_A's live range = and > + restore R after each such call. > + > + Return true if (1) is better than (2). SPILL_COST is the cost of > + doing (1). */ > +inline bool > +ira_caller_save_loop_spill_p (ira_allocno_t a, ira_allocno_t = subloop_a, > + int spill_cost) > +{ > + if (!ira_subloop_allocnos_can_differ_p (a)) > + return false; > + > + /* Calculate the cost of saving a call-clobbered register > + before each call and restoring it afterwards. */ > + int call_cost =3D ira_caller_save_cost (subloop_a); > + return call_cost && call_cost >=3D spill_cost; > +} > + > #endif /* GCC_IRA_INT_H */ > diff --git a/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c = b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c > new file mode 100644 > index 00000000000..7acdc432b0c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c > @@ -0,0 +1,65 @@ > +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */ > +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } = */ > + > +#define PROB 0.1 > + > +struct L > +{ > + int data; > + volatile struct L *next; > + volatile struct L *inner; > +}; > + > +void ext(); > + > +/* The thing we're testing here is that the !head->inner path of the = outer loop > + body has no stack accesses. It's possible that we'll need to = update this > + pattern for unrelated code changes. but the test should be XFAILed = rather > + than changed if any new stack accesses creep into the !head->inner = path. */ > +/* > +** foo: > +** ... > +** ldr (w[0-9]+), \[(x[0-9]+)\] > +** add (w[0-9]+), (?:\3, \1|\1, \3) > +** ldr (x[0-9]+), \[\2, #?16\] > +** str \3, \[\2\] > +** ldr \2, \[\2, #?8\] > +** cbn?z \4, .* > +** ... > +** ret > +*/ > +void > +foo (volatile struct L *head, int inc, double *ptr) > +{ > + double d =3D *ptr; > + while (head) > + { > + /* Clobber all call-preserved GPRs, so that the loop has to use > + call-clobbered GPRs if it is to avoid spilling. */ > + asm volatile ("" ::: > + "x19", "x20", "x21", "x22", "x23", > + "x24", "x25", "x26", "x27", "x28"); > + inc =3D head->data + inc; > + volatile struct L *inner =3D head->inner; > + head->data =3D inc; > + head =3D head->next; > + if (__builtin_expect_with_probability (inner !=3D 0, 0, PROB)) > + for (int i =3D 0; i < 1000; ++i) > + { > + ext (); > + /* Hack to create high register pressure, so that IRA = doesn't > + collapse this loop into the parent loop. */ > + d +=3D 1; > + asm volatile ("// foo" ::: > + "d0", "d1", "d2", "d3", > + "d4", "d5", "d6", "d7", > + "d8", "d9", "d10", "d11", > + "d12", "d13", "d14", "d15", > + "d16", "d17", "d18", "d19", > + "d20", "d21", "d22", "d23", > + "d24", "d25", "d26", "d27", > + "d28", "d29", "d30", "d31"); > + } > + } > + *ptr =3D d; > +} >