From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maxim.kuvyrkov@linaro.org>
Received: from mail-lj1-x22d.google.com (mail-lj1-x22d.google.com
 [IPv6:2a00:1450:4864:20::22d])
 by sourceware.org (Postfix) with ESMTPS id 497573858C20
 for <gcc-regression@gcc.gnu.org>; Thu, 10 Feb 2022 16:16:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 497573858C20
Received: by mail-lj1-x22d.google.com with SMTP id j14so8753300lja.3
 for <gcc-regression@gcc.gnu.org>; Thu, 10 Feb 2022 08:16:19 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=ljLPN/4MgJ1aMv/aD8x62tbByQUSRmR60gQE21HYpkY=;
 b=J4k8hW5TXJ6eajm5zvYM9iTBn36SIYuli6DABgIbbNI6JIad6gH+g3JuUhEPUt6NQh
 G3C0le0FK2ScX/8sqzfJoih2S3MrrBOzb+o4I4nTkYE/9jXfUaO9M8U629GOUj1HI0U9
 wnKBgINo/9W+zZBgDS0hbAIckPDvmKgjIieMImoq0HK3dN4mm6ZGagsgMEqLSSuuHVKq
 p3PtFNY68GFhsGthVDbXlLvlwXH+TXfhNKjwdl1ybTDm7SvyoIPrBxWDiWWp7ZvQmERk
 G67kT+jkPqTz9LPp4hPAmwnPoplEBD754n1wuGYDQKUygDeUgmYB7kFMzOag573uuSvF
 V9AA==
X-Gm-Message-State: AOAM532QwKvi0QPVFSbZmynEGT9RBQ+kaohubnI7cP51Y4tsX2+N2J3X
 +dzkekjfK6wMMkEpqtOSU0Ft
X-Google-Smtp-Source: ABdhPJxcNIoV/aqvn2h9ej82cq+u0/hMCSw4Cs1/UbUGRlAX1XqEAu7fXSFqvDGhMA7U6kWapfqM8w==
X-Received: by 2002:a2e:91d9:: with SMTP id u25mr5241616ljg.41.1644509777147; 
 Thu, 10 Feb 2022 08:16:17 -0800 (PST)
Received: from smtpclient.apple ([185.30.228.158])
 by smtp.gmail.com with ESMTPSA id o10sm2837210lfl.116.2022.02.10.08.16.16
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 10 Feb 2022 08:16:16 -0800 (PST)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\))
Subject: Re: [TCWG CI] 453.povray slowed down by 3% after gcc: ira: Consider
 modelling caller-save allocations as loop spills
From: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
In-Reply-To: <1851691846.10927.1641946260901@jenkins.jenkins>
Date: Thu, 10 Feb 2022 19:16:15 +0300
Cc: gcc-regression@gcc.gnu.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <AA197DB4-834A-4BE2-AD72-2A9552777BB8@linaro.org>
References: <1851691846.10927.1641946260901@jenkins.jenkins>
To: Richard Sandiford <richard.sandiford@arm.com>
X-Mailer: Apple Mail (2.3654.120.0.1.13)
X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_LOTSOFHASH,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-regression@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-regression mailing list <gcc-regression.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-regression>,
 <mailto:gcc-regression-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-regression/>
List-Post: <mailto:gcc-regression@gcc.gnu.org>
List-Help: <mailto:gcc-regression-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-regression>,
 <mailto:gcc-regression-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Thu, 10 Feb 2022 16:16:22 -0000

Hi Richard,

It seems your patch slows down 453.povray by 3%.  It=E2=80=99s not a =
huge slow down, but take a look if you have a spare minute.

FWIW, it seems that some other patch has slowed down 453.povray by =
another 3%, but, unfortunately, our benchmarking CI was down for a week =
and we didn=E2=80=99t catch the culprit.

Thanks,

--
Maxim Kuvyrkov
https://www.linaro.org

> On 12 Jan 2022, at 03:11, ci_notify@linaro.org wrote:
>=20
> After gcc commit 01f3e6a40e7202310abbeb41c345d325bd69554f
> Author: Richard Sandiford <richard.sandiford@arm.com>
>=20
>    ira: Consider modelling caller-save allocations as loop spills
>=20
> the following benchmarks slowed down by more than 2%:
> - 453.povray slowed down by 3% from 4710 to 4840 perf samples
>=20
> Below reproducer instructions can be used to re-build both "first_bad" =
and "last_good" cross-toolchains used in this bisection.  Naturally, the =
scripts will fail when triggerring benchmarking jobs if you don't have =
access to Linaro TCWG CI.
>=20
> For your convenience, we have uploaded tarballs with pre-processed =
source and assembly files at:
> - First_bad save-temps: =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/build-01f3e6a40e7202310abbeb41c345=
d325bd69554f/save-temps/
> - Last_good save-temps: =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/build-8e7a23728f66d2da88b47e342244=
10457fdefbf5/save-temps/
> - Baseline save-temps: =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/build-baseline/save-temps/
>=20
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: GCC + Glibc + GNU Linker
> - Version: all components were built from their tip of trunk
> - Target: arm-linux-gnueabihf
> - Compiler flags: -O3 -flto -marm
> - Hardware: NVidia TK1 4x Cortex-A15
>=20
> This benchmarking CI is work-in-progress, and we welcome feedback and =
suggestions at linaro-toolchain@lists.linaro.org .  In our improvement =
plans is to add support for SPEC CPU2017 benchmarks and provide "perf =
report/annotate" data behind these reports.
>=20
> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, =
REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
>=20
> This commit has regressed these CI configurations:
> - tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3_LTO
>=20
> First_bad build: =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/build-01f3e6a40e7202310abbeb41c345=
d325bd69554f/
> Last_good build: =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/build-8e7a23728f66d2da88b47e342244=
10457fdefbf5/
> Baseline build: =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/build-baseline/
> Even more details: =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/
>=20
> Reproduce builds:
> <cut>
> mkdir investigate-gcc-01f3e6a40e7202310abbeb41c345d325bd69554f
> cd investigate-gcc-01f3e6a40e7202310abbeb41c345d325bd69554f
>=20
> # Fetch scripts
> git clone https://git.linaro.org/toolchain/jenkins-scripts
>=20
> # Fetch manifests and test.sh script
> mkdir -p artifacts/manifests
> curl -o artifacts/manifests/build-baseline.sh =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/manifests/build-baseline.sh =
--fail
> curl -o artifacts/manifests/build-parameters.sh =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/manifests/build-parameters.sh =
--fail
> curl -o artifacts/test.sh =
https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-a=
rm-spec2k6-O3_LTO/45/artifact/artifacts/test.sh --fail
> chmod +x artifacts/test.sh
>=20
> # Reproduce the baseline build (build all pre-requisites)
> ./jenkins-scripts/tcwg_bmk-build.sh @@ =
artifacts/manifests/build-baseline.sh
>=20
> # Save baseline build state (which is then restored in =
artifacts/test.sh)
> mkdir -p ./bisect
> rsync -a --del --delete-excluded --exclude /bisect/ --exclude =
/artifacts/ --exclude /gcc/ ./ ./bisect/baseline/
>=20
> cd gcc
>=20
> # Reproduce first_bad build
> git checkout --detach 01f3e6a40e7202310abbeb41c345d325bd69554f
> ../artifacts/test.sh
>=20
> # Reproduce last_good build
> git checkout --detach 8e7a23728f66d2da88b47e34224410457fdefbf5
> ../artifacts/test.sh
>=20
> cd ..
> </cut>
>=20
> Full commit (up to 1000 lines):
> <cut>
> commit 01f3e6a40e7202310abbeb41c345d325bd69554f
> Author: Richard Sandiford <richard.sandiford@arm.com>
> Date:   Mon Jan 10 14:47:08 2022 +0000
>=20
>    ira: Consider modelling caller-save allocations as loop spills
>=20
>    If an allocno A in an inner loop L spans a call, a parent allocno =
AP
>    can choose to handle a call-clobbered/caller-saved hard register R
>    in one of two ways:
>=20
>    (1) save R before each call in L and restore R after each call
>    (2) spill R to memory throughout L
>=20
>    (2) can be cheaper than (1) in some cases, particularly if L does
>    not reference A.
>=20
>    Before the patch we always did (1).  The patch adds support for
>    picking (2) instead, when it seems cheaper.  It builds on the
>    earlier support for not propagating conflicts to parent allocnos.
>=20
>    gcc/
>            PR rtl-optimization/98782
>            * ira-int.h (ira_caller_save_cost): New function.
>            (ira_caller_save_loop_spill_p): Likewise.
>            * ira-build.c (ira_propagate_hard_reg_costs): Test whether =
it is
>            cheaper to spill a call-clobbered register throughout a =
loop rather
>            than spill it around each individual call.  If so, treat =
all
>            call-clobbered registers as conflicts and...
>            (propagate_allocno_info): ...do not propagate call =
information
>            from the child to the parent.
>            * ira-color.c (move_spill_restore): Update accordingly.
>            * ira-costs.c (ira_tune_allocno_costs): Use =
ira_caller_save_cost.
>=20
>    gcc/testsuite/
>            * gcc.target/aarch64/reg-alloc-3.c: New test.
> ---
> gcc/ira-build.c                                | 23 +++++----
> gcc/ira-color.c                                | 13 ++++--
> gcc/ira-costs.c                                |  7 +--
> gcc/ira-int.h                                  | 39 ++++++++++++++++
> gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c | 65 =
++++++++++++++++++++++++++
> 5 files changed, 129 insertions(+), 18 deletions(-)
>=20
> diff --git a/gcc/ira-build.c b/gcc/ira-build.c
> index 875b4d8ed7c..ab3e87164e1 100644
> --- a/gcc/ira-build.c
> +++ b/gcc/ira-build.c
> @@ -2000,6 +2000,8 @@ ira_propagate_hard_reg_costs (ira_allocno_t =
parent_a, ira_allocno_t a,
> 			      int spill_cost)
> {
>   HARD_REG_SET conflicts =3D ira_total_conflict_hard_regs (a);
> +  if (ira_caller_save_loop_spill_p (parent_a, a, spill_cost))
> +    conflicts |=3D ira_need_caller_save_regs (a);
>   conflicts &=3D ~ira_total_conflict_hard_regs (parent_a);
>=20
>   auto costs =3D ALLOCNO_HARD_REG_COSTS (a);
> @@ -2069,15 +2071,18 @@ propagate_allocno_info (void)
> 	  if (!ira_subloop_allocnos_can_differ_p (parent_a))
> 	    merge_hard_reg_conflicts (a, parent_a, true);
>=20
> -	  ALLOCNO_CALL_FREQ (parent_a) +=3D ALLOCNO_CALL_FREQ (a);
> -	  ALLOCNO_CALLS_CROSSED_NUM (parent_a)
> -	    +=3D ALLOCNO_CALLS_CROSSED_NUM (a);
> -	  ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a)
> -	    +=3D ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a);
> -	  ALLOCNO_CROSSED_CALLS_ABIS (parent_a)
> -	    |=3D ALLOCNO_CROSSED_CALLS_ABIS (a);
> -	  ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a)
> -	    |=3D ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a);
> +	  if (!ira_caller_save_loop_spill_p (parent_a, a, spill_cost))
> +	    {
> +	      ALLOCNO_CALL_FREQ (parent_a) +=3D ALLOCNO_CALL_FREQ (a);
> +	      ALLOCNO_CALLS_CROSSED_NUM (parent_a)
> +		+=3D ALLOCNO_CALLS_CROSSED_NUM (a);
> +	      ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a)
> +		+=3D ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a);
> +	      ALLOCNO_CROSSED_CALLS_ABIS (parent_a)
> +		|=3D ALLOCNO_CROSSED_CALLS_ABIS (a);
> +	      ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a)
> +		|=3D ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a);
> +	    }
> 	  ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (parent_a)
> 	    +=3D ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (a);
> 	  aclass =3D ALLOCNO_CLASS (a);
> diff --git a/gcc/ira-color.c b/gcc/ira-color.c
> index 4344ee6689e..1487afc5ef1 100644
> --- a/gcc/ira-color.c
> +++ b/gcc/ira-color.c
> @@ -3597,11 +3597,16 @@ move_spill_restore (void)
> 		 propagate_allocno_info will have propagated
> 		 the cost of spilling HARD_REGNO in SUBLOOP_NODE.
> 		 (ira_subloop_allocnos_can_differ_p must be true
> -		 in that case.)  Otherwise, SPILL_COST acted as
> -		 a cap on the propagated register cost, in cases
> -		 where the allocations can differ.  */
> +		 in that case.)  If HARD_REGNO is a caller-saved
> +		 register, we might have modelled it in the same way.
> +
> +		 Otherwise, SPILL_COST acted as a cap on the propagated
> +		 register cost, in cases where the allocations can =
differ.  */
> 	      auto conflicts =3D ira_total_conflict_hard_regs =
(subloop_allocno);
> -	      if (TEST_HARD_REG_BIT (conflicts, hard_regno))
> +	      if (TEST_HARD_REG_BIT (conflicts, hard_regno)
> +		  || (ira_need_caller_save_p (subloop_allocno, =
hard_regno)
> +		      && ira_caller_save_loop_spill_p (a, =
subloop_allocno,
> +						       spill_cost)))
> 		reg_cost =3D spill_cost;
> 	      else if (ira_subloop_allocnos_can_differ_p (a))
> 		reg_cost =3D MIN (reg_cost, spill_cost);
> diff --git a/gcc/ira-costs.c b/gcc/ira-costs.c
> index 280befc58da..cbb58d32be8 100644
> --- a/gcc/ira-costs.c
> +++ b/gcc/ira-costs.c
> @@ -2308,7 +2308,7 @@ ira_tune_allocno_costs (void)
> {
>   int j, n, regno;
>   int cost, min_cost, *reg_costs;
> -  enum reg_class aclass, rclass;
> +  enum reg_class aclass;
>   machine_mode mode;
>   ira_allocno_t a;
>   ira_allocno_iterator ai;
> @@ -2347,12 +2347,9 @@ ira_tune_allocno_costs (void)
> 		}
> 	      if (skip_p)
> 		continue;
> -	      rclass =3D REGNO_REG_CLASS (regno);
> 	      cost =3D 0;
> 	      if (ira_need_caller_save_p (a, regno))
> -		cost +=3D (ALLOCNO_CALL_FREQ (a)
> -			 * (ira_memory_move_cost[mode][rclass][0]
> -			    + ira_memory_move_cost[mode][rclass][1]));
> +		cost +=3D ira_caller_save_cost (a);
> #ifdef IRA_HARD_REGNO_ADD_COST_MULTIPLIER
> 	      cost +=3D ((ira_memory_move_cost[mode][rclass][0]
> 			+ ira_memory_move_cost[mode][rclass][1])
> diff --git a/gcc/ira-int.h b/gcc/ira-int.h
> index 8b87498f77f..a78811eb416 100644
> --- a/gcc/ira-int.h
> +++ b/gcc/ira-int.h
> @@ -1660,4 +1660,43 @@ ira_total_conflict_hard_regs (ira_allocno_t a)
>   return conflicts;
> }
>=20
> +/* Return the cost of saving a caller-saved register before each call
> +   in A's live range and restoring the same register after each call. =
 */
> +inline int
> +ira_caller_save_cost (ira_allocno_t a)
> +{
> +  auto mode =3D ALLOCNO_MODE (a);
> +  auto rclass =3D ALLOCNO_CLASS (a);
> +  return (ALLOCNO_CALL_FREQ (a)
> +	  * (ira_memory_move_cost[mode][rclass][0]
> +	     + ira_memory_move_cost[mode][rclass][1]));
> +}
> +
> +/* A and SUBLOOP_A are allocnos for the same pseudo register, with =
A's
> +   loop immediately enclosing SUBLOOP_A's loop.  If we allocate to A =
a
> +   hard register R that is clobbered by a call in SUBLOOP_A, decide
> +   which of the following approaches should be used for handling the
> +   conflict:
> +
> +   (1) Spill R on entry to SUBLOOP_A's loop, assign memory to =
SUBLOOP_A,
> +       and restore R on exit from SUBLOOP_A's loop.
> +
> +   (2) Spill R before each necessary call in SUBLOOP_A's live range =
and
> +       restore R after each such call.
> +
> +   Return true if (1) is better than (2).  SPILL_COST is the cost of
> +   doing (1).  */
> +inline bool
> +ira_caller_save_loop_spill_p (ira_allocno_t a, ira_allocno_t =
subloop_a,
> +			      int spill_cost)
> +{
> +  if (!ira_subloop_allocnos_can_differ_p (a))
> +    return false;
> +
> +  /* Calculate the cost of saving a call-clobbered register
> +     before each call and restoring it afterwards.  */
> +  int call_cost =3D ira_caller_save_cost (subloop_a);
> +  return call_cost && call_cost >=3D spill_cost;
> +}
> +
> #endif /* GCC_IRA_INT_H */
> diff --git a/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c =
b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c
> new file mode 100644
> index 00000000000..7acdc432b0c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c
> @@ -0,0 +1,65 @@
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } =
*/
> +
> +#define PROB 0.1
> +
> +struct L
> +{
> +  int data;
> +  volatile struct L *next;
> +  volatile struct L *inner;
> +};
> +
> +void ext();
> +
> +/* The thing we're testing here is that the !head->inner path of the =
outer loop
> +   body has no stack accesses.  It's possible that we'll need to =
update this
> +   pattern for unrelated code changes. but the test should be XFAILed =
rather
> +   than changed if any new stack accesses creep into the !head->inner =
path.  */
> +/*
> +** foo:
> +**	...
> +**	ldr	(w[0-9]+), \[(x[0-9]+)\]
> +**	add	(w[0-9]+), (?:\3, \1|\1, \3)
> +**	ldr	(x[0-9]+), \[\2, #?16\]
> +**	str	\3, \[\2\]
> +**	ldr	\2, \[\2, #?8\]
> +**	cbn?z	\4, .*
> +**	...
> +**	ret
> +*/
> +void
> +foo (volatile struct L *head, int inc, double *ptr)
> +{
> +  double d =3D *ptr;
> +  while (head)
> +    {
> +      /* Clobber all call-preserved GPRs, so that the loop has to use
> +	 call-clobbered GPRs if it is to avoid spilling.  */
> +      asm volatile ("" :::
> +		    "x19", "x20", "x21", "x22", "x23",
> +		    "x24", "x25", "x26", "x27", "x28");
> +      inc =3D head->data + inc;
> +      volatile struct L *inner =3D head->inner;
> +      head->data =3D inc;
> +      head =3D head->next;
> +      if (__builtin_expect_with_probability (inner !=3D 0, 0, PROB))
> +	for (int i =3D 0; i < 1000; ++i)
> +	  {
> +	    ext ();
> +	    /* Hack to create high register pressure, so that IRA =
doesn't
> +	       collapse this loop into the parent loop.  */
> +	    d +=3D 1;
> +	    asm volatile ("// foo" :::
> +			  "d0", "d1", "d2", "d3",
> +			  "d4", "d5", "d6", "d7",
> +			  "d8", "d9", "d10", "d11",
> +			  "d12", "d13", "d14", "d15",
> +			  "d16", "d17", "d18", "d19",
> +			  "d20", "d21", "d22", "d23",
> +			  "d24", "d25", "d26", "d27",
> +			  "d28", "d29", "d30", "d31");
> +	  }
> +    }
> +  *ptr =3D d;
> +}
> </cut>