public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782]
@ 2022-01-06 14:45 Richard Sandiford
  2022-01-06 14:46 ` [PATCH 1/6] ira: Add a ira_loop_border_costs class Richard Sandiford
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Richard Sandiford @ 2022-01-06 14:45 UTC (permalink / raw)
  To: gcc-patches

This series of patches recovers the exchange2 performance lost in the
GCC 11 timeframe (at least on aarch64 and Power9 -- thanks Pat for
testing the latter).

There are 6 patches, split into two groups of 3.  The first 3 are just
preparatory patches, although patch 2 does contain a minor bug fix.
The other 3 patches are the ones that together fix the regression.

I realise this is a bit invasive for stage 3.  However, the series is
fixing a large regression in an important benchmark and AFAIK there are
no known acceptable mitigations that we could apply instead.  I think
the series is also working with concepts that already exist in IRA:
it's really about tweaking the cost model used to control them.

We also still have at least 3 months (more realistically 4 months) of
testing before GCC 12 is released.  So perhaps one option would be to
apply any approved version of the series now, but with the understanding
that if there's significant fallout (more than a handful of small tweaks
or fixes), we would simply revert the patches rather than trying to
rework them in-situ.  The series is confined to IRA so reverting it
should be simple.  Would that be OK?

Each patch bootstrapped & regression-tested individually on
aarch64-linux-gnu.  Also tested as a series on aarch64_be-elf,
arm-linux-gnueabihf, powerpc64le-linux-gnu, and x86_64-linux-gnu.

Richard

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] ira: Add a ira_loop_border_costs class
  2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
@ 2022-01-06 14:46 ` Richard Sandiford
  2022-01-06 15:08   ` Jan Hubicka
  2022-01-07 14:38   ` Vladimir Makarov
  2022-01-06 14:46 ` [PATCH 2/6] ira: Add comments and fix move_spill_restore calculation Richard Sandiford
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Richard Sandiford @ 2022-01-06 14:46 UTC (permalink / raw)
  To: gcc-patches

The final index into (ira_)memory_move_cost is 1 for loads and
0 for stores.  Thus the combination:

  entry_freq * memory_cost[1] + exit_freq * memory_cost[0]

is the cost of loading a register on entry to a loop and
storing it back on exit from the loop.  This is the cost to
use if the register is successfully allocated within the
loop but is spilled in the parent loop.  Similarly:

  entry_freq * memory_cost[0] + exit_freq * memory_cost[1]

is the cost of storing a register on entry to the loop and
restoring it on exit from the loop.  This is the cost to
use if the register is spilled within the loop but is
successfully allocated in the parent loop.

The patch adds a helper class for calculating these values and
mechanically replaces the existing instances.  There is no attempt to
editorialise the choice between using “spill inside” and “spill outside”
costs.  (I think one of them is the wrong way round, but a later patch
deals with that.)

No functional change intended.

gcc/
	PR rtl-optimization/98782
	* ira-int.h (ira_loop_border_costs): New class.
	* ira-color.c (ira_loop_border_costs::ira_loop_border_costs):
	New constructor.
	(calculate_allocno_spill_cost): Use ira_loop_border_costs.
	(color_pass): Likewise.
	(move_spill_restore): Likewise.
---
 gcc/ira-color.c | 76 +++++++++++++++++++------------------------------
 gcc/ira-int.h   | 56 ++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+), 46 deletions(-)

diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index e0b4e490043..66c11710b97 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -2567,13 +2567,23 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int regno, bool exit_p)
   return REG_FREQ_FROM_EDGE_FREQ (freq);
 }
 
+/* Construct an object that describes the boundary between A and its
+   parent allocno.  */
+ira_loop_border_costs::ira_loop_border_costs (ira_allocno_t a)
+  : m_mode (ALLOCNO_MODE (a)),
+    m_class (ALLOCNO_CLASS (a)),
+    m_entry_freq (ira_loop_edge_freq (ALLOCNO_LOOP_TREE_NODE (a),
+				      ALLOCNO_REGNO (a), false)),
+    m_exit_freq (ira_loop_edge_freq (ALLOCNO_LOOP_TREE_NODE (a),
+				     ALLOCNO_REGNO (a), true))
+{
+}
+
 /* Calculate and return the cost of putting allocno A into memory.  */
 static int
 calculate_allocno_spill_cost (ira_allocno_t a)
 {
   int regno, cost;
-  machine_mode mode;
-  enum reg_class rclass;
   ira_allocno_t parent_allocno;
   ira_loop_tree_node_t parent_node, loop_node;
 
@@ -2586,24 +2596,12 @@ calculate_allocno_spill_cost (ira_allocno_t a)
     return cost;
   if ((parent_allocno = parent_node->regno_allocno_map[regno]) == NULL)
     return cost;
-  mode = ALLOCNO_MODE (a);
-  rclass = ALLOCNO_CLASS (a);
+  ira_loop_border_costs border_costs (a);
   if (ALLOCNO_HARD_REGNO (parent_allocno) < 0)
-    cost -= (ira_memory_move_cost[mode][rclass][0]
-	     * ira_loop_edge_freq (loop_node, regno, true)
-	     + ira_memory_move_cost[mode][rclass][1]
-	     * ira_loop_edge_freq (loop_node, regno, false));
+    cost -= border_costs.spill_outside_loop_cost ();
   else
-    {
-      ira_init_register_move_cost_if_necessary (mode);
-      cost += ((ira_memory_move_cost[mode][rclass][1]
-		* ira_loop_edge_freq (loop_node, regno, true)
-		+ ira_memory_move_cost[mode][rclass][0]
-		* ira_loop_edge_freq (loop_node, regno, false))
-	       - (ira_register_move_cost[mode][rclass][rclass]
-		  * (ira_loop_edge_freq (loop_node, regno, false)
-		     + ira_loop_edge_freq (loop_node, regno, true))));
-    }
+    cost += (border_costs.spill_inside_loop_cost ()
+	     - border_costs.move_between_loops_cost ());
   return cost;
 }
 
@@ -3342,7 +3340,7 @@ static void
 color_pass (ira_loop_tree_node_t loop_tree_node)
 {
   int regno, hard_regno, index = -1, n;
-  int cost, exit_freq, enter_freq;
+  int cost;
   unsigned int j;
   bitmap_iterator bi;
   machine_mode mode;
@@ -3466,8 +3464,6 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 		}
 	      continue;
 	    }
-	  exit_freq = ira_loop_edge_freq (subloop_node, regno, true);
-	  enter_freq = ira_loop_edge_freq (subloop_node, regno, false);
 	  ira_assert (regno < ira_reg_equiv_len);
 	  if (ira_equiv_no_lvalue_p (regno))
 	    {
@@ -3483,16 +3479,16 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 	    }
 	  else if (hard_regno < 0)
 	    {
+	      ira_loop_border_costs border_costs (subloop_allocno);
 	      ALLOCNO_UPDATED_MEMORY_COST (subloop_allocno)
-		-= ((ira_memory_move_cost[mode][rclass][1] * enter_freq)
-		    + (ira_memory_move_cost[mode][rclass][0] * exit_freq));
+		-= border_costs.spill_outside_loop_cost ();
 	    }
 	  else
 	    {
+	      ira_loop_border_costs border_costs (subloop_allocno);
 	      aclass = ALLOCNO_CLASS (subloop_allocno);
 	      ira_init_register_move_cost_if_necessary (mode);
-	      cost = (ira_register_move_cost[mode][rclass][rclass]
-		      * (exit_freq + enter_freq));
+	      cost = border_costs.move_between_loops_cost ();
 	      ira_allocate_and_set_or_copy_costs
 		(&ALLOCNO_UPDATED_HARD_REG_COSTS (subloop_allocno), aclass,
 		 ALLOCNO_UPDATED_CLASS_COST (subloop_allocno),
@@ -3508,8 +3504,7 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 		ALLOCNO_UPDATED_CLASS_COST (subloop_allocno)
 		  = ALLOCNO_UPDATED_HARD_REG_COSTS (subloop_allocno)[index];
 	      ALLOCNO_UPDATED_MEMORY_COST (subloop_allocno)
-		+= (ira_memory_move_cost[mode][rclass][0] * enter_freq
-		    + ira_memory_move_cost[mode][rclass][1] * exit_freq);
+		+= border_costs.spill_inside_loop_cost ();
 	    }
 	}
     }
@@ -3550,7 +3545,6 @@ move_spill_restore (void)
 {
   int cost, regno, hard_regno, hard_regno2, index;
   bool changed_p;
-  int enter_freq, exit_freq;
   machine_mode mode;
   enum reg_class rclass;
   ira_allocno_t a, parent_allocno, subloop_allocno;
@@ -3605,38 +3599,28 @@ move_spill_restore (void)
 		       - (ALLOCNO_HARD_REG_COSTS (subloop_allocno) == NULL
 			  ? ALLOCNO_CLASS_COST (subloop_allocno)
 			  : ALLOCNO_HARD_REG_COSTS (subloop_allocno)[index]));
-	      exit_freq = ira_loop_edge_freq (subloop_node, regno, true);
-	      enter_freq = ira_loop_edge_freq (subloop_node, regno, false);
+	      ira_loop_border_costs border_costs (subloop_allocno);
 	      if ((hard_regno2 = ALLOCNO_HARD_REGNO (subloop_allocno)) < 0)
-		cost -= (ira_memory_move_cost[mode][rclass][0] * exit_freq
-			 + ira_memory_move_cost[mode][rclass][1] * enter_freq);
+		cost -= border_costs.spill_outside_loop_cost ();
 	      else
 		{
-		  cost
-		    += (ira_memory_move_cost[mode][rclass][0] * exit_freq
-			+ ira_memory_move_cost[mode][rclass][1] * enter_freq);
+		  cost += border_costs.spill_outside_loop_cost ();
 		  if (hard_regno2 != hard_regno)
-		    cost -= (ira_register_move_cost[mode][rclass][rclass]
-			     * (exit_freq + enter_freq));
+		    cost -= border_costs.move_between_loops_cost ();
 		}
 	    }
 	  if ((parent = loop_node->parent) != NULL
 	      && (parent_allocno = parent->regno_allocno_map[regno]) != NULL)
 	    {
 	      ira_assert (rclass == ALLOCNO_CLASS (parent_allocno));
-	      exit_freq	= ira_loop_edge_freq (loop_node, regno, true);
-	      enter_freq = ira_loop_edge_freq (loop_node, regno, false);
+	      ira_loop_border_costs border_costs (a);
 	      if ((hard_regno2 = ALLOCNO_HARD_REGNO (parent_allocno)) < 0)
-		cost -= (ira_memory_move_cost[mode][rclass][0] * exit_freq
-			 + ira_memory_move_cost[mode][rclass][1] * enter_freq);
+		cost -= border_costs.spill_outside_loop_cost ();
 	      else
 		{
-		  cost
-		    += (ira_memory_move_cost[mode][rclass][1] * exit_freq
-			+ ira_memory_move_cost[mode][rclass][0] * enter_freq);
+		  cost += border_costs.spill_inside_loop_cost ();
 		  if (hard_regno2 != hard_regno)
-		    cost -= (ira_register_move_cost[mode][rclass][rclass]
-			     * (exit_freq + enter_freq));
+		    cost -= border_costs.move_between_loops_cost ();
 		}
 	    }
 	  if (cost < 0)
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index 59d2872cac3..b32c80d4c9e 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -1539,4 +1539,60 @@ ira_need_caller_save_p (ira_allocno_t a, unsigned int regno)
 				     ALLOCNO_MODE (a), regno);
 }
 
+/* Represents the boundary between an allocno in one loop and its parent
+   allocno in the enclosing loop.  It is usually possible to change a
+   register's allocation on this boundary; the class provides routines
+   for calculating the cost of such changes.  */
+class ira_loop_border_costs
+{
+public:
+  ira_loop_border_costs (ira_allocno_t);
+
+  int move_between_loops_cost () const;
+  int spill_outside_loop_cost () const;
+  int spill_inside_loop_cost () const;
+
+private:
+  /* The mode and class of the child allocno.  */
+  machine_mode m_mode;
+  reg_class m_class;
+
+  /* Sums the frequencies of the entry edges and the exit edges.  */
+  int m_entry_freq, m_exit_freq;
+};
+
+/* Return the cost of storing the register on entry to the loop and
+   loading it back on exit from the loop.  This is the cost to use if
+   the register is spilled within the loop but is successfully allocated
+   in the parent loop.  */
+inline int
+ira_loop_border_costs::spill_inside_loop_cost () const
+{
+  return (m_entry_freq * ira_memory_move_cost[m_mode][m_class][0]
+	  + m_exit_freq * ira_memory_move_cost[m_mode][m_class][1]);
+}
+
+/* Return the cost of loading the register on entry to the loop and
+   storing it back on exit from the loop.  This is the cost to use if
+   the register is successfully allocated within the loop but is spilled
+   in the parent loop.  */
+inline int
+ira_loop_border_costs::spill_outside_loop_cost () const
+{
+  return (m_entry_freq * ira_memory_move_cost[m_mode][m_class][1]
+	  + m_exit_freq * ira_memory_move_cost[m_mode][m_class][0]);
+}
+
+/* Return the cost of moving the pseudo register between different hard
+   registers on entry and exit from the loop.  This is the cost to use
+   if the register is successfully allocated within both this loop and
+   the parent loop, but the allocations for the loops differ.  */
+inline int
+ira_loop_border_costs::move_between_loops_cost () const
+{
+  ira_init_register_move_cost_if_necessary (m_mode);
+  auto move_cost = ira_register_move_cost[m_mode][m_class][m_class];
+  return move_cost * (m_entry_freq + m_exit_freq);
+}
+
 #endif /* GCC_IRA_INT_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/6] ira: Add comments and fix move_spill_restore calculation
  2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
  2022-01-06 14:46 ` [PATCH 1/6] ira: Add a ira_loop_border_costs class Richard Sandiford
@ 2022-01-06 14:46 ` Richard Sandiford
  2022-01-07 14:39   ` Vladimir Makarov
  2022-01-06 14:47 ` [PATCH 3/6] ira: Add ira_subloop_allocnos_can_differ_p Richard Sandiford
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Richard Sandiford @ 2022-01-06 14:46 UTC (permalink / raw)
  To: gcc-patches

This patch adds comments to describe each use of ira_loop_border_costs.
I think this highlights that move_spill_restore was using the wrong cost
in one case, which came from tranposing [0] and [1] in the original
(pre-ira_loop_border_costs) ira_memory_move_cost expressions.  The
difference would only be noticeable on targets that distinguish between
load and store costs.

gcc/
	PR rtl-optimization/98782
	* ira-color.c (color_pass): Add comments to describe the spill costs.
	(move_spill_restore): Likewise.  Fix reversed calculation.
---
 gcc/ira-color.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index 66c11710b97..e7433312675 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -3479,6 +3479,13 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 	    }
 	  else if (hard_regno < 0)
 	    {
+	      /* If we allocate a register to SUBLOOP_ALLOCNO, we'll need
+		 to load the register on entry to the subloop and store
+		 the register back on exit from the subloop.  This incurs
+		 a fixed cost for all registers.  Since UPDATED_MEMORY_COST
+		 is (and should only be) used relative to the register costs
+		 for the same allocno, we can subtract this shared register
+		 cost from the memory cost.  */
 	      ira_loop_border_costs border_costs (subloop_allocno);
 	      ALLOCNO_UPDATED_MEMORY_COST (subloop_allocno)
 		-= border_costs.spill_outside_loop_cost ();
@@ -3503,6 +3510,9 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 		  > ALLOCNO_UPDATED_HARD_REG_COSTS (subloop_allocno)[index])
 		ALLOCNO_UPDATED_CLASS_COST (subloop_allocno)
 		  = ALLOCNO_UPDATED_HARD_REG_COSTS (subloop_allocno)[index];
+	      /* If we spill SUBLOOP_ALLOCNO, we'll need to store HARD_REGNO
+		 on entry to the subloop and restore HARD_REGNO on exit from
+		 the subloop.  */
 	      ALLOCNO_UPDATED_MEMORY_COST (subloop_allocno)
 		+= border_costs.spill_inside_loop_cost ();
 	    }
@@ -3601,9 +3611,17 @@ move_spill_restore (void)
 			  : ALLOCNO_HARD_REG_COSTS (subloop_allocno)[index]));
 	      ira_loop_border_costs border_costs (subloop_allocno);
 	      if ((hard_regno2 = ALLOCNO_HARD_REGNO (subloop_allocno)) < 0)
-		cost -= border_costs.spill_outside_loop_cost ();
+		/* The register was spilled in the subloop.  If we spill
+		   it in the outer loop too then we'll no longer need to
+		   save the register on entry to the subloop and restore
+		   the register on exit from the subloop.  */
+		cost -= border_costs.spill_inside_loop_cost ();
 	      else
 		{
+		  /* The register was also allocated in the subloop.  If we
+		     spill it in the outer loop then we'll need to load the
+		     register on entry to the subloop and store the register
+		     back on exit from the subloop.  */
 		  cost += border_costs.spill_outside_loop_cost ();
 		  if (hard_regno2 != hard_regno)
 		    cost -= border_costs.move_between_loops_cost ();
@@ -3615,9 +3633,17 @@ move_spill_restore (void)
 	      ira_assert (rclass == ALLOCNO_CLASS (parent_allocno));
 	      ira_loop_border_costs border_costs (a);
 	      if ((hard_regno2 = ALLOCNO_HARD_REGNO (parent_allocno)) < 0)
+		/* The register was spilled in the parent loop.  If we spill
+		   it in this loop too then we'll no longer need to load the
+		   register on entry to this loop and save the register back
+		   on exit from this loop.  */
 		cost -= border_costs.spill_outside_loop_cost ();
 	      else
 		{
+		  /* The register was also allocated in the parent loop.
+		     If we spill it in this loop then we'll need to save
+		     the register on entry to this loop and restore the
+		     register on exit from this loop.  */
 		  cost += border_costs.spill_inside_loop_cost ();
 		  if (hard_regno2 != hard_regno)
 		    cost -= border_costs.move_between_loops_cost ();
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/6] ira: Add ira_subloop_allocnos_can_differ_p
  2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
  2022-01-06 14:46 ` [PATCH 1/6] ira: Add a ira_loop_border_costs class Richard Sandiford
  2022-01-06 14:46 ` [PATCH 2/6] ira: Add comments and fix move_spill_restore calculation Richard Sandiford
@ 2022-01-06 14:47 ` Richard Sandiford
  2022-01-07 14:40   ` Vladimir Makarov
  2022-01-06 14:47 ` [PATCH 4/6] ira: Try to avoid propagating conflicts Richard Sandiford
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Richard Sandiford @ 2022-01-06 14:47 UTC (permalink / raw)
  To: gcc-patches

color_pass has two instances of the same code for propagating non-cap
assignments from parent loops to subloops.  This patch adds a helper
function for testing when such propagations are required for correctness
and uses it to remove the duplicated code.

A later patch will use this in ira-build.c too, which is why the
function is exported to ira-int.h.

No functional change intended.

gcc/
	PR rtl-optimization/98782
	* ira-int.h (ira_subloop_allocnos_can_differ_p): New function,
	extracted from...
	* ira-color.c (color_pass): ...here.
---
 gcc/ira-color.c | 21 +--------------------
 gcc/ira-int.h   | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index e7433312675..ae0f08af4b3 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -3446,26 +3446,7 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 	  if ((flag_ira_region == IRA_REGION_MIXED
 	       && (loop_tree_node->reg_pressure[pclass]
 		   <= ira_class_hard_regs_num[pclass]))
-	      || (pic_offset_table_rtx != NULL
-		  && regno == (int) REGNO (pic_offset_table_rtx))
-	      /* Avoid overlapped multi-registers. Moves between them
-		 might result in wrong code generation.  */
-	      || (hard_regno >= 0
-		  && ira_reg_class_max_nregs[pclass][mode] > 1))
-	    {
-	      if (! ALLOCNO_ASSIGNED_P (subloop_allocno))
-		{
-		  ALLOCNO_HARD_REGNO (subloop_allocno) = hard_regno;
-		  ALLOCNO_ASSIGNED_P (subloop_allocno) = true;
-		  if (hard_regno >= 0)
-		    update_costs_from_copies (subloop_allocno, true, true);
-		  /* We don't need updated costs anymore.  */
-		  ira_free_allocno_updated_costs (subloop_allocno);
-		}
-	      continue;
-	    }
-	  ira_assert (regno < ira_reg_equiv_len);
-	  if (ira_equiv_no_lvalue_p (regno))
+	      || !ira_subloop_allocnos_can_differ_p (a, hard_regno >= 0))
 	    {
 	      if (! ALLOCNO_ASSIGNED_P (subloop_allocno))
 		{
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index b32c80d4c9e..c5b1a131abd 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -1595,4 +1595,32 @@ ira_loop_border_costs::move_between_loops_cost () const
   return move_cost * (m_entry_freq + m_exit_freq);
 }
 
+/* Return true if subloops that contain allocnos for A's register can
+   use a different assignment from A.  ALLOCATED_P is true for the case
+   in which allocation succeeded for A.  */
+inline bool
+ira_subloop_allocnos_can_differ_p (ira_allocno_t a, bool allocated_p = true)
+{
+  auto regno = ALLOCNO_REGNO (a);
+
+  if (pic_offset_table_rtx != NULL
+      && regno == (int) REGNO (pic_offset_table_rtx))
+    return false;
+
+  ira_assert (regno < ira_reg_equiv_len);
+  if (ira_equiv_no_lvalue_p (regno))
+    return false;
+
+  /* Avoid overlapping multi-registers.  Moves between them might result
+     in wrong code generation.  */
+  if (allocated_p)
+    {
+      auto pclass = ira_pressure_class_translate[ALLOCNO_CLASS (a)];
+      if (ira_reg_class_max_nregs[pclass][ALLOCNO_MODE (a)] > 1)
+	return false;
+    }
+
+  return true;
+}
+
 #endif /* GCC_IRA_INT_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/6] ira: Try to avoid propagating conflicts
  2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
                   ` (2 preceding siblings ...)
  2022-01-06 14:47 ` [PATCH 3/6] ira: Add ira_subloop_allocnos_can_differ_p Richard Sandiford
@ 2022-01-06 14:47 ` Richard Sandiford
  2022-01-10 13:51   ` Vladimir Makarov
  2022-01-06 14:48 ` [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills Richard Sandiford
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Richard Sandiford @ 2022-01-06 14:47 UTC (permalink / raw)
  To: gcc-patches

Suppose that:

- an inner loop L contains an allocno A
- L clobbers hard register R while A is live
- A's parent allocno is AP

Previously, propagate_allocno_info would propagate conflict sets up the
loop tree, so that the conflict between A and R would become a conflict
between AP and R (and so on for ancestors of AP).

However, when IRA treats loops as separate allocation regions, it can
decide on a loop-by-loop basis whether to allocate a register or spill
to memory.  Conflicts in inner loops therefore don't need to become
hard conflicts in parent loops.  Instead we can record that using the
“conflicting” registers for the parent allocnos has a higher cost.
In the example above, this higher cost is the sum of:

- the cost of saving R on entry to L
- the cost of keeping the pseudo register in memory throughout L
- the cost of reloading R on exit from L

This value is also a cap on the hard register cost that A can contribute
to AP in general (not just for conflicts).  Whatever allocation we pick
for AP, there is always the option of spilling that register to memory
throughout L, so the cost to A of allocating a register to AP can't be
more than the cost of spilling A.

To take an extreme example: if allocating a register R2 to A is more
expensive than spilling A to memory, ALLOCNO_HARD_REG_COSTS (A)[R2]
could be (say) 2 times greater than ALLOCNO_MEMORY_COST (A) or 100
times greater than ALLOCNO_MEMORY_COST (A).  But this scale factor
doesn't matter to AP.  All that matters is that R2 is more expensive
than memory for A, so that allocating R2 to AP should be costed as
spilling A to memory (again assuming that A and AP are in different
allocation regions).  Propagating a factor of 100 would distort the
register costs for AP.

move_spill_restore tries to undo the propagation done by
propagate_allocno_info, so we need some extra processing there.

gcc/
	PR rtl-optimization/98782
	* ira-int.h (ira_allocno::might_conflict_with_parent_p): New field.
	(ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P): New macro.
	(ira_single_region_allocno_p): New function.
	(ira_total_conflict_hard_regs): Likewise.
	* ira-build.c (ira_create_allocno): Initialize
	ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P.
	(ira_propagate_hard_reg_costs): New function.
	(propagate_allocno_info): Use it.  Try to avoid propagating
	hard register conflicts to parent allocnos if we can handle
	the conflicts by spilling instead.  Limit the propagated
	register costs to the cost of spilling throughout the child loop.
	* ira-color.c (color_pass): Use ira_single_region_allocno_p to
	test whether a child and parent allocno can share the same
	register.
	(move_spill_restore): Adjust for the new behavior of
	propagate_allocno_info.

gcc/testsuite/
	* gcc.target/aarch64/reg-alloc-2.c: New test.
---
 gcc/ira-build.c                               | 55 +++++++++++++++++--
 gcc/ira-color.c                               | 53 ++++++++++++------
 gcc/ira-int.h                                 | 37 +++++++++++++
 .../gcc.target/aarch64/reg-alloc-2.c          | 47 ++++++++++++++++
 4 files changed, 169 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/reg-alloc-2.c

diff --git a/gcc/ira-build.c b/gcc/ira-build.c
index 0547edfde55..875b4d8ed7c 100644
--- a/gcc/ira-build.c
+++ b/gcc/ira-build.c
@@ -497,6 +497,7 @@ ira_create_allocno (int regno, bool cap_p,
   bitmap_set_bit (loop_tree_node->all_allocnos, ALLOCNO_NUM (a));
   ALLOCNO_NREFS (a) = 0;
   ALLOCNO_FREQ (a) = 0;
+  ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P (a) = false;
   ALLOCNO_HARD_REGNO (a) = -1;
   ALLOCNO_CALL_FREQ (a) = 0;
   ALLOCNO_CALLS_CROSSED_NUM (a) = 0;
@@ -1991,6 +1992,35 @@ propagate_modified_regnos (ira_loop_tree_node_t loop_tree_node)
 		   loop_tree_node->modified_regnos);
 }
 
+/* Propagate ALLOCNO_HARD_REG_COSTS from A to PARENT_A.  Use SPILL_COST
+   as the cost of spilling a register throughout A (which we have to do
+   for PARENT_A allocations that conflict with A).  */
+static void
+ira_propagate_hard_reg_costs (ira_allocno_t parent_a, ira_allocno_t a,
+			      int spill_cost)
+{
+  HARD_REG_SET conflicts = ira_total_conflict_hard_regs (a);
+  conflicts &= ~ira_total_conflict_hard_regs (parent_a);
+
+  auto costs = ALLOCNO_HARD_REG_COSTS (a);
+  if (!hard_reg_set_empty_p (conflicts))
+    ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P (a) = true;
+  else if (!costs)
+    return;
+
+  auto aclass = ALLOCNO_CLASS (a);
+  ira_allocate_and_set_costs (&ALLOCNO_HARD_REG_COSTS (parent_a),
+			      aclass, ALLOCNO_CLASS_COST (parent_a));
+  auto parent_costs = ALLOCNO_HARD_REG_COSTS (parent_a);
+  for (int i = 0; i < ira_class_hard_regs_num[aclass]; ++i)
+    if (TEST_HARD_REG_BIT (conflicts, ira_class_hard_regs[aclass][i]))
+      parent_costs[i] += spill_cost;
+    else if (costs)
+      /* The cost to A of allocating this register to PARENT_A can't
+	 be more than the cost of spilling the register throughout A.  */
+      parent_costs[i] += MIN (costs[i], spill_cost);
+}
+
 /* Propagate new info about allocno A (see comments about accumulated
    info in allocno definition) to the corresponding allocno on upper
    loop tree level.  So allocnos on upper levels accumulate
@@ -2019,12 +2049,27 @@ propagate_allocno_info (void)
 	  && bitmap_bit_p (ALLOCNO_LOOP_TREE_NODE (a)->border_allocnos,
 			   ALLOCNO_NUM (a)))
 	{
+	  /* Calculate the cost of storing to memory on entry to A's loop,
+	     referencing as memory within A's loop, and restoring from
+	     memory on exit from A's loop.  */
+	  ira_loop_border_costs border_costs (a);
+	  int spill_cost = INT_MAX;
+	  if (ira_subloop_allocnos_can_differ_p (parent_a))
+	    spill_cost = (border_costs.spill_inside_loop_cost ()
+			  + ALLOCNO_MEMORY_COST (a));
+
 	  if (! ALLOCNO_BAD_SPILL_P (a))
 	    ALLOCNO_BAD_SPILL_P (parent_a) = false;
 	  ALLOCNO_NREFS (parent_a) += ALLOCNO_NREFS (a);
 	  ALLOCNO_FREQ (parent_a) += ALLOCNO_FREQ (a);
+
+	  /* If A's allocation can differ from PARENT_A's, we can if necessary
+	     spill PARENT_A on entry to A's loop and restore it afterwards.
+	     Doing that has cost SPILL_COST.  */
+	  if (!ira_subloop_allocnos_can_differ_p (parent_a))
+	    merge_hard_reg_conflicts (a, parent_a, true);
+
 	  ALLOCNO_CALL_FREQ (parent_a) += ALLOCNO_CALL_FREQ (a);
-	  merge_hard_reg_conflicts (a, parent_a, true);
 	  ALLOCNO_CALLS_CROSSED_NUM (parent_a)
 	    += ALLOCNO_CALLS_CROSSED_NUM (a);
 	  ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a)
@@ -2037,15 +2082,15 @@ propagate_allocno_info (void)
 	    += ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (a);
 	  aclass = ALLOCNO_CLASS (a);
 	  ira_assert (aclass == ALLOCNO_CLASS (parent_a));
-	  ira_allocate_and_accumulate_costs
-	    (&ALLOCNO_HARD_REG_COSTS (parent_a), aclass,
-	     ALLOCNO_HARD_REG_COSTS (a));
+	  ira_propagate_hard_reg_costs (parent_a, a, spill_cost);
 	  ira_allocate_and_accumulate_costs
 	    (&ALLOCNO_CONFLICT_HARD_REG_COSTS (parent_a),
 	     aclass,
 	     ALLOCNO_CONFLICT_HARD_REG_COSTS (a));
+	  /* The cost to A of allocating a register to PARENT_A can't be
+	     more than the cost of spilling the register throughout A.  */
 	  ALLOCNO_CLASS_COST (parent_a)
-	    += ALLOCNO_CLASS_COST (a);
+	    += MIN (ALLOCNO_CLASS_COST (a), spill_cost);
 	  ALLOCNO_MEMORY_COST (parent_a) += ALLOCNO_MEMORY_COST (a);
 	}
 }
diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index ae0f08af4b3..4344ee6689e 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -3344,7 +3344,7 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
   unsigned int j;
   bitmap_iterator bi;
   machine_mode mode;
-  enum reg_class rclass, aclass, pclass;
+  enum reg_class rclass, aclass;
   ira_allocno_t a, subloop_allocno;
   ira_loop_tree_node_t subloop_node;
 
@@ -3389,10 +3389,9 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 	/* Remove from processing in the next loop.  */
 	bitmap_clear_bit (consideration_allocno_bitmap, j);
 	rclass = ALLOCNO_CLASS (a);
-	pclass = ira_pressure_class_translate[rclass];
-	if (flag_ira_region == IRA_REGION_MIXED
-	    && (loop_tree_node->reg_pressure[pclass]
-		<= ira_class_hard_regs_num[pclass]))
+	subloop_allocno = ALLOCNO_CAP_MEMBER (a);
+	subloop_node = ALLOCNO_LOOP_TREE_NODE (subloop_allocno);
+	if (ira_single_region_allocno_p (a, subloop_allocno))
 	  {
 	    mode = ALLOCNO_MODE (a);
 	    hard_regno = ALLOCNO_HARD_REGNO (a);
@@ -3402,8 +3401,6 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 		ira_assert (index >= 0);
 	      }
 	    regno = ALLOCNO_REGNO (a);
-	    subloop_allocno = ALLOCNO_CAP_MEMBER (a);
-	    subloop_node = ALLOCNO_LOOP_TREE_NODE (subloop_allocno);
 	    ira_assert (!ALLOCNO_ASSIGNED_P (subloop_allocno));
 	    ALLOCNO_HARD_REGNO (subloop_allocno) = hard_regno;
 	    ALLOCNO_ASSIGNED_P (subloop_allocno) = true;
@@ -3426,7 +3423,6 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 	  ira_assert (ALLOCNO_CAP_MEMBER (a) == NULL);
 	  mode = ALLOCNO_MODE (a);
 	  rclass = ALLOCNO_CLASS (a);
-	  pclass = ira_pressure_class_translate[rclass];
 	  hard_regno = ALLOCNO_HARD_REGNO (a);
 	  /* Use hard register class here.  ??? */
 	  if (hard_regno >= 0)
@@ -3443,11 +3439,11 @@ color_pass (ira_loop_tree_node_t loop_tree_node)
 	  ira_assert (ALLOCNO_CLASS (subloop_allocno) == rclass);
 	  ira_assert (bitmap_bit_p (subloop_node->all_allocnos,
 				    ALLOCNO_NUM (subloop_allocno)));
-	  if ((flag_ira_region == IRA_REGION_MIXED
-	       && (loop_tree_node->reg_pressure[pclass]
-		   <= ira_class_hard_regs_num[pclass]))
+	  if (ira_single_region_allocno_p (a, subloop_allocno)
 	      || !ira_subloop_allocnos_can_differ_p (a, hard_regno >= 0))
 	    {
+	      gcc_assert (!ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P
+			  (subloop_allocno));
 	      if (! ALLOCNO_ASSIGNED_P (subloop_allocno))
 		{
 		  ALLOCNO_HARD_REGNO (subloop_allocno) = hard_regno;
@@ -3583,14 +3579,35 @@ move_spill_restore (void)
 	      if (subloop_allocno == NULL)
 		continue;
 	      ira_assert (rclass == ALLOCNO_CLASS (subloop_allocno));
-	      /* We have accumulated cost.  To get the real cost of
-		 allocno usage in the loop we should subtract costs of
-		 the subloop allocnos.  */
-	      cost -= (ALLOCNO_MEMORY_COST (subloop_allocno)
-		       - (ALLOCNO_HARD_REG_COSTS (subloop_allocno) == NULL
-			  ? ALLOCNO_CLASS_COST (subloop_allocno)
-			  : ALLOCNO_HARD_REG_COSTS (subloop_allocno)[index]));
 	      ira_loop_border_costs border_costs (subloop_allocno);
+
+	      /* We have accumulated cost.  To get the real cost of
+		 allocno usage in the loop we should subtract the costs
+		 added by propagate_allocno_info for the subloop allocnos.  */
+	      int reg_cost
+		= (ALLOCNO_HARD_REG_COSTS (subloop_allocno) == NULL
+		   ? ALLOCNO_CLASS_COST (subloop_allocno)
+		   : ALLOCNO_HARD_REG_COSTS (subloop_allocno)[index]);
+
+	      int spill_cost
+		= (border_costs.spill_inside_loop_cost ()
+		   + ALLOCNO_MEMORY_COST (subloop_allocno));
+
+	      /* If HARD_REGNO conflicts with SUBLOOP_A then
+		 propagate_allocno_info will have propagated
+		 the cost of spilling HARD_REGNO in SUBLOOP_NODE.
+		 (ira_subloop_allocnos_can_differ_p must be true
+		 in that case.)  Otherwise, SPILL_COST acted as
+		 a cap on the propagated register cost, in cases
+		 where the allocations can differ.  */
+	      auto conflicts = ira_total_conflict_hard_regs (subloop_allocno);
+	      if (TEST_HARD_REG_BIT (conflicts, hard_regno))
+		reg_cost = spill_cost;
+	      else if (ira_subloop_allocnos_can_differ_p (a))
+		reg_cost = MIN (reg_cost, spill_cost);
+
+	      cost -= ALLOCNO_MEMORY_COST (subloop_allocno) - reg_cost;
+
 	      if ((hard_regno2 = ALLOCNO_HARD_REGNO (subloop_allocno)) < 0)
 		/* The register was spilled in the subloop.  If we spill
 		   it in the outer loop too then we'll no longer need to
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index c5b1a131abd..8b87498f77f 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -314,6 +314,13 @@ struct ira_allocno
      vector where a bit with given index represents allocno with the
      same number.  */
   unsigned int conflict_vec_p : 1;
+  /* True if the parent loop has an allocno for the same register and
+     if the parent allocno's assignment might not be valid in this loop.
+     This means that we cannot merge this allocno and the parent allocno
+     together.
+
+     This is only ever true for non-cap allocnos.  */
+  unsigned int might_conflict_with_parent_p : 1;
   /* Hard register assigned to given allocno.  Negative value means
      that memory was allocated to the allocno.  During the reload,
      spilled allocno has value equal to the corresponding stack slot
@@ -423,6 +430,8 @@ struct ira_allocno
 #define ALLOCNO_CAP_MEMBER(A) ((A)->cap_member)
 #define ALLOCNO_NREFS(A) ((A)->nrefs)
 #define ALLOCNO_FREQ(A) ((A)->freq)
+#define ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P(A) \
+  ((A)->might_conflict_with_parent_p)
 #define ALLOCNO_HARD_REGNO(A) ((A)->hard_regno)
 #define ALLOCNO_CALL_FREQ(A) ((A)->call_freq)
 #define ALLOCNO_CALLS_CROSSED_NUM(A) ((A)->calls_crossed_num)
@@ -1623,4 +1632,32 @@ ira_subloop_allocnos_can_differ_p (ira_allocno_t a, bool allocated_p = true)
   return true;
 }
 
+/* Return true if we should treat A and SUBLOOP_A as belonging to a
+   single region.  */
+inline bool
+ira_single_region_allocno_p (ira_allocno_t a, ira_allocno_t subloop_a)
+{
+  if (flag_ira_region != IRA_REGION_MIXED)
+    return false;
+
+  if (ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P (subloop_a))
+    return false;
+
+  auto rclass = ALLOCNO_CLASS (a);
+  auto pclass = ira_pressure_class_translate[rclass];
+  auto loop_used_regs = ALLOCNO_LOOP_TREE_NODE (a)->reg_pressure[pclass];
+  return loop_used_regs <= ira_class_hard_regs_num[pclass];
+}
+
+/* Return the set of all hard registers that conflict with A.  */
+inline HARD_REG_SET
+ira_total_conflict_hard_regs (ira_allocno_t a)
+{
+  auto obj_0 = ALLOCNO_OBJECT (a, 0);
+  HARD_REG_SET conflicts = OBJECT_TOTAL_CONFLICT_HARD_REGS (obj_0);
+  for (int i = 1; i < ALLOCNO_NUM_OBJECTS (a); i++)
+    conflicts |= OBJECT_TOTAL_CONFLICT_HARD_REGS (ALLOCNO_OBJECT (a, i));
+  return conflicts;
+}
+
 #endif /* GCC_IRA_INT_H */
diff --git a/gcc/testsuite/gcc.target/aarch64/reg-alloc-2.c b/gcc/testsuite/gcc.target/aarch64/reg-alloc-2.c
new file mode 100644
index 00000000000..d4d260c96b5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/reg-alloc-2.c
@@ -0,0 +1,47 @@
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#define PROB 0.1
+
+struct L
+{
+  int data;
+  volatile struct L *next;
+  volatile struct L *inner;
+};
+
+/* The thing we're testing here is that the !head->inner path of the outer loop
+   body has no stack accesses.  It's possible that we'll need to update this
+   pattern for unrelated code changes. but the test should be XFAILed rather
+   than changed if any new stack accesses occur on the !head->inner path.  */
+/*
+** foo:
+**	...
+**	ldr	(w[0-9]+), \[(x[0-9]+)\]
+**	add	(w[0-9]+), (?:\3, \1|\1, \3)
+**	ldr	(x[0-9]+), \[\2, #?16\]
+**	str	\3, \[\2\]
+**	ldr	\2, \[\2, #?8\]
+**	cbn?z	\4, .*
+**	...
+**	ret
+*/
+void
+foo (volatile struct L *head, int inc)
+{
+  while (head)
+    {
+      inc = head->data + inc;
+      volatile struct L *inner = head->inner;
+      head->data = inc;
+      head = head->next;
+      if (__builtin_expect_with_probability (inner != 0, 0, PROB))
+	for (int i = 0; i < 1000; ++i)
+	  /* Leave x30 for i.  */
+	  asm volatile ("// foo" :::
+			"x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
+			"x8", "x9", "x10", "x11", "x12", "x13", "x14", "x15",
+			"x16", "x17", "x18", "x19", "x20", "x21", "x22", "x23",
+			"x24", "x25", "x26", "x27", "x28");
+    }
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
                   ` (3 preceding siblings ...)
  2022-01-06 14:47 ` [PATCH 4/6] ira: Try to avoid propagating conflicts Richard Sandiford
@ 2022-01-06 14:48 ` Richard Sandiford
  2022-01-10 13:51   ` Vladimir Makarov
                     ` (2 more replies)
  2022-01-06 14:48 ` [PATCH 6/6] ira: Handle "soft" conflicts between cap and non-cap allocnos Richard Sandiford
  2022-01-07 14:38 ` [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Vladimir Makarov
  6 siblings, 3 replies; 24+ messages in thread
From: Richard Sandiford @ 2022-01-06 14:48 UTC (permalink / raw)
  To: gcc-patches

If an allocno A in an inner loop L spans a call, a parent allocno AP
can choose to handle a call-clobbered/caller-saved hard register R
in one of two ways:

(1) save R before each call in L and restore R after each call
(2) spill R to memory throughout L

(2) can be cheaper than (1) in some cases, particularly if L does
not reference A.

Before the patch we always did (1).  The patch adds support for
picking (2) instead, when it seems cheaper.  It builds on the
earlier support for not propagating conflicts to parent allocnos.

gcc/
	PR rtl-optimization/98782
	* ira-int.h (ira_caller_save_cost): New function.
	(ira_caller_save_loop_spill_p): Likewise.
	* ira-build.c (ira_propagate_hard_reg_costs): Test whether it is
	cheaper to spill a call-clobbered register throughout a loop rather
	than spill it around each individual call.  If so, treat all
	call-clobbered registers as conflicts and...
	(propagate_allocno_info): ...do not propagate call information
	from the child to the parent.
	* ira-color.c (move_spill_restore): Update accordingly.
	* ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost.

gcc/testsuite/
	* gcc.target/aarch64/reg-alloc-3.c: New test.
---
 gcc/ira-build.c                               | 23 ++++---
 gcc/ira-color.c                               | 13 ++--
 gcc/ira-costs.c                               |  7 +-
 gcc/ira-int.h                                 | 39 +++++++++++
 .../gcc.target/aarch64/reg-alloc-3.c          | 65 +++++++++++++++++++
 5 files changed, 129 insertions(+), 18 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c

diff --git a/gcc/ira-build.c b/gcc/ira-build.c
index 875b4d8ed7c..ab3e87164e1 100644
--- a/gcc/ira-build.c
+++ b/gcc/ira-build.c
@@ -2000,6 +2000,8 @@ ira_propagate_hard_reg_costs (ira_allocno_t parent_a, ira_allocno_t a,
 			      int spill_cost)
 {
   HARD_REG_SET conflicts = ira_total_conflict_hard_regs (a);
+  if (ira_caller_save_loop_spill_p (parent_a, a, spill_cost))
+    conflicts |= ira_need_caller_save_regs (a);
   conflicts &= ~ira_total_conflict_hard_regs (parent_a);
 
   auto costs = ALLOCNO_HARD_REG_COSTS (a);
@@ -2069,15 +2071,18 @@ propagate_allocno_info (void)
 	  if (!ira_subloop_allocnos_can_differ_p (parent_a))
 	    merge_hard_reg_conflicts (a, parent_a, true);
 
-	  ALLOCNO_CALL_FREQ (parent_a) += ALLOCNO_CALL_FREQ (a);
-	  ALLOCNO_CALLS_CROSSED_NUM (parent_a)
-	    += ALLOCNO_CALLS_CROSSED_NUM (a);
-	  ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a)
-	    += ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a);
-	  ALLOCNO_CROSSED_CALLS_ABIS (parent_a)
-	    |= ALLOCNO_CROSSED_CALLS_ABIS (a);
-	  ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a)
-	    |= ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a);
+	  if (!ira_caller_save_loop_spill_p (parent_a, a, spill_cost))
+	    {
+	      ALLOCNO_CALL_FREQ (parent_a) += ALLOCNO_CALL_FREQ (a);
+	      ALLOCNO_CALLS_CROSSED_NUM (parent_a)
+		+= ALLOCNO_CALLS_CROSSED_NUM (a);
+	      ALLOCNO_CHEAP_CALLS_CROSSED_NUM (parent_a)
+		+= ALLOCNO_CHEAP_CALLS_CROSSED_NUM (a);
+	      ALLOCNO_CROSSED_CALLS_ABIS (parent_a)
+		|= ALLOCNO_CROSSED_CALLS_ABIS (a);
+	      ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (parent_a)
+		|= ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a);
+	    }
 	  ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (parent_a)
 	    += ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (a);
 	  aclass = ALLOCNO_CLASS (a);
diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index 4344ee6689e..1487afc5ef1 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -3597,11 +3597,16 @@ move_spill_restore (void)
 		 propagate_allocno_info will have propagated
 		 the cost of spilling HARD_REGNO in SUBLOOP_NODE.
 		 (ira_subloop_allocnos_can_differ_p must be true
-		 in that case.)  Otherwise, SPILL_COST acted as
-		 a cap on the propagated register cost, in cases
-		 where the allocations can differ.  */
+		 in that case.)  If HARD_REGNO is a caller-saved
+		 register, we might have modelled it in the same way.
+
+		 Otherwise, SPILL_COST acted as a cap on the propagated
+		 register cost, in cases where the allocations can differ.  */
 	      auto conflicts = ira_total_conflict_hard_regs (subloop_allocno);
-	      if (TEST_HARD_REG_BIT (conflicts, hard_regno))
+	      if (TEST_HARD_REG_BIT (conflicts, hard_regno)
+		  || (ira_need_caller_save_p (subloop_allocno, hard_regno)
+		      && ira_caller_save_loop_spill_p (a, subloop_allocno,
+						       spill_cost)))
 		reg_cost = spill_cost;
 	      else if (ira_subloop_allocnos_can_differ_p (a))
 		reg_cost = MIN (reg_cost, spill_cost);
diff --git a/gcc/ira-costs.c b/gcc/ira-costs.c
index 280befc58da..cbb58d32be8 100644
--- a/gcc/ira-costs.c
+++ b/gcc/ira-costs.c
@@ -2308,7 +2308,7 @@ ira_tune_allocno_costs (void)
 {
   int j, n, regno;
   int cost, min_cost, *reg_costs;
-  enum reg_class aclass, rclass;
+  enum reg_class aclass;
   machine_mode mode;
   ira_allocno_t a;
   ira_allocno_iterator ai;
@@ -2347,12 +2347,9 @@ ira_tune_allocno_costs (void)
 		}
 	      if (skip_p)
 		continue;
-	      rclass = REGNO_REG_CLASS (regno);
 	      cost = 0;
 	      if (ira_need_caller_save_p (a, regno))
-		cost += (ALLOCNO_CALL_FREQ (a)
-			 * (ira_memory_move_cost[mode][rclass][0]
-			    + ira_memory_move_cost[mode][rclass][1]));
+		cost += ira_caller_save_cost (a);
 #ifdef IRA_HARD_REGNO_ADD_COST_MULTIPLIER
 	      cost += ((ira_memory_move_cost[mode][rclass][0]
 			+ ira_memory_move_cost[mode][rclass][1])
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index 8b87498f77f..a78811eb416 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -1660,4 +1660,43 @@ ira_total_conflict_hard_regs (ira_allocno_t a)
   return conflicts;
 }
 
+/* Return the cost of saving a caller-saved register before each call
+   in A's live range and restoring the same register after each call.  */
+inline int
+ira_caller_save_cost (ira_allocno_t a)
+{
+  auto mode = ALLOCNO_MODE (a);
+  auto rclass = ALLOCNO_CLASS (a);
+  return (ALLOCNO_CALL_FREQ (a)
+	  * (ira_memory_move_cost[mode][rclass][0]
+	     + ira_memory_move_cost[mode][rclass][1]));
+}
+
+/* A and SUBLOOP_A are allocnos for the same pseudo register, with A's
+   loop immediately enclosing SUBLOOP_A's loop.  If we allocate to A a
+   hard register R that is clobbered by a call in SUBLOOP_A, decide
+   which of the following approaches should be used for handling the
+   conflict:
+
+   (1) Spill R on entry to SUBLOOP_A's loop, assign memory to SUBLOOP_A,
+       and restore R on exit from SUBLOOP_A's loop.
+
+   (2) Spill R before each necessary call in SUBLOOP_A's live range and
+       restore R after each such call.
+
+   Return true if (1) is better than (2).  SPILL_COST is the cost of
+   doing (1).  */
+inline bool
+ira_caller_save_loop_spill_p (ira_allocno_t a, ira_allocno_t subloop_a,
+			      int spill_cost)
+{
+  if (!ira_subloop_allocnos_can_differ_p (a))
+    return false;
+
+  /* Calculate the cost of saving a call-clobbered register
+     before each call and restoring it afterwards.  */
+  int call_cost = ira_caller_save_cost (subloop_a);
+  return call_cost && call_cost >= spill_cost;
+}
+
 #endif /* GCC_IRA_INT_H */
diff --git a/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c
new file mode 100644
index 00000000000..7acdc432b0c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/reg-alloc-3.c
@@ -0,0 +1,65 @@
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#define PROB 0.1
+
+struct L
+{
+  int data;
+  volatile struct L *next;
+  volatile struct L *inner;
+};
+
+void ext();
+
+/* The thing we're testing here is that the !head->inner path of the outer loop
+   body has no stack accesses.  It's possible that we'll need to update this
+   pattern for unrelated code changes. but the test should be XFAILed rather
+   than changed if any new stack accesses creep into the !head->inner path.  */
+/*
+** foo:
+**	...
+**	ldr	(w[0-9]+), \[(x[0-9]+)\]
+**	add	(w[0-9]+), (?:\3, \1|\1, \3)
+**	ldr	(x[0-9]+), \[\2, #?16\]
+**	str	\3, \[\2\]
+**	ldr	\2, \[\2, #?8\]
+**	cbn?z	\4, .*
+**	...
+**	ret
+*/
+void
+foo (volatile struct L *head, int inc, double *ptr)
+{
+  double d = *ptr;
+  while (head)
+    {
+      /* Clobber all call-preserved GPRs, so that the loop has to use
+	 call-clobbered GPRs if it is to avoid spilling.  */
+      asm volatile ("" :::
+		    "x19", "x20", "x21", "x22", "x23",
+		    "x24", "x25", "x26", "x27", "x28");
+      inc = head->data + inc;
+      volatile struct L *inner = head->inner;
+      head->data = inc;
+      head = head->next;
+      if (__builtin_expect_with_probability (inner != 0, 0, PROB))
+	for (int i = 0; i < 1000; ++i)
+	  {
+	    ext ();
+	    /* Hack to create high register pressure, so that IRA doesn't
+	       collapse this loop into the parent loop.  */
+	    d += 1;
+	    asm volatile ("// foo" :::
+			  "d0", "d1", "d2", "d3",
+			  "d4", "d5", "d6", "d7",
+			  "d8", "d9", "d10", "d11",
+			  "d12", "d13", "d14", "d15",
+			  "d16", "d17", "d18", "d19",
+			  "d20", "d21", "d22", "d23",
+			  "d24", "d25", "d26", "d27",
+			  "d28", "d29", "d30", "d31");
+	  }
+    }
+  *ptr = d;
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/6] ira: Handle "soft" conflicts between cap and non-cap allocnos
  2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
                   ` (4 preceding siblings ...)
  2022-01-06 14:48 ` [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills Richard Sandiford
@ 2022-01-06 14:48 ` Richard Sandiford
  2022-01-10 13:52   ` Vladimir Makarov
  2022-01-07 14:38 ` [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Vladimir Makarov
  6 siblings, 1 reply; 24+ messages in thread
From: Richard Sandiford @ 2022-01-06 14:48 UTC (permalink / raw)
  To: gcc-patches

This patch looks for allocno conflicts of the following form:

- One allocno (X) is a cap allocno for some non-cap allocno X2.
- X2 belongs to some loop L2.
- The other allocno (Y) is a non-cap allocno.
- Y is an ancestor of some allocno Y2 in L2.
- Y2 is not referenced in L2 (that is, ALLOCNO_NREFS (Y2) == 0).
- Y can use a different allocation from Y2.

In this case, Y's register is live across L2 but is not used within it,
whereas X's register is used only within L2.  The conflict is therefore
only "soft", in that it can easily be avoided by spilling Y2 inside L2
without affecting any insn references.

In principle we could do this for ALLOCNO_NREFS (Y2) != 0 too, with the
callers then taking Y2's ALLOCNO_MEMORY_COST into account.  There would
then be no "cliff edge" between a Y2 that has no references and a Y2 that
has (say) a single cold reference.

However, doing that isn't necessary for the PR and seems to give
variable results in practice.  (fotonik3d_r improves slightly but
namd_r regresses slightly.)  It therefore seemed better to start
with the higher-value zero-reference case and see how things go.

On top of the previous patches in the series, this fixes the exchange2
regression seen in GCC 11.

gcc/
	PR rtl-optimization/98782
	* ira-int.h (ira_soft_conflict): Declare.
	* ira-costs.c (max_soft_conflict_loop_depth): New constant.
	(ira_soft_conflict): New function.
	(spill_soft_conflicts): Likewise.
	(assign_hard_reg): Use them to handle the case described by
	the comment above ira_soft_conflict.
	(improve_allocation): Likewise.
	* ira.c (check_allocation): Allow allocnos with "soft" conflicts
	to share the same register.

gcc/testsuite/
	* gcc.target/aarch64/reg-alloc-4.c: New test.
---
 gcc/ira-color.c                               | 284 ++++++++++++++++--
 gcc/ira-int.h                                 |   1 +
 gcc/ira.c                                     |   2 +
 .../gcc.target/aarch64/reg-alloc-4.c          |  69 +++++
 4 files changed, 326 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/reg-alloc-4.c

diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index 1487afc5ef1..36f3f4d70f3 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -36,6 +36,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "reload.h"
 #include "cfgloop.h"
 
+/* To prevent soft conflict detection becoming quadratic in the
+   loop depth.  Only for very pathological cases, so it hardly
+   seems worth a --param.  */
+const int max_soft_conflict_loop_depth = 64;
+
 typedef struct allocno_hard_regs *allocno_hard_regs_t;
 
 /* The structure contains information about hard registers can be
@@ -1707,6 +1712,167 @@ calculate_saved_nregs (int hard_regno, machine_mode mode)
   return nregs;
 }
 
+/* Allocnos A1 and A2 are known to conflict.  Check whether, in some loop L
+   that is either the current loop or a nested subloop, the conflict is of
+   the following form:
+
+   - One allocno (X) is a cap allocno for some non-cap allocno X2.
+
+   - X2 belongs to some loop L2.
+
+   - The other allocno (Y) is a non-cap allocno.
+
+   - Y is an ancestor of some allocno Y2 in L2.  (Note that such a Y2
+     must exist, given that X and Y conflict.)
+
+   - Y2 is not referenced in L2 (that is, ALLOCNO_NREFS (Y2) == 0).
+
+   - Y can use a different allocation from Y2.
+
+   In this case, Y's register is live across L2 but is not used within it,
+   whereas X's register is used only within L2.  The conflict is therefore
+   only "soft", in that it can easily be avoided by spilling Y2 inside L2
+   without affecting any insn references.
+
+   If the conflict does have this form, return the Y2 that would need to be
+   spilled in order to allow X and Y (and thus A1 and A2) to use the same
+   register.  Return null otherwise.  Returning null is conservatively correct;
+   any nonnnull return value is an optimization.  */
+ira_allocno_t
+ira_soft_conflict (ira_allocno_t a1, ira_allocno_t a2)
+{
+  /* Search for the loop L and its associated allocnos X and Y.  */
+  int search_depth = 0;
+  while (ALLOCNO_CAP_MEMBER (a1) && ALLOCNO_CAP_MEMBER (a2))
+    {
+      a1 = ALLOCNO_CAP_MEMBER (a1);
+      a2 = ALLOCNO_CAP_MEMBER (a2);
+      if (search_depth++ > max_soft_conflict_loop_depth)
+	return nullptr;
+    }
+  /* This must be true if A1 and A2 conflict.  */
+  ira_assert (ALLOCNO_LOOP_TREE_NODE (a1) == ALLOCNO_LOOP_TREE_NODE (a2));
+
+  /* Make A1 the cap allocno (X in the comment above) and A2 the
+     non-cap allocno (Y in the comment above).  */
+  if (ALLOCNO_CAP_MEMBER (a2))
+    std::swap (a1, a2);
+  if (!ALLOCNO_CAP_MEMBER (a1))
+    return nullptr;
+
+  /* Search for the real allocno that A1 caps (X2 in the comment above).  */
+  do
+    {
+      a1 = ALLOCNO_CAP_MEMBER (a1);
+      if (search_depth++ > max_soft_conflict_loop_depth)
+	return nullptr;
+    }
+  while (ALLOCNO_CAP_MEMBER (a1));
+
+  /* Find the associated allocno for A2 (Y2 in the comment above).  */
+  auto node = ALLOCNO_LOOP_TREE_NODE (a1);
+  auto local_a2 = node->regno_allocno_map[ALLOCNO_REGNO (a2)];
+
+  /* Find the parent of LOCAL_A2/Y2.  LOCAL_A2 must be a descendant of A2
+     for the conflict query to make sense, so this parent lookup must succeed.
+
+     If the parent allocno has no references, it is usually cheaper to
+     spill at that loop level instead.  Keep searching until we find
+     a parent allocno that does have references (but don't look past
+     the starting allocno).  */
+  ira_allocno_t local_parent_a2;
+  for (;;)
+    {
+      local_parent_a2 = ira_parent_allocno (local_a2);
+      if (local_parent_a2 == a2 || ALLOCNO_NREFS (local_parent_a2) != 0)
+	break;
+      local_a2 = local_parent_a2;
+    }
+  if (CHECKING_P)
+    {
+      /* Sanity check to make sure that the conflict we've been given
+	 makes sense.  */
+      auto test_a2 = local_parent_a2;
+      while (test_a2 != a2)
+	{
+	  test_a2 = ira_parent_allocno (test_a2);
+	  ira_assert (test_a2);
+	}
+    }
+  if (local_a2
+      && ALLOCNO_NREFS (local_a2) == 0
+      && ira_subloop_allocnos_can_differ_p (local_parent_a2))
+    return local_a2;
+  return nullptr;
+}
+
+/* The caller has decided to allocate HREGNO to A and has proved that
+   this is safe.  However, the allocation might require the kind of
+   spilling described in the comment above ira_soft_conflict.
+   The caller has recorded that:
+
+   - The allocnos in ALLOCNOS_TO_SPILL are the ones that would need
+     to be spilled to satisfy soft conflicts for at least one allocation
+     (not necessarily HREGNO).
+
+   - The soft conflicts apply only to A allocations that overlap
+     SOFT_CONFLICT_REGS.
+
+   If allocating HREGNO is subject to any soft conflicts, record the
+   subloop allocnos that need to be spilled.  */
+static void
+spill_soft_conflicts (ira_allocno_t a, bitmap allocnos_to_spill,
+		      HARD_REG_SET soft_conflict_regs, int hregno)
+{
+  auto nregs = hard_regno_nregs (hregno, ALLOCNO_MODE (a));
+  bitmap_iterator bi;
+  unsigned int i;
+  EXECUTE_IF_SET_IN_BITMAP (allocnos_to_spill, 0, i, bi)
+    {
+      /* SPILL_A needs to be spilled for at least one allocation
+	 (not necessarily this one).  */
+      auto spill_a = ira_allocnos[i];
+
+      /* Find the corresponding allocno for this loop.  */
+      auto conflict_a = spill_a;
+      do
+	{
+	  conflict_a = ira_parent_or_cap_allocno (conflict_a);
+	  ira_assert (conflict_a);
+	}
+      while (ALLOCNO_LOOP_TREE_NODE (conflict_a)->level
+	     > ALLOCNO_LOOP_TREE_NODE (a)->level);
+
+      ira_assert (ALLOCNO_LOOP_TREE_NODE (conflict_a)
+		  == ALLOCNO_LOOP_TREE_NODE (a));
+
+      if (conflict_a == a)
+	{
+	  /* SPILL_A is a descendant of A.  We don't know (and don't need
+	     to know) which cap allocnos have a soft conflict with A.
+	     All we need to do is test whether the soft conflict applies
+	     to the chosen allocation.  */
+	  if (ira_hard_reg_set_intersection_p (hregno, ALLOCNO_MODE (a),
+					       soft_conflict_regs))
+	    ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P (spill_a) = true;
+	}
+      else
+	{
+	  /* SPILL_A is a descendant of CONFLICT_A, which has a soft conflict
+	     with A.  Test whether the soft conflict applies to the current
+	     allocation.  */
+	  ira_assert (ira_soft_conflict (a, conflict_a) == spill_a);
+	  auto conflict_hregno = ALLOCNO_HARD_REGNO (conflict_a);
+	  ira_assert (conflict_hregno >= 0);
+	  auto conflict_nregs = hard_regno_nregs (conflict_hregno,
+						  ALLOCNO_MODE (conflict_a));
+	  if (hregno + nregs > conflict_hregno
+	      && conflict_hregno + conflict_nregs > hregno)
+	    ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P (spill_a) = true;
+	}
+    }
+}
+
 /* Choose a hard register for allocno A.  If RETRY_P is TRUE, it means
    that the function called from function
    `ira_reassign_conflict_allocnos' and `allocno_reload_assign'.  In
@@ -1746,6 +1912,8 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
 #ifdef STACK_REGS
   bool no_stack_reg_p;
 #endif
+  auto_bitmap allocnos_to_spill;
+  HARD_REG_SET soft_conflict_regs = {};
 
   ira_assert (! ALLOCNO_ASSIGNED_P (a));
   get_conflict_and_start_profitable_regs (a, retry_p,
@@ -1833,23 +2001,56 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
 
 		  mode = ALLOCNO_MODE (conflict_a);
 		  conflict_nregs = hard_regno_nregs (hard_regno, mode);
-		  if (conflict_nregs == n_objects && conflict_nregs > 1)
+		  auto spill_a = (retry_p
+				  ? nullptr
+				  : ira_soft_conflict (a, conflict_a));
+		  if (spill_a)
 		    {
-		      int num = OBJECT_SUBWORD (conflict_obj);
-
-		      if (REG_WORDS_BIG_ENDIAN)
-			SET_HARD_REG_BIT (conflicting_regs[word],
-					  hard_regno + n_objects - num - 1);
-		      else
-			SET_HARD_REG_BIT (conflicting_regs[word],
-					  hard_regno + num);
+		      if (bitmap_set_bit (allocnos_to_spill,
+					  ALLOCNO_NUM (spill_a)))
+			{
+			  ira_loop_border_costs border_costs (spill_a);
+			  auto cost = border_costs.spill_inside_loop_cost ();
+			  auto note_conflict = [&](int r)
+			    {
+			      SET_HARD_REG_BIT (soft_conflict_regs, r);
+			      auto hri = ira_class_hard_reg_index[aclass][r];
+			      if (hri >= 0)
+				{
+				  costs[hri] += cost;
+				  full_costs[hri] += cost;
+				}
+			    };
+			  for (int r = hard_regno;
+			       r >= 0 && (int) end_hard_regno (mode, r) > hard_regno;
+			       r--)
+			    note_conflict (r);
+			  for (int r = hard_regno + 1;
+			       r < hard_regno + conflict_nregs;
+			       r++)
+			    note_conflict (r);
+			}
 		    }
 		  else
-		    conflicting_regs[word]
-		      |= ira_reg_mode_hard_regset[hard_regno][mode];
-		  if (hard_reg_set_subset_p (profitable_hard_regs,
-					     conflicting_regs[word]))
-		    goto fail;
+		    {
+		      if (conflict_nregs == n_objects && conflict_nregs > 1)
+			{
+			  int num = OBJECT_SUBWORD (conflict_obj);
+
+			  if (REG_WORDS_BIG_ENDIAN)
+			    SET_HARD_REG_BIT (conflicting_regs[word],
+					      hard_regno + n_objects - num - 1);
+			  else
+			    SET_HARD_REG_BIT (conflicting_regs[word],
+					      hard_regno + num);
+			}
+		      else
+			conflicting_regs[word]
+			  |= ira_reg_mode_hard_regset[hard_regno][mode];
+		      if (hard_reg_set_subset_p (profitable_hard_regs,
+						 conflicting_regs[word]))
+			goto fail;
+		    }
 		}
 	    }
 	  else if (! retry_p
@@ -1962,6 +2163,8 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
     {
       for (i = hard_regno_nregs (best_hard_regno, mode) - 1; i >= 0; i--)
 	allocated_hardreg_p[best_hard_regno + i] = true;
+      spill_soft_conflicts (a, allocnos_to_spill, soft_conflict_regs,
+			    best_hard_regno);
     }
   if (! retry_p)
     restore_costs_from_copies (a);
@@ -2983,6 +3186,8 @@ improve_allocation (void)
 	   assigning hard register to allocno A even without spilling
 	   conflicting allocnos.  */
 	continue;
+      auto_bitmap allocnos_to_spill;
+      HARD_REG_SET soft_conflict_regs = {};
       mode = ALLOCNO_MODE (a);
       nwords = ALLOCNO_NUM_OBJECTS (a);
       /* Process each allocno conflicting with A and update the cost
@@ -3008,31 +3213,49 @@ improve_allocation (void)
 	      ALLOCNO_COLOR_DATA (conflict_a)->temp = check;
 	      if ((conflict_hregno = ALLOCNO_HARD_REGNO (conflict_a)) < 0)
 		continue;
-	      spill_cost = ALLOCNO_UPDATED_MEMORY_COST (conflict_a);
-	      k = (ira_class_hard_reg_index
-		   [ALLOCNO_CLASS (conflict_a)][conflict_hregno]);
-	      ira_assert (k >= 0);
-	      if ((allocno_costs = ALLOCNO_HARD_REG_COSTS (conflict_a))
-		  != NULL)
-		spill_cost -= allocno_costs[k];
+	      auto spill_a = ira_soft_conflict (a, conflict_a);
+	      if (spill_a)
+		{
+		  if (!bitmap_set_bit (allocnos_to_spill,
+				       ALLOCNO_NUM (spill_a)))
+		    continue;
+		  ira_loop_border_costs border_costs (spill_a);
+		  spill_cost = border_costs.spill_inside_loop_cost ();
+		}
 	      else
-		spill_cost -= ALLOCNO_UPDATED_CLASS_COST (conflict_a);
-	      spill_cost
-		+= allocno_copy_cost_saving (conflict_a, conflict_hregno);
+		{
+		  spill_cost = ALLOCNO_UPDATED_MEMORY_COST (conflict_a);
+		  k = (ira_class_hard_reg_index
+		       [ALLOCNO_CLASS (conflict_a)][conflict_hregno]);
+		  ira_assert (k >= 0);
+		  if ((allocno_costs = ALLOCNO_HARD_REG_COSTS (conflict_a))
+		      != NULL)
+		    spill_cost -= allocno_costs[k];
+		  else
+		    spill_cost -= ALLOCNO_UPDATED_CLASS_COST (conflict_a);
+		  spill_cost
+		    += allocno_copy_cost_saving (conflict_a, conflict_hregno);
+		}
 	      conflict_nregs = hard_regno_nregs (conflict_hregno,
 						 ALLOCNO_MODE (conflict_a));
+	      auto note_conflict = [&](int r)
+		{
+		  if (check_hard_reg_p (a, r,
+					conflicting_regs, profitable_hard_regs))
+		    {
+		      if (spill_a)
+			SET_HARD_REG_BIT (soft_conflict_regs, r);
+		      costs[r] += spill_cost;
+		    }
+		};
 	      for (r = conflict_hregno;
 		   r >= 0 && (int) end_hard_regno (mode, r) > conflict_hregno;
 		   r--)
-		if (check_hard_reg_p (a, r,
-				      conflicting_regs, profitable_hard_regs))
-		  costs[r] += spill_cost;
+		note_conflict (r);
 	      for (r = conflict_hregno + 1;
 		   r < conflict_hregno + conflict_nregs;
 		   r++)
-		if (check_hard_reg_p (a, r,
-				      conflicting_regs, profitable_hard_regs))
-		  costs[r] += spill_cost;
+		note_conflict (r);
 	    }
 	}
       min_cost = INT_MAX;
@@ -3055,6 +3278,7 @@ improve_allocation (void)
 	   by spilling some conflicting allocnos does not improve the
 	   allocation cost.  */
 	continue;
+      spill_soft_conflicts (a, allocnos_to_spill, soft_conflict_regs, best);
       nregs = hard_regno_nregs (best, mode);
       /* Now spill conflicting allocnos which contain a hard register
 	 of A when we assign the best chosen hard register to it.  */
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index a78811eb416..e1e68025211 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -1067,6 +1067,7 @@ extern void ira_debug_conflicts (bool);
 extern void ira_build_conflicts (void);
 
 /* ira-color.c */
+extern ira_allocno_t ira_soft_conflict (ira_allocno_t, ira_allocno_t);
 extern void ira_debug_hard_regs_forest (void);
 extern int ira_loop_edge_freq (ira_loop_tree_node_t, int, bool);
 extern void ira_reassign_conflict_allocnos (int);
diff --git a/gcc/ira.c b/gcc/ira.c
index 6f65d20cfd9..3bece668692 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -2649,6 +2649,8 @@ check_allocation (void)
 	      int conflict_hard_regno = ALLOCNO_HARD_REGNO (conflict_a);
 	      if (conflict_hard_regno < 0)
 		continue;
+	      if (ira_soft_conflict (a, conflict_a))
+		continue;
 
 	      conflict_nregs = hard_regno_nregs (conflict_hard_regno,
 						 ALLOCNO_MODE (conflict_a));
diff --git a/gcc/testsuite/gcc.target/aarch64/reg-alloc-4.c b/gcc/testsuite/gcc.target/aarch64/reg-alloc-4.c
new file mode 100644
index 00000000000..ceb6f50de2d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/reg-alloc-4.c
@@ -0,0 +1,69 @@
+/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */
+
+#define PROB 0.1
+
+struct L
+{
+  int data;
+  volatile struct L *next;
+  volatile struct L *inner;
+};
+
+/* The thing we're testing here is that the !head->inner path of the outer loop
+   body has no stack accesses.  It's possible that we'll need to update this
+   pattern for unrelated code changes. but the test should be XFAILed rather
+   than changed if any new stack accesses occur on the !head->inner path.  */
+/*
+** foo:
+**	...
+**	ldr	(w[0-9]+), \[(x[0-9]+)\]
+**	add	(w[0-9]+), (?:\3, \1|\1, \3)
+**	ldr	(x[0-9]+), \[\2, #?16\]
+**	str	\3, \[\2\]
+**	ldr	\2, \[\2, #?8\]
+**	cbn?z	\4, .*
+**	...
+**	ret
+*/
+void
+foo (volatile struct L *head, int inc)
+{
+  while (head)
+    {
+      /* Clobber all call-preserved GPRs, so that the loop has to use
+	 call-clobbered GPRs if it is to avoid spilling.  */
+      asm volatile ("" :::
+		    "x19", "x20", "x21", "x22", "x23",
+		    "x24", "x25", "x26", "x27", "x28");
+      inc = head->data + inc;
+      volatile struct L *inner = head->inner;
+      head->data = inc;
+      head = head->next;
+      if (__builtin_expect_with_probability (inner != 0, 0, PROB))
+	for (int i = 0; i < 1000; ++i)
+	  asm volatile ("" ::			/* example allocation: */
+			"r" (i),		/* x0 */
+			"r" (inner),		/* x1 */
+			"r" (inner->next),	/* x2 */
+			"r" (inner->next),	/* x3 */
+			"r" (inner->next),	/* x4 */
+			"r" (inner->next),	/* x5 */
+			"r" (inner->next),	/* x6 */
+			"r" (inner->next),	/* x7 */
+			"r" (inner->next),	/* x8 */
+			"r" (inner->next),	/* x9 */
+			"r" (inner->next),	/* x10 */
+			"r" (inner->next),	/* x11 */
+			"r" (inner->next),	/* x12 */
+			"r" (inner->next),	/* x13 */
+			"r" (inner->next),	/* x14 */
+			"r" (inner->next),	/* x15 */
+			"r" (inner->next),	/* x16 */
+			"r" (inner->next),	/* x17 */
+			"r" (inner->next),	/* x18 */
+			"r" (inner->next) :	/* x30 */
+			"x19", "x20", "x21", "x22", "x23",
+			"x24", "x25", "x26", "x27", "x28");
+    }
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] ira: Add a ira_loop_border_costs class
  2022-01-06 14:46 ` [PATCH 1/6] ira: Add a ira_loop_border_costs class Richard Sandiford
@ 2022-01-06 15:08   ` Jan Hubicka
  2022-01-07 11:12     ` Richard Sandiford
  2022-01-07 14:38   ` Vladimir Makarov
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Hubicka @ 2022-01-06 15:08 UTC (permalink / raw)
  To: gcc-patches, vmakarov, richard.sandiford

> The final index into (ira_)memory_move_cost is 1 for loads and
> 0 for stores.  Thus the combination:
> 
>   entry_freq * memory_cost[1] + exit_freq * memory_cost[0]
> 
> is the cost of loading a register on entry to a loop and
> storing it back on exit from the loop.  This is the cost to
> use if the register is successfully allocated within the
> loop but is spilled in the parent loop.  Similarly:
> 
>   entry_freq * memory_cost[0] + exit_freq * memory_cost[1]
> 
> is the cost of storing a register on entry to the loop and
> restoring it on exit from the loop.  This is the cost to
> use if the register is spilled within the loop but is
> successfully allocated in the parent loop.
> 
> The patch adds a helper class for calculating these values and
> mechanically replaces the existing instances.  There is no attempt to
> editorialise the choice between using “spill inside” and “spill outside”
> costs.  (I think one of them is the wrong way round, but a later patch
> deals with that.)
> 
> No functional change intended.
> 
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-int.h (ira_loop_border_costs): New class.
> 	* ira-color.c (ira_loop_border_costs::ira_loop_border_costs):
> 	New constructor.
> 	(calculate_allocno_spill_cost): Use ira_loop_border_costs.
> 	(color_pass): Likewise.
> 	(move_spill_restore): Likewise.

Thanks for working on this.  For profile bits, the patch looks good to
me.  In general I am trying to move away from the integer frequencies.
It would be more precise to calculate the m_entry_freq and m_exit_freq
as profile_count m_entry_count, m_exit_count
and do conversion only later.

count->to_frequency method basically scales it to the range 0...BB_FREQ_MAX.

Honza

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] ira: Add a ira_loop_border_costs class
  2022-01-06 15:08   ` Jan Hubicka
@ 2022-01-07 11:12     ` Richard Sandiford
  0 siblings, 0 replies; 24+ messages in thread
From: Richard Sandiford @ 2022-01-07 11:12 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, vmakarov

Jan Hubicka <hubicka@kam.mff.cuni.cz> writes:
>> The final index into (ira_)memory_move_cost is 1 for loads and
>> 0 for stores.  Thus the combination:
>> 
>>   entry_freq * memory_cost[1] + exit_freq * memory_cost[0]
>> 
>> is the cost of loading a register on entry to a loop and
>> storing it back on exit from the loop.  This is the cost to
>> use if the register is successfully allocated within the
>> loop but is spilled in the parent loop.  Similarly:
>> 
>>   entry_freq * memory_cost[0] + exit_freq * memory_cost[1]
>> 
>> is the cost of storing a register on entry to the loop and
>> restoring it on exit from the loop.  This is the cost to
>> use if the register is spilled within the loop but is
>> successfully allocated in the parent loop.
>> 
>> The patch adds a helper class for calculating these values and
>> mechanically replaces the existing instances.  There is no attempt to
>> editorialise the choice between using “spill inside” and “spill outside”
>> costs.  (I think one of them is the wrong way round, but a later patch
>> deals with that.)
>> 
>> No functional change intended.
>> 
>> gcc/
>> 	PR rtl-optimization/98782
>> 	* ira-int.h (ira_loop_border_costs): New class.
>> 	* ira-color.c (ira_loop_border_costs::ira_loop_border_costs):
>> 	New constructor.
>> 	(calculate_allocno_spill_cost): Use ira_loop_border_costs.
>> 	(color_pass): Likewise.
>> 	(move_spill_restore): Likewise.
>
> Thanks for working on this.  For profile bits, the patch looks good to
> me.  In general I am trying to move away from the integer frequencies.
> It would be more precise to calculate the m_entry_freq and m_exit_freq
> as profile_count m_entry_count, m_exit_count
> and do conversion only later.
>
> count->to_frequency method basically scales it to the range 0...BB_FREQ_MAX.

Yeah.  If I get time, I'd like to see how easy it would be to move the
RAs away from the BB_FREQ_MAX scaling.  I agree it's not really precise
enough.

The problem is that the costs are multiplied by several other things,
and there have recently been overflow problems (g:7d02c8bf75980fa2468f).
I guess the way to avoid that would be to use sreals, but the problem
then is that the costing code (particularly record_reg_classes) is very
compile-time sensitive.  So it might need a bit of surgery to move
over to sreal costs without negatively affecting compile time (and
perhaps memory consumption).

Of course, RA changes would become much easier if we got rid of
old reload. :-)  (OK, bit of a cheap shot in this case, but true
in general.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782]
  2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
                   ` (5 preceding siblings ...)
  2022-01-06 14:48 ` [PATCH 6/6] ira: Handle "soft" conflicts between cap and non-cap allocnos Richard Sandiford
@ 2022-01-07 14:38 ` Vladimir Makarov
  6 siblings, 0 replies; 24+ messages in thread
From: Vladimir Makarov @ 2022-01-07 14:38 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford


On 2022-01-06 09:45, Richard Sandiford wrote:
> This series of patches recovers the exchange2 performance lost in the
> GCC 11 timeframe (at least on aarch64 and Power9 -- thanks Pat for
> testing the latter).
>
> There are 6 patches, split into two groups of 3.  The first 3 are just
> preparatory patches, although patch 2 does contain a minor bug fix.
> The other 3 patches are the ones that together fix the regression.
>
> I realise this is a bit invasive for stage 3.  However, the series is
> fixing a large regression in an important benchmark and AFAIK there are
> no known acceptable mitigations that we could apply instead.  I think
> the series is also working with concepts that already exist in IRA:
> it's really about tweaking the cost model used to control them.
>
> We also still have at least 3 months (more realistically 4 months) of
> testing before GCC 12 is released.  So perhaps one option would be to
> apply any approved version of the series now, but with the understanding
> that if there's significant fallout (more than a handful of small tweaks
> or fixes), we would simply revert the patches rather than trying to
> rework them in-situ.  The series is confined to IRA so reverting it
> should be simple.  Would that be OK?

Richard. thank you for working on these issues.

I don't think there is a problem with the GCC development stage here.  
These are patches solving existing PR(s).  Of course it is better to do 
such changes earlier at the stage3, so IMHO the timing is right.

I don't expect that the changes will result in serious problems like 
wrong code generation or RA crashes as they are about improving RA 
heuristics.  They can result in new GCC test failures on some targets 
(we have many overconstrained tests expecting an exact GCC output).  If 
we are overwhelmed with the new failures we can revert the patches.

The first 3 patches are ok to commit.  I'll look at the rest 3 ones this 
weekend and write you my opinion on Monday.  I don't think there will be 
a problem with the last 3 patches.  They are clearly improving RA 
heuristics.  I just need some time to think about them.

Thank you again for picking this difficult PR and working on it.

> Each patch bootstrapped & regression-tested individually on
> aarch64-linux-gnu.  Also tested as a series on aarch64_be-elf,
> arm-linux-gnueabihf, powerpc64le-linux-gnu, and x86_64-linux-gnu.
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] ira: Add a ira_loop_border_costs class
  2022-01-06 14:46 ` [PATCH 1/6] ira: Add a ira_loop_border_costs class Richard Sandiford
  2022-01-06 15:08   ` Jan Hubicka
@ 2022-01-07 14:38   ` Vladimir Makarov
  1 sibling, 0 replies; 24+ messages in thread
From: Vladimir Makarov @ 2022-01-07 14:38 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford


On 2022-01-06 09:46, Richard Sandiford wrote:
> The final index into (ira_)memory_move_cost is 1 for loads and
> 0 for stores.  Thus the combination:
>
>    entry_freq * memory_cost[1] + exit_freq * memory_cost[0]
>
> is the cost of loading a register on entry to a loop and
> storing it back on exit from the loop.  This is the cost to
> use if the register is successfully allocated within the
> loop but is spilled in the parent loop.  Similarly:
>
>    entry_freq * memory_cost[0] + exit_freq * memory_cost[1]
>
> is the cost of storing a register on entry to the loop and
> restoring it on exit from the loop.  This is the cost to
> use if the register is spilled within the loop but is
> successfully allocated in the parent loop.
>
> The patch adds a helper class for calculating these values and
> mechanically replaces the existing instances.  There is no attempt to
> editorialise the choice between using “spill inside” and “spill outside”
> costs.  (I think one of them is the wrong way round, but a later patch
> deals with that.)
>
> No functional change intended.
>
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-int.h (ira_loop_border_costs): New class.
> 	* ira-color.c (ira_loop_border_costs::ira_loop_border_costs):
> 	New constructor.
> 	(calculate_allocno_spill_cost): Use ira_loop_border_costs.
> 	(color_pass): Likewise.
> 	(move_spill_restore): Likewise.
It is OK for me.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] ira: Add comments and fix move_spill_restore calculation
  2022-01-06 14:46 ` [PATCH 2/6] ira: Add comments and fix move_spill_restore calculation Richard Sandiford
@ 2022-01-07 14:39   ` Vladimir Makarov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Makarov @ 2022-01-07 14:39 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford


On 2022-01-06 09:46, Richard Sandiford wrote:
> This patch adds comments to describe each use of ira_loop_border_costs.
> I think this highlights that move_spill_restore was using the wrong cost
> in one case, which came from tranposing [0] and [1] in the original
> (pre-ira_loop_border_costs) ira_memory_move_cost expressions.  The
> difference would only be noticeable on targets that distinguish between
> load and store costs.
>
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-color.c (color_pass): Add comments to describe the spill costs.
> 	(move_spill_restore): Likewise.  Fix reversed calculation.
OK for me.  Thank you for fixing the cost typo.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] ira: Add ira_subloop_allocnos_can_differ_p
  2022-01-06 14:47 ` [PATCH 3/6] ira: Add ira_subloop_allocnos_can_differ_p Richard Sandiford
@ 2022-01-07 14:40   ` Vladimir Makarov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Makarov @ 2022-01-07 14:40 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford


On 2022-01-06 09:47, Richard Sandiford wrote:
> color_pass has two instances of the same code for propagating non-cap
> assignments from parent loops to subloops.  This patch adds a helper
> function for testing when such propagations are required for correctness
> and uses it to remove the duplicated code.
>
> A later patch will use this in ira-build.c too, which is why the
> function is exported to ira-int.h.
>
> No functional change intended.
>
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-int.h (ira_subloop_allocnos_can_differ_p): New function,
> 	extracted from...
> 	* ira-color.c (color_pass): ...here.
OK.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] ira: Try to avoid propagating conflicts
  2022-01-06 14:47 ` [PATCH 4/6] ira: Try to avoid propagating conflicts Richard Sandiford
@ 2022-01-10 13:51   ` Vladimir Makarov
  2022-01-10 14:49     ` Richard Sandiford
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Makarov @ 2022-01-10 13:51 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford


On 2022-01-06 09:47, Richard Sandiford wrote:
> Suppose that:
>
> - an inner loop L contains an allocno A
> - L clobbers hard register R while A is live
> - A's parent allocno is AP
>
> Previously, propagate_allocno_info would propagate conflict sets up the
> loop tree, so that the conflict between A and R would become a conflict
> between AP and R (and so on for ancestors of AP).
My thoughts for propagating hard register conflicts was to avoid 
changing allocations on the region border as much as possible.  The 
solution you are proposing might result in allocating R to the allocno 
and creating moves/loads/stores on the region border when it would be 
possible to assign R to another allocno and another hard reg to the 
allocno in consideration.  As it is all about heuristics it is hard to 
say just speculating what probability of such situation and what 
heuristic is better.  Only checking credible benchmarks is a criterium 
to choose heuristics.  It seems yours work better.  Thank you putting 
deep thoughts in improving existing heuristics in this and the following 
patches, Richard.
> However, when IRA treats loops as separate allocation regions, it can
> decide on a loop-by-loop basis whether to allocate a register or spill
> to memory.  Conflicts in inner loops therefore don't need to become
> hard conflicts in parent loops.  Instead we can record that using the
> “conflicting” registers for the parent allocnos has a higher cost.
> In the example above, this higher cost is the sum of:
>
> - the cost of saving R on entry to L
> - the cost of keeping the pseudo register in memory throughout L
> - the cost of reloading R on exit from L
>
> This value is also a cap on the hard register cost that A can contribute
> to AP in general (not just for conflicts).  Whatever allocation we pick
> for AP, there is always the option of spilling that register to memory
> throughout L, so the cost to A of allocating a register to AP can't be
> more than the cost of spilling A.
>
> To take an extreme example: if allocating a register R2 to A is more
> expensive than spilling A to memory, ALLOCNO_HARD_REG_COSTS (A)[R2]
> could be (say) 2 times greater than ALLOCNO_MEMORY_COST (A) or 100
> times greater than ALLOCNO_MEMORY_COST (A).  But this scale factor
> doesn't matter to AP.  All that matters is that R2 is more expensive
> than memory for A, so that allocating R2 to AP should be costed as
> spilling A to memory (again assuming that A and AP are in different
> allocation regions).  Propagating a factor of 100 would distort the
> register costs for AP.
>
> move_spill_restore tries to undo the propagation done by
> propagate_allocno_info, so we need some extra processing there.
>
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-int.h (ira_allocno::might_conflict_with_parent_p): New field.
> 	(ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P): New macro.
> 	(ira_single_region_allocno_p): New function.
> 	(ira_total_conflict_hard_regs): Likewise.
> 	* ira-build.c (ira_create_allocno): Initialize
> 	ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P.
> 	(ira_propagate_hard_reg_costs): New function.
> 	(propagate_allocno_info): Use it.  Try to avoid propagating
> 	hard register conflicts to parent allocnos if we can handle
> 	the conflicts by spilling instead.  Limit the propagated
> 	register costs to the cost of spilling throughout the child loop.
> 	* ira-color.c (color_pass): Use ira_single_region_allocno_p to
> 	test whether a child and parent allocno can share the same
> 	register.
> 	(move_spill_restore): Adjust for the new behavior of
> 	propagate_allocno_info.
>
> gcc/testsuite/
> 	* gcc.target/aarch64/reg-alloc-2.c: New test.
Thank you for the patch.  It is ok for me.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-06 14:48 ` [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills Richard Sandiford
@ 2022-01-10 13:51   ` Vladimir Makarov
  2022-01-11  4:21   ` Hans-Peter Nilsson
  2022-01-11  8:38   ` Robin Dapp
  2 siblings, 0 replies; 24+ messages in thread
From: Vladimir Makarov @ 2022-01-10 13:51 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford


On 2022-01-06 09:48, Richard Sandiford wrote:
> If an allocno A in an inner loop L spans a call, a parent allocno AP
> can choose to handle a call-clobbered/caller-saved hard register R
> in one of two ways:
>
> (1) save R before each call in L and restore R after each call
> (2) spill R to memory throughout L
>
> (2) can be cheaper than (1) in some cases, particularly if L does
> not reference A.
>
> Before the patch we always did (1).  The patch adds support for
> picking (2) instead, when it seems cheaper.  It builds on the
> earlier support for not propagating conflicts to parent allocnos.
Another cost calculation improvement for calls would be taking into 
account that allocno can be saved and restored once for several 
subsequent calls (e.g. in one BB).
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-int.h (ira_caller_save_cost): New function.
> 	(ira_caller_save_loop_spill_p): Likewise.
> 	* ira-build.c (ira_propagate_hard_reg_costs): Test whether it is
> 	cheaper to spill a call-clobbered register throughout a loop rather
> 	than spill it around each individual call.  If so, treat all
> 	call-clobbered registers as conflicts and...
> 	(propagate_allocno_info): ...do not propagate call information
> 	from the child to the parent.
> 	* ira-color.c (move_spill_restore): Update accordingly.
> 	* ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost.
>
> gcc/testsuite/
> 	* gcc.target/aarch64/reg-alloc-3.c: New test.
OK for me.  Thank you for the patch.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] ira: Handle "soft" conflicts between cap and non-cap allocnos
  2022-01-06 14:48 ` [PATCH 6/6] ira: Handle "soft" conflicts between cap and non-cap allocnos Richard Sandiford
@ 2022-01-10 13:52   ` Vladimir Makarov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Makarov @ 2022-01-10 13:52 UTC (permalink / raw)
  To: gcc-patches, richard.sandiford


On 2022-01-06 09:48, Richard Sandiford wrote:
> This patch looks for allocno conflicts of the following form:
>
> - One allocno (X) is a cap allocno for some non-cap allocno X2.
> - X2 belongs to some loop L2.
> - The other allocno (Y) is a non-cap allocno.
> - Y is an ancestor of some allocno Y2 in L2.
> - Y2 is not referenced in L2 (that is, ALLOCNO_NREFS (Y2) == 0).
> - Y can use a different allocation from Y2.
>
> In this case, Y's register is live across L2 but is not used within it,
> whereas X's register is used only within L2.  The conflict is therefore
> only "soft", in that it can easily be avoided by spilling Y2 inside L2
> without affecting any insn references.
>
> In principle we could do this for ALLOCNO_NREFS (Y2) != 0 too, with the
> callers then taking Y2's ALLOCNO_MEMORY_COST into account.  There would
> then be no "cliff edge" between a Y2 that has no references and a Y2 that
> has (say) a single cold reference.
>
> However, doing that isn't necessary for the PR and seems to give
> variable results in practice.  (fotonik3d_r improves slightly but
> namd_r regresses slightly.)  It therefore seemed better to start
> with the higher-value zero-reference case and see how things go.
>
> On top of the previous patches in the series, this fixes the exchange2
> regression seen in GCC 11.
>
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-int.h (ira_soft_conflict): Declare.
> 	* ira-costs.c (max_soft_conflict_loop_depth): New constant.
> 	(ira_soft_conflict): New function.
> 	(spill_soft_conflicts): Likewise.
> 	(assign_hard_reg): Use them to handle the case described by
> 	the comment above ira_soft_conflict.
> 	(improve_allocation): Likewise.
> 	* ira.c (check_allocation): Allow allocnos with "soft" conflicts
> 	to share the same register.
>
> gcc/testsuite/
> 	* gcc.target/aarch64/reg-alloc-4.c: New test.

OK.  If something goes wrong with the patches (e.g. a lot of GCC 
testsuite failures or performance degradation), we can revert only the 
last 3 of them as ones actually changing the heuristics.  But I hope it 
will be not necessary.

Thank you again for working on the PR.  Fixing it required big efforts 
in thinking, testing and benchmarking.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] ira: Try to avoid propagating conflicts
  2022-01-10 13:51   ` Vladimir Makarov
@ 2022-01-10 14:49     ` Richard Sandiford
  0 siblings, 0 replies; 24+ messages in thread
From: Richard Sandiford @ 2022-01-10 14:49 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: gcc-patches

Hi Vlad,

Vladimir Makarov <vmakarov@redhat.com> writes:
> On 2022-01-06 09:47, Richard Sandiford wrote:
>> Suppose that:
>>
>> - an inner loop L contains an allocno A
>> - L clobbers hard register R while A is live
>> - A's parent allocno is AP
>>
>> Previously, propagate_allocno_info would propagate conflict sets up the
>> loop tree, so that the conflict between A and R would become a conflict
>> between AP and R (and so on for ancestors of AP).
> My thoughts for propagating hard register conflicts was to avoid 
> changing allocations on the region border as much as possible.  The 
> solution you are proposing might result in allocating R to the allocno 
> and creating moves/loads/stores on the region border when it would be 
> possible to assign R to another allocno and another hard reg to the 
> allocno in consideration.  As it is all about heuristics it is hard to 
> say just speculating what probability of such situation and what 
> heuristic is better.  Only checking credible benchmarks is a criterium 
> to choose heuristics. […]

Yeah, I agree with all of the above.  Any change to these heuristics is
likely to make some things worse: a strict improvement over the status
quo is essentially impossible.

I guess in principle, the new heuristic is more suited to those high
register pressure situations in which some spilling is inevitable.
The cases that are most likely to lose out under the new heuristics
are those where allocation is possible without spilling and where
recording the conflicts gave the allocator the extra “push” it needed to
prioritise the parent allocno over others with fewer subloop conflicts.
And the risks of that happening are probably greater if the target is
providing lop-sided costs.

(Talking of which, the aarch64 port is still providing equal costs
for loads and stores, which is something that we ought to look at.
But again, it's a sensitive heuristic, so tweaking it will need care.)

Thanks a lot for the quick reviews, really appreciate it.

I've now pushed the series.  Let's see what the fallout is :-)

Richard

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-06 14:48 ` [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills Richard Sandiford
  2022-01-10 13:51   ` Vladimir Makarov
@ 2022-01-11  4:21   ` Hans-Peter Nilsson
  2022-01-11  8:38   ` Robin Dapp
  2 siblings, 0 replies; 24+ messages in thread
From: Hans-Peter Nilsson @ 2022-01-11  4:21 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, vmakarov

> From: Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org>
> Date: Thu, 6 Jan 2022 15:48:01 +0100

> If an allocno A in an inner loop L spans a call, a parent allocno AP
> can choose to handle a call-clobbered/caller-saved hard register R
> in one of two ways:
> 
> (1) save R before each call in L and restore R after each call
> (2) spill R to memory throughout L
> 
> (2) can be cheaper than (1) in some cases, particularly if L does
> not reference A.
> 
> Before the patch we always did (1).  The patch adds support for
> picking (2) instead, when it seems cheaper.  It builds on the
> earlier support for not propagating conflicts to parent allocnos.
> 
> gcc/
> 	PR rtl-optimization/98782
> 	* ira-int.h (ira_caller_save_cost): New function.
> 	(ira_caller_save_loop_spill_p): Likewise.
> 	* ira-build.c (ira_propagate_hard_reg_costs): Test whether it is
> 	cheaper to spill a call-clobbered register throughout a loop rather
> 	than spill it around each individual call.  If so, treat all
> 	call-clobbered registers as conflicts and...
> 	(propagate_allocno_info): ...do not propagate call information
> 	from the child to the parent.
> 	* ira-color.c (move_spill_restore): Update accordingly.
> 	* ira-costs.c (ira_tune_allocno_costs): Use ira_caller_save_cost.

I bisected a broken build for cris-elf to this patch.
Details in https://gcc.gnu.org/PR103974 supposedly
sufficient to find a quick resolution.

(JFTR, as you're already CC:ed by your @gcc.gnu.org account.)

Perhaps some of these patches are better postponed for stage 1?

brgds, H-P

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-06 14:48 ` [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills Richard Sandiford
  2022-01-10 13:51   ` Vladimir Makarov
  2022-01-11  4:21   ` Hans-Peter Nilsson
@ 2022-01-11  8:38   ` Robin Dapp
  2022-01-11  8:52     ` Richard Sandiford
  2 siblings, 1 reply; 24+ messages in thread
From: Robin Dapp @ 2022-01-11  8:38 UTC (permalink / raw)
  To: gcc-patches, vmakarov, richard.sandiford

Hi Richard,

this causes a bootstrap error on s390 (where
IRA_HARD_REGNO_ADD_COST_MULTIPLIER is defined). rclass is used in the
#define-guarded area.  I guess you also wanted to move this to the new
function ira_caller_save_cost?

Regards
 Robin

--

../../gcc/ira-costs.c: In function ‘void ira_tune_allocno_costs()’:
../../gcc/ira-costs.c:2354:45: error: ‘rclass’ was not declared in this
scope; did you mean ‘aclass’?
 2354 |        cost += ((ira_memory_move_cost[mode][rclass][0]
      |                                             ^~~~~~
      |                                             aclass

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-11  8:38   ` Robin Dapp
@ 2022-01-11  8:52     ` Richard Sandiford
  2022-01-11 11:31       ` Robin Dapp
                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Richard Sandiford @ 2022-01-11  8:52 UTC (permalink / raw)
  To: Robin Dapp; +Cc: gcc-patches, vmakarov

[-- Attachment #1: Type: text/plain, Size: 519 bytes --]

Robin Dapp <rdapp@linux.ibm.com> writes:
> Hi Richard,
>
> this causes a bootstrap error on s390 (where
> IRA_HARD_REGNO_ADD_COST_MULTIPLIER is defined). rclass is used in the
> #define-guarded area.

Gah, sorry about that.

> I guess you also wanted to move this to the new function
> ira_caller_save_cost?

No, the IRA_HARD_REGNO_ADD_COST_MULTIPLIER heuristic is a separate thing.
It's just that I had to remove the rclass variable to allow bootstrap on
other targets.

Could you try the attached?

Thanks,
Richard



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ira-Fix-s390-build.patch --]
[-- Type: text/x-diff, Size: 1400 bytes --]

From 74cca9d27da840dbb79aa9ed9edc6f529945d477 Mon Sep 17 00:00:00 2001
From: Richard Sandiford <richard.sandiford@arm.com>
Date: Tue, 11 Jan 2022 08:44:56 +0000
Subject: [PATCH] ira: Fix s390 build

My g:01f3e6a40e7202310abbeb41c345d325bd69554f broke the s390
build because the rclass variable was still needed by the
IRA_HARD_REGNO_ADD_COST_MULTIPLIER code.

gcc/
	* ira-costs.c (ira_tune_allocno_costs): Fix missing rclass
	definition in IRA_HARD_REGNO_ADD_COST_MULTIPLIER code.
---
 gcc/ira-costs.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/gcc/ira-costs.c b/gcc/ira-costs.c
index cbb58d32be8..1e4cf5a35e4 100644
--- a/gcc/ira-costs.c
+++ b/gcc/ira-costs.c
@@ -2351,10 +2351,13 @@ ira_tune_allocno_costs (void)
 	      if (ira_need_caller_save_p (a, regno))
 		cost += ira_caller_save_cost (a);
 #ifdef IRA_HARD_REGNO_ADD_COST_MULTIPLIER
-	      cost += ((ira_memory_move_cost[mode][rclass][0]
-			+ ira_memory_move_cost[mode][rclass][1])
-		       * ALLOCNO_FREQ (a)
-		       * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);
+	      {
+		auto rclass = REGNO_REG_CLASS (regno);
+		cost += ((ira_memory_move_cost[mode][rclass][0]
+			  + ira_memory_move_cost[mode][rclass][1])
+			 * ALLOCNO_FREQ (a)
+			 * IRA_HARD_REGNO_ADD_COST_MULTIPLIER (regno) / 2);
+	      }
 #endif
 	      if (INT_MAX - cost < reg_costs[j])
 		reg_costs[j] = INT_MAX;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-11  8:52     ` Richard Sandiford
@ 2022-01-11 11:31       ` Robin Dapp
  2022-01-11 12:43       ` Martin Liška
  2022-01-11 12:47       ` Robin Dapp
  2 siblings, 0 replies; 24+ messages in thread
From: Robin Dapp @ 2022-01-11 11:31 UTC (permalink / raw)
  To: gcc-patches, vmakarov, richard.sandiford

Hi Richard,

> Could you try the attached?

build and bootstrap look OK with it.  Testsuite shows lots of fallout
but the proper bisect isn't finished yet.  The commit before your series
is still fine - the problem could also be after it, though.  Will report
back later.

Thanks
 Robin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-11  8:52     ` Richard Sandiford
  2022-01-11 11:31       ` Robin Dapp
@ 2022-01-11 12:43       ` Martin Liška
  2022-01-11 12:47       ` Robin Dapp
  2 siblings, 0 replies; 24+ messages in thread
From: Martin Liška @ 2022-01-11 12:43 UTC (permalink / raw)
  To: Robin Dapp, gcc-patches, vmakarov, richard.sandiford

On 1/11/22 09:52, Richard Sandiford via Gcc-patches wrote:
> Robin Dapp <rdapp@linux.ibm.com> writes:
>> Hi Richard,
>>
>> this causes a bootstrap error on s390 (where
>> IRA_HARD_REGNO_ADD_COST_MULTIPLIER is defined). rclass is used in the
>> #define-guarded area.
> 
> Gah, sorry about that.
> 
>> I guess you also wanted to move this to the new function
>> ira_caller_save_cost?
> 
> No, the IRA_HARD_REGNO_ADD_COST_MULTIPLIER heuristic is a separate thing.
> It's just that I had to remove the rclass variable to allow bootstrap on
> other targets.
> 
> Could you try the attached?

Hello.

I noticed the same failure. Please push the patch.

Thanks,
Martin

> 
> Thanks,
> Richard
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-11  8:52     ` Richard Sandiford
  2022-01-11 11:31       ` Robin Dapp
  2022-01-11 12:43       ` Martin Liška
@ 2022-01-11 12:47       ` Robin Dapp
  2022-01-11 12:49         ` Richard Sandiford
  2 siblings, 1 reply; 24+ messages in thread
From: Robin Dapp @ 2022-01-11 12:47 UTC (permalink / raw)
  To: gcc-patches, vmakarov, richard.sandiford

> Could you try the attached?

The series with the patch is OK from a testsuite point of view.  The
other problem appears later.

Regards
 Robin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills
  2022-01-11 12:47       ` Robin Dapp
@ 2022-01-11 12:49         ` Richard Sandiford
  0 siblings, 0 replies; 24+ messages in thread
From: Richard Sandiford @ 2022-01-11 12:49 UTC (permalink / raw)
  To: Robin Dapp; +Cc: gcc-patches, vmakarov

Robin Dapp <rdapp@linux.ibm.com> writes:
>> Could you try the attached?
>
> The series with the patch is OK from a testsuite point of view.  The
> other problem appears later.

OK, thanks for checking.  I've pushed the patch as obvious.

Richard

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-01-11 12:49 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-06 14:45 [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Richard Sandiford
2022-01-06 14:46 ` [PATCH 1/6] ira: Add a ira_loop_border_costs class Richard Sandiford
2022-01-06 15:08   ` Jan Hubicka
2022-01-07 11:12     ` Richard Sandiford
2022-01-07 14:38   ` Vladimir Makarov
2022-01-06 14:46 ` [PATCH 2/6] ira: Add comments and fix move_spill_restore calculation Richard Sandiford
2022-01-07 14:39   ` Vladimir Makarov
2022-01-06 14:47 ` [PATCH 3/6] ira: Add ira_subloop_allocnos_can_differ_p Richard Sandiford
2022-01-07 14:40   ` Vladimir Makarov
2022-01-06 14:47 ` [PATCH 4/6] ira: Try to avoid propagating conflicts Richard Sandiford
2022-01-10 13:51   ` Vladimir Makarov
2022-01-10 14:49     ` Richard Sandiford
2022-01-06 14:48 ` [PATCH 5/6] ira: Consider modelling caller-save allocations as loop spills Richard Sandiford
2022-01-10 13:51   ` Vladimir Makarov
2022-01-11  4:21   ` Hans-Peter Nilsson
2022-01-11  8:38   ` Robin Dapp
2022-01-11  8:52     ` Richard Sandiford
2022-01-11 11:31       ` Robin Dapp
2022-01-11 12:43       ` Martin Liška
2022-01-11 12:47       ` Robin Dapp
2022-01-11 12:49         ` Richard Sandiford
2022-01-06 14:48 ` [PATCH 6/6] ira: Handle "soft" conflicts between cap and non-cap allocnos Richard Sandiford
2022-01-10 13:52   ` Vladimir Makarov
2022-01-07 14:38 ` [PATCH 0/6] ira: Fix performance regression in exchange2 [PR98782] Vladimir Makarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).