PR80155: Code hoisting and register pressure

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* PR80155: Code hoisting and register pressure
@ 2018-05-23  7:27 Prathamesh Kulkarni
  2018-05-23  8:29 ` Richard Biener
  0 siblings, 1 reply; 14+ messages in thread
From: Prathamesh Kulkarni @ 2018-05-23  7:27 UTC (permalink / raw)
  To: gcc, Richard Biener, Bin Cheng, Thomas Preudhomme

[-- Attachment #1: Type: text/plain, Size: 3490 bytes --]

Hi,
I am trying to work on PR80155, which exposes a problem with code
hoisting and register pressure on a leading embedded benchmark for ARM
cortex-m7, where code-hoisting causes an extra register spill.

I have attached two test-cases which (hopefully) are representative of
the original test-case.
The first one (trans_dfa.c) is bigger and somewhat similar to the
original test-case and trans_dfa_2.c is hand-reduced version of
trans_dfa.c. There's 2 spills caused with trans_dfa.c
and one spill with trans_dfa_2.c due to lesser amount of cases.
The test-cases in the PR are probably not relevant.

Initially I thought the spill was happening because of "too many
hoistings" taking place in original test-case thus increasing the
register pressure, but it seems the spill is possibly caused because
expression gets hoisted out of a block that is on loop exit.

For example, the following hoistings take place with trans_dfa_2.c:

(1) Inserting expression in block 4 for code hoisting:
{mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)

(2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)

(3) Inserting expression in block 4 for code hoisting:
{pointer_plus_expr,s_33,1} (0023)

(4) Inserting expression in block 3 for code hoisting:
{pointer_plus_expr,s_33,1} (0023)

The issue seems to be hoisting of (*tab + 1) which consists of first
two hoistings in block 4
from blocks 5 and 9, which causes the extra spill. I verified that by
disabling hoisting into block 4,
which resulted in no extra spills.

I wonder if that's because the expression (*tab + 1) is getting
hoisted from blocks 5 and 9,
which are on loop exit ? So the expression that was previously
computed in a block on loop exit, gets hoisted outside that block
which possibly makes the allocator more defensive ? Similarly
disabling hoisting of expressions which appeared in blocks on loop
exit in original test-case prevented the extra spill. The other
hoistings didn't seem to matter.

If that's the case, would it make sense to add another constraint to
hoisting to not hoist expression if it appears in a block that is on
loop exit (exception is if block has only single successor), at-least
for targets like cortex-m7 where the effect of spill is more
pronounced ?

I tried to write an untested prototype patch (approach-8.diff) on
these lines, to refuse hoisting if an expression appears in block on
loop exit. The patch adds a new map pre_expr_block_d, that maps
pre_expr to the set of blocks (as a bitmap) it appears in and are on
loop exit, which is computed during compute_avail.
During do_hoist_insertion, it simply checks if the bitmap of blocks is
not empty.
It works for the attached test-cases and passes ssa-pre-*.c and
ssa-hoist-*.c tests.

Alternatively, we could restrict replacing expression by it's leader
in eliminate_dom_walker::before_dom_children if the expression appears
in a block on loop exit.
In principle this is more general than hoisting, but I have restricted
it to only hoisted expressions to avoid being overly conservative.
Similarly, this constrained could be made conditional, only for
targets like cortex-m7. I have attached a prototype patch based on
this approach (approach-9.diff). Although it works for attached
test-cases, unfortunately it ICE's with the original test-case in
tail-merge, I am working on that.

I am not sure though if either of these approaches are in the correct
direction and would
be grateful for suggestions on the issue!

Thanks,
Prathamesh

[-- Attachment #2: trans_dfa.c --]
[-- Type: text/x-csrc, Size: 940 bytes --]

int foo(char **save_s, int *tab)
{
  char *s;
  int state = 0;
  int c;

  for (s = *save_s; *s; s++)
    {
      c = *s;
      switch (state)
	{
	  case 0:
	   if (c == 'A')
	     {
	       state = 1;
	       tab[0]++;
	     }
	   else
	     {
	       state = -1;
	       tab[0]++;
	     }
	  break;

	  case 1:
	    if (c == 'B')
	      {
		state = 2;
		tab[1]++;
	      }
	    else
	      {
		state = -1;
		tab[1]++;
	      }
	   break;

	  case 2:
	    if (c == 'C')
	      {
		state = 3;
		tab[2]++;
	      }
	    else
	      {
		state = -1;
		tab[2]++;
	      }
	   break;
	  
	  case 3:
	    if (c == 'D')
	      {
		state = 4;
		tab[3]++;
	      }
	    else
	      {
		state = -1;
		tab[3]++;
	      }
	  break;

	 case 4:
	    if (c == 'E')
	      {
		state = 5;
		tab[4]++;
	      }
	    else
	      {
		state = -1;
		tab[4]++;
	      }
	 break;
	 
	 default:
	   break;
	}
    }

  tab[state]++;
  *save_s = s;
  return state;
}

[-- Attachment #3: trans_dfa_2.c --]
[-- Type: text/x-csrc, Size: 385 bytes --]

int foo(char **save_s, int *tab) 
{
  char *s; 
  int c;

  int state = 0;
  for (s = *save_s; *s; s++)
    {
      c = *s;
      switch (state)
	{
	  case 0:
	   if (c == 'A')
	     {
	       state = 1;
	       tab[0]++;
	     }
	   else
	     {
	       state = -1;
	       tab[0]++;
	     }
	  break;

	 default:
	   break;
	}
    }

  tab[state]++;
  *save_s = s;
  return state;
}

[-- Attachment #4: approach-9.diff --]
[-- Type: text/x-patch, Size: 6933 bytes --]

diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 11b1938216e..d7663bee1ee 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -52,6 +52,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-dce.h"
 #include "tree-cfgcleanup.h"
 #include "alias.h"
+#include "graph.h"
 
 /* Even though this file is called tree-ssa-pre.c, we actually
    implement a bit more than just PRE here.  All of them piggy-back
@@ -2433,6 +2434,7 @@ compute_antic (void)
    for performing quick dead code elimination of insertions we made
    that didn't turn out to be necessary.   */
 static bitmap inserted_exprs;
+static bitmap hoisted_exprs;
 
 /* The actual worker for create_component_ref_by_pieces.  */
 
@@ -3549,6 +3551,13 @@ do_hoist_insertion (basic_block block)
 	  continue;
 	}
 
+#if 0
+	  fprintf (stderr,
+		   "Inserting expression in block %d for code hoisting: ",
+		   block->index);
+	  print_pre_expr (stderr, expr);
+	  fprintf (stderr, " (%04d)\n", value_id);
+#endif
       /* OK, we should hoist this value.  Perform the transformation.  */
       pre_stats.hoist_insert++;
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -3583,6 +3592,8 @@ do_hoist_insertion (basic_block block)
 	continue;
 
       new_stuff = true;
+      bitmap_set_bit (hoisted_exprs, SSA_NAME_VERSION (res));
+//      fprintf (stderr, "\nres = "); debug_generic_expr (res);
     }
 
   exprs.release ();
@@ -4066,6 +4077,7 @@ init_pre (void)
   name_to_id.create (0);
 
   inserted_exprs = BITMAP_ALLOC (NULL);
+  hoisted_exprs = BITMAP_ALLOC (NULL);
 
   connect_infinite_loops_to_exit ();
   memset (&pre_stats, 0, sizeof (pre_stats));
@@ -4183,7 +4195,7 @@ pass_pre::execute (function *fun)
   statistics_counter_event (fun, "New PHIs", pre_stats.phis);
 
   /* Remove all the redundant expressions.  */
-  todo |= vn_eliminate (inserted_exprs);
+  todo |= vn_eliminate (inserted_exprs, hoisted_exprs);
 
   /* Because we don't follow exactly the standard PRE algorithm, and decide not
      to insert PHI nodes sometimes, and because value numbering of casts isn't
@@ -4217,6 +4229,7 @@ pass_pre::execute (function *fun)
      the todo.  */
   update_ssa (TODO_update_ssa_only_virtuals);
 
+  debug_dot_cfg (cfun, 0, "after-pre.dot");
   return todo;
 }
 
diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index 1463c1d4116..2ae88dc574e 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -5171,7 +5171,7 @@ vn_nary_may_trap (vn_nary_op_t nary)
 class eliminate_dom_walker : public dom_walker
 {
 public:
-  eliminate_dom_walker (cdi_direction, bitmap);
+  eliminate_dom_walker (cdi_direction, bitmap, bitmap hoisted_exprs = NULL);
   ~eliminate_dom_walker ();
 
   virtual edge before_dom_children (basic_block);
@@ -5188,6 +5188,7 @@ public:
 
   /* SSA names that had their defs inserted by PRE if do_pre.  */
   bitmap inserted_exprs;
+  bitmap hoisted_exprs;
 
   /* Blocks with statements that have had their EH properties changed.  */
   bitmap need_eh_cleanup;
@@ -5202,10 +5203,12 @@ public:
 };
 
 eliminate_dom_walker::eliminate_dom_walker (cdi_direction direction,
-					    bitmap inserted_exprs_)
+					    bitmap inserted_exprs_,
+					    bitmap hoisted_exprs_)
   : dom_walker (direction), do_pre (inserted_exprs_ != NULL),
     el_todo (0), eliminations (0), insertions (0),
-    inserted_exprs (inserted_exprs_)
+    inserted_exprs (inserted_exprs_),
+    hoisted_exprs (hoisted_exprs_)
 {
   need_eh_cleanup = BITMAP_ALLOC (NULL);
   need_ab_cleanup = BITMAP_ALLOC (NULL);
@@ -5338,7 +5341,18 @@ eliminate_dom_walker::eliminate_insert (gimple_stmt_iterator *gsi, tree val)
   return res;
 }
 
+static bool
+block_loop_exit_p (basic_block b)
+{
+  edge e;
+  edge_iterator ei;
 
+  FOR_EACH_EDGE (e, ei, b->succs)
+    if (b->loop_father != e->dest->loop_father)
+      return true;
+
+  return false;
+}
 
 /* Perform elimination for the basic-block B during the domwalk.  */
 
@@ -5369,6 +5383,17 @@ eliminate_dom_walker::before_dom_children (basic_block b)
 	}
 
       tree sprime = eliminate_avail (res);
+#if 1
+      if (block_loop_exit_p (b)
+	  && hoisted_exprs
+	  && sprime
+	  && bitmap_bit_p (hoisted_exprs, SSA_NAME_VERSION (sprime))
+	  && !single_succ_p (b))
+	{
+	  gsi_next (&gsi);
+	  continue;
+	}
+#endif
       if (sprime
 	  && sprime != res)
 	{
@@ -5432,6 +5458,19 @@ eliminate_dom_walker::before_dom_children (basic_block b)
 		   && is_global_var (gimple_assign_rhs1 (stmt)))))
 	{
 	  sprime = eliminate_avail (lhs);
+#if 1 
+	  if (block_loop_exit_p (b)
+	      && sprime
+	      && hoisted_exprs
+	      && bitmap_bit_p (hoisted_exprs, SSA_NAME_VERSION (sprime))
+	      && !single_succ_p (b)) 
+	    {
+//	      fprintf (stderr, "\nblock = %d, stmt = ", b->index); debug_gimple_stmt (stmt);
+//	      fprintf (stderr, "b single succ = %d\n", single_succ_p (b));
+	      continue;
+	    }
+#endif
+
 	  if (!sprime)
 	    {
 	      /* If there is no existing usable leader but SCCVN thinks
@@ -5697,6 +5736,13 @@ eliminate_dom_walker::before_dom_children (basic_block b)
 	  if (TREE_CODE (use) != SSA_NAME)
 	    continue;
 	  tree sprime = eliminate_avail (use);
+	  if (block_loop_exit_p (b)
+	      && sprime
+	      && hoisted_exprs
+	      && bitmap_bit_p (hoisted_exprs, SSA_NAME_VERSION (sprime))
+	      && !single_succ_p (b)) 
+	    continue;
+
 	  if (sprime && sprime != use
 	      && may_propagate_copy (use, sprime)
 	      /* We substitute into debug stmts to avoid excessive
@@ -5877,6 +5923,13 @@ eliminate_dom_walker::before_dom_children (basic_block b)
 	      || virtual_operand_p (arg))
 	    continue;
 	  tree sprime = eliminate_avail (arg);
+	  if (block_loop_exit_p (b)
+	      && sprime
+	      && hoisted_exprs
+	      && bitmap_bit_p (hoisted_exprs, SSA_NAME_VERSION (sprime))
+	      && !single_succ_p (b)) 
+	    continue;
+
 	  if (sprime && may_propagate_copy (arg, sprime))
 	    propagate_value (use_p, sprime);
 	}
@@ -5903,9 +5956,9 @@ eliminate_dom_walker::after_dom_children (basic_block)
 /* Eliminate fully redundant computations.  */
 
 unsigned int
-vn_eliminate (bitmap inserted_exprs)
+vn_eliminate (bitmap inserted_exprs, bitmap hoisted_exprs)
 {
-  eliminate_dom_walker el (CDI_DOMINATORS, inserted_exprs);
+  eliminate_dom_walker el (CDI_DOMINATORS, inserted_exprs, hoisted_exprs);
   el.avail.reserve (num_ssa_names);
 
   el.walk (cfun->cfg->x_entry_block_ptr);
diff --git a/gcc/tree-ssa-sccvn.h b/gcc/tree-ssa-sccvn.h
index 9356520cbe5..5f9bab7b83f 100644
--- a/gcc/tree-ssa-sccvn.h
+++ b/gcc/tree-ssa-sccvn.h
@@ -214,7 +214,7 @@ extern vn_ssa_aux_t VN_INFO (tree);
 extern vn_ssa_aux_t VN_INFO_GET (tree);
 tree vn_get_expr_for (tree);
 void run_scc_vn (vn_lookup_kind);
-unsigned int vn_eliminate (bitmap);
+unsigned int vn_eliminate (bitmap, bitmap hoisted_exprs = NULL);
 void free_scc_vn (void);
 void scc_vn_restore_ssa_info (void);
 tree vn_nary_op_lookup (tree, vn_nary_op_t *);

[-- Attachment #5: approach-8.diff --]
[-- Type: text/x-patch, Size: 4657 bytes --]

diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 11b1938216e..d10a276daf2 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -52,6 +52,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-dce.h"
 #include "tree-cfgcleanup.h"
 #include "alias.h"
+#include "graph.h"
 
 /* Even though this file is called tree-ssa-pre.c, we actually
    implement a bit more than just PRE here.  All of them piggy-back
@@ -500,7 +501,6 @@ typedef struct bb_bitmap_sets
 #define BB_MAY_NOTRETURN(BB) ((bb_value_sets_t) ((BB)->aux))->contains_may_not_return_call
 #define BB_LIVE_VOP_ON_EXIT(BB) ((bb_value_sets_t) ((BB)->aux))->vop_on_exit
 
-
 /* This structure is used to keep track of statistics on what
    optimization PRE was able to perform.  */
 static struct
@@ -2434,6 +2434,25 @@ compute_antic (void)
    that didn't turn out to be necessary.   */
 static bitmap inserted_exprs;
 
+struct pre_expr_block_d : nofree_ptr_hash <pre_expr_block_d>
+{
+  pre_expr expr;
+  bitmap blocks;
+
+  /* hash_table support.  */
+  static inline hashval_t hash(const pre_expr_block_d *x)
+  {
+    return pre_expr_d::hash (x->expr);
+  }
+
+  static inline int equal(const pre_expr_block_d *x1, const pre_expr_block_d *x2)
+  {
+    return pre_expr_d::equal (x1->expr, x2->expr);
+  }
+};
+
+static hash_table<pre_expr_block_d> * expr_to_blocks;
+
 /* The actual worker for create_component_ref_by_pieces.  */
 
 static tree
@@ -3549,6 +3568,15 @@ do_hoist_insertion (basic_block block)
 	  continue;
 	}
 
+      pre_expr_block_d x;
+      x.expr = expr;
+      pre_expr_block_d **slot = expr_to_blocks->find_slot (&x, NO_INSERT);
+      if (slot)
+	{
+	  gcc_assert (!bitmap_empty_p ((*slot)->blocks));
+	  continue;
+	}
+
       /* OK, we should hoist this value.  Perform the transformation.  */
       pre_stats.hoist_insert++;
       if (dump_file && (dump_flags & TDF_DETAILS))
@@ -3678,6 +3706,40 @@ insert (void)
   statistics_histogram_event (cfun, "insert iterations", num_iterations);
 }
 
+static void
+add_expr_to_blocks (pre_expr expr, basic_block block)
+{
+  pre_expr_block_d x;
+  x.expr = expr;
+
+  edge e;
+  edge_iterator ei;
+  bool loop_exit = false;
+
+  if (single_succ_p (block))
+    return;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    if (block->loop_father != e->dest->loop_father)
+      {
+	loop_exit = true;
+	break;
+      }
+
+  if (!loop_exit)
+    return;
+
+  pre_expr_block_d **slot = expr_to_blocks->find_slot (&x, INSERT);
+  if (!*slot)
+    {
+      *slot = XNEW (struct pre_expr_block_d);
+      (*slot)->expr = expr;
+      (*slot)->blocks = BITMAP_ALLOC (NULL);
+    }
+
+   
+  bitmap_set_bit ((*slot)->blocks, block->index);
+}
 
 /* Compute the AVAIL set for all basic blocks.
 
@@ -3711,6 +3773,7 @@ compute_avail (void)
 
       e = get_or_alloc_expr_for_name (name);
       add_to_value (get_expr_value_id (e), e);
+      add_expr_to_blocks (e, ENTRY_BLOCK_PTR_FOR_FN (cfun));
       bitmap_insert_into_set (TMP_GEN (ENTRY_BLOCK_PTR_FOR_FN (cfun)), e);
       bitmap_value_insert_into_set (AVAIL_OUT (ENTRY_BLOCK_PTR_FOR_FN (cfun)),
 				    e);
@@ -3771,6 +3834,7 @@ compute_avail (void)
 
 	  pre_expr e = get_or_alloc_expr_for_name (result);
 	  add_to_value (get_expr_value_id (e), e);
+	  add_expr_to_blocks (e, block);
 	  bitmap_value_insert_into_set (AVAIL_OUT (block), e);
 	  bitmap_insert_into_set (PHI_GEN (block), e);
 	}
@@ -3808,6 +3872,7 @@ compute_avail (void)
 	      pre_expr e = get_or_alloc_expr_for_name (op);
 
 	      add_to_value (get_expr_value_id (e), e);
+	      add_expr_to_blocks (e, block);
 	      bitmap_insert_into_set (TMP_GEN (block), e);
 	      bitmap_value_insert_into_set (AVAIL_OUT (block), e);
 	    }
@@ -3862,6 +3927,7 @@ compute_avail (void)
 		    PRE_EXPR_REFERENCE (result) = ref;
 
 		    get_or_alloc_expression_id (result);
+		    add_expr_to_blocks (result, block);
 		    add_to_value (get_expr_value_id (result), result);
 		    bitmap_value_insert_into_set (EXP_GEN (block), result);
 		  }
@@ -4019,6 +4085,7 @@ compute_avail (void)
 
 		get_or_alloc_expression_id (result);
 		add_to_value (get_expr_value_id (result), result);
+		add_expr_to_blocks (result, block);
 		bitmap_value_insert_into_set (EXP_GEN (block), result);
 		continue;
 	      }
@@ -4077,6 +4144,7 @@ init_pre (void)
   bitmap_obstack_initialize (&grand_bitmap_obstack);
   phi_translate_table = new hash_table<expr_pred_trans_d> (5110);
   expression_to_id = new hash_table<pre_expr_d> (num_ssa_names * 3);
+  expr_to_blocks = new hash_table<pre_expr_block_d> (num_ssa_names * 3);
   FOR_ALL_BB_FN (bb, cfun)
     {
       EXP_GEN (bb) = bitmap_set_new ();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-23  7:27 PR80155: Code hoisting and register pressure Prathamesh Kulkarni
@ 2018-05-23  8:29 ` Richard Biener
  2018-05-23  9:20   ` Prathamesh Kulkarni
  2018-05-23  9:40   ` Bin.Cheng
  0 siblings, 2 replies; 14+ messages in thread
From: Richard Biener @ 2018-05-23  8:29 UTC (permalink / raw)
  To: Prathamesh Kulkarni; +Cc: gcc, Bin Cheng, Thomas Preudhomme

On Wed, 23 May 2018, Prathamesh Kulkarni wrote:

> Hi,
> I am trying to work on PR80155, which exposes a problem with code
> hoisting and register pressure on a leading embedded benchmark for ARM
> cortex-m7, where code-hoisting causes an extra register spill.
> 
> I have attached two test-cases which (hopefully) are representative of
> the original test-case.
> The first one (trans_dfa.c) is bigger and somewhat similar to the
> original test-case and trans_dfa_2.c is hand-reduced version of
> trans_dfa.c. There's 2 spills caused with trans_dfa.c
> and one spill with trans_dfa_2.c due to lesser amount of cases.
> The test-cases in the PR are probably not relevant.
> 
> Initially I thought the spill was happening because of "too many
> hoistings" taking place in original test-case thus increasing the
> register pressure, but it seems the spill is possibly caused because
> expression gets hoisted out of a block that is on loop exit.
> 
> For example, the following hoistings take place with trans_dfa_2.c:
> 
> (1) Inserting expression in block 4 for code hoisting:
> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
> 
> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
> 
> (3) Inserting expression in block 4 for code hoisting:
> {pointer_plus_expr,s_33,1} (0023)
> 
> (4) Inserting expression in block 3 for code hoisting:
> {pointer_plus_expr,s_33,1} (0023)
> 
> The issue seems to be hoisting of (*tab + 1) which consists of first
> two hoistings in block 4
> from blocks 5 and 9, which causes the extra spill. I verified that by
> disabling hoisting into block 4,
> which resulted in no extra spills.
> 
> I wonder if that's because the expression (*tab + 1) is getting
> hoisted from blocks 5 and 9,
> which are on loop exit ? So the expression that was previously
> computed in a block on loop exit, gets hoisted outside that block
> which possibly makes the allocator more defensive ? Similarly
> disabling hoisting of expressions which appeared in blocks on loop
> exit in original test-case prevented the extra spill. The other
> hoistings didn't seem to matter.

I think that's simply co-incidence.  The only thing that makes
a block that also exits from the loop special is that an
expression could be sunk out of the loop and hoisting (commoning
with another path) could prevent that.  But that isn't what is
happening here and it would be a pass ordering issue as
the sinking pass runs only after hoisting (no idea why exactly
but I guess there are cases where we want to prefer CSE over
sinking).  So you could try if re-ordering PRE and sinking helps
your testcase.

What I do see is a missed opportunity to merge the successors
of BB 4.  After PRE we have

<bb 4> [local count: 159303558]:
<L1>:
pretmp_123 = *tab_37(D);
_87 = pretmp_123 + 1;
if (c_36 == 65)
  goto <bb 5>; [34.00%]
else
  goto <bb 8>; [66.00%]

<bb 5> [local count: 54163210]:
*tab_37(D) = _87;
_96 = MEM[(char *)s_57 + 1B];
if (_96 != 0)
  goto <bb 7>; [89.00%]
else
  goto <bb 6>; [11.00%]

<bb 8> [local count: 105140348]:
*tab_37(D) = _87;
_56 = MEM[(char *)s_57 + 1B];
if (_56 != 0)
  goto <bb 10>; [89.00%]
else
  goto <bb 9>; [11.00%]

here at least the stores and loads can be hoisted.  Note this
may also point at the real issue of the code hoisting which is
tearing apart the RMW operation?

> If that's the case, would it make sense to add another constraint to
> hoisting to not hoist expression if it appears in a block that is on
> loop exit (exception is if block has only single successor), at-least
> for targets like cortex-m7 where the effect of spill is more
> pronounced ?
> 
> I tried to write an untested prototype patch (approach-8.diff) on
> these lines, to refuse hoisting if an expression appears in block on
> loop exit. The patch adds a new map pre_expr_block_d, that maps
> pre_expr to the set of blocks (as a bitmap) it appears in and are on
> loop exit, which is computed during compute_avail.
> During do_hoist_insertion, it simply checks if the bitmap of blocks is
> not empty.
> It works for the attached test-cases and passes ssa-pre-*.c and
> ssa-hoist-*.c tests.

As said to me the heuristic doesn't make much sense.  There is
btw loop_exits_from_bb_p ().

If the issue in the end is a RA one (that is, there _is_ a better
allocation possible?) then you may also want to look at out-of-SSA
coalescing.

So overall I'm not convinced the hoisting decision is wrong.
It doesn't increase register pressure at all.  It does expose
a missed store hoisting and load CSE opportunity (but we don't
have a way to "PRE" or hoist stores at the moment).  Stores
do not fit data flow problems well given they need to be kept
in order with respect to other stores and loads and appearantly
*tab aliases *s (yeah, s is char *...  make tab restrict and I
guess things will work much smoother).

Richard.

> Alternatively, we could restrict replacing expression by it's leader
> in eliminate_dom_walker::before_dom_children if the expression appears
> in a block on loop exit.
> In principle this is more general than hoisting, but I have restricted
> it to only hoisted expressions to avoid being overly conservative.
> Similarly, this constrained could be made conditional, only for
> targets like cortex-m7. I have attached a prototype patch based on
> this approach (approach-9.diff). Although it works for attached
> test-cases, unfortunately it ICE's with the original test-case in
> tail-merge, I am working on that.
> 
> I am not sure though if either of these approaches are in the correct
> direction and would
> be grateful for suggestions on the issue!
> 
> Thanks,
> Prathamesh
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-23  8:29 ` Richard Biener
@ 2018-05-23  9:20   ` Prathamesh Kulkarni
  2018-05-23 13:07     ` Jeff Law
  2018-05-23  9:40   ` Bin.Cheng
  1 sibling, 1 reply; 14+ messages in thread
From: Prathamesh Kulkarni @ 2018-05-23  9:20 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc, Bin Cheng, Thomas Preudhomme

On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>
>> Hi,
>> I am trying to work on PR80155, which exposes a problem with code
>> hoisting and register pressure on a leading embedded benchmark for ARM
>> cortex-m7, where code-hoisting causes an extra register spill.
>>
>> I have attached two test-cases which (hopefully) are representative of
>> the original test-case.
>> The first one (trans_dfa.c) is bigger and somewhat similar to the
>> original test-case and trans_dfa_2.c is hand-reduced version of
>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>> The test-cases in the PR are probably not relevant.
>>
>> Initially I thought the spill was happening because of "too many
>> hoistings" taking place in original test-case thus increasing the
>> register pressure, but it seems the spill is possibly caused because
>> expression gets hoisted out of a block that is on loop exit.
>>
>> For example, the following hoistings take place with trans_dfa_2.c:
>>
>> (1) Inserting expression in block 4 for code hoisting:
>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>
>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
>>
>> (3) Inserting expression in block 4 for code hoisting:
>> {pointer_plus_expr,s_33,1} (0023)
>>
>> (4) Inserting expression in block 3 for code hoisting:
>> {pointer_plus_expr,s_33,1} (0023)
>>
>> The issue seems to be hoisting of (*tab + 1) which consists of first
>> two hoistings in block 4
>> from blocks 5 and 9, which causes the extra spill. I verified that by
>> disabling hoisting into block 4,
>> which resulted in no extra spills.
>>
>> I wonder if that's because the expression (*tab + 1) is getting
>> hoisted from blocks 5 and 9,
>> which are on loop exit ? So the expression that was previously
>> computed in a block on loop exit, gets hoisted outside that block
>> which possibly makes the allocator more defensive ? Similarly
>> disabling hoisting of expressions which appeared in blocks on loop
>> exit in original test-case prevented the extra spill. The other
>> hoistings didn't seem to matter.
>
> I think that's simply co-incidence.  The only thing that makes
> a block that also exits from the loop special is that an
> expression could be sunk out of the loop and hoisting (commoning
> with another path) could prevent that.  But that isn't what is
> happening here and it would be a pass ordering issue as
> the sinking pass runs only after hoisting (no idea why exactly
> but I guess there are cases where we want to prefer CSE over
> sinking).  So you could try if re-ordering PRE and sinking helps
> your testcase.
Thanks for the suggestions. Placing sink pass before PRE works
for both these test-cases! Sadly it still causes the spill for the benchmark -:(
I will try to create a better approximation of the original test-case.
>
> What I do see is a missed opportunity to merge the successors
> of BB 4.  After PRE we have
>
> <bb 4> [local count: 159303558]:
> <L1>:
> pretmp_123 = *tab_37(D);
> _87 = pretmp_123 + 1;
> if (c_36 == 65)
>   goto <bb 5>; [34.00%]
> else
>   goto <bb 8>; [66.00%]
>
> <bb 5> [local count: 54163210]:
> *tab_37(D) = _87;
> _96 = MEM[(char *)s_57 + 1B];
> if (_96 != 0)
>   goto <bb 7>; [89.00%]
> else
>   goto <bb 6>; [11.00%]
>
> <bb 8> [local count: 105140348]:
> *tab_37(D) = _87;
> _56 = MEM[(char *)s_57 + 1B];
> if (_56 != 0)
>   goto <bb 10>; [89.00%]
> else
>   goto <bb 9>; [11.00%]
>
> here at least the stores and loads can be hoisted.  Note this
> may also point at the real issue of the code hoisting which is
> tearing apart the RMW operation?
Indeed, this possibility seems much more likely than block being on loop exit.
I will try to "hardcode" the load/store hoists into block 4 for this
specific test-case to check
if that prevents the spill.
>
>> If that's the case, would it make sense to add another constraint to
>> hoisting to not hoist expression if it appears in a block that is on
>> loop exit (exception is if block has only single successor), at-least
>> for targets like cortex-m7 where the effect of spill is more
>> pronounced ?
>>
>> I tried to write an untested prototype patch (approach-8.diff) on
>> these lines, to refuse hoisting if an expression appears in block on
>> loop exit. The patch adds a new map pre_expr_block_d, that maps
>> pre_expr to the set of blocks (as a bitmap) it appears in and are on
>> loop exit, which is computed during compute_avail.
>> During do_hoist_insertion, it simply checks if the bitmap of blocks is
>> not empty.
>> It works for the attached test-cases and passes ssa-pre-*.c and
>> ssa-hoist-*.c tests.
>
> As said to me the heuristic doesn't make much sense.  There is
> btw loop_exits_from_bb_p ().
>
> If the issue in the end is a RA one (that is, there _is_ a better
> allocation possible?) then you may also want to look at out-of-SSA
> coalescing.
>
> So overall I'm not convinced the hoisting decision is wrong.
> It doesn't increase register pressure at all.  It does expose
> a missed store hoisting and load CSE opportunity (but we don't
> have a way to "PRE" or hoist stores at the moment).  Stores
> do not fit data flow problems well given they need to be kept
> in order with respect to other stores and loads and appearantly
> *tab aliases *s (yeah, s is char *...  make tab restrict and I
> guess things will work much smoother).
Well strangely, making tab restrict resulted in two extra spills
compared to without hoisting.

Thanks,
Prathamesh
>
> Richard.
>
>> Alternatively, we could restrict replacing expression by it's leader
>> in eliminate_dom_walker::before_dom_children if the expression appears
>> in a block on loop exit.
>> In principle this is more general than hoisting, but I have restricted
>> it to only hoisted expressions to avoid being overly conservative.
>> Similarly, this constrained could be made conditional, only for
>> targets like cortex-m7. I have attached a prototype patch based on
>> this approach (approach-9.diff). Although it works for attached
>> test-cases, unfortunately it ICE's with the original test-case in
>> tail-merge, I am working on that.
>>
>> I am not sure though if either of these approaches are in the correct
>> direction and would
>> be grateful for suggestions on the issue!
>>
>> Thanks,
>> Prathamesh
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-23  8:29 ` Richard Biener
  2018-05-23  9:20   ` Prathamesh Kulkarni
@ 2018-05-23  9:40   ` Bin.Cheng
  1 sibling, 0 replies; 14+ messages in thread
From: Bin.Cheng @ 2018-05-23  9:40 UTC (permalink / raw)
  To: Richard Biener
  Cc: Prathamesh Kulkarni, GCC Development, Bin Cheng, Thomas Preudhomme

On Wed, May 23, 2018 at 9:28 AM, Richard Biener <rguenther@suse.de> wrote:
> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>
>> Hi,
>> I am trying to work on PR80155, which exposes a problem with code
>> hoisting and register pressure on a leading embedded benchmark for ARM
>> cortex-m7, where code-hoisting causes an extra register spill.
>>
>> I have attached two test-cases which (hopefully) are representative of
>> the original test-case.
>> The first one (trans_dfa.c) is bigger and somewhat similar to the
>> original test-case and trans_dfa_2.c is hand-reduced version of
>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>> The test-cases in the PR are probably not relevant.
>>
>> Initially I thought the spill was happening because of "too many
>> hoistings" taking place in original test-case thus increasing the
>> register pressure, but it seems the spill is possibly caused because
>> expression gets hoisted out of a block that is on loop exit.
>>
>> For example, the following hoistings take place with trans_dfa_2.c:
>>
>> (1) Inserting expression in block 4 for code hoisting:
>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>
>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
>>
>> (3) Inserting expression in block 4 for code hoisting:
>> {pointer_plus_expr,s_33,1} (0023)
>>
>> (4) Inserting expression in block 3 for code hoisting:
>> {pointer_plus_expr,s_33,1} (0023)
>>
>> The issue seems to be hoisting of (*tab + 1) which consists of first
>> two hoistings in block 4
>> from blocks 5 and 9, which causes the extra spill. I verified that by
>> disabling hoisting into block 4,
>> which resulted in no extra spills.
>>
>> I wonder if that's because the expression (*tab + 1) is getting
>> hoisted from blocks 5 and 9,
>> which are on loop exit ? So the expression that was previously
>> computed in a block on loop exit, gets hoisted outside that block
>> which possibly makes the allocator more defensive ? Similarly
>> disabling hoisting of expressions which appeared in blocks on loop
>> exit in original test-case prevented the extra spill. The other
>> hoistings didn't seem to matter.
>
> I think that's simply co-incidence.  The only thing that makes
> a block that also exits from the loop special is that an
> expression could be sunk out of the loop and hoisting (commoning
> with another path) could prevent that.  But that isn't what is
> happening here and it would be a pass ordering issue as
> the sinking pass runs only after hoisting (no idea why exactly
> but I guess there are cases where we want to prefer CSE over
> sinking).  So you could try if re-ordering PRE and sinking helps
> your testcase.
>
> What I do see is a missed opportunity to merge the successors
> of BB 4.  After PRE we have
>
> <bb 4> [local count: 159303558]:
> <L1>:
> pretmp_123 = *tab_37(D);
> _87 = pretmp_123 + 1;
> if (c_36 == 65)
>   goto <bb 5>; [34.00%]
> else
>   goto <bb 8>; [66.00%]
>
> <bb 5> [local count: 54163210]:
> *tab_37(D) = _87;
> _96 = MEM[(char *)s_57 + 1B];
> if (_96 != 0)
>   goto <bb 7>; [89.00%]
> else
>   goto <bb 6>; [11.00%]
>
> <bb 8> [local count: 105140348]:
> *tab_37(D) = _87;
> _56 = MEM[(char *)s_57 + 1B];
> if (_56 != 0)
>   goto <bb 10>; [89.00%]
> else
>   goto <bb 9>; [11.00%]
>
> here at least the stores and loads can be hoisted.  Note this
> may also point at the real issue of the code hoisting which is
> tearing apart the RMW operation?
>
>> If that's the case, would it make sense to add another constraint to
>> hoisting to not hoist expression if it appears in a block that is on
>> loop exit (exception is if block has only single successor), at-least
>> for targets like cortex-m7 where the effect of spill is more
>> pronounced ?
>>
>> I tried to write an untested prototype patch (approach-8.diff) on
>> these lines, to refuse hoisting if an expression appears in block on
>> loop exit. The patch adds a new map pre_expr_block_d, that maps
>> pre_expr to the set of blocks (as a bitmap) it appears in and are on
>> loop exit, which is computed during compute_avail.
>> During do_hoist_insertion, it simply checks if the bitmap of blocks is
>> not empty.
>> It works for the attached test-cases and passes ssa-pre-*.c and
>> ssa-hoist-*.c tests.
>
> As said to me the heuristic doesn't make much sense.  There is
> btw loop_exits_from_bb_p ().
>
> If the issue in the end is a RA one (that is, there _is_ a better
> allocation possible?) then you may also want to look at out-of-SSA
> coalescing.
>
> So overall I'm not convinced the hoisting decision is wrong.
> It doesn't increase register pressure at all.  It does expose
Not quite.  There are two hoisting in case of trans_dfa_2.c

For the first one:

Inserting expression in block 4 for code hoisting:
{mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
Inserted pretmp_30 = *tab_20(D);
 in predecessor 4 (0005)
Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
Inserted _12 = pretmp_30 + 1;
 in predecessor 4 (0006)

I agree hoisting separates RM from W, but for the second one:

Inserting expression in block 4 for code hoisting:
{pointer_plus_expr,s_33,1} (0023)
Inserted _14 = s_33 + 1;
 in predecessor 4 (0023)
Inserting expression in block 3 for code hoisting:
{pointer_plus_expr,s_33,1} (0023)
Inserted _62 = s_33 + 1;
 in predecessor 3 (0023)

It does increase register pressure because _62(originally s_34 and
s_27) is extended and s_33 is still alive.

Thanks,
bin

> a missed store hoisting and load CSE opportunity (but we don't
> have a way to "PRE" or hoist stores at the moment).  Stores
> do not fit data flow problems well given they need to be kept
> in order with respect to other stores and loads and appearantly
> *tab aliases *s (yeah, s is char *...  make tab restrict and I
> guess things will work much smoother).
>
> Richard.
>
>> Alternatively, we could restrict replacing expression by it's leader
>> in eliminate_dom_walker::before_dom_children if the expression appears
>> in a block on loop exit.
>> In principle this is more general than hoisting, but I have restricted
>> it to only hoisted expressions to avoid being overly conservative.
>> Similarly, this constrained could be made conditional, only for
>> targets like cortex-m7. I have attached a prototype patch based on
>> this approach (approach-9.diff). Although it works for attached
>> test-cases, unfortunately it ICE's with the original test-case in
>> tail-merge, I am working on that.
>>
>> I am not sure though if either of these approaches are in the correct
>> direction and would
>> be grateful for suggestions on the issue!
>>
>> Thanks,
>> Prathamesh
>>
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-23  9:20   ` Prathamesh Kulkarni
@ 2018-05-23 13:07     ` Jeff Law
  2018-05-25  9:23       ` Prathamesh Kulkarni
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Law @ 2018-05-23 13:07 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Biener; +Cc: gcc, Bin Cheng, Thomas Preudhomme

On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>
>>> Hi,
>>> I am trying to work on PR80155, which exposes a problem with code
>>> hoisting and register pressure on a leading embedded benchmark for ARM
>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>
>>> I have attached two test-cases which (hopefully) are representative of
>>> the original test-case.
>>> The first one (trans_dfa.c) is bigger and somewhat similar to the
>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>> The test-cases in the PR are probably not relevant.
>>>
>>> Initially I thought the spill was happening because of "too many
>>> hoistings" taking place in original test-case thus increasing the
>>> register pressure, but it seems the spill is possibly caused because
>>> expression gets hoisted out of a block that is on loop exit.
>>>
>>> For example, the following hoistings take place with trans_dfa_2.c:
>>>
>>> (1) Inserting expression in block 4 for code hoisting:
>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>
>>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
>>>
>>> (3) Inserting expression in block 4 for code hoisting:
>>> {pointer_plus_expr,s_33,1} (0023)
>>>
>>> (4) Inserting expression in block 3 for code hoisting:
>>> {pointer_plus_expr,s_33,1} (0023)
>>>
>>> The issue seems to be hoisting of (*tab + 1) which consists of first
>>> two hoistings in block 4
>>> from blocks 5 and 9, which causes the extra spill. I verified that by
>>> disabling hoisting into block 4,
>>> which resulted in no extra spills.
>>>
>>> I wonder if that's because the expression (*tab + 1) is getting
>>> hoisted from blocks 5 and 9,
>>> which are on loop exit ? So the expression that was previously
>>> computed in a block on loop exit, gets hoisted outside that block
>>> which possibly makes the allocator more defensive ? Similarly
>>> disabling hoisting of expressions which appeared in blocks on loop
>>> exit in original test-case prevented the extra spill. The other
>>> hoistings didn't seem to matter.
>>
>> I think that's simply co-incidence.  The only thing that makes
>> a block that also exits from the loop special is that an
>> expression could be sunk out of the loop and hoisting (commoning
>> with another path) could prevent that.  But that isn't what is
>> happening here and it would be a pass ordering issue as
>> the sinking pass runs only after hoisting (no idea why exactly
>> but I guess there are cases where we want to prefer CSE over
>> sinking).  So you could try if re-ordering PRE and sinking helps
>> your testcase.
> Thanks for the suggestions. Placing sink pass before PRE works
> for both these test-cases! Sadly it still causes the spill for the benchmark -:(
> I will try to create a better approximation of the original test-case.
>>
>> What I do see is a missed opportunity to merge the successors
>> of BB 4.  After PRE we have
>>
>> <bb 4> [local count: 159303558]:
>> <L1>:
>> pretmp_123 = *tab_37(D);
>> _87 = pretmp_123 + 1;
>> if (c_36 == 65)
>>   goto <bb 5>; [34.00%]
>> else
>>   goto <bb 8>; [66.00%]
>>
>> <bb 5> [local count: 54163210]:
>> *tab_37(D) = _87;
>> _96 = MEM[(char *)s_57 + 1B];
>> if (_96 != 0)
>>   goto <bb 7>; [89.00%]
>> else
>>   goto <bb 6>; [11.00%]
>>
>> <bb 8> [local count: 105140348]:
>> *tab_37(D) = _87;
>> _56 = MEM[(char *)s_57 + 1B];
>> if (_56 != 0)
>>   goto <bb 10>; [89.00%]
>> else
>>   goto <bb 9>; [11.00%]
>>
>> here at least the stores and loads can be hoisted.  Note this
>> may also point at the real issue of the code hoisting which is
>> tearing apart the RMW operation?
> Indeed, this possibility seems much more likely than block being on loop exit.
> I will try to "hardcode" the load/store hoists into block 4 for this
> specific test-case to check
> if that prevents the spill.
Even if it prevents the spill in this case, it's likely a good thing to
do.  The statements prior to the conditional in bb5 and bb8 should be
hoisted, leaving bb5 and bb8 with just their conditionals.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-23 13:07     ` Jeff Law
@ 2018-05-25  9:23       ` Prathamesh Kulkarni
  2018-05-25  9:49         ` Bin.Cheng
  0 siblings, 1 reply; 14+ messages in thread
From: Prathamesh Kulkarni @ 2018-05-25  9:23 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Biener, gcc, Bin Cheng, Thomas Preudhomme

On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>
>>>> Hi,
>>>> I am trying to work on PR80155, which exposes a problem with code
>>>> hoisting and register pressure on a leading embedded benchmark for ARM
>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>
>>>> I have attached two test-cases which (hopefully) are representative of
>>>> the original test-case.
>>>> The first one (trans_dfa.c) is bigger and somewhat similar to the
>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>> The test-cases in the PR are probably not relevant.
>>>>
>>>> Initially I thought the spill was happening because of "too many
>>>> hoistings" taking place in original test-case thus increasing the
>>>> register pressure, but it seems the spill is possibly caused because
>>>> expression gets hoisted out of a block that is on loop exit.
>>>>
>>>> For example, the following hoistings take place with trans_dfa_2.c:
>>>>
>>>> (1) Inserting expression in block 4 for code hoisting:
>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>
>>>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
>>>>
>>>> (3) Inserting expression in block 4 for code hoisting:
>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>
>>>> (4) Inserting expression in block 3 for code hoisting:
>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>
>>>> The issue seems to be hoisting of (*tab + 1) which consists of first
>>>> two hoistings in block 4
>>>> from blocks 5 and 9, which causes the extra spill. I verified that by
>>>> disabling hoisting into block 4,
>>>> which resulted in no extra spills.
>>>>
>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>> hoisted from blocks 5 and 9,
>>>> which are on loop exit ? So the expression that was previously
>>>> computed in a block on loop exit, gets hoisted outside that block
>>>> which possibly makes the allocator more defensive ? Similarly
>>>> disabling hoisting of expressions which appeared in blocks on loop
>>>> exit in original test-case prevented the extra spill. The other
>>>> hoistings didn't seem to matter.
>>>
>>> I think that's simply co-incidence.  The only thing that makes
>>> a block that also exits from the loop special is that an
>>> expression could be sunk out of the loop and hoisting (commoning
>>> with another path) could prevent that.  But that isn't what is
>>> happening here and it would be a pass ordering issue as
>>> the sinking pass runs only after hoisting (no idea why exactly
>>> but I guess there are cases where we want to prefer CSE over
>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>> your testcase.
>> Thanks for the suggestions. Placing sink pass before PRE works
>> for both these test-cases! Sadly it still causes the spill for the benchmark -:(
>> I will try to create a better approximation of the original test-case.
>>>
>>> What I do see is a missed opportunity to merge the successors
>>> of BB 4.  After PRE we have
>>>
>>> <bb 4> [local count: 159303558]:
>>> <L1>:
>>> pretmp_123 = *tab_37(D);
>>> _87 = pretmp_123 + 1;
>>> if (c_36 == 65)
>>>   goto <bb 5>; [34.00%]
>>> else
>>>   goto <bb 8>; [66.00%]
>>>
>>> <bb 5> [local count: 54163210]:
>>> *tab_37(D) = _87;
>>> _96 = MEM[(char *)s_57 + 1B];
>>> if (_96 != 0)
>>>   goto <bb 7>; [89.00%]
>>> else
>>>   goto <bb 6>; [11.00%]
>>>
>>> <bb 8> [local count: 105140348]:
>>> *tab_37(D) = _87;
>>> _56 = MEM[(char *)s_57 + 1B];
>>> if (_56 != 0)
>>>   goto <bb 10>; [89.00%]
>>> else
>>>   goto <bb 9>; [11.00%]
>>>
>>> here at least the stores and loads can be hoisted.  Note this
>>> may also point at the real issue of the code hoisting which is
>>> tearing apart the RMW operation?
>> Indeed, this possibility seems much more likely than block being on loop exit.
>> I will try to "hardcode" the load/store hoists into block 4 for this
>> specific test-case to check
>> if that prevents the spill.
> Even if it prevents the spill in this case, it's likely a good thing to
> do.  The statements prior to the conditional in bb5 and bb8 should be
> hoisted, leaving bb5 and bb8 with just their conditionals.
Hi,
It seems disabling forwprop somehow works for causing no extra spills
on the original test-case.

For instance,
Hoisting without forwprop:

bb 3:
_1 = tab_1(D) + 8
pretmp_268 = MEM[tab_1(D) + 8B];
_2 = pretmp_268 + 1;
goto <bb 4> or <bb 5>

bb 4:
 *_1 = _ 2

bb 5:
*_1 = _2

Hoisting with forwprop:

bb 3:
pretmp_164 = MEM[tab_1(D) + 8B];
_2 = pretmp_164 + 1
goto <bb 4> or <bb 5>

bb 4:
MEM[tab_1(D) + 8] = _2;

bb 5:
MEM[tab_1(D) + 8] = _2;

Although in both cases, we aren't hoisting stores, the issues with forwprop
for this case seems to be the folding of
*_1 = _2
into
MEM[tab_1(D) + 8] = _2  ?

Disabling folding to mem_ref[base + offset] in forwprop "works" in the
sense it created same set of hoistings as without forwprop, however it
still results in additional spills (albeit different registers).

That's because forwprop seems to be increasing live range of
prephitmp_217 by substituting
_221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 + 1).
On the other hand, Bin pointed out to me in private that forwprop also
helps to restrict register pressure by propagating "tab + const_int"
for same test-case.

So I am not really sure if there's an easier fix than having
heuristics for estimating register pressure at TREE level ? I would be
grateful for suggestions on how to proceed from here.
Thanks!

Regards,
Prathamesh
>
> Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-25  9:23       ` Prathamesh Kulkarni
@ 2018-05-25  9:49         ` Bin.Cheng
  2018-05-25  9:58           ` Richard Biener
  2018-05-25 16:57           ` Jeff Law
  0 siblings, 2 replies; 14+ messages in thread
From: Bin.Cheng @ 2018-05-25  9:49 UTC (permalink / raw)
  To: Prathamesh Kulkarni
  Cc: Jeff Law, Richard Biener, GCC Development, Thomas Preudhomme

On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>
>>>>> Hi,
>>>>> I am trying to work on PR80155, which exposes a problem with code
>>>>> hoisting and register pressure on a leading embedded benchmark for ARM
>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>
>>>>> I have attached two test-cases which (hopefully) are representative of
>>>>> the original test-case.
>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to the
>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>>> The test-cases in the PR are probably not relevant.
>>>>>
>>>>> Initially I thought the spill was happening because of "too many
>>>>> hoistings" taking place in original test-case thus increasing the
>>>>> register pressure, but it seems the spill is possibly caused because
>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>
>>>>> For example, the following hoistings take place with trans_dfa_2.c:
>>>>>
>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>
>>>>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
>>>>>
>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>
>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>
>>>>> The issue seems to be hoisting of (*tab + 1) which consists of first
>>>>> two hoistings in block 4
>>>>> from blocks 5 and 9, which causes the extra spill. I verified that by
>>>>> disabling hoisting into block 4,
>>>>> which resulted in no extra spills.
>>>>>
>>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>>> hoisted from blocks 5 and 9,
>>>>> which are on loop exit ? So the expression that was previously
>>>>> computed in a block on loop exit, gets hoisted outside that block
>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>> disabling hoisting of expressions which appeared in blocks on loop
>>>>> exit in original test-case prevented the extra spill. The other
>>>>> hoistings didn't seem to matter.
>>>>
>>>> I think that's simply co-incidence.  The only thing that makes
>>>> a block that also exits from the loop special is that an
>>>> expression could be sunk out of the loop and hoisting (commoning
>>>> with another path) could prevent that.  But that isn't what is
>>>> happening here and it would be a pass ordering issue as
>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>> but I guess there are cases where we want to prefer CSE over
>>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>>> your testcase.
>>> Thanks for the suggestions. Placing sink pass before PRE works
>>> for both these test-cases! Sadly it still causes the spill for the benchmark -:(
>>> I will try to create a better approximation of the original test-case.
>>>>
>>>> What I do see is a missed opportunity to merge the successors
>>>> of BB 4.  After PRE we have
>>>>
>>>> <bb 4> [local count: 159303558]:
>>>> <L1>:
>>>> pretmp_123 = *tab_37(D);
>>>> _87 = pretmp_123 + 1;
>>>> if (c_36 == 65)
>>>>   goto <bb 5>; [34.00%]
>>>> else
>>>>   goto <bb 8>; [66.00%]
>>>>
>>>> <bb 5> [local count: 54163210]:
>>>> *tab_37(D) = _87;
>>>> _96 = MEM[(char *)s_57 + 1B];
>>>> if (_96 != 0)
>>>>   goto <bb 7>; [89.00%]
>>>> else
>>>>   goto <bb 6>; [11.00%]
>>>>
>>>> <bb 8> [local count: 105140348]:
>>>> *tab_37(D) = _87;
>>>> _56 = MEM[(char *)s_57 + 1B];
>>>> if (_56 != 0)
>>>>   goto <bb 10>; [89.00%]
>>>> else
>>>>   goto <bb 9>; [11.00%]
>>>>
>>>> here at least the stores and loads can be hoisted.  Note this
>>>> may also point at the real issue of the code hoisting which is
>>>> tearing apart the RMW operation?
>>> Indeed, this possibility seems much more likely than block being on loop exit.
>>> I will try to "hardcode" the load/store hoists into block 4 for this
>>> specific test-case to check
>>> if that prevents the spill.
>> Even if it prevents the spill in this case, it's likely a good thing to
>> do.  The statements prior to the conditional in bb5 and bb8 should be
>> hoisted, leaving bb5 and bb8 with just their conditionals.
> Hi,
> It seems disabling forwprop somehow works for causing no extra spills
> on the original test-case.
>
> For instance,
> Hoisting without forwprop:
>
> bb 3:
> _1 = tab_1(D) + 8
> pretmp_268 = MEM[tab_1(D) + 8B];
> _2 = pretmp_268 + 1;
> goto <bb 4> or <bb 5>
>
> bb 4:
>  *_1 = _ 2
>
> bb 5:
> *_1 = _2
>
> Hoisting with forwprop:
>
> bb 3:
> pretmp_164 = MEM[tab_1(D) + 8B];
> _2 = pretmp_164 + 1
> goto <bb 4> or <bb 5>
>
> bb 4:
> MEM[tab_1(D) + 8] = _2;
>
> bb 5:
> MEM[tab_1(D) + 8] = _2;
>
> Although in both cases, we aren't hoisting stores, the issues with forwprop
> for this case seems to be the folding of
> *_1 = _2
> into
> MEM[tab_1(D) + 8] = _2  ?

This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
thus propagating _1 using (tab_1(D) + 8) actually removes one live
range.

>
> Disabling folding to mem_ref[base + offset] in forwprop "works" in the
> sense it created same set of hoistings as without forwprop, however it
> still results in additional spills (albeit different registers).
>
> That's because forwprop seems to be increasing live range of
> prephitmp_217 by substituting
> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 + 1).
Hmm, it's hard to discuss private benchmarks, not sure which dump
shall I find prephitmp_221/prephitmp_217 stuff.

> On the other hand, Bin pointed out to me in private that forwprop also
> helps to restrict register pressure by propagating "tab + const_int"
> for same test-case.
>
> So I am not really sure if there's an easier fix than having
> heuristics for estimating register pressure at TREE level ? I would be
Easy fix, maybe not.  OTOH, I am more convinced passes like
forwprop/sink/hoisting can be improved by taking live range into
consideration.  Specifically, to direct such passes when moving code
around different basic blocks, because inter-block register pressure
is hard to resolve afterwards.

As suggested by Jeff and Richi, I guess the first step would be doing
experiments, collecting more benchmark data for reordering sink before
pre?  It enables code sink as well as decreases register pressure in
the original reduced cases IIRC.

Thanks,
bin
> grateful for suggestions on how to proceed from here.
> Thanks!
>
> Regards,
> Prathamesh
>>
>> Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-25  9:49         ` Bin.Cheng
@ 2018-05-25  9:58           ` Richard Biener
  2018-05-25 16:57           ` Jeff Law
  1 sibling, 0 replies; 14+ messages in thread
From: Richard Biener @ 2018-05-25  9:58 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Prathamesh Kulkarni, Jeff Law, GCC Development, Thomas Preudhomme

On Fri, 25 May 2018, Bin.Cheng wrote:

> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> > On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
> >> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
> >>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
> >>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
> >>>>
> >>>>> Hi,
> >>>>> I am trying to work on PR80155, which exposes a problem with code
> >>>>> hoisting and register pressure on a leading embedded benchmark for ARM
> >>>>> cortex-m7, where code-hoisting causes an extra register spill.
> >>>>>
> >>>>> I have attached two test-cases which (hopefully) are representative of
> >>>>> the original test-case.
> >>>>> The first one (trans_dfa.c) is bigger and somewhat similar to the
> >>>>> original test-case and trans_dfa_2.c is hand-reduced version of
> >>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
> >>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
> >>>>> The test-cases in the PR are probably not relevant.
> >>>>>
> >>>>> Initially I thought the spill was happening because of "too many
> >>>>> hoistings" taking place in original test-case thus increasing the
> >>>>> register pressure, but it seems the spill is possibly caused because
> >>>>> expression gets hoisted out of a block that is on loop exit.
> >>>>>
> >>>>> For example, the following hoistings take place with trans_dfa_2.c:
> >>>>>
> >>>>> (1) Inserting expression in block 4 for code hoisting:
> >>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
> >>>>>
> >>>>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
> >>>>>
> >>>>> (3) Inserting expression in block 4 for code hoisting:
> >>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>
> >>>>> (4) Inserting expression in block 3 for code hoisting:
> >>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>
> >>>>> The issue seems to be hoisting of (*tab + 1) which consists of first
> >>>>> two hoistings in block 4
> >>>>> from blocks 5 and 9, which causes the extra spill. I verified that by
> >>>>> disabling hoisting into block 4,
> >>>>> which resulted in no extra spills.
> >>>>>
> >>>>> I wonder if that's because the expression (*tab + 1) is getting
> >>>>> hoisted from blocks 5 and 9,
> >>>>> which are on loop exit ? So the expression that was previously
> >>>>> computed in a block on loop exit, gets hoisted outside that block
> >>>>> which possibly makes the allocator more defensive ? Similarly
> >>>>> disabling hoisting of expressions which appeared in blocks on loop
> >>>>> exit in original test-case prevented the extra spill. The other
> >>>>> hoistings didn't seem to matter.
> >>>>
> >>>> I think that's simply co-incidence.  The only thing that makes
> >>>> a block that also exits from the loop special is that an
> >>>> expression could be sunk out of the loop and hoisting (commoning
> >>>> with another path) could prevent that.  But that isn't what is
> >>>> happening here and it would be a pass ordering issue as
> >>>> the sinking pass runs only after hoisting (no idea why exactly
> >>>> but I guess there are cases where we want to prefer CSE over
> >>>> sinking).  So you could try if re-ordering PRE and sinking helps
> >>>> your testcase.
> >>> Thanks for the suggestions. Placing sink pass before PRE works
> >>> for both these test-cases! Sadly it still causes the spill for the benchmark -:(
> >>> I will try to create a better approximation of the original test-case.
> >>>>
> >>>> What I do see is a missed opportunity to merge the successors
> >>>> of BB 4.  After PRE we have
> >>>>
> >>>> <bb 4> [local count: 159303558]:
> >>>> <L1>:
> >>>> pretmp_123 = *tab_37(D);
> >>>> _87 = pretmp_123 + 1;
> >>>> if (c_36 == 65)
> >>>>   goto <bb 5>; [34.00%]
> >>>> else
> >>>>   goto <bb 8>; [66.00%]
> >>>>
> >>>> <bb 5> [local count: 54163210]:
> >>>> *tab_37(D) = _87;
> >>>> _96 = MEM[(char *)s_57 + 1B];
> >>>> if (_96 != 0)
> >>>>   goto <bb 7>; [89.00%]
> >>>> else
> >>>>   goto <bb 6>; [11.00%]
> >>>>
> >>>> <bb 8> [local count: 105140348]:
> >>>> *tab_37(D) = _87;
> >>>> _56 = MEM[(char *)s_57 + 1B];
> >>>> if (_56 != 0)
> >>>>   goto <bb 10>; [89.00%]
> >>>> else
> >>>>   goto <bb 9>; [11.00%]
> >>>>
> >>>> here at least the stores and loads can be hoisted.  Note this
> >>>> may also point at the real issue of the code hoisting which is
> >>>> tearing apart the RMW operation?
> >>> Indeed, this possibility seems much more likely than block being on loop exit.
> >>> I will try to "hardcode" the load/store hoists into block 4 for this
> >>> specific test-case to check
> >>> if that prevents the spill.
> >> Even if it prevents the spill in this case, it's likely a good thing to
> >> do.  The statements prior to the conditional in bb5 and bb8 should be
> >> hoisted, leaving bb5 and bb8 with just their conditionals.
> > Hi,
> > It seems disabling forwprop somehow works for causing no extra spills
> > on the original test-case.
> >
> > For instance,
> > Hoisting without forwprop:
> >
> > bb 3:
> > _1 = tab_1(D) + 8
> > pretmp_268 = MEM[tab_1(D) + 8B];
> > _2 = pretmp_268 + 1;
> > goto <bb 4> or <bb 5>
> >
> > bb 4:
> >  *_1 = _ 2
> >
> > bb 5:
> > *_1 = _2
> >
> > Hoisting with forwprop:
> >
> > bb 3:
> > pretmp_164 = MEM[tab_1(D) + 8B];
> > _2 = pretmp_164 + 1
> > goto <bb 4> or <bb 5>
> >
> > bb 4:
> > MEM[tab_1(D) + 8] = _2;
> >
> > bb 5:
> > MEM[tab_1(D) + 8] = _2;
> >
> > Although in both cases, we aren't hoisting stores, the issues with forwprop
> > for this case seems to be the folding of
> > *_1 = _2
> > into
> > MEM[tab_1(D) + 8] = _2  ?
> 
> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
> thus propagating _1 using (tab_1(D) + 8) actually removes one live
> range.
> 
> >
> > Disabling folding to mem_ref[base + offset] in forwprop "works" in the
> > sense it created same set of hoistings as without forwprop, however it
> > still results in additional spills (albeit different registers).
> >
> > That's because forwprop seems to be increasing live range of
> > prephitmp_217 by substituting
> > _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 + 1).
> Hmm, it's hard to discuss private benchmarks, not sure which dump
> shall I find prephitmp_221/prephitmp_217 stuff.
> 
> > On the other hand, Bin pointed out to me in private that forwprop also
> > helps to restrict register pressure by propagating "tab + const_int"
> > for same test-case.
> >
> > So I am not really sure if there's an easier fix than having
> > heuristics for estimating register pressure at TREE level ? I would be
> Easy fix, maybe not.  OTOH, I am more convinced passes like
> forwprop/sink/hoisting can be improved by taking live range into
> consideration.  Specifically, to direct such passes when moving code
> around different basic blocks, because inter-block register pressure
> is hard to resolve afterwards.
> 
> As suggested by Jeff and Richi, I guess the first step would be doing
> experiments, collecting more benchmark data for reordering sink before
> pre?  It enables code sink as well as decreases register pressure in
> the original reduced cases IIRC.

Note sinking also doesn't have a cost model that takes into account
register pressure.  But yes, I can't think of a reason to do sinking
after PRE (well, apart from PRE performing value-numbering and that
of course might expose sinking opportunities).

So yes, perform some experiments - you should be able to use
-fdump-statistics[-stats] and look at the number of sunk stmts
reported (and also the number of PRE eliminations ultimatively
performed).

Richard.

> Thanks,
> bin
> > grateful for suggestions on how to proceed from here.
> > Thanks!
> >
> > Regards,
> > Prathamesh
> >>
> >> Jeff
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-25  9:49         ` Bin.Cheng
  2018-05-25  9:58           ` Richard Biener
@ 2018-05-25 16:57           ` Jeff Law
  2018-05-25 17:54             ` Richard Biener
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff Law @ 2018-05-25 16:57 UTC (permalink / raw)
  To: Bin.Cheng, Prathamesh Kulkarni
  Cc: Richard Biener, GCC Development, Thomas Preudhomme

On 05/25/2018 03:49 AM, Bin.Cheng wrote:
> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>
>>>>>> Hi,
>>>>>> I am trying to work on PR80155, which exposes a problem with code
>>>>>> hoisting and register pressure on a leading embedded benchmark for ARM
>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>
>>>>>> I have attached two test-cases which (hopefully) are representative of
>>>>>> the original test-case.
>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to the
>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>
>>>>>> Initially I thought the spill was happening because of "too many
>>>>>> hoistings" taking place in original test-case thus increasing the
>>>>>> register pressure, but it seems the spill is possibly caused because
>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>
>>>>>> For example, the following hoistings take place with trans_dfa_2.c:
>>>>>>
>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>
>>>>>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
>>>>>>
>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>
>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>
>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of first
>>>>>> two hoistings in block 4
>>>>>> from blocks 5 and 9, which causes the extra spill. I verified that by
>>>>>> disabling hoisting into block 4,
>>>>>> which resulted in no extra spills.
>>>>>>
>>>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>>>> hoisted from blocks 5 and 9,
>>>>>> which are on loop exit ? So the expression that was previously
>>>>>> computed in a block on loop exit, gets hoisted outside that block
>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>> disabling hoisting of expressions which appeared in blocks on loop
>>>>>> exit in original test-case prevented the extra spill. The other
>>>>>> hoistings didn't seem to matter.
>>>>>
>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>> a block that also exits from the loop special is that an
>>>>> expression could be sunk out of the loop and hoisting (commoning
>>>>> with another path) could prevent that.  But that isn't what is
>>>>> happening here and it would be a pass ordering issue as
>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>> but I guess there are cases where we want to prefer CSE over
>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>>>> your testcase.
>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>> for both these test-cases! Sadly it still causes the spill for the benchmark -:(
>>>> I will try to create a better approximation of the original test-case.
>>>>>
>>>>> What I do see is a missed opportunity to merge the successors
>>>>> of BB 4.  After PRE we have
>>>>>
>>>>> <bb 4> [local count: 159303558]:
>>>>> <L1>:
>>>>> pretmp_123 = *tab_37(D);
>>>>> _87 = pretmp_123 + 1;
>>>>> if (c_36 == 65)
>>>>>   goto <bb 5>; [34.00%]
>>>>> else
>>>>>   goto <bb 8>; [66.00%]
>>>>>
>>>>> <bb 5> [local count: 54163210]:
>>>>> *tab_37(D) = _87;
>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>> if (_96 != 0)
>>>>>   goto <bb 7>; [89.00%]
>>>>> else
>>>>>   goto <bb 6>; [11.00%]
>>>>>
>>>>> <bb 8> [local count: 105140348]:
>>>>> *tab_37(D) = _87;
>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>> if (_56 != 0)
>>>>>   goto <bb 10>; [89.00%]
>>>>> else
>>>>>   goto <bb 9>; [11.00%]
>>>>>
>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>> may also point at the real issue of the code hoisting which is
>>>>> tearing apart the RMW operation?
>>>> Indeed, this possibility seems much more likely than block being on loop exit.
>>>> I will try to "hardcode" the load/store hoists into block 4 for this
>>>> specific test-case to check
>>>> if that prevents the spill.
>>> Even if it prevents the spill in this case, it's likely a good thing to
>>> do.  The statements prior to the conditional in bb5 and bb8 should be
>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>> Hi,
>> It seems disabling forwprop somehow works for causing no extra spills
>> on the original test-case.
>>
>> For instance,
>> Hoisting without forwprop:
>>
>> bb 3:
>> _1 = tab_1(D) + 8
>> pretmp_268 = MEM[tab_1(D) + 8B];
>> _2 = pretmp_268 + 1;
>> goto <bb 4> or <bb 5>
>>
>> bb 4:
>>  *_1 = _ 2
>>
>> bb 5:
>> *_1 = _2
>>
>> Hoisting with forwprop:
>>
>> bb 3:
>> pretmp_164 = MEM[tab_1(D) + 8B];
>> _2 = pretmp_164 + 1
>> goto <bb 4> or <bb 5>
>>
>> bb 4:
>> MEM[tab_1(D) + 8] = _2;
>>
>> bb 5:
>> MEM[tab_1(D) + 8] = _2;
>>
>> Although in both cases, we aren't hoisting stores, the issues with forwprop
>> for this case seems to be the folding of
>> *_1 = _2
>> into
>> MEM[tab_1(D) + 8] = _2  ?
> 
> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
> thus propagating _1 using (tab_1(D) + 8) actually removes one live
> range.
> 
>>
>> Disabling folding to mem_ref[base + offset] in forwprop "works" in the
>> sense it created same set of hoistings as without forwprop, however it
>> still results in additional spills (albeit different registers).
>>
>> That's because forwprop seems to be increasing live range of
>> prephitmp_217 by substituting
>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 + 1).
> Hmm, it's hard to discuss private benchmarks, not sure which dump
> shall I find prephitmp_221/prephitmp_217 stuff.
> 
>> On the other hand, Bin pointed out to me in private that forwprop also
>> helps to restrict register pressure by propagating "tab + const_int"
>> for same test-case.
>>
>> So I am not really sure if there's an easier fix than having
>> heuristics for estimating register pressure at TREE level ? I would be
> Easy fix, maybe not.  OTOH, I am more convinced passes like
> forwprop/sink/hoisting can be improved by taking live range into
> consideration.  Specifically, to direct such passes when moving code
> around different basic blocks, because inter-block register pressure
> is hard to resolve afterwards.
> 
> As suggested by Jeff and Richi, I guess the first step would be doing
> experiments, collecting more benchmark data for reordering sink before
> pre?  It enables code sink as well as decreases register pressure in
> the original reduced cases IIRC.
We might even consider re-evaluating Bernd's work on what is effectively
a gimple scheduler to minimize register pressure.

Or we could look to extend your work into a generalized pressure
reducing pass that we could run near the gimple/rtl border.

The final possibility would be Click's algorithm from '95 adjusted to
just do pressure reduction.

jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-25 16:57           ` Jeff Law
@ 2018-05-25 17:54             ` Richard Biener
  2018-05-25 19:26               ` Jeff Law
  2018-05-26 18:07               ` Bin.Cheng
  0 siblings, 2 replies; 14+ messages in thread
From: Richard Biener @ 2018-05-25 17:54 UTC (permalink / raw)
  To: Jeff Law, Bin.Cheng, Prathamesh Kulkarni
  Cc: GCC Development, Thomas Preudhomme

On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>> <prathamesh.kulkarni@linaro.org> wrote:
>>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I am trying to work on PR80155, which exposes a problem with
>code
>>>>>>> hoisting and register pressure on a leading embedded benchmark
>for ARM
>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>>
>>>>>>> I have attached two test-cases which (hopefully) are
>representative of
>>>>>>> the original test-case.
>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
>the
>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>>
>>>>>>> Initially I thought the spill was happening because of "too many
>>>>>>> hoistings" taking place in original test-case thus increasing
>the
>>>>>>> register pressure, but it seems the spill is possibly caused
>because
>>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>>
>>>>>>> For example, the following hoistings take place with
>trans_dfa_2.c:
>>>>>>>
>>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>>
>>>>>>> (2) Inserting expression in block 4 for code hoisting:
>{plus_expr,_4,1} (0006)
>>>>>>>
>>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>
>>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>
>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
>first
>>>>>>> two hoistings in block 4
>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
>that by
>>>>>>> disabling hoisting into block 4,
>>>>>>> which resulted in no extra spills.
>>>>>>>
>>>>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>>>>> hoisted from blocks 5 and 9,
>>>>>>> which are on loop exit ? So the expression that was previously
>>>>>>> computed in a block on loop exit, gets hoisted outside that
>block
>>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>>> disabling hoisting of expressions which appeared in blocks on
>loop
>>>>>>> exit in original test-case prevented the extra spill. The other
>>>>>>> hoistings didn't seem to matter.
>>>>>>
>>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>>> a block that also exits from the loop special is that an
>>>>>> expression could be sunk out of the loop and hoisting (commoning
>>>>>> with another path) could prevent that.  But that isn't what is
>>>>>> happening here and it would be a pass ordering issue as
>>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>>> but I guess there are cases where we want to prefer CSE over
>>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>>>>> your testcase.
>>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>>> for both these test-cases! Sadly it still causes the spill for the
>benchmark -:(
>>>>> I will try to create a better approximation of the original
>test-case.
>>>>>>
>>>>>> What I do see is a missed opportunity to merge the successors
>>>>>> of BB 4.  After PRE we have
>>>>>>
>>>>>> <bb 4> [local count: 159303558]:
>>>>>> <L1>:
>>>>>> pretmp_123 = *tab_37(D);
>>>>>> _87 = pretmp_123 + 1;
>>>>>> if (c_36 == 65)
>>>>>>   goto <bb 5>; [34.00%]
>>>>>> else
>>>>>>   goto <bb 8>; [66.00%]
>>>>>>
>>>>>> <bb 5> [local count: 54163210]:
>>>>>> *tab_37(D) = _87;
>>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>>> if (_96 != 0)
>>>>>>   goto <bb 7>; [89.00%]
>>>>>> else
>>>>>>   goto <bb 6>; [11.00%]
>>>>>>
>>>>>> <bb 8> [local count: 105140348]:
>>>>>> *tab_37(D) = _87;
>>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>>> if (_56 != 0)
>>>>>>   goto <bb 10>; [89.00%]
>>>>>> else
>>>>>>   goto <bb 9>; [11.00%]
>>>>>>
>>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>>> may also point at the real issue of the code hoisting which is
>>>>>> tearing apart the RMW operation?
>>>>> Indeed, this possibility seems much more likely than block being
>on loop exit.
>>>>> I will try to "hardcode" the load/store hoists into block 4 for
>this
>>>>> specific test-case to check
>>>>> if that prevents the spill.
>>>> Even if it prevents the spill in this case, it's likely a good
>thing to
>>>> do.  The statements prior to the conditional in bb5 and bb8 should
>be
>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>>> Hi,
>>> It seems disabling forwprop somehow works for causing no extra
>spills
>>> on the original test-case.
>>>
>>> For instance,
>>> Hoisting without forwprop:
>>>
>>> bb 3:
>>> _1 = tab_1(D) + 8
>>> pretmp_268 = MEM[tab_1(D) + 8B];
>>> _2 = pretmp_268 + 1;
>>> goto <bb 4> or <bb 5>
>>>
>>> bb 4:
>>>  *_1 = _ 2
>>>
>>> bb 5:
>>> *_1 = _2
>>>
>>> Hoisting with forwprop:
>>>
>>> bb 3:
>>> pretmp_164 = MEM[tab_1(D) + 8B];
>>> _2 = pretmp_164 + 1
>>> goto <bb 4> or <bb 5>
>>>
>>> bb 4:
>>> MEM[tab_1(D) + 8] = _2;
>>>
>>> bb 5:
>>> MEM[tab_1(D) + 8] = _2;
>>>
>>> Although in both cases, we aren't hoisting stores, the issues with
>forwprop
>>> for this case seems to be the folding of
>>> *_1 = _2
>>> into
>>> MEM[tab_1(D) + 8] = _2  ?
>> 
>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
>> range.
>> 
>>>
>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
>the
>>> sense it created same set of hoistings as without forwprop, however
>it
>>> still results in additional spills (albeit different registers).
>>>
>>> That's because forwprop seems to be increasing live range of
>>> prephitmp_217 by substituting
>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 +
>1).
>> Hmm, it's hard to discuss private benchmarks, not sure which dump
>> shall I find prephitmp_221/prephitmp_217 stuff.
>> 
>>> On the other hand, Bin pointed out to me in private that forwprop
>also
>>> helps to restrict register pressure by propagating "tab + const_int"
>>> for same test-case.
>>>
>>> So I am not really sure if there's an easier fix than having
>>> heuristics for estimating register pressure at TREE level ? I would
>be
>> Easy fix, maybe not.  OTOH, I am more convinced passes like
>> forwprop/sink/hoisting can be improved by taking live range into
>> consideration.  Specifically, to direct such passes when moving code
>> around different basic blocks, because inter-block register pressure
>> is hard to resolve afterwards.
>> 
>> As suggested by Jeff and Richi, I guess the first step would be doing
>> experiments, collecting more benchmark data for reordering sink
>before
>> pre?  It enables code sink as well as decreases register pressure in
>> the original reduced cases IIRC.
>We might even consider re-evaluating Bernd's work on what is
>effectively
>a gimple scheduler to minimize register pressure.

Sure. The main issue here I see is with the interaction with TER which we unfortunately still rely on. Enough GIMPLE instruction selection might help to get rid of the remaining pieces... 

>Or we could look to extend your work into a generalized pressure
>reducing pass that we could run near the gimple/rtl border.
>
>The final possibility would be Click's algorithm from '95 adjusted to
>just do pressure reduction.
>
>jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-25 17:54             ` Richard Biener
@ 2018-05-25 19:26               ` Jeff Law
  2018-05-26  6:10                 ` Richard Biener
  2018-05-26 18:07               ` Bin.Cheng
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff Law @ 2018-05-25 19:26 UTC (permalink / raw)
  To: Richard Biener, Bin.Cheng, Prathamesh Kulkarni
  Cc: GCC Development, Thomas Preudhomme

On 05/25/2018 11:54 AM, Richard Biener wrote:
> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>> On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>>> <prathamesh.kulkarni@linaro.org> wrote:
>>>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
>>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I am trying to work on PR80155, which exposes a problem with
>> code
>>>>>>>> hoisting and register pressure on a leading embedded benchmark
>> for ARM
>>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>>>
>>>>>>>> I have attached two test-cases which (hopefully) are
>> representative of
>>>>>>>> the original test-case.
>>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
>> the
>>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>>>
>>>>>>>> Initially I thought the spill was happening because of "too many
>>>>>>>> hoistings" taking place in original test-case thus increasing
>> the
>>>>>>>> register pressure, but it seems the spill is possibly caused
>> because
>>>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>>>
>>>>>>>> For example, the following hoistings take place with
>> trans_dfa_2.c:
>>>>>>>>
>>>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>>>
>>>>>>>> (2) Inserting expression in block 4 for code hoisting:
>> {plus_expr,_4,1} (0006)
>>>>>>>>
>>>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>
>>>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>
>>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
>> first
>>>>>>>> two hoistings in block 4
>>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
>> that by
>>>>>>>> disabling hoisting into block 4,
>>>>>>>> which resulted in no extra spills.
>>>>>>>>
>>>>>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>>>>>> hoisted from blocks 5 and 9,
>>>>>>>> which are on loop exit ? So the expression that was previously
>>>>>>>> computed in a block on loop exit, gets hoisted outside that
>> block
>>>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>>>> disabling hoisting of expressions which appeared in blocks on
>> loop
>>>>>>>> exit in original test-case prevented the extra spill. The other
>>>>>>>> hoistings didn't seem to matter.
>>>>>>>
>>>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>>>> a block that also exits from the loop special is that an
>>>>>>> expression could be sunk out of the loop and hoisting (commoning
>>>>>>> with another path) could prevent that.  But that isn't what is
>>>>>>> happening here and it would be a pass ordering issue as
>>>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>>>> but I guess there are cases where we want to prefer CSE over
>>>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>>>>>> your testcase.
>>>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>>>> for both these test-cases! Sadly it still causes the spill for the
>> benchmark -:(
>>>>>> I will try to create a better approximation of the original
>> test-case.
>>>>>>>
>>>>>>> What I do see is a missed opportunity to merge the successors
>>>>>>> of BB 4.  After PRE we have
>>>>>>>
>>>>>>> <bb 4> [local count: 159303558]:
>>>>>>> <L1>:
>>>>>>> pretmp_123 = *tab_37(D);
>>>>>>> _87 = pretmp_123 + 1;
>>>>>>> if (c_36 == 65)
>>>>>>>   goto <bb 5>; [34.00%]
>>>>>>> else
>>>>>>>   goto <bb 8>; [66.00%]
>>>>>>>
>>>>>>> <bb 5> [local count: 54163210]:
>>>>>>> *tab_37(D) = _87;
>>>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>>>> if (_96 != 0)
>>>>>>>   goto <bb 7>; [89.00%]
>>>>>>> else
>>>>>>>   goto <bb 6>; [11.00%]
>>>>>>>
>>>>>>> <bb 8> [local count: 105140348]:
>>>>>>> *tab_37(D) = _87;
>>>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>>>> if (_56 != 0)
>>>>>>>   goto <bb 10>; [89.00%]
>>>>>>> else
>>>>>>>   goto <bb 9>; [11.00%]
>>>>>>>
>>>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>>>> may also point at the real issue of the code hoisting which is
>>>>>>> tearing apart the RMW operation?
>>>>>> Indeed, this possibility seems much more likely than block being
>> on loop exit.
>>>>>> I will try to "hardcode" the load/store hoists into block 4 for
>> this
>>>>>> specific test-case to check
>>>>>> if that prevents the spill.
>>>>> Even if it prevents the spill in this case, it's likely a good
>> thing to
>>>>> do.  The statements prior to the conditional in bb5 and bb8 should
>> be
>>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>>>> Hi,
>>>> It seems disabling forwprop somehow works for causing no extra
>> spills
>>>> on the original test-case.
>>>>
>>>> For instance,
>>>> Hoisting without forwprop:
>>>>
>>>> bb 3:
>>>> _1 = tab_1(D) + 8
>>>> pretmp_268 = MEM[tab_1(D) + 8B];
>>>> _2 = pretmp_268 + 1;
>>>> goto <bb 4> or <bb 5>
>>>>
>>>> bb 4:
>>>>  *_1 = _ 2
>>>>
>>>> bb 5:
>>>> *_1 = _2
>>>>
>>>> Hoisting with forwprop:
>>>>
>>>> bb 3:
>>>> pretmp_164 = MEM[tab_1(D) + 8B];
>>>> _2 = pretmp_164 + 1
>>>> goto <bb 4> or <bb 5>
>>>>
>>>> bb 4:
>>>> MEM[tab_1(D) + 8] = _2;
>>>>
>>>> bb 5:
>>>> MEM[tab_1(D) + 8] = _2;
>>>>
>>>> Although in both cases, we aren't hoisting stores, the issues with
>> forwprop
>>>> for this case seems to be the folding of
>>>> *_1 = _2
>>>> into
>>>> MEM[tab_1(D) + 8] = _2  ?
>>>
>>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
>>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
>>> range.
>>>
>>>>
>>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
>> the
>>>> sense it created same set of hoistings as without forwprop, however
>> it
>>>> still results in additional spills (albeit different registers).
>>>>
>>>> That's because forwprop seems to be increasing live range of
>>>> prephitmp_217 by substituting
>>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 +
>> 1).
>>> Hmm, it's hard to discuss private benchmarks, not sure which dump
>>> shall I find prephitmp_221/prephitmp_217 stuff.
>>>
>>>> On the other hand, Bin pointed out to me in private that forwprop
>> also
>>>> helps to restrict register pressure by propagating "tab + const_int"
>>>> for same test-case.
>>>>
>>>> So I am not really sure if there's an easier fix than having
>>>> heuristics for estimating register pressure at TREE level ? I would
>> be
>>> Easy fix, maybe not.  OTOH, I am more convinced passes like
>>> forwprop/sink/hoisting can be improved by taking live range into
>>> consideration.  Specifically, to direct such passes when moving code
>>> around different basic blocks, because inter-block register pressure
>>> is hard to resolve afterwards.
>>>
>>> As suggested by Jeff and Richi, I guess the first step would be doing
>>> experiments, collecting more benchmark data for reordering sink
>> before
>>> pre?  It enables code sink as well as decreases register pressure in
>>> the original reduced cases IIRC.
>> We might even consider re-evaluating Bernd's work on what is
>> effectively
>> a gimple scheduler to minimize register pressure.
> 
> Sure. The main issue here I see is with the interaction with TER which we unfortunately still rely on. Enough GIMPLE instruction selection might help to get rid of the remaining pieces... 
I really wonder how bad it would be to walk over expr.c and change the
expanders to be able to walk SSA_NAME_DEF_STMT to potentially get at the
more complex statements rather than relying on TER.

That's really all TER is supposed to be doing anyway.

Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-25 19:26               ` Jeff Law
@ 2018-05-26  6:10                 ` Richard Biener
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Biener @ 2018-05-26  6:10 UTC (permalink / raw)
  To: Jeff Law, Bin.Cheng, Prathamesh Kulkarni
  Cc: GCC Development, Thomas Preudhomme

On May 25, 2018 9:25:51 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>On 05/25/2018 11:54 AM, Richard Biener wrote:
>> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <law@redhat.com>
>wrote:
>>> On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>>>> <prathamesh.kulkarni@linaro.org> wrote:
>>>>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
>>>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de>
>wrote:
>>>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I am trying to work on PR80155, which exposes a problem with
>>> code
>>>>>>>>> hoisting and register pressure on a leading embedded benchmark
>>> for ARM
>>>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>>>>
>>>>>>>>> I have attached two test-cases which (hopefully) are
>>> representative of
>>>>>>>>> the original test-case.
>>>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
>>> the
>>>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version
>of
>>>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of
>cases.
>>>>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>>>>
>>>>>>>>> Initially I thought the spill was happening because of "too
>many
>>>>>>>>> hoistings" taking place in original test-case thus increasing
>>> the
>>>>>>>>> register pressure, but it seems the spill is possibly caused
>>> because
>>>>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>>>>
>>>>>>>>> For example, the following hoistings take place with
>>> trans_dfa_2.c:
>>>>>>>>>
>>>>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>>>>
>>>>>>>>> (2) Inserting expression in block 4 for code hoisting:
>>> {plus_expr,_4,1} (0006)
>>>>>>>>>
>>>>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>>
>>>>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>>
>>>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
>>> first
>>>>>>>>> two hoistings in block 4
>>>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
>>> that by
>>>>>>>>> disabling hoisting into block 4,
>>>>>>>>> which resulted in no extra spills.
>>>>>>>>>
>>>>>>>>> I wonder if that's because the expression (*tab + 1) is
>getting
>>>>>>>>> hoisted from blocks 5 and 9,
>>>>>>>>> which are on loop exit ? So the expression that was previously
>>>>>>>>> computed in a block on loop exit, gets hoisted outside that
>>> block
>>>>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>>>>> disabling hoisting of expressions which appeared in blocks on
>>> loop
>>>>>>>>> exit in original test-case prevented the extra spill. The
>other
>>>>>>>>> hoistings didn't seem to matter.
>>>>>>>>
>>>>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>>>>> a block that also exits from the loop special is that an
>>>>>>>> expression could be sunk out of the loop and hoisting
>(commoning
>>>>>>>> with another path) could prevent that.  But that isn't what is
>>>>>>>> happening here and it would be a pass ordering issue as
>>>>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>>>>> but I guess there are cases where we want to prefer CSE over
>>>>>>>> sinking).  So you could try if re-ordering PRE and sinking
>helps
>>>>>>>> your testcase.
>>>>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>>>>> for both these test-cases! Sadly it still causes the spill for
>the
>>> benchmark -:(
>>>>>>> I will try to create a better approximation of the original
>>> test-case.
>>>>>>>>
>>>>>>>> What I do see is a missed opportunity to merge the successors
>>>>>>>> of BB 4.  After PRE we have
>>>>>>>>
>>>>>>>> <bb 4> [local count: 159303558]:
>>>>>>>> <L1>:
>>>>>>>> pretmp_123 = *tab_37(D);
>>>>>>>> _87 = pretmp_123 + 1;
>>>>>>>> if (c_36 == 65)
>>>>>>>>   goto <bb 5>; [34.00%]
>>>>>>>> else
>>>>>>>>   goto <bb 8>; [66.00%]
>>>>>>>>
>>>>>>>> <bb 5> [local count: 54163210]:
>>>>>>>> *tab_37(D) = _87;
>>>>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>>>>> if (_96 != 0)
>>>>>>>>   goto <bb 7>; [89.00%]
>>>>>>>> else
>>>>>>>>   goto <bb 6>; [11.00%]
>>>>>>>>
>>>>>>>> <bb 8> [local count: 105140348]:
>>>>>>>> *tab_37(D) = _87;
>>>>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>>>>> if (_56 != 0)
>>>>>>>>   goto <bb 10>; [89.00%]
>>>>>>>> else
>>>>>>>>   goto <bb 9>; [11.00%]
>>>>>>>>
>>>>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>>>>> may also point at the real issue of the code hoisting which is
>>>>>>>> tearing apart the RMW operation?
>>>>>>> Indeed, this possibility seems much more likely than block being
>>> on loop exit.
>>>>>>> I will try to "hardcode" the load/store hoists into block 4 for
>>> this
>>>>>>> specific test-case to check
>>>>>>> if that prevents the spill.
>>>>>> Even if it prevents the spill in this case, it's likely a good
>>> thing to
>>>>>> do.  The statements prior to the conditional in bb5 and bb8
>should
>>> be
>>>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>>>>> Hi,
>>>>> It seems disabling forwprop somehow works for causing no extra
>>> spills
>>>>> on the original test-case.
>>>>>
>>>>> For instance,
>>>>> Hoisting without forwprop:
>>>>>
>>>>> bb 3:
>>>>> _1 = tab_1(D) + 8
>>>>> pretmp_268 = MEM[tab_1(D) + 8B];
>>>>> _2 = pretmp_268 + 1;
>>>>> goto <bb 4> or <bb 5>
>>>>>
>>>>> bb 4:
>>>>>  *_1 = _ 2
>>>>>
>>>>> bb 5:
>>>>> *_1 = _2
>>>>>
>>>>> Hoisting with forwprop:
>>>>>
>>>>> bb 3:
>>>>> pretmp_164 = MEM[tab_1(D) + 8B];
>>>>> _2 = pretmp_164 + 1
>>>>> goto <bb 4> or <bb 5>
>>>>>
>>>>> bb 4:
>>>>> MEM[tab_1(D) + 8] = _2;
>>>>>
>>>>> bb 5:
>>>>> MEM[tab_1(D) + 8] = _2;
>>>>>
>>>>> Although in both cases, we aren't hoisting stores, the issues with
>>> forwprop
>>>>> for this case seems to be the folding of
>>>>> *_1 = _2
>>>>> into
>>>>> MEM[tab_1(D) + 8] = _2  ?
>>>>
>>>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
>>>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
>>>> range.
>>>>
>>>>>
>>>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
>>> the
>>>>> sense it created same set of hoistings as without forwprop,
>however
>>> it
>>>>> still results in additional spills (albeit different registers).
>>>>>
>>>>> That's because forwprop seems to be increasing live range of
>>>>> prephitmp_217 by substituting
>>>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217
>+
>>> 1).
>>>> Hmm, it's hard to discuss private benchmarks, not sure which dump
>>>> shall I find prephitmp_221/prephitmp_217 stuff.
>>>>
>>>>> On the other hand, Bin pointed out to me in private that forwprop
>>> also
>>>>> helps to restrict register pressure by propagating "tab +
>const_int"
>>>>> for same test-case.
>>>>>
>>>>> So I am not really sure if there's an easier fix than having
>>>>> heuristics for estimating register pressure at TREE level ? I
>would
>>> be
>>>> Easy fix, maybe not.  OTOH, I am more convinced passes like
>>>> forwprop/sink/hoisting can be improved by taking live range into
>>>> consideration.  Specifically, to direct such passes when moving
>code
>>>> around different basic blocks, because inter-block register
>pressure
>>>> is hard to resolve afterwards.
>>>>
>>>> As suggested by Jeff and Richi, I guess the first step would be
>doing
>>>> experiments, collecting more benchmark data for reordering sink
>>> before
>>>> pre?  It enables code sink as well as decreases register pressure
>in
>>>> the original reduced cases IIRC.
>>> We might even consider re-evaluating Bernd's work on what is
>>> effectively
>>> a gimple scheduler to minimize register pressure.
>> 
>> Sure. The main issue here I see is with the interaction with TER
>which we unfortunately still rely on. Enough GIMPLE instruction
>selection might help to get rid of the remaining pieces... 
>I really wonder how bad it would be to walk over expr.c and change the
>expanders to be able to walk SSA_NAME_DEF_STMT to potentially get at
>the
>more complex statements rather than relying on TER.

But that's what they do... TER computes when this is valid and not break due to coalescing. 

>That's really all TER is supposed to be doing anyway.

Yes. Last year I posted patches to apply the scheduling TER does on top of the above but it was difficult to dissect from the above... Maybe we need to try harder here... 

Richard. 

>Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-25 17:54             ` Richard Biener
  2018-05-25 19:26               ` Jeff Law
@ 2018-05-26 18:07               ` Bin.Cheng
  2018-05-28 15:22                 ` Richard Biener
  1 sibling, 1 reply; 14+ messages in thread
From: Bin.Cheng @ 2018-05-26 18:07 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jeff Law, Prathamesh Kulkarni, GCC Development, Thomas Preudhomme

On Fri, May 25, 2018 at 5:54 PM, Richard Biener <rguenther@suse.de> wrote:
> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
>>On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>>> <prathamesh.kulkarni@linaro.org> wrote:
>>>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
>>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
>>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> I am trying to work on PR80155, which exposes a problem with
>>code
>>>>>>>> hoisting and register pressure on a leading embedded benchmark
>>for ARM
>>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>>>
>>>>>>>> I have attached two test-cases which (hopefully) are
>>representative of
>>>>>>>> the original test-case.
>>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
>>the
>>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>>>
>>>>>>>> Initially I thought the spill was happening because of "too many
>>>>>>>> hoistings" taking place in original test-case thus increasing
>>the
>>>>>>>> register pressure, but it seems the spill is possibly caused
>>because
>>>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>>>
>>>>>>>> For example, the following hoistings take place with
>>trans_dfa_2.c:
>>>>>>>>
>>>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>>>
>>>>>>>> (2) Inserting expression in block 4 for code hoisting:
>>{plus_expr,_4,1} (0006)
>>>>>>>>
>>>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>
>>>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>
>>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
>>first
>>>>>>>> two hoistings in block 4
>>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
>>that by
>>>>>>>> disabling hoisting into block 4,
>>>>>>>> which resulted in no extra spills.
>>>>>>>>
>>>>>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>>>>>> hoisted from blocks 5 and 9,
>>>>>>>> which are on loop exit ? So the expression that was previously
>>>>>>>> computed in a block on loop exit, gets hoisted outside that
>>block
>>>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>>>> disabling hoisting of expressions which appeared in blocks on
>>loop
>>>>>>>> exit in original test-case prevented the extra spill. The other
>>>>>>>> hoistings didn't seem to matter.
>>>>>>>
>>>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>>>> a block that also exits from the loop special is that an
>>>>>>> expression could be sunk out of the loop and hoisting (commoning
>>>>>>> with another path) could prevent that.  But that isn't what is
>>>>>>> happening here and it would be a pass ordering issue as
>>>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>>>> but I guess there are cases where we want to prefer CSE over
>>>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>>>>>> your testcase.
>>>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>>>> for both these test-cases! Sadly it still causes the spill for the
>>benchmark -:(
>>>>>> I will try to create a better approximation of the original
>>test-case.
>>>>>>>
>>>>>>> What I do see is a missed opportunity to merge the successors
>>>>>>> of BB 4.  After PRE we have
>>>>>>>
>>>>>>> <bb 4> [local count: 159303558]:
>>>>>>> <L1>:
>>>>>>> pretmp_123 = *tab_37(D);
>>>>>>> _87 = pretmp_123 + 1;
>>>>>>> if (c_36 == 65)
>>>>>>>   goto <bb 5>; [34.00%]
>>>>>>> else
>>>>>>>   goto <bb 8>; [66.00%]
>>>>>>>
>>>>>>> <bb 5> [local count: 54163210]:
>>>>>>> *tab_37(D) = _87;
>>>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>>>> if (_96 != 0)
>>>>>>>   goto <bb 7>; [89.00%]
>>>>>>> else
>>>>>>>   goto <bb 6>; [11.00%]
>>>>>>>
>>>>>>> <bb 8> [local count: 105140348]:
>>>>>>> *tab_37(D) = _87;
>>>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>>>> if (_56 != 0)
>>>>>>>   goto <bb 10>; [89.00%]
>>>>>>> else
>>>>>>>   goto <bb 9>; [11.00%]
>>>>>>>
>>>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>>>> may also point at the real issue of the code hoisting which is
>>>>>>> tearing apart the RMW operation?
>>>>>> Indeed, this possibility seems much more likely than block being
>>on loop exit.
>>>>>> I will try to "hardcode" the load/store hoists into block 4 for
>>this
>>>>>> specific test-case to check
>>>>>> if that prevents the spill.
>>>>> Even if it prevents the spill in this case, it's likely a good
>>thing to
>>>>> do.  The statements prior to the conditional in bb5 and bb8 should
>>be
>>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>>>> Hi,
>>>> It seems disabling forwprop somehow works for causing no extra
>>spills
>>>> on the original test-case.
>>>>
>>>> For instance,
>>>> Hoisting without forwprop:
>>>>
>>>> bb 3:
>>>> _1 = tab_1(D) + 8
>>>> pretmp_268 = MEM[tab_1(D) + 8B];
>>>> _2 = pretmp_268 + 1;
>>>> goto <bb 4> or <bb 5>
>>>>
>>>> bb 4:
>>>>  *_1 = _ 2
>>>>
>>>> bb 5:
>>>> *_1 = _2
>>>>
>>>> Hoisting with forwprop:
>>>>
>>>> bb 3:
>>>> pretmp_164 = MEM[tab_1(D) + 8B];
>>>> _2 = pretmp_164 + 1
>>>> goto <bb 4> or <bb 5>
>>>>
>>>> bb 4:
>>>> MEM[tab_1(D) + 8] = _2;
>>>>
>>>> bb 5:
>>>> MEM[tab_1(D) + 8] = _2;
>>>>
>>>> Although in both cases, we aren't hoisting stores, the issues with
>>forwprop
>>>> for this case seems to be the folding of
>>>> *_1 = _2
>>>> into
>>>> MEM[tab_1(D) + 8] = _2  ?
>>>
>>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
>>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
>>> range.
>>>
>>>>
>>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
>>the
>>>> sense it created same set of hoistings as without forwprop, however
>>it
>>>> still results in additional spills (albeit different registers).
>>>>
>>>> That's because forwprop seems to be increasing live range of
>>>> prephitmp_217 by substituting
>>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 +
>>1).
>>> Hmm, it's hard to discuss private benchmarks, not sure which dump
>>> shall I find prephitmp_221/prephitmp_217 stuff.
>>>
>>>> On the other hand, Bin pointed out to me in private that forwprop
>>also
>>>> helps to restrict register pressure by propagating "tab + const_int"
>>>> for same test-case.
>>>>
>>>> So I am not really sure if there's an easier fix than having
>>>> heuristics for estimating register pressure at TREE level ? I would
>>be
>>> Easy fix, maybe not.  OTOH, I am more convinced passes like
>>> forwprop/sink/hoisting can be improved by taking live range into
>>> consideration.  Specifically, to direct such passes when moving code
>>> around different basic blocks, because inter-block register pressure
>>> is hard to resolve afterwards.
>>>
>>> As suggested by Jeff and Richi, I guess the first step would be doing
>>> experiments, collecting more benchmark data for reordering sink
>>before
>>> pre?  It enables code sink as well as decreases register pressure in
>>> the original reduced cases IIRC.
>>We might even consider re-evaluating Bernd's work on what is
>>effectively
>>a gimple scheduler to minimize register pressure.
Could you please point me to Bernd's work?  Does it schedule around
different basic blocks to minimize register pressue?  Like in this
case, various optimizers extends live range to different basic blocks.
I once prototyped a single-block gimple scheduler, it isn't very
useful IMHO.

It will be huge amount work to take register pressure into
consideration in various optimizers.  OTOH, having a single
inter-block live range shrink pass before expanding is great, but I
think it would be at least equally difficult.

>
> Sure. The main issue here I see is with the interaction with TER which we unfortunately still rely on. Enough GIMPLE instruction selection might help to get rid of the remaining pieces...

If the scheduler is designed to focus on inter-block live range
shrink, it won't interfer with TER which only replaces single use in
each basic block.

Thanks,
bin
>
>>Or we could look to extend your work into a generalized pressure
>>reducing pass that we could run near the gimple/rtl border.
>>
>>The final possibility would be Click's algorithm from '95 adjusted to
>>just do pressure reduction.
>>
>>jeff
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PR80155: Code hoisting and register pressure
  2018-05-26 18:07               ` Bin.Cheng
@ 2018-05-28 15:22                 ` Richard Biener
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Biener @ 2018-05-28 15:22 UTC (permalink / raw)
  To: Bin.Cheng
  Cc: Jeff Law, Prathamesh Kulkarni, GCC Development, Thomas Preudhomme

On Sat, 26 May 2018, Bin.Cheng wrote:

> On Fri, May 25, 2018 at 5:54 PM, Richard Biener <rguenther@suse.de> wrote:
> > On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <law@redhat.com> wrote:
> >>On 05/25/2018 03:49 AM, Bin.Cheng wrote:
> >>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
> >>> <prathamesh.kulkarni@linaro.org> wrote:
> >>>> On 23 May 2018 at 18:37, Jeff Law <law@redhat.com> wrote:
> >>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
> >>>>>> On 23 May 2018 at 13:58, Richard Biener <rguenther@suse.de> wrote:
> >>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>> I am trying to work on PR80155, which exposes a problem with
> >>code
> >>>>>>>> hoisting and register pressure on a leading embedded benchmark
> >>for ARM
> >>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
> >>>>>>>>
> >>>>>>>> I have attached two test-cases which (hopefully) are
> >>representative of
> >>>>>>>> the original test-case.
> >>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
> >>the
> >>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
> >>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
> >>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
> >>>>>>>> The test-cases in the PR are probably not relevant.
> >>>>>>>>
> >>>>>>>> Initially I thought the spill was happening because of "too many
> >>>>>>>> hoistings" taking place in original test-case thus increasing
> >>the
> >>>>>>>> register pressure, but it seems the spill is possibly caused
> >>because
> >>>>>>>> expression gets hoisted out of a block that is on loop exit.
> >>>>>>>>
> >>>>>>>> For example, the following hoistings take place with
> >>trans_dfa_2.c:
> >>>>>>>>
> >>>>>>>> (1) Inserting expression in block 4 for code hoisting:
> >>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
> >>>>>>>>
> >>>>>>>> (2) Inserting expression in block 4 for code hoisting:
> >>{plus_expr,_4,1} (0006)
> >>>>>>>>
> >>>>>>>> (3) Inserting expression in block 4 for code hoisting:
> >>>>>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>>>>
> >>>>>>>> (4) Inserting expression in block 3 for code hoisting:
> >>>>>>>> {pointer_plus_expr,s_33,1} (0023)
> >>>>>>>>
> >>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
> >>first
> >>>>>>>> two hoistings in block 4
> >>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
> >>that by
> >>>>>>>> disabling hoisting into block 4,
> >>>>>>>> which resulted in no extra spills.
> >>>>>>>>
> >>>>>>>> I wonder if that's because the expression (*tab + 1) is getting
> >>>>>>>> hoisted from blocks 5 and 9,
> >>>>>>>> which are on loop exit ? So the expression that was previously
> >>>>>>>> computed in a block on loop exit, gets hoisted outside that
> >>block
> >>>>>>>> which possibly makes the allocator more defensive ? Similarly
> >>>>>>>> disabling hoisting of expressions which appeared in blocks on
> >>loop
> >>>>>>>> exit in original test-case prevented the extra spill. The other
> >>>>>>>> hoistings didn't seem to matter.
> >>>>>>>
> >>>>>>> I think that's simply co-incidence.  The only thing that makes
> >>>>>>> a block that also exits from the loop special is that an
> >>>>>>> expression could be sunk out of the loop and hoisting (commoning
> >>>>>>> with another path) could prevent that.  But that isn't what is
> >>>>>>> happening here and it would be a pass ordering issue as
> >>>>>>> the sinking pass runs only after hoisting (no idea why exactly
> >>>>>>> but I guess there are cases where we want to prefer CSE over
> >>>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
> >>>>>>> your testcase.
> >>>>>> Thanks for the suggestions. Placing sink pass before PRE works
> >>>>>> for both these test-cases! Sadly it still causes the spill for the
> >>benchmark -:(
> >>>>>> I will try to create a better approximation of the original
> >>test-case.
> >>>>>>>
> >>>>>>> What I do see is a missed opportunity to merge the successors
> >>>>>>> of BB 4.  After PRE we have
> >>>>>>>
> >>>>>>> <bb 4> [local count: 159303558]:
> >>>>>>> <L1>:
> >>>>>>> pretmp_123 = *tab_37(D);
> >>>>>>> _87 = pretmp_123 + 1;
> >>>>>>> if (c_36 == 65)
> >>>>>>>   goto <bb 5>; [34.00%]
> >>>>>>> else
> >>>>>>>   goto <bb 8>; [66.00%]
> >>>>>>>
> >>>>>>> <bb 5> [local count: 54163210]:
> >>>>>>> *tab_37(D) = _87;
> >>>>>>> _96 = MEM[(char *)s_57 + 1B];
> >>>>>>> if (_96 != 0)
> >>>>>>>   goto <bb 7>; [89.00%]
> >>>>>>> else
> >>>>>>>   goto <bb 6>; [11.00%]
> >>>>>>>
> >>>>>>> <bb 8> [local count: 105140348]:
> >>>>>>> *tab_37(D) = _87;
> >>>>>>> _56 = MEM[(char *)s_57 + 1B];
> >>>>>>> if (_56 != 0)
> >>>>>>>   goto <bb 10>; [89.00%]
> >>>>>>> else
> >>>>>>>   goto <bb 9>; [11.00%]
> >>>>>>>
> >>>>>>> here at least the stores and loads can be hoisted.  Note this
> >>>>>>> may also point at the real issue of the code hoisting which is
> >>>>>>> tearing apart the RMW operation?
> >>>>>> Indeed, this possibility seems much more likely than block being
> >>on loop exit.
> >>>>>> I will try to "hardcode" the load/store hoists into block 4 for
> >>this
> >>>>>> specific test-case to check
> >>>>>> if that prevents the spill.
> >>>>> Even if it prevents the spill in this case, it's likely a good
> >>thing to
> >>>>> do.  The statements prior to the conditional in bb5 and bb8 should
> >>be
> >>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
> >>>> Hi,
> >>>> It seems disabling forwprop somehow works for causing no extra
> >>spills
> >>>> on the original test-case.
> >>>>
> >>>> For instance,
> >>>> Hoisting without forwprop:
> >>>>
> >>>> bb 3:
> >>>> _1 = tab_1(D) + 8
> >>>> pretmp_268 = MEM[tab_1(D) + 8B];
> >>>> _2 = pretmp_268 + 1;
> >>>> goto <bb 4> or <bb 5>
> >>>>
> >>>> bb 4:
> >>>>  *_1 = _ 2
> >>>>
> >>>> bb 5:
> >>>> *_1 = _2
> >>>>
> >>>> Hoisting with forwprop:
> >>>>
> >>>> bb 3:
> >>>> pretmp_164 = MEM[tab_1(D) + 8B];
> >>>> _2 = pretmp_164 + 1
> >>>> goto <bb 4> or <bb 5>
> >>>>
> >>>> bb 4:
> >>>> MEM[tab_1(D) + 8] = _2;
> >>>>
> >>>> bb 5:
> >>>> MEM[tab_1(D) + 8] = _2;
> >>>>
> >>>> Although in both cases, we aren't hoisting stores, the issues with
> >>forwprop
> >>>> for this case seems to be the folding of
> >>>> *_1 = _2
> >>>> into
> >>>> MEM[tab_1(D) + 8] = _2  ?
> >>>
> >>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
> >>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
> >>> range.
> >>>
> >>>>
> >>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
> >>the
> >>>> sense it created same set of hoistings as without forwprop, however
> >>it
> >>>> still results in additional spills (albeit different registers).
> >>>>
> >>>> That's because forwprop seems to be increasing live range of
> >>>> prephitmp_217 by substituting
> >>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 +
> >>1).
> >>> Hmm, it's hard to discuss private benchmarks, not sure which dump
> >>> shall I find prephitmp_221/prephitmp_217 stuff.
> >>>
> >>>> On the other hand, Bin pointed out to me in private that forwprop
> >>also
> >>>> helps to restrict register pressure by propagating "tab + const_int"
> >>>> for same test-case.
> >>>>
> >>>> So I am not really sure if there's an easier fix than having
> >>>> heuristics for estimating register pressure at TREE level ? I would
> >>be
> >>> Easy fix, maybe not.  OTOH, I am more convinced passes like
> >>> forwprop/sink/hoisting can be improved by taking live range into
> >>> consideration.  Specifically, to direct such passes when moving code
> >>> around different basic blocks, because inter-block register pressure
> >>> is hard to resolve afterwards.
> >>>
> >>> As suggested by Jeff and Richi, I guess the first step would be doing
> >>> experiments, collecting more benchmark data for reordering sink
> >>before
> >>> pre?  It enables code sink as well as decreases register pressure in
> >>> the original reduced cases IIRC.
> >>We might even consider re-evaluating Bernd's work on what is
> >>effectively
> >>a gimple scheduler to minimize register pressure.
> Could you please point me to Bernd's work?  Does it schedule around
> different basic blocks to minimize register pressue?  Like in this
> case, various optimizers extends live range to different basic blocks.
> I once prototyped a single-block gimple scheduler, it isn't very
> useful IMHO.

Possibly https://gcc.gnu.org/ml/gcc-patches/2017-05/msg01058.html.

Quickly re-skimming the patch it seems to work on the set of TER-able
stmts, scheduling them, and if scheduled, remove them from the TER-able
set (because TER will undo scheduling later).  So it's partly a
register-pressure driven cost-model for TER and partly doing scheduling
on GIMPLE.

Of course it suffers from the issue I mentioned - we rely on TER
(too much) to do combine-like work during RTL expansion.

> It will be huge amount work to take register pressure into
> consideration in various optimizers.  OTOH, having a single
> inter-block live range shrink pass before expanding is great, but I
> think it would be at least equally difficult.

Yes, I've always wanted to do some GIMPLE level scheduling before
expanding.  And yes, it's equally difficult because it means we have
to do all instruction selection TER enables on GIMPLE - which is
of course also a good thing.  And yes, the difficulty is to actually
see all the expand-time benefits we get from TER - most of them
are _not_ places that look at def stmts during expansion but are
those that benefit from being fed "complex" RTL when expanding
a stmts operands.

> > Sure. The main issue here I see is with the interaction with TER which we unfortunately still rely on. Enough GIMPLE instruction selection might help to get rid of the remaining pieces...
> 
> If the scheduler is designed to focus on inter-block live range
> shrink, it won't interfer with TER which only replaces single use in
> each basic block.

True.

Richard.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-05-28 15:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-23  7:27 PR80155: Code hoisting and register pressure Prathamesh Kulkarni
2018-05-23  8:29 ` Richard Biener
2018-05-23  9:20   ` Prathamesh Kulkarni
2018-05-23 13:07     ` Jeff Law
2018-05-25  9:23       ` Prathamesh Kulkarni
2018-05-25  9:49         ` Bin.Cheng
2018-05-25  9:58           ` Richard Biener
2018-05-25 16:57           ` Jeff Law
2018-05-25 17:54             ` Richard Biener
2018-05-25 19:26               ` Jeff Law
2018-05-26  6:10                 ` Richard Biener
2018-05-26 18:07               ` Bin.Cheng
2018-05-28 15:22                 ` Richard Biener
2018-05-23  9:40   ` Bin.Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).