[PATCH] Sanitize block partitioning under -freorder-blocks-and-partition

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
@ 2013-08-01 16:32 Teresa Johnson
  2013-08-02 11:22 ` Bernhard Reutner-Fischer
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-01 16:32 UTC (permalink / raw)
  To: gcc-patches; +Cc: Steven Bosscher, Jan Hubicka, Jeff Law

[-- Attachment #1: Type: text/plain, Size: 14587 bytes --]

Patch 3 of 3 split out from the patch I sent in May that fixes problems with
-freorder-blocks-and-partition, with changes/fixes discussed in that thread.

See http://gcc.gnu.org/ml/gcc-patches/2013-05/threads.html#00388 for context.

This patch sanitizes the partitioning to address issues such as edge
weight insanities that sometimes occur due to upstream optimizations,
and ensures that hot blocks are not dominated by cold blocks. This
needs to be resanitized after certain cfg optimizations that may
cause hot blocks previously reached via both hot and cold paths to
only be reached by cold paths.

The verification code in sanitize_dominator_hotness was contributed by
Steven Bosscher.

Bootstrapped and tested on x86-64-unknown-linux-gnu. Also ensured that
a profiledbootstrap passed with -freorder-blocks-and-partition enabled
(and with the dwarf version changed to 2 to work around PR57451).

Ok for trunk?

(I also included the patch as an attachment since my mailer invariably
messes up the formatting in the pasted version.)

Thanks,
Teresa

2013-08-01  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_bb_partition): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (fixup_partitions): Declare.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_dominator_hotness): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_dominator_hotness.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c (revision 201281)
+++ cfgrtl.c (working copy)
@@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to a different partition,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_bb_partition (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* Now need to make bb's pred edges non-region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1979,6 +2007,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2173,6 +2209,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  if (!crtl->has_bb_partition)
+    return vNULL;
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition", son->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_bb_partition (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2219,7 +2350,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2382,6 +2514,14 @@ rtl_verify_edges (void)
  }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2515,7 +2655,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2761,7 +2901,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h (revision 201281)
+++ basic-block.h (working copy)
@@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c (revision 201281)
+++ cfgcleanup.c (working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
       df_analyze ();
     }

+  if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-  if (changed)
-    verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

   changed_overall |= changed;
   first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c (revision 201281)
+++ bb-reorder.c (working copy)
@@ -1444,6 +1444,55 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that no cold bbs dominate hot bbs along the dominance or
+   post-dominance DIR, for example as a result of edge weight insanities.
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
+                            vec<basic_block> *bbs_in_hot_partition)
+{
+  if (!cold_bb_count)
+    return 0;
+
+  bool dom_calculated_here = !dom_info_available_p (dir);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (dir);
+
+  /* Keep examining hot bbs until we have either checked them all, or
+     re-marked all cold bbs as hot.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      basic_block dom_bb = get_immediate_dominator (dir, bb);
+
+      /* If bb's immediate dominator is also hot then it is ok.  */
+      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
+        continue;
+
+      /* We have a hot bb with an immediate dominator that is cold.
+         The dominator needs to be re-marked hot.  */
+      BB_SET_PARTITION (dom_bb, BB_HOT_PARTITION);
+      cold_bb_count--;
+
+      /* Now we need to examine newly-hot dom_bb to see if it is also
+         dominated by a cold bb.  */
+      bbs_in_hot_partition->safe_push (dom_bb);
+      hot_bbs_to_check.safe_push (dom_bb);
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (dir);
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */
@@ -1455,16 +1504,42 @@ find_rarely_executed_basic_blocks_and_crossing_edg
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
- BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
- BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that no cold bbs dominate hot bbs. This could happen as a result of
+     several different possibilities. One is that there are edge
weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. Then do the same along the post-dominator
+     tree (which could have additional changes required after fixing up
+     dominators).  */
+  if (cold_bb_count)
+    cold_bb_count = sanitize_dominator_hotness (CDI_DOMINATORS,
+                                                cold_bb_count,
+                                                &bbs_in_hot_partition);
+  if (cold_bb_count)
+    cold_bb_count = sanitize_dominator_hotness (CDI_POST_DOMINATORS,
+                                                cold_bb_count,
+                                                &bbs_in_hot_partition);
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */


-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 14336 bytes --]

Patch 3 of 3 split out from the patch I sent in May that fixes problems with
-freorder-blocks-and-partition, with changes/fixes discussed in that thread.

See http://gcc.gnu.org/ml/gcc-patches/2013-05/threads.html#00388 for context.

This patch sanitizes the partitioning to address issues such as edge
weight insanities that sometimes occur due to upstream optimizations,
and ensures that hot blocks are not dominated by cold blocks. This
needs to be resanitized after certain cfg optimizations that may
cause hot blocks previously reached via both hot and cold paths to
only be reached by cold paths.

The verification code in sanitize_dominator_hotness was contributed by
Steven Bosscher.

Bootstrapped and tested on x86-64-unknown-linux-gnu. Also ensured that
a profiledbootstrap passed with -freorder-blocks-and-partition enabled
(and with the dwarf version changed to 2 to work around PR57451).

Ok for trunk?

Thanks,
Teresa

2013-08-01  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

	* cfgrtl.c (fixup_bb_partition): New routine.
	(commit_edge_insertions): Invoke fixup_partitions.
	(find_partition_fixes): New routine.
	(fixup_partitions): Ditto.
	(verify_hot_cold_block_grouping): Update comments.
	(rtl_verify_edges): Invoke find_partition_fixes.
	(rtl_verify_bb_pointers): Update comments.
	(rtl_verify_bb_layout): Ditto.
	* basic-block.h (fixup_partitions): Declare.
	* cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
	* bb-reorder.c (sanitize_dominator_hotness): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_dominator_hotness.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c	(revision 201281)
+++ cfgrtl.c	(working copy)
@@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
     }
 }
 
+/* Called when block BB has been reassigned to a different partition,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_bb_partition (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* Now need to make bb's pred edges non-region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.
 
@@ -1979,6 +2007,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;
 
+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2173,6 +2209,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }
 
+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  if (!crtl->has_bb_partition)
+    return vNULL;
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition", son->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_bb_partition (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2219,7 +2350,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */
 
 static int
 rtl_verify_edges (void)
@@ -2382,6 +2514,14 @@ rtl_verify_edges (void)
 	}
     }
 
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2515,7 +2655,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks
 
    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2761,7 +2901,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */
 
 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h	(revision 201281)
+++ basic-block.h	(working copy)
@@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);
 
 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c	(revision 201281)
+++ cfgcleanup.c	(working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
 	      df_analyze ();
 	    }
 
+	  if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-	  if (changed)
-	    verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }
 
 	  changed_overall |= changed;
 	  first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c	(revision 201281)
+++ bb-reorder.c	(working copy)
@@ -1444,6 +1444,55 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }
 
+
+/* Ensure that no cold bbs dominate hot bbs along the dominance or
+   post-dominance DIR, for example as a result of edge weight insanities.
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
+                            vec<basic_block> *bbs_in_hot_partition)
+{
+  if (!cold_bb_count)
+    return 0;
+
+  bool dom_calculated_here = !dom_info_available_p (dir);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (dir);
+
+  /* Keep examining hot bbs until we have either checked them all, or
+     re-marked all cold bbs as hot.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      basic_block dom_bb = get_immediate_dominator (dir, bb);
+
+      /* If bb's immediate dominator is also hot then it is ok.  */
+      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
+        continue;
+
+      /* We have a hot bb with an immediate dominator that is cold.
+         The dominator needs to be re-marked hot.  */
+      BB_SET_PARTITION (dom_bb, BB_HOT_PARTITION);
+      cold_bb_count--;
+
+      /* Now we need to examine newly-hot dom_bb to see if it is also
+         dominated by a cold bb.  */
+      bbs_in_hot_partition->safe_push (dom_bb);
+      hot_bbs_to_check.safe_push (dom_bb);
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (dir);
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */
@@ -1455,16 +1504,42 @@ find_rarely_executed_basic_blocks_and_crossing_edg
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;
 
   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-	BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-	BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }
 
+  /* Ensure that no cold bbs dominate hot bbs. This could happen as a result of
+     several different possibilities. One is that there are edge weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be hot, because
+     it is entered fewer times than the number of profile training runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. Then do the same along the post-dominator
+     tree (which could have additional changes required after fixing up
+     dominators).  */
+  if (cold_bb_count)
+    cold_bb_count = sanitize_dominator_hotness (CDI_DOMINATORS,
+                                                cold_bb_count,
+                                                &bbs_in_hot_partition);
+  if (cold_bb_count)
+    cold_bb_count = sanitize_dominator_hotness (CDI_POST_DOMINATORS,
+                                                cold_bb_count,
+                                                &bbs_in_hot_partition);
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-01 16:32 [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition Teresa Johnson
@ 2013-08-02 11:22 ` Bernhard Reutner-Fischer
  2013-08-02 14:51   ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Bernhard Reutner-Fischer @ 2013-08-02 11:22 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: gcc-patches, Steven Bosscher, Jan Hubicka, Jeff Law

On 1 August 2013 18:32, Teresa Johnson <tejohnson@google.com> wrote:
> Patch 3 of 3 split out from the patch I sent in May that fixes problems with
> -freorder-blocks-and-partition, with changes/fixes discussed in that thread.
>
> See http://gcc.gnu.org/ml/gcc-patches/2013-05/threads.html#00388 for context.
>
> This patch sanitizes the partitioning to address issues such as edge
> weight insanities that sometimes occur due to upstream optimizations,
> and ensures that hot blocks are not dominated by cold blocks. This
> needs to be resanitized after certain cfg optimizations that may
> cause hot blocks previously reached via both hot and cold paths to
> only be reached by cold paths.
>
> The verification code in sanitize_dominator_hotness was contributed by
> Steven Bosscher.
>
> Bootstrapped and tested on x86-64-unknown-linux-gnu. Also ensured that
> a profiledbootstrap passed with -freorder-blocks-and-partition enabled
> (and with the dwarf version changed to 2 to work around PR57451).
>
> Ok for trunk?
>
> (I also included the patch as an attachment since my mailer invariably
> messes up the formatting in the pasted version.)
>
> Thanks,
> Teresa
>
> 2013-08-01  Teresa Johnson  <tejohnson@google.com>
>             Steven Bosscher  <steven@gcc.gnu.org>
>
>         * cfgrtl.c (fixup_bb_partition): New routine.
>         (commit_edge_insertions): Invoke fixup_partitions.
>         (find_partition_fixes): New routine.
>         (fixup_partitions): Ditto.
>         (verify_hot_cold_block_grouping): Update comments.
>         (rtl_verify_edges): Invoke find_partition_fixes.
>         (rtl_verify_bb_pointers): Update comments.
>         (rtl_verify_bb_layout): Ditto.
>         * basic-block.h (fixup_partitions): Declare.
>         * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
>         * bb-reorder.c (sanitize_dominator_hotness): New function.
>         (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
>         sanitize_dominator_hotness.
>
> Index: cfgrtl.c
> ===================================================================
> --- cfgrtl.c (revision 201281)
> +++ cfgrtl.c (working copy)
> @@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
>      }
>  }
>
> +/* Called when block BB has been reassigned to a different partition,
> +   to ensure that the region crossing attributes are updated.  */
> +
> +static void
> +fixup_bb_partition (basic_block bb)
> +{
> +  edge e;
> +  edge_iterator ei;
> +
> +  /* Now need to make bb's pred edges non-region crossing.  */
> +  FOR_EACH_EDGE (e, ei, bb->preds)
> +    {
> +      fixup_partition_crossing (e);
> +    }
> +
> +  /* Possibly need to make bb's successor edges region crossing,
> +     or remove stale region crossing.  */
> +  FOR_EACH_EDGE (e, ei, bb->succs)
> +    {
> +      if ((e->flags & EDGE_FALLTHRU)
> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
> +          && e->dest != EXIT_BLOCK_PTR)
> +        force_nonfallthru (e);
> +      else
> +        fixup_partition_crossing (e);
> +    }
> +}
> +
>  /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
>     expense of adding new instructions or reordering basic blocks.
>
> @@ -1979,6 +2007,14 @@ commit_edge_insertions (void)
>  {
>    basic_block bb;
>
> +  /* Optimization passes that invoke this routine can cause hot blocks
> +     previously reached by both hot and cold blocks to become dominated only
> +     by cold blocks. This will cause the verification below to fail,
> +     and lead to now cold code in the hot section. In some cases this
> +     may only be visible after newly unreachable blocks are deleted,
> +     which will be done by fixup_partitions.  */
> +  fixup_partitions ();
> +
>  #ifdef ENABLE_CHECKING
>    verify_flow_info ();
>  #endif
> @@ -2173,6 +2209,101 @@ get_last_bb_insn (basic_block bb)
>    return end;
>  }
>
> +/* Sanity check partition hotness to ensure that basic blocks in
> +   the cold partition don't dominate basic blocks in the hot partition.
> +   If FLAG_ONLY is true, report violations as errors. Otherwise
> +   re-mark the dominated blocks as cold, since this is run after
> +   cfg optimizations that may make hot blocks previously reached
> +   by both hot and cold blocks now only reachable along cold paths.  */
> +
> +vec<basic_block>
> +find_partition_fixes (bool flag_only)
> +{
> +  basic_block bb;
> +  vec<basic_block> bbs_in_cold_partition = vNULL;
> +  vec<basic_block> bbs_to_fix = vNULL;
> +
> +  if (!crtl->has_bb_partition)
> +    return vNULL;

I'd push this early return into the callers instead, at most turn it into a
gcc_checking_assert to be safe.

Both callers, fixup_partitions and rtl_verify_edges, look at
ctrl->has_bb_partition already before calling this, so the above should
be dead already.

Did my mailer somehow swallow the static from find_partition_fixes?

> +
> +  FOR_EACH_BB (bb)
> +    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
> +      bbs_in_cold_partition.safe_push (bb);
> +
> +  if (bbs_in_cold_partition.is_empty ())
> +    return vNULL;
> +
> +  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
> +
> +  if (dom_calculated_here)
> +    calculate_dominance_info (CDI_DOMINATORS);
> +
> +  while (! bbs_in_cold_partition.is_empty  ())
> +    {
> +      bb = bbs_in_cold_partition.pop ();
> +      /* Any blocks dominated by a block in the cold section
> +         must also be cold.  */
> +      basic_block son;
> +      for (son = first_dom_son (CDI_DOMINATORS, bb);
> +           son;
> +           son = next_dom_son (CDI_DOMINATORS, son))
> +        {
> +          /* If son is not yet cold, then mark it cold here and
> +             enqueue it for further processing.  */
> +          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
> +            {
> +              if (flag_only)
> +                error ("non-cold basic block %d dominated "
> +                       "by a block in the cold partition", son->index);
> +              else
> +                BB_SET_PARTITION (son, BB_COLD_PARTITION);
> +              bbs_to_fix.safe_push (son);
> +              bbs_in_cold_partition.safe_push (son);
> +            }
> +        }
> +    }
> +
> +  if (dom_calculated_here)
> +    free_dominance_info (CDI_DOMINATORS);
> +
> +  return bbs_to_fix;
> +}
> +
> +/* Perform cleanup on the hot/cold bb partitioning after optimization
> +   passes that modify the cfg.  */
> +
> +void
> +fixup_partitions (void)
> +{
> +  basic_block bb;
> +
> +  if (!crtl->has_bb_partition)
> +    return;
> +
> +  /* Delete any blocks that became unreachable and weren't
> +     already cleaned up, for example during edge forwarding
> +     and convert_jumps_to_returns. This will expose more
> +     opportunities for fixing the partition boundaries here.
> +     Also, the calculation of the dominance graph during verification
> +     will assert if there are unreachable nodes.  */
> +  delete_unreachable_blocks ();
> +
> +  /* If there are partitions, do a sanity check on them: A basic block in
> +     a cold partition cannot dominate a basic block in a hot partition.
> +     Fixup any that now violate this requirement, as a result of edge
> +     forwarding and unreachable block deletion.  */
> +  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
> +
> +  /* Do the partition fixup after all necessary blocks have been converted to
> +     cold, so that we only update the region crossings the minimum number of
> +     places, which can require forcing edges to be non fallthru.  */
> +  while (! bbs_to_fix.is_empty ())
> +    {
> +      bb = bbs_to_fix.pop ();
> +      fixup_bb_partition (bb);
> +    }
> +}
> +
>  /* Verify, in the basic block chain, that there is at most one switch
>     between hot/cold partitions. This condition will not be true until
>     after reorder_basic_blocks is called.  */
> @@ -2219,7 +2350,8 @@ verify_hot_cold_block_grouping (void)
>  /* Perform several checks on the edges out of each block, such as
>     the consistency of the branch probabilities, the correctness
>     of hot/cold partition crossing edges, and the number of expected
> -   successor edges.  */
> +   successor edges.  Also verify that the dominance relationship
> +   between hot/cold blocks is sane.  */
>
>  static int
>  rtl_verify_edges (void)
> @@ -2382,6 +2514,14 @@ rtl_verify_edges (void)
>   }
>      }
>
> +  /* If there are partitions, do a sanity check on them: A basic block in
> +     a cold partition cannot dominate a basic block in a hot partition.  */
> +  if (crtl->has_bb_partition && !err)
> +    {
> +      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
> +      err = !bbs_to_fix.is_empty ();
> +    }
> +
>    /* Clean up.  */
>    return err;
>  }
> @@ -2515,7 +2655,7 @@ rtl_verify_bb_pointers (void)
>       and NOTE_INSN_BASIC_BLOCK
>     - verify that no fall_thru edge crosses hot/cold partition boundaries
>     - verify that there are no pending RTL branch predictions
> -   - verify that there is a single hot/cold partition boundary after bbro
> +   - verify that hot blocks are not dominated by cold blocks
>
>     In future it can be extended check a lot of other stuff as well
>     (reachability of basic blocks, life information, etc. etc.).  */
> @@ -2761,7 +2901,8 @@ rtl_verify_bb_layout (void)
>     - check that all insns are in the basic blocks
>       (except the switch handling code, barriers and notes)
>     - check that all returns are followed by barriers
> -   - check that all fallthru edge points to the adjacent blocks.  */
> +   - check that all fallthru edge points to the adjacent blocks
> +   - verify that there is a single hot/cold partition boundary after bbro  */
>
>  static int
>  rtl_verify_flow_info (void)
> Index: basic-block.h
> ===================================================================
> --- basic-block.h (revision 201281)
> +++ basic-block.h (working copy)
> @@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
>  extern bool forwarder_block_p (const_basic_block);
>  extern bool can_fallthru (basic_block, basic_block);
>  extern void emit_barrier_after_bb (basic_block bb);
> +extern void fixup_partitions (void);
>
>  /* In cfgbuild.c.  */
>  extern void find_many_sub_basic_blocks (sbitmap);
> Index: cfgcleanup.c
> ===================================================================
> --- cfgcleanup.c (revision 201281)
> +++ cfgcleanup.c (working copy)
> @@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
>        df_analyze ();
>      }
>
> +  if (changed)
> +            {
> +              /* Edge forwarding in particular can cause hot blocks previously
> +                 reached by both hot and cold blocks to become dominated only
> +                 by cold blocks. This will cause the verification
> below to fail,
> +                 and lead to now cold code in the hot section. This is not easy
> +                 to detect and fix during edge forwarding, and in some cases
> +                 is only visible after newly unreachable blocks are deleted,
> +                 which will be done in fixup_partitions.  */
> +              fixup_partitions ();
> +
>  #ifdef ENABLE_CHECKING
> -  if (changed)
> -    verify_flow_info ();
> +              verify_flow_info ();
>  #endif
> +            }
>
>    changed_overall |= changed;
>    first_pass = false;
> Index: bb-reorder.c
> ===================================================================
> --- bb-reorder.c (revision 201281)
> +++ bb-reorder.c (working copy)
> @@ -1444,6 +1444,55 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
>        ei_next (&ei);
>  }
>
> +
> +/* Ensure that no cold bbs dominate hot bbs along the dominance or
> +   post-dominance DIR, for example as a result of edge weight insanities.
> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
> +   to BBS_IN_HOT_PARTITION.  */
> +
> +static unsigned int
> +sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
> +                            vec<basic_block> *bbs_in_hot_partition)
> +{
> +  if (!cold_bb_count)
> +    return 0;

Same pattern as above. Callers do not invoke us if !cold_bb_count so the above
check is dead code. Again, remove or checking assert?
> +
> +  bool dom_calculated_here = !dom_info_available_p (dir);
> +
> +  if (dom_calculated_here)
> +    calculate_dominance_info (dir);
> +
> +  /* Keep examining hot bbs until we have either checked them all, or
> +     re-marked all cold bbs as hot.  */
> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
> +  while (! hot_bbs_to_check.is_empty ()
> +         && cold_bb_count)

The comment says "or", which sounds plausible, but the code says "and"?
> +    {
> +      basic_block bb = hot_bbs_to_check.pop ();
> +      basic_block dom_bb = get_immediate_dominator (dir, bb);
> +
> +      /* If bb's immediate dominator is also hot then it is ok.  */
> +      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)

Why not follow the comment here and == BB_HOT_PARTITION instead, for clarity?

> +        continue;
> +
> +      /* We have a hot bb with an immediate dominator that is cold.
> +         The dominator needs to be re-marked hot.  */
> +      BB_SET_PARTITION (dom_bb, BB_HOT_PARTITION);
> +      cold_bb_count--;
> +
> +      /* Now we need to examine newly-hot dom_bb to see if it is also
> +         dominated by a cold bb.  */
> +      bbs_in_hot_partition->safe_push (dom_bb);
> +      hot_bbs_to_check.safe_push (dom_bb);
> +    }
> +
> +  if (dom_calculated_here)
> +    free_dominance_info (dir);
> +
> +  return cold_bb_count;
> +}
> +
> +
>  /* Find the basic blocks that are rarely executed and need to be moved to
>     a separate section of the .o file (to cut down on paging and improve
>     cache locality).  Return a vector of all edges that cross.  */
> @@ -1455,16 +1504,42 @@ find_rarely_executed_basic_blocks_and_crossing_edg
>    basic_block bb;
>    edge e;
>    edge_iterator ei;
> +  unsigned int cold_bb_count = 0;
> +  vec<basic_block> bbs_in_hot_partition = vNULL;
>
>    /* Mark which partition (hot/cold) each basic block belongs in.  */
>    FOR_EACH_BB (bb)
>      {
>        if (probably_never_executed_bb_p (cfun, bb))
> - BB_SET_PARTITION (bb, BB_COLD_PARTITION);
> +        {
> +          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
> +          cold_bb_count++;
> +        }
>        else
> - BB_SET_PARTITION (bb, BB_HOT_PARTITION);
> +        {
> +          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
> +          bbs_in_hot_partition.safe_push (bb);
> +        }
>      }
>
> +  /* Ensure that no cold bbs dominate hot bbs. This could happen as a result of
> +     several different possibilities. One is that there are edge
> weight insanities
> +     due to optimization phases that do not properly update basic block profile
> +     counts. The second is that the entry of the function may not be
> hot, because
> +     it is entered fewer times than the number of profile training
> runs, but there
> +     is a loop inside the function that causes blocks within the function to be
> +     above the threshold for hotness. Then do the same along the post-dominator
> +     tree (which could have additional changes required after fixing up
> +     dominators).  */
> +  if (cold_bb_count)
> +    cold_bb_count = sanitize_dominator_hotness (CDI_DOMINATORS,
> +                                                cold_bb_count,
> +                                                &bbs_in_hot_partition);
> +  if (cold_bb_count)
> +    cold_bb_count = sanitize_dominator_hotness (CDI_POST_DOMINATORS,
> +                                                cold_bb_count,
> +                                                &bbs_in_hot_partition);

I take it this last store to cold_bb_count is eliminated anyway.

Thanks,
> +
>    /* The format of .gcc_except_table does not allow landing pads to
>       be in a different partition as the throw.  Fix this by either
>       moving or duplicating the landing pads.  */
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-02 11:22 ` Bernhard Reutner-Fischer
@ 2013-08-02 14:51   ` Teresa Johnson
  2013-08-02 15:05     ` Jan Hubicka
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-02 14:51 UTC (permalink / raw)
  To: Bernhard Reutner-Fischer
  Cc: gcc-patches, Steven Bosscher, Jan Hubicka, Jeff Law

On Fri, Aug 2, 2013 at 4:22 AM, Bernhard Reutner-Fischer
<rep.dot.nop@gmail.com> wrote:
> On 1 August 2013 18:32, Teresa Johnson <tejohnson@google.com> wrote:
>> Patch 3 of 3 split out from the patch I sent in May that fixes problems with
>> -freorder-blocks-and-partition, with changes/fixes discussed in that thread.
>>
>> See http://gcc.gnu.org/ml/gcc-patches/2013-05/threads.html#00388 for context.
>>
>> This patch sanitizes the partitioning to address issues such as edge
>> weight insanities that sometimes occur due to upstream optimizations,
>> and ensures that hot blocks are not dominated by cold blocks. This
>> needs to be resanitized after certain cfg optimizations that may
>> cause hot blocks previously reached via both hot and cold paths to
>> only be reached by cold paths.
>>
>> The verification code in sanitize_dominator_hotness was contributed by
>> Steven Bosscher.
>>
>> Bootstrapped and tested on x86-64-unknown-linux-gnu. Also ensured that
>> a profiledbootstrap passed with -freorder-blocks-and-partition enabled
>> (and with the dwarf version changed to 2 to work around PR57451).
>>
>> Ok for trunk?
>>
>> (I also included the patch as an attachment since my mailer invariably
>> messes up the formatting in the pasted version.)
>>
>> Thanks,
>> Teresa
>>
>> 2013-08-01  Teresa Johnson  <tejohnson@google.com>
>>             Steven Bosscher  <steven@gcc.gnu.org>
>>
>>         * cfgrtl.c (fixup_bb_partition): New routine.
>>         (commit_edge_insertions): Invoke fixup_partitions.
>>         (find_partition_fixes): New routine.
>>         (fixup_partitions): Ditto.
>>         (verify_hot_cold_block_grouping): Update comments.
>>         (rtl_verify_edges): Invoke find_partition_fixes.
>>         (rtl_verify_bb_pointers): Update comments.
>>         (rtl_verify_bb_layout): Ditto.
>>         * basic-block.h (fixup_partitions): Declare.
>>         * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
>>         * bb-reorder.c (sanitize_dominator_hotness): New function.
>>         (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
>>         sanitize_dominator_hotness.
>>
>> Index: cfgrtl.c
>> ===================================================================
>> --- cfgrtl.c (revision 201281)
>> +++ cfgrtl.c (working copy)
>> @@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
>>      }
>>  }
>>
>> +/* Called when block BB has been reassigned to a different partition,
>> +   to ensure that the region crossing attributes are updated.  */
>> +
>> +static void
>> +fixup_bb_partition (basic_block bb)
>> +{
>> +  edge e;
>> +  edge_iterator ei;
>> +
>> +  /* Now need to make bb's pred edges non-region crossing.  */
>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>> +    {
>> +      fixup_partition_crossing (e);
>> +    }
>> +
>> +  /* Possibly need to make bb's successor edges region crossing,
>> +     or remove stale region crossing.  */
>> +  FOR_EACH_EDGE (e, ei, bb->succs)
>> +    {
>> +      if ((e->flags & EDGE_FALLTHRU)
>> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
>> +          && e->dest != EXIT_BLOCK_PTR)
>> +        force_nonfallthru (e);
>> +      else
>> +        fixup_partition_crossing (e);
>> +    }
>> +}
>> +
>>  /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
>>     expense of adding new instructions or reordering basic blocks.
>>
>> @@ -1979,6 +2007,14 @@ commit_edge_insertions (void)
>>  {
>>    basic_block bb;
>>
>> +  /* Optimization passes that invoke this routine can cause hot blocks
>> +     previously reached by both hot and cold blocks to become dominated only
>> +     by cold blocks. This will cause the verification below to fail,
>> +     and lead to now cold code in the hot section. In some cases this
>> +     may only be visible after newly unreachable blocks are deleted,
>> +     which will be done by fixup_partitions.  */
>> +  fixup_partitions ();
>> +
>>  #ifdef ENABLE_CHECKING
>>    verify_flow_info ();
>>  #endif
>> @@ -2173,6 +2209,101 @@ get_last_bb_insn (basic_block bb)
>>    return end;
>>  }
>>
>> +/* Sanity check partition hotness to ensure that basic blocks in
>> +   the cold partition don't dominate basic blocks in the hot partition.
>> +   If FLAG_ONLY is true, report violations as errors. Otherwise
>> +   re-mark the dominated blocks as cold, since this is run after
>> +   cfg optimizations that may make hot blocks previously reached
>> +   by both hot and cold blocks now only reachable along cold paths.  */
>> +
>> +vec<basic_block>
>> +find_partition_fixes (bool flag_only)
>> +{
>> +  basic_block bb;
>> +  vec<basic_block> bbs_in_cold_partition = vNULL;
>> +  vec<basic_block> bbs_to_fix = vNULL;
>> +
>> +  if (!crtl->has_bb_partition)
>> +    return vNULL;
>
> I'd push this early return into the callers instead, at most turn it into a
> gcc_checking_assert to be safe.
>
> Both callers, fixup_partitions and rtl_verify_edges, look at
> ctrl->has_bb_partition already before calling this, so the above should
> be dead already.

Right, I was being paranoid - changed to a gcc_checking_assert.

>
> Did my mailer somehow swallow the static from find_partition_fixes?

Oops, I missed that - added the static.

>
>> +
>> +  FOR_EACH_BB (bb)
>> +    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
>> +      bbs_in_cold_partition.safe_push (bb);
>> +
>> +  if (bbs_in_cold_partition.is_empty ())
>> +    return vNULL;
>> +
>> +  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
>> +
>> +  if (dom_calculated_here)
>> +    calculate_dominance_info (CDI_DOMINATORS);
>> +
>> +  while (! bbs_in_cold_partition.is_empty  ())
>> +    {
>> +      bb = bbs_in_cold_partition.pop ();
>> +      /* Any blocks dominated by a block in the cold section
>> +         must also be cold.  */
>> +      basic_block son;
>> +      for (son = first_dom_son (CDI_DOMINATORS, bb);
>> +           son;
>> +           son = next_dom_son (CDI_DOMINATORS, son))
>> +        {
>> +          /* If son is not yet cold, then mark it cold here and
>> +             enqueue it for further processing.  */
>> +          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
>> +            {
>> +              if (flag_only)
>> +                error ("non-cold basic block %d dominated "
>> +                       "by a block in the cold partition", son->index);
>> +              else
>> +                BB_SET_PARTITION (son, BB_COLD_PARTITION);
>> +              bbs_to_fix.safe_push (son);
>> +              bbs_in_cold_partition.safe_push (son);
>> +            }
>> +        }
>> +    }
>> +
>> +  if (dom_calculated_here)
>> +    free_dominance_info (CDI_DOMINATORS);
>> +
>> +  return bbs_to_fix;
>> +}
>> +
>> +/* Perform cleanup on the hot/cold bb partitioning after optimization
>> +   passes that modify the cfg.  */
>> +
>> +void
>> +fixup_partitions (void)
>> +{
>> +  basic_block bb;
>> +
>> +  if (!crtl->has_bb_partition)
>> +    return;
>> +
>> +  /* Delete any blocks that became unreachable and weren't
>> +     already cleaned up, for example during edge forwarding
>> +     and convert_jumps_to_returns. This will expose more
>> +     opportunities for fixing the partition boundaries here.
>> +     Also, the calculation of the dominance graph during verification
>> +     will assert if there are unreachable nodes.  */
>> +  delete_unreachable_blocks ();
>> +
>> +  /* If there are partitions, do a sanity check on them: A basic block in
>> +     a cold partition cannot dominate a basic block in a hot partition.
>> +     Fixup any that now violate this requirement, as a result of edge
>> +     forwarding and unreachable block deletion.  */
>> +  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
>> +
>> +  /* Do the partition fixup after all necessary blocks have been converted to
>> +     cold, so that we only update the region crossings the minimum number of
>> +     places, which can require forcing edges to be non fallthru.  */
>> +  while (! bbs_to_fix.is_empty ())
>> +    {
>> +      bb = bbs_to_fix.pop ();
>> +      fixup_bb_partition (bb);
>> +    }
>> +}
>> +
>>  /* Verify, in the basic block chain, that there is at most one switch
>>     between hot/cold partitions. This condition will not be true until
>>     after reorder_basic_blocks is called.  */
>> @@ -2219,7 +2350,8 @@ verify_hot_cold_block_grouping (void)
>>  /* Perform several checks on the edges out of each block, such as
>>     the consistency of the branch probabilities, the correctness
>>     of hot/cold partition crossing edges, and the number of expected
>> -   successor edges.  */
>> +   successor edges.  Also verify that the dominance relationship
>> +   between hot/cold blocks is sane.  */
>>
>>  static int
>>  rtl_verify_edges (void)
>> @@ -2382,6 +2514,14 @@ rtl_verify_edges (void)
>>   }
>>      }
>>
>> +  /* If there are partitions, do a sanity check on them: A basic block in
>> +     a cold partition cannot dominate a basic block in a hot partition.  */
>> +  if (crtl->has_bb_partition && !err)
>> +    {
>> +      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
>> +      err = !bbs_to_fix.is_empty ();
>> +    }
>> +
>>    /* Clean up.  */
>>    return err;
>>  }
>> @@ -2515,7 +2655,7 @@ rtl_verify_bb_pointers (void)
>>       and NOTE_INSN_BASIC_BLOCK
>>     - verify that no fall_thru edge crosses hot/cold partition boundaries
>>     - verify that there are no pending RTL branch predictions
>> -   - verify that there is a single hot/cold partition boundary after bbro
>> +   - verify that hot blocks are not dominated by cold blocks
>>
>>     In future it can be extended check a lot of other stuff as well
>>     (reachability of basic blocks, life information, etc. etc.).  */
>> @@ -2761,7 +2901,8 @@ rtl_verify_bb_layout (void)
>>     - check that all insns are in the basic blocks
>>       (except the switch handling code, barriers and notes)
>>     - check that all returns are followed by barriers
>> -   - check that all fallthru edge points to the adjacent blocks.  */
>> +   - check that all fallthru edge points to the adjacent blocks
>> +   - verify that there is a single hot/cold partition boundary after bbro  */
>>
>>  static int
>>  rtl_verify_flow_info (void)
>> Index: basic-block.h
>> ===================================================================
>> --- basic-block.h (revision 201281)
>> +++ basic-block.h (working copy)
>> @@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
>>  extern bool forwarder_block_p (const_basic_block);
>>  extern bool can_fallthru (basic_block, basic_block);
>>  extern void emit_barrier_after_bb (basic_block bb);
>> +extern void fixup_partitions (void);
>>
>>  /* In cfgbuild.c.  */
>>  extern void find_many_sub_basic_blocks (sbitmap);
>> Index: cfgcleanup.c
>> ===================================================================
>> --- cfgcleanup.c (revision 201281)
>> +++ cfgcleanup.c (working copy)
>> @@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
>>        df_analyze ();
>>      }
>>
>> +  if (changed)
>> +            {
>> +              /* Edge forwarding in particular can cause hot blocks previously
>> +                 reached by both hot and cold blocks to become dominated only
>> +                 by cold blocks. This will cause the verification
>> below to fail,
>> +                 and lead to now cold code in the hot section. This is not easy
>> +                 to detect and fix during edge forwarding, and in some cases
>> +                 is only visible after newly unreachable blocks are deleted,
>> +                 which will be done in fixup_partitions.  */
>> +              fixup_partitions ();
>> +
>>  #ifdef ENABLE_CHECKING
>> -  if (changed)
>> -    verify_flow_info ();
>> +              verify_flow_info ();
>>  #endif
>> +            }
>>
>>    changed_overall |= changed;
>>    first_pass = false;
>> Index: bb-reorder.c
>> ===================================================================
>> --- bb-reorder.c (revision 201281)
>> +++ bb-reorder.c (working copy)
>> @@ -1444,6 +1444,55 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
>>        ei_next (&ei);
>>  }
>>
>> +
>> +/* Ensure that no cold bbs dominate hot bbs along the dominance or
>> +   post-dominance DIR, for example as a result of edge weight insanities.
>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>> +   to BBS_IN_HOT_PARTITION.  */
>> +
>> +static unsigned int
>> +sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
>> +                            vec<basic_block> *bbs_in_hot_partition)
>> +{
>> +  if (!cold_bb_count)
>> +    return 0;
>
> Same pattern as above. Callers do not invoke us if !cold_bb_count so the above
> check is dead code. Again, remove or checking assert?

Changed to checking assert.

>> +
>> +  bool dom_calculated_here = !dom_info_available_p (dir);
>> +
>> +  if (dom_calculated_here)
>> +    calculate_dominance_info (dir);
>> +
>> +  /* Keep examining hot bbs until we have either checked them all, or
>> +     re-marked all cold bbs as hot.  */
>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>> +  while (! hot_bbs_to_check.is_empty ()
>> +         && cold_bb_count)
>
> The comment says "or", which sounds plausible, but the code says "and"?

The comment is describing the conditions for exiting the loop, which
is why it was an "or". I changed the comment to describe the
conditions for staying in the loop to make it more consistent/clearer:

  /* Keep examining hot bbs while we still have some left to check
     and there are remaining cold bbs.  */

>> +    {
>> +      basic_block bb = hot_bbs_to_check.pop ();
>> +      basic_block dom_bb = get_immediate_dominator (dir, bb);
>> +
>> +      /* If bb's immediate dominator is also hot then it is ok.  */
>> +      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
>
> Why not follow the comment here and == BB_HOT_PARTITION instead, for clarity?

The entry/exit blocks are unpartitioned, and we want to skip those. I
have clarified that in the comment:

      /* If bb's immediate dominator is also hot (or unpartitioned,
         e.g. the entry block) then it is ok. If it is cold, it
         needs to be adjusted.  */

>
>> +        continue;
>> +
>> +      /* We have a hot bb with an immediate dominator that is cold.
>> +         The dominator needs to be re-marked hot.  */
>> +      BB_SET_PARTITION (dom_bb, BB_HOT_PARTITION);
>> +      cold_bb_count--;
>> +
>> +      /* Now we need to examine newly-hot dom_bb to see if it is also
>> +         dominated by a cold bb.  */
>> +      bbs_in_hot_partition->safe_push (dom_bb);
>> +      hot_bbs_to_check.safe_push (dom_bb);
>> +    }
>> +
>> +  if (dom_calculated_here)
>> +    free_dominance_info (dir);
>> +
>> +  return cold_bb_count;
>> +}
>> +
>> +
>>  /* Find the basic blocks that are rarely executed and need to be moved to
>>     a separate section of the .o file (to cut down on paging and improve
>>     cache locality).  Return a vector of all edges that cross.  */
>> @@ -1455,16 +1504,42 @@ find_rarely_executed_basic_blocks_and_crossing_edg
>>    basic_block bb;
>>    edge e;
>>    edge_iterator ei;
>> +  unsigned int cold_bb_count = 0;
>> +  vec<basic_block> bbs_in_hot_partition = vNULL;
>>
>>    /* Mark which partition (hot/cold) each basic block belongs in.  */
>>    FOR_EACH_BB (bb)
>>      {
>>        if (probably_never_executed_bb_p (cfun, bb))
>> - BB_SET_PARTITION (bb, BB_COLD_PARTITION);
>> +        {
>> +          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
>> +          cold_bb_count++;
>> +        }
>>        else
>> - BB_SET_PARTITION (bb, BB_HOT_PARTITION);
>> +        {
>> +          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
>> +          bbs_in_hot_partition.safe_push (bb);
>> +        }
>>      }
>>
>> +  /* Ensure that no cold bbs dominate hot bbs. This could happen as a result of
>> +     several different possibilities. One is that there are edge
>> weight insanities
>> +     due to optimization phases that do not properly update basic block profile
>> +     counts. The second is that the entry of the function may not be
>> hot, because
>> +     it is entered fewer times than the number of profile training
>> runs, but there
>> +     is a loop inside the function that causes blocks within the function to be
>> +     above the threshold for hotness. Then do the same along the post-dominator
>> +     tree (which could have additional changes required after fixing up
>> +     dominators).  */
>> +  if (cold_bb_count)
>> +    cold_bb_count = sanitize_dominator_hotness (CDI_DOMINATORS,
>> +                                                cold_bb_count,
>> +                                                &bbs_in_hot_partition);
>> +  if (cold_bb_count)
>> +    cold_bb_count = sanitize_dominator_hotness (CDI_POST_DOMINATORS,
>> +                                                cold_bb_count,
>> +                                                &bbs_in_hot_partition);
>
> I take it this last store to cold_bb_count is eliminated anyway.

Yep, and I went ahead and removed the dead store.

Thanks for the review. New patch below.

Teresa


2013-08-01  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_bb_partition): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (fixup_partitions): Declare.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_dominator_hotness): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_dominator_hotness.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 201281)
+++ cfgrtl.c    (working copy)
@@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to a different partition,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_bb_partition (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* Now need to make bb's pred edges non-region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1979,6 +2007,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2173,6 +2209,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition", son->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_bb_partition (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2219,7 +2350,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2382,6 +2514,14 @@ rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2515,7 +2655,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2761,7 +2901,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 201281)
+++ basic-block.h       (working copy)
@@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 201281)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 201281)
+++ bb-reorder.c        (working copy)
@@ -1444,6 +1444,57 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that no cold bbs dominate hot bbs along the dominance or
+   post-dominance DIR, for example as a result of edge weight insanities.
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
+                            vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  bool dom_calculated_here = !dom_info_available_p (dir);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (dir);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      basic_block dom_bb = get_immediate_dominator (dir, bb);
+
+      /* If bb's immediate dominator is also hot (or unpartitioned,
+         e.g. the entry block) then it is ok. If it is cold, it
+         needs to be adjusted.  */
+      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
+        continue;
+
+      /* We have a hot bb with an immediate dominator that is cold.
+         The dominator needs to be re-marked hot.  */
+      BB_SET_PARTITION (dom_bb, BB_HOT_PARTITION);
+      cold_bb_count--;
+
+      /* Now we need to examine newly-hot dom_bb to see if it is also
+         dominated by a cold bb.  */
+      bbs_in_hot_partition->safe_push (dom_bb);
+      hot_bbs_to_check.safe_push (dom_bb);
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (dir);
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */
@@ -1455,16 +1506,42 @@ find_rarely_executed_basic_blocks_and_crossing_edg
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that no cold bbs dominate hot bbs. This could happen as a result of
+     several different possibilities. One is that there are edge
weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. Then do the same along the post-dominator
+     tree (which could have additional changes required after fixing up
+     dominators).  */
+  if (cold_bb_count)
+    cold_bb_count = sanitize_dominator_hotness (CDI_DOMINATORS,
+                                                cold_bb_count,
+                                                &bbs_in_hot_partition);
+  if (cold_bb_count)
+    sanitize_dominator_hotness (CDI_POST_DOMINATORS,
+                                cold_bb_count,
+                                &bbs_in_hot_partition);
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */


-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-02 14:51   ` Teresa Johnson
@ 2013-08-02 15:05     ` Jan Hubicka
  2013-08-02 23:05       ` Steven Bosscher
  2013-08-03  4:48       ` Teresa Johnson
  0 siblings, 2 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-08-02 15:05 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher,
	Jan Hubicka, Jeff Law

> 
> 2013-08-01  Teresa Johnson  <tejohnson@google.com>
>             Steven Bosscher  <steven@gcc.gnu.org>
> 
>         * cfgrtl.c (fixup_bb_partition): New routine.
>         (commit_edge_insertions): Invoke fixup_partitions.
>         (find_partition_fixes): New routine.
>         (fixup_partitions): Ditto.
>         (verify_hot_cold_block_grouping): Update comments.
>         (rtl_verify_edges): Invoke find_partition_fixes.
>         (rtl_verify_bb_pointers): Update comments.
>         (rtl_verify_bb_layout): Ditto.
>         * basic-block.h (fixup_partitions): Declare.
>         * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
>         * bb-reorder.c (sanitize_dominator_hotness): New function.
>         (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
>         sanitize_dominator_hotness.
> 
> Index: cfgrtl.c
> ===================================================================
> --- cfgrtl.c    (revision 201281)
> +++ cfgrtl.c    (working copy)
> @@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
>      }
>  }
> 
> +/* Called when block BB has been reassigned to a different partition,
> +   to ensure that the region crossing attributes are updated.  */
> +
> +static void
> +fixup_bb_partition (basic_block bb)
> +{
> +  edge e;
> +  edge_iterator ei;
> +
> +  /* Now need to make bb's pred edges non-region crossing.  */
> +  FOR_EACH_EDGE (e, ei, bb->preds)
> +    {
> +      fixup_partition_crossing (e);
> +    }
> +
> +  /* Possibly need to make bb's successor edges region crossing,
> +     or remove stale region crossing.  */
> +  FOR_EACH_EDGE (e, ei, bb->succs)
> +    {
> +      if ((e->flags & EDGE_FALLTHRU)
> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
> +          && e->dest != EXIT_BLOCK_PTR)
> +        force_nonfallthru (e);
> +      else
> +        fixup_partition_crossing (e);
> +    }
> +}

Is there particular reason why preds can not be fallhtrus and why
force_nonfallthru edge does not need partition crossing fixup?
(if so, perhpas it could be mentioned in the description, if not,
I think force_nonfallthru path has to check if new BB was introduced
and do the right thing on the edge.

> +/* Sanity check partition hotness to ensure that basic blocks in
> +   the cold partition don't dominate basic blocks in the hot partition.
> +   If FLAG_ONLY is true, report violations as errors. Otherwise
> +   re-mark the dominated blocks as cold, since this is run after
> +   cfg optimizations that may make hot blocks previously reached
> +   by both hot and cold blocks now only reachable along cold paths.  */

With profile, I suppose we can have cold blocks dominating hot blocks when the
hot blocks is in loop whose trip count is high enough.  Indeed for partitioning
reasons it does not make sense to push those into different section.

I also wonder, if we finally get the pass stable, can we enable it by default
and offline probably cold blocks w/o profile? Primarily blocks reachable only
by EH + blocks leading to a crash or throw().  For C++ those should be common
enough to make a difference...

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-02 15:05     ` Jan Hubicka
@ 2013-08-02 23:05       ` Steven Bosscher
  2013-08-03  4:53         ` Teresa Johnson
  2013-08-03  4:48       ` Teresa Johnson
  1 sibling, 1 reply; 62+ messages in thread
From: Steven Bosscher @ 2013-08-02 23:05 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Teresa Johnson, Bernhard Reutner-Fischer, gcc-patches, Jeff Law

On Fri, Aug 2, 2013 at 5:05 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> +/* Called when block BB has been reassigned to a different partition,
>> +   to ensure that the region crossing attributes are updated.  */
>> +
>> +static void
>> +fixup_bb_partition (basic_block bb)
>> +{
>> +  edge e;
>> +  edge_iterator ei;
>> +
>> +  /* Now need to make bb's pred edges non-region crossing.  */
>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>> +    {
>> +      fixup_partition_crossing (e);
>> +    }
>> +
>> +  /* Possibly need to make bb's successor edges region crossing,
>> +     or remove stale region crossing.  */
>> +  FOR_EACH_EDGE (e, ei, bb->succs)
>> +    {
>> +      if ((e->flags & EDGE_FALLTHRU)
>> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
>> +          && e->dest != EXIT_BLOCK_PTR)
>> +        force_nonfallthru (e);
>> +      else
>> +        fixup_partition_crossing (e);
>> +    }
>> +}
>
> Is there particular reason why preds can not be fallhtrus

Yes, by definition a crossing edge cannot fall through. There is
always a control transfer from one section to another.


>> +/* Sanity check partition hotness to ensure that basic blocks in
>> +   the cold partition don't dominate basic blocks in the hot partition.
>> +   If FLAG_ONLY is true, report violations as errors. Otherwise
>> +   re-mark the dominated blocks as cold, since this is run after
>> +   cfg optimizations that may make hot blocks previously reached
>> +   by both hot and cold blocks now only reachable along cold paths.  */
>
> With profile, I suppose we can have cold blocks dominating hot blocks when the
> hot blocks is in loop whose trip count is high enough.

That is the common case, actually.

>  Indeed for partitioning
> reasons it does not make sense to push those into different section.

The partitioning algrorithm makes sure this doesn't happen. The
hottest path from the entry block to a hot basic block is always part
of the hot partition.


> I also wonder, if we finally get the pass stable, can we enable it by default
> and offline probably cold blocks w/o profile?

That is the general idea behind all this work, obviously ;-)

> Primarily blocks reachable only
> by EH + blocks leading to a crash or throw().  For C++ those should be common
> enough to make a difference...

Yup, and IIRC Theresa posted some numbers that showed this.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-02 15:05     ` Jan Hubicka
  2013-08-02 23:05       ` Steven Bosscher
@ 2013-08-03  4:48       ` Teresa Johnson
  2013-08-05 13:36         ` Teresa Johnson
  1 sibling, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-03  4:48 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law

On Fri, Aug 2, 2013 at 8:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> 2013-08-01  Teresa Johnson  <tejohnson@google.com>
>>             Steven Bosscher  <steven@gcc.gnu.org>
>>
>>         * cfgrtl.c (fixup_bb_partition): New routine.
>>         (commit_edge_insertions): Invoke fixup_partitions.
>>         (find_partition_fixes): New routine.
>>         (fixup_partitions): Ditto.
>>         (verify_hot_cold_block_grouping): Update comments.
>>         (rtl_verify_edges): Invoke find_partition_fixes.
>>         (rtl_verify_bb_pointers): Update comments.
>>         (rtl_verify_bb_layout): Ditto.
>>         * basic-block.h (fixup_partitions): Declare.
>>         * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
>>         * bb-reorder.c (sanitize_dominator_hotness): New function.
>>         (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
>>         sanitize_dominator_hotness.
>>
>> Index: cfgrtl.c
>> ===================================================================
>> --- cfgrtl.c    (revision 201281)
>> +++ cfgrtl.c    (working copy)
>> @@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
>>      }
>>  }
>>
>> +/* Called when block BB has been reassigned to a different partition,
>> +   to ensure that the region crossing attributes are updated.  */
>> +
>> +static void
>> +fixup_bb_partition (basic_block bb)
>> +{
>> +  edge e;
>> +  edge_iterator ei;
>> +
>> +  /* Now need to make bb's pred edges non-region crossing.  */
>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>> +    {
>> +      fixup_partition_crossing (e);
>> +    }
>> +
>> +  /* Possibly need to make bb's successor edges region crossing,
>> +     or remove stale region crossing.  */
>> +  FOR_EACH_EDGE (e, ei, bb->succs)
>> +    {
>> +      if ((e->flags & EDGE_FALLTHRU)
>> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
>> +          && e->dest != EXIT_BLOCK_PTR)
>> +        force_nonfallthru (e);
>> +      else
>> +        fixup_partition_crossing (e);
>> +    }
>> +}
>
> Is there particular reason why preds can not be fallhtrus and why
> force_nonfallthru edge does not need partition crossing fixup?
> (if so, perhpas it could be mentioned in the description, if not,
> I think force_nonfallthru path has to check if new BB was introduced
> and do the right thing on the edge.

I need to clarify the comments in this routine, because without the
context of how this is called it isn't clear. This routine is only
called when we detect a hot bb that is now dominated by a cold bb and
needs to become cold. Therefore, its preds will no longer be region
crossing (any non-dominating blocks that were previously hot would
have been marked cold in the caller for the same reason, so we will
not end up adjusting the region crossing-ness or fallthrough-ness of
those pred edges). Any that were region crossing before but aren't any
longer could not have been fall through (as Steven noted, you can't
have a fall through across a partition boundary). I will add some
better comments here.

Regarding the call to force_nonfallthru, that routine calls
fixup_partition_crossing as needed, and I will update the comment to
reflect that too.

>
>> +/* Sanity check partition hotness to ensure that basic blocks in
>> +   the cold partition don't dominate basic blocks in the hot partition.
>> +   If FLAG_ONLY is true, report violations as errors. Otherwise
>> +   re-mark the dominated blocks as cold, since this is run after
>> +   cfg optimizations that may make hot blocks previously reached
>> +   by both hot and cold blocks now only reachable along cold paths.  */
>
> With profile, I suppose we can have cold blocks dominating hot blocks when the
> hot blocks is in loop whose trip count is high enough.  Indeed for partitioning
> reasons it does not make sense to push those into different section.
>
> I also wonder, if we finally get the pass stable, can we enable it by default
> and offline probably cold blocks w/o profile? Primarily blocks reachable only
> by EH + blocks leading to a crash or throw().  For C++ those should be common
> enough to make a difference...

Yep, as soon as PR57451 is fixed, which I hope to get to next week,
then I am going to send a patch to turn this on by default, at least
with profile feedback, which is where I've been doing performance
tuning. But you are right that there are some cases where it should be
beneficial without profile data as well.

Thanks,
Teresa

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-02 23:05       ` Steven Bosscher
@ 2013-08-03  4:53         ` Teresa Johnson
  0 siblings, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-03  4:53 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches, Jeff Law

On Fri, Aug 2, 2013 at 4:04 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Fri, Aug 2, 2013 at 5:05 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> +/* Called when block BB has been reassigned to a different partition,
>>> +   to ensure that the region crossing attributes are updated.  */
>>> +
>>> +static void
>>> +fixup_bb_partition (basic_block bb)
>>> +{
>>> +  edge e;
>>> +  edge_iterator ei;
>>> +
>>> +  /* Now need to make bb's pred edges non-region crossing.  */
>>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>>> +    {
>>> +      fixup_partition_crossing (e);
>>> +    }
>>> +
>>> +  /* Possibly need to make bb's successor edges region crossing,
>>> +     or remove stale region crossing.  */
>>> +  FOR_EACH_EDGE (e, ei, bb->succs)
>>> +    {
>>> +      if ((e->flags & EDGE_FALLTHRU)
>>> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
>>> +          && e->dest != EXIT_BLOCK_PTR)
>>> +        force_nonfallthru (e);
>>> +      else
>>> +        fixup_partition_crossing (e);
>>> +    }
>>> +}
>>
>> Is there particular reason why preds can not be fallhtrus
>
> Yes, by definition a crossing edge cannot fall through. There is
> always a control transfer from one section to another.
>
>
>>> +/* Sanity check partition hotness to ensure that basic blocks in
>>> +   the cold partition don't dominate basic blocks in the hot partition.
>>> +   If FLAG_ONLY is true, report violations as errors. Otherwise
>>> +   re-mark the dominated blocks as cold, since this is run after
>>> +   cfg optimizations that may make hot blocks previously reached
>>> +   by both hot and cold blocks now only reachable along cold paths.  */
>>
>> With profile, I suppose we can have cold blocks dominating hot blocks when the
>> hot blocks is in loop whose trip count is high enough.
>
> That is the common case, actually.
>
>>  Indeed for partitioning
>> reasons it does not make sense to push those into different section.
>
> The partitioning algrorithm makes sure this doesn't happen. The
> hottest path from the entry block to a hot basic block is always part
> of the hot partition.

Well, at least with this patch that will be true. The trunk version
just partitions based on the bb's count without regard to paths.

Thanks,
Teresa

>
>
>> I also wonder, if we finally get the pass stable, can we enable it by default
>> and offline probably cold blocks w/o profile?
>
> That is the general idea behind all this work, obviously ;-)
>
>> Primarily blocks reachable only
>> by EH + blocks leading to a crash or throw().  For C++ those should be common
>> enough to make a difference...
>
> Yup, and IIRC Theresa posted some numbers that showed this.
>
> Ciao!
> Steven



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-03  4:48       ` Teresa Johnson
@ 2013-08-05 13:36         ` Teresa Johnson
  2013-08-05 14:11           ` Jan Hubicka
       [not found]           ` <20130808222332.GA31755@kam.mff.cuni.cz>
  0 siblings, 2 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-05 13:36 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law

On Fri, Aug 2, 2013 at 9:48 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Fri, Aug 2, 2013 at 8:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> 2013-08-01  Teresa Johnson  <tejohnson@google.com>
>>>             Steven Bosscher  <steven@gcc.gnu.org>
>>>
>>>         * cfgrtl.c (fixup_bb_partition): New routine.
>>>         (commit_edge_insertions): Invoke fixup_partitions.
>>>         (find_partition_fixes): New routine.
>>>         (fixup_partitions): Ditto.
>>>         (verify_hot_cold_block_grouping): Update comments.
>>>         (rtl_verify_edges): Invoke find_partition_fixes.
>>>         (rtl_verify_bb_pointers): Update comments.
>>>         (rtl_verify_bb_layout): Ditto.
>>>         * basic-block.h (fixup_partitions): Declare.
>>>         * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
>>>         * bb-reorder.c (sanitize_dominator_hotness): New function.
>>>         (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
>>>         sanitize_dominator_hotness.
>>>
>>> Index: cfgrtl.c
>>> ===================================================================
>>> --- cfgrtl.c    (revision 201281)
>>> +++ cfgrtl.c    (working copy)
>>> @@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
>>>      }
>>>  }
>>>
>>> +/* Called when block BB has been reassigned to a different partition,
>>> +   to ensure that the region crossing attributes are updated.  */
>>> +
>>> +static void
>>> +fixup_bb_partition (basic_block bb)
>>> +{
>>> +  edge e;
>>> +  edge_iterator ei;
>>> +
>>> +  /* Now need to make bb's pred edges non-region crossing.  */
>>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>>> +    {
>>> +      fixup_partition_crossing (e);
>>> +    }
>>> +
>>> +  /* Possibly need to make bb's successor edges region crossing,
>>> +     or remove stale region crossing.  */
>>> +  FOR_EACH_EDGE (e, ei, bb->succs)
>>> +    {
>>> +      if ((e->flags & EDGE_FALLTHRU)
>>> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
>>> +          && e->dest != EXIT_BLOCK_PTR)
>>> +        force_nonfallthru (e);
>>> +      else
>>> +        fixup_partition_crossing (e);
>>> +    }
>>> +}
>>
>> Is there particular reason why preds can not be fallhtrus and why
>> force_nonfallthru edge does not need partition crossing fixup?
>> (if so, perhpas it could be mentioned in the description, if not,
>> I think force_nonfallthru path has to check if new BB was introduced
>> and do the right thing on the edge.
>
> I need to clarify the comments in this routine, because without the
> context of how this is called it isn't clear. This routine is only
> called when we detect a hot bb that is now dominated by a cold bb and
> needs to become cold. Therefore, its preds will no longer be region
> crossing (any non-dominating blocks that were previously hot would
> have been marked cold in the caller for the same reason, so we will
> not end up adjusting the region crossing-ness or fallthrough-ness of
> those pred edges). Any that were region crossing before but aren't any
> longer could not have been fall through (as Steven noted, you can't
> have a fall through across a partition boundary). I will add some
> better comments here.
>
> Regarding the call to force_nonfallthru, that routine calls
> fixup_partition_crossing as needed, and I will update the comment to
> reflect that too.

Patch with updated comments below. Ok for trunk?

Thanks,
Teresa

2013-08-05  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_bb_partition): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (fixup_partitions): Declare.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_dominator_hotness): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_dominator_hotness.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 201461)
+++ cfgrtl.c    (working copy)
@@ -1341,6 +1341,43 @@ fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1979,6 +2016,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2173,6 +2218,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition", son->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2219,7 +2359,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2382,6 +2523,14 @@ rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2515,7 +2664,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2761,7 +2910,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 201461)
+++ basic-block.h       (working copy)
@@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 201461)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 201461)
+++ bb-reorder.c        (working copy)
@@ -1444,6 +1444,57 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that no cold bbs dominate hot bbs along the dominance or
+   post-dominance DIR, for example as a result of edge weight insanities.
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
+                            vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  bool dom_calculated_here = !dom_info_available_p (dir);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (dir);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      basic_block dom_bb = get_immediate_dominator (dir, bb);
+
+      /* If bb's immediate dominator is also hot (or unpartitioned,
+         e.g. the entry block) then it is ok. If it is cold, it
+         needs to be adjusted.  */
+      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
+        continue;
+
+      /* We have a hot bb with an immediate dominator that is cold.
+         The dominator needs to be re-marked hot.  */
+      BB_SET_PARTITION (dom_bb, BB_HOT_PARTITION);
+      cold_bb_count--;
+
+      /* Now we need to examine newly-hot dom_bb to see if it is also
+         dominated by a cold bb.  */
+      bbs_in_hot_partition->safe_push (dom_bb);
+      hot_bbs_to_check.safe_push (dom_bb);
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (dir);
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */
@@ -1455,16 +1506,42 @@ find_rarely_executed_basic_blocks_and_crossing_edg
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that no cold bbs dominate hot bbs. This could happen as a result of
+     several different possibilities. One is that there are edge
weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. Then do the same along the post-dominator
+     tree (which could have additional changes required after fixing up
+     dominators).  */
+  if (cold_bb_count)
+    cold_bb_count = sanitize_dominator_hotness (CDI_DOMINATORS,
+                                                cold_bb_count,
+                                                &bbs_in_hot_partition);
+  if (cold_bb_count)
+    sanitize_dominator_hotness (CDI_POST_DOMINATORS,
+                                cold_bb_count,
+                                &bbs_in_hot_partition);
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */

>
>>
>>> +/* Sanity check partition hotness to ensure that basic blocks in
>>> +   the cold partition don't dominate basic blocks in the hot partition.
>>> +   If FLAG_ONLY is true, report violations as errors. Otherwise
>>> +   re-mark the dominated blocks as cold, since this is run after
>>> +   cfg optimizations that may make hot blocks previously reached
>>> +   by both hot and cold blocks now only reachable along cold paths.  */
>>
>> With profile, I suppose we can have cold blocks dominating hot blocks when the
>> hot blocks is in loop whose trip count is high enough.  Indeed for partitioning
>> reasons it does not make sense to push those into different section.
>>
>> I also wonder, if we finally get the pass stable, can we enable it by default
>> and offline probably cold blocks w/o profile? Primarily blocks reachable only
>> by EH + blocks leading to a crash or throw().  For C++ those should be common
>> enough to make a difference...
>
> Yep, as soon as PR57451 is fixed, which I hope to get to next week,
> then I am going to send a patch to turn this on by default, at least
> with profile feedback, which is where I've been doing performance
> tuning. But you are right that there are some cases where it should be
> beneficial without profile data as well.
>
> Thanks,
> Teresa
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-05 13:36         ` Teresa Johnson
@ 2013-08-05 14:11           ` Jan Hubicka
  2013-08-05 14:57             ` Teresa Johnson
       [not found]           ` <20130808222332.GA31755@kam.mff.cuni.cz>
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-05 14:11 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law

The patch looks OK to me in general (I can not approve it).
Still have one question...
> +
> +/* Ensure that no cold bbs dominate hot bbs along the dominance or
> +   post-dominance DIR, for example as a result of edge weight insanities.
> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
> +   to BBS_IN_HOT_PARTITION.  */
> +
> +static unsigned int
> +sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
> +                            vec<basic_block> *bbs_in_hot_partition)
> +{
> +  /* Callers check this.  */
> +  gcc_checking_assert (cold_bb_count);
> +
> +  bool dom_calculated_here = !dom_info_available_p (dir);
> +
> +  if (dom_calculated_here)
> +    calculate_dominance_info (dir);
> +
> +  /* Keep examining hot bbs while we still have some left to check
> +     and there are remaining cold bbs.  */
> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
> +  while (! hot_bbs_to_check.is_empty ()
> +         && cold_bb_count)
> +    {
> +      basic_block bb = hot_bbs_to_check.pop ();
> +      basic_block dom_bb = get_immediate_dominator (dir, bb);
> +
> +      /* If bb's immediate dominator is also hot (or unpartitioned,
> +         e.g. the entry block) then it is ok. If it is cold, it
> +         needs to be adjusted.  */
> +      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
> +        continue;

What will hapepn on

if (t)
  something
else
  something else
for (i=0;i<1000000;i++)
  something else2

I would expect if/something and something else to be cold by profile feedback.
Your dominator code will bring the if into hot partition but both paths
of conditional will be cold, so the number of crossings will actually grow.

If we want to have at least some path to hot blocks in the hot region, I suspect
we could walk back from hot regions to entry and keep those in hot regions rather
than relying on the dominator tree...
But I am sure such things can be dealt with incrementally.

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-05 14:11           ` Jan Hubicka
@ 2013-08-05 14:57             ` Teresa Johnson
  2013-08-06  3:01               ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-05 14:57 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law

On Mon, Aug 5, 2013 at 7:11 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> The patch looks OK to me in general (I can not approve it).
> Still have one question...
>> +
>> +/* Ensure that no cold bbs dominate hot bbs along the dominance or
>> +   post-dominance DIR, for example as a result of edge weight insanities.
>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>> +   to BBS_IN_HOT_PARTITION.  */
>> +
>> +static unsigned int
>> +sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
>> +                            vec<basic_block> *bbs_in_hot_partition)
>> +{
>> +  /* Callers check this.  */
>> +  gcc_checking_assert (cold_bb_count);
>> +
>> +  bool dom_calculated_here = !dom_info_available_p (dir);
>> +
>> +  if (dom_calculated_here)
>> +    calculate_dominance_info (dir);
>> +
>> +  /* Keep examining hot bbs while we still have some left to check
>> +     and there are remaining cold bbs.  */
>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>> +  while (! hot_bbs_to_check.is_empty ()
>> +         && cold_bb_count)
>> +    {
>> +      basic_block bb = hot_bbs_to_check.pop ();
>> +      basic_block dom_bb = get_immediate_dominator (dir, bb);
>> +
>> +      /* If bb's immediate dominator is also hot (or unpartitioned,
>> +         e.g. the entry block) then it is ok. If it is cold, it
>> +         needs to be adjusted.  */
>> +      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
>> +        continue;
>
> What will hapepn on
>
> if (t)
>   something
> else
>   something else
> for (i=0;i<1000000;i++)
>   something else2
>
> I would expect if/something and something else to be cold by profile feedback.
> Your dominator code will bring the if into hot partition but both paths
> of conditional will be cold, so the number of crossings will actually grow.

You are right, this case will not be handled well.

>
> If we want to have at least some path to hot blocks in the hot region, I suspect
> we could walk back from hot regions to entry and keep those in hot regions rather
> than relying on the dominator tree...
> But I am sure such things can be dealt with incrementally.

I am going to fix this and will resend the patch. Rather than look at
the immediate dominator of each hot block, we need to ensure that at
least one pred bb is hot. In your example, if that was a 50-50 branch,
then IMO both preds should be marked hot.

Thanks,
Teresa

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-05 14:57             ` Teresa Johnson
@ 2013-08-06  3:01               ` Teresa Johnson
  0 siblings, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-06  3:01 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law

On Mon, Aug 5, 2013 at 7:57 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Mon, Aug 5, 2013 at 7:11 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> The patch looks OK to me in general (I can not approve it).
>> Still have one question...
>>> +
>>> +/* Ensure that no cold bbs dominate hot bbs along the dominance or
>>> +   post-dominance DIR, for example as a result of edge weight insanities.
>>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>>> +   to BBS_IN_HOT_PARTITION.  */
>>> +
>>> +static unsigned int
>>> +sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
>>> +                            vec<basic_block> *bbs_in_hot_partition)
>>> +{
>>> +  /* Callers check this.  */
>>> +  gcc_checking_assert (cold_bb_count);
>>> +
>>> +  bool dom_calculated_here = !dom_info_available_p (dir);
>>> +
>>> +  if (dom_calculated_here)
>>> +    calculate_dominance_info (dir);
>>> +
>>> +  /* Keep examining hot bbs while we still have some left to check
>>> +     and there are remaining cold bbs.  */
>>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>>> +  while (! hot_bbs_to_check.is_empty ()
>>> +         && cold_bb_count)
>>> +    {
>>> +      basic_block bb = hot_bbs_to_check.pop ();
>>> +      basic_block dom_bb = get_immediate_dominator (dir, bb);
>>> +
>>> +      /* If bb's immediate dominator is also hot (or unpartitioned,
>>> +         e.g. the entry block) then it is ok. If it is cold, it
>>> +         needs to be adjusted.  */
>>> +      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
>>> +        continue;
>>
>> What will hapepn on
>>
>> if (t)
>>   something
>> else
>>   something else
>> for (i=0;i<1000000;i++)
>>   something else2
>>
>> I would expect if/something and something else to be cold by profile feedback.
>> Your dominator code will bring the if into hot partition but both paths
>> of conditional will be cold, so the number of crossings will actually grow.
>
> You are right, this case will not be handled well.
>
>>
>> If we want to have at least some path to hot blocks in the hot region, I suspect
>> we could walk back from hot regions to entry and keep those in hot regions rather
>> than relying on the dominator tree...
>> But I am sure such things can be dealt with incrementally.
>
> I am going to fix this and will resend the patch. Rather than look at
> the immediate dominator of each hot block, we need to ensure that at
> least one pred bb is hot. In your example, if that was a 50-50 branch,
> then IMO both preds should be marked hot.

New patch below that walks the preds of each hot bb instead of the dominators.

Bootstrapped and tested on x86-64-unknown-linux-gnu. Also ensured that
a profiledbootstrap passed with -freorder-blocks-and-partition enabled
still works.

Ok for trunk?

Thanks,
Teresa


2013-08-05  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_new_cold_bb): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (fixup_partitions): Declare.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_hot_paths): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_hot_paths.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 201461)
+++ cfgrtl.c    (working copy)
@@ -1341,6 +1341,43 @@ fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1979,6 +2016,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2173,6 +2218,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition (%d)",
son->index, bb->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2219,7 +2359,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2382,6 +2523,14 @@ rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2515,7 +2664,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2761,7 +2910,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 201461)
+++ basic-block.h       (working copy)
@@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 201461)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 201461)
+++ bb-reorder.c        (working copy)
@@ -1444,27 +1444,134 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that all hot bbs are included in a hot path through the
+   procedure. This is done by calling this function twice, once
+   with WALK_UP true (to look for paths from the entry to hot bbs) and
+   once with WALK_UP false (to look for paths from hot bbs to the exit).
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
+                    vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
+      edge e;
+      edge_iterator ei;
+      int highest_probability = 0;
+      bool found = false;
+
+      /* Walk the preds/succs and check if there is at least one already
+         marked hot. Keep track of the most frequent pred/succ so that we
+         can mark it hot if we don't find one.  */
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+
+          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
+          {
+            found = true;
+            break;
+          }
+          if (e->probability > highest_probability)
+            highest_probability = e->probability;
+        }
+
+      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
+         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
+         then the most frequent pred (or succ) needs to be adjusted.  In the
+         case where multiple preds/succs have the same probability (e.g. a
+         50-50 branch), then both will be adjusted.  */
+      if (found)
+        continue;
+
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+          if (e->probability < highest_probability)
+            continue;
+
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          /* We have a hot bb with an immediate dominator that is cold.
+             The dominator needs to be re-marked hot.  */
+          BB_SET_PARTITION (reach_bb, BB_HOT_PARTITION);
+          cold_bb_count--;
+
+          /* Now we need to examine newly-hot reach_bb to see if it is also
+             dominated by a cold bb.  */
+          bbs_in_hot_partition->safe_push (reach_bb);
+          hot_bbs_to_check.safe_push (reach_bb);
+        }
+    }
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */

-static vec<edge>
+static vec<edge>
 find_rarely_executed_basic_blocks_and_crossing_edges (void)
 {
   vec<edge> crossing_edges = vNULL;
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that hot bbs are included along a hot path from the entry to exit.
+     Several different possibilities may include cold bbs along all paths
+     to/from a hot bb. One is that there are edge weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. This is fixed by walking up from hot bbs
+     to the entry block, and then down from hot bbs to the exit, performing
+     partitioning fixups as necessary.  */
+  if (cold_bb_count)
+    {
+      mark_dfs_back_edges ();
+      cold_bb_count = sanitize_hot_paths (true, cold_bb_count,
+                                          &bbs_in_hot_partition);
+      if (cold_bb_count)
+        sanitize_hot_paths (false, cold_bb_count, &bbs_in_hot_partition);
+    }
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */

>
> Thanks,
> Teresa
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
       [not found]           ` <20130808222332.GA31755@kam.mff.cuni.cz>
@ 2013-08-08 23:04             ` Teresa Johnson
  2013-08-09  9:58               ` Jan Hubicka
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-08 23:04 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska

On Thu, Aug 8, 2013 at 3:23 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> Martin Liska was kind enough to generate disk seeking graph of gimp statrup with his function reordering.
> His code simply measures time of firest execution of a function and orders functions in the given order.
> The functions stay in the subsections (unlikely/startup/exit/hot/normal) that are then glued together
> in this order.
>
> I am attaching disk seeking with and without -freorder-blocks-and-partition (with your patch).
>
> In 2.pdf you can see two increasing sequences in the text segment.  If I am not mistaken the bottom
> one comes for hot and the top one for normal section.  The big unused part on bottom is unlikely
> section since most of gimp is not trained.

2.pdf is reordered with Martin's technique?

>
> Now 1.pdf is with -freorder-blocks-and-partition and your patch.  You can see there is third sequence
> near bottom of the text seciton. that is beggining of unlikely section, so it tracks jumps where we
> fall into cold section of function.

1.pdf is generated using the usual FDO +
-freorder-blocks-and-partition (i.e. not using Martin's technique)?

>
> It still seems rather bad (i.e. good part of unlikely section is actually used).  I think the dominator
> based approach is not going to work too reliably (I can "fix" my testcase to contain multiple nested
> conditionals and then the heuristic about predecestors won't help).

Yes, this doesn't look good. Did you use the latest version of my
patch that doesn't walk the dominators?

Do you know how many training runs are done for this benchmark? I
think a lot of the issues that you pointed out with the hot loop
preceded by non-looping conditional code as in your earlier example,
or multiple nested conditionals, comes from the fact that the cold
cutoff is not 0, but some number less than the number of training
runs. Perhaps the cutoff for splitting should be 0. Then the main
issue that needs to be corrected is profile insanities, not code that
is executed once (since that would not be marked cold).

The only other issue that I can think of here is that the training
data was not representative and didn't execute these blocks.

>
> What about simply walking the CFG from entry through all edges with non-0 counts and making all reachable
> blocks hot + forcingly make any hot blocks not reachable this way reachable?

Is this different than what I currently have + changing the cold
cutoff to 0? In that case any blocks reachable through non-0 edges
should be non-0 and marked hot, and the current patch forces the most
frequent paths to all hot blocks to be hot.

Thanks,
Teresa

> I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>
> Honza

-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-08 23:04             ` Teresa Johnson
@ 2013-08-09  9:58               ` Jan Hubicka
  2013-08-09 14:38                 ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-09  9:58 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska

> On Thu, Aug 8, 2013 at 3:23 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hi,
> > Martin Liska was kind enough to generate disk seeking graph of gimp statrup with his function reordering.
> > His code simply measures time of firest execution of a function and orders functions in the given order.
> > The functions stay in the subsections (unlikely/startup/exit/hot/normal) that are then glued together
> > in this order.
> >
> > I am attaching disk seeking with and without -freorder-blocks-and-partition (with your patch).
> >
> > In 2.pdf you can see two increasing sequences in the text segment.  If I am not mistaken the bottom
> > one comes for hot and the top one for normal section.  The big unused part on bottom is unlikely
> > section since most of gimp is not trained.
> 
> 2.pdf is reordered with Martin's technique?
> 
> >
> > Now 1.pdf is with -freorder-blocks-and-partition and your patch.  You can see there is third sequence
> > near bottom of the text seciton. that is beggining of unlikely section, so it tracks jumps where we
> > fall into cold section of function.
> 
> 1.pdf is generated using the usual FDO +
> -freorder-blocks-and-partition (i.e. not using Martin's technique)?

2.pdf is Martin's reordering (that works ortoghonally to what we already have -
it just orders the functions inside idividual subsections. This make the
subsections more visible than without his patch).
1.pdf is Marting's rerodering + yours patch (I asked him to double check it is
the latest verision) + -freorder-blocks-and-partition.

He simply trains and measures the gimp startup, nothing else, so there should not
be problem with representativity of the data.
> 
> >
> > It still seems rather bad (i.e. good part of unlikely section is actually used).  I think the dominator
> > based approach is not going to work too reliably (I can "fix" my testcase to contain multiple nested
> > conditionals and then the heuristic about predecestors won't help).
> 
> Yes, this doesn't look good. Did you use the latest version of my
> patch that doesn't walk the dominators?
> 
> Do you know how many training runs are done for this benchmark? I
> think a lot of the issues that you pointed out with the hot loop
> preceded by non-looping conditional code as in your earlier example,
> or multiple nested conditionals, comes from the fact that the cold
> cutoff is not 0, but some number less than the number of training
> runs. Perhaps the cutoff for splitting should be 0. Then the main
> issue that needs to be corrected is profile insanities, not code that
> is executed once (since that would not be marked cold).

Hmm, compute_function_frequency uses probably_never_executed_bb_p that requires
the count of basic block to be less than number of train runs.  In Martin's
setup that will be 0.

This is the same as what -frerorder-block-of-partition does?
> 
> The only other issue that I can think of here is that the training
> data was not representative and didn't execute these blocks.
> 
> >
> > What about simply walking the CFG from entry through all edges with non-0 counts and making all reachable
> > blocks hot + forcingly make any hot blocks not reachable this way reachable?
> 
> Is this different than what I currently have + changing the cold
> cutoff to 0? In that case any blocks reachable through non-0 edges
> should be non-0 and marked hot, and the current patch forces the most
> frequent paths to all hot blocks to be hot.

Do we sanity check that the cold partition does not contain any blocks of
count 0?  It may be that the profile is broken enough to make partitioning
not work.
I can think of inlining where the count gets scaled all way down to 0.  Perhaps
count scaling code can be modified to never round towards 0 for block executing
non-0 times...

Honza
> 
> Thanks,
> Teresa
> 
> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09  9:58               ` Jan Hubicka
@ 2013-08-09 14:38                 ` Teresa Johnson
  2013-08-09 15:28                   ` Jan Hubicka
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-09 14:38 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

On Fri, Aug 9, 2013 at 2:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Thu, Aug 8, 2013 at 3:23 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Hi,
>> > Martin Liska was kind enough to generate disk seeking graph of gimp statrup with his function reordering.
>> > His code simply measures time of firest execution of a function and orders functions in the given order.
>> > The functions stay in the subsections (unlikely/startup/exit/hot/normal) that are then glued together
>> > in this order.
>> >
>> > I am attaching disk seeking with and without -freorder-blocks-and-partition (with your patch).
>> >
>> > In 2.pdf you can see two increasing sequences in the text segment.  If I am not mistaken the bottom
>> > one comes for hot and the top one for normal section.  The big unused part on bottom is unlikely
>> > section since most of gimp is not trained.
>>
>> 2.pdf is reordered with Martin's technique?
>>
>> >
>> > Now 1.pdf is with -freorder-blocks-and-partition and your patch.  You can see there is third sequence
>> > near bottom of the text seciton. that is beggining of unlikely section, so it tracks jumps where we
>> > fall into cold section of function.
>>
>> 1.pdf is generated using the usual FDO +
>> -freorder-blocks-and-partition (i.e. not using Martin's technique)?
>
> 2.pdf is Martin's reordering (that works ortoghonally to what we already have -
> it just orders the functions inside idividual subsections. This make the
> subsections more visible than without his patch).
> 1.pdf is Marting's rerodering + yours patch (I asked him to double check it is
> the latest verision) + -freorder-blocks-and-partition.
>
> He simply trains and measures the gimp startup, nothing else, so there should not
> be problem with representativity of the data.

Ok, so a single training run, and it is essentially the same as what
is being used to create the graph after optimization.

>>
>> >
>> > It still seems rather bad (i.e. good part of unlikely section is actually used).  I think the dominator
>> > based approach is not going to work too reliably (I can "fix" my testcase to contain multiple nested
>> > conditionals and then the heuristic about predecestors won't help).
>>
>> Yes, this doesn't look good. Did you use the latest version of my
>> patch that doesn't walk the dominators?
>>
>> Do you know how many training runs are done for this benchmark? I
>> think a lot of the issues that you pointed out with the hot loop
>> preceded by non-looping conditional code as in your earlier example,
>> or multiple nested conditionals, comes from the fact that the cold
>> cutoff is not 0, but some number less than the number of training
>> runs. Perhaps the cutoff for splitting should be 0. Then the main
>> issue that needs to be corrected is profile insanities, not code that
>> is executed once (since that would not be marked cold).
>
> Hmm, compute_function_frequency uses probably_never_executed_bb_p that requires
> the count of basic block to be less than number of train runs.  In Martin's
> setup that will be 0.
>
> This is the same as what -frerorder-block-of-partition does?

Right, it simply puts blocks that are probably_never_executed_bb_p
into the cold section. But it sounds like that is not an issue here
since Martin is doing a single training run so the cutoff is
essentially 0.

>>
>> The only other issue that I can think of here is that the training
>> data was not representative and didn't execute these blocks.
>>
>> >
>> > What about simply walking the CFG from entry through all edges with non-0 counts and making all reachable
>> > blocks hot + forcingly make any hot blocks not reachable this way reachable?
>>
>> Is this different than what I currently have + changing the cold
>> cutoff to 0? In that case any blocks reachable through non-0 edges
>> should be non-0 and marked hot, and the current patch forces the most
>> frequent paths to all hot blocks to be hot.
>
> Do we sanity check that the cold partition does not contain any blocks of
> count 0?  It may be that the profile is broken enough to make partitioning
> not work.

Do you mean sanity check that the cold partition does not contain any
blocks of count > 0? (they should all be zero) I don't think that
sanity check is there, but I can try adding that.

The issue with such a sanity check may be due to the later fixup I
have in this patch (fixup_partitions). It is invoked after certain
optimizations on the cfg that may make hot blocks previously reached
by both hot and cold edges only reachable by cold blocks. These blocks
are remarked cold. If the profile data hasn't been updated correctly
it is possible that they would still have a non-0 count, although they
are essentially cold after the cfg transformation.

But certainly such a sanity check should always succeed after the
original partitioning.

> I can think of inlining where the count gets scaled all way down to 0.  Perhaps
> count scaling code can be modified to never round towards 0 for block executing
> non-0 times...

This reminds me of why this situation could happen. When I have been
testing this on the google branch I found situations where COMDAT
routines have 0 profile counts (if I remember correctly, this happens
when profile-gen binary has call to out-of-line copy of COMDAT in
module A, linker chooses the out-of-line copy from module B, therefore
the profile data for COMDAT in module A is 0). When the COMDAT gets
inlined, the 0 counts on its bbs are scaled to 0, even though the
callsite is non-zero. I have a patch that I was planning to send as a
follow-up that handles this case by propagating the callsite bb's
count to the inlined code when it has 0 counts, scaling by the edge
frequencies. I can either include that patch in this one, or send it
for review separately right now. Do you want to give it a try with
this one to see if it addresses the issue?

Also, can you send me reproduction instructions for gimp? I don't
think I need Martin's patch, but which version of gimp and what is the
equivalent way for me to train it? I have some scripts to generate a
similar type of instruction heat map graph that I have been using to
tune partitioning and function reordering. Essentially it uses linux
perf to sample on instructions_retired and then munge the data in
several ways to produce various stats and graphs. One thing that has
been useful has been to combine the perf data with nm output to
determine which cold functions are being executed at runtime.

However, for this to tell me which split cold bbs are being executed I
need to use a patch that Sri sent for review several months back that
gives the split cold section its own name:
  http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
Steven had some follow up comments that Sri hasn't had a chance to address yet:
  http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
(cc'ing Sri as we should probably revive this patch soon to address
gdb and other issues with detecting split functions properly)

Thanks!
Teresa

>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09 14:38                 ` Teresa Johnson
@ 2013-08-09 15:28                   ` Jan Hubicka
  2013-08-09 15:54                     ` Martin Liška
  2013-08-09 21:02                     ` Teresa Johnson
  0 siblings, 2 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-08-09 15:28 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

> > Do we sanity check that the cold partition does not contain any blocks of
> > count 0?  It may be that the profile is broken enough to make partitioning
> > not work.
> 
> Do you mean sanity check that the cold partition does not contain any
> blocks of count > 0? (they should all be zero) I don't think that
> sanity check is there, but I can try adding that.

Thanks, lets start with this - I suppose we need to figure out if
 1) the reachable blocks goes to cold section because partitioning decides
    so even if they have non-0 count.
 2) the reachable blocks goes to cold section because they have incorrectly
    updated count to 0 by someone
 3) profiling gets some blocks wrong.
> 
> The issue with such a sanity check may be due to the later fixup I
> have in this patch (fixup_partitions). It is invoked after certain
> optimizations on the cfg that may make hot blocks previously reached
> by both hot and cold edges only reachable by cold blocks. These blocks
> are remarked cold. If the profile data hasn't been updated correctly
> it is possible that they would still have a non-0 count, although they
> are essentially cold after the cfg transformation.

Well, or the other posibility is that the edges was updated wrong
and the blocks are really cold.  We need to figure out if that happens
commonly enough.

I will try to think of some artificial testcases.

> 
> But certainly such a sanity check should always succeed after the
> original partitioning.
> 
> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
> > count scaling code can be modified to never round towards 0 for block executing
> > non-0 times...
> 
> This reminds me of why this situation could happen. When I have been
> testing this on the google branch I found situations where COMDAT
> routines have 0 profile counts (if I remember correctly, this happens
> when profile-gen binary has call to out-of-line copy of COMDAT in
> module A, linker chooses the out-of-line copy from module B, therefore
> the profile data for COMDAT in module A is 0). When the COMDAT gets
> inlined, the 0 counts on its bbs are scaled to 0, even though the
> callsite is non-zero. I have a patch that I was planning to send as a
> follow-up that handles this case by propagating the callsite bb's
> count to the inlined code when it has 0 counts, scaling by the edge
> frequencies. I can either include that patch in this one, or send it
> for review separately right now. Do you want to give it a try with
> this one to see if it addresses the issue?

This scenario should not happen with LTO setup: the LTO symbol tables contains
code before early optimization and should be identical with profiling or
without (modulo the new references and call from profile code).

But this patch seems useful as a backup solution for non-LTO, so yes, please
send it separately and I can try to double check that it really do not happen
with LTO.
(acutally LTO symtab may just chose COMDAT from module that has counts with it.
It has all the info for it.  I was thinkin about it few weeks back.  It is
bit hard to do - you need to verify that all references from the function are
the same or linking might fail if you overwrite linker's decisiosns).
> 
> Also, can you send me reproduction instructions for gimp? I don't
> think I need Martin's patch, but which version of gimp and what is the
> equivalent way for me to train it? I have some scripts to generate a
> similar type of instruction heat map graph that I have been using to
> tune partitioning and function reordering. Essentially it uses linux
> perf to sample on instructions_retired and then munge the data in
> several ways to produce various stats and graphs. One thing that has
> been useful has been to combine the perf data with nm output to
> determine which cold functions are being executed at runtime.

Martin?

> 
> However, for this to tell me which split cold bbs are being executed I
> need to use a patch that Sri sent for review several months back that
> gives the split cold section its own name:
>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> (cc'ing Sri as we should probably revive this patch soon to address
> gdb and other issues with detecting split functions properly)

Intresting, I used linker script for this purposes, but that his GNU ld only...

Honza
> 
> Thanks!
> Teresa
> 
> >
> > Honza
> >>
> >> Thanks,
> >> Teresa
> >>
> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >
> >> > Honza
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09 15:28                   ` Jan Hubicka
@ 2013-08-09 15:54                     ` Martin Liška
  2013-08-09 21:03                       ` Teresa Johnson
  2013-08-09 21:02                     ` Teresa Johnson
  1 sibling, 1 reply; 62+ messages in thread
From: Martin Liška @ 2013-08-09 15:54 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Teresa Johnson, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, Sriraman Tallam

Hi

On 9 August 2013 17:28, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Do we sanity check that the cold partition does not contain any blocks of
>> > count 0?  It may be that the profile is broken enough to make partitioning
>> > not work.
>>
>> Do you mean sanity check that the cold partition does not contain any
>> blocks of count > 0? (they should all be zero) I don't think that
>> sanity check is there, but I can try adding that.
>
> Thanks, lets start with this - I suppose we need to figure out if
>  1) the reachable blocks goes to cold section because partitioning decides
>     so even if they have non-0 count.
>  2) the reachable blocks goes to cold section because they have incorrectly
>     updated count to 0 by someone
>  3) profiling gets some blocks wrong.
>>
>> The issue with such a sanity check may be due to the later fixup I
>> have in this patch (fixup_partitions). It is invoked after certain
>> optimizations on the cfg that may make hot blocks previously reached
>> by both hot and cold edges only reachable by cold blocks. These blocks
>> are remarked cold. If the profile data hasn't been updated correctly
>> it is possible that they would still have a non-0 count, although they
>> are essentially cold after the cfg transformation.
>
> Well, or the other posibility is that the edges was updated wrong
> and the blocks are really cold.  We need to figure out if that happens
> commonly enough.
>
> I will try to think of some artificial testcases.
>
>>
>> But certainly such a sanity check should always succeed after the
>> original partitioning.
>>
>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>> > count scaling code can be modified to never round towards 0 for block executing
>> > non-0 times...
>>
>> This reminds me of why this situation could happen. When I have been
>> testing this on the google branch I found situations where COMDAT
>> routines have 0 profile counts (if I remember correctly, this happens
>> when profile-gen binary has call to out-of-line copy of COMDAT in
>> module A, linker chooses the out-of-line copy from module B, therefore
>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>> callsite is non-zero. I have a patch that I was planning to send as a
>> follow-up that handles this case by propagating the callsite bb's
>> count to the inlined code when it has 0 counts, scaling by the edge
>> frequencies. I can either include that patch in this one, or send it
>> for review separately right now. Do you want to give it a try with
>> this one to see if it addresses the issue?
>
> This scenario should not happen with LTO setup: the LTO symbol tables contains
> code before early optimization and should be identical with profiling or
> without (modulo the new references and call from profile code).
>
> But this patch seems useful as a backup solution for non-LTO, so yes, please
> send it separately and I can try to double check that it really do not happen
> with LTO.
> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
> It has all the info for it.  I was thinkin about it few weeks back.  It is
> bit hard to do - you need to verify that all references from the function are
> the same or linking might fail if you overwrite linker's decisiosns).
>>
>> Also, can you send me reproduction instructions for gimp? I don't
>> think I need Martin's patch, but which version of gimp and what is the
>> equivalent way for me to train it? I have some scripts to generate a
>> similar type of instruction heat map graph that I have been using to
>> tune partitioning and function reordering. Essentially it uses linux
>> perf to sample on instructions_retired and then munge the data in
>> several ways to produce various stats and graphs. One thing that has
>> been useful has been to combine the perf data with nm output to
>> determine which cold functions are being executed at runtime.
>
> Martin?

I use gimp from git repository, commit:
88ecd59c3436d302b644a5d25c1938c0e7b60ae0 (from Fet 5 2013)
Link: http://www.gimp.org/source/#gimp_from_git

Martin

>>
>> However, for this to tell me which split cold bbs are being executed I
>> need to use a patch that Sri sent for review several months back that
>> gives the split cold section its own name:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>> (cc'ing Sri as we should probably revive this patch soon to address
>> gdb and other issues with detecting split functions properly)
>
> Intresting, I used linker script for this purposes, but that his GNU ld only...
>
> Honza
>>
>> Thanks!
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >> >
>> >> > Honza
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09 15:28                   ` Jan Hubicka
  2013-08-09 15:54                     ` Martin Liška
@ 2013-08-09 21:02                     ` Teresa Johnson
  2013-08-09 22:43                       ` Jan Hubicka
                                         ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-09 21:02 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

On Fri, Aug 9, 2013 at 8:28 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Do we sanity check that the cold partition does not contain any blocks of
>> > count 0?  It may be that the profile is broken enough to make partitioning
>> > not work.
>>
>> Do you mean sanity check that the cold partition does not contain any
>> blocks of count > 0? (they should all be zero) I don't think that
>> sanity check is there, but I can try adding that.
>
> Thanks, lets start with this - I suppose we need to figure out if
>  1) the reachable blocks goes to cold section because partitioning decides
>     so even if they have non-0 count.

Right, this should be easy enough to check and should hopefully never happen.

>  2) the reachable blocks goes to cold section because they have incorrectly
>     updated count to 0 by someone

A sanity check should find this too. But it can happen now for various
reasons like the comdat issue I described below. But it will be good
to flag these and fix them.

>  3) profiling gets some blocks wrong.

This is the one that will be tough to fix, if the training run isn't
representative.

>>
>> The issue with such a sanity check may be due to the later fixup I
>> have in this patch (fixup_partitions). It is invoked after certain
>> optimizations on the cfg that may make hot blocks previously reached
>> by both hot and cold edges only reachable by cold blocks. These blocks
>> are remarked cold. If the profile data hasn't been updated correctly
>> it is possible that they would still have a non-0 count, although they
>> are essentially cold after the cfg transformation.
>
> Well, or the other posibility is that the edges was updated wrong
> and the blocks are really cold.  We need to figure out if that happens
> commonly enough.
>
> I will try to think of some artificial testcases.
>
>>
>> But certainly such a sanity check should always succeed after the
>> original partitioning.
>>
>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>> > count scaling code can be modified to never round towards 0 for block executing
>> > non-0 times...
>>
>> This reminds me of why this situation could happen. When I have been
>> testing this on the google branch I found situations where COMDAT
>> routines have 0 profile counts (if I remember correctly, this happens
>> when profile-gen binary has call to out-of-line copy of COMDAT in
>> module A, linker chooses the out-of-line copy from module B, therefore
>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>> callsite is non-zero. I have a patch that I was planning to send as a
>> follow-up that handles this case by propagating the callsite bb's
>> count to the inlined code when it has 0 counts, scaling by the edge
>> frequencies. I can either include that patch in this one, or send it
>> for review separately right now. Do you want to give it a try with
>> this one to see if it addresses the issue?
>
> This scenario should not happen with LTO setup: the LTO symbol tables contains
> code before early optimization and should be identical with profiling or
> without (modulo the new references and call from profile code).
>
> But this patch seems useful as a backup solution for non-LTO, so yes, please
> send it separately and I can try to double check that it really do not happen
> with LTO.
> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
> It has all the info for it.  I was thinkin about it few weeks back.  It is
> bit hard to do - you need to verify that all references from the function are
> the same or linking might fail if you overwrite linker's decisiosns).

I see, yes LTO can deal with this better since it has global
information. In non-LTO mode (including LIPO) we have the issue.

I take it gimp is built with LTO and therefore shouldn't be hitting
this comdat issue?

Let me do a couple things:
- port over my comdat inlining fix from the google branch to trunk and
send it for review. If you or Martin could try it to see if it helps
with function splitting to avoid the hits from the cold code that
would be great
- I'll add some new sanity checking to try to detect non-zero blocks
in the cold section, or 0 blocks reached by non-zero edges and see if
I can flush out any problems with my tests or a profiledbootstrap or
gimp.
- I'll try building and profiling gimp myself to see if I can
reproduce the issue with code executing out of the cold section.

Thanks,
Teresa

>>
>> Also, can you send me reproduction instructions for gimp? I don't
>> think I need Martin's patch, but which version of gimp and what is the
>> equivalent way for me to train it? I have some scripts to generate a
>> similar type of instruction heat map graph that I have been using to
>> tune partitioning and function reordering. Essentially it uses linux
>> perf to sample on instructions_retired and then munge the data in
>> several ways to produce various stats and graphs. One thing that has
>> been useful has been to combine the perf data with nm output to
>> determine which cold functions are being executed at runtime.
>
> Martin?
>
>>
>> However, for this to tell me which split cold bbs are being executed I
>> need to use a patch that Sri sent for review several months back that
>> gives the split cold section its own name:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>> (cc'ing Sri as we should probably revive this patch soon to address
>> gdb and other issues with detecting split functions properly)
>
> Intresting, I used linker script for this purposes, but that his GNU ld only...
>
> Honza
>>
>> Thanks!
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >> >
>> >> > Honza
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09 15:54                     ` Martin Liška
@ 2013-08-09 21:03                       ` Teresa Johnson
  0 siblings, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-09 21:03 UTC (permalink / raw)
  To: Martin Liška
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, Sriraman Tallam

On Fri, Aug 9, 2013 at 8:54 AM, Martin Liška <marxin.liska@gmail.com> wrote:
> Hi
>
> On 9 August 2013 17:28, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > Do we sanity check that the cold partition does not contain any blocks of
>>> > count 0?  It may be that the profile is broken enough to make partitioning
>>> > not work.
>>>
>>> Do you mean sanity check that the cold partition does not contain any
>>> blocks of count > 0? (they should all be zero) I don't think that
>>> sanity check is there, but I can try adding that.
>>
>> Thanks, lets start with this - I suppose we need to figure out if
>>  1) the reachable blocks goes to cold section because partitioning decides
>>     so even if they have non-0 count.
>>  2) the reachable blocks goes to cold section because they have incorrectly
>>     updated count to 0 by someone
>>  3) profiling gets some blocks wrong.
>>>
>>> The issue with such a sanity check may be due to the later fixup I
>>> have in this patch (fixup_partitions). It is invoked after certain
>>> optimizations on the cfg that may make hot blocks previously reached
>>> by both hot and cold edges only reachable by cold blocks. These blocks
>>> are remarked cold. If the profile data hasn't been updated correctly
>>> it is possible that they would still have a non-0 count, although they
>>> are essentially cold after the cfg transformation.
>>
>> Well, or the other posibility is that the edges was updated wrong
>> and the blocks are really cold.  We need to figure out if that happens
>> commonly enough.
>>
>> I will try to think of some artificial testcases.
>>
>>>
>>> But certainly such a sanity check should always succeed after the
>>> original partitioning.
>>>
>>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>>> > count scaling code can be modified to never round towards 0 for block executing
>>> > non-0 times...
>>>
>>> This reminds me of why this situation could happen. When I have been
>>> testing this on the google branch I found situations where COMDAT
>>> routines have 0 profile counts (if I remember correctly, this happens
>>> when profile-gen binary has call to out-of-line copy of COMDAT in
>>> module A, linker chooses the out-of-line copy from module B, therefore
>>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>>> callsite is non-zero. I have a patch that I was planning to send as a
>>> follow-up that handles this case by propagating the callsite bb's
>>> count to the inlined code when it has 0 counts, scaling by the edge
>>> frequencies. I can either include that patch in this one, or send it
>>> for review separately right now. Do you want to give it a try with
>>> this one to see if it addresses the issue?
>>
>> This scenario should not happen with LTO setup: the LTO symbol tables contains
>> code before early optimization and should be identical with profiling or
>> without (modulo the new references and call from profile code).
>>
>> But this patch seems useful as a backup solution for non-LTO, so yes, please
>> send it separately and I can try to double check that it really do not happen
>> with LTO.
>> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
>> It has all the info for it.  I was thinkin about it few weeks back.  It is
>> bit hard to do - you need to verify that all references from the function are
>> the same or linking might fail if you overwrite linker's decisiosns).
>>>
>>> Also, can you send me reproduction instructions for gimp? I don't
>>> think I need Martin's patch, but which version of gimp and what is the
>>> equivalent way for me to train it? I have some scripts to generate a
>>> similar type of instruction heat map graph that I have been using to
>>> tune partitioning and function reordering. Essentially it uses linux
>>> perf to sample on instructions_retired and then munge the data in
>>> several ways to produce various stats and graphs. One thing that has
>>> been useful has been to combine the perf data with nm output to
>>> determine which cold functions are being executed at runtime.
>>
>> Martin?
>
> I use gimp from git repository, commit:
> 88ecd59c3436d302b644a5d25c1938c0e7b60ae0 (from Fet 5 2013)
> Link: http://www.gimp.org/source/#gimp_from_git

Thanks. Were you building with LTO? And just -O2, or any other options
I should use?

Teresa

>
> Martin
>
>>>
>>> However, for this to tell me which split cold bbs are being executed I
>>> need to use a patch that Sri sent for review several months back that
>>> gives the split cold section its own name:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>>> (cc'ing Sri as we should probably revive this patch soon to address
>>> gdb and other issues with detecting split functions properly)
>>
>> Intresting, I used linker script for this purposes, but that his GNU ld only...
>>
>> Honza
>>>
>>> Thanks!
>>> Teresa
>>>
>>> >
>>> > Honza
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>>> >> >
>>> >> > Honza
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09 21:02                     ` Teresa Johnson
@ 2013-08-09 22:43                       ` Jan Hubicka
  2013-08-11 12:21                       ` Jan Hubicka
  2013-08-17 15:54                       ` Teresa Johnson
  2 siblings, 0 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-08-09 22:43 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

> 
> I see, yes LTO can deal with this better since it has global
> information. In non-LTO mode (including LIPO) we have the issue.

Thinking about it, there is still one problem left: I usually suggest
users to train with -fno-lto to avoid excessive linking time with
instrumentation.  This actually will bring differences in between
compile and link decisions of linker.  
We can't even replace one body by another, since inlining done
at training stage may distribute the profile across multiple comdat
copies.  I suppose only way is to extend lto-symtab to actually
merge profiles.  To merge CFG profiles it will need to read in the body.

Martin is working on identical function merging, I suppose once we
implement CFG profile merging for that purpose, we can use it here.
I will try to look into this.
> 
> I take it gimp is built with LTO and therefore shouldn't be hitting
> this comdat issue?

I think gimp is still mostly C program. (did not double check my last
contribution to it is from 90s :)
> 
> Let me do a couple things:
> - port over my comdat inlining fix from the google branch to trunk and
> send it for review. If you or Martin could try it to see if it helps
> with function splitting to avoid the hits from the cold code that
> would be great
> - I'll add some new sanity checking to try to detect non-zero blocks
> in the cold section, or 0 blocks reached by non-zero edges and see if
> I can flush out any problems with my tests or a profiledbootstrap or
> gimp.
> - I'll try building and profiling gimp myself to see if I can
> reproduce the issue with code executing out of the cold section.
> 
> Thanks,
> Teresa
> 
> >>
> >> Also, can you send me reproduction instructions for gimp? I don't
> >> think I need Martin's patch, but which version of gimp and what is the
> >> equivalent way for me to train it? I have some scripts to generate a
> >> similar type of instruction heat map graph that I have been using to
> >> tune partitioning and function reordering. Essentially it uses linux
> >> perf to sample on instructions_retired and then munge the data in
> >> several ways to produce various stats and graphs. One thing that has
> >> been useful has been to combine the perf data with nm output to
> >> determine which cold functions are being executed at runtime.
> >
> > Martin?
> >
> >>
> >> However, for this to tell me which split cold bbs are being executed I
> >> need to use a patch that Sri sent for review several months back that
> >> gives the split cold section its own name:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >> (cc'ing Sri as we should probably revive this patch soon to address
> >> gdb and other issues with detecting split functions properly)
> >
> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >
> > Honza
> >>
> >> Thanks!
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> Thanks,
> >> >> Teresa
> >> >>
> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >> >
> >> >> > Honza
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09 21:02                     ` Teresa Johnson
  2013-08-09 22:43                       ` Jan Hubicka
@ 2013-08-11 12:21                       ` Jan Hubicka
  2013-08-11 13:25                         ` Teresa Johnson
  2013-08-17 15:54                       ` Teresa Johnson
  2 siblings, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-11 12:21 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

> 
> I see, yes LTO can deal with this better since it has global
> information. In non-LTO mode (including LIPO) we have the issue.

Either Martin or me will implement merging of the multiple copies at
LTO link time.  This is needed for Martin's code unification patch anyway.

Theoretically gcov runtime can also have symbol names and cfg checksums of
comdats in the static data and at exit produce buckets based on matching
names+checksums+counter counts, merge all data into in each bucket to one
representative by the existing merging routines and then memcpy them to
all the oriignal copiles.  This way all compilation units will receive same
results.

I am not very keen about making gcov runtime bigger and more complex than it
needs to be, but having sane profile for comdats seems quite important.
Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
produce ordered list of comdats, so the runtime won't need to do hashing +
lookups.

Honza
> 
> I take it gimp is built with LTO and therefore shouldn't be hitting
> this comdat issue?
> 
> Let me do a couple things:
> - port over my comdat inlining fix from the google branch to trunk and
> send it for review. If you or Martin could try it to see if it helps
> with function splitting to avoid the hits from the cold code that
> would be great
> - I'll add some new sanity checking to try to detect non-zero blocks
> in the cold section, or 0 blocks reached by non-zero edges and see if
> I can flush out any problems with my tests or a profiledbootstrap or
> gimp.
> - I'll try building and profiling gimp myself to see if I can
> reproduce the issue with code executing out of the cold section.
> 
> Thanks,
> Teresa
> 
> >>
> >> Also, can you send me reproduction instructions for gimp? I don't
> >> think I need Martin's patch, but which version of gimp and what is the
> >> equivalent way for me to train it? I have some scripts to generate a
> >> similar type of instruction heat map graph that I have been using to
> >> tune partitioning and function reordering. Essentially it uses linux
> >> perf to sample on instructions_retired and then munge the data in
> >> several ways to produce various stats and graphs. One thing that has
> >> been useful has been to combine the perf data with nm output to
> >> determine which cold functions are being executed at runtime.
> >
> > Martin?
> >
> >>
> >> However, for this to tell me which split cold bbs are being executed I
> >> need to use a patch that Sri sent for review several months back that
> >> gives the split cold section its own name:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >> (cc'ing Sri as we should probably revive this patch soon to address
> >> gdb and other issues with detecting split functions properly)
> >
> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >
> > Honza
> >>
> >> Thanks!
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> Thanks,
> >> >> Teresa
> >> >>
> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >> >
> >> >> > Honza
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-11 12:21                       ` Jan Hubicka
@ 2013-08-11 13:25                         ` Teresa Johnson
  2013-08-11 15:55                           ` Martin Liška
  2013-08-11 21:05                           ` Jan Hubicka
  0 siblings, 2 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-11 13:25 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam, Rong Xu

Cc'ing Rong since he is also working on trying to address the comdat
profile issue. Rong, you may need to see an earlier message for more
context:
http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html

Teresa

On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I see, yes LTO can deal with this better since it has global
>> information. In non-LTO mode (including LIPO) we have the issue.
>
> Either Martin or me will implement merging of the multiple copies at
> LTO link time.  This is needed for Martin's code unification patch anyway.
>
> Theoretically gcov runtime can also have symbol names and cfg checksums of
> comdats in the static data and at exit produce buckets based on matching
> names+checksums+counter counts, merge all data into in each bucket to one
> representative by the existing merging routines and then memcpy them to
> all the oriignal copiles.  This way all compilation units will receive same
> results.
>
> I am not very keen about making gcov runtime bigger and more complex than it
> needs to be, but having sane profile for comdats seems quite important.
> Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
> produce ordered list of comdats, so the runtime won't need to do hashing +
> lookups.
>
> Honza
>>
>> I take it gimp is built with LTO and therefore shouldn't be hitting
>> this comdat issue?
>>
>> Let me do a couple things:
>> - port over my comdat inlining fix from the google branch to trunk and
>> send it for review. If you or Martin could try it to see if it helps
>> with function splitting to avoid the hits from the cold code that
>> would be great
>> - I'll add some new sanity checking to try to detect non-zero blocks
>> in the cold section, or 0 blocks reached by non-zero edges and see if
>> I can flush out any problems with my tests or a profiledbootstrap or
>> gimp.
>> - I'll try building and profiling gimp myself to see if I can
>> reproduce the issue with code executing out of the cold section.
>>
>> Thanks,
>> Teresa
>>
>> >>
>> >> Also, can you send me reproduction instructions for gimp? I don't
>> >> think I need Martin's patch, but which version of gimp and what is the
>> >> equivalent way for me to train it? I have some scripts to generate a
>> >> similar type of instruction heat map graph that I have been using to
>> >> tune partitioning and function reordering. Essentially it uses linux
>> >> perf to sample on instructions_retired and then munge the data in
>> >> several ways to produce various stats and graphs. One thing that has
>> >> been useful has been to combine the perf data with nm output to
>> >> determine which cold functions are being executed at runtime.
>> >
>> > Martin?
>> >
>> >>
>> >> However, for this to tell me which split cold bbs are being executed I
>> >> need to use a patch that Sri sent for review several months back that
>> >> gives the split cold section its own name:
>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>> >> (cc'ing Sri as we should probably revive this patch soon to address
>> >> gdb and other issues with detecting split functions properly)
>> >
>> > Intresting, I used linker script for this purposes, but that his GNU ld only...
>> >
>> > Honza
>> >>
>> >> Thanks!
>> >> Teresa
>> >>
>> >> >
>> >> > Honza
>> >> >>
>> >> >> Thanks,
>> >> >> Teresa
>> >> >>
>> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >> >> >
>> >> >> > Honza
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-11 13:25                         ` Teresa Johnson
@ 2013-08-11 15:55                           ` Martin Liška
  2013-08-11 17:55                             ` Jan Hubicka
  2013-08-11 21:05                           ` Jan Hubicka
  1 sibling, 1 reply; 62+ messages in thread
From: Martin Liška @ 2013-08-11 15:55 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, Sriraman Tallam, Rong Xu

[-- Attachment #1: Type: text/plain, Size: 4934 bytes --]

Hello,
   I did a collection of systemtap graphs for GIMP.

All these graphs were created with enabled LTO, profiling and -O2.

1) gimp-reordered.pdf - function are reordered according to my newly
created profile that utilizes LTO infrastructure
2) gimp-no-top-level-reorder.pdf - (GCC rev. 201648) -fno-top-level-reorder
3) gimp-top-level-reorder.pdf - (GCC rev. 201648) -ftop-level-reorder

Honza has an idea how to minimize hot text section and I will send new
graphs for the proposed patch.
Moreover, I will send graphs for Inkscape which is written in C++.

Have a nice day,
Martin

On 11 August 2013 15:25, Teresa Johnson <tejohnson@google.com> wrote:
> Cc'ing Rong since he is also working on trying to address the comdat
> profile issue. Rong, you may need to see an earlier message for more
> context:
> http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html
>
> Teresa
>
> On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> I see, yes LTO can deal with this better since it has global
>>> information. In non-LTO mode (including LIPO) we have the issue.
>>
>> Either Martin or me will implement merging of the multiple copies at
>> LTO link time.  This is needed for Martin's code unification patch anyway.
>>
>> Theoretically gcov runtime can also have symbol names and cfg checksums of
>> comdats in the static data and at exit produce buckets based on matching
>> names+checksums+counter counts, merge all data into in each bucket to one
>> representative by the existing merging routines and then memcpy them to
>> all the oriignal copiles.  This way all compilation units will receive same
>> results.
>>
>> I am not very keen about making gcov runtime bigger and more complex than it
>> needs to be, but having sane profile for comdats seems quite important.
>> Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
>> produce ordered list of comdats, so the runtime won't need to do hashing +
>> lookups.
>>
>> Honza
>>>
>>> I take it gimp is built with LTO and therefore shouldn't be hitting
>>> this comdat issue?
>>>
>>> Let me do a couple things:
>>> - port over my comdat inlining fix from the google branch to trunk and
>>> send it for review. If you or Martin could try it to see if it helps
>>> with function splitting to avoid the hits from the cold code that
>>> would be great
>>> - I'll add some new sanity checking to try to detect non-zero blocks
>>> in the cold section, or 0 blocks reached by non-zero edges and see if
>>> I can flush out any problems with my tests or a profiledbootstrap or
>>> gimp.
>>> - I'll try building and profiling gimp myself to see if I can
>>> reproduce the issue with code executing out of the cold section.
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >>
>>> >> Also, can you send me reproduction instructions for gimp? I don't
>>> >> think I need Martin's patch, but which version of gimp and what is the
>>> >> equivalent way for me to train it? I have some scripts to generate a
>>> >> similar type of instruction heat map graph that I have been using to
>>> >> tune partitioning and function reordering. Essentially it uses linux
>>> >> perf to sample on instructions_retired and then munge the data in
>>> >> several ways to produce various stats and graphs. One thing that has
>>> >> been useful has been to combine the perf data with nm output to
>>> >> determine which cold functions are being executed at runtime.
>>> >
>>> > Martin?
>>> >
>>> >>
>>> >> However, for this to tell me which split cold bbs are being executed I
>>> >> need to use a patch that Sri sent for review several months back that
>>> >> gives the split cold section its own name:
>>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>>> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>>> >> (cc'ing Sri as we should probably revive this patch soon to address
>>> >> gdb and other issues with detecting split functions properly)
>>> >
>>> > Intresting, I used linker script for this purposes, but that his GNU ld only...
>>> >
>>> > Honza
>>> >>
>>> >> Thanks!
>>> >> Teresa
>>> >>
>>> >> >
>>> >> > Honza
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Teresa
>>> >> >>
>>> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>>> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>>> >> >> >
>>> >> >> > Honza
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

[-- Attachment #2: gimp-graphs.tar.bz2 --]
[-- Type: application/x-bzip2, Size: 105015 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-11 15:55                           ` Martin Liška
@ 2013-08-11 17:55                             ` Jan Hubicka
  0 siblings, 0 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-08-11 17:55 UTC (permalink / raw)
  To: Martin Liška
  Cc: Teresa Johnson, Jan Hubicka, Bernhard Reutner-Fischer,
	gcc-patches, Steven Bosscher, Jeff Law, Sriraman Tallam, Rong Xu

> Hello,
>    I did a collection of systemtap graphs for GIMP.
> 
> All these graphs were created with enabled LTO, profiling and -O2.
> 
> 1) gimp-reordered.pdf - function are reordered according to my newly
> created profile that utilizes LTO infrastructure
> 2) gimp-no-top-level-reorder.pdf - (GCC rev. 201648) -fno-top-level-reorder
> 3) gimp-top-level-reorder.pdf - (GCC rev. 201648) -ftop-level-reorder

Thanks for the graphs! 
gimp-top-level-reorder seems to be bogus (it shows accesses into dynstr only).

To catch the -fno-reorder-blocks-partition problem, perhaps you can modify
the Martin's linker script to make .text.unlikely section non-executable.
This way it will crash application every time we jump into it.

Honza
> 
> Honza has an idea how to minimize hot text section and I will send new
> graphs for the proposed patch.
> Moreover, I will send graphs for Inkscape which is written in C++.
> 
> Have a nice day,
> Martin
> 
> On 11 August 2013 15:25, Teresa Johnson <tejohnson@google.com> wrote:
> > Cc'ing Rong since he is also working on trying to address the comdat
> > profile issue. Rong, you may need to see an earlier message for more
> > context:
> > http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html
> >
> > Teresa
> >
> > On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>>
> >>> I see, yes LTO can deal with this better since it has global
> >>> information. In non-LTO mode (including LIPO) we have the issue.
> >>
> >> Either Martin or me will implement merging of the multiple copies at
> >> LTO link time.  This is needed for Martin's code unification patch anyway.
> >>
> >> Theoretically gcov runtime can also have symbol names and cfg checksums of
> >> comdats in the static data and at exit produce buckets based on matching
> >> names+checksums+counter counts, merge all data into in each bucket to one
> >> representative by the existing merging routines and then memcpy them to
> >> all the oriignal copiles.  This way all compilation units will receive same
> >> results.
> >>
> >> I am not very keen about making gcov runtime bigger and more complex than it
> >> needs to be, but having sane profile for comdats seems quite important.
> >> Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
> >> produce ordered list of comdats, so the runtime won't need to do hashing +
> >> lookups.
> >>
> >> Honza
> >>>
> >>> I take it gimp is built with LTO and therefore shouldn't be hitting
> >>> this comdat issue?
> >>>
> >>> Let me do a couple things:
> >>> - port over my comdat inlining fix from the google branch to trunk and
> >>> send it for review. If you or Martin could try it to see if it helps
> >>> with function splitting to avoid the hits from the cold code that
> >>> would be great
> >>> - I'll add some new sanity checking to try to detect non-zero blocks
> >>> in the cold section, or 0 blocks reached by non-zero edges and see if
> >>> I can flush out any problems with my tests or a profiledbootstrap or
> >>> gimp.
> >>> - I'll try building and profiling gimp myself to see if I can
> >>> reproduce the issue with code executing out of the cold section.
> >>>
> >>> Thanks,
> >>> Teresa
> >>>
> >>> >>
> >>> >> Also, can you send me reproduction instructions for gimp? I don't
> >>> >> think I need Martin's patch, but which version of gimp and what is the
> >>> >> equivalent way for me to train it? I have some scripts to generate a
> >>> >> similar type of instruction heat map graph that I have been using to
> >>> >> tune partitioning and function reordering. Essentially it uses linux
> >>> >> perf to sample on instructions_retired and then munge the data in
> >>> >> several ways to produce various stats and graphs. One thing that has
> >>> >> been useful has been to combine the perf data with nm output to
> >>> >> determine which cold functions are being executed at runtime.
> >>> >
> >>> > Martin?
> >>> >
> >>> >>
> >>> >> However, for this to tell me which split cold bbs are being executed I
> >>> >> need to use a patch that Sri sent for review several months back that
> >>> >> gives the split cold section its own name:
> >>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >>> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >>> >> (cc'ing Sri as we should probably revive this patch soon to address
> >>> >> gdb and other issues with detecting split functions properly)
> >>> >
> >>> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >>> >
> >>> > Honza
> >>> >>
> >>> >> Thanks!
> >>> >> Teresa
> >>> >>
> >>> >> >
> >>> >> > Honza
> >>> >> >>
> >>> >> >> Thanks,
> >>> >> >> Teresa
> >>> >> >>
> >>> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >>> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >>> >> >> >
> >>> >> >> > Honza
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>>
> >>>
> >>>
> >>> --
> >>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-11 13:25                         ` Teresa Johnson
  2013-08-11 15:55                           ` Martin Liška
@ 2013-08-11 21:05                           ` Jan Hubicka
  1 sibling, 0 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-08-11 21:05 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam,
	Rong Xu

Hi,
thinking about it a bit more, I suppose easiest way is to
1) make separate sets of counters for each comdat and place them
   into comdat section named as DECL_COMDAT_GROUP (node) + cfg_checksum + individual_counter_counts.
   This will make linker to unify the sections for us.
2) extend API of libgcov initialization so multiple counters can be recorded per file.
3) at merging time, gcov needs to merge all comdat section counters into temporary memory, so multiple
   merging won't produce bad results
4) counter streaming will need to be updated to deal with separate comdat sections...
5) probably we will want to update histogram production to avoid counting same comdat many times
   (this can be done by adding an "processed" flag into the per-function sections

I don't see any obvious problems with this plan, just that it is quite some
work.  If you had chance to implement something along these lines, I think it
would help ;))

Honza
> Cc'ing Rong since he is also working on trying to address the comdat
> profile issue. Rong, you may need to see an earlier message for more
> context:
> http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html
> 
> Teresa
> 
> On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>
> >> I see, yes LTO can deal with this better since it has global
> >> information. In non-LTO mode (including LIPO) we have the issue.
> >
> > Either Martin or me will implement merging of the multiple copies at
> > LTO link time.  This is needed for Martin's code unification patch anyway.
> >
> > Theoretically gcov runtime can also have symbol names and cfg checksums of
> > comdats in the static data and at exit produce buckets based on matching
> > names+checksums+counter counts, merge all data into in each bucket to one
> > representative by the existing merging routines and then memcpy them to
> > all the oriignal copiles.  This way all compilation units will receive same
> > results.
> >
> > I am not very keen about making gcov runtime bigger and more complex than it
> > needs to be, but having sane profile for comdats seems quite important.
> > Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
> > produce ordered list of comdats, so the runtime won't need to do hashing +
> > lookups.
> >
> > Honza
> >>
> >> I take it gimp is built with LTO and therefore shouldn't be hitting
> >> this comdat issue?
> >>
> >> Let me do a couple things:
> >> - port over my comdat inlining fix from the google branch to trunk and
> >> send it for review. If you or Martin could try it to see if it helps
> >> with function splitting to avoid the hits from the cold code that
> >> would be great
> >> - I'll add some new sanity checking to try to detect non-zero blocks
> >> in the cold section, or 0 blocks reached by non-zero edges and see if
> >> I can flush out any problems with my tests or a profiledbootstrap or
> >> gimp.
> >> - I'll try building and profiling gimp myself to see if I can
> >> reproduce the issue with code executing out of the cold section.
> >>
> >> Thanks,
> >> Teresa
> >>
> >> >>
> >> >> Also, can you send me reproduction instructions for gimp? I don't
> >> >> think I need Martin's patch, but which version of gimp and what is the
> >> >> equivalent way for me to train it? I have some scripts to generate a
> >> >> similar type of instruction heat map graph that I have been using to
> >> >> tune partitioning and function reordering. Essentially it uses linux
> >> >> perf to sample on instructions_retired and then munge the data in
> >> >> several ways to produce various stats and graphs. One thing that has
> >> >> been useful has been to combine the perf data with nm output to
> >> >> determine which cold functions are being executed at runtime.
> >> >
> >> > Martin?
> >> >
> >> >>
> >> >> However, for this to tell me which split cold bbs are being executed I
> >> >> need to use a patch that Sri sent for review several months back that
> >> >> gives the split cold section its own name:
> >> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >> >> (cc'ing Sri as we should probably revive this patch soon to address
> >> >> gdb and other issues with detecting split functions properly)
> >> >
> >> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >> >
> >> > Honza
> >> >>
> >> >> Thanks!
> >> >> Teresa
> >> >>
> >> >> >
> >> >> > Honza
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Teresa
> >> >> >>
> >> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >> >> >
> >> >> >> > Honza
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-09 21:02                     ` Teresa Johnson
  2013-08-09 22:43                       ` Jan Hubicka
  2013-08-11 12:21                       ` Jan Hubicka
@ 2013-08-17 15:54                       ` Teresa Johnson
  2013-08-17 21:02                         ` Jan Hubicka
  2 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-17 15:54 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

On Fri, Aug 9, 2013 at 2:02 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Fri, Aug 9, 2013 at 8:28 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > Do we sanity check that the cold partition does not contain any blocks of
>>> > count 0?  It may be that the profile is broken enough to make partitioning
>>> > not work.
>>>
>>> Do you mean sanity check that the cold partition does not contain any
>>> blocks of count > 0? (they should all be zero) I don't think that
>>> sanity check is there, but I can try adding that.
>>
>> Thanks, lets start with this - I suppose we need to figure out if
>>  1) the reachable blocks goes to cold section because partitioning decides
>>     so even if they have non-0 count.
>
> Right, this should be easy enough to check and should hopefully never happen.
>
>>  2) the reachable blocks goes to cold section because they have incorrectly
>>     updated count to 0 by someone
>
> A sanity check should find this too. But it can happen now for various
> reasons like the comdat issue I described below. But it will be good
> to flag these and fix them.
>
>>  3) profiling gets some blocks wrong.
>
> This is the one that will be tough to fix, if the training run isn't
> representative.
>
>>>
>>> The issue with such a sanity check may be due to the later fixup I
>>> have in this patch (fixup_partitions). It is invoked after certain
>>> optimizations on the cfg that may make hot blocks previously reached
>>> by both hot and cold edges only reachable by cold blocks. These blocks
>>> are remarked cold. If the profile data hasn't been updated correctly
>>> it is possible that they would still have a non-0 count, although they
>>> are essentially cold after the cfg transformation.
>>
>> Well, or the other posibility is that the edges was updated wrong
>> and the blocks are really cold.  We need to figure out if that happens
>> commonly enough.
>>
>> I will try to think of some artificial testcases.
>>
>>>
>>> But certainly such a sanity check should always succeed after the
>>> original partitioning.
>>>
>>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>>> > count scaling code can be modified to never round towards 0 for block executing
>>> > non-0 times...
>>>
>>> This reminds me of why this situation could happen. When I have been
>>> testing this on the google branch I found situations where COMDAT
>>> routines have 0 profile counts (if I remember correctly, this happens
>>> when profile-gen binary has call to out-of-line copy of COMDAT in
>>> module A, linker chooses the out-of-line copy from module B, therefore
>>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>>> callsite is non-zero. I have a patch that I was planning to send as a
>>> follow-up that handles this case by propagating the callsite bb's
>>> count to the inlined code when it has 0 counts, scaling by the edge
>>> frequencies. I can either include that patch in this one, or send it
>>> for review separately right now. Do you want to give it a try with
>>> this one to see if it addresses the issue?
>>
>> This scenario should not happen with LTO setup: the LTO symbol tables contains
>> code before early optimization and should be identical with profiling or
>> without (modulo the new references and call from profile code).
>>
>> But this patch seems useful as a backup solution for non-LTO, so yes, please
>> send it separately and I can try to double check that it really do not happen
>> with LTO.
>> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
>> It has all the info for it.  I was thinkin about it few weeks back.  It is
>> bit hard to do - you need to verify that all references from the function are
>> the same or linking might fail if you overwrite linker's decisiosns).
>
> I see, yes LTO can deal with this better since it has global
> information. In non-LTO mode (including LIPO) we have the issue.
>
> I take it gimp is built with LTO and therefore shouldn't be hitting
> this comdat issue?
>
> Let me do a couple things:

Here is some status:

> - port over my comdat inlining fix from the google branch to trunk and
> send it for review. If you or Martin could try it to see if it helps
> with function splitting to avoid the hits from the cold code that
> would be great

I have included the cleaned up patch below. I will send it to trunk
for review, but included it below if you want to try it out before I
send it pending a little more extensive testing I want to do after the
cleanup (it passes bootstrap and some small test cases I checked
manually).

> - I'll add some new sanity checking to try to detect non-zero blocks
> in the cold section, or 0 blocks reached by non-zero edges and see if
> I can flush out any problems with my tests or a profiledbootstrap or
> gimp.

I added both of these and ran into issues due to profile maintenance.
For example, there were non-zero blocks in the cold section because
pro_and_epilogue split a simple return block that was previously reach
by both hot and cold paths. The new return block that was then only
reached via the cold path did not have its count properly updated to
reflect this, and since with this patch, blocks dominated by cold
blocks are remarked cold, we ended up with a non-zero count block in
the cold section. And there were 0 count blocks reached by non-zero
edges because copyprop did not clean up edge weights after removing
some branches and blocks, leading to non-zero edge weights that had
previously targeted a branch that was removed, now targeting a 0 count
block that the removed branch always branched around.

In any case, the good news is in that the cases I looked at, the
splitting code is doing the right thing and these blocks that were
marked cold really were cold. It would be great to fix the profile
maintenance issues, but that in the meantime the above sanity checks
are too aggressive.

I think it makes sense to commit the current patch if possible, as it
is making the splitting more sane.

> - I'll try building and profiling gimp myself to see if I can
> reproduce the issue with code executing out of the cold section.

I have spent some time this week trying to get the latest gimp Martin
pointed me to configured and built, but it took awhile to track down
and configure/build all of the required versions of dependent
packages. I'm still hitting some issues trying to get it compiled, so
it may not yet be configured properly. I'll take a look again early
next week.

Teresa

patch for updating counts based on estimated frequencies to address
inlined comdats with 0 profile counts:

013-08-16  Teresa Johnson  <tejohnson@google.com>

        * tree-inline.c (copy_bb): Compute count based on frequency.
        (copy_edges_for_bb): Ditto.
        (copy_cfg_body): Ditto.
        (copy_body): Pass down frequency.
        (expand_call_inline): Ditto.
        (tree_function_versioning): Ditto.
        * predict.c (init_and_estimate_bb_frequencies): New function.
        (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
        * predict.h (init_and_estimate_bb_frequencies): Declare.
        * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
        * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
        counts.
        (clone_inlined_nodes): Compute edge count scale if needed.

Index: tree-inline.c
===================================================================
--- tree-inline.c       (revision 201644)
+++ tree-inline.c       (working copy)
@@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id

 static basic_block
 copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
-         gcov_type count_scale)
+         gcov_type count_scale, gcov_type freq_to_count_scale)
 {
   gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
   basic_block copy_basic_block;
@@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
      basic_block_info automatically.  */
   copy_basic_block = create_basic_block (NULL, (void *) 0,
                                          (basic_block) prev->aux);
-  copy_basic_block->count = apply_scale (bb->count, count_scale);
+  copy_basic_block->count
+      = (count_scale
+         ? apply_scale (bb->count, count_scale)
+         /* When the callee bb counts were all zero (e.g. this was a COMDAT
+            that didn't get profile counts) then we compute the new bb counts
+            via the statically-estimated frequencies.  */
+         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));

   /* We are going to rebuild frequencies from scratch.  These values
      have just small importance to drive canonicalize_loop_headers.  */
@@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
    debug stmts are left after a statement that must end the basic block.  */

 static bool
-copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
+copy_edges_for_bb (basic_block bb, gcov_type count_scale,
+                   basic_block ret_bb,
                   bool can_make_abnormal_goto)
 {
   basic_block new_bb = (basic_block) bb->aux;
@@ -1912,7 +1919,14 @@ static bool
            && old_edge->dest->aux != EXIT_BLOCK_PTR)
          flags |= EDGE_FALLTHRU;
        new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
-       new_edge->count = apply_scale (old_edge->count, count_scale);
+        basic_block new_src_bb = (basic_block) old_edge->src->aux;
+       new_edge->count
+            = (count_scale
+               ? apply_scale (old_edge->count, count_scale)
+               // The bb counts have already been scaled with
freq_to_count_scale
+               // when that is non-zero, so just scale that new bb count by
+               // the edge probability.
+               : apply_probability (new_src_bb->count, old_edge->probability));
        new_edge->probability = old_edge->probability;
       }

@@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
    another function.  Walks FN via CFG, returns new fndecl.  */

 static tree
-copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
+copy_cfg_body (copy_body_data * id, gcov_type count,
+              int frequency, int frequency_scale,
               basic_block entry_block_map, basic_block exit_block_map,
               bitmap blocks_to_copy, basic_block new_entry)
 {
@@ -2293,15 +2308,20 @@ static tree
   basic_block bb;
   tree new_fndecl = NULL;
   bool need_debug_cleanup = false;
-  gcov_type count_scale;
+  gcov_type count_scale = 0;
+  gcov_type freq_to_count_scale = 0;
   int last;
   int incoming_frequency = 0;
   gcov_type incoming_count = 0;

-  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
-    count_scale
-        = GCOV_COMPUTE_SCALE (count,
-                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
+  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
+  if (entry_bb->count)
+    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
+  /* When the callee bb counts were all zero (e.g. this was a COMDAT
+     that didn't get profile counts) then we compute the new bb counts
+     via the statically-estimated frequencies.  */
+  else if (entry_bb->frequency)
+    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
   else
     count_scale = REG_BR_PROB_BASE;

@@ -2323,7 +2343,13 @@ static tree
            incoming_frequency += EDGE_FREQUENCY (e);
            incoming_count += e->count;
          }
-      incoming_count = apply_scale (incoming_count, count_scale);
+      incoming_count
+          = (count_scale
+             ? apply_scale (incoming_count, count_scale)
+             /* When the callee bb counts were all zero (e.g. this was a COMDAT
+                that didn't get profile counts) then we compute the
new bb counts
+                via the statically-estimated frequencies.  */
+             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
       incoming_frequency
        = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
       ENTRY_BLOCK_PTR->count = incoming_count;
@@ -2350,7 +2376,8 @@ static tree
   FOR_EACH_BB_FN (bb, cfun_to_copy)
     if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
       {
-       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
+       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
+                                     freq_to_count_scale);
        bb->aux = new_bb;
        new_bb->aux = bb;
        new_bb->loop_father = entry_block_map->loop_father;
@@ -2364,7 +2391,8 @@ static tree
   FOR_ALL_BB_FN (bb, cfun_to_copy)
     if (!blocks_to_copy
         || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
-      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
+      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
+                                              exit_block_map,
                                               can_make_abormal_goto);

   if (new_entry)
@@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
    another function.  */

 static tree
-copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
+copy_body (copy_body_data *id, gcov_type count, int frequency,
+          int frequency_scale,
           basic_block entry_block_map, basic_block exit_block_map,
           bitmap blocks_to_copy, basic_block new_entry)
 {
@@ -2571,7 +2600,8 @@ static tree

   /* If this body has a CFG, walk CFG and copy.  */
   gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
-  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
exit_block_map,
+  body = copy_cfg_body (id, count, frequency, frequency_scale,
+                       entry_block_map, exit_block_map,
                        blocks_to_copy, new_entry);
   copy_debug_stmts (id);

@@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
      function in any way before this point, as this CALL_EXPR may be
      a self-referential call; if we're calling ourselves, we need to
      duplicate our body before altering anything.  */
-  copy_body (id, bb->count,
+  copy_body (id, bb->count, bb->frequency,
             GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
             bb, return_block, NULL, NULL);

@@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
     }

   /* Copy the Function's body.  */
-  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
-            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
+  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
+            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
+            blocks_to_copy, new_entry);

   /* Renumber the lexical scoping (non-code) blocks consecutively.  */
   number_blocks (new_decl);
Index: predict.c
===================================================================
--- predict.c   (revision 201644)
+++ predict.c   (working copy)
@@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
   return new pass_strip_predict_hints (ctxt);
 }

+/* Initialize loop edges and compute estimated bb frequencies when there
+   is no profile data available.  */
+
+void
+init_and_estimate_bb_frequencies (void)
+{
+  if (profile_status == PROFILE_READ && counts_to_freqs ())
+    return;
+
+  loop_optimizer_init (0);
+  add_noreturn_fake_exit_edges ();
+  mark_irreducible_loops ();
+  connect_infinite_loops_to_exit ();
+  estimate_bb_frequencies ();
+  remove_fake_exit_edges ();
+  loop_optimizer_finalize ();
+}
+
 /* Rebuild function frequencies.  Passes are in general expected to
    maintain profile by hand, however in some cases this is not possible:
    for example when inlining several functions with loops freuqencies might run
@@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
 {
   timevar_push (TV_REBUILD_FREQUENCIES);
   if (profile_status == PROFILE_GUESSED)
-    {
-      loop_optimizer_init (0);
-      add_noreturn_fake_exit_edges ();
-      mark_irreducible_loops ();
-      connect_infinite_loops_to_exit ();
-      estimate_bb_frequencies ();
-      remove_fake_exit_edges ();
-      loop_optimizer_finalize ();
-    }
+    init_and_estimate_bb_frequencies ();
   else if (profile_status == PROFILE_READ)
     counts_to_freqs ();
   else
Index: predict.h
===================================================================
--- predict.h   (revision 201644)
+++ predict.h   (working copy)
@@ -38,6 +38,7 @@ enum prediction
 extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
 extern int counts_to_freqs (void);
 extern void estimate_bb_frequencies (void);
+extern void init_and_estimate_bb_frequencies (void);
 extern const char *predictor_name (enum br_predictor);
 extern tree build_predict_expr (enum br_predictor, enum prediction);
 extern void tree_estimate_probability (void);
Index: profile.c
===================================================================
--- profile.c   (revision 201644)
+++ profile.c   (working copy)
@@ -1305,6 +1305,12 @@ branch_prob (void)

   values.release ();
   free_edge_list (el);
+
+  /* Call after setting profile_status to PROFILE_READ, will then
+     invoke counts_to_freqs and if the sum of the counts is zero, will
+     estimate the frequencies.  */
+  init_and_estimate_bb_frequencies ();
+
   coverage_end_function (lineno_checksum, cfg_checksum);
 }
 ^L
Index: ipa-inline-transform.c
===================================================================
--- ipa-inline-transform.c      (revision 201644)
+++ ipa-inline-transform.c      (working copy)
@@ -51,7 +51,7 @@ int nfunctions_inlined;

 static void
 update_noncloned_frequencies (struct cgraph_node *node,
-                             int freq_scale)
+                             gcov_type count_scale, int freq_scale)
 {
   struct cgraph_edge *e;

@@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
     freq_scale = 1;
   for (e = node->callees; e; e = e->next_callee)
     {
+      e->count = apply_scale (e->count, count_scale);
       e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
       if (e->frequency > CGRAPH_FREQ_MAX)
         e->frequency = CGRAPH_FREQ_MAX;
       if (!e->inline_failed)
-        update_noncloned_frequencies (e->callee, freq_scale);
+        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
     }
   for (e = node->indirect_calls; e; e = e->next_callee)
     {
+      e->count = apply_scale (e->count, count_scale);
       e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
       if (e->frequency > CGRAPH_FREQ_MAX)
         e->frequency = CGRAPH_FREQ_MAX;
@@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
            }
          duplicate = false;
          e->callee->symbol.externally_visible = false;
-          update_noncloned_frequencies (e->callee, e->frequency);
+          // In the case of a COMDAT, the callee's count may be from other
+          // modules, and we need to scale it for the current module's calls
+          // (e.g. e->count may be 0 despite e->callee->count > 0).
+          gcov_type count_scale = REG_BR_PROB_BASE;
+          if (e->callee->count > e->count)
+            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
+          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
        }
       else
        {


>
> Thanks,
> Teresa
>
>>>
>>> Also, can you send me reproduction instructions for gimp? I don't
>>> think I need Martin's patch, but which version of gimp and what is the
>>> equivalent way for me to train it? I have some scripts to generate a
>>> similar type of instruction heat map graph that I have been using to
>>> tune partitioning and function reordering. Essentially it uses linux
>>> perf to sample on instructions_retired and then munge the data in
>>> several ways to produce various stats and graphs. One thing that has
>>> been useful has been to combine the perf data with nm output to
>>> determine which cold functions are being executed at runtime.
>>
>> Martin?
>>
>>>
>>> However, for this to tell me which split cold bbs are being executed I
>>> need to use a patch that Sri sent for review several months back that
>>> gives the split cold section its own name:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>>> (cc'ing Sri as we should probably revive this patch soon to address
>>> gdb and other issues with detecting split functions properly)
>>
>> Intresting, I used linker script for this purposes, but that his GNU ld only...
>>
>> Honza
>>>
>>> Thanks!
>>> Teresa
>>>
>>> >
>>> > Honza
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>>> >> >
>>> >> > Honza
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-17 15:54                       ` Teresa Johnson
@ 2013-08-17 21:02                         ` Jan Hubicka
  2013-08-19 13:51                           ` Teresa Johnson
  2013-08-19 15:34                           ` Teresa Johnson
  0 siblings, 2 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-08-17 21:02 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

> 
> I added both of these and ran into issues due to profile maintenance.
> For example, there were non-zero blocks in the cold section because
> pro_and_epilogue split a simple return block that was previously reach
> by both hot and cold paths. The new return block that was then only
> reached via the cold path did not have its count properly updated to
> reflect this, and since with this patch, blocks dominated by cold
> blocks are remarked cold, we ended up with a non-zero count block in
> the cold section. And there were 0 count blocks reached by non-zero
> edges because copyprop did not clean up edge weights after removing
> some branches and blocks, leading to non-zero edge weights that had
> previously targeted a branch that was removed, now targeting a 0 count
> block that the removed branch always branched around.

I see, can you please send fixes for the problems you identified?
Thanks for owrking on this!
> 
> In any case, the good news is in that the cases I looked at, the
> splitting code is doing the right thing and these blocks that were
> marked cold really were cold. It would be great to fix the profile
> maintenance issues, but that in the meantime the above sanity checks
> are too aggressive.

We can keep them and output info into dump file - it is what most of
the profile sanity checking does anyway.

Did you try to use Martin's linker script to turn text.unlikely
section unexecutable?  I think that way we will easily find what really
causes us to use it during startup of trained application (just like
Martin does for gimp).
> 
> I think it makes sense to commit the current patch if possible, as it
> is making the splitting more sane.

My only concern about the patch is that I am not convinced the dominator
based code has chance to work reliably enough so we won't see too many
accesses into the cold section.
We can commit it and work on better solution incrementally but it will
probably mean replacing it later.  If you think it makes things easier
to work on it incrementally, I think the patch is OK.
> 
> > - I'll try building and profiling gimp myself to see if I can
> > reproduce the issue with code executing out of the cold section.
> 
> I have spent some time this week trying to get the latest gimp Martin
> pointed me to configured and built, but it took awhile to track down
> and configure/build all of the required versions of dependent
> packages. I'm still hitting some issues trying to get it compiled, so
> it may not yet be configured properly. I'll take a look again early
> next week.

I do not think there is anything special about gimp.  You can probably
take any other bigger app, like GCC itself. With profiledbootstrap
and linker script to lock unlikely section you should get ICEs where
we jump into cold secton and should not.
> 
> Teresa
> 
> patch for updating counts based on estimated frequencies to address
> inlined comdats with 0 profile counts:
> 
> 013-08-16  Teresa Johnson  <tejohnson@google.com>
> 
>         * tree-inline.c (copy_bb): Compute count based on frequency.
>         (copy_edges_for_bb): Ditto.
>         (copy_cfg_body): Ditto.
>         (copy_body): Pass down frequency.
>         (expand_call_inline): Ditto.
>         (tree_function_versioning): Ditto.
>         * predict.c (init_and_estimate_bb_frequencies): New function.
>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>         counts.
>         (clone_inlined_nodes): Compute edge count scale if needed.

I do not see why inliner needs to care about scaling more than it does right
now.  So you have init_and_estimate_bb_frequencies that force profile guessing
on a given function body. In addition to that I thing you need something like
freqs_to_counts that will compute counts based on freqs with given scale
(actually you can do that as part of propagation before frequencies are scalled
to the usual 0...FREQ_MAX scale and precision is lost).

Because offline COMDAT functoin will be porduced for every COMDAT used, I think
it is bad to porduce any COMDAT (or any reachable function via calls with non-0
count) that has empty profile (either because it got lost by COMDAT merging
or because of reading mismatch). 

So I guess you can just check functions with 0 count and non-0 count callers
and initialize their guessed profile.
Some capping will probably be needed to not propagate insanely large numbers..

Since new direct calls can be discovered later, inline may want to do that
again each time it inlines non-0 count call of COMDAT with 0 count...

Honza
> 
> Index: tree-inline.c
> ===================================================================
> --- tree-inline.c       (revision 201644)
> +++ tree-inline.c       (working copy)
> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
> 
>  static basic_block
>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
> -         gcov_type count_scale)
> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>  {
>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>    basic_block copy_basic_block;
> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>       basic_block_info automatically.  */
>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>                                           (basic_block) prev->aux);
> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
> +  copy_basic_block->count
> +      = (count_scale
> +         ? apply_scale (bb->count, count_scale)
> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
> +            that didn't get profile counts) then we compute the new bb counts
> +            via the statically-estimated frequencies.  */
> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
> 
>    /* We are going to rebuild frequencies from scratch.  These values
>       have just small importance to drive canonicalize_loop_headers.  */
> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>     debug stmts are left after a statement that must end the basic block.  */
> 
>  static bool
> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
> +                   basic_block ret_bb,
>                    bool can_make_abnormal_goto)
>  {
>    basic_block new_bb = (basic_block) bb->aux;
> @@ -1912,7 +1919,14 @@ static bool
>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>           flags |= EDGE_FALLTHRU;
>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
> -       new_edge->count = apply_scale (old_edge->count, count_scale);
> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
> +       new_edge->count
> +            = (count_scale
> +               ? apply_scale (old_edge->count, count_scale)
> +               // The bb counts have already been scaled with
> freq_to_count_scale
> +               // when that is non-zero, so just scale that new bb count by
> +               // the edge probability.
> +               : apply_probability (new_src_bb->count, old_edge->probability));
>         new_edge->probability = old_edge->probability;
>        }
> 
> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>     another function.  Walks FN via CFG, returns new fndecl.  */
> 
>  static tree
> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
> +copy_cfg_body (copy_body_data * id, gcov_type count,
> +              int frequency, int frequency_scale,
>                basic_block entry_block_map, basic_block exit_block_map,
>                bitmap blocks_to_copy, basic_block new_entry)
>  {
> @@ -2293,15 +2308,20 @@ static tree
>    basic_block bb;
>    tree new_fndecl = NULL;
>    bool need_debug_cleanup = false;
> -  gcov_type count_scale;
> +  gcov_type count_scale = 0;
> +  gcov_type freq_to_count_scale = 0;
>    int last;
>    int incoming_frequency = 0;
>    gcov_type incoming_count = 0;
> 
> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
> -    count_scale
> -        = GCOV_COMPUTE_SCALE (count,
> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
> +  if (entry_bb->count)
> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
> +     that didn't get profile counts) then we compute the new bb counts
> +     via the statically-estimated frequencies.  */
> +  else if (entry_bb->frequency)
> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>    else
>      count_scale = REG_BR_PROB_BASE;
> 
> @@ -2323,7 +2343,13 @@ static tree
>             incoming_frequency += EDGE_FREQUENCY (e);
>             incoming_count += e->count;
>           }
> -      incoming_count = apply_scale (incoming_count, count_scale);
> +      incoming_count
> +          = (count_scale
> +             ? apply_scale (incoming_count, count_scale)
> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
> +                that didn't get profile counts) then we compute the
> new bb counts
> +                via the statically-estimated frequencies.  */
> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>        incoming_frequency
>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>        ENTRY_BLOCK_PTR->count = incoming_count;
> @@ -2350,7 +2376,8 @@ static tree
>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>        {
> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
> +                                     freq_to_count_scale);
>         bb->aux = new_bb;
>         new_bb->aux = bb;
>         new_bb->loop_father = entry_block_map->loop_father;
> @@ -2364,7 +2391,8 @@ static tree
>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>      if (!blocks_to_copy
>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
> +                                              exit_block_map,
>                                                can_make_abormal_goto);
> 
>    if (new_entry)
> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>     another function.  */
> 
>  static tree
> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
> +copy_body (copy_body_data *id, gcov_type count, int frequency,
> +          int frequency_scale,
>            basic_block entry_block_map, basic_block exit_block_map,
>            bitmap blocks_to_copy, basic_block new_entry)
>  {
> @@ -2571,7 +2600,8 @@ static tree
> 
>    /* If this body has a CFG, walk CFG and copy.  */
>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
> exit_block_map,
> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
> +                       entry_block_map, exit_block_map,
>                         blocks_to_copy, new_entry);
>    copy_debug_stmts (id);
> 
> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>       function in any way before this point, as this CALL_EXPR may be
>       a self-referential call; if we're calling ourselves, we need to
>       duplicate our body before altering anything.  */
> -  copy_body (id, bb->count,
> +  copy_body (id, bb->count, bb->frequency,
>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>              bb, return_block, NULL, NULL);
> 
> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>      }
> 
>    /* Copy the Function's body.  */
> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
> +            blocks_to_copy, new_entry);
> 
>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>    number_blocks (new_decl);
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 201644)
> +++ predict.c   (working copy)
> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>    return new pass_strip_predict_hints (ctxt);
>  }
> 
> +/* Initialize loop edges and compute estimated bb frequencies when there
> +   is no profile data available.  */
> +
> +void
> +init_and_estimate_bb_frequencies (void)
> +{
> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
> +    return;
> +
> +  loop_optimizer_init (0);
> +  add_noreturn_fake_exit_edges ();
> +  mark_irreducible_loops ();
> +  connect_infinite_loops_to_exit ();
> +  estimate_bb_frequencies ();
> +  remove_fake_exit_edges ();
> +  loop_optimizer_finalize ();
> +}
> +
>  /* Rebuild function frequencies.  Passes are in general expected to
>     maintain profile by hand, however in some cases this is not possible:
>     for example when inlining several functions with loops freuqencies might run
> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>  {
>    timevar_push (TV_REBUILD_FREQUENCIES);
>    if (profile_status == PROFILE_GUESSED)
> -    {
> -      loop_optimizer_init (0);
> -      add_noreturn_fake_exit_edges ();
> -      mark_irreducible_loops ();
> -      connect_infinite_loops_to_exit ();
> -      estimate_bb_frequencies ();
> -      remove_fake_exit_edges ();
> -      loop_optimizer_finalize ();
> -    }
> +    init_and_estimate_bb_frequencies ();
>    else if (profile_status == PROFILE_READ)
>      counts_to_freqs ();
>    else
> Index: predict.h
> ===================================================================
> --- predict.h   (revision 201644)
> +++ predict.h   (working copy)
> @@ -38,6 +38,7 @@ enum prediction
>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>  extern int counts_to_freqs (void);
>  extern void estimate_bb_frequencies (void);
> +extern void init_and_estimate_bb_frequencies (void);
>  extern const char *predictor_name (enum br_predictor);
>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>  extern void tree_estimate_probability (void);
> Index: profile.c
> ===================================================================
> --- profile.c   (revision 201644)
> +++ profile.c   (working copy)
> @@ -1305,6 +1305,12 @@ branch_prob (void)
> 
>    values.release ();
>    free_edge_list (el);
> +
> +  /* Call after setting profile_status to PROFILE_READ, will then
> +     invoke counts_to_freqs and if the sum of the counts is zero, will
> +     estimate the frequencies.  */
> +  init_and_estimate_bb_frequencies ();
> +
>    coverage_end_function (lineno_checksum, cfg_checksum);
>  }
>  ^L
> Index: ipa-inline-transform.c
> ===================================================================
> --- ipa-inline-transform.c      (revision 201644)
> +++ ipa-inline-transform.c      (working copy)
> @@ -51,7 +51,7 @@ int nfunctions_inlined;
> 
>  static void
>  update_noncloned_frequencies (struct cgraph_node *node,
> -                             int freq_scale)
> +                             gcov_type count_scale, int freq_scale)
>  {
>    struct cgraph_edge *e;
> 
> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>      freq_scale = 1;
>    for (e = node->callees; e; e = e->next_callee)
>      {
> +      e->count = apply_scale (e->count, count_scale);
>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>        if (e->frequency > CGRAPH_FREQ_MAX)
>          e->frequency = CGRAPH_FREQ_MAX;
>        if (!e->inline_failed)
> -        update_noncloned_frequencies (e->callee, freq_scale);
> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>      }
>    for (e = node->indirect_calls; e; e = e->next_callee)
>      {
> +      e->count = apply_scale (e->count, count_scale);
>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>        if (e->frequency > CGRAPH_FREQ_MAX)
>          e->frequency = CGRAPH_FREQ_MAX;
> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>             }
>           duplicate = false;
>           e->callee->symbol.externally_visible = false;
> -          update_noncloned_frequencies (e->callee, e->frequency);
> +          // In the case of a COMDAT, the callee's count may be from other
> +          // modules, and we need to scale it for the current module's calls
> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
> +          gcov_type count_scale = REG_BR_PROB_BASE;
> +          if (e->callee->count > e->count)
> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>         }
>        else
>         {

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-17 21:02                         ` Jan Hubicka
@ 2013-08-19 13:51                           ` Teresa Johnson
  2013-08-19 15:16                             ` Jan Hubicka
  2013-08-19 15:34                           ` Teresa Johnson
  1 sibling, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-19 13:51 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I added both of these and ran into issues due to profile maintenance.
>> For example, there were non-zero blocks in the cold section because
>> pro_and_epilogue split a simple return block that was previously reach
>> by both hot and cold paths. The new return block that was then only
>> reached via the cold path did not have its count properly updated to
>> reflect this, and since with this patch, blocks dominated by cold
>> blocks are remarked cold, we ended up with a non-zero count block in
>> the cold section. And there were 0 count blocks reached by non-zero
>> edges because copyprop did not clean up edge weights after removing
>> some branches and blocks, leading to non-zero edge weights that had
>> previously targeted a branch that was removed, now targeting a 0 count
>> block that the removed branch always branched around.
>
> I see, can you please send fixes for the problems you identified?
> Thanks for owrking on this!

I don't have fixes at this point - I just identified the phase and
transformation from looking at the dump. But I'll try to fix them soon
while I'm working on performance tuning for splitting. I have a
feeling there are probably a bunch of places where the profile isn't
getting updated properly, unfortunately.

>>
>> In any case, the good news is in that the cases I looked at, the
>> splitting code is doing the right thing and these blocks that were
>> marked cold really were cold. It would be great to fix the profile
>> maintenance issues, but that in the meantime the above sanity checks
>> are too aggressive.
>
> We can keep them and output info into dump file - it is what most of
> the profile sanity checking does anyway.

Ok, I will add that.

>
> Did you try to use Martin's linker script to turn text.unlikely
> section unexecutable?  I think that way we will easily find what really
> causes us to use it during startup of trained application (just like
> Martin does for gimp).

I haven't - where can I get that script?

>>
>> I think it makes sense to commit the current patch if possible, as it
>> is making the splitting more sane.
>
> My only concern about the patch is that I am not convinced the dominator
> based code has chance to work reliably enough so we won't see too many
> accesses into the cold section.

Remember it isn't using dominance anymore. The latest patch was
instead ensuring the most frequent path between hot blocks and the
entry/exit are marked hot. That should be better than the dominance
approach used in the earlier version.

> We can commit it and work on better solution incrementally but it will
> probably mean replacing it later.  If you think it makes things easier
> to work on it incrementally, I think the patch is OK.

Yes, I think this is a big step forward from what is there now for
splitting, which does the splitting purely based on bb count in
isolation. I don't have a much better solution in mind yet.

>>
>> > - I'll try building and profiling gimp myself to see if I can
>> > reproduce the issue with code executing out of the cold section.
>>
>> I have spent some time this week trying to get the latest gimp Martin
>> pointed me to configured and built, but it took awhile to track down
>> and configure/build all of the required versions of dependent
>> packages. I'm still hitting some issues trying to get it compiled, so
>> it may not yet be configured properly. I'll take a look again early
>> next week.
>
> I do not think there is anything special about gimp.  You can probably
> take any other bigger app, like GCC itself. With profiledbootstrap
> and linker script to lock unlikely section you should get ICEs where
> we jump into cold secton and should not.

Ok, please point me to the linker script and I will try gcc
profiledbootstrap as well. I wanted to try gimp if possible as I
haven't seen this much jumping to the cold section in some of the
internal apps I tried.

Thanks,
Teresa

>>
>> Teresa
>>
>> patch for updating counts based on estimated frequencies to address
>> inlined comdats with 0 profile counts:
>>
>> 013-08-16  Teresa Johnson  <tejohnson@google.com>
>>
>>         * tree-inline.c (copy_bb): Compute count based on frequency.
>>         (copy_edges_for_bb): Ditto.
>>         (copy_cfg_body): Ditto.
>>         (copy_body): Pass down frequency.
>>         (expand_call_inline): Ditto.
>>         (tree_function_versioning): Ditto.
>>         * predict.c (init_and_estimate_bb_frequencies): New function.
>>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>>         counts.
>>         (clone_inlined_nodes): Compute edge count scale if needed.
>
> I do not see why inliner needs to care about scaling more than it does right
> now.  So you have init_and_estimate_bb_frequencies that force profile guessing
> on a given function body. In addition to that I thing you need something like
> freqs_to_counts that will compute counts based on freqs with given scale
> (actually you can do that as part of propagation before frequencies are scalled
> to the usual 0...FREQ_MAX scale and precision is lost).
>
> Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> count) that has empty profile (either because it got lost by COMDAT merging
> or because of reading mismatch).
>
> So I guess you can just check functions with 0 count and non-0 count callers
> and initialize their guessed profile.
> Some capping will probably be needed to not propagate insanely large numbers..
>
> Since new direct calls can be discovered later, inline may want to do that
> again each time it inlines non-0 count call of COMDAT with 0 count...
>
> Honza
>>
>> Index: tree-inline.c
>> ===================================================================
>> --- tree-inline.c       (revision 201644)
>> +++ tree-inline.c       (working copy)
>> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
>>
>>  static basic_block
>>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
>> -         gcov_type count_scale)
>> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>>  {
>>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>>    basic_block copy_basic_block;
>> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>>       basic_block_info automatically.  */
>>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>>                                           (basic_block) prev->aux);
>> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
>> +  copy_basic_block->count
>> +      = (count_scale
>> +         ? apply_scale (bb->count, count_scale)
>> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +            that didn't get profile counts) then we compute the new bb counts
>> +            via the statically-estimated frequencies.  */
>> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
>>
>>    /* We are going to rebuild frequencies from scratch.  These values
>>       have just small importance to drive canonicalize_loop_headers.  */
>> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>>     debug stmts are left after a statement that must end the basic block.  */
>>
>>  static bool
>> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
>> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
>> +                   basic_block ret_bb,
>>                    bool can_make_abnormal_goto)
>>  {
>>    basic_block new_bb = (basic_block) bb->aux;
>> @@ -1912,7 +1919,14 @@ static bool
>>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>>           flags |= EDGE_FALLTHRU;
>>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
>> -       new_edge->count = apply_scale (old_edge->count, count_scale);
>> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
>> +       new_edge->count
>> +            = (count_scale
>> +               ? apply_scale (old_edge->count, count_scale)
>> +               // The bb counts have already been scaled with
>> freq_to_count_scale
>> +               // when that is non-zero, so just scale that new bb count by
>> +               // the edge probability.
>> +               : apply_probability (new_src_bb->count, old_edge->probability));
>>         new_edge->probability = old_edge->probability;
>>        }
>>
>> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>>     another function.  Walks FN via CFG, returns new fndecl.  */
>>
>>  static tree
>> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
>> +copy_cfg_body (copy_body_data * id, gcov_type count,
>> +              int frequency, int frequency_scale,
>>                basic_block entry_block_map, basic_block exit_block_map,
>>                bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2293,15 +2308,20 @@ static tree
>>    basic_block bb;
>>    tree new_fndecl = NULL;
>>    bool need_debug_cleanup = false;
>> -  gcov_type count_scale;
>> +  gcov_type count_scale = 0;
>> +  gcov_type freq_to_count_scale = 0;
>>    int last;
>>    int incoming_frequency = 0;
>>    gcov_type incoming_count = 0;
>>
>> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
>> -    count_scale
>> -        = GCOV_COMPUTE_SCALE (count,
>> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
>> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
>> +  if (entry_bb->count)
>> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
>> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +     that didn't get profile counts) then we compute the new bb counts
>> +     via the statically-estimated frequencies.  */
>> +  else if (entry_bb->frequency)
>> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>>    else
>>      count_scale = REG_BR_PROB_BASE;
>>
>> @@ -2323,7 +2343,13 @@ static tree
>>             incoming_frequency += EDGE_FREQUENCY (e);
>>             incoming_count += e->count;
>>           }
>> -      incoming_count = apply_scale (incoming_count, count_scale);
>> +      incoming_count
>> +          = (count_scale
>> +             ? apply_scale (incoming_count, count_scale)
>> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +                that didn't get profile counts) then we compute the
>> new bb counts
>> +                via the statically-estimated frequencies.  */
>> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>>        incoming_frequency
>>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>>        ENTRY_BLOCK_PTR->count = incoming_count;
>> @@ -2350,7 +2376,8 @@ static tree
>>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>>        {
>> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
>> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
>> +                                     freq_to_count_scale);
>>         bb->aux = new_bb;
>>         new_bb->aux = bb;
>>         new_bb->loop_father = entry_block_map->loop_father;
>> @@ -2364,7 +2391,8 @@ static tree
>>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy
>>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
>> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
>> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
>> +                                              exit_block_map,
>>                                                can_make_abormal_goto);
>>
>>    if (new_entry)
>> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>>     another function.  */
>>
>>  static tree
>> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
>> +copy_body (copy_body_data *id, gcov_type count, int frequency,
>> +          int frequency_scale,
>>            basic_block entry_block_map, basic_block exit_block_map,
>>            bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2571,7 +2600,8 @@ static tree
>>
>>    /* If this body has a CFG, walk CFG and copy.  */
>>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
>> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
>> exit_block_map,
>> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
>> +                       entry_block_map, exit_block_map,
>>                         blocks_to_copy, new_entry);
>>    copy_debug_stmts (id);
>>
>> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>>       function in any way before this point, as this CALL_EXPR may be
>>       a self-referential call; if we're calling ourselves, we need to
>>       duplicate our body before altering anything.  */
>> -  copy_body (id, bb->count,
>> +  copy_body (id, bb->count, bb->frequency,
>>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>>              bb, return_block, NULL, NULL);
>>
>> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>>      }
>>
>>    /* Copy the Function's body.  */
>> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
>> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
>> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
>> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
>> +            blocks_to_copy, new_entry);
>>
>>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>>    number_blocks (new_decl);
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 201644)
>> +++ predict.c   (working copy)
>> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>>    return new pass_strip_predict_hints (ctxt);
>>  }
>>
>> +/* Initialize loop edges and compute estimated bb frequencies when there
>> +   is no profile data available.  */
>> +
>> +void
>> +init_and_estimate_bb_frequencies (void)
>> +{
>> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
>> +    return;
>> +
>> +  loop_optimizer_init (0);
>> +  add_noreturn_fake_exit_edges ();
>> +  mark_irreducible_loops ();
>> +  connect_infinite_loops_to_exit ();
>> +  estimate_bb_frequencies ();
>> +  remove_fake_exit_edges ();
>> +  loop_optimizer_finalize ();
>> +}
>> +
>>  /* Rebuild function frequencies.  Passes are in general expected to
>>     maintain profile by hand, however in some cases this is not possible:
>>     for example when inlining several functions with loops freuqencies might run
>> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>>  {
>>    timevar_push (TV_REBUILD_FREQUENCIES);
>>    if (profile_status == PROFILE_GUESSED)
>> -    {
>> -      loop_optimizer_init (0);
>> -      add_noreturn_fake_exit_edges ();
>> -      mark_irreducible_loops ();
>> -      connect_infinite_loops_to_exit ();
>> -      estimate_bb_frequencies ();
>> -      remove_fake_exit_edges ();
>> -      loop_optimizer_finalize ();
>> -    }
>> +    init_and_estimate_bb_frequencies ();
>>    else if (profile_status == PROFILE_READ)
>>      counts_to_freqs ();
>>    else
>> Index: predict.h
>> ===================================================================
>> --- predict.h   (revision 201644)
>> +++ predict.h   (working copy)
>> @@ -38,6 +38,7 @@ enum prediction
>>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>>  extern int counts_to_freqs (void);
>>  extern void estimate_bb_frequencies (void);
>> +extern void init_and_estimate_bb_frequencies (void);
>>  extern const char *predictor_name (enum br_predictor);
>>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>>  extern void tree_estimate_probability (void);
>> Index: profile.c
>> ===================================================================
>> --- profile.c   (revision 201644)
>> +++ profile.c   (working copy)
>> @@ -1305,6 +1305,12 @@ branch_prob (void)
>>
>>    values.release ();
>>    free_edge_list (el);
>> +
>> +  /* Call after setting profile_status to PROFILE_READ, will then
>> +     invoke counts_to_freqs and if the sum of the counts is zero, will
>> +     estimate the frequencies.  */
>> +  init_and_estimate_bb_frequencies ();
>> +
>>    coverage_end_function (lineno_checksum, cfg_checksum);
>>  }
>>  ^L
>> Index: ipa-inline-transform.c
>> ===================================================================
>> --- ipa-inline-transform.c      (revision 201644)
>> +++ ipa-inline-transform.c      (working copy)
>> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>>
>>  static void
>>  update_noncloned_frequencies (struct cgraph_node *node,
>> -                             int freq_scale)
>> +                             gcov_type count_scale, int freq_scale)
>>  {
>>    struct cgraph_edge *e;
>>
>> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>>      freq_scale = 1;
>>    for (e = node->callees; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>>        if (!e->inline_failed)
>> -        update_noncloned_frequencies (e->callee, freq_scale);
>> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>>      }
>>    for (e = node->indirect_calls; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>>             }
>>           duplicate = false;
>>           e->callee->symbol.externally_visible = false;
>> -          update_noncloned_frequencies (e->callee, e->frequency);
>> +          // In the case of a COMDAT, the callee's count may be from other
>> +          // modules, and we need to scale it for the current module's calls
>> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
>> +          gcov_type count_scale = REG_BR_PROB_BASE;
>> +          if (e->callee->count > e->count)
>> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>>         }
>>        else
>>         {



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-19 13:51                           ` Teresa Johnson
@ 2013-08-19 15:16                             ` Jan Hubicka
  2013-08-19 17:48                               ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-19 15:16 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

> Remember it isn't using dominance anymore. The latest patch was
> instead ensuring the most frequent path between hot blocks and the
> entry/exit are marked hot. That should be better than the dominance
> approach used in the earlier version.

Indeed, that looks more resonable approach.
Can you point me to the last version of patch? Last one I remember still
walked dominators...
> 
> > We can commit it and work on better solution incrementally but it will
> > probably mean replacing it later.  If you think it makes things easier
> > to work on it incrementally, I think the patch is OK.
> 
> Yes, I think this is a big step forward from what is there now for
> splitting, which does the splitting purely based on bb count in
> isolation. I don't have a much better solution in mind yet.
> 
> >>
> >> > - I'll try building and profiling gimp myself to see if I can
> >> > reproduce the issue with code executing out of the cold section.
> >>
> >> I have spent some time this week trying to get the latest gimp Martin
> >> pointed me to configured and built, but it took awhile to track down
> >> and configure/build all of the required versions of dependent
> >> packages. I'm still hitting some issues trying to get it compiled, so
> >> it may not yet be configured properly. I'll take a look again early
> >> next week.
> >
> > I do not think there is anything special about gimp.  You can probably
> > take any other bigger app, like GCC itself. With profiledbootstrap
> > and linker script to lock unlikely section you should get ICEs where
> > we jump into cold secton and should not.
> 
> Ok, please point me to the linker script and I will try gcc
> profiledbootstrap as well. I wanted to try gimp if possible as I
> haven't seen this much jumping to the cold section in some of the
> internal apps I tried.

You can also discuss with Martin the systemtap script to plot disk accesses
during the startup.  It is very handy for analyzing the code layout issues

It may be interesting to get similar script taking traces from valgrind
and ploting the most frequent calls in the final layout ;)

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-17 21:02                         ` Jan Hubicka
  2013-08-19 13:51                           ` Teresa Johnson
@ 2013-08-19 15:34                           ` Teresa Johnson
  2013-08-21 15:31                             ` Jan Hubicka
  1 sibling, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-19 15:34 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> patch for updating counts based on estimated frequencies to address
>> inlined comdats with 0 profile counts:
>>
>> 013-08-16  Teresa Johnson  <tejohnson@google.com>
>>
>>         * tree-inline.c (copy_bb): Compute count based on frequency.
>>         (copy_edges_for_bb): Ditto.
>>         (copy_cfg_body): Ditto.
>>         (copy_body): Pass down frequency.
>>         (expand_call_inline): Ditto.
>>         (tree_function_versioning): Ditto.
>>         * predict.c (init_and_estimate_bb_frequencies): New function.
>>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>>         counts.
>>         (clone_inlined_nodes): Compute edge count scale if needed.
>
> I do not see why inliner needs to care about scaling more than it does right
> now.  So you have init_and_estimate_bb_frequencies that force profile guessing
> on a given function body. In addition to that I thing you need something like
> freqs_to_counts that will compute counts based on freqs with given scale
> (actually you can do that as part of propagation before frequencies are scalled
> to the usual 0...FREQ_MAX scale and precision is lost).
>
> Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> count) that has empty profile (either because it got lost by COMDAT merging
> or because of reading mismatch).

The approach this patch takes is to simply treat those functions the
same as we would if we didn't feed back profile data in the first
place, by using the frequencies. This is sufficient except when one is
inlined, which is why I have the special handling in the inliner
itself.

>
> So I guess you can just check functions with 0 count and non-0 count callers
> and initialize their guessed profile.
> Some capping will probably be needed to not propagate insanely large numbers..

But at profile read time we don't have access to the inter-module
calls. Presumably having guessed profiles for these routines should
help the O2 profile-use case as well (i.e. better optimized
out-of-line copy), so I wouldn't want to limit it to IPO or LIPO
compiles where we can identify inter-module call counts at some point
in the compilation.

>
> Since new direct calls can be discovered later, inline may want to do that
> again each time it inlines non-0 count call of COMDAT with 0 count...

How about an approach like this:
- Invoke init_and_estimate_bb_frequencies as I am doing to guess the
profiles at profile read time for functions with 0 counts.
- At inline time, invoke some kind of freqs_to_counts routine for any
0-count routine that is reached by non-zero call edges. It would take
the sum of all incoming call edge counts and synthesize counts for the
bbs using the guessed profile frequencies applied earlier by
init_and_estimate_bb_frequencies. Then the inliner can do its normal
bb count scaling.

Does that seem like a reasonable approach?

There is one other fix in this patch:
- The clone_inlined_nodes/update_noncloned_frequencies changes below
are handling a different case: 0-count call edge in this module, with
non-zero callee node count due to calls from other modules. It will
allow update_noncloned_frequencies to scale down the edge counts in
callee's cloned call tree. This was a fix I made for the
callgraph-based linker plugin function reordering, and not splitting
(since it is using both the node and edge weights to make ordering
decisions). Here's a description of the issue when I was debugging it:

----
In this case, because the callee we are inlining does not have any
other callers and is not external, we call
update_noncloned_frequencies from clone_inlined_nodes instead of
creating a clone. This routine does not attempt to scale the outgoing
edge weight counters on the callee, since the assumption must be that
there are no other callers so all the weight is attributed to the
current edge that we are inlining.

In this case this is clearly not correct, because the caller's count
is 0. I'm assuming that this is happening because the callee we are
inlining is a comdat, so its non-zero weights must have come from a
different module. It seems like update_noncloned_frequencies should go
ahead and scale the counts too.
----

For the above case, I think the right place to fix this is probably
during clone_inlined_nodes/update_noncloned_frequencies, as scaling is
handled by cgraph_clone_node in the case where we need cloning (also
called from clone_inlined_nodes).

Teresa

>
> Honza
>>
>> Index: tree-inline.c
>> ===================================================================
>> --- tree-inline.c       (revision 201644)
>> +++ tree-inline.c       (working copy)
>> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
>>
>>  static basic_block
>>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
>> -         gcov_type count_scale)
>> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>>  {
>>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>>    basic_block copy_basic_block;
>> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>>       basic_block_info automatically.  */
>>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>>                                           (basic_block) prev->aux);
>> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
>> +  copy_basic_block->count
>> +      = (count_scale
>> +         ? apply_scale (bb->count, count_scale)
>> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +            that didn't get profile counts) then we compute the new bb counts
>> +            via the statically-estimated frequencies.  */
>> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
>>
>>    /* We are going to rebuild frequencies from scratch.  These values
>>       have just small importance to drive canonicalize_loop_headers.  */
>> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>>     debug stmts are left after a statement that must end the basic block.  */
>>
>>  static bool
>> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
>> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
>> +                   basic_block ret_bb,
>>                    bool can_make_abnormal_goto)
>>  {
>>    basic_block new_bb = (basic_block) bb->aux;
>> @@ -1912,7 +1919,14 @@ static bool
>>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>>           flags |= EDGE_FALLTHRU;
>>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
>> -       new_edge->count = apply_scale (old_edge->count, count_scale);
>> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
>> +       new_edge->count
>> +            = (count_scale
>> +               ? apply_scale (old_edge->count, count_scale)
>> +               // The bb counts have already been scaled with
>> freq_to_count_scale
>> +               // when that is non-zero, so just scale that new bb count by
>> +               // the edge probability.
>> +               : apply_probability (new_src_bb->count, old_edge->probability));
>>         new_edge->probability = old_edge->probability;
>>        }
>>
>> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>>     another function.  Walks FN via CFG, returns new fndecl.  */
>>
>>  static tree
>> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
>> +copy_cfg_body (copy_body_data * id, gcov_type count,
>> +              int frequency, int frequency_scale,
>>                basic_block entry_block_map, basic_block exit_block_map,
>>                bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2293,15 +2308,20 @@ static tree
>>    basic_block bb;
>>    tree new_fndecl = NULL;
>>    bool need_debug_cleanup = false;
>> -  gcov_type count_scale;
>> +  gcov_type count_scale = 0;
>> +  gcov_type freq_to_count_scale = 0;
>>    int last;
>>    int incoming_frequency = 0;
>>    gcov_type incoming_count = 0;
>>
>> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
>> -    count_scale
>> -        = GCOV_COMPUTE_SCALE (count,
>> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
>> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
>> +  if (entry_bb->count)
>> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
>> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +     that didn't get profile counts) then we compute the new bb counts
>> +     via the statically-estimated frequencies.  */
>> +  else if (entry_bb->frequency)
>> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>>    else
>>      count_scale = REG_BR_PROB_BASE;
>>
>> @@ -2323,7 +2343,13 @@ static tree
>>             incoming_frequency += EDGE_FREQUENCY (e);
>>             incoming_count += e->count;
>>           }
>> -      incoming_count = apply_scale (incoming_count, count_scale);
>> +      incoming_count
>> +          = (count_scale
>> +             ? apply_scale (incoming_count, count_scale)
>> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +                that didn't get profile counts) then we compute the
>> new bb counts
>> +                via the statically-estimated frequencies.  */
>> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>>        incoming_frequency
>>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>>        ENTRY_BLOCK_PTR->count = incoming_count;
>> @@ -2350,7 +2376,8 @@ static tree
>>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>>        {
>> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
>> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
>> +                                     freq_to_count_scale);
>>         bb->aux = new_bb;
>>         new_bb->aux = bb;
>>         new_bb->loop_father = entry_block_map->loop_father;
>> @@ -2364,7 +2391,8 @@ static tree
>>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy
>>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
>> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
>> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
>> +                                              exit_block_map,
>>                                                can_make_abormal_goto);
>>
>>    if (new_entry)
>> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>>     another function.  */
>>
>>  static tree
>> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
>> +copy_body (copy_body_data *id, gcov_type count, int frequency,
>> +          int frequency_scale,
>>            basic_block entry_block_map, basic_block exit_block_map,
>>            bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2571,7 +2600,8 @@ static tree
>>
>>    /* If this body has a CFG, walk CFG and copy.  */
>>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
>> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
>> exit_block_map,
>> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
>> +                       entry_block_map, exit_block_map,
>>                         blocks_to_copy, new_entry);
>>    copy_debug_stmts (id);
>>
>> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>>       function in any way before this point, as this CALL_EXPR may be
>>       a self-referential call; if we're calling ourselves, we need to
>>       duplicate our body before altering anything.  */
>> -  copy_body (id, bb->count,
>> +  copy_body (id, bb->count, bb->frequency,
>>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>>              bb, return_block, NULL, NULL);
>>
>> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>>      }
>>
>>    /* Copy the Function's body.  */
>> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
>> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
>> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
>> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
>> +            blocks_to_copy, new_entry);
>>
>>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>>    number_blocks (new_decl);
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 201644)
>> +++ predict.c   (working copy)
>> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>>    return new pass_strip_predict_hints (ctxt);
>>  }
>>
>> +/* Initialize loop edges and compute estimated bb frequencies when there
>> +   is no profile data available.  */
>> +
>> +void
>> +init_and_estimate_bb_frequencies (void)
>> +{
>> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
>> +    return;
>> +
>> +  loop_optimizer_init (0);
>> +  add_noreturn_fake_exit_edges ();
>> +  mark_irreducible_loops ();
>> +  connect_infinite_loops_to_exit ();
>> +  estimate_bb_frequencies ();
>> +  remove_fake_exit_edges ();
>> +  loop_optimizer_finalize ();
>> +}
>> +
>>  /* Rebuild function frequencies.  Passes are in general expected to
>>     maintain profile by hand, however in some cases this is not possible:
>>     for example when inlining several functions with loops freuqencies might run
>> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>>  {
>>    timevar_push (TV_REBUILD_FREQUENCIES);
>>    if (profile_status == PROFILE_GUESSED)
>> -    {
>> -      loop_optimizer_init (0);
>> -      add_noreturn_fake_exit_edges ();
>> -      mark_irreducible_loops ();
>> -      connect_infinite_loops_to_exit ();
>> -      estimate_bb_frequencies ();
>> -      remove_fake_exit_edges ();
>> -      loop_optimizer_finalize ();
>> -    }
>> +    init_and_estimate_bb_frequencies ();
>>    else if (profile_status == PROFILE_READ)
>>      counts_to_freqs ();
>>    else
>> Index: predict.h
>> ===================================================================
>> --- predict.h   (revision 201644)
>> +++ predict.h   (working copy)
>> @@ -38,6 +38,7 @@ enum prediction
>>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>>  extern int counts_to_freqs (void);
>>  extern void estimate_bb_frequencies (void);
>> +extern void init_and_estimate_bb_frequencies (void);
>>  extern const char *predictor_name (enum br_predictor);
>>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>>  extern void tree_estimate_probability (void);
>> Index: profile.c
>> ===================================================================
>> --- profile.c   (revision 201644)
>> +++ profile.c   (working copy)
>> @@ -1305,6 +1305,12 @@ branch_prob (void)
>>
>>    values.release ();
>>    free_edge_list (el);
>> +
>> +  /* Call after setting profile_status to PROFILE_READ, will then
>> +     invoke counts_to_freqs and if the sum of the counts is zero, will
>> +     estimate the frequencies.  */
>> +  init_and_estimate_bb_frequencies ();
>> +
>>    coverage_end_function (lineno_checksum, cfg_checksum);
>>  }
>>  ^L
>> Index: ipa-inline-transform.c
>> ===================================================================
>> --- ipa-inline-transform.c      (revision 201644)
>> +++ ipa-inline-transform.c      (working copy)
>> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>>
>>  static void
>>  update_noncloned_frequencies (struct cgraph_node *node,
>> -                             int freq_scale)
>> +                             gcov_type count_scale, int freq_scale)
>>  {
>>    struct cgraph_edge *e;
>>
>> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>>      freq_scale = 1;
>>    for (e = node->callees; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>>        if (!e->inline_failed)
>> -        update_noncloned_frequencies (e->callee, freq_scale);
>> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>>      }
>>    for (e = node->indirect_calls; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>>             }
>>           duplicate = false;
>>           e->callee->symbol.externally_visible = false;
>> -          update_noncloned_frequencies (e->callee, e->frequency);
>> +          // In the case of a COMDAT, the callee's count may be from other
>> +          // modules, and we need to scale it for the current module's calls
>> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
>> +          gcov_type count_scale = REG_BR_PROB_BASE;
>> +          if (e->callee->count > e->count)
>> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>>         }
>>        else
>>         {



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-19 15:16                             ` Jan Hubicka
@ 2013-08-19 17:48                               ` Teresa Johnson
  2013-08-19 19:56                                 ` Martin Liška
  2013-08-27 18:12                                 ` Teresa Johnson
  0 siblings, 2 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-19 17:48 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

On Mon, Aug 19, 2013 at 8:09 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Remember it isn't using dominance anymore. The latest patch was
>> instead ensuring the most frequent path between hot blocks and the
>> entry/exit are marked hot. That should be better than the dominance
>> approach used in the earlier version.
>
> Indeed, that looks more resonable approach.
> Can you point me to the last version of patch? Last one I remember still
> walked dominators...

I've included the latest patch below. I still use dominators in the
post-cfg-optimization fixup (fixup_partitions), but not in the
partition sanitizing done during the partitioning itself
(sanitize_hot_paths). The former is looking for hot bbs newly
dominated by cold bbs after cfg transformations.

>>
>> > We can commit it and work on better solution incrementally but it will
>> > probably mean replacing it later.  If you think it makes things easier
>> > to work on it incrementally, I think the patch is OK.
>>
>> Yes, I think this is a big step forward from what is there now for
>> splitting, which does the splitting purely based on bb count in
>> isolation. I don't have a much better solution in mind yet.
>>
>> >>
>> >> > - I'll try building and profiling gimp myself to see if I can
>> >> > reproduce the issue with code executing out of the cold section.
>> >>
>> >> I have spent some time this week trying to get the latest gimp Martin
>> >> pointed me to configured and built, but it took awhile to track down
>> >> and configure/build all of the required versions of dependent
>> >> packages. I'm still hitting some issues trying to get it compiled, so
>> >> it may not yet be configured properly. I'll take a look again early
>> >> next week.
>> >
>> > I do not think there is anything special about gimp.  You can probably
>> > take any other bigger app, like GCC itself. With profiledbootstrap
>> > and linker script to lock unlikely section you should get ICEs where
>> > we jump into cold secton and should not.
>>
>> Ok, please point me to the linker script and I will try gcc
>> profiledbootstrap as well. I wanted to try gimp if possible as I
>> haven't seen this much jumping to the cold section in some of the
>> internal apps I tried.
>
> You can also discuss with Martin the systemtap script to plot disk accesses
> during the startup.  It is very handy for analyzing the code layout issues

Ok. I am using linux perf to collect this info (fed through some
scripts that munge and plot the data).

>
> It may be interesting to get similar script taking traces from valgrind
> and ploting the most frequent calls in the final layout ;)

I think linux perf -g to get a callgraph should give similar data.

Teresa

>
> Honza

2013-08-05  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_new_cold_bb): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (fixup_partitions): Declare.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_hot_paths): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_hot_paths.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 201461)
+++ cfgrtl.c    (working copy)
@@ -1341,6 +1341,43 @@ fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1979,6 +2016,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2173,6 +2218,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition (%d)",
son->index, bb->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2219,7 +2359,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2382,6 +2523,14 @@ rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2515,7 +2664,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2761,7 +2910,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 201461)
+++ basic-block.h       (working copy)
@@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 201461)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 201461)
+++ bb-reorder.c        (working copy)
@@ -1444,27 +1444,134 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that all hot bbs are included in a hot path through the
+   procedure. This is done by calling this function twice, once
+   with WALK_UP true (to look for paths from the entry to hot bbs) and
+   once with WALK_UP false (to look for paths from hot bbs to the exit).
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
+                    vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
+      edge e;
+      edge_iterator ei;
+      int highest_probability = 0;
+      bool found = false;
+
+      /* Walk the preds/succs and check if there is at least one already
+         marked hot. Keep track of the most frequent pred/succ so that we
+         can mark it hot if we don't find one.  */
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+
+          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
+          {
+            found = true;
+            break;
+          }
+          if (e->probability > highest_probability)
+            highest_probability = e->probability;
+        }
+
+      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
+         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
+         then the most frequent pred (or succ) needs to be adjusted.  In the
+         case where multiple preds/succs have the same probability (e.g. a
+         50-50 branch), then both will be adjusted.  */
+      if (found)
+        continue;
+
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+          if (e->probability < highest_probability)
+            continue;
+
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          /* We have a hot bb with an immediate dominator that is cold.
+             The dominator needs to be re-marked hot.  */
+          BB_SET_PARTITION (reach_bb, BB_HOT_PARTITION);
+          cold_bb_count--;
+
+          /* Now we need to examine newly-hot reach_bb to see if it is also
+             dominated by a cold bb.  */
+          bbs_in_hot_partition->safe_push (reach_bb);
+          hot_bbs_to_check.safe_push (reach_bb);
+        }
+    }
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */

-static vec<edge>
+static vec<edge>
 find_rarely_executed_basic_blocks_and_crossing_edges (void)
 {
   vec<edge> crossing_edges = vNULL;
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that hot bbs are included along a hot path from the entry to exit.
+     Several different possibilities may include cold bbs along all paths
+     to/from a hot bb. One is that there are edge weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. This is fixed by walking up from hot bbs
+     to the entry block, and then down from hot bbs to the exit, performing
+     partitioning fixups as necessary.  */
+  if (cold_bb_count)
+    {
+      mark_dfs_back_edges ();
+      cold_bb_count = sanitize_hot_paths (true, cold_bb_count,
+                                          &bbs_in_hot_partition);
+      if (cold_bb_count)
+        sanitize_hot_paths (false, cold_bb_count, &bbs_in_hot_partition);
+    }
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */


-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-19 17:48                               ` Teresa Johnson
@ 2013-08-19 19:56                                 ` Martin Liška
  2013-08-27 18:12                                 ` Teresa Johnson
  1 sibling, 0 replies; 62+ messages in thread
From: Martin Liška @ 2013-08-19 19:56 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, Sriraman Tallam

[-- Attachment #1: Type: text/plain, Size: 19977 bytes --]

Dear Teresa,

On 19 August 2013 19:47, Teresa Johnson <tejohnson@google.com> wrote:
> On Mon, Aug 19, 2013 at 8:09 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> Remember it isn't using dominance anymore. The latest patch was
>>> instead ensuring the most frequent path between hot blocks and the
>>> entry/exit are marked hot. That should be better than the dominance
>>> approach used in the earlier version.
>>
>> Indeed, that looks more resonable approach.
>> Can you point me to the last version of patch? Last one I remember still
>> walked dominators...
>
> I've included the latest patch below. I still use dominators in the
> post-cfg-optimization fixup (fixup_partitions), but not in the
> partition sanitizing done during the partitioning itself
> (sanitize_hot_paths). The former is looking for hot bbs newly
> dominated by cold bbs after cfg transformations.
>
>>>
>>> > We can commit it and work on better solution incrementally but it will
>>> > probably mean replacing it later.  If you think it makes things easier
>>> > to work on it incrementally, I think the patch is OK.
>>>
>>> Yes, I think this is a big step forward from what is there now for
>>> splitting, which does the splitting purely based on bb count in
>>> isolation. I don't have a much better solution in mind yet.
>>>
>>> >>
>>> >> > - I'll try building and profiling gimp myself to see if I can
>>> >> > reproduce the issue with code executing out of the cold section.
>>> >>
>>> >> I have spent some time this week trying to get the latest gimp Martin
>>> >> pointed me to configured and built, but it took awhile to track down
>>> >> and configure/build all of the required versions of dependent
>>> >> packages. I'm still hitting some issues trying to get it compiled, so
>>> >> it may not yet be configured properly. I'll take a look again early
>>> >> next week.
>>> >
>>> > I do not think there is anything special about gimp.  You can probably
>>> > take any other bigger app, like GCC itself. With profiledbootstrap
>>> > and linker script to lock unlikely section you should get ICEs where
>>> > we jump into cold secton and should not.
>>>
>>> Ok, please point me to the linker script and I will try gcc
>>> profiledbootstrap as well. I wanted to try gimp if possible as I
>>> haven't seen this much jumping to the cold section in some of the
>>> internal apps I tried.
>>
>> You can also discuss with Martin the systemtap script to plot disk accesses
>> during the startup.  It is very handy for analyzing the code layout issues

I send you as an attachment linker script that preserves
.text.unlikely, .text.exit, .text.startup and .text.hot sections. I
tries to modify memory access flags for .text.unlikely section to
read-only, but I am not so familiar with linker scripting. Maybe, you
can define with MEMORY command the read-only memory area for
.text.unlikely.

I am using for my graphing disabled read-ahead function in the Linux
kernel with systemtap script and the results are presented with
matplotlib library. If you are more interested in this tool-chain,
write me and I can send you these scripts. I was also using valgrind
to trace functions, so I will try to get locations of called functions
to identify which functions are called in corresponding sections.

Martin

> Ok. I am using linux perf to collect this info (fed through some
> scripts that munge and plot the data).
>
>>
>> It may be interesting to get similar script taking traces from valgrind
>> and ploting the most frequent calls in the final layout ;)
>
> I think linux perf -g to get a callgraph should give similar data.
>
> Teresa
>
>>
>> Honza
>
> 2013-08-05  Teresa Johnson  <tejohnson@google.com>
>             Steven Bosscher  <steven@gcc.gnu.org>
>
>         * cfgrtl.c (fixup_new_cold_bb): New routine.
>         (commit_edge_insertions): Invoke fixup_partitions.
>         (find_partition_fixes): New routine.
>         (fixup_partitions): Ditto.
>         (verify_hot_cold_block_grouping): Update comments.
>         (rtl_verify_edges): Invoke find_partition_fixes.
>         (rtl_verify_bb_pointers): Update comments.
>         (rtl_verify_bb_layout): Ditto.
>         * basic-block.h (fixup_partitions): Declare.
>         * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
>         * bb-reorder.c (sanitize_hot_paths): New function.
>         (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
>         sanitize_hot_paths.
>
> Index: cfgrtl.c
> ===================================================================
> --- cfgrtl.c    (revision 201461)
> +++ cfgrtl.c    (working copy)
> @@ -1341,6 +1341,43 @@ fixup_partition_crossing (edge e)
>      }
>  }
>
> +/* Called when block BB has been reassigned to the cold partition,
> +   because it is now dominated by another cold block,
> +   to ensure that the region crossing attributes are updated.  */
> +
> +static void
> +fixup_new_cold_bb (basic_block bb)
> +{
> +  edge e;
> +  edge_iterator ei;
> +
> +  /* This is called when a hot bb is found to now be dominated
> +     by a cold bb and therefore needs to become cold. Therefore,
> +     its preds will no longer be region crossing. Any non-dominating
> +     preds that were previously hot would also have become cold
> +     in the caller for the same region. Any preds that were previously
> +     region-crossing will be adjusted in fixup_partition_crossing.  */
> +  FOR_EACH_EDGE (e, ei, bb->preds)
> +    {
> +      fixup_partition_crossing (e);
> +    }
> +
> +  /* Possibly need to make bb's successor edges region crossing,
> +     or remove stale region crossing.  */
> +  FOR_EACH_EDGE (e, ei, bb->succs)
> +    {
> +      /* We can't have fall-through edges across partition boundaries.
> +         Note that force_nonfallthru will do any necessary partition
> +         boundary fixup by calling fixup_partition_crossing itself.  */
> +      if ((e->flags & EDGE_FALLTHRU)
> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
> +          && e->dest != EXIT_BLOCK_PTR)
> +        force_nonfallthru (e);
> +      else
> +        fixup_partition_crossing (e);
> +    }
> +}
> +
>  /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
>     expense of adding new instructions or reordering basic blocks.
>
> @@ -1979,6 +2016,14 @@ commit_edge_insertions (void)
>  {
>    basic_block bb;
>
> +  /* Optimization passes that invoke this routine can cause hot blocks
> +     previously reached by both hot and cold blocks to become dominated only
> +     by cold blocks. This will cause the verification below to fail,
> +     and lead to now cold code in the hot section. In some cases this
> +     may only be visible after newly unreachable blocks are deleted,
> +     which will be done by fixup_partitions.  */
> +  fixup_partitions ();
> +
>  #ifdef ENABLE_CHECKING
>    verify_flow_info ();
>  #endif
> @@ -2173,6 +2218,101 @@ get_last_bb_insn (basic_block bb)
>    return end;
>  }
>
> +/* Sanity check partition hotness to ensure that basic blocks in
> +   the cold partition don't dominate basic blocks in the hot partition.
> +   If FLAG_ONLY is true, report violations as errors. Otherwise
> +   re-mark the dominated blocks as cold, since this is run after
> +   cfg optimizations that may make hot blocks previously reached
> +   by both hot and cold blocks now only reachable along cold paths.  */
> +
> +static vec<basic_block>
> +find_partition_fixes (bool flag_only)
> +{
> +  basic_block bb;
> +  vec<basic_block> bbs_in_cold_partition = vNULL;
> +  vec<basic_block> bbs_to_fix = vNULL;
> +
> +  /* Callers check this.  */
> +  gcc_checking_assert (crtl->has_bb_partition);
> +
> +  FOR_EACH_BB (bb)
> +    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
> +      bbs_in_cold_partition.safe_push (bb);
> +
> +  if (bbs_in_cold_partition.is_empty ())
> +    return vNULL;
> +
> +  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
> +
> +  if (dom_calculated_here)
> +    calculate_dominance_info (CDI_DOMINATORS);
> +
> +  while (! bbs_in_cold_partition.is_empty  ())
> +    {
> +      bb = bbs_in_cold_partition.pop ();
> +      /* Any blocks dominated by a block in the cold section
> +         must also be cold.  */
> +      basic_block son;
> +      for (son = first_dom_son (CDI_DOMINATORS, bb);
> +           son;
> +           son = next_dom_son (CDI_DOMINATORS, son))
> +        {
> +          /* If son is not yet cold, then mark it cold here and
> +             enqueue it for further processing.  */
> +          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
> +            {
> +              if (flag_only)
> +                error ("non-cold basic block %d dominated "
> +                       "by a block in the cold partition (%d)",
> son->index, bb->index);
> +              else
> +                BB_SET_PARTITION (son, BB_COLD_PARTITION);
> +              bbs_to_fix.safe_push (son);
> +              bbs_in_cold_partition.safe_push (son);
> +            }
> +        }
> +    }
> +
> +  if (dom_calculated_here)
> +    free_dominance_info (CDI_DOMINATORS);
> +
> +  return bbs_to_fix;
> +}
> +
> +/* Perform cleanup on the hot/cold bb partitioning after optimization
> +   passes that modify the cfg.  */
> +
> +void
> +fixup_partitions (void)
> +{
> +  basic_block bb;
> +
> +  if (!crtl->has_bb_partition)
> +    return;
> +
> +  /* Delete any blocks that became unreachable and weren't
> +     already cleaned up, for example during edge forwarding
> +     and convert_jumps_to_returns. This will expose more
> +     opportunities for fixing the partition boundaries here.
> +     Also, the calculation of the dominance graph during verification
> +     will assert if there are unreachable nodes.  */
> +  delete_unreachable_blocks ();
> +
> +  /* If there are partitions, do a sanity check on them: A basic block in
> +     a cold partition cannot dominate a basic block in a hot partition.
> +     Fixup any that now violate this requirement, as a result of edge
> +     forwarding and unreachable block deletion.  */
> +  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
> +
> +  /* Do the partition fixup after all necessary blocks have been converted to
> +     cold, so that we only update the region crossings the minimum number of
> +     places, which can require forcing edges to be non fallthru.  */
> +  while (! bbs_to_fix.is_empty ())
> +    {
> +      bb = bbs_to_fix.pop ();
> +      fixup_new_cold_bb (bb);
> +    }
> +}
> +
>  /* Verify, in the basic block chain, that there is at most one switch
>     between hot/cold partitions. This condition will not be true until
>     after reorder_basic_blocks is called.  */
> @@ -2219,7 +2359,8 @@ verify_hot_cold_block_grouping (void)
>  /* Perform several checks on the edges out of each block, such as
>     the consistency of the branch probabilities, the correctness
>     of hot/cold partition crossing edges, and the number of expected
> -   successor edges.  */
> +   successor edges.  Also verify that the dominance relationship
> +   between hot/cold blocks is sane.  */
>
>  static int
>  rtl_verify_edges (void)
> @@ -2382,6 +2523,14 @@ rtl_verify_edges (void)
>         }
>      }
>
> +  /* If there are partitions, do a sanity check on them: A basic block in
> +     a cold partition cannot dominate a basic block in a hot partition.  */
> +  if (crtl->has_bb_partition && !err)
> +    {
> +      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
> +      err = !bbs_to_fix.is_empty ();
> +    }
> +
>    /* Clean up.  */
>    return err;
>  }
> @@ -2515,7 +2664,7 @@ rtl_verify_bb_pointers (void)
>       and NOTE_INSN_BASIC_BLOCK
>     - verify that no fall_thru edge crosses hot/cold partition boundaries
>     - verify that there are no pending RTL branch predictions
> -   - verify that there is a single hot/cold partition boundary after bbro
> +   - verify that hot blocks are not dominated by cold blocks
>
>     In future it can be extended check a lot of other stuff as well
>     (reachability of basic blocks, life information, etc. etc.).  */
> @@ -2761,7 +2910,8 @@ rtl_verify_bb_layout (void)
>     - check that all insns are in the basic blocks
>       (except the switch handling code, barriers and notes)
>     - check that all returns are followed by barriers
> -   - check that all fallthru edge points to the adjacent blocks.  */
> +   - check that all fallthru edge points to the adjacent blocks
> +   - verify that there is a single hot/cold partition boundary after bbro  */
>
>  static int
>  rtl_verify_flow_info (void)
> Index: basic-block.h
> ===================================================================
> --- basic-block.h       (revision 201461)
> +++ basic-block.h       (working copy)
> @@ -797,6 +797,7 @@ extern bool contains_no_active_insn_p (const_basic
>  extern bool forwarder_block_p (const_basic_block);
>  extern bool can_fallthru (basic_block, basic_block);
>  extern void emit_barrier_after_bb (basic_block bb);
> +extern void fixup_partitions (void);
>
>  /* In cfgbuild.c.  */
>  extern void find_many_sub_basic_blocks (sbitmap);
> Index: cfgcleanup.c
> ===================================================================
> --- cfgcleanup.c        (revision 201461)
> +++ cfgcleanup.c        (working copy)
> @@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
>               df_analyze ();
>             }
>
> +         if (changed)
> +            {
> +              /* Edge forwarding in particular can cause hot blocks previously
> +                 reached by both hot and cold blocks to become dominated only
> +                 by cold blocks. This will cause the verification
> below to fail,
> +                 and lead to now cold code in the hot section. This is not easy
> +                 to detect and fix during edge forwarding, and in some cases
> +                 is only visible after newly unreachable blocks are deleted,
> +                 which will be done in fixup_partitions.  */
> +              fixup_partitions ();
> +
>  #ifdef ENABLE_CHECKING
> -         if (changed)
> -           verify_flow_info ();
> +              verify_flow_info ();
>  #endif
> +            }
>
>           changed_overall |= changed;
>           first_pass = false;
> Index: bb-reorder.c
> ===================================================================
> --- bb-reorder.c        (revision 201461)
> +++ bb-reorder.c        (working copy)
> @@ -1444,27 +1444,134 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
>        ei_next (&ei);
>  }
>
> +
> +/* Ensure that all hot bbs are included in a hot path through the
> +   procedure. This is done by calling this function twice, once
> +   with WALK_UP true (to look for paths from the entry to hot bbs) and
> +   once with WALK_UP false (to look for paths from hot bbs to the exit).
> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
> +   to BBS_IN_HOT_PARTITION.  */
> +
> +static unsigned int
> +sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
> +                    vec<basic_block> *bbs_in_hot_partition)
> +{
> +  /* Callers check this.  */
> +  gcc_checking_assert (cold_bb_count);
> +
> +  /* Keep examining hot bbs while we still have some left to check
> +     and there are remaining cold bbs.  */
> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
> +  while (! hot_bbs_to_check.is_empty ()
> +         && cold_bb_count)
> +    {
> +      basic_block bb = hot_bbs_to_check.pop ();
> +      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
> +      edge e;
> +      edge_iterator ei;
> +      int highest_probability = 0;
> +      bool found = false;
> +
> +      /* Walk the preds/succs and check if there is at least one already
> +         marked hot. Keep track of the most frequent pred/succ so that we
> +         can mark it hot if we don't find one.  */
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          basic_block reach_bb = walk_up ? e->src : e->dest;
> +
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +
> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
> +          {
> +            found = true;
> +            break;
> +          }
> +          if (e->probability > highest_probability)
> +            highest_probability = e->probability;
> +        }
> +
> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
> +         then the most frequent pred (or succ) needs to be adjusted.  In the
> +         case where multiple preds/succs have the same probability (e.g. a
> +         50-50 branch), then both will be adjusted.  */
> +      if (found)
> +        continue;
> +
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +          if (e->probability < highest_probability)
> +            continue;
> +
> +          basic_block reach_bb = walk_up ? e->src : e->dest;
> +
> +          /* We have a hot bb with an immediate dominator that is cold.
> +             The dominator needs to be re-marked hot.  */
> +          BB_SET_PARTITION (reach_bb, BB_HOT_PARTITION);
> +          cold_bb_count--;
> +
> +          /* Now we need to examine newly-hot reach_bb to see if it is also
> +             dominated by a cold bb.  */
> +          bbs_in_hot_partition->safe_push (reach_bb);
> +          hot_bbs_to_check.safe_push (reach_bb);
> +        }
> +    }
> +
> +  return cold_bb_count;
> +}
> +
> +
>  /* Find the basic blocks that are rarely executed and need to be moved to
>     a separate section of the .o file (to cut down on paging and improve
>     cache locality).  Return a vector of all edges that cross.  */
>
> -static vec<edge>
> +static vec<edge>
>  find_rarely_executed_basic_blocks_and_crossing_edges (void)
>  {
>    vec<edge> crossing_edges = vNULL;
>    basic_block bb;
>    edge e;
>    edge_iterator ei;
> +  unsigned int cold_bb_count = 0;
> +  vec<basic_block> bbs_in_hot_partition = vNULL;
>
>    /* Mark which partition (hot/cold) each basic block belongs in.  */
>    FOR_EACH_BB (bb)
>      {
>        if (probably_never_executed_bb_p (cfun, bb))
> -       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
> +        {
> +          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
> +          cold_bb_count++;
> +        }
>        else
> -       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
> +        {
> +          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
> +          bbs_in_hot_partition.safe_push (bb);
> +        }
>      }
>
> +  /* Ensure that hot bbs are included along a hot path from the entry to exit.
> +     Several different possibilities may include cold bbs along all paths
> +     to/from a hot bb. One is that there are edge weight insanities
> +     due to optimization phases that do not properly update basic block profile
> +     counts. The second is that the entry of the function may not be
> hot, because
> +     it is entered fewer times than the number of profile training
> runs, but there
> +     is a loop inside the function that causes blocks within the function to be
> +     above the threshold for hotness. This is fixed by walking up from hot bbs
> +     to the entry block, and then down from hot bbs to the exit, performing
> +     partitioning fixups as necessary.  */
> +  if (cold_bb_count)
> +    {
> +      mark_dfs_back_edges ();
> +      cold_bb_count = sanitize_hot_paths (true, cold_bb_count,
> +                                          &bbs_in_hot_partition);
> +      if (cold_bb_count)
> +        sanitize_hot_paths (false, cold_bb_count, &bbs_in_hot_partition);
> +    }
> +
>    /* The format of .gcc_except_table does not allow landing pads to
>       be in a different partition as the throw.  Fix this by either
>       moving or duplicating the landing pads.  */
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

[-- Attachment #2: ld.script --]
[-- Type: application/octet-stream, Size: 8644 bytes --]

/* Script for -z combreloc: combine and sort reloc sections */
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
	      "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)
SEARCH_DIR("/home/marxin/binutils-bin/x86_64-unknown-linux-gnu/lib64"); SEARCH_DIR("/home/marxin/binutils-bin/lib64"); SEARCH_DIR("/usr/local/lib64"); SEARCH_DIR("/lib64"); SEARCH_DIR("/usr/lib64"); SEARCH_DIR("/home/marxin/binutils-bin/x86_64-unknown-linux-gnu/lib"); SEARCH_DIR("/home/marxin/binutils-bin/lib"); SEARCH_DIR("/usr/local/lib"); SEARCH_DIR("/lib"); SEARCH_DIR("/usr/lib");
SECTIONS
{
  /* Read-only sections, merged into text segment: */
  PROVIDE (__executable_start = SEGMENT_START("text-segment", 0x400000)); . = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;
  .interp         : { *(.interp) }
  .note.gnu.build-id : { *(.note.gnu.build-id) }
  .hash           : { *(.hash) }
  .gnu.hash       : { *(.gnu.hash) }
  .dynsym         : { *(.dynsym) }
  .dynstr         : { *(.dynstr) }
  .gnu.version    : { *(.gnu.version) }
  .gnu.version_d  : { *(.gnu.version_d) }
  .gnu.version_r  : { *(.gnu.version_r) }
  .rela.dyn       :
    {
      *(.rela.init)
      *(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
      *(.rela.fini)
      *(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*)
      *(.rela.data .rela.data.* .rela.gnu.linkonce.d.*)
      *(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*)
      *(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*)
      *(.rela.ctors)
      *(.rela.dtors)
      *(.rela.got)
      *(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*)
      *(.rela.ldata .rela.ldata.* .rela.gnu.linkonce.l.*)
      *(.rela.lbss .rela.lbss.* .rela.gnu.linkonce.lb.*)
      *(.rela.lrodata .rela.lrodata.* .rela.gnu.linkonce.lr.*)
      *(.rela.ifunc)
    }
  .rela.plt       :
    {
      *(.rela.plt)
      PROVIDE_HIDDEN (__rela_iplt_start = .);
      *(.rela.iplt)
      PROVIDE_HIDDEN (__rela_iplt_end = .);
    }
  .init           :
  {
    KEEP (*(SORT_NONE(.init)))
  }
  .plt            : { *(.plt) *(.iplt) }
  .text.unlikely  : { *(.text.unlikely .text.*_unlikely .text.unlikely.*) }
  .text.exit      : { *(.text.exit .text.exit.*) }
  .text.startup   : { *(.text.startup .text.startup.*) }
  .text.hot       : { *(.text.hot .text.hot.*) }
  .text           :
  {
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf32.em.  */
    *(.gnu.warning)
  }
  .fini           :
  {
    KEEP (*(SORT_NONE(.fini)))
  }
  PROVIDE (__etext = .);
  PROVIDE (_etext = .);
  PROVIDE (etext = .);
  .rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
  .rodata1        : { *(.rodata1) }
  .eh_frame_hdr : { *(.eh_frame_hdr) }
  .eh_frame       : ONLY_IF_RO { KEEP (*(.eh_frame)) }
  .gcc_except_table   : ONLY_IF_RO { *(.gcc_except_table
  .gcc_except_table.*) }
  /* These sections are generated by the Sun/Oracle C++ compiler.  */
  .exception_ranges   : ONLY_IF_RO { *(.exception_ranges
  .exception_ranges*) }
  /* Adjust the address for the data segment.  We want to adjust up to
     the same address within the page on the next page up.  */
  . = ALIGN (CONSTANT (MAXPAGESIZE)) - ((CONSTANT (MAXPAGESIZE) - .) & (CONSTANT (MAXPAGESIZE) - 1)); . = DATA_SEGMENT_ALIGN (CONSTANT (MAXPAGESIZE), CONSTANT (COMMONPAGESIZE));
  /* Exception handling  */
  .eh_frame       : ONLY_IF_RW { KEEP (*(.eh_frame)) }
  .gcc_except_table   : ONLY_IF_RW { *(.gcc_except_table .gcc_except_table.*) }
  .exception_ranges   : ONLY_IF_RW { *(.exception_ranges .exception_ranges*) }
  /* Thread Local Storage sections  */
  .tdata	  : { *(.tdata .tdata.* .gnu.linkonce.td.*) }
  .tbss		  : { *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) }
  .preinit_array     :
  {
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array))
    PROVIDE_HIDDEN (__preinit_array_end = .);
  }
  .init_array     :
  {
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
    KEEP (*(.init_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .ctors))
    PROVIDE_HIDDEN (__init_array_end = .);
  }
  .fini_array     :
  {
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP (*(SORT_BY_INIT_PRIORITY(.fini_array.*) SORT_BY_INIT_PRIORITY(.dtors.*)))
    KEEP (*(.fini_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .dtors))
    PROVIDE_HIDDEN (__fini_array_end = .);
  }
  .ctors          :
  {
    /* gcc uses crtbegin.o to find the start of
       the constructors, so we make sure it is
       first.  Because this is a wildcard, it
       doesn't matter if the user does not
       actually link against crtbegin.o; the
       linker won't look for a file to match a
       wildcard.  The wildcard also means that it
       doesn't matter which directory crtbegin.o
       is in.  */
    KEEP (*crtbegin.o(.ctors))
    KEEP (*crtbegin?.o(.ctors))
    /* We don't want to include the .ctor section from
       the crtend.o file until after the sorted ctors.
       The .ctor section from the crtend file contains the
       end of ctors marker and it must be last */
    KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .ctors))
    KEEP (*(SORT(.ctors.*)))
    KEEP (*(.ctors))
  }
  .dtors          :
  {
    KEEP (*crtbegin.o(.dtors))
    KEEP (*crtbegin?.o(.dtors))
    KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .dtors))
    KEEP (*(SORT(.dtors.*)))
    KEEP (*(.dtors))
  }
  .jcr            : { KEEP (*(.jcr)) }
  .data.rel.ro : { *(.data.rel.ro.local* .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro .data.rel.ro.* .gnu.linkonce.d.rel.ro.*) }
  .dynamic        : { *(.dynamic) }
  .got            : { *(.got) *(.igot) }
  . = DATA_SEGMENT_RELRO_END (SIZEOF (.got.plt) >= 24 ? 24 : 0, .);
  .got.plt        : { *(.got.plt)  *(.igot.plt) }
  .data           :
  {
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
  }
  .data1          : { *(.data1) }
  _edata = .; PROVIDE (edata = .);
  . = .;
  __bss_start = .;
  .bss            :
  {
   *(.dynbss)
   *(.bss .bss.* .gnu.linkonce.b.*)
   *(COMMON)
   /* Align here to ensure that the .bss section occupies space up to
      _end.  Align after .bss to ensure correct alignment even if the
      .bss section disappears because there are no input sections.
      FIXME: Why do we need it? When there is no .bss section, we don't
      pad the .data section.  */
   . = ALIGN(. != 0 ? 64 / 8 : 1);
  }
  .lbss   :
  {
    *(.dynlbss)
    *(.lbss .lbss.* .gnu.linkonce.lb.*)
    *(LARGE_COMMON)
  }
  . = ALIGN(64 / 8);
  . = SEGMENT_START("ldata-segment", .);
  .lrodata   ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)) :
  {
    *(.lrodata .lrodata.* .gnu.linkonce.lr.*)
  }
  .ldata   ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)) :
  {
    *(.ldata .ldata.* .gnu.linkonce.l.*)
    . = ALIGN(. != 0 ? 64 / 8 : 1);
  }
  . = ALIGN(64 / 8);
  _end = .; PROVIDE (end = .);
  . = DATA_SEGMENT_END (.);
  /* Stabs debugging sections.  */
  .stab          0 : { *(.stab) }
  .stabstr       0 : { *(.stabstr) }
  .stab.excl     0 : { *(.stab.excl) }
  .stab.exclstr  0 : { *(.stab.exclstr) }
  .stab.index    0 : { *(.stab.index) }
  .stab.indexstr 0 : { *(.stab.indexstr) }
  .comment       0 : { *(.comment) }
  /* DWARF debug sections.
     Symbols in the DWARF debugging sections are relative to the beginning
     of the section so we begin them at 0.  */
  /* DWARF 1 */
  .debug          0 : { *(.debug) }
  .line           0 : { *(.line) }
  /* GNU DWARF 1 extensions */
  .debug_srcinfo  0 : { *(.debug_srcinfo) }
  .debug_sfnames  0 : { *(.debug_sfnames) }
  /* DWARF 1.1 and DWARF 2 */
  .debug_aranges  0 : { *(.debug_aranges) }
  .debug_pubnames 0 : { *(.debug_pubnames) }
  /* DWARF 2 */
  .debug_info     0 : { *(.debug_info .gnu.linkonce.wi.*) }
  .debug_abbrev   0 : { *(.debug_abbrev) }
  .debug_line     0 : { *(.debug_line .debug_line.* .debug_line_end ) }
  .debug_frame    0 : { *(.debug_frame) }
  .debug_str      0 : { *(.debug_str) }
  .debug_loc      0 : { *(.debug_loc) }
  .debug_macinfo  0 : { *(.debug_macinfo) }
  /* SGI/MIPS DWARF 2 extensions */
  .debug_weaknames 0 : { *(.debug_weaknames) }
  .debug_funcnames 0 : { *(.debug_funcnames) }
  .debug_typenames 0 : { *(.debug_typenames) }
  .debug_varnames  0 : { *(.debug_varnames) }
  /* DWARF 3 */
  .debug_pubtypes 0 : { *(.debug_pubtypes) }
  .debug_ranges   0 : { *(.debug_ranges) }
  /* DWARF Extension.  */
  .debug_macro    0 : { *(.debug_macro) }
  .gnu.attributes 0 : { KEEP (*(.gnu.attributes)) }
  /DISCARD/ : { *(.note.GNU-stack) *(.gnu_debuglink) *(.gnu.lto_*) }
}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-19 15:34                           ` Teresa Johnson
@ 2013-08-21 15:31                             ` Jan Hubicka
  0 siblings, 0 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-08-21 15:31 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

> >
> > Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> > it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> > count) that has empty profile (either because it got lost by COMDAT merging
> > or because of reading mismatch).
> 
> The approach this patch takes is to simply treat those functions the
> same as we would if we didn't feed back profile data in the first
> place, by using the frequencies. This is sufficient except when one is
> inlined, which is why I have the special handling in the inliner
> itself.

Yes, my orignal plan was to have per-function profile_status that 
specify if profile is read, guessed or absent and do function local
decision sanely with each setting.

Here we read the function, we set profile to READ (with all counts being 0).
We should drop it to GUESSED when we see that there are non-0 count edges
calling the function in question and probably we should see if it is obviously
hot (i.e. reachable by a hot call) and promote its function profile to HOT
then to get code placement less bad...
> >
> > Since new direct calls can be discovered later, inline may want to do that
> > again each time it inlines non-0 count call of COMDAT with 0 count...
> 
> How about an approach like this:
> - Invoke init_and_estimate_bb_frequencies as I am doing to guess the
> profiles at profile read time for functions with 0 counts.

I see, here we are out of sync. 
We always used to go with estimated frequencies for functions with 0 counts,
but it seems that this code broke when prediction was moved before profiling.
(we also should keep edge probabilities from predict.c in that case)

The esitmated profile is already there before reading the profile in, so we
only do not want to overwrite it.  Does the following work for you?

Index: tree-profile.c
===================================================================
--- tree-profile.c	(revision 201838)
+++ tree-profile.c	(working copy)
@@ -604,6 +604,34 @@
 
       pop_cfun ();
     }
+  /* See if 0 count function has non-0 count callers.  In this case we
+     lost some profile.  Drop its function profile to PROFILE_GUESSED.  */
+  FOR_EACH_DEFINED_FUNCTION (node)
+    {
+      struct cgraph_edge *e;
+      bool called = false;
+      if (node->count)
+	continue;
+      for (e = node->callers; e; e = e->next_caller)
+	{
+	  if (e->count)
+	    called = true;
+	  if (cgraph_maybe_hot_edge_p (e))
+	    break;
+	}
+      if (e || called
+	  && profile_status_for_function
+	      (DECL_STRUCT_FUNCTION (node->symbol.decl)) == PROFILE_READ)
+	{
+	  if (dump_file)
+	    fprintf (stderr, "Dropping 0 profile for %s/%i.%s based on calls.\n",
+		     cgraph_node_name (node), node->symbol.order,
+		     e ? "function is hot" : "function is normal");
+	  profile_status_for_function (DECL_STRUCT_FUNCTION (node->symbol.decl))
+	    = (flag_guess_branch_prob ? PROFILE_GUESSED : PROFILE_ABSENT);
+	  node->frequency = e ? NODE_FREQUENCY_HOT : NODE_FREQUENCY_NORMAL;
+	}
+    }
 
   del_node_map();
   return 0;
Index: predict.c
===================================================================
--- predict.c	(revision 201838)
+++ predict.c	(working copy)
@@ -2715,6 +2715,9 @@
   gcov_type count_max, true_count_max = 0;
   basic_block bb;
 
+  if (!ENTRY_BLOCK_PTR->count)
+    return 0;
+
   FOR_BB_BETWEEN (bb, ENTRY_BLOCK_PTR, NULL, next_bb)
     true_count_max = MAX (bb->count, true_count_max);
 

> - At inline time, invoke some kind of freqs_to_counts routine for any
> 0-count routine that is reached by non-zero call edges. It would take

We should not need that since frequencies ought to be there.

> the sum of all incoming call edge counts and synthesize counts for the
> bbs using the guessed profile frequencies applied earlier by
> init_and_estimate_bb_frequencies. Then the inliner can do its normal
> bb count scaling.

Yes, i guess we should go this way.  Still we will need to watch overly large values.
Recrusive inlining can probably easily produce quite a nonsense here.

We wil also need to solve problem that in this case cgraph edges will have 0 profile.
We probably want to play the game there and just do the scaling for edge count,
since IPA passes probably do not want to care about partial profiles.
> 
> Does that seem like a reasonable approach?
> 
> There is one other fix in this patch:
> - The clone_inlined_nodes/update_noncloned_frequencies changes below
> are handling a different case: 0-count call edge in this module, with
> non-zero callee node count due to calls from other modules. It will
> allow update_noncloned_frequencies to scale down the edge counts in
> callee's cloned call tree. This was a fix I made for the
> callgraph-based linker plugin function reordering, and not splitting
> (since it is using both the node and edge weights to make ordering
> decisions). Here's a description of the issue when I was debugging it:

Yes, it seems resonable.  I did not really care about comdats at a time
I was writting this function..
> >> Index: ipa-inline-transform.c
> >> ===================================================================
> >> --- ipa-inline-transform.c      (revision 201644)
> >> +++ ipa-inline-transform.c      (working copy)
> >> @@ -51,7 +51,7 @@ int nfunctions_inlined;
> >>
> >>  static void
> >>  update_noncloned_frequencies (struct cgraph_node *node,
> >> -                             int freq_scale)
> >> +                             gcov_type count_scale, int freq_scale)
> >>  {
> >>    struct cgraph_edge *e;
> >>
> >> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
> >>      freq_scale = 1;
> >>    for (e = node->callees; e; e = e->next_callee)
> >>      {
> >> +      e->count = apply_scale (e->count, count_scale);
> >>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
> >>        if (e->frequency > CGRAPH_FREQ_MAX)
> >>          e->frequency = CGRAPH_FREQ_MAX;
> >>        if (!e->inline_failed)
> >> -        update_noncloned_frequencies (e->callee, freq_scale);
> >> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
> >>      }
> >>    for (e = node->indirect_calls; e; e = e->next_callee)
> >>      {
> >> +      e->count = apply_scale (e->count, count_scale);
> >>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
> >>        if (e->frequency > CGRAPH_FREQ_MAX)
> >>          e->frequency = CGRAPH_FREQ_MAX;
> >> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
> >>             }
> >>           duplicate = false;
> >>           e->callee->symbol.externally_visible = false;
> >> -          update_noncloned_frequencies (e->callee, e->frequency);
> >> +          // In the case of a COMDAT, the callee's count may be from other
> >> +          // modules, and we need to scale it for the current module's calls
> >> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
> >> +          gcov_type count_scale = REG_BR_PROB_BASE;
> >> +          if (e->callee->count > e->count)
> >> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
> >> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);

Please fix the comment to the usual style and go ahead with this change.

Thanks,
Honza
> >>         }
> >>        else
> >>         {
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-19 17:48                               ` Teresa Johnson
  2013-08-19 19:56                                 ` Martin Liška
@ 2013-08-27 18:12                                 ` Teresa Johnson
  2013-08-28 16:59                                   ` Jan Hubicka
  2013-08-31 16:20                                   ` Jan Hubicka
  1 sibling, 2 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-27 18:12 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

On Mon, Aug 19, 2013 at 10:47 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Mon, Aug 19, 2013 at 8:09 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> Remember it isn't using dominance anymore. The latest patch was
>>> instead ensuring the most frequent path between hot blocks and the
>>> entry/exit are marked hot. That should be better than the dominance
>>> approach used in the earlier version.
>>
>> Indeed, that looks more resonable approach.
>> Can you point me to the last version of patch? Last one I remember still
>> walked dominators...
>
> I've included the latest patch below. I still use dominators in the
> post-cfg-optimization fixup (fixup_partitions), but not in the
> partition sanitizing done during the partitioning itself
> (sanitize_hot_paths). The former is looking for hot bbs newly
> dominated by cold bbs after cfg transformations.
>
>>>
>>> > We can commit it and work on better solution incrementally but it will
>>> > probably mean replacing it later.  If you think it makes things easier
>>> > to work on it incrementally, I think the patch is OK.
>>>
>>> Yes, I think this is a big step forward from what is there now for
>>> splitting, which does the splitting purely based on bb count in
>>> isolation. I don't have a much better solution in mind yet.
>>>

Ping on this patch. Honza, did the latest version I sent last week
look ok to you? I've included below a new version that adds the
partitioning insanity warnings we discussed (emitted to the dump only
since as I noted there are various optimization passes that provoke
this due to not fixing up profile data).

(Honza - I am also going to move the discussion we started in this
thread on the COMDAT missing profile handling patch to a different
thread. Should have an update on that shortly.)

Latest patch below retested with bootstrap and regression testing on
x86-64-unknown-linux-gnu, and with a profiledbootstrap and
-freorder-blocks-and-partition forced on. Ok for trunk?

Thanks,
Teresa

2013-08-27  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_new_cold_bb): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (probably_never_executed_edge_p): Declare.
        (fixup_partitions): Ditto.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_hot_paths): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_hot_paths.
        * predict.c (probably_never_executed_edge_p): New routine.
        * cfg.c (check_bb_profile): Add partition insanity warnings.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 202021)
+++ cfgrtl.c    (working copy)
@@ -1358,6 +1358,43 @@ fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1996,6 +2033,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2190,6 +2235,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition (%d)",
son->index, bb->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2236,7 +2376,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2399,6 +2540,14 @@ rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2532,7 +2681,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2778,7 +2927,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 202021)
+++ basic-block.h       (working copy)
@@ -726,6 +726,7 @@ extern void compute_available (sbitmap *, sbitmap
 extern bool maybe_hot_bb_p (struct function *, const_basic_block);
 extern bool maybe_hot_edge_p (edge);
 extern bool probably_never_executed_bb_p (struct function *,
const_basic_block);
+extern bool probably_never_executed_edge_p (struct function *, edge);
 extern bool optimize_bb_for_size_p (const_basic_block);
 extern bool optimize_bb_for_speed_p (const_basic_block);
 extern bool optimize_edge_for_size_p (edge);
@@ -797,6 +798,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 202021)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 202021)
+++ bb-reorder.c        (working copy)
@@ -1444,27 +1444,134 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that all hot bbs are included in a hot path through the
+   procedure. This is done by calling this function twice, once
+   with WALK_UP true (to look for paths from the entry to hot bbs) and
+   once with WALK_UP false (to look for paths from hot bbs to the exit).
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
+                    vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
+      edge e;
+      edge_iterator ei;
+      int highest_probability = 0;
+      bool found = false;
+
+      /* Walk the preds/succs and check if there is at least one already
+         marked hot. Keep track of the most frequent pred/succ so that we
+         can mark it hot if we don't find one.  */
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+
+          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
+          {
+            found = true;
+            break;
+          }
+          if (e->probability > highest_probability)
+            highest_probability = e->probability;
+        }
+
+      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
+         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
+         then the most frequent pred (or succ) needs to be adjusted.  In the
+         case where multiple preds/succs have the same probability (e.g. a
+         50-50 branch), then both will be adjusted.  */
+      if (found)
+        continue;
+
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+          if (e->probability < highest_probability)
+            continue;
+
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          /* We have a hot bb with an immediate dominator that is cold.
+             The dominator needs to be re-marked hot.  */
+          BB_SET_PARTITION (reach_bb, BB_HOT_PARTITION);
+          cold_bb_count--;
+
+          /* Now we need to examine newly-hot reach_bb to see if it is also
+             dominated by a cold bb.  */
+          bbs_in_hot_partition->safe_push (reach_bb);
+          hot_bbs_to_check.safe_push (reach_bb);
+        }
+    }
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */

-static vec<edge>
+static vec<edge>
 find_rarely_executed_basic_blocks_and_crossing_edges (void)
 {
   vec<edge> crossing_edges = vNULL;
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that hot bbs are included along a hot path from the entry to exit.
+     Several different possibilities may include cold bbs along all paths
+     to/from a hot bb. One is that there are edge weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. This is fixed by walking up from hot bbs
+     to the entry block, and then down from hot bbs to the exit, performing
+     partitioning fixups as necessary.  */
+  if (cold_bb_count)
+    {
+      mark_dfs_back_edges ();
+      cold_bb_count = sanitize_hot_paths (true, cold_bb_count,
+                                          &bbs_in_hot_partition);
+      if (cold_bb_count)
+        sanitize_hot_paths (false, cold_bb_count, &bbs_in_hot_partition);
+    }
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */
Index: predict.c
===================================================================
--- predict.c   (revision 202021)
+++ predict.c   (working copy)
@@ -241,6 +241,22 @@ probably_never_executed_bb_p (struct function *fun
   return false;
 }

+
+/* Return true in case edge E is probably never executed.  */
+
+bool
+probably_never_executed_edge_p (struct function *fun, edge e)
+{
+  gcc_checking_assert (fun);
+  if (profile_info && flag_branch_probabilities)
+    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
+  if ((!profile_info || !flag_branch_probabilities)
+      && (cgraph_get_node (fun->decl)->frequency
+         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
+    return true;
+  return false;
+}
+
 /* Return true if NODE should be optimized for size.  */

 bool
Index: cfg.c
===================================================================
--- cfg.c       (revision 202021)
+++ cfg.c       (working copy)
@@ -446,6 +446,21 @@ check_bb_profile (basic_block bb, FILE * file, int
                 (flags & TDF_COMMENT) ? ";; " : "", s_indent,
                 (int) lsum, (int) bb->count);
     }
+  if (BB_PARTITION (bb) == BB_COLD_PARTITION)
+    {
+      /* Warn about inconsistencies in the partitioning that are
+         currently caused by profile insanities created via optimization.  */
+      if (!probably_never_executed_bb_p (fun, bb))
+        fprintf (file, "%s%sBlock in cold partition with hot count\n",
+                 (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+      FOR_EACH_EDGE (e, ei, bb->preds)
+        {
+          if (!probably_never_executed_edge_p (fun, e))
+            fprintf (file,
+                     "%s%sBlock in cold partition with incoming hot edge\n",
+                     (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+        }
+    }
 }
 ^L
 void



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-27 18:12                                 ` Teresa Johnson
@ 2013-08-28 16:59                                   ` Jan Hubicka
  2013-08-28 18:35                                     ` Teresa Johnson
  2013-08-31 16:20                                   ` Jan Hubicka
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-28 16:59 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

Hi,
with Martin we did bit of progress on analyzing the problems.  We now have
COMDAT profile merging for FDO and we also noticed that forks can make your
basic blocks appear never executed even though they are executed every run:
the fork is accounted as 3 independnet runs of the program.  First run
is until fork, the second 2 runs are both variant.

I have patch to track this.  Moreover vforks seems to produce repeated
merging of results.

These two factors seems to help to riddle enought the firefox profiles
so it took us weeks to understand what happens.
> +         if (changed)
> +            {
> +              /* Edge forwarding in particular can cause hot blocks previously
> +                 reached by both hot and cold blocks to become dominated only
> +                 by cold blocks. This will cause the verification
> below to fail,
> +                 and lead to now cold code in the hot section. This is not easy
> +                 to detect and fix during edge forwarding, and in some cases
> +                 is only visible after newly unreachable blocks are deleted,
> +                 which will be done in fixup_partitions.  */
> +              fixup_partitions ();

Is it really necessary to run this from internal loop of the cfgcleanup?  It seems
you will play back and forth game where edge forwarding will remove your fallthru
and you will be re-adding it?

I would wait for cfgcleanup to finish its job (I don't really think it needs to be
iterative) and then do the fixup possibly cleaning up when the blocks was repoisitoined
(I suppose it is only case where the code above introduces new cfgcleanup oppurtunities).

> +      /* Walk the preds/succs and check if there is at least one already
> +         marked hot. Keep track of the most frequent pred/succ so that we
> +         can mark it hot if we don't find one.  */
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          basic_block reach_bb = walk_up ? e->src : e->dest;
> +
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +
> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
> +          {
> +            found = true;
> +            break;
> +          }
> +          if (e->probability > highest_probability)
> +            highest_probability = e->probability;

When doing predecestor walk if you have two predecestors, one executing 100000
times and getting to the block with probability 1%, you want to chose it over
block executing once and getting to you with probability 100%.

You probably want to look for most likely predecestor here.  You need to look
for highest e->count and if they are all 0 for highest EDGE_FREQUENCY and for
maximal probability only if all fails?

> +        }
> +
> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
> +         then the most frequent pred (or succ) needs to be adjusted.  In the
> +         case where multiple preds/succs have the same probability (e.g. a
> +         50-50 branch), then both will be adjusted.  */
> +      if (found)
> +        continue;
> +
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +          if (e->probability < highest_probability)
> +            continue;

Again for predecestor walk you need to wind down the bit crazy logic described above.
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 202021)
> +++ predict.c   (working copy)
> @@ -241,6 +241,22 @@ probably_never_executed_bb_p (struct function *fun
>    return false;
>  }
> 
> +
> +/* Return true in case edge E is probably never executed.  */
> +
> +bool
> +probably_never_executed_edge_p (struct function *fun, edge e)
> +{
> +  gcc_checking_assert (fun);
> +  if (profile_info && flag_branch_probabilities)
> +    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
> +  if ((!profile_info || !flag_branch_probabilities)
> +      && (cgraph_get_node (fun->decl)->frequency
> +         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
> +    return true;
> +  return false;
Instead of duplicating the conditional, break out the tests into
probably_never_executed_count_p, like we have for maybe_hot_count_p.

It would be nice to extend this to work w/o profile: probably_never_executed_edge_p
can return true for EH edges, setjmp edges and we can then walk bodies ignoring
EH/setjmp flagging blocks that are probably never executed enabling the partitioning
to do a job w/o profile.

Otherwise the patch look OK to me. Thanks for working on this. Do we have agreement on C++ way of mixing
declarations and code?

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-28 16:59                                   ` Jan Hubicka
@ 2013-08-28 18:35                                     ` Teresa Johnson
  2013-08-30  7:17                                       ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-28 18:35 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam, Rong Xu

On Wed, Aug 28, 2013 at 9:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> with Martin we did bit of progress on analyzing the problems.  We now have
> COMDAT profile merging for FDO

 Great! Is this the LTO merging you were talking about in an earlier
message, or the gcov runtime fix (that would presumably not be
lto-specific)?

> and we also noticed that forks can make your
> basic blocks appear never executed even though they are executed every run:
> the fork is accounted as 3 independnet runs of the program.  First run
> is until fork, the second 2 runs are both variant.
>
> I have patch to track this.  Moreover vforks seems to produce repeated
> merging of results.

Aha, does this explain the gimp issue as well?

>
> These two factors seems to help to riddle enought the firefox profiles
> so it took us weeks to understand what happens.
>> +         if (changed)
>> +            {
>> +              /* Edge forwarding in particular can cause hot blocks previously
>> +                 reached by both hot and cold blocks to become dominated only
>> +                 by cold blocks. This will cause the verification
>> below to fail,
>> +                 and lead to now cold code in the hot section. This is not easy
>> +                 to detect and fix during edge forwarding, and in some cases
>> +                 is only visible after newly unreachable blocks are deleted,
>> +                 which will be done in fixup_partitions.  */
>> +              fixup_partitions ();
>
> Is it really necessary to run this from internal loop of the cfgcleanup?

The reason I added it here is that just below there is a call to
verify_flow_info, and that will fail with the new verification.

> It seems
> you will play back and forth game where edge forwarding will remove your fallthru
> and you will be re-adding it?

fixup_partitions will not add new fall through edges. (It may invoke
force_nonfallthru to do the opposite.) So there shouldn't be any
ping-ponging effect.

>
> I would wait for cfgcleanup to finish its job (I don't really think it needs to be
> iterative) and then do the fixup possibly cleaning up when the blocks was repoisitoined
> (I suppose it is only case where the code above introduces new cfgcleanup oppurtunities).

As noted above, I can't do this due to the call to verify_flow_info
for each iteration. One option is to move both down outside the loop.

>
>> +      /* Walk the preds/succs and check if there is at least one already
>> +         marked hot. Keep track of the most frequent pred/succ so that we
>> +         can mark it hot if we don't find one.  */
>> +      FOR_EACH_EDGE (e, ei, edges)
>> +        {
>> +          basic_block reach_bb = walk_up ? e->src : e->dest;
>> +
>> +          if (e->flags & EDGE_DFS_BACK)
>> +            continue;
>> +
>> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
>> +          {
>> +            found = true;
>> +            break;
>> +          }
>> +          if (e->probability > highest_probability)
>> +            highest_probability = e->probability;
>
> When doing predecestor walk if you have two predecestors, one executing 100000
> times and getting to the block with probability 1%, you want to chose it over
> block executing once and getting to you with probability 100%.
>
> You probably want to look for most likely predecestor here.  You need to look
> for highest e->count and if they are all 0 for highest EDGE_FREQUENCY and for
> maximal probability only if all fails?

Yes, thanks, let me do that.

>
>> +        }
>> +
>> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
>> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
>> +         then the most frequent pred (or succ) needs to be adjusted.  In the
>> +         case where multiple preds/succs have the same probability (e.g. a
>> +         50-50 branch), then both will be adjusted.  */
>> +      if (found)
>> +        continue;
>> +
>> +      FOR_EACH_EDGE (e, ei, edges)
>> +        {
>> +          if (e->flags & EDGE_DFS_BACK)
>> +            continue;
>> +          if (e->probability < highest_probability)
>> +            continue;
>
> Again for predecestor walk you need to wind down the bit crazy logic described above.
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 202021)
>> +++ predict.c   (working copy)
>> @@ -241,6 +241,22 @@ probably_never_executed_bb_p (struct function *fun
>>    return false;
>>  }
>>
>> +
>> +/* Return true in case edge E is probably never executed.  */
>> +
>> +bool
>> +probably_never_executed_edge_p (struct function *fun, edge e)
>> +{
>> +  gcc_checking_assert (fun);
>> +  if (profile_info && flag_branch_probabilities)
>> +    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
>> +  if ((!profile_info || !flag_branch_probabilities)
>> +      && (cgraph_get_node (fun->decl)->frequency
>> +         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
>> +    return true;
>> +  return false;
> Instead of duplicating the conditional, break out the tests into
> probably_never_executed_count_p, like we have for maybe_hot_count_p.

ok

>
> It would be nice to extend this to work w/o profile: probably_never_executed_edge_p
> can return true for EH edges, setjmp edges and we can then walk bodies ignoring
> EH/setjmp flagging blocks that are probably never executed enabling the partitioning
> to do a job w/o profile.

agreed. although I would prefer to leave that for a follow-on patch so
it can be tuned a bit.

>
> Otherwise the patch look OK to me. Thanks for working on this. Do we have agreement on C++ way of mixing
> declarations and code?

According to http://gcc.gnu.org/wiki/CppConventions:

"In new code variables which are used in a small scope should be
defined at the point of first use, rather than at the top of the
function. Variables which are used throughout the function may be
defined at the top of the function, as in C."

I think I am following that, but let me know if you see something that
needs to be fixed.

Thanks,
Teresa

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-28 18:35                                     ` Teresa Johnson
@ 2013-08-30  7:17                                       ` Teresa Johnson
  2013-08-30  9:16                                         ` Jan Hubicka
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-08-30  7:17 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam, Rong Xu

On Wed, Aug 28, 2013 at 11:20 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Wed, Aug 28, 2013 at 9:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hi,
>> with Martin we did bit of progress on analyzing the problems.  We now have
>> COMDAT profile merging for FDO
>
>  Great! Is this the LTO merging you were talking about in an earlier
> message, or the gcov runtime fix (that would presumably not be
> lto-specific)?
>
>> and we also noticed that forks can make your
>> basic blocks appear never executed even though they are executed every run:
>> the fork is accounted as 3 independnet runs of the program.  First run
>> is until fork, the second 2 runs are both variant.
>>
>> I have patch to track this.  Moreover vforks seems to produce repeated
>> merging of results.
>
> Aha, does this explain the gimp issue as well?
>
>>
>> These two factors seems to help to riddle enought the firefox profiles
>> so it took us weeks to understand what happens.
>>> +         if (changed)
>>> +            {
>>> +              /* Edge forwarding in particular can cause hot blocks previously
>>> +                 reached by both hot and cold blocks to become dominated only
>>> +                 by cold blocks. This will cause the verification
>>> below to fail,
>>> +                 and lead to now cold code in the hot section. This is not easy
>>> +                 to detect and fix during edge forwarding, and in some cases
>>> +                 is only visible after newly unreachable blocks are deleted,
>>> +                 which will be done in fixup_partitions.  */
>>> +              fixup_partitions ();
>>
>> Is it really necessary to run this from internal loop of the cfgcleanup?
>
> The reason I added it here is that just below there is a call to
> verify_flow_info, and that will fail with the new verification.
>
>> It seems
>> you will play back and forth game where edge forwarding will remove your fallthru
>> and you will be re-adding it?
>
> fixup_partitions will not add new fall through edges. (It may invoke
> force_nonfallthru to do the opposite.) So there shouldn't be any
> ping-ponging effect.
>
>>
>> I would wait for cfgcleanup to finish its job (I don't really think it needs to be
>> iterative) and then do the fixup possibly cleaning up when the blocks was repoisitoined
>> (I suppose it is only case where the code above introduces new cfgcleanup oppurtunities).
>
> As noted above, I can't do this due to the call to verify_flow_info
> for each iteration. One option is to move both down outside the loop.
>
>>
>>> +      /* Walk the preds/succs and check if there is at least one already
>>> +         marked hot. Keep track of the most frequent pred/succ so that we
>>> +         can mark it hot if we don't find one.  */
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          basic_block reach_bb = walk_up ? e->src : e->dest;
>>> +
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +
>>> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
>>> +          {
>>> +            found = true;
>>> +            break;
>>> +          }
>>> +          if (e->probability > highest_probability)
>>> +            highest_probability = e->probability;
>>
>> When doing predecestor walk if you have two predecestors, one executing 100000
>> times and getting to the block with probability 1%, you want to chose it over
>> block executing once and getting to you with probability 100%.
>>
>> You probably want to look for most likely predecestor here.  You need to look
>> for highest e->count and if they are all 0 for highest EDGE_FREQUENCY and for
>> maximal probability only if all fails?
>
> Yes, thanks, let me do that.

New patch that addresses this is included below.

Thanks,
Teresa

>
>>
>>> +        }
>>> +
>>> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
>>> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
>>> +         then the most frequent pred (or succ) needs to be adjusted.  In the
>>> +         case where multiple preds/succs have the same probability (e.g. a
>>> +         50-50 branch), then both will be adjusted.  */
>>> +      if (found)
>>> +        continue;
>>> +
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +          if (e->probability < highest_probability)
>>> +            continue;
>>
>> Again for predecestor walk you need to wind down the bit crazy logic described above.
>>> Index: predict.c
>>> ===================================================================
>>> --- predict.c   (revision 202021)
>>> +++ predict.c   (working copy)
>>> @@ -241,6 +241,22 @@ probably_never_executed_bb_p (struct function *fun
>>>    return false;
>>>  }
>>>
>>> +
>>> +/* Return true in case edge E is probably never executed.  */
>>> +
>>> +bool
>>> +probably_never_executed_edge_p (struct function *fun, edge e)
>>> +{
>>> +  gcc_checking_assert (fun);
>>> +  if (profile_info && flag_branch_probabilities)
>>> +    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
>>> +  if ((!profile_info || !flag_branch_probabilities)
>>> +      && (cgraph_get_node (fun->decl)->frequency
>>> +         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
>>> +    return true;
>>> +  return false;
>> Instead of duplicating the conditional, break out the tests into
>> probably_never_executed_count_p, like we have for maybe_hot_count_p.
>
> ok
>
>>
>> It would be nice to extend this to work w/o profile: probably_never_executed_edge_p
>> can return true for EH edges, setjmp edges and we can then walk bodies ignoring
>> EH/setjmp flagging blocks that are probably never executed enabling the partitioning
>> to do a job w/o profile.
>
> agreed. although I would prefer to leave that for a follow-on patch so
> it can be tuned a bit.
>
>>
>> Otherwise the patch look OK to me. Thanks for working on this. Do we have agreement on C++ way of mixing
>> declarations and code?
>
> According to http://gcc.gnu.org/wiki/CppConventions:
>
> "In new code variables which are used in a small scope should be
> defined at the point of first use, rather than at the top of the
> function. Variables which are used throughout the function may be
> defined at the top of the function, as in C."
>
> I think I am following that, but let me know if you see something that
> needs to be fixed.
>
> Thanks,
> Teresa
>
>>

2013-08-29  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_new_cold_bb): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (probably_never_executed_edge_p): Declare.
        (fixup_partitions): Ditto.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_hot_paths): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_hot_paths.
        * predict.c (probably_never_executed_edge_p): New routine.
        * cfg.c (check_bb_profile): Add partition insanity warnings.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 202021)
+++ cfgrtl.c    (working copy)
@@ -1358,6 +1358,43 @@ fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1996,6 +2033,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2190,6 +2235,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition (%d)",
son->index, bb->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2236,7 +2376,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2399,6 +2540,14 @@ rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2532,7 +2681,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2778,7 +2927,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 202021)
+++ basic-block.h       (working copy)
@@ -726,6 +726,7 @@ extern void compute_available (sbitmap *, sbitmap
 extern bool maybe_hot_bb_p (struct function *, const_basic_block);
 extern bool maybe_hot_edge_p (edge);
 extern bool probably_never_executed_bb_p (struct function *,
const_basic_block);
+extern bool probably_never_executed_edge_p (struct function *, edge);
 extern bool optimize_bb_for_size_p (const_basic_block);
 extern bool optimize_bb_for_speed_p (const_basic_block);
 extern bool optimize_edge_for_size_p (edge);
@@ -797,6 +798,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 202021)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 202021)
+++ bb-reorder.c        (working copy)
@@ -1444,27 +1444,157 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that all hot bbs are included in a hot path through the
+   procedure. This is done by calling this function twice, once
+   with WALK_UP true (to look for paths from the entry to hot bbs) and
+   once with WALK_UP false (to look for paths from hot bbs to the exit).
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
+                    vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
+      edge e;
+      edge_iterator ei;
+      int highest_probability = 0;
+      int highest_freq = 0;
+      gcov_type highest_count = 0;
+      bool found = false;
+
+      /* Walk the preds/succs and check if there is at least one already
+         marked hot. Keep track of the most frequent pred/succ so that we
+         can mark it hot if we don't find one.  */
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+
+          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
+          {
+            found = true;
+            break;
+          }
+          /* The following loop will look for the hottest edge via
+             the edge count, if it is non-zero, then fallback to the edge
+             frequency and finally the edge probability.  */
+          if (e->count > highest_count)
+            highest_count = e->count;
+          int edge_freq = EDGE_FREQUENCY (e);
+          if (edge_freq > highest_freq)
+            highest_freq = edge_freq;
+          if (e->probability > highest_probability)
+            highest_probability = e->probability;
+        }
+
+      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
+         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
+         then the most frequent pred (or succ) needs to be adjusted.  In the
+         case where multiple preds/succs have the same frequency (e.g. a
+         50-50 branch), then both will be adjusted.  */
+      if (found)
+        continue;
+
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+          /* Select the hottest edge using the edge count, if it is non-zero,
+             then fallback to the edge frequency and finally the edge
+             probability.  */
+          if (highest_count)
+            {
+              if (e->count < highest_count)
+                continue;
+            }
+          else if (highest_freq)
+            {
+              if (EDGE_FREQUENCY (e) < highest_freq)
+                continue;
+            }
+          else if (e->probability < highest_probability)
+            continue;
+
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          /* We have a hot bb with an immediate dominator that is cold.
+             The dominator needs to be re-marked hot.  */
+          BB_SET_PARTITION (reach_bb, BB_HOT_PARTITION);
+          cold_bb_count--;
+
+          /* Now we need to examine newly-hot reach_bb to see if it is also
+             dominated by a cold bb.  */
+          bbs_in_hot_partition->safe_push (reach_bb);
+          hot_bbs_to_check.safe_push (reach_bb);
+        }
+    }
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */

-static vec<edge>
+static vec<edge>
 find_rarely_executed_basic_blocks_and_crossing_edges (void)
 {
   vec<edge> crossing_edges = vNULL;
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that hot bbs are included along a hot path from the entry to exit.
+     Several different possibilities may include cold bbs along all paths
+     to/from a hot bb. One is that there are edge weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. This is fixed by walking up from hot bbs
+     to the entry block, and then down from hot bbs to the exit, performing
+     partitioning fixups as necessary.  */
+  if (cold_bb_count)
+    {
+      mark_dfs_back_edges ();
+      cold_bb_count = sanitize_hot_paths (true, cold_bb_count,
+                                          &bbs_in_hot_partition);
+      if (cold_bb_count)
+        sanitize_hot_paths (false, cold_bb_count, &bbs_in_hot_partition);
+    }
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */
Index: predict.c
===================================================================
--- predict.c   (revision 202021)
+++ predict.c   (working copy)
@@ -241,6 +241,22 @@ probably_never_executed_bb_p (struct function *fun
   return false;
 }

+
+/* Return true in case edge E is probably never executed.  */
+
+bool
+probably_never_executed_edge_p (struct function *fun, edge e)
+{
+  gcc_checking_assert (fun);
+  if (profile_info && flag_branch_probabilities)
+    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
+  if ((!profile_info || !flag_branch_probabilities)
+      && (cgraph_get_node (fun->decl)->frequency
+         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
+    return true;
+  return false;
+}
+
 /* Return true if NODE should be optimized for size.  */

 bool
Index: cfg.c
===================================================================
--- cfg.c       (revision 202021)
+++ cfg.c       (working copy)
@@ -446,6 +446,21 @@ check_bb_profile (basic_block bb, FILE * file, int
                 (flags & TDF_COMMENT) ? ";; " : "", s_indent,
                 (int) lsum, (int) bb->count);
     }
+  if (BB_PARTITION (bb) == BB_COLD_PARTITION)
+    {
+      /* Warn about inconsistencies in the partitioning that are
+         currently caused by profile insanities created via optimization.  */
+      if (!probably_never_executed_bb_p (fun, bb))
+        fprintf (file, "%s%sBlock in cold partition with hot count\n",
+                 (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+      FOR_EACH_EDGE (e, ei, bb->preds)
+        {
+          if (!probably_never_executed_edge_p (fun, e))
+            fprintf (file,
+                     "%s%sBlock in cold partition with incoming hot edge\n",
+                     (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+        }
+    }
 }
 ^L
 void




-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-30  7:17                                       ` Teresa Johnson
@ 2013-08-30  9:16                                         ` Jan Hubicka
  2013-08-30 15:13                                           ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-30  9:16 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam,
	Rong Xu

> Great! Is this the LTO merging you were talking about in an earlier
> message, or the gcov runtime fix (that would presumably not be
> lto-specific)?

It is the LTO path - we need to merge profiles there anyway for his code unification
work.

> > I have patch to track this.  Moreover vforks seems to produce repeated
> > merging of results.
> 
> Aha, does this explain the gimp issue as well?

Not really - we still need to debug why we hit cold section so many times with
partitioning.  I sitll think easier approach will be to lock the cold section and
then start probably with testsuite (i.e. write script to compile the small testcases
with FDO + partitioning and see what crash by hitting cold section).

> >
> > Is it really necessary to run this from internal loop of the cfgcleanup?
> 
> The reason I added it here is that just below there is a call to
> verify_flow_info, and that will fail with the new verification.

Hmm, OK, I suppose we run the cleanup after partitioning just once or twice, right?
We can track this incrementally - I am not sure if we put it from the internal iteration
loop we would get anything substantial either.
Removing unreachable blocks twice is however ugly.

> +/* Ensure that all hot bbs are included in a hot path through the
> +   procedure. This is done by calling this function twice, once
> +   with WALK_UP true (to look for paths from the entry to hot bbs) and
> +   once with WALK_UP false (to look for paths from hot bbs to the exit).
> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
> +   to BBS_IN_HOT_PARTITION.  */
> +
> +static unsigned int
> +sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
> +                    vec<basic_block> *bbs_in_hot_partition)
> +{
> +  /* Callers check this.  */
> +  gcc_checking_assert (cold_bb_count);
> +
> +  /* Keep examining hot bbs while we still have some left to check
> +     and there are remaining cold bbs.  */
> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
> +  while (! hot_bbs_to_check.is_empty ()
> +         && cold_bb_count)
> +    {
> +      basic_block bb = hot_bbs_to_check.pop ();
> +      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
> +      edge e;
> +      edge_iterator ei;
> +      int highest_probability = 0;
> +      int highest_freq = 0;
> +      gcov_type highest_count = 0;
> +      bool found = false;
> +
> +      /* Walk the preds/succs and check if there is at least one already
> +         marked hot. Keep track of the most frequent pred/succ so that we
> +         can mark it hot if we don't find one.  */
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          basic_block reach_bb = walk_up ? e->src : e->dest;
> +
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +
> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
> +          {
> +            found = true;
> +            break;
> +          }
> +          /* The following loop will look for the hottest edge via
> +             the edge count, if it is non-zero, then fallback to the edge
> +             frequency and finally the edge probability.  */
> +          if (e->count > highest_count)
> +            highest_count = e->count;
> +          int edge_freq = EDGE_FREQUENCY (e);
> +          if (edge_freq > highest_freq)
> +            highest_freq = edge_freq;
> +          if (e->probability > highest_probability)
> +            highest_probability = e->probability;
> +        }
> +
> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
> +         then the most frequent pred (or succ) needs to be adjusted.  In the
> +         case where multiple preds/succs have the same frequency (e.g. a
> +         50-50 branch), then both will be adjusted.  */
> +      if (found)
> +        continue;
> +
> +      FOR_EACH_EDGE (e, ei, edges)
> +        {
> +          if (e->flags & EDGE_DFS_BACK)
> +            continue;
> +          /* Select the hottest edge using the edge count, if it is non-zero,
> +             then fallback to the edge frequency and finally the edge
> +             probability.  */
> +          if (highest_count)
> +            {
> +              if (e->count < highest_count)
> +                continue;
> +            }
> +          else if (highest_freq)

The frequency condition needs to be done only when you walk predecestors - when
you walk down the edge probabilities are just fine.

The patch seems OK to me now.  I will make our FDO tester to use partitioning so we get
this benchmarked a bit.

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-30  9:16                                         ` Jan Hubicka
@ 2013-08-30 15:13                                           ` Teresa Johnson
  2013-08-30 15:28                                             ` Jan Hubicka
  2013-08-30 21:56                                             ` Rong Xu
  0 siblings, 2 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-30 15:13 UTC (permalink / raw)
  To: Jan Hubicka, Rong Xu
  Cc: Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher, Jeff Law,
	marxin.liska, Sriraman Tallam

[-- Attachment #1: Type: text/plain, Size: 5789 bytes --]

Can someone review and ok the attached patch for trunk? It has been
bootstrapped and tested on x86-64-unknown-linux-gnu, and tested by
enabling -freorder-blocks-and-partition enabled for a
profiledbootstrap as well.

(Honza, see more responses inlined below. Rong, please see note below as well).

Thanks,
Teresa

On Fri, Aug 30, 2013 at 2:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Great! Is this the LTO merging you were talking about in an earlier
>> message, or the gcov runtime fix (that would presumably not be
>> lto-specific)?
>
> It is the LTO path - we need to merge profiles there anyway for his code unification
> work.

Rong - can you send a summary of the approach you are working on? Is
it LIPO-specific?

>
>> > I have patch to track this.  Moreover vforks seems to produce repeated
>> > merging of results.
>>
>> Aha, does this explain the gimp issue as well?
>
> Not really - we still need to debug why we hit cold section so many times with
> partitioning.  I sitll think easier approach will be to lock the cold section and
> then start probably with testsuite (i.e. write script to compile the small testcases
> with FDO + partitioning and see what crash by hitting cold section).

Ok, that is on my todo list.

>
>> >
>> > Is it really necessary to run this from internal loop of the cfgcleanup?
>>
>> The reason I added it here is that just below there is a call to
>> verify_flow_info, and that will fail with the new verification.
>
> Hmm, OK, I suppose we run the cleanup after partitioning just once or twice, right?
> We can track this incrementally - I am not sure if we put it from the internal iteration
> loop we would get anything substantial either.
> Removing unreachable blocks twice is however ugly.

When I was debugging the issue that led to this change I seemed to see
1-2 iterations typically. Although I haven't measured it
scientifically. It would be good to revisit that and see if we can
pull both parts out of the loop, but as a separate patch.

>
>> +/* Ensure that all hot bbs are included in a hot path through the
>> +   procedure. This is done by calling this function twice, once
>> +   with WALK_UP true (to look for paths from the entry to hot bbs) and
>> +   once with WALK_UP false (to look for paths from hot bbs to the exit).
>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>> +   to BBS_IN_HOT_PARTITION.  */
>> +
>> +static unsigned int
>> +sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
>> +                    vec<basic_block> *bbs_in_hot_partition)
>> +{
>> +  /* Callers check this.  */
>> +  gcc_checking_assert (cold_bb_count);
>> +
>> +  /* Keep examining hot bbs while we still have some left to check
>> +     and there are remaining cold bbs.  */
>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>> +  while (! hot_bbs_to_check.is_empty ()
>> +         && cold_bb_count)
>> +    {
>> +      basic_block bb = hot_bbs_to_check.pop ();
>> +      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
>> +      edge e;
>> +      edge_iterator ei;
>> +      int highest_probability = 0;
>> +      int highest_freq = 0;
>> +      gcov_type highest_count = 0;
>> +      bool found = false;
>> +
>> +      /* Walk the preds/succs and check if there is at least one already
>> +         marked hot. Keep track of the most frequent pred/succ so that we
>> +         can mark it hot if we don't find one.  */
>> +      FOR_EACH_EDGE (e, ei, edges)
>> +        {
>> +          basic_block reach_bb = walk_up ? e->src : e->dest;
>> +
>> +          if (e->flags & EDGE_DFS_BACK)
>> +            continue;
>> +
>> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
>> +          {
>> +            found = true;
>> +            break;
>> +          }
>> +          /* The following loop will look for the hottest edge via
>> +             the edge count, if it is non-zero, then fallback to the edge
>> +             frequency and finally the edge probability.  */
>> +          if (e->count > highest_count)
>> +            highest_count = e->count;
>> +          int edge_freq = EDGE_FREQUENCY (e);
>> +          if (edge_freq > highest_freq)
>> +            highest_freq = edge_freq;
>> +          if (e->probability > highest_probability)
>> +            highest_probability = e->probability;
>> +        }
>> +
>> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
>> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
>> +         then the most frequent pred (or succ) needs to be adjusted.  In the
>> +         case where multiple preds/succs have the same frequency (e.g. a
>> +         50-50 branch), then both will be adjusted.  */
>> +      if (found)
>> +        continue;
>> +
>> +      FOR_EACH_EDGE (e, ei, edges)
>> +        {
>> +          if (e->flags & EDGE_DFS_BACK)
>> +            continue;
>> +          /* Select the hottest edge using the edge count, if it is non-zero,
>> +             then fallback to the edge frequency and finally the edge
>> +             probability.  */
>> +          if (highest_count)
>> +            {
>> +              if (e->count < highest_count)
>> +                continue;
>> +            }
>> +          else if (highest_freq)
>
> The frequency condition needs to be done only when you walk predecestors - when
> you walk down the edge probabilities are just fine.

True. For simplicity I think it should be fine to leave as-is so there
isn't more special casing as the current approach works in both
directions.

>
> The patch seems OK to me now.  I will make our FDO tester to use partitioning so we get
> this benchmarked a bit.

Ok thanks.

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 19380 bytes --]

Patch 3 of 3 split out from the patch I sent in May that fixes problems with
-freorder-blocks-and-partition, with changes/fixes discussed in that thread.

See http://gcc.gnu.org/ml/gcc-patches/2013-05/threads.html#00388 for more context.

This patch sanitizes the partitioning to address issues such as edge
weight insanities that sometimes occur due to upstream optimizations,
and ensures that hot blocks are not dominated by cold blocks. This
needs to be resanitized after certain cfg optimizations that may
cause hot blocks previously reached via both hot and cold paths to
only be reached by cold paths.

The verification code in sanitize_dominator_hotness was contributed by
Steven Bosscher.

2013-08-29  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

	* cfgrtl.c (fixup_new_cold_bb): New routine.
	(commit_edge_insertions): Invoke fixup_partitions.
	(find_partition_fixes): New routine.
	(fixup_partitions): Ditto.
	(verify_hot_cold_block_grouping): Update comments.
	(rtl_verify_edges): Invoke find_partition_fixes.
	(rtl_verify_bb_pointers): Update comments.
	(rtl_verify_bb_layout): Ditto.
	* basic-block.h (probably_never_executed_edge_p): Declare.
        (fixup_partitions): Ditto.
	* cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
	* bb-reorder.c (sanitize_hot_paths): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_hot_paths.
	* predict.c (probably_never_executed_edge_p): New routine.
	* cfg.c (check_bb_profile): Add partition insanity warnings.

Index: cfgrtl.c
===================================================================
--- cfgrtl.c	(revision 202021)
+++ cfgrtl.c	(working copy)
@@ -1358,6 +1358,43 @@ fixup_partition_crossing (edge e)
     }
 }
 
+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.
 
@@ -1996,6 +2033,14 @@ commit_edge_insertions (void)
 {
   basic_block bb;
 
+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2190,6 +2235,101 @@ get_last_bb_insn (basic_block bb)
   return end;
 }
 
+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition (%d)", son->index, bb->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2236,7 +2376,8 @@ verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */
 
 static int
 rtl_verify_edges (void)
@@ -2399,6 +2540,14 @@ rtl_verify_edges (void)
 	}
     }
 
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2532,7 +2681,7 @@ rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks
 
    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2778,7 +2927,8 @@ rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */
 
 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h	(revision 202021)
+++ basic-block.h	(working copy)
@@ -726,6 +726,7 @@ extern void compute_available (sbitmap *, sbitmap
 extern bool maybe_hot_bb_p (struct function *, const_basic_block);
 extern bool maybe_hot_edge_p (edge);
 extern bool probably_never_executed_bb_p (struct function *, const_basic_block);
+extern bool probably_never_executed_edge_p (struct function *, edge);
 extern bool optimize_bb_for_size_p (const_basic_block);
 extern bool optimize_bb_for_speed_p (const_basic_block);
 extern bool optimize_edge_for_size_p (edge);
@@ -797,6 +798,7 @@ extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);
 
 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c	(revision 202021)
+++ cfgcleanup.c	(working copy)
@@ -2807,10 +2807,21 @@ try_optimize_cfg (int mode)
 	      df_analyze ();
 	    }
 
+	  if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-	  if (changed)
-	    verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }
 
 	  changed_overall |= changed;
 	  first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c	(revision 202021)
+++ bb-reorder.c	(working copy)
@@ -1444,27 +1444,157 @@ fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }
 
+
+/* Ensure that all hot bbs are included in a hot path through the
+   procedure. This is done by calling this function twice, once
+   with WALK_UP true (to look for paths from the entry to hot bbs) and
+   once with WALK_UP false (to look for paths from hot bbs to the exit).
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
+                    vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
+      edge e;
+      edge_iterator ei;
+      int highest_probability = 0;
+      int highest_freq = 0;
+      gcov_type highest_count = 0;
+      bool found = false;
+
+      /* Walk the preds/succs and check if there is at least one already
+         marked hot. Keep track of the most frequent pred/succ so that we
+         can mark it hot if we don't find one.  */
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+
+          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
+          {
+            found = true;
+            break;
+          }
+          /* The following loop will look for the hottest edge via
+             the edge count, if it is non-zero, then fallback to the edge
+             frequency and finally the edge probability.  */
+          if (e->count > highest_count)
+            highest_count = e->count;
+          int edge_freq = EDGE_FREQUENCY (e);
+          if (edge_freq > highest_freq)
+            highest_freq = edge_freq;
+          if (e->probability > highest_probability)
+            highest_probability = e->probability;
+        }
+
+      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
+         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
+         then the most frequent pred (or succ) needs to be adjusted.  In the
+         case where multiple preds/succs have the same frequency (e.g. a
+         50-50 branch), then both will be adjusted.  */
+      if (found)
+        continue;
+
+      FOR_EACH_EDGE (e, ei, edges)
+        {
+          if (e->flags & EDGE_DFS_BACK)
+            continue;
+          /* Select the hottest edge using the edge count, if it is non-zero,
+             then fallback to the edge frequency and finally the edge
+             probability.  */
+          if (highest_count)
+            {
+              if (e->count < highest_count)
+                continue;
+            }
+          else if (highest_freq)
+            {
+              if (EDGE_FREQUENCY (e) < highest_freq)
+                continue;
+            }
+          else if (e->probability < highest_probability)
+            continue;
+
+          basic_block reach_bb = walk_up ? e->src : e->dest;
+
+          /* We have a hot bb with an immediate dominator that is cold.
+             The dominator needs to be re-marked hot.  */
+          BB_SET_PARTITION (reach_bb, BB_HOT_PARTITION);
+          cold_bb_count--;
+
+          /* Now we need to examine newly-hot reach_bb to see if it is also
+             dominated by a cold bb.  */
+          bbs_in_hot_partition->safe_push (reach_bb);
+          hot_bbs_to_check.safe_push (reach_bb);
+        }
+    }
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */
 
-static vec<edge> 
+static vec<edge>
 find_rarely_executed_basic_blocks_and_crossing_edges (void)
 {
   vec<edge> crossing_edges = vNULL;
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;
 
   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-	BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-	BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }
 
+  /* Ensure that hot bbs are included along a hot path from the entry to exit.
+     Several different possibilities may include cold bbs along all paths
+     to/from a hot bb. One is that there are edge weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be hot, because
+     it is entered fewer times than the number of profile training runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. This is fixed by walking up from hot bbs
+     to the entry block, and then down from hot bbs to the exit, performing
+     partitioning fixups as necessary.  */
+  if (cold_bb_count)
+    {
+      mark_dfs_back_edges ();
+      cold_bb_count = sanitize_hot_paths (true, cold_bb_count,
+                                          &bbs_in_hot_partition);
+      if (cold_bb_count)
+        sanitize_hot_paths (false, cold_bb_count, &bbs_in_hot_partition);
+    }
+
   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */
Index: predict.c
===================================================================
--- predict.c	(revision 202021)
+++ predict.c	(working copy)
@@ -241,6 +241,22 @@ probably_never_executed_bb_p (struct function *fun
   return false;
 }
 
+
+/* Return true in case edge E is probably never executed.  */
+
+bool
+probably_never_executed_edge_p (struct function *fun, edge e)
+{
+  gcc_checking_assert (fun);
+  if (profile_info && flag_branch_probabilities)
+    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
+  if ((!profile_info || !flag_branch_probabilities)
+      && (cgraph_get_node (fun->decl)->frequency
+	  == NODE_FREQUENCY_UNLIKELY_EXECUTED))
+    return true;
+  return false;
+}
+
 /* Return true if NODE should be optimized for size.  */
 
 bool
Index: cfg.c
===================================================================
--- cfg.c	(revision 202021)
+++ cfg.c	(working copy)
@@ -446,6 +446,21 @@ check_bb_profile (basic_block bb, FILE * file, int
 		 (flags & TDF_COMMENT) ? ";; " : "", s_indent,
 		 (int) lsum, (int) bb->count);
     }
+  if (BB_PARTITION (bb) == BB_COLD_PARTITION)
+    {
+      /* Warn about inconsistencies in the partitioning that are
+         currently caused by profile insanities created via optimization.  */
+      if (!probably_never_executed_bb_p (fun, bb))
+        fprintf (file, "%s%sBlock in cold partition with hot count\n",
+                 (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+      FOR_EACH_EDGE (e, ei, bb->preds)
+        {
+          if (!probably_never_executed_edge_p (fun, e))
+            fprintf (file,
+                     "%s%sBlock in cold partition with incoming hot edge\n",
+                     (flags & TDF_COMMENT) ? ";; " : "", s_indent);
+        }
+    }
 }
 \f
 void

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-30 15:13                                           ` Teresa Johnson
@ 2013-08-30 15:28                                             ` Jan Hubicka
  2013-08-30 15:54                                               ` Teresa Johnson
  2013-08-30 21:56                                             ` Rong Xu
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-30 15:28 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Rong Xu, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

> >
> > The frequency condition needs to be done only when you walk predecestors - when
> > you walk down the edge probabilities are just fine.
> 
> True. For simplicity I think it should be fine to leave as-is so there
> isn't more special casing as the current approach works in both
> directions.

Yep, you are right. Frequencies are safe both directions.

I think this change belongs to profile feedback category, so the patch is OK.

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-30 15:28                                             ` Jan Hubicka
@ 2013-08-30 15:54                                               ` Teresa Johnson
  0 siblings, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-08-30 15:54 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Rong Xu, Bernhard Reutner-Fischer, gcc-patches, Steven Bosscher,
	Jeff Law, marxin.liska, Sriraman Tallam

On Fri, Aug 30, 2013 at 8:26 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >
>> > The frequency condition needs to be done only when you walk predecestors - when
>> > you walk down the edge probabilities are just fine.
>>
>> True. For simplicity I think it should be fine to leave as-is so there
>> isn't more special casing as the current approach works in both
>> directions.
>
> Yep, you are right. Frequencies are safe both directions.
>
> I think this change belongs to profile feedback category, so the patch is OK.

ok, thanks.

Teresa

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-30 15:13                                           ` Teresa Johnson
  2013-08-30 15:28                                             ` Jan Hubicka
@ 2013-08-30 21:56                                             ` Rong Xu
  1 sibling, 0 replies; 62+ messages in thread
From: Rong Xu @ 2013-08-30 21:56 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

On Fri, Aug 30, 2013 at 7:50 AM, Teresa Johnson <tejohnson@google.com> wrote:
> Can someone review and ok the attached patch for trunk? It has been
> bootstrapped and tested on x86-64-unknown-linux-gnu, and tested by
> enabling -freorder-blocks-and-partition enabled for a
> profiledbootstrap as well.
>
> (Honza, see more responses inlined below. Rong, please see note below as well).
>
> Thanks,
> Teresa
>
> On Fri, Aug 30, 2013 at 2:14 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> Great! Is this the LTO merging you were talking about in an earlier
>>> message, or the gcov runtime fix (that would presumably not be
>>> lto-specific)?
>>
>> It is the LTO path - we need to merge profiles there anyway for his code unification
>> work.
>
> Rong - can you send a summary of the approach you are working on? Is
> it LIPO-specific?

I'm also working to improve COMDAT handling in FDO/LIPO.  It's
applicable to regular FDO.
Our motivation case is different from the case talked here:
  COMDAT function F is defined in module A and is picked by linker as
the out-of-line copy in profile-gen phrase.
It gets all the counters. In profile-use compilation F may not be
emitted in A (like all the callsites are ipa-inlined in A),
We choose instance of F in module B as the out-of-line copy, and it's
not optimized.

We are seeing more of this kind of cases in LIPO due to multiple
COMDAT copies brought by the auxiliary modules.

Since COMDAT function may be inlined after instrumentation, multiple
copies of the counters make be co-exists
We want to differentiate inlined-copy counters or out-of-line-copy counters.
(In LIPO, we actually encourage the inline of COMDAT in IPA-inline to
get more context sensitive counters.)

Our current solution is to have another instrumentation only for
COMDAT functions.
* For each  comdat_key, we create a global var pointing to the
gcov_fn_info of the out-of-line copy.
* This global var is initialized by the instrumentation code placed in
the function entry, to the value of gcov_fn_info of current module.
   This is post IPA-inline instrumentation. So at most one
instrumentation (the one picked by linker) is executed.
* Expand gcov_fn_info to points to the global var.
In the run time, we can differentiate if the counter out-of-line copy
or inlined copy. We set a tag in gcda file accordingly.
(in the case of both out-of-line copy and inlined copy, we treat it as
out-of-line copy).

In the profile-use phrase, we use this info to resolve the decls of
COMDAT functions, and also make sure we only emit
the out-of-line copy of the COMDAT function (even if it's not
referenced in the module).

This has been done and we are testing it now.

I talked to Teresa about the her case some time back. It seems that
merging the counters is an easy change with the above patch because
we have the address of the out-of-line gcov_fn_info. We can do a
simple in-memory merge if the checksum matches. The concern is
that this may reduce the context sensitive information.

-Rong

>
>>
>>> > I have patch to track this.  Moreover vforks seems to produce repeated
>>> > merging of results.
>>>
>>> Aha, does this explain the gimp issue as well?
>>
>> Not really - we still need to debug why we hit cold section so many times with
>> partitioning.  I sitll think easier approach will be to lock the cold section and
>> then start probably with testsuite (i.e. write script to compile the small testcases
>> with FDO + partitioning and see what crash by hitting cold section).
>
> Ok, that is on my todo list.
>
>>
>>> >
>>> > Is it really necessary to run this from internal loop of the cfgcleanup?
>>>
>>> The reason I added it here is that just below there is a call to
>>> verify_flow_info, and that will fail with the new verification.
>>
>> Hmm, OK, I suppose we run the cleanup after partitioning just once or twice, right?
>> We can track this incrementally - I am not sure if we put it from the internal iteration
>> loop we would get anything substantial either.
>> Removing unreachable blocks twice is however ugly.
>
> When I was debugging the issue that led to this change I seemed to see
> 1-2 iterations typically. Although I haven't measured it
> scientifically. It would be good to revisit that and see if we can
> pull both parts out of the loop, but as a separate patch.
>
>>
>>> +/* Ensure that all hot bbs are included in a hot path through the
>>> +   procedure. This is done by calling this function twice, once
>>> +   with WALK_UP true (to look for paths from the entry to hot bbs) and
>>> +   once with WALK_UP false (to look for paths from hot bbs to the exit).
>>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>>> +   to BBS_IN_HOT_PARTITION.  */
>>> +
>>> +static unsigned int
>>> +sanitize_hot_paths (bool walk_up, unsigned int cold_bb_count,
>>> +                    vec<basic_block> *bbs_in_hot_partition)
>>> +{
>>> +  /* Callers check this.  */
>>> +  gcc_checking_assert (cold_bb_count);
>>> +
>>> +  /* Keep examining hot bbs while we still have some left to check
>>> +     and there are remaining cold bbs.  */
>>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>>> +  while (! hot_bbs_to_check.is_empty ()
>>> +         && cold_bb_count)
>>> +    {
>>> +      basic_block bb = hot_bbs_to_check.pop ();
>>> +      vec<edge, va_gc> *edges = walk_up ? bb->preds : bb->succs;
>>> +      edge e;
>>> +      edge_iterator ei;
>>> +      int highest_probability = 0;
>>> +      int highest_freq = 0;
>>> +      gcov_type highest_count = 0;
>>> +      bool found = false;
>>> +
>>> +      /* Walk the preds/succs and check if there is at least one already
>>> +         marked hot. Keep track of the most frequent pred/succ so that we
>>> +         can mark it hot if we don't find one.  */
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          basic_block reach_bb = walk_up ? e->src : e->dest;
>>> +
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +
>>> +          if (BB_PARTITION (reach_bb) != BB_COLD_PARTITION)
>>> +          {
>>> +            found = true;
>>> +            break;
>>> +          }
>>> +          /* The following loop will look for the hottest edge via
>>> +             the edge count, if it is non-zero, then fallback to the edge
>>> +             frequency and finally the edge probability.  */
>>> +          if (e->count > highest_count)
>>> +            highest_count = e->count;
>>> +          int edge_freq = EDGE_FREQUENCY (e);
>>> +          if (edge_freq > highest_freq)
>>> +            highest_freq = edge_freq;
>>> +          if (e->probability > highest_probability)
>>> +            highest_probability = e->probability;
>>> +        }
>>> +
>>> +      /* If bb is reached by (or reaches, in the case of !WALK_UP) another hot
>>> +         block (or unpartitioned, e.g. the entry block) then it is ok. If not,
>>> +         then the most frequent pred (or succ) needs to be adjusted.  In the
>>> +         case where multiple preds/succs have the same frequency (e.g. a
>>> +         50-50 branch), then both will be adjusted.  */
>>> +      if (found)
>>> +        continue;
>>> +
>>> +      FOR_EACH_EDGE (e, ei, edges)
>>> +        {
>>> +          if (e->flags & EDGE_DFS_BACK)
>>> +            continue;
>>> +          /* Select the hottest edge using the edge count, if it is non-zero,
>>> +             then fallback to the edge frequency and finally the edge
>>> +             probability.  */
>>> +          if (highest_count)
>>> +            {
>>> +              if (e->count < highest_count)
>>> +                continue;
>>> +            }
>>> +          else if (highest_freq)
>>
>> The frequency condition needs to be done only when you walk predecestors - when
>> you walk down the edge probabilities are just fine.
>
> True. For simplicity I think it should be fine to leave as-is so there
> isn't more special casing as the current approach works in both
> directions.
>
>>
>> The patch seems OK to me now.  I will make our FDO tester to use partitioning so we get
>> this benchmarked a bit.
>
> Ok thanks.
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-27 18:12                                 ` Teresa Johnson
  2013-08-28 16:59                                   ` Jan Hubicka
@ 2013-08-31 16:20                                   ` Jan Hubicka
  2013-08-31 23:40                                     ` Jan Hubicka
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-31 16:20 UTC (permalink / raw)
  To: Teresa Johnson
  Cc: Jan Hubicka, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

Hi,
With Martin we made script for testing the profiling failures.
First do

ld --verbose >~/script

then apply

--- /home/jh/script2	2013-08-31 17:59:11.000000000 +0200
+++ /home/jh/script	2013-08-31 17:39:40.000000000 +0200
@@ -1,12 +1,3 @@
-GNU ld (GNU Binutils for Debian) 2.20.1-system.20100303
-  Supported emulations:
-   elf_x86_64
-   elf_i386
-   i386linux
-   elf_l1om
-using internal linker script:
-==================================================
-/* Script for -z combreloc: combine and sort reloc sections */
 OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
 	      "elf64-x86-64")
 OUTPUT_ARCH(i386:x86-64)
@@ -55,6 +46,7 @@
     KEEP (*(.init))
   } =0x90909090
   .plt            : { *(.plt) *(.iplt) }
+  .text.unlikely (NOLOAD) : { *(.text.unlikely .text.*_unlikely .text.unlikely.*) }
   .text           :
   {
     *(.text.unlikely .text.*_unlikely)
@@ -218,4 +210,3 @@
 }
 
 
-==================================================

then create t.c as:
__attribute__ ((noinline))
t()
{
  printf ("test\n");
}
main(int argc)
{
  if (argc>1)
        t();
return 0;
}

and dotests as:

for name in $*
do
rm a.out *.gcda 2>/dev/null
./xgcc -B ./ -Ofast -fprofile-generate $name --static  2>/dev/null
if [ -f a.out ]
then
./a.out >/dev/null 2>/dev/null || continue 
./xgcc -B ./ -Ofast -fprofile-use -freorder-blocks-and-partition -Wl,-T,/home/jh/script --static $name   2>/dev/null
./a.out t >/dev/null 2>/dev/null || echo FAIL $name
else
echo skip $name
fi
done

Then run:

jh@gcc10:~/trunk/build/gcc$ sh dotests t.c
FAIL t.c

You should get FAIL if things are fine, because t.c depends behaviour on number
of command line parameters.  If that fail you can run i.e.

jh@gcc10:~/trunk/build/gcc$ sh dotests ~/trunk/gcc/testsuite/gcc.c-torture/execute/*.c
skip /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000402-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
skip /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20010329-1.c

Those with fail get cold section executed.  When you run tehm again through dotests you can do gdb a.out
and see what function gets split incorrectly.

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-31 16:20                                   ` Jan Hubicka
@ 2013-08-31 23:40                                     ` Jan Hubicka
  2013-09-24  8:07                                       ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-08-31 23:40 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Teresa Johnson, Bernhard Reutner-Fischer, gcc-patches,
	Steven Bosscher, Jeff Law, marxin.liska, Sriraman Tallam

Hi,
I run my script on execute testsuite and looked into few testcases. The problem I found
was roundoff errors - i.e. when expanding switch we set 50% chage that out of bound
value is above or bellow.  Both gets rounded to 0, because switch is executed once
and the value is bellow.

Partly this can be fixed by making probably_never_executed to consider frequencies when
counts are too coarse:

Index: predict.c
===================================================================
--- predict.c	(revision 202133)
+++ predict.c	(working copy)
@@ -232,8 +232,22 @@ bool
 probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
 {
   gcc_checking_assert (fun);
-  if (profile_info && flag_branch_probabilities)
-    return ((bb->count + profile_info->runs / 2) / profile_info->runs) == 0;
+  if (profile_status_for_function (fun) == PROFILE_READ)
+    {
+      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
+	return false;
+      if (!bb->frequency)
+	return true;
+      if (!ENTRY_BLOCK_PTR->frequency)
+	return false;
+      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
+	{
+	  return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
+		        ENTRY_BLOCK_PTR->frequency)
+		  < REG_BR_PROB_BASE / 4);
+	}
+      return true;
+    }
   if ((!profile_info || !flag_branch_probabilities)
       && (cgraph_get_node (fun->decl)->frequency
 	  == NODE_FREQUENCY_UNLIKELY_EXECUTED))

In other cases it was mostly loop unrolling in combination with jump threading. So
I modified my script to separately report when failure happens for test trained
once and test trained hundred times.

FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c

FAIL1 is failure after one run, FIAL is failure after 100 train runs.
We should take look at FAILs and see if there are bugs to fix. For FAIL1
I think it is kind of design problem: while implementing counts&frequencies
the idea was that small counts do not matter, so integer arithmetic is all
right.

I wonder if with current C++ wonderland we can't simply switch count
to a better representation. Either sreal or fixedpoint with capping
(the integer overflow issues are tiring, too).

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-08-31 23:40                                     ` Jan Hubicka
@ 2013-09-24  8:07                                       ` Teresa Johnson
  2013-09-24 13:44                                         ` Jan Hubicka
  2013-09-24 18:28                                         ` Jan Hubicka
  0 siblings, 2 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-09-24  8:07 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

Hi Honza,

I am finally getting back to working on this after a few weeks of
working on some other priorities.

On Sat, Aug 31, 2013 at 2:46 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> I run my script on execute testsuite and looked into few testcases. The problem I found
> was roundoff errors - i.e. when expanding switch we set 50% chage that out of bound
> value is above or bellow.  Both gets rounded to 0, because switch is executed once
> and the value is bellow.
>
> Partly this can be fixed by making probably_never_executed to consider frequencies when
> counts are too coarse:
>
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 202133)
> +++ predict.c   (working copy)
> @@ -232,8 +232,22 @@ bool
>  probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>  {
>    gcc_checking_assert (fun);
> -  if (profile_info && flag_branch_probabilities)
> -    return ((bb->count + profile_info->runs / 2) / profile_info->runs) == 0;
> +  if (profile_status_for_function (fun) == PROFILE_READ)
> +    {
> +      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
> +       return false;
> +      if (!bb->frequency)
> +       return true;
> +      if (!ENTRY_BLOCK_PTR->frequency)
> +       return false;
> +      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
> +       {
> +         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
> +                       ENTRY_BLOCK_PTR->frequency)
> +                 < REG_BR_PROB_BASE / 4);
> +       }
> +      return true;
> +    }
>    if ((!profile_info || !flag_branch_probabilities)
>        && (cgraph_get_node (fun->decl)->frequency
>           == NODE_FREQUENCY_UNLIKELY_EXECUTED))

Did you mean to commit the above change? I see that it went in as part
of r202258 but doesn't show up in the ChangeLog entry for that
revision.

>
> In other cases it was mostly loop unrolling in combination with jump threading. So
> I modified my script to separately report when failure happens for test trained
> once and test trained hundred times.

Thanks for the linker script. I reproduced your results. I looked at a
couple cases. The first was one that failed after 1 training run only
(20000910-2.c). It was due to jump threading, which you noted was a
problem. For this one I think we can handle it in the partitioning,
since there is an FDO insanity that we could probably treat more
conservatively when splitting.

I looked at one that failed after 100 as well (20031204-1.c). In this
case, it was due to expansion which was creating multiple branches/bbs
from a logical OR and guessing incorrectly on how to assign the
counts:

 if (octets == 4 && (*cp == ':' || *cp == '\0')) {

The (*cp == ':' || *cp == '\0') part looked like the following going
into RTL expansion:

  [20031204-1.c : 31:33] _29 = _28 == 58;
  [20031204-1.c : 31:33] _30 = _28 == 0;
  [20031204-1.c : 31:33] _31 = _29 | _30;
  [20031204-1.c : 31:18] if (_31 != 0)
    goto <bb 16>;
  else
    goto <bb 19>;

where the result of the OR was always true, so bb 16 had a count of
100 and bb 19 a count of 0. When it was expanded, the expanded version
of the above turned into 2 bbs with a branch in between. Both
comparisons were done in the first bb, but the first bb checked
whether the result of the *cp == '\0' compare was true, and if not
branched to the check for whether the *cp == ':' compare was true. It
gave the branch to the second check against ':' a count of 0, so that
bb got a count of 0 and was split out, and put the count of 100 on the
fall through assuming the compare with '\0' always evaluated to true.
In reality, this OR condition was always true because *cp was ':', not
'\0'. Therefore, the count of 0 on the second block with the check for
':' was incorrect, we ended up trying to execute it, and failed.

Presumably we had the correct profile data for both blocks, but the
accuracy was reduced when the OR was represented as a logical
computation with a single branch. We could change the expansion code
to do something different, e.g. treat as a 50-50 branch. But we would
still end up with integer truncation issues when there was a single
training run. But that could be dealt with conservatively in the
bbpart code as I suggested for the jump threading issue above. I.e. a
cold block with incoming non-cold edges conservatively not marked cold
for splitting.

>
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>
> FAIL1 is failure after one run, FIAL is failure after 100 train runs.
> We should take look at FAILs and see if there are bugs to fix. For FAIL1
> I think it is kind of design problem: while implementing counts&frequencies
> the idea was that small counts do not matter, so integer arithmetic is all
> right.
>
> I wonder if with current C++ wonderland we can't simply switch count
> to a better representation. Either sreal or fixedpoint with capping
> (the integer overflow issues are tiring, too).

It also seems like we should be able to detect the profile insanities
caused by integer truncation and handle them conservatively. That
being said, I see some sreal uses already in the profile.c code, so
presumably we could use this for the counts as well if it turns out to
be necessary?

BTW, Rong also implemented his runtime patch to do the COMDAT profile
merging. However, that ended up having some issues, that were solvable
but would have caused us to lose all context sensitivity from COMDATS
inlined during the profile-gen build. I am going to go back to solving
this in the profile-use phase as we discussed in the separate thread
on the COMDAT inlining patch I had been working on.

Thanks,
Teresa

>
> Honza

-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24  8:07                                       ` Teresa Johnson
@ 2013-09-24 13:44                                         ` Jan Hubicka
  2013-09-24 19:06                                           ` Teresa Johnson
  2013-09-26 20:55                                           ` Rong Xu
  2013-09-24 18:28                                         ` Jan Hubicka
  1 sibling, 2 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-09-24 13:44 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Jan Hubicka, gcc-patches, marxin.liska

> Hi Honza,
> 
> I am finally getting back to working on this after a few weeks of
> working on some other priorities.

I am also trying to return to this, so good timming ;)
Martin has got smaller C++ programs (Inkscape) to not touch cold segment
during the startup with FDO (w/o partitioning). Firefox still does, I think
the problem are lost samples due to different linker decisions even with LTO.
(i.e. linker pick an object from .a libraryat profile-generate time that i never
passes later.).

I plan to look into that today.
> 
> Did you mean to commit the above change? I see that it went in as part
> of r202258 but doesn't show up in the ChangeLog entry for that
> revision.

Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
fix that.
> 
> >
> > In other cases it was mostly loop unrolling in combination with jump threading. So
> > I modified my script to separately report when failure happens for test trained
> > once and test trained hundred times.
> 
> Thanks for the linker script. I reproduced your results. I looked at a
> couple cases. The first was one that failed after 1 training run only
> (20000910-2.c). It was due to jump threading, which you noted was a
> problem. For this one I think we can handle it in the partitioning,
> since there is an FDO insanity that we could probably treat more
> conservatively when splitting.

We should fix the roundoff issues - when I was introducing the
frequency/probability/count system I made an assumptions that parts of programs
with very low counts do not matter, since they are not part of hot spot (and I
cared only about the hot spot).  Now we care about identifying unlikely
executed spots and we need to fix this.
> 
> I looked at one that failed after 100 as well (20031204-1.c). In this
> case, it was due to expansion which was creating multiple branches/bbs
> from a logical OR and guessing incorrectly on how to assign the
> counts:
> 
>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
> 
> The (*cp == ':' || *cp == '\0') part looked like the following going
> into RTL expansion:
> 
>   [20031204-1.c : 31:33] _29 = _28 == 58;
>   [20031204-1.c : 31:33] _30 = _28 == 0;
>   [20031204-1.c : 31:33] _31 = _29 | _30;
>   [20031204-1.c : 31:18] if (_31 != 0)
>     goto <bb 16>;
>   else
>     goto <bb 19>;
> 
> where the result of the OR was always true, so bb 16 had a count of
> 100 and bb 19 a count of 0. When it was expanded, the expanded version
> of the above turned into 2 bbs with a branch in between. Both
> comparisons were done in the first bb, but the first bb checked
> whether the result of the *cp == '\0' compare was true, and if not
> branched to the check for whether the *cp == ':' compare was true. It
> gave the branch to the second check against ':' a count of 0, so that
> bb got a count of 0 and was split out, and put the count of 100 on the
> fall through assuming the compare with '\0' always evaluated to true.
> In reality, this OR condition was always true because *cp was ':', not
> '\0'. Therefore, the count of 0 on the second block with the check for
> ':' was incorrect, we ended up trying to execute it, and failed.
> 
> Presumably we had the correct profile data for both blocks, but the
> accuracy was reduced when the OR was represented as a logical
> computation with a single branch. We could change the expansion code
> to do something different, e.g. treat as a 50-50 branch. But we would
> still end up with integer truncation issues when there was a single
> training run. But that could be dealt with conservatively in the
> bbpart code as I suggested for the jump threading issue above. I.e. a
> cold block with incoming non-cold edges conservatively not marked cold
> for splitting.
> 
> >
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
> >
> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
> > I think it is kind of design problem: while implementing counts&frequencies
> > the idea was that small counts do not matter, so integer arithmetic is all
> > right.
> >
> > I wonder if with current C++ wonderland we can't simply switch count
> > to a better representation. Either sreal or fixedpoint with capping
> > (the integer overflow issues are tiring, too).
> 
> It also seems like we should be able to detect the profile insanities
> caused by integer truncation and handle them conservatively. That
> being said, I see some sreal uses already in the profile.c code, so
> presumably we could use this for the counts as well if it turns out to
> be necessary?

Yes, I was thinking about this, too.  We would need to do some evaulation of
compile time implications but switching counts from gcov_type to
profile_counter_t that is typedef to sreal seems sane idea.

We could switch CFG code first.  there should not be many hot spots where
counts are involved.  We can offline the common calculation we already moved to
macros that

We will need to invent also REG representation for them.  Now we have INT_LIST
for that we may have SREAL list and introduce SREAL as valid RTX argument.
This can be done incrementally.
> 
> BTW, Rong also implemented his runtime patch to do the COMDAT profile
> merging. However, that ended up having some issues, that were solvable
> but would have caused us to lose all context sensitivity from COMDATS
> inlined during the profile-gen build. I am going to go back to solving
> this in the profile-use phase as we discussed in the separate thread
> on the COMDAT inlining patch I had been working on.

Yes, lets move ahead with this, too.  I think i should dig out the change
that made frequencies to be guessed again.

As for COMDAT merging, i would like to see the patch.  I am experimenting
now with a patch to also privatize COMDATs during -fprofile-generate to
avoid problems with lost profiles mentioned above.

As for context sensitivity, one approach would be to have two sets of
counters for every comdat - one merged globally and one counting local
instances.  We can then privatize always and at profile read in stage
just clone every comdat and have two instances - one for offline copy
and one for inlining.

This is not different from how I always wanted to handle GNU extern inlines
(that also do have this issue - when you do not inline it, the unit do not see
any profile of it).

We can just tie the two functions together so "inline" version stay prior
inlining and then have linker to redirect to inline version instead of offline
version in such cases.  It already knows how to skip aliases and this is not
terribly different from that.

Honza
> 
> Thanks,
> Teresa
> 
> >
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24  8:07                                       ` Teresa Johnson
  2013-09-24 13:44                                         ` Jan Hubicka
@ 2013-09-24 18:28                                         ` Jan Hubicka
  2013-09-24 18:51                                           ` Teresa Johnson
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-09-24 18:28 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Jan Hubicka, gcc-patches, marxin.liska

> 
> I looked at one that failed after 100 as well (20031204-1.c). In this
> case, it was due to expansion which was creating multiple branches/bbs
> from a logical OR and guessing incorrectly on how to assign the
> counts:
> 
>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
> 
> The (*cp == ':' || *cp == '\0') part looked like the following going
> into RTL expansion:
> 
>   [20031204-1.c : 31:33] _29 = _28 == 58;
>   [20031204-1.c : 31:33] _30 = _28 == 0;
>   [20031204-1.c : 31:33] _31 = _29 | _30;
>   [20031204-1.c : 31:18] if (_31 != 0)
>     goto <bb 16>;
>   else
>     goto <bb 19>;
> 
> where the result of the OR was always true, so bb 16 had a count of
> 100 and bb 19 a count of 0. When it was expanded, the expanded version
> of the above turned into 2 bbs with a branch in between. Both
> comparisons were done in the first bb, but the first bb checked
> whether the result of the *cp == '\0' compare was true, and if not
> branched to the check for whether the *cp == ':' compare was true. It
> gave the branch to the second check against ':' a count of 0, so that
> bb got a count of 0 and was split out, and put the count of 100 on the
> fall through assuming the compare with '\0' always evaluated to true.
> In reality, this OR condition was always true because *cp was ':', not
> '\0'. Therefore, the count of 0 on the second block with the check for
> ':' was incorrect, we ended up trying to execute it, and failed.

I see, we produce:
;; if (_26 != 0)  

(insn 94 93 95 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 107 [ D.2184 ])
            (const_int 0 [0]))) a.c:31 -1
     (nil))

(insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
        (eq:QI (reg:CCZ 17 flags)
            (const_int 0 [0]))) a.c:31 -1
     (nil)) 
        
(insn 96 95 97 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 122 [ D.2186 ])
            (const_int 0 [0]))) a.c:31 -1
     (nil))

(jump_insn 97 96 98 (set (pc)
        (if_then_else (ne (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref 100)
            (pc))) a.c:31 -1
     (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
        (nil)))
     
(insn 98 97 99 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 108 [ D.2186 ])
            (const_int 0 [0]))) a.c:31 -1 
     (nil)) 
     
(jump_insn 99 98 100 (set (pc)
        (if_then_else (eq (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref 0)
            (pc))) a.c:31 -1
     (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
        (nil)))

(code_label 100 99 0 14 "" [0 uses])

That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"

First I think the logic of do_jump should really be moved to trees.  It is not
doing things that can not be adequately represented by gimple.

I am not that certain we want to move it before profiling though.
> 
> Presumably we had the correct profile data for both blocks, but the
> accuracy was reduced when the OR was represented as a logical
> computation with a single branch. We could change the expansion code
> to do something different, e.g. treat as a 50-50 branch. But we would
> still end up with integer truncation issues when there was a single
> training run. But that could be dealt with conservatively in the

Yep, but it is still better than what we have now - if the test above was
in hot part of program (i.e. not executed once), we will end up optimizing
the second conditional for size.

So I think it is do_jump bug to not distribute probabilities across the two
conditoinals introduced.
> bbpart code as I suggested for the jump threading issue above. I.e. a
> cold block with incoming non-cold edges conservatively not marked cold
> for splitting.

Yep, we can probably do that, but we ought to fix the individual cases
above at least for resonable number of runs.

Will you look into logic of do_jump or shall I try to dive in?

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24 18:28                                         ` Jan Hubicka
@ 2013-09-24 18:51                                           ` Teresa Johnson
  2013-09-25 23:10                                             ` Teresa Johnson
                                                               ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-09-24 18:51 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I looked at one that failed after 100 as well (20031204-1.c). In this
>> case, it was due to expansion which was creating multiple branches/bbs
>> from a logical OR and guessing incorrectly on how to assign the
>> counts:
>>
>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>
>> The (*cp == ':' || *cp == '\0') part looked like the following going
>> into RTL expansion:
>>
>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>   [20031204-1.c : 31:18] if (_31 != 0)
>>     goto <bb 16>;
>>   else
>>     goto <bb 19>;
>>
>> where the result of the OR was always true, so bb 16 had a count of
>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>> of the above turned into 2 bbs with a branch in between. Both
>> comparisons were done in the first bb, but the first bb checked
>> whether the result of the *cp == '\0' compare was true, and if not
>> branched to the check for whether the *cp == ':' compare was true. It
>> gave the branch to the second check against ':' a count of 0, so that
>> bb got a count of 0 and was split out, and put the count of 100 on the
>> fall through assuming the compare with '\0' always evaluated to true.
>> In reality, this OR condition was always true because *cp was ':', not
>> '\0'. Therefore, the count of 0 on the second block with the check for
>> ':' was incorrect, we ended up trying to execute it, and failed.
>
> I see, we produce:
> ;; if (_26 != 0)
>
> (insn 94 93 95 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:QI 107 [ D.2184 ])
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
>         (eq:QI (reg:CCZ 17 flags)
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (insn 96 95 97 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:QI 122 [ D.2186 ])
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (jump_insn 97 96 98 (set (pc)
>         (if_then_else (ne (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 100)
>             (pc))) a.c:31 -1
>      (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
>         (nil)))
>
> (insn 98 97 99 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:QI 108 [ D.2186 ])
>             (const_int 0 [0]))) a.c:31 -1
>      (nil))
>
> (jump_insn 99 98 100 (set (pc)
>         (if_then_else (eq (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 0)
>             (pc))) a.c:31 -1
>      (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
>         (nil)))
>
> (code_label 100 99 0 14 "" [0 uses])
>
> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"
>
> First I think the logic of do_jump should really be moved to trees.  It is not
> doing things that can not be adequately represented by gimple.
>
> I am not that certain we want to move it before profiling though.
>>
>> Presumably we had the correct profile data for both blocks, but the
>> accuracy was reduced when the OR was represented as a logical
>> computation with a single branch. We could change the expansion code
>> to do something different, e.g. treat as a 50-50 branch. But we would
>> still end up with integer truncation issues when there was a single
>> training run. But that could be dealt with conservatively in the
>
> Yep, but it is still better than what we have now - if the test above was
> in hot part of program (i.e. not executed once), we will end up optimizing
> the second conditional for size.
>
> So I think it is do_jump bug to not distribute probabilities across the two
> conditoinals introduced.
>> bbpart code as I suggested for the jump threading issue above. I.e. a
>> cold block with incoming non-cold edges conservatively not marked cold
>> for splitting.
>
> Yep, we can probably do that, but we ought to fix the individual cases
> above at least for resonable number of runs.

I made this change and it removed a few of the failures.

I looked at another case that still failed with 1 train run but passed
with 100. It turned out to be another truncation issue exposed by RTL
expansion, where we created some control flow for a memset builtin
which was in a block with an execution count of 1. Some of the blocks
got frequencies less than half the original block, so the count was
rounded down or truncated to 0. I noticed that in this case (as well
as the jump threading case I fixed by looking for non-zero incoming
edges in partitioning) that the bb frequency was non-zero.

Why not just have probably_never_executed_bb_p return simply return
false bb->frequency is non-zero (right now it does the opposite -
returns true when bb->frequency is 0)? Making this change removed a
bunch of other failures. With this change as well, there are only 3
cases that still fail with 1 train run that pass with 100. Need to
look at those.

>
> Will you look into logic of do_jump or shall I try to dive in?

I can take a look, but probably won't have a chance until late this
week. If you don't get to it before then I will see if I can figure
out why it is applying the branch probabilities this way.

Teresa

>
> Honza



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24 13:44                                         ` Jan Hubicka
@ 2013-09-24 19:06                                           ` Teresa Johnson
  2013-09-26 20:55                                           ` Rong Xu
  1 sibling, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-09-24 19:06 UTC (permalink / raw)
  To: Jan Hubicka, Rong Xu; +Cc: gcc-patches, marxin.liska

Rong - can you answer the questions below on the comdat patch?


On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hi Honza,
>>
>> I am finally getting back to working on this after a few weeks of
>> working on some other priorities.
>
> I am also trying to return to this, so good timming ;)
> Martin has got smaller C++ programs (Inkscape) to not touch cold segment
> during the startup with FDO (w/o partitioning). Firefox still does, I think
> the problem are lost samples due to different linker decisions even with LTO.
> (i.e. linker pick an object from .a libraryat profile-generate time that i never
> passes later.).
>
> I plan to look into that today.
>>
>> Did you mean to commit the above change? I see that it went in as part
>> of r202258 but doesn't show up in the ChangeLog entry for that
>> revision.
>
> Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
> fix that.

Should the same fix be applied to probably_never_executed_edge_p?

>>
>> >
>> > In other cases it was mostly loop unrolling in combination with jump threading. So
>> > I modified my script to separately report when failure happens for test trained
>> > once and test trained hundred times.
>>
>> Thanks for the linker script. I reproduced your results. I looked at a
>> couple cases. The first was one that failed after 1 training run only
>> (20000910-2.c). It was due to jump threading, which you noted was a
>> problem. For this one I think we can handle it in the partitioning,
>> since there is an FDO insanity that we could probably treat more
>> conservatively when splitting.
>
> We should fix the roundoff issues - when I was introducing the
> frequency/probability/count system I made an assumptions that parts of programs
> with very low counts do not matter, since they are not part of hot spot (and I
> cared only about the hot spot).  Now we care about identifying unlikely
> executed spots and we need to fix this.
>>
>> I looked at one that failed after 100 as well (20031204-1.c). In this
>> case, it was due to expansion which was creating multiple branches/bbs
>> from a logical OR and guessing incorrectly on how to assign the
>> counts:
>>
>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>
>> The (*cp == ':' || *cp == '\0') part looked like the following going
>> into RTL expansion:
>>
>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>   [20031204-1.c : 31:18] if (_31 != 0)
>>     goto <bb 16>;
>>   else
>>     goto <bb 19>;
>>
>> where the result of the OR was always true, so bb 16 had a count of
>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>> of the above turned into 2 bbs with a branch in between. Both
>> comparisons were done in the first bb, but the first bb checked
>> whether the result of the *cp == '\0' compare was true, and if not
>> branched to the check for whether the *cp == ':' compare was true. It
>> gave the branch to the second check against ':' a count of 0, so that
>> bb got a count of 0 and was split out, and put the count of 100 on the
>> fall through assuming the compare with '\0' always evaluated to true.
>> In reality, this OR condition was always true because *cp was ':', not
>> '\0'. Therefore, the count of 0 on the second block with the check for
>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> Presumably we had the correct profile data for both blocks, but the
>> accuracy was reduced when the OR was represented as a logical
>> computation with a single branch. We could change the expansion code
>> to do something different, e.g. treat as a 50-50 branch. But we would
>> still end up with integer truncation issues when there was a single
>> training run. But that could be dealt with conservatively in the
>> bbpart code as I suggested for the jump threading issue above. I.e. a
>> cold block with incoming non-cold edges conservatively not marked cold
>> for splitting.
>>
>> >
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> >
>> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
>> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
>> > I think it is kind of design problem: while implementing counts&frequencies
>> > the idea was that small counts do not matter, so integer arithmetic is all
>> > right.
>> >
>> > I wonder if with current C++ wonderland we can't simply switch count
>> > to a better representation. Either sreal or fixedpoint with capping
>> > (the integer overflow issues are tiring, too).
>>
>> It also seems like we should be able to detect the profile insanities
>> caused by integer truncation and handle them conservatively. That
>> being said, I see some sreal uses already in the profile.c code, so
>> presumably we could use this for the counts as well if it turns out to
>> be necessary?
>
> Yes, I was thinking about this, too.  We would need to do some evaulation of
> compile time implications but switching counts from gcov_type to
> profile_counter_t that is typedef to sreal seems sane idea.
>
> We could switch CFG code first.  there should not be many hot spots where
> counts are involved.  We can offline the common calculation we already moved to
> macros that
>
> We will need to invent also REG representation for them.  Now we have INT_LIST
> for that we may have SREAL list and introduce SREAL as valid RTX argument.
> This can be done incrementally.
>>
>> BTW, Rong also implemented his runtime patch to do the COMDAT profile
>> merging. However, that ended up having some issues, that were solvable
>> but would have caused us to lose all context sensitivity from COMDATS
>> inlined during the profile-gen build. I am going to go back to solving
>> this in the profile-use phase as we discussed in the separate thread
>> on the COMDAT inlining patch I had been working on.
>
> Yes, lets move ahead with this, too.  I think i should dig out the change
> that made frequencies to be guessed again.

I think I have that and was building my patch on top of it.

Rong:

>
> As for COMDAT merging, i would like to see the patch.  I am experimenting
> now with a patch to also privatize COMDATs during -fprofile-generate to
> avoid problems with lost profiles mentioned above.
>
> As for context sensitivity, one approach would be to have two sets of
> counters for every comdat - one merged globally and one counting local
> instances.  We can then privatize always and at profile read in stage
> just clone every comdat and have two instances - one for offline copy
> and one for inlining.
>
> This is not different from how I always wanted to handle GNU extern inlines
> (that also do have this issue - when you do not inline it, the unit do not see
> any profile of it).
>
> We can just tie the two functions together so "inline" version stay prior
> inlining and then have linker to redirect to inline version instead of offline
> version in such cases.  It already knows how to skip aliases and this is not
> terribly different from that.
>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24 18:51                                           ` Teresa Johnson
@ 2013-09-25 23:10                                             ` Teresa Johnson
  2013-09-26  8:44                                               ` Teresa Johnson
  2013-09-26 22:26                                             ` Jan Hubicka
  2013-10-01 17:36                                             ` Teresa Johnson
  2 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-09-25 23:10 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Tue, Sep 24, 2013 at 11:25 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> I looked at one that failed after 100 as well (20031204-1.c). In this
>>> case, it was due to expansion which was creating multiple branches/bbs
>>> from a logical OR and guessing incorrectly on how to assign the
>>> counts:
>>>
>>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>>
>>> The (*cp == ':' || *cp == '\0') part looked like the following going
>>> into RTL expansion:
>>>
>>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>>   [20031204-1.c : 31:18] if (_31 != 0)
>>>     goto <bb 16>;
>>>   else
>>>     goto <bb 19>;
>>>
>>> where the result of the OR was always true, so bb 16 had a count of
>>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>>> of the above turned into 2 bbs with a branch in between. Both
>>> comparisons were done in the first bb, but the first bb checked
>>> whether the result of the *cp == '\0' compare was true, and if not
>>> branched to the check for whether the *cp == ':' compare was true. It
>>> gave the branch to the second check against ':' a count of 0, so that
>>> bb got a count of 0 and was split out, and put the count of 100 on the
>>> fall through assuming the compare with '\0' always evaluated to true.
>>> In reality, this OR condition was always true because *cp was ':', not
>>> '\0'. Therefore, the count of 0 on the second block with the check for
>>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> I see, we produce:
>> ;; if (_26 != 0)
>>
>> (insn 94 93 95 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 107 [ D.2184 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
>>         (eq:QI (reg:CCZ 17 flags)
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (insn 96 95 97 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 122 [ D.2186 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (jump_insn 97 96 98 (set (pc)
>>         (if_then_else (ne (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (label_ref 100)
>>             (pc))) a.c:31 -1
>>      (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
>>         (nil)))
>>
>> (insn 98 97 99 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 108 [ D.2186 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (jump_insn 99 98 100 (set (pc)
>>         (if_then_else (eq (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (label_ref 0)
>>             (pc))) a.c:31 -1
>>      (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
>>         (nil)))
>>
>> (code_label 100 99 0 14 "" [0 uses])
>>
>> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"
>>
>> First I think the logic of do_jump should really be moved to trees.  It is not
>> doing things that can not be adequately represented by gimple.
>>
>> I am not that certain we want to move it before profiling though.
>>>
>>> Presumably we had the correct profile data for both blocks, but the
>>> accuracy was reduced when the OR was represented as a logical
>>> computation with a single branch. We could change the expansion code
>>> to do something different, e.g. treat as a 50-50 branch. But we would
>>> still end up with integer truncation issues when there was a single
>>> training run. But that could be dealt with conservatively in the
>>
>> Yep, but it is still better than what we have now - if the test above was
>> in hot part of program (i.e. not executed once), we will end up optimizing
>> the second conditional for size.
>>
>> So I think it is do_jump bug to not distribute probabilities across the two
>> conditoinals introduced.
>>> bbpart code as I suggested for the jump threading issue above. I.e. a
>>> cold block with incoming non-cold edges conservatively not marked cold
>>> for splitting.
>>
>> Yep, we can probably do that, but we ought to fix the individual cases
>> above at least for resonable number of runs.
>
> I made this change and it removed a few of the failures.
>
> I looked at another case that still failed with 1 train run but passed
> with 100. It turned out to be another truncation issue exposed by RTL
> expansion, where we created some control flow for a memset builtin
> which was in a block with an execution count of 1. Some of the blocks
> got frequencies less than half the original block, so the count was
> rounded down or truncated to 0. I noticed that in this case (as well
> as the jump threading case I fixed by looking for non-zero incoming
> edges in partitioning) that the bb frequency was non-zero.
>
> Why not just have probably_never_executed_bb_p return simply return
> false bb->frequency is non-zero (right now it does the opposite -
> returns true when bb->frequency is 0)? Making this change removed a
> bunch of other failures. With this change as well, there are only 3
> cases that still fail with 1 train run that pass with 100. Need to
> look at those.

FYI, These turned out to be more jump threading issues. I am currently
working on getting jump threading profile updates to work properly, I
think I'm pretty close. Haven't had a chance to look at do_jump yet.

Teresa

>
>>
>> Will you look into logic of do_jump or shall I try to dive in?
>
> I can take a look, but probably won't have a chance until late this
> week. If you don't get to it before then I will see if I can figure
> out why it is applying the branch probabilities this way.
>
> Teresa
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-25 23:10                                             ` Teresa Johnson
@ 2013-09-26  8:44                                               ` Teresa Johnson
  0 siblings, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-09-26  8:44 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Wed, Sep 25, 2013 at 2:33 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Tue, Sep 24, 2013 at 11:25 AM, Teresa Johnson <tejohnson@google.com> wrote:
>> On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>
>>>> I looked at one that failed after 100 as well (20031204-1.c). In this
>>>> case, it was due to expansion which was creating multiple branches/bbs
>>>> from a logical OR and guessing incorrectly on how to assign the
>>>> counts:
>>>>
>>>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>>>
>>>> The (*cp == ':' || *cp == '\0') part looked like the following going
>>>> into RTL expansion:
>>>>
>>>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>>>   [20031204-1.c : 31:18] if (_31 != 0)
>>>>     goto <bb 16>;
>>>>   else
>>>>     goto <bb 19>;
>>>>
>>>> where the result of the OR was always true, so bb 16 had a count of
>>>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>>>> of the above turned into 2 bbs with a branch in between. Both
>>>> comparisons were done in the first bb, but the first bb checked
>>>> whether the result of the *cp == '\0' compare was true, and if not
>>>> branched to the check for whether the *cp == ':' compare was true. It
>>>> gave the branch to the second check against ':' a count of 0, so that
>>>> bb got a count of 0 and was split out, and put the count of 100 on the
>>>> fall through assuming the compare with '\0' always evaluated to true.
>>>> In reality, this OR condition was always true because *cp was ':', not
>>>> '\0'. Therefore, the count of 0 on the second block with the check for
>>>> ':' was incorrect, we ended up trying to execute it, and failed.
>>>
>>> I see, we produce:
>>> ;; if (_26 != 0)
>>>
>>> (insn 94 93 95 (set (reg:CCZ 17 flags)
>>>         (compare:CCZ (reg:QI 107 [ D.2184 ])
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
>>>         (eq:QI (reg:CCZ 17 flags)
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (insn 96 95 97 (set (reg:CCZ 17 flags)
>>>         (compare:CCZ (reg:QI 122 [ D.2186 ])
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (jump_insn 97 96 98 (set (pc)
>>>         (if_then_else (ne (reg:CCZ 17 flags)
>>>                 (const_int 0 [0]))
>>>             (label_ref 100)
>>>             (pc))) a.c:31 -1
>>>      (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
>>>         (nil)))
>>>
>>> (insn 98 97 99 (set (reg:CCZ 17 flags)
>>>         (compare:CCZ (reg:QI 108 [ D.2186 ])
>>>             (const_int 0 [0]))) a.c:31 -1
>>>      (nil))
>>>
>>> (jump_insn 99 98 100 (set (pc)
>>>         (if_then_else (eq (reg:CCZ 17 flags)
>>>                 (const_int 0 [0]))
>>>             (label_ref 0)
>>>             (pc))) a.c:31 -1
>>>      (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
>>>         (nil)))
>>>
>>> (code_label 100 99 0 14 "" [0 uses])
>>>
>>> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"
>>>
>>> First I think the logic of do_jump should really be moved to trees.  It is not
>>> doing things that can not be adequately represented by gimple.
>>>
>>> I am not that certain we want to move it before profiling though.
>>>>
>>>> Presumably we had the correct profile data for both blocks, but the
>>>> accuracy was reduced when the OR was represented as a logical
>>>> computation with a single branch. We could change the expansion code
>>>> to do something different, e.g. treat as a 50-50 branch. But we would
>>>> still end up with integer truncation issues when there was a single
>>>> training run. But that could be dealt with conservatively in the
>>>
>>> Yep, but it is still better than what we have now - if the test above was
>>> in hot part of program (i.e. not executed once), we will end up optimizing
>>> the second conditional for size.
>>>
>>> So I think it is do_jump bug to not distribute probabilities across the two
>>> conditoinals introduced.
>>>> bbpart code as I suggested for the jump threading issue above. I.e. a
>>>> cold block with incoming non-cold edges conservatively not marked cold
>>>> for splitting.
>>>
>>> Yep, we can probably do that, but we ought to fix the individual cases
>>> above at least for resonable number of runs.
>>
>> I made this change and it removed a few of the failures.
>>
>> I looked at another case that still failed with 1 train run but passed
>> with 100. It turned out to be another truncation issue exposed by RTL
>> expansion, where we created some control flow for a memset builtin
>> which was in a block with an execution count of 1. Some of the blocks
>> got frequencies less than half the original block, so the count was
>> rounded down or truncated to 0. I noticed that in this case (as well
>> as the jump threading case I fixed by looking for non-zero incoming
>> edges in partitioning) that the bb frequency was non-zero.
>>
>> Why not just have probably_never_executed_bb_p return simply return
>> false bb->frequency is non-zero (right now it does the opposite -
>> returns true when bb->frequency is 0)? Making this change removed a
>> bunch of other failures. With this change as well, there are only 3
>> cases that still fail with 1 train run that pass with 100. Need to
>> look at those.
>
> FYI, These turned out to be more jump threading issues. I am currently
> working on getting jump threading profile updates to work properly, I
> think I'm pretty close. Haven't had a chance to look at do_jump yet.

Correction, it was not the 3 tests I mentioned above, but a different
set of tests being affected by this jump threading issue. I have a
patch I am regression testing. But there are other profile insanities
being caused upstream of jump threading that I haven't tracked down.
So I will also test and send the patch to handle some of these in the
splitting/cold cold detection code.  It also contains the change to
probably_never_executed_bb_p that I mention above to return false when
bb->frequency is non-zero.

Teresa

>
> Teresa
>
>>
>>>
>>> Will you look into logic of do_jump or shall I try to dive in?
>>
>> I can take a look, but probably won't have a chance until late this
>> week. If you don't get to it before then I will see if I can figure
>> out why it is applying the branch probabilities this way.
>>
>> Teresa
>>
>>>
>>> Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24 13:44                                         ` Jan Hubicka
  2013-09-24 19:06                                           ` Teresa Johnson
@ 2013-09-26 20:55                                           ` Rong Xu
  2013-09-26 22:23                                             ` Jan Hubicka
  1 sibling, 1 reply; 62+ messages in thread
From: Rong Xu @ 2013-09-26 20:55 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Teresa Johnson, gcc-patches, marxin.liska

On Tue, Sep 24, 2013 at 5:31 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> Hi Honza,
>>
>> I am finally getting back to working on this after a few weeks of
>> working on some other priorities.
>
> I am also trying to return to this, so good timming ;)
> Martin has got smaller C++ programs (Inkscape) to not touch cold segment
> during the startup with FDO (w/o partitioning). Firefox still does, I think
> the problem are lost samples due to different linker decisions even with LTO.
> (i.e. linker pick an object from .a libraryat profile-generate time that i never
> passes later.).
>
> I plan to look into that today.
>>
>> Did you mean to commit the above change? I see that it went in as part
>> of r202258 but doesn't show up in the ChangeLog entry for that
>> revision.
>
> Yes, I meant to check it in, but did not mean to do so w/o Changelog.  I wil
> fix that.
>>
>> >
>> > In other cases it was mostly loop unrolling in combination with jump threading. So
>> > I modified my script to separately report when failure happens for test trained
>> > once and test trained hundred times.
>>
>> Thanks for the linker script. I reproduced your results. I looked at a
>> couple cases. The first was one that failed after 1 training run only
>> (20000910-2.c). It was due to jump threading, which you noted was a
>> problem. For this one I think we can handle it in the partitioning,
>> since there is an FDO insanity that we could probably treat more
>> conservatively when splitting.
>
> We should fix the roundoff issues - when I was introducing the
> frequency/probability/count system I made an assumptions that parts of programs
> with very low counts do not matter, since they are not part of hot spot (and I
> cared only about the hot spot).  Now we care about identifying unlikely
> executed spots and we need to fix this.
>>
>> I looked at one that failed after 100 as well (20031204-1.c). In this
>> case, it was due to expansion which was creating multiple branches/bbs
>> from a logical OR and guessing incorrectly on how to assign the
>> counts:
>>
>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>
>> The (*cp == ':' || *cp == '\0') part looked like the following going
>> into RTL expansion:
>>
>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>   [20031204-1.c : 31:18] if (_31 != 0)
>>     goto <bb 16>;
>>   else
>>     goto <bb 19>;
>>
>> where the result of the OR was always true, so bb 16 had a count of
>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>> of the above turned into 2 bbs with a branch in between. Both
>> comparisons were done in the first bb, but the first bb checked
>> whether the result of the *cp == '\0' compare was true, and if not
>> branched to the check for whether the *cp == ':' compare was true. It
>> gave the branch to the second check against ':' a count of 0, so that
>> bb got a count of 0 and was split out, and put the count of 100 on the
>> fall through assuming the compare with '\0' always evaluated to true.
>> In reality, this OR condition was always true because *cp was ':', not
>> '\0'. Therefore, the count of 0 on the second block with the check for
>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> Presumably we had the correct profile data for both blocks, but the
>> accuracy was reduced when the OR was represented as a logical
>> computation with a single branch. We could change the expansion code
>> to do something different, e.g. treat as a 50-50 branch. But we would
>> still end up with integer truncation issues when there was a single
>> training run. But that could be dealt with conservatively in the
>> bbpart code as I suggested for the jump threading issue above. I.e. a
>> cold block with incoming non-cold edges conservatively not marked cold
>> for splitting.
>>
>> >
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000422-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20000910-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20020413-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20030903-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20031204-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060420-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20060905-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120427-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20120808-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/20121108-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920501-6.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/920726-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/981001-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/990628-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/991216-2.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/cmpdi-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/float-floor.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr33870.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr36093.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/pr43784.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/switch-1.c
>> > FAIL1 /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> > FAIL /home/jh/trunk/gcc/testsuite/gcc.c-torture/execute/va-arg-22.c
>> >
>> > FAIL1 is failure after one run, FIAL is failure after 100 train runs.
>> > We should take look at FAILs and see if there are bugs to fix. For FAIL1
>> > I think it is kind of design problem: while implementing counts&frequencies
>> > the idea was that small counts do not matter, so integer arithmetic is all
>> > right.
>> >
>> > I wonder if with current C++ wonderland we can't simply switch count
>> > to a better representation. Either sreal or fixedpoint with capping
>> > (the integer overflow issues are tiring, too).
>>
>> It also seems like we should be able to detect the profile insanities
>> caused by integer truncation and handle them conservatively. That
>> being said, I see some sreal uses already in the profile.c code, so
>> presumably we could use this for the counts as well if it turns out to
>> be necessary?
>
> Yes, I was thinking about this, too.  We would need to do some evaulation of
> compile time implications but switching counts from gcov_type to
> profile_counter_t that is typedef to sreal seems sane idea.
>
> We could switch CFG code first.  there should not be many hot spots where
> counts are involved.  We can offline the common calculation we already moved to
> macros that
>
> We will need to invent also REG representation for them.  Now we have INT_LIST
> for that we may have SREAL list and introduce SREAL as valid RTX argument.
> This can be done incrementally.
>>
>> BTW, Rong also implemented his runtime patch to do the COMDAT profile
>> merging. However, that ended up having some issues, that were solvable
>> but would have caused us to lose all context sensitivity from COMDATS
>> inlined during the profile-gen build. I am going to go back to solving
>> this in the profile-use phase as we discussed in the separate thread
>> on the COMDAT inlining patch I had been working on.
>
> Yes, lets move ahead with this, too.  I think i should dig out the change
> that made frequencies to be guessed again.
>
> As for COMDAT merging, i would like to see the patch.  I am experimenting
> now with a patch to also privatize COMDATs during -fprofile-generate to
> avoid problems with lost profiles mentioned above.
>

Do you mean you privatize every COMDAT function in the profile-generate?
We discussed this idea internally and we thought it would not work for
large applications (like in google) due to size.

> As for context sensitivity, one approach would be to have two sets of
> counters for every comdat - one merged globally and one counting local
> instances.  We can then privatize always and at profile read in stage
> just clone every comdat and have two instances - one for offline copy
> and one for inlining.
>

In my implementation, I also allow multiple sets of COMDAT profile
co-existing in one compilation.
Due to the auxiliary modules in LIPO, I actually have more than two.

But I'm wondering how do you determine which profile to use for each
call-site -- the inline decision may not
be the same for profile-generate and profile-use compilation.

> This is not different from how I always wanted to handle GNU extern inlines
> (that also do have this issue - when you do not inline it, the unit do not see
> any profile of it).
>
> We can just tie the two functions together so "inline" version stay prior
> inlining and then have linker to redirect to inline version instead of offline
> version in such cases.  It already knows how to skip aliases and this is not
> terribly different from that.
>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-26 20:55                                           ` Rong Xu
@ 2013-09-26 22:23                                             ` Jan Hubicka
  2013-09-26 22:54                                               ` Rong Xu
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-09-26 22:23 UTC (permalink / raw)
  To: Rong Xu; +Cc: Jan Hubicka, Teresa Johnson, gcc-patches, marxin.liska

> > As for COMDAT merging, i would like to see the patch.  I am experimenting
> > now with a patch to also privatize COMDATs during -fprofile-generate to
> > avoid problems with lost profiles mentioned above.
> >
> 
> Do you mean you privatize every COMDAT function in the profile-generate?
> We discussed this idea internally and we thought it would not work for
> large applications (like in google) due to size.

Yes, Martin and I plan to test this on firefox.  In a way you already have all
the COMDAT functions unshared in the object files, so the resulting binary
should not be completely off the limits.  But I do not have any quantitative
data, yet, since we hit bug in constant folding and devirtualization I fixed in
meantime but we did not re-run the tests yet.

> 
> > As for context sensitivity, one approach would be to have two sets of
> > counters for every comdat - one merged globally and one counting local
> > instances.  We can then privatize always and at profile read in stage
> > just clone every comdat and have two instances - one for offline copy
> > and one for inlining.
> >
> 
> In my implementation, I also allow multiple sets of COMDAT profile
> co-existing in one compilation.
> Due to the auxiliary modules in LIPO, I actually have more than two.

How does auxiliary modules work?
> 
> But I'm wondering how do you determine which profile to use for each
> call-site -- the inline decision may not
> be the same for profile-generate and profile-use compilation.

My suggestion was to simply use the module local profile for all inline sites
within the given module and the global profile for the offline copy of the
function (that one will, in the case it survives linking, be shared across
all the modules anyway).

I think this may work in the cases where i.e. use of hash templates in one
module is very different (in average size) from other module.
I did not really put much effort into it - I currently worry primarily about
the cases where profile is lost completely since it gets attached to a function
not surviving final linking (or because we inline something we did not inlined
at profile time).

As for context sensitivity, we may try to consider developing more consistent
solution for this.  COMDAT functions are definitely not only that may exhibit
context sensitive behaviour.
One approach would be to always have multiple counters for each function and
hash based on cbacktraces collected by indirect call profiling instrumentation.
In a way this is same path profiling, but that would definitely add quite some
overhead + we will need to think of resonable way to represent this within
compiler.

How do you decide what functions you want to have multiple profiles for?

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24 18:51                                           ` Teresa Johnson
  2013-09-25 23:10                                             ` Teresa Johnson
@ 2013-09-26 22:26                                             ` Jan Hubicka
  2013-09-27 14:50                                               ` Teresa Johnson
  2013-10-01 17:36                                             ` Teresa Johnson
  2 siblings, 1 reply; 62+ messages in thread
From: Jan Hubicka @ 2013-09-26 22:26 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Jan Hubicka, gcc-patches, marxin.liska

> 
> Why not just have probably_never_executed_bb_p return simply return
> false bb->frequency is non-zero (right now it does the opposite -

We want to have frequencies guessed for functions that was not trained
in the profiling run (that was patch I posted earlier that I think did not
go in, yet).

Currently I return true when frequency indicate that BB is executed at least in
1/4th of all executions.  With the cases discussed I see we may need to reduce
this threshold.  In general I do not like much hard tests for 0 because meaning
of 0 depends on REG_BR_FREQ_BASE that is supposed to be changeable and we may
want to make frequencies sreal, too.

I suppose we may introduce --param for this.  You are also right that I should
update probably_never_executed_edge_p (I intended so, but obviously the code
ended up in mainline accidentally).

I however saw at least one case of jump threading where this trick did not
help: the jump threading update confused itself by scaling via counts rather
than frequencies and ended up with dropping everything to 0. This makes it 
more tempting to try to go with sreals for those....

Honza

> returns true when bb->frequency is 0)? Making this change removed a
> bunch of other failures. With this change as well, there are only 3
> cases that still fail with 1 train run that pass with 100. Need to
> look at those.
> 
> >
> > Will you look into logic of do_jump or shall I try to dive in?
> 
> I can take a look, but probably won't have a chance until late this
> week. If you don't get to it before then I will see if I can figure
> out why it is applying the branch probabilities this way.
> 
> Teresa
> 
> >
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-26 22:23                                             ` Jan Hubicka
@ 2013-09-26 22:54                                               ` Rong Xu
  0 siblings, 0 replies; 62+ messages in thread
From: Rong Xu @ 2013-09-26 22:54 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Teresa Johnson, gcc-patches, marxin.liska

On Thu, Sep 26, 2013 at 2:54 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > As for COMDAT merging, i would like to see the patch.  I am experimenting
>> > now with a patch to also privatize COMDATs during -fprofile-generate to
>> > avoid problems with lost profiles mentioned above.
>> >
>>
>> Do you mean you privatize every COMDAT function in the profile-generate?
>> We discussed this idea internally and we thought it would not work for
>> large applications (like in google) due to size.
>
> Yes, Martin and I plan to test this on firefox.  In a way you already have all
> the COMDAT functions unshared in the object files, so the resulting binary
> should not be completely off the limits.  But I do not have any quantitative
> data, yet, since we hit bug in constant folding and devirtualization I fixed in
> meantime but we did not re-run the tests yet.

LInker removes a great numbers of duplicated copies, esp for those
template functions.
We don't have a quantitative numbers either. But I'll collect some soon.
>
>>
>> > As for context sensitivity, one approach would be to have two sets of
>> > counters for every comdat - one merged globally and one counting local
>> > instances.  We can then privatize always and at profile read in stage
>> > just clone every comdat and have two instances - one for offline copy
>> > and one for inlining.
>> >
>>
>> In my implementation, I also allow multiple sets of COMDAT profile
>> co-existing in one compilation.
>> Due to the auxiliary modules in LIPO, I actually have more than two.
>
> How does auxiliary modules work?

It pulls in multiple profiles from other compilation. So there might be multiple
inlined profiles.

>>
>> But I'm wondering how do you determine which profile to use for each
>> call-site -- the inline decision may not
>> be the same for profile-generate and profile-use compilation.
>
> My suggestion was to simply use the module local profile for all inline sites
> within the given module and the global profile for the offline copy of the
> function (that one will, in the case it survives linking, be shared across
> all the modules anyway).

For simple example like:
callsite1 --> comcat_function_foo
callsite2 --> comdat_function_foo

callsite1 is inlined in profile-generate, it has its own inlined
profile counter.
callsite2 is not inlined and the profile goes to the offline copies.
let's callsite 1 is cold (0 counter) and callsite 2 is hot. Using
local profile (the cold one)
for callsite2 will not be correct.

>
> I think this may work in the cases where i.e. use of hash templates in one
> module is very different (in average size) from other module.
> I did not really put much effort into it - I currently worry primarily about
> the cases where profile is lost completely since it gets attached to a function
> not surviving final linking (or because we inline something we did not inlined
> at profile time).
>
> As for context sensitivity, we may try to consider developing more consistent
> solution for this.  COMDAT functions are definitely not only that may exhibit
> context sensitive behaviour.
> One approach would be to always have multiple counters for each function and
> hash based on cbacktraces collected by indirect call profiling instrumentation.
> In a way this is same path profiling, but that would definitely add quite some
> overhead + we will need to think of resonable way to represent this within
> compiler.
>
> How do you decide what functions you want to have multiple profiles for?

I do the instrumentation after ipa-inline for comdat function. I know
if a callsite
is inlined or not. In profile-use phrase, I also need to provide to
the context (which module this is from) to pick
the right profile.

>
> Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-26 22:26                                             ` Jan Hubicka
@ 2013-09-27 14:50                                               ` Teresa Johnson
  2013-09-29 17:34                                                 ` Teresa Johnson
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-09-27 14:50 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Thu, Sep 26, 2013 at 3:02 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> Why not just have probably_never_executed_bb_p return simply return
>> false bb->frequency is non-zero (right now it does the opposite -
>
> We want to have frequencies guessed for functions that was not trained
> in the profiling run (that was patch I posted earlier that I think did not
> go in, yet).

Right, but for splitting and bb layout purposes, for these statically
guessed unprofiled functions we in fact don't want to do any splitting
or treat the bbs as never executed (which shouldn't be a change from
the status quo since all the bbs in these functions are currently 0
weight, it's only when we inline in the case of comdats that they
appear colder than the surrounding code, but in fact we don't want
this).

The only other caller to probably_never_executed_bb_p is
compute_function_frequency, but in the case of statically guessed
functions they will have profile_status != PROFILE_READ and won't
invoke probably_never_executed_bb_p. But re-reading our most recent
exchange on the comdat profile issue, it sounds like you were
suggesting guessing profiles for all 0-weight functions early, then
dropping them from PROFILE_READ to PROFILE_GUESSED only once we
determine in ipa-inline that there is a potentially non-zero call path
to them. In that case with the change I describe above to
probably_never_executed_bb_p, the 0-weight functions with 0 calls to
them will incorrectly be marked as NODE_FREQUENCY_NORMAL, which would
be bad as they would not be size optimized or moved into the cold
section.

So it seems like we want different handling of these guessed
frequencies in compute_function_frequency and bb-reorder.c. Actually I
think we can handle this by checking if the function entry block has a
0 count. If so, then we just look at the bb counts and not the
frequencies for determining bb hotness as the frequencies would
presumably have been statically-guessed. This will ensure that the
cgraph node continues to be marked unlikely and size-optimized. If the
function entry block has a non-zero count, then we look at both the bb
count and the bb frequency - if they are both zero then the bb is
probably never executed, but if either is non-zero then we should
treat the block as possibly executed (which will come into play for
splitting and bb layout).

Teresa

>
> Currently I return true when frequency indicate that BB is executed at least in
> 1/4th of all executions.  With the cases discussed I see we may need to reduce
> this threshold.  In general I do not like much hard tests for 0 because meaning
> of 0 depends on REG_BR_FREQ_BASE that is supposed to be changeable and we may
> want to make frequencies sreal, too.
>
> I suppose we may introduce --param for this.  You are also right that I should
> update probably_never_executed_edge_p (I intended so, but obviously the code
> ended up in mainline accidentally).
>
> I however saw at least one case of jump threading where this trick did not
> help: the jump threading update confused itself by scaling via counts rather
> than frequencies and ended up with dropping everything to 0. This makes it
> more tempting to try to go with sreals for those....
>
> Honza
>
>> returns true when bb->frequency is 0)? Making this change removed a
>> bunch of other failures. With this change as well, there are only 3
>> cases that still fail with 1 train run that pass with 100. Need to
>> look at those.
>>
>> >
>> > Will you look into logic of do_jump or shall I try to dive in?
>>
>> I can take a look, but probably won't have a chance until late this
>> week. If you don't get to it before then I will see if I can figure
>> out why it is applying the branch probabilities this way.
>>
>> Teresa
>>
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-27 14:50                                               ` Teresa Johnson
@ 2013-09-29 17:34                                                 ` Teresa Johnson
  2013-10-02 16:19                                                   ` Jan Hubicka
  0 siblings, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-09-29 17:34 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Fri, Sep 27, 2013 at 7:15 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Thu, Sep 26, 2013 at 3:02 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> Why not just have probably_never_executed_bb_p return simply return
>>> false bb->frequency is non-zero (right now it does the opposite -
>>
>> We want to have frequencies guessed for functions that was not trained
>> in the profiling run (that was patch I posted earlier that I think did not
>> go in, yet).
>
> Right, but for splitting and bb layout purposes, for these statically
> guessed unprofiled functions we in fact don't want to do any splitting
> or treat the bbs as never executed (which shouldn't be a change from
> the status quo since all the bbs in these functions are currently 0
> weight, it's only when we inline in the case of comdats that they
> appear colder than the surrounding code, but in fact we don't want
> this).
>
> The only other caller to probably_never_executed_bb_p is
> compute_function_frequency, but in the case of statically guessed
> functions they will have profile_status != PROFILE_READ and won't
> invoke probably_never_executed_bb_p. But re-reading our most recent
> exchange on the comdat profile issue, it sounds like you were
> suggesting guessing profiles for all 0-weight functions early, then
> dropping them from PROFILE_READ to PROFILE_GUESSED only once we
> determine in ipa-inline that there is a potentially non-zero call path
> to them. In that case with the change I describe above to
> probably_never_executed_bb_p, the 0-weight functions with 0 calls to
> them will incorrectly be marked as NODE_FREQUENCY_NORMAL, which would
> be bad as they would not be size optimized or moved into the cold
> section.
>
> So it seems like we want different handling of these guessed
> frequencies in compute_function_frequency and bb-reorder.c. Actually I
> think we can handle this by checking if the function entry block has a
> 0 count. If so, then we just look at the bb counts and not the
> frequencies for determining bb hotness as the frequencies would
> presumably have been statically-guessed. This will ensure that the
> cgraph node continues to be marked unlikely and size-optimized. If the
> function entry block has a non-zero count, then we look at both the bb
> count and the bb frequency - if they are both zero then the bb is
> probably never executed, but if either is non-zero then we should
> treat the block as possibly executed (which will come into play for
> splitting and bb layout).

Here is a patch to handle the profile insanities conservatively during
splitting. It also simplifies the probably_never_executed* code to
treat missing counts within a profiled function differently
(conservatively, based on frequency) from the case where the whole
function has a guessed profile. That way, once a patch to guess
profiles for non-executed functions is added, they will continue to
have their nodes marked as unlikely. I also pulled the guts of the
probably_never_executed_bb_p code out to a helper that is then invoked
by both the bb and edge versions of this function, so they stay in
sync.

This gets rid of a number of the failures with splitting + the linker
script to make the unlikely section non-executable. I have a patch to
fix some jump threading insanities that I will send separately.

Bootstrapped and regression tested on x86_64. Also tested with an lto
profiledbootstrap. Ok for trunk?

Thanks,
Teresa

2013-09-29  Teresa Johnson  <tejohnson@google.com>

        * bb-reorder.c (find_rarely_executed_basic_blocks_and_crossing_edges):
        Treat profile insanities conservatively.
        * predict.c (probably_never_executed): New function. Treat profile
        insanities conservatively.
        (probably_never_executed_bb_p): Invoke probably_never_executed.
        (probably_never_executed_edge_p): Invoke probably_never_executed.

Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 202947)
+++ bb-reorder.c        (working copy)
@@ -1564,8 +1564,25 @@ find_rarely_executed_basic_blocks_and_crossing_edg
   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
+      bool cold_bb = false;
       if (probably_never_executed_bb_p (cfun, bb))
         {
+          /* Handle profile insanities created by upstream optimizations
+             by also checking the incoming edge weights. If there is a non-cold
+             incoming edge, conservatively prevent this block from being split
+             into the cold section.  */
+          cold_bb = true;
+          FOR_EACH_EDGE (e, ei, bb->preds)
+            {
+              if (!probably_never_executed_edge_p (cfun, e))
+                {
+                  cold_bb = false;
+                  break;
+                }
+            }
+        }
+      if (cold_bb)
+        {
           BB_SET_PARTITION (bb, BB_COLD_PARTITION);
           cold_bb_count++;
         }
Index: predict.c
===================================================================
--- predict.c   (revision 202947)
+++ predict.c   (working copy)
@@ -226,26 +226,26 @@ maybe_hot_edge_p (edge e)
 }


-/* Return true in case BB is probably never executed.  */

-bool
-probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
+/* Return true if profile COUNT and FREQUENCY, or function FUN static
+   node frequency reflects never being executed.  */
+
+static bool
+probably_never_executed (struct function *fun,
+                         gcov_type count, int frequency)
 {
   gcc_checking_assert (fun);
   if (profile_status_for_function (fun) == PROFILE_READ)
     {
-      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
+      if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
        return false;
-      if (!bb->frequency)
-       return true;
-      if (!ENTRY_BLOCK_PTR->frequency)
-       return false;
-      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
-       {
-         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
-                       ENTRY_BLOCK_PTR->frequency)
-                 < REG_BR_PROB_BASE / 4);
-       }
+      // If this is a profiled function (entry bb non-zero count), then base
+      // the coldness decision on the frequency. This will handle cases where
+      // counts are not updated properly during optimizations or expansion.
+      if (ENTRY_BLOCK_PTR->count)
+       return frequency == 0;
+      // Unprofiled function, frequencies statically assigned. All bbs are
+      // treated as cold.
       return true;
     }
   if ((!profile_info || !flag_branch_probabilities)
@@ -256,19 +256,21 @@ maybe_hot_edge_p (edge e)
 }


+/* Return true in case BB is probably never executed.  */
+
+bool
+probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
+{
+  return probably_never_executed (fun, bb->count, bb->frequency);
+}
+
+
 /* Return true in case edge E is probably never executed.  */

 bool
 probably_never_executed_edge_p (struct function *fun, edge e)
 {
-  gcc_checking_assert (fun);
-  if (profile_info && flag_branch_probabilities)
-    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
-  if ((!profile_info || !flag_branch_probabilities)
-      && (cgraph_get_node (fun->decl)->frequency
-         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
-    return true;
-  return false;
+  return probably_never_executed (fun, e->count, EDGE_FREQUENCY (e));
 }

 /* Return true if NODE should be optimized for size.  */


>
> Teresa
>
>>
>> Currently I return true when frequency indicate that BB is executed at least in
>> 1/4th of all executions.  With the cases discussed I see we may need to reduce
>> this threshold.  In general I do not like much hard tests for 0 because meaning
>> of 0 depends on REG_BR_FREQ_BASE that is supposed to be changeable and we may
>> want to make frequencies sreal, too.
>>
>> I suppose we may introduce --param for this.  You are also right that I should
>> update probably_never_executed_edge_p (I intended so, but obviously the code
>> ended up in mainline accidentally).
>>
>> I however saw at least one case of jump threading where this trick did not
>> help: the jump threading update confused itself by scaling via counts rather
>> than frequencies and ended up with dropping everything to 0. This makes it
>> more tempting to try to go with sreals for those....
>>
>> Honza
>>
>>> returns true when bb->frequency is 0)? Making this change removed a
>>> bunch of other failures. With this change as well, there are only 3
>>> cases that still fail with 1 train run that pass with 100. Need to
>>> look at those.
>>>
>>> >
>>> > Will you look into logic of do_jump or shall I try to dive in?
>>>
>>> I can take a look, but probably won't have a chance until late this
>>> week. If you don't get to it before then I will see if I can figure
>>> out why it is applying the branch probabilities this way.
>>>
>>> Teresa
>>>
>>> >
>>> > Honza
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-24 18:51                                           ` Teresa Johnson
  2013-09-25 23:10                                             ` Teresa Johnson
  2013-09-26 22:26                                             ` Jan Hubicka
@ 2013-10-01 17:36                                             ` Teresa Johnson
  2 siblings, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-10-01 17:36 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Tue, Sep 24, 2013 at 11:25 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Tue, Sep 24, 2013 at 10:57 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> I looked at one that failed after 100 as well (20031204-1.c). In this
>>> case, it was due to expansion which was creating multiple branches/bbs
>>> from a logical OR and guessing incorrectly on how to assign the
>>> counts:
>>>
>>>  if (octets == 4 && (*cp == ':' || *cp == '\0')) {
>>>
>>> The (*cp == ':' || *cp == '\0') part looked like the following going
>>> into RTL expansion:
>>>
>>>   [20031204-1.c : 31:33] _29 = _28 == 58;
>>>   [20031204-1.c : 31:33] _30 = _28 == 0;
>>>   [20031204-1.c : 31:33] _31 = _29 | _30;
>>>   [20031204-1.c : 31:18] if (_31 != 0)
>>>     goto <bb 16>;
>>>   else
>>>     goto <bb 19>;
>>>
>>> where the result of the OR was always true, so bb 16 had a count of
>>> 100 and bb 19 a count of 0. When it was expanded, the expanded version
>>> of the above turned into 2 bbs with a branch in between. Both
>>> comparisons were done in the first bb, but the first bb checked
>>> whether the result of the *cp == '\0' compare was true, and if not
>>> branched to the check for whether the *cp == ':' compare was true. It
>>> gave the branch to the second check against ':' a count of 0, so that
>>> bb got a count of 0 and was split out, and put the count of 100 on the
>>> fall through assuming the compare with '\0' always evaluated to true.
>>> In reality, this OR condition was always true because *cp was ':', not
>>> '\0'. Therefore, the count of 0 on the second block with the check for
>>> ':' was incorrect, we ended up trying to execute it, and failed.
>>
>> I see, we produce:
>> ;; if (_26 != 0)
>>
>> (insn 94 93 95 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 107 [ D.2184 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (insn 95 94 96 (set (reg:QI 122 [ D.2186 ])
>>         (eq:QI (reg:CCZ 17 flags)
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (insn 96 95 97 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 122 [ D.2186 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (jump_insn 97 96 98 (set (pc)
>>         (if_then_else (ne (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (label_ref 100)
>>             (pc))) a.c:31 -1
>>      (expr_list:REG_BR_PROB (const_int 6100 [0x17d4])
>>         (nil)))
>>
>> (insn 98 97 99 (set (reg:CCZ 17 flags)
>>         (compare:CCZ (reg:QI 108 [ D.2186 ])
>>             (const_int 0 [0]))) a.c:31 -1
>>      (nil))
>>
>> (jump_insn 99 98 100 (set (pc)
>>         (if_then_else (eq (reg:CCZ 17 flags)
>>                 (const_int 0 [0]))
>>             (label_ref 0)
>>             (pc))) a.c:31 -1
>>      (expr_list:REG_BR_PROB (const_int 3900 [0xf3c])
>>         (nil)))
>>
>> (code_label 100 99 0 14 "" [0 uses])
>>
>> That is because we TER together "_26 = _25 | _24" and "if (_26 != 0)"
>>
>> First I think the logic of do_jump should really be moved to trees.  It is not
>> doing things that can not be adequately represented by gimple.
>>
>> I am not that certain we want to move it before profiling though.
>>>
>>> Presumably we had the correct profile data for both blocks, but the
>>> accuracy was reduced when the OR was represented as a logical
>>> computation with a single branch. We could change the expansion code
>>> to do something different, e.g. treat as a 50-50 branch. But we would
>>> still end up with integer truncation issues when there was a single
>>> training run. But that could be dealt with conservatively in the
>>
>> Yep, but it is still better than what we have now - if the test above was
>> in hot part of program (i.e. not executed once), we will end up optimizing
>> the second conditional for size.
>>
>> So I think it is do_jump bug to not distribute probabilities across the two
>> conditoinals introduced.
>>> bbpart code as I suggested for the jump threading issue above. I.e. a
>>> cold block with incoming non-cold edges conservatively not marked cold
>>> for splitting.
>>
>> Yep, we can probably do that, but we ought to fix the individual cases
>> above at least for resonable number of runs.
>
> I made this change and it removed a few of the failures.
>
> I looked at another case that still failed with 1 train run but passed
> with 100. It turned out to be another truncation issue exposed by RTL
> expansion, where we created some control flow for a memset builtin
> which was in a block with an execution count of 1. Some of the blocks
> got frequencies less than half the original block, so the count was
> rounded down or truncated to 0. I noticed that in this case (as well
> as the jump threading case I fixed by looking for non-zero incoming
> edges in partitioning) that the bb frequency was non-zero.
>
> Why not just have probably_never_executed_bb_p return simply return
> false bb->frequency is non-zero (right now it does the opposite -
> returns true when bb->frequency is 0)? Making this change removed a
> bunch of other failures. With this change as well, there are only 3
> cases that still fail with 1 train run that pass with 100. Need to
> look at those.
>
>>
>> Will you look into logic of do_jump or shall I try to dive in?
>
> I can take a look, but probably won't have a chance until late this
> week. If you don't get to it before then I will see if I can figure
> out why it is applying the branch probabilities this way.

Turned out not to be too tricky to fix the do_jump issue affecting
20031204-1.c. The patch below fixes the issue for this test case
(along with the patch I posted a couple days ago that handles profile
insanities conservatively in the case where there is 1 training run
and we truncate counts):

Index: dojump.c
===================================================================
--- dojump.c (revision 202947)
+++ dojump.c (working copy)
@@ -325,15 +325,20 @@
       break;

     case TRUTH_ORIF_EXPR:
+      /* Spread the probability evenly between the two conditions. So
+         the first condition has half the total probability of being true,
+         and therefore has half the probability of being false
+         (i.e. falls through to the second condition). If we reach the
+         second condition, it will be true with the original probability.  */
       if (if_true_label == NULL_RTX)
  {
           drop_through_label = gen_label_rtx ();
-  do_jump (op0, NULL_RTX, drop_through_label, prob);
+  do_jump (op0, NULL_RTX, drop_through_label, prob / 2);
   do_jump (op1, if_false_label, NULL_RTX, prob);
  }
       else
  {
-  do_jump (op0, NULL_RTX, if_true_label, prob);
+  do_jump (op0, NULL_RTX, if_true_label, prob / 2);
   do_jump (op1, if_false_label, if_true_label, prob);
  }
       break;

I am regression testing this now and will post the patch for review in
a separate thread.

Honza, any comments on the patch I posted a couple days ago that
treats the profile insanities conservatively?

Thanks,
Teresa

>
> Teresa
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-09-29 17:34                                                 ` Teresa Johnson
@ 2013-10-02 16:19                                                   ` Jan Hubicka
  2013-10-02 17:55                                                     ` Teresa Johnson
  2013-10-03 13:42                                                     ` Teresa Johnson
  0 siblings, 2 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-10-02 16:19 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Jan Hubicka, gcc-patches, marxin.liska

> 2013-09-29  Teresa Johnson  <tejohnson@google.com>
> 
>         * bb-reorder.c (find_rarely_executed_basic_blocks_and_crossing_edges):
>         Treat profile insanities conservatively.
>         * predict.c (probably_never_executed): New function. Treat profile
>         insanities conservatively.
>         (probably_never_executed_bb_p): Invoke probably_never_executed.
>         (probably_never_executed_edge_p): Invoke probably_never_executed.
> 
> Index: bb-reorder.c
> ===================================================================
> --- bb-reorder.c        (revision 202947)
> +++ bb-reorder.c        (working copy)
> @@ -1564,8 +1564,25 @@ find_rarely_executed_basic_blocks_and_crossing_edg
>    /* Mark which partition (hot/cold) each basic block belongs in.  */
>    FOR_EACH_BB (bb)
>      {
> +      bool cold_bb = false;

whitespace here

>        if (probably_never_executed_bb_p (cfun, bb))
>          {
> +          /* Handle profile insanities created by upstream optimizations
> +             by also checking the incoming edge weights. If there is a non-cold
> +             incoming edge, conservatively prevent this block from being split
> +             into the cold section.  */
> +          cold_bb = true;
> +          FOR_EACH_EDGE (e, ei, bb->preds)
> +            {
> +              if (!probably_never_executed_edge_p (cfun, e))
> +                {
> +                  cold_bb = false;
> +                  break;
> +                }
> +            }

You can probably elimnate the extra braces.
So we won't propagate deeper in the CFG, right?

This change is OK.

> +        }
> +      if (cold_bb)
> +        {
>            BB_SET_PARTITION (bb, BB_COLD_PARTITION);
>            cold_bb_count++;
>          }
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 202947)
> +++ predict.c   (working copy)
> @@ -226,26 +226,26 @@ maybe_hot_edge_p (edge e)
>  }
> 
> 
> -/* Return true in case BB is probably never executed.  */
> 
> -bool
> -probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
> +/* Return true if profile COUNT and FREQUENCY, or function FUN static
> +   node frequency reflects never being executed.  */
> +
> +static bool
> +probably_never_executed (struct function *fun,
> +                         gcov_type count, int frequency)
>  {
>    gcc_checking_assert (fun);
>    if (profile_status_for_function (fun) == PROFILE_READ)
>      {
> -      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
> +      if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>         return false;
> -      if (!bb->frequency)
> -       return true;
> -      if (!ENTRY_BLOCK_PTR->frequency)
> -       return false;
> -      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
> -       {
> -         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
> -                       ENTRY_BLOCK_PTR->frequency)
> -                 < REG_BR_PROB_BASE / 4);
> -       }
> +      // If this is a profiled function (entry bb non-zero count), then base
> +      // the coldness decision on the frequency. This will handle cases where
> +      // counts are not updated properly during optimizations or expansion.
> +      if (ENTRY_BLOCK_PTR->count)
> +       return frequency == 0;
> +      // Unprofiled function, frequencies statically assigned. All bbs are
> +      // treated as cold.

I would avoid combining C and C++ comments in the function.  
Did you get some data on how many basic blocks we now consider hot?

The previous implemntation consdered block as never executed when frequencies
indicates that it is executed in at most 1/4th of invocations of program.
You essentially chnage to 1/10000.  The first seems bit too high given the
way we distribute probabilities in dojump and firends, second looks too low.

The change introducing probably_never_executed with the current logic is OK.  
We may want to fine tune the ratio.

Honza
>        return true;
>      }
>    if ((!profile_info || !flag_branch_probabilities)
> @@ -256,19 +256,21 @@ maybe_hot_edge_p (edge e)
>  }
> 
> 
> +/* Return true in case BB is probably never executed.  */
> +
> +bool
> +probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
> +{
> +  return probably_never_executed (fun, bb->count, bb->frequency);
> +}
> +
> +
>  /* Return true in case edge E is probably never executed.  */
> 
>  bool
>  probably_never_executed_edge_p (struct function *fun, edge e)
>  {
> -  gcc_checking_assert (fun);
> -  if (profile_info && flag_branch_probabilities)
> -    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
> -  if ((!profile_info || !flag_branch_probabilities)
> -      && (cgraph_get_node (fun->decl)->frequency
> -         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
> -    return true;
> -  return false;
> +  return probably_never_executed (fun, e->count, EDGE_FREQUENCY (e));
>  }
> 
>  /* Return true if NODE should be optimized for size.  */

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-10-02 16:19                                                   ` Jan Hubicka
@ 2013-10-02 17:55                                                     ` Teresa Johnson
  2013-10-02 18:10                                                       ` Jan Hubicka
  2013-10-03 13:42                                                     ` Teresa Johnson
  1 sibling, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-10-02 17:55 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Wed, Oct 2, 2013 at 9:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> 2013-09-29  Teresa Johnson  <tejohnson@google.com>
>>
>>         * bb-reorder.c (find_rarely_executed_basic_blocks_and_crossing_edges):
>>         Treat profile insanities conservatively.
>>         * predict.c (probably_never_executed): New function. Treat profile
>>         insanities conservatively.
>>         (probably_never_executed_bb_p): Invoke probably_never_executed.
>>         (probably_never_executed_edge_p): Invoke probably_never_executed.
>>
>> Index: bb-reorder.c
>> ===================================================================
>> --- bb-reorder.c        (revision 202947)
>> +++ bb-reorder.c        (working copy)
>> @@ -1564,8 +1564,25 @@ find_rarely_executed_basic_blocks_and_crossing_edg
>>    /* Mark which partition (hot/cold) each basic block belongs in.  */
>>    FOR_EACH_BB (bb)
>>      {
>> +      bool cold_bb = false;
>
> whitespace here

meaning add a line of whitespace? Ok, done.

>
>>        if (probably_never_executed_bb_p (cfun, bb))
>>          {
>> +          /* Handle profile insanities created by upstream optimizations
>> +             by also checking the incoming edge weights. If there is a non-cold
>> +             incoming edge, conservatively prevent this block from being split
>> +             into the cold section.  */
>> +          cold_bb = true;
>> +          FOR_EACH_EDGE (e, ei, bb->preds)
>> +            {
>> +              if (!probably_never_executed_edge_p (cfun, e))
>> +                {
>> +                  cold_bb = false;
>> +                  break;
>> +                }
>> +            }
>
> You can probably elimnate the extra braces.
> So we won't propagate deeper in the CFG, right?

Done.

>
> This change is OK.
>
>> +        }
>> +      if (cold_bb)
>> +        {
>>            BB_SET_PARTITION (bb, BB_COLD_PARTITION);
>>            cold_bb_count++;
>>          }
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 202947)
>> +++ predict.c   (working copy)
>> @@ -226,26 +226,26 @@ maybe_hot_edge_p (edge e)
>>  }
>>
>>
>> -/* Return true in case BB is probably never executed.  */
>>
>> -bool
>> -probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>> +/* Return true if profile COUNT and FREQUENCY, or function FUN static
>> +   node frequency reflects never being executed.  */
>> +
>> +static bool
>> +probably_never_executed (struct function *fun,
>> +                         gcov_type count, int frequency)
>>  {
>>    gcc_checking_assert (fun);
>>    if (profile_status_for_function (fun) == PROFILE_READ)
>>      {
>> -      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>> +      if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>>         return false;
>> -      if (!bb->frequency)
>> -       return true;
>> -      if (!ENTRY_BLOCK_PTR->frequency)
>> -       return false;
>> -      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
>> -       {
>> -         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
>> -                       ENTRY_BLOCK_PTR->frequency)
>> -                 < REG_BR_PROB_BASE / 4);
>> -       }
>> +      // If this is a profiled function (entry bb non-zero count), then base
>> +      // the coldness decision on the frequency. This will handle cases where
>> +      // counts are not updated properly during optimizations or expansion.
>> +      if (ENTRY_BLOCK_PTR->count)
>> +       return frequency == 0;
>> +      // Unprofiled function, frequencies statically assigned. All bbs are
>> +      // treated as cold.
>
> I would avoid combining C and C++ comments in the function.

Fixed.

> Did you get some data on how many basic blocks we now consider hot?

No, I can do that.

>
> The previous implemntation consdered block as never executed when frequencies
> indicates that it is executed in at most 1/4th of invocations of program.
> You essentially chnage to 1/10000.  The first seems bit too high given the
> way we distribute probabilities in dojump and firends, second looks too low.

But why do we want to consider blocks as "probably never executed"
when the frequency suggests they are sometimes executed?

AFAICT, there are 2 main callers of this routine:
1) function splitting in bb-layout
2) function cgraph node weight

Where #2 will affect optimization of the function for size and also
function layout by the linker.

I would argue that for function splitting, we really want to know when
it is probably *never* executed - i.e. completely cold, since the cost
of jumping back and forth to the cold section is likely to be high.

I am not sure for #2 what the right ratio is. For function layout, we
may also want to place only really cold *never* executed functions
into the cold section, but I am less sure about optimization for size.

Perhaps we really need two different interfaces to test for different
levels of coldness:

probably_never_executed()
  -> returns true when there is profile information for the function
and the bb has 0 count and 0 frequency.
  -> invoked from bb-reorder.cc to drive function splitting
  -> may want to consider invoking this as an additional check before
putting function into unlikely text section in the future.

possibly_never_executed()
   -> essentially the existing logic in probably_never_executed_bb_p
   -> invoked when marking the cgraph node

>
> The change introducing probably_never_executed with the current logic is OK.

Ok, I will commit the two approved parts for now (the change to
bb-reorder.c and the addition of probably_never_executed that uses the
existing logic from probably_never_executed_bb_p.

Thanks,
Teresa

> We may want to fine tune the ratio.
>
> Honza
>>        return true;
>>      }
>>    if ((!profile_info || !flag_branch_probabilities)
>> @@ -256,19 +256,21 @@ maybe_hot_edge_p (edge e)
>>  }
>>
>>
>> +/* Return true in case BB is probably never executed.  */
>> +
>> +bool
>> +probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>> +{
>> +  return probably_never_executed (fun, bb->count, bb->frequency);
>> +}
>> +
>> +
>>  /* Return true in case edge E is probably never executed.  */
>>
>>  bool
>>  probably_never_executed_edge_p (struct function *fun, edge e)
>>  {
>> -  gcc_checking_assert (fun);
>> -  if (profile_info && flag_branch_probabilities)
>> -    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
>> -  if ((!profile_info || !flag_branch_probabilities)
>> -      && (cgraph_get_node (fun->decl)->frequency
>> -         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
>> -    return true;
>> -  return false;
>> +  return probably_never_executed (fun, e->count, EDGE_FREQUENCY (e));
>>  }
>>
>>  /* Return true if NODE should be optimized for size.  */



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-10-02 17:55                                                     ` Teresa Johnson
@ 2013-10-02 18:10                                                       ` Jan Hubicka
  0 siblings, 0 replies; 62+ messages in thread
From: Jan Hubicka @ 2013-10-02 18:10 UTC (permalink / raw)
  To: Teresa Johnson; +Cc: Jan Hubicka, gcc-patches, marxin.liska

> But why do we want to consider blocks as "probably never executed"
> when the frequency suggests they are sometimes executed?

Well, probably never executed is mean to reffer to one run.  If you have
something like code handling fatal errors, you probably still want to have it
in cold secion even if user may have trained the program on a testsuite that
triggers them once or twice per thousdand of runs.

We may just make the predicate more strict, but lets do that incrementally so
we know how much things change.

I am somewhat concerned that we are not that effective on breaking
out cold code so -fprofile-use does not lead to as significant code
size reductions as the theory would suggest, so perhaps I am just overfly
conservative about this.  Getting the splitting to work reliably is
definitely going to be a win.

> Perhaps we really need two different interfaces to test for different
> levels of coldness:
> 
> probably_never_executed()
>   -> returns true when there is profile information for the function
> and the bb has 0 count and 0 frequency.
>   -> invoked from bb-reorder.cc to drive function splitting
>   -> may want to consider invoking this as an additional check before
> putting function into unlikely text section in the future.
> 
> possibly_never_executed()
>    -> essentially the existing logic in probably_never_executed_bb_p
>    -> invoked when marking the cgraph node

Perhaps...
Advantage of hot/normal/cold split is that it is easy to understand, but if
necessary (i.e, it becomes impossible to tune well) we may add more stages...

Honza

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-10-02 16:19                                                   ` Jan Hubicka
  2013-10-02 17:55                                                     ` Teresa Johnson
@ 2013-10-03 13:42                                                     ` Teresa Johnson
  2013-10-03 23:37                                                       ` Teresa Johnson
  1 sibling, 1 reply; 62+ messages in thread
From: Teresa Johnson @ 2013-10-03 13:42 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Wed, Oct 2, 2013 at 9:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> 2013-09-29  Teresa Johnson  <tejohnson@google.com>
>>
>>         * bb-reorder.c (find_rarely_executed_basic_blocks_and_crossing_edges):
>>         Treat profile insanities conservatively.
>>         * predict.c (probably_never_executed): New function. Treat profile
>>         insanities conservatively.
>>         (probably_never_executed_bb_p): Invoke probably_never_executed.
>>         (probably_never_executed_edge_p): Invoke probably_never_executed.
>>
>> Index: bb-reorder.c
>> ===================================================================
>> --- bb-reorder.c        (revision 202947)
>> +++ bb-reorder.c        (working copy)
>> @@ -1564,8 +1564,25 @@ find_rarely_executed_basic_blocks_and_crossing_edg
>>    /* Mark which partition (hot/cold) each basic block belongs in.  */
>>    FOR_EACH_BB (bb)
>>      {
>> +      bool cold_bb = false;
>
> whitespace here
>
>>        if (probably_never_executed_bb_p (cfun, bb))
>>          {
>> +          /* Handle profile insanities created by upstream optimizations
>> +             by also checking the incoming edge weights. If there is a non-cold
>> +             incoming edge, conservatively prevent this block from being split
>> +             into the cold section.  */
>> +          cold_bb = true;
>> +          FOR_EACH_EDGE (e, ei, bb->preds)
>> +            {
>> +              if (!probably_never_executed_edge_p (cfun, e))
>> +                {
>> +                  cold_bb = false;
>> +                  break;
>> +                }
>> +            }
>
> You can probably elimnate the extra braces.
> So we won't propagate deeper in the CFG, right?
>
> This change is OK.
>
>> +        }
>> +      if (cold_bb)
>> +        {
>>            BB_SET_PARTITION (bb, BB_COLD_PARTITION);
>>            cold_bb_count++;
>>          }
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 202947)
>> +++ predict.c   (working copy)
>> @@ -226,26 +226,26 @@ maybe_hot_edge_p (edge e)
>>  }
>>
>>
>> -/* Return true in case BB is probably never executed.  */
>>
>> -bool
>> -probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>> +/* Return true if profile COUNT and FREQUENCY, or function FUN static
>> +   node frequency reflects never being executed.  */
>> +
>> +static bool
>> +probably_never_executed (struct function *fun,
>> +                         gcov_type count, int frequency)
>>  {
>>    gcc_checking_assert (fun);
>>    if (profile_status_for_function (fun) == PROFILE_READ)
>>      {
>> -      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>> +      if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>>         return false;
>> -      if (!bb->frequency)
>> -       return true;
>> -      if (!ENTRY_BLOCK_PTR->frequency)
>> -       return false;
>> -      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
>> -       {
>> -         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
>> -                       ENTRY_BLOCK_PTR->frequency)
>> -                 < REG_BR_PROB_BASE / 4);
>> -       }
>> +      // If this is a profiled function (entry bb non-zero count), then base
>> +      // the coldness decision on the frequency. This will handle cases where
>> +      // counts are not updated properly during optimizations or expansion.
>> +      if (ENTRY_BLOCK_PTR->count)
>> +       return frequency == 0;
>> +      // Unprofiled function, frequencies statically assigned. All bbs are
>> +      // treated as cold.
>
> I would avoid combining C and C++ comments in the function.
> Did you get some data on how many basic blocks we now consider hot?
>
> The previous implemntation consdered block as never executed when frequencies
> indicates that it is executed in at most 1/4th of invocations of program.
> You essentially chnage to 1/10000.  The first seems bit too high given the
> way we distribute probabilities in dojump and firends, second looks too low.

Actually, I don't think the current code is detecting when the
frequencies indicate it was executed 1/4 time. The current code takes
a ratio of the entry block count, and compares it to
REG_BR_PROB_BASE/4, which seems like the wrong comparison for profile
counts. Shouldn't this be something like:

gcov_type computed_count = RDIV (frequency * ENTRY_BLOCK_PTR->count,
ENTRY_BLOCK_PTR->frequency)
if ((computed_count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
   return false;
return true;

i.e. do the same check we do for bb->count above. And the check
guarding this is looking for ENTRY_BLOCK_PTR->count <
REG_BR_PROB_BASE, which also doesn't seem right, although we need to
ensure that we don't overflow when multiplying by frequency.

Teresa

>
> The change introducing probably_never_executed with the current logic is OK.
> We may want to fine tune the ratio.
>
> Honza
>>        return true;
>>      }
>>    if ((!profile_info || !flag_branch_probabilities)
>> @@ -256,19 +256,21 @@ maybe_hot_edge_p (edge e)
>>  }
>>
>>
>> +/* Return true in case BB is probably never executed.  */
>> +
>> +bool
>> +probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>> +{
>> +  return probably_never_executed (fun, bb->count, bb->frequency);
>> +}
>> +
>> +
>>  /* Return true in case edge E is probably never executed.  */
>>
>>  bool
>>  probably_never_executed_edge_p (struct function *fun, edge e)
>>  {
>> -  gcc_checking_assert (fun);
>> -  if (profile_info && flag_branch_probabilities)
>> -    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
>> -  if ((!profile_info || !flag_branch_probabilities)
>> -      && (cgraph_get_node (fun->decl)->frequency
>> -         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
>> -    return true;
>> -  return false;
>> +  return probably_never_executed (fun, e->count, EDGE_FREQUENCY (e));
>>  }
>>
>>  /* Return true if NODE should be optimized for size.  */



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition
  2013-10-03 13:42                                                     ` Teresa Johnson
@ 2013-10-03 23:37                                                       ` Teresa Johnson
  0 siblings, 0 replies; 62+ messages in thread
From: Teresa Johnson @ 2013-10-03 23:37 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: gcc-patches, marxin.liska

On Thu, Oct 3, 2013 at 6:41 AM, Teresa Johnson <tejohnson@google.com> wrote:
> On Wed, Oct 2, 2013 at 9:19 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> 2013-09-29  Teresa Johnson  <tejohnson@google.com>
>>>
>>>         * bb-reorder.c (find_rarely_executed_basic_blocks_and_crossing_edges):
>>>         Treat profile insanities conservatively.
>>>         * predict.c (probably_never_executed): New function. Treat profile
>>>         insanities conservatively.
>>>         (probably_never_executed_bb_p): Invoke probably_never_executed.
>>>         (probably_never_executed_edge_p): Invoke probably_never_executed.
>>>
>>> Index: bb-reorder.c
>>> ===================================================================
>>> --- bb-reorder.c        (revision 202947)
>>> +++ bb-reorder.c        (working copy)
>>> @@ -1564,8 +1564,25 @@ find_rarely_executed_basic_blocks_and_crossing_edg
>>>    /* Mark which partition (hot/cold) each basic block belongs in.  */
>>>    FOR_EACH_BB (bb)
>>>      {
>>> +      bool cold_bb = false;
>>
>> whitespace here
>>
>>>        if (probably_never_executed_bb_p (cfun, bb))
>>>          {
>>> +          /* Handle profile insanities created by upstream optimizations
>>> +             by also checking the incoming edge weights. If there is a non-cold
>>> +             incoming edge, conservatively prevent this block from being split
>>> +             into the cold section.  */
>>> +          cold_bb = true;
>>> +          FOR_EACH_EDGE (e, ei, bb->preds)
>>> +            {
>>> +              if (!probably_never_executed_edge_p (cfun, e))
>>> +                {
>>> +                  cold_bb = false;
>>> +                  break;
>>> +                }
>>> +            }
>>
>> You can probably elimnate the extra braces.
>> So we won't propagate deeper in the CFG, right?
>>
>> This change is OK.
>>
>>> +        }
>>> +      if (cold_bb)
>>> +        {
>>>            BB_SET_PARTITION (bb, BB_COLD_PARTITION);
>>>            cold_bb_count++;
>>>          }
>>> Index: predict.c
>>> ===================================================================
>>> --- predict.c   (revision 202947)
>>> +++ predict.c   (working copy)
>>> @@ -226,26 +226,26 @@ maybe_hot_edge_p (edge e)
>>>  }
>>>
>>>
>>> -/* Return true in case BB is probably never executed.  */
>>>
>>> -bool
>>> -probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>>> +/* Return true if profile COUNT and FREQUENCY, or function FUN static
>>> +   node frequency reflects never being executed.  */
>>> +
>>> +static bool
>>> +probably_never_executed (struct function *fun,
>>> +                         gcov_type count, int frequency)
>>>  {
>>>    gcc_checking_assert (fun);
>>>    if (profile_status_for_function (fun) == PROFILE_READ)
>>>      {
>>> -      if ((bb->count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>>> +      if ((count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>>>         return false;
>>> -      if (!bb->frequency)
>>> -       return true;
>>> -      if (!ENTRY_BLOCK_PTR->frequency)
>>> -       return false;
>>> -      if (ENTRY_BLOCK_PTR->count && ENTRY_BLOCK_PTR->count < REG_BR_PROB_BASE)
>>> -       {
>>> -         return (RDIV (bb->frequency * ENTRY_BLOCK_PTR->count,
>>> -                       ENTRY_BLOCK_PTR->frequency)
>>> -                 < REG_BR_PROB_BASE / 4);
>>> -       }
>>> +      // If this is a profiled function (entry bb non-zero count), then base
>>> +      // the coldness decision on the frequency. This will handle cases where
>>> +      // counts are not updated properly during optimizations or expansion.
>>> +      if (ENTRY_BLOCK_PTR->count)
>>> +       return frequency == 0;
>>> +      // Unprofiled function, frequencies statically assigned. All bbs are
>>> +      // treated as cold.
>>
>> I would avoid combining C and C++ comments in the function.
>> Did you get some data on how many basic blocks we now consider hot?
>>
>> The previous implemntation consdered block as never executed when frequencies
>> indicates that it is executed in at most 1/4th of invocations of program.
>> You essentially chnage to 1/10000.  The first seems bit too high given the
>> way we distribute probabilities in dojump and firends, second looks too low.
>
> Actually, I don't think the current code is detecting when the
> frequencies indicate it was executed 1/4 time. The current code takes
> a ratio of the entry block count, and compares it to
> REG_BR_PROB_BASE/4, which seems like the wrong comparison for profile
> counts. Shouldn't this be something like:
>
> gcov_type computed_count = RDIV (frequency * ENTRY_BLOCK_PTR->count,
> ENTRY_BLOCK_PTR->frequency)
> if ((computed_count * 4 + profile_info->runs / 2) / profile_info->runs > 0)
>    return false;
> return true;
>
> i.e. do the same check we do for bb->count above. And the check
> guarding this is looking for ENTRY_BLOCK_PTR->count <
> REG_BR_PROB_BASE, which also doesn't seem right, although we need to
> ensure that we don't overflow when multiplying by frequency.

I have a variant of this change that works well, and has the proper
overflow checking. This gets rid of 13 failures when there is 1 train
run. Changing the required ratio to 1/100 training runs instead of 1/4
reduces the single train run failures by another 5, since smaller
frequencies are handled even when the count has been truncated to 0.

I found a couple more failures with 1 train run were due to inlining
profile update issues, which I have a fix for. At that point, the
failures are the same between the 1 train run and 100 train run cases.

After that, there are 2 remaining failures (both with 1 train and 100
train runs), that go away when I disable loop unrolling, that I
haven't looked at yet.

I'll send a patch with the above changes though hopefully tonight.
What do you think of changing the required execution count to profile
run ratio to 1/100?

Teresa

>
> Teresa
>
>>
>> The change introducing probably_never_executed with the current logic is OK.
>> We may want to fine tune the ratio.
>>
>> Honza
>>>        return true;
>>>      }
>>>    if ((!profile_info || !flag_branch_probabilities)
>>> @@ -256,19 +256,21 @@ maybe_hot_edge_p (edge e)
>>>  }
>>>
>>>
>>> +/* Return true in case BB is probably never executed.  */
>>> +
>>> +bool
>>> +probably_never_executed_bb_p (struct function *fun, const_basic_block bb)
>>> +{
>>> +  return probably_never_executed (fun, bb->count, bb->frequency);
>>> +}
>>> +
>>> +
>>>  /* Return true in case edge E is probably never executed.  */
>>>
>>>  bool
>>>  probably_never_executed_edge_p (struct function *fun, edge e)
>>>  {
>>> -  gcc_checking_assert (fun);
>>> -  if (profile_info && flag_branch_probabilities)
>>> -    return ((e->count + profile_info->runs / 2) / profile_info->runs) == 0;
>>> -  if ((!profile_info || !flag_branch_probabilities)
>>> -      && (cgraph_get_node (fun->decl)->frequency
>>> -         == NODE_FREQUENCY_UNLIKELY_EXECUTED))
>>> -    return true;
>>> -  return false;
>>> +  return probably_never_executed (fun, e->count, EDGE_FREQUENCY (e));
>>>  }
>>>
>>>  /* Return true if NODE should be optimized for size.  */
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413



-- 
Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2013-10-03 23:37 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-01 16:32 [PATCH] Sanitize block partitioning under -freorder-blocks-and-partition Teresa Johnson
2013-08-02 11:22 ` Bernhard Reutner-Fischer
2013-08-02 14:51   ` Teresa Johnson
2013-08-02 15:05     ` Jan Hubicka
2013-08-02 23:05       ` Steven Bosscher
2013-08-03  4:53         ` Teresa Johnson
2013-08-03  4:48       ` Teresa Johnson
2013-08-05 13:36         ` Teresa Johnson
2013-08-05 14:11           ` Jan Hubicka
2013-08-05 14:57             ` Teresa Johnson
2013-08-06  3:01               ` Teresa Johnson
     [not found]           ` <20130808222332.GA31755@kam.mff.cuni.cz>
2013-08-08 23:04             ` Teresa Johnson
2013-08-09  9:58               ` Jan Hubicka
2013-08-09 14:38                 ` Teresa Johnson
2013-08-09 15:28                   ` Jan Hubicka
2013-08-09 15:54                     ` Martin Liška
2013-08-09 21:03                       ` Teresa Johnson
2013-08-09 21:02                     ` Teresa Johnson
2013-08-09 22:43                       ` Jan Hubicka
2013-08-11 12:21                       ` Jan Hubicka
2013-08-11 13:25                         ` Teresa Johnson
2013-08-11 15:55                           ` Martin Liška
2013-08-11 17:55                             ` Jan Hubicka
2013-08-11 21:05                           ` Jan Hubicka
2013-08-17 15:54                       ` Teresa Johnson
2013-08-17 21:02                         ` Jan Hubicka
2013-08-19 13:51                           ` Teresa Johnson
2013-08-19 15:16                             ` Jan Hubicka
2013-08-19 17:48                               ` Teresa Johnson
2013-08-19 19:56                                 ` Martin Liška
2013-08-27 18:12                                 ` Teresa Johnson
2013-08-28 16:59                                   ` Jan Hubicka
2013-08-28 18:35                                     ` Teresa Johnson
2013-08-30  7:17                                       ` Teresa Johnson
2013-08-30  9:16                                         ` Jan Hubicka
2013-08-30 15:13                                           ` Teresa Johnson
2013-08-30 15:28                                             ` Jan Hubicka
2013-08-30 15:54                                               ` Teresa Johnson
2013-08-30 21:56                                             ` Rong Xu
2013-08-31 16:20                                   ` Jan Hubicka
2013-08-31 23:40                                     ` Jan Hubicka
2013-09-24  8:07                                       ` Teresa Johnson
2013-09-24 13:44                                         ` Jan Hubicka
2013-09-24 19:06                                           ` Teresa Johnson
2013-09-26 20:55                                           ` Rong Xu
2013-09-26 22:23                                             ` Jan Hubicka
2013-09-26 22:54                                               ` Rong Xu
2013-09-24 18:28                                         ` Jan Hubicka
2013-09-24 18:51                                           ` Teresa Johnson
2013-09-25 23:10                                             ` Teresa Johnson
2013-09-26  8:44                                               ` Teresa Johnson
2013-09-26 22:26                                             ` Jan Hubicka
2013-09-27 14:50                                               ` Teresa Johnson
2013-09-29 17:34                                                 ` Teresa Johnson
2013-10-02 16:19                                                   ` Jan Hubicka
2013-10-02 17:55                                                     ` Teresa Johnson
2013-10-02 18:10                                                       ` Jan Hubicka
2013-10-03 13:42                                                     ` Teresa Johnson
2013-10-03 23:37                                                       ` Teresa Johnson
2013-10-01 17:36                                             ` Teresa Johnson
2013-08-19 15:34                           ` Teresa Johnson
2013-08-21 15:31                             ` Jan Hubicka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).