public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Initial shrink-wrapping patch
@ 2011-08-31 20:37 Bernd Schmidt
  2011-09-01 13:57 ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-08-31 20:37 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 781 bytes --]

This is a new version of the original 4/6 shrink wrapping patch, minus
the preliminary bits that were already approved, and with some extra bug
fixes that were discovered in the meantime. This is now mostly contained
to just function.c. I'll resubmit the other parts (i.e. exposing more
shrink-wrapping opportunities through code movement and copy
propagation) later after retesting them.

This requires the two patches I sent earlier today, for mips epilogue
unwinding and copying the frame_related bit.  Together with these, BSRT
on i686-linux, and sim tested on mips64-elf. ARM tests are still running
so I've dropped those target-specific bits for now.

Experience with testing on mips shows that at the very least the
consistency checks in dwarf2cfi are working well.


Bernd

[-- Attachment #2: sw0831b.diff --]
[-- Type: text/plain, Size: 36261 bytes --]

	* doc/tm.texi (RETURN_ADDR_REGNUM): Document.
	* doc/invoke.texi (-fshrink-wrap): Document.
	* opts.c (default_options_table): Add it.
	* common.opt (fshrink-wrap): Add.
	* function.c (emit_return_into_block): Remove useless declaration.
	(record_hard_reg_uses_1, record_hard_reg_uses, frame_required_for_rtx,
	requires_stack_frame_p, gen_return_pattern): New static functions.
	(emit_return_into_block): New arg simple_p.  All callers changed.
	Use gen_return_pattern.
	(thread_prologue_and_epilogue_insns): Implement shrink-wrapping.
	* config/i386/i386.md (returns): New code_iterator.
	(return_str, return_cond): New code_attrs.
	(<return_str>return, <return_str>return_internal,
	<return_str>return_internal_long, <return_str>return_pop_internal,
	<return_str>return_indirect_internal): New patterns, replacing...
	(return, return_internal, return_internal_long, return_pop_internal,
	return_indirect_internal): ... these.
	* config/i386/i386.c (ix86_expand_epilogue): Adjust to expand
	simple returns.
	(ix86_pad_returns): Likewise.

Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 178135)
+++ gcc/doc/tm.texi	(working copy)
@@ -3208,6 +3208,12 @@ Define this if the return address of a p
 from the frame pointer of the previous stack frame.
 @end defmac
 
+@defmac RETURN_ADDR_REGNUM
+If defined, a C expression whose value is the register number of the return
+address for the current function.  Targets that pass the return address on
+the stack should not define this macro.
+@end defmac
+
 @defmac INCOMING_RETURN_ADDR_RTX
 A C expression whose value is RTL representing the location of the
 incoming return address at the beginning of any function, before the
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 178135)
+++ gcc/doc/invoke.texi	(working copy)
@@ -395,10 +395,10 @@ Objective-C and Objective-C++ Dialects}.
 -fschedule-insns -fschedule-insns2 -fsection-anchors @gol
 -fselective-scheduling -fselective-scheduling2 @gol
 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
--fsignaling-nans -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
--fsplit-wide-types -fstack-protector -fstack-protector-all @gol
--fstrict-aliasing -fstrict-overflow -fthread-jumps -ftracer @gol
--ftree-bit-ccp @gol
+-fshrink-wrap -fsignaling-nans -fsingle-precision-constant @gol
+-fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector @gol
+-fstack-protector-all -fstrict-aliasing -fstrict-overflow @gol
+-fthread-jumps -ftracer -ftree-bit-ccp @gol
 -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copy-prop @gol
 -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse @gol
 -ftree-forwprop -ftree-fre -ftree-loop-if-convert @gol
@@ -6872,6 +6872,12 @@ This option has no effect until one of @
 When pipelining loops during selective scheduling, also pipeline outer loops.
 This option has no effect until @option{-fsel-sched-pipelining} is turned on.
 
+@item -fshrink-wrap
+@opindex fshrink-wrap
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.  This flag is enabled by default at
+@option{-O} and higher.
+
 @item -fcaller-saves
 @opindex fcaller-saves
 Enable values to be allocated in registers that will be clobbered by
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 178135)
+++ gcc/opts.c	(working copy)
@@ -434,6 +434,7 @@ static const struct default_options defa
     { OPT_LEVELS_1_PLUS, OPT_fipa_reference, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fipa_profile, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fmerge_constants, NULL, 1 },
+    { OPT_LEVELS_1_PLUS, OPT_fshrink_wrap, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fsplit_wide_types, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_ccp, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_bit_ccp, NULL, 1 },
Index: gcc/function.c
===================================================================
--- gcc/function.c	(revision 178135)
+++ gcc/function.c	(working copy)
@@ -147,9 +147,6 @@ extern tree debug_find_var_in_block_tree
    can always export `prologue_epilogue_contains'.  */
 static void record_insns (rtx, rtx, htab_t *) ATTRIBUTE_UNUSED;
 static bool contains (const_rtx, htab_t);
-#ifdef HAVE_return
-static void emit_return_into_block (basic_block);
-#endif
 static void prepare_function_start (void);
 static void do_clobber_return_reg (rtx, void *);
 static void do_use_return_reg (rtx, void *);
@@ -5285,6 +5282,77 @@ prologue_epilogue_contains (const_rtx in
   return 0;
 }
 
+#ifdef HAVE_simple_return
+
+/* A for_each_rtx subroutine of record_hard_reg_sets.  */
+static int
+record_hard_reg_uses_1 (rtx *px, void *data)
+{
+  rtx x = *px;
+  HARD_REG_SET *pused = (HARD_REG_SET *)data;
+
+  if (REG_P (x) && REGNO (x) < FIRST_PSEUDO_REGISTER)
+    {
+      int nregs = hard_regno_nregs[REGNO (x)][GET_MODE (x)];
+      while (nregs-- > 0)
+	SET_HARD_REG_BIT (*pused, REGNO (x) + nregs);
+    }
+  return 0;
+}
+
+/* Like record_hard_reg_sets, but called through note_uses.  */
+static void
+record_hard_reg_uses (rtx *px, void *data)
+{
+  for_each_rtx (px, record_hard_reg_uses_1, data);
+}
+
+/* A subroutine of requires_stack_frame_p, called via for_each_rtx.
+   If any change is made, set CHANGED
+   to true.  */
+
+static int
+frame_required_for_rtx (rtx *loc, void *data ATTRIBUTE_UNUSED)
+{
+  rtx x = *loc;
+  if (x == stack_pointer_rtx || x == hard_frame_pointer_rtx
+      || x == arg_pointer_rtx || x == pic_offset_table_rtx
+#ifdef RETURN_ADDR_REGNUM
+      || (REG_P (x) && REGNO (x) == RETURN_ADDR_REGNUM)
+#endif
+      )
+    return 1;
+  return 0;
+}
+
+/* Return true if INSN requires the stack frame to be set up.
+   PROLOGUE_USED contains the hard registers used in the function
+   prologue.  */
+static bool
+requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
+{
+  HARD_REG_SET hardregs;
+  unsigned regno;
+
+  if (!INSN_P (insn) || DEBUG_INSN_P (insn))
+    return false;
+  if (CALL_P (insn))
+    return !SIBLING_CALL_P (insn);
+  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
+    return true;
+  CLEAR_HARD_REG_SET (hardregs);
+  note_stores (PATTERN (insn), record_hard_reg_sets, &hardregs);
+  if (hard_reg_set_intersect_p (hardregs, prologue_used))
+    return true;
+  AND_COMPL_HARD_REG_SET (hardregs, call_used_reg_set);
+  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (TEST_HARD_REG_BIT (hardregs, regno)
+	&& df_regs_ever_live_p (regno))
+      return true;
+  return false;
+}
+#endif
+
 #ifdef HAVE_return
 /* Insert use of return register before the end of BB.  */
 
@@ -5299,45 +5367,141 @@ emit_use_return_register_into_block (bas
   emit_insn_before (seq, BB_END (bb));
 }
 
-/* Insert gen_return at the end of block BB.  This also means updating
-   block_for_insn appropriately.  */
+
+static rtx
+gen_return_pattern (bool simple_p)
+{
+#ifdef HAVE_simple_return
+  return simple_p ? gen_simple_return () : gen_return ();
+#else
+  gcc_assert (!simple_p);
+  return gen_return ();
+#endif
+}
+
+/* Insert an appropriate return pattern at the end of block BB.  This
+   also means updating block_for_insn appropriately.  */
 
 static void
-emit_return_into_block (basic_block bb)
+emit_return_into_block (bool simple_p, basic_block bb)
 {
-  rtx jump = emit_jump_insn_after (gen_return (), BB_END (bb));
-  rtx pat = PATTERN (jump);
+  rtx jump, pat;
+  jump = emit_jump_insn_after (gen_return_pattern (simple_p), BB_END (bb));
+  pat = PATTERN (jump);
   if (GET_CODE (pat) == PARALLEL)
     pat = XVECEXP (pat, 0, 0);
   gcc_assert (ANY_RETURN_P (pat));
   JUMP_LABEL (jump) = pat;
 }
-#endif /* HAVE_return */
+#endif
 
 /* Generate the prologue and epilogue RTL if the machine supports it.  Thread
    this into place with notes indicating where the prologue ends and where
-   the epilogue begins.  Update the basic block information when possible.  */
+   the epilogue begins.  Update the basic block information when possible.
+
+   Notes on epilogue placement:
+   There are several kinds of edges to the exit block:
+   * a single fallthru edge from LAST_BB
+   * possibly, edges from blocks containing sibcalls
+   * possibly, fake edges from infinite loops
+
+   The epilogue is always emitted on the fallthru edge from the last basic
+   block in the function, LAST_BB, into the exit block.
+
+   If LAST_BB is empty except for a label, it is the target of every
+   other basic block in the function that ends in a return.  If a
+   target has a return or simple_return pattern (possibly with
+   conditional variants), these basic blocks can be changed so that a
+   return insn is emitted into them, and their target is adjusted to
+   the real exit block.
+
+   Notes on shrink wrapping: We implement a fairly conservative
+   version of shrink-wrapping rather than the textbook one.  We only
+   generate a single prologue and a single epilogue.  This is
+   sufficient to catch a number of interesting cases involving early
+   exits.
+
+   First, we identify the blocks that require the prologue to occur before
+   them.  These are the ones that modify a call-saved register, or reference
+   any of the stack or frame pointer registers.  To simplify things, we then
+   mark everything reachable from these blocks as also requiring a prologue.
+   This takes care of loops automatically, and avoids the need to examine
+   whether MEMs reference the frame, since it is sufficient to check for
+   occurrences of the stack or frame pointer.
+
+   We then compute the set of blocks for which the need for a prologue
+   is anticipatable (borrowing terminology from the shrink-wrapping
+   description in Muchnick's book).  These are the blocks which either
+   require a prologue themselves, or those that have only successors
+   where the prologue is anticipatable.  The prologue needs to be
+   inserted on all edges from BB1->BB2 where BB2 is in ANTIC and BB1
+   is not.  For the moment, we ensure that only one such edge exists.
+
+   The epilogue is placed as described above, but we make a
+   distinction between inserting return and simple_return patterns
+   when modifying other blocks that end in a return.  Blocks that end
+   in a sibcall omit the sibcall_epilogue if the block is not in
+   ANTIC.  */
 
 static void
 thread_prologue_and_epilogue_insns (void)
 {
   bool inserted;
+  basic_block last_bb;
+  bool last_bb_active;
+#ifdef HAVE_simple_return
+  bool unconverted_simple_returns = false;
+  basic_block simple_return_block_hot = NULL;
+  basic_block simple_return_block_cold = NULL;
+#endif
+  rtx returnjump ATTRIBUTE_UNUSED;
   rtx seq ATTRIBUTE_UNUSED, epilogue_end ATTRIBUTE_UNUSED;
-  edge entry_edge, e;
+  rtx prologue_seq ATTRIBUTE_UNUSED, split_prologue_seq ATTRIBUTE_UNUSED;
+  edge e, entry_edge, orig_entry_edge, exit_fallthru_edge;
   edge_iterator ei;
+  bitmap_head bb_flags;
+
+  df_analyze ();
 
   rtl_profile_for_bb (ENTRY_BLOCK_PTR);
 
   inserted = false;
   seq = NULL_RTX;
+  prologue_seq = NULL_RTX;
   epilogue_end = NULL_RTX;
+  returnjump = NULL_RTX;
 
   /* Can't deal with multiple successors of the entry block at the
      moment.  Function should always have at least one entry
      point.  */
   gcc_assert (single_succ_p (ENTRY_BLOCK_PTR));
   entry_edge = single_succ_edge (ENTRY_BLOCK_PTR);
+  orig_entry_edge = entry_edge;
 
+  exit_fallthru_edge = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
+  if (exit_fallthru_edge != NULL)
+    {
+      rtx label;
+
+      last_bb = exit_fallthru_edge->src;
+      /* Test whether there are active instructions in the last block.  */
+      label = BB_END (last_bb);
+      while (label && !LABEL_P (label))
+	{
+	  if (active_insn_p (label))
+	    break;
+	  label = PREV_INSN (label);
+	}
+
+      last_bb_active = BB_HEAD (last_bb) != label || !LABEL_P (label);
+    }
+  else
+    {
+      last_bb = NULL;
+      last_bb_active = false;
+    }
+
+  split_prologue_seq = NULL_RTX;
   if (flag_split_stack
       && (lookup_attribute ("no_split_stack", DECL_ATTRIBUTES (cfun->decl))
 	  == NULL))
@@ -5349,17 +5513,15 @@ thread_prologue_and_epilogue_insns (void
 
       start_sequence ();
       emit_insn (gen_split_stack_prologue ());
-      seq = get_insns ();
+      split_prologue_seq = get_insns ();
       end_sequence ();
 
-      record_insns (seq, NULL, &prologue_insn_hash);
-      set_insn_locators (seq, prologue_locator);
-
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+      record_insns (split_prologue_seq, NULL, &prologue_insn_hash);
+      set_insn_locators (split_prologue_seq, prologue_locator);
 #endif
     }
 
+  prologue_seq = NULL_RTX;
 #ifdef HAVE_prologue
   if (HAVE_prologue)
     {
@@ -5382,15 +5544,196 @@ thread_prologue_and_epilogue_insns (void
       if (!targetm.profile_before_prologue () && crtl->profile)
         emit_insn (gen_blockage ());
 
-      seq = get_insns ();
+      prologue_seq = get_insns ();
       end_sequence ();
-      set_insn_locators (seq, prologue_locator);
+      set_insn_locators (prologue_seq, prologue_locator);
+    }
+#endif
 
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+  bitmap_initialize (&bb_flags, &bitmap_default_obstack);
+
+#ifdef HAVE_simple_return
+  /* Try to perform a kind of shrink-wrapping, making sure the
+     prologue/epilogue is emitted only around those parts of the
+     function that require it.  */
+
+  if (flag_shrink_wrap && HAVE_simple_return && !flag_non_call_exceptions
+      && HAVE_prologue && !crtl->calls_eh_return)
+    {
+      HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;
+      rtx p_insn;
+
+      VEC(basic_block, heap) *vec;
+      basic_block bb;
+      bitmap_head bb_antic_flags;
+      bitmap_head bb_on_list;
+
+      /* Compute the registers set and used in the prologue.  */
+      CLEAR_HARD_REG_SET (prologue_clobbered);
+      CLEAR_HARD_REG_SET (prologue_used);
+      for (p_insn = prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	{
+	  HARD_REG_SET this_used;
+	  if (!NONDEBUG_INSN_P (p_insn))
+	    continue;
+
+	  CLEAR_HARD_REG_SET (this_used);
+	  note_uses (&PATTERN (p_insn), record_hard_reg_uses,
+		     &this_used);
+	  AND_COMPL_HARD_REG_SET (this_used, prologue_clobbered);
+	  IOR_HARD_REG_SET (prologue_used, this_used);
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+	}
+
+      bitmap_initialize (&bb_antic_flags, &bitmap_default_obstack);
+      bitmap_initialize (&bb_on_list, &bitmap_default_obstack);
+
+      vec = VEC_alloc (basic_block, heap, n_basic_blocks);
+
+      FOR_EACH_BB (bb)
+	{
+	  rtx insn;
+	  FOR_BB_INSNS (bb, insn)
+	    {
+	      if (requires_stack_frame_p (insn, prologue_used))
+		{
+		  bitmap_set_bit (&bb_flags, bb->index);
+		  VEC_quick_push (basic_block, vec, bb);
+		  break;
+		}
+	    }
+	}
+
+      /* For every basic block that needs a prologue, mark all blocks
+	 reachable from it, so as to ensure they are also seen as
+	 requiring a prologue.  */
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    {
+	      if (e->dest == EXIT_BLOCK_PTR
+		  || bitmap_bit_p (&bb_flags, e->dest->index))
+		continue;
+	      bitmap_set_bit (&bb_flags, e->dest->index);
+	      VEC_quick_push (basic_block, vec, e->dest);
+	    }
+	}
+      /* If the last basic block contains only a label, we'll be able
+	 to convert jumps to it to (potentially conditional) return
+	 insns later.  This means we don't necessarily need a prologue
+	 for paths reaching it.  */
+      if (last_bb)
+	{
+	  if (!last_bb_active)
+	    bitmap_clear_bit (&bb_flags, last_bb->index);
+	  else if (!bitmap_bit_p (&bb_flags, last_bb->index))
+	    goto fail_shrinkwrap;
+	}
+
+      /* Now walk backwards from every block that is marked as needing
+	 a prologue to compute the bb_antic_flags bitmap.  */
+      bitmap_copy (&bb_antic_flags, &bb_flags);
+      FOR_EACH_BB (bb)
+	{
+	  edge e;
+	  edge_iterator ei;
+	  if (!bitmap_bit_p (&bb_flags, bb->index))
+	    continue;
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
+	      {
+		VEC_quick_push (basic_block, vec, e->src);
+		bitmap_set_bit (&bb_on_list, e->src->index);
+	      }
+	}
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  bool all_set = true;
+
+	  bitmap_clear_bit (&bb_on_list, tmp_bb->index);
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    {
+	      if (!bitmap_bit_p (&bb_antic_flags, e->dest->index))
+		{
+		  all_set = false;
+		  break;
+		}
+	    }
+	  if (all_set)
+	    {
+	      bitmap_set_bit (&bb_antic_flags, tmp_bb->index);
+	      FOR_EACH_EDGE (e, ei, tmp_bb->preds)
+		if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
+		  {
+		    VEC_quick_push (basic_block, vec, e->src);
+		    bitmap_set_bit (&bb_on_list, e->src->index);
+		  }
+	    }
+	}
+      /* Find exactly one edge that leads to a block in ANTIC from
+	 a block that isn't.  */
+      if (!bitmap_bit_p (&bb_antic_flags, entry_edge->dest->index))
+	FOR_EACH_BB (bb)
+	  {
+	    if (!bitmap_bit_p (&bb_antic_flags, bb->index))
+	      continue;
+	    FOR_EACH_EDGE (e, ei, bb->preds)
+	      if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
+		{
+		  if (entry_edge != orig_entry_edge)
+		    {
+		      entry_edge = orig_entry_edge;
+		      goto fail_shrinkwrap;
+		    }
+		  entry_edge = e;
+		}
+	  }
+
+      /* Test whether the prologue is known to clobber any register
+	 (other than FP or SP) which are live on the edge.  */
+      CLEAR_HARD_REG_SET (prologue_clobbered);
+      for (p_insn = prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	if (NONDEBUG_INSN_P (p_insn))
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+      for (p_insn = split_prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	if (NONDEBUG_INSN_P (p_insn))
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+      CLEAR_HARD_REG_BIT (prologue_clobbered, STACK_POINTER_REGNUM);
+      if (frame_pointer_needed)
+	CLEAR_HARD_REG_BIT (prologue_clobbered, HARD_FRAME_POINTER_REGNUM);
+      CLEAR_HARD_REG_SET (live_on_edge);
+      reg_set_to_hard_reg_set (&live_on_edge,
+			       df_get_live_in (entry_edge->dest));
+      if (hard_reg_set_intersect_p (live_on_edge, prologue_clobbered))
+	entry_edge = orig_entry_edge;
+
+    fail_shrinkwrap:
+      bitmap_clear (&bb_antic_flags);
+      bitmap_clear (&bb_on_list);
+      VEC_free (basic_block, heap, vec);
     }
 #endif
 
+  if (split_prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (split_prologue_seq, entry_edge);
+      inserted = true;
+    }
+  if (prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (prologue_seq, entry_edge);
+      inserted = true;
+    }
+
   /* If the exit block has no non-fake predecessors, we don't need
      an epilogue.  */
   FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
@@ -5400,110 +5743,148 @@ thread_prologue_and_epilogue_insns (void
     goto epilogue_done;
 
   rtl_profile_for_bb (EXIT_BLOCK_PTR);
+
 #ifdef HAVE_return
-  if (optimize && HAVE_return)
+  /* If we're allowed to generate a simple return instruction, then by
+     definition we don't need a full epilogue.  If the last basic
+     block before the exit block does not contain active instructions,
+     examine its predecessors and try to emit (conditional) return
+     instructions.  */
+  if (optimize && !last_bb_active
+      && (HAVE_return || entry_edge != orig_entry_edge))
     {
-      /* If we're allowed to generate a simple return instruction,
-	 then by definition we don't need a full epilogue.  Examine
-	 the block that falls through to EXIT.   If it does not
-	 contain any code, examine its predecessors and try to
-	 emit (conditional) return instructions.  */
-
-      basic_block last;
+      edge_iterator ei2;
+      int i;
+      basic_block bb;
       rtx label;
+      VEC(basic_block,heap) *src_bbs;
 
-      e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-      if (e == NULL)
+      if (exit_fallthru_edge == NULL)
 	goto epilogue_done;
-      last = e->src;
+      label = BB_HEAD (last_bb);
 
-      /* Verify that there are no active instructions in the last block.  */
-      label = BB_END (last);
-      while (label && !LABEL_P (label))
-	{
-	  if (active_insn_p (label))
-	    break;
-	  label = PREV_INSN (label);
-	}
+      src_bbs = VEC_alloc (basic_block, heap, EDGE_COUNT (last_bb->preds));
+      FOR_EACH_EDGE (e, ei2, last_bb->preds)
+	if (e->src != ENTRY_BLOCK_PTR)
+	  VEC_quick_push (basic_block, src_bbs, e->src);
 
-      if (BB_HEAD (last) == label && LABEL_P (label))
+      FOR_EACH_VEC_ELT (basic_block, src_bbs, i, bb)
 	{
-	  edge_iterator ei2;
+	  bool simple_p;
+	  rtx jump;
+	  e = find_edge (bb, last_bb);
 
-	  for (ei2 = ei_start (last->preds); (e = ei_safe_edge (ei2)); )
-	    {
-	      basic_block bb = e->src;
-	      rtx jump;
+	  jump = BB_END (bb);
 
-	      if (bb == ENTRY_BLOCK_PTR)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+#ifdef HAVE_simple_return
+	  simple_p = (entry_edge != orig_entry_edge
+		      ? !bitmap_bit_p (&bb_flags, bb->index) : false);
+#else
+	  simple_p = false;
+#endif
 
-	      jump = BB_END (bb);
-	      if (!JUMP_P (jump) || JUMP_LABEL (jump) != label)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+	  if (!simple_p
+	      && (!HAVE_return || !JUMP_P (jump)
+		  || JUMP_LABEL (jump) != label))
+	    continue;
 
-	      /* If we have an unconditional jump, we can replace that
-		 with a simple return instruction.  */
-	      if (simplejump_p (jump))
-		{
-		  /* The use of the return register might be present in the exit
-		     fallthru block.  Either:
-		     - removing the use is safe, and we should remove the use in
-		       the exit fallthru block, or
-		     - removing the use is not safe, and we should add it here.
-		     For now, we conservatively choose the latter.  Either of the
-		     2 helps in crossjumping.  */
-		  emit_use_return_register_into_block (bb);
+	  /* If we have an unconditional jump, we can replace that
+	     with a simple return instruction.  */
+	  if (!JUMP_P (jump))
+	    {
+	      emit_barrier_after (BB_END (bb));
+	      emit_return_into_block (simple_p, bb);
+	    }
+	  else if (simplejump_p (jump))
+	    {
+	      /* The use of the return register might be present in the exit
+	         fallthru block.  Either:
+	         - removing the use is safe, and we should remove the use in
+	           the exit fallthru block, or
+	         - removing the use is not safe, and we should add it here.
+	         For now, we conservatively choose the latter.  Either of the
+	         2 helps in crossjumping.  */
+	      emit_use_return_register_into_block (bb);
 
-		  emit_return_into_block (bb);
-		  delete_insn (jump);
-		}
+	      emit_return_into_block (simple_p, bb);
+	      delete_insn (jump);
+	    }
+	  else if (condjump_p (jump) && JUMP_LABEL (jump) != label)
+	    {
+	      basic_block new_bb;
+	      edge new_e;
 
-	      /* If we have a conditional jump, we can try to replace
-		 that with a conditional return instruction.  */
-	      else if (condjump_p (jump))
+	      gcc_assert (simple_p);
+	      new_bb = split_edge (e);
+	      emit_barrier_after (BB_END (new_bb));
+	      emit_return_into_block (simple_p, new_bb);
+#ifdef HAVE_simple_return
+	      if (simple_p)
 		{
-		  if (! redirect_jump (jump, ret_rtx, 0))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
-
-		  /* See comment in simple_jump_p case above.  */
-		  emit_use_return_register_into_block (bb);
-
-		  /* If this block has only one successor, it both jumps
-		     and falls through to the fallthru block, so we can't
-		     delete the edge.  */
-		  if (single_succ_p (bb))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
+		  if (BB_PARTITION (new_bb) == BB_HOT_PARTITION)
+		    simple_return_block_hot = new_bb;
+		  else
+		    simple_return_block_cold = new_bb;
 		}
+#endif
+	      new_e = single_succ_edge (new_bb);
+	      redirect_edge_succ (new_e, EXIT_BLOCK_PTR);
+
+	      continue;
+	    }
+	  /* If we have a conditional jump branching to the last
+	     block, we can try to replace that with a conditional
+	     return instruction.  */
+	  else if (condjump_p (jump))
+	    {
+	      rtx dest;
+	      if (simple_p)
+		dest = simple_return_rtx;
 	      else
+		dest = ret_rtx;
+	      if (! redirect_jump (jump, dest, 0))
 		{
-		  ei_next (&ei2);
+#ifdef HAVE_simple_return
+		  if (simple_p)
+		    unconverted_simple_returns = true;
+#endif
 		  continue;
 		}
 
-	      /* Fix up the CFG for the successful change we just made.  */
-	      redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	      /* See comment in simple_jump_p case above.  */
+	      emit_use_return_register_into_block (bb);
+
+	      /* If this block has only one successor, it both jumps
+		 and falls through to the fallthru block, so we can't
+		 delete the edge.  */
+	      if (single_succ_p (bb))
+		continue;
+	    }
+	  else
+	    {
+#ifdef HAVE_simple_return
+	      if (simple_p)
+		unconverted_simple_returns = true;
+#endif
+	      continue;
 	    }
 
+	  /* Fix up the CFG for the successful change we just made.  */
+	  redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	}
+      VEC_free (basic_block, heap, src_bbs);
+
+      if (HAVE_return)
+	{
 	  /* Emit a return insn for the exit fallthru block.  Whether
 	     this is still reachable will be determined later.  */
 
-	  emit_barrier_after (BB_END (last));
-	  emit_return_into_block (last);
-	  epilogue_end = BB_END (last);
-	  single_succ_edge (last)->flags &= ~EDGE_FALLTHRU;
+	  emit_barrier_after (BB_END (last_bb));
+	  emit_return_into_block (false, last_bb);
+	  epilogue_end = BB_END (last_bb);
+	  if (JUMP_P (epilogue_end))
+	    JUMP_LABEL (epilogue_end) = ret_rtx;
+	  single_succ_edge (last_bb)->flags &= ~EDGE_FALLTHRU;
 	  goto epilogue_done;
 	}
     }
@@ -5540,20 +5921,15 @@ thread_prologue_and_epilogue_insns (void
     }
 #endif
 
-  /* Find the edge that falls through to EXIT.  Other edges may exist
-     due to RETURN instructions, but those don't need epilogues.
-     There really shouldn't be a mixture -- either all should have
-     been converted or none, however...  */
+  /* If nothing falls through into the exit block, we don't need an
+     epilogue.  */
 
-  e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-  if (e == NULL)
+  if (exit_fallthru_edge == NULL)
     goto epilogue_done;
 
 #ifdef HAVE_epilogue
   if (HAVE_epilogue)
     {
-      rtx returnjump;
-
       start_sequence ();
       epilogue_end = emit_note (NOTE_INSN_EPILOGUE_BEG);
       seq = gen_epilogue ();
@@ -5564,11 +5940,11 @@ thread_prologue_and_epilogue_insns (void
       record_insns (seq, NULL, &epilogue_insn_hash);
       set_insn_locators (seq, epilogue_locator);
 
-      returnjump = get_last_insn ();
       seq = get_insns ();
+      returnjump = get_last_insn ();
       end_sequence ();
 
-      insert_insn_on_edge (seq, e);
+      insert_insn_on_edge (seq, exit_fallthru_edge);
       inserted = true;
 
       if (JUMP_P (returnjump))
@@ -5581,23 +5957,21 @@ thread_prologue_and_epilogue_insns (void
 	  else
 	    JUMP_LABEL (returnjump) = ret_rtx;
 	}
-      else
-	returnjump = NULL_RTX;
     }
   else
 #endif
     {
       basic_block cur_bb;
 
-      if (! next_active_insn (BB_END (e->src)))
+      if (! next_active_insn (BB_END (exit_fallthru_edge->src)))
 	goto epilogue_done;
       /* We have a fall-through edge to the exit block, the source is not
          at the end of the function, and there will be an assembler epilogue
          at the end of the function.
          We can't use force_nonfallthru here, because that would try to
-         use return.  Inserting a jump 'by hand' is extremely messy, so
+	 use return.  Inserting a jump 'by hand' is extremely messy, so
 	 we take advantage of cfg_layout_finalize using
-	fixup_fallthru_exit_predecessor.  */
+	 fixup_fallthru_exit_predecessor.  */
       cfg_layout_initialize (0);
       FOR_EACH_BB (cur_bb)
 	if (cur_bb->index >= NUM_FIXED_BLOCKS
@@ -5607,6 +5981,7 @@ thread_prologue_and_epilogue_insns (void
     }
 
 epilogue_done:
+
   default_rtl_profile ();
 
   if (inserted)
@@ -5632,33 +6007,102 @@ epilogue_done:
 	}
     }
 
+#ifdef HAVE_simple_return
+  /* If there were branches to an empty LAST_BB which we tried to
+     convert to conditional simple_returns, but couldn't for some
+     reason, create a block to hold a simple_return insn and redirect
+     those remaining edges.  */
+  if (unconverted_simple_returns)
+    {
+      edge_iterator ei2;
+      basic_block exit_pred = EXIT_BLOCK_PTR->prev_bb;
+
+      gcc_assert (entry_edge != orig_entry_edge);
+
+      /* See if we can reuse the last insn that was emitted for the
+	 epilogue.  */
+      if (returnjump != NULL_RTX
+	  && JUMP_LABEL (returnjump) == simple_return_rtx)
+	{
+	  edge e = split_block (exit_fallthru_edge->src,
+				PREV_INSN (returnjump));
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    simple_return_block_hot = e->dest;
+	  else
+	    simple_return_block_cold = e->dest;
+	}
+
+    restart_scan:
+      for (ei2 = ei_start (last_bb->preds); (e = ei_safe_edge (ei2)); )
+	{
+	  basic_block bb = e->src;
+	  basic_block *pdest_bb;
+
+	  if (bb == ENTRY_BLOCK_PTR
+	      || bitmap_bit_p (&bb_flags, bb->index))
+	    {
+	      ei_next (&ei2);
+	      continue;
+	    }
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    pdest_bb = &simple_return_block_hot;
+	  else
+	    pdest_bb = &simple_return_block_cold;
+	  if (*pdest_bb == NULL)
+	    {
+	      basic_block bb;
+	      rtx start;
+
+	      bb = create_basic_block (NULL, NULL, exit_pred);
+	      BB_COPY_PARTITION (bb, e->src);
+	      start = emit_jump_insn_after (gen_simple_return (),
+					    BB_END (bb));
+	      JUMP_LABEL (start) = simple_return_rtx;
+	      emit_barrier_after (start);
+
+	      *pdest_bb = bb;
+	      make_edge (bb, EXIT_BLOCK_PTR, 0);
+	    }
+	  redirect_edge_and_branch_force (e, *pdest_bb);
+	  goto restart_scan;
+	}
+    }
+#endif
+
 #ifdef HAVE_sibcall_epilogue
   /* Emit sibling epilogues before any sibling call sites.  */
   for (ei = ei_start (EXIT_BLOCK_PTR->preds); (e = ei_safe_edge (ei)); )
     {
       basic_block bb = e->src;
       rtx insn = BB_END (bb);
+      rtx ep_seq;
 
       if (!CALL_P (insn)
-	  || ! SIBLING_CALL_P (insn))
+	  || ! SIBLING_CALL_P (insn)
+	  || (entry_edge != orig_entry_edge
+	      && !bitmap_bit_p (&bb_flags, bb->index)))
 	{
 	  ei_next (&ei);
 	  continue;
 	}
 
-      start_sequence ();
-      emit_note (NOTE_INSN_EPILOGUE_BEG);
-      emit_insn (gen_sibcall_epilogue ());
-      seq = get_insns ();
-      end_sequence ();
+      ep_seq = gen_sibcall_epilogue ();
+      if (ep_seq)
+	{
+	  start_sequence ();
+	  emit_note (NOTE_INSN_EPILOGUE_BEG);
+	  emit_insn (ep_seq);
+	  seq = get_insns ();
+	  end_sequence ();
 
-      /* Retain a map of the epilogue insns.  Used in life analysis to
-	 avoid getting rid of sibcall epilogue insns.  Do this before we
-	 actually emit the sequence.  */
-      record_insns (seq, NULL, &epilogue_insn_hash);
-      set_insn_locators (seq, epilogue_locator);
+	  /* Retain a map of the epilogue insns.  Used in life analysis to
+	     avoid getting rid of sibcall epilogue insns.  Do this before we
+	     actually emit the sequence.  */
+	  record_insns (seq, NULL, &epilogue_insn_hash);
+	  set_insn_locators (seq, epilogue_locator);
 
-      emit_insn_before (seq, insn);
+	  emit_insn_before (seq, insn);
+	}
       ei_next (&ei);
     }
 #endif
@@ -5683,6 +6127,8 @@ epilogue_done:
     }
 #endif
 
+  bitmap_clear (&bb_flags);
+
   /* Threading the prologue and epilogue changes the artificial refs
      in the entry and exit blocks.  */
   epilogue_completed = 1;
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 178135)
+++ gcc/common.opt	(working copy)
@@ -1768,6 +1768,11 @@ fshow-column
 Common Report Var(flag_show_column) Init(1)
 Show column numbers in diagnostics, when available.  Default on
 
+fshrink-wrap
+Common Report Var(flag_shrink_wrap) Optimization
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.
+
 fsignaling-nans
 Common Report Var(flag_signaling_nans) Optimization SetByCombined
 Disable optimizations observable by IEEE signaling NaNs
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 178135)
+++ gcc/config/i386/i386.md	(working copy)
@@ -11686,24 +11686,29 @@ (define_insn "prologue_use"
   ""
   [(set_attr "length" "0")])
 
+(define_code_iterator returns [return simple_return])
+(define_code_attr return_str [(return "") (simple_return "simple_")])
+(define_code_attr return_cond [(return "ix86_can_use_return_insn_p ()")
+			       (simple_return "")])
+
 ;; Insn emitted into the body of a function to return from a function.
 ;; This is only done if the function's epilogue is known to be simple.
 ;; See comments for ix86_can_use_return_insn_p in i386.c.
 
-(define_expand "return"
-  [(return)]
-  "ix86_can_use_return_insn_p ()"
+(define_expand "<return_str>return"
+  [(returns)]
+  "<return_cond>"
 {
   if (crtl->args.pops_args)
     {
       rtx popc = GEN_INT (crtl->args.pops_args);
-      emit_jump_insn (gen_return_pop_internal (popc));
+      emit_jump_insn (gen_<return_str>return_pop_internal (popc));
       DONE;
     }
 })
 
-(define_insn "return_internal"
-  [(return)]
+(define_insn "<return_str>return_internal"
+  [(returns)]
   "reload_completed"
   "ret"
   [(set_attr "length" "1")
@@ -11714,8 +11719,8 @@ (define_insn "return_internal"
 ;; Used by x86_machine_dependent_reorg to avoid penalty on single byte RET
 ;; instruction Athlon and K8 have.
 
-(define_insn "return_internal_long"
-  [(return)
+(define_insn "<return_str>return_internal_long"
+  [(returns)
    (unspec [(const_int 0)] UNSPEC_REP)]
   "reload_completed"
   "rep\;ret"
@@ -11725,8 +11730,8 @@ (define_insn "return_internal_long"
    (set_attr "prefix_rep" "1")
    (set_attr "modrm" "0")])
 
-(define_insn "return_pop_internal"
-  [(return)
+(define_insn "<return_str>return_pop_internal"
+  [(returns)
    (use (match_operand:SI 0 "const_int_operand" ""))]
   "reload_completed"
   "ret\t%0"
@@ -11735,8 +11740,8 @@ (define_insn "return_pop_internal"
    (set_attr "length_immediate" "2")
    (set_attr "modrm" "0")])
 
-(define_insn "return_indirect_internal"
-  [(return)
+(define_insn "<return_str>return_indirect_internal"
+  [(returns)
    (use (match_operand:SI 0 "register_operand" "r"))]
   "reload_completed"
   "jmp\t%A0"
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 178135)
+++ gcc/config/i386/i386.c	(working copy)
@@ -10754,13 +10754,13 @@ ix86_expand_epilogue (int style)
 
 	  pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
 				     popc, -1, true);
-	  emit_jump_insn (gen_return_indirect_internal (ecx));
+	  emit_jump_insn (gen_simple_return_indirect_internal (ecx));
 	}
       else
-	emit_jump_insn (gen_return_pop_internal (popc));
+	emit_jump_insn (gen_simple_return_pop_internal (popc));
     }
   else
-    emit_jump_insn (gen_return_internal ());
+    emit_jump_insn (gen_simple_return_internal ());
 
   /* Restore the state back to the state from the prologue,
      so that it's correct for the next epilogue.  */
@@ -30615,7 +30615,10 @@ ix86_pad_returns (void)
 	}
       if (replace)
 	{
-	  emit_jump_insn_before (gen_return_internal_long (), ret);
+	  if (PATTERN (ret) == ret_rtx)
+	    emit_jump_insn_before (gen_return_internal_long (), ret);
+	  else
+	    emit_jump_insn_before (gen_simple_return_internal_long (), ret);
 	  delete_insn (ret);
 	}
     }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-08-31 20:37 Initial shrink-wrapping patch Bernd Schmidt
@ 2011-09-01 13:57 ` Richard Sandiford
  2011-09-13  2:57   ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2011-09-01 13:57 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds@codesourcery.com> writes:
> +/* A for_each_rtx subroutine of record_hard_reg_sets.  */
> +static int
> +record_hard_reg_uses_1 (rtx *px, void *data)
> +{
> +  rtx x = *px;
> +  HARD_REG_SET *pused = (HARD_REG_SET *)data;
> +
> +  if (REG_P (x) && REGNO (x) < FIRST_PSEUDO_REGISTER)
> +    {
> +      int nregs = hard_regno_nregs[REGNO (x)][GET_MODE (x)];
> +      while (nregs-- > 0)
> +	SET_HARD_REG_BIT (*pused, REGNO (x) + nregs);
> +    }

add_to_hard_reg_set (pused, GET_MODE (x), REGNO (x));

> +/* A subroutine of requires_stack_frame_p, called via for_each_rtx.
> +   If any change is made, set CHANGED
> +   to true.  */

Strange line break, and comment doesn't match code (no "changed" variable).
Probably moot though because:

> +/* Return true if INSN requires the stack frame to be set up.
> +   PROLOGUE_USED contains the hard registers used in the function
> +   prologue.  */
> +static bool
> +requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
> +{
>[...]
> +  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
> +    return true;
> +  CLEAR_HARD_REG_SET (hardregs);
> +  note_stores (PATTERN (insn), record_hard_reg_sets, &hardregs);
> +  if (hard_reg_set_intersect_p (hardregs, prologue_used))
> +    return true;
> +  AND_COMPL_HARD_REG_SET (hardregs, call_used_reg_set);
> +  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> +    if (TEST_HARD_REG_BIT (hardregs, regno)
> +	&& df_regs_ever_live_p (regno))
> +      return true;

...I suppose this ought to use DF instead.

Does something guarantee that non-local gotos are marked as
needing a frame?

Why do we need to check specifically for pic_offset_table_rtx?  Isn't it
handled by the generic "live registers set by the prologue" test?
Reason for asking is that pic_offset_table_rtx can be a global value,
such as for mips*-elf.

> -/* Insert gen_return at the end of block BB.  This also means updating
> -   block_for_insn appropriately.  */
> +
> +static rtx
> +gen_return_pattern (bool simple_p)
> +{
> +#ifdef HAVE_simple_return
> +  return simple_p ? gen_simple_return () : gen_return ();
> +#else
> +  gcc_assert (!simple_p);
> +  return gen_return ();
> +#endif
> +}

Missing function comment.

> +
> +/* Insert an appropriate return pattern at the end of block BB.  This
> +   also means updating block_for_insn appropriately.  */
>  
>  static void
> -emit_return_into_block (basic_block bb)
> +emit_return_into_block (bool simple_p, basic_block bb)

Pedantic, but the comment should reference SIMPLE_P.

> +#ifdef HAVE_simple_return
> +  /* Try to perform a kind of shrink-wrapping, making sure the
> +     prologue/epilogue is emitted only around those parts of the
> +     function that require it.  */
> +
> +  if (flag_shrink_wrap && HAVE_simple_return && !flag_non_call_exceptions
> +      && HAVE_prologue && !crtl->calls_eh_return)
> +    {

Maybe it should be obvious, but could you add a comment explaining why
flag_non_call_exceptions and crtl->calls_eh_return need to be checked
explicitly?  I can see why we don't want to optimise functions that
call eh_return, but I was curious why it wasn't handled by the general
insn-level analysis.

Would checking prologue_seq != NULL be better than HAVE_prologue?

> +      HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;
> +      rtx p_insn;
> +
> +      VEC(basic_block, heap) *vec;
> +      basic_block bb;
> +      bitmap_head bb_antic_flags;
> +      bitmap_head bb_on_list;
> +
> +      /* Compute the registers set and used in the prologue.  */
> +      CLEAR_HARD_REG_SET (prologue_clobbered);
> +      CLEAR_HARD_REG_SET (prologue_used);
> +      for (p_insn = prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
> +	{
> +	  HARD_REG_SET this_used;
> +	  if (!NONDEBUG_INSN_P (p_insn))
> +	    continue;
> +
> +	  CLEAR_HARD_REG_SET (this_used);
> +	  note_uses (&PATTERN (p_insn), record_hard_reg_uses,
> +		     &this_used);
> +	  AND_COMPL_HARD_REG_SET (this_used, prologue_clobbered);
> +	  IOR_HARD_REG_SET (prologue_used, this_used);
> +	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
> +		       &prologue_clobbered);
> +	}

Should this iterate over split_prologue_seq as well?

Could we combine prologue_seq and split_prologue_seq into a single sequence?

> +      bitmap_initialize (&bb_antic_flags, &bitmap_default_obstack);
> +      bitmap_initialize (&bb_on_list, &bitmap_default_obstack);
> +
> +      vec = VEC_alloc (basic_block, heap, n_basic_blocks);
> +
> +      FOR_EACH_BB (bb)
> +	{
> +	  rtx insn;
> +	  FOR_BB_INSNS (bb, insn)
> +	    {
> +	      if (requires_stack_frame_p (insn, prologue_used))
> +		{
> +		  bitmap_set_bit (&bb_flags, bb->index);
> +		  VEC_quick_push (basic_block, vec, bb);
> +		  break;
> +		}
> +	    }

Excess { and } in for loop.

> +	}
> +
> +      /* For every basic block that needs a prologue, mark all blocks
> +	 reachable from it, so as to ensure they are also seen as
> +	 requiring a prologue.  */
> +      while (!VEC_empty (basic_block, vec))
> +	{
> +	  basic_block tmp_bb = VEC_pop (basic_block, vec);
> +	  edge e;
> +	  edge_iterator ei;
> +	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
> +	    {
> +	      if (e->dest == EXIT_BLOCK_PTR
> +		  || bitmap_bit_p (&bb_flags, e->dest->index))
> +		continue;
> +	      bitmap_set_bit (&bb_flags, e->dest->index);
> +	      VEC_quick_push (basic_block, vec, e->dest);
> +	    }

Don't know which is the preferred style, but:

    if (e->dest != EXIT_BLOCK_PTR
        && bitmap_set_bit (&bb_flags, e->dest->index))
      VEC_quick_push (basic_block, vec, e->dest);

should be a little more efficient.

> +      /* Now walk backwards from every block that is marked as needing
> +	 a prologue to compute the bb_antic_flags bitmap.  */
> +      bitmap_copy (&bb_antic_flags, &bb_flags);
> +      FOR_EACH_BB (bb)
> +	{
> +	  edge e;
> +	  edge_iterator ei;
> +	  if (!bitmap_bit_p (&bb_flags, bb->index))
> +	    continue;
> +	  FOR_EACH_EDGE (e, ei, bb->preds)
> +	    if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
> +	      {
> +		VEC_quick_push (basic_block, vec, e->src);
> +		bitmap_set_bit (&bb_on_list, e->src->index);
> +	      }

Here too I think we want:

	  FOR_EACH_EDGE (e, ei, bb->preds)
	    if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
		&& bitmap_set_bit (&bb_on_list, e->src->index))
	      VEC_quick_push (basic_block, vec, e->src);

in this case to avoid pushing the same thing twice.

> +	  bitmap_clear_bit (&bb_on_list, tmp_bb->index);
> +	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
> +	    {
> +	      if (!bitmap_bit_p (&bb_antic_flags, e->dest->index))
> +		{
> +		  all_set = false;
> +		  break;
> +		}
> +	    }

Excess { and } in for loop.

> +	  if (all_set)
> +	    {
> +	      bitmap_set_bit (&bb_antic_flags, tmp_bb->index);
> +	      FOR_EACH_EDGE (e, ei, tmp_bb->preds)
> +		if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
> +		  {
> +		    VEC_quick_push (basic_block, vec, e->src);
> +		    bitmap_set_bit (&bb_on_list, e->src->index);
> +		  }
> +	    }

Same:

	  FOR_EACH_EDGE (e, ei, tmp_bb->preds)
	    if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
		&& bitmap_set_bit (&bb_on_list, e->src->index))
	      VEC_quick_push (basic_block, vec, e->src);

comment as above.

> +      /* Test whether the prologue is known to clobber any register
> +	 (other than FP or SP) which are live on the edge.  */
> +      CLEAR_HARD_REG_SET (prologue_clobbered);
> +      for (p_insn = prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
> +	if (NONDEBUG_INSN_P (p_insn))
> +	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
> +		       &prologue_clobbered);
> +      for (p_insn = split_prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
> +	if (NONDEBUG_INSN_P (p_insn))
> +	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
> +		       &prologue_clobbered);
> +      CLEAR_HARD_REG_BIT (prologue_clobbered, STACK_POINTER_REGNUM);
> +      if (frame_pointer_needed)
> +	CLEAR_HARD_REG_BIT (prologue_clobbered, HARD_FRAME_POINTER_REGNUM);

Seems like we should be able to reuse the insn walk from the beginning
of the enclosing if statement.

> +#ifdef HAVE_simple_return
> +	  simple_p = (entry_edge != orig_entry_edge
> +		      ? !bitmap_bit_p (&bb_flags, bb->index) : false);
> +#else
> +	  simple_p = false;
> +#endif

Why is this bb_flags rather than bb_antic_flags?

	  simple_p = (entry_edge != orig_entry_edge
		      && !bitmap_bit_p (&bb_flags, bb->index));

seems more readable, but I suppose that's personal taste.

> +	      gcc_assert (simple_p);
> +	      new_bb = split_edge (e);
> +	      emit_barrier_after (BB_END (new_bb));
> +	      emit_return_into_block (simple_p, new_bb);
> +#ifdef HAVE_simple_return
> +	      if (simple_p)

Check is redundant given the gcc_assert.


> +	      /* If this block has only one successor, it both jumps
> +		 and falls through to the fallthru block, so we can't
> +		 delete the edge.  */
> +	      if (single_succ_p (bb))
> +		continue;

Seems odd that this could happen in optimised code, but if it did,
wouldn't it invalidate the simple_p transformation?  Seems like
the non-fallthrough edge would use a simple return but the
fallthrough one wouldn't.

> -      start_sequence ();
> -      emit_note (NOTE_INSN_EPILOGUE_BEG);
> -      emit_insn (gen_sibcall_epilogue ());
> -      seq = get_insns ();
> -      end_sequence ();
> +      ep_seq = gen_sibcall_epilogue ();
> +      if (ep_seq)
> +	{

Why the new null check?

Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-01 13:57 ` Richard Sandiford
@ 2011-09-13  2:57   ` Bernd Schmidt
  2011-09-13 11:03     ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-09-13  2:57 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 6637 bytes --]

On 09/01/11 15:57, Richard Sandiford wrote:
> add_to_hard_reg_set (pused, GET_MODE (x), REGNO (x));

Done.

> Strange line break, and comment doesn't match code (no "changed" variable).

Fixed.

>> +/* Return true if INSN requires the stack frame to be set up.
>> +   PROLOGUE_USED contains the hard registers used in the function
>> +   prologue.  */
>> +static bool
>> +requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
>> +{
>> [...]
>> +  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
>> +    return true;
>> +  CLEAR_HARD_REG_SET (hardregs);
>> +  note_stores (PATTERN (insn), record_hard_reg_sets, &hardregs);
>> +  if (hard_reg_set_intersect_p (hardregs, prologue_used))
>> +    return true;
>> +  AND_COMPL_HARD_REG_SET (hardregs, call_used_reg_set);
>> +  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
>> +    if (TEST_HARD_REG_BIT (hardregs, regno)
>> +	&& df_regs_ever_live_p (regno))
>> +      return true;
> 
> ...I suppose this ought to use DF instead.

Here, I kind of disagree. Most of the new code works on unemitted
sequences, and df doesn't work for that, so we rely on note_stores and
such quite a lot. I've changed the one note_stores call in this function
to use df, but IMO it's not an improvement. As for
frame_required_for_rtx, I'd much rather test for equality with
hard_frame_pointer_rtx than muck about with register numbers and
frame_pointer_needed tests.

Also, it turns out I need to pretend that trap_if requires the prologue
due to the testcase you also ran across, so a for_each_rtx walk is
required anyway.

> Does something guarantee that non-local gotos are marked as
> needing a frame?

There are (use fp) (use sp) and such insns before nonlocal gotos. There
aren't any testsuite failures related to nonlocal gotos, so I assume
that's sufficient.

> Why do we need to check specifically for pic_offset_table_rtx?  Isn't it
> handled by the generic "live registers set by the prologue" test?

There is no such test, really - other than this code. There is a "live
registers clobbered by the prologue" test which disables the
optimization at a late stage. However, the prologue may set some
registers simply as scratch regs, and things would be more complicated
if we tried to identify which ones are scratch and which ones are really
set up for the rest of the function.

> Reason for asking is that pic_offset_table_rtx can be a global value,
> such as for mips*-elf.

Can we shelve this as a potential improvement for later?

>> +static rtx
>> +gen_return_pattern (bool simple_p)
> 
> Missing function comment.

Fixed.

> Pedantic, but the comment should reference SIMPLE_P.

Fixed.

> 
>> +#ifdef HAVE_simple_return
>> +  /* Try to perform a kind of shrink-wrapping, making sure the
>> +     prologue/epilogue is emitted only around those parts of the
>> +     function that require it.  */
>> +
>> +  if (flag_shrink_wrap && HAVE_simple_return && !flag_non_call_exceptions
>> +      && HAVE_prologue && !crtl->calls_eh_return)
>> +    {
> 
> Maybe it should be obvious, but could you add a comment explaining why
> flag_non_call_exceptions and crtl->calls_eh_return need to be checked
> explicitly?

Primarily because I think it's a waste of time to worry about them. I've
removed the flag_non_call_exceptions check, but eh_return is handled
elsewhere in this function, and there's really no reason to try to think
of the possible interactions. It's only used in one function in libgcc...

> Would checking prologue_seq != NULL be better than HAVE_prologue?

I was going to say no, but a look at the i386 expand_prologue suggests
we don't really want to try to duplicate the logic for HAVE_prologue.
"prologue_seq != NULL" isn't sufficient due to extra stuff being emitted
into the sequence, but I've added some code to check if the prologue has
nontrivial insns.

> Should this iterate over split_prologue_seq as well?

and

> Seems like we should be able to reuse the insn walk from the beginning
> of the enclosing if statement.

That seems to be a problem with merging multiple versions of this code.
Rearranged.

> Could we combine prologue_seq and split_prologue_seq into a single sequence?

Not in this patch, please.

> Excess { and } in for loop.

Fixed.

>     if (e->dest != EXIT_BLOCK_PTR
>         && bitmap_set_bit (&bb_flags, e->dest->index))
>       VEC_quick_push (basic_block, vec, e->dest);
> 
> should be a little more efficient.

Changed (three similar changes which you pointed out).

>> +#ifdef HAVE_simple_return
>> +	  simple_p = (entry_edge != orig_entry_edge
>> +		      ? !bitmap_bit_p (&bb_flags, bb->index) : false);
>> +#else
>> +	  simple_p = false;
>> +#endif
> 
> Why is this bb_flags rather than bb_antic_flags?

Doesn't really matter. (bb_antic_flags & ~bb_flags) only contains blocks
which don't need a prologue, but for which all subsequent paths lead to
something that requires a prologue. That's never going to be a block
that needs an epilogue or a return inserted.

> 	  simple_p = (entry_edge != orig_entry_edge
> 		      && !bitmap_bit_p (&bb_flags, bb->index));
> 
> seems more readable, but I suppose that's personal taste.

Fixed.

>> +#ifdef HAVE_simple_return
>> +	      if (simple_p)
> 
> Check is redundant given the gcc_assert.

Removed the if.

>> +	      /* If this block has only one successor, it both jumps
>> +		 and falls through to the fallthru block, so we can't
>> +		 delete the edge.  */
>> +	      if (single_succ_p (bb))
>> +		continue;
> 
> Seems odd that this could happen in optimised code, but if it did,
> wouldn't it invalidate the simple_p transformation?  Seems like
> the non-fallthrough edge would use a simple return but the
> fallthrough one wouldn't.

Whuh. For what it's worth, this code is discussed in this thread:
http://gcc.gnu.org/ml/gcc-patches/2000-02/msg00175.html

I've added a test for this so that blocks ending in such a block have
their bb_flags bit set.

>> -      start_sequence ();
>> -      emit_note (NOTE_INSN_EPILOGUE_BEG);
>> -      emit_insn (gen_sibcall_epilogue ());
>> -      seq = get_insns ();
>> -      end_sequence ();
>> +      ep_seq = gen_sibcall_epilogue ();
>> +      if (ep_seq)
>> +	{
> 
> Why the new null check?

This may have been more important with the old version of the dwarf2
code, but it seems cleaner this way - do we really want to insert
NOTE_INSN_EPILOGUE_BEG when there's no epilogue?

New version, bootstrapped and tested on i686-linux. Slightly earlier
versions tested with mips-elf (plus the various prologue/epilogue
patches for mips). Also added debug dump printout and a testcase (based
on the old 6/6 patch).


Bernd

[-- Attachment #2: sw0912.diff --]
[-- Type: text/plain, Size: 38920 bytes --]

	* doc/tm.texi.in (RETURN_ADDR_REGNUM): Document.
	* doc/tm.texi: Regenerate.
	* doc/invoke.texi (-fshrink-wrap): Document.
	* opts.c (default_options_table): Add it.
	* common.opt (fshrink-wrap): Add.
	* function.c (emit_return_into_block): Remove useless declaration.
	(record_hard_reg_uses_1, record_hard_reg_uses, frame_required_for_rtx,
	requires_stack_frame_p, gen_return_pattern): New static functions.
	(emit_return_into_block): New arg simple_p.  All callers changed.
	Use gen_return_pattern.
	(thread_prologue_and_epilogue_insns): Implement shrink-wrapping.
	* config/i386/i386.md (returns): New code_iterator.
	(return_str, return_cond): New code_attrs.
	(<return_str>return, <return_str>return_internal,
	<return_str>return_internal_long, <return_str>return_pop_internal,
	<return_str>return_indirect_internal): New patterns, replacing...
	(return, return_internal, return_internal_long, return_pop_internal,
	return_indirect_internal): ... these.
	* config/i386/i386.c (ix86_expand_epilogue): Adjust to expand
	simple returns.
	(ix86_pad_returns): Likewise.

	* gcc.target/i386/sw-1.c: New test.

Index: doc/tm.texi
===================================================================
--- doc/tm.texi	(revision 178734)
+++ doc/tm.texi	(working copy)
@@ -3208,6 +3208,12 @@ Define this if the return address of a p
 from the frame pointer of the previous stack frame.
 @end defmac
 
+@defmac RETURN_ADDR_REGNUM
+If defined, a C expression whose value is the register number of the return
+address for the current function.  Targets that pass the return address on
+the stack should not define this macro.
+@end defmac
+
 @defmac INCOMING_RETURN_ADDR_RTX
 A C expression whose value is RTL representing the location of the
 incoming return address at the beginning of any function, before the
Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in	(revision 178734)
+++ doc/tm.texi.in	(working copy)
@@ -3194,6 +3194,12 @@ Define this if the return address of a p
 from the frame pointer of the previous stack frame.
 @end defmac
 
+@defmac RETURN_ADDR_REGNUM
+If defined, a C expression whose value is the register number of the return
+address for the current function.  Targets that pass the return address on
+the stack should not define this macro.
+@end defmac
+
 @defmac INCOMING_RETURN_ADDR_RTX
 A C expression whose value is RTL representing the location of the
 incoming return address at the beginning of any function, before the
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 178734)
+++ doc/invoke.texi	(working copy)
@@ -396,10 +396,10 @@ Objective-C and Objective-C++ Dialects}.
 -fschedule-insns -fschedule-insns2 -fsection-anchors @gol
 -fselective-scheduling -fselective-scheduling2 @gol
 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
--fsignaling-nans -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
--fsplit-wide-types -fstack-protector -fstack-protector-all @gol
--fstrict-aliasing -fstrict-overflow -fthread-jumps -ftracer @gol
--ftree-bit-ccp @gol
+-fshrink-wrap -fsignaling-nans -fsingle-precision-constant @gol
+-fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector @gol
+-fstack-protector-all -fstrict-aliasing -fstrict-overflow @gol
+-fthread-jumps -ftracer -ftree-bit-ccp @gol
 -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copy-prop @gol
 -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse @gol
 -ftree-forwprop -ftree-fre -ftree-loop-if-convert @gol
@@ -6878,6 +6878,12 @@ This option has no effect until one of @
 When pipelining loops during selective scheduling, also pipeline outer loops.
 This option has no effect until @option{-fsel-sched-pipelining} is turned on.
 
+@item -fshrink-wrap
+@opindex fshrink-wrap
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.  This flag is enabled by default at
+@option{-O} and higher.
+
 @item -fcaller-saves
 @opindex fcaller-saves
 Enable values to be allocated in registers that will be clobbered by
Index: testsuite/gcc.target/i386/sw-1.c
===================================================================
--- testsuite/gcc.target/i386/sw-1.c	(revision 0)
+++ testsuite/gcc.target/i386/sw-1.c	(revision 0)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+
+#include <string.h>
+
+int c;
+int x[2000];
+__attribute__((regparm(1))) int foo (int a, int b)
+ {
+   int t[200];
+   if (a == 0)
+     return 1;
+   if (c == 0)
+     return 2;
+   memcpy (t, x + b, sizeof t);
+   return t[a];
+ }
+
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */
+/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */
Index: opts.c
===================================================================
--- opts.c	(revision 178734)
+++ opts.c	(working copy)
@@ -434,6 +434,7 @@ static const struct default_options defa
     { OPT_LEVELS_1_PLUS, OPT_fipa_reference, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fipa_profile, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fmerge_constants, NULL, 1 },
+    { OPT_LEVELS_1_PLUS, OPT_fshrink_wrap, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fsplit_wide_types, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_ccp, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_bit_ccp, NULL, 1 },
Index: function.c
===================================================================
--- function.c	(revision 178734)
+++ function.c	(working copy)
@@ -147,9 +147,6 @@ extern tree debug_find_var_in_block_tree
    can always export `prologue_epilogue_contains'.  */
 static void record_insns (rtx, rtx, htab_t *) ATTRIBUTE_UNUSED;
 static bool contains (const_rtx, htab_t);
-#ifdef HAVE_return
-static void emit_return_into_block (basic_block);
-#endif
 static void prepare_function_start (void);
 static void do_clobber_return_reg (rtx, void *);
 static void do_use_return_reg (rtx, void *);
@@ -5285,6 +5282,82 @@ prologue_epilogue_contains (const_rtx in
   return 0;
 }
 
+#ifdef HAVE_simple_return
+
+/* A for_each_rtx subroutine of record_hard_reg_sets.  */
+static int
+record_hard_reg_uses_1 (rtx *px, void *data)
+{
+  rtx x = *px;
+  HARD_REG_SET *pused = (HARD_REG_SET *)data;
+
+  if (REG_P (x) && REGNO (x) < FIRST_PSEUDO_REGISTER)
+    add_to_hard_reg_set (pused, GET_MODE (x), REGNO (x));
+  return 0;
+}
+
+/* Like record_hard_reg_sets, but called through note_uses.  */
+static void
+record_hard_reg_uses (rtx *px, void *data)
+{
+  for_each_rtx (px, record_hard_reg_uses_1, data);
+}
+
+/* A subroutine of requires_stack_frame_p, called via for_each_rtx.
+   Return 1 if we found an rtx that forces a prologue, zero otherwise.  */
+static int
+frame_required_for_rtx (rtx *loc, void *data ATTRIBUTE_UNUSED)
+{
+  rtx x = *loc;
+  if (x == stack_pointer_rtx || x == hard_frame_pointer_rtx
+      || x == arg_pointer_rtx || x == pic_offset_table_rtx
+#ifdef RETURN_ADDR_REGNUM
+      || (REG_P (x) && REGNO (x) == RETURN_ADDR_REGNUM)
+#endif
+      || GET_CODE (x) == TRAP_IF)
+    return 1;
+  return 0;
+}
+
+/* Return true if INSN requires the stack frame to be set up.
+   PROLOGUE_USED contains the hard registers used in the function
+   prologue.  */
+static bool
+requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
+{
+  df_ref *def_rec;
+  HARD_REG_SET hardregs;
+  unsigned regno;
+
+  if (!INSN_P (insn) || DEBUG_INSN_P (insn))
+    return false;
+  if (CALL_P (insn))
+    return !SIBLING_CALL_P (insn);
+  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
+    return true;
+
+  CLEAR_HARD_REG_SET (hardregs);
+  for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
+    {
+      rtx dreg = DF_REF_REG (*def_rec);
+
+      if (!REG_P (dreg))
+	continue;
+
+      add_to_hard_reg_set (&hardregs, GET_MODE (dreg),
+			   REGNO (dreg));
+    }
+  if (hard_reg_set_intersect_p (hardregs, prologue_used))
+    return true;
+  AND_COMPL_HARD_REG_SET (hardregs, call_used_reg_set);
+  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (TEST_HARD_REG_BIT (hardregs, regno)
+	&& df_regs_ever_live_p (regno))
+      return true;
+  return false;
+}
+#endif
+
 #ifdef HAVE_return
 /* Insert use of return register before the end of BB.  */
 
@@ -5299,45 +5372,145 @@ emit_use_return_register_into_block (bas
   emit_insn_before (seq, BB_END (bb));
 }
 
-/* Insert gen_return at the end of block BB.  This also means updating
-   block_for_insn appropriately.  */
+
+/* Create a return pattern, either simple_return or return, depending on
+   simple_p.  */
+
+static rtx
+gen_return_pattern (bool simple_p)
+{
+#ifdef HAVE_simple_return
+  return simple_p ? gen_simple_return () : gen_return ();
+#else
+  gcc_assert (!simple_p);
+  return gen_return ();
+#endif
+}
+
+/* Insert an appropriate return pattern at the end of block BB.  This
+   also means updating block_for_insn appropriately.  SIMPLE_P is
+   the same as in gen_return_pattern and passed to it.  */
 
 static void
-emit_return_into_block (basic_block bb)
+emit_return_into_block (bool simple_p, basic_block bb)
 {
-  rtx jump = emit_jump_insn_after (gen_return (), BB_END (bb));
-  rtx pat = PATTERN (jump);
+  rtx jump, pat;
+  jump = emit_jump_insn_after (gen_return_pattern (simple_p), BB_END (bb));
+  pat = PATTERN (jump);
   if (GET_CODE (pat) == PARALLEL)
     pat = XVECEXP (pat, 0, 0);
   gcc_assert (ANY_RETURN_P (pat));
   JUMP_LABEL (jump) = pat;
 }
-#endif /* HAVE_return */
+#endif
 
 /* Generate the prologue and epilogue RTL if the machine supports it.  Thread
    this into place with notes indicating where the prologue ends and where
-   the epilogue begins.  Update the basic block information when possible.  */
+   the epilogue begins.  Update the basic block information when possible.
+
+   Notes on epilogue placement:
+   There are several kinds of edges to the exit block:
+   * a single fallthru edge from LAST_BB
+   * possibly, edges from blocks containing sibcalls
+   * possibly, fake edges from infinite loops
+
+   The epilogue is always emitted on the fallthru edge from the last basic
+   block in the function, LAST_BB, into the exit block.
+
+   If LAST_BB is empty except for a label, it is the target of every
+   other basic block in the function that ends in a return.  If a
+   target has a return or simple_return pattern (possibly with
+   conditional variants), these basic blocks can be changed so that a
+   return insn is emitted into them, and their target is adjusted to
+   the real exit block.
+
+   Notes on shrink wrapping: We implement a fairly conservative
+   version of shrink-wrapping rather than the textbook one.  We only
+   generate a single prologue and a single epilogue.  This is
+   sufficient to catch a number of interesting cases involving early
+   exits.
+
+   First, we identify the blocks that require the prologue to occur before
+   them.  These are the ones that modify a call-saved register, or reference
+   any of the stack or frame pointer registers.  To simplify things, we then
+   mark everything reachable from these blocks as also requiring a prologue.
+   This takes care of loops automatically, and avoids the need to examine
+   whether MEMs reference the frame, since it is sufficient to check for
+   occurrences of the stack or frame pointer.
+
+   We then compute the set of blocks for which the need for a prologue
+   is anticipatable (borrowing terminology from the shrink-wrapping
+   description in Muchnick's book).  These are the blocks which either
+   require a prologue themselves, or those that have only successors
+   where the prologue is anticipatable.  The prologue needs to be
+   inserted on all edges from BB1->BB2 where BB2 is in ANTIC and BB1
+   is not.  For the moment, we ensure that only one such edge exists.
+
+   The epilogue is placed as described above, but we make a
+   distinction between inserting return and simple_return patterns
+   when modifying other blocks that end in a return.  Blocks that end
+   in a sibcall omit the sibcall_epilogue if the block is not in
+   ANTIC.  */
 
 static void
 thread_prologue_and_epilogue_insns (void)
 {
   bool inserted;
+  basic_block last_bb;
+  bool last_bb_active;
+#ifdef HAVE_simple_return
+  bool unconverted_simple_returns = false;
+  basic_block simple_return_block_hot = NULL;
+  basic_block simple_return_block_cold = NULL;
+  bool nonempty_prologue;
+#endif
+  rtx returnjump ATTRIBUTE_UNUSED;
   rtx seq ATTRIBUTE_UNUSED, epilogue_end ATTRIBUTE_UNUSED;
-  edge entry_edge, e;
+  rtx prologue_seq ATTRIBUTE_UNUSED, split_prologue_seq ATTRIBUTE_UNUSED;
+  edge e, entry_edge, orig_entry_edge, exit_fallthru_edge;
   edge_iterator ei;
+  bitmap_head bb_flags;
+
+  df_analyze ();
 
   rtl_profile_for_bb (ENTRY_BLOCK_PTR);
 
   inserted = false;
   seq = NULL_RTX;
   epilogue_end = NULL_RTX;
+  returnjump = NULL_RTX;
 
   /* Can't deal with multiple successors of the entry block at the
      moment.  Function should always have at least one entry
      point.  */
   gcc_assert (single_succ_p (ENTRY_BLOCK_PTR));
   entry_edge = single_succ_edge (ENTRY_BLOCK_PTR);
+  orig_entry_edge = entry_edge;
+
+  exit_fallthru_edge = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
+  if (exit_fallthru_edge != NULL)
+    {
+      rtx label;
 
+      last_bb = exit_fallthru_edge->src;
+      /* Test whether there are active instructions in the last block.  */
+      label = BB_END (last_bb);
+      while (label && !LABEL_P (label))
+	{
+	  if (active_insn_p (label))
+	    break;
+	  label = PREV_INSN (label);
+	}
+
+      last_bb_active = BB_HEAD (last_bb) != label || !LABEL_P (label);
+    }
+  else
+    {
+      last_bb = NULL;
+      last_bb_active = false;
+    }
+
+  split_prologue_seq = NULL_RTX;
   if (flag_split_stack
       && (lookup_attribute ("no_split_stack", DECL_ATTRIBUTES (cfun->decl))
 	  == NULL))
@@ -5349,17 +5522,15 @@ thread_prologue_and_epilogue_insns (void
 
       start_sequence ();
       emit_insn (gen_split_stack_prologue ());
-      seq = get_insns ();
+      split_prologue_seq = get_insns ();
       end_sequence ();
 
-      record_insns (seq, NULL, &prologue_insn_hash);
-      set_insn_locators (seq, prologue_locator);
-
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+      record_insns (split_prologue_seq, NULL, &prologue_insn_hash);
+      set_insn_locators (split_prologue_seq, prologue_locator);
 #endif
     }
 
+  prologue_seq = NULL_RTX;
 #ifdef HAVE_prologue
   if (HAVE_prologue)
     {
@@ -5382,15 +5553,220 @@ thread_prologue_and_epilogue_insns (void
       if (!targetm.profile_before_prologue () && crtl->profile)
         emit_insn (gen_blockage ());
 
-      seq = get_insns ();
+      prologue_seq = get_insns ();
       end_sequence ();
-      set_insn_locators (seq, prologue_locator);
+      set_insn_locators (prologue_seq, prologue_locator);
+    }
+#endif
 
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+  bitmap_initialize (&bb_flags, &bitmap_default_obstack);
+
+#ifdef HAVE_simple_return
+  /* Try to perform a kind of shrink-wrapping, making sure the
+     prologue/epilogue is emitted only around those parts of the
+     function that require it.  */
+
+  nonempty_prologue = false;
+  for (seq = prologue_seq; seq; seq = NEXT_INSN (seq))
+    if (!NOTE_P (seq) || NOTE_KIND (seq) != NOTE_INSN_PROLOGUE_END)
+      {
+	nonempty_prologue = true;
+	break;
+      }
+      
+  if (flag_shrink_wrap && HAVE_simple_return
+      && nonempty_prologue && !crtl->calls_eh_return)
+    {
+      HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;
+      rtx p_insn;
+
+      VEC(basic_block, heap) *vec;
+      basic_block bb;
+      bitmap_head bb_antic_flags;
+      bitmap_head bb_on_list;
+
+      if (dump_file)
+	fprintf (dump_file, "Attempting shrink-wrapping optimization.\n");
+
+      /* Compute the registers set and used in the prologue.  */
+      CLEAR_HARD_REG_SET (prologue_clobbered);
+      CLEAR_HARD_REG_SET (prologue_used);
+      for (p_insn = prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	{
+	  HARD_REG_SET this_used;
+	  if (!NONDEBUG_INSN_P (p_insn))
+	    continue;
+
+	  CLEAR_HARD_REG_SET (this_used);
+	  note_uses (&PATTERN (p_insn), record_hard_reg_uses,
+		     &this_used);
+	  AND_COMPL_HARD_REG_SET (this_used, prologue_clobbered);
+	  IOR_HARD_REG_SET (prologue_used, this_used);
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+	}
+      for (p_insn = split_prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	if (NONDEBUG_INSN_P (p_insn))
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+
+      bitmap_initialize (&bb_antic_flags, &bitmap_default_obstack);
+      bitmap_initialize (&bb_on_list, &bitmap_default_obstack);
+
+      /* Find the set of basic blocks that require a stack frame.  */
+
+      vec = VEC_alloc (basic_block, heap, n_basic_blocks);
+
+      FOR_EACH_BB (bb)
+	{
+	  rtx insn;
+	  /* As a special case, check for jumps to the last bb that
+	     cannot successfully be converted to simple_returns later
+	     on, and mark them as requiring a frame.  These are
+	     conditional jumps that jump to their fallthru block, so
+	     it's not a case that is expected to occur often.  */
+	  if (JUMP_P (BB_END (bb)) && condjump_p (BB_END (bb))
+	      && single_succ_p (bb)
+	      && !last_bb_active
+	      && single_succ (bb) == last_bb)
+	    {
+	      bitmap_set_bit (&bb_flags, bb->index);
+	      VEC_quick_push (basic_block, vec, bb);
+	    }
+	  else
+	    FOR_BB_INSNS (bb, insn)
+	      if (requires_stack_frame_p (insn, prologue_used))
+		{
+		  bitmap_set_bit (&bb_flags, bb->index);
+		  VEC_quick_push (basic_block, vec, bb);
+		  break;
+		}
+	}
+
+      /* For every basic block that needs a prologue, mark all blocks
+	 reachable from it, so as to ensure they are also seen as
+	 requiring a prologue.  */
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    if (e->dest != EXIT_BLOCK_PTR
+		&& bitmap_set_bit (&bb_flags, e->dest->index))
+	      VEC_quick_push (basic_block, vec, e->dest);
+	}
+      /* If the last basic block contains only a label, we'll be able
+	 to convert jumps to it to (potentially conditional) return
+	 insns later.  This means we don't necessarily need a prologue
+	 for paths reaching it.  */
+      if (last_bb)
+	{
+	  if (!last_bb_active)
+	    bitmap_clear_bit (&bb_flags, last_bb->index);
+	  else if (!bitmap_bit_p (&bb_flags, last_bb->index))
+	    goto fail_shrinkwrap;
+	}
+
+      /* Now walk backwards from every block that is marked as needing
+	 a prologue to compute the bb_antic_flags bitmap.  */
+      bitmap_copy (&bb_antic_flags, &bb_flags);
+      FOR_EACH_BB (bb)
+	{
+	  edge e;
+	  edge_iterator ei;
+	  if (!bitmap_bit_p (&bb_flags, bb->index))
+	    continue;
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
+		&& bitmap_set_bit (&bb_on_list, e->src->index))
+	      VEC_quick_push (basic_block, vec, e->src);
+	}
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  bool all_set = true;
+
+	  bitmap_clear_bit (&bb_on_list, tmp_bb->index);
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    if (!bitmap_bit_p (&bb_antic_flags, e->dest->index))
+	      {
+		all_set = false;
+		break;
+	      }
+
+	  if (all_set)
+	    {
+	      bitmap_set_bit (&bb_antic_flags, tmp_bb->index);
+	      FOR_EACH_EDGE (e, ei, tmp_bb->preds)
+		if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
+		    && bitmap_set_bit (&bb_on_list, e->src->index))
+		  VEC_quick_push (basic_block, vec, e->src);
+	    }
+	}
+      /* Find exactly one edge that leads to a block in ANTIC from
+	 a block that isn't.  */
+      if (!bitmap_bit_p (&bb_antic_flags, entry_edge->dest->index))
+	FOR_EACH_BB (bb)
+	  {
+	    if (!bitmap_bit_p (&bb_antic_flags, bb->index))
+	      continue;
+	    FOR_EACH_EDGE (e, ei, bb->preds)
+	      if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
+		{
+		  if (entry_edge != orig_entry_edge)
+		    {
+		      entry_edge = orig_entry_edge;
+		      if (dump_file)
+			fprintf (dump_file, "More than one candidate edge.\n");
+		      goto fail_shrinkwrap;
+		    }
+		  if (dump_file)
+		    fprintf (dump_file, "Found candidate edge for "
+			     "shrink-wrapping, %d->%d.\n", e->src->index,
+			     e->dest->index);
+		  entry_edge = e;
+		}
+	  }
+
+      /* Test whether the prologue is known to clobber any register
+	 (other than FP or SP) which are live on the edge.  */
+      CLEAR_HARD_REG_BIT (prologue_clobbered, STACK_POINTER_REGNUM);
+      if (frame_pointer_needed)
+	CLEAR_HARD_REG_BIT (prologue_clobbered, HARD_FRAME_POINTER_REGNUM);
+      CLEAR_HARD_REG_SET (live_on_edge);
+      reg_set_to_hard_reg_set (&live_on_edge,
+			       df_get_live_in (entry_edge->dest));
+      if (hard_reg_set_intersect_p (live_on_edge, prologue_clobbered))
+	{
+	  entry_edge = orig_entry_edge;
+	  if (dump_file)
+	    fprintf (dump_file, "Shrink-wrapping aborted due to clobber.\n");
+	}
+      else if (dump_file)
+	{
+	  fprintf (dump_file, "Performing shrink-wrapping.\n");
+	}
+    fail_shrinkwrap:
+      bitmap_clear (&bb_antic_flags);
+      bitmap_clear (&bb_on_list);
+      VEC_free (basic_block, heap, vec);
     }
 #endif
 
+  if (split_prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (split_prologue_seq, entry_edge);
+      inserted = true;
+    }
+  if (prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (prologue_seq, entry_edge);
+      inserted = true;
+    }
+
   /* If the exit block has no non-fake predecessors, we don't need
      an epilogue.  */
   FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
@@ -5400,110 +5776,145 @@ thread_prologue_and_epilogue_insns (void
     goto epilogue_done;
 
   rtl_profile_for_bb (EXIT_BLOCK_PTR);
+
 #ifdef HAVE_return
-  if (optimize && HAVE_return)
+  /* If we're allowed to generate a simple return instruction, then by
+     definition we don't need a full epilogue.  If the last basic
+     block before the exit block does not contain active instructions,
+     examine its predecessors and try to emit (conditional) return
+     instructions.  */
+  if (optimize && !last_bb_active
+      && (HAVE_return || entry_edge != orig_entry_edge))
     {
-      /* If we're allowed to generate a simple return instruction,
-	 then by definition we don't need a full epilogue.  Examine
-	 the block that falls through to EXIT.   If it does not
-	 contain any code, examine its predecessors and try to
-	 emit (conditional) return instructions.  */
-
-      basic_block last;
+      edge_iterator ei2;
+      int i;
+      basic_block bb;
       rtx label;
+      VEC(basic_block,heap) *src_bbs;
 
-      e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-      if (e == NULL)
+      if (exit_fallthru_edge == NULL)
 	goto epilogue_done;
-      last = e->src;
+      label = BB_HEAD (last_bb);
 
-      /* Verify that there are no active instructions in the last block.  */
-      label = BB_END (last);
-      while (label && !LABEL_P (label))
-	{
-	  if (active_insn_p (label))
-	    break;
-	  label = PREV_INSN (label);
-	}
+      src_bbs = VEC_alloc (basic_block, heap, EDGE_COUNT (last_bb->preds));
+      FOR_EACH_EDGE (e, ei2, last_bb->preds)
+	if (e->src != ENTRY_BLOCK_PTR)
+	  VEC_quick_push (basic_block, src_bbs, e->src);
 
-      if (BB_HEAD (last) == label && LABEL_P (label))
+      FOR_EACH_VEC_ELT (basic_block, src_bbs, i, bb)
 	{
-	  edge_iterator ei2;
-
-	  for (ei2 = ei_start (last->preds); (e = ei_safe_edge (ei2)); )
-	    {
-	      basic_block bb = e->src;
-	      rtx jump;
+	  bool simple_p;
+	  rtx jump;
+	  e = find_edge (bb, last_bb);
 
-	      if (bb == ENTRY_BLOCK_PTR)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+	  jump = BB_END (bb);
 
-	      jump = BB_END (bb);
-	      if (!JUMP_P (jump) || JUMP_LABEL (jump) != label)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+#ifdef HAVE_simple_return
+	  simple_p = (entry_edge != orig_entry_edge
+		      && !bitmap_bit_p (&bb_flags, bb->index));
+#else
+	  simple_p = false;
+#endif
 
-	      /* If we have an unconditional jump, we can replace that
-		 with a simple return instruction.  */
-	      if (simplejump_p (jump))
-		{
-		  /* The use of the return register might be present in the exit
-		     fallthru block.  Either:
-		     - removing the use is safe, and we should remove the use in
-		       the exit fallthru block, or
-		     - removing the use is not safe, and we should add it here.
-		     For now, we conservatively choose the latter.  Either of the
-		     2 helps in crossjumping.  */
-		  emit_use_return_register_into_block (bb);
+	  if (!simple_p
+	      && (!HAVE_return || !JUMP_P (jump)
+		  || JUMP_LABEL (jump) != label))
+	    continue;
 
-		  emit_return_into_block (bb);
-		  delete_insn (jump);
-		}
+	  /* If we have an unconditional jump, we can replace that
+	     with a simple return instruction.  */
+	  if (!JUMP_P (jump))
+	    {
+	      emit_barrier_after (BB_END (bb));
+	      emit_return_into_block (simple_p, bb);
+	    }
+	  else if (simplejump_p (jump))
+	    {
+	      /* The use of the return register might be present in the exit
+	         fallthru block.  Either:
+	         - removing the use is safe, and we should remove the use in
+	           the exit fallthru block, or
+	         - removing the use is not safe, and we should add it here.
+	         For now, we conservatively choose the latter.  Either of the
+	         2 helps in crossjumping.  */
+	      emit_use_return_register_into_block (bb);
 
-	      /* If we have a conditional jump, we can try to replace
-		 that with a conditional return instruction.  */
-	      else if (condjump_p (jump))
-		{
-		  if (! redirect_jump (jump, ret_rtx, 0))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
+	      emit_return_into_block (simple_p, bb);
+	      delete_insn (jump);
+	    }
+	  else if (condjump_p (jump) && JUMP_LABEL (jump) != label)
+	    {
+	      basic_block new_bb;
+	      edge new_e;
 
-		  /* See comment in simple_jump_p case above.  */
-		  emit_use_return_register_into_block (bb);
+	      gcc_assert (simple_p);
+	      new_bb = split_edge (e);
+	      emit_barrier_after (BB_END (new_bb));
+	      emit_return_into_block (simple_p, new_bb);
+#ifdef HAVE_simple_return
+	      if (BB_PARTITION (new_bb) == BB_HOT_PARTITION)
+		simple_return_block_hot = new_bb;
+	      else
+		simple_return_block_cold = new_bb;
+#endif
+	      new_e = single_succ_edge (new_bb);
+	      redirect_edge_succ (new_e, EXIT_BLOCK_PTR);
 
-		  /* If this block has only one successor, it both jumps
-		     and falls through to the fallthru block, so we can't
-		     delete the edge.  */
-		  if (single_succ_p (bb))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
-		}
+	      continue;
+	    }
+	  /* If we have a conditional jump branching to the last
+	     block, we can try to replace that with a conditional
+	     return instruction.  */
+	  else if (condjump_p (jump))
+	    {
+	      rtx dest;
+	      if (simple_p)
+		dest = simple_return_rtx;
 	      else
+		dest = ret_rtx;
+	      if (! redirect_jump (jump, dest, 0))
 		{
-		  ei_next (&ei2);
+#ifdef HAVE_simple_return
+		  if (simple_p)
+		    unconverted_simple_returns = true;
+#endif
 		  continue;
 		}
 
-	      /* Fix up the CFG for the successful change we just made.  */
-	      redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	      /* See comment in simple_jump_p case above.  */
+	      emit_use_return_register_into_block (bb);
+
+	      /* If this block has only one successor, it both jumps
+		 and falls through to the fallthru block, so we can't
+		 delete the edge.  */
+	      if (single_succ_p (bb))
+		continue;
+	    }
+	  else
+	    {
+#ifdef HAVE_simple_return
+	      if (simple_p)
+		unconverted_simple_returns = true;
+#endif
+	      continue;
 	    }
 
+	  /* Fix up the CFG for the successful change we just made.  */
+	  redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	}
+      VEC_free (basic_block, heap, src_bbs);
+
+      if (HAVE_return)
+	{
 	  /* Emit a return insn for the exit fallthru block.  Whether
 	     this is still reachable will be determined later.  */
 
-	  emit_barrier_after (BB_END (last));
-	  emit_return_into_block (last);
-	  epilogue_end = BB_END (last);
-	  single_succ_edge (last)->flags &= ~EDGE_FALLTHRU;
+	  emit_barrier_after (BB_END (last_bb));
+	  emit_return_into_block (false, last_bb);
+	  epilogue_end = BB_END (last_bb);
+	  if (JUMP_P (epilogue_end))
+	    JUMP_LABEL (epilogue_end) = ret_rtx;
+	  single_succ_edge (last_bb)->flags &= ~EDGE_FALLTHRU;
 	  goto epilogue_done;
 	}
     }
@@ -5540,20 +5951,15 @@ thread_prologue_and_epilogue_insns (void
     }
 #endif
 
-  /* Find the edge that falls through to EXIT.  Other edges may exist
-     due to RETURN instructions, but those don't need epilogues.
-     There really shouldn't be a mixture -- either all should have
-     been converted or none, however...  */
+  /* If nothing falls through into the exit block, we don't need an
+     epilogue.  */
 
-  e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-  if (e == NULL)
+  if (exit_fallthru_edge == NULL)
     goto epilogue_done;
 
 #ifdef HAVE_epilogue
   if (HAVE_epilogue)
     {
-      rtx returnjump;
-
       start_sequence ();
       epilogue_end = emit_note (NOTE_INSN_EPILOGUE_BEG);
       seq = gen_epilogue ();
@@ -5564,11 +5970,11 @@ thread_prologue_and_epilogue_insns (void
       record_insns (seq, NULL, &epilogue_insn_hash);
       set_insn_locators (seq, epilogue_locator);
 
-      returnjump = get_last_insn ();
       seq = get_insns ();
+      returnjump = get_last_insn ();
       end_sequence ();
 
-      insert_insn_on_edge (seq, e);
+      insert_insn_on_edge (seq, exit_fallthru_edge);
       inserted = true;
 
       if (JUMP_P (returnjump))
@@ -5581,23 +5987,21 @@ thread_prologue_and_epilogue_insns (void
 	  else
 	    JUMP_LABEL (returnjump) = ret_rtx;
 	}
-      else
-	returnjump = NULL_RTX;
     }
   else
 #endif
     {
       basic_block cur_bb;
 
-      if (! next_active_insn (BB_END (e->src)))
+      if (! next_active_insn (BB_END (exit_fallthru_edge->src)))
 	goto epilogue_done;
       /* We have a fall-through edge to the exit block, the source is not
          at the end of the function, and there will be an assembler epilogue
          at the end of the function.
          We can't use force_nonfallthru here, because that would try to
-         use return.  Inserting a jump 'by hand' is extremely messy, so
+	 use return.  Inserting a jump 'by hand' is extremely messy, so
 	 we take advantage of cfg_layout_finalize using
-	fixup_fallthru_exit_predecessor.  */
+	 fixup_fallthru_exit_predecessor.  */
       cfg_layout_initialize (0);
       FOR_EACH_BB (cur_bb)
 	if (cur_bb->index >= NUM_FIXED_BLOCKS
@@ -5607,6 +6011,7 @@ thread_prologue_and_epilogue_insns (void
     }
 
 epilogue_done:
+
   default_rtl_profile ();
 
   if (inserted)
@@ -5632,33 +6037,102 @@ epilogue_done:
 	}
     }
 
+#ifdef HAVE_simple_return
+  /* If there were branches to an empty LAST_BB which we tried to
+     convert to conditional simple_returns, but couldn't for some
+     reason, create a block to hold a simple_return insn and redirect
+     those remaining edges.  */
+  if (unconverted_simple_returns)
+    {
+      edge_iterator ei2;
+      basic_block exit_pred = EXIT_BLOCK_PTR->prev_bb;
+
+      gcc_assert (entry_edge != orig_entry_edge);
+
+      /* See if we can reuse the last insn that was emitted for the
+	 epilogue.  */
+      if (returnjump != NULL_RTX
+	  && JUMP_LABEL (returnjump) == simple_return_rtx)
+	{
+	  edge e = split_block (exit_fallthru_edge->src,
+				PREV_INSN (returnjump));
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    simple_return_block_hot = e->dest;
+	  else
+	    simple_return_block_cold = e->dest;
+	}
+
+    restart_scan:
+      for (ei2 = ei_start (last_bb->preds); (e = ei_safe_edge (ei2)); )
+	{
+	  basic_block bb = e->src;
+	  basic_block *pdest_bb;
+
+	  if (bb == ENTRY_BLOCK_PTR
+	      || bitmap_bit_p (&bb_flags, bb->index))
+	    {
+	      ei_next (&ei2);
+	      continue;
+	    }
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    pdest_bb = &simple_return_block_hot;
+	  else
+	    pdest_bb = &simple_return_block_cold;
+	  if (*pdest_bb == NULL)
+	    {
+	      basic_block bb;
+	      rtx start;
+
+	      bb = create_basic_block (NULL, NULL, exit_pred);
+	      BB_COPY_PARTITION (bb, e->src);
+	      start = emit_jump_insn_after (gen_simple_return (),
+					    BB_END (bb));
+	      JUMP_LABEL (start) = simple_return_rtx;
+	      emit_barrier_after (start);
+
+	      *pdest_bb = bb;
+	      make_edge (bb, EXIT_BLOCK_PTR, 0);
+	    }
+	  redirect_edge_and_branch_force (e, *pdest_bb);
+	  goto restart_scan;
+	}
+    }
+#endif
+
 #ifdef HAVE_sibcall_epilogue
   /* Emit sibling epilogues before any sibling call sites.  */
   for (ei = ei_start (EXIT_BLOCK_PTR->preds); (e = ei_safe_edge (ei)); )
     {
       basic_block bb = e->src;
       rtx insn = BB_END (bb);
+      rtx ep_seq;
 
       if (!CALL_P (insn)
-	  || ! SIBLING_CALL_P (insn))
+	  || ! SIBLING_CALL_P (insn)
+	  || (entry_edge != orig_entry_edge
+	      && !bitmap_bit_p (&bb_flags, bb->index)))
 	{
 	  ei_next (&ei);
 	  continue;
 	}
 
-      start_sequence ();
-      emit_note (NOTE_INSN_EPILOGUE_BEG);
-      emit_insn (gen_sibcall_epilogue ());
-      seq = get_insns ();
-      end_sequence ();
+      ep_seq = gen_sibcall_epilogue ();
+      if (ep_seq)
+	{
+	  start_sequence ();
+	  emit_note (NOTE_INSN_EPILOGUE_BEG);
+	  emit_insn (ep_seq);
+	  seq = get_insns ();
+	  end_sequence ();
 
-      /* Retain a map of the epilogue insns.  Used in life analysis to
-	 avoid getting rid of sibcall epilogue insns.  Do this before we
-	 actually emit the sequence.  */
-      record_insns (seq, NULL, &epilogue_insn_hash);
-      set_insn_locators (seq, epilogue_locator);
+	  /* Retain a map of the epilogue insns.  Used in life analysis to
+	     avoid getting rid of sibcall epilogue insns.  Do this before we
+	     actually emit the sequence.  */
+	  record_insns (seq, NULL, &epilogue_insn_hash);
+	  set_insn_locators (seq, epilogue_locator);
 
-      emit_insn_before (seq, insn);
+	  emit_insn_before (seq, insn);
+	}
       ei_next (&ei);
     }
 #endif
@@ -5683,6 +6157,8 @@ epilogue_done:
     }
 #endif
 
+  bitmap_clear (&bb_flags);
+
   /* Threading the prologue and epilogue changes the artificial refs
      in the entry and exit blocks.  */
   epilogue_completed = 1;
Index: common.opt
===================================================================
--- common.opt	(revision 178734)
+++ common.opt	(working copy)
@@ -1768,6 +1768,11 @@ fshow-column
 Common Report Var(flag_show_column) Init(1)
 Show column numbers in diagnostics, when available.  Default on
 
+fshrink-wrap
+Common Report Var(flag_shrink_wrap) Optimization
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.
+
 fsignaling-nans
 Common Report Var(flag_signaling_nans) Optimization SetByCombined
 Disable optimizations observable by IEEE signaling NaNs
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 178734)
+++ config/i386/i386.md	(working copy)
@@ -11657,24 +11657,29 @@ (define_insn "prologue_use"
   ""
   [(set_attr "length" "0")])
 
+(define_code_iterator returns [return simple_return])
+(define_code_attr return_str [(return "") (simple_return "simple_")])
+(define_code_attr return_cond [(return "ix86_can_use_return_insn_p ()")
+			       (simple_return "")])
+
 ;; Insn emitted into the body of a function to return from a function.
 ;; This is only done if the function's epilogue is known to be simple.
 ;; See comments for ix86_can_use_return_insn_p in i386.c.
 
-(define_expand "return"
-  [(return)]
-  "ix86_can_use_return_insn_p ()"
+(define_expand "<return_str>return"
+  [(returns)]
+  "<return_cond>"
 {
   if (crtl->args.pops_args)
     {
       rtx popc = GEN_INT (crtl->args.pops_args);
-      emit_jump_insn (gen_return_pop_internal (popc));
+      emit_jump_insn (gen_<return_str>return_pop_internal (popc));
       DONE;
     }
 })
 
-(define_insn "return_internal"
-  [(return)]
+(define_insn "<return_str>return_internal"
+  [(returns)]
   "reload_completed"
   "ret"
   [(set_attr "length" "1")
@@ -11685,8 +11690,8 @@ (define_insn "return_internal"
 ;; Used by x86_machine_dependent_reorg to avoid penalty on single byte RET
 ;; instruction Athlon and K8 have.
 
-(define_insn "return_internal_long"
-  [(return)
+(define_insn "<return_str>return_internal_long"
+  [(returns)
    (unspec [(const_int 0)] UNSPEC_REP)]
   "reload_completed"
   "rep\;ret"
@@ -11696,8 +11701,8 @@ (define_insn "return_internal_long"
    (set_attr "prefix_rep" "1")
    (set_attr "modrm" "0")])
 
-(define_insn "return_pop_internal"
-  [(return)
+(define_insn "<return_str>return_pop_internal"
+  [(returns)
    (use (match_operand:SI 0 "const_int_operand" ""))]
   "reload_completed"
   "ret\t%0"
@@ -11706,8 +11711,8 @@ (define_insn "return_pop_internal"
    (set_attr "length_immediate" "2")
    (set_attr "modrm" "0")])
 
-(define_insn "return_indirect_internal"
-  [(return)
+(define_insn "<return_str>return_indirect_internal"
+  [(returns)
    (use (match_operand:SI 0 "register_operand" "r"))]
   "reload_completed"
   "jmp\t%A0"
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 178734)
+++ config/i386/i386.c	(working copy)
@@ -10779,13 +10779,13 @@ ix86_expand_epilogue (int style)
 
 	  pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
 				     popc, -1, true);
-	  emit_jump_insn (gen_return_indirect_internal (ecx));
+	  emit_jump_insn (gen_simple_return_indirect_internal (ecx));
 	}
       else
-	emit_jump_insn (gen_return_pop_internal (popc));
+	emit_jump_insn (gen_simple_return_pop_internal (popc));
     }
   else
-    emit_jump_insn (gen_return_internal ());
+    emit_jump_insn (gen_simple_return_internal ());
 
   /* Restore the state back to the state from the prologue,
      so that it's correct for the next epilogue.  */
@@ -31097,7 +31097,10 @@ ix86_pad_returns (void)
 	}
       if (replace)
 	{
-	  emit_jump_insn_before (gen_return_internal_long (), ret);
+	  if (PATTERN (ret) == ret_rtx)
+	    emit_jump_insn_before (gen_return_internal_long (), ret);
+	  else
+	    emit_jump_insn_before (gen_simple_return_internal_long (), ret);
 	  delete_insn (ret);
 	}
     }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-13  2:57   ` Bernd Schmidt
@ 2011-09-13 11:03     ` Richard Sandiford
  2011-09-13 12:49       ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2011-09-13 11:03 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds_cb1@t-online.de> writes:
>>> +/* Return true if INSN requires the stack frame to be set up.
>>> +   PROLOGUE_USED contains the hard registers used in the function
>>> +   prologue.  */
>>> +static bool
>>> +requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
>>> +{
>>> [...]
>>> +  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
>>> +    return true;
>>> +  CLEAR_HARD_REG_SET (hardregs);
>>> +  note_stores (PATTERN (insn), record_hard_reg_sets, &hardregs);
>>> +  if (hard_reg_set_intersect_p (hardregs, prologue_used))
>>> +    return true;
>>> +  AND_COMPL_HARD_REG_SET (hardregs, call_used_reg_set);
>>> +  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
>>> +    if (TEST_HARD_REG_BIT (hardregs, regno)
>>> +	&& df_regs_ever_live_p (regno))
>>> +      return true;
>> 
>> ...I suppose this ought to use DF instead.
>
> Here, I kind of disagree. Most of the new code works on unemitted
> sequences, and df doesn't work for that, so we rely on note_stores and
> such quite a lot. I've changed the one note_stores call in this function
> to use df, but IMO it's not an improvement. As for
> frame_required_for_rtx, I'd much rather test for equality with
> hard_frame_pointer_rtx than muck about with register numbers and
> frame_pointer_needed tests.

Seems like checking for HARD_FRAME_POINTER_REGNUM would be OK regardless.
Even if it isn't being used as a frame pointer, the hard frame pointer
still needs to be saved and restored, so references would still need the
protection of the prologue.  I don't know off-hand whether targets like
AVR ever need to generate QImode parts of the frame pointer (in which
case checking for register numbers seems like the right thing).

But I agree it's borderline, so I won't insist.

> Also, it turns out I need to pretend that trap_if requires the prologue
> due to the testcase you also ran across, so a for_each_rtx walk is
> required anyway.

Why does TRAP_IF inherently need a prologue though?  The CFA information
should be correct regardless.  Do we really want this behaviour for
conditional TRAP_IFs, such as you might get to trap division by zero?
Seems odd to treat explicit traps (on architectures that need them)
differently from a plain div (in cases where that traps).

Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-13 11:03     ` Richard Sandiford
@ 2011-09-13 12:49       ` Bernd Schmidt
  2011-09-13 13:06         ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-09-13 12:49 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford

On 09/13/11 10:35, Richard Sandiford wrote:

>> Also, it turns out I need to pretend that trap_if requires the prologue
>> due to the testcase you also ran across, so a for_each_rtx walk is
>> required anyway.
> 
> Why does TRAP_IF inherently need a prologue though?  The CFA information
> should be correct regardless.

It is until you cross-jump the two trap_ifs. So we have to disable
either one of the two optimizations for them.


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-13 12:49       ` Bernd Schmidt
@ 2011-09-13 13:06         ` Richard Sandiford
  2011-09-13 13:24           ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2011-09-13 13:06 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds_cb1@t-online.de> writes:
> On 09/13/11 10:35, Richard Sandiford wrote:
>>> Also, it turns out I need to pretend that trap_if requires the prologue
>>> due to the testcase you also ran across, so a for_each_rtx walk is
>>> required anyway.
>> 
>> Why does TRAP_IF inherently need a prologue though?  The CFA information
>> should be correct regardless.
>
> It is until you cross-jump the two trap_ifs. So we have to disable
> either one of the two optimizations for them.

But trap_if seems to be tackling it at the wrong level.  AIUI, the
problem is that we usually rely on the different return JUMP_LABELs
to stop unwrapped blocks from being merged with wrapped blocks,
and that that falls down if there is no return.  And although it's
likely always true in practice, it seems wrong in principle to assume
that nothing after prologue/epilogue generation will discover new
points of no return.

So yeah, maybe trying to disable cross-jumping in this sort of situation
(and hopefully only in this sort of situation) is the right way to go.
It would be good if we could represent it in the rtl somehow, so that
the current analysis code will think that the two blocks aren't equal,
without us having to remember to add a special check everywhere that
this sort of thing could occur.  Not sure how feasible that is though...

Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-13 13:06         ` Richard Sandiford
@ 2011-09-13 13:24           ` Bernd Schmidt
  2011-09-13 13:42             ` Richard Sandiford
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-09-13 13:24 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford

On 09/13/11 13:37, Richard Sandiford wrote:
> Bernd Schmidt <bernds_cb1@t-online.de> writes:
>> On 09/13/11 10:35, Richard Sandiford wrote:
>>>> Also, it turns out I need to pretend that trap_if requires the prologue
>>>> due to the testcase you also ran across, so a for_each_rtx walk is
>>>> required anyway.
>>>
>>> Why does TRAP_IF inherently need a prologue though?  The CFA information
>>> should be correct regardless.
>>
>> It is until you cross-jump the two trap_ifs. So we have to disable
>> either one of the two optimizations for them.
> 
> But trap_if seems to be tackling it at the wrong level.  AIUI, the
> problem is that we usually rely on the different return JUMP_LABELs
> to stop unwrapped blocks from being merged with wrapped blocks,
> and that that falls down if there is no return.  And although it's
> likely always true in practice, it seems wrong in principle to assume
> that nothing after prologue/epilogue generation will discover new
> points of no return.

Well, I know of nothing that could lead to such a case. Maybe we should
worry about it when we get there? At the moment, only if_trap is known
to cause us problems and it's easy enough to just deal with it by
turning off either shrink-wrapping or crossjumping. I don't much care
which...


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-13 13:24           ` Bernd Schmidt
@ 2011-09-13 13:42             ` Richard Sandiford
  2011-09-13 17:22               ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2011-09-13 13:42 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds@codesourcery.com> writes:
> On 09/13/11 13:37, Richard Sandiford wrote:
>> Bernd Schmidt <bernds_cb1@t-online.de> writes:
>>> On 09/13/11 10:35, Richard Sandiford wrote:
>>>>> Also, it turns out I need to pretend that trap_if requires the prologue
>>>>> due to the testcase you also ran across, so a for_each_rtx walk is
>>>>> required anyway.
>>>>
>>>> Why does TRAP_IF inherently need a prologue though?  The CFA information
>>>> should be correct regardless.
>>>
>>> It is until you cross-jump the two trap_ifs. So we have to disable
>>> either one of the two optimizations for them.
>> 
>> But trap_if seems to be tackling it at the wrong level.  AIUI, the
>> problem is that we usually rely on the different return JUMP_LABELs
>> to stop unwrapped blocks from being merged with wrapped blocks,
>> and that that falls down if there is no return.  And although it's
>> likely always true in practice, it seems wrong in principle to assume
>> that nothing after prologue/epilogue generation will discover new
>> points of no return.
>
> Well, I know of nothing that could lead to such a case. Maybe we should
> worry about it when we get there? At the moment, only if_trap is known
> to cause us problems and it's easy enough to just deal with it by
> turning off either shrink-wrapping or crossjumping. I don't much care
> which...

It just feels like checking for trap_if or turning off cross-jumping
are working around problems in the representation of shrink-wrapped
functions.  There should be something in the IL to say that those
two blocks cannot be merged for CFI reasons.  Maybe two flags on
the basic block to say whether they start (resp. end) with the
"wrapped" version of the CFI?  (Which unfortunately would need
to be checked explicitly.)

OTOH, if another reviewer thinks that's unreasnable, I'll happily
defer to them.

Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-13 13:42             ` Richard Sandiford
@ 2011-09-13 17:22               ` Bernd Schmidt
  2011-09-15  1:48                 ` Richard Henderson
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-09-13 17:22 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford; +Cc: Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]

On 09/13/11 15:05, Richard Sandiford wrote:
> It just feels like checking for trap_if or turning off cross-jumping
> are working around problems in the representation of shrink-wrapped
> functions.  There should be something in the IL to say that those
> two blocks cannot be merged for CFI reasons. 

There is - JUMP_LABELs and such, and the simple_return vs return
distinction. This works for essentially all the interesting cases. The
problem here is that we don't have a jump as the last insn. So how about
the solution in crossjumping as below?

> Maybe two flags on
> the basic block to say whether they start (resp. end) with the
> "wrapped" version of the CFI?  (Which unfortunately would need
> to be checked explicitly.)

I think that's overdesigning it, and it breaks as soon as something
discards the bb info (reorg...) or puts a label in the middle of a
prologue or epilogue.  Keeping that up-to-date would be much more
fragile than just manually dealing with the few cases where we can't
tell what's going on.

> OTOH, if another reviewer thinks that's unreasnable, I'll happily
> defer to them.

Cc'ing rth for a second opinion...


Bernd

[-- Attachment #2: ccsw.diff --]
[-- Type: text/plain, Size: 1057 bytes --]

	* cfgcleanup.c (outgoing_edges_match): Nonjump edges to the
	EXIT_BLOCK_PTR match only if we did not perform shrink-wrapping.

Index: gcc/cfgcleanup.c
===================================================================
--- gcc/cfgcleanup.c	(revision 178734)
+++ gcc/cfgcleanup.c	(working copy)
@@ -1488,6 +1488,16 @@ outgoing_edges_match (int mode, basic_bl
   edge e1, e2;
   edge_iterator ei;
 
+  /* If we performed shrink-wrapping, edges to the EXIT_BLOCK_PTR can
+     only be distinguished for JUMP_INSNs.  The two paths may differ in
+     whether they went through the prologue.  Sibcalls are fine, we know
+     that we either didn't need or inserted an epilogue before them.  */
+  if (flag_shrink_wrap
+      && single_succ_p (bb1) && single_succ (bb1) == EXIT_BLOCK_PTR
+      && !JUMP_P (BB_END (bb1))
+      && !(CALL_P (BB_END (bb1)) && SIBLING_CALL_P (BB_END (bb1))))
+    return false;
+  
   /* If BB1 has only one successor, we may be looking at either an
      unconditional jump, or a fake edge to exit.  */
   if (single_succ_p (bb1)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-13 17:22               ` Bernd Schmidt
@ 2011-09-15  1:48                 ` Richard Henderson
  2011-09-27 23:07                   ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Henderson @ 2011-09-15  1:48 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, richard.sandiford

On 09/13/2011 08:36 AM, Bernd Schmidt wrote:
> On 09/13/11 15:05, Richard Sandiford wrote:
>> It just feels like checking for trap_if or turning off cross-jumping
>> are working around problems in the representation of shrink-wrapped
>> functions.  There should be something in the IL to say that those
>> two blocks cannot be merged for CFI reasons. 
> 
> There is - JUMP_LABELs and such, and the simple_return vs return
> distinction. This works for essentially all the interesting cases. The
> problem here is that we don't have a jump as the last insn. So how about
> the solution in crossjumping as below?
> 
>> Maybe two flags on
>> the basic block to say whether they start (resp. end) with the
>> "wrapped" version of the CFI?  (Which unfortunately would need
>> to be checked explicitly.)
> 
> I think that's overdesigning it, and it breaks as soon as something
> discards the bb info (reorg...) or puts a label in the middle of a
> prologue or epilogue.  Keeping that up-to-date would be much more
> fragile than just manually dealing with the few cases where we can't
> tell what's going on.
> 
>> OTOH, if another reviewer thinks that's unreasnable, I'll happily
>> defer to them.
> 
> Cc'ing rth for a second opinion...

It feels hacky, but I don't have anything better to suggest.
I think the patch is ok.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-15  1:48                 ` Richard Henderson
@ 2011-09-27 23:07                   ` Bernd Schmidt
  2011-09-30 18:19                     ` Richard Henderson
  2011-10-03  8:30                     ` Richard Sandiford
  0 siblings, 2 replies; 41+ messages in thread
From: Bernd Schmidt @ 2011-09-27 23:07 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

On 09/15/11 00:51, Richard Henderson wrote:
> On 09/13/2011 08:36 AM, Bernd Schmidt wrote:
>> On 09/13/11 15:05, Richard Sandiford wrote:
>>> It just feels like checking for trap_if or turning off cross-jumping
>>> are working around problems in the representation of shrink-wrapped
>>> functions.  There should be something in the IL to say that those
>>> two blocks cannot be merged for CFI reasons. 
>>
>> There is - JUMP_LABELs and such, and the simple_return vs return
>> distinction. This works for essentially all the interesting cases. The
>> problem here is that we don't have a jump as the last insn. So how about
>> the solution in crossjumping as below?
>>
>>> Maybe two flags on
>>> the basic block to say whether they start (resp. end) with the
>>> "wrapped" version of the CFI?  (Which unfortunately would need
>>> to be checked explicitly.)
>>
>> I think that's overdesigning it, and it breaks as soon as something
>> discards the bb info (reorg...) or puts a label in the middle of a
>> prologue or epilogue.  Keeping that up-to-date would be much more
>> fragile than just manually dealing with the few cases where we can't
>> tell what's going on.
>>
>>> OTOH, if another reviewer thinks that's unreasnable, I'll happily
>>> defer to them.
>>
>> Cc'ing rth for a second opinion...
> 
> It feels hacky, but I don't have anything better to suggest.
> I think the patch is ok.

Here's a new version of the entire shrink-wrapping patch with the
trap_if test replaced by the outgoing_edges_match change, the condjump_p
bug fixed, and the dump output and testcase adjusted a bit. Bootstrapped
and tested on i686-linux and mips-elf. Ok? (I guess I could also leave
out the RETURN_ADDR_REGNUM thing, since it was needed for ARM only and I
can't quite remember why at the moment).


Bernd


[-- Attachment #2: sw0927c.diff --]
[-- Type: text/plain, Size: 40043 bytes --]

	* doc/tm.texi.in (RETURN_ADDR_REGNUM): Document.
	* doc/tm.texi: Regenerate.
	* doc/invoke.texi (-fshrink-wrap): Document.
	* opts.c (default_options_table): Add it.
	* common.opt (fshrink-wrap): Add.
	* function.c (emit_return_into_block): Remove useless declaration.
	(record_hard_reg_uses_1, record_hard_reg_uses, frame_required_for_rtx,
	requires_stack_frame_p, gen_return_pattern): New static functions.
	(emit_return_into_block): New arg simple_p.  All callers changed.
	Use gen_return_pattern.
	(thread_prologue_and_epilogue_insns): Implement shrink-wrapping.
	* config/i386/i386.md (returns): New code_iterator.
	(return_str, return_cond): New code_attrs.
	(<return_str>return, <return_str>return_internal,
	<return_str>return_internal_long, <return_str>return_pop_internal,
	<return_str>return_indirect_internal): New patterns, replacing...
	(return, return_internal, return_internal_long, return_pop_internal,
	return_indirect_internal): ... these.
	* config/i386/i386.c (ix86_expand_epilogue): Adjust to expand
	simple returns.
	(ix86_pad_returns): Likewise.
	* cfgcleanup.c (outgoing_edges_match): If shrink-wrapping, edges that
	are not jumps or sibcalls can't be compared.

	* gcc.target/i386/sw-1.c: New test.

Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 178734)
+++ gcc/doc/tm.texi	(working copy)
@@ -3208,6 +3208,12 @@ Define this if the return address of a p
 from the frame pointer of the previous stack frame.
 @end defmac
 
+@defmac RETURN_ADDR_REGNUM
+If defined, a C expression whose value is the register number of the return
+address for the current function.  Targets that pass the return address on
+the stack should not define this macro.
+@end defmac
+
 @defmac INCOMING_RETURN_ADDR_RTX
 A C expression whose value is RTL representing the location of the
 incoming return address at the beginning of any function, before the
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 178734)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -3194,6 +3194,12 @@ Define this if the return address of a p
 from the frame pointer of the previous stack frame.
 @end defmac
 
+@defmac RETURN_ADDR_REGNUM
+If defined, a C expression whose value is the register number of the return
+address for the current function.  Targets that pass the return address on
+the stack should not define this macro.
+@end defmac
+
 @defmac INCOMING_RETURN_ADDR_RTX
 A C expression whose value is RTL representing the location of the
 incoming return address at the beginning of any function, before the
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 178734)
+++ gcc/doc/invoke.texi	(working copy)
@@ -396,10 +396,10 @@ Objective-C and Objective-C++ Dialects}.
 -fschedule-insns -fschedule-insns2 -fsection-anchors @gol
 -fselective-scheduling -fselective-scheduling2 @gol
 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
--fsignaling-nans -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
--fsplit-wide-types -fstack-protector -fstack-protector-all @gol
--fstrict-aliasing -fstrict-overflow -fthread-jumps -ftracer @gol
--ftree-bit-ccp @gol
+-fshrink-wrap -fsignaling-nans -fsingle-precision-constant @gol
+-fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector @gol
+-fstack-protector-all -fstrict-aliasing -fstrict-overflow @gol
+-fthread-jumps -ftracer -ftree-bit-ccp @gol
 -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copy-prop @gol
 -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse @gol
 -ftree-forwprop -ftree-fre -ftree-loop-if-convert @gol
@@ -6878,6 +6878,12 @@ This option has no effect until one of @
 When pipelining loops during selective scheduling, also pipeline outer loops.
 This option has no effect until @option{-fsel-sched-pipelining} is turned on.
 
+@item -fshrink-wrap
+@opindex fshrink-wrap
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.  This flag is enabled by default at
+@option{-O} and higher.
+
 @item -fcaller-saves
 @opindex fcaller-saves
 Enable values to be allocated in registers that will be clobbered by
Index: gcc/testsuite/gcc.target/i386/sw-1.c
===================================================================
--- gcc/testsuite/gcc.target/i386/sw-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/sw-1.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+
+#include <string.h>
+
+int c;
+int x[2000];
+__attribute__((regparm(1))) void foo (int a, int b)
+ {
+   int t[200];
+   if (a == 0 || c == 0)
+     return;
+   memcpy (t, x + b, sizeof t);
+   c = t[a];
+ }
+
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */
+/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 178734)
+++ gcc/opts.c	(working copy)
@@ -434,6 +434,7 @@ static const struct default_options defa
     { OPT_LEVELS_1_PLUS, OPT_fipa_reference, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fipa_profile, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fmerge_constants, NULL, 1 },
+    { OPT_LEVELS_1_PLUS, OPT_fshrink_wrap, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fsplit_wide_types, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_ccp, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_bit_ccp, NULL, 1 },
Index: gcc/function.c
===================================================================
--- gcc/function.c	(revision 178734)
+++ gcc/function.c	(working copy)
@@ -147,9 +147,6 @@ extern tree debug_find_var_in_block_tree
    can always export `prologue_epilogue_contains'.  */
 static void record_insns (rtx, rtx, htab_t *) ATTRIBUTE_UNUSED;
 static bool contains (const_rtx, htab_t);
-#ifdef HAVE_return
-static void emit_return_into_block (basic_block);
-#endif
 static void prepare_function_start (void);
 static void do_clobber_return_reg (rtx, void *);
 static void do_use_return_reg (rtx, void *);
@@ -5285,6 +5282,81 @@ prologue_epilogue_contains (const_rtx in
   return 0;
 }
 
+#ifdef HAVE_simple_return
+
+/* A for_each_rtx subroutine of record_hard_reg_sets.  */
+static int
+record_hard_reg_uses_1 (rtx *px, void *data)
+{
+  rtx x = *px;
+  HARD_REG_SET *pused = (HARD_REG_SET *)data;
+
+  if (REG_P (x) && REGNO (x) < FIRST_PSEUDO_REGISTER)
+    add_to_hard_reg_set (pused, GET_MODE (x), REGNO (x));
+  return 0;
+}
+
+/* Like record_hard_reg_sets, but called through note_uses.  */
+static void
+record_hard_reg_uses (rtx *px, void *data)
+{
+  for_each_rtx (px, record_hard_reg_uses_1, data);
+}
+
+/* A subroutine of requires_stack_frame_p, called via for_each_rtx.
+   Return 1 if we found an rtx that forces a prologue, zero otherwise.  */
+static int
+frame_required_for_rtx (rtx *loc, void *data ATTRIBUTE_UNUSED)
+{
+  rtx x = *loc;
+  if (x == stack_pointer_rtx || x == hard_frame_pointer_rtx
+#ifdef RETURN_ADDR_REGNUM
+      || (REG_P (x) && REGNO (x) == RETURN_ADDR_REGNUM)
+#endif
+      || x == arg_pointer_rtx || x == pic_offset_table_rtx)
+    return 1;
+  return 0;
+}
+
+/* Return true if INSN requires the stack frame to be set up.
+   PROLOGUE_USED contains the hard registers used in the function
+   prologue.  */
+static bool
+requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
+{
+  df_ref *def_rec;
+  HARD_REG_SET hardregs;
+  unsigned regno;
+
+  if (!INSN_P (insn) || DEBUG_INSN_P (insn))
+    return false;
+  if (CALL_P (insn))
+    return !SIBLING_CALL_P (insn);
+  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
+    return true;
+
+  CLEAR_HARD_REG_SET (hardregs);
+  for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
+    {
+      rtx dreg = DF_REF_REG (*def_rec);
+
+      if (!REG_P (dreg))
+	continue;
+
+      add_to_hard_reg_set (&hardregs, GET_MODE (dreg),
+			   REGNO (dreg));
+    }
+  if (hard_reg_set_intersect_p (hardregs, prologue_used))
+    return true;
+  AND_COMPL_HARD_REG_SET (hardregs, call_used_reg_set);
+  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (TEST_HARD_REG_BIT (hardregs, regno)
+	&& df_regs_ever_live_p (regno))
+      return true;
+  return false;
+}
+#endif
+
 #ifdef HAVE_return
 /* Insert use of return register before the end of BB.  */
 
@@ -5299,45 +5371,145 @@ emit_use_return_register_into_block (bas
   emit_insn_before (seq, BB_END (bb));
 }
 
-/* Insert gen_return at the end of block BB.  This also means updating
-   block_for_insn appropriately.  */
+
+/* Create a return pattern, either simple_return or return, depending on
+   simple_p.  */
+
+static rtx
+gen_return_pattern (bool simple_p)
+{
+#ifdef HAVE_simple_return
+  return simple_p ? gen_simple_return () : gen_return ();
+#else
+  gcc_assert (!simple_p);
+  return gen_return ();
+#endif
+}
+
+/* Insert an appropriate return pattern at the end of block BB.  This
+   also means updating block_for_insn appropriately.  SIMPLE_P is
+   the same as in gen_return_pattern and passed to it.  */
 
 static void
-emit_return_into_block (basic_block bb)
+emit_return_into_block (bool simple_p, basic_block bb)
 {
-  rtx jump = emit_jump_insn_after (gen_return (), BB_END (bb));
-  rtx pat = PATTERN (jump);
+  rtx jump, pat;
+  jump = emit_jump_insn_after (gen_return_pattern (simple_p), BB_END (bb));
+  pat = PATTERN (jump);
   if (GET_CODE (pat) == PARALLEL)
     pat = XVECEXP (pat, 0, 0);
   gcc_assert (ANY_RETURN_P (pat));
   JUMP_LABEL (jump) = pat;
 }
-#endif /* HAVE_return */
+#endif
 
 /* Generate the prologue and epilogue RTL if the machine supports it.  Thread
    this into place with notes indicating where the prologue ends and where
-   the epilogue begins.  Update the basic block information when possible.  */
+   the epilogue begins.  Update the basic block information when possible.
+
+   Notes on epilogue placement:
+   There are several kinds of edges to the exit block:
+   * a single fallthru edge from LAST_BB
+   * possibly, edges from blocks containing sibcalls
+   * possibly, fake edges from infinite loops
+
+   The epilogue is always emitted on the fallthru edge from the last basic
+   block in the function, LAST_BB, into the exit block.
+
+   If LAST_BB is empty except for a label, it is the target of every
+   other basic block in the function that ends in a return.  If a
+   target has a return or simple_return pattern (possibly with
+   conditional variants), these basic blocks can be changed so that a
+   return insn is emitted into them, and their target is adjusted to
+   the real exit block.
+
+   Notes on shrink wrapping: We implement a fairly conservative
+   version of shrink-wrapping rather than the textbook one.  We only
+   generate a single prologue and a single epilogue.  This is
+   sufficient to catch a number of interesting cases involving early
+   exits.
+
+   First, we identify the blocks that require the prologue to occur before
+   them.  These are the ones that modify a call-saved register, or reference
+   any of the stack or frame pointer registers.  To simplify things, we then
+   mark everything reachable from these blocks as also requiring a prologue.
+   This takes care of loops automatically, and avoids the need to examine
+   whether MEMs reference the frame, since it is sufficient to check for
+   occurrences of the stack or frame pointer.
+
+   We then compute the set of blocks for which the need for a prologue
+   is anticipatable (borrowing terminology from the shrink-wrapping
+   description in Muchnick's book).  These are the blocks which either
+   require a prologue themselves, or those that have only successors
+   where the prologue is anticipatable.  The prologue needs to be
+   inserted on all edges from BB1->BB2 where BB2 is in ANTIC and BB1
+   is not.  For the moment, we ensure that only one such edge exists.
+
+   The epilogue is placed as described above, but we make a
+   distinction between inserting return and simple_return patterns
+   when modifying other blocks that end in a return.  Blocks that end
+   in a sibcall omit the sibcall_epilogue if the block is not in
+   ANTIC.  */
 
 static void
 thread_prologue_and_epilogue_insns (void)
 {
   bool inserted;
+  basic_block last_bb;
+  bool last_bb_active;
+#ifdef HAVE_simple_return
+  bool unconverted_simple_returns = false;
+  basic_block simple_return_block_hot = NULL;
+  basic_block simple_return_block_cold = NULL;
+  bool nonempty_prologue;
+#endif
+  rtx returnjump ATTRIBUTE_UNUSED;
   rtx seq ATTRIBUTE_UNUSED, epilogue_end ATTRIBUTE_UNUSED;
-  edge entry_edge, e;
+  rtx prologue_seq ATTRIBUTE_UNUSED, split_prologue_seq ATTRIBUTE_UNUSED;
+  edge e, entry_edge, orig_entry_edge, exit_fallthru_edge;
   edge_iterator ei;
+  bitmap_head bb_flags;
+
+  df_analyze ();
 
   rtl_profile_for_bb (ENTRY_BLOCK_PTR);
 
   inserted = false;
   seq = NULL_RTX;
   epilogue_end = NULL_RTX;
+  returnjump = NULL_RTX;
 
   /* Can't deal with multiple successors of the entry block at the
      moment.  Function should always have at least one entry
      point.  */
   gcc_assert (single_succ_p (ENTRY_BLOCK_PTR));
   entry_edge = single_succ_edge (ENTRY_BLOCK_PTR);
+  orig_entry_edge = entry_edge;
+
+  exit_fallthru_edge = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
+  if (exit_fallthru_edge != NULL)
+    {
+      rtx label;
+
+      last_bb = exit_fallthru_edge->src;
+      /* Test whether there are active instructions in the last block.  */
+      label = BB_END (last_bb);
+      while (label && !LABEL_P (label))
+	{
+	  if (active_insn_p (label))
+	    break;
+	  label = PREV_INSN (label);
+	}
+
+      last_bb_active = BB_HEAD (last_bb) != label || !LABEL_P (label);
+    }
+  else
+    {
+      last_bb = NULL;
+      last_bb_active = false;
+    }
 
+  split_prologue_seq = NULL_RTX;
   if (flag_split_stack
       && (lookup_attribute ("no_split_stack", DECL_ATTRIBUTES (cfun->decl))
 	  == NULL))
@@ -5349,17 +5521,15 @@ thread_prologue_and_epilogue_insns (void
 
       start_sequence ();
       emit_insn (gen_split_stack_prologue ());
-      seq = get_insns ();
+      split_prologue_seq = get_insns ();
       end_sequence ();
 
-      record_insns (seq, NULL, &prologue_insn_hash);
-      set_insn_locators (seq, prologue_locator);
-
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+      record_insns (split_prologue_seq, NULL, &prologue_insn_hash);
+      set_insn_locators (split_prologue_seq, prologue_locator);
 #endif
     }
 
+  prologue_seq = NULL_RTX;
 #ifdef HAVE_prologue
   if (HAVE_prologue)
     {
@@ -5382,15 +5552,219 @@ thread_prologue_and_epilogue_insns (void
       if (!targetm.profile_before_prologue () && crtl->profile)
         emit_insn (gen_blockage ());
 
-      seq = get_insns ();
+      prologue_seq = get_insns ();
       end_sequence ();
-      set_insn_locators (seq, prologue_locator);
+      set_insn_locators (prologue_seq, prologue_locator);
+    }
+#endif
 
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+  bitmap_initialize (&bb_flags, &bitmap_default_obstack);
+
+#ifdef HAVE_simple_return
+  /* Try to perform a kind of shrink-wrapping, making sure the
+     prologue/epilogue is emitted only around those parts of the
+     function that require it.  */
+
+  nonempty_prologue = false;
+  for (seq = prologue_seq; seq; seq = NEXT_INSN (seq))
+    if (!NOTE_P (seq) || NOTE_KIND (seq) != NOTE_INSN_PROLOGUE_END)
+      {
+	nonempty_prologue = true;
+	break;
+      }
+      
+  if (flag_shrink_wrap && HAVE_simple_return
+      && nonempty_prologue && !crtl->calls_eh_return)
+    {
+      HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;
+      rtx p_insn;
+
+      VEC(basic_block, heap) *vec;
+      basic_block bb;
+      bitmap_head bb_antic_flags;
+      bitmap_head bb_on_list;
+
+      if (dump_file)
+	fprintf (dump_file, "Attempting shrink-wrapping optimization.\n");
+
+      /* Compute the registers set and used in the prologue.  */
+      CLEAR_HARD_REG_SET (prologue_clobbered);
+      CLEAR_HARD_REG_SET (prologue_used);
+      for (p_insn = prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	{
+	  HARD_REG_SET this_used;
+	  if (!NONDEBUG_INSN_P (p_insn))
+	    continue;
+
+	  CLEAR_HARD_REG_SET (this_used);
+	  note_uses (&PATTERN (p_insn), record_hard_reg_uses,
+		     &this_used);
+	  AND_COMPL_HARD_REG_SET (this_used, prologue_clobbered);
+	  IOR_HARD_REG_SET (prologue_used, this_used);
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+	}
+      for (p_insn = split_prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	if (NONDEBUG_INSN_P (p_insn))
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+
+      bitmap_initialize (&bb_antic_flags, &bitmap_default_obstack);
+      bitmap_initialize (&bb_on_list, &bitmap_default_obstack);
+
+      /* Find the set of basic blocks that require a stack frame.  */
+
+      vec = VEC_alloc (basic_block, heap, n_basic_blocks);
+
+      FOR_EACH_BB (bb)
+	{
+	  rtx insn;
+	  /* As a special case, check for jumps to the last bb that
+	     cannot successfully be converted to simple_returns later
+	     on, and mark them as requiring a frame.  These are
+	     conditional jumps that jump to their fallthru block, so
+	     it's not a case that is expected to occur often.  */
+	  if (JUMP_P (BB_END (bb)) && any_condjump_p (BB_END (bb))
+	      && single_succ_p (bb)
+	      && !last_bb_active
+	      && single_succ (bb) == last_bb)
+	    {
+	      bitmap_set_bit (&bb_flags, bb->index);
+	      VEC_quick_push (basic_block, vec, bb);
+	    }
+	  else
+	    FOR_BB_INSNS (bb, insn)
+	      if (requires_stack_frame_p (insn, prologue_used))
+		{
+		  bitmap_set_bit (&bb_flags, bb->index);
+		  VEC_quick_push (basic_block, vec, bb);
+		  break;
+		}
+	}
+
+      /* For every basic block that needs a prologue, mark all blocks
+	 reachable from it, so as to ensure they are also seen as
+	 requiring a prologue.  */
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    if (e->dest != EXIT_BLOCK_PTR
+		&& bitmap_set_bit (&bb_flags, e->dest->index))
+	      VEC_quick_push (basic_block, vec, e->dest);
+	}
+      /* If the last basic block contains only a label, we'll be able
+	 to convert jumps to it to (potentially conditional) return
+	 insns later.  This means we don't necessarily need a prologue
+	 for paths reaching it.  */
+      if (last_bb)
+	{
+	  if (!last_bb_active)
+	    bitmap_clear_bit (&bb_flags, last_bb->index);
+	  else if (!bitmap_bit_p (&bb_flags, last_bb->index))
+	    goto fail_shrinkwrap;
+	}
+
+      /* Now walk backwards from every block that is marked as needing
+	 a prologue to compute the bb_antic_flags bitmap.  */
+      bitmap_copy (&bb_antic_flags, &bb_flags);
+      FOR_EACH_BB (bb)
+	{
+	  edge e;
+	  edge_iterator ei;
+	  if (!bitmap_bit_p (&bb_flags, bb->index))
+	    continue;
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
+		&& bitmap_set_bit (&bb_on_list, e->src->index))
+	      VEC_quick_push (basic_block, vec, e->src);
+	}
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  bool all_set = true;
+
+	  bitmap_clear_bit (&bb_on_list, tmp_bb->index);
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    if (!bitmap_bit_p (&bb_antic_flags, e->dest->index))
+	      {
+		all_set = false;
+		break;
+	      }
+
+	  if (all_set)
+	    {
+	      bitmap_set_bit (&bb_antic_flags, tmp_bb->index);
+	      FOR_EACH_EDGE (e, ei, tmp_bb->preds)
+		if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
+		    && bitmap_set_bit (&bb_on_list, e->src->index))
+		  VEC_quick_push (basic_block, vec, e->src);
+	    }
+	}
+      /* Find exactly one edge that leads to a block in ANTIC from
+	 a block that isn't.  */
+      if (!bitmap_bit_p (&bb_antic_flags, entry_edge->dest->index))
+	FOR_EACH_BB (bb)
+	  {
+	    if (!bitmap_bit_p (&bb_antic_flags, bb->index))
+	      continue;
+	    FOR_EACH_EDGE (e, ei, bb->preds)
+	      if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
+		{
+		  if (entry_edge != orig_entry_edge)
+		    {
+		      entry_edge = orig_entry_edge;
+		      if (dump_file)
+			fprintf (dump_file, "More than one candidate edge.\n");
+		      goto fail_shrinkwrap;
+		    }
+		  if (dump_file)
+		    fprintf (dump_file, "Found candidate edge for "
+			     "shrink-wrapping, %d->%d.\n", e->src->index,
+			     e->dest->index);
+		  entry_edge = e;
+		}
+	  }
+
+      /* Test whether the prologue is known to clobber any register
+	 (other than FP or SP) which are live on the edge.  */
+      CLEAR_HARD_REG_BIT (prologue_clobbered, STACK_POINTER_REGNUM);
+      if (frame_pointer_needed)
+	CLEAR_HARD_REG_BIT (prologue_clobbered, HARD_FRAME_POINTER_REGNUM);
+      CLEAR_HARD_REG_SET (live_on_edge);
+      reg_set_to_hard_reg_set (&live_on_edge,
+			       df_get_live_in (entry_edge->dest));
+      if (hard_reg_set_intersect_p (live_on_edge, prologue_clobbered))
+	{
+	  entry_edge = orig_entry_edge;
+	  if (dump_file)
+	    fprintf (dump_file, "Shrink-wrapping aborted due to clobber.\n");
+	}
+      else if (dump_file && entry_edge != orig_entry_edge)
+	fprintf (dump_file, "Performing shrink-wrapping.\n");
+
+    fail_shrinkwrap:
+      bitmap_clear (&bb_antic_flags);
+      bitmap_clear (&bb_on_list);
+      VEC_free (basic_block, heap, vec);
     }
 #endif
 
+  if (split_prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (split_prologue_seq, entry_edge);
+      inserted = true;
+    }
+  if (prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (prologue_seq, entry_edge);
+      inserted = true;
+    }
+
   /* If the exit block has no non-fake predecessors, we don't need
      an epilogue.  */
   FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
@@ -5400,110 +5774,145 @@ thread_prologue_and_epilogue_insns (void
     goto epilogue_done;
 
   rtl_profile_for_bb (EXIT_BLOCK_PTR);
+
 #ifdef HAVE_return
-  if (optimize && HAVE_return)
+  /* If we're allowed to generate a simple return instruction, then by
+     definition we don't need a full epilogue.  If the last basic
+     block before the exit block does not contain active instructions,
+     examine its predecessors and try to emit (conditional) return
+     instructions.  */
+  if (optimize && !last_bb_active
+      && (HAVE_return || entry_edge != orig_entry_edge))
     {
-      /* If we're allowed to generate a simple return instruction,
-	 then by definition we don't need a full epilogue.  Examine
-	 the block that falls through to EXIT.   If it does not
-	 contain any code, examine its predecessors and try to
-	 emit (conditional) return instructions.  */
-
-      basic_block last;
+      edge_iterator ei2;
+      int i;
+      basic_block bb;
       rtx label;
+      VEC(basic_block,heap) *src_bbs;
 
-      e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-      if (e == NULL)
+      if (exit_fallthru_edge == NULL)
 	goto epilogue_done;
-      last = e->src;
+      label = BB_HEAD (last_bb);
 
-      /* Verify that there are no active instructions in the last block.  */
-      label = BB_END (last);
-      while (label && !LABEL_P (label))
-	{
-	  if (active_insn_p (label))
-	    break;
-	  label = PREV_INSN (label);
-	}
+      src_bbs = VEC_alloc (basic_block, heap, EDGE_COUNT (last_bb->preds));
+      FOR_EACH_EDGE (e, ei2, last_bb->preds)
+	if (e->src != ENTRY_BLOCK_PTR)
+	  VEC_quick_push (basic_block, src_bbs, e->src);
 
-      if (BB_HEAD (last) == label && LABEL_P (label))
+      FOR_EACH_VEC_ELT (basic_block, src_bbs, i, bb)
 	{
-	  edge_iterator ei2;
-
-	  for (ei2 = ei_start (last->preds); (e = ei_safe_edge (ei2)); )
-	    {
-	      basic_block bb = e->src;
-	      rtx jump;
+	  bool simple_p;
+	  rtx jump;
+	  e = find_edge (bb, last_bb);
 
-	      if (bb == ENTRY_BLOCK_PTR)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+	  jump = BB_END (bb);
 
-	      jump = BB_END (bb);
-	      if (!JUMP_P (jump) || JUMP_LABEL (jump) != label)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+#ifdef HAVE_simple_return
+	  simple_p = (entry_edge != orig_entry_edge
+		      && !bitmap_bit_p (&bb_flags, bb->index));
+#else
+	  simple_p = false;
+#endif
 
-	      /* If we have an unconditional jump, we can replace that
-		 with a simple return instruction.  */
-	      if (simplejump_p (jump))
-		{
-		  /* The use of the return register might be present in the exit
-		     fallthru block.  Either:
-		     - removing the use is safe, and we should remove the use in
-		       the exit fallthru block, or
-		     - removing the use is not safe, and we should add it here.
-		     For now, we conservatively choose the latter.  Either of the
-		     2 helps in crossjumping.  */
-		  emit_use_return_register_into_block (bb);
+	  if (!simple_p
+	      && (!HAVE_return || !JUMP_P (jump)
+		  || JUMP_LABEL (jump) != label))
+	    continue;
 
-		  emit_return_into_block (bb);
-		  delete_insn (jump);
-		}
+	  /* If we have an unconditional jump, we can replace that
+	     with a simple return instruction.  */
+	  if (!JUMP_P (jump))
+	    {
+	      emit_barrier_after (BB_END (bb));
+	      emit_return_into_block (simple_p, bb);
+	    }
+	  else if (simplejump_p (jump))
+	    {
+	      /* The use of the return register might be present in the exit
+	         fallthru block.  Either:
+	         - removing the use is safe, and we should remove the use in
+	           the exit fallthru block, or
+	         - removing the use is not safe, and we should add it here.
+	         For now, we conservatively choose the latter.  Either of the
+	         2 helps in crossjumping.  */
+	      emit_use_return_register_into_block (bb);
 
-	      /* If we have a conditional jump, we can try to replace
-		 that with a conditional return instruction.  */
-	      else if (condjump_p (jump))
-		{
-		  if (! redirect_jump (jump, ret_rtx, 0))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
+	      emit_return_into_block (simple_p, bb);
+	      delete_insn (jump);
+	    }
+	  else if (condjump_p (jump) && JUMP_LABEL (jump) != label)
+	    {
+	      basic_block new_bb;
+	      edge new_e;
 
-		  /* See comment in simple_jump_p case above.  */
-		  emit_use_return_register_into_block (bb);
+	      gcc_assert (simple_p);
+	      new_bb = split_edge (e);
+	      emit_barrier_after (BB_END (new_bb));
+	      emit_return_into_block (simple_p, new_bb);
+#ifdef HAVE_simple_return
+	      if (BB_PARTITION (new_bb) == BB_HOT_PARTITION)
+		simple_return_block_hot = new_bb;
+	      else
+		simple_return_block_cold = new_bb;
+#endif
+	      new_e = single_succ_edge (new_bb);
+	      redirect_edge_succ (new_e, EXIT_BLOCK_PTR);
 
-		  /* If this block has only one successor, it both jumps
-		     and falls through to the fallthru block, so we can't
-		     delete the edge.  */
-		  if (single_succ_p (bb))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
-		}
+	      continue;
+	    }
+	  /* If we have a conditional jump branching to the last
+	     block, we can try to replace that with a conditional
+	     return instruction.  */
+	  else if (condjump_p (jump))
+	    {
+	      rtx dest;
+	      if (simple_p)
+		dest = simple_return_rtx;
 	      else
+		dest = ret_rtx;
+	      if (! redirect_jump (jump, dest, 0))
 		{
-		  ei_next (&ei2);
+#ifdef HAVE_simple_return
+		  if (simple_p)
+		    unconverted_simple_returns = true;
+#endif
 		  continue;
 		}
 
-	      /* Fix up the CFG for the successful change we just made.  */
-	      redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	      /* See comment in simple_jump_p case above.  */
+	      emit_use_return_register_into_block (bb);
+
+	      /* If this block has only one successor, it both jumps
+		 and falls through to the fallthru block, so we can't
+		 delete the edge.  */
+	      if (single_succ_p (bb))
+		continue;
+	    }
+	  else
+	    {
+#ifdef HAVE_simple_return
+	      if (simple_p)
+		unconverted_simple_returns = true;
+#endif
+	      continue;
 	    }
 
+	  /* Fix up the CFG for the successful change we just made.  */
+	  redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	}
+      VEC_free (basic_block, heap, src_bbs);
+
+      if (HAVE_return)
+	{
 	  /* Emit a return insn for the exit fallthru block.  Whether
 	     this is still reachable will be determined later.  */
 
-	  emit_barrier_after (BB_END (last));
-	  emit_return_into_block (last);
-	  epilogue_end = BB_END (last);
-	  single_succ_edge (last)->flags &= ~EDGE_FALLTHRU;
+	  emit_barrier_after (BB_END (last_bb));
+	  emit_return_into_block (false, last_bb);
+	  epilogue_end = BB_END (last_bb);
+	  if (JUMP_P (epilogue_end))
+	    JUMP_LABEL (epilogue_end) = ret_rtx;
+	  single_succ_edge (last_bb)->flags &= ~EDGE_FALLTHRU;
 	  goto epilogue_done;
 	}
     }
@@ -5540,20 +5949,15 @@ thread_prologue_and_epilogue_insns (void
     }
 #endif
 
-  /* Find the edge that falls through to EXIT.  Other edges may exist
-     due to RETURN instructions, but those don't need epilogues.
-     There really shouldn't be a mixture -- either all should have
-     been converted or none, however...  */
+  /* If nothing falls through into the exit block, we don't need an
+     epilogue.  */
 
-  e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-  if (e == NULL)
+  if (exit_fallthru_edge == NULL)
     goto epilogue_done;
 
 #ifdef HAVE_epilogue
   if (HAVE_epilogue)
     {
-      rtx returnjump;
-
       start_sequence ();
       epilogue_end = emit_note (NOTE_INSN_EPILOGUE_BEG);
       seq = gen_epilogue ();
@@ -5564,11 +5968,11 @@ thread_prologue_and_epilogue_insns (void
       record_insns (seq, NULL, &epilogue_insn_hash);
       set_insn_locators (seq, epilogue_locator);
 
-      returnjump = get_last_insn ();
       seq = get_insns ();
+      returnjump = get_last_insn ();
       end_sequence ();
 
-      insert_insn_on_edge (seq, e);
+      insert_insn_on_edge (seq, exit_fallthru_edge);
       inserted = true;
 
       if (JUMP_P (returnjump))
@@ -5581,23 +5985,21 @@ thread_prologue_and_epilogue_insns (void
 	  else
 	    JUMP_LABEL (returnjump) = ret_rtx;
 	}
-      else
-	returnjump = NULL_RTX;
     }
   else
 #endif
     {
       basic_block cur_bb;
 
-      if (! next_active_insn (BB_END (e->src)))
+      if (! next_active_insn (BB_END (exit_fallthru_edge->src)))
 	goto epilogue_done;
       /* We have a fall-through edge to the exit block, the source is not
          at the end of the function, and there will be an assembler epilogue
          at the end of the function.
          We can't use force_nonfallthru here, because that would try to
-         use return.  Inserting a jump 'by hand' is extremely messy, so
+	 use return.  Inserting a jump 'by hand' is extremely messy, so
 	 we take advantage of cfg_layout_finalize using
-	fixup_fallthru_exit_predecessor.  */
+	 fixup_fallthru_exit_predecessor.  */
       cfg_layout_initialize (0);
       FOR_EACH_BB (cur_bb)
 	if (cur_bb->index >= NUM_FIXED_BLOCKS
@@ -5607,6 +6009,7 @@ thread_prologue_and_epilogue_insns (void
     }
 
 epilogue_done:
+
   default_rtl_profile ();
 
   if (inserted)
@@ -5632,33 +6035,102 @@ epilogue_done:
 	}
     }
 
+#ifdef HAVE_simple_return
+  /* If there were branches to an empty LAST_BB which we tried to
+     convert to conditional simple_returns, but couldn't for some
+     reason, create a block to hold a simple_return insn and redirect
+     those remaining edges.  */
+  if (unconverted_simple_returns)
+    {
+      edge_iterator ei2;
+      basic_block exit_pred = EXIT_BLOCK_PTR->prev_bb;
+
+      gcc_assert (entry_edge != orig_entry_edge);
+
+      /* See if we can reuse the last insn that was emitted for the
+	 epilogue.  */
+      if (returnjump != NULL_RTX
+	  && JUMP_LABEL (returnjump) == simple_return_rtx)
+	{
+	  edge e = split_block (exit_fallthru_edge->src,
+				PREV_INSN (returnjump));
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    simple_return_block_hot = e->dest;
+	  else
+	    simple_return_block_cold = e->dest;
+	}
+
+    restart_scan:
+      for (ei2 = ei_start (last_bb->preds); (e = ei_safe_edge (ei2)); )
+	{
+	  basic_block bb = e->src;
+	  basic_block *pdest_bb;
+
+	  if (bb == ENTRY_BLOCK_PTR
+	      || bitmap_bit_p (&bb_flags, bb->index))
+	    {
+	      ei_next (&ei2);
+	      continue;
+	    }
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    pdest_bb = &simple_return_block_hot;
+	  else
+	    pdest_bb = &simple_return_block_cold;
+	  if (*pdest_bb == NULL)
+	    {
+	      basic_block bb;
+	      rtx start;
+
+	      bb = create_basic_block (NULL, NULL, exit_pred);
+	      BB_COPY_PARTITION (bb, e->src);
+	      start = emit_jump_insn_after (gen_simple_return (),
+					    BB_END (bb));
+	      JUMP_LABEL (start) = simple_return_rtx;
+	      emit_barrier_after (start);
+
+	      *pdest_bb = bb;
+	      make_edge (bb, EXIT_BLOCK_PTR, 0);
+	    }
+	  redirect_edge_and_branch_force (e, *pdest_bb);
+	  goto restart_scan;
+	}
+    }
+#endif
+
 #ifdef HAVE_sibcall_epilogue
   /* Emit sibling epilogues before any sibling call sites.  */
   for (ei = ei_start (EXIT_BLOCK_PTR->preds); (e = ei_safe_edge (ei)); )
     {
       basic_block bb = e->src;
       rtx insn = BB_END (bb);
+      rtx ep_seq;
 
       if (!CALL_P (insn)
-	  || ! SIBLING_CALL_P (insn))
+	  || ! SIBLING_CALL_P (insn)
+	  || (entry_edge != orig_entry_edge
+	      && !bitmap_bit_p (&bb_flags, bb->index)))
 	{
 	  ei_next (&ei);
 	  continue;
 	}
 
-      start_sequence ();
-      emit_note (NOTE_INSN_EPILOGUE_BEG);
-      emit_insn (gen_sibcall_epilogue ());
-      seq = get_insns ();
-      end_sequence ();
+      ep_seq = gen_sibcall_epilogue ();
+      if (ep_seq)
+	{
+	  start_sequence ();
+	  emit_note (NOTE_INSN_EPILOGUE_BEG);
+	  emit_insn (ep_seq);
+	  seq = get_insns ();
+	  end_sequence ();
 
-      /* Retain a map of the epilogue insns.  Used in life analysis to
-	 avoid getting rid of sibcall epilogue insns.  Do this before we
-	 actually emit the sequence.  */
-      record_insns (seq, NULL, &epilogue_insn_hash);
-      set_insn_locators (seq, epilogue_locator);
+	  /* Retain a map of the epilogue insns.  Used in life analysis to
+	     avoid getting rid of sibcall epilogue insns.  Do this before we
+	     actually emit the sequence.  */
+	  record_insns (seq, NULL, &epilogue_insn_hash);
+	  set_insn_locators (seq, epilogue_locator);
 
-      emit_insn_before (seq, insn);
+	  emit_insn_before (seq, insn);
+	}
       ei_next (&ei);
     }
 #endif
@@ -5683,6 +6155,8 @@ epilogue_done:
     }
 #endif
 
+  bitmap_clear (&bb_flags);
+
   /* Threading the prologue and epilogue changes the artificial refs
      in the entry and exit blocks.  */
   epilogue_completed = 1;
Index: gcc/cfgcleanup.c
===================================================================
--- gcc/cfgcleanup.c	(revision 178734)
+++ gcc/cfgcleanup.c	(working copy)
@@ -1488,6 +1488,16 @@ outgoing_edges_match (int mode, basic_bl
   edge e1, e2;
   edge_iterator ei;
 
+  /* If we performed shrink-wrapping, edges to the EXIT_BLOCK_PTR can
+     only be distinguished for JUMP_INSNs.  The two paths may differ in
+     whether they went through the prologue.  Sibcalls are fine, we know
+     that we either didn't need or inserted an epilogue before them.  */
+  if (flag_shrink_wrap
+      && single_succ_p (bb1) && single_succ (bb1) == EXIT_BLOCK_PTR
+      && !JUMP_P (BB_END (bb1))
+      && !(CALL_P (BB_END (bb1)) && SIBLING_CALL_P (BB_END (bb1))))
+    return false;
+  
   /* If BB1 has only one successor, we may be looking at either an
      unconditional jump, or a fake edge to exit.  */
   if (single_succ_p (bb1)
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 178734)
+++ gcc/common.opt	(working copy)
@@ -1768,6 +1768,11 @@ fshow-column
 Common Report Var(flag_show_column) Init(1)
 Show column numbers in diagnostics, when available.  Default on
 
+fshrink-wrap
+Common Report Var(flag_shrink_wrap) Optimization
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.
+
 fsignaling-nans
 Common Report Var(flag_signaling_nans) Optimization SetByCombined
 Disable optimizations observable by IEEE signaling NaNs
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 178734)
+++ gcc/config/i386/i386.md	(working copy)
@@ -11657,24 +11657,29 @@ (define_insn "prologue_use"
   ""
   [(set_attr "length" "0")])
 
+(define_code_iterator returns [return simple_return])
+(define_code_attr return_str [(return "") (simple_return "simple_")])
+(define_code_attr return_cond [(return "ix86_can_use_return_insn_p ()")
+			       (simple_return "")])
+
 ;; Insn emitted into the body of a function to return from a function.
 ;; This is only done if the function's epilogue is known to be simple.
 ;; See comments for ix86_can_use_return_insn_p in i386.c.
 
-(define_expand "return"
-  [(return)]
-  "ix86_can_use_return_insn_p ()"
+(define_expand "<return_str>return"
+  [(returns)]
+  "<return_cond>"
 {
   if (crtl->args.pops_args)
     {
       rtx popc = GEN_INT (crtl->args.pops_args);
-      emit_jump_insn (gen_return_pop_internal (popc));
+      emit_jump_insn (gen_<return_str>return_pop_internal (popc));
       DONE;
     }
 })
 
-(define_insn "return_internal"
-  [(return)]
+(define_insn "<return_str>return_internal"
+  [(returns)]
   "reload_completed"
   "ret"
   [(set_attr "length" "1")
@@ -11685,8 +11690,8 @@ (define_insn "return_internal"
 ;; Used by x86_machine_dependent_reorg to avoid penalty on single byte RET
 ;; instruction Athlon and K8 have.
 
-(define_insn "return_internal_long"
-  [(return)
+(define_insn "<return_str>return_internal_long"
+  [(returns)
    (unspec [(const_int 0)] UNSPEC_REP)]
   "reload_completed"
   "rep\;ret"
@@ -11696,8 +11701,8 @@ (define_insn "return_internal_long"
    (set_attr "prefix_rep" "1")
    (set_attr "modrm" "0")])
 
-(define_insn "return_pop_internal"
-  [(return)
+(define_insn "<return_str>return_pop_internal"
+  [(returns)
    (use (match_operand:SI 0 "const_int_operand" ""))]
   "reload_completed"
   "ret\t%0"
@@ -11706,8 +11711,8 @@ (define_insn "return_pop_internal"
    (set_attr "length_immediate" "2")
    (set_attr "modrm" "0")])
 
-(define_insn "return_indirect_internal"
-  [(return)
+(define_insn "<return_str>return_indirect_internal"
+  [(returns)
    (use (match_operand:SI 0 "register_operand" "r"))]
   "reload_completed"
   "jmp\t%A0"
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 178734)
+++ gcc/config/i386/i386.c	(working copy)
@@ -10779,13 +10779,13 @@ ix86_expand_epilogue (int style)
 
 	  pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
 				     popc, -1, true);
-	  emit_jump_insn (gen_return_indirect_internal (ecx));
+	  emit_jump_insn (gen_simple_return_indirect_internal (ecx));
 	}
       else
-	emit_jump_insn (gen_return_pop_internal (popc));
+	emit_jump_insn (gen_simple_return_pop_internal (popc));
     }
   else
-    emit_jump_insn (gen_return_internal ());
+    emit_jump_insn (gen_simple_return_internal ());
 
   /* Restore the state back to the state from the prologue,
      so that it's correct for the next epilogue.  */
@@ -31097,7 +31097,10 @@ ix86_pad_returns (void)
 	}
       if (replace)
 	{
-	  emit_jump_insn_before (gen_return_internal_long (), ret);
+	  if (PATTERN (ret) == ret_rtx)
+	    emit_jump_insn_before (gen_return_internal_long (), ret);
+	  else
+	    emit_jump_insn_before (gen_simple_return_internal_long (), ret);
 	  delete_insn (ret);
 	}
     }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-27 23:07                   ` Bernd Schmidt
@ 2011-09-30 18:19                     ` Richard Henderson
  2011-10-04 22:19                       ` Bernd Schmidt
  2011-10-03  8:30                     ` Richard Sandiford
  1 sibling, 1 reply; 41+ messages in thread
From: Richard Henderson @ 2011-09-30 18:19 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, richard.sandiford

On 09/27/2011 02:02 PM, Bernd Schmidt wrote:
> Here's a new version of the entire shrink-wrapping patch with the
> trap_if test replaced by the outgoing_edges_match change, the condjump_p
> bug fixed, and the dump output and testcase adjusted a bit. Bootstrapped
> and tested on i686-linux and mips-elf. Ok? (I guess I could also leave
> out the RETURN_ADDR_REGNUM thing, since it was needed for ARM only and I
> can't quite remember why at the moment).

Please do leave out RETURN_ADDR_REGNUM for now.  If you remember why,
then you could bring it back alongside the patch for the ARM backend.

As for the i386 backend changes, not an objection per se, but I'm
trying to understand why we need so many copies of patterns.

For instance, while I understand the continued existance of the
"return" expander, why doesn't that expand to a simple_return?
After all the ix86_can_use_return_insn_p asserts that there *is*
no epilogue.  nd if "return" can expand to simple_return, can't
all of the return_internal_* patterns also only use simple_return
and avoid the macro-ization?

I don't see anything glaringly wrong in the middle end.  Although
the thread_prologue_and_epilogue_insns function is now gigantic.
If there were an easy way to break that up and reduce the amount
of conditional compilation at the same time... that'd be great,
but not a requirement.



r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-27 23:07                   ` Bernd Schmidt
  2011-09-30 18:19                     ` Richard Henderson
@ 2011-10-03  8:30                     ` Richard Sandiford
  2011-10-03 11:29                       ` Basile Starynkevitch
  1 sibling, 1 reply; 41+ messages in thread
From: Richard Sandiford @ 2011-10-03  8:30 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Henderson, GCC Patches

Just a suggestion, but...

Bernd Schmidt <bernds@codesourcery.com> writes:
> Index: gcc/cfgcleanup.c
> ===================================================================
> --- gcc/cfgcleanup.c	(revision 178734)
> +++ gcc/cfgcleanup.c	(working copy)
> @@ -1488,6 +1488,16 @@ outgoing_edges_match (int mode, basic_bl
>    edge e1, e2;
>    edge_iterator ei;
>  
> +  /* If we performed shrink-wrapping, edges to the EXIT_BLOCK_PTR can
> +     only be distinguished for JUMP_INSNs.  The two paths may differ in
> +     whether they went through the prologue.  Sibcalls are fine, we know
> +     that we either didn't need or inserted an epilogue before them.  */
> +  if (flag_shrink_wrap
> +      && single_succ_p (bb1) && single_succ (bb1) == EXIT_BLOCK_PTR
> +      && !JUMP_P (BB_END (bb1))
> +      && !(CALL_P (BB_END (bb1)) && SIBLING_CALL_P (BB_END (bb1))))
> +    return false;

...how about adding a bit to crtl to say whether shrink-wrap occured,
and check that instead of flag_shrink_wrap?

(Leaving the full review to Richard.)

Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-03  8:30                     ` Richard Sandiford
@ 2011-10-03 11:29                       ` Basile Starynkevitch
  2011-10-03 16:39                         ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Basile Starynkevitch @ 2011-10-03 11:29 UTC (permalink / raw)
  To: Bernd Schmidt, Richard Henderson, GCC Patches, richard.sandiford


Hello,

Regarding this shrink-wrapping patch, I would suggest to describe, in a
comments of one or two sentences, what shkink-wrapping means in the context
of GCC.

http://en.wikipedia.org/wiki/Shrink_wrap does not help much in understanding
that.

Cheers.
-- 
Basile STARYNKEVITCH         http://starynkevitch.net/Basile/
email: basile<at>starynkevitch<dot>net mobile: +33 6 8501 2359
8, rue de la Faiencerie, 92340 Bourg La Reine, France
*** opinions {are only mines, sont seulement les miennes} ***

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-03 11:29                       ` Basile Starynkevitch
@ 2011-10-03 16:39                         ` Bernd Schmidt
  0 siblings, 0 replies; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-03 16:39 UTC (permalink / raw)
  To: Basile Starynkevitch; +Cc: Richard Henderson, GCC Patches, richard.sandiford

On 10/03/11 13:28, Basile Starynkevitch wrote:
> Regarding this shrink-wrapping patch, I would suggest to describe, in a
> comments of one or two sentences, what shkink-wrapping means in the context
> of GCC.

See the documentation part of the patch.


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-09-30 18:19                     ` Richard Henderson
@ 2011-10-04 22:19                       ` Bernd Schmidt
  2011-10-04 22:43                         ` Richard Henderson
  2011-10-06 18:47                         ` H.J. Lu
  0 siblings, 2 replies; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-04 22:19 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 1427 bytes --]

On 09/30/11 18:51, Richard Henderson wrote:

> Please do leave out RETURN_ADDR_REGNUM for now.  If you remember why,
> then you could bring it back alongside the patch for the ARM backend.

Changed.

> As for the i386 backend changes, not an objection per se, but I'm
> trying to understand why we need so many copies of patterns.

Also changed.

> I don't see anything glaringly wrong in the middle end.  Although
> the thread_prologue_and_epilogue_insns function is now gigantic.
> If there were an easy way to break that up and reduce the amount
> of conditional compilation at the same time... that'd be great,
> but not a requirement.

I don't think there's an easy way; and it's almost certain to break
stuff again, so I'd rather avoid doing it at the same time as this patch
if possible.

I can see one possible way of tackling it; have an analysis phase that
fills up a few basic_block VECs (those which need sibcall returns, those
which need plain returns, those which need simple returns) and computes
other information, such as the edges on which prologue and epilogue are
to be inserted, and then a worker phase (probably split across several
functions) which does all the fiddling.

Richard S. suggested:
> ...how about adding a bit to crtl to say whether shrink-wrap occured,
> and check that instead of flag_shrink_wrap?

Good idea, also changed.

New version below.  Bootstrapped and tested i686-linux.


Bernd

[-- Attachment #2: sw1004.diff --]
[-- Type: text/plain, Size: 38792 bytes --]

	* doc/invoke.texi (-fshrink-wrap): Document.
	* opts.c (default_options_table): Add it.
	* common.opt (fshrink-wrap): Add.
	* function.c (emit_return_into_block): Remove useless declaration.
	(record_hard_reg_uses_1, record_hard_reg_uses, frame_required_for_rtx,
	requires_stack_frame_p, gen_return_pattern): New static functions.
	(emit_return_into_block): New arg simple_p.  All callers changed.
	Use gen_return_pattern.
	(thread_prologue_and_epilogue_insns): Implement shrink-wrapping.
	* config/i386/i386.md (return): Expand into a simple_return.
	(simple_return): New expander):
	(simple_return_internal, simple_return_internal_long,
	simple_return_pop_internal_long, simple_return_indirect_internal):
	Renamed from return_internal, return_internal_long,
	return_pop_internal_long and return_indirect_internal; changed to use
	simple_return.
	* config/i386/i386.c (ix86_expand_epilogue): Adjust to expand
	simple returns.
	(ix86_pad_returns): Likewise.
	* function.h (struct rtl_data): Add member shrink_wrapped.
	* cfgcleanup.c (outgoing_edges_match): If shrink-wrapped, edges that
	are not jumps or sibcalls can't be compared.

	* gcc.target/i386/sw-1.c: New test.

Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 179310)
+++ gcc/doc/invoke.texi	(working copy)
@@ -396,10 +396,10 @@ Objective-C and Objective-C++ Dialects}.
 -fschedule-insns -fschedule-insns2 -fsection-anchors @gol
 -fselective-scheduling -fselective-scheduling2 @gol
 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
--fsignaling-nans -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
--fsplit-wide-types -fstack-protector -fstack-protector-all @gol
--fstrict-aliasing -fstrict-overflow -fthread-jumps -ftracer @gol
--ftree-bit-ccp @gol
+-fshrink-wrap -fsignaling-nans -fsingle-precision-constant @gol
+-fsplit-ivs-in-unroller -fsplit-wide-types -fstack-protector @gol
+-fstack-protector-all -fstrict-aliasing -fstrict-overflow @gol
+-fthread-jumps -ftracer -ftree-bit-ccp @gol
 -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copy-prop @gol
 -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse @gol
 -ftree-forwprop -ftree-fre -ftree-loop-if-convert @gol
@@ -6879,6 +6879,12 @@ This option has no effect until one of @
 When pipelining loops during selective scheduling, also pipeline outer loops.
 This option has no effect until @option{-fsel-sched-pipelining} is turned on.
 
+@item -fshrink-wrap
+@opindex fshrink-wrap
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.  This flag is enabled by default at
+@option{-O} and higher.
+
 @item -fcaller-saves
 @opindex fcaller-saves
 Enable values to be allocated in registers that will be clobbered by
Index: gcc/testsuite/gcc.target/i386/sw-1.c
===================================================================
--- gcc/testsuite/gcc.target/i386/sw-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/i386/sw-1.c	(revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fshrink-wrap -fdump-rtl-pro_and_epilogue" } */
+
+#include <string.h>
+
+int c;
+int x[2000];
+__attribute__((regparm(1))) void foo (int a, int b)
+ {
+   int t[200];
+   if (a == 0 || c == 0)
+     return;
+   memcpy (t, x + b, sizeof t);
+   c = t[a];
+ }
+
+/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */
+/* { dg-final { cleanup-rtl-dump "pro_and_epilogue" } } */
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 179310)
+++ gcc/opts.c	(working copy)
@@ -434,6 +434,7 @@ static const struct default_options defa
     { OPT_LEVELS_1_PLUS, OPT_fipa_reference, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fipa_profile, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fmerge_constants, NULL, 1 },
+    { OPT_LEVELS_1_PLUS, OPT_fshrink_wrap, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_fsplit_wide_types, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_ccp, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_bit_ccp, NULL, 1 },
Index: gcc/function.c
===================================================================
--- gcc/function.c	(revision 179310)
+++ gcc/function.c	(working copy)
@@ -147,9 +147,6 @@ extern tree debug_find_var_in_block_tree
    can always export `prologue_epilogue_contains'.  */
 static void record_insns (rtx, rtx, htab_t *) ATTRIBUTE_UNUSED;
 static bool contains (const_rtx, htab_t);
-#ifdef HAVE_return
-static void emit_return_into_block (basic_block);
-#endif
 static void prepare_function_start (void);
 static void do_clobber_return_reg (rtx, void *);
 static void do_use_return_reg (rtx, void *);
@@ -5285,6 +5282,78 @@ prologue_epilogue_contains (const_rtx in
   return 0;
 }
 
+#ifdef HAVE_simple_return
+
+/* A for_each_rtx subroutine of record_hard_reg_sets.  */
+static int
+record_hard_reg_uses_1 (rtx *px, void *data)
+{
+  rtx x = *px;
+  HARD_REG_SET *pused = (HARD_REG_SET *)data;
+
+  if (REG_P (x) && REGNO (x) < FIRST_PSEUDO_REGISTER)
+    add_to_hard_reg_set (pused, GET_MODE (x), REGNO (x));
+  return 0;
+}
+
+/* Like record_hard_reg_sets, but called through note_uses.  */
+static void
+record_hard_reg_uses (rtx *px, void *data)
+{
+  for_each_rtx (px, record_hard_reg_uses_1, data);
+}
+
+/* A subroutine of requires_stack_frame_p, called via for_each_rtx.
+   Return 1 if we found an rtx that forces a prologue, zero otherwise.  */
+static int
+frame_required_for_rtx (rtx *loc, void *data ATTRIBUTE_UNUSED)
+{
+  rtx x = *loc;
+  if (x == stack_pointer_rtx || x == hard_frame_pointer_rtx
+      || x == arg_pointer_rtx || x == pic_offset_table_rtx)
+    return 1;
+  return 0;
+}
+
+/* Return true if INSN requires the stack frame to be set up.
+   PROLOGUE_USED contains the hard registers used in the function
+   prologue.  */
+static bool
+requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
+{
+  df_ref *def_rec;
+  HARD_REG_SET hardregs;
+  unsigned regno;
+
+  if (!INSN_P (insn) || DEBUG_INSN_P (insn))
+    return false;
+  if (CALL_P (insn))
+    return !SIBLING_CALL_P (insn);
+  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
+    return true;
+
+  CLEAR_HARD_REG_SET (hardregs);
+  for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
+    {
+      rtx dreg = DF_REF_REG (*def_rec);
+
+      if (!REG_P (dreg))
+	continue;
+
+      add_to_hard_reg_set (&hardregs, GET_MODE (dreg),
+			   REGNO (dreg));
+    }
+  if (hard_reg_set_intersect_p (hardregs, prologue_used))
+    return true;
+  AND_COMPL_HARD_REG_SET (hardregs, call_used_reg_set);
+  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (TEST_HARD_REG_BIT (hardregs, regno)
+	&& df_regs_ever_live_p (regno))
+      return true;
+  return false;
+}
+#endif
+
 #ifdef HAVE_return
 /* Insert use of return register before the end of BB.  */
 
@@ -5299,45 +5368,145 @@ emit_use_return_register_into_block (bas
   emit_insn_before (seq, BB_END (bb));
 }
 
-/* Insert gen_return at the end of block BB.  This also means updating
-   block_for_insn appropriately.  */
+
+/* Create a return pattern, either simple_return or return, depending on
+   simple_p.  */
+
+static rtx
+gen_return_pattern (bool simple_p)
+{
+#ifdef HAVE_simple_return
+  return simple_p ? gen_simple_return () : gen_return ();
+#else
+  gcc_assert (!simple_p);
+  return gen_return ();
+#endif
+}
+
+/* Insert an appropriate return pattern at the end of block BB.  This
+   also means updating block_for_insn appropriately.  SIMPLE_P is
+   the same as in gen_return_pattern and passed to it.  */
 
 static void
-emit_return_into_block (basic_block bb)
+emit_return_into_block (bool simple_p, basic_block bb)
 {
-  rtx jump = emit_jump_insn_after (gen_return (), BB_END (bb));
-  rtx pat = PATTERN (jump);
+  rtx jump, pat;
+  jump = emit_jump_insn_after (gen_return_pattern (simple_p), BB_END (bb));
+  pat = PATTERN (jump);
   if (GET_CODE (pat) == PARALLEL)
     pat = XVECEXP (pat, 0, 0);
   gcc_assert (ANY_RETURN_P (pat));
   JUMP_LABEL (jump) = pat;
 }
-#endif /* HAVE_return */
+#endif
 
 /* Generate the prologue and epilogue RTL if the machine supports it.  Thread
    this into place with notes indicating where the prologue ends and where
-   the epilogue begins.  Update the basic block information when possible.  */
+   the epilogue begins.  Update the basic block information when possible.
+
+   Notes on epilogue placement:
+   There are several kinds of edges to the exit block:
+   * a single fallthru edge from LAST_BB
+   * possibly, edges from blocks containing sibcalls
+   * possibly, fake edges from infinite loops
+
+   The epilogue is always emitted on the fallthru edge from the last basic
+   block in the function, LAST_BB, into the exit block.
+
+   If LAST_BB is empty except for a label, it is the target of every
+   other basic block in the function that ends in a return.  If a
+   target has a return or simple_return pattern (possibly with
+   conditional variants), these basic blocks can be changed so that a
+   return insn is emitted into them, and their target is adjusted to
+   the real exit block.
+
+   Notes on shrink wrapping: We implement a fairly conservative
+   version of shrink-wrapping rather than the textbook one.  We only
+   generate a single prologue and a single epilogue.  This is
+   sufficient to catch a number of interesting cases involving early
+   exits.
+
+   First, we identify the blocks that require the prologue to occur before
+   them.  These are the ones that modify a call-saved register, or reference
+   any of the stack or frame pointer registers.  To simplify things, we then
+   mark everything reachable from these blocks as also requiring a prologue.
+   This takes care of loops automatically, and avoids the need to examine
+   whether MEMs reference the frame, since it is sufficient to check for
+   occurrences of the stack or frame pointer.
+
+   We then compute the set of blocks for which the need for a prologue
+   is anticipatable (borrowing terminology from the shrink-wrapping
+   description in Muchnick's book).  These are the blocks which either
+   require a prologue themselves, or those that have only successors
+   where the prologue is anticipatable.  The prologue needs to be
+   inserted on all edges from BB1->BB2 where BB2 is in ANTIC and BB1
+   is not.  For the moment, we ensure that only one such edge exists.
+
+   The epilogue is placed as described above, but we make a
+   distinction between inserting return and simple_return patterns
+   when modifying other blocks that end in a return.  Blocks that end
+   in a sibcall omit the sibcall_epilogue if the block is not in
+   ANTIC.  */
 
 static void
 thread_prologue_and_epilogue_insns (void)
 {
   bool inserted;
+  basic_block last_bb;
+  bool last_bb_active;
+#ifdef HAVE_simple_return
+  bool unconverted_simple_returns = false;
+  basic_block simple_return_block_hot = NULL;
+  basic_block simple_return_block_cold = NULL;
+  bool nonempty_prologue;
+#endif
+  rtx returnjump ATTRIBUTE_UNUSED;
   rtx seq ATTRIBUTE_UNUSED, epilogue_end ATTRIBUTE_UNUSED;
-  edge entry_edge, e;
+  rtx prologue_seq ATTRIBUTE_UNUSED, split_prologue_seq ATTRIBUTE_UNUSED;
+  edge e, entry_edge, orig_entry_edge, exit_fallthru_edge;
   edge_iterator ei;
+  bitmap_head bb_flags;
+
+  df_analyze ();
 
   rtl_profile_for_bb (ENTRY_BLOCK_PTR);
 
   inserted = false;
   seq = NULL_RTX;
   epilogue_end = NULL_RTX;
+  returnjump = NULL_RTX;
 
   /* Can't deal with multiple successors of the entry block at the
      moment.  Function should always have at least one entry
      point.  */
   gcc_assert (single_succ_p (ENTRY_BLOCK_PTR));
   entry_edge = single_succ_edge (ENTRY_BLOCK_PTR);
+  orig_entry_edge = entry_edge;
+
+  exit_fallthru_edge = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
+  if (exit_fallthru_edge != NULL)
+    {
+      rtx label;
+
+      last_bb = exit_fallthru_edge->src;
+      /* Test whether there are active instructions in the last block.  */
+      label = BB_END (last_bb);
+      while (label && !LABEL_P (label))
+	{
+	  if (active_insn_p (label))
+	    break;
+	  label = PREV_INSN (label);
+	}
 
+      last_bb_active = BB_HEAD (last_bb) != label || !LABEL_P (label);
+    }
+  else
+    {
+      last_bb = NULL;
+      last_bb_active = false;
+    }
+
+  split_prologue_seq = NULL_RTX;
   if (flag_split_stack
       && (lookup_attribute ("no_split_stack", DECL_ATTRIBUTES (cfun->decl))
 	  == NULL))
@@ -5349,17 +5518,15 @@ thread_prologue_and_epilogue_insns (void
 
       start_sequence ();
       emit_insn (gen_split_stack_prologue ());
-      seq = get_insns ();
+      split_prologue_seq = get_insns ();
       end_sequence ();
 
-      record_insns (seq, NULL, &prologue_insn_hash);
-      set_insn_locators (seq, prologue_locator);
-
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+      record_insns (split_prologue_seq, NULL, &prologue_insn_hash);
+      set_insn_locators (split_prologue_seq, prologue_locator);
 #endif
     }
 
+  prologue_seq = NULL_RTX;
 #ifdef HAVE_prologue
   if (HAVE_prologue)
     {
@@ -5382,15 +5549,222 @@ thread_prologue_and_epilogue_insns (void
       if (!targetm.profile_before_prologue () && crtl->profile)
         emit_insn (gen_blockage ());
 
-      seq = get_insns ();
+      prologue_seq = get_insns ();
       end_sequence ();
-      set_insn_locators (seq, prologue_locator);
+      set_insn_locators (prologue_seq, prologue_locator);
+    }
+#endif
 
-      insert_insn_on_edge (seq, entry_edge);
-      inserted = true;
+  bitmap_initialize (&bb_flags, &bitmap_default_obstack);
+
+#ifdef HAVE_simple_return
+  /* Try to perform a kind of shrink-wrapping, making sure the
+     prologue/epilogue is emitted only around those parts of the
+     function that require it.  */
+
+  nonempty_prologue = false;
+  for (seq = prologue_seq; seq; seq = NEXT_INSN (seq))
+    if (!NOTE_P (seq) || NOTE_KIND (seq) != NOTE_INSN_PROLOGUE_END)
+      {
+	nonempty_prologue = true;
+	break;
+      }
+      
+  if (flag_shrink_wrap && HAVE_simple_return
+      && nonempty_prologue && !crtl->calls_eh_return)
+    {
+      HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;
+      rtx p_insn;
+
+      VEC(basic_block, heap) *vec;
+      basic_block bb;
+      bitmap_head bb_antic_flags;
+      bitmap_head bb_on_list;
+
+      if (dump_file)
+	fprintf (dump_file, "Attempting shrink-wrapping optimization.\n");
+
+      /* Compute the registers set and used in the prologue.  */
+      CLEAR_HARD_REG_SET (prologue_clobbered);
+      CLEAR_HARD_REG_SET (prologue_used);
+      for (p_insn = prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	{
+	  HARD_REG_SET this_used;
+	  if (!NONDEBUG_INSN_P (p_insn))
+	    continue;
+
+	  CLEAR_HARD_REG_SET (this_used);
+	  note_uses (&PATTERN (p_insn), record_hard_reg_uses,
+		     &this_used);
+	  AND_COMPL_HARD_REG_SET (this_used, prologue_clobbered);
+	  IOR_HARD_REG_SET (prologue_used, this_used);
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+	}
+      for (p_insn = split_prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
+	if (NONDEBUG_INSN_P (p_insn))
+	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
+		       &prologue_clobbered);
+
+      bitmap_initialize (&bb_antic_flags, &bitmap_default_obstack);
+      bitmap_initialize (&bb_on_list, &bitmap_default_obstack);
+
+      /* Find the set of basic blocks that require a stack frame.  */
+
+      vec = VEC_alloc (basic_block, heap, n_basic_blocks);
+
+      FOR_EACH_BB (bb)
+	{
+	  rtx insn;
+	  /* As a special case, check for jumps to the last bb that
+	     cannot successfully be converted to simple_returns later
+	     on, and mark them as requiring a frame.  These are
+	     conditional jumps that jump to their fallthru block, so
+	     it's not a case that is expected to occur often.  */
+	  if (JUMP_P (BB_END (bb)) && any_condjump_p (BB_END (bb))
+	      && single_succ_p (bb)
+	      && !last_bb_active
+	      && single_succ (bb) == last_bb)
+	    {
+	      bitmap_set_bit (&bb_flags, bb->index);
+	      VEC_quick_push (basic_block, vec, bb);
+	    }
+	  else
+	    FOR_BB_INSNS (bb, insn)
+	      if (requires_stack_frame_p (insn, prologue_used))
+		{
+		  bitmap_set_bit (&bb_flags, bb->index);
+		  VEC_quick_push (basic_block, vec, bb);
+		  break;
+		}
+	}
+
+      /* For every basic block that needs a prologue, mark all blocks
+	 reachable from it, so as to ensure they are also seen as
+	 requiring a prologue.  */
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    if (e->dest != EXIT_BLOCK_PTR
+		&& bitmap_set_bit (&bb_flags, e->dest->index))
+	      VEC_quick_push (basic_block, vec, e->dest);
+	}
+      /* If the last basic block contains only a label, we'll be able
+	 to convert jumps to it to (potentially conditional) return
+	 insns later.  This means we don't necessarily need a prologue
+	 for paths reaching it.  */
+      if (last_bb)
+	{
+	  if (!last_bb_active)
+	    bitmap_clear_bit (&bb_flags, last_bb->index);
+	  else if (!bitmap_bit_p (&bb_flags, last_bb->index))
+	    goto fail_shrinkwrap;
+	}
+
+      /* Now walk backwards from every block that is marked as needing
+	 a prologue to compute the bb_antic_flags bitmap.  */
+      bitmap_copy (&bb_antic_flags, &bb_flags);
+      FOR_EACH_BB (bb)
+	{
+	  edge e;
+	  edge_iterator ei;
+	  if (!bitmap_bit_p (&bb_flags, bb->index))
+	    continue;
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
+		&& bitmap_set_bit (&bb_on_list, e->src->index))
+	      VEC_quick_push (basic_block, vec, e->src);
+	}
+      while (!VEC_empty (basic_block, vec))
+	{
+	  basic_block tmp_bb = VEC_pop (basic_block, vec);
+	  edge e;
+	  edge_iterator ei;
+	  bool all_set = true;
+
+	  bitmap_clear_bit (&bb_on_list, tmp_bb->index);
+	  FOR_EACH_EDGE (e, ei, tmp_bb->succs)
+	    if (!bitmap_bit_p (&bb_antic_flags, e->dest->index))
+	      {
+		all_set = false;
+		break;
+	      }
+
+	  if (all_set)
+	    {
+	      bitmap_set_bit (&bb_antic_flags, tmp_bb->index);
+	      FOR_EACH_EDGE (e, ei, tmp_bb->preds)
+		if (!bitmap_bit_p (&bb_antic_flags, e->src->index)
+		    && bitmap_set_bit (&bb_on_list, e->src->index))
+		  VEC_quick_push (basic_block, vec, e->src);
+	    }
+	}
+      /* Find exactly one edge that leads to a block in ANTIC from
+	 a block that isn't.  */
+      if (!bitmap_bit_p (&bb_antic_flags, entry_edge->dest->index))
+	FOR_EACH_BB (bb)
+	  {
+	    if (!bitmap_bit_p (&bb_antic_flags, bb->index))
+	      continue;
+	    FOR_EACH_EDGE (e, ei, bb->preds)
+	      if (!bitmap_bit_p (&bb_antic_flags, e->src->index))
+		{
+		  if (entry_edge != orig_entry_edge)
+		    {
+		      entry_edge = orig_entry_edge;
+		      if (dump_file)
+			fprintf (dump_file, "More than one candidate edge.\n");
+		      goto fail_shrinkwrap;
+		    }
+		  if (dump_file)
+		    fprintf (dump_file, "Found candidate edge for "
+			     "shrink-wrapping, %d->%d.\n", e->src->index,
+			     e->dest->index);
+		  entry_edge = e;
+		}
+	  }
+
+      /* Test whether the prologue is known to clobber any register
+	 (other than FP or SP) which are live on the edge.  */
+      CLEAR_HARD_REG_BIT (prologue_clobbered, STACK_POINTER_REGNUM);
+      if (frame_pointer_needed)
+	CLEAR_HARD_REG_BIT (prologue_clobbered, HARD_FRAME_POINTER_REGNUM);
+      CLEAR_HARD_REG_SET (live_on_edge);
+      reg_set_to_hard_reg_set (&live_on_edge,
+			       df_get_live_in (entry_edge->dest));
+      if (hard_reg_set_intersect_p (live_on_edge, prologue_clobbered))
+	{
+	  entry_edge = orig_entry_edge;
+	  if (dump_file)
+	    fprintf (dump_file, "Shrink-wrapping aborted due to clobber.\n");
+	}
+      else if (dump_file && entry_edge != orig_entry_edge)
+	{
+	  crtl->shrink_wrapped = true;
+	  fprintf (dump_file, "Performing shrink-wrapping.\n");
+	}
+
+    fail_shrinkwrap:
+      bitmap_clear (&bb_antic_flags);
+      bitmap_clear (&bb_on_list);
+      VEC_free (basic_block, heap, vec);
     }
 #endif
 
+  if (split_prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (split_prologue_seq, entry_edge);
+      inserted = true;
+    }
+  if (prologue_seq != NULL_RTX)
+    {
+      insert_insn_on_edge (prologue_seq, entry_edge);
+      inserted = true;
+    }
+
   /* If the exit block has no non-fake predecessors, we don't need
      an epilogue.  */
   FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR->preds)
@@ -5400,110 +5774,145 @@ thread_prologue_and_epilogue_insns (void
     goto epilogue_done;
 
   rtl_profile_for_bb (EXIT_BLOCK_PTR);
+
 #ifdef HAVE_return
-  if (optimize && HAVE_return)
+  /* If we're allowed to generate a simple return instruction, then by
+     definition we don't need a full epilogue.  If the last basic
+     block before the exit block does not contain active instructions,
+     examine its predecessors and try to emit (conditional) return
+     instructions.  */
+  if (optimize && !last_bb_active
+      && (HAVE_return || entry_edge != orig_entry_edge))
     {
-      /* If we're allowed to generate a simple return instruction,
-	 then by definition we don't need a full epilogue.  Examine
-	 the block that falls through to EXIT.   If it does not
-	 contain any code, examine its predecessors and try to
-	 emit (conditional) return instructions.  */
-
-      basic_block last;
+      edge_iterator ei2;
+      int i;
+      basic_block bb;
       rtx label;
+      VEC(basic_block,heap) *src_bbs;
 
-      e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-      if (e == NULL)
+      if (exit_fallthru_edge == NULL)
 	goto epilogue_done;
-      last = e->src;
+      label = BB_HEAD (last_bb);
 
-      /* Verify that there are no active instructions in the last block.  */
-      label = BB_END (last);
-      while (label && !LABEL_P (label))
-	{
-	  if (active_insn_p (label))
-	    break;
-	  label = PREV_INSN (label);
-	}
+      src_bbs = VEC_alloc (basic_block, heap, EDGE_COUNT (last_bb->preds));
+      FOR_EACH_EDGE (e, ei2, last_bb->preds)
+	if (e->src != ENTRY_BLOCK_PTR)
+	  VEC_quick_push (basic_block, src_bbs, e->src);
 
-      if (BB_HEAD (last) == label && LABEL_P (label))
+      FOR_EACH_VEC_ELT (basic_block, src_bbs, i, bb)
 	{
-	  edge_iterator ei2;
-
-	  for (ei2 = ei_start (last->preds); (e = ei_safe_edge (ei2)); )
-	    {
-	      basic_block bb = e->src;
-	      rtx jump;
+	  bool simple_p;
+	  rtx jump;
+	  e = find_edge (bb, last_bb);
 
-	      if (bb == ENTRY_BLOCK_PTR)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+	  jump = BB_END (bb);
 
-	      jump = BB_END (bb);
-	      if (!JUMP_P (jump) || JUMP_LABEL (jump) != label)
-		{
-		  ei_next (&ei2);
-		  continue;
-		}
+#ifdef HAVE_simple_return
+	  simple_p = (entry_edge != orig_entry_edge
+		      && !bitmap_bit_p (&bb_flags, bb->index));
+#else
+	  simple_p = false;
+#endif
 
-	      /* If we have an unconditional jump, we can replace that
-		 with a simple return instruction.  */
-	      if (simplejump_p (jump))
-		{
-		  /* The use of the return register might be present in the exit
-		     fallthru block.  Either:
-		     - removing the use is safe, and we should remove the use in
-		       the exit fallthru block, or
-		     - removing the use is not safe, and we should add it here.
-		     For now, we conservatively choose the latter.  Either of the
-		     2 helps in crossjumping.  */
-		  emit_use_return_register_into_block (bb);
+	  if (!simple_p
+	      && (!HAVE_return || !JUMP_P (jump)
+		  || JUMP_LABEL (jump) != label))
+	    continue;
 
-		  emit_return_into_block (bb);
-		  delete_insn (jump);
-		}
+	  /* If we have an unconditional jump, we can replace that
+	     with a simple return instruction.  */
+	  if (!JUMP_P (jump))
+	    {
+	      emit_barrier_after (BB_END (bb));
+	      emit_return_into_block (simple_p, bb);
+	    }
+	  else if (simplejump_p (jump))
+	    {
+	      /* The use of the return register might be present in the exit
+	         fallthru block.  Either:
+	         - removing the use is safe, and we should remove the use in
+	           the exit fallthru block, or
+	         - removing the use is not safe, and we should add it here.
+	         For now, we conservatively choose the latter.  Either of the
+	         2 helps in crossjumping.  */
+	      emit_use_return_register_into_block (bb);
 
-	      /* If we have a conditional jump, we can try to replace
-		 that with a conditional return instruction.  */
-	      else if (condjump_p (jump))
-		{
-		  if (! redirect_jump (jump, ret_rtx, 0))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
+	      emit_return_into_block (simple_p, bb);
+	      delete_insn (jump);
+	    }
+	  else if (condjump_p (jump) && JUMP_LABEL (jump) != label)
+	    {
+	      basic_block new_bb;
+	      edge new_e;
 
-		  /* See comment in simple_jump_p case above.  */
-		  emit_use_return_register_into_block (bb);
+	      gcc_assert (simple_p);
+	      new_bb = split_edge (e);
+	      emit_barrier_after (BB_END (new_bb));
+	      emit_return_into_block (simple_p, new_bb);
+#ifdef HAVE_simple_return
+	      if (BB_PARTITION (new_bb) == BB_HOT_PARTITION)
+		simple_return_block_hot = new_bb;
+	      else
+		simple_return_block_cold = new_bb;
+#endif
+	      new_e = single_succ_edge (new_bb);
+	      redirect_edge_succ (new_e, EXIT_BLOCK_PTR);
 
-		  /* If this block has only one successor, it both jumps
-		     and falls through to the fallthru block, so we can't
-		     delete the edge.  */
-		  if (single_succ_p (bb))
-		    {
-		      ei_next (&ei2);
-		      continue;
-		    }
-		}
+	      continue;
+	    }
+	  /* If we have a conditional jump branching to the last
+	     block, we can try to replace that with a conditional
+	     return instruction.  */
+	  else if (condjump_p (jump))
+	    {
+	      rtx dest;
+	      if (simple_p)
+		dest = simple_return_rtx;
 	      else
+		dest = ret_rtx;
+	      if (! redirect_jump (jump, dest, 0))
 		{
-		  ei_next (&ei2);
+#ifdef HAVE_simple_return
+		  if (simple_p)
+		    unconverted_simple_returns = true;
+#endif
 		  continue;
 		}
 
-	      /* Fix up the CFG for the successful change we just made.  */
-	      redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	      /* See comment in simple_jump_p case above.  */
+	      emit_use_return_register_into_block (bb);
+
+	      /* If this block has only one successor, it both jumps
+		 and falls through to the fallthru block, so we can't
+		 delete the edge.  */
+	      if (single_succ_p (bb))
+		continue;
+	    }
+	  else
+	    {
+#ifdef HAVE_simple_return
+	      if (simple_p)
+		unconverted_simple_returns = true;
+#endif
+	      continue;
 	    }
 
+	  /* Fix up the CFG for the successful change we just made.  */
+	  redirect_edge_succ (e, EXIT_BLOCK_PTR);
+	}
+      VEC_free (basic_block, heap, src_bbs);
+
+      if (HAVE_return)
+	{
 	  /* Emit a return insn for the exit fallthru block.  Whether
 	     this is still reachable will be determined later.  */
 
-	  emit_barrier_after (BB_END (last));
-	  emit_return_into_block (last);
-	  epilogue_end = BB_END (last);
-	  single_succ_edge (last)->flags &= ~EDGE_FALLTHRU;
+	  emit_barrier_after (BB_END (last_bb));
+	  emit_return_into_block (false, last_bb);
+	  epilogue_end = BB_END (last_bb);
+	  if (JUMP_P (epilogue_end))
+	    JUMP_LABEL (epilogue_end) = ret_rtx;
+	  single_succ_edge (last_bb)->flags &= ~EDGE_FALLTHRU;
 	  goto epilogue_done;
 	}
     }
@@ -5540,20 +5949,15 @@ thread_prologue_and_epilogue_insns (void
     }
 #endif
 
-  /* Find the edge that falls through to EXIT.  Other edges may exist
-     due to RETURN instructions, but those don't need epilogues.
-     There really shouldn't be a mixture -- either all should have
-     been converted or none, however...  */
+  /* If nothing falls through into the exit block, we don't need an
+     epilogue.  */
 
-  e = find_fallthru_edge (EXIT_BLOCK_PTR->preds);
-  if (e == NULL)
+  if (exit_fallthru_edge == NULL)
     goto epilogue_done;
 
 #ifdef HAVE_epilogue
   if (HAVE_epilogue)
     {
-      rtx returnjump;
-
       start_sequence ();
       epilogue_end = emit_note (NOTE_INSN_EPILOGUE_BEG);
       seq = gen_epilogue ();
@@ -5564,11 +5968,11 @@ thread_prologue_and_epilogue_insns (void
       record_insns (seq, NULL, &epilogue_insn_hash);
       set_insn_locators (seq, epilogue_locator);
 
-      returnjump = get_last_insn ();
       seq = get_insns ();
+      returnjump = get_last_insn ();
       end_sequence ();
 
-      insert_insn_on_edge (seq, e);
+      insert_insn_on_edge (seq, exit_fallthru_edge);
       inserted = true;
 
       if (JUMP_P (returnjump))
@@ -5581,23 +5985,21 @@ thread_prologue_and_epilogue_insns (void
 	  else
 	    JUMP_LABEL (returnjump) = ret_rtx;
 	}
-      else
-	returnjump = NULL_RTX;
     }
   else
 #endif
     {
       basic_block cur_bb;
 
-      if (! next_active_insn (BB_END (e->src)))
+      if (! next_active_insn (BB_END (exit_fallthru_edge->src)))
 	goto epilogue_done;
       /* We have a fall-through edge to the exit block, the source is not
          at the end of the function, and there will be an assembler epilogue
          at the end of the function.
          We can't use force_nonfallthru here, because that would try to
-         use return.  Inserting a jump 'by hand' is extremely messy, so
+	 use return.  Inserting a jump 'by hand' is extremely messy, so
 	 we take advantage of cfg_layout_finalize using
-	fixup_fallthru_exit_predecessor.  */
+	 fixup_fallthru_exit_predecessor.  */
       cfg_layout_initialize (0);
       FOR_EACH_BB (cur_bb)
 	if (cur_bb->index >= NUM_FIXED_BLOCKS
@@ -5607,6 +6009,7 @@ thread_prologue_and_epilogue_insns (void
     }
 
 epilogue_done:
+
   default_rtl_profile ();
 
   if (inserted)
@@ -5632,33 +6035,102 @@ epilogue_done:
 	}
     }
 
+#ifdef HAVE_simple_return
+  /* If there were branches to an empty LAST_BB which we tried to
+     convert to conditional simple_returns, but couldn't for some
+     reason, create a block to hold a simple_return insn and redirect
+     those remaining edges.  */
+  if (unconverted_simple_returns)
+    {
+      edge_iterator ei2;
+      basic_block exit_pred = EXIT_BLOCK_PTR->prev_bb;
+
+      gcc_assert (entry_edge != orig_entry_edge);
+
+      /* See if we can reuse the last insn that was emitted for the
+	 epilogue.  */
+      if (returnjump != NULL_RTX
+	  && JUMP_LABEL (returnjump) == simple_return_rtx)
+	{
+	  edge e = split_block (exit_fallthru_edge->src,
+				PREV_INSN (returnjump));
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    simple_return_block_hot = e->dest;
+	  else
+	    simple_return_block_cold = e->dest;
+	}
+
+    restart_scan:
+      for (ei2 = ei_start (last_bb->preds); (e = ei_safe_edge (ei2)); )
+	{
+	  basic_block bb = e->src;
+	  basic_block *pdest_bb;
+
+	  if (bb == ENTRY_BLOCK_PTR
+	      || bitmap_bit_p (&bb_flags, bb->index))
+	    {
+	      ei_next (&ei2);
+	      continue;
+	    }
+	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	    pdest_bb = &simple_return_block_hot;
+	  else
+	    pdest_bb = &simple_return_block_cold;
+	  if (*pdest_bb == NULL)
+	    {
+	      basic_block bb;
+	      rtx start;
+
+	      bb = create_basic_block (NULL, NULL, exit_pred);
+	      BB_COPY_PARTITION (bb, e->src);
+	      start = emit_jump_insn_after (gen_simple_return (),
+					    BB_END (bb));
+	      JUMP_LABEL (start) = simple_return_rtx;
+	      emit_barrier_after (start);
+
+	      *pdest_bb = bb;
+	      make_edge (bb, EXIT_BLOCK_PTR, 0);
+	    }
+	  redirect_edge_and_branch_force (e, *pdest_bb);
+	  goto restart_scan;
+	}
+    }
+#endif
+
 #ifdef HAVE_sibcall_epilogue
   /* Emit sibling epilogues before any sibling call sites.  */
   for (ei = ei_start (EXIT_BLOCK_PTR->preds); (e = ei_safe_edge (ei)); )
     {
       basic_block bb = e->src;
       rtx insn = BB_END (bb);
+      rtx ep_seq;
 
       if (!CALL_P (insn)
-	  || ! SIBLING_CALL_P (insn))
+	  || ! SIBLING_CALL_P (insn)
+	  || (entry_edge != orig_entry_edge
+	      && !bitmap_bit_p (&bb_flags, bb->index)))
 	{
 	  ei_next (&ei);
 	  continue;
 	}
 
-      start_sequence ();
-      emit_note (NOTE_INSN_EPILOGUE_BEG);
-      emit_insn (gen_sibcall_epilogue ());
-      seq = get_insns ();
-      end_sequence ();
+      ep_seq = gen_sibcall_epilogue ();
+      if (ep_seq)
+	{
+	  start_sequence ();
+	  emit_note (NOTE_INSN_EPILOGUE_BEG);
+	  emit_insn (ep_seq);
+	  seq = get_insns ();
+	  end_sequence ();
 
-      /* Retain a map of the epilogue insns.  Used in life analysis to
-	 avoid getting rid of sibcall epilogue insns.  Do this before we
-	 actually emit the sequence.  */
-      record_insns (seq, NULL, &epilogue_insn_hash);
-      set_insn_locators (seq, epilogue_locator);
+	  /* Retain a map of the epilogue insns.  Used in life analysis to
+	     avoid getting rid of sibcall epilogue insns.  Do this before we
+	     actually emit the sequence.  */
+	  record_insns (seq, NULL, &epilogue_insn_hash);
+	  set_insn_locators (seq, epilogue_locator);
 
-      emit_insn_before (seq, insn);
+	  emit_insn_before (seq, insn);
+	}
       ei_next (&ei);
     }
 #endif
@@ -5683,6 +6155,8 @@ epilogue_done:
     }
 #endif
 
+  bitmap_clear (&bb_flags);
+
   /* Threading the prologue and epilogue changes the artificial refs
      in the entry and exit blocks.  */
   epilogue_completed = 1;
Index: gcc/function.h
===================================================================
--- gcc/function.h	(revision 179310)
+++ gcc/function.h	(working copy)
@@ -435,6 +435,9 @@ struct GTY(()) rtl_data {
      function where currently compiled version of it is nothrow.  */
   bool nothrow;
 
+  /* True if we performed shrink-wrapping for the current function.  */
+  bool shrink_wrapped;
+
   /* Like regs_ever_live, but 1 if a reg is set or clobbered from an
      asm.  Unlike regs_ever_live, elements of this array corresponding
      to eliminable regs (like the frame pointer) are set if an asm
Index: gcc/cfgcleanup.c
===================================================================
--- gcc/cfgcleanup.c	(revision 179310)
+++ gcc/cfgcleanup.c	(working copy)
@@ -1488,6 +1488,16 @@ outgoing_edges_match (int mode, basic_bl
   edge e1, e2;
   edge_iterator ei;
 
+  /* If we performed shrink-wrapping, edges to the EXIT_BLOCK_PTR can
+     only be distinguished for JUMP_INSNs.  The two paths may differ in
+     whether they went through the prologue.  Sibcalls are fine, we know
+     that we either didn't need or inserted an epilogue before them.  */
+  if (crtl->shrink_wrapped
+      && single_succ_p (bb1) && single_succ (bb1) == EXIT_BLOCK_PTR
+      && !JUMP_P (BB_END (bb1))
+      && !(CALL_P (BB_END (bb1)) && SIBLING_CALL_P (BB_END (bb1))))
+    return false;
+  
   /* If BB1 has only one successor, we may be looking at either an
      unconditional jump, or a fake edge to exit.  */
   if (single_succ_p (bb1)
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 179310)
+++ gcc/common.opt	(working copy)
@@ -1772,6 +1772,11 @@ fshow-column
 Common Report Var(flag_show_column) Init(1)
 Show column numbers in diagnostics, when available.  Default on
 
+fshrink-wrap
+Common Report Var(flag_shrink_wrap) Optimization
+Emit function prologues only before parts of the function that need it,
+rather than at the top of the function.
+
 fsignaling-nans
 Common Report Var(flag_signaling_nans) Optimization SetByCombined
 Disable optimizations observable by IEEE signaling NaNs
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 179310)
+++ gcc/config/i386/i386.md	(working copy)
@@ -11695,19 +11695,31 @@ (define_insn "prologue_use"
 ;; See comments for ix86_can_use_return_insn_p in i386.c.
 
 (define_expand "return"
-  [(return)]
+  [(simple_return)]
   "ix86_can_use_return_insn_p ()"
 {
   if (crtl->args.pops_args)
     {
       rtx popc = GEN_INT (crtl->args.pops_args);
-      emit_jump_insn (gen_return_pop_internal (popc));
+      emit_jump_insn (gen_simple_return_pop_internal (popc));
       DONE;
     }
 })
 
-(define_insn "return_internal"
-  [(return)]
+(define_expand "simple_return"
+  [(simple_return)]
+  ""
+{
+  if (crtl->args.pops_args)
+    {
+      rtx popc = GEN_INT (crtl->args.pops_args);
+      emit_jump_insn (gen_simple_return_pop_internal (popc));
+      DONE;
+    }
+})
+
+(define_insn "simple_return_internal"
+  [(simple_return)]
   "reload_completed"
   "ret"
   [(set_attr "length" "1")
@@ -11718,8 +11730,8 @@ (define_insn "return_internal"
 ;; Used by x86_machine_dependent_reorg to avoid penalty on single byte RET
 ;; instruction Athlon and K8 have.
 
-(define_insn "return_internal_long"
-  [(return)
+(define_insn "simple_return_internal_long"
+  [(simple_return)
    (unspec [(const_int 0)] UNSPEC_REP)]
   "reload_completed"
   "rep\;ret"
@@ -11729,8 +11741,8 @@ (define_insn "return_internal_long"
    (set_attr "prefix_rep" "1")
    (set_attr "modrm" "0")])
 
-(define_insn "return_pop_internal"
-  [(return)
+(define_insn "simple_return_pop_internal"
+  [(simple_return)
    (use (match_operand:SI 0 "const_int_operand" ""))]
   "reload_completed"
   "ret\t%0"
@@ -11739,8 +11751,8 @@ (define_insn "return_pop_internal"
    (set_attr "length_immediate" "2")
    (set_attr "modrm" "0")])
 
-(define_insn "return_indirect_internal"
-  [(return)
+(define_insn "simple_return_indirect_internal"
+  [(simple_return)
    (use (match_operand:SI 0 "register_operand" "r"))]
   "reload_completed"
   "jmp\t%A0"
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 179310)
+++ gcc/config/i386/i386.c	(working copy)
@@ -10779,13 +10779,13 @@ ix86_expand_epilogue (int style)
 
 	  pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
 				     popc, -1, true);
-	  emit_jump_insn (gen_return_indirect_internal (ecx));
+	  emit_jump_insn (gen_simple_return_indirect_internal (ecx));
 	}
       else
-	emit_jump_insn (gen_return_pop_internal (popc));
+	emit_jump_insn (gen_simple_return_pop_internal (popc));
     }
   else
-    emit_jump_insn (gen_return_internal ());
+    emit_jump_insn (gen_simple_return_internal ());
 
   /* Restore the state back to the state from the prologue,
      so that it's correct for the next epilogue.  */
@@ -31184,7 +31184,7 @@ ix86_pad_returns (void)
 	}
       if (replace)
 	{
-	  emit_jump_insn_before (gen_return_internal_long (), ret);
+	  emit_jump_insn_before (gen_simple_return_internal_long (), ret);
 	  delete_insn (ret);
 	}
     }

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-04 22:19                       ` Bernd Schmidt
@ 2011-10-04 22:43                         ` Richard Henderson
  2011-10-05 15:22                           ` Richard Guenther
  2011-10-05 17:23                           ` Bernd Schmidt
  2011-10-06 18:47                         ` H.J. Lu
  1 sibling, 2 replies; 41+ messages in thread
From: Richard Henderson @ 2011-10-04 22:43 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, richard.sandiford

On 10/04/2011 03:10 PM, Bernd Schmidt wrote:
> 	* doc/invoke.texi (-fshrink-wrap): Document.
> 	* opts.c (default_options_table): Add it.
> 	* common.opt (fshrink-wrap): Add.
> 	* function.c (emit_return_into_block): Remove useless declaration.
> 	(record_hard_reg_uses_1, record_hard_reg_uses, frame_required_for_rtx,
> 	requires_stack_frame_p, gen_return_pattern): New static functions.
> 	(emit_return_into_block): New arg simple_p.  All callers changed.
> 	Use gen_return_pattern.
> 	(thread_prologue_and_epilogue_insns): Implement shrink-wrapping.
> 	* config/i386/i386.md (return): Expand into a simple_return.
> 	(simple_return): New expander):
> 	(simple_return_internal, simple_return_internal_long,
> 	simple_return_pop_internal_long, simple_return_indirect_internal):
> 	Renamed from return_internal, return_internal_long,
> 	return_pop_internal_long and return_indirect_internal; changed to use
> 	simple_return.
> 	* config/i386/i386.c (ix86_expand_epilogue): Adjust to expand
> 	simple returns.
> 	(ix86_pad_returns): Likewise.
> 	* function.h (struct rtl_data): Add member shrink_wrapped.
> 	* cfgcleanup.c (outgoing_edges_match): If shrink-wrapped, edges that
> 	are not jumps or sibcalls can't be compared.
> 
> 	* gcc.target/i386/sw-1.c: New test.

Ok.

As a followup, I think this option needs to be disabled for profiling
and profile_after_prologue.  Should be a mere matter of frobbing the
options at startup.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-04 22:43                         ` Richard Henderson
@ 2011-10-05 15:22                           ` Richard Guenther
  2011-10-05 16:04                             ` Bernd Schmidt
  2011-10-05 17:23                           ` Bernd Schmidt
  1 sibling, 1 reply; 41+ messages in thread
From: Richard Guenther @ 2011-10-05 15:22 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, GCC Patches, richard.sandiford

On Wed, Oct 5, 2011 at 12:29 AM, Richard Henderson <rth@redhat.com> wrote:
> On 10/04/2011 03:10 PM, Bernd Schmidt wrote:
>>       * doc/invoke.texi (-fshrink-wrap): Document.
>>       * opts.c (default_options_table): Add it.
>>       * common.opt (fshrink-wrap): Add.
>>       * function.c (emit_return_into_block): Remove useless declaration.
>>       (record_hard_reg_uses_1, record_hard_reg_uses, frame_required_for_rtx,
>>       requires_stack_frame_p, gen_return_pattern): New static functions.
>>       (emit_return_into_block): New arg simple_p.  All callers changed.
>>       Use gen_return_pattern.
>>       (thread_prologue_and_epilogue_insns): Implement shrink-wrapping.
>>       * config/i386/i386.md (return): Expand into a simple_return.
>>       (simple_return): New expander):
>>       (simple_return_internal, simple_return_internal_long,
>>       simple_return_pop_internal_long, simple_return_indirect_internal):
>>       Renamed from return_internal, return_internal_long,
>>       return_pop_internal_long and return_indirect_internal; changed to use
>>       simple_return.
>>       * config/i386/i386.c (ix86_expand_epilogue): Adjust to expand
>>       simple returns.
>>       (ix86_pad_returns): Likewise.
>>       * function.h (struct rtl_data): Add member shrink_wrapped.
>>       * cfgcleanup.c (outgoing_edges_match): If shrink-wrapped, edges that
>>       are not jumps or sibcalls can't be compared.
>>
>>       * gcc.target/i386/sw-1.c: New test.
>
> Ok.
>
> As a followup, I think this option needs to be disabled for profiling
> and profile_after_prologue.  Should be a mere matter of frobbing the
> options at startup.

This breaks bootstrap on x86_64-linux.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50621

Richard.

>
> r~
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-05 15:22                           ` Richard Guenther
@ 2011-10-05 16:04                             ` Bernd Schmidt
  2011-10-05 16:22                               ` Richard Henderson
  2011-10-06 18:07                               ` Bernd Schmidt
  0 siblings, 2 replies; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-05 16:04 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Richard Henderson, GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 2095 bytes --]

On 10/05/11 17:13, Richard Guenther wrote:
> On Wed, Oct 5, 2011 at 12:29 AM, Richard Henderson <rth@redhat.com> wrote:
>> On 10/04/2011 03:10 PM, Bernd Schmidt wrote:
>>>       * doc/invoke.texi (-fshrink-wrap): Document.
>>>       * opts.c (default_options_table): Add it.
>>>       * common.opt (fshrink-wrap): Add.
>>>       * function.c (emit_return_into_block): Remove useless declaration.
>>>       (record_hard_reg_uses_1, record_hard_reg_uses, frame_required_for_rtx,
>>>       requires_stack_frame_p, gen_return_pattern): New static functions.
>>>       (emit_return_into_block): New arg simple_p.  All callers changed.
>>>       Use gen_return_pattern.
>>>       (thread_prologue_and_epilogue_insns): Implement shrink-wrapping.
>>>       * config/i386/i386.md (return): Expand into a simple_return.
>>>       (simple_return): New expander):
>>>       (simple_return_internal, simple_return_internal_long,
>>>       simple_return_pop_internal_long, simple_return_indirect_internal):
>>>       Renamed from return_internal, return_internal_long,
>>>       return_pop_internal_long and return_indirect_internal; changed to use
>>>       simple_return.
>>>       * config/i386/i386.c (ix86_expand_epilogue): Adjust to expand
>>>       simple returns.
>>>       (ix86_pad_returns): Likewise.
>>>       * function.h (struct rtl_data): Add member shrink_wrapped.
>>>       * cfgcleanup.c (outgoing_edges_match): If shrink-wrapped, edges that
>>>       are not jumps or sibcalls can't be compared.
>>>
>>>       * gcc.target/i386/sw-1.c: New test.
>>
>> Ok.
>>
>> As a followup, I think this option needs to be disabled for profiling
>> and profile_after_prologue.  Should be a mere matter of frobbing the
>> options at startup.
> 
> This breaks bootstrap on x86_64-linux.
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50621

Caused by the x86_64 expand_epilogue not generating REG_CFA_RESTORE
notes, and in another case by queuing but not emitting them.

Bootstrapping the following now. Ok? (Alternatively, could keep the
redzone logic, but make it depend on !flag_shrink_wrap).


Bernd

[-- Attachment #2: redzone.diff --]
[-- Type: text/plain, Size: 2630 bytes --]

	* config/i386/i386.c (ix86_add_cfa_restore_note): Lose CFA_OFFSET
	argument.  All callers changed.  Always emit a note.
	(ix86_expand_epilogue): Ensure queued cfa_adjust notes are attached
	to an insn.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 179553)
+++ gcc/config/i386/i386.c	(working copy)
@@ -9126,17 +9126,11 @@ ix86_emit_save_sse_regs_using_mov (HOST_
 static GTY(()) rtx queued_cfa_restores;
 
 /* Add a REG_CFA_RESTORE REG note to INSN or queue them until next stack
-   manipulation insn.  The value is on the stack at CFA - CFA_OFFSET.
-   Don't add the note if the previously saved value will be left untouched
-   within stack red-zone till return, as unwinders can find the same value
-   in the register and on the stack.  */
+   manipulation insn.  */
 
 static void
-ix86_add_cfa_restore_note (rtx insn, rtx reg, HOST_WIDE_INT cfa_offset)
+ix86_add_cfa_restore_note (rtx insn, rtx reg)
 {
-  if (cfa_offset <= cfun->machine->fs.red_zone_offset)
-    return;
-
   if (insn)
     {
       add_reg_note (insn, REG_CFA_RESTORE, reg);
@@ -10298,7 +10292,7 @@ ix86_emit_restore_reg_using_pop (rtx reg
   struct machine_function *m = cfun->machine;
   rtx insn = emit_insn (gen_pop (reg));
 
-  ix86_add_cfa_restore_note (insn, reg, m->fs.sp_offset);
+  ix86_add_cfa_restore_note (insn, reg);
   m->fs.sp_offset -= UNITS_PER_WORD;
 
   if (m->fs.cfa_reg == crtl->drap_reg
@@ -10383,8 +10377,7 @@ ix86_emit_leave (void)
       add_reg_note (insn, REG_CFA_DEF_CFA,
 		    plus_constant (stack_pointer_rtx, m->fs.sp_offset));
       RTX_FRAME_RELATED_P (insn) = 1;
-      ix86_add_cfa_restore_note (insn, hard_frame_pointer_rtx,
-				 m->fs.fp_offset);
+      ix86_add_cfa_restore_note (insn, hard_frame_pointer_rtx);
     }
 }
 
@@ -10421,7 +10414,7 @@ ix86_emit_restore_regs_using_mov (HOST_W
 	    m->fs.drap_valid = true;
 	  }
 	else
-	  ix86_add_cfa_restore_note (NULL_RTX, reg, cfa_offset);
+	  ix86_add_cfa_restore_note (NULL_RTX, reg);
 
 	cfa_offset -= UNITS_PER_WORD;
       }
@@ -10446,7 +10439,7 @@ ix86_emit_restore_sse_regs_using_mov (HO
 	set_mem_align (mem, 128);
 	emit_move_insn (reg, mem);
 
-	ix86_add_cfa_restore_note (NULL_RTX, reg, cfa_offset);
+	ix86_add_cfa_restore_note (NULL_RTX, reg);
 
 	cfa_offset -= 16;
       }
@@ -10738,6 +10731,8 @@ ix86_expand_epilogue (int style)
 				 GEN_INT (m->fs.sp_offset - UNITS_PER_WORD),
 				 style, true);
     }
+  else
+    ix86_add_queued_cfa_restore_notes (get_last_insn ());
 
   /* Sibcall epilogues don't want a return instruction.  */
   if (style == 0)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-05 16:04                             ` Bernd Schmidt
@ 2011-10-05 16:22                               ` Richard Henderson
  2011-10-05 17:21                                 ` Bernd Schmidt
  2011-10-06 18:07                               ` Bernd Schmidt
  1 sibling, 1 reply; 41+ messages in thread
From: Richard Henderson @ 2011-10-05 16:22 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Guenther, GCC Patches, richard.sandiford

On 10/05/2011 08:59 AM, Bernd Schmidt wrote:
> Bootstrapping the following now. Ok? (Alternatively, could keep the
> redzone logic, but make it depend on !flag_shrink_wrap).

It's a good space-saving optimization, that redzone logic.  We
should be able to keep it based on crtl->shrink_wrapped; that
does appear to get set before we actually emit the epilogue.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-05 16:22                               ` Richard Henderson
@ 2011-10-05 17:21                                 ` Bernd Schmidt
  2011-10-05 23:35                                   ` Ian Lance Taylor
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-05 17:21 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Richard Guenther, GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 488 bytes --]

On 10/05/11 18:21, Richard Henderson wrote:
> On 10/05/2011 08:59 AM, Bernd Schmidt wrote:
>> Bootstrapping the following now. Ok? (Alternatively, could keep the
>> redzone logic, but make it depend on !flag_shrink_wrap).
> 
> It's a good space-saving optimization, that redzone logic.  We
> should be able to keep it based on crtl->shrink_wrapped; that
> does appear to get set before we actually emit the epilogue.

I've committed the following after a x86_64-linux bootstrap.


Bernd


[-- Attachment #2: redzone.diff --]
[-- Type: text/plain, Size: 1757 bytes --]

	PR bootstrap/50621
	* config/i386/i386.c (ix86_add_cfa_restore_note): Omit notes only
	if the function was not shrink-wrapped.
	(ix86_expand_epilogue): Ensure queued cfa_adjust notes are attached
	to an insn.
	* function.c (thread_prologue_and_epilogue_insns): Make sure the
	shrink_wrapped flag is set even if there is no dump file.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 179553)
+++ gcc/config/i386/i386.c	(working copy)
@@ -9134,7 +9134,8 @@ static GTY(()) rtx queued_cfa_restores;
 static void
 ix86_add_cfa_restore_note (rtx insn, rtx reg, HOST_WIDE_INT cfa_offset)
 {
-  if (cfa_offset <= cfun->machine->fs.red_zone_offset)
+  if (!crtl->shrink_wrapped
+      && cfa_offset <= cfun->machine->fs.red_zone_offset)
     return;
 
   if (insn)
@@ -10738,6 +10739,8 @@ ix86_expand_epilogue (int style)
 				 GEN_INT (m->fs.sp_offset - UNITS_PER_WORD),
 				 style, true);
     }
+  else
+    ix86_add_queued_cfa_restore_notes (get_last_insn ());
 
   /* Sibcall epilogues don't want a return instruction.  */
   if (style == 0)
Index: gcc/function.c
===================================================================
--- gcc/function.c	(revision 179553)
+++ gcc/function.c	(working copy)
@@ -5741,10 +5741,11 @@ thread_prologue_and_epilogue_insns (void
 	  if (dump_file)
 	    fprintf (dump_file, "Shrink-wrapping aborted due to clobber.\n");
 	}
-      else if (dump_file && entry_edge != orig_entry_edge)
+      else if (entry_edge != orig_entry_edge)
 	{
 	  crtl->shrink_wrapped = true;
-	  fprintf (dump_file, "Performing shrink-wrapping.\n");
+	  if (dump_file)
+	    fprintf (dump_file, "Performing shrink-wrapping.\n");
 	}
 
     fail_shrinkwrap:

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-04 22:43                         ` Richard Henderson
  2011-10-05 15:22                           ` Richard Guenther
@ 2011-10-05 17:23                           ` Bernd Schmidt
  2011-10-05 18:16                             ` Richard Henderson
  1 sibling, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-05 17:23 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 344 bytes --]

On 10/05/11 00:29, Richard Henderson wrote:
> As a followup, I think this option needs to be disabled for profiling
> and profile_after_prologue.  Should be a mere matter of frobbing the
> options at startup.

The other code seems to test crtl->profile rather than an option flag,
so how's this? Bootstrapped along with the x86_64 fix.


Bernd

[-- Attachment #2: swprof.diff --]
[-- Type: text/plain, Size: 588 bytes --]

	* function.c (thread_prologue_and_epilogue_insns): Don't shrink-wrap
	if profiling after the prologue.

Index: gcc/function.c
===================================================================
--- gcc/function.c	(revision 179553)
+++ gcc/function.c	(working copy)
@@ -5571,6 +5571,7 @@ thread_prologue_and_epilogue_insns (void
       }
       
   if (flag_shrink_wrap && HAVE_simple_return
+      && (targetm.profile_before_prologue () || !crtl->profile)
       && nonempty_prologue && !crtl->calls_eh_return)
     {
       HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-05 17:23                           ` Bernd Schmidt
@ 2011-10-05 18:16                             ` Richard Henderson
  0 siblings, 0 replies; 41+ messages in thread
From: Richard Henderson @ 2011-10-05 18:16 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, richard.sandiford

On 10/05/2011 10:19 AM, Bernd Schmidt wrote:
> 	* function.c (thread_prologue_and_epilogue_insns): Don't shrink-wrap
> 	if profiling after the prologue.

Ok.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-05 17:21                                 ` Bernd Schmidt
@ 2011-10-05 23:35                                   ` Ian Lance Taylor
  2011-10-06  1:43                                     ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Ian Lance Taylor @ 2011-10-05 23:35 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

On Wed, Oct 5, 2011 at 10:17 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>
> I've committed the following after a x86_64-linux bootstrap.

This patch appears to have broken the Go bootstrap when compiling a C
file in libgo.  Here is a reduced test case:

static struct
{
  int n;
  unsigned int m;
  _Bool f;
} g;

_Bool
f (int s)
{
  unsigned int b, m;

  if (!g.f)
    return 0;
  b = 1 << s;
  while (1)
    {
      m = g.m;
      if (m & b)
        break;
      if (__sync_bool_compare_and_swap (&g.m, m, m|b))
        {
          if (m == 0)
            f2 (&g.n);
          break;
        }
    }
  return 1;
}

Compiling this file with -O2 -fsplit-stack causes an ICE:

foo.c:29:1: internal compiler error: in maybe_record_trace_start, at
dwarf2cfi.c:2243

Compiling the file with -O2 -fsplit-stack -fno-shrink-wrap works.

Ian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-05 23:35                                   ` Ian Lance Taylor
@ 2011-10-06  1:43                                     ` Bernd Schmidt
  2011-10-06  6:41                                       ` Ian Lance Taylor
  2011-10-06 13:44                                       ` Bernd Schmidt
  0 siblings, 2 replies; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06  1:43 UTC (permalink / raw)
  To: Ian Lance Taylor
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 1299 bytes --]

On 10/06/11 01:04, Ian Lance Taylor wrote:
> On Wed, Oct 5, 2011 at 10:17 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>>
>> I've committed the following after a x86_64-linux bootstrap.
> 
> This patch appears to have broken the Go bootstrap when compiling a C
> file in libgo.  Here is a reduced test case:
> 
> static struct
> {
>   int n;
>   unsigned int m;
>   _Bool f;
> } g;
> 
> _Bool
> f (int s)
> {
>   unsigned int b, m;
> 
>   if (!g.f)
>     return 0;
>   b = 1 << s;
>   while (1)
>     {
>       m = g.m;
>       if (m & b)
>         break;
>       if (__sync_bool_compare_and_swap (&g.m, m, m|b))
>         {
>           if (m == 0)
>             f2 (&g.n);
>           break;
>         }
>     }
>   return 1;
> }
> 
> Compiling this file with -O2 -fsplit-stack causes an ICE:
> 
> foo.c:29:1: internal compiler error: in maybe_record_trace_start, at
> dwarf2cfi.c:2243
> 
> Compiling the file with -O2 -fsplit-stack -fno-shrink-wrap works.

This appears to be because the split prologue contains a jump, which
means the find_many_sub_blocks call reorders the block numbers, and our
indices into bb_flags are off.

I'm testing the following patch, but it won't finish today - feel free
to test and check it in, or to just disable shrink-wrapping with split
prologues.


Bernd

[-- Attachment #2: ucsr.diff --]
[-- Type: text/plain, Size: 2500 bytes --]

Index: gcc/function.c
===================================================================
--- gcc/function.c	(revision 179577)
+++ gcc/function.c	(working copy)
@@ -5455,7 +5455,7 @@ thread_prologue_and_epilogue_insns (void
   basic_block last_bb;
   bool last_bb_active;
 #ifdef HAVE_simple_return
-  bool unconverted_simple_returns = false;
+  VEC (basic_block, heap) *unconverted_simple_returns = NULL;
   basic_block simple_return_block_hot = NULL;
   basic_block simple_return_block_cold = NULL;
   bool nonempty_prologue;
@@ -5876,7 +5876,8 @@ thread_prologue_and_epilogue_insns (void
 		{
 #ifdef HAVE_simple_return
 		  if (simple_p)
-		    unconverted_simple_returns = true;
+		    VEC_safe_push (basic_block, heap,
+				   unconverted_simple_returns, bb);
 #endif
 		  continue;
 		}
@@ -5894,7 +5895,8 @@ thread_prologue_and_epilogue_insns (void
 	    {
 #ifdef HAVE_simple_return
 	      if (simple_p)
-		unconverted_simple_returns = true;
+		VEC_safe_push (basic_block, heap,
+			       unconverted_simple_returns, bb);
 #endif
 	      continue;
 	    }
@@ -6042,10 +6044,11 @@ epilogue_done:
      convert to conditional simple_returns, but couldn't for some
      reason, create a block to hold a simple_return insn and redirect
      those remaining edges.  */
-  if (unconverted_simple_returns)
+  if (!VEC_empty (basic_block, unconverted_simple_returns))
     {
-      edge_iterator ei2;
       basic_block exit_pred = EXIT_BLOCK_PTR->prev_bb;
+      basic_block src_bb;
+      int i;
 
       gcc_assert (entry_edge != orig_entry_edge);
 
@@ -6062,19 +6065,12 @@ epilogue_done:
 	    simple_return_block_cold = e->dest;
 	}
 
-    restart_scan:
-      for (ei2 = ei_start (last_bb->preds); (e = ei_safe_edge (ei2)); )
+      FOR_EACH_VEC_ELT (basic_block, unconverted_simple_returns, i, src_bb)
 	{
-	  basic_block bb = e->src;
+	  edge e = find_edge (src_bb, last_bb);
 	  basic_block *pdest_bb;
 
-	  if (bb == ENTRY_BLOCK_PTR
-	      || bitmap_bit_p (&bb_flags, bb->index))
-	    {
-	      ei_next (&ei2);
-	      continue;
-	    }
-	  if (BB_PARTITION (e->src) == BB_HOT_PARTITION)
+	  if (BB_PARTITION (src_bb) == BB_HOT_PARTITION)
 	    pdest_bb = &simple_return_block_hot;
 	  else
 	    pdest_bb = &simple_return_block_cold;
@@ -6094,8 +6090,8 @@ epilogue_done:
 	      make_edge (bb, EXIT_BLOCK_PTR, 0);
 	    }
 	  redirect_edge_and_branch_force (e, *pdest_bb);
-	  goto restart_scan;
 	}
+      VEC_free (basic_block, heap, unconverted_simple_returns);
     }
 #endif
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06  1:43                                     ` Bernd Schmidt
@ 2011-10-06  6:41                                       ` Ian Lance Taylor
  2011-10-06 10:45                                         ` Bernd Schmidt
  2011-10-06 13:44                                       ` Bernd Schmidt
  1 sibling, 1 reply; 41+ messages in thread
From: Ian Lance Taylor @ 2011-10-06  6:41 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

Bernd Schmidt <bernds@codesourcery.com> writes:

> This appears to be because the split prologue contains a jump, which
> means the find_many_sub_blocks call reorders the block numbers, and our
> indices into bb_flags are off.
>
> I'm testing the following patch, but it won't finish today - feel free
> to test and check it in, or to just disable shrink-wrapping with split
> prologues.

Thinking about it I think this is the wrong approach.  The -fsplit-stack
code by definition has to wrap the entire function and it can not modify
any callee-saved registers.  We should do shrink wrapping before
-fsplit-stack, not the other way around.

Ian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06  6:41                                       ` Ian Lance Taylor
@ 2011-10-06 10:45                                         ` Bernd Schmidt
  2011-10-06 16:18                                           ` Ian Lance Taylor
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06 10:45 UTC (permalink / raw)
  To: Ian Lance Taylor
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

On 10/06/11 05:17, Ian Lance Taylor wrote:
> Thinking about it I think this is the wrong approach.  The -fsplit-stack
> code by definition has to wrap the entire function and it can not modify
> any callee-saved registers.  We should do shrink wrapping before
> -fsplit-stack, not the other way around.

Sorry, I'm not following what you're saying here. Can you elaborate?


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06  1:43                                     ` Bernd Schmidt
  2011-10-06  6:41                                       ` Ian Lance Taylor
@ 2011-10-06 13:44                                       ` Bernd Schmidt
  2011-10-06 15:42                                         ` Richard Henderson
  1 sibling, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06 13:44 UTC (permalink / raw)
  To: Ian Lance Taylor
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

On 10/06/11 01:47, Bernd Schmidt wrote:
> This appears to be because the split prologue contains a jump, which
> means the find_many_sub_blocks call reorders the block numbers, and our
> indices into bb_flags are off.

Testing of the patch completed - ok? Regardless of split-stack it seems
like a cleanup and eliminates a potential source of errors.


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 13:44                                       ` Bernd Schmidt
@ 2011-10-06 15:42                                         ` Richard Henderson
  0 siblings, 0 replies; 41+ messages in thread
From: Richard Henderson @ 2011-10-06 15:42 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Ian Lance Taylor, Richard Guenther, GCC Patches, richard.sandiford

On 10/06/2011 06:37 AM, Bernd Schmidt wrote:
> On 10/06/11 01:47, Bernd Schmidt wrote:
>> This appears to be because the split prologue contains a jump, which
>> means the find_many_sub_blocks call reorders the block numbers, and our
>> indices into bb_flags are off.
> 
> Testing of the patch completed - ok? Regardless of split-stack it seems
> like a cleanup and eliminates a potential source of errors.

Yes, patch is ok.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 10:45                                         ` Bernd Schmidt
@ 2011-10-06 16:18                                           ` Ian Lance Taylor
  2011-10-06 16:19                                             ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Ian Lance Taylor @ 2011-10-06 16:18 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

Bernd Schmidt <bernds@codesourcery.com> writes:

> On 10/06/11 05:17, Ian Lance Taylor wrote:
>> Thinking about it I think this is the wrong approach.  The -fsplit-stack
>> code by definition has to wrap the entire function and it can not modify
>> any callee-saved registers.  We should do shrink wrapping before
>> -fsplit-stack, not the other way around.
>
> Sorry, I'm not following what you're saying here. Can you elaborate?

Basically -fsplit-stack wraps the entire function in code that (on
x86_64) looks like

	cmpq	%fs:112, %rsp
	jae	.L2
	movl	$24, %r10d
	movl	$0, %r11d
	call	__morestack
	ret
.L2:

There is absolutely no reason to try to shrink wrap that code.  It will
never help.  That code always has to be first.  It especially has to be
first because the gold linker recognizes the prologue specially when a
split-stack function calls a non-split-stack function, in order to
request a larger stack.

Therefore, it seems to me that we should apply shrink wrapping to the
function as it exists *before* the split-stack prologue is created.  The
flag_split_stack bit should be moved after the flag_shrink_wrap bit.

Ian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 16:18                                           ` Ian Lance Taylor
@ 2011-10-06 16:19                                             ` Bernd Schmidt
  2011-10-06 17:32                                               ` Richard Henderson
  2011-10-06 19:39                                               ` Ian Lance Taylor
  0 siblings, 2 replies; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06 16:19 UTC (permalink / raw)
  To: Ian Lance Taylor
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

On 10/06/11 17:57, Ian Lance Taylor wrote:
> There is absolutely no reason to try to shrink wrap that code.  It will
> never help.  That code always has to be first.  It especially has to be
> first because the gold linker recognizes the prologue specially when a
> split-stack function calls a non-split-stack function, in order to
> request a larger stack.

Urgh, ok.

> Therefore, it seems to me that we should apply shrink wrapping to the
> function as it exists *before* the split-stack prologue is created.  The
> flag_split_stack bit should be moved after the flag_shrink_wrap bit.

Sounds like we just need to always emit the split prologue on the
original entry edge then. Can you test the following with Go?


Bernd

[-- Attachment #2: split-seq.diff --]
[-- Type: text/plain, Size: 1098 bytes --]

	* function.c (thread_prologue_and_epilogue_insns): Emit split
	prologue on the orig_entry_edge. Don't account for it in
	prologue_clobbered.

Index: gcc/function.c
===================================================================
--- gcc/function.c	(revision 179619)
+++ gcc/function.c	(working copy)
@@ -5602,10 +5602,6 @@ thread_prologue_and_epilogue_insns (void
 	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
 		       &prologue_clobbered);
 	}
-      for (p_insn = split_prologue_seq; p_insn; p_insn = NEXT_INSN (p_insn))
-	if (NONDEBUG_INSN_P (p_insn))
-	  note_stores (PATTERN (p_insn), record_hard_reg_sets,
-		       &prologue_clobbered);
 
       bitmap_initialize (&bb_antic_flags, &bitmap_default_obstack);
       bitmap_initialize (&bb_on_list, &bitmap_default_obstack);
@@ -5758,7 +5754,7 @@ thread_prologue_and_epilogue_insns (void
 
   if (split_prologue_seq != NULL_RTX)
     {
-      insert_insn_on_edge (split_prologue_seq, entry_edge);
+      insert_insn_on_edge (split_prologue_seq, orig_entry_edge);
       inserted = true;
     }
   if (prologue_seq != NULL_RTX)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 16:19                                             ` Bernd Schmidt
@ 2011-10-06 17:32                                               ` Richard Henderson
  2011-10-06 19:39                                               ` Ian Lance Taylor
  1 sibling, 0 replies; 41+ messages in thread
From: Richard Henderson @ 2011-10-06 17:32 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Ian Lance Taylor, Richard Guenther, GCC Patches, richard.sandiford

On 10/06/2011 09:01 AM, Bernd Schmidt wrote:
> On 10/06/11 17:57, Ian Lance Taylor wrote:
>> There is absolutely no reason to try to shrink wrap that code.  It will
>> never help.  That code always has to be first.  It especially has to be
>> first because the gold linker recognizes the prologue specially when a
>> split-stack function calls a non-split-stack function, in order to
>> request a larger stack.
> 
> Urgh, ok.
> 
>> Therefore, it seems to me that we should apply shrink wrapping to the
>> function as it exists *before* the split-stack prologue is created.  The
>> flag_split_stack bit should be moved after the flag_shrink_wrap bit.
> 
> Sounds like we just need to always emit the split prologue on the
> original entry edge then. Can you test the following with Go?

Looks reasonable.

I wonder if we can have this as a generic feature?  I'm thinking about
things like the MIPS and Alpha load-gp stuff.  That operation also needs
to happen exactly at the start of the function, due to the pc-relative
nature of the operation.

I do see that MIPS works around this by emitting the load-gp as text
in the legacy prologue.  But Alpha makes some effort to emit this as
rtl, so that the scheduler knows about the two pipeline reservations
and the latency of any use of the gp register.

Would a "pre_prologue" named pattern seem wrong to anyone?


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-05 16:04                             ` Bernd Schmidt
  2011-10-05 16:22                               ` Richard Henderson
@ 2011-10-06 18:07                               ` Bernd Schmidt
  2011-10-06 18:19                                 ` Richard Henderson
  1 sibling, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06 18:07 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, richard.sandiford

HJ found some more maybe_record_trace_start failures. In one case I
debugged, we have

(insn/f 31 0 32 (parallel [
            (set (reg/f:DI 7 sp)
                (plus:DI (reg/f:DI 7 sp)
                    (const_int 8 [0x8])))
            (clobber (reg:CC 17 flags))
            (clobber (mem:BLK (scratch) [0 A8]))
        ]) -1
     (expr_list:REG_CFA_ADJUST_CFA (set (reg/f:DI 7 sp)
            (plus:DI (reg/f:DI 7 sp)
                (const_int 8 [0x8])))
        (nil)))

The insn pattern is later changed by csa to adjust by 24, and the note
is left untouched; that seems to be triggering the problem.

Richard, is there a reason to use REG_CFA_ADJUST_CFA rather than
REG_CFA_DEF_CFA? If no, I'll just try to fix i386.c not to emit the former.


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 18:07                               ` Bernd Schmidt
@ 2011-10-06 18:19                                 ` Richard Henderson
  2011-10-06 18:28                                   ` Bernd Schmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Richard Henderson @ 2011-10-06 18:19 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, richard.sandiford

On 10/06/2011 11:03 AM, Bernd Schmidt wrote:
> HJ found some more maybe_record_trace_start failures. In one case I
> debugged, we have
> 
> (insn/f 31 0 32 (parallel [
>             (set (reg/f:DI 7 sp)
>                 (plus:DI (reg/f:DI 7 sp)
>                     (const_int 8 [0x8])))
>             (clobber (reg:CC 17 flags))
>             (clobber (mem:BLK (scratch) [0 A8]))
>         ]) -1
>      (expr_list:REG_CFA_ADJUST_CFA (set (reg/f:DI 7 sp)
>             (plus:DI (reg/f:DI 7 sp)
>                 (const_int 8 [0x8])))
>         (nil)))
> 
> The insn pattern is later changed by csa to adjust by 24, and the note
> is left untouched; that seems to be triggering the problem.

Hmm.  That seems a bit odd, considering this function probably does not
use alloca (no frame pointer), and uses accumulated outgoing arguments
(x86_64 never uses no-accumulate-outgoing-args, afaik).

> Richard, is there a reason to use REG_CFA_ADJUST_CFA rather than
> REG_CFA_DEF_CFA? If no, I'll just try to fix i386.c not to emit the former.

Not that I can think of.  But if that change makes any difference at all,
that's almost certainly another bug.

What PR are you looking at here?


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 18:19                                 ` Richard Henderson
@ 2011-10-06 18:28                                   ` Bernd Schmidt
  0 siblings, 0 replies; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06 18:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: GCC Patches, richard.sandiford

On 10/06/11 20:13, Richard Henderson wrote:
> What PR are you looking at here?

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50632

Testcase is gcc.dg/pr50132.c.


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-04 22:19                       ` Bernd Schmidt
  2011-10-04 22:43                         ` Richard Henderson
@ 2011-10-06 18:47                         ` H.J. Lu
  2011-10-06 18:53                           ` Bernd Schmidt
  1 sibling, 1 reply; 41+ messages in thread
From: H.J. Lu @ 2011-10-06 18:47 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Henderson, GCC Patches, richard.sandiford

On Tue, Oct 4, 2011 at 3:10 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 09/30/11 18:51, Richard Henderson wrote:
>
>> Please do leave out RETURN_ADDR_REGNUM for now.  If you remember why,
>> then you could bring it back alongside the patch for the ARM backend.
>
> Changed.
>
>> As for the i386 backend changes, not an objection per se, but I'm
>> trying to understand why we need so many copies of patterns.
>
> Also changed.
>
>> I don't see anything glaringly wrong in the middle end.  Although
>> the thread_prologue_and_epilogue_insns function is now gigantic.
>> If there were an easy way to break that up and reduce the amount
>> of conditional compilation at the same time... that'd be great,
>> but not a requirement.
>
> I don't think there's an easy way; and it's almost certain to break
> stuff again, so I'd rather avoid doing it at the same time as this patch
> if possible.
>
> I can see one possible way of tackling it; have an analysis phase that
> fills up a few basic_block VECs (those which need sibcall returns, those
> which need plain returns, those which need simple returns) and computes
> other information, such as the edges on which prologue and epilogue are
> to be inserted, and then a worker phase (probably split across several
> functions) which does all the fiddling.
>
> Richard S. suggested:
>> ...how about adding a bit to crtl to say whether shrink-wrap occured,
>> and check that instead of flag_shrink_wrap?
>
> Good idea, also changed.
>
> New version below.  Bootstrapped and tested i686-linux.
>
>

It also caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50633

Don't you need to update ix86_expand_prologue?


-- 
H.J.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 18:47                         ` H.J. Lu
@ 2011-10-06 18:53                           ` Bernd Schmidt
  2011-10-06 19:25                             ` H.J. Lu
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06 18:53 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Richard Henderson, GCC Patches, richard.sandiford

On 10/06/11 20:27, H.J. Lu wrote:
> It also caused:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50633
> 
> Don't you need to update ix86_expand_prologue?

In theory it should just work. It seems the x32 stuff has entertaining
properties :-( Haven't quite figured out how to build it yet, but:

-    subq    $136, %rsp
-    .cfi_def_cfa_offset 144
     movl    $0, %eax
     movl    %esp, %ecx
     addl    $60, %ecx
@@ -16,6 +14,8 @@ main:
     movl    %eax, (%edx)
     cmpl    $16, %eax
     jne    .L2
+    subq    $136, %rsp
+    .cfi_def_cfa_offset 144

So, this looks like we have both $esp and $rsp - i.e. not using
stack_pointer_rtx in all cases? Is there a way to avoid this?

BTW, one other thing that occurred to me - what about drap_reg? Does
that need to be added to the set of registers whose use requires a prologue?


Bernd

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 18:53                           ` Bernd Schmidt
@ 2011-10-06 19:25                             ` H.J. Lu
  0 siblings, 0 replies; 41+ messages in thread
From: H.J. Lu @ 2011-10-06 19:25 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Richard Henderson, GCC Patches, richard.sandiford

On Thu, Oct 6, 2011 at 11:40 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 10/06/11 20:27, H.J. Lu wrote:
>> It also caused:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50633
>>
>> Don't you need to update ix86_expand_prologue?
>
> In theory it should just work. It seems the x32 stuff has entertaining
> properties :-( Haven't quite figured out how to build it yet, but:
>
> -    subq    $136, %rsp
> -    .cfi_def_cfa_offset 144
>     movl    $0, %eax
>     movl    %esp, %ecx
>     addl    $60, %ecx
> @@ -16,6 +14,8 @@ main:
>     movl    %eax, (%edx)
>     cmpl    $16, %eax
>     jne    .L2
> +    subq    $136, %rsp
> +    .cfi_def_cfa_offset 144
>
> So, this looks like we have both $esp and $rsp - i.e. not using
> stack_pointer_rtx in all cases? Is there a way to avoid this?

X32 has 32bit software stack pointer and 64bit hardware stack
pointer.

> BTW, one other thing that occurred to me - what about drap_reg? Does
> that need to be added to the set of registers whose use requires a prologue?
>

It should be covered by SP.

-- 
H.J.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 16:19                                             ` Bernd Schmidt
  2011-10-06 17:32                                               ` Richard Henderson
@ 2011-10-06 19:39                                               ` Ian Lance Taylor
  2011-10-06 22:15                                                 ` Bernd Schmidt
  1 sibling, 1 reply; 41+ messages in thread
From: Ian Lance Taylor @ 2011-10-06 19:39 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

Bernd Schmidt <bernds@codesourcery.com> writes:

> On 10/06/11 17:57, Ian Lance Taylor wrote:
>> There is absolutely no reason to try to shrink wrap that code.  It will
>> never help.  That code always has to be first.  It especially has to be
>> first because the gold linker recognizes the prologue specially when a
>> split-stack function calls a non-split-stack function, in order to
>> request a larger stack.
>
> Urgh, ok.
>
>> Therefore, it seems to me that we should apply shrink wrapping to the
>> function as it exists *before* the split-stack prologue is created.  The
>> flag_split_stack bit should be moved after the flag_shrink_wrap bit.
>
> Sounds like we just need to always emit the split prologue on the
> original entry edge then. Can you test the following with Go?

Unfortunately, that patch fails compiling libgo/runtime/go-map-len.c.
go-map-len.c is a C file, and actually quite a simple one.

../../../trunk/libgo/runtime/go-map-len.c: In function ‘__go_map_len’:
../../../trunk/libgo/runtime/go-map-len.c:23:1: error: in basic block 2:
../../../trunk/libgo/runtime/go-map-len.c:23:1: error: flow control insn inside 
a basic block
(jump_insn 38 37 39 2 (set (pc)
        (if_then_else (geu (reg:CC 17 flags)
                (const_int 0 [0]))
            (label_ref 43)
            (pc))) ../../../trunk/libgo/runtime/go-map-len.c:18 -1
     (expr_list:REG_DEAD (reg:CC 17 flags)
        (expr_list:REG_BR_PROB (const_int 9900 [0x26ac])
            (nil)))
 -> 43)
../../../trunk/libgo/runtime/go-map-len.c:23:1: internal compiler error: in rtl_
verify_flow_info_1, at cfgrtl.c:2001
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.


Note that you can test Go yourself by adding --enable-languages=go to
your configure line.  Nothing special is required to build Go.  It works
better if you use the gold linker but that is not required.

Ian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 19:39                                               ` Ian Lance Taylor
@ 2011-10-06 22:15                                                 ` Bernd Schmidt
  2011-10-06 22:28                                                   ` Richard Henderson
  0 siblings, 1 reply; 41+ messages in thread
From: Bernd Schmidt @ 2011-10-06 22:15 UTC (permalink / raw)
  To: Ian Lance Taylor
  Cc: Richard Henderson, Richard Guenther, GCC Patches, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 926 bytes --]

On 10/06/11 21:33, Ian Lance Taylor wrote:

> Note that you can test Go yourself by adding --enable-languages=go to
> your configure line.  Nothing special is required to build Go.  It works
> better if you use the gold linker but that is not required.

Oh, ok. For some reason I thought I was remembering that it doesn't
build on my box.

Here's the set of fixes I have now. find_many_sub_blocks needs to be run
on the destination of the orig_entry_edge, and it also appears to munge
basic block pointers, so unconverted_simple_returns now stores jump
insns. For HJ's problem I gave in and used register numbers rather than
comparing with stack_pointer_rtx etc., and converted some more code to
use df as originally suggested by Richard S.

Testing now with rth's csa patch also applied, both i686-linux and
x86_64-linux. Bootstraps complete, all languages including Go for the
i686-linux one. Ok if testing passes?


Bernd


[-- Attachment #2: swfixes1006.diff --]
[-- Type: text/plain, Size: 6352 bytes --]

	* function.c (frame_required_for_rtx): Remove function.
	(requires_stack_frame_p): New arg set_up_by_prologue.  All callers
	changed.  Compute a set of mentioned registers and compare against
	the new arg rather than calling frame_required_for_rtx.
	(thread_prologue_and_epilogue_insns): Compute the set_up_by_prologue
	reg set.  Convert the unconverted_simple_returns mechanism to store
	jump insns rather than their basic blocks.  Also check the
	orig_entry_edge destination for new blocks.

Index: function.c
===================================================================
--- function.c	(revision 179627)
+++ function.c	(working copy)
@@ -5303,25 +5303,15 @@ record_hard_reg_uses (rtx *px, void *dat
   for_each_rtx (px, record_hard_reg_uses_1, data);
 }
 
-/* A subroutine of requires_stack_frame_p, called via for_each_rtx.
-   Return 1 if we found an rtx that forces a prologue, zero otherwise.  */
-static int
-frame_required_for_rtx (rtx *loc, void *data ATTRIBUTE_UNUSED)
-{
-  rtx x = *loc;
-  if (x == stack_pointer_rtx || x == hard_frame_pointer_rtx
-      || x == arg_pointer_rtx || x == pic_offset_table_rtx)
-    return 1;
-  return 0;
-}
-
 /* Return true if INSN requires the stack frame to be set up.
    PROLOGUE_USED contains the hard registers used in the function
-   prologue.  */
+   prologue.  SET_UP_BY_PROLOGUE is the set of registers we expect the
+   prologue to set up for the function.  */
 static bool
-requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used)
+requires_stack_frame_p (rtx insn, HARD_REG_SET prologue_used,
+			HARD_REG_SET set_up_by_prologue)
 {
-  df_ref *def_rec;
+  df_ref *df_rec;
   HARD_REG_SET hardregs;
   unsigned regno;
 
@@ -5329,13 +5319,11 @@ requires_stack_frame_p (rtx insn, HARD_R
     return false;
   if (CALL_P (insn))
     return !SIBLING_CALL_P (insn);
-  if (for_each_rtx (&PATTERN (insn), frame_required_for_rtx, NULL))
-    return true;
 
   CLEAR_HARD_REG_SET (hardregs);
-  for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
+  for (df_rec = DF_INSN_DEFS (insn); *df_rec; df_rec++)
     {
-      rtx dreg = DF_REF_REG (*def_rec);
+      rtx dreg = DF_REF_REG (*df_rec);
 
       if (!REG_P (dreg))
 	continue;
@@ -5350,6 +5338,20 @@ requires_stack_frame_p (rtx insn, HARD_R
     if (TEST_HARD_REG_BIT (hardregs, regno)
 	&& df_regs_ever_live_p (regno))
       return true;
+
+  for (df_rec = DF_INSN_USES (insn); *df_rec; df_rec++)
+    {
+      rtx reg = DF_REF_REG (*df_rec);
+
+      if (!REG_P (reg))
+	continue;
+
+      add_to_hard_reg_set (&hardregs, GET_MODE (reg),
+			   REGNO (reg));
+    }
+  if (hard_reg_set_intersect_p (hardregs, set_up_by_prologue))
+    return true;
+
   return false;
 }
 #endif
@@ -5455,7 +5457,7 @@ thread_prologue_and_epilogue_insns (void
   basic_block last_bb;
   bool last_bb_active ATTRIBUTE_UNUSED;
 #ifdef HAVE_simple_return
-  VEC (basic_block, heap) *unconverted_simple_returns = NULL;
+  VEC (rtx, heap) *unconverted_simple_returns = NULL;
   basic_block simple_return_block_hot = NULL;
   basic_block simple_return_block_cold = NULL;
   bool nonempty_prologue;
@@ -5575,6 +5577,7 @@ thread_prologue_and_epilogue_insns (void
       && nonempty_prologue && !crtl->calls_eh_return)
     {
       HARD_REG_SET prologue_clobbered, prologue_used, live_on_edge;
+      HARD_REG_SET set_up_by_prologue;
       rtx p_insn;
 
       VEC(basic_block, heap) *vec;
@@ -5610,6 +5613,16 @@ thread_prologue_and_epilogue_insns (void
 
       vec = VEC_alloc (basic_block, heap, n_basic_blocks);
 
+      CLEAR_HARD_REG_SET (set_up_by_prologue);
+      add_to_hard_reg_set (&set_up_by_prologue, Pmode, STACK_POINTER_REGNUM);
+      add_to_hard_reg_set (&set_up_by_prologue, Pmode, ARG_POINTER_REGNUM);
+      if (frame_pointer_needed)
+	add_to_hard_reg_set (&set_up_by_prologue, Pmode,
+			     HARD_FRAME_POINTER_REGNUM);
+      if (pic_offset_table_rtx)
+	add_to_hard_reg_set (&set_up_by_prologue, Pmode,
+			     PIC_OFFSET_TABLE_REGNUM);
+
       FOR_EACH_BB (bb)
 	{
 	  rtx insn;
@@ -5628,7 +5641,8 @@ thread_prologue_and_epilogue_insns (void
 	    }
 	  else
 	    FOR_BB_INSNS (bb, insn)
-	      if (requires_stack_frame_p (insn, prologue_used))
+	      if (requires_stack_frame_p (insn, prologue_used,
+					  set_up_by_prologue))
 		{
 		  bitmap_set_bit (&bb_flags, bb->index);
 		  VEC_quick_push (basic_block, vec, bb);
@@ -5872,8 +5886,8 @@ thread_prologue_and_epilogue_insns (void
 		{
 #ifdef HAVE_simple_return
 		  if (simple_p)
-		    VEC_safe_push (basic_block, heap,
-				   unconverted_simple_returns, bb);
+		    VEC_safe_push (rtx, heap,
+				   unconverted_simple_returns, jump);
 #endif
 		  continue;
 		}
@@ -5891,8 +5905,8 @@ thread_prologue_and_epilogue_insns (void
 	    {
 #ifdef HAVE_simple_return
 	      if (simple_p)
-		VEC_safe_push (basic_block, heap,
-			       unconverted_simple_returns, bb);
+		VEC_safe_push (rtx, heap,
+			       unconverted_simple_returns, jump);
 #endif
 	      continue;
 	    }
@@ -6022,6 +6036,7 @@ epilogue_done:
       blocks = sbitmap_alloc (last_basic_block);
       sbitmap_zero (blocks);
       SET_BIT (blocks, entry_edge->dest->index);
+      SET_BIT (blocks, orig_entry_edge->dest->index);
       find_many_sub_basic_blocks (blocks);
       sbitmap_free (blocks);
 
@@ -6040,10 +6055,10 @@ epilogue_done:
      convert to conditional simple_returns, but couldn't for some
      reason, create a block to hold a simple_return insn and redirect
      those remaining edges.  */
-  if (!VEC_empty (basic_block, unconverted_simple_returns))
+  if (!VEC_empty (rtx, unconverted_simple_returns))
     {
       basic_block exit_pred = EXIT_BLOCK_PTR->prev_bb;
-      basic_block src_bb;
+      rtx jump;
       int i;
 
       gcc_assert (entry_edge != orig_entry_edge);
@@ -6061,8 +6076,9 @@ epilogue_done:
 	    simple_return_block_cold = e->dest;
 	}
 
-      FOR_EACH_VEC_ELT (basic_block, unconverted_simple_returns, i, src_bb)
+      FOR_EACH_VEC_ELT (rtx, unconverted_simple_returns, i, jump)
 	{
+	  basic_block src_bb = BLOCK_FOR_INSN (jump);
 	  edge e = find_edge (src_bb, last_bb);
 	  basic_block *pdest_bb;
 
@@ -6087,7 +6103,7 @@ epilogue_done:
 	    }
 	  redirect_edge_and_branch_force (e, *pdest_bb);
 	}
-      VEC_free (basic_block, heap, unconverted_simple_returns);
+      VEC_free (rtx, heap, unconverted_simple_returns);
     }
 #endif
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Initial shrink-wrapping patch
  2011-10-06 22:15                                                 ` Bernd Schmidt
@ 2011-10-06 22:28                                                   ` Richard Henderson
  0 siblings, 0 replies; 41+ messages in thread
From: Richard Henderson @ 2011-10-06 22:28 UTC (permalink / raw)
  To: Bernd Schmidt
  Cc: Ian Lance Taylor, Richard Guenther, GCC Patches, richard.sandiford

On 10/06/2011 02:39 PM, Bernd Schmidt wrote:
> 	* function.c (frame_required_for_rtx): Remove function.
> 	(requires_stack_frame_p): New arg set_up_by_prologue.  All callers
> 	changed.  Compute a set of mentioned registers and compare against
> 	the new arg rather than calling frame_required_for_rtx.
> 	(thread_prologue_and_epilogue_insns): Compute the set_up_by_prologue
> 	reg set.  Convert the unconverted_simple_returns mechanism to store
> 	jump insns rather than their basic blocks.  Also check the
> 	orig_entry_edge destination for new blocks.

Ok.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2011-10-06 22:18 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31 20:37 Initial shrink-wrapping patch Bernd Schmidt
2011-09-01 13:57 ` Richard Sandiford
2011-09-13  2:57   ` Bernd Schmidt
2011-09-13 11:03     ` Richard Sandiford
2011-09-13 12:49       ` Bernd Schmidt
2011-09-13 13:06         ` Richard Sandiford
2011-09-13 13:24           ` Bernd Schmidt
2011-09-13 13:42             ` Richard Sandiford
2011-09-13 17:22               ` Bernd Schmidt
2011-09-15  1:48                 ` Richard Henderson
2011-09-27 23:07                   ` Bernd Schmidt
2011-09-30 18:19                     ` Richard Henderson
2011-10-04 22:19                       ` Bernd Schmidt
2011-10-04 22:43                         ` Richard Henderson
2011-10-05 15:22                           ` Richard Guenther
2011-10-05 16:04                             ` Bernd Schmidt
2011-10-05 16:22                               ` Richard Henderson
2011-10-05 17:21                                 ` Bernd Schmidt
2011-10-05 23:35                                   ` Ian Lance Taylor
2011-10-06  1:43                                     ` Bernd Schmidt
2011-10-06  6:41                                       ` Ian Lance Taylor
2011-10-06 10:45                                         ` Bernd Schmidt
2011-10-06 16:18                                           ` Ian Lance Taylor
2011-10-06 16:19                                             ` Bernd Schmidt
2011-10-06 17:32                                               ` Richard Henderson
2011-10-06 19:39                                               ` Ian Lance Taylor
2011-10-06 22:15                                                 ` Bernd Schmidt
2011-10-06 22:28                                                   ` Richard Henderson
2011-10-06 13:44                                       ` Bernd Schmidt
2011-10-06 15:42                                         ` Richard Henderson
2011-10-06 18:07                               ` Bernd Schmidt
2011-10-06 18:19                                 ` Richard Henderson
2011-10-06 18:28                                   ` Bernd Schmidt
2011-10-05 17:23                           ` Bernd Schmidt
2011-10-05 18:16                             ` Richard Henderson
2011-10-06 18:47                         ` H.J. Lu
2011-10-06 18:53                           ` Bernd Schmidt
2011-10-06 19:25                             ` H.J. Lu
2011-10-03  8:30                     ` Richard Sandiford
2011-10-03 11:29                       ` Basile Starynkevitch
2011-10-03 16:39                         ` Bernd Schmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).