public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 2/5] dce: Don't dead-code delete separately wrapped restores
  2016-09-23  8:22 [PATCH v3 0/5] Separate shrink-wrapping Segher Boessenkool
@ 2016-09-23  8:22 ` Segher Boessenkool
  2016-09-26 16:55   ` Jeff Law
  2016-09-23  8:22 ` [PATCH 1/5] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-23  8:22 UTC (permalink / raw)
  To: gcc-patches; +Cc: law, dje.gcc, Segher Boessenkool

If there is a separately wrapped register restore on some path that
is dead (say, control goes into an endless loop after it), then we
cannot delete that restore because that would confuse the DWARF CFI
(if there is another path joining after this).
This happens with gcc.dg/torture/pr53168.c, for example.


2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>

	* dce.c (delete_unmarked_insns): Don't delete instructions with
	a REG_CFA_RESTORE note.

---
 gcc/dce.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/gcc/dce.c b/gcc/dce.c
index ea3fb00..d510287 100644
--- a/gcc/dce.c
+++ b/gcc/dce.c
@@ -587,6 +587,15 @@ delete_unmarked_insns (void)
 	  if (!dbg_cnt (dce))
 	    continue;
 
+	  if (crtl->shrink_wrapped_separate
+	      && find_reg_note (insn, REG_CFA_RESTORE, NULL))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, "DCE: NOT deleting insn %d, it's a "
+				    "callee-save restore\n", INSN_UID (insn));
+	      continue;
+	    }
+
 	  if (dump_file)
 	    fprintf (dump_file, "DCE: Deleting insn %d\n", INSN_UID (insn));
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 0/5] Separate shrink-wrapping
@ 2016-09-23  8:22 Segher Boessenkool
  2016-09-23  8:22 ` [PATCH 2/5] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-23  8:22 UTC (permalink / raw)
  To: gcc-patches; +Cc: law, dje.gcc, Segher Boessenkool

A new version of this patch set.

This now marks every block without any successors as needing all
components.  This prevents most previous problem situations from
ever happening.

The emit hooks now put explicit CFI notes on the instructions.  The
heuristics dwarf2cfi uses do not work so well for separately-wrapped
prologues.  With this change the cprop patch is no longer needed (and
we get better code in some cases).

I have improved the documentation, and many of the other suggested
changes.

Still no testcases, and I need to do new benchmark runs.


Segher


Segher Boessenkool (5):
  separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  dce: Don't dead-code delete separately wrapped restores
  regrename: Don't rename restores
  shrink-wrap: Shrink-wrapping for separate components
  rs6000: Separate shrink-wrapping

 gcc/common.opt             |   4 +
 gcc/config/rs6000/rs6000.c | 269 ++++++++++++++++-
 gcc/dce.c                  |   9 +
 gcc/doc/invoke.texi        |  11 +-
 gcc/doc/tm.texi            |  63 ++++
 gcc/doc/tm.texi.in         |  38 +++
 gcc/emit-rtl.h             |   4 +
 gcc/function.c             |   9 +-
 gcc/regrename.c            |   6 +
 gcc/shrink-wrap.c          | 729 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/shrink-wrap.h          |   1 +
 gcc/target.def             |  57 ++++
 12 files changed, 1181 insertions(+), 19 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-23  8:22 [PATCH v3 0/5] Separate shrink-wrapping Segher Boessenkool
  2016-09-23  8:22 ` [PATCH 2/5] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
@ 2016-09-23  8:22 ` Segher Boessenkool
  2016-09-26 17:02   ` Jeff Law
  2016-09-23  8:23 ` [PATCH 3/5] regrename: Don't rename restores Segher Boessenkool
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-23  8:22 UTC (permalink / raw)
  To: gcc-patches; +Cc: law, dje.gcc, Segher Boessenkool

This patch adds a new command-line flag "-fshrink-wrap-separate", a status
flag "shrink_wrapped_separate", hooks for abstracting the target components,
and documentation for all those.


2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>

	* common.opt (-fshrink-wrap-separate): New flag.
	* doc/invoke.texi: Document it.
	* doc/tm.texi.in (Shrink-wrapping separate components): New subsection.
	* doc/tm.texi: Regenerate.
	* emit-rtl.h (struct rtl_data): New field shrink_wrapped_separate.
	* target.def (shrink_wrap): New hook vector.
	(get_separate_components, components_for_bb, disqualify_components,
	emit_prologue_components, emit_epilogue_components,
	set_handled_components): New hooks.

---
 gcc/common.opt      |  4 ++++
 gcc/doc/invoke.texi | 11 +++++++++-
 gcc/doc/tm.texi     | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/doc/tm.texi.in  | 38 ++++++++++++++++++++++++++++++++
 gcc/emit-rtl.h      |  4 ++++
 gcc/target.def      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 176 insertions(+), 1 deletion(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index fa1c036..1a38d12 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2189,6 +2189,10 @@ Common Report Var(flag_shrink_wrap) Optimization
 Emit function prologues only before parts of the function that need it,
 rather than at the top of the function.
 
+fshrink-wrap-separate
+Common Report Var(flag_shrink_wrap_separate) Init(1) Optimization
+Shrink-wrap parts of the prologue and epilogue separately.
+
 fsignaling-nans
 Common Report Var(flag_signaling_nans) Optimization SetByCombined
 Disable optimizations observable by IEEE signaling NaNs.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 8eb5eff..464b737 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -397,7 +397,8 @@ Objective-C and Objective-C++ Dialects}.
 -fschedule-insns -fschedule-insns2 -fsection-anchors @gol
 -fselective-scheduling -fselective-scheduling2 @gol
 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
--fsemantic-interposition -fshrink-wrap -fsignaling-nans @gol
+-fsemantic-interposition -fshrink-wrap -fshrink-wrap-separate @gol
+-fsignaling-nans @gol
 -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
 -fsplit-paths @gol
 -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
@@ -6396,6 +6397,7 @@ compilation time.
 -fmove-loop-invariants @gol
 -freorder-blocks @gol
 -fshrink-wrap @gol
+-fshrink-wrap-separate @gol
 -fsplit-wide-types @gol
 -fssa-backprop @gol
 -fssa-phiopt @gol
@@ -7306,6 +7308,13 @@ Emit function prologues only before parts of the function that need it,
 rather than at the top of the function.  This flag is enabled by default at
 @option{-O} and higher.
 
+@item -fshrink-wrap-separate
+@opindex fshrink-wrap-separate
+Shrink-wrap separate parts of the prologue and epilogue separately, so that
+those parts are only executed when needed.
+This option is on by default, but has no effect unless @option{-fshrink-wrap}
+is also turned on.
+
 @item -fcaller-saves
 @opindex fcaller-saves
 Enable allocation of values to registers that are clobbered by
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index ced6274..c439695 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -2924,6 +2924,7 @@ This describes the stack layout and calling conventions.
 * Function Entry::
 * Profiling::
 * Tail Calls::
+* Shrink-wrapping separate components::
 * Stack Smashing Protection::
 * Miscellaneous Register Hooks::
 @end menu
@@ -4853,6 +4854,68 @@ This hook should add additional registers that are computed by the prologue to t
 True if a function's return statements should be checked for matching the function's return type.  This includes checking for falling off the end of a non-void function.  Return false if no such check should be made.
 @end deftypefn
 
+@node Shrink-wrapping separate components
+@subsection Shrink-wrapping separate components
+@cindex shrink-wrapping separate components
+
+The prologue may perform a variety of target dependent tasks such as
+saving callee-saved registers, saving the return address, aligning the
+stack, creating a stack frame, initializing the PIC register, setting
+up the static chain, etc.
+
+On some targets some of these tasks may be independent of others and
+thus may be shrink-wrapped separately.  These independent tasks are
+referred to as components and are handled generically by the target
+independent parts of GCC.
+
+Using the following hooks those prologue or epilogue components can be
+shrink-wrapped separately, so that the initialization (and possibly
+teardown) those components do is not done as frequently on execution
+paths where this would unnecessary.
+
+What exactly those components are is up to the target code; the generic
+code treats them abstractly, as a bit in an @code{sbitmap}.  These
+@code{sbitmap}s are allocated by the @code{shrink_wrap.get_separate_components}
+and @code{shrink_wrap.components_for_bb} hooks, and deallocated by the
+generic code.
+
+@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS (void)
+This hook should return an @code{sbitmap} with the bits set for those
+components that can be separately shrink-wrapped in the current function.
+Return @code{NULL} if the current function should not get any separate
+shrink-wrapping.
+Don't define this hook if it would always return @code{NULL}.
+If it is defined, the other hooks in this group have to be defined as well.
+@end deftypefn
+
+@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB (basic_block)
+This hook should return an @code{sbitmap} with the bits set for those
+components where either the prologue component has to be executed before
+the @code{basic_block}, or the epilogue component after it, or both.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS (sbitmap @var{components}, edge @var{e}, sbitmap @var{edge_components}, bool @var{is_prologue})
+This hook should clear the bits in the @var{components} bitmap for those
+components in @var{edge_components} that the target cannot handle on edge
+@var{e}, where @var{is_prologue} says if this is for a prologue or an
+epilogue instead.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS (sbitmap)
+Emit prologue insns for the components indicated by the parameter.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS (sbitmap)
+Emit epilogue insns for the components indicated by the parameter.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS (sbitmap)
+Mark the components in the parameter as handled, so that the
+@code{prologue} and @code{epilogue} named patterns know to ignore those
+components.  The target code should not hang on to the @code{sbitmap}, it
+will be deleted after this call.
+@end deftypefn
+
 @node Stack Smashing Protection
 @subsection Stack smashing protection
 @cindex stack smashing protection
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index a5714d1..84164bf 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -2530,6 +2530,7 @@ This describes the stack layout and calling conventions.
 * Function Entry::
 * Profiling::
 * Tail Calls::
+* Shrink-wrapping separate components::
 * Stack Smashing Protection::
 * Miscellaneous Register Hooks::
 @end menu
@@ -3775,6 +3776,43 @@ the function prologue.  Normally, the profiling code comes after.
 
 @hook TARGET_WARN_FUNC_RETURN
 
+@node Shrink-wrapping separate components
+@subsection Shrink-wrapping separate components
+@cindex shrink-wrapping separate components
+
+The prologue may perform a variety of target dependent tasks such as
+saving callee-saved registers, saving the return address, aligning the
+stack, creating a stack frame, initializing the PIC register, setting
+up the static chain, etc.
+
+On some targets some of these tasks may be independent of others and
+thus may be shrink-wrapped separately.  These independent tasks are
+referred to as components and are handled generically by the target
+independent parts of GCC.
+
+Using the following hooks those prologue or epilogue components can be
+shrink-wrapped separately, so that the initialization (and possibly
+teardown) those components do is not done as frequently on execution
+paths where this would unnecessary.
+
+What exactly those components are is up to the target code; the generic
+code treats them abstractly, as a bit in an @code{sbitmap}.  These
+@code{sbitmap}s are allocated by the @code{shrink_wrap.get_separate_components}
+and @code{shrink_wrap.components_for_bb} hooks, and deallocated by the
+generic code.
+
+@hook TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+
+@hook TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+
 @node Stack Smashing Protection
 @subsection Stack smashing protection
 @cindex stack smashing protection
diff --git a/gcc/emit-rtl.h b/gcc/emit-rtl.h
index 52c72b1..0a242b1 100644
--- a/gcc/emit-rtl.h
+++ b/gcc/emit-rtl.h
@@ -254,6 +254,10 @@ struct GTY(()) rtl_data {
   /* True if we performed shrink-wrapping for the current function.  */
   bool shrink_wrapped;
 
+  /* True if we performed shrink-wrapping for separate components for
+     the current function.  */
+  bool shrink_wrapped_separate;
+
   /* Nonzero if function being compiled doesn't modify the stack pointer
      (ignoring the prologue and epilogue).  This is only valid after
      pass_stack_ptr_mod has run.  */
diff --git a/gcc/target.def b/gcc/target.def
index 0e10d6f..c357d65 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -5792,6 +5792,63 @@ DEFHOOK
  bool, (tree),
  hook_bool_tree_true)
 
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_SHRINK_WRAP_"
+HOOK_VECTOR (TARGET_SHRINK_WRAP_HOOKS, shrink_wrap)
+
+DEFHOOK
+(get_separate_components,
+ "This hook should return an @code{sbitmap} with the bits set for those\n\
+components that can be separately shrink-wrapped in the current function.\n\
+Return @code{NULL} if the current function should not get any separate\n\
+shrink-wrapping.\n\
+Don't define this hook if it would always return @code{NULL}.\n\
+If it is defined, the other hooks in this group have to be defined as well.",
+ sbitmap, (void),
+ NULL)
+
+DEFHOOK
+(components_for_bb,
+ "This hook should return an @code{sbitmap} with the bits set for those\n\
+components where either the prologue component has to be executed before\n\
+the @code{basic_block}, or the epilogue component after it, or both.",
+ sbitmap, (basic_block),
+ NULL)
+
+DEFHOOK
+(disqualify_components,
+ "This hook should clear the bits in the @var{components} bitmap for those\n\
+components in @var{edge_components} that the target cannot handle on edge\n\
+@var{e}, where @var{is_prologue} says if this is for a prologue or an\n\
+epilogue instead.",
+ void, (sbitmap components, edge e, sbitmap edge_components, bool is_prologue),
+ NULL)
+
+DEFHOOK
+(emit_prologue_components,
+ "Emit prologue insns for the components indicated by the parameter.",
+ void, (sbitmap),
+ NULL)
+
+DEFHOOK
+(emit_epilogue_components,
+ "Emit epilogue insns for the components indicated by the parameter.",
+ void, (sbitmap),
+ NULL)
+
+DEFHOOK
+(set_handled_components,
+ "Mark the components in the parameter as handled, so that the\n\
+@code{prologue} and @code{epilogue} named patterns know to ignore those\n\
+components.  The target code should not hang on to the @code{sbitmap}, it\n\
+will be deleted after this call.",
+ void, (sbitmap),
+ NULL)
+
+HOOK_VECTOR_END (shrink_wrap)
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_"
+
 /* Determine the type of unwind info to emit for debugging.  */
 DEFHOOK
 (debug_unwind_info,
-- 
1.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/5] regrename: Don't rename restores
  2016-09-23  8:22 [PATCH v3 0/5] Separate shrink-wrapping Segher Boessenkool
  2016-09-23  8:22 ` [PATCH 2/5] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
  2016-09-23  8:22 ` [PATCH 1/5] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
@ 2016-09-23  8:23 ` Segher Boessenkool
  2016-09-26 16:44   ` Jeff Law
  2016-09-23  8:33 ` [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components Segher Boessenkool
  2016-09-23  8:44 ` [PATCH 5/5] rs6000: Separate shrink-wrapping Segher Boessenkool
  4 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-23  8:23 UTC (permalink / raw)
  To: gcc-patches; +Cc: law, dje.gcc, Segher Boessenkool

A restore is supposed to restore some certain register.  Restoring it
into some other register will not work.  Don't.


2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>

	* regrename.c (build_def_use): Invalidate chains that have a
	REG_CFA_RESTORE on some instruction.

---
 gcc/regrename.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/gcc/regrename.c b/gcc/regrename.c
index 54c7768..00a5d02 100644
--- a/gcc/regrename.c
+++ b/gcc/regrename.c
@@ -1867,6 +1867,12 @@ build_def_use (basic_block bb)
 		scan_rtx (insn, &XEXP (note, 0), NO_REGS, terminate_dead,
 			  OP_IN);
 	      }
+
+	  /* Step 8: Kill the chains involving register restores.  Those
+	     should restore _that_ register.  */
+	  for (note = REG_NOTES (insn); note; note = XEXP (note, 1))
+	    if (REG_NOTE_KIND (note) == REG_CFA_RESTORE)
+	      scan_rtx (insn, &XEXP (note, 0), NO_REGS, mark_all_read, OP_IN);
 	}
       else if (DEBUG_INSN_P (insn)
 	       && !VAR_LOC_UNKNOWN_P (INSN_VAR_LOCATION_LOC (insn)))
-- 
1.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
  2016-09-23  8:22 [PATCH v3 0/5] Separate shrink-wrapping Segher Boessenkool
                   ` (2 preceding siblings ...)
  2016-09-23  8:23 ` [PATCH 3/5] regrename: Don't rename restores Segher Boessenkool
@ 2016-09-23  8:33 ` Segher Boessenkool
  2016-09-27 21:25   ` Jeff Law
  2016-09-23  8:44 ` [PATCH 5/5] rs6000: Separate shrink-wrapping Segher Boessenkool
  4 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-23  8:33 UTC (permalink / raw)
  To: gcc-patches; +Cc: law, dje.gcc, Segher Boessenkool

This is the main substance of this patch series.

Instead of doing all of the prologue and epilogue in one spot, it often
is better to do components of it at different places, so that they are
executed less frequently.

What exactly is a component is completely up to the target; this code
treats it all abstractly, and uses hooks for the target to handle the
more concrete things.  Commonly there is one component for each callee-
saved register, for example.

Components can be executed more than once per function execution.  This
pass makes sure that a component's epilogue is not called more often
than the corresponding prologue has been, at any point in time; that the
prologue is called more often, wherever the prologue's effect is needed;
and that the epilogue is called as often as the prologue has been, when
the function exits.  It does this by first deciding which blocks need
which components "active", and then placing prologue and epilogue
components to make that exactly true.

Deciding what blocks should run with a certain component active so that
the total cost of executing the prologues (and epilogues) is optimal, is
not a computationally feasible problem.  Instead, for each basic block,
we estimate the cost of putting a prologue right before the block, and
if that is cheaper than the total cost of putting prologues optimally
(according to the estimated cost) in the dominator subtrees strictly
dominated by this first block, place it at the first block instead.
This simple procedure places the components optimally for any dominator
sub tree where the root node's cost does not depend on anything outside
its subtree.

The cost is the execution frequency of all edges into the block coming
from blocks that do not have this component active.  The estimated cost
is the execution frequency of the block, minus the execution frequency
of any backedges (which by definition are coming from subtrees, so if
the "head" block gets a prologue, the source block of any backedge has
that component active as well).

Currently, the epilogues are placed as late as possible, given the
constraints.  This does not matter for execution cost, but we could
save a little bit of code size by placing the epilogues in a smarter
way.  This is a possible future optimisation.

Now all that is left is inserting prologues and epilogues on all edges
that jump into resp. out of the "active" set of blocks.  Often we need
to insert some components' prologues (or epilogues) on all edges into
(or out of) a block.  In theory cross-jumping can unify all such, but
in practice that often fails; besides, that is a lot of work.  So in
this case we insert the prologue and epilogue components at the "head"
or "tail" of a block, instead.

As a final optimisation, if a block needs a prologue and its immediate
dominator has the block as a post-dominator, that immediate dominator
gets the prologue as well.


2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>

	* function.c (thread_prologue_and_epilogue_insns): Recompute the
	live info.  Call try_shrink_wrapping_separate.  Compute the
	prologue_seq afterwards, if it has possibly changed.  Compute the
	split_prologue_seq and epilogue_seq later, too.
	* shrink-wrap.c: #include cfgbuild.h.
	(dump_components): New function.
	(struct sw): New struct.
	(SW): New function.
	(init_separate_shrink_wrap): New function.
	(fini_separate_shrink_wrap): New function.
	(place_prologue_for_one_component): New function.
	(spread_components): New function.
	(disqualify_problematic_components): New function.
	(emit_common_heads_for_components): New function.
	(emit_common_tails_for_components): New function.
	(insert_prologue_epilogue_for_components): New function.
	(try_shrink_wrapping_separate): New function.
	* shrink-wrap.h: Declare try_shrink_wrapping_separate.

---
 gcc/function.c    |   9 +-
 gcc/shrink-wrap.c | 729 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/shrink-wrap.h |   1 +
 3 files changed, 737 insertions(+), 2 deletions(-)

diff --git a/gcc/function.c b/gcc/function.c
index 53bad87..79e7b4f 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -5920,9 +5920,7 @@ thread_prologue_and_epilogue_insns (void)
   edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
   edge orig_entry_edge = entry_edge;
 
-  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
   rtx_insn *prologue_seq = make_prologue_seq ();
-  rtx_insn *epilogue_seq = make_epilogue_seq ();
 
   /* Try to perform a kind of shrink-wrapping, making sure the
      prologue/epilogue is emitted only around those parts of the
@@ -5930,6 +5928,13 @@ thread_prologue_and_epilogue_insns (void)
 
   try_shrink_wrapping (&entry_edge, prologue_seq);
 
+  try_shrink_wrapping_separate (entry_edge->dest);
+
+  if (crtl->shrink_wrapped_separate)
+    prologue_seq = make_prologue_seq ();
+
+  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
+  rtx_insn *epilogue_seq = make_epilogue_seq ();
 
   rtl_profile_for_bb (EXIT_BLOCK_PTR_FOR_FN (cfun));
 
diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index b85b1c3..0dced22 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "tree-pass.h"
 #include "cfgrtl.h"
+#include "cfgbuild.h"
 #include "params.h"
 #include "bb-reorder.h"
 #include "shrink-wrap.h"
@@ -1006,3 +1007,731 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq)
   BITMAP_FREE (bb_with);
   free_dominance_info (CDI_DOMINATORS);
 }
+\f
+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.  The following code does this, for prologue and epilogue
+   components that can be put in more than one location, and where those
+   components can be executed more than once (the epilogue component will
+   always be executed before the prologue component is executed a second
+   time).
+
+   What exactly is a component is target-dependent.  The more usual
+   components are simple saves/restores to/from the frame of callee-saved
+   registers.  This code treats components abstractly (as an sbitmap),
+   letting the target handle all details.
+
+   Prologue components are placed in such a way that for every component
+   the prologue is executed as infrequently as possible.  We do this by
+   walking the dominator tree, comparing the cost of placing a prologue
+   component before a block to the sum of costs determined for all subtrees
+   of that block.
+
+   From this placement, we then determine for each component all blocks
+   where at least one of this block's dominators (including itself) will
+   get a prologue inserted.  That then is how the components are placed.
+   We could place the epilogue components a bit smarter (we can save a
+   bit of code size sometimes); this is a possible future improvement.
+
+   Prologues and epilogues are preferably placed into a block, either at
+   the beginning or end of it, if it is needed for all predecessor resp.
+   successor edges; or placed on the edge otherwise.
+
+   If the placement of any prologue/epilogue leads to a situation we cannot
+   handle (for example, an abnormal edge would need to be split, or some
+   targets want to use some specific registers that may not be available
+   where we want to put them), separate shrink-wrapping for the components
+   in that prologue/epilogue is aborted.  */
+
+
+/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the
+   label LABEL.  */
+static void
+dump_components (const char *label, sbitmap components)
+{
+  if (bitmap_empty_p (components))
+    return;
+
+  fprintf (dump_file, " [%s", label);
+
+  for (unsigned int j = 0; j < components->n_bits; j++)
+    if (bitmap_bit_p (components, j))
+      fprintf (dump_file, " %u", j);
+
+  fprintf (dump_file, "]");
+}
+
+/* The data we collect for each bb.  */
+struct sw {
+  /* What components does this BB need?  */
+  sbitmap needs_components;
+
+  /* What components does this BB have?  This is the main decision this
+     pass makes.  */
+  sbitmap has_components;
+
+  /* The components for which we placed code at the start of the BB (instead
+     of on all incoming edges).  */
+  sbitmap head_components;
+
+  /* The components for which we placed code at the end of the BB (instead
+     of on all outgoing edges).  */
+  sbitmap tail_components;
+
+  /* The frequency of executing the prologue for this BB, if a prologue is
+     placed on this BB.  This is a pessimistic estimate (no prologue is
+     needed for edges from blocks that have the component under consideration
+     active already).  */
+  gcov_type own_cost;
+
+  /* The frequency of executing the prologue for this BB and all BBs
+     dominated by it.  */
+  gcov_type total_cost;
+};
+
+/* A helper function for accessing the pass-specific info.  */
+static inline struct sw *
+SW (basic_block bb)
+{
+  gcc_assert (bb->aux);
+  return (struct sw *) bb->aux;
+}
+
+/* Create the pass-specific data structures for separately shrink-wrapping
+   with components COMPONENTS.  */
+static void
+init_separate_shrink_wrap (sbitmap components)
+{
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    {
+      bb->aux = xcalloc (1, sizeof (struct sw));
+
+      SW (bb)->needs_components = targetm.shrink_wrap.components_for_bb (bb);
+
+      /* Mark all basic blocks without successor as needing all components.
+	 This avoids problems in at least cfgcleanup, sel-sched, and
+	 regrename (largely to do with all paths to such a block still
+	 needing the same dwarf CFI info).  */
+      if (EDGE_COUNT (bb->succs) == 0)
+	bitmap_copy (SW (bb)->needs_components, components);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "bb %d components:", bb->index);
+	  dump_components ("has", SW (bb)->needs_components);
+	  fprintf (dump_file, "\n");
+	}
+
+      SW (bb)->has_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      SW (bb)->head_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      SW (bb)->tail_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      bitmap_clear (SW (bb)->has_components);
+      bitmap_clear (SW (bb)->head_components);
+      bitmap_clear (SW (bb)->tail_components);
+    }
+}
+
+/* Destroy the pass-specific data.  */
+static void
+fini_separate_shrink_wrap (void)
+{
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    if (bb->aux)
+      {
+	sbitmap_free (SW (bb)->needs_components);
+	sbitmap_free (SW (bb)->has_components);
+	sbitmap_free (SW (bb)->head_components);
+	sbitmap_free (SW (bb)->tail_components);
+	free (bb->aux);
+	bb->aux = 0;
+      }
+}
+
+/* Place the prologue for component WHICH, in the basic blocks dominated
+   by HEAD.  Do a DFS over the dominator tree, and set bit WHICH in the
+   HAS_COMPONENTS of a block if either the block has that bit set in
+   NEEDS_COMPONENTS, or it is cheaper to place the prologue here than in all
+   dominator subtrees separately.  */
+static void
+place_prologue_for_one_component (unsigned int which, basic_block head)
+{
+  /* The block we are currently dealing with.  */
+  basic_block bb = head;
+  /* Is this the first time we visit this block, i.e. have we just gone
+     down the tree.  */
+  bool first_visit = true;
+
+  /* Walk the dominator tree, visit one block per iteration of this loop.
+     Each basic block is visited twice: once before visiting any children
+     of the block, and once after visiting all of them (leaf nodes are
+     visited only once).  As an optimization, we do not visit subtrees
+     that can no longer influence the prologue placement.  */
+  for (;;)
+    {
+      /* First visit of a block: set the (children) cost accumulator to zero;
+	 if the block does not have the component itself, walk down.  */
+      if (first_visit)
+	{
+	  /* Initialize the cost.  The cost is the block execution frequency
+	     that does not come from backedges.  Calculating this by simply
+	     adding the cost of all edges that aren't backedges does not
+	     work: this does not always add up to the block frequency at
+	     all, and even if it does, rounding error makes for bad
+	     decisions.  */
+	  SW (bb)->own_cost = bb->frequency;
+
+	  edge e;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    if (dominated_by_p (CDI_DOMINATORS, e->src, bb))
+	      {
+		if (SW (bb)->own_cost > EDGE_FREQUENCY (e))
+		  SW (bb)->own_cost -= EDGE_FREQUENCY (e);
+		else
+		  SW (bb)->own_cost = 0;
+	      }
+
+	  SW (bb)->total_cost = 0;
+
+	  if (!bitmap_bit_p (SW (bb)->needs_components, which)
+	      && first_dom_son (CDI_DOMINATORS, bb))
+	    {
+	      bb = first_dom_son (CDI_DOMINATORS, bb);
+	      continue;
+	    }
+	}
+
+      /* If this block does need the component itself, or it is cheaper to
+	 put the prologue here than in all the descendants that need it,
+	 mark it so.  If this block's immediate post-dominator is dominated
+	 by this block, and that needs the prologue, we can put it on this
+	 block as well (earlier is better).  */
+      if (bitmap_bit_p (SW (bb)->needs_components, which)
+	  || SW (bb)->total_cost > SW (bb)->own_cost)
+	{
+	  SW (bb)->total_cost = SW (bb)->own_cost;
+	  bitmap_set_bit (SW (bb)->has_components, which);
+	}
+      else
+	{
+	  basic_block kid = get_immediate_dominator (CDI_POST_DOMINATORS, bb);
+	  if (dominated_by_p (CDI_DOMINATORS, kid, bb)
+	      && bitmap_bit_p (SW (kid)->has_components, which))
+	    {
+	      SW (bb)->total_cost = SW (bb)->own_cost;
+	      bitmap_set_bit (SW (bb)->has_components, which);
+	    }
+	}
+
+      /* We are back where we started, so we are done now.  */
+      if (bb == head)
+	return;
+
+      /* We now know the cost of the subtree rooted at the current block.
+	 Accumulate this cost in the parent.  */
+      basic_block parent = get_immediate_dominator (CDI_DOMINATORS, bb);
+      SW (parent)->total_cost += SW (bb)->total_cost;
+
+      /* Don't walk the tree down unless necessary.  */
+      if (next_dom_son (CDI_DOMINATORS, bb)
+          && SW (parent)->total_cost <= SW (parent)->own_cost)
+	{
+	  bb = next_dom_son (CDI_DOMINATORS, bb);
+	  first_visit = true;
+	}
+      else
+	{
+	  bb = parent;
+	  first_visit = false;
+	}
+    }
+}
+
+/* Mark HAS_COMPONENTS for every block dominated by at least one block with
+   PRO_COMPONENTS set for the respective components, starting at HEAD.  */
+static void
+spread_components (basic_block head)
+{
+  basic_block bb = head;
+  bool first_visit = true;
+  /* This keeps a tally of all components active.  */
+  sbitmap components = SW (head)->has_components;
+
+  for (;;)
+    {
+      if (first_visit)
+	{
+	  bitmap_ior (SW (bb)->has_components, SW (bb)->has_components,
+		      components);
+
+	  if (first_dom_son (CDI_DOMINATORS, bb))
+	    {
+	      components = SW (bb)->has_components;
+	      bb = first_dom_son (CDI_DOMINATORS, bb);
+	      continue;
+	    }
+	}
+
+      components = SW (bb)->has_components;
+
+      if (next_dom_son (CDI_DOMINATORS, bb))
+	{
+	  bb = next_dom_son (CDI_DOMINATORS, bb);
+	  basic_block parent = get_immediate_dominator (CDI_DOMINATORS, bb);
+	  components = SW (parent)->has_components;
+	  first_visit = true;
+	}
+      else
+	{
+	  if (bb == head)
+	    return;
+	  bb = get_immediate_dominator (CDI_DOMINATORS, bb);
+	  first_visit = false;
+	}
+    }
+}
+
+/* If we cannot handle placing some component's prologues or epilogues where
+   we decided we should place them, unmark that component in COMPONENTS so
+   that it is not wrapped separately.  */
+static void
+disqualify_problematic_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  /* Find which components we want pro/epilogues for here.  */
+	  bitmap_and_compl (epi, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and_compl (pro, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+
+	  /* Ask the target what it thinks about things.  */
+	  if (!bitmap_empty_p (epi))
+	    targetm.shrink_wrap.disqualify_components (components, e, epi,
+						       false);
+	  if (!bitmap_empty_p (pro))
+	    targetm.shrink_wrap.disqualify_components (components, e, pro,
+						       true);
+
+	  /* If this edge doesn't need splitting, we're fine.  */
+	  if (single_pred_p (e->dest)
+	      && e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun))
+	    continue;
+
+	  /* If the edge can be split, that is fine too.  */
+	  if ((e->flags & EDGE_ABNORMAL) == 0)
+	    continue;
+
+	  /* We also can handle sibcalls.  */
+	  if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
+	    {
+	      gcc_assert (e->flags & EDGE_SIBCALL);
+	      continue;
+	    }
+
+	  /* Remove from consideration those components we would need
+	     pro/epilogues for on edges where we cannot insert them.  */
+	  bitmap_and_compl (components, components, epi);
+	  bitmap_and_compl (components, components, pro);
+
+	  if (dump_file && !bitmap_subset_p (epi, components))
+	    {
+	      fprintf (dump_file, "  BAD epi %d->%d", e->src->index,
+		       e->dest->index);
+	      if (e->flags & EDGE_EH)
+		fprintf (dump_file, " for EH");
+	      dump_components ("epi", epi);
+	      fprintf (dump_file, "\n");
+	    }
+
+	  if (dump_file && !bitmap_subset_p (pro, components))
+	    {
+	      fprintf (dump_file, "  BAD pro %d->%d", e->src->index,
+		       e->dest->index);
+	      if (e->flags & EDGE_EH)
+		fprintf (dump_file, " for EH");
+	      dump_components ("pro", pro);
+	      fprintf (dump_file, "\n");
+	    }
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+}
+
+/* Place code for prologues and epilogues for COMPONENTS where we can put
+   that code at the start of basic blocks.  */
+static void
+emit_common_heads_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      /* Find which prologue resp. epilogue components are needed for all
+	 predecessor edges to this block.  */
+
+      /* First, select all possible components.  */
+      bitmap_copy (epi, components);
+      bitmap_copy (pro, components);
+
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  if (e->flags & EDGE_ABNORMAL)
+	    {
+	      bitmap_clear (epi);
+	      bitmap_clear (pro);
+	      break;
+	    }
+
+	  /* Deselect those epilogue components that should not be inserted
+	     for this edge.  */
+	  bitmap_and_compl (tmp, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and (epi, epi, tmp);
+
+	  /* Similar, for the prologue.  */
+	  bitmap_and_compl (tmp, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+	  bitmap_and (pro, pro, tmp);
+	}
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "  bb %d", bb->index);
+
+      if (dump_file && !bitmap_empty_p (epi))
+	dump_components ("epi", epi);
+      if (dump_file && !bitmap_empty_p (pro))
+	dump_components ("pro", pro);
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "\n");
+
+      /* Place code after the BB note.  */
+      if (!bitmap_empty_p (pro))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_prologue_components (pro);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  emit_insn_after (seq, bb_note (bb));
+
+	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, pro);
+	}
+
+      if (!bitmap_empty_p (epi))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_epilogue_components (epi);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  emit_insn_after (seq, bb_note (bb));
+
+	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, epi);
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+  sbitmap_free (tmp);
+}
+
+/* Place code for prologues and epilogues for COMPONENTS where we can put
+   that code at the end of basic blocks.  */
+static void
+emit_common_tails_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      /* Find which prologue resp. epilogue components are needed for all
+	 successor edges from this block.  */
+      if (EDGE_COUNT (bb->succs) == 0)
+	continue;
+
+      /* First, select all possible components.  */
+      bitmap_copy (epi, components);
+      bitmap_copy (pro, components);
+
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  if (e->flags & EDGE_ABNORMAL)
+	    {
+	      bitmap_clear (epi);
+	      bitmap_clear (pro);
+	      break;
+	    }
+
+	  /* Deselect those epilogue components that should not be inserted
+	     for this edge, and also those that are already put at the head
+	     of the successor block.  */
+	  bitmap_and_compl (tmp, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and_compl (tmp, tmp, SW (e->dest)->head_components);
+	  bitmap_and (epi, epi, tmp);
+
+	  /* Similarly, for the prologue.  */
+	  bitmap_and_compl (tmp, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+	  bitmap_and_compl (tmp, tmp, SW (e->dest)->head_components);
+	  bitmap_and (pro, pro, tmp);
+	}
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "  bb %d", bb->index);
+
+      if (dump_file && !bitmap_empty_p (epi))
+	dump_components ("epi", epi);
+      if (dump_file && !bitmap_empty_p (pro))
+	dump_components ("pro", pro);
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "\n");
+
+      /* Put the code at the end of the BB, but before any final jump.  */
+      if (!bitmap_empty_p (epi))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_epilogue_components (epi);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  rtx_insn *insn = BB_END (bb);
+	  if (control_flow_insn_p (insn))
+	    emit_insn_before (seq, insn);
+	  else
+	    emit_insn_after (seq, insn);
+
+	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, epi);
+	}
+
+      if (!bitmap_empty_p (pro))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_prologue_components (pro);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  rtx_insn *insn = BB_END (bb);
+	  if (JUMP_P (insn) || (CALL_P (insn) && SIBLING_CALL_P (insn)))
+	    emit_insn_before (seq, insn);
+	  else
+	    emit_insn_after (seq, insn);
+
+	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, pro);
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+  sbitmap_free (tmp);
+}
+
+/* Place prologues and epilogues for COMPONENTS on edges, if we haven't already
+   placed them inside blocks directly.  */
+static void
+insert_prologue_epilogue_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      if (!bb->aux)
+	continue;
+
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  /* Find which pro/epilogue components are needed on this edge.  */
+	  bitmap_and_compl (epi, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and_compl (pro, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+	  bitmap_and (epi, epi, components);
+	  bitmap_and (pro, pro, components);
+
+	  /* Deselect those we already have put at the head or tail of the
+	     edge's dest resp. src.  */
+	  bitmap_and_compl (epi, epi, SW (e->dest)->head_components);
+	  bitmap_and_compl (pro, pro, SW (e->dest)->head_components);
+	  bitmap_and_compl (epi, epi, SW (e->src)->tail_components);
+	  bitmap_and_compl (pro, pro, SW (e->src)->tail_components);
+
+	  if (!bitmap_empty_p (epi) || !bitmap_empty_p (pro))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "  %d->%d", e->src->index,
+			   e->dest->index);
+		  dump_components ("epi", epi);
+		  dump_components ("pro", pro);
+		  fprintf (dump_file, "\n");
+		}
+
+	      /* Put the epilogue components in place.  */
+	      start_sequence ();
+	      targetm.shrink_wrap.emit_epilogue_components (epi);
+	      rtx_insn *seq = get_insns ();
+	      end_sequence ();
+
+	      if (e->flags & EDGE_SIBCALL)
+		{
+		  gcc_assert (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun));
+
+		  rtx_insn *insn = BB_END (e->src);
+		  gcc_assert (CALL_P (insn) && SIBLING_CALL_P (insn));
+		  emit_insn_before (seq, insn);
+		}
+	      else if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		{
+		  gcc_assert (e->flags & EDGE_FALLTHRU);
+		  basic_block new_bb = split_edge (e);
+		  emit_insn_after (seq, BB_END (new_bb));
+		}
+	      else
+		insert_insn_on_edge (seq, e);
+
+	      /* Put the prologue components in place.  */
+	      start_sequence ();
+	      targetm.shrink_wrap.emit_prologue_components (pro);
+	      seq = get_insns ();
+	      end_sequence ();
+
+	      insert_insn_on_edge (seq, e);
+	    }
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+
+  commit_edge_insertions ();
+}
+
+/* The main entry point to this subpass.  FIRST_BB is where the prologue
+   would be normally put.  */
+void
+try_shrink_wrapping_separate (basic_block first_bb)
+{
+  if (!(SHRINK_WRAPPING_ENABLED
+	&& flag_shrink_wrap_separate
+	&& optimize_function_for_speed_p (cfun)
+	&& targetm.shrink_wrap.get_separate_components))
+    return;
+
+  /* We don't handle "strange" functions.  */
+  if (cfun->calls_alloca
+      || cfun->calls_setjmp
+      || cfun->can_throw_non_call_exceptions
+      || crtl->calls_eh_return
+      || crtl->has_nonlocal_goto
+      || crtl->saves_all_registers)
+    return;
+
+  /* Ask the target what components there are.  If it returns NULL, don't
+     do anything.  */
+  sbitmap components = targetm.shrink_wrap.get_separate_components ();
+  if (!components)
+    return;
+
+  /* We need LIVE info.  */
+  df_live_add_problem ();
+  df_live_set_all_dirty ();
+  df_analyze ();
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+
+  init_separate_shrink_wrap (components);
+
+  sbitmap_iterator sbi;
+  unsigned int j;
+  EXECUTE_IF_SET_IN_BITMAP (components, 0, j, sbi)
+    place_prologue_for_one_component (j, first_bb);
+
+  spread_components (first_bb);
+
+  disqualify_problematic_components (components);
+
+  /* Don't separately shrink-wrap anything where the "main" prologue will
+     go; the target code can often optimize things if it is presented with
+     all components together (say, if it generates store-multiple insns).  */
+  bitmap_and_compl (components, components, SW (first_bb)->has_components);
+
+  if (bitmap_empty_p (components))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not wrapping anything separately.\n");
+    }
+  else
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "The components we wrap separately are");
+	  dump_components ("sep", components);
+	  fprintf (dump_file, "\n");
+
+	  fprintf (dump_file, "... Inserting common heads...\n");
+	}
+
+      emit_common_heads_for_components (components);
+
+      if (dump_file)
+	fprintf (dump_file, "... Inserting common tails...\n");
+
+      emit_common_tails_for_components (components);
+
+      if (dump_file)
+	fprintf (dump_file, "... Inserting the more difficult ones...\n");
+
+      insert_prologue_epilogue_for_components (components);
+
+      if (dump_file)
+	fprintf (dump_file, "... Done.\n");
+
+      targetm.shrink_wrap.set_handled_components (components);
+
+      crtl->shrink_wrapped_separate = true;
+    }
+
+  fini_separate_shrink_wrap ();
+
+  sbitmap_free (components);
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+
+  if (crtl->shrink_wrapped_separate)
+    {
+      df_live_set_all_dirty ();
+      df_analyze ();
+    }
+}
diff --git a/gcc/shrink-wrap.h b/gcc/shrink-wrap.h
index e06ab37..05fcb41 100644
--- a/gcc/shrink-wrap.h
+++ b/gcc/shrink-wrap.h
@@ -25,6 +25,7 @@ along with GCC; see the file COPYING3.  If not see
 /* In shrink-wrap.c.  */
 extern bool requires_stack_frame_p (rtx_insn *, HARD_REG_SET, HARD_REG_SET);
 extern void try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq);
+extern void try_shrink_wrapping_separate (basic_block first_bb);
 #define SHRINK_WRAPPING_ENABLED \
   (flag_shrink_wrap && targetm.have_simple_return ())
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/5] rs6000: Separate shrink-wrapping
  2016-09-23  8:22 [PATCH v3 0/5] Separate shrink-wrapping Segher Boessenkool
                   ` (3 preceding siblings ...)
  2016-09-23  8:33 ` [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components Segher Boessenkool
@ 2016-09-23  8:44 ` Segher Boessenkool
  2016-09-26 16:39   ` Jeff Law
  2016-09-26 18:16   ` David Edelsohn
  4 siblings, 2 replies; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-23  8:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: law, dje.gcc, Segher Boessenkool

This implements the hooks for separate shrink-wrapping for rs6000.
It handles GPRs and LR.  The GPRs get a component number corresponding
to their register number; LR gets component number 0.


2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>

	* config/rs6000/rs6000.c (machine_function): Add new fields
	gpr_is_wrapped_separately and lr_is_wrapped_separately.
	(TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
	TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
	TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
	TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
	TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
	TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
	(rs6000_get_separate_components): New function.
	(rs6000_components_for_bb): New function.
	(rs6000_disqualify_components): New function.
	(rs6000_emit_prologue_components): New function.
	(rs6000_emit_epilogue_components): New function.
	(rs6000_set_handled_components): New function.
	(rs6000_emit_prologue): Don't emit LR save if lr_is_wrapped_separately.
	Don't emit GPR saves if gpr_is_wrapped_separately for that register.
	(rs6000_emit_epilogue): Don't emit GPR restores if
	gpr_is_wrapped_separately for that register.  Don't make a
	REG_CFA_RESTORE note for registers we did not restore, either.

---
 gcc/config/rs6000/rs6000.c | 269 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 253 insertions(+), 16 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 557009f..ec8d637 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -152,6 +152,10 @@ typedef struct GTY(()) machine_function
   bool split_stack_argp_used;
   /* Flag if r2 setup is needed with ELFv2 ABI.  */
   bool r2_setup_needed;
+  /* The components already handled by separate shrink-wrapping, which should
+     not be considered by the prologue and epilogue.  */
+  bool gpr_is_wrapped_separately[32];
+  bool lr_is_wrapped_separately;
 } machine_function;
 
 /* Support targetm.vectorize.builtin_mask_for_load.  */
@@ -1513,6 +1517,19 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_SET_UP_BY_PROLOGUE
 #define TARGET_SET_UP_BY_PROLOGUE rs6000_set_up_by_prologue
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS rs6000_get_separate_components
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB rs6000_components_for_bb
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS rs6000_disqualify_components
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS rs6000_emit_prologue_components
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS rs6000_emit_epilogue_components
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS rs6000_set_handled_components
+
 #undef TARGET_EXTRA_LIVE_ON_ENTRY
 #define TARGET_EXTRA_LIVE_ON_ENTRY rs6000_live_on_entry
 
@@ -26850,6 +26867,212 @@ rs6000_global_entry_point_needed_p (void)
   return cfun->machine->r2_setup_needed;
 }
 
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+static sbitmap
+rs6000_get_separate_components (void)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+
+  if (!(info->savres_strategy & SAVE_INLINE_GPRS)
+      || !(info->savres_strategy & REST_INLINE_GPRS)
+      || WORLD_SAVE_P (info))
+    return NULL;
+
+  sbitmap components = sbitmap_alloc (32);
+  bitmap_clear (components);
+
+  /* The GPRs we need saved to the frame.  */
+  int reg_size = TARGET_32BIT ? 4 : 8;
+  int offset = info->gp_save_offset;
+  if (info->push_p)
+    offset += info->total_size;
+
+  for (unsigned regno = info->first_gp_reg_save; regno < 32; regno++)
+    {
+      if (IN_RANGE (offset, -0x8000, 0x7fff)
+	  && rs6000_reg_live_or_pic_offset_p (regno))
+	bitmap_set_bit (components, regno);
+
+      offset += reg_size;
+    }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  /* Don't mess with the fixed TOC register.  */
+  if ((TARGET_TOC && TARGET_MINIMAL_TOC)
+      || (flag_pic == 1 && DEFAULT_ABI == ABI_V4)
+      || (flag_pic && DEFAULT_ABI == ABI_DARWIN))
+    bitmap_clear_bit (components, RS6000_PIC_OFFSET_TABLE_REGNUM);
+
+  /* Optimize LR save and restore if we can.  This is component 0.  */
+  if (info->lr_save_p
+      && !(flag_pic && (DEFAULT_ABI == ABI_V4 || DEFAULT_ABI == ABI_DARWIN)))
+    {
+      offset = info->lr_save_offset;
+      if (info->push_p)
+	offset += info->total_size;
+      if (IN_RANGE (offset, -0x8000, 0x7fff))
+	bitmap_set_bit (components, 0);
+    }
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+static sbitmap
+rs6000_components_for_bb (basic_block bb)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (32);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = info->first_gp_reg_save; regno < 32; regno++)
+    if (bitmap_bit_p (in, regno)
+	|| bitmap_bit_p (gen, regno)
+	|| bitmap_bit_p (kill, regno))
+      bitmap_set_bit (components, regno);
+
+  /* LR needs to be saved around a bb if it is killed in that bb.  */
+  if (bitmap_bit_p (gen, LR_REGNO)
+      || bitmap_bit_p (kill, LR_REGNO))
+    bitmap_set_bit (components, 0);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.  */
+static void
+rs6000_disqualify_components (sbitmap components, edge e,
+			      sbitmap edge_components, bool /*is_prologue*/)
+{
+  /* Our LR pro/epilogue code moves LR via R0, so R0 had better not be
+     live where we want to place that code.  */
+  if (bitmap_bit_p (edge_components, 0)
+      && bitmap_bit_p (DF_LIVE_IN (e->dest), 0))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Disqualifying LR because GPR0 is live "
+		 "on entry to bb %d\n", e->dest->index);
+      bitmap_clear_bit (components, 0);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+static void
+rs6000_emit_prologue_components (sbitmap components)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+  int reg_size = TARGET_32BIT ? 4 : 8;
+
+  /* Prologue for LR.  */
+  if (bitmap_bit_p (components, 0))
+    {
+      rtx reg = gen_rtx_REG (Pmode, 0);
+      rtx_insn *insn = emit_move_insn (reg, gen_rtx_REG (Pmode, LR_REGNO));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_REGISTER, NULL);
+
+      int offset = info->lr_save_offset;
+      if (info->push_p)
+	offset += info->total_size;
+
+      insn = emit_insn (gen_frame_store (reg, ptr_reg, offset));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      rtx lr = gen_rtx_REG (Pmode, LR_REGNO);
+      rtx mem = copy_rtx (SET_DEST (single_set (insn)));
+      add_reg_note (insn, REG_CFA_OFFSET, gen_rtx_SET (mem, lr));
+    }
+
+  /* Prologue for the GPRs.  */
+  int offset = info->gp_save_offset;
+  if (info->push_p)
+    offset += info->total_size;
+
+  for (int i = info->first_gp_reg_save; i < 32; i++)
+    {
+      if (bitmap_bit_p (components, i))
+	{
+	  rtx reg = gen_rtx_REG (Pmode, i);
+	  rtx_insn *insn = emit_insn (gen_frame_store (reg, ptr_reg, offset));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  rtx set = copy_rtx (single_set (insn));
+	  add_reg_note (insn, REG_CFA_OFFSET, set);
+	}
+
+      offset += reg_size;
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+static void
+rs6000_emit_epilogue_components (sbitmap components)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+			     ? HARD_FRAME_POINTER_REGNUM
+			     : STACK_POINTER_REGNUM);
+  int reg_size = TARGET_32BIT ? 4 : 8;
+
+  /* Epilogue for the GPRs.  */
+  int offset = info->gp_save_offset;
+  if (info->push_p)
+    offset += info->total_size;
+
+  for (int i = info->first_gp_reg_save; i < 32; i++)
+    {
+      if (bitmap_bit_p (components, i))
+	{
+	  rtx reg = gen_rtx_REG (Pmode, i);
+	  rtx_insn *insn = emit_insn (gen_frame_load (reg, ptr_reg, offset));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	  add_reg_note (insn, REG_CFA_RESTORE, reg);
+	}
+
+      offset += reg_size;
+    }
+
+  /* Epilogue for LR.  */
+  if (bitmap_bit_p (components, 0))
+    {
+      int offset = info->lr_save_offset;
+      if (info->push_p)
+	offset += info->total_size;
+
+      rtx reg = gen_rtx_REG (Pmode, 0);
+      rtx_insn *insn = emit_insn (gen_frame_load (reg, ptr_reg, offset));
+
+      rtx lr = gen_rtx_REG (Pmode, LR_REGNO);
+      insn = emit_move_insn (lr, reg);
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_RESTORE, lr);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+static void
+rs6000_set_handled_components (sbitmap components)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+
+  for (int i = info->first_gp_reg_save; i < 32; i++)
+    if (bitmap_bit_p (components, i))
+      cfun->machine->gpr_is_wrapped_separately[i] = true;
+
+  if (bitmap_bit_p (components, 0))
+    cfun->machine->lr_is_wrapped_separately = true;
+}
+
 /* Emit function prologue as insns.  */
 
 void
@@ -27107,7 +27330,8 @@ rs6000_emit_prologue (void)
     }
 
   /* If we use the link register, get it into r0.  */
-  if (!WORLD_SAVE_P (info) && info->lr_save_p)
+  if (!WORLD_SAVE_P (info) && info->lr_save_p
+      && !cfun->machine->lr_is_wrapped_separately)
     {
       rtx addr, reg, mem;
 
@@ -27335,13 +27559,16 @@ rs6000_emit_prologue (void)
     }
   else if (!WORLD_SAVE_P (info))
     {
-      int i;
-      for (i = 0; i < 32 - info->first_gp_reg_save; i++)
-	if (rs6000_reg_live_or_pic_offset_p (info->first_gp_reg_save + i))
-	  emit_frame_save (frame_reg_rtx, reg_mode,
-			   info->first_gp_reg_save + i,
-			   info->gp_save_offset + frame_off + reg_size * i,
-			   sp_off - frame_off);
+      int offset = info->gp_save_offset + frame_off;
+      for (int i = info->first_gp_reg_save; i < 32; i++)
+	{
+	  if (rs6000_reg_live_or_pic_offset_p (i)
+	      && !cfun->machine->gpr_is_wrapped_separately[i])
+	    emit_frame_save (frame_reg_rtx, reg_mode, i, offset,
+			     sp_off - frame_off);
+
+	  offset += reg_size;
+	}
     }
 
   if (crtl->calls_eh_return)
@@ -28264,7 +28491,9 @@ rs6000_emit_epilogue (int sibcall)
 		&& (restoring_FPRs_inline
 		    || (strategy & REST_NOINLINE_FPRS_DOESNT_RESTORE_LR))
 		&& (restoring_GPRs_inline
-		    || info->first_fp_reg_save < 64));
+		    || info->first_fp_reg_save < 64)
+		&& !cfun->machine->lr_is_wrapped_separately);
+
 
   if (WORLD_SAVE_P (info))
     {
@@ -28899,12 +29128,18 @@ rs6000_emit_epilogue (int sibcall)
     }
   else
     {
-      for (i = 0; i < 32 - info->first_gp_reg_save; i++)
-	if (rs6000_reg_live_or_pic_offset_p (info->first_gp_reg_save + i))
-	  emit_insn (gen_frame_load
-		     (gen_rtx_REG (reg_mode, info->first_gp_reg_save + i),
-		      frame_reg_rtx,
-		      info->gp_save_offset + frame_off + reg_size * i));
+      int offset = info->gp_save_offset + frame_off;
+      for (i = info->first_gp_reg_save; i < 32; i++)
+	{
+	  if (rs6000_reg_live_or_pic_offset_p (i)
+	      && !cfun->machine->gpr_is_wrapped_separately[i])
+	    {
+	      rtx reg = gen_rtx_REG (reg_mode, i);
+	      emit_insn (gen_frame_load (reg, frame_reg_rtx, offset));
+	    }
+
+	  offset += reg_size;
+	}
     }
 
   if (DEFAULT_ABI == ABI_V4 || flag_shrink_wrap)
@@ -28943,8 +29178,10 @@ rs6000_emit_epilogue (int sibcall)
 	    || using_load_multiple
 	    || rs6000_reg_live_or_pic_offset_p (i))
 	  {
-	    rtx reg = gen_rtx_REG (reg_mode, i);
+	    if (cfun->machine->gpr_is_wrapped_separately[i])
+	      continue;
 
+	    rtx reg = gen_rtx_REG (reg_mode, i);
 	    cfa_restores = alloc_reg_note (REG_CFA_RESTORE, reg, cfa_restores);
 	  }
     }
-- 
1.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] rs6000: Separate shrink-wrapping
  2016-09-23  8:44 ` [PATCH 5/5] rs6000: Separate shrink-wrapping Segher Boessenkool
@ 2016-09-26 16:39   ` Jeff Law
  2016-09-26 18:16   ` David Edelsohn
  1 sibling, 0 replies; 18+ messages in thread
From: Jeff Law @ 2016-09-26 16:39 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: dje.gcc

On 09/23/2016 02:21 AM, Segher Boessenkool wrote:
> This implements the hooks for separate shrink-wrapping for rs6000.
> It handles GPRs and LR.  The GPRs get a component number corresponding
> to their register number; LR gets component number 0.
>
>
> 2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* config/rs6000/rs6000.c (machine_function): Add new fields
> 	gpr_is_wrapped_separately and lr_is_wrapped_separately.
> 	(TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
> 	TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
> 	TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
> 	TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
> 	TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
> 	TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
> 	(rs6000_get_separate_components): New function.
> 	(rs6000_components_for_bb): New function.
> 	(rs6000_disqualify_components): New function.
> 	(rs6000_emit_prologue_components): New function.
> 	(rs6000_emit_epilogue_components): New function.
> 	(rs6000_set_handled_components): New function.
> 	(rs6000_emit_prologue): Don't emit LR save if lr_is_wrapped_separately.
> 	Don't emit GPR saves if gpr_is_wrapped_separately for that register.
> 	(rs6000_emit_epilogue): Don't emit GPR restores if
> 	gpr_is_wrapped_separately for that register.  Don't make a
> 	REG_CFA_RESTORE note for registers we did not restore, either.
Just to be explicit, I'm assuming you and the other ppc port maintainers 
will handle final review/approval on this.  I've referred back to this 
patch to see how the various target independent bits interact, but I 
haven't looked closely at this patch.

jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] regrename: Don't rename restores
  2016-09-23  8:23 ` [PATCH 3/5] regrename: Don't rename restores Segher Boessenkool
@ 2016-09-26 16:44   ` Jeff Law
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Law @ 2016-09-26 16:44 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: dje.gcc

On 09/23/2016 02:21 AM, Segher Boessenkool wrote:
> A restore is supposed to restore some certain register.  Restoring it
> into some other register will not work.  Don't.
>
>
> 2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* regrename.c (build_def_use): Invalidate chains that have a
> 	REG_CFA_RESTORE on some instruction.
>
> ---
>  gcc/regrename.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/gcc/regrename.c b/gcc/regrename.c
> index 54c7768..00a5d02 100644
> --- a/gcc/regrename.c
> +++ b/gcc/regrename.c
> @@ -1867,6 +1867,12 @@ build_def_use (basic_block bb)
>  		scan_rtx (insn, &XEXP (note, 0), NO_REGS, terminate_dead,
>  			  OP_IN);
>  	      }
> +
> +	  /* Step 8: Kill the chains involving register restores.  Those
> +	     should restore _that_ register.  */
> +	  for (note = REG_NOTES (insn); note; note = XEXP (note, 1))
> +	    if (REG_NOTE_KIND (note) == REG_CFA_RESTORE)
> +	      scan_rtx (insn, &XEXP (note, 0), NO_REGS, mark_all_read, OP_IN);
>  	}
>        else if (DEBUG_INSN_P (insn)
>  	       && !VAR_LOC_UNKNOWN_P (INSN_VAR_LOCATION_LOC (insn)))
Seems like a good thing regardless of the shrink-wrapping changes. 
There's a comment about 200 lines earlier (egad) which outlines the 
steps.  Can you please add a comment there too.

It would probably be a good idea to refactor build_def_use a bit, but 
I'd understand if you don't want to tackle that.  I don't think that 
desire should block this patch.

As I've said before, I'm not sure we're getting CFA notes right, 
particularly on the older ports, but I don't think that should block 
this patch.


With the earlier comment update change this is OK.

jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] dce: Don't dead-code delete separately wrapped restores
  2016-09-23  8:22 ` [PATCH 2/5] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
@ 2016-09-26 16:55   ` Jeff Law
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Law @ 2016-09-26 16:55 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: dje.gcc

On 09/23/2016 02:21 AM, Segher Boessenkool wrote:
> If there is a separately wrapped register restore on some path that
> is dead (say, control goes into an endless loop after it), then we
> cannot delete that restore because that would confuse the DWARF CFI
> (if there is another path joining after this).
> This happens with gcc.dg/torture/pr53168.c, for example.
>
>
> 2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* dce.c (delete_unmarked_insns): Don't delete instructions with
> 	a REG_CFA_RESTORE note.
OK and can go in now IMHO.
jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-23  8:22 ` [PATCH 1/5] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
@ 2016-09-26 17:02   ` Jeff Law
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Law @ 2016-09-26 17:02 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: dje.gcc

On 09/23/2016 02:21 AM, Segher Boessenkool wrote:
> This patch adds a new command-line flag "-fshrink-wrap-separate", a status
> flag "shrink_wrapped_separate", hooks for abstracting the target components,
> and documentation for all those.
>
>
> 2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* common.opt (-fshrink-wrap-separate): New flag.
> 	* doc/invoke.texi: Document it.
> 	* doc/tm.texi.in (Shrink-wrapping separate components): New subsection.
> 	* doc/tm.texi: Regenerate.
> 	* emit-rtl.h (struct rtl_data): New field shrink_wrapped_separate.
> 	* target.def (shrink_wrap): New hook vector.
> 	(get_separate_components, components_for_bb, disqualify_components,
> 	emit_prologue_components, emit_epilogue_components,
> 	set_handled_components): New hooks.
>
> ---
>  gcc/common.opt      |  4 ++++
>  gcc/doc/invoke.texi | 11 +++++++++-
>  gcc/doc/tm.texi     | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/doc/tm.texi.in  | 38 ++++++++++++++++++++++++++++++++
>  gcc/emit-rtl.h      |  4 ++++
>  gcc/target.def      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 176 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index fa1c036..1a38d12 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2189,6 +2189,10 @@ Common Report Var(flag_shrink_wrap) Optimization
>  Emit function prologues only before parts of the function that need it,
>  rather than at the top of the function.
>
> +fshrink-wrap-separate
> +Common Report Var(flag_shrink_wrap_separate) Init(1) Optimization
> +Shrink-wrap parts of the prologue and epilogue separately.
> +
>  fsignaling-nans
>  Common Report Var(flag_signaling_nans) Optimization SetByCombined
>  Disable optimizations observable by IEEE signaling NaNs.
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 8eb5eff..464b737 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -397,7 +397,8 @@ Objective-C and Objective-C++ Dialects}.
>  -fschedule-insns -fschedule-insns2 -fsection-anchors @gol
>  -fselective-scheduling -fselective-scheduling2 @gol
>  -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
> --fsemantic-interposition -fshrink-wrap -fsignaling-nans @gol
> +-fsemantic-interposition -fshrink-wrap -fshrink-wrap-separate @gol
> +-fsignaling-nans @gol
>  -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
>  -fsplit-paths @gol
>  -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
> @@ -6396,6 +6397,7 @@ compilation time.
>  -fmove-loop-invariants @gol
>  -freorder-blocks @gol
>  -fshrink-wrap @gol
> +-fshrink-wrap-separate @gol
>  -fsplit-wide-types @gol
>  -fssa-backprop @gol
>  -fssa-phiopt @gol
> @@ -7306,6 +7308,13 @@ Emit function prologues only before parts of the function that need it,
>  rather than at the top of the function.  This flag is enabled by default at
>  @option{-O} and higher.
>
> +@item -fshrink-wrap-separate
> +@opindex fshrink-wrap-separate
> +Shrink-wrap separate parts of the prologue and epilogue separately, so that
> +those parts are only executed when needed.
> +This option is on by default, but has no effect unless @option{-fshrink-wrap}
> +is also turned on.
This option also has no effect if the target has not implemented support 
for separate shrink wrapping.   Or something like that seems appropriate 
here.

I think with that change this patch is fine and can be committed with 
the main part once approved.

jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/5] rs6000: Separate shrink-wrapping
  2016-09-23  8:44 ` [PATCH 5/5] rs6000: Separate shrink-wrapping Segher Boessenkool
  2016-09-26 16:39   ` Jeff Law
@ 2016-09-26 18:16   ` David Edelsohn
  1 sibling, 0 replies; 18+ messages in thread
From: David Edelsohn @ 2016-09-26 18:16 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: GCC Patches, Jeffrey Law

On Fri, Sep 23, 2016 at 4:21 AM, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
> This implements the hooks for separate shrink-wrapping for rs6000.
> It handles GPRs and LR.  The GPRs get a component number corresponding
> to their register number; LR gets component number 0.
>
>
> 2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>
>
>         * config/rs6000/rs6000.c (machine_function): Add new fields
>         gpr_is_wrapped_separately and lr_is_wrapped_separately.
>         (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
>         TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
>         TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
>         TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
>         TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
>         TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
>         (rs6000_get_separate_components): New function.
>         (rs6000_components_for_bb): New function.
>         (rs6000_disqualify_components): New function.
>         (rs6000_emit_prologue_components): New function.
>         (rs6000_emit_epilogue_components): New function.
>         (rs6000_set_handled_components): New function.
>         (rs6000_emit_prologue): Don't emit LR save if lr_is_wrapped_separately.
>         Don't emit GPR saves if gpr_is_wrapped_separately for that register.
>         (rs6000_emit_epilogue): Don't emit GPR restores if
>         gpr_is_wrapped_separately for that register.  Don't make a
>         REG_CFA_RESTORE note for registers we did not restore, either.

The rs6000 bits are okay when the rest of the shrink-wrapping
infrastructure is approved.

Thanks, David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
  2016-09-23  8:33 ` [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components Segher Boessenkool
@ 2016-09-27 21:25   ` Jeff Law
  2016-09-28  9:26     ` Segher Boessenkool
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jeff Law @ 2016-09-27 21:25 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: dje.gcc

On 09/23/2016 02:21 AM, Segher Boessenkool wrote:
> Deciding what blocks should run with a certain component active so that
> the total cost of executing the prologues (and epilogues) is optimal, is
> not a computationally feasible problem.  Instead, for each basic block,
> we estimate the cost of putting a prologue right before the block, and
> if that is cheaper than the total cost of putting prologues optimally
> (according to the estimated cost) in the dominator subtrees strictly
> dominated by this first block, place it at the first block instead.
> This simple procedure places the components optimally for any dominator
> sub tree where the root node's cost does not depend on anything outside
> its subtree.
>
> The cost is the execution frequency of all edges into the block coming
> from blocks that do not have this component active.  The estimated cost
> is the execution frequency of the block, minus the execution frequency
> of any backedges (which by definition are coming from subtrees, so if
> the "head" block gets a prologue, the source block of any backedge has
> that component active as well).
>
> Currently, the epilogues are placed as late as possible, given the
> constraints.  This does not matter for execution cost, but we could
> save a little bit of code size by placing the epilogues in a smarter
> way.  This is a possible future optimisation.
>
> Now all that is left is inserting prologues and epilogues on all edges
> that jump into resp. out of the "active" set of blocks.  Often we need
> to insert some components' prologues (or epilogues) on all edges into
> (or out of) a block.  In theory cross-jumping can unify all such, but
> in practice that often fails; besides, that is a lot of work.  So in
> this case we insert the prologue and epilogue components at the "head"
> or "tail" of a block, instead.
>
> As a final optimisation, if a block needs a prologue and its immediate
> dominator has the block as a post-dominator, that immediate dominator
> gets the prologue as well.
>
>
> 2016-09-23  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* function.c (thread_prologue_and_epilogue_insns): Recompute the
> 	live info.  Call try_shrink_wrapping_separate.  Compute the
> 	prologue_seq afterwards, if it has possibly changed.  Compute the
> 	split_prologue_seq and epilogue_seq later, too.
> 	* shrink-wrap.c: #include cfgbuild.h.
> 	(dump_components): New function.
> 	(struct sw): New struct.
> 	(SW): New function.
> 	(init_separate_shrink_wrap): New function.
> 	(fini_separate_shrink_wrap): New function.
> 	(place_prologue_for_one_component): New function.
> 	(spread_components): New function.
> 	(disqualify_problematic_components): New function.
> 	(emit_common_heads_for_components): New function.
> 	(emit_common_tails_for_components): New function.
> 	(insert_prologue_epilogue_for_components): New function.
> 	(try_shrink_wrapping_separate): New function.
> 	* shrink-wrap.h: Declare try_shrink_wrapping_separate.
>
> ---
>  gcc/function.c    |   9 +-
>  gcc/shrink-wrap.c | 729 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/shrink-wrap.h |   1 +
>  3 files changed, 737 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/function.c b/gcc/function.c
> index 53bad87..79e7b4f 100644
> --- a/gcc/function.c
> +++ b/gcc/function.c
> @@ -5920,9 +5920,7 @@ thread_prologue_and_epilogue_insns (void)
>    edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>    edge orig_entry_edge = entry_edge;
>
> -  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
>    rtx_insn *prologue_seq = make_prologue_seq ();
> -  rtx_insn *epilogue_seq = make_epilogue_seq ();
>
>    /* Try to perform a kind of shrink-wrapping, making sure the
>       prologue/epilogue is emitted only around those parts of the
> @@ -5930,6 +5928,13 @@ thread_prologue_and_epilogue_insns (void)
>
>    try_shrink_wrapping (&entry_edge, prologue_seq);
>
> +  try_shrink_wrapping_separate (entry_edge->dest);
> +
> +  if (crtl->shrink_wrapped_separate)
> +    prologue_seq = make_prologue_seq ();
Note this implies that make_prologue_seq (and consequently the target 
specific bits for prologue generation) can be safely called more than 
once.  That's probably OK, but might be worth a comment -- your decision 
whether or not to add the comment.


> diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
> index b85b1c3..0dced22 100644
> --- a/gcc/shrink-wrap.c
> +++ b/gcc/shrink-wrap.c

> +/* Place the prologue for component WHICH, in the basic blocks dominated
> +   by HEAD.  Do a DFS over the dominator tree, and set bit WHICH in the
> +   HAS_COMPONENTS of a block if either the block has that bit set in
> +   NEEDS_COMPONENTS, or it is cheaper to place the prologue here than in all
> +   dominator subtrees separately.  */
> +static void
> +place_prologue_for_one_component (unsigned int which, basic_block head)
> +{
> +  /* The block we are currently dealing with.  */
> +  basic_block bb = head;
> +  /* Is this the first time we visit this block, i.e. have we just gone
> +     down the tree.  */
> +  bool first_visit = true;
> +
> +  /* Walk the dominator tree, visit one block per iteration of this loop.
> +     Each basic block is visited twice: once before visiting any children
> +     of the block, and once after visiting all of them (leaf nodes are
> +     visited only once).  As an optimization, we do not visit subtrees
> +     that can no longer influence the prologue placement.  */
> +  for (;;)
Is there some reason you wrote this as a loop rather than recursion? 
IMHO it makes this function (and spread_components) more difficult to 
reason about than it needs to be.



> +
> +/* Mark HAS_COMPONENTS for every block dominated by at least one block with
> +   PRO_COMPONENTS set for the respective components, starting at HEAD.  */
> +static void
> +spread_components (basic_block head)
I don't see any references to PRO_COMPONENTS in the actual code.  If I'm 
reading the code correctly, you're just accumulating/propagating 
HAS_COMPONENTS from parents to children via a DFS walk of the dominator 
tree, right?





> +
> +/* Place code for prologues and epilogues for COMPONENTS where we can put
> +   that code at the end of basic blocks.  */
> +static void
> +emit_common_tails_for_components (sbitmap components)
[ Snip. ]
> +
> +      /* Put the code at the end of the BB, but before any final jump.  */
> +      if (!bitmap_empty_p (epi))
So what if the final jump uses hard registers in one way or another?   I 
don't immediately see anything that verifies it is safe to transpose the 
epilogue and the final jump.

Conceptually want the epilogue to occur on the outgoing edge(s).  But 
you want to actually transpose the epilogue and the control flow insn so 
that you can insert the epilogue in one place.

But naive transposition isn't safe.  The control flow insn may use one 
or more registers that you were restoring or you may be on a cc0 target. 
   I think you need to handle the former more cleanly.  The latter I'd 
be comfortable filtering out in try_shrink_wrapping_separate.

This has similarities to some of the problems we've had in the past with 
rtl-gcse insertion as well as output reloads on insns with control flow.

With transposition issue addressed, the only blocker I see are some 
simple testcases we can add to the suite.  They don't have to be real 
extensive.  And one motivating example for the list archives, ideally 
the glibc malloc case.

Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
  2016-09-27 21:25   ` Jeff Law
@ 2016-09-28  9:26     ` Segher Boessenkool
  2016-09-28 16:36       ` Jeff Law
  2016-09-30 10:14     ` Segher Boessenkool
       [not found]     ` <20160930102908.GB14933@gate.crashing.org>
  2 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-28  9:26 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, dje.gcc

On Tue, Sep 27, 2016 at 03:14:51PM -0600, Jeff Law wrote:
> On 09/23/2016 02:21 AM, Segher Boessenkool wrote:
> >--- a/gcc/function.c
> >+++ b/gcc/function.c
> >@@ -5920,9 +5920,7 @@ thread_prologue_and_epilogue_insns (void)
> >   edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
> >   edge orig_entry_edge = entry_edge;
> >
> >-  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
> >   rtx_insn *prologue_seq = make_prologue_seq ();
> >-  rtx_insn *epilogue_seq = make_epilogue_seq ();
> >
> >   /* Try to perform a kind of shrink-wrapping, making sure the
> >      prologue/epilogue is emitted only around those parts of the
> >@@ -5930,6 +5928,13 @@ thread_prologue_and_epilogue_insns (void)
> >
> >   try_shrink_wrapping (&entry_edge, prologue_seq);
> >
> >+  try_shrink_wrapping_separate (entry_edge->dest);
> >+
> >+  if (crtl->shrink_wrapped_separate)
> >+    prologue_seq = make_prologue_seq ();
> Note this implies that make_prologue_seq (and consequently the target 
> specific bits for prologue generation) can be safely called more than 
> once.  That's probably OK, but might be worth a comment -- your decision 
> whether or not to add the comment.

It is only called twice if separately shrink-wrapping (and that did
anything), so it won't change current behaviour.

I'll add a comment.

> >+static void
> >+place_prologue_for_one_component (unsigned int which, basic_block head)
> >+{
> >+  /* The block we are currently dealing with.  */
> >+  basic_block bb = head;
> >+  /* Is this the first time we visit this block, i.e. have we just gone
> >+     down the tree.  */
> >+  bool first_visit = true;
> >+
> >+  /* Walk the dominator tree, visit one block per iteration of this loop.
> >+     Each basic block is visited twice: once before visiting any children
> >+     of the block, and once after visiting all of them (leaf nodes are
> >+     visited only once).  As an optimization, we do not visit subtrees
> >+     that can no longer influence the prologue placement.  */
> >+  for (;;)
> Is there some reason you wrote this as a loop rather than recursion? 
> IMHO it makes this function (and spread_components) more difficult to 
> reason about than it needs to be.

It would recurse a few thousand frames deep on not terribly big testcases
(i.e. I've seen it happen).  This uses a lot of stack space on some ABIs,
and is very slow as well (this is the slowest function here by far).
Unlimited recursion is bad.

> >+/* Mark HAS_COMPONENTS for every block dominated by at least one block 
> >with
> >+   PRO_COMPONENTS set for the respective components, starting at HEAD.  */
> >+static void
> >+spread_components (basic_block head)
> I don't see any references to PRO_COMPONENTS in the actual code.  If I'm 
> reading the code correctly, you're just accumulating/propagating 
> HAS_COMPONENTS from parents to children via a DFS walk of the dominator 
> tree, right?

Yeah, s/PRO_COMPONENTS/HAS_COMPONENTS/.  I have renamed things a few
time but missed this one due to the ALL VARS IN ALL CAPS style :-)

> >+/* Place code for prologues and epilogues for COMPONENTS where we can put
> >+   that code at the end of basic blocks.  */
> >+static void
> >+emit_common_tails_for_components (sbitmap components)
> [ Snip. ]
> >+
> >+      /* Put the code at the end of the BB, but before any final jump.  */
> >+      if (!bitmap_empty_p (epi))
> So what if the final jump uses hard registers in one way or another?   I 
> don't immediately see anything that verifies it is safe to transpose the 
> epilogue and the final jump.

Whoops.  Thanks for catching this.

> Conceptually want the epilogue to occur on the outgoing edge(s).  But 
> you want to actually transpose the epilogue and the control flow insn so 
> that you can insert the epilogue in one place.

The same problem happens with prologues, too.

> But naive transposition isn't safe.  The control flow insn may use one 
> or more registers that you were restoring or you may be on a cc0 target. 

A cc0 target can not use separate shrink-wrapping *anyway* if any of the
components would clobber cc0, so that is no problem here.

>   I think you need to handle the former more cleanly.  The latter I'd 
> be comfortable filtering out in try_shrink_wrapping_separate.

I'm thinking to not do the common tail optimisation if BB_END is a
JUMP_INSN but not simplejump_p (or a return or a sibling call).  Do
you see any problem with that?

> This has similarities to some of the problems we've had in the past with 
> rtl-gcse insertion as well as output reloads on insns with control flow.

Oh, I had so much fun recently with the latter :-/

> With transposition issue addressed, the only blocker I see are some 
> simple testcases we can add to the suite.  They don't have to be real 
> extensive.

Yeah, working on it.

> And one motivating example for the list archives, ideally 
> the glibc malloc case.

Okay.

Thanks for the reviews,


Segher

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
  2016-09-28  9:26     ` Segher Boessenkool
@ 2016-09-28 16:36       ` Jeff Law
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Law @ 2016-09-28 16:36 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, dje.gcc

On 09/28/2016 03:04 AM, Segher Boessenkool wrote:
>
>>> +static void
>>> +place_prologue_for_one_component (unsigned int which, basic_block head)
>>> +{
>>> +  /* The block we are currently dealing with.  */
>>> +  basic_block bb = head;
>>> +  /* Is this the first time we visit this block, i.e. have we just gone
>>> +     down the tree.  */
>>> +  bool first_visit = true;
>>> +
>>> +  /* Walk the dominator tree, visit one block per iteration of this loop.
>>> +     Each basic block is visited twice: once before visiting any children
>>> +     of the block, and once after visiting all of them (leaf nodes are
>>> +     visited only once).  As an optimization, we do not visit subtrees
>>> +     that can no longer influence the prologue placement.  */
>>> +  for (;;)
>> Is there some reason you wrote this as a loop rather than recursion?
>> IMHO it makes this function (and spread_components) more difficult to
>> reason about than it needs to be.
>
> It would recurse a few thousand frames deep on not terribly big testcases
> (i.e. I've seen it happen).  This uses a lot of stack space on some ABIs,
> and is very slow as well (this is the slowest function here by far).
> Unlimited recursion is bad.
I'm surprised the recursion was that deep.  Such is life.   Thanks for 
clarifying.  I won't object to the iterative version. :-)

>>> +/* Place code for prologues and epilogues for COMPONENTS where we can put
>>> +   that code at the end of basic blocks.  */
>>> +static void
>>> +emit_common_tails_for_components (sbitmap components)
>> [ Snip. ]
>>> +
>>> +      /* Put the code at the end of the BB, but before any final jump.  */
>>> +      if (!bitmap_empty_p (epi))
>> So what if the final jump uses hard registers in one way or another?   I
>> don't immediately see anything that verifies it is safe to transpose the
>> epilogue and the final jump.
>
> Whoops.  Thanks for catching this.
I missed it the first time though the code too.

>
>> Conceptually want the epilogue to occur on the outgoing edge(s).  But
>> you want to actually transpose the epilogue and the control flow insn so
>> that you can insert the epilogue in one place.
>
> The same problem happens with prologues, too.
Yea.  I guess if a had a jump with an embedded side effect (such as movb 
or addb on the PA), then transposing the control flow insn with the 
prologue would be problematical as well.

>
> A cc0 target can not use separate shrink-wrapping *anyway* if any of the
> components would clobber cc0, so that is no problem here.
True, but I'd be more comfortable if we filtered out cc0 targets explicitly.

>
>>   I think you need to handle the former more cleanly.  The latter I'd
>> be comfortable filtering out in try_shrink_wrapping_separate.
>
> I'm thinking to not do the common tail optimisation if BB_END is a
> JUMP_INSN but not simplejump_p (or a return or a sibling call).  Do
> you see any problem with that?
Seems reasonable.


Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
  2016-09-27 21:25   ` Jeff Law
  2016-09-28  9:26     ` Segher Boessenkool
@ 2016-09-30 10:14     ` Segher Boessenkool
       [not found]     ` <20160930102908.GB14933@gate.crashing.org>
  2 siblings, 0 replies; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-30 10:14 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, dje.gcc

On Tue, Sep 27, 2016 at 03:14:51PM -0600, Jeff Law wrote:
> With transposition issue addressed, the only blocker I see are some 
> simple testcases we can add to the suite.  They don't have to be real 
> extensive.  And one motivating example for the list archives, ideally 
> the glibc malloc case.

Here are some testcases.


2016-09-30  Segher Boessenkool  <segher@kernel.crashing.org>

gcc/testsuite/
	* gcc.target/powerpc/shrink-wrap-separate-0.c: New.
	* gcc.target/powerpc/shrink-wrap-separate-1.c: New.
	* gcc.target/powerpc/shrink-wrap-separate-2.c: New.


diff --git a/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-0.c b/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-0.c
new file mode 100644
index 0000000..dea0611
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-0.c
@@ -0,0 +1,22 @@
+/* { dg-do compile { target powerpc*-*-* } } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler {#before\M.*\mmflr\M} } } */
+
+/* This tests if shrink-wrapping for separate components works.
+
+   r20 (a callee-saved register) is forced live at the start, so that we
+   get it saved in a prologue at the start of the function.
+   The link register only needs to be saved if x is non-zero; without
+   separate shrink-wrapping it would however be saved in the one prologue.
+   The test tests if the mflr insn ends up behind the prologue.  */
+
+void g(void);
+
+void f(int x)
+{
+	register int r20 asm("20") = x;
+	asm("#before" : : "r"(r20));
+	if (x)
+		g();
+	asm(""); // no tailcall of g
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-1.c b/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-1.c
new file mode 100644
index 0000000..735b606
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile { target powerpc*-*-* } } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler {\mmflr\M.*\mbl\M.*\mmflr\M.*\mbl\M} } } */
+
+/* This tests if shrink-wrapping for separate components creates more
+   than one prologue when that is useful.  In this case, it saves the
+   link register before both the call to g and the call to h.  */
+
+void g(void) __attribute__((noreturn));
+void h(void) __attribute__((noreturn));
+
+void f(int x)
+{
+	if (x == 42)
+		g();
+	if (x == 31)
+		h();
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-2.c b/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-2.c
new file mode 100644
index 0000000..b22564a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/shrink-wrap-separate-2.c
@@ -0,0 +1,26 @@
+/* { dg-do compile { target powerpc*-*-* } } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler {\mmflr\M.*\mbl\M.*\mmflr\M.*\mbl\M} } } */
+
+/* This tests if shrink-wrapping for separate components puts a prologue
+   inside a loop when that is useful.  In this case, it saves the link
+   register before each call: both calls happen with probability .10,
+   so saving the link register happens with .80 per execution of f on
+   average, which is smaller than 1 which you would get if you saved
+   it outside the loop.  */
+
+int *a;
+void g(void);
+
+void f(int x)
+{
+	int j;
+	for (j = 0; j < 4; j++) {
+		if (__builtin_expect(a[j], 0))
+			g();
+		asm("#" : : : "memory");
+		if (__builtin_expect(a[j], 0))
+			g();
+		a[j]++;
+	}
+}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
       [not found]     ` <20160930102908.GB14933@gate.crashing.org>
@ 2016-09-30 10:52       ` Segher Boessenkool
  2016-10-10 21:21         ` Jeff Law
  0 siblings, 1 reply; 18+ messages in thread
From: Segher Boessenkool @ 2016-09-30 10:52 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 4205 bytes --]

[ whoops, message too big, resending with the attachment compressed ]

On Tue, Sep 27, 2016 at 03:14:51PM -0600, Jeff Law wrote:
> With transposition issue addressed, the only blocker I see are some 
> simple testcases we can add to the suite.  They don't have to be real 
> extensive.  And one motivating example for the list archives, ideally 
> the glibc malloc case.

And here is the malloc testcase.

A very important (for performance) function is _int_malloc, which starts
with


static void *
_int_malloc (mstate av, size_t bytes)
{
// [ variable declarations culled ]

  if (((unsigned long) (bytes) >= (unsigned long) (size_t) (-2 * (unsigned long)((((
 __builtin_offsetof (
 struct malloc_chunk
 ,-
 fd_nextsize
 )
 )+((2 *(sizeof(size_t)) < __alignof__ (long double) ? __alignof__ (long double) : 2 *(sizeof(size_t))) - 1)) & ~((2 *(sizeof(size_t)) < __alignof__ (long double) ? __alignof__ (long double) : 2 *(sizeof(size_t))) - 1)))))) { (__libc_errno = (
 12
 )); return 0; }

  if (__builtin_expect ((av ==-
     ((
     void-
     *)0)
     ), 0))
    {
      void *p = sysmalloc (nb, av);
      if (p !=-
              ((
              void-
              *)0)
                  )
 alloc_perturb (p, bytes);
      return p;
    }


which without separate shrink-wrapping ends up as (reordered the blocks):


.L._int_malloc:
        mflr 0
        li 9,-65
        std 14,-144(1)
        std 15,-136(1)
        cmpld 7,4,9
        std 16,-128(1)
        std 17,-120(1)
        std 18,-112(1)
        std 19,-104(1)
        std 20,-96(1)
        std 21,-88(1)
        std 22,-80(1)
        std 23,-72(1)
        std 0,16(1)
        std 24,-64(1)
        std 25,-56(1)
        std 26,-48(1)
        std 27,-40(1)
        std 28,-32(1)
        std 29,-24(1)
        std 30,-16(1)
        std 31,-8(1)
        stdu 1,-288(1)
        bgt 7,.L768
        addi 14,4,23
        mr 15,3
        cmpldi 0,14,31
        ble 0,.L769

# ...

.L768:
        addis 27,2,__libc_errno@got@tprel@ha
        li 19,12
        ld 28,__libc_errno@got@tprel@l(27)
        li 3,0
        add 17,28,__libc_errno@tls
        stw 19,0(17)
        b .L631

# ...

.L631:
        addi 1,1,288
        ld 29,16(1)
        ld 14,-144(1)
        ld 15,-136(1)
        ld 16,-128(1)
        ld 17,-120(1)
        ld 18,-112(1)
        ld 19,-104(1)
        ld 20,-96(1)
        ld 21,-88(1)
        ld 22,-80(1)
        ld 23,-72(1)
        ld 24,-64(1)
        mtlr 29
        ld 25,-56(1)
        ld 26,-48(1)
        ld 27,-40(1)
        ld 28,-32(1)
        ld 29,-24(1)
        ld 30,-16(1)
        ld 31,-8(1)
        blr
# ...

.L769:
        cmpdi 1,3,0
        beq 1,.L715

# ...

.L715:
        li 14,32
.L635:
        li 4,0
.L762:
        addi 1,1,288
        mr 3,14
        ld 14,16(1)
        ld 15,-136(1)
        ld 16,-128(1)
        ld 17,-120(1)
        ld 18,-112(1)
        ld 19,-104(1)
        ld 20,-96(1)
        ld 21,-88(1)
        ld 22,-80(1)
        ld 23,-72(1)
        ld 24,-64(1)
        ld 25,-56(1)
        mtlr 14
        ld 26,-48(1)
        ld 14,-144(1)
        ld 27,-40(1)
        ld 28,-32(1)
        ld 29,-24(1)
        ld 30,-16(1)
        ld 31,-8(1)
        b sysmalloc


[ I see have regrename on by default still; doesn't matter much for
this test, it's just less readable ]


With separate shrink-wrapping we get instead


.L._int_malloc:
        li 9,-65
        stdu 1,-288(1)
        cmpld 7,4,9
        bgt 7,.L811
        std 14,144(1)
        addi 14,4,23
        std 15,152(1)
        mr 15,3
        cmpldi 0,14,31
        std 25,232(1)
        std 30,272(1)
        ble 0,.L812

# ...

.L811:
        addis 6,2,__libc_errno@got@tprel@ha
        li 11,12
        ld 10,__libc_errno@got@tprel@l(6)
        li 3,0
        add 12,10,__libc_errno@tls
        stw 11,0(12)
        b .L673

# ...

.L673:
        addi 1,1,288
        blr

# ...

.L812:
        cmpdi 1,3,0
        beq 1,.L757

# ...

.L757:
        li 14,32
.L677:
        mr 3,14
        ld 15,152(1)
        ld 14,144(1)
        ld 25,232(1)
        ld 30,272(1)
        li 4,0
        addi 1,1,288
        b sysmalloc


I'm attaching the full testcase (pre-processed for powerpc64-linux).


Segher

[-- Attachment #2: malloc.i.gz --]
[-- Type: application/x-gzip, Size: 74379 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
  2016-09-30 10:52       ` Segher Boessenkool
@ 2016-10-10 21:21         ` Jeff Law
  2016-10-10 22:23           ` Segher Boessenkool
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Law @ 2016-10-10 21:21 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, dje.gcc

On 09/30/2016 04:34 AM, Segher Boessenkool wrote:
> [ whoops, message too big, resending with the attachment compressed ]
>
> On Tue, Sep 27, 2016 at 03:14:51PM -0600, Jeff Law wrote:
>> With transposition issue addressed, the only blocker I see are some
>> simple testcases we can add to the suite.  They don't have to be real
>> extensive.  And one motivating example for the list archives, ideally
>> the glibc malloc case.
>
> And here is the malloc testcase.
>
> A very important (for performance) function is _int_malloc, which starts
> with
[ ... ]
THanks.  What I think is important to note with this example is the bits 
that were pushed into the path with the sysmalloc/alloc_perturb calls. 
That's an unlikely path.

We have to extrapolate a bit from the assembly provided.  In the not 
separately shrink-wrapped version, we have a full prologue of stores and 
two instances of a full epilogue (though only one ever executes) provided.

With separate shrink wrapping the (presumably) very cold path where we 
error has virtually no prologue/epilogue.  That's probably a nop from a 
performance standpoint.

More interesting is the path where we call sysmalloc/alloc_perturb, it's 
a cold path, but not as cold as the error path.  We save/restore 4 regs 
in that case.  Rather than a full prologue/epilogue.  So there's clearly 
a savings there, though again, via the expect it's a cold path.

Where we have to extrapolate is the hot path.  Presumably on the hot 
path we're saving/restoring ~4 fewer registers.   I haven't verified 
that, but that is kindof the whole point here.



Thanks,
Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components
  2016-10-10 21:21         ` Jeff Law
@ 2016-10-10 22:23           ` Segher Boessenkool
  0 siblings, 0 replies; 18+ messages in thread
From: Segher Boessenkool @ 2016-10-10 22:23 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, dje.gcc

[-- Attachment #1: Type: text/plain, Size: 3113 bytes --]

On Mon, Oct 10, 2016 at 03:21:31PM -0600, Jeff Law wrote:
> On 09/30/2016 04:34 AM, Segher Boessenkool wrote:
> >[ whoops, message too big, resending with the attachment compressed ]
> >
> >On Tue, Sep 27, 2016 at 03:14:51PM -0600, Jeff Law wrote:
> >>With transposition issue addressed, the only blocker I see are some
> >>simple testcases we can add to the suite.  They don't have to be real
> >>extensive.  And one motivating example for the list archives, ideally
> >>the glibc malloc case.
> >
> >And here is the malloc testcase.
> >
> >A very important (for performance) function is _int_malloc, which starts
> >with
> [ ... ]
> THanks.  What I think is important to note with this example is the bits 
> that were pushed into the path with the sysmalloc/alloc_perturb calls. 
> That's an unlikely path.

alloc_perturb is a no-op, and inlined as such: as nothing :-)

> We have to extrapolate a bit from the assembly provided.  In the not 
> separately shrink-wrapped version, we have a full prologue of stores and 
> two instances of a full epilogue (though only one ever executes) provided.
> 
> With separate shrink wrapping the (presumably) very cold path where we 
> error has virtually no prologue/epilogue.  That's probably a nop from a 
> performance standpoint.
> 
> More interesting is the path where we call sysmalloc/alloc_perturb, it's 
> a cold path, but not as cold as the error path.  We save/restore 4 regs 
> in that case.  Rather than a full prologue/epilogue.  So there's clearly 
> a savings there, though again, via the expect it's a cold path.
> 
> Where we have to extrapolate is the hot path.  Presumably on the hot 
> path we're saving/restoring ~4 fewer registers.   I haven't verified 
> that, but that is kindof the whole point here.

We save/restore just four registers total on the hot path.  And yes,
that is the point :-)

The hot exit is

.L683:
	ld 14,144(1)
	ld 15,152(1)
	ld 25,232(1)
	ld 30,272(1)
	addi 3,4,16
.L673:
	addi 1,1,288
	blr

so four GPR restores and no LR restore.  Without separate shrink-wrapping
this was

.L641:
	addi 3,21,16
	b .L631

[ ... ]

.L631:
	addi 1,1,288
	ld 29,16(1)
	ld 14,-144(1)
	ld 15,-136(1)
	ld 16,-128(1)
	ld 17,-120(1)
	ld 18,-112(1)
	ld 19,-104(1)
	ld 20,-96(1)
	ld 21,-88(1)
	ld 22,-80(1)
	ld 23,-72(1)
	ld 24,-64(1)
	mtlr 29
	ld 25,-56(1)
	ld 26,-48(1)
	ld 27,-40(1)
	ld 28,-32(1)
	ld 29,-24(1)
	ld 30,-16(1)
	ld 31,-8(1)
	blr

(18 GPRs as well as LR).

I didn't show this path because there is a whole bunch of branches with
inline asm in the way.

The sysmalloc path was

.L635:
	li 4,0
.L761:
	addi 1,1,288
	mr 3,14
	ld 14,16(1)
	ld 15,-136(1)
	ld 16,-128(1)
	ld 17,-120(1)
	ld 18,-112(1)
	ld 19,-104(1)
	ld 20,-96(1)
	ld 21,-88(1)
	ld 22,-80(1)
	ld 23,-72(1)
	ld 24,-64(1)
	ld 25,-56(1)
	mtlr 14
	ld 26,-48(1)
	ld 14,-144(1)
	ld 27,-40(1)
	ld 28,-32(1)
	ld 29,-24(1)
	ld 30,-16(1)
	ld 31,-8(1)
	b sysmalloc

and now is

.L677:
	mr 3,14
	ld 15,152(1)
	ld 14,144(1)
	ld 25,232(1)
	ld 30,272(1)
	li 4,0
	addi 1,1,288
	b sysmalloc

I attach malloc.s.{no,yes}, I hope you can stomach that.  Well you
can read HP-PA, heh.


Segher

[-- Attachment #2: malloc.s.no.gz --]
[-- Type: application/x-gzip, Size: 40507 bytes --]

[-- Attachment #3: malloc.s.yes.gz --]
[-- Type: application/x-gzip, Size: 42479 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-10-10 22:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-23  8:22 [PATCH v3 0/5] Separate shrink-wrapping Segher Boessenkool
2016-09-23  8:22 ` [PATCH 2/5] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
2016-09-26 16:55   ` Jeff Law
2016-09-23  8:22 ` [PATCH 1/5] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
2016-09-26 17:02   ` Jeff Law
2016-09-23  8:23 ` [PATCH 3/5] regrename: Don't rename restores Segher Boessenkool
2016-09-26 16:44   ` Jeff Law
2016-09-23  8:33 ` [PATCH 4/5] shrink-wrap: Shrink-wrapping for separate components Segher Boessenkool
2016-09-27 21:25   ` Jeff Law
2016-09-28  9:26     ` Segher Boessenkool
2016-09-28 16:36       ` Jeff Law
2016-09-30 10:14     ` Segher Boessenkool
     [not found]     ` <20160930102908.GB14933@gate.crashing.org>
2016-09-30 10:52       ` Segher Boessenkool
2016-10-10 21:21         ` Jeff Law
2016-10-10 22:23           ` Segher Boessenkool
2016-09-23  8:44 ` [PATCH 5/5] rs6000: Separate shrink-wrapping Segher Boessenkool
2016-09-26 16:39   ` Jeff Law
2016-09-26 18:16   ` David Edelsohn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).