[PATCH v2 0/9] Separate shrink-wrapping

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH v2 0/9] Separate shrink-wrapping
@ 2016-08-01  1:43 Segher Boessenkool
  2016-08-01  1:43 ` [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
                   ` (10 more replies)
  0 siblings, 11 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

This is the second version.  Concern was renamed to component, and all
other comments were addressed (I hope).  It still uses only two bitmaps
for the component placement, but now they are called needs_components
and has_components, which hopefully is easier to follow.  The "can this
prologue be moved earlier for no cost" test is generalised a bit, and
the cost estimate explicitly excludes backedges, which makes it easier
to see that prologues will not be placed inside a loop if there is no
benefit to that (I didn't see any different generated code because of
this change).  And new and updated comments all over, of course.

Is this okay for trunk?

Segher

 gcc/cfgcleanup.c           |   5 +
 gcc/common.opt             |   4 +
 gcc/config/rs6000/rs6000.c | 257 ++++++++++++++++-
 gcc/dce.c                  |   9 +
 gcc/doc/invoke.texi        |  11 +-
 gcc/doc/tm.texi            |  54 ++++
 gcc/doc/tm.texi.in         |  29 ++
 gcc/emit-rtl.h             |   4 +
 gcc/function.c             |  15 +-
 gcc/regcprop.c             |   4 +
 gcc/regrename.c            |  13 +-
 gcc/sel-sched-ir.c         |   1 +
 gcc/shrink-wrap.c          | 705 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/shrink-wrap.h          |   1 +
 gcc/target.def             |  57 ++++
 15 files changed, 1150 insertions(+), 19 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
@ 2016-08-01  1:43 ` Segher Boessenkool
  2016-09-08 17:52   ` Jeff Law
  2016-08-01  1:43 ` [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

Deleting restores (before a noreturn) that are dead confuses dwarf2cfi.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* dce.c (delete_unmarked_insns): Don't delete instructions with
	a REG_CFA_RESTORE note.
---
 gcc/dce.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/gcc/dce.c b/gcc/dce.c
index ea3fb00..d510287 100644
--- a/gcc/dce.c
+++ b/gcc/dce.c
@@ -587,6 +587,15 @@ delete_unmarked_insns (void)
 	  if (!dbg_cnt (dce))
 	    continue;
 
+	  if (crtl->shrink_wrapped_separate
+	      && find_reg_note (insn, REG_CFA_RESTORE, NULL))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, "DCE: NOT deleting insn %d, it's a "
+				    "callee-save restore\n", INSN_UID (insn));
+	      continue;
+	    }
+
 	  if (dump_file)
 	    fprintf (dump_file, "DCE: Deleting insn %d\n", INSN_UID (insn));
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/9] cfgcleanup: Don't confuse CFI when -fshrink-wrap-separate
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
  2016-08-01  1:43 ` [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
  2016-08-01  1:43 ` [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
@ 2016-08-01  1:43 ` Segher Boessenkool
  2016-09-08 17:51   ` Jeff Law
  2016-08-01  1:57 ` [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components Segher Boessenkool
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

cfgcleanup would try to join noreturn paths with different components
handled.  This then fails in dwarf2cfi.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* cfgcleanup.c (outgoing_edges_match): Don't join noreturn calls
	if shrink_wrapped_separate.
---
 gcc/cfgcleanup.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/gcc/cfgcleanup.c b/gcc/cfgcleanup.c
index 023b9d2..e3f205b 100644
--- a/gcc/cfgcleanup.c
+++ b/gcc/cfgcleanup.c
@@ -1824,6 +1824,11 @@ outgoing_edges_match (int mode, basic_block bb1, basic_block bb2)
           || !find_reg_note (last1, REG_ARGS_SIZE, NULL)))
     return false;
 
+  /* If shrink-wrapping separate components, joining noreturn calls that
+     have different components set up will confuse dwarf2cfi.  */
+  if (!nonfakeedges && crtl->shrink_wrapped_separate)
+    return false;
+
   /* fallthru edges must be forwarded to the same destination.  */
   if (fallthru1)
     {
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
  2016-08-01  1:43 ` [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
@ 2016-08-01  1:43 ` Segher Boessenkool
  2016-08-29  9:31   ` Bernd Schmidt
  2016-09-08 17:48   ` Jeff Law
  2016-08-01  1:43 ` [PATCH 2/9] cfgcleanup: Don't confuse CFI when -fshrink-wrap-separate Segher Boessenkool
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

This patch adds a new command-line flag "-fshrink-wrap-separate", a status
flag "shrink_wrapped_separate", hooks for abstracting the target components,
and documentation for all those.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* common.opt (-fshrink-wrap-separate): New flag.
	* doc/invoke.texi: Document it.
	* doc/tm.texi.in (Shrink-wrapping separate components): New subsection.
	* doc/tm.texi: Regenerate.
	* emit-rtl.h (struct rtl_data): New field shrink_wrapped_separate.
	* target.def (shrink_wrap): New hook vector.
	(get_separate_components, components_for_bb, disqualify_components,
	emit_prologue_components, emit_epilogue_components,
	set_handled_components): New hooks.
---
 gcc/common.opt      |  4 ++++
 gcc/doc/invoke.texi | 11 ++++++++++-
 gcc/doc/tm.texi     | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/doc/tm.texi.in  | 29 +++++++++++++++++++++++++++
 gcc/emit-rtl.h      |  4 ++++
 gcc/target.def      | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 8a292ed..97d305f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2168,6 +2168,10 @@ Common Report Var(flag_shrink_wrap) Optimization
 Emit function prologues only before parts of the function that need it,
 rather than at the top of the function.
 
+fshrink-wrap-separate
+Common Report Var(flag_shrink_wrap_separate) Init(1) Optimization
+Shrink-wrap parts of the prologue and epilogue separately.
+
 fsignaling-nans
 Common Report Var(flag_signaling_nans) Optimization SetByCombined
 Disable optimizations observable by IEEE signaling NaNs.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 22001f9..2ea1727 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -397,7 +397,8 @@ Objective-C and Objective-C++ Dialects}.
 -fschedule-insns -fschedule-insns2 -fsection-anchors @gol
 -fselective-scheduling -fselective-scheduling2 @gol
 -fsel-sched-pipelining -fsel-sched-pipelining-outer-loops @gol
--fsemantic-interposition -fshrink-wrap -fsignaling-nans @gol
+-fsemantic-interposition -fshrink-wrap -fshrink-wrap-separate @gol
+-fsignaling-nans @gol
 -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
 -fsplit-paths @gol
 -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
@@ -6322,6 +6323,7 @@ compilation time.
 -fmove-loop-invariants @gol
 -freorder-blocks @gol
 -fshrink-wrap @gol
+-fshrink-wrap-separate @gol
 -fsplit-wide-types @gol
 -fssa-backprop @gol
 -fssa-phiopt @gol
@@ -7231,6 +7233,13 @@ Emit function prologues only before parts of the function that need it,
 rather than at the top of the function.  This flag is enabled by default at
 @option{-O} and higher.
 
+@item -fshrink-wrap-separate
+@opindex fshrink-wrap-separate
+Shrink-wrap separate parts of the prologue and epilogue separately, so that
+those parts are only executed when needed.
+This option is on by default, but has no effect unless @option{-fshrink-wrap}
+is also turned on.
+
 @item -fcaller-saves
 @opindex fcaller-saves
 Enable allocation of values to registers that are clobbered by
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 9edb006..5a5c5cab 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -2924,6 +2924,7 @@ This describes the stack layout and calling conventions.
 * Function Entry::
 * Profiling::
 * Tail Calls::
+* Shrink-wrapping separate components::
 * Stack Smashing Protection::
 * Miscellaneous Register Hooks::
 @end menu
@@ -4852,6 +4853,59 @@ This hook should add additional registers that are computed by the prologue to t
 True if a function's return statements should be checked for matching the function's return type.  This includes checking for falling off the end of a non-void function.  Return false if no such check should be made.
 @end deftypefn
 
+@node Shrink-wrapping separate components
+@subsection Shrink-wrapping separate components
+@cindex shrink-wrapping separate components
+
+The prologue does a lot of separate things: save callee-saved registers,
+do whatever needs to be done to be able to call things (save the return
+address, align the stack, whatever; different for each target), set up a
+stack frame, do whatever needs to be done for the static chain (if anything),
+set up registers for PIC, etc.  Using the following hooks those prologue
+or epilogue components can be shrink-wrapped separately, so that the
+initialization (and possibly teardown) those components do is not done on
+execution paths where it is unnecessary.
+
+What exactly those components are is up to the target code; the generic
+code treats them abstractly.
+
+@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS (void)
+This hook should return an @code{sbitmap} with the bits set for those
+components that can be separately shrink-wrapped in the current function.
+Return @code{NULL} if the current function should not get any separate
+shrink-wrapping.
+Don't define this hook if it would always return @code{NULL}.
+If it is defined, the other hooks in this group have to be defined as well.
+@end deftypefn
+
+@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB (basic_block)
+This hook should return an @code{sbitmap} with the bits set for those
+components where either the prologue component has to be executed before
+the @code{basic_block}, or the epilogue component after it, or both.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS (sbitmap @var{components}, edge @var{e}, sbitmap @var{edge_components}, bool @var{is_prologue})
+This hook should clear the bits in the @var{components} bitmap for those
+components in @var{edge_components} that the target cannot handle on edge
+@var{e}, where @var{is_prologue} says if this is for a prologue or an
+epilogue instead.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS (sbitmap)
+Emit prologue insns for the components indicated by the parameter.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS (sbitmap)
+Emit epilogue insns for the components indicated by the parameter.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS (sbitmap)
+Mark the components in the parameter as handled, so that the
+@code{prologue} and @code{epilogue} named patterns know to ignore those
+components.  The target code should not hang on to the @code{sbitmap}, it
+will be deleted after this call.
+@end deftypefn
+
 @node Stack Smashing Protection
 @subsection Stack smashing protection
 @cindex stack smashing protection
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index a72c3d8..aaa11e6 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -2530,6 +2530,7 @@ This describes the stack layout and calling conventions.
 * Function Entry::
 * Profiling::
 * Tail Calls::
+* Shrink-wrapping separate components::
 * Stack Smashing Protection::
 * Miscellaneous Register Hooks::
 @end menu
@@ -3789,6 +3790,34 @@ the function prologue.  Normally, the profiling code comes after.
 
 @hook TARGET_WARN_FUNC_RETURN
 
+@node Shrink-wrapping separate components
+@subsection Shrink-wrapping separate components
+@cindex shrink-wrapping separate components
+
+The prologue does a lot of separate things: save callee-saved registers,
+do whatever needs to be done to be able to call things (save the return
+address, align the stack, whatever; different for each target), set up a
+stack frame, do whatever needs to be done for the static chain (if anything),
+set up registers for PIC, etc.  Using the following hooks those prologue
+or epilogue components can be shrink-wrapped separately, so that the
+initialization (and possibly teardown) those components do is not done on
+execution paths where it is unnecessary.
+
+What exactly those components are is up to the target code; the generic
+code treats them abstractly.
+
+@hook TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+
+@hook TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+
+@hook TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+
 @node Stack Smashing Protection
 @subsection Stack smashing protection
 @cindex stack smashing protection
diff --git a/gcc/emit-rtl.h b/gcc/emit-rtl.h
index 39dfce9..001f775 100644
--- a/gcc/emit-rtl.h
+++ b/gcc/emit-rtl.h
@@ -254,6 +254,10 @@ struct GTY(()) rtl_data {
   /* True if we performed shrink-wrapping for the current function.  */
   bool shrink_wrapped;
 
+  /* True if we performed shrink-wrapping for separate components for
+     the current function.  */
+  bool shrink_wrapped_separate;
+
   /* Nonzero if function being compiled doesn't modify the stack pointer
      (ignoring the prologue and epilogue).  This is only valid after
      pass_stack_ptr_mod has run.  */
diff --git a/gcc/target.def b/gcc/target.def
index 929d9ea..b951aec 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -5773,6 +5773,63 @@ DEFHOOK
  bool, (tree),
  hook_bool_tree_true)
 
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_SHRINK_WRAP_"
+HOOK_VECTOR (TARGET_SHRINK_WRAP_HOOKS, shrink_wrap)
+
+DEFHOOK
+(get_separate_components,
+ "This hook should return an @code{sbitmap} with the bits set for those\n\
+components that can be separately shrink-wrapped in the current function.\n\
+Return @code{NULL} if the current function should not get any separate\n\
+shrink-wrapping.\n\
+Don't define this hook if it would always return @code{NULL}.\n\
+If it is defined, the other hooks in this group have to be defined as well.",
+ sbitmap, (void),
+ NULL)
+
+DEFHOOK
+(components_for_bb,
+ "This hook should return an @code{sbitmap} with the bits set for those\n\
+components where either the prologue component has to be executed before\n\
+the @code{basic_block}, or the epilogue component after it, or both.",
+ sbitmap, (basic_block),
+ NULL)
+
+DEFHOOK
+(disqualify_components,
+ "This hook should clear the bits in the @var{components} bitmap for those\n\
+components in @var{edge_components} that the target cannot handle on edge\n\
+@var{e}, where @var{is_prologue} says if this is for a prologue or an\n\
+epilogue instead.",
+ void, (sbitmap components, edge e, sbitmap edge_components, bool is_prologue),
+ NULL)
+
+DEFHOOK
+(emit_prologue_components,
+ "Emit prologue insns for the components indicated by the parameter.",
+ void, (sbitmap),
+ NULL)
+
+DEFHOOK
+(emit_epilogue_components,
+ "Emit epilogue insns for the components indicated by the parameter.",
+ void, (sbitmap),
+ NULL)
+
+DEFHOOK
+(set_handled_components,
+ "Mark the components in the parameter as handled, so that the\n\
+@code{prologue} and @code{epilogue} named patterns know to ignore those\n\
+components.  The target code should not hang on to the @code{sbitmap}, it\n\
+will be deleted after this call.",
+ void, (sbitmap),
+ NULL)
+
+HOOK_VECTOR_END (shrink_wrap)
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_"
+
 /* Determine the type of unwind info to emit for debugging.  */
 DEFHOOK
 (debug_unwind_info,
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (5 preceding siblings ...)
  2016-08-01  1:57 ` [PATCH 4/9] regrename: Don't rename restores Segher Boessenkool
@ 2016-08-01  1:57 ` Segher Boessenkool
  2016-09-08 18:34   ` Jeff Law
  2016-08-01  2:12 ` [PATCH 6/9] sel-sched: Don't mess with register restores Segher Boessenkool
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

Doing cprop on frame-related instructions blows up spectacularly.
So don't.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* regcprop.c (copyprop_hardreg_forward_1): Don't change
	RTX_FRAME_RELATED_P instructions.
---
 gcc/regcprop.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/gcc/regcprop.c b/gcc/regcprop.c
index 1498300..6aea74f 100644
--- a/gcc/regcprop.c
+++ b/gcc/regcprop.c
@@ -829,6 +829,10 @@ copyprop_hardreg_forward_1 (basic_block bb, struct value_data *vd)
 	    }
 	}
 
+      /* Don't change prologue instructions.  */
+      if (RTX_FRAME_RELATED_P (insn))
+	set = NULL;
+
       /* Special-case plain move instructions, since we may well
 	 be able to do the move from a different register class.  */
       if (set && REG_P (SET_SRC (set)))
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 4/9] regrename: Don't rename restores
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (4 preceding siblings ...)
  2016-08-01  1:57 ` [PATCH 9/9] rs6000: Separate shrink-wrapping Segher Boessenkool
@ 2016-08-01  1:57 ` Segher Boessenkool
  2016-09-08 17:54   ` Jeff Law
  2016-08-01  1:57 ` [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone Segher Boessenkool
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

A restore is supposed to restore some certain register.  Restoring it
into some other register will not work.  Don't.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* regrename.c (build_def_use): Invalidate chains that have a
	REG_CFA_RESTORE on some instruction.
---
 gcc/regrename.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/gcc/regrename.c b/gcc/regrename.c
index 54c7768..00a5d02 100644
--- a/gcc/regrename.c
+++ b/gcc/regrename.c
@@ -1867,6 +1867,12 @@ build_def_use (basic_block bb)
 		scan_rtx (insn, &XEXP (note, 0), NO_REGS, terminate_dead,
 			  OP_IN);
 	      }
+
+	  /* Step 8: Kill the chains involving register restores.  Those
+	     should restore _that_ register.  */
+	  for (note = REG_NOTES (insn); note; note = XEXP (note, 1))
+	    if (REG_NOTE_KIND (note) == REG_CFA_RESTORE)
+	      scan_rtx (insn, &XEXP (note, 0), NO_REGS, mark_all_read, OP_IN);
 	}
       else if (DEBUG_INSN_P (insn)
 	       && !VAR_LOC_UNKNOWN_P (INSN_VAR_LOCATION_LOC (insn)))
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (2 preceding siblings ...)
  2016-08-01  1:43 ` [PATCH 2/9] cfgcleanup: Don't confuse CFI when -fshrink-wrap-separate Segher Boessenkool
@ 2016-08-01  1:57 ` Segher Boessenkool
  2016-09-08 19:03   ` Jeff Law
  2016-08-01  1:57 ` [PATCH 9/9] rs6000: Separate shrink-wrapping Segher Boessenkool
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

This is the main substance of this patch series.

Instead of doing all of the prologue and epilogue in one spot, it often
is better to do components of it at different places, so that they are
executed less frequently.

What exactly is a component is completely up to the target; this code
treats it all abstractly, and uses hooks for the target to handle the
more concrete things.  Commonly there is one component for each callee-
saved register, for example.

Components can be executed more than once per function execution.  This
pass makes sure that a component's epilogue is not called more often
than the corresponding prologue has been, at any point in time; that the
prologue is called more often, wherever the prologue's effect is needed;
and that the epilogue is called as often as the prologue has been, when
the function exits.  It does this by first deciding which blocks need
which components active, and then placing prologue and epilogue
components to make that exactly true.

Deciding what blocks should run with a certain component active so that
the total cost of executing the prologues (and epilogues) is optimal, is
not a computationally feasible problem.  Instead, for each basic block,
we estimate the cost of putting a prologue right before the block, and
if that is cheaper than the total cost of putting prologues optimally
(according to the estimated cost) in the dominator subtrees strictly
dominated by this first block, place it at the first block instead.
This simple procedure places the components optimally for any dominator
sub tree where the root node's cost does not depend on anything outside
its subtree.

The cost is the execution frequency of all edges into the block coming
from blocks that do not have this component active.  The estimated cost
is the execution frequency of the block, minus the execution frequency
of any backedges (which by definition are coming from subtrees, so if
the "head" block gets a prologue, the source block of any backedge has
that component active as well).

Currently, the epilogues are placed as late as possible, given the
constraints.  This does not matter for execution cost, but we could
save a little bit of code size by placing the epilogues in a smarter
way.  This is a possible future optimisation.

Now all that is left is inserting prologues and epilogues on all edges
that jump into resp. out of the "active" set of blocks.  Often we need
to insert some components' prologues (or epilogues) on all edges into
(or out of) a block.  In theory cross-jumping can unify all such, but
in practice that often fails; besides, that is a lot of work.  So in
this case we insert the prologue and epilogue components at the "head"
or "tail" of a block, instead.

As a final optimisation, if a block needs a prologue and its immediate
dominator has the block as a post-dominator, the dominator gets the
prologue as well.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* function.c (thread_prologue_and_epilogue_insns): Recompute the
	live info.  Call try_shrink_wrapping_separate.  Compute the
	prologue_seq afterwards, if it has possibly changed.  Compute the
	split_prologue_seq and epilogue_seq later, too.
	* shrink-wrap.c: #include cfgbuild.h.
	(dump_components): New function.
	(struct sw): New struct.
	(SW): New function.
	(init_separate_shrink_wrap): New function.
	(fini_separate_shrink_wrap): New function.
	(place_prologue_for_one_component): New function.
	(spread_components): New function.
	(disqualify_problematic_components): New function.
	(emit_common_heads_for_components): New function.
	(emit_common_tails_for_components): New function.
	(insert_prologue_epilogue_for_components): New function.
	(try_shrink_wrapping_separate): New function.
	* shrink-wrap.h: Declare try_shrink_wrapping_separate.
---
 gcc/function.c    |  15 +-
 gcc/shrink-wrap.c | 715 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/shrink-wrap.h |   1 +
 3 files changed, 729 insertions(+), 2 deletions(-)

diff --git a/gcc/function.c b/gcc/function.c
index bba0705..390e9a6 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -5912,6 +5912,12 @@ make_epilogue_seq (void)
 void
 thread_prologue_and_epilogue_insns (void)
 {
+  if (optimize > 1)
+    {
+      df_live_add_problem ();
+      df_live_set_all_dirty ();
+    }
+
   df_analyze ();
 
   /* Can't deal with multiple successors of the entry block at the
@@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void)
   edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
   edge orig_entry_edge = entry_edge;
 
-  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
   rtx_insn *prologue_seq = make_prologue_seq ();
-  rtx_insn *epilogue_seq = make_epilogue_seq ();
 
   /* Try to perform a kind of shrink-wrapping, making sure the
      prologue/epilogue is emitted only around those parts of the
@@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)
 
   try_shrink_wrapping (&entry_edge, prologue_seq);
 
+  df_analyze ();
+  try_shrink_wrapping_separate (entry_edge->dest);
+  if (crtl->shrink_wrapped_separate)
+    prologue_seq = make_prologue_seq ();
+
+  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
+  rtx_insn *epilogue_seq = make_epilogue_seq ();
 
   rtl_profile_for_bb (EXIT_BLOCK_PTR_FOR_FN (cfun));
 
diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
index b85b1c3..643e375 100644
--- a/gcc/shrink-wrap.c
+++ b/gcc/shrink-wrap.c
@@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "tree-pass.h"
 #include "cfgrtl.h"
+#include "cfgbuild.h"
 #include "params.h"
 #include "bb-reorder.h"
 #include "shrink-wrap.h"
@@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq)
   BITMAP_FREE (bb_with);
   free_dominance_info (CDI_DOMINATORS);
 }
+\f
+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.  The following code does this, for prologue and epilogue
+   components that can be put in more than one location, and where those
+   components can be executed more than once (the epilogue component will
+   always be executed before the prologue component is executed a second
+   time).
+
+   What exactly is a component is target-dependent.  The more usual
+   components are simple saves/restores to/from the frame of callee-saved
+   registers.  This code treats components abstractly (as an sbitmap),
+   letting the target handle all details.
+
+   Prologue components are placed in such a way that for every component
+   the prologue is executed as infrequently as possible.  We do this by
+   walking the dominator tree, comparing the cost of placing a prologue
+   component before a block to the sum of costs determined for all subtrees
+   of that block.
+
+   From this placement, we then determine for each component all blocks
+   where at least one of this block's dominators (including itself) will
+   get a prologue inserted.  That then is how the components are placed.
+   We could place the epilogue components a bit smarter (we can save a
+   bit of code size sometimes); this is a possible future improvement.
+
+   Prologues and epilogues are preferably placed into a block, either at
+   the beginning or end of it, if it is needed for all predecessor resp.
+   successor edges; or placed on the edge otherwise.
+
+   If the placement of any prologue/epilogue leads to a situation we cannot
+   handle (for example, an abnormal edge would need to be split, or some
+   targets want to use some specific registers that may not be available
+   where we want to put them), separate shrink-wrapping for the components
+   in that prologue/epilogue is aborted.  */
+
+
+/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the
+   label LABEL.  */
+static void
+dump_components (const char *label, sbitmap components)
+{
+  if (bitmap_empty_p (components))
+    return;
+
+  fprintf (dump_file, " [%s", label);
+
+  for (unsigned int j = 0; j < components->n_bits; j++)
+    if (bitmap_bit_p (components, j))
+      fprintf (dump_file, " %u", j);
+
+  fprintf (dump_file, "]");
+}
+
+/* The data we collect for each bb.  */
+struct sw {
+  /* What components does this BB need?  */
+  sbitmap needs_components;
+
+  /* What components does this BB have?  This is the main decision this
+     pass makes.  */
+  sbitmap has_components;
+
+  /* The components for which we placed code at the start of the BB (instead
+     of on all incoming edges).  */
+  sbitmap head_components;
+
+  /* The components for which we placed code at the end of the BB (instead
+     of on all outgoing edges).  */
+  sbitmap tail_components;
+
+  /* The frequency of executing the prologue for this BB, if a prologue is
+     placed on this BB.  This is a pessimistic estimate (no prologue is
+     needed for edges from blocks that have the component under consideration
+     active already).  */
+  gcov_type own_cost;
+
+  /* The frequency of executing the prologue for this BB and all BBs
+     dominated by it.  */
+  gcov_type total_cost;
+};
+
+/* A helper function for accessing the pass-specific info.  */
+static inline struct sw *
+SW (basic_block bb)
+{
+  gcc_assert (bb->aux);
+  return (struct sw *) bb->aux;
+}
+
+/* Create the pass-specific data structures for separately shrink-wrapping
+   with components COMPONENTS.  */
+static void
+init_separate_shrink_wrap (sbitmap components)
+{
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    {
+      bb->aux = xcalloc (1, sizeof (struct sw));
+
+      SW (bb)->needs_components = targetm.shrink_wrap.components_for_bb (bb);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "bb %d components:", bb->index);
+	  dump_components ("has", SW (bb)->needs_components);
+	  fprintf (dump_file, "\n");
+	}
+
+      SW (bb)->has_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      SW (bb)->head_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      SW (bb)->tail_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      bitmap_clear (SW (bb)->has_components);
+      bitmap_clear (SW (bb)->head_components);
+      bitmap_clear (SW (bb)->tail_components);
+    }
+}
+
+/* Destroy the pass-specific data.  */
+static void
+fini_separate_shrink_wrap (void)
+{
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    if (bb->aux)
+      {
+	sbitmap_free (SW (bb)->needs_components);
+	sbitmap_free (SW (bb)->has_components);
+	sbitmap_free (SW (bb)->head_components);
+	sbitmap_free (SW (bb)->tail_components);
+	free (bb->aux);
+	bb->aux = 0;
+      }
+}
+
+/* Place the prologue for component WHICH, in the basic blocks dominated
+   by HEAD.  Do a DFS over the dominator tree, and set bit WHICH in the
+   HAS_COMPONENTS of a block if either the block has that bit set in
+   NEEDS_COMPONENTS, or it is cheaper to place the prologue here than in all
+   dominator subtrees separately.  */
+static void
+place_prologue_for_one_component (unsigned int which, basic_block head)
+{
+  /* The block we are currently dealing with.  */
+  basic_block bb = head;
+  /* Is this the first time we visit this block, i.e. have we just gone
+     down the tree.  */
+  bool first_visit = true;
+
+  /* Walk the dominator tree, visit one block per iteration of this loop.
+     Each basic block is visited twice: once before visiting any children
+     of the block, and once after visiting all of them (leaf nodes are
+     visited only once).  As an optimization, we do not visit subtrees
+     that can no longer influence the prologue placement.  */
+  for (;;)
+    {
+      /* First visit of a block: set the (children) cost accumulator to zero;
+	 if the block does not have the component itself, walk down.  */
+      if (first_visit)
+	{
+	  /* Initialize the cost.  The cost is the block execution frequency
+	     that does not come from backedges.  Calculating this by simply
+	     adding the cost of all edges that aren't backedges does not
+	     work: this does not always add up to the block frequency at
+	     all, and even if it does, rounding error makes for bad
+	     decisions.  */
+	  SW (bb)->own_cost = bb->frequency;
+
+	  edge e;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (e, ei, bb->preds)
+	    if (dominated_by_p (CDI_DOMINATORS, e->src, bb))
+	      {
+		if (SW (bb)->own_cost > EDGE_FREQUENCY (e))
+		  SW (bb)->own_cost -= EDGE_FREQUENCY (e);
+		else
+		  SW (bb)->own_cost = 0;
+	      }
+
+	  SW (bb)->total_cost = 0;
+
+	  if (!bitmap_bit_p (SW (bb)->needs_components, which)
+	      && first_dom_son (CDI_DOMINATORS, bb))
+	    {
+	      bb = first_dom_son (CDI_DOMINATORS, bb);
+	      continue;
+	    }
+	}
+
+      /* If this block does need the component itself, or it is cheaper to
+	 put the prologue here than in all the descendants that need it,
+	 mark it so.  If this block's immediate post-dominator is dominated
+	 by this block, and that needs the prologue, we can put it on this
+	 block as well (earlier is better).  */
+      if (bitmap_bit_p (SW (bb)->needs_components, which)
+	  || SW (bb)->total_cost > SW (bb)->own_cost)
+	{
+	  SW (bb)->total_cost = SW (bb)->own_cost;
+	  bitmap_set_bit (SW (bb)->has_components, which);
+	}
+      else
+	{
+	  basic_block kid = get_immediate_dominator (CDI_POST_DOMINATORS, bb);
+	  if (dominated_by_p (CDI_DOMINATORS, kid, bb)
+	      && bitmap_bit_p (SW (kid)->has_components, which))
+	    {
+	      SW (bb)->total_cost = SW (bb)->own_cost;
+	      bitmap_set_bit (SW (bb)->has_components, which);
+	    }
+	}
+
+      /* We are back where we started, so we are done now.  */
+      if (bb == head)
+	return;
+
+      /* We now know the cost of the subtree rooted at the current block.
+	 Accumulate this cost in the parent.  */
+      basic_block parent = get_immediate_dominator (CDI_DOMINATORS, bb);
+      SW (parent)->total_cost += SW (bb)->total_cost;
+
+      /* Don't walk the tree down unless necessary.  */
+      if (next_dom_son (CDI_DOMINATORS, bb)
+          && SW (parent)->total_cost <= SW (parent)->own_cost)
+	{
+	  bb = next_dom_son (CDI_DOMINATORS, bb);
+	  first_visit = true;
+	}
+      else
+	{
+	  bb = parent;
+	  first_visit = false;
+	}
+    }
+}
+
+/* Mark HAS_COMPONENTS for every block dominated by at least one block with
+   PRO_COMPONENTS set for the respective components, starting at HEAD.  */
+static void
+spread_components (basic_block head)
+{
+  basic_block bb = head;
+  bool first_visit = true;
+  /* This keeps a tally of all components active.  */
+  sbitmap components = SW (head)->has_components;
+
+  for (;;)
+    {
+      if (first_visit)
+	{
+	  bitmap_ior (SW (bb)->has_components, SW (bb)->has_components,
+		      components);
+
+	  if (first_dom_son (CDI_DOMINATORS, bb))
+	    {
+	      components = SW (bb)->has_components;
+	      bb = first_dom_son (CDI_DOMINATORS, bb);
+	      continue;
+	    }
+	}
+
+      components = SW (bb)->has_components;
+
+      if (next_dom_son (CDI_DOMINATORS, bb))
+	{
+	  bb = next_dom_son (CDI_DOMINATORS, bb);
+	  basic_block parent = get_immediate_dominator (CDI_DOMINATORS, bb);
+	  components = SW (parent)->has_components;
+	  first_visit = true;
+	}
+      else
+	{
+	  if (bb == head)
+	    return;
+	  bb = get_immediate_dominator (CDI_DOMINATORS, bb);
+	  first_visit = false;
+	}
+    }
+}
+
+/* If we cannot handle placing some component's prologues or epilogues where
+   we decided we should place them, unmark that component in COMPONENTS so
+   that it is not wrapped separately.  */
+static void
+disqualify_problematic_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  /* Find which components we want pro/epilogues for here.  */
+	  bitmap_and_compl (epi, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and_compl (pro, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+
+	  /* Ask the target what it thinks about things.  */
+	  if (!bitmap_empty_p (epi))
+	    targetm.shrink_wrap.disqualify_components (components, e, epi,
+						       false);
+	  if (!bitmap_empty_p (pro))
+	    targetm.shrink_wrap.disqualify_components (components, e, pro,
+						       true);
+
+	  /* If this edge doesn't need splitting, we're fine.  */
+	  if (single_pred_p (e->dest)
+	      && e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun))
+	    continue;
+
+	  /* If the edge can be split, that is fine too.  */
+	  if ((e->flags & EDGE_ABNORMAL) == 0)
+	    continue;
+
+	  /* We also can handle sibcalls.  */
+	  if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
+	    {
+	      gcc_assert (e->flags & EDGE_SIBCALL);
+	      continue;
+	    }
+
+	  /* Remove from consideration those components we would need
+	     pro/epilogues for on edges where we cannot insert them.  */
+	  bitmap_and_compl (components, components, epi);
+	  bitmap_and_compl (components, components, pro);
+
+	  if (dump_file && !bitmap_subset_p (epi, components))
+	    {
+	      fprintf (dump_file, "  BAD epi %d->%d", e->src->index,
+		       e->dest->index);
+	      if (e->flags & EDGE_EH)
+		fprintf (dump_file, " for EH");
+	      dump_components ("epi", epi);
+	      fprintf (dump_file, "\n");
+	    }
+
+	  if (dump_file && !bitmap_subset_p (pro, components))
+	    {
+	      fprintf (dump_file, "  BAD pro %d->%d", e->src->index,
+		       e->dest->index);
+	      if (e->flags & EDGE_EH)
+		fprintf (dump_file, " for EH");
+	      dump_components ("pro", pro);
+	      fprintf (dump_file, "\n");
+	    }
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+}
+
+/* Place code for prologues and epilogues for COMPONENTS where we can put
+   that code at the start of basic blocks.  */
+static void
+emit_common_heads_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      /* Find which prologue resp. epilogue components are needed for all
+	 predecessor edges to this block.  */
+
+      /* First, select all possible components.  */
+      bitmap_copy (epi, components);
+      bitmap_copy (pro, components);
+
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  if (e->flags & EDGE_ABNORMAL)
+	    {
+	      bitmap_clear (epi);
+	      bitmap_clear (pro);
+	      break;
+	    }
+
+	  /* Deselect those epilogue components that should not be inserted
+	     for this edge.  */
+	  bitmap_and_compl (tmp, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and (epi, epi, tmp);
+
+	  /* Similar, for the prologue.  */
+	  bitmap_and_compl (tmp, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+	  bitmap_and (pro, pro, tmp);
+	}
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "  bb %d", bb->index);
+
+      if (dump_file && !bitmap_empty_p (epi))
+	dump_components ("epi", epi);
+      if (dump_file && !bitmap_empty_p (pro))
+	dump_components ("pro", pro);
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "\n");
+
+      /* Place code after the BB note.  */
+      if (!bitmap_empty_p (pro))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_prologue_components (pro);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  emit_insn_after (seq, bb_note (bb));
+
+	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, pro);
+	}
+
+      if (!bitmap_empty_p (epi))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_epilogue_components (epi);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  emit_insn_after (seq, bb_note (bb));
+
+	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, epi);
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+  sbitmap_free (tmp);
+}
+
+/* Place code for prologues and epilogues for COMPONENTS where we can put
+   that code at the end of basic blocks.  */
+static void
+emit_common_tails_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      /* Find which prologue resp. epilogue components are needed for all
+	 successor edges from this block.  */
+      if (EDGE_COUNT (bb->succs) == 0)
+	continue;
+
+      /* First, select all possible components.  */
+      bitmap_copy (epi, components);
+      bitmap_copy (pro, components);
+
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  if (e->flags & EDGE_ABNORMAL)
+	    {
+	      bitmap_clear (epi);
+	      bitmap_clear (pro);
+	      break;
+	    }
+
+	  /* Deselect those epilogue components that should not be inserted
+	     for this edge, and also those that are already put at the head
+	     of the successor block.  */
+	  bitmap_and_compl (tmp, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and_compl (tmp, tmp, SW (e->dest)->head_components);
+	  bitmap_and (epi, epi, tmp);
+
+	  /* Similarly, for the prologue.  */
+	  bitmap_and_compl (tmp, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+	  bitmap_and_compl (tmp, tmp, SW (e->dest)->head_components);
+	  bitmap_and (pro, pro, tmp);
+	}
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "  bb %d", bb->index);
+
+      if (dump_file && !bitmap_empty_p (epi))
+	dump_components ("epi", epi);
+      if (dump_file && !bitmap_empty_p (pro))
+	dump_components ("pro", pro);
+
+      if (dump_file && !(bitmap_empty_p (epi) && bitmap_empty_p (pro)))
+	fprintf (dump_file, "\n");
+
+      /* Put the code at the end of the BB, but before any final jump.  */
+      if (!bitmap_empty_p (epi))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_epilogue_components (epi);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  rtx_insn *insn = BB_END (bb);
+	  if (control_flow_insn_p (insn))
+	    emit_insn_before (seq, insn);
+	  else
+	    emit_insn_after (seq, insn);
+
+	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, epi);
+	}
+
+      if (!bitmap_empty_p (pro))
+	{
+	  start_sequence ();
+	  targetm.shrink_wrap.emit_prologue_components (pro);
+	  rtx_insn *seq = get_insns ();
+	  end_sequence ();
+
+	  rtx_insn *insn = BB_END (bb);
+	  if (JUMP_P (insn) || (CALL_P (insn) && SIBLING_CALL_P (insn)))
+	    emit_insn_before (seq, insn);
+	  else
+	    emit_insn_after (seq, insn);
+
+	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, pro);
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+  sbitmap_free (tmp);
+}
+
+/* Place prologues and epilogues for COMPONENTS on edges, if we haven't already
+   placed them inside blocks directly.  */
+static void
+insert_prologue_epilogue_for_components (sbitmap components)
+{
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      if (!bb->aux)
+	continue;
+
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  /* Find which pro/epilogue components are needed on this edge.  */
+	  bitmap_and_compl (epi, SW (e->src)->has_components,
+			    SW (e->dest)->has_components);
+	  bitmap_and_compl (pro, SW (e->dest)->has_components,
+			    SW (e->src)->has_components);
+	  bitmap_and (epi, epi, components);
+	  bitmap_and (pro, pro, components);
+
+	  /* Deselect those we already have put at the head or tail of the
+	     edge's dest resp. src.  */
+	  bitmap_and_compl (epi, epi, SW (e->dest)->head_components);
+	  bitmap_and_compl (pro, pro, SW (e->dest)->head_components);
+	  bitmap_and_compl (epi, epi, SW (e->src)->tail_components);
+	  bitmap_and_compl (pro, pro, SW (e->src)->tail_components);
+
+	  if (!bitmap_empty_p (epi) || !bitmap_empty_p (pro))
+	    {
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "  %d->%d", e->src->index,
+			   e->dest->index);
+		  dump_components ("epi", epi);
+		  dump_components ("pro", pro);
+		  fprintf (dump_file, "\n");
+		}
+
+	      /* Put the epilogue components in place.  */
+	      start_sequence ();
+	      targetm.shrink_wrap.emit_epilogue_components (epi);
+	      rtx_insn *seq = get_insns ();
+	      end_sequence ();
+
+	      if (e->flags & EDGE_SIBCALL)
+		{
+		  gcc_assert (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun));
+
+		  rtx_insn *insn = BB_END (e->src);
+		  gcc_assert (CALL_P (insn) && SIBLING_CALL_P (insn));
+		  emit_insn_before (seq, insn);
+		}
+	      else if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
+		{
+		  gcc_assert (e->flags & EDGE_FALLTHRU);
+		  basic_block new_bb = split_edge (e);
+		  emit_insn_after (seq, BB_END (new_bb));
+		}
+	      else
+		insert_insn_on_edge (seq, e);
+
+	      /* Put the prologue components in place.  */
+	      start_sequence ();
+	      targetm.shrink_wrap.emit_prologue_components (pro);
+	      seq = get_insns ();
+	      end_sequence ();
+
+	      insert_insn_on_edge (seq, e);
+	    }
+	}
+    }
+
+  sbitmap_free (pro);
+  sbitmap_free (epi);
+
+  commit_edge_insertions ();
+}
+
+/* The main entry point to this subpass.  FIRST_BB is where the prologue
+   would be normally put.  */
+void
+try_shrink_wrapping_separate (basic_block first_bb)
+{
+  if (!(SHRINK_WRAPPING_ENABLED
+	&& flag_shrink_wrap_separate
+	&& optimize_function_for_speed_p (cfun)
+	&& targetm.shrink_wrap.get_separate_components))
+    return;
+
+  /* We need LIVE info.  */
+  if (!df_live)
+    return;
+
+  /* We don't handle "strange" functions.  */
+  if (cfun->calls_alloca
+      || cfun->calls_setjmp
+      || cfun->can_throw_non_call_exceptions
+      || crtl->calls_eh_return
+      || crtl->has_nonlocal_goto
+      || crtl->saves_all_registers)
+    return;
+
+  /* Ask the target what components there are.  If it returns NULL, don't
+     do anything.  */
+  sbitmap components = targetm.shrink_wrap.get_separate_components ();
+  if (!components)
+    return;
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+
+  init_separate_shrink_wrap (components);
+
+  sbitmap_iterator sbi;
+  unsigned int j;
+  EXECUTE_IF_SET_IN_BITMAP (components, 0, j, sbi)
+    place_prologue_for_one_component (j, first_bb);
+
+  spread_components (first_bb);
+
+  disqualify_problematic_components (components);
+
+  /* Don't separately shrink-wrap anything where the "main" prologue will
+     go; the target code can often optimize things if it is presented with
+     all components together (say, if it generates store-multiple insns).  */
+  bitmap_and_compl (components, components, SW (first_bb)->has_components);
+
+  if (bitmap_empty_p (components))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not wrapping anything separately.\n");
+    }
+  else
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "The components we wrap separately are");
+	  dump_components ("sep", components);
+	  fprintf (dump_file, "\n");
+
+	  fprintf (dump_file, "... Inserting common heads...\n");
+	}
+
+      emit_common_heads_for_components (components);
+
+      if (dump_file)
+	fprintf (dump_file, "... Inserting common tails...\n");
+
+      emit_common_tails_for_components (components);
+
+      if (dump_file)
+	fprintf (dump_file, "... Inserting the more difficult ones...\n");
+
+      insert_prologue_epilogue_for_components (components);
+
+      if (dump_file)
+	fprintf (dump_file, "... Done.\n");
+
+      targetm.shrink_wrap.set_handled_components (components);
+
+      crtl->shrink_wrapped_separate = true;
+    }
+
+  fini_separate_shrink_wrap ();
+
+  sbitmap_free (components);
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+}
diff --git a/gcc/shrink-wrap.h b/gcc/shrink-wrap.h
index e06ab37..05fcb41 100644
--- a/gcc/shrink-wrap.h
+++ b/gcc/shrink-wrap.h
@@ -25,6 +25,7 @@ along with GCC; see the file COPYING3.  If not see
 /* In shrink-wrap.c.  */
 extern bool requires_stack_frame_p (rtx_insn *, HARD_REG_SET, HARD_REG_SET);
 extern void try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq);
+extern void try_shrink_wrapping_separate (basic_block first_bb);
 #define SHRINK_WRAPPING_ENABLED \
   (flag_shrink_wrap && targetm.have_simple_return ())
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 9/9] rs6000: Separate shrink-wrapping
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (3 preceding siblings ...)
  2016-08-01  1:57 ` [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components Segher Boessenkool
@ 2016-08-01  1:57 ` Segher Boessenkool
  2016-08-01  1:57 ` [PATCH 4/9] regrename: Don't rename restores Segher Boessenkool
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  1:57 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

This implements the hooks for separate shrink-wrapping for rs6000.
It handles GPRs and LR.  The GPRs get a component number corresponding
to their register number; LR gets component number 0.

This improves specint by 1.8%, specfp by 0.5%, separate benchmarks
much more.  It improves the hot path in various interpreters (some
improve by double digits), and e.g. in glibc's malloc.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* config/rs6000/rs6000.c (machine_function): Add new fields
	gpr_is_wrapped_separately and lr_is_wrapped_separately.
	(TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS,
	TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB,
	TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS,
	TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS,
	TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS,
	TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Define.
	(rs6000_get_separate_components): New function.
	(rs6000_components_for_bb): New function.
	(rs6000_disqualify_components): New function.
	(rs6000_emit_prologue_components): New function.
	(rs6000_emit_epilogue_components): New function.
	(rs6000_set_handled_components): New function.
	(rs6000_emit_prologue): Don't emit LR save if lr_is_wrapped_separately.
	Don't emit GPR saves if gpr_is_wrapped_separately for that register.
	(restore_saved_lr): Don't restore LR if lr_is_wrapped_separately.
	(rs6000_emit_epilogue): Don't emit GPR restores if
	gpr_is_wrapped_separately for that register.  Don't make a
	REG_CFA_RESTORE note for registers we did not restore, either.
---
 gcc/config/rs6000/rs6000.c | 257 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 242 insertions(+), 15 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 46b46d7..723fea5 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -152,6 +152,10 @@ typedef struct GTY(()) machine_function
   bool split_stack_argp_used;
   /* Flag if r2 setup is needed with ELFv2 ABI.  */
   bool r2_setup_needed;
+  /* The components already handled by separate shrink-wrapping, which should
+     not be considered by the prologue and epilogue.  */
+  bool gpr_is_wrapped_separately[32];
+  bool lr_is_wrapped_separately;
 } machine_function;
 
 /* Support targetm.vectorize.builtin_mask_for_load.  */
@@ -1513,6 +1517,19 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_SET_UP_BY_PROLOGUE
 #define TARGET_SET_UP_BY_PROLOGUE rs6000_set_up_by_prologue
 
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS rs6000_get_separate_components
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB rs6000_components_for_bb
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS rs6000_disqualify_components
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS rs6000_emit_prologue_components
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS rs6000_emit_epilogue_components
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS rs6000_set_handled_components
+
 #undef TARGET_EXTRA_LIVE_ON_ENTRY
 #define TARGET_EXTRA_LIVE_ON_ENTRY rs6000_live_on_entry
 
@@ -26444,6 +26461,201 @@ rs6000_global_entry_point_needed_p (void)
   return cfun->machine->r2_setup_needed;
 }
 
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+static sbitmap
+rs6000_get_separate_components (void)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+
+  if (!(info->savres_strategy & SAVE_INLINE_GPRS)
+      || !(info->savres_strategy & REST_INLINE_GPRS)
+      || WORLD_SAVE_P (info))
+    return NULL;
+
+  sbitmap components = sbitmap_alloc (32);
+  bitmap_clear (components);
+
+  /* The GPRs we need saved to the frame.  */
+  int reg_size = TARGET_32BIT ? 4 : 8;
+  int offset = info->gp_save_offset;
+  if (info->push_p)
+    offset += info->total_size;
+
+  for (unsigned regno = info->first_gp_reg_save; regno < 32; regno++)
+    {
+      if (IN_RANGE (offset, -0x8000, 0x7fff)
+	  && rs6000_reg_live_or_pic_offset_p (regno))
+	bitmap_set_bit (components, regno);
+
+      offset += reg_size;
+    }
+
+  /* Don't mess with the hard frame pointer.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  /* Don't mess with the fixed TOC register.  */
+  if ((TARGET_TOC && TARGET_MINIMAL_TOC)
+      || (flag_pic == 1 && DEFAULT_ABI == ABI_V4)
+      || (flag_pic && DEFAULT_ABI == ABI_DARWIN))
+    bitmap_clear_bit (components, RS6000_PIC_OFFSET_TABLE_REGNUM);
+
+  /* Optimize LR save and restore if we can.  This is component 0.  */
+  if (info->lr_save_p
+      && !(flag_pic && (DEFAULT_ABI == ABI_V4 || DEFAULT_ABI == ABI_DARWIN)))
+    {
+      offset = info->lr_save_offset;
+      if (info->push_p)
+	offset += info->total_size;
+      if (IN_RANGE (offset, -0x8000, 0x7fff))
+	bitmap_set_bit (components, 0);
+    }
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+static sbitmap
+rs6000_components_for_bb (basic_block bb)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (32);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = info->first_gp_reg_save; regno < 32; regno++)
+    if (bitmap_bit_p (in, regno)
+	|| bitmap_bit_p (gen, regno)
+	|| bitmap_bit_p (kill, regno))
+      bitmap_set_bit (components, regno);
+
+  /* LR needs to be saved around a bb if it is killed in that bb.  */
+  if (bitmap_bit_p (kill, LR_REGNO))
+    bitmap_set_bit (components, 0);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.  */
+static void
+rs6000_disqualify_components (sbitmap components, edge e,
+			      sbitmap edge_components, bool /*is_prologue*/)
+{
+  /* Our LR pro/epilogue code moves LR via R0, so R0 had better not be
+     live where we want to place that code.  */
+  if (bitmap_bit_p (edge_components, 0)
+      && bitmap_bit_p (DF_LIVE_IN (e->dest), 0))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Disqualifying LR because GPR0 is live "
+		 "on entry to bb %d\n", e->dest->index);
+      bitmap_clear_bit (components, 0);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+static void
+rs6000_emit_prologue_components (sbitmap components)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+  rtx sp_reg_rtx = gen_rtx_REG (Pmode, STACK_POINTER_REGNUM);
+  int reg_size = TARGET_32BIT ? 4 : 8;
+
+  /* Prologue for LR.  */
+  if (bitmap_bit_p (components, 0))
+    {
+      rtx reg = gen_rtx_REG (Pmode, 0);
+      emit_move_insn (reg, gen_rtx_REG (Pmode, LR_REGNO));
+      RTX_FRAME_RELATED_P (get_last_insn ()) = 1;
+
+      int offset = info->lr_save_offset;
+      if (info->push_p)
+	offset += info->total_size;
+
+      emit_insn (gen_frame_store (reg, sp_reg_rtx, offset));
+      RTX_FRAME_RELATED_P (get_last_insn ()) = 1;
+    }
+
+  /* Prologue for the GPRs.  */
+  int offset = info->gp_save_offset;
+  if (info->push_p)
+    offset += info->total_size;
+
+  for (int i = info->first_gp_reg_save; i < 32; i++)
+    {
+      if (bitmap_bit_p (components, i))
+	{
+	  rtx reg = gen_rtx_REG (Pmode, i);
+	  emit_insn (gen_frame_store (reg, sp_reg_rtx, offset));
+	  RTX_FRAME_RELATED_P (get_last_insn ()) = 1;
+	}
+
+      offset += reg_size;
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+static void
+rs6000_emit_epilogue_components (sbitmap components)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+  rtx sp_reg_rtx = gen_rtx_REG (Pmode, STACK_POINTER_REGNUM);
+  int reg_size = TARGET_32BIT ? 4 : 8;
+
+  /* Epilogue for the GPRs.  */
+  int offset = info->gp_save_offset;
+  if (info->push_p)
+    offset += info->total_size;
+
+  for (int i = info->first_gp_reg_save; i < 32; i++)
+    {
+      if (bitmap_bit_p (components, i))
+	{
+	  rtx reg = gen_rtx_REG (Pmode, i);
+	  emit_insn (gen_frame_load (reg, sp_reg_rtx, offset));
+	  RTX_FRAME_RELATED_P (get_last_insn ()) = 1;
+	  add_reg_note (get_last_insn (), REG_CFA_RESTORE, reg);
+	}
+
+      offset += reg_size;
+    }
+
+  /* Epilogue for LR.  */
+  if (bitmap_bit_p (components, 0))
+    {
+      int offset = info->lr_save_offset;
+      if (info->push_p)
+	offset += info->total_size;
+
+      rtx reg = gen_rtx_REG (Pmode, 0);
+      emit_insn (gen_frame_load (reg, sp_reg_rtx, offset));
+
+      rtx lr = gen_rtx_REG (Pmode, LR_REGNO);
+      emit_move_insn (lr, reg);
+      RTX_FRAME_RELATED_P (get_last_insn ()) = 1;
+      add_reg_note (get_last_insn (), REG_CFA_RESTORE, lr);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+static void
+rs6000_set_handled_components (sbitmap components)
+{
+  rs6000_stack_t *info = rs6000_stack_info ();
+
+  for (int i = info->first_gp_reg_save; i < 32; i++)
+    if (bitmap_bit_p (components, i))
+      cfun->machine->gpr_is_wrapped_separately[i] = true;
+
+  if (bitmap_bit_p (components, 0))
+    cfun->machine->lr_is_wrapped_separately = true;
+}
+
 /* Emit function prologue as insns.  */
 
 void
@@ -26701,7 +26913,8 @@ rs6000_emit_prologue (void)
     }
 
   /* If we use the link register, get it into r0.  */
-  if (!WORLD_SAVE_P (info) && info->lr_save_p)
+  if (!WORLD_SAVE_P (info) && info->lr_save_p
+      && !cfun->machine->lr_is_wrapped_separately)
     {
       rtx addr, reg, mem;
 
@@ -26929,13 +27142,16 @@ rs6000_emit_prologue (void)
     }
   else if (!WORLD_SAVE_P (info))
     {
-      int i;
-      for (i = 0; i < 32 - info->first_gp_reg_save; i++)
-	if (rs6000_reg_live_or_pic_offset_p (info->first_gp_reg_save + i))
-	  emit_frame_save (frame_reg_rtx, reg_mode,
-			   info->first_gp_reg_save + i,
-			   info->gp_save_offset + frame_off + reg_size * i,
-			   sp_off - frame_off);
+      int offset = info->gp_save_offset + frame_off;
+      for (int i = info->first_gp_reg_save; i < 32; i++)
+	{
+	  if (rs6000_reg_live_or_pic_offset_p (i)
+	      && !cfun->machine->gpr_is_wrapped_separately[i])
+	    emit_frame_save (frame_reg_rtx, reg_mode, i, offset,
+			     sp_off - frame_off);
+
+	  offset += reg_size;
+	}
     }
 
   if (crtl->calls_eh_return)
@@ -27740,6 +27956,9 @@ load_lr_save (int regno, rtx frame_reg_rtx, int offset)
 static void
 restore_saved_lr (int regno, bool exit_func)
 {
+  if (cfun->machine->lr_is_wrapped_separately)
+    return;
+
   rtx reg = gen_rtx_REG (Pmode, regno);
   rtx lr = gen_rtx_REG (Pmode, LR_REGNO);
   rtx_insn *insn = emit_move_insn (lr, reg);
@@ -28497,12 +28716,18 @@ rs6000_emit_epilogue (int sibcall)
     }
   else
     {
-      for (i = 0; i < 32 - info->first_gp_reg_save; i++)
-	if (rs6000_reg_live_or_pic_offset_p (info->first_gp_reg_save + i))
-	  emit_insn (gen_frame_load
-		     (gen_rtx_REG (reg_mode, info->first_gp_reg_save + i),
-		      frame_reg_rtx,
-		      info->gp_save_offset + frame_off + reg_size * i));
+      int offset = info->gp_save_offset + frame_off;
+      for (i = info->first_gp_reg_save; i < 32; i++)
+	{
+	  if (rs6000_reg_live_or_pic_offset_p (i)
+	      && !cfun->machine->gpr_is_wrapped_separately[i])
+	    {
+	      rtx reg = gen_rtx_REG (reg_mode, i);
+	      emit_insn (gen_frame_load (reg, frame_reg_rtx, offset));
+	    }
+
+	  offset += reg_size;
+	}
     }
 
   if (DEFAULT_ABI == ABI_V4 || flag_shrink_wrap)
@@ -28541,8 +28766,10 @@ rs6000_emit_epilogue (int sibcall)
 	    || using_load_multiple
 	    || rs6000_reg_live_or_pic_offset_p (i))
 	  {
-	    rtx reg = gen_rtx_REG (reg_mode, i);
+	    if (cfun->machine->gpr_is_wrapped_separately[i])
+	      continue;
 
+	    rtx reg = gen_rtx_REG (reg_mode, i);
 	    cfa_restores = alloc_reg_note (REG_CFA_RESTORE, reg, cfa_restores);
 	  }
     }
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 6/9] sel-sched: Don't mess with register restores
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (6 preceding siblings ...)
  2016-08-01  1:57 ` [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone Segher Boessenkool
@ 2016-08-01  2:12 ` Segher Boessenkool
  2016-08-04  7:33   ` Andrey Belevantsev
  2016-09-08 17:55   ` Jeff Law
  2016-08-01  2:12 ` [PATCH 5/9] regrename: Don't run if function was separately shrink-wrapped Segher Boessenkool
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  2:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

If selective scheduling copies register restores it confuses dwarf2cfi.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* sel-sched-ir.c (init_global_and_expr_for_insn): Don't copy
	instructions with a REG_CFA_RESTORE note.
---
 gcc/sel-sched-ir.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/sel-sched-ir.c b/gcc/sel-sched-ir.c
index 83f813a..4a3984a 100644
--- a/gcc/sel-sched-ir.c
+++ b/gcc/sel-sched-ir.c
@@ -3015,6 +3015,7 @@ init_global_and_expr_for_insn (insn_t insn)
           /* TRAP_IF though have an INSN code is control_flow_insn_p ().  */
           || control_flow_insn_p (insn)
           || volatile_insn_p (PATTERN (insn))
+	  || find_reg_note (insn, REG_CFA_RESTORE, NULL)
           || (targetm.cannot_copy_insn_p
               && targetm.cannot_copy_insn_p (insn)))
         force_unique_p = true;
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 5/9] regrename: Don't run if function was separately shrink-wrapped
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (7 preceding siblings ...)
  2016-08-01  2:12 ` [PATCH 6/9] sel-sched: Don't mess with register restores Segher Boessenkool
@ 2016-08-01  2:12 ` Segher Boessenkool
  2016-09-08 17:54   ` Jeff Law
  2016-08-04  0:05 ` [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
  2016-08-26 13:03 ` Bernd Schmidt
  10 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-01  2:12 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt, Segher Boessenkool

Unfortunately even after the previous patch there are still a few cases
where regrename creates invalid code when used together with separate
shrink-wrapping.  At noreturn exits regrename thinks it can use all
callee-saved registers, but that is not true.  This patch disables
regrename for functions that were separately shrink-wrapped.

2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* regrename.c (gate): Disable regrename if shrink_wrapped_separate
	is set in crtl.
---
 gcc/regrename.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/regrename.c b/gcc/regrename.c
index 00a5d02..e2d2483 100644
--- a/gcc/regrename.c
+++ b/gcc/regrename.c
@@ -1956,7 +1956,12 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      return (optimize > 0 && (flag_rename_registers));
+      /* regrename can decide to use callee-saved registers in places where
+	 those should hold their original contents.  This happens when
+	 separate shrink-wrapping is used.  Disable regrename in that case,
+	 until this problem is fixed.  */
+      return (optimize > 0 && flag_rename_registers
+	      && !crtl->shrink_wrapped_separate);
     }
 
   virtual unsigned int execute (function *) { return regrename_optimize (); }
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (8 preceding siblings ...)
  2016-08-01  2:12 ` [PATCH 5/9] regrename: Don't run if function was separately shrink-wrapped Segher Boessenkool
@ 2016-08-04  0:05 ` Segher Boessenkool
  2016-08-24 16:04   ` Segher Boessenkool
  2016-08-26 13:03 ` Bernd Schmidt
  10 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-04  0:05 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt

Ping.


Segher


On Mon, Aug 01, 2016 at 01:42:37AM +0000, Segher Boessenkool wrote:
> This is the second version.  Concern was renamed to component, and all
> other comments were addressed (I hope).  It still uses only two bitmaps
> for the component placement, but now they are called needs_components
> and has_components, which hopefully is easier to follow.  The "can this
> prologue be moved earlier for no cost" test is generalised a bit, and
> the cost estimate explicitly excludes backedges, which makes it easier
> to see that prologues will not be placed inside a loop if there is no
> benefit to that (I didn't see any different generated code because of
> this change).  And new and updated comments all over, of course.
> 
> Is this okay for trunk?
> 
> 
> Segher
> 
> 
>  gcc/cfgcleanup.c           |   5 +
>  gcc/common.opt             |   4 +
>  gcc/config/rs6000/rs6000.c | 257 ++++++++++++++++-
>  gcc/dce.c                  |   9 +
>  gcc/doc/invoke.texi        |  11 +-
>  gcc/doc/tm.texi            |  54 ++++
>  gcc/doc/tm.texi.in         |  29 ++
>  gcc/emit-rtl.h             |   4 +
>  gcc/function.c             |  15 +-
>  gcc/regcprop.c             |   4 +
>  gcc/regrename.c            |  13 +-
>  gcc/sel-sched-ir.c         |   1 +
>  gcc/shrink-wrap.c          | 705 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/shrink-wrap.h          |   1 +
>  gcc/target.def             |  57 ++++
>  15 files changed, 1150 insertions(+), 19 deletions(-)
> 
> -- 
> 1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 6/9] sel-sched: Don't mess with register restores
  2016-08-01  2:12 ` [PATCH 6/9] sel-sched: Don't mess with register restores Segher Boessenkool
@ 2016-08-04  7:33   ` Andrey Belevantsev
  2016-09-08 17:55   ` Jeff Law
  1 sibling, 0 replies; 74+ messages in thread
From: Andrey Belevantsev @ 2016-08-04  7:33 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

Hello,

On 01.08.2016 4:42, Segher Boessenkool wrote:
> If selective scheduling copies register restores it confuses dwarf2cfi.
> 
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
> 
> 	* sel-sched-ir.c (init_global_and_expr_for_insn): Don't copy
> 	instructions with a REG_CFA_RESTORE note.

OK from sel-sched POV.

Best,
Andrey

> ---
>  gcc/sel-sched-ir.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/gcc/sel-sched-ir.c b/gcc/sel-sched-ir.c
> index 83f813a..4a3984a 100644
> --- a/gcc/sel-sched-ir.c
> +++ b/gcc/sel-sched-ir.c
> @@ -3015,6 +3015,7 @@ init_global_and_expr_for_insn (insn_t insn)
>            /* TRAP_IF though have an INSN code is control_flow_insn_p ().  */
>            || control_flow_insn_p (insn)
>            || volatile_insn_p (PATTERN (insn))
> +	  || find_reg_note (insn, REG_CFA_RESTORE, NULL)
>            || (targetm.cannot_copy_insn_p
>                && targetm.cannot_copy_insn_p (insn)))
>          force_unique_p = true;
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-04  0:05 ` [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
@ 2016-08-24 16:04   ` Segher Boessenkool
  0 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-24 16:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: bschmidt

Ping x2.

On Wed, Aug 03, 2016 at 07:05:34PM -0500, Segher Boessenkool wrote:
> Ping.
> 
> 
> Segher
> 
> 
> On Mon, Aug 01, 2016 at 01:42:37AM +0000, Segher Boessenkool wrote:
> > This is the second version.  Concern was renamed to component, and all
> > other comments were addressed (I hope).  It still uses only two bitmaps
> > for the component placement, but now they are called needs_components
> > and has_components, which hopefully is easier to follow.  The "can this
> > prologue be moved earlier for no cost" test is generalised a bit, and
> > the cost estimate explicitly excludes backedges, which makes it easier
> > to see that prologues will not be placed inside a loop if there is no
> > benefit to that (I didn't see any different generated code because of
> > this change).  And new and updated comments all over, of course.
> > 
> > Is this okay for trunk?
> > 
> > 
> > Segher
> > 
> > 
> >  gcc/cfgcleanup.c           |   5 +
> >  gcc/common.opt             |   4 +
> >  gcc/config/rs6000/rs6000.c | 257 ++++++++++++++++-
> >  gcc/dce.c                  |   9 +
> >  gcc/doc/invoke.texi        |  11 +-
> >  gcc/doc/tm.texi            |  54 ++++
> >  gcc/doc/tm.texi.in         |  29 ++
> >  gcc/emit-rtl.h             |   4 +
> >  gcc/function.c             |  15 +-
> >  gcc/regcprop.c             |   4 +
> >  gcc/regrename.c            |  13 +-
> >  gcc/sel-sched-ir.c         |   1 +
> >  gcc/shrink-wrap.c          | 705 +++++++++++++++++++++++++++++++++++++++++++++
> >  gcc/shrink-wrap.h          |   1 +
> >  gcc/target.def             |  57 ++++
> >  15 files changed, 1150 insertions(+), 19 deletions(-)
> > 
> > -- 
> > 1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
                   ` (9 preceding siblings ...)
  2016-08-04  0:05 ` [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
@ 2016-08-26 13:03 ` Bernd Schmidt
  2016-08-26 13:48   ` David Malcolm
                     ` (2 more replies)
  10 siblings, 3 replies; 74+ messages in thread
From: Bernd Schmidt @ 2016-08-26 13:03 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2083 bytes --]

On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> This is the second version.  Concern was renamed to component, and all
> other comments were addressed (I hope).

Not really, I'm afraid. There still seems to be no detailed explanation 
of the algorithm. Instead, there is a vague outline (which should be 
expanded to at least the level of detail found e.g. in tree-ssa-pre.c), 
and inside the code there are still meaningless comments of the form

/* Deselect those epilogue components that should not be inserted
    for this edge.  */

which don't tell me anything about what the intention is (why should 
they not be inserted?). The result is that as a reader, I still find 
myself unable to determine whether the algorithm is correct or not.

Worse, I'm still not entirely certain what it is intended to achieve: I 
asked for a motivating example or two, but haven't seen one in the 
comments anywhere. My most general question would be why the algorithm 
for placing individual prologue components would be different from the 
regular shrink-wrapping algorithm for full prologues. Examples and/or 
arguments relating to how this new code acts with regard to loops are 
also particularly needed.

So, I think improvement is necessary in these points, but I also think 
that there's room for experimental verification by way of self-tests. If 
done thoroughly (testing the transformation on several sufficiently 
random CFGs and maybe some manually constructed ones) it would go a long 
way towards showing that at least we don't have to worry too much about 
miscompilations. That's what I've been working on, and an initial patch 
is below. This is incomplete and posted more as an RFD since we're 
getting towards the end of the week: there are gaps in the test 
coverage, and it currently fails the selftest. I assume the latter is a 
problem with my code, but it wouldn't hurt if you could take a look; 
maybe I misunderstood something entirely about what the separate 
shrink-wrapping is supposed to achieve, or maybe I messed up the 
algorithm with my changes.


Bernd

[-- Attachment #2: sepsw-st4.diff --]
[-- Type: text/x-patch, Size: 33547 bytes --]

diff '--width=118' -x .svn -dru sw-selftest-base/gcc/sbitmap.c with-selftest/gcc/sbitmap.c
--- sw-selftest-base/gcc/sbitmap.c	2016-04-29 10:46:49.189700651 +0200
+++ with-selftest/gcc/sbitmap.c	2016-08-26 11:40:05.043374153 +0200
@@ -319,7 +319,7 @@
       *dstp++ = *ap++;
 }
 
-/* Return true if there are any bits set in A are also set in B.
+/* Return true if there are any bits set in A which are also set in B.
    Return false otherwise.  */
 
 bool
@@ -335,6 +335,24 @@
       return true;
 
   return false;
+}
+
+/* Return true if there are any bits set in A which are not set in B.
+   Return false otherwise.  */
+
+bool
+bitmap_intersect_compl_p (const_sbitmap a, const_sbitmap b)
+{
+  const_sbitmap_ptr ap = a->elms;
+  const_sbitmap_ptr bp = b->elms;
+  unsigned int i, n;
+
+  n = MIN (a->size, b->size);
+  for (i = 0; i < n; i++)
+    if ((*ap++ & ~*bp++) != 0)
+      return true;
+
+  return false;
 }
 
 /* Set DST to be (A and B).
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/sbitmap.h with-selftest/gcc/sbitmap.h
--- sw-selftest-base/gcc/sbitmap.h	2016-08-01 12:42:29.678435973 +0200
+++ with-selftest/gcc/sbitmap.h	2016-08-26 11:40:14.520040887 +0200
@@ -246,6 +246,7 @@
 extern bool bitmap_and_or (sbitmap, const_sbitmap,
 				     const_sbitmap, const_sbitmap);
 extern bool bitmap_intersect_p (const_sbitmap, const_sbitmap);
+extern bool bitmap_intersect_compl_p (const_sbitmap, const_sbitmap);
 extern bool bitmap_and (sbitmap, const_sbitmap, const_sbitmap);
 extern bool bitmap_ior (sbitmap, const_sbitmap, const_sbitmap);
 extern bool bitmap_xor (sbitmap, const_sbitmap, const_sbitmap);
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/selftest.h with-selftest/gcc/selftest.h
--- sw-selftest-base/gcc/selftest.h	2016-08-01 12:42:57.738436176 +0200
+++ with-selftest/gcc/selftest.h	2016-08-26 11:49:50.656711729 +0200
@@ -92,6 +92,11 @@
 extern void tree_cfg_c_tests ();
 extern void vec_c_tests ();
 extern void wide_int_cc_tests ();
+extern void shrink_wrapping_separate_tests ();
+
+/* Helper functions to set up certain tests.  */
+extern basic_block build_random_cfg (int);
+extern void free_random_cfg ();
 
 extern int num_passes;
 
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/selftest-run-tests.c with-selftest/gcc/selftest-run-tests.c
--- sw-selftest-base/gcc/selftest-run-tests.c	2016-08-01 12:43:07.411769580 +0200
+++ with-selftest/gcc/selftest-run-tests.c	2016-08-26 11:50:34.160045378 +0200
@@ -67,8 +67,9 @@
   spellcheck_tree_c_tests ();
   tree_cfg_c_tests ();
 
-  /* This one relies on most of the above.  */
+  /* These ones relies on most of the above.  */
   function_tests_c_tests ();
+  shrink_wrapping_separate_tests ();
 
   /* Finished running tests.  */
   long finish_time = get_run_time ();
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/shrink-wrap.c with-selftest/gcc/shrink-wrap.c
--- sw-selftest-base/gcc/shrink-wrap.c	2016-08-25 13:50:11.013943584 +0200
+++ with-selftest/gcc/shrink-wrap.c	2016-08-26 13:25:48.843420128 +0200
@@ -41,7 +41,7 @@
 #include "regcprop.h"
 #include "rtl-iter.h"
 #include "valtrack.h"
-
+#include "selftest.h"
 
 /* Return true if INSN requires the stack frame to be set up.
    PROLOGUE_USED contains the hard registers used in the function
@@ -1091,6 +1091,58 @@
   gcov_type total_cost;
 };
 
+/* An abstract base class which holds the implementation of the algorithm.
+   Derived classes exist for actual shrink-wrapping, and for self-test.  */
+class sw_sep_base
+{
+  virtual sbitmap components_for_bb (basic_block) = 0;
+  virtual void disqualify (edge, sbitmap, bool) = 0;
+  virtual void emit_prologue_into_bb (basic_block, sbitmap, bool) = 0;
+  virtual void emit_epilogue_into_bb (basic_block, sbitmap, bool) = 0;
+  virtual void emit_prologue_on_edge (edge, sbitmap) = 0;
+  virtual void emit_epilogue_on_edge (edge, sbitmap) = 0;
+  virtual void set_handled () = 0;
+
+  void init ();
+  void place_prologue_for_one_component (unsigned int, basic_block);
+  void spread_components (basic_block head);
+  void disqualify_problematic_components ();
+  void emit_common_heads_for_components ();
+  void emit_common_tails_for_components ();
+  void insert_prologue_epilogue_for_components ();
+
+ protected:
+  basic_block m_first_bb;
+  sbitmap m_components;
+
+  sw_sep_base (basic_block first_bb, sbitmap components)
+    : m_first_bb (first_bb), m_components (components)
+  {
+  }
+
+ public:
+  virtual void fini ();
+  sbitmap run ();
+};
+
+/* The derived class that performs the real optimization.  */
+class sw_sep : public sw_sep_base
+{
+  sbitmap components_for_bb (basic_block);
+  void disqualify (edge, sbitmap, bool);
+  void emit_prologue_into_bb (basic_block, sbitmap, bool);
+  void emit_epilogue_into_bb (basic_block, sbitmap, bool);
+  void emit_prologue_on_edge (edge, sbitmap);
+  void emit_epilogue_on_edge (edge, sbitmap);
+  void set_handled ();
+
+ public:
+  sw_sep (basic_block first_bb, sbitmap components)
+    : sw_sep_base (first_bb, components)
+  {
+  }
+};
+
 /* A helper function for accessing the pass-specific info.  */
 static inline struct sw *
 SW (basic_block bb)
@@ -1101,26 +1153,19 @@
 
 /* Create the pass-specific data structures for separately shrink-wrapping
    with components COMPONENTS.  */
-static void
-init_separate_shrink_wrap (sbitmap components)
+void
+sw_sep_base::init ()
 {
   basic_block bb;
   FOR_ALL_BB_FN (bb, cfun)
     {
       bb->aux = xcalloc (1, sizeof (struct sw));
 
-      SW (bb)->needs_components = targetm.shrink_wrap.components_for_bb (bb);
-
-      if (dump_file)
-	{
-	  fprintf (dump_file, "bb %d components:", bb->index);
-	  dump_components ("has", SW (bb)->needs_components);
-	  fprintf (dump_file, "\n");
-	}
+      SW (bb)->needs_components = components_for_bb (bb);
 
-      SW (bb)->has_components = sbitmap_alloc (SBITMAP_SIZE (components));
-      SW (bb)->head_components = sbitmap_alloc (SBITMAP_SIZE (components));
-      SW (bb)->tail_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      SW (bb)->has_components = sbitmap_alloc (SBITMAP_SIZE (m_components));
+      SW (bb)->head_components = sbitmap_alloc (SBITMAP_SIZE (m_components));
+      SW (bb)->tail_components = sbitmap_alloc (SBITMAP_SIZE (m_components));
       bitmap_clear (SW (bb)->has_components);
       bitmap_clear (SW (bb)->head_components);
       bitmap_clear (SW (bb)->tail_components);
@@ -1128,8 +1173,8 @@
 }
 
 /* Destroy the pass-specific data.  */
-static void
-fini_separate_shrink_wrap (void)
+void
+sw_sep_base::fini (void)
 {
   basic_block bb;
   FOR_ALL_BB_FN (bb, cfun)
@@ -1144,13 +1189,132 @@
       }
 }
 
+sbitmap
+sw_sep::components_for_bb (basic_block bb)
+{
+  sbitmap b = targetm.shrink_wrap.components_for_bb (bb);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "bb %d components:", bb->index);
+      dump_components ("has", b);
+      fprintf (dump_file, "\n");
+    }
+  return b;
+}
+
+void
+sw_sep::disqualify (edge e, sbitmap edge_components, bool is_prologue)
+{
+  targetm.shrink_wrap.disqualify_components (m_components, e, edge_components,
+					     is_prologue);
+}
+
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_sep::emit_epilogue_on_edge (edge e, sbitmap epi)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_epilogue_components (epi);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  if (e->flags & EDGE_SIBCALL)
+    {
+      gcc_assert (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun));
+
+      rtx_insn *insn = BB_END (e->src);
+      gcc_assert (CALL_P (insn) && SIBLING_CALL_P (insn));
+      emit_insn_before (seq, insn);
+    }
+  else if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
+    {
+      gcc_assert (e->flags & EDGE_FALLTHRU);
+      basic_block new_bb = split_edge (e);
+      emit_insn_after (seq, BB_END (new_bb));
+    }
+  else
+    insert_insn_on_edge (seq, e);
+}
+
+/* Call the target hook to emit prologue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_sep::emit_prologue_on_edge (edge e, sbitmap pro)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_prologue_components (pro);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  insert_insn_on_edge (seq, e);
+}
+
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_sep::emit_epilogue_into_bb (basic_block bb, sbitmap epi, bool at_head)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_epilogue_components (epi);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  if (at_head)
+    emit_insn_after (seq, bb_note (bb));
+  else
+    {
+      rtx_insn *insn = BB_END (bb);
+      if (control_flow_insn_p (insn))
+	emit_insn_before (seq, insn);
+      else
+	emit_insn_after (seq, insn);
+    }
+}
+
+/* Call the target hook to emit prologue components for PRO, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_sep::emit_prologue_into_bb (basic_block bb, sbitmap pro, bool at_head)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_prologue_components (pro);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  if (at_head)
+    emit_insn_after (seq, bb_note (bb));
+  else
+    {
+      rtx_insn *insn = BB_END (bb);
+      if (control_flow_insn_p (insn))
+	emit_insn_before (seq, insn);
+      else
+	emit_insn_after (seq, insn);
+    }
+}
+
+/* Called by the engine if separate shrink-wrapping was successful for
+   some components.  */
+void
+sw_sep::set_handled ()
+{
+  targetm.shrink_wrap.set_handled_components (m_components);
+
+  crtl->shrink_wrapped_separate = true;
+}
+
 /* Place the prologue for component WHICH, in the basic blocks dominated
    by HEAD.  Do a DFS over the dominator tree, and set bit WHICH in the
    HAS_COMPONENTS of a block if either the block has that bit set in
    NEEDS_COMPONENTS, or it is cheaper to place the prologue here than in all
    dominator subtrees separately.  */
-static void
-place_prologue_for_one_component (unsigned int which, basic_block head)
+void
+sw_sep_base::place_prologue_for_one_component (unsigned int which,
+					       basic_block head)
 {
   /* The block we are currently dealing with.  */
   basic_block bb = head;
@@ -1246,8 +1410,8 @@
 
 /* Mark HAS_COMPONENTS for every block dominated by at least one block with
    PRO_COMPONENTS set for the respective components, starting at HEAD.  */
-static void
-spread_components (basic_block head)
+void
+sw_sep_base::spread_components (basic_block head)
 {
   basic_block bb = head;
   bool first_visit = true;
@@ -1291,11 +1455,11 @@
 /* If we cannot handle placing some component's prologues or epilogues where
    we decided we should place them, unmark that component in COMPONENTS so
    that it is not wrapped separately.  */
-static void
-disqualify_problematic_components (sbitmap components)
+void
+sw_sep_base::disqualify_problematic_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1312,11 +1476,9 @@
 
 	  /* Ask the target what it thinks about things.  */
 	  if (!bitmap_empty_p (epi))
-	    targetm.shrink_wrap.disqualify_components (components, e, epi,
-						       false);
+	    disqualify (e, epi, false);
 	  if (!bitmap_empty_p (pro))
-	    targetm.shrink_wrap.disqualify_components (components, e, pro,
-						       true);
+	    disqualify (e, pro, true);
 
 	  /* If this edge doesn't need splitting, we're fine.  */
 	  if (single_pred_p (e->dest)
@@ -1336,10 +1498,10 @@
 
 	  /* Remove from consideration those components we would need
 	     pro/epilogues for on edges where we cannot insert them.  */
-	  bitmap_and_compl (components, components, epi);
-	  bitmap_and_compl (components, components, pro);
+	  bitmap_and_compl (m_components, m_components, epi);
+	  bitmap_and_compl (m_components, m_components, pro);
 
-	  if (dump_file && !bitmap_subset_p (epi, components))
+	  if (dump_file && !bitmap_subset_p (epi, m_components))
 	    {
 	      fprintf (dump_file, "  BAD epi %d->%d", e->src->index,
 		       e->dest->index);
@@ -1349,7 +1511,7 @@
 	      fprintf (dump_file, "\n");
 	    }
 
-	  if (dump_file && !bitmap_subset_p (pro, components))
+	  if (dump_file && !bitmap_subset_p (pro, m_components))
 	    {
 	      fprintf (dump_file, "  BAD pro %d->%d", e->src->index,
 		       e->dest->index);
@@ -1367,12 +1529,12 @@
 
 /* Place code for prologues and epilogues for COMPONENTS where we can put
    that code at the start of basic blocks.  */
-static void
-emit_common_heads_for_components (sbitmap components)
+void
+sw_sep_base::emit_common_heads_for_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1381,8 +1543,8 @@
 	 predecessor edges to this block.  */
 
       /* First, select all possible components.  */
-      bitmap_copy (epi, components);
-      bitmap_copy (pro, components);
+      bitmap_copy (epi, m_components);
+      bitmap_copy (pro, m_components);
 
       edge e;
       edge_iterator ei;
@@ -1421,24 +1583,14 @@
       /* Place code after the BB note.  */
       if (!bitmap_empty_p (pro))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_prologue_components (pro);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  emit_insn_after (seq, bb_note (bb));
+	  emit_prologue_into_bb (bb, pro, true);
 
 	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, pro);
 	}
 
       if (!bitmap_empty_p (epi))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_epilogue_components (epi);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  emit_insn_after (seq, bb_note (bb));
+	  emit_epilogue_into_bb (bb, epi, true);
 
 	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, epi);
 	}
@@ -1451,12 +1603,12 @@
 
 /* Place code for prologues and epilogues for COMPONENTS where we can put
    that code at the end of basic blocks.  */
-static void
-emit_common_tails_for_components (sbitmap components)
+void
+sw_sep_base::emit_common_tails_for_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1467,8 +1619,8 @@
 	continue;
 
       /* First, select all possible components.  */
-      bitmap_copy (epi, components);
-      bitmap_copy (pro, components);
+      bitmap_copy (epi, m_components);
+      bitmap_copy (pro, m_components);
 
       edge e;
       edge_iterator ei;
@@ -1510,32 +1662,13 @@
       /* Put the code at the end of the BB, but before any final jump.  */
       if (!bitmap_empty_p (epi))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_epilogue_components (epi);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  rtx_insn *insn = BB_END (bb);
-	  if (control_flow_insn_p (insn))
-	    emit_insn_before (seq, insn);
-	  else
-	    emit_insn_after (seq, insn);
-
+	  emit_epilogue_into_bb (bb, epi, false);
 	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, epi);
 	}
 
       if (!bitmap_empty_p (pro))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_prologue_components (pro);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  rtx_insn *insn = BB_END (bb);
-	  if (JUMP_P (insn) || (CALL_P (insn) && SIBLING_CALL_P (insn)))
-	    emit_insn_before (seq, insn);
-	  else
-	    emit_insn_after (seq, insn);
+	  emit_prologue_into_bb (bb, pro, false);
 
 	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, pro);
 	}
@@ -1548,11 +1681,11 @@
 
 /* Place prologues and epilogues for COMPONENTS on edges, if we haven't already
    placed them inside blocks directly.  */
-static void
-insert_prologue_epilogue_for_components (sbitmap components)
+void
+sw_sep_base::insert_prologue_epilogue_for_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1569,8 +1702,8 @@
 			    SW (e->dest)->has_components);
 	  bitmap_and_compl (pro, SW (e->dest)->has_components,
 			    SW (e->src)->has_components);
-	  bitmap_and (epi, epi, components);
-	  bitmap_and (pro, pro, components);
+	  bitmap_and (epi, epi, m_components);
+	  bitmap_and (pro, pro, m_components);
 
 	  /* Deselect those we already have put at the head or tail of the
 	     edge's dest resp. src.  */
@@ -1590,36 +1723,8 @@
 		  fprintf (dump_file, "\n");
 		}
 
-	      /* Put the epilogue components in place.  */
-	      start_sequence ();
-	      targetm.shrink_wrap.emit_epilogue_components (epi);
-	      rtx_insn *seq = get_insns ();
-	      end_sequence ();
-
-	      if (e->flags & EDGE_SIBCALL)
-		{
-		  gcc_assert (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun));
-
-		  rtx_insn *insn = BB_END (e->src);
-		  gcc_assert (CALL_P (insn) && SIBLING_CALL_P (insn));
-		  emit_insn_before (seq, insn);
-		}
-	      else if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
-		{
-		  gcc_assert (e->flags & EDGE_FALLTHRU);
-		  basic_block new_bb = split_edge (e);
-		  emit_insn_after (seq, BB_END (new_bb));
-		}
-	      else
-		insert_insn_on_edge (seq, e);
-
-	      /* Put the prologue components in place.  */
-	      start_sequence ();
-	      targetm.shrink_wrap.emit_prologue_components (pro);
-	      seq = get_insns ();
-	      end_sequence ();
-
-	      insert_insn_on_edge (seq, e);
+	      emit_epilogue_on_edge (e, epi);
+	      emit_prologue_on_edge (e, pro);
 	    }
 	}
     }
@@ -1630,6 +1735,62 @@
   commit_edge_insertions ();
 }
 
+sbitmap
+sw_sep_base::run ()
+{
+  init ();
+
+  sbitmap_iterator sbi;
+  unsigned int j;
+  EXECUTE_IF_SET_IN_BITMAP (m_components, 0, j, sbi)
+    place_prologue_for_one_component (j, m_first_bb);
+
+  spread_components (m_first_bb);
+
+  disqualify_problematic_components ();
+
+  /* Don't separately shrink-wrap anything where the "main" prologue will
+     go; the target code can often optimize things if it is presented with
+     all components together (say, if it generates store-multiple insns).  */
+  bitmap_and_compl (m_components, m_components,
+		    SW (m_first_bb)->has_components);
+
+  if (bitmap_empty_p (m_components))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not wrapping anything separately.\n");
+      return m_components;
+    }
+  if (dump_file)
+    {
+      fprintf (dump_file, "The components we wrap separately are");
+      dump_components ("sep", m_components);
+      fprintf (dump_file, "\n");
+
+      fprintf (dump_file, "... Inserting common heads...\n");
+    }
+
+  emit_common_heads_for_components ();
+
+  if (dump_file)
+    fprintf (dump_file, "... Inserting common tails...\n");
+
+  emit_common_tails_for_components ();
+
+  if (dump_file)
+    fprintf (dump_file, "... Inserting the more difficult ones...\n");
+
+  insert_prologue_epilogue_for_components ();
+
+  if (dump_file)
+    fprintf (dump_file, "... Done.\n");
+
+  set_handled ();
+
+  fini ();
+  return m_components;
+}
+
 /* The main entry point to this subpass.  FIRST_BB is where the prologue
    would be normally put.  */
 void
@@ -1663,61 +1824,310 @@
   calculate_dominance_info (CDI_DOMINATORS);
   calculate_dominance_info (CDI_POST_DOMINATORS);
 
-  init_separate_shrink_wrap (components);
+  sw_sep engine (first_bb, components);
+  components = engine.run ();
+  engine.fini ();
 
-  sbitmap_iterator sbi;
-  unsigned int j;
-  EXECUTE_IF_SET_IN_BITMAP (components, 0, j, sbi)
-    place_prologue_for_one_component (j, first_bb);
+  sbitmap_free (components);
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+}
 
-  spread_components (first_bb);
+#if CHECKING_P
 
-  disqualify_problematic_components (components);
+const int n_components_for_test = 32;
+const int n_cfgs_to_try = 32;
+const int n_paths_to_try = 256;
+const int max_path_len = 32;
 
-  /* Don't separately shrink-wrap anything where the "main" prologue will
-     go; the target code can often optimize things if it is presented with
-     all components together (say, if it generates store-multiple insns).  */
-  bitmap_and_compl (components, components, SW (first_bb)->has_components);
+namespace selftest {
 
-  if (bitmap_empty_p (components))
+/* A derived class that performs a self-test of the separate
+   shrink-wrapping algorithm.  */
+class sw_test_sep : public sw_sep_base
+{
+  hash_map<basic_block, sbitmap> m_orig_components;
+  hash_map<basic_block, sbitmap> m_pro_added_to_head;
+  hash_map<basic_block, sbitmap> m_epi_added_to_head;
+  hash_map<basic_block, sbitmap> m_pro_added_to_end;
+  hash_map<basic_block, sbitmap> m_epi_added_to_end;
+  hash_map<edge, sbitmap> m_pro_added_to_edge;
+  hash_map<edge, sbitmap> m_epi_added_to_edge;
+
+  sbitmap components_for_bb (basic_block);
+  void disqualify (edge, sbitmap, bool);
+  void emit_prologue_into_bb (basic_block, sbitmap, bool);
+  void emit_epilogue_into_bb (basic_block, sbitmap, bool);
+  void emit_prologue_on_edge (edge, sbitmap);
+  void emit_epilogue_on_edge (edge, sbitmap);
+  void set_handled ();
+
+ public:
+  void fini ();
+
+  sw_test_sep (basic_block first_bb, sbitmap components)
+    : sw_sep_base (first_bb, components)
+  {
+  }
+};
+
+sbitmap
+sw_test_sep::components_for_bb (basic_block bb)
+{
+  sbitmap components = sbitmap_alloc (n_components_for_test);
+  bitmap_clear (components);
+  if (bb != ENTRY_BLOCK_PTR_FOR_FN (cfun)
+      && bb != EXIT_BLOCK_PTR_FOR_FN (cfun))
+    for (int i = 0; i < n_components_for_test; i++)
+      if (random () & 1)
+	bitmap_set_bit (components, i);
+  m_orig_components.put (bb, components);
+  return components;
+}
+
+void
+sw_test_sep::disqualify (edge, sbitmap, bool)
+{
+  /* This mechanism isn't tested yet.  */
+}
+
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_test_sep::emit_epilogue_on_edge (edge e, sbitmap epi)
+{
+  sbitmap *p = m_epi_added_to_edge.get (e);
+  sbitmap mask;
+  if (p == NULL)
     {
-      if (dump_file)
-	fprintf (dump_file, "Not wrapping anything separately.\n");
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      m_epi_added_to_edge.put (e, mask);
     }
   else
+    mask = *p;
+
+  ASSERT_FALSE (bitmap_intersect_p (mask, epi));
+  bitmap_ior (mask, mask, epi);
+}
+
+/* Call the target hook to emit prologue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_test_sep::emit_prologue_on_edge (edge e, sbitmap pro)
+{
+  sbitmap *p = m_pro_added_to_edge.get (e);
+  sbitmap mask;
+  if (p == NULL)
     {
-      if (dump_file)
-	{
-	  fprintf (dump_file, "The components we wrap separately are");
-	  dump_components ("sep", components);
-	  fprintf (dump_file, "\n");
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      m_pro_added_to_edge.put (e, mask);
+    }
+  else
+    mask = *p;
 
-	  fprintf (dump_file, "... Inserting common heads...\n");
-	}
+  ASSERT_FALSE (bitmap_intersect_p (mask, pro));
+  bitmap_ior (mask, mask, pro);
+}
 
-      emit_common_heads_for_components (components);
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_test_sep::emit_epilogue_into_bb (basic_block bb, sbitmap epi, bool at_head)
+{
+  hash_map<basic_block, sbitmap> *map;
+  map = at_head ? &m_epi_added_to_head : &m_epi_added_to_end;
+  sbitmap *p = map->get (bb);
+  sbitmap mask;
+  if (p == NULL)
+    {
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      map->put (bb, mask);
+    }
+  else
+    mask = *p;
 
-      if (dump_file)
-	fprintf (dump_file, "... Inserting common tails...\n");
+  ASSERT_FALSE (bitmap_intersect_p (mask, epi));
+  bitmap_ior (mask, mask, epi);
+}
 
-      emit_common_tails_for_components (components);
+/* Call the target hook to emit prologue components for PRO, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_test_sep::emit_prologue_into_bb (basic_block bb, sbitmap pro, bool at_head)
+{
+  hash_map<basic_block, sbitmap> *map;
+  map = at_head ? &m_pro_added_to_head : &m_pro_added_to_end;
+  sbitmap *p = map->get (bb);
+  sbitmap mask;
+  if (p == NULL)
+    {
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      map->put (bb, mask);
+    }
+  else
+    mask = *p;
 
-      if (dump_file)
-	fprintf (dump_file, "... Inserting the more difficult ones...\n");
+  ASSERT_FALSE (bitmap_intersect_p (mask, pro));
+  bitmap_ior (mask, mask, pro);
+}
 
-      insert_prologue_epilogue_for_components (components);
+/* Called by the engine if separate shrink-wrapping was successful for
+   some components.  For the test, we use this to verify the result.  */
+void
+sw_test_sep::set_handled ()
+{
+  basic_block bb;
+  /* Verify that we didn't added prologue and epilogue for the same
+     component in the same place.  */
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      sbitmap *pro = m_pro_added_to_head.get (bb);
+      sbitmap *epi = m_epi_added_to_head.get (bb);
+      if (pro && epi)
+	ASSERT_FALSE (bitmap_intersect_p (*pro, *epi));
+      pro = m_pro_added_to_end.get (bb);
+      epi = m_epi_added_to_end.get (bb);
+      if (pro && epi)
+	ASSERT_FALSE (bitmap_intersect_p (*pro, *epi));
+    }
 
-      if (dump_file)
-	fprintf (dump_file, "... Done.\n");
+  /* Try random paths through the function, and verify that
+     (a) if a block needs a component, it has been saved
+     (b) no component is saved or restored more than once in
+         a row.  */
 
-      targetm.shrink_wrap.set_handled_components (components);
+  sbitmap active = sbitmap_alloc (n_components_for_test);
+  for (int i = 0; i < n_paths_to_try; i++)
+    {
+      basic_block curr = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+      bitmap_clear (active);
 
-      crtl->shrink_wrapped_separate = true;
+      for (int len = 0; len < max_path_len; len++)
+	{
+	  /* Update for the bb head.  */
+	  sbitmap *pro_entry = m_pro_added_to_head.get (curr);
+	  sbitmap *epi_entry = m_epi_added_to_head.get (curr);
+	  if (epi_entry)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_compl_p (*epi_entry, active));
+	      bitmap_and_compl (active, active, *epi_entry);
+	    }
+	  if (pro_entry)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_p (*pro_entry, active));
+	      bitmap_ior (active, active, *pro_entry);
+	    }
+
+	  /* Make sure the required components are active.  */
+	  sbitmap *required = m_orig_components.get (curr);
+	  if (required)
+	    ASSERT_FALSE (bitmap_intersect_compl_p (*required, active));
+
+	  /* Update for the bb end.  */
+	  sbitmap *pro_end = m_pro_added_to_end.get (curr);
+	  sbitmap *epi_end = m_epi_added_to_end.get (curr);
+	  if (epi_end)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_compl_p (*epi_end, active));
+	      bitmap_and_compl (active, active, *epi_end);
+	    }
+	  if (pro_end)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_p (*pro_end, active));
+	      bitmap_ior (active, active, *pro_end);
+	    }
+
+	  int l = EDGE_COUNT (curr->succs);
+	  if (l == 0)
+	    break;
+	  int n = random () % l;
+	  edge e = EDGE_SUCC (curr, n);
+
+	  /* Update for the edge.  */
+	  sbitmap *pro_edge = m_pro_added_to_edge.get (e);
+	  sbitmap *epi_edge = m_epi_added_to_edge.get (e);
+	  if (epi_edge)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_compl_p (*epi_edge, active));
+	      bitmap_and_compl (active, active, *epi_edge);
+	    }
+	  if (pro_edge)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_p (*pro_edge, active));
+	      bitmap_ior (active, active, *pro_edge);
+	    }
+
+	  curr = e->dest;
+	  if (curr == EXIT_BLOCK_PTR_FOR_FN (cfun))
+	    break;
+	}
     }
+}
 
-  fini_separate_shrink_wrap ();
+void
+sw_test_sep::fini ()
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      sbitmap *proh = m_pro_added_to_head.get (bb);
+      sbitmap *epih = m_epi_added_to_head.get (bb);
+      sbitmap *proe = m_pro_added_to_end.get (bb);
+      sbitmap *epie = m_epi_added_to_end.get (bb);
+      if (proh)
+	sbitmap_free (*proh);
+      if (epih)
+	sbitmap_free (*epih);
+      if (proe)
+	sbitmap_free (*proe);
+      if (epie)
+	sbitmap_free (*epie);
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  sbitmap *pro = m_pro_added_to_edge.get (e);
+	  sbitmap *epi = m_epi_added_to_edge.get (e);
+	  if (pro)
+	    sbitmap_free (*pro);
+	  if (epi)
+	    sbitmap_free (*epi);
+	}
+    }
+
+  sw_sep_base::fini ();
+}
+
+void
+shrink_wrapping_separate_tests ()
+{
+  /* Ask the target what components there are.  If it returns NULL, don't
+     do anything.  */
+  sbitmap components = sbitmap_alloc (n_components_for_test);
+  bitmap_ones (components);
+
+  basic_block first = build_random_cfg (5);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+
+  sw_test_sep engine (first, components);
+  components = engine.run ();
+  engine.fini ();
 
   sbitmap_free (components);
   free_dominance_info (CDI_DOMINATORS);
   free_dominance_info (CDI_POST_DOMINATORS);
+
+  free_random_cfg ();
 }
+
+} // namespace selftest
+
+#endif
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/tree-cfg.c with-selftest/gcc/tree-cfg.c
--- sw-selftest-base/gcc/tree-cfg.c	2016-08-25 13:49:37.443942684 +0200
+++ with-selftest/gcc/tree-cfg.c	2016-08-26 14:45:06.643454608 +0200
@@ -9234,6 +9234,110 @@
   return fndecl;
 }
 
+static basic_block
+build_single_after (vec<basic_block> *in_exits, vec<basic_block> *exits)
+{
+  basic_block bb_a = create_empty_bb ((*in_exits)[0]);
+  exits->safe_push (bb_a);
+  basic_block bb;
+  unsigned i;
+  FOR_EACH_VEC_ELT (*in_exits, i, bb)
+    make_edge (bb, bb_a, i == 0 ? EDGE_FALLTHRU : 0);
+  return bb_a;
+}
+
+static basic_block
+build_if (function *fun, basic_block thn, vec<basic_block> *exits)
+{
+  basic_block entry = ENTRY_BLOCK_PTR_FOR_FN (fun);
+  basic_block bb_a = create_empty_bb (entry);
+  make_edge (bb_a, thn, EDGE_TRUE_VALUE);
+  exits->safe_push (bb_a);
+  return bb_a;
+}
+
+static basic_block
+build_ifelse (function *fun, basic_block thn, basic_block els)
+{
+  basic_block entry = ENTRY_BLOCK_PTR_FOR_FN (fun);
+  basic_block bb_a = create_empty_bb (entry);
+  make_edge (bb_a, thn, EDGE_TRUE_VALUE);
+  make_edge (bb_a, els, EDGE_FALSE_VALUE);
+  return bb_a;
+}
+
+static basic_block
+build_loop (basic_block inner_head, vec<basic_block> *inner_exits,
+	    vec<basic_block> *exits)
+{
+  build_single_after (inner_exits, exits);
+  basic_block bb;
+  unsigned i;
+  FOR_EACH_VEC_ELT (*inner_exits, i, bb)
+    make_edge (bb, inner_head, 0);
+
+  return inner_head;
+}
+
+static basic_block
+recursive_build_cfg (function *fun, int maxdepth, vec<basic_block> *exits)
+{
+  if (maxdepth == 0)
+    {
+      basic_block entry = ENTRY_BLOCK_PTR_FOR_FN (fun);
+      basic_block bb_a = create_empty_bb (entry);
+      exits->safe_push (bb_a);
+      return bb_a;
+    }
+  int subdepth1 = random () % maxdepth;
+  int subdepth2 = random () % maxdepth;
+  auto_vec<basic_block> inner_exits;
+  basic_block inner, inner2;
+
+  switch (random () % 4)
+    {
+    case 0:
+      inner = recursive_build_cfg (fun, subdepth1, exits);
+      return build_if (fun, inner, exits);
+    case 1:
+      inner = recursive_build_cfg (fun, subdepth1, exits);
+      inner2 = recursive_build_cfg (fun, subdepth2, exits);
+      return build_ifelse (fun, inner, inner2);
+    case 2:
+      inner = recursive_build_cfg (fun, subdepth1, &inner_exits);
+      return build_loop (inner, &inner_exits, exits);
+    case 3:
+      inner = recursive_build_cfg (fun, subdepth1, &inner_exits);
+      build_single_after (&inner_exits, exits);
+      return inner;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+basic_block
+build_random_cfg (int maxdepth)
+{
+  gimple_register_cfg_hooks ();
+
+  tree fndecl = push_fndecl ("cfg_test_random");
+  function *fun = DECL_STRUCT_FUNCTION (fndecl);
+
+  auto_vec<basic_block> rgn_exits;
+  basic_block head = recursive_build_cfg (fun, maxdepth, &rgn_exits);
+  make_edge (ENTRY_BLOCK_PTR_FOR_FN (fun), head, EDGE_FALLTHRU);
+  basic_block bb;
+  unsigned i;
+  FOR_EACH_VEC_ELT (rgn_exits, i, bb)
+    make_edge (bb, EXIT_BLOCK_PTR_FOR_FN (fun), 0);
+  return head;
+}
+
+void free_random_cfg ()
+{
+  pop_cfun ();
+}
+
 /* These tests directly create CFGs.
    Compare with the static fns within tree-cfg.c:
      - build_gimple_cfg

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 13:03 ` Bernd Schmidt
@ 2016-08-26 13:48   ` David Malcolm
  2016-08-26 13:55     ` Bernd Schmidt
  2016-08-26 14:50   ` Segher Boessenkool
  2016-09-09 19:36   ` Jeff Law
  2 siblings, 1 reply; 74+ messages in thread
From: David Malcolm @ 2016-08-26 13:48 UTC (permalink / raw)
  To: Bernd Schmidt, Segher Boessenkool, gcc-patches

On Fri, 2016-08-26 at 15:03 +0200, Bernd Schmidt wrote:
> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> > This is the second version.  Concern was renamed to component, and
> > all
> > other comments were addressed (I hope).
> 
> Not really, I'm afraid. There still seems to be no detailed
> explanation 
> of the algorithm. Instead, there is a vague outline (which should be 
> expanded to at least the level of detail found e.g. in tree-ssa
> -pre.c), 
> and inside the code there are still meaningless comments of the form
> 
> /* Deselect those epilogue components that should not be inserted
>     for this edge.  */
> 
> which don't tell me anything about what the intention is (why should 
> they not be inserted?). The result is that as a reader, I still find 
> myself unable to determine whether the algorithm is correct or not.
> 
> Worse, I'm still not entirely certain what it is intended to achieve:
> I 
> asked for a motivating example or two, but haven't seen one in the 
> comments anywhere. My most general question would be why the
> algorithm 
> for placing individual prologue components would be different from
> the 
> regular shrink-wrapping algorithm for full prologues. Examples and/or
> arguments relating to how this new code acts with regard to loops are
> also particularly needed.
> 
> So, I think improvement is necessary in these points, but I also
> think 
> that there's room for experimental verification by way of self-tests.
> If 
> done thoroughly (testing the transformation on several sufficiently 
> random CFGs and maybe some manually constructed ones) it would go a
> long 
> way towards showing that at least we don't have to worry too much
> about 
> miscompilations. That's what I've been working on, and an initial
> patch 
> is below. This is incomplete and posted more as an RFD since we're 
> getting towards the end of the week: there are gaps in the test 
> coverage, and it currently fails the selftest. I assume the latter is
> a 
> problem with my code, but it wouldn't hurt if you could take a look; 
> maybe I misunderstood something entirely about what the separate 
> shrink-wrapping is supposed to achieve, or maybe I messed up the 
> algorithm with my changes.
> 

Bernd: I'm very happy to see someone else using the selftest framework.

(My mailer isn't letting me quote the patch body, sorry).

I'm nervous about the build_random_cfg function: randomness in
selftests seems like a double-edged sword.  On the one hand, we can use
it to fuzz-test an optimization to rapidly gain a lot of coverage.  On
the other hand, does every host generate the same sequence of values?
Presumably we want everyone to be effectively running the same
selftests.

Is this using libiberty's implementation of "random", or can it use the
underlying host libc implementation?  Are there any cross-platform
guarantees on the sequence of values that are returned for a particular
seed?  I don't want us to have host-specific failures that turn out to
be due to host differences in the RNG.

Do we need to re-seed the RNG before each test?  (or to rework
libiberty's random to bundle up the RNG state in a class, and use
that).  Otherwise, the meaning of the test could change if someone adds
another random-using selftest before this one.

Maybe there's a need for some kind of selftest::rng class here?

On a unrelated note, should the various vfunc implementations be marked
with "FINAL OVERRIDE" (from coretypes.h), so that the compiler has a
better chance of devirtualizing them in a release build?

Hope this is constructive
Dave

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 13:48   ` David Malcolm
@ 2016-08-26 13:55     ` Bernd Schmidt
  0 siblings, 0 replies; 74+ messages in thread
From: Bernd Schmidt @ 2016-08-26 13:55 UTC (permalink / raw)
  To: David Malcolm, Segher Boessenkool, gcc-patches

On 08/26/2016 03:48 PM, David Malcolm wrote:
> I'm nervous about the build_random_cfg function: randomness in
> selftests seems like a double-edged sword.  On the one hand, we can use
> it to fuzz-test an optimization to rapidly gain a lot of coverage.  On
> the other hand, does every host generate the same sequence of values?
> Presumably we want everyone to be effectively running the same
> selftests.

I intended to put a srandom call in there at the start but forgot. One 
could argue that there's value in having different tests on different 
hosts, as long as they are stable between runs, but...

> Maybe there's a need for some kind of selftest::rng class here?

That might be a good idea too.

> On a unrelated note, should the various vfunc implementations be marked
> with "FINAL OVERRIDE" (from coretypes.h), so that the compiler has a
> better chance of devirtualizing them in a release build?

Wasn't aware of that - will have a look.

Maybe we could also use a long-running "-fselftest=extensive" for 
stress-testing algorithms such as this (increasing the number of random 
cfgs it tries)?


Bernd

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 13:03 ` Bernd Schmidt
  2016-08-26 13:48   ` David Malcolm
@ 2016-08-26 14:50   ` Segher Boessenkool
  2016-08-26 15:03     ` Bernd Schmidt
  2016-09-09 19:36   ` Jeff Law
  2 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-26 14:50 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches

Hi Bernd,

On Fri, Aug 26, 2016 at 03:03:43PM +0200, Bernd Schmidt wrote:
> Not really, I'm afraid. There still seems to be no detailed explanation 
> of the algorithm. Instead, there is a vague outline

Did you read the description of 8/9?  If you think any of that needs to
be in the code, please just say so.

> and inside the code there are still meaningless comments of the form
> 
> /* Deselect those epilogue components that should not be inserted
>    for this edge.  */

You asked for a comment here, see
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00932.html
("what's the purpose of this", etc.)

> Worse, I'm still not entirely certain what it is intended to achieve: I 
> asked for a motivating example or two, but haven't seen one in the 
> comments anywhere.

"Make faster code", like all other optimisation passes do as well.

I commented repeatedly about (micro-)benchmark results.

The head comment starts with

+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.

and that is the long and short of it.

> My most general question would be why the algorithm 
> for placing individual prologue components would be different from the 
> regular shrink-wrapping algorithm for full prologues. Examples and/or 
> arguments relating to how this new code acts with regard to loops are 
> also particularly needed.

The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,
not such a hot plan.  The "separate" algo does not duplicate blocks at all
(its only code growth is for all the prologue/epilogue components inserted,
and for some forwarder blocks sometimes).

The full-prologue algorithm only ever inserts exactly one prologue, far
from optimal.  But it *cannot* duplicate it with the semantics we have.
The separate components can of course be duplicated, it's a new abstraction
and part of the requirements for it.

> So, I think improvement is necessary in these points, but I also think 
> that there's room for experimental verification by way of self-tests.

Since it is used everywhere I think that is a big waste of time (it is
tested a lot already).  Testing specific problem cases can of course help;
but we have "make check" for that already anwyay. Also, I think "self-tests"
looking at the internals are just wrong (the only sane way to update such
tests when changing the code is to delete the tests, since they should be
written independently; otherwise you just have two copies of the same).  And
the useless abstractions are just useless.

The algorithm is procedural; writing it in procedural style is much clearer
than hiding everything.

> If 
> done thoroughly (testing the transformation on several sufficiently 
> random CFGs and maybe some manually constructed ones) it would go a long 
> way towards showing that at least we don't have to worry too much about 
> miscompilations.

If you hadn't noticed, the algorithm is constructed in such a way as to
minimise possible miscompilations: it first determines what blocks need
what components "active", and then makes it so, in a separate (much more
trivial) phase.

Wrt your patch...  making each component needed half of the time is not
very realistic, you'll end up with all components active in most blocks.

bools as parameters where they do not mean "yes/no" is confusing.

It seems you do no longer do the "insert at head" and "insert at tail"
before the "insert on edge"?  This cannot work afais.

Some utility funcs print dump info; the caller should, instead.

You make "components" global (as "m_components").

Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 14:50   ` Segher Boessenkool
@ 2016-08-26 15:03     ` Bernd Schmidt
  2016-08-26 16:27       ` Segher Boessenkool
                         ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Bernd Schmidt @ 2016-08-26 15:03 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches

On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
> The head comment starts with
>
> +/* Separate shrink-wrapping
> +
> +   Instead of putting all of the prologue and epilogue in one spot, we
> +   can put parts of it in places where those components are executed less
> +   frequently.
>
> and that is the long and short of it.

And that comment puzzles me. Surely prologue and epilogue are executed 
only once currently, so how does frequency come into it? Again - please 
provide an example.

> The full-prologue algorithm makes as many blocks run without prologue as
> possible, by duplicating blocks where that helps.  If you do this for
> every component you can and up with 2**40 blocks for just 40 components,

Ok, so why wouldn't we use the existing code with the duplication part 
disabled? That's a later addition anyway and isn't necessary to do 
shrink-wrapping in the first place.


Bernd

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 15:03     ` Bernd Schmidt
@ 2016-08-26 16:27       ` Segher Boessenkool
  2016-09-08 16:58         ` Jeff Law
  2016-08-30 12:31       ` Michael Matz
  2016-09-08 17:20       ` Jeff Law
  2 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-26 16:27 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches

On Fri, Aug 26, 2016 at 05:03:34PM +0200, Bernd Schmidt wrote:
> On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
> >The head comment starts with
> >
> >+/* Separate shrink-wrapping
> >+
> >+   Instead of putting all of the prologue and epilogue in one spot, we
> >+   can put parts of it in places where those components are executed less
> >+   frequently.
> >
> >and that is the long and short of it.
> 
> And that comment puzzles me. Surely prologue and epilogue are executed 
> only once currently, so how does frequency come into it? Again - please 
> provide an example.

If some component is only needed for 0.01% of executions of a function,
running it once for every execution is 10000 times too much.

The trivial example is a function that does an early exit, but uses one
or a few non-volatile registers before that exit.  This happens in e.g.
glibc's malloc, if you want an easily accessed example.  With the current
code, *all* components will be saved and then restored shortly afterwards.

> >The full-prologue algorithm makes as many blocks run without prologue as
> >possible, by duplicating blocks where that helps.  If you do this for
> >every component you can and up with 2**40 blocks for just 40 components,
> 
> Ok, so why wouldn't we use the existing code with the duplication part 
> disabled?

That would not perform nearly as well.

> That's a later addition anyway and isn't necessary to do 
> shrink-wrapping in the first place.

No, it always did that, just not as often (it only duplicated straight-line
code before).


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-08-01  1:43 ` [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
@ 2016-08-29  9:31   ` Bernd Schmidt
  2016-08-29 14:30     ` Segher Boessenkool
  2016-09-08 17:37     ` Jeff Law
  2016-09-08 17:48   ` Jeff Law
  1 sibling, 2 replies; 74+ messages in thread
From: Bernd Schmidt @ 2016-08-29  9:31 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches

On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> +@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS (sbitmap)
> +Emit prologue insns for the components indicated by the parameter.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS (sbitmap)
> +Emit epilogue insns for the components indicated by the parameter.
> +@end deftypefn

How do these actually know where to save/restore registers? The frame 
pointer may have been eliminated, and SP isn't necessarily constant 
during the function. Seems like you'd have to calculate CFA reg/offset 
much like dwarf2out does and pass it to this hook.


Bernd

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-08-29  9:31   ` Bernd Schmidt
@ 2016-08-29 14:30     ` Segher Boessenkool
  2016-09-08 17:37     ` Jeff Law
  1 sibling, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-08-29 14:30 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc-patches

On Mon, Aug 29, 2016 at 11:31:51AM +0200, Bernd Schmidt wrote:
> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> >+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS 
> >(sbitmap)
> >+Emit prologue insns for the components indicated by the parameter.
> >+@end deftypefn
> >+
> >+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS 
> >(sbitmap)
> >+Emit epilogue insns for the components indicated by the parameter.
> >+@end deftypefn
> 
> How do these actually know where to save/restore registers? The frame 
> pointer may have been eliminated, and SP isn't necessarily constant 
> during the function. Seems like you'd have to calculate CFA reg/offset 
> much like dwarf2out does and pass it to this hook.

There are many other reasons why separate shrink-wrapping can not be
done for a certain function, too; some target-specific.

The generic code (patch 8/9) does

+  /* We don't handle "strange" functions.  */
+  if (cfun->calls_alloca
+      || cfun->calls_setjmp
+      || cfun->can_throw_non_call_exceptions
+      || crtl->calls_eh_return
+      || crtl->has_nonlocal_goto
+      || crtl->saves_all_registers)
+    return;
+
+  /* Ask the target what components there are.  If it returns NULL, don't
+     do anything.  */
+  sbitmap components = targetm.shrink_wrap.get_separate_components ();

and the rs6000 version of TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
(path 9/9) then does

+  rs6000_stack_t *info = rs6000_stack_info ();
+
+  if (!(info->savres_strategy & SAVE_INLINE_GPRS)
+      || !(info->savres_strategy & REST_INLINE_GPRS)
+      || WORLD_SAVE_P (info))
+    return NULL;

and also checks for each component if it can access the save slot with
just one instruction.  It can handle both the stack pointer (r1) and
the hard frame pointer (r31); it uses the same logic as the "ordinary"
prologue/epilogue would.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 15:03     ` Bernd Schmidt
  2016-08-26 16:27       ` Segher Boessenkool
@ 2016-08-30 12:31       ` Michael Matz
  2016-09-08 16:41         ` Jeff Law
  2016-09-08 17:20       ` Jeff Law
  2 siblings, 1 reply; 74+ messages in thread
From: Michael Matz @ 2016-08-30 12:31 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Segher Boessenkool, gcc-patches

Hi,

On Fri, 26 Aug 2016, Bernd Schmidt wrote:

> And that comment puzzles me. Surely prologue and epilogue are executed only
> once currently, so how does frequency come into it? Again - please provide an
> example.

int some_global;
int foo (void) {
  if (!some_global) {
    call_this();
    call_that();
    x = some + large / computation;
    while (x--) { much_more_code(); }
    some_global = 1;
  }
  return some_global;
}

Now you need the full pro/epilogue (saving/restoring all call clobbered 
regs presumably used by the large computation and loop) for only one call 
of that function (the first), and then never again.  But you do need a 
part of it always assuming the architecture is right, and this is a shared 
library, namely the setup of the PIC pointer for the access to 
some_global.

A prologue does many things, and in some cases many of them aren't 
necessary for all calls (indeed that's often the case for functions 
containing early-outs), so much so that the full prologue including 
unnecessary parts needs more time than the functional body of a functions 
particular call.  It's obvious that it would be better to move those parts 
of the prologue to places where they actually are needed:

int f2 (void) {
  setup_pic_reg();
  if (!some_global) {
    save_call_clobbers();
    code_from_above();
    restore_call_clobbers();
  }
  retreg = some_global;
  restore_pic_reg();
}

That includes moving parts of the prologue into loops:

int g() {
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += i;
    if (sum >= CUTOFF) {
      some_long_winded_expression();
      that_eventually_calls_abort();
    }
  }
  return sum;
}

Here all parts of the prologue that somehow make it possible to call other 
functions are necessary only when the program will abort eventually: hence 
is necessary only at one call of g() at most.  Again it's sensible to move 
those parts inside the unlikely condition, even though it's inside a loop.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-30 12:31       ` Michael Matz
@ 2016-09-08 16:41         ` Jeff Law
  2016-09-09  6:31           ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 16:41 UTC (permalink / raw)
  To: Michael Matz, Bernd Schmidt; +Cc: Segher Boessenkool, gcc-patches

On 08/30/2016 06:31 AM, Michael Matz wrote:
> Hi,
>
> On Fri, 26 Aug 2016, Bernd Schmidt wrote:
>
>> And that comment puzzles me. Surely prologue and epilogue are executed only
>> once currently, so how does frequency come into it? Again - please provide an
>> example.
>
> int some_global;
> int foo (void) {
>   if (!some_global) {
>     call_this();
>     call_that();
>     x = some + large / computation;
>     while (x--) { much_more_code(); }
>     some_global = 1;
>   }
>   return some_global;
> }
>
> Now you need the full pro/epilogue (saving/restoring all call clobbered
> regs presumably used by the large computation and loop) for only one call
> of that function (the first), and then never again.  But you do need a
> part of it always assuming the architecture is right, and this is a shared
> library, namely the setup of the PIC pointer for the access to
> some_global.

>
> A prologue does many things, and in some cases many of them aren't
> necessary for all calls (indeed that's often the case for functions
> containing early-outs), so much so that the full prologue including
> unnecessary parts needs more time than the functional body of a functions
> particular call.  It's obvious that it would be better to move those parts
> of the prologue to places where they actually are needed:
[ ... ]
Right.  Essentially Segher's patch introduces the concept of prologue 
components that are independent of each other and which can be 
shrink-wrapped separately.  The degree of independence is highly target 
specific, of course.

I think one of the questions (and I haven't looked through the whole 
thread yet to see if it's answered) is why the basic shrink-wrapping 
algorithm can't be applied to each of the prologue components -- though 
you may have answered it with your loop example below.


>
> int f2 (void) {
>   setup_pic_reg();
>   if (!some_global) {
>     save_call_clobbers();
>     code_from_above();
>     restore_call_clobbers();
>   }
>   retreg = some_global;
>   restore_pic_reg();
> }
>
> That includes moving parts of the prologue into loops:
>
> int g() {
>   int sum = 0;
>   for (int i = 0; i < NUM; i++) {
>     sum += i;
>     if (sum >= CUTOFF) {
>       some_long_winded_expression();
>       that_eventually_calls_abort();
>     }
>   }
>   return sum;
> }
>
> Here all parts of the prologue that somehow make it possible to call other
> functions are necessary only when the program will abort eventually: hence
> is necessary only at one call of g() at most.  Again it's sensible to move
> those parts inside the unlikely condition, even though it's inside a loop.
Thanks.  I'd been wondering about when it'd be useful to push prologue 
code into a loop nest when I saw the patches fly by, but didn't think 
about it too much.  I haven't looked at the shrink-wrapping literature 
in years, but I don't recall it having any concept that there were cases 
where sinking into a loop nest was profitable.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 16:27       ` Segher Boessenkool
@ 2016-09-08 16:58         ` Jeff Law
  2016-09-09 15:26           ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 16:58 UTC (permalink / raw)
  To: Segher Boessenkool, Bernd Schmidt; +Cc: gcc-patches

On 08/26/2016 10:27 AM, Segher Boessenkool wrote:
> On Fri, Aug 26, 2016 at 05:03:34PM +0200, Bernd Schmidt wrote:
>> On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
>>> The head comment starts with
>>>
>>> +/* Separate shrink-wrapping
>>> +
>>> +   Instead of putting all of the prologue and epilogue in one spot, we
>>> +   can put parts of it in places where those components are executed less
>>> +   frequently.
>>>
>>> and that is the long and short of it.
>>
>> And that comment puzzles me. Surely prologue and epilogue are executed
>> only once currently, so how does frequency come into it? Again - please
>> provide an example.
>
> If some component is only needed for 0.01% of executions of a function,
> running it once for every execution is 10000 times too much.
>
> The trivial example is a function that does an early exit, but uses one
> or a few non-volatile registers before that exit.  This happens in e.g.
> glibc's malloc, if you want an easily accessed example.  With the current
> code, *all* components will be saved and then restored shortly afterwards.
So can you expand on the malloc example a bit -- I'm pretty sure I 
understand what you're trying to do, but a concrete example may help 
Bernd and be useful for archival purposes.

I also know that Carlos is interested in the malloc example -- so I'd 
like to be able to pass that along to him.

Given the multiple early exit and fast paths through the allocator, I'm 
not at all surprised that sinking different components of the prologue 
to different locations is useful.

Also if there's a case where sinking into a loop occurs, definitely 
point that out.

>
>>> The full-prologue algorithm makes as many blocks run without prologue as
>>> possible, by duplicating blocks where that helps.  If you do this for
>>> every component you can and up with 2**40 blocks for just 40 components,
>>
>> Ok, so why wouldn't we use the existing code with the duplication part
>> disabled?
>
> That would not perform nearly as well.
>
>> That's a later addition anyway and isn't necessary to do
>> shrink-wrapping in the first place.
>
> No, it always did that, just not as often (it only duplicated straight-line
> code before).
Presumably (I haven't looked yet), the duplication is so that we can 
isolate one or more paths which in turn allows sinking the prologue 
further on some of those paths.

This is something I'll definitely want to look at -- block duplication 
to facilitate code elimination (or in this case avoid code insertion) 
hits several areas of interest to me -- and how we balance duplication 
vs runtime savings is always interesting.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 15:03     ` Bernd Schmidt
  2016-08-26 16:27       ` Segher Boessenkool
  2016-08-30 12:31       ` Michael Matz
@ 2016-09-08 17:20       ` Jeff Law
  2016-09-09 15:33         ` Segher Boessenkool
  2 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:20 UTC (permalink / raw)
  To: Bernd Schmidt, Segher Boessenkool; +Cc: gcc-patches

On 08/26/2016 09:03 AM, Bernd Schmidt wrote:
> On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
>> The head comment starts with
>>
>> +/* Separate shrink-wrapping
>> +
>> +   Instead of putting all of the prologue and epilogue in one spot, we
>> +   can put parts of it in places where those components are executed
>> less
>> +   frequently.
>>
>> and that is the long and short of it.
>
> And that comment puzzles me. Surely prologue and epilogue are executed
> only once currently, so how does frequency come into it? Again - please
> provide an example.
Right, they're executed once currently.  But the prologue could be sunk 
into locations where they are not executed every time the function is 
called.  That's the basis behind shrink wrapping.

Segher's code essentially allows individual components of the prologue 
to sink to different points within the function rather than forcing the 
prologue to be sunk as an atomic unit.

Conceptually you could run the standard algorithm on each independent 
component.


>
>> The full-prologue algorithm makes as many blocks run without prologue as
>> possible, by duplicating blocks where that helps.  If you do this for
>> every component you can and up with 2**40 blocks for just 40 components,
>
> Ok, so why wouldn't we use the existing code with the duplication part
> disabled? That's a later addition anyway and isn't necessary to do
> shrink-wrapping in the first place.
I think the concern here is the balance between code explosion and the 
runtime gains.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-08-29  9:31   ` Bernd Schmidt
  2016-08-29 14:30     ` Segher Boessenkool
@ 2016-09-08 17:37     ` Jeff Law
  2016-09-09 11:03       ` Bernd Schmidt
  2016-09-09 15:40       ` Segher Boessenkool
  1 sibling, 2 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:37 UTC (permalink / raw)
  To: Bernd Schmidt, Segher Boessenkool, gcc-patches

On 08/29/2016 03:31 AM, Bernd Schmidt wrote:
> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
>> +@deftypefn {Target Hook} void
>> TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS (sbitmap)
>> +Emit prologue insns for the components indicated by the parameter.
>> +@end deftypefn
>> +
>> +@deftypefn {Target Hook} void
>> TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS (sbitmap)
>> +Emit epilogue insns for the components indicated by the parameter.
>> +@end deftypefn
>
> How do these actually know where to save/restore registers? The frame
> pointer may have been eliminated, and SP isn't necessarily constant
> during the function. Seems like you'd have to calculate CFA reg/offset
> much like dwarf2out does and pass it to this hook.
So I think the confusion here is these hooks are independent of 
placement. ie, the target independent code does something like:

FOR_EACH_BB
   Build the component bitmap using the incoming edge components
   Emit the prologue components at the start of the block
   Emit the epilogue components at the end of the block


The components handled by a particular block start are set/cleared by 
the other hooks.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-08-01  1:43 ` [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
  2016-08-29  9:31   ` Bernd Schmidt
@ 2016-09-08 17:48   ` Jeff Law
  2016-09-09 15:44     ` Segher Boessenkool
  1 sibling, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:48 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> This patch adds a new command-line flag "-fshrink-wrap-separate", a status
> flag "shrink_wrapped_separate", hooks for abstracting the target components,
> and documentation for all those.
>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* common.opt (-fshrink-wrap-separate): New flag.
> 	* doc/invoke.texi: Document it.
> 	* doc/tm.texi.in (Shrink-wrapping separate components): New subsection.
> 	* doc/tm.texi: Regenerate.
> 	* emit-rtl.h (struct rtl_data): New field shrink_wrapped_separate.
> 	* target.def (shrink_wrap): New hook vector.
> 	(get_separate_components, components_for_bb, disqualify_components,
> 	emit_prologue_components, emit_epilogue_components,
> 	set_handled_components): New hooks.
> ---
>  gcc/common.opt      |  4 ++++
>  gcc/doc/invoke.texi | 11 ++++++++++-
>  gcc/doc/tm.texi     | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/doc/tm.texi.in  | 29 +++++++++++++++++++++++++++
>  gcc/emit-rtl.h      |  4 ++++
>  gcc/target.def      | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 158 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 8a292ed..97d305f 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 9edb006..5a5c5cab 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -4852,6 +4853,59 @@ This hook should add additional registers that are computed by the prologue to t
>  True if a function's return statements should be checked for matching the function's return type.  This includes checking for falling off the end of a non-void function.  Return false if no such check should be made.
>  @end deftypefn
>
> +@node Shrink-wrapping separate components
> +@subsection Shrink-wrapping separate components
> +@cindex shrink-wrapping separate components
> +
> +The prologue does a lot of separate things: save callee-saved registers,
> +do whatever needs to be done to be able to call things (save the return
> +address, align the stack, whatever; different for each target), set up a
> +stack frame, do whatever needs to be done for the static chain (if anything),
> +set up registers for PIC, etc.
The prologue may perform a variety of target dependent tasks such as 
saving callee saved registers, saving the return address, aligning the 
stack, create a local stack frame, initialize the PIC register, etc.

On some targets some of these tasks may be independent of others and 
thus may be shrink-wrapped separately.  These independent tasks are 
referred to as components and are handled generically by the target 
independent parts of GCC.

Each component has a slot in a sbitmap that is generated and maintained 
for each basic block.



> +@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS (void)
> +This hook should return an @code{sbitmap} with the bits set for those
> +components that can be separately shrink-wrapped in the current function.
> +Return @code{NULL} if the current function should not get any separate
> +shrink-wrapping.
> +Don't define this hook if it would always return @code{NULL}.
> +If it is defined, the other hooks in this group have to be defined as well.
> +@end deftypefn
> +
> +@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB (basic_block)
> +This hook should return an @code{sbitmap} with the bits set for those
> +components where either the prologue component has to be executed before
> +the @code{basic_block}, or the epilogue component after it, or both.
> +@end deftypefn
Who is responsible for allocating and releasing the sbitmaps?


I don't have major concerns with this patch -- I'd like to see 
clarification done on the ownership of the sbitmaps (ie, who allocates 
and releases those objects).  I'd like to see if we can get a better 
introduction as well.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/9] cfgcleanup: Don't confuse CFI when -fshrink-wrap-separate
  2016-08-01  1:43 ` [PATCH 2/9] cfgcleanup: Don't confuse CFI when -fshrink-wrap-separate Segher Boessenkool
@ 2016-09-08 17:51   ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:51 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> cfgcleanup would try to join noreturn paths with different components
> handled.  This then fails in dwarf2cfi.
>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* cfgcleanup.c (outgoing_edges_match): Don't join noreturn calls
> 	if shrink_wrapped_separate.
So if you only have fake edges, then you've got a non-returning call. If 
you could twiddle the comment to make that clear it'd be appreciated.

Obviously you could look at the components and allow joining if the 
components are the same.  But I don't know if that happens enough in 
practice to be worth the effort.

OK with the comment update if/when the kit as a whole is approved.

jeff



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores
  2016-08-01  1:43 ` [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
@ 2016-09-08 17:52   ` Jeff Law
  2016-09-09 15:59     ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:52 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> Deleting restores (before a noreturn) that are dead confuses dwarf2cfi.
>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* dce.c (delete_unmarked_insns): Don't delete instructions with
> 	a REG_CFA_RESTORE note.
I don't really understand this one.  Why is the restore marked dead and 
why doesn't that happen for normal epilogues?  Something wonky seems to 
be going on here.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/9] regrename: Don't rename restores
  2016-08-01  1:57 ` [PATCH 4/9] regrename: Don't rename restores Segher Boessenkool
@ 2016-09-08 17:54   ` Jeff Law
  2016-09-09 21:05     ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:54 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> A restore is supposed to restore some certain register.  Restoring it
> into some other register will not work.  Don't.
>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* regrename.c (build_def_use): Invalidate chains that have a
> 	REG_CFA_RESTORE on some instruction.
Again, how is this different from a normal epilogue that restores 
registers which regrename seems to not muck with.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/9] regrename: Don't run if function was separately shrink-wrapped
  2016-08-01  2:12 ` [PATCH 5/9] regrename: Don't run if function was separately shrink-wrapped Segher Boessenkool
@ 2016-09-08 17:54   ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:54 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> Unfortunately even after the previous patch there are still a few cases
> where regrename creates invalid code when used together with separate
> shrink-wrapping.  At noreturn exits regrename thinks it can use all
> callee-saved registers, but that is not true.  This patch disables
> regrename for functions that were separately shrink-wrapped.
>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* regrename.c (gate): Disable regrename if shrink_wrapped_separate
> 	is set in crtl.
I think this (and the prior patches) are masking a deeper issue.  It's 
almost as if we don't have correct lifetime information and the various 
return points.  I'd really like to see a deeper analysis of these issues.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 6/9] sel-sched: Don't mess with register restores
  2016-08-01  2:12 ` [PATCH 6/9] sel-sched: Don't mess with register restores Segher Boessenkool
  2016-08-04  7:33   ` Andrey Belevantsev
@ 2016-09-08 17:55   ` Jeff Law
  2016-09-09 21:13     ` Segher Boessenkool
  1 sibling, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 17:55 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> If selective scheduling copies register restores it confuses dwarf2cfi.
>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* sel-sched-ir.c (init_global_and_expr_for_insn): Don't copy
> 	instructions with a REG_CFA_RESTORE note.
Similarly, I think you're papering over a lifetime problem of some kind 
here.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone
  2016-08-01  1:57 ` [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone Segher Boessenkool
@ 2016-09-08 18:34   ` Jeff Law
  2016-09-09 21:21     ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 18:34 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> Doing cprop on frame-related instructions blows up spectacularly.
> So don't.
>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* regcprop.c (copyprop_hardreg_forward_1): Don't change
> 	RTX_FRAME_RELATED_P instructions.
Example or testcase?

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
  2016-08-01  1:57 ` [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components Segher Boessenkool
@ 2016-09-08 19:03   ` Jeff Law
  2016-09-09 22:03     ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-08 19:03 UTC (permalink / raw)
  To: Segher Boessenkool, gcc-patches; +Cc: bschmidt

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
>
> Deciding what blocks should run with a certain component active so that
> the total cost of executing the prologues (and epilogues) is optimal, is
> not a computationally feasible problem.
Really?  It's just a dataflow problem is it not and one that ought to 
converge very quickly I'd think.  Or is it more a function of having to 
run it on so many independent components?  I'm still pondering the value 
of having every GPR be an independent component :-)

ISTM this adds a fair amount of complexity to the implementation in that 
prologues and epilogues for a particular component can run more than 
once.  Can you give a concrete example where this happens so that we can 
all understand it better?

If we keep this aspect of the implementation it seems like a note in the 
developer section would be in order.  It's certainly non-intuitive.

I only glanced over the code that seems related to this aspect of the 
implementation.  If we decide to go forward, I'd like to look at it 
again more closely.

>
> Now all that is left is inserting prologues and epilogues on all edges
> that jump into resp. out of the "active" set of blocks.  Often we need
> to insert some components' prologues (or epilogues) on all edges into
> (or out of) a block.  In theory cross-jumping can unify all such, but
> in practice that often fails; besides, that is a lot of work.  So in
> this case we insert the prologue and epilogue components at the "head"
> or "tail" of a block, instead.
Cross jumping is rather simplistic, so I'm not surprised that it doesn't 
catch all this stuff.

>
> As a final optimisation, if a block needs a prologue and its immediate
> dominator has the block as a post-dominator, the dominator gets the
> prologue as well.
So why not just put it in the idom and not in the dominated block? 
Doesn't this just end up running that component's prologue twice?

>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* function.c (thread_prologue_and_epilogue_insns): Recompute the
> 	live info.  Call try_shrink_wrapping_separate.  Compute the
> 	prologue_seq afterwards, if it has possibly changed.  Compute the
> 	split_prologue_seq and epilogue_seq later, too.
> 	* shrink-wrap.c: #include cfgbuild.h.
> 	(dump_components): New function.
> 	(struct sw): New struct.
> 	(SW): New function.
> 	(init_separate_shrink_wrap): New function.
> 	(fini_separate_shrink_wrap): New function.
> 	(place_prologue_for_one_component): New function.
> 	(spread_components): New function.
> 	(disqualify_problematic_components): New function.
> 	(emit_common_heads_for_components): New function.
> 	(emit_common_tails_for_components): New function.
> 	(insert_prologue_epilogue_for_components): New function.
> 	(try_shrink_wrapping_separate): New function.
> 	* shrink-wrap.h: Declare try_shrink_wrapping_separate.
> ---
>  gcc/function.c    |  15 +-
>  gcc/shrink-wrap.c | 715 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/shrink-wrap.h |   1 +
>  3 files changed, 729 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/function.c b/gcc/function.c
> index bba0705..390e9a6 100644
> --- a/gcc/function.c
> +++ b/gcc/function.c
> @@ -5912,6 +5912,12 @@ make_epilogue_seq (void)
>  void
>  thread_prologue_and_epilogue_insns (void)
>  {
> +  if (optimize > 1)
> +    {
> +      df_live_add_problem ();
> +      df_live_set_all_dirty ();
> +    }
Perhaps conditional on separate shrink wrapping?

> @@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void)
>    edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>    edge orig_entry_edge = entry_edge;
>
> -  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
>    rtx_insn *prologue_seq = make_prologue_seq ();
> -  rtx_insn *epilogue_seq = make_epilogue_seq ();
>
>    /* Try to perform a kind of shrink-wrapping, making sure the
>       prologue/epilogue is emitted only around those parts of the
> @@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)
>
>    try_shrink_wrapping (&entry_edge, prologue_seq);
>
> +  df_analyze ();
> +  try_shrink_wrapping_separate (entry_edge->dest);
> +  if (crtl->shrink_wrapped_separate)
> +    prologue_seq = make_prologue_seq ();
Perhaps push the df_analyze call into try_shrink_wrapping_separate?

ANd if it was successful, do you need to update the DF information?  Is 
that perhaps the cause of some of the issues we're seeing with DCE, 
regrename, the scheduler?



> diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
> index b85b1c3..643e375 100644
> --- a/gcc/shrink-wrap.c
> +++ b/gcc/shrink-wrap.c
> @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "tree-pass.h"
>  #include "cfgrtl.h"
> +#include "cfgbuild.h"
>  #include "params.h"
>  #include "bb-reorder.h"
>  #include "shrink-wrap.h"
> @@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq)
>    BITMAP_FREE (bb_with);
>    free_dominance_info (CDI_DOMINATORS);
>  }
> +\f
> +/* Separate shrink-wrapping
> +
> +   Instead of putting all of the prologue and epilogue in one spot, we
> +   can put parts of it in places where those components are executed less
> +   frequently.  The following code does this, for prologue and epilogue
> +   components that can be put in more than one location, and where those
> +   components can be executed more than once (the epilogue component will
> +   always be executed before the prologue component is executed a second
> +   time).
> +
> +   What exactly is a component is target-dependent.  The more usual
> +   components are simple saves/restores to/from the frame of callee-saved
> +   registers.  This code treats components abstractly (as an sbitmap),
> +   letting the target handle all details.
> +
> +   Prologue components are placed in such a way that for every component
> +   the prologue is executed as infrequently as possible.  We do this by
> +   walking the dominator tree, comparing the cost of placing a prologue
> +   component before a block to the sum of costs determined for all subtrees
> +   of that block.
> +
> +   From this placement, we then determine for each component all blocks
> +   where at least one of this block's dominators (including itself) will
> +   get a prologue inserted.  That then is how the components are placed.
> +   We could place the epilogue components a bit smarter (we can save a
> +   bit of code size sometimes); this is a possible future improvement.
> +
> +   Prologues and epilogues are preferably placed into a block, either at
> +   the beginning or end of it, if it is needed for all predecessor resp.
> +   successor edges; or placed on the edge otherwise.
> +
> +   If the placement of any prologue/epilogue leads to a situation we cannot
> +   handle (for example, an abnormal edge would need to be split, or some
> +   targets want to use some specific registers that may not be available
> +   where we want to put them), separate shrink-wrapping for the components
> +   in that prologue/epilogue is aborted.  */
> +
> +
> +/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the
> +   label LABEL.  */
> +static void
> +dump_components (const char *label, sbitmap components)
> +{
> +  if (bitmap_empty_p (components))
> +    return;
> +
> +  fprintf (dump_file, " [%s", label);
> +
> +  for (unsigned int j = 0; j < components->n_bits; j++)
> +    if (bitmap_bit_p (components, j))
> +      fprintf (dump_file, " %u", j);
> +
> +  fprintf (dump_file, "]");
Consider allowing the target to provide a mapping from the component to 
a symbolic name of some kind and using that in the dumps.  Related, 
consider using an enum rather than magic constants in the target bits (I 
noted seeing component #0 as a magic constant in the ppc implementation).


> +
> +/* A helper function for accessing the pass-specific info.  */
> +static inline struct sw *
> +SW (basic_block bb)
> +{
> +  gcc_assert (bb->aux);
> +  return (struct sw *) bb->aux;
> +}
> +
> +/* Create the pass-specific data structures for separately shrink-wrapping
> +   with components COMPONENTS.  */
> +static void
> +init_separate_shrink_wrap (sbitmap components)
So it seems like there's a toplevel list of components that's owned by 
the target and state at each block components that are owned by the 
generic code.  That's fine.  Just make sure we doc that the toplevel 
list of components is allocated by the backend (and where does it get 
freed?)

Consider using auto_sbitmap rather than manually managing 
allocation/releasing of the per-block structures.  In fact, can't all of 
SW become a class and we lose the explicit init/fini routines in favor 
of a ctor/dtor?



> +
> +/* Place code for prologues and epilogues for COMPONENTS where we can put
> +   that code at the start of basic blocks.  */
> +static void
> +emit_common_heads_for_components (sbitmap components)
> +{
> +  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
> +  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
> +  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
> +
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      /* Find which prologue resp. epilogue components are needed for all
> +	 predecessor edges to this block.  */
Nit.  Avoid ".resp".  I actually had to look that abbreviation up :-) 
Being a bit more verbose in the comments would be preferred over ".resp".


Having looked at this in more detail now, I'm wondering if we want to 
talk at all in the docs about when selection vs disqualifying happens. 
ISTM that we build the set of components we want to insert for each 
edge/block.  When that's complete, we then prune those results with the 
disqualifying sets.

For the PPC R0 vs LR is the only thing that causes disqualification 
right?  Can't that be handled when we build the set of components we 
want to insert for each edge/block?  Is there some advantage to handling 
disqualifications after all the potential insertion points have been 
handled?

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-08 16:41         ` Jeff Law
@ 2016-09-09  6:31           ` Segher Boessenkool
  2016-09-09 15:28             ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09  6:31 UTC (permalink / raw)
  To: Jeff Law; +Cc: Michael Matz, Bernd Schmidt, gcc-patches

Thanks for looking at the patches Jeff.

On Thu, Sep 08, 2016 at 10:28:59AM -0600, Jeff Law wrote:
> Right.  Essentially Segher's patch introduces the concept of prologue 
> components that are independent of each other and which can be 
> shrink-wrapped separately.  The degree of independence is highly target 
> specific, of course.

There can be dependencies as well (e.g. a save to the frame requires that
frame to be set up already), but the current patches have no way to
describe such dependencies.  I haven't found a good way to describe the
dependencies yet.  Finding a good balance between general and useful isn't
easy, as usual.

> I think one of the questions (and I haven't looked through the whole 
> thread yet to see if it's answered) is why the basic shrink-wrapping 
> algorithm can't be applied to each of the prologue components -- though 
> you may have answered it with your loop example below.

You get code size explosion with the "basic" algorithm.  And a lot of
that isn't really worth it: avoiding the whole prologue/epilogue is
usually worth copying some blocks for, but avoiding just a single register
save/restore pair?  More doubtful.

> >That includes moving parts of the prologue into loops:
> >
> >int g() {
> >  int sum = 0;
> >  for (int i = 0; i < NUM; i++) {
> >    sum += i;
> >    if (sum >= CUTOFF) {
> >      some_long_winded_expression();
> >      that_eventually_calls_abort();
> >    }
> >  }
> >  return sum;
> >}
> >
> >Here all parts of the prologue that somehow make it possible to call other
> >functions are necessary only when the program will abort eventually: hence
> >is necessary only at one call of g() at most.  Again it's sensible to move
> >those parts inside the unlikely condition, even though it's inside a loop.
> Thanks.  I'd been wondering about when it'd be useful to push prologue 
> code into a loop nest when I saw the patches fly by, but didn't think 
> about it too much.  I haven't looked at the shrink-wrapping literature 
> in years, but I don't recall it having any concept that there were cases 
> where sinking into a loop nest was profitable.

It isn't common, but it does happen.  If you use a proper cost metric
based on executiuon frequency with some algorithm then that algorithm will
naturally avoid placing *logues into loops.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-08 17:37     ` Jeff Law
@ 2016-09-09 11:03       ` Bernd Schmidt
  2016-09-09 15:13         ` Segher Boessenkool
  2016-09-09 15:51         ` Jeff Law
  2016-09-09 15:40       ` Segher Boessenkool
  1 sibling, 2 replies; 74+ messages in thread
From: Bernd Schmidt @ 2016-09-09 11:03 UTC (permalink / raw)
  To: Jeff Law, Segher Boessenkool, gcc-patches

On 09/08/2016 07:20 PM, Jeff Law wrote:
> On 08/29/2016 03:31 AM, Bernd Schmidt wrote:
>> How do these actually know where to save/restore registers? The frame
>> pointer may have been eliminated, and SP isn't necessarily constant
>> during the function. Seems like you'd have to calculate CFA reg/offset
>> much like dwarf2out does and pass it to this hook.
> So I think the confusion here is these hooks are independent of
> placement. ie, the target independent code does something like:
>
> FOR_EACH_BB
>   Build the component bitmap using the incoming edge components
>   Emit the prologue components at the start of the block
>   Emit the epilogue components at the end of the block
>
>
> The components handled by a particular block start are set/cleared by
> the other hooks.

Hmm? The problem is that you can't generally emit a save/restore 
independent of placement, because you may not know which offset to use 
from whichever base register. But these offsets aren't necessarily 
constant throughout the function. Segher explained that the algorithm 
deals with this by giving up in many cases, which of course limits the 
usefulness. It probably makes it unusable entirely on targets that want 
to use pushes for function args.


Bernd

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-09 11:03       ` Bernd Schmidt
@ 2016-09-09 15:13         ` Segher Boessenkool
  2016-09-09 18:31           ` Jeff Law
  2016-09-09 15:51         ` Jeff Law
  1 sibling, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 15:13 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Jeff Law, gcc-patches

On Fri, Sep 09, 2016 at 12:58:11PM +0200, Bernd Schmidt wrote:
> Hmm? The problem is that you can't generally emit a save/restore 
> independent of placement, because you may not know which offset to use 
> from whichever base register. But these offsets aren't necessarily 
> constant throughout the function. Segher explained that the algorithm 
> deals with this by giving up in many cases, which of course limits the 
> usefulness. It probably makes it unusable entirely on targets that want 
> to use pushes for function args.

It's not the generic algorithm that gives up; it's the target hook.
Specifically, at least for the PowerPC one I wrote, the
TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS hook gives up on register
components that need an offset from the base reg (stack or frame pointer)
that cannot be used in a single instruction (i.e. won't fit in 16 bits).

The generic code only does

+  /* We don't handle "strange" functions.  */
+  if (cfun->calls_alloca
+      || cfun->calls_setjmp
+      || cfun->can_throw_non_call_exceptions
+      || crtl->calls_eh_return
+      || crtl->has_nonlocal_goto
+      || crtl->saves_all_registers)
+    return;

so that does not give up in "many" cases.

Targets that push function args can handle things fine as far as I see?
Targets that normally use push insns in the prologue will just have to
not do that for the components that are separately wrapped.  Or they can
still use pushes to reserve that space, if that works better.

Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-08 16:58         ` Jeff Law
@ 2016-09-09 15:26           ` Segher Boessenkool
  2016-09-09 16:26             ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 15:26 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
> So can you expand on the malloc example a bit -- I'm pretty sure I 
> understand what you're trying to do, but a concrete example may help 
> Bernd and be useful for archival purposes.

Sure, but it's big (which is the problem :-) )

> I also know that Carlos is interested in the malloc example -- so I'd 
> like to be able to pass that along to him.
> 
> Given the multiple early exit and fast paths through the allocator, I'm 
> not at all surprised that sinking different components of the prologue 
> to different locations is useful.
> 
> Also if there's a case where sinking into a loop occurs, definitely 
> point that out.

Not sure that happens in there, I'll find out.

> >>That's a later addition anyway and isn't necessary to do
> >>shrink-wrapping in the first place.
> >
> >No, it always did that, just not as often (it only duplicated straight-line
> >code before).
> Presumably (I haven't looked yet), the duplication is so that we can 
> isolate one or more paths which in turn allows sinking the prologue 
> further on some of those paths.

It duplicates as many blocks as it needs to dup, to make as many exits
as possible reachable without *any* prologue/epilogue.

As the header comment before the older code says:

/* Try to perform a kind of shrink-wrapping, making sure the
   prologue/epilogue is emitted only around those parts of the
   function that require it.

   There will be exactly one prologue, and it will be executed either
   zero or one time, on any path.  Depending on where the prologue is
   placed, some of the basic blocks can be reached via both paths with
   and without a prologue.  Such blocks will be duplicated here, and the
   edges changed to match.

   Paths that go to the exit without going through the prologue will use
   a simple_return instead of the epilogue.  We maximize the number of
   those, making sure to only duplicate blocks that can be duplicated.
   If the prologue can then still be placed in multiple locations, we
   place it as early as possible.

> This is something I'll definitely want to look at -- block duplication 
> to facilitate code elimination (or in this case avoid code insertion) 
> hits several areas of interest to me -- and how we balance duplication 
> vs runtime savings is always interesting.

Yeah.  If you use separate shrink-wrapping you still *also* get the
"old" algorithm, if that wasn't clear.

Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09  6:31           ` Segher Boessenkool
@ 2016-09-09 15:28             ` Jeff Law
  2016-09-09 15:43               ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-09 15:28 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Michael Matz, Bernd Schmidt, gcc-patches

On 09/09/2016 12:19 AM, Segher Boessenkool wrote:
> Thanks for looking at the patches Jeff.
>
> On Thu, Sep 08, 2016 at 10:28:59AM -0600, Jeff Law wrote:
>> Right.  Essentially Segher's patch introduces the concept of prologue
>> components that are independent of each other and which can be
>> shrink-wrapped separately.  The degree of independence is highly target
>> specific, of course.
>
> There can be dependencies as well (e.g. a save to the frame requires that
> frame to be set up already), but the current patches have no way to
> describe such dependencies.  I haven't found a good way to describe the
> dependencies yet.  Finding a good balance between general and useful isn't
> easy, as usual.
Right.  I over-simplified a bit here.  Some dependencies can be handled 
via the hooks in the target files.  I think what we've got is sufficient 
for now -- we'll have a much clearer picture of weak spots in the design 
if/when someone enables separate shrink wrapping for another target. 
Until then, I think we're OK.

>
>> I think one of the questions (and I haven't looked through the whole
>> thread yet to see if it's answered) is why the basic shrink-wrapping
>> algorithm can't be applied to each of the prologue components -- though
>> you may have answered it with your loop example below.
>
> You get code size explosion with the "basic" algorithm.  And a lot of
> that isn't really worth it: avoiding the whole prologue/epilogue is
> usually worth copying some blocks for, but avoiding just a single register
> save/restore pair?  More doubtful.
Understood.   We have similar balancing problems elsewhere.  How much 
duplication should we allow to expose a CSE or eliminate a conditional 
jump on a path through the CFG.

So I think sticking with this as a design decision makes sense -- does 
it impact the issue around running a particular component's prologue 
more than once?


>> Thanks.  I'd been wondering about when it'd be useful to push prologue
>> code into a loop nest when I saw the patches fly by, but didn't think
>> about it too much.  I haven't looked at the shrink-wrapping literature
>> in years, but I don't recall it having any concept that there were cases
>> where sinking into a loop nest was profitable.
>
> It isn't common, but it does happen.  If you use a proper cost metric
> based on executiuon frequency with some algorithm then that algorithm will
> naturally avoid placing *logues into loops.
Right and the cases where it's placing them into loops it's going to be 
doing so on paths that are unlikely executed within the loop.  ISTM that 
using frequencies is a significant step forward over the older placement 
algorithms in this space (which IIRC were structured as simple dataflow 
solvers) -- with the added advantage that if we have profiling data we 
can make better decisions.

Does this impact the compile time computation complexity issue that was 
raised elsewhere?

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-08 17:20       ` Jeff Law
@ 2016-09-09 15:33         ` Segher Boessenkool
  2016-09-09 16:49           ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 15:33 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Thu, Sep 08, 2016 at 10:58:13AM -0600, Jeff Law wrote:
> >And that comment puzzles me. Surely prologue and epilogue are executed
> >only once currently, so how does frequency come into it? Again - please
> >provide an example.
> Right, they're executed once currently.  But the prologue could be sunk 
> into locations where they are not executed every time the function is 
> called.  That's the basis behind shrink wrapping.

Right.

> Segher's code essentially allows individual components of the prologue 
> to sink to different points within the function rather than forcing the 
> prologue to be sunk as an atomic unit.

It also allows prologue an epilogue components to be placed in multiple
places, and even allows them to be executed more than once, if that is
cheaper.

> >>The full-prologue algorithm makes as many blocks run without prologue as
> >>possible, by duplicating blocks where that helps.  If you do this for
> >>every component you can and up with 2**40 blocks for just 40 components,
> >
> >Ok, so why wouldn't we use the existing code with the duplication part
> >disabled? That's a later addition anyway and isn't necessary to do
> >shrink-wrapping in the first place.
> I think the concern here is the balance between code explosion and the 
> runtime gains.

The code explosion can be terrible if you have only a few components,
already.  The runtime gains are not so great either.  Here's a simple
example, showing part of a cfg, the exits to the right are calls to
abort and need some prologue component:

|
1
|\
| \
2  X1
|\
| \
|  X2
|

With the "old" algorithm, you need to place the prologue at 1 (because
you can only have one), and then the fall-through path also necessarily
gets that component's prologue.  With the "separate" algorithm, the
component prologues can be placed at X1 and X2 instead (and that will
most likely be cheapest according to the cost model, too).

You can also put this in a loop.  I'll try to make a simple example
with that.

Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-08 17:37     ` Jeff Law
  2016-09-09 11:03       ` Bernd Schmidt
@ 2016-09-09 15:40       ` Segher Boessenkool
  2016-09-09 16:58         ` Jeff Law
  1 sibling, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 15:40 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Thu, Sep 08, 2016 at 11:20:41AM -0600, Jeff Law wrote:
> On 08/29/2016 03:31 AM, Bernd Schmidt wrote:
> >On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> >>+@deftypefn {Target Hook} void
> >>TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS (sbitmap)
> >>+Emit prologue insns for the components indicated by the parameter.
> >>+@end deftypefn
> >>+
> >>+@deftypefn {Target Hook} void
> >>TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS (sbitmap)
> >>+Emit epilogue insns for the components indicated by the parameter.
> >>+@end deftypefn
> >
> >How do these actually know where to save/restore registers? The frame
> >pointer may have been eliminated, and SP isn't necessarily constant
> >during the function. Seems like you'd have to calculate CFA reg/offset
> >much like dwarf2out does and pass it to this hook.
> So I think the confusion here is these hooks are independent of 
> placement. ie, the target independent code does something like:
> 
> FOR_EACH_BB
>   Build the component bitmap using the incoming edge components
>   Emit the prologue components at the start of the block
>   Emit the epilogue components at the end of the block

So you think those hooks need a BB parameter?  If there are ports that
need that, then sure.  PowerPC doesn't need it.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 15:28             ` Jeff Law
@ 2016-09-09 15:43               ` Segher Boessenkool
  2016-09-09 18:25                 ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 15:43 UTC (permalink / raw)
  To: Jeff Law; +Cc: Michael Matz, Bernd Schmidt, gcc-patches

On Fri, Sep 09, 2016 at 09:26:39AM -0600, Jeff Law wrote:
> >>I think one of the questions (and I haven't looked through the whole
> >>thread yet to see if it's answered) is why the basic shrink-wrapping
> >>algorithm can't be applied to each of the prologue components -- though
> >>you may have answered it with your loop example below.
> >
> >You get code size explosion with the "basic" algorithm.  And a lot of
> >that isn't really worth it: avoiding the whole prologue/epilogue is
> >usually worth copying some blocks for, but avoiding just a single register
> >save/restore pair?  More doubtful.
> Understood.   We have similar balancing problems elsewhere.  How much 
> duplication should we allow to expose a CSE or eliminate a conditional 
> jump on a path through the CFG.
> 
> So I think sticking with this as a design decision makes sense -- does 
> it impact the issue around running a particular component's prologue 
> more than once?

I don't follow, sorry; could you rephrase?

> >>Thanks.  I'd been wondering about when it'd be useful to push prologue
> >>code into a loop nest when I saw the patches fly by, but didn't think
> >>about it too much.  I haven't looked at the shrink-wrapping literature
> >>in years, but I don't recall it having any concept that there were cases
> >>where sinking into a loop nest was profitable.
> >
> >It isn't common, but it does happen.  If you use a proper cost metric
> >based on executiuon frequency with some algorithm then that algorithm will
> >naturally avoid placing *logues into loops.
> Right and the cases where it's placing them into loops it's going to be 
> doing so on paths that are unlikely executed within the loop.  ISTM that 
> using frequencies is a significant step forward over the older placement 
> algorithms in this space (which IIRC were structured as simple dataflow 
> solvers) -- with the added advantage that if we have profiling data we 
> can make better decisions.

Using the known/guessed execution frequencies is not really new, just not
forty years old :-)

> Does this impact the compile time computation complexity issue that was 
> raised elsewhere?

I'm not sure what you mean here either, sorry.  It is all O(NM) with N
the number of BBs and M the number of components (and assuming dom
lookups are constant time, as usual).


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-08 17:48   ` Jeff Law
@ 2016-09-09 15:44     ` Segher Boessenkool
  0 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 15:44 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Thu, Sep 08, 2016 at 11:37:17AM -0600, Jeff Law wrote:
> I don't have major concerns with this patch -- I'd like to see 
> clarification done on the ownership of the sbitmaps (ie, who allocates 
> and releases those objects).  I'd like to see if we can get a better 
> introduction as well.

I'll work a bit more on improving the internals documentation, okay.

Thanks,


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-09 11:03       ` Bernd Schmidt
  2016-09-09 15:13         ` Segher Boessenkool
@ 2016-09-09 15:51         ` Jeff Law
  1 sibling, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-09 15:51 UTC (permalink / raw)
  To: Bernd Schmidt, Segher Boessenkool, gcc-patches

On 09/09/2016 04:58 AM, Bernd Schmidt wrote:
> On 09/08/2016 07:20 PM, Jeff Law wrote:
>> On 08/29/2016 03:31 AM, Bernd Schmidt wrote:
>>> How do these actually know where to save/restore registers? The frame
>>> pointer may have been eliminated, and SP isn't necessarily constant
>>> during the function. Seems like you'd have to calculate CFA reg/offset
>>> much like dwarf2out does and pass it to this hook.
>> So I think the confusion here is these hooks are independent of
>> placement. ie, the target independent code does something like:
>>
>> FOR_EACH_BB
>>   Build the component bitmap using the incoming edge components
>>   Emit the prologue components at the start of the block
>>   Emit the epilogue components at the end of the block
>>
>>
>> The components handled by a particular block start are set/cleared by
>> the other hooks.
>
> Hmm? The problem is that you can't generally emit a save/restore
> independent of placement, because you may not know which offset to use
> from whichever base register. But these offsets aren't necessarily
> constant throughout the function. Segher explained that the algorithm
> deals with this by giving up in many cases, which of course limits the
> usefulness. It probably makes it unusable entirely on targets that want
> to use pushes for function args.
On a target with ACCUMULATE_OUTGOING_ARGS the offsets are generally 
going to be constant.  For a target that's pushing args and not using a 
frame pointer, this isn't likely to work.  But that's OK as those 
targets wouldn't define the hooks or would test for those cases within 
the given hooks.  I can envision (but will certainly not implement) 
separate shrink wrapping on m68k with a frame pointer in Segher's framework.

Essentially the hooks push these decisions into the target machine, 
which is where they belong.  FUrthermore, the hooks can build a custom 
sequence for each insertion point.

That allows (as an example) the PPC LR save/restore sequence to use r0 
as an intermediate register for that case.  We could use those 
mechanisms to ensure there's a scratch register at the insertion point 
for address computations on targets that need them, etc.



Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores
  2016-09-08 17:52   ` Jeff Law
@ 2016-09-09 15:59     ` Segher Boessenkool
  2016-09-09 16:39       ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 15:59 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Thu, Sep 08, 2016 at 11:50:56AM -0600, Jeff Law wrote:
> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> >Deleting restores (before a noreturn) that are dead confuses dwarf2cfi.
> >
> >2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
> >
> >	* dce.c (delete_unmarked_insns): Don't delete instructions with
> >	a REG_CFA_RESTORE note.
> I don't really understand this one.  Why is the restore marked dead and 
> why doesn't that happen for normal epilogues?  Something wonky seems to 
> be going on here.

Because it's not behind a NOTE_INSN_EPILOGUE_BEGIN it is not treated
specially by DCE, like insns in "normal" epilogues are.

It is marked dead because it *is* dead: the value in the callee-save
register is not used by anything anymore (noreturn!)


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 15:26           ` Segher Boessenkool
@ 2016-09-09 16:26             ` Jeff Law
  2016-09-09 16:51               ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-09 16:26 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bernd Schmidt, gcc-patches

On 09/09/2016 09:17 AM, Segher Boessenkool wrote:
> On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
>> So can you expand on the malloc example a bit -- I'm pretty sure I
>> understand what you're trying to do, but a concrete example may help
>> Bernd and be useful for archival purposes.
>
> Sure, but it's big (which is the problem :-) )
Yea :(  But it's likely a very compelling example of real world code. 
It's almost certainly too big to turn into a testcase of any kind, but 
just some before/after annotated code would be helpful.

Ideally we'd have some smaller testcases we could put in the testsuite 
to ensure that the feature works over-time in the way intended would be 
helpful as well.

>>
>>>> That's a later addition anyway and isn't necessary to do
>>>> shrink-wrapping in the first place.
>>>
>>> No, it always did that, just not as often (it only duplicated straight-line
>>> code before).
>> Presumably (I haven't looked yet), the duplication is so that we can
>> isolate one or more paths which in turn allows sinking the prologue
>> further on some of those paths.
>
> It duplicates as many blocks as it needs to dup, to make as many exits
> as possible reachable without *any* prologue/epilogue.
>
> As the header comment before the older code says:
>
> /* Try to perform a kind of shrink-wrapping, making sure the
>    prologue/epilogue is emitted only around those parts of the
>    function that require it.
>
>    There will be exactly one prologue, and it will be executed either
>    zero or one time, on any path.
Right.  That's always been my understanding of the key driver for 
placement.  There's exactly one and will be executed one time or none 
across all paths in the CFG.

Essentially this is comparable to PRE-like algorithms for placement of 
expression evaluations.

And in separate shrink-wrapping world, we're leaving that model behind 
and I think that's one of the big things I'm struggling with -- we may 
execute a prologue component more than once if I've read everything 
correctly.


  Depending on where the prologue is
>    placed, some of the basic blocks can be reached via both paths with
>    and without a prologue.  Such blocks will be duplicated here, and the
>    edges changed to match.
Understood.

This really feels comparable to block duplication for the purposes of 
isolating a particular path through the CFG so that path can be modified 
without affecting the behavior of other paths through the CFG.

It's also directly comparable to block duplication to allow more 
aggressive code motion in PRE-like algorithms.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores
  2016-09-09 15:59     ` Segher Boessenkool
@ 2016-09-09 16:39       ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-09 16:39 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, bschmidt

On 09/09/2016 09:51 AM, Segher Boessenkool wrote:
> On Thu, Sep 08, 2016 at 11:50:56AM -0600, Jeff Law wrote:
>> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
>>> Deleting restores (before a noreturn) that are dead confuses dwarf2cfi.
>>>
>>> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>>>
>>> 	* dce.c (delete_unmarked_insns): Don't delete instructions with
>>> 	a REG_CFA_RESTORE note.
>> I don't really understand this one.  Why is the restore marked dead and
>> why doesn't that happen for normal epilogues?  Something wonky seems to
>> be going on here.
>
> Because it's not behind a NOTE_INSN_EPILOGUE_BEGIN it is not treated
> specially by DCE, like insns in "normal" epilogues are.
Ohhhhh!  While I don't see the special handling of things after 
NOTE_INSN_EPILOGUE_BEG, I do see this in df-scan.c:

   if (targetm.have_epilogue () && epilogue_completed)
     {
       /* Mark all call-saved registers that we actually used.  */
       for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
         if (df_regs_ever_live_p (i) && !LOCAL_REGNO (i)
             && !TEST_HARD_REG_BIT (regs_invalidated_by_call, i))
           bitmap_set_bit (exit_block_uses, i);
     }

Which then presumably gets used elsewhere when solving problems.

I wonder if we need some kind of similar mechanism to make those 
registers appear to be live at the component epilogues.

Jeff


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 15:33         ` Segher Boessenkool
@ 2016-09-09 16:49           ` Jeff Law
  2016-09-09 17:00             ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-09 16:49 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bernd Schmidt, gcc-patches

On 09/09/2016 09:28 AM, Segher Boessenkool wrote:
>> Segher's code essentially allows individual components of the prologue
>> to sink to different points within the function rather than forcing the
>> prologue to be sunk as an atomic unit.
>
> It also allows prologue an epilogue components to be placed in multiple
> places,
Right.  I need to keep that in the forefront of my mind.



and even allows them to be executed more than once, if that is
> cheaper.
This is the part that I'm still struggling with.

>
>>>> The full-prologue algorithm makes as many blocks run without prologue as
>>>> possible, by duplicating blocks where that helps.  If you do this for
>>>> every component you can and up with 2**40 blocks for just 40 components,
>>>
>>> Ok, so why wouldn't we use the existing code with the duplication part
>>> disabled? That's a later addition anyway and isn't necessary to do
>>> shrink-wrapping in the first place.
>> I think the concern here is the balance between code explosion and the
>> runtime gains.
>
> The code explosion can be terrible if you have only a few components,
> already.  The runtime gains are not so great either.  Here's a simple
> example, showing part of a cfg, the exits to the right are calls to
> abort and need some prologue component:
>
> |
> 1
> |\
> | \
> 2  X1
> |\
> | \
> |  X2
> |
   3
>
> With the "old" algorithm, you need to place the prologue at 1 (because
> you can only have one), and then the fall-through path also necessarily
> gets that component's prologue.  With the "separate" algorithm, the
> component prologues can be placed at X1 and X2 instead (and that will
> most likely be cheapest according to the cost model, too).
Thanks.  That really helps highlight some of the key differences between 
the old and new model.


>
> You can also put this in a loop.  I'll try to make a simple example
> with that.
That would be great.  And a case where they execute more than once too.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 16:26             ` Jeff Law
@ 2016-09-09 16:51               ` Segher Boessenkool
  2016-09-09 17:22                 ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 16:51 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Fri, Sep 09, 2016 at 09:59:03AM -0600, Jeff Law wrote:
> On 09/09/2016 09:17 AM, Segher Boessenkool wrote:
> >On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
> >>So can you expand on the malloc example a bit -- I'm pretty sure I
> >>understand what you're trying to do, but a concrete example may help
> >>Bernd and be useful for archival purposes.
> >
> >Sure, but it's big (which is the problem :-) )
> Yea :(  But it's likely a very compelling example of real world code. 
> It's almost certainly too big to turn into a testcase of any kind, but 
> just some before/after annotated code would be helpful.

I'll work on it, but it won't be before tonight, probably quite a bit
later.  Seeing the difference in generated machine code is probably
most useful?  Better than RTL ;-)

> Ideally we'd have some smaller testcases we could put in the testsuite 
> to ensure that the feature works over-time in the way intended would be 
> helpful as well.

I might be able to make some really tiny testcases that will not need
updates all of the time.  We'll see.

> Right.  That's always been my understanding of the key driver for 
> placement.  There's exactly one and will be executed one time or none 
> across all paths in the CFG.

But that is not because it is good to have only one!  GCC expects there
to be only one, instead.  Some ports might even use prologues that cannot
be duplicated at all.

> And in separate shrink-wrapping world, we're leaving that model behind 
> and I think that's one of the big things I'm struggling with -- we may 
> execute a prologue component more than once if I've read everything 
> correctly.

Yes, and that is perhaps radically new in the GCC world, but not anywhere
else.

[ ... ]

> This really feels comparable to block duplication for the purposes of 
> isolating a particular path through the CFG so that path can be modified 
> without affecting the behavior of other paths through the CFG.

Yes, very much.

> It's also directly comparable to block duplication to allow more 
> aggressive code motion in PRE-like algorithms.

Yeah.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-09 15:40       ` Segher Boessenkool
@ 2016-09-09 16:58         ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-09 16:58 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bernd Schmidt, gcc-patches

On 09/09/2016 09:33 AM, Segher Boessenkool wrote:
> On Thu, Sep 08, 2016 at 11:20:41AM -0600, Jeff Law wrote:
>> On 08/29/2016 03:31 AM, Bernd Schmidt wrote:
>>> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
>>>> +@deftypefn {Target Hook} void
>>>> TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS (sbitmap)
>>>> +Emit prologue insns for the components indicated by the parameter.
>>>> +@end deftypefn
>>>> +
>>>> +@deftypefn {Target Hook} void
>>>> TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS (sbitmap)
>>>> +Emit epilogue insns for the components indicated by the parameter.
>>>> +@end deftypefn
>>>
>>> How do these actually know where to save/restore registers? The frame
>>> pointer may have been eliminated, and SP isn't necessarily constant
>>> during the function. Seems like you'd have to calculate CFA reg/offset
>>> much like dwarf2out does and pass it to this hook.
>> So I think the confusion here is these hooks are independent of
>> placement. ie, the target independent code does something like:
>>
>> FOR_EACH_BB
>>   Build the component bitmap using the incoming edge components
>>   Emit the prologue components at the start of the block
>>   Emit the epilogue components at the end of the block
>
> So you think those hooks need a BB parameter?  If there are ports that
> need that, then sure.  PowerPC doesn't need it.
I wouldn't add it now.  If we find a port that needs it, we can do so at 
that time.  Maybe the port wants to scan the block for a scratch 
register or somesuch.  But let's wait until we actually see a need.

I was mostly trying to highlight the high level structure and that each 
block can have a prologue and/or epilogue associated with it and that 
they are specialized for each block.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 16:49           ` Jeff Law
@ 2016-09-09 17:00             ` Segher Boessenkool
  2016-09-09 17:44               ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 17:00 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Fri, Sep 09, 2016 at 10:48:30AM -0600, Jeff Law wrote:
> and even allows them to be executed more than once, if that is
> >cheaper.
> This is the part that I'm still struggling with.

The usual example:

1
|\
| \
|  2
| /
|/
3
|\
| \
|  4
| /
|/
5

where 2 and 4 need a certain prologue component (and the rest doesn't).

If the frequency of 2 plus that of 4 is less than the frequency of 1,
it is better to place a prologue (and epilogue) around both 2 and 4
than to place a prologue at 1 and an epilogue at 5.

> >You can also put this in a loop.  I'll try to make a simple example
> >with that.
> That would be great.  And a case where they execute more than once too.

Okido.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 16:51               ` Segher Boessenkool
@ 2016-09-09 17:22                 ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-09 17:22 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bernd Schmidt, gcc-patches

On 09/09/2016 10:49 AM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 09:59:03AM -0600, Jeff Law wrote:
>> On 09/09/2016 09:17 AM, Segher Boessenkool wrote:
>>> On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
>>>> So can you expand on the malloc example a bit -- I'm pretty sure I
>>>> understand what you're trying to do, but a concrete example may help
>>>> Bernd and be useful for archival purposes.
>>>
>>> Sure, but it's big (which is the problem :-) )
>> Yea :(  But it's likely a very compelling example of real world code.
>> It's almost certainly too big to turn into a testcase of any kind, but
>> just some before/after annotated code would be helpful.
>
> I'll work on it, but it won't be before tonight, probably quite a bit
> later.  Seeing the difference in generated machine code is probably
> most useful?  Better than RTL ;-)
Yea, machine code is probably most useful.

>
>> Ideally we'd have some smaller testcases we could put in the testsuite
>> to ensure that the feature works over-time in the way intended would be
>> helpful as well.
>
> I might be able to make some really tiny testcases that will not need
> updates all of the time.  We'll see.
Understood, particularly on the maintenance burden for these kind of 
tests.  The alternate approach might be to pick up Bernd's work and see 
if we can use that test the various components.

>
> But that is not because it is good to have only one!  GCC expects there
> to be only one, instead.  Some ports might even use prologues that cannot
> be duplicated at all.
Agreed on all points.

>
>> And in separate shrink-wrapping world, we're leaving that model behind
>> and I think that's one of the big things I'm struggling with -- we may
>> execute a prologue component more than once if I've read everything
>> correctly.
>
> Yes, and that is perhaps radically new in the GCC world, but not anywhere
> else.
And just to be clear I think I'm mostly looking for why multiple 
execution of a particular component is useful.  My first thought in that 
case is that we've got a redundancy and that the redundancy should be 
identified and eliminated :-)

I've got a few not-well-formed ideas in my head about when that might 
happen and why we might not want to squash out the redundancy.  But it'd 
be a hell of a step forward to see that case in action.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 17:00             ` Segher Boessenkool
@ 2016-09-09 17:44               ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-09 17:44 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bernd Schmidt, gcc-patches

On 09/09/2016 10:57 AM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 10:48:30AM -0600, Jeff Law wrote:
>> and even allows them to be executed more than once, if that is
>>> cheaper.
>> This is the part that I'm still struggling with.
>
> The usual example:
>
> 1
> |\
> | \
> |  2
> | /
> |/
> 3
> |\
> | \
> |  4
> | /
> |/
> 5
>
> where 2 and 4 need a certain prologue component (and the rest doesn't).
Perfect.

So this is consistent with one of the ideas I was starting to form.  I'm 
going to stay in the PRE world because it's model is so damn close to 
what you're doing.

PRE minimizes expression evaluations on paths *without* introducing 
evaluations on paths that didn't already have one.

So if we pretend we had some expression evaluated in 2 & 4.  The path 
1->2->3->4 has two evaluations of the expression, but PRE can't really 
do anything here.  We can't hoist the evaluation into a better spot as 
doing so would introduce evaluations on paths that didn't have one 
before.  2 & 4 are the proper locations for that expression evaluation.

And the same applies to separate shrink wrapping.  2 & 4 are the right 
place.  It's all so clear now.

I'll note that duplicating 3 into 3' and redirecting the edge 2->3 to 
2->3' allows us to do better PRE and prologue insertion.  But I don't 
think that's a requirement for you to go forward :-)

Anyway, it's all clear now.  Thanks so much for that example.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 15:43               ` Segher Boessenkool
@ 2016-09-09 18:25                 ` Jeff Law
  2016-09-09 20:29                   ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-09 18:25 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Michael Matz, Bernd Schmidt, gcc-patches

On 09/09/2016 09:40 AM, Segher Boessenkool wrote:
>>
>> So I think sticking with this as a design decision makes sense -- does
>> it impact the issue around running a particular component's prologue
>> more than once?
>
> I don't follow, sorry; could you rephrase?
Nevermind -- my question has been resolved.  I was fumbling around 
trying to figure out the circumstances where a prologue/epilogue might 
be run more than once.  With your recent message, that's been cleared up.

>> Right and the cases where it's placing them into loops it's going to be
>> doing so on paths that are unlikely executed within the loop.  ISTM that
>> using frequencies is a significant step forward over the older placement
>> algorithms in this space (which IIRC were structured as simple dataflow
>> solvers) -- with the added advantage that if we have profiling data we
>> can make better decisions.
>
> Using the known/guessed execution frequencies is not really new, just not
> forty years old :-)
Referring to its use in GCC (or lack thereof).

>
>> Does this impact the compile time computation complexity issue that was
>> raised elsewhere?
>
> I'm not sure what you mean here either, sorry.  It is all O(NM) with N
> the number of BBs and M the number of components (and assuming dom
> lookups are constant time, as usual).
You said in a different message that computing optimal placement points 
for prologue/epilogue components was not computationally feasible.   I'm 
just trying to figure out if the switch to utilizing block frequencies 
is a part of what makes solving that problem infeasible.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-09 15:13         ` Segher Boessenkool
@ 2016-09-09 18:31           ` Jeff Law
  2016-09-09 20:41             ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-09 18:31 UTC (permalink / raw)
  To: Segher Boessenkool, Bernd Schmidt; +Cc: gcc-patches

On 09/09/2016 09:04 AM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 12:58:11PM +0200, Bernd Schmidt wrote:
>> Hmm? The problem is that you can't generally emit a save/restore
>> independent of placement, because you may not know which offset to use
>> from whichever base register. But these offsets aren't necessarily
>> constant throughout the function. Segher explained that the algorithm
>> deals with this by giving up in many cases, which of course limits the
>> usefulness. It probably makes it unusable entirely on targets that want
>> to use pushes for function args.
>
> It's not the generic algorithm that gives up; it's the target hook.
> Specifically, at least for the PowerPC one I wrote, the
> TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS hook gives up on register
> components that need an offset from the base reg (stack or frame pointer)
> that cannot be used in a single instruction (i.e. won't fit in 16 bits).
>
> The generic code only does
>
> +  /* We don't handle "strange" functions.  */
> +  if (cfun->calls_alloca
> +      || cfun->calls_setjmp
> +      || cfun->can_throw_non_call_exceptions
> +      || crtl->calls_eh_return
> +      || crtl->has_nonlocal_goto
> +      || crtl->saves_all_registers)
> +    return;
>
> so that does not give up in "many" cases.
Doesn't seem like a lot to me either.


>
> Targets that push function args can handle things fine as far as I see?
> Targets that normally use push insns in the prologue will just have to
> not do that for the components that are separately wrapped.  Or they can
> still use pushes to reserve that space, if that works better.
To me it's more about the fact that the offset to the slot where the 
register should be saved varies (unless you have a frame pointer) and I 
don't think there's enough information in any of the hook arguments to 
allow derivation of that offset.

But I don't consider that a flaw that should block this feature.  If 
someday we want to implement on such a target, we'll have to figure out 
how to compute that offset at an arbitrary location and pass along the 
needed information to the hooks.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-08-26 13:03 ` Bernd Schmidt
  2016-08-26 13:48   ` David Malcolm
  2016-08-26 14:50   ` Segher Boessenkool
@ 2016-09-09 19:36   ` Jeff Law
  2016-09-09 21:00     ` Segher Boessenkool
  2 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-09 19:36 UTC (permalink / raw)
  To: Bernd Schmidt, Segher Boessenkool, gcc-patches

On 08/26/2016 07:03 AM, Bernd Schmidt wrote:
> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
>> This is the second version.  Concern was renamed to component, and all
>> other comments were addressed (I hope).
>
> Not really, I'm afraid. There still seems to be no detailed explanation
> of the algorithm. Instead, there is a vague outline (which should be
> expanded to at least the level of detail found e.g. in tree-ssa-pre.c),
> and inside the code there are still meaningless comments of the form
I think Segher and I can work towards this. The placement algorithm is 
where I think the main focus will need to be.  Segher has some text for 
that, but we may need to supplement with a CFG or two to help clarify.

>
> /* Deselect those epilogue components that should not be inserted
>    for this edge.  */
>
> which don't tell me anything about what the intention is (why should
> they not be inserted?). The result is that as a reader, I still find
> myself unable to determine whether the algorithm is correct or not.
Deselection is more about pruning components that we initially thought 
were separate shrink wrap candidates, but upon deeper inspection we can 
not handle.  And it's all target driven.

Deselection has to run after placement computation because deselection 
is a property of the desired insertion site (which is why this hook 
passes in the desired edge where we want to insert code).  Deselection 
runs prior to actual generation & insertion of the sequences. 
Deselection affects the ability to separately shrink wrap that component 
globally.

So it does affect correctness, but it's pretty easy to see that its safe.



>
> Worse, I'm still not entirely certain what it is intended to achieve: I
> asked for a motivating example or two, but haven't seen one in the
> comments anywhere. My most general question would be why the algorithm
> for placing individual prologue components would be different from the
> regular shrink-wrapping algorithm for full prologues. Examples and/or
> arguments relating to how this new code acts with regard to loops are
> also particularly needed.
I think Segher and Michael covered this in the various follow-ups.  I'm 
now comfortable with the overall goals.

Right now we have a single copy of the prologue/epilogue.  That single 
copy might get shrink-wrapped, but at the end of the day, it's still a 
single copy that gets executed at most once.

If we relax the single copy and single execution traits, then we get 
more freedom to place the prologues/epilogues and less frequently 
executed points.

We get even more flexibility by handling components of the 
prologue/epilogue separately.

Some comments in these areas will be appropriate, but again, I'm 
comfortable with the overall direction this work has gone.


>
> So, I think improvement is necessary in these points, but I also think
> that there's room for experimental verification by way of self-tests. If
> done thoroughly (testing the transformation on several sufficiently
> random CFGs and maybe some manually constructed ones) it would go a long
> way towards showing that at least we don't have to worry too much about
> miscompilations. That's what I've been working on, and an initial patch
> is below. This is incomplete and posted more as an RFD since we're
> getting towards the end of the week: there are gaps in the test
> coverage, and it currently fails the selftest. I assume the latter is a
> problem with my code, but it wouldn't hurt if you could take a look;
> maybe I misunderstood something entirely about what the separate
> shrink-wrapping is supposed to achieve, or maybe I messed up the
> algorithm with my changes.
I think the lack of test coverage is something we'll want to address.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 18:25                 ` Jeff Law
@ 2016-09-09 20:29                   ` Segher Boessenkool
  0 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 20:29 UTC (permalink / raw)
  To: Jeff Law; +Cc: Michael Matz, Bernd Schmidt, gcc-patches

On Fri, Sep 09, 2016 at 12:19:03PM -0600, Jeff Law wrote:
> >>Does this impact the compile time computation complexity issue that was
> >>raised elsewhere?
> >
> >I'm not sure what you mean here either, sorry.  It is all O(NM) with N
> >the number of BBs and M the number of components (and assuming dom
> >lookups are constant time, as usual).
> You said in a different message that computing optimal placement points 
> for prologue/epilogue components was not computationally feasible.   I'm 
> just trying to figure out if the switch to utilizing block frequencies 
> is a part of what makes solving that problem infeasible.

Ah, I see.  Allowing multiple prologues (and epilogues) is what makes
it infeasible.  If there is just (at most) one prologue (per component),
calculating the optimal placement is of course linear in # BBs, given
that the cost function is O(1) (or sort of kind of; linear in # edges
really, if you have to insist :-) )

If you allow multiple prologues, i.e. allow any combination of blocks
to run with or without some component "active", you get an exponential
number of possible way to place things, and the cost for those combinations
is *not* an ordered function: if all predecessors of a block have some
component active, then this block itself does not need a prologue.

I get around that by making the cost function ordered, that is, possibly
overestimating the cost of nodes that are the dest of a cross-jump; in
the first version of the patch, by always using the execution frequency
of a node (so, not considering that a prologue there does not cost
anything for edges where the predecessor already has that component
active); and in the second version of the patch, that, but do subtract
the cost from backedges (which makes it clearer that loops are handled
correctly, because the loop header generally has lower cost than the
nodes in the loop body).

Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-09 18:31           ` Jeff Law
@ 2016-09-09 20:41             ` Segher Boessenkool
  2016-09-12 16:49               ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 20:41 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Fri, Sep 09, 2016 at 12:28:12PM -0600, Jeff Law wrote:
> >The generic code only does
> >
> >+  /* We don't handle "strange" functions.  */
> >+  if (cfun->calls_alloca
> >+      || cfun->calls_setjmp
> >+      || cfun->can_throw_non_call_exceptions
> >+      || crtl->calls_eh_return
> >+      || crtl->has_nonlocal_goto
> >+      || crtl->saves_all_registers)
> >+    return;
> >
> >so that does not give up in "many" cases.
> Doesn't seem like a lot to me either.

A few of those could be handled, perhaps with some extra hooks, but it
didn't look useful to me so far.

> >Targets that push function args can handle things fine as far as I see?
> >Targets that normally use push insns in the prologue will just have to
> >not do that for the components that are separately wrapped.  Or they can
> >still use pushes to reserve that space, if that works better.
> To me it's more about the fact that the offset to the slot where the 
> register should be saved varies (unless you have a frame pointer) and I 
> don't think there's enough information in any of the hook arguments to 
> allow derivation of that offset.

I think knowing in front of what BB to insert the prologue (or after what
BB, the epilogue) is all info we need?

And if even that is not good enough for any target then that target can
elect not to do separate shrink-wrapping at all ;-)

> But I don't consider that a flaw that should block this feature.  If 
> someday we want to implement on such a target, we'll have to figure out 
> how to compute that offset at an arbitrary location and pass along the 
> needed information to the hooks.

We probably need to be able to calculate that offset at the edges of any
BB for other reasons, already.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 19:36   ` Jeff Law
@ 2016-09-09 21:00     ` Segher Boessenkool
  2016-09-12 11:00       ` Bernd Schmidt
  2016-09-12 16:59       ` Jeff Law
  0 siblings, 2 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 21:00 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
> I think the lack of test coverage is something we'll want to address.

Building and running the compiler, the various target libraries, and the
testsuite is more than enough coverage for correctness in my opinion --
I cannot make up anything that is not already in there.  No doubt some
PR will show up later, and we can (and should) of course add that then.

For checking whether it makes "smart" decisions, I'll make some really
simple testcases that will not fail for random reasons all the time.

Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/9] regrename: Don't rename restores
  2016-09-08 17:54   ` Jeff Law
@ 2016-09-09 21:05     ` Segher Boessenkool
  2016-09-12 17:01       ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 21:05 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Thu, Sep 08, 2016 at 11:51:53AM -0600, Jeff Law wrote:
> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> >A restore is supposed to restore some certain register.  Restoring it
> >into some other register will not work.  Don't.
> >
> >2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
> >
> >	* regrename.c (build_def_use): Invalidate chains that have a
> >	REG_CFA_RESTORE on some instruction.
> Again, how is this different from a normal epilogue that restores 
> registers which regrename seems to not muck with.

Good question.  Either way, it is always wrong to rename a register we
restore from stack.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 6/9] sel-sched: Don't mess with register restores
  2016-09-08 17:55   ` Jeff Law
@ 2016-09-09 21:13     ` Segher Boessenkool
  2016-09-12 17:39       ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 21:13 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Thu, Sep 08, 2016 at 11:54:33AM -0600, Jeff Law wrote:
> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> >If selective scheduling copies register restores it confuses dwarf2cfi.
> >
> >2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
> >
> >	* sel-sched-ir.c (init_global_and_expr_for_insn): Don't copy
> >	instructions with a REG_CFA_RESTORE note.
> Similarly, I think you're papering over a lifetime problem of some kind 
> here.

sel-sched-ir.c says

    /* Certain instructions cannot be cloned, and frame related insns and
       the insn adjacent to NOTE_INSN_EPILOGUE_BEG cannot be moved out of
       their block.  */
    if (prologue_epilogue_contains (insn))

...

and I'm just extending that to "epilogue" instructions not in the
"epilogue" ;-)

If all such epilogue instructions always had a REG_CFA_RESTORE note,
we could drop the "old" thing; but even instructions restoring a register
do not always have such a note (they can be batched up, and they can be
not emitted at all if not shrink-wrapping).


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone
  2016-09-08 18:34   ` Jeff Law
@ 2016-09-09 21:21     ` Segher Boessenkool
  0 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 21:21 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Thu, Sep 08, 2016 at 11:55:04AM -0600, Jeff Law wrote:
> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> >Doing cprop on frame-related instructions blows up spectacularly.
> >So don't.
> >
> >2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
> >
> >	* regcprop.c (copyprop_hardreg_forward_1): Don't change
> >	RTX_FRAME_RELATED_P instructions.
> Example or testcase?

A bunch of testsuite things, maybe many even, I don't remember, it's been
a while.  Easy to find out though :-)  (It'll take at least a day).


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
  2016-09-08 19:03   ` Jeff Law
@ 2016-09-09 22:03     ` Segher Boessenkool
  2016-09-12 18:21       ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-09 22:03 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Thu, Sep 08, 2016 at 12:34:07PM -0600, Jeff Law wrote:
> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
> >
> >Deciding what blocks should run with a certain component active so that
> >the total cost of executing the prologues (and epilogues) is optimal, is
> >not a computationally feasible problem.
> Really?  It's just a dataflow problem is it not and one that ought to 
> converge very quickly I'd think.  Or is it more a function of having to 
> run it on so many independent components?  I'm still pondering the value 
> of having every GPR be an independent component :-)

The cost function (as a function of which BBs runs with a certain component
enabled) is not monotonic: the cost for having a component active for a
subset of BBs can be higher, the same, or equal to the cost for all such
nodes (where "cost" means "how often do we execute this prologue").

> >Now all that is left is inserting prologues and epilogues on all edges
> >that jump into resp. out of the "active" set of blocks.  Often we need
> >to insert some components' prologues (or epilogues) on all edges into
> >(or out of) a block.  In theory cross-jumping can unify all such, but
> >in practice that often fails; besides, that is a lot of work.  So in
> >this case we insert the prologue and epilogue components at the "head"
> >or "tail" of a block, instead.
> Cross jumping is rather simplistic, so I'm not surprised that it doesn't 
> catch all this stuff.

I hoped it would, so I could have so much simpler code.  Sniff.

> >As a final optimisation, if a block needs a prologue and its immediate
> >dominator has the block as a post-dominator, the dominator gets the
> >prologue as well.
> So why not just put it in the idom and not in the dominated block? 

That's what it does :-)


> > void
> > thread_prologue_and_epilogue_insns (void)
> > {
> >+  if (optimize > 1)
> >+    {
> >+      df_live_add_problem ();
> >+      df_live_set_all_dirty ();
> >+    }
> Perhaps conditional on separate shrink wrapping?

Actually, I think we need to do this once more, one of the times always
(also when only using "regular" shrink-wrapping), because otherwise the
info in the dump file is out of whack.  Everything is okay once the next
pass starts, of course.  I'll have a look.

> >@@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)
> >
> >   try_shrink_wrapping (&entry_edge, prologue_seq);
> >
> >+  df_analyze ();
> >+  try_shrink_wrapping_separate (entry_edge->dest);
> >+  if (crtl->shrink_wrapped_separate)
> >+    prologue_seq = make_prologue_seq ();
> Perhaps push the df_analyze call into try_shrink_wrapping_separate?

Yeah possibly.

> ANd if it was successful, do you need to update the DF information?  Is 
> that perhaps the cause of some of the issues we're seeing with DCE, 
> regrename, the scheduler?

See above.  The DF info is correct when the next pass starts (or ends,
I do not remember; the dump file does show correct information).

> Consider allowing the target to provide a mapping from the component to 
> a symbolic name of some kind and using that in the dumps.

Yeah that will help, once we do more than just GPRs anyway -- not everyone
remembers the GCC register #s for all registers ;-)

> Related, 
> consider using an enum rather than magic constants in the target bits (I 
> noted seeing component #0 as a magic constant in the ppc implementation).

But 0 is the hard register used there :-)

And since we now are C++, I cannot use enums as integers.  Sigh.

The meaning of "0" should of course be documented.

> So it seems like there's a toplevel list of components that's owned by 
> the target and state at each block components that are owned by the 
> generic code.  That's fine.  Just make sure we doc that the toplevel 
> list of components is allocated by the backend (and where does it get 
> freed?)

Every sbitmap is owned by the separate shrink-wrapping pass, not by the
target.  As the documentation says:

+@deftypefn {Target Hook} void TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS (sbitma
+Mark the components in the parameter as handled, so that the
+@code{prologue} and @code{epilogue} named patterns know to ignore those
+components.  The target code should not hang on to the @code{sbitmap}, it
+will be deleted after this call.
+@end deftypefn

> Consider using auto_sbitmap rather than manually managing 
> allocation/releasing of the per-block structures.  In fact, can't all of 
> SW become a class and we lose the explicit init/fini routines in favor 
> of a ctor/dtor?

Yes, you can always add indirection.  I do not think the code becomes
more readable that way (quite the opposite).  Explicit is *good*.

The init/fini code is small, and that is not an accident.

> >+      /* Find which prologue resp. epilogue components are needed for all
> >+	 predecessor edges to this block.  */
> Nit.  Avoid ".resp".  I actually had to look that abbreviation up :-) 
> Being a bit more verbose in the comments would be preferred over ".resp".

"resp." is very common in mathematics papers.  It means either "and" or
"or", and it magically means either or both *and* exactly the thing you
mean there.

But I'll try to use it less often, sure.

> Having looked at this in more detail now, I'm wondering if we want to 
> talk at all in the docs about when selection vs disqualifying happens. 
> ISTM that we build the set of components we want to insert for each 
> edge/block.  When that's complete, we then prune those results with the 
> disqualifying sets.

We decide which blocks should have which components active.  From that,
it is easy to derive on which edges we need a prologue or an epilogue.
Now if the target says putting some prologue or epilogue on some certain
edge will not work, we just don't do that component at all.  It doesn't
happen very often.  In theory we could be smarter.

> For the PPC R0 vs LR is the only thing that causes disqualification 
> right?

Currently, yes.

> Can't that be handled when we build the set of components we 
> want to insert for each edge/block?  Is there some advantage to handling 
> disqualifications after all the potential insertion points have been 
> handled?

We do not know if an edge needs a prologue, epilogue, or neither, until
we have decided whether *both* ends of that edge want the component active
or not.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 21:00     ` Segher Boessenkool
@ 2016-09-12 11:00       ` Bernd Schmidt
  2016-09-12 16:59       ` Jeff Law
  1 sibling, 0 replies; 74+ messages in thread
From: Bernd Schmidt @ 2016-09-12 11:00 UTC (permalink / raw)
  To: Segher Boessenkool, Jeff Law; +Cc: gcc-patches

On 09/09/2016 10:56 PM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
>> I think the lack of test coverage is something we'll want to address.
>
> Building and running the compiler, the various target libraries, and the
> testsuite is more than enough coverage for correctness in my opinion --

I'd argue against that:
  - a lot of compiler bugs for passes like this show up only for unusual
    CFGs that you don't really get with normal input code.
  - separate shrink wrapping explicitly tries to modify infrequent
    paths, which may not even get exercised.

Also, unless I've missed it you've not provided any numbers about how 
often this triggers, let's say on a cc1 build.

For these reasons I thought having self-tests would be a good idea, but 
I was informed it's a waste of time, so I guess not.


Bernd

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc
  2016-09-09 20:41             ` Segher Boessenkool
@ 2016-09-12 16:49               ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-12 16:49 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bernd Schmidt, gcc-patches

On 09/09/2016 02:33 PM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 12:28:12PM -0600, Jeff Law wrote:
>>> The generic code only does
>>>
>>> +  /* We don't handle "strange" functions.  */
>>> +  if (cfun->calls_alloca
>>> +      || cfun->calls_setjmp
>>> +      || cfun->can_throw_non_call_exceptions
>>> +      || crtl->calls_eh_return
>>> +      || crtl->has_nonlocal_goto
>>> +      || crtl->saves_all_registers)
>>> +    return;
>>>
>>> so that does not give up in "many" cases.
>> Doesn't seem like a lot to me either.
>
> A few of those could be handled, perhaps with some extra hooks, but it
> didn't look useful to me so far.
Agreed.  I don't think this is worth spending time on.


>
>>> Targets that push function args can handle things fine as far as I see?
>>> Targets that normally use push insns in the prologue will just have to
>>> not do that for the components that are separately wrapped.  Or they can
>>> still use pushes to reserve that space, if that works better.
>> To me it's more about the fact that the offset to the slot where the
>> register should be saved varies (unless you have a frame pointer) and I
>> don't think there's enough information in any of the hook arguments to
>> allow derivation of that offset.
>
> I think knowing in front of what BB to insert the prologue (or after what
> BB, the epilogue) is all info we need?
Maybe.  I'd be worried about things like deferred pops and 
combine-stack-adjustments.  The former are probably OK as I suspect we 
cleaned things up at basic block boundaries.  The latter I've never 
really looked at.

>
> And if even that is not good enough for any target then that target can
> elect not to do separate shrink-wrapping at all ;-)
Exactly.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-09 21:00     ` Segher Boessenkool
  2016-09-12 11:00       ` Bernd Schmidt
@ 2016-09-12 16:59       ` Jeff Law
  2016-09-14 13:22         ` Segher Boessenkool
  1 sibling, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-12 16:59 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Bernd Schmidt, gcc-patches

On 09/09/2016 02:56 PM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
>> I think the lack of test coverage is something we'll want to address.
>
> Building and running the compiler, the various target libraries, and the
> testsuite is more than enough coverage for correctness in my opinion --
> I cannot make up anything that is not already in there.  No doubt some
> PR will show up later, and we can (and should) of course add that then.
>
> For checking whether it makes "smart" decisions, I'll make some really
> simple testcases that will not fail for random reasons all the time.
I'm moving more and more towards wanting tests for new features. 
There's no reason why verification of basic behavior shouldn't be part 
of the feature patch itself.

I suspect the biggest problem with testing something like separate 
shrink wrapping is making sure the test is stable over time -- and 
that's a legitimate concern.  But it ought to be do-able in one form or 
another, particularly since we have control over the debug dumps. 
Alternately building something on top of Bernd's work is an option too.

Building the compiler & libraries, regression testing, etc are all 
important as well.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/9] regrename: Don't rename restores
  2016-09-09 21:05     ` Segher Boessenkool
@ 2016-09-12 17:01       ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-12 17:01 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, bschmidt

On 09/09/2016 02:59 PM, Segher Boessenkool wrote:
> On Thu, Sep 08, 2016 at 11:51:53AM -0600, Jeff Law wrote:
>> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
>>> A restore is supposed to restore some certain register.  Restoring it
>>> into some other register will not work.  Don't.
>>>
>>> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>>>
>>> 	* regrename.c (build_def_use): Invalidate chains that have a
>>> 	REG_CFA_RESTORE on some instruction.
>> Again, how is this different from a normal epilogue that restores
>> registers which regrename seems to not muck with.
>
> Good question.  Either way, it is always wrong to rename a register we
> restore from stack.
Agreed.  Somehow register renaming does the right thing for a normal 
epilogue.  I don't see anything in regrename that obviously treats the 
epilogue specially, there's check_new_reg_p, but I don't see that it 
inherently handles this case.

regrename seems to use the DF infrastructure, so I wouldn't be surprised 
if this is a symptom of incorrect DF information for the epilogues.  One 
way to potentially find out would be to tweak the DF code to mark all 
the callee saved regs as live at the return insns and see how that 
affects regrename's decision making.

jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 6/9] sel-sched: Don't mess with register restores
  2016-09-09 21:13     ` Segher Boessenkool
@ 2016-09-12 17:39       ` Jeff Law
  2016-09-14 13:28         ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-12 17:39 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, bschmidt

On 09/09/2016 03:13 PM, Segher Boessenkool wrote:
> On Thu, Sep 08, 2016 at 11:54:33AM -0600, Jeff Law wrote:
>> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
>>> If selective scheduling copies register restores it confuses dwarf2cfi.
>>>
>>> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>>>
>>> 	* sel-sched-ir.c (init_global_and_expr_for_insn): Don't copy
>>> 	instructions with a REG_CFA_RESTORE note.
>> Similarly, I think you're papering over a lifetime problem of some kind
>> here.
>
> sel-sched-ir.c says
>
>     /* Certain instructions cannot be cloned, and frame related insns and
>        the insn adjacent to NOTE_INSN_EPILOGUE_BEG cannot be moved out of
>        their block.  */
>     if (prologue_epilogue_contains (insn))
>
> ...
>
> and I'm just extending that to "epilogue" instructions not in the
> "epilogue" ;-)
>
> If all such epilogue instructions always had a REG_CFA_RESTORE note,
> we could drop the "old" thing; but even instructions restoring a register
> do not always have such a note (they can be batched up, and they can be
> not emitted at all if not shrink-wrapping).
Can you fix this by registering the separate prologue/epilogue insns in 
prologue_insn_hash and epilogue_insn_hash or does that have unintended 
consequences?


Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
  2016-09-09 22:03     ` Segher Boessenkool
@ 2016-09-12 18:21       ` Jeff Law
  2016-09-14 13:45         ` Segher Boessenkool
  0 siblings, 1 reply; 74+ messages in thread
From: Jeff Law @ 2016-09-12 18:21 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, bschmidt

On 09/09/2016 03:57 PM, Segher Boessenkool wrote:
> On Thu, Sep 08, 2016 at 12:34:07PM -0600, Jeff Law wrote:
>> On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
>>>
>>> Deciding what blocks should run with a certain component active so that
>>> the total cost of executing the prologues (and epilogues) is optimal, is
>>> not a computationally feasible problem.
>> Really?  It's just a dataflow problem is it not and one that ought to
>> converge very quickly I'd think.  Or is it more a function of having to
>> run it on so many independent components?  I'm still pondering the value
>> of having every GPR be an independent component :-)
>
> The cost function (as a function of which BBs runs with a certain component
> enabled) is not monotonic: the cost for having a component active for a
> subset of BBs can be higher, the same, or equal to the cost for all such
> nodes (where "cost" means "how often do we execute this prologue").
Understood.   You covered this reasonably well in another reply.  Thanks.

>> Cross jumping is rather simplistic, so I'm not surprised that it doesn't
>> catch all this stuff.
>
> I hoped it would, so I could have so much simpler code.  Sniff.
In general, I think our ability to identify and de-duplicate code is 
poor at best, especially at the RTL level.  It's never been a major area 
of focus.   So, yea.  Sniff.


>
>>> As a final optimisation, if a block needs a prologue and its immediate
>>> dominator has the block as a post-dominator, the dominator gets the
>>> prologue as well.
>> So why not just put it in the idom and not in the dominated block?
>
> That's what it does :-)
Then I must have mis-parsed.  Thanks for clarifying.

>
>
>>> void
>>> thread_prologue_and_epilogue_insns (void)
>>> {
>>> +  if (optimize > 1)
>>> +    {
>>> +      df_live_add_problem ();
>>> +      df_live_set_all_dirty ();
>>> +    }
>> Perhaps conditional on separate shrink wrapping?
>
> Actually, I think we need to do this once more, one of the times always
> (also when only using "regular" shrink-wrapping), because otherwise the
> info in the dump file is out of whack.  Everything is okay once the next
> pass starts, of course.  I'll have a look.
OK.  It struck me as a bit odd, so just a verify on your side that it 
was intended is fine with me.


>
>>> @@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)
>>>
>>>   try_shrink_wrapping (&entry_edge, prologue_seq);
>>>
>>> +  df_analyze ();
>>> +  try_shrink_wrapping_separate (entry_edge->dest);
>>> +  if (crtl->shrink_wrapped_separate)
>>> +    prologue_seq = make_prologue_seq ();
>> Perhaps push the df_analyze call into try_shrink_wrapping_separate?
>
> Yeah possibly.
Your call.


>
>> ANd if it was successful, do you need to update the DF information?  Is
>> that perhaps the cause of some of the issues we're seeing with DCE,
>> regrename, the scheduler?
>
> See above.  The DF info is correct when the next pass starts (or ends,
> I do not remember; the dump file does show correct information).
Hmm, then explain again why DCE is mucking up?  I don't immediately see 
how EPILOGUE_BEG notes come into play with DCE.  It seems to rely on the 
DF data and AFAICT DF only cares about the EPILOGUE_BEG note in 
can_move_insns_across which shouldn't be used by DCE.




>
>> Related,
>> consider using an enum rather than magic constants in the target bits (I
>> noted seeing component #0 as a magic constant in the ppc implementation).
>
> But 0 is the hard register used there :-)
Oh, missed that.  Nevermind :-)

>> So it seems like there's a toplevel list of components that's owned by
>> the target and state at each block components that are owned by the
>> generic code.  That's fine.  Just make sure we doc that the toplevel
>> list of components is allocated by the backend (and where does it get
>> freed?)
>
> Every sbitmap is owned by the separate shrink-wrapping pass, not by the
> target.  As the documentation says:
I'm specifically referring to:

+@deftypefn {Target Hook} sbitmap 
TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS (void)

+@deftypefn {Target Hook} sbitmap TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB 
(basic_block)

Which in the rs6000 implementation allocate and return an sbitmap.  I 
don't see anywhere we could reasonably free them in the target with the 
existing hooks, so it seems like they have to be free'd by the generic 
code. So ownership of those sbitmaps isn't particularly clear.


>
>> Consider using auto_sbitmap rather than manually managing
>> allocation/releasing of the per-block structures.  In fact, can't all of
>> SW become a class and we lose the explicit init/fini routines in favor
>> of a ctor/dtor?
>
> Yes, you can always add indirection.  I do not think the code becomes
> more readable that way (quite the opposite).  Explicit is *good*.
The GCC project is moving away from this kind of explicit 
allocation/deallocation and more towards a RAII.  Unless there is a 
clear need for the explicit allocation/deallocation, please put this 
stuff into a class with an appropriate ctor/dtor.

FWIW, I was a big opponent of how much stuff happens "behind your back" 
with some languages (including C++).  But over the last few years my 
personal stance has softened considerably after seeing how cleanly RAII 
solves certain problems.


>
>> Having looked at this in more detail now, I'm wondering if we want to
>> talk at all in the docs about when selection vs disqualifying happens.
>> ISTM that we build the set of components we want to insert for each
>> edge/block.  When that's complete, we then prune those results with the
>> disqualifying sets.
>
> We decide which blocks should have which components active.  From that,
> it is easy to derive on which edges we need a prologue or an epilogue.
> Now if the target says putting some prologue or epilogue on some certain
> edge will not work, we just don't do that component at all.  It doesn't
> happen very often.  In theory we could be smarter.
Right.  I was obviously over-simplifying, the key point I was verifying 
was that they're separate steps...  Which led to the question about when 
to disqualify for the R0 vs LR issue.


>
>> For the PPC R0 vs LR is the only thing that causes disqualification
>> right?
>
> Currently, yes.
>
>> Can't that be handled when we build the set of components we
>> want to insert for each edge/block?  Is there some advantage to handling
>> disqualifications after all the potential insertion points have been
>> handled?
>
> We do not know if an edge needs a prologue, epilogue, or neither, until
> we have decided whether *both* ends of that edge want the component active
> or not.
Right.  Hmm, maybe I'm not asking the question clearly.

Whether or not an edge needs a prologue or epilogue is a function not 
just of the state at the head or tail of the edge, but instead is a 
function of global dataflow propagation?  Thus we can't disqualify until 
after we've done the dataflow propagation?  Right?


Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Separate shrink-wrapping
  2016-09-12 16:59       ` Jeff Law
@ 2016-09-14 13:22         ` Segher Boessenkool
  0 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-14 13:22 UTC (permalink / raw)
  To: Jeff Law; +Cc: Bernd Schmidt, gcc-patches

On Mon, Sep 12, 2016 at 10:49:46AM -0600, Jeff Law wrote:
> On 09/09/2016 02:56 PM, Segher Boessenkool wrote:
> >On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
> >>I think the lack of test coverage is something we'll want to address.
> >
> >Building and running the compiler, the various target libraries, and the
> >testsuite is more than enough coverage for correctness in my opinion --
> >I cannot make up anything that is not already in there.  No doubt some
> >PR will show up later, and we can (and should) of course add that then.
> >
> >For checking whether it makes "smart" decisions, I'll make some really
> >simple testcases that will not fail for random reasons all the time.
> I'm moving more and more towards wanting tests for new features. 
> There's no reason why verification of basic behavior shouldn't be part 
> of the feature patch itself.

Yes, but as I said, this is used millions of times during bootstrap
already, and at least *I* cannot think of more complex structures that
would fail.  If any show up we can add them as regression tests.

> I suspect the biggest problem with testing something like separate 
> shrink wrapping is making sure the test is stable over time -- and 
> that's a legitimate concern.

Yes exactly.

> But it ought to be do-able in one form or another,

Yes, as I promised, I'll make up some trivial tests.

> particularly since we have control over the debug dumps. 

I'm not a big fan of using the debug dumps in the testsuite -- very
often tests break because you changed something in the dump.  Debug
dumps are for debug, not for testing.

If we get to the point where we generically can use some piece of RTL
as input and just run one pass on it, that would be nice.  But we're
not there yet, and the kind of "unit tests" where you just write all
code twice and essentially only test if the test is correct: no thanks.

Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 6/9] sel-sched: Don't mess with register restores
  2016-09-12 17:39       ` Jeff Law
@ 2016-09-14 13:28         ` Segher Boessenkool
  0 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-14 13:28 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Mon, Sep 12, 2016 at 11:24:01AM -0600, Jeff Law wrote:
> >sel-sched-ir.c says
> >
> >    /* Certain instructions cannot be cloned, and frame related insns and
> >       the insn adjacent to NOTE_INSN_EPILOGUE_BEG cannot be moved out of
> >       their block.  */
> >    if (prologue_epilogue_contains (insn))
> >
> >...
> >
> >and I'm just extending that to "epilogue" instructions not in the
> >"epilogue" ;-)
> >
> >If all such epilogue instructions always had a REG_CFA_RESTORE note,
> >we could drop the "old" thing; but even instructions restoring a register
> >do not always have such a note (they can be batched up, and they can be
> >not emitted at all if not shrink-wrapping).
> Can you fix this by registering the separate prologue/epilogue insns in 
> prologue_insn_hash and epilogue_insn_hash or does that have unintended 
> consequences?

An interesting plan, I'll try.


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
  2016-09-12 18:21       ` Jeff Law
@ 2016-09-14 13:45         ` Segher Boessenkool
  2016-09-15 16:47           ` Jeff Law
  0 siblings, 1 reply; 74+ messages in thread
From: Segher Boessenkool @ 2016-09-14 13:45 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches, bschmidt

On Mon, Sep 12, 2016 at 12:02:50PM -0600, Jeff Law wrote:
> >>>As a final optimisation, if a block needs a prologue and its immediate
> >>>dominator has the block as a post-dominator, the dominator gets the
> >>>prologue as well.
> >>So why not just put it in the idom and not in the dominated block?
> >
> >That's what it does :-)
> Then I must have mis-parsed.  Thanks for clarifying.

"As a final optimisation, if a block needs a prologue and its immediate
dominator has the block as a post-dominator, ***that immediate dominator***
gets the prologue as well."

That is clearer I hope :-)

> Hmm, then explain again why DCE is mucking up?  I don't immediately see 
> how EPILOGUE_BEG notes come into play with DCE.  It seems to rely on the 
> DF data and AFAICT DF only cares about the EPILOGUE_BEG note in 
> can_move_insns_across which shouldn't be used by DCE.

The register restore *is* dead code, but we need to have the same CFI
for all convergent paths.

> >>Consider using auto_sbitmap rather than manually managing
> >>allocation/releasing of the per-block structures.  In fact, can't all of
> >>SW become a class and we lose the explicit init/fini routines in favor
> >>of a ctor/dtor?
> >
> >Yes, you can always add indirection.  I do not think the code becomes
> >more readable that way (quite the opposite).  Explicit is *good*.
> The GCC project is moving away from this kind of explicit 
> allocation/deallocation and more towards a RAII.  Unless there is a 
> clear need for the explicit allocation/deallocation, please put this 
> stuff into a class with an appropriate ctor/dtor.
> 
> FWIW, I was a big opponent of how much stuff happens "behind your back" 
> with some languages (including C++).  But over the last few years my 
> personal stance has softened considerably after seeing how cleanly RAII 
> solves certain problems.

We then still cannot get rid of SW, which is a convenience macro to do
a nasty cast on bb->aux.  If bb->aux was some pretty class hierarchy,
easy to use and all that, I would of course agree with your suggestion.
But as it is it is just a bare pointer, so the less we hide the safer
it is.

> >>For the PPC R0 vs LR is the only thing that causes disqualification
> >>right?
> >
> >Currently, yes.
> >
> >>Can't that be handled when we build the set of components we
> >>want to insert for each edge/block?  Is there some advantage to handling
> >>disqualifications after all the potential insertion points have been
> >>handled?
> >
> >We do not know if an edge needs a prologue, epilogue, or neither, until
> >we have decided whether *both* ends of that edge want the component active
> >or not.
> Right.  Hmm, maybe I'm not asking the question clearly.
> 
> Whether or not an edge needs a prologue or epilogue is a function not 
> just of the state at the head or tail of the edge, but instead is a 
> function of global dataflow propagation?  Thus we can't disqualify until 
> after we've done the dataflow propagation?  Right?

We can figure out before we decide what blocks need what components, what
edges can not get a prologue or epilogue for which components.  This
complicates the selection algorithm a whole lot, for not much gain that
I have seen so far, so I just give up in the cases that end up "bad".

It is not easy at all to see what edges will need to get a *logue,
because not always both blocks that edge connects are in the same
dominator subtree (or tree even, for an epilogue-aware placement
algorithm, but this patch doesn't do that yet; it's a more minor
optimisation, only reduces code size a little).


Segher

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
  2016-09-14 13:45         ` Segher Boessenkool
@ 2016-09-15 16:47           ` Jeff Law
  0 siblings, 0 replies; 74+ messages in thread
From: Jeff Law @ 2016-09-15 16:47 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: gcc-patches, bschmidt

On 09/14/2016 07:38 AM, Segher Boessenkool wrote:
> On Mon, Sep 12, 2016 at 12:02:50PM -0600, Jeff Law wrote:
>>>>> As a final optimisation, if a block needs a prologue and its immediate
>>>>> dominator has the block as a post-dominator, the dominator gets the
>>>>> prologue as well.
>>>> So why not just put it in the idom and not in the dominated block?
>>>
>>> That's what it does :-)
>> Then I must have mis-parsed.  Thanks for clarifying.
>
> "As a final optimisation, if a block needs a prologue and its immediate
> dominator has the block as a post-dominator, ***that immediate dominator***
> gets the prologue as well."
>
> That is clearer I hope :-)
It is :-)

>
>> Hmm, then explain again why DCE is mucking up?  I don't immediately see
>> how EPILOGUE_BEG notes come into play with DCE.  It seems to rely on the
>> DF data and AFAICT DF only cares about the EPILOGUE_BEG note in
>> can_move_insns_across which shouldn't be used by DCE.
>
> The register restore *is* dead code, but we need to have the same CFI
> for all convergent paths.
OK.   I think I was conflating multiple issues.  So we need to keep the 
restore alive so that we have the same CFI across those paths, even 
though it appears dead on one or more paths.

I think this points us back to what you were experimenting with to 
address the regrename problems -- specifically creating "uses" at those 
key points.  That solves the DCE problem as well as one of the regrename 
problems, right?

>>
>> Whether or not an edge needs a prologue or epilogue is a function not
>> just of the state at the head or tail of the edge, but instead is a
>> function of global dataflow propagation?  Thus we can't disqualify until
>> after we've done the dataflow propagation?  Right?
>
> We can figure out before we decide what blocks need what components, what
> edges can not get a prologue or epilogue for which components.  This
> complicates the selection algorithm a whole lot, for not much gain that
> I have seen so far, so I just give up in the cases that end up "bad".
OK.  I'll drop it :-)  It was more a mental exercise in understanding 
then something I think needed to change.

Jeff

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores
  2016-06-08  1:48 [PATCH 0/9] separate shrink-wrapping Segher Boessenkool
@ 2016-06-08  1:48 ` Segher Boessenkool
  0 siblings, 0 replies; 74+ messages in thread
From: Segher Boessenkool @ 2016-06-08  1:48 UTC (permalink / raw)
  To: gcc-patches; +Cc: dje.gcc, Segher Boessenkool

Deleting restores (before a noreturn) that are dead confuses dwarf2cfi.


2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>

	* dce.c (delete_unmarked_insns): Don't delete instructions with
	a REG_CFA_RESTORE note.

---
 gcc/dce.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/gcc/dce.c b/gcc/dce.c
index ea3fb00..d510287 100644
--- a/gcc/dce.c
+++ b/gcc/dce.c
@@ -587,6 +587,15 @@ delete_unmarked_insns (void)
 	  if (!dbg_cnt (dce))
 	    continue;
 
+	  if (crtl->shrink_wrapped_separate
+	      && find_reg_note (insn, REG_CFA_RESTORE, NULL))
+	    {
+	      if (dump_file)
+		fprintf (dump_file, "DCE: NOT deleting insn %d, it's a "
+				    "callee-save restore\n", INSN_UID (insn));
+	      continue;
+	    }
+
 	  if (dump_file)
 	    fprintf (dump_file, "DCE: Deleting insn %d\n", INSN_UID (insn));
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2016-09-15 16:43 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
2016-08-01  1:43 ` [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
2016-09-08 17:52   ` Jeff Law
2016-09-09 15:59     ` Segher Boessenkool
2016-09-09 16:39       ` Jeff Law
2016-08-01  1:43 ` [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
2016-08-29  9:31   ` Bernd Schmidt
2016-08-29 14:30     ` Segher Boessenkool
2016-09-08 17:37     ` Jeff Law
2016-09-09 11:03       ` Bernd Schmidt
2016-09-09 15:13         ` Segher Boessenkool
2016-09-09 18:31           ` Jeff Law
2016-09-09 20:41             ` Segher Boessenkool
2016-09-12 16:49               ` Jeff Law
2016-09-09 15:51         ` Jeff Law
2016-09-09 15:40       ` Segher Boessenkool
2016-09-09 16:58         ` Jeff Law
2016-09-08 17:48   ` Jeff Law
2016-09-09 15:44     ` Segher Boessenkool
2016-08-01  1:43 ` [PATCH 2/9] cfgcleanup: Don't confuse CFI when -fshrink-wrap-separate Segher Boessenkool
2016-09-08 17:51   ` Jeff Law
2016-08-01  1:57 ` [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components Segher Boessenkool
2016-09-08 19:03   ` Jeff Law
2016-09-09 22:03     ` Segher Boessenkool
2016-09-12 18:21       ` Jeff Law
2016-09-14 13:45         ` Segher Boessenkool
2016-09-15 16:47           ` Jeff Law
2016-08-01  1:57 ` [PATCH 9/9] rs6000: Separate shrink-wrapping Segher Boessenkool
2016-08-01  1:57 ` [PATCH 4/9] regrename: Don't rename restores Segher Boessenkool
2016-09-08 17:54   ` Jeff Law
2016-09-09 21:05     ` Segher Boessenkool
2016-09-12 17:01       ` Jeff Law
2016-08-01  1:57 ` [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone Segher Boessenkool
2016-09-08 18:34   ` Jeff Law
2016-09-09 21:21     ` Segher Boessenkool
2016-08-01  2:12 ` [PATCH 6/9] sel-sched: Don't mess with register restores Segher Boessenkool
2016-08-04  7:33   ` Andrey Belevantsev
2016-09-08 17:55   ` Jeff Law
2016-09-09 21:13     ` Segher Boessenkool
2016-09-12 17:39       ` Jeff Law
2016-09-14 13:28         ` Segher Boessenkool
2016-08-01  2:12 ` [PATCH 5/9] regrename: Don't run if function was separately shrink-wrapped Segher Boessenkool
2016-09-08 17:54   ` Jeff Law
2016-08-04  0:05 ` [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
2016-08-24 16:04   ` Segher Boessenkool
2016-08-26 13:03 ` Bernd Schmidt
2016-08-26 13:48   ` David Malcolm
2016-08-26 13:55     ` Bernd Schmidt
2016-08-26 14:50   ` Segher Boessenkool
2016-08-26 15:03     ` Bernd Schmidt
2016-08-26 16:27       ` Segher Boessenkool
2016-09-08 16:58         ` Jeff Law
2016-09-09 15:26           ` Segher Boessenkool
2016-09-09 16:26             ` Jeff Law
2016-09-09 16:51               ` Segher Boessenkool
2016-09-09 17:22                 ` Jeff Law
2016-08-30 12:31       ` Michael Matz
2016-09-08 16:41         ` Jeff Law
2016-09-09  6:31           ` Segher Boessenkool
2016-09-09 15:28             ` Jeff Law
2016-09-09 15:43               ` Segher Boessenkool
2016-09-09 18:25                 ` Jeff Law
2016-09-09 20:29                   ` Segher Boessenkool
2016-09-08 17:20       ` Jeff Law
2016-09-09 15:33         ` Segher Boessenkool
2016-09-09 16:49           ` Jeff Law
2016-09-09 17:00             ` Segher Boessenkool
2016-09-09 17:44               ` Jeff Law
2016-09-09 19:36   ` Jeff Law
2016-09-09 21:00     ` Segher Boessenkool
2016-09-12 11:00       ` Bernd Schmidt
2016-09-12 16:59       ` Jeff Law
2016-09-14 13:22         ` Segher Boessenkool
  -- strict thread matches above, loose matches on Subject: below --
2016-06-08  1:48 [PATCH 0/9] separate shrink-wrapping Segher Boessenkool
2016-06-08  1:48 ` [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).