Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jeff Law <law@redhat.com>
To: Segher Boessenkool <segher@kernel.crashing.org>, gcc-patches@gcc.gnu.org
Cc: bschmidt@redhat.com
Subject: Re: [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components
Date: Thu, 08 Sep 2016 19:03:00 -0000	[thread overview]
Message-ID: <75e00978-7f5d-f821-85de-5bfc2d2800b0@redhat.com> (raw)
In-Reply-To: <e21c6260ba313fd4905f00fa0da29f92af14b13e.1470015604.git.segher@kernel.crashing.org>

On 07/31/2016 07:42 PM, Segher Boessenkool wrote:
>
> Deciding what blocks should run with a certain component active so that
> the total cost of executing the prologues (and epilogues) is optimal, is
> not a computationally feasible problem.
Really?  It's just a dataflow problem is it not and one that ought to 
converge very quickly I'd think.  Or is it more a function of having to 
run it on so many independent components?  I'm still pondering the value 
of having every GPR be an independent component :-)

ISTM this adds a fair amount of complexity to the implementation in that 
prologues and epilogues for a particular component can run more than 
once.  Can you give a concrete example where this happens so that we can 
all understand it better?

If we keep this aspect of the implementation it seems like a note in the 
developer section would be in order.  It's certainly non-intuitive.

I only glanced over the code that seems related to this aspect of the 
implementation.  If we decide to go forward, I'd like to look at it 
again more closely.

>
> Now all that is left is inserting prologues and epilogues on all edges
> that jump into resp. out of the "active" set of blocks.  Often we need
> to insert some components' prologues (or epilogues) on all edges into
> (or out of) a block.  In theory cross-jumping can unify all such, but
> in practice that often fails; besides, that is a lot of work.  So in
> this case we insert the prologue and epilogue components at the "head"
> or "tail" of a block, instead.
Cross jumping is rather simplistic, so I'm not surprised that it doesn't 
catch all this stuff.

>
> As a final optimisation, if a block needs a prologue and its immediate
> dominator has the block as a post-dominator, the dominator gets the
> prologue as well.
So why not just put it in the idom and not in the dominated block? 
Doesn't this just end up running that component's prologue twice?

>
> 2016-06-07  Segher Boessenkool  <segher@kernel.crashing.org>
>
> 	* function.c (thread_prologue_and_epilogue_insns): Recompute the
> 	live info.  Call try_shrink_wrapping_separate.  Compute the
> 	prologue_seq afterwards, if it has possibly changed.  Compute the
> 	split_prologue_seq and epilogue_seq later, too.
> 	* shrink-wrap.c: #include cfgbuild.h.
> 	(dump_components): New function.
> 	(struct sw): New struct.
> 	(SW): New function.
> 	(init_separate_shrink_wrap): New function.
> 	(fini_separate_shrink_wrap): New function.
> 	(place_prologue_for_one_component): New function.
> 	(spread_components): New function.
> 	(disqualify_problematic_components): New function.
> 	(emit_common_heads_for_components): New function.
> 	(emit_common_tails_for_components): New function.
> 	(insert_prologue_epilogue_for_components): New function.
> 	(try_shrink_wrapping_separate): New function.
> 	* shrink-wrap.h: Declare try_shrink_wrapping_separate.
> ---
>  gcc/function.c    |  15 +-
>  gcc/shrink-wrap.c | 715 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  gcc/shrink-wrap.h |   1 +
>  3 files changed, 729 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/function.c b/gcc/function.c
> index bba0705..390e9a6 100644
> --- a/gcc/function.c
> +++ b/gcc/function.c
> @@ -5912,6 +5912,12 @@ make_epilogue_seq (void)
>  void
>  thread_prologue_and_epilogue_insns (void)
>  {
> +  if (optimize > 1)
> +    {
> +      df_live_add_problem ();
> +      df_live_set_all_dirty ();
> +    }
Perhaps conditional on separate shrink wrapping?

> @@ -5922,9 +5928,7 @@ thread_prologue_and_epilogue_insns (void)
>    edge entry_edge = single_succ_edge (ENTRY_BLOCK_PTR_FOR_FN (cfun));
>    edge orig_entry_edge = entry_edge;
>
> -  rtx_insn *split_prologue_seq = make_split_prologue_seq ();
>    rtx_insn *prologue_seq = make_prologue_seq ();
> -  rtx_insn *epilogue_seq = make_epilogue_seq ();
>
>    /* Try to perform a kind of shrink-wrapping, making sure the
>       prologue/epilogue is emitted only around those parts of the
> @@ -5932,6 +5936,13 @@ thread_prologue_and_epilogue_insns (void)
>
>    try_shrink_wrapping (&entry_edge, prologue_seq);
>
> +  df_analyze ();
> +  try_shrink_wrapping_separate (entry_edge->dest);
> +  if (crtl->shrink_wrapped_separate)
> +    prologue_seq = make_prologue_seq ();
Perhaps push the df_analyze call into try_shrink_wrapping_separate?

ANd if it was successful, do you need to update the DF information?  Is 
that perhaps the cause of some of the issues we're seeing with DCE, 
regrename, the scheduler?



> diff --git a/gcc/shrink-wrap.c b/gcc/shrink-wrap.c
> index b85b1c3..643e375 100644
> --- a/gcc/shrink-wrap.c
> +++ b/gcc/shrink-wrap.c
> @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "tree-pass.h"
>  #include "cfgrtl.h"
> +#include "cfgbuild.h"
>  #include "params.h"
>  #include "bb-reorder.h"
>  #include "shrink-wrap.h"
> @@ -1006,3 +1007,717 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq)
>    BITMAP_FREE (bb_with);
>    free_dominance_info (CDI_DOMINATORS);
>  }
> +\f
> +/* Separate shrink-wrapping
> +
> +   Instead of putting all of the prologue and epilogue in one spot, we
> +   can put parts of it in places where those components are executed less
> +   frequently.  The following code does this, for prologue and epilogue
> +   components that can be put in more than one location, and where those
> +   components can be executed more than once (the epilogue component will
> +   always be executed before the prologue component is executed a second
> +   time).
> +
> +   What exactly is a component is target-dependent.  The more usual
> +   components are simple saves/restores to/from the frame of callee-saved
> +   registers.  This code treats components abstractly (as an sbitmap),
> +   letting the target handle all details.
> +
> +   Prologue components are placed in such a way that for every component
> +   the prologue is executed as infrequently as possible.  We do this by
> +   walking the dominator tree, comparing the cost of placing a prologue
> +   component before a block to the sum of costs determined for all subtrees
> +   of that block.
> +
> +   From this placement, we then determine for each component all blocks
> +   where at least one of this block's dominators (including itself) will
> +   get a prologue inserted.  That then is how the components are placed.
> +   We could place the epilogue components a bit smarter (we can save a
> +   bit of code size sometimes); this is a possible future improvement.
> +
> +   Prologues and epilogues are preferably placed into a block, either at
> +   the beginning or end of it, if it is needed for all predecessor resp.
> +   successor edges; or placed on the edge otherwise.
> +
> +   If the placement of any prologue/epilogue leads to a situation we cannot
> +   handle (for example, an abnormal edge would need to be split, or some
> +   targets want to use some specific registers that may not be available
> +   where we want to put them), separate shrink-wrapping for the components
> +   in that prologue/epilogue is aborted.  */
> +
> +
> +/* Print the sbitmap COMPONENTS to the DUMP_FILE if not empty, with the
> +   label LABEL.  */
> +static void
> +dump_components (const char *label, sbitmap components)
> +{
> +  if (bitmap_empty_p (components))
> +    return;
> +
> +  fprintf (dump_file, " [%s", label);
> +
> +  for (unsigned int j = 0; j < components->n_bits; j++)
> +    if (bitmap_bit_p (components, j))
> +      fprintf (dump_file, " %u", j);
> +
> +  fprintf (dump_file, "]");
Consider allowing the target to provide a mapping from the component to 
a symbolic name of some kind and using that in the dumps.  Related, 
consider using an enum rather than magic constants in the target bits (I 
noted seeing component #0 as a magic constant in the ppc implementation).


> +
> +/* A helper function for accessing the pass-specific info.  */
> +static inline struct sw *
> +SW (basic_block bb)
> +{
> +  gcc_assert (bb->aux);
> +  return (struct sw *) bb->aux;
> +}
> +
> +/* Create the pass-specific data structures for separately shrink-wrapping
> +   with components COMPONENTS.  */
> +static void
> +init_separate_shrink_wrap (sbitmap components)
So it seems like there's a toplevel list of components that's owned by 
the target and state at each block components that are owned by the 
generic code.  That's fine.  Just make sure we doc that the toplevel 
list of components is allocated by the backend (and where does it get 
freed?)

Consider using auto_sbitmap rather than manually managing 
allocation/releasing of the per-block structures.  In fact, can't all of 
SW become a class and we lose the explicit init/fini routines in favor 
of a ctor/dtor?



> +
> +/* Place code for prologues and epilogues for COMPONENTS where we can put
> +   that code at the start of basic blocks.  */
> +static void
> +emit_common_heads_for_components (sbitmap components)
> +{
> +  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
> +  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
> +  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
> +
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      /* Find which prologue resp. epilogue components are needed for all
> +	 predecessor edges to this block.  */
Nit.  Avoid ".resp".  I actually had to look that abbreviation up :-) 
Being a bit more verbose in the comments would be preferred over ".resp".


Having looked at this in more detail now, I'm wondering if we want to 
talk at all in the docs about when selection vs disqualifying happens. 
ISTM that we build the set of components we want to insert for each 
edge/block.  When that's complete, we then prune those results with the 
disqualifying sets.

For the PPC R0 vs LR is the only thing that causes disqualification 
right?  Can't that be handled when we build the set of components we 
want to insert for each edge/block?  Is there some advantage to handling 
disqualifications after all the potential insertion points have been 
handled?

Jeff

next prev parent reply	other threads:[~2016-09-08 18:34 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-01  1:43 [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
2016-08-01  1:43 ` [PATCH 1/9] separate shrink-wrap: New command-line flag, status flag, hooks, and doc Segher Boessenkool
2016-08-29  9:31   ` Bernd Schmidt
2016-08-29 14:30     ` Segher Boessenkool
2016-09-08 17:37     ` Jeff Law
2016-09-09 11:03       ` Bernd Schmidt
2016-09-09 15:13         ` Segher Boessenkool
2016-09-09 18:31           ` Jeff Law
2016-09-09 20:41             ` Segher Boessenkool
2016-09-12 16:49               ` Jeff Law
2016-09-09 15:51         ` Jeff Law
2016-09-09 15:40       ` Segher Boessenkool
2016-09-09 16:58         ` Jeff Law
2016-09-08 17:48   ` Jeff Law
2016-09-09 15:44     ` Segher Boessenkool
2016-08-01  1:43 ` [PATCH 2/9] cfgcleanup: Don't confuse CFI when -fshrink-wrap-separate Segher Boessenkool
2016-09-08 17:51   ` Jeff Law
2016-08-01  1:43 ` [PATCH 3/9] dce: Don't dead-code delete separately wrapped restores Segher Boessenkool
2016-09-08 17:52   ` Jeff Law
2016-09-09 15:59     ` Segher Boessenkool
2016-09-09 16:39       ` Jeff Law
2016-08-01  1:57 ` [PATCH 7/9] cprop: Leave RTX_FRAME_RELATED_P instructions alone Segher Boessenkool
2016-09-08 18:34   ` Jeff Law
2016-09-09 21:21     ` Segher Boessenkool
2016-08-01  1:57 ` [PATCH 8/9] shrink-wrap: shrink-wrapping for separate components Segher Boessenkool
2016-09-08 19:03   ` Jeff Law [this message]
2016-09-09 22:03     ` Segher Boessenkool
2016-09-12 18:21       ` Jeff Law
2016-09-14 13:45         ` Segher Boessenkool
2016-09-15 16:47           ` Jeff Law
2016-08-01  1:57 ` [PATCH 9/9] rs6000: Separate shrink-wrapping Segher Boessenkool
2016-08-01  1:57 ` [PATCH 4/9] regrename: Don't rename restores Segher Boessenkool
2016-09-08 17:54   ` Jeff Law
2016-09-09 21:05     ` Segher Boessenkool
2016-09-12 17:01       ` Jeff Law
2016-08-01  2:12 ` [PATCH 6/9] sel-sched: Don't mess with register restores Segher Boessenkool
2016-08-04  7:33   ` Andrey Belevantsev
2016-09-08 17:55   ` Jeff Law
2016-09-09 21:13     ` Segher Boessenkool
2016-09-12 17:39       ` Jeff Law
2016-09-14 13:28         ` Segher Boessenkool
2016-08-01  2:12 ` [PATCH 5/9] regrename: Don't run if function was separately shrink-wrapped Segher Boessenkool
2016-09-08 17:54   ` Jeff Law
2016-08-04  0:05 ` [PATCH v2 0/9] Separate shrink-wrapping Segher Boessenkool
2016-08-24 16:04   ` Segher Boessenkool
2016-08-26 13:03 ` Bernd Schmidt
2016-08-26 13:48   ` David Malcolm
2016-08-26 13:55     ` Bernd Schmidt
2016-08-26 14:50   ` Segher Boessenkool
2016-08-26 15:03     ` Bernd Schmidt
2016-08-26 16:27       ` Segher Boessenkool
2016-09-08 16:58         ` Jeff Law
2016-09-09 15:26           ` Segher Boessenkool
2016-09-09 16:26             ` Jeff Law
2016-09-09 16:51               ` Segher Boessenkool
2016-09-09 17:22                 ` Jeff Law
2016-08-30 12:31       ` Michael Matz
2016-09-08 16:41         ` Jeff Law
2016-09-09  6:31           ` Segher Boessenkool
2016-09-09 15:28             ` Jeff Law
2016-09-09 15:43               ` Segher Boessenkool
2016-09-09 18:25                 ` Jeff Law
2016-09-09 20:29                   ` Segher Boessenkool
2016-09-08 17:20       ` Jeff Law
2016-09-09 15:33         ` Segher Boessenkool
2016-09-09 16:49           ` Jeff Law
2016-09-09 17:00             ` Segher Boessenkool
2016-09-09 17:44               ` Jeff Law
2016-09-09 19:36   ` Jeff Law
2016-09-09 21:00     ` Segher Boessenkool
2016-09-12 11:00       ` Bernd Schmidt
2016-09-12 16:59       ` Jeff Law
2016-09-14 13:22         ` Segher Boessenkool

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=75e00978-7f5d-f821-85de-5bfc2d2800b0@redhat.com \
    --to=law@redhat.com \
    --cc=bschmidt@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=segher@kernel.crashing.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).